On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.
To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.
The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.
The patches have been tested on:
1. Arm Juno (r0): 2+4 Cortex A57/A53
2. Hikey960: 4+4 Cortex A73/A53
Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.
REQUESTS=1000
BIGS="1 2"
LITTLES="0 3 4 5"
# Don't care about the score for those, just keep the bigs busy
for i in $BIGS; do
taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
--test=cpu run &>/dev/null &
done
for i in $LITTLES; do
sysbench --max-requests=$REQUESTS --test=cpu run \
| grep "total time:" &
done
wait
Results:
Single runs with completion time of each task
Juno (tip)
total time: 1.2608s
total time: 1.2995s
total time: 1.5954s
total time: 1.7463s
Juno (misfit)
total time: 1.2575s
total time: 1.3004s
total time: 1.5860s
total time: 1.5871s
Hikey960 (tip)
total time: 1.7431s
total time: 2.2914s
total time: 2.5976s
total time: 1.7280s
Hikey960 (misfit)
total time: 1.7866s
total time: 1.7513s
total time: 1.6918s
total time: 1.6965s
10 run summary (tracking longest running task for each run)
Juno Hikey960
avg max avg max
tip 1.7465 1.7469 2.5997 2.6131
misfit 1.6016 1.6192 1.8506 1.9666
Changelog:
v3
- Fixed locking around static_key.
- Changed group per-cpu capacity comparison to be based on max rather
than min capacity.
- Added patch to prevent occasional pointless high->low capacity
migrations.
- Changed type of group_misfit_task_load and misfit_task_load to
unsigned long.
- Changed fbq() to pick the cpu with highest misfit_task_load rather
than breaking when the first is found.
- Rebased against tip/sched/core.
- v2 Tested-by: Gaku Inami <[email protected]>
v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao <[email protected]>
- Dropped patch #4 as although the patch seems to make sense no benefit
has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami <[email protected]>
Chris Redpath (1):
sched/fair: Don't move tasks to lower capacity cpus unless necessary
Morten Rasmussen (4):
sched: Add static_key for asymmetric cpu capacity optimizations
sched/fair: Add group_misfit_task load-balance type
sched: Add sched_group per-cpu max capacity
sched/fair: Consider misfit tasks when load-balancing
Valentin Schneider (4):
sched/fair: Kick nohz balance if rq->misfit_task
sched: Change root_domain->overload type to int
sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
sched/fair: Set sd->overload when misfit
kernel/sched/fair.c | 162 +++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 16 +++--
kernel/sched/topology.c | 20 ++++++
3 files changed, 171 insertions(+), 27 deletions(-)
--
2.7.4
From: Valentin Schneider <[email protected]>
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 +++---
kernel/sched/sched.h | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 545fa3600894..9dd147c71dd0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8431,8 +8431,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (env->dst_rq->rd->overload != overload)
- env->dst_rq->rd->overload = overload;
+ if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+ WRITE_ONCE(env->dst_rq->rd->overload, overload);
}
}
@@ -9875,7 +9875,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
rq_unpin_lock(this_rq, rf);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ !READ_ONCE(this_rq->rd->overload)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f945edadd2c5..791d9badad31 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1665,8 +1665,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
- if (!rq->rd->overload)
- rq->rd->overload = 1;
+ if (!READ_ONCE(rq->rd->overload))
+ WRITE_ONCE(rq->rd->overload, 1);
#endif
}
--
2.7.4
From: Chris Redpath <[email protected]>
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Chris Redpath <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aac7b9155bc4..5d612d3906d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8796,6 +8796,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
capacity = capacity_of(i);
+ /*
+ * For ASYM_CPUCAPACITY domains, don't pick a cpu that could
+ * eventually lead to active_balancing high->low capacity.
+ * Higher per-cpu capacity is considered better than balancing
+ * average load.
+ */
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ capacity_of(env->dst_cpu) < capacity &&
+ rq->nr_running == 1)
+ continue;
+
wl = weighted_cpuload(rq);
/*
--
2.7.4
From: Valentin Schneider <[email protected]>
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.
A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.
They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.
This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:
a,b,c,d are our CPU-hogging tasks
_ signifies idling
LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away
This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.
As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/sched.h | 6 +++++-
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9dd147c71dd0..aac7b9155bc4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8165,7 +8165,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
* @load_idx: Load index of sched_domain of this_cpu for load calc.
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
@@ -8209,8 +8209,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- sgs->group_misfit_task_load < rq->misfit_task_load)
+ sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+ *overload = 1;
+ }
}
/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 791d9badad31..6dfab618e457 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -688,7 +688,11 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;
- /* Indicate more than one runnable task for any CPU */
+ /*
+ * Indicate pullable load on at least one CPU, e.g:
+ * - More than one runnable task
+ * - Running task is misfit
+ */
int overload;
/*
--
2.7.4
From: Valentin Schneider <[email protected]>
sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 7 +++----
kernel/sched/sched.h | 4 ++--
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b29d53d1e22a..545fa3600894 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8170,7 +8170,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ int *overload)
{
unsigned long load;
int i, nr_running;
@@ -8195,7 +8195,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
nr_running = rq->nr_running;
if (nr_running > 1)
- *overload = true;
+ *overload = 1;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -8354,8 +8354,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
- int load_idx, prefer_sibling = 0;
- bool overload = false;
+ int load_idx, prefer_sibling, overload = 0;
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5ed67122cf59..f945edadd2c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -689,7 +689,7 @@ struct root_domain {
cpumask_var_t online;
/* Indicate more than one runnable task for any CPU */
- bool overload;
+ int overload;
/*
* The bit corresponding to a CPU gets set here if such CPU has more
@@ -1666,7 +1666,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
if (!rq->rd->overload)
- rq->rd->overload = true;
+ rq->rd->overload = 1;
#endif
}
--
2.7.4
On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.
Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
balancing.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6af3354e9e26..11f3efa299ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7287,6 +7287,7 @@ struct lb_env {
unsigned int loop_max;
enum fbq_type fbq_type;
+ enum group_type src_grp_type;
struct list_head tasks;
};
@@ -8245,6 +8246,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
{
struct sg_lb_stats *busiest = &sds->busiest_stat;
+ /*
+ * Don't try to pull misfit tasks we can't help.
+ * We can use max_capacity here as reduction in capacity on some
+ * cpus in the group should either be possible to resolve
+ * internally or be covered by avg_load imbalance (eventually).
+ */
+ if (sgs->group_type == group_misfit_task &&
+ (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ !group_has_capacity(env, &sds->local_stat)))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -8267,6 +8279,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
+ /*
+ * If we have more than one misfit sg go with the biggest misfit.
+ */
+ if (sgs->group_type == group_misfit_task &&
+ sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+ return false;
+
asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8564,8 +8583,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* factors in sg capacity and sgs with smaller group_type are
* skipped when updating the busiest sg:
*/
- if (busiest->avg_load <= sds->avg_load ||
- local->avg_load >= sds->avg_load) {
+ if (busiest->group_type != group_misfit_task &&
+ (busiest->avg_load <= sds->avg_load ||
+ local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8599,6 +8619,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
+ /* Boost imbalance to allow misfit task to be balanced. */
+ if (busiest->group_type == group_misfit_task) {
+ env->imbalance = max_t(long, env->imbalance,
+ busiest->group_misfit_task_load);
+ }
+
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
@@ -8665,6 +8691,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
busiest->group_no_capacity)
goto force_balance;
+ /* Misfit tasks should be dealt with regardless of the avg load */
+ if (busiest->group_type == group_misfit_task)
+ goto force_balance;
+
/*
* If the local group is busier than the selected busiest group
* don't try and pull any tasks.
@@ -8702,6 +8732,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
force_balance:
/* Looks like there is an imbalance. Compute it */
+ env->src_grp_type = busiest->group_type;
calculate_imbalance(env, &sds);
return sds.busiest;
@@ -8749,6 +8780,19 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (rt > env->fbq_type)
continue;
+ /*
+ * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+ * seek the "biggest" misfit task.
+ */
+ if (env->src_grp_type == group_misfit_task) {
+ if (rq->misfit_task_load > busiest_load) {
+ busiest_load = rq->misfit_task_load;
+ busiest = rq;
+ }
+
+ continue;
+ }
+
capacity = capacity_of(i);
wl = weighted_cpuload(rq);
@@ -8818,6 +8862,9 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ if (env->src_grp_type == group_misfit_task)
+ return 1;
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
--
2.7.4
From: Valentin Schneider <[email protected]>
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:
* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure
As such, this commit adds an rq->misfit_task condition to trigger a
nohz kick.
The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <[email protected]>, and a similar patch was submitted for
the Android Common Kernel - see [1].
[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11f3efa299ca..b29d53d1e22a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9507,7 +9507,7 @@ static void nohz_balancer_kick(struct rq *rq)
if (time_before(now, nohz.next_balance))
goto out;
- if (rq->nr_running >= 2) {
+ if (rq->nr_running >= 2 || rq->misfit_task_load) {
flags = NOHZ_KICK_MASK;
goto out;
}
--
2.7.4
The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++++++++++----
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 2 ++
3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ed7eb2ac068f..6af3354e9e26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7929,13 +7929,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+ sdg->sgc->max_capacity = capacity;
}
void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity, min_capacity;
+ unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7949,6 +7950,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
capacity = 0;
min_capacity = ULONG_MAX;
+ max_capacity = 0;
if (child->flags & SD_OVERLAP) {
/*
@@ -7979,6 +7981,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
}
min_capacity = min(capacity, min_capacity);
+ max_capacity = max(capacity, max_capacity);
}
} else {
/*
@@ -7992,12 +7995,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+ max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+ sdg->sgc->max_capacity = max_capacity;
}
/*
@@ -8093,16 +8098,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
}
/*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
* per-CPU capacity than sched_group ref.
*/
static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
}
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ return sg->sgc->max_capacity * capacity_margin <
+ ref->sgc->max_capacity * 1024;
+}
+
static inline enum
group_type group_classify(struct sched_group *group,
struct sg_lb_stats *sgs)
@@ -8248,7 +8264,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* power/energy consequences are not considered.
*/
if (sgs->sum_nr_running <= sgs->group_weight &&
- group_smaller_cpu_capacity(sds->local, sg))
+ group_smaller_min_cpu_capacity(sds->local, sg))
return false;
asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 273d07dedc90..5ed67122cf59 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1165,6 +1165,7 @@ struct sched_group_capacity {
*/
unsigned long capacity;
unsigned long min_capacity; /* Min per-CPU capacity in group */
+ unsigned long max_capacity; /* Max per-CPU capacity in group */
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index edc87e35fc75..f32bf3a998b1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -708,6 +708,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
sg_span = sched_group_span(sg);
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
}
static int
@@ -867,6 +868,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
return sg;
}
--
2.7.4
The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 18 ++++++++++++++++++
3 files changed, 22 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..6116d1b7e441 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6585,6 +6585,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
{
long min_cap, max_cap;
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4d9ac7..0fbfbcf5c551 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1153,6 +1153,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
struct sched_group_capacity {
atomic_t ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 61a1125c1ae4..edc87e35fc75 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
static void update_top_cache_domain(int cpu)
{
@@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
}
+static void update_asym_cpucapacity(int cpu)
+{
+ int enable = false;
+
+ rcu_read_lock();
+ if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+ enable = true;
+ rcu_read_unlock();
+
+ if (enable) {
+ /* This expects to be hotplug-safe */
+ static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+ }
+}
+
/*
* Attach the domain 'sd' to 'cpu' as its base domain. Callers must
* hold the hotplug lock.
@@ -1707,6 +1723,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
rcu_read_unlock();
+ update_asym_cpucapacity(cpumask_first(cpu_map));
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
--
2.7.4
To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 2 ++
2 files changed, 48 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6116d1b7e441..ed7eb2ac068f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
static unsigned long weighted_cpuload(struct rq *rq);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
/* Cached statistics for all CPUs within a node */
struct numa_stats {
@@ -4043,6 +4043,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
WRITE_ONCE(p->se.avg.util_est, ue);
}
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+ return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return;
+
+ if (!p) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ rq->misfit_task_load = task_h_load(p);
+}
+
#else /* CONFIG_SMP */
static inline int
@@ -4078,6 +4101,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
static inline void
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
#endif /* CONFIG_SMP */
@@ -6598,7 +6622,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);
- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return !task_fits_capacity(p, min_cap);
}
/*
@@ -7015,9 +7039,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
+ update_misfit_status(p, rq);
+
return p;
idle:
+ update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
/*
@@ -7223,6 +7250,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
enum fbq_type { regular, remote, all };
+enum group_type {
+ group_other = 0,
+ group_misfit_task,
+ group_imbalanced,
+ group_overloaded,
+};
+
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
@@ -7764,12 +7798,6 @@ static unsigned long task_h_load(struct task_struct *p)
/********** Helpers for find_busiest_group ************************/
-enum group_type {
- group_other = 0,
- group_imbalanced,
- group_overloaded,
-};
-
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -7785,6 +7813,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+ unsigned long group_misfit_task_load; /* A cpu has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -8084,6 +8113,9 @@ group_type group_classify(struct sched_group *group,
if (sg_imbalanced(group))
return group_imbalanced;
+ if (sgs->group_misfit_task_load)
+ return group_misfit_task;
+
return group_other;
}
@@ -8158,6 +8190,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
*/
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
+
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ sgs->group_misfit_task_load < rq->misfit_task_load)
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}
/* Adjust by relative CPU capacity of the group */
@@ -9939,6 +9975,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_misfit_status(curr, rq);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0fbfbcf5c551..273d07dedc90 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -815,6 +815,8 @@ struct rq {
unsigned char idle_balance;
+ unsigned long misfit_task_load;
+
/* For active balancing */
int active_balance;
int push_cpu;
--
2.7.4
Hi Morten,
On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
> +static void update_asym_cpucapacity(int cpu)
> +{
> + int enable = false;
> +
> + rcu_read_lock();
> + if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> + enable = true;
> + rcu_read_unlock();
> +
> + if (enable) {
> + /* This expects to be hotplug-safe */
> + static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
> + }
> +}
What would happen if you hotplugged an entire cluster ? You'd loose the
SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
we care ?
And also, Peter mentioned an issue with the EAS patches with multiple
root_domains. Does that apply here as well ? What if you had a
configuration with big and little CPUs in different root_domains for ex?
Should we disable the static key in the above cases ?
Thanks,
Quentin
On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
> Hi Morten,
>
> On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
> > +static void update_asym_cpucapacity(int cpu)
> > +{
> > + int enable = false;
> > +
> > + rcu_read_lock();
> > + if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> > + enable = true;
> > + rcu_read_unlock();
> > +
> > + if (enable) {
> > + /* This expects to be hotplug-safe */
> > + static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
> > + }
> > +}
>
> What would happen if you hotplugged an entire cluster ? You'd loose the
> SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
> we care ?
I don't think we should care. The static key enables additional checks
and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
be set and they should all be have no effect if that is the case. I
added the static key just avoid the overhead on systems where they would
have no effect. At least that is intention, I could course have broken
things by mistake.
> And also, Peter mentioned an issue with the EAS patches with multiple
> root_domains. Does that apply here as well ? What if you had a
> configuration with big and little CPUs in different root_domains for ex?
>
> Should we disable the static key in the above cases ?
Exclusive cpusets are more tricky as the flags will be the same for
sched_domains at the same level. So we can't set the flag correctly if
someone configures the exclusive cpusets such that you have one
root_domain spanning big and a subset of little, and one spanning the
remaining little cpus if all topology levels are preserved. If we
imagine a three cluster system where 0-3 and 4-7 little clusters, and
8-11 is a big cluster with cpusets configured as 0-5 and 6-11. The first
set should _not_ have SD_ASYM_CPUCAPACITY set, while the second should.
I'm tempted to say we shouldn't care in this situation. Setting the
flags correctly in the three cluster example would require knowledge
about the cpuset configuration which we don't have in the arch code so
SD_ASYM_CPUCAPACITY flag detection would have be done by the
sched_domain build code. However, not setting the flag according to the
actual members of the exclusive cpuset means that homogeneous
sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
wrong scheduling decisions.
We can actually end up with this problem just by hotplugging too. If you
unplug the entire big cluster in the three cluster example above, you
preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
we only have little cpus left.
As I see it, we have two choices: 1) Set the flags correctly for
exclusive cpusets which means some additional "fun" in the sched_domain
hierarchy set up, or 2) ignore it and make sure that setting
SD_ASYM_CPUCAPACITY on homogeneous sched_domains works fairly okay. The
latter seems easier.
Morten
On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
> On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
>> Hi Morten,
>>
>> On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
>>> +static void update_asym_cpucapacity(int cpu)
>>> +{
>>> + int enable = false;
>>> +
>>> + rcu_read_lock();
>>> + if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
>>> + enable = true;
>>> + rcu_read_unlock();
>>> +
>>> + if (enable) {
>>> + /* This expects to be hotplug-safe */
>>> + static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
>>> + }
>>> +}
>>
>> What would happen if you hotplugged an entire cluster ? You'd loose the
>> SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
>> we care ?
>
> I don't think we should care. The static key enables additional checks
> and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
> be set and they should all be have no effect if that is the case. I
> added the static key just avoid the overhead on systems where they would
> have no effect. At least that is intention, I could course have broken
> things by mistake.
I tent to agree for misfit but it would be easy to just add an
static_branch_disable_cpuslocked() into the else path of if(enable).
>> And also, Peter mentioned an issue with the EAS patches with multiple
>> root_domains. Does that apply here as well ? What if you had a
>> configuration with big and little CPUs in different root_domains for ex?
>>
>> Should we disable the static key in the above cases ?
>
> Exclusive cpusets are more tricky as the flags will be the same for
> sched_domains at the same level. So we can't set the flag correctly if
> someone configures the exclusive cpusets such that you have one
> root_domain spanning big and a subset of little, and one spanning the
> remaining little cpus if all topology levels are preserved. If we
> imagine a three cluster system where 0-3 and 4-7 little clusters, and
> 8-11 is a big cluster with cpusets configured as 0-5 and 6-11. The first
> set should _not_ have SD_ASYM_CPUCAPACITY set, while the second should.
>
> I'm tempted to say we shouldn't care in this situation. Setting the
> flags correctly in the three cluster example would require knowledge
> about the cpuset configuration which we don't have in the arch code so
> SD_ASYM_CPUCAPACITY flag detection would have be done by the
> sched_domain build code. However, not setting the flag according to the
> actual members of the exclusive cpuset means that homogeneous
> sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
> wrong scheduling decisions.
We could easily pass the CPU as an argument to all these
sched_domain_flags_f functions.
-typedef int (*sched_domain_flags_f)(void);
+typedef int (*sched_domain_flags_f)(int cpu);
In this case, the arch specific flag functions on a sched domain (sd)
level could use the corresponding sched_domain_mask_f function to
iterate over the span of the sd seen by CPU instead of all online cpus.
The overall question is ... do we need sd setups which are asymmetric
(different topology flags for sd hierarchies seen by different CPUs) and
does the scheduler cope with this?
We have seen 3 cluster systems like the MediaTek X20 (2 A72, 4 A53 (max
1.85Ghz), 4 A53 (max 1.4Ghz)) but this would rather be a big-little-tiny
system due to the max CPU frequency differences between the A53's.
We could also say that systems with 2 clusters with the same uArch and
same max CPU frequency and additional clusters are insane, like we e.g.
do with the Energy Model and CPUs with different uArch within a
frequency domain?
> We can actually end up with this problem just by hotplugging too. If you
> unplug the entire big cluster in the three cluster example above, you
> preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
> we only have little cpus left.
>
> As I see it, we have two choices: 1) Set the flags correctly for
> exclusive cpusets which means some additional "fun" in the sched_domain
> hierarchy set up, or 2) ignore it and make sure that setting
I assume you refer to this cpu parameter for sched_domain_flags_f under 1).
> SD_ASYM_CPUCAPACITY on homogeneous sched_domains works fairly okay. The
> latter seems easier.
>
> Morten
>
On Wed, Jun 27, 2018 at 05:41:22PM +0200, Dietmar Eggemann wrote:
> On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
> >On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
> >>Hi Morten,
> >>
> >>On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
> >>>+static void update_asym_cpucapacity(int cpu)
> >>>+{
> >>>+ int enable = false;
> >>>+
> >>>+ rcu_read_lock();
> >>>+ if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> >>>+ enable = true;
> >>>+ rcu_read_unlock();
> >>>+
> >>>+ if (enable) {
> >>>+ /* This expects to be hotplug-safe */
> >>>+ static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
> >>>+ }
> >>>+}
> >>
> >>What would happen if you hotplugged an entire cluster ? You'd loose the
> >>SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
> >>we care ?
> >
> >I don't think we should care. The static key enables additional checks
> >and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
> >be set and they should all be have no effect if that is the case. I
> >added the static key just avoid the overhead on systems where they would
> >have no effect. At least that is intention, I could course have broken
> >things by mistake.
>
> I tent to agree for misfit but it would be easy to just add an
> static_branch_disable_cpuslocked() into the else path of if(enable).
Depending on how we address the exclusive cpuset mess Quentin pointed
out, it isn't as simple as that. As it is with our current
not-yet-posted patches we would never remove the SD_ASYM_CPUCAPACITY
flag, so we would never do a static_branch_disable_cpuslocked().
However, I'm currently thinking that we should probably set the flag
according to the cpus in each root_domain. If we do that, we can't just
disable the static_key because one root_domain is SMP, so we would have
to track if _any_ root_domain (exclusive cpuset topology) has the flag
set.
Is it worth it to iterate over all the exclusive cpuset with all
the locking implications to disable the static_key in the corner case
where exclusive cpuset are set up for all cpu capacities in the system?
> >>And also, Peter mentioned an issue with the EAS patches with multiple
> >>root_domains. Does that apply here as well ? What if you had a
> >>configuration with big and little CPUs in different root_domains for ex?
> >>
> >>Should we disable the static key in the above cases ?
> >
> >Exclusive cpusets are more tricky as the flags will be the same for
> >sched_domains at the same level. So we can't set the flag correctly if
> >someone configures the exclusive cpusets such that you have one
> >root_domain spanning big and a subset of little, and one spanning the
> >remaining little cpus if all topology levels are preserved. If we
> >imagine a three cluster system where 0-3 and 4-7 little clusters, and
> >8-11 is a big cluster with cpusets configured as 0-5 and 6-11. The first
> >set should _not_ have SD_ASYM_CPUCAPACITY set, while the second should.
> >
> >I'm tempted to say we shouldn't care in this situation. Setting the
> >flags correctly in the three cluster example would require knowledge
> >about the cpuset configuration which we don't have in the arch code so
> >SD_ASYM_CPUCAPACITY flag detection would have be done by the
> >sched_domain build code. However, not setting the flag according to the
> >actual members of the exclusive cpuset means that homogeneous
> >sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
> >wrong scheduling decisions.
>
> We could easily pass the CPU as an argument to all these
> sched_domain_flags_f functions.
>
> -typedef int (*sched_domain_flags_f)(void);
> +typedef int (*sched_domain_flags_f)(int cpu);
>
> In this case, the arch specific flag functions on a sched domain (sd) level
> could use the corresponding sched_domain_mask_f function to iterate over the
> span of the sd seen by CPU instead of all online cpus.
Yeah, but I'm afraid that doesn't solve the problem. The
sched_domain_mask_f function doesn't tell us anything about any
exclusive cpusets. We need sched_domain_span(sd) which is
sched_domain_mask_f & cpu_map where cpu_map is the cpumask of the
exclusive cpuset. So we either need to pass the sched_domain_span() mask
through sched_domain_flags_f or work our way back to that information
based on the cpu id. I'm not sure if the latter is possible.
> The overall question is ... do we need sd setups which are asymmetric
> (different topology flags for sd hierarchies seen by different CPUs) and
> does the scheduler cope with this?
Well, in this case each root_domain is still flag symmetric so don't
think it should cause any problems.
> We have seen 3 cluster systems like the MediaTek X20 (2 A72, 4 A53 (max
> 1.85Ghz), 4 A53 (max 1.4Ghz)) but this would rather be a big-little-tiny
> system due to the max CPU frequency differences between the A53's.
>
> We could also say that systems with 2 clusters with the same uArch and same
> max CPU frequency and additional clusters are insane, like we e.g. do with
> the Energy Model and CPUs with different uArch within a frequency domain?
I fail to get the point here. Are you saying that SMP is insane? ;-)
>
> >We can actually end up with this problem just by hotplugging too. If you
> >unplug the entire big cluster in the three cluster example above, you
> >preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
> >we only have little cpus left.
> >
> >As I see it, we have two choices: 1) Set the flags correctly for
> >exclusive cpusets which means some additional "fun" in the sched_domain
> >hierarchy set up, or 2) ignore it and make sure that setting
>
> I assume you refer to this cpu parameter for sched_domain_flags_f under 1).
No, what I had in mind was to let sd_init() set SD_ASYM_CPUCAPACITY
flag as the arch-code doesn't know about the exclusive cpusets as
discussed above. In the meantime I have thought about a different
approach where sd_init() only disables the flag when it is no longer
needed due to exclusive cpusets.
On 06/28/2018 10:48 AM, Morten Rasmussen wrote:
> On Wed, Jun 27, 2018 at 05:41:22PM +0200, Dietmar Eggemann wrote:
>> On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
>>> On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
[...]
>>>> What would happen if you hotplugged an entire cluster ? You'd loose the
>>>> SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
>>>> we care ?
>>>
>>> I don't think we should care. The static key enables additional checks
>>> and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
>>> be set and they should all be have no effect if that is the case. I
>>> added the static key just avoid the overhead on systems where they would
>>> have no effect. At least that is intention, I could course have broken
>>> things by mistake.
>>
>> I tent to agree for misfit but it would be easy to just add an
>> static_branch_disable_cpuslocked() into the else path of if(enable).
>
> Depending on how we address the exclusive cpuset mess Quentin pointed
> out, it isn't as simple as that. As it is with our current
> not-yet-posted patches we would never remove the SD_ASYM_CPUCAPACITY
> flag, so we would never do a static_branch_disable_cpuslocked().
I was referring rather to the 'hotplug out an entire cluster' mentioned
above. Looking closer into the code, I see that this will only work for
traditional big.LITTLE (one big, one little cluster) since on them the
DIE level vanishes and so the SD_ASYM_CPUCAPACITY flag.
Currently we only detect the flags when the system comes up (in the
init_cpu_capacity_callback()) and not when we hotplug cpus. That's why
it doesn't work for your 'three cluster system where 0-3 and 4-7 little
clusters, and 8-11 is a big cluster' example.
So we don't re-detect the flags every time we call from the scheduler
into the arch code and so the the DIE level arch topology flag function
will return SD_ASYM_CPUCAPACITY.
> However, I'm currently thinking that we should probably set the flag
> according to the cpus in each root_domain. If we do that, we can't just
> disable the static_key because one root_domain is SMP, so we would have
> to track if _any_ root_domain (exclusive cpuset topology) has the flag
> set.
This is then in sync with the energy model where the static key should
be enabled if any root domain can do EAS. The static key would be system
wide, not per root domain.
> Is it worth it to iterate over all the exclusive cpuset with all
> the locking implications to disable the static_key in the corner case
> where exclusive cpuset are set up for all cpu capacities in the system?
Don't know either? But if the code to do this would include 'exclusive
cpusets' and platforms like your three cluster example then maybe ...
[...]
>>> I'm tempted to say we shouldn't care in this situation. Setting the
>>> flags correctly in the three cluster example would require knowledge
>>> about the cpuset configuration which we don't have in the arch code so
>>> SD_ASYM_CPUCAPACITY flag detection would have be done by the
>>> sched_domain build code. However, not setting the flag according to the
>>> actual members of the exclusive cpuset means that homogeneous
>>> sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
>>> wrong scheduling decisions.
>>
>> We could easily pass the CPU as an argument to all these
>> sched_domain_flags_f functions.
>>
>> -typedef int (*sched_domain_flags_f)(void);
>> +typedef int (*sched_domain_flags_f)(int cpu);
>>
>> In this case, the arch specific flag functions on a sched domain (sd) level
>> could use the corresponding sched_domain_mask_f function to iterate over the
>> span of the sd seen by CPU instead of all online cpus.
>
> Yeah, but I'm afraid that doesn't solve the problem. The
> sched_domain_mask_f function doesn't tell us anything about any
> exclusive cpusets. We need sched_domain_span(sd) which is
You're right, I checked again and even if we hotplug, the flag function
tl->mask(cpu) like cpu_coregroup_mask() always return the initial set of
cpus. So we are only save if the DIE sd vanishes and with it the
SD_ASYM_CPUCAPACITY flag since nobody will call the appropriate arch
topology flag function.
> sched_domain_mask_f & cpu_map where cpu_map is the cpumask of the
> exclusive cpuset. So we either need to pass the sched_domain_span() mask
> through sched_domain_flags_f or work our way back to that information
> based on the cpu id. I'm not sure if the latter is possible.
Agreed.
[...]
>> We could also say that systems with 2 clusters with the same uArch and same
>> max CPU frequency and additional clusters are insane, like we e.g. do with
>> the Energy Model and CPUs with different uArch within a frequency domain?
>
> I fail to get the point here. Are you saying that SMP is insane? ;-)
The '... and additional clusters' is important, a system like your three
cluster system you describe above.
But the problem that if you hotplug-out the big cluster, the DIE level
is still there with the SD_ASYM_CPUCAPACITY flag is due to the fact that
we don't start the flag detection mechanism during hotplug, right?
>>> We can actually end up with this problem just by hotplugging too. If you
>>> unplug the entire big cluster in the three cluster example above, you
>>> preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
>>> we only have little cpus left.
IMHO, again that's because we do flag detection only at system startup.
>>> As I see it, we have two choices: 1) Set the flags correctly for
>>> exclusive cpusets which means some additional "fun" in the sched_domain
>>> hierarchy set up, or 2) ignore it and make sure that setting
>>
>> I assume you refer to this cpu parameter for sched_domain_flags_f under 1).
>
> No, what I had in mind was to let sd_init() set SD_ASYM_CPUCAPACITY
> flag as the arch-code doesn't know about the exclusive cpusets as
> discussed above. In the meantime I have thought about a different
> approach where sd_init() only disables the flag when it is no longer
> needed due to exclusive cpusets.
In this case sd_init() has to know the cpu capacity values. I assume
that the arch still has to call rebuild_sched_domains() after cpufreq is
up and running due to the max cpu frequency dependency for cpu capacity.
On Thu, Jun 28, 2018 at 07:16:46PM +0200, Dietmar Eggemann wrote:
> On 06/28/2018 10:48 AM, Morten Rasmussen wrote:
> >On Wed, Jun 27, 2018 at 05:41:22PM +0200, Dietmar Eggemann wrote:
> >>On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
> >>>On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
>
> [...]
>
> >>>>What would happen if you hotplugged an entire cluster ? You'd loose the
> >>>>SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
> >>>>we care ?
> >>>
> >>>I don't think we should care. The static key enables additional checks
> >>>and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
> >>>be set and they should all be have no effect if that is the case. I
> >>>added the static key just avoid the overhead on systems where they would
> >>>have no effect. At least that is intention, I could course have broken
> >>>things by mistake.
> >>
> >>I tent to agree for misfit but it would be easy to just add an
> >>static_branch_disable_cpuslocked() into the else path of if(enable).
> >
> >Depending on how we address the exclusive cpuset mess Quentin pointed
> >out, it isn't as simple as that. As it is with our current
> >not-yet-posted patches we would never remove the SD_ASYM_CPUCAPACITY
> >flag, so we would never do a static_branch_disable_cpuslocked().
>
> I was referring rather to the 'hotplug out an entire cluster' mentioned
> above. Looking closer into the code, I see that this will only work for
> traditional big.LITTLE (one big, one little cluster) since on them the DIE
> level vanishes and so the SD_ASYM_CPUCAPACITY flag.
Agreed, disabling the branch in that case should be possible but only if
we know that there aren't any exclusive cpusets.
> Currently we only detect the flags when the system comes up (in the
> init_cpu_capacity_callback()) and not when we hotplug cpus. That's why it
> doesn't work for your 'three cluster system where 0-3 and 4-7 little
> clusters, and 8-11 is a big cluster' example.
> So we don't re-detect the flags every time we call from the scheduler into
> the arch code and so the the DIE level arch topology flag function will
> return SD_ASYM_CPUCAPACITY.
It would fairly easy to re-detect every time we hotplug a cpu in/out but
that won't help us with exclusive cpuset issue.
>
> >However, I'm currently thinking that we should probably set the flag
> >according to the cpus in each root_domain. If we do that, we can't just
> >disable the static_key because one root_domain is SMP, so we would have
> >to track if _any_ root_domain (exclusive cpuset topology) has the flag
> >set.
>
> This is then in sync with the energy model where the static key should be
> enabled if any root domain can do EAS. The static key would be system wide,
> not per root domain.
The static_key can only be system wide :-)
>
> >Is it worth it to iterate over all the exclusive cpuset with all
> >the locking implications to disable the static_key in the corner case
> >where exclusive cpuset are set up for all cpu capacities in the system?
>
> Don't know either? But if the code to do this would include 'exclusive
> cpusets' and platforms like your three cluster example then maybe ...
Iterating over all root_domains should work for any platform. IMHO, it
might be a lot of additional complexity to disable some optimizations in
a few corner cases.
>
> [...]
>
> >>>I'm tempted to say we shouldn't care in this situation. Setting the
> >>>flags correctly in the three cluster example would require knowledge
> >>>about the cpuset configuration which we don't have in the arch code so
> >>>SD_ASYM_CPUCAPACITY flag detection would have be done by the
> >>>sched_domain build code. However, not setting the flag according to the
> >>>actual members of the exclusive cpuset means that homogeneous
> >>>sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
> >>>wrong scheduling decisions.
> >>
> >>We could easily pass the CPU as an argument to all these
> >>sched_domain_flags_f functions.
> >>
> >>-typedef int (*sched_domain_flags_f)(void);
> >>+typedef int (*sched_domain_flags_f)(int cpu);
> >>
> >>In this case, the arch specific flag functions on a sched domain (sd) level
> >>could use the corresponding sched_domain_mask_f function to iterate over the
> >>span of the sd seen by CPU instead of all online cpus.
> >
> >Yeah, but I'm afraid that doesn't solve the problem. The
> >sched_domain_mask_f function doesn't tell us anything about any
> >exclusive cpusets. We need sched_domain_span(sd) which is
>
> You're right, I checked again and even if we hotplug, the flag function
> tl->mask(cpu) like cpu_coregroup_mask() always return the initial set of
> cpus. So we are only save if the DIE sd vanishes and with it the
> SD_ASYM_CPUCAPACITY flag since nobody will call the appropriate arch
> topology flag function.
Yeah, that isn't really solving the problem, it is more like working by
accident ;-)
>
> >sched_domain_mask_f & cpu_map where cpu_map is the cpumask of the
> >exclusive cpuset. So we either need to pass the sched_domain_span() mask
> >through sched_domain_flags_f or work our way back to that information
> >based on the cpu id. I'm not sure if the latter is possible.
>
> Agreed.
>
> [...]
>
> >>We could also say that systems with 2 clusters with the same uArch and same
> >>max CPU frequency and additional clusters are insane, like we e.g. do with
> >>the Energy Model and CPUs with different uArch within a frequency domain?
> >
> >I fail to get the point here. Are you saying that SMP is insane? ;-)
>
> The '... and additional clusters' is important, a system like your three
> cluster system you describe above.
>
> But the problem that if you hotplug-out the big cluster, the DIE level is
> still there with the SD_ASYM_CPUCAPACITY flag is due to the fact that we
> don't start the flag detection mechanism during hotplug, right?
True, flag detection is currently only done once (well, twice: when the
secondary cpus come up, and again at cpufreq init). We can't rely on
sched_domains being elemitated to get the flags right in the general
case. It just appears to work on systems we have considered so far.
>
> >>>We can actually end up with this problem just by hotplugging too. If you
> >>>unplug the entire big cluster in the three cluster example above, you
> >>>preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
> >>>we only have little cpus left.
>
> IMHO, again that's because we do flag detection only at system startup.
Yes.
>
> >>>As I see it, we have two choices: 1) Set the flags correctly for
> >>>exclusive cpusets which means some additional "fun" in the sched_domain
> >>>hierarchy set up, or 2) ignore it and make sure that setting
> >>
> >>I assume you refer to this cpu parameter for sched_domain_flags_f under 1).
> >
> >No, what I had in mind was to let sd_init() set SD_ASYM_CPUCAPACITY
> >flag as the arch-code doesn't know about the exclusive cpusets as
> >discussed above. In the meantime I have thought about a different
> >approach where sd_init() only disables the flag when it is no longer
> >needed due to exclusive cpusets.
>
> In this case sd_init() has to know the cpu capacity values. I assume that
> the arch still has to call rebuild_sched_domains() after cpufreq is up and
> running due to the max cpu frequency dependency for cpu capacity.
Yes, sd_init() will just access rq->cpu_capacity_orig which should be
straightforward. There is no change in when we rebuild the sched_domain
hierarchy. It will be rebuild once cpufreq is up, and every time we mess
with exclusive cpusets.
Hi,
> -----Original Message-----
> From: Morten Rasmussen <[email protected]>
> Sent: Wednesday, June 20, 2018 6:06 PM
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected]; Gaku Inami
> <[email protected]>; [email protected]; Morten Rasmussen <[email protected]>
> Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
[snip]
>
> The patches have been tested on:
> 1. Arm Juno (r0): 2+4 Cortex A57/A53
> 2. Hikey960: 4+4 Cortex A73/A53
>
> Test case:
> Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> big cpus, while creating a longer running set of unpinned sysbench
> tasks.
I have tested v3 patches on Renesas SoC again. It looks fine.
You can add:
Tested-by: Gaku Inami <[email protected]>
The patches have been tested on:
3. Renesas R-Car H3 : 4+4 Cortex A57/A53
Results:
Single runs with completion time of each task
R-Car H3 (tip)
total time: 0.9391s
total time: 0.9865s
total time: 1.3691s
total time: 1.6740s
R-Car H3 (misfit)
total time: 0.9368s
total time: 0.9475s
total time: 0.9471s
total time: 0.9505s
10 run summary (tracking longest running task for each run)
R-Car H3
avg max
tip 1.6742 1.6750
misfit 0.9784 0.9905
Regards,
Inami
[snip]
Hi,
On Tue, Jul 03, 2018 at 02:28:28AM +0000, Gaku Inami wrote:
> Hi,
>
> > -----Original Message-----
> > From: Morten Rasmussen <[email protected]>
> > Sent: Wednesday, June 20, 2018 6:06 PM
> > To: [email protected]; [email protected]
> > Cc: [email protected]; [email protected]; [email protected]; Gaku Inami
> > <[email protected]>; [email protected]; Morten Rasmussen <[email protected]>
> > Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
> [snip]
> >
> > The patches have been tested on:
> > 1. Arm Juno (r0): 2+4 Cortex A57/A53
> > 2. Hikey960: 4+4 Cortex A73/A53
> >
> > Test case:
> > Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> > big cpus, while creating a longer running set of unpinned sysbench
> > tasks.
>
> I have tested v3 patches on Renesas SoC again. It looks fine.
>
> You can add:
>
> Tested-by: Gaku Inami <[email protected]>
>
> The patches have been tested on:
> 3. Renesas R-Car H3 : 4+4 Cortex A57/A53
>
> Results:
> Single runs with completion time of each task
> R-Car H3 (tip)
> total time: 0.9391s
> total time: 0.9865s
> total time: 1.3691s
> total time: 1.6740s
>
> R-Car H3 (misfit)
> total time: 0.9368s
> total time: 0.9475s
> total time: 0.9471s
> total time: 0.9505s
>
> 10 run summary (tracking longest running task for each run)
> R-Car H3
> avg max
> tip 1.6742 1.6750
> misfit 0.9784 0.9905
Thanks for testing again. I have just posted v4 with some minor changes.
Behaviour for the test-cases should be the same.
Morten