2018-02-16 08:24:44

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.

To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.

The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.

The patches have been tested on:
1. Arm Juno (r0): 2+4 Cortex A57/A53
2. Hikey960: 4+4 Cortex A73/A53

Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.

REQUESTS=1000
BIGS="1 2"
LITTLES="0 3 4 5"

# Don't care about the score for those, just keep the bigs busy
for i in $BIGS; do
taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
--test=cpu run &>/dev/null &
done

for i in $LITTLES; do
sysbench --max-requests=$REQUESTS --test=cpu run \
| grep "total time:" &
done

wait

Results:
Single runs with completion time of each task
Juno (tip)
total time: 1.2608s
total time: 1.2995s
total time: 1.5954s
total time: 1.7463s

Juno (misfit)
total time: 1.2575s
total time: 1.3004s
total time: 1.5860s
total time: 1.5871s

Hikey960 (tip)
total time: 1.7431s
total time: 2.2914s
total time: 2.5976s
total time: 1.7280s

Hikey960 (misfit)
total time: 1.7866s
total time: 1.7513s
total time: 1.6918s
total time: 1.6965s

10 run summary (tracking longest running task for each run)
Juno Hikey960
avg max avg max
tip 1.7465 1.7469 2.5997 2.6131
misfit 1.6016 1.6192 1.8506 1.9666

Morten Rasmussen (4):
sched: Add static_key for asymmetric cpu capacity optimizations
sched/fair: Add group_misfit_task load-balance type
sched/fair: Consider misfit tasks when load-balancing
sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

Valentin Schneider (3):
sched/fair: Kick nohz balance if rq->misfit_task
sched: Rename root_domain->overload to should_idle_balance
sched/fair: Set sd->should_idle_balance when misfit

kernel/sched/fair.c | 141 +++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 15 ++++--
kernel/sched/topology.c | 10 ++++
3 files changed, 142 insertions(+), 24 deletions(-)

--
2.7.4



2018-02-15 16:24:26

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 2/7] sched/fair: Add group_misfit_task load-balance type

To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 2 ++
2 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 452ad2e6f1a0..bdca97ea516e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -712,6 +712,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)

static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);

/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -1422,7 +1423,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
static unsigned long weighted_cpuload(struct rq *rq);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);

/* Cached statistics for all CPUs within a node */
struct numa_stats {
@@ -3869,6 +3869,30 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)

static int idle_balance(struct rq *this_rq, struct rq_flags *rf);

+static inline unsigned long task_util(struct task_struct *p);
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+ return capacity * 1024 > task_util(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return;
+
+ if (!p) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ rq->misfit_task_load = task_h_load(p);
+}
+
#else /* CONFIG_SMP */

static inline int
@@ -3898,6 +3922,8 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
+
#endif /* CONFIG_SMP */

static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -5767,7 +5793,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return affine;
}

-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6304,7 +6329,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);

- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return task_fits_capacity(p, min_cap);
}

/*
@@ -6733,9 +6758,12 @@ done: __maybe_unused
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);

+ update_misfit_status(p, rq);
+
return p;

idle:
+ update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);

/*
@@ -6941,6 +6969,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

enum fbq_type { regular, remote, all };

+enum group_type {
+ group_other = 0,
+ group_misfit_task,
+ group_imbalanced,
+ group_overloaded,
+};
+
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
@@ -7453,12 +7488,6 @@ static unsigned long task_h_load(struct task_struct *p)

/********** Helpers for find_busiest_group ************************/

-enum group_type {
- group_other = 0,
- group_imbalanced,
- group_overloaded,
-};
-
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -7474,6 +7503,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+ int group_misfit_task_load; /* A cpu has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -7773,6 +7803,9 @@ group_type group_classify(struct sched_group *group,
if (sg_imbalanced(group))
return group_imbalanced;

+ if (sgs->group_misfit_task_load)
+ return group_misfit_task;
+
return group_other;
}

@@ -7822,6 +7855,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
*/
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
+
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ !sgs->group_misfit_task_load && rq->misfit_task_load)
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}

/* Adjust by relative CPU capacity of the group */
@@ -9436,6 +9473,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_misfit_status(curr, rq);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a06184906640..7d324b706e67 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -771,6 +771,8 @@ struct rq {
struct callback_head *balance_callback;

unsigned char idle_balance;
+
+ unsigned int misfit_task_load;
/* For active balancing */
int active_balance;
int push_cpu;
--
2.7.4


2018-02-16 08:04:58

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 6/7] sched: Rename root_domain->overload to should_idle_balance

From: Valentin Schneider <[email protected]>

The name "overload" is not very explicit, especially since it doesn't
use any concept of "load" coming from load-tracking signals. For now it
simply tracks if any of the CPUs in root_domain has more than one
runnable task, and is then used to decide whether idle balance should be
performed.

As such, this commit changes that flag's name to 'should_idle_balance',
which makes its role more explicit.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Suggested-by: Patrick Bellasi <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 18 +++++++++---------
kernel/sched/sched.h | 12 ++++++++----
2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1439b784b1f0..2d2302b7b584 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7830,12 +7830,12 @@ group_type group_classify(struct sched_group *group,
* @load_idx: Load index of sched_domain of this_cpu for load calc.
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @should_idle_balance: Indicate groups could need idle balance.
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ bool *should_idle_balance)
{
unsigned long load;
int i, nr_running;
@@ -7857,7 +7857,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,

nr_running = rq->nr_running;
if (nr_running > 1)
- *overload = true;
+ *should_idle_balance = true;

#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -8016,7 +8016,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
- bool overload = false;
+ bool should_idle_balance = false;

if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -8038,7 +8038,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}

update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ &should_idle_balance);

if (local_group)
goto next_group;
@@ -8078,9 +8078,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
env->fbq_type = fbq_classify_group(&sds->busiest_stat);

if (!env->sd->parent) {
- /* update overload indicator if we are at root domain */
- if (env->dst_rq->rd->overload != overload)
- env->dst_rq->rd->overload = overload;
+ /* update idle_balance indicator if we are at root domain */
+ if (env->dst_rq->rd->should_idle_balance != should_idle_balance)
+ env->dst_rq->rd->should_idle_balance = should_idle_balance;
}
}

@@ -8878,7 +8878,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
rq_unpin_lock(this_rq, rf);

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ !this_rq->rd->should_idle_balance) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7d324b706e67..4215438667e5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -650,8 +650,12 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;

- /* Indicate more than one runnable task for any CPU */
- bool overload;
+ /*
+ * Indicate whether the idle balance can be used to solve
+ * imbalance within the root domain.
+ * e.g. There is more than one runnable task for any CPU
+ */
+ bool should_idle_balance;

/*
* The bit corresponding to a CPU gets set here if such CPU has more
@@ -1610,8 +1614,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)

if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
- if (!rq->rd->overload)
- rq->rd->overload = true;
+ if (!rq->rd->should_idle_balance)
+ rq->rd->should_idle_balance = true;
#endif
}

--
2.7.4


2018-02-16 08:10:14

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 7/7] sched/fair: Set sd->should_idle_balance when misfit

From: Valentin Schneider <[email protected]>

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the should_idle_balance flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away

This however won't happen, because currently the should_idle_balance
flag is only set when there is any CPU that has more than one runnable
task - which may very well not be the case here if our CPU-hogging
workload is all there is to run.

As such, this commit sets the should_idle_balance flag in
update_sg_lb_stats when a group is flagged as having a misfit task.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d2302b7b584..d080a144f87f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7871,8 +7871,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;

if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- !sgs->group_misfit_task_load && rq->misfit_task_load)
+ !sgs->group_misfit_task_load && rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+ *should_idle_balance = true;
+ }
}

/* Adjust by relative CPU capacity of the group */
--
2.7.4


2018-02-16 08:11:25

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 3/7] sched/fair: Consider misfit tasks when load-balancing

On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the first cpu with the rq->misfit flag raised as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
balancing.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bdca97ea516e..0d70838ed711 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7004,6 +7004,7 @@ struct lb_env {
unsigned int loop_max;

enum fbq_type fbq_type;
+ enum group_type src_grp_type;
struct list_head tasks;
};

@@ -7894,6 +7895,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
{
struct sg_lb_stats *busiest = &sds->busiest_stat;

+ /*
+ * Don't try to pull misfit tasks we can't help.
+ */
+ if (sgs->group_type == group_misfit_task &&
+ (!group_smaller_cpu_capacity(sg, sds->local) ||
+ !group_has_capacity(env, &sds->local_stat)))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;

@@ -8199,8 +8208,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* factors in sg capacity and sgs with smaller group_type are
* skipped when updating the busiest sg:
*/
- if (busiest->avg_load <= sds->avg_load ||
- local->avg_load >= sds->avg_load) {
+ if (busiest->group_type != group_misfit_task &&
+ (busiest->avg_load <= sds->avg_load ||
+ local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8234,6 +8244,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;

+ /* Boost imbalance to allow misfit task to be balanced. */
+ if (busiest->group_type == group_misfit_task) {
+ env->imbalance = max_t(long, env->imbalance,
+ busiest->group_misfit_task_load);
+ }
+
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
@@ -8300,6 +8316,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
busiest->group_no_capacity)
goto force_balance;

+ /* Misfit tasks should be dealt with regardless of the avg load */
+ if (busiest->group_type == group_misfit_task)
+ goto force_balance;
+
/*
* If the local group is busier than the selected busiest group
* don't try and pull any tasks.
@@ -8337,6 +8357,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)

force_balance:
/* Looks like there is an imbalance. Compute it */
+ env->src_grp_type = busiest->group_type;
calculate_imbalance(env, &sds);
return sds.busiest;

@@ -8384,6 +8405,14 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (rt > env->fbq_type)
continue;

+ /*
+ * For ASYM_CPUCAPACITY domains with misfit tasks we ignore
+ * load.
+ */
+ if (env->src_grp_type == group_misfit_task &&
+ rq->misfit_task_load)
+ return rq;
+
capacity = capacity_of(i);

wl = weighted_cpuload(rq);
@@ -8453,6 +8482,9 @@ static int need_active_balance(struct lb_env *env)
return 1;
}

+ if (env->src_grp_type == group_misfit_task)
+ return 1;
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}

--
2.7.4


2018-02-16 08:12:15

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 1/7] sched: Add static_key for asymmetric cpu capacity optimizations

The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 10 ++++++++++
3 files changed, 14 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1070803cb423..452ad2e6f1a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6291,6 +6291,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
{
long min_cap, max_cap;

+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..a06184906640 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1095,6 +1095,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;

struct sched_group_capacity {
atomic_t ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed7f88b..517c57d312df 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -393,6 +393,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);

static void update_top_cache_domain(int cpu)
{
@@ -420,6 +421,13 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
}

+static void update_asym_cpucapacity(int cpu)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
+ lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+ static_branch_enable(&sched_asym_cpucapacity);
+}
+
/*
* Attach the domain 'sd' to 'cpu' as its base domain. Callers must
* hold the hotplug lock.
@@ -1697,6 +1705,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att

cpu_attach_domain(sd, d.rd, i);
}
+
+ update_asym_cpucapacity(cpumask_first(cpu_map));
rcu_read_unlock();

if (rq && sched_debug_enabled) {
--
2.7.4


2018-02-16 08:25:23

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 5/7] sched/fair: Kick nohz balance if rq->misfit_task

From: Valentin Schneider <[email protected]>

There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <[email protected]>, and a similar patch was submitted for
the Android Common Kernel - see [1].

[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e036aef3f10a..1439b784b1f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9411,6 +9411,9 @@ static inline bool nohz_kick_needed(struct rq *rq)
if (rq->nr_running >= 2)
return true;

+ if (rq->misfit_task_load)
+ return true;
+
rcu_read_lock();
sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
if (sds) {
--
2.7.4


2018-02-16 08:25:27

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On systems with asymmetric cpu capacities, a skewed load distribution
might yield better throughput than balancing load per group capacity.
For example, preferring high capacity cpus for compute intensive tasks
leaving low capacity cpus idle rather than balancing the number of idle
cpus across different cpu types. Instead, let load-balance back off if
the busiest group isn't really overloaded.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d70838ed711..e036aef3f10a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7794,6 +7794,19 @@ group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
ref->sgc->min_capacity * 1024;
}

+/*
+ * group_similar_cpu_capacity: Returns true if the minimum capacity of the
+ * compared groups differ by less than 12.5%.
+ */
+static inline bool
+group_similar_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ long diff = sg->sgc->min_capacity - ref->sgc->min_capacity;
+ long max = max(sg->sgc->min_capacity, ref->sgc->min_capacity);
+
+ return abs(diff) < max >> 3;
+}
+
static inline enum
group_type group_classify(struct sched_group *group,
struct sg_lb_stats *sgs)
@@ -7925,6 +7938,15 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_cpu_capacity(sds->local, sg))
return false;

+ /*
+ * Candidate sg doesn't face any severe imbalance issues so
+ * don't disturb unless the groups are of similar capacity
+ * where balancing is more harmless.
+ */
+ if (sgs->group_type == group_other &&
+ !group_similar_cpu_capacity(sds->local, sg))
+ return false;
+
asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
--
2.7.4


2018-02-16 18:54:11

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 6/7] sched: Rename root_domain->overload to should_idle_balance

Hi,

On 15/02/18 16:20, Morten Rasmussen wrote:
> From: Valentin Schneider <[email protected]>
>
> The name "overload" is not very explicit, especially since it doesn't
> use any concept of "load" coming from load-tracking signals. For now it
> simply tracks if any of the CPUs in root_domain has more than one
> runnable task, and is then used to decide whether idle balance should be
> performed.
>
> As such, this commit changes that flag's name to 'should_idle_balance',
> which makes its role more explicit.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
> Suggested-by: Patrick Bellasi <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---

[...]

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7d324b706e67..4215438667e5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -650,8 +650,12 @@ struct root_domain {
> cpumask_var_t span;
> cpumask_var_t online;
>
> - /* Indicate more than one runnable task for any CPU */
> - bool overload;
> + /*
> + * Indicate whether the idle balance can be used to solve
> + * imbalance within the root domain.
> + * e.g. There is more than one runnable task for any CPU
> + */
> + bool should_idle_balance;

Current name is however consistent with RT/DL's naming convention

[...]
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
*/
cpumask_var_t dlo_mask;

[...]

/*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
*/
cpumask_var_t rto_mask;

Not a big deal, though. Just wanted to point that out. :)

Best,

- Juri

2018-02-16 18:55:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/7] sched: Rename root_domain->overload to should_idle_balance

On Fri, Feb 16, 2018 at 10:14:02AM +0100, Juri Lelli wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 7d324b706e67..4215438667e5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -650,8 +650,12 @@ struct root_domain {
> > cpumask_var_t span;
> > cpumask_var_t online;
> >
> > - /* Indicate more than one runnable task for any CPU */
> > - bool overload;
> > + /*
> > + * Indicate whether the idle balance can be used to solve
> > + * imbalance within the root domain.
> > + * e.g. There is more than one runnable task for any CPU
> > + */
> > + bool should_idle_balance;
>
> Current name is however consistent with RT/DL's naming convention

Yeah, not a fan either. We've consistently used the term to mean
nr_running>1. The thing to fix there is the stupid bool, not the name.

2018-02-16 19:05:24

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 6/7] sched: Rename root_domain->overload to should_idle_balance

On 02/16/2018 09:49 AM, Peter Zijlstra wrote:
> On Fri, Feb 16, 2018 at 10:14:02AM +0100, Juri Lelli wrote:
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 7d324b706e67..4215438667e5 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -650,8 +650,12 @@ struct root_domain {
>>> cpumask_var_t span;
>>> cpumask_var_t online;
>>>
>>> - /* Indicate more than one runnable task for any CPU */
>>> - bool overload;
>>> + /*
>>> + * Indicate whether the idle balance can be used to solve
>>> + * imbalance within the root domain.
>>> + * e.g. There is more than one runnable task for any CPU
>>> + */
>>> + bool should_idle_balance;
>>
>> Current name is however consistent with RT/DL's naming convention

I saw that it was already used elsewhere in fair but didn't know about
RT/DL, thanks for pointing that out.

>
> Yeah, not a fan either. We've consistently used the term to mean
> nr_running>1. The thing to fix there is the stupid bool, not the name.
>

So yeah the other thing that doesn't help here is that we're cramming
several meanings into rq->rd->overload:
- is there an overloaded group
- is there a group with misfit task(s)

So it didn't make sense to keep it named "overload". Perhaps a better way of
handling this would be to keep exposing which is which instead of merging it
all in a bool. Something along those lines:

@update_sg_lb_stats():
[...]
nr_running = rq->nr_running;
if (nr_running > 1)
- *overload = true;
+ sds->balance_status |= LB_STATUS_OVERLOAD

[...]

if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
!sgs->group_misfit_task_load && rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
- *should_idle_balance = true;
+ sds->balance_status |= LB_STATUS_MISFIT
}

@update_sd_lb_stats():

[...]
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (env->dst_rq->rd->overload != overload)
- env->dst_rq->rd->overload = overload;
+ if (env->dst_rq->rd->balance_status != sds->balance_status)
+ env->dst_rq->rd->balance_status = sds->balance_status
}

2018-02-16 19:10:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] sched: Add static_key for asymmetric cpu capacity optimizations

On Thu, Feb 15, 2018 at 04:20:48PM +0000, Morten Rasmussen wrote:
> +static void update_asym_cpucapacity(int cpu)
> +{
> + if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
> + lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> + static_branch_enable(&sched_asym_cpucapacity);
> +}

That looks odd, why not just:

if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
static_branch_enable(&sched_asym_cpucapacity);

? possibly with:

else
static_branch_disable(&sched_asym_cpucapacity);

if you want to play funny games :-)

2018-02-16 19:16:37

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 1/7] sched: Add static_key for asymmetric cpu capacity optimizations

On Fri, Feb 16, 2018 at 02:47:04PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 15, 2018 at 04:20:48PM +0000, Morten Rasmussen wrote:
> > +static void update_asym_cpucapacity(int cpu)
> > +{
> > + if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
> > + lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> > + static_branch_enable(&sched_asym_cpucapacity);
> > +}
>
> That looks odd, why not just:
>
> if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> static_branch_enable(&sched_asym_cpucapacity);

I actually had that initially and then I misread the implementation of
static_key_enable() as if it trigger the WARN_ON_ONCE() condition if I
enable an already enabled static key. But I see now that it should be
safe to do.

> ? possibly with:
>
> else
> static_branch_disable(&sched_asym_cpucapacity);
>
> if you want to play funny games :-)

I thought about that too. It could make certain hotplug scenarios even
more expensive. I think we want the sched_asym_cpucapacity code to behave
even if SD_ASYM_CPUCAPACITY isn't set anywhere, so the static key would
be permanently from the point we detect asymmetry and leave it set. This
would be in line with how the smt static key works.

2018-02-16 19:19:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] sched: Add static_key for asymmetric cpu capacity optimizations

On Fri, Feb 16, 2018 at 03:41:01PM +0000, Morten Rasmussen wrote:
> On Fri, Feb 16, 2018 at 02:47:04PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 15, 2018 at 04:20:48PM +0000, Morten Rasmussen wrote:
> > > +static void update_asym_cpucapacity(int cpu)
> > > +{
> > > + if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
> > > + lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> > > + static_branch_enable(&sched_asym_cpucapacity);
> > > +}
> >
> > That looks odd, why not just:
> >
> > if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> > static_branch_enable(&sched_asym_cpucapacity);
>
> I actually had that initially and then I misread the implementation of
> static_key_enable() as if it trigger the WARN_ON_ONCE() condition if I
> enable an already enabled static key. But I see now that it should be
> safe to do.

Right, that WARN is there for when we use enable/disable on a key with a
value outside of [0,1].

> > ? possibly with:
> >
> > else
> > static_branch_disable(&sched_asym_cpucapacity);
> >
> > if you want to play funny games :-)
>
> I thought about that too. It could make certain hotplug scenarios even
> more expensive. I think we want the sched_asym_cpucapacity code to behave
> even if SD_ASYM_CPUCAPACITY isn't set anywhere, so the static key would
> be permanently from the point we detect asymmetry and leave it set. This
> would be in line with how the smt static key works.

Fair enough..

2018-02-16 19:21:11

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH 1/7] sched: Add static_key for asymmetric cpu capacity optimizations

Hi Morten,

On Friday 16 Feb 2018 at 15:41:01 (+0000), Morten Rasmussen wrote:
> On Fri, Feb 16, 2018 at 02:47:04PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 15, 2018 at 04:20:48PM +0000, Morten Rasmussen wrote:
> > > +static void update_asym_cpucapacity(int cpu)
> > > +{
> > > + if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
> > > + lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> > > + static_branch_enable(&sched_asym_cpucapacity);
> > > +}
> >
> > That looks odd, why not just:
> >
> > if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> > static_branch_enable(&sched_asym_cpucapacity);
>
> I actually had that initially and then I misread the implementation of
> static_key_enable() as if it trigger the WARN_ON_ONCE() condition if I
> enable an already enabled static key. But I see now that it should be
> safe to do.

AFAIU it should be safe, but without your check you'll have to go through
cpus_read_lock()/unlock() every time a CPU is hotplugged. There is probably
no good reason to re-do that again and again if the state of the key
never changes. I don't know how expensive those lock/unlock operations are,
but I bet they are more expensive than testing static_branch_unlikely() :)

>
> > ? possibly with:
> >
> > else
> > static_branch_disable(&sched_asym_cpucapacity);
> >
> > if you want to play funny games :-)
>
> I thought about that too. It could make certain hotplug scenarios even
> more expensive. I think we want the sched_asym_cpucapacity code to behave
> even if SD_ASYM_CPUCAPACITY isn't set anywhere, so the static key would
> be permanently from the point we detect asymmetry and leave it set. This
> would be in line with how the smt static key works.

Thanks !
Quentin

2018-02-16 19:21:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] sched: Add static_key for asymmetric cpu capacity optimizations

On Fri, Feb 16, 2018 at 05:39:27PM +0000, Quentin Perret wrote:
> AFAIU it should be safe, but without your check you'll have to go through
> cpus_read_lock()/unlock() every time a CPU is hotplugged. There is probably
> no good reason to re-do that again and again if the state of the key
> never changes. I don't know how expensive those lock/unlock operations are,
> but I bet they are more expensive than testing static_branch_unlikely() :)

This is not a performance critical path, more obvious code is more
better.

Also cpus_read_lock() is dirt cheap most of the time.

2018-02-19 11:51:56

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 1/7] sched: Add static_key for asymmetric cpu capacity optimizations

On Fri, Feb 16, 2018 at 05:51:23PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 16, 2018 at 03:41:01PM +0000, Morten Rasmussen wrote:
> > On Fri, Feb 16, 2018 at 02:47:04PM +0100, Peter Zijlstra wrote:
> > > ? possibly with:
> > >
> > > else
> > > static_branch_disable(&sched_asym_cpucapacity);
> > >
> > > if you want to play funny games :-)
> >
> > I thought about that too. It could make certain hotplug scenarios even
> > more expensive. I think we want the sched_asym_cpucapacity code to behave
> > even if SD_ASYM_CPUCAPACITY isn't set anywhere, so the static key would
> > be permanently from the point we detect asymmetry and leave it set. This
> > would be in line with how the smt static key works.
>
> Fair enough..

Yeah, we can always add the 'else' bit later if we find a good reason to
do so.

2018-02-20 13:05:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Thu, Feb 15, 2018 at 04:20:51PM +0000, Morten Rasmussen wrote:
> +/*
> + * group_similar_cpu_capacity: Returns true if the minimum capacity of the
> + * compared groups differ by less than 12.5%.
> + */
> +static inline bool
> +group_similar_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> +{
> + long diff = sg->sgc->min_capacity - ref->sgc->min_capacity;
> + long max = max(sg->sgc->min_capacity, ref->sgc->min_capacity);
> +
> + return abs(diff) < max >> 3;
> +}

This seems a fairly random and dodgy heuristic.


2018-02-20 13:06:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched/fair: Add group_misfit_task load-balance type

On Mon, Feb 19, 2018 at 02:56:44PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 15, 2018 at 04:20:49PM +0000, Morten Rasmussen wrote:
> > @@ -6733,9 +6758,12 @@ done: __maybe_unused
> > if (hrtick_enabled(rq))
> > hrtick_start_fair(rq, p);
> >
> > + update_misfit_status(p, rq);
> > +
> > return p;
> >
> > idle:
> > + update_misfit_status(NULL, rq);
> > new_tasks = idle_balance(rq, rf);
> >
> > /*
>
> So we set a point when picking a task (or tick). We clear said pointer
> when idle.

N/m, I can't read today. You only store the load, not the actual task.


2018-02-20 13:06:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched/fair: Add group_misfit_task load-balance type

On Thu, Feb 15, 2018 at 04:20:49PM +0000, Morten Rasmussen wrote:
> @@ -6733,9 +6758,12 @@ done: __maybe_unused
> if (hrtick_enabled(rq))
> hrtick_start_fair(rq, p);
>
> + update_misfit_status(p, rq);
> +
> return p;
>
> idle:
> + update_misfit_status(NULL, rq);
> new_tasks = idle_balance(rq, rf);
>
> /*

So we set a point when picking a task (or tick). We clear said pointer
when idle.

> @@ -7822,6 +7855,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> */
> if (!nr_running && idle_cpu(i))
> sgs->idle_cpus++;
> +
> + if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> + !sgs->group_misfit_task_load && rq->misfit_task_load)
> + sgs->group_misfit_task_load = rq->misfit_task_load;
> }
>
> /* Adjust by relative CPU capacity of the group */

And we read said pointer from another CPU, without holding the
respective rq->lock.

What happens, if right after we set sgs->group_misfit_task_load, our
task decides to exit?


2018-02-20 13:07:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Mon, Feb 19, 2018 at 03:50:12PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 15, 2018 at 04:20:51PM +0000, Morten Rasmussen wrote:
> > On systems with asymmetric cpu capacities, a skewed load distribution
> > might yield better throughput than balancing load per group capacity.
> > For example, preferring high capacity cpus for compute intensive tasks
> > leaving low capacity cpus idle rather than balancing the number of idle
> > cpus across different cpu types. Instead, let load-balance back off if
> > the busiest group isn't really overloaded.
>
> I'm sorry. I just can't seem to make sense of that today. What?

Aah, you're saying that is we have 4+4 bit.little and 4 runnable tasks,
we would like them running on our big cluster and leave the small
cluster entirely idle, instead of runnint 2+2.

So what about this DynamicQ nonsense? Or is that so unstructured we
can't really do anything sensible with it?


2018-02-20 13:07:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Thu, Feb 15, 2018 at 04:20:51PM +0000, Morten Rasmussen wrote:
> On systems with asymmetric cpu capacities, a skewed load distribution
> might yield better throughput than balancing load per group capacity.
> For example, preferring high capacity cpus for compute intensive tasks
> leaving low capacity cpus idle rather than balancing the number of idle
> cpus across different cpu types. Instead, let load-balance back off if
> the busiest group isn't really overloaded.

I'm sorry. I just can't seem to make sense of that today. What?


2018-02-20 16:02:05

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched/fair: Add group_misfit_task load-balance type

On Mon, Feb 19, 2018 at 02:58:42PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 19, 2018 at 02:56:44PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 15, 2018 at 04:20:49PM +0000, Morten Rasmussen wrote:
> > > @@ -6733,9 +6758,12 @@ done: __maybe_unused
> > > if (hrtick_enabled(rq))
> > > hrtick_start_fair(rq, p);
> > >
> > > + update_misfit_status(p, rq);
> > > +
> > > return p;
> > >
> > > idle:
> > > + update_misfit_status(NULL, rq);
> > > new_tasks = idle_balance(rq, rf);
> > >
> > > /*
> >
> > So we set a point when picking a task (or tick). We clear said pointer
> > when idle.
>
> N/m, I can't read today. You only store the load, not the actual task.

It is a very valid question though.

Storing the load of the misfit task isn't the perfect solution as there
is no guarantee that we actually end up pulling the misfit task if some
other task happens to be on the rq when we balance. However, as I think
you are hinting, we will get into all sorts of interesting races if we
store a pointer to the actual task.

In most real scenarios it appears to be 'good enough'. The typical
misfit scenario is one always-running task and potentially a few small
tasks showing up periodically. In that case we are most likely to see
the always-running task when we are balancing.

2018-02-20 16:24:07

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Mon, Feb 19, 2018 at 03:53:35PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 19, 2018 at 03:50:12PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 15, 2018 at 04:20:51PM +0000, Morten Rasmussen wrote:
> > > On systems with asymmetric cpu capacities, a skewed load distribution
> > > might yield better throughput than balancing load per group capacity.
> > > For example, preferring high capacity cpus for compute intensive tasks
> > > leaving low capacity cpus idle rather than balancing the number of idle
> > > cpus across different cpu types. Instead, let load-balance back off if
> > > the busiest group isn't really overloaded.
> >
> > I'm sorry. I just can't seem to make sense of that today. What?
>
> Aah, you're saying that is we have 4+4 bit.little and 4 runnable tasks,
> we would like them running on our big cluster and leave the small
> cluster entirely idle, instead of runnint 2+2.

Yes. Ideally, when are busy, we want to fill up the big cpus first, one
task on each, and if we have more tasks we start using little cpus
before putting additional tasks on any of the bigs. The load per group
capacity metric is sort of working against that up until roughly the
point where we have enough tasks for all the cpus.

> So what about this DynamicQ nonsense? Or is that so unstructured we
> can't really do anything sensible with it?

With DynamiQ we don't have any grouping of the cpu types provided by the
architecture. So we are going end up balancing between individual cpus.
I hope these tweaks should work for DynamiQ too. Always-running tasks on
little cpus should to be flagged as misfits and be picked up by big
cpus. It is still to be verified though.

2018-02-20 16:35:10

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Mon, Feb 19, 2018 at 04:10:11PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 15, 2018 at 04:20:51PM +0000, Morten Rasmussen wrote:
> > +/*
> > + * group_similar_cpu_capacity: Returns true if the minimum capacity of the
> > + * compared groups differ by less than 12.5%.
> > + */
> > +static inline bool
> > +group_similar_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> > +{
> > + long diff = sg->sgc->min_capacity - ref->sgc->min_capacity;
> > + long max = max(sg->sgc->min_capacity, ref->sgc->min_capacity);
> > +
> > + return abs(diff) < max >> 3;
> > +}
>
> This seems a fairly random and dodgy heuristic.

I can't deny that :-)

We need to somehow figure out if we are doing asymmetric cpu capacity
balancing or normal SMP balancing. We probably don't care about
migrating tasks if the capacities are nearly identical. But how much is
'nearly'?

We could make it strictly equal as long as sgc->min_capacity is based on
capacity_orig. If we let things like rt-pressure influence
sgc->min_capacity, it might become a mess.

We could tie it to sd->imbalance_pct to make it slightly less arbitrary,
or we can try to drop the margin.

Alternative solutions and preferences are welcome...

2018-02-20 18:27:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Tue, Feb 20, 2018 at 04:33:52PM +0000, Morten Rasmussen wrote:
> On Mon, Feb 19, 2018 at 04:10:11PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 15, 2018 at 04:20:51PM +0000, Morten Rasmussen wrote:
> > > +/*
> > > + * group_similar_cpu_capacity: Returns true if the minimum capacity of the
> > > + * compared groups differ by less than 12.5%.
> > > + */
> > > +static inline bool
> > > +group_similar_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> > > +{
> > > + long diff = sg->sgc->min_capacity - ref->sgc->min_capacity;
> > > + long max = max(sg->sgc->min_capacity, ref->sgc->min_capacity);
> > > +
> > > + return abs(diff) < max >> 3;
> > > +}
> >
> > This seems a fairly random and dodgy heuristic.
>
> I can't deny that :-)
>
> We need to somehow figure out if we are doing asymmetric cpu capacity
> balancing or normal SMP balancing. We probably don't care about
> migrating tasks if the capacities are nearly identical. But how much is
> 'nearly'?
>
> We could make it strictly equal as long as sgc->min_capacity is based on
> capacity_orig. If we let things like rt-pressure influence
> sgc->min_capacity, it might become a mess.

See, that is the problem, I think it this min_capacity thing is
influenced by rt-pressure and the like.

See update_cpu_capacity(), min_capacity is set after we add the RT scale
factor thingy, and then update_group_capacity() filters the min of the
whole group. The thing only ever goes down.

But this means that if a big CPU has a very high IRQ/RT load, its
capacity will dip below that of a little core and min_capacity for the
big group as a whole will appear smaller than that of the little group.

Or am I now terminally confused again?

2018-02-23 16:39:04

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Tue, Feb 20, 2018 at 07:26:05PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 20, 2018 at 04:33:52PM +0000, Morten Rasmussen wrote:
> > On Mon, Feb 19, 2018 at 04:10:11PM +0100, Peter Zijlstra wrote:
> > > On Thu, Feb 15, 2018 at 04:20:51PM +0000, Morten Rasmussen wrote:
> > > > +/*
> > > > + * group_similar_cpu_capacity: Returns true if the minimum capacity of the
> > > > + * compared groups differ by less than 12.5%.
> > > > + */
> > > > +static inline bool
> > > > +group_similar_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> > > > +{
> > > > + long diff = sg->sgc->min_capacity - ref->sgc->min_capacity;
> > > > + long max = max(sg->sgc->min_capacity, ref->sgc->min_capacity);
> > > > +
> > > > + return abs(diff) < max >> 3;
> > > > +}
> > >
> > > This seems a fairly random and dodgy heuristic.
> >
> > I can't deny that :-)
> >
> > We need to somehow figure out if we are doing asymmetric cpu capacity
> > balancing or normal SMP balancing. We probably don't care about
> > migrating tasks if the capacities are nearly identical. But how much is
> > 'nearly'?
> >
> > We could make it strictly equal as long as sgc->min_capacity is based on
> > capacity_orig. If we let things like rt-pressure influence
> > sgc->min_capacity, it might become a mess.
>
> See, that is the problem, I think it this min_capacity thing is
> influenced by rt-pressure and the like.
>
> See update_cpu_capacity(), min_capacity is set after we add the RT scale
> factor thingy, and then update_group_capacity() filters the min of the
> whole group. The thing only ever goes down.
>
> But this means that if a big CPU has a very high IRQ/RT load, its
> capacity will dip below that of a little core and min_capacity for the
> big group as a whole will appear smaller than that of the little group.
>
> Or am I now terminally confused again?

No, I think you are right, or I'm equally confused.

I don't think we can avoid having some sort of margin to decide when
capacities are significantly different if we want to keep this patch.

Looking more into this, I realized that we do already have similar
comparison and margin in group_smaller_cpu_capacity(). So I'm going to
look using that instead if possible.

The commit message could use some clarification too as we do in fact
already use group_smaller_cpu_capacity() to bail out of pulling big
tasks from big cpus to little. However, there are cases where it is fine
to have sum_nr_running > group_weight on the big side and cases where it
is fine to have the bigs idling while the littles are lightly utilized
which should be addressed by this patch.

2018-02-23 16:49:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Fri, Feb 23, 2018 at 04:38:06PM +0000, Morten Rasmussen wrote:
> > Or am I now terminally confused again?
>
> No, I think you are right, or I'm equally confused.

:-)

Would it make sense to also track max_capacity, but then based on the
value before RT scaling ?

That should readily be able to distinguish between big and little
clusters, although then DynamiQ would still completely ruin things.

2018-02-26 15:11:22

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched/fair: Avoid unnecessary balancing of asymmetric capacity groups

On Fri, Feb 23, 2018 at 05:47:52PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 23, 2018 at 04:38:06PM +0000, Morten Rasmussen wrote:
> > > Or am I now terminally confused again?
> >
> > No, I think you are right, or I'm equally confused.
>
> :-)
>
> Would it make sense to also track max_capacity, but then based on the
> value before RT scaling ?
>
> That should readily be able to distinguish between big and little
> clusters, although then DynamiQ would still completely ruin things.

IIRC, I did actually track max_capacity to begin with for the wake_cap()
stuff, but someone suggested to use min_capacity instead to factor in
the RT scaling as it could potentially help some use-cases.

I can add unscaled max_capacity tracking and use that as this is
primarily a solution for asymmetric cpu capacity system.

Whether we track max or min shouldn't really matter if it is based on
original capacity, unless you have a DynamiQ system. For DynamiQ system
it depends on how it is configured. If it is a single DynamiQ cluster we
will have just on sched_domain with per-cpu sched_groups, so we don't
have balancing between mixed groups. If we have multiple DynamiQ
clusters, we can have mixed groups, homogeneous groups, or both
depending on the system configuration. Homogeneous groups should be
okay, mixed groups could work okay, I think, as long as all group have
the same mix, a mix of mixed groups is going to be a challenge. Most
likely we would have to treat all these groups as one and ignore cache
boundaries for scheduling decisions.

We are slowly getting into the mess of which capacity should be used for
various conditions. Is it the original capacity (unscaled and at the
highest frequency), do we subtract the RT utilization, and what if the
thermal framework has disabled some of the higher frequencies?

Particularly, the fact that constraints impose by RT and thermal are not
permanent and might change depending on the use-case and how we place
the tasks. But that is a longer discussion to be had.

2018-02-28 07:47:06

by Gaku Inami

[permalink] [raw]
Subject: RE: [PATCH 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

Hi

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Morten Rasmussen
> Sent: Friday, February 16, 2018 1:21 AM
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected]; [email protected];
> Morten Rasmussen <[email protected]>
> Subject: [PATCH 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
>
<snip>
>
> The patches have been tested on:
> 1. Arm Juno (r0): 2+4 Cortex A57/A53
> 2. Hikey960: 4+4 Cortex A73/A53
>
> Test case:
> Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> big cpus, while creating a longer running set of unpinned sysbench
> tasks.

Tested-by: Gaku Inami <[email protected]>

I tested as same in other SoC. It looks fine.

The patches have been tested on:
3. Renesas R-Car H3 : 4+4 Cortex A57/A53

Results:
Single runs with completion time of each task
R-Car H3 (tip)
total time: 0.9415s
total time: 0.9582s
total time: 1.3275s
total time: 1.6752s

R-Car H3 (misfit)
total time: 0.9312s
total time: 0.9429s
total time: 0.9525s
total time: 0.9660s

10 run summary (tracking longest running task for each run)
R-Car H3
avg max
tip 1.6752 1.6753
misfit 0.9890 1.0204

Regards,
Inami

2018-03-01 12:00:34

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

Hi,

On 02/28/2018 07:46 AM, Gaku Inami wrote:
> Hi
>
[...]
>
> Tested-by: Gaku Inami <[email protected]>
>
> I tested as same in other SoC. It looks fine.
>

Thanks for testing on your side !