Hi folks,
This is the misfit-specific bits I tore out of [1] and that I've been further
chewing on.
o Patch 1 pays attention to group vs CPU capacity checks. It's removing some
safeguard we had against downmigrations, so it had to grow fatter to
compensate for it.
o Patch 2 aligns running and non-running misfit task cache hotness
considerations
This is based on top of today's tip/sched/core at:
e0ccd4a3b01 ("rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()")
Revisions
=========
v3 -> part 2 v1
---------------
o Removed sg_lb_stats->group_has_misfit_task (Vincent)
o Shoved misfit specific update_sg_lb_stats() work into its own function
(Vincent)
o Added migrate_degrades_capacity()
o Made task_hot() tweak depend on
src_grp_type == group_has_spare
rather than
idle != CPU_NOT_IDLE
v2 -> v3
--------
o Rebased on top of latest tip/sched/core
o Added test results vs stress-ng.vm-segv
v1 -> v2
--------
o Collected Reviewed-by
o Minor comment and code cleanups
o Consolidated static key vs SD flag explanation (Dietmar)
Note to Vincent: I didn't measure the impact of adding said static key to
load_balance(); I do however believe it is a low hanging fruit. The
wrapper keeps things neat and tidy, and is also helpful for documenting
the intricacies of the static key status vs the presence of the SD flag
in a CPU's sched_domain hierarchy.
o Removed v1 patch 4 - root_domain.max_cpu_capacity is absolutely not what
I had convinced myself it was.
o Squashed capacity margin usage with removal of
group_smaller_{min, max}_capacity() (Vincent)
o Replaced v1 patch 7 with Lingutla's can_migrate_task() patch [2]
o Rewrote task_hot() modification changelog
Links
=====
[1]: http://lore.kernel.org/r/[email protected]
[2]: http://lore.kernel.org/r/[email protected]
Cheers,
Valentin
Valentin Schneider (2):
sched/fair: Filter out locally-unsolvable misfit imbalances
sched/fair: Relax task_hot() for misfit tasks
kernel/sched/fair.c | 99 ++++++++++++++++++++++++++++++++++++---------
1 file changed, 81 insertions(+), 18 deletions(-)
--
2.25.1
Consider the following topology:
DIE [ ]
MC [ ][ ]
0 1 2 3
capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})
w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).
When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
this load_balance() happens and preempt the CPU hog running there, we would
have, for the [0-1] group at CPU2's DIE level:
o sgs->sum_nr_running > sgs->group_weight
o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct
IOW, this group is group_overloaded.
Considering CPU0 is picked by find_busiest_queue(), we would then visit the
preempted CPU hog in detach_tasks(). However, given it has just been
preempted by this pcpu kworker, task_hot() will prevent it from being
detached. We then leave load_balance() without having done anything.
Long story short, preempted misfit tasks are affected by task_hot(), while
currently running misfit tasks are intentionally preempted by the stopper
task to migrate them over to a higher-capacity CPU.
Align detach_tasks() with the active-balance logic and let it pick a
cache-hot misfit task when the destination CPU can provide a capacity
uplift.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2d1a69d7aa7..43fc98d34276 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7493,6 +7493,7 @@ struct lb_env {
enum fbq_type fbq_type;
enum migration_type migration_type;
enum group_type src_grp_type;
+ enum group_type dst_grp_type;
struct list_head tasks;
};
@@ -7533,6 +7534,31 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
return delta < (s64)sysctl_sched_migration_cost;
}
+
+/*
+ * What does migrating this task do to our capacity-aware scheduling criterion?
+ *
+ * Returns 1, if the task needs more capacity than the dst CPU can provide.
+ * Returns 0, if the task needs the extra capacity provided by the dst CPU
+ * Returns -1, if the task isn't impacted by the migration wrt capacity.
+ */
+static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
+{
+ if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+ return -1;
+
+ if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
+ if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+ return 0;
+ else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
+ return 1;
+ else
+ return -1;
+ }
+
+ return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
+}
+
#ifdef CONFIG_NUMA_BALANCING
/*
* Returns 1, if task migration degrades locality
@@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (tsk_cache_hot == -1)
tsk_cache_hot = task_hot(p, env);
+ /*
+ * On a (sane) asymmetric CPU capacity system, the increase in compute
+ * capacity should offset any potential performance hit caused by a
+ * migration.
+ */
+ if ((env->dst_grp_type == group_has_spare) &&
+ !migrate_degrades_capacity(p, env))
+ tsk_cache_hot = 0;
+
if (tsk_cache_hot <= 0 ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (tsk_cache_hot == 1) {
@@ -9310,6 +9345,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (!sds.busiest)
goto out_balanced;
+ env->dst_grp_type = local->group_type;
env->src_grp_type = busiest->group_type;
/* Misfit tasks should be dealt with regardless of the avg load */
--
2.25.1
Consider the following (hypothetical) asymmetric CPU capacity topology,
with some amount of capacity pressure (RT | DL | IRQ | thermal):
DIE [ ]
MC [ ][ ]
0 1 2 3
| CPU | capacity_orig | capacity |
|-----+---------------+----------|
| 0 | 870 | 860 |
| 1 | 870 | 600 |
| 2 | 1024 | 850 |
| 3 | 1024 | 860 |
If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
sufficiently busy, i.e. don't have enough spare capacity to accommodate
CPU1's misfit task. This would then fall on CPU2 to pull the task.
This currently won't happen, because CPU2 will fail
capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
level. In this case, the max_capacity is that of CPU0's, which is at this
point in time greater than that of CPU2's. This comparison doesn't make
much sense, given that the only CPUs we should care about in this scenario
are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
destination CPU).
Aggregate a misfit task's load into sgs->group_misfit_task_load only if
env->dst_cpu would grant it a capacity uplift.
Note that the aforementioned capacity vs sgc->max_capacity comparison was
meant to prevent misfit task downmigration: candidate groups classified as
group_misfit_task but with a higher (max) CPU capacity than the destination CPU
would be discarded. This change makes it so said group_misfit_task
classification can't happen anymore, which may cause some undesired
downmigrations.
Further tweak find_busiest_queue() to ensure this doesn't happen. Also note
find_busiest_queue() can now iterate over CPUs with a higher capacity than
the local CPU's, so add a capacity check there.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++-------------
1 file changed, 45 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b8ae02f1994..d2d1a69d7aa7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}
+/* Is CPU a's capacity noticeably greater than CPU b's? */
+static inline bool cpu_capacity_greater(int a, int b)
+{
+ return capacity_greater(capacity_of(a), capacity_of(b));
+}
+
static void record_wakee(struct task_struct *p)
{
/*
@@ -7486,6 +7492,7 @@ struct lb_env {
enum fbq_type fbq_type;
enum migration_type migration_type;
+ enum group_type src_grp_type;
struct list_head tasks;
};
@@ -8447,6 +8454,32 @@ static bool update_nohz_stats(struct rq *rq)
#endif
}
+static inline void update_sg_lb_misfit_stats(struct lb_env *env,
+ struct sched_group *group,
+ struct sg_lb_stats *sgs,
+ int *sg_status,
+ int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ if (!(env->sd->flags & SD_ASYM_CPUCAPACITY) ||
+ !rq->misfit_task_load)
+ return;
+
+ *sg_status |= SG_OVERLOAD;
+
+ /*
+ * Don't attempt to maximize load for misfit tasks that can't be
+ * granted a CPU capacity uplift.
+ */
+ if (cpu_capacity_greater(env->dst_cpu, cpu)) {
+ sgs->group_misfit_task_load = max(
+ sgs->group_misfit_task_load,
+ rq->misfit_task_load);
+ }
+
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -8498,12 +8531,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (local_group)
continue;
- /* Check for a misfit task on the cpu */
- if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- sgs->group_misfit_task_load < rq->misfit_task_load) {
- sgs->group_misfit_task_load = rq->misfit_task_load;
- *sg_status |= SG_OVERLOAD;
- }
+ update_sg_lb_misfit_stats(env, group, sgs, sg_status, i);
}
/* Check if dst CPU is idle and preferred to this group */
@@ -8550,15 +8578,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (!sgs->sum_h_nr_running)
return false;
- /*
- * Don't try to pull misfit tasks we can't help.
- * We can use max_capacity here as reduction in capacity on some
- * CPUs in the group should either be possible to resolve
- * internally or be covered by avg_load imbalance (eventually).
- */
+ /* Don't try to pull misfit tasks we can't help */
if (sgs->group_type == group_misfit_task &&
- (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
- sds->local_stat.group_type != group_has_spare))
+ sds->local_stat.group_type != group_has_spare)
return false;
if (sgs->group_type > busiest->group_type)
@@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (!sds.busiest)
goto out_balanced;
+ env->src_grp_type = busiest->group_type;
+
/* Misfit tasks should be dealt with regardless of the avg load */
if (busiest->group_type == group_misfit_task)
goto force_balance;
@@ -9441,8 +9465,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* average load.
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
- nr_running == 1)
+ env->src_grp_type <= group_fully_busy &&
+ !capacity_greater(capacity_of(env->dst_cpu), capacity))
continue;
switch (env->migration_type) {
@@ -9504,15 +9528,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
case migrate_misfit:
/*
* For ASYM_CPUCAPACITY domains with misfit tasks we
- * simply seek the "biggest" misfit task.
+ * simply seek the "biggest" misfit task we can
+ * accommodate.
*/
+ if (!cpu_capacity_greater(env->dst_cpu, i))
+ continue;
+
if (rq->misfit_task_load > busiest_load) {
busiest_load = rq->misfit_task_load;
busiest = rq;
}
break;
-
}
}
--
2.25.1
On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> ...
> Further tweak find_busiest_queue() to ensure this doesn't happen.
> Also note
> find_busiest_queue() can now iterate over CPUs with a higher capacity
> than
> the local CPU's, so add a capacity check there.
>
> Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> Consider the following topology:
>
> Long story short, preempted misfit tasks are affected by task_hot(),
> while
> currently running misfit tasks are intentionally preempted by the
> stopper
> task to migrate them over to a higher-capacity CPU.
>
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.
>
> Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
This patch looks good, but...
> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> struct lb_env *env)
> if (tsk_cache_hot == -1)
> tsk_cache_hot = task_hot(p, env);
>
> + /*
> + * On a (sane) asymmetric CPU capacity system, the increase in
> compute
> + * capacity should offset any potential performance hit caused
> by a
> + * migration.
> + */
> + if ((env->dst_grp_type == group_has_spare) &&
> + !migrate_degrades_capacity(p, env))
> + tsk_cache_hot = 0;
... I'm starting to wonder if we should not rename the
tsk_cache_hot variable to something else to make this
code more readable. Probably in another patch :)
--
All Rights Reversed.
On 15/04/21 16:39, Rik van Riel wrote:
> On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
>> Consider the following topology:
>>
>> Long story short, preempted misfit tasks are affected by task_hot(),
>> while
>> currently running misfit tasks are intentionally preempted by the
>> stopper
>> task to migrate them over to a higher-capacity CPU.
>>
>> Align detach_tasks() with the active-balance logic and let it pick a
>> cache-hot misfit task when the destination CPU can provide a capacity
>> uplift.
>>
>> Signed-off-by: Valentin Schneider <[email protected]>
>
> Reviewed-by: Rik van Riel <[email protected]>
>
Thanks!
>
> This patch looks good, but...
>
>> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
>> struct lb_env *env)
>> if (tsk_cache_hot == -1)
>> tsk_cache_hot = task_hot(p, env);
>>
>> + /*
>> + * On a (sane) asymmetric CPU capacity system, the increase in
>> compute
>> + * capacity should offset any potential performance hit caused
>> by a
>> + * migration.
>> + */
>> + if ((env->dst_grp_type == group_has_spare) &&
>> + !migrate_degrades_capacity(p, env))
>> + tsk_cache_hot = 0;
>
> ... I'm starting to wonder if we should not rename the
> tsk_cache_hot variable to something else to make this
> code more readable. Probably in another patch :)
>
I'd tend to agree, but naming is hard. "migration_harmful" ?
> --
> All Rights Reversed.
Le jeudi 15 avril 2021 ? 18:58:46 (+0100), Valentin Schneider a ?crit :
> Consider the following topology:
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3
>
> capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})
>
> w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).
>
> When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
> should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
> upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
> this load_balance() happens and preempt the CPU hog running there, we would
> have, for the [0-1] group at CPU2's DIE level:
>
> o sgs->sum_nr_running > sgs->group_weight
> o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct
>
> IOW, this group is group_overloaded.
>
> Considering CPU0 is picked by find_busiest_queue(), we would then visit the
> preempted CPU hog in detach_tasks(). However, given it has just been
> preempted by this pcpu kworker, task_hot() will prevent it from being
> detached. We then leave load_balance() without having done anything.
>
> Long story short, preempted misfit tasks are affected by task_hot(), while
> currently running misfit tasks are intentionally preempted by the stopper
> task to migrate them over to a higher-capacity CPU.
>
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d2d1a69d7aa7..43fc98d34276 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7493,6 +7493,7 @@ struct lb_env {
> enum fbq_type fbq_type;
> enum migration_type migration_type;
> enum group_type src_grp_type;
> + enum group_type dst_grp_type;
> struct list_head tasks;
> };
>
> @@ -7533,6 +7534,31 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
> return delta < (s64)sysctl_sched_migration_cost;
> }
>
> +
> +/*
> + * What does migrating this task do to our capacity-aware scheduling criterion?
> + *
> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
> + */
> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> +{
> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> + return -1;
> +
> + if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
> + if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> + return 0;
> + else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
> + return 1;
> + else
> + return -1;
> + }
Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
> +
> + return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
> +}
I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater.
static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
{
unsigned long src_capacity, dst_capacity;
if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
return -1;
src_capacity = capacity_of(env->src_cpu);
dst_capacity = capacity_of(env->dst_cpu);
if (!task_fits_capacity(p, src_capacity)) {
if (capacity_greater(dst_capacity, src_capacity))
return 0;
else if (capacity_greater(src_capacity, dst_capacity))
return 1;
else
return -1;
}
return task_fits_capacity(p, dst_capacity) ? -1 : 1;
}
> +
> #ifdef CONFIG_NUMA_BALANCING
> /*
> * Returns 1, if task migration degrades locality
> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (tsk_cache_hot == -1)
> tsk_cache_hot = task_hot(p, env);
>
> + /*
> + * On a (sane) asymmetric CPU capacity system, the increase in compute
> + * capacity should offset any potential performance hit caused by a
> + * migration.
> + */
> + if ((env->dst_grp_type == group_has_spare) &&
Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
stated in $subject
> + !migrate_degrades_capacity(p, env))
> + tsk_cache_hot = 0;
> +
> if (tsk_cache_hot <= 0 ||
> env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> if (tsk_cache_hot == 1) {
> @@ -9310,6 +9345,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> if (!sds.busiest)
> goto out_balanced;
>
> + env->dst_grp_type = local->group_type;
> env->src_grp_type = busiest->group_type;
>
> /* Misfit tasks should be dealt with regardless of the avg load */
> --
> 2.25.1
>
On Thu, 15 Apr 2021 at 19:58, Valentin Schneider
<[email protected]> wrote:
>
> Consider the following (hypothetical) asymmetric CPU capacity topology,
> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3
>
> | CPU | capacity_orig | capacity |
> |-----+---------------+----------|
> | 0 | 870 | 860 |
> | 1 | 870 | 600 |
> | 2 | 1024 | 850 |
> | 3 | 1024 | 860 |
>
> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>
> This currently won't happen, because CPU2 will fail
>
> capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
>
> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> level. In this case, the max_capacity is that of CPU0's, which is at this
> point in time greater than that of CPU2's. This comparison doesn't make
> much sense, given that the only CPUs we should care about in this scenario
> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> destination CPU).
>
> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift.
>
> Note that the aforementioned capacity vs sgc->max_capacity comparison was
> meant to prevent misfit task downmigration: candidate groups classified as
> group_misfit_task but with a higher (max) CPU capacity than the destination CPU
> would be discarded. This change makes it so said group_misfit_task
> classification can't happen anymore, which may cause some undesired
> downmigrations.
>
> Further tweak find_busiest_queue() to ensure this doesn't happen. Also note
> find_busiest_queue() can now iterate over CPUs with a higher capacity than
> the local CPU's, so add a capacity check there.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++-------------
> 1 file changed, 45 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9b8ae02f1994..d2d1a69d7aa7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity;
> }
>
> +/* Is CPU a's capacity noticeably greater than CPU b's? */
> +static inline bool cpu_capacity_greater(int a, int b)
> +{
> + return capacity_greater(capacity_of(a), capacity_of(b));
> +}
> +
> static void record_wakee(struct task_struct *p)
> {
> /*
> @@ -7486,6 +7492,7 @@ struct lb_env {
>
> enum fbq_type fbq_type;
> enum migration_type migration_type;
> + enum group_type src_grp_type;
> struct list_head tasks;
> };
>
> @@ -8447,6 +8454,32 @@ static bool update_nohz_stats(struct rq *rq)
> #endif
> }
>
> +static inline void update_sg_lb_misfit_stats(struct lb_env *env,
> + struct sched_group *group,
> + struct sg_lb_stats *sgs,
> + int *sg_status,
> + int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> +
> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY) ||
> + !rq->misfit_task_load)
> + return;
> +
> + *sg_status |= SG_OVERLOAD;
> +
> + /*
> + * Don't attempt to maximize load for misfit tasks that can't be
> + * granted a CPU capacity uplift.
> + */
> + if (cpu_capacity_greater(env->dst_cpu, cpu)) {
> + sgs->group_misfit_task_load = max(
> + sgs->group_misfit_task_load,
> + rq->misfit_task_load);
> + }
> +
> +}
> +
> /**
> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -8498,12 +8531,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> if (local_group)
> continue;
>
> - /* Check for a misfit task on the cpu */
> - if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> - sgs->group_misfit_task_load < rq->misfit_task_load) {
> - sgs->group_misfit_task_load = rq->misfit_task_load;
> - *sg_status |= SG_OVERLOAD;
> - }
> + update_sg_lb_misfit_stats(env, group, sgs, sg_status, i);
> }
>
> /* Check if dst CPU is idle and preferred to this group */
> @@ -8550,15 +8578,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> if (!sgs->sum_h_nr_running)
> return false;
>
> - /*
> - * Don't try to pull misfit tasks we can't help.
> - * We can use max_capacity here as reduction in capacity on some
> - * CPUs in the group should either be possible to resolve
> - * internally or be covered by avg_load imbalance (eventually).
> - */
> + /* Don't try to pull misfit tasks we can't help */
> if (sgs->group_type == group_misfit_task &&
> - (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> - sds->local_stat.group_type != group_has_spare))
> + sds->local_stat.group_type != group_has_spare)
> return false;
>
> if (sgs->group_type > busiest->group_type)
> @@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> if (!sds.busiest)
> goto out_balanced;
>
> + env->src_grp_type = busiest->group_type;
> +
> /* Misfit tasks should be dealt with regardless of the avg load */
> if (busiest->group_type == group_misfit_task)
> goto force_balance;
> @@ -9441,8 +9465,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> * average load.
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> - !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> - nr_running == 1)
> + env->src_grp_type <= group_fully_busy &&
> + !capacity_greater(capacity_of(env->dst_cpu), capacity))
> continue;
>
> switch (env->migration_type) {
> @@ -9504,15 +9528,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> case migrate_misfit:
> /*
> * For ASYM_CPUCAPACITY domains with misfit tasks we
> - * simply seek the "biggest" misfit task.
> + * simply seek the "biggest" misfit task we can
> + * accommodate.
> */
> + if (!cpu_capacity_greater(env->dst_cpu, i))
Use the same level of interface as above. This makes code and the
condition easier to follow in find_busiest_queue()
capacity_greater(capacity_of(env->dst_cpu), capacity_of(i))
> + continue;
> +
> if (rq->misfit_task_load > busiest_load) {
> busiest_load = rq->misfit_task_load;
> busiest = rq;
> }
>
> break;
> -
> }
> }
>
> --
> 2.25.1
>
On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
> On 15/04/21 16:39, Rik van Riel wrote:
> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> >> Consider the following topology:
> >>
> >> Long story short, preempted misfit tasks are affected by task_hot(),
> >> while
> >> currently running misfit tasks are intentionally preempted by the
> >> stopper
> >> task to migrate them over to a higher-capacity CPU.
> >>
> >> Align detach_tasks() with the active-balance logic and let it pick a
> >> cache-hot misfit task when the destination CPU can provide a capacity
> >> uplift.
> >>
> >> Signed-off-by: Valentin Schneider <[email protected]>
> >
> > Reviewed-by: Rik van Riel <[email protected]>
> >
>
> Thanks!
>
> >
> > This patch looks good, but...
> >
> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> >> struct lb_env *env)
> >> if (tsk_cache_hot == -1)
> >> tsk_cache_hot = task_hot(p, env);
> >>
> >> + /*
> >> + * On a (sane) asymmetric CPU capacity system, the increase in
> >> compute
> >> + * capacity should offset any potential performance hit caused
> >> by a
> >> + * migration.
> >> + */
> >> + if ((env->dst_grp_type == group_has_spare) &&
> >> + !migrate_degrades_capacity(p, env))
> >> + tsk_cache_hot = 0;
> >
> > ... I'm starting to wonder if we should not rename the
> > tsk_cache_hot variable to something else to make this
> > code more readable. Probably in another patch :)
> >
>
> I'd tend to agree, but naming is hard. "migration_harmful" ?
I thought Rik meant tsk_cache_hot, for which I'd suggest at least
buying a vowel and putting an 'a' in there :)
Cheers,
Phil
>
> > --
> > All Rights Reversed.
>
--
On 16/04/21 15:29, Vincent Guittot wrote:
> On Thu, 15 Apr 2021 at 19:58, Valentin Schneider
> <[email protected]> wrote:
>> @@ -9441,8 +9465,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>> * average load.
>> */
>> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
>> - !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
>> - nr_running == 1)
>> + env->src_grp_type <= group_fully_busy &&
>> + !capacity_greater(capacity_of(env->dst_cpu), capacity))
>> continue;
>>
>> switch (env->migration_type) {
>> @@ -9504,15 +9528,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>> case migrate_misfit:
>> /*
>> * For ASYM_CPUCAPACITY domains with misfit tasks we
>> - * simply seek the "biggest" misfit task.
>> + * simply seek the "biggest" misfit task we can
>> + * accommodate.
>> */
>> + if (!cpu_capacity_greater(env->dst_cpu, i))
>
> Use the same level of interface as above. This makes code and the
> condition easier to follow in find_busiest_queue()
>
> capacity_greater(capacity_of(env->dst_cpu), capacity_of(i))
>
OK
On 19/04/21 08:59, Phil Auld wrote:
> On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
>> On 15/04/21 16:39, Rik van Riel wrote:
>> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
>> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
>> >> struct lb_env *env)
>> >> if (tsk_cache_hot == -1)
>> >> tsk_cache_hot = task_hot(p, env);
>> >>
>> >> + /*
>> >> + * On a (sane) asymmetric CPU capacity system, the increase in
>> >> compute
>> >> + * capacity should offset any potential performance hit caused
>> >> by a
>> >> + * migration.
>> >> + */
>> >> + if ((env->dst_grp_type == group_has_spare) &&
>> >> + !migrate_degrades_capacity(p, env))
>> >> + tsk_cache_hot = 0;
>> >
>> > ... I'm starting to wonder if we should not rename the
>> > tsk_cache_hot variable to something else to make this
>> > code more readable. Probably in another patch :)
>> >
>>
>> I'd tend to agree, but naming is hard. "migration_harmful" ?
>
> I thought Rik meant tsk_cache_hot, for which I'd suggest at least
> buying a vowel and putting an 'a' in there :)
>
That's the one I was eyeing: s/tsk_cache_hot/migration_harmful/ or
somesuch. Right now we're feeding it:
o migrate_degrades_locality()
o task_hot()
and I'm adding another one, so that's 2/3 which don't actually care about
cache hotness, but rather "does doing this migration degrade/improve
$criterion?"
On 16/04/21 15:51, Vincent Guittot wrote:
> Le jeudi 15 avril 2021 � 18:58:46 (+0100), Valentin Schneider a �crit :
>> +
>> +/*
>> + * What does migrating this task do to our capacity-aware scheduling criterion?
>> + *
>> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
>> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
>> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
>> + */
>> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
>> +{
>> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>> + return -1;
>> +
>> + if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
>> + if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
>> + return 0;
>> + else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
>> + return 1;
>> + else
>> + return -1;
>> + }
>
> Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
>
Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
which it *doesn't* fit.
>> +
>> + return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
>> +}
>
> I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater.
>
> static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> {
> unsigned long src_capacity, dst_capacity;
>
> if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> return -1;
>
> src_capacity = capacity_of(env->src_cpu);
> dst_capacity = capacity_of(env->dst_cpu);
>
> if (!task_fits_capacity(p, src_capacity)) {
> if (capacity_greater(dst_capacity, src_capacity))
> return 0;
> else if (capacity_greater(src_capacity, dst_capacity))
> return 1;
> else
> return -1;
> }
>
> return task_fits_capacity(p, dst_capacity) ? -1 : 1;
> }
>
I'll take it, thanks!
>
>> +
>> #ifdef CONFIG_NUMA_BALANCING
>> /*
>> * Returns 1, if task migration degrades locality
>> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> if (tsk_cache_hot == -1)
>> tsk_cache_hot = task_hot(p, env);
>>
>> + /*
>> + * On a (sane) asymmetric CPU capacity system, the increase in compute
>> + * capacity should offset any potential performance hit caused by a
>> + * migration.
>> + */
>> + if ((env->dst_grp_type == group_has_spare) &&
>
> Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
> stated in $subject
>
Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
could give us a better picture. Staring at this some more, this isn't so
true when the group size goes up - there's no guarantees the dst_cpu is the
one that has spare cycles, and the other CPUs might not be able to grant
the capacity uplift dst_cpu can.
As for not using src_grp_type == group_misfit_task, this is pretty much the
same as [1]. CPU-bound (misfit) task + some other task on the same rq
implies group_overloaded classification when balancing at MC level (no SMT,
so one group per CPU).
[1]: http://lore.kernel.org/r/[email protected]
On Mon, Apr 19, 2021 at 06:17:47PM +0100 Valentin Schneider wrote:
> On 19/04/21 08:59, Phil Auld wrote:
> > On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
> >> On 15/04/21 16:39, Rik van Riel wrote:
> >> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> >> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> >> >> struct lb_env *env)
> >> >> if (tsk_cache_hot == -1)
> >> >> tsk_cache_hot = task_hot(p, env);
> >> >>
> >> >> + /*
> >> >> + * On a (sane) asymmetric CPU capacity system, the increase in
> >> >> compute
> >> >> + * capacity should offset any potential performance hit caused
> >> >> by a
> >> >> + * migration.
> >> >> + */
> >> >> + if ((env->dst_grp_type == group_has_spare) &&
> >> >> + !migrate_degrades_capacity(p, env))
> >> >> + tsk_cache_hot = 0;
> >> >
> >> > ... I'm starting to wonder if we should not rename the
> >> > tsk_cache_hot variable to something else to make this
> >> > code more readable. Probably in another patch :)
> >> >
> >>
> >> I'd tend to agree, but naming is hard. "migration_harmful" ?
> >
> > I thought Rik meant tsk_cache_hot, for which I'd suggest at least
> > buying a vowel and putting an 'a' in there :)
> >
>
> That's the one I was eyeing: s/tsk_cache_hot/migration_harmful/ or
> somesuch. Right now we're feeding it:
>
Fair enough, my bad, the migration part immediately drew me to
migrate_degrades_capacity().
> o migrate_degrades_locality()
> o task_hot()
>
> and I'm adding another one, so that's 2/3 which don't actually care about
> cache hotness, but rather "does doing this migration degrade/improve
> $criterion?"
>
prefer_no_migrate?
Cheers,
Phil
--
On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
<[email protected]> wrote:
>
> On 16/04/21 15:51, Vincent Guittot wrote:
> > Le jeudi 15 avril 2021 � 18:58:46 (+0100), Valentin Schneider a �crit :
> >> +
> >> +/*
> >> + * What does migrating this task do to our capacity-aware scheduling criterion?
> >> + *
> >> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
> >> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
> >> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
> >> + */
> >> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> >> +{
> >> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> >> + return -1;
> >> +
> >> + if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
> >> + if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> >> + return 0;
> >> + else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
> >> + return 1;
> >> + else
> >> + return -1;
> >> + }
> >
> > Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
> >
>
> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
> which it *doesn't* fit.
OK. I was confused because I thought that this was only to force
migration in case of group_misfit_task but you tried to extend to
other cases... I'm not convinced that you succeeded to cover all cases
Also I found this function which returns 3 values a bit disturbing.
IIUC you tried to align to migrate_degrades_capacity but you should
have better aligned to task_hot and return only 0 or 1. -1 is not used
>
> >> +
> >> + return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
> >> +}
> >
> > I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater.
> >
> > static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> > {
> > unsigned long src_capacity, dst_capacity;
> >
> > if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> > return -1;
> >
> > src_capacity = capacity_of(env->src_cpu);
> > dst_capacity = capacity_of(env->dst_cpu);
> >
> > if (!task_fits_capacity(p, src_capacity)) {
> > if (capacity_greater(dst_capacity, src_capacity))
> > return 0;
> > else if (capacity_greater(src_capacity, dst_capacity))
> > return 1;
> > else
> > return -1;
> > }
> >
> > return task_fits_capacity(p, dst_capacity) ? -1 : 1;
> > }
> >
>
> I'll take it, thanks!
>
> >
> >> +
> >> #ifdef CONFIG_NUMA_BALANCING
> >> /*
> >> * Returns 1, if task migration degrades locality
> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >> if (tsk_cache_hot == -1)
> >> tsk_cache_hot = task_hot(p, env);
> >>
> >> + /*
> >> + * On a (sane) asymmetric CPU capacity system, the increase in compute
> >> + * capacity should offset any potential performance hit caused by a
> >> + * migration.
> >> + */
> >> + if ((env->dst_grp_type == group_has_spare) &&
> >
> > Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
> > stated in $subject
> >
>
> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
> could give us a better picture. Staring at this some more, this isn't so
> true when the group size goes up - there's no guarantees the dst_cpu is the
> one that has spare cycles, and the other CPUs might not be able to grant
> the capacity uplift dst_cpu can.
yeah you have to keep checking for env->idle != CPU_NOT_IDLE
>
> As for not using src_grp_type == group_misfit_task, this is pretty much the
> same as [1]. CPU-bound (misfit) task + some other task on the same rq
> implies group_overloaded classification when balancing at MC level (no SMT,
> so one group per CPU).
Is it something that happens often or just a sporadic/transient state
? I mean does it really worth the extra complexity and do you see
performance improvement ?
You should better focus on fixing the simple case of group_misfit_task
task. This other cases looks far more complex with lot of corner cases
>
> [1]: http://lore.kernel.org/r/[email protected]
On 20/04/21 16:33, Vincent Guittot wrote:
> On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
> <[email protected]> wrote:
>>
>> On 16/04/21 15:51, Vincent Guittot wrote:
>> > Le jeudi 15 avril 2021 � 18:58:46 (+0100), Valentin Schneider a �crit :
>> >> +
>> >> +/*
>> >> + * What does migrating this task do to our capacity-aware scheduling criterion?
>> >> + *
>> >> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
>> >> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
>> >> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
>> >> + */
>> >> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
>> >> +{
>> >> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>> >> + return -1;
>> >> +
>> >> + if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
>> >> + if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
>> >> + return 0;
>> >> + else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
>> >> + return 1;
>> >> + else
>> >> + return -1;
>> >> + }
>> >
>> > Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
>> >
>>
>> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
>> which it *doesn't* fit.
>
> OK. I was confused because I thought that this was only to force
> migration in case of group_misfit_task but you tried to extend to
> other cases... I'm not convinced that you succeeded to cover all cases
>
> Also I found this function which returns 3 values a bit disturbing.
> IIUC you tried to align to migrate_degrades_capacity but you should
> have better aligned to task_hot and return only 0 or 1. -1 is not used
>
Ack, will do.
>> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> >> if (tsk_cache_hot == -1)
>> >> tsk_cache_hot = task_hot(p, env);
>> >>
>> >> + /*
>> >> + * On a (sane) asymmetric CPU capacity system, the increase in compute
>> >> + * capacity should offset any potential performance hit caused by a
>> >> + * migration.
>> >> + */
>> >> + if ((env->dst_grp_type == group_has_spare) &&
>> >
>> > Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
>> > stated in $subject
>> >
>>
>> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
>> could give us a better picture. Staring at this some more, this isn't so
>> true when the group size goes up - there's no guarantees the dst_cpu is the
>> one that has spare cycles, and the other CPUs might not be able to grant
>> the capacity uplift dst_cpu can.
>
> yeah you have to keep checking for env->idle != CPU_NOT_IDLE
>
>>
>> As for not using src_grp_type == group_misfit_task, this is pretty much the
>> same as [1]. CPU-bound (misfit) task + some other task on the same rq
>> implies group_overloaded classification when balancing at MC level (no SMT,
>> so one group per CPU).
>
> Is it something that happens often or just a sporadic/transient state
> ? I mean does it really worth the extra complexity and do you see
> performance improvement ?
>
"Unfortunately" yes, this is a relatively common scenario when running "1
big task per CPU" types of workloads. The expected behaviour for big.LITTLE
systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
becomes free, usually via newidle balance (which, since they process work
faster than the LITTLEs, is bound to happen), and an extra task being
enqueued at "the wrong time" can prevent this from happening.
This usually means a misfit task can take a few dozen extra ms than it
should to be migrated - in the tests I run (which are pretty much this 1
hog per CPU workload) this happens about ~20% of the time.
> You should better focus on fixing the simple case of group_misfit_task
> task. This other cases looks far more complex with lot of corner cases
>
>>
>> [1]: http://lore.kernel.org/r/[email protected]
On 15/04/2021 19:58, Valentin Schneider wrote:
[...]
> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift.
>
> Note that the aforementioned capacity vs sgc->max_capacity comparison was
> meant to prevent misfit task downmigration: candidate groups classified as
> group_misfit_task but with a higher (max) CPU capacity than the destination CPU
> would be discarded. This change makes it so said group_misfit_task
> classification can't happen anymore, which may cause some undesired
> downmigrations.
>
> Further tweak find_busiest_queue() to ensure this doesn't happen.
Maybe you can describe shortly here what's the new mechanism is you
replace the old 'prevent misfit task downmigration' with.
Also note
> find_busiest_queue() can now iterate over CPUs with a higher capacity than
> the local CPU's, so add a capacity check there.
[...]
> +static inline void update_sg_lb_misfit_stats(struct lb_env *env,
> + struct sched_group *group,
Seems to be not used.
> + struct sg_lb_stats *sgs,
> + int *sg_status,
> + int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> +
> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY) ||
> + !rq->misfit_task_load)
> + return;
> +
> + *sg_status |= SG_OVERLOAD;
> +
> + /*
> + * Don't attempt to maximize load for misfit tasks that can't be
> + * granted a CPU capacity uplift.
> + */
> + if (cpu_capacity_greater(env->dst_cpu, cpu)) {
> + sgs->group_misfit_task_load = max(
> + sgs->group_misfit_task_load,
> + rq->misfit_task_load);
> + }
> +
> +}
> +
[...]
> @@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> if (!sds.busiest)
> goto out_balanced;
>
> + env->src_grp_type = busiest->group_type;
Maybe a short comment why you set it here in fbg(). It's only used later
in fbq() for asym. CPU capacity sd but is set unconditionally.
[...]
On 21/04/2021 12:52, Valentin Schneider wrote:
> On 20/04/21 16:33, Vincent Guittot wrote:
>> On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
>> <[email protected]> wrote:
>>>
>>> On 16/04/21 15:51, Vincent Guittot wrote:
>>>> Le jeudi 15 avril 2021 � 18:58:46 (+0100), Valentin Schneider a �crit :
>>>>> +
>>>>> +/*
>>>>> + * What does migrating this task do to our capacity-aware scheduling criterion?
>>>>> + *
>>>>> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
>>>>> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
>>>>> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
>>>>> + */
>>>>> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
>>>>> +{
>>>>> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>>>>> + return -1;
>>>>> +
>>>>> + if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
>>>>> + if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
>>>>> + return 0;
>>>>> + else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
>>>>> + return 1;
>>>>> + else
>>>>> + return -1;
>>>>> + }
>>>>
>>>> Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
>>>>
>>>
>>> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
>>> which it *doesn't* fit.
>>
>> OK. I was confused because I thought that this was only to force
>> migration in case of group_misfit_task but you tried to extend to
>> other cases... I'm not convinced that you succeeded to cover all cases
>>
>> Also I found this function which returns 3 values a bit disturbing.
>> IIUC you tried to align to migrate_degrades_capacity but you should
>> have better aligned to task_hot and return only 0 or 1. -1 is not used
>>
>
> Ack, will do.
>
>>>>> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>>> if (tsk_cache_hot == -1)
>>>>> tsk_cache_hot = task_hot(p, env);
>>>>>
>>>>> + /*
>>>>> + * On a (sane) asymmetric CPU capacity system, the increase in compute
>>>>> + * capacity should offset any potential performance hit caused by a
>>>>> + * migration.
>>>>> + */
>>>>> + if ((env->dst_grp_type == group_has_spare) &&
>>>>
>>>> Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
>>>> stated in $subject
>>>>
>>>
>>> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
>>> could give us a better picture. Staring at this some more, this isn't so
>>> true when the group size goes up - there's no guarantees the dst_cpu is the
>>> one that has spare cycles, and the other CPUs might not be able to grant
>>> the capacity uplift dst_cpu can.
>>
>> yeah you have to keep checking for env->idle != CPU_NOT_IDLE
>>
>>>
>>> As for not using src_grp_type == group_misfit_task, this is pretty much the
>>> same as [1]. CPU-bound (misfit) task + some other task on the same rq
>>> implies group_overloaded classification when balancing at MC level (no SMT,
>>> so one group per CPU).
>>
>> Is it something that happens often or just a sporadic/transient state
>> ? I mean does it really worth the extra complexity and do you see
>> performance improvement ?
>>
>
> "Unfortunately" yes, this is a relatively common scenario when running "1
> big task per CPU" types of workloads. The expected behaviour for big.LITTLE
> systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
> becomes free, usually via newidle balance (which, since they process work
> faster than the LITTLEs, is bound to happen), and an extra task being
> enqueued at "the wrong time" can prevent this from happening.
>
> This usually means a misfit task can take a few dozen extra ms than it
> should to be migrated - in the tests I run (which are pretty much this 1
> hog per CPU workload) this happens about ~20% of the time.
>
>> You should better focus on fixing the simple case of group_misfit_task
>> task. This other cases looks far more complex with lot of corner cases
>>
>>>
>>> [1]: http://lore.kernel.org/r/[email protected]
Just to make sure I can follow the conversation ...
In case you:
(1) return 1 instead of -1
(2) keep the `env->idle != CPU_NOT_IDLE` check
(3) and remove the `dst_grp_type == group_has_spare` check
you are pretty much back to what you had in [PATCH v3 7/7] directly in
task_hot() except:
(4) the 'if (p fits on src_cpu && p !fits dst_cpu) => tsk_cache_hot) check?
On 22/04/21 11:48, Dietmar Eggemann wrote:
> On 15/04/2021 19:58, Valentin Schneider wrote:
>
> [...]
>
>> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
>> env->dst_cpu would grant it a capacity uplift.
>>
>> Note that the aforementioned capacity vs sgc->max_capacity comparison was
>> meant to prevent misfit task downmigration: candidate groups classified as
>> group_misfit_task but with a higher (max) CPU capacity than the destination CPU
>> would be discarded. This change makes it so said group_misfit_task
>> classification can't happen anymore, which may cause some undesired
>> downmigrations.
>>
>> Further tweak find_busiest_queue() to ensure this doesn't happen.
>
> Maybe you can describe shortly here what's the new mechanism is you
> replace the old 'prevent misfit task downmigration' with.
>
Will do.
> Also note
>> find_busiest_queue() can now iterate over CPUs with a higher capacity than
>> the local CPU's, so add a capacity check there.
>
> [...]
>
>> +static inline void update_sg_lb_misfit_stats(struct lb_env *env,
>> + struct sched_group *group,
>
> Seems to be not used.
>
Right, that's an update_sg_lb_stats() copy/paste fail :-)
>
>> @@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> if (!sds.busiest)
>> goto out_balanced;
>>
>> + env->src_grp_type = busiest->group_type;
>
> Maybe a short comment why you set it here in fbg(). It's only used later
> in fbq() for asym. CPU capacity sd but is set unconditionally.
>
Seeing as almost every other line there has an accompanying comment, I
think I'll do that.
> [...]
On 22/04/21 19:29, Dietmar Eggemann wrote:
> Just to make sure I can follow the conversation ...
>
> In case you:
>
> (1) return 1 instead of -1
>
> (2) keep the `env->idle != CPU_NOT_IDLE` check
>
> (3) and remove the `dst_grp_type == group_has_spare` check
>
> you are pretty much back to what you had in [PATCH v3 7/7] directly in
> task_hot() except:
>
> (4) the 'if (p fits on src_cpu && p !fits dst_cpu) => tsk_cache_hot) check?
Pretty much, I'll now filter the return value of task_hot() rather than
affect it directly.
On Wed, 21 Apr 2021 at 12:52, Valentin Schneider
<[email protected]> wrote:
>
> On 20/04/21 16:33, Vincent Guittot wrote:
> > On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
> > <[email protected]> wrote:
> >>
> >> On 16/04/21 15:51, Vincent Guittot wrote:
> >> > Le jeudi 15 avril 2021 � 18:58:46 (+0100), Valentin Schneider a �crit :
> >> >> +
> >> >> +/*
> >> >> + * What does migrating this task do to our capacity-aware scheduling criterion?
> >> >> + *
> >> >> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
> >> >> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
> >> >> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
> >> >> + */
> >> >> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> >> >> +{
> >> >> + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> >> >> + return -1;
> >> >> +
> >> >> + if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
> >> >> + if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> >> >> + return 0;
> >> >> + else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
> >> >> + return 1;
> >> >> + else
> >> >> + return -1;
> >> >> + }
> >> >
> >> > Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
> >> >
> >>
> >> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
> >> which it *doesn't* fit.
> >
> > OK. I was confused because I thought that this was only to force
> > migration in case of group_misfit_task but you tried to extend to
> > other cases... I'm not convinced that you succeeded to cover all cases
> >
> > Also I found this function which returns 3 values a bit disturbing.
> > IIUC you tried to align to migrate_degrades_capacity but you should
> > have better aligned to task_hot and return only 0 or 1. -1 is not used
> >
>
> Ack, will do.
>
> >> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >> >> if (tsk_cache_hot == -1)
> >> >> tsk_cache_hot = task_hot(p, env);
> >> >>
> >> >> + /*
> >> >> + * On a (sane) asymmetric CPU capacity system, the increase in compute
> >> >> + * capacity should offset any potential performance hit caused by a
> >> >> + * migration.
> >> >> + */
> >> >> + if ((env->dst_grp_type == group_has_spare) &&
> >> >
> >> > Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
> >> > stated in $subject
> >> >
> >>
> >> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
> >> could give us a better picture. Staring at this some more, this isn't so
> >> true when the group size goes up - there's no guarantees the dst_cpu is the
> >> one that has spare cycles, and the other CPUs might not be able to grant
> >> the capacity uplift dst_cpu can.
> >
> > yeah you have to keep checking for env->idle != CPU_NOT_IDLE
> >
> >>
> >> As for not using src_grp_type == group_misfit_task, this is pretty much the
> >> same as [1]. CPU-bound (misfit) task + some other task on the same rq
> >> implies group_overloaded classification when balancing at MC level (no SMT,
> >> so one group per CPU).
> >
> > Is it something that happens often or just a sporadic/transient state
> > ? I mean does it really worth the extra complexity and do you see
> > performance improvement ?
> >
>
> "Unfortunately" yes, this is a relatively common scenario when running "1
> big task per CPU" types of workloads. The expected behaviour for big.LITTLE
> systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
> becomes free, usually via newidle balance (which, since they process work
> faster than the LITTLEs, is bound to happen), and an extra task being
> enqueued at "the wrong time" can prevent this from happening.
>
> This usually means a misfit task can take a few dozen extra ms than it
A few dozens is quite long. With a big core being idle, it should try
every 8ms on a quad x quad system and I suspect the next try will be
during the next tick. Would be good to understand why it has to wait
so much
> should to be migrated - in the tests I run (which are pretty much this 1
> hog per CPU workload) this happens about ~20% of the time.
>
> > You should better focus on fixing the simple case of group_misfit_task
> > task. This other cases looks far more complex with lot of corner cases
> >
> >>
> >> [1]: http://lore.kernel.org/r/[email protected]
Hi Vincent, apologies for the belated reply
On 30/04/21 08:58, Vincent Guittot wrote:
> On Wed, 21 Apr 2021 at 12:52, Valentin Schneider
> <[email protected]> wrote:
>> On 20/04/21 16:33, Vincent Guittot wrote:
>> > Is it something that happens often or just a sporadic/transient state
>> > ? I mean does it really worth the extra complexity and do you see
>> > performance improvement ?
>> >
>>
>> "Unfortunately" yes, this is a relatively common scenario when running "1
>> big task per CPU" types of workloads. The expected behaviour for big.LITTLE
>> systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
>> becomes free, usually via newidle balance (which, since they process work
>> faster than the LITTLEs, is bound to happen), and an extra task being
>> enqueued at "the wrong time" can prevent this from happening.
>>
>> This usually means a misfit task can take a few dozen extra ms than it
>
> A few dozens is quite long. With a big core being idle, it should try
> every 8ms on a quad x quad system and I suspect the next try will be
> during the next tick. Would be good to understand why it has to wait
> so much
>
True, IIRC this was mostly due to a compound effect of the different issues
I've described in that thread (and the previous one). Now that
9bcb959d05ee ("sched/fair: Ignore percpu threads for imbalance pulls")
is in, I'll re-run some tests against upstream and see how we fare.