Hi folks,
I split up the extra misfit patches from v3 as I'm still playing around with
those following Vincent's comments. In the meantime, I believe the first few
patches of the series can still be considered as standalone.
o Patch 1 prevents pcpu kworkers from causing group_imbalanced
o Patch 2 is an independent active balance cleanup
o Patch 3 introduces yet another margin for capacity to capacity
comparisons
The "important" one is patch 3, as it solves misfit migration issues on newer
platforms.
This is based on top of today's tip/sched/core at:
0a2b65c03e9b ("sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()")
Testing
=======
I ran my usual [1] misfit tests on
o TC2
o Juno
o HiKey960
o Dragonboard845C
o RB5
RB5 has a similar topology to Pixel4 and highlights the problem of having
two different CPU capacity values above 819 (in this case 871 and 1024):
without these patches, CPU hogs (i.e. misfit tasks) running on the "medium"
CPUs will never be upmigrated to a "big" via misfit balance.
The 0day bot reported [3] the first patch causes a ~14% regression on its
stress-ng.vm-segv testcase. I ran that testcase on:
o Ampere eMAG (arm64, 32 cores)
o 2-socket Xeon E5-2690 (x86, 40 cores)
and found at worse a -0.3% regression and at best a 2% improvement - I'm
getting nowhere near -14%.
Revisions
=========
v3 -> v4
--------
o Tore out the extra misfit patches
o Rewrote patch 1 changelog (Dietmar)
o Reused LBF_ACTIVE_BALANCE to ditch LBF_DST_PINNED active balance logic
(Dietmar)
o Collected Tested-by (Lingutla)
o Squashed capacity_greater() introduction and use (Vincent)
o Removed sched_asym_cpucapacity() static key proliferation (Vincent)
v2 -> v3
--------
o Rebased on top of latest tip/sched/core
o Added test results vs stress-ng.vm-segv
v1 -> v2
--------
o Collected Reviewed-by
o Minor comment and code cleanups
o Consolidated static key vs SD flag explanation (Dietmar)
Note to Vincent: I didn't measure the impact of adding said static key to
load_balance(); I do however believe it is a low hanging fruit. The
wrapper keeps things neat and tidy, and is also helpful for documenting
the intricacies of the static key status vs the presence of the SD flag
in a CPU's sched_domain hierarchy.
o Removed v1 patch 4 - root_domain.max_cpu_capacity is absolutely not what
I had convinced myself it was.
o Squashed capacity margin usage with removal of
group_smaller_{min, max}_capacity() (Vincent)
o Replaced v1 patch 7 with Lingutla's can_migrate_task() patch [2]
o Rewrote task_hot() modification changelog
Links
=====
[1]: https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.misfit.StaggeredFinishes
[2]: http://lore.kernel.org/r/[email protected]
[3]: http://lore.kernel.org/r/20210223023004.GB25487@xsang-OptiPlex-9020
Cheers,
Valentin
Lingutla Chandrasekhar (1):
sched/fair: Ignore percpu threads for imbalance pulls
Valentin Schneider (2):
sched/fair: Clean up active balance nr_balance_failed trickery
sched/fair: Introduce a CPU capacity comparison helper
kernel/sched/fair.c | 68 +++++++++++++++++++--------------------------
1 file changed, 29 insertions(+), 39 deletions(-)
--
2.25.1
From: Lingutla Chandrasekhar <[email protected]>
During load balance, LBF_SOME_PINNED will bet set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.
In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).
It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.
Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:
DIE [ ]
MC [ ][ ]
0 1 2 3
L L B B
arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.
Signed-off-by: Lingutla Chandrasekhar <[email protected]>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdbb2d40..04d5e14fa261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
+ /* Disregard pcpu kthreads; they are where they need to be. */
+ if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+ return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
--
2.25.1
When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().
This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.
Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose. Cleanup the LBF_DST_PINNED active balance special case.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04d5e14fa261..d8077f82a380 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7422,6 +7422,7 @@ enum migration_type {
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
#define LBF_SOME_PINNED 0x08
+#define LBF_ACTIVE_LB 0x10
struct lb_env {
struct sched_domain *sd;
@@ -7583,10 +7584,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* our sched_group. We may want to revisit it if we couldn't
* meet load balance goals by pulling other tasks on src_cpu.
*
- * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
- * already computed one in current iteration.
+ * Avoid computing new_dst_cpu
+ * - for NEWLY_IDLE
+ * - if we have already computed one in current iteration
+ * - if it's an active balance
*/
- if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
+ if (env->idle == CPU_NEWLY_IDLE ||
+ env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
return 0;
/* Prevent to re-select dst_cpu via env's CPUs: */
@@ -7611,10 +7615,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* Aggressive migration if:
- * 1) destination numa is preferred
- * 2) task is cache cold, or
- * 3) too many balance attempts have failed.
+ * 1) active balance
+ * 2) destination numa is preferred
+ * 3) task is cache cold, or
+ * 4) too many balance attempts have failed.
*/
+ if (env->flags & LBF_ACTIVE_LB)
+ return 1;
+
tsk_cache_hot = migrate_degrades_locality(p, env);
if (tsk_cache_hot == -1)
tsk_cache_hot = task_hot(p, env);
@@ -9805,9 +9813,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
active_load_balance_cpu_stop, busiest,
&busiest->active_balance_work);
}
-
- /* We've kicked active balancing, force task migration. */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
}
} else {
sd->nr_balance_failed = 0;
@@ -9957,13 +9962,7 @@ static int active_load_balance_cpu_stop(void *data)
.src_cpu = busiest_rq->cpu,
.src_rq = busiest_rq,
.idle = CPU_IDLE,
- /*
- * can_migrate_task() doesn't need to compute new_dst_cpu
- * for active balancing. Since we have CPU_IDLE, but no
- * @dst_grpmask we need to make that test go away with lying
- * about DST_PINNED.
- */
- .flags = LBF_DST_PINNED,
+ .flags = LBF_ACTIVE_LB,
};
schedstat_inc(sd->alb_count);
--
2.25.1
During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass
group_smaller_max_cpu_capacity(<candidate group>, <local group>);
which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.
Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.
One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.
While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
Reviewed-by: Qais Yousef <[email protected]>
Tested-by: Lingutla Chandrasekhar <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 33 ++++++++++-----------------------
1 file changed, 10 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8077f82a380..c9c5c2697998 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
*/
#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap1' noticeably greater than 'cap2'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
#endif
#ifdef CONFIG_CFS_BANDWIDTH
@@ -8364,26 +8371,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
return false;
}
-/*
- * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity than sched_group ref.
- */
-static inline bool
-group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
-}
-
-/*
- * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity_orig than sched_group ref.
- */
-static inline bool
-group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
-}
-
static inline enum
group_type group_classify(unsigned int imbalance_pct,
struct sched_group *group,
@@ -8539,7 +8526,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* internally or be covered by avg_load imbalance (eventually).
*/
if (sgs->group_type == group_misfit_task &&
- (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -8623,7 +8610,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
*/
if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
(sgs->group_type <= group_fully_busy) &&
- (group_smaller_min_cpu_capacity(sds->local, sg)))
+ (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
return false;
return true;
@@ -9423,7 +9410,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* average load.
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- capacity_of(env->dst_cpu) < capacity &&
+ !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
nr_running == 1)
continue;
--
2.25.1
On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<[email protected]> wrote:
>
> From: Lingutla Chandrasekhar <[email protected]>
>
> During load balance, LBF_SOME_PINNED will bet set if any candidate task
> cannot be detached due to CPU affinity constraints. This can result in
> setting env->sd->parent->sgc->group_imbalance, which can lead to a group
> being classified as group_imbalanced (rather than any of the other, lower
> group_type) when balancing at a higher level.
>
> In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
> set due to per-CPU kthreads being the only other runnable tasks on any
> given rq. This results in changing the group classification during
> load-balance at higher levels when in reality there is nothing that can be
> done for this affinity constraint: per-CPU kthreads, as the name implies,
> don't get to move around (modulo hotplug shenanigans).
>
> It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
> with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
> an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
> we can leverage here to not set LBF_SOME_PINNED.
>
> Note that the aforementioned classification to group_imbalance (when
> nothing can be done) is especially problematic on big.LITTLE systems, which
> have a topology the likes of:
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3
> L L B B
>
> arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
>
> Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
> level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
> the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
> CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
> can significantly delay the upgmigration of said misfit tasks. Systems
> relying on ASYM_PACKING are likely to face similar issues.
>
> Signed-off-by: Lingutla Chandrasekhar <[email protected]>
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> [Reword changelog]
> Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6d73bdbb2d40..04d5e14fa261 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> return 0;
>
> + /* Disregard pcpu kthreads; they are where they need to be. */
> + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> + return 0;
> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
>
> --
> 2.25.1
>
On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<[email protected]> wrote:
>
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
>
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
>
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose. Cleanup the LBF_DST_PINNED active balance special case.
>
> Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 31 +++++++++++++++----------------
> 1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04d5e14fa261..d8077f82a380 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7422,6 +7422,7 @@ enum migration_type {
> #define LBF_NEED_BREAK 0x02
> #define LBF_DST_PINNED 0x04
> #define LBF_SOME_PINNED 0x08
> +#define LBF_ACTIVE_LB 0x10
>
> struct lb_env {
> struct sched_domain *sd;
> @@ -7583,10 +7584,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> * our sched_group. We may want to revisit it if we couldn't
> * meet load balance goals by pulling other tasks on src_cpu.
> *
> - * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
> - * already computed one in current iteration.
> + * Avoid computing new_dst_cpu
> + * - for NEWLY_IDLE
> + * - if we have already computed one in current iteration
> + * - if it's an active balance
> */
> - if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
> + if (env->idle == CPU_NEWLY_IDLE ||
> + env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
> return 0;
>
> /* Prevent to re-select dst_cpu via env's CPUs: */
> @@ -7611,10 +7615,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>
> /*
> * Aggressive migration if:
> - * 1) destination numa is preferred
> - * 2) task is cache cold, or
> - * 3) too many balance attempts have failed.
> + * 1) active balance
> + * 2) destination numa is preferred
> + * 3) task is cache cold, or
> + * 4) too many balance attempts have failed.
> */
> + if (env->flags & LBF_ACTIVE_LB)
> + return 1;
> +
> tsk_cache_hot = migrate_degrades_locality(p, env);
> if (tsk_cache_hot == -1)
> tsk_cache_hot = task_hot(p, env);
> @@ -9805,9 +9813,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> active_load_balance_cpu_stop, busiest,
> &busiest->active_balance_work);
> }
> -
> - /* We've kicked active balancing, force task migration. */
> - sd->nr_balance_failed = sd->cache_nice_tries+1;
> }
> } else {
> sd->nr_balance_failed = 0;
> @@ -9957,13 +9962,7 @@ static int active_load_balance_cpu_stop(void *data)
> .src_cpu = busiest_rq->cpu,
> .src_rq = busiest_rq,
> .idle = CPU_IDLE,
> - /*
> - * can_migrate_task() doesn't need to compute new_dst_cpu
> - * for active balancing. Since we have CPU_IDLE, but no
> - * @dst_grpmask we need to make that test go away with lying
> - * about DST_PINNED.
> - */
> - .flags = LBF_DST_PINNED,
> + .flags = LBF_ACTIVE_LB,
> };
>
> schedstat_inc(sd->alb_count);
> --
> 2.25.1
>
On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<[email protected]> wrote:
>
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
> group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.
>
> Reviewed-by: Qais Yousef <[email protected]>
> Tested-by: Lingutla Chandrasekhar <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 33 ++++++++++-----------------------
> 1 file changed, 10 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d8077f82a380..c9c5c2697998 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
> */
> #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap1' noticeably greater than 'cap2'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
> #endif
>
> #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8364,26 +8371,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> return false;
> }
>
> -/*
> - * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity than sched_group ref.
> - */
> -static inline bool
> -group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> -}
> -
> -/*
> - * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity_orig than sched_group ref.
> - */
> -static inline bool
> -group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> -}
> -
> static inline enum
> group_type group_classify(unsigned int imbalance_pct,
> struct sched_group *group,
> @@ -8539,7 +8526,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * internally or be covered by avg_load imbalance (eventually).
> */
> if (sgs->group_type == group_misfit_task &&
> - (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> + (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> @@ -8623,7 +8610,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> */
> if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> (sgs->group_type <= group_fully_busy) &&
> - (group_smaller_min_cpu_capacity(sds->local, sg)))
> + (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> return false;
>
> return true;
> @@ -9423,7 +9410,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> * average load.
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> - capacity_of(env->dst_cpu) < capacity &&
> + !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> nr_running == 1)
> continue;
>
> --
> 2.25.1
>
On 01/04/2021 21:30, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar <[email protected]>
>
> During load balance, LBF_SOME_PINNED will bet set if any candidate task
nitpick; s/bet/be ?
[...]
Reviewed-by: Dietmar Eggemann <[email protected]>
On 01/04/2021 21:30, Valentin Schneider wrote:
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
>
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
>
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose. Cleanup the LBF_DST_PINNED active balance special case.
>
> Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
[...]
On 06/04/21 17:37, Dietmar Eggemann wrote:
> On 01/04/2021 21:30, Valentin Schneider wrote:
>> While at it, replace group_smaller_{min, max}_cpu_capacity() with
>> comparisons of the source group's min/max capacity and the destination
>> CPU's capacity.
>
> IMHO, you haven't mentioned why you replace local sched group with dst
> CPU. I can see that only the capacity of the dst CPU makes really sense
> here. Might be worth mentioning in the patch header why. There is some
> of it in v3 6/7 but that's a different change.
>
Right, some of it got lost in the shuffling. This was a separate change in
v1 (4/8). I can reuse that changelog into:
"""
Also note that comparing capacity extrema of local and source sched_group's
doesn't make much sense when at the day of the day the imbalance will be
pulled by a known env->dst_cpu, whose capacity can be anywhere within the
local group's capacity extrema.
While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
"""
> Reviewed-by: Dietmar Eggemann <[email protected]>
>
> [...]
On 01/04/2021 21:30, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
> group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.
IMHO, you haven't mentioned why you replace local sched group with dst
CPU. I can see that only the capacity of the dst CPU makes really sense
here. Might be worth mentioning in the patch header why. There is some
of it in v3 6/7 but that's a different change.
Reviewed-by: Dietmar Eggemann <[email protected]>
[...]
On 06/04/21 17:35, Dietmar Eggemann wrote:
> On 01/04/2021 21:30, Valentin Schneider wrote:
>> From: Lingutla Chandrasekhar <[email protected]>
>>
>> During load balance, LBF_SOME_PINNED will bet set if any candidate task
>
> nitpick; s/bet/be ?
>
Yes indeed...
> [...]
>
> Reviewed-by: Dietmar Eggemann <[email protected]>