Hi folks,
Here is this year's series of misfit changes. On the menu:
o Patch 1 prevents pcpu kworkers from causing group_imbalanced
o Patch 2 is an independent active balance cleanup
o Patch 3 adds some more sched_asym_cpucapacity static branches
o Patch 4 introduces yet another margin for capacity to capacity
comparisons
o Patches 5-6 build on top of patch 4 and change capacity comparisons
throughout misfit load balancing
o Patch 7 aligns running and non-running misfit task cache hotness
considerations
IMO the somewhat controversial bit is patch 4, because it attempts to solve
margin issues by... Adding another margin. This does solve issues on
existing platforms (e.g. Pixel4), but we'll be back to square one the day
some "clever" folks spin a platform with two different CPU capacities less
than 5% apart.
This is based on top of today's tip/sched/core at:
13c2235b2b28 ("sched: Remove unnecessary variable from schedule_tail()")
Testing
=======
I ran my usual [1] misfit tests on
o TC2
o Juno
o HiKey960
o Dragonboard845C
o RB5
RB5 has a similar topology to Pixel4 and highlights the problem of having
two different CPU capacity values above 819 (in this case 871 and 1024):
without these patches, CPU hogs (i.e. misfit tasks) running on the "medium"
CPUs will never be upmigrated to a "big" via misfit balance.
The 0day bot reported [3] the first patch causes a ~14% regression on its
stress-ng.vm-segv testcase. I ran that testcase on:
o Ampere eMAG (arm64, 32 cores)
o 2-socket Xeon E5-2690 (x86, 40 cores)
and found at worse a -0.3% regression and at best a 2% improvement - I'm
getting nowhere near -14%.
Revisions
=========
v2 -> v3
--------
o Rebased on top of latest tip/sched/core
o Added test results vs stress-ng.vm-segv
v1 -> v2
--------
o Collected Reviewed-by
o Minor comment and code cleanups
o Consolidated static key vs SD flag explanation (Dietmar)
Note to Vincent: I didn't measure the impact of adding said static key to
load_balance(); I do however believe it is a low hanging fruit. The
wrapper keeps things neat and tidy, and is also helpful for documenting
the intricacies of the static key status vs the presence of the SD flag
in a CPU's sched_domain hierarchy.
o Removed v1 patch 4 - root_domain.max_cpu_capacity is absolutely not what
I had convinced myself it was.
o Squashed capacity margin usage with removal of
group_smaller_{min, max}_capacity() (Vincent)
o Replaced v1 patch 7 with Lingutla's can_migrate_task() patch [2]
o Rewrote task_hot() modification changelog
Links
=====
[1]: https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.misfit.StaggeredFinishes
[2]: http://lore.kernel.org/r/[email protected]
[3]: http://lore.kernel.org/r/20210223023004.GB25487@xsang-OptiPlex-9020
Cheers,
Valentin
Lingutla Chandrasekhar (1):
sched/fair: Ignore percpu threads for imbalance pulls
Valentin Schneider (6):
sched/fair: Clean up active balance nr_balance_failed trickery
sched/fair: Add more sched_asym_cpucapacity static branch checks
sched/fair: Introduce a CPU capacity comparison helper
sched/fair: Employ capacity_greater() throughout load_balance()
sched/fair: Filter out locally-unsolvable misfit imbalances
sched/fair: Relax task_hot() for misfit tasks
kernel/sched/fair.c | 128 ++++++++++++++++++++++++-------------------
kernel/sched/sched.h | 33 +++++++++++
2 files changed, 105 insertions(+), 56 deletions(-)
--
2.25.1
When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().
This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.
Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83aea97fbf22..f50a902bdf24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7420,6 +7420,7 @@ enum migration_type {
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
#define LBF_SOME_PINNED 0x08
+#define LBF_ACTIVE_LB 0x10
struct lb_env {
struct sched_domain *sd;
@@ -7609,10 +7610,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* Aggressive migration if:
- * 1) destination numa is preferred
- * 2) task is cache cold, or
- * 3) too many balance attempts have failed.
+ * 1) active balance
+ * 2) destination numa is preferred
+ * 3) task is cache cold, or
+ * 4) too many balance attempts have failed.
*/
+ if (env->flags & LBF_ACTIVE_LB)
+ return 1;
+
tsk_cache_hot = migrate_degrades_locality(p, env);
if (tsk_cache_hot == -1)
tsk_cache_hot = task_hot(p, env);
@@ -9794,9 +9799,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
active_load_balance_cpu_stop, busiest,
&busiest->active_balance_work);
}
-
- /* We've kicked active balancing, force task migration. */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
}
} else {
sd->nr_balance_failed = 0;
@@ -9952,7 +9954,8 @@ static int active_load_balance_cpu_stop(void *data)
* @dst_grpmask we need to make that test go away with lying
* about DST_PINNED.
*/
- .flags = LBF_DST_PINNED,
+ .flags = LBF_DST_PINNED |
+ LBF_ACTIVE_LB,
};
schedstat_inc(sd->alb_count);
--
2.25.1
Rik noted a while back that a handful of
sd->flags & SD_ASYM_CPUCAPACITY
& family in the CFS load-balancer code aren't guarded by the
sched_asym_cpucapacity static branch.
Turning those checks into NOPs for those who don't need it is fairly
straightforward, and hiding it in a helper doesn't change code size in all
but one spot. It also gives us a place to document the differences between
checking the static key and checking the SD flag.
Suggested-by: Rik van Riel <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++-------------
kernel/sched/sched.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f50a902bdf24..db892f6e222f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6300,15 +6300,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* sd_asym_cpucapacity rather than sd_llc.
*/
if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+ /* See sd_has_asym_cpucapacity() */
sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
- /*
- * On an asymmetric CPU capacity system where an exclusive
- * cpuset defines a symmetric island (i.e. one unique
- * capacity_orig value through the cpuset), the key will be set
- * but the CPUs within that cpuset will not have a domain with
- * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
- * capacity path.
- */
if (sd) {
i = select_idle_capacity(p, sd, target);
return ((unsigned)i < nr_cpumask_bits) ? i : target;
@@ -8467,7 +8460,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
/* Check for a misfit task on the cpu */
- if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ if (sd_has_asym_cpucapacity(env->sd) &&
sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
*sg_status |= SG_OVERLOAD;
@@ -8524,7 +8517,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* CPUs in the group should either be possible to resolve
* internally or be covered by avg_load imbalance (eventually).
*/
- if (sgs->group_type == group_misfit_task &&
+ if (static_branch_unlikely(&sched_asym_cpucapacity) &&
+ sgs->group_type == group_misfit_task &&
(!group_smaller_max_cpu_capacity(sg, sds->local) ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -8607,7 +8601,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* throughput. Maximize throughput, power/energy consequences are not
* considered.
*/
- if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+ if (sd_has_asym_cpucapacity(env->sd) &&
(sgs->group_type <= group_fully_busy) &&
(group_smaller_min_cpu_capacity(sds->local, sg)))
return false;
@@ -8730,7 +8724,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
}
/* Check if task fits in the group */
- if (sd->flags & SD_ASYM_CPUCAPACITY &&
+ if (sd_has_asym_cpucapacity(sd) &&
!task_fits_capacity(p, group->sgc->max_capacity)) {
sgs->group_misfit_task_load = 1;
}
@@ -9408,7 +9402,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* Higher per-CPU capacity is considered better than balancing
* average load.
*/
- if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ if (sd_has_asym_cpucapacity(env->sd) &&
capacity_of(env->dst_cpu) < capacity &&
nr_running == 1)
continue;
@@ -10225,6 +10219,7 @@ static void nohz_balancer_kick(struct rq *rq)
}
}
+ /* See sd_has_asym_cpucapacity(). */
sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu));
if (sd) {
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2e09a647c4f..27bf70bc86c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1492,6 +1492,39 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
extern struct static_key_false sched_asym_cpucapacity;
+/*
+ * Note that the static key is system-wide, but the visibility of
+ * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
+ * imply all CPUs can see asymmetry.
+ *
+ * Consider an asymmetric CPU capacity system such as:
+ *
+ * MC [ ]
+ * 0 1 2 3 4 5
+ * L L L L B B
+ *
+ * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
+ *
+ * By default, booting this system will enable the sched_asym_cpucapacity
+ * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
+ * sched_domain.
+ *
+ * Further consider exclusive cpusets creating a "symmetric island":
+ *
+ * MC [ ][ ]
+ * 0 1 2 3 4 5
+ * L L L L B B
+ *
+ * Again, booting this will enable the static key, but CPUs 0-1 will *not* have
+ * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the intended
+ * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP system.
+ */
+static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
+{
+ return static_branch_unlikely(&sched_asym_cpucapacity) &&
+ sd->flags & SD_ASYM_CPUCAPACITY;
+}
+
struct sched_group_capacity {
atomic_t ref;
/*
--
2.25.1
During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass
group_smaller_max_cpu_capacity(<candidate group>, <local group>);
which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.
Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.
One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.
Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db892f6e222f..ddb2ab3edf6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
*/
#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap1' noticeably greater than 'cap2'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
#endif
#ifdef CONFIG_CFS_BANDWIDTH
--
2.25.1
Consider the following topology:
DIE [ ]
MC [ ][ ]
0 1 2 3
capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})
w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).
When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
this load_balance() happens and preempt the CPU hog running there, we would
have, for the [0-1] group at CPU2's DIE level:
o sgs->sum_nr_running > sgs->group_weight
o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct
IOW, this group is group_overloaded.
Considering CPU0 is picked by find_busiest_queue(), we would then visit the
preempted CPU hog in detach_tasks(). However, given it has just been
preempted by this pcpu kworker, task_hot() will prevent it from being
detached. We then leave load_balance() without having done anything.
Long story short, preempted misfit tasks are affected by task_hot(), while
currently running misfit tasks are intentionally preempted by the stopper
task to migrate them over to a higher-capacity CPU.
Align detach_tasks() with the active-balance logic and let it pick a
cache-hot misfit task when the destination CPU can provide a capacity
uplift.
Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41cdda7a8ea6..5454429cea7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7474,6 +7474,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
if (env->sd->flags & SD_SHARE_CPUCAPACITY)
return 0;
+ /*
+ * On a (sane) asymmetric CPU capacity system, the increase in compute
+ * capacity should offset any potential performance hit caused by a
+ * migration.
+ */
+ if (sd_has_asym_cpucapacity(env->sd) &&
+ env->idle != CPU_NOT_IDLE &&
+ !task_fits_capacity(p, capacity_of(env->src_cpu)) &&
+ cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+ return 0;
+
/*
* Buddy candidates are cache hot:
*/
--
2.25.1
Consider the following (hypothetical) asymmetric CPU capacity topology,
with some amount of capacity pressure (RT | DL | IRQ | thermal):
DIE [ ]
MC [ ][ ]
0 1 2 3
| CPU | capacity_orig | capacity |
|-----+---------------+----------|
| 0 | 870 | 860 |
| 1 | 870 | 600 |
| 2 | 1024 | 850 |
| 3 | 1024 | 860 |
If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
sufficiently busy, i.e. don't have enough spare capacity to accommodate
CPU1's misfit task. This would then fall on CPU2 to pull the task.
This currently won't happen, because CPU2 will fail
capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
level. In this case, the max_capacity is that of CPU0's, which is at this
point in time greater than that of CPU2's. This comparison doesn't make
much sense, given that the only CPUs we should care about in this scenario
are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
destination CPU).
Aggregate a misfit task's load into sgs->group_misfit_task_load only if
env->dst_cpu would grant it a capacity uplift. Separately track whether a
sched_group contains a misfit task to still classify it as
group_misfit_task and not pick it as busiest group when pulling from a
lower-capacity CPU (which is the current behaviour and prevents
down-migration).
Since find_busiest_queue() can now iterate over CPUs with a higher capacity
than the local CPU's, add a capacity check there.
Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e8a242cd1f7..41cdda7a8ea6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}
+/* Is CPU a's capacity noticeably greater than CPU b's? */
+static inline bool cpu_capacity_greater(int a, int b)
+{
+ return capacity_greater(capacity_of(a), capacity_of(b));
+}
+
static void record_wakee(struct task_struct *p)
{
/*
@@ -8091,7 +8097,8 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ unsigned long group_misfit_task_load; /* Task load that can be uplifted */
+ int group_has_misfit_task; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
if (sgs->group_asym_packing)
return group_asym_packing;
- if (sgs->group_misfit_task_load)
+ if (sgs->group_has_misfit_task)
return group_misfit_task;
if (!group_has_capacity(imbalance_pct, sgs))
@@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
/* Check for a misfit task on the cpu */
- if (sd_has_asym_cpucapacity(env->sd) &&
- sgs->group_misfit_task_load < rq->misfit_task_load) {
- sgs->group_misfit_task_load = rq->misfit_task_load;
- *sg_status |= SG_OVERLOAD;
+ if (!sd_has_asym_cpucapacity(env->sd) ||
+ !rq->misfit_task_load)
+ continue;
+
+ *sg_status |= SG_OVERLOAD;
+ sgs->group_has_misfit_task = true;
+
+ /*
+ * Don't attempt to maximize load for misfit tasks that can't be
+ * granted a CPU capacity uplift.
+ */
+ if (cpu_capacity_greater(env->dst_cpu, i)) {
+ sgs->group_misfit_task_load = max(
+ sgs->group_misfit_task_load,
+ rq->misfit_task_load);
}
}
@@ -8501,7 +8519,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
/* Don't try to pull misfit tasks we can't help */
if (static_branch_unlikely(&sched_asym_cpucapacity) &&
sgs->group_type == group_misfit_task &&
- (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+ (!sgs->group_misfit_task_load ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -9448,15 +9466,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
case migrate_misfit:
/*
* For ASYM_CPUCAPACITY domains with misfit tasks we
- * simply seek the "biggest" misfit task.
+ * simply seek the "biggest" misfit task we can
+ * accommodate.
*/
+ if (!cpu_capacity_greater(env->dst_cpu, i))
+ continue;
+
if (rq->misfit_task_load > busiest_load) {
busiest_load = rq->misfit_task_load;
busiest = rq;
}
break;
-
}
}
--
2.25.1
While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 33 ++++-----------------------------
1 file changed, 4 insertions(+), 29 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ddb2ab3edf6d..1e8a242cd1f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8350,26 +8350,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
return false;
}
-/*
- * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity than sched_group ref.
- */
-static inline bool
-group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
-}
-
-/*
- * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity_orig than sched_group ref.
- */
-static inline bool
-group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
-}
-
static inline enum
group_type group_classify(unsigned int imbalance_pct,
struct sched_group *group,
@@ -8518,15 +8498,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (!sgs->sum_h_nr_running)
return false;
- /*
- * Don't try to pull misfit tasks we can't help.
- * We can use max_capacity here as reduction in capacity on some
- * CPUs in the group should either be possible to resolve
- * internally or be covered by avg_load imbalance (eventually).
- */
+ /* Don't try to pull misfit tasks we can't help */
if (static_branch_unlikely(&sched_asym_cpucapacity) &&
sgs->group_type == group_misfit_task &&
- (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -8610,7 +8585,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
*/
if (sd_has_asym_cpucapacity(env->sd) &&
(sgs->group_type <= group_fully_busy) &&
- (group_smaller_min_cpu_capacity(sds->local, sg)))
+ (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
return false;
return true;
@@ -9410,7 +9385,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* average load.
*/
if (sd_has_asym_cpucapacity(env->sd) &&
- capacity_of(env->dst_cpu) < capacity &&
+ !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
nr_running == 1)
continue;
--
2.25.1
On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
<[email protected]> wrote:
>
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
> group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> Reviewed-by: Qais Yousef <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index db892f6e222f..ddb2ab3edf6d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
> */
> #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap1' noticeably greater than 'cap2'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
defined but not used.
Should be merged with next patch which start to use it
> #endif
>
> #ifdef CONFIG_CFS_BANDWIDTH
> --
> 2.25.1
>
On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
<[email protected]> wrote:
>
> Consider the following (hypothetical) asymmetric CPU capacity topology,
> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3
>
> | CPU | capacity_orig | capacity |
> |-----+---------------+----------|
> | 0 | 870 | 860 |
> | 1 | 870 | 600 |
> | 2 | 1024 | 850 |
> | 3 | 1024 | 860 |
>
> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>
> This currently won't happen, because CPU2 will fail
>
> capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
which has been introduced by the previous patch: patch5
>
> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> level. In this case, the max_capacity is that of CPU0's, which is at this
> point in time greater than that of CPU2's. This comparison doesn't make
> much sense, given that the only CPUs we should care about in this scenario
> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> destination CPU).
>
> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> sched_group contains a misfit task to still classify it as
> group_misfit_task and not pick it as busiest group when pulling from a
Could you give more details about why we should keep tracking the
group as misfit ? Do you have a UC in mind ?
> lower-capacity CPU (which is the current behaviour and prevents
> down-migration).
>
> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> than the local CPU's, add a capacity check there.
>
> Reviewed-by: Qais Yousef <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++---------
> 1 file changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e8a242cd1f7..41cdda7a8ea6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity;
> }
>
> +/* Is CPU a's capacity noticeably greater than CPU b's? */
> +static inline bool cpu_capacity_greater(int a, int b)
> +{
> + return capacity_greater(capacity_of(a), capacity_of(b));
> +}
> +
> static void record_wakee(struct task_struct *p)
> {
> /*
> @@ -8091,7 +8097,8 @@ struct sg_lb_stats {
> unsigned int group_weight;
> enum group_type group_type;
> unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> - unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> + unsigned long group_misfit_task_load; /* Task load that can be uplifted */
> + int group_has_misfit_task; /* A CPU has a task too big for its capacity */
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> @@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
> if (sgs->group_asym_packing)
> return group_asym_packing;
>
> - if (sgs->group_misfit_task_load)
> + if (sgs->group_has_misfit_task)
> return group_misfit_task;
>
> if (!group_has_capacity(imbalance_pct, sgs))
> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> continue;
>
> /* Check for a misfit task on the cpu */
> - if (sd_has_asym_cpucapacity(env->sd) &&
> - sgs->group_misfit_task_load < rq->misfit_task_load) {
> - sgs->group_misfit_task_load = rq->misfit_task_load;
> - *sg_status |= SG_OVERLOAD;
> + if (!sd_has_asym_cpucapacity(env->sd) ||
> + !rq->misfit_task_load)
> + continue;
> +
> + *sg_status |= SG_OVERLOAD;
> + sgs->group_has_misfit_task = true;
> +
> + /*
> + * Don't attempt to maximize load for misfit tasks that can't be
> + * granted a CPU capacity uplift.
> + */
> + if (cpu_capacity_greater(env->dst_cpu, i)) {
> + sgs->group_misfit_task_load = max(
> + sgs->group_misfit_task_load,
> + rq->misfit_task_load);
Please encapsulate all this misfit specific code in a dedicated
function which will be called from update_sg_lb_stats
> }
> }
>
> @@ -8501,7 +8519,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> /* Don't try to pull misfit tasks we can't help */
> if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> sgs->group_type == group_misfit_task &&
> - (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> + (!sgs->group_misfit_task_load ||
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> @@ -9448,15 +9466,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> case migrate_misfit:
> /*
> * For ASYM_CPUCAPACITY domains with misfit tasks we
> - * simply seek the "biggest" misfit task.
> + * simply seek the "biggest" misfit task we can
> + * accommodate.
> */
> + if (!cpu_capacity_greater(env->dst_cpu, i))
> + continue;
> +
> if (rq->misfit_task_load > busiest_load) {
> busiest_load = rq->misfit_task_load;
> busiest = rq;
> }
>
> break;
> -
> }
> }
>
> --
> 2.25.1
>
On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
<[email protected]> wrote:
>
> Rik noted a while back that a handful of
>
> sd->flags & SD_ASYM_CPUCAPACITY
>
> & family in the CFS load-balancer code aren't guarded by the
> sched_asym_cpucapacity static branch.
guarding asym capacity with static branch in fast path makes sense but
I see no benefit in this slow path but hiding and complexifying the
code. Also if you start with this way then you have to add a nop in
all other places where flag or a group_type might be unused.
>
> Turning those checks into NOPs for those who don't need it is fairly
> straightforward, and hiding it in a helper doesn't change code size in all
> but one spot. It also gives us a place to document the differences between
> checking the static key and checking the SD flag.
>
> Suggested-by: Rik van Riel <[email protected]>
> Reviewed-by: Qais Yousef <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 21 ++++++++-------------
> kernel/sched/sched.h | 33 +++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f50a902bdf24..db892f6e222f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6300,15 +6300,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> * sd_asym_cpucapacity rather than sd_llc.
> */
> if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> + /* See sd_has_asym_cpucapacity() */
> sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
> - /*
> - * On an asymmetric CPU capacity system where an exclusive
> - * cpuset defines a symmetric island (i.e. one unique
> - * capacity_orig value through the cpuset), the key will be set
> - * but the CPUs within that cpuset will not have a domain with
> - * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
> - * capacity path.
> - */
> if (sd) {
> i = select_idle_capacity(p, sd, target);
> return ((unsigned)i < nr_cpumask_bits) ? i : target;
> @@ -8467,7 +8460,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> continue;
>
> /* Check for a misfit task on the cpu */
> - if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> + if (sd_has_asym_cpucapacity(env->sd) &&
> sgs->group_misfit_task_load < rq->misfit_task_load) {
> sgs->group_misfit_task_load = rq->misfit_task_load;
> *sg_status |= SG_OVERLOAD;
> @@ -8524,7 +8517,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * CPUs in the group should either be possible to resolve
> * internally or be covered by avg_load imbalance (eventually).
> */
> - if (sgs->group_type == group_misfit_task &&
> + if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> + sgs->group_type == group_misfit_task &&
> (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> sds->local_stat.group_type != group_has_spare))
> return false;
> @@ -8607,7 +8601,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * throughput. Maximize throughput, power/energy consequences are not
> * considered.
> */
> - if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> + if (sd_has_asym_cpucapacity(env->sd) &&
> (sgs->group_type <= group_fully_busy) &&
> (group_smaller_min_cpu_capacity(sds->local, sg)))
> return false;
> @@ -8730,7 +8724,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
> }
>
> /* Check if task fits in the group */
> - if (sd->flags & SD_ASYM_CPUCAPACITY &&
> + if (sd_has_asym_cpucapacity(sd) &&
> !task_fits_capacity(p, group->sgc->max_capacity)) {
> sgs->group_misfit_task_load = 1;
> }
> @@ -9408,7 +9402,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> * Higher per-CPU capacity is considered better than balancing
> * average load.
> */
> - if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> + if (sd_has_asym_cpucapacity(env->sd) &&
> capacity_of(env->dst_cpu) < capacity &&
> nr_running == 1)
> continue;
> @@ -10225,6 +10219,7 @@ static void nohz_balancer_kick(struct rq *rq)
> }
> }
>
> + /* See sd_has_asym_cpucapacity(). */
> sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu));
> if (sd) {
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d2e09a647c4f..27bf70bc86c7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1492,6 +1492,39 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> extern struct static_key_false sched_asym_cpucapacity;
>
> +/*
> + * Note that the static key is system-wide, but the visibility of
> + * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
> + * imply all CPUs can see asymmetry.
> + *
> + * Consider an asymmetric CPU capacity system such as:
> + *
> + * MC [ ]
> + * 0 1 2 3 4 5
> + * L L L L B B
> + *
> + * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
> + *
> + * By default, booting this system will enable the sched_asym_cpucapacity
> + * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
> + * sched_domain.
> + *
> + * Further consider exclusive cpusets creating a "symmetric island":
> + *
> + * MC [ ][ ]
> + * 0 1 2 3 4 5
> + * L L L L B B
> + *
> + * Again, booting this will enable the static key, but CPUs 0-1 will *not* have
> + * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the intended
> + * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP system.
> + */
> +static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
> +{
> + return static_branch_unlikely(&sched_asym_cpucapacity) &&
> + sd->flags & SD_ASYM_CPUCAPACITY;
> +}
> +
> struct sched_group_capacity {
> atomic_t ref;
> /*
> --
> 2.25.1
>
On 15/03/21 16:13, Vincent Guittot wrote:
> On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> <[email protected]> wrote:
>>
>> Consider the following (hypothetical) asymmetric CPU capacity topology,
>> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>>
>> DIE [ ]
>> MC [ ][ ]
>> 0 1 2 3
>>
>> | CPU | capacity_orig | capacity |
>> |-----+---------------+----------|
>> | 0 | 870 | 860 |
>> | 1 | 870 | 600 |
>> | 2 | 1024 | 850 |
>> | 3 | 1024 | 860 |
>>
>> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
>> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
>> sufficiently busy, i.e. don't have enough spare capacity to accommodate
>> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>>
>> This currently won't happen, because CPU2 will fail
>>
>> capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
>
> which has been introduced by the previous patch: patch5
>
>>
>> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
>> level. In this case, the max_capacity is that of CPU0's, which is at this
>> point in time greater than that of CPU2's. This comparison doesn't make
>> much sense, given that the only CPUs we should care about in this scenario
>> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
>> destination CPU).
>>
>> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
>> env->dst_cpu would grant it a capacity uplift. Separately track whether a
>> sched_group contains a misfit task to still classify it as
>> group_misfit_task and not pick it as busiest group when pulling from a
>
> Could you give more details about why we should keep tracking the
> group as misfit ? Do you have a UC in mind ?
>
As stated the current behaviour is to classify groups as group_misfit_task
regardless of the dst_cpu's capacity. When we see a group_misfit_task
candidate group misfit task with higher per-CPU capacity than the local
group, we don't pick it as busiest.
I initially thought not marking those as group_misfit_task was the right
thing to do, as they could then be classified as group_fully_busy or
group_has_spare. Consider:
DIE [ ]
MC [ ][ ]
0 1 2 3
L L B B
arch_scale_capacity(L) < arch_scale_capacity(B)
CPUs 0-1 are idle / lightly loaded
CPU2 has a misfit task and a few very small tasks
CPU3 has a few very small tasks
When CPU0 is running load_balance() at DIE level, right now we'll classify
the [2-3] group as group_misfit_task and not pick it as busiest because the
local group has a lower CPU capacity.
If we didn't do that, we could leave the misfit task alone and pull some
small task(s) from CPU2 or CPU3, which would be a good thing to
do. However, by allowing a group containing a misfit task to be picked as
the busiest group when a CPU of lower capacity is pulling, we run the risk
of the misfit task itself being downmigrated - e.g. if we repeatedly
increment the sd->nr_balance_failed counter and do an active balance (maybe
because the small tasks were unfortunately cache_hot()).
It's less than ideal, but I considered not downmigrating misfit tasks was
the thing to prioritize (and FWIW it also maintains current behaviour).
Another approach would be to add task utilization vs CPU capacity checks in
detach_tasks() and need_active_balance() to prevent downmigration when
env->imbalance_type < group_misfit_task. This may go against the busiest
group selection heuristics however (misfit tasks could be the main
contributors to the imbalance, but we end up not moving them).
>> lower-capacity CPU (which is the current behaviour and prevents
>> down-migration).
>>
>> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
>> than the local CPU's, add a capacity check there.
>>
>> Reviewed-by: Qais Yousef <[email protected]>
>> Signed-off-by: Valentin Schneider <[email protected]>
>> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> continue;
>>
>> /* Check for a misfit task on the cpu */
>> - if (sd_has_asym_cpucapacity(env->sd) &&
>> - sgs->group_misfit_task_load < rq->misfit_task_load) {
>> - sgs->group_misfit_task_load = rq->misfit_task_load;
>> - *sg_status |= SG_OVERLOAD;
>> + if (!sd_has_asym_cpucapacity(env->sd) ||
>> + !rq->misfit_task_load)
>> + continue;
>> +
>> + *sg_status |= SG_OVERLOAD;
>> + sgs->group_has_misfit_task = true;
>> +
>> + /*
>> + * Don't attempt to maximize load for misfit tasks that can't be
>> + * granted a CPU capacity uplift.
>> + */
>> + if (cpu_capacity_greater(env->dst_cpu, i)) {
>> + sgs->group_misfit_task_load = max(
>> + sgs->group_misfit_task_load,
>> + rq->misfit_task_load);
>
> Please encapsulate all this misfit specific code in a dedicated
> function which will be called from update_sg_lb_stats
>
Will do.
Hi Vincent,
Thanks for taking another look at this.
On 15/03/21 15:18, Vincent Guittot wrote:
> On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> <[email protected]> wrote:
>>
>> Rik noted a while back that a handful of
>>
>> sd->flags & SD_ASYM_CPUCAPACITY
>>
>> & family in the CFS load-balancer code aren't guarded by the
>> sched_asym_cpucapacity static branch.
>
> guarding asym capacity with static branch in fast path makes sense but
> I see no benefit in this slow path but hiding and complexifying the
> code. Also if you start with this way then you have to add a nop in
> all other places where flag or a group_type might be unused.
>
OK, fair enough, I'll drop this one.
On 15/03/21 15:24, Vincent Guittot wrote:
>> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>> */
>> #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>>
>> +/*
>> + * The margin used when comparing CPU capacities.
>> + * is 'cap1' noticeably greater than 'cap2'
>> + *
>> + * (default: ~5%)
>> + */
>> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
>
> defined but not used.
>
> Should be merged with next patch which start to use it
>
Will do.
On 11/03/2021 13:05, Valentin Schneider wrote:
[...]
> @@ -9952,7 +9954,8 @@ static int active_load_balance_cpu_stop(void *data)
> * @dst_grpmask we need to make that test go away with lying
> * about DST_PINNED.
> */
> - .flags = LBF_DST_PINNED,
> + .flags = LBF_DST_PINNED |
> + LBF_ACTIVE_LB,
Since you now have a dedicated LBF flag for active balancing you could remove the
LBF_DST_PINNED here and adapt the if condition in can_migrate_task():
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f50a902bdf24..9f7feb512016 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7583,10 +7583,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* meet load balance goals by pulling other tasks on src_cpu.
*
* Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
- * already computed one in current iteration.
+ * already computed one in current iteration or if it is an
+ * active balance.
*/
- if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
+ if (env->idle == CPU_NEWLY_IDLE ||
+ (env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))) {
return 0;
+ }
/* Prevent to re-select dst_cpu via env's CPUs: */
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
@@ -9948,14 +9951,7 @@ static int active_load_balance_cpu_stop(void *data)
.src_cpu = busiest_rq->cpu,
.src_rq = busiest_rq,
.idle = CPU_IDLE,
- /*
- * can_migrate_task() doesn't need to compute new_dst_cpu
- * for active balancing. Since we have CPU_IDLE, but no
- * @dst_grpmask we need to make that test go away with lying
- * about DST_PINNED.
- */
- .flags = LBF_DST_PINNED |
- LBF_ACTIVE_LB,
+ .flags = LBF_ACTIVE_LB,
};
On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
<[email protected]> wrote:
>
> On 15/03/21 16:13, Vincent Guittot wrote:
> > On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> > <[email protected]> wrote:
> >>
> >> Consider the following (hypothetical) asymmetric CPU capacity topology,
> >> with some amount of capacity pressure (RT | DL | IRQ | thermal):
> >>
> >> DIE [ ]
> >> MC [ ][ ]
> >> 0 1 2 3
> >>
> >> | CPU | capacity_orig | capacity |
> >> |-----+---------------+----------|
> >> | 0 | 870 | 860 |
> >> | 1 | 870 | 600 |
> >> | 2 | 1024 | 850 |
> >> | 3 | 1024 | 860 |
> >>
> >> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> >> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> >> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> >> CPU1's misfit task. This would then fall on CPU2 to pull the task.
> >>
> >> This currently won't happen, because CPU2 will fail
> >>
> >> capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
> >
> > which has been introduced by the previous patch: patch5
> >
> >>
> >> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> >> level. In this case, the max_capacity is that of CPU0's, which is at this
> >> point in time greater than that of CPU2's. This comparison doesn't make
> >> much sense, given that the only CPUs we should care about in this scenario
> >> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> >> destination CPU).
> >>
> >> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> >> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> >> sched_group contains a misfit task to still classify it as
> >> group_misfit_task and not pick it as busiest group when pulling from a
> >
> > Could you give more details about why we should keep tracking the
> > group as misfit ? Do you have a UC in mind ?
> >
>
> As stated the current behaviour is to classify groups as group_misfit_task
> regardless of the dst_cpu's capacity. When we see a group_misfit_task
> candidate group misfit task with higher per-CPU capacity than the local
> group, we don't pick it as busiest.
>
> I initially thought not marking those as group_misfit_task was the right
> thing to do, as they could then be classified as group_fully_busy or
> group_has_spare. Consider:
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3
> L L B B
>
> arch_scale_capacity(L) < arch_scale_capacity(B)
>
> CPUs 0-1 are idle / lightly loaded
> CPU2 has a misfit task and a few very small tasks
> CPU3 has a few very small tasks
>
> When CPU0 is running load_balance() at DIE level, right now we'll classify
> the [2-3] group as group_misfit_task and not pick it as busiest because the
> local group has a lower CPU capacity.
>
> If we didn't do that, we could leave the misfit task alone and pull some
> small task(s) from CPU2 or CPU3, which would be a good thing to
Are you sure? the last check in update_sd_pick_busiest() should
already filter this. So it should be enough to let it be classify
correctly
A group should be classified as group_misfit_task when there is a task
to migrate in priority compared to some other groups. In your case,
you tag it as group_misfit_task but in order to do the opposite, i.e.
make sure to not select it. As mentioned above, this will be filter in
the last check in update_sd_pick_busiest()
> do. However, by allowing a group containing a misfit task to be picked as
> the busiest group when a CPU of lower capacity is pulling, we run the risk
> of the misfit task itself being downmigrated - e.g. if we repeatedly
> increment the sd->nr_balance_failed counter and do an active balance (maybe
> because the small tasks were unfortunately cache_hot()).
>
> It's less than ideal, but I considered not downmigrating misfit tasks was
> the thing to prioritize (and FWIW it also maintains current behaviour).
>
>
> Another approach would be to add task utilization vs CPU capacity checks in
> detach_tasks() and need_active_balance() to prevent downmigration when
> env->imbalance_type < group_misfit_task. This may go against the busiest
> group selection heuristics however (misfit tasks could be the main
> contributors to the imbalance, but we end up not moving them).
>
>
> >> lower-capacity CPU (which is the current behaviour and prevents
> >> down-migration).
> >>
> >> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> >> than the local CPU's, add a capacity check there.
> >>
> >> Reviewed-by: Qais Yousef <[email protected]>
> >> Signed-off-by: Valentin Schneider <[email protected]>
>
> >> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >> continue;
> >>
> >> /* Check for a misfit task on the cpu */
> >> - if (sd_has_asym_cpucapacity(env->sd) &&
> >> - sgs->group_misfit_task_load < rq->misfit_task_load) {
> >> - sgs->group_misfit_task_load = rq->misfit_task_load;
> >> - *sg_status |= SG_OVERLOAD;
> >> + if (!sd_has_asym_cpucapacity(env->sd) ||
> >> + !rq->misfit_task_load)
> >> + continue;
> >> +
> >> + *sg_status |= SG_OVERLOAD;
> >> + sgs->group_has_misfit_task = true;
> >> +
> >> + /*
> >> + * Don't attempt to maximize load for misfit tasks that can't be
> >> + * granted a CPU capacity uplift.
> >> + */
> >> + if (cpu_capacity_greater(env->dst_cpu, i)) {
> >> + sgs->group_misfit_task_load = max(
> >> + sgs->group_misfit_task_load,
> >> + rq->misfit_task_load);
> >
> > Please encapsulate all this misfit specific code in a dedicated
> > function which will be called from update_sg_lb_stats
> >
>
> Will do.
On 19/03/21 16:19, Vincent Guittot wrote:
> On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
> <[email protected]> wrote:
>> As stated the current behaviour is to classify groups as group_misfit_task
>> regardless of the dst_cpu's capacity. When we see a group_misfit_task
>> candidate group misfit task with higher per-CPU capacity than the local
>> group, we don't pick it as busiest.
>>
>> I initially thought not marking those as group_misfit_task was the right
>> thing to do, as they could then be classified as group_fully_busy or
>> group_has_spare. Consider:
>>
>> DIE [ ]
>> MC [ ][ ]
>> 0 1 2 3
>> L L B B
>>
>> arch_scale_capacity(L) < arch_scale_capacity(B)
>>
>> CPUs 0-1 are idle / lightly loaded
>> CPU2 has a misfit task and a few very small tasks
>> CPU3 has a few very small tasks
>>
>> When CPU0 is running load_balance() at DIE level, right now we'll classify
>> the [2-3] group as group_misfit_task and not pick it as busiest because the
>> local group has a lower CPU capacity.
>>
>> If we didn't do that, we could leave the misfit task alone and pull some
>> small task(s) from CPU2 or CPU3, which would be a good thing to
>
> Are you sure? the last check in update_sd_pick_busiest() should
> already filter this. So it should be enough to let it be classify
> correctly
>
> A group should be classified as group_misfit_task when there is a task
> to migrate in priority compared to some other groups. In your case,
> you tag it as group_misfit_task but in order to do the opposite, i.e.
> make sure to not select it. As mentioned above, this will be filter in
> the last check in update_sd_pick_busiest()
>
This hinges on sgc->min_capacity, which might be influenced by a CPU in the
candidate group being severely pressured by IRQ / thermal / RT / DL
pressure. That said, you have a point in that this check and the one in
find_busiest_queue() catches most scenarios I can think of.
Let me ponder about this some more, and if throw it at the test
infrastructure monster if I go down that route.
Hi Valentin,
The reduced margin is helping our platforms, Please feel free to add my
tested-by tag:
Tested-by: Lingutla Chandrasekhar <[email protected]>
On 3/11/2021 5:35 PM, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
> group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> Reviewed-by: Qais Yousef <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index db892f6e222f..ddb2ab3edf6d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
> */
> #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap1' noticeably greater than 'cap2'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
> #endif
>
> #ifdef CONFIG_CFS_BANDWIDTH
Hi Valentin,
The reduced margin is helping our platforms, Please feel free to add my
tested-by tag:
Tested-by: Lingutla Chandrasekhar <[email protected]>
On 3/11/2021 5:35 PM, Valentin Schneider wrote:
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.
>
> Reviewed-by: Qais Yousef <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 33 ++++-----------------------------
> 1 file changed, 4 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ddb2ab3edf6d..1e8a242cd1f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8350,26 +8350,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> return false;
> }
>
> -/*
> - * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity than sched_group ref.
> - */
> -static inline bool
> -group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> -}
> -
> -/*
> - * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity_orig than sched_group ref.
> - */
> -static inline bool
> -group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> -}
> -
> static inline enum
> group_type group_classify(unsigned int imbalance_pct,
> struct sched_group *group,
> @@ -8518,15 +8498,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> if (!sgs->sum_h_nr_running)
> return false;
>
> - /*
> - * Don't try to pull misfit tasks we can't help.
> - * We can use max_capacity here as reduction in capacity on some
> - * CPUs in the group should either be possible to resolve
> - * internally or be covered by avg_load imbalance (eventually).
> - */
> + /* Don't try to pull misfit tasks we can't help */
> if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> sgs->group_type == group_misfit_task &&
> - (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> + (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> @@ -8610,7 +8585,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> */
> if (sd_has_asym_cpucapacity(env->sd) &&
> (sgs->group_type <= group_fully_busy) &&
> - (group_smaller_min_cpu_capacity(sds->local, sg)))
> + (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> return false;
>
> return true;
> @@ -9410,7 +9385,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> * average load.
> */
> if (sd_has_asym_cpucapacity(env->sd) &&
> - capacity_of(env->dst_cpu) < capacity &&
> + !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> nr_running == 1)
> continue;
>