Hi,
This is v3 of this series. Previous versions can be found here [1] and
here [2]. To avoid duplication, I do not include the cover letter of the
original submission. You can read it in [1].
Changes since v2:
Vincent correctly indicated that I was abusing asym_packing to force load
balances unrelated to CPU priority. The underlying issue is that the
scheduler cannot not handle load balances between SMT and non-SMT cores
correctly. I added several prework patches to fix it... and I removed the
abuse of asym_packing.
Dietmar helped me to realize that there is a better way to check the idle
state of SMT cores. Now I give the task to the scheduler instead of
architecture-specific overrides. I unconditionally obey CPU priorities
at the SMT level. This keeps Power7 happy. At upper levels (i.e., when
balancing load between cores) the scheduler also considers the idle state
of the core in addition to CPU priority. This satisfies x86.
Ionela spotted a violation of the scheduler topology sanity checks. We did
not find a check that suits both Power7 and x86. For now, I removed the
NEEDS_CHILD flag of SD_ASYM_PACKING.
Hopefully, these patches are in sufficiently good shape to be merged.
Thank you for your feedback and I look forward to getting more of it!
New patches 2, 3, 4, 5, 6, 7, 8
Updated patches: 1
Unchanged patches: 9, 10
BR,
Ricardo
[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/
Ricardo Neri (10):
sched/fair: Generalize asym_packing logic for SMT cores
sched/fair: Move is_core_idle() out of CONFIG_NUMA
sched/fair: Only do asym_packing load balancing from fully idle SMT
cores
sched/fair: Let low-priority cores help high-priority busy SMT cores
sched/fair: Keep a fully_busy SMT sched group as busiest
sched/fair: Use the prefer_sibling flag of the current sched domain
sched/fair: Do not even the number of busy CPUs via asym_packing
sched/topology: Remove SHARED_CHILD from ASYM_PACKING
x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
x86/sched/itmt: Give all SMT siblings of a core the same priority
arch/x86/kernel/itmt.c | 23 +----
arch/x86/kernel/smpboot.c | 2 +-
include/linux/sched/sd_flags.h | 5 +-
kernel/sched/fair.c | 175 +++++++++++++++++----------------
4 files changed, 99 insertions(+), 106 deletions(-)
--
2.25.1
When doing asym_packing load balancing between cores, all we care is that
the destination core is fully idle (including SMT siblings, if any) and
that the busiest candidate scheduling group has exactly one busy CPU. It is
irrelevant whether the candidate busiest core is non-SMT, SMT2, SMT4, SMT8,
etc.
Do not handle the candidate busiest non-SMT vs SMT cases separately. Simply
do the two checks described above. Let find_busiest_group() handle bigger
imbalances in the number of idle CPUs.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Updated documentation of the function to reflect the new behavior.
(Dietmar)
Changes since v1:
* Reworded commit message and inline comments for clarity.
* Stated that this changeset does not impact SMT4 or SMT8.
---
kernel/sched/fair.c | 41 ++++++++++++++---------------------------
1 file changed, 14 insertions(+), 27 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c46485d65d7..df46e06c9a3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9254,13 +9254,11 @@ group_type group_classify(unsigned int imbalance_pct,
* the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
* only if @dst_cpu has higher priority.
*
- * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
- * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
- * Bigger imbalances in the number of busy CPUs will be dealt with in
- * update_sd_pick_busiest().
- *
- * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
- * of @dst_cpu are idle and @sg has lower priority.
+ * If @dst_cpu has SMT siblings, check if there are no running tasks in
+ * @sds::local. In such case, decide based on the priority of @sg. Do it only
+ * if @sg has exactly one busy CPU (i.e., one more than @sds::local). Bigger
+ * imbalances in the number of busy CPUs will be dealt with in
+ * find_busiest_group().
*
* Return: true if @dst_cpu can pull tasks, false otherwise.
*/
@@ -9269,12 +9267,10 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
struct sched_group *sg)
{
#ifdef CONFIG_SCHED_SMT
- bool local_is_smt, sg_is_smt;
+ bool local_is_smt;
int sg_busy_cpus;
local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
- sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
-
sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
if (!local_is_smt) {
@@ -9295,25 +9291,16 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
}
- /* @dst_cpu has SMT siblings. */
-
- if (sg_is_smt) {
- int local_busy_cpus = sds->local->group_weight -
- sds->local_stat.idle_cpus;
- int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
-
- if (busy_cpus_delta == 1)
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
- return false;
- }
-
/*
- * @sg does not have SMT siblings. Ensure that @sds::local does not end
- * up with more than one busy SMT sibling and only pull tasks if there
- * are not busy CPUs (i.e., no CPU has running tasks).
+ * @dst_cpu has SMT siblings. Do asym_packing load balancing only if
+ * all its siblings are idle (moving tasks between physical cores in
+ * which some SMT siblings are busy results in the same throughput).
+ *
+ * If the difference in the number of busy CPUs is two or more, let
+ * find_busiest_group() take care of it. We only care if @sg has
+ * exactly one busy CPU. This covers SMT and non-SMT sched groups.
*/
- if (!sds->local_stat.sum_nr_running)
+ if (sg_busy_cpus == 1 && !sds->local_stat.sum_nr_running)
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
return false;
--
2.25.1
asym_packing needs this function to determine whether an SMT core is a
suitable destination for load balancing.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df46e06c9a3e..767bec7789ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,23 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Scheduling class queueing methods:
*/
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+ int sibling;
+
+ for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+ if (cpu == sibling)
+ continue;
+
+ if (!idle_cpu(sibling))
+ return false;
+ }
+#endif
+
+ return true;
+}
+
#ifdef CONFIG_NUMA
#define NUMA_IMBALANCE_MIN 2
@@ -1700,23 +1717,6 @@ struct numa_stats {
int idle_cpu;
};
-static inline bool is_core_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
- int sibling;
-
- for_each_cpu(sibling, cpu_smt_mask(cpu)) {
- if (cpu == sibling)
- continue;
-
- if (!idle_cpu(sibling))
- return false;
- }
-#endif
-
- return true;
-}
-
struct task_numa_env {
struct task_struct *p;
--
2.25.1
Using asym_packing priorities within an SMT core is straightforward. Just
follow the priorities that hardware indicates.
When balancing load from an SMT core, also consider the idle of its
siblings. Priorities do not reflect that an SMT core divides its throughput
among all its busy siblings. They only makes sense when exactly one sibling
is busy.
Indicate that active balance is needed if the destination CPU has lower
priority than the source CPU but the latter has busy SMT siblings.
Make find_busiest_queue() not skip higher-priority SMT cores with more than
busy sibling.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 80c86462c6f6..c9d0ddfd11f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10436,11 +10436,20 @@ static struct rq *find_busiest_queue(struct lb_env *env,
nr_running == 1)
continue;
- /* Make sure we only pull tasks from a CPU of lower priority */
+ /*
+ * Make sure we only pull tasks from a CPU of lower priority
+ * when balancing between SMT siblings.
+ *
+ * If balancing between cores, let lower priority CPUs help
+ * SMT cores with more than one busy sibling.
+ */
if ((env->sd->flags & SD_ASYM_PACKING) &&
sched_asym_prefer(i, env->dst_cpu) &&
- nr_running == 1)
- continue;
+ nr_running == 1) {
+ if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
+ (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
+ continue;
+ }
switch (env->migration_type) {
case migrate_load:
@@ -10530,8 +10539,20 @@ asym_active_balance(struct lb_env *env)
* lower priority CPUs in order to pack all tasks in the
* highest priority CPUs.
*/
- return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
- sched_asym_prefer(env->dst_cpu, env->src_cpu);
+ if (env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING)) {
+ /* Always obey priorities between SMT siblings. */
+ if (env->sd->flags & SD_SHARE_CPUCAPACITY)
+ return sched_asym_prefer(env->dst_cpu, env->src_cpu);
+
+ /*
+ * A lower priority CPU can help an SMT core with more than one
+ * busy sibling.
+ */
+ return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
+ !is_core_idle(env->src_cpu);
+ }
+
+ return false;
}
static inline bool
--
2.25.1
When balancing load between cores, all the SMT siblings of the destination
CPU, if any, must be idle. Otherwise, pulling new tasks degrades the
throughput of the busy SMT siblings. The overall throughput of the system
remains the same.
When balancing load within an SMT core this consideration is not relevant
relevant. Follow the priorities that hardware indicates.
Using is_core_idle() renders checking !sds->local_stat.sum_nr_running
redundant. Remove it.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 767bec7789ac..80c86462c6f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9250,12 +9250,14 @@ group_type group_classify(unsigned int imbalance_pct,
* Check the state of the SMT siblings of both @sds::local and @sg and decide
* if @dst_cpu can pull tasks.
*
+ * This function must be called only if all the SMT siblings of @dst_cpu are
+ * idle, if any.
+ *
* If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
* the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
* only if @dst_cpu has higher priority.
*
- * If @dst_cpu has SMT siblings, check if there are no running tasks in
- * @sds::local. In such case, decide based on the priority of @sg. Do it only
+ * If @dst_cpu has SMT siblings, decide based on the priority of @sg. Do it only
* if @sg has exactly one busy CPU (i.e., one more than @sds::local). Bigger
* imbalances in the number of busy CPUs will be dealt with in
* find_busiest_group().
@@ -9292,15 +9294,13 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
}
/*
- * @dst_cpu has SMT siblings. Do asym_packing load balancing only if
- * all its siblings are idle (moving tasks between physical cores in
- * which some SMT siblings are busy results in the same throughput).
+ * @dst_cpu has SMT siblings and are also idle.
*
* If the difference in the number of busy CPUs is two or more, let
* find_busiest_group() take care of it. We only care if @sg has
* exactly one busy CPU. This covers SMT and non-SMT sched groups.
*/
- if (sg_busy_cpus == 1 && !sds->local_stat.sum_nr_running)
+ if (sg_busy_cpus == 1)
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
return false;
@@ -9314,7 +9314,14 @@ static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
struct sched_group *group)
{
- /* Only do SMT checks if either local or candidate have SMT siblings */
+ /*
+ * If the destination CPU has SMT siblings, env->idle != CPU_NOT_IDLE
+ * is not sufficient. We need to make sure the whole core is idle.
+ */
+ if (sds->local->flags & SD_SHARE_CPUCAPACITY && !is_core_idle(env->dst_cpu))
+ return false;
+
+ /* Only do SMT checks if either local or candidate have SMT siblings. */
if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
(group->flags & SD_SHARE_CPUCAPACITY))
return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
@@ -11261,8 +11268,17 @@ static void nohz_balancer_kick(struct rq *rq)
*/
for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
if (sched_asym_prefer(i, cpu)) {
- flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
- goto unlock;
+ /*
+ * Always do ASYM_PACKING balance in the SMT
+ * domain. In upper domains, the core must be
+ * fully idle.
+ */
+ if (sd->flags & SD_SHARE_CPUCAPACITY ||
+ (!(sd->flags & SD_SHARE_CPUCAPACITY) &&
+ is_core_idle(i))) {
+ flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+ goto unlock;
+ }
}
}
}
--
2.25.1
When comparing two fully_busy scheduling groups, keep the current busiest
group if it represents an SMT core. Tasks in such scheduling group share
CPU resources and need more help than tasks in a non-SMT fully_busy group.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9d0ddfd11f2..df7bcbf634a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9514,10 +9514,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* contention when accessing shared HW resources.
*
* XXX for now avg_load is not computed and always 0 so we
- * select the 1st one.
+ * select the 1st one, except if @sg is composed of SMT
+ * siblings.
*/
- if (sgs->avg_load <= busiest->avg_load)
+
+ if (sgs->avg_load < busiest->avg_load)
return false;
+
+ if (sgs->avg_load == busiest->avg_load) {
+ /*
+ * SMT sched groups need more help than non-SMT groups.
+ * If @sg happens to also be SMT, either choice is good.
+ */
+ if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
+ return false;
+ }
+
break;
case group_has_spare:
--
2.25.1
SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
Above the SMT sched domain, all domains have a child. The SD_PREFER_
SIBLING is honored always regardless of the scheduling domain at which the
load balance takes place.
There are cases, however, in which the busiest CPU's sched domain has
child but the destination CPU's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
left with a fully busy SMT core and an idle non-SMT core.
Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
scheduling domain, not its child.
The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
will not spread load among NUMA sched groups, as desired.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df7bcbf634a8..a37ad59f20ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
- struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
@@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sg = sg->next;
} while (sg != env->sd->groups);
- /* Tag domain that child domain prefers tasks go to siblings first */
- sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
-
+ /*
+ * Tag domain that @env::sd prefers to spread excess tasks among
+ * sibling sched groups.
+ */
+ sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);
@@ -10346,7 +10347,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto out_balanced;
}
- /* Try to move all excess tasks to child's sibling domain */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
busiest->sum_nr_running > local->sum_nr_running + 1)
goto force_balance;
--
2.25.1
Only x86 and Power7 use ASYM_PACKING. They use it differently.
Power7 has cores of equal priority, but the SMT siblings of a core have
different priorities. Parent scheduling domains do not need (nor have) the
ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
cause the topology debug code to complain.
X86 has cores of different priority, but all the SMT siblings of the core
have equal priority. It needs ASYM_PACKING at the MC level, but not at the
SMT level (it also needs it at upper levels if they have scheduling groups
of different priority). Removing ASYM_PACKING from the SMT domain causes
the topology debug code to complain.
Remove SHARED_CHILD for now. We still need a topology check that satisfies
both architectures.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
include/linux/sched/sd_flags.h | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..800238854ba5 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
/*
* Place busy tasks earlier in the domain
*
- * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- * up, but currently assumed to be set from the base domain
- * upwards (see update_top_cache_domain()).
* NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
/*
* Prefer to place tasks in a sibling domain
--
2.25.1
Now that find_busiest_group() triggers load balancing between a fully_
busy SMT2 core and an idle non-SMT core, it is no longer needed to force
balancing via asym_packing. Use asym_packing only as intended: when there
is high-priority CPU that is idle.
After this change, the same logic apply to SMT and non-SMT local groups.
Simplify asym_smt_can_pull_tasks() accordingly.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 37 +++++--------------------------------
1 file changed, 5 insertions(+), 32 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a37ad59f20ea..0ada2d18b934 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9247,20 +9247,15 @@ group_type group_classify(unsigned int imbalance_pct,
* @sgs: Load-balancing statistics of the candidate busiest group
* @sg: The candidate busiest group
*
- * Check the state of the SMT siblings of both @sds::local and @sg and decide
- * if @dst_cpu can pull tasks.
+ * Check the state of the SMT siblings of @sg and decide if @dst_cpu can pull
+ * tasks.
*
* This function must be called only if all the SMT siblings of @dst_cpu are
* idle, if any.
*
- * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
- * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
- * only if @dst_cpu has higher priority.
- *
- * If @dst_cpu has SMT siblings, decide based on the priority of @sg. Do it only
- * if @sg has exactly one busy CPU (i.e., one more than @sds::local). Bigger
- * imbalances in the number of busy CPUs will be dealt with in
- * find_busiest_group().
+ * @dst_cpu can pull tasks if @sg has exactly one busy CPU (i.e., one more than
+ * @sds::local) and has lower group priority than @sds::local. Bigger imbalances
+ * in the number of busy CPUs will be dealt with in find_busiest_group().
*
* Return: true if @dst_cpu can pull tasks, false otherwise.
*/
@@ -9269,33 +9264,11 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
struct sched_group *sg)
{
#ifdef CONFIG_SCHED_SMT
- bool local_is_smt;
int sg_busy_cpus;
- local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
- if (!local_is_smt) {
- /*
- * If we are here, @dst_cpu is idle and does not have SMT
- * siblings. Pull tasks if candidate group has two or more
- * busy CPUs.
- */
- if (sg_busy_cpus >= 2) /* implies sg_is_smt */
- return true;
-
- /*
- * @dst_cpu does not have SMT siblings. @sg may have SMT
- * siblings and only one is busy. In such case, @dst_cpu
- * can help if it has higher priority and is idle (i.e.,
- * it has no running tasks).
- */
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
- }
-
/*
- * @dst_cpu has SMT siblings and are also idle.
- *
* If the difference in the number of busy CPUs is two or more, let
* find_busiest_group() take care of it. We only care if @sg has
* exactly one busy CPU. This covers SMT and non-SMT sched groups.
--
2.25.1
There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None.
Changes since v1:
* Introduced this patch.
---
arch/x86/kernel/smpboot.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 55cad72715d9..0213d066a9a9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -547,7 +547,7 @@ static int x86_core_flags(void)
#ifdef CONFIG_SCHED_SMT
static int x86_smt_flags(void)
{
- return cpu_smt_flags() | x86_sched_itmt_flags();
+ return cpu_smt_flags();
}
#endif
#ifdef CONFIG_SCHED_CLUSTER
--
2.25.1
X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler
knows how to handle SMT and non-SMT cores of different priority. There is
no reason for SMT siblings of a core to have different priorities.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None.
Changes since v1:
* Reworded commit message for clarity.
---
arch/x86/kernel/itmt.c | 23 +++++------------------
1 file changed, 5 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 9ff480e94511..6510883c5e81 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -174,32 +174,19 @@ int arch_asym_cpu_priority(int cpu)
/**
* sched_set_itmt_core_prio() - Set CPU priority based on ITMT
- * @prio: Priority of cpu core
- * @core_cpu: The cpu number associated with the core
+ * @prio: Priority of @cpu
+ * @cpu: The CPU number
*
* The pstate driver will find out the max boost frequency
* and call this function to set a priority proportional
- * to the max boost frequency. CPU with higher boost
+ * to the max boost frequency. CPUs with higher boost
* frequency will receive higher priority.
*
* No need to rebuild sched domain after updating
* the CPU priorities. The sched domains have no
* dependency on CPU priorities.
*/
-void sched_set_itmt_core_prio(int prio, int core_cpu)
+void sched_set_itmt_core_prio(int prio, int cpu)
{
- int cpu, i = 1;
-
- for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
- int smt_prio;
-
- /*
- * Ensure that the siblings are moved to the end
- * of the priority chain and only used when
- * all other high priority cpus are out of capacity.
- */
- smt_prio = prio * smp_num_siblings / (i * i);
- per_cpu(sched_core_priority, cpu) = smt_prio;
- i++;
- }
+ per_cpu(sched_core_priority, cpu) = prio;
}
--
2.25.1
On Tue, 7 Feb 2023 at 05:50, Ricardo Neri
<[email protected]> wrote:
>
> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
>
> Above the SMT sched domain, all domains have a child. The SD_PREFER_
> SIBLING is honored always regardless of the scheduling domain at which the
> load balance takes place.
>
> There are cases, however, in which the busiest CPU's sched domain has
> child but the destination CPU's does not. Consider, for instance a non-SMT
> core (or an SMT core with only one online sibling) doing load balance with
> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> left with a fully busy SMT core and an idle non-SMT core.
>
> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> scheduling domain, not its child.
>
> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> will not spread load among NUMA sched groups, as desired.
This is a significant change in the behavior of the numa system. It
would be good to get figures or confirmation that demonstrate that
it's ok to remove prefer_sibling behavior at the 1st numa level.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Suggested-by: Valentin Schneider <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Introduced this patch.
>
> Changes since v1:
> * N/A
> ---
> kernel/sched/fair.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df7bcbf634a8..a37ad59f20ea 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
>
> static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
> {
> - struct sched_domain *child = env->sd->child;
> struct sched_group *sg = env->sd->groups;
> struct sg_lb_stats *local = &sds->local_stat;
> struct sg_lb_stats tmp_sgs;
> @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> sg = sg->next;
> } while (sg != env->sd->groups);
>
> - /* Tag domain that child domain prefers tasks go to siblings first */
> - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> -
> + /*
> + * Tag domain that @env::sd prefers to spread excess tasks among
> + * sibling sched groups.
> + */
> + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
>
> if (env->sd->flags & SD_NUMA)
> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> @@ -10346,7 +10347,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> goto out_balanced;
> }
>
> - /* Try to move all excess tasks to child's sibling domain */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> busiest->sum_nr_running > local->sum_nr_running + 1)
> goto force_balance;
> --
> 2.25.1
>
On Tue, 7 Feb 2023 at 05:50, Ricardo Neri
<[email protected]> wrote:
>
> Using asym_packing priorities within an SMT core is straightforward. Just
> follow the priorities that hardware indicates.
>
> When balancing load from an SMT core, also consider the idle of its
> siblings. Priorities do not reflect that an SMT core divides its throughput
> among all its busy siblings. They only makes sense when exactly one sibling
> is busy.
>
> Indicate that active balance is needed if the destination CPU has lower
> priority than the source CPU but the latter has busy SMT siblings.
>
> Make find_busiest_queue() not skip higher-priority SMT cores with more than
> busy sibling.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Suggested-by: Valentin Schneider <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Introduced this patch.
>
> Changes since v1:
> * N/A
> ---
> kernel/sched/fair.c | 31 ++++++++++++++++++++++++++-----
> 1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 80c86462c6f6..c9d0ddfd11f2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10436,11 +10436,20 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> nr_running == 1)
> continue;
>
> - /* Make sure we only pull tasks from a CPU of lower priority */
> + /*
> + * Make sure we only pull tasks from a CPU of lower priority
> + * when balancing between SMT siblings.
> + *
> + * If balancing between cores, let lower priority CPUs help
> + * SMT cores with more than one busy sibling.
> + */
> if ((env->sd->flags & SD_ASYM_PACKING) &&
> sched_asym_prefer(i, env->dst_cpu) &&
> - nr_running == 1)
> - continue;
> + nr_running == 1) {
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
This 2nd if could be merged with the upper one
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10518,11 +10518,10 @@ static struct rq *find_busiest_queue(struct
lb_env *env,
*/
if ((env->sd->flags & SD_ASYM_PACKING) &&
sched_asym_prefer(i, env->dst_cpu) &&
- nr_running == 1) {
- if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
- (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
&& is_core_idle(i)))
+ (nr_running == 1) &&
+ (env->sd->flags & SD_SHARE_CPUCAPACITY ||
+ (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
&& is_core_idle(i))))
continue;
- }
switch (env->migration_type) {
case migrate_load:
---
AFAICT, you can even remove one env->sd->flags & SD_SHARE_CPUCAPACITY
test with the below but this make the condition far less obvious
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a6021af9de11..7dfa30c45327 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10518,11 +10518,10 @@ static struct rq *find_busiest_queue(struct
lb_env *env,
*/
if ((env->sd->flags & SD_ASYM_PACKING) &&
sched_asym_prefer(i, env->dst_cpu) &&
- nr_running == 1) {
- if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
- (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
&& is_core_idle(i)))
+ (nr_running == 1) &&
+ !(!(env->sd->flags & SD_SHARE_CPUCAPACITY) &&
+ !is_core_idle(i)))
continue;
- }
> + continue;
> + }
>
> switch (env->migration_type) {
> case migrate_load:
> @@ -10530,8 +10539,20 @@ asym_active_balance(struct lb_env *env)
> * lower priority CPUs in order to pack all tasks in the
> * highest priority CPUs.
> */
> - return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
> - sched_asym_prefer(env->dst_cpu, env->src_cpu);
> + if (env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING)) {
> + /* Always obey priorities between SMT siblings. */
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> + return sched_asym_prefer(env->dst_cpu, env->src_cpu);
> +
> + /*
> + * A lower priority CPU can help an SMT core with more than one
> + * busy sibling.
> + */
> + return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
> + !is_core_idle(env->src_cpu);
> + }
> +
> + return false;
> }
>
> static inline bool
> --
> 2.25.1
>
Hi, All,
On Mon, 2023-02-06 at 20:58 -0800, Ricardo Neri wrote:
> Hi,
>
> This is v3 of this series. Previous versions can be found here [1]
> and
> here [2]. To avoid duplication, I do not include the cover letter of
> the
> original submission. You can read it in [1].
I happened to run into a similar issue when testing another patch
series which allows idle injections for partial cpus instead of all
cpus.
https://lore.kernel.org/all/[email protected]/
On an ADL-P NUC system with 4 Pcores (cpu0-cpu7), and 8 Ecores (cpu8-
cpu15), the problem can be reproduced by
1. start 16 stress threads
2. force idle injection to all Ecore cpus
3. stop idle injection after 10 seconds
After step 3, all the Pcore cpus are 100% busy, and all the Ecore cpus
are almost 100% idle. This situation lasts for a long time, till I kill
all the stress threads after 20 seconds.
After sync with Chen Yu, I also tried
stress -c 16 &
chrt -r 70 taskset -c 8-15 stress -c 8 -t 10
instead of idle injection, and the problem is also 100% reproducible.
And note that, the problem can be reproduced w/ and w/o ITMT enabled,
by poking /proc/sys/kernel/sched_itmt_enabled
With this whole patch series applied, I can confirm the problem is gone
both w/ and w/o ITMT enabled. So
Tested-by: Zhang Rui <[email protected]>
thanks,
rui
>
> Changes since v2:
>
> Vincent correctly indicated that I was abusing asym_packing to force
> load
> balances unrelated to CPU priority. The underlying issue is that the
> scheduler cannot not handle load balances between SMT and non-SMT
> cores
> correctly. I added several prework patches to fix it... and I removed
> the
> abuse of asym_packing.
>
> Dietmar helped me to realize that there is a better way to check the
> idle
> state of SMT cores. Now I give the task to the scheduler instead of
> architecture-specific overrides. I unconditionally obey CPU
> priorities
> at the SMT level. This keeps Power7 happy. At upper levels (i.e.,
> when
> balancing load between cores) the scheduler also considers the idle
> state
> of the core in addition to CPU priority. This satisfies x86.
>
> Ionela spotted a violation of the scheduler topology sanity checks.
> We did
> not find a check that suits both Power7 and x86. For now, I removed
> the
> NEEDS_CHILD flag of SD_ASYM_PACKING.
>
> Hopefully, these patches are in sufficiently good shape to be merged.
>
> Thank you for your feedback and I look forward to getting more of it!
>
> New patches 2, 3, 4, 5, 6, 7, 8
> Updated patches: 1
> Unchanged patches: 9, 10
>
> BR,
> Ricardo
>
> [1].
> https://lore.kernel.org/lkml/[email protected]/
> [2].
> https://lore.kernel.org/lkml/[email protected]/
>
>
> Ricardo Neri (10):
> sched/fair: Generalize asym_packing logic for SMT cores
> sched/fair: Move is_core_idle() out of CONFIG_NUMA
> sched/fair: Only do asym_packing load balancing from fully idle SMT
> cores
> sched/fair: Let low-priority cores help high-priority busy SMT
> cores
> sched/fair: Keep a fully_busy SMT sched group as busiest
> sched/fair: Use the prefer_sibling flag of the current sched domain
> sched/fair: Do not even the number of busy CPUs via asym_packing
> sched/topology: Remove SHARED_CHILD from ASYM_PACKING
> x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
> x86/sched/itmt: Give all SMT siblings of a core the same priority
>
> arch/x86/kernel/itmt.c | 23 +----
> arch/x86/kernel/smpboot.c | 2 +-
> include/linux/sched/sd_flags.h | 5 +-
> kernel/sched/fair.c | 175 +++++++++++++++++------------
> ----
> 4 files changed, 99 insertions(+), 106 deletions(-)
>
On Wed, Feb 08, 2023 at 08:56:32AM +0100, Vincent Guittot wrote:
> > + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> > + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
>
> This 2nd if could be merged with the upper one
Wasn't this exact same condition used in the previous patch as well?
Does it wants to be a helper perhaps?
On 2023-02-06 at 20:58:34 -0800, Ricardo Neri wrote:
> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
>
> Above the SMT sched domain, all domains have a child. The SD_PREFER_
> SIBLING is honored always regardless of the scheduling domain at which the
> load balance takes place.
>
> There are cases, however, in which the busiest CPU's sched domain has
> child but the destination CPU's does not. Consider, for instance a non-SMT
> core (or an SMT core with only one online sibling) doing load balance with
> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> left with a fully busy SMT core and an idle non-SMT core.
>
> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> scheduling domain, not its child.
>
> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> will not spread load among NUMA sched groups, as desired.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Suggested-by: Valentin Schneider <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Introduced this patch.
>
> Changes since v1:
> * N/A
> ---
> kernel/sched/fair.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df7bcbf634a8..a37ad59f20ea 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
>
> static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
> {
> - struct sched_domain *child = env->sd->child;
> struct sched_group *sg = env->sd->groups;
> struct sg_lb_stats *local = &sds->local_stat;
> struct sg_lb_stats tmp_sgs;
> @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> sg = sg->next;
> } while (sg != env->sd->groups);
>
> - /* Tag domain that child domain prefers tasks go to siblings first */
> - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> -
> + /*
> + * Tag domain that @env::sd prefers to spread excess tasks among
> + * sibling sched groups.
> + */
> + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
>
This does help fix the issue that non-SMT core fails to pull task from busy SMT-cores.
And it also semantically changes the definination of prefer sibling. Do we also
need to change this:
if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
sd->child->flags &= ~SD_PREFER_SIBLING;
might be:
if ((sd->flags & SD_ASYM_CPUCAPACITY))
sd->flags &= ~SD_PREFER_SIBLING;
thanks,
Chenyu
>
>> static inline void update_sd_lb_stats(struct lb_env *env, struct
>> sd_lb_stats *sds) {
>> - struct sched_domain *child = env->sd->child;
>> struct sched_group *sg = env->sd->groups;
>> struct sg_lb_stats *local = &sds->local_stat;
>> struct sg_lb_stats tmp_sgs;
>> @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct
>lb_env *env, struct sd_lb_stats *sd
>> sg = sg->next;
>> } while (sg != env->sd->groups);
>>
>> - /* Tag domain that child domain prefers tasks go to siblings first */
>> - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>> -
>> + /*
>> + * Tag domain that @env::sd prefers to spread excess tasks among
>> + * sibling sched groups.
>> + */
>> + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
>>
>This does help fix the issue that non-SMT core fails to pull task from busy SMT-
>cores.
>And it also semantically changes the definination of prefer sibling. Do we also
>need to change this:
> if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> sd->child->flags &= ~SD_PREFER_SIBLING; might be:
> if ((sd->flags & SD_ASYM_CPUCAPACITY))
> sd->flags &= ~SD_PREFER_SIBLING;
>
Yu,
I think you are talking about the code in sd_init()
where SD_PREFER_SIBLING is first set
to "ON" and updated depending on SD_ASYM_CPUCAPACITY. The intention of the code
is if there are cpus in the scheduler domain that have differing cpu capacities,
we do not want to do spreading among the child groups in the sched domain.
So the flag is turned off in the child group level and not the parent level. But with your above
change, the parent's flag is turned off, leaving the child level flag on.
This moves the level where spreading happens (SD_PREFER_SIBLING on)
up one level which is undesired (see table below).
SD_PREFER_SIBLING after init
original code proposed
SD Level SD_ASYM_CPUCAPACITY
root ON ON OFF (note: SD_PREFER_SIBLING unused at this level)
first level ON OFF OFF
second level OFF OFF ON
third level OFF ON ON
Tim
On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote:
> > > static inline void update_sd_lb_stats(struct lb_env *env, struct
> > > sd_lb_stats *sds) {
> > > - struct sched_domain *child = env->sd->child;
> > > struct sched_group *sg = env->sd->groups;
> > > struct sg_lb_stats *local = &sds->local_stat;
> > > struct sg_lb_stats tmp_sgs;
> > > @@ -10045,9 +10044,11 @@ static inline void
> > > update_sd_lb_stats(struct
> > lb_env *env, struct sd_lb_stats *sd
> > > sg = sg->next;
> > > } while (sg != env->sd->groups);
> > >
> > > - /* Tag domain that child domain prefers tasks go to
> > > siblings first */
> > > - sds->prefer_sibling = child && child->flags &
> > > SD_PREFER_SIBLING;
> > > -
> > > + /*
> > > + * Tag domain that @env::sd prefers to spread excess
> > > tasks among
> > > + * sibling sched groups.
> > > + */
> > > + sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
> > >
> > This does help fix the issue that non-SMT core fails to pull task
> > from busy SMT-
> > cores.
> > And it also semantically changes the definination of prefer
> > sibling. Do we also
> > need to change this:
> > if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > sd->child->flags &= ~SD_PREFER_SIBLING; might be:
> > if ((sd->flags & SD_ASYM_CPUCAPACITY))
> > sd->flags &= ~SD_PREFER_SIBLING;
> >
>
> Yu,
>
> I think you are talking about the code in sd_init()
> where SD_PREFER_SIBLING is first set
> to "ON" and updated depending on SD_ASYM_CPUCAPACITY. The intention
> of the code
> is if there are cpus in the scheduler domain that have differing cpu
> capacities,
> we do not want to do spreading among the child groups in the sched
> domain.
> So the flag is turned off in the child group level and not the parent
> level. But with your above
> change, the parent's flag is turned off, leaving the child level flag
> on.
> This moves the level where spreading happens (SD_PREFER_SIBLING on)
> up one level which is undesired (see table below).
>
>
Sorry got a bad mail client messing up the table format. Updated below
SD_ASYM_CPUCAPACITY SD_PREFER_SIBLING after init
original code proposed
SD Level
root ON ON OFF (note: SD_PREFER_SIBLING unused at this level)
first level ON OFF OFF
second level OFF OFF ON
third level OFF ON ON
Tim
On Thu, Feb 09, 2023 at 12:53:41PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 08, 2023 at 08:56:32AM +0100, Vincent Guittot wrote:
>
> > > + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> > > + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
> >
> > This 2nd if could be merged with the upper one
>
> Wasn't this exact same condition used in the previous patch as well?
> Does it wants to be a helper perhaps?
Patch 3 focuses on the destination CPU: make sure that it and its SMT
siblings are idle before attempting to do asym_packing balance.
This patch focuses on the busiest group: if the busiest group is an SMT
core with more than one busy sibling, help it even if it has higher
priority.
On Wed, Feb 08, 2023 at 08:56:32AM +0100, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 05:50, Ricardo Neri
> <[email protected]> wrote:
> >
> > Using asym_packing priorities within an SMT core is straightforward. Just
> > follow the priorities that hardware indicates.
> >
> > When balancing load from an SMT core, also consider the idle of its
> > siblings. Priorities do not reflect that an SMT core divides its throughput
> > among all its busy siblings. They only makes sense when exactly one sibling
> > is busy.
> >
> > Indicate that active balance is needed if the destination CPU has lower
> > priority than the source CPU but the latter has busy SMT siblings.
> >
> > Make find_busiest_queue() not skip higher-priority SMT cores with more than
> > busy sibling.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Suggested-by: Valentin Schneider <[email protected]>
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * Introduced this patch.
> >
> > Changes since v1:
> > * N/A
> > ---
> > kernel/sched/fair.c | 31 ++++++++++++++++++++++++++-----
> > 1 file changed, 26 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 80c86462c6f6..c9d0ddfd11f2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10436,11 +10436,20 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > nr_running == 1)
> > continue;
> >
> > - /* Make sure we only pull tasks from a CPU of lower priority */
> > + /*
> > + * Make sure we only pull tasks from a CPU of lower priority
> > + * when balancing between SMT siblings.
> > + *
> > + * If balancing between cores, let lower priority CPUs help
> > + * SMT cores with more than one busy sibling.
> > + */
> > if ((env->sd->flags & SD_ASYM_PACKING) &&
> > sched_asym_prefer(i, env->dst_cpu) &&
> > - nr_running == 1)
> > - continue;
> > + nr_running == 1) {
> > + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> > + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
>
> This 2nd if could be merged with the upper one
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10518,11 +10518,10 @@ static struct rq *find_busiest_queue(struct
> lb_env *env,
> */
> if ((env->sd->flags & SD_ASYM_PACKING) &&
> sched_asym_prefer(i, env->dst_cpu) &&
> - nr_running == 1) {
> - if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> - (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
> && is_core_idle(i)))
> + (nr_running == 1) &&
> + (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> + (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
> && is_core_idle(i))))
> continue;
> - }
>
> switch (env->migration_type) {
> case migrate_load:
> ---
>
> AFAICT, you can even remove one env->sd->flags & SD_SHARE_CPUCAPACITY
> test with the below but this make the condition far less obvious
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a6021af9de11..7dfa30c45327 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10518,11 +10518,10 @@ static struct rq *find_busiest_queue(struct
> lb_env *env,
> */
> if ((env->sd->flags & SD_ASYM_PACKING) &&
> sched_asym_prefer(i, env->dst_cpu) &&
> - nr_running == 1) {
> - if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> - (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
> && is_core_idle(i)))
> + (nr_running == 1) &&
> + !(!(env->sd->flags & SD_SHARE_CPUCAPACITY) &&
> + !is_core_idle(i)))
> continue;
I agree. This expression is equivalent to what I proposed. It is less
obvious but the comment above clarifies what is going on. I will take
your suggestion.
Thanks and BR,
Ricardo
On Thu, Feb 09, 2023 at 03:05:03PM -0800, Tim Chen wrote:
> On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote:
> > > > ?static inline void update_sd_lb_stats(struct lb_env *env, struct
> > > > sd_lb_stats *sds)? {
> > > > -???????struct sched_domain *child = env->sd->child;
> > > > ????????struct sched_group *sg = env->sd->groups;
> > > > ????????struct sg_lb_stats *local = &sds->local_stat;
> > > > ????????struct sg_lb_stats tmp_sgs;
> > > > @@ -10045,9 +10044,11 @@ static inline void
> > > > update_sd_lb_stats(struct
> > > lb_env *env, struct sd_lb_stats *sd
> > > > ????????????????sg = sg->next;
> > > > ????????} while (sg != env->sd->groups);
> > > >
> > > > -???????/* Tag domain that child domain prefers tasks go to
> > > > siblings first */
> > > > -???????sds->prefer_sibling = child && child->flags &
> > > > SD_PREFER_SIBLING;
> > > > -
> > > > +???????/*
> > > > +??????? * Tag domain that @env::sd prefers to spread excess
> > > > tasks among
> > > > +??????? * sibling sched groups.
> > > > +??????? */
> > > > +???????sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
> > > >
> > > This does help fix the issue that non-SMT core fails to pull task
> > > from busy SMT-
> > > cores.
> > > And it also semantically changes the definination of prefer
> > > sibling. Do we also
> > > need to change this:
> > > ?????? if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > > ?????????????? sd->child->flags &= ~SD_PREFER_SIBLING; might be:
> > > ?????? if ((sd->flags & SD_ASYM_CPUCAPACITY))
> > > ?????????????? sd->flags &= ~SD_PREFER_SIBLING;
> > >
> >
> > Yu,
> >
> > I think you are talking about the code in sd_init()
> > where SD_PREFER_SIBLING is first set
> > to "ON" and updated depending on SD_ASYM_CPUCAPACITY.? The intention
> > of the code
> > is if there are cpus in the scheduler domain that have differing cpu
> > capacities,
> > we do not want to do spreading among the child groups in the sched
> > domain.
> > So the flag is turned off in the child group level and not the parent
> > level. But with your above
> > change, the parent's flag is turned off, leaving the child level flag
> > on.
> > This moves the level where spreading happens (SD_PREFER_SIBLING on)
> > up one level which is undesired (see table below).
But my patch moves the level at which we act on prefer_sibling: it now
checks the SD_PREFER_SIBLING flag at the current level, not its child.
Thus, removing SD_PREFER_SIBLING from a level with SD_ASYM_CPUCAPACITY
prevents spreading among CPUs of different CPU capacity, no?
Thanks and BR,
Ricardo
On 2023-02-09 at 15:05:03 -0800, Tim Chen wrote:
> On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote:
> > > > ?static inline void update_sd_lb_stats(struct lb_env *env, struct
> > > > sd_lb_stats *sds)? {
> > > > -???????struct sched_domain *child = env->sd->child;
> > > > ????????struct sched_group *sg = env->sd->groups;
> > > > ????????struct sg_lb_stats *local = &sds->local_stat;
> > > > ????????struct sg_lb_stats tmp_sgs;
> > > > @@ -10045,9 +10044,11 @@ static inline void
> > > > update_sd_lb_stats(struct
> > > lb_env *env, struct sd_lb_stats *sd
> > > > ????????????????sg = sg->next;
> > > > ????????} while (sg != env->sd->groups);
> > > >
> > > > -???????/* Tag domain that child domain prefers tasks go to
> > > > siblings first */
> > > > -???????sds->prefer_sibling = child && child->flags &
> > > > SD_PREFER_SIBLING;
> > > > -
> > > > +???????/*
> > > > +??????? * Tag domain that @env::sd prefers to spread excess
> > > > tasks among
> > > > +??????? * sibling sched groups.
> > > > +??????? */
> > > > +???????sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
> > > >
> > > This does help fix the issue that non-SMT core fails to pull task
> > > from busy SMT-
> > > cores.
> > > And it also semantically changes the definination of prefer
> > > sibling. Do we also
> > > need to change this:
> > > ?????? if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > > ?????????????? sd->child->flags &= ~SD_PREFER_SIBLING; might be:
> > > ?????? if ((sd->flags & SD_ASYM_CPUCAPACITY))
> > > ?????????????? sd->flags &= ~SD_PREFER_SIBLING;
> > >
> >
> > Yu,
> >
> > I think you are talking about the code in sd_init()
> > where SD_PREFER_SIBLING is first set
> > to "ON" and updated depending on SD_ASYM_CPUCAPACITY.? The intention
> > of the code
> > is if there are cpus in the scheduler domain that have differing cpu
> > capacities,
> > we do not want to do spreading among the child groups in the sched
> > domain.
> > So the flag is turned off in the child group level and not the parent
> > level. But with your above
> > change, the parent's flag is turned off, leaving the child level flag
> > on.
> > This moves the level where spreading happens (SD_PREFER_SIBLING on)
> > up one level which is undesired (see table below).
> >
Yes, it moves the flag 1 level up. And if I understand correctly, with Ricardo's patch
applied, we have changed the original meaning of SD_PREFER_SIBLING:
Original: Tasks in this sched domain want to be migrated to another sched domain.
After init change: Tasks in the sched group under this sched domain want to
be migrated to a sibling group.
> >
> Sorry got a bad mail client messing up the table format. Updated below
>
> SD_ASYM_CPUCAPACITY SD_PREFER_SIBLING after init
> original code proposed
> SD Level
> root ON ON OFF (note: SD_PREFER_SIBLING unused at this level)
SD_PREFER_SIBLING is hornored in root level after the init proposal.
> first level ON OFF OFF
Before the init proposed, tasks in first level sd do not want
to be spreaded to a sibling sd. After the init proposeal, tasks
in all sched groups under root sd, do not want to be spreaded
to a sibling sched group(AKA first level sd)
thanks,
Chenyu
> second level OFF OFF ON
> third level OFF ON ON
>
> Tim
On Thu, Feb 09, 2023 at 04:43:33PM -0800, Ricardo Neri wrote:
> On Thu, Feb 09, 2023 at 12:53:41PM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 08, 2023 at 08:56:32AM +0100, Vincent Guittot wrote:
> >
> > > > + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> > > > + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
> > >
> > > This 2nd if could be merged with the upper one
> >
> > Wasn't this exact same condition used in the previous patch as well?
> > Does it wants to be a helper perhaps?
>
> Patch 3 focuses on the destination CPU: make sure that it and its SMT
> siblings are idle before attempting to do asym_packing balance.
>
> This patch focuses on the busiest group: if the busiest group is an SMT
> core with more than one busy sibling, help it even if it has higher
> priority.
Yeah, so? If its a recurring expression a helper never hurts.
On Mon, Feb 06, 2023 at 08:58:34PM -0800, Ricardo Neri wrote:
> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
>
> Above the SMT sched domain, all domains have a child. The SD_PREFER_
> SIBLING is honored always regardless of the scheduling domain at which the
> load balance takes place.
>
> There are cases, however, in which the busiest CPU's sched domain has
> child but the destination CPU's does not. Consider, for instance a non-SMT
> core (or an SMT core with only one online sibling) doing load balance with
> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> left with a fully busy SMT core and an idle non-SMT core.
>
> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> scheduling domain, not its child.
>
> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> will not spread load among NUMA sched groups, as desired.
>
Like many of the others; I don't much like this.
Why not simply detect this asymmetric having of SMT and kill the
PREFER_SIBLING flag on the SMT leafs in that case?
Specifically, I'm thinking something in the degenerate area where it
looks if a given domain has equal depth children or so.
Note that this should not be tied to having special hardware, you can
create the very same weirdness by just offlining a few SMT siblings and
leaving a few on.
On Fri, Feb 10, 2023 at 09:41:14AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 09, 2023 at 04:43:33PM -0800, Ricardo Neri wrote:
> > On Thu, Feb 09, 2023 at 12:53:41PM +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 08, 2023 at 08:56:32AM +0100, Vincent Guittot wrote:
> > >
> > > > > + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> > > > > + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
> > > >
> > > > This 2nd if could be merged with the upper one
> > >
> > > Wasn't this exact same condition used in the previous patch as well?
> > > Does it wants to be a helper perhaps?
> >
> > Patch 3 focuses on the destination CPU: make sure that it and its SMT
> > siblings are idle before attempting to do asym_packing balance.
> >
> > This patch focuses on the busiest group: if the busiest group is an SMT
> > core with more than one busy sibling, help it even if it has higher
> > priority.
>
> Yeah, so? If its a recurring expression a helper never hurts.
Ah! I get your point now. You meant a helper function. Thank you for the
clarification.
Sure! I can add this helper function.
On Wed, Feb 08, 2023 at 08:48:05AM +0100, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 05:50, Ricardo Neri
> <[email protected]> wrote:
> >
> > SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> > non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
> >
> > Above the SMT sched domain, all domains have a child. The SD_PREFER_
> > SIBLING is honored always regardless of the scheduling domain at which the
> > load balance takes place.
> >
> > There are cases, however, in which the busiest CPU's sched domain has
> > child but the destination CPU's does not. Consider, for instance a non-SMT
> > core (or an SMT core with only one online sibling) doing load balance with
> > an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> > left with a fully busy SMT core and an idle non-SMT core.
> >
> > Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> > scheduling domain, not its child.
> >
> > The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> > will not spread load among NUMA sched groups, as desired.
>
> This is a significant change in the behavior of the numa system. It
> would be good to get figures or confirmation that demonstrate that
> it's ok to remove prefer_sibling behavior at the 1st numa level.
>
Thank you very much for your feedback Vincent!
You are right. It does change the behavior at the first numa level. Peter
did not like this change either. It also made things confusing for
SD_ASYM_CPUCAPACITY vs SD_PREFER_SIBLING.
I'll work in a different approach.
On 10/02/23 11:08, Peter Zijlstra wrote:
> On Mon, Feb 06, 2023 at 08:58:34PM -0800, Ricardo Neri wrote:
>> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
>> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
>>
>> Above the SMT sched domain, all domains have a child. The SD_PREFER_
>> SIBLING is honored always regardless of the scheduling domain at which the
>> load balance takes place.
>>
>> There are cases, however, in which the busiest CPU's sched domain has
>> child but the destination CPU's does not. Consider, for instance a non-SMT
>> core (or an SMT core with only one online sibling) doing load balance with
>> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
>> left with a fully busy SMT core and an idle non-SMT core.
>>
>> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
>> scheduling domain, not its child.
>>
>> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
>> will not spread load among NUMA sched groups, as desired.
>>
>
> Like many of the others; I don't much like this.
>
> Why not simply detect this asymmetric having of SMT and kill the
> PREFER_SIBLING flag on the SMT leafs in that case?
>
> Specifically, I'm thinking something in the degenerate area where it
> looks if a given domain has equal depth children or so.
>
> Note that this should not be tied to having special hardware, you can
> create the very same weirdness by just offlining a few SMT siblings and
> leaving a few on.
So something like have SD_PREFER_SIBLING affect the SD it's on (and not
its parent), but remove it from the lowest non-degenerated topology level?
(+ add it to the first NUMA level to keep things as they are, even if TBF I
find relying on it for NUMA balancing a bit odd).
On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> its parent), but remove it from the lowest non-degenerated topology level?
So I was rather confused about the whole moving it between levels things
this morning -- conceptually, prefer siblings says you want to try
sibling domains before filling up your current domain. Now, balancing
between siblings happens one level up, hence looking at child->flags
makes perfect sense.
But looking at the current domain and still calling it prefer sibling
makes absolutely no sense what so ever.
In that confusion I think I also got the polarity wrong, I thought you
wanted to kill prefer_sibling for the assymetric SMT cases, instead you
want to force enable it as long as there is one SMT child around.
Whichever way around it we do it, I'm thinking perhaps some renaming
might be in order to clarify things.
How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
of the behaviour, but have it be set by children with SD_PREFER_SIBLING
or something.
OTOH, there's also
if (busiest->group_weight == 1 || sds->prefer_sibling) {
which explicitly also takes the group-of-one (the !child case) into
account, but that's not consistently done.
sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING;
seems an interesting option, except perhaps ASYM_CPUCAPACITY -- I
forget, can CPUs of different capacity be in the same leaf domain? With
big.LITTLE I think not, they had their own cache domains and so you get
at least MC domains per capacity, but DynamiQ might have totally wrecked
that party.
> (+ add it to the first NUMA level to keep things as they are, even if TBF I
> find relying on it for NUMA balancing a bit odd).
Arguably it ought to perhaps be one of those node_reclaim_distance
things. The thing is that NUMA-1 is often fairly quick, esp. these days
where it's basically on die numa.
On 10/02/23 17:53, Peter Zijlstra wrote:
> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
>
>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
>> its parent), but remove it from the lowest non-degenerated topology level?
>
> So I was rather confused about the whole moving it between levels things
> this morning -- conceptually, prefer siblings says you want to try
> sibling domains before filling up your current domain. Now, balancing
> between siblings happens one level up, hence looking at child->flags
> makes perfect sense.
>
> But looking at the current domain and still calling it prefer sibling
> makes absolutely no sense what so ever.
>
True :-)
> In that confusion I think I also got the polarity wrong, I thought you
> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> want to force enable it as long as there is one SMT child around.
>
> Whichever way around it we do it, I'm thinking perhaps some renaming
> might be in order to clarify things.
>
> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> or something.
>
Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
something along those lines.
> OTOH, there's also
>
> if (busiest->group_weight == 1 || sds->prefer_sibling) {
>
> which explicitly also takes the group-of-one (the !child case) into
> account, but that's not consistently done.
>
> sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING;
>
> seems an interesting option,
> except perhaps ASYM_CPUCAPACITY -- I
> forget, can CPUs of different capacity be in the same leaf domain? With
> big.LITTLE I think not, they had their own cache domains and so you get
> at least MC domains per capacity, but DynamiQ might have totally wrecked
> that party.
Yeah, newer systems can have different capacities in one MC domain, cf:
b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
>
>> (+ add it to the first NUMA level to keep things as they are, even if TBF I
>> find relying on it for NUMA balancing a bit odd).
>
> Arguably it ought to perhaps be one of those node_reclaim_distance
> things. The thing is that NUMA-1 is often fairly quick, esp. these days
> where it's basically on die numa.
Right, makes sense, thanks.
On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> On 10/02/23 17:53, Peter Zijlstra wrote:
> > On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> >
> >> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> >> its parent), but remove it from the lowest non-degenerated topology level?
> >
> > So I was rather confused about the whole moving it between levels things
> > this morning -- conceptually, prefer siblings says you want to try
> > sibling domains before filling up your current domain. Now, balancing
> > between siblings happens one level up, hence looking at child->flags
> > makes perfect sense.
> >
> > But looking at the current domain and still calling it prefer sibling
> > makes absolutely no sense what so ever.
> >
>
> True :-)
>
> > In that confusion I think I also got the polarity wrong, I thought you
> > wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> > want to force enable it as long as there is one SMT child around.
Exactly.
> >
> > Whichever way around it we do it, I'm thinking perhaps some renaming
> > might be in order to clarify things.
> >
> > How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> > of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> > or something.
> >
>
> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
> something along those lines.
I sense a consesus towards SD_SPREAD_TASKS.
>
> > OTOH, there's also
> >
> > if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >
> > which explicitly also takes the group-of-one (the !child case) into
> > account, but that's not consistently done.
> >
> > sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING;
This would need a special provision for SD_ASYM_CPUCAPACITY.
> >
> > seems an interesting option,
>
> > except perhaps ASYM_CPUCAPACITY -- I
> > forget, can CPUs of different capacity be in the same leaf domain? With
> > big.LITTLE I think not, they had their own cache domains and so you get
> > at least MC domains per capacity, but DynamiQ might have totally wrecked
> > that party.
>
> Yeah, newer systems can have different capacities in one MC domain, cf:
>
> b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
>
> >
> >> (+ add it to the first NUMA level to keep things as they are, even if TBF I
> >> find relying on it for NUMA balancing a bit odd).
> >
> > Arguably it ought to perhaps be one of those node_reclaim_distance
> > things. The thing is that NUMA-1 is often fairly quick, esp. these days
> > where it's basically on die numa.
To conserve the current behavior the NUMA level would need to have
SD_SPREAD_TASKS. It will be cleared along with SD_BALANCE_{EXEC, FORK} and
SD_WAKE_AFFINE if the numa distance is larger than node_reclaim_distance,
yes?
Thanks and BR,
Ricardo
On 10/02/2023 19:31, Ricardo Neri wrote:
> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
>> On 10/02/23 17:53, Peter Zijlstra wrote:
>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
>>>
>>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
>>>> its parent), but remove it from the lowest non-degenerated topology level?
>>>
>>> So I was rather confused about the whole moving it between levels things
>>> this morning -- conceptually, prefer siblings says you want to try
>>> sibling domains before filling up your current domain. Now, balancing
>>> between siblings happens one level up, hence looking at child->flags
>>> makes perfect sense.
>>>
>>> But looking at the current domain and still calling it prefer sibling
>>> makes absolutely no sense what so ever.
>>>
>>
>> True :-)
>>
>>> In that confusion I think I also got the polarity wrong, I thought you
>>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
>>> want to force enable it as long as there is one SMT child around.
>
> Exactly.
>
>>>
>>> Whichever way around it we do it, I'm thinking perhaps some renaming
>>> might be in order to clarify things.
>>>
>>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
>>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
>>> or something.
>>>
>>
>> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
>> something along those lines.
>
> I sense a consesus towards SD_SPREAD_TASKS.
Can you not detect the E-core dst_cpu case on MC with:
+ if (child)
+ sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
+ else if (sds->busiest)
+ sds->prefer_sibling = sds->busiest->group_weight > 1;
+
[...]
On 07/02/2023 05:58, Ricardo Neri wrote:
[...]
> @@ -9269,33 +9264,11 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> struct sched_group *sg)
> {
> #ifdef CONFIG_SCHED_SMT
> - bool local_is_smt;
> int sg_busy_cpus;
>
> - local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
>
> - if (!local_is_smt) {
> - /*
> - * If we are here, @dst_cpu is idle and does not have SMT
> - * siblings. Pull tasks if candidate group has two or more
> - * busy CPUs.
> - */
> - if (sg_busy_cpus >= 2) /* implies sg_is_smt */
> - return true;
> -
> - /*
> - * @dst_cpu does not have SMT siblings. @sg may have SMT
> - * siblings and only one is busy. In such case, @dst_cpu
> - * can help if it has higher priority and is idle (i.e.,
> - * it has no running tasks).
> - */
> - return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> - }
> -
> /*
> - * @dst_cpu has SMT siblings and are also idle.
> - *
> * If the difference in the number of busy CPUs is two or more, let
> * find_busiest_group() take care of it. We only care if @sg has
> * exactly one busy CPU. This covers SMT and non-SMT sched groups.
Can't this be made lighter by removing asym_smt_can_pull_tasks() and
putting the logic to exclude the call to sched_asym_prefer() into
sched_asym() directly?
Not sure if we need the CONFIG_SCHED_SMT since it's all guarded by
`flags & SD_SHARE_CPUCAPACITY` already, which is only set under.
CONFIG_SCHED_SMT.
static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds,
struct sg_lb_stats *sgs, struct sched_group *group)
{
bool local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
if (local_is_smt && !is_core_idle(env->dst_cpu))
return false;
if ((local_is_smt || group->flags & SD_SHARE_CPUCAPACITY)) {
int sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
if (sg_busy_cpus != 1)
return false;
}
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}
[...]
On 07/02/2023 05:58, Ricardo Neri wrote:
[...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 80c86462c6f6..c9d0ddfd11f2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10436,11 +10436,20 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> nr_running == 1)
> continue;
>
> - /* Make sure we only pull tasks from a CPU of lower priority */
> + /*
> + * Make sure we only pull tasks from a CPU of lower priority
> + * when balancing between SMT siblings.
> + *
> + * If balancing between cores, let lower priority CPUs help
> + * SMT cores with more than one busy sibling.
> + */
> if ((env->sd->flags & SD_ASYM_PACKING) &&
> sched_asym_prefer(i, env->dst_cpu) &&
> - nr_running == 1)
> - continue;
> + nr_running == 1) {
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
> + continue;
is_core_idle(i) returns true for !CONFIG_SCHED_SMP. So far it was always
guarded by `flags & SD_SHARE_CPUCAPACITY` which is only set for
CONFIG_SCHED_SMP.
Here it's different but still depends on `flags & SD_ASYM_PACKING`.
Can we have SD_ASYM_PACKING w/o CONFIG_SCHED_SMP? The comment just says
`If balancing between cores (MC), let lower priority CPUs help SMT cores
with more than one busy sibling.`
So this only mentions your specific asymmetric e-cores w/o SMT and
p-cores w/ SMT case.
I'm asking since numa_idle_core(), the only user of is_core_idle() so
far has an extra `!static_branch_likely(&sched_smt_present)` condition
before calling it.
> + }
>
> switch (env->migration_type) {
> case migrate_load:
> @@ -10530,8 +10539,20 @@ asym_active_balance(struct lb_env *env)
> * lower priority CPUs in order to pack all tasks in the
> * highest priority CPUs.
> */
> - return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
> - sched_asym_prefer(env->dst_cpu, env->src_cpu);
> + if (env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING)) {
> + /* Always obey priorities between SMT siblings. */
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> + return sched_asym_prefer(env->dst_cpu, env->src_cpu);
> +
> + /*
> + * A lower priority CPU can help an SMT core with more than one
> + * busy sibling.
> + */
> + return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
> + !is_core_idle(env->src_cpu);
Here it is similar.
> + }
> +
> + return false;
On Mon, Feb 13, 2023 at 01:44:20PM +0100, Dietmar Eggemann wrote:
> On 07/02/2023 05:58, Ricardo Neri wrote:
>
> [...]
>
> > @@ -9269,33 +9264,11 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > struct sched_group *sg)
> > {
> > #ifdef CONFIG_SCHED_SMT
> > - bool local_is_smt;
> > int sg_busy_cpus;
> >
> > - local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> >
> > - if (!local_is_smt) {
> > - /*
> > - * If we are here, @dst_cpu is idle and does not have SMT
> > - * siblings. Pull tasks if candidate group has two or more
> > - * busy CPUs.
> > - */
> > - if (sg_busy_cpus >= 2) /* implies sg_is_smt */
> > - return true;
> > -
> > - /*
> > - * @dst_cpu does not have SMT siblings. @sg may have SMT
> > - * siblings and only one is busy. In such case, @dst_cpu
> > - * can help if it has higher priority and is idle (i.e.,
> > - * it has no running tasks).
> > - */
> > - return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > - }
> > -
> > /*
> > - * @dst_cpu has SMT siblings and are also idle.
> > - *
> > * If the difference in the number of busy CPUs is two or more, let
> > * find_busiest_group() take care of it. We only care if @sg has
> > * exactly one busy CPU. This covers SMT and non-SMT sched groups.
>
> Can't this be made lighter by removing asym_smt_can_pull_tasks() and
> putting the logic to exclude the call to sched_asym_prefer() into
> sched_asym() directly?
Yes, you are right. asym_smt_can_pull_tasks() was simplified significantly.
I'll take your suggestion.
> Not sure if we need the CONFIG_SCHED_SMT since it's all guarded by
> `flags & SD_SHARE_CPUCAPACITY` already, which is only set under.
> CONFIG_SCHED_SMT.
Yes, asym_smt_can_pull_tasks() now cares for a very specific case, which
only happens with CONFIG_SCHED_SMT. I'll remove the !CONFIG_SCHED_SMT part.
>
> static inline bool
> sched_asym(struct lb_env *env, struct sd_lb_stats *sds,
> struct sg_lb_stats *sgs, struct sched_group *group)
> {
> bool local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
>
> if (local_is_smt && !is_core_idle(env->dst_cpu))
> return false;
>
> if ((local_is_smt || group->flags & SD_SHARE_CPUCAPACITY)) {
> int sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
>
> if (sg_busy_cpus != 1)
> return false;
> }
>
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
I'll take your suggestion. Thanks!
On Mon, Feb 13, 2023 at 02:40:24PM +0100, Dietmar Eggemann wrote:
> On 07/02/2023 05:58, Ricardo Neri wrote:
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 80c86462c6f6..c9d0ddfd11f2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10436,11 +10436,20 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > nr_running == 1)
> > continue;
> >
> > - /* Make sure we only pull tasks from a CPU of lower priority */
> > + /*
> > + * Make sure we only pull tasks from a CPU of lower priority
> > + * when balancing between SMT siblings.
> > + *
> > + * If balancing between cores, let lower priority CPUs help
> > + * SMT cores with more than one busy sibling.
> > + */
> > if ((env->sd->flags & SD_ASYM_PACKING) &&
> > sched_asym_prefer(i, env->dst_cpu) &&
> > - nr_running == 1)
> > - continue;
> > + nr_running == 1) {
> > + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> > + (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
> > + continue;
>
> is_core_idle(i) returns true for !CONFIG_SCHED_SMP. So far it was always
> guarded by `flags & SD_SHARE_CPUCAPACITY` which is only set for
> CONFIG_SCHED_SMP.
>
> Here it's different but still depends on `flags & SD_ASYM_PACKING`.
>
> Can we have SD_ASYM_PACKING w/o CONFIG_SCHED_SMP? The comment just says
> `If balancing between cores (MC), let lower priority CPUs help SMT cores
> with more than one busy sibling.`
We cannot have SD_ASYM_PACKING w/o CONFIG_SCHED_SMP. We may have it without
CONFIG_SCHED_SMT. In the latter case we want is_core_idle() to return true
as there are no SMT siblings competing for core throughput and CPU priority
is meaningful. I can add an extra comment clarifying the !CONFIG_SCHED_SMT /
>
> So this only mentions your specific asymmetric e-cores w/o SMT and
> p-cores w/ SMT case.
>
> I'm asking since numa_idle_core(), the only user of is_core_idle() so
> far has an extra `!static_branch_likely(&sched_smt_present)` condition
> before calling it.
That is a good point. Calling is_core_idle() is pointless if
!static_branch_likely(&sched_smt_present).
As per feedback from Vincent and Peter, I have put this logic in a helper
function. I'll add an extra check for this static key.
>
> > + }
> >
> > switch (env->migration_type) {
> > case migrate_load:
> > @@ -10530,8 +10539,20 @@ asym_active_balance(struct lb_env *env)
> > * lower priority CPUs in order to pack all tasks in the
> > * highest priority CPUs.
> > */
> > - return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
> > - sched_asym_prefer(env->dst_cpu, env->src_cpu);
> > + if (env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING)) {
> > + /* Always obey priorities between SMT siblings. */
> > + if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> > + return sched_asym_prefer(env->dst_cpu, env->src_cpu);
> > +
> > + /*
> > + * A lower priority CPU can help an SMT core with more than one
> > + * busy sibling.
> > + */
> > + return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
> > + !is_core_idle(env->src_cpu);
>
> Here it is similar.
I will use my helper function here as well.
Thanks and BR,
Ricardo
On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
> On 10/02/2023 19:31, Ricardo Neri wrote:
> > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> >> On 10/02/23 17:53, Peter Zijlstra wrote:
> >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> >>>
> >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> >>>> its parent), but remove it from the lowest non-degenerated topology level?
> >>>
> >>> So I was rather confused about the whole moving it between levels things
> >>> this morning -- conceptually, prefer siblings says you want to try
> >>> sibling domains before filling up your current domain. Now, balancing
> >>> between siblings happens one level up, hence looking at child->flags
> >>> makes perfect sense.
> >>>
> >>> But looking at the current domain and still calling it prefer sibling
> >>> makes absolutely no sense what so ever.
> >>>
> >>
> >> True :-)
> >>
> >>> In that confusion I think I also got the polarity wrong, I thought you
> >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> >>> want to force enable it as long as there is one SMT child around.
> >
> > Exactly.
> >
> >>>
> >>> Whichever way around it we do it, I'm thinking perhaps some renaming
> >>> might be in order to clarify things.
> >>>
> >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> >>> or something.
> >>>
> >>
> >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
> >> something along those lines.
> >
> > I sense a consesus towards SD_SPREAD_TASKS.
>
> Can you not detect the E-core dst_cpu case on MC with:
>
> + if (child)
> + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
> + else if (sds->busiest)
> + sds->prefer_sibling = sds->busiest->group_weight > 1;
Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
is set based on the flags of the destination CPU's sched domain. But when
used in find_busiest_group() tasks are spread from the busiest group's
child domain.
Your proposed code, also needs a check for SD_PREFER_SIBLING, no?
Thanks and BR,
Ricardo
On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote:
> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
> > On 10/02/2023 19:31, Ricardo Neri wrote:
> > > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> > >> On 10/02/23 17:53, Peter Zijlstra wrote:
> > >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> > >>>
> > >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> > >>>> its parent), but remove it from the lowest non-degenerated topology level?
> > >>>
> > >>> So I was rather confused about the whole moving it between levels things
> > >>> this morning -- conceptually, prefer siblings says you want to try
> > >>> sibling domains before filling up your current domain. Now, balancing
> > >>> between siblings happens one level up, hence looking at child->flags
> > >>> makes perfect sense.
> > >>>
> > >>> But looking at the current domain and still calling it prefer sibling
> > >>> makes absolutely no sense what so ever.
> > >>>
> > >>
> > >> True :-)
> > >>
> > >>> In that confusion I think I also got the polarity wrong, I thought you
> > >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> > >>> want to force enable it as long as there is one SMT child around.
> > >
> > > Exactly.
> > >
> > >>>
> > >>> Whichever way around it we do it, I'm thinking perhaps some renaming
> > >>> might be in order to clarify things.
> > >>>
> > >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> > >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> > >>> or something.
> > >>>
> > >>
> > >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
> > >> something along those lines.
> > >
> > > I sense a consesus towards SD_SPREAD_TASKS.
> >
> > Can you not detect the E-core dst_cpu case on MC with:
> >
> > + if (child)
> > + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
> > + else if (sds->busiest)
> > + sds->prefer_sibling = sds->busiest->group_weight > 1;
>
> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
> is set based on the flags of the destination CPU's sched domain. But when
> used in find_busiest_group() tasks are spread from the busiest group's
> child domain.
>
> Your proposed code, also needs a check for SD_PREFER_SIBLING, no?
I tweaked the solution that Dietmar proposed:
- sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+ if (sds->busiest)
+ sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
This comes from the observation that the prefer_sibling setting acts on
busiest group. It then depends on whether the busiest group, not the local
group, has child sched sched domains. Today it works because in most cases
both the local and the busiest groups have child domains with the SD_
PREFER_SIBLING flag.
This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
prefer_sibling would not be set in that case.
It would also conserve the current behavior at the NUMA level. We would
not need to implement SD_SPREAD_TASKS.
This would both fix the SMT vs non-SMT bug and be less invasive.
Thoughts?
Thanks and BR,
Ricardo
On Wed, Feb 15, 2023 at 09:21:05PM -0800, Ricardo Neri wrote:
> I tweaked the solution that Dietmar proposed:
>
> - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> + if (sds->busiest)
> + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
>
> This comes from the observation that the prefer_sibling setting acts on
> busiest group. It then depends on whether the busiest group, not the local
> group, has child sched sched domains. Today it works because in most cases
> both the local and the busiest groups have child domains with the SD_
> PREFER_SIBLING flag.
>
> This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> prefer_sibling would not be set in that case.
>
> It would also conserve the current behavior at the NUMA level. We would
> not need to implement SD_SPREAD_TASKS.
>
> This would both fix the SMT vs non-SMT bug and be less invasive.
>
> Thoughts?
That does look nice. Be sure to put in a nice comment too.
On Thu, Feb 16, 2023 at 01:16:42PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 15, 2023 at 09:21:05PM -0800, Ricardo Neri wrote:
>
> > I tweaked the solution that Dietmar proposed:
> >
> > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> > + if (sds->busiest)
> > + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
> >
> > This comes from the observation that the prefer_sibling setting acts on
> > busiest group. It then depends on whether the busiest group, not the local
> > group, has child sched sched domains. Today it works because in most cases
> > both the local and the busiest groups have child domains with the SD_
> > PREFER_SIBLING flag.
> >
> > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> > prefer_sibling would not be set in that case.
> >
> > It would also conserve the current behavior at the NUMA level. We would
> > not need to implement SD_SPREAD_TASKS.
> >
> > This would both fix the SMT vs non-SMT bug and be less invasive.
> >
> > Thoughts?
>
> That does look nice. Be sure to put in a nice comment too.
Will do!
On 20/02/23 15:22, hupu wrote:
> From: Peter Zijlstra <[email protected]>
>> Specifically, I'm thinking something in the degenerate area where it
>> looks if a given domain has equal depth children or so.
>
> By the way, in the last email you mentioned "the degenerate area". I can't understand what it means because of cultural differences. Does it refer to this scenario: sched_domain at the SMT layer is destroyed after an SMT thread goes offline, so mc_domain->child is NULL?
>
This references sched domain degeneration, cf. sd_parent_degenerate() and
sd_degenerate() in sched/topology.c
Sched domains are built following a layout described in
sched_domain_topology[], and depending on the actual CPU layout some
domains end up not being worth keeping (e.g. they only contain one CPU - we
need at least 2 to balance things), so we destroy (degenerate) them.
> hupu
On 21/02/23 17:49, tmmdh wrote:
> From: Valentin Schneider <[email protected]>
> But I am still confused by Peter's description about prefer_sibling in the last email, linked at
> https://lore.kernel.org/all/Y+Z2b%[email protected]/
>> this morning -- conceptually, prefer siblings says you want to try
>> sibling domains before filling up your current domain.
>
> Why should we try sibling domains before filling up your current domain? Why does Peter think that sibling domains is better than current domain.
> My understanding about this problem is described as follows, but I am not
> sure if it is correct. I think the sibling domain is a domain lower than
> the current level. Just like SMT is the sibling of MC, while DIE is the
> sibling of NUMA.
That's the wrong way around; going up (or down) the sched_domain hierarchy is
done via parent (or child) pointers. Sibling means going sideways (i.e. the
same topology level but viewed from a different CPU)
> Is it because the cpus covered by sibling domains share more resources (such as cache), which can improve the performance of task running?
>
If sibling domains are in same cache hierarchy then spreading tasks between
them can improve overall performance. That doesn't work with NUMA or
big.LITTLE domains, so we don't have the flag on those.
On 16/02/2023 06:21, Ricardo Neri wrote:
> On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote:
>> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
>>> On 10/02/2023 19:31, Ricardo Neri wrote:
>>>> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
>>>>> On 10/02/23 17:53, Peter Zijlstra wrote:
>>>>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
[...]
>>> Can you not detect the E-core dst_cpu case on MC with:
>>>
>>> + if (child)
>>> + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
>>> + else if (sds->busiest)
>>> + sds->prefer_sibling = sds->busiest->group_weight > 1;
>>
>> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
>> is set based on the flags of the destination CPU's sched domain. But when
>> used in find_busiest_group() tasks are spread from the busiest group's
>> child domain.
>>
>> Your proposed code, also needs a check for SD_PREFER_SIBLING, no?
>
> I tweaked the solution that Dietmar proposed:
>
> - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> + if (sds->busiest)
> + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
Maybe:
sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
1 vs 2048 ?
> This comes from the observation that the prefer_sibling setting acts on
> busiest group. It then depends on whether the busiest group, not the local
> group, has child sched sched domains. Today it works because in most cases
> both the local and the busiest groups have child domains with the SD_
> PREFER_SIBLING flag.
>
> This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> prefer_sibling would not be set in that case.
>
> It would also conserve the current behavior at the NUMA level. We would
> not need to implement SD_SPREAD_TASKS.
>
> This would both fix the SMT vs non-SMT bug and be less invasive.
Yeah, much better! I always forget that we have those flags on SGs now
as well. Luckily, we just need to check busiest sg to cover all cases.
On Thu, Feb 23, 2023 at 11:09:55AM +0100, Dietmar Eggemann wrote:
> On 16/02/2023 06:21, Ricardo Neri wrote:
> > On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote:
> >> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
> >>> On 10/02/2023 19:31, Ricardo Neri wrote:
> >>>> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> >>>>> On 10/02/23 17:53, Peter Zijlstra wrote:
> >>>>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
>
> [...]
>
> >>> Can you not detect the E-core dst_cpu case on MC with:
> >>>
> >>> + if (child)
> >>> + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
> >>> + else if (sds->busiest)
> >>> + sds->prefer_sibling = sds->busiest->group_weight > 1;
> >>
> >> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
> >> is set based on the flags of the destination CPU's sched domain. But when
> >> used in find_busiest_group() tasks are spread from the busiest group's
> >> child domain.
> >>
> >> Your proposed code, also needs a check for SD_PREFER_SIBLING, no?
> >
> > I tweaked the solution that Dietmar proposed:
> >
> > - sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> > + if (sds->busiest)
> > + sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
>
> Maybe:
>
> sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
>
> 1 vs 2048 ?
Sure, I can do this.
>
> > This comes from the observation that the prefer_sibling setting acts on
> > busiest group. It then depends on whether the busiest group, not the local
> > group, has child sched sched domains. Today it works because in most cases
> > both the local and the busiest groups have child domains with the SD_
> > PREFER_SIBLING flag.
> >
> > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> > prefer_sibling would not be set in that case.
> >
> > It would also conserve the current behavior at the NUMA level. We would
> > not need to implement SD_SPREAD_TASKS.
> >
> > This would both fix the SMT vs non-SMT bug and be less invasive.
>
> Yeah, much better! I always forget that we have those flags on SGs now
> as well. Luckily, we just need to check busiest sg to cover all cases.
Right. I can add a comment to clarify from where the flags come.
Hi Ricardo,
On Monday 06 Feb 2023 at 20:58:36 (-0800), Ricardo Neri wrote:
> Only x86 and Power7 use ASYM_PACKING. They use it differently.
>
> Power7 has cores of equal priority, but the SMT siblings of a core have
> different priorities. Parent scheduling domains do not need (nor have) the
> ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
> cause the topology debug code to complain.
>
> X86 has cores of different priority, but all the SMT siblings of the core
> have equal priority. It needs ASYM_PACKING at the MC level, but not at the
> SMT level (it also needs it at upper levels if they have scheduling groups
> of different priority). Removing ASYM_PACKING from the SMT domain causes
> the topology debug code to complain.
>
> Remove SHARED_CHILD for now. We still need a topology check that satisfies
> both architectures.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Suggested-by: Valentin Schneider <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Introduced this patch.
>
> Changes since v1:
> * N/A
> ---
> include/linux/sched/sd_flags.h | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> index 57bde66d95f7..800238854ba5 100644
> --- a/include/linux/sched/sd_flags.h
> +++ b/include/linux/sched/sd_flags.h
> @@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
> /*
> * Place busy tasks earlier in the domain
> *
> - * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
> - * up, but currently assumed to be set from the base domain
> - * upwards (see update_top_cache_domain()).
> * NEEDS_GROUPS: Load balancing flag.
> */
> -SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> +SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
While this silences the warning one would have gotten when removing
SD_ASYM_PACKING from SMT level, it will still result in sd_asym_packing
being NULL for these systems, which breaks nohz balance. That is because
highest_flag_domain() still stops searching at the first level without
the flag set, in this case SMT, even if levels above have the flag set.
Maybe highest_flag_domain() should be changed to take into account the
metadata flags?
Thanks,
Ionela.
>
> /*
> * Prefer to place tasks in a sibling domain
> --
> 2.25.1
>
>
On Fri, Mar 03, 2023 at 11:29:52AM +0000, Ionela Voinescu wrote:
> Hi Ricardo,
Hi Ionela!
>
> On Monday 06 Feb 2023 at 20:58:36 (-0800), Ricardo Neri wrote:
> > Only x86 and Power7 use ASYM_PACKING. They use it differently.
> >
> > Power7 has cores of equal priority, but the SMT siblings of a core have
> > different priorities. Parent scheduling domains do not need (nor have) the
> > ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
> > cause the topology debug code to complain.
> >
> > X86 has cores of different priority, but all the SMT siblings of the core
> > have equal priority. It needs ASYM_PACKING at the MC level, but not at the
> > SMT level (it also needs it at upper levels if they have scheduling groups
> > of different priority). Removing ASYM_PACKING from the SMT domain causes
> > the topology debug code to complain.
> >
> > Remove SHARED_CHILD for now. We still need a topology check that satisfies
> > both architectures.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Suggested-by: Valentin Schneider <[email protected]>
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * Introduced this patch.
> >
> > Changes since v1:
> > * N/A
> > ---
> > include/linux/sched/sd_flags.h | 5 +----
> > 1 file changed, 1 insertion(+), 4 deletions(-)
> >
> > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > index 57bde66d95f7..800238854ba5 100644
> > --- a/include/linux/sched/sd_flags.h
> > +++ b/include/linux/sched/sd_flags.h
> > @@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
> > /*
> > * Place busy tasks earlier in the domain
> > *
> > - * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
> > - * up, but currently assumed to be set from the base domain
> > - * upwards (see update_top_cache_domain()).
> > * NEEDS_GROUPS: Load balancing flag.
> > */
> > -SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> > +SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
>
> While this silences the warning one would have gotten when removing
> SD_ASYM_PACKING from SMT level, it will still result in sd_asym_packing
> being NULL for these systems, which breaks nohz balance. That is because
> highest_flag_domain() still stops searching at the first level without
> the flag set, in this case SMT, even if levels above have the flag set.
You are absolutely right! This how this whole discussion started. It
slipped my mind.
>
> Maybe highest_flag_domain() should be changed to take into account the
> metadata flags?
What about the patch below? Search will stop if the flag has
SDF_SHARED_CHILD as it does today. Otherwise it will search all the
domains.
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1773,6 +1773,12 @@ queue_balance_callback(struct rq *rq,
for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
__sd; __sd = __sd->parent)
+#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
+static const unsigned int SD_SHARED_CHILD_MASK =
+#include <linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
/**
* highest_flag_domain - Return highest sched_domain containing flag.
* @cpu: The CPU whose highest level of sched domain is to
@@ -1781,15 +1787,19 @@ queue_balance_callback(struct rq *rq,
* for the given CPU.
*
* Returns the highest sched_domain of a CPU which contains the given flag.
- */
+*/
static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
{
struct sched_domain *sd, *hsd = NULL;
for_each_domain(cpu, sd) {
- if (!(sd->flags & flag))
+ if (sd->flags & flag) {
+ hsd = sd;
+ continue;
+ }
+
+ if (flag & SD_SHARED_CHILD_MASK)
break;
- hsd = sd;
}
return hsd;
>
> Thanks,
> Ionela.
>
> >
> > /*
> > * Prefer to place tasks in a sibling domain
> > --
> > 2.25.1
> >
> >
Hey,
On Sunday 05 Mar 2023 at 11:08:11 (-0800), Ricardo Neri wrote:
> On Fri, Mar 03, 2023 at 11:29:52AM +0000, Ionela Voinescu wrote:
> > Hi Ricardo,
>
> Hi Ionela!
>
> >
> > On Monday 06 Feb 2023 at 20:58:36 (-0800), Ricardo Neri wrote:
> > > Only x86 and Power7 use ASYM_PACKING. They use it differently.
> > >
> > > Power7 has cores of equal priority, but the SMT siblings of a core have
> > > different priorities. Parent scheduling domains do not need (nor have) the
> > > ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
> > > cause the topology debug code to complain.
> > >
> > > X86 has cores of different priority, but all the SMT siblings of the core
> > > have equal priority. It needs ASYM_PACKING at the MC level, but not at the
> > > SMT level (it also needs it at upper levels if they have scheduling groups
> > > of different priority). Removing ASYM_PACKING from the SMT domain causes
> > > the topology debug code to complain.
> > >
> > > Remove SHARED_CHILD for now. We still need a topology check that satisfies
> > > both architectures.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Suggested-by: Valentin Schneider <[email protected]>
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v2:
> > > * Introduced this patch.
> > >
> > > Changes since v1:
> > > * N/A
> > > ---
> > > include/linux/sched/sd_flags.h | 5 +----
> > > 1 file changed, 1 insertion(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > > index 57bde66d95f7..800238854ba5 100644
> > > --- a/include/linux/sched/sd_flags.h
> > > +++ b/include/linux/sched/sd_flags.h
> > > @@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
> > > /*
> > > * Place busy tasks earlier in the domain
> > > *
> > > - * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
> > > - * up, but currently assumed to be set from the base domain
> > > - * upwards (see update_top_cache_domain()).
> > > * NEEDS_GROUPS: Load balancing flag.
> > > */
> > > -SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> > > +SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> >
> > While this silences the warning one would have gotten when removing
> > SD_ASYM_PACKING from SMT level, it will still result in sd_asym_packing
> > being NULL for these systems, which breaks nohz balance. That is because
> > highest_flag_domain() still stops searching at the first level without
> > the flag set, in this case SMT, even if levels above have the flag set.
>
> You are absolutely right! This how this whole discussion started. It
> slipped my mind.
>
> >
> > Maybe highest_flag_domain() should be changed to take into account the
> > metadata flags?
>
> What about the patch below? Search will stop if the flag has
> SDF_SHARED_CHILD as it does today. Otherwise it will search all the
> domains.
>
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1773,6 +1773,12 @@ queue_balance_callback(struct rq *rq,
> for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
> __sd; __sd = __sd->parent)
>
> +#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
> +static const unsigned int SD_SHARED_CHILD_MASK =
> +#include <linux/sched/sd_flags.h>
> +0;
> +#undef SD_FLAG
> +
> /**
> * highest_flag_domain - Return highest sched_domain containing flag.
> * @cpu: The CPU whose highest level of sched domain is to
> @@ -1781,15 +1787,19 @@ queue_balance_callback(struct rq *rq,
> * for the given CPU.
> *
> * Returns the highest sched_domain of a CPU which contains the given flag.
> - */
> +*/
^^^
likely an unintended change
> static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
> {
> struct sched_domain *sd, *hsd = NULL;
>
> for_each_domain(cpu, sd) {
> - if (!(sd->flags & flag))
> + if (sd->flags & flag) {
> + hsd = sd;
> + continue;
> + }
> +
There might be room for a comment here:
/*
* If the flag is not set and is known to be shared with lower
* domains, stop the search, as it won't be found further up.
*/
> + if (flag & SD_SHARED_CHILD_MASK)
> break;
> - hsd = sd;
> }
>
> return hsd;
It looks nice and sane to me - I've not compiled or tested it :).
Thanks,
Ionela.
>
> >
> > Thanks,
> > Ionela.
> >
> > >
> > > /*
> > > * Prefer to place tasks in a sibling domain
> > > --
> > > 2.25.1
> > >
> > >
On Mon, Mar 06, 2023 at 01:10:37PM +0000, Ionela Voinescu wrote:
> Hey,
>
> On Sunday 05 Mar 2023 at 11:08:11 (-0800), Ricardo Neri wrote:
> > On Fri, Mar 03, 2023 at 11:29:52AM +0000, Ionela Voinescu wrote:
> > > Hi Ricardo,
> >
> > Hi Ionela!
> >
> > >
> > > On Monday 06 Feb 2023 at 20:58:36 (-0800), Ricardo Neri wrote:
> > > > Only x86 and Power7 use ASYM_PACKING. They use it differently.
> > > >
> > > > Power7 has cores of equal priority, but the SMT siblings of a core have
> > > > different priorities. Parent scheduling domains do not need (nor have) the
> > > > ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
> > > > cause the topology debug code to complain.
> > > >
> > > > X86 has cores of different priority, but all the SMT siblings of the core
> > > > have equal priority. It needs ASYM_PACKING at the MC level, but not at the
> > > > SMT level (it also needs it at upper levels if they have scheduling groups
> > > > of different priority). Removing ASYM_PACKING from the SMT domain causes
> > > > the topology debug code to complain.
> > > >
> > > > Remove SHARED_CHILD for now. We still need a topology check that satisfies
> > > > both architectures.
> > > >
> > > > Cc: Ben Segall <[email protected]>
> > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > Cc: Dietmar Eggemann <[email protected]>
> > > > Cc: Len Brown <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>
> > > > Cc: Tim C. Chen <[email protected]>
> > > > Cc: Valentin Schneider <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Suggested-by: Valentin Schneider <[email protected]>
> > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > ---
> > > > Changes since v2:
> > > > * Introduced this patch.
> > > >
> > > > Changes since v1:
> > > > * N/A
> > > > ---
> > > > include/linux/sched/sd_flags.h | 5 +----
> > > > 1 file changed, 1 insertion(+), 4 deletions(-)
> > > >
> > > > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > > > index 57bde66d95f7..800238854ba5 100644
> > > > --- a/include/linux/sched/sd_flags.h
> > > > +++ b/include/linux/sched/sd_flags.h
> > > > @@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
> > > > /*
> > > > * Place busy tasks earlier in the domain
> > > > *
> > > > - * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
> > > > - * up, but currently assumed to be set from the base domain
> > > > - * upwards (see update_top_cache_domain()).
> > > > * NEEDS_GROUPS: Load balancing flag.
> > > > */
> > > > -SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> > > > +SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> > >
> > > While this silences the warning one would have gotten when removing
> > > SD_ASYM_PACKING from SMT level, it will still result in sd_asym_packing
> > > being NULL for these systems, which breaks nohz balance. That is because
> > > highest_flag_domain() still stops searching at the first level without
> > > the flag set, in this case SMT, even if levels above have the flag set.
> >
> > You are absolutely right! This how this whole discussion started. It
> > slipped my mind.
> >
> > >
> > > Maybe highest_flag_domain() should be changed to take into account the
> > > metadata flags?
> >
> > What about the patch below? Search will stop if the flag has
> > SDF_SHARED_CHILD as it does today. Otherwise it will search all the
> > domains.
> >
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1773,6 +1773,12 @@ queue_balance_callback(struct rq *rq,
> > for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
> > __sd; __sd = __sd->parent)
> >
> > +#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
> > +static const unsigned int SD_SHARED_CHILD_MASK =
> > +#include <linux/sched/sd_flags.h>
> > +0;
> > +#undef SD_FLAG
> > +
> > /**
> > * highest_flag_domain - Return highest sched_domain containing flag.
> > * @cpu: The CPU whose highest level of sched domain is to
> > @@ -1781,15 +1787,19 @@ queue_balance_callback(struct rq *rq,
> > * for the given CPU.
> > *
> > * Returns the highest sched_domain of a CPU which contains the given flag.
> > - */
> > +*/
> ^^^
> likely an unintended change
Yes! I will remove it in the patch I post.
> > static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
> > {
> > struct sched_domain *sd, *hsd = NULL;
> >
> > for_each_domain(cpu, sd) {
> > - if (!(sd->flags & flag))
> > + if (sd->flags & flag) {
> > + hsd = sd;
> > + continue;
> > + }
> > +
>
> There might be room for a comment here:
> /*
> * If the flag is not set and is known to be shared with lower
> * domains, stop the search, as it won't be found further up.
> */
Sure, I can and a comment to this effect.
> > + if (flag & SD_SHARED_CHILD_MASK)
> > break;
> > - hsd = sd;
> > }
> >
> > return hsd;
>
> It looks nice and sane to me - I've not compiled or tested it :).
Thank you very much for your feedback!
BR,
Ricardo
On Mon, 2023-02-06 at 20:58 -0800, Ricardo Neri wrote:
> Using asym_packing priorities within an SMT core is straightforward.
> Just
> follow the priorities that hardware indicates.
>
> When balancing load from an SMT core, also consider the idle of its
> siblings. Priorities do not reflect that an SMT core divides its
> throughput
> among all its busy siblings. They only makes sense when exactly one
> sibling
> is busy.
>
> Indicate that active balance is needed if the destination CPU has
> lower
> priority than the source CPU but the latter has busy SMT siblings.
>
> Make find_busiest_queue() not skip higher-priority SMT cores with
> more than
> busy sibling.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Suggested-by: Valentin Schneider <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Introduced this patch.
>
> Changes since v1:
> * N/A
> ---
> kernel/sched/fair.c | 31 ++++++++++++++++++++++++++-----
> 1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 80c86462c6f6..c9d0ddfd11f2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10436,11 +10436,20 @@ static struct rq *find_busiest_queue(struct
> lb_env *env,
> nr_running == 1)
> continue;
>
> - /* Make sure we only pull tasks from a CPU of lower
> priority */
> + /*
> + * Make sure we only pull tasks from a CPU of lower
> priority
> + * when balancing between SMT siblings.
> + *
> + * If balancing between cores, let lower priority
> CPUs help
> + * SMT cores with more than one busy sibling.
> + */
> if ((env->sd->flags & SD_ASYM_PACKING) &&
> sched_asym_prefer(i, env->dst_cpu) &&
> - nr_running == 1)
> - continue;
> + nr_running == 1) {
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> + (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
> && is_core_idle(i)))
> + continue;
> + }
>
> switch (env->migration_type) {
> case migrate_load:
> @@ -10530,8 +10539,20 @@ asym_active_balance(struct lb_env *env)
> * lower priority CPUs in order to pack all tasks in the
> * highest priority CPUs.
> */
> - return env->idle != CPU_NOT_IDLE && (env->sd->flags &
> SD_ASYM_PACKING) &&
> - sched_asym_prefer(env->dst_cpu, env->src_cpu);
> + if (env->idle != CPU_NOT_IDLE && (env->sd->flags &
> SD_ASYM_PACKING)) {
> + /* Always obey priorities between SMT siblings. */
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> + return sched_asym_prefer(env->dst_cpu, env-
> >src_cpu);
> +
> + /*
> + * A lower priority CPU can help an SMT core with
> more than one
> + * busy sibling.
> + */
> + return sched_asym_prefer(env->dst_cpu, env->src_cpu)
> ||
> + !is_core_idle(env->src_cpu);
> + }
Suppose we have the Atom cores in a sched group (e.g. a cluster),
this will pull the tasks from those core to a SMT thread even if
its sibling thread is busy. Suggest this change
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da1afa99cd55..b671cb0d7ab3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10473,9 +10473,11 @@ asym_active_balance(struct lb_env *env)
/*
* A lower priority CPU can help an SMT core with more than one
* busy sibling.
+ * Pull only if no SMT sibling busy.
*/
- return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
- !is_core_idle(env->src_cpu);
+ if (is_core_idle(env->dst_cpu))
+ return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
+ !is_core_idle(env->src_cpu);
}
Tim
> +
> + return false;
> }
>
> static inline bool
On Thu, Mar 09, 2023 at 04:51:35PM -0800, Tim Chen wrote:
> On Mon, 2023-02-06 at 20:58 -0800, Ricardo Neri wrote:
> > Using asym_packing priorities within an SMT core is straightforward.
> > Just
> > follow the priorities that hardware indicates.
> >
> > When balancing load from an SMT core, also consider the idle of its
> > siblings. Priorities do not reflect that an SMT core divides its
> > throughput
> > among all its busy siblings. They only makes sense when exactly one
> > sibling
> > is busy.
> >
> > Indicate that active balance is needed if the destination CPU has
> > lower
> > priority than the source CPU but the latter has busy SMT siblings.
> >
> > Make find_busiest_queue() not skip higher-priority SMT cores with
> > more than
> > busy sibling.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Suggested-by: Valentin Schneider <[email protected]>
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > ?* Introduced this patch.
> >
> > Changes since v1:
> > ?* N/A
> > ---
> > ?kernel/sched/fair.c | 31 ++++++++++++++++++++++++++-----
> > ?1 file changed, 26 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 80c86462c6f6..c9d0ddfd11f2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10436,11 +10436,20 @@ static struct rq *find_busiest_queue(struct
> > lb_env *env,
> > ??????????????????? nr_running == 1)
> > ????????????????????????continue;
> > ?
> > -???????????????/* Make sure we only pull tasks from a CPU of lower
> > priority */
> > +???????????????/*
> > +??????????????? * Make sure we only pull tasks from a CPU of lower
> > priority
> > +??????????????? * when balancing between SMT siblings.
> > +??????????????? *
> > +??????????????? * If balancing between cores, let lower priority
> > CPUs help
> > +??????????????? * SMT cores with more than one busy sibling.
> > +??????????????? */
> > ????????????????if ((env->sd->flags & SD_ASYM_PACKING) &&
> > ??????????????????? sched_asym_prefer(i, env->dst_cpu) &&
> > -?????????????????? nr_running == 1)
> > -???????????????????????continue;
> > +?????????????????? nr_running == 1) {
> > +???????????????????????if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
> > +?????????????????????????? (!(env->sd->flags & SD_SHARE_CPUCAPACITY)
> > && is_core_idle(i)))
> > +???????????????????????????????continue;
> > +???????????????}
> > ?
> > ????????????????switch (env->migration_type) {
> > ????????????????case migrate_load:
> > @@ -10530,8 +10539,20 @@ asym_active_balance(struct lb_env *env)
> > ???????? * lower priority CPUs in order to pack all tasks in the
> > ???????? * highest priority CPUs.
> > ???????? */
> > -???????return env->idle != CPU_NOT_IDLE && (env->sd->flags &
> > SD_ASYM_PACKING) &&
> > -????????????? sched_asym_prefer(env->dst_cpu, env->src_cpu);
> > +???????if (env->idle != CPU_NOT_IDLE && (env->sd->flags &
> > SD_ASYM_PACKING)) {
> > +???????????????/* Always obey priorities between SMT siblings. */
> > +???????????????if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> > +???????????????????????return sched_asym_prefer(env->dst_cpu, env-
> > >src_cpu);
> > +
> > +???????????????/*
> > +??????????????? * A lower priority CPU can help an SMT core with
> > more than one
> > +??????????????? * busy sibling.
> > +??????????????? */
> > +???????????????return sched_asym_prefer(env->dst_cpu, env->src_cpu)
> > ||
> > +????????????????????? !is_core_idle(env->src_cpu);
> > +???????}
>
> Suppose we have the Atom cores in a sched group (e.g. a cluster),
> this will pull the tasks from those core to a SMT thread even if
> its sibling thread is busy. Suggest this change
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da1afa99cd55..b671cb0d7ab3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10473,9 +10473,11 @@ asym_active_balance(struct lb_env *env)
> /*
> * A lower priority CPU can help an SMT core with more than one
> * busy sibling.
> + * Pull only if no SMT sibling busy.
> */
> - return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
> - !is_core_idle(env->src_cpu);
> + if (is_core_idle(env->dst_cpu))
> + return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
> + !is_core_idle(env->src_cpu);
Thank you Tim! Patch 3 does this check for asym_packing, but we could land
from other types of idle load balancing.
I wil integrate this change to the series.
Thanks and BR,
Ricardo