Hi,
This is v4 of this series. Previous versions can be found here [1], [2],
and here [3]. To avoid duplication, I do not include the cover letter of
the original submission. You can read it in [1].
This patchset applies cleanly on today's master branch of the tip tree.
Changes since v3:
Nobody liked the proposed changes to the setting of prefer_sibling.
Instead, I tweaked the solution that Dietmar proposed. Now the busiest
group, not the local group, determines the setting of prefer_sibling.
Vincent suggested improvements to the logic to decide whether to follow
asym_packing priorities. Peter suggested to wrap that in a helper function.
I added sched_use_asym_prio().
Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
does not assume that all child domains have the requested flag.
Tim found that asym_active_balance() needs to also check for the idle
states of the SMT siblings of lb_env::dst_cpu. I added such check.
I wrongly assumed that asym_packing could only be used when the busiest
group had exactly one busy CPU. This broke asym_packing balancing at the
DIE domain. I limited this check to balances between cores at the MC
level.
As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
and placed its logic in sched_asym(). Also, sched_asym() uses
sched_smt_active() to skip checks when not needed.
I also added a patch from Chen Yu to enable asym_packing balancing in
Meteor Lake, which has CPUs of different maximum frequency in more than
one die.
Hopefully, these patches are in sufficiently good shape to be merged?
Thank you for your feedback and I look forward to getting more of it!
New patches: 8, 12
Updated patches: 2, 3, 4, 6, 7
Unchanged patches: 1, 5, 9, 10, 11
BR,
Ricardo
[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/
[3]. https://lore.kernel.org/lkml/[email protected]/
Chen Yu (1):
x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
processors
Ricardo Neri (11):
sched/fair: Move is_core_idle() out of CONFIG_NUMA
sched/fair: Only do asym_packing load balancing from fully idle SMT
cores
sched/fair: Simplify asym_packing logic for SMT cores
sched/fair: Let low-priority cores help high-priority busy SMT cores
sched/fair: Keep a fully_busy SMT sched group as busiest
sched/fair: Use the busiest group to set prefer_sibling
sched/fair: Do not even the number of busy CPUs via asym_packing
sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
sched/topology: Remove SHARED_CHILD from ASYM_PACKING
x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
x86/sched/itmt: Give all SMT siblings of a core the same priority
arch/x86/kernel/itmt.c | 23 +---
arch/x86/kernel/smpboot.c | 4 +-
include/linux/sched/sd_flags.h | 5 +-
kernel/sched/fair.c | 216 +++++++++++++++++----------------
kernel/sched/sched.h | 22 +++-
5 files changed, 138 insertions(+), 132 deletions(-)
--
2.25.1
asym_packing needs this function to determine whether an SMT core is a
suitable destination for load balancing.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4dc1950ae5ae..57c106fa721d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,23 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Scheduling class queueing methods:
*/
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+ int sibling;
+
+ for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+ if (cpu == sibling)
+ continue;
+
+ if (!idle_cpu(sibling))
+ return false;
+ }
+#endif
+
+ return true;
+}
+
#ifdef CONFIG_NUMA
#define NUMA_IMBALANCE_MIN 2
@@ -1700,23 +1717,6 @@ struct numa_stats {
int idle_cpu;
};
-static inline bool is_core_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
- int sibling;
-
- for_each_cpu(sibling, cpu_smt_mask(cpu)) {
- if (cpu == sibling)
- continue;
-
- if (!idle_cpu(sibling))
- return false;
- }
-#endif
-
- return true;
-}
-
struct task_numa_env {
struct task_struct *p;
--
2.25.1
When balancing load between cores, all the SMT siblings of the destination
CPU, if any, must be idle. Otherwise, pulling new tasks degrades the
throughput of the busy SMT siblings. The overall throughput of the system
remains the same.
When balancing load within an SMT core this consideration is not relevant.
Follow the priorities that hardware indicates.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Improved the logic to determine whether CPU priority should be followed.
Also, wrapped this logic in a helper function. (Vincent G./ Peter)
* Used sched_smt_active() to avoid pointless calls of is_core_idle().
(Dietmar)
* Ensure that the core is idle in asym_active_balance(). (Tim)
* Used sched_use_asym_prio() to check for fully idle SMT cores in
sched_asym().
* Removed check for fully idle core inside asym_smt_can_pull_tasks().
Now such condition is verified outside the function.
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 60 +++++++++++++++++++++++++++++++++------------
1 file changed, 44 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57c106fa721d..ec7ddbfd1136 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9273,6 +9273,29 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}
+/**
+ * sched_use_asym_prio - Check whether asym_packing priority must be used
+ * @sd: The scheduling domain of the load balancing
+ * @cpu: A CPU
+ *
+ * Always use CPU priority when balancing load between SMT siblings. When
+ * balancing load between cores, it is not sufficient that @cpu is idle. Only
+ * use CPU priority if the whole core is idle.
+ *
+ * Returns: True if the priority of @cpu must be followed. False otherwise.
+ */
+static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+ if (!sched_smt_active())
+ return true;
+
+ return sd->flags & SD_SHARE_CPUCAPACITY || is_core_idle(cpu);
+#else
+ return true;
+#endif
+}
+
/**
* asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
* @dst_cpu: Destination CPU of the load balancing
@@ -9283,6 +9306,9 @@ group_type group_classify(unsigned int imbalance_pct,
* Check the state of the SMT siblings of both @sds::local and @sg and decide
* if @dst_cpu can pull tasks.
*
+ * This function must be called only if all the SMT siblings of @dst_cpu are
+ * idle, if any.
+ *
* If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
* the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
* only if @dst_cpu has higher priority.
@@ -9292,8 +9318,7 @@ group_type group_classify(unsigned int imbalance_pct,
* Bigger imbalances in the number of busy CPUs will be dealt with in
* update_sd_pick_busiest().
*
- * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
- * of @dst_cpu are idle and @sg has lower priority.
+ * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
*
* Return: true if @dst_cpu can pull tasks, false otherwise.
*/
@@ -9341,15 +9366,8 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
return false;
}
- /*
- * @sg does not have SMT siblings. Ensure that @sds::local does not end
- * up with more than one busy SMT sibling and only pull tasks if there
- * are not busy CPUs (i.e., no CPU has running tasks).
- */
- if (!sds->local_stat.sum_nr_running)
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
- return false;
+ /* If we are here @dst_cpu has SMT siblings and are also idle. */
+ return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
#else
/* Always return false so that callers deal with non-SMT cases. */
return false;
@@ -9360,7 +9378,11 @@ static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
struct sched_group *group)
{
- /* Only do SMT checks if either local or candidate have SMT siblings */
+ /* Ensure that the whole local core is idle, if applicable. */
+ if (!sched_use_asym_prio(env->sd, env->dst_cpu))
+ return false;
+
+ /* Only do SMT checks if either local or candidate have SMT siblings. */
if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
(group->flags & SD_SHARE_CPUCAPACITY))
return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
@@ -10565,11 +10587,13 @@ static inline bool
asym_active_balance(struct lb_env *env)
{
/*
- * ASYM_PACKING needs to force migrate tasks from busy but
- * lower priority CPUs in order to pack all tasks in the
- * highest priority CPUs.
+ * ASYM_PACKING needs to force migrate tasks from busy but lower
+ * priority CPUs in order to pack all tasks in the highest priority
+ * CPUs. When done between cores, do it only if the whole core if the
+ * whole core is idle.
*/
return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
+ sched_use_asym_prio(env->sd, env->dst_cpu) &&
sched_asym_prefer(env->dst_cpu, env->src_cpu);
}
@@ -11304,9 +11328,13 @@ static void nohz_balancer_kick(struct rq *rq)
* When ASYM_PACKING; see if there's a more preferred CPU
* currently idle; in which case, kick the ILB to move tasks
* around.
+ *
+ * When balancing betwen cores, all the SMT siblings of the
+ * preferred CPU must be idle.
*/
for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
- if (sched_asym_prefer(i, cpu)) {
+ if (sched_use_asym_prio(sd, i) &&
+ sched_asym_prefer(i, cpu)) {
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
--
2.25.1
When comparing two fully_busy scheduling groups, keep the current busiest
group if it represents an SMT core. Tasks in such scheduling group share
CPU resources and need more help than tasks in a non-SMT fully_busy group.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b151e93ec316..ea23a5163bfa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9566,10 +9566,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* contention when accessing shared HW resources.
*
* XXX for now avg_load is not computed and always 0 so we
- * select the 1st one.
+ * select the 1st one, except if @sg is composed of SMT
+ * siblings.
*/
- if (sgs->avg_load <= busiest->avg_load)
+
+ if (sgs->avg_load < busiest->avg_load)
return false;
+
+ if (sgs->avg_load == busiest->avg_load) {
+ /*
+ * SMT sched groups need more help than non-SMT groups.
+ * If @sg happens to also be SMT, either choice is good.
+ */
+ if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
+ return false;
+ }
+
break;
case group_has_spare:
--
2.25.1
Using asym_packing priorities within an SMT core is straightforward. Just
follow the priorities that hardware indicates.
When balancing load from an SMT core, also consider the idle state of its
siblings. Priorities do not reflect that an SMT core divides its throughput
among all its busy siblings. They only makes sense when exactly one sibling
is busy.
Indicate that active balance is needed if the destination CPU has lower
priority than the source CPU but the latter has busy SMT siblings.
Make find_busiest_queue() not skip higher-priority SMT cores with more than
busy sibling.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Improved the logic to determine whether CPU priority should be followed.
Also, wrapped this logic in a helper function. (Vincent G./ Peter)
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6bbe0300635..b151e93ec316 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10488,8 +10488,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
nr_running == 1)
continue;
- /* Make sure we only pull tasks from a CPU of lower priority */
+ /*
+ * Make sure we only pull tasks from a CPU of lower priority
+ * when balancing between SMT siblings.
+ *
+ * If balancing between cores, let lower priority CPUs help
+ * SMT cores with more than one busy sibling.
+ */
if ((env->sd->flags & SD_ASYM_PACKING) &&
+ sched_use_asym_prio(env->sd, i) &&
sched_asym_prefer(i, env->dst_cpu) &&
nr_running == 1)
continue;
@@ -10582,10 +10589,15 @@ asym_active_balance(struct lb_env *env)
* priority CPUs in order to pack all tasks in the highest priority
* CPUs. When done between cores, do it only if the whole core if the
* whole core is idle.
+ *
+ * If @env::src_cpu is an SMT core with busy siblings, let
+ * the lower priority @env::dst_cpu help it. Do not follow
+ * CPU priority.
*/
return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
sched_use_asym_prio(env->sd, env->dst_cpu) &&
- sched_asym_prefer(env->dst_cpu, env->src_cpu);
+ (sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
+ !sched_use_asym_prio(env->sd, env->src_cpu));
}
static inline bool
--
2.25.1
Callers of asym_smt_can_pull_tasks() check the idle state of the
destination CPU and its SMT siblings, if any. No extra checks are needed
in such function.
Since SMT cores divide capacity among its siblings, priorities only really
make sense if only one sibling is active. This is true for SMT2, SMT4,
SMT8, etc. Do not use asym_packing load balance for this case. Instead,
let find_busiest_group() handle imbalances.
When balancing non-SMT cores or at higher scheduling domains (e.g.,
between MC scheduling groups), continue using priorities.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Dropped checks for a destination SMT core.
* Limited the check of a single busy CPU to SMT cores. This fixes a bug
when balancing between MC scheduling groups of different priority.
Changes since v2:
* Updated documentation of the function to reflect the new behavior.
(Dietmar)
Changes since v1:
* Reworded commit message and inline comments for clarity.
* Stated that this changeset does not impact SMT4 or SMT8.
---
kernel/sched/fair.c | 33 ++++++++++++---------------------
1 file changed, 12 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec7ddbfd1136..b6bbe0300635 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9313,12 +9313,9 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
* the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
* only if @dst_cpu has higher priority.
*
- * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
- * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
- * Bigger imbalances in the number of busy CPUs will be dealt with in
- * update_sd_pick_busiest().
- *
- * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
+ * When dealing with SMT cores, only use priorities if the SMT core has exactly
+ * one busy sibling. find_busiest_group() will handle bigger imbalances in the
+ * number of busy CPUs.
*
* Return: true if @dst_cpu can pull tasks, false otherwise.
*/
@@ -9327,12 +9324,10 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
struct sched_group *sg)
{
#ifdef CONFIG_SCHED_SMT
- bool local_is_smt, sg_is_smt;
+ bool local_is_smt;
int sg_busy_cpus;
local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
- sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
-
sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
if (!local_is_smt) {
@@ -9353,21 +9348,17 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
}
- /* @dst_cpu has SMT siblings. */
-
- if (sg_is_smt) {
- int local_busy_cpus = sds->local->group_weight -
- sds->local_stat.idle_cpus;
- int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
-
- if (busy_cpus_delta == 1)
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
+ /*
+ * If we are here @dst_cpu has SMT siblings and are also idle.
+ *
+ * CPU priorities does not make sense for SMT cores with more than one
+ * busy sibling.
+ */
+ if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
return false;
- }
- /* If we are here @dst_cpu has SMT siblings and are also idle. */
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
#else
/* Always return false so that callers deal with non-SMT cases. */
return false;
--
2.25.1
The prefer_sibling setting acts on the busiest group to move excess tasks
to the local group. This should be done as per request of the child of the
busiest group's sched domain, not the local group's.
Using the flags of the child domain of the local group works fortuitously
if both groups have child domains.
There are cases, however, in which the busiest group's sched domain has
child but the local group's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING of the busiest group's child
domain will not be honored. We are left with a fully busy SMT core and an
idle non-SMT core.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Suggested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Dropped patch that set prefer_sibling from the current sched domain
level. Instead, uses the flags of the busiest group's child sched
domain. (Vincent G, PeterZ, Dietmar)
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea23a5163bfa..477cb5777036 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10056,7 +10056,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
- struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
@@ -10097,8 +10096,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sg = sg->next;
} while (sg != env->sd->groups);
- /* Tag domain that child domain prefers tasks go to siblings first */
- sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+ /*
+ * Indicate that the child domain of the busiest group prefers tasks
+ * go to a child's sibling domains first. NB the flags of a sched group
+ * are those of the child domain.
+ */
+ if (sds->busiest)
+ sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
if (env->sd->flags & SD_NUMA)
@@ -10398,7 +10402,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto out_balanced;
}
- /* Try to move all excess tasks to child's sibling domain */
+ /*
+ * Try to move all excess tasks to a sibling domain of the busiest
+ * group's child domain.
+ */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
busiest->sum_nr_running > local->sum_nr_running + 1)
goto force_balance;
--
2.25.1
Now that find_busiest_group() triggers load balancing between a fully_
busy SMT2 core and an idle non-SMT core, it is no longer needed to force
balancing via asym_packing. Use asym_packing only as intended: when there
is high-priority CPU that is idle.
After this change, the same logic apply to SMT and non-SMT local groups.
It makes less sense having a separate function to deal specifically with
SMT. Fold the logic in asym_smt_can_pull_tasks() into sched_asym().
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Removed unnecessary local variable sg_busy_cpus.
* Folded asym_smt_can_pull_tasks() into sched_asym() and simplify
further. (Dietmar)
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 86 +++++++++++----------------------------------
1 file changed, 21 insertions(+), 65 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 477cb5777036..145ca52f9629 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9297,74 +9297,26 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
}
/**
- * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
- * @dst_cpu: Destination CPU of the load balancing
+ * sched_asym - Check if the destination CPU can do asym_packing load balance
+ * @env: The load balancing environment
* @sds: Load-balancing data with statistics of the local group
* @sgs: Load-balancing statistics of the candidate busiest group
- * @sg: The candidate busiest group
+ * @group: The candidate busiest group
*
- * Check the state of the SMT siblings of both @sds::local and @sg and decide
- * if @dst_cpu can pull tasks.
+ * @env::dst_cpu can do asym_packing if it has higher priority than the
+ * preferred CPU of @group.
*
- * This function must be called only if all the SMT siblings of @dst_cpu are
- * idle, if any.
+ * SMT is a special case. If we are balancing load between cores, @env::dst_cpu
+ * can do asym_packing balance only if all its SMT siblings are idle. Also, it
+ * can only do it if @group is an SMT group and has exactly on busy CPU. Larger
+ * imbalances in the number of CPUS are dealt with in find_busiest_group().
*
- * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
- * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
- * only if @dst_cpu has higher priority.
+ * If we are balancing load within an SMT core, or at DIE domain level, always
+ * proceed.
*
- * When dealing with SMT cores, only use priorities if the SMT core has exactly
- * one busy sibling. find_busiest_group() will handle bigger imbalances in the
- * number of busy CPUs.
- *
- * Return: true if @dst_cpu can pull tasks, false otherwise.
+ * Return: true if @env::dst_cpu can do with asym_packing load balance. False
+ * otherwise.
*/
-static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
- struct sg_lb_stats *sgs,
- struct sched_group *sg)
-{
-#ifdef CONFIG_SCHED_SMT
- bool local_is_smt;
- int sg_busy_cpus;
-
- local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
- sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
-
- if (!local_is_smt) {
- /*
- * If we are here, @dst_cpu is idle and does not have SMT
- * siblings. Pull tasks if candidate group has two or more
- * busy CPUs.
- */
- if (sg_busy_cpus >= 2) /* implies sg_is_smt */
- return true;
-
- /*
- * @dst_cpu does not have SMT siblings. @sg may have SMT
- * siblings and only one is busy. In such case, @dst_cpu
- * can help if it has higher priority and is idle (i.e.,
- * it has no running tasks).
- */
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
- }
-
- /*
- * If we are here @dst_cpu has SMT siblings and are also idle.
- *
- * CPU priorities does not make sense for SMT cores with more than one
- * busy sibling.
- */
- if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
- return false;
-
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-#else
- /* Always return false so that callers deal with non-SMT cases. */
- return false;
-#endif
-}
-
static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
struct sched_group *group)
@@ -9373,10 +9325,14 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
if (!sched_use_asym_prio(env->sd, env->dst_cpu))
return false;
- /* Only do SMT checks if either local or candidate have SMT siblings. */
- if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
- (group->flags & SD_SHARE_CPUCAPACITY))
- return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+ /*
+ * CPU priorities does not make sense for SMT cores with more than one
+ * busy sibling.
+ */
+ if (group->flags & SD_SHARE_CPUCAPACITY) {
+ if (sgs->group_weight - sgs->idle_cpus != 1)
+ return false;
+ }
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}
--
2.25.1
Only x86 and Power7 use ASYM_PACKING. They use it differently.
Power7 has cores of equal priority, but the SMT siblings of a core have
different priorities. Parent scheduling domains do not need (nor have) the
ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
cause the topology debug code to complain.
X86 has cores of different priority, but all the SMT siblings of the core
have equal priority. It needs ASYM_PACKING at the MC level, but not at the
SMT level (it also needs it at upper levels if they have scheduling groups
of different priority). Removing ASYM_PACKING from the SMT domain causes
the topology debug code to complain.
Remove SHARED_CHILD for now. We still need a topology check that satisfies
both architectures.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Removed spurious space.
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
include/linux/sched/sd_flags.h | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..fad77b5172e2 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
/*
* Place busy tasks earlier in the domain
*
- * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- * up, but currently assumed to be set from the base domain
- * upwards (see update_top_cache_domain()).
* NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
/*
* Prefer to place tasks in a sibling domain
--
2.25.1
Do not assume that all the children of a scheduling domain have a given
flag. Check whether it has the SDF_SHARED_CHILD meta flag.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Ionela Voinescu <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Introduced this patch.
Changes since v2:
* N/A
Changes since v1:
* N/A
---
kernel/sched/sched.h | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 060616944d7a..70abce91b549 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1772,6 +1772,13 @@ queue_balance_callback(struct rq *rq,
for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
__sd; __sd = __sd->parent)
+/* A mask of all the SD flags that have the SDF_SHARED_CHILD metaflag */
+#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
+static const unsigned int SD_SHARED_CHILD_MASK =
+#include <linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
/**
* highest_flag_domain - Return highest sched_domain containing flag.
* @cpu: The CPU whose highest level of sched domain is to
@@ -1779,16 +1786,25 @@ queue_balance_callback(struct rq *rq,
* @flag: The flag to check for the highest sched_domain
* for the given CPU.
*
- * Returns the highest sched_domain of a CPU which contains the given flag.
+ * Returns the highest sched_domain of a CPU which contains @flag. If @flag has
+ * the SDF_SHARED_CHILD metaflag, all the children domains also have @flag.
*/
static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
{
struct sched_domain *sd, *hsd = NULL;
for_each_domain(cpu, sd) {
- if (!(sd->flags & flag))
+ if (sd->flags & flag) {
+ hsd = sd;
+ continue;
+ }
+
+ /*
+ * Stop the search if @flag is known to be shared at lower
+ * levels. It will not be found further up.
+ */
+ if (flag & SD_SHARED_CHILD_MASK)
break;
- hsd = sd;
}
return hsd;
--
2.25.1
There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* Introduced this patch.
---
arch/x86/kernel/smpboot.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 851477f7d728..cf7f42189e86 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -552,7 +552,7 @@ static int x86_core_flags(void)
#ifdef CONFIG_SCHED_SMT
static int x86_smt_flags(void)
{
- return cpu_smt_flags() | x86_sched_itmt_flags();
+ return cpu_smt_flags();
}
#endif
#ifdef CONFIG_SCHED_CLUSTER
--
2.25.1
From: Chen Yu <[email protected]>
Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Ricardo Neri <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
Changes since v3:
* Introduced this patch.
Changes since v2:
* N/A
Changes since v1:
* N/A
---
arch/x86/kernel/smpboot.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index cf7f42189e86..8d966f618413 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -583,7 +583,7 @@ static struct sched_domain_topology_level x86_hybrid_topology[] = {
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, x86_sched_itmt_flags, SD_INIT_NAME(DIE) },
{ NULL, },
};
--
2.25.1
X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler
knows how to handle SMT and non-SMT cores of different priority. There is
no reason for SMT siblings of a core to have different priorities.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Tested-by: Zhang Rui <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* Reworded commit message for clarity.
---
arch/x86/kernel/itmt.c | 23 +++++------------------
1 file changed, 5 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 670eb08b972a..ee4fe8cdb857 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -165,32 +165,19 @@ int arch_asym_cpu_priority(int cpu)
/**
* sched_set_itmt_core_prio() - Set CPU priority based on ITMT
- * @prio: Priority of cpu core
- * @core_cpu: The cpu number associated with the core
+ * @prio: Priority of @cpu
+ * @cpu: The CPU number
*
* The pstate driver will find out the max boost frequency
* and call this function to set a priority proportional
- * to the max boost frequency. CPU with higher boost
+ * to the max boost frequency. CPUs with higher boost
* frequency will receive higher priority.
*
* No need to rebuild sched domain after updating
* the CPU priorities. The sched domains have no
* dependency on CPU priorities.
*/
-void sched_set_itmt_core_prio(int prio, int core_cpu)
+void sched_set_itmt_core_prio(int prio, int cpu)
{
- int cpu, i = 1;
-
- for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
- int smt_prio;
-
- /*
- * Ensure that the siblings are moved to the end
- * of the priority chain and only used when
- * all other high priority cpus are out of capacity.
- */
- smt_prio = prio * smp_num_siblings / (i * i);
- per_cpu(sched_core_priority, cpu) = smt_prio;
- i++;
- }
+ per_cpu(sched_core_priority, cpu) = prio;
}
--
2.25.1
On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> Hi,
>
> This is v4 of this series. Previous versions can be found here [1], [2],
> and here [3]. To avoid duplication, I do not include the cover letter of
> the original submission. You can read it in [1].
>
> This patchset applies cleanly on today's master branch of the tip tree.
>
> Changes since v3:
>
> Nobody liked the proposed changes to the setting of prefer_sibling.
> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> group, not the local group, determines the setting of prefer_sibling.
>
> Vincent suggested improvements to the logic to decide whether to follow
> asym_packing priorities. Peter suggested to wrap that in a helper function.
> I added sched_use_asym_prio().
>
> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> does not assume that all child domains have the requested flag.
>
> Tim found that asym_active_balance() needs to also check for the idle
> states of the SMT siblings of lb_env::dst_cpu. I added such check.
>
> I wrongly assumed that asym_packing could only be used when the busiest
> group had exactly one busy CPU. This broke asym_packing balancing at the
> DIE domain. I limited this check to balances between cores at the MC
> level.
>
> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> and placed its logic in sched_asym(). Also, sched_asym() uses
> sched_smt_active() to skip checks when not needed.
>
> I also added a patch from Chen Yu to enable asym_packing balancing in
> Meteor Lake, which has CPUs of different maximum frequency in more than
> one die.
Is the actual topology of Meteor Lake already public? This patch made me
wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
but I can't remember (one of the raisins why the endless calls are such
a frigging waste of time) and I can't seem to find the answer using
Google either.
> Hopefully, these patches are in sufficiently good shape to be merged?
Changelogs are very sparse towards the end and I had to reverse engineer
some of it which is a shame. But yeah, on a first reading the code looks
mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
start to make a little sense. Mostly they utterly fail to answer the
very fundament "why did you do this" question.
Also, you seem to have forgotten to Cc our friends from IBM such that
they might verify you didn't break their Power7 stuff -- or do you have
a Power7 yourself to verify and forgot to mention that?
> Chen Yu (1):
> x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
> processors
>
> Ricardo Neri (11):
> sched/fair: Move is_core_idle() out of CONFIG_NUMA
> sched/fair: Only do asym_packing load balancing from fully idle SMT
> cores
> sched/fair: Simplify asym_packing logic for SMT cores
> sched/fair: Let low-priority cores help high-priority busy SMT cores
> sched/fair: Keep a fully_busy SMT sched group as busiest
> sched/fair: Use the busiest group to set prefer_sibling
> sched/fair: Do not even the number of busy CPUs via asym_packing
> sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
> sched/topology: Remove SHARED_CHILD from ASYM_PACKING
> x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
> x86/sched/itmt: Give all SMT siblings of a core the same priority
>
> arch/x86/kernel/itmt.c | 23 +---
> arch/x86/kernel/smpboot.c | 4 +-
> include/linux/sched/sd_flags.h | 5 +-
> kernel/sched/fair.c | 216 +++++++++++++++++----------------
> kernel/sched/sched.h | 22 +++-
> 5 files changed, 138 insertions(+), 132 deletions(-)
I'm going to start to queue this and hopefully push out post -rc1 if
nobody objects.
On Sat, 2023-04-29 at 17:32 +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> > Hi,
> >
> > This is v4 of this series. Previous versions can be found here [1], [2],
> > and here [3]. To avoid duplication, I do not include the cover letter of
> > the original submission. You can read it in [1].
> >
> > This patchset applies cleanly on today's master branch of the tip tree.
> >
> > Changes since v3:
> >
> > Nobody liked the proposed changes to the setting of prefer_sibling.
> > Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> > group, not the local group, determines the setting of prefer_sibling.
> >
> > Vincent suggested improvements to the logic to decide whether to follow
> > asym_packing priorities. Peter suggested to wrap that in a helper function.
> > I added sched_use_asym_prio().
> >
> > Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> > rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> > does not assume that all child domains have the requested flag.
> >
> > Tim found that asym_active_balance() needs to also check for the idle
> > states of the SMT siblings of lb_env::dst_cpu. I added such check.
> >
> > I wrongly assumed that asym_packing could only be used when the busiest
> > group had exactly one busy CPU. This broke asym_packing balancing at the
> > DIE domain. I limited this check to balances between cores at the MC
> > level.
> >
> > As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> > and placed its logic in sched_asym(). Also, sched_asym() uses
> > sched_smt_active() to skip checks when not needed.
> >
> > I also added a patch from Chen Yu to enable asym_packing balancing in
> > Meteor Lake, which has CPUs of different maximum frequency in more than
> > one die.
>
> Is the actual topology of Meteor Lake already public? This patch made me
> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
> but I can't remember (one of the raisins why the endless calls are such
> a frigging waste of time) and I can't seem to find the answer using
> Google either.
There are a bunch of fixes that are needed for SCHED_CLUSTER to work
properly on hybrid_topology. I'll clean them up and post them on
top of Ricardo's current patch set this week.
Tim
>
> > Hopefully, these patches are in sufficiently good shape to be merged?
On Sat, Apr 29, 2023 at 05:32:19PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> > Hi,
> >
> > This is v4 of this series. Previous versions can be found here [1], [2],
> > and here [3]. To avoid duplication, I do not include the cover letter of
> > the original submission. You can read it in [1].
> >
> > This patchset applies cleanly on today's master branch of the tip tree.
> >
> > Changes since v3:
> >
> > Nobody liked the proposed changes to the setting of prefer_sibling.
> > Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> > group, not the local group, determines the setting of prefer_sibling.
> >
> > Vincent suggested improvements to the logic to decide whether to follow
> > asym_packing priorities. Peter suggested to wrap that in a helper function.
> > I added sched_use_asym_prio().
> >
> > Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> > rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> > does not assume that all child domains have the requested flag.
> >
> > Tim found that asym_active_balance() needs to also check for the idle
> > states of the SMT siblings of lb_env::dst_cpu. I added such check.
> >
> > I wrongly assumed that asym_packing could only be used when the busiest
> > group had exactly one busy CPU. This broke asym_packing balancing at the
> > DIE domain. I limited this check to balances between cores at the MC
> > level.
> >
> > As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> > and placed its logic in sched_asym(). Also, sched_asym() uses
> > sched_smt_active() to skip checks when not needed.
> >
> > I also added a patch from Chen Yu to enable asym_packing balancing in
> > Meteor Lake, which has CPUs of different maximum frequency in more than
> > one die.
>
> Is the actual topology of Meteor Lake already public? This patch made me
> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
Indeed, Meteor Lake will need SCHED_CLUSTER as does Alder Lake. This is in
addition to multi-die support.
> but I can't remember (one of the raisins why the endless calls are such
> a frigging waste of time) and I can't seem to find the answer using
> Google either.
>
> > Hopefully, these patches are in sufficiently good shape to be merged?
>
> Changelogs are very sparse towards the end and I had to reverse engineer
> some of it which is a shame. But yeah, on a first reading the code looks
> mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
> start to make a little sense. Mostly they utterly fail to answer the
> very fundament "why did you do this" question.
I am sorry changelogs are not sufficiently clear. I thought stating the
overall goal in the cover letter was enough. In the future, would you
prefer that I repeat the cover letter instead of referring to it? Should
individual changelogs state the overall goal?
>
> Also, you seem to have forgotten to Cc our friends from IBM such that
> they might verify you didn't break their Power7 stuff -- or do you have
> a Power7 yourself to verify and forgot to mention that?
I do not have a Power7 system. I did emulate it on an x86 system by
setting all cores with identical sg->asym_prefer_cpu. Within, cores, SMT
siblings had asymmetric priorities. It was only SMT2, though.
>
> > Chen Yu (1):
> > x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
> > processors
> >
> > Ricardo Neri (11):
> > sched/fair: Move is_core_idle() out of CONFIG_NUMA
> > sched/fair: Only do asym_packing load balancing from fully idle SMT
> > cores
> > sched/fair: Simplify asym_packing logic for SMT cores
> > sched/fair: Let low-priority cores help high-priority busy SMT cores
> > sched/fair: Keep a fully_busy SMT sched group as busiest
> > sched/fair: Use the busiest group to set prefer_sibling
> > sched/fair: Do not even the number of busy CPUs via asym_packing
> > sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
> > sched/topology: Remove SHARED_CHILD from ASYM_PACKING
> > x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
> > x86/sched/itmt: Give all SMT siblings of a core the same priority
> >
> > arch/x86/kernel/itmt.c | 23 +---
> > arch/x86/kernel/smpboot.c | 4 +-
> > include/linux/sched/sd_flags.h | 5 +-
> > kernel/sched/fair.c | 216 +++++++++++++++++----------------
> > kernel/sched/sched.h | 22 +++-
> > 5 files changed, 138 insertions(+), 132 deletions(-)
>
> I'm going to start to queue this and hopefully push out post -rc1 if
> nobody objects.
Thanks! Will it be content for v6.4 or v6.5?
BR,
Ricardo
On Mon, 1 May 2023 18:42:55 -0700
Ricardo Neri <[email protected]> wrote:
> I am sorry changelogs are not sufficiently clear. I thought stating the
> overall goal in the cover letter was enough. In the future, would you
> prefer that I repeat the cover letter instead of referring to it? Should
> individual changelogs state the overall goal?
Yes. Every commit should have a change log that explains why that commit
was done without having to look elsewhere.
The cover letter should be the summary of what the patches do. But 5 to 10
years from now, when a git bisect comes across a commit, there's no
guarantee that a cover letter will be easily found. The change log may be
the only thing a developer debugging some code will have to understand what
the change was for.
I'm guilty of having poor change logs which I myself suffered from, as I
don't remember what I meant. So I've been more adamant on adding more
detail to my change logs which has saved me a few times.
-- Steve
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 40b4d3dc328265c8ec6688657d74813edf785c83
Gitweb: https://git.kernel.org/tip/40b4d3dc328265c8ec6688657d74813edf785c83
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:44 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:36 +02:00
sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
Do not assume that all the children of a scheduling domain have a given
flag. Check whether it has the SDF_SHARED_CHILD meta flag.
Suggested-by: Ionela Voinescu <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/sched.h | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0..6784462 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1772,6 +1772,13 @@ queue_balance_callback(struct rq *rq,
for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
__sd; __sd = __sd->parent)
+/* A mask of all the SD flags that have the SDF_SHARED_CHILD metaflag */
+#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
+static const unsigned int SD_SHARED_CHILD_MASK =
+#include <linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
/**
* highest_flag_domain - Return highest sched_domain containing flag.
* @cpu: The CPU whose highest level of sched domain is to
@@ -1779,16 +1786,25 @@ queue_balance_callback(struct rq *rq,
* @flag: The flag to check for the highest sched_domain
* for the given CPU.
*
- * Returns the highest sched_domain of a CPU which contains the given flag.
+ * Returns the highest sched_domain of a CPU which contains @flag. If @flag has
+ * the SDF_SHARED_CHILD metaflag, all the children domains also have @flag.
*/
static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
{
struct sched_domain *sd, *hsd = NULL;
for_each_domain(cpu, sd) {
- if (!(sd->flags & flag))
+ if (sd->flags & flag) {
+ hsd = sd;
+ continue;
+ }
+
+ /*
+ * Stop the search if @flag is known to be shared at lower
+ * levels. It will not be found further up.
+ */
+ if (flag & SD_SHARED_CHILD_MASK)
break;
- hsd = sd;
}
return hsd;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 43726bdedd29797d8e1fee2e7300a6d2b9a74ba8
Gitweb: https://git.kernel.org/tip/43726bdedd29797d8e1fee2e7300a6d2b9a74ba8
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:42 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:35 +02:00
sched/fair: Use the busiest group to set prefer_sibling
The prefer_sibling setting acts on the busiest group to move excess tasks
to the local group. This should be done as per request of the child of the
busiest group's sched domain, not the local group's.
Using the flags of the child domain of the local group works fortuitously
if both groups have child domains.
There are cases, however, in which the busiest group's sched domain has
child but the local group's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING of the busiest group's child
domain will not be honored. We are left with a fully busy SMT core and an
idle non-SMT core.
Suggested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4a9f040..3bb8934 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10109,7 +10109,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
- struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
@@ -10150,8 +10149,13 @@ next_group:
sg = sg->next;
} while (sg != env->sd->groups);
- /* Tag domain that child domain prefers tasks go to siblings first */
- sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+ /*
+ * Indicate that the child domain of the busiest group prefers tasks
+ * go to a child's sibling domains first. NB the flags of a sched group
+ * are those of the child domain.
+ */
+ if (sds->busiest)
+ sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
if (env->sd->flags & SD_NUMA)
@@ -10461,7 +10465,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto out_balanced;
}
- /* Try to move all excess tasks to child's sibling domain */
+ /*
+ * Try to move all excess tasks to a sibling domain of the busiest
+ * group's child domain.
+ */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
busiest->sum_nr_running > local->sum_nr_running + 1)
goto force_balance;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 5fd6d7f43958cb62da105c8413eac3e78480f09a
Gitweb: https://git.kernel.org/tip/5fd6d7f43958cb62da105c8413eac3e78480f09a
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:41 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:35 +02:00
sched/fair: Keep a fully_busy SMT sched group as busiest
When comparing two fully_busy scheduling groups, keep the current busiest
group if it represents an SMT core. Tasks in such scheduling group share
CPU resources and need more help than tasks in a non-SMT fully_busy group.
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85ce249..4a9f040 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9619,10 +9619,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* contention when accessing shared HW resources.
*
* XXX for now avg_load is not computed and always 0 so we
- * select the 1st one.
+ * select the 1st one, except if @sg is composed of SMT
+ * siblings.
*/
- if (sgs->avg_load <= busiest->avg_load)
+
+ if (sgs->avg_load < busiest->avg_load)
return false;
+
+ if (sgs->avg_load == busiest->avg_load) {
+ /*
+ * SMT sched groups need more help than non-SMT groups.
+ * If @sg happens to also be SMT, either choice is good.
+ */
+ if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
+ return false;
+ }
+
break;
case group_has_spare:
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 046a5a95c3b0425cfe79e43021d8ee90c1c4f8c9
Gitweb: https://git.kernel.org/tip/046a5a95c3b0425cfe79e43021d8ee90c1c4f8c9
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:47 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:38 +02:00
x86/sched/itmt: Give all SMT siblings of a core the same priority
X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler
knows how to handle SMT and non-SMT cores of different priority. There is
no reason for SMT siblings of a core to have different priorities.
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/itmt.c | 23 +++++------------------
1 file changed, 5 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 670eb08..ee4fe8c 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -165,32 +165,19 @@ int arch_asym_cpu_priority(int cpu)
/**
* sched_set_itmt_core_prio() - Set CPU priority based on ITMT
- * @prio: Priority of cpu core
- * @core_cpu: The cpu number associated with the core
+ * @prio: Priority of @cpu
+ * @cpu: The CPU number
*
* The pstate driver will find out the max boost frequency
* and call this function to set a priority proportional
- * to the max boost frequency. CPU with higher boost
+ * to the max boost frequency. CPUs with higher boost
* frequency will receive higher priority.
*
* No need to rebuild sched domain after updating
* the CPU priorities. The sched domains have no
* dependency on CPU priorities.
*/
-void sched_set_itmt_core_prio(int prio, int core_cpu)
+void sched_set_itmt_core_prio(int prio, int cpu)
{
- int cpu, i = 1;
-
- for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
- int smt_prio;
-
- /*
- * Ensure that the siblings are moved to the end
- * of the priority chain and only used when
- * all other high priority cpus are out of capacity.
- */
- smt_prio = prio * smp_num_siblings / (i * i);
- per_cpu(sched_core_priority, cpu) = smt_prio;
- i++;
- }
+ per_cpu(sched_core_priority, cpu) = prio;
}
The following commit has been merged into the sched/core branch of tip:
Commit-ID: c9ca07886aaa40225a29e5c1e46ac31d2e14f53a
Gitweb: https://git.kernel.org/tip/c9ca07886aaa40225a29e5c1e46ac31d2e14f53a
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:43 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:36 +02:00
sched/fair: Do not even the number of busy CPUs via asym_packing
Now that find_busiest_group() triggers load balancing between a fully_
busy SMT2 core and an idle non-SMT core, it is no longer needed to force
balancing via asym_packing. Use asym_packing only as intended: when there
is high-priority CPU that is idle.
After this change, the same logic apply to SMT and non-SMT local groups.
It makes less sense having a separate function to deal specifically with
SMT. Fold the logic in asym_smt_can_pull_tasks() into sched_asym().
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 86 ++++++++++----------------------------------
1 file changed, 21 insertions(+), 65 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bb8934..48b6f0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9350,74 +9350,26 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
}
/**
- * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
- * @dst_cpu: Destination CPU of the load balancing
+ * sched_asym - Check if the destination CPU can do asym_packing load balance
+ * @env: The load balancing environment
* @sds: Load-balancing data with statistics of the local group
* @sgs: Load-balancing statistics of the candidate busiest group
- * @sg: The candidate busiest group
+ * @group: The candidate busiest group
*
- * Check the state of the SMT siblings of both @sds::local and @sg and decide
- * if @dst_cpu can pull tasks.
+ * @env::dst_cpu can do asym_packing if it has higher priority than the
+ * preferred CPU of @group.
*
- * This function must be called only if all the SMT siblings of @dst_cpu are
- * idle, if any.
+ * SMT is a special case. If we are balancing load between cores, @env::dst_cpu
+ * can do asym_packing balance only if all its SMT siblings are idle. Also, it
+ * can only do it if @group is an SMT group and has exactly on busy CPU. Larger
+ * imbalances in the number of CPUS are dealt with in find_busiest_group().
*
- * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
- * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
- * only if @dst_cpu has higher priority.
+ * If we are balancing load within an SMT core, or at DIE domain level, always
+ * proceed.
*
- * When dealing with SMT cores, only use priorities if the SMT core has exactly
- * one busy sibling. find_busiest_group() will handle bigger imbalances in the
- * number of busy CPUs.
- *
- * Return: true if @dst_cpu can pull tasks, false otherwise.
+ * Return: true if @env::dst_cpu can do with asym_packing load balance. False
+ * otherwise.
*/
-static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
- struct sg_lb_stats *sgs,
- struct sched_group *sg)
-{
-#ifdef CONFIG_SCHED_SMT
- bool local_is_smt;
- int sg_busy_cpus;
-
- local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
- sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
-
- if (!local_is_smt) {
- /*
- * If we are here, @dst_cpu is idle and does not have SMT
- * siblings. Pull tasks if candidate group has two or more
- * busy CPUs.
- */
- if (sg_busy_cpus >= 2) /* implies sg_is_smt */
- return true;
-
- /*
- * @dst_cpu does not have SMT siblings. @sg may have SMT
- * siblings and only one is busy. In such case, @dst_cpu
- * can help if it has higher priority and is idle (i.e.,
- * it has no running tasks).
- */
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
- }
-
- /*
- * If we are here @dst_cpu has SMT siblings and are also idle.
- *
- * CPU priorities does not make sense for SMT cores with more than one
- * busy sibling.
- */
- if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
- return false;
-
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-#else
- /* Always return false so that callers deal with non-SMT cases. */
- return false;
-#endif
-}
-
static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
struct sched_group *group)
@@ -9426,10 +9378,14 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
if (!sched_use_asym_prio(env->sd, env->dst_cpu))
return false;
- /* Only do SMT checks if either local or candidate have SMT siblings. */
- if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
- (group->flags & SD_SHARE_CPUCAPACITY))
- return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+ /*
+ * CPU priorities does not make sense for SMT cores with more than one
+ * busy sibling.
+ */
+ if (group->flags & SD_SHARE_CPUCAPACITY) {
+ if (sgs->group_weight - sgs->idle_cpus != 1)
+ return false;
+ }
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}
The following commit has been merged into the sched/core branch of tip:
Commit-ID: ca528cc501896a808dc79c3c0544369d23b331c8
Gitweb: https://git.kernel.org/tip/ca528cc501896a808dc79c3c0544369d23b331c8
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:45 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:37 +02:00
sched/topology: Remove SHARED_CHILD from ASYM_PACKING
Only x86 and Power7 use ASYM_PACKING. They use it differently.
Power7 has cores of equal priority, but the SMT siblings of a core have
different priorities. Parent scheduling domains do not need (nor have) the
ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
cause the topology debug code to complain.
X86 has cores of different priority, but all the SMT siblings of the core
have equal priority. It needs ASYM_PACKING at the MC level, but not at the
SMT level (it also needs it at upper levels if they have scheduling groups
of different priority). Removing ASYM_PACKING from the SMT domain causes
the topology debug code to complain.
Remove SHARED_CHILD for now. We still need a topology check that satisfies
both architectures.
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched/sd_flags.h | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66..fad77b5 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
/*
* Place busy tasks earlier in the domain
*
- * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- * up, but currently assumed to be set from the base domain
- * upwards (see update_top_cache_domain()).
* NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
/*
* Prefer to place tasks in a sibling domain
The following commit has been merged into the sched/core branch of tip:
Commit-ID: ef7657d4d2d6a8456aa624010de456c32a135fe9
Gitweb: https://git.kernel.org/tip/ef7657d4d2d6a8456aa624010de456c32a135fe9
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:39 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:34 +02:00
sched/fair: Simplify asym_packing logic for SMT cores
Callers of asym_smt_can_pull_tasks() check the idle state of the
destination CPU and its SMT siblings, if any. No extra checks are needed
in such function.
Since SMT cores divide capacity among its siblings, priorities only really
make sense if only one sibling is active. This is true for SMT2, SMT4,
SMT8, etc. Do not use asym_packing load balance for this case. Instead,
let find_busiest_group() handle imbalances.
When balancing non-SMT cores or at higher scheduling domains (e.g.,
between MC scheduling groups), continue using priorities.
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 33 ++++++++++++---------------------
1 file changed, 12 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 713d03e..a8a02ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9366,12 +9366,9 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
* the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
* only if @dst_cpu has higher priority.
*
- * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
- * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
- * Bigger imbalances in the number of busy CPUs will be dealt with in
- * update_sd_pick_busiest().
- *
- * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
+ * When dealing with SMT cores, only use priorities if the SMT core has exactly
+ * one busy sibling. find_busiest_group() will handle bigger imbalances in the
+ * number of busy CPUs.
*
* Return: true if @dst_cpu can pull tasks, false otherwise.
*/
@@ -9380,12 +9377,10 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
struct sched_group *sg)
{
#ifdef CONFIG_SCHED_SMT
- bool local_is_smt, sg_is_smt;
+ bool local_is_smt;
int sg_busy_cpus;
local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
- sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
-
sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
if (!local_is_smt) {
@@ -9406,21 +9401,17 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
}
- /* @dst_cpu has SMT siblings. */
-
- if (sg_is_smt) {
- int local_busy_cpus = sds->local->group_weight -
- sds->local_stat.idle_cpus;
- int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
-
- if (busy_cpus_delta == 1)
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
+ /*
+ * If we are here @dst_cpu has SMT siblings and are also idle.
+ *
+ * CPU priorities does not make sense for SMT cores with more than one
+ * busy sibling.
+ */
+ if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
return false;
- }
- /* If we are here @dst_cpu has SMT siblings and are also idle. */
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
#else
/* Always return false so that callers deal with non-SMT cases. */
return false;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 18ad34532755feb5b9f4284b07769b1bfec18ab3
Gitweb: https://git.kernel.org/tip/18ad34532755feb5b9f4284b07769b1bfec18ab3
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:40 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:35 +02:00
sched/fair: Let low-priority cores help high-priority busy SMT cores
Using asym_packing priorities within an SMT core is straightforward. Just
follow the priorities that hardware indicates.
When balancing load from an SMT core, also consider the idle state of its
siblings. Priorities do not reflect that an SMT core divides its throughput
among all its busy siblings. They only makes sense when exactly one sibling
is busy.
Indicate that active balance is needed if the destination CPU has lower
priority than the source CPU but the latter has busy SMT siblings.
Make find_busiest_queue() not skip higher-priority SMT cores with more than
busy sibling.
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8a02ae..85ce249 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10551,8 +10551,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
nr_running == 1)
continue;
- /* Make sure we only pull tasks from a CPU of lower priority */
+ /*
+ * Make sure we only pull tasks from a CPU of lower priority
+ * when balancing between SMT siblings.
+ *
+ * If balancing between cores, let lower priority CPUs help
+ * SMT cores with more than one busy sibling.
+ */
if ((env->sd->flags & SD_ASYM_PACKING) &&
+ sched_use_asym_prio(env->sd, i) &&
sched_asym_prefer(i, env->dst_cpu) &&
nr_running == 1)
continue;
@@ -10645,10 +10652,15 @@ asym_active_balance(struct lb_env *env)
* priority CPUs in order to pack all tasks in the highest priority
* CPUs. When done between cores, do it only if the whole core if the
* whole core is idle.
+ *
+ * If @env::src_cpu is an SMT core with busy siblings, let
+ * the lower priority @env::dst_cpu help it. Do not follow
+ * CPU priority.
*/
return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
sched_use_asym_prio(env->sd, env->dst_cpu) &&
- sched_asym_prefer(env->dst_cpu, env->src_cpu);
+ (sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
+ !sched_use_asym_prio(env->sd, env->src_cpu));
}
static inline bool
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 8b36d07f1d63de102d464f44a89704bc62d00811
Gitweb: https://git.kernel.org/tip/8b36d07f1d63de102d464f44a89704bc62d00811
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:37 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:33 +02:00
sched/fair: Move is_core_idle() out of CONFIG_NUMA
asym_packing needs this function to determine whether an SMT core is a
suitable destination for load balancing.
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f..a47208d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,23 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Scheduling class queueing methods:
*/
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+ int sibling;
+
+ for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+ if (cpu == sibling)
+ continue;
+
+ if (!idle_cpu(sibling))
+ return false;
+ }
+#endif
+
+ return true;
+}
+
#ifdef CONFIG_NUMA
#define NUMA_IMBALANCE_MIN 2
@@ -1700,23 +1717,6 @@ struct numa_stats {
int idle_cpu;
};
-static inline bool is_core_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
- int sibling;
-
- for_each_cpu(sibling, cpu_smt_mask(cpu)) {
- if (cpu == sibling)
- continue;
-
- if (!idle_cpu(sibling))
- return false;
- }
-#endif
-
- return true;
-}
-
struct task_numa_env {
struct task_struct *p;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: eefefa716c9fa6aa73159f09954b7eeba4cafd09
Gitweb: https://git.kernel.org/tip/eefefa716c9fa6aa73159f09954b7eeba4cafd09
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:38 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:34 +02:00
sched/fair: Only do asym_packing load balancing from fully idle SMT cores
When balancing load between cores, all the SMT siblings of the destination
CPU, if any, must be idle. Otherwise, pulling new tasks degrades the
throughput of the busy SMT siblings. The overall throughput of the system
remains the same.
When balancing load within an SMT core this consideration is not relevant.
Follow the priorities that hardware indicates.
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++-------------
1 file changed, 40 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a47208d..713d03e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9331,6 +9331,25 @@ group_type group_classify(unsigned int imbalance_pct,
}
/**
+ * sched_use_asym_prio - Check whether asym_packing priority must be used
+ * @sd: The scheduling domain of the load balancing
+ * @cpu: A CPU
+ *
+ * Always use CPU priority when balancing load between SMT siblings. When
+ * balancing load between cores, it is not sufficient that @cpu is idle. Only
+ * use CPU priority if the whole core is idle.
+ *
+ * Returns: True if the priority of @cpu must be followed. False otherwise.
+ */
+static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
+{
+ if (!sched_smt_active())
+ return true;
+
+ return sd->flags & SD_SHARE_CPUCAPACITY || is_core_idle(cpu);
+}
+
+/**
* asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
* @dst_cpu: Destination CPU of the load balancing
* @sds: Load-balancing data with statistics of the local group
@@ -9340,6 +9359,9 @@ group_type group_classify(unsigned int imbalance_pct,
* Check the state of the SMT siblings of both @sds::local and @sg and decide
* if @dst_cpu can pull tasks.
*
+ * This function must be called only if all the SMT siblings of @dst_cpu are
+ * idle, if any.
+ *
* If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
* the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
* only if @dst_cpu has higher priority.
@@ -9349,8 +9371,7 @@ group_type group_classify(unsigned int imbalance_pct,
* Bigger imbalances in the number of busy CPUs will be dealt with in
* update_sd_pick_busiest().
*
- * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
- * of @dst_cpu are idle and @sg has lower priority.
+ * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
*
* Return: true if @dst_cpu can pull tasks, false otherwise.
*/
@@ -9398,15 +9419,8 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
return false;
}
- /*
- * @sg does not have SMT siblings. Ensure that @sds::local does not end
- * up with more than one busy SMT sibling and only pull tasks if there
- * are not busy CPUs (i.e., no CPU has running tasks).
- */
- if (!sds->local_stat.sum_nr_running)
- return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
- return false;
+ /* If we are here @dst_cpu has SMT siblings and are also idle. */
+ return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
#else
/* Always return false so that callers deal with non-SMT cases. */
return false;
@@ -9417,7 +9431,11 @@ static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
struct sched_group *group)
{
- /* Only do SMT checks if either local or candidate have SMT siblings */
+ /* Ensure that the whole local core is idle, if applicable. */
+ if (!sched_use_asym_prio(env->sd, env->dst_cpu))
+ return false;
+
+ /* Only do SMT checks if either local or candidate have SMT siblings. */
if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
(group->flags & SD_SHARE_CPUCAPACITY))
return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
@@ -10632,11 +10650,13 @@ static inline bool
asym_active_balance(struct lb_env *env)
{
/*
- * ASYM_PACKING needs to force migrate tasks from busy but
- * lower priority CPUs in order to pack all tasks in the
- * highest priority CPUs.
+ * ASYM_PACKING needs to force migrate tasks from busy but lower
+ * priority CPUs in order to pack all tasks in the highest priority
+ * CPUs. When done between cores, do it only if the whole core if the
+ * whole core is idle.
*/
return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
+ sched_use_asym_prio(env->sd, env->dst_cpu) &&
sched_asym_prefer(env->dst_cpu, env->src_cpu);
}
@@ -11371,9 +11391,13 @@ static void nohz_balancer_kick(struct rq *rq)
* When ASYM_PACKING; see if there's a more preferred CPU
* currently idle; in which case, kick the ILB to move tasks
* around.
+ *
+ * When balancing betwen cores, all the SMT siblings of the
+ * preferred CPU must be idle.
*/
for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
- if (sched_asym_prefer(i, cpu)) {
+ if (sched_use_asym_prio(sd, i) &&
+ sched_asym_prefer(i, cpu)) {
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 044f0e27dec6e30bb8875a4a12c5f2594964e93f
Gitweb: https://git.kernel.org/tip/044f0e27dec6e30bb8875a4a12c5f2594964e93f
Author: Chen Yu <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:48 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:38 +02:00
x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors
Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.
Suggested-by: Ricardo Neri <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/smpboot.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a335abd..34066f6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -583,7 +583,7 @@ static struct sched_domain_topology_level x86_hybrid_topology[] = {
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, x86_sched_itmt_flags, SD_INIT_NAME(DIE) },
{ NULL, },
};
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 995998ebdebd09b85c28cc46068d8a0744113837
Gitweb: https://git.kernel.org/tip/995998ebdebd09b85c28cc46068d8a0744113837
Author: Ricardo Neri <[email protected]>
AuthorDate: Thu, 06 Apr 2023 13:31:46 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 08 May 2023 10:58:37 +02:00
x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/smpboot.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 352f0ce..a335abd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -552,7 +552,7 @@ static int x86_core_flags(void)
#ifdef CONFIG_SCHED_SMT
static int x86_smt_flags(void)
{
- return cpu_smt_flags() | x86_sched_itmt_flags();
+ return cpu_smt_flags();
}
#endif
#ifdef CONFIG_SCHED_CLUSTER
On 4/29/23 9:02 PM, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
>> Hi,
>>
>> This is v4 of this series. Previous versions can be found here [1], [2],
>> and here [3]. To avoid duplication, I do not include the cover letter of
>> the original submission. You can read it in [1].
>>
>> This patchset applies cleanly on today's master branch of the tip tree.
>>
>> Changes since v3:
>>
>> Nobody liked the proposed changes to the setting of prefer_sibling.
>> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
>> group, not the local group, determines the setting of prefer_sibling.
>>
>> Vincent suggested improvements to the logic to decide whether to follow
>> asym_packing priorities. Peter suggested to wrap that in a helper function.
>> I added sched_use_asym_prio().
>>
>> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
>> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
>> does not assume that all child domains have the requested flag.
>>
>> Tim found that asym_active_balance() needs to also check for the idle
>> states of the SMT siblings of lb_env::dst_cpu. I added such check.
>>
>> I wrongly assumed that asym_packing could only be used when the busiest
>> group had exactly one busy CPU. This broke asym_packing balancing at the
>> DIE domain. I limited this check to balances between cores at the MC
>> level.
>>
>> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
>> and placed its logic in sched_asym(). Also, sched_asym() uses
>> sched_smt_active() to skip checks when not needed.
>>
>> I also added a patch from Chen Yu to enable asym_packing balancing in
>> Meteor Lake, which has CPUs of different maximum frequency in more than
>> one die.
>
> Is the actual topology of Meteor Lake already public? This patch made me
> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
> but I can't remember (one of the raisins why the endless calls are such
> a frigging waste of time) and I can't seem to find the answer using
> Google either.
>
>> Hopefully, these patches are in sufficiently good shape to be merged?
>
> Changelogs are very sparse towards the end and I had to reverse engineer
> some of it which is a shame. But yeah, on a first reading the code looks
> mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
> start to make a little sense. Mostly they utterly fail to answer the
> very fundament "why did you do this" question.
>
> Also, you seem to have forgotten to Cc our friends from IBM such that
> they might verify you didn't break their Power7 stuff -- or do you have
> a Power7 yourself to verify and forgot to mention that?
Very good patch series in addressing asym packing. Interesting discussions as
well. Took me quite sometime to get through to understand and do a little bit
of testing.
Tested this patch a bit on power7 with qemu. Tested with SMT=4. sched domains
show ASYM_PACKING present only for SMT domain.
We don't see any regressions/gain due to patch. SMT priorities are honored when
tasks are scheduled and load_balanced.
>
>> Chen Yu (1):
>> x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
>> processors
>>
>> Ricardo Neri (11):
>> sched/fair: Move is_core_idle() out of CONFIG_NUMA
>> sched/fair: Only do asym_packing load balancing from fully idle SMT
>> cores
>> sched/fair: Simplify asym_packing logic for SMT cores
>> sched/fair: Let low-priority cores help high-priority busy SMT cores
>> sched/fair: Keep a fully_busy SMT sched group as busiest
>> sched/fair: Use the busiest group to set prefer_sibling
>> sched/fair: Do not even the number of busy CPUs via asym_packing
>> sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
>> sched/topology: Remove SHARED_CHILD from ASYM_PACKING
>> x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
>> x86/sched/itmt: Give all SMT siblings of a core the same priority
>>
>> arch/x86/kernel/itmt.c | 23 +---
>> arch/x86/kernel/smpboot.c | 4 +-
>> include/linux/sched/sd_flags.h | 5 +-
>> kernel/sched/fair.c | 216 +++++++++++++++++----------------
>> kernel/sched/sched.h | 22 +++-
>> 5 files changed, 138 insertions(+), 132 deletions(-)
>
> I'm going to start to queue this and hopefully push out post -rc1 if
> nobody objects.
On 4/7/23 2:01 AM, Ricardo Neri wrote:
> When comparing two fully_busy scheduling groups, keep the current busiest
> group if it represents an SMT core. Tasks in such scheduling group share
> CPU resources and need more help than tasks in a non-SMT fully_busy group.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Zhang Rui <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v3:
> * None
>
> Changes since v2:
> * Introduced this patch.
>
> Changes since v1:
> * N/A
> ---
> kernel/sched/fair.c | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b151e93ec316..ea23a5163bfa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9566,10 +9566,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * contention when accessing shared HW resources.
> *
> * XXX for now avg_load is not computed and always 0 so we
> - * select the 1st one.
> + * select the 1st one, except if @sg is composed of SMT
> + * siblings.
> */
> - if (sgs->avg_load <= busiest->avg_load)
> +
> + if (sgs->avg_load < busiest->avg_load)
> return false;
> +
> + if (sgs->avg_load == busiest->avg_load) {
> + /*
> + * SMT sched groups need more help than non-SMT groups.
> + * If @sg happens to also be SMT, either choice is good.
> + */
> + if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> + return false;
> + }
> +
> break;
IIUC,
Earlier, we used to go to out_balanced if sgs->avg_load <= busiest->avg_load.
Now we go only if it is less. lets say sgs->avg_load == busiest->avg_load,
then we will return true in MC,DIE domain. This might end up traversing
multiple such group's and pick the last one as the busiest instead of
first. I guess eventually any load balance if exists will be fixed. But
this might cause slight overhead. would it?
nit: There is typo in [2/12] if the whole core is repeated.
+ * CPUs. When done between cores, do it only if the whole core if the
+ * whole core is idle.
Mentioning in this reply instead, to avoid sending another mail reply for this.
>
> case group_has_spare:
On Sat, May 13, 2023 at 12:11:45AM +0530, Shrikanth Hegde wrote:
>
>
> On 4/7/23 2:01 AM, Ricardo Neri wrote:
> > When comparing two fully_busy scheduling groups, keep the current busiest
> > group if it represents an SMT core. Tasks in such scheduling group share
> > CPU resources and need more help than tasks in a non-SMT fully_busy group.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Tested-by: Zhang Rui <[email protected]>
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v3:
> > * None
> >
> > Changes since v2:
> > * Introduced this patch.
> >
> > Changes since v1:
> > * N/A
> > ---
> > kernel/sched/fair.c | 16 ++++++++++++++--
> > 1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b151e93ec316..ea23a5163bfa 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9566,10 +9566,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > * contention when accessing shared HW resources.
> > *
> > * XXX for now avg_load is not computed and always 0 so we
> > - * select the 1st one.
> > + * select the 1st one, except if @sg is composed of SMT
> > + * siblings.
> > */
> > - if (sgs->avg_load <= busiest->avg_load)
> > +
> > + if (sgs->avg_load < busiest->avg_load)
> > return false;
> > +
> > + if (sgs->avg_load == busiest->avg_load) {
> > + /*
> > + * SMT sched groups need more help than non-SMT groups.
> > + * If @sg happens to also be SMT, either choice is good.
> > + */
> > + if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> > + return false;
> > + }
> > +
> > break;
>
Thank you very much for your review!
> IIUC,
>
> Earlier, we used to go to out_balanced if sgs->avg_load <= busiest->avg_load.
> Now we go only if it is less.
In this particular case we are comparing to fully_busy groups. Both
sgs->avg_load and busiest->avg_load are equal to zero 0.
> lets say sgs->avg_load == busiest->avg_load,
> then we will return true in MC,DIE domain. This might end up traversing
> multiple such group's and pick the last one as the busiest instead of
> first.
Yes, that is correct. But we traverse all sched groups from
update_sd_lb_stats() anyway. We are here because both sgs and busiest are
of type fully_busy and we need to break a tie. Previously we always kept
on selecting sgs as busiest.
> I guess eventually any load balance if exists will be fixed. But
> this might cause slight overhead. would it?
>
>
>
> nit: There is typo in [2/12] if the whole core is repeated.
> + * CPUs. When done between cores, do it only if the whole core if the
> + * whole core is idle.
>
> Mentioning in this reply instead, to avoid sending another mail reply for this.
Ah! I read my patches dozens of times and I still missed this. Thank you
for noting. I will post a trivial patch to fix it.
Thanks and BR,
Ricardo
On Fri, May 12, 2023 at 11:53:48PM +0530, Shrikanth Hegde wrote:
>
>
> On 4/29/23 9:02 PM, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> >> Hi,
> >>
> >> This is v4 of this series. Previous versions can be found here [1], [2],
> >> and here [3]. To avoid duplication, I do not include the cover letter of
> >> the original submission. You can read it in [1].
> >>
> >> This patchset applies cleanly on today's master branch of the tip tree.
> >>
> >> Changes since v3:
> >>
> >> Nobody liked the proposed changes to the setting of prefer_sibling.
> >> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> >> group, not the local group, determines the setting of prefer_sibling.
> >>
> >> Vincent suggested improvements to the logic to decide whether to follow
> >> asym_packing priorities. Peter suggested to wrap that in a helper function.
> >> I added sched_use_asym_prio().
> >>
> >> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> >> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> >> does not assume that all child domains have the requested flag.
> >>
> >> Tim found that asym_active_balance() needs to also check for the idle
> >> states of the SMT siblings of lb_env::dst_cpu. I added such check.
> >>
> >> I wrongly assumed that asym_packing could only be used when the busiest
> >> group had exactly one busy CPU. This broke asym_packing balancing at the
> >> DIE domain. I limited this check to balances between cores at the MC
> >> level.
> >>
> >> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> >> and placed its logic in sched_asym(). Also, sched_asym() uses
> >> sched_smt_active() to skip checks when not needed.
> >>
> >> I also added a patch from Chen Yu to enable asym_packing balancing in
> >> Meteor Lake, which has CPUs of different maximum frequency in more than
> >> one die.
> >
> > Is the actual topology of Meteor Lake already public? This patch made me
> > wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
> > but I can't remember (one of the raisins why the endless calls are such
> > a frigging waste of time) and I can't seem to find the answer using
> > Google either.
> >
> >> Hopefully, these patches are in sufficiently good shape to be merged?
> >
> > Changelogs are very sparse towards the end and I had to reverse engineer
> > some of it which is a shame. But yeah, on a first reading the code looks
> > mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
> > start to make a little sense. Mostly they utterly fail to answer the
> > very fundament "why did you do this" question.
> >
> > Also, you seem to have forgotten to Cc our friends from IBM such that
> > they might verify you didn't break their Power7 stuff -- or do you have
> > a Power7 yourself to verify and forgot to mention that?
>
> Very good patch series in addressing asym packing. Interesting discussions as
> well. Took me quite sometime to get through to understand and do a little bit
> of testing.
>
> Tested this patch a bit on power7 with qemu. Tested with SMT=4. sched domains
> show ASYM_PACKING present only for SMT domain.
>
> We don't see any regressions/gain due to patch. SMT priorities are honored when
> tasks are scheduled and load_balanced.
Thank you very much for your review and testing! Would you mind sharing the
qemu command you use? I would like to test my future patches on power7.
BR,
Ricardo
On 5/19/23 5:33 AM, Ricardo Neri wrote:
> On Fri, May 12, 2023 at 11:53:48PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 4/29/23 9:02 PM, Peter Zijlstra wrote:
>>> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
>>>> Hi,
>>>>
>>>> This is v4 of this series. Previous versions can be found here [1], [2],
>>>> and here [3]. To avoid duplication, I do not include the cover letter of
>>>> the original submission. You can read it in [1].
>>>>
>>>> This patchset applies cleanly on today's master branch of the tip tree.
>>>>
>>>> Changes since v3:
>>>>
>>>> Nobody liked the proposed changes to the setting of prefer_sibling.
>>>> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
>>>> group, not the local group, determines the setting of prefer_sibling.
>>>>
>>>> Vincent suggested improvements to the logic to decide whether to follow
>>>> asym_packing priorities. Peter suggested to wrap that in a helper function.
>>>> I added sched_use_asym_prio().
>>>>
>>>> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
>>>> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
>>>> does not assume that all child domains have the requested flag.
>>>>
>>>> Tim found that asym_active_balance() needs to also check for the idle
>>>> states of the SMT siblings of lb_env::dst_cpu. I added such check.
>>>>
>>>> I wrongly assumed that asym_packing could only be used when the busiest
>>>> group had exactly one busy CPU. This broke asym_packing balancing at the
>>>> DIE domain. I limited this check to balances between cores at the MC
>>>> level.
>>>>
>>>> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
>>>> and placed its logic in sched_asym(). Also, sched_asym() uses
>>>> sched_smt_active() to skip checks when not needed.
>>>>
>>>> I also added a patch from Chen Yu to enable asym_packing balancing in
>>>> Meteor Lake, which has CPUs of different maximum frequency in more than
>>>> one die.
>>>
>>> Is the actual topology of Meteor Lake already public? This patch made me
>>> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
>>> but I can't remember (one of the raisins why the endless calls are such
>>> a frigging waste of time) and I can't seem to find the answer using
>>> Google either.
>>>
>>>> Hopefully, these patches are in sufficiently good shape to be merged?
>>>
>>> Changelogs are very sparse towards the end and I had to reverse engineer
>>> some of it which is a shame. But yeah, on a first reading the code looks
>>> mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
>>> start to make a little sense. Mostly they utterly fail to answer the
>>> very fundament "why did you do this" question.
>>>
>>> Also, you seem to have forgotten to Cc our friends from IBM such that
>>> they might verify you didn't break their Power7 stuff -- or do you have
>>> a Power7 yourself to verify and forgot to mention that?
>>
>> Very good patch series in addressing asym packing. Interesting discussions as
>> well. Took me quite sometime to get through to understand and do a little bit
>> of testing.
>>
>> Tested this patch a bit on power7 with qemu. Tested with SMT=4. sched domains
>> show ASYM_PACKING present only for SMT domain.
>>
>> We don't see any regressions/gain due to patch. SMT priorities are honored when
>> tasks are scheduled and load_balanced.
>
> Thank you very much for your review and testing! Would you mind sharing the
> qemu command you use? I would like to test my future patches on power7.
Sure. I tried this qemu command on a Power9 system. It loads in compat mode.
This would simulate 4Core and SMT4
you need the have CONFIG_POWER7_CPU=y and CONFIG_CPU_BIG_ENDIAN=y and right
qcow2 image.
qemu-system-ppc64 -M pseries,accel=kvm,max-cpu-compat=power7 -m 8192 -smp cores=4,threads=4 -enable-kvm -drive file=<qcow2_image>,format=qcow2 -vga none -nographic -net nic -net user,hostfwd=tcp::2222-:22 -kernel ./<custom_vmlinux> --append "noautotest root=/dev/sda2"
>
> BR,
> Ricardo