2021-05-13 22:36:08

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING

This is v3 the series. v1 and v2 patches and test results can be found in
[1] and [2], respectively.

=== Problem statement ===
On SMT-enabled hardware, ASYM_PACKING can cause the load balancer to choose
low priority CPUs over medium priority CPUs.

When balancing load in scheduling domains with the SD_ASYM_PACKING flag,
idle CPUs of higher priority pull tasks from CPUs of lower priority. This
balancing is done by comparing pairs of scheduling groups. There may also
be scheduling groups composed of CPUs that are SMT siblings.

When using SD_ASYM_PACKING on x86 with Intel Turbo Boost Max Technology 3.0
(ITMT), the priorities of a scheduling group of CPUs that has N SMT
siblings are assigned as N*P, N*P/2, N*P/3, ..., P, where P is the
priority assigned by the hardware to the physical core and N is the number
of SMT siblings.

Systems such as Intel Comet Lake can have some cores supporting SMT, while
others do not. As a result, it is possible to have medium non-SMT
priorities, Q, such that N*P > Q > P.

When comparing groups for load balancing, the priority of the CPU doing the
load balancing is only compared with the preferred CPU of the candidate
busiest group (N*P vs Q in the example above). Thus, scheduling groups
with a preferred CPU with priority N*P always pull tasks from the
scheduling group with priority Q and then such tasks are spread across the
“SMT” domain. Conversely, since N*P > Q, CPUs with priority Q cannot
pull tasks from a group with a preferred CPU with priority N*P, even
though Q > P.

Doing load balancing based on load (i.e. if the busiest group is of type
group_overloaded) will not help if the system is not fully busy as the
involved groups will have only one task and load balancing will
not be deemed as necessary.

The behavior described above results in leaving CPUs with medium priority
idle, while CPUs with lower priority are busy. Furthermore, such CPUs of
lower priority are SMT siblings of high priority CPUs, which are also busy.

This patchset fixes this behavior by also checking the idle state of the
SMT siblings of both the CPU doing the load balance and the busiest
candidate group.

I ran a few benchmarks with and without this version of the patchset on
an Intel(R) Core(TM) i9-7900X CPU. I kept online both SMT siblings of two
high priority cores. I offlined the lower priority SMT siblings of four
low priority cores. I offlined the rest of the cores. The resulting
priority configuration meets the N*P > Q > P condition described above.

The baseline for the results is an unmodified v5.12 kernel. Results
show a comparative percentage of improvement (positive) or degradation
(negative). Each test case is repeated five times, and the standard
deviation among repetitions is also documented.

In order to judge only the improvements this patchset provides, Table 1
shows the results when setting the CPU's frequency at 1000MHz. It can be
observed that the patches bring an overall positive impact for tests
that exhibit low variance. Some of the test cases, however, show a high
variance and it is difficult to assess the impact. Still, a few test
cases with low variance exhibit slight degradation.

Table 2 shows the results when using hardware-controlled performance
performance states (HWP), a common use case. In this case, the impact of
the patches is also overall positive with a few test cases showing slight
degradation.

Thanks and BR,
Ricardo

========
Changes since v2:
* Removed arch_sched_asym_prefer_early() and re-evaluation of the
candidate busiest group in update_sd_pick_busiest(). (PeterZ)
* Introduced sched_group::flags to reflect the properties of CPUs
in the scheduling group. This helps to identify scheduling groups
whose CPUs are SMT siblings. (PeterZ)
* Modified update_sg_lb_stats() to get statistics of the scheduling
domain as an argument. This provides it with the statistics of the
local domain. (PeterZ)
* Introduced sched_asym() to decide if a busiest candidate group can
be marked for asymmetric packing.
* Reworded patch 1 for clarity. (PeterZ)

========
Changes since v1:
* Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
tasks. Instead, reclassify the candidate busiest group, as it
may still be selected. (PeterZ)
* Avoid an expensive and unnecessary call to cpumask_weight() when
determining if a sched_group is comprised of SMT siblings.
(PeterZ).
* Updated test results using the v2 patches.

======== Table 1. Test results of patches at 10000MHz ========
=======================================================================
hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe group-2 1.00 ( 4.17) +0.41 ( 1.35)
process-pipe group-4 1.00 ( 5.26) +4.02 ( 3.80)
process-pipe group-8 1.00 ( 12.57) +1.34 ( 16.28)
process-sockets group-2 1.00 ( 1.04) +0.83 ( 0.58)
process-sockets group-4 1.00 ( 0.57) -0.38 ( 1.13)
process-sockets group-8 1.00 ( 6.39) +1.15 ( 3.44)
threads-pipe group-2 1.00 ( 2.50) +1.47 ( 1.78)
threads-pipe group-4 1.00 ( 5.55) +2.68 ( 4.88)
threads-pipe group-8 1.00 ( 12.85) -0.89 ( 10.56)
threads-sockets group-2 1.00 ( 2.11) +1.77 ( 0.77)
threads-sockets group-4 1.00 ( 0.70) +1.11 ( 0.92)
threads-sockets group-8 1.00 ( 1.20) -1.79 ( 1.88)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR thread-2 1.00 ( 0.72) -1.42 ( 0.45)
TCP_RR thread-4 1.00 ( 0.38) +4.57 ( 0.44)
TCP_RR thread-8 1.00 ( 9.29) +1.09 ( 15.02)
TCP_RR thread-16 1.00 ( 21.18) +0.88 ( 23.62)
UDP_RR thread-2 1.00 ( 0.23) -2.41 ( 0.49)
UDP_RR thread-4 1.00 ( 0.19) +4.54 ( 0.37)
UDP_RR thread-8 1.00 ( 6.83) -0.78 ( 7.70)
UDP_RR thread-16 1.00 ( 20.85) -0.00 ( 19.52)

tbench
======
case load baseline(std%) compare%( std%)
loopback thread-2 1.00 ( 0.15) -1.48 ( 0.15)
loopback thread-4 1.00 ( 0.25) +3.65 ( 0.36)
loopback thread-8 1.00 ( 0.16) +0.74 ( 0.28)
loopback thread-16 1.00 ( 0.32) +0.85 ( 0.32)

schbench
========
case load baseline(std%) compare%( std%)
normal mthread-2 1.00 ( 0.10) -0.05 ( 0.00)
normal mthread-4 1.00 ( 0.00) +0.00 ( 0.00)
normal mthread-8 1.00 ( 4.81) +5.57 ( 2.61)

======== Table 2. Test results of patches with HWP ========
=======================================================================

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe group-2 1.00 ( 2.59) +2.93 ( 2.10)
process-pipe group-4 1.00 ( 4.05) +5.82 ( 4.55)
process-pipe group-8 1.00 ( 7.37) -0.09 ( 15.19)
process-sockets group-2 1.00 ( 1.64) -1.48 ( 1.61)
process-sockets group-4 1.00 ( 0.95) -1.97 ( 1.44)
process-sockets group-8 1.00 ( 1.84) -0.64 ( 2.08)
threads-pipe group-2 1.00 ( 3.02) -0.09 ( 1.70)
threads-pipe group-4 1.00 ( 5.39) +3.42 ( 4.59)
threads-pipe group-8 1.00 ( 5.56) +1.96 ( 10.98)
threads-sockets group-2 1.00 ( 1.10) +1.99 ( 1.67)
threads-sockets group-4 1.00 ( 0.45) +1.56 ( 1.10)
threads-sockets group-8 1.00 ( 3.56) +3.27 ( 1.47)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR thread-2 1.00 ( 0.36) +21.26 ( 1.40)
TCP_RR thread-4 1.00 ( 0.31) +1.74 ( 0.31)
TCP_RR thread-8 1.00 ( 11.70) -1.03 ( 13.55)
TCP_RR thread-16 1.00 ( 23.49) -0.43 ( 22.79)
UDP_RR thread-2 1.00 ( 0.19) +16.29 ( 0.97)
UDP_RR thread-4 1.00 ( 0.13) +3.51 ( 0.51)
UDP_RR thread-8 1.00 ( 8.95) +0.05 ( 10.84)
UDP_RR thread-16 1.00 ( 21.54) +0.10 ( 22.89)

tbench
======
case load baseline(std%) compare%( std%)
loopback thread-2 1.00 ( 0.22) +10.81 ( 1.23)
loopback thread-4 1.00 ( 2.56) +61.43 ( 0.56)
loopback thread-8 1.00 ( 0.41) -0.10 ( 0.42)
loopback thread-16 1.00 ( 0.60) +0.59 ( 0.21)

schbench
========
case load baseline(std%) compare%( std%)
normal mthread-2 1.00 ( 0.00) +0.00 ( 0.00)
normal mthread-4 1.00 ( 0.00) +0.00 ( 0.00)
normal mthread-8 1.00 ( 0.00) -2.67 ( 5.20)

========
[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/

Ricardo Neri (6):
sched/topology: Introduce sched_group::flags
sched/fair: Optimize checking for group_asym_packing
sched/fair: Provide update_sg_lb_stats() with sched domain statistics
sched/fair: Carve out logic to mark a group for asymmetric packing
sched/fair: Consider SMT in ASYM_PACKING load balance
x86/sched: Enable SMT checks for asymmetric packing in load balancing

arch/x86/kernel/itmt.c | 10 +++
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 135 ++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 7 +-
5 files changed, 142 insertions(+), 12 deletions(-)

--
2.17.1



2021-05-13 22:36:29

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the destination CPU, dst_cpu, but also of
its SMT siblings.

If dst_cpu is idle but its SMT siblings are busy, performance suffers
if it pulls tasks from a medium priority CPU that does not have SMT
siblings.

Implement asym_can_pull_tasks() to inspect the state of the SMT siblings
of both dst_cpu and the CPUs in the candidate busiest group.

To keep the legacy behavior, add an arch_asym_check_smt_siblings() to
skip these new checks by default. Individual architectures must override
the mentioned function to use asym_can_pull_tasks().

Cc: Aubrey Li <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim Chen <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Reworded the commit message to reflect updates in code.
* Corrected misrepresentation of dst_cpu as the CPU doing the load
balancing. (PeterZ)
* Removed call to arch_asym_check_smt_siblings() as it is now called in
sched_asym().

Changes since v1:
* Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
tasks. Instead, reclassify the candidate busiest group, as it
may still be selected. (PeterZ)
* Avoid an expensive and unnecessary call to cpumask_weight() when
determining if a sched_group is comprised of SMT siblings.
(PeterZ).
---
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++
2 files changed, 102 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..43bdb8b1e1df 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
#endif

extern int arch_asym_cpu_priority(int cpu);
+extern bool arch_asym_check_smt_siblings(void);

struct sched_domain_attr {
int relax_domain_level;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8b66a5d593e..3d6cc027e6e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
return -cpu;
}

+/*
+ * For asym packing, first check the state of SMT siblings before deciding to
+ * pull tasks.
+ */
+bool __weak arch_asym_check_smt_siblings(void)
+{
+ return false;
+}
+
/*
* The margin used when comparing utilization with CPU capacity.
*
@@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}

+/**
+ * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
+ * @dst_cpu: Destination CPU of the load balancing
+ * @sds: Load-balancing data with statistics of the local group
+ * @sgs: Load-balancing statistics of the candidate busiest group
+ * @sg: The candidate busiet group
+ *
+ * Check the state of the SMT siblings of both @sds::local and @sg and decide
+ * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
+ * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
+ * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
+ *
+ * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
+ * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
+ * between the number of busy CPUs is 2 or more. If the difference is of 1,
+ * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
+ * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
+ * has lower priority.
+ */
+static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
+ struct sg_lb_stats *sgs, struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+ int cpu, local_busy_cpus, sg_busy_cpus;
+ bool local_is_smt, sg_is_smt;
+
+ cpu = group_first_cpu(sg);
+ local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
+ sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
+
+ sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+ if (!local_is_smt) {
+ /*
+ * If we are here, @dst_cpu is idle and does not have SMT
+ * siblings. Pull tasks if candidate group has two or more
+ * busy CPUs.
+ */
+ if (sg_is_smt && sg_busy_cpus >= 2)
+ return true;
+
+ /*
+ * @dst_cpu does not have SMT siblings. @sg may have SMT
+ * siblings and only one is busy. In such case, @dst_cpu
+ * can help if it has higher priority and is idle.
+ */
+ return !sds->local_stat.group_util &&
+ sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+ }
+
+ /* @dst_cpu has SMT siblings. */
+
+ local_busy_cpus = sds->local->group_weight - sds->local_stat.idle_cpus;
+
+ if (sg_is_smt) {
+ int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
+
+ /* Local can always help to even the number busy CPUs. */
+ if (busy_cpus_delta >= 2)
+ return true;
+
+ if (busy_cpus_delta == 1)
+ return sched_asym_prefer(dst_cpu,
+ sg->asym_prefer_cpu);
+
+ return false;
+ }
+
+ /*
+ * @sg does not have SMT siblings. Ensure that @sds::local does not end
+ * up with more than one busy SMT sibling and only pull tasks if there
+ * are not busy CPUs. As CPUs move in and out of idle state frequently,
+ * also check the group utilization to smoother the decision.
+ */
+ if (!local_busy_cpus && !sds->local_stat.group_util)
+ return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
+ return false;
+#else
+ return true;
+#endif
+}
+
static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
struct sched_group *group)
@@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
if (group == sds->local)
return false;

+ if (arch_asym_check_smt_siblings())
+ return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}

@@ -9463,6 +9558,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
nr_running == 1)
continue;

+ /* Make sure we only pull tasks from a CPU of lower priority */
+ if ((env->sd->flags & SD_ASYM_PACKING) &&
+ sched_asym_prefer(i, env->dst_cpu) &&
+ nr_running == 1)
+ continue;
+
switch (env->migration_type) {
case migrate_load:
/*
--
2.17.1


2021-05-14 03:20:34

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing

Currently when balancing load, a candidate busiest group is marked for
asymmetric packing if, among other factors, dst_cpu has higher priority
than the preferred CPU of the candidate busiest group. However, other
factors influence the decision, such as the state of the SMT siblings of
dst_cpu, if any.

Thus, create a separate function, sched_asym(), in which logic for such
decisions can be implemented. A subsequent changeset will introduce logic
to deal with SMT.

Cc: Aubrey Li <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim Chen <[email protected]>
Co-developed-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.

Changes since v1:
* N/A
---
kernel/sched/fair.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8c04e9d0d3b..c8b66a5d593e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8447,6 +8447,20 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}

+static inline bool
+sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ struct sched_group *group)
+{
+ /*
+ * Because sd->groups starts with the local group, anything that isn't
+ * the local group will have access to the local state.
+ */
+ if (group == sds->local)
+ return false;
+
+ return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -8507,18 +8521,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}

+ sgs->group_capacity = group->sgc->capacity;
+
+ sgs->group_weight = group->group_weight;
+
/* Check if dst CPU is idle and preferred to this group */
if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
- env->idle != CPU_NOT_IDLE &&
- sgs->sum_h_nr_running &&
- sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+ env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
+ sched_asym(env, sds, sgs, group)) {
sgs->group_asym_packing = 1;
}

- sgs->group_capacity = group->sgc->capacity;
-
- sgs->group_weight = group->group_weight;
-
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);

/* Computing avg_load makes sense only when group is overloaded */
--
2.17.1


2021-05-14 10:41:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> include/linux/sched/topology.h | 1 +
> kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++
> 2 files changed, 102 insertions(+)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..43bdb8b1e1df 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> #endif
>
> extern int arch_asym_cpu_priority(int cpu);
> +extern bool arch_asym_check_smt_siblings(void);
>
> struct sched_domain_attr {
> int relax_domain_level;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8b66a5d593e..3d6cc027e6e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> return -cpu;
> }
>
> +/*
> + * For asym packing, first check the state of SMT siblings before deciding to
> + * pull tasks.
> + */
> +bool __weak arch_asym_check_smt_siblings(void)
> +{
> + return false;
> +}
> +
> /*
> * The margin used when comparing utilization with CPU capacity.
> *

> @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> if (group == sds->local)
> return false;
>
> + if (arch_asym_check_smt_siblings())
> + return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }

So I'm thinking that this is a property of having ASYM_PACKING at a core
level, rather than some arch special. Wouldn't something like this be
more appropriate?

---
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
#endif

extern int arch_asym_cpu_priority(int cpu);
-extern bool arch_asym_check_smt_siblings(void);

struct sched_domain_attr {
int relax_domain_level;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
}

/*
- * For asym packing, first check the state of SMT siblings before deciding to
- * pull tasks.
- */
-bool __weak arch_asym_check_smt_siblings(void)
-{
- return false;
-}
-
-/*
* The margin used when comparing utilization with CPU capacity.
*
* (default: ~20%)
@@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
if (group == sds->local)
return false;

- if (arch_asym_check_smt_siblings())
+ if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
+ (group->flags & SD_SHARE_CPUCAPACITY))
return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);

return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);

2021-05-15 10:57:46

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > include/linux/sched/topology.h | 1 +
> > kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++
> > 2 files changed, 102 insertions(+)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 8f0f778b7c91..43bdb8b1e1df 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> > #endif
> >
> > extern int arch_asym_cpu_priority(int cpu);
> > +extern bool arch_asym_check_smt_siblings(void);
> >
> > struct sched_domain_attr {
> > int relax_domain_level;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c8b66a5d593e..3d6cc027e6e6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> > return -cpu;
> > }
> >
> > +/*
> > + * For asym packing, first check the state of SMT siblings before deciding to
> > + * pull tasks.
> > + */
> > +bool __weak arch_asym_check_smt_siblings(void)
> > +{
> > + return false;
> > +}
> > +
> > /*
> > * The margin used when comparing utilization with CPU capacity.
> > *
>
> > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> > if (group == sds->local)
> > return false;
> >
> > + if (arch_asym_check_smt_siblings())
> > + return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> > +
> > return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> > }
>
> So I'm thinking that this is a property of having ASYM_PACKING at a core
> level, rather than some arch special. Wouldn't something like this be
> more appropriate?
>
> ---
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
> #endif
>
> extern int arch_asym_cpu_priority(int cpu);
> -extern bool arch_asym_check_smt_siblings(void);
>
> struct sched_domain_attr {
> int relax_domain_level;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
> }
>
> /*
> - * For asym packing, first check the state of SMT siblings before deciding to
> - * pull tasks.
> - */
> -bool __weak arch_asym_check_smt_siblings(void)
> -{
> - return false;
> -}
> -
> -/*
> * The margin used when comparing utilization with CPU capacity.
> *
> * (default: ~20%)
> @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
> if (group == sds->local)
> return false;
>
> - if (arch_asym_check_smt_siblings())
> + if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> + (group->flags & SD_SHARE_CPUCAPACITY))
> return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);

Thanks Peter for the quick review! This makes sense to me. The only
reason we proposed arch_asym_check_smt_siblings() is because we were
about breaking powerpc (I need to study how they set priorities for SMT,
if applicable). If you think this is not an issue I can post a
v4 with this update.

Thanks and BR,
Ricardo

2021-05-18 00:16:54

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing

On 13/05/2021 17:49, Ricardo Neri wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8c04e9d0d3b..c8b66a5d593e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8447,6 +8447,20 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
> }
>
> +static inline bool
> +sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
> + struct sched_group *group)
> +{
> + /*
> + * Because sd->groups starts with the local group, anything that isn't
> + * the local group will have access to the local state.
> + */
> + if (group == sds->local)
> + return false;

sched_asym() is called under if(!local_group ...) from
update_sg_lb_stats(). So why checking this here again?

[...]

2021-05-18 13:17:29

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On 13/05/2021 17:49, Ricardo Neri wrote:

[...]

> @@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
> }
>
> +/**
> + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> + * @dst_cpu: Destination CPU of the load balancing
> + * @sds: Load-balancing data with statistics of the local group
> + * @sgs: Load-balancing statistics of the candidate busiest group
> + * @sg: The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> + struct sg_lb_stats *sgs, struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> + int cpu, local_busy_cpus, sg_busy_cpus;
> + bool local_is_smt, sg_is_smt;
> +
> + cpu = group_first_cpu(sg);

Looks like `cpu` isn't used.

[...]

2021-05-19 08:49:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_can_pull_tasks() to inspect the state of the SMT siblings
> of both dst_cpu and the CPUs in the candidate busiest group.
>
> To keep the legacy behavior, add an arch_asym_check_smt_siblings() to
> skip these new checks by default. Individual architectures must override
> the mentioned function to use asym_can_pull_tasks().
>
> Cc: Aubrey Li <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim Chen <[email protected]>
> Reviewed-by: Len Brown <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Reworded the commit message to reflect updates in code.
> * Corrected misrepresentation of dst_cpu as the CPU doing the load
> balancing. (PeterZ)
> * Removed call to arch_asym_check_smt_siblings() as it is now called in
> sched_asym().
>
> Changes since v1:
> * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> tasks. Instead, reclassify the candidate busiest group, as it
> may still be selected. (PeterZ)
> * Avoid an expensive and unnecessary call to cpumask_weight() when
> determining if a sched_group is comprised of SMT siblings.
> (PeterZ).
> ---
> include/linux/sched/topology.h | 1 +
> kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++
> 2 files changed, 102 insertions(+)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..43bdb8b1e1df 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> #endif
>
> extern int arch_asym_cpu_priority(int cpu);
> +extern bool arch_asym_check_smt_siblings(void);
>
> struct sched_domain_attr {
> int relax_domain_level;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8b66a5d593e..3d6cc027e6e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> return -cpu;
> }
>
> +/*
> + * For asym packing, first check the state of SMT siblings before deciding to
> + * pull tasks.
> + */
> +bool __weak arch_asym_check_smt_siblings(void)
> +{
> + return false;
> +}
> +
> /*
> * The margin used when comparing utilization with CPU capacity.
> *
> @@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
> }
>
> +/**
> + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> + * @dst_cpu: Destination CPU of the load balancing
> + * @sds: Load-balancing data with statistics of the local group
> + * @sgs: Load-balancing statistics of the candidate busiest group
> + * @sg: The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> + struct sg_lb_stats *sgs, struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> + int cpu, local_busy_cpus, sg_busy_cpus;
> + bool local_is_smt, sg_is_smt;
> +
> + cpu = group_first_cpu(sg);
> + local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> + sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> + if (!local_is_smt) {
> + /*
> + * If we are here, @dst_cpu is idle and does not have SMT
> + * siblings. Pull tasks if candidate group has two or more
> + * busy CPUs.
> + */
> + if (sg_is_smt && sg_busy_cpus >= 2)
> + return true;
> +
> + /*
> + * @dst_cpu does not have SMT siblings. @sg may have SMT
> + * siblings and only one is busy. In such case, @dst_cpu
> + * can help if it has higher priority and is idle.
> + */
> + return !sds->local_stat.group_util &&
> + sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> + }
> +
> + /* @dst_cpu has SMT siblings. */
> +
> + local_busy_cpus = sds->local->group_weight - sds->local_stat.idle_cpus;
> +
> + if (sg_is_smt) {
> + int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> +
> + /* Local can always help to even the number busy CPUs. */
> + if (busy_cpus_delta >= 2)
> + return true;
> +
> + if (busy_cpus_delta == 1)
> + return sched_asym_prefer(dst_cpu,
> + sg->asym_prefer_cpu);
> +
> + return false;
> + }
> +
> + /*
> + * @sg does not have SMT siblings. Ensure that @sds::local does not end
> + * up with more than one busy SMT sibling and only pull tasks if there
> + * are not busy CPUs. As CPUs move in and out of idle state frequently,
> + * also check the group utilization to smoother the decision.

nit: s/smoother/smoothen/

> + */
> + if (!local_busy_cpus && !sds->local_stat.group_util)
> + return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);

Hmm, I am not sure but is it possible that there some local_busy_cpus yet
group_util is 0? If not just check for !group_util ?

FWIW on the series:
Reviewed-by: Joel Fernandes (Google) <[email protected]>

thanks,

- Joel

> +
> + return false;
> +#else
> + return true;
> +#endif
> +}
> +
> static inline bool
> sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
> struct sched_group *group)
> @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> if (group == sds->local)
> return false;
>
> + if (arch_asym_check_smt_siblings())
> + return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> @@ -9463,6 +9558,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> nr_running == 1)
> continue;
>
> + /* Make sure we only pull tasks from a CPU of lower priority */
> + if ((env->sd->flags & SD_ASYM_PACKING) &&
> + sched_asym_prefer(i, env->dst_cpu) &&
> + nr_running == 1)
> + continue;
> +
> switch (env->migration_type) {
> case migrate_load:
> /*
> --
> 2.17.1
>

2021-05-19 09:04:36

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On Mon, May 17, 2021 at 6:28 PM Joel Fernandes <[email protected]> wrote:
>
> On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> > check for the idle state of the destination CPU, dst_cpu, but also of
> > its SMT siblings.
> >
[...]
> > + if (!local_is_smt) {
> > + /*
> > + * If we are here, @dst_cpu is idle and does not have SMT
> > + * siblings. Pull tasks if candidate group has two or more
> > + * busy CPUs.
> > + */
> > + if (sg_is_smt && sg_busy_cpus >= 2)
> > + return true;
> > +
> > + /*
> > + * @dst_cpu does not have SMT siblings. @sg may have SMT
> > + * siblings and only one is busy. In such case, @dst_cpu
> > + * can help if it has higher priority and is idle.
> > + */
> > + return !sds->local_stat.group_util &&
> > + sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > + }
> > +
> > + /* @dst_cpu has SMT siblings. */
> > +
> > + local_busy_cpus = sds->local->group_weight - sds->local_stat.idle_cpus;
> > +
> > + if (sg_is_smt) {
> > + int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > +
> > + /* Local can always help to even the number busy CPUs. */
> > + if (busy_cpus_delta >= 2)
> > + return true;
> > +
> > + if (busy_cpus_delta == 1)
> > + return sched_asym_prefer(dst_cpu,
> > + sg->asym_prefer_cpu);
> > +
> > + return false;
> > + }
> > +
> > + /*
> > + * @sg does not have SMT siblings. Ensure that @sds::local does not end
> > + * up with more than one busy SMT sibling and only pull tasks if there
> > + * are not busy CPUs. As CPUs move in and out of idle state frequently,
> > + * also check the group utilization to smoother the decision.
>
> nit: s/smoother/smoothen/
>
> > + */
> > + if (!local_busy_cpus && !sds->local_stat.group_util)
> > + return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
>
> Hmm, I am not sure but is it possible that there some local_busy_cpus yet
> group_util is 0? If not just check for !group_util ?

Sorry - I meant here, "yet group_util is not 0..."

2021-05-19 18:31:57

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On Mon, May 17, 2021 at 05:18:29PM +0200, Dietmar Eggemann wrote:
> On 13/05/2021 17:49, Ricardo Neri wrote:
>
> [...]
>
> > @@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
> > return group_has_spare;
> > }
> >
> > +/**
> > + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> > + * @dst_cpu: Destination CPU of the load balancing
> > + * @sds: Load-balancing data with statistics of the local group
> > + * @sgs: Load-balancing statistics of the candidate busiest group
> > + * @sg: The candidate busiet group
> > + *
> > + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> > + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> > + *
> > + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
> > + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> > + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> > + * has lower priority.
> > + */
> > +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > + struct sg_lb_stats *sgs, struct sched_group *sg)
> > +{
> > +#ifdef CONFIG_SCHED_SMT
> > + int cpu, local_busy_cpus, sg_busy_cpus;
> > + bool local_is_smt, sg_is_smt;
> > +
> > + cpu = group_first_cpu(sg);
>
> Looks like `cpu` isn't used.

Thank you very much for your feedback Dietmar!

Ah! That is correct. I will remove this variable.

Thanks and BR,
Ricardo
>
> [...]

2021-05-19 18:33:18

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> > On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > > include/linux/sched/topology.h | 1 +
> > > kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++
> > > 2 files changed, 102 insertions(+)
> > >
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index 8f0f778b7c91..43bdb8b1e1df 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> > > #endif
> > >
> > > extern int arch_asym_cpu_priority(int cpu);
> > > +extern bool arch_asym_check_smt_siblings(void);
> > >
> > > struct sched_domain_attr {
> > > int relax_domain_level;
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c8b66a5d593e..3d6cc027e6e6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> > > return -cpu;
> > > }
> > >
> > > +/*
> > > + * For asym packing, first check the state of SMT siblings before deciding to
> > > + * pull tasks.
> > > + */
> > > +bool __weak arch_asym_check_smt_siblings(void)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > /*
> > > * The margin used when comparing utilization with CPU capacity.
> > > *
> >
> > > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> > > if (group == sds->local)
> > > return false;
> > >
> > > + if (arch_asym_check_smt_siblings())
> > > + return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> > > +
> > > return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> > > }
> >
> > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > level, rather than some arch special. Wouldn't something like this be
> > more appropriate?
> >
> > ---
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
> > #endif
> >
> > extern int arch_asym_cpu_priority(int cpu);
> > -extern bool arch_asym_check_smt_siblings(void);
> >
> > struct sched_domain_attr {
> > int relax_domain_level;
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
> > }
> >
> > /*
> > - * For asym packing, first check the state of SMT siblings before deciding to
> > - * pull tasks.
> > - */
> > -bool __weak arch_asym_check_smt_siblings(void)
> > -{
> > - return false;
> > -}
> > -
> > -/*
> > * The margin used when comparing utilization with CPU capacity.
> > *
> > * (default: ~20%)
> > @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
> > if (group == sds->local)
> > return false;
> >
> > - if (arch_asym_check_smt_siblings())
> > + if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> > + (group->flags & SD_SHARE_CPUCAPACITY))
> > return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
>
> Thanks Peter for the quick review! This makes sense to me. The only
> reason we proposed arch_asym_check_smt_siblings() is because we were
> about breaking powerpc (I need to study how they set priorities for SMT,
> if applicable). If you think this is not an issue I can post a
> v4 with this update.

As far as I can see, priorities in powerpc are set by the CPU number.
However, I am not sure how CPUs are enumerated? If CPUs in brackets are
SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
[1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
core would need to be busy before a new core is used.

Still, I think the issue described in the cover letter may be
reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
able to help since CPU0 has the highest priority.

I am cc'ing the linuxppc list to get some feedback.

Thanks and BR,
Ricardo

2021-05-19 18:33:28

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing

On Mon, May 17, 2021 at 04:21:36PM +0200, Dietmar Eggemann wrote:
> On 13/05/2021 17:49, Ricardo Neri wrote:
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c8c04e9d0d3b..c8b66a5d593e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8447,6 +8447,20 @@ group_type group_classify(unsigned int imbalance_pct,
> > return group_has_spare;
> > }
> >
> > +static inline bool
> > +sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
> > + struct sched_group *group)
> > +{
> > + /*
> > + * Because sd->groups starts with the local group, anything that isn't
> > + * the local group will have access to the local state.
> > + */
> > + if (group == sds->local)
> > + return false;
>
> sched_asym() is called under if(!local_group ...) from
> update_sg_lb_stats(). So why checking this here again?

This is true. It looks to me that this check is not needed. I makes
sense to me to keep the check in update_sg_lb_stats() so that we can
avoid the extra checks, right?

Thanks and BR,
Ricardo

2021-05-19 19:26:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:

> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > level, rather than some arch special. Wouldn't something like this be
> > > more appropriate?

> > Thanks Peter for the quick review! This makes sense to me. The only
> > reason we proposed arch_asym_check_smt_siblings() is because we were
> > about breaking powerpc (I need to study how they set priorities for SMT,
> > if applicable). If you think this is not an issue I can post a
> > v4 with this update.
>
> As far as I can see, priorities in powerpc are set by the CPU number.
> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> core would need to be busy before a new core is used.
>
> Still, I think the issue described in the cover letter may be
> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> able to help since CPU0 has the highest priority.
>
> I am cc'ing the linuxppc list to get some feedback.

IIRC the concern with Power is that their Cores can go faster if the
higher SMT siblings are unused.

That is, suppose you have an SMT4 Core with only a single active task,
then if only SMT0 is used it can reach max performance, but if the
active sibling is SMT1 it can not reach max performance, and if the only
active sibling is SMT2 it goes slower still.

So they need to pack the tasks to the lowest SMT siblings, and have the
highest SMT siblings idle (where possible) in order to increase
performance.



2021-05-19 19:31:13

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

Excerpts from Peter Zijlstra's message of May 19, 2021 7:59 pm:
> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
>> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
>> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
>
>> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
>> > > level, rather than some arch special. Wouldn't something like this be
>> > > more appropriate?
>
>> > Thanks Peter for the quick review! This makes sense to me. The only
>> > reason we proposed arch_asym_check_smt_siblings() is because we were
>> > about breaking powerpc (I need to study how they set priorities for SMT,
>> > if applicable). If you think this is not an issue I can post a
>> > v4 with this update.
>>
>> As far as I can see, priorities in powerpc are set by the CPU number.
>> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
>> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
>> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
>> core would need to be busy before a new core is used.
>>
>> Still, I think the issue described in the cover letter may be
>> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
>> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
>> able to help since CPU0 has the highest priority.
>>
>> I am cc'ing the linuxppc list to get some feedback.
>
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
>
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
>
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.

That's correct.

Thanks,
Nick

2021-05-19 19:31:19

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

* Peter Zijlstra <[email protected]> [2021-05-19 11:59:48]:

> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
>
> > > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > > level, rather than some arch special. Wouldn't something like this be
> > > > more appropriate?
>
> > > Thanks Peter for the quick review! This makes sense to me. The only
> > > reason we proposed arch_asym_check_smt_siblings() is because we were
> > > about breaking powerpc (I need to study how they set priorities for SMT,
> > > if applicable). If you think this is not an issue I can post a
> > > v4 with this update.
> >
> > As far as I can see, priorities in powerpc are set by the CPU number.
> > However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> > SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> > [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> > core would need to be busy before a new core is used.
> >
> > Still, I think the issue described in the cover letter may be
> > reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> > tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> > able to help since CPU0 has the highest priority.
> >
> > I am cc'ing the linuxppc list to get some feedback.
>
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
>
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
>
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.
>
>

If you are referring to SD_ASYM_PACKING, then packing tasks to lowest SMT
was needed in POWER7 timeframe. So if there was one thread running, then
running on the lowest sibling provided the best performance as if running in
single threaded mode.

However recent chips like POWER8/ POWER9 / POWER10 dont need SD_ASYM_PACKING
since the hardware itself does the switch. So even if task is place on a
higher sibling within the core, we dont need to switch the task to a lower
sibling for it to perform better. Now running only one thread running on any
sibling, its expected to run in single threaded mode.

--
Thanks and Regards
Srikar Dronamraju