LinuxLists.cc - [PATCH 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs

2023-05-04 16:21:42

Subject: [PATCH 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs

Cluster scheduling domain is not enabled on x86 hybrid CPUs as the logic
is missing to do proper load balancing between a cluster
with SMT CPUs in single core and a cluster with multiple Atom CPUs.

When cluster scheduling was first introduced to x86, it was noticed
that with cluster scheduling on hybrid CPU, single threaded task often
ended up on Atom core (or E-core) instead on idle Big core (or P-core),
resulting in lower performance. Hence cluster scheduling was disabled
on x86 hybrid CPU. (See: https://www.phoronix.com/review/linux-516-regress)

Ricardo recently introduced a patch series that greatly improved
the load balancing logic between P-cores and E-cores on x86 hybrid
CPUs.
https://lore.kernel.org/lkml/[email protected]/T/#m16ebc8de64dbf4c54adebab701b42b47805105f4

However, that patch series is not enough to allow the enabling of cluster
scheduling on hybrid x86 CPUs. This patch series provides some additional
fixes needed for load balancing between cluster sched group consisting
of SMT CPUs of Big cores and cluster sched group consisting of Atom CPUs.
With these patches applied on top of Ricardo's patch series, load is
properly balanced between the P-core and E-core clusters. Idle CPUs
are used in the proper order:

1) SMT CPU on an idle P-core,
2) idle E-core,
3) unused SMT CPU with a busy sibling.

On x86, CPUs in a cluster share L2 cache. Load is now balanced
between the clusters with cluster enabled, for potentially less L2 cache
contention.

I did some experiments on an Alder Lake with 6 P-cores and 8 E-cores,
organized in two clusters of 4 E-core each.

I tested some single threaded benchmarks in Phoronix suite that previously
have shown regressions when cluster scheduling was first enabled. Cluster
scheduling using this patch series performs as well as vanilla kernel.

Single Threaded 6.3-rc5 with cluster Improvement
Benchmark scheduling in Performance
(run-run deviation)
-------------------------------------------------------------------------------------------
tjbench (+/- 0.08%) (+/- 0.23%) -0.23%
PhPbench (+/- 0.31%) (+/- 0.89%) -0.39%
flac (+/- 0.58%) (+/- 0.22%) +0.17%
pybench (+/- 3.16%) (+/- 0.27%) +2.55%

For multi-threaded benchmarks, I tried kernel build and tensor flow lite.
Cluster scheduling did best for the 10 thread case where 6 threads run on
the P-cores, 2 threads on one Atom cluster and 2 threads on the other Atom
cluster. Whereas the vanilla kernel will have 6 threads on the P-cores,
4 threads on one Atom cluster. Though the differences are small and
fall within run variations.

Multi Threaded 6.3-rc5 with cluster Improvement
Benchmark scheduling in Performance
(-#threads) (run-run deviation)
-------------------------------------------------------------------------------------------
Kbuild-8 (+/- 2.90%) (+/- 1.16%) -0.76%
Kbuild-10 (+/- 3.08%) (+/- 3.09%) +0.64%
Kbuild-12 (+/- 3.28%) (+/- 3.55%) +0.91%
Tensor Lite-8 (+/- 4.84%) (+/- 4.61%) -0.23%
Tensor Lite-10 (+/- 0.87%) (+/- 1.45%) +0.47%
Tensor Lite-12 (+/- 1.37%) (+/- 1.04%) -0.12%

Thanks for reviewing these patches.

Tim Chen

Ricardo Neri (1):
sched/fair: Consider the idle state of the whole core for load balance

Tim C Chen (5):
sched/topology: Propagate SMT flags when removing degenerate domain
sched/fair: Check whether active load balance is needed in busiest
group
sched/fair: Fix busiest group selection for asym groups
sched/fair: Skip prefer sibling move between SMT group and non-SMT
group
sched/x86: Add cluster topology to hybrid CPU

arch/x86/kernel/smpboot.c | 3 ++
kernel/sched/fair.c | 78 ++++++++++++++++++++++++++++++++++++++-
kernel/sched/topology.c | 7 +++-
3 files changed, 86 insertions(+), 2 deletions(-)

--
2.32.0

2023-05-04 16:23:03

by Tim Chen

[permalink] [raw]

Subject: [PATCH 4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

From: Tim C Chen <[email protected]>

Do not try to move tasks between non SMT sched group and SMT sched
group for "prefer sibling" load balance.
Let asym_active_balance_busiest() handle that case properly.
Otherwise we could get task bouncing back and forth between
the SMT sched group and non SMT sched group.

Reviewed-by: Ricardo Neri <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a325db34b02..58ef7d529731 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
/*
* Try to move all excess tasks to a sibling domain of the busiest
* group's child domain.
+ *
+ * Do not try to move between non smt sched group and smt sched
+ * group. Let asym active balance properly handle that case.
*/
if (sds.prefer_sibling && local->group_type == group_has_spare &&
+ !asymmetric_groups(sds.busiest, sds.local) &&
busiest->sum_nr_running > local->sum_nr_running + 1)
goto force_balance;

--
2.32.0

2023-05-04 16:34:46

by Tim Chen

[permalink] [raw]

Subject: [PATCH 5/6] sched/fair: Consider the idle state of the whole core for load balance

From: Ricardo Neri <[email protected]>

should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
cpus) starting from lower numbered CPUs looking for the first idle CPU.

In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
non-SMT cores:

[0, 1] [2, 3] [4, 5] 5 6 7 8
b i b i b i b i i i

In the figure above, CPUs in brackets are siblings of an SMT core. The
rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
idle CPU.

We should let a CPU on a fully idle core get the first chance to idle
load balance as it has more CPU capacity than a CPU on an idle SMT
CPU with busy sibling. So for the figure above, if we are running
should_we_balance() to CPU 1, we should return false to let CPU 6 on
idle core to have a chance first to idle load balance.

A partially busy (i.e., of type group_has_spare) local group with SMT
cores will often have only one SMT sibling busy. If the destination CPU
is a non-SMT core, partially busy, lower-numbered, SMT cores should not
be considered when finding the first idle CPU.

However, in should_we_balance(), when we encounter idle SMT first in partially
busy core, we prematurely break the search for the first idle CPU.

Higher-numbered, non-SMT cores is not given the chance to have
idle balance done on their behalf. Those CPUs will only be considered
for idle balancing by chance via CPU_NEWLY_IDLE.

Instead, consider the idle state of the whole SMT core.

Signed-off-by: Ricardo Neri <[email protected]>
Co-developed-by: Tim Chen <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58ef7d529731..c77fadba063d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10683,7 +10683,7 @@ static int active_load_balance_cpu_stop(void *data);
static int should_we_balance(struct lb_env *env)
{
struct sched_group *sg = env->sd->groups;
- int cpu;
+ int cpu, idle_smt = -1;

/*
* Ensure the balancing environment is consistent; can happen
@@ -10709,11 +10709,26 @@ static int should_we_balance(struct lb_env *env)
for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
if (!idle_cpu(cpu))
continue;
+ else {
+ /*
+ * Don't balance to idle SMT in busy core right away when
+ * balancing cores, but remember the first idle SMT CPU for
+ * later consideration. Find CPU on an idle core first.
+ */
+ if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
+ if (idle_smt == -1)
+ idle_smt = cpu;
+ continue;
+ }
+ }

/* Are we the first idle CPU? */
return cpu == env->dst_cpu;
}

+ if (idle_smt == env->dst_cpu)
+ return true;
+
/* Are we the first CPU of this group ? */
return group_balance_cpu(sg) == env->dst_cpu;
}
--
2.32.0

2023-05-04 16:40:03

by Tim Chen

[permalink] [raw]

Subject: [PATCH 1/6] sched/topology: Propagate SMT flags when removing degenerate domain

From: Tim C Chen <[email protected]>

When a degenerate cluster domain for core with SMT CPUs is removed,
the SD_SHARE_CPUCAPACITY flag in the local child sched group was not
propagated to the new parent. We need this flag to properly determine
whether the local sched group is SMT. Set the flag in the local
child sched group of the new parent sched domain.

Reviewed-by: Ricardo Neri <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/topology.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 051aaf65c749..6d5628fcebcf 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -719,8 +719,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)

if (sd_parent_degenerate(tmp, parent)) {
tmp->parent = parent->parent;
- if (parent->parent)
+
+ if (parent->parent) {
parent->parent->child = tmp;
+ if (tmp->flags & SD_SHARE_CPUCAPACITY)
+ parent->parent->groups->flags |= SD_SHARE_CPUCAPACITY;
+ }
+
/*
* Transfer SD_PREFER_SIBLING down in case of a
* degenerate parent; the spans match for this
--
2.32.0

2023-05-04 16:42:00

by Tim Chen

[permalink] [raw]

Subject: [PATCH 3/6] sched/fair: Fix busiest group selection for asym groups

From: Tim C Chen <[email protected]>

For two groups that have spare capacity and are partially busy, the
busier group should be the group with pure CPUs rather than the group
with SMT CPUs. We do not want to make the SMT group the busiest one to
pull task off the SMT and makes the whole core empty.

Otherwise suppose in the search for busiest group,
we first encounter an SMT group with 1 task and set it as the busiest.
The local group is an atom cluster with 1 task and we then encounter an atom
cluster group with 3 tasks, we will not pick this atom cluster group over the
SMT group, even though we should. As a result, we do not load balance
the busier Atom cluster (with 3 tasks) towards the local Atom cluster
(with 1 task). And it doesn't make sense to pick the 1 task SMT group
as the busier group as we also should not pull task off the SMT towards
the 1 task atom cluster and make the SMT core completely empty.

Fix this case.

Reviewed-by: Ricardo Neri <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bde962aa160a..8a325db34b02 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9548,6 +9548,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
break;

case group_has_spare:
+ /*
+ * Do not pick sg with SMT CPUs over sg with pure CPUs,
+ * as we do not want to pull task off half empty SMT core
+ * and make the core idle.
+ */
+ if (asymmetric_groups(sds->busiest, sg)) {
+ if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
+ return true;
+ else
+ return false;
+ }
+
/*
* Select not overloaded group with lowest number of idle cpus
* and highest number of running tasks. We could also compare
--
2.32.0

2023-05-04 16:47:41

by Tim Chen

[permalink] [raw]

Subject: [PATCH 6/6] sched/x86: Add cluster topology to hybrid CPU

From: Tim C Chen <[email protected]>

Cluster topology was not enabled on hybrid x86 CPU as load balance
was not properly working for cluster domain. That has been fixed and
cluster domain can be enabled for hybrid CPU.

Reviewed-by: Ricardo Neri <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
arch/x86/kernel/smpboot.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index cea297d97034..2489d767c398 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -575,6 +575,9 @@ static struct sched_domain_topology_level x86_hybrid_topology[] = {
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
#endif
+#ifdef CONFIG_SCHED_CLUSTER
+ { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
--
2.32.0

2023-05-05 13:29:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 5/6] sched/fair: Consider the idle state of the whole core for load balance

On Thu, May 04, 2023 at 09:09:55AM -0700, Tim Chen wrote:

> @@ -10709,11 +10709,26 @@ static int should_we_balance(struct lb_env *env)
> for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
> if (!idle_cpu(cpu))
> continue;
> + else {
> + /*
> + * Don't balance to idle SMT in busy core right away when
> + * balancing cores, but remember the first idle SMT CPU for
> + * later consideration. Find CPU on an idle core first.
> + */
> + if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
> + if (idle_smt == -1)
> + idle_smt = cpu;
> + continue;
> + }
> + }

Not only does that bust CodingStyle, it's also entirely daft. What
exactly is the purpose of that else statement?

>
> /* Are we the first idle CPU? */
> return cpu == env->dst_cpu;
> }

2023-05-05 13:31:07

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 3/6] sched/fair: Fix busiest group selection for asym groups

On Thu, May 04, 2023 at 09:09:53AM -0700, Tim Chen wrote:
> From: Tim C Chen <[email protected]>
>
> For two groups that have spare capacity and are partially busy, the
> busier group should be the group with pure CPUs rather than the group
> with SMT CPUs. We do not want to make the SMT group the busiest one to
> pull task off the SMT and makes the whole core empty.
>
> Otherwise suppose in the search for busiest group,
> we first encounter an SMT group with 1 task and set it as the busiest.
> The local group is an atom cluster with 1 task and we then encounter an atom
> cluster group with 3 tasks, we will not pick this atom cluster group over the
> SMT group, even though we should. As a result, we do not load balance
> the busier Atom cluster (with 3 tasks) towards the local Atom cluster
> (with 1 task). And it doesn't make sense to pick the 1 task SMT group
> as the busier group as we also should not pull task off the SMT towards
> the 1 task atom cluster and make the SMT core completely empty.
>
> Fix this case.
>
> Reviewed-by: Ricardo Neri <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bde962aa160a..8a325db34b02 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9548,6 +9548,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> break;
>
> case group_has_spare:
> + /*
> + * Do not pick sg with SMT CPUs over sg with pure CPUs,
> + * as we do not want to pull task off half empty SMT core
> + * and make the core idle.
> + */

Comment says what the code does; not why.

> + if (asymmetric_groups(sds->busiest, sg)) {
> + if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> + return true;
> + else
> + return false;

return (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> + }

Also, should this not be part of the previous patch?

2023-05-05 14:09:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> From: Tim C Chen <[email protected]>
>
> Do not try to move tasks between non SMT sched group and SMT sched
> group for "prefer sibling" load balance.
> Let asym_active_balance_busiest() handle that case properly.
> Otherwise we could get task bouncing back and forth between
> the SMT sched group and non SMT sched group.
>
> Reviewed-by: Ricardo Neri <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a325db34b02..58ef7d529731 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> /*
> * Try to move all excess tasks to a sibling domain of the busiest
> * group's child domain.
> + *
> + * Do not try to move between non smt sched group and smt sched
> + * group. Let asym active balance properly handle that case.
> */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> + !asymmetric_groups(sds.busiest, sds.local) &&
> busiest->sum_nr_running > local->sum_nr_running + 1)
> goto force_balance;

This seems to have the hidden assumption that a !SMT core is somehow
'less' that an SMT code. Should this not also look at
sched_asym_prefer() to establush this is so?

I mean, imagine I have a regular system and just offline one smt sibling
for giggles.

2023-05-05 22:39:34

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH 3/6] sched/fair: Fix busiest group selection for asym groups

On Fri, 2023-05-05 at 15:19 +0200, Peter Zijlstra wrote:
>
> >
> > case group_has_spare:
> > + /*
> > + * Do not pick sg with SMT CPUs over sg with pure CPUs,
> > + * as we do not want to pull task off half empty SMT core
> > + * and make the core idle.
> > + */
>
> Comment says what the code does; not why.

Good point, will make the comment better.

>
> > + if (asymmetric_groups(sds->busiest, sg)) {
> > + if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> > + return true;
> > + else
> > + return false;
>
> return (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> > + }
>
> Also, should this not be part of the previous patch?

Sure, I can merge it with the previous patch.

Tim

2023-05-05 23:07:35

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH 5/6] sched/fair: Consider the idle state of the whole core for load balance

On Fri, 2023-05-05 at 15:17 +0800, Hillf Danton wrote:
> On 4 May 2023 09:09:55 -0700 Tim Chen <[email protected]>
> > From: Ricardo Neri <[email protected]>
> >
> > should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
> > cpus) starting from lower numbered CPUs looking for the first idle CPU.
> >
> > In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
> > non-SMT cores:
> >
> > [0, 1] [2, 3] [4, 5] 5 6 7 8
> > b i b i b i b i i i
> >
> > In the figure above, CPUs in brackets are siblings of an SMT core. The
> > rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
> > idle CPU.
>
> Better if l2-cache affinity is added in the diagram above. And better
> again if the diagram is available upon introducing hybrid x86 in the
> cover letter.

Good suggestion. Will update the diagram and add it to cover letter.

Tim

2023-05-05 23:10:38

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH 5/6] sched/fair: Consider the idle state of the whole core for load balance

On Fri, 2023-05-05 at 15:23 +0200, Peter Zijlstra wrote:
> On Thu, May 04, 2023 at 09:09:55AM -0700, Tim Chen wrote:
>
> > @@ -10709,11 +10709,26 @@ static int should_we_balance(struct lb_env *env)
> > for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
> > if (!idle_cpu(cpu))
> > continue;
> > + else {
> > + /*
> > + * Don't balance to idle SMT in busy core right away when
> > + * balancing cores, but remember the first idle SMT CPU for
> > + * later consideration. Find CPU on an idle core first.
> > + */
> > + if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
> > + if (idle_smt == -1)
> > + idle_smt = cpu;
> > + continue;
> > + }
> > + }
>
> Not only does that bust CodingStyle, it's also entirely daft. What
> exactly is the purpose of that else statement?
>
>

Yeah, that's a dumb "else" statement. Will remove that.

Tim

2023-05-05 23:42:30

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH 4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

On Fri, 2023-05-05 at 15:22 +0200, Peter Zijlstra wrote:
> On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> > From: Tim C Chen <[email protected]>
> >
> > Do not try to move tasks between non SMT sched group and SMT sched
> > group for "prefer sibling" load balance.
> > Let asym_active_balance_busiest() handle that case properly.
> > Otherwise we could get task bouncing back and forth between
> > the SMT sched group and non SMT sched group.
> >
> > Reviewed-by: Ricardo Neri <[email protected]>
> > Signed-off-by: Tim Chen <[email protected]>
> > ---
> > kernel/sched/fair.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a325db34b02..58ef7d529731 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > /*
> > * Try to move all excess tasks to a sibling domain of the busiest
> > * group's child domain.
> > + *
> > + * Do not try to move between non smt sched group and smt sched
> > + * group. Let asym active balance properly handle that case.
> > */
> > if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > + !asymmetric_groups(sds.busiest, sds.local) &&
> > busiest->sum_nr_running > local->sum_nr_running + 1)
> > goto force_balance;
>
> This seems to have the hidden assumption that a !SMT core is somehow
> 'less' that an SMT code. Should this not also look at
> sched_asym_prefer() to establush this is so?
>
> I mean, imagine I have a regular system and just offline one smt sibling
> for giggles.

I don't quite follow your point as asymmetric_groups() returns false even
one smt sibling is offlined.

Even say sds.busiest has 1 SMT and sds.local has 2 SMT, both sched groups still
have SD_SHARE_CPUCAPACITY flag turned on. So asymmetric_groups() return
false and the load balancing logic is not changed for regular non-hybrid system.

I may be missing something.

Tim

2023-05-05 23:58:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

On Fri, May 05, 2023 at 04:07:39PM -0700, Tim Chen wrote:
> On Fri, 2023-05-05 at 15:22 +0200, Peter Zijlstra wrote:
> > On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> > > From: Tim C Chen <[email protected]>
> > >
> > > Do not try to move tasks between non SMT sched group and SMT sched
> > > group for "prefer sibling" load balance.
> > > Let asym_active_balance_busiest() handle that case properly.
> > > Otherwise we could get task bouncing back and forth between
> > > the SMT sched group and non SMT sched group.
> > >
> > > Reviewed-by: Ricardo Neri <[email protected]>
> > > Signed-off-by: Tim Chen <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 4 ++++
> > > 1 file changed, 4 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 8a325db34b02..58ef7d529731 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > > /*
> > > * Try to move all excess tasks to a sibling domain of the busiest
> > > * group's child domain.
> > > + *
> > > + * Do not try to move between non smt sched group and smt sched
> > > + * group. Let asym active balance properly handle that case.
> > > */
> > > if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > > + !asymmetric_groups(sds.busiest, sds.local) &&
> > > busiest->sum_nr_running > local->sum_nr_running + 1)
> > > goto force_balance;
> >
> > This seems to have the hidden assumption that a !SMT core is somehow
> > 'less' that an SMT code. Should this not also look at
> > sched_asym_prefer() to establush this is so?
> >
> > I mean, imagine I have a regular system and just offline one smt sibling
> > for giggles.
>
> I don't quite follow your point as asymmetric_groups() returns false even
> one smt sibling is offlined.
>
> Even say sds.busiest has 1 SMT and sds.local has 2 SMT, both sched groups still
> have SD_SHARE_CPUCAPACITY flag turned on. So asymmetric_groups() return
> false and the load balancing logic is not changed for regular non-hybrid system.
>
> I may be missing something.

What's the difference between the two cases? That is, if the remaining
sibling will have SD_SHARE_CPUCAPACIY from the degenerate SMT domain
that's been reaped, then why doesn't the same thing apply to the atoms
in the hybrid muck?

Those two cases *should* be identical, both cases you have cores with
and cores without SMT.

2023-05-06 00:18:44

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

On Sat, May 06, 2023 at 01:38:36AM +0200, Peter Zijlstra wrote:
> On Fri, May 05, 2023 at 04:07:39PM -0700, Tim Chen wrote:
> > On Fri, 2023-05-05 at 15:22 +0200, Peter Zijlstra wrote:
> > > On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> > > > From: Tim C Chen <[email protected]>
> > > >
> > > > Do not try to move tasks between non SMT sched group and SMT sched
> > > > group for "prefer sibling" load balance.
> > > > Let asym_active_balance_busiest() handle that case properly.
> > > > Otherwise we could get task bouncing back and forth between
> > > > the SMT sched group and non SMT sched group.
> > > >
> > > > Reviewed-by: Ricardo Neri <[email protected]>
> > > > Signed-off-by: Tim Chen <[email protected]>
> > > > ---
> > > > kernel/sched/fair.c | 4 ++++
> > > > 1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 8a325db34b02..58ef7d529731 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > > > /*
> > > > * Try to move all excess tasks to a sibling domain of the busiest
> > > > * group's child domain.
> > > > + *
> > > > + * Do not try to move between non smt sched group and smt sched
> > > > + * group. Let asym active balance properly handle that case.
> > > > */
> > > > if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > > > + !asymmetric_groups(sds.busiest, sds.local) &&
> > > > busiest->sum_nr_running > local->sum_nr_running + 1)
> > > > goto force_balance;
> > >
> > > This seems to have the hidden assumption that a !SMT core is somehow
> > > 'less' that an SMT code. Should this not also look at
> > > sched_asym_prefer() to establush this is so?
> > >
> > > I mean, imagine I have a regular system and just offline one smt sibling
> > > for giggles.
> >
> > I don't quite follow your point as asymmetric_groups() returns false even
> > one smt sibling is offlined.
> >
> > Even say sds.busiest has 1 SMT and sds.local has 2 SMT, both sched groups still
> > have SD_SHARE_CPUCAPACITY flag turned on. So asymmetric_groups() return
> > false and the load balancing logic is not changed for regular non-hybrid system.
> >
> > I may be missing something.
>
> What's the difference between the two cases? That is, if the remaining
> sibling will have SD_SHARE_CPUCAPACIY from the degenerate SMT domain
> that's been reaped, then why doesn't the same thing apply to the atoms
> in the hybrid muck?
>
> Those two cases *should* be identical, both cases you have cores with
> and cores without SMT.

On my alderlake:

[ 202.222019] CPU0 attaching sched-domain(s):
[ 202.222509] domain-0: span=0-1 level=SMT
[ 202.222707] groups: 0:{ span=0 }, 1:{ span=1 }
[ 202.222945] domain-1: span=0-23 level=MC
[ 202.223148] groups: 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 },12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
...
[ 202.249979] CPU23 attaching sched-domain(s):
[ 202.250127] domain-0: span=0-23 level=MC
[ 202.250198] groups: 23:{ span=23 }, 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 }, 12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }

$ echo 0 > /sys/devices/system/cpu/cpu1/online
[ 251.213848] CPU0 attaching sched-domain(s):
[ 251.214376] domain-0: span=0,2-23 level=MC
[ 251.214580] groups: 0:{ span=0 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 }, 12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
...
[ 251.239511] CPU23 attaching sched-domain(s):
[ 251.239656] domain-0: span=0,2-23 level=MC
[ 251.239727] groups: 23:{ span=23 }, 0:{ span=0 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 }, 12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }

$ cat /debug/sched/domains/cpu0/domain0/groups_flags

$ cat /debug/sched/domains/cpu23/domain0/groups_flags

IOW, neither the big core with SMT with one sibling offline, nor the
little core with no SMT on at all have the relevant flags set on their
domain0 groups.

---
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 98bfc0f4ec94..e408b2889186 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -427,6 +427,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
#undef SDM

debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
+ debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
}

void update_sched_domain_debugfs(void)

2023-05-09 13:57:40

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH 4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

On Thu, 4 May 2023 at 18:11, Tim Chen <[email protected]> wrote:
>
> From: Tim C Chen <[email protected]>
>
> Do not try to move tasks between non SMT sched group and SMT sched
> group for "prefer sibling" load balance.
> Let asym_active_balance_busiest() handle that case properly.
> Otherwise we could get task bouncing back and forth between
> the SMT sched group and non SMT sched group.
>
> Reviewed-by: Ricardo Neri <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a325db34b02..58ef7d529731 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> /*
> * Try to move all excess tasks to a sibling domain of the busiest
> * group's child domain.
> + *
> + * Do not try to move between non smt sched group and smt sched
> + * group. Let asym active balance properly handle that case.
> */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> + !asymmetric_groups(sds.busiest, sds.local) &&

Can't you delete SD_PREFER_SIBLING flags when building topology like
SD_ASYM_CPUCAPACITY does ?

Generally speaking SD_ASYM_CPUCAPACITY and SD_ASYM_PACKING are doing
quite similar thing, it would be good to get one common solution
instead 2 parallel paths

> busiest->sum_nr_running > local->sum_nr_running + 1)
> goto force_balance;
>
> --
> 2.32.0
>

2023-05-09 23:44:24

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH 4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

On Tue, 2023-05-09 at 15:36 +0200, Vincent Guittot wrote:
> On Thu, 4 May 2023 at 18:11, Tim Chen <[email protected]> wrote:
> >
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a325db34b02..58ef7d529731 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > /*
> > * Try to move all excess tasks to a sibling domain of the busiest
> > * group's child domain.
> > + *
> > + * Do not try to move between non smt sched group and smt sched
> > + * group. Let asym active balance properly handle that case.
> > */
> > if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > + !asymmetric_groups(sds.busiest, sds.local) &&
>
> Can't you delete SD_PREFER_SIBLING flags when building topology like
> SD_ASYM_CPUCAPACITY does ?

The sched domain actually can have a mixture of sched groups with Atom modules
and sched groups with SMT cores. When comparing sched group of Atom core cluster
and Atom core cluster, or SMT core with SMT core, I think we do want the prefer sibling logic.
It is only when we are comparing SMT core and Atom core cluster we
want to skip this. Ricardo, please correct me if I am wrong.

>
> Generally speaking SD_ASYM_CPUCAPACITY and SD_ASYM_PACKING are doing
> quite similar thing, it would be good to get one common solution
> instead 2 parallel paths

Okay. I'll see what I can do to merge the handling.

Tim

>
> > busiest->sum_nr_running > local->sum_nr_running + 1)
> > goto force_balance;
> >
> > --
> > 2.32.0
> >