2023-06-08 22:49:20

by Tim Chen

[permalink] [raw]
Subject: [Patch v2 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs

This is the second version of patches to fix issues to allow cluster
scheduling on x86 hybrid CPUs. It addresses concerns by
reviewers in the first version:
https://lore.kernel.org/lkml/CAKfTPtD1W6vJQBsNKEt_4tn2EeAs=73CeH4LoCwENrh2JUDwnQ@mail.gmail.com/T/

The review comments were greatly appreciated.

Changes from v1:
1. Peter pointed out that the number of CPUs in a cluster could
also be modified by bringing CPU on or offline. Balance between the
sibling clusters should not just take into consideration of whether a
cluster has SMT CPUs or pure core CPUs. In this version, I take the
approach to balance tasks between the clusters such that the
running_tasks/num_cores between the clusters are similar. This would
accommodate balance between SMT clusters and non-SMT clusters,
or between the same clusters with different number of cores.

2. Vincent pointed out that special case logic in the general path
for detection of fully busy SMT can be simplified. Fully busy SMT could
be detected during statistics gathering and will make the code cleaner
by detecting such cases there. This version of the patch series makes
this change.

3. Suggestions by Chen Yu and Hillf Danton to improve commit logs.

4. Patch by Peter to dump domain sched groups' flags and
include suggestions by Peter to simplify code.

The performance of this version is similar to the previous version for
other benchmarks, though kbuild is about a couple percents worse.
Experiments were done on Alder Lake with 6 P-cores and 8 E-cores,
organized in two clusters of 4 E-core each.

Single Threaded 6.3-rc5 with cluster Improvement
Benchmark scheduling in Performance
(run-run deviation) (run-run deviation)
-------------------------------------------------------------------------------------------
tjbench (+/- 0.08%) (+/- 0.51%) -0.34%
PhPbench (+/- 0.31%) (+/- 2.48%) -0.99%
flac (+/- 0.58%) (+/- 0.86%) +0.61%
pybench (+/- 3.16%) (+/- 3.36%) +1.36%


Multi Threaded 6.3-rc5 with cluster Improvement
Benchmark scheduling in Performance
(-#threads) (run-run deviation) (run-run deviation)
-------------------------------------------------------------------------------------------
Kbuild-8 (+/- 2.90%) (+/- 0.24%) -1.63%
Kbuild-10 (+/- 3.08%) (+/- 0.47%) -2.06%
Kbuild-12 (+/- 3.28%) (+/- 0.28%) -1.38%
Tensor Lite-8 (+/- 4.84%) (+/- 3.21%) -0.57%
Tensor Lite-10 (+/- 0.87%) (+/- 1.21%) -1.00%
Tensor Lite-12 (+/- 1.37%) (+/- 0.43%) -0.05%

Tim Chen

Ricardo Neri (1):
sched/fair: Consider the idle state of the whole core for load balance

Tim C Chen (5):
sched/fair: Determine active load balance for SMT sched groups
sched/topology: Record number of cores in sched group
sched/fair: Implement prefer sibling imbalance calculation between
asymmetric groups
sched/x86: Add cluster topology to hybrid CPU
sched/debug: Dump domains' sched group flags

arch/x86/kernel/smpboot.c | 3 +
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 151 ++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 6 ++
kernel/sched/topology.c | 10 ++-
5 files changed, 165 insertions(+), 6 deletions(-)

--
2.32.0



2023-06-08 22:50:44

by Tim Chen

[permalink] [raw]
Subject: [Patch v2 2/6] sched/topology: Record number of cores in sched group

From: Tim C Chen <[email protected]>

When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the number
of cores in each domain. In preparation of implementing such a policy,
record the number of tasks in a scheduling group.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 10 +++++++++-
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d0eb36350d2..5f7f36e45b87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1860,6 +1860,7 @@ struct sched_group {
atomic_t ref;

unsigned int group_weight;
+ unsigned int cores;
struct sched_group_capacity *sgc;
int asym_prefer_cpu; /* CPU of highest priority in group */
int flags;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6d5628fcebcf..6b099dbdfb39 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
{
struct sched_group *sg = sd->groups;
+ struct cpumask *mask = sched_domains_tmpmask2;

WARN_ON(!sg);

do {
- int cpu, max_cpu = -1;
+ int cpu, cores = 0, max_cpu = -1;

sg->group_weight = cpumask_weight(sched_group_span(sg));

+ cpumask_copy(mask, sched_group_span(sg));
+ for_each_cpu(cpu, mask) {
+ cores++;
+ cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
+ }
+ sg->cores = cores;
+
if (!(sd->flags & SD_ASYM_PACKING))
goto next;

--
2.32.0


2023-06-08 22:52:10

by Tim Chen

[permalink] [raw]
Subject: [Patch v2 1/6] sched/fair: Determine active load balance for SMT sched groups

From: Tim C Chen <[email protected]>

On hybrid CPUs with scheduling cluster enabled, we will need to
consider balancing between SMT CPU cluster, and Atom core cluster.

Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
Each scheduling cluster span a L2 cache.

--L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
[0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
Big Big Big Big Atom Atom
core core core core Module Module

If the busiest group is a big core with both SMT CPUs busy, we should
active load balance if destination group has idle CPU cores. Such
condition is considered by asym_active_balance() in load balancing but not
considered when looking for busiest group and computing load imbalance.
Add this consideration in find_busiest_group() and calculate_imbalance().

In addition, update the logic determining the busier group when one group
is SMT and the other group is non SMT but both groups are partially busy
with idle CPU. The busier group should be the group with idle cores rather
than the group with one busy SMT CPU. We do not want to make the SMT group
the busiest one to pull the only task off SMT CPU and causing the whole core to
go empty.

Otherwise suppose in the search for the busiest group, we first encounter
an SMT group with 1 task and set it as the busiest. The desitination
group is an atom cluster with 1 task and we next encounter an atom
cluster group with 3 tasks, we will not pick this atom cluster over the
SMT group, even though we should. As a result, we do not load balance
the busier Atom cluster (with 3 tasks) towards the local atom cluster
(with 1 task). And it doesn't make sense to pick the 1 task SMT group
as the busier group as we also should not pull task off the SMT towards
the 1 task atom cluster and make the SMT core completely empty.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 70 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87317634fab2..03573362274f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8279,6 +8279,11 @@ enum group_type {
* more powerful CPU.
*/
group_misfit_task,
+ /*
+ * Balance SMT group that's fully busy. Can benefit from migration
+ * a task on SMT with busy sibling to another CPU on idle core.
+ */
+ group_smt_balance,
/*
* SD_ASYM_PACKING only: One local CPU with higher capacity is available,
* and the task should be migrated to it instead of running on the
@@ -8987,6 +8992,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
+ unsigned int group_smt_balance; /* Task on busy SMT be moved */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int imbalance_pct,
if (sgs->group_asym_packing)
return group_asym_packing;

+ if (sgs->group_smt_balance)
+ return group_smt_balance;
+
if (sgs->group_misfit_task_load)
return group_misfit_task;

@@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}

+/* One group has more than one SMT CPU while the other group does not */
+static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
+ struct sched_group *sg2)
+{
+ if (!sg1 || !sg2)
+ return false;
+
+ return (sg1->flags & SD_SHARE_CPUCAPACITY) !=
+ (sg2->flags & SD_SHARE_CPUCAPACITY);
+}
+
+static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+ struct sched_group *group)
+{
+ if (env->idle == CPU_NOT_IDLE)
+ return false;
+
+ /*
+ * For SMT source group, it is better to move a task
+ * to a CPU that doesn't have multiple tasks sharing its CPU capacity.
+ * Note that if a group has a single SMT, SD_SHARE_CPUCAPCITY
+ * will not be on.
+ */
+ if (group->flags & SD_SHARE_CPUCAPACITY &&
+ sgs->sum_h_nr_running > 1)
+ return true;
+
+ return false;
+}
+
static inline bool
sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
{
@@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_asym_packing = 1;
}

+ /* Check for loaded SMT group to be balanced to dst CPU */
+ if (!local_group && smt_balance(env, sgs, group))
+ sgs->group_smt_balance = 1;
+
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);

/* Computing avg_load makes sense only when group is overloaded */
@@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return false;
break;

+ case group_smt_balance:
case group_fully_busy:
/*
* Select the fully busy group with highest avg_load. In
@@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
break;

case group_has_spare:
+ /*
+ * Do not pick sg with SMT CPUs over sg with pure CPUs,
+ * as we do not want to pull task off half empty SMT core
+ * and make the core idle.
+ */
+ if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
+ if (sg->flags & SD_SHARE_CPUCAPACITY)
+ return false;
+ else
+ return true;
+ }
+
/*
* Select not overloaded group with lowest number of idle cpus
* and highest number of running tasks. We could also compare
@@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,

case group_imbalanced:
case group_asym_packing:
+ case group_smt_balance:
/* Those types are not used in the slow wakeup path */
return false;

@@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)

case group_imbalanced:
case group_asym_packing:
+ case group_smt_balance:
/* Those type are not used in the slow wakeup path */
return NULL;

@@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return;
}

+ if (busiest->group_type == group_smt_balance) {
+ /* Reduce number of tasks sharing CPU capacity */
+ env->migration_type = migrate_task;
+ env->imbalance = 1;
+ return;
+ }
+
if (busiest->group_type == group_imbalanced) {
/*
* In the group_imb case we cannot rely on group-wide averages
@@ -10371,6 +10436,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
*/
goto out_balanced;

+ if (busiest->group_type == group_smt_balance &&
+ smt_vs_nonsmt_groups(sds.local, sds.busiest))
+ /* Let non SMT CPU pull from SMT CPU sharing with sibling */
+ goto force_balance;
+
if (busiest->group_weight > 1 &&
local->idle_cpus <= (busiest->idle_cpus + 1))
/*
--
2.32.0


2023-06-08 22:52:52

by Tim Chen

[permalink] [raw]
Subject: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

From: Tim C Chen <[email protected]>

In the current prefer sibling load balancing code, there is an implicit
assumption that the busiest sched group and local sched group are
equivalent, hence the tasks to be moved is simply the difference in
number of tasks between the two groups (i.e. imbalance) divided by two.

However, we may have different number of cores between the cluster groups,
say when we take CPU offline or we have hybrid groups. In that case,
we should balance between the two groups such that #tasks/#cores ratio
is the same between the same between both groups. Hence the imbalance
computed will need to reflect this.

Additionally when we have asymmetric packing, we will need to bias the
balancing according to whether the busiest group is favored or the local
group if favored. If the destination group is favored and has idle
cores, we should move at least one task from the busiest group to avoid
leaving favored idle core unused. But if the busiest group is favored,
we should limit the number of tasks we move so we will not create idle
cores in busiest group, leaving favored cores unused.

Adjust the sibling imbalance computation to take into account of the
above considerations.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 5 ++++
2 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03573362274f..0b0904263d51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9372,6 +9372,65 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
return false;
}

+static inline long sibling_imbalance(struct lb_env *env,
+ struct sd_lb_stats *sds,
+ struct sg_lb_stats *busiest,
+ struct sg_lb_stats *local)
+{
+ int ncores_busiest, ncores_local;
+ long imbalance;
+
+ if (env->idle == CPU_NOT_IDLE)
+ return 0;
+
+ ncores_busiest = sds->busiest->cores;
+ ncores_local = sds->local->cores;
+
+ if (ncores_busiest == ncores_local &&
+ (!(env->sd->flags & SD_ASYM_PACKING) ||
+ sched_asym_equal(env->dst_cpu,
+ sds->busiest->asym_prefer_cpu))) {
+ imbalance = busiest->sum_nr_running;
+ lsub_positive(&imbalance, local->sum_nr_running);
+ return imbalance;
+ }
+
+ /* Balance such that nr_running/ncores ratio are same on both groups */
+ imbalance = ncores_local * busiest->sum_nr_running;
+ lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running);
+ /* Normalize imbalance to become tasks to be moved to restore balance */
+ imbalance /= ncores_local + ncores_busiest;
+
+ if (env->sd->flags & SD_ASYM_PACKING) {
+ int limit;
+
+ if (!busiest->sum_nr_running)
+ goto out;
+
+ if (sched_asym_prefer(env->dst_cpu, sds->busiest->asym_prefer_cpu)) {
+ /* Don't leave preferred core idle */
+ if (imbalance == 0 && local->sum_nr_running < ncores_local)
+ imbalance = 1;
+ goto out;
+ }
+
+ /* Limit tasks moved from preferred group, don't leave cores idle */
+ limit = busiest->sum_nr_running;
+ lsub_positive(&limit, ncores_busiest);
+ if (imbalance > limit)
+ imbalance = limit;
+
+ goto out;
+ }
+
+ /* Take advantage of resource in an empty sched group */
+ if (imbalance == 0 && local->sum_nr_running == 0 &&
+ busiest->sum_nr_running > 1)
+ imbalance = 1;
+out:
+ return imbalance << 1;
+}
+
static inline bool
sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
{
@@ -10230,14 +10289,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}

if (busiest->group_weight == 1 || sds->prefer_sibling) {
- unsigned int nr_diff = busiest->sum_nr_running;
/*
* When prefer sibling, evenly spread running tasks on
* groups.
*/
env->migration_type = migrate_task;
- lsub_positive(&nr_diff, local->sum_nr_running);
- env->imbalance = nr_diff;
+ env->imbalance = sibling_imbalance(env, sds, busiest, local);
} else {

/*
@@ -10424,7 +10481,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* group's child domain.
*/
if (sds.prefer_sibling && local->group_type == group_has_spare &&
- busiest->sum_nr_running > local->sum_nr_running + 1)
+ sibling_imbalance(env, &sds, busiest, local) > 1)
goto force_balance;

if (busiest->group_type != group_overloaded) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f7f36e45b87..adffe0894cdb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -804,6 +804,11 @@ static inline bool sched_asym_prefer(int a, int b)
return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
}

+static inline bool sched_asym_equal(int a, int b)
+{
+ return arch_asym_cpu_priority(a) == arch_asym_cpu_priority(b);
+}
+
struct perf_domain {
struct em_perf_domain *em_pd;
struct perf_domain *next;
--
2.32.0


2023-06-08 23:11:05

by Tim Chen

[permalink] [raw]
Subject: [Patch v2 4/6] sched/fair: Consider the idle state of the whole core for load balance

From: Ricardo Neri <[email protected]>

should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
cpus) starting from lower numbered CPUs looking for the first idle CPU.

In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
non-SMT cores:

[0, 1] [2, 3] [4, 5] 6 7 8 9
b i b i b i b i i i

In the figure above, CPUs in brackets are siblings of an SMT core. The
rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
idle CPU.

We should let a CPU on a fully idle core get the first chance to idle
load balance as it has more CPU capacity than a CPU on an idle SMT
CPU with busy sibling. So for the figure above, if we are running
should_we_balance() to CPU 1, we should return false to let CPU 7 on
idle core to have a chance first to idle load balance.

A partially busy (i.e., of type group_has_spare) local group with SMT 
cores will often have only one SMT sibling busy. If the destination CPU
is a non-SMT core, partially busy, lower-numbered, SMT cores should not
be considered when finding the first idle CPU. 

However, in should_we_balance(), when we encounter idle SMT first in partially
busy core, we prematurely break the search for the first idle CPU.

Higher-numbered, non-SMT cores is not given the chance to have
idle balance done on their behalf. Those CPUs will only be considered
for idle balancing by chance via CPU_NEWLY_IDLE.

Instead, consider the idle state of the whole SMT core.

Signed-off-by: Ricardo Neri <[email protected]>
Co-developed-by: Tim Chen <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b0904263d51..33246dce10db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10749,7 +10749,7 @@ static int active_load_balance_cpu_stop(void *data);
static int should_we_balance(struct lb_env *env)
{
struct sched_group *sg = env->sd->groups;
- int cpu;
+ int cpu, idle_smt = -1;

/*
* Ensure the balancing environment is consistent; can happen
@@ -10776,10 +10776,24 @@ static int should_we_balance(struct lb_env *env)
if (!idle_cpu(cpu))
continue;

+ /*
+ * Don't balance to idle SMT in busy core right away when
+ * balancing cores, but remember the first idle SMT CPU for
+ * later consideration. Find CPU on an idle core first.
+ */
+ if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
+ if (idle_smt == -1)
+ idle_smt = cpu;
+ continue;
+ }
+
/* Are we the first idle CPU? */
return cpu == env->dst_cpu;
}

+ if (idle_smt == env->dst_cpu)
+ return true;
+
/* Are we the first CPU of this group ? */
return group_balance_cpu(sg) == env->dst_cpu;
}
--
2.32.0


2023-06-12 11:41:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v2 1/6] sched/fair: Determine active load balance for SMT sched groups

On Thu, Jun 08, 2023 at 03:32:27PM -0700, Tim Chen wrote:

> +/* One group has more than one SMT CPU while the other group does not */
> +static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
> + struct sched_group *sg2)
> +{
> + if (!sg1 || !sg2)
> + return false;
> +
> + return (sg1->flags & SD_SHARE_CPUCAPACITY) !=
> + (sg2->flags & SD_SHARE_CPUCAPACITY);
> +}
> +
> +static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> + struct sched_group *group)
> +{
> + if (env->idle == CPU_NOT_IDLE)
> + return false;
> +
> + /*
> + * For SMT source group, it is better to move a task
> + * to a CPU that doesn't have multiple tasks sharing its CPU capacity.
> + * Note that if a group has a single SMT, SD_SHARE_CPUCAPCITY
> + * will not be on.
> + */
> + if (group->flags & SD_SHARE_CPUCAPACITY &&
> + sgs->sum_h_nr_running > 1)
> + return true;

AFAICT this does the right thing for SMT>2

> +
> + return false;
> +}
> +
> static inline bool
> sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> {

> @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> break;
>
> case group_has_spare:
> + /*
> + * Do not pick sg with SMT CPUs over sg with pure CPUs,
> + * as we do not want to pull task off half empty SMT core
> + * and make the core idle.
> + */
> + if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> + if (sg->flags & SD_SHARE_CPUCAPACITY)
> + return false;
> + else
> + return true;
> + }

However, here I'm not at all sure. Consider SMT-4 with 2 active CPUs, we
still very much would like to pull one task off if we have an idle core
somewhere, no?


2023-06-12 11:41:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v2 1/6] sched/fair: Determine active load balance for SMT sched groups

On Thu, Jun 08, 2023 at 03:32:27PM -0700, Tim Chen wrote:
> @@ -10371,6 +10436,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> */
> goto out_balanced;
>
> + if (busiest->group_type == group_smt_balance &&
> + smt_vs_nonsmt_groups(sds.local, sds.busiest))
> + /* Let non SMT CPU pull from SMT CPU sharing with sibling */
> + goto force_balance;
> +
> if (busiest->group_weight > 1 &&
> local->idle_cpus <= (busiest->idle_cpus + 1))
> /*

Could you please add {} for all of them? The comment makes them all
multi-line.

2023-06-12 11:53:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v2 2/6] sched/topology: Record number of cores in sched group

On Thu, Jun 08, 2023 at 03:32:28PM -0700, Tim Chen wrote:
> From: Tim C Chen <[email protected]>
>
> When balancing sibling domains that have different number of cores,
> tasks in respective sibling domain should be proportional to the number
> of cores in each domain. In preparation of implementing such a policy,
> record the number of tasks in a scheduling group.
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/sched.h | 1 +
> kernel/sched/topology.c | 10 +++++++++-
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3d0eb36350d2..5f7f36e45b87 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1860,6 +1860,7 @@ struct sched_group {
> atomic_t ref;
>
> unsigned int group_weight;
> + unsigned int cores;
> struct sched_group_capacity *sgc;
> int asym_prefer_cpu; /* CPU of highest priority in group */
> int flags;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6d5628fcebcf..6b099dbdfb39 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> {
> struct sched_group *sg = sd->groups;
> + struct cpumask *mask = sched_domains_tmpmask2;
>
> WARN_ON(!sg);
>
> do {
> - int cpu, max_cpu = -1;
> + int cpu, cores = 0, max_cpu = -1;
>
> sg->group_weight = cpumask_weight(sched_group_span(sg));
>
> + cpumask_copy(mask, sched_group_span(sg));
> + for_each_cpu(cpu, mask) {
> + cores++;
> + cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
> + }
> + sg->cores = cores;
> +
> if (!(sd->flags & SD_ASYM_PACKING))
> goto next;

Just a note; not sure we want or can do anything about this, but
consider someone doing partitions like:

[0,1] [2,3] [3,6]
[------] [------]

That is, 3 SMT cores, and 2 partitions splitting an SMT core in two.

Then the domain trees will see either 2 or 3 but not the fully core.

I'm perfectly fine with saying: don't do that then.

2023-06-12 12:24:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

On Thu, Jun 08, 2023 at 03:32:29PM -0700, Tim Chen wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 03573362274f..0b0904263d51 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9372,6 +9372,65 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> return false;
> }
>
> +static inline long sibling_imbalance(struct lb_env *env,
> + struct sd_lb_stats *sds,
> + struct sg_lb_stats *busiest,
> + struct sg_lb_stats *local)
> +{
> + int ncores_busiest, ncores_local;
> + long imbalance;
> +
> + if (env->idle == CPU_NOT_IDLE)
> + return 0;
> +
> + ncores_busiest = sds->busiest->cores;
> + ncores_local = sds->local->cores;
> +
> + if (ncores_busiest == ncores_local &&
> + (!(env->sd->flags & SD_ASYM_PACKING) ||
> + sched_asym_equal(env->dst_cpu,
> + sds->busiest->asym_prefer_cpu))) {
> + imbalance = busiest->sum_nr_running;
> + lsub_positive(&imbalance, local->sum_nr_running);
> + return imbalance;
> + }
> +
> + /* Balance such that nr_running/ncores ratio are same on both groups */
> + imbalance = ncores_local * busiest->sum_nr_running;
> + lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running);
> + /* Normalize imbalance to become tasks to be moved to restore balance */
> + imbalance /= ncores_local + ncores_busiest;
> +
> + if (env->sd->flags & SD_ASYM_PACKING) {
> + int limit;
> +
> + if (!busiest->sum_nr_running)
> + goto out;

This seems out-of-place, shouldn't we have terminate sooner if busiest
is empty?

> +
> + if (sched_asym_prefer(env->dst_cpu, sds->busiest->asym_prefer_cpu)) {
> + /* Don't leave preferred core idle */
> + if (imbalance == 0 && local->sum_nr_running < ncores_local)
> + imbalance = 1;
> + goto out;
> + }
> +
> + /* Limit tasks moved from preferred group, don't leave cores idle */
> + limit = busiest->sum_nr_running;
> + lsub_positive(&limit, ncores_busiest);
> + if (imbalance > limit)
> + imbalance = limit;

How does this affect the server parts that have larger than single core
turbo domains?

> +
> + goto out;
> + }
> +
> + /* Take advantage of resource in an empty sched group */
> + if (imbalance == 0 && local->sum_nr_running == 0 &&
> + busiest->sum_nr_running > 1)
> + imbalance = 1;
> +out:
> + return imbalance << 1;
> +}


But basically you have:

LcBn - BcLn
imb = -----------
LcBc

Which makes sense, except you then return:

imb * 2

which then made me wonder about rounding.

Do we want to to add (LcBc -1) or (LcBc/2) to resp. ceil() or round()
the thing before division? Because currently it uses floor().

If you evaludate it like:


2 * (LcBn - BcLn)
imb = -----------------
LcBc

The result is different from what you have now.

What actual behaviour is desired in these low imbalance cases? and can
you write a comment as to why we do as we do etc..?

2023-06-12 20:20:26

by Tim Chen

[permalink] [raw]
Subject: Re: [Patch v2 1/6] sched/fair: Determine active load balance for SMT sched groups

On Mon, 2023-06-12 at 13:13 +0200, Peter Zijlstra wrote:
>
>
>
> > @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > break;
> >
> > case group_has_spare:
> > + /*
> > + * Do not pick sg with SMT CPUs over sg with pure CPUs,
> > + * as we do not want to pull task off half empty SMT core
> > + * and make the core idle.
> > + */
> > + if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> > + if (sg->flags & SD_SHARE_CPUCAPACITY)
> > + return false;
> > + else
> > + return true;
> > + }
>
> However, here I'm not at all sure. Consider SMT-4 with 2 active CPUs, we
> still very much would like to pull one task off if we have an idle core
> somewhere, no?
>

How about making this modification to take care of SMT-4 case?

Tim

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33246dce10db..e2261c24e536 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9642,11 +9642,11 @@ static bool update_sd_pick_busiest(struct lb_env *env,
case group_has_spare:
/*
* Do not pick sg with SMT CPUs over sg with pure CPUs,
- * as we do not want to pull task off half empty SMT core
+ * as we do not want to pull task off SMT core with one task
* and make the core idle.
*/
if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
- if (sg->flags & SD_SHARE_CPUCAPACITY)
+ if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
return false;
else
return true;



2023-06-12 20:26:32

by Tim Chen

[permalink] [raw]
Subject: Re: [Patch v2 2/6] sched/topology: Record number of cores in sched group

On Mon, 2023-06-12 at 13:29 +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2023 at 03:32:28PM -0700, Tim Chen wrote:
> >
> > sg->group_weight = cpumask_weight(sched_group_span(sg));
> >
> > + cpumask_copy(mask, sched_group_span(sg));
> > + for_each_cpu(cpu, mask) {
> > + cores++;
> > + cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
> > + }
> > + sg->cores = cores;
> > +
> > if (!(sd->flags & SD_ASYM_PACKING))
> > goto next;
>
> Just a note; not sure we want or can do anything about this, but
> consider someone doing partitions like:
>
> [0,1] [2,3] [3,6]
> [------] [------]
>
> That is, 3 SMT cores, and 2 partitions splitting an SMT core in two.
>
> Then the domain trees will see either 2 or 3 but not the fully core.
>
> I'm perfectly fine with saying: don't do that then.

I also can't see a reason to split SMT between two domains.

Tim

2023-06-12 20:35:58

by Tim Chen

[permalink] [raw]
Subject: Re: [Patch v2 1/6] sched/fair: Determine active load balance for SMT sched groups

On Mon, 2023-06-12 at 13:16 +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2023 at 03:32:27PM -0700, Tim Chen wrote:
> > @@ -10371,6 +10436,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > */
> > goto out_balanced;
> >
> > + if (busiest->group_type == group_smt_balance &&
> > + smt_vs_nonsmt_groups(sds.local, sds.busiest))
> > + /* Let non SMT CPU pull from SMT CPU sharing with sibling */
> > + goto force_balance;
> > +
> > if (busiest->group_weight > 1 &&
> > local->idle_cpus <= (busiest->idle_cpus + 1))
> > /*
>
> Could you please add {} for all of them? The comment makes them all
> multi-line.

Will do.

Tim

2023-06-12 20:36:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v2 1/6] sched/fair: Determine active load balance for SMT sched groups

On Mon, Jun 12, 2023 at 01:12:10PM -0700, Tim Chen wrote:
> How about making this modification to take care of SMT-4 case?
>

Yep, that works.

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 33246dce10db..e2261c24e536 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9642,11 +9642,11 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> case group_has_spare:
> /*
> * Do not pick sg with SMT CPUs over sg with pure CPUs,
> - * as we do not want to pull task off half empty SMT core
> + * as we do not want to pull task off SMT core with one task
> * and make the core idle.
> */
> if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> - if (sg->flags & SD_SHARE_CPUCAPACITY)
> + if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> return false;
> else
> return true;
>
>

2023-06-13 18:15:30

by Tim Chen

[permalink] [raw]
Subject: Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

On Mon, 2023-06-12 at 14:05 +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2023 at 03:32:29PM -0700, Tim Chen wrote:
>
> >
> > + if (env->sd->flags & SD_ASYM_PACKING) {
> > + int limit;
> > +
> > + if (!busiest->sum_nr_running)
> > + goto out;
>
> This seems out-of-place, shouldn't we have terminate sooner if busiest
> is empty?

Yes. Should move this check to the beginning.

>
> > +
> > + if (sched_asym_prefer(env->dst_cpu, sds->busiest->asym_prefer_cpu)) {
> > + /* Don't leave preferred core idle */
> > + if (imbalance == 0 && local->sum_nr_running < ncores_local)
> > + imbalance = 1;
> > + goto out;
> > + }
> > +
> > + /* Limit tasks moved from preferred group, don't leave cores idle */
> > + limit = busiest->sum_nr_running;
> > + lsub_positive(&limit, ncores_busiest);
> > + if (imbalance > limit)
> > + imbalance = limit;
>
> How does this affect the server parts that have larger than single core
> turbo domains?

Are you thinking about the case where the local group is completely empty
so there's turbo headroom and we should move at least one task, even though
CPU in busiest group has higher priority?

In other words, are you suggesting we should add

if (imbalance == 0 && busiest->sum_nr_running > 0 &&
local->sum_nr_running == 0)
imbalance = 1;


>
> > +
> > + goto out;
> > + }
> > +
> > + /* Take advantage of resource in an empty sched group */
> > + if (imbalance == 0 && local->sum_nr_running == 0 &&
> > + busiest->sum_nr_running > 1)
> > + imbalance = 1;
> > +out:
> > + return imbalance << 1;
> > +}
>
>
> But basically you have:
>
> LcBn - BcLn
> imb = -----------
> LcBc
>
> Which makes sense, except you then return:
>
> imb * 2
>
> which then made me wonder about rounding.
>
> Do we want to to add (LcBc -1) or (LcBc/2) to resp. ceil() or round()
> the thing before division? Because currently it uses floor().
>
> If you evaludate it like:
>
>
> 2 * (LcBn - BcLn)
> imb = -----------------
> LcBc
>
> The result is different from what you have now.

If I do the rounding after multiplying imb by two (call it imb_new),
the difference with imb I am returning now (call it imb_old)
will be at most 1. Note that imb_old returned is always a multiple of 2.

I will be using imb in calculate_imbalance() and divide it
by 2 there to get the number tasks to move from busiest group.
So when there is a difference of 1 between imb_old and imb_new,
the difference will be trimmed off after the division of 2.

We will get the same number of tasks to move with either
imb_old or imb_new in calculate_imbalance() so the two
computations will arrive at the same result eventually.

>
> What actual behaviour is desired in these low imbalance cases? and can
> you write a comment as to why we do as we do etc..?

I do not keep imb as

2 * (LcBn - BcLn)
imb = -----------------
LcBc

as it is easier to leave out the factor of 2
in the middle of sibling_imblalance() computation
so I can directly interpret imb as the number
of tasks to move, and add the factor of two
when I actually need to return the imbalance.

Would you like me to add this reasoning in the comments?

Thanks.

Tim


2023-06-15 11:19:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

On Tue, Jun 13, 2023 at 10:46:36AM -0700, Tim Chen wrote:
> On Mon, 2023-06-12 at 14:05 +0200, Peter Zijlstra wrote:

> > > + /* Limit tasks moved from preferred group, don't leave cores idle */
> > > + limit = busiest->sum_nr_running;
> > > + lsub_positive(&limit, ncores_busiest);
> > > + if (imbalance > limit)
> > > + imbalance = limit;
> >
> > How does this affect the server parts that have larger than single core
> > turbo domains?
>
> Are you thinking about the case where the local group is completely empty
> so there's turbo headroom and we should move at least one task, even though
> CPU in busiest group has higher priority?

Something along those lines, I didn't think it through, just wondered
about the wisdom of piling everything in the highest priority 'domain',
which would depress the turbo range.

Rjw said that on modern client the frequency domains are per-core, but
that on server parts they are still wider -- or something long those
lines. So it might make sense to consider some of that.

> In other words, are you suggesting we should add
>
> if (imbalance == 0 && busiest->sum_nr_running > 0 &&
> local->sum_nr_running == 0)
> imbalance = 1;

I didn't get that far; and I don't think we have the right topology
information on the servers to even begin considering the effects of the
turbo-range, so perhaps it all doesn't matter.

Just wanted to raise the point for consideration.

Because as-is, the policy of piling extra on the preferred group doesn't
immediately make sense. IIRC the whole ITMT preferred core is mostly
about an extra turbo bin, but if you pile on, the headroom evaporates --
you'll never turbo that high anymore and special casing it doesn't make
sense.

So perhaps I'm not saying more code, but less code is better here.

Dunno, is any of this measurable either way around?

> > > +
> > > + goto out;
> > > + }
> > > +
> > > + /* Take advantage of resource in an empty sched group */
> > > + if (imbalance == 0 && local->sum_nr_running == 0 &&
> > > + busiest->sum_nr_running > 1)
> > > + imbalance = 1;
> > > +out:
> > > + return imbalance << 1;
> > > +}
> >
> >
> > But basically you have:
> >
> > LcBn - BcLn
> > imb = -----------
> > LcBc
> >
> > Which makes sense, except you then return:
> >
> > imb * 2
> >
> > which then made me wonder about rounding.
> >
> > Do we want to to add (LcBc -1) or (LcBc/2) to resp. ceil() or round()
> > the thing before division? Because currently it uses floor().
> >
> > If you evaludate it like:
> >
> >
> > 2 * (LcBn - BcLn)
> > imb = -----------------
> > LcBc
> >
> > The result is different from what you have now.
>
> If I do the rounding after multiplying imb by two (call it imb_new),
> the difference with imb I am returning now (call it imb_old)
> will be at most 1. Note that imb_old returned is always a multiple of 2.
>
> I will be using imb in calculate_imbalance() and divide it
> by 2 there to get the number tasks to move from busiest group.
> So when there is a difference of 1 between imb_old and imb_new,
> the difference will be trimmed off after the division of 2.
>
> We will get the same number of tasks to move with either
> imb_old or imb_new in calculate_imbalance() so the two
> computations will arrive at the same result eventually.
>
> >
> > What actual behaviour is desired in these low imbalance cases? and can
> > you write a comment as to why we do as we do etc..?
>
> I do not keep imb as
>
> 2 * (LcBn - BcLn)
> imb = -----------------
> LcBc
>
> as it is easier to leave out the factor of 2
> in the middle of sibling_imblalance() computation
> so I can directly interpret imb as the number
> of tasks to move, and add the factor of two
> when I actually need to return the imbalance.
>
> Would you like me to add this reasoning in the comments?

So if we want a multiple of 2, leaving that multiplication off makes
sense, but I'm not sure I got the argument for or against the rounding.

floor() gets us 1 task to move when there is at least a whole task's
worth of imbalance, but less than 2.

round() would get us 1 task to move when there's at least half a task's
worth of imbalance but less than 1.5.

ceil() will migrate on any imbalance, however small -- which will result
in ping-pong I suppose, so let's disregard that.

The difference, with the multiplcation later, is 0 or 2.

Does the round() still result in ping-pong?

2023-06-15 17:28:47

by Tim Chen

[permalink] [raw]
Subject: Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

On Thu, 2023-06-15 at 13:07 +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 10:46:36AM -0700, Tim Chen wrote:
> > On Mon, 2023-06-12 at 14:05 +0200, Peter Zijlstra wrote:
>
> > > > + /* Limit tasks moved from preferred group, don't leave cores idle */
> > > > + limit = busiest->sum_nr_running;
> > > > + lsub_positive(&limit, ncores_busiest);
> > > > + if (imbalance > limit)
> > > > + imbalance = limit;
> > >
> > > How does this affect the server parts that have larger than single core
> > > turbo domains?
> >
> > Are you thinking about the case where the local group is completely empty
> > so there's turbo headroom and we should move at least one task, even though
> > CPU in busiest group has higher priority?
>
> Something along those lines, I didn't think it through, just wondered
> about the wisdom of piling everything in the highest priority 'domain',
> which would depress the turbo range.
>
> Rjw said that on modern client the frequency domains are per-core, but
> that on server parts they are still wider -- or something long those
> lines. So it might make sense to consider some of that.
>

I see what you are saying. In that case, even for ASYM_PACKING, it probably
makes more sense to simply distribute tasks in proportion to number of cores
in sched group. And pay attention that we do not let preferred sched group
be idle.

>
> > In other words, are you suggesting we should add
> >
> > if (imbalance == 0 && busiest->sum_nr_running > 0 &&
> > local->sum_nr_running == 0)
> > imbalance = 1;
>
> I didn't get that far; and I don't think we have the right topology
> information on the servers to even begin considering the effects of the
> turbo-range, so perhaps it all doesn't matter.
>
> Just wanted to raise the point for consideration.
>
> Because as-is, the policy of piling extra on the preferred group doesn't
> immediately make sense. IIRC the whole ITMT preferred core is mostly
> about an extra turbo bin, but if you pile on, the headroom evaporates --
> you'll never turbo that high anymore and special casing it doesn't make
> sense.
>
> So perhaps I'm not saying more code, but less code is better here.
>
> Dunno, is any of this measurable either way around?

As far as I know, only client CPUs have ITMT and asymmetric turbo.
So this behavior could only be approximated on a client's Atom clusters with
big cores turned off.

>
> > > > +
> > > > + goto out;
> > > > + }
> > > > +
> > > > + /* Take advantage of resource in an empty sched group */
> > > > + if (imbalance == 0 && local->sum_nr_running == 0 &&
> > > > + busiest->sum_nr_running > 1)
> > > > + imbalance = 1;
> > > > +out:
> > > > + return imbalance << 1;
> > > > +}
> > >
> > >
> > > But basically you have:
> > >
> > > LcBn - BcLn
> > > imb = -----------
> > > LcBc
> > >
> > > Which makes sense, except you then return:
> > >
> > > imb * 2
> > >
> > > which then made me wonder about rounding.
> > >
> > > Do we want to to add (LcBc -1) or (LcBc/2) to resp. ceil() or round()
> > > the thing before division? Because currently it uses floor().
> > >
> > > If you evaludate it like:
> > >
> > >
> > > 2 * (LcBn - BcLn)
> > > imb = -----------------
> > > LcBc
> > >
> > > The result is different from what you have now.
> >
> > If I do the rounding after multiplying imb by two (call it imb_new),
> > the difference with imb I am returning now (call it imb_old)
> > will be at most 1. Note that imb_old returned is always a multiple of 2.
> >
> > I will be using imb in calculate_imbalance() and divide it
> > by 2 there to get the number tasks to move from busiest group.
> > So when there is a difference of 1 between imb_old and imb_new,
> > the difference will be trimmed off after the division of 2.
> >
> > We will get the same number of tasks to move with either
> > imb_old or imb_new in calculate_imbalance() so the two
> > computations will arrive at the same result eventually.
> >
> > >
> > > What actual behaviour is desired in these low imbalance cases? and can
> > > you write a comment as to why we do as we do etc..?
> >
> > I do not keep imb as
> >
> > 2 * (LcBn - BcLn)
> > imb = -----------------
> > LcBc
> >
> > as it is easier to leave out the factor of 2
> > in the middle of sibling_imblalance() computation
> > so I can directly interpret imb as the number
> > of tasks to move, and add the factor of two
> > when I actually need to return the imbalance.
> >
> > Would you like me to add this reasoning in the comments?
>
> So if we want a multiple of 2, leaving that multiplication off makes
> sense, but I'm not sure I got the argument for or against the rounding.
>
> floor() gets us 1 task to move when there is at least a whole task's
> worth of imbalance, but less than 2.
>
> round() would get us 1 task to move when there's at least half a task's
> worth of imbalance but less than 1.5.
>
> ceil() will migrate on any imbalance, however small -- which will result
> in ping-pong I suppose, so let's disregard that.
>
> The difference, with the multiplcation later, is 0 or 2.
>
> Does the round() still result in ping-pong?

Will have to experiment to see whether round() is better.
I chose floor() on purpose in my initial implementation
to minimize migrations.

Tim