Subject: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

+------+ +------+ +-------+ +------+
| node | 12 |node | 20 | node | 12 |node |
| 0 +---------+1 +--------+ 2 +-------+3 |
+------+ +------+ +-------+ +------+

domain0 node0 node1 node2 node3

domain1 node0+1 node0+1 node2+3 node2+3
+
domain2 node0+1+2 |
group: node0+1 |
group:node2+3 <-------------------+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU out of the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same time, we re-use the
lower level sgc directly.

+------+ +------+ +-------+ +------+
| node | 12 |node | 20 | node | 12 |node |
| 0 +---------+1 +--------+ 2 +-------+3 |
+------+ +------+ +-------+ +------+

domain0 node0 node1 +- node2 node3
|
domain1 node0+1 node0+1 | node2+3 node2+3
|
domain2 node0+1+2 |
group: node0+1 |
group:node2 <-------------------+

Tested by the below topology:
qemu-system-aarch64 -M virt -nographic \
-smp cpus=8 \
-numa node,cpus=0-1,nodeid=0 \
-numa node,cpus=2-3,nodeid=1 \
-numa node,cpus=4-5,nodeid=2 \
-numa node,cpus=6-7,nodeid=3 \
-numa dist,src=0,dst=1,val=12 \
-numa dist,src=0,dst=2,val=20 \
-numa dist,src=0,dst=3,val=22 \
-numa dist,src=1,dst=2,val=22 \
-numa dist,src=2,dst=3,val=12 \
-numa dist,src=1,dst=3,val=24 \
-m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[ 0.802139] CPU0 attaching sched-domain(s):
[ 0.802193] domain-0: span=0-1 level=MC
[ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[ 0.802693] domain-1: span=0-3 level=NUMA
[ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[ 0.802811] domain-2: span=0-5 level=NUMA
[ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[ 0.802881] ERROR: groups don't span domain->span
[ 0.803058] domain-3: span=0-7 level=NUMA
[ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
[ 0.804055] CPU1 attaching sched-domain(s):
[ 0.804072] domain-0: span=0-1 level=MC
[ 0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[ 0.804152] domain-1: span=0-3 level=NUMA
[ 0.804170] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[ 0.804219] domain-2: span=0-5 level=NUMA
[ 0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[ 0.804302] ERROR: groups don't span domain->span
[ 0.804520] domain-3: span=0-7 level=NUMA
[ 0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
[ 0.804677] CPU2 attaching sched-domain(s):
[ 0.804687] domain-0: span=2-3 level=MC
[ 0.804705] groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 }
[ 0.804754] domain-1: span=0-3 level=NUMA
[ 0.804772] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
[ 0.804820] domain-2: span=0-5 level=NUMA
[ 0.804836] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
[ 0.804944] ERROR: groups don't span domain->span
[ 0.805108] domain-3: span=0-7 level=NUMA
[ 0.805134] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
[ 0.805223] CPU3 attaching sched-domain(s):
[ 0.805232] domain-0: span=2-3 level=MC
[ 0.805249] groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 }
[ 0.805319] domain-1: span=0-3 level=NUMA
[ 0.805336] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
[ 0.805383] domain-2: span=0-5 level=NUMA
[ 0.805399] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
[ 0.805458] ERROR: groups don't span domain->span
[ 0.805605] domain-3: span=0-7 level=NUMA
[ 0.805626] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
[ 0.805712] CPU4 attaching sched-domain(s):
[ 0.805721] domain-0: span=4-5 level=MC
[ 0.805738] groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 }
[ 0.805787] domain-1: span=4-7 level=NUMA
[ 0.805803] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
[ 0.805851] domain-2: span=0-1,4-7 level=NUMA
[ 0.805867] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
[ 0.805915] ERROR: groups don't span domain->span
[ 0.806108] domain-3: span=0-7 level=NUMA
[ 0.806130] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
[ 0.806214] CPU5 attaching sched-domain(s):
[ 0.806222] domain-0: span=4-5 level=MC
[ 0.806240] groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 }
[ 0.806841] domain-1: span=4-7 level=NUMA
[ 0.806866] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
[ 0.806934] domain-2: span=0-1,4-7 level=NUMA
[ 0.806953] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
[ 0.807004] ERROR: groups don't span domain->span
[ 0.807312] domain-3: span=0-7 level=NUMA
[ 0.807386] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
[ 0.807686] CPU6 attaching sched-domain(s):
[ 0.807710] domain-0: span=6-7 level=MC
[ 0.807750] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 }
[ 0.807840] domain-1: span=4-7 level=NUMA
[ 0.807870] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
[ 0.807952] domain-2: span=0-1,4-7 level=NUMA
[ 0.807985] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
[ 0.808045] ERROR: groups don't span domain->span
[ 0.808257] domain-3: span=0-7 level=NUMA
[ 0.808571] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 }
[ 0.808848] CPU7 attaching sched-domain(s):
[ 0.808860] domain-0: span=6-7 level=MC
[ 0.808880] groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 }
[ 0.808953] domain-1: span=4-7 level=NUMA
[ 0.808974] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
[ 0.809034] domain-2: span=0-1,4-7 level=NUMA
[ 0.809055] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
[ 0.809128] ERROR: groups don't span domain->span
[ 0.810361] domain-3: span=0-7 level=NUMA
[ 0.810400] groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }

w/ patch, we don't get "groups don't span domain->span" any more:
[ 1.486271] CPU0 attaching sched-domain(s):
[ 1.486820] domain-0: span=0-1 level=MC
[ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
[ 1.515717] domain-1: span=0-3 level=NUMA
[ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
[ 1.516989] domain-2: span=0-5 level=NUMA
[ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
[ 1.517369] domain-3: span=0-7 level=NUMA
[ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
[ 1.520027] CPU1 attaching sched-domain(s):
[ 1.520097] domain-0: span=0-1 level=MC
[ 1.520184] groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 }
[ 1.520429] domain-1: span=0-3 level=NUMA
[ 1.520487] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
[ 1.520687] domain-2: span=0-5 level=NUMA
[ 1.520744] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
[ 1.520948] domain-3: span=0-7 level=NUMA
[ 1.521038] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
[ 1.522068] CPU2 attaching sched-domain(s):
[ 1.522348] domain-0: span=2-3 level=MC
[ 1.522606] groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 }
[ 1.522832] domain-1: span=0-3 level=NUMA
[ 1.522885] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
[ 1.523043] domain-2: span=0-5 level=NUMA
[ 1.523092] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
[ 1.523302] domain-3: span=0-7 level=NUMA
[ 1.523352] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
[ 1.523748] CPU3 attaching sched-domain(s):
[ 1.523774] domain-0: span=2-3 level=MC
[ 1.523825] groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 }
[ 1.524009] domain-1: span=0-3 level=NUMA
[ 1.524086] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
[ 1.524281] domain-2: span=0-5 level=NUMA
[ 1.524331] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
[ 1.524534] domain-3: span=0-7 level=NUMA
[ 1.524586] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
[ 1.524847] CPU4 attaching sched-domain(s):
[ 1.524873] domain-0: span=4-5 level=MC
[ 1.524954] groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 }
[ 1.525105] domain-1: span=4-7 level=NUMA
[ 1.525153] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
[ 1.525368] domain-2: span=0-1,4-7 level=NUMA
[ 1.525428] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
[ 1.532726] domain-3: span=0-7 level=NUMA
[ 1.532811] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 }
[ 1.534125] CPU5 attaching sched-domain(s):
[ 1.534159] domain-0: span=4-5 level=MC
[ 1.534303] groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 }
[ 1.534490] domain-1: span=4-7 level=NUMA
[ 1.534572] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
[ 1.534734] domain-2: span=0-1,4-7 level=NUMA
[ 1.534783] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
[ 1.536057] domain-3: span=0-7 level=NUMA
[ 1.536430] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 }
[ 1.536815] CPU6 attaching sched-domain(s):
[ 1.536846] domain-0: span=6-7 level=MC
[ 1.536934] groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 }
[ 1.537144] domain-1: span=4-7 level=NUMA
[ 1.537262] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
[ 1.537553] domain-2: span=0-1,4-7 level=NUMA
[ 1.537613] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
[ 1.537872] domain-3: span=0-7 level=NUMA
[ 1.537998] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
[ 1.538448] CPU7 attaching sched-domain(s):
[ 1.538505] domain-0: span=6-7 level=MC
[ 1.538586] groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 }
[ 1.538746] domain-1: span=4-7 level=NUMA
[ 1.538798] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
[ 1.539048] domain-2: span=0-1,4-7 level=NUMA
[ 1.539111] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
[ 1.539571] domain-3: span=0-7 level=NUMA
[ 1.539610] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }

Reported-by: Valentin Schneider <[email protected]>
Tested-by: Meelis Roos <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
-v2:
merged Valentin's edit on v1, thus fixed two issues
* "one cpu in one numa" was not supported
* the capacity of groups from grandchild couldn't be updated
Thanks for Valentin's nice edit!

And also added some comments in build_balance_mask

kernel/sched/topology.c | 85 +++++++++++++++++++++++++----------------
1 file changed, 53 insertions(+), 32 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5d3675c7a76b..964ed89001fe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -723,35 +723,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
for (tmp = sd; tmp; tmp = tmp->parent)
numa_distance += !!(tmp->flags & SD_NUMA);

- /*
- * FIXME: Diameter >=3 is misrepresented.
- *
- * Smallest diameter=3 topology is:
- *
- * node 0 1 2 3
- * 0: 10 20 30 40
- * 1: 20 10 20 30
- * 2: 30 20 10 20
- * 3: 40 30 20 10
- *
- * 0 --- 1 --- 2 --- 3
- *
- * NUMA-3 0-3 N/A N/A 0-3
- * groups: {0-2},{1-3} {1-3},{0-2}
- *
- * NUMA-2 0-2 0-3 0-3 1-3
- * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2}
- *
- * NUMA-1 0-1 0-2 1-3 2-3
- * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2}
- *
- * NUMA-0 0 1 2 3
- *
- * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
- * group span isn't a subset of the domain span.
- */
- WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n");
-
sched_domain_debug(sd, cpu);

rq_attach_root(rq, rd);
@@ -908,6 +879,17 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
for_each_cpu(i, sg_span) {
sibling = *per_cpu_ptr(sdd->sd, i);

+ /*
+ * for NUMA diameter >= 3, while sibling->child spans out of
+ * the domain being built, we are using the grandchild;
+ * build_overlap_sched_groups includes words on how this could
+ * happen
+ */
+ while (sibling->child &&
+ !cpumask_subset(sched_domain_span(sibling->child),
+ sched_domain_span(sd)))
+ sibling = sibling->child;
+
/*
* Can happen in the asymmetric case, where these siblings are
* unused. The mask will not be empty because those CPUs that
@@ -955,10 +937,11 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
}

static void init_overlap_sched_group(struct sched_domain *sd,
- struct sched_group *sg)
+ struct sched_group *sg,
+ struct sched_domain *grandchild)
{
struct cpumask *mask = sched_domains_tmpmask2;
- struct sd_data *sdd = sd->private;
+ struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
struct cpumask *sg_span;
int cpu;

@@ -996,6 +979,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)

for_each_cpu_wrap(i, span, cpu) {
struct cpumask *sg_span;
+ bool from_grandchild = false;

if (cpumask_test_cpu(i, covered))
continue;
@@ -1015,6 +999,43 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;

+ /*
+ * for NUMA diameter >= 3, building sched_domain by sibling's
+ * child's child domain to prevent sched_group from spanning
+ * out of sched_domain
+ * if we don't do this, Diameter >=3 is misrepresented:
+ *
+ * Smallest diameter=3 topology is:
+ *
+ * node 0 1 2 3
+ * 0: 10 20 30 40
+ * 1: 20 10 20 30
+ * 2: 30 20 10 20
+ * 3: 40 30 20 10
+ *
+ * 0 --- 1 --- 2 --- 3
+ *
+ * NUMA-3 0-3 N/A N/A 0-3
+ * groups: {0-2},{1-3} {1-3},{0-2}
+ *
+ * NUMA-2 0-2 0-3 0-3 1-3
+ * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2}
+ *
+ * NUMA-1 0-1 0-2 1-3 2-3
+ * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2}
+ *
+ * NUMA-0 0 1 2 3
+ *
+ * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
+ * group span isn't a subset of the domain span.
+ */
+ while (sibling->child &&
+ !cpumask_subset(sched_domain_span(sibling->child),
+ span)) {
+ sibling = sibling->child;
+ from_grandchild = true;
+ }
+
sg = build_group_from_child_sched_domain(sibling, cpu);
if (!sg)
goto fail;
@@ -1022,7 +1043,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
sg_span = sched_group_span(sg);
cpumask_or(covered, covered, sg_span);

- init_overlap_sched_group(sd, sg);
+ init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);

if (!first)
first = sg;
--
2.25.1


2021-02-03 12:02:24

by Meelis Roos

[permalink] [raw]
Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

03.02.21 13:12 Barry Song wrote:
> kernel/sched/topology.c | 85 +++++++++++++++++++++++++----------------
> 1 file changed, 53 insertions(+), 32 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5d3675c7a76b..964ed89001fe 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c

This one still works on the Sun X4600-M2, on top of v5.11-rc6-55-g3aaf0a27ffc2.


Performance-wise - is the some simple benhmark to run to meaure the impact? Compared to what - 5.10.0 or the kernel with the warning?

drop caches and time the build time of linux kernel with make -j64?

--
Meelis Roos

Subject: RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2



> -----Original Message-----
> From: Meelis Roos [mailto:[email protected]]
> Sent: Thursday, February 4, 2021 12:58 AM
> To: Song Bao Hua (Barry Song) <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Cc: [email protected]; xuwei (O) <[email protected]>; Liguozhu (Kenneth)
> <[email protected]>; tiantao (H) <[email protected]>; wanghuiqiang
> <[email protected]>; Zengtao (B) <[email protected]>; Jonathan
> Cameron <[email protected]>; [email protected]
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
>
> 03.02.21 13:12 Barry Song wrote:
> > kernel/sched/topology.c | 85 +++++++++++++++++++++++++----------------
> > 1 file changed, 53 insertions(+), 32 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 5d3675c7a76b..964ed89001fe 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
>
> This one still works on the Sun X4600-M2, on top of v5.11-rc6-55-g3aaf0a27ffc2.
>
>
> Performance-wise - is the some simple benhmark to run to meaure the impact?
> Compared to what - 5.10.0 or the kernel with the warning?

Hi Meelis,
Thanks for retesting.

Comparing to the kernel with the warning is enough. As I mentioned here:
https://lore.kernel.org/lkml/[email protected]/

I have seen two major issues the broken sched_group has:

* in load_balance() and find_busiest_group()
kernel is calculating the avg_load and group_type by:

sum(load of cpus within sched_domain)
------------------------------------
capacity of the whole sched_group

since sched_group isn't a subset of sched_domain, so the load of
the problematic group is severely underestimated.

sched_domain

+----------------------------------+
| |
| +-------------------------------------------+
| | +-------+ +------+ | |
| | | cpu0 | | cpu1 | | |
| | +-------+ +------+ | |
+----------------------------------+ |
| |
| +-------+ +-------+ |
| |cpu2 | |cpu3 | |
| +-------+ +-------+ |
| |
+-------------------------------------------+
problematic sched_group


For the above example, kernel will divide "the sum load of
cpu0 and cpu1" by "the capacity of the whole group including
cpu0,1,2 and 3".

* in select_task_rq_fair() and find_idlest_group()
Kernel could push a forked/exec-ed task to the outside of the
sched_domain, but still inside the sched_group. For the above
diagram, while kernel wants to find the idlest cpu in the
sched_domain, it can result in picking cpu2 or cpu3.

I guess these two issues can potentially affect many benchmarks.
Our team have seen 5% unixbench score increase with the fix in
some machines though the real impact might be case-by-case.

>
> drop caches and time the build time of linux kernel with make -j64?
>
> --
> Meelis Roos

Thanks
Barry

2021-02-09 12:58:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

On Thu, Feb 04, 2021 at 12:12:01AM +1300, Barry Song wrote:
> As long as NUMA diameter > 2, building sched_domain by sibling's child
> domain will definitely create a sched_domain with sched_group which will
> span out of the sched_domain:
>
> +------+ +------+ +-------+ +------+
> | node | 12 |node | 20 | node | 12 |node |
> | 0 +---------+1 +--------+ 2 +-------+3 |
> +------+ +------+ +-------+ +------+
>
> domain0 node0 node1 node2 node3
>
> domain1 node0+1 node0+1 node2+3 node2+3
> +
> domain2 node0+1+2 |
> group: node0+1 |
> group:node2+3 <-------------------+
>
> when node2 is added into the domain2 of node0, kernel is using the child
> domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> the span of the domain including node0+1+2.
>
> This will make load_balance() run based on screwed avg_load and group_type
> in the sched_group spanning out of the sched_domain, and it also makes
> select_task_rq_fair() pick an idle CPU out of the sched_domain.
>
> Real servers which suffer from this problem include Kunpeng920 and 8-node
> Sun Fire X4600-M2, at least.
>
> Here we move to use the *child* domain of the *child* domain of node2's
> domain2 as the new added sched_group. At the same time, we re-use the
> lower level sgc directly.
>
> +------+ +------+ +-------+ +------+
> | node | 12 |node | 20 | node | 12 |node |
> | 0 +---------+1 +--------+ 2 +-------+3 |
> +------+ +------+ +-------+ +------+
>
> domain0 node0 node1 +- node2 node3
> |
> domain1 node0+1 node0+1 | node2+3 node2+3
> |
> domain2 node0+1+2 |
> group: node0+1 |
> group:node2 <-------------------+
>

I've finally had a moment to think about this, would it make sense to
also break up group: node0+1, such that we then end up with 3 groups of
equal size?

> w/ patch, we don't get "groups don't span domain->span" any more:
> [ 1.486271] CPU0 attaching sched-domain(s):
> [ 1.486820] domain-0: span=0-1 level=MC
> [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
> [ 1.515717] domain-1: span=0-3 level=NUMA
> [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
> [ 1.516989] domain-2: span=0-5 level=NUMA
> [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }

groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3, cap=1989 }, 4:{ span=4-5, cap=1949 }

> [ 1.517369] domain-3: span=0-7 level=NUMA
> [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }

Let me continue to think about this... it's been a while :/

Subject: RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2



> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Wednesday, February 10, 2021 1:56 AM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; xuwei (O)
> <[email protected]>; Liguozhu (Kenneth) <[email protected]>; tiantao (H)
> <[email protected]>; wanghuiqiang <[email protected]>; Zengtao (B)
> <[email protected]>; Jonathan Cameron <[email protected]>;
> [email protected]; Meelis Roos <[email protected]>
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
>
> On Thu, Feb 04, 2021 at 12:12:01AM +1300, Barry Song wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child
> > domain will definitely create a sched_domain with sched_group which will
> > span out of the sched_domain:
> >
> > +------+ +------+ +-------+ +------+
> > | node | 12 |node | 20 | node | 12 |node |
> > | 0 +---------+1 +--------+ 2 +-------+3 |
> > +------+ +------+ +-------+ +------+
> >
> > domain0 node0 node1 node2 node3
> >
> > domain1 node0+1 node0+1 node2+3 node2+3
> > +
> > domain2 node0+1+2 |
> > group: node0+1 |
> > group:node2+3 <-------------------+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of the domain including node0+1+2.
> >
> > This will make load_balance() run based on screwed avg_load and group_type
> > in the sched_group spanning out of the sched_domain, and it also makes
> > select_task_rq_fair() pick an idle CPU out of the sched_domain.
> >
> > Real servers which suffer from this problem include Kunpeng920 and 8-node
> > Sun Fire X4600-M2, at least.
> >
> > Here we move to use the *child* domain of the *child* domain of node2's
> > domain2 as the new added sched_group. At the same time, we re-use the
> > lower level sgc directly.
> >
> > +------+ +------+ +-------+ +------+
> > | node | 12 |node | 20 | node | 12 |node |
> > | 0 +---------+1 +--------+ 2 +-------+3 |
> > +------+ +------+ +-------+ +------+
> >
> > domain0 node0 node1 +- node2 node3
> > |
> > domain1 node0+1 node0+1 | node2+3 node2+3
> > |
> > domain2 node0+1+2 |
> > group: node0+1 |
> > group:node2 <-------------------+
> >
>
> I've finally had a moment to think about this, would it make sense to
> also break up group: node0+1, such that we then end up with 3 groups of
> equal size?

We used to create the sched_groups of sched_domain[n] of node[m] by
1. local group: sched_domain[n-1] of node[m]
2. remote group: sched_domain[n-1] of node[m]'s siblings
in the same level.
Since the sched_domain[n-1] of a part of node[m]'s siblings are able
to cover the whole span of sched_domain[n] of node[m], there is no
necessity to scan over all siblings of node[m], once sched_domain[n]
of node[m] has been covered, we can stop making more sched_groups. So
the number of sched_groups is small.

So historically, the code has never tried to make sched_groups result
in equal size. And it permits the overlapping of local group and remote
groups.

One issue we are facing in original code is that once the topology
gets to 3-hops NUMA, sched_domain[n-1] of node[m]'s siblings might
span out of the range of sched_domain[n] of node[m]. Here my approach
is trying to find a descanted sibling to build remote groups and fix
this issue for those machines with this problem. So it keeps those
machines without 3-hops issues untouched.

Valentin sent another RFC to break up all remote groups to include
the remote node only instead of using sched_domain[n-1] of siblings,
this will eliminate the problem from the first beginning. One side
effect is that it changes all machines including those machines w/o
3-hops issue by creating much more remote sched_groups. So we both
agree we can get started from descanted sibling(grandchild) approach
first.

What you are advising seems to be breaking up local sched_group,
it will create much more local groups. It sounds like a huge change
even beyond the scope of the original issue we are trying to fix :-)

>
> > w/ patch, we don't get "groups don't span domain->span" any more:
> > [ 1.486271] CPU0 attaching sched-domain(s):
> > [ 1.486820] domain-0: span=0-1 level=MC
> > [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
> > [ 1.515717] domain-1: span=0-3 level=NUMA
> > [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
> > [ 1.516989] domain-2: span=0-5 level=NUMA
> > [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
>
> groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3, cap=1989 },
> 4:{ span=4-5, cap=1949 }
>
> > [ 1.517369] domain-3: span=0-7 level=NUMA
> > [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7
> mask=6-7 cap=4054 }
>
> Let me continue to think about this... it's been a while :/

Sure, thanks!

Barry

2021-02-10 11:31:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:

> > I've finally had a moment to think about this, would it make sense to
> > also break up group: node0+1, such that we then end up with 3 groups of
> > equal size?
>

> Since the sched_domain[n-1] of a part of node[m]'s siblings are able
> to cover the whole span of sched_domain[n] of node[m], there is no
> necessity to scan over all siblings of node[m], once sched_domain[n]
> of node[m] has been covered, we can stop making more sched_groups. So
> the number of sched_groups is small.
>
> So historically, the code has never tried to make sched_groups result
> in equal size. And it permits the overlapping of local group and remote
> groups.

Histrorically groups have (typically) always been the same size though.

The reason I did ask is because when you get one large and a bunch of
smaller groups, the load-balancing 'pull' is relatively smaller to the
large groups.

That is, IIRC should_we_balance() ensures only 1 CPU out of the group
continues the load-balancing pass. So if, for example, we have one group
of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
1/2 times, while the group of 4 CPUs will pull 1/4 times.

By making sure all groups are of the same level, and thus of equal size,
this doesn't happen.

Subject: RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2



> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Thursday, February 11, 2021 12:22 AM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; xuwei (O)
> <[email protected]>; Liguozhu (Kenneth) <[email protected]>; tiantao (H)
> <[email protected]>; wanghuiqiang <[email protected]>; Zengtao (B)
> <[email protected]>; Jonathan Cameron <[email protected]>;
> [email protected]; Meelis Roos <[email protected]>
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
>
> On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
>
> > > I've finally had a moment to think about this, would it make sense to
> > > also break up group: node0+1, such that we then end up with 3 groups of
> > > equal size?
> >
>
> > Since the sched_domain[n-1] of a part of node[m]'s siblings are able
> > to cover the whole span of sched_domain[n] of node[m], there is no
> > necessity to scan over all siblings of node[m], once sched_domain[n]
> > of node[m] has been covered, we can stop making more sched_groups. So
> > the number of sched_groups is small.
> >
> > So historically, the code has never tried to make sched_groups result
> > in equal size. And it permits the overlapping of local group and remote
> > groups.
>
> Histrorically groups have (typically) always been the same size though.

This is probably true for other platforms. But unfortunately it has never
been true in my platform :-)

node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10

In case we have only two cpus in one numa.

CPU0's domain-3 has no overflowed sched_group, but its first group
covers 0-5(node0-node2), the second group covers 4-7
(node2-node3):

[ 0.802139] CPU0 attaching sched-domain(s):
[ 0.802193] domain-0: span=0-1 level=MC
[ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[ 0.802693] domain-1: span=0-3 level=NUMA
[ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[ 0.802811] domain-2: span=0-5 level=NUMA
[ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[ 0.802881] ERROR: groups don't span domain->span
[ 0.803058] domain-3: span=0-7 level=NUMA
[ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }


>
> The reason I did ask is because when you get one large and a bunch of
> smaller groups, the load-balancing 'pull' is relatively smaller to the
> large groups.
>
> That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> continues the load-balancing pass. So if, for example, we have one group
> of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> 1/2 times, while the group of 4 CPUs will pull 1/4 times.
>
> By making sure all groups are of the same level, and thus of equal size,
> this doesn't happen.

As you can see, even if we give all groups of domain2 equal size
by breaking up both local_group and remote_groups, we will get to
the same problem in domain-3. And what's more tricky is that
domain-3 has no problem of "groups don't span domain->span".

It seems we need to change both domain2 and domain3 then though
domain3 has no issue of "groups don't span domain->span".

Thanks
Barry

2021-02-11 20:04:14

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

On 10/02/21 12:21, Peter Zijlstra wrote:
> On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
>> So historically, the code has never tried to make sched_groups result
>> in equal size. And it permits the overlapping of local group and remote
>> groups.
>
> Histrorically groups have (typically) always been the same size though.
>
> The reason I did ask is because when you get one large and a bunch of
> smaller groups, the load-balancing 'pull' is relatively smaller to the
> large groups.
>
> That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> continues the load-balancing pass. So if, for example, we have one group
> of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> 1/2 times, while the group of 4 CPUs will pull 1/4 times.
>
> By making sure all groups are of the same level, and thus of equal size,
> this doesn't happen.

So I hacked something that tries to do this, with the notable exception
that it doesn't change the way the local group is generated. Breaking the
assumption that the local group always spans the child domain doesn't sound
like the best thing to do.

Anywho, the below makes it so all !local NUMA groups are built using the
same sched_domain_topology_level. Some of it is absolutely disgusting
(esp. the sched_domains_curr_level stuff), I didn't bother with handling
domain degeneration yet, and I trigger the WARN_ON in get_group()... But at
least the topology gets built!

AFAICT Barry's topology is handled the same. On that sunfire topology, it
pretty much turns all remote groups into groups spanning a single
node. That does almost double the number of groups for some domains,
compared to Barry's current v3.

If that is really a route we want to go down, I'll try to clean the below.

(deposit your drinks before going any further)
--->8---
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..e74f48bdd226 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -187,7 +187,10 @@ struct sched_domain_topology_level {
sched_domain_mask_f mask;
sched_domain_flags_f sd_flags;
int flags;
+#ifdef CONFIG_NUMA
int numa_level;
+ int numa_sibling_level;
+#endif
struct sd_data data;
#ifdef CONFIG_SCHED_DEBUG
char *name;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3c50cc7285c9..5a9e6a4d5d89 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -742,6 +742,34 @@ enum s_alloc {
sa_none,
};

+/*
+ * Topology list, bottom-up.
+ */
+static struct sched_domain_topology_level default_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+ { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_MC
+ { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+ { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { NULL, },
+};
+
+static struct sched_domain_topology_level *sched_domain_topology =
+ default_topology;
+
+#define for_each_sd_topology(tl) \
+ for (tl = sched_domain_topology; tl->mask; tl++)
+
+void set_sched_topology(struct sched_domain_topology_level *tl)
+{
+ if (WARN_ON_ONCE(sched_smp_initialized))
+ return;
+
+ sched_domain_topology = tl;
+}
+
/*
* Return the canonical balance CPU for this group, this is the first CPU
* of this group that's also in the balance mask.
@@ -955,10 +983,12 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
struct sched_group *first = NULL, *last = NULL, *sg;
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered = sched_domains_tmpmask;
+ struct sched_domain_topology_level *tl;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
int i;

+ tl = container_of(sd->private, struct sched_domain_topology_level, data);
cpumask_clear(covered);

for_each_cpu_wrap(i, span, cpu) {
@@ -982,6 +1012,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;

+ /* Don't mess up local group being child sched domain */
+ if (tl->numa_sibling_level != sd->level && i != cpu)
+ sibling = *per_cpu_ptr(sched_domain_topology[tl->numa_sibling_level].data.sd, i);
+
sg = build_group_from_child_sched_domain(sibling, cpu);
if (!sg)
goto fail;
@@ -989,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
sg_span = sched_group_span(sg);
cpumask_or(covered, covered, sg_span);

- init_overlap_sched_group(sd, sg);
+ init_overlap_sched_group(sibling, sg);

if (!first)
first = sg;
@@ -1440,34 +1474,6 @@ sd_init(struct sched_domain_topology_level *tl,
return sd;
}

-/*
- * Topology list, bottom-up.
- */
-static struct sched_domain_topology_level default_topology[] = {
-#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-#ifdef CONFIG_SCHED_MC
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
-#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
-};
-
-static struct sched_domain_topology_level *sched_domain_topology =
- default_topology;
-
-#define for_each_sd_topology(tl) \
- for (tl = sched_domain_topology; tl->mask; tl++)
-
-void set_sched_topology(struct sched_domain_topology_level *tl)
-{
- if (WARN_ON_ONCE(sched_smp_initialized))
- return;
-
- sched_domain_topology = tl;
-}
-
#ifdef CONFIG_NUMA

static const struct cpumask *sd_numa_mask(int cpu)
@@ -1566,6 +1572,52 @@ static void init_numa_topology_type(void)

#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)

+void sched_init_numa_siblings(void)
+{
+ struct sched_domain_topology_level *tl;
+ int tl_id = 0, sibling_tl_id = 0;
+ const struct cpumask *mask;
+ int last_written_tl = 0;
+ int n, i, j;
+
+ for_each_sd_topology(tl) {
+ if (!tl->numa_level)
+ goto next;
+
+ for_each_node(n) {
+ /* XXX: this *thing* needs to die */
+ sched_domains_curr_level = tl->numa_level;
+ i = cpumask_first(cpumask_of_node(n));
+ mask = tl->mask(i);
+
+ for_each_cpu_wrap(j, mask, i) {
+ sibling_tl_id = tl_id;
+ sched_domains_curr_level = tl->numa_level;
+
+ /* Not subset? Down we go! */
+ /* XXX bother about degeneration here? */
+ do {
+ sibling_tl_id--;
+ sched_domains_curr_level--;
+ } while (sibling_tl_id > 0 &&
+ !cpumask_subset(sched_domain_topology[sibling_tl_id].mask(j), mask));
+
+ /* Only write if first write or it lowers level */
+ if (last_written_tl < tl_id ||
+ tl->numa_sibling_level > sibling_tl_id) {
+ tl->numa_sibling_level = sibling_tl_id;
+ last_written_tl = tl_id;
+ }
+ }
+ }
+next:
+ if (last_written_tl < tl_id)
+ tl->numa_sibling_level = tl_id;
+
+ tl_id++;
+ }
+}
+
void sched_init_numa(void)
{
struct sched_domain_topology_level *tl;
@@ -1708,6 +1760,8 @@ void sched_init_numa(void)
sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];

init_numa_topology_type();
+
+ sched_init_numa_siblings();
}

void sched_domains_numa_masks_set(unsigned int cpu)

Subject: RE: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2



> -----Original Message-----
> From: Valentin Schneider [mailto:[email protected]]
> Sent: Friday, February 12, 2021 8:55 AM
> To: Peter Zijlstra <[email protected]>; Song Bao Hua (Barry Song)
> <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; xuwei (O)
> <[email protected]>; Liguozhu (Kenneth) <[email protected]>; tiantao (H)
> <[email protected]>; wanghuiqiang <[email protected]>; Zengtao (B)
> <[email protected]>; Jonathan Cameron <[email protected]>;
> [email protected]; Meelis Roos <[email protected]>
> Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't
> span domain->span for NUMA diameter > 2
>
> On 10/02/21 12:21, Peter Zijlstra wrote:
> > On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
> >> So historically, the code has never tried to make sched_groups result
> >> in equal size. And it permits the overlapping of local group and remote
> >> groups.
> >
> > Histrorically groups have (typically) always been the same size though.
> >
> > The reason I did ask is because when you get one large and a bunch of
> > smaller groups, the load-balancing 'pull' is relatively smaller to the
> > large groups.
> >
> > That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> > continues the load-balancing pass. So if, for example, we have one group
> > of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> > 1/2 times, while the group of 4 CPUs will pull 1/4 times.
> >
> > By making sure all groups are of the same level, and thus of equal size,
> > this doesn't happen.
>
> So I hacked something that tries to do this, with the notable exception
> that it doesn't change the way the local group is generated. Breaking the
> assumption that the local group always spans the child domain doesn't sound
> like the best thing to do.
>
> Anywho, the below makes it so all !local NUMA groups are built using the
> same sched_domain_topology_level. Some of it is absolutely disgusting
> (esp. the sched_domains_curr_level stuff), I didn't bother with handling
> domain degeneration yet, and I trigger the WARN_ON in get_group()... But at
> least the topology gets built!
>
> AFAICT Barry's topology is handled the same. On that sunfire topology, it
> pretty much turns all remote groups into groups spanning a single
> node. That does almost double the number of groups for some domains,
> compared to Barry's current v3.
>
> If that is really a route we want to go down, I'll try to clean the below.
>
Hi Valentin,

I understand Peter's concern is that the local group has different
size with remote groups. Is this patch resolving Peter's concern?
To me, it seems not :-)

Though I don’t understand why different group sizes will be harmful
since all groups are calculating avg_load and group_type based on
their own capacities. Thus, for a smaller group, its capacity would
be smaller.

Is it because a bigger group has relatively less chance to pull, so
load balancing will be completed more slowly while small groups have
high load?

> (deposit your drinks before going any further)
> --->8---
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..e74f48bdd226 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -187,7 +187,10 @@ struct sched_domain_topology_level {
> sched_domain_mask_f mask;
> sched_domain_flags_f sd_flags;
> int flags;
> +#ifdef CONFIG_NUMA
> int numa_level;
> + int numa_sibling_level;
> +#endif
> struct sd_data data;
> #ifdef CONFIG_SCHED_DEBUG
> char *name;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3c50cc7285c9..5a9e6a4d5d89 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -742,6 +742,34 @@ enum s_alloc {
> sa_none,
> };
>
> +/*
> + * Topology list, bottom-up.
> + */
> +static struct sched_domain_topology_level default_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> +#endif
> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> + { NULL, },
> +};
> +
> +static struct sched_domain_topology_level *sched_domain_topology =
> + default_topology;
> +
> +#define for_each_sd_topology(tl) \
> + for (tl = sched_domain_topology; tl->mask; tl++)
> +
> +void set_sched_topology(struct sched_domain_topology_level *tl)
> +{
> + if (WARN_ON_ONCE(sched_smp_initialized))
> + return;
> +
> + sched_domain_topology = tl;
> +}
> +
> /*
> * Return the canonical balance CPU for this group, this is the first CPU
> * of this group that's also in the balance mask.
> @@ -955,10 +983,12 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
> struct sched_group *first = NULL, *last = NULL, *sg;
> const struct cpumask *span = sched_domain_span(sd);
> struct cpumask *covered = sched_domains_tmpmask;
> + struct sched_domain_topology_level *tl;
> struct sd_data *sdd = sd->private;
> struct sched_domain *sibling;
> int i;
>
> + tl = container_of(sd->private, struct sched_domain_topology_level, data);
> cpumask_clear(covered);
>
> for_each_cpu_wrap(i, span, cpu) {
> @@ -982,6 +1012,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
> if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
> continue;
>
> + /* Don't mess up local group being child sched domain */
> + if (tl->numa_sibling_level != sd->level && i != cpu)
> + sibling =
> *per_cpu_ptr(sched_domain_topology[tl->numa_sibling_level].data.sd, i);
> +
> sg = build_group_from_child_sched_domain(sibling, cpu);
> if (!sg)
> goto fail;
> @@ -989,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> cpu)
> sg_span = sched_group_span(sg);
> cpumask_or(covered, covered, sg_span);
>
> - init_overlap_sched_group(sd, sg);
> + init_overlap_sched_group(sibling, sg);
>
> if (!first)
> first = sg;
> @@ -1440,34 +1474,6 @@ sd_init(struct sched_domain_topology_level *tl,
> return sd;
> }
>
> -/*
> - * Topology list, bottom-up.
> - */
> -static struct sched_domain_topology_level default_topology[] = {
> -#ifdef CONFIG_SCHED_SMT
> - { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> -#endif
> -#ifdef CONFIG_SCHED_MC
> - { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> -#endif
> - { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> - { NULL, },
> -};
> -
> -static struct sched_domain_topology_level *sched_domain_topology =
> - default_topology;
> -
> -#define for_each_sd_topology(tl) \
> - for (tl = sched_domain_topology; tl->mask; tl++)
> -
> -void set_sched_topology(struct sched_domain_topology_level *tl)
> -{
> - if (WARN_ON_ONCE(sched_smp_initialized))
> - return;
> -
> - sched_domain_topology = tl;
> -}
> -
> #ifdef CONFIG_NUMA
>
> static const struct cpumask *sd_numa_mask(int cpu)
> @@ -1566,6 +1572,52 @@ static void init_numa_topology_type(void)
>
> #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
>
> +void sched_init_numa_siblings(void)
> +{
> + struct sched_domain_topology_level *tl;
> + int tl_id = 0, sibling_tl_id = 0;
> + const struct cpumask *mask;
> + int last_written_tl = 0;
> + int n, i, j;
> +
> + for_each_sd_topology(tl) {
> + if (!tl->numa_level)
> + goto next;
> +
> + for_each_node(n) {
> + /* XXX: this *thing* needs to die */
> + sched_domains_curr_level = tl->numa_level;
> + i = cpumask_first(cpumask_of_node(n));
> + mask = tl->mask(i);
> +
> + for_each_cpu_wrap(j, mask, i) {
> + sibling_tl_id = tl_id;
> + sched_domains_curr_level = tl->numa_level;
> +
> + /* Not subset? Down we go! */
> + /* XXX bother about degeneration here? */
> + do {
> + sibling_tl_id--;
> + sched_domains_curr_level--;
> + } while (sibling_tl_id > 0 &&
> +
> !cpumask_subset(sched_domain_topology[sibling_tl_id].mask(j), mask));
> +
> + /* Only write if first write or it lowers level */
> + if (last_written_tl < tl_id ||
> + tl->numa_sibling_level > sibling_tl_id) {
> + tl->numa_sibling_level = sibling_tl_id;
> + last_written_tl = tl_id;
> + }
> + }
> + }
> +next:
> + if (last_written_tl < tl_id)
> + tl->numa_sibling_level = tl_id;
> +
> + tl_id++;
> + }
> +}
> +
> void sched_init_numa(void)
> {
> struct sched_domain_topology_level *tl;
> @@ -1708,6 +1760,8 @@ void sched_init_numa(void)
> sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
>
> init_numa_topology_type();
> +
> + sched_init_numa_siblings();
> }
>
> void sched_domains_numa_masks_set(unsigned int cpu)


Thanks
Barry

2021-02-18 14:40:47

by Valentin Schneider

[permalink] [raw]
Subject: RE: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2


Hi Barry,

On 18/02/21 09:17, Song Bao Hua (Barry Song) wrote:
> Hi Valentin,
>
> I understand Peter's concern is that the local group has different
> size with remote groups. Is this patch resolving Peter's concern?
> To me, it seems not :-)
>

If you remove the '&& i != cpu' in build_overlap_sched_groups() you get
that, but then you also get some extra warnings :-)

Now yes, should_we_balance() only matters for the local group. However I'm
somewhat wary of messing with the local groups; for one it means you would
have more than one tl now accessing the same sgc->next_update, sgc->{min,
max}capacity, sgc->group_imbalance (as Vincent had pointed out).

By ensuring only remote (i.e. !local) groups are modified (which is what
your patch does), we absolve ourselves of this issue, which is why I prefer
this approach ATM.

> Though I don’t understand why different group sizes will be harmful
> since all groups are calculating avg_load and group_type based on
> their own capacities. Thus, for a smaller group, its capacity would
> be smaller.
>
> Is it because a bigger group has relatively less chance to pull, so
> load balancing will be completed more slowly while small groups have
> high load?
>

Peter's point is that, if at a given tl you have groups that look like

g0: 0-4, g1: 5-6, g2: 7-8

Then g0 is half as likely to pull tasks with load_balance() than g1 or g2
(due to the group size vs should_we_balance())


However, I suppose one "trick" to be aware of here is that since your patch
*doesn't* change the local group, we do have e.g. on CPU0:

[ 0.374840] domain-2: span=0-5 level=NUMA
[ 0.375054] groups: 0:{ span=0-3 cap=4003 }, 4:{ span=4-5 cap=1988 }

*but* on CPU4 we get:

[ 0.387019] domain-2: span=0-1,4-7 level=NUMA
[ 0.387211] groups: 4:{ span=4-7 cap=3984 }, 0:{ span=0-1 cap=2013 }

IOW, at a given tl, all *local* groups have /roughly/ the same size and thus
similar pull probability (it took me writing this mail to see it that
way). So perhaps this is all fine already?

Subject: RE: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2



> -----Original Message-----
> From: Valentin Schneider [mailto:[email protected]]
> Sent: Friday, February 19, 2021 1:41 AM
> To: Song Bao Hua (Barry Song) <[email protected]>; Peter Zijlstra
> <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; xuwei (O)
> <[email protected]>; Liguozhu (Kenneth) <[email protected]>; tiantao (H)
> <[email protected]>; wanghuiqiang <[email protected]>; Zengtao (B)
> <[email protected]>; Jonathan Cameron <[email protected]>;
> [email protected]; Meelis Roos <[email protected]>
> Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't
> span domain->span for NUMA diameter > 2
>
>
> Hi Barry,
>
> On 18/02/21 09:17, Song Bao Hua (Barry Song) wrote:
> > Hi Valentin,
> >
> > I understand Peter's concern is that the local group has different
> > size with remote groups. Is this patch resolving Peter's concern?
> > To me, it seems not :-)
> >
>
> If you remove the '&& i != cpu' in build_overlap_sched_groups() you get that,
> but then you also get some extra warnings :-)
>
> Now yes, should_we_balance() only matters for the local group. However I'm
> somewhat wary of messing with the local groups; for one it means you would have
> more than one tl now accessing the same sgc->next_update, sgc->{min,
> max}capacity, sgc->group_imbalance (as Vincent had pointed out).
>
> By ensuring only remote (i.e. !local) groups are modified (which is what your
> patch does), we absolve ourselves of this issue, which is why I prefer this
> approach ATM.

Yep. The grandchild approach seems still to the feasible way for this moment.

>
> > Though I don’t understand why different group sizes will be harmful
> > since all groups are calculating avg_load and group_type based on
> > their own capacities. Thus, for a smaller group, its capacity would be
> > smaller.
> >
> > Is it because a bigger group has relatively less chance to pull, so
> > load balancing will be completed more slowly while small groups have
> > high load?
> >
>
> Peter's point is that, if at a given tl you have groups that look like
>
> g0: 0-4, g1: 5-6, g2: 7-8
>
> Then g0 is half as likely to pull tasks with load_balance() than g1 or g2 (due
> to the group size vs should_we_balance())

Yep. the difference is that g1 and g2 won't be local groups of any CPU in
this tl.
The smaller groups g1 and g2 are only remote groups, so should_we_balance()
doesn't matter here for them.

>
>
> However, I suppose one "trick" to be aware of here is that since your patch
> *doesn't* change the local group, we do have e.g. on CPU0:
>
> [ 0.374840] domain-2: span=0-5 level=NUMA
> [ 0.375054] groups: 0:{ span=0-3 cap=4003 }, 4:{ span=4-5 cap=1988 }
>
> *but* on CPU4 we get:
>
> [ 0.387019] domain-2: span=0-1,4-7 level=NUMA
> [ 0.387211] groups: 4:{ span=4-7 cap=3984 }, 0:{ span=0-1 cap=2013 }
>
> IOW, at a given tl, all *local* groups have /roughly/ the same size and thus
> similar pull probability (it took me writing this mail to see it that way).
> So perhaps this is all fine already?

Yep. since should_we_balance() only matters for local groups and we haven't
changed the size of local groups in the grandchild approach, all local groups
are still getting similar pull probability in this topology level.

Since we still prefer the grandchild approach ATM, if Peter has no more concern
on the unequal size between local groups and remote groups, I would be glad
to send v4 of grandchild approach by rewriting changelog to explain the update
issue of sgc->next_update, sgc->{min, max}capacity, sgc->group_imbalance
Vincent pointed out and also describe the local_groups are not touched, thus
still in the equal size.

Thanks
Barry