Platform
========
I've been investigating an assertion failure on a D06 board (Kunpeng 920
based), which AFAIA is 2 sockets, each with 2 dies, each with 24 CPUs and
its own memory controller (4 nodes total).
The default distance table I get is:
node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10
Which can be summarised as:
2 10 2
0 <---> 1 <---> 2 <---> 3
Error
=====
Using sched_debug=1, I get an assertion error for every single CPU. This is
for CPU0:
[344276.794534] CPU0 attaching sched-domain(s):
[344276.794536] domain-0: span=0-23 level=MC
[344276.794539] groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
[344276.794554] domain-1: span=0-47 level=NUMA
[344276.794555] groups: 0:{ span=0-23 cap=24576 }, 24:{ span=24-47 cap=24576 }
[344276.794558] domain-2: span=0-71 level=NUMA
[344276.794560] groups: 0:{ span=0-47 cap=49152 }, 48:{ span=48-95 cap=49152 }
[344276.794563] ERROR: groups don't span domain->span
[344276.799346] domain-3: span=0-95 level=NUMA
[344276.799353] groups: 0:{ span=0-71 mask=0-23 cap=73728 }, 72:{ span=48-95 mask=72-95 cap=49152 }
Root cause
==========
The NUMA distance setup is correct for node 0 (despite the deduplicating
sort not liking this distance table - that's for another day), and so are
the sched_domain spans (CPU0 PoV):
node0 node1 node2 node3
/^^^^^^\ /^^^^^^^\ /^^^^^^^\ /^^^^^^^\
0 ... 23 24 ... 47 48 ... 71 72 ... 95
MC [ ]
NUMA(<=12) [ ]
NUMA(<=20) [ ]
NUMA(<=22) [ ]
The problem lies in the *groups* of that middle NUMA domain: they are
{[0-47], [48-95]}, IOW it includes node3 when it really shouldn't. Let me
try to explain why:
[]: domain span
(): group span
CPU0 perspective:
,----
| node0 node1 node2 node3
| /^^^^^^^^^^\ /^^^^^^^\ /^^^^^^^\ /^^^^^^^\
| 0 ....... 23 24 ... 47 48 ... 71 72 ... 95
| MC [(0)...(23)]
| NUMA(<=12) [ (0 - 23) (24 - 47)]
`----
CPU48 perspective:
,----
| node0 node1 node2 node3
| /^^^^^^\ /^^^^^^^\ /^^^^^^^^^^^\ /^^^^^^^\
| 0 ... 23 24 ... 47 48 ....... 71 72 ... 95
| MC [(48)...(71)]
| NUMA(<=12) [ (48 - 23) (72 - 95)]
`----
When we go through build_overlap_sched_groups() for CPU0's NUMA(<=20)
domain, we'll first use CPU0's NUMA(<=20) child (i.e. NUMA(<=12)) as span
for the first group, which will be (0-47). The next CPU we'll iterate over
will be CPU48; we'll do the same thing and use its NUMA(<=20) child span as
group span, which here will be (48-95) - despite CPUs (72-95) *not* being
<= 20 hops away from CPU0 (but 22)!
AFAICT this isn't too complicated to reproduce. For instance, a ring of 6
CPUs with no interlink in the middle (indegree(v) == 2 for all v) would
trigger the same issue.
Fix
===
Sanitize the groups we get out of build_group_from_child_sched_domain()
with the span of the domain we're currently building - this ensures the
groups we build only contain CPUs that are the right distance away from the
base CPU. This also requires modifying build_balance_mask().
With the patch applied:
[ 83.177623] domain-0: span=0-23 level=MC
[ 83.177626] groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
[ 83.177641] domain-1: span=0-47 level=NUMA
[ 83.177642] groups: 0:{ span=0-23 cap=24576 }, 24:{ span=24-47 cap=24576 }
[ 83.177645] domain-2: span=0-71 level=NUMA
[ 83.177646] groups: 0:{ span=0-47 cap=49152 }, 48:{ span=48-71 cap=49152 }
[ 83.177649] domain-3: span=0-95 level=NUMA
[ 83.177651] groups: 0:{ span=0-71 mask=0-23 cap=73728 }, 48:{ span=48-95 cap=73728 }
Note that the modification of build_balance_mask() feels icky, but I
couldn't think of topologies it would break. AFAICT the masks for the
topology pointed out in commit 73bb059f9b8a ("sched/topology: Fix
overlapping sched_group_mask") would remain unchanged.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/topology.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..7033b27e5162 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -866,7 +866,7 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
continue;
/* If we would not end up here, we can't continue from here */
- if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
+ if (!cpumask_subset(sg_span, sched_domain_span(sibling->child)))
continue;
cpumask_set_cpu(i, mask);
@@ -882,7 +882,9 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
* statistics having the groups node local is of dubious benefit.
*/
static struct sched_group *
-build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+build_group_from_child_sched_domain(struct sched_domain *sd,
+ int cpu,
+ const struct cpumask *level_span)
{
struct sched_group *sg;
struct cpumask *sg_span;
@@ -899,6 +901,21 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
else
cpumask_copy(sg_span, sched_domain_span(sd));
+ /*
+ * We're using sibling sched_domains to build up the groups of our NUMA
+ * domains, and those are built up from the point of view of their
+ * respective base CPU. This means that when we copy the span of said
+ * sibling->child domain, we are including CPUs that are
+ * sched_domains_numa_distance[level-1] hops away from the *sibling*
+ * base CPU, not the base CPU of the domain we are currently building
+ * for.
+ *
+ * Correct this by trimming the newly built sched_group with the current
+ * domain span which is known to only include CPUs that are the correct
+ * distance away from the current base CPU.
+ */
+ cpumask_and(sg_span, sg_span, level_span);
+
atomic_inc(&sg->ref);
return sg;
}
@@ -964,7 +981,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;
- sg = build_group_from_child_sched_domain(sibling, cpu);
+ sg = build_group_from_child_sched_domain(sibling, cpu, span);
if (!sg)
goto fail;
--
2.24.0
On Tue, Mar 24 2020, Valentin Schneider wrote:
> Fix
> ===
>
> Sanitize the groups we get out of build_group_from_child_sched_domain()
> with the span of the domain we're currently building - this ensures the
> groups we build only contain CPUs that are the right distance away from the
> base CPU. This also requires modifying build_balance_mask().
>
I somehow missed that this triggers the WARN_ON_ONCE() in
init_overlap_sched_group(). Gotta figure out why.
On 24.03.20 18:20, Valentin Schneider wrote:
>
> On Tue, Mar 24 2020, Valentin Schneider wrote:
>> Fix
>> ===
>>
>> Sanitize the groups we get out of build_group_from_child_sched_domain()
>> with the span of the domain we're currently building - this ensures the
>> groups we build only contain CPUs that are the right distance away from the
>> base CPU. This also requires modifying build_balance_mask().
>>
>
> I somehow missed that this triggers the WARN_ON_ONCE() in
> init_overlap_sched_group(). Gotta figure out why.
Cool, seems like qemu can emulate this!
qemu-system-aarch64 -kernel Image -hda ./qemu-image-aarch64.img -append
'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp cores=8
--nographic -m 512 -cpu cortex-a53 -machine virt -numa
node,cpus=0-1,nodeid=0 -numa node,cpus=2-3,nodeid=1, -numa
node,cpus=4-5,nodeid=2, -numa node,cpus=6-7,nodeid=3, -numa
dist,src=0,dst=1,val=12, -numa dist,src=0,dst=2,val=20, -numa
dist,src=0,dst=3,val=22, -numa dist,src=1,dst=2,val=22, -numa
dist,src=1,dst=3,val=24, -numa dist,src=2,dst=3,val=12
...
[ 0.711685]
[ 0.711767] 10 12 20 22
[ 0.711860] 12 10 22 24
[ 0.711917] 20 22 10 12
[ 0.711970] 22 24 12 10
[ 0.712036]
[ 0.718356] CPU0 attaching sched-domain(s):
[ 0.718433] domain-0: span=0-1 level=MC
[ 0.718646] groups: 0:{ span=0 cap=1006 }, 1:{ span=1 cap=1015 }
[ 0.718865] domain-1: span=0-3 level=NUMA
[ 0.718906] groups: 0:{ span=0-1 cap=2021 }, 2:{ span=2-3 cap=2014 }
[ 0.719044] domain-2: span=0-5 level=NUMA
[ 0.719082] groups: 0:{ span=0-3 cap=4035 }, 4:{ span=4-7 cap=3855 }
[ 0.719164] ERROR: groups don't span domain->span
[ 0.719191] domain-3: span=0-7 level=NUMA
[ 0.719228] groups: 0:{ span=0-5 mask=0-1 cap=5974 }, 6:{
span=4-7 mask=6-7 cap=3964 }
[ 0.719961] CPU1 attaching sched-domain(s):
...
cat /proc/sys/kernel/sched_domain/cpu0/domain*/{name,flags}
MC
NUMA
NUMA
NUMA
4655
25647
25647
25647
On Wed, Mar 25 2020, Dietmar Eggemann wrote:
>> I somehow missed that this triggers the WARN_ON_ONCE() in
>> init_overlap_sched_group(). Gotta figure out why.
>
> Cool, seems like qemu can emulate this!
>
>
> qemu-system-aarch64 -kernel Image -hda ./qemu-image-aarch64.img -append
> 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp cores=8
> --nographic -m 512 -cpu cortex-a53 -machine virt -numa
> node,cpus=0-1,nodeid=0 -numa node,cpus=2-3,nodeid=1, -numa
> node,cpus=4-5,nodeid=2, -numa node,cpus=6-7,nodeid=3, -numa
> dist,src=0,dst=1,val=12, -numa dist,src=0,dst=2,val=20, -numa
> dist,src=0,dst=3,val=22, -numa dist,src=1,dst=2,val=22, -numa
> dist,src=1,dst=3,val=24, -numa dist,src=2,dst=3,val=12
>
> ...
> [ 0.711685]
> [ 0.711767] 10 12 20 22
> [ 0.711860] 12 10 22 24
> [ 0.711917] 20 22 10 12
> [ 0.711970] 22 24 12 10
> [ 0.712036]
> [ 0.718356] CPU0 attaching sched-domain(s):
> [ 0.718433] domain-0: span=0-1 level=MC
> [ 0.718646] groups: 0:{ span=0 cap=1006 }, 1:{ span=1 cap=1015 }
> [ 0.718865] domain-1: span=0-3 level=NUMA
> [ 0.718906] groups: 0:{ span=0-1 cap=2021 }, 2:{ span=2-3 cap=2014 }
> [ 0.719044] domain-2: span=0-5 level=NUMA
> [ 0.719082] groups: 0:{ span=0-3 cap=4035 }, 4:{ span=4-7 cap=3855 }
> [ 0.719164] ERROR: groups don't span domain->span
> [ 0.719191] domain-3: span=0-7 level=NUMA
> [ 0.719228] groups: 0:{ span=0-5 mask=0-1 cap=5974 }, 6:{
> span=4-7 mask=6-7 cap=3964 }
> [ 0.719961] CPU1 attaching sched-domain(s):
> ...
>
I wasn't aware you could specify the distances directly on the CLI,
that's awesome! Thanks for pointing this out.
On Tue, Mar 24 2020, Valentin Schneider wrote:
> kernel/sched/topology.c | 23 ++++++++++++++++++++---
> 1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8344757bba6e..7033b27e5162 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -866,7 +866,7 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
> continue;
>
> /* If we would not end up here, we can't continue from here */
> - if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
> + if (!cpumask_subset(sg_span, sched_domain_span(sibling->child)))
So this is one source of issues; what I've done here is a bit stupid
since we include CPUs that *cannot* end up there. What I should've done
is something like:
cpumask_and(tmp, sched_domain_span(sibling->child), sched_domain_span(sd));
if (!cpumask_equal(sg_span, tmp))
...
But even with that I just unfold even more horrors: this breaks the
overlapping sched_group_capacity (see 1676330ecfa8 ("sched/topology: Fix
overlapping sched_group_capacity")).
For instance, here I would have
CPU0-domain2-group4: span=4-5
CPU4-domain2-group4: span=4-7 mask=4-5
Both groups are at the same topology level and have the same first CPU,
so they point to the same sched_group_capacity structure - but they
don't have the same span. They would without my "fix", but then the
group spans are back to being wrong. I'm starting to think this is
doomed, at least in the current state of things :/
On Wed, Mar 25 2020, Valentin Schneider wrote:
> On Tue, Mar 24 2020, Valentin Schneider wrote:
>> kernel/sched/topology.c | 23 ++++++++++++++++++++---
>> 1 file changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 8344757bba6e..7033b27e5162 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -866,7 +866,7 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
>> continue;
>>
>> /* If we would not end up here, we can't continue from here */
>> - if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
>> + if (!cpumask_subset(sg_span, sched_domain_span(sibling->child)))
>
> So this is one source of issues; what I've done here is a bit stupid
> since we include CPUs that *cannot* end up there. What I should've done
> is something like:
>
> cpumask_and(tmp, sched_domain_span(sibling->child), sched_domain_span(sd));
> if (!cpumask_equal(sg_span, tmp))
> ...
>
> But even with that I just unfold even more horrors: this breaks the
> overlapping sched_group_capacity (see 1676330ecfa8 ("sched/topology: Fix
> overlapping sched_group_capacity")).
>
> For instance, here I would have
>
> CPU0-domain2-group4: span=4-5
> CPU4-domain2-group4: span=4-7 mask=4-5
>
^ That's using Dietmar's qemu setup; on the D06 that is
CPU0-domain2-group48: span=48-71
CPU48-domain2-group48: span=48-95 mask=48-71
> Both groups are at the same topology level and have the same first CPU,
> so they point to the same sched_group_capacity structure - but they
> don't have the same span. They would without my "fix", but then the
> group spans are back to being wrong. I'm starting to think this is
> doomed, at least in the current state of things :/
On Wed, Mar 25 2020, Valentin Schneider wrote:
> On Wed, Mar 25 2020, Valentin Schneider wrote:
>> On Tue, Mar 24 2020, Valentin Schneider wrote:
>>> kernel/sched/topology.c | 23 ++++++++++++++++++++---
>>> 1 file changed, 20 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 8344757bba6e..7033b27e5162 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -866,7 +866,7 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
>>> continue;
>>>
>>> /* If we would not end up here, we can't continue from here */
>>> - if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
>>> + if (!cpumask_subset(sg_span, sched_domain_span(sibling->child)))
>>
>> So this is one source of issues; what I've done here is a bit stupid
>> since we include CPUs that *cannot* end up there. What I should've done
>> is something like:
>>
>> cpumask_and(tmp, sched_domain_span(sibling->child), sched_domain_span(sd));
>> if (!cpumask_equal(sg_span, tmp))
>> ...
>>
>> But even with that I just unfold even more horrors: this breaks the
>> overlapping sched_group_capacity (see 1676330ecfa8 ("sched/topology: Fix
>> overlapping sched_group_capacity")).
>>
>> For instance, here I would have
>>
>> CPU0-domain2-group4: span=4-5
>> CPU4-domain2-group4: span=4-7 mask=4-5
>>
>
> ^ That's using Dietmar's qemu setup; on the D06 that is
>
> CPU0-domain2-group48: span=48-71
> CPU48-domain2-group48: span=48-95 mask=48-71
>
I realize this isn't clear at all;
- build_balance_mask() for CPU0-domain2-group48 (span=48-71) builds a
mask that contains CPUs 48-71.
- build_balance_mask() for CPU48-domain2-group48 (span=48-95) builds a
mask that contains CPUs 48-95.
Both masks have the same first CPU, so they both use the same
sched_group_capacity structure. However, CPU0-domain2-sgc48 was first
and set the sgc cpumask to what it wanted, so when it's turn for
CPU48-domain2-group48 we hit the WARN_ON_ONCE() and leave the
mask as-is.
And now I'll stop spewing nonsense for today.
>> Both groups are at the same topology level and have the same first CPU,
>> so they point to the same sched_group_capacity structure - but they
>> don't have the same span. They would without my "fix", but then the
>> group spans are back to being wrong. I'm starting to think this is
>> doomed, at least in the current state of things :/