Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
From:   Valentin Schneider <valentin.schneider@arm.com>
To:     linux-kernel@vger.kernel.org
Cc:     peterz@infradead.org, mingo@kernel.org, vincent.guittot@linaro.org,
        dietmar.eggemann@arm.com, morten.rasmussen@arm.com,
        mgorman@techsingularity.net
Subject: [PATCH] sched/topology: Fix overlapping sched_group build
Date:   Tue, 24 Mar 2020 12:55:33 +0000
Message-Id: <20200324125533.17447-1-valentin.schneider@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Platform
========

I've been investigating an assertion failure on a D06 board (Kunpeng 920
based), which AFAIA is 2 sockets, each with 2 dies, each with 24 CPUs and
its own memory controller (4 nodes total).

The default distance table I get is:

  node   0   1   2   3
    0:  10  12  20  22
    1:  12  10  22  24
    2:  20  22  10  12
    3:  22  24  12  10

Which can be summarised as:

      2       10      2
  0 <---> 1 <---> 2 <---> 3

Error
=====

Using sched_debug=1, I get an assertion error for every single CPU. This is
for CPU0:

  [344276.794534] CPU0 attaching sched-domain(s):
  [344276.794536]  domain-0: span=0-23 level=MC
  [344276.794539]   groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
  [344276.794554]   domain-1: span=0-47 level=NUMA
  [344276.794555]    groups: 0:{ span=0-23 cap=24576 }, 24:{ span=24-47 cap=24576 }
  [344276.794558]    domain-2: span=0-71 level=NUMA
  [344276.794560]     groups: 0:{ span=0-47 cap=49152 }, 48:{ span=48-95 cap=49152 }
  [344276.794563] ERROR: groups don't span domain->span
  [344276.799346]     domain-3: span=0-95 level=NUMA
  [344276.799353]      groups: 0:{ span=0-71 mask=0-23 cap=73728 }, 72:{ span=48-95 mask=72-95 cap=49152 }

Root cause
==========

The NUMA distance setup is correct for node 0 (despite the deduplicating
sort not liking this distance table - that's for another day), and so are
the sched_domain spans (CPU0 PoV):

	     node0    node1     node2     node3
	   /^^^^^^\ /^^^^^^^\ /^^^^^^^\ /^^^^^^^\
	   0 ... 23 24 ... 47 48 ... 71 72 ... 95
MC         [      ]
NUMA(<=12) [                ]
NUMA(<=20) [                          ]
NUMA(<=22) [                                     ]

The problem lies in the *groups* of that middle NUMA domain: they are
{[0-47], [48-95]}, IOW it includes node3 when it really shouldn't. Let me
try to explain why:

[]: domain span
(): group span
CPU0 perspective:
,----
|              node0        node1     node2     node3
|            /^^^^^^^^^^\ /^^^^^^^\ /^^^^^^^\ /^^^^^^^\
|	     0 ....... 23 24 ... 47 48 ... 71 72 ... 95
| MC         [(0)...(23)]
| NUMA(<=12) [ (0 - 23)  (24 - 47)]
`----

CPU48 perspective:
,----
|              node0    node1     node2         node3
|            /^^^^^^\ /^^^^^^^\ /^^^^^^^^^^^\ /^^^^^^^\
|	     0 ... 23 24 ... 47 48 ....... 71 72 ... 95
| MC                            [(48)...(71)]
| NUMA(<=12)                    [ (48 - 23)  (72 - 95)]
`----

When we go through build_overlap_sched_groups() for CPU0's NUMA(<=20)
domain, we'll first use CPU0's NUMA(<=20) child (i.e. NUMA(<=12)) as span
for the first group, which will be (0-47). The next CPU we'll iterate over
will be CPU48; we'll do the same thing and use its NUMA(<=20) child span as
group span, which here will be (48-95) - despite CPUs (72-95) *not* being
<= 20 hops away from CPU0 (but 22)!

AFAICT this isn't too complicated to reproduce. For instance, a ring of 6
CPUs with no interlink in the middle (indegree(v) == 2 for all v) would
trigger the same issue.

Fix
===

Sanitize the groups we get out of build_group_from_child_sched_domain()
with the span of the domain we're currently building - this ensures the
groups we build only contain CPUs that are the right distance away from the
base CPU. This also requires modifying build_balance_mask().

With the patch applied:

  [   83.177623]  domain-0: span=0-23 level=MC
  [   83.177626]   groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
  [   83.177641]   domain-1: span=0-47 level=NUMA
  [   83.177642]    groups: 0:{ span=0-23 cap=24576 }, 24:{ span=24-47 cap=24576 }
  [   83.177645]    domain-2: span=0-71 level=NUMA
  [   83.177646]     groups: 0:{ span=0-47 cap=49152 }, 48:{ span=48-71 cap=49152 }
  [   83.177649]     domain-3: span=0-95 level=NUMA
  [   83.177651]      groups: 0:{ span=0-71 mask=0-23 cap=73728 }, 48:{ span=48-95 cap=73728 }

Note that the modification of build_balance_mask() feels icky, but I
couldn't think of topologies it would break. AFAICT the masks for the
topology pointed out in commit 73bb059f9b8a ("sched/topology: Fix
overlapping sched_group_mask") would remain unchanged.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/topology.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..7033b27e5162 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -866,7 +866,7 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
 			continue;
 
 		/* If we would not end up here, we can't continue from here */
-		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
+		if (!cpumask_subset(sg_span, sched_domain_span(sibling->child)))
 			continue;
 
 		cpumask_set_cpu(i, mask);
@@ -882,7 +882,9 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
  * statistics having the groups node local is of dubious benefit.
  */
 static struct sched_group *
-build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+build_group_from_child_sched_domain(struct sched_domain *sd,
+				    int cpu,
+				    const struct cpumask *level_span)
 {
 	struct sched_group *sg;
 	struct cpumask *sg_span;
@@ -899,6 +901,21 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 	else
 		cpumask_copy(sg_span, sched_domain_span(sd));
 
+	/*
+	 * We're using sibling sched_domains to build up the groups of our NUMA
+	 * domains, and those are built up from the point of view of their
+	 * respective base CPU. This means that when we copy the span of said
+	 * sibling->child domain, we are including CPUs that are
+	 * sched_domains_numa_distance[level-1] hops away from the *sibling*
+	 * base CPU, not the base CPU of the domain we are currently building
+	 * for.
+	 *
+	 * Correct this by trimming the newly built sched_group with the current
+	 * domain span which is known to only include CPUs that are the correct
+	 * distance away from the current base CPU.
+	 */
+	cpumask_and(sg_span, sg_span, level_span);
+
 	atomic_inc(&sg->ref);
 	return sg;
 }
@@ -964,7 +981,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-		sg = build_group_from_child_sched_domain(sibling, cpu);
+		sg = build_group_from_child_sched_domain(sibling, cpu, span);
 		if (!sg)
 			goto fail;
 
-- 
2.24.0