2021-05-21 06:11:46

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 0/3] Skip numa distance for offline nodes

Geetika reported yet another trace while doing a dlpar CPU add
operation. This was true even on top of a recent commit
6980d13f0dd1 ("powerpc/smp: Set numa node before updating mask") which fixed
a similar trace.

WARNING: CPU: 40 PID: 2954 at kernel/sched/topology.c:2088 build_sched_domains+0x6e8/0x1540
Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag
xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag
netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables
nfnetlink dm_multipath pseries_rng xts vmx_crypto binfmt_misc ip_tables xfs
libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror
dm_region_hash dm_log dm_mod fuse
CPU: 40 PID: 2954 Comm: kworker/40:0 Not tainted 5.13.0-rc1+ #19
Workqueue: events cpuset_hotplug_workfn
NIP: c0000000001de588 LR: c0000000001de584 CTR: 00000000006cd36c
REGS: c00000002772b250 TRAP: 0700 Not tainted (5.12.0-rc5-master+)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28828422 XER: 0000000d
CFAR: c00000000020c2f8 IRQMASK: 0 #012GPR00: c0000000001de584 c00000002772b4f0
c000000001f55400 0000000000000036 #012GPR04: c0000063c6368010 c0000063c63f0a00
0000000000000027 c0000063c6368018 #012GPR08: 0000000000000023 c0000063c636ef48
00000063c4de0000 c0000063bfe9ffe8 #012GPR12: 0000000028828424 c0000063fe68fe80
0000000000000000 0000000000000417 #012GPR16: 0000000000000028 c00000000740dcd8
c00000000205db68 c000000001a3a4a0 #012GPR20: c000000091ed7d20 c000000091ed8520
0000000000000001 0000000000000000 #012GPR24: c0000000113a9600 0000000000000190
0000000000000028 c0000000010e3ac0 #012GPR28: 0000000000000000 c00000000740dd00
c0000000317b5900 0000000000000190
NIP [c0000000001de588] build_sched_domains+0x6e8/0x1540
LR [c0000000001de584] build_sched_domains+0x6e4/0x1540
Call Trace:
[c00000002772b4f0] [c0000000001de584] build_sched_domains+0x6e4/0x1540 (unreliable)
[c00000002772b640] [c0000000001e08dc] partition_sched_domains_locked+0x3ec/0x530
[c00000002772b6e0] [c0000000002a2144] rebuild_sched_domains_locked+0x524/0xbf0
[c00000002772b7e0] [c0000000002a5620] rebuild_sched_domains+0x40/0x70
[c00000002772b810] [c0000000002a58e4] cpuset_hotplug_workfn+0x294/0xe20
[c00000002772bc30] [c000000000187510] process_one_work+0x300/0x670
[c00000002772bd10] [c0000000001878f8] worker_thread+0x78/0x520
[c00000002772bda0] [c0000000001937f0] kthread+0x1a0/0x1b0
[c00000002772be10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7ee5bb78 7f0ac378 7f29cb78 7f68db78 7f46d378 7f84e378 f8610068 3c62ff19
fbe10060 3863e558 4802dd31 60000000 <0fe00000> 3920fff4 f9210080 e86100b0

Detailed analysis of the failing scenario showed that the span in
question belongs to NODE domain and further the cpumasks for some
cpus in NODE overlapped. There are two possible reasons how we ended
up here:

(1) The numa node was offline or blank with no CPUs or memory. Hence
the sched_max_numa_distance could not be set correctly, or the
sched_domains_numa_distance happened to be partially populated.

(2) Depending on a bogus node_distance of an offline node to populate
cpumasks is the issue. On POWER platform the node_distance is
correctly available only for an online node which has some CPU or
memory resource associated with it.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node 0 1 2 3 4 5 6 7
0: 10 20 40 40 40 40 40 40
1: 20 10 40 40 40 40 40 40
2: 40 40 10 20 40 40 40 40
3: 40 40 20 10 40 40 40 40
4: 40 40 40 40 10 20 40 40
5: 40 40 40 40 20 10 40 40
6: 40 40 40 40 40 40 10 20
7: 40 40 40 40 40 40 20 10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node 0 1
0: 10 20
1: 20 10

This series tries to fix both these problems.
Note: These problems are now visible, thanks to
Commit ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't
(partially) overlap")


Cc: LKML <[email protected]>
Cc: [email protected]
Cc: Nathan Lynch <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Geetika Moolchandani <[email protected]>

Srikar Dronamraju (3):
sched/topology: Allow archs to populate distance map
powerpc/numa: Populate distance map correctly
sched/topology: Skip updating masks for non-online nodes

arch/powerpc/include/asm/topology.h | 3 +++
arch/powerpc/mm/numa.c | 19 +++++++++++++++
kernel/sched/topology.c | 38 +++++++++++++++++++++--------
3 files changed, 50 insertions(+), 10 deletions(-)


base-commit: 1699949d3314e5d1956fb082e4cd4798bf6149fc
--
2.27.0


2021-05-21 15:03:28

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 1/3] sched/topology: Allow archs to populate distance map

Currently scheduler populates the distance map by looking at distance
of each node from all other nodes. This should work for most
architectures and platforms.

However there are some architectures like POWER that may not expose
the distance of nodes that are not yet onlined because those resources
are not yet allocated to the OS instance. Such architectures have
other means to provide valid distance data for the current platform.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node 0 1 2 3 4 5 6 7
0: 10 20 40 40 40 40 40 40
1: 20 10 40 40 40 40 40 40
2: 40 40 10 20 40 40 40 40
3: 40 40 20 10 40 40 40 40
4: 40 40 40 40 10 20 40 40
5: 40 40 40 40 20 10 40 40
6: 40 40 40 40 40 40 10 20
7: 40 40 40 40 40 40 20 10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node 0 1
0: 10 20
1: 20 10

It may be implementation dependent on what node_distance(0,3) where
node 0 is online and node 3 is offline. In POWER case, it returns
LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
distance between nodes is 20. However that would not be true.

When Nodes are onlined and CPUs from those nodes are hotplugged,
the max node distance would be 40.

To handle such scenarios, let scheduler allow architectures to populate
the distance map. Architectures that like to populate the distance map
can overload arch_populate_distance_map().

Cc: LKML <[email protected]>
Cc: [email protected]
Cc: Nathan Lynch <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Geetika Moolchandani <[email protected]>
Reported-by: Geetika Moolchandani <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/topology.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 053115b55f89..ccb9aff59add 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1630,6 +1630,26 @@ static void init_numa_topology_type(void)

#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)

+#ifndef arch_populate_distance_map
+static int arch_populate_distance_map(unsigned long *distance_map)
+{
+ int i, j;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ for (j = 0; j < nr_node_ids; j++) {
+ int distance = node_distance(i, j);
+
+ if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
+ sched_numa_warn("Invalid distance value range");
+ return -1;
+ }
+ bitmap_set(distance_map, distance, 1);
+ }
+ }
+ return 0;
+}
+#endif
+
void sched_init_numa(void)
{
struct sched_domain_topology_level *tl;
@@ -1646,18 +1666,10 @@ void sched_init_numa(void)
return;

bitmap_zero(distance_map, NR_DISTANCE_VALUES);
- for (i = 0; i < nr_node_ids; i++) {
- for (j = 0; j < nr_node_ids; j++) {
- int distance = node_distance(i, j);

- if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
- sched_numa_warn("Invalid distance value range");
- return;
- }
+ if (arch_populate_distance_map(distance_map))
+ return;

- bitmap_set(distance_map, distance, 1);
- }
- }
/*
* We can now figure out how many unique distance values there are and
* allocate memory accordingly.
--
2.27.0

2021-05-21 15:03:30

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes

Currently scheduler doesn't check if node is online before adding CPUs
to the node mask. However on some architectures, node distance is only
available for nodes that are online. Its not sure how much to rely on
the node distance, when one of the nodes is offline.

If said node distance is fake (since one of the nodes is offline) and
the actual node distance is different, then the cpumask of such nodes
when the nodes become becomes online will be wrong.

This can cause topology_span_sane to throw up a warning message and the
rest of the topology being not updated properly.

Resolve this by skipping update of cpumask for nodes that are not
online.

Cc: LKML <[email protected]>
Cc: [email protected]
Cc: Nathan Lynch <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Geetika Moolchandani <[email protected]>
Reported-by: Geetika Moolchandani <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/topology.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ccb9aff59add..ba0555e83548 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1731,6 +1731,9 @@ void sched_init_numa(void)
sched_domains_numa_masks[i][j] = mask;

for_each_node(k) {
+ if (!node_online(j))
+ continue;
+
if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
sched_numa_warn("Node-distance not symmetric");

@@ -1793,6 +1796,9 @@ void sched_domains_numa_masks_set(unsigned int cpu)

for (i = 0; i < sched_domains_numa_levels; i++) {
for (j = 0; j < nr_node_ids; j++) {
+ if (!node_online(j))
+ continue;
+
if (node_distance(j, node) <= sched_domains_numa_distance[i])
cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
}
--
2.27.0

2021-05-21 16:35:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> Currently scheduler populates the distance map by looking at distance
> of each node from all other nodes. This should work for most
> architectures and platforms.
>
> However there are some architectures like POWER that may not expose
> the distance of nodes that are not yet onlined because those resources
> are not yet allocated to the OS instance. Such architectures have
> other means to provide valid distance data for the current platform.
>
> For example distance info from numactl from a fully populated 8 node
> system at boot may look like this.
>
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 20 40 40 40 40 40 40
> 1: 20 10 40 40 40 40 40 40
> 2: 40 40 10 20 40 40 40 40
> 3: 40 40 20 10 40 40 40 40
> 4: 40 40 40 40 10 20 40 40
> 5: 40 40 40 40 20 10 40 40
> 6: 40 40 40 40 40 40 10 20
> 7: 40 40 40 40 40 40 20 10
>
> However the same system when only two nodes are online at boot, then the
> numa topology will look like
> node distances:
> node 0 1
> 0: 10 20
> 1: 20 10
>
> It may be implementation dependent on what node_distance(0,3) where
> node 0 is online and node 3 is offline. In POWER case, it returns
> LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> distance between nodes is 20. However that would not be true.
>
> When Nodes are onlined and CPUs from those nodes are hotplugged,
> the max node distance would be 40.
>
> To handle such scenarios, let scheduler allow architectures to populate
> the distance map. Architectures that like to populate the distance map
> can overload arch_populate_distance_map().

Why? Why can't your node_distance() DTRT? The arch interface is
nr_node_ids and node_distance(), I don't see why we need something new
and then replace one special use of it.

By virtue of you being able to actually implement this new hook, you
supposedly can actually do node_distance() right too.

2021-05-21 20:06:18

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

* Peter Zijlstra <[email protected]> [2021-05-20 20:56:31]:

> On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > Currently scheduler populates the distance map by looking at distance
> > of each node from all other nodes. This should work for most
> > architectures and platforms.
> >
> > However there are some architectures like POWER that may not expose
> > the distance of nodes that are not yet onlined because those resources
> > are not yet allocated to the OS instance. Such architectures have
> > other means to provide valid distance data for the current platform.
> >
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> >
> > node distances:
> > node 0 1 2 3 4 5 6 7
> > 0: 10 20 40 40 40 40 40 40
> > 1: 20 10 40 40 40 40 40 40
> > 2: 40 40 10 20 40 40 40 40
> > 3: 40 40 20 10 40 40 40 40
> > 4: 40 40 40 40 10 20 40 40
> > 5: 40 40 40 40 20 10 40 40
> > 6: 40 40 40 40 40 40 10 20
> > 7: 40 40 40 40 40 40 20 10
> >
> > However the same system when only two nodes are online at boot, then the
> > numa topology will look like
> > node distances:
> > node 0 1
> > 0: 10 20
> > 1: 20 10
> >
> > It may be implementation dependent on what node_distance(0,3) where
> > node 0 is online and node 3 is offline. In POWER case, it returns
> > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > distance between nodes is 20. However that would not be true.
> >
> > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > the max node distance would be 40.
> >
> > To handle such scenarios, let scheduler allow architectures to populate
> > the distance map. Architectures that like to populate the distance map
> > can overload arch_populate_distance_map().
>
> Why? Why can't your node_distance() DTRT? The arch interface is
> nr_node_ids and node_distance(), I don't see why we need something new
> and then replace one special use of it.
>
> By virtue of you being able to actually implement this new hook, you
> supposedly can actually do node_distance() right too.

Since for an offline node, arch interface code doesn't have the info.
As far as I know/understand, in POWER, unless there is an active memory or
CPU that's getting onlined, arch can't fetch the correct node distance.

Taking the above example: node 3 is offline, then node_distance of (3,X)
where X is anything other than 3, is not reliable. The moment node 3 is
onlined, the node distance is reliable.

This problem will not happen even on POWER if all the nodes have either
memory or CPUs active at the time of boot.

--
Thanks and Regards
Srikar Dronamraju

2021-05-21 20:09:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2021-05-20 20:56:31]:
>
> > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > > Currently scheduler populates the distance map by looking at distance
> > > of each node from all other nodes. This should work for most
> > > architectures and platforms.
> > >
> > > However there are some architectures like POWER that may not expose
> > > the distance of nodes that are not yet onlined because those resources
> > > are not yet allocated to the OS instance. Such architectures have
> > > other means to provide valid distance data for the current platform.
> > >
> > > For example distance info from numactl from a fully populated 8 node
> > > system at boot may look like this.
> > >
> > > node distances:
> > > node 0 1 2 3 4 5 6 7
> > > 0: 10 20 40 40 40 40 40 40
> > > 1: 20 10 40 40 40 40 40 40
> > > 2: 40 40 10 20 40 40 40 40
> > > 3: 40 40 20 10 40 40 40 40
> > > 4: 40 40 40 40 10 20 40 40
> > > 5: 40 40 40 40 20 10 40 40
> > > 6: 40 40 40 40 40 40 10 20
> > > 7: 40 40 40 40 40 40 20 10
> > >
> > > However the same system when only two nodes are online at boot, then the
> > > numa topology will look like
> > > node distances:
> > > node 0 1
> > > 0: 10 20
> > > 1: 20 10
> > >
> > > It may be implementation dependent on what node_distance(0,3) where
> > > node 0 is online and node 3 is offline. In POWER case, it returns
> > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > > distance between nodes is 20. However that would not be true.
> > >
> > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > > the max node distance would be 40.
> > >
> > > To handle such scenarios, let scheduler allow architectures to populate
> > > the distance map. Architectures that like to populate the distance map
> > > can overload arch_populate_distance_map().
> >
> > Why? Why can't your node_distance() DTRT? The arch interface is
> > nr_node_ids and node_distance(), I don't see why we need something new
> > and then replace one special use of it.
> >
> > By virtue of you being able to actually implement this new hook, you
> > supposedly can actually do node_distance() right too.
>
> Since for an offline node, arch interface code doesn't have the info.
> As far as I know/understand, in POWER, unless there is an active memory or
> CPU that's getting onlined, arch can't fetch the correct node distance.
>
> Taking the above example: node 3 is offline, then node_distance of (3,X)
> where X is anything other than 3, is not reliable. The moment node 3 is
> onlined, the node distance is reliable.
>
> This problem will not happen even on POWER if all the nodes have either
> memory or CPUs active at the time of boot.

But then how can you implement this new hook? Going by the fact that
both nr_node_ids and distance_ref_points_depth are fixed, how many
possible __node_distance() configurations are there left?

The example provided above does not suggest there's much room for
alternatives, and hence for actual need of this new interface.