LinuxLists.cc - [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-04 20:02:18

Subject: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

When we hotplug a CPU in a memoryless/cpuless node,
the kernel crashes when it rebuilds the sched_domains data.

I reproduce this problem on POWER and with a pseries VM, with the following
QEMU parameters:

-machine pseries -enable-kvm -m 8192 \
-smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \
-numa node,nodeid=0,cpus=0-1,mem=0 \
-numa node,nodeid=1,cpus=2-3,mem=8192 \
-numa node,nodeid=2,cpus=4-5,mem=0 \
-numa node,nodeid=3,cpus=6-7,mem=0

Then I can trigger the crash by hotplugging a CPU on node-id 3:

(qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3

Built 2 zonelists, mobility grouping on. Total pages: 130162
Policy zone: Normal
WARNING: workqueue cpumask: online intersect > possible intersect
BUG: Kernel NULL pointer dereference at 0x00000400
Faulting instruction address: 0xc000000000170edc
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20
Workqueue: events cpuset_hotplug_workfn
NIP: c000000000170edc LR: c000000000170f98 CTR: 0000000000000000
REGS: c000000003e931a0 TRAP: 0380 Not tainted (5.0.0-rc6+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22284028 XER: 00000000
CFAR: c000000000170f20 IRQMASK: 0
GPR00: c000000000170f98 c000000003e93430 c0000000011ac500 c0000001efe22000
GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000010
GPR08: 0000000000000001 0000000000000400 ffffffffffffffff 0000000000000000
GPR12: 0000000000008800 c00000003fffd680 c0000001f14b0000 c0000000011e1bf0
GPR16: c0000000011e61f4 c0000001efe22200 c0000001efe22020 c0000001fba80000
GPR20: c0000001ff567a80 0000000000000001 c000000000e27a80 ffffffffffffe830
GPR24: ffffffffffffec30 000000000000102f 000000000000102f c0000001efca1000
GPR28: c0000001efca0400 c0000001efe22000 c0000001efe23bff c0000001efe22a00
NIP [c000000000170edc] free_sched_groups+0x5c/0xf0
LR [c000000000170f98] destroy_sched_domain+0x28/0x90
Call Trace:
[c000000003e93430] [000000000000102f] 0x102f (unreliable)
[c000000003e93470] [c000000000170f98] destroy_sched_domain+0x28/0x90
[c000000003e934a0] [c0000000001716e0] cpu_attach_domain+0x100/0x920
[c000000003e93600] [c000000000173128] build_sched_domains+0x1228/0x1370
[c000000003e93740] [c00000000017429c] partition_sched_domains+0x23c/0x400
[c000000003e937e0] [c0000000001f5ec8] rebuild_sched_domains_locked+0x78/0xe0
[c000000003e93820] [c0000000001f9ff0] rebuild_sched_domains+0x30/0x50
[c000000003e93850] [c0000000001fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70
[c000000003e93c80] [c00000000012e5a0] process_one_work+0x1b0/0x480
[c000000003e93d20] [c00000000012e8f8] worker_thread+0x88/0x540
[c000000003e93db0] [c00000000013714c] kthread+0x15c/0x1a0
[c000000003e93e20] [c00000000000b55c] ret_from_kernel_thread+0x5c/0x80
Instruction dump:
2e240000 f8010010 f821ffc1 409e0014 48000080 7fbdf040 7fdff378 419e0074
ebdf0000 4192002c e93f0010 7c0004ac <7d404828> 314affff 7d40492d 40c2fff4
---[ end trace f992c4a7d47d602a ]---

Kernel panic - not syncing: Fatal exception

This happens in free_sched_groups() because the linked list of the
sched_groups is corrupted. Here what happens when we hotplug the CPU:

- build_sched_groups() builds a sched_groups linked list for
sched_domain D1, with only one entry A, refcount=1

D1: A(ref=1)

- build_sched_groups() builds a sched_groups linked list for
sched_domain D2, with the same entry A

D2: A(ref=2)

- build_sched_groups() builds a sched_groups linked list for
sched_domain D3, with the same entry A and a new entry B:

D3: A(ref=3) -> B(ref=1)

- destroy_sched_domain() is called for D1:

D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released,
but A->next always points to B

- destroy_sched_domain() is called for D3:

D3: A(ref=2) -> B(ref=0)

kernel crashes when it tries to use data inside B, as the memory has been
corrupted as it has been freed, the linked list (next) is broken too.

This problem appears with commit 051f3ca02e46
("sched/topology: Introduce NUMA identity node sched domain").

If I compare function calls sequence before and after this commit I can see
in the working case build_overlap_sched_groups() is called instead of
build_sched_groups() and in this case the reference counters have all the
same value and the linked list can be correctly unallocated.
The involved commit has introduced the node domain, and in the case of
powerpc the node domain can overlap, whereas it should not happen.

This happens because initially powerpc code computes
sched_domains_numa_masks of offline nodes as if they were merged with
node 0 (because firmware doesn't provide the distance information for
memoryless/cpuless nodes):

node 0 1 2 3
0: 10 40 10 10
1: 40 10 40 40
2: 10 40 10 10
3: 10 40 10 10

We should have:

node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10

And once a new CPU is added, node is onlined, numa masks are updated
but initial set bits are not cleared. This explains why nodes can overlap.

This patch changes the initial code to not initialize the distance for
offline nodes. The distances will be updated when node will become online
(on CPU hotplug) as it is already done.

This patch has been tested on powerpc but not on the other architectures.
They are impacted because the modified part is in the common code.
All comments are welcome (how to move the change to powerpc specific code
or if the other architectures can work with this change).

Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: David Gibson <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nathan Fontenot <[email protected]>
Cc: Michael Bringmann <[email protected]>
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Laurent Vivier <[email protected]>
---

Notes:
v3: fix the root cause of the problem (sched numa mask initialization)
v2: add scheduler maintainers in the CC: list

kernel/sched/topology.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3f35ba1d8fde..24831b86533b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1622,8 +1622,10 @@ void sched_init_numa(void)
return;

sched_domains_numa_masks[i][j] = mask;
+ if (!node_state(j, N_ONLINE))
+ continue;

- for_each_node(k) {
+ for_each_online_node(k) {
if (node_distance(j, k) > sched_domains_numa_distance[i])
continue;

--
2.20.1

2019-03-05 13:54:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

On Mon, Mar 04, 2019 at 08:59:52PM +0100, Laurent Vivier wrote:
> This happens because initially powerpc code computes
> sched_domains_numa_masks of offline nodes as if they were merged with
> node 0 (because firmware doesn't provide the distance information for
> memoryless/cpuless nodes):
>
> node 0 1 2 3
> 0: 10 40 10 10
> 1: 40 10 40 40
> 2: 10 40 10 10
> 3: 10 40 10 10

*groan*... what does it do for things like percpu memory? ISTR the
per-cpu chunks are all allocated early too. Having them all use memory
out of node-0 would seem sub-optimal.

> We should have:
>
> node 0 1 2 3
> 0: 10 40 40 40
> 1: 40 10 40 40
> 2: 40 40 10 40
> 3: 40 40 40 10

Can it happen that it introduces a new distance in the table? One that
hasn't been seen before? This example only has 10 and 40, but suppose
the new node lands at distance 20 (or 80); can such a thing happen?

If not; why not?

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3f35ba1d8fde..24831b86533b 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1622,8 +1622,10 @@ void sched_init_numa(void)
> return;
>
> sched_domains_numa_masks[i][j] = mask;
> + if (!node_state(j, N_ONLINE))
> + continue;
>
> - for_each_node(k) {
> + for_each_online_node(k) {
> if (node_distance(j, k) > sched_domains_numa_distance[i])
> continue;
>

So you're relying on sched_domain_numa_masks_set/clear() to fix this up,
but that in turn relies on the sched_domain_numa_levels thing to stay
accurate.

This all seems very fragile and unfortunate.

2019-03-15 11:15:04

by Laurent Vivier

[permalink] [raw]

Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

On 04/03/2019 20:59, Laurent Vivier wrote:
> When we hotplug a CPU in a memoryless/cpuless node,
> the kernel crashes when it rebuilds the sched_domains data.
>
> I reproduce this problem on POWER and with a pseries VM, with the following
> QEMU parameters:
>
> -machine pseries -enable-kvm -m 8192 \
> -smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \
> -numa node,nodeid=0,cpus=0-1,mem=0 \
> -numa node,nodeid=1,cpus=2-3,mem=8192 \
> -numa node,nodeid=2,cpus=4-5,mem=0 \
> -numa node,nodeid=3,cpus=6-7,mem=0
>
> Then I can trigger the crash by hotplugging a CPU on node-id 3:
>
> (qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3
>
> Built 2 zonelists, mobility grouping on. Total pages: 130162
> Policy zone: Normal
> WARNING: workqueue cpumask: online intersect > possible intersect
> BUG: Kernel NULL pointer dereference at 0x00000400
> Faulting instruction address: 0xc000000000170edc
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
> CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20
> Workqueue: events cpuset_hotplug_workfn
> NIP: c000000000170edc LR: c000000000170f98 CTR: 0000000000000000
> REGS: c000000003e931a0 TRAP: 0380 Not tainted (5.0.0-rc6+)
> MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22284028 XER: 00000000
> CFAR: c000000000170f20 IRQMASK: 0
> GPR00: c000000000170f98 c000000003e93430 c0000000011ac500 c0000001efe22000
> GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000010
> GPR08: 0000000000000001 0000000000000400 ffffffffffffffff 0000000000000000
> GPR12: 0000000000008800 c00000003fffd680 c0000001f14b0000 c0000000011e1bf0
> GPR16: c0000000011e61f4 c0000001efe22200 c0000001efe22020 c0000001fba80000
> GPR20: c0000001ff567a80 0000000000000001 c000000000e27a80 ffffffffffffe830
> GPR24: ffffffffffffec30 000000000000102f 000000000000102f c0000001efca1000
> GPR28: c0000001efca0400 c0000001efe22000 c0000001efe23bff c0000001efe22a00
> NIP [c000000000170edc] free_sched_groups+0x5c/0xf0
> LR [c000000000170f98] destroy_sched_domain+0x28/0x90
> Call Trace:
> [c000000003e93430] [000000000000102f] 0x102f (unreliable)
> [c000000003e93470] [c000000000170f98] destroy_sched_domain+0x28/0x90
> [c000000003e934a0] [c0000000001716e0] cpu_attach_domain+0x100/0x920
> [c000000003e93600] [c000000000173128] build_sched_domains+0x1228/0x1370
> [c000000003e93740] [c00000000017429c] partition_sched_domains+0x23c/0x400
> [c000000003e937e0] [c0000000001f5ec8] rebuild_sched_domains_locked+0x78/0xe0
> [c000000003e93820] [c0000000001f9ff0] rebuild_sched_domains+0x30/0x50
> [c000000003e93850] [c0000000001fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70
> [c000000003e93c80] [c00000000012e5a0] process_one_work+0x1b0/0x480
> [c000000003e93d20] [c00000000012e8f8] worker_thread+0x88/0x540
> [c000000003e93db0] [c00000000013714c] kthread+0x15c/0x1a0
> [c000000003e93e20] [c00000000000b55c] ret_from_kernel_thread+0x5c/0x80
> Instruction dump:
> 2e240000 f8010010 f821ffc1 409e0014 48000080 7fbdf040 7fdff378 419e0074
> ebdf0000 4192002c e93f0010 7c0004ac <7d404828> 314affff 7d40492d 40c2fff4
> ---[ end trace f992c4a7d47d602a ]---
>
> Kernel panic - not syncing: Fatal exception
>
> This happens in free_sched_groups() because the linked list of the
> sched_groups is corrupted. Here what happens when we hotplug the CPU:
>
> - build_sched_groups() builds a sched_groups linked list for
> sched_domain D1, with only one entry A, refcount=1
>
> D1: A(ref=1)
>
> - build_sched_groups() builds a sched_groups linked list for
> sched_domain D2, with the same entry A
>
> D2: A(ref=2)
>
> - build_sched_groups() builds a sched_groups linked list for
> sched_domain D3, with the same entry A and a new entry B:
>
> D3: A(ref=3) -> B(ref=1)
>
> - destroy_sched_domain() is called for D1:
>
> D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released,
> but A->next always points to B
>
> - destroy_sched_domain() is called for D3:
>
> D3: A(ref=2) -> B(ref=0)
>
> kernel crashes when it tries to use data inside B, as the memory has been
> corrupted as it has been freed, the linked list (next) is broken too.
>
> This problem appears with commit 051f3ca02e46
> ("sched/topology: Introduce NUMA identity node sched domain").
>
> If I compare function calls sequence before and after this commit I can see
> in the working case build_overlap_sched_groups() is called instead of
> build_sched_groups() and in this case the reference counters have all the
> same value and the linked list can be correctly unallocated.
> The involved commit has introduced the node domain, and in the case of
> powerpc the node domain can overlap, whereas it should not happen.
>
> This happens because initially powerpc code computes
> sched_domains_numa_masks of offline nodes as if they were merged with
> node 0 (because firmware doesn't provide the distance information for
> memoryless/cpuless nodes):
>
> node 0 1 2 3
> 0: 10 40 10 10
> 1: 40 10 40 40
> 2: 10 40 10 10
> 3: 10 40 10 10
>
> We should have:
>
> node 0 1 2 3
> 0: 10 40 40 40
> 1: 40 10 40 40
> 2: 40 40 10 40
> 3: 40 40 40 10
>
> And once a new CPU is added, node is onlined, numa masks are updated
> but initial set bits are not cleared. This explains why nodes can overlap.
>
> This patch changes the initial code to not initialize the distance for
> offline nodes. The distances will be updated when node will become online
> (on CPU hotplug) as it is already done.
>
> This patch has been tested on powerpc but not on the other architectures.
> They are impacted because the modified part is in the common code.
> All comments are welcome (how to move the change to powerpc specific code
> or if the other architectures can work with this change).
>
> Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
> Cc: Suravee Suthikulpanit <[email protected]>
> Cc: Srikar Dronamraju <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: David Gibson <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Nathan Fontenot <[email protected]>
> Cc: Michael Bringmann <[email protected]>
> Cc: [email protected]
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Laurent Vivier <[email protected]>
> ---
>
> Notes:
> v3: fix the root cause of the problem (sched numa mask initialization)
> v2: add scheduler maintainers in the CC: list
>
> kernel/sched/topology.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3f35ba1d8fde..24831b86533b 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1622,8 +1622,10 @@ void sched_init_numa(void)
> return;
>
> sched_domains_numa_masks[i][j] = mask;
> + if (!node_state(j, N_ONLINE))
> + continue;
>
> - for_each_node(k) {
> + for_each_online_node(k) {
> if (node_distance(j, k) > sched_domains_numa_distance[i])
> continue;
>
>

Another way to avoid the nodes overlapping for the offline nodes at
startup is to ensure the default values don't define a distance that
merge all offline nodes into node 0.

A powerpc specific patch can workaround the kernel crash by doing this:

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 87f0dd0..3ba29bb 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
struct device_node *memory;
int default_nid = 0;
unsigned long i;
+ int nid, dist;

if (numa_enabled == 0) {
printk(KERN_WARNING "NUMA disabled by user\n");
@@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)

dbg("NUMA associativity depth for CPU/Memory: %d\n",
min_common_depth);

+ for (nid = 0; nid < MAX_NUMNODES; nid ++)
+ for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
+ distance_lookup_table[nid][dist] = nid;
+
/*
* Even though we connect cpus to numa domains later in SMP
* init, we need to know the node ids now. This is because

Any comment?

If this is not the good way to do, does someone have a better idea how
to fix the kernel crash?

Thanks,
Laurent

2019-03-15 12:26:58

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

On Fri, Mar 15, 2019 at 12:12:45PM +0100, Laurent Vivier wrote:

> Another way to avoid the nodes overlapping for the offline nodes at
> startup is to ensure the default values don't define a distance that
> merge all offline nodes into node 0.
>
> A powerpc specific patch can workaround the kernel crash by doing this:
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 87f0dd0..3ba29bb 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
> struct device_node *memory;
> int default_nid = 0;
> unsigned long i;
> + int nid, dist;
>
> if (numa_enabled == 0) {
> printk(KERN_WARNING "NUMA disabled by user\n");
> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
>
> dbg("NUMA associativity depth for CPU/Memory: %d\n",
> min_common_depth);
>
> + for (nid = 0; nid < MAX_NUMNODES; nid ++)
> + for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
> + distance_lookup_table[nid][dist] = nid;
> +
> /*
> * Even though we connect cpus to numa domains later in SMP
> * init, we need to know the node ids now. This is because

What does that actually do? That is, what does it make the distance
table look like before and after you bring up the CPUs?

> Any comment?

Well, I had a few questions here:

[email protected]

that I've not yet seen answers to.

2019-03-15 13:06:08

by Laurent Vivier

[permalink] [raw]

Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

On 15/03/2019 13:25, Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 12:12:45PM +0100, Laurent Vivier wrote:
>
>> Another way to avoid the nodes overlapping for the offline nodes at
>> startup is to ensure the default values don't define a distance that
>> merge all offline nodes into node 0.
>>
>> A powerpc specific patch can workaround the kernel crash by doing this:
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 87f0dd0..3ba29bb 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
>> struct device_node *memory;
>> int default_nid = 0;
>> unsigned long i;
>> + int nid, dist;
>>
>> if (numa_enabled == 0) {
>> printk(KERN_WARNING "NUMA disabled by user\n");
>> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
>>
>> dbg("NUMA associativity depth for CPU/Memory: %d\n",
>> min_common_depth);
>>
>> + for (nid = 0; nid < MAX_NUMNODES; nid ++)
>> + for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
>> + distance_lookup_table[nid][dist] = nid;
>> +
>> /*
>> * Even though we connect cpus to numa domains later in SMP
>> * init, we need to know the node ids now. This is because
>
> What does that actually do? That is, what does it make the distance
> table look like before and after you bring up the CPUs?

By default the table is full of 0. When a CPU is brought up the value is
read from the device-tree and the table is updated. What I've seen is
this value is common for 2 nodes at a given level if they share the level.
So as the table is initialized with 0, all offline nodes (no memory no
cpu) are merged with node 0.
My fix initializes the table with unique values for each node, so by
default no nodes are mixed.

>
>> Any comment?
>
> Well, I had a few questions here:
>
> [email protected]
>
> that I've not yet seen answers to.

I didn't answer because:

- I thought this was not the good way to fix the problem as you said "it
seems very fragile and unfortunate",

- I don't have the answers, I'd really like someone from IBM that knows
well the NUMA part of powerpc answers to these questions... and perhaps
find a better solution.

Thanks,
Laurent

2019-03-18 10:50:04

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

> > node 0 (because firmware doesn't provide the distance information for
> > memoryless/cpuless nodes):
> >
> > node 0 1 2 3
> > 0: 10 40 10 10
> > 1: 40 10 40 40
> > 2: 10 40 10 10
> > 3: 10 40 10 10
>
> *groan*... what does it do for things like percpu memory? ISTR the
> per-cpu chunks are all allocated early too. Having them all use memory
> out of node-0 would seem sub-optimal.

In the specific failing case, there is only one node with memory; all other
nodes are cpu only nodes.

However in the generic case since its just a cpu hotplug ops, the memory
allocated for per-cpu chunks allocated early would remain.

May be Michael Ellerman can correct me here.

>
> > We should have:
> >
> > node 0 1 2 3
> > 0: 10 40 40 40
> > 1: 40 10 40 40
> > 2: 40 40 10 40
> > 3: 40 40 40 10
>
> Can it happen that it introduces a new distance in the table? One that
> hasn't been seen before? This example only has 10 and 40, but suppose
> the new node lands at distance 20 (or 80); can such a thing happen?
>
> If not; why not?

Yes distances can be 20, 40 or 80. There is nothing that makes the node
distance to be 40 always.

> So you're relying on sched_domain_numa_masks_set/clear() to fix this up,
> but that in turn relies on the sched_domain_numa_levels thing to stay
> accurate.
>
> This all seems very fragile and unfortunate.
>

Any reasons why this is fragile?

--
Thanks and Regards
Srikar Dronamraju

2019-03-18 11:08:40

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

* Laurent Vivier <[email protected]> [2019-03-15 12:12:45]:

>
> Another way to avoid the nodes overlapping for the offline nodes at
> startup is to ensure the default values don't define a distance that
> merge all offline nodes into node 0.
>
> A powerpc specific patch can workaround the kernel crash by doing this:
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 87f0dd0..3ba29bb 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
> struct device_node *memory;
> int default_nid = 0;
> unsigned long i;
> + int nid, dist;
>
> if (numa_enabled == 0) {
> printk(KERN_WARNING "NUMA disabled by user\n");
> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
>
> dbg("NUMA associativity depth for CPU/Memory: %d\n",
> min_common_depth);
>
> + for (nid = 0; nid < MAX_NUMNODES; nid ++)
> + for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
> + distance_lookup_table[nid][dist] = nid;
> +

The only reason, this would have worked in the specific case, is because we
are overriding the distance_lookup_table with a unique distance.
So node_distance for any other node other than itself will return max
distance which is 40 in this case. (since distance_ref_points_depth is 2)

I am not sure if this will work if the node distance between the two nodes
happens to be 20.

> /*
> * Even though we connect cpus to numa domains later in SMP
> * init, we need to know the node ids now. This is because
>

--
Thanks and Regards
Srikar Dronamraju

2019-03-18 11:28:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

On Mon, Mar 18, 2019 at 04:17:30PM +0530, Srikar Dronamraju wrote:
> > > node 0 (because firmware doesn't provide the distance information for
> > > memoryless/cpuless nodes):
> > >
> > > node 0 1 2 3
> > > 0: 10 40 10 10
> > > 1: 40 10 40 40
> > > 2: 10 40 10 10
> > > 3: 10 40 10 10
> >
> > *groan*... what does it do for things like percpu memory? ISTR the
> > per-cpu chunks are all allocated early too. Having them all use memory
> > out of node-0 would seem sub-optimal.
>
> In the specific failing case, there is only one node with memory; all other
> nodes are cpu only nodes.
>
> However in the generic case since its just a cpu hotplug ops, the memory
> allocated for per-cpu chunks allocated early would remain.

What do you do in the case where there's multiple nodes with memory, but
only one with CPUs on?

Do you then still allocate the per-cpu memory for the CPUs that will
appear on that second node on node0?

> > > We should have:
> > >
> > > node 0 1 2 3
> > > 0: 10 40 40 40
> > > 1: 40 10 40 40
> > > 2: 40 40 10 40
> > > 3: 40 40 40 10
> >
> > Can it happen that it introduces a new distance in the table? One that
> > hasn't been seen before? This example only has 10 and 40, but suppose
> > the new node lands at distance 20 (or 80); can such a thing happen?
> >
> > If not; why not?
>
> Yes distances can be 20, 40 or 80. There is nothing that makes the node
> distance to be 40 always.

This,

> > So you're relying on sched_domain_numa_masks_set/clear() to fix this up,
> > but that in turn relies on the sched_domain_numa_levels thing to stay
> > accurate.
> >
> > This all seems very fragile and unfortunate.
> >
>
> Any reasons why this is fragile?

breaks that patch. The code assumes all the numa distances are known at
boot. If you add distances later, it comes unstuck.

It's not like you're actually changing the interconnects around at
runtime. Node topology really should be known at boot time.

What I _think_ the x86 BIOS does is, for each empty socket, iterate as
many logical CPUs (non-present) as it finds on Socket-0 (or whatever
socket is the boot socket).

Those non-present CPUs are assigned to their respective nodes. And
if/when a physical CPU is placed on the socket and the CPUs onlined, it
all 'works' (see ACPI SRAT).

I'm not entirely sure what happens on x86 when it boots with say a
10-core part and you then fill an empty socket with a 20-core part, I
suspect we simply will not use more than 10, we'll not have space
reserved in the Linux cpumasks for them anyway.