When we hotplug a CPU in a memoryless/cpuless node, the kernel crashes when it
rebuilds the sched_domains data.
I reproduce this problem on POWER and with a pseries VM, with the following
QEMU parameters:
-machine pseries -enable-kvm -m 8192 \
-smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \
-numa node,nodeid=0,cpus=0-1,mem=0 \
-numa node,nodeid=1,cpus=2-3,mem=8192 \
-numa node,nodeid=2,cpus=4-5,mem=0 \
-numa node,nodeid=3,cpus=6-7,mem=0
Then I can trigger the crash by hotplugging a CPU on node-id 3:
(qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3
Built 2 zonelists, mobility grouping on. Total pages: 130162
Policy zone: Normal
WARNING: workqueue cpumask: online intersect > possible intersect
BUG: Kernel NULL pointer dereference at 0x00000400
Faulting instruction address: 0xc000000000170edc
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20
Workqueue: events cpuset_hotplug_workfn
NIP: c000000000170edc LR: c000000000170f98 CTR: 0000000000000000
REGS: c000000003e931a0 TRAP: 0380 Not tainted (5.0.0-rc6+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22284028 XER: 00000000
CFAR: c000000000170f20 IRQMASK: 0
GPR00: c000000000170f98 c000000003e93430 c0000000011ac500 c0000001efe22000
GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000010
GPR08: 0000000000000001 0000000000000400 ffffffffffffffff 0000000000000000
GPR12: 0000000000008800 c00000003fffd680 c0000001f14b0000 c0000000011e1bf0
GPR16: c0000000011e61f4 c0000001efe22200 c0000001efe22020 c0000001fba80000
GPR20: c0000001ff567a80 0000000000000001 c000000000e27a80 ffffffffffffe830
GPR24: ffffffffffffec30 000000000000102f 000000000000102f c0000001efca1000
GPR28: c0000001efca0400 c0000001efe22000 c0000001efe23bff c0000001efe22a00
NIP [c000000000170edc] free_sched_groups+0x5c/0xf0
LR [c000000000170f98] destroy_sched_domain+0x28/0x90
Call Trace:
[c000000003e93430] [000000000000102f] 0x102f (unreliable)
[c000000003e93470] [c000000000170f98] destroy_sched_domain+0x28/0x90
[c000000003e934a0] [c0000000001716e0] cpu_attach_domain+0x100/0x920
[c000000003e93600] [c000000000173128] build_sched_domains+0x1228/0x1370
[c000000003e93740] [c00000000017429c] partition_sched_domains+0x23c/0x400
[c000000003e937e0] [c0000000001f5ec8] rebuild_sched_domains_locked+0x78/0xe0
[c000000003e93820] [c0000000001f9ff0] rebuild_sched_domains+0x30/0x50
[c000000003e93850] [c0000000001fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70
[c000000003e93c80] [c00000000012e5a0] process_one_work+0x1b0/0x480
[c000000003e93d20] [c00000000012e8f8] worker_thread+0x88/0x540
[c000000003e93db0] [c00000000013714c] kthread+0x15c/0x1a0
[c000000003e93e20] [c00000000000b55c] ret_from_kernel_thread+0x5c/0x80
Instruction dump:
2e240000 f8010010 f821ffc1 409e0014 48000080 7fbdf040 7fdff378 419e0074
ebdf0000 4192002c e93f0010 7c0004ac <7d404828> 314affff 7d40492d 40c2fff4
---[ end trace f992c4a7d47d602a ]---
Kernel panic - not syncing: Fatal exception
This happens in free_sched_groups() because the linked list of the sched_groups
is corrupted. Here what happens when we hotplug the CPU:
- build_sched_groups() builds a sched_groups linked list for sched_domain D1,
with only one entry A, refcount=1
D1: A(ref=1)
- build_sched_groups() builds a sched_groups linked list for sched_domain D2,
with the same entry A
D2: A(ref=2)
- build_sched_groups() builds a sched_groups linked list for sched_domain D3,
with the same entry A and a new entry B:
D3: A(ref=3) -> B(ref=1)
- destroy_sched_domain() is called for D1:
D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released,
but A->next always points to B
- destroy_sched_domain() is called for D3:
D3: A(ref=2) -> B(ref=0)
kernel crashes when it tries to use data inside B, as the memory has been
corrupted as it has been freed, the linked list (next) is broken too.
This problem appears with commit 051f3ca02e46
("sched/topology: Introduce NUMA identity node sched domain").
If I compare function calls sequence before and after this commit I can see
in the the working case that build_overlap_sched_groups() is called instead of
build_sched_groups() and in this case the reference counters have all the
same value and the linked list can be correctly unallocated.
The problem happens because patch "sched/topology: Introduce NUMA
identity node sched domain" has removed the SDTL_OVERLAP flag
of the first topology level when it has introduced the NODE domain (and thus
build_sched_groups() is used instead of build_overlap_sched_groups()).
As I don't see any reason (and it is not documented in the involved commit)
to remove this flag, this patch re-introduces the SDTL_OVERLAP flag for the
first level. This fixes the problem described above and a CPU can be hotplugged
again without kernel crash.
Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: David Gibson <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nathan Fontenot <[email protected]>
Cc: Michael Bringmann <[email protected]>
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Laurent Vivier <[email protected]>
---
Notes:
v2: add scheduler maintainers in the CC: list
kernel/sched/topology.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3f35ba1d8fde..372278605f0d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1651,6 +1651,7 @@ void sched_init_numa(void)
*/
tl[i++] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
+ .flags = SDTL_OVERLAP,
.numa_level = 0,
SD_INIT_NAME(NODE)
};
--
2.20.1
On Wed, Feb 20, 2019 at 05:55:20PM +0100, Laurent Vivier wrote:
> index 3f35ba1d8fde..372278605f0d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1651,6 +1651,7 @@ void sched_init_numa(void)
> */
> tl[i++] = (struct sched_domain_topology_level){
> .mask = sd_numa_mask,
> + .flags = SDTL_OVERLAP,
This makes no sense what so ever. The numa identify node should not have
overlap with other domains.
Are you sure this is not because of the utterly broken powerpc nonsense
where they move CPUs between nodes?
> .numa_level = 0,
> SD_INIT_NAME(NODE)
> };
> --
> 2.20.1
>
On 20/02/2019 18:08, Peter Zijlstra wrote:
> On Wed, Feb 20, 2019 at 05:55:20PM +0100, Laurent Vivier wrote:
>> index 3f35ba1d8fde..372278605f0d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1651,6 +1651,7 @@ void sched_init_numa(void)
>> */
>> tl[i++] = (struct sched_domain_topology_level){
>> .mask = sd_numa_mask,
>> + .flags = SDTL_OVERLAP,
>
> This makes no sense what so ever. The numa identify node should not have
> overlap with other domains.
>
> Are you sure this is not because of the utterly broken powerpc nonsense
> where they move CPUs between nodes?
No, I'm not sure. This why I've Cc: powerpc folks. My conclusion is only
based on the before/after changes.
I've tested some patches from powerpc ML, but they don't fix this problem:
powerpc/numa: Perform full re-add of CPU for PRRN/VPHN topology update
powerpc/pseries: Perform full re-add of CPU for topology update
post-migration
So the only reason I can see to have a corrupted sched_group list is the
sched_domain_span() fonction doesn't return a correct cpumask for the
domain once a new CPU is added.
Thanks,
Laurent