Received: by 2002:ac0:a874:0:0:0:0:0 with SMTP id c49csp345484ima; Fri, 15 Mar 2019 04:15:04 -0700 (PDT) X-Google-Smtp-Source: APXvYqwzTh24exdCjBAspCapxuGgUwoIThcsoQ0ZEahgp30RL52sVkfaFqxAtqYRNNn4/MH45mpY X-Received: by 2002:a17:902:87:: with SMTP id a7mr3582947pla.295.1552648504877; Fri, 15 Mar 2019 04:15:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1552648504; cv=none; d=google.com; s=arc-20160816; b=n37YMlYVj4RcmIF2yGJhap3YfDd738Q+3wgdf9hd1ySR0tGjh4h3cTdBg0ilr+Gkvw hyEnproMqI6yQW7etEtOwuxTCW6yGn2n/bH1U6z6TpZxpPUA39uGDnupR4939FDJ5i9R WPF1n6YnPAYuODIE0dyPTktkeO5+dNMQ9iIfRvKch0NTEHlyr+eQH830T9kJs5K027QM 9LtPnpADIrS6A3kZhtmcn6cRBptNXP18MqvXFcbSYIVo9hR7X1yb3BoK4MxwgQvlqap6 y6nvI7HLrz/HUIIM364IuPRATjfrgkcgDtrATORaFzDI3iiWVB6hrI2t6fEdB0X51VgR p9GQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=05iDUagrXkABbpsyqJVeK7TvIHvkVM6c7+ll5iegYkg=; b=UXwR7VW0uhRPPaIZ4QjONbgImcx2AiKMoOk7PtFrX06Q2+eI8hRuM+G3XY/qVjGaEc daqPd2JHEBgb4Np88mRbgleScvgvCWQqva39RHhuVEXXVlPDnT4OAAVWuMI+Gc2ndE8l CaA5PZ8i0n3L0JzpGeqEkAL8aWqOJUtuhiJfmJmnWmO85Ge2yJC9/uFUbQIMXkSETlWv GL9h55xS5Km/7uWC87Wi0kG4uEMiXsNoxcRgfxo4JvQLdYHiVI2YjBPTR+tN1GokmBmY eEuCaZ3IyfGUJOqFj0xZeyfeW70NwyWL19xFe4M3pl870CSjPsPe9ZqDI20tZfXsb2Ut zNRA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z11si1488223plk.225.2019.03.15.04.14.49; Fri, 15 Mar 2019 04:15:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728083AbfCOLMu (ORCPT + 99 others); Fri, 15 Mar 2019 07:12:50 -0400 Received: from mx1.redhat.com ([209.132.183.28]:56962 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727000AbfCOLMu (ORCPT ); Fri, 15 Mar 2019 07:12:50 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 2EC88806B6; Fri, 15 Mar 2019 11:12:49 +0000 (UTC) Received: from [10.36.117.12] (ovpn-117-12.ams2.redhat.com [10.36.117.12]) by smtp.corp.redhat.com (Postfix) with ESMTP id 32DAB5D9C5; Fri, 15 Mar 2019 11:12:46 +0000 (UTC) Subject: Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node To: linux-kernel@vger.kernel.org Cc: Suravee Suthikulpanit , Srikar Dronamraju , Borislav Petkov , David Gibson , Michael Ellerman , Nathan Fontenot , Michael Bringmann , linuxppc-dev@lists.ozlabs.org, Ingo Molnar , Peter Zijlstra References: <20190304195952.16879-1-lvivier@redhat.com> From: Laurent Vivier Message-ID: <77b142fe-0886-1510-28bd-d432ea2c796a@redhat.com> Date: Fri, 15 Mar 2019 12:12:45 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <20190304195952.16879-1-lvivier@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Fri, 15 Mar 2019 11:12:49 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/03/2019 20:59, Laurent Vivier wrote: > When we hotplug a CPU in a memoryless/cpuless node, > the kernel crashes when it rebuilds the sched_domains data. > > I reproduce this problem on POWER and with a pseries VM, with the following > QEMU parameters: > > -machine pseries -enable-kvm -m 8192 \ > -smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \ > -numa node,nodeid=0,cpus=0-1,mem=0 \ > -numa node,nodeid=1,cpus=2-3,mem=8192 \ > -numa node,nodeid=2,cpus=4-5,mem=0 \ > -numa node,nodeid=3,cpus=6-7,mem=0 > > Then I can trigger the crash by hotplugging a CPU on node-id 3: > > (qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3 > > Built 2 zonelists, mobility grouping on. Total pages: 130162 > Policy zone: Normal > WARNING: workqueue cpumask: online intersect > possible intersect > BUG: Kernel NULL pointer dereference at 0x00000400 > Faulting instruction address: 0xc000000000170edc > Oops: Kernel access of bad area, sig: 11 [#1] > LE SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod > CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20 > Workqueue: events cpuset_hotplug_workfn > NIP: c000000000170edc LR: c000000000170f98 CTR: 0000000000000000 > REGS: c000000003e931a0 TRAP: 0380 Not tainted (5.0.0-rc6+) > MSR: 8000000000009033 CR: 22284028 XER: 00000000 > CFAR: c000000000170f20 IRQMASK: 0 > GPR00: c000000000170f98 c000000003e93430 c0000000011ac500 c0000001efe22000 > GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000010 > GPR08: 0000000000000001 0000000000000400 ffffffffffffffff 0000000000000000 > GPR12: 0000000000008800 c00000003fffd680 c0000001f14b0000 c0000000011e1bf0 > GPR16: c0000000011e61f4 c0000001efe22200 c0000001efe22020 c0000001fba80000 > GPR20: c0000001ff567a80 0000000000000001 c000000000e27a80 ffffffffffffe830 > GPR24: ffffffffffffec30 000000000000102f 000000000000102f c0000001efca1000 > GPR28: c0000001efca0400 c0000001efe22000 c0000001efe23bff c0000001efe22a00 > NIP [c000000000170edc] free_sched_groups+0x5c/0xf0 > LR [c000000000170f98] destroy_sched_domain+0x28/0x90 > Call Trace: > [c000000003e93430] [000000000000102f] 0x102f (unreliable) > [c000000003e93470] [c000000000170f98] destroy_sched_domain+0x28/0x90 > [c000000003e934a0] [c0000000001716e0] cpu_attach_domain+0x100/0x920 > [c000000003e93600] [c000000000173128] build_sched_domains+0x1228/0x1370 > [c000000003e93740] [c00000000017429c] partition_sched_domains+0x23c/0x400 > [c000000003e937e0] [c0000000001f5ec8] rebuild_sched_domains_locked+0x78/0xe0 > [c000000003e93820] [c0000000001f9ff0] rebuild_sched_domains+0x30/0x50 > [c000000003e93850] [c0000000001fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70 > [c000000003e93c80] [c00000000012e5a0] process_one_work+0x1b0/0x480 > [c000000003e93d20] [c00000000012e8f8] worker_thread+0x88/0x540 > [c000000003e93db0] [c00000000013714c] kthread+0x15c/0x1a0 > [c000000003e93e20] [c00000000000b55c] ret_from_kernel_thread+0x5c/0x80 > Instruction dump: > 2e240000 f8010010 f821ffc1 409e0014 48000080 7fbdf040 7fdff378 419e0074 > ebdf0000 4192002c e93f0010 7c0004ac <7d404828> 314affff 7d40492d 40c2fff4 > ---[ end trace f992c4a7d47d602a ]--- > > Kernel panic - not syncing: Fatal exception > > This happens in free_sched_groups() because the linked list of the > sched_groups is corrupted. Here what happens when we hotplug the CPU: > > - build_sched_groups() builds a sched_groups linked list for > sched_domain D1, with only one entry A, refcount=1 > > D1: A(ref=1) > > - build_sched_groups() builds a sched_groups linked list for > sched_domain D2, with the same entry A > > D2: A(ref=2) > > - build_sched_groups() builds a sched_groups linked list for > sched_domain D3, with the same entry A and a new entry B: > > D3: A(ref=3) -> B(ref=1) > > - destroy_sched_domain() is called for D1: > > D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released, > but A->next always points to B > > - destroy_sched_domain() is called for D3: > > D3: A(ref=2) -> B(ref=0) > > kernel crashes when it tries to use data inside B, as the memory has been > corrupted as it has been freed, the linked list (next) is broken too. > > This problem appears with commit 051f3ca02e46 > ("sched/topology: Introduce NUMA identity node sched domain"). > > If I compare function calls sequence before and after this commit I can see > in the working case build_overlap_sched_groups() is called instead of > build_sched_groups() and in this case the reference counters have all the > same value and the linked list can be correctly unallocated. > The involved commit has introduced the node domain, and in the case of > powerpc the node domain can overlap, whereas it should not happen. > > This happens because initially powerpc code computes > sched_domains_numa_masks of offline nodes as if they were merged with > node 0 (because firmware doesn't provide the distance information for > memoryless/cpuless nodes): > > node 0 1 2 3 > 0: 10 40 10 10 > 1: 40 10 40 40 > 2: 10 40 10 10 > 3: 10 40 10 10 > > We should have: > > node 0 1 2 3 > 0: 10 40 40 40 > 1: 40 10 40 40 > 2: 40 40 10 40 > 3: 40 40 40 10 > > And once a new CPU is added, node is onlined, numa masks are updated > but initial set bits are not cleared. This explains why nodes can overlap. > > This patch changes the initial code to not initialize the distance for > offline nodes. The distances will be updated when node will become online > (on CPU hotplug) as it is already done. > > This patch has been tested on powerpc but not on the other architectures. > They are impacted because the modified part is in the common code. > All comments are welcome (how to move the change to powerpc specific code > or if the other architectures can work with this change). > > Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain") > Cc: Suravee Suthikulpanit > Cc: Srikar Dronamraju > Cc: Borislav Petkov > Cc: David Gibson > Cc: Michael Ellerman > Cc: Nathan Fontenot > Cc: Michael Bringmann > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Ingo Molnar > Cc: Peter Zijlstra > Signed-off-by: Laurent Vivier > --- > > Notes: > v3: fix the root cause of the problem (sched numa mask initialization) > v2: add scheduler maintainers in the CC: list > > kernel/sched/topology.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 3f35ba1d8fde..24831b86533b 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1622,8 +1622,10 @@ void sched_init_numa(void) > return; > > sched_domains_numa_masks[i][j] = mask; > + if (!node_state(j, N_ONLINE)) > + continue; > > - for_each_node(k) { > + for_each_online_node(k) { > if (node_distance(j, k) > sched_domains_numa_distance[i]) > continue; > > Another way to avoid the nodes overlapping for the offline nodes at startup is to ensure the default values don't define a distance that merge all offline nodes into node 0. A powerpc specific patch can workaround the kernel crash by doing this: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 87f0dd0..3ba29bb 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void) struct device_node *memory; int default_nid = 0; unsigned long i; + int nid, dist; if (numa_enabled == 0) { printk(KERN_WARNING "NUMA disabled by user\n"); @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void) dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth); + for (nid = 0; nid < MAX_NUMNODES; nid ++) + for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++) + distance_lookup_table[nid][dist] = nid; + /* * Even though we connect cpus to numa domains later in SMP * init, we need to know the node ids now. This is because Any comment? If this is not the good way to do, does someone have a better idea how to fix the kernel crash? Thanks, Laurent