Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2064713imm; Thu, 7 Jun 2018 05:01:34 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJbSkHLCGmMjcTnHPUde9TaqKxl0gKH9ZSbKhsiKM/zle4deL6U+jnsMIvgF4x7b/r2I5J1 X-Received: by 2002:aa7:819a:: with SMTP id g26-v6mr1553649pfi.210.1528372894641; Thu, 07 Jun 2018 05:01:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528372894; cv=none; d=google.com; s=arc-20160816; b=ByJ2CCbYuPF11egPd4MlGZ2hKl9ftRISd195UWoXLXJyIVK//+lsY1OkG9EjGYsVld Bq5kldUVrKxpyUs8wHDhKVdhPaqTaurOQ2F/23si0fuo460GD3Qx4leIxpd/lUTp+iwe izXmbWjJED5shKDaFf2QCOJHEd0YuEPXkPOaQaDByTlCN2owKIaahMO37jjZGaDc8N/B DvLdJL6lRQOyuUEyHDXvmc23eshoDvuJfG1WKEhwMRzhqW6nNMerVlf4714vgCtCsadV ftqueEiENak0nk+HOQ/2BR9Pc0IzVfuxprfz7s2T/QsTD2h0OGvdGyoF8DDZbo1wlAfg +Qxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=Y+pxZMcT42fN2DPSA0+Flh7heVFVgqaEBAOgDHEcVLI=; b=PrG7kx1Hq2Ns66WE80wvTvDCJskAX9AWh9njvBrsT4IzcjIBujhnS6CqypD626zb82 QHBsfOYtcZ9vIStJaXCR+MNpMV8aupQZqz4X6M1FcOz/bipo3xi76uf+kUdISOD/rICV xcT4OpWyZo9dFqOa2aRCOjgtOQKUUkTWVP+KLY7MXEFkN10O7k7JhZLmKPz9gKdAY/mc ThShIFwS8HtjT8/KtANhl+ZvwrwrDDEDQjTDa1ucMx4apaVeiRpgAOfpciHBVr8J0/TA 9TDjFr2UuePKKRl7zKaC2hzW7lyeu4xD/VIZUcZIemp7ET7vvfwQnyCyZWC6VrmgrliO ljNg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i15-v6si15445608pgf.412.2018.06.07.05.01.17; Thu, 07 Jun 2018 05:01:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932689AbeFGL7D (ORCPT + 99 others); Thu, 7 Jun 2018 07:59:03 -0400 Received: from szxga07-in.huawei.com ([45.249.212.35]:44690 "EHLO huawei.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1753645AbeFGL7A (ORCPT ); Thu, 7 Jun 2018 07:59:00 -0400 Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id 1CFB46B277AE7; Thu, 7 Jun 2018 19:58:56 +0800 (CST) Received: from [127.0.0.1] (10.177.223.23) by DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id 14.3.382.0; Thu, 7 Jun 2018 19:58:50 +0800 Subject: Re: [PATCH 1/2] arm64: avoid alloc memory on offline node To: Michal Hocko , Bjorn Helgaas CC: Will Deacon , , Catalin Marinas , Greg Kroah-Hartman , "Rafael J. Wysocki" , Jarkko Sakkinen , linux-arm , Linux Kernel Mailing List , , , , Andrew Morton , References: <1527768879-88161-1-git-send-email-xiexiuqi@huawei.com> <1527768879-88161-2-git-send-email-xiexiuqi@huawei.com> <20180606154516.GL6631@arm.com> <20180607105514.GA13139@dhcp22.suse.cz> From: Hanjun Guo Message-ID: <5ed798a0-6c9c-086e-e5e8-906f593ca33e@huawei.com> Date: Thu, 7 Jun 2018 19:55:53 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <20180607105514.GA13139@dhcp22.suse.cz> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.223.23] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018/6/7 18:55, Michal Hocko wrote: > On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote: >> [+cc akpm, linux-mm, linux-pci] >> >> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon wrote: >>> >>> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote: >>>> A numa system may return node which is not online. >>>> For example, a numa node: >>>> 1) without memory >>>> 2) NR_CPUS is very small, and the cpus on the node are not brought up >>>> >>>> In this situation, we use NUMA_NO_NODE to avoid oops. >>>> >>>> [ 25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988 >>>> [ 25.740982] Mem abort info: >>>> [ 25.743762] ESR = 0x96000005 >>>> [ 25.746803] Exception class = DABT (current EL), IL = 32 bits >>>> [ 25.752711] SET = 0, FnV = 0 >>>> [ 25.755751] EA = 0, S1PTW = 0 >>>> [ 25.758878] Data abort info: >>>> [ 25.761745] ISV = 0, ISS = 0x00000005 >>>> [ 25.765568] CM = 0, WnR = 0 >>>> [ 25.768521] [0000000000001988] user address but active_mm is swapper >>>> [ 25.774861] Internal error: Oops: 96000005 [#1] SMP >>>> [ 25.779724] Modules linked in: >>>> [ 25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115 >>>> [ 25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018 >>>> [ 25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO) >>>> [ 25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70 >>>> [ 25.808389] lr : __alloc_pages_nodemask+0x184/0xe70 >>>> [ 25.813252] sp : ffff00000996f660 >>>> [ 25.816553] x29: ffff00000996f660 x28: 0000000000000000 >>>> [ 25.821852] x27: 00000000014012c0 x26: 0000000000000000 >>>> [ 25.827150] x25: 0000000000000003 x24: ffff000008099eac >>>> [ 25.832449] x23: 0000000000400000 x22: 0000000000000000 >>>> [ 25.837747] x21: 0000000000000001 x20: 0000000000000000 >>>> [ 25.843045] x19: 0000000000400000 x18: 0000000000010e00 >>>> [ 25.848343] x17: 000000000437f790 x16: 0000000000000020 >>>> [ 25.853641] x15: 0000000000000000 x14: 6549435020524541 >>>> [ 25.858939] x13: 20454d502067756c x12: 0000000000000000 >>>> [ 25.864237] x11: ffff00000996f6f0 x10: 0000000000000006 >>>> [ 25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90 >>>> [ 25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08 >>>> [ 25.880132] x5 : 0000000000000000 x4 : 0000000000000081 >>>> [ 25.885430] x3 : 0000000000000000 x2 : 0000000000000000 >>>> [ 25.890728] x1 : 0000000000000001 x0 : 0000000000001980 >>>> [ 25.896027] Process swapper/0 (pid: 1, stack limit = 0x (ptrval)) >>>> [ 25.902712] Call trace: >>>> [ 25.905146] __alloc_pages_nodemask+0xf0/0xe70 >>>> [ 25.909577] allocate_slab+0x94/0x590 >>>> [ 25.913225] new_slab+0x68/0xc8 >>>> [ 25.916353] ___slab_alloc+0x444/0x4f8 >>>> [ 25.920088] __slab_alloc+0x50/0x68 >>>> [ 25.923562] kmem_cache_alloc_node_trace+0xe8/0x230 >>>> [ 25.928426] pci_acpi_scan_root+0x94/0x278 >>>> [ 25.932510] acpi_pci_root_add+0x228/0x4b0 >>>> [ 25.936593] acpi_bus_attach+0x10c/0x218 >>>> [ 25.940501] acpi_bus_attach+0xac/0x218 >>>> [ 25.944323] acpi_bus_attach+0xac/0x218 >>>> [ 25.948144] acpi_bus_scan+0x5c/0xc0 >>>> [ 25.951708] acpi_scan_init+0xf8/0x254 >>>> [ 25.955443] acpi_init+0x310/0x37c >>>> [ 25.958831] do_one_initcall+0x54/0x208 >>>> [ 25.962653] kernel_init_freeable+0x244/0x340 >>>> [ 25.966999] kernel_init+0x18/0x118 >>>> [ 25.970474] ret_from_fork+0x10/0x1c >>>> [ 25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803) >>>> [ 25.980162] ---[ end trace 64f0893eb21ec283 ]--- >>>> [ 25.984765] Kernel panic - not syncing: Fatal exception >>>> >>>> Signed-off-by: Xie XiuQi >>>> Tested-by: Huiqiang Wang >>>> Cc: Hanjun Guo >>>> Cc: Tomasz Nowicki >>>> Cc: Xishi Qiu >>>> --- >>>> arch/arm64/kernel/pci.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c >>>> index 0e2ea1c..e17cc45 100644 >>>> --- a/arch/arm64/kernel/pci.c >>>> +++ b/arch/arm64/kernel/pci.c >>>> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) >>>> struct pci_bus *bus, *child; >>>> struct acpi_pci_root_ops *root_ops; >>>> >>>> + if (node != NUMA_NO_NODE && !node_online(node)) >>>> + node = NUMA_NO_NODE; >>>> + >>> >>> This really feels like a bodge, but it does appear to be what other >>> architectures do, so: >>> >>> Acked-by: Will Deacon >> >> I agree, this doesn't feel like something we should be avoiding in the >> caller of kzalloc_node(). >> >> I would not expect kzalloc_node() to return memory that's offline, no >> matter what node we told it to allocate from. I could imagine it >> returning failure, or returning memory from a node that *is* online, >> but returning a pointer to offline memory seems broken. >> >> Are we putting memory that's offline in the free list? I don't know >> where to look to figure this out. > > I am not sure I have the full context but pci_acpi_scan_root calls > kzalloc_node(sizeof(*info), GFP_KERNEL, node) > and that should fall back to whatever node that is online. Offline node > shouldn't keep any pages behind. So there must be something else going > on here and the patch is not the right way to handle it. What does > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel? The whole context is: The system is booted with a NUMA node has no memory attaching to it (memory-less NUMA node), also with NR_CPUS less than CPUs presented in MADT, so CPUs on this memory-less node are not brought up, and this NUMA node will not be online (but SRAT presents this NUMA node); Devices attaching to this NUMA node such as PCI host bridge still return the valid NUMA node via _PXM, but actually that valid NUMA node is not online which lead to this issue. Thanks Hanjun >