Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp2869121ybi; Thu, 4 Jul 2019 21:19:55 -0700 (PDT) X-Google-Smtp-Source: APXvYqwNhoJ6Z0BG8wWBIBDLGITdIdG3O3lOxYrrno0VWd0eO9ciX+wYcR7CzTYlpPVsxIaNSTAR X-Received: by 2002:a17:90a:342c:: with SMTP id o41mr2041402pjb.1.1562300395750; Thu, 04 Jul 2019 21:19:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562300395; cv=none; d=google.com; s=arc-20160816; b=PbpR0B/ux4daLpde55Umd02kSJvUlehC3OyavhSYE+ezS3u3Ggt8TmUTzgIwKC8Bgz 6URICGL8X614QT63WsK2ajGQE12BAP8w0VC4rxqUi72m9CtfpOEPp/Gv4I40jT2ENoT4 AZqP+Iwulic5VKcwH8l1W3jabDI7gyxpJ8FLAEhsGVVc4HcjC302GYM+zwMiuIkJ48jp Mv6yGMol/C2PKDMfr/5guU1/FHZAr2xUSJ7wrPytQlj4sAh41iMV7nHJibwsRNftaEfy JRrLRgrT3Aht/m/Cm5HRe9jSDi8O1H0IEkrB4Y2RbrMRTmLIo/DRgaCAFjW9/hxYKbm6 xLww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=kWttqF7CHQ114C55wMktQz7skpu5ho5Lb6fP5pPQp9U=; b=gllTTZKlqrVanM16gNur0BpNYpHVYrvEO/0nB5R1WHYdIaSlvXkiDpJHvdjP6TyBiK dPQV8LtTU4uK2FUEs2OM+ccyaJ0o395QFFwSbIhi/E4qeeLD6/Uu3TJMVLYX6GG91MDf K8ckpg8Subql6SQiDAJCRQ3Vf0cqBSgGy9Q9hruAdkQMlW0OS+q6tCOcG78X9b3SLFCg i/Md3u0TvaDY2SqE44PL67K1hdaNNZEq+SL9bm3Dks/Yk1VnEiYOQTBvzMa6pJgrxiCf wiEsMrV4pfgAaqJz1GClmPVXmySy0pLWWWLs4YXVPcXiVx9Lf4YoHPH5ekXApYXV20dH al9Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=tdxfy4ST; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t31si6978092pjb.25.2019.07.04.21.19.40; Thu, 04 Jul 2019 21:19:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=tdxfy4ST; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726026AbfGEEQq (ORCPT + 99 others); Fri, 5 Jul 2019 00:16:46 -0400 Received: from mail-pg1-f194.google.com ([209.85.215.194]:34230 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725280AbfGEEQp (ORCPT ); Fri, 5 Jul 2019 00:16:45 -0400 Received: by mail-pg1-f194.google.com with SMTP id p10so3694603pgn.1 for ; Thu, 04 Jul 2019 21:16:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=kWttqF7CHQ114C55wMktQz7skpu5ho5Lb6fP5pPQp9U=; b=tdxfy4STGPcqwwVWzra7lbJ2JRjaFt7j2O9OWE+BzEFHYyy7rhg4R1uFb3dSKnGLmW ZCt3Zza9jr5O8XkYro2uVrqfJzZX4m1e2X4NSZFN2DLt3tppB6dd2MId876lZBeyDJIg pWx+tAlfOpCdbjgeKFQcK+Bw3riP8mWt8lSrCPrmQv6HhXTe5pidh2lqVRM7ODLVE6Fk S1mIB1P4eC2JFVTsX3qrJH/zMg5PDseN8c693FwIl4VoQvIHAhi+3FOyd7G41F28khlW 73K7JZfbUW0mppJHDxGWLg96PZhZreiotoquMuq9xLdDDZYM6u6AR3jFZcvLY5ZxNSJy Te1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=kWttqF7CHQ114C55wMktQz7skpu5ho5Lb6fP5pPQp9U=; b=H/RkpSPHgBZDKTiYcrdUnAgU0fqqVuLP4srARzhlYqpj8AZe+EaaIbhJwqPFSXXcjD pIy65EbEdQm75bA9A9SxaySsvhI0qNDQpmK4MWK3obGW5psQpRryCk4ETYPOsQSjjzlG WuHUS+MN4PU4tib8psiTrZqKg4jLnmLyZS9pZssbKYmK0stf/ibDaWcWfCYB1NaujZfE NADiZTIrRi/Q8EpzfhAq/Y3MFaPITtqg4AnK34mEnJAdcCufHfFYSPne9qWCIWNZoJzg Fm/F1Uy6Wq39ruCFjBEFimTyNqC9UwBr65frd5aFrep18VBiuL3afYpx/a/C2zeE8erc PNEw== X-Gm-Message-State: APjAAAUeXrcxQl1+p4Yazk+TikgDHYnnCMrm37vaWQF/GDAwOL2EOrgf ZHzsal18JubTlY+h8YOr/g== X-Received: by 2002:a17:90a:20a2:: with SMTP id f31mr2022832pjg.90.1562300204164; Thu, 04 Jul 2019 21:16:44 -0700 (PDT) Received: from mylaptop.redhat.com ([2408:8207:7821:9e80:eaf2:5f81:4c66:c3d0]) by smtp.gmail.com with ESMTPSA id l68sm16328638pjb.8.2019.07.04.21.16.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 04 Jul 2019 21:16:43 -0700 (PDT) From: Pingfan Liu To: x86@kernel.org Cc: Pingfan Liu , Michal Hocko , Dave Hansen , Mike Rapoport , Tony Luck , Andy Lutomirski , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Andrew Morton , Vlastimil Babka , Oscar Salvador , Pavel Tatashin , Mel Gorman , Benjamin Herrenschmidt , Michael Ellerman , Stephen Rothwell , Qian Cai , Barret Rhoden , Bjorn Helgaas , David Rientjes , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 2/2] x86/numa: instance all parsed numa node Date: Fri, 5 Jul 2019 12:15:43 +0800 Message-Id: <1562300143-11671-2-git-send-email-kernelfans@gmail.com> X-Mailer: git-send-email 2.7.5 In-Reply-To: <1562300143-11671-1-git-send-email-kernelfans@gmail.com> References: <1562300143-11671-1-git-send-email-kernelfans@gmail.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option is used to speed up kdump process, so it is not a rare case. It turns out that some pgdat is not instanced when specifying nr_cpus, e.g, on x86, not initialized by init_cpu_to_node()->init_memory_less_node(). But device->numa_node info is used as preferred_nid param for __alloc_pages_nodemask(), which causes NULL reference ac->zonelist = node_zonelist(preferred_nid, gfp_mask); Although this bug is detected on x86, it should affect all archs, where a machine with a numa-node having no memory, if nr_cpus prevents the instance of the node, and the device on the node tries to allocate memory with device->numa_node info. The patch takes the way by instancing all parsed numa node on x86. (for more detail, please refer to section I and II) I. Notes about the crashing info: -1 kexec -l with nr_cpus=4 -2 system info NUMA node0 CPU(s): 0,8,16,24 NUMA node1 CPU(s): 2,10,18,26 NUMA node2 CPU(s): 4,12,20,28 NUMA node3 CPU(s): 6,14,22,30 NUMA node4 CPU(s): 1,9,17,25 NUMA node5 CPU(s): 3,11,19,27 NUMA node6 CPU(s): 5,13,21,29 NUMA node7 CPU(s): 7,15,23,31 -3 panic stack [...] [ 5.721547] atomic64_test: passed for x86-64 platform with CX8 and with SSE [ 5.729187] pcieport 0000:00:01.1: Signaling PME with IRQ 34 [ 5.735187] pcieport 0000:00:01.2: Signaling PME with IRQ 35 [ 5.741168] pcieport 0000:00:01.3: Signaling PME with IRQ 36 [ 5.747189] pcieport 0000:00:07.1: Signaling PME with IRQ 37 [ 5.754061] pcieport 0000:00:08.1: Signaling PME with IRQ 39 [ 5.760727] pcieport 0000:20:07.1: Signaling PME with IRQ 40 [ 5.766955] pcieport 0000:20:08.1: Signaling PME with IRQ 42 [ 5.772742] BUG: unable to handle kernel paging request at 0000000000002088 [ 5.773618] PGD 0 P4D 0 [ 5.773618] Oops: 0000 [#1] SMP NOPTI [ 5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3 [ 5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018 [ 5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0 [ 5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6 89 [ 5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246 [ 5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000 [ 5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 [ 5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002 [ 5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000 [ 5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002 [ 5.773618] FS: 0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000 [ 5.773618] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0 [ 5.773618] Call Trace: [ 5.773618] new_slab+0xa9/0x570 [ 5.773618] ___slab_alloc+0x375/0x540 [ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] __slab_alloc+0x1c/0x38 [ 5.773618] __kmalloc_node_track_caller+0xc8/0x270 [ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] devm_kmalloc+0x28/0x60 [ 5.773618] pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] really_probe+0x73/0x420 [ 5.773618] driver_probe_device+0x115/0x130 [ 5.773618] __driver_attach+0x103/0x110 [ 5.773618] ? driver_probe_device+0x130/0x130 [ 5.773618] bus_for_each_dev+0x67/0xc0 [ 5.773618] ? klist_add_tail+0x3b/0x70 [ 5.773618] bus_add_driver+0x41/0x260 [ 5.773618] ? pcie_port_setup+0x4d/0x4d [ 5.773618] driver_register+0x5b/0xe0 [ 5.773618] ? pcie_port_setup+0x4d/0x4d [ 5.773618] do_one_initcall+0x4e/0x1d4 [ 5.773618] ? init_setup+0x25/0x28 [ 5.773618] kernel_init_freeable+0x1c1/0x26e [ 5.773618] ? loglevel+0x5b/0x5b [ 5.773618] ? rest_init+0xb0/0xb0 [ 5.773618] kernel_init+0xa/0x110 [ 5.773618] ret_from_fork+0x22/0x40 [ 5.773618] Modules linked in: [ 5.773618] CR2: 0000000000002088 [ 5.773618] ---[ end trace 1030c9120a03d081 ]--- [...] -4 other notes about the reproduction of this bug: On my test machine, this bug is covered by 'commit 0d76bcc960e6 ("Revert "ACPI/PCI: Pay attention to device-specific _PXM node values"")', but the crack caused by dev->numa_node is still exposed from other path. II. history I had a original try on [1], which took the way by deferring the instance of offline node. Later Michal has suggested a fix [2], which only consider node with memory as online. Beside fixing this bug, that patch also aimed at excluding memory-less node as a candidate when iterating the zones. It is a pity that the method conflicts with the scheduler code, which assumes node with cpu as online too. You can find the broken by "git grep for_each_online_node | grep sched" or the discussion in tail of [3]. Since Michal has no time to continue on this issue. I pick it up again. This patch drops the change of "node online" definition in [2], i.e. still consider node as online if it has either cpu or memory. And keeps the rest main idea in [2] of initializing all parsed node on x86. For other archs, they need extra dedicated effort. [1]: https://patchwork.kernel.org/patch/10738733/ [2]: https://lkml.org/lkml/2019/2/13/253 [3]: https://lore.kernel.org/lkml/20190528182011.GG1658@dhcp22.suse.cz/T/ Signed-off-by: Pingfan Liu Cc: Michal Hocko Cc: Dave Hansen Cc: Mike Rapoport Cc: Tony Luck Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Andrew Morton Cc: Michal Hocko Cc: Vlastimil Babka Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Mel Gorman Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Stephen Rothwell Cc: Qian Cai Cc: Barret Rhoden Cc: Bjorn Helgaas Cc: David Rientjes Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- arch/x86/mm/numa.c | 17 ++++++++++++----- mm/page_alloc.c | 11 ++++++++--- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index b48d507..5f5b558 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -732,6 +732,15 @@ static void __init init_memory_less_node(int nid) */ } +static void __init init_parsed_rest_node(void) +{ + int node; + + for_each_node_mask(node, node_possible_map) + if (!node_online(node)) + init_memory_less_node(node); +} + /* * Setup early cpu_to_node. * @@ -752,6 +761,7 @@ void __init init_cpu_to_node(void) u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid); BUG_ON(cpu_to_apicid == NULL); + init_parsed_rest_node(); for_each_possible_cpu(cpu) { int node = numa_cpu_node(cpu); @@ -759,11 +769,8 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) { - init_memory_less_node(node); - node_set_online(nid); - } - + if (!node_online(node)) + node_set_online(node); numa_set_node(cpu, node); } } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d66bc8a..5d8db00 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5662,10 +5662,15 @@ static void __build_all_zonelists(void *data) if (self && !node_online(self->node_id)) { build_zonelists(self); } else { - for_each_online_node(nid) { + /* In rare case, node_zonelist() hits offline node */ + for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); - - build_zonelists(pgdat); + /* + * This condition can be removed on archs, with all + * possible node instanced. + */ + if (pgdat) + build_zonelists(pgdat); } #ifdef CONFIG_HAVE_MEMORYLESS_NODES -- 2.7.5