Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp1315579yba; Tue, 2 Apr 2019 06:44:29 -0700 (PDT) X-Google-Smtp-Source: APXvYqwn5DIRGwnSFlD2j8+pJVYfuevOALGfwczkC2e8Kue8qb09O09IWor9fxvR9ep2+rfe7zuC X-Received: by 2002:a63:c34a:: with SMTP id e10mr66191910pgd.194.1554212669623; Tue, 02 Apr 2019 06:44:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1554212669; cv=none; d=google.com; s=arc-20160816; b=uu2YJLIrqnZAilRq3LuQyT1yFwaqzvokcjdSLFLDRvkkN9sR+URkHLuzgpwkfJL/+L 5t61+yn58wZ92PV36N7FPLsD+YIHng0T5RmOr8X31qLIBH2e/f4NUZdi7r0YBmyWvPWO plFN+NnMssU6Ad/KhDDC6b8CDgKHP9H8jbGjBGtX4cfPkx5GZlmSCV5nxxU6wygEs2Ve MI/Jx9UPkC3RWo0jzK1kfhVsG1fyzz9No9qoTe8ZOoc9BvAjauoOTBi1r/g+6H4XV21n JhvbBXXHF+Ofo0VxKxMGp90HDG+/aWoa8zRzakF/z7m5bQQ59kcWlLCDsBjv7ODrlk4b oTJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:subject:message-id:date:cc:to :from:mime-version:content-transfer-encoding:content-disposition; bh=BV4LRfe4+fCXYA21EYkib2N7zElAM5/aGG18LKdnBuc=; b=XpJkch0ZJWUblKcveLhpFWW1k3MRc8OjxJqS2MiuC5kgLfADdvFYD5BgbbYIJuvZRs LhnaDnMtFEsyvYPARU290ZoEkjVdEpgjBQVXpetjg+do1xOxYvKGJHJb5sZ8kioxFQIL OfPLwdekTSw4MNgyIE8WMeYQuVqOu3+FPMQLDi7jMqhm4F/9k14fAqcq15mnQsq6UsOW l99841WxbTss7BnSOcji58bW4TazeV9tGAcTtTKA6am43zL56NIY1oH6dKblv9nGrhMd KYa957CfZPKlEIWCd/YR5Edz2jXtF+JCVSgP2j+P5ZA6QrXar2ju0c2JdiU+QtOEf/8V Jy4Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x131si11290541pgx.73.2019.04.02.06.44.14; Tue, 02 Apr 2019 06:44:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731890AbfDBNnF (ORCPT + 99 others); Tue, 2 Apr 2019 09:43:05 -0400 Received: from shadbolt.e.decadent.org.uk ([88.96.1.126]:43666 "EHLO shadbolt.e.decadent.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731519AbfDBNkM (ORCPT ); Tue, 2 Apr 2019 09:40:12 -0400 Received: from [167.98.27.226] (helo=deadeye) by shadbolt.decadent.org.uk with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hBJe4-0002os-6Z; Tue, 02 Apr 2019 14:40:08 +0100 Received: from ben by deadeye with local (Exim 4.92) (envelope-from ) id 1hBJdx-0004xd-Uq; Tue, 02 Apr 2019 14:40:01 +0100 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 From: Ben Hutchings To: linux-kernel@vger.kernel.org, stable@vger.kernel.org CC: akpm@linux-foundation.org, Denis Kirjanov , "Anshuman Khandual" , "Miroslav Benes" , "Linus Torvalds" , "Michal Hocko" Date: Tue, 02 Apr 2019 14:38:28 +0100 Message-ID: X-Mailer: LinuxStableQueue (scripts by bwh) X-Patchwork-Hint: ignore Subject: [PATCH 3.16 89/99] mm, memory_hotplug: do not clear numa_node association after hot_remove In-Reply-To: X-SA-Exim-Connect-IP: 167.98.27.226 X-SA-Exim-Mail-From: ben@decadent.org.uk X-SA-Exim-Scanned: No (on shadbolt.decadent.org.uk); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 3.16.65-rc1 review patch. If anyone has any objections, please let me know. ------------------ From: Michal Hocko commit 46a3679b8190101e4ebdfe252ef79e6150a4f2ac upstream. Per-cpu numa_node provides a default node for each possible cpu. The association gets initialized during the boot when the architecture specific code explores cpu->NUMA affinity. When the whole NUMA node is removed though we are clearing this association try_offline_node check_and_unmap_cpu_on_node unmap_cpu_on_node numa_clear_node numa_set_node(cpu, NUMA_NO_NODE) This means that whoever calls cpu_to_node for a cpu associated with such a node will get NUMA_NO_NODE. This is problematic for two reasons. First it is fragile because __alloc_pages_node would simply blow up on an out-of-bound access. We have encountered this when loading kvm module BUG: unable to handle kernel paging request at 00000000000021c0 IP: __alloc_pages_nodemask+0x93/0xb70 PGD 800000ffe853e067 PUD 7336bbc067 PMD 0 Oops: 0000 [#1] SMP [...] CPU: 88 PID: 1223749 Comm: modprobe Tainted: G W 4.4.156-94.64-default #1 RIP: __alloc_pages_nodemask+0x93/0xb70 RSP: 0018:ffff887354493b40 EFLAGS: 00010202 RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0 RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000 R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101 R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000 FS: 00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: alloc_vmcs_cpu+0x3d/0x90 [kvm_intel] hardware_setup+0x781/0x849 [kvm_intel] kvm_arch_hardware_setup+0x28/0x190 [kvm] kvm_init+0x7c/0x2d0 [kvm] vmx_init+0x1e/0x32c [kvm_intel] do_one_initcall+0xca/0x1f0 do_init_module+0x5a/0x1d7 load_module+0x1393/0x1c90 SYSC_finit_module+0x70/0xa0 entry_SYSCALL_64_fastpath+0x1e/0xb7 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7 on an older kernel but the code is basically the same in the current Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which would recognize NUMA_NO_NODE and use alloc_pages_node which would translate it to numa_mem_id but that is wrong as well because it would use a cpu affinity of the local CPU which might be quite far from the original node. It is also reasonable to expect that cpu_to_node will provide a sane value and there might be many more callers like that. The second problem is that __register_one_node relies on cpu_to_node to properly associate cpus back to the node when it is onlined. We do not want to lose that link as there is no arch independent way to get it from the early boot time AFAICS. Drop the whole check_and_unmap_cpu_on_node machinery and keep the association to fix both issues. The NODE_DATA(nid) is not deallocated so it will stay in place and if anybody wants to allocate from that node then a fallback node will be used. Thanks to Vlastimil Babka for his live system debugging skills that helped debugging the issue. Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node") Signed-off-by: Michal Hocko Debugged-by: Vlastimil Babka Reported-by: Miroslav Benes Acked-by: Anshuman Khandual Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings --- mm/memory_hotplug.c | 30 +----------------------------- 1 file changed, 1 insertion(+), 29 deletions(-) --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1880,34 +1880,6 @@ static int check_cpu_on_node(pg_data_t * return 0; } -static void unmap_cpu_on_node(pg_data_t *pgdat) -{ -#ifdef CONFIG_ACPI_NUMA - int cpu; - - for_each_possible_cpu(cpu) - if (cpu_to_node(cpu) == pgdat->node_id) - numa_clear_node(cpu); -#endif -} - -static int check_and_unmap_cpu_on_node(pg_data_t *pgdat) -{ - int ret; - - ret = check_cpu_on_node(pgdat); - if (ret) - return ret; - - /* - * the node will be offlined when we come here, so we can clear - * the cpu_to_node() now. - */ - - unmap_cpu_on_node(pgdat); - return 0; -} - /** * try_offline_node * @@ -1941,7 +1913,7 @@ void try_offline_node(int nid) return; } - if (check_and_unmap_cpu_on_node(pgdat)) + if (check_cpu_on_node(pgdat)) return; /*