Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp6104917pxb; Thu, 27 Jan 2022 06:39:48 -0800 (PST) X-Google-Smtp-Source: ABdhPJz8h7aG3ZP42NgV4wmyTTGSM8hi9LWLo+qZzhI8V628p0UrcRBnlVJNU4GAyJPLwkkZ9RQY X-Received: by 2002:a05:6a00:b51:: with SMTP id p17mr3115579pfo.45.1643294388497; Thu, 27 Jan 2022 06:39:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643294388; cv=none; d=google.com; s=arc-20160816; b=pc+4BWTR+IH2SPtBxQ7WVvzOe6A0+92lULroFGag7f2CGPpptvTd2VIWCRXM91lpUh gs9MxHbtvZh48L9uKRG7T1feqGVcdzJLBmwP4Mo66wVQcSupzrxxAityc8M9Gz8P58ia XwEvnNNJKPk1TA3Rn4phRRCV/5o3I5eJIY09tHptRYgz+xWZUrBWiiBsSWFAvlj7UkQH YdGCzuEy1lOJZEo/5OArPU/HdRVehxi98Iw7rACnXF5kTLZZ3c1yPWhEm3bRvNo5ZGZ8 pbgTuMmt2q3Yx+LsoVL8YSveJMPcSKQ4yWenHqEi5xbo03orj3iI9WDaSfiiY9kYz34R Ccsg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=aOJjl+XOvtbHeHVkKZ6L9gOjlOV4TSrP1aRrCtm5TiU=; b=wQifiibjZk5fTYwYH3Fxe49/A071cYeAVbywlQE8zrEzkQ+FvignHuX4m3N0Ei6TR1 jSWrqZh+gbF8Ax2FScs7tQzJSgCAgcgrKuNCWNGT5Wl50zX0mQAH4YVM8rjMUwCGFHql AIEbNKyY/YixDr0/xK3n/jgsOpOwn1+o6FF2MBGqk6FYHb3m5wj1j/xzbsXkbsxWnveh kPX9SudKDL7zDp+93mCsTmjrbpe/dCQQIW6KYxSXTBJczzHGML5z4uG1Mec7Q6kTx4Ry ibe7uECOJxy404foDCjGoE5IXYoDsV1qYKZPervK9TTX/krAoFV97/ojVpKYdgEqqHqj 3RQg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l12si2379318pls.140.2022.01.27.06.39.32; Thu, 27 Jan 2022 06:39:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237982AbiA0IxZ (ORCPT + 99 others); Thu, 27 Jan 2022 03:53:25 -0500 Received: from mail-ed1-f47.google.com ([209.85.208.47]:40738 "EHLO mail-ed1-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237933AbiA0IxR (ORCPT ); Thu, 27 Jan 2022 03:53:17 -0500 Received: by mail-ed1-f47.google.com with SMTP id w25so1795735edt.7 for ; Thu, 27 Jan 2022 00:53:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=aOJjl+XOvtbHeHVkKZ6L9gOjlOV4TSrP1aRrCtm5TiU=; b=GGxJWdd6gtUgVyS76bu3VGFunQ1WR68Q/BjYixePp8jv+x4ITmmZT27l7sjg78HiHY Vb3DON5yaMM+2Y6j5DnosTaJH1bJSnawjdQ0V+q1KJdmmTGq0NWzI9rDXoV3HXTcpEdI 9BqV26B+oeHwJPSBwDHfVLj6Fb/6wJRj/ep26JKxBSigveVqRYsXCLm6siexygNMfoBY xqFPfIWyj9QozAc0fNXPoPYNibLPk9BO/pnLHP/kzDgFeACOovcsXs8voDhYK4uZX6qt 6agfGnHLGDtZEVnbZhNiAD6Hvmo/ktDmR+nVpA40rNza/9PkAYNScIx0OqM8ZwRTr78F tVcA== X-Gm-Message-State: AOAM532LF6rO/GYGApS2opf+vwkRvy286isGe6wBNY5OtWdVSIwqoXae 0nHpDJum/+1K5qFWOoKBoSE= X-Received: by 2002:aa7:cdd9:: with SMTP id h25mr2652972edw.95.1643273596216; Thu, 27 Jan 2022 00:53:16 -0800 (PST) Received: from localhost.localdomain (ip-85-160-47-31.eurotel.cz. [85.160.47.31]) by smtp.gmail.com with ESMTPSA id n11sm11029757edv.52.2022.01.27.00.53.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jan 2022 00:53:15 -0800 (PST) From: Michal Hocko To: Andrew Morton Cc: , LKML , David Hildenbrand , Alexey Makhalov , Dennis Zhou , Eric Dumazet , Oscar Salvador , Tejun Heo , Christoph Lameter , Nico Pache , Wei Yang , Rafael Aquini , Michal Hocko Subject: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully Date: Thu, 27 Jan 2022 09:53:01 +0100 Message-Id: <20220127085305.20890-3-mhocko@kernel.org> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220127085305.20890-1-mhocko@kernel.org> References: <20220127085305.20890-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko We have had several reports [1][2][3] that page allocator blows up when an allocation from a possible node is requested. The underlying reason is that NODE_DATA for the specific node is not allocated. NUMA specific initialization is arch specific and it can vary a lot. E.g. x86 tries to initialize all nodes that have some cpu affinity (see init_cpu_to_node) but this can be insufficient because the node might be cpuless for example. One way to address this problem would be to check for !node_online nodes when trying to get a zonelist and silently fall back to another node. That is unfortunately adding a branch into allocator hot path and it doesn't handle any other potential NODE_DATA users. This patch takes a different approach (following a lead of [3]) and it pre allocates pgdat for all possible nodes in an arch indipendent code - free_area_init. All uninitialized nodes are treated as memoryless nodes. node_state of the node is not changed because that would lead to other side effects - e.g. sysfs representation of such a node and from past discussions [4] it is known that some tools might have problems digesting that. Newly allocated pgdat only gets a minimal initialization and the rest of the work is expected to be done by the memory hotplug - hotadd_new_pgdat (renamed to hotadd_init_pgdat). generic_alloc_nodedata is changed to use the memblock allocator because neither page nor slab allocators are available at the stage when all pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can use the early boot allocator. The only arch specific implementation is ia64 and that is changed to use the early allocator as well. Reported-by: Alexey Makhalov Tested-by: Alexey Makhalov Reported-by: Nico Pache Acked-by: Rafael Aquini Tested-by: Rafael Aquini Signed-off-by: Michal Hocko [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com --- arch/ia64/mm/discontig.c | 4 ++-- include/linux/memory_hotplug.h | 2 +- mm/memory_hotplug.c | 21 +++++++++------------ mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++--- 4 files changed, 43 insertions(+), 18 deletions(-) diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 8dc8a554f774..dd0cf4834eaa 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -608,11 +608,11 @@ void __init paging_init(void) zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); } -pg_data_t *arch_alloc_nodedata(int nid) +pg_data_t * __init arch_alloc_nodedata(int nid) { unsigned long size = compute_pernodesize(nid); - return kzalloc(size, GFP_KERNEL); + return memblock_alloc(size, SMP_CACHE_BYTES); } void arch_free_nodedata(pg_data_t *pgdat) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 4355983b364d..cdd66bfdf855 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -44,7 +44,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); */ #define generic_alloc_nodedata(nid) \ ({ \ - kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ + memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES); \ }) /* * This definition is just for error path in node hotadd. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2a9627dc784c..fc991831d296 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1162,19 +1162,21 @@ static void reset_node_present_pages(pg_data_t *pgdat) } /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ -static pg_data_t __ref *hotadd_new_pgdat(int nid) +static pg_data_t __ref *hotadd_init_pgdat(int nid) { struct pglist_data *pgdat; pgdat = NODE_DATA(nid); - if (!pgdat) { - pgdat = arch_alloc_nodedata(nid); - if (!pgdat) - return NULL; + /* + * NODE_DATA is preallocated (free_area_init) but its internal + * state is not allocated completely. Add missing pieces. + * Completely offline nodes stay around and they just need + * reintialization. + */ + if (pgdat->per_cpu_nodestats == &boot_nodestats) { pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat); - arch_refresh_nodedata(nid, pgdat); } else { int cpu; /* @@ -1193,8 +1195,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid) } } - /* we can use NODE_DATA(nid) from here */ - pgdat->node_id = nid; pgdat->node_start_pfn = 0; /* init node's zones as empty zones, we don't have any present pages.*/ @@ -1246,7 +1246,7 @@ static int __try_online_node(int nid, bool set_node_online) if (node_online(nid)) return 0; - pgdat = hotadd_new_pgdat(nid); + pgdat = hotadd_init_pgdat(nid); if (!pgdat) { pr_err("Cannot online node %d due to NULL pgdat\n", nid); ret = -ENOMEM; @@ -1445,9 +1445,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) return ret; error: - /* rollback pgdat allocation and others */ - if (new_node) - rollback_node_hotadd(nid); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) memblock_remove(start, size); error_mem_hotplug_end: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3589febc6d31..1a05669044d3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6402,7 +6402,11 @@ static void __build_all_zonelists(void *data) if (self && !node_online(self->node_id)) { build_zonelists(self); } else { - for_each_online_node(nid) { + /* + * All possible nodes have pgdat preallocated + * free_area_init + */ + for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); build_zonelists(pgdat); @@ -8096,8 +8100,32 @@ void __init free_area_init(unsigned long *max_zone_pfn) /* Initialise every node */ mminit_verify_pageflags_layout(); setup_nr_node_ids(); - for_each_online_node(nid) { - pg_data_t *pgdat = NODE_DATA(nid); + for_each_node(nid) { + pg_data_t *pgdat; + + if (!node_online(nid)) { + pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid); + + /* Allocator not initialized yet */ + pgdat = arch_alloc_nodedata(nid); + if (!pgdat) { + pr_err("Cannot allocate %zuB for node %d.\n", + sizeof(*pgdat), nid); + continue; + } + arch_refresh_nodedata(nid, pgdat); + free_area_init_memoryless_node(nid); + /* + * not marking this node online because we do not want to + * confuse userspace by sysfs files/directories for node + * without any memory attached to it (see topology_init) + * The pgdat will get fully initialized when a memory is + * hotpluged into it by hotadd_init_pgdat + */ + continue; + } + + pgdat = NODE_DATA(nid); free_area_init_node(nid); /* Any memory on that node */ -- 2.30.2