Received: by 2002:a05:7412:b995:b0:f9:9502:5bb8 with SMTP id it21csp1013461rdb; Fri, 22 Dec 2023 11:36:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IHShFzPRRhMqTvnjn80iGRzxxHKH4fgttskqtp7R/rTJ1xi8Daw+NUVFcZxq30bOgxPJ6vE X-Received: by 2002:a05:6a20:144e:b0:195:57c3:5ebd with SMTP id a14-20020a056a20144e00b0019557c35ebdmr352352pzi.48.1703273816509; Fri, 22 Dec 2023 11:36:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703273816; cv=none; d=google.com; s=arc-20160816; b=JfPlUQ209BdjaWeJ9+tboyCYv7LYledL9FB7xivfWbBCdR5dq18I9C5AdD9m3PrhFc Oi3UqdK7wsnePDX0p0fNn9EQgbXRz6KOz2hQKYlb5EPvwzapxnZgPbFzkwB3kgy66W2j Y/T2ou5QfpGoUZWAwdf/SEb5ibekjueKv8QBTtSbjLvYxKgoEO+qpiXuhMVAgJtA1Vcy udB2/+avpl8HpgMjPrq5QfgGe4y0ECJsbp1nEyP0YsJ+f8BBC7Uy6zPw3GMSGsFd9oZL TVev+DsQJnpUYv63eberHtvi/SwUPQLgInG2cbMz1tzib7YU/Ke92HKHEQIPbGwF5WnT VCSg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=Ksbp/Zxa/a+NFoOb3VtUCnlki1kzhpOKLW4fIgeVn2s=; fh=xprWcyUKTLi4rhte7gFCvduD673j12awuFkCPnYDQVY=; b=ixPD6tJyBJ1xK2GS3TEmqqfn4As+3MKHo7us3/okFT0SroC1JcXOzym8DZVSijsGXI BUvj2oZNu9XD1TY1mD/wUPB+VTj/2igJH/SyaAeDAtuLIrDzCPqQj/So1KzzpDCXhC7B g9gPwksAYCYKsGVJhTWHmQAUIfyQSE4QD8H5FOtJX6/tKTSKcCG7ObXVpxqQ6mx2ecBX wK57mnPN8kyVJ+mKfG4wEuroFVmMp+fUaNe+y/J9rvMlMJB2wun8yEePjOpwitIpju2c uJHc90BVGb2QxfrR/XsfdssAyje4VgGIO/nuQkqozpDcy8dlydyZDe/7VucbFW4AC94Q Zz7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=N2w9h4L1; spf=pass (google.com: domain of linux-kernel+bounces-10017-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-10017-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id b1-20020a655cc1000000b005c1ce3c9628si3565457pgt.752.2023.12.22.11.36.56 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Dec 2023 11:36:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-10017-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=N2w9h4L1; spf=pass (google.com: domain of linux-kernel+bounces-10017-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-10017-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 2DA7E28382E for ; Fri, 22 Dec 2023 19:36:56 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 323F82E82A; Fri, 22 Dec 2023 19:36:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="N2w9h4L1" X-Original-To: linux-kernel@vger.kernel.org Received: from smtp-fw-33001.amazon.com (smtp-fw-33001.amazon.com [207.171.190.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B72142C84E; Fri, 22 Dec 2023 19:36:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1703273797; x=1734809797; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Ksbp/Zxa/a+NFoOb3VtUCnlki1kzhpOKLW4fIgeVn2s=; b=N2w9h4L12u6e8sbK/pKBvpP/zQCi+JIcwGeLpA0nEXAUPMM0FuFrLo1f F4BtXaTGv9sa+kKR+ylcONw6ZKZ2nkajVoA4WDZkJUnK6orRreazeOvrs yjq2zKsK+b4pIRHy6SJwk63ZR3JXyF3iPj4KZM2na8AORdfX3pwAsDYmj 8=; X-IronPort-AV: E=Sophos;i="6.04,297,1695686400"; d="scan'208";a="318656934" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-pdx-2c-m6i4x-fad5e78e.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-33001.sea14.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Dec 2023 19:36:27 +0000 Received: from smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev (pdx2-ws-svc-p26-lb5-vlan3.pdx.amazon.com [10.39.38.70]) by email-inbound-relay-pdx-2c-m6i4x-fad5e78e.us-west-2.amazon.com (Postfix) with ESMTPS id 37F27A0E8B; Fri, 22 Dec 2023 19:36:17 +0000 (UTC) Received: from EX19MTAUWC001.ant.amazon.com [10.0.7.35:58867] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.47.55:2525] with esmtp (Farcaster) id 5fa57003-fc1c-4663-be2b-92c0c0b01216; Fri, 22 Dec 2023 19:36:17 +0000 (UTC) X-Farcaster-Flow-ID: 5fa57003-fc1c-4663-be2b-92c0c0b01216 Received: from EX19D020UWC004.ant.amazon.com (10.13.138.149) by EX19MTAUWC001.ant.amazon.com (10.250.64.174) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Fri, 22 Dec 2023 19:36:17 +0000 Received: from dev-dsk-graf-1a-5ce218e4.eu-west-1.amazon.com (10.253.83.51) by EX19D020UWC004.ant.amazon.com (10.13.138.149) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Fri, 22 Dec 2023 19:36:13 +0000 From: Alexander Graf To: CC: , , , , , , , Eric Biederman , "H. Peter Anvin" , Andy Lutomirski , Peter Zijlstra , "Rob Herring" , Steven Rostedt , "Andrew Morton" , Mark Rutland , "Tom Lendacky" , Ashish Kalra , James Gowans , Stanislav Kinsburskii , , , , Anthony Yznaga , Usama Arif , David Woodhouse , Benjamin Herrenschmidt Subject: [PATCH v2 01/17] mm,memblock: Add support for scratch memory Date: Fri, 22 Dec 2023 19:35:51 +0000 Message-ID: <20231222193607.15474-2-graf@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20231222193607.15474-1-graf@amazon.com> References: <20231222193607.15474-1-graf@amazon.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D035UWA004.ant.amazon.com (10.13.139.109) To EX19D020UWC004.ant.amazon.com (10.13.138.149) Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit With KHO (Kexec HandOver), we need a way to ensure that the new kernel does not allocate memory on top of any memory regions that the previous kernel was handing over. But to know where those are, we need to include them in the reserved memblocks array which may not be big enough to hold all allocations. To resize the array, we need to allocate memory. That brings us into a catch 22 situation. The solution to that is the scratch region: a safe region to operate in. KHO provides a "scratch region" as part of its metadata. This scratch region is a single, contiguous memory block that we know does not contain any KHO allocations. We can exclusively allocate from there until we finish kernel initialization to a point where it knows about all the KHO memory reservations. We introduce a new memblock_set_scratch_only() function that allows KHO to indicate that any memblock allocation must happen from the scratch region. Later, we may want to perform another KHO kexec. For that, we reuse the same scratch region. To ensure that no eventually handed over data gets allocated inside that scratch region, we flip the semantics of the scratch region with memblock_clear_scratch_only(): After that call, no allocations may happen from scratch memblock regions. We will lift that restriction in the next patch. Signed-off-by: Alexander Graf --- include/linux/memblock.h | 19 +++++++++++++ mm/Kconfig | 4 +++ mm/memblock.c | 61 +++++++++++++++++++++++++++++++++++++++- 3 files changed, 83 insertions(+), 1 deletion(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index ae3bde302f70..14043f5b696f 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -42,6 +42,10 @@ extern unsigned long long max_possible_pfn; * kernel resource tree. * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are * not initialized (only for reserved regions). + * @MEMBLOCK_SCRATCH: memory region that kexec can pass to the next kernel in + * handover mode. During early boot, we do not know about all memory reservations + * yet, so we get scratch memory from the previous kernel that we know is good + * to use. It is the only memory that allocations may happen from in this phase. */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ @@ -50,6 +54,7 @@ enum memblock_flags { MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */ MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */ + MEMBLOCK_SCRATCH = 0x20, /* scratch memory for kexec handover */ }; /** @@ -129,6 +134,8 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); int memblock_mark_nomap(phys_addr_t base, phys_addr_t size); int memblock_clear_nomap(phys_addr_t base, phys_addr_t size); int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size); +int memblock_mark_scratch(phys_addr_t base, phys_addr_t size); +int memblock_clear_scratch(phys_addr_t base, phys_addr_t size); void memblock_free_all(void); void memblock_free(void *ptr, size_t size); @@ -273,6 +280,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m) return m->flags & MEMBLOCK_DRIVER_MANAGED; } +static inline bool memblock_is_scratch(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_SCRATCH; +} + int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, @@ -610,5 +622,12 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end) { } static inline void memtest_report_meminfo(struct seq_file *m) { } #endif +#ifdef CONFIG_MEMBLOCK_SCRATCH +void memblock_set_scratch_only(void); +void memblock_clear_scratch_only(void); +#else +static inline void memblock_set_scratch_only(void) { } +static inline void memblock_clear_scratch_only(void) { } +#endif #endif /* _LINUX_MEMBLOCK_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 57cd378c73d6..384369e40f10 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -513,6 +513,10 @@ config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP config HAVE_MEMBLOCK_PHYS_MAP bool +# Enable memblock support for scratch memory which is needed for KHO +config MEMBLOCK_SCRATCH + bool + config HAVE_FAST_GUP depends on MMU bool diff --git a/mm/memblock.c b/mm/memblock.c index 5a88d6d24d79..e89e6c8f9d75 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -106,6 +106,13 @@ unsigned long min_low_pfn; unsigned long max_pfn; unsigned long long max_possible_pfn; +#ifdef CONFIG_MEMBLOCK_SCRATCH +/* When set to true, only allocate from MEMBLOCK_SCRATCH ranges */ +static bool scratch_only; +#else +#define scratch_only false +#endif + static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock; static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock; #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP @@ -168,6 +175,10 @@ bool __init_memblock memblock_has_mirror(void) static enum memblock_flags __init_memblock choose_memblock_flags(void) { + /* skip non-scratch memory for kho early boot allocations */ + if (scratch_only) + return MEMBLOCK_SCRATCH; + return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE; } @@ -643,7 +654,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type, #ifdef CONFIG_NUMA WARN_ON(nid != memblock_get_region_node(rgn)); #endif - WARN_ON(flags != rgn->flags); + WARN_ON(flags != (rgn->flags & ~MEMBLOCK_SCRATCH)); nr_new++; if (insert) { if (start_rgn == -1) @@ -890,6 +901,18 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size) } #endif +#ifdef CONFIG_MEMBLOCK_SCRATCH +__init_memblock void memblock_set_scratch_only(void) +{ + scratch_only = true; +} + +__init_memblock void memblock_clear_scratch_only(void) +{ + scratch_only = false; +} +#endif + /** * memblock_setclr_flag - set or clear flag for a memory region * @type: memblock type to set/clear flag for @@ -1015,6 +1038,33 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t MEMBLOCK_RSRV_NOINIT); } +/** + * memblock_mark_scratch - Mark a memory region with flag MEMBLOCK_SCRATCH. + * @base: the base phys addr of the region + * @size: the size of the region + * + * Only memory regions marked with %MEMBLOCK_SCRATCH will be considered for + * allocations during early boot with kexec handover. + * + * Return: 0 on success, -errno on failure. + */ +int __init_memblock memblock_mark_scratch(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_SCRATCH); +} + +/** + * memblock_clear_scratch - Clear flag MEMBLOCK_SCRATCH for a specified region. + * @base: the base phys addr of the region + * @size: the size of the region + * + * Return: 0 on success, -errno on failure. + */ +int __init_memblock memblock_clear_scratch(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_SCRATCH); +} + static bool should_skip_region(struct memblock_type *type, struct memblock_region *m, int nid, int flags) @@ -1046,6 +1096,14 @@ static bool should_skip_region(struct memblock_type *type, if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m)) return true; + /* In early alloc during kho, we can only consider scratch allocations */ + if ((flags & MEMBLOCK_SCRATCH) && !memblock_is_scratch(m)) + return true; + + /* Leave scratch memory alone after scratch-only phase */ + if (!(flags & MEMBLOCK_SCRATCH) && memblock_is_scratch(m)) + return true; + return false; } @@ -2211,6 +2269,7 @@ static const char * const flagname[] = { [ilog2(MEMBLOCK_MIRROR)] = "MIRROR", [ilog2(MEMBLOCK_NOMAP)] = "NOMAP", [ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG", + [ilog2(MEMBLOCK_SCRATCH)] = "SCRATCH", }; static int memblock_debug_show(struct seq_file *m, void *private) -- 2.40.1 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879