Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp3550916rwb; Sun, 20 Nov 2022 16:49:34 -0800 (PST) X-Google-Smtp-Source: AA0mqf7tDfGSJDHHIxrT1ge10/peSh7XHGQr/T4e6sInYIzyP36RHJ0zBKbEEVvXeEyEUR2IhWgB X-Received: by 2002:a17:90a:43c6:b0:210:f235:1151 with SMTP id r64-20020a17090a43c600b00210f2351151mr18409138pjg.230.1668991774513; Sun, 20 Nov 2022 16:49:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668991774; cv=none; d=google.com; s=arc-20160816; b=UOORZoRfZPjFGpuHpx8CKPew97U0lX296fX7eod2559W/pBxsWotPvl7tylHGLMgJc mTYSesu+dfhiCi2HPrWs4qU9xbTSV93TOJ0nYYztWMPMkMwi7SwECOsBYqkDenNuxNBJ 5unp2SBfZl5EbtPSh86DxVZ39WGnsufW8i26QPD6sX3CUOQfbvzW6Kx3ZbOh30wlH78G fhvKMULjbqkkG7Ojt/279ZCcJf/PznXCDU2JoxT1CL43beWXi2JvehI19Cbw9IQnJB4J +rZ0+CLI3Hk1nxeETJZ4WWNm+eTmTl85zSSHLzO+KG6ApC5T1tmZR374jD7XpBwnInow JbiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=gJ+5X3JEGP1sm7qRluYBRppMuKG810qRYdjZgW7jZGA=; b=hOn+jBpxYxJ3qCN3Q1d4Z2vHXclj+gdsOyaT6NDAGZYHxklRv5WcAAvkQIP5dr401q hEFKgmu9zfCD8iyi4KUZGxDiKqC6wqKihrfebrvFl+CSMDsBsCwK7sIEMnmhVSc6C3OE j+IVwbsk/SV09MJW7wdNlt1u1xAckMPzLY3Ohlu4/YamkM2Zc7FJAiJjnsaXIAKmZqIk 3jmdxCP7NfNcfHxo1TfO+ietatAZFR/tBC3Jan4l2ESOXeXGun/Tl+fozcCreyQWE7jd OSLXjcpRIi36yqzNQFUjd+Em0mSLPPGKYzYXRLwkSsne0L4jEMOlcv5xiEyT1u3blxdy dDkg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=eV8f2n2Q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u13-20020a63ef0d000000b0046b3ba2c806si10059866pgh.145.2022.11.20.16.49.23; Sun, 20 Nov 2022 16:49:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=eV8f2n2Q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229708AbiKUA2v (ORCPT + 91 others); Sun, 20 Nov 2022 19:28:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229814AbiKUA1n (ORCPT ); Sun, 20 Nov 2022 19:27:43 -0500 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7F73E1743C; Sun, 20 Nov 2022 16:27:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1668990460; x=1700526460; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tTM6f9ZtDSECP7CTls1i410chojXO/BfWVbWBowQUQE=; b=eV8f2n2Q2AKNiPl6TiQVQ7EZDr0TXdEeZNV0lCIwCghrm3Y+tY80LcjL x8+N3LJL0qFcQ6z9f6yiqxwS4n7jehm87LGlUKoym4gB4W9fxTkNaUEzq mZQQsrqcG+5Z9IBBgVLBHQiTTkjIkEGTbg6B9+dINi+QnIb/FodL4DibA W5pG4WgbXFov9ROhYGszt5AkRWcFpIGpUC6N1CKYXF3eWWgXR406CPSSR cxI0QcTCvC7bT1HVDIKpkFGZGf/X+BblSfFDTpYkI+nkiEOFMg0lnZZ8g e8fzGgkg78HGDnGfPbDPnQf7p36F0RcacYGiyX+pRA0xp3jSFi+yKBYNK Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10537"; a="377705705" X-IronPort-AV: E=Sophos;i="5.96,180,1665471600"; d="scan'208";a="377705705" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2022 16:27:38 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10537"; a="729825362" X-IronPort-AV: E=Sophos;i="5.96,180,1665471600"; d="scan'208";a="729825362" Received: from tomnavar-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.209.176.15]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2022 16:27:34 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, peterz@infradead.org, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Date: Mon, 21 Nov 2022 13:26:32 +1300 Message-Id: <9b545148275b14a8c7edef1157f8ec44dc8116ee.1668988357.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org TDX reports a list of "Convertible Memory Region" (CMR) to indicate all memory regions that can possibly be used by the TDX module, but they are not automatically usable to the TDX module. As a step of initializing the TDX module, the kernel needs to choose a list of memory regions (out from convertible memory regions) that the TDX module can use and pass those regions to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. No more TDX-usable memory can be added to the TDX module after that. The initial support of TDX guests will only allocate TDX guest memory from the global page allocator. To keep things simple, this initial implementation simply guarantees all pages in the page allocator are TDX memory. To achieve this, use all system memory in the core-mm at the time of initializing the TDX module as TDX memory, and at the meantime, refuse to add any non-TDX-memory in the memory hotplug. Specifically, walk through all memory regions managed by memblock and add them to a global list of "TDX-usable" memory regions, which is a fixed list after the module initialization (or empty if initialization fails). To reject non-TDX-memory in memory hotplug, add an additional check in arch_add_memory() to check whether the new region is covered by any region in the "TDX-usable" memory region list. Note this requires all memory regions in memblock are TDX convertible memory when initializing the TDX module. This is true in practice if no new memory has been hot-added before initializing the TDX module, since in practice all boot-time present DIMM is TDX convertible memory. If any new memory has been hot-added, then initializing the TDX module will fail due to that memory region is not covered by CMR. This can be enhanced in the future, i.e. by allowing adding non-TDX memory to a separate NUMA node. In this case, the "TDX-capable" nodes and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace needs to guarantee memory pages for TDX guests are always allocated from the "TDX-capable" nodes. Note TDX assumes convertible memory is always physically present during machine's runtime. A non-buggy BIOS should never support hot-removal of any convertible memory. This implementation doesn't handle ACPI memory removal but depends on the BIOS to behave correctly. Signed-off-by: Kai Huang --- v6 -> v7: - Changed to use all system memory in memblock at the time of initializing the TDX module as TDX memory - Added memory hotplug support --- arch/x86/Kconfig | 1 + arch/x86/include/asm/tdx.h | 3 + arch/x86/mm/init_64.c | 10 ++ arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++ 4 files changed, 197 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index dd333b46fafb..b36129183035 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST depends on X86_64 depends on KVM_INTEL depends on X86_X2APIC + select ARCH_KEEP_MEMBLOCK help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index d688228f3151..71169ecefabf 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, #ifdef CONFIG_INTEL_TDX_HOST bool platform_tdx_enabled(void); int tdx_enable(void); +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn); #else /* !CONFIG_INTEL_TDX_HOST */ static inline bool platform_tdx_enabled(void) { return false; } static inline int tdx_enable(void) { return -ENODEV; } +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn, + unsigned long end_pfn) { return true; } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3f040c6e5d13..900341333d7e 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -55,6 +55,7 @@ #include #include #include +#include #include "mm_internal.h" @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size, unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; + /* + * For now if TDX is enabled, all pages in the page allocator + * must be TDX memory, which is a fixed set of memory regions + * that are passed to the TDX module. Reject the new region + * if it is not TDX memory to guarantee above is true. + */ + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages)) + return -EINVAL; + init_memory_mapping(start, start + size, params->pgprot); return add_pages(nid, start_pfn, nr_pages, params); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 43227af25e44..32af86e31c47 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -16,6 +16,11 @@ #include #include #include +#include +#include +#include +#include +#include #include #include #include @@ -34,6 +39,13 @@ enum tdx_module_status_t { TDX_MODULE_SHUTDOWN, }; +struct tdx_memblock { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; + int nid; +}; + static u32 tdx_keyid_start __ro_after_init; static u32 tdx_keyid_num __ro_after_init; @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo; static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT); static int tdx_cmr_num; +/* All TDX-usable memory regions */ +static LIST_HEAD(tdx_memlist); + /* * Detect TDX private KeyIDs to see whether TDX has been enabled by the * BIOS. Both initializing the TDX module and running TDX guest require @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void) return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num); } +/* Check whether the given pfn range is covered by any CMR or not. */ +static bool pfn_range_covered_by_cmr(unsigned long start_pfn, + unsigned long end_pfn) +{ + int i; + + for (i = 0; i < tdx_cmr_num; i++) { + struct cmr_info *cmr = &tdx_cmr_array[i]; + unsigned long cmr_start_pfn; + unsigned long cmr_end_pfn; + + cmr_start_pfn = cmr->base >> PAGE_SHIFT; + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT; + + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn) + return true; + } + + return false; +} + +/* + * Add a memory region on a given node as a TDX memory block. The caller + * to make sure all memory regions are added in address ascending order + * and don't overlap. + */ +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn, + int nid) +{ + struct tdx_memblock *tmb; + + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL); + if (!tmb) + return -ENOMEM; + + INIT_LIST_HEAD(&tmb->list); + tmb->start_pfn = start_pfn; + tmb->end_pfn = end_pfn; + tmb->nid = nid; + + list_add_tail(&tmb->list, &tdx_memlist); + return 0; +} + +static void free_tdx_memory(void) +{ + while (!list_empty(&tdx_memlist)) { + struct tdx_memblock *tmb = list_first_entry(&tdx_memlist, + struct tdx_memblock, list); + + list_del(&tmb->list); + kfree(tmb); + } +} + +/* + * Add all memblock memory regions to the @tdx_memlist as TDX memory. + * Must be called when get_online_mems() is called by the caller. + */ +static int build_tdx_memory(void) +{ + unsigned long start_pfn, end_pfn; + int i, nid, ret; + + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { + /* + * The first 1MB may not be reported as TDX convertible + * memory. Manually exclude them as TDX memory. + * + * This is fine as the first 1MB is already reserved in + * reserve_real_mode() and won't end up to ZONE_DMA as + * free page anyway. + */ + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT); + if (start_pfn >= end_pfn) + continue; + + /* Verify memory is truly TDX convertible memory */ + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) { + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n", + start_pfn << PAGE_SHIFT, + end_pfn << PAGE_SHIFT); + return -EINVAL; + } + + /* + * Add the memory regions as TDX memory. The regions in + * memblock has already guaranteed they are in address + * ascending order and don't overlap. + */ + ret = add_tdx_memblock(start_pfn, end_pfn, nid); + if (ret) + goto err; + } + + return 0; +err: + free_tdx_memory(); + return ret; +} + /* * Detect and initialize the TDX module. * @@ -357,12 +473,56 @@ static int init_tdx_module(void) if (ret) goto out; + /* + * All memory regions that can be used by the TDX module must be + * passed to the TDX module during the module initialization. + * Once this is done, all "TDX-usable" memory regions are fixed + * during module's runtime. + * + * The initial support of TDX guests only allocates memory from + * the global page allocator. To keep things simple, for now + * just make sure all pages in the page allocator are TDX memory. + * + * To achieve this, use all system memory in the core-mm at the + * time of initializing the TDX module as TDX memory, and at the + * meantime, reject any new memory in memory hot-add. + * + * This works as in practice, all boot-time present DIMM is TDX + * convertible memory. However if any new memory is hot-added + * before initializing the TDX module, the initialization will + * fail due to that memory is not covered by CMR. + * + * This can be enhanced in the future, i.e. by allowing adding or + * onlining non-TDX memory to a separate node, in which case the + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist + * together -- the userspace/kernel just needs to make sure pages + * for TDX guests must come from those "TDX-capable" nodes. + * + * Build the list of TDX memory regions as mentioned above so + * they can be passed to the TDX module later. + */ + get_online_mems(); + + ret = build_tdx_memory(); + if (ret) + goto out; /* * Return -EINVAL until all steps of TDX module initialization * process are done. */ ret = -EINVAL; out: + /* + * Memory hotplug checks the hot-added memory region against the + * @tdx_memlist to see if the region is TDX memory. + * + * Do put_online_mems() here to make sure any modification to + * @tdx_memlist is done while holding the memory hotplug read + * lock, so that the memory hotplug path can just check the + * @tdx_memlist w/o holding the @tdx_module_lock which may cause + * deadlock. + */ + put_online_mems(); return ret; } @@ -485,3 +645,26 @@ int tdx_enable(void) return ret; } EXPORT_SYMBOL_GPL(tdx_enable); + +/* + * Check whether the given range is TDX memory. Must be called between + * mem_hotplug_begin()/mem_hotplug_done(). + */ +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn) +{ + struct tdx_memblock *tmb; + + /* Empty list means TDX isn't enabled successfully */ + if (list_empty(&tdx_memlist)) + return true; + + list_for_each_entry(tmb, &tdx_memlist, list) { + /* + * The new range is TDX memory if it is fully covered + * by any TDX memory block. + */ + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn) + return true; + } + return false; +} -- 2.38.1