2022-11-21 00:49:07

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 00/20] TDX host kernel support

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. TDX specs are available in [1].

This series is the initial support to enable TDX with minimal code to
allow KVM to create and run TDX guests. KVM support for TDX is being
developed separately[2]. A new "userspace inaccessible memfd" approach
to support TDX private memory is also being developed[3]. The KVM will
only support the new "userspace inaccessible memfd" as TDX guest memory.

This series doesn't aim to support all functionalities (i.e. exposing TDX
module via /sysfs), and doesn't aim to resolve all things perfectly.
Especially, the implementation to how to choose "TDX-usable" memory and
memory hotplug handling is simple, that this series just makes sure all
pages in the page allocator are TDX memory.

A better solution, suggested by Kirill, is similar to the per-node memory
encryption flag in this series [4]. Similarly, a per-node TDX flag can
be added so both "TDX-capable" and "non-TDX-capable" nodes can co-exist.
With exposing the TDX flag to userspace via /sysfs, the userspace can
then use NUMA APIs to bind TDX guests to those "TDX-capable" nodes.

For more information please refer to "Kernel policy on TDX memory" and
"Memory hotplug" sections below. Huang, Ying is working on this
"per-node TDX flag" support and will post another series independently.

(For memory hotplug, sorry for broadcasting widely but I cc'ed the
[email protected] following Kirill's suggestion so MM experts can also
help to provide comments.)

Also, other optimizations will be posted as follow-up once this initial
TDX support is upstreamed.

Hi Dave, Dan, Kirill, Ying (and Intel reviewers),

Please kindly help to review, and I would appreciate reviewed-by or
acked-by tags if the patches look good to you.

This series has been reviewed by Isaku who is developing KVM TDX patches.
Kirill also has reviewed couple of patches as well.

Also, I highly appreciate if anyone else can help to review this series.

----- Changelog history: ------

- v6 -> v7:

- Added memory hotplug support.
- Changed how to choose the list of "TDX-usable" memory regions from at
kernel boot time to TDX module initialization time.
- Addressed comments received in previous versions. (Andi/Dave).
- Improved the commit message and the comments of kexec() support patch,
and the patch handles returnning PAMTs back to the kernel when TDX
module initialization fails. Please also see "kexec()" section below.
- Changed the documentation patch accordingly.
- For all others please see individual patch changelog history.

- v5 -> v6:

- Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
- Removed patch to disable driver-managed memory hotplug (Intel
internal discussion).
- Added one patch to introduce enum type for TDX supported page size
level to replace the hard-coded values in TDX guest code (Dave).
- Added one patch to make TDX depends on X2APIC being enabled (Dave).
- Added one patch to build all boot-time present memory regions as TDX
memory during kernel boot.
- Added Reviewed-by from others to some patches.
- For all others please see individual patch changelog history.

- v4 -> v5:

This is essentially a resent of v4. Sorry I forgot to consult
get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
and linux-mm mailing list and the relevant people for 4 new patches.

There are also very minor code and commit message update from v4:

- Rebased to latest tip/x86/tdx.
- Fixed a checkpatch issue that I missed in v4.
- Removed an obsoleted comment that I missed in patch 6.
- Very minor update to the commit message of patch 12.

For other changes to individual patches since v3, please refer to the
changelog histroy of individual patches (I just used v3 -> v5 since
there's basically no code change to v4).

- v3 -> v4 (addressed Dave's comments, and other comments from others):

- Simplified SEAMRR and TDX keyID detection.
- Added patches to handle ACPI CPU hotplug.
- Added patches to handle ACPI memory hotplug and driver managed memory
hotplug.
- Removed tdx_detect() but only use single tdx_init().
- Removed detecting TDX module via P-SEAMLDR.
- Changed from using e820 to using memblock to convert system RAM to TDX
memory.
- Excluded legacy PMEM from TDX memory.
- Removed the boot-time command line to disable TDX patch.
- Addressed comments for other individual patches (please see individual
patches).
- Improved the documentation patch based on the new implementation.

- V2 -> v3:

- Addressed comments from Isaku.
- Fixed memory leak and unnecessary function argument in the patch to
configure the key for the global keyid (patch 17).
- Enhanced a little bit to the patch to get TDX module and CMR
information (patch 09).
- Fixed an unintended change in the patch to allocate PAMT (patch 13).
- Addressed comments from Kevin:
- Slightly improvement on commit message to patch 03.
- Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
seamrr_enabled() (patch 04).
- Changed documentation patch to add TDX host kernel support materials
to Documentation/x86/tdx.rst together with TDX guest staff, instead
of a standalone file (patch 21)
- Very minor improvement in commit messages.

- RFC (v1) -> v2:
- Rebased to Kirill's latest TDX guest code.
- Fixed two issues that are related to finding all RAM memory regions
based on e820.
- Minor improvement on comments and commit messages.

v6:
https://lore.kernel.org/linux-mm/[email protected]/T/

v5:
https://lore.kernel.org/lkml/[email protected]/T/

v3:
https://lore.kernel.org/lkml/[email protected]/T/

V2:
https://lore.kernel.org/lkml/[email protected]/T/

RFC (v1):
https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/

== Background ==

TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
A CPU-attested software module called 'the TDX module' runs in the new
isolated region as a trusted hypervisor to create/run protected VMs.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.

TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection. The firmware runs on the
secure processor acts a similar role as the TDX module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction. This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized. This series assumes the TDX module is loaded
by BIOS before the kernel boots.

How to initialize the TDX module is described at TDX module 1.0
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
Initialization and Shutdown".

== Design Considerations ==

1. Initialize the TDX module at runtime

There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run. This
series implements the runtime initialization.

This series adds a function tdx_enable() to allow the caller to initialize
TDX at runtime:

if (tdx_enable())
goto no_tdx;
// TDX is ready to create TD guests.

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata. Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
done). So far, KVM is the only user of TDX, and it already handles VMXON.
Letting KVM to initialize TDX avoids handling VMXON in the core kernel.

3) It is more flexible to support "TDX module runtime update" (not in
this series). After updating to the new module at runtime, kernel needs
to go through the initialization process again.

2. CPU hotplug

TDX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
event to the kernel. This series doesn't handle physical (ACPI) CPU
hotplug at all but depends on the BIOS to behave correctly.

Note TDX works with CPU logical online/offline, thus this series still
allows to do logical CPU online/offline.

3. Kernel policy on TDX memory

The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable. The TDX architecture
allows the VMM to designate specific convertible memory regions as usable
for TDX private memory.

The initial support of TDX guests will only allocate TDX private memory
from the global page allocator. This series chooses to designate _all_
system RAM in the core-mm at the time of initializing TDX module as TDX
memory to guarantee all pages in the page allocator are TDX pages.

4. Memory Hotplug

After the kernel passes all "TDX-usable" memory regions to the TDX
module, the set of "TDX-usable" memory regions are fixed during module's
runtime. No more "TDX-usable" memory can be added to the TDX module
after that.

To achieve above "to guarantee all pages in the page allocator are TDX
pages", this series simply choose to reject any non-TDX-usable memory in
memory hotplug.

This _will_ be enhanced in the future after first submission. The
direction we are heading is to allow adding/onlining non-TDX memory to
separate NUMA nodes so that both "TDX-capable" nodes and "TDX-capable"
nodes can co-exist. The TDX flag can be exposed to userspace via /sysfs
so userspace can bind TDX guests to "TDX-capable" nodes via NUMA ABIs.

Note TDX assumes convertible memory is always physically present during
machine's runtime. A non-buggy BIOS should never support hot-removal of
any convertible memory. This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

5. Kexec()

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages (i.e. metadata used by the TDX module, and any
TDX guest memory if kexec() happens when there's any TDX guest alive).
2) There might be dirty cachelines associated with TDX private pages.

Just like SME, TDX hosts require special cache flushing before kexec().
Similar to SME handling, the kernel uses wbinvd() to flush cache in
stop_this_cpu() when TDX is enabled.

This series doesn't convert all TDX private pages back to normal due to
below considerations:

1) The kernel doesn't have existing infrastructure to track which pages
are TDX private pages.
2) The number of TDX private pages can be large, and converting all of
them (cache flush + using MOVDIR64B to clear the page) in kexec() can
be time consuming.
3) The new kernel will almost only use KeyID 0 to access memory. KeyID
0 doesn't support integrity-check, so it's OK.
4) The kernel doesn't (and may never) support MKTME. If any 3rd party
kernel ever supports MKTME, it should do MOVDIR64B to clear the page
with the new MKTME KeyID (just like TDX does) before using it.

Also, if the old kernel ever enables TDX, the new kernel cannot use TDX
again. When the new kernel goes through the TDX module initialization
process it will fail immediately at the first step.

Ideally, it's better to shutdown the TDX module in kexec(), but there's
no guarantee that CPUs are in VMX operation in kexec() so just leave the
module open.

== Reference ==

[1]: TDX specs
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

[2]: KVM TDX basic feature support
https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/

[3]: KVM: mm: fd-based approach for supporting KVM
https://lore.kernel.org/lkml/[email protected]/T/

[4]: per-node memory encryption flag
https://lore.kernel.org/linux-mm/[email protected]/t/


Kai Huang (20):
x86/tdx: Define TDX supported page sizes as macros
x86/virt/tdx: Detect TDX during kernel boot
x86/virt/tdx: Disable TDX if X2APIC is not enabled
x86/virt/tdx: Add skeleton to initialize TDX on demand
x86/virt/tdx: Implement functions to make SEAMCALL
x86/virt/tdx: Shut down TDX module in case of error
x86/virt/tdx: Do TDX module global initialization
x86/virt/tdx: Do logical-cpu scope TDX module initialization
x86/virt/tdx: Get information about TDX module and TDX-capable memory
x86/virt/tdx: Use all system memory when initializing TDX module as
TDX memory
x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
memory regions
x86/virt/tdx: Create TDMRs to cover all TDX memory regions
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
x86/virt/tdx: Set up reserved areas for all TDMRs
x86/virt/tdx: Reserve TDX module global KeyID
x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
x86/virt/tdx: Configure global KeyID on all packages
x86/virt/tdx: Initialize all TDMRs
x86/virt/tdx: Flush cache in kexec() when TDX is enabled
Documentation/x86: Add documentation for TDX host support

Documentation/x86/tdx.rst | 181 +++-
arch/x86/Kconfig | 15 +
arch/x86/Makefile | 2 +
arch/x86/coco/tdx/tdx.c | 6 +-
arch/x86/include/asm/tdx.h | 30 +
arch/x86/kernel/process.c | 8 +-
arch/x86/mm/init_64.c | 10 +
arch/x86/virt/Makefile | 2 +
arch/x86/virt/vmx/Makefile | 2 +
arch/x86/virt/vmx/tdx/Makefile | 2 +
arch/x86/virt/vmx/tdx/seamcall.S | 52 ++
arch/x86/virt/vmx/tdx/tdx.c | 1422 ++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 118 +++
arch/x86/virt/vmx/tdx/tdxcall.S | 19 +-
14 files changed, 1852 insertions(+), 17 deletions(-)
create mode 100644 arch/x86/virt/Makefile
create mode 100644 arch/x86/virt/vmx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h


base-commit: 00e07cfbdf0b232f7553f0175f8f4e8d792f7e90
--
2.38.1



2022-11-21 00:49:34

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
memory regions that can possibly be used by the TDX module, but they are
not automatically usable to the TDX module. As a step of initializing
the TDX module, the kernel needs to choose a list of memory regions (out
from convertible memory regions) that the TDX module can use and pass
those regions to the TDX module. Once this is done, those "TDX-usable"
memory regions are fixed during module's lifetime. No more TDX-usable
memory can be added to the TDX module after that.

The initial support of TDX guests will only allocate TDX guest memory
from the global page allocator. To keep things simple, this initial
implementation simply guarantees all pages in the page allocator are TDX
memory. To achieve this, use all system memory in the core-mm at the
time of initializing the TDX module as TDX memory, and at the meantime,
refuse to add any non-TDX-memory in the memory hotplug.

Specifically, walk through all memory regions managed by memblock and
add them to a global list of "TDX-usable" memory regions, which is a
fixed list after the module initialization (or empty if initialization
fails). To reject non-TDX-memory in memory hotplug, add an additional
check in arch_add_memory() to check whether the new region is covered by
any region in the "TDX-usable" memory region list.

Note this requires all memory regions in memblock are TDX convertible
memory when initializing the TDX module. This is true in practice if no
new memory has been hot-added before initializing the TDX module, since
in practice all boot-time present DIMM is TDX convertible memory. If
any new memory has been hot-added, then initializing the TDX module will
fail due to that memory region is not covered by CMR.

This can be enhanced in the future, i.e. by allowing adding non-TDX
memory to a separate NUMA node. In this case, the "TDX-capable" nodes
and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
needs to guarantee memory pages for TDX guests are always allocated from
the "TDX-capable" nodes.

Note TDX assumes convertible memory is always physically present during
machine's runtime. A non-buggy BIOS should never support hot-removal of
any convertible memory. This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Changed to use all system memory in memblock at the time of
initializing the TDX module as TDX memory
- Added memory hotplug support

---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/tdx.h | 3 +
arch/x86/mm/init_64.c | 10 ++
arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
4 files changed, 197 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dd333b46fafb..b36129183035 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
depends on X86_64
depends on KVM_INTEL
depends on X86_X2APIC
+ select ARCH_KEEP_MEMBLOCK
help
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. This option enables necessary TDX
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d688228f3151..71169ecefabf 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
#ifdef CONFIG_INTEL_TDX_HOST
bool platform_tdx_enabled(void);
int tdx_enable(void);
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
#else /* !CONFIG_INTEL_TDX_HOST */
static inline bool platform_tdx_enabled(void) { return false; }
static inline int tdx_enable(void) { return -ENODEV; }
+static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
+ unsigned long end_pfn) { return true; }
#endif /* CONFIG_INTEL_TDX_HOST */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3f040c6e5d13..900341333d7e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -55,6 +55,7 @@
#include <asm/uv/uv.h>
#include <asm/setup.h>
#include <asm/ftrace.h>
+#include <asm/tdx.h>

#include "mm_internal.h"

@@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

+ /*
+ * For now if TDX is enabled, all pages in the page allocator
+ * must be TDX memory, which is a fixed set of memory regions
+ * that are passed to the TDX module. Reject the new region
+ * if it is not TDX memory to guarantee above is true.
+ */
+ if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
+ return -EINVAL;
+
init_memory_mapping(start, start + size, params->pgprot);

return add_pages(nid, start_pfn, nr_pages, params);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 43227af25e44..32af86e31c47 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -16,6 +16,11 @@
#include <linux/smp.h>
#include <linux/atomic.h>
#include <linux/align.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/apic.h>
@@ -34,6 +39,13 @@ enum tdx_module_status_t {
TDX_MODULE_SHUTDOWN,
};

+struct tdx_memblock {
+ struct list_head list;
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+ int nid;
+};
+
static u32 tdx_keyid_start __ro_after_init;
static u32 tdx_keyid_num __ro_after_init;

@@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
static int tdx_cmr_num;

+/* All TDX-usable memory regions */
+static LIST_HEAD(tdx_memlist);
+
/*
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
* BIOS. Both initializing the TDX module and running TDX guest require
@@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
}

+/* Check whether the given pfn range is covered by any CMR or not. */
+static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ int i;
+
+ for (i = 0; i < tdx_cmr_num; i++) {
+ struct cmr_info *cmr = &tdx_cmr_array[i];
+ unsigned long cmr_start_pfn;
+ unsigned long cmr_end_pfn;
+
+ cmr_start_pfn = cmr->base >> PAGE_SHIFT;
+ cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
+
+ if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Add a memory region on a given node as a TDX memory block. The caller
+ * to make sure all memory regions are added in address ascending order
+ * and don't overlap.
+ */
+static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
+ int nid)
+{
+ struct tdx_memblock *tmb;
+
+ tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+ if (!tmb)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&tmb->list);
+ tmb->start_pfn = start_pfn;
+ tmb->end_pfn = end_pfn;
+ tmb->nid = nid;
+
+ list_add_tail(&tmb->list, &tdx_memlist);
+ return 0;
+}
+
+static void free_tdx_memory(void)
+{
+ while (!list_empty(&tdx_memlist)) {
+ struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
+ struct tdx_memblock, list);
+
+ list_del(&tmb->list);
+ kfree(tmb);
+ }
+}
+
+/*
+ * Add all memblock memory regions to the @tdx_memlist as TDX memory.
+ * Must be called when get_online_mems() is called by the caller.
+ */
+static int build_tdx_memory(void)
+{
+ unsigned long start_pfn, end_pfn;
+ int i, nid, ret;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+ /*
+ * The first 1MB may not be reported as TDX convertible
+ * memory. Manually exclude them as TDX memory.
+ *
+ * This is fine as the first 1MB is already reserved in
+ * reserve_real_mode() and won't end up to ZONE_DMA as
+ * free page anyway.
+ */
+ start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
+ if (start_pfn >= end_pfn)
+ continue;
+
+ /* Verify memory is truly TDX convertible memory */
+ if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
+ pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
+ start_pfn << PAGE_SHIFT,
+ end_pfn << PAGE_SHIFT);
+ return -EINVAL;
+ }
+
+ /*
+ * Add the memory regions as TDX memory. The regions in
+ * memblock has already guaranteed they are in address
+ * ascending order and don't overlap.
+ */
+ ret = add_tdx_memblock(start_pfn, end_pfn, nid);
+ if (ret)
+ goto err;
+ }
+
+ return 0;
+err:
+ free_tdx_memory();
+ return ret;
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -357,12 +473,56 @@ static int init_tdx_module(void)
if (ret)
goto out;

+ /*
+ * All memory regions that can be used by the TDX module must be
+ * passed to the TDX module during the module initialization.
+ * Once this is done, all "TDX-usable" memory regions are fixed
+ * during module's runtime.
+ *
+ * The initial support of TDX guests only allocates memory from
+ * the global page allocator. To keep things simple, for now
+ * just make sure all pages in the page allocator are TDX memory.
+ *
+ * To achieve this, use all system memory in the core-mm at the
+ * time of initializing the TDX module as TDX memory, and at the
+ * meantime, reject any new memory in memory hot-add.
+ *
+ * This works as in practice, all boot-time present DIMM is TDX
+ * convertible memory. However if any new memory is hot-added
+ * before initializing the TDX module, the initialization will
+ * fail due to that memory is not covered by CMR.
+ *
+ * This can be enhanced in the future, i.e. by allowing adding or
+ * onlining non-TDX memory to a separate node, in which case the
+ * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
+ * together -- the userspace/kernel just needs to make sure pages
+ * for TDX guests must come from those "TDX-capable" nodes.
+ *
+ * Build the list of TDX memory regions as mentioned above so
+ * they can be passed to the TDX module later.
+ */
+ get_online_mems();
+
+ ret = build_tdx_memory();
+ if (ret)
+ goto out;
/*
* Return -EINVAL until all steps of TDX module initialization
* process are done.
*/
ret = -EINVAL;
out:
+ /*
+ * Memory hotplug checks the hot-added memory region against the
+ * @tdx_memlist to see if the region is TDX memory.
+ *
+ * Do put_online_mems() here to make sure any modification to
+ * @tdx_memlist is done while holding the memory hotplug read
+ * lock, so that the memory hotplug path can just check the
+ * @tdx_memlist w/o holding the @tdx_module_lock which may cause
+ * deadlock.
+ */
+ put_online_mems();
return ret;
}

@@ -485,3 +645,26 @@ int tdx_enable(void)
return ret;
}
EXPORT_SYMBOL_GPL(tdx_enable);
+
+/*
+ * Check whether the given range is TDX memory. Must be called between
+ * mem_hotplug_begin()/mem_hotplug_done().
+ */
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
+{
+ struct tdx_memblock *tmb;
+
+ /* Empty list means TDX isn't enabled successfully */
+ if (list_empty(&tdx_memlist))
+ return true;
+
+ list_for_each_entry(tmb, &tdx_memlist, list) {
+ /*
+ * The new range is TDX memory if it is fully covered
+ * by any TDX memory block.
+ */
+ if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+ return true;
+ }
+ return false;
+}
--
2.38.1


2022-11-21 00:53:31

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 08/20] x86/virt/tdx: Do logical-cpu scope TDX module initialization

After the global module initialization, the next step is logical-cpu
scope module initialization. Logical-cpu initialization requires
calling TDH.SYS.LP.INIT on all BIOS-enabled CPUs. This SEAMCALL can run
concurrently on all CPUs.

Use the helper introduced for shutting down the module to do logical-cpu
scope initialization.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 14 ++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 1 +
2 files changed, 15 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f292292313bd..2cf7090667aa 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -199,6 +199,15 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
on_each_cpu(seamcall_smp_call_function, sc, true);
}

+static int tdx_module_init_cpus(void)
+{
+ struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+
+ seamcall_on_each_cpu(&sc);
+
+ return atomic_read(&sc.err);
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -218,6 +227,11 @@ static int init_tdx_module(void)
if (ret)
goto out;

+ /* Logical-cpu scope initialization */
+ ret = tdx_module_init_cpus();
+ if (ret)
+ goto out;
+
/*
* Return -EINVAL until all steps of TDX module initialization
* process are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 0b415805c921..9ba11808bd45 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -16,6 +16,7 @@
* TDX module SEAMCALL leaf functions
*/
#define TDH_SYS_INIT 33
+#define TDH_SYS_LP_INIT 35
#define TDH_SYS_LP_SHUTDOWN 44

/*
--
2.38.1


2022-11-21 00:53:44

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs

As the last step of constructing TDMRs, set up reserved areas for all
TDMRs. For each TDMR, put all memory holes within this TDMR to the
reserved areas. And for all PAMTs which overlap with this TDMR, put
all the overlapping parts to reserved areas too.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- No change.

v5 -> v6:
- Rebase due to using 'tdx_memblock' instead of memblock.
- Split tdmr_set_up_rsvd_areas() into two functions to handle memory
hole and PAMT respectively.
- Added Isaku's Reviewed-by.


---
arch/x86/virt/vmx/tdx/tdx.c | 190 +++++++++++++++++++++++++++++++++++-
1 file changed, 188 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9d76e70de46e..1fbf33f2f210 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -21,6 +21,7 @@
#include <linux/memblock.h>
#include <linux/minmax.h>
#include <linux/sizes.h>
+#include <linux/sort.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/apic.h>
@@ -767,6 +768,187 @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
return pamt_npages;
}

+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
+ u64 addr, u64 size)
+{
+ struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+ int idx = *p_idx;
+
+ /* Reserved area must be 4K aligned in offset and size */
+ if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+ return -EINVAL;
+
+ /* Cannot exceed maximum reserved areas supported by TDX */
+ if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
+ return -E2BIG;
+
+ rsvd_areas[idx].offset = addr - tdmr->base;
+ rsvd_areas[idx].size = size;
+
+ *p_idx = idx + 1;
+
+ return 0;
+}
+
+static int tdmr_set_up_memory_hole_rsvd_areas(struct tdmr_info *tdmr,
+ int *rsvd_idx)
+{
+ struct tdx_memblock *tmb;
+ u64 prev_end;
+ int ret;
+
+ /* Mark holes between memory regions as reserved */
+ prev_end = tdmr_start(tdmr);
+ list_for_each_entry(tmb, &tdx_memlist, list) {
+ u64 start, end;
+
+ start = tmb->start_pfn << PAGE_SHIFT;
+ end = tmb->end_pfn << PAGE_SHIFT;
+
+ /* Break if this region is after the TDMR */
+ if (start >= tdmr_end(tdmr))
+ break;
+
+ /* Exclude regions before this TDMR */
+ if (end < tdmr_start(tdmr))
+ continue;
+
+ /*
+ * Skip if no hole exists before this region. "<=" is
+ * used because one memory region might span two TDMRs
+ * (when the previous TDMR covers part of this region).
+ * In this case the start address of this region is
+ * smaller than the start address of the second TDMR.
+ *
+ * Update the prev_end to the end of this region where
+ * the possible memory hole starts.
+ */
+ if (start <= prev_end) {
+ prev_end = end;
+ continue;
+ }
+
+ /* Add the hole before this region */
+ ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+ start - prev_end);
+ if (ret)
+ return ret;
+
+ prev_end = end;
+ }
+
+ /* Add the hole after the last region if it exists. */
+ if (prev_end < tdmr_end(tdmr)) {
+ ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+ tdmr_end(tdmr) - prev_end);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int tdmr_set_up_pamt_rsvd_areas(struct tdmr_info *tdmr, int *rsvd_idx,
+ struct tdmr_info *tdmr_array,
+ int tdmr_num)
+{
+ int i, ret;
+
+ /*
+ * If any PAMT overlaps with this TDMR, the overlapping part
+ * must also be put to the reserved area too. Walk over all
+ * TDMRs to find out those overlapping PAMTs and put them to
+ * reserved areas.
+ */
+ for (i = 0; i < tdmr_num; i++) {
+ struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i);
+ unsigned long pamt_start_pfn, pamt_npages;
+ u64 pamt_start, pamt_end;
+
+ tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
+ /* Each TDMR must already have PAMT allocated */
+ WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
+
+ pamt_start = pamt_start_pfn << PAGE_SHIFT;
+ pamt_end = pamt_start + (pamt_npages << PAGE_SHIFT);
+
+ /* Skip PAMTs outside of the given TDMR */
+ if ((pamt_end <= tdmr_start(tdmr)) ||
+ (pamt_start >= tdmr_end(tdmr)))
+ continue;
+
+ /* Only mark the part within the TDMR as reserved */
+ if (pamt_start < tdmr_start(tdmr))
+ pamt_start = tdmr_start(tdmr);
+ if (pamt_end > tdmr_end(tdmr))
+ pamt_end = tdmr_end(tdmr);
+
+ ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
+ pamt_end - pamt_start);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+ struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+ struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+ if (r1->offset + r1->size <= r2->offset)
+ return -1;
+ if (r1->offset >= r2->offset + r2->size)
+ return 1;
+
+ /* Reserved areas cannot overlap. The caller should guarantee. */
+ WARN_ON_ONCE(1);
+ return -1;
+}
+
+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
+static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr,
+ struct tdmr_info *tdmr_array,
+ int tdmr_num)
+{
+ int ret, rsvd_idx = 0;
+
+ /* Put all memory holes within the TDMR into reserved areas */
+ ret = tdmr_set_up_memory_hole_rsvd_areas(tdmr, &rsvd_idx);
+ if (ret)
+ return ret;
+
+ /* Put all (overlapping) PAMTs within the TDMR into reserved areas */
+ ret = tdmr_set_up_pamt_rsvd_areas(tdmr, &rsvd_idx, tdmr_array, tdmr_num);
+ if (ret)
+ return ret;
+
+ /* TDX requires reserved areas listed in address ascending order */
+ sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+ rsvd_area_cmp_func, NULL);
+
+ return 0;
+}
+
+static int tdmrs_set_up_rsvd_areas_all(struct tdmr_info *tdmr_array,
+ int tdmr_num)
+{
+ int i;
+
+ for (i = 0; i < tdmr_num; i++) {
+ int ret;
+
+ ret = tdmr_set_up_rsvd_areas(tdmr_array_entry(tdmr_array, i),
+ tdmr_array, tdmr_num);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
/*
* Construct an array of TDMRs to cover all TDX memory ranges.
* The actual number of TDMRs is kept to @tdmr_num.
@@ -783,8 +965,12 @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
if (ret)
goto err;

- /* Return -EINVAL until constructing TDMRs is done */
- ret = -EINVAL;
+ ret = tdmrs_set_up_rsvd_areas_all(tdmr_array, *tdmr_num);
+ if (ret)
+ goto err_free_pamts;
+
+ return 0;
+err_free_pamts:
tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
err:
return ret;
--
2.38.1


2022-11-21 00:53:44

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 05/20] x86/virt/tdx: Implement functions to make SEAMCALL

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction. This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

The TDX module defines a set of SEAMCALL leaf functions to allow the
host to initialize it, and to create and run protected VMs. SEAMCALL
leaf functions use an ABI different from the x86-64 system-v ABI.
Instead, they share the same ABI with the TDCALL leaf functions.

Implement a function __seamcall() to allow the host to make SEAMCALL
to SEAM software using the TDX_MODULE_CALL macro which is the common
assembly for both SEAMCALL and TDCALL.

SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
CPU is not in VMX operation. The current TDX_MODULE_CALL macro doesn't
handle any of them. There's no way to check whether the CPU is in VMX
operation or not.

Initializing the TDX module is done at runtime on demand, and it depends
on the caller to ensure CPU is in VMX operation before making SEAMCALL.
To avoid getting Oops when the caller mistakenly tries to initialize the
TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL
macro to handle #UD (and also #GP, which can theoretically still happen
when TDX isn't actually enabled by the BIOS, i.e. due to BIOS bug).

Introduce two new TDX error codes for #UD and #GP respectively so the
caller can distinguish. Also, Opportunistically put the new TDX error
codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST
Kconfig option as they are only used when it is on.

As __seamcall() can potentially return multiple error codes, besides the
actual SEAMCALL leaf function return code, also introduce a wrapper
function seamcall() to convert the __seamcall() error code to the kernel
error code, so the caller doesn't need to duplicate the code to check
return value of __seamcall() and return kernel error code accordingly.

Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- No change.

v5 -> v6:
- Added code to handle #UD and #GP (Dave).
- Moved the seamcall() wrapper function to this patch, and used a
temporary __always_unused to avoid compile warning (Dave).

- v3 -> v5 (no feedback on v4):
- Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
SEAMCALL itself fails.
- Improve the changelog.

---
arch/x86/include/asm/tdx.h | 9 ++++++
arch/x86/virt/vmx/tdx/Makefile | 2 +-
arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.c | 42 ++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 8 +++++
arch/x86/virt/vmx/tdx/tdxcall.S | 19 ++++++++++--
6 files changed, 129 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 05fc89d9742a..d688228f3151 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,10 @@
#include <asm/ptrace.h>
#include <asm/shared/tdx.h>

+#ifdef CONFIG_INTEL_TDX_HOST
+
+#include <asm/trapnr.h>
+
/*
* SW-defined error codes.
*
@@ -18,6 +22,11 @@
#define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
#define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))

+#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
+#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
+
+#endif
+
#ifndef __ASSEMBLY__

/*
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 93ca8b73e1f1..38d534f2c113 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-y += tdx.o
+obj-y += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..f81be6b9c133
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ * (the P-SEAMLDR or the TDX module).
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
+ * or the completion status of the SEAMCALL leaf function. Additional
+ * output operands are saved in @out (if it is provided by the caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * used by the P-SEAMLDR or the TDX
+ * module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+ FRAME_BEGIN
+ TDX_MODULE_CALL host=1
+ FRAME_END
+ RET
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 28c187b8726f..b06c1a2bc9cb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -124,6 +124,48 @@ bool platform_tdx_enabled(void)
return !!tdx_keyid_num;
}

+/*
+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
+ * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
+ * leaf function return code and the additional output respectively if
+ * not NULL.
+ */
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ u64 *seamcall_ret,
+ struct tdx_module_output *out)
+{
+ u64 sret;
+
+ sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+ /* Save SEAMCALL return code if caller wants it */
+ if (seamcall_ret)
+ *seamcall_ret = sret;
+
+ /* SEAMCALL was successful */
+ if (!sret)
+ return 0;
+
+ switch (sret) {
+ case TDX_SEAMCALL_GP:
+ /*
+ * platform_tdx_enabled() is checked to be true
+ * before making any SEAMCALL.
+ */
+ WARN_ON_ONCE(1);
+ fallthrough;
+ case TDX_SEAMCALL_VMFAILINVALID:
+ /* Return -ENODEV if the TDX module is not loaded. */
+ return -ENODEV;
+ case TDX_SEAMCALL_UD:
+ /* Return -EINVAL if CPU isn't in VMX operation. */
+ return -EINVAL;
+ default:
+ /* Return -EIO if the actual SEAMCALL leaf failed. */
+ return -EIO;
+ }
+}
+
/*
* Detect and initialize the TDX module.
*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index d00074abcb20..92a8de957dc7 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -12,4 +12,12 @@
/* MSR to report KeyID partitioning between MKTME and TDX */
#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087

+/*
+ * Do not put any hardware-defined TDX structure representations below
+ * this comment!
+ */
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
#endif
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 49a54356ae99..757b0c34be10 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <asm/asm-offsets.h>
#include <asm/tdx.h>
+#include <asm/asm.h>

/*
* TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -45,6 +46,7 @@
/* Leave input param 2 in RDX */

.if \host
+1:
seamcall
/*
* SEAMCALL instruction is essentially a VMExit from VMX root
@@ -57,10 +59,23 @@
* This value will never be used as actual SEAMCALL error code as
* it is from the Reserved status code class.
*/
- jnc .Lno_vmfailinvalid
+ jnc .Lseamcall_out
mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+ jmp .Lseamcall_out
+2:
+ /*
+ * SEAMCALL caused #GP or #UD. By reaching here %eax contains
+ * the trap number. Convert the trap number to the TDX error
+ * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
+ *
+ * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
+ * only accepts 32-bit immediate at most.
+ */
+ mov $TDX_SW_ERROR, %r12
+ orq %r12, %rax

+ _ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out:
.else
tdcall
.endif
--
2.38.1


2022-11-21 00:53:55

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 07/20] x86/virt/tdx: Do TDX module global initialization

The first step of initializing the module is to call TDH.SYS.INIT once
on any logical cpu to do module global initialization. Do the module
global initialization.

It also detects the TDX module, as seamcall() returns -ENODEV when the
module is not loaded.

Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Improved changelog.

---
arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++--
arch/x86/virt/vmx/tdx/tdx.h | 1 +
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5db1a05cb4bd..f292292313bd 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -208,8 +208,23 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
*/
static int init_tdx_module(void)
{
- /* The TDX module hasn't been detected */
- return -ENODEV;
+ int ret;
+
+ /*
+ * Call TDH.SYS.INIT to do the global initialization of
+ * the TDX module. It also detects the module.
+ */
+ ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+ if (ret)
+ goto out;
+
+ /*
+ * Return -EINVAL until all steps of TDX module initialization
+ * process are done.
+ */
+ ret = -EINVAL;
+out:
+ return ret;
}

static void shutdown_tdx_module(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 215cc1065d78..0b415805c921 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -15,6 +15,7 @@
/*
* TDX module SEAMCALL leaf functions
*/
+#define TDH_SYS_INIT 33
#define TDH_SYS_LP_SHUTDOWN 44

/*
--
2.38.1


2022-11-21 00:54:31

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 12/20] x86/virt/tdx: Create TDMRs to cover all TDX memory regions

The kernel configures TDX-usable memory regions by passing an array of
"TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the
information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Create a number of TDMRs to cover all TDX memory regions. To keep it
simple, always try to create one TDMR for each memory region. As the
first step only set up the base/size for each TDMR.

Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions. If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just create a new TDMR for the remaining
part.

TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- No change.

v5 -> v6:
- Rebase due to using 'tdx_memblock' instead of memblock.

- v3 -> v5 (no feedback on v4):
- Removed allocating TDMR individually.
- Improved changelog by using Dave's words.
- Made TDMR_START() and TDMR_END() as static inline function.

---
arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++-
1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 26048c6b0170..57b448de59a0 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -445,6 +445,24 @@ static int build_tdx_memory(void)
return ret;
}

+/* TDMR must be 1gb aligned */
+#define TDMR_ALIGNMENT BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT)
+
+/* Align up and down the address to TDMR boundary */
+#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT)
+
+static inline u64 tdmr_start(struct tdmr_info *tdmr)
+{
+ return tdmr->base;
+}
+
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
+{
+ return tdmr->base + tdmr->size;
+}
+
/* Calculate the actual TDMR_INFO size */
static inline int cal_tdmr_size(void)
{
@@ -492,14 +510,98 @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
}

+static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
+ int idx)
+{
+ return (struct tdmr_info *)((unsigned long)tdmr_array +
+ cal_tdmr_size() * idx);
+}
+
+/*
+ * Create TDMRs to cover all TDX memory regions. The actual number
+ * of TDMRs is set to @tdmr_num.
+ */
+static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+{
+ struct tdx_memblock *tmb;
+ int tdmr_idx = 0;
+
+ /*
+ * Loop over TDX memory regions and create TDMRs to cover them.
+ * To keep it simple, always try to use one TDMR to cover
+ * one memory region.
+ */
+ list_for_each_entry(tmb, &tdx_memlist, list) {
+ struct tdmr_info *tdmr;
+ u64 start, end;
+
+ tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
+ start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
+ end = TDMR_ALIGN_UP(tmb->end_pfn << PAGE_SHIFT);
+
+ /*
+ * If the current TDMR's size hasn't been initialized,
+ * it is a new TDMR to cover the new memory region.
+ * Otherwise, the current TDMR has already covered the
+ * previous memory region. In the latter case, check
+ * whether the current memory region has been fully or
+ * partially covered by the current TDMR, since TDMR is
+ * 1G aligned.
+ */
+ if (tdmr->size) {
+ /*
+ * Loop to the next memory region if the current
+ * block has already been fully covered by the
+ * current TDMR.
+ */
+ if (end <= tdmr_end(tdmr))
+ continue;
+
+ /*
+ * If part of the current memory region has
+ * already been covered by the current TDMR,
+ * skip the already covered part.
+ */
+ if (start < tdmr_end(tdmr))
+ start = tdmr_end(tdmr);
+
+ /*
+ * Create a new TDMR to cover the current memory
+ * region, or the remaining part of it.
+ */
+ tdmr_idx++;
+ if (tdmr_idx >= tdx_sysinfo.max_tdmrs)
+ return -E2BIG;
+
+ tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
+ }
+
+ tdmr->base = start;
+ tdmr->size = end - start;
+ }
+
+ /* @tdmr_idx is always the index of last valid TDMR. */
+ *tdmr_num = tdmr_idx + 1;
+
+ return 0;
+}
+
/*
* Construct an array of TDMRs to cover all TDX memory ranges.
* The actual number of TDMRs is kept to @tdmr_num.
*/
static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
{
+ int ret;
+
+ ret = create_tdmrs(tdmr_array, tdmr_num);
+ if (ret)
+ goto err;
+
/* Return -EINVAL until constructing TDMRs is done */
- return -EINVAL;
+ ret = -EINVAL;
+err:
+ return ret;
}

/*
--
2.38.1


2022-11-21 00:54:51

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory. This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module. PAMTs are not reserved by hardware
up front. They must be allocated by the kernel and then given to the
TDX module.

TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime. One (bad)
mitigation is to launch a TD guest early during system boot to get those
PAMTs allocated at early time, but the only way to fix is to add a boot
option to allocate or reserve PAMTs during kernel boot.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR. If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

- Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
the total number of reserved areas consumed for PAMTs.
- Try to first allocate PAMT from the local node of the TDMR for better
NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Changes due to using macros instead of 'enum' for TDX supported page
sizes.

v5 -> v6:
- Rebase due to using 'tdx_memblock' instead of memblock.
- 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
- Improved comment around tdmr_get_nid() (Dave).
- Improved comment in tdmr_set_up_pamt() around breaking the PAMT
into PAMTs for 4K/2M/1G (Dave).
- tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).

- v3 -> v5 (no feedback on v4):
- Used memblock to get the NUMA node for given TDMR.
- Removed tdmr_get_pamt_sz() helper but use open-code instead.
- Changed to use 'switch .. case..' for each TDX supported page size in
tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
- Added printing out memory used for PAMT allocation when TDX module is
initialized successfully.
- Explained downside of alloc_contig_pages() in changelog.
- Addressed other minor comments.


---
arch/x86/Kconfig | 1 +
arch/x86/virt/vmx/tdx/tdx.c | 191 ++++++++++++++++++++++++++++++++++++
2 files changed, 192 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b36129183035..b86a333b860f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1960,6 +1960,7 @@ config INTEL_TDX_HOST
depends on KVM_INTEL
depends on X86_X2APIC
select ARCH_KEEP_MEMBLOCK
+ depends on CONTIG_ALLOC
help
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 57b448de59a0..9d76e70de46e 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -586,6 +586,187 @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
return 0;
}

+/*
+ * Calculate PAMT size given a TDMR and a page size. The returned
+ * PAMT size is always aligned up to 4K page boundary.
+ */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
+{
+ unsigned long pamt_sz, nr_pamt_entries;
+
+ switch (pgsz) {
+ case TDX_PS_4K:
+ nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+ break;
+ case TDX_PS_2M:
+ nr_pamt_entries = tdmr->size >> PMD_SHIFT;
+ break;
+ case TDX_PS_1G:
+ nr_pamt_entries = tdmr->size >> PUD_SHIFT;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return 0;
+ }
+
+ pamt_sz = nr_pamt_entries * tdx_sysinfo.pamt_entry_size;
+ /* TDX requires PAMT size must be 4K aligned */
+ pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+ return pamt_sz;
+}
+
+/*
+ * Pick a NUMA node on which to allocate this TDMR's metadata.
+ *
+ * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
+ * not be. If the TDMR covers more than one node, just use the _first_
+ * one. This can lead to small areas of off-node metadata for some
+ * memory.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr)
+{
+ struct tdx_memblock *tmb;
+
+ /* Find the first memory region covered by the TDMR */
+ list_for_each_entry(tmb, &tdx_memlist, list) {
+ if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
+ return tmb->nid;
+ }
+
+ /*
+ * Fall back to allocating the TDMR's metadata from node 0 when
+ * no TDX memory block can be found. This should never happen
+ * since TDMRs originate from TDX memory blocks.
+ */
+ WARN_ON_ONCE(1);
+ return 0;
+}
+
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
+{
+ unsigned long pamt_base[TDX_PS_1G + 1];
+ unsigned long pamt_size[TDX_PS_1G + 1];
+ unsigned long tdmr_pamt_base;
+ unsigned long tdmr_pamt_size;
+ struct page *pamt;
+ int pgsz, nid;
+
+ nid = tdmr_get_nid(tdmr);
+
+ /*
+ * Calculate the PAMT size for each TDX supported page size
+ * and the total PAMT size.
+ */
+ tdmr_pamt_size = 0;
+ for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
+ pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
+ tdmr_pamt_size += pamt_size[pgsz];
+ }
+
+ /*
+ * Allocate one chunk of physically contiguous memory for all
+ * PAMTs. This helps minimize the PAMT's use of reserved areas
+ * in overlapped TDMRs.
+ */
+ pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
+ nid, &node_online_map);
+ if (!pamt)
+ return -ENOMEM;
+
+ /*
+ * Break the contiguous allocation back up into the
+ * individual PAMTs for each page size.
+ */
+ tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+ for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
+ pamt_base[pgsz] = tdmr_pamt_base;
+ tdmr_pamt_base += pamt_size[pgsz];
+ }
+
+ tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+ tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
+ tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
+ tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
+ tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
+ tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
+
+ return 0;
+}
+
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
+ unsigned long *pamt_npages)
+{
+ unsigned long pamt_base, pamt_sz;
+
+ /*
+ * The PAMT was allocated in one contiguous unit. The 4K PAMT
+ * should always point to the beginning of that allocation.
+ */
+ pamt_base = tdmr->pamt_4k_base;
+ pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+ *pamt_pfn = pamt_base >> PAGE_SHIFT;
+ *pamt_npages = pamt_sz >> PAGE_SHIFT;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+ unsigned long pamt_pfn, pamt_npages;
+
+ tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
+
+ /* Do nothing if PAMT hasn't been allocated for this TDMR */
+ if (!pamt_npages)
+ return;
+
+ if (WARN_ON_ONCE(!pamt_pfn))
+ return;
+
+ free_contig_range(pamt_pfn, pamt_npages);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+ int i;
+
+ for (i = 0; i < tdmr_num; i++)
+ tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < tdmr_num; i++) {
+ ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
+ if (ret)
+ goto err;
+ }
+
+ return 0;
+err:
+ tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+ return ret;
+}
+
+static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
+ int tdmr_num)
+{
+ unsigned long pamt_npages = 0;
+ int i;
+
+ for (i = 0; i < tdmr_num; i++) {
+ unsigned long pfn, npages;
+
+ tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages);
+ pamt_npages += npages;
+ }
+
+ return pamt_npages;
+}
+
/*
* Construct an array of TDMRs to cover all TDX memory ranges.
* The actual number of TDMRs is kept to @tdmr_num.
@@ -598,8 +779,13 @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
if (ret)
goto err;

+ ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
+ if (ret)
+ goto err;
+
/* Return -EINVAL until constructing TDMRs is done */
ret = -EINVAL;
+ tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
err:
return ret;
}
@@ -686,6 +872,11 @@ static int init_tdx_module(void)
* process are done.
*/
ret = -EINVAL;
+ if (ret)
+ tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+ else
+ pr_info("%lu pages allocated for PAMT.\n",
+ tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
out_free_tdmrs:
/*
* The array of TDMRs is freed no matter the initialization is
--
2.38.1


2022-11-21 00:55:39

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 20/20] Documentation/x86: Add documentation for TDX host support

Add documentation for TDX host kernel support. There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals. Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Changed "TDX Memory Policy" and "Kexec()" sections.

---
Documentation/x86/tdx.rst | 181 +++++++++++++++++++++++++++++++++++---
1 file changed, 170 insertions(+), 11 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index dc8d9fd2c3f7..35092e7c60f7 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,165 @@ encrypting the guest memory. In TDX, a special module running in a special
mode sits between the host and the guest and manages the guest/host
separation.

+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized. The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot. Below dmesg shows when TDX is enabled by BIOS::
+
+ [..] tdx: TDX enabled by BIOS. TDX private KeyID range: [16, 64).
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect the TDX module. The kernel detects it
+by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction. The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory. It also takes additional CPU
+time to initialize those metadata along with the TDX module itself. Both
+are not trivial. The kernel initializes the TDX module at runtime on
+demand. The caller to call tdx_enable() to initialize the TDX module::
+
+ ret = tdx_enable();
+ if (ret)
+ goto no_tdx;
+ // TDX is ready to use
+
+Initializing the TDX module requires all logical CPUs being online.
+tdx_enable() internally temporarily disables CPU hotplug to prevent any
+CPU from going offline, but the caller still needs to guarantee all
+present CPUs are online before calling tdx_enable().
+
+Also, tdx_enable() requires all CPUs are already in VMX operation
+(requirement of making SEAMCALL). Currently, tdx_enable() doesn't handle
+VMXON internally, but depends on the caller to guarantee that. So far
+KVM is the only user of TDX and KVM already handles VMXON.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below::
+
+ [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+ [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+ [..] tdx: 65667 pages allocated for PAMT.
+ [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg shows below::
+
+ [..] tdx: Failed to initialize TDX module. Shut it down.
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
+memory regions that can possibly be used by the TDX module, but they are
+not automatically usable to the TDX module. As a step of initializing
+the TDX module, the kernel needs to choose a list of memory regions (out
+from convertible memory regions) that the TDX module can use and pass
+those regions to the TDX module. Once this is done, those "TDX-usable"
+memory regions are fixed during module's lifetime. No more TDX-usable
+memory can be added to the TDX module after that.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory. Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and at the meantime, refuses to add any non-TDX-memory in
+the memory hotplug.
+
+This can be enhanced in the future, i.e. by allowing adding non-TDX
+memory to a separate NUMA node. In this case, the "TDX-capable" nodes
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+needs to guarantee memory pages for TDX guests are always allocated from
+the "TDX-capable" nodes.
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime. A non-buggy BIOS should never support hot-removal of
+any convertible memory. This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX. A non-buggy BIOS should never support hot-add/removal of
+physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages (i.e. metadata used by the TDX module, and any
+TDX guest memory if kexec() is executed when there's live TDX guests).
+2) There might be dirty cachelines associated with TDX private pages.
+
+Because the hardware doesn't guarantee cache coherency among different
+KeyIDs, the old kernel needs to flush cache (of TDX private pages)
+before booting to the new kernel. Also, the kernel doesn't convert all
+TDX private pages back to normal because of below considerations:
+
+1) The kernel doesn't have existing infrastructure to track which pages
+ are TDX private page.
+2) The number of TDX private pages can be large, and converting all of
+ them (cache flush + using MOVDIR64B to clear the page) can be time
+ consuming.
+3) The new kernel will almost only use KeyID 0 to access memory. KeyID
+ 0 doesn't support integrity-check, so it's OK.
+4) The kernel doesn't (and may never) support MKTME. If any 3rd party
+ kernel ever supports MKTME, it should do MOVDIR64B to clear the page
+ with the new MKTME KeyID (just like TDX does) before using it.
+
+The current TDX module architecture doesn't play nicely with kexec().
+The TDX module can only be initialized once during its lifetime, and
+there is no SEAMCALL to reset the module to give a new clean slate to
+the new kernel. Therefore, ideally, if the module is ever initialized,
+it's better to shut down the module. The new kernel won't be able to
+use TDX anyway (as it needs to go through the TDX module initialization
+process which will fail immediately at the first step).
+
+However, there's no guarantee CPU is in VMX operation during kexec(), so
+it's impractical to shut down the module. Currently, the kernel just
+leaves the module in open state.
+
+TDX Guest Support
+=================
Since the host cannot directly access guest registers or memory, much
normal functionality of a hypervisor must be moved into the guest. This is
implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +179,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
guest to the hypervisor or the TDX module.

New TDX Exceptions
-==================
+------------------

TDX guests behave differently from bare-metal and traditional VMX guests.
In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +189,7 @@ Instructions marked with an '*' conditionally cause exceptions. The
details for these instructions are discussed below.

Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~

- Port I/O (INS, OUTS, IN, OUT)
- HLT
@@ -41,7 +200,7 @@ Instruction-based #VE
- CPUID*

Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~

- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +211,7 @@ Instruction-based #GP
- RDMSR*,WRMSR*

RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~

MSR access behavior falls into three categories:

@@ -73,7 +232,7 @@ trapping and handling in the TDX module. Other than possibly being slow,
these MSRs appear to function just as they would on bare metal.

CPUID Behavior
---------------
+~~~~~~~~~~~~~~

For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +252,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
value with a hypercall.

#VE on Memory Accesses
-======================
+----------------------

There are essentially two classes of TDX memory: private and shared.
Private memory receives full TDX protections. Its content is protected
@@ -107,7 +266,7 @@ entries. This helps ensure that a guest does not place sensitive
information in shared memory, exposing it to the untrusted hypervisor.

#VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~

Access to shared mappings can cause a #VE. The hypervisor ultimately
controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +286,7 @@ be careful not to access device MMIO regions unless it is also prepared to
handle a #VE.

#VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~

An access to private mappings can also cause a #VE. Since all kernel
memory is also private memory, the kernel might theoretically need to
@@ -145,7 +304,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
to handle the exception.

Linux #VE handler
-=================
+-----------------

Just like page faults or #GP's, #VE exceptions can be either handled or be
fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +326,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
which is not recoverable.

MMIO handling
-=============
+-------------

In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +348,7 @@ MMIO access via other means (like structure overlays) may result in an
oops.

Shared Memory Conversions
-=========================
+-------------------------

All TDX guest memory starts out as private at boot. This memory can not
be accessed by the hypervisor. However, some kernel users like device
--
2.38.1


2022-11-21 00:57:49

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 16/20] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID

After the TDX-usable memory regions are constructed in an array of TDMRs
and the global KeyID is reserved, configure them to the TDX module using
TDH.SYS.CONFIG SEAMCALL. TDH.SYS.CONFIG can only be called once and can
be done on any logical cpu.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 2 ++
2 files changed, 39 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e2cbeeb7f0dc..3a032930e58a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -979,6 +979,37 @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
return ret;
}

+static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
+ u64 global_keyid)
+{
+ u64 *tdmr_pa_array;
+ int i, array_sz;
+ u64 ret;
+
+ /*
+ * TDMR_INFO entries are configured to the TDX module via an
+ * array of the physical address of each TDMR_INFO. TDX module
+ * requires the array itself to be 512-byte aligned. Round up
+ * the array size to 512-byte aligned so the buffer allocated
+ * by kzalloc() will meet the alignment requirement.
+ */
+ array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
+ tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
+ if (!tdmr_pa_array)
+ return -ENOMEM;
+
+ for (i = 0; i < tdmr_num; i++)
+ tdmr_pa_array[i] = __pa(tdmr_array_entry(tdmr_array, i));
+
+ ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num,
+ global_keyid, 0, NULL, NULL);
+
+ /* Free the array as it is not required anymore. */
+ kfree(tdmr_pa_array);
+
+ return ret;
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -1062,11 +1093,17 @@ static int init_tdx_module(void)
*/
tdx_global_keyid = tdx_keyid_start;

+ /* Pass the TDMRs and the global KeyID to the TDX module */
+ ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid);
+ if (ret)
+ goto out_free_pamts;
+
/*
* Return -EINVAL until all steps of TDX module initialization
* process are done.
*/
ret = -EINVAL;
+out_free_pamts:
if (ret)
tdmrs_free_pamt_all(tdmr_array, tdmr_num);
else
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index a737f2b51474..c26bab2555ca 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -19,6 +19,7 @@
#define TDH_SYS_INIT 33
#define TDH_SYS_LP_INIT 35
#define TDH_SYS_LP_SHUTDOWN 44
+#define TDH_SYS_CONFIG 45

struct cmr_info {
u64 base;
@@ -86,6 +87,7 @@ struct tdmr_reserved_area {
} __packed;

#define TDMR_INFO_ALIGNMENT 512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT 512

struct tdmr_info {
u64 base;
--
2.38.1


2022-11-21 01:18:33

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums. Not all memory
satisfies these requirements.

As a result, the TDX introduced the concept of a "Convertible Memory
Region" (CMR). During boot, the firmware builds a list of all of the
memory ranges which can provide the TDX security guarantees. The list
of these ranges is available to the kernel by querying the TDX module.

The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory. This metadata essentially
serves as the 'struct page' for the TDX module. The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory. If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR). Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment. To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes. If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

CMR - Firmware-enumerated physical ranges that support TDX. CMRs are
4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
TDX. 1G granularity and alignment required. Each TDMR has
reserved areas where TDX memory holes and overlapping PAMTs can
be put into.
PAMT - Physically contiguous TDX metadata. One table for each page size
per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G
PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing an array of TDMRs to the TDX module.

Constructing the array of TDMRs consists below steps:

1) Create TDMRs to cover all memory regions that the TDX module can use;
2) Allocate and set up PAMT for each TDMR;
3) Set up reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps after all
TDX memory regions are verified to be truly convertible. Always free
TDMRs at the end of the initialization (no matter successful or not)
as TDMRs are only used during the initialization.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Improved commit message to explain 'int' overflow cannot happen
in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.

v5 -> v6:
- construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is
used instead of memblock.
- Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
- Moved calculating TDMR size to this patch.
- Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
once, instead of allocating each TDMR individually.
- Removed "crypto protection" in the changelog.
- -EFAULT -> -EINVAL in couple of places.


---
arch/x86/virt/vmx/tdx/tdx.c | 83 +++++++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++
2 files changed, 106 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 32af86e31c47..26048c6b0170 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -445,6 +445,63 @@ static int build_tdx_memory(void)
return ret;
}

+/* Calculate the actual TDMR_INFO size */
+static inline int cal_tdmr_size(void)
+{
+ int tdmr_sz;
+
+ /*
+ * The actual size of TDMR_INFO depends on the maximum number
+ * of reserved areas.
+ *
+ * Note: for TDX1.0 the max_reserved_per_tdmr is 16, and
+ * TDMR_INFO size is aligned up to 512-byte. Even it is
+ * extended in the future, it would be insane if TDMR_INFO
+ * becomes larger than 4K. The tdmr_sz here should never
+ * overflow.
+ */
+ tdmr_sz = sizeof(struct tdmr_info);
+ tdmr_sz += sizeof(struct tdmr_reserved_area) *
+ tdx_sysinfo.max_reserved_per_tdmr;
+
+ /*
+ * TDX requires each TDMR_INFO to be 512-byte aligned. Always
+ * round up TDMR_INFO size to the 512-byte boundary.
+ */
+ return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+}
+
+static struct tdmr_info *alloc_tdmr_array(int *array_sz)
+{
+ /*
+ * TDX requires each TDMR_INFO to be 512-byte aligned.
+ * Use alloc_pages_exact() to allocate all TDMRs at once.
+ * Each TDMR_INFO will still be 512-byte aligned since
+ * cal_tdmr_size() always returns 512-byte aligned size.
+ */
+ *array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs;
+
+ /*
+ * Zero the buffer so 'struct tdmr_info::size' can be
+ * used to determine whether a TDMR is valid.
+ *
+ * Note: for TDX1.0 the max_tdmrs is 64 and TDMR_INFO size
+ * is 512-byte. Even they are extended in the future, it
+ * would be insane if the total size exceeds 4MB.
+ */
+ return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
+}
+
+/*
+ * Construct an array of TDMRs to cover all TDX memory ranges.
+ * The actual number of TDMRs is kept to @tdmr_num.
+ */
+static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+{
+ /* Return -EINVAL until constructing TDMRs is done */
+ return -EINVAL;
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -454,6 +511,9 @@ static int build_tdx_memory(void)
*/
static int init_tdx_module(void)
{
+ struct tdmr_info *tdmr_array;
+ int tdmr_array_sz;
+ int tdmr_num;
int ret;

/*
@@ -506,11 +566,34 @@ static int init_tdx_module(void)
ret = build_tdx_memory();
if (ret)
goto out;
+
+ /* Prepare enough space to construct TDMRs */
+ tdmr_array = alloc_tdmr_array(&tdmr_array_sz);
+ if (!tdmr_array) {
+ ret = -ENOMEM;
+ goto out_free_tdx_mem;
+ }
+
+ /* Construct TDMRs to cover all TDX memory ranges */
+ ret = construct_tdmrs(tdmr_array, &tdmr_num);
+ if (ret)
+ goto out_free_tdmrs;
+
/*
* Return -EINVAL until all steps of TDX module initialization
* process are done.
*/
ret = -EINVAL;
+out_free_tdmrs:
+ /*
+ * The array of TDMRs is freed no matter the initialization is
+ * successful or not. They are not needed anymore after the
+ * module initialization.
+ */
+ free_pages_exact(tdmr_array, tdmr_array_sz);
+out_free_tdx_mem:
+ if (ret)
+ free_tdx_memory();
out:
/*
* Memory hotplug checks the hot-added memory region against the
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 8e273756098c..a737f2b51474 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -80,6 +80,29 @@ struct tdsysinfo_struct {
};
} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);

+struct tdmr_reserved_area {
+ u64 offset;
+ u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT 512
+
+struct tdmr_info {
+ u64 base;
+ u64 size;
+ u64 pamt_1g_base;
+ u64 pamt_1g_size;
+ u64 pamt_2m_base;
+ u64 pamt_2m_size;
+ u64 pamt_4k_base;
+ u64 pamt_4k_size;
+ /*
+ * Actual number of reserved areas depends on
+ * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+ */
+ struct tdmr_reserved_area reserved_areas[0];
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
/*
* Do not put any hardware-defined TDX structure representations below
* this comment!
--
2.38.1


2022-11-21 01:22:26

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 19/20] x86/virt/tdx: Flush cache in kexec() when TDX is enabled

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages (i.e. metadata used by the TDX module, and any
TDX guest memory if kexec() happens when there's any TDX guest alive).
2) There might be dirty cachelines associated with TDX private pages.

Because the hardware doesn't guarantee cache coherency among different
KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
before booting to the new kernel. Also, reading TDX private page using
any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
Therefore ideally, the kernel should convert all TDX private pages back
to normal before booting to the new kernel.

However, this implementation doesn't convert TDX private pages back to
normal in kexec() because of below considerations:

1) The kernel doesn't have existing infrastructure to track which pages
are TDX private pages.
2) The number of TDX private pages can be large, and converting all of
them (cache flush + using MOVDIR64B to clear the page) in kexec() can
be time consuming.
3) The new kernel will almost only use KeyID 0 to access memory. KeyID
0 doesn't support integrity-check, so it's OK.
4) The kernel doesn't (and may never) support MKTME. If any 3rd party
kernel ever supports MKTME, it should do MOVDIR64B to clear the page
with the new MKTME KeyID (just like TDX does) before using it.

Therefore, this implementation just flushes cache to make sure there are
no stale dirty cachelines associated with any TDX private KeyIDs before
booting to the new kernel, otherwise they may silently corrupt the new
kernel.

Following SME support, use wbinvd() to flush cache in stop_this_cpu().
Theoretically, cache flush is only needed when the TDX module has been
initialized. However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status. Just check
whether TDX is enabled by BIOS instead to flush cache.

Also, the current TDX module doesn't play nicely with kexec(). The TDX
module can only be initialized once during its lifetime, and there is no
ABI to reset the module to give a new clean slate to the new kernel.
Therefore ideally, if the TDX module is ever initialized, it's better
to shut it down. The new kernel won't be able to use TDX anyway (as it
needs to go through the TDX module initialization process which will
fail immediately at the first step).

However, shutting down the TDX module requires all CPUs being in VMX
operation, but there's no such guarantee as kexec() can happen at any
time (i.e. when KVM is not even loaded). So just do nothing but leave
leave the TDX module open.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Improved changelog to explain why don't convert TDX private pages back
to normal.

---
arch/x86/kernel/process.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c21b7347a26d..0cc84977dc62 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy)
*
* Test the CPUID bit directly because the machine might've cleared
* X86_FEATURE_SME due to cmdline options.
+ *
+ * Similar to SME, if the TDX module is ever initialized, the
+ * cachelines associated with any TDX private KeyID must be flushed
+ * before transiting to the new kernel. The TDX module is initialized
+ * on demand, and it takes the mutex to read its status. Just check
+ * whether TDX is enabled by BIOS instead to flush cache.
*/
- if (cpuid_eax(0x8000001f) & BIT(0))
+ if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
native_wbinvd();
for (;;) {
/*
--
2.38.1


2022-11-21 01:28:11

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros

TDX supports 4K, 2M and 1G page sizes. The corresponding values are
defined by the TDX module spec and used as TDX module ABI. Currently,
they are used in try_accept_one() when the TDX guest tries to accept a
page. However currently try_accept_one() uses hard-coded magic values.

Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one(). TDX host support will need to use them too.

Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:

- Removed the helper to convert kernel page level to TDX page level.
- Changed to use macro to define TDX supported page sizes.

---
arch/x86/coco/tdx/tdx.c | 6 +++---
arch/x86/include/asm/tdx.h | 9 +++++++++
2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index cfd4c95b9f04..7fa7fb54f438 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -722,13 +722,13 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
*/
switch (pg_level) {
case PG_LEVEL_4K:
- page_size = 0;
+ page_size = TDX_PS_4K;
break;
case PG_LEVEL_2M:
- page_size = 1;
+ page_size = TDX_PS_2M;
break;
case PG_LEVEL_1G:
- page_size = 2;
+ page_size = TDX_PS_1G;
break;
default:
return false;
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..e9a3f4a6fba1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,6 +20,15 @@

#ifndef __ASSEMBLY__

+/*
+ * TDX supported page sizes (4K/2M/1G).
+ *
+ * Those values are part of the TDX module ABI. Do not change them.
+ */
+#define TDX_PS_4K 0
+#define TDX_PS_2M 1
+#define TDX_PS_1G 2
+
/*
* Used to gather the output registers values of the TDCALL and SEAMCALL
* instructions when requesting services from the TDX module.
--
2.38.1


2022-11-21 01:28:37

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 18/20] x86/virt/tdx: Initialize all TDMRs

Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
TDX initialization.

All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
the memory pages can be used by the TDX module. The time to initialize
TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT
internally initializes the PAMT entries using the global KeyID.

To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
initializes an (implementation-specific) subset of PAMT entries of one
TDMR in one invocation. The caller needs to call TDH.SYS.TDMR.INIT
iteratively until all PAMT entries of the given TDMR are initialized.

TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they
are initializing different TDMRs. To keep it simple, just initialize
all TDMRs one by one. On a 2-socket machine with 2.2G CPUs and 64GB
memory, each TDH.SYS.TDMR.INIT roughly takes couple of microseconds on
average, and it takes roughly dozens of milliseconds to complete the
initialization of all TDMRs while system is idle.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Removed need_resched() check. -- Andi.

---
arch/x86/virt/vmx/tdx/tdx.c | 69 ++++++++++++++++++++++++++++++++++---
arch/x86/virt/vmx/tdx/tdx.h | 1 +
2 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 99d1be5941a7..9bcdb30b7a80 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1066,6 +1066,65 @@ static int config_global_keyid(void)
return seamcall_on_each_package_serialized(&sc);
}

+/* Initialize one TDMR */
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+ u64 next;
+
+ /*
+ * Initializing PAMT entries might be time-consuming (in
+ * proportion to the size of the requested TDMR). To avoid long
+ * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
+ * an (implementation-defined) subset of PAMT entries in one
+ * invocation.
+ *
+ * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
+ * of the requested TDMR are initialized (if next-to-initialize
+ * address matches the end address of the TDMR).
+ */
+ do {
+ struct tdx_module_output out;
+ int ret;
+
+ ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
+ &out);
+ if (ret)
+ return ret;
+ /*
+ * RDX contains 'next-to-initialize' address if
+ * TDH.SYS.TDMR.INT succeeded.
+ */
+ next = out.rdx;
+ /* Allow scheduling when needed */
+ cond_resched();
+ } while (next < tdmr->base + tdmr->size);
+
+ return 0;
+}
+
+/* Initialize all TDMRs */
+static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+ int i;
+
+ /*
+ * Initialize TDMRs one-by-one for simplicity, though the TDX
+ * architecture does allow different TDMRs to be initialized in
+ * parallel on multiple CPUs. Parallel initialization could
+ * be added later when the time spent in the serialized scheme
+ * becomes a real concern.
+ */
+ for (i = 0; i < tdmr_num; i++) {
+ int ret;
+
+ ret = init_tdmr(tdmr_array_entry(tdmr_array, i));
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -1172,11 +1231,11 @@ static int init_tdx_module(void)
if (ret)
goto out_free_pamts;

- /*
- * Return -EINVAL until all steps of TDX module initialization
- * process are done.
- */
- ret = -EINVAL;
+ /* Initialize TDMRs to complete the TDX module initialization */
+ ret = init_tdmrs(tdmr_array, tdmr_num);
+ if (ret)
+ goto out_free_pamts;
+
out_free_pamts:
if (ret) {
/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 768d097412ab..891691b1ea50 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -19,6 +19,7 @@
#define TDH_SYS_INFO 32
#define TDH_SYS_INIT 33
#define TDH_SYS_LP_INIT 35
+#define TDH_SYS_TDMR_INIT 36
#define TDH_SYS_LP_SHUTDOWN 44
#define TDH_SYS_CONFIG 45

--
2.38.1


2022-11-21 01:39:36

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums. Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR). During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees. The list of these
ranges, along with TDX module information, is available to the kernel by
querying the TDX module via TDH.SYS.INFO SEAMCALL.

The host kernel can choose whether or not to use all convertible memory
regions as TDX-usable memory. Before the TDX module is ready to create
any TDX guests, the kernel needs to configure the TDX-usable memory
regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX
module. Constructing the TDMR array requires information of both the
TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions. Call
TDH.SYS.INFO to get this information as a preparation.

Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
having to pass them as function arguments when constructing the TDMR
array. And they are too big to be put to the stack anyway. Also, KVM
needs to use the TDSYSINFO_STRUCT to create TDX guests.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Simplified the check of CMRs due to the fact that TDX actually
verifies CMRs (that are passed by the BIOS) before enabling TDX.
- Changed the function name from check_cmrs() -> trim_empty_cmrs().
- Added CMR page aligned check so that later patch can just get the PFN
using ">> PAGE_SHIFT".

v5 -> v6:
- Added to also print TDX module's attribute (Isaku).
- Removed all arguments in tdx_gete_sysinfo() to use static variables
of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used
directly in other functions in later patches.
- Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
- Renamed sanitize_cmrs() to check_cmrs().
- Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
actual size returned by TDH.SYS.INFO.
- Changed -EFAULT to -EINVAL in couple places.
- Added comments around tdx_sysinfo and tdx_cmr_array saying they are
used by TDH.SYS.INFO ABI.
- Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
arguments in tdx_get_sysinfo().
- Changed to only print BIOS-CMR when check_cmrs() fails.

---
arch/x86/virt/vmx/tdx/tdx.c | 125 ++++++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 61 ++++++++++++++++++
2 files changed, 186 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2cf7090667aa..43227af25e44 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,6 +15,7 @@
#include <linux/cpumask.h>
#include <linux/smp.h>
#include <linux/atomic.h>
+#include <linux/align.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/apic.h>
@@ -40,6 +41,11 @@ static enum tdx_module_status_t tdx_module_status;
/* Prevent concurrent attempts on TDX detection and initialization */
static DEFINE_MUTEX(tdx_module_lock);

+/* Below two are used in TDH.SYS.INFO SEAMCALL ABI */
+static struct tdsysinfo_struct tdx_sysinfo;
+static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
+static int tdx_cmr_num;
+
/*
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
* BIOS. Both initializing the TDX module and running TDX guest require
@@ -208,6 +214,121 @@ static int tdx_module_init_cpus(void)
return atomic_read(&sc.err);
}

+static inline bool is_cmr_empty(struct cmr_info *cmr)
+{
+ return !cmr->size;
+}
+
+static inline bool is_cmr_ok(struct cmr_info *cmr)
+{
+ /* CMR must be page aligned */
+ return IS_ALIGNED(cmr->base, PAGE_SIZE) &&
+ IS_ALIGNED(cmr->size, PAGE_SIZE);
+}
+
+static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
+ const char *name)
+{
+ int i;
+
+ for (i = 0; i < cmr_num; i++) {
+ struct cmr_info *cmr = &cmr_array[i];
+
+ pr_info("%s : [0x%llx, 0x%llx)\n", name,
+ cmr->base, cmr->base + cmr->size);
+ }
+}
+
+/* Check CMRs reported by TDH.SYS.INFO, and trim tail empty CMRs. */
+static int trim_empty_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num)
+{
+ struct cmr_info *cmr;
+ int i, cmr_num;
+
+ /*
+ * Intel TDX module spec, 20.7.3 CMR_INFO:
+ *
+ * TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
+ * array of CMR_INFO entries. The CMRs are sorted from the
+ * lowest base address to the highest base address, and they
+ * are non-overlapping.
+ *
+ * This implies that BIOS may generate invalid empty entries
+ * if total CMRs are less than 32. Need to skip them manually.
+ *
+ * CMR also must be 4K aligned. TDX doesn't trust BIOS. TDX
+ * actually verifies CMRs before it gets enabled, so anything
+ * doesn't meet above means kernel bug (or TDX is broken).
+ */
+ cmr = &cmr_array[0];
+ /* There must be at least one valid CMR */
+ if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
+ goto err;
+
+ cmr_num = *actual_cmr_num;
+ for (i = 1; i < cmr_num; i++) {
+ struct cmr_info *cmr = &cmr_array[i];
+ struct cmr_info *prev_cmr = NULL;
+
+ /* Skip further empty CMRs */
+ if (is_cmr_empty(cmr))
+ break;
+
+ /*
+ * Do sanity check anyway to make sure CMRs:
+ * - are 4K aligned
+ * - don't overlap
+ * - are in address ascending order.
+ */
+ if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
+ goto err;
+
+ prev_cmr = &cmr_array[i - 1];
+ if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
+ cmr->base))
+ goto err;
+ }
+
+ /* Update the actual number of CMRs */
+ *actual_cmr_num = i;
+
+ /* Print kernel checked CMRs */
+ print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR");
+
+ return 0;
+err:
+ pr_info("[TDX broken ?]: Invalid CMRs detected\n");
+ print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
+ return -EINVAL;
+}
+
+static int tdx_get_sysinfo(void)
+{
+ struct tdx_module_output out;
+ int ret;
+
+ BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
+
+ ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
+ __pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
+ if (ret)
+ return ret;
+
+ /* R9 contains the actual entries written the CMR array. */
+ tdx_cmr_num = out.r9;
+
+ pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+ tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
+ tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
+ tdx_sysinfo.build_date, tdx_sysinfo.build_num);
+
+ /*
+ * trim_empty_cmrs() updates the actual number of CMRs by
+ * dropping all tail empty CMRs.
+ */
+ return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -232,6 +353,10 @@ static int init_tdx_module(void)
if (ret)
goto out;

+ ret = tdx_get_sysinfo();
+ if (ret)
+ goto out;
+
/*
* Return -EINVAL until all steps of TDX module initialization
* process are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9ba11808bd45..8e273756098c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -15,10 +15,71 @@
/*
* TDX module SEAMCALL leaf functions
*/
+#define TDH_SYS_INFO 32
#define TDH_SYS_INIT 33
#define TDH_SYS_LP_INIT 35
#define TDH_SYS_LP_SHUTDOWN 44

+struct cmr_info {
+ u64 base;
+ u64 size;
+} __packed;
+
+#define MAX_CMRS 32
+#define CMR_INFO_ARRAY_ALIGNMENT 512
+
+struct cpuid_config {
+ u32 leaf;
+ u32 sub_leaf;
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE 1024
+#define TDSYSINFO_STRUCT_ALIGNMENT 1024
+
+struct tdsysinfo_struct {
+ /* TDX-SEAM Module Info */
+ u32 attributes;
+ u32 vendor_id;
+ u32 build_date;
+ u16 build_num;
+ u16 minor_version;
+ u16 major_version;
+ u8 reserved0[14];
+ /* Memory Info */
+ u16 max_tdmrs;
+ u16 max_reserved_per_tdmr;
+ u16 pamt_entry_size;
+ u8 reserved1[10];
+ /* Control Struct Info */
+ u16 tdcs_base_size;
+ u8 reserved2[2];
+ u16 tdvps_base_size;
+ u8 tdvps_xfam_dependent_size;
+ u8 reserved3[9];
+ /* TD Capabilities */
+ u64 attributes_fixed0;
+ u64 attributes_fixed1;
+ u64 xfam_fixed0;
+ u64 xfam_fixed1;
+ u8 reserved4[32];
+ u32 num_cpuid_config;
+ /*
+ * The actual number of CPUID_CONFIG depends on above
+ * 'num_cpuid_config'. The size of 'struct tdsysinfo_struct'
+ * is 1024B defined by TDX architecture. Use a union with
+ * specific padding to make 'sizeof(struct tdsysinfo_struct)'
+ * equal to 1024.
+ */
+ union {
+ struct cpuid_config cpuid_configs[0];
+ u8 reserved5[892];
+ };
+} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
+
/*
* Do not put any hardware-defined TDX structure representations below
* this comment!
--
2.38.1


2022-11-21 01:43:26

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

After the array of TDMRs and the global KeyID are configured to the TDX
module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
on all packages.

TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
it cannot run concurrently on different CPUs. Implement a helper to
run SEAMCALL on one cpu for each package one by one, and use it to
configure the global KeyID on all packages.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
with KeyID 0) before the TDX module uses the global KeyID to access the
PAMT. Following the TDX module specification, flush cache before
configuring the global KeyID on all packages.

Given the PAMT size can be large (~1/256th of system RAM), just use
WBINVD on all CPUs to flush.

Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
used the global KeyID to write any PAMT. Therefore, need to use WBINVD
to flush cache before freeing the PAMTs back to the kernel. Note using
MOVDIR64B (which changes the page's associated KeyID from the old TDX
private KeyID back to KeyID 0, which is used by the kernel) to clear
PMATs isn't needed, as the KeyID 0 doesn't support integrity check.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Improved changelong and comment to explain why MOVDIR64B isn't used
when returning PAMTs back to the kernel.

---
arch/x86/virt/vmx/tdx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx.h | 1 +
2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 3a032930e58a..99d1be5941a7 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -224,6 +224,46 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
on_each_cpu(seamcall_smp_call_function, sc, true);
}

+/*
+ * Call one SEAMCALL on one (any) cpu for each physical package in
+ * serialized way. Return immediately in case of any error if
+ * SEAMCALL fails on any cpu.
+ *
+ * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
+ * to be atomic, but for simplicity just reuse it instead of adding
+ * a new one.
+ */
+static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
+{
+ cpumask_var_t packages;
+ int cpu, ret = 0;
+
+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+ return -ENOMEM;
+
+ for_each_online_cpu(cpu) {
+ if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+ packages))
+ continue;
+
+ ret = smp_call_function_single(cpu, seamcall_smp_call_function,
+ sc, true);
+ if (ret)
+ break;
+
+ /*
+ * Doesn't have to use atomic_read(), but it doesn't
+ * hurt either.
+ */
+ ret = atomic_read(&sc->err);
+ if (ret)
+ break;
+ }
+
+ free_cpumask_var(packages);
+ return ret;
+}
+
static int tdx_module_init_cpus(void)
{
struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
@@ -1010,6 +1050,22 @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
return ret;
}

+static int config_global_keyid(void)
+{
+ struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
+
+ /*
+ * Configure the key of the global KeyID on all packages by
+ * calling TDH.SYS.KEY.CONFIG on all packages in a serialized
+ * way as it cannot run concurrently on different CPUs.
+ *
+ * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
+ * a recoverable error). Assume this is exceedingly rare and
+ * just return error if encountered instead of retrying.
+ */
+ return seamcall_on_each_package_serialized(&sc);
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -1098,15 +1154,44 @@ static int init_tdx_module(void)
if (ret)
goto out_free_pamts;

+ /*
+ * Hardware doesn't guarantee cache coherency across different
+ * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
+ * (associated with KeyID 0) before the TDX module can use the
+ * global KeyID to access the PAMT. Given PAMTs are potentially
+ * large (~1/256th of system RAM), just use WBINVD on all cpus
+ * to flush the cache.
+ *
+ * Follow the TDX spec to flush cache before configuring the
+ * global KeyID on all packages.
+ */
+ wbinvd_on_all_cpus();
+
+ /* Config the key of global KeyID on all packages */
+ ret = config_global_keyid();
+ if (ret)
+ goto out_free_pamts;
+
/*
* Return -EINVAL until all steps of TDX module initialization
* process are done.
*/
ret = -EINVAL;
out_free_pamts:
- if (ret)
+ if (ret) {
+ /*
+ * Part of PAMT may already have been initialized by
+ * TDX module. Flush cache before returning PAMT back
+ * to the kernel.
+ *
+ * Note there's no need to do MOVDIR64B (which changes
+ * the page's associated KeyID from the old TDX private
+ * KeyID back to KeyID 0, which is used by the kernel),
+ * as KeyID 0 doesn't support integrity check.
+ */
+ wbinvd_on_all_cpus();
tdmrs_free_pamt_all(tdmr_array, tdmr_num);
- else
+ } else
pr_info("%lu pages allocated for PAMT.\n",
tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
out_free_tdmrs:
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index c26bab2555ca..768d097412ab 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -15,6 +15,7 @@
/*
* TDX module SEAMCALL leaf functions
*/
+#define TDH_SYS_KEY_CONFIG 31
#define TDH_SYS_INFO 32
#define TDH_SYS_INIT 33
#define TDH_SYS_LP_INIT 35
--
2.38.1


2022-11-21 01:45:24

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 15/20] x86/virt/tdx: Reserve TDX module global KeyID

TDX module initialization requires to use one TDX private KeyID as the
global KeyID to protect the TDX module metadata. The global KeyID is
configured to the TDX module along with TDMRs.

Just reserve the first TDX private KeyID as the global KeyID. Keep the
global KeyID as a static variable as KVM will need to use it too.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1fbf33f2f210..e2cbeeb7f0dc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -62,6 +62,9 @@ static int tdx_cmr_num;
/* All TDX-usable memory regions */
static LIST_HEAD(tdx_memlist);

+/* TDX module global KeyID. Used in TDH.SYS.CONFIG ABI. */
+static u32 tdx_global_keyid;
+
/*
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
* BIOS. Both initializing the TDX module and running TDX guest require
@@ -1053,6 +1056,12 @@ static int init_tdx_module(void)
if (ret)
goto out_free_tdmrs;

+ /*
+ * Reserve the first TDX KeyID as global KeyID to protect
+ * TDX module metadata.
+ */
+ tdx_global_keyid = tdx_keyid_start;
+
/*
* Return -EINVAL until all steps of TDX module initialization
* process are done.
--
2.38.1


2022-11-21 01:46:22

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME. The memory encryption hardware underpinning MKTME is also
used for Intel TDX. TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs. The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX
private KeyIDs are consistently and correctly programmed by the BIOS
across all CPU packages before it enables TDX on any CPU core. A valid
TDX private KeyID range on BSP indicates TDX has been enabled by the
BIOS, otherwise the BIOS is buggy.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests. The TDX module will be initialized at
runtime by the user (i.e. KVM) on demand.

Add a new early_initcall(tdx_init) to do TDX early boot initialization.
Only detect TDX private KeyIDs for now. Some other early checks will
follow up. Also add a new function to report whether TDX has been
enabled by BIOS (TDX private KeyID range is valid). Kexec() will also
need it to determine whether need to flush dirty cachelines that are
associated with any TDX private KeyIDs before booting to the new kernel.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support). So far only KVM is the only user of TDX. Make the new config
option depend on KVM_INTEL.

Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- No change.

v5 -> v6:
- Removed SEAMRR detection to make code simpler.
- Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
- Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).


---
arch/x86/Kconfig | 12 +++++
arch/x86/Makefile | 2 +
arch/x86/include/asm/tdx.h | 7 +++
arch/x86/virt/Makefile | 2 +
arch/x86/virt/vmx/Makefile | 2 +
arch/x86/virt/vmx/tdx/Makefile | 2 +
arch/x86/virt/vmx/tdx/tdx.c | 95 ++++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 15 ++++++
8 files changed, 137 insertions(+)
create mode 100644 arch/x86/virt/Makefile
create mode 100644 arch/x86/virt/vmx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..cced4ef3bfb2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1953,6 +1953,18 @@ config X86_SGX

If unsure, say N.

+config INTEL_TDX_HOST
+ bool "Intel Trust Domain Extensions (TDX) host support"
+ depends on CPU_SUP_INTEL
+ depends on X86_64
+ depends on KVM_INTEL
+ help
+ Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+ host and certain physical attacks. This option enables necessary TDX
+ support in host kernel to run protected VMs.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 415a5d138de4..38d3e8addc5f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -246,6 +246,8 @@ archheaders:

libs-y += arch/x86/lib/

+core-y += arch/x86/virt/
+
# drivers-y are linked after core-y
drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
drivers-$(CONFIG_PCI) += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e9a3f4a6fba1..51c4222a13ae 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -98,5 +98,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
return -ENODEV;
}
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else /* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..feebda21d793
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST) += tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..93ca8b73e1f1
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..982d9c453b6b
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/printk.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+#include "tdx.h"
+
+static u32 tdx_keyid_start __ro_after_init;
+static u32 tdx_keyid_num __ro_after_init;
+
+/*
+ * Detect TDX private KeyIDs to see whether TDX has been enabled by the
+ * BIOS. Both initializing the TDX module and running TDX guest require
+ * TDX private KeyID.
+ *
+ * TDX doesn't trust BIOS. TDX verifies all configurations from BIOS
+ * are correct before enabling TDX on any core. TDX requires the BIOS
+ * to correctly and consistently program TDX private KeyIDs on all CPU
+ * packages. Unless there is a BIOS bug, detecting a valid TDX private
+ * KeyID range on BSP indicates TDX has been enabled by the BIOS. If
+ * there's such BIOS bug, it will be caught later when initializing the
+ * TDX module.
+ */
+static int __init detect_tdx(void)
+{
+ int ret;
+
+ /*
+ * IA32_MKTME_KEYID_PARTIONING:
+ * Bit [31:0]: Number of MKTME KeyIDs.
+ * Bit [63:32]: Number of TDX private KeyIDs.
+ */
+ ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start,
+ &tdx_keyid_num);
+ if (ret)
+ return -ENODEV;
+
+ if (!tdx_keyid_num)
+ return -ENODEV;
+
+ /*
+ * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
+ * KeyIDs start after the last MKTME KeyID.
+ */
+ tdx_keyid_start++;
+
+ pr_info("TDX enabled by BIOS. TDX private KeyID range: [%u, %u)\n",
+ tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
+
+ return 0;
+}
+
+static void __init clear_tdx(void)
+{
+ tdx_keyid_start = tdx_keyid_num = 0;
+}
+
+static int __init tdx_init(void)
+{
+ if (detect_tdx())
+ return -ENODEV;
+
+ /*
+ * Initializing the TDX module requires one TDX private KeyID.
+ * If there's only one TDX KeyID then after module initialization
+ * KVM won't be able to run any TDX guest, which makes the whole
+ * thing worthless. Just disable TDX in this case.
+ */
+ if (tdx_keyid_num < 2) {
+ pr_info("Disable TDX as there's only one TDX private KeyID available.\n");
+ goto no_tdx;
+ }
+
+ return 0;
+no_tdx:
+ clear_tdx();
+ return -ENODEV;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+ return !!tdx_keyid_num;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..d00074abcb20
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability. The
+ * architectural definitions come first.
+ */
+
+/* MSR to report KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
+
+#endif
--
2.38.1


2022-11-21 01:46:44

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled

The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
[1]. This bug allows an attacker to use the APIC MMIO interface to
extract data from the SGX enclave.

TDX is not immune from this either. Early check X2APIC and disable TDX
if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.

[1]: https://aepicleak.com/aepicleak.pdf

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- Changed to use "Link" for the two lore links to get rid of checkpatch
warning.

---
arch/x86/Kconfig | 1 +
arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cced4ef3bfb2..dd333b46fafb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
depends on CPU_SUP_INTEL
depends on X86_64
depends on KVM_INTEL
+ depends on X86_X2APIC
help
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 982d9c453b6b..8d943bdc8335 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -12,6 +12,7 @@
#include <linux/printk.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
+#include <asm/apic.h>
#include <asm/tdx.h>
#include "tdx.h"

@@ -81,6 +82,16 @@ static int __init tdx_init(void)
goto no_tdx;
}

+ /*
+ * TDX requires X2APIC being enabled to prevent potential data
+ * leak via APIC MMIO registers. Just disable TDX if not using
+ * X2APIC.
+ */
+ if (!x2apic_enabled()) {
+ pr_info("Disable TDX as X2APIC is not enabled.\n");
+ goto no_tdx;
+ }
+
return 0;
no_tdx:
clear_tdx();
--
2.38.1


2022-11-21 01:48:24

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

Before the TDX module can be used to create and run TDX guests, it must
be loaded and properly initialized. The TDX module is expected to be
loaded by the BIOS, and to be initialized by the kernel.

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). The host
kernel communicates with the TDX module via a new SEAMCALL instruction.
The TDX module implements a set of SEAMCALL leaf functions to allow the
host kernel to initialize it.

The TDX module can be initialized only once in its lifetime. Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM). This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) It is more flexible to support TDX module runtime updating in the
future (after updating the TDX module, it needs to be initialized
again).

3) It avoids having to do a "temporary" solution to handle VMXON in the
core (non-KVM) kernel for now. This is because SEAMCALL requires CPU
being in VMX operation (VMXON is done), but currently only KVM handles
VMXON. Adding VMXON support to the core kernel isn't trivial. More
importantly, from long-term a reference-based approach is likely needed
in the core kernel as more kernel components are likely needed to
support TDX as well. Allow KVM to initialize the TDX module avoids
having to handle VMXON during kernel boot for now.

Add a placeholder tdx_enable() to detect and initialize the TDX module
on demand, with a state machine protected by mutex to support concurrent
calls from multiple callers.

The TDX module will be initialized in multi-steps defined by the TDX
module:

1) Global initialization;
2) Logical-CPU scope initialization;
3) Enumerate the TDX module capabilities and platform configuration;
4) Configure the TDX module about TDX usable memory ranges and global
KeyID information;
5) Package-scope configuration for the global KeyID;
6) Initialize usable memory ranges based on 4).

The TDX module can also be shut down at any time during its lifetime.
In case of any error during the initialization process, shut down the
module. It's pointless to leave the module in any intermediate state
during the initialization.

Both logical CPU scope initialization and shutting down the TDX module
require calling SEAMCALL on all boot-time present CPUs. For simplicity
just temporarily disable CPU hotplug during the module initialization.

Note TDX architecturally doesn't support physical CPU hot-add/removal.
A non-buggy BIOS should never support ACPI CPU hot-add/removal. This
implementation doesn't explicitly handle ACPI CPU hot-add/removal but
depends on the BIOS to do the right thing.

Reviewed-by: Chao Gao <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- No change.

v5 -> v6:
- Added code to set status to TDX_MODULE_NONE if TDX module is not
loaded (Chao)
- Added Chao's Reviewed-by.
- Improved comments around cpus_read_lock().

- v3->v5 (no feedback on v4):
- Removed the check that SEAMRR and TDX KeyID have been detected on
all present cpus.
- Removed tdx_detect().
- Added num_online_cpus() to MADT-enabled CPUs check within the CPU
hotplug lock and return early with error message.
- Improved dmesg printing for TDX module detection and initialization.

---
arch/x86/include/asm/tdx.h | 2 +
arch/x86/virt/vmx/tdx/tdx.c | 150 ++++++++++++++++++++++++++++++++++++
2 files changed, 152 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 51c4222a13ae..05fc89d9742a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -101,8 +101,10 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,

#ifdef CONFIG_INTEL_TDX_HOST
bool platform_tdx_enabled(void);
+int tdx_enable(void);
#else /* !CONFIG_INTEL_TDX_HOST */
static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_enable(void) { return -ENODEV; }
#endif /* CONFIG_INTEL_TDX_HOST */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 8d943bdc8335..28c187b8726f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -10,15 +10,34 @@
#include <linux/types.h>
#include <linux/init.h>
#include <linux/printk.h>
+#include <linux/mutex.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/apic.h>
#include <asm/tdx.h>
#include "tdx.h"

+/* TDX module status during initialization */
+enum tdx_module_status_t {
+ /* TDX module hasn't been detected and initialized */
+ TDX_MODULE_UNKNOWN,
+ /* TDX module is not loaded */
+ TDX_MODULE_NONE,
+ /* TDX module is initialized */
+ TDX_MODULE_INITIALIZED,
+ /* TDX module is shut down due to initialization error */
+ TDX_MODULE_SHUTDOWN,
+};
+
static u32 tdx_keyid_start __ro_after_init;
static u32 tdx_keyid_num __ro_after_init;

+static enum tdx_module_status_t tdx_module_status;
+/* Prevent concurrent attempts on TDX detection and initialization */
+static DEFINE_MUTEX(tdx_module_lock);
+
/*
* Detect TDX private KeyIDs to see whether TDX has been enabled by the
* BIOS. Both initializing the TDX module and running TDX guest require
@@ -104,3 +123,134 @@ bool platform_tdx_enabled(void)
{
return !!tdx_keyid_num;
}
+
+/*
+ * Detect and initialize the TDX module.
+ *
+ * Return -ENODEV when the TDX module is not loaded, 0 when it
+ * is successfully initialized, or other error when it fails to
+ * initialize.
+ */
+static int init_tdx_module(void)
+{
+ /* The TDX module hasn't been detected */
+ return -ENODEV;
+}
+
+static void shutdown_tdx_module(void)
+{
+ /* TODO: Shut down the TDX module */
+}
+
+static int __tdx_enable(void)
+{
+ int ret;
+
+ /*
+ * Initializing the TDX module requires doing SEAMCALL on all
+ * boot-time present CPUs. For simplicity temporarily disable
+ * CPU hotplug to prevent any CPU from going offline during
+ * the initialization.
+ */
+ cpus_read_lock();
+
+ /*
+ * Check whether all boot-time present CPUs are online and
+ * return early with a message so the user can be aware.
+ *
+ * Note a non-buggy BIOS should never support physical (ACPI)
+ * CPU hotplug when TDX is enabled, and all boot-time present
+ * CPU should be enabled in MADT, so there should be no
+ * disabled_cpus and num_processors won't change at runtime
+ * either.
+ */
+ if (disabled_cpus || num_online_cpus() != num_processors) {
+ pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = init_tdx_module();
+ if (ret == -ENODEV) {
+ pr_info("TDX module is not loaded.\n");
+ tdx_module_status = TDX_MODULE_NONE;
+ goto out;
+ }
+
+ /*
+ * Shut down the TDX module in case of any error during the
+ * initialization process. It's meaningless to leave the TDX
+ * module in any middle state of the initialization process.
+ *
+ * Shutting down the module also requires doing SEAMCALL on all
+ * MADT-enabled CPUs. Do it while CPU hotplug is disabled.
+ *
+ * Return all errors during the initialization as -EFAULT as the
+ * module is always shut down.
+ */
+ if (ret) {
+ pr_info("Failed to initialize TDX module. Shut it down.\n");
+ shutdown_tdx_module();
+ tdx_module_status = TDX_MODULE_SHUTDOWN;
+ ret = -EFAULT;
+ goto out;
+ }
+
+ pr_info("TDX module initialized.\n");
+ tdx_module_status = TDX_MODULE_INITIALIZED;
+out:
+ cpus_read_unlock();
+
+ return ret;
+}
+
+/**
+ * tdx_enable - Enable TDX by initializing the TDX module
+ *
+ * Caller to make sure all CPUs are online and in VMX operation before
+ * calling this function. CPU hotplug is temporarily disabled internally
+ * to prevent any cpu from going offline.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * 0: The TDX module has been successfully initialized.
+ * * -ENODEV: The TDX module is not loaded, or TDX is not supported.
+ * * -EINVAL: The TDX module cannot be initialized due to certain
+ * conditions are not met (i.e. when not all MADT-enabled
+ * CPUs are not online).
+ * * -EFAULT: Other internal fatal errors, or the TDX module is in
+ * shutdown mode due to it failed to initialize in previous
+ * attempts.
+ */
+int tdx_enable(void)
+{
+ int ret;
+
+ if (!platform_tdx_enabled())
+ return -ENODEV;
+
+ mutex_lock(&tdx_module_lock);
+
+ switch (tdx_module_status) {
+ case TDX_MODULE_UNKNOWN:
+ ret = __tdx_enable();
+ break;
+ case TDX_MODULE_NONE:
+ ret = -ENODEV;
+ break;
+ case TDX_MODULE_INITIALIZED:
+ ret = 0;
+ break;
+ default:
+ WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN);
+ ret = -EFAULT;
+ break;
+ }
+
+ mutex_unlock(&tdx_module_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
--
2.38.1


2022-11-21 01:48:59

by Kai Huang

[permalink] [raw]
Subject: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

TDX supports shutting down the TDX module at any time during its
lifetime. After the module is shut down, no further TDX module SEAMCALL
leaf functions can be made to the module on any logical cpu.

Shut down the TDX module in case of any error during the initialization
process. It's pointless to leave the TDX module in some middle state.

Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
CPUs and use it to shut down the module. Later logical-cpu scope module
initialization will use it too.

Reviewed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
---

v6 -> v7:
- No change.

v5 -> v6:
- Removed the seamcall() wrapper to previous patch (Dave).

- v3 -> v5 (no feedback on v4):
- Added a wrapper of __seamcall() to print error code if SEAMCALL fails.
- Made the seamcall_on_each_cpu() void.
- Removed 'seamcall_ret' and 'tdx_module_out' from
'struct seamcall_ctx', as they must be local variable.
- Added the comments to tdx_init() and one paragraph to changelog to
explain the caller should handle VMXON.
- Called out after shut down, no "TDX module" SEAMCALL can be made.

---
arch/x86/virt/vmx/tdx/tdx.c | 43 +++++++++++++++++++++++++++++++++----
arch/x86/virt/vmx/tdx/tdx.h | 5 +++++
2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b06c1a2bc9cb..5db1a05cb4bd 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,8 @@
#include <linux/mutex.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
+#include <linux/smp.h>
+#include <linux/atomic.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/apic.h>
@@ -124,15 +126,27 @@ bool platform_tdx_enabled(void)
return !!tdx_keyid_num;
}

+/*
+ * Data structure to make SEAMCALL on multiple CPUs concurrently.
+ * @err is set to -EFAULT when SEAMCALL fails on any cpu.
+ */
+struct seamcall_ctx {
+ u64 fn;
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ atomic_t err;
+};
+
/*
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
* leaf function return code and the additional output respectively if
* not NULL.
*/
-static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
- u64 *seamcall_ret,
- struct tdx_module_output *out)
+static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ u64 *seamcall_ret, struct tdx_module_output *out)
{
u64 sret;

@@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
}
}

+static void seamcall_smp_call_function(void *data)
+{
+ struct seamcall_ctx *sc = data;
+ int ret;
+
+ ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
+ if (ret)
+ atomic_set(&sc->err, -EFAULT);
+}
+
+/*
+ * Call the SEAMCALL on all online CPUs concurrently. Caller to check
+ * @sc->err to determine whether any SEAMCALL failed on any cpu.
+ */
+static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
+{
+ on_each_cpu(seamcall_smp_call_function, sc, true);
+}
+
/*
* Detect and initialize the TDX module.
*
@@ -181,7 +214,9 @@ static int init_tdx_module(void)

static void shutdown_tdx_module(void)
{
- /* TODO: Shut down the TDX module */
+ struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
+
+ seamcall_on_each_cpu(&sc);
}

static int __tdx_enable(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 92a8de957dc7..215cc1065d78 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -12,6 +12,11 @@
/* MSR to report KeyID partitioning between MKTME and TDX */
#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087

+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_LP_SHUTDOWN 44
+
/*
* Do not put any hardware-defined TDX structure representations below
* this comment!
--
2.38.1


Subject: Re: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot



On 11/20/22 4:26 PM, Kai Huang wrote:
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. A CPU-attested software module
> called 'the TDX module' runs inside a new isolated memory range as a
> trusted hypervisor to manage and run protected VMs.
>
> Pre-TDX Intel hardware has support for a memory encryption architecture
> called MKTME. The memory encryption hardware underpinning MKTME is also
> used for Intel TDX. TDX ends up "stealing" some of the physical address
> space from the MKTME architecture for crypto-protection to VMs. The
> BIOS is responsible for partitioning the "KeyID" space between legacy
> MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
> KeyIDs' or 'TDX KeyIDs' for short.
>
> TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX
> private KeyIDs are consistently and correctly programmed by the BIOS
> across all CPU packages before it enables TDX on any CPU core. A valid
> TDX private KeyID range on BSP indicates TDX has been enabled by the
> BIOS, otherwise the BIOS is buggy.
>
> The TDX module is expected to be loaded by the BIOS when it enables TDX,
> but the kernel needs to properly initialize it before it can be used to
> create and run any TDX guests. The TDX module will be initialized at
> runtime by the user (i.e. KVM) on demand.
>
> Add a new early_initcall(tdx_init) to do TDX early boot initialization.
> Only detect TDX private KeyIDs for now. Some other early checks will
> follow up. Also add a new function to report whether TDX has been
> enabled by BIOS (TDX private KeyID range is valid). Kexec() will also
> need it to determine whether need to flush dirty cachelines that are
> associated with any TDX private KeyIDs before booting to the new kernel.
>
> To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
> TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
> to opt-in TDX host kernel support (to distinguish with TDX guest kernel
> support). So far only KVM is the only user of TDX. Make the new config
> option depend on KVM_INTEL.
>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v6 -> v7:
> - No change.
>
> v5 -> v6:
> - Removed SEAMRR detection to make code simpler.
> - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
> - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).
>
>
> ---
> arch/x86/Kconfig | 12 +++++
> arch/x86/Makefile | 2 +
> arch/x86/include/asm/tdx.h | 7 +++
> arch/x86/virt/Makefile | 2 +
> arch/x86/virt/vmx/Makefile | 2 +
> arch/x86/virt/vmx/tdx/Makefile | 2 +
> arch/x86/virt/vmx/tdx/tdx.c | 95 ++++++++++++++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 15 ++++++
> 8 files changed, 137 insertions(+)
> create mode 100644 arch/x86/virt/Makefile
> create mode 100644 arch/x86/virt/vmx/Makefile
> create mode 100644 arch/x86/virt/vmx/tdx/Makefile
> create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
> create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 67745ceab0db..cced4ef3bfb2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1953,6 +1953,18 @@ config X86_SGX
>
> If unsure, say N.
>
> +config INTEL_TDX_HOST
> + bool "Intel Trust Domain Extensions (TDX) host support"
> + depends on CPU_SUP_INTEL
> + depends on X86_64
> + depends on KVM_INTEL
> + help
> + Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> + host and certain physical attacks. This option enables necessary TDX
> + support in host kernel to run protected VMs.
> +
> + If unsure, say N.
> +
> config EFI
> bool "EFI runtime service support"
> depends on ACPI
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 415a5d138de4..38d3e8addc5f 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -246,6 +246,8 @@ archheaders:
>
> libs-y += arch/x86/lib/
>
> +core-y += arch/x86/virt/
> +
> # drivers-y are linked after core-y
> drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
> drivers-$(CONFIG_PCI) += arch/x86/pci/
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index e9a3f4a6fba1..51c4222a13ae 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -98,5 +98,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> return -ENODEV;
> }
> #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
> +
> +#ifdef CONFIG_INTEL_TDX_HOST
> +bool platform_tdx_enabled(void);
> +#else /* !CONFIG_INTEL_TDX_HOST */
> +static inline bool platform_tdx_enabled(void) { return false; }
> +#endif /* CONFIG_INTEL_TDX_HOST */
> +
> #endif /* !__ASSEMBLY__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
> new file mode 100644
> index 000000000000..1e36502cd738
> --- /dev/null
> +++ b/arch/x86/virt/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y += vmx/
> diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
> new file mode 100644
> index 000000000000..feebda21d793
> --- /dev/null
> +++ b/arch/x86/virt/vmx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_INTEL_TDX_HOST) += tdx/
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> new file mode 100644
> index 000000000000..93ca8b73e1f1
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y += tdx.o
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> new file mode 100644
> index 000000000000..982d9c453b6b
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -0,0 +1,95 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2022 Intel Corporation.
> + *
> + * Intel Trusted Domain Extensions (TDX) support
> + */
> +
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/printk.h>
> +#include <asm/msr-index.h>
> +#include <asm/msr.h>
> +#include <asm/tdx.h>
> +#include "tdx.h"
> +
> +static u32 tdx_keyid_start __ro_after_init;
> +static u32 tdx_keyid_num __ro_after_init;
> +
> +/*
> + * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> + * BIOS. Both initializing the TDX module and running TDX guest require
> + * TDX private KeyID.
> + *
> + * TDX doesn't trust BIOS. TDX verifies all configurations from BIOS
> + * are correct before enabling TDX on any core. TDX requires the BIOS
> + * to correctly and consistently program TDX private KeyIDs on all CPU
> + * packages. Unless there is a BIOS bug, detecting a valid TDX private
> + * KeyID range on BSP indicates TDX has been enabled by the BIOS. If
> + * there's such BIOS bug, it will be caught later when initializing the
> + * TDX module.
> + */
> +static int __init detect_tdx(void)
> +{
> + int ret;
> +
> + /*
> + * IA32_MKTME_KEYID_PARTIONING:
> + * Bit [31:0]: Number of MKTME KeyIDs.
> + * Bit [63:32]: Number of TDX private KeyIDs.
> + */
> + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start,
> + &tdx_keyid_num);
> + if (ret)
> + return -ENODEV;
> +
> + if (!tdx_keyid_num)
> + return -ENODEV;
> +
> + /*
> + * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
> + * KeyIDs start after the last MKTME KeyID.
> + */
> + tdx_keyid_start++;

It is not very clear why you increment tdx_keyid_start. What we read from
MSR_IA32_MKTME_KEYID_PARTITIONING is not the correct start keyid?

Also why is this global variable? At least in this patch, there seems to
be no use case.

> +
> + pr_info("TDX enabled by BIOS. TDX private KeyID range: [%u, %u)\n",
> + tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
> +
> + return 0;
> +}
> +
> +static void __init clear_tdx(void)
> +{
> + tdx_keyid_start = tdx_keyid_num = 0;
> +}
> +
> +static int __init tdx_init(void)
> +{
> + if (detect_tdx())
> + return -ENODEV;
> +
> + /*
> + * Initializing the TDX module requires one TDX private KeyID.
> + * If there's only one TDX KeyID then after module initialization
> + * KVM won't be able to run any TDX guest, which makes the whole
> + * thing worthless. Just disable TDX in this case.
> + */
> + if (tdx_keyid_num < 2) {
> + pr_info("Disable TDX as there's only one TDX private KeyID available.\n");
> + goto no_tdx;
> + }
> +
> + return 0;
> +no_tdx:
> + clear_tdx();
> + return -ENODEV;
> +}
> +early_initcall(tdx_init);
> +
> +/* Return whether the BIOS has enabled TDX */
> +bool platform_tdx_enabled(void)
> +{
> + return !!tdx_keyid_num;
> +}
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> new file mode 100644
> index 000000000000..d00074abcb20
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +/*
> + * This file contains both macros and data structures defined by the TDX
> + * architecture and Linux defined software data structures and functions.
> + * The two should not be mixed together for better readability. The
> + * architectural definitions come first.
> + */
> +
> +/* MSR to report KeyID partitioning between MKTME and TDX */
> +#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
> +
> +#endif

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros



On 11/20/22 4:26 PM, Kai Huang wrote:
> +/*
> + * TDX supported page sizes (4K/2M/1G).
> + *
> + * Those values are part of the TDX module ABI. Do not change them.

It would be better if you include specification version and section
title.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled



On 11/20/22 4:26 PM, Kai Huang wrote:
> The MMIO/xAPIC interface has some problems, most notably the APIC LEAK

"some problems" looks more generic. May be we can be specific here. Like
it has security issues?

> [1]. This bug allows an attacker to use the APIC MMIO interface to
> extract data from the SGX enclave.
>
> TDX is not immune from this either. Early check X2APIC and disable TDX
> if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.
>
> [1]: https://aepicleak.com/aepicleak.pdf
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v6 -> v7:
> - Changed to use "Link" for the two lore links to get rid of checkpatch
> warning.
>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index cced4ef3bfb2..dd333b46fafb 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
> depends on CPU_SUP_INTEL
> depends on X86_64
> depends on KVM_INTEL
> + depends on X86_X2APIC
> help
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This option enables necessary TDX
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 982d9c453b6b..8d943bdc8335 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -12,6 +12,7 @@
> #include <linux/printk.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> +#include <asm/apic.h>
> #include <asm/tdx.h>
> #include "tdx.h"
>
> @@ -81,6 +82,16 @@ static int __init tdx_init(void)
> goto no_tdx;
> }
>
> + /*
> + * TDX requires X2APIC being enabled to prevent potential data
> + * leak via APIC MMIO registers. Just disable TDX if not using
> + * X2APIC.

Remove the double space.

> + */
> + if (!x2apic_enabled()) {
> + pr_info("Disable TDX as X2APIC is not enabled.\n");

pr_warn()?

> + goto no_tdx;
> + }
> +
> return 0;
> no_tdx:
> clear_tdx();

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-11-21 06:38:13

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

Kai Huang <[email protected]> writes:

> TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> memory regions that can possibly be used by the TDX module, but they are
> not automatically usable to the TDX module. As a step of initializing
> the TDX module, the kernel needs to choose a list of memory regions (out
> from convertible memory regions) that the TDX module can use and pass
> those regions to the TDX module. Once this is done, those "TDX-usable"
> memory regions are fixed during module's lifetime. No more TDX-usable
> memory can be added to the TDX module after that.
>
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator. To keep things simple, this initial
> implementation simply guarantees all pages in the page allocator are TDX
> memory. To achieve this, use all system memory in the core-mm at the
> time of initializing the TDX module as TDX memory, and at the meantime,
> refuse to add any non-TDX-memory in the memory hotplug.
>
> Specifically, walk through all memory regions managed by memblock and
> add them to a global list of "TDX-usable" memory regions, which is a
> fixed list after the module initialization (or empty if initialization
> fails). To reject non-TDX-memory in memory hotplug, add an additional
> check in arch_add_memory() to check whether the new region is covered by
> any region in the "TDX-usable" memory region list.
>
> Note this requires all memory regions in memblock are TDX convertible
> memory when initializing the TDX module. This is true in practice if no
> new memory has been hot-added before initializing the TDX module, since
> in practice all boot-time present DIMM is TDX convertible memory. If
> any new memory has been hot-added, then initializing the TDX module will
> fail due to that memory region is not covered by CMR.
>
> This can be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node. In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
>
> Note TDX assumes convertible memory is always physically present during
> machine's runtime. A non-buggy BIOS should never support hot-removal of
> any convertible memory. This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v6 -> v7:
> - Changed to use all system memory in memblock at the time of
> initializing the TDX module as TDX memory
> - Added memory hotplug support
>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/tdx.h | 3 +
> arch/x86/mm/init_64.c | 10 ++
> arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
> 4 files changed, 197 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> depends on X86_64
> depends on KVM_INTEL
> depends on X86_X2APIC
> + select ARCH_KEEP_MEMBLOCK
> help
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d688228f3151..71169ecefabf 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> #ifdef CONFIG_INTEL_TDX_HOST
> bool platform_tdx_enabled(void);
> int tdx_enable(void);
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
> #else /* !CONFIG_INTEL_TDX_HOST */
> static inline bool platform_tdx_enabled(void) { return false; }
> static inline int tdx_enable(void) { return -ENODEV; }
> +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> + unsigned long end_pfn) { return true; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3f040c6e5d13..900341333d7e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -55,6 +55,7 @@
> #include <asm/uv/uv.h>
> #include <asm/setup.h>
> #include <asm/ftrace.h>
> +#include <asm/tdx.h>
>
> #include "mm_internal.h"
>
> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> + /*
> + * For now if TDX is enabled, all pages in the page allocator
> + * must be TDX memory, which is a fixed set of memory regions
> + * that are passed to the TDX module. Reject the new region
> + * if it is not TDX memory to guarantee above is true.
> + */
> + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> + return -EINVAL;
> +
> init_memory_mapping(start, start + size, params->pgprot);
>
> return add_pages(nid, start_pfn, nr_pages, params);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 43227af25e44..32af86e31c47 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -16,6 +16,11 @@
> #include <linux/smp.h>
> #include <linux/atomic.h>
> #include <linux/align.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/apic.h>
> @@ -34,6 +39,13 @@ enum tdx_module_status_t {
> TDX_MODULE_SHUTDOWN,
> };
>
> +struct tdx_memblock {
> + struct list_head list;
> + unsigned long start_pfn;

Why not use 'phys_addr_t'?

> + unsigned long end_pfn;
> + int nid;
> +};
> +
> static u32 tdx_keyid_start __ro_after_init;
> static u32 tdx_keyid_num __ro_after_init;
>
> @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
> static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> static int tdx_cmr_num;
>
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
> /*
> * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> * BIOS. Both initializing the TDX module and running TDX guest require
> @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
> return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> }
>
> +/* Check whether the given pfn range is covered by any CMR or not. */
> +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> + unsigned long end_pfn)
> +{
> + int i;
> +
> + for (i = 0; i < tdx_cmr_num; i++) {
> + struct cmr_info *cmr = &tdx_cmr_array[i];
> + unsigned long cmr_start_pfn;
> + unsigned long cmr_end_pfn;
> +
> + cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;

Why not use PHYS_PFN() or PFN_DOWN()? And PFN_PHYS() in reverse direction?

> +
> + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Add a memory region on a given node as a TDX memory block. The caller
> + * to make sure all memory regions are added in address ascending order
> + * and don't overlap.
> + */
> +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> + int nid)
> +{
> + struct tdx_memblock *tmb;
> +
> + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> + if (!tmb)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&tmb->list);
> + tmb->start_pfn = start_pfn;
> + tmb->end_pfn = end_pfn;
> + tmb->nid = nid;
> +
> + list_add_tail(&tmb->list, &tdx_memlist);
> + return 0;
> +}
> +
> +static void free_tdx_memory(void)
> +{
> + while (!list_empty(&tdx_memlist)) {
> + struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
> + struct tdx_memblock, list);
> +
> + list_del(&tmb->list);
> + kfree(tmb);
> + }
> +}
> +
> +/*
> + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> + * Must be called when get_online_mems() is called by the caller.
> + */
> +static int build_tdx_memory(void)
> +{
> + unsigned long start_pfn, end_pfn;
> + int i, nid, ret;
> +
> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> + /*
> + * The first 1MB may not be reported as TDX convertible
> + * memory. Manually exclude them as TDX memory.
> + *
> + * This is fine as the first 1MB is already reserved in
> + * reserve_real_mode() and won't end up to ZONE_DMA as
> + * free page anyway.
> + */
> + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> + if (start_pfn >= end_pfn)
> + continue;

How about check whether first 1MB is reserved instead of depending on
the corresponding code isn't changed? Via for_each_reserved_mem_range()?

> +
> + /* Verify memory is truly TDX convertible memory */
> + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> + start_pfn << PAGE_SHIFT,
> + end_pfn << PAGE_SHIFT);
> + return -EINVAL;
> + }
> +
> + /*
> + * Add the memory regions as TDX memory. The regions in
> + * memblock has already guaranteed they are in address
> + * ascending order and don't overlap.
> + */
> + ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> + if (ret)
> + goto err;
> + }
> +
> + return 0;
> +err:
> + free_tdx_memory();
> + return ret;
> +}
> +
> /*
> * Detect and initialize the TDX module.
> *
> @@ -357,12 +473,56 @@ static int init_tdx_module(void)
> if (ret)
> goto out;
>
> + /*
> + * All memory regions that can be used by the TDX module must be
> + * passed to the TDX module during the module initialization.
> + * Once this is done, all "TDX-usable" memory regions are fixed
> + * during module's runtime.
> + *
> + * The initial support of TDX guests only allocates memory from
> + * the global page allocator. To keep things simple, for now
> + * just make sure all pages in the page allocator are TDX memory.
> + *
> + * To achieve this, use all system memory in the core-mm at the
> + * time of initializing the TDX module as TDX memory, and at the
> + * meantime, reject any new memory in memory hot-add.
> + *
> + * This works as in practice, all boot-time present DIMM is TDX
> + * convertible memory. However if any new memory is hot-added
> + * before initializing the TDX module, the initialization will
> + * fail due to that memory is not covered by CMR.
> + *
> + * This can be enhanced in the future, i.e. by allowing adding or
> + * onlining non-TDX memory to a separate node, in which case the
> + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> + * together -- the userspace/kernel just needs to make sure pages
> + * for TDX guests must come from those "TDX-capable" nodes.
> + *
> + * Build the list of TDX memory regions as mentioned above so
> + * they can be passed to the TDX module later.
> + */
> + get_online_mems();
> +
> + ret = build_tdx_memory();
> + if (ret)
> + goto out;
> /*
> * Return -EINVAL until all steps of TDX module initialization
> * process are done.
> */
> ret = -EINVAL;
> out:
> + /*
> + * Memory hotplug checks the hot-added memory region against the
> + * @tdx_memlist to see if the region is TDX memory.
> + *
> + * Do put_online_mems() here to make sure any modification to
> + * @tdx_memlist is done while holding the memory hotplug read
> + * lock, so that the memory hotplug path can just check the
> + * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> + * deadlock.
> + */
> + put_online_mems();
> return ret;
> }
>
> @@ -485,3 +645,26 @@ int tdx_enable(void)
> return ret;
> }
> EXPORT_SYMBOL_GPL(tdx_enable);
> +
> +/*
> + * Check whether the given range is TDX memory. Must be called between
> + * mem_hotplug_begin()/mem_hotplug_done().

Must be called with mem_hotplug_lock write-locked.

> + */
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + struct tdx_memblock *tmb;
> +
> + /* Empty list means TDX isn't enabled successfully */
> + if (list_empty(&tdx_memlist))
> + return true;
> +
> + list_for_each_entry(tmb, &tdx_memlist, list) {
> + /*
> + * The new range is TDX memory if it is fully covered
> + * by any TDX memory block.
> + */
> + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> + return true;
> + }
> + return false;
> +}

Best Regards,
Huang, Ying

2022-11-21 09:24:17

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros

On Sun, 2022-11-20 at 18:52 -0800, Sathyanarayanan Kuppuswamy wrote:
>
> On 11/20/22 4:26 PM, Kai Huang wrote:
> > +/*
> > + * TDX supported page sizes (4K/2M/1G).
> > + *
> > + * Those values are part of the TDX module ABI. Do not change them.
>
> It would be better if you include specification version and section
> title.
>

Such as below?

"Those values are part of the TDX module ABI (section "Physical Page Size", TDX
module 1.0 spec). Do not change them."

Btw, Dave mentioned we should not put the "section numbers" to the comment:

https://lore.kernel.org/lkml/[email protected]/

I was trying to follow.

2022-11-21 09:35:28

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory


> >
> > +struct tdx_memblock {
> > + struct list_head list;
> > + unsigned long start_pfn;
>
> Why not use 'phys_addr_t'?

TDX memory must be page aligned, so start_pfn/end_pfn would be easier.

>
> > + unsigned long end_pfn;
> > + int nid;
> > +};
> > +
> >

[...]

> >
> > +/* Check whether the given pfn range is covered by any CMR or not. */
> > +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> > + unsigned long end_pfn)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < tdx_cmr_num; i++) {
> > + struct cmr_info *cmr = &tdx_cmr_array[i];
> > + unsigned long cmr_start_pfn;
> > + unsigned long cmr_end_pfn;
> > +
> > + cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> > + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
>
> Why not use PHYS_PFN() or PFN_DOWN()? And PFN_PHYS() in reverse direction?

Didn't know them. Will use them.


[...]

> > +
> > +/*
> > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> > + * Must be called when get_online_mems() is called by the caller.
> > + */
> > +static int build_tdx_memory(void)
> > +{
> > + unsigned long start_pfn, end_pfn;
> > + int i, nid, ret;
> > +
> > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > + /*
> > + * The first 1MB may not be reported as TDX convertible
> > + * memory. Manually exclude them as TDX memory.
> > + *
> > + * This is fine as the first 1MB is already reserved in
> > + * reserve_real_mode() and won't end up to ZONE_DMA as
> > + * free page anyway.
> > + */
> > + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > + if (start_pfn >= end_pfn)
> > + continue;
>
> How about check whether first 1MB is reserved instead of depending on
> the corresponding code isn't changed? Via for_each_reserved_mem_range()?

IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
init code/data. I feel it's not safe to just treat reserved memory will never
be in page allocator. Otherwise we have for_each_free_mem_range() can use.

[...]

>
> > +/*
> > + * Check whether the given range is TDX memory. Must be called between
> > + * mem_hotplug_begin()/mem_hotplug_done().
>
> Must be called with mem_hotplug_lock write-locked.
>

Will do.

2022-11-21 10:06:21

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled

On Sun, 2022-11-20 at 19:51 -0800, Sathyanarayanan Kuppuswamy wrote:
>
> On 11/20/22 4:26 PM, Kai Huang wrote:
> > The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
>
> "some problems" looks more generic. May be we can be specific here. Like
> it has security issues?

It was quoted from below upstream commit id (I only kept the one that I quoted
to save space):

commit b8d1d163604bd1e600b062fb00de5dc42baa355f (tag: x86_apic_for_v6.1_rc1,
tip/x86/apic)
Author: Daniel Sneddon <[email protected]>
Date: Tue Aug 16 16:19:42 2022 -0700

x86/apic: Don't disable x2APIC if locked

....

The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
[1]. This bug allows an attacker to use the APIC MMIO interface to
extract data from the SGX enclave.

....

[1]: https://aepicleak.com/aepicleak.pdf

Signed-off-by: Daniel Sneddon <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Neelima Krishnan <[email protected]>
Link:
https://lkml.kernel.org/r/[email protected]


>
> > [1]. This bug allows an attacker to use the APIC MMIO interface to
> > extract data from the SGX enclave.
> >
> > TDX is not immune from this either. Early check X2APIC and disable TDX
> > if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.
> >
> > [1]: https://aepicleak.com/aepicleak.pdf
> >
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Signed-off-by: Kai Huang <[email protected]>
> > ---
> >
> > v6 -> v7:
> > - Changed to use "Link" for the two lore links to get rid of checkpatch
> > warning.
> >
> > ---
> > arch/x86/Kconfig | 1 +
> > arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
> > 2 files changed, 12 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index cced4ef3bfb2..dd333b46fafb 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
> > depends on CPU_SUP_INTEL
> > depends on X86_64
> > depends on KVM_INTEL
> > + depends on X86_X2APIC
> > help
> > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks. This option enables necessary TDX
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 982d9c453b6b..8d943bdc8335 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -12,6 +12,7 @@
> > #include <linux/printk.h>
> > #include <asm/msr-index.h>
> > #include <asm/msr.h>
> > +#include <asm/apic.h>
> > #include <asm/tdx.h>
> > #include "tdx.h"
> >
> > @@ -81,6 +82,16 @@ static int __init tdx_init(void)
> > goto no_tdx;
> > }
> >
> > + /*
> > + * TDX requires X2APIC being enabled to prevent potential data
> > + * leak via APIC MMIO registers. Just disable TDX if not using
> > + * X2APIC.
>
> Remove the double space.

Sorry which "double space"?

>
> > + */
> > + if (!x2apic_enabled()) {
> > + pr_info("Disable TDX as X2APIC is not enabled.\n");
>
> pr_warn()?

Why?

2022-11-21 10:46:49

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot


> > +static u32 tdx_keyid_start __ro_after_init;
> > +static u32 tdx_keyid_num __ro_after_init;
> > +
> > +/*
> > + * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> > + * BIOS. Both initializing the TDX module and running TDX guest require
> > + * TDX private KeyID.
> > + *
> > + * TDX doesn't trust BIOS. TDX verifies all configurations from BIOS
> > + * are correct before enabling TDX on any core. TDX requires the BIOS
> > + * to correctly and consistently program TDX private KeyIDs on all CPU
> > + * packages. Unless there is a BIOS bug, detecting a valid TDX private
> > + * KeyID range on BSP indicates TDX has been enabled by the BIOS. If
> > + * there's such BIOS bug, it will be caught later when initializing the
> > + * TDX module.
> > + */
> > +static int __init detect_tdx(void)
> > +{
> > + int ret;
> > +
> > + /*
> > + * IA32_MKTME_KEYID_PARTIONING:
> > + * Bit [31:0]: Number of MKTME KeyIDs.
> > + * Bit [63:32]: Number of TDX private KeyIDs.
> > + */
> > + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start,
> > + &tdx_keyid_num);
> > + if (ret)
> > + return -ENODEV;
> > +
> > + if (!tdx_keyid_num)
> > + return -ENODEV;
> > +
> > + /*
> > + * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
> > + * KeyIDs start after the last MKTME KeyID.
> > + */
> > + tdx_keyid_start++;
>
> It is not very clear why you increment tdx_keyid_start. 
>

Please see above comment around rdmsr_safe():

/*
* IA32_MKTME_KEYID_PARTIONING:
* Bit [31:0]: Number of MKTME KeyIDs.
* Bit [63:32]: Number of TDX private KeyIDs.
*/

And the comment right above 'tdx_keyid_start++':

/*
* KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
* KeyIDs start after the last MKTME KeyID.
*/

Do the above two comments answer the question?

Or I can be more explicit, such as below?

"
Now tdx_start_keyid is the last MKTME KeyID. TDX private KeyIDs start after the
last MKTME KeyID. Increase tdx_start_keyid by 1 to set it to the first TDX
private KeyID.
"


> What we read from
> MSR_IA32_MKTME_KEYID_PARTITIONING is not the correct start keyid?


TDX verifies TDX private KeyID range is configured consistently across all
packages. Any wrong KeyID range means BIOS bug, and such bug will cause TDX
being not enabled -- later TDX module initialization will catch this.

>
> Also why is this global variable? At least in this patch, there seems to
> be no use case.

Platform_tdx_enabled() uses tdx_keyid_num to determine whether TDX is enabled by
BIOS.

Also, in the changlog I can add "both initializing the TDX module and creating
TDX guest will need to use TDX private KeyID".

But I also have a comment saying something similar around ...

>
> > + /*
> > + * Initializing the TDX module requires one TDX private KeyID.
> > + * If there's only one TDX KeyID then after module initialization
> > + * KVM won't be able to run any TDX guest, which makes the whole
> > + * thing worthless. Just disable TDX in this case.
> > + */
> > + if (tdx_keyid_num < 2) {
> > + pr_info("Disable TDX as there's only one TDX private KeyID available.\n");
> > + goto no_tdx;
> > + }
> > +

... here.

Subject: Re: [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros



On 11/21/22 1:15 AM, Huang, Kai wrote:
> On Sun, 2022-11-20 at 18:52 -0800, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 11/20/22 4:26 PM, Kai Huang wrote:
>>> +/*
>>> + * TDX supported page sizes (4K/2M/1G).
>>> + *
>>> + * Those values are part of the TDX module ABI. Do not change them.
>>
>> It would be better if you include specification version and section
>> title.
>>
>
> Such as below?
>
> "Those values are part of the TDX module ABI (section "Physical Page Size", TDX
> module 1.0 spec). Do not change them."

Yes.

>
> Btw, Dave mentioned we should not put the "section numbers" to the comment:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> I was trying to follow.

Yes. That's why suggested to put section title.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-11-21 18:22:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros

On 11/20/22 18:52, Sathyanarayanan Kuppuswamy wrote:
> On 11/20/22 4:26 PM, Kai Huang wrote:
>> +/*
>> + * TDX supported page sizes (4K/2M/1G).
>> + *
>> + * Those values are part of the TDX module ABI. Do not change them.
> It would be better if you include specification version and section
> title.

I actually think TDX code, in general, spends way too much time quoting
and referring to the spec.

Also, why quote the version? Do we quote the SDM version when we add
new SDM-defined architecture?

It's just busywork that bloats the kernel and adds noise. Please focus
on adding value to the comments that came from your brain and not just
pasting boilerplate gunk over and over.

Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled



On 11/21/22 1:44 AM, Huang, Kai wrote:
> On Sun, 2022-11-20 at 19:51 -0800, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 11/20/22 4:26 PM, Kai Huang wrote:
>>> The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
>>
>> "some problems" looks more generic. May be we can be specific here. Like
>> it has security issues?
>
> It was quoted from below upstream commit id (I only kept the one that I quoted
> to save space):

Ok.

>
> commit b8d1d163604bd1e600b062fb00de5dc42baa355f (tag: x86_apic_for_v6.1_rc1,
> tip/x86/apic)
> Author: Daniel Sneddon <[email protected]>
> Date: Tue Aug 16 16:19:42 2022 -0700
>
> x86/apic: Don't disable x2APIC if locked
>
> ....
>
> The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
> [1]. This bug allows an attacker to use the APIC MMIO interface to
> extract data from the SGX enclave.
>
> ....
>
> [1]: https://aepicleak.com/aepicleak.pdf
>
> Signed-off-by: Daniel Sneddon <[email protected]>
> Signed-off-by: Dave Hansen <[email protected]>
> Acked-by: Dave Hansen <[email protected]>
> Tested-by: Neelima Krishnan <[email protected]>
> Link:
> https://lkml.kernel.org/r/[email protected]
>
>
>>
>>> [1]. This bug allows an attacker to use the APIC MMIO interface to
>>> extract data from the SGX enclave.
>>>
>>> TDX is not immune from this either. Early check X2APIC and disable TDX
>>> if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.
>>>
>>> [1]: https://aepicleak.com/aepicleak.pdf
>>>
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Signed-off-by: Kai Huang <[email protected]>
>>> ---
>>>
>>> v6 -> v7:
>>> - Changed to use "Link" for the two lore links to get rid of checkpatch
>>> warning.
>>>
>>> ---
>>> arch/x86/Kconfig | 1 +
>>> arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
>>> 2 files changed, 12 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index cced4ef3bfb2..dd333b46fafb 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
>>> depends on CPU_SUP_INTEL
>>> depends on X86_64
>>> depends on KVM_INTEL
>>> + depends on X86_X2APIC
>>> help
>>> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>>> host and certain physical attacks. This option enables necessary TDX
>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> index 982d9c453b6b..8d943bdc8335 100644
>>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -12,6 +12,7 @@
>>> #include <linux/printk.h>
>>> #include <asm/msr-index.h>
>>> #include <asm/msr.h>
>>> +#include <asm/apic.h>
>>> #include <asm/tdx.h>
>>> #include "tdx.h"
>>>
>>> @@ -81,6 +82,16 @@ static int __init tdx_init(void)
>>> goto no_tdx;
>>> }
>>>
>>> + /*
>>> + * TDX requires X2APIC being enabled to prevent potential data
>>> + * leak via APIC MMIO registers. Just disable TDX if not using
>>> + * X2APIC.
>>
>> Remove the double space.
>
> Sorry which "double space"?

Before Just disable. It looked like double space. Is it not?

>
>>
>>> + */
>>> + if (!x2apic_enabled()) {
>>> + pr_info("Disable TDX as X2APIC is not enabled.\n");
>>
>> pr_warn()?
>
> Why?

Since it is a failure case, I thought pr_warn would be better. It is up
to you.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot



On 11/21/22 1:37 AM, Huang, Kai wrote:
>> Also why is this global variable? At least in this patch, there seems to
>> be no use case.
> Platform_tdx_enabled() uses tdx_keyid_num to determine whether TDX is enabled by
> BIOS.
>
> Also, in the changlog I can add "both initializing the TDX module and creating
> TDX guest will need to use TDX private KeyID".
>
> But I also have a comment saying something similar around ...
>

I am asking about the tdx_keyid_start. It mainly used in detect_tdx(). Maybe you
declared it as global as a preparation for next patches. But it is not explained
in change log.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-11-22 00:38:27

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros

On Mon, 2022-11-21 at 15:48 -0800, Hansen, Dave wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > +/*
> > + * TDX supported page sizes (4K/2M/1G).
> > + *
> > + * Those values are part of the TDX module ABI. Do not change them.
> > + */
> > +#define TDX_PS_4K 0
> > +#define TDX_PS_2M 1
> > +#define TDX_PS_1G 2
>
> That comment can just be:
>
> /* TDX supported page sizes from the TDX module ABI. */
>
> I think folks understand that the kernel can't willy nilly change ABI
> values.

Thanks. Will do.

2022-11-22 00:38:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros

On 11/20/22 16:26, Kai Huang wrote:
> +/*
> + * TDX supported page sizes (4K/2M/1G).
> + *
> + * Those values are part of the TDX module ABI. Do not change them.
> + */
> +#define TDX_PS_4K 0
> +#define TDX_PS_2M 1
> +#define TDX_PS_1G 2

That comment can just be:

/* TDX supported page sizes from the TDX module ABI. */

I think folks understand that the kernel can't willy nilly change ABI
values.

2022-11-22 00:40:21

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled

On Mon, 2022-11-21 at 15:46 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
> > [1]. This bug allows an attacker to use the APIC MMIO interface to
> > extract data from the SGX enclave.
> >
> > TDX is not immune from this either. Early check X2APIC and disable TDX
> > if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.
>
> This makes no sense.
>
> This is TDX host code. TDX hosts are untrusted. Zero of the TDX
> security guarantees are provided by the host.
>
> What is the benefit of disabling TDX from the host if x2APIC is not
> enabled? It can't be for security reasons since the host does not help
> provide TDX security guarantees. It also can't be for SGX because SGX
> doesn't depend on the OS doing anything in order to be secure.

Agreed. Although in practice I think if we do some hardening in the kernel, it
would raise some attack bar.

>
> So, this boils down to the most fundamental of questions you need to
> answer about every patch:
>
> What does this code do?
>
> What end-user-visible effect is there if this code is not present?

Considering TDX host cannot be trusted (i.e. can be attacked/modified), I agree
the check isn't needed.

I was following your suggestion in the patch which handles "x2apic locked" case:

https://lore.kernel.org/lkml/[email protected]/

I guess I misunderstood your point.

Reading that discussion again, if I understand correctly, you just want to make
INTEL_TDX_HOST depend on X86_X2APIC?

How about still having a patch to make INTEL_TDX_HOST depend on X86_X2APIC but
with something below in the changelog?

"
TDX capable platforms are locked to X2APIC mode and cannot fall back to the
legacy xAPIC mode when TDX is enabled by the BIOS. It doesn't make sense to
turn on INTEL_TDX_HOST while X86_X2APIC is not enabled. Make INTEL_TDX_HOST
depend on X86_X2APIC.
"

2022-11-22 00:51:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled

On 11/20/22 16:26, Kai Huang wrote:
> The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
> [1]. This bug allows an attacker to use the APIC MMIO interface to
> extract data from the SGX enclave.
>
> TDX is not immune from this either. Early check X2APIC and disable TDX
> if X2APIC is not enabled, and make INTEL_TDX_HOST depend on X86_X2APIC.

This makes no sense.

This is TDX host code. TDX hosts are untrusted. Zero of the TDX
security guarantees are provided by the host.

What is the benefit of disabling TDX from the host if x2APIC is not
enabled? It can't be for security reasons since the host does not help
provide TDX security guarantees. It also can't be for SGX because SGX
doesn't depend on the OS doing anything in order to be secure.

So, this boils down to the most fundamental of questions you need to
answer about every patch:

What does this code do?

What end-user-visible effect is there if this code is not present?

2022-11-22 00:56:17

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled

On Mon, 2022-11-21 at 14:00 -0800, Sathyanarayanan Kuppuswamy wrote:
> > > >  
> > > > + /*
> > > > + * TDX requires X2APIC being enabled to prevent potential data
> > > > + * leak via APIC MMIO registers.  Just disable TDX if not using
> > > > + * X2APIC.
> > >
> > > Remove the double space.
> >
> > Sorry which "double space"?
>
> Before Just disable. It looked like double space. Is it not?

There are bunch of examples in the upstream kernel using "double space" both in
changelog and comment.

2022-11-22 00:56:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled

On 11/21/22 16:30, Huang, Kai wrote:
>
> How about still having a patch to make INTEL_TDX_HOST depend on X86_X2APIC but
> with something below in the changelog?
>
> "
> TDX capable platforms are locked to X2APIC mode and cannot fall back to the
> legacy xAPIC mode when TDX is enabled by the BIOS. It doesn't make sense to
> turn on INTEL_TDX_HOST while X86_X2APIC is not enabled. Make INTEL_TDX_HOST
> depend on X86_X2APIC.

That's fine and it makes logical sense as a dependency. TDX host
support requires x2APIC. Period.

2022-11-22 01:10:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot

On 11/20/22 16:26, Kai Huang wrote:
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. A CPU-attested software module
> called 'the TDX module' runs inside a new isolated memory range as a
> trusted hypervisor to manage and run protected VMs.
>
> Pre-TDX Intel hardware has support for a memory encryption architecture
> called MKTME. The memory encryption hardware underpinning MKTME is also
> used for Intel TDX. TDX ends up "stealing" some of the physical address
> space from the MKTME architecture for crypto-protection to VMs. The
> BIOS is responsible for partitioning the "KeyID" space between legacy
> MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
> KeyIDs' or 'TDX KeyIDs' for short.
>
> TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX
> private KeyIDs are consistently and correctly programmed by the BIOS
> across all CPU packages before it enables TDX on any CPU core. A valid
> TDX private KeyID range on BSP indicates TDX has been enabled by the
> BIOS, otherwise the BIOS is buggy.
>
> The TDX module is expected to be loaded by the BIOS when it enables TDX,
> but the kernel needs to properly initialize it before it can be used to
> create and run any TDX guests. The TDX module will be initialized at
> runtime by the user (i.e. KVM) on demand.

Calling KVM "the user" is a stretch. How about we give actual user
facts instead of filling this with i.e.'s when there's only one actual
way it happens?

The TDX module will be initialized by the KVM subsystem when
<insert actual trigger description here>.

> Add a new early_initcall(tdx_init) to do TDX early boot initialization.
> Only detect TDX private KeyIDs for now. Some other early checks will
> follow up.

Just say what this patch is doing. Don't try to

> Also add a new function to report whether TDX has been
> enabled by BIOS (TDX private KeyID range is valid). Kexec() will also
> need it to determine whether need to flush dirty cachelines that are
> associated with any TDX private KeyIDs before booting to the new kernel.

That last sentence doesn't parse correctly.

> To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
> TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
> to opt-in TDX host kernel support (to distinguish with TDX guest kernel
> support). So far only KVM is the only user of TDX. Make the new config
> option depend on KVM_INTEL.
..
> +config INTEL_TDX_HOST
> + bool "Intel Trust Domain Extensions (TDX) host support"
> + depends on CPU_SUP_INTEL
> + depends on X86_64
> + depends on KVM_INTEL
> + help
> + Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> + host and certain physical attacks. This option enables necessary TDX
> + support in host kernel to run protected VMs.
> +
> + If unsure, say N.
> +
> config EFI
> bool "EFI runtime service support"
> depends on ACPI
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 415a5d138de4..38d3e8addc5f 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -246,6 +246,8 @@ archheaders:
>
> libs-y += arch/x86/lib/
>
> +core-y += arch/x86/virt/
> +
> # drivers-y are linked after core-y
> drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
> drivers-$(CONFIG_PCI) += arch/x86/pci/
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index e9a3f4a6fba1..51c4222a13ae 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -98,5 +98,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> return -ENODEV;
> }
> #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
> +
> +#ifdef CONFIG_INTEL_TDX_HOST
> +bool platform_tdx_enabled(void);
> +#else /* !CONFIG_INTEL_TDX_HOST */
> +static inline bool platform_tdx_enabled(void) { return false; }
> +#endif /* CONFIG_INTEL_TDX_HOST */
> +
> #endif /* !__ASSEMBLY__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
> new file mode 100644
> index 000000000000..1e36502cd738
> --- /dev/null
> +++ b/arch/x86/virt/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y += vmx/
> diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
> new file mode 100644
> index 000000000000..feebda21d793
> --- /dev/null
> +++ b/arch/x86/virt/vmx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_INTEL_TDX_HOST) += tdx/
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> new file mode 100644
> index 000000000000..93ca8b73e1f1
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y += tdx.o
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> new file mode 100644
> index 000000000000..982d9c453b6b
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -0,0 +1,95 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2022 Intel Corporation.
> + *
> + * Intel Trusted Domain Extensions (TDX) support
> + */
> +
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/printk.h>
> +#include <asm/msr-index.h>
> +#include <asm/msr.h>
> +#include <asm/tdx.h>
> +#include "tdx.h"
> +
> +static u32 tdx_keyid_start __ro_after_init;
> +static u32 tdx_keyid_num __ro_after_init;
> +
> +/*
> + * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> + * BIOS. Both initializing the TDX module and running TDX guest require
> + * TDX private KeyID.

This comment is not right, sorry.

Talk about the function at a *HIGH* level. Don't talk about every
little detailed facet of the function. That's what the code is there for.

> + * TDX doesn't trust BIOS. TDX verifies all configurations from BIOS
> + * are correct before enabling TDX on any core. TDX requires the BIOS
> + * to correctly and consistently program TDX private KeyIDs on all CPU
> + * packages. Unless there is a BIOS bug, detecting a valid TDX private
> + * KeyID range on BSP indicates TDX has been enabled by the BIOS. If
> + * there's such BIOS bug, it will be caught later when initializing the
> + * TDX module.
> + */

I have no idea what that comment is doing. Can it just be removed?

> +static int __init detect_tdx(void)
> +{
> + int ret;
> +
> + /*
> + * IA32_MKTME_KEYID_PARTIONING:
> + * Bit [31:0]: Number of MKTME KeyIDs.
> + * Bit [63:32]: Number of TDX private KeyIDs.
> + */
> + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &tdx_keyid_start,
> + &tdx_keyid_num);

'tdx_keyid_start' appears to be named wrong.

> + if (ret)
> + return -ENODEV;
> +
> + if (!tdx_keyid_num)
> + return -ENODEV;
> +
> + /*
> + * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
> + * KeyIDs start after the last MKTME KeyID.
> + */

Is the TME key a "MKTME KeyID"?

> + tdx_keyid_start++;

... and this confirms it.

This probably should be:

u32 nr_mktme_keyids;

ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING,
&nr_mktme_keyids,
&tdx_keyid_num);
...

/* TDX KeyIDs start after the last MKTME KeyID */
tdx_keyid_start = nr_mktme_keyids + 1;

See how that makes actual logical sense and barely even needs the comment?


> + pr_info("TDX enabled by BIOS. TDX private KeyID range: [%u, %u)\n",
> + tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
> +
> + return 0;
> +}
> +
> +static void __init clear_tdx(void)
> +{
> + tdx_keyid_start = tdx_keyid_num = 0;
> +}

This is where a comment is needed and can actually help.

/*
* tdx_keyid_start/num indicate that TDX is uninitialized. This
* is used in TDX initialization error paths to take it from
* initialized -> uninitialized.
*/

> +static int __init tdx_init(void)
> +{
> + if (detect_tdx())
> + return -ENODEV;

This reads as:

if tdx is detected:
return error


So, first, why bother having detect_tdx() return fancy -ERRNO codes if
they're going to be throw away? You could at *least* do:


int err;

err = tdx_record_keyid_partioning();
if (err)
return err;

Note how tdx_record_keyid_partioning() actually talks about what the
function does. There's also a recent trend in x86 land not to put
obvious prefixes on functions. That would make the naming more or less
record_keyid_partioning().

I kinda like the consistent prefixes but Boris doesn't.

> + /*
> + * Initializing the TDX module requires one TDX private KeyID.
> + * If there's only one TDX KeyID then after module initialization
> + * KVM won't be able to run any TDX guest, which makes the whole
> + * thing worthless. Just disable TDX in this case.
> + */
> + if (tdx_keyid_num < 2) {
> + pr_info("Disable TDX as there's only one TDX private KeyID available.\n");
> + goto no_tdx;
> + }

'tdx_keyid_num' is a crummy name. Here it reads like, "if the tdx keyid
number is < 2'. Which is wrong. A better name would be: nr_tdx_keyids

That's also a horrible error message. You have:

+#define pr_fmt(fmt) "tdx: " fmt

so that message will look like:

tdx: Disable TDX as there's only one TDX private KeyID available.

How many 'TDX' strings do we need in one message. How about:

pr_info("initialization failed: too few private KeyIDs available
(%d).\n", nr_tdx_keyids;

That gives a lot more information and removes the two redundant TDX strings.

+no_tdx:
> + clear_tdx();
> + return -ENODEV;
> +}
> +early_initcall(tdx_init);
> +
> +/* Return whether the BIOS has enabled TDX */
> +bool platform_tdx_enabled(void)
> +{
> + return !!tdx_keyid_num;
> +}
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> new file mode 100644
> index 000000000000..d00074abcb20
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +/*
> + * This file contains both macros and data structures defined by the TDX
> + * architecture and Linux defined software data structures and functions.
> + * The two should not be mixed together for better readability. The
> + * architectural definitions come first.
> + */
> +
> +/* MSR to report KeyID partitioning between MKTME and TDX */
> +#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
> +
> +#endif


2022-11-22 02:11:11

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

"Huang, Kai" <[email protected]> writes:

[...]

>
>> > +
>> > +/*
>> > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
>> > + * Must be called when get_online_mems() is called by the caller.
>> > + */
>> > +static int build_tdx_memory(void)
>> > +{
>> > + unsigned long start_pfn, end_pfn;
>> > + int i, nid, ret;
>> > +
>> > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>> > + /*
>> > + * The first 1MB may not be reported as TDX convertible
>> > + * memory. Manually exclude them as TDX memory.
>> > + *
>> > + * This is fine as the first 1MB is already reserved in
>> > + * reserve_real_mode() and won't end up to ZONE_DMA as
>> > + * free page anyway.
>> > + */
>> > + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
>> > + if (start_pfn >= end_pfn)
>> > + continue;
>>
>> How about check whether first 1MB is reserved instead of depending on
>> the corresponding code isn't changed? Via for_each_reserved_mem_range()?
>
> IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
> init code/data. I feel it's not safe to just treat reserved memory will never
> be in page allocator. Otherwise we have for_each_free_mem_range() can use.

Yes. memblock reverse information isn't perfect. But I still think
that it is still better than just assumption to check whether the frist
1MB is reserved in memblock. Or, we can check whether the pages of the
first 1MB is reversed via checking struct page directly?

[...]

Best Regards,
Huang, Ying

2022-11-22 02:15:50

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled

On Mon, 2022-11-21 at 16:44 -0800, Dave Hansen wrote:
> On 11/21/22 16:30, Huang, Kai wrote:
> >
> > How about still having a patch to make INTEL_TDX_HOST depend on X86_X2APIC but
> > with something below in the changelog?
> >
> > "
> > TDX capable platforms are locked to X2APIC mode and cannot fall back to the
> > legacy xAPIC mode when TDX is enabled by the BIOS. It doesn't make sense to
> > turn on INTEL_TDX_HOST while X86_X2APIC is not enabled. Make INTEL_TDX_HOST
> > depend on X86_X2APIC.
>
> That's fine and it makes logical sense as a dependency. TDX host
> support requires x2APIC. Period.
>
Thanks. Perhaps I can reuse your second sentence in the changelog:

"
TDX capable platforms are locked to X2APIC mode and cannot fall back to the
legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support requires
x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC.
"

2022-11-22 09:43:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:

> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
> CPUs and use it to shut down the module. Later logical-cpu scope module
> initialization will use it too.

Uhh, those requirements ^ are not met by this:

> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> + on_each_cpu(seamcall_smp_call_function, sc, true);
> +}

Consider:

CPU0 CPU1 CPU2

local_irq_disable()
...
seamcall_on_each_cpu()
send-IPIs to 0 and 2
<IPI>
runs local seamcall
(seamcall done)
waits for 0 and 2
<has an NMI delay things>
runs seamcall
clears CSD_LOCK
</IPI>
... spinning ...

local_irq_enable()
<IPI>
runs seamcall
clears CSD_LOCK
*FINALLY* observes CSD_LOCK cleared on
all CPU and continues
</IPI>

IOW, they all 3 run seamcall at different times.

Either the Changelog is broken or this TDX crud is worse crap than I
thought possible, because the only way to actually meet that requirement
as stated is stop_machine().


2022-11-22 10:01:45

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory


> > > > +/*
> > > > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> > > > + * Must be called when get_online_mems() is called by the caller.
> > > > + */
> > > > +static int build_tdx_memory(void)
> > > > +{
> > > > + unsigned long start_pfn, end_pfn;
> > > > + int i, nid, ret;
> > > > +
> > > > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > > > + /*
> > > > + * The first 1MB may not be reported as TDX convertible
> > > > + * memory. Manually exclude them as TDX memory.
> > > > + *
> > > > + * This is fine as the first 1MB is already reserved in
> > > > + * reserve_real_mode() and won't end up to ZONE_DMA as
> > > > + * free page anyway.
> > > > + */
> > > > + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > > > + if (start_pfn >= end_pfn)
> > > > + continue;
> > >
> > > How about check whether first 1MB is reserved instead of depending on
> > > the corresponding code isn't changed? Via for_each_reserved_mem_range()?
> >
> > IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
> > init code/data. I feel it's not safe to just treat reserved memory will never
> > be in page allocator. Otherwise we have for_each_free_mem_range() can use.
>
> Yes. memblock reverse information isn't perfect. But I still think
> that it is still better than just assumption to check whether the frist
> 1MB is reserved in memblock. Or, we can check whether the pages of the
> first 1MB is reversed via checking struct page directly?
>

Sorry I am a little bit confused what you want to achieve here. Do you want to
make some sanity check to make sure the first 1MB is indeed not in the page
allocator?

IIUC, it is indeed true. Please see the comment of calling reserve_real_mode()
in setup_arch(). Also please see efi_free_boot_services(), which doesn't free
the boot service if it is below 1MB.

Also, my understanding is kernel's intention is to always reserve the first 1MB:

/*
* Don't free memory under 1M for two reasons:
* - BIOS might clobber it
* - Crash kernel needs it to be reserved
*/

So if any page in first 1MB ended up to the page allocator, it should be the
kernel bug which is not related to TDX, correct?


2022-11-22 10:09:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:

> +/*
> + * Data structure to make SEAMCALL on multiple CPUs concurrently.
> + * @err is set to -EFAULT when SEAMCALL fails on any cpu.
> + */
> +struct seamcall_ctx {
> + u64 fn;
> + u64 rcx;
> + u64 rdx;
> + u64 r8;
> + u64 r9;
> + atomic_t err;
> +};

> @@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> }
> }
>
> +static void seamcall_smp_call_function(void *data)
> +{
> + struct seamcall_ctx *sc = data;
> + int ret;
> +
> + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
> + if (ret)
> + atomic_set(&sc->err, -EFAULT);
> +}

Can someone explain me this usage of atomic_t, please?

2022-11-22 10:13:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 05/20] x86/virt/tdx: Implement functions to make SEAMCALL

On Mon, Nov 21, 2022 at 01:26:27PM +1300, Kai Huang wrote:
> +/*
> + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> + * leaf function return code and the additional output respectively if
> + * not NULL.
> + */
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + u64 *seamcall_ret,
> + struct tdx_module_output *out)
> +{

What's the point of a 'static __always_unused' function again? Other
than to test the DCE pass of a linker, that is?

2022-11-22 10:41:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Mon, Nov 21, 2022 at 01:26:32PM +1300, Kai Huang wrote:

> +static int build_tdx_memory(void)
> +{
> + unsigned long start_pfn, end_pfn;
> + int i, nid, ret;
> +
> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> + /*
> + * The first 1MB may not be reported as TDX convertible
> + * memory. Manually exclude them as TDX memory.
> + *
> + * This is fine as the first 1MB is already reserved in
> + * reserve_real_mode() and won't end up to ZONE_DMA as
> + * free page anyway.
> + */
> + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> + if (start_pfn >= end_pfn)
> + continue;
> +
> + /* Verify memory is truly TDX convertible memory */
> + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> + start_pfn << PAGE_SHIFT,
> + end_pfn << PAGE_SHIFT);
> + return -EINVAL;

Given how tdx_cc_memory_compatible() below relies on tdx_memlist being
empty; this error patch is wrong and should goto err.

> + }
> +
> + /*
> + * Add the memory regions as TDX memory. The regions in
> + * memblock has already guaranteed they are in address
> + * ascending order and don't overlap.
> + */
> + ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> + if (ret)
> + goto err;
> + }
> +
> + return 0;
> +err:
> + free_tdx_memory();
> + return ret;
> +}

> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + struct tdx_memblock *tmb;
> +
> + /* Empty list means TDX isn't enabled successfully */
> + if (list_empty(&tdx_memlist))
> + return true;
> +
> + list_for_each_entry(tmb, &tdx_memlist, list) {
> + /*
> + * The new range is TDX memory if it is fully covered
> + * by any TDX memory block.
> + */
> + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> + return true;
> + }
> + return false;
> +}

2022-11-22 10:44:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Mon, Nov 21, 2022 at 01:26:26PM +1300, Kai Huang wrote:
> +static int __tdx_enable(void)
> +{
> + int ret;
> +
> + /*
> + * Initializing the TDX module requires doing SEAMCALL on all
> + * boot-time present CPUs. For simplicity temporarily disable
> + * CPU hotplug to prevent any CPU from going offline during
> + * the initialization.
> + */
> + cpus_read_lock();
> +
> + /*
> + * Check whether all boot-time present CPUs are online and
> + * return early with a message so the user can be aware.
> + *
> + * Note a non-buggy BIOS should never support physical (ACPI)
> + * CPU hotplug when TDX is enabled, and all boot-time present
> + * CPU should be enabled in MADT, so there should be no
> + * disabled_cpus and num_processors won't change at runtime
> + * either.
> + */
> + if (disabled_cpus || num_online_cpus() != num_processors) {
> + pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + ret = init_tdx_module();
> + if (ret == -ENODEV) {
> + pr_info("TDX module is not loaded.\n");
> + tdx_module_status = TDX_MODULE_NONE;
> + goto out;
> + }
> +
> + /*
> + * Shut down the TDX module in case of any error during the
> + * initialization process. It's meaningless to leave the TDX
> + * module in any middle state of the initialization process.
> + *
> + * Shutting down the module also requires doing SEAMCALL on all
> + * MADT-enabled CPUs. Do it while CPU hotplug is disabled.
> + *
> + * Return all errors during the initialization as -EFAULT as the
> + * module is always shut down.
> + */
> + if (ret) {
> + pr_info("Failed to initialize TDX module. Shut it down.\n");
> + shutdown_tdx_module();
> + tdx_module_status = TDX_MODULE_SHUTDOWN;
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + pr_info("TDX module initialized.\n");
> + tdx_module_status = TDX_MODULE_INITIALIZED;
> +out:
> + cpus_read_unlock();
> +
> + return ret;
> +}

Uhm.. so if we've offlined all the SMT siblings because of some
speculation fail or other, this TDX thing will fail to initialize?

Because as I understand it; this TDX initialization happens some random
time after boot, when the first (TDX using) KVM instance gets created,
long after the speculation mitigations are enforced.

2022-11-22 11:13:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:

> +/*
> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check
> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> + */
> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> + on_each_cpu(seamcall_smp_call_function, sc, true);
> +}

Suppose the user has NOHZ_FULL configured, and is already running
userspace that will terminate on interrupt (this is desired feature for
NOHZ_FULL), guess how happy they'll be if someone, on another parition,
manages to tickle this TDX gunk?


2022-11-22 11:15:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Tue, Nov 22 2022 at 10:02, Peter Zijlstra wrote:
> On Mon, Nov 21, 2022 at 01:26:26PM +1300, Kai Huang wrote:
>> + cpus_read_unlock();
>> +
>> + return ret;
>> +}
>
> Uhm.. so if we've offlined all the SMT siblings because of some
> speculation fail or other, this TDX thing will fail to initialize?
>
> Because as I understand it; this TDX initialization happens some random
> time after boot, when the first (TDX using) KVM instance gets created,
> long after the speculation mitigations are enforced.

Correct. Aside of that it's completely unclear from the changelog why
TDX needs to run the seamcall on _all_ present CPUs and why it cannot
handle CPU being hotplugged later.

It's pretty much obvious that a TDX guest can only run on CPUs where
the seam module has been initialized, but where does the requirement
come from that _ALL_ CPUs must be initialized and _ALL_ CPUs must be
able to run TDX guests?

I just went and read through the documentation again.

"1. After loading the Intel TDX module, the host VMM should call the
TDH.SYS.INIT function to globally initialize the module.

2. The host VMM should then call the TDH.SYS.LP.INIT function on each
logical processor. TDH.SYS.LP.INIT is intended to initialize the
module within the scope of the Logical Processor (LP)."

This clearly tells me, that:

1) TDX must be globally initialized (once)

2) TDX must be initialized on each logical processor on which TDX
root/non-root operation should be executed

But it does not define any requirement for doing this on all logical
processors and for preventing physical hotplug (Neither for CPUs nor for
memory).

Nothing in the TDX specs and docs mentions physical hotplug or a
requirement for invoking seamcall on the world.

Thanks,

tglx


2022-11-22 12:08:52

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Tue, 2022-11-22 at 11:10 +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2022 at 01:26:32PM +1300, Kai Huang wrote:
>
> > +static int build_tdx_memory(void)
> > +{
> > + unsigned long start_pfn, end_pfn;
> > + int i, nid, ret;
> > +
> > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > + /*
> > + * The first 1MB may not be reported as TDX convertible
> > + * memory. Manually exclude them as TDX memory.
> > + *
> > + * This is fine as the first 1MB is already reserved in
> > + * reserve_real_mode() and won't end up to ZONE_DMA as
> > + * free page anyway.
> > + */
> > + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > + if (start_pfn >= end_pfn)
> > + continue;
> > +
> > + /* Verify memory is truly TDX convertible memory */
> > + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> > + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> > + start_pfn << PAGE_SHIFT,
> > + end_pfn << PAGE_SHIFT);
> > + return -EINVAL;
>
> Given how tdx_cc_memory_compatible() below relies on tdx_memlist being
> empty; this error patch is wrong and should goto err.

Oops. Thanks for catching.

Also thanks for review! Today is too late for me and I'll catch up with others
tomorrow.

>
> > + }
> > +
> > + /*
> > + * Add the memory regions as TDX memory. The regions in
> > + * memblock has already guaranteed they are in address
> > + * ascending order and don't overlap.
> > + */
> > + ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> > + if (ret)
> > + goto err;
> > + }
> > +
> > + return 0;
> > +err:
> > + free_tdx_memory();
> > + return ret;
> > +}
>
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + struct tdx_memblock *tmb;
> > +
> > + /* Empty list means TDX isn't enabled successfully */
> > + if (list_empty(&tdx_memlist))
> > + return true;
> > +
> > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > + /*
> > + * The new range is TDX memory if it is fully covered
> > + * by any TDX memory block.
> > + */
> > + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > + return true;
> > + }
> > + return false;
> > +}
>

2022-11-22 12:37:46

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot

> >
> > The TDX module is expected to be loaded by the BIOS when it enables TDX,
> > but the kernel needs to properly initialize it before it can be used to
> > create and run any TDX guests. The TDX module will be initialized at
> > runtime by the user (i.e. KVM) on demand.
>
> Calling KVM "the user" is a stretch. How about we give actual user
> facts instead of filling this with i.e.'s when there's only one actual
> way it happens?
>
> The TDX module will be initialized by the KVM subsystem when
> <insert actual trigger description here>.
>
> > Add a new early_initcall(tdx_init) to do TDX early boot initialization.
> > Only detect TDX private KeyIDs for now. Some other early checks will
> > follow up.
>
> Just say what this patch is doing. Don't try to
>
> > Also add a new function to report whether TDX has been
> > enabled by BIOS (TDX private KeyID range is valid). Kexec() will also
> > need it to determine whether need to flush dirty cachelines that are
> > associated with any TDX private KeyIDs before booting to the new kernel.
>
> That last sentence doesn't parse correctly.

Will do all above. Please see updated patch at the bottom.


[...]

> > +/*
> > + * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> > + * BIOS. Both initializing the TDX module and running TDX guest require
> > + * TDX private KeyID.
>
> This comment is not right, sorry.
>
> Talk about the function at a *HIGH* level. Don't talk about every
> little detailed facet of the function. That's what the code is there for.
>
> > + * TDX doesn't trust BIOS. TDX verifies all configurations from BIOS
> > + * are correct before enabling TDX on any core. TDX requires the BIOS
> > + * to correctly and consistently program TDX private KeyIDs on all CPU
> > + * packages. Unless there is a BIOS bug, detecting a valid TDX private
> > + * KeyID range on BSP indicates TDX has been enabled by the BIOS. If
> > + * there's such BIOS bug, it will be caught later when initializing the
> > + * TDX module.
> > + */
>
> I have no idea what that comment is doing. Can it just be removed?

Will remove this part and update the entire comment.

Also will address all your other comments. Please see the updated patch.

[...]

>
> > + /*
> > + * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
> > + * KeyIDs start after the last MKTME KeyID.
> > + */
>
> Is the TME key a "MKTME KeyID"?

I don't think so. Hardware handles TME KeyID 0 differently from non-0 MKTME
KeyIDs. And PCONFIG only accept non-0 KeyIDs.

>
> > +static void __init clear_tdx(void)
> > +{
> > + tdx_keyid_start = tdx_keyid_num = 0;
> > +}
>
> This is where a comment is needed and can actually help.
>
> /*
> * tdx_keyid_start/num indicate that TDX is uninitialized. This
> * is used in TDX initialization error paths to take it from
> * initialized -> uninitialized.
> */
>

Just want to point out after removing the !x2apic_enabled() check, the only
thing need to do here is to detect/record the TDX KeyIDs.

And the purpose of this TDX boot-time initialization code is to provide
platform_tdx_enabled() function so that kexec() can use.

To distinguish boot-time TDX initialization from runtime TDX module
initialization, how about change the comment to below?

static void __init clear_tdx(void)
{
/*
* tdx_keyid_start and nr_tdx_keyids indicate that TDX is not
* enabled by the BIOS. This is used in TDX boot-time
* initializatiton error paths to take it from enabled to not
* enabled.
*/
tdx_keyid_start = nr_tdx_keyids = 0;
}

[...]

And below is the updated patch. How does it look to you?

==========================================================

x86/virt/tdx: Detect TDX during kernel boot

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME. The memory encryption hardware underpinning MKTME is also
used for Intel TDX. TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs. The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX
private KeyIDs are consistently and correctly programmed by the BIOS
across all CPU packages before it enables TDX on any CPU core. A valid
TDX private KeyID range on BSP indicates TDX has been enabled by the
BIOS, otherwise the BIOS is buggy.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests. The TDX module will be initialized by
the KVM subsystem when the KVM module is loaded.

Add a new early_initcall(tdx_init) to detect the TDX private KeyIDs.
Both TDX module initialization and creating TDX guest require to use TDX
private KeyID. Also add a function to report whether TDX is enabled by
the BIOS (TDX KeyID range is valid). Similar to AMD SME, kexec() will
use it to determine whether cache flush is needed.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support). So far only KVM uses TDX. Make the new config option depend
on KVM_INTEL.

Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kai Huang <[email protected]>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..cced4ef3bfb2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1953,6 +1953,18 @@ config X86_SGX

If unsure, say N.

+config INTEL_TDX_HOST
+ bool "Intel Trust Domain Extensions (TDX) host support"
+ depends on CPU_SUP_INTEL
+ depends on X86_64
+ depends on KVM_INTEL
+ help
+ Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+ host and certain physical attacks. This option enables necessary TDX
+ support in host kernel to run protected VMs.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 415a5d138de4..38d3e8addc5f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -246,6 +246,8 @@ archheaders:

libs-y += arch/x86/lib/

+core-y += arch/x86/virt/
+
# drivers-y are linked after core-y
drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
drivers-$(CONFIG_PCI) += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 25fd6070dc0b..4dfe2e794411 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr,
unsigned long p1,
return -ENODEV;
}
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else /* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..feebda21d793
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST) += tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..93ca8b73e1f1
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..a60611448111
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/printk.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+#include "tdx.h"
+
+static u32 tdx_keyid_start __ro_after_init;
+static u32 nr_tdx_keyids __ro_after_init;
+
+static int __init record_keyid_partitioning(void)
+{
+ u32 nr_mktme_keyids;
+ int ret;
+
+ /*
+ * IA32_MKTME_KEYID_PARTIONING:
+ * Bit [31:0]: Number of MKTME KeyIDs.
+ * Bit [63:32]: Number of TDX private KeyIDs.
+ */
+ ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &nr_mktme_keyids,
+ &nr_tdx_keyids);
+ if (ret)
+ return -ENODEV;
+
+ if (!nr_tdx_keyids)
+ return -ENODEV;
+
+ /* TDX KeyIDs start after the last MKTME KeyID. */
+ tdx_keyid_start++;
+
+ pr_info("enabled: private KeyID range [%u, %u)\n",
+ tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
+
+ return 0;
+}
+
+static void __init clear_tdx(void)
+{
+ /*
+ * tdx_keyid_start and nr_tdx_keyids indicate that TDX is not
+ * enabled by the BIOS. This is used in TDX boot-time
+ * initializatiton error paths to take it from enabled to not
+ * enabled.
+ */
+ tdx_keyid_start = nr_tdx_keyids = 0;
+}
+
+static int __init tdx_init(void)
+{
+ int err;
+
+ err = record_keyid_partitioning();
+ if (err)
+ return err;
+
+ /*
+ * Initializing the TDX module requires one TDX private KeyID.
+ * If there's only one TDX KeyID then after module initialization
+ * KVM won't be able to run any TDX guest, which makes the whole
+ * thing worthless. Just disable TDX in this case.
+ */
+ if (nr_tdx_keyids < 2) {
+ pr_info("initialization failed: too few private KeyIDs available
(%d).\n",
+ nr_tdx_keyids);
+ goto no_tdx;
+ }
+
+ return 0;
+no_tdx:
+ clear_tdx();
+ return -ENODEV;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+ return !!nr_tdx_keyids;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..d00074abcb20
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability. The
+ * architectural definitions come first.
+ */
+
+/* MSR to report KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
+
+#endif


2022-11-22 15:37:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:

> On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
>
>> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
>> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
>> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
>> CPUs and use it to shut down the module. Later logical-cpu scope module
>> initialization will use it too.
>
> Uhh, those requirements ^ are not met by this:

Can run concurrently != Must run concurrently

The documentation clearly says "can run concurrently" as quoted above.

Thanks,

tglx

2022-11-22 15:40:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On 11/22/22 02:31, Thomas Gleixner wrote:
> Nothing in the TDX specs and docs mentions physical hotplug or a
> requirement for invoking seamcall on the world.

The TDX module source is actually out there[1] for us to look at. It's
in a lovely, convenient zip file, but you can read it if sufficiently
motivated.

It has this lovely nugget in it:

WARNING!!! Proprietary License!! Avert your virgin eyes!!!

> if (tdx_global_data_ptr->num_of_init_lps < tdx_global_data_ptr->num_of_lps)
> {
> TDX_ERROR("Num of initialized lps %d is smaller than total num of lps %d\n",
> tdx_global_data_ptr->num_of_init_lps, tdx_global_data_ptr->num_of_lps);
> retval = TDX_SYS_CONFIG_NOT_PENDING;
> goto EXIT;
> }

tdx_global_data_ptr->num_of_init_lps is incremented at TDH.SYS.INIT
time. That if() is called at TDH.SYS.CONFIG time to help bring the
module up.

So, I think you're right. I don't see the docs that actually *explain*
this "you must seamcall all the things" requirement.

1.
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html


2022-11-22 15:50:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/22/22 01:13, Peter Zijlstra wrote:
> On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
>> +/*
>> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check
>> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
>> + */
>> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
>> +{
>> + on_each_cpu(seamcall_smp_call_function, sc, true);
>> +}
>
> Suppose the user has NOHZ_FULL configured, and is already running
> userspace that will terminate on interrupt (this is desired feature for
> NOHZ_FULL), guess how happy they'll be if someone, on another parition,
> manages to tickle this TDX gunk?

Yeah, they'll be none too happy.

But, what do we do?

There are technical solutions like detecting if NOHZ_FULL is in play and
refusing to initialize TDX. There are also non-technical solutions like
telling folks in the documentation that they better modprobe kvm early
if they want to do TDX, or their NOHZ_FULL apps will pay.

We could also force the TDX module to be loaded early in boot before
NOHZ_FULL is in play, but that would waste memory on TDX metadata even
if TDX is never used.

How do NOHZ_FULL folks deal with late microcode updates, for example?
Those are roughly equally disruptive to all CPUs.

2022-11-22 16:12:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/22/22 01:20, Peter Zijlstra wrote:
> Either the Changelog is broken or this TDX crud is worse crap than I
> thought possible, because the only way to actually meet that requirement
> as stated is stop_machine().

I think the changelog is broken. I don't see anything in the TDX module
spec about "the SEMACALL can run concurrently on different CPUs".
Shutdown, as far as I can tell, just requires that the shutdown seamcall
be run once on each CPU. Concurrency and ordering don't seem to matter
at all.

2022-11-22 17:24:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot

On 11/22/22 03:28, Huang, Kai wrote:
>>> + /*
>>> + * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
>>> + * KeyIDs start after the last MKTME KeyID.
>>> + */
>>
>> Is the TME key a "MKTME KeyID"?
>
> I don't think so. Hardware handles TME KeyID 0 differently from non-0 MKTME
> KeyIDs. And PCONFIG only accept non-0 KeyIDs.

Let's say we have 4 MKTME hardware bits, we'd have:

0: TME Key
1->3: MKTME Keys
4->7: TDX Private Keys

First, the MSR values:

> + * IA32_MKTME_KEYID_PARTIONING:
> + * Bit [31:0]: Number of MKTME KeyIDs.
> + * Bit [63:32]: Number of TDX private KeyIDs.

These would be:

Bit [ 31:0] = 3
Bit [63:22] = 4

And in the end the variables:

tdx_keyid_start would be 4 and tdx_keyid_num would be 4.

Right?

That's a bit wonky for my brain because I guess I know too much about
the internal implementation and how the key space is split up. I guess
I (wrongly) expected Bit[31:0]==Bit[63:22].



>>> +static void __init clear_tdx(void)
>>> +{
>>> + tdx_keyid_start = tdx_keyid_num = 0;
>>> +}
>>
>> This is where a comment is needed and can actually help.
>>
>> /*
>> * tdx_keyid_start/num indicate that TDX is uninitialized. This
>> * is used in TDX initialization error paths to take it from
>> * initialized -> uninitialized.
>> */
>
> Just want to point out after removing the !x2apic_enabled() check, the only
> thing need to do here is to detect/record the TDX KeyIDs.
>
> And the purpose of this TDX boot-time initialization code is to provide
> platform_tdx_enabled() function so that kexec() can use.
>
> To distinguish boot-time TDX initialization from runtime TDX module
> initialization, how about change the comment to below?
>
> static void __init clear_tdx(void)
> {
> /*
> * tdx_keyid_start and nr_tdx_keyids indicate that TDX is not
> * enabled by the BIOS. This is used in TDX boot-time
> * initializatiton error paths to take it from enabled to not
> * enabled.
> */
> tdx_keyid_start = nr_tdx_keyids = 0;
> }
>
> [...]

I honestly have no idea what "boot-time TDX initialization" is versus
"runtime TDX module initialization". This doesn't hel.

> And below is the updated patch. How does it look to you?

Let's see...

...
> +static u32 tdx_keyid_start __ro_after_init;
> +static u32 nr_tdx_keyids __ro_after_init;
> +
> +static int __init record_keyid_partitioning(void)
> +{
> + u32 nr_mktme_keyids;
> + int ret;
> +
> + /*
> + * IA32_MKTME_KEYID_PARTIONING:
> + * Bit [31:0]: Number of MKTME KeyIDs.
> + * Bit [63:32]: Number of TDX private KeyIDs.
> + */
> + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &nr_mktme_keyids,
> + &nr_tdx_keyids);
> + if (ret)
> + return -ENODEV;
> +
> + if (!nr_tdx_keyids)
> + return -ENODEV;
> +
> + /* TDX KeyIDs start after the last MKTME KeyID. */
> + tdx_keyid_start++;

tdx_keyid_start is uniniitalized here. So, it'd be 0, then ++'d.

Kai, please take a moment and slow down. This isn't a race. I offered
some replacement code here, which you've discarded, missed or ignored
and in the process broken this code.

This approach just wastes reviewer time. It's not working for me.

I'm going to make a suggestion (aka. a demand): You can post these
patches at most once a week. You get a whole week to (carefully)
incorporate reviewer feedback, make the patch better, and post a new
version. Need more time? Go ahead and take it. Take as much time as
you want.


2022-11-22 17:57:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 22 2022 at 07:20, Dave Hansen wrote:
> On 11/22/22 01:20, Peter Zijlstra wrote:
>> Either the Changelog is broken or this TDX crud is worse crap than I
>> thought possible, because the only way to actually meet that requirement
>> as stated is stop_machine().
>
> I think the changelog is broken. I don't see anything in the TDX module
> spec about "the SEMACALL can run concurrently on different CPUs".
> Shutdown, as far as I can tell, just requires that the shutdown seamcall
> be run once on each CPU. Concurrency and ordering don't seem to matter
> at all.

You're right. The 'can concurrently run' thing is for LP.INIT:

4.2.2.
LP-Scope Initialization: TDH.SYS.LP.INIT

TDH.SYS.LP.INIT is intended to perform LP-scope, core-scope and
package-scope initialization of the Intel TDX module. It can be called
only after TDH.SYS.INIT completes successfully, and it can run
concurrently on multiple LPs.

2022-11-22 18:29:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 05/20] x86/virt/tdx: Implement functions to make SEAMCALL

On 11/20/22 16:26, Kai Huang wrote:
> TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> mode runs only the TDX module itself or other code to load the TDX
> module.
>
> The host kernel communicates with SEAM software via a new SEAMCALL
> instruction. This is conceptually similar to a guest->host hypercall,
> except it is made from the host to SEAM software instead.
>
> The TDX module defines a set of SEAMCALL leaf functions to allow the
> host to initialize it, and to create and run protected VMs. SEAMCALL
> leaf functions use an ABI different from the x86-64 system-v ABI.
> Instead, they share the same ABI with the TDCALL leaf functions.

I may have suggested this along the way, but the mention of the sysv ABI
is just confusing here. This is enough for a changelog:

The TDX module establishes a new SEAMCALL ABI which allows the
host to initialize the module and to and to manage VMs.

Kill the rest.

> Implement a function __seamcall() to allow the host to make SEAMCALL
> to SEAM software using the TDX_MODULE_CALL macro which is the common
> assembly for both SEAMCALL and TDCALL.

In general, I dislike mentioning function names in changelogs. Keep
this high-level, like:

Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very
similar to the TDCALL ABI and leverages much TDCALL
infrastructure.

> SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> CPU is not in VMX operation. The current TDX_MODULE_CALL macro doesn't
> handle any of them. There's no way to check whether the CPU is in VMX
> operation or not.

What is SEAMRR?

Why even mention this behavior in the changelog. Is this a problem?
Does it have a solution?

> Initializing the TDX module is done at runtime on demand, and it depends
> on the caller to ensure CPU is in VMX operation before making SEAMCALL.
> To avoid getting Oops when the caller mistakenly tries to initialize the
> TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL
> macro to handle #UD (and also #GP, which can theoretically still happen
> when TDX isn't actually enabled by the BIOS, i.e. due to BIOS bug).

I'm not completely sure this is worth it. If the BIOS lies, we oops.
There are lots of ways that the BIOS lying can make the kernel oops.
What's one more?

> Introduce two new TDX error codes for #UD and #GP respectively so the
> caller can distinguish. Also, Opportunistically put the new TDX error
> codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST
> Kconfig option as they are only used when it is on.
>
> As __seamcall() can potentially return multiple error codes, besides the
> actual SEAMCALL leaf function return code, also introduce a wrapper
> function seamcall() to convert the __seamcall() error code to the kernel
> error code, so the caller doesn't need to duplicate the code to check
> return value of __seamcall() and return kernel error code accordingly.


> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 05fc89d9742a..d688228f3151 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,6 +8,10 @@
> #include <asm/ptrace.h>
> #include <asm/shared/tdx.h>
>
> +#ifdef CONFIG_INTEL_TDX_HOST
> +
> +#include <asm/trapnr.h>
> +
> /*
> * SW-defined error codes.
> *
> @@ -18,6 +22,11 @@
> #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
> #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))
>
> +#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
> +#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
> +
> +#endif
> +
> #ifndef __ASSEMBLY__
>
> /*
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> index 93ca8b73e1f1..38d534f2c113 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,2 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -obj-y += tdx.o
> +obj-y += tdx.o seamcall.o
> diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> new file mode 100644
> index 000000000000..f81be6b9c133
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/linkage.h>
> +#include <asm/frame.h>
> +
> +#include "tdxcall.S"
> +
> +/*
> + * __seamcall() - Host-side interface functions to SEAM software module
> + * (the P-SEAMLDR or the TDX module).
> + *
> + * Transform function call register arguments into the SEAMCALL register
> + * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
> + * or the completion status of the SEAMCALL leaf function. Additional
> + * output operands are saved in @out (if it is provided by the caller).
> + *
> + *-------------------------------------------------------------------------
> + * SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX - SEAMCALL Leaf number.
> + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX - SEAMCALL completion status code.
> + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __seamcall() function ABI:
> + *
> + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
> + * @rcx (RSI) - Input parameter 1, moved to RCX
> + * @rdx (RDX) - Input parameter 2, moved to RDX
> + * @r8 (RCX) - Input parameter 3, moved to R8
> + * @r9 (R8) - Input parameter 4, moved to R9
> + *
> + * @out (R9) - struct tdx_module_output pointer
> + * stored temporarily in R12 (not
> + * used by the P-SEAMLDR or the TDX
> + * module). It can be NULL.
> + *
> + * Return (via RAX) the completion status of the SEAMCALL, or
> + * TDX_SEAMCALL_VMFAILINVALID.
> + */
> +SYM_FUNC_START(__seamcall)
> + FRAME_BEGIN
> + TDX_MODULE_CALL host=1
> + FRAME_END
> + RET
> +SYM_FUNC_END(__seamcall)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 28c187b8726f..b06c1a2bc9cb 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -124,6 +124,48 @@ bool platform_tdx_enabled(void)
> return !!tdx_keyid_num;
> }
>
> +/*
> + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> + * leaf function return code and the additional output respectively if
> + * not NULL.
> + */
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + u64 *seamcall_ret,
> + struct tdx_module_output *out)
> +{
> + u64 sret;
> +
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +
> + /* Save SEAMCALL return code if caller wants it */
> + if (seamcall_ret)
> + *seamcall_ret = sret;
> +
> + /* SEAMCALL was successful */
> + if (!sret)
> + return 0;
> +
> + switch (sret) {
> + case TDX_SEAMCALL_GP:
> + /*
> + * platform_tdx_enabled() is checked to be true
> + * before making any SEAMCALL.
> + */

This doesn't make any sense. "platform_tdx_enabled() is checked"???

Do you mean that it *should* be checked and probably wasn't which is
what caused the error?

> + WARN_ON_ONCE(1);
> + fallthrough;
> + case TDX_SEAMCALL_VMFAILINVALID:
> + /* Return -ENODEV if the TDX module is not loaded. */
> + return -ENODEV;

Pro tip: you don't need to rewrite code in comments. If the code
literally says, "return -ENODEV", there is very little value in writing
virtually identical bytes "Return -ENODEV" in the comment.

2022-11-22 18:42:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On 11/20/22 16:26, Kai Huang wrote:
> 2) It is more flexible to support TDX module runtime updating in the
> future (after updating the TDX module, it needs to be initialized
> again).

I hate this generic blabber about "more flexible". There's a *REASON*
it's more flexible, so let's talk about the reasons, please.

It's really something like this, right?

The TDX module design allows it to be updated while the system
is running. The update procedure shares quite a few steps with
this "on demand" loading mechanism. The hope is that much of
this "on demand" mechanism can be shared with a future "update"
mechanism. A boot-time TDX module implementation would not be
able to share much code with the update mechanism.


> 3) It avoids having to do a "temporary" solution to handle VMXON in the
> core (non-KVM) kernel for now. This is because SEAMCALL requires CPU
> being in VMX operation (VMXON is done), but currently only KVM handles
> VMXON. Adding VMXON support to the core kernel isn't trivial. More
> importantly, from long-term a reference-based approach is likely needed
> in the core kernel as more kernel components are likely needed to
> support TDX as well. Allow KVM to initialize the TDX module avoids
> having to handle VMXON during kernel boot for now.

There are a lot of words in there.

3) Loading the TDX module requires VMX to be enabled. Currently, only
the kernel KVM code mucks with VMX enabling. If the TDX module were
to be initialized separately from KVM (like at boot), the boot code
would need to be taught how to muck with VMX enabling and KVM would
need to be taught how to cope with that. Making KVM itself
responsible for TDX initialization lets the rest of the kernel stay
blissfully unaware of VMX.

> Add a placeholder tdx_enable() to detect and initialize the TDX module
> on demand, with a state machine protected by mutex to support concurrent
> calls from multiple callers.

As opposed to concurrent calls from one caller? ;)

> The TDX module will be initialized in multi-steps defined by the TDX
> module:
>
> 1) Global initialization;
> 2) Logical-CPU scope initialization;
> 3) Enumerate the TDX module capabilities and platform configuration;
> 4) Configure the TDX module about TDX usable memory ranges and global
> KeyID information;
> 5) Package-scope configuration for the global KeyID;
> 6) Initialize usable memory ranges based on 4).

This would actually be a nice place to call out the SEAMCALL names and
mention that each of these steps involves a set of SEAMCALLs.

> The TDX module can also be shut down at any time during its lifetime.
> In case of any error during the initialization process, shut down the
> module. It's pointless to leave the module in any intermediate state
> during the initialization.
>
> Both logical CPU scope initialization and shutting down the TDX module
> require calling SEAMCALL on all boot-time present CPUs. For simplicity
> just temporarily disable CPU hotplug during the module initialization.

You might want to more precisely define "boot-time present CPUs". The
boot of *what*?

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 8d943bdc8335..28c187b8726f 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -10,15 +10,34 @@
> #include <linux/types.h>
> #include <linux/init.h>
> #include <linux/printk.h>
> +#include <linux/mutex.h>
> +#include <linux/cpu.h>
> +#include <linux/cpumask.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/apic.h>
> #include <asm/tdx.h>
> #include "tdx.h"
>
> +/* TDX module status during initialization */
> +enum tdx_module_status_t {
> + /* TDX module hasn't been detected and initialized */
> + TDX_MODULE_UNKNOWN,
> + /* TDX module is not loaded */
> + TDX_MODULE_NONE,
> + /* TDX module is initialized */
> + TDX_MODULE_INITIALIZED,
> + /* TDX module is shut down due to initialization error */
> + TDX_MODULE_SHUTDOWN,
> +};

Are these part of the ABI or just a purely OS-side construct?

> static u32 tdx_keyid_start __ro_after_init;
> static u32 tdx_keyid_num __ro_after_init;
>
> +static enum tdx_module_status_t tdx_module_status;
> +/* Prevent concurrent attempts on TDX detection and initialization */
> +static DEFINE_MUTEX(tdx_module_lock);
> +
> /*
> * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> * BIOS. Both initializing the TDX module and running TDX guest require
> @@ -104,3 +123,134 @@ bool platform_tdx_enabled(void)
> {
> return !!tdx_keyid_num;
> }
> +
> +/*
> + * Detect and initialize the TDX module.
> + *
> + * Return -ENODEV when the TDX module is not loaded, 0 when it
> + * is successfully initialized, or other error when it fails to
> + * initialize.
> + */
> +static int init_tdx_module(void)
> +{
> + /* The TDX module hasn't been detected */
> + return -ENODEV;
> +}
> +
> +static void shutdown_tdx_module(void)
> +{
> + /* TODO: Shut down the TDX module */
> +}
> +
> +static int __tdx_enable(void)
> +{
> + int ret;
> +
> + /*
> + * Initializing the TDX module requires doing SEAMCALL on all
> + * boot-time present CPUs. For simplicity temporarily disable
> + * CPU hotplug to prevent any CPU from going offline during
> + * the initialization.
> + */
> + cpus_read_lock();
> +
> + /*
> + * Check whether all boot-time present CPUs are online and
> + * return early with a message so the user can be aware.
> + *
> + * Note a non-buggy BIOS should never support physical (ACPI)
> + * CPU hotplug when TDX is enabled, and all boot-time present
> + * CPU should be enabled in MADT, so there should be no
> + * disabled_cpus and num_processors won't change at runtime
> + * either.
> + */

Again, there are a lot of words in that comment, but I'm not sure why
it's here. Despite all the whinging about ACPI, doesn't it boil down to:

The TDX module itself establishes its own concept of how many
logical CPUs there are in the system when it is loaded. The
module will reject initialization attempts unless the kernel
runs TDX initialization code on every last CPU.

Ensure that the kernel is able to run code on all known logical
CPUs.

and these checks are just to see if the kernel has shot itself in the
foot and is *KNOWS* that it is currently unable to run code on some
logical CPU?

> + if (disabled_cpus || num_online_cpus() != num_processors) {
> + pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + ret = init_tdx_module();
> + if (ret == -ENODEV) {

Why check for -ENODEV exclusively? Is there some other error nonzero
code that indicates success?

> + pr_info("TDX module is not loaded.\n");
> + tdx_module_status = TDX_MODULE_NONE;
> + goto out;
> + }
> +
> + /*
> + * Shut down the TDX module in case of any error during the
> + * initialization process. It's meaningless to leave the TDX
> + * module in any middle state of the initialization process.
> + *
> + * Shutting down the module also requires doing SEAMCALL on all
> + * MADT-enabled CPUs. Do it while CPU hotplug is disabled.
> + *
> + * Return all errors during the initialization as -EFAULT as the
> + * module is always shut down.
> + */
> + if (ret) {
> + pr_info("Failed to initialize TDX module. Shut it down.\n");

"Shut it down" seems wrong here. That could be interpreted as "I have
already shut it down". "Shutting down" seems better.

> + shutdown_tdx_module();
> + tdx_module_status = TDX_MODULE_SHUTDOWN;
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + pr_info("TDX module initialized.\n");
> + tdx_module_status = TDX_MODULE_INITIALIZED;
> +out:
> + cpus_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * tdx_enable - Enable TDX by initializing the TDX module
> + *
> + * Caller to make sure all CPUs are online and in VMX operation before
> + * calling this function. CPU hotplug is temporarily disabled internally
> + * to prevent any cpu from going offline.

"cpu" or "CPU"?

> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * 0: The TDX module has been successfully initialized.
> + * * -ENODEV: The TDX module is not loaded, or TDX is not supported.
> + * * -EINVAL: The TDX module cannot be initialized due to certain
> + * conditions are not met (i.e. when not all MADT-enabled
> + * CPUs are not online).
> + * * -EFAULT: Other internal fatal errors, or the TDX module is in
> + * shutdown mode due to it failed to initialize in previous
> + * attempts.
> + */

I honestly don't think all these error codes mean anything. They're
plumbed nowhere and the use of -EFAULT is just plain wrong.

Nobody can *DO* anything with these anyway.

Just give one error code and make sure that you have pr_info()'s around
to make it clear what went wrong. Then just do -EINVAL universally.
Remove all the nonsense comments.

> +int tdx_enable(void)
> +{
> + int ret;
> +
> + if (!platform_tdx_enabled())
> + return -ENODEV;
> +
> + mutex_lock(&tdx_module_lock);
> +
> + switch (tdx_module_status) {
> + case TDX_MODULE_UNKNOWN:
> + ret = __tdx_enable();
> + break;
> + case TDX_MODULE_NONE:
> + ret = -ENODEV;
> + break;

TDX_MODULE_NONE should probably be called TDX_MODULE_NOT_LOADED. A
comment would also be nice:

/* The BIOS did not load the module. No way to fix that. */

> + case TDX_MODULE_INITIALIZED:

/* Already initialized, great, tell the caller: */

> + ret = 0;
> + break;
> + default:
> + WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN);
> + ret = -EFAULT;
> + break;
> + }

I don't get what that default: is for or what it has to do with
TDX_MODULE_SHUTDOWN.


> + mutex_unlock(&tdx_module_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_enable);

2022-11-22 19:32:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 22, 2022 at 07:14:14AM -0800, Dave Hansen wrote:
> On 11/22/22 01:13, Peter Zijlstra wrote:
> > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> >> +/*
> >> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check
> >> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> >> + */
> >> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> >> +{
> >> + on_each_cpu(seamcall_smp_call_function, sc, true);
> >> +}
> >
> > Suppose the user has NOHZ_FULL configured, and is already running
> > userspace that will terminate on interrupt (this is desired feature for
> > NOHZ_FULL), guess how happy they'll be if someone, on another parition,
> > manages to tickle this TDX gunk?
>
> Yeah, they'll be none too happy.
>
> But, what do we do?

Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
all TDX using tasks.

> There are technical solutions like detecting if NOHZ_FULL is in play and
> refusing to initialize TDX. There are also non-technical solutions like
> telling folks in the documentation that they better modprobe kvm early
> if they want to do TDX, or their NOHZ_FULL apps will pay.

Surely modprobe kvm isn't the point where TDX gets loaded? Because
that's on boot for everybody due to all the auto-probing nonsense.

I was expecting TDX to not get initialized until the first TDX using KVM
instance is created. Am I wrong?

> We could also force the TDX module to be loaded early in boot before
> NOHZ_FULL is in play, but that would waste memory on TDX metadata even
> if TDX is never used.

I'm thikning it makes sense to have a tdx={off,on-demand,force} toggle
anyway.

> How do NOHZ_FULL folks deal with late microcode updates, for example?
> Those are roughly equally disruptive to all CPUs.

I imagine they don't do that -- in fact I would recommend we make the
whole late loading thing mutually exclusive with nohz_full; can't have
both.

2022-11-22 19:34:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/22/22 11:13, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 07:14:14AM -0800, Dave Hansen wrote:
>> On 11/22/22 01:13, Peter Zijlstra wrote:
>>> On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
>>>> +/*
>>>> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check
>>>> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
>>>> + */
>>>> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
>>>> +{
>>>> + on_each_cpu(seamcall_smp_call_function, sc, true);
>>>> +}
>>>
>>> Suppose the user has NOHZ_FULL configured, and is already running
>>> userspace that will terminate on interrupt (this is desired feature for
>>> NOHZ_FULL), guess how happy they'll be if someone, on another parition,
>>> manages to tickle this TDX gunk?
>>
>> Yeah, they'll be none too happy.
>>
>> But, what do we do?
>
> Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
> all TDX using tasks.

I don't think that works. As I mentioned to Thomas elsewhere, you don't
just need to initialize TDX on the CPUs where it is used. Before the
module will start working you need to initialize it on *all* the CPUs it
knows about. The module itself has a little counter where it tracks
this and will refuse to start being useful until it gets called
thoroughly enough.

>> There are technical solutions like detecting if NOHZ_FULL is in play and
>> refusing to initialize TDX. There are also non-technical solutions like
>> telling folks in the documentation that they better modprobe kvm early
>> if they want to do TDX, or their NOHZ_FULL apps will pay.
>
> Surely modprobe kvm isn't the point where TDX gets loaded? Because
> that's on boot for everybody due to all the auto-probing nonsense.
>
> I was expecting TDX to not get initialized until the first TDX using KVM
> instance is created. Am I wrong?

I went looking for it in this series to prove you wrong. I failed. :)

tdx_enable() is buried in here somewhere:

> https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/

I don't have the patience to dig it out today, so I guess we'll have Kai
tell us.

>> We could also force the TDX module to be loaded early in boot before
>> NOHZ_FULL is in play, but that would waste memory on TDX metadata even
>> if TDX is never used.
>
> I'm thikning it makes sense to have a tdx={off,on-demand,force} toggle
> anyway.

Yep, that makes total sense. Kai had one in an earlier version but I
made him throw it out because it wasn't *strictly* required and this set
is fat enough.

>> How do NOHZ_FULL folks deal with late microcode updates, for example?
>> Those are roughly equally disruptive to all CPUs.
>
> I imagine they don't do that -- in fact I would recommend we make the
> whole late loading thing mutually exclusive with nohz_full; can't have
> both.

So, if we just use schedule_on_cpu() for now and have the TDX code wait,
will a NOHZ_FULL task just block the schedule_on_cpu() indefinitely?

That doesn't seem like _horrible_ behavior to start off with for a
minimal series.

2022-11-22 19:35:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 22, 2022 at 10:57:52AM -0800, Dave Hansen wrote:

> To me, this starts to veer way too far into internal implementation details.
>
> Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs
> to shut down the TDX module.

We really need to let go of the whole 'all BIOS-enabled CPUs' thing.

2022-11-22 19:36:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/20/22 16:26, Kai Huang wrote:
> TDX supports shutting down the TDX module at any time during its
> lifetime. After the module is shut down, no further TDX module SEAMCALL
> leaf functions can be made to the module on any logical cpu.
>
> Shut down the TDX module in case of any error during the initialization
> process. It's pointless to leave the TDX module in some middle state.
>
> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
> CPUs and use it to shut down the module. Later logical-cpu scope module
> initialization will use it too.

To me, this starts to veer way too far into internal implementation details.

Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs
to shut down the TDX module.

This is also the point where you should talk about the new
infrastructure. Why do you need a new 'struct seamcall_something'?
What makes it special?

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b06c1a2bc9cb..5db1a05cb4bd 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,8 @@
> #include <linux/mutex.h>
> #include <linux/cpu.h>
> #include <linux/cpumask.h>
> +#include <linux/smp.h>
> +#include <linux/atomic.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/apic.h>
> @@ -124,15 +126,27 @@ bool platform_tdx_enabled(void)
> return !!tdx_keyid_num;
> }
>
> +/*
> + * Data structure to make SEAMCALL on multiple CPUs concurrently.
> + * @err is set to -EFAULT when SEAMCALL fails on any cpu.
> + */
> +struct seamcall_ctx {
> + u64 fn;
> + u64 rcx;
> + u64 rdx;
> + u64 r8;
> + u64 r9;
> + atomic_t err;
> +};
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> * leaf function return code and the additional output respectively if
> * not NULL.
> */
> -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> - u64 *seamcall_ret,
> - struct tdx_module_output *out)
> +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + u64 *seamcall_ret, struct tdx_module_output *out)
> {
> u64 sret;
>
> @@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> }
> }
>
> +static void seamcall_smp_call_function(void *data)
> +{
> + struct seamcall_ctx *sc = data;
> + int ret;
> +
> + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
> + if (ret)
> + atomic_set(&sc->err, -EFAULT);
> +}

The atomic_t is kinda silly. I guess it's not *that* wasteful though.

I think it would have actually been a lot more clear if instead of
containing an errno it was a *count* of the number of encountered errors.

An "atomic_set()" where everyone is overwriting each other is a bit
counterintuitive. It's OK here, of course, but it still looks goofy.

If this were:

atomic_inc(&sc->nr_errors);

it would be a lot more clear that *anyone* can increment and that it
truly is shared.

> +/*
> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check
> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> + */
> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> + on_each_cpu(seamcall_smp_call_function, sc, true);
> +}
> +
> /*
> * Detect and initialize the TDX module.
> *
> @@ -181,7 +214,9 @@ static int init_tdx_module(void)
>
> static void shutdown_tdx_module(void)
> {
> - /* TODO: Shut down the TDX module */
> + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> +
> + seamcall_on_each_cpu(&sc);
> }


The seamcall_on_each_cpu() function is silly as-is. Either collapse the
functions or note in the changelog why this is not as silly as it looks.

> static int __tdx_enable(void)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 92a8de957dc7..215cc1065d78 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -12,6 +12,11 @@
> /* MSR to report KeyID partitioning between MKTME and TDX */
> #define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
>
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_LP_SHUTDOWN 44
> +
> /*
> * Do not put any hardware-defined TDX structure representations below
> * this comment!

2022-11-22 19:38:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 07/20] x86/virt/tdx: Do TDX module global initialization

On 11/20/22 16:26, Kai Huang wrote:
> The first step of initializing the module is to call TDH.SYS.INIT once
> on any logical cpu to do module global initialization. Do the module
> global initialization.
>
> It also detects the TDX module, as seamcall() returns -ENODEV when the
> module is not loaded.

Part of making a good patch set is telling a bit of a story. In patch
4, you laid out 6 steps necessary to initialize TDX. On top of that,
there is infrastructure It would be great to lay that out in a way that
folks can actually follow along.

For instance, it would be great to tell the reader here that this patch
is an inflection point. It is transitioning out of the infrastructure
(patches 1->6) and into the actual "multi-steps" of initialization that
the module spec requires.

This patch is *TOTALLY* different from the one before it because it
actually _starts_ to do something useful.

But, you wouldn't know it from the changelog.

> arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++--
> arch/x86/virt/vmx/tdx/tdx.h | 1 +
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 5db1a05cb4bd..f292292313bd 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -208,8 +208,23 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> */
> static int init_tdx_module(void)
> {
> - /* The TDX module hasn't been detected */
> - return -ENODEV;
> + int ret;
> +
> + /*
> + * Call TDH.SYS.INIT to do the global initialization of
> + * the TDX module. It also detects the module.
> + */
> + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> + if (ret)
> + goto out;

Please also note that the 0's are all just unused parameters. They mean
nothing.

> +
> + /*
> + * Return -EINVAL until all steps of TDX module initialization
> + * process are done.
> + */
> + ret = -EINVAL;
> +out:
> + return ret;
> }

It might be a bit unconventional, but can you imagine how well it would
tell the story if this comment said:

/*
* TODO:
* - Logical-CPU scope initialization (TDH_SYS_INIT_LP)
* - Enumerate capabilities and platform configuration
(TDH_SYS_CONFIG)
...
*/

and then each of the following patches that *did* those things removed
the TODO line from the list.

That TODO list could have been added in patch 4.

> static void shutdown_tdx_module(void)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 215cc1065d78..0b415805c921 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -15,6 +15,7 @@
> /*
> * TDX module SEAMCALL leaf functions
> */
> +#define TDH_SYS_INIT 33
> #define TDH_SYS_LP_SHUTDOWN 44
>
> /*

2022-11-22 19:48:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote:

> > Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
> > all TDX using tasks.
>
> I don't think that works. As I mentioned to Thomas elsewhere, you don't
> just need to initialize TDX on the CPUs where it is used. Before the
> module will start working you need to initialize it on *all* the CPUs it
> knows about. The module itself has a little counter where it tracks
> this and will refuse to start being useful until it gets called
> thoroughly enough.

That's bloody terrible, that is. How are we going to make that work with
the SMT mitigation crud that forces the SMT sibilng offline?

Then the counters don't match and TDX won't work.

Can we get this limitiation removed and simply let the module throw a
wobbly (error) when someone tries and use TDX without that logical CPU
having been properly initialized?

2022-11-22 19:49:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote:
> On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:
>
> > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> >
> >> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> >> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> >> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
> >> CPUs and use it to shut down the module. Later logical-cpu scope module
> >> initialization will use it too.
> >
> > Uhh, those requirements ^ are not met by this:
>
> Can run concurrently != Must run concurrently
>
> The documentation clearly says "can run concurrently" as quoted above.

The next sentense says: "Implement a mechanism to run SEAMCALL
concurrently" -- it does not.

Anyway, since we're all in agreement there is no such requirement at
all, a schedule_on_each_cpu() might be more appropriate, there is no
reason to use IPIs and spin-waiting for any of this.

That said; perhaps we should grow:

schedule_on_cpu(struct cpumask *cpus, work_func_t func);

to only disturb a given mask of CPUs.

2022-11-22 19:53:26

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 22, 2022, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote:
> > On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:
> >
> > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> > >
> > >> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> > >> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> > >> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
> > >> CPUs and use it to shut down the module. Later logical-cpu scope module
> > >> initialization will use it too.
> > >
> > > Uhh, those requirements ^ are not met by this:
> >
> > Can run concurrently != Must run concurrently
> >
> > The documentation clearly says "can run concurrently" as quoted above.
>
> The next sentense says: "Implement a mechanism to run SEAMCALL
> concurrently" -- it does not.
>
> Anyway, since we're all in agreement there is no such requirement at
> all, a schedule_on_each_cpu() might be more appropriate, there is no
> reason to use IPIs and spin-waiting for any of this.

Backing up a bit, what's the reason for _any_ of this? The changelog says

It's pointless to leave the TDX module in some middle state.

but IMO it's just as pointless to do a shutdown unless the kernel benefits in
some meaningful way. And IIUC, TDH.SYS.LP.SHUTDOWN does nothing more than change
the SEAM VMCS.HOST_RIP to point to an error trampoline. E.g. it's not like doing
a shutdown lets the kernel reclaim memory that was gifted to the TDX module.

In other words, this is just a really expensive way of changing a function pointer,
and the only way it would ever benefit the kernel is if there is a kernel bug that
leads to trying to use TDX after a fatal error. And even then, the only difference
seems to be that subsequent bogus SEAMCALLs would get a more unique error message.

2022-11-22 20:28:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Tue, Nov 22, 2022, Thomas Gleixner wrote:
> On Tue, Nov 22 2022 at 07:35, Dave Hansen wrote:
>
> > On 11/22/22 02:31, Thomas Gleixner wrote:
> >> Nothing in the TDX specs and docs mentions physical hotplug or a
> >> requirement for invoking seamcall on the world.
> >
> > The TDX module source is actually out there[1] for us to look at. It's
> > in a lovely, convenient zip file, but you can read it if sufficiently
> > motivated.
>
> zip file? Version control from the last millenium?
>
> The whole thing wants to be @github with a proper change history if
> Intel wants anyone to trust this and take it serious.
>
> /me refrains from ranting about the outrageous license choice.

Let me know if you grab pitchforcks and torches, I'll join the mob :-)

2022-11-22 20:40:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Tue, Nov 22 2022 at 07:35, Dave Hansen wrote:

> On 11/22/22 02:31, Thomas Gleixner wrote:
>> Nothing in the TDX specs and docs mentions physical hotplug or a
>> requirement for invoking seamcall on the world.
>
> The TDX module source is actually out there[1] for us to look at. It's
> in a lovely, convenient zip file, but you can read it if sufficiently
> motivated.

zip file? Version control from the last millenium?

The whole thing wants to be @github with a proper change history if
Intel wants anyone to trust this and take it serious.

/me refrains from ranting about the outrageous license choice.

> It has this lovely nugget in it:
>
> WARNING!!! Proprietary License!! Avert your virgin eyes!!!

It's probably not the only reasons to avert the eyes.

>> if (tdx_global_data_ptr->num_of_init_lps < tdx_global_data_ptr->num_of_lps)
>> {
>> TDX_ERROR("Num of initialized lps %d is smaller than total num of lps %d\n",
>> tdx_global_data_ptr->num_of_init_lps, tdx_global_data_ptr->num_of_lps);
>> retval = TDX_SYS_CONFIG_NOT_PENDING;
>> goto EXIT;
>> }
>
> tdx_global_data_ptr->num_of_init_lps is incremented at TDH.SYS.INIT
> time. That if() is called at TDH.SYS.CONFIG time to help bring the
> module up.
>
> So, I think you're right. I don't see the docs that actually *explain*
> this "you must seamcall all the things" requirement.

The code actually enforces this.

At TDH.SYS.INIT which is the first operation it gets the total number
of LPs from the sysinfo table:

src/vmm_dispatcher/api_calls/tdh_sys_init.c:

tdx_global_data_ptr->num_of_lps = sysinfo_table_ptr->mcheck_fields.tot_num_lps;

Then TDH.SYS.LP.INIT increments the count of initialized LPs.

src/vmm_dispatcher/api_calls/tdh_sys_lp_init.c:

increment_num_of_lps(tdx_global_data_ptr)
_lock_xadd_32b(&tdx_global_data_ptr->num_of_init_lps, 1);

Finally TDH.SYS.CONFIG checks whether _ALL_ LPs have been initialized.

src/vmm_dispatcher/api_calls/tdh_sys_config.c:

if (tdx_global_data_ptr->num_of_init_lps < tdx_global_data_ptr->num_of_lps)

Clearly that's nowhere spelled out in the documentation, but I don't
buy the 'architecturaly required' argument not at all. It's an
implementation detail of the TDX module.

Technically there is IMO ZERO requirement to do so.

1) The TDX module is global

2) Seam-root and Seam-non-root operation are strictly a LP property.

The only architectural prerequisite for using Seam on a LP is that
obviously the encryption/decryption mechanics have been initialized
on the package to which the LP belongs.

I can see why it might be complicated to add/remove an LP after
initialization fact, but technically it should be possible.

TDX/Seam is not that special.

But what's absolutely annoying is that the documentation lacks any
information about the choice of enforcement which has been hardcoded
into the Seam module for whatever reasons.

Maybe I overlooked it, but then it's definitely well hidden.

Thanks,

tglx

2022-11-22 23:40:30

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot

On Tue, 2022-11-22 at 08:50 -0800, Dave Hansen wrote:
> On 11/22/22 03:28, Huang, Kai wrote:
> > > > + /*
> > > > + * KeyID 0 is for TME. MKTME KeyIDs start from 1. TDX private
> > > > + * KeyIDs start after the last MKTME KeyID.
> > > > + */
> > >
> > > Is the TME key a "MKTME KeyID"?
> >
> > I don't think so. Hardware handles TME KeyID 0 differently from non-0 MKTME
> > KeyIDs. And PCONFIG only accept non-0 KeyIDs.
>
> Let's say we have 4 MKTME hardware bits, we'd have:
>
> 0: TME Key
> 1->3: MKTME Keys
> 4->7: TDX Private Keys
>
> First, the MSR values:
>
> > + * IA32_MKTME_KEYID_PARTIONING:
> > + * Bit [31:0]: Number of MKTME KeyIDs.
> > + * Bit [63:32]: Number of TDX private KeyIDs.
>
> These would be:
>
> Bit [ 31:0] = 3
> Bit [63:22] = 4
>
> And in the end the variables:
>
> tdx_keyid_start would be 4 and tdx_keyid_num would be 4.
>
> Right?

Yes.

>
> That's a bit wonky for my brain because I guess I know too much about
> the internal implementation and how the key space is split up. I guess
> I (wrongly) expected Bit[31:0]==Bit[63:22].

The spec says the The Bit[31:0] only reports the number of MKTME KeyIDs, and it
does exclude KeyID 0.

My machine has 6 hardware bits in total (that is KeyID 0 ~ 63), and the upper 48
KeyIDs are reserved to TDX. In my case:

[Bit 31:0] = 15
[Bit 63:32] = 48

And tdx_keyid_start and nr_tdx_keyids are 16 and 48.

The TDX KeyID range: [16, 63], or [16, 64).

So [Bit 31:0] reports only "NUM_MKTME_KIDS", which excludes KeyID 0.

>
>
>
> > > > +static void __init clear_tdx(void)
> > > > +{
> > > > + tdx_keyid_start = tdx_keyid_num = 0;
> > > > +}
> > >
> > > This is where a comment is needed and can actually help.
> > >
> > > /*
> > > * tdx_keyid_start/num indicate that TDX is uninitialized. This
> > > * is used in TDX initialization error paths to take it from
> > > * initialized -> uninitialized.
> > > */
> >
> > Just want to point out after removing the !x2apic_enabled() check, the only
> > thing need to do here is to detect/record the TDX KeyIDs.
> >
> > And the purpose of this TDX boot-time initialization code is to provide
> > platform_tdx_enabled() function so that kexec() can use.
> >
> > To distinguish boot-time TDX initialization from runtime TDX module
> > initialization, how about change the comment to below?
> >
> > static void __init clear_tdx(void)
> > {
> > /*
> > * tdx_keyid_start and nr_tdx_keyids indicate that TDX is not
> > * enabled by the BIOS. This is used in TDX boot-time
> > * initializatiton error paths to take it from enabled to not
> > * enabled.
> > */
> > tdx_keyid_start = nr_tdx_keyids = 0;
> > }
> >
> > [...]
>
> I honestly have no idea what "boot-time TDX initialization" is versus
> "runtime TDX module initialization". This doesn't hel.

I'll use your original comment.

>
> > And below is the updated patch. How does it look to you?
>
> Let's see...
>
> ...
> > +static u32 tdx_keyid_start __ro_after_init;
> > +static u32 nr_tdx_keyids __ro_after_init;
> > +
> > +static int __init record_keyid_partitioning(void)
> > +{
> > + u32 nr_mktme_keyids;
> > + int ret;
> > +
> > + /*
> > + * IA32_MKTME_KEYID_PARTIONING:
> > + * Bit [31:0]: Number of MKTME KeyIDs.
> > + * Bit [63:32]: Number of TDX private KeyIDs.
> > + */
> > + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &nr_mktme_keyids,
> > + &nr_tdx_keyids);
> > + if (ret)
> > + return -ENODEV;
> > +
> > + if (!nr_tdx_keyids)
> > + return -ENODEV;
> > +
> > + /* TDX KeyIDs start after the last MKTME KeyID. */
> > + tdx_keyid_start++;
>
> tdx_keyid_start is uniniitalized here. So, it'd be 0, then ++'d.
>
> Kai, please take a moment and slow down. This isn't a race. I offered
> some replacement code here, which you've discarded, missed or ignored
> and in the process broken this code.
>
> This approach just wastes reviewer time. It's not working for me.

Apology. I missed it this time.

>
> I'm going to make a suggestion (aka. a demand): You can post these
> patches at most once a week. You get a whole week to (carefully)
> incorporate reviewer feedback, make the patch better, and post a new
> version. Need more time? Go ahead and take it. Take as much time as
> you want.
>

Yes will follow.

2022-11-23 00:08:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On 11/20/22 16:26, Kai Huang wrote:
> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums. Not all memory
> satisfies these requirements.
>
> As a result, TDX introduced the concept of a "Convertible Memory Region"
> (CMR). During boot, the firmware builds a list of all of the memory
> ranges which can provide the TDX security guarantees. The list of these
> ranges, along with TDX module information, is available to the kernel by
> querying the TDX module via TDH.SYS.INFO SEAMCALL.

I think the last sentence goes too far. What does it matter what the
name of the SEAMCALL is? Who cares at this point? It's in the patch.
Scroll down two pages if you really care.

> The host kernel can choose whether or not to use all convertible memory
> regions as TDX-usable memory. Before the TDX module is ready to create
> any TDX guests, the kernel needs to configure the TDX-usable memory
> regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX
> module. Constructing the TDMR array requires information of both the
> TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions. Call
> TDH.SYS.INFO to get this information as a preparation.

That last sentece is kinda goofy. I think there's a way to distill this
whole thing down more effecively.

CMRs tell the kernel which memory is TDX compatible. The kernel
takes CMRs and constructs "TD Memory Regions" (TDMRs). TDMRs
let the kernel grante TDX protections to some or all of the CMR
areas.

> Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid

I find it very useful to be precise when referring to code. Your code
says 'tdsysinfo_struct', yet this says 'TDSYSINFO_STRUCT'. Why the
difference?

> having to pass them as function arguments when constructing the TDMR
> array. And they are too big to be put to the stack anyway. Also, KVM
> needs to use the TDSYSINFO_STRUCT to create TDX guests.

This is also a great place to mention that the tdsysinfo_struct contains
a *lot* of gunk which will not be used for a bit or that may never get
used.

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2cf7090667aa..43227af25e44 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -15,6 +15,7 @@
> #include <linux/cpumask.h>
> #include <linux/smp.h>
> #include <linux/atomic.h>
> +#include <linux/align.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/apic.h>
> @@ -40,6 +41,11 @@ static enum tdx_module_status_t tdx_module_status;
> /* Prevent concurrent attempts on TDX detection and initialization */
> static DEFINE_MUTEX(tdx_module_lock);
>
> +/* Below two are used in TDH.SYS.INFO SEAMCALL ABI */
> +static struct tdsysinfo_struct tdx_sysinfo;
> +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> +static int tdx_cmr_num;
> +
> /*
> * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> * BIOS. Both initializing the TDX module and running TDX guest require
> @@ -208,6 +214,121 @@ static int tdx_module_init_cpus(void)
> return atomic_read(&sc.err);
> }
>
> +static inline bool is_cmr_empty(struct cmr_info *cmr)
> +{
> + return !cmr->size;
> +}
> +
> +static inline bool is_cmr_ok(struct cmr_info *cmr)
> +{
> + /* CMR must be page aligned */
> + return IS_ALIGNED(cmr->base, PAGE_SIZE) &&
> + IS_ALIGNED(cmr->size, PAGE_SIZE);
> +}
> +
> +static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
> + const char *name)
> +{
> + int i;
> +
> + for (i = 0; i < cmr_num; i++) {
> + struct cmr_info *cmr = &cmr_array[i];
> +
> + pr_info("%s : [0x%llx, 0x%llx)\n", name,
> + cmr->base, cmr->base + cmr->size);
> + }
> +}
> +
> +/* Check CMRs reported by TDH.SYS.INFO, and trim tail empty CMRs. */
> +static int trim_empty_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num)
> +{
> + struct cmr_info *cmr;
> + int i, cmr_num;
> +
> + /*
> + * Intel TDX module spec, 20.7.3 CMR_INFO:
> + *
> + * TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
> + * array of CMR_INFO entries. The CMRs are sorted from the
> + * lowest base address to the highest base address, and they
> + * are non-overlapping.
> + *
> + * This implies that BIOS may generate invalid empty entries
> + * if total CMRs are less than 32. Need to skip them manually.
> + *
> + * CMR also must be 4K aligned. TDX doesn't trust BIOS. TDX
> + * actually verifies CMRs before it gets enabled, so anything
> + * doesn't meet above means kernel bug (or TDX is broken).
> + */

I dislike comments like this that describe all the code below. Can't
you simply put the comment near the code that implements it?

> + cmr = &cmr_array[0];
> + /* There must be at least one valid CMR */
> + if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
> + goto err;
> +
> + cmr_num = *actual_cmr_num;
> + for (i = 1; i < cmr_num; i++) {
> + struct cmr_info *cmr = &cmr_array[i];
> + struct cmr_info *prev_cmr = NULL;
> +
> + /* Skip further empty CMRs */
> + if (is_cmr_empty(cmr))
> + break;
> +
> + /*
> + * Do sanity check anyway to make sure CMRs:
> + * - are 4K aligned
> + * - don't overlap
> + * - are in address ascending order.
> + */
> + if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
> + goto err;

Why does cmr_array[0] get a pass on the empty and sanity checks?

> + prev_cmr = &cmr_array[i - 1];
> + if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
> + cmr->base))
> + goto err;
> + }
> +
> + /* Update the actual number of CMRs */
> + *actual_cmr_num = i;

That comment is not helpful. Yes, this is literally updating the number
of CMRs. Literally. That's the "what". But, the "why" is important.
Why is it doing this?

> + /* Print kernel checked CMRs */
> + print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR");

This is the point where I start to lose patience with these comments.
These are just a waste of space.

Also, I saw the loop above check 'cmr_num' CMRs for is_cmr_ok(). Now,
it'll print an 'actual_cmr_num=1' number of CMRs as being
"kernel-checked". Why? That makes zero sense.

> + return 0;
> +err:
> + pr_info("[TDX broken ?]: Invalid CMRs detected\n");
> + print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
> + return -EINVAL;
> +}
> +
> +static int tdx_get_sysinfo(void)
> +{
> + struct tdx_module_output out;
> + int ret;
> +
> + BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
> +
> + ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
> + __pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
> + if (ret)
> + return ret;
> +
> + /* R9 contains the actual entries written the CMR array. */
> + tdx_cmr_num = out.r9;
> +
> + pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> + tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
> + tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
> + tdx_sysinfo.build_date, tdx_sysinfo.build_num);

This is a case where a little bit of vertical alignment will go a long way:

> + tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
> + tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
> + tdx_sysinfo.build_date, tdx_sysinfo.build_num);

> +
> + /*
> + * trim_empty_cmrs() updates the actual number of CMRs by
> + * dropping all tail empty CMRs.
> + */
> + return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> +}

Why does this both need to respect the "tdx_cmr_num = out.r9" value
*and* trim the empty ones? Couldn't it just ignore the "tdx_cmr_num =
out.r9" value and just trim the empty ones either way? It's not like
there is a billion of them. It would simplify the code for sure.

> /*
> * Detect and initialize the TDX module.
> *
> @@ -232,6 +353,10 @@ static int init_tdx_module(void)
> if (ret)
> goto out;
>
> + ret = tdx_get_sysinfo();
> + if (ret)
> + goto out;
> +
> /*
> * Return -EINVAL until all steps of TDX module initialization
> * process are done.
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 9ba11808bd45..8e273756098c 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -15,10 +15,71 @@
> /*
> * TDX module SEAMCALL leaf functions
> */
> +#define TDH_SYS_INFO 32
> #define TDH_SYS_INIT 33
> #define TDH_SYS_LP_INIT 35
> #define TDH_SYS_LP_SHUTDOWN 44
>
> +struct cmr_info {
> + u64 base;
> + u64 size;
> +} __packed;
> +
> +#define MAX_CMRS 32
> +#define CMR_INFO_ARRAY_ALIGNMENT 512
> +
> +struct cpuid_config {
> + u32 leaf;
> + u32 sub_leaf;
> + u32 eax;
> + u32 ebx;
> + u32 ecx;
> + u32 edx;
> +} __packed;
> +
> +#define TDSYSINFO_STRUCT_SIZE 1024
> +#define TDSYSINFO_STRUCT_ALIGNMENT 1024
> +
> +struct tdsysinfo_struct {
> + /* TDX-SEAM Module Info */
> + u32 attributes;
> + u32 vendor_id;
> + u32 build_date;
> + u16 build_num;
> + u16 minor_version;
> + u16 major_version;
> + u8 reserved0[14];
> + /* Memory Info */
> + u16 max_tdmrs;
> + u16 max_reserved_per_tdmr;
> + u16 pamt_entry_size;
> + u8 reserved1[10];
> + /* Control Struct Info */
> + u16 tdcs_base_size;
> + u8 reserved2[2];
> + u16 tdvps_base_size;
> + u8 tdvps_xfam_dependent_size;
> + u8 reserved3[9];
> + /* TD Capabilities */
> + u64 attributes_fixed0;
> + u64 attributes_fixed1;
> + u64 xfam_fixed0;
> + u64 xfam_fixed1;
> + u8 reserved4[32];
> + u32 num_cpuid_config;
> + /*
> + * The actual number of CPUID_CONFIG depends on above
> + * 'num_cpuid_config'. The size of 'struct tdsysinfo_struct'
> + * is 1024B defined by TDX architecture. Use a union with
> + * specific padding to make 'sizeof(struct tdsysinfo_struct)'
> + * equal to 1024.
> + */
> + union {
> + struct cpuid_config cpuid_configs[0];
> + u8 reserved5[892];
> + };

Can you double check what the "right" way to do variable arrays is these
days? I thought the [0] method was discouraged.

Also, it isn't *really* 892 bytes of reserved space, right? Anything
that's not cpuid_configs[] is reserved, I presume. Could you try to be
more precise there?

> +} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
> +
> /*
> * Do not put any hardware-defined TDX structure representations below
> * this comment!

2022-11-23 00:59:33

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Tue, 2022-11-22 at 21:03 +0100, Thomas Gleixner wrote:
> > >      if (tdx_global_data_ptr->num_of_init_lps < tdx_global_data_ptr-
> > > >num_of_lps)
> > >      {
> > >          TDX_ERROR("Num of initialized lps %d is smaller than total num of
> > > lps %d\n",
> > >                      tdx_global_data_ptr->num_of_init_lps,
> > > tdx_global_data_ptr->num_of_lps);
> > >          retval = TDX_SYS_CONFIG_NOT_PENDING;
> > >          goto EXIT;
> > >      }
> >
> > tdx_global_data_ptr->num_of_init_lps is incremented at TDH.SYS.INIT
> > time.  That if() is called at TDH.SYS.CONFIG time to help bring the
> > module up.
> >
> > So, I think you're right.  I don't see the docs that actually *explain*
> > this "you must seamcall all the things" requirement.
>
> The code actually enforces this.
>
> At TDH.SYS.INIT which is the first operation it gets the total number
> of LPs from the sysinfo table:
>
> src/vmm_dispatcher/api_calls/tdh_sys_init.c:
>
>     tdx_global_data_ptr->num_of_lps = sysinfo_table_ptr-
> >mcheck_fields.tot_num_lps;
>
> Then TDH.SYS.LP.INIT increments the count of initialized LPs.
>
> src/vmm_dispatcher/api_calls/tdh_sys_lp_init.c:
>
>     increment_num_of_lps(tdx_global_data_ptr)
>        _lock_xadd_32b(&tdx_global_data_ptr->num_of_init_lps, 1);
>
> Finally TDH.SYS.CONFIG checks whether _ALL_ LPs have been initialized.
>
> src/vmm_dispatcher/api_calls/tdh_sys_config.c:
>
>     if (tdx_global_data_ptr->num_of_init_lps < tdx_global_data_ptr-
> >num_of_lps)
>
> Clearly that's nowhere spelled out in the documentation, but I don't
> buy the 'architecturaly required' argument not at all. It's an
> implementation detail of the TDX module.

Hi Thomas,

Thanks for review!

I agree on hardware level there shouldn't be such requirement (not 100% sure
though), but I guess from kernel's perspective, "the implementation detail of
the TDX module" is sort of "architectural requirement" -- at least Intel arch
guys think so I guess.

>
> Technically there is IMO ZERO requirement to do so.
>
>  1) The TDX module is global
>
>  2) Seam-root and Seam-non-root operation are strictly a LP property.
>
>     The only architectural prerequisite for using Seam on a LP is that
>     obviously the encryption/decryption mechanics have been initialized
>     on the package to which the LP belongs.
>
> I can see why it might be complicated to add/remove an LP after
> initialization fact, but technically it should be possible.

"kernel soft offline" actually isn't an issue. We can bring down a logical cpu
after it gets initialized and then bring it up again.

Only add/removal of physical cpu will cause problem: 

TDX MCHECK verifies all boot-time present cpus to make sure they are TDX-
compatible before it enables TDX in hardware. MCHECK cannot run on hot-added
CPU, so TDX cannot support physical CPU hotplug.

We tried to get it clarified in the specification, and below is what TDX/module
arch guys agreed to put to the TDX module spec (just checked it's not in latest
public spec yet, but they said it will be in next release):

"
4.1.3.2. CPU Configuration

During platform boot, MCHECK verifies all logical CPUs to ensure they meet TDX’s
security and certain functionality requirements, and MCHECK passes the following
CPU configuration information to the NP-SEAMLDR, P-SEAMLDR and the TDX Module:

· Total number of logical processors in the platform.
· Total number of installed packages in the platform.
· A table of per-package CPU family, model and stepping etc.
identification, as enumerated by CPUID(1).EAX.
The above information is static and does not change after platform boot and
MCHECK run.

Note: TDX doesn’t support adding or removing CPUs from TDX security
perimeter, as checked my MCHECK. BIOS should prevent CPUs from being hot-added
or hot-removed after platform boots.

The TDX module performs additional checks of the CPU’s configuration and
supported features, by reading MSRs and CPUID information as described in the
following sections.
"

>
> TDX/Seam is not that special.
>
> But what's absolutely annoying is that the documentation lacks any
> information about the choice of enforcement which has been hardcoded
> into the Seam module for whatever reasons.
>
> Maybe I overlooked it, but then it's definitely well hidden.

It depends on the TDX Module implementation, which TDX arch guys think should be
"architectural" I think.


2022-11-23 01:02:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On 11/20/22 16:26, Kai Huang wrote:
> TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> memory regions that can possibly be used by the TDX module, but they are
> not automatically usable to the TDX module. As a step of initializing
> the TDX module, the kernel needs to choose a list of memory regions (out
> from convertible memory regions) that the TDX module can use and pass
> those regions to the TDX module. Once this is done, those "TDX-usable"
> memory regions are fixed during module's lifetime. No more TDX-usable
> memory can be added to the TDX module after that.
>
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator. To keep things simple, this initial
> implementation simply guarantees all pages in the page allocator are TDX
> memory. To achieve this, use all system memory in the core-mm at the
> time of initializing the TDX module as TDX memory, and at the meantime,
> refuse to add any non-TDX-memory in the memory hotplug.
>
> Specifically, walk through all memory regions managed by memblock and
> add them to a global list of "TDX-usable" memory regions, which is a
> fixed list after the module initialization (or empty if initialization
> fails). To reject non-TDX-memory in memory hotplug, add an additional
> check in arch_add_memory() to check whether the new region is covered by
> any region in the "TDX-usable" memory region list.
>
> Note this requires all memory regions in memblock are TDX convertible
> memory when initializing the TDX module. This is true in practice if no
> new memory has been hot-added before initializing the TDX module, since
> in practice all boot-time present DIMM is TDX convertible memory. If
> any new memory has been hot-added, then initializing the TDX module will
> fail due to that memory region is not covered by CMR.
>
> This can be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node. In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
>
> Note TDX assumes convertible memory is always physically present during
> machine's runtime. A non-buggy BIOS should never support hot-removal of
> any convertible memory. This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.

My eyes glazed over about halfway through that. Can you try to trim it
down a bit, or at least try to summarize it better up front?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> depends on X86_64
> depends on KVM_INTEL
> depends on X86_X2APIC
> + select ARCH_KEEP_MEMBLOCK
> help
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d688228f3151..71169ecefabf 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> #ifdef CONFIG_INTEL_TDX_HOST
> bool platform_tdx_enabled(void);
> int tdx_enable(void);
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
> #else /* !CONFIG_INTEL_TDX_HOST */
> static inline bool platform_tdx_enabled(void) { return false; }
> static inline int tdx_enable(void) { return -ENODEV; }
> +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> + unsigned long end_pfn) { return true; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3f040c6e5d13..900341333d7e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -55,6 +55,7 @@
> #include <asm/uv/uv.h>
> #include <asm/setup.h>
> #include <asm/ftrace.h>
> +#include <asm/tdx.h>
>
> #include "mm_internal.h"
>
> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> + /*
> + * For now if TDX is enabled, all pages in the page allocator

s/For now//

> + * must be TDX memory, which is a fixed set of memory regions
> + * that are passed to the TDX module. Reject the new region
> + * if it is not TDX memory to guarantee above is true.
> + */
> + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> + return -EINVAL;

There's a real art to making a right-size comment. I don't think this
needs to be any more than:

/*
* Not all memory is compatible with TDX. Reject
* the addition of any incomatible memory.
*/

If you want to write a treatise, do it in Documentation or at the
tdx_cc_memory_compatible() definition.

> init_memory_mapping(start, start + size, params->pgprot);
>
> return add_pages(nid, start_pfn, nr_pages, params);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 43227af25e44..32af86e31c47 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -16,6 +16,11 @@
> #include <linux/smp.h>
> #include <linux/atomic.h>
> #include <linux/align.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/apic.h>
> @@ -34,6 +39,13 @@ enum tdx_module_status_t {
> TDX_MODULE_SHUTDOWN,
> };
>
> +struct tdx_memblock {
> + struct list_head list;
> + unsigned long start_pfn;
> + unsigned long end_pfn;
> + int nid;
> +};

Why does the nid matter?

> static u32 tdx_keyid_start __ro_after_init;
> static u32 tdx_keyid_num __ro_after_init;
>
> @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
> static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> static int tdx_cmr_num;
>
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
> /*
> * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> * BIOS. Both initializing the TDX module and running TDX guest require
> @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
> return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> }
>
> +/* Check whether the given pfn range is covered by any CMR or not. */
> +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> + unsigned long end_pfn)
> +{
> + int i;
> +
> + for (i = 0; i < tdx_cmr_num; i++) {
> + struct cmr_info *cmr = &tdx_cmr_array[i];
> + unsigned long cmr_start_pfn;
> + unsigned long cmr_end_pfn;
> +
> + cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> +
> + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> + return true;
> + }

What if the pfn range overlaps two CMRs? It will never pass any
individual overlap test and will return false.

> + return false;
> +}
> +
> +/*
> + * Add a memory region on a given node as a TDX memory block. The caller
> + * to make sure all memory regions are added in address ascending order

s/to/must/

> + * and don't overlap.
> + */
> +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> + int nid)
> +{
> + struct tdx_memblock *tmb;
> +
> + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> + if (!tmb)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&tmb->list);
> + tmb->start_pfn = start_pfn;
> + tmb->end_pfn = end_pfn;
> + tmb->nid = nid;
> +
> + list_add_tail(&tmb->list, &tdx_memlist);
> + return 0;
> +}
> +
> +static void free_tdx_memory(void)

This is named a bit too generically. How about free_tdx_memlist() or
something?

> +{
> + while (!list_empty(&tdx_memlist)) {
> + struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
> + struct tdx_memblock, list);
> +
> + list_del(&tmb->list);
> + kfree(tmb);
> + }
> +}
> +
> +/*
> + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> + * Must be called when get_online_mems() is called by the caller.
> + */

Again, this explains the "what", but not the "why".

/*
* Ensure that all memblock memory regions are convertible to TDX
* memory. Once this has been established, stash the memblock
* ranges off in a secondary structure because $REASONS.
*/

Which makes me wonder: Why do you even need a secondary structure here?
What's wrong with the memblocks themselves?

> +static int build_tdx_memory(void)
> +{
> + unsigned long start_pfn, end_pfn;
> + int i, nid, ret;
> +
> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> + /*
> + * The first 1MB may not be reported as TDX convertible
> + * memory. Manually exclude them as TDX memory.

I don't like the "may not" here very much.

> + * This is fine as the first 1MB is already reserved in
> + * reserve_real_mode() and won't end up to ZONE_DMA as
> + * free page anyway.

^ free pages

> + */

This way too wishy washy. The TDX module may or may not... Then, it
doesn't matter since reserve_real_mode() does it anyway...

Then it goes and adds code to skip it!

> + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> + if (start_pfn >= end_pfn)
> + continue;


Please just put a dang stake in the ground. If the other code deals
with this, then explain *why* more is needed here.

> + /* Verify memory is truly TDX convertible memory */
> + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> + start_pfn << PAGE_SHIFT,
> + end_pfn << PAGE_SHIFT);
> + return -EINVAL;

... no 'goto err'? This leaks all the previous add_tdx_memblock()
structures, right?

> + }
> +
> + /*
> + * Add the memory regions as TDX memory. The regions in
> + * memblock has already guaranteed they are in address
> + * ascending order and don't overlap.
> + */
> + ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> + if (ret)
> + goto err;
> + }
> +
> + return 0;
> +err:
> + free_tdx_memory();
> + return ret;
> +}
> +
> /*
> * Detect and initialize the TDX module.
> *
> @@ -357,12 +473,56 @@ static int init_tdx_module(void)
> if (ret)
> goto out;
>
> + /*
> + * All memory regions that can be used by the TDX module must be
> + * passed to the TDX module during the module initialization.
> + * Once this is done, all "TDX-usable" memory regions are fixed
> + * during module's runtime.
> + *
> + * The initial support of TDX guests only allocates memory from
> + * the global page allocator. To keep things simple, for now
> + * just make sure all pages in the page allocator are TDX memory.
> + *
> + * To achieve this, use all system memory in the core-mm at the
> + * time of initializing the TDX module as TDX memory, and at the
> + * meantime, reject any new memory in memory hot-add.
> + *
> + * This works as in practice, all boot-time present DIMM is TDX
> + * convertible memory. However if any new memory is hot-added
> + * before initializing the TDX module, the initialization will
> + * fail due to that memory is not covered by CMR.
> + *
> + * This can be enhanced in the future, i.e. by allowing adding or
> + * onlining non-TDX memory to a separate node, in which case the
> + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> + * together -- the userspace/kernel just needs to make sure pages
> + * for TDX guests must come from those "TDX-capable" nodes.
> + *
> + * Build the list of TDX memory regions as mentioned above so
> + * they can be passed to the TDX module later.
> + */

This is msotly Documentation/, not a code comment. Please clean it up.

> + get_online_mems();
> +
> + ret = build_tdx_memory();
> + if (ret)
> + goto out;
> /*
> * Return -EINVAL until all steps of TDX module initialization
> * process are done.
> */
> ret = -EINVAL;
> out:
> + /*
> + * Memory hotplug checks the hot-added memory region against the
> + * @tdx_memlist to see if the region is TDX memory.
> + *
> + * Do put_online_mems() here to make sure any modification to
> + * @tdx_memlist is done while holding the memory hotplug read
> + * lock, so that the memory hotplug path can just check the
> + * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> + * deadlock.
> + */

I'm honestly not following any of that.

> + put_online_mems();
> return ret;
> }
>
> @@ -485,3 +645,26 @@ int tdx_enable(void)
> return ret;
> }
> EXPORT_SYMBOL_GPL(tdx_enable);
> +
> +/*
> + * Check whether the given range is TDX memory. Must be called between
> + * mem_hotplug_begin()/mem_hotplug_done().
> + */
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + struct tdx_memblock *tmb;
> +
> + /* Empty list means TDX isn't enabled successfully */
> + if (list_empty(&tdx_memlist))
> + return true;
> +
> + list_for_each_entry(tmb, &tdx_memlist, list) {
> + /*
> + * The new range is TDX memory if it is fully covered
> + * by any TDX memory block.
> + */
> + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> + return true;

Same bug. What if the start/end_pfn range is covered by more than one
tdx_memblock?

> + }
> + return false;
> +}

2022-11-23 01:13:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/22/22 16:58, Huang, Kai wrote:
> On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
>>> I was expecting TDX to not get initialized until the first TDX using KVM
>>> instance is created. Am I wrong?
>> I went looking for it in this series to prove you wrong. I failed. ????
>>
>> tdx_enable() is buried in here somewhere:
>>
>>> https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/
>> I don't have the patience to dig it out today, so I guess we'll have Kai
>> tell us.
> It will be done when KVM module is loaded, but not when the first TDX guest is
> created.

Why is it done that way?

Can it be changed to delay TDX initialization until the first TDX guest
needs to run?

2022-11-23 01:25:26

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, 2022-11-22 at 20:33 +0100, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote:
>
> > > Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
> > > all TDX using tasks.
> >
> > I don't think that works. As I mentioned to Thomas elsewhere, you don't
> > just need to initialize TDX on the CPUs where it is used. Before the
> > module will start working you need to initialize it on *all* the CPUs it
> > knows about. The module itself has a little counter where it tracks
> > this and will refuse to start being useful until it gets called
> > thoroughly enough.
>
> That's bloody terrible, that is. How are we going to make that work with
> the SMT mitigation crud that forces the SMT sibilng offline?
>
> Then the counters don't match and TDX won't work.
>
> Can we get this limitiation removed and simply let the module throw a
> wobbly (error) when someone tries and use TDX without that logical CPU
> having been properly initialized?

Dave kindly helped to raise this issue and I'll follow up with TDX module guys
to see whether we can remove/ease such limitation.

Thanks!

2022-11-23 01:33:11

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, 2022-11-22 at 20:14 +0100, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 10:57:52AM -0800, Dave Hansen wrote:
>
> > To me, this starts to veer way too far into internal implementation details.
> >
> > Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs
> > to shut down the TDX module.
>
> We really need to let go of the whole 'all BIOS-enabled CPUs' thing.

As replied in another email I'll follow up with TDX module guys to see whether
we can remove/ease such limitation.

Thanks!

2022-11-23 01:34:43

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
> > I was expecting TDX to not get initialized until the first TDX using KVM
> > instance is created. Am I wrong?
>
> I went looking for it in this series to prove you wrong.  I failed.  :)
>
> tdx_enable() is buried in here somewhere:
>
> > https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/
>
> I don't have the patience to dig it out today, so I guess we'll have Kai
> tell us.

It will be done when KVM module is loaded, but not when the first TDX guest is
created.

2022-11-23 01:35:10

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Wed, 2022-11-23 at 00:30 +0000, Huang, Kai wrote:
> >
> > Clearly that's nowhere spelled out in the documentation, but I don't
> > buy the 'architecturaly required' argument not at all. It's an
> > implementation detail of the TDX module.
>
> Hi Thomas,
>
> Thanks for review!
>
> I agree on hardware level there shouldn't be such requirement (not 100% sure
> though), but I guess from kernel's perspective, "the implementation detail of
> the TDX module" is sort of "architectural requirement" -- at least Intel arch
> guys think so I guess.

Let me double check with the TDX module folks and figure out the root of the
requirement.

Thanks.

2022-11-23 01:35:57

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, 2022-11-22 at 17:04 -0800, Dave Hansen wrote:
> On 11/22/22 16:58, Huang, Kai wrote:
> > On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
> > > > I was expecting TDX to not get initialized until the first TDX using KVM
> > > > instance is created. Am I wrong?
> > > I went looking for it in this series to prove you wrong. I failed. ????
> > >
> > > tdx_enable() is buried in here somewhere:
> > >
> > > > https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/
> > > I don't have the patience to dig it out today, so I guess we'll have Kai
> > > tell us.
> > It will be done when KVM module is loaded, but not when the first TDX guest is
> > created.
>
> Why is it done that way?
>
> Can it be changed to delay TDX initialization until the first TDX guest
> needs to run?
>

Sean suggested.

Hi Sean, could you commenet?

(I'll dig out the link of that suggestion if Sean didn't reply)

2022-11-23 10:06:09

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, 2022-11-22 at 19:31 +0000, Sean Christopherson wrote:
> On Tue, Nov 22, 2022, Peter Zijlstra wrote:
> > On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote:
> > > On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:
> > >
> > > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> > > >
> > > > > Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> > > > > BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> > > > > CPUs. Implement a mechanism to run SEAMCALL concurrently on all online
> > > > > CPUs and use it to shut down the module. Later logical-cpu scope module
> > > > > initialization will use it too.
> > > >
> > > > Uhh, those requirements ^ are not met by this:
> > >
> > > Can run concurrently != Must run concurrently
> > >
> > > The documentation clearly says "can run concurrently" as quoted above.
> >
> > The next sentense says: "Implement a mechanism to run SEAMCALL
> > concurrently" -- it does not.
> >
> > Anyway, since we're all in agreement there is no such requirement at
> > all, a schedule_on_each_cpu() might be more appropriate, there is no
> > reason to use IPIs and spin-waiting for any of this.
>
> Backing up a bit, what's the reason for _any_ of this? The changelog says
>
> It's pointless to leave the TDX module in some middle state.
>
> but IMO it's just as pointless to do a shutdown unless the kernel benefits in
> some meaningful way. And IIUC, TDH.SYS.LP.SHUTDOWN does nothing more than change
> the SEAM VMCS.HOST_RIP to point to an error trampoline. E.g. it's not like doing
> a shutdown lets the kernel reclaim memory that was gifted to the TDX module.

The metadata memory has been freed back to the kernel in case of error before
shutting down the module.

>
> In other words, this is just a really expensive way of changing a function pointer,
> and the only way it would ever benefit the kernel is if there is a kernel bug that
> leads to trying to use TDX after a fatal error. And even then, the only difference
> seems to be that subsequent bogus SEAMCALLs would get a more unique error message.

The only issue of leaving module open is like you said bogus SEAMCALLs can still
be made. There's a slight risk that those SEAMCALLs may actually be able to do
something that may crash the kernel (i.e. if the module is shut down due to
TDH.SYS.INIT.TDMR error, then further bogus TDH.SYS.INIT.TDMR can still corrupt
the metadata).

But, this belongs to "kernel bug" or "kernel is under attack" category. Neither
of them should be a concern of TDX: host kernel is out of TCB, and preventing
DoS attack is not part of TDX anyway.

Also, in case of kexec(), we cannot shut down the module either (in this
implementation, due to CPU may not be in VMX operation when kexec()).

So I agree with you, it's fine to not shut down the module.

Hi maintainers, does this look good to you?

2022-11-23 10:06:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Tue, Nov 22, 2022 at 04:21:38PM -0800, Dave Hansen wrote:
> > + /*
> > + * All memory regions that can be used by the TDX module must be
> > + * passed to the TDX module during the module initialization.
> > + * Once this is done, all "TDX-usable" memory regions are fixed
> > + * during module's runtime.
> > + *
> > + * The initial support of TDX guests only allocates memory from
> > + * the global page allocator. To keep things simple, for now
> > + * just make sure all pages in the page allocator are TDX memory.
> > + *
> > + * To achieve this, use all system memory in the core-mm at the
> > + * time of initializing the TDX module as TDX memory, and at the
> > + * meantime, reject any new memory in memory hot-add.
> > + *
> > + * This works as in practice, all boot-time present DIMM is TDX
> > + * convertible memory. However if any new memory is hot-added
> > + * before initializing the TDX module, the initialization will
> > + * fail due to that memory is not covered by CMR.
> > + *
> > + * This can be enhanced in the future, i.e. by allowing adding or
> > + * onlining non-TDX memory to a separate node, in which case the
> > + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> > + * together -- the userspace/kernel just needs to make sure pages
> > + * for TDX guests must come from those "TDX-capable" nodes.
> > + *
> > + * Build the list of TDX memory regions as mentioned above so
> > + * they can be passed to the TDX module later.
> > + */
>
> This is msotly Documentation/, not a code comment. Please clean it up.

So personally, I *vastly* prefer code comments over this Documentation/
cesspit. Putting things in Documentation/ is a bit like an
old-folks-home, neatly out of the way to (bit)rot in peace.

And that whole .rst disease is making it unreadable for anybody that
still knows how to use a text editor :-(

2022-11-23 10:25:32

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 05/20] x86/virt/tdx: Implement functions to make SEAMCALL

On Tue, 2022-11-22 at 10:06 +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2022 at 01:26:27PM +1300, Kai Huang wrote:
> > +/*
> > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > + * leaf function return code and the additional output respectively if
> > + * not NULL.
> > + */
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > + u64 *seamcall_ret,
> > + struct tdx_module_output *out)
> > +{
>
> What's the point of a 'static __always_unused' function again? Other
> than to test the DCE pass of a linker, that is?
>

It is used to avoid the compile warning as so far with this patch it doesn't
have any caller. Without the __always_unused, the compiler will complain.

Originally it was in the patch "Shut down TDX module in case of error" where it
was firstly called. Dave suggested to move it out:

https://lore.kernel.org/all/[email protected]/

2022-11-23 10:55:02

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Tue, 2022-11-22 at 10:05 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > 2) It is more flexible to support TDX module runtime updating in the
> > future (after updating the TDX module, it needs to be initialized
> > again).
>
> I hate this generic blabber about "more flexible". There's a *REASON*
> it's more flexible, so let's talk about the reasons, please.
>
> It's really something like this, right?
>
> The TDX module design allows it to be updated while the system
> is running. The update procedure shares quite a few steps with
> this "on demand" loading mechanism. The hope is that much of
> this "on demand" mechanism can be shared with a future "update"
> mechanism. A boot-time TDX module implementation would not be
> able to share much code with the update mechanism.

Yes. Thanks.

>
>
> > 3) It avoids having to do a "temporary" solution to handle VMXON in the
> > core (non-KVM) kernel for now. This is because SEAMCALL requires CPU
> > being in VMX operation (VMXON is done), but currently only KVM handles
> > VMXON. Adding VMXON support to the core kernel isn't trivial. More
> > importantly, from long-term a reference-based approach is likely needed
> > in the core kernel as more kernel components are likely needed to
> > support TDX as well. Allow KVM to initialize the TDX module avoids
> > having to handle VMXON during kernel boot for now.
>
> There are a lot of words in there.
>
> 3) Loading the TDX module requires VMX to be enabled. Currently, only
> the kernel KVM code mucks with VMX enabling. If the TDX module were
> to be initialized separately from KVM (like at boot), the boot code
> would need to be taught how to muck with VMX enabling and KVM would
> need to be taught how to cope with that. Making KVM itself
> responsible for TDX initialization lets the rest of the kernel stay
> blissfully unaware of VMX.

Thanks.

>
> > Add a placeholder tdx_enable() to detect and initialize the TDX module
> > on demand, with a state machine protected by mutex to support concurrent
> > calls from multiple callers.
>
> As opposed to concurrent calls from one caller? ;)

How about below?

"
Add a placeholder tdx_enable() to initialize the TDX module on demand. So far
KVM will be the only caller, but other kernel components will need to use it too
in the future. Just use a mutex protected state machine to make sure the module
initialization can only be done once.
"

>
> > The TDX module will be initialized in multi-steps defined by the TDX
> > module:
> >
> > 1) Global initialization;
> > 2) Logical-CPU scope initialization;
> > 3) Enumerate the TDX module capabilities and platform configuration;
> > 4) Configure the TDX module about TDX usable memory ranges and global
> > KeyID information;
> > 5) Package-scope configuration for the global KeyID;
> > 6) Initialize usable memory ranges based on 4).
>
> This would actually be a nice place to call out the SEAMCALL names and
> mention that each of these steps involves a set of SEAMCALLs.

How about below?

"
The TDX module will be initialized in multi-steps defined by the TDX module and
each of those steps involves a specific SEAMCALL:
1) Global initialization using TDH.SYS.INIT.
2) Logical-CPU scope initialization using TDH.SYS.LP.INIT.
3) Enumerate the TDX module capabilities and TDX-capable memory information 
using TDH.SYS.INFO.
4) Configure the TDX module with TDX-usable memory regions and the global
KeyID information using TDH.SYS.CONFIG.
5) Package-scope configuration for the global KeyID using TDH.SYS.KEY.CONFIG.
6) Initialize TDX-usable memory regions using TDH.SYS.TDMR.INIT.

Before step 4), the kernel needs to build a set of TDX-usable memory regions,
and construct data structures to cover those regions.
"

>
> > The TDX module can also be shut down at any time during its lifetime.
> > In case of any error during the initialization process, shut down the
> > module. It's pointless to leave the module in any intermediate state
> > during the initialization.
> >
> > Both logical CPU scope initialization and shutting down the TDX module
> > require calling SEAMCALL on all boot-time present CPUs. For simplicity
> > just temporarily disable CPU hotplug during the module initialization.
>
> You might want to more precisely define "boot-time present CPUs". The
> boot of *what*?

How about use "BIOS-enabled CPUs" instead? If OK I'll use it consistently across
this series.

>
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 8d943bdc8335..28c187b8726f 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -10,15 +10,34 @@
> > #include <linux/types.h>
> > #include <linux/init.h>
> > #include <linux/printk.h>
> > +#include <linux/mutex.h>
> > +#include <linux/cpu.h>
> > +#include <linux/cpumask.h>
> > #include <asm/msr-index.h>
> > #include <asm/msr.h>
> > #include <asm/apic.h>
> > #include <asm/tdx.h>
> > #include "tdx.h"
> >
> > +/* TDX module status during initialization */
> > +enum tdx_module_status_t {
> > + /* TDX module hasn't been detected and initialized */
> > + TDX_MODULE_UNKNOWN,
> > + /* TDX module is not loaded */
> > + TDX_MODULE_NONE,
> > + /* TDX module is initialized */
> > + TDX_MODULE_INITIALIZED,
> > + /* TDX module is shut down due to initialization error */
> > + TDX_MODULE_SHUTDOWN,
> > +};
>
> Are these part of the ABI or just a purely OS-side construct?

Purely OS-side construct. I'll explicitly call out in the comment.

>
> > static u32 tdx_keyid_start __ro_after_init;
> > static u32 tdx_keyid_num __ro_after_init;
> >
> > +static enum tdx_module_status_t tdx_module_status;
> > +/* Prevent concurrent attempts on TDX detection and initialization */
> > +static DEFINE_MUTEX(tdx_module_lock);
> > +
> > /*
> > * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> > * BIOS. Both initializing the TDX module and running TDX guest require
> > @@ -104,3 +123,134 @@ bool platform_tdx_enabled(void)
> > {
> > return !!tdx_keyid_num;
> > }
> > +
> > +/*
> > + * Detect and initialize the TDX module.
> > + *
> > + * Return -ENODEV when the TDX module is not loaded, 0 when it
> > + * is successfully initialized, or other error when it fails to
> > + * initialize.
> > + */
> > +static int init_tdx_module(void)
> > +{
> > + /* The TDX module hasn't been detected */
> > + return -ENODEV;
> > +}
> > +
> > +static void shutdown_tdx_module(void)
> > +{
> > + /* TODO: Shut down the TDX module */
> > +}
> > +
> > +static int __tdx_enable(void)
> > +{
> > + int ret;
> > +
> > + /*
> > + * Initializing the TDX module requires doing SEAMCALL on all
> > + * boot-time present CPUs. For simplicity temporarily disable
> > + * CPU hotplug to prevent any CPU from going offline during
> > + * the initialization.
> > + */
> > + cpus_read_lock();
> > +
> > + /*
> > + * Check whether all boot-time present CPUs are online and
> > + * return early with a message so the user can be aware.
> > + *
> > + * Note a non-buggy BIOS should never support physical (ACPI)
> > + * CPU hotplug when TDX is enabled, and all boot-time present
> > + * CPU should be enabled in MADT, so there should be no
> > + * disabled_cpus and num_processors won't change at runtime
> > + * either.
> > + */
>
> Again, there are a lot of words in that comment, but I'm not sure why
> it's here. Despite all the whinging about ACPI, doesn't it boil down to:
>
> The TDX module itself establishes its own concept of how many
> logical CPUs there are in the system when it is loaded.  
>

This isn't accurate. TDX MCHECK records the total number of logical CPUs when
the BIOS enables TDX. This happens before the TDX module is loaded. In fact
the TDX module only gets this information from a secret location.

> The
> module will reject initialization attempts unless the kernel
> runs TDX initialization code on every last CPU.
>
> Ensure that the kernel is able to run code on all known logical
> CPUs.

How about:

TDX itself establishes its own concept of how many logical CPUs there 
are in the system when it gets enabled by the BIOS. The module will 
reject initialization attempts unless the kernel runs TDX
initialization
code on every last CPU.

Ensure that the kernel is able to run code on all known logical CPUs.

>
> and these checks are just to see if the kernel has shot itself in the
> foot and is *KNOWS* that it is currently unable to run code on some
> logical CPU?

Yes.

>
> > + if (disabled_cpus || num_online_cpus() != num_processors) {
> > + pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + ret = init_tdx_module();
> > + if (ret == -ENODEV) {
>
> Why check for -ENODEV exclusively? Is there some other error nonzero
> code that indicates success?

The idea is to print out "TDX module not loaded" to separate it from other
errors, so that the user can get a better idea when something goes wrong.

>
> > + pr_info("TDX module is not loaded.\n");
> > + tdx_module_status = TDX_MODULE_NONE;
> > + goto out;
> > + }
> > +
> > + /*
> > + * Shut down the TDX module in case of any error during the
> > + * initialization process. It's meaningless to leave the TDX
> > + * module in any middle state of the initialization process.
> > + *
> > + * Shutting down the module also requires doing SEAMCALL on all
> > + * MADT-enabled CPUs. Do it while CPU hotplug is disabled.
> > + *
> > + * Return all errors during the initialization as -EFAULT as the
> > + * module is always shut down.
> > + */
> > + if (ret) {
> > + pr_info("Failed to initialize TDX module. Shut it down.\n");
>
> "Shut it down" seems wrong here. That could be interpreted as "I have
> already shut it down". "Shutting down" seems better.

Will change to "Shutting down" if we still want to keep the shut down patch
(please see my another reply to Sean).

>
> > + shutdown_tdx_module();
> > + tdx_module_status = TDX_MODULE_SHUTDOWN;
> > + ret = -EFAULT;
> > + goto out;
> > + }
> > +
> > + pr_info("TDX module initialized.\n");
> > + tdx_module_status = TDX_MODULE_INITIALIZED;
> > +out:
> > + cpus_read_unlock();
> > +
> > + return ret;
> > +}
> > +
> > +/**
> > + * tdx_enable - Enable TDX by initializing the TDX module
> > + *
> > + * Caller to make sure all CPUs are online and in VMX operation before
> > + * calling this function. CPU hotplug is temporarily disabled internally
> > + * to prevent any cpu from going offline.
>
> "cpu" or "CPU"?
>
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * 0: The TDX module has been successfully initialized.
> > + * * -ENODEV: The TDX module is not loaded, or TDX is not supported.
> > + * * -EINVAL: The TDX module cannot be initialized due to certain
> > + * conditions are not met (i.e. when not all MADT-enabled
> > + * CPUs are not online).
> > + * * -EFAULT: Other internal fatal errors, or the TDX module is in
> > + * shutdown mode due to it failed to initialize in previous
> > + * attempts.
> > + */
>
> I honestly don't think all these error codes mean anything. They're
> plumbed nowhere and the use of -EFAULT is just plain wrong.
>
> Nobody can *DO* anything with these anyway.
>
> Just give one error code and make sure that you have pr_info()'s around
> to make it clear what went wrong. Then just do -EINVAL universally.
> Remove all the nonsense comments.

OK.

> > +int tdx_enable(void)
> > +{
> > + int ret;
> > +
> > + if (!platform_tdx_enabled())
> > + return -ENODEV;
> > +
> > + mutex_lock(&tdx_module_lock);
> > +
> > + switch (tdx_module_status) {
> > + case TDX_MODULE_UNKNOWN:
> > + ret = __tdx_enable();
> > + break;
> > + case TDX_MODULE_NONE:
> > + ret = -ENODEV;
> > + break;
>
> TDX_MODULE_NONE should probably be called TDX_MODULE_NOT_LOADED. A
> comment would also be nice:
>
> /* The BIOS did not load the module. No way to fix that. */
>
> > + case TDX_MODULE_INITIALIZED:
>
> /* Already initialized, great, tell the caller: */

Will do.

>
> > + ret = 0;
> > + break;
> > + default:
> > + WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN);
> > + ret = -EFAULT;
> > + break;
> > + }
>
> I don't get what that default: is for or what it has to do with
> TDX_MODULE_SHUTDOWN.

I meant we can only have 4 possible status, and the default case must be the
TDX_MODULE_SHUTDOWN state.

I think I can just remove that WARN()?

>
>
> > + mutex_unlock(&tdx_module_lock);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_enable);
>

2022-11-23 11:11:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

Kai!

On Wed, Nov 23 2022 at 00:30, Kai Huang wrote:
> On Tue, 2022-11-22 at 21:03 +0100, Thomas Gleixner wrote:
>> Clearly that's nowhere spelled out in the documentation, but I don't
>> buy the 'architecturaly required' argument not at all. It's an
>> implementation detail of the TDX module.
>
> I agree on hardware level there shouldn't be such requirement (not 100% sure
> though), but I guess from kernel's perspective, "the implementation detail of
> the TDX module" is sort of "architectural requirement"

Sure, but then it needs to be clearly documented so.

> -- at least Intel arch guys think so I guess.

Intel "arch" guys? You might look up the meanings of "arch" in a
dictionary. LKML is not twatter.

>> Technically there is IMO ZERO requirement to do so.
>>
>>  1) The TDX module is global
>>
>>  2) Seam-root and Seam-non-root operation are strictly a LP property.
>>
>>     The only architectural prerequisite for using Seam on a LP is that
>>     obviously the encryption/decryption mechanics have been initialized
>>     on the package to which the LP belongs.
>>
>> I can see why it might be complicated to add/remove an LP after
>> initialization fact, but technically it should be possible.
>
> "kernel soft offline" actually isn't an issue. We can bring down a logical cpu
> after it gets initialized and then bring it up again.

That's the whole point where this discussion started: _AFTER_ it gets
initialized.

Which means that, e.g. adding "nosmt" to the kernel command line will
make TDX fail hard because at the point where TDX is initialized the
hyperthreads are not 'soft' online and cannot respond to anything, but
the BIOS already accounted them.

This is just wrong as we all know that "nosmt" is sadly one of the
obvious counter measures for the never ending flood of speculation
issues.

> Only add/removal of physical cpu will cause problem: 

You wish.

> TDX MCHECK verifies all boot-time present cpus to make sure they are TDX-
> compatible before it enables TDX in hardware. MCHECK cannot run on hot-added
> CPU, so TDX cannot support physical CPU hotplug.

TDX can rightfully impose the limitation that it only executes on CPUs,
which are known at boot/initialization time, and only utilizes "known"
memory. That's it, but that does not enforce to prevent physical hotplug
in general.

> We tried to get it clarified in the specification, and below is what TDX/module
> arch guys agreed to put to the TDX module spec (just checked it's not in latest
> public spec yet, but they said it will be in next release):
>
> "
> 4.1.3.2. CPU Configuration
>
> During platform boot, MCHECK verifies all logical CPUs to ensure they
> meet TDX’s

That MCHECK falls into the category of security voodoo.

It needs to verify _ALL_ logical CPUs to ensure that Intel did not put
different models and steppings into a package or what?

> security and certain functionality requirements, and MCHECK passes the following
> CPU configuration information to the NP-SEAMLDR, P-SEAMLDR and the TDX Module:
>
> · Total number of logical processors in the platform.

You surely need MCHECK for this

> · Total number of installed packages in the platform.

and for this...

> · A table of per-package CPU family, model and stepping etc.
> identification, as enumerated by CPUID(1).EAX.
> The above information is static and does not change after platform boot and
> MCHECK run.
>
> Note: TDX doesn’t support adding or removing CPUs from TDX security
> perimeter, as checked my MCHECK.

More security voodoo. The TDX security perimeter has nothing to do with
adding or removing CPUs on a system. If that'd be true then TDX is a
complete fail.

> BIOS should prevent CPUs from being hot-added or hot-removed after
> platform boots.

If the BIOS does not prevent it, then TDX and the Seam module will not
even notice unless the OS tries to invoke seamcall() on a newly plugged
CPU.

A newly added CPU and newly added memory should not have any impact on
the TDX integrity of the already running system. If they have then
again, TDX is broken by design.

> The TDX module performs additional checks of the CPU’s configuration and
> supported features, by reading MSRs and CPUID information as described in the
> following sections.

to ensure that the MCHECK generated table is still valid at the point
where TDX is initialized?

That said, this documentation is at least better than the existing void,
but that does not make it technically correct.

Thanks,

tglx

2022-11-23 11:52:15

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 05/20] x86/virt/tdx: Implement functions to make SEAMCALL

On Tue, 2022-11-22 at 10:20 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
> > mode runs only the TDX module itself or other code to load the TDX
> > module.
> >
> > The host kernel communicates with SEAM software via a new SEAMCALL
> > instruction. This is conceptually similar to a guest->host hypercall,
> > except it is made from the host to SEAM software instead.
> >
> > The TDX module defines a set of SEAMCALL leaf functions to allow the
> > host to initialize it, and to create and run protected VMs. SEAMCALL
> > leaf functions use an ABI different from the x86-64 system-v ABI.
> > Instead, they share the same ABI with the TDCALL leaf functions.
>
> I may have suggested this along the way, but the mention of the sysv ABI
> is just confusing here. This is enough for a changelog:
>
> The TDX module establishes a new SEAMCALL ABI which allows the
> host to initialize the module and to and to manage VMs.
>
> Kill the rest.

Thanks will do.

>
> > Implement a function __seamcall() to allow the host to make SEAMCALL
> > to SEAM software using the TDX_MODULE_CALL macro which is the common
> > assembly for both SEAMCALL and TDCALL.
>
> In general, I dislike mentioning function names in changelogs. Keep
> this high-level, like:
>
> Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very
> similar to the TDCALL ABI and leverages much TDCALL
> infrastructure.

Will do.

>
> > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > CPU is not in VMX operation. The current TDX_MODULE_CALL macro doesn't
> > handle any of them. There's no way to check whether the CPU is in VMX
> > operation or not.
>
> What is SEAMRR?

Sorry it is a leftover. Should be "when TDX isn't enabled".

>
> Why even mention this behavior in the changelog. Is this a problem?
> Does it have a solution?

My intention was to provide some background information why to extend
TDX_MODULE_CALL macro to handle #UD and #GP as mentioned below.

>
> > Initializing the TDX module is done at runtime on demand, and it depends
> > on the caller to ensure CPU is in VMX operation before making SEAMCALL.
> > To avoid getting Oops when the caller mistakenly tries to initialize the
> > TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL
> > macro to handle #UD (and also #GP, which can theoretically still happen
> > when TDX isn't actually enabled by the BIOS, i.e. due to BIOS bug).
>
> I'm not completely sure this is worth it. If the BIOS lies, we oops.
> There are lots of ways that the BIOS lying can make the kernel oops.
> What's one more?

I agree. But if we want to handle #UD, then #GP won't cause oops any more, so I
just added error code for #GP too.

Or perhaps we can change to below: ?

"... extend the TDX_MODULE_CALL to handle #UD (and opportunistically #GP since
they share the same assembly)."

Or other suggestions?

>
> > Introduce two new TDX error codes for #UD and #GP respectively so the
> > caller can distinguish. Also, Opportunistically put the new TDX error
> > codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST
> > Kconfig option as they are only used when it is on.
> >
> > As __seamcall() can potentially return multiple error codes, besides the
> > actual SEAMCALL leaf function return code, also introduce a wrapper
> > function seamcall() to convert the __seamcall() error code to the kernel
> > error code, so the caller doesn't need to duplicate the code to check
> > return value of __seamcall() and return kernel error code accordingly.
>
>

[...]

> > +/*
> > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > + * leaf function return code and the additional output respectively if
> > + * not NULL.
> > + */
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > + u64 *seamcall_ret,
> > + struct tdx_module_output *out)
> > +{
> > + u64 sret;
> > +
> > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +
> > + /* Save SEAMCALL return code if caller wants it */
> > + if (seamcall_ret)
> > + *seamcall_ret = sret;
> > +
> > + /* SEAMCALL was successful */
> > + if (!sret)
> > + return 0;
> > +
> > + switch (sret) {
> > + case TDX_SEAMCALL_GP:
> > + /*
> > + * platform_tdx_enabled() is checked to be true
> > + * before making any SEAMCALL.
> > + */
>
> This doesn't make any sense. "platform_tdx_enabled() is checked"???
>
> Do you mean that it *should* be checked and probably wasn't which is
> what caused the error?

I meant tdx_enable() already calls platform_tdx_enabled() to check whether BIOS
has enabled TDX at the very beginning before making any SEAMCALL, so
theoretically #GP should not happen unless there's BIOS bug. I thought a WARN()
can help to catch.

>
> > + WARN_ON_ONCE(1);
> > + fallthrough;
> > + case TDX_SEAMCALL_VMFAILINVALID:
> > + /* Return -ENODEV if the TDX module is not loaded. */
> > + return -ENODEV;
>
> Pro tip: you don't need to rewrite code in comments. If the code
> literally says, "return -ENODEV", there is very little value in writing
> virtually identical bytes "Return -ENODEV" in the comment.
>

Indeed. Thanks for the tip! I'll update those comments.

2022-11-23 12:41:08

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Wed, 2022-11-23 at 12:05 +0100, Thomas Gleixner wrote:
> Kai!
>
> On Wed, Nov 23 2022 at 00:30, Kai Huang wrote:
> > On Tue, 2022-11-22 at 21:03 +0100, Thomas Gleixner wrote:
> > > Clearly that's nowhere spelled out in the documentation, but I don't
> > > buy the 'architecturaly required' argument not at all. It's an
> > > implementation detail of the TDX module.
> >
> > I agree on hardware level there shouldn't be such requirement (not 100% sure
> > though), but I guess from kernel's perspective, "the implementation detail of
> > the TDX module" is sort of "architectural requirement"
>
> Sure, but then it needs to be clearly documented so.
>
> > -- at least Intel arch guys think so I guess.
>
> Intel "arch" guys? You might look up the meanings of "arch" in a
> dictionary. LKML is not twatter.
>
> > > Technically there is IMO ZERO requirement to do so.
> > >
> > >  1) The TDX module is global
> > >
> > >  2) Seam-root and Seam-non-root operation are strictly a LP property.
> > >
> > >     The only architectural prerequisite for using Seam on a LP is that
> > >     obviously the encryption/decryption mechanics have been initialized
> > >     on the package to which the LP belongs.
> > >
> > > I can see why it might be complicated to add/remove an LP after
> > > initialization fact, but technically it should be possible.
> >
> > "kernel soft offline" actually isn't an issue. We can bring down a logical cpu
> > after it gets initialized and then bring it up again.
>
> That's the whole point where this discussion started: _AFTER_ it gets
> initialized.
>
> Which means that, e.g. adding "nosmt" to the kernel command line will
> make TDX fail hard because at the point where TDX is initialized the
> hyperthreads are not 'soft' online and cannot respond to anything, but
> the BIOS already accounted them.
>
> This is just wrong as we all know that "nosmt" is sadly one of the
> obvious counter measures for the never ending flood of speculation
> issues.

Agree. As said in my other replies, I'll follow up with TDX module guys on
this.

>
> > Only add/removal of physical cpu will cause problem: 
>
> You wish.
>
> > TDX MCHECK verifies all boot-time present cpus to make sure they are TDX-
> > compatible before it enables TDX in hardware. MCHECK cannot run on hot-added
> > CPU, so TDX cannot support physical CPU hotplug.
>
> TDX can rightfully impose the limitation that it only executes on CPUs,
> which are known at boot/initialization time, and only utilizes "known"
> memory. That's it, but that does not enforce to prevent physical hotplug
> in general.

Adding physical CPUs should have no problem I guess, they just cannot run TDX.
TDX architecture doesn't expect they can run TDX code anyway.

But would physical removal of boot-time present CPU cause problem? TDX MCHECK
checks/records boot-time present CPUs. If a CPU is removed and then a new one
is added, then TDX still treats it is TDX-compatible, but it may actually not.

So if this happens, from functionality's point of view, it can break. I think
TDX still wants to guarantee TDX code can work correctly on "TDX recorded" CPUs.

Also, I am not sure whether there's any security issue if a malicious kernel
tries to run TDX code on such removed-then-added CPU.

This seems a TDX architecture problem rather than kernel policy issue.

>
> > We tried to get it clarified in the specification, and below is what TDX/module
> > arch guys agreed to put to the TDX module spec (just checked it's not in latest
> > public spec yet, but they said it will be in next release):
> >
> > "
> > 4.1.3.2. CPU Configuration
> >
> > During platform boot, MCHECK verifies all logical CPUs to ensure they
> > meet TDX’s
>
> That MCHECK falls into the category of security voodoo.
>
> It needs to verify _ALL_ logical CPUs to ensure that Intel did not put
> different models and steppings into a package or what?

I am guessing so.

>
> > security and certain functionality requirements, and MCHECK passes the following
> > CPU configuration information to the NP-SEAMLDR, P-SEAMLDR and the TDX Module:
> >
> > · Total number of logical processors in the platform.
>
> You surely need MCHECK for this
>
> > · Total number of installed packages in the platform.
>
> and for this...
>
> > · A table of per-package CPU family, model and stepping etc.
> > identification, as enumerated by CPUID(1).EAX.
> > The above information is static and does not change after platform boot and
> > MCHECK run.
> >
> > Note: TDX doesn’t support adding or removing CPUs from TDX security
> > perimeter, as checked my MCHECK.
>
> More security voodoo. The TDX security perimeter has nothing to do with
> adding or removing CPUs on a system. If that'd be true then TDX is a
> complete fail.

No argument here.

> > BIOS should prevent CPUs from being hot-added or hot-removed after
> > platform boots.
>
> If the BIOS does not prevent it, then TDX and the Seam module will not
> even notice unless the OS tries to invoke seamcall() on a newly plugged
> CPU.
>
> A newly added CPU and newly added memory should not have any impact on
> the TDX integrity of the already running system. If they have then
> again, TDX is broken by design.

No argument here either.

>
> > The TDX module performs additional checks of the CPU’s configuration and
> > supported features, by reading MSRs and CPUID information as described in the
> > following sections.
>
> to ensure that the MCHECK generated table is still valid at the point
> where TDX is initialized?

I think it is trying to say:

MCHECK doesn't do all the verifications. Some verfications are deferred to the
TDX module to check when it gets initialized.

>
> That said, this documentation is at least better than the existing void,
> but that does not make it technically correct.
>
> Thanks,
>
> tglx

2022-11-23 12:41:09

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 07/20] x86/virt/tdx: Do TDX module global initialization

On Tue, 2022-11-22 at 11:14 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > The first step of initializing the module is to call TDH.SYS.INIT once
> > on any logical cpu to do module global initialization. Do the module
> > global initialization.
> >
> > It also detects the TDX module, as seamcall() returns -ENODEV when the
> > module is not loaded.
>
> Part of making a good patch set is telling a bit of a story. In patch
> 4, you laid out 6 steps necessary to initialize TDX. On top of that,
> there is infrastructure It would be great to lay that out in a way that
> folks can actually follow along.
>
> For instance, it would be great to tell the reader here that this patch
> is an inflection point. It is transitioning out of the infrastructure
> (patches 1->6) and into the actual "multi-steps" of initialization that
> the module spec requires.
>
> This patch is *TOTALLY* different from the one before it because it
> actually _starts_ to do something useful.
>
> But, you wouldn't know it from the changelog.

I'll try to enhance the changelog to make them more connected. Right now I
don't have a clear clue on how to write in best way.

>
> > arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++--
> > arch/x86/virt/vmx/tdx/tdx.h | 1 +
> > 2 files changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 5db1a05cb4bd..f292292313bd 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -208,8 +208,23 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > */
> > static int init_tdx_module(void)
> > {
> > - /* The TDX module hasn't been detected */
> > - return -ENODEV;
> > + int ret;
> > +
> > + /*
> > + * Call TDH.SYS.INIT to do the global initialization of
> > + * the TDX module. It also detects the module.
> > + */
> > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > + if (ret)
> > + goto out;
>
> Please also note that the 0's are all just unused parameters. They mean
> nothing.

Will add to the comment.

>
> > +
> > + /*
> > + * Return -EINVAL until all steps of TDX module initialization
> > + * process are done.
> > + */
> > + ret = -EINVAL;
> > +out:
> > + return ret;
> > }
>
> It might be a bit unconventional, but can you imagine how well it would
> tell the story if this comment said:
>
> /*
> * TODO:
> * - Logical-CPU scope initialization (TDH_SYS_INIT_LP)
> * - Enumerate capabilities and platform configuration
> (TDH_SYS_CONFIG)
> ...
> */
>
> and then each of the following patches that *did* those things removed
> the TODO line from the list.
>
> That TODO list could have been added in patch 4.
>

Thanks for suggestion. Will do.

I think I can do this to "construct TDMRs" related patches too.

2022-11-23 12:42:34

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On Tue, 2022-11-22 at 15:39 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > TDX provides increased levels of memory confidentiality and integrity.
> > This requires special hardware support for features like memory
> > encryption and storage of memory integrity checksums. Not all memory
> > satisfies these requirements.
> >
> > As a result, TDX introduced the concept of a "Convertible Memory Region"
> > (CMR). During boot, the firmware builds a list of all of the memory
> > ranges which can provide the TDX security guarantees. The list of these
> > ranges, along with TDX module information, is available to the kernel by
> > querying the TDX module via TDH.SYS.INFO SEAMCALL.
>
> I think the last sentence goes too far. What does it matter what the
> name of the SEAMCALL is? Who cares at this point? It's in the patch.
> Scroll down two pages if you really care.

I'll remove "via TDH.SYS.INFO SEAMCALL".

>
> > The host kernel can choose whether or not to use all convertible memory
> > regions as TDX-usable memory. Before the TDX module is ready to create
> > any TDX guests, the kernel needs to configure the TDX-usable memory
> > regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX
> > module. Constructing the TDMR array requires information of both the
> > TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions. Call
> > TDH.SYS.INFO to get this information as a preparation.
>
> That last sentece is kinda goofy. I think there's a way to distill this
> whole thing down more effecively.
>
> CMRs tell the kernel which memory is TDX compatible. The kernel
> takes CMRs and constructs "TD Memory Regions" (TDMRs). TDMRs
> let the kernel grante TDX protections to some or all of the CMR
> areas.

Will do.

But it seems we should still mention "Constructing TDMRs requires information of
both the TDX module (TDSYSINFO_STRUCT) and the CMRs"? The reason is to justify
"use static to avoid having to pass them as function arguments when constructing
TDMRs" below.

>
> > Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
>
> I find it very useful to be precise when referring to code. Your code
> says 'tdsysinfo_struct', yet this says 'TDSYSINFO_STRUCT'. Why the
> difference?

Here I actually didn't intend to refer to any code. In the above paragraph
(that is going to be replaced with yours), I mentioned "TDSYSINFO_STRUCT" to
explain what does "information of the TDX module" actually refer to, since
TDSYSINFO_STRUCT is used in the spec.

What's your preference?

>
> > having to pass them as function arguments when constructing the TDMR
> > array. And they are too big to be put to the stack anyway. Also, KVM
> > needs to use the TDSYSINFO_STRUCT to create TDX guests.
>
> This is also a great place to mention that the tdsysinfo_struct contains
> a *lot* of gunk which will not be used for a bit or that may never get
> used.

Perhaps below?

"Note many members in tdsysinfo_struct' are not used by the kernel".

Btw, may I ask why does it matter?

[...]


> > +
> > +/* Check CMRs reported by TDH.SYS.INFO, and trim tail empty CMRs. */
> > +static int trim_empty_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num)
> > +{
> > + struct cmr_info *cmr;
> > + int i, cmr_num;
> > +
> > + /*
> > + * Intel TDX module spec, 20.7.3 CMR_INFO:
> > + *
> > + * TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
> > + * array of CMR_INFO entries. The CMRs are sorted from the
> > + * lowest base address to the highest base address, and they
> > + * are non-overlapping.
> > + *
> > + * This implies that BIOS may generate invalid empty entries
> > + * if total CMRs are less than 32. Need to skip them manually.
> > + *
> > + * CMR also must be 4K aligned. TDX doesn't trust BIOS. TDX
> > + * actually verifies CMRs before it gets enabled, so anything
> > + * doesn't meet above means kernel bug (or TDX is broken).
> > + */
>
> I dislike comments like this that describe all the code below. Can't
> you simply put the comment near the code that implements it?

Will do.

>
> > + cmr = &cmr_array[0];
> > + /* There must be at least one valid CMR */
> > + if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
> > + goto err;
> > +
> > + cmr_num = *actual_cmr_num;
> > + for (i = 1; i < cmr_num; i++) {
> > + struct cmr_info *cmr = &cmr_array[i];
> > + struct cmr_info *prev_cmr = NULL;
> > +
> > + /* Skip further empty CMRs */
> > + if (is_cmr_empty(cmr))
> > + break;
> > +
> > + /*
> > + * Do sanity check anyway to make sure CMRs:
> > + * - are 4K aligned
> > + * - don't overlap
> > + * - are in address ascending order.
> > + */
> > + if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
> > + goto err;
>
> Why does cmr_array[0] get a pass on the empty and sanity checks?

TDX MCHECK verifies CMRs before enabling TDX, so there must be at least one
valid CMR.

And cmr_array[0] is checked before this loop.

>
> > + prev_cmr = &cmr_array[i - 1];
> > + if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
> > + cmr->base))
> > + goto err;
> > + }
> > +
> > + /* Update the actual number of CMRs */
> > + *actual_cmr_num = i;
>
> That comment is not helpful. Yes, this is literally updating the number
> of CMRs. Literally. That's the "what". But, the "why" is important.
> Why is it doing this?

When building the list of "TDX-usable" memory regions, the kernel verifies those
regions against CMRs to see whether they are truly convertible memory.

How about adding a comment like below:

/*
* When the kernel builds the TDX-usable memory regions, it verifies
* they are truly convertible memory by checking them against CMRs.
* Update the actual number of CMRs to skip those empty CMRs.
*/

Also, I think printing CMRs in the dmesg is helpful. Printing empty (zero) CMRs
will put meaningless log to the dmesg.

>
> > + /* Print kernel checked CMRs */
> > + print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR");
>
> This is the point where I start to lose patience with these comments.
> These are just a waste of space.

Sorry will remove.

>
> Also, I saw the loop above check 'cmr_num' CMRs for is_cmr_ok(). Now,
> it'll print an 'actual_cmr_num=1' number of CMRs as being
> "kernel-checked". Why? That makes zero sense.

The loop quits when it sees an empty CMR. I think there's no need to check
further CMRs as they must be empty (TDX MCHECK verifies CMRs).

>
> > + return 0;
> > +err:
> > + pr_info("[TDX broken ?]: Invalid CMRs detected\n");
> > + print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
> > + return -EINVAL;
> > +}
> > +
> > +static int tdx_get_sysinfo(void)
> > +{
> > + struct tdx_module_output out;
> > + int ret;
> > +
> > + BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
> > +
> > + ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
> > + __pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
> > + if (ret)
> > + return ret;
> > +
> > + /* R9 contains the actual entries written the CMR array. */
> > + tdx_cmr_num = out.r9;
> > +
> > + pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> > + tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
> > + tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
> > + tdx_sysinfo.build_date, tdx_sysinfo.build_num);
>
> This is a case where a little bit of vertical alignment will go a long way:
>
> > + tdx_sysinfo.attributes, tdx_sysinfo.vendor_id,
> > + tdx_sysinfo.major_version, tdx_sysinfo.minor_version,
> > + tdx_sysinfo.build_date, tdx_sysinfo.build_num);

Thanks will do.

>
> > +
> > + /*
> > + * trim_empty_cmrs() updates the actual number of CMRs by
> > + * dropping all tail empty CMRs.
> > + */
> > + return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> > +}
>
> Why does this both need to respect the "tdx_cmr_num = out.r9" value
> *and* trim the empty ones? Couldn't it just ignore the "tdx_cmr_num =
> out.r9" value and just trim the empty ones either way? It's not like
> there is a billion of them. It would simplify the code for sure.

OK. Since spec says MAX_CMRs is 32, so I can use 32 instead of reading out from
R9.

[...]

> > +struct cpuid_config {
> > + u32 leaf;
> > + u32 sub_leaf;
> > + u32 eax;
> > + u32 ebx;
> > + u32 ecx;
> > + u32 edx;
> > +} __packed;
> > +
> > +#define TDSYSINFO_STRUCT_SIZE 1024
> > +#define TDSYSINFO_STRUCT_ALIGNMENT 1024
> > +
> > +struct tdsysinfo_struct {
> > + /* TDX-SEAM Module Info */
> > + u32 attributes;
> > + u32 vendor_id;
> > + u32 build_date;
> > + u16 build_num;
> > + u16 minor_version;
> > + u16 major_version;
> > + u8 reserved0[14];
> > + /* Memory Info */
> > + u16 max_tdmrs;
> > + u16 max_reserved_per_tdmr;
> > + u16 pamt_entry_size;
> > + u8 reserved1[10];
> > + /* Control Struct Info */
> > + u16 tdcs_base_size;
> > + u8 reserved2[2];
> > + u16 tdvps_base_size;
> > + u8 tdvps_xfam_dependent_size;
> > + u8 reserved3[9];
> > + /* TD Capabilities */
> > + u64 attributes_fixed0;
> > + u64 attributes_fixed1;
> > + u64 xfam_fixed0;
> > + u64 xfam_fixed1;
> > + u8 reserved4[32];
> > + u32 num_cpuid_config;
> > + /*
> > + * The actual number of CPUID_CONFIG depends on above
> > + * 'num_cpuid_config'. The size of 'struct tdsysinfo_struct'
> > + * is 1024B defined by TDX architecture. Use a union with
> > + * specific padding to make 'sizeof(struct tdsysinfo_struct)'
> > + * equal to 1024.
> > + */
> > + union {
> > + struct cpuid_config cpuid_configs[0];
> > + u8 reserved5[892];
> > + };
>
> Can you double check what the "right" way to do variable arrays is these
> days? I thought the [0] method was discouraged.
>
> Also, it isn't *really* 892 bytes of reserved space, right? Anything
> that's not cpuid_configs[] is reserved, I presume. Could you try to be
> more precise there?

I'll do some study first here and get back to you. Thanks.

The intention is to make sure the structure size is 1024B, so that the static
variable will have enough space for the TDX module to write.

2022-11-23 16:35:55

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Wed, Nov 23, 2022, Huang, Kai wrote:
> On Tue, 2022-11-22 at 17:04 -0800, Dave Hansen wrote:
> > On 11/22/22 16:58, Huang, Kai wrote:
> > > On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
> > > > > I was expecting TDX to not get initialized until the first TDX using KVM
> > > > > instance is created. Am I wrong?
> > > > I went looking for it in this series to prove you wrong. I failed. ????
> > > >
> > > > tdx_enable() is buried in here somewhere:
> > > >
> > > > > https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/
> > > > I don't have the patience to dig it out today, so I guess we'll have Kai
> > > > tell us.
> > > It will be done when KVM module is loaded, but not when the first TDX guest is
> > > created.
> >
> > Why is it done that way?
> >
> > Can it be changed to delay TDX initialization until the first TDX guest
> > needs to run?
> >
>
> Sean suggested.
>
> Hi Sean, could you commenet?

Waiting until the first TDX guest is created would result in false advertising,
as KVM wouldn't know whether or not TDX is actually supported until that first
VM is created. If we can guarantee that TDH.SYS.INIT will fail if and only if
there is a kernel bug, then I would be ok deferring the "enabling" until the
first VM is created.

2022-11-23 17:01:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On 11/23/22 03:40, Huang, Kai wrote:
> On Tue, 2022-11-22 at 15:39 -0800, Dave Hansen wrote:
>> That last sentece is kinda goofy. I think there's a way to distill this
>> whole thing down more effecively.
>>
>> CMRs tell the kernel which memory is TDX compatible. The kernel
>> takes CMRs and constructs "TD Memory Regions" (TDMRs). TDMRs
>> let the kernel grant TDX protections to some or all of the CMR
>> areas.
>
> Will do.
>
> But it seems we should still mention "Constructing TDMRs requires information of
> both the TDX module (TDSYSINFO_STRUCT) and the CMRs"? The reason is to justify
> "use static to avoid having to pass them as function arguments when constructing
> TDMRs" below.

In a changelog, no. You do *NOT* use super technical language in
changelogs if not super necessary. Mentioning "TDSYSINFO_STRUCT" here
is useless. The *MOST* you would do for a good changelog is:

The kernel takes CMRs (plus a little more metadata) and
constructs "TD Memory Regions" (TDMRs).

You just need to talk about things at a high level in mostly
non-technical language so that folks know the structure of the code
below. It's not a replacement for the code, the comments, *OR* the TDX
module specification.

I'm also not quite sure that this justifies the static variables anyway.
They could be dynamically allocated and passed around, for instance.

>>> Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
>>
>> I find it very useful to be precise when referring to code. Your code
>> says 'tdsysinfo_struct', yet this says 'TDSYSINFO_STRUCT'. Why the
>> difference?
>
> Here I actually didn't intend to refer to any code. In the above paragraph
> (that is going to be replaced with yours), I mentioned "TDSYSINFO_STRUCT" to
> explain what does "information of the TDX module" actually refer to, since
> TDSYSINFO_STRUCT is used in the spec.
>
> What's your preference?

Kill all mentions to TDSYSINFO_STRUCT whatsoever in the changelog.
Write comprehensible English.

>>> having to pass them as function arguments when constructing the TDMR
>>> array. And they are too big to be put to the stack anyway. Also, KVM
>>> needs to use the TDSYSINFO_STRUCT to create TDX guests.
>>
>> This is also a great place to mention that the tdsysinfo_struct contains
>> a *lot* of gunk which will not be used for a bit or that may never get
>> used.
>
> Perhaps below?
>
> "Note many members in tdsysinfo_struct' are not used by the kernel".
>
> Btw, may I ask why does it matter?

Because you're adding a massive structure with all kinds of fields.
Those fields mostly aren't used. That could be from an error in this
series, or because they will be used later or because they will *never*
be used.

>>> + cmr = &cmr_array[0];
>>> + /* There must be at least one valid CMR */
>>> + if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
>>> + goto err;
>>> +
>>> + cmr_num = *actual_cmr_num;
>>> + for (i = 1; i < cmr_num; i++) {
>>> + struct cmr_info *cmr = &cmr_array[i];
>>> + struct cmr_info *prev_cmr = NULL;
>>> +
>>> + /* Skip further empty CMRs */
>>> + if (is_cmr_empty(cmr))
>>> + break;
>>> +
>>> + /*
>>> + * Do sanity check anyway to make sure CMRs:
>>> + * - are 4K aligned
>>> + * - don't overlap
>>> + * - are in address ascending order.
>>> + */
>>> + if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
>>> + goto err;
>>
>> Why does cmr_array[0] get a pass on the empty and sanity checks?
>
> TDX MCHECK verifies CMRs before enabling TDX, so there must be at least one
> valid CMR.
>
> And cmr_array[0] is checked before this loop.

I think you're confusing two separate things. MCHECK ensures that there
is convertible memory. The CMRs that this code looks at are software
(TD module) defined and created structures that the OS and the module share.

This cmr_array[] structure is not created by MCHECK.

Go look at your code. Consider what will happen if cmr_array[0] is
empty or !is_cmr_ok(). Then consider what will happen if cmr_array[1]
has the same happen.

Does that end result really justify having separate code for
cmr_array[0] and cmr_array[>0]?

>>> + prev_cmr = &cmr_array[i - 1];
>>> + if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
>>> + cmr->base))
>>> + goto err;
>>> + }
>>> +
>>> + /* Update the actual number of CMRs */
>>> + *actual_cmr_num = i;
>>
>> That comment is not helpful. Yes, this is literally updating the number
>> of CMRs. Literally. That's the "what". But, the "why" is important.
>> Why is it doing this?
>
> When building the list of "TDX-usable" memory regions, the kernel verifies those
> regions against CMRs to see whether they are truly convertible memory.
>
> How about adding a comment like below:
>
> /*
> * When the kernel builds the TDX-usable memory regions, it verifies
> * they are truly convertible memory by checking them against CMRs.
> * Update the actual number of CMRs to skip those empty CMRs.
> */
>
> Also, I think printing CMRs in the dmesg is helpful. Printing empty (zero) CMRs
> will put meaningless log to the dmesg.

So it's just about printing them?

Then put a dang switch to the print function that says "print them all"
or not.

...
>> Also, I saw the loop above check 'cmr_num' CMRs for is_cmr_ok(). Now,
>> it'll print an 'actual_cmr_num=1' number of CMRs as being
>> "kernel-checked". Why? That makes zero sense.
>
> The loop quits when it sees an empty CMR. I think there's no need to check
> further CMRs as they must be empty (TDX MCHECK verifies CMRs).

OK, so you're going to get some more homework here. Please explain to
me how MCHECK and the CMR array that comes out of the TDX module are
related. How does the output from MCHECK get turned into the in-memory
cmr_array[], step by step?

At this point, I fear that you're offering up MCHECK like it's a bag of
magic beans rather than really truly thinking about the cmr_array[] data
structure. How it is generated? How might it be broken? Who might
break it? If so, what the kernel should do about it?


>>> +
>>> + /*
>>> + * trim_empty_cmrs() updates the actual number of CMRs by
>>> + * dropping all tail empty CMRs.
>>> + */
>>> + return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
>>> +}
>>
>> Why does this both need to respect the "tdx_cmr_num = out.r9" value
>> *and* trim the empty ones? Couldn't it just ignore the "tdx_cmr_num =
>> out.r9" value and just trim the empty ones either way? It's not like
>> there is a billion of them. It would simplify the code for sure.
>
> OK. Since spec says MAX_CMRs is 32, so I can use 32 instead of reading out from
> R9.

But then you still have the "trimming" code. Why not just trust "r9"
and then axe all the trimming code? Heck, and most of the sanity checks.

This code could be a *lot* smaller.

2022-11-23 17:05:43

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On 11/23/22 02:18, Huang, Kai wrote:
>> Again, there are a lot of words in that comment, but I'm not sure why
>> it's here. Despite all the whinging about ACPI, doesn't it boil down to:
>>
>> The TDX module itself establishes its own concept of how many
>> logical CPUs there are in the system when it is loaded.
>>
> This isn't accurate. TDX MCHECK records the total number of logical CPUs when
> the BIOS enables TDX. This happens before the TDX module is loaded. In fact
> the TDX module only gets this information from a secret location.

Kai, this is the point where I lose patience with the conversation
around this series. I'll you paste you the line of code where the TDX
module literally "establishes its own concept of how many logical CPUs
there are in the system":

> //NUM_LPS
> tdx_global_data_ptr->num_of_lps = sysinfo_table_ptr->mcheck_fields.tot_num_lps;

Yes, this is derived directly from MCHECK. But, this concept is
separate from MCHECK. We have seen zero actual facts from you or other
folks at Intel that this is anything other than an arbitrary choice made
for the convenience of the TDX module. It _might_ not be this way. I'm
open to hearing those facts and changing my position on this.

But, please bring facts, not random references to "secret locations"
that aren't that secret.

2022-11-23 17:39:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/23/22 08:20, Sean Christopherson wrote:
>>> Why is it done that way?
>>>
>>> Can it be changed to delay TDX initialization until the first TDX guest
>>> needs to run?
>>>
>> Sean suggested.
>>
>> Hi Sean, could you commenet?
> Waiting until the first TDX guest is created would result in false advertising,
> as KVM wouldn't know whether or not TDX is actually supported until that first
> VM is created. If we can guarantee that TDH.SYS.INIT will fail if and only if
> there is a kernel bug, then I would be ok deferring the "enabling" until the
> first VM is created.

There's no way we can guarantee _that_. For one, the PAMT* allocations
can always fail. I guess we could ask sysadmins to fire up a guest to
"prime" things, but that seems a little silly. Maybe that would work as
the initial implementation that we merge, but I suspect our users will
demand more determinism, maybe a boot or module parameter.

* Physical Address Metadata Table, a large physically contiguous data
structure, the rough equivalent of 'struct page' for the TDX module

2022-11-23 17:47:15

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Wed, Nov 23, 2022, Dave Hansen wrote:
> On 11/23/22 08:20, Sean Christopherson wrote:
> >>> Why is it done that way?
> >>>
> >>> Can it be changed to delay TDX initialization until the first TDX guest
> >>> needs to run?
> >>>
> >> Sean suggested.
> >>
> >> Hi Sean, could you commenet?
> > Waiting until the first TDX guest is created would result in false advertising,
> > as KVM wouldn't know whether or not TDX is actually supported until that first
> > VM is created. If we can guarantee that TDH.SYS.INIT will fail if and only if
> > there is a kernel bug, then I would be ok deferring the "enabling" until the
> > first VM is created.
>
> There's no way we can guarantee _that_. For one, the PAMT* allocations
> can always fail. I guess we could ask sysadmins to fire up a guest to
> "prime" things, but that seems a little silly. Maybe that would work as
> the initial implementation that we merge, but I suspect our users will
> demand more determinism, maybe a boot or module parameter.

Oh, you mean all of TDX initialization? I thought "initialization" here mean just
doing tdx_enable().

Yeah, that's not going to be a viable option. Aside from lacking determinisim,
it would be all too easy to end up on a system with fragmented memory that can't
allocate the PAMTs post-boot.

2022-11-23 18:58:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/23/22 09:37, Sean Christopherson wrote:
> On Wed, Nov 23, 2022, Dave Hansen wrote:
>> There's no way we can guarantee _that_. For one, the PAMT* allocations
>> can always fail. I guess we could ask sysadmins to fire up a guest to
>> "prime" things, but that seems a little silly. Maybe that would work as
>> the initial implementation that we merge, but I suspect our users will
>> demand more determinism, maybe a boot or module parameter.
> Oh, you mean all of TDX initialization? I thought "initialization" here mean just
> doing tdx_enable().

Yes, but the first call to tdx_enable() does TDH_SYS_INIT and all the
subsequent work to get the module going.

> Yeah, that's not going to be a viable option. Aside from lacking determinisim,
> it would be all too easy to end up on a system with fragmented memory that can't
> allocate the PAMTs post-boot.

For now, the post-boot runtime PAMT allocations are the one any only way
that TDX can be initialized. I pushed for it to be done this way.
Here's why:

Doing tdx_enable() is relatively slow and it eats up a non-zero amount
of physically contiguous RAM for metadata (~1/256th or ~0.4% of RAM).
Systems that support TDX but will never run TDX guests should not pay
that cost.

That means that we either make folks opt-in at boot-time or we try to
make a best effort at runtime to do the metadata allocations.

From my perspective, the best-effort stuff is absolutely needed. Users
are going to forget the command-line opt in and there's no harm in
_trying_ the big allocations even if they fail.

Second, in reality, the "real" systems that can run TDX guests are
probably not going to sit around fragmenting memory for a month before
they run their first guest. They're going to run one shortly after they
boot when memory isn't fragmented and the best-effort allocation will
work really well.

Third, if anyone *REALLY* cared to make it reliable *and* wanted to sit
around fragmenting memory for a month, they could just start a TDX guest
and kill it to get TDX initialized. This isn't ideal. But, to me, it
beats defining some new, separate ABI (or boot/module option) to do it.

So, let's have those discussions. Long-term, what *is* the most
reliable way to get the TDX module loaded with 100% determinism? What
new ABI or interfaces are needed? Also, is that 100% determinism
required the moment this series is merged? Or, can we work up to it?

I think it can wait until this particular series is farther along.

2022-11-23 19:47:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Wed, Nov 23, 2022, Dave Hansen wrote:
> On 11/23/22 09:37, Sean Christopherson wrote:
> > On Wed, Nov 23, 2022, Dave Hansen wrote:
> >> There's no way we can guarantee _that_. For one, the PAMT* allocations
> >> can always fail. I guess we could ask sysadmins to fire up a guest to
> >> "prime" things, but that seems a little silly. Maybe that would work as
> >> the initial implementation that we merge, but I suspect our users will
> >> demand more determinism, maybe a boot or module parameter.
> > Oh, you mean all of TDX initialization? I thought "initialization" here mean just
> > doing tdx_enable().
>
> Yes, but the first call to tdx_enable() does TDH_SYS_INIT and all the
> subsequent work to get the module going.

Ah, sorry, I misread the diff. Actually applied the patches this time...

> > Yeah, that's not going to be a viable option. Aside from lacking determinisim,
> > it would be all too easy to end up on a system with fragmented memory that can't
> > allocate the PAMTs post-boot.
>
> For now, the post-boot runtime PAMT allocations are the one any only way
> that TDX can be initialized. I pushed for it to be done this way.
> Here's why:
>
> Doing tdx_enable() is relatively slow and it eats up a non-zero amount
> of physically contiguous RAM for metadata (~1/256th or ~0.4% of RAM).
> Systems that support TDX but will never run TDX guests should not pay
> that cost.
>
> That means that we either make folks opt-in at boot-time or we try to
> make a best effort at runtime to do the metadata allocations.
>
> From my perspective, the best-effort stuff is absolutely needed. Users
> are going to forget the command-line opt in

Eh, any sufficiently robust deployment should be able to ensure that its kernels
boot with "required" command-line options.

> and there's no harm in _trying_ the big allocations even if they fail.

No, but there is "harm" if a host can't provide the functionality the control
plane thinks it supports.

> Second, in reality, the "real" systems that can run TDX guests are
> probably not going to sit around fragmenting memory for a month before
> they run their first guest. They're going to run one shortly after they
> boot when memory isn't fragmented and the best-effort allocation will
> work really well.

I don't think this will hold true. Long term, we (Google) want to have a common
pool for non-TDX and TDX VMs. Forcing TDX VMs to use a dedicated pool of hosts
makes it much more difficult to react to demand, e.g. if we carve out N hosts for
TDX, but only use 10% of those hosts, then that's a lot of wasted capacity/money.
IIRC, people have discussed dynamically reconfiguring hosts so that systems could
be moved in/out of a dedicated pool, but that's still suboptimal, e.g. would
require emptying a host to reboot+reconfigure..

If/when we end up with a common pool, then it's very likely that we could have a
TDX-capable host go weeks/months before launching its first TDX VM.

> Third, if anyone *REALLY* cared to make it reliable *and* wanted to sit
> around fragmenting memory for a month, they could just start a TDX guest
> and kill it to get TDX initialized. This isn't ideal. But, to me, it
> beats defining some new, separate ABI (or boot/module option) to do it.

That's a hack. I have no objection to waiting until KVM is _loaded_ to initialize
TDX, but waiting until KVM_CREATE_VM is not acceptable. Use cases aside, KVM's ABI
would be a mess, e.g. KVM couldn't use KVM_CHECK_EXTENSION or any other /dev/kvm
ioctl() to enumerate TDX support.

> So, let's have those discussions. Long-term, what *is* the most
> reliable way to get the TDX module loaded with 100% determinism? What
> new ABI or interfaces are needed? Also, is that 100% determinism
> required the moment this series is merged? Or, can we work up to it?

I don't think we (Google again) strictly need 100% determinism with respect to
enabling TDX, what's most important is that if a host says it's TDX-capable, then
it really is TDX-capable. I'm sure we'll treat "failure to load" as an error,
but it doesn't necessarily need to be a fatal error as the host can still run in
a degraded state (no idea if we'll actually do that though).

> I think it can wait until this particular series is farther along.

For an opt-in kernel param, agreed. That can be added later, e.g. if it turns
out that the PAMT allocation failure rate is too high.

2022-11-23 22:15:01

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand

On Wed, 2022-11-23 at 08:58 -0800, Dave Hansen wrote:
> On 11/23/22 02:18, Huang, Kai wrote:
> > > Again, there are a lot of words in that comment, but I'm not sure why
> > > it's here. Despite all the whinging about ACPI, doesn't it boil down to:
> > >
> > > The TDX module itself establishes its own concept of how many
> > > logical CPUs there are in the system when it is loaded.
> > >
> > This isn't accurate. TDX MCHECK records the total number of logical CPUs when
> > the BIOS enables TDX. This happens before the TDX module is loaded. In fact
> > the TDX module only gets this information from a secret location.
>
> Kai, this is the point where I lose patience with the conversation
> around this series. I'll you paste you the line of code where the TDX
> module literally "establishes its own concept of how many logical CPUs
> there are in the system":
>
> > //NUM_LPS
> > tdx_global_data_ptr->num_of_lps = sysinfo_table_ptr->mcheck_fields.tot_num_lps;
>
> Yes, this is derived directly from MCHECK. But, this concept is
> separate from MCHECK. We have seen zero actual facts from you or other
> folks at Intel that this is anything other than an arbitrary choice made
> for the convenience of the TDX module. It _might_ not be this way. I'm
> open to hearing those facts and changing my position on this.
>
> But, please bring facts, not random references to "secret locations"
> that aren't that secret.

Agreed. Thanks for providing the comment and will use it.

2022-11-23 22:41:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On 11/20/22 16:26, Kai Huang wrote:
> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums. Not all memory
> satisfies these requirements.
>
> As a result, the TDX introduced the concept of a "Convertible Memory

s/the TDX introduced/TDX introduces/

> Region" (CMR). During boot, the firmware builds a list of all of the
> memory ranges which can provide the TDX security guarantees. The list
> of these ranges is available to the kernel by querying the TDX module.
>
> The TDX architecture needs additional metadata to record things like
> which TD guest "owns" a given page of memory. This metadata essentially
> serves as the 'struct page' for the TDX module. The space for this
> metadata is not reserved by the hardware up front and must be allocated
> by the kernel and given to the TDX module.
>
> Since this metadata consumes space, the VMM can choose whether or not to
> allocate it for a given area of convertible memory. If it chooses not
> to, the memory cannot receive TDX protections and can not be used by TDX
> guests as private memory.
>
> For every memory region that the VMM wants to use as TDX memory, it sets
> up a "TD Memory Region" (TDMR). Each TDMR represents a physically
> contiguous convertible range and must also have its own physically
> contiguous metadata table, referred to as a Physical Address Metadata
> Table (PAMT), to track status for each page in the TDMR range.
>
> Unlike a CMR, each TDMR requires 1G granularity and alignment. To
> support physical RAM areas that don't meet those strict requirements,
> each TDMR permits a number of internal "reserved areas" which can be
> placed over memory holes. If PAMT metadata is placed within a TDMR it
> must be covered by one of these reserved areas.
>
> Let's summarize the concepts:
>
> CMR - Firmware-enumerated physical ranges that support TDX. CMRs are
> 4K aligned.
> TDMR - Physical address range which is chosen by the kernel to support
> TDX. 1G granularity and alignment required. Each TDMR has
> reserved areas where TDX memory holes and overlapping PAMTs can
> be put into.

s/put into/represented/

> PAMT - Physically contiguous TDX metadata. One table for each page size
> per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G
> PAMT.
>
> As one step of initializing the TDX module, the kernel configures
> TDX-usable memory regions by passing an array of TDMRs to the TDX module.
>
> Constructing the array of TDMRs consists below steps:
>
> 1) Create TDMRs to cover all memory regions that the TDX module can use;

Slight tweak:

1) Create TDMRs to cover all memory regions that the TDX module will use
for TD memory

The TDX module "uses" more memory than strictly the TMDR's.

> 2) Allocate and set up PAMT for each TDMR;
> 3) Set up reserved areas for each TDMR.

s/Set up/Designate/

> Add a placeholder to construct TDMRs to do the above steps after all
> TDX memory regions are verified to be truly convertible. Always free
> TDMRs at the end of the initialization (no matter successful or not)
> as TDMRs are only used during the initialization.

The changelog here actually looks really good to me so far.

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 32af86e31c47..26048c6b0170 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -445,6 +445,63 @@ static int build_tdx_memory(void)
> return ret;
> }
>
> +/* Calculate the actual TDMR_INFO size */
> +static inline int cal_tdmr_size(void)

I think we can spare the bytes to add "culate" in the function name so
we don't think these are California TDMRs.

> +{
> + int tdmr_sz;
> +
> + /*
> + * The actual size of TDMR_INFO depends on the maximum number
> + * of reserved areas.
> + *
> + * Note: for TDX1.0 the max_reserved_per_tdmr is 16, and
> + * TDMR_INFO size is aligned up to 512-byte. Even it is
> + * extended in the future, it would be insane if TDMR_INFO
> + * becomes larger than 4K. The tdmr_sz here should never
> + * overflow.
> + */
> + tdmr_sz = sizeof(struct tdmr_info);
> + tdmr_sz += sizeof(struct tdmr_reserved_area) *
> + tdx_sysinfo.max_reserved_per_tdmr;

First, I think 'tdx_sysinfo' should probably be a local variable in
init_tdx_module() and have its address passed in here. Having global
variables always makes it more opaque about who is initializing it.

Second, if this code is making assumptions about
'max_reserved_per_tdmr', then let's actually add assertions or sanity
checks. For instance:

if (tdx_sysinfo.max_reserved_per_tdmr > MAX_TDMRS)
return -1;

or even:

if (tdmr_sz > PAGE_SIZE)
return -1;

It does almost no good to just assert what the limits are in a comment.

> + /*
> + * TDX requires each TDMR_INFO to be 512-byte aligned. Always
> + * round up TDMR_INFO size to the 512-byte boundary.
> + */

<sigh> More silly comments.

The place to document this is TDMR_INFO_ALIGNMENT. If anyone wants to
know what the alignment is, exactly, they can look at the definition.
They don't need to be told *TWICE* what TDMR_INFO_ALIGNMENT #defines to
in one comment.

> + return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> +}
> +
> +static struct tdmr_info *alloc_tdmr_array(int *array_sz)
> +{
> + /*
> + * TDX requires each TDMR_INFO to be 512-byte aligned.
> + * Use alloc_pages_exact() to allocate all TDMRs at once.
> + * Each TDMR_INFO will still be 512-byte aligned since
> + * cal_tdmr_size() always returns 512-byte aligned size.
> + */

OK, I think you're just trolling me now. Two *MORE* mentions of the
512-byte alignment?

> + *array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs;
> +
> + /*
> + * Zero the buffer so 'struct tdmr_info::size' can be
> + * used to determine whether a TDMR is valid.
> + *
> + * Note: for TDX1.0 the max_tdmrs is 64 and TDMR_INFO size
> + * is 512-byte. Even they are extended in the future, it
> + * would be insane if the total size exceeds 4MB.
> + */
> + return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
> +}

This looks massively over complicated.

Get rid of this function entirely. Then create:

static int tdmr_array_size(void)
{
return tdmr_size_single() * tdx_sysinfo.max_tdmrs;
}

The *caller* can do:

tdmr_array = alloc_pages_exact(tdmr_array_size(),
GFP_KERNEL | __GFP_ZERO);
if (!tdmr_array) {
...

Then the error path is:

free_pages_exact(tdmr_array, tdmr_array_size());

Then, there are no size pointers going back and forth. Easy peasy. I'm
OK with a little arithmetic being repeated.

> +/*
> + * Construct an array of TDMRs to cover all TDX memory ranges.
> + * The actual number of TDMRs is kept to @tdmr_num.
> + */
> +static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> +{
> + /* Return -EINVAL until constructing TDMRs is done */
> + return -EINVAL;
> +}
> +
> /*
> * Detect and initialize the TDX module.
> *
> @@ -454,6 +511,9 @@ static int build_tdx_memory(void)
> */
> static int init_tdx_module(void)
> {
> + struct tdmr_info *tdmr_array;
> + int tdmr_array_sz;
> + int tdmr_num;

I tend to write these like:

"tdmr_num" is the number of *a* TDMR.

"nr_tdmrs" is the number of TDMRs.

> int ret;
>
> /*
> @@ -506,11 +566,34 @@ static int init_tdx_module(void)
> ret = build_tdx_memory();
> if (ret)
> goto out;
> +
> + /* Prepare enough space to construct TDMRs */
> + tdmr_array = alloc_tdmr_array(&tdmr_array_sz);
> + if (!tdmr_array) {
> + ret = -ENOMEM;
> + goto out_free_tdx_mem;
> + }
> +
> + /* Construct TDMRs to cover all TDX memory ranges */
> + ret = construct_tdmrs(tdmr_array, &tdmr_num);
> + if (ret)
> + goto out_free_tdmrs;
> +
> /*
> * Return -EINVAL until all steps of TDX module initialization
> * process are done.
> */
> ret = -EINVAL;
> +out_free_tdmrs:
> + /*
> + * The array of TDMRs is freed no matter the initialization is
> + * successful or not. They are not needed anymore after the
> + * module initialization.
> + */
> + free_pages_exact(tdmr_array, tdmr_array_sz);
> +out_free_tdx_mem:
> + if (ret)
> + free_tdx_memory();
> out:
> /*
> * Memory hotplug checks the hot-added memory region against the
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 8e273756098c..a737f2b51474 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -80,6 +80,29 @@ struct tdsysinfo_struct {
> };
> } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
>
> +struct tdmr_reserved_area {
> + u64 offset;
> + u64 size;
> +} __packed;
> +
> +#define TDMR_INFO_ALIGNMENT 512
> +
> +struct tdmr_info {
> + u64 base;
> + u64 size;
> + u64 pamt_1g_base;
> + u64 pamt_1g_size;
> + u64 pamt_2m_base;
> + u64 pamt_2m_size;
> + u64 pamt_4k_base;
> + u64 pamt_4k_size;
> + /*
> + * Actual number of reserved areas depends on
> + * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
> + */
> + struct tdmr_reserved_area reserved_areas[0];
> +} __packed __aligned(TDMR_INFO_ALIGNMENT);
> +
> /*
> * Do not put any hardware-defined TDX structure representations below
> * this comment!

2022-11-23 22:49:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 12/20] x86/virt/tdx: Create TDMRs to cover all TDX memory regions

On 11/20/22 16:26, Kai Huang wrote:
> The kernel configures TDX-usable memory regions by passing an array of
> "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the
> information of the base/size of a memory region, the base/size of the
> associated Physical Address Metadata Table (PAMT) and a list of reserved
> areas in the region.
>
> Create a number of TDMRs to cover all TDX memory regions. To keep it
> simple, always try to create one TDMR for each memory region. As the
> first step only set up the base/size for each TDMR.
>
> Each TDMR must be 1G aligned and the size must be in 1G granularity.
> This implies that one TDMR could cover multiple memory regions. If a
> memory region spans the 1GB boundary and the former part is already
> covered by the previous TDMR, just create a new TDMR for the remaining
> part.
>
> TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs
> are consumed but there is more memory region to cover.

Good changelog. This patch is doing *one* thing.

> arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++-
> 1 file changed, 103 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 26048c6b0170..57b448de59a0 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -445,6 +445,24 @@ static int build_tdx_memory(void)
> return ret;
> }
>
> +/* TDMR must be 1gb aligned */
> +#define TDMR_ALIGNMENT BIT_ULL(30)
> +#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT)
> +
> +/* Align up and down the address to TDMR boundary */
> +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT)
> +
> +static inline u64 tdmr_start(struct tdmr_info *tdmr)
> +{
> + return tdmr->base;
> +}

I'm always skeptical that it's a good idea to take this in code:

tdmr->base

and make it this:

tdmr_start(tdmr)

because the helper is *LESS* compact than the open-coded form! I hope
I'm proven wrong.

> +static inline u64 tdmr_end(struct tdmr_info *tdmr)
> +{
> + return tdmr->base + tdmr->size;
> +}
> +
> /* Calculate the actual TDMR_INFO size */
> static inline int cal_tdmr_size(void)
> {
> @@ -492,14 +510,98 @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
> return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
> }
>
> +static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
> + int idx)
> +{
> + return (struct tdmr_info *)((unsigned long)tdmr_array +
> + cal_tdmr_size() * idx);
> +}

FWIW, I think it's probably a bad idea to have 'struct tdmr_info *'
types floating around since:

tmdr_info_array[0]

works, but:

tmdr_info_array[1]

will blow up in your face. It would almost make sense to have

struct tdmr_info_list {
struct tdmr_info *first_tdmr;
}

and then pass around pointers to the 'struct tdmr_info_list'. Maybe
that's overkill, but it is kinda silly to call something an array if []
doesn't work on it.

> +/*
> + * Create TDMRs to cover all TDX memory regions. The actual number
> + * of TDMRs is set to @tdmr_num.
> + */
> +static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> +{
> + struct tdx_memblock *tmb;
> + int tdmr_idx = 0;
> +
> + /*
> + * Loop over TDX memory regions and create TDMRs to cover them.
> + * To keep it simple, always try to use one TDMR to cover
> + * one memory region.
> + */

This seems like it might tend to under-utilize TDMRs. I'm sure this is
done for simplicity, but is it OK? Why is it OK? How are you sure this
won't bite us later?

> + list_for_each_entry(tmb, &tdx_memlist, list) {
> + struct tdmr_info *tdmr;
> + u64 start, end;
> +
> + tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
> + start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
> + end = TDMR_ALIGN_UP(tmb->end_pfn << PAGE_SHIFT);

Nit: a little vertical alignment can make this much more readable:

start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
end = TDMR_ALIGN_UP (tmb->end_pfn << PAGE_SHIFT);

> +
> + /*
> + * If the current TDMR's size hasn't been initialized,
> + * it is a new TDMR to cover the new memory region.
> + * Otherwise, the current TDMR has already covered the
> + * previous memory region. In the latter case, check
> + * whether the current memory region has been fully or
> + * partially covered by the current TDMR, since TDMR is
> + * 1G aligned.
> + */

Again, we have a comment over a if() block that describes what the
individual steps in the block do. *Plus* each individual step is
*ALREADY* commented. What purpose does this comment serve?

> + if (tdmr->size) {
> + /*
> + * Loop to the next memory region if the current
> + * block has already been fully covered by the
> + * current TDMR.
> + */
> + if (end <= tdmr_end(tdmr))
> + continue;
> +
> + /*
> + * If part of the current memory region has
> + * already been covered by the current TDMR,
> + * skip the already covered part.
> + */
> + if (start < tdmr_end(tdmr))
> + start = tdmr_end(tdmr);
> +
> + /*
> + * Create a new TDMR to cover the current memory
> + * region, or the remaining part of it.
> + */
> + tdmr_idx++;
> + if (tdmr_idx >= tdx_sysinfo.max_tdmrs)
> + return -E2BIG;
> +
> + tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
> + }
> +
> + tdmr->base = start;
> + tdmr->size = end - start;
> + }
> +
> + /* @tdmr_idx is always the index of last valid TDMR. */
> + *tdmr_num = tdmr_idx + 1;
> +
> + return 0;
> +}

Seems like a positive return value could be the number of populated
TDMRs. That would get rid of the int* argument.

> /*
> * Construct an array of TDMRs to cover all TDX memory ranges.
> * The actual number of TDMRs is kept to @tdmr_num.
> */

OK, so something else allocated the 'tdmr_array' and it's being passed
in here to fill it out. "construct" and "create" are both near synonyms
for "allocate", which isn't even being done here.

We want something here that will make it clear that this function is
taking an already populated list of TDMRs and filling it out.
"fill_tmdrs()" seems like it might be a better choice.

This is also a place where better words can help. If the function is
called "construct", then there's *ZERO* value in using the same word in
the comment. Using a word that is a close synonym but that can contrast
it with something different would be really nice, say:

This is also a place where the calling convention can be used to add
clarity. If you implicitly use a global variable, you have to explain
that. But, if you pass *in* a variable, it's a lot more clear.

Take this, for instance:

/*
* Take the memory referenced in @tdx_memlist and populate the
* preallocated @tmdr_array, following all the special alignment
* and size rules for TDMR.
*/
static int fill_out_tdmrs(struct list_head *tdx_memlist,
struct tdmr_info *tdmr_array)
{
...

That's 100% crystal clear about what's going on. You know what the
inputs are and the outputs. You also know why this is even necessary.
It's implied a bit, but it's because TDMRs have special rules about
size/alignment and tdx_memlists do not.

> static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> {
> + int ret;
> +
> + ret = create_tdmrs(tdmr_array, tdmr_num);
> + if (ret)
> + goto err;
> +
> /* Return -EINVAL until constructing TDMRs is done */
> - return -EINVAL;
> + ret = -EINVAL;
> +err:
> + return ret;
> }
>
> /*

2022-11-23 23:08:16

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On Wed, 2022-11-23 at 08:44 -0800, Dave Hansen wrote:
> > On 11/23/22 03:40, Huang, Kai wrote:
> > > > On Tue, 2022-11-22 at 15:39 -0800, Dave Hansen wrote:
> > > > > > That last sentece is kinda goofy. I think there's a way to distill this
> > > > > > whole thing down more effecively.
> > > > > >
> > > > > > CMRs tell the kernel which memory is TDX compatible. The kernel
> > > > > > takes CMRs and constructs "TD Memory Regions" (TDMRs). TDMRs
> > > > > > let the kernel grant TDX protections to some or all of the CMR
> > > > > > areas.
> > > >
> > > > Will do.
> > > >
> > > > But it seems we should still mention "Constructing TDMRs requires information of
> > > > both the TDX module (TDSYSINFO_STRUCT) and the CMRs"? The reason is to justify
> > > > "use static to avoid having to pass them as function arguments when constructing
> > > > TDMRs" below.
> >
> > In a changelog, no. You do *NOT* use super technical language in
> > changelogs if not super necessary. Mentioning "TDSYSINFO_STRUCT" here
> > is useless. The *MOST* you would do for a good changelog is:
> >
> > The kernel takes CMRs (plus a little more metadata) and
> > constructs "TD Memory Regions" (TDMRs).
> >
> > You just need to talk about things at a high level in mostly
> > non-technical language so that folks know the structure of the code
> > below. It's not a replacement for the code, the comments, *OR* the TDX
> > module specification.
> >
> > I'm also not quite sure that this justifies the static variables anyway.
> > They could be dynamically allocated and passed around, for instance.

I see. Thanks for explaining.

> >
> > > > > > > > Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
> > > > > >
> > > > > > I find it very useful to be precise when referring to code. Your code
> > > > > > says 'tdsysinfo_struct', yet this says 'TDSYSINFO_STRUCT'. Why the
> > > > > > difference?
> > > >
> > > > Here I actually didn't intend to refer to any code. In the above paragraph
> > > > (that is going to be replaced with yours), I mentioned "TDSYSINFO_STRUCT" to
> > > > explain what does "information of the TDX module" actually refer to, since
> > > > TDSYSINFO_STRUCT is used in the spec.
> > > >
> > > > What's your preference?
> >
> > Kill all mentions to TDSYSINFO_STRUCT whatsoever in the changelog.
> > Write comprehensible English.

OK.

> >
> > > > > > > > having to pass them as function arguments when constructing the TDMR
> > > > > > > > array. And they are too big to be put to the stack anyway. Also, KVM
> > > > > > > > needs to use the TDSYSINFO_STRUCT to create TDX guests.
> > > > > >
> > > > > > This is also a great place to mention that the tdsysinfo_struct contains
> > > > > > a *lot* of gunk which will not be used for a bit or that may never get
> > > > > > used.
> > > >
> > > > Perhaps below?
> > > >
> > > > "Note many members in tdsysinfo_struct' are not used by the kernel".
> > > >
> > > > Btw, may I ask why does it matter?
> >
> > Because you're adding a massive structure with all kinds of fields.
> > Those fields mostly aren't used. That could be from an error in this
> > series, or because they will be used later or because they will *never*
> > be used.

OK.

> >
> > > > > > > > + cmr = &cmr_array[0];
> > > > > > > > + /* There must be at least one valid CMR */
> > > > > > > > + if (WARN_ON_ONCE(is_cmr_empty(cmr) || !is_cmr_ok(cmr)))
> > > > > > > > + goto err;
> > > > > > > > +
> > > > > > > > + cmr_num = *actual_cmr_num;
> > > > > > > > + for (i = 1; i < cmr_num; i++) {
> > > > > > > > + struct cmr_info *cmr = &cmr_array[i];
> > > > > > > > + struct cmr_info *prev_cmr = NULL;
> > > > > > > > +
> > > > > > > > + /* Skip further empty CMRs */
> > > > > > > > + if (is_cmr_empty(cmr))
> > > > > > > > + break;
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * Do sanity check anyway to make sure CMRs:
> > > > > > > > + * - are 4K aligned
> > > > > > > > + * - don't overlap
> > > > > > > > + * - are in address ascending order.
> > > > > > > > + */
> > > > > > > > + if (WARN_ON_ONCE(!is_cmr_ok(cmr)))
> > > > > > > > + goto err;
> > > > > >
> > > > > > Why does cmr_array[0] get a pass on the empty and sanity checks?
> > > >
> > > > TDX MCHECK verifies CMRs before enabling TDX, so there must be at least one
> > > > valid CMR.
> > > >
> > > > And cmr_array[0] is checked before this loop.
> >
> > I think you're confusing two separate things. MCHECK ensures that there
> > is convertible memory. The CMRs that this code looks at are software
> > (TD module) defined and created structures that the OS and the module share.

Not sure whether I completely got your words, but the CMRs are generated by the
BIOS, verified and stored by the MCHECK. Thus the CMR structure is also
meaningful to the BIOS and the MCHECK, but not TDX module defined and created.

There are couple of places in the TDX module spec which says this. One example
is "Table 3.1: Typical Intel TDX Module Platform-Scope Initialization Sequence"
and "13.1.1. Initialization and Configuration Flow". They both mention:

"BIOS configures Convertible Memory Regions (CMRs); MCHECK checks them and
securely stores the information."

Also, "20.8.3 CMR_INFO":

"CMR_INFO is designed to provide information about a Convertible Memory Range
(CMR), as configured by BIOS and checked and stored securely by MCHECK."

> >
> > This cmr_array[] structure is not created by MCHECK.

Right.

But TDH.SYS.INFO only "Retrieve Intel TDX module information and convertible
memory (CMR) information." by writing CMRs to the buffer provided by the kernel
(cmr_array[]).

So my understanding is the entries in the cmr_array[] are just the same CMRs
that are verified by the MCHECK.

> >
> > Go look at your code. Consider what will happen if cmr_array[0] is
> > empty or !is_cmr_ok(). Then consider what will happen if cmr_array[1]
> > has the same happen.
> >
> > Does that end result really justify having separate code for
> > cmr_array[0] and cmr_array[>0]?

One slight difference is cmr_array[0] must be valid, but cmr_array[>1] can be
empty. And for cmr_array[>0] we also have additional check against the previous
one.

> >
> > > > > > > > + prev_cmr = &cmr_array[i - 1];
> > > > > > > > + if (WARN_ON_ONCE((prev_cmr->base + prev_cmr->size) >
> > > > > > > > + cmr->base))
> > > > > > > > + goto err;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + /* Update the actual number of CMRs */
> > > > > > > > + *actual_cmr_num = i;
> > > > > >
> > > > > > That comment is not helpful. Yes, this is literally updating the number
> > > > > > of CMRs. Literally. That's the "what". But, the "why" is important.
> > > > > > Why is it doing this?
> > > >
> > > > When building the list of "TDX-usable" memory regions, the kernel verifies those
> > > > regions against CMRs to see whether they are truly convertible memory.
> > > >
> > > > How about adding a comment like below:
> > > >
> > > > /*
> > > > * When the kernel builds the TDX-usable memory regions, it verifies
> > > > * they are truly convertible memory by checking them against CMRs.
> > > > * Update the actual number of CMRs to skip those empty CMRs.
> > > > */
> > > >
> > > > Also, I think printing CMRs in the dmesg is helpful. Printing empty (zero) CMRs
> > > > will put meaningless log to the dmesg.
> >
> > So it's just about printing them?
> >
> > Then put a dang switch to the print function that says "print them all"
> > or not.

Yes can do. Currently "print them all" is only done when the CMR sanity check
fails. We can unconditionally "print valid CMRs" if we don't need that check.

> >
> > ...
> > > > > > Also, I saw the loop above check 'cmr_num' CMRs for is_cmr_ok(). Now,
> > > > > > it'll print an 'actual_cmr_num=1' number of CMRs as being
> > > > > > "kernel-checked". Why? That makes zero sense.
> > > >
> > > > The loop quits when it sees an empty CMR. I think there's no need to check
> > > > further CMRs as they must be empty (TDX MCHECK verifies CMRs).
> >
> > OK, so you're going to get some more homework here. Please explain to
> > me how MCHECK and the CMR array that comes out of the TDX module are
> > related. How does the output from MCHECK get turned into the in-memory
> > cmr_array[], step by step?
> >

(Please also see my above reply)

1. BIOS generates the CMRs and pass to the MCHECK
2. MCHECK verifies CMRs and stores the "CMR table in a pre-defined location in
SEAMRR’s SEAMCFG region so it can be read later and trusted by the Intel TDX
module" (13.1.4.1 Intel TDX ISA Background: Convertible Memory Ranges (CMRs)).
3. TDH.SYS.INFO copies the CMRs to the buffer provided by the kernel
(cmr_array[]).

> > At this point, I fear that you're offering up MCHECK like it's a bag of
> > magic beans rather than really truly thinking about the cmr_array[] data
> > structure. How it is generated? How might it be broken? Who might
> > break it? If so, what the kernel should do about it?

Only kernel bug can break the cmr_array[] I think. As described in "13.1.4.1
Intel TDX ISA Background: Convertible Memory Ranges (CMRs)", MCHECK should have
guaranteed that:
- there must be one CMR
- CMR is page aligned
- CMRs don't overlap and in address ascending order

The only legal thing is there might be empty CMRs at the tail of the cmr_array[]
following one or more valid CMRs.

> >
> >
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * trim_empty_cmrs() updates the actual number of CMRs by
> > > > > > > > + * dropping all tail empty CMRs.
> > > > > > > > + */
> > > > > > > > + return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> > > > > > > > +}
> > > > > >
> > > > > > Why does this both need to respect the "tdx_cmr_num = out.r9" value
> > > > > > *and* trim the empty ones? Couldn't it just ignore the "tdx_cmr_num =
> > > > > > out.r9" value and just trim the empty ones either way? It's not like
> > > > > > there is a billion of them. It would simplify the code for sure.
> > > >
> > > > OK. Since spec says MAX_CMRs is 32, so I can use 32 instead of reading out from
> > > > R9.
> >
> > But then you still have the "trimming" code. Why not just trust "r9"
> > and then axe all the trimming code? Heck, and most of the sanity checks.
> >
> > This code could be a *lot* smaller.

As I said the only problem is there might be empty CMRs at the tail of the
cmr_array[] following one or more valid CMRs.  

But we can also do nothing here, but just skip empty CMRs when comparing the
memory region to it (in next patch).

Or, we don't even need to explicitly check memory region against CMRs. If the
memory regions that we provided in the TDMR doesn't fall into CMR, then
TDH.SYS.CONFIG will fail. We can just depend on the SEAMCALL to do that.


2022-11-23 23:08:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 11/20/22 16:26, Kai Huang wrote:
> The TDX module uses additional metadata to record things like which
> guest "owns" a given page of memory. This metadata, referred as
> Physical Address Metadata Table (PAMT), essentially serves as the
> 'struct page' for the TDX module. PAMTs are not reserved by hardware
> up front. They must be allocated by the kernel and then given to the
> TDX module.

... during module initialization.

> TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must
> be a physically contiguous area from a Convertible Memory Region (CMR).
> However, the PAMTs which track pages in one TDMR do not need to reside
> within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with
> any TDMR, the overlapping part must be reported as a reserved area in
> that particular TDMR.
>
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).
> The downside is alloc_contig_pages() may fail at runtime. One (bad)
> mitigation is to launch a TD guest early during system boot to get those
> PAMTs allocated at early time, but the only way to fix is to add a boot
> option to allocate or reserve PAMTs during kernel boot.

FWIW, we all agree that this is a bad permanent way to leave things.
You can call me out here as proposing that this wart be left in place
while this series is merged and is a detail we can work on afterword
with new module params, boot options, Kconfig or whatever.

> TDX only supports a limited number of reserved areas per TDMR to cover
> both PAMTs and memory holes within the given TDMR. If many PAMTs are
> allocated within a single TDMR, the reserved areas may not be sufficient
> to cover all of them.
>
> Adopt the following policies when allocating PAMTs for a given TDMR:
>
> - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
> the total number of reserved areas consumed for PAMTs.
> - Try to first allocate PAMT from the local node of the TDMR for better
> NUMA locality.
>
> Also dump out how many pages are allocated for PAMTs when the TDX module
> is initialized successfully.

... this helps answer the eternal "where did all my memory go?" questions.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b36129183035..b86a333b860f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1960,6 +1960,7 @@ config INTEL_TDX_HOST
> depends on KVM_INTEL
> depends on X86_X2APIC
> select ARCH_KEEP_MEMBLOCK
> + depends on CONTIG_ALLOC
> help
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This option enables necessary TDX
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 57b448de59a0..9d76e70de46e 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -586,6 +586,187 @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> return 0;
> }
>
> +/*
> + * Calculate PAMT size given a TDMR and a page size. The returned
> + * PAMT size is always aligned up to 4K page boundary.
> + */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
> +{
> + unsigned long pamt_sz, nr_pamt_entries;
> +
> + switch (pgsz) {
> + case TDX_PS_4K:
> + nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> + break;
> + case TDX_PS_2M:
> + nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> + break;
> + case TDX_PS_1G:
> + nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + return 0;
> + }
> +
> + pamt_sz = nr_pamt_entries * tdx_sysinfo.pamt_entry_size;
> + /* TDX requires PAMT size must be 4K aligned */
> + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> + return pamt_sz;
> +}
> +
> +/*
> + * Pick a NUMA node on which to allocate this TDMR's metadata.
> + *
> + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
> + * not be. If the TDMR covers more than one node, just use the _first_
> + * one. This can lead to small areas of off-node metadata for some
> + * memory.
> + */
> +static int tdmr_get_nid(struct tdmr_info *tdmr)
> +{
> + struct tdx_memblock *tmb;
> +
> + /* Find the first memory region covered by the TDMR */
> + list_for_each_entry(tmb, &tdx_memlist, list) {
> + if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> + return tmb->nid;
> + }

Aha, the first use of tmb->nid! I wondered why that was there.

> +
> + /*
> + * Fall back to allocating the TDMR's metadata from node 0 when
> + * no TDX memory block can be found. This should never happen
> + * since TDMRs originate from TDX memory blocks.
> + */
> + WARN_ON_ONCE(1);

That's probably better a pr_warn() or something. A backtrace and all
that jazz seems a bit overly dramatic for this.

> + return 0;
> +}
The rest of this actually looks fine. It's nearing ack'able state.

2022-11-24 00:09:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 15/20] x86/virt/tdx: Reserve TDX module global KeyID

On 11/20/22 16:26, Kai Huang wrote:
> @@ -1053,6 +1056,12 @@ static int init_tdx_module(void)
> if (ret)
> goto out_free_tdmrs;
>
> + /*
> + * Reserve the first TDX KeyID as global KeyID to protect
> + * TDX module metadata.
> + */
> + tdx_global_keyid = tdx_keyid_start;

This doesn't "reserve" squat.

You could argue that it "picks", "chooses", or "designates" the
'tdx_global_keyid', but where is the "reservation"?

2022-11-24 00:16:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs

> +static int tdmr_set_up_memory_hole_rsvd_areas(struct tdmr_info *tdmr,
> + int *rsvd_idx)
> +{

This needs a comment.

This is another case where it's confusing to be passing around 'struct
tdmr_info *'. Is this *A* TDMR or an array?


/*
* Go through tdx_memlist to find holes between memory areas. If any of
* those holes fall within @tdmr, set up a TDMR reserved area to cover
* the hole.
*/
static int tdmr_populate_rsvd_holes(struct list_head *tdx_memlist,
struct tdmr_info *tdmr,
int *rsvd_idx)

> + struct tdx_memblock *tmb;
> + u64 prev_end;
> + int ret;
> +
> + /* Mark holes between memory regions as reserved */
> + prev_end = tdmr_start(tdmr);

I'm having a hard time following this, especially the mixing of
semantics between 'prev_end' both pointing to tdmr and to tmb addresses.

Here, 'prev_end' logically represents the last address which we know has
been handled. All of the holes in the addresses below it have been
dealt with. It is safe to set here to tdmr_start() because this
function call is uniquely tasked with setting up reserved areas in
'tdmr'. So, it can safely consider any holes before tdmr_start(tdmr) as
being handled.

But, dang, there's a lot of complexity there.

First, the:

/* Mark holes between memory regions as reserved */

comment is misleading. It has *ZILCH* to do with the "prev_end =
tdmr_start(tdmr);" assignment.

This at least needs:

/* Start looking for reserved blocks at the beginning of the TDMR: */
prev_end = tdmr_start(tdmr);

but I also get the feeling that 'prev_end' is a crummy variable name. I
don't have any better suggestions at the moment.

> + list_for_each_entry(tmb, &tdx_memlist, list) {
> + u64 start, end;
> +
> + start = tmb->start_pfn << PAGE_SHIFT;
> + end = tmb->end_pfn << PAGE_SHIFT;
> +

More alignment opportunities:

start = tmb->start_pfn << PAGE_SHIFT;
end = tmb->end_pfn << PAGE_SHIFT;


> + /* Break if this region is after the TDMR */
> + if (start >= tdmr_end(tdmr))
> + break;
> +
> + /* Exclude regions before this TDMR */
> + if (end < tdmr_start(tdmr))
> + continue;
> +
> + /*
> + * Skip if no hole exists before this region. "<=" is
> + * used because one memory region might span two TDMRs
> + * (when the previous TDMR covers part of this region).
> + * In this case the start address of this region is
> + * smaller than the start address of the second TDMR.
> + *
> + * Update the prev_end to the end of this region where
> + * the possible memory hole starts.
> + */

Can't this just be:

/*
* Skip over memory areas that
* have already been dealt with.
*/

> + if (start <= prev_end) {
> + prev_end = end;
> + continue;
> + }
> +
> + /* Add the hole before this region */
> + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
> + start - prev_end);
> + if (ret)
> + return ret;
> +
> + prev_end = end;
> + }
> +
> + /* Add the hole after the last region if it exists. */
> + if (prev_end < tdmr_end(tdmr)) {
> + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
> + tdmr_end(tdmr) - prev_end);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int tdmr_set_up_pamt_rsvd_areas(struct tdmr_info *tdmr, int *rsvd_idx,
> + struct tdmr_info *tdmr_array,
> + int tdmr_num)
> +{
> + int i, ret;
> +
> + /*
> + * If any PAMT overlaps with this TDMR, the overlapping part
> + * must also be put to the reserved area too. Walk over all
> + * TDMRs to find out those overlapping PAMTs and put them to
> + * reserved areas.
> + */
> + for (i = 0; i < tdmr_num; i++) {
> + struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i);
> + unsigned long pamt_start_pfn, pamt_npages;
> + u64 pamt_start, pamt_end;
> +
> + tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
> + /* Each TDMR must already have PAMT allocated */
> + WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
> +
> + pamt_start = pamt_start_pfn << PAGE_SHIFT;
> + pamt_end = pamt_start + (pamt_npages << PAGE_SHIFT);
> +
> + /* Skip PAMTs outside of the given TDMR */
> + if ((pamt_end <= tdmr_start(tdmr)) ||
> + (pamt_start >= tdmr_end(tdmr)))
> + continue;
> +
> + /* Only mark the part within the TDMR as reserved */
> + if (pamt_start < tdmr_start(tdmr))
> + pamt_start = tdmr_start(tdmr);
> + if (pamt_end > tdmr_end(tdmr))
> + pamt_end = tdmr_end(tdmr);
> +
> + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
> + pamt_end - pamt_start);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +/* Compare function called by sort() for TDMR reserved areas */
> +static int rsvd_area_cmp_func(const void *a, const void *b)
> +{
> + struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
> + struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
> +
> + if (r1->offset + r1->size <= r2->offset)
> + return -1;
> + if (r1->offset >= r2->offset + r2->size)
> + return 1;
> +
> + /* Reserved areas cannot overlap. The caller should guarantee. */
> + WARN_ON_ONCE(1);
> + return -1;
> +}
> +
> +/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
> +static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr,
> + struct tdmr_info *tdmr_array,
> + int tdmr_num)
> +{
> + int ret, rsvd_idx = 0;
> +
> + /* Put all memory holes within the TDMR into reserved areas */
> + ret = tdmr_set_up_memory_hole_rsvd_areas(tdmr, &rsvd_idx);
> + if (ret)
> + return ret;
> +
> + /* Put all (overlapping) PAMTs within the TDMR into reserved areas */
> + ret = tdmr_set_up_pamt_rsvd_areas(tdmr, &rsvd_idx, tdmr_array, tdmr_num);
> + if (ret)
> + return ret;
> +
> + /* TDX requires reserved areas listed in address ascending order */
> + sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
> + rsvd_area_cmp_func, NULL);

Ugh, and I guess we don't know where the PAMTs will be ordered with
respect to holes, so sorting is the easiest way to do this.

<snip>

2022-11-24 00:36:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

On 11/20/22 16:26, Kai Huang wrote:
> After the array of TDMRs and the global KeyID are configured to the TDX
> module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
> on all packages.
>
> TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
> it cannot run concurrently on different CPUs. Implement a helper to
> run SEAMCALL on one cpu for each package one by one, and use it to
> configure the global KeyID on all packages.

This has the same problems as SYS.LP.INIT. It's basically snake oil in
some TDX configurations.

This really only needs to be done when the TDX module has memory
mappings on a socket for which it needs to use the "global KeyID". If
there's no PAMT on a socket, there are probably no allocations there to
speak of and no *technical* reason to call TDH.SYS.KEY.CONFIG on that
socket. At least none I can see.

So, let's check up on this requirement as well. This could also turn
out to be a real pain if all the CPUs on a socket are offline.

> Intel hardware doesn't guarantee cache coherency across different
> KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
> with KeyID 0) before the TDX module uses the global KeyID to access the
> PAMT. Following the TDX module specification, flush cache before
> configuring the global KeyID on all packages.

I think it's probably worth an aside here about why TDX security isn't
dependent on this step. I *think* it boils down to the memory integrity
protections. If the caches aren't flushed, a dirty KeyID-0 cacheline
could be written back to RAM. The TDX module would come along later and
read the cacheline using KeyID-whatever, get an integrity mismatch,
machine check, and then everyone would be sad.

Everyone is sad, but TDX security remains intact because memory
integrity saved us.

Is it memory integrity or the TD bit, actually?

> Given the PAMT size can be large (~1/256th of system RAM), just use
> WBINVD on all CPUs to flush.

<sigh>

> Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
> used the global KeyID to write any PAMT. Therefore, need to use WBINVD
> to flush cache before freeing the PAMTs back to the kernel. Note using
> MOVDIR64B (which changes the page's associated KeyID from the old TDX
> private KeyID back to KeyID 0, which is used by the kernel) to clear
> PMATs isn't needed, as the KeyID 0 doesn't support integrity check.

I hope this is covered in the code well.

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 3a032930e58a..99d1be5941a7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -224,6 +224,46 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> on_each_cpu(seamcall_smp_call_function, sc, true);
> }
>
> +/*
> + * Call one SEAMCALL on one (any) cpu for each physical package in
> + * serialized way. Return immediately in case of any error if
> + * SEAMCALL fails on any cpu.
> + *
> + * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
> + * to be atomic, but for simplicity just reuse it instead of adding
> + * a new one.
> + */
> +static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
> +{
> + cpumask_var_t packages;
> + int cpu, ret = 0;
> +
> + if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
> + return -ENOMEM;
> +
> + for_each_online_cpu(cpu) {
> + if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
> + packages))
> + continue;
> +
> + ret = smp_call_function_single(cpu, seamcall_smp_call_function,
> + sc, true);
> + if (ret)
> + break;
> +
> + /*
> + * Doesn't have to use atomic_read(), but it doesn't
> + * hurt either.
> + */

I don't think you need to cover this twice. Just do it in one comment.

> + ret = atomic_read(&sc->err);
> + if (ret)
> + break;
> + }
> +
> + free_cpumask_var(packages);
> + return ret;
> +}
> +
> static int tdx_module_init_cpus(void)
> {
> struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
> @@ -1010,6 +1050,22 @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
> return ret;
> }
>
> +static int config_global_keyid(void)
> +{
> + struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
> +
> + /*
> + * Configure the key of the global KeyID on all packages by
> + * calling TDH.SYS.KEY.CONFIG on all packages in a serialized
> + * way as it cannot run concurrently on different CPUs.
> + *
> + * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
> + * a recoverable error). Assume this is exceedingly rare and
> + * just return error if encountered instead of retrying.
> + */
> + return seamcall_on_each_package_serialized(&sc);
> +}
> +
> /*
> * Detect and initialize the TDX module.
> *
> @@ -1098,15 +1154,44 @@ static int init_tdx_module(void)
> if (ret)
> goto out_free_pamts;
>
> + /*
> + * Hardware doesn't guarantee cache coherency across different
> + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
> + * (associated with KeyID 0) before the TDX module can use the
> + * global KeyID to access the PAMT. Given PAMTs are potentially
> + * large (~1/256th of system RAM), just use WBINVD on all cpus
> + * to flush the cache.
> + *
> + * Follow the TDX spec to flush cache before configuring the
> + * global KeyID on all packages.
> + */
> + wbinvd_on_all_cpus();
> +
> + /* Config the key of global KeyID on all packages */
> + ret = config_global_keyid();
> + if (ret)
> + goto out_free_pamts;
> +
> /*
> * Return -EINVAL until all steps of TDX module initialization
> * process are done.
> */
> ret = -EINVAL;
> out_free_pamts:
> - if (ret)
> + if (ret) {
> + /*
> + * Part of PAMT may already have been initialized by

s/initialized/written/

> + * TDX module. Flush cache before returning PAMT back
> + * to the kernel.
> + *
> + * Note there's no need to do MOVDIR64B (which changes
> + * the page's associated KeyID from the old TDX private
> + * KeyID back to KeyID 0, which is used by the kernel),
> + * as KeyID 0 doesn't support integrity check.
> + */

MOVDIR64B is the tiniest of implementation details and also not the only
way to initialize memory integrity metadata.

Just keep this high level:

* No need to worry about memory integrity checks here.
* KeyID 0 has integrity checking disabled.

> + wbinvd_on_all_cpus();
> tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> - else
> + } else
> pr_info("%lu pages allocated for PAMT.\n",
> tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
> out_free_tdmrs:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index c26bab2555ca..768d097412ab 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -15,6 +15,7 @@
> /*
> * TDX module SEAMCALL leaf functions
> */
> +#define TDH_SYS_KEY_CONFIG 31
> #define TDH_SYS_INFO 32
> #define TDH_SYS_INIT 33
> #define TDH_SYS_LP_INIT 35

2022-11-24 00:49:01

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 16/20] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID

On 11/20/22 16:26, Kai Huang wrote:
> After the TDX-usable memory regions are constructed in an array of TDMRs
> and the global KeyID is reserved, configure them to the TDX module using
> TDH.SYS.CONFIG SEAMCALL. TDH.SYS.CONFIG can only be called once and can
> be done on any logical cpu.
>
> Reviewed-by: Isaku Yamahata <[email protected]>
> Signed-off-by: Kai Huang <[email protected]>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 2 ++
> 2 files changed, 39 insertions(+)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index e2cbeeb7f0dc..3a032930e58a 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -979,6 +979,37 @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> return ret;
> }
>
> +static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
> + u64 global_keyid)
> +{
> + u64 *tdmr_pa_array;
> + int i, array_sz;
> + u64 ret;
> +
> + /*
> + * TDMR_INFO entries are configured to the TDX module via an
> + * array of the physical address of each TDMR_INFO. TDX module
> + * requires the array itself to be 512-byte aligned. Round up
> + * the array size to 512-byte aligned so the buffer allocated
> + * by kzalloc() will meet the alignment requirement.
> + */

Aagh. Return of (a different) 512-byte aligned structure.

> + array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
> + tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);

Just to be clear, all that chatter about alignment is because the
*START* of the array has to be aligned. Right? I see alignment for
'array_sz', but that's not the start of the array.

tdmr_pa_array is the start of the array. Where is *THAT* aligned?

How does rounding up the size make kzalloc() magically know how to align
the *START* of the allocation?

Because I'm actually questioning my own sanity at this point, I went and
double-checked the docs (Documentation/core-api/memory-allocation.rst):

> The address of a chunk allocated with `kmalloc` is aligned to at least
> ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the
> alignment is also guaranteed to be at least the respective size.

Hint #1: ARCH_KMALLOC_MINALIGN is way less than 512.
Hint #2: tdmr_num is not guaranteed to be a power of two.
Hint #3: Comments don't actually affect the allocation

<snip>

2022-11-24 00:49:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 18/20] x86/virt/tdx: Initialize all TDMRs

On 11/20/22 16:26, Kai Huang wrote:
> Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
> TDX initialization.
>
> All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
> the memory pages can be used by the TDX module. The time to initialize
> TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT
> internally initializes the PAMT entries using the global KeyID.
>
> To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
> initializes an (implementation-specific) subset of PAMT entries of one
> TDMR in one invocation. The caller needs to call TDH.SYS.TDMR.INIT
> iteratively until all PAMT entries of the given TDMR are initialized.
>
> TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they
> are initializing different TDMRs. To keep it simple, just initialize
> all TDMRs one by one. On a 2-socket machine with 2.2G CPUs and 64GB
> memory, each TDH.SYS.TDMR.INIT roughly takes couple of microseconds on
> average, and it takes roughly dozens of milliseconds to complete the
> initialization of all TDMRs while system is idle.

Any chance you could say TDH.SYS.TDMR.INIT a few more times in there? ;)

Seriously, please try to trim that down. If you talk about the
implementation in detail in the code comments, don't cover it in detail
in the changelog too.

Also, this is a momentous patch, right? It's the last piece. Maybe
worth calling that out.

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 99d1be5941a7..9bcdb30b7a80 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1066,6 +1066,65 @@ static int config_global_keyid(void)
> return seamcall_on_each_package_serialized(&sc);
> }
>
> +/* Initialize one TDMR */

Does this comment add value? Even if it does, it is better than naming
the dang function init_one_tdmr()?

> +static int init_tdmr(struct tdmr_info *tdmr)
> +{
> + u64 next;
> +
> + /*
> + * Initializing PAMT entries might be time-consuming (in
> + * proportion to the size of the requested TDMR). To avoid long
> + * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
> + * an (implementation-defined) subset of PAMT entries in one
> + * invocation.
> + *
> + * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
> + * of the requested TDMR are initialized (if next-to-initialize
> + * address matches the end address of the TDMR).
> + */

The PAMT discussion here is, IMNHO silly. If this were about
initializing PAMT's then it should be renamed init_pamts() and the
SEAMCALL should be called PAMT_INIT or soemthing. It's not, and the ABI
is built around the TDMR and *its* addresses.

Let's chop this comment down:

/*
* Initializing a TDMR can be time consuming. To avoid long
* SEAMCALLs, the TDX module may only initialize a part of the
* TDMR in each call.
*/

Talk about the looping logic in the loop.

> + do {
> + struct tdx_module_output out;
> + int ret;
> +
> + ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> + &out);
> + if (ret)
> + return ret;
> + /*
> + * RDX contains 'next-to-initialize' address if
> + * TDH.SYS.TDMR.INT succeeded.
> + */
> + next = out.rdx;
> + /* Allow scheduling when needed */

That comment is a wee bit superfluous, don't you think?

> + cond_resched();

/* Keep making SEAMCALLs until the TDMR is done */

> + } while (next < tdmr->base + tdmr->size);
> +
> + return 0;
> +}
> +
> +/* Initialize all TDMRs */

Does this comment add value?

> +static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num)
> +{
> + int i;
> +
> + /*
> + * Initialize TDMRs one-by-one for simplicity, though the TDX
> + * architecture does allow different TDMRs to be initialized in
> + * parallel on multiple CPUs. Parallel initialization could
> + * be added later when the time spent in the serialized scheme
> + * becomes a real concern.
> + */

Trim down the comment:

/*
* This operation is costly. It can be parallelized,
* but keep it simple for now.
*/

> + for (i = 0; i < tdmr_num; i++) {
> + int ret;
> +
> + ret = init_tdmr(tdmr_array_entry(tdmr_array, i));
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> /*
> * Detect and initialize the TDX module.
> *
> @@ -1172,11 +1231,11 @@ static int init_tdx_module(void)
> if (ret)
> goto out_free_pamts;
>
> - /*
> - * Return -EINVAL until all steps of TDX module initialization
> - * process are done.
> - */
> - ret = -EINVAL;
> + /* Initialize TDMRs to complete the TDX module initialization */
> + ret = init_tdmrs(tdmr_array, tdmr_num);
> + if (ret)
> + goto out_free_pamts;
> +
> out_free_pamts:
> if (ret) {
> /*
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 768d097412ab..891691b1ea50 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -19,6 +19,7 @@
> #define TDH_SYS_INFO 32
> #define TDH_SYS_INIT 33
> #define TDH_SYS_LP_INIT 35
> +#define TDH_SYS_TDMR_INIT 36
> #define TDH_SYS_LP_SHUTDOWN 44
> #define TDH_SYS_CONFIG 45
>

2022-11-24 01:23:16

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

"Huang, Kai" <[email protected]> writes:

>> > > > +/*
>> > > > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
>> > > > + * Must be called when get_online_mems() is called by the caller.
>> > > > + */
>> > > > +static int build_tdx_memory(void)
>> > > > +{
>> > > > + unsigned long start_pfn, end_pfn;
>> > > > + int i, nid, ret;
>> > > > +
>> > > > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>> > > > + /*
>> > > > + * The first 1MB may not be reported as TDX convertible
>> > > > + * memory. Manually exclude them as TDX memory.
>> > > > + *
>> > > > + * This is fine as the first 1MB is already reserved in
>> > > > + * reserve_real_mode() and won't end up to ZONE_DMA as
>> > > > + * free page anyway.
>> > > > + */
>> > > > + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
>> > > > + if (start_pfn >= end_pfn)
>> > > > + continue;
>> > >
>> > > How about check whether first 1MB is reserved instead of depending on
>> > > the corresponding code isn't changed? Via for_each_reserved_mem_range()?
>> >
>> > IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
>> > init code/data. I feel it's not safe to just treat reserved memory will never
>> > be in page allocator. Otherwise we have for_each_free_mem_range() can use.
>>
>> Yes. memblock reverse information isn't perfect. But I still think
>> that it is still better than just assumption to check whether the frist
>> 1MB is reserved in memblock. Or, we can check whether the pages of the
>> first 1MB is reversed via checking struct page directly?
>>
>
> Sorry I am a little bit confused what you want to achieve here. Do you want to
> make some sanity check to make sure the first 1MB is indeed not in the page
> allocator?
>
> IIUC, it is indeed true. Please see the comment of calling reserve_real_mode()
> in setup_arch(). Also please see efi_free_boot_services(), which doesn't free
> the boot service if it is below 1MB.
>
> Also, my understanding is kernel's intention is to always reserve the first 1MB:
>
> /*
> * Don't free memory under 1M for two reasons:
> * - BIOS might clobber it
> * - Crash kernel needs it to be reserved
> */
>
> So if any page in first 1MB ended up to the page allocator, it should be the
> kernel bug which is not related to TDX, correct?

I suggest to add some code to verify this. It's possible for the code
to be changed in the future (although possibility is low). And TDX may
not be changed at the same time. Then the verifying code here can catch
that. So, we can make change accordingly.

Best Regards,
Huang, Ying

2022-11-24 01:57:12

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Tue, 2022-11-22 at 16:21 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> > memory regions that can possibly be used by the TDX module, but they are
> > not automatically usable to the TDX module. As a step of initializing
> > the TDX module, the kernel needs to choose a list of memory regions (out
> > from convertible memory regions) that the TDX module can use and pass
> > those regions to the TDX module. Once this is done, those "TDX-usable"
> > memory regions are fixed during module's lifetime. No more TDX-usable
> > memory can be added to the TDX module after that.
> >
> > The initial support of TDX guests will only allocate TDX guest memory
> > from the global page allocator. To keep things simple, this initial
> > implementation simply guarantees all pages in the page allocator are TDX
> > memory. To achieve this, use all system memory in the core-mm at the
> > time of initializing the TDX module as TDX memory, and at the meantime,
> > refuse to add any non-TDX-memory in the memory hotplug.
> >
> > Specifically, walk through all memory regions managed by memblock and
> > add them to a global list of "TDX-usable" memory regions, which is a
> > fixed list after the module initialization (or empty if initialization
> > fails). To reject non-TDX-memory in memory hotplug, add an additional
> > check in arch_add_memory() to check whether the new region is covered by
> > any region in the "TDX-usable" memory region list.
> >
> > Note this requires all memory regions in memblock are TDX convertible
> > memory when initializing the TDX module. This is true in practice if no
> > new memory has been hot-added before initializing the TDX module, since
> > in practice all boot-time present DIMM is TDX convertible memory. If
> > any new memory has been hot-added, then initializing the TDX module will
> > fail due to that memory region is not covered by CMR.
> >
> > This can be enhanced in the future, i.e. by allowing adding non-TDX
> > memory to a separate NUMA node. In this case, the "TDX-capable" nodes
> > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > needs to guarantee memory pages for TDX guests are always allocated from
> > the "TDX-capable" nodes.
> >
> > Note TDX assumes convertible memory is always physically present during
> > machine's runtime. A non-buggy BIOS should never support hot-removal of
> > any convertible memory. This implementation doesn't handle ACPI memory
> > removal but depends on the BIOS to behave correctly.
>
> My eyes glazed over about halfway through that. Can you try to trim it
> down a bit, or at least try to summarize it better up front?

Will do.

>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index dd333b46fafb..b36129183035 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> > depends on X86_64
> > depends on KVM_INTEL
> > depends on X86_X2APIC
> > + select ARCH_KEEP_MEMBLOCK
> > help
> > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks. This option enables necessary TDX
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index d688228f3151..71169ecefabf 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> > #ifdef CONFIG_INTEL_TDX_HOST
> > bool platform_tdx_enabled(void);
> > int tdx_enable(void);
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
> > #else /* !CONFIG_INTEL_TDX_HOST */
> > static inline bool platform_tdx_enabled(void) { return false; }
> > static inline int tdx_enable(void) { return -ENODEV; }
> > +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> > + unsigned long end_pfn) { return true; }
> > #endif /* CONFIG_INTEL_TDX_HOST */
> >
> > #endif /* !__ASSEMBLY__ */
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 3f040c6e5d13..900341333d7e 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -55,6 +55,7 @@
> > #include <asm/uv/uv.h>
> > #include <asm/setup.h>
> > #include <asm/ftrace.h>
> > +#include <asm/tdx.h>
> >
> > #include "mm_internal.h"
> >
> > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > unsigned long start_pfn = start >> PAGE_SHIFT;
> > unsigned long nr_pages = size >> PAGE_SHIFT;
> >
> > + /*
> > + * For now if TDX is enabled, all pages in the page allocator
>
> s/For now//

Will do.

>
> > + * must be TDX memory, which is a fixed set of memory regions
> > + * that are passed to the TDX module. Reject the new region
> > + * if it is not TDX memory to guarantee above is true.
> > + */
> > + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > + return -EINVAL;
>
> There's a real art to making a right-size comment. I don't think this
> needs to be any more than:
>
> /*
> * Not all memory is compatible with TDX. Reject
> * the addition of any incomatible memory.
> */

Thanks.

>
> If you want to write a treatise, do it in Documentation or at the
> tdx_cc_memory_compatible() definition.
>
> > init_memory_mapping(start, start + size, params->pgprot);
> >
> > return add_pages(nid, start_pfn, nr_pages, params);
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 43227af25e44..32af86e31c47 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -16,6 +16,11 @@
> > #include <linux/smp.h>
> > #include <linux/atomic.h>
> > #include <linux/align.h>
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/memblock.h>
> > +#include <linux/minmax.h>
> > +#include <linux/sizes.h>
> > #include <asm/msr-index.h>
> > #include <asm/msr.h>
> > #include <asm/apic.h>
> > @@ -34,6 +39,13 @@ enum tdx_module_status_t {
> > TDX_MODULE_SHUTDOWN,
> > };
> >
> > +struct tdx_memblock {
> > + struct list_head list;
> > + unsigned long start_pfn;
> > + unsigned long end_pfn;
> > + int nid;
> > +};
>
> Why does the nid matter?

It is used to find the node for the PAMT allocation for a given TDMR.

>
> > static u32 tdx_keyid_start __ro_after_init;
> > static u32 tdx_keyid_num __ro_after_init;
> >
> > @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
> > static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> > static int tdx_cmr_num;
> >
> > +/* All TDX-usable memory regions */
> > +static LIST_HEAD(tdx_memlist);
> > +
> > /*
> > * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> > * BIOS. Both initializing the TDX module and running TDX guest require
> > @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
> > return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> > }
> >
> > +/* Check whether the given pfn range is covered by any CMR or not. */
> > +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> > + unsigned long end_pfn)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < tdx_cmr_num; i++) {
> > + struct cmr_info *cmr = &tdx_cmr_array[i];
> > + unsigned long cmr_start_pfn;
> > + unsigned long cmr_end_pfn;
> > +
> > + cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> > + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> > +
> > + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> > + return true;
> > + }
>
> What if the pfn range overlaps two CMRs? It will never pass any
> individual overlap test and will return false.

We can only return true if the two CMRs are contiguous.  

I cannot think out a reason that a reasonable BIOS could generate contiguous
CMRs. Perhaps one reason is two contiguous NUMA nodes? For this case, memblock
has made sure no memory region could cross NUMA nodes, so the start_pfn/end_pfn
here should always be within one node. Perhaps we can add a comment for this
case?

Anyway I am not sure whether it is worth to consider "contiguous CMRs" case.

>
> > + return false;
> > +}
> > +
> > +/*
> > + * Add a memory region on a given node as a TDX memory block. The caller
> > + * to make sure all memory regions are added in address ascending order
>
> s/to/must/

Thanks.

>
> > + * and don't overlap.
> > + */
> > +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> > + int nid)
> > +{
> > + struct tdx_memblock *tmb;
> > +
> > + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > + if (!tmb)
> > + return -ENOMEM;
> > +
> > + INIT_LIST_HEAD(&tmb->list);
> > + tmb->start_pfn = start_pfn;
> > + tmb->end_pfn = end_pfn;
> > + tmb->nid = nid;
> > +
> > + list_add_tail(&tmb->list, &tdx_memlist);
> > + return 0;
> > +}
> > +
> > +static void free_tdx_memory(void)
>
> This is named a bit too generically. How about free_tdx_memlist() or
> something?

Will use free_tdx_memlist(). Do you want to also change build_tdx_memory() to
build_tdx_memlist()?

>
> > +{
> > + while (!list_empty(&tdx_memlist)) {
> > + struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
> > + struct tdx_memblock, list);
> > +
> > + list_del(&tmb->list);
> > + kfree(tmb);
> > + }
> > +}
> > +
> > +/*
> > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> > + * Must be called when get_online_mems() is called by the caller.
> > + */
>
> Again, this explains the "what", but not the "why".
>
> /*
> * Ensure that all memblock memory regions are convertible to TDX
> * memory. Once this has been established, stash the memblock
> * ranges off in a secondary structure because $REASONS.
> */
>
> Which makes me wonder: Why do you even need a secondary structure here?
> What's wrong with the memblocks themselves?

One reason is the new region has already been added to memblock before calling
arch_add_memory(), so we cannot compare the new region against memblock.

The other reason is memblock is updated in memory hotplug so it really isn't a
set of *fixed* memory regions, which TDX requires. Having TDX's own tdx_memlist
can support such case: after module initialization, some memory can be hot-
removed and then hot-added again, because the hot-added range will be covered by
@tdx_memlist.

>
> > +static int build_tdx_memory(void)
> > +{
> > + unsigned long start_pfn, end_pfn;
> > + int i, nid, ret;
> > +
> > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > + /*
> > + * The first 1MB may not be reported as TDX convertible
> > + * memory. Manually exclude them as TDX memory.
>
> I don't like the "may not" here very much.
>
> > + * This is fine as the first 1MB is already reserved in
> > + * reserve_real_mode() and won't end up to ZONE_DMA as
> > + * free page anyway.
>
> ^ free pages
>
> > + */
>
> This way too wishy washy. The TDX module may or may not... Then, it
> doesn't matter since reserve_real_mode() does it anyway...
>
> Then it goes and adds code to skip it!
>
> > + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > + if (start_pfn >= end_pfn)
> > + continue;
>
>
> Please just put a dang stake in the ground. If the other code deals
> with this, then explain *why* more is needed here.

How about adding below before the 'for_each_mem_pfn_range()' loop:

/*
* Some reserved pages in memblock (i.e. kernel init code/data) are
* freed to the page allocator directly. Use for_each_mem_pfn_range()
* instead of for_each_free_mem_range() to make sure all pages in the
* page allocator are covered as TDX memory.
*/

It explains why to use for_each_mem_pfn_range().

And here before skipping first 1MB, we add below:

/*
* The first 1MB is not reported as TDX covertible memory.
* Although the first 1MB is always reserved and won't end up
* to the page allocator, it is still in memblock's memory
* regions. Skip them manually to exclude them as TDX memory.
*/

>
> > + /* Verify memory is truly TDX convertible memory */
> > + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> > + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> > + start_pfn << PAGE_SHIFT,
> > + end_pfn << PAGE_SHIFT);
> > + return -EINVAL;
>
> ... no 'goto err'? This leaks all the previous add_tdx_memblock()
> structures, right?

Right. It's a leftover from the old code. Will fix.

>
> > + }
> > +
> > + /*
> > + * Add the memory regions as TDX memory. The regions in
> > + * memblock has already guaranteed they are in address
> > + * ascending order and don't overlap.
> > + */
> > + ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> > + if (ret)
> > + goto err;
> > + }
> > +
> > + return 0;
> > +err:
> > + free_tdx_memory();
> > + return ret;
> > +}
> > +
> > /*
> > * Detect and initialize the TDX module.
> > *
> > @@ -357,12 +473,56 @@ static int init_tdx_module(void)
> > if (ret)
> > goto out;
> >
> > + /*
> > + * All memory regions that can be used by the TDX module must be
> > + * passed to the TDX module during the module initialization.
> > + * Once this is done, all "TDX-usable" memory regions are fixed
> > + * during module's runtime.
> > + *
> > + * The initial support of TDX guests only allocates memory from
> > + * the global page allocator. To keep things simple, for now
> > + * just make sure all pages in the page allocator are TDX memory.
> > + *
> > + * To achieve this, use all system memory in the core-mm at the
> > + * time of initializing the TDX module as TDX memory, and at the
> > + * meantime, reject any new memory in memory hot-add.
> > + *
> > + * This works as in practice, all boot-time present DIMM is TDX
> > + * convertible memory. However if any new memory is hot-added
> > + * before initializing the TDX module, the initialization will
> > + * fail due to that memory is not covered by CMR.
> > + *
> > + * This can be enhanced in the future, i.e. by allowing adding or
> > + * onlining non-TDX memory to a separate node, in which case the
> > + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> > + * together -- the userspace/kernel just needs to make sure pages
> > + * for TDX guests must come from those "TDX-capable" nodes.
> > + *
> > + * Build the list of TDX memory regions as mentioned above so
> > + * they can be passed to the TDX module later.
> > + */
>
> This is msotly Documentation/, not a code comment. Please clean it up.

Will try to clean up.

>
> > + get_online_mems();
> > +
> > + ret = build_tdx_memory();
> > + if (ret)
> > + goto out;
> > /*
> > * Return -EINVAL until all steps of TDX module initialization
> > * process are done.
> > */
> > ret = -EINVAL;
> > out:
> > + /*
> > + * Memory hotplug checks the hot-added memory region against the
> > + * @tdx_memlist to see if the region is TDX memory.
> > + *
> > + * Do put_online_mems() here to make sure any modification to
> > + * @tdx_memlist is done while holding the memory hotplug read
> > + * lock, so that the memory hotplug path can just check the
> > + * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> > + * deadlock.
> > + */
>
> I'm honestly not following any of that.

How about:

/*
* Make sure tdx_cc_memory_compatible() either sees a fixed set of
* memory regions in @tdx_memlist, or an empty list.
*/

>
> > + put_online_mems();
> > return ret;
> > }
> >
> > @@ -485,3 +645,26 @@ int tdx_enable(void)
> > return ret;
> > }
> > EXPORT_SYMBOL_GPL(tdx_enable);
> > +
> > +/*
> > + * Check whether the given range is TDX memory. Must be called between
> > + * mem_hotplug_begin()/mem_hotplug_done().
> > + */
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + struct tdx_memblock *tmb;
> > +
> > + /* Empty list means TDX isn't enabled successfully */
> > + if (list_empty(&tdx_memlist))
> > + return true;
> > +
> > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > + /*
> > + * The new range is TDX memory if it is fully covered
> > + * by any TDX memory block.
> > + */
> > + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > + return true;
>
> Same bug. What if the start/end_pfn range is covered by more than one
> tdx_memblock?

We may want to return true if tdx_memblocks are contiguous.

However I don't think this will happen?

tdx_memblock is from memblock, and when two memory regions in memblock are
contiguous, they must have different node, or flags.

My understanding is the hot-added memory region here cannot across NUMA nodes,
nor have different flags, correct?

>
> > + }
> > + return false;
> > +}
>

2022-11-24 01:58:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On 11/23/22 17:04, Huang, Kai wrote:
> On Tue, 2022-11-22 at 16:21 -0800, Dave Hansen wrote:
>>> +struct tdx_memblock {
>>> + struct list_head list;
>>> + unsigned long start_pfn;
>>> + unsigned long end_pfn;
>>> + int nid;
>>> +};
>>
>> Why does the nid matter?
>
> It is used to find the node for the PAMT allocation for a given TDMR.

... which is in this patch?

You can't just plop unused and unmentioned nuggets in the code. Remove
it until it is needed.


>>> +/* Check whether the given pfn range is covered by any CMR or not. */
>>> +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
>>> + unsigned long end_pfn)
>>> +{
>>> + int i;
>>> +
>>> + for (i = 0; i < tdx_cmr_num; i++) {
>>> + struct cmr_info *cmr = &tdx_cmr_array[i];
>>> + unsigned long cmr_start_pfn;
>>> + unsigned long cmr_end_pfn;
>>> +
>>> + cmr_start_pfn = cmr->base >> PAGE_SHIFT;
>>> + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
>>> +
>>> + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
>>> + return true;
>>> + }
>>
>> What if the pfn range overlaps two CMRs? It will never pass any
>> individual overlap test and will return false.
>
> We can only return true if the two CMRs are contiguous.
>
> I cannot think out a reason that a reasonable BIOS could generate contiguous
> CMRs.

Because it can?

We don't just try and randomly assign what we think is reasonable or
not. First and foremost, we need to ask whether the configuration in
question is allowed by the spec.

Would it be a *valid* thing to have two adjacent CMRs? Does the TDX
module spec disallow it?

> Perhaps one reason is two contiguous NUMA nodes? For this case, memblock
> has made sure no memory region could cross NUMA nodes, so the start_pfn/end_pfn
> here should always be within one node. Perhaps we can add a comment for this
> case?

<cough> numa=off <cough>

> Anyway I am not sure whether it is worth to consider "contiguous CMRs" case.

I am sure. You need to consider it.

>>> + * and don't overlap.
>>> + */
>>> +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
>>> + int nid)
>>> +{
>>> + struct tdx_memblock *tmb;
>>> +
>>> + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
>>> + if (!tmb)
>>> + return -ENOMEM;
>>> +
>>> + INIT_LIST_HEAD(&tmb->list);
>>> + tmb->start_pfn = start_pfn;
>>> + tmb->end_pfn = end_pfn;
>>> + tmb->nid = nid;
>>> +
>>> + list_add_tail(&tmb->list, &tdx_memlist);
>>> + return 0;
>>> +}
>>> +
>>> +static void free_tdx_memory(void)
>>
>> This is named a bit too generically. How about free_tdx_memlist() or
>> something?
>
> Will use free_tdx_memlist(). Do you want to also change build_tdx_memory() to
> build_tdx_memlist()?

Does it build a memlist?

>>> +{
>>> + while (!list_empty(&tdx_memlist)) {
>>> + struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
>>> + struct tdx_memblock, list);
>>> +
>>> + list_del(&tmb->list);
>>> + kfree(tmb);
>>> + }
>>> +}
>>> +
>>> +/*
>>> + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
>>> + * Must be called when get_online_mems() is called by the caller.
>>> + */
>>
>> Again, this explains the "what", but not the "why".
>>
>> /*
>> * Ensure that all memblock memory regions are convertible to TDX
>> * memory. Once this has been established, stash the memblock
>> * ranges off in a secondary structure because $REASONS.
>> */
>>
>> Which makes me wonder: Why do you even need a secondary structure here?
>> What's wrong with the memblocks themselves?
>
> One reason is the new region has already been added to memblock before calling
> arch_add_memory(), so we cannot compare the new region against memblock.
>
> The other reason is memblock is updated in memory hotplug so it really isn't a
> set of *fixed* memory regions, which TDX requires. Having TDX's own tdx_memlist
> can support such case: after module initialization, some memory can be hot-
> removed and then hot-added again, because the hot-added range will be covered by
> @tdx_memlist.

OK, that's fair enough. Memblocks change and we don't lock out changes
to memblocks. But, the TDX setup is based on memblocks at one point in
time, so we effectively need to know what their state was at that point
in time.

>>> +static int build_tdx_memory(void)
>>> +{
>>> + unsigned long start_pfn, end_pfn;
>>> + int i, nid, ret;
>>> +
>>> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>>> + /*
>>> + * The first 1MB may not be reported as TDX convertible
>>> + * memory. Manually exclude them as TDX memory.
>>
>> I don't like the "may not" here very much.
>>
>>> + * This is fine as the first 1MB is already reserved in
>>> + * reserve_real_mode() and won't end up to ZONE_DMA as
>>> + * free page anyway.
>>
>> ^ free pages
>>
>>> + */
>>
>> This way too wishy washy. The TDX module may or may not... Then, it
>> doesn't matter since reserve_real_mode() does it anyway...
>>
>> Then it goes and adds code to skip it!
>>
>>> + start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
>>> + if (start_pfn >= end_pfn)
>>> + continue;
>>
>>
>> Please just put a dang stake in the ground. If the other code deals
>> with this, then explain *why* more is needed here.
>
> How about adding below before the 'for_each_mem_pfn_range()' loop:
>
> /*
> * Some reserved pages in memblock (i.e. kernel init code/data) are
> * freed to the page allocator directly. Use for_each_mem_pfn_range()
> * instead of for_each_free_mem_range() to make sure all pages in the
> * page allocator are covered as TDX memory.
> */
>
> It explains why to use for_each_mem_pfn_range().

I actually wasn't asking about the for_each_mem_pfn_range() use.

> And here before skipping first 1MB, we add below:
>
> /*
> * The first 1MB is not reported as TDX covertible memory.
> * Although the first 1MB is always reserved and won't end up
> * to the page allocator, it is still in memblock's memory
> * regions. Skip them manually to exclude them as TDX memory.
> */

That looks OK, with the spelling fixed.

>>> + /* Verify memory is truly TDX convertible memory */
>>> + if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
>>> + pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
>>> + start_pfn << PAGE_SHIFT,
>>> + end_pfn << PAGE_SHIFT);
>>> + return -EINVAL;
>>
>> ... no 'goto err'? This leaks all the previous add_tdx_memblock()
>> structures, right?
>
> Right. It's a leftover from the old code. Will fix.
>
>>
>>> + }
>>> +
>>> + /*
>>> + * Add the memory regions as TDX memory. The regions in
>>> + * memblock has already guaranteed they are in address
>>> + * ascending order and don't overlap.
>>> + */
>>> + ret = add_tdx_memblock(start_pfn, end_pfn, nid);
>>> + if (ret)
>>> + goto err;
>>> + }
>>> +
>>> + return 0;
>>> +err:
>>> + free_tdx_memory();
>>> + return ret;
>>> +}
>>> +
>>> /*
>>> * Detect and initialize the TDX module.
>>> *
>>> @@ -357,12 +473,56 @@ static int init_tdx_module(void)
>>> if (ret)
>>> goto out;
>>>
>>> + /*
>>> + * All memory regions that can be used by the TDX module must be
>>> + * passed to the TDX module during the module initialization.
>>> + * Once this is done, all "TDX-usable" memory regions are fixed
>>> + * during module's runtime.
>>> + *
>>> + * The initial support of TDX guests only allocates memory from
>>> + * the global page allocator. To keep things simple, for now
>>> + * just make sure all pages in the page allocator are TDX memory.
>>> + *
>>> + * To achieve this, use all system memory in the core-mm at the
>>> + * time of initializing the TDX module as TDX memory, and at the
>>> + * meantime, reject any new memory in memory hot-add.
>>> + *
>>> + * This works as in practice, all boot-time present DIMM is TDX
>>> + * convertible memory. However if any new memory is hot-added
>>> + * before initializing the TDX module, the initialization will
>>> + * fail due to that memory is not covered by CMR.
>>> + *
>>> + * This can be enhanced in the future, i.e. by allowing adding or
>>> + * onlining non-TDX memory to a separate node, in which case the
>>> + * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
>>> + * together -- the userspace/kernel just needs to make sure pages
>>> + * for TDX guests must come from those "TDX-capable" nodes.
>>> + *
>>> + * Build the list of TDX memory regions as mentioned above so
>>> + * they can be passed to the TDX module later.
>>> + */
>>
>> This is msotly Documentation/, not a code comment. Please clean it up.
>
> Will try to clean up.
>
>>
>>> + get_online_mems();
>>> +
>>> + ret = build_tdx_memory();
>>> + if (ret)
>>> + goto out;
>>> /*
>>> * Return -EINVAL until all steps of TDX module initialization
>>> * process are done.
>>> */
>>> ret = -EINVAL;
>>> out:
>>> + /*
>>> + * Memory hotplug checks the hot-added memory region against the
>>> + * @tdx_memlist to see if the region is TDX memory.
>>> + *
>>> + * Do put_online_mems() here to make sure any modification to
>>> + * @tdx_memlist is done while holding the memory hotplug read
>>> + * lock, so that the memory hotplug path can just check the
>>> + * @tdx_memlist w/o holding the @tdx_module_lock which may cause
>>> + * deadlock.
>>> + */
>>
>> I'm honestly not following any of that.
>
> How about:
>
> /*
> * Make sure tdx_cc_memory_compatible() either sees a fixed set of
> * memory regions in @tdx_memlist, or an empty list.
> */

That's a comment for the lock side, not the unlock side. It should be:

/*
* @tdx_memlist is written here and read at memory hotplug time.
* Lock out memory hotplug code while building it.
*/

>>> + put_online_mems();
>>> return ret;
>>> }
>>>
>>> @@ -485,3 +645,26 @@ int tdx_enable(void)
>>> return ret;
>>> }
>>> EXPORT_SYMBOL_GPL(tdx_enable);
>>> +
>>> +/*
>>> + * Check whether the given range is TDX memory. Must be called between
>>> + * mem_hotplug_begin()/mem_hotplug_done().
>>> + */
>>> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
>>> +{
>>> + struct tdx_memblock *tmb;
>>> +
>>> + /* Empty list means TDX isn't enabled successfully */
>>> + if (list_empty(&tdx_memlist))
>>> + return true;
>>> +
>>> + list_for_each_entry(tmb, &tdx_memlist, list) {
>>> + /*
>>> + * The new range is TDX memory if it is fully covered
>>> + * by any TDX memory block.
>>> + */
>>> + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
>>> + return true;
>>
>> Same bug. What if the start/end_pfn range is covered by more than one
>> tdx_memblock?
>
> We may want to return true if tdx_memblocks are contiguous.
>
> However I don't think this will happen?
>
> tdx_memblock is from memblock, and when two memory regions in memblock are
> contiguous, they must have different node, or flags.
>
> My understanding is the hot-added memory region here cannot across NUMA nodes,
> nor have different flags, correct?

I'm not sure what flags are in this context.

2022-11-24 02:01:18

by Dan Williams

[permalink] [raw]
Subject: RE: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

Kai Huang wrote:
> TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> memory regions that can possibly be used by the TDX module, but they are
> not automatically usable to the TDX module. As a step of initializing
> the TDX module, the kernel needs to choose a list of memory regions (out
> from convertible memory regions) that the TDX module can use and pass
> those regions to the TDX module. Once this is done, those "TDX-usable"
> memory regions are fixed during module's lifetime. No more TDX-usable
> memory can be added to the TDX module after that.
>
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator. To keep things simple, this initial
> implementation simply guarantees all pages in the page allocator are TDX
> memory. To achieve this, use all system memory in the core-mm at the
> time of initializing the TDX module as TDX memory, and at the meantime,
> refuse to add any non-TDX-memory in the memory hotplug.
>
> Specifically, walk through all memory regions managed by memblock and
> add them to a global list of "TDX-usable" memory regions, which is a
> fixed list after the module initialization (or empty if initialization
> fails). To reject non-TDX-memory in memory hotplug, add an additional
> check in arch_add_memory() to check whether the new region is covered by
> any region in the "TDX-usable" memory region list.
>
> Note this requires all memory regions in memblock are TDX convertible
> memory when initializing the TDX module. This is true in practice if no
> new memory has been hot-added before initializing the TDX module, since
> in practice all boot-time present DIMM is TDX convertible memory. If
> any new memory has been hot-added, then initializing the TDX module will
> fail due to that memory region is not covered by CMR.
>
> This can be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node. In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
>
> Note TDX assumes convertible memory is always physically present during
> machine's runtime. A non-buggy BIOS should never support hot-removal of
> any convertible memory. This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v6 -> v7:
> - Changed to use all system memory in memblock at the time of
> initializing the TDX module as TDX memory
> - Added memory hotplug support
>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/tdx.h | 3 +
> arch/x86/mm/init_64.c | 10 ++
> arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
> 4 files changed, 197 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> depends on X86_64
> depends on KVM_INTEL
> depends on X86_X2APIC
> + select ARCH_KEEP_MEMBLOCK
> help
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d688228f3151..71169ecefabf 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> #ifdef CONFIG_INTEL_TDX_HOST
> bool platform_tdx_enabled(void);
> int tdx_enable(void);
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
> #else /* !CONFIG_INTEL_TDX_HOST */
> static inline bool platform_tdx_enabled(void) { return false; }
> static inline int tdx_enable(void) { return -ENODEV; }
> +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> + unsigned long end_pfn) { return true; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3f040c6e5d13..900341333d7e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -55,6 +55,7 @@
> #include <asm/uv/uv.h>
> #include <asm/setup.h>
> #include <asm/ftrace.h>
> +#include <asm/tdx.h>
>
> #include "mm_internal.h"
>
> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> + /*
> + * For now if TDX is enabled, all pages in the page allocator
> + * must be TDX memory, which is a fixed set of memory regions
> + * that are passed to the TDX module. Reject the new region
> + * if it is not TDX memory to guarantee above is true.
> + */
> + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> + return -EINVAL;

arch_add_memory() does not add memory to the page allocator. For
example, memremap_pages() uses arch_add_memory() and explicitly does not
release the memory to the page allocator. This check belongs in
add_memory_resource() to prevent new memory that violates TDX from being
onlined. Hopefully there is also an option to disable TDX from the
kernel boot command line to recover memory-hotplug without needing to
boot into the BIOS to toggle TDX.

2022-11-24 02:55:17

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Wed, 2022-11-23 at 17:22 -0800, Hansen, Dave wrote:
> On 11/23/22 17:04, Huang, Kai wrote:
> > On Tue, 2022-11-22 at 16:21 -0800, Dave Hansen wrote:
> > > > +struct tdx_memblock {
> > > > + struct list_head list;
> > > > + unsigned long start_pfn;
> > > > + unsigned long end_pfn;
> > > > + int nid;
> > > > +};
> > >
> > > Why does the nid matter?
> >
> > It is used to find the node for the PAMT allocation for a given TDMR.
>
> ... which is in this patch?
>
> You can't just plop unused and unmentioned nuggets in the code. Remove
> it until it is needed.

OK. I'll move to the PAMT allocation patch.

>
>
> > > > +/* Check whether the given pfn range is covered by any CMR or not. */
> > > > +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> > > > + unsigned long end_pfn)
> > > > +{
> > > > + int i;
> > > > +
> > > > + for (i = 0; i < tdx_cmr_num; i++) {
> > > > + struct cmr_info *cmr = &tdx_cmr_array[i];
> > > > + unsigned long cmr_start_pfn;
> > > > + unsigned long cmr_end_pfn;
> > > > +
> > > > + cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> > > > + cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> > > > +
> > > > + if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> > > > + return true;
> > > > + }
> > >
> > > What if the pfn range overlaps two CMRs? It will never pass any
> > > individual overlap test and will return false.
> >
> > We can only return true if the two CMRs are contiguous.
> >
> > I cannot think out a reason that a reasonable BIOS could generate contiguous
> > CMRs.
>
> Because it can?
>
> We don't just try and randomly assign what we think is reasonable or
> not. First and foremost, we need to ask whether the configuration in
> question is allowed by the spec.
>
> Would it be a *valid* thing to have two adjacent CMRs? Does the TDX
> module spec disallow it?

No the TDX module doesn't disallow it, IIUC. The spec only says they don't
overlap.

>
> > Perhaps one reason is two contiguous NUMA nodes? For this case, memblock
> > has made sure no memory region could cross NUMA nodes, so the start_pfn/end_pfn
> > here should always be within one node. Perhaps we can add a comment for this
> > case?
>
> <cough> numa=off <cough>
>
> > Anyway I am not sure whether it is worth to consider "contiguous CMRs" case.
>
> I am sure. You need to consider it.

OK.

Also, as mentioned in another reply to patch "Get information about TDX module
and TDX-capable memory", we can depend on TDH.SYS.CONFIG to return failure but
don't necessarily need to sanity check all memory regions are CMR memory. This
way we can just removing above sanity check code here.

What do you think?

>
> > > > + * and don't overlap.
> > > > + */
> > > > +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> > > > + int nid)
> > > > +{
> > > > + struct tdx_memblock *tmb;
> > > > +
> > > > + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > > > + if (!tmb)
> > > > + return -ENOMEM;
> > > > +
> > > > + INIT_LIST_HEAD(&tmb->list);
> > > > + tmb->start_pfn = start_pfn;
> > > > + tmb->end_pfn = end_pfn;
> > > > + tmb->nid = nid;
> > > > +
> > > > + list_add_tail(&tmb->list, &tdx_memlist);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static void free_tdx_memory(void)
> > >
> > > This is named a bit too generically. How about free_tdx_memlist() or
> > > something?
> >
> > Will use free_tdx_memlist(). Do you want to also change build_tdx_memory() to
> > build_tdx_memlist()?
>
> Does it build a memlist?

Yes.


[...]

>
> I actually wasn't asking about the for_each_mem_pfn_range() use.
>
> > And here before skipping first 1MB, we add below:
> >
> > /*
> > * The first 1MB is not reported as TDX covertible memory.
> > * Although the first 1MB is always reserved and won't end up
> > * to the page allocator, it is still in memblock's memory
> > * regions. Skip them manually to exclude them as TDX memory.
> > */
>
> That looks OK, with the spelling fixed.

Yes "covertible" -> "convertible".


[...]

> > > > out:
> > > > + /*
> > > > + * Memory hotplug checks the hot-added memory region against the
> > > > + * @tdx_memlist to see if the region is TDX memory.
> > > > + *
> > > > + * Do put_online_mems() here to make sure any modification to
> > > > + * @tdx_memlist is done while holding the memory hotplug read
> > > > + * lock, so that the memory hotplug path can just check the
> > > > + * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> > > > + * deadlock.
> > > > + */
> > >
> > > I'm honestly not following any of that.
> >
> > How about:
> >
> > /*
> > * Make sure tdx_cc_memory_compatible() either sees a fixed set of
> > * memory regions in @tdx_memlist, or an empty list.
> > */
>
> That's a comment for the lock side, not the unlock side. It should be:
>
> /*
> * @tdx_memlist is written here and read at memory hotplug time.
> * Lock out memory hotplug code while building it.
> */

Thanks.

>
> > > > + put_online_mems();
> > > > return ret;
> > > > }
> > > >
> > > > @@ -485,3 +645,26 @@ int tdx_enable(void)
> > > > return ret;
> > > > }
> > > > EXPORT_SYMBOL_GPL(tdx_enable);
> > > > +
> > > > +/*
> > > > + * Check whether the given range is TDX memory. Must be called between
> > > > + * mem_hotplug_begin()/mem_hotplug_done().
> > > > + */
> > > > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> > > > +{
> > > > + struct tdx_memblock *tmb;
> > > > +
> > > > + /* Empty list means TDX isn't enabled successfully */
> > > > + if (list_empty(&tdx_memlist))
> > > > + return true;
> > > > +
> > > > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > + /*
> > > > + * The new range is TDX memory if it is fully covered
> > > > + * by any TDX memory block.
> > > > + */
> > > > + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > > > + return true;
> > >
> > > Same bug. What if the start/end_pfn range is covered by more than one
> > > tdx_memblock?
> >
> > We may want to return true if tdx_memblocks are contiguous.
> >
> > However I don't think this will happen?
> >
> > tdx_memblock is from memblock, and when two memory regions in memblock are
> > contiguous, they must have different node, or flags.
> >
> > My understanding is the hot-added memory region here cannot across NUMA nodes,
> > nor have different flags, correct?
>
> I'm not sure what flags are in this context.
>

The flags in 'struct memblock_region':

enum memblock_flags {
MEMBLOCK_NONE = 0x0, /* No special request */
MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
};

/**
* struct memblock_region - represents a memory region
* @base: base address of the region
* @size: size of the region
* @flags: memory region attributes
* @nid: NUMA node id
*/
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
enum memblock_flags flags;
#ifdef CONFIG_NUMA
int nid;
#endif
};


2022-11-24 09:56:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Wed, Nov 23, 2022 at 05:50:37PM -0800, Dan Williams wrote:

> arch_add_memory() does not add memory to the page allocator. For
> example, memremap_pages() uses arch_add_memory() and explicitly does not
> release the memory to the page allocator. This check belongs in
> add_memory_resource() to prevent new memory that violates TDX from being
> onlined. Hopefully there is also an option to disable TDX from the
> kernel boot command line to recover memory-hotplug without needing to
> boot into the BIOS to toggle TDX.

So I've been pushing for all this to either require: tdx=force on the
cmdline to boot-time enable, or delay all the memory allocation to the
first KVM/TDX instance being created.

That is, by default, none of this crud should ever trigger and consume
memory if you're not using TDX (most of us really).

(every machine I have loads kvm.ko unconditionally -- even if I never
user KVM, so kvm.ko load time is not a valid point in time to do TDX
enablement).


2022-11-24 10:11:02

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> >  
> > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >    unsigned long start_pfn = start >> PAGE_SHIFT;
> >    unsigned long nr_pages = size >> PAGE_SHIFT;
> >  
> > + /*
> > + * For now if TDX is enabled, all pages in the page allocator
> > + * must be TDX memory, which is a fixed set of memory regions
> > + * that are passed to the TDX module.  Reject the new region
> > + * if it is not TDX memory to guarantee above is true.
> > + */
> > + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > + return -EINVAL;
>
> arch_add_memory() does not add memory to the page allocator.  For
> example, memremap_pages() uses arch_add_memory() and explicitly does not
> release the memory to the page allocator. 

Indeed. Sorry I missed this.

> This check belongs in
> add_memory_resource() to prevent new memory that violates TDX from being
> onlined. 

This would require adding another 'arch_cc_memory_compatible()' to the common
add_memory_resource() (I actually long time ago had such patch to work with the
memremap_pages() you mentioned above).

How about adding a memory_notifier to the TDX code, and reject online of TDX
incompatible memory (something like below)? The benefit is this is TDX code
self contained and won't pollute the common mm code:

+static int tdx_memory_notifier(struct notifier_block *nb,
+ unsigned long action, void *v)
+{
+ struct memory_notify *mn = v;
+
+ if (action != MEM_GOING_ONLINE)
+ return NOTIFY_OK;
+
+ /*
+ * Not all memory is compatible with TDX. Reject
+ * online of any incompatible memory.
+ */
+ return tdx_cc_memory_compatible(mn->start_pfn,
+ mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+ .notifier_call = tdx_memory_notifier,
+};

> Hopefully there is also an option to disable TDX from the
> kernel boot command line to recover memory-hotplug without needing to
> boot into the BIOS to toggle TDX.

I am fine.

Hi Dave, is it OK to include such command line to this initial TDX submission?

2022-11-24 10:19:46

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Thu, 2022-11-24 at 10:26 +0100, Peter Zijlstra wrote:
> On Wed, Nov 23, 2022 at 05:50:37PM -0800, Dan Williams wrote:
>
> > arch_add_memory() does not add memory to the page allocator. For
> > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > release the memory to the page allocator. This check belongs in
> > add_memory_resource() to prevent new memory that violates TDX from being
> > onlined. Hopefully there is also an option to disable TDX from the
> > kernel boot command line to recover memory-hotplug without needing to
> > boot into the BIOS to toggle TDX.
>
> So I've been pushing for all this to either require: tdx=force on the
> cmdline to boot-time enable, or delay all the memory allocation to the
> first KVM/TDX instance being created.
>
> That is, by default, none of this crud should ever trigger and consume
> memory if you're not using TDX (most of us really).
>
> (every machine I have loads kvm.ko unconditionally -- even if I never
> user KVM, so kvm.ko load time is not a valid point in time to do TDX
> enablement).
>

Thanks for input. I am fine with 'tdx=force'.

Although, I'd like to point out KVM will have a module parameter 'enable_tdx'.

Hi Dave, Sean, do you have any comments?

2022-11-24 10:29:54

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On Wed, 2022-11-23 at 14:17 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > TDX provides increased levels of memory confidentiality and integrity.
> > This requires special hardware support for features like memory
> > encryption and storage of memory integrity checksums. Not all memory
> > satisfies these requirements.
> >
> > As a result, the TDX introduced the concept of a "Convertible Memory
>
> s/the TDX introduced/TDX introduces/
>
> > Region" (CMR). During boot, the firmware builds a list of all of the
> > memory ranges which can provide the TDX security guarantees. The list
> > of these ranges is available to the kernel by querying the TDX module.
> >
> > The TDX architecture needs additional metadata to record things like
> > which TD guest "owns" a given page of memory. This metadata essentially
> > serves as the 'struct page' for the TDX module. The space for this
> > metadata is not reserved by the hardware up front and must be allocated
> > by the kernel and given to the TDX module.
> >
> > Since this metadata consumes space, the VMM can choose whether or not to
> > allocate it for a given area of convertible memory. If it chooses not
> > to, the memory cannot receive TDX protections and can not be used by TDX
> > guests as private memory.
> >
> > For every memory region that the VMM wants to use as TDX memory, it sets
> > up a "TD Memory Region" (TDMR). Each TDMR represents a physically
> > contiguous convertible range and must also have its own physically
> > contiguous metadata table, referred to as a Physical Address Metadata
> > Table (PAMT), to track status for each page in the TDMR range.
> >
> > Unlike a CMR, each TDMR requires 1G granularity and alignment. To
> > support physical RAM areas that don't meet those strict requirements,
> > each TDMR permits a number of internal "reserved areas" which can be
> > placed over memory holes. If PAMT metadata is placed within a TDMR it
> > must be covered by one of these reserved areas.
> >
> > Let's summarize the concepts:
> >
> > CMR - Firmware-enumerated physical ranges that support TDX. CMRs are
> > 4K aligned.
> > TDMR - Physical address range which is chosen by the kernel to support
> > TDX. 1G granularity and alignment required. Each TDMR has
> > reserved areas where TDX memory holes and overlapping PAMTs can
> > be put into.
>
> s/put into/represented/
>
> > PAMT - Physically contiguous TDX metadata. One table for each page size
> > per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G
> > PAMT.
> >
> > As one step of initializing the TDX module, the kernel configures
> > TDX-usable memory regions by passing an array of TDMRs to the TDX module.
> >
> > Constructing the array of TDMRs consists below steps:
> >
> > 1) Create TDMRs to cover all memory regions that the TDX module can use;
>
> Slight tweak:
>
> 1) Create TDMRs to cover all memory regions that the TDX module will use
> for TD memory
>
> The TDX module "uses" more memory than strictly the TMDR's.
>
> > 2) Allocate and set up PAMT for each TDMR;
> > 3) Set up reserved areas for each TDMR.
>
> s/Set up/Designate/

Thanks. All above will be addressed.

>
> > Add a placeholder to construct TDMRs to do the above steps after all
> > TDX memory regions are verified to be truly convertible. Always free
> > TDMRs at the end of the initialization (no matter successful or not)
> > as TDMRs are only used during the initialization.
>
> The changelog here actually looks really good to me so far.
>
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 32af86e31c47..26048c6b0170 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -445,6 +445,63 @@ static int build_tdx_memory(void)
> > return ret;
> > }
> >
> > +/* Calculate the actual TDMR_INFO size */
> > +static inline int cal_tdmr_size(void)
>
> I think we can spare the bytes to add "culate" in the function name so
> we don't think these are California TDMRs.

Sure will do.

>
> > +{
> > + int tdmr_sz;
> > +
> > + /*
> > + * The actual size of TDMR_INFO depends on the maximum number
> > + * of reserved areas.
> > + *
> > + * Note: for TDX1.0 the max_reserved_per_tdmr is 16, and
> > + * TDMR_INFO size is aligned up to 512-byte. Even it is
> > + * extended in the future, it would be insane if TDMR_INFO
> > + * becomes larger than 4K. The tdmr_sz here should never
> > + * overflow.
> > + */
> > + tdmr_sz = sizeof(struct tdmr_info);
> > + tdmr_sz += sizeof(struct tdmr_reserved_area) *
> > + tdx_sysinfo.max_reserved_per_tdmr;
>
> First, I think 'tdx_sysinfo' should probably be a local variable in
> init_tdx_module() and have its address passed in here. Having global
> variables always makes it more opaque about who is initializing it.
>
> Second, if this code is making assumptions about
> 'max_reserved_per_tdmr', then let's actually add assertions or sanity
> checks. For instance:
>
> if (tdx_sysinfo.max_reserved_per_tdmr > MAX_TDMRS)
> return -1;
>
> or even:
>
> if (tdmr_sz > PAGE_SIZE)
> return -1;

I can add this.

>
> It does almost no good to just assert what the limits are in a comment.
>
> > + /*
> > + * TDX requires each TDMR_INFO to be 512-byte aligned. Always
> > + * round up TDMR_INFO size to the 512-byte boundary.
> > + */
>
> <sigh> More silly comments.
>
> The place to document this is TDMR_INFO_ALIGNMENT. If anyone wants to
> know what the alignment is, exactly, they can look at the definition.
> They don't need to be told *TWICE* what TDMR_INFO_ALIGNMENT #defines to
> in one comment.

I see. Then I think we don't even need this comment since the name of
TDMR_INFO_ALIGNMENT already implies?

>
> > + return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> > +}
> > +
> > +static struct tdmr_info *alloc_tdmr_array(int *array_sz)
> > +{
> > + /*
> > + * TDX requires each TDMR_INFO to be 512-byte aligned.
> > + * Use alloc_pages_exact() to allocate all TDMRs at once.
> > + * Each TDMR_INFO will still be 512-byte aligned since
> > + * cal_tdmr_size() always returns 512-byte aligned size.
> > + */
>
> OK, I think you're just trolling me now. Two *MORE* mentions of the
> 512-byte alignment?

I'll remove.

>
> > + *array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs;
> > +
> > + /*
> > + * Zero the buffer so 'struct tdmr_info::size' can be
> > + * used to determine whether a TDMR is valid.
> > + *
> > + * Note: for TDX1.0 the max_tdmrs is 64 and TDMR_INFO size
> > + * is 512-byte. Even they are extended in the future, it
> > + * would be insane if the total size exceeds 4MB.
> > + */
> > + return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
> > +}
>
> This looks massively over complicated.
>
> Get rid of this function entirely. Then create:
>
> static int tdmr_array_size(void)
> {
> return tdmr_size_single() * tdx_sysinfo.max_tdmrs;
> }
>
> The *caller* can do:
>
> tdmr_array = alloc_pages_exact(tdmr_array_size(),
> GFP_KERNEL | __GFP_ZERO);
> if (!tdmr_array) {
> ...
>
> Then the error path is:
>
> free_pages_exact(tdmr_array, tdmr_array_size());
>
> Then, there are no size pointers going back and forth. Easy peasy. I'm
> OK with a little arithmetic being repeated.

Yes. Will do.

>
> > +/*
> > + * Construct an array of TDMRs to cover all TDX memory ranges.
> > + * The actual number of TDMRs is kept to @tdmr_num.
> > + */
> > +static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> > +{
> > + /* Return -EINVAL until constructing TDMRs is done */
> > + return -EINVAL;
> > +}
> > +
> > /*
> > * Detect and initialize the TDX module.
> > *
> > @@ -454,6 +511,9 @@ static int build_tdx_memory(void)
> > */
> > static int init_tdx_module(void)
> > {
> > + struct tdmr_info *tdmr_array;
> > + int tdmr_array_sz;
> > + int tdmr_num;
>
> I tend to write these like:
>
> "tdmr_num" is the number of *a* TDMR.
>
> "nr_tdmrs" is the number of TDMRs.

Indeed. Will do.

2022-11-24 12:06:16

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 12/20] x86/virt/tdx: Create TDMRs to cover all TDX memory regions


> > +static inline u64 tdmr_start(struct tdmr_info *tdmr)
> > +{
> > + return tdmr->base;
> > +}
>
> I'm always skeptical that it's a good idea to take this in code:
>
> tdmr->base
>
> and make it this:
>
> tdmr_start(tdmr)
>
> because the helper is *LESS* compact than the open-coded form! I hope
> I'm proven wrong.

IIUC you prefer using tdmr->base directly. Will do.

>
> > +static inline u64 tdmr_end(struct tdmr_info *tdmr)
> > +{
> > + return tdmr->base + tdmr->size;
> > +}
> > +
> > /* Calculate the actual TDMR_INFO size */
> > static inline int cal_tdmr_size(void)
> > {
> > @@ -492,14 +510,98 @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
> > return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
> > }
> >
> > +static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
> > + int idx)
> > +{
> > + return (struct tdmr_info *)((unsigned long)tdmr_array +
> > + cal_tdmr_size() * idx);
> > +}
>
> FWIW, I think it's probably a bad idea to have 'struct tdmr_info *'
> types floating around since:
>
> tmdr_info_array[0]
>
> works, but:
>
> tmdr_info_array[1]
>
> will blow up in your face. It would almost make sense to have
>
> struct tdmr_info_list {
> struct tdmr_info *first_tdmr;
> }
>
> and then pass around pointers to the 'struct tdmr_info_list'. Maybe
> that's overkill, but it is kinda silly to call something an array if []
> doesn't work on it.

Then should I introduce 'struct tdmr_info_list' in the previous patch (which
allocates enough space for the tdmr_array), and add functions to allocate/free
this tdmr_info_list?

>
> > +/*
> > + * Create TDMRs to cover all TDX memory regions. The actual number
> > + * of TDMRs is set to @tdmr_num.
> > + */
> > +static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> > +{
> > + struct tdx_memblock *tmb;
> > + int tdmr_idx = 0;
> > +
> > + /*
> > + * Loop over TDX memory regions and create TDMRs to cover them.
> > + * To keep it simple, always try to use one TDMR to cover
> > + * one memory region.
> > + */
>
> This seems like it might tend to under-utilize TDMRs. I'm sure this is
> done for simplicity, but is it OK? Why is it OK? How are you sure this
> won't bite us later?

In practice the maximum number of TDMRs is 64. In reality we never met a
machine that could result in so many memory regions, and typically 20 TDMRs is
big enough to cover them.

But if user uses 'memmap' to deliberately create bunch of discrete memory
regions, then we can run out of TDMRs. But I think we can blame user in this
case.

How about add a comment?

/*
* In practice TDX1.0 supports 64 TDMRs, which should be big enough
* to cover all memory regions in reality if admin doesn't use 'memmap'
* to create bunch of discrete memory regions.
*/

>
> > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > + struct tdmr_info *tdmr;
> > + u64 start, end;
> > +
> > + tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
> > + start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
> > + end = TDMR_ALIGN_UP(tmb->end_pfn << PAGE_SHIFT);
>
> Nit: a little vertical alignment can make this much more readable:
>
> start = TDMR_ALIGN_DOWN(tmb->start_pfn << PAGE_SHIFT);
> end = TDMR_ALIGN_UP (tmb->end_pfn << PAGE_SHIFT);

Sure.

Btw Ying suggested we can use PHYS_PFN() for 

<phys> >> PAGE_SHIFT

and PFN_PHYS() for

<pfn> << PAGE_SHIFT

Should I apply them to this entire series?

>
> > +
> > + /*
> > + * If the current TDMR's size hasn't been initialized,
> > + * it is a new TDMR to cover the new memory region.
> > + * Otherwise, the current TDMR has already covered the
> > + * previous memory region. In the latter case, check
> > + * whether the current memory region has been fully or
> > + * partially covered by the current TDMR, since TDMR is
> > + * 1G aligned.
> > + */
>
> Again, we have a comment over a if() block that describes what the
> individual steps in the block do. *Plus* each individual step is
> *ALREADY* commented. What purpose does this comment serve?

I think the check of 'if (tdmr->size)' is still worth commenting. The last
sentence can be removed -- as you said, it is kinda duplicated with the
individual comments within the if().

>
> > + if (tdmr->size) {
> > + /*
> > + * Loop to the next memory region if the current
> > + * block has already been fully covered by the
> > + * current TDMR.
> > + */
> > + if (end <= tdmr_end(tdmr))
> > + continue;
> > +
> > + /*
> > + * If part of the current memory region has
> > + * already been covered by the current TDMR,
> > + * skip the already covered part.
> > + */
> > + if (start < tdmr_end(tdmr))
> > + start = tdmr_end(tdmr);
> > +
> > + /*
> > + * Create a new TDMR to cover the current memory
> > + * region, or the remaining part of it.
> > + */
> > + tdmr_idx++;
> > + if (tdmr_idx >= tdx_sysinfo.max_tdmrs)
> > + return -E2BIG;
> > +
> > + tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
> > + }
> > +
> > + tdmr->base = start;
> > + tdmr->size = end - start;
> > + }
> > +
> > + /* @tdmr_idx is always the index of last valid TDMR. */
> > + *tdmr_num = tdmr_idx + 1;
> > +
> > + return 0;
> > +}
>
> Seems like a positive return value could be the number of populated
> TDMRs. That would get rid of the int* argument.

Yes we can. I'll make the function return -E2BIG, or the actual number of
TDMRs.

Btw, I think it's better to print out some error message in case of -E2BIG so
user can easily tell the reason of failure? Something like this:

if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
pr_info("no enough TDMRs to cover all TDX memory regions\n");
return -E2BIG;
}

>
> > /*
> > * Construct an array of TDMRs to cover all TDX memory ranges.
> > * The actual number of TDMRs is kept to @tdmr_num.
> > */
>
> OK, so something else allocated the 'tdmr_array' and it's being passed
> in here to fill it out. "construct" and "create" are both near synonyms
> for "allocate", which isn't even being done here.
>
> We want something here that will make it clear that this function is
> taking an already populated list of TDMRs and filling it out.
> "fill_tmdrs()" seems like it might be a better choice.
>
> This is also a place where better words can help. If the function is
> called "construct", then there's *ZERO* value in using the same word in
> the comment. Using a word that is a close synonym but that can contrast
> it with something different would be really nice, say:

Thanks for the tip!
>
> This is also a place where the calling convention can be used to add
> clarity. If you implicitly use a global variable, you have to explain
> that. But, if you pass *in* a variable, it's a lot more clear.
>
> Take this, for instance:
>
> /*
> * Take the memory referenced in @tdx_memlist and populate the
> * preallocated @tmdr_array, following all the special alignment
> * and size rules for TDMR.
> */
> static int fill_out_tdmrs(struct list_head *tdx_memlist,
> struct tdmr_info *tdmr_array)
> {
> ...
>
> That's 100% crystal clear about what's going on. You know what the
> inputs are and the outputs. You also know why this is even necessary.
> It's implied a bit, but it's because TDMRs have special rules about
> size/alignment and tdx_memlists do not.

Agreed. Let me try this out.

2022-11-24 12:07:12

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Wed, 2022-11-23 at 14:57 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > The TDX module uses additional metadata to record things like which
> > guest "owns" a given page of memory. This metadata, referred as
> > Physical Address Metadata Table (PAMT), essentially serves as the
> > 'struct page' for the TDX module. PAMTs are not reserved by hardware
> > up front. They must be allocated by the kernel and then given to the
> > TDX module.
>
> ... during module initialization.

Thanks.

>
> > TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
> > (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must
> > be a physically contiguous area from a Convertible Memory Region (CMR).
> > However, the PAMTs which track pages in one TDMR do not need to reside
> > within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with
> > any TDMR, the overlapping part must be reported as a reserved area in
> > that particular TDMR.
> >
> > Use alloc_contig_pages() since PAMT must be a physically contiguous area
> > and it may be potentially large (~1/256th of the size of the given TDMR).
> > The downside is alloc_contig_pages() may fail at runtime. One (bad)
> > mitigation is to launch a TD guest early during system boot to get those
> > PAMTs allocated at early time, but the only way to fix is to add a boot
> > option to allocate or reserve PAMTs during kernel boot.
>
> FWIW, we all agree that this is a bad permanent way to leave things.
> You can call me out here as proposing that this wart be left in place
> while this series is merged and is a detail we can work on afterword
> with new module params, boot options, Kconfig or whatever.

Sorry do you mean to call out in the cover letter, or in this changelog?

> > TDX only supports a limited number of reserved areas per TDMR to cover
> > both PAMTs and memory holes within the given TDMR. If many PAMTs are
> > allocated within a single TDMR, the reserved areas may not be sufficient
> > to cover all of them.
> >
> > Adopt the following policies when allocating PAMTs for a given TDMR:
> >
> > - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
> > the total number of reserved areas consumed for PAMTs.
> > - Try to first allocate PAMT from the local node of the TDMR for better
> > NUMA locality.
> >
> > Also dump out how many pages are allocated for PAMTs when the TDX module
> > is initialized successfully.
>
> ... this helps answer the eternal "where did all my memory go?" questions.

Will add to the comment.

[...]


> > +/*
> > + * Pick a NUMA node on which to allocate this TDMR's metadata.
> > + *
> > + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
> > + * not be. If the TDMR covers more than one node, just use the _first_
> > + * one. This can lead to small areas of off-node metadata for some
> > + * memory.
> > + */
> > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > +{
> > + struct tdx_memblock *tmb;
> > +
> > + /* Find the first memory region covered by the TDMR */
> > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > + if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> > + return tmb->nid;
> > + }
>
> Aha, the first use of tmb->nid! I wondered why that was there.

As you suggested I'll introduce the nid member of 'tdx_memblock' in this patch.

>
> > +
> > + /*
> > + * Fall back to allocating the TDMR's metadata from node 0 when
> > + * no TDX memory block can be found. This should never happen
> > + * since TDMRs originate from TDX memory blocks.
> > + */
> > + WARN_ON_ONCE(1);
>
> That's probably better a pr_warn() or something. A backtrace and all
> that jazz seems a bit overly dramatic for this.

How about below?

pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT
allocation, fallback to use node 0.\n");




2022-11-24 12:07:50

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On Wed, 2022-11-23 at 14:17 -0800, Dave Hansen wrote:
> First, I think 'tdx_sysinfo' should probably be a local variable in
> init_tdx_module() and have its address passed in here.  Having global
> variables always makes it more opaque about who is initializing it.

Sorry I missed to respond this.

Using local variable for 'tdx_sysinfo' will cause a build warning:

https://lore.kernel.org/lkml/[email protected]/

So instead we can have a local variable for the pointer of 'tdx_sysinfo', and
dynamically allocate memory for it.

KVM will need to use it, though. So I think eventually we will need to have a
global variable (either tdx_sysinfo itself, or the pointer of it). But this can
be done in a separate patch.

CMRs can be done in the same way (KVM doesn't need to use CMRs, but perhaps some
day we may want to expose them to /sysfs, etc).

What's your preference?

2022-11-24 22:47:01

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

On Wed, 2022-11-23 at 16:28 -0800, Dave Hansen wrote:
> > On 11/20/22 16:26, Kai Huang wrote:
> > > > After the array of TDMRs and the global KeyID are configured to the TDX
> > > > module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
> > > > on all packages.
> > > >
> > > > TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
> > > > it cannot run concurrently on different CPUs. Implement a helper to
> > > > run SEAMCALL on one cpu for each package one by one, and use it to
> > > > configure the global KeyID on all packages.
> >
> > This has the same problems as SYS.LP.INIT. It's basically snake oil in
> > some TDX configurations.
> >
> > This really only needs to be done when the TDX module has memory
> > mappings on a socket for which it needs to use the "global KeyID".  
> >

I think so.

> > If
> > there's no PAMT on a socket, there are probably no allocations there to
> > speak of and no *technical* reason to call TDH.SYS.KEY.CONFIG on that
> > socket. At least none I can see.

Right PAMT is the main user. Besides PAMT, each TDX guest has a "Trust Domain
Root" (TDR) page, and this TDR page is also encrypted using the global KeyID.

> >
> > So, let's check up on this requirement as well. This could also turn
> > out to be a real pain if all the CPUs on a socket are offline.

Yes I'll also check.

> >
> > > > Intel hardware doesn't guarantee cache coherency across different
> > > > KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
> > > > with KeyID 0) before the TDX module uses the global KeyID to access the
> > > > PAMT. Following the TDX module specification, flush cache before
> > > > configuring the global KeyID on all packages.
> >
> > I think it's probably worth an aside here about why TDX security isn't
> > dependent on this step. I *think* it boils down to the memory integrity
> > protections. If the caches aren't flushed, a dirty KeyID-0 cacheline
> > could be written back to RAM. The TDX module would come along later and
> > read the cacheline using KeyID-whatever, get an integrity mismatch,
> > machine check, and then everyone would be sad.
> >
> > Everyone is sad, but TDX security remains intact because memory
> > integrity saved us.
> >
> > Is it memory integrity or the TD bit, actually?

For this case, I think memory integrity.

The TD bit is mainly used to prevent host kernel from being able to read the
integrity checksum (MAC) of TD memory, which could result in potential blute-
force attack on the MAC.

Specifically, there's such attack that: the host kernel can establish a shared
(non-TDX private) KeyID mapping, and repeatedly use different keys via that
mapping to speculatively read TDX guest memory. W/o the TD bit, the hardware
always performs the integrity check and thus there's possibility that the host
eventually can find a key which matches the integrity check.

To prevent such case, the TD bit is added. Hardware will check the TD bit match
first, and only perform integrity check _after_ TD bit match. So in above case,
host kernel speculatively read TDX memory via shared KeyID mapping will always
get a TD bit mismatch, thus won't be able to achieve above attack.

> > > > Given the PAMT size can be large (~1/256th of system RAM), just use
> > > > WBINVD on all CPUs to flush.
> >
> > <sigh>
> >
> > > > Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
> > > > used the global KeyID to write any PAMT. Therefore, need to use WBINVD
> > > > to flush cache before freeing the PAMTs back to the kernel. Note using
> > > > MOVDIR64B (which changes the page's associated KeyID from the old TDX
> > > > private KeyID back to KeyID 0, which is used by the kernel) to clear
> > > > PMATs isn't needed, as the KeyID 0 doesn't support integrity check.
> >
> > I hope this is covered in the code well.
> >
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index 3a032930e58a..99d1be5941a7 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -224,6 +224,46 @@ static void seamcall_on_each_cpu(struct
> > > > seamcall_ctx *sc)
> > > >   on_each_cpu(seamcall_smp_call_function, sc, true);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Call one SEAMCALL on one (any) cpu for each physical package in
> > > > + * serialized way. Return immediately in case of any error if
> > > > + * SEAMCALL fails on any cpu.
> > > > + *
> > > > + * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
> > > > + * to be atomic, but for simplicity just reuse it instead of adding
> > > > + * a new one.
> > > > + */
> > > > +static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
> > > > +{
> > > > + cpumask_var_t packages;
> > > > + int cpu, ret = 0;
> > > > +
> > > > + if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
> > > > + return -ENOMEM;
> > > > +
> > > > + for_each_online_cpu(cpu) {
> > > > + if
> > > > (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
> > > > + packages))
> > > > + continue;
> > > > +
> > > > + ret = smp_call_function_single(cpu,
> > > > seamcall_smp_call_function,
> > > > + sc, true);
> > > > + if (ret)
> > > > + break;
> > > > +
> > > > + /*
> > > > + * Doesn't have to use atomic_read(), but it doesn't
> > > > + * hurt either.
> > > > + */
> >
> > I don't think you need to cover this twice. Just do it in one comment.

Right. I'll remove. I'll be more careful next time to avoid such pattern.

> >
> > > > + ret = atomic_read(&sc->err);
> > > > + if (ret)
> > > > + break;
> > > > + }
> > > > +
> > > > + free_cpumask_var(packages);
> > > > + return ret;
> > > > +}
> > > > +
> > > > static int tdx_module_init_cpus(void)
> > > > {
> > > > struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
> > > > @@ -1010,6 +1050,22 @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
> > > > return ret;
> > > > }
> > > >
> > > > +static int config_global_keyid(void)
> > > > +{
> > > > + struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
> > > > +
> > > > + /*
> > > > + * Configure the key of the global KeyID on all packages by
> > > > + * calling TDH.SYS.KEY.CONFIG on all packages in a serialized
> > > > + * way as it cannot run concurrently on different CPUs.
> > > > + *
> > > > + * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
> > > > + * a recoverable error). Assume this is exceedingly rare and
> > > > + * just return error if encountered instead of retrying.
> > > > + */
> > > > + return seamcall_on_each_package_serialized(&sc);
> > > > +}
> > > > +
> > > > /*
> > > > * Detect and initialize the TDX module.
> > > > *
> > > > @@ -1098,15 +1154,44 @@ static int init_tdx_module(void)
> > > > if (ret)
> > > > goto out_free_pamts;
> > > >
> > > > + /*
> > > > + * Hardware doesn't guarantee cache coherency across different
> > > > + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
> > > > + * (associated with KeyID 0) before the TDX module can use the
> > > > + * global KeyID to access the PAMT. Given PAMTs are potentially
> > > > + * large (~1/256th of system RAM), just use WBINVD on all cpus
> > > > + * to flush the cache.
> > > > + *
> > > > + * Follow the TDX spec to flush cache before configuring the
> > > > + * global KeyID on all packages.
> > > > + */
> > > > + wbinvd_on_all_cpus();
> > > > +
> > > > + /* Config the key of global KeyID on all packages */
> > > > + ret = config_global_keyid();
> > > > + if (ret)
> > > > + goto out_free_pamts;
> > > > +
> > > > /*
> > > > * Return -EINVAL until all steps of TDX module initialization
> > > > * process are done.
> > > > */
> > > > ret = -EINVAL;
> > > > out_free_pamts:
> > > > - if (ret)
> > > > + if (ret) {
> > > > + /*
> > > > + * Part of PAMT may already have been initialized by
> >
> > s/initialized/written/

Will do.

> >
> > > > + * TDX module. Flush cache before returning PAMT back
> > > > + * to the kernel.
> > > > + *
> > > > + * Note there's no need to do MOVDIR64B (which changes
> > > > + * the page's associated KeyID from the old TDX private
> > > > + * KeyID back to KeyID 0, which is used by the kernel),
> > > > + * as KeyID 0 doesn't support integrity check.
> > > > + */
> >
> > MOVDIR64B is the tiniest of implementation details and also not the only
> > way to initialize memory integrity metadata.
> >
> > Just keep this high level:
> >
> > * No need to worry about memory integrity checks here.
> > * KeyID 0 has integrity checking disabled.

Will do. Thanks.

2022-11-24 22:47:28

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 15/20] x86/virt/tdx: Reserve TDX module global KeyID

On Wed, 2022-11-23 at 15:40 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > @@ -1053,6 +1056,12 @@ static int init_tdx_module(void)
> > if (ret)
> > goto out_free_tdmrs;
> >
> > + /*
> > + * Reserve the first TDX KeyID as global KeyID to protect
> > + * TDX module metadata.
> > + */
> > + tdx_global_keyid = tdx_keyid_start;
>
> This doesn't "reserve" squat.
>
> You could argue that it "picks", "chooses", or "designates" the
> 'tdx_global_keyid', but where is the "reservation"?

Right. I'll change to use "choose".

2022-11-25 00:25:35

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

On Thu, 2022-11-24 at 22:28 +0000, Huang, Kai wrote:
> > > > > Intel hardware doesn't guarantee cache coherency across different
> > > > > KeyIDs.  The kernel needs to flush PAMT's dirty cachelines (associated
> > > > > with KeyID 0) before the TDX module uses the global KeyID to access
> > > > > the
> > > > > PAMT.  Following the TDX module specification, flush cache before
> > > > > configuring the global KeyID on all packages.
> > >
> > > I think it's probably worth an aside here about why TDX security isn't
> > > dependent on this step.  I *think* it boils down to the memory integrity
> > > protections.  If the caches aren't flushed, a dirty KeyID-0 cacheline
> > > could be written back to RAM.  The TDX module would come along later and
> > > read the cacheline using KeyID-whatever, get an integrity mismatch,
> > > machine check, and then everyone would be sad.
> > >
> > > Everyone is sad, but TDX security remains intact because memory
> > > integrity saved us.
> > >
> > > Is it memory integrity or the TD bit, actually?
>
> For this case, I think memory integrity.

Sorry thinking again, I was wrong. It should be the TD bit, since TD bit check
happens before integrity check.

So the follow is:

1) Dirty KeyID-0 cacheline written back to RAM, which clears the TD bit.
2) TDX module reads PAMT using TDX private KeyId causes TD mismatch, which in
this case, results in poison.
3) Further consume of poisoned data results in #MC.

>
> The TD bit is mainly used to prevent host kernel from being able to read the
> integrity checksum (MAC) of TD memory, which could result in potential blute-
> force attack on the MAC.
>
> Specifically, there's such attack that: the host kernel can establish a shared
> (non-TDX private) KeyID mapping, and repeatedly use different keys via that
> mapping to speculatively read TDX guest memory. W/o the TD bit, the hardware
> always performs the integrity check and thus there's possibility that the host
> eventually can find a key which matches the integrity check.
>
> To prevent such case, the TD bit is added. Hardware will check the TD bit
> match
> first, and only perform integrity check _after_ TD bit match. So in above
> case,
> host kernel speculatively read TDX memory via shared KeyID mapping will always
> get a TD bit mismatch, thus won't be able to achieve above attack.

2022-11-25 01:13:45

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 16/20] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID

On Wed, 2022-11-23 at 15:56 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > After the TDX-usable memory regions are constructed in an array of TDMRs
> > and the global KeyID is reserved, configure them to the TDX module using
> > TDH.SYS.CONFIG SEAMCALL. TDH.SYS.CONFIG can only be called once and can
> > be done on any logical cpu.
> >
> > Reviewed-by: Isaku Yamahata <[email protected]>
> > Signed-off-by: Kai Huang <[email protected]>
> > ---
> > arch/x86/virt/vmx/tdx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
> > arch/x86/virt/vmx/tdx/tdx.h | 2 ++
> > 2 files changed, 39 insertions(+)
> >
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index e2cbeeb7f0dc..3a032930e58a 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -979,6 +979,37 @@ static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
> > return ret;
> > }
> >
> > +static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
> > + u64 global_keyid)
> > +{
> > + u64 *tdmr_pa_array;
> > + int i, array_sz;
> > + u64 ret;
> > +
> > + /*
> > + * TDMR_INFO entries are configured to the TDX module via an
> > + * array of the physical address of each TDMR_INFO. TDX module
> > + * requires the array itself to be 512-byte aligned. Round up
> > + * the array size to 512-byte aligned so the buffer allocated
> > + * by kzalloc() will meet the alignment requirement.
> > + */
>
> Aagh. Return of (a different) 512-byte aligned structure.
>
> > + array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
> > + tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
>
> Just to be clear, all that chatter about alignment is because the
> *START* of the array has to be aligned. Right?  
>

Correct.

> I see alignment for
> 'array_sz', but that's not the start of the array.
>
> tdmr_pa_array is the start of the array. Where is *THAT* aligned?

The alignment is assumed to be guaranteed based on the Documentation you quoted
below.

>
> How does rounding up the size make kzalloc() magically know how to align
> the *START* of the allocation?
>
> Because I'm actually questioning my own sanity at this point, I went and
> double-checked the docs (Documentation/core-api/memory-allocation.rst):
>
> > The address of a chunk allocated with `kmalloc` is aligned to at least
> > ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the
> > alignment is also guaranteed to be at least the respective size.
>
> Hint #1: ARCH_KMALLOC_MINALIGN is way less than 512.
> Hint #2: tdmr_num is not guaranteed to be a power of two.

tdmr_num * sizeof(u64) is not guaranteed.

> Hint #3: Comments don't actually affect the allocation

Sorry I don't understand the Hint #3.

Perhaps I should just allocate one page so it must be 512-byte aligned?

/*
* TDMR_INFO entries are configured to the TDX module via an array
* of physical address of each TDMR_INFO. The TDX module requires
* the array itself to be 512-byte aligned. Just allocate a page
* to use it as the array so the alignment can be guaranteed. The
* page will be freed after TDH.SYS.CONFIG anyway.
*/


2022-11-25 01:51:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 16/20] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID

On 11/24/22 16:59, Huang, Kai wrote:
> On Wed, 2022-11-23 at 15:56 -0800, Dave Hansen wrote:
>> On 11/20/22 16:26, Kai Huang wrote:
>>> + array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
>>> + tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
>>
>> Just to be clear, all that chatter about alignment is because the
>> *START* of the array has to be aligned. Right?
>
> Correct.
>
>> I see alignment for
>> 'array_sz', but that's not the start of the array.
>>
>> tdmr_pa_array is the start of the array. Where is *THAT* aligned?
>
> The alignment is assumed to be guaranteed based on the Documentation you quoted
> below.

I'm feeling kinda dense today, being Thanksgiving and all. Could you
please walk me through, step-by-step how you get kzalloc() to give you
an allocation where the start address is 512-byte aligned?

...
> Perhaps I should just allocate one page so it must be 512-byte aligned?
>
> /*
> * TDMR_INFO entries are configured to the TDX module via an array
> * of physical address of each TDMR_INFO. The TDX module requires
> * the array itself to be 512-byte aligned. Just allocate a page
> * to use it as the array so the alignment can be guaranteed. The
> * page will be freed after TDH.SYS.CONFIG anyway.
> */

Kai, I just plain can't believe at this point that comments like this
are still being written. I _thought_ I was very clear before that if
you have a constant, say:

#define FOO_ALIGN 512

and you want to align foo, you can just do:

foo = ALIGN(foo, FOO_ALIGN);

You don't need to mention the 512-byte alignment again. The #define is
good enough.


2022-11-25 02:32:18

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 16/20] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID

On Thu, 2022-11-24 at 17:18 -0800, Dave Hansen wrote:
> On 11/24/22 16:59, Huang, Kai wrote:
> > On Wed, 2022-11-23 at 15:56 -0800, Dave Hansen wrote:
> > > On 11/20/22 16:26, Kai Huang wrote:
> > > > + array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
> > > > + tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
> > >
> > > Just to be clear, all that chatter about alignment is because the
> > > *START* of the array has to be aligned. Right?
> >
> > Correct.
> >
> > > I see alignment for
> > > 'array_sz', but that's not the start of the array.
> > >
> > > tdmr_pa_array is the start of the array. Where is *THAT* aligned?
> >
> > The alignment is assumed to be guaranteed based on the Documentation you quoted
> > below.
>
> I'm feeling kinda dense today, being Thanksgiving and all. Could you
> please walk me through, step-by-step how you get kzalloc() to give you
> an allocation where the start address is 512-byte aligned?

Sorry I am not good at math. I forgot although 512 is power of two, n x 512 may
not be power of two.

The code works because in practice tdmr_num is never larger than 64 so tdmr_num
* sizeof(64) cannot be larger than 512.

So if want to consider array size being larger than 4K, we should use
alloc_pages_exact() to allocate?

>
> ...
> > Perhaps I should just allocate one page so it must be 512-byte aligned?
> >
> > /*
> > * TDMR_INFO entries are configured to the TDX module via an array
> > * of physical address of each TDMR_INFO. The TDX module requires
> > * the array itself to be 512-byte aligned. Just allocate a page
> > * to use it as the array so the alignment can be guaranteed. The
> > * page will be freed after TDH.SYS.CONFIG anyway.
> > */
>
> Kai, I just plain can't believe at this point that comments like this
> are still being written. I _thought_ I was very clear before that if
> you have a constant, say:
>
> #define FOO_ALIGN 512
>
> and you want to align foo, you can just do:
>
> foo = ALIGN(foo, FOO_ALIGN);
>
> You don't need to mention the 512-byte alignment again. The #define is
> good enough.
>
>

My fault. I thought by changing to allocate a page we don't need to do 'foo =
ALIGN(foo, FOO_ALIGN)' so I thought the comment could be useful.

Thanks for responding at Thanksgiving day.

2022-11-25 03:01:22

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 18/20] x86/virt/tdx: Initialize all TDMRs

On Wed, 2022-11-23 at 16:42 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
> > TDX initialization.
> >
> > All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
> > the memory pages can be used by the TDX module. The time to initialize
> > TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT
> > internally initializes the PAMT entries using the global KeyID.
> >
> > To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
> > initializes an (implementation-specific) subset of PAMT entries of one
> > TDMR in one invocation. The caller needs to call TDH.SYS.TDMR.INIT
> > iteratively until all PAMT entries of the given TDMR are initialized.
> >
> > TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they
> > are initializing different TDMRs. To keep it simple, just initialize
> > all TDMRs one by one. On a 2-socket machine with 2.2G CPUs and 64GB
> > memory, each TDH.SYS.TDMR.INIT roughly takes couple of microseconds on
> > average, and it takes roughly dozens of milliseconds to complete the
> > initialization of all TDMRs while system is idle.
>
> Any chance you could say TDH.SYS.TDMR.INIT a few more times in there? ;)

I am a bad changelog writer.

>
> Seriously, please try to trim that down. If you talk about the
> implementation in detail in the code comments, don't cover it in detail
> in the changelog too.

Yes will use this tip to trim down. Thanks for the tip.

>
> Also, this is a momentous patch, right? It's the last piece. Maybe
> worth calling that out.

Yes this is the last step of initializing the TDX module. It is sort of
mentioned in the first sentence of this changelong:

Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
TDX initialization.

But perhaps it can be more explicitly.

>
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 99d1be5941a7..9bcdb30b7a80 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1066,6 +1066,65 @@ static int config_global_keyid(void)
> > return seamcall_on_each_package_serialized(&sc);
> > }
> >
> > +/* Initialize one TDMR */
>
> Does this comment add value? Even if it does, it is better than naming
> the dang function init_one_tdmr()?

Sorry will try best to avoid such type of comments in the future.

>
> > +static int init_tdmr(struct tdmr_info *tdmr)
> > +{
> > + u64 next;
> > +
> > + /*
> > + * Initializing PAMT entries might be time-consuming (in
> > + * proportion to the size of the requested TDMR). To avoid long
> > + * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
> > + * an (implementation-defined) subset of PAMT entries in one
> > + * invocation.
> > + *
> > + * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
> > + * of the requested TDMR are initialized (if next-to-initialize
> > + * address matches the end address of the TDMR).
> > + */
>
> The PAMT discussion here is, IMNHO silly. If this were about
> initializing PAMT's then it should be renamed init_pamts() and the
> SEAMCALL should be called PAMT_INIT or soemthing. It's not, and the ABI
> is built around the TDMR and *its* addresses.

Agreed.

>
> Let's chop this comment down:
>
> /*
> * Initializing a TDMR can be time consuming. To avoid long
> * SEAMCALLs, the TDX module may only initialize a part of the
> * TDMR in each call.
> */
>
> Talk about the looping logic in the loop.

Agreed. Thanks.

>
> > + do {
> > + struct tdx_module_output out;
> > + int ret;
> > +
> > + ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> > + &out);
> > + if (ret)
> > + return ret;
> > + /*
> > + * RDX contains 'next-to-initialize' address if
> > + * TDH.SYS.TDMR.INT succeeded.
> > + */
> > + next = out.rdx;
> > + /* Allow scheduling when needed */
>
> That comment is a wee bit superfluous, don't you think?

Indeed.

>
> > + cond_resched();
>
> /* Keep making SEAMCALLs until the TDMR is done */
>
> > + } while (next < tdmr->base + tdmr->size);
> > +
> > + return 0;
> > +}
> > +
> > +/* Initialize all TDMRs */
>
> Does this comment add value?

No. Will remove.

>
> > +static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num)
> > +{
> > + int i;
> > +
> > + /*
> > + * Initialize TDMRs one-by-one for simplicity, though the TDX
> > + * architecture does allow different TDMRs to be initialized in
> > + * parallel on multiple CPUs. Parallel initialization could
> > + * be added later when the time spent in the serialized scheme
> > + * becomes a real concern.
> > + */
>
> Trim down the comment:
>
> /*
> * This operation is costly. It can be parallelized,
> * but keep it simple for now.
> */

Thanks.

[...]


2022-11-25 09:44:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On 24.11.22 10:06, Huang, Kai wrote:
> On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
>>>
>>> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>    unsigned long start_pfn = start >> PAGE_SHIFT;
>>>    unsigned long nr_pages = size >> PAGE_SHIFT;
>>>
>>> + /*
>>> + * For now if TDX is enabled, all pages in the page allocator
>>> + * must be TDX memory, which is a fixed set of memory regions
>>> + * that are passed to the TDX module.  Reject the new region
>>> + * if it is not TDX memory to guarantee above is true.
>>> + */
>>> + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
>>> + return -EINVAL;
>>
>> arch_add_memory() does not add memory to the page allocator.  For
>> example, memremap_pages() uses arch_add_memory() and explicitly does not
>> release the memory to the page allocator.
>
> Indeed. Sorry I missed this.
>
>> This check belongs in
>> add_memory_resource() to prevent new memory that violates TDX from being
>> onlined.
>
> This would require adding another 'arch_cc_memory_compatible()' to the common
> add_memory_resource() (I actually long time ago had such patch to work with the
> memremap_pages() you mentioned above).
>
> How about adding a memory_notifier to the TDX code, and reject online of TDX
> incompatible memory (something like below)? The benefit is this is TDX code
> self contained and won't pollute the common mm code:
>
> +static int tdx_memory_notifier(struct notifier_block *nb,
> + unsigned long action, void *v)
> +{
> + struct memory_notify *mn = v;
> +
> + if (action != MEM_GOING_ONLINE)
> + return NOTIFY_OK;
> +
> + /*
> + * Not all memory is compatible with TDX. Reject
> + * online of any incompatible memory.
> + */
> + return tdx_cc_memory_compatible(mn->start_pfn,
> + mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> +}
> +
> +static struct notifier_block tdx_memory_nb = {
> + .notifier_call = tdx_memory_notifier,
> +};

With mhp_memmap_on_memory() some memory might already be touched during
add_memory() (because part of the hotplug memory is used for holding the
memmap), not when actually onlining memory. So in that case, this would
be too late.

add_memory_resource() sounds better, even though I disgust such TDX
special handling in common code.

--
Thanks,

David / dhildenb

2022-11-28 08:59:09

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
> On 24.11.22 10:06, Huang, Kai wrote:
> > On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> > > >
> > > > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > > >    unsigned long start_pfn = start >> PAGE_SHIFT;
> > > >    unsigned long nr_pages = size >> PAGE_SHIFT;
> > > >
> > > > + /*
> > > > + * For now if TDX is enabled, all pages in the page allocator
> > > > + * must be TDX memory, which is a fixed set of memory regions
> > > > + * that are passed to the TDX module.  Reject the new region
> > > > + * if it is not TDX memory to guarantee above is true.
> > > > + */
> > > > + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > > > + return -EINVAL;
> > >
> > > arch_add_memory() does not add memory to the page allocator.  For
> > > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > > release the memory to the page allocator.
> >
> > Indeed. Sorry I missed this.
> >
> > > This check belongs in
> > > add_memory_resource() to prevent new memory that violates TDX from being
> > > onlined.
> >
> > This would require adding another 'arch_cc_memory_compatible()' to the common
> > add_memory_resource() (I actually long time ago had such patch to work with the
> > memremap_pages() you mentioned above).
> >
> > How about adding a memory_notifier to the TDX code, and reject online of TDX
> > incompatible memory (something like below)? The benefit is this is TDX code
> > self contained and won't pollute the common mm code:
> >
> > +static int tdx_memory_notifier(struct notifier_block *nb,
> > + unsigned long action, void *v)
> > +{
> > + struct memory_notify *mn = v;
> > +
> > + if (action != MEM_GOING_ONLINE)
> > + return NOTIFY_OK;
> > +
> > + /*
> > + * Not all memory is compatible with TDX. Reject
> > + * online of any incompatible memory.
> > + */
> > + return tdx_cc_memory_compatible(mn->start_pfn,
> > + mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> > +}
> > +
> > +static struct notifier_block tdx_memory_nb = {
> > + .notifier_call = tdx_memory_notifier,
> > +};
>
> With mhp_memmap_on_memory() some memory might already be touched during
> add_memory() (because part of the hotplug memory is used for holding the
> memmap), not when actually onlining memory. So in that case, this would
> be too late.

Hi David,

Thanks for the review!

Right. The memmap pages are added to the zone before online_pages(), but IIUC
those memmap pages will never be free pages thus won't be allocated by the page
allocator, correct? Therefore in practice there won't be problem even they are
not TDX compatible memory.

>
> add_memory_resource() sounds better, even though I disgust such TDX
> special handling in common code.
>

Given above, do you still prefer changing add_memory_resource()? If so I'll
change to modify add_memory_resource().

2022-11-28 09:25:55

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs

On Wed, 2022-11-23 at 15:39 -0800, Dave Hansen wrote:
> > +static int tdmr_set_up_memory_hole_rsvd_areas(struct tdmr_info *tdmr,
> > + int *rsvd_idx)
> > +{
>
> This needs a comment.
>
> This is another case where it's confusing to be passing around 'struct
> tdmr_info *'. Is this *A* TDMR or an array?

It is for one TDMR but not an array. All functions in form of tdmr_xxx() are
operations for one TDMR.

But I agree it's not clear.

>
> /*
>  * Go through tdx_memlist to find holes between memory areas. If any of
>  * those holes fall within @tdmr, set up a TDMR reserved area to cover
>  * the hole.
>  */
> static int tdmr_populate_rsvd_holes(struct list_head *tdx_memlist,
> struct tdmr_info *tdmr,
> int *rsvd_idx)

Thanks!

Should I also change below function 'tdmr_set_up_pamt_rsvd_areas()' to, i.e.
tdmr_populate_rsvd_pamts()?

Actually, there are two more functions in this patch: tdmr_set_up_rsvd_areas()
and tdmrs_set_up_rsvd_areas_all(). Should I also change them to
tdmr_populate_rsvd_areas() and tdmrs_populate_rsvd_areas_all()?

>
> > + struct tdx_memblock *tmb;
> > + u64 prev_end;
> > + int ret;
> > +
> > + /* Mark holes between memory regions as reserved */
> > + prev_end = tdmr_start(tdmr);
>
> I'm having a hard time following this, especially the mixing of
> semantics between 'prev_end' both pointing to tdmr and to tmb addresses.
>
> Here, 'prev_end' logically represents the last address which we know has
> been handled. All of the holes in the addresses below it have been
> dealt with. It is safe to set here to tdmr_start() because this
> function call is uniquely tasked with setting up reserved areas in
> 'tdmr'. So, it can safely consider any holes before tdmr_start(tdmr) as
> being handled.

Yes.

>
> But, dang, there's a lot of complexity there.
>
> First, the:
>
> /* Mark holes between memory regions as reserved */
>
> comment is misleading. It has *ZILCH* to do with the "prev_end =
> tdmr_start(tdmr);" assignment.
>
> This at least needs:
>
>    /* Start looking for reserved blocks at the beginning of the TDMR: */
>    prev_end = tdmr_start(tdmr);

Right. Sorry for the bad comment.

>
> but I also get the feeling that 'prev_end' is a crummy variable name. I
> don't have any better suggestions at the moment.
>
> > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > + u64 start, end;
> > +
> > + start = tmb->start_pfn << PAGE_SHIFT;
> > + end = tmb->end_pfn << PAGE_SHIFT;
> > +
>
> More alignment opportunities:
>
> start = tmb->start_pfn << PAGE_SHIFT;
> end = tmb->end_pfn << PAGE_SHIFT;

Should I use PFN_PHYS()? Then looks we don't need this alignment.

>
>
> > + /* Break if this region is after the TDMR */
> > + if (start >= tdmr_end(tdmr))
> > + break;
> > +
> > + /* Exclude regions before this TDMR */
> > + if (end < tdmr_start(tdmr))
> > + continue;
> > +
> > + /*
> > + * Skip if no hole exists before this region. "<=" is
> > + * used because one memory region might span two TDMRs
> > + * (when the previous TDMR covers part of this region).
> > + * In this case the start address of this region is
> > + * smaller than the start address of the second TDMR.
> > + *
> > + * Update the prev_end to the end of this region where
> > + * the possible memory hole starts.
> > + */
>
> Can't this just be:
>
> /*
> * Skip over memory areas that
> * have already been dealt with.
> */

I think so. This actually also covers the "two contiguous memory regions" case,
which isn't mentioned in the original comment.

>
> > + if (start <= prev_end) {
> > + prev_end = end;
> > + continue;
> > + }
> > +
> > + /* Add the hole before this region */
> > + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
> > + start - prev_end);
> > + if (ret)
> > + return ret;
> > +
> > + prev_end = end;
> > + }
> > +
> > + /* Add the hole after the last region if it exists. */
> > + if (prev_end < tdmr_end(tdmr)) {
> > + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
> > + tdmr_end(tdmr) - prev_end);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int tdmr_set_up_pamt_rsvd_areas(struct tdmr_info *tdmr, int
> > *rsvd_idx,
> > + struct tdmr_info *tdmr_array,
> > + int tdmr_num)
> > +{
> > + int i, ret;
> > +
> > + /*
> > + * If any PAMT overlaps with this TDMR, the overlapping part
> > + * must also be put to the reserved area too. Walk over all
> > + * TDMRs to find out those overlapping PAMTs and put them to
> > + * reserved areas.
> > + */
> > + for (i = 0; i < tdmr_num; i++) {
> > + struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i);
> > + unsigned long pamt_start_pfn, pamt_npages;
> > + u64 pamt_start, pamt_end;
> > +
> > + tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
> > + /* Each TDMR must already have PAMT allocated */
> > + WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
> > +
> > + pamt_start = pamt_start_pfn << PAGE_SHIFT;
> > + pamt_end = pamt_start + (pamt_npages << PAGE_SHIFT);
> > +
> > + /* Skip PAMTs outside of the given TDMR */
> > + if ((pamt_end <= tdmr_start(tdmr)) ||
> > + (pamt_start >= tdmr_end(tdmr)))
> > + continue;
> > +
> > + /* Only mark the part within the TDMR as reserved */
> > + if (pamt_start < tdmr_start(tdmr))
> > + pamt_start = tdmr_start(tdmr);
> > + if (pamt_end > tdmr_end(tdmr))
> > + pamt_end = tdmr_end(tdmr);
> > +
> > + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
> > + pamt_end - pamt_start);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/* Compare function called by sort() for TDMR reserved areas */
> > +static int rsvd_area_cmp_func(const void *a, const void *b)
> > +{
> > + struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
> > + struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
> > +
> > + if (r1->offset + r1->size <= r2->offset)
> > + return -1;
> > + if (r1->offset >= r2->offset + r2->size)
> > + return 1;
> > +
> > + /* Reserved areas cannot overlap. The caller should guarantee. */
> > + WARN_ON_ONCE(1);
> > + return -1;
> > +}
> > +
> > +/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
> > +static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr,
> > + struct tdmr_info *tdmr_array,
> > + int tdmr_num)
> > +{
> > + int ret, rsvd_idx = 0;
> > +
> > + /* Put all memory holes within the TDMR into reserved areas */
> > + ret = tdmr_set_up_memory_hole_rsvd_areas(tdmr, &rsvd_idx);
> > + if (ret)
> > + return ret;
> > +
> > + /* Put all (overlapping) PAMTs within the TDMR into reserved areas
> > */
> > + ret = tdmr_set_up_pamt_rsvd_areas(tdmr, &rsvd_idx, tdmr_array,
> > tdmr_num);
> > + if (ret)
> > + return ret;
> > +
> > + /* TDX requires reserved areas listed in address ascending order */
> > + sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct
> > tdmr_reserved_area),
> > + rsvd_area_cmp_func, NULL);
>
> Ugh, and I guess we don't know where the PAMTs will be ordered with
> respect to holes, so sorting is the easiest way to do this.
>
> <snip>

Right.

2022-11-28 09:47:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On 28.11.22 09:38, Huang, Kai wrote:
> On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
>> On 24.11.22 10:06, Huang, Kai wrote:
>>> On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
>>>>>
>>>>> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>>>    unsigned long start_pfn = start >> PAGE_SHIFT;
>>>>>    unsigned long nr_pages = size >> PAGE_SHIFT;
>>>>>
>>>>> + /*
>>>>> + * For now if TDX is enabled, all pages in the page allocator
>>>>> + * must be TDX memory, which is a fixed set of memory regions
>>>>> + * that are passed to the TDX module.  Reject the new region
>>>>> + * if it is not TDX memory to guarantee above is true.
>>>>> + */
>>>>> + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
>>>>> + return -EINVAL;
>>>>
>>>> arch_add_memory() does not add memory to the page allocator.  For
>>>> example, memremap_pages() uses arch_add_memory() and explicitly does not
>>>> release the memory to the page allocator.
>>>
>>> Indeed. Sorry I missed this.
>>>
>>>> This check belongs in
>>>> add_memory_resource() to prevent new memory that violates TDX from being
>>>> onlined.
>>>
>>> This would require adding another 'arch_cc_memory_compatible()' to the common
>>> add_memory_resource() (I actually long time ago had such patch to work with the
>>> memremap_pages() you mentioned above).
>>>
>>> How about adding a memory_notifier to the TDX code, and reject online of TDX
>>> incompatible memory (something like below)? The benefit is this is TDX code
>>> self contained and won't pollute the common mm code:
>>>
>>> +static int tdx_memory_notifier(struct notifier_block *nb,
>>> + unsigned long action, void *v)
>>> +{
>>> + struct memory_notify *mn = v;
>>> +
>>> + if (action != MEM_GOING_ONLINE)
>>> + return NOTIFY_OK;
>>> +
>>> + /*
>>> + * Not all memory is compatible with TDX. Reject
>>> + * online of any incompatible memory.
>>> + */
>>> + return tdx_cc_memory_compatible(mn->start_pfn,
>>> + mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
>>> +}
>>> +
>>> +static struct notifier_block tdx_memory_nb = {
>>> + .notifier_call = tdx_memory_notifier,
>>> +};
>>
>> With mhp_memmap_on_memory() some memory might already be touched during
>> add_memory() (because part of the hotplug memory is used for holding the
>> memmap), not when actually onlining memory. So in that case, this would
>> be too late.
>
> Hi David,
>
> Thanks for the review!
>
> Right. The memmap pages are added to the zone before online_pages(), but IIUC
> those memmap pages will never be free pages thus won't be allocated by the page
> allocator, correct? Therefore in practice there won't be problem even they are
> not TDX compatible memory.

But that memory will be read/written. Isn't that an issue, for example,
if memory doesn't get accepted etc?

--
Thanks,

David / dhildenb

2022-11-28 09:51:41

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Mon, 2022-11-28 at 09:43 +0100, David Hildenbrand wrote:
> On 28.11.22 09:38, Huang, Kai wrote:
> > On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
> > > On 24.11.22 10:06, Huang, Kai wrote:
> > > > On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> > > > > >
> > > > > > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > > > > >    unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > >    unsigned long nr_pages = size >> PAGE_SHIFT;
> > > > > >
> > > > > > + /*
> > > > > > + * For now if TDX is enabled, all pages in the page allocator
> > > > > > + * must be TDX memory, which is a fixed set of memory regions
> > > > > > + * that are passed to the TDX module.  Reject the new region
> > > > > > + * if it is not TDX memory to guarantee above is true.
> > > > > > + */
> > > > > > + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > > > > > + return -EINVAL;
> > > > >
> > > > > arch_add_memory() does not add memory to the page allocator.  For
> > > > > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > > > > release the memory to the page allocator.
> > > >
> > > > Indeed. Sorry I missed this.
> > > >
> > > > > This check belongs in
> > > > > add_memory_resource() to prevent new memory that violates TDX from being
> > > > > onlined.
> > > >
> > > > This would require adding another 'arch_cc_memory_compatible()' to the common
> > > > add_memory_resource() (I actually long time ago had such patch to work with the
> > > > memremap_pages() you mentioned above).
> > > >
> > > > How about adding a memory_notifier to the TDX code, and reject online of TDX
> > > > incompatible memory (something like below)? The benefit is this is TDX code
> > > > self contained and won't pollute the common mm code:
> > > >
> > > > +static int tdx_memory_notifier(struct notifier_block *nb,
> > > > + unsigned long action, void *v)
> > > > +{
> > > > + struct memory_notify *mn = v;
> > > > +
> > > > + if (action != MEM_GOING_ONLINE)
> > > > + return NOTIFY_OK;
> > > > +
> > > > + /*
> > > > + * Not all memory is compatible with TDX. Reject
> > > > + * online of any incompatible memory.
> > > > + */
> > > > + return tdx_cc_memory_compatible(mn->start_pfn,
> > > > + mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> > > > +}
> > > > +
> > > > +static struct notifier_block tdx_memory_nb = {
> > > > + .notifier_call = tdx_memory_notifier,
> > > > +};
> > >
> > > With mhp_memmap_on_memory() some memory might already be touched during
> > > add_memory() (because part of the hotplug memory is used for holding the
> > > memmap), not when actually onlining memory. So in that case, this would
> > > be too late.
> >
> > Hi David,
> >
> > Thanks for the review!
> >
> > Right. The memmap pages are added to the zone before online_pages(), but IIUC
> > those memmap pages will never be free pages thus won't be allocated by the page
> > allocator, correct? Therefore in practice there won't be problem even they are
> > not TDX compatible memory.
>
> But that memory will be read/written. Isn't that an issue, for example,
> if memory doesn't get accepted etc?
>

Sorry I don't quite understand "if memory doesn't get accepted" mean. Do you
mean accepted by the TDX module?

Only the host kernel will read/write those memmap pages. The TDX module won't
(as they won't be allocated to be used as TDX guest memory or TDX module
metadata). So it's fine.


2022-11-28 10:05:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On 28.11.22 10:21, Huang, Kai wrote:
> On Mon, 2022-11-28 at 09:43 +0100, David Hildenbrand wrote:
>> On 28.11.22 09:38, Huang, Kai wrote:
>>> On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
>>>> On 24.11.22 10:06, Huang, Kai wrote:
>>>>> On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
>>>>>>>
>>>>>>> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>>>>>    unsigned long start_pfn = start >> PAGE_SHIFT;
>>>>>>>    unsigned long nr_pages = size >> PAGE_SHIFT;
>>>>>>>
>>>>>>> + /*
>>>>>>> + * For now if TDX is enabled, all pages in the page allocator
>>>>>>> + * must be TDX memory, which is a fixed set of memory regions
>>>>>>> + * that are passed to the TDX module.  Reject the new region
>>>>>>> + * if it is not TDX memory to guarantee above is true.
>>>>>>> + */
>>>>>>> + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
>>>>>>> + return -EINVAL;
>>>>>>
>>>>>> arch_add_memory() does not add memory to the page allocator.  For
>>>>>> example, memremap_pages() uses arch_add_memory() and explicitly does not
>>>>>> release the memory to the page allocator.
>>>>>
>>>>> Indeed. Sorry I missed this.
>>>>>
>>>>>> This check belongs in
>>>>>> add_memory_resource() to prevent new memory that violates TDX from being
>>>>>> onlined.
>>>>>
>>>>> This would require adding another 'arch_cc_memory_compatible()' to the common
>>>>> add_memory_resource() (I actually long time ago had such patch to work with the
>>>>> memremap_pages() you mentioned above).
>>>>>
>>>>> How about adding a memory_notifier to the TDX code, and reject online of TDX
>>>>> incompatible memory (something like below)? The benefit is this is TDX code
>>>>> self contained and won't pollute the common mm code:
>>>>>
>>>>> +static int tdx_memory_notifier(struct notifier_block *nb,
>>>>> + unsigned long action, void *v)
>>>>> +{
>>>>> + struct memory_notify *mn = v;
>>>>> +
>>>>> + if (action != MEM_GOING_ONLINE)
>>>>> + return NOTIFY_OK;
>>>>> +
>>>>> + /*
>>>>> + * Not all memory is compatible with TDX. Reject
>>>>> + * online of any incompatible memory.
>>>>> + */
>>>>> + return tdx_cc_memory_compatible(mn->start_pfn,
>>>>> + mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
>>>>> +}
>>>>> +
>>>>> +static struct notifier_block tdx_memory_nb = {
>>>>> + .notifier_call = tdx_memory_notifier,
>>>>> +};
>>>>
>>>> With mhp_memmap_on_memory() some memory might already be touched during
>>>> add_memory() (because part of the hotplug memory is used for holding the
>>>> memmap), not when actually onlining memory. So in that case, this would
>>>> be too late.
>>>
>>> Hi David,
>>>
>>> Thanks for the review!
>>>
>>> Right. The memmap pages are added to the zone before online_pages(), but IIUC
>>> those memmap pages will never be free pages thus won't be allocated by the page
>>> allocator, correct? Therefore in practice there won't be problem even they are
>>> not TDX compatible memory.
>>
>> But that memory will be read/written. Isn't that an issue, for example,
>> if memory doesn't get accepted etc?
>>
>
> Sorry I don't quite understand "if memory doesn't get accepted" mean. Do you
> mean accepted by the TDX module?
>
> Only the host kernel will read/write those memmap pages. The TDX module won't
> (as they won't be allocated to be used as TDX guest memory or TDX module
> metadata). So it's fine.

Oh, so we're not also considering hotplugging memory to a TDX VM that
might not be backed by TDX. Got it.

So what you want to prevent is getting !TDX memory exposed to the buddy
such that it won't accidentally get allocated for a TDX guest, correct?

In that case, memory notifiers would indeed be fine.

Thanks!

--
Thanks,

David / dhildenb

2022-11-28 10:06:54

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On Mon, 2022-11-28 at 10:26 +0100, David Hildenbrand wrote:
> On 28.11.22 10:21, Huang, Kai wrote:
> > On Mon, 2022-11-28 at 09:43 +0100, David Hildenbrand wrote:
> > > On 28.11.22 09:38, Huang, Kai wrote:
> > > > On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
> > > > > On 24.11.22 10:06, Huang, Kai wrote:
> > > > > > On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> > > > > > > >
> > > > > > > > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > > > > > > >    unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > > > >    unsigned long nr_pages = size >> PAGE_SHIFT;
> > > > > > > >
> > > > > > > > + /*
> > > > > > > > + * For now if TDX is enabled, all pages in the page allocator
> > > > > > > > + * must be TDX memory, which is a fixed set of memory regions
> > > > > > > > + * that are passed to the TDX module.  Reject the new region
> > > > > > > > + * if it is not TDX memory to guarantee above is true.
> > > > > > > > + */
> > > > > > > > + if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > > > > > > > + return -EINVAL;
> > > > > > >
> > > > > > > arch_add_memory() does not add memory to the page allocator.  For
> > > > > > > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > > > > > > release the memory to the page allocator.
> > > > > >
> > > > > > Indeed. Sorry I missed this.
> > > > > >
> > > > > > > This check belongs in
> > > > > > > add_memory_resource() to prevent new memory that violates TDX from being
> > > > > > > onlined.
> > > > > >
> > > > > > This would require adding another 'arch_cc_memory_compatible()' to the common
> > > > > > add_memory_resource() (I actually long time ago had such patch to work with the
> > > > > > memremap_pages() you mentioned above).
> > > > > >
> > > > > > How about adding a memory_notifier to the TDX code, and reject online of TDX
> > > > > > incompatible memory (something like below)? The benefit is this is TDX code
> > > > > > self contained and won't pollute the common mm code:
> > > > > >
> > > > > > +static int tdx_memory_notifier(struct notifier_block *nb,
> > > > > > + unsigned long action, void *v)
> > > > > > +{
> > > > > > + struct memory_notify *mn = v;
> > > > > > +
> > > > > > + if (action != MEM_GOING_ONLINE)
> > > > > > + return NOTIFY_OK;
> > > > > > +
> > > > > > + /*
> > > > > > + * Not all memory is compatible with TDX. Reject
> > > > > > + * online of any incompatible memory.
> > > > > > + */
> > > > > > + return tdx_cc_memory_compatible(mn->start_pfn,
> > > > > > + mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> > > > > > +}
> > > > > > +
> > > > > > +static struct notifier_block tdx_memory_nb = {
> > > > > > + .notifier_call = tdx_memory_notifier,
> > > > > > +};
> > > > >
> > > > > With mhp_memmap_on_memory() some memory might already be touched during
> > > > > add_memory() (because part of the hotplug memory is used for holding the
> > > > > memmap), not when actually onlining memory. So in that case, this would
> > > > > be too late.
> > > >
> > > > Hi David,
> > > >
> > > > Thanks for the review!
> > > >
> > > > Right. The memmap pages are added to the zone before online_pages(), but IIUC
> > > > those memmap pages will never be free pages thus won't be allocated by the page
> > > > allocator, correct? Therefore in practice there won't be problem even they are
> > > > not TDX compatible memory.
> > >
> > > But that memory will be read/written. Isn't that an issue, for example,
> > > if memory doesn't get accepted etc?
> > >
> >
> > Sorry I don't quite understand "if memory doesn't get accepted" mean. Do you
> > mean accepted by the TDX module?
> >
> > Only the host kernel will read/write those memmap pages. The TDX module won't
> > (as they won't be allocated to be used as TDX guest memory or TDX module
> > metadata). So it's fine.
>
> Oh, so we're not also considering hotplugging memory to a TDX VM that
> might not be backed by TDX. Got it.
>
> So what you want to prevent is getting !TDX memory exposed to the buddy
> such that it won't accidentally get allocated for a TDX guest, correct?

Yes correct.

>
> In that case, memory notifiers would indeed be fine.
>
> Thanks!
>

Thanks.

2022-11-28 14:38:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs

On 11/28/22 01:14, Huang, Kai wrote:
> On Wed, 2022-11-23 at 15:39 -0800, Dave Hansen wrote:
...
>> /*
>>  * Go through tdx_memlist to find holes between memory areas. If any of
>>  * those holes fall within @tdmr, set up a TDMR reserved area to cover
>>  * the hole.
>>  */
>> static int tdmr_populate_rsvd_holes(struct list_head *tdx_memlist,
>> struct tdmr_info *tdmr,
>> int *rsvd_idx)
>
> Thanks!
>
> Should I also change below function 'tdmr_set_up_pamt_rsvd_areas()' to, i.e.
> tdmr_populate_rsvd_pamts()?
>
> Actually, there are two more functions in this patch: tdmr_set_up_rsvd_areas()
> and tdmrs_set_up_rsvd_areas_all(). Should I also change them to
> tdmr_populate_rsvd_areas() and tdmrs_populate_rsvd_areas_all()?

I don't know. I'll look at the naming again once I see it all together.

>> but I also get the feeling that 'prev_end' is a crummy variable name. I
>> don't have any better suggestions at the moment.
>>
>>> + list_for_each_entry(tmb, &tdx_memlist, list) {
>>> + u64 start, end;
>>> +
>>> + start = tmb->start_pfn << PAGE_SHIFT;
>>> + end = tmb->end_pfn << PAGE_SHIFT;
>>> +
>>
>> More alignment opportunities:
>>
>> start = tmb->start_pfn << PAGE_SHIFT;
>> end = tmb->end_pfn << PAGE_SHIFT;
>
> Should I use PFN_PHYS()? Then looks we don't need this alignment.

Sure.

2022-11-28 16:40:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On 11/24/22 04:02, Huang, Kai wrote:
> On Wed, 2022-11-23 at 14:17 -0800, Dave Hansen wrote:
>> First, I think 'tdx_sysinfo' should probably be a local variable in
>> init_tdx_module() and have its address passed in here. Having global
>> variables always makes it more opaque about who is initializing it.
> Sorry I missed to respond this.
>
> Using local variable for 'tdx_sysinfo' will cause a build warning:
>
> https://lore.kernel.org/lkml/[email protected]/

Having it be local scope is a lot more important than having it be on
stack. Just declare it local to the function but keep it off the stack.
No need to dynamically allocate it, even.

2022-11-28 16:58:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 11/24/22 03:46, Huang, Kai wrote:
> On Wed, 2022-11-23 at 14:57 -0800, Dave Hansen wrote:
>> On 11/20/22 16:26, Kai Huang wrote:
>>> Use alloc_contig_pages() since PAMT must be a physically contiguous area
>>> and it may be potentially large (~1/256th of the size of the given TDMR).
>>> The downside is alloc_contig_pages() may fail at runtime. One (bad)
>>> mitigation is to launch a TD guest early during system boot to get those
>>> PAMTs allocated at early time, but the only way to fix is to add a boot
>>> option to allocate or reserve PAMTs during kernel boot.
>>
>> FWIW, we all agree that this is a bad permanent way to leave things.
>> You can call me out here as proposing that this wart be left in place
>> while this series is merged and is a detail we can work on afterword
>> with new module params, boot options, Kconfig or whatever.
>
> Sorry do you mean to call out in the cover letter, or in this changelog?

Cover letter would be best. But, a note in the changelog that it is
imperfect and will be improved on later would also be nice.

>>> + /*
>>> + * Fall back to allocating the TDMR's metadata from node 0 when
>>> + * no TDX memory block can be found. This should never happen
>>> + * since TDMRs originate from TDX memory blocks.
>>> + */
>>> + WARN_ON_ONCE(1);
>>
>> That's probably better a pr_warn() or something. A backtrace and all
>> that jazz seems a bit overly dramatic for this.
>
> How about below?
>
> pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT
> allocation, fallback to use node 0.\n");

I actually try to make these somewhat mirror the code. For instance, if
you are searching using *just* the start TDMR address, then the message
should only talk about the start address. Also, it's not trying to find
a *node* per se. It's trying to find a 'tmb'. So, if someone wanted to
debug this problem, they would actually want to dump out the tmbs.

But, back to the loop that this message describes:

> + /* Find the first memory region covered by the TDMR */
> + list_for_each_entry(tmb, &tdx_memlist, list) {
> + if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> + return tmb->nid;
> + }

That loop is funky. It's not obvious at *all* why it even works.

1. A 'tmb' describes "real" memory. It never covers holes.
2. This code is trying to find *a* 'tmb' to place a structure in. It
needs real memory to place this, of course.
3. A 'tdmr' may include holes and may not have a 'tmb' at either its
start or end address
4. A 'tdmr' is expected to cover one or more 'tmb's. If there were no
'tmb's, then the TDMR is going to be marked as all reserved and is
effectively being wasted.
5. A 'tdmr' may cover more than one NUMA node. If this happens, it is
ok to get memory from any of those nodes for that tdmr's PAMT.

I'd include this comment on the loop:

A TDMR must cover at least part of one TMB. That TMB will end
after the TDMR begins. But, that TMB may have started before
the TDMR. Find the next 'tmb' that _ends_ after this TDMR
begins. Ignore 'tmb' start addresses. They are irrelevant.

Maybe even a little ASCII diagram about the different tmb configurations
that this can find:

| TDMR1 | TDMR2 |
|---tmb---|
|tmb|
|------tmb-------|
|------tmb-------|

I'd also include this on the function:

/*
* Locate a NUMA node which should hold the allocation of the @tdmr
* PAMT. This node will have some memory covered by the TDMR. The
* relative amount of memory covered is not considered.
*/


2022-11-28 22:40:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On 11/28/22 14:13, Huang, Kai wrote:
> Apologize I am not entirely sure whether I fully got your point. Do you mean
> something like below?
...

No, something like this:

static int init_tdx_module(void)
{
static struct tdsysinfo_struct tdx_sysinfo; /* too rotund for the stack */
...
tdx_get_sysinfo(&tdx_sysinfo, ...);
...

But, also, seriously, 3k on the stack is *fine* if you can shut up the
warnings. This isn't going to be a deep call stack to begin with.

2022-11-28 22:44:44

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs

On Mon, 2022-11-28 at 05:18 -0800, Dave Hansen wrote:
> On 11/28/22 01:14, Huang, Kai wrote:
> > On Wed, 2022-11-23 at 15:39 -0800, Dave Hansen wrote:
> ...
> > > /*
> > >  * Go through tdx_memlist to find holes between memory areas.  If any of
> > >  * those holes fall within @tdmr, set up a TDMR reserved area to cover
> > >  * the hole.
> > >  */
> > > static int tdmr_populate_rsvd_holes(struct list_head *tdx_memlist,
> > >       struct tdmr_info *tdmr,
> > >       int *rsvd_idx)
> >
> > Thanks!
> >
> > Should I also change below function 'tdmr_set_up_pamt_rsvd_areas()' to, i.e.
> > tdmr_populate_rsvd_pamts()?
> >
> > Actually, there are two more functions in this patch: tdmr_set_up_rsvd_areas()
> > and tdmrs_set_up_rsvd_areas_all().  Should I also change them to
> > tdmr_populate_rsvd_areas() and tdmrs_populate_rsvd_areas_all()?
>
> I don't know.  I'll look at the naming again once I see it all together.

Sure. I'll only change the one you mentioned above and keep the others for the
next version.

2022-11-28 22:58:37

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Mon, 2022-11-28 at 08:39 -0800, Dave Hansen wrote:
> On 11/24/22 03:46, Huang, Kai wrote:
> > On Wed, 2022-11-23 at 14:57 -0800, Dave Hansen wrote:
> > > On 11/20/22 16:26, Kai Huang wrote:
> > > > Use alloc_contig_pages() since PAMT must be a physically contiguous area
> > > > and it may be potentially large (~1/256th of the size of the given TDMR).
> > > > The downside is alloc_contig_pages() may fail at runtime. One (bad)
> > > > mitigation is to launch a TD guest early during system boot to get those
> > > > PAMTs allocated at early time, but the only way to fix is to add a boot
> > > > option to allocate or reserve PAMTs during kernel boot.
> > >
> > > FWIW, we all agree that this is a bad permanent way to leave things.
> > > You can call me out here as proposing that this wart be left in place
> > > while this series is merged and is a detail we can work on afterword
> > > with new module params, boot options, Kconfig or whatever.
> >
> > Sorry do you mean to call out in the cover letter, or in this changelog?
>
> Cover letter would be best. But, a note in the changelog that it is
> imperfect and will be improved on later would also be nice.

Thanks will do both.

>
> > > > + /*
> > > > + * Fall back to allocating the TDMR's metadata from node 0 when
> > > > + * no TDX memory block can be found. This should never happen
> > > > + * since TDMRs originate from TDX memory blocks.
> > > > + */
> > > > + WARN_ON_ONCE(1);
> > >
> > > That's probably better a pr_warn() or something. A backtrace and all
> > > that jazz seems a bit overly dramatic for this.
> >
> > How about below?
> >
> > pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT
> > allocation, fallback to use node 0.\n");
>
> I actually try to make these somewhat mirror the code. For instance, if
> you are searching using *just* the start TDMR address, then the message
> should only talk about the start address. Also, it's not trying to find
> a *node* per se. It's trying to find a 'tmb'. So, if someone wanted to
> debug this problem, they would actually want to dump out the tmbs.
>
> But, back to the loop that this message describes:
>
> > + /* Find the first memory region covered by the TDMR */
> > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > + if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> > + return tmb->nid;
> > + }
>
> That loop is funky. It's not obvious at *all* why it even works.
>
> 1. A 'tmb' describes "real" memory. It never covers holes.
> 2. This code is trying to find *a* 'tmb' to place a structure in. It
> needs real memory to place this, of course.
> 3. A 'tdmr' may include holes and may not have a 'tmb' at either its
> start or end address
> 4. A 'tdmr' is expected to cover one or more 'tmb's. If there were no
> 'tmb's, then the TDMR is going to be marked as all reserved and is
> effectively being wasted.
> 5. A 'tdmr' may cover more than one NUMA node. If this happens, it is
> ok to get memory from any of those nodes for that tdmr's PAMT.

Right.

>
> I'd include this comment on the loop:
>
> A TDMR must cover at least part of one TMB. That TMB will end
> after the TDMR begins. But, that TMB may have started before
> the TDMR. Find the next 'tmb' that _ends_ after this TDMR
> begins. Ignore 'tmb' start addresses. They are irrelevant.

Thanks. Will do.

However, I am not sure I quite understand "the next 'tmb'" part?

>
> Maybe even a little ASCII diagram about the different tmb configurations
> that this can find:
>
> > TDMR1 | TDMR2 |
> |---tmb---|
> |tmb|
> |------tmb-------| <- case 3)
> |------tmb-------| <- case 4

Thanks for the diagram!

But IIUC it seems the above case 3) and 4) are actually not possible, since when
one TDMR is created, it's end is always rounded up to the end of TMB it tries to
cover (the rounded-up end may cover the entire or only partial of other TMBs,
though).

1G 2G
TDMR1 | TDMR2 |

|--tmb1--| |--tmb2--| |-tmb3-|

node 0 | node 1

>
> I'd also include this on the function:
>
> /*
> * Locate a NUMA node which should hold the allocation of the @tdmr
> * PAMT. This node will have some memory covered by the TDMR. The
> * relative amount of memory covered is not considered.
> */
>
>

Thanks. Will do.

2022-11-28 22:58:41

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On Mon, 2022-11-28 at 07:59 -0800, Dave Hansen wrote:
> On 11/24/22 04:02, Huang, Kai wrote:
> > On Wed, 2022-11-23 at 14:17 -0800, Dave Hansen wrote:
> > > First, I think 'tdx_sysinfo' should probably be a local variable in
> > > init_tdx_module() and have its address passed in here. Having global
> > > variables always makes it more opaque about who is initializing it.
> > Sorry I missed to respond this.
> >
> > Using local variable for 'tdx_sysinfo' will cause a build warning:
> >
> > https://lore.kernel.org/lkml/[email protected]/
>
> Having it be local scope is a lot more important than having it be on
> stack. Just declare it local to the function but keep it off the stack.
> No need to dynamically allocate it, even.

Apologize I am not entirely sure whether I fully got your point. Do you mean
something like below?

static struct tdsysinfo_struct tdx_sysinfo;

static int tdmr_size_single(int max_reserved_per_tdmr)
{
...
}

static int tdmr_array_size(struct tdsysinfo_struct *sysinfo)
{
return tdmr_size_single(sysinfo->max_reserved_per_tdmr) *
sysinfo->max_tdmrs;
}

static int init_tdx_module(void)
{
...
tdx_get_sysinfo(&tdx_sysinfo, ...);
...

tdmr_array = alloc_pages_exact(tdmr_array_size(&tdx_sysinfo),
GFP_KERNEL | __GFP_ZERO);
...

construct_tdmrs(tdmr_array, &nr_tdmrs, &tdx_sysinfo);
...
}


2022-11-28 23:03:33

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On Mon, 2022-11-28 at 14:19 -0800, Dave Hansen wrote:
> On 11/28/22 14:13, Huang, Kai wrote:
> > Apologize I am not entirely sure whether I fully got your point. Do you mean
> > something like below?
> ...
>
> No, something like this:
>
> static int init_tdx_module(void)
> {
> static struct tdsysinfo_struct tdx_sysinfo; /* too rotund for the stack */
> ...
> tdx_get_sysinfo(&tdx_sysinfo, ...);
> ...
>
> But, also, seriously, 3k on the stack is *fine* if you can shut up the
> warnings. This isn't going to be a deep call stack to begin with.
>

Let me try to find out whether it is possible to silent the warning. If I
cannot, then I'll use your above way. Thanks!

2022-11-28 23:05:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 11/28/22 14:48, Huang, Kai wrote:
>> Maybe even a little ASCII diagram about the different tmb configurations
>> that this can find:
>>
>>> TDMR1 | TDMR2 |
>> |---tmb---|
>> |tmb|
>> |------tmb-------| <- case 3)
>> |------tmb-------| <- case 4
> Thanks for the diagram!
>
> But IIUC it seems the above case 3) and 4) are actually not possible, since when
> one TDMR is created, it's end is always rounded up to the end of TMB it tries to
> cover (the rounded-up end may cover the entire or only partial of other TMBs,
> though).

OK, but at the same time, we shouldn't *STRICTLY* specialize every
single little chunk of this code to be aware of every other tiny little
implementation detail.

Let's say tomorrow's code has lots of TDMRs left, but fills up one
TDMR's reserved areas and has to "split" it. Want to bet on whether the
person that adds that patch will be able to find this code and fix it up?

Or, say that the TDMR creation algorithm changes and they're not done in
order of ascending physical address.

This code actually gets easier and more obvious if you ignore the other
details.

2022-11-28 23:06:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs

On 11/28/22 14:24, Huang, Kai wrote:
>> I don't know. I'll look at the naming again once I see it all together.
> Sure. I'll only change the one you mentioned above and keep the others for the
> next version.

The alternative is that you can look at what I suggested, try and learn
something from it, and try to apply that elsewhere in the series before
you post again.

2022-11-28 23:25:18

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs

On Mon, 2022-11-28 at 14:58 -0800, Hansen, Dave wrote:
> On 11/28/22 14:24, Huang, Kai wrote:
> > > I don't know. I'll look at the naming again once I see it all together.
> > Sure. I'll only change the one you mentioned above and keep the others for the
> > next version.
>
> The alternative is that you can look at what I suggested, try and learn
> something from it, and try to apply that elsewhere in the series before
> you post again.

Yeah surely will do.

2022-11-28 23:26:16

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Mon, 2022-11-28 at 14:56 -0800, Hansen, Dave wrote:
> On 11/28/22 14:48, Huang, Kai wrote:
> > > Maybe even a little ASCII diagram about the different tmb configurations
> > > that this can find:
> > >
> > > > TDMR1 | TDMR2 |
> > > |---tmb---|
> > > |tmb|
> > > |------tmb-------| <- case 3)
> > > |------tmb-------| <- case 4
> > Thanks for the diagram!
> >
> > But IIUC it seems the above case 3) and 4) are actually not possible, since when
> > one TDMR is created, it's end is always rounded up to the end of TMB it tries to
> > cover (the rounded-up end may cover the entire or only partial of other TMBs,
> > though).
>
> OK, but at the same time, we shouldn't *STRICTLY* specialize every
> single little chunk of this code to be aware of every other tiny little
> implementation detail.
>
> Let's say tomorrow's code has lots of TDMRs left, but fills up one
> TDMR's reserved areas and has to "split" it. Want to bet on whether the
> person that adds that patch will be able to find this code and fix it up?

Yeah good point.

>
> Or, say that the TDMR creation algorithm changes and they're not done in
> order of ascending physical address.
>
> This code actually gets easier and more obvious if you ignore the other
> details.

Agreed. Thanks.


2022-11-29 22:51:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On 11/22/22 11:33, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote:
>>> Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
>>> all TDX using tasks.
>> I don't think that works. As I mentioned to Thomas elsewhere, you don't
>> just need to initialize TDX on the CPUs where it is used. Before the
>> module will start working you need to initialize it on *all* the CPUs it
>> knows about. The module itself has a little counter where it tracks
>> this and will refuse to start being useful until it gets called
>> thoroughly enough.
> That's bloody terrible, that is. How are we going to make that work with
> the SMT mitigation crud that forces the SMT sibilng offline?
>
> Then the counters don't match and TDX won't work.
>
> Can we get this limitiation removed and simply let the module throw a
> wobbly (error) when someone tries and use TDX without that logical CPU
> having been properly initialized?

It sounds like we can at least punt the limitation away from the OS's
purview.

There's actually a multi-step process to get a "real" TDX module loaded.
There's a fancy ACM (Authenticated Code Module) that's invoked via
GETSEC[ENTERACCS] and an intermediate module loader. That dance used to
be done in the kernel, but we talked the BIOS guys into doing it instead.

I believe these per-logical-CPU checks _can_ also be punted out of the
TDX module itself and delegated to one of these earlier module loading
phases that the BIOS drives.

I'm still a _bit_ skeptical that the checks are needed in the first
place. But, as long as they're hidden from the OS, I don't see a need
to be too cranky about it.

In the end, we could just plain stop doing the TDH.SYS.LP.INIT code in
the kernel.

Unless someone screams, I'll ask the BIOS and TDX module folks to look
into this.

2022-11-30 03:45:20

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages


On 11/21/2022 8:26 AM, Kai Huang wrote:
> After the array of TDMRs and the global KeyID are configured to the TDX
> module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
> on all packages.
>
> TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
> it cannot run concurrently on different CPUs. Implement a helper to
> run SEAMCALL on one cpu for each package one by one, and use it to
> configure the global KeyID on all packages.
>
> Intel hardware doesn't guarantee cache coherency across different
> KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
> with KeyID 0) before the TDX module uses the global KeyID to access the
> PAMT. Following the TDX module specification, flush cache before
> configuring the global KeyID on all packages.
>
> Given the PAMT size can be large (~1/256th of system RAM), just use
> WBINVD on all CPUs to flush.
>
> Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
> used the global KeyID to write any PAMT. Therefore, need to use WBINVD
> to flush cache before freeing the PAMTs back to the kernel. Note using
> MOVDIR64B (which changes the page's associated KeyID from the old TDX
> private KeyID back to KeyID 0, which is used by the kernel)

It seems not accurate to say MOVDIR64B changes the page's associated KeyID.
It just uses the current KeyID for memory operations.


> to clear
> PMATs isn't needed, as the KeyID 0 doesn't support integrity check.

For integrity check, is KeyID 0 special or it just has the same behavior
as non-zero shared KeyID (if any)?

By saying "KeyID 0 doesn't support integrity check", is it because of 
the implementation of this patch set or hardware behavior?

According to Architecure Specification 1.0 of TDX Module (344425-004US),
shared KeyID could also enable integrity check.


>
> Reviewed-by: Isaku Yamahata <[email protected]>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v6 -> v7:
> - Improved changelong and comment to explain why MOVDIR64B isn't used
> when returning PAMTs back to the kernel.
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 1 +
> 2 files changed, 88 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 3a032930e58a..99d1be5941a7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -224,6 +224,46 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> on_each_cpu(seamcall_smp_call_function, sc, true);
> }
>
> +/*
> + * Call one SEAMCALL on one (any) cpu for each physical package in
> + * serialized way. Return immediately in case of any error if
> + * SEAMCALL fails on any cpu.
> + *
> + * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
> + * to be atomic, but for simplicity just reuse it instead of adding
> + * a new one.
> + */
> +static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
> +{
> + cpumask_var_t packages;
> + int cpu, ret = 0;
> +
> + if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
> + return -ENOMEM;
> +
> + for_each_online_cpu(cpu) {
> + if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
> + packages))
> + continue;
> +
> + ret = smp_call_function_single(cpu, seamcall_smp_call_function,
> + sc, true);
> + if (ret)
> + break;
> +
> + /*
> + * Doesn't have to use atomic_read(), but it doesn't
> + * hurt either.
> + */
> + ret = atomic_read(&sc->err);
> + if (ret)
> + break;
> + }
> +
> + free_cpumask_var(packages);
> + return ret;
> +}
> +
> static int tdx_module_init_cpus(void)
> {
> struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
> @@ -1010,6 +1050,22 @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
> return ret;
> }
>
> +static int config_global_keyid(void)
> +{
> + struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
> +
> + /*
> + * Configure the key of the global KeyID on all packages by
> + * calling TDH.SYS.KEY.CONFIG on all packages in a serialized
> + * way as it cannot run concurrently on different CPUs.
> + *
> + * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
> + * a recoverable error). Assume this is exceedingly rare and
> + * just return error if encountered instead of retrying.
> + */
> + return seamcall_on_each_package_serialized(&sc);
> +}
> +
> /*
> * Detect and initialize the TDX module.
> *
> @@ -1098,15 +1154,44 @@ static int init_tdx_module(void)
> if (ret)
> goto out_free_pamts;
>
> + /*
> + * Hardware doesn't guarantee cache coherency across different
> + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines
> + * (associated with KeyID 0) before the TDX module can use the
> + * global KeyID to access the PAMT. Given PAMTs are potentially
> + * large (~1/256th of system RAM), just use WBINVD on all cpus
> + * to flush the cache.
> + *
> + * Follow the TDX spec to flush cache before configuring the
> + * global KeyID on all packages.
> + */
> + wbinvd_on_all_cpus();
> +
> + /* Config the key of global KeyID on all packages */
> + ret = config_global_keyid();
> + if (ret)
> + goto out_free_pamts;
> +
> /*
> * Return -EINVAL until all steps of TDX module initialization
> * process are done.
> */
> ret = -EINVAL;
> out_free_pamts:
> - if (ret)
> + if (ret) {
> + /*
> + * Part of PAMT may already have been initialized by
> + * TDX module. Flush cache before returning PAMT back
> + * to the kernel.
> + *
> + * Note there's no need to do MOVDIR64B (which changes
> + * the page's associated KeyID from the old TDX private
> + * KeyID back to KeyID 0, which is used by the kernel),
> + * as KeyID 0 doesn't support integrity check.
> + */
> + wbinvd_on_all_cpus();
> tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> - else
> + } else
> pr_info("%lu pages allocated for PAMT.\n",
> tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
> out_free_tdmrs:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index c26bab2555ca..768d097412ab 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -15,6 +15,7 @@
> /*
> * TDX module SEAMCALL leaf functions
> */
> +#define TDH_SYS_KEY_CONFIG 31
> #define TDH_SYS_INFO 32
> #define TDH_SYS_INIT 33
> #define TDH_SYS_LP_INIT 35

2022-11-30 08:47:43

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

On Wed, 2022-11-30 at 11:35 +0800, Binbin Wu wrote:
> On 11/21/2022 8:26 AM, Kai Huang wrote:
> > After the array of TDMRs and the global KeyID are configured to the TDX
> > module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
> > on all packages.
> >
> > TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
> > it cannot run concurrently on different CPUs. Implement a helper to
> > run SEAMCALL on one cpu for each package one by one, and use it to
> > configure the global KeyID on all packages.
> >
> > Intel hardware doesn't guarantee cache coherency across different
> > KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
> > with KeyID 0) before the TDX module uses the global KeyID to access the
> > PAMT. Following the TDX module specification, flush cache before
> > configuring the global KeyID on all packages.
> >
> > Given the PAMT size can be large (~1/256th of system RAM), just use
> > WBINVD on all CPUs to flush.
> >
> > Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
> > used the global KeyID to write any PAMT. Therefore, need to use WBINVD
> > to flush cache before freeing the PAMTs back to the kernel. Note using
> > MOVDIR64B (which changes the page's associated KeyID from the old TDX
> > private KeyID back to KeyID 0, which is used by the kernel)
>
> It seems not accurate to say MOVDIR64B changes the page's associated KeyID.
> It just uses the current KeyID for memory operations.

The "write" to the memory changes the page's associated KeyID to the KeyID that
does the "write". A more accurate expression perhaps should be MOVDIR64B +
MFENSE, but I think it doesn't matter in changelog.

>
>
> > to clear
> > PMATs isn't needed, as the KeyID 0 doesn't support integrity check.
>
> For integrity check, is KeyID 0 special or it just has the same behavior
> as non-zero shared KeyID (if any)?

KeyID 0 is TME KeyID. It is special. Hardware treats the KeyID 0 differently.

>
> By saying "KeyID 0 doesn't support integrity check", is it because of 
> the implementation of this patch set or hardware behavior?

It's hardware behaviour.

>
> According to Architecure Specification 1.0 of TDX Module (344425-004US),
> shared KeyID could also enable integrity check.
>

KeyID 0 is different from non-0 MKTME shared KeyID. For example, you cannot do
PCONFIG on KeyID 0.

2022-11-30 11:32:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Nov 29 2022 at 13:40, Dave Hansen wrote:
> On 11/22/22 11:33, Peter Zijlstra wrote:
>> Can we get this limitiation removed and simply let the module throw a
>> wobbly (error) when someone tries and use TDX without that logical CPU
>> having been properly initialized?
>
> It sounds like we can at least punt the limitation away from the OS's
> purview.
>
> There's actually a multi-step process to get a "real" TDX module loaded.
> There's a fancy ACM (Authenticated Code Module) that's invoked via
> GETSEC[ENTERACCS] and an intermediate module loader. That dance used to
> be done in the kernel, but we talked the BIOS guys into doing it instead.
>
> I believe these per-logical-CPU checks _can_ also be punted out of the
> TDX module itself and delegated to one of these earlier module loading
> phases that the BIOS drives.
>
> I'm still a _bit_ skeptical that the checks are needed in the first
> place. But, as long as they're hidden from the OS, I don't see a need
> to be too cranky about it.

Right.

> In the end, we could just plain stop doing the TDH.SYS.LP.INIT code in
> the kernel.

Which in turn makes all the problems we discussed go away.

> Unless someone screams, I'll ask the BIOS and TDX module folks to look
> into this.

Yes, please.

Thanks,

tglx

2022-11-30 14:34:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

On Wed, Nov 30, 2022 at 08:34:46AM +0000, Huang, Kai wrote:
> On Wed, 2022-11-30 at 11:35 +0800, Binbin Wu wrote:
> > On 11/21/2022 8:26 AM, Kai Huang wrote:
> > > After the array of TDMRs and the global KeyID are configured to the TDX
> > > module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
> > > on all packages.
> > >
> > > TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
> > > it cannot run concurrently on different CPUs. Implement a helper to
> > > run SEAMCALL on one cpu for each package one by one, and use it to
> > > configure the global KeyID on all packages.
> > >
> > > Intel hardware doesn't guarantee cache coherency across different
> > > KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
> > > with KeyID 0) before the TDX module uses the global KeyID to access the
> > > PAMT. Following the TDX module specification, flush cache before
> > > configuring the global KeyID on all packages.
> > >
> > > Given the PAMT size can be large (~1/256th of system RAM), just use
> > > WBINVD on all CPUs to flush.
> > >
> > > Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
> > > used the global KeyID to write any PAMT. Therefore, need to use WBINVD
> > > to flush cache before freeing the PAMTs back to the kernel. Note using
> > > MOVDIR64B (which changes the page's associated KeyID from the old TDX
> > > private KeyID back to KeyID 0, which is used by the kernel)
> >
> > It seems not accurate to say MOVDIR64B changes the page's associated KeyID.
> > It just uses the current KeyID for memory operations.
>
> The "write" to the memory changes the page's associated KeyID to the KeyID that
> does the "write". A more accurate expression perhaps should be MOVDIR64B +
> MFENSE, but I think it doesn't matter in changelog.

MOVDIR64B KeyID for the cache line, not the page. Integrity tracked on
per-cacheline basis.

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-11-30 15:34:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

On 11/30/22 00:34, Huang, Kai wrote:
> On Wed, 2022-11-30 at 11:35 +0800, Binbin Wu wrote:
>> On 11/21/2022 8:26 AM, Kai Huang wrote:
>>> After the array of TDMRs and the global KeyID are configured to the TDX
>>> module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
>>> on all packages.
>>>
>>> TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package. And
>>> it cannot run concurrently on different CPUs. Implement a helper to
>>> run SEAMCALL on one cpu for each package one by one, and use it to
>>> configure the global KeyID on all packages.
>>>
>>> Intel hardware doesn't guarantee cache coherency across different
>>> KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated
>>> with KeyID 0) before the TDX module uses the global KeyID to access the
>>> PAMT. Following the TDX module specification, flush cache before
>>> configuring the global KeyID on all packages.
>>>
>>> Given the PAMT size can be large (~1/256th of system RAM), just use
>>> WBINVD on all CPUs to flush.
>>>
>>> Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
>>> used the global KeyID to write any PAMT. Therefore, need to use WBINVD
>>> to flush cache before freeing the PAMTs back to the kernel. Note using
>>> MOVDIR64B (which changes the page's associated KeyID from the old TDX
>>> private KeyID back to KeyID 0, which is used by the kernel)
>>
>> It seems not accurate to say MOVDIR64B changes the page's associated KeyID.
>> It just uses the current KeyID for memory operations.
>
> The "write" to the memory changes the page's associated KeyID to the KeyID that
> does the "write". A more accurate expression perhaps should be MOVDIR64B +
> MFENSE, but I think it doesn't matter in changelog.

Just delete it from the changelog. It's a distraction. I'm not even
sure it's *necessary* to do any memory content conversion after the TDX
module has written gunk.

There won't be any integrity issues because integrity errors don't do
anything for KeyID-0 (no #MC).

I _think_ the reads of the page using KeyID-0 will see abort page
semantics. That's *FINE*.

2022-11-30 17:42:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

On 11/20/22 16:26, Kai Huang wrote:
> After the array of TDMRs and the global KeyID are configured to the TDX
> module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
> on all packages.

I want to circle back to this because it potentially has the same class
of issue that TDH.SYS.LP.INIT had. So, here's some more background
followed by the key question: is TDH.SYS.KEY.CONFIG too strict? Should
we explore relaxing it?

Here's the very long-winded way of asking the same thing:

This key is used to protect TDX module memory which is too large to fit
into the limited range-register-protected (SMRR) areas that most of the
module uses. Right now, that metadata includes the Physical Address
Metadata Tables (PAMT) and "TD Root" (TDR) pages. Using this "global
KeyID" provides stronger isolation and integrity protection for these
structures than is provided by KeyID-0.

The "global KeyID" only strictly needs to be programmed into a memory
controllers if a PAMT or TDR page is allocated in memory attached to
that controller. However, the TDX module currently requires that
TDH.SYS.KEY.CONFIG be executed on one processor in each package. This
is true even if there is no TDX Memory Region (TDMR) attached to that
package.

This was likely done for simplicity in the TDX module. It currently has
no NUMA awareness (or even trusted NUMA metadata) and no ability to
correlate processor packages with the memory attached to their memory
controllers.

The TDH.SYS.KEY.CONFIG design is actually pretty similar to Kirill's
MKTME implementation[1]. Basically blast the KeyID configuration out to
one processor in each package, regardless of whether the KeyID will ever
get used on that package.

While this requirement from the TDX module is _slightly_ too strict, I'm
not quite as worried about it as I was about the *super* strict
TDH.SYS.LP.INIT requirements. It's a lot harder and more rare to have
an entire package of CPUs unavailable versus a single logical CPU.
There is, for instance, no side-channel mitigation that disables an
entire package worth of CPUs. I'm not even sure if we allow an entire
package worth of NOHZ_FULL-indisposed processors.

I'm happy to go run the same drill for TDH.SYS.KEY.CONFIG that we did
for TDH.SYS.LP.INIT. Basically, can we relax the too-strict
restrictions? But, I'm not sure anyone will ever reap a practical
benefit from it. I'm tempted to just leave it as-is.

Does anyone feel differently?

1.
https://lore.kernel.org/lkml/[email protected]/T/#m936f260a345284687f8e929675f68f3d514725f5


2022-11-30 21:04:17

by Kai Huang

[permalink] [raw]
Subject: RE: [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages

> On 11/30/22 00:34, Huang, Kai wrote:
> > On Wed, 2022-11-30 at 11:35 +0800, Binbin Wu wrote:
> >> On 11/21/2022 8:26 AM, Kai Huang wrote:
> >>> After the array of TDMRs and the global KeyID are configured to the
> >>> TDX module, use TDH.SYS.KEY.CONFIG to configure the key of the
> >>> global KeyID on all packages.
> >>>
> >>> TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package.
> >>> And it cannot run concurrently on different CPUs. Implement a
> >>> helper to run SEAMCALL on one cpu for each package one by one, and
> >>> use it to configure the global KeyID on all packages.
> >>>
> >>> Intel hardware doesn't guarantee cache coherency across different
> >>> KeyIDs. The kernel needs to flush PAMT's dirty cachelines
> >>> (associated with KeyID 0) before the TDX module uses the global
> >>> KeyID to access the PAMT. Following the TDX module specification,
> >>> flush cache before configuring the global KeyID on all packages.
> >>>
> >>> Given the PAMT size can be large (~1/256th of system RAM), just use
> >>> WBINVD on all CPUs to flush.
> >>>
> >>> Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already
> >>> have used the global KeyID to write any PAMT. Therefore, need to
> >>> use WBINVD to flush cache before freeing the PAMTs back to the
> >>> kernel. Note using MOVDIR64B (which changes the page's associated
> >>> KeyID from the old TDX private KeyID back to KeyID 0, which is used
> >>> by the kernel)
> >>
> >> It seems not accurate to say MOVDIR64B changes the page's associated
> KeyID.
> >> It just uses the current KeyID for memory operations.
> >
> > The "write" to the memory changes the page's associated KeyID to the
> > KeyID that does the "write". A more accurate expression perhaps
> > should be MOVDIR64B + MFENSE, but I think it doesn't matter in changelog.
>
> Just delete it from the changelog. It's a distraction. I'm not even sure it's
> *necessary* to do any memory content conversion after the TDX module has
> written gunk.
>
> There won't be any integrity issues because integrity errors don't do anything for
> KeyID-0 (no #MC).
>
> I _think_ the reads of the page using KeyID-0 will see abort page semantics.
> That's *FINE*.

I'll remove this from the changelog. Thanks!

2022-11-30 22:35:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

On 11/24/22 02:02, Huang, Kai wrote:
> Thanks for input. I am fine with 'tdx=force'.
>
> Although, I'd like to point out KVM will have a module parameter 'enable_tdx'.
>
> Hi Dave, Sean, do you have any comments?

That's fine. Just keep it out of the initial implementation. Ignore it
for now,

2022-12-02 11:33:38

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On Wed, 2022-11-23 at 22:53 +0000, Huang, Kai wrote:
> > > > > > > > > +
> > > > > > > > > +   /*
> > > > > > > > > +    * trim_empty_cmrs() updates the actual number of CMRs by
> > > > > > > > > +    * dropping all tail empty CMRs.
> > > > > > > > > +    */
> > > > > > > > > +   return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> > > > > > > > > +}
> > > > > > >
> > > > > > > Why does this both need to respect the "tdx_cmr_num = out.r9"
> > > > > > > value
> > > > > > > *and* trim the empty ones?  Couldn't it just ignore the
> > > > > > > "tdx_cmr_num =
> > > > > > > out.r9" value and just trim the empty ones either way?  It's not
> > > > > > > like
> > > > > > > there is a billion of them.  It would simplify the code for sure.
> > > > >
> > > > > OK.  Since spec says MAX_CMRs is 32, so I can use 32 instead of
> > > > > reading out from
> > > > > R9.
> > >
> > > But then you still have the "trimming" code.  Why not just trust "r9"
> > > and then axe all the trimming code?  Heck, and most of the sanity checks.
> > >
> > > This code could be a *lot* smaller.
>
> As I said the only problem is there might be empty CMRs at the tail of the
> cmr_array[] following one or more valid CMRs.  

Hi Dave,

Probably I forgot to mention the "r9" in practice always returns 32, so there
will be empty CMRs at the tail of the cmr_array[].

>
> But we can also do nothing here, but just skip empty CMRs when comparing the
> memory region to it (in next patch).
>
> Or, we don't even need to explicitly check memory region against CMRs. If the
> memory regions that we provided in the TDMR doesn't fall into CMR, then
> TDH.SYS.CONFIG will fail. We can just depend on the SEAMCALL to do that.

Sorry to ping, but do you have any comments here?

How about we just don't do any check of TDX memory regions against CMRs, but
just let the TDH.SYS.CONFIG SEAMCALL to determine?

2022-12-02 11:34:00

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On Tue, 2022-11-22 at 15:39 -0800, Dave Hansen wrote:
> > +#define TDSYSINFO_STRUCT_SIZE 1024
> > +#define TDSYSINFO_STRUCT_ALIGNMENT 1024
> > +
> > +struct tdsysinfo_struct {
> > + /* TDX-SEAM Module Info */
> > + u32 attributes;
> > + u32 vendor_id;
> > + u32 build_date;
> > + u16 build_num;
> > + u16 minor_version;
> > + u16 major_version;
> > + u8 reserved0[14];
> > + /* Memory Info */
> > + u16 max_tdmrs;
> > + u16 max_reserved_per_tdmr;
> > + u16 pamt_entry_size;
> > + u8 reserved1[10];
> > + /* Control Struct Info */
> > + u16 tdcs_base_size;
> > + u8 reserved2[2];
> > + u16 tdvps_base_size;
> > + u8 tdvps_xfam_dependent_size;
> > + u8 reserved3[9];
> > + /* TD Capabilities */
> > + u64 attributes_fixed0;
> > + u64 attributes_fixed1;
> > + u64 xfam_fixed0;
> > + u64 xfam_fixed1;
> > + u8 reserved4[32];
> > + u32 num_cpuid_config;
> > + /*
> > + * The actual number of CPUID_CONFIG depends on above
> > + * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
> > + * is 1024B defined by TDX architecture.  Use a union with
> > + * specific padding to make 'sizeof(struct tdsysinfo_struct)'
> > + * equal to 1024.
> > + */
> > + union {
> > + struct cpuid_config cpuid_configs[0];
> > + u8 reserved5[892];
> > + };
>
> Can you double check what the "right" way to do variable arrays is these
> days?  I thought the [0] method was discouraged.
>
> Also, it isn't *really* 892 bytes of reserved space, right?  Anything
> that's not cpuid_configs[] is reserved, I presume.  Could you try to be
> more precise there?

Hi Dave,

I did some search, and I think we should use DECLARE_FLEX_ARRAY() macro?

And also to address you concern that not all 892 bytes are reserved, how about
below:

union {
- struct cpuid_config cpuid_configs[0];
- u8 reserved5[892];
+ DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
+ u8 padding[892];
};
} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);

The goal is to make the size of 'struct tdsysinfo_struct' to be 1024B so we can
use a static variable for it, and at the meantime, it can still have 1024B
(enough space) for the TDH.SYS.INFO to write to.

2022-12-02 17:36:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On 12/2/22 03:19, Huang, Kai wrote:
> Probably I forgot to mention the "r9" in practice always returns 32, so there
> will be empty CMRs at the tail of the cmr_array[].

Right, so the r9 value is basically useless. I bet the code gets
simpler if you just ignore it.

>> But we can also do nothing here, but just skip empty CMRs when comparing the
>> memory region to it (in next patch).
>>
>> Or, we don't even need to explicitly check memory region against CMRs. If the
>> memory regions that we provided in the TDMR doesn't fall into CMR, then
>> TDH.SYS.CONFIG will fail. We can just depend on the SEAMCALL to do that.
>
> Sorry to ping, but do you have any comments here?
>
> How about we just don't do any check of TDX memory regions against CMRs, but
> just let the TDH.SYS.CONFIG SEAMCALL to determine?

Right, if we screw it up TDH.SYS.CONFIG SEAMCALL will fail. We don't
need to add more code to detect that failure ourselves. TDX is screwed
either way.

2022-12-02 17:37:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On 12/2/22 03:11, Huang, Kai wrote:
> And also to address you concern that not all 892 bytes are reserved, how about
> below:
>
> union {
> - struct cpuid_config cpuid_configs[0];
> - u8 reserved5[892];
> + DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
> + u8 padding[892];
> };
> } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
>
> The goal is to make the size of 'struct tdsysinfo_struct' to be 1024B so we can
> use a static variable for it, and at the meantime, it can still have 1024B
> (enough space) for the TDH.SYS.INFO to write to.

I just don't like the open-coded sizes.

For instance, wouldn't it be great if you didn't have to know the size
of *ANYTHING* else to properly size the '892'?

Maybe we just need some helpers to hide the gunk:

#define DECLARE_PADDED_STRUCT(type, name, alignment) \
struct type##_padded { \
union { \
struct type name; \
u8 padding[alignment]; \
} \
} name##_padded;

#define PADDED_STRUCT(name) (name##_padded.name)

That can get used like this:

DECLARE_PADDED_STRUCT(struct tdsysinfo_struct, tdsysinfo,
TDSYSINFO_STRUCT_ALIGNMENT);


struct tdsysinfo_struct sysinfo = PADDED_STRUCT(tdsysinfo)

2022-12-02 22:09:02

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On Fri, 2022-12-02 at 09:06 -0800, Dave Hansen wrote:
> On 12/2/22 03:11, Huang, Kai wrote:
> > And also to address you concern that not all 892 bytes are reserved, how about
> > below:
> >
> > union {
> > - struct cpuid_config cpuid_configs[0];
> > - u8 reserved5[892];
> > + DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
> > + u8 padding[892];
> > };
> > } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
> >
> > The goal is to make the size of 'struct tdsysinfo_struct' to be 1024B so we can
> > use a static variable for it, and at the meantime, it can still have 1024B
> > (enough space) for the TDH.SYS.INFO to write to.
>
> I just don't like the open-coded sizes.
>
> For instance, wouldn't it be great if you didn't have to know the size
> of *ANYTHING* else to properly size the '892'?
>
> Maybe we just need some helpers to hide the gunk:
>
> #define DECLARE_PADDED_STRUCT(type, name, alignment) \
> struct type##_padded { \
> union { \
> struct type name; \
> u8 padding[alignment]; \
> } \
> } name##_padded;
>
> #define PADDED_STRUCT(name) (name##_padded.name)
>
> That can get used like this:
>
> DECLARE_PADDED_STRUCT(struct tdsysinfo_struct, tdsysinfo,
> TDSYSINFO_STRUCT_ALIGNMENT);
>
>
> struct tdsysinfo_struct sysinfo = PADDED_STRUCT(tdsysinfo)

Thanks. Will try out this way.

2022-12-02 22:57:42

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory

On Fri, 2022-12-02 at 09:25 -0800, Dave Hansen wrote:
> On 12/2/22 03:19, Huang, Kai wrote:
> > Probably I forgot to mention the "r9" in practice always returns 32, so there
> > will be empty CMRs at the tail of the cmr_array[].
>
> Right, so the r9 value is basically useless. I bet the code gets
> simpler if you just ignore it.
>
> > > But we can also do nothing here, but just skip empty CMRs when comparing the
> > > memory region to it (in next patch).
> > >
> > > Or, we don't even need to explicitly check memory region against CMRs. If the
> > > memory regions that we provided in the TDMR doesn't fall into CMR, then
> > > TDH.SYS.CONFIG will fail. We can just depend on the SEAMCALL to do that.
> >
> > Sorry to ping, but do you have any comments here?
> >
> > How about we just don't do any check of TDX memory regions against CMRs, but
> > just let the TDH.SYS.CONFIG SEAMCALL to determine?
>
> Right, if we screw it up TDH.SYS.CONFIG SEAMCALL will fail. We don't
> need to add more code to detect that failure ourselves. TDX is screwed
> either way.

Will do. Thanks.

2022-12-07 12:40:20

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On Mon, 2022-11-28 at 22:50 +0000, Huang, Kai wrote:
> On Mon, 2022-11-28 at 14:19 -0800, Dave Hansen wrote:
> > On 11/28/22 14:13, Huang, Kai wrote:
> > > Apologize I am not entirely sure whether I fully got your point. Do you mean
> > > something like below?
> > ...
> >
> > No, something like this:
> >
> > static int init_tdx_module(void)
> > {
> > static struct tdsysinfo_struct tdx_sysinfo; /* too rotund for the stack */
> > ...
> > tdx_get_sysinfo(&tdx_sysinfo, ...);
> > ...
> >
> > But, also, seriously, 3k on the stack is *fine* if you can shut up the
> > warnings. This isn't going to be a deep call stack to begin with.
> >
>
> Let me try to find out whether it is possible to silent the warning. If I
> cannot, then I'll use your above way. Thanks!

Hi Dave,

Sorry to double asking.

Adding below build flag to Makefile can silent the warning:

index 38d534f2c113..f8a40d15fdfc 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS_tdx.o += -Wframe-larger-than=4096

So to confirm you want to add this flag to Makefile and just make tdx_sysinfo
and tdx_cmr_array as local variables?

Another reason I am double asking is, 'tdx_global_keyid' in this series can also
be a local variable in init_tdx_module() but currently it is a static (as KVM
will need it too). If I change to use local variable in the patch
"x86/virt/tdx: Reserve TDX module global KeyID" like below:

--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -50,9 +50,6 @@ static DEFINE_MUTEX(tdx_module_lock);
/* All TDX-usable memory regions */
static LIST_HEAD(tdx_memlist);

-/* TDX module global KeyID. Used in TDH.SYS.CONFIG ABI. */
-static u32 tdx_global_keyid;
-
/*
* tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
* This is used in TDX initialization error paths to take it from
@@ -928,6 +925,7 @@ static int init_tdx_module(void)
__aligned(CMR_INFO_ARRAY_ALIGNMENT);
struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
struct tdmr_info_list tdmr_list;
+ u32 global_keyid;
int ret;

ret = tdx_get_sysinfo(sysinfo, cmr_array);
@@ -964,7 +962,7 @@ static int init_tdx_module(void)
* Pick the first TDX KeyID as global KeyID to protect
* TDX module metadata.
*/
- tdx_global_keyid = tdx_keyid_start;
+ global_keyid = tdx_keyid_start;

I got a warning for this particular patch:

arch/x86/virt/vmx/tdx/tdx.c: In function ‘init_tdx_module’:
arch/x86/virt/vmx/tdx/tdx.c:928:13: warning: variable ‘global_keyid’ set but not
used [-Wunused-but-set-variable]
928 | u32 global_keyid;
| ^~~~~~~~~~~~

To get rid of this warning, we need to merge this patch to the later patch
(which configures the TDMRs and global keyid to the TDX module).

Should I make the tdx_global_keyid as local variable too and merge patch
"x86/virt/tdx: Reserve TDX module global KeyID" to the later patch?

2022-12-08 13:13:07

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On Wed, 2022-12-07 at 11:47 +0000, Huang, Kai wrote:
> On Mon, 2022-11-28 at 22:50 +0000, Huang, Kai wrote:
> > On Mon, 2022-11-28 at 14:19 -0800, Dave Hansen wrote:
> > > On 11/28/22 14:13, Huang, Kai wrote:
> > > > Apologize I am not entirely sure whether I fully got your point. Do you mean
> > > > something like below?
> > > ...
> > >
> > > No, something like this:
> > >
> > > static int init_tdx_module(void)
> > > {
> > > static struct tdsysinfo_struct tdx_sysinfo; /* too rotund for the stack */
> > > ...
> > > tdx_get_sysinfo(&tdx_sysinfo, ...);
> > > ...
> > >
> > > But, also, seriously, 3k on the stack is *fine* if you can shut up the
> > > warnings. This isn't going to be a deep call stack to begin with.
> > >
> >
> > Let me try to find out whether it is possible to silent the warning. If I
> > cannot, then I'll use your above way. Thanks!
>
> Hi Dave,
>
> Sorry to double asking.
>
> Adding below build flag to Makefile can silent the warning:
>
> index 38d534f2c113..f8a40d15fdfc 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,3 @@
> # SPDX-License-Identifier: GPL-2.0-only
> +CFLAGS_tdx.o += -Wframe-larger-than=4096
>
> So to confirm you want to add this flag to Makefile and just make tdx_sysinfo
> and tdx_cmr_array as local variables?

Hi Dave,

I found if I declare TDSYSINFO_STRUCT and CMR_ARRAY as local variable (on the
stack), the TDH.SYS.INFO failed in my testing due to 'invalid operand' of the
address of TDSYSINFO_STRUCT. If I declare them as static, the SEAMCALL works.

I haven't looked into the reason yet but I suspect the address isn't aligned (I
used __pa() to get the physical address). I'll take a look and report back.

In the meantime, do you have any comments? Should I still pursue to keep them
as local variable on the stack?

Thanks.

>
> Another reason I am double asking is, 'tdx_global_keyid' in this series can also
> be a local variable in init_tdx_module() but currently it is a static (as KVM
> will need it too). If I change to use local variable in the patch
> "x86/virt/tdx: Reserve TDX module global KeyID" like below:
>
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -50,9 +50,6 @@ static DEFINE_MUTEX(tdx_module_lock);
> /* All TDX-usable memory regions */
> static LIST_HEAD(tdx_memlist);
>
> -/* TDX module global KeyID. Used in TDH.SYS.CONFIG ABI. */
> -static u32 tdx_global_keyid;
> -
> /*
> * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
> * This is used in TDX initialization error paths to take it from
> @@ -928,6 +925,7 @@ static int init_tdx_module(void)
> __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
> struct tdmr_info_list tdmr_list;
> + u32 global_keyid;
> int ret;
>
> ret = tdx_get_sysinfo(sysinfo, cmr_array);
> @@ -964,7 +962,7 @@ static int init_tdx_module(void)
> * Pick the first TDX KeyID as global KeyID to protect
> * TDX module metadata.
> */
> - tdx_global_keyid = tdx_keyid_start;
> + global_keyid = tdx_keyid_start;
>
> I got a warning for this particular patch:
>
> arch/x86/virt/vmx/tdx/tdx.c: In function ‘init_tdx_module’:
> arch/x86/virt/vmx/tdx/tdx.c:928:13: warning: variable ‘global_keyid’ set but not
> used [-Wunused-but-set-variable]
> 928 | u32 global_keyid;
> | ^~~~~~~~~~~~
>
> To get rid of this warning, we need to merge this patch to the later patch
> (which configures the TDMRs and global keyid to the TDX module).
>
> Should I make the tdx_global_keyid as local variable too and merge patch
> "x86/virt/tdx: Reserve TDX module global KeyID" to the later patch?

And for this one, if we merge the two patches then in fact we can just remove
'tdx_global_keyid' but use 'tdx_start_keyid' directly. I have already done in
this way. Any comments please let me know. Thanks for your time.

2022-12-08 15:37:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On 12/8/22 04:56, Huang, Kai wrote:
> I haven't looked into the reason yet but I suspect the address isn't aligned (I
> used __pa() to get the physical address). I'll take a look and report back.
>
> In the meantime, do you have any comments? Should I still pursue to keep them
> as local variable on the stack?

Yes, you should investigate the reason for the failure and try to
understand both the success and the failure cases.

2022-12-08 23:37:12

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions

On Thu, 2022-12-08 at 06:58 -0800, Dave Hansen wrote:
> On 12/8/22 04:56, Huang, Kai wrote:
> > I haven't looked into the reason yet but I suspect the address isn't aligned (I
> > used __pa() to get the physical address). I'll take a look and report back.
> >
> > In the meantime, do you have any comments? Should I still pursue to keep them
> > as local variable on the stack?
>
> Yes, you should investigate the reason for the failure and try to
> understand both the success and the failure cases.

Hi Dave,

Learned something new from Kirill today.

The reason is not the alignment, but it's wrong to use __pa() to get the
physical address of function local variable on the stack. It is because Kirill
told me kernel stack can now be allocated via vmalloc(), so use __pa() won't
work.

I changed to use slow_virt_to_phys() and tried in my testing and it now works.

So I'll change to use slow_virt_to_phys() for the next version. We can take a
look at the new version to see if that is what you wanted.

Thanks for your time.