2022-04-06 15:27:30

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 00/21] TDX host kernel support

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. This series provides support for
initializing the TDX module in the host kernel. KVM support for TDX is
being developed separately[1].

The code has been tested on couple of TDX-capable machines. I would
consider it as ready for review. I highly appreciate if anyone can help
to review this series (from high level design to detail implementations).
For Intel reviewers (CC'ed), please help to review, and I would
appreciate Reviewed-by or Acked-by tags if the patches look good to you.

Thanks in advance.

This series is based on Kirill's TDX guest series[2]. The reason is host
side SEAMCALL implementation can share TDCALL's implementation which is
implemented in TDX guest series.

You can find TDX related specs here:
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

You can also find this series in below repo in github:
https://github.com/intel/tdx/tree/host-upstream

Changelog history:

- V2 -> v3:

- Rebased to latest TDX guest code, which is based on 5.18-rc1.
- Addressed comments from Isaku.
- Fixed memory leak and unnecessary function argument in the patch to
configure the key for the global keyid (patch 17).
- Enhanced a little bit to the patch to get TDX module and CMR
information (patch 09).
- Fixed an unintended change in the patch to allocate PAMT (patch 13).
- Addressed comments from Kevin:
- Slightly improvement on commit message to patch 03.
- Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
seamrr_enabled() (patch 04).
- Changed documentation patch to add TDX host kernel support materials
to Documentation/x86/tdx.rst together with TDX guest staff, instead
of a standalone file (patch 21)

- RFC (v1) -> v2:
- Rebased to Kirill's latest TDX guest code.
- Fixed two issues that are related to finding all RAM memory regions
based on e820.
- Minor improvement on comments and commit messages.

V2:
https://lore.kernel.org/lkml/[email protected]/T/
RFC (v1):
https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/

== Background ==

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. To support TDX, a new CPU mode called
Secure Arbitration Mode (SEAM) is added to Intel processors.

SEAM is an extension to the existing VMX architecture. It defines a new
VMX root operation (SEAM VMX root) and a new VMX non-root operation (SEAM
VMX non-root).

SEAM VMX root operation is designed to host a CPU-attested, software
module called the 'TDX module' which implements functions to manage
crypto protected VMs called Trust Domains (TD). SEAM VMX root is also
designed to host a CPU-attested, software module called the 'Intel
Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.

Host kernel transits to either the P-SEAMLDR or the TDX module via a new
SEAMCALL instruction. SEAMCALLs are host-side interface functions
defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
instruction. They are similar to a hypercall, except they are made by
host kernel to the SEAM software modules.

TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
protect TD guests. TDX reserves part of MKTME KeyID space as TDX private
KeyIDs, which can only be used by software runs in SEAM. The physical
address bits for encoding TDX private KeyID are treated as reserved bits
when not in SEAM operation. The partitioning of MKTME KeyIDs and TDX
private KeyIDs is configured by BIOS.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized using SEAMCALLs defined by TDX architecture.
This series assumes both the P-SEAMLDR and the TDX module are loaded by
BIOS before the kernel boots.

There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
SEAMCALL to detect P-SEAMLDR. The success of this SEAMCALL means the
P-SEAMLDR is loaded. The P-SEAMLDR information returned by this
SEAMCALL further tells whether TDX module is loaded.

The TDX module is initialized in multiple steps:

1) Global initialization;
2) Logical-CPU scope initialization;
3) Enumerate the TDX module capabilities;
4) Configure the TDX module about usable memory ranges and
global KeyID information;
5) Package-scope configuration for the global KeyID;
6) Initialize TDX metadata for usable memory ranges based on 4).

Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
table) logical cpus, otherwise step 4) will fail. Step 5) requires
calling SEAMCALL on at least one cpu on all packages.

TDX module can also be shut down at any time during module's lifetime, by
calling SEAMCALL on all "BIOS-enabled" logical cpus.

== Design Considerations ==

1. Lazy TDX module initialization on-demand by caller

None of the steps in the TDX module initialization process must be done
during kernel boot. This series doesn't initialize TDX at boot time, but
instead, provides two functions to allow caller to detect and initialize
TDX on demand:

if (tdx_detect())
goto no_tdx;
if (tdx_init())
goto no_tdx;

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata. Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) Both detecting and initializing the TDX module require calling
SEAMCALL. However, SEAMCALL requires CPU being already in VMX operation
(VMXON has been done). So far, KVM is the only user of TDX, and it
already handles VMXON/VMXOFF. Therefore, letting KVM to initialize TDX
on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
core-kernel. Also, in long term, likely a reference based VMXON/VMXOFF
approach is needed since more kernel components will need to handle
VMXON/VMXONFF.

3) It is more flexible to support "TDX module runtime update" (not in
this series). After updating to the new module at runtime, kernel needs
to go through the initialization process again. For the new module,
it's possible the metadata allocated for the old module cannot be reused
for the new module, and needs to be re-allocated again.

2. Kernel policy on TDX memory

Host kernel is responsible for choosing which memory regions can be used
as TDX memory, and configuring those memory regions to the TDX module by
using an array of "TD Memory Regions" (TDMR), which is a data structure
defined by TDX architecture.

The first generation of TDX essentially guarantees that all system RAM
memory regions (excluding the memory below 1MB) can be used as TDX
memory. To avoid having to modify the page allocator to distinguish TDX
and non-TDX allocation, this series chooses to use all system RAM as TDX
memory.

E820 table is used to find all system RAM entries. Following
e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
types are treated as TDX memory, and contiguous ranges in the same NUMA
node are merged together (similar to memblock_add()) before trimming the
non-page-aligned part.

X86 Legacy PMEMs (E820_TYPE_PRAM) also unconditionally treated as TDX
memory as underneath they are RAM and can be potentially used as TD guest
memory.

Memblock is not used to find all RAM regions as: 1) it is gone after
kernel boots; 2) it doesn't have legacy PMEM.

3. Memory hotplug

The first generation of TDX architecturally doesn't support memory
hotplug. And the first generation of TDX-capable platforms don't support
physical memory hotplug. Since it physically cannot happen, this series
doesn't add any check in ACPI memory hotplug code path to disable it.

A special case of memory hotplug is adding NVDIMM as system RAM using
kmem driver. However the first generation of TDX-capable platforms
cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
happen either.

Another case is admin can use 'memmap' kernel command line to create
legacy PMEMs and use them as TD guest memory, or theoretically, can use
kmem driver to add them as system RAM. To avoid having to change memory
hotplug code to prevent this from happening, this series always include
legacy PMEMs when constructing TDMRs so they are also TDX memory.

4. CPU hotplug

The first generation of TDX architecturally doesn't support ACPI CPU
hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
first generation of TDX-capable platforms don't support ACPI CPU hotplug
either. Since this physically cannot happen, this series doesn't add any
check in ACPI CPU hotplug code path to disable it.

Also, only TDX module initialization requires all BIOS-enabled cpus are
online. After the initialization, any logical cpu can be brought down
and brought up to online again later. Therefore this series doesn't
change logical CPU hotplug either.

5. TDX interaction with kexec()

If TDX is ever enabled and/or used to run any TD guests, the cachelines
of TDX private memory, including PAMTs, used by TDX module need to be
flushed before transiting to the new kernel otherwise they may silently
corrupt the new kernel. Similar to SME, this series flushes cache in
stop_this_cpu().

The TDX module can be initialized only once during its lifetime. The
first generation of TDX doesn't have interface to reset TDX module to
uninitialized state so it can be initialized again.

This implies:

- If the old kernel fails to initialize TDX, the new kernel cannot
use TDX too unless the new kernel fixes the bug which leads to
initialization failure in the old kernel and can resume from where
the old kernel stops. This requires certain coordination between
the two kernels.

- If the old kernel has initialized TDX successfully, the new kernel
may be able to use TDX if the two kernels have the exactly same
configurations on the TDX module. It further requires the new kernel
to reserve the TDX metadata pages (allocated by the old kernel) in
its page allocator. It also requires coordination between the two
kernels. Furthermore, if kexec() is done when there are active TD
guests running, the new kernel cannot use TDX because it's extremely
hard for the old kernel to pass all TDX private pages to the new
kernel.

Given that, this series doesn't support TDX after kexec() (except the
old kernel doesn't attempt to initialize TDX at all).

And this series doesn't shut down TDX module but leaves it open during
kexec(). It is because shutting down TDX module requires CPU being in
VMX operation but there's no guarantee of this during kexec(). Leaving
the TDX module open is not the best case, but it is OK since the new
kernel won't be able to use TDX anyway (therefore TDX module won't run
at all).

[1] https://lore.kernel.org/lkml/772b20e270b3451aea9714260f2c40ddcc4afe80.1646422845.git.isaku.yamahata@intel.com/T/
[2] https://github.com/intel/tdx/tree/guest-upstream


Kai Huang (21):
x86/virt/tdx: Detect SEAM
x86/virt/tdx: Detect TDX private KeyIDs
x86/virt/tdx: Implement the SEAMCALL base function
x86/virt/tdx: Add skeleton for detecting and initializing TDX on
demand
x86/virt/tdx: Detect P-SEAMLDR and TDX module
x86/virt/tdx: Shut down TDX module in case of error
x86/virt/tdx: Do TDX module global initialization
x86/virt/tdx: Do logical-cpu scope TDX module initialization
x86/virt/tdx: Get information about TDX module and convertible memory
x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
x86/virt/tdx: Choose to use all system RAM as TDX memory
x86/virt/tdx: Create TDMRs to cover all system RAM
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
x86/virt/tdx: Set up reserved areas for all TDMRs
x86/virt/tdx: Reserve TDX module global KeyID
x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
x86/virt/tdx: Configure global KeyID on all packages
x86/virt/tdx: Initialize all TDMRs
x86: Flush cache of TDX private memory during kexec()
x86/virt/tdx: Add kernel command line to opt-in TDX host support
Documentation/x86: Add documentation for TDX host support

.../admin-guide/kernel-parameters.txt | 6 +
Documentation/x86/tdx.rst | 326 +++-
arch/x86/Kconfig | 14 +
arch/x86/Makefile | 2 +
arch/x86/include/asm/tdx.h | 15 +
arch/x86/kernel/cpu/intel.c | 3 +
arch/x86/kernel/process.c | 15 +-
arch/x86/virt/Makefile | 2 +
arch/x86/virt/vmx/Makefile | 2 +
arch/x86/virt/vmx/tdx/Makefile | 2 +
arch/x86/virt/vmx/tdx/seamcall.S | 52 +
arch/x86/virt/vmx/tdx/tdx.c | 1717 +++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 137 ++
13 files changed, 2279 insertions(+), 14 deletions(-)
create mode 100644 arch/x86/virt/Makefile
create mode 100644 arch/x86/virt/vmx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

--
2.35.1


2022-04-06 15:44:46

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 14/21] x86/virt/tdx: Set up reserved areas for all TDMRs

As the last step of constructing TDMRs, create reserved area information
for the memory region holes in each TDMR. If any PAMT (or part of it)
resides within a particular TDMR, also mark it as reserved.

All reserved areas in each TDMR must be in address ascending order,
required by TDX architecture.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 148 +++++++++++++++++++++++++++++++++++-
1 file changed, 146 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1b807dcbc101..bf0d13644898 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,6 +15,7 @@
#include <linux/atomic.h>
#include <linux/slab.h>
#include <linux/math.h>
+#include <linux/sort.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/cpufeature.h>
@@ -1112,6 +1113,145 @@ static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
return -ENOMEM;
}

+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
+ u64 addr, u64 size)
+{
+ struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+ int idx = *p_idx;
+
+ /* Reserved area must be 4K aligned in offset and size */
+ if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+ return -EINVAL;
+
+ /* Cannot exceed maximum reserved areas supported by TDX */
+ if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
+ return -E2BIG;
+
+ rsvd_areas[idx].offset = addr - tdmr->base;
+ rsvd_areas[idx].size = size;
+
+ *p_idx = idx + 1;
+
+ return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+ struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+ struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+ if (r1->offset + r1->size <= r2->offset)
+ return -1;
+ if (r1->offset >= r2->offset + r2->size)
+ return 1;
+
+ /* Reserved areas cannot overlap. Caller should guarantee. */
+ WARN_ON(1);
+ return -1;
+}
+
+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
+static int tdmr_setup_rsvd_areas(struct tdmr_info *tdmr,
+ struct tdmr_info **tdmr_array,
+ int tdmr_num)
+{
+ u64 start, end, prev_end;
+ int rsvd_idx, i, ret = 0;
+
+ /* Mark holes between e820 RAM entries as reserved */
+ rsvd_idx = 0;
+ prev_end = TDMR_START(tdmr);
+ e820_for_each_mem(i, start, end) {
+ /* Break if this entry is after the TDMR */
+ if (start >= TDMR_END(tdmr))
+ break;
+
+ /* Exclude entries before this TDMR */
+ if (end < TDMR_START(tdmr))
+ continue;
+
+ /*
+ * Skip if no hole exists before this entry. "<=" is
+ * used because one e820 entry might span two TDMRs.
+ * In that case the start address of this entry is
+ * smaller then the start address of the second TDMR.
+ */
+ if (start <= prev_end) {
+ prev_end = end;
+ continue;
+ }
+
+ /* Add the hole before this e820 entry */
+ ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+ start - prev_end);
+ if (ret)
+ return ret;
+
+ prev_end = end;
+ }
+
+ /* Add the hole after the last RAM entry if it exists. */
+ if (prev_end < TDMR_END(tdmr)) {
+ ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+ TDMR_END(tdmr) - prev_end);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Walk over all TDMRs to find out whether any PAMT falls into
+ * the given TDMR. If yes, mark it as reserved too.
+ */
+ for (i = 0; i < tdmr_num; i++) {
+ struct tdmr_info *tmp = tdmr_array[i];
+ u64 pamt_start, pamt_end;
+
+ pamt_start = tmp->pamt_4k_base;
+ pamt_end = pamt_start + tmp->pamt_4k_size +
+ tmp->pamt_2m_size + tmp->pamt_1g_size;
+
+ /* Skip PAMTs outside of the given TDMR */
+ if ((pamt_end <= TDMR_START(tdmr)) ||
+ (pamt_start >= TDMR_END(tdmr)))
+ continue;
+
+ /* Only mark the part within the TDMR as reserved */
+ if (pamt_start < TDMR_START(tdmr))
+ pamt_start = TDMR_START(tdmr);
+ if (pamt_end > TDMR_END(tdmr))
+ pamt_end = TDMR_END(tdmr);
+
+ ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, pamt_start,
+ pamt_end - pamt_start);
+ if (ret)
+ return ret;
+ }
+
+ /* TDX requires reserved areas listed in address ascending order */
+ sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+ rsvd_area_cmp_func, NULL);
+
+ return 0;
+}
+
+static int tdmrs_setup_rsvd_areas_all(struct tdmr_info **tdmr_array,
+ int tdmr_num)
+{
+ int i;
+
+ for (i = 0; i < tdmr_num; i++) {
+ int ret;
+
+ ret = tdmr_setup_rsvd_areas(tdmr_array[i], tdmr_array,
+ tdmr_num);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
{
int ret;
@@ -1128,8 +1268,12 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
if (ret)
goto err_free_tdmrs;

- /* Return -EFAULT until constructing TDMRs is done */
- ret = -EFAULT;
+ ret = tdmrs_setup_rsvd_areas_all(tdmr_array, *tdmr_num);
+ if (ret)
+ goto err_free_pamts;
+
+ return 0;
+err_free_pamts:
tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
err_free_tdmrs:
free_tdmrs(tdmr_array, *tdmr_num);
--
2.35.1

2022-04-06 15:51:57

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization

Do the TDX module global initialization which requires calling
TDH.SYS.INIT once on any logical cpu.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
arch/x86/virt/vmx/tdx/tdx.h | 1 +
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index faf8355965a5..5c2f3a30be2f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -463,11 +463,20 @@ static int __tdx_detect(void)

static int init_tdx_module(void)
{
+ int ret;
+
+ /* TDX module global initialization */
+ ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+ if (ret)
+ goto out;
+
/*
* Return -EFAULT until all steps of TDX module
* initialization are done.
*/
- return -EFAULT;
+ ret = -EFAULT;
+out:
+ return ret;
}

static void shutdown_tdx_module(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index dcc1f6dfe378..f0983b1936d8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -38,6 +38,7 @@ struct p_seamldr_info {
/*
* TDX module SEAMCALL leaf functions
*/
+#define TDH_SYS_INIT 33
#define TDH_SYS_LP_SHUTDOWN 44

struct tdx_module_output;
--
2.35.1

2022-04-06 15:52:56

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error

TDX supports shutting down the TDX module at any time during its
lifetime. After TDX module is shut down, no further SEAMCALL can be
made on any logical cpu.

Shut down the TDX module in case of any error happened during the
initialization process. It's pointless to leave the TDX module in some
middle state.

Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
BIOS-enabled cpus, and the SEMACALL can run concurrently on different
cpus. Implement a mechanism to run SEAMCALL concurrently on all online
cpus. Logical-cpu scope initialization will use it too.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx.h | 5 +++++
2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 674867bccc14..faf8355965a5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -11,6 +11,8 @@
#include <linux/cpumask.h>
#include <linux/mutex.h>
#include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/atomic.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/cpufeature.h>
@@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
return 0;
}

+/* Data structure to make SEAMCALL on multiple CPUs concurrently */
+struct seamcall_ctx {
+ u64 fn;
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ atomic_t err;
+ u64 seamcall_ret;
+ struct tdx_module_output out;
+};
+
+static void seamcall_smp_call_function(void *data)
+{
+ struct seamcall_ctx *sc = data;
+ int ret;
+
+ ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
+ &sc->seamcall_ret, &sc->out);
+ if (ret)
+ atomic_set(&sc->err, ret);
+}
+
+/*
+ * Call the SEAMCALL on all online cpus concurrently.
+ * Return error if SEAMCALL fails on any cpu.
+ */
+static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
+{
+ on_each_cpu(seamcall_smp_call_function, sc, true);
+ return atomic_read(&sc->err);
+}
+
static inline bool p_seamldr_ready(void)
{
return !!p_seamldr_info.p_seamldr_ready;
@@ -437,7 +472,10 @@ static int init_tdx_module(void)

static void shutdown_tdx_module(void)
{
- /* TODO: Shut down the TDX module */
+ struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
+
+ seamcall_on_each_cpu(&sc);
+
tdx_module_status = TDX_MODULE_SHUTDOWN;
}

diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 6990c93198b3..dcc1f6dfe378 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -35,6 +35,11 @@ struct p_seamldr_info {
#define P_SEAMLDR_SEAMCALL_BASE BIT_ULL(63)
#define P_SEAMCALL_SEAMLDR_INFO (P_SEAMLDR_SEAMCALL_BASE | 0x0)

+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_LP_SHUTDOWN 44
+
struct tdx_module_output;
u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_output *out);
--
2.35.1

2022-04-06 16:00:41

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization

Logical-cpu scope initialization requires calling TDH.SYS.LP.INIT on all
BIOS-enabled cpus, otherwise the TDH.SYS.CONFIG SEAMCALL will fail.
TDH.SYS.LP.INIT can be called concurrently on all cpus.

Following global initialization, do the logical-cpu scope initialization
by calling TDH.SYS.LP.INIT on all online cpus. Whether all BIOS-enabled
cpus are online is not checked here for simplicity. The caller of
tdx_init() should guarantee all BIOS-enabled cpus are online.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 1 +
2 files changed, 13 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5c2f3a30be2f..ef2718423f0f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -461,6 +461,13 @@ static int __tdx_detect(void)
return -ENODEV;
}

+static int tdx_module_init_cpus(void)
+{
+ struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+
+ return seamcall_on_each_cpu(&sc);
+}
+
static int init_tdx_module(void)
{
int ret;
@@ -470,6 +477,11 @@ static int init_tdx_module(void)
if (ret)
goto out;

+ /* Logical-cpu scope initialization */
+ ret = tdx_module_init_cpus();
+ if (ret)
+ goto out;
+
/*
* Return -EFAULT until all steps of TDX module
* initialization are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index f0983b1936d8..b8cfdd6e12f3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -39,6 +39,7 @@ struct p_seamldr_info {
* TDX module SEAMCALL leaf functions
*/
#define TDH_SYS_INIT 33
+#define TDH_SYS_LP_INIT 35
#define TDH_SYS_LP_SHUTDOWN 44

struct tdx_module_output;
--
2.35.1

2022-04-06 16:51:10

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM

The kernel configures TDX usable memory regions to the TDX module via
an array of "TD Memory Region" (TDMR). Each TDMR entry (TDMR_INFO)
contains the information of the base/size of a memory region, the
base/size of the associated Physical Address Metadata Table (PAMT) and
a list of reserved areas in the region.

Create a number of TDMRs according to the verified e820 RAM entries.
As the first step only set up the base/size information for each TDMR.

TDMR must be 1G aligned and the size must be in 1G granularity. This
implies that one TDMR could cover multiple e820 RAM entries. If a RAM
entry spans the 1GB boundary and the former part is already covered by
the previous TDMR, just create a new TDMR for the latter part.

TDX only supports a limited number of TDMRs (currently 64). Abort the
TDMR construction process when the number of TDMRs exceeds this
limitation.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 138 ++++++++++++++++++++++++++++++++++++
1 file changed, 138 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6b0c51aaa7f2..82534e70df96 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -54,6 +54,18 @@
((u32)(((_keyid_part) & 0xffffffffull) + 1))
#define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))

+/* TDMR must be 1gb aligned */
+#define TDMR_ALIGNMENT BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT)
+
+/* Align up and down the address to TDMR boundary */
+#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT)
+
+/* TDMR's start and end address */
+#define TDMR_START(_tdmr) ((_tdmr)->base)
+#define TDMR_END(_tdmr) ((_tdmr)->base + (_tdmr)->size)
+
/*
* TDX module status during initialization
*/
@@ -813,6 +825,44 @@ static int e820_check_against_cmrs(void)
return 0;
}

+/* The starting offset of reserved areas within TDMR_INFO */
+#define TDMR_RSVD_START 64
+
+static struct tdmr_info *__alloc_tdmr(void)
+{
+ int tdmr_sz;
+
+ /*
+ * TDMR_INFO's actual size depends on maximum number of reserved
+ * areas that one TDMR supports.
+ */
+ tdmr_sz = TDMR_RSVD_START + tdx_sysinfo.max_reserved_per_tdmr *
+ sizeof(struct tdmr_reserved_area);
+
+ /*
+ * TDX requires TDMR_INFO to be 512 aligned. Always align up
+ * TDMR_INFO size to 512 so the memory allocated via kzalloc()
+ * can meet the alignment requirement.
+ */
+ tdmr_sz = ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+
+ return kzalloc(tdmr_sz, GFP_KERNEL);
+}
+
+/* Create a new TDMR at given index in the TDMR array */
+static struct tdmr_info *alloc_tdmr(struct tdmr_info **tdmr_array, int idx)
+{
+ struct tdmr_info *tdmr;
+
+ if (WARN_ON_ONCE(tdmr_array[idx]))
+ return NULL;
+
+ tdmr = __alloc_tdmr();
+ tdmr_array[idx] = tdmr;
+
+ return tdmr;
+}
+
static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
{
int i;
@@ -826,6 +876,89 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
}
}

+/*
+ * Create TDMRs to cover all RAM entries in e820_table. The created
+ * TDMRs are saved to @tdmr_array and @tdmr_num is set to the actual
+ * number of TDMRs. All entries in @tdmr_array must be initially NULL.
+ */
+static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
+{
+ struct tdmr_info *tdmr;
+ u64 start, end;
+ int i, tdmr_idx;
+ int ret = 0;
+
+ tdmr_idx = 0;
+ tdmr = alloc_tdmr(tdmr_array, 0);
+ if (!tdmr)
+ return -ENOMEM;
+ /*
+ * Loop over all RAM entries in e820 and create TDMRs to cover
+ * them. To keep it simple, always try to use one TDMR to cover
+ * one RAM entry.
+ */
+ e820_for_each_mem(i, start, end) {
+ start = TDMR_ALIGN_DOWN(start);
+ end = TDMR_ALIGN_UP(end);
+
+ /*
+ * If the current TDMR's size hasn't been initialized, it
+ * is a new allocated TDMR to cover the new RAM entry.
+ * Otherwise the current TDMR already covers the previous
+ * RAM entry. In the latter case, check whether the
+ * current RAM entry has been fully or partially covered
+ * by the current TDMR, since TDMR is 1G aligned.
+ */
+ if (tdmr->size) {
+ /*
+ * Loop to next RAM entry if the current entry
+ * is already fully covered by the current TDMR.
+ */
+ if (end <= TDMR_END(tdmr))
+ continue;
+
+ /*
+ * If part of current RAM entry has already been
+ * covered by current TDMR, skip the already
+ * covered part.
+ */
+ if (start < TDMR_END(tdmr))
+ start = TDMR_END(tdmr);
+
+ /*
+ * Create a new TDMR to cover the current RAM
+ * entry, or the remaining part of it.
+ */
+ tdmr_idx++;
+ if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
+ ret = -E2BIG;
+ goto err;
+ }
+ tdmr = alloc_tdmr(tdmr_array, tdmr_idx);
+ if (!tdmr) {
+ ret = -ENOMEM;
+ goto err;
+ }
+ }
+
+ tdmr->base = start;
+ tdmr->size = end - start;
+ }
+
+ /* @tdmr_idx is always the index of last valid TDMR. */
+ *tdmr_num = tdmr_idx + 1;
+
+ return 0;
+err:
+ /*
+ * Clean up already allocated TDMRs in case of error. @tdmr_idx
+ * indicates the last TDMR that wasn't created successfully,
+ * therefore only needs to free @tdmr_idx TDMRs.
+ */
+ free_tdmrs(tdmr_array, tdmr_idx);
+ return ret;
+}
+
static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
{
int ret;
@@ -834,8 +967,13 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
if (ret)
goto err;

+ ret = create_tdmrs(tdmr_array, tdmr_num);
+ if (ret)
+ goto err;
+
/* Return -EFAULT until constructing TDMRs is done */
ret = -EFAULT;
+ free_tdmrs(tdmr_array, *tdmr_num);
err:
return ret;
}
--
2.35.1

2022-04-06 16:51:18

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 15/21] x86/virt/tdx: Reserve TDX module global KeyID

TDX module initialization requires to use one TDX private KeyID as the
global KeyID to crypto protect TDX metadata. The global KeyID is
configured to the TDX module along with TDMRs.

Just reserve the first TDX private KeyID as the global KeyID.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index bf0d13644898..ecd65f7014e2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -112,6 +112,9 @@ static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMEN
static int tdx_cmr_num;
static struct tdsysinfo_struct tdx_sysinfo;

+/* TDX global KeyID to protect TDX metadata */
+static u32 tdx_global_keyid;
+
static bool __seamrr_enabled(void)
{
return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -1320,6 +1323,12 @@ static int init_tdx_module(void)
if (ret)
goto out_free_tdmrs;

+ /*
+ * Reserve the first TDX KeyID as global KeyID to protect
+ * TDX module metadata.
+ */
+ tdx_global_keyid = tdx_keyid_start;
+
/*
* Return -EFAULT until all steps of TDX module
* initialization are done.
--
2.35.1

2022-04-06 16:51:27

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module

The P-SEAMLDR (persistent SEAM loader) is the first software module that
runs in SEAM VMX root, responsible for loading and updating the TDX
module. Both the P-SEAMLDR and the TDX module are expected to be loaded
before host kernel boots.

There is no CPUID or MSR to detect whether the P-SEAMLDR or the TDX
module has been loaded. SEAMCALL instruction fails with VMfailInvalid
when the target SEAM software module is not loaded, so SEAMCALL can be
used to detect whether the P-SEAMLDR and the TDX module are loaded.

Detect the P-SEAMLDR and the TDX module by calling SEAMLDR.INFO SEAMCALL
to get the P-SEAMLDR information. If the SEAMCALL succeeds, the
P-SEAMLDR information further tells whether the TDX module is loaded or
not.

Also add a wrapper of __seamcall() to make SEAMCALL to the P-SEAMLDR and
the TDX module with additional defensive check on SEAMRR and CR4.VMXE,
since both detecting and initializing TDX module require the caller of
TDX to handle VMXON.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 175 +++++++++++++++++++++++++++++++++++-
arch/x86/virt/vmx/tdx/tdx.h | 31 +++++++
2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 53093d4ad458..674867bccc14 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,7 +15,9 @@
#include <asm/msr.h>
#include <asm/cpufeature.h>
#include <asm/cpufeatures.h>
+#include <asm/virtext.h>
#include <asm/tdx.h>
+#include "tdx.h"

/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
#define MTRR_CAP_SEAMRR BIT(15)
@@ -74,6 +76,8 @@ static enum tdx_module_status_t tdx_module_status;
/* Prevent concurrent attempts on TDX detection and initialization */
static DEFINE_MUTEX(tdx_module_lock);

+static struct p_seamldr_info p_seamldr_info;
+
static bool __seamrr_enabled(void)
{
return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -229,6 +233,160 @@ static bool tdx_keyid_sufficient(void)
return tdx_keyid_num >= 2;
}

+/*
+ * All error codes of both the P-SEAMLDR and the TDX module SEAMCALLs
+ * have bit 63 set if SEAMCALL fails.
+ */
+#define SEAMCALL_LEAF_ERROR(_ret) ((_ret) & BIT_ULL(63))
+
+/**
+ * seamcall - make SEAMCALL to the P-SEAMLDR or the TDX module with
+ * additional check on SEAMRR and CR4.VMXE
+ *
+ * @fn: SEAMCALL leaf number.
+ * @rcx: Input operand RCX.
+ * @rdx: Input operand RDX.
+ * @r8: Input operand R8.
+ * @r9: Input operand R9.
+ * @seamcall_ret: SEAMCALL completion status (can be NULL).
+ * @out: Additional output operands (can be NULL).
+ *
+ * Wrapper of __seamcall() to make SEAMCALL to the P-SEAMLDR or the TDX
+ * module with additional defensive check on SEAMRR and CR4.VMXE. Caller
+ * to make sure SEAMRR is enabled and CPU is already in VMX operation
+ * before calling this function.
+ *
+ * Unlike __seamcall(), it returns kernel error code instead of SEAMCALL
+ * completion status, which is returned via @seamcall_ret if desired.
+ *
+ * Return:
+ *
+ * * -ENODEV: SEAMCALL failed with VMfailInvalid, or SEAMRR is not enabled.
+ * * -EPERM: CR4.VMXE is not enabled
+ * * -EFAULT: SEAMCALL failed
+ * * -0: SEAMCALL succeeded
+ */
+static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ u64 *seamcall_ret, struct tdx_module_output *out)
+{
+ u64 ret;
+
+ if (WARN_ON_ONCE(!seamrr_enabled()))
+ return -ENODEV;
+
+ /*
+ * SEAMCALL instruction requires CPU being already in VMX
+ * operation (VMXON has been done), otherwise it causes #UD.
+ * Sanity check whether CR4.VMXE has been enabled.
+ *
+ * Note VMX being enabled in CR4 doesn't mean CPU is already
+ * in VMX operation, but unfortunately there's no way to do
+ * such check. However in practice enabling CR4.VMXE and
+ * doing VMXON are done together (for now) so in practice it
+ * checks whether VMXON has been done.
+ *
+ * Preemption is disabled during the CR4.VMXE check and the
+ * actual SEAMCALL so VMX doesn't get disabled by other threads
+ * due to scheduling.
+ */
+ preempt_disable();
+ if (WARN_ON_ONCE(!cpu_vmx_enabled())) {
+ preempt_enable_no_resched();
+ return -EPERM;
+ }
+
+ ret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+ preempt_enable_no_resched();
+
+ /*
+ * Convert SEAMCALL error code to kernel error code:
+ * - -ENODEV: VMfailInvalid
+ * - -EFAULT: SEAMCALL failed
+ * - 0: SEAMCALL was successful
+ */
+ if (ret == TDX_SEAMCALL_VMFAILINVALID)
+ return -ENODEV;
+
+ /* Save the completion status if caller wants to use it */
+ if (seamcall_ret)
+ *seamcall_ret = ret;
+
+ /*
+ * TDX module SEAMCALLs may also return non-zero completion
+ * status codes but w/o bit 63 set. Those codes are treated
+ * as additional information/warning while the SEAMCALL is
+ * treated as completed successfully. Return 0 in this case.
+ * Caller can use @seamcall_ret to get the additional code
+ * when it is desired.
+ */
+ if (SEAMCALL_LEAF_ERROR(ret)) {
+ pr_err("SEAMCALL leaf %llu failed: 0x%llx\n", fn, ret);
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
+static inline bool p_seamldr_ready(void)
+{
+ return !!p_seamldr_info.p_seamldr_ready;
+}
+
+static inline bool tdx_module_ready(void)
+{
+ /*
+ * SEAMLDR_INFO.SEAM_READY indicates whether TDX module
+ * is (loaded and) ready for SEAMCALL.
+ */
+ return p_seamldr_ready() && !!p_seamldr_info.seam_ready;
+}
+
+/*
+ * Detect whether the P-SEAMLDR has been loaded by calling SEAMLDR.INFO
+ * SEAMCALL to get the P-SEAMLDR information, which further tells whether
+ * the TDX module has been loaded and ready for SEAMCALL. Caller to make
+ * sure only calling this function when CPU is already in VMX operation.
+ */
+static int detect_p_seamldr(void)
+{
+ int ret;
+
+ /*
+ * SEAMCALL fails with VMfailInvalid when SEAM software is not
+ * loaded, in which case seamcall() returns -ENODEV. Use this
+ * to detect the P-SEAMLDR.
+ *
+ * Note the P-SEAMLDR SEAMCALL also fails with VMfailInvalid when
+ * the P-SEAMLDR is already busy with another SEAMCALL. But this
+ * won't happen here as this function is only called once.
+ */
+ ret = seamcall(P_SEAMCALL_SEAMLDR_INFO, __pa(&p_seamldr_info),
+ 0, 0, 0, NULL, NULL);
+ if (ret) {
+ if (ret == -ENODEV)
+ pr_info("P-SEAMLDR is not loaded.\n");
+ else
+ pr_info("Failed to detect P-SEAMLDR.\n");
+
+ return ret;
+ }
+
+ /*
+ * If SEAMLDR.INFO was successful, it must be ready for SEAMCALL.
+ * Otherwise it's either kernel or firmware bug.
+ */
+ if (WARN_ON_ONCE(!p_seamldr_ready()))
+ return -ENODEV;
+
+ pr_info("P-SEAMLDR: version 0x%x, vendor_id: 0x%x, build_date: %u, build_num %u, major %u, minor %u\n",
+ p_seamldr_info.version, p_seamldr_info.vendor_id,
+ p_seamldr_info.build_date, p_seamldr_info.build_num,
+ p_seamldr_info.major, p_seamldr_info.minor);
+
+ return 0;
+}
+
static int __tdx_detect(void)
{
/* The TDX module is not loaded if SEAMRR is disabled */
@@ -247,7 +405,22 @@ static int __tdx_detect(void)
goto no_tdx_module;
}

- /* Return -ENODEV until the TDX module is detected */
+ /*
+ * For simplicity any error during detect_p_seamldr() marks
+ * TDX module as not loaded.
+ */
+ if (detect_p_seamldr())
+ goto no_tdx_module;
+
+ if (!tdx_module_ready()) {
+ pr_info("TDX module is not loaded.\n");
+ goto no_tdx_module;
+ }
+
+ pr_info("TDX module detected.\n");
+ tdx_module_status = TDX_MODULE_LOADED;
+ return 0;
+
no_tdx_module:
tdx_module_status = TDX_MODULE_NONE;
return -ENODEV;
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9d5b6f554c20..6990c93198b3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -3,6 +3,37 @@
#define _X86_VIRT_TDX_H

#include <linux/types.h>
+#include <linux/compiler.h>
+
+/*
+ * TDX architectural data structures
+ */
+
+#define P_SEAMLDR_INFO_ALIGNMENT 256
+
+struct p_seamldr_info {
+ u32 version;
+ u32 attributes;
+ u32 vendor_id;
+ u32 build_date;
+ u16 build_num;
+ u16 minor;
+ u16 major;
+ u8 reserved0[2];
+ u32 acm_x2apicid;
+ u8 reserved1[4];
+ u8 seaminfo[128];
+ u8 seam_ready;
+ u8 seam_debug;
+ u8 p_seamldr_ready;
+ u8 reserved2[88];
+} __packed __aligned(P_SEAMLDR_INFO_ALIGNMENT);
+
+/*
+ * P-SEAMLDR SEAMCALL leaf function
+ */
+#define P_SEAMLDR_SEAMCALL_BASE BIT_ULL(63)
+#define P_SEAMCALL_SEAMLDR_INFO (P_SEAMLDR_SEAMCALL_BASE | 0x0)

struct tdx_module_output;
u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
--
2.35.1

2022-04-06 16:51:43

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

The TDX module is essentially a CPU-attested software module running
in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
host and certain physical attacks. The TDX module implements the
functions to build, tear down and start execution of the protected VMs
called Trusted Domains (TD). Before the TDX module can be used to
create and run TD guests, it must be loaded into the SEAM Range Register
(SEAMRR) and properly initialized. The TDX module is expected to be
loaded by BIOS before booting to the kernel, and the kernel is expected
to detect and initialize it, using the SEAMCALLs defined by TDX
architecture.

The TDX module can be initialized only once in its lifetime. Instead
of always initializing it at boot time, this implementation chooses an
on-demand approach to initialize TDX until there is a real need (e.g
when requested by KVM). This avoids consuming the memory that must be
allocated by kernel and given to the TDX module as metadata (~1/256th of
the TDX-usable memory), and also saves the time of initializing the TDX
module (and the metadata) when TDX is not used at all. Initializing the
TDX module at runtime on-demand also is more flexible to support TDX
module runtime updating in the future (after updating the TDX module, it
needs to be initialized again).

Introduce two placeholders tdx_detect() and tdx_init() to detect and
initialize the TDX module on demand, with a state machine introduced to
orchestrate the entire process (in case of multiple callers).

To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs. The
TDX module is reported as not loaded if either SEAMRR is not enabled, or
there are no enough TDX private KeyIDs to create any TD guest. The TDX
module itself requires one global TDX private KeyID to crypto protect
its metadata.

And tdx_init() is currently empty. The TDX module will be initialized
in multi-steps defined by the TDX architecture:

1) Global initialization;
2) Logical-CPU scope initialization;
3) Enumerate the TDX module capabilities and platform configuration;
4) Configure the TDX module about usable memory ranges and global
KeyID information;
5) Package-scope configuration for the global KeyID;
6) Initialize usable memory ranges based on 4).

The TDX module can also be shut down at any time during its lifetime.
In case of any error during the initialization process, shut down the
module. It's pointless to leave the module in any intermediate state
during the initialization.

SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
operation (VMXON has been done), otherwise it generates #UD. So far
only KVM handles VMXON/VMXOFF. Choose to not handle VMXON/VMXOFF in
tdx_detect() and tdx_init() but depend on the caller to guarantee that,
since so far KVM is the only user of TDX. In the long term, more kernel
components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
module runtime update), so a reference-based approach to do VMXON/VMXOFF
is likely needed.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/include/asm/tdx.h | 4 +
arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
2 files changed, 226 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1f29813b1646..c8af2ba6bb8a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,

#ifdef CONFIG_INTEL_TDX_HOST
void tdx_detect_cpu(struct cpuinfo_x86 *c);
+int tdx_detect(void);
+int tdx_init(void);
#else
static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
+static inline int tdx_detect(void) { return -ENODEV; }
+static inline int tdx_init(void) { return -ENODEV; }
#endif /* CONFIG_INTEL_TDX_HOST */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ba2210001ea8..53093d4ad458 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -9,6 +9,8 @@

#include <linux/types.h>
#include <linux/cpumask.h>
+#include <linux/mutex.h>
+#include <linux/cpu.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/cpufeature.h>
@@ -45,12 +47,33 @@
((u32)(((_keyid_part) & 0xffffffffull) + 1))
#define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))

+/*
+ * TDX module status during initialization
+ */
+enum tdx_module_status_t {
+ /* TDX module status is unknown */
+ TDX_MODULE_UNKNOWN,
+ /* TDX module is not loaded */
+ TDX_MODULE_NONE,
+ /* TDX module is loaded, but not initialized */
+ TDX_MODULE_LOADED,
+ /* TDX module is fully initialized */
+ TDX_MODULE_INITIALIZED,
+ /* TDX module is shutdown due to error during initialization */
+ TDX_MODULE_SHUTDOWN,
+};
+
/* BIOS must configure SEAMRR registers for all cores consistently */
static u64 seamrr_base, seamrr_mask;

static u32 tdx_keyid_start;
static u32 tdx_keyid_num;

+static enum tdx_module_status_t tdx_module_status;
+
+/* Prevent concurrent attempts on TDX detection and initialization */
+static DEFINE_MUTEX(tdx_module_lock);
+
static bool __seamrr_enabled(void)
{
return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
detect_seam(c);
detect_tdx_keyids(c);
}
+
+static bool seamrr_enabled(void)
+{
+ /*
+ * To detect any BIOS misconfiguration among cores, all logical
+ * cpus must have been brought up at least once. This is true
+ * unless 'maxcpus' kernel command line is used to limit the
+ * number of cpus to be brought up during boot time. However
+ * 'maxcpus' is basically an invalid operation mode due to the
+ * MCE broadcast problem, and it should not be used on a TDX
+ * capable machine. Just do paranoid check here and do not
+ * report SEAMRR as enabled in this case.
+ */
+ if (!cpumask_equal(&cpus_booted_once_mask,
+ cpu_present_mask))
+ return false;
+
+ return __seamrr_enabled();
+}
+
+static bool tdx_keyid_sufficient(void)
+{
+ if (!cpumask_equal(&cpus_booted_once_mask,
+ cpu_present_mask))
+ return false;
+
+ /*
+ * TDX requires at least two KeyIDs: one global KeyID to
+ * protect the metadata of the TDX module and one or more
+ * KeyIDs to run TD guests.
+ */
+ return tdx_keyid_num >= 2;
+}
+
+static int __tdx_detect(void)
+{
+ /* The TDX module is not loaded if SEAMRR is disabled */
+ if (!seamrr_enabled()) {
+ pr_info("SEAMRR not enabled.\n");
+ goto no_tdx_module;
+ }
+
+ /*
+ * Also do not report the TDX module as loaded if there's
+ * no enough TDX private KeyIDs to run any TD guests.
+ */
+ if (!tdx_keyid_sufficient()) {
+ pr_info("Number of TDX private KeyIDs too small: %u.\n",
+ tdx_keyid_num);
+ goto no_tdx_module;
+ }
+
+ /* Return -ENODEV until the TDX module is detected */
+no_tdx_module:
+ tdx_module_status = TDX_MODULE_NONE;
+ return -ENODEV;
+}
+
+static int init_tdx_module(void)
+{
+ /*
+ * Return -EFAULT until all steps of TDX module
+ * initialization are done.
+ */
+ return -EFAULT;
+}
+
+static void shutdown_tdx_module(void)
+{
+ /* TODO: Shut down the TDX module */
+ tdx_module_status = TDX_MODULE_SHUTDOWN;
+}
+
+static int __tdx_init(void)
+{
+ int ret;
+
+ /*
+ * Logical-cpu scope initialization requires calling one SEAMCALL
+ * on all logical cpus enabled by BIOS. Shutting down the TDX
+ * module also has such requirement. Further more, configuring
+ * the key of the global KeyID requires calling one SEAMCALL for
+ * each package. For simplicity, disable CPU hotplug in the whole
+ * initialization process.
+ *
+ * It's perhaps better to check whether all BIOS-enabled cpus are
+ * online before starting initializing, and return early if not.
+ * But none of 'possible', 'present' and 'online' CPU masks
+ * represents BIOS-enabled cpus. For example, 'possible' mask is
+ * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
+ * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
+ * online.
+ */
+ cpus_read_lock();
+
+ ret = init_tdx_module();
+
+ /*
+ * Shut down the TDX module in case of any error during the
+ * initialization process. It's meaningless to leave the TDX
+ * module in any middle state of the initialization process.
+ */
+ if (ret)
+ shutdown_tdx_module();
+
+ cpus_read_unlock();
+
+ return ret;
+}
+
+/**
+ * tdx_detect - Detect whether the TDX module has been loaded
+ *
+ * Detect whether the TDX module has been loaded and ready for
+ * initialization. Only call this function when all cpus are
+ * already in VMX operation.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * -0: The TDX module has been loaded and ready for
+ * initialization.
+ * * -ENODEV: The TDX module is not loaded.
+ * * -EPERM: CPU is not in VMX operation.
+ * * -EFAULT: Other internal fatal errors.
+ */
+int tdx_detect(void)
+{
+ int ret;
+
+ mutex_lock(&tdx_module_lock);
+
+ switch (tdx_module_status) {
+ case TDX_MODULE_UNKNOWN:
+ ret = __tdx_detect();
+ break;
+ case TDX_MODULE_NONE:
+ ret = -ENODEV;
+ break;
+ case TDX_MODULE_LOADED:
+ case TDX_MODULE_INITIALIZED:
+ ret = 0;
+ break;
+ case TDX_MODULE_SHUTDOWN:
+ ret = -EFAULT;
+ break;
+ default:
+ WARN_ON(1);
+ ret = -EFAULT;
+ }
+
+ mutex_unlock(&tdx_module_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_detect);
+
+/**
+ * tdx_init - Initialize the TDX module
+ *
+ * Initialize the TDX module to make it ready to run TD guests. This
+ * function should be called after tdx_detect() returns successful.
+ * Only call this function when all cpus are online and are in VMX
+ * operation. CPU hotplug is temporarily disabled internally.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * -0: The TDX module has been successfully initialized.
+ * * -ENODEV: The TDX module is not loaded.
+ * * -EPERM: The CPU which does SEAMCALL is not in VMX operation.
+ * * -EFAULT: Other internal fatal errors.
+ */
+int tdx_init(void)
+{
+ int ret;
+
+ mutex_lock(&tdx_module_lock);
+
+ switch (tdx_module_status) {
+ case TDX_MODULE_NONE:
+ ret = -ENODEV;
+ break;
+ case TDX_MODULE_LOADED:
+ ret = __tdx_init();
+ break;
+ case TDX_MODULE_INITIALIZED:
+ ret = 0;
+ break;
+ default:
+ ret = -EFAULT;
+ break;
+ }
+ mutex_unlock(&tdx_module_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_init);
--
2.35.1

2022-04-06 16:51:53

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. To support TDX, a new CPU mode called
Secure Arbitration Mode (SEAM) is added to Intel processors.

SEAM is an extension to the VMX architecture to define a new VMX root
operation (SEAM VMX root) and a new VMX non-root operation (SEAM VMX
non-root). SEAM VMX root operation is designed to host a CPU-attested
software module called the 'TDX module' which implements functions to
manage crypto-protected VMs called Trust Domains (TD). It is also
designed to additionally host a CPU-attested software module called the
'Intel Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX
module.

Software modules in SEAM VMX root run in a memory region defined by the
SEAM range register (SEAMRR). So the first thing of detecting Intel TDX
is to detect the validity of SEAMRR.

The presence of SEAMRR is reported via a new SEAMRR bit (15) of the
IA32_MTRRCAP MSR. The SEAMRR range registers consist of a pair of MSRs:

IA32_SEAMRR_PHYS_BASE and IA32_SEAMRR_PHYS_MASK

BIOS is expected to configure SEAMRR with the same value across all
cores. In case of BIOS misconfiguration, detect and compare SEAMRR
on all cpus.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
crypto-protect TD guests. Part of MKTME KeyIDs are reserved as "TDX
private KeyID" or "TDX KeyIDs" for short. Similar to detecting SEAMRR,
detecting TDX private KeyIDs also needs to be done on all cpus to detect
any BIOS misconfiguration.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support. Add a function to detect all TDX preliminaries
(SEAMRR, TDX private KeyIDs) for a given cpu when it is brought up. As
the first step, detect the validity of SEAMRR.

Also add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host
kernel support (to distinguish with TDX guest kernel support).

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/Kconfig | 12 ++++
arch/x86/Makefile | 2 +
arch/x86/include/asm/tdx.h | 9 +++
arch/x86/kernel/cpu/intel.c | 3 +
arch/x86/virt/Makefile | 2 +
arch/x86/virt/vmx/Makefile | 2 +
arch/x86/virt/vmx/tdx/Makefile | 2 +
arch/x86/virt/vmx/tdx/tdx.c | 102 +++++++++++++++++++++++++++++++++
8 files changed, 134 insertions(+)
create mode 100644 arch/x86/virt/Makefile
create mode 100644 arch/x86/virt/vmx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7021ec725dd3..9113bf09f358 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1967,6 +1967,18 @@ config X86_SGX

If unsure, say N.

+config INTEL_TDX_HOST
+ bool "Intel Trust Domain Extensions (TDX) host support"
+ default n
+ depends on CPU_SUP_INTEL
+ depends on X86_64
+ help
+ Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+ host and certain physical attacks. This option enables necessary TDX
+ support in host kernel to run protected VMs.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 63d50f65b828..2ca3a2a36dc5 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -234,6 +234,8 @@ head-y += arch/x86/kernel/platform-quirks.o

libs-y += arch/x86/lib/

+core-y += arch/x86/virt/
+
# drivers-y are linked after core-y
drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
drivers-$(CONFIG_PCI) += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..1f29813b1646 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,6 +20,8 @@

#ifndef __ASSEMBLY__

+#include <asm/processor.h>
+
/*
* Used to gather the output registers values of the TDCALL and SEAMCALL
* instructions when requesting services from the TDX module.
@@ -87,5 +89,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
return -ENODEV;
}
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+void tdx_detect_cpu(struct cpuinfo_x86 *c);
+#else
+static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..b142a640fb8e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -26,6 +26,7 @@
#include <asm/resctrl.h>
#include <asm/numa.h>
#include <asm/thermal.h>
+#include <asm/tdx.h>

#ifdef CONFIG_X86_64
#include <linux/topology.h>
@@ -715,6 +716,8 @@ static void init_intel(struct cpuinfo_x86 *c)
if (cpu_has(c, X86_FEATURE_TME))
detect_tme(c);

+ tdx_detect_cpu(c);
+
init_intel_misc_features(c);

if (tsx_ctrl_state == TSX_CTRL_ENABLE)
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..1e1fcd7d3bd1
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..1bd688684716
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..03f35c75f439
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cpumask.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/cpufeature.h>
+#include <asm/cpufeatures.h>
+#include <asm/tdx.h>
+
+/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
+#define MTRR_CAP_SEAMRR BIT(15)
+
+/* Core-scope Intel SEAMRR base and mask registers. */
+#define MSR_IA32_SEAMRR_PHYS_BASE 0x00001400
+#define MSR_IA32_SEAMRR_PHYS_MASK 0x00001401
+
+#define SEAMRR_PHYS_BASE_CONFIGURED BIT_ULL(3)
+#define SEAMRR_PHYS_MASK_ENABLED BIT_ULL(11)
+#define SEAMRR_PHYS_MASK_LOCKED BIT_ULL(10)
+
+#define SEAMRR_ENABLED_BITS \
+ (SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
+
+/* BIOS must configure SEAMRR registers for all cores consistently */
+static u64 seamrr_base, seamrr_mask;
+
+static bool __seamrr_enabled(void)
+{
+ return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
+}
+
+static void detect_seam_bsp(struct cpuinfo_x86 *c)
+{
+ u64 mtrrcap, base, mask;
+
+ /* SEAMRR is reported via MTRRcap */
+ if (!boot_cpu_has(X86_FEATURE_MTRR))
+ return;
+
+ rdmsrl(MSR_MTRRcap, mtrrcap);
+ if (!(mtrrcap & MTRR_CAP_SEAMRR))
+ return;
+
+ rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
+ if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
+ pr_info("SEAMRR base is not configured by BIOS\n");
+ return;
+ }
+
+ rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
+ if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
+ pr_info("SEAMRR is not enabled by BIOS\n");
+ return;
+ }
+
+ seamrr_base = base;
+ seamrr_mask = mask;
+}
+
+static void detect_seam_ap(struct cpuinfo_x86 *c)
+{
+ u64 base, mask;
+
+ /*
+ * Don't bother to detect this AP if SEAMRR is not
+ * enabled after earlier detections.
+ */
+ if (!__seamrr_enabled())
+ return;
+
+ rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
+ rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
+
+ if (base == seamrr_base && mask == seamrr_mask)
+ return;
+
+ pr_err("Inconsistent SEAMRR configuration by BIOS\n");
+ /* Mark SEAMRR as disabled. */
+ seamrr_base = 0;
+ seamrr_mask = 0;
+}
+
+static void detect_seam(struct cpuinfo_x86 *c)
+{
+ if (c == &boot_cpu_data)
+ detect_seam_bsp(c);
+ else
+ detect_seam_ap(c);
+}
+
+void tdx_detect_cpu(struct cpuinfo_x86 *c)
+{
+ detect_seam(c);
+}
--
2.35.1

2022-04-06 16:51:57

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

In order to provide crypto protection to guests, the TDX module uses
additional metadata to record things like which guest "owns" a given
page of memory. This metadata, referred as Physical Address Metadata
Table (PAMT), essentially serves as the 'struct page' for the TDX
module. PAMTs are not reserved by hardware upfront. They must be
allocated by the kernel and then given to the TDX module.

TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.
Each PAMT must be a physically contiguous area from the Convertible
Memory Regions (CMR). However, the PAMTs which track pages in one TDMR
do not need to reside within that TDMR but can be anywhere in CMRs.
If one PAMT overlaps with any TDMR, the overlapping part must be
reported as a reserved area in that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).

The current version of TDX supports at most 16 reserved areas per TDMR
to cover both PAMTs and potential memory holes within the TDMR. If many
PAMTs are allocated within a single TDMR, 16 reserved areas may not be
sufficient to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

- Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
the total number of reserved areas consumed for PAMTs.
- Try to first allocate PAMT from the local node of the TDMR for better
NUMA locality.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
2 files changed, 166 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7414625b938f..ff68d0829bd7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
depends on CPU_SUP_INTEL
depends on X86_64
select NUMA_KEEP_MEMINFO if NUMA
+ depends on CONTIG_ALLOC
help
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 82534e70df96..1b807dcbc101 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -21,6 +21,7 @@
#include <asm/cpufeatures.h>
#include <asm/virtext.h>
#include <asm/e820/api.h>
+#include <asm/pgtable.h>
#include <asm/tdx.h>
#include "tdx.h"

@@ -66,6 +67,16 @@
#define TDMR_START(_tdmr) ((_tdmr)->base)
#define TDMR_END(_tdmr) ((_tdmr)->base + (_tdmr)->size)

+/* Page sizes supported by TDX */
+enum tdx_page_sz {
+ TDX_PG_4K = 0,
+ TDX_PG_2M,
+ TDX_PG_1G,
+ TDX_PG_MAX,
+};
+
+#define TDX_HPAGE_SHIFT 9
+
/*
* TDX module status during initialization
*/
@@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
return ret;
}

+/* Calculate PAMT size given a TDMR and a page size */
+static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
+ enum tdx_page_sz pgsz)
+{
+ unsigned long pamt_sz;
+
+ pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
+ tdx_sysinfo.pamt_entry_size;
+ /* PAMT size must be 4K aligned */
+ pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+ return pamt_sz;
+}
+
+/* Calculate the size of all PAMTs for a TDMR */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
+{
+ enum tdx_page_sz pgsz;
+ unsigned long pamt_sz;
+
+ pamt_sz = 0;
+ for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
+ pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
+
+ return pamt_sz;
+}
+
+/*
+ * Locate the NUMA node containing the start of the given TDMR's first
+ * RAM entry. The given TDMR may also cover memory in other NUMA nodes.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr)
+{
+ u64 start, end;
+ int i;
+
+ /* Find the first RAM entry covered by the TDMR */
+ e820_for_each_mem(i, start, end)
+ if (end > TDMR_START(tdmr))
+ break;
+
+ /*
+ * One TDMR must cover at least one (or partial) RAM entry,
+ * otherwise it is kernel bug. WARN_ON() in this case.
+ */
+ if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
+ return 0;
+
+ /*
+ * The first RAM entry may be partially covered by the previous
+ * TDMR. In this case, use TDMR's start to find the NUMA node.
+ */
+ if (start < TDMR_START(tdmr))
+ start = TDMR_START(tdmr);
+
+ return phys_to_target_node(start);
+}
+
+static int tdmr_setup_pamt(struct tdmr_info *tdmr)
+{
+ unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
+ unsigned long pamt_sz[TDX_PG_MAX];
+ unsigned long pamt_npages;
+ struct page *pamt;
+ enum tdx_page_sz pgsz;
+ int nid;
+
+ /*
+ * Allocate one chunk of physically contiguous memory for all
+ * PAMTs. This helps minimize the PAMT's use of reserved areas
+ * in overlapped TDMRs.
+ */
+ nid = tdmr_get_nid(tdmr);
+ pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
+ pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
+ &node_online_map);
+ if (!pamt)
+ return -ENOMEM;
+
+ /* Calculate PAMT base and size for all supported page sizes. */
+ tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+ for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
+ unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
+
+ pamt_base[pgsz] = tdmr_pamt_base;
+ pamt_sz[pgsz] = sz;
+
+ tdmr_pamt_base += sz;
+ }
+
+ tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
+ tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
+ tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
+ tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
+ tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
+ tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];
+
+ return 0;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+ unsigned long pamt_pfn, pamt_sz;
+
+ pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;
+ pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+ /* Do nothing if PAMT hasn't been allocated for this TDMR */
+ if (!pamt_sz)
+ return;
+
+ if (WARN_ON(!pamt_pfn))
+ return;
+
+ free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+ int i;
+
+ for (i = 0; i < tdmr_num; i++)
+ tdmr_free_pamt(tdmr_array[i]);
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+ int i, ret;
+
+ for (i = 0; i < tdmr_num; i++) {
+ ret = tdmr_setup_pamt(tdmr_array[i]);
+ if (ret)
+ goto err;
+ }
+
+ return 0;
+err:
+ tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+ return -ENOMEM;
+}
+
static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
{
int ret;
@@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
if (ret)
goto err;

+ ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
+ if (ret)
+ goto err_free_tdmrs;
+
/* Return -EFAULT until constructing TDMRs is done */
ret = -EFAULT;
+ tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
+err_free_tdmrs:
free_tdmrs(tdmr_array, *tdmr_num);
err:
return ret;
@@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
* initialization are done.
*/
ret = -EFAULT;
+ /*
+ * Free PAMTs allocated in construct_tdmrs() when TDX module
+ * initialization fails.
+ */
+ if (ret)
+ tdmrs_free_pamt_all(tdmr_array, tdmr_num);
out_free_tdmrs:
/*
* TDMRs are only used during initializing TDX module. Always
--
2.35.1

2022-04-06 16:53:09

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function

Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are isolated from legacy VMX root
and VMX non-root mode.

A CPU-attested software module (called the 'TDX module') runs in SEAM
VMX root to manage the crypto-protected VMs running in SEAM VMX non-root.
SEAM VMX root is also used to host another CPU-attested software module
(called the 'P-SEAMLDR') to load and update the TDX module.

Host kernel transits to either the P-SEAMLDR or the TDX module via the
new SEAMCALL instruction. SEAMCALL leaf functions are host-side
interface functions defined by the P-SEAMLDR and the TDX module around
the new SEAMCALL instruction. They are similar to a hypercall, except
they are made by host kernel to the SEAM software.

SEAMCALL leaf functions use an ABI different from the x86-64 system-v
ABI. Instead, they share the same ABI with the TDCALL leaf functions.
%rax is used to carry both the SEAMCALL leaf function number (input) and
the completion status code (output). Additional GPRs (%rcx, %rdx,
%r8->%r11) may be further used as both input and output operands in
individual leaf functions.

Implement a C function __seamcall() to do SEAMCALL leaf functions using
the assembly macro used by __tdx_module_call() (the implementation of
TDCALL leaf functions). The only exception not covered here is TDENTER
leaf function which takes all GPRs and XMM0-XMM15 as both input and
output. The caller of TDENTER should implement its own logic to call
TDENTER directly instead of using this function.

SEAMCALL instruction is essentially a VMExit from VMX root to SEAM VMX
root, and it can fail with VMfailInvalid, for instance, when the SEAM
software module is not loaded. The C function __seamcall() returns
TDX_SEAMCALL_VMFAILINVALID, which doesn't conflict with any actual error
code of SEAMCALLs, to uniquely represent this case.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/Makefile | 2 +-
arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 11 +++++++
3 files changed, 64 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 1bd688684716..fd577619620e 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..327961b2dd5a
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ * (the P-SEAMLDR or the TDX module)
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI. Return TDX_SEAMCALL_VMFAILINVALID, or the completion status of
+ * the SEAMCALL. Additional output operands are saved in @out (if it is
+ * provided by caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * used by the P-SEAMLDR or the TDX
+ * module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+ FRAME_BEGIN
+ TDX_MODULE_CALL host=1
+ FRAME_END
+ ret
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..9d5b6f554c20
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+#include <linux/types.h>
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+#endif
--
2.35.1

2022-04-06 16:53:26

by Kai Huang

[permalink] [raw]
Subject: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME. The memory encryption hardware underpinning MKTME is also
used for Intel TDX. TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.

A new MSR (IA32_MKTME_KEYID_PARTITIONING) helps to enumerate how MKTME-
enumerated "KeyID" space is distributed between TDX and legacy MKTME.
KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs'
for short.

The new MSR is per package and BIOS is responsible for partitioning
MKTME KeyIDs and TDX KeyIDs consistently among all packages.

Detect TDX private KeyIDs as a preparation to initialize TDX. Similar
to detecting SEAMRR, detect on all cpus to detect any potential BIOS
misconfiguration among packages.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 72 +++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 03f35c75f439..ba2210001ea8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -29,9 +29,28 @@
#define SEAMRR_ENABLED_BITS \
(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)

+/*
+ * Intel Trusted Domain CPU Architecture Extension spec:
+ *
+ * IA32_MKTME_KEYID_PARTIONING:
+ *
+ * Bit [31:0]: number of MKTME KeyIDs.
+ * Bit [63:32]: number of TDX private KeyIDs.
+ *
+ * TDX private KeyIDs start after the last MKTME KeyID.
+ */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
+
+#define TDX_KEYID_START(_keyid_part) \
+ ((u32)(((_keyid_part) & 0xffffffffull) + 1))
+#define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))
+
/* BIOS must configure SEAMRR registers for all cores consistently */
static u64 seamrr_base, seamrr_mask;

+static u32 tdx_keyid_start;
+static u32 tdx_keyid_num;
+
static bool __seamrr_enabled(void)
{
return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -96,7 +115,60 @@ static void detect_seam(struct cpuinfo_x86 *c)
detect_seam_ap(c);
}

+static void detect_tdx_keyids_bsp(struct cpuinfo_x86 *c)
+{
+ u64 keyid_part;
+
+ /* TDX is built on MKTME, which is based on TME */
+ if (!boot_cpu_has(X86_FEATURE_TME))
+ return;
+
+ if (rdmsrl_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &keyid_part))
+ return;
+
+ /* If MSR value is 0, TDX is not enabled by BIOS. */
+ if (!keyid_part)
+ return;
+
+ tdx_keyid_num = TDX_KEYID_NUM(keyid_part);
+ if (!tdx_keyid_num)
+ return;
+
+ tdx_keyid_start = TDX_KEYID_START(keyid_part);
+}
+
+static void detect_tdx_keyids_ap(struct cpuinfo_x86 *c)
+{
+ u64 keyid_part;
+
+ /*
+ * Don't bother to detect this AP if TDX KeyIDs are
+ * not detected or cleared after earlier detections.
+ */
+ if (!tdx_keyid_num)
+ return;
+
+ rdmsrl(MSR_IA32_MKTME_KEYID_PARTITIONING, keyid_part);
+
+ if ((tdx_keyid_start == TDX_KEYID_START(keyid_part)) &&
+ (tdx_keyid_num == TDX_KEYID_NUM(keyid_part)))
+ return;
+
+ pr_err("Inconsistent TDX KeyID configuration among packages by BIOS\n");
+ tdx_keyid_start = 0;
+ tdx_keyid_num = 0;
+}
+
+static void detect_tdx_keyids(struct cpuinfo_x86 *c)
+{
+ if (c == &boot_cpu_data)
+ detect_tdx_keyids_bsp(c);
+ else
+ detect_tdx_keyids_ap(c);
+}
+
void tdx_detect_cpu(struct cpuinfo_x86 *c)
{
detect_seam(c);
+ detect_tdx_keyids(c);
}
--
2.35.1

2022-04-15 06:33:50

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Wed, 2022-04-06 at 16:49 +1200, Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This series provides support for
> initializing the TDX module in the host kernel. KVM support for TDX is
> being developed separately[1].
>
> The code has been tested on couple of TDX-capable machines. I would
> consider it as ready for review. I highly appreciate if anyone can help
> to review this series (from high level design to detail implementations).
> For Intel reviewers (CC'ed), please help to review, and I would
> appreciate Reviewed-by or Acked-by tags if the patches look good to you.

Hi Intel reviewers,

Kindly ping. Could you help to review?

--
Thanks,
-Kai


2022-04-19 14:28:05

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs

On Mon, 2022-04-18 at 22:39 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > + * Intel Trusted Domain CPU Architecture Extension spec:
>
> In TDX guest code, we have been using TDX as "Intel Trust Domain
> Extensions". It also aligns with spec. Maybe you should change
> your patch set to use the same.
>

Yeah will change to use "Intel Trust Domain ...". Thanks.

--
Thanks,
-Kai


Subject: Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs



On 4/5/22 9:49 PM, Kai Huang wrote:
> detect_seam(c);
> + detect_tdx_keyids(c);

Do you want to add some return value to detect_seam() and not
proceed if it fails?

In case if this function is going to be extended by future
patch set, maybe do the same for detect_tdx_keyids()?

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-04-19 18:26:50

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM


> > +
> > +static void detect_seam(struct cpuinfo_x86 *c)
> > +{
>
> why not do this check directly in tdx_detect_cpu()?

The second patch will detect TDX KeyID too. I suppose you are saying below is
better?

void tdx_detect_cpu(struct cpuinfo_x86 *c)
{
if (c == &boot_cpu_data) {
detect_seam_bsp(c);
detect_tdx_keyids_bsp(c);
} else {
detect_seam_ap(c);
detect_tdx_keyids_ap(c);
}
}

I personally don't see how above is better than the current way. Instead, I
think having SEAM and TDX KeyID detection code in single function respectively
is more flexible for future extension (if needed).


>
> > + if (c == &boot_cpu_data)
> > + detect_seam_bsp(c);
> > + else
> > + detect_seam_ap(c);
> > +}
> > +
> > +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> > +{
> > + detect_seam(c);
> > +}
>

2022-04-19 19:01:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

On Mon, Apr 18, 2022, Sathyanarayanan Kuppuswamy wrote:
> > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > +{
> > + u64 base, mask;
> > +
> > + /*
> > + * Don't bother to detect this AP if SEAMRR is not
> > + * enabled after earlier detections.
> > + */
> > + if (!__seamrr_enabled())
> > + return;
> > +
> > + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > +
> > + if (base == seamrr_base && mask == seamrr_mask)
> > + return;
> > +
> > + pr_err("Inconsistent SEAMRR configuration by BIOS\n");
>
> Do we need to panic for SEAM config issue (for security)?

No, clearing seamrr_mask will effectively prevent the kernel from attempting to
use TDX or any other feature that might depend on SEAM. Panicking because the
user's BIOS is crappy would be to kicking them while they're down.

As for security, it's the TDX Module's responsibility to validate the security
properties of the system, the kernel only cares about not dying/crashing.

> > + /* Mark SEAMRR as disabled. */
> > + seamrr_base = 0;
> > + seamrr_mask = 0
> > +}

2022-04-20 16:01:28

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs

On Mon, 2022-04-18 at 22:42 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > detect_seam(c);
> > + detect_tdx_keyids(c);
>
> Do you want to add some return value to detect_seam() and not
> proceed if it fails?

I don't think this function should return. However it may make sense to stop
detecting TDX KeyIDs when on some cpu SEAMRR is detected as not enabled (i.e. on
BSP when SEAMRR is not enabled by BIOS, or on any AP when there's BIOS bug that
BIOS doesn't configure SEAMRR consistently on all cpus). The reason is TDX
KeyIDs can only be accessed by software runs in SEAM mode. So if SEAMRR
configuration is broken, TDX KeyID configuration probably is broken too.

However detect_tdx_keyids() essentially only uses rdmsr_safe() to read some MSR,
so if there's any problem, rdmsr_safe() will catch it. And SEAMRR is always
checked before doing any TDX related staff later, therefore in practice there
will be no problem. But anyway I guess there's no harm to add additional SEAMRR
check in detect_tdx_keyids(). I'll think more on this. Thanks.

>
> In case if this function is going to be extended by future
> patch set, maybe do the same for detect_tdx_keyids()?
>

I'd prefer to leaving this in current way until there's a real need.

--
Thanks,
-Kai


Subject: Re: [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization



On 4/5/22 9:49 PM, Kai Huang wrote:
> Do the TDX module global initialization which requires calling
> TDH.SYS.INIT once on any logical cpu.

IMO, you could add some more background details to this commit log. Like
why you are doing it and what it does?. I know that you already
explained some background in previous patches. But including brief
details here will help to review the commit without checking the
previous commits.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-04-21 21:22:09

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function

On Wed, 2022-04-20 at 00:29 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/19/22 9:16 PM, Kai Huang wrote:
> > On Tue, 2022-04-19 at 07:07 -0700, Sathyanarayanan Kuppuswamy wrote:
> > >
> > > On 4/5/22 9:49 PM, Kai Huang wrote:
> > > > SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> > > > ABI. Instead, they share the same ABI with the TDCALL leaf functions.
> > >
> > > TDCALL is a new term for this patch set. Maybe add some detail about
> > > it in ()?.
> > >
> > > >
> >
> > TDCALL implementation is already in tip/tdx. This series will be rebased to it.
> > I don't think we need to explain more about something that is already in the tip
> > tree?
>
> Since you have already expanded terms like TD,TDX and SEAM in this patch
> set, I thought you wanted to explain TDX terms to make it easy for new
> readers. So to keep it uniform, I have suggested adding some brief
> details about the TDCALL.
>
>

All right. I can add one sentence to explain it.

--
Thanks,
-Kai


Subject: Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function



On 4/5/22 9:49 PM, Kai Huang wrote:
> SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> ABI. Instead, they share the same ABI with the TDCALL leaf functions.

TDCALL is a new term for this patch set. Maybe add some detail about
it in ()?.

> %rax is used to carry both the SEAMCALL leaf function number (input) and
> the completion status code (output). Additional GPRs (%rcx, %rdx,
> %r8->%r11) may be further used as both input and output operands in
> individual leaf functions.



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-04-22 07:27:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On 4/19/22 21:37, Kai Huang wrote:
> On Tue, 2022-04-19 at 07:53 -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> The TDX module is essentially a CPU-attested software module running
>>> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
>>> host and certain physical attacks. The TDX module implements the
>> /s/host/hosts
> I don't quite get. Could you explain why there are multiple hosts?

This one is an arbitrary language tweak. This:

to protect VMs from malicious host and certain physical attacks.

could also be written:

to protect VMs from malicious host attacks and certain physical
attacks.

But, it's somewhat more compact to do what was writen. I agree that the
language is a bit clumsy and could be cleaned up, but just doing
s/host/hosts/ doesn't really improve anything.

2022-04-22 14:12:33

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

>
> > > > +
> > > > +/**
> > > > + * tdx_detect - Detect whether the TDX module has been loaded
> > > > + *
> > > > + * Detect whether the TDX module has been loaded and ready for
> > > > + * initialization. Only call this function when all cpus are
> > > > + * already in VMX operation.
> > > > + *
> > > > + * This function can be called in parallel by multiple callers.
> > > > + *
> > > > + * Return:
> > > > + *
> > > > + * * -0: The TDX module has been loaded and ready for
> > > > + * initialization.
> > > > + * * -ENODEV: The TDX module is not loaded.
> > > > + * * -EPERM: CPU is not in VMX operation.
> > > > + * * -EFAULT: Other internal fatal errors.
> > > > + */
> > > > +int tdx_detect(void)
> > >
> > > Will this function be used separately or always along with
> > > tdx_init()?
> >
> > The caller should first use tdx_detect() and then use tdx_init(). If caller
> > only uses tdx_detect(), then TDX module won't be initialized (unless other
> > caller does this). If caller calls tdx_init() before tdx_detect(), it will get
> > error.
> >
>
> I just checked your patch set to understand where you are using
> tdx_detect()/tdx_init(). But I did not find any callers. Did I miss it?
> or it is not used in your patch set?
>

No you didn't. They are not called in this series. KVM series which is under
upstream process by Isaku will call them. Dave once said w/o caller is fine as
for this particular case people know KVM is going to use them. In cover letter
I also mentioned KVM support is under development by another series. Next
version in cover letter, I'll explicitly call out this series doesn't have
caller of them but depends on KVM to call them.


--
Thanks,
-Kai


Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand



On 4/5/22 9:49 PM, Kai Huang wrote:
> The TDX module is essentially a CPU-attested software module running
> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> host and certain physical attacks. The TDX module implements the

/s/host/hosts

> functions to build, tear down and start execution of the protected VMs
> called Trusted Domains (TD). Before the TDX module can be used to

/s/Trusted/Trust

> create and run TD guests, it must be loaded into the SEAM Range Register
> (SEAMRR) and properly initialized. The TDX module is expected to be
> loaded by BIOS before booting to the kernel, and the kernel is expected
> to detect and initialize it, using the SEAMCALLs defined by TDX
> architecture.
>
> The TDX module can be initialized only once in its lifetime. Instead
> of always initializing it at boot time, this implementation chooses an
> on-demand approach to initialize TDX until there is a real need (e.g
> when requested by KVM). This avoids consuming the memory that must be
> allocated by kernel and given to the TDX module as metadata (~1/256th of

allocated by the kernel

> the TDX-usable memory), and also saves the time of initializing the TDX
> module (and the metadata) when TDX is not used at all. Initializing the
> TDX module at runtime on-demand also is more flexible to support TDX
> module runtime updating in the future (after updating the TDX module, it
> needs to be initialized again).
>
> Introduce two placeholders tdx_detect() and tdx_init() to detect and
> initialize the TDX module on demand, with a state machine introduced to
> orchestrate the entire process (in case of multiple callers).
>
> To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs. The
> TDX module is reported as not loaded if either SEAMRR is not enabled, or
> there are no enough TDX private KeyIDs to create any TD guest. The TDX
> module itself requires one global TDX private KeyID to crypto protect
> its metadata.
>
> And tdx_init() is currently empty. The TDX module will be initialized
> in multi-steps defined by the TDX architecture:
>
> 1) Global initialization;
> 2) Logical-CPU scope initialization;
> 3) Enumerate the TDX module capabilities and platform configuration;
> 4) Configure the TDX module about usable memory ranges and global
> KeyID information;
> 5) Package-scope configuration for the global KeyID;
> 6) Initialize usable memory ranges based on 4).
>
> The TDX module can also be shut down at any time during its lifetime.
> In case of any error during the initialization process, shut down the
> module. It's pointless to leave the module in any intermediate state
> during the initialization.
>
> SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
> operation (VMXON has been done), otherwise it generates #UD. So far
> only KVM handles VMXON/VMXOFF. Choose to not handle VMXON/VMXOFF in
> tdx_detect() and tdx_init() but depend on the caller to guarantee that,
> since so far KVM is the only user of TDX. In the long term, more kernel
> components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
> module runtime update), so a reference-based approach to do VMXON/VMXOFF
> is likely needed.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
> arch/x86/include/asm/tdx.h | 4 +
> arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 226 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 1f29813b1646..c8af2ba6bb8a 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>
> #ifdef CONFIG_INTEL_TDX_HOST
> void tdx_detect_cpu(struct cpuinfo_x86 *c);
> +int tdx_detect(void);
> +int tdx_init(void);
> #else
> static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
> +static inline int tdx_detect(void) { return -ENODEV; }
> +static inline int tdx_init(void) { return -ENODEV; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ba2210001ea8..53093d4ad458 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -9,6 +9,8 @@
>
> #include <linux/types.h>
> #include <linux/cpumask.h>
> +#include <linux/mutex.h>
> +#include <linux/cpu.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/cpufeature.h>
> @@ -45,12 +47,33 @@
> ((u32)(((_keyid_part) & 0xffffffffull) + 1))
> #define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))
>
> +/*
> + * TDX module status during initialization
> + */
> +enum tdx_module_status_t {
> + /* TDX module status is unknown */
> + TDX_MODULE_UNKNOWN,
> + /* TDX module is not loaded */
> + TDX_MODULE_NONE,
> + /* TDX module is loaded, but not initialized */
> + TDX_MODULE_LOADED,
> + /* TDX module is fully initialized */
> + TDX_MODULE_INITIALIZED,
> + /* TDX module is shutdown due to error during initialization */
> + TDX_MODULE_SHUTDOWN,
> +};
> +

May be adding these states when you really need will make
more sense. Currently this patch only uses SHUTDOWN and
NONE states. Other state usage is not very clear.

> /* BIOS must configure SEAMRR registers for all cores consistently */
> static u64 seamrr_base, seamrr_mask;
>
> static u32 tdx_keyid_start;
> static u32 tdx_keyid_num;
>
> +static enum tdx_module_status_t tdx_module_status;
> +
> +/* Prevent concurrent attempts on TDX detection and initialization */
> +static DEFINE_MUTEX(tdx_module_lock);

Any possible concurrent usage models?

> +
> static bool __seamrr_enabled(void)
> {
> return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> @@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
> detect_seam(c);
> detect_tdx_keyids(c);
> }
> +
> +static bool seamrr_enabled(void)
> +{
> + /*
> + * To detect any BIOS misconfiguration among cores, all logical
> + * cpus must have been brought up at least once. This is true
> + * unless 'maxcpus' kernel command line is used to limit the
> + * number of cpus to be brought up during boot time. However
> + * 'maxcpus' is basically an invalid operation mode due to the
> + * MCE broadcast problem, and it should not be used on a TDX
> + * capable machine. Just do paranoid check here and do not

a paranoid check

> + * report SEAMRR as enabled in this case.
> + */
> + if (!cpumask_equal(&cpus_booted_once_mask,
> + cpu_present_mask))
> + return false;
> +
> + return __seamrr_enabled();
> +}
> +
> +static bool tdx_keyid_sufficient(void)
> +{
> + if (!cpumask_equal(&cpus_booted_once_mask,
> + cpu_present_mask))
> + return false;
> +
> + /*
> + * TDX requires at least two KeyIDs: one global KeyID to
> + * protect the metadata of the TDX module and one or more
> + * KeyIDs to run TD guests.
> + */
> + return tdx_keyid_num >= 2;
> +}
> +
> +static int __tdx_detect(void)
> +{
> + /* The TDX module is not loaded if SEAMRR is disabled */
> + if (!seamrr_enabled()) {
> + pr_info("SEAMRR not enabled.\n");
> + goto no_tdx_module;
> + }
> +
> + /*
> + * Also do not report the TDX module as loaded if there's
> + * no enough TDX private KeyIDs to run any TD guests.
> + */

You are not returning TDX_MODULE_LOADED under any current
scenarios. So think above comment is not accurate.

> + if (!tdx_keyid_sufficient()) {
> + pr_info("Number of TDX private KeyIDs too small: %u.\n",
> + tdx_keyid_num);
> + goto no_tdx_module;
> + }
> +
> + /* Return -ENODEV until the TDX module is detected */
> +no_tdx_module:
> + tdx_module_status = TDX_MODULE_NONE;
> + return -ENODEV;
> +}
> +
> +static int init_tdx_module(void)
> +{
> + /*
> + * Return -EFAULT until all steps of TDX module
> + * initialization are done.
> + */
> + return -EFAULT;
> +}
> +
> +static void shutdown_tdx_module(void)
> +{
> + /* TODO: Shut down the TDX module */
> + tdx_module_status = TDX_MODULE_SHUTDOWN;
> +}
> +
> +static int __tdx_init(void)
> +{
> + int ret;
> +
> + /*
> + * Logical-cpu scope initialization requires calling one SEAMCALL
> + * on all logical cpus enabled by BIOS. Shutting down the TDX
> + * module also has such requirement. Further more, configuring

such a requirement

> + * the key of the global KeyID requires calling one SEAMCALL for
> + * each package. For simplicity, disable CPU hotplug in the whole
> + * initialization process.
> + *
> + * It's perhaps better to check whether all BIOS-enabled cpus are
> + * online before starting initializing, and return early if not.
> + * But none of 'possible', 'present' and 'online' CPU masks
> + * represents BIOS-enabled cpus. For example, 'possible' mask is
> + * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> + * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> + * online.
> + */
> + cpus_read_lock();
> +
> + ret = init_tdx_module();
> +
> + /*
> + * Shut down the TDX module in case of any error during the
> + * initialization process. It's meaningless to leave the TDX
> + * module in any middle state of the initialization process.
> + */
> + if (ret)
> + shutdown_tdx_module();
> +
> + cpus_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * tdx_detect - Detect whether the TDX module has been loaded
> + *
> + * Detect whether the TDX module has been loaded and ready for
> + * initialization. Only call this function when all cpus are
> + * already in VMX operation.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0: The TDX module has been loaded and ready for
> + * initialization.
> + * * -ENODEV: The TDX module is not loaded.
> + * * -EPERM: CPU is not in VMX operation.
> + * * -EFAULT: Other internal fatal errors.
> + */
> +int tdx_detect(void)

Will this function be used separately or always along with
tdx_init()?

> +{
> + int ret;
> +
> + mutex_lock(&tdx_module_lock);
> +
> + switch (tdx_module_status) {
> + case TDX_MODULE_UNKNOWN:
> + ret = __tdx_detect();
> + break;
> + case TDX_MODULE_NONE:
> + ret = -ENODEV;
> + break;
> + case TDX_MODULE_LOADED:
> + case TDX_MODULE_INITIALIZED:
> + ret = 0;
> + break;
> + case TDX_MODULE_SHUTDOWN:
> + ret = -EFAULT;
> + break;
> + default:
> + WARN_ON(1);
> + ret = -EFAULT;
> + }
> +
> + mutex_unlock(&tdx_module_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_detect);
> +
> +/**
> + * tdx_init - Initialize the TDX module

If it for tdx module initialization, why not call it
tdx_module_init()? If not, update the description
appropriately.

> + *
> + * Initialize the TDX module to make it ready to run TD guests. This
> + * function should be called after tdx_detect() returns successful.
> + * Only call this function when all cpus are online and are in VMX
> + * operation. CPU hotplug is temporarily disabled internally.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0: The TDX module has been successfully initialized.
> + * * -ENODEV: The TDX module is not loaded.
> + * * -EPERM: The CPU which does SEAMCALL is not in VMX operation.
> + * * -EFAULT: Other internal fatal errors.
> + */

You return differnt error values just for debug prints or there are
other uses for it?

> +int tdx_init(void)
> +{
> + int ret;
> +
> + mutex_lock(&tdx_module_lock);
> +
> + switch (tdx_module_status) {
> + case TDX_MODULE_NONE:
> + ret = -ENODEV;
> + break;
> + case TDX_MODULE_LOADED:

> + ret = __tdx_init();
> + break;
> + case TDX_MODULE_INITIALIZED:
> + ret = 0;
> + break;
> + default:
> + ret = -EFAULT;
> + break;
> + }
> + mutex_unlock(&tdx_module_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_init);

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-04-22 18:10:59

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function

On Tue, 2022-04-19 at 07:07 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> > ABI. Instead, they share the same ABI with the TDCALL leaf functions.
>
> TDCALL is a new term for this patch set. Maybe add some detail about
> it in ()?.
>
> >

TDCALL implementation is already in tip/tdx. This series will be rebased to it.
I don't think we need to explain more about something that is already in the tip
tree?


--
Thanks,
-Kai


2022-04-22 18:14:10

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization

On Wed, 2022-04-20 at 15:27 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > Do the TDX module global initialization which requires calling
> > TDH.SYS.INIT once on any logical cpu.
>
> IMO, you could add some more background details to this commit log. Like
> why you are doing it and what it does?. I know that you already
> explained some background in previous patches. But including brief
> details here will help to review the commit without checking the
> previous commits.
>

OK I guess I can add "the first step is global initialization", etc.

--
Thanks,
-Kai


Subject: Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function



On 4/19/22 9:16 PM, Kai Huang wrote:
> On Tue, 2022-04-19 at 07:07 -0700, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> SEAMCALL leaf functions use an ABI different from the x86-64 system-v
>>> ABI. Instead, they share the same ABI with the TDCALL leaf functions.
>>
>> TDCALL is a new term for this patch set. Maybe add some detail about
>> it in ()?.
>>
>>>
>
> TDCALL implementation is already in tip/tdx. This series will be rebased to it.
> I don't think we need to explain more about something that is already in the tip
> tree?

Since you have already expanded terms like TD,TDX and SEAM in this patch
set, I thought you wanted to explain TDX terms to make it easy for new
readers. So to keep it uniform, I have suggested adding some brief
details about the TDCALL.

But I am fine either way.

>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-04-22 19:12:21

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On Tue, 2022-04-19 at 07:53 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > The TDX module is essentially a CPU-attested software module running
> > in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> > host and certain physical attacks. The TDX module implements the
>
> /s/host/hosts

I don't quite get. Could you explain why there are multiple hosts?

>
> > functions to build, tear down and start execution of the protected VMs
> > called Trusted Domains (TD). Before the TDX module can be used to
>
> /s/Trusted/Trust

Thanks.

>
> > create and run TD guests, it must be loaded into the SEAM Range Register
> > (SEAMRR) and properly initialized. The TDX module is expected to be
> > loaded by BIOS before booting to the kernel, and the kernel is expected
> > to detect and initialize it, using the SEAMCALLs defined by TDX
> > architecture.
> >
> > The TDX module can be initialized only once in its lifetime. Instead
> > of always initializing it at boot time, this implementation chooses an
> > on-demand approach to initialize TDX until there is a real need (e.g
> > when requested by KVM). This avoids consuming the memory that must be
> > allocated by kernel and given to the TDX module as metadata (~1/256th of
>
> allocated by the kernel

Ok.

>
> > the TDX-usable memory), and also saves the time of initializing the TDX
> > module (and the metadata) when TDX is not used at all. Initializing the
> > TDX module at runtime on-demand also is more flexible to support TDX
> > module runtime updating in the future (after updating the TDX module, it
> > needs to be initialized again).
> >
> > Introduce two placeholders tdx_detect() and tdx_init() to detect and
> > initialize the TDX module on demand, with a state machine introduced to
> > orchestrate the entire process (in case of multiple callers).
> >
> > To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs. The
> > TDX module is reported as not loaded if either SEAMRR is not enabled, or
> > there are no enough TDX private KeyIDs to create any TD guest. The TDX
> > module itself requires one global TDX private KeyID to crypto protect
> > its metadata.
> >
> > And tdx_init() is currently empty. The TDX module will be initialized
> > in multi-steps defined by the TDX architecture:
> >
> > 1) Global initialization;
> > 2) Logical-CPU scope initialization;
> > 3) Enumerate the TDX module capabilities and platform configuration;
> > 4) Configure the TDX module about usable memory ranges and global
> > KeyID information;
> > 5) Package-scope configuration for the global KeyID;
> > 6) Initialize usable memory ranges based on 4).
> >
> > The TDX module can also be shut down at any time during its lifetime.
> > In case of any error during the initialization process, shut down the
> > module. It's pointless to leave the module in any intermediate state
> > during the initialization.
> >
> > SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
> > operation (VMXON has been done), otherwise it generates #UD. So far
> > only KVM handles VMXON/VMXOFF. Choose to not handle VMXON/VMXOFF in
> > tdx_detect() and tdx_init() but depend on the caller to guarantee that,
> > since so far KVM is the only user of TDX. In the long term, more kernel
> > components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
> > module runtime update), so a reference-based approach to do VMXON/VMXOFF
> > is likely needed.
> >
> > Signed-off-by: Kai Huang <[email protected]>
> > ---
> > arch/x86/include/asm/tdx.h | 4 +
> > arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 226 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 1f29813b1646..c8af2ba6bb8a 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> >
> > #ifdef CONFIG_INTEL_TDX_HOST
> > void tdx_detect_cpu(struct cpuinfo_x86 *c);
> > +int tdx_detect(void);
> > +int tdx_init(void);
> > #else
> > static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
> > +static inline int tdx_detect(void) { return -ENODEV; }
> > +static inline int tdx_init(void) { return -ENODEV; }
> > #endif /* CONFIG_INTEL_TDX_HOST */
> >
> > #endif /* !__ASSEMBLY__ */
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index ba2210001ea8..53093d4ad458 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -9,6 +9,8 @@
> >
> > #include <linux/types.h>
> > #include <linux/cpumask.h>
> > +#include <linux/mutex.h>
> > +#include <linux/cpu.h>
> > #include <asm/msr-index.h>
> > #include <asm/msr.h>
> > #include <asm/cpufeature.h>
> > @@ -45,12 +47,33 @@
> > ((u32)(((_keyid_part) & 0xffffffffull) + 1))
> > #define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))
> >
> > +/*
> > + * TDX module status during initialization
> > + */
> > +enum tdx_module_status_t {
> > + /* TDX module status is unknown */
> > + TDX_MODULE_UNKNOWN,
> > + /* TDX module is not loaded */
> > + TDX_MODULE_NONE,
> > + /* TDX module is loaded, but not initialized */
> > + TDX_MODULE_LOADED,
> > + /* TDX module is fully initialized */
> > + TDX_MODULE_INITIALIZED,
> > + /* TDX module is shutdown due to error during initialization */
> > + TDX_MODULE_SHUTDOWN,
> > +};
> > +
>
> May be adding these states when you really need will make
> more sense. Currently this patch only uses SHUTDOWN and
> NONE states. Other state usage is not very clear.

They are all used in tdx_detect() and tdx_init(), no?

>
> > /* BIOS must configure SEAMRR registers for all cores consistently */
> > static u64 seamrr_base, seamrr_mask;
> >
> > static u32 tdx_keyid_start;
> > static u32 tdx_keyid_num;
> >
> > +static enum tdx_module_status_t tdx_module_status;
> > +
> > +/* Prevent concurrent attempts on TDX detection and initialization */
> > +static DEFINE_MUTEX(tdx_module_lock);
>
> Any possible concurrent usage models?

tdx_detect() and tdx_init() are called on demand by callers, so it's possible
multiple callers can call into them concurrently.

>
> > +
> > static bool __seamrr_enabled(void)
> > {
> > return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > @@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
> > detect_seam(c);
> > detect_tdx_keyids(c);
> > }
> > +
> > +static bool seamrr_enabled(void)
> > +{
> > + /*
> > + * To detect any BIOS misconfiguration among cores, all logical
> > + * cpus must have been brought up at least once. This is true
> > + * unless 'maxcpus' kernel command line is used to limit the
> > + * number of cpus to be brought up during boot time. However
> > + * 'maxcpus' is basically an invalid operation mode due to the
> > + * MCE broadcast problem, and it should not be used on a TDX
> > + * capable machine. Just do paranoid check here and do not
>
> a paranoid check

Ok.

>
> > + * report SEAMRR as enabled in this case.
> > + */
> > + if (!cpumask_equal(&cpus_booted_once_mask,
> > + cpu_present_mask))
> > + return false;
> > +
> > + return __seamrr_enabled();
> > +}
> > +
> > +static bool tdx_keyid_sufficient(void)
> > +{
> > + if (!cpumask_equal(&cpus_booted_once_mask,
> > + cpu_present_mask))
> > + return false;
> > +
> > + /*
> > + * TDX requires at least two KeyIDs: one global KeyID to
> > + * protect the metadata of the TDX module and one or more
> > + * KeyIDs to run TD guests.
> > + */
> > + return tdx_keyid_num >= 2;
> > +}
> > +
> > +static int __tdx_detect(void)
> > +{
> > + /* The TDX module is not loaded if SEAMRR is disabled */
> > + if (!seamrr_enabled()) {
> > + pr_info("SEAMRR not enabled.\n");
> > + goto no_tdx_module;
> > + }
> > +
> > + /*
> > + * Also do not report the TDX module as loaded if there's
> > + * no enough TDX private KeyIDs to run any TD guests.
> > + */
>
> You are not returning TDX_MODULE_LOADED under any current
> scenarios. So think above comment is not accurate.

This comment is to explain the logic behind of below TDX KeyID check. I don't
see how is it related to your comments?

This patch is pretty much a placeholder to express the idea of how are
tdx_detect() and tdx_init() going to be implemented. In below after the
tdx_keyid_sufficient() check, I also have a comment to explain the module hasn't
been detected yet which means there will be code to detect the module here, and
at that time, logically this function will return TDX_MODULE_LOADED. I don't
see this is hard to understand?

>
> > + if (!tdx_keyid_sufficient()) {
> > + pr_info("Number of TDX private KeyIDs too small: %u.\n",
> > + tdx_keyid_num);
> > + goto no_tdx_module;
> > + }
> > +
> > + /* Return -ENODEV until the TDX module is detected */
> > +no_tdx_module:
> > + tdx_module_status = TDX_MODULE_NONE;
> > + return -ENODEV;
> > +}
> > +
> > +static int init_tdx_module(void)
> > +{
> > + /*
> > + * Return -EFAULT until all steps of TDX module
> > + * initialization are done.
> > + */
> > + return -EFAULT;
> > +}
> > +
> > +static void shutdown_tdx_module(void)
> > +{
> > + /* TODO: Shut down the TDX module */
> > + tdx_module_status = TDX_MODULE_SHUTDOWN;
> > +}
> > +
> > +static int __tdx_init(void)
> > +{
> > + int ret;
> > +
> > + /*
> > + * Logical-cpu scope initialization requires calling one SEAMCALL
> > + * on all logical cpus enabled by BIOS. Shutting down the TDX
> > + * module also has such requirement. Further more, configuring
>
> such a requirement

Thanks.

>
> > + * the key of the global KeyID requires calling one SEAMCALL for
> > + * each package. For simplicity, disable CPU hotplug in the whole
> > + * initialization process.
> > + *
> > + * It's perhaps better to check whether all BIOS-enabled cpus are
> > + * online before starting initializing, and return early if not.
> > + * But none of 'possible', 'present' and 'online' CPU masks
> > + * represents BIOS-enabled cpus. For example, 'possible' mask is
> > + * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> > + * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> > + * online.
> > + */
> > + cpus_read_lock();
> > +
> > + ret = init_tdx_module();
> > +
> > + /*
> > + * Shut down the TDX module in case of any error during the
> > + * initialization process. It's meaningless to leave the TDX
> > + * module in any middle state of the initialization process.
> > + */
> > + if (ret)
> > + shutdown_tdx_module();
> > +
> > + cpus_read_unlock();
> > +
> > + return ret;
> > +}
> > +
> > +/**
> > + * tdx_detect - Detect whether the TDX module has been loaded
> > + *
> > + * Detect whether the TDX module has been loaded and ready for
> > + * initialization. Only call this function when all cpus are
> > + * already in VMX operation.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0: The TDX module has been loaded and ready for
> > + * initialization.
> > + * * -ENODEV: The TDX module is not loaded.
> > + * * -EPERM: CPU is not in VMX operation.
> > + * * -EFAULT: Other internal fatal errors.
> > + */
> > +int tdx_detect(void)
>
> Will this function be used separately or always along with
> tdx_init()?

The caller should first use tdx_detect() and then use tdx_init(). If caller
only uses tdx_detect(), then TDX module won't be initialized (unless other
caller does this). If caller calls tdx_init() before tdx_detect(), it will get
error.

>
> > +{
> > + int ret;
> > +
> > + mutex_lock(&tdx_module_lock);
> > +
> > + switch (tdx_module_status) {
> > + case TDX_MODULE_UNKNOWN:
> > + ret = __tdx_detect();
> > + break;
> > + case TDX_MODULE_NONE:
> > + ret = -ENODEV;
> > + break;
> > + case TDX_MODULE_LOADED:
> > + case TDX_MODULE_INITIALIZED:
> > + ret = 0;
> > + break;
> > + case TDX_MODULE_SHUTDOWN:
> > + ret = -EFAULT;
> > + break;
> > + default:
> > + WARN_ON(1);
> > + ret = -EFAULT;
> > + }
> > +
> > + mutex_unlock(&tdx_module_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_detect);
> > +
> > +/**
> > + * tdx_init - Initialize the TDX module
>
> If it for tdx module initialization, why not call it
> tdx_module_init()? If not, update the description
> appropriately.

Besides do the actual module initialization, it also has a state machine.

But point taken, and I'll try to refine the description. Thanks.

>
> > + *
> > + * Initialize the TDX module to make it ready to run TD guests. This
> > + * function should be called after tdx_detect() returns successful.
> > + * Only call this function when all cpus are online and are in VMX
> > + * operation. CPU hotplug is temporarily disabled internally.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0: The TDX module has been successfully initialized.
> > + * * -ENODEV: The TDX module is not loaded.
> > + * * -EPERM: The CPU which does SEAMCALL is not in VMX operation.
> > + * * -EFAULT: Other internal fatal errors.
> > + */
>
> You return differnt error values just for debug prints or there are
> other uses for it?

Caller can distinguish them and act differently. Even w/o any purpose, I think
it's better to return different error codes to reflect different error reasons.



--
Thanks,
-Kai


Subject: Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs



On 4/5/22 9:49 PM, Kai Huang wrote:
> + * Intel Trusted Domain CPU Architecture Extension spec:

In TDX guest code, we have been using TDX as "Intel Trust Domain
Extensions". It also aligns with spec. Maybe you should change
your patch set to use the same.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand



On 4/19/22 9:37 PM, Kai Huang wrote:
> On Tue, 2022-04-19 at 07:53 -0700, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> The TDX module is essentially a CPU-attested software module running
>>> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
>>> host and certain physical attacks. The TDX module implements the
>>
>> /s/host/hosts
>
> I don't quite get. Could you explain why there are multiple hosts?

Sorry, I misread it. It is correct, so ignore it.

>
>>

>>> +
>>> +/**
>>> + * tdx_detect - Detect whether the TDX module has been loaded
>>> + *
>>> + * Detect whether the TDX module has been loaded and ready for
>>> + * initialization. Only call this function when all cpus are
>>> + * already in VMX operation.
>>> + *
>>> + * This function can be called in parallel by multiple callers.
>>> + *
>>> + * Return:
>>> + *
>>> + * * -0: The TDX module has been loaded and ready for
>>> + * initialization.
>>> + * * -ENODEV: The TDX module is not loaded.
>>> + * * -EPERM: CPU is not in VMX operation.
>>> + * * -EFAULT: Other internal fatal errors.
>>> + */
>>> +int tdx_detect(void)
>>
>> Will this function be used separately or always along with
>> tdx_init()?
>
> The caller should first use tdx_detect() and then use tdx_init(). If caller
> only uses tdx_detect(), then TDX module won't be initialized (unless other
> caller does this). If caller calls tdx_init() before tdx_detect(), it will get
> error.
>

I just checked your patch set to understand where you are using
tdx_detect()/tdx_init(). But I did not find any callers. Did I miss it?
or it is not used in your patch set?

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM



On 4/5/22 9:49 PM, Kai Huang wrote:
> +/* BIOS must configure SEAMRR registers for all cores consistently */
> +static u64 seamrr_base, seamrr_mask;
> +
> +static bool __seamrr_enabled(void)
> +{
> + return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> +}
> +
> +static void detect_seam_bsp(struct cpuinfo_x86 *c)
> +{
> + u64 mtrrcap, base, mask;
> +
> + /* SEAMRR is reported via MTRRcap */
> + if (!boot_cpu_has(X86_FEATURE_MTRR))
> + return;
> +
> + rdmsrl(MSR_MTRRcap, mtrrcap);
> + if (!(mtrrcap & MTRR_CAP_SEAMRR))
> + return;
> +
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> + if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
> + pr_info("SEAMRR base is not configured by BIOS\n");
> + return;
> + }
> +
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> + if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
> + pr_info("SEAMRR is not enabled by BIOS\n");
> + return;
> + }
> +
> + seamrr_base = base;
> + seamrr_mask = mask;
> +}
> +
> +static void detect_seam_ap(struct cpuinfo_x86 *c)
> +{
> + u64 base, mask;
> +
> + /*
> + * Don't bother to detect this AP if SEAMRR is not
> + * enabled after earlier detections.
> + */
> + if (!__seamrr_enabled())
> + return;
> +
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> +
> + if (base == seamrr_base && mask == seamrr_mask)
> + return;
> +
> + pr_err("Inconsistent SEAMRR configuration by BIOS\n");

Do we need to panic for SEAM config issue (for security)?

> + /* Mark SEAMRR as disabled. */
> + seamrr_base = 0;
> + seamrr_mask = 0
> +}
> +
> +static void detect_seam(struct cpuinfo_x86 *c)
> +{

why not do this check directly in tdx_detect_cpu()?

> + if (c == &boot_cpu_data)
> + detect_seam_bsp(c);
> + else
> + detect_seam_ap(c);
> +}
> +
> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> +{
> + detect_seam(c);
> +}

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error



On 4/5/22 9:49 PM, Kai Huang wrote:
> TDX supports shutting down the TDX module at any time during its
> lifetime. After TDX module is shut down, no further SEAMCALL can be
> made on any logical cpu.
>
> Shut down the TDX module in case of any error happened during the
> initialization process. It's pointless to leave the TDX module in some
> middle state.
>
> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all

May be adding specification reference will help.

> BIOS-enabled cpus, and the SEMACALL can run concurrently on different
> cpus. Implement a mechanism to run SEAMCALL concurrently on all online

From TDX Module spec, sec 13.4.1 titled "Shutdown Initiated by the Host
VMM (as Part of Module Update)",

TDH.SYS.LP.SHUTDOWN is designed to set state variables to block all
SEAMCALLs on the current LP and all SEAMCALL leaf functions except
TDH.SYS.LP.SHUTDOWN on the other LPs.

As per above spec reference, executing TDH.SYS.LP.SHUTDOWN in
one LP prevent all SEAMCALL leaf function on all other LPs. If so,
why execute it on all CPUs?

> cpus. Logical-cpu scope initialization will use it too.

Concurrent SEAMCALL support seem to be useful for other SEAMCALL
types as well. If you agree, I think it would be better if you move
it out to a separate common patch.

>
> Signed-off-by: Kai Huang <[email protected]>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 5 +++++
> 2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 674867bccc14..faf8355965a5 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -11,6 +11,8 @@
> #include <linux/cpumask.h>
> #include <linux/mutex.h>
> #include <linux/cpu.h>
> +#include <linux/smp.h>
> +#include <linux/atomic.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/cpufeature.h>
> @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> return 0;
> }
>
> +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> +struct seamcall_ctx {
> + u64 fn;
> + u64 rcx;
> + u64 rdx;
> + u64 r8;
> + u64 r9;
> + atomic_t err;
> + u64 seamcall_ret;
> + struct tdx_module_output out;
> +};
> +
> +static void seamcall_smp_call_function(void *data)
> +{
> + struct seamcall_ctx *sc = data;
> + int ret;
> +
> + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> + &sc->seamcall_ret, &sc->out);
> + if (ret)
> + atomic_set(&sc->err, ret);
> +}
> +
> +/*
> + * Call the SEAMCALL on all online cpus concurrently.
> + * Return error if SEAMCALL fails on any cpu.
> + */
> +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> + on_each_cpu(seamcall_smp_call_function, sc, true);
> + return atomic_read(&sc->err);
> +}
> +
> static inline bool p_seamldr_ready(void)
> {
> return !!p_seamldr_info.p_seamldr_ready;
> @@ -437,7 +472,10 @@ static int init_tdx_module(void)
>
> static void shutdown_tdx_module(void)
> {
> - /* TODO: Shut down the TDX module */
> + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> +
> + seamcall_on_each_cpu(&sc);

May be check the error and WARN_ON on failure?

> +
> tdx_module_status = TDX_MODULE_SHUTDOWN;
> }
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 6990c93198b3..dcc1f6dfe378 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -35,6 +35,11 @@ struct p_seamldr_info {
> #define P_SEAMLDR_SEAMCALL_BASE BIT_ULL(63)
> #define P_SEAMCALL_SEAMLDR_INFO (P_SEAMLDR_SEAMCALL_BASE | 0x0)
>
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_LP_SHUTDOWN 44
> +
> struct tdx_module_output;
> u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> struct tdx_module_output *out);

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization



On 4/5/22 9:49 PM, Kai Huang wrote:
> Logical-cpu scope initialization requires calling TDH.SYS.LP.INIT on all
> BIOS-enabled cpus, otherwise the TDH.SYS.CONFIG SEAMCALL will fail.

IIUC, this change handles logical CPU initialization part of TDX module
initialization. So why talk about TDH.SYS.CONFIG failure here? Are they
related?

> TDH.SYS.LP.INIT can be called concurrently on all cpus.

IMO, if you move the following paragraph to the beginning, it is easier
to understand "what" and "why" part of this change.
>
> Following global initialization, do the logical-cpu scope initialization
> by calling TDH.SYS.LP.INIT on all online cpus. Whether all BIOS-enabled
> cpus are online is not checked here for simplicity. The caller of
> tdx_init() should guarantee all BIOS-enabled cpus are online.

Include specification reference for TDX module initialization and
TDH.SYS.LP.INIT.

In TDX module spec, section 22.2.35 (TDH.SYS.LP.INIT Leaf), mentions
some environment requirements. I don't see you checking here for it?
Is this already met?



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-04-26 03:09:35

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error


> >
> > Prevent all SEAMCALLs on other LPs except TDH.SYS.LP.SHUTDOWN. The spec defnies
> > shutting down the TDX module as running this SEAMCALl on all LPs, so why just
> > run on a single cpu? What's the benefit?
>
> If executing it in one LP prevents SEAMCALLs on all other LPs, I am
> trying to understand why spec recommends running it in all LPs?

Please see 3.1.2 Intel TDX Module Shutdown and Update

The "shutdown" case requires "Execute On" on "Each LP".

Also, TDH.SYS.LP.SHUTDOWN describe this is shutdown on *current* LP.

>
> But the following explanation answers my query. I recommend making a
> note about it in commit log or comments.

Is above enough to address your question?



--
Thanks,
-Kai


Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error



On 4/25/22 4:41 PM, Kai Huang wrote:
> On Sat, 2022-04-23 at 08:39 -0700, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> TDX supports shutting down the TDX module at any time during its
>>> lifetime. After TDX module is shut down, no further SEAMCALL can be
>>> made on any logical cpu.
>>>
>>> Shut down the TDX module in case of any error happened during the
>>> initialization process. It's pointless to leave the TDX module in some
>>> middle state.
>>>
>>> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
>>
>> May be adding specification reference will help.
>
> How about adding the reference to the code comment? Here we just need some fact
> description. Adding reference to the code comment also allows people to find
> the relative part in the spec easily when they are looking at the actual code
> (i.e. after the code is merged to upstream). Otherwise people needs to do a git
> blame and find the exact commit message for that.

If it is not a hassle, you can add references both in code and at the
end of the commit log. Adding two more lines to the commit log should
not be difficult.

I think it is fine either way. Your choice.

>
>>
>>> BIOS-enabled cpus, and the SEMACALL can run concurrently on different
>>> cpus. Implement a mechanism to run SEAMCALL concurrently on all online
>>
>> From TDX Module spec, sec 13.4.1 titled "Shutdown Initiated by the Host
>> VMM (as Part of Module Update)",
>>
>> TDH.SYS.LP.SHUTDOWN is designed to set state variables to block all
>> SEAMCALLs on the current LP and all SEAMCALL leaf functions except
>> TDH.SYS.LP.SHUTDOWN on the other LPs.
>>
>> As per above spec reference, executing TDH.SYS.LP.SHUTDOWN in
>> one LP prevent all SEAMCALL leaf function on all other LPs. If so,
>> why execute it on all CPUs?
>
> Prevent all SEAMCALLs on other LPs except TDH.SYS.LP.SHUTDOWN. The spec defnies
> shutting down the TDX module as running this SEAMCALl on all LPs, so why just
> run on a single cpu? What's the benefit?

If executing it in one LP prevents SEAMCALLs on all other LPs, I am
trying to understand why spec recommends running it in all LPs?

But the following explanation answers my query. I recommend making a
note about it in commit log or comments.

>
> Also, the spec also mentions for runtime update, "SEAMLDR can check that
> TDH.SYS.SHUTDOWN has been executed on all LPs". Runtime update isn't supported
> in this series, but it can leverage the existing code if we run SEAMCALL on all
> LPs to shutdown the module as spec suggested. Why just run on a single cpu?
>
>>
>>> cpus. Logical-cpu scope initialization will use it too.
>>
>> Concurrent SEAMCALL support seem to be useful for other SEAMCALL
>> types as well. If you agree, I think it would be better if you move
>> it out to a separate common patch.
>
> There are couple of problems of doing that:
>
> - All the functions are static in this tdx.c. Introducing them separately in
> dedicated patch would result in compile warning about those static functions are
> not used.
> - I have received comments from others I can add those functions when they are
> firstly used. Given those functions is not large, so I prefer this way too.

Ok

>
>>
>>>
>>> Signed-off-by: Kai Huang <[email protected]>
>>> ---
>>> arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
>>> arch/x86/virt/vmx/tdx/tdx.h | 5 +++++
>>> 2 files changed, 44 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> index 674867bccc14..faf8355965a5 100644
>>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -11,6 +11,8 @@
>>> #include <linux/cpumask.h>
>>> #include <linux/mutex.h>
>>> #include <linux/cpu.h>
>>> +#include <linux/smp.h>
>>> +#include <linux/atomic.h>
>>> #include <asm/msr-index.h>
>>> #include <asm/msr.h>
>>> #include <asm/cpufeature.h>
>>> @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>>> return 0;
>>> }
>>>
>>> +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
>>> +struct seamcall_ctx {
>>> + u64 fn;
>>> + u64 rcx;
>>> + u64 rdx;
>>> + u64 r8;
>>> + u64 r9;
>>> + atomic_t err;
>>> + u64 seamcall_ret;
>>> + struct tdx_module_output out;
>>> +};
>>> +
>>> +static void seamcall_smp_call_function(void *data)
>>> +{
>>> + struct seamcall_ctx *sc = data;
>>> + int ret;
>>> +
>>> + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
>>> + &sc->seamcall_ret, &sc->out);
>>> + if (ret)
>>> + atomic_set(&sc->err, ret);
>>> +}
>>> +
>>> +/*
>>> + * Call the SEAMCALL on all online cpus concurrently.
>>> + * Return error if SEAMCALL fails on any cpu.
>>> + */
>>> +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
>>> +{
>>> + on_each_cpu(seamcall_smp_call_function, sc, true);
>>> + return atomic_read(&sc->err);
>>> +}
>>> +
>>> static inline bool p_seamldr_ready(void)
>>> {
>>> return !!p_seamldr_info.p_seamldr_ready;
>>> @@ -437,7 +472,10 @@ static int init_tdx_module(void)
>>>
>>> static void shutdown_tdx_module(void)
>>> {
>>> - /* TODO: Shut down the TDX module */
>>> + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
>>> +
>>> + seamcall_on_each_cpu(&sc);
>>
>> May be check the error and WARN_ON on failure?
>
> When SEAMCALL fails, the error code will be printed out actually (please see
> previous patch), so I thought there's no need to WARN_ON() here (and some other
> similar places). I am not sure the additional WARN_ON() will do any help?

OK. I missed that part.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-04-26 10:45:58

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization

On Sat, 2022-04-23 at 18:27 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > Logical-cpu scope initialization requires calling TDH.SYS.LP.INIT on all
> > BIOS-enabled cpus, otherwise the TDH.SYS.CONFIG SEAMCALL will fail.
>
> IIUC, this change handles logical CPU initialization part of TDX module
> initialization. So why talk about TDH.SYS.CONFIG failure here? Are they
> related?

They are a little bit related but I think I can remove it. Thanks.

>
> > TDH.SYS.LP.INIT can be called concurrently on all cpus.
>
> IMO, if you move the following paragraph to the beginning, it is easier
> to understand "what" and "why" part of this change.

OK.

> >
> > Following global initialization, do the logical-cpu scope initialization
> > by calling TDH.SYS.LP.INIT on all online cpus. Whether all BIOS-enabled
> > cpus are online is not checked here for simplicity. The caller of
> > tdx_init() should guarantee all BIOS-enabled cpus are online.
>
> Include specification reference for TDX module initialization and
> TDH.SYS.LP.INIT.
>
> In TDX module spec, section 22.2.35 (TDH.SYS.LP.INIT Leaf), mentions
> some environment requirements. I don't see you checking here for it?
> Is this already met?
>

Good catch. I missed it, and I'll look into it. Thanks.


--
Thanks,
-Kai


2022-04-26 20:12:14

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error

On Sat, 2022-04-23 at 08:39 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > TDX supports shutting down the TDX module at any time during its
> > lifetime. After TDX module is shut down, no further SEAMCALL can be
> > made on any logical cpu.
> >
> > Shut down the TDX module in case of any error happened during the
> > initialization process. It's pointless to leave the TDX module in some
> > middle state.
> >
> > Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
>
> May be adding specification reference will help.

How about adding the reference to the code comment? Here we just need some fact
description. Adding reference to the code comment also allows people to find
the relative part in the spec easily when they are looking at the actual code
(i.e. after the code is merged to upstream). Otherwise people needs to do a git
blame and find the exact commit message for that.

>
> > BIOS-enabled cpus, and the SEMACALL can run concurrently on different
> > cpus. Implement a mechanism to run SEAMCALL concurrently on all online
>
> From TDX Module spec, sec 13.4.1 titled "Shutdown Initiated by the Host
> VMM (as Part of Module Update)",
>
> TDH.SYS.LP.SHUTDOWN is designed to set state variables to block all
> SEAMCALLs on the current LP and all SEAMCALL leaf functions except
> TDH.SYS.LP.SHUTDOWN on the other LPs.
>
> As per above spec reference, executing TDH.SYS.LP.SHUTDOWN in
> one LP prevent all SEAMCALL leaf function on all other LPs. If so,
> why execute it on all CPUs?

Prevent all SEAMCALLs on other LPs except TDH.SYS.LP.SHUTDOWN. The spec defnies
shutting down the TDX module as running this SEAMCALl on all LPs, so why just
run on a single cpu? What's the benefit?

Also, the spec also mentions for runtime update, "SEAMLDR can check that
TDH.SYS.SHUTDOWN has been executed on all LPs". Runtime update isn't supported
in this series, but it can leverage the existing code if we run SEAMCALL on all
LPs to shutdown the module as spec suggested. Why just run on a single cpu?

>
> > cpus. Logical-cpu scope initialization will use it too.
>
> Concurrent SEAMCALL support seem to be useful for other SEAMCALL
> types as well. If you agree, I think it would be better if you move
> it out to a separate common patch.

There are couple of problems of doing that:

- All the functions are static in this tdx.c. Introducing them separately in
dedicated patch would result in compile warning about those static functions are
not used.
- I have received comments from others I can add those functions when they are
firstly used. Given those functions is not large, so I prefer this way too.

>
> >
> > Signed-off-by: Kai Huang <[email protected]>
> > ---
> > arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
> > arch/x86/virt/vmx/tdx/tdx.h | 5 +++++
> > 2 files changed, 44 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 674867bccc14..faf8355965a5 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -11,6 +11,8 @@
> > #include <linux/cpumask.h>
> > #include <linux/mutex.h>
> > #include <linux/cpu.h>
> > +#include <linux/smp.h>
> > +#include <linux/atomic.h>
> > #include <asm/msr-index.h>
> > #include <asm/msr.h>
> > #include <asm/cpufeature.h>
> > @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > return 0;
> > }
> >
> > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > +struct seamcall_ctx {
> > + u64 fn;
> > + u64 rcx;
> > + u64 rdx;
> > + u64 r8;
> > + u64 r9;
> > + atomic_t err;
> > + u64 seamcall_ret;
> > + struct tdx_module_output out;
> > +};
> > +
> > +static void seamcall_smp_call_function(void *data)
> > +{
> > + struct seamcall_ctx *sc = data;
> > + int ret;
> > +
> > + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > + &sc->seamcall_ret, &sc->out);
> > + if (ret)
> > + atomic_set(&sc->err, ret);
> > +}
> > +
> > +/*
> > + * Call the SEAMCALL on all online cpus concurrently.
> > + * Return error if SEAMCALL fails on any cpu.
> > + */
> > +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > +{
> > + on_each_cpu(seamcall_smp_call_function, sc, true);
> > + return atomic_read(&sc->err);
> > +}
> > +
> > static inline bool p_seamldr_ready(void)
> > {
> > return !!p_seamldr_info.p_seamldr_ready;
> > @@ -437,7 +472,10 @@ static int init_tdx_module(void)
> >
> > static void shutdown_tdx_module(void)
> > {
> > - /* TODO: Shut down the TDX module */
> > + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> > +
> > + seamcall_on_each_cpu(&sc);
>
> May be check the error and WARN_ON on failure?

When SEAMCALL fails, the error code will be printed out actually (please see
previous patch), so I thought there's no need to WARN_ON() here (and some other
similar places). I am not sure the additional WARN_ON() will do any help?

--
Thanks,
-Kai


2022-04-27 02:16:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error

On 4/5/22 21:49, Kai Huang wrote:
> TDX supports shutting down the TDX module at any time during its
> lifetime. After TDX module is shut down, no further SEAMCALL can be
> made on any logical cpu.

Is this strictly true?

I thought SEAMCALLs were used for the P-SEAMLDR too.

> Shut down the TDX module in case of any error happened during the
> initialization process. It's pointless to leave the TDX module in some
> middle state.
>
> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> BIOS-enabled cpus, and the SEMACALL can run concurrently on different
> cpus. Implement a mechanism to run SEAMCALL concurrently on all online
> cpus. Logical-cpu scope initialization will use it too.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
> arch/x86/virt/vmx/tdx/tdx.h | 5 +++++
> 2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 674867bccc14..faf8355965a5 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -11,6 +11,8 @@
> #include <linux/cpumask.h>
> #include <linux/mutex.h>
> #include <linux/cpu.h>
> +#include <linux/smp.h>
> +#include <linux/atomic.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/cpufeature.h>
> @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> return 0;
> }
>
> +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> +struct seamcall_ctx {
> + u64 fn;
> + u64 rcx;
> + u64 rdx;
> + u64 r8;
> + u64 r9;
> + atomic_t err;
> + u64 seamcall_ret;
> + struct tdx_module_output out;
> +};
> +
> +static void seamcall_smp_call_function(void *data)
> +{
> + struct seamcall_ctx *sc = data;
> + int ret;
> +
> + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> + &sc->seamcall_ret, &sc->out);
> + if (ret)
> + atomic_set(&sc->err, ret);
> +}
> +
> +/*
> + * Call the SEAMCALL on all online cpus concurrently.
> + * Return error if SEAMCALL fails on any cpu.
> + */
> +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> + on_each_cpu(seamcall_smp_call_function, sc, true);
> + return atomic_read(&sc->err);
> +}

Why bother returning something that's not read?

> static inline bool p_seamldr_ready(void)
> {
> return !!p_seamldr_info.p_seamldr_ready;
> @@ -437,7 +472,10 @@ static int init_tdx_module(void)
>
> static void shutdown_tdx_module(void)
> {
> - /* TODO: Shut down the TDX module */
> + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> +
> + seamcall_on_each_cpu(&sc);
> +
> tdx_module_status = TDX_MODULE_SHUTDOWN;
> }
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 6990c93198b3..dcc1f6dfe378 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -35,6 +35,11 @@ struct p_seamldr_info {
> #define P_SEAMLDR_SEAMCALL_BASE BIT_ULL(63)
> #define P_SEAMCALL_SEAMLDR_INFO (P_SEAMLDR_SEAMCALL_BASE | 0x0)
>
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_LP_SHUTDOWN 44
> +
> struct tdx_module_output;
> u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> struct tdx_module_output *out);

2022-04-27 03:13:29

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> On 4/26/22 16:12, Kai Huang wrote:
> > Hi Dave,
> >
> > Thanks for review!
> >
> > On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > > > +config INTEL_TDX_HOST
> > > > + bool "Intel Trust Domain Extensions (TDX) host support"
> > > > + default n
> > > > + depends on CPU_SUP_INTEL
> > > > + depends on X86_64
> > > > + help
> > > > + Intel Trust Domain Extensions (TDX) protects guest VMs from
> > > > malicious
> > > > + host and certain physical attacks. This option enables necessary
> > > > TDX
> > > > + support in host kernel to run protected VMs.
> > > > +
> > > > + If unsure, say N.
> > >
> > > Nothing about KVM?
> >
> > I'll add KVM into the context. How about below?
> >
> > "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks. This option enables necessary TDX
> > support in host kernel to allow KVM to run protected VMs called Trust
> > Domains (TD)."
>
> What about a dependency? Isn't this dead code without CONFIG_KVM=y/m?

Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM. But so far KVM is the only
user of TDX, so in practice the code is dead w/o KVM.

What's your opinion?

>
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > new file mode 100644
> > > > index 000000000000..03f35c75f439
> > > > --- /dev/null
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -0,0 +1,102 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * Copyright(c) 2022 Intel Corporation.
> > > > + *
> > > > + * Intel Trusted Domain Extensions (TDX) support
> > > > + */
> > > > +
> > > > +#define pr_fmt(fmt) "tdx: " fmt
> > > > +
> > > > +#include <linux/types.h>
> > > > +#include <linux/cpumask.h>
> > > > +#include <asm/msr-index.h>
> > > > +#include <asm/msr.h>
> > > > +#include <asm/cpufeature.h>
> > > > +#include <asm/cpufeatures.h>
> > > > +#include <asm/tdx.h>
> > > > +
> > > > +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
> > > > +#define MTRR_CAP_SEAMRR BIT(15)
> > > > +
> > > > +/* Core-scope Intel SEAMRR base and mask registers. */
> > > > +#define MSR_IA32_SEAMRR_PHYS_BASE 0x00001400
> > > > +#define MSR_IA32_SEAMRR_PHYS_MASK 0x00001401
> > > > +
> > > > +#define SEAMRR_PHYS_BASE_CONFIGURED BIT_ULL(3)
> > > > +#define SEAMRR_PHYS_MASK_ENABLED BIT_ULL(11)
> > > > +#define SEAMRR_PHYS_MASK_LOCKED BIT_ULL(10)
> > > > +
> > > > +#define SEAMRR_ENABLED_BITS \
> > > > + (SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
> > > > +
> > > > +/* BIOS must configure SEAMRR registers for all cores consistently */
> > > > +static u64 seamrr_base, seamrr_mask;
> > > > +
> > > > +static bool __seamrr_enabled(void)
> > > > +{
> > > > + return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > > > +}
> > >
> > > But there's no case where seamrr_mask is non-zero and where
> > > _seamrr_enabled(). Why bother checking the SEAMRR_ENABLED_BITS?
> >
> > seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
> > is 0. It will also be cleared when BIOS mis-configuration is detected on any
> > AP. SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.
>
> The point is that this could be:
>
> return !!seamrr_mask;

The definition of this SEAMRR_MASK MSR defines "ENABLED" and "LOCKED" bits.
Explicitly checking the two bits, instead of !!seamrr_mask roles out other
incorrect configurations. For instance, we should not treat SEAMRR being
enabled if we only have "ENABLED" bit set or "LOCKED" bit set.

>
>
> > > > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > > > +{
> > > > + u64 base, mask;
> > > > +
> > > > + /*
> > > > + * Don't bother to detect this AP if SEAMRR is not
> > > > + * enabled after earlier detections.
> > > > + */
> > > > + if (!__seamrr_enabled())
> > > > + return;
> > > > +
> > > > + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > > > + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > > > +
> > >
> > > This is the place for a comment about why the values have to be equal.
> >
> > I'll add below:
> >
> > /* BIOS must configure SEAMRR consistently across all cores */
>
> What happens if the BIOS doesn't do this? What actually breaks? In
> other words, do we *NEED* error checking here?

AFAICT the spec doesn't explicitly mention what will happen if BIOS doesn't
configure them consistently among cores. But for safety I think it's better to
detect.

>
> > > > + if (base == seamrr_base && mask == seamrr_mask)
> > > > + return;
> > > > +
> > > > + pr_err("Inconsistent SEAMRR configuration by BIOS\n");
> > > > + /* Mark SEAMRR as disabled. */
> > > > + seamrr_base = 0;
> > > > + seamrr_mask = 0;
> > > > +}
> > > > +
> > > > +static void detect_seam(struct cpuinfo_x86 *c)
> > > > +{
> > > > + if (c == &boot_cpu_data)
> > > > + detect_seam_bsp(c);
> > > > + else
> > > > + detect_seam_ap(c);
> > > > +}
> > > > +
> > > > +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> > > > +{
> > > > + detect_seam(c);
> > > > +}
> > >
> > > The extra function looks a bit silly here now. Maybe this gets filled
> > > out later, but it's goofy-looking here.
> >
> > Thomas suggested to put all TDX detection related in one function call, so I
> > added tdx_detect_cpu(). I'll move this to the next patch when detecting TDX
> > KeyIDs.
>
> That's fine, or just add a comment or a changelog sentence about this
> being filled out later.

There's already one sentence in the changelog:

"......Add a function to detect all TDX preliminaries (SEAMRR, TDX private
KeyIDs) for a given cpu when it is brought up. As the first step, detect the
validity of SEAMRR."

Does this look good to you?

2022-04-27 09:11:40

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > SEAM VMX root operation is designed to host a CPU-attested, software
> > module called the 'TDX module' which implements functions to manage
> > crypto protected VMs called Trust Domains (TD). SEAM VMX root is also
>
> "crypto protected"? What the heck is that?

How about "crypto-protected"? I googled and it seems it is used by someone
else.

>
> > designed to host a CPU-attested, software module called the 'Intel
> > Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
> >
> > Host kernel transits to either the P-SEAMLDR or the TDX module via a new
>
> ^ The

Thanks.

>
> > SEAMCALL instruction. SEAMCALLs are host-side interface functions
> > defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> > instruction. They are similar to a hypercall, except they are made by
> > host kernel to the SEAM software modules.
>
> This is still missing some important high-level things, like that the
> TDX module is protected from the untrusted VMM. Heck, it forgets to
> mention that the VMM itself is untrusted and the TDX module replaces
> things that the VMM usually does.
>
> It would also be nice to mention here how this compares with SEV-SNP.
> Where is the TDX module in that design? Why doesn't SEV need all this code?
>
> > TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> > protect TD guests. TDX reserves part of MKTME KeyID space as TDX private
> > KeyIDs, which can only be used by software runs in SEAM. The physical
>
> ^ which

Thanks.

>
> > address bits for encoding TDX private KeyID are treated as reserved bits
> > when not in SEAM operation. The partitioning of MKTME KeyIDs and TDX
> > private KeyIDs is configured by BIOS.
> >
> > Before being able to manage TD guests, the TDX module must be loaded
> > and properly initialized using SEAMCALLs defined by TDX architecture.
> > This series assumes both the P-SEAMLDR and the TDX module are loaded by
> > BIOS before the kernel boots.
> >
> > There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> > Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> > SEAMCALL to detect P-SEAMLDR. The success of this SEAMCALL means the
> > P-SEAMLDR is loaded. The P-SEAMLDR information returned by this
> > SEAMCALL further tells whether TDX module is loaded.
>
> There's a bit of information missing here. The kernel might not know
> the state of things being loaded. A previous kernel might have loaded
> it and left it in an unknown state.
>
> > The TDX module is initialized in multiple steps:
> >
> > 1) Global initialization;
> > 2) Logical-CPU scope initialization;
> > 3) Enumerate the TDX module capabilities;
> > 4) Configure the TDX module about usable memory ranges and
> > global KeyID information;
> > 5) Package-scope configuration for the global KeyID;
> > 6) Initialize TDX metadata for usable memory ranges based on 4).
> >
> > Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> > table) logical cpus, otherwise step 4) will fail. Step 5) requires
> > calling SEAMCALL on at least one cpu on all packages.
> >
> > TDX module can also be shut down at any time during module's lifetime, by
> > calling SEAMCALL on all "BIOS-enabled" logical cpus.
> >
> > == Design Considerations ==
> >
> > 1. Lazy TDX module initialization on-demand by caller
>
> This doesn't really tell us what "lazy" is or what the alternatives are.
>
> There are basically two ways the TDX module could be loaded. Either:
> * In early boot
> or
> * At runtime just before the first TDX guest is run
>
> This series implements the runtime loading.

OK will do.

>
> > None of the steps in the TDX module initialization process must be done
> > during kernel boot. This series doesn't initialize TDX at boot time, but
> > instead, provides two functions to allow caller to detect and initialize
> > TDX on demand:
> >
> > if (tdx_detect())
> > goto no_tdx;
> > if (tdx_init())
> > goto no_tdx;
> >
> > This approach has below pros:
> >
> > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> > metadata. Enabling TDX on demand allows only to consume this memory when
> > TDX is truly needed (i.e. when KVM wants to create TD guests).
> >
> > 2) Both detecting and initializing the TDX module require calling
> > SEAMCALL. However, SEAMCALL requires CPU being already in VMX operation
> > (VMXON has been done). So far, KVM is the only user of TDX, and it
> > already handles VMXON/VMXOFF. Therefore, letting KVM to initialize TDX
> > on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> > core-kernel. Also, in long term, likely a reference based VMXON/VMXOFF
> > approach is needed since more kernel components will need to handle
> > VMXON/VMXONFF.
> >
> > 3) It is more flexible to support "TDX module runtime update" (not in
> > this series). After updating to the new module at runtime, kernel needs
> > to go through the initialization process again. For the new module,
> > it's possible the metadata allocated for the old module cannot be reused
> > for the new module, and needs to be re-allocated again.
> >
> > 2. Kernel policy on TDX memory
> >
> > Host kernel is responsible for choosing which memory regions can be used
> > as TDX memory, and configuring those memory regions to the TDX module by
> > using an array of "TD Memory Regions" (TDMR), which is a data structure
> > defined by TDX architecture.
>
>
> This is putting the cart before the horse. Don't define the details up
> front.
>
> The TDX architecture allows the VMM to designate specific memory
> as usable for TDX private memory. This series chooses to
> designate _all_ system RAM as TDX to avoid having to modify the
> page allocator to distinguish TDX and non-TDX-capable memory
>
> ... then go on to explain the details.

Thanks. Will update.

>
> > The first generation of TDX essentially guarantees that all system RAM
> > memory regions (excluding the memory below 1MB) can be used as TDX
> > memory. To avoid having to modify the page allocator to distinguish TDX
> > and non-TDX allocation, this series chooses to use all system RAM as TDX
> > memory.
> >
> > E820 table is used to find all system RAM entries. Following
> > e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> > types are treated as TDX memory, and contiguous ranges in the same NUMA
> > node are merged together (similar to memblock_add()) before trimming the
> > non-page-aligned part.
>
> This e820 cruft is too much detail for a cover letter. In general, once
> you start talking about individual functions, you've gone too far in the
> cover letter.

Will remove.

>
> > 3. Memory hotplug
> >
> > The first generation of TDX architecturally doesn't support memory
> > hotplug. And the first generation of TDX-capable platforms don't support
> > physical memory hotplug. Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> >
> > A special case of memory hotplug is adding NVDIMM as system RAM using
> > kmem driver. However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
>
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

I forgot to add below (which is in the documentation patch):

"This can be enhanced when future generation of TDX starts to support ACPI
memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
same platform."

Is this acceptable?

>
> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM. To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> >
> > 4. CPU hotplug
> >
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either. Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.
> >
> > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > online. After the initialization, any logical cpu can be brought down
> > and brought up to online again later. Therefore this series doesn't
> > change logical CPU hotplug either.
> >
> > 5. TDX interaction with kexec()
> >
> > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > of TDX private memory, including PAMTs, used by TDX module need to be
> > flushed before transiting to the new kernel otherwise they may silently
> > corrupt the new kernel. Similar to SME, this series flushes cache in
> > stop_this_cpu().
>
> What does this have to do with kexec()? What's a PAMT?

The point is the dirty cachelines of TDX private memory must be flushed
otherwise they may slightly corrupt the new kexec()-ed kernel.

Will use "TDX metadata" instead of "PAMT". The former has already been
mentioned above.

>
> > The TDX module can be initialized only once during its lifetime. The
> > first generation of TDX doesn't have interface to reset TDX module to
>
> ^ an

Thanks.

>
> > uninitialized state so it can be initialized again.
> >
> > This implies:
> >
> > - If the old kernel fails to initialize TDX, the new kernel cannot
> > use TDX too unless the new kernel fixes the bug which leads to
> > initialization failure in the old kernel and can resume from where
> > the old kernel stops. This requires certain coordination between
> > the two kernels.
>
> OK, but what does this *MEAN*?

This means we need to extend the information which the old kernel passes to the
new kernel. But I don't think it's feasible. I'll refine this kexec() section
to make it more concise next version.

>
> > - If the old kernel has initialized TDX successfully, the new kernel
> > may be able to use TDX if the two kernels have the exactly same
> > configurations on the TDX module. It further requires the new kernel
> > to reserve the TDX metadata pages (allocated by the old kernel) in
> > its page allocator. It also requires coordination between the two
> > kernels. Furthermore, if kexec() is done when there are active TD
> > guests running, the new kernel cannot use TDX because it's extremely
> > hard for the old kernel to pass all TDX private pages to the new
> > kernel.
> >
> > Given that, this series doesn't support TDX after kexec() (except the
> > old kernel doesn't attempt to initialize TDX at all).
> >
> > And this series doesn't shut down TDX module but leaves it open during
> > kexec(). It is because shutting down TDX module requires CPU being in
> > VMX operation but there's no guarantee of this during kexec(). Leaving
> > the TDX module open is not the best case, but it is OK since the new
> > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > at all).
>
> tl;dr: kexec() doesn't work with this code.
>
> Right?
>
> That doesn't seem good.

It can work in my understanding. We just need to flush cache before booting to
the new kernel.


--
Thanks,
-Kai


2022-04-27 09:44:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

> +config INTEL_TDX_HOST
> + bool "Intel Trust Domain Extensions (TDX) host support"
> + default n
> + depends on CPU_SUP_INTEL
> + depends on X86_64
> + help
> + Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> + host and certain physical attacks. This option enables necessary TDX
> + support in host kernel to run protected VMs.
> +
> + If unsure, say N.

Nothing about KVM?

...
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> new file mode 100644
> index 000000000000..03f35c75f439
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2022 Intel Corporation.
> + *
> + * Intel Trusted Domain Extensions (TDX) support
> + */
> +
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +#include <linux/types.h>
> +#include <linux/cpumask.h>
> +#include <asm/msr-index.h>
> +#include <asm/msr.h>
> +#include <asm/cpufeature.h>
> +#include <asm/cpufeatures.h>
> +#include <asm/tdx.h>
> +
> +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
> +#define MTRR_CAP_SEAMRR BIT(15)
> +
> +/* Core-scope Intel SEAMRR base and mask registers. */
> +#define MSR_IA32_SEAMRR_PHYS_BASE 0x00001400
> +#define MSR_IA32_SEAMRR_PHYS_MASK 0x00001401
> +
> +#define SEAMRR_PHYS_BASE_CONFIGURED BIT_ULL(3)
> +#define SEAMRR_PHYS_MASK_ENABLED BIT_ULL(11)
> +#define SEAMRR_PHYS_MASK_LOCKED BIT_ULL(10)
> +
> +#define SEAMRR_ENABLED_BITS \
> + (SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
> +
> +/* BIOS must configure SEAMRR registers for all cores consistently */
> +static u64 seamrr_base, seamrr_mask;
> +
> +static bool __seamrr_enabled(void)
> +{
> + return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> +}

But there's no case where seamrr_mask is non-zero and where
_seamrr_enabled(). Why bother checking the SEAMRR_ENABLED_BITS?

> +static void detect_seam_bsp(struct cpuinfo_x86 *c)
> +{
> + u64 mtrrcap, base, mask;
> +
> + /* SEAMRR is reported via MTRRcap */
> + if (!boot_cpu_has(X86_FEATURE_MTRR))
> + return;
> +
> + rdmsrl(MSR_MTRRcap, mtrrcap);
> + if (!(mtrrcap & MTRR_CAP_SEAMRR))
> + return;
> +
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> + if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
> + pr_info("SEAMRR base is not configured by BIOS\n");
> + return;
> + }
> +
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> + if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
> + pr_info("SEAMRR is not enabled by BIOS\n");
> + return;
> + }
> +
> + seamrr_base = base;
> + seamrr_mask = mask;
> +}

Comment, please.

/*
* Stash the boot CPU's MSR values so that AP values
* can can be checked for consistency.
*/


> +static void detect_seam_ap(struct cpuinfo_x86 *c)
> +{
> + u64 base, mask;
> +
> + /*
> + * Don't bother to detect this AP if SEAMRR is not
> + * enabled after earlier detections.
> + */
> + if (!__seamrr_enabled())
> + return;
> +
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> +

This is the place for a comment about why the values have to be equal.

> + if (base == seamrr_base && mask == seamrr_mask)
> + return;
> +
> + pr_err("Inconsistent SEAMRR configuration by BIOS\n");
> + /* Mark SEAMRR as disabled. */
> + seamrr_base = 0;
> + seamrr_mask = 0;
> +}
> +
> +static void detect_seam(struct cpuinfo_x86 *c)
> +{
> + if (c == &boot_cpu_data)
> + detect_seam_bsp(c);
> + else
> + detect_seam_ap(c);
> +}
> +
> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> +{
> + detect_seam(c);
> +}

The extra function looks a bit silly here now. Maybe this gets filled
out later, but it's goofy-looking here.

2022-04-27 09:45:53

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

Hi Dave,

Thanks for review!

On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > +config INTEL_TDX_HOST
> > + bool "Intel Trust Domain Extensions (TDX) host support"
> > + default n
> > + depends on CPU_SUP_INTEL
> > + depends on X86_64
> > + help
> > + Intel Trust Domain Extensions (TDX) protects guest VMs from
> > malicious
> > + host and certain physical attacks. This option enables necessary
> > TDX
> > + support in host kernel to run protected VMs.
> > +
> > + If unsure, say N.
>
> Nothing about KVM?

I'll add KVM into the context. How about below?

"Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. This option enables necessary TDX
support in host kernel to allow KVM to run protected VMs called Trust
Domains (TD)."

>
> ...
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > new file mode 100644
> > index 000000000000..03f35c75f439
> > --- /dev/null
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -0,0 +1,102 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright(c) 2022 Intel Corporation.
> > + *
> > + * Intel Trusted Domain Extensions (TDX) support
> > + */
> > +
> > +#define pr_fmt(fmt) "tdx: " fmt
> > +
> > +#include <linux/types.h>
> > +#include <linux/cpumask.h>
> > +#include <asm/msr-index.h>
> > +#include <asm/msr.h>
> > +#include <asm/cpufeature.h>
> > +#include <asm/cpufeatures.h>
> > +#include <asm/tdx.h>
> > +
> > +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
> > +#define MTRR_CAP_SEAMRR BIT(15)
> > +
> > +/* Core-scope Intel SEAMRR base and mask registers. */
> > +#define MSR_IA32_SEAMRR_PHYS_BASE 0x00001400
> > +#define MSR_IA32_SEAMRR_PHYS_MASK 0x00001401
> > +
> > +#define SEAMRR_PHYS_BASE_CONFIGURED BIT_ULL(3)
> > +#define SEAMRR_PHYS_MASK_ENABLED BIT_ULL(11)
> > +#define SEAMRR_PHYS_MASK_LOCKED BIT_ULL(10)
> > +
> > +#define SEAMRR_ENABLED_BITS \
> > + (SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
> > +
> > +/* BIOS must configure SEAMRR registers for all cores consistently */
> > +static u64 seamrr_base, seamrr_mask;
> > +
> > +static bool __seamrr_enabled(void)
> > +{
> > + return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > +}
>
> But there's no case where seamrr_mask is non-zero and where
> _seamrr_enabled(). Why bother checking the SEAMRR_ENABLED_BITS?

seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
is 0. It will also be cleared when BIOS mis-configuration is detected on any
AP. SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.

>
> > +static void detect_seam_bsp(struct cpuinfo_x86 *c)
> > +{
> > + u64 mtrrcap, base, mask;
> > +
> > + /* SEAMRR is reported via MTRRcap */
> > + if (!boot_cpu_has(X86_FEATURE_MTRR))
> > + return;
> > +
> > + rdmsrl(MSR_MTRRcap, mtrrcap);
> > + if (!(mtrrcap & MTRR_CAP_SEAMRR))
> > + return;
> > +
> > + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > + if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
> > + pr_info("SEAMRR base is not configured by BIOS\n");
> > + return;
> > + }
> > +
> > + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > + if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
> > + pr_info("SEAMRR is not enabled by BIOS\n");
> > + return;
> > + }
> > +
> > + seamrr_base = base;
> > + seamrr_mask = mask;
> > +}
>
> Comment, please.
>
> /*
> * Stash the boot CPU's MSR values so that AP values
> * can can be checked for consistency.
> */
>

Thanks. Will add.

>
> > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > +{
> > + u64 base, mask;
> > +
> > + /*
> > + * Don't bother to detect this AP if SEAMRR is not
> > + * enabled after earlier detections.
> > + */
> > + if (!__seamrr_enabled())
> > + return;
> > +
> > + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > +
>
> This is the place for a comment about why the values have to be equal.

I'll add below:

/* BIOS must configure SEAMRR consistently across all cores */

>
> > + if (base == seamrr_base && mask == seamrr_mask)
> > + return;
> > +
> > + pr_err("Inconsistent SEAMRR configuration by BIOS\n");
> > + /* Mark SEAMRR as disabled. */
> > + seamrr_base = 0;
> > + seamrr_mask = 0;
> > +}
> > +
> > +static void detect_seam(struct cpuinfo_x86 *c)
> > +{
> > + if (c == &boot_cpu_data)
> > + detect_seam_bsp(c);
> > + else
> > + detect_seam_ap(c);
> > +}
> > +
> > +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> > +{
> > + detect_seam(c);
> > +}
>
> The extra function looks a bit silly here now. Maybe this gets filled
> out later, but it's goofy-looking here.

Thomas suggested to put all TDX detection related in one function call, so I
added tdx_detect_cpu(). I'll move this to the next patch when detecting TDX
KeyIDs.

2022-04-27 10:05:25

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module

On Tue, 2022-04-26 at 13:56 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > The P-SEAMLDR (persistent SEAM loader) is the first software module that
> > runs in SEAM VMX root, responsible for loading and updating the TDX
> > module. Both the P-SEAMLDR and the TDX module are expected to be loaded
> > before host kernel boots.
>
> Why bother with the P-SEAMLDR here at all? The kernel isn't loading the
> TDX module in this series. Why not just call into the TDX module directly?

It's not absolutely needed in this series. I choose to detect P-SEAMLDR because
detecting it can also detect the TDX module, and eventually we will need to
support P-SEAMLDR because the TDX module runtime update uses P-SEAMLDR's
SEAMCALL to do that.

Also, even for this series, detecting the P-SEAMLDR allows us to provide the P-
SEAMLDR information to user at a basic level in dmesg:

[..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209,
build_num 160, major 1, minor 0

This may be useful to users, but it's not a hard requirement for this series.


--
Thanks,
-Kai


2022-04-27 10:08:41

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function

On Tue, 2022-04-26 at 13:37 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It
> > defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
> > operation (SEAM VMX non-root) which are isolated from legacy VMX root
> > and VMX non-root mode.
>
> I feel like this is too much detail for an opening paragraph.
>
> > A CPU-attested software module (called the 'TDX module') runs in SEAM
> > VMX root to manage the crypto-protected VMs running in SEAM VMX non-root.
> > SEAM VMX root is also used to host another CPU-attested software module
> > (called the 'P-SEAMLDR') to load and update the TDX module.
> > > Host kernel transits to either the P-SEAMLDR or the TDX module via the
> > new SEAMCALL instruction. SEAMCALL leaf functions are host-side
> > interface functions defined by the P-SEAMLDR and the TDX module around
> > the new SEAMCALL instruction. They are similar to a hypercall, except
> > they are made by host kernel to the SEAM software.
>
> I think you can get rid of about half of this changelog so farand make
> it more clear in the process with this:
>
> TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).
> This mode runs only the TDX module itself or other code needed
> to load the TDX module.
>
> The host kernel communicates with SEAM software via a new
> SEAMCALL instruction. This is conceptually similar to
> a guest->host hypercall, except it is made from the host to SEAM
> software instead.

Thank you!

>
> This is a technical document, but you're writing too technically for my
> taste and focusing on the low-level details rather than the high-level
> concepts. What do I care that SEAM is two modes and what their names
> are at this juncture? Are those details necesarry to get me to
> understand what a SEAMCALL is or what this patch implements?

Thanks for the point. I'll revisit this series based on this in next version.

>
> > SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> > ABI. Instead, they share the same ABI with the TDCALL leaf functions.
> > %rax is used to carry both the SEAMCALL leaf function number (input) and
> > the completion status code (output). Additional GPRs (%rcx, %rdx,
> > %r8->%r11) may be further used as both input and output operands in
> > individual leaf functions.
> >
> > Implement a C function __seamcall()
>
> Your "C function" looks a bit like assembly to me.

Will change to (I saw TDX guest patch used similar way):

Add a generic interface to do SEAMCALL leaf functions, using the
assembly macro used by __tdx_module_call().

>
> > to do SEAMCALL leaf functions using
> > the assembly macro used by __tdx_module_call() (the implementation of
> > TDCALL leaf functions). The only exception not covered here is TDENTER
> > leaf function which takes all GPRs and XMM0-XMM15 as both input and
> > output. The caller of TDENTER should implement its own logic to call
> > TDENTER directly instead of using this function.
>
> I have no idea why this paragraph is here or what it is trying to tell me.

Will get rid of the rest staff.

>
> > SEAMCALL instruction is essentially a VMExit from VMX root to SEAM VMX
> > root, and it can fail with VMfailInvalid, for instance, when the SEAM
> > software module is not loaded. The C function __seamcall() returns
> > TDX_SEAMCALL_VMFAILINVALID, which doesn't conflict with any actual error
> > code of SEAMCALLs, to uniquely represent this case.
>
> Again, I'm lost. Why is this detail here? I don't even see
> TDX_SEAMCALL_VMFAILINVALID in the patch.

Will remove.

>
> > diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> > index 1bd688684716..fd577619620e 100644
> > --- a/arch/x86/virt/vmx/tdx/Makefile
> > +++ b/arch/x86/virt/vmx/tdx/Makefile
> > @@ -1,2 +1,2 @@
> > # SPDX-License-Identifier: GPL-2.0-only
> > -obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o
> > +obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o seamcall.o
> > diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> > new file mode 100644
> > index 000000000000..327961b2dd5a
> > --- /dev/null
> > +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> > @@ -0,0 +1,52 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#include <linux/linkage.h>
> > +#include <asm/frame.h>
> > +
> > +#include "tdxcall.S"
> > +
> > +/*
> > + * __seamcall() - Host-side interface functions to SEAM software module
> > + * (the P-SEAMLDR or the TDX module)
> > + *
> > + * Transform function call register arguments into the SEAMCALL register
> > + * ABI. Return TDX_SEAMCALL_VMFAILINVALID, or the completion status of
> > + * the SEAMCALL. Additional output operands are saved in @out (if it is
> > + * provided by caller).
>
> This needs to say:
>
> Returns TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails.

OK.


--
Thanks,
-Kai


2022-04-27 10:15:05

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On Tue, 2022-04-26 at 13:53 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > The TDX module is essentially a CPU-attested software module running
> > in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> > host and certain physical attacks. The TDX module implements the
> > functions to build, tear down and start execution of the protected VMs
> > called Trusted Domains (TD). Before the TDX module can be used to
> > create and run TD guests, it must be loaded into the SEAM Range Register
> > (SEAMRR) and properly initialized.
>
> The module isn't loaded into a register, right?
>
> It's loaded into a memory area pointed to *by* the register.

Yes. Should be below:

"..., it must be loaded into the memory area pointed to by the SEAM Ranger
Register (SEAMRR) ...".

>
> > The TDX module is expected to be
> > loaded by BIOS before booting to the kernel, and the kernel is expected
> > to detect and initialize it, using the SEAMCALLs defined by TDX
> > architecture.
>
> Wait a sec... So, what was all this gobleygook about TDX module loading
> and SEAMRR's if the kernel just has the TDX module *handed* to it
> already loaded?
>
> It looks to me like you wrote all of this before the TDX module was
> being loaded by the BIOS and neglected to go and update these changelogs.

Those were written on purpose after we changed to loading the TDX module in the
BIOS. In the code, I checks seamrr_enabled() as the first step to detect the
TDX module, so I thought it would be better to talk a little bit about "the TDX
module needs to be loaded to SEAMRR" thing.

>
> > The TDX module can be initialized only once in its lifetime. Instead
> > of always initializing it at boot time, this implementation chooses an
> > on-demand approach to initialize TDX until there is a real need (e.g
> > when requested by KVM). This avoids consuming the memory that must be
> > allocated by kernel and given to the TDX module as metadata (~1/256th of
> > the TDX-usable memory), and also saves the time of initializing the TDX
> > module (and the metadata) when TDX is not used at all. Initializing the
> > TDX module at runtime on-demand also is more flexible to support TDX
> > module runtime updating in the future (after updating the TDX module, it
> > needs to be initialized again).
> >
> > Introduce two placeholders tdx_detect() and tdx_init() to detect and
> > initialize the TDX module on demand, with a state machine introduced to
> > orchestrate the entire process (in case of multiple callers).
> >
> > To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs. The
> > TDX module is reported as not loaded if either SEAMRR is not enabled, or
> > there are no enough TDX private KeyIDs to create any TD guest. The TDX
> > module itself requires one global TDX private KeyID to crypto protect
> > its metadata.
>
> This is stepping over the line into telling me what the code does
> instead of why.

Will remove.

[...]

> > +
> > +static bool seamrr_enabled(void)
> > +{
> > + /*
> > + * To detect any BIOS misconfiguration among cores, all logical
> > + * cpus must have been brought up at least once. This is true
> > + * unless 'maxcpus' kernel command line is used to limit the
> > + * number of cpus to be brought up during boot time. However
> > + * 'maxcpus' is basically an invalid operation mode due to the
> > + * MCE broadcast problem, and it should not be used on a TDX
> > + * capable machine. Just do paranoid check here and do not
> > + * report SEAMRR as enabled in this case.
> > + */
> > + if (!cpumask_equal(&cpus_booted_once_mask,
> > + cpu_present_mask))
> > + return false;
> > +
> > + return __seamrr_enabled();
> > +}
> > +
> > +static bool tdx_keyid_sufficient(void)
> > +{
> > + if (!cpumask_equal(&cpus_booted_once_mask,
> > + cpu_present_mask))
> > + return false;
>
> I'd move this cpumask_equal() to a helper.

Sorry to double confirm, do you want something like:

static bool tdx_detected_on_all_cpus(void)
{
/*
* To detect any BIOS misconfiguration among cores, all logical
* cpus must have been brought up at least once. This is true
* unless 'maxcpus' kernel command line is used to limit the
* number of cpus to be brought up during boot time. However
* 'maxcpus' is basically an invalid operation mode due to the
* MCE broadcast problem, and it should not be used on a TDX
* capable machine. Just do paranoid check here and do not
* report SEAMRR as enabled in this case.
*/
return cpumask_equal(&cpus_booted_once_mask, cpu_present_mask);
}

static bool seamrr_enabled(void)
{
if (!tdx_detected_on_all_cpus())
return false;

return __seamrr_enabled();
}

static bool tdx_keyid_sufficient()
{
if (!tdx_detected_on_all_cpus())
return false;

...
}

>
> > + /*
> > + * TDX requires at least two KeyIDs: one global KeyID to
> > + * protect the metadata of the TDX module and one or more
> > + * KeyIDs to run TD guests.
> > + */
> > + return tdx_keyid_num >= 2;
> > +}
> > +
> > +static int __tdx_detect(void)
> > +{
> > + /* The TDX module is not loaded if SEAMRR is disabled */
> > + if (!seamrr_enabled()) {
> > + pr_info("SEAMRR not enabled.\n");
> > + goto no_tdx_module;
> > + }
>
> Why even bother with the SEAMRR stuff? It sounded like you can "ping"
> the module with SEAMCALL. Why not just use that directly?

SEAMCALL will cause #GP if SEAMRR is not enabled. We should check whether
SEAMRR is enabled before making SEAMCALL.

>
> > + /*
> > + * Also do not report the TDX module as loaded if there's
> > + * no enough TDX private KeyIDs to run any TD guests.
> > + */
> > + if (!tdx_keyid_sufficient()) {
> > + pr_info("Number of TDX private KeyIDs too small: %u.\n",
> > + tdx_keyid_num);
> > + goto no_tdx_module;
> > + }
> > +
> > + /* Return -ENODEV until the TDX module is detected */
> > +no_tdx_module:
> > + tdx_module_status = TDX_MODULE_NONE;
> > + return -ENODEV;
> > +}
> > +
> > +static int init_tdx_module(void)
> > +{
> > + /*
> > + * Return -EFAULT until all steps of TDX module
> > + * initialization are done.
> > + */
> > + return -EFAULT;
> > +}
> > +
> > +static void shutdown_tdx_module(void)
> > +{
> > + /* TODO: Shut down the TDX module */
> > + tdx_module_status = TDX_MODULE_SHUTDOWN;
> > +}
> > +
> > +static int __tdx_init(void)
> > +{
> > + int ret;
> > +
> > + /*
> > + * Logical-cpu scope initialization requires calling one SEAMCALL
> > + * on all logical cpus enabled by BIOS. Shutting down the TDX
> > + * module also has such requirement. Further more, configuring
> > + * the key of the global KeyID requires calling one SEAMCALL for
> > + * each package. For simplicity, disable CPU hotplug in the whole
> > + * initialization process.
> > + *
> > + * It's perhaps better to check whether all BIOS-enabled cpus are
> > + * online before starting initializing, and return early if not.
>
> But you did some of this cpumask checking above. Right?

Above check only guarantees SEAMRR/TDX KeyID has been detected on all presnet
cpus. the 'present' cpumask doesn't equal to all BIOS-enabled CPUs.

>
> > + * But none of 'possible', 'present' and 'online' CPU masks
> > + * represents BIOS-enabled cpus. For example, 'possible' mask is
> > + * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> > + * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> > + * online.
> > + */
> > + cpus_read_lock();
> > +
> > + ret = init_tdx_module();
> > +
> > + /*
> > + * Shut down the TDX module in case of any error during the
> > + * initialization process. It's meaningless to leave the TDX
> > + * module in any middle state of the initialization process.
> > + */
> > + if (ret)
> > + shutdown_tdx_module();
> > +
> > + cpus_read_unlock();
> > +
> > + return ret;
> > +}
> > +
> > +/**
> > + * tdx_detect - Detect whether the TDX module has been loaded
> > + *
> > + * Detect whether the TDX module has been loaded and ready for
> > + * initialization. Only call this function when all cpus are
> > + * already in VMX operation.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0: The TDX module has been loaded and ready for
> > + * initialization.
>
> "-0", eh?

Sorry. Originally I meant to have below:

- 0:
- -ENODEV:
...

I changed to want to have below:

0:
-ENODEV:
...

But forgot to remove the '-' before 0. I'll remove it.

>
> > + * * -ENODEV: The TDX module is not loaded.
> > + * * -EPERM: CPU is not in VMX operation.
> > + * * -EFAULT: Other internal fatal errors.
> > + */
> > +int tdx_detect(void)
> > +{
> > + int ret;
> > +
> > + mutex_lock(&tdx_module_lock);
> > +
> > + switch (tdx_module_status) {
> > + case TDX_MODULE_UNKNOWN:
> > + ret = __tdx_detect();
> > + break;
> > + case TDX_MODULE_NONE:
> > + ret = -ENODEV;
> > + break;
> > + case TDX_MODULE_LOADED:
> > + case TDX_MODULE_INITIALIZED:
> > + ret = 0;
> > + break;
> > + case TDX_MODULE_SHUTDOWN:
> > + ret = -EFAULT;
> > + break;
> > + default:
> > + WARN_ON(1);
> > + ret = -EFAULT;
> > + }
> > +
> > + mutex_unlock(&tdx_module_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_detect);
> > +
> > +/**
> > + * tdx_init - Initialize the TDX module
> > + *
> > + * Initialize the TDX module to make it ready to run TD guests. This
> > + * function should be called after tdx_detect() returns successful.
> > + * Only call this function when all cpus are online and are in VMX
> > + * operation. CPU hotplug is temporarily disabled internally.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0: The TDX module has been successfully initialized.
> > + * * -ENODEV: The TDX module is not loaded.
> > + * * -EPERM: The CPU which does SEAMCALL is not in VMX operation.
> > + * * -EFAULT: Other internal fatal errors.
> > + */
> > +int tdx_init(void)
> > +{
> > + int ret;
> > +
> > + mutex_lock(&tdx_module_lock);
> > +
> > + switch (tdx_module_status) {
> > + case TDX_MODULE_NONE:
> > + ret = -ENODEV;
> > + break;
> > + case TDX_MODULE_LOADED:
> > + ret = __tdx_init();
> > + break;
> > + case TDX_MODULE_INITIALIZED:
> > + ret = 0;
> > + break;
> > + default:
> > + ret = -EFAULT;
> > + break;
> > + }
> > + mutex_unlock(&tdx_module_lock);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_init);
>
> Why does this need both a tdx_detect() and a tdx_init()? Shouldn't the
> interface from outside just be "get TDX up and running, please?"

We can have a single tdx_init(). However tdx_init() can be heavy, and having a
separate non-heavy tdx_detect() may be useful if caller wants to separate
"detecting the TDX module" and "initializing the TDX module", i.e. to do
something in the middle.

However tdx_detect() basically only detects P-SEAMLDR. If we move P-SEAMLDR
detection to tdx_init(), or we git rid of P-SEAMLDR completely, then we don't
need tdx_detect() anymore. We can expose seamrr_enabled() and TDX KeyID
variables or functions so caller can use them to see whether it should do TDX
related staff and then call tdx_init().


--
Thanks,
-Kai


2022-04-27 10:20:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/5/22 21:49, Kai Huang wrote:
> SEAM VMX root operation is designed to host a CPU-attested, software
> module called the 'TDX module' which implements functions to manage
> crypto protected VMs called Trust Domains (TD). SEAM VMX root is also

"crypto protected"? What the heck is that?

> designed to host a CPU-attested, software module called the 'Intel
> Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
>
> Host kernel transits to either the P-SEAMLDR or the TDX module via a new

^ The

> SEAMCALL instruction. SEAMCALLs are host-side interface functions
> defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> instruction. They are similar to a hypercall, except they are made by
> host kernel to the SEAM software modules.

This is still missing some important high-level things, like that the
TDX module is protected from the untrusted VMM. Heck, it forgets to
mention that the VMM itself is untrusted and the TDX module replaces
things that the VMM usually does.

It would also be nice to mention here how this compares with SEV-SNP.
Where is the TDX module in that design? Why doesn't SEV need all this code?

> TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> protect TD guests. TDX reserves part of MKTME KeyID space as TDX private
> KeyIDs, which can only be used by software runs in SEAM. The physical

^ which

> address bits for encoding TDX private KeyID are treated as reserved bits
> when not in SEAM operation. The partitioning of MKTME KeyIDs and TDX
> private KeyIDs is configured by BIOS.
>
> Before being able to manage TD guests, the TDX module must be loaded
> and properly initialized using SEAMCALLs defined by TDX architecture.
> This series assumes both the P-SEAMLDR and the TDX module are loaded by
> BIOS before the kernel boots.
>
> There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> SEAMCALL to detect P-SEAMLDR. The success of this SEAMCALL means the
> P-SEAMLDR is loaded. The P-SEAMLDR information returned by this
> SEAMCALL further tells whether TDX module is loaded.

There's a bit of information missing here. The kernel might not know
the state of things being loaded. A previous kernel might have loaded
it and left it in an unknown state.

> The TDX module is initialized in multiple steps:
>
> 1) Global initialization;
> 2) Logical-CPU scope initialization;
> 3) Enumerate the TDX module capabilities;
> 4) Configure the TDX module about usable memory ranges and
> global KeyID information;
> 5) Package-scope configuration for the global KeyID;
> 6) Initialize TDX metadata for usable memory ranges based on 4).
>
> Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> table) logical cpus, otherwise step 4) will fail. Step 5) requires
> calling SEAMCALL on at least one cpu on all packages.
>
> TDX module can also be shut down at any time during module's lifetime, by
> calling SEAMCALL on all "BIOS-enabled" logical cpus.
>
> == Design Considerations ==
>
> 1. Lazy TDX module initialization on-demand by caller

This doesn't really tell us what "lazy" is or what the alternatives are.

There are basically two ways the TDX module could be loaded. Either:
* In early boot
or
* At runtime just before the first TDX guest is run

This series implements the runtime loading.

> None of the steps in the TDX module initialization process must be done
> during kernel boot. This series doesn't initialize TDX at boot time, but
> instead, provides two functions to allow caller to detect and initialize
> TDX on demand:
>
> if (tdx_detect())
> goto no_tdx;
> if (tdx_init())
> goto no_tdx;
>
> This approach has below pros:
>
> 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> metadata. Enabling TDX on demand allows only to consume this memory when
> TDX is truly needed (i.e. when KVM wants to create TD guests).
>
> 2) Both detecting and initializing the TDX module require calling
> SEAMCALL. However, SEAMCALL requires CPU being already in VMX operation
> (VMXON has been done). So far, KVM is the only user of TDX, and it
> already handles VMXON/VMXOFF. Therefore, letting KVM to initialize TDX
> on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> core-kernel. Also, in long term, likely a reference based VMXON/VMXOFF
> approach is needed since more kernel components will need to handle
> VMXON/VMXONFF.
>
> 3) It is more flexible to support "TDX module runtime update" (not in
> this series). After updating to the new module at runtime, kernel needs
> to go through the initialization process again. For the new module,
> it's possible the metadata allocated for the old module cannot be reused
> for the new module, and needs to be re-allocated again.
>
> 2. Kernel policy on TDX memory
>
> Host kernel is responsible for choosing which memory regions can be used
> as TDX memory, and configuring those memory regions to the TDX module by
> using an array of "TD Memory Regions" (TDMR), which is a data structure
> defined by TDX architecture.


This is putting the cart before the horse. Don't define the details up
front.

The TDX architecture allows the VMM to designate specific memory
as usable for TDX private memory. This series chooses to
designate _all_ system RAM as TDX to avoid having to modify the
page allocator to distinguish TDX and non-TDX-capable memory

... then go on to explain the details.

> The first generation of TDX essentially guarantees that all system RAM
> memory regions (excluding the memory below 1MB) can be used as TDX
> memory. To avoid having to modify the page allocator to distinguish TDX
> and non-TDX allocation, this series chooses to use all system RAM as TDX
> memory.
>
> E820 table is used to find all system RAM entries. Following
> e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> types are treated as TDX memory, and contiguous ranges in the same NUMA
> node are merged together (similar to memblock_add()) before trimming the
> non-page-aligned part.

This e820 cruft is too much detail for a cover letter. In general, once
you start talking about individual functions, you've gone too far in the
cover letter.

> 3. Memory hotplug
>
> The first generation of TDX architecturally doesn't support memory
> hotplug. And the first generation of TDX-capable platforms don't support
> physical memory hotplug. Since it physically cannot happen, this series
> doesn't add any check in ACPI memory hotplug code path to disable it.
>
> A special case of memory hotplug is adding NVDIMM as system RAM using
> kmem driver. However the first generation of TDX-capable platforms
> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> happen either.

What prevents this code from today's code being run on tomorrow's
platforms and breaking these assumptions?

> Another case is admin can use 'memmap' kernel command line to create
> legacy PMEMs and use them as TD guest memory, or theoretically, can use
> kmem driver to add them as system RAM. To avoid having to change memory
> hotplug code to prevent this from happening, this series always include
> legacy PMEMs when constructing TDMRs so they are also TDX memory.
>
> 4. CPU hotplug
>
> The first generation of TDX architecturally doesn't support ACPI CPU
> hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> first generation of TDX-capable platforms don't support ACPI CPU hotplug
> either. Since this physically cannot happen, this series doesn't add any
> check in ACPI CPU hotplug code path to disable it.
>
> Also, only TDX module initialization requires all BIOS-enabled cpus are
> online. After the initialization, any logical cpu can be brought down
> and brought up to online again later. Therefore this series doesn't
> change logical CPU hotplug either.
>
> 5. TDX interaction with kexec()
>
> If TDX is ever enabled and/or used to run any TD guests, the cachelines
> of TDX private memory, including PAMTs, used by TDX module need to be
> flushed before transiting to the new kernel otherwise they may silently
> corrupt the new kernel. Similar to SME, this series flushes cache in
> stop_this_cpu().

What does this have to do with kexec()? What's a PAMT?

> The TDX module can be initialized only once during its lifetime. The
> first generation of TDX doesn't have interface to reset TDX module to

^ an

> uninitialized state so it can be initialized again.
>
> This implies:
>
> - If the old kernel fails to initialize TDX, the new kernel cannot
> use TDX too unless the new kernel fixes the bug which leads to
> initialization failure in the old kernel and can resume from where
> the old kernel stops. This requires certain coordination between
> the two kernels.

OK, but what does this *MEAN*?

> - If the old kernel has initialized TDX successfully, the new kernel
> may be able to use TDX if the two kernels have the exactly same
> configurations on the TDX module. It further requires the new kernel
> to reserve the TDX metadata pages (allocated by the old kernel) in
> its page allocator. It also requires coordination between the two
> kernels. Furthermore, if kexec() is done when there are active TD
> guests running, the new kernel cannot use TDX because it's extremely
> hard for the old kernel to pass all TDX private pages to the new
> kernel.
>
> Given that, this series doesn't support TDX after kexec() (except the
> old kernel doesn't attempt to initialize TDX at all).
>
> And this series doesn't shut down TDX module but leaves it open during
> kexec(). It is because shutting down TDX module requires CPU being in
> VMX operation but there's no guarantee of this during kexec(). Leaving
> the TDX module open is not the best case, but it is OK since the new
> kernel won't be able to use TDX anyway (therefore TDX module won't run
> at all).

tl;dr: kexec() doesn't work with this code.

Right?

That doesn't seem good.

2022-04-27 10:29:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function

On 4/5/22 21:49, Kai Huang wrote:
> Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It
> defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
> operation (SEAM VMX non-root) which are isolated from legacy VMX root
> and VMX non-root mode.

I feel like this is too much detail for an opening paragraph.

> A CPU-attested software module (called the 'TDX module') runs in SEAM
> VMX root to manage the crypto-protected VMs running in SEAM VMX non-root.
> SEAM VMX root is also used to host another CPU-attested software module
> (called the 'P-SEAMLDR') to load and update the TDX module.
>> Host kernel transits to either the P-SEAMLDR or the TDX module via the
> new SEAMCALL instruction. SEAMCALL leaf functions are host-side
> interface functions defined by the P-SEAMLDR and the TDX module around
> the new SEAMCALL instruction. They are similar to a hypercall, except
> they are made by host kernel to the SEAM software.

I think you can get rid of about half of this changelog so farand make
it more clear in the process with this:

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).
This mode runs only the TDX module itself or other code needed
to load the TDX module.

The host kernel communicates with SEAM software via a new
SEAMCALL instruction. This is conceptually similar to
a guest->host hypercall, except it is made from the host to SEAM
software instead.

This is a technical document, but you're writing too technically for my
taste and focusing on the low-level details rather than the high-level
concepts. What do I care that SEAM is two modes and what their names
are at this juncture? Are those details necesarry to get me to
understand what a SEAMCALL is or what this patch implements?

> SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> ABI. Instead, they share the same ABI with the TDCALL leaf functions.
> %rax is used to carry both the SEAMCALL leaf function number (input) and
> the completion status code (output). Additional GPRs (%rcx, %rdx,
> %r8->%r11) may be further used as both input and output operands in
> individual leaf functions.
>
> Implement a C function __seamcall()

Your "C function" looks a bit like assembly to me.

> to do SEAMCALL leaf functions using
> the assembly macro used by __tdx_module_call() (the implementation of
> TDCALL leaf functions). The only exception not covered here is TDENTER
> leaf function which takes all GPRs and XMM0-XMM15 as both input and
> output. The caller of TDENTER should implement its own logic to call
> TDENTER directly instead of using this function.

I have no idea why this paragraph is here or what it is trying to tell me.

> SEAMCALL instruction is essentially a VMExit from VMX root to SEAM VMX
> root, and it can fail with VMfailInvalid, for instance, when the SEAM
> software module is not loaded. The C function __seamcall() returns
> TDX_SEAMCALL_VMFAILINVALID, which doesn't conflict with any actual error
> code of SEAMCALLs, to uniquely represent this case.

Again, I'm lost. Why is this detail here? I don't even see
TDX_SEAMCALL_VMFAILINVALID in the patch.

> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> index 1bd688684716..fd577619620e 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,2 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o
> +obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o seamcall.o
> diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> new file mode 100644
> index 000000000000..327961b2dd5a
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/linkage.h>
> +#include <asm/frame.h>
> +
> +#include "tdxcall.S"
> +
> +/*
> + * __seamcall() - Host-side interface functions to SEAM software module
> + * (the P-SEAMLDR or the TDX module)
> + *
> + * Transform function call register arguments into the SEAMCALL register
> + * ABI. Return TDX_SEAMCALL_VMFAILINVALID, or the completion status of
> + * the SEAMCALL. Additional output operands are saved in @out (if it is
> + * provided by caller).

This needs to say:

Returns TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails.

> + *-------------------------------------------------------------------------
> + * SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX - SEAMCALL Leaf number.
> + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX - SEAMCALL completion status code.
> + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __seamcall() function ABI:
> + *
> + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
> + * @rcx (RSI) - Input parameter 1, moved to RCX
> + * @rdx (RDX) - Input parameter 2, moved to RDX
> + * @r8 (RCX) - Input parameter 3, moved to R8
> + * @r9 (R8) - Input parameter 4, moved to R9
> + *
> + * @out (R9) - struct tdx_module_output pointer
> + * stored temporarily in R12 (not
> + * used by the P-SEAMLDR or the TDX
> + * module). It can be NULL.
> + *
> + * Return (via RAX) the completion status of the SEAMCALL, or
> + * TDX_SEAMCALL_VMFAILINVALID.
> + */
> +SYM_FUNC_START(__seamcall)
> + FRAME_BEGIN
> + TDX_MODULE_CALL host=1
> + FRAME_END
> + ret
> +SYM_FUNC_END(__seamcall)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> new file mode 100644
> index 000000000000..9d5b6f554c20
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +#include <linux/types.h>
> +
> +struct tdx_module_output;
> +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + struct tdx_module_output *out);
> +
> +#endif

2022-04-27 10:41:11

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error

On Tue, 2022-04-26 at 13:59 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > TDX supports shutting down the TDX module at any time during its
> > lifetime. After TDX module is shut down, no further SEAMCALL can be
> > made on any logical cpu.
>
> Is this strictly true?
>
> I thought SEAMCALLs were used for the P-SEAMLDR too.

Sorry will change to no TDX module SEAMCALL can be made on any logical cpu.

[...]

> >
> > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > +struct seamcall_ctx {
> > + u64 fn;
> > + u64 rcx;
> > + u64 rdx;
> > + u64 r8;
> > + u64 r9;
> > + atomic_t err;
> > + u64 seamcall_ret;
> > + struct tdx_module_output out;
> > +};
> > +
> > +static void seamcall_smp_call_function(void *data)
> > +{
> > + struct seamcall_ctx *sc = data;
> > + int ret;
> > +
> > + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > + &sc->seamcall_ret, &sc->out);
> > + if (ret)
> > + atomic_set(&sc->err, ret);
> > +}
> > +
> > +/*
> > + * Call the SEAMCALL on all online cpus concurrently.
> > + * Return error if SEAMCALL fails on any cpu.
> > + */
> > +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > +{
> > + on_each_cpu(seamcall_smp_call_function, sc, true);
> > + return atomic_read(&sc->err);
> > +}
>
> Why bother returning something that's not read?

It's not needed. I'll make it void.

Caller can check seamcall_ctx::err directly if they want to know whether any
error happened.



--
Thanks,
-Kai


2022-04-27 10:55:17

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

On Wed, 2022-04-27 at 00:22 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Kai Huang wrote:
> > On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> > > On 4/26/22 16:12, Kai Huang wrote:
> > > > Hi Dave,
> > > >
> > > > Thanks for review!
> > > >
> > > > On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > > > > > +config INTEL_TDX_HOST
> > > > > > + bool "Intel Trust Domain Extensions (TDX) host support"
> > > > > > + default n
> > > > > > + depends on CPU_SUP_INTEL
> > > > > > + depends on X86_64
> > > > > > + help
> > > > > > + Intel Trust Domain Extensions (TDX) protects guest VMs from
> > > > > > malicious
> > > > > > + host and certain physical attacks. This option enables necessary
> > > > > > TDX
> > > > > > + support in host kernel to run protected VMs.
> > > > > > +
> > > > > > + If unsure, say N.
> > > > >
> > > > > Nothing about KVM?
> > > >
> > > > I'll add KVM into the context. How about below?
> > > >
> > > > "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > > host and certain physical attacks. This option enables necessary TDX
> > > > support in host kernel to allow KVM to run protected VMs called Trust
> > > > Domains (TD)."
> > >
> > > What about a dependency? Isn't this dead code without CONFIG_KVM=y/m?
> >
> > Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> > make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM. But so far KVM is the only
> > user of TDX, so in practice the code is dead w/o KVM.
> >
> > What's your opinion?
>
> Take a dependency on CONFIG_KVM_INTEL, there's already precedence for this specific
> case of a feature that can't possibly have an in-kernel user. See
> arch/x86/kernel/cpu/feat_ctl.c, which in the (very) unlikely event IA32_FEATURE_CONTROL
> is left unlocked by BIOS, will deliberately disable VMX if CONFIG_KVM_INTEL=n.

Thanks. Fine to me.

--
Thanks,
-Kai


2022-04-27 10:57:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module

On 4/5/22 21:49, Kai Huang wrote:
> The P-SEAMLDR (persistent SEAM loader) is the first software module that
> runs in SEAM VMX root, responsible for loading and updating the TDX
> module. Both the P-SEAMLDR and the TDX module are expected to be loaded
> before host kernel boots.

Why bother with the P-SEAMLDR here at all? The kernel isn't loading the
TDX module in this series. Why not just call into the TDX module directly?

2022-04-27 11:02:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

On 4/26/22 16:12, Kai Huang wrote:
> Hi Dave,
>
> Thanks for review!
>
> On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
>>> +config INTEL_TDX_HOST
>>> + bool "Intel Trust Domain Extensions (TDX) host support"
>>> + default n
>>> + depends on CPU_SUP_INTEL
>>> + depends on X86_64
>>> + help
>>> + Intel Trust Domain Extensions (TDX) protects guest VMs from
>>> malicious
>>> + host and certain physical attacks. This option enables necessary
>>> TDX
>>> + support in host kernel to run protected VMs.
>>> +
>>> + If unsure, say N.
>>
>> Nothing about KVM?
>
> I'll add KVM into the context. How about below?
>
> "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This option enables necessary TDX
> support in host kernel to allow KVM to run protected VMs called Trust
> Domains (TD)."

What about a dependency? Isn't this dead code without CONFIG_KVM=y/m?

>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> new file mode 100644
>>> index 000000000000..03f35c75f439
>>> --- /dev/null
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -0,0 +1,102 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Copyright(c) 2022 Intel Corporation.
>>> + *
>>> + * Intel Trusted Domain Extensions (TDX) support
>>> + */
>>> +
>>> +#define pr_fmt(fmt) "tdx: " fmt
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/cpumask.h>
>>> +#include <asm/msr-index.h>
>>> +#include <asm/msr.h>
>>> +#include <asm/cpufeature.h>
>>> +#include <asm/cpufeatures.h>
>>> +#include <asm/tdx.h>
>>> +
>>> +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
>>> +#define MTRR_CAP_SEAMRR BIT(15)
>>> +
>>> +/* Core-scope Intel SEAMRR base and mask registers. */
>>> +#define MSR_IA32_SEAMRR_PHYS_BASE 0x00001400
>>> +#define MSR_IA32_SEAMRR_PHYS_MASK 0x00001401
>>> +
>>> +#define SEAMRR_PHYS_BASE_CONFIGURED BIT_ULL(3)
>>> +#define SEAMRR_PHYS_MASK_ENABLED BIT_ULL(11)
>>> +#define SEAMRR_PHYS_MASK_LOCKED BIT_ULL(10)
>>> +
>>> +#define SEAMRR_ENABLED_BITS \
>>> + (SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
>>> +
>>> +/* BIOS must configure SEAMRR registers for all cores consistently */
>>> +static u64 seamrr_base, seamrr_mask;
>>> +
>>> +static bool __seamrr_enabled(void)
>>> +{
>>> + return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
>>> +}
>>
>> But there's no case where seamrr_mask is non-zero and where
>> _seamrr_enabled(). Why bother checking the SEAMRR_ENABLED_BITS?
>
> seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
> is 0. It will also be cleared when BIOS mis-configuration is detected on any
> AP. SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.

The point is that this could be:

return !!seamrr_mask;


>>> +static void detect_seam_ap(struct cpuinfo_x86 *c)
>>> +{
>>> + u64 base, mask;
>>> +
>>> + /*
>>> + * Don't bother to detect this AP if SEAMRR is not
>>> + * enabled after earlier detections.
>>> + */
>>> + if (!__seamrr_enabled())
>>> + return;
>>> +
>>> + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
>>> + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
>>> +
>>
>> This is the place for a comment about why the values have to be equal.
>
> I'll add below:
>
> /* BIOS must configure SEAMRR consistently across all cores */

What happens if the BIOS doesn't do this? What actually breaks? In
other words, do we *NEED* error checking here?

>>> + if (base == seamrr_base && mask == seamrr_mask)
>>> + return;
>>> +
>>> + pr_err("Inconsistent SEAMRR configuration by BIOS\n");
>>> + /* Mark SEAMRR as disabled. */
>>> + seamrr_base = 0;
>>> + seamrr_mask = 0;
>>> +}
>>> +
>>> +static void detect_seam(struct cpuinfo_x86 *c)
>>> +{
>>> + if (c == &boot_cpu_data)
>>> + detect_seam_bsp(c);
>>> + else
>>> + detect_seam_ap(c);
>>> +}
>>> +
>>> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
>>> +{
>>> + detect_seam(c);
>>> +}
>>
>> The extra function looks a bit silly here now. Maybe this gets filled
>> out later, but it's goofy-looking here.
>
> Thomas suggested to put all TDX detection related in one function call, so I
> added tdx_detect_cpu(). I'll move this to the next patch when detecting TDX
> KeyIDs.

That's fine, or just add a comment or a changelog sentence about this
being filled out later.

2022-04-27 11:04:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

On Wed, Apr 27, 2022, Kai Huang wrote:
> On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> > On 4/26/22 16:12, Kai Huang wrote:
> > > Hi Dave,
> > >
> > > Thanks for review!
> > >
> > > On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > > > > +config INTEL_TDX_HOST
> > > > > + bool "Intel Trust Domain Extensions (TDX) host support"
> > > > > + default n
> > > > > + depends on CPU_SUP_INTEL
> > > > > + depends on X86_64
> > > > > + help
> > > > > + Intel Trust Domain Extensions (TDX) protects guest VMs from
> > > > > malicious
> > > > > + host and certain physical attacks. This option enables necessary
> > > > > TDX
> > > > > + support in host kernel to run protected VMs.
> > > > > +
> > > > > + If unsure, say N.
> > > >
> > > > Nothing about KVM?
> > >
> > > I'll add KVM into the context. How about below?
> > >
> > > "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > host and certain physical attacks. This option enables necessary TDX
> > > support in host kernel to allow KVM to run protected VMs called Trust
> > > Domains (TD)."
> >
> > What about a dependency? Isn't this dead code without CONFIG_KVM=y/m?
>
> Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM. But so far KVM is the only
> user of TDX, so in practice the code is dead w/o KVM.
>
> What's your opinion?

Take a dependency on CONFIG_KVM_INTEL, there's already precedence for this specific
case of a feature that can't possibly have an in-kernel user. See
arch/x86/kernel/cpu/feat_ctl.c, which in the (very) unlikely event IA32_FEATURE_CONTROL
is left unlocked by BIOS, will deliberately disable VMX if CONFIG_KVM_INTEL=n.

2022-04-27 11:27:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On 4/5/22 21:49, Kai Huang wrote:
> The TDX module is essentially a CPU-attested software module running
> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> host and certain physical attacks. The TDX module implements the
> functions to build, tear down and start execution of the protected VMs
> called Trusted Domains (TD). Before the TDX module can be used to
> create and run TD guests, it must be loaded into the SEAM Range Register
> (SEAMRR) and properly initialized.

The module isn't loaded into a register, right?

It's loaded into a memory area pointed to *by* the register.

> The TDX module is expected to be
> loaded by BIOS before booting to the kernel, and the kernel is expected
> to detect and initialize it, using the SEAMCALLs defined by TDX
> architecture.

Wait a sec... So, what was all this gobleygook about TDX module loading
and SEAMRR's if the kernel just has the TDX module *handed* to it
already loaded?

It looks to me like you wrote all of this before the TDX module was
being loaded by the BIOS and neglected to go and update these changelogs.

> The TDX module can be initialized only once in its lifetime. Instead
> of always initializing it at boot time, this implementation chooses an
> on-demand approach to initialize TDX until there is a real need (e.g
> when requested by KVM). This avoids consuming the memory that must be
> allocated by kernel and given to the TDX module as metadata (~1/256th of
> the TDX-usable memory), and also saves the time of initializing the TDX
> module (and the metadata) when TDX is not used at all. Initializing the
> TDX module at runtime on-demand also is more flexible to support TDX
> module runtime updating in the future (after updating the TDX module, it
> needs to be initialized again).
>
> Introduce two placeholders tdx_detect() and tdx_init() to detect and
> initialize the TDX module on demand, with a state machine introduced to
> orchestrate the entire process (in case of multiple callers).
>
> To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs. The
> TDX module is reported as not loaded if either SEAMRR is not enabled, or
> there are no enough TDX private KeyIDs to create any TD guest. The TDX
> module itself requires one global TDX private KeyID to crypto protect
> its metadata.

This is stepping over the line into telling me what the code does
instead of why.

> And tdx_init() is currently empty. The TDX module will be initialized
> in multi-steps defined by the TDX architecture:
>
> 1) Global initialization;
> 2) Logical-CPU scope initialization;
> 3) Enumerate the TDX module capabilities and platform configuration;
> 4) Configure the TDX module about usable memory ranges and global
> KeyID information;
> 5) Package-scope configuration for the global KeyID;
> 6) Initialize usable memory ranges based on 4).
>
> The TDX module can also be shut down at any time during its lifetime.
> In case of any error during the initialization process, shut down the
> module. It's pointless to leave the module in any intermediate state
> during the initialization.
>
> SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
> operation (VMXON has been done), otherwise it generates #UD. So far
> only KVM handles VMXON/VMXOFF. Choose to not handle VMXON/VMXOFF in
> tdx_detect() and tdx_init() but depend on the caller to guarantee that,
> since so far KVM is the only user of TDX. In the long term, more kernel
> components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
> module runtime update), so a reference-based approach to do VMXON/VMXOFF
> is likely needed.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
> arch/x86/include/asm/tdx.h | 4 +
> arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 226 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 1f29813b1646..c8af2ba6bb8a 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>
> #ifdef CONFIG_INTEL_TDX_HOST
> void tdx_detect_cpu(struct cpuinfo_x86 *c);
> +int tdx_detect(void);
> +int tdx_init(void);
> #else
> static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
> +static inline int tdx_detect(void) { return -ENODEV; }
> +static inline int tdx_init(void) { return -ENODEV; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ba2210001ea8..53093d4ad458 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -9,6 +9,8 @@
>
> #include <linux/types.h>
> #include <linux/cpumask.h>
> +#include <linux/mutex.h>
> +#include <linux/cpu.h>
> #include <asm/msr-index.h>
> #include <asm/msr.h>
> #include <asm/cpufeature.h>
> @@ -45,12 +47,33 @@
> ((u32)(((_keyid_part) & 0xffffffffull) + 1))
> #define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))
>
> +/*
> + * TDX module status during initialization
> + */
> +enum tdx_module_status_t {
> + /* TDX module status is unknown */
> + TDX_MODULE_UNKNOWN,
> + /* TDX module is not loaded */
> + TDX_MODULE_NONE,
> + /* TDX module is loaded, but not initialized */
> + TDX_MODULE_LOADED,
> + /* TDX module is fully initialized */
> + TDX_MODULE_INITIALIZED,
> + /* TDX module is shutdown due to error during initialization */
> + TDX_MODULE_SHUTDOWN,
> +};
> +
> /* BIOS must configure SEAMRR registers for all cores consistently */
> static u64 seamrr_base, seamrr_mask;
>
> static u32 tdx_keyid_start;
> static u32 tdx_keyid_num;
>
> +static enum tdx_module_status_t tdx_module_status;
> +
> +/* Prevent concurrent attempts on TDX detection and initialization */
> +static DEFINE_MUTEX(tdx_module_lock);
> +
> static bool __seamrr_enabled(void)
> {
> return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> @@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
> detect_seam(c);
> detect_tdx_keyids(c);
> }
> +
> +static bool seamrr_enabled(void)
> +{
> + /*
> + * To detect any BIOS misconfiguration among cores, all logical
> + * cpus must have been brought up at least once. This is true
> + * unless 'maxcpus' kernel command line is used to limit the
> + * number of cpus to be brought up during boot time. However
> + * 'maxcpus' is basically an invalid operation mode due to the
> + * MCE broadcast problem, and it should not be used on a TDX
> + * capable machine. Just do paranoid check here and do not
> + * report SEAMRR as enabled in this case.
> + */
> + if (!cpumask_equal(&cpus_booted_once_mask,
> + cpu_present_mask))
> + return false;
> +
> + return __seamrr_enabled();
> +}
> +
> +static bool tdx_keyid_sufficient(void)
> +{
> + if (!cpumask_equal(&cpus_booted_once_mask,
> + cpu_present_mask))
> + return false;

I'd move this cpumask_equal() to a helper.

> + /*
> + * TDX requires at least two KeyIDs: one global KeyID to
> + * protect the metadata of the TDX module and one or more
> + * KeyIDs to run TD guests.
> + */
> + return tdx_keyid_num >= 2;
> +}
> +
> +static int __tdx_detect(void)
> +{
> + /* The TDX module is not loaded if SEAMRR is disabled */
> + if (!seamrr_enabled()) {
> + pr_info("SEAMRR not enabled.\n");
> + goto no_tdx_module;
> + }

Why even bother with the SEAMRR stuff? It sounded like you can "ping"
the module with SEAMCALL. Why not just use that directly?

> + /*
> + * Also do not report the TDX module as loaded if there's
> + * no enough TDX private KeyIDs to run any TD guests.
> + */
> + if (!tdx_keyid_sufficient()) {
> + pr_info("Number of TDX private KeyIDs too small: %u.\n",
> + tdx_keyid_num);
> + goto no_tdx_module;
> + }
> +
> + /* Return -ENODEV until the TDX module is detected */
> +no_tdx_module:
> + tdx_module_status = TDX_MODULE_NONE;
> + return -ENODEV;
> +}
> +
> +static int init_tdx_module(void)
> +{
> + /*
> + * Return -EFAULT until all steps of TDX module
> + * initialization are done.
> + */
> + return -EFAULT;
> +}
> +
> +static void shutdown_tdx_module(void)
> +{
> + /* TODO: Shut down the TDX module */
> + tdx_module_status = TDX_MODULE_SHUTDOWN;
> +}
> +
> +static int __tdx_init(void)
> +{
> + int ret;
> +
> + /*
> + * Logical-cpu scope initialization requires calling one SEAMCALL
> + * on all logical cpus enabled by BIOS. Shutting down the TDX
> + * module also has such requirement. Further more, configuring
> + * the key of the global KeyID requires calling one SEAMCALL for
> + * each package. For simplicity, disable CPU hotplug in the whole
> + * initialization process.
> + *
> + * It's perhaps better to check whether all BIOS-enabled cpus are
> + * online before starting initializing, and return early if not.

But you did some of this cpumask checking above. Right?

> + * But none of 'possible', 'present' and 'online' CPU masks
> + * represents BIOS-enabled cpus. For example, 'possible' mask is
> + * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> + * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> + * online.
> + */
> + cpus_read_lock();
> +
> + ret = init_tdx_module();
> +
> + /*
> + * Shut down the TDX module in case of any error during the
> + * initialization process. It's meaningless to leave the TDX
> + * module in any middle state of the initialization process.
> + */
> + if (ret)
> + shutdown_tdx_module();
> +
> + cpus_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * tdx_detect - Detect whether the TDX module has been loaded
> + *
> + * Detect whether the TDX module has been loaded and ready for
> + * initialization. Only call this function when all cpus are
> + * already in VMX operation.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0: The TDX module has been loaded and ready for
> + * initialization.

"-0", eh?

> + * * -ENODEV: The TDX module is not loaded.
> + * * -EPERM: CPU is not in VMX operation.
> + * * -EFAULT: Other internal fatal errors.
> + */
> +int tdx_detect(void)
> +{
> + int ret;
> +
> + mutex_lock(&tdx_module_lock);
> +
> + switch (tdx_module_status) {
> + case TDX_MODULE_UNKNOWN:
> + ret = __tdx_detect();
> + break;
> + case TDX_MODULE_NONE:
> + ret = -ENODEV;
> + break;
> + case TDX_MODULE_LOADED:
> + case TDX_MODULE_INITIALIZED:
> + ret = 0;
> + break;
> + case TDX_MODULE_SHUTDOWN:
> + ret = -EFAULT;
> + break;
> + default:
> + WARN_ON(1);
> + ret = -EFAULT;
> + }
> +
> + mutex_unlock(&tdx_module_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_detect);
> +
> +/**
> + * tdx_init - Initialize the TDX module
> + *
> + * Initialize the TDX module to make it ready to run TD guests. This
> + * function should be called after tdx_detect() returns successful.
> + * Only call this function when all cpus are online and are in VMX
> + * operation. CPU hotplug is temporarily disabled internally.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0: The TDX module has been successfully initialized.
> + * * -ENODEV: The TDX module is not loaded.
> + * * -EPERM: The CPU which does SEAMCALL is not in VMX operation.
> + * * -EFAULT: Other internal fatal errors.
> + */
> +int tdx_init(void)
> +{
> + int ret;
> +
> + mutex_lock(&tdx_module_lock);
> +
> + switch (tdx_module_status) {
> + case TDX_MODULE_NONE:
> + ret = -ENODEV;
> + break;
> + case TDX_MODULE_LOADED:
> + ret = __tdx_init();
> + break;
> + case TDX_MODULE_INITIALIZED:
> + ret = 0;
> + break;
> + default:
> + ret = -EFAULT;
> + break;
> + }
> + mutex_unlock(&tdx_module_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_init);

Why does this need both a tdx_detect() and a tdx_init()? Shouldn't the
interface from outside just be "get TDX up and running, please?"

2022-04-27 14:46:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

On 4/26/22 16:49, Kai Huang wrote:
> On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
>> What about a dependency? Isn't this dead code without CONFIG_KVM=y/m?
>
> Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM. But so far KVM is the only
> user of TDX, so in practice the code is dead w/o KVM.
>
> What's your opinion?

You're stuck in some really weird fantasy world. Sure, we can dream up
more than one user of the TDX module. But, in the real world, there's
only one. Plus, code can have multiple dependencies!

depends on FOO || BAR

This TDX cruft is dead code in today's real-world kernel without KVM.
You should add a dependency.

>>>>> +static bool __seamrr_enabled(void)
>>>>> +{
>>>>> + return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
>>>>> +}
>>>>
>>>> But there's no case where seamrr_mask is non-zero and where
>>>> _seamrr_enabled(). Why bother checking the SEAMRR_ENABLED_BITS?
>>>
>>> seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
>>> is 0. It will also be cleared when BIOS mis-configuration is detected on any
>>> AP. SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.
>>
>> The point is that this could be:
>>
>> return !!seamrr_mask;
>
> The definition of this SEAMRR_MASK MSR defines "ENABLED" and "LOCKED" bits.
> Explicitly checking the two bits, instead of !!seamrr_mask roles out other
> incorrect configurations. For instance, we should not treat SEAMRR being
> enabled if we only have "ENABLED" bit set or "LOCKED" bit set.

You're confusing two different things:
* The state of the variable
* The actual correct hardware state

The *VARIABLE* can't be non-zero and also denote that SEAMRR is enabled.
Does this *CODE* ever set ENABLED or LOCKED without each other?

>>>>> +static void detect_seam_ap(struct cpuinfo_x86 *c)
>>>>> +{
>>>>> + u64 base, mask;
>>>>> +
>>>>> + /*
>>>>> + * Don't bother to detect this AP if SEAMRR is not
>>>>> + * enabled after earlier detections.
>>>>> + */
>>>>> + if (!__seamrr_enabled())
>>>>> + return;
>>>>> +
>>>>> + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
>>>>> + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
>>>>> +
>>>>
>>>> This is the place for a comment about why the values have to be equal.
>>>
>>> I'll add below:
>>>
>>> /* BIOS must configure SEAMRR consistently across all cores */
>>
>> What happens if the BIOS doesn't do this? What actually breaks? In
>> other words, do we *NEED* error checking here?
>
> AFAICT the spec doesn't explicitly mention what will happen if BIOS doesn't
> configure them consistently among cores. But for safety I think it's better to
> detect.

Safety? Safety of what?

>>>>> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
>>>>> +{
>>>>> + detect_seam(c);
>>>>> +}
>>>>
>>>> The extra function looks a bit silly here now. Maybe this gets filled
>>>> out later, but it's goofy-looking here.
>>>
>>> Thomas suggested to put all TDX detection related in one function call, so I
>>> added tdx_detect_cpu(). I'll move this to the next patch when detecting TDX
>>> KeyIDs.
>>
>> That's fine, or just add a comment or a changelog sentence about this
>> being filled out later.
>
> There's already one sentence in the changelog:
>
> "......Add a function to detect all TDX preliminaries (SEAMRR, TDX private
> KeyIDs) for a given cpu when it is brought up. As the first step, detect the
> validity of SEAMRR."
>
> Does this look good to you?

No, that doesn't provide enough context.

There are two single-line wrapper functions. One calls the other. That
looks entirely silly in this patch. You need to explain the silliness,
explicitly.

2022-04-27 14:48:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module

On 4/26/22 17:01, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:56 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> The P-SEAMLDR (persistent SEAM loader) is the first software module that
>>> runs in SEAM VMX root, responsible for loading and updating the TDX
>>> module. Both the P-SEAMLDR and the TDX module are expected to be loaded
>>> before host kernel boots.
>>
>> Why bother with the P-SEAMLDR here at all? The kernel isn't loading the
>> TDX module in this series. Why not just call into the TDX module directly?
>
> It's not absolutely needed in this series. I choose to detect P-SEAMLDR because
> detecting it can also detect the TDX module, and eventually we will need to
> support P-SEAMLDR because the TDX module runtime update uses P-SEAMLDR's
> SEAMCALL to do that.
>
> Also, even for this series, detecting the P-SEAMLDR allows us to provide the P-
> SEAMLDR information to user at a basic level in dmesg:
>
> [..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209,
> build_num 160, major 1, minor 0
>
> This may be useful to users, but it's not a hard requirement for this series.

We've had a lot of problems in general with this code trying to do too
much at once. I thought we agreed that this was going to only contain
the minimum code to make TDX functional. It seems to be creeping to
grow bigger and bigger.

Am I remembering this wrong?

2022-04-27 15:15:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On 4/26/22 17:43, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:53 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
...
>>> +static bool tdx_keyid_sufficient(void)
>>> +{
>>> + if (!cpumask_equal(&cpus_booted_once_mask,
>>> + cpu_present_mask))
>>> + return false;
>>
>> I'd move this cpumask_equal() to a helper.
>
> Sorry to double confirm, do you want something like:
>
> static bool tdx_detected_on_all_cpus(void)
> {
> /*
> * To detect any BIOS misconfiguration among cores, all logical
> * cpus must have been brought up at least once. This is true
> * unless 'maxcpus' kernel command line is used to limit the
> * number of cpus to be brought up during boot time. However
> * 'maxcpus' is basically an invalid operation mode due to the
> * MCE broadcast problem, and it should not be used on a TDX
> * capable machine. Just do paranoid check here and do not
> * report SEAMRR as enabled in this case.
> */
> return cpumask_equal(&cpus_booted_once_mask, cpu_present_mask);
> }

That's logically the right idea, but I hate the name since the actual
test has nothing to do with TDX being detected. The comment is also
rather verbose and rambling.

It should be named something like:

all_cpus_booted()

and with a comment like this:

/*
* To initialize TDX, the kernel needs to run some code on every
* present CPU. Detect cases where present CPUs have not been
* booted, like when maxcpus=N is used.
*/

> static bool seamrr_enabled(void)
> {
> if (!tdx_detected_on_all_cpus())
> return false;
>
> return __seamrr_enabled();
> }
>
> static bool tdx_keyid_sufficient()
> {
> if (!tdx_detected_on_all_cpus())
> return false;
>
> ...
> }

Although, looking at those, it's *still* unclear why you need this. I
assume it's because some later TDX SEAMCALL will fail if you get this
wrong, and you want to be able to provide a better error message.

*BUT* this code doesn't actually provide halfway reasonable error
messages. If someone uses maxcpus=99, then this code will report:

pr_info("SEAMRR not enabled.\n");

right? That's bonkers.

>>> + /*
>>> + * TDX requires at least two KeyIDs: one global KeyID to
>>> + * protect the metadata of the TDX module and one or more
>>> + * KeyIDs to run TD guests.
>>> + */
>>> + return tdx_keyid_num >= 2;
>>> +}
>>> +
>>> +static int __tdx_detect(void)
>>> +{
>>> + /* The TDX module is not loaded if SEAMRR is disabled */
>>> + if (!seamrr_enabled()) {
>>> + pr_info("SEAMRR not enabled.\n");
>>> + goto no_tdx_module;
>>> + }
>>
>> Why even bother with the SEAMRR stuff? It sounded like you can "ping"
>> the module with SEAMCALL. Why not just use that directly?
>
> SEAMCALL will cause #GP if SEAMRR is not enabled. We should check whether
> SEAMRR is enabled before making SEAMCALL.

So... You could actually get rid of all this code. if SEAMCALL #GP's,
then you say, "Whoops, the firmware didn't load the TDX module
correctly, sorry."

Why is all this code here? What is it for?

>>> + /*
>>> + * Also do not report the TDX module as loaded if there's
>>> + * no enough TDX private KeyIDs to run any TD guests.
>>> + */
>>> + if (!tdx_keyid_sufficient()) {
>>> + pr_info("Number of TDX private KeyIDs too small: %u.\n",
>>> + tdx_keyid_num);
>>> + goto no_tdx_module;
>>> + }
>>> +
>>> + /* Return -ENODEV until the TDX module is detected */
>>> +no_tdx_module:
>>> + tdx_module_status = TDX_MODULE_NONE;
>>> + return -ENODEV;
>>> +}

Again, if someone uses maxcpus=1234 and we get down here, then it
reports to the user:

Number of TDX private KeyIDs too small: ...

???? When the root of the problem has nothing to do with KeyIDs.

>>> +static int init_tdx_module(void)
>>> +{
>>> + /*
>>> + * Return -EFAULT until all steps of TDX module
>>> + * initialization are done.
>>> + */
>>> + return -EFAULT;
>>> +}
>>> +
>>> +static void shutdown_tdx_module(void)
>>> +{
>>> + /* TODO: Shut down the TDX module */
>>> + tdx_module_status = TDX_MODULE_SHUTDOWN;
>>> +}
>>> +
>>> +static int __tdx_init(void)
>>> +{
>>> + int ret;
>>> +
>>> + /*
>>> + * Logical-cpu scope initialization requires calling one SEAMCALL
>>> + * on all logical cpus enabled by BIOS. Shutting down the TDX
>>> + * module also has such requirement. Further more, configuring
>>> + * the key of the global KeyID requires calling one SEAMCALL for
>>> + * each package. For simplicity, disable CPU hotplug in the whole
>>> + * initialization process.
>>> + *
>>> + * It's perhaps better to check whether all BIOS-enabled cpus are
>>> + * online before starting initializing, and return early if not.
>>
>> But you did some of this cpumask checking above. Right?
>
> Above check only guarantees SEAMRR/TDX KeyID has been detected on all presnet
> cpus. the 'present' cpumask doesn't equal to all BIOS-enabled CPUs.

I have no idea what this is saying. In general, I have no idea what the
comment is saying. It makes zero sense. The locking pattern for stuff
like this is:

cpus_read_lock();

for_each_online_cpu()
do_something();

cpus_read_unlock();

because you need to make sure that you don't miss "do_something()" on a
CPU that comes online during the loop.

But, now that I think about it, all of the checks I've seen so far are
for *booted* CPUs. While the lock (I assume) would keep new CPUs from
booting, it doesn't do any good really since the "cpus_booted_once_mask"
bits are only set and not cleared. A CPU doesn't un-become booted once.

Again, we seem to have a long, verbose comment that says very little and
only confuses me.

...
>> Why does this need both a tdx_detect() and a tdx_init()? Shouldn't the
>> interface from outside just be "get TDX up and running, please?"
>
> We can have a single tdx_init(). However tdx_init() can be heavy, and having a
> separate non-heavy tdx_detect() may be useful if caller wants to separate
> "detecting the TDX module" and "initializing the TDX module", i.e. to do
> something in the middle.

<Sigh> So, this "design" went unmentioned, *and* I can't review if the
actual callers of this need the functionality or not because they're not
in this series.

> However tdx_detect() basically only detects P-SEAMLDR. If we move P-SEAMLDR
> detection to tdx_init(), or we git rid of P-SEAMLDR completely, then we don't
> need tdx_detect() anymore. We can expose seamrr_enabled() and TDX KeyID
> variables or functions so caller can use them to see whether it should do TDX
> related staff and then call tdx_init().

I don't think you've made a strong case for why P-SEAMLDR detection is
even necessary in this series.

2022-04-27 22:58:25

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM

On Wed, 2022-04-27 at 07:22 -0700, Dave Hansen wrote:
> On 4/26/22 16:49, Kai Huang wrote:
> > On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> > > What about a dependency? Isn't this dead code without CONFIG_KVM=y/m?
> >
> > Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> > make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM. But so far KVM is the only
> > user of TDX, so in practice the code is dead w/o KVM.
> >
> > What's your opinion?
>
> You're stuck in some really weird fantasy world. Sure, we can dream up
> more than one user of the TDX module. But, in the real world, there's
> only one. Plus, code can have multiple dependencies!
>
> depends on FOO || BAR
>
> This TDX cruft is dead code in today's real-world kernel without KVM.
> You should add a dependency.

Will add a dependency on CONFIG_KVM_INTEL.

>
> > > > > > +static bool __seamrr_enabled(void)
> > > > > > +{
> > > > > > + return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > > > > > +}
> > > > >
> > > > > But there's no case where seamrr_mask is non-zero and where
> > > > > _seamrr_enabled(). Why bother checking the SEAMRR_ENABLED_BITS?
> > > >
> > > > seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
> > > > is 0. It will also be cleared when BIOS mis-configuration is detected on any
> > > > AP. SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.
> > >
> > > The point is that this could be:
> > >
> > > return !!seamrr_mask;
> >
> > The definition of this SEAMRR_MASK MSR defines "ENABLED" and "LOCKED" bits.
> > Explicitly checking the two bits, instead of !!seamrr_mask roles out other
> > incorrect configurations. For instance, we should not treat SEAMRR being
> > enabled if we only have "ENABLED" bit set or "LOCKED" bit set.
>
> You're confusing two different things:
> * The state of the variable
> * The actual correct hardware state
>
> The *VARIABLE* can't be non-zero and also denote that SEAMRR is enabled.
> Does this *CODE* ever set ENABLED or LOCKED without each other?

OK. Will just use !!seamrr_mask. I thought explicitly checking
SEAMRR_ENABLED_BITS would be clearer.

>
> > > > > > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > > > > > +{
> > > > > > + u64 base, mask;
> > > > > > +
> > > > > > + /*
> > > > > > + * Don't bother to detect this AP if SEAMRR is not
> > > > > > + * enabled after earlier detections.
> > > > > > + */
> > > > > > + if (!__seamrr_enabled())
> > > > > > + return;
> > > > > > +
> > > > > > + rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > > > > > + rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > > > > > +
> > > > >
> > > > > This is the place for a comment about why the values have to be equal.
> > > >
> > > > I'll add below:
> > > >
> > > > /* BIOS must configure SEAMRR consistently across all cores */
> > >
> > > What happens if the BIOS doesn't do this? What actually breaks? In
> > > other words, do we *NEED* error checking here?
> >
> > AFAICT the spec doesn't explicitly mention what will happen if BIOS doesn't
> > configure them consistently among cores. But for safety I think it's better to
> > detect.
>
> Safety? Safety of what?

I'll ask TDX architect people and get back to you.

I'll also ask what will happen if TDX KeyID isn't configured consistently across
packages. Currently TDX KeyID is also detected on all cpus (existing
detect_tme() also detect MKTME KeyID bits on all cpus).

2022-04-27 23:02:20

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module

On Wed, 2022-04-27 at 07:24 -0700, Dave Hansen wrote:
> On 4/26/22 17:01, Kai Huang wrote:
> > On Tue, 2022-04-26 at 13:56 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > The P-SEAMLDR (persistent SEAM loader) is the first software module that
> > > > runs in SEAM VMX root, responsible for loading and updating the TDX
> > > > module. Both the P-SEAMLDR and the TDX module are expected to be loaded
> > > > before host kernel boots.
> > >
> > > Why bother with the P-SEAMLDR here at all? The kernel isn't loading the
> > > TDX module in this series. Why not just call into the TDX module directly?
> >
> > It's not absolutely needed in this series. I choose to detect P-SEAMLDR because
> > detecting it can also detect the TDX module, and eventually we will need to
> > support P-SEAMLDR because the TDX module runtime update uses P-SEAMLDR's
> > SEAMCALL to do that.
> >
> > Also, even for this series, detecting the P-SEAMLDR allows us to provide the P-
> > SEAMLDR information to user at a basic level in dmesg:
> >
> > [..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209,
> > build_num 160, major 1, minor 0
> >
> > This may be useful to users, but it's not a hard requirement for this series.
>
> We've had a lot of problems in general with this code trying to do too
> much at once. I thought we agreed that this was going to only contain
> the minimum code to make TDX functional. It seems to be creeping to
> grow bigger and bigger.
>
> Am I remembering this wrong?

OK. I'll remove the P-SEAMLDR related code.

--
Thanks,
-Kai


2022-04-27 23:04:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/26/22 18:15, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> SEAM VMX root operation is designed to host a CPU-attested, software
>>> module called the 'TDX module' which implements functions to manage
>>> crypto protected VMs called Trust Domains (TD). SEAM VMX root is also
>>
>> "crypto protected"? What the heck is that?
>
> How about "crypto-protected"? I googled and it seems it is used by someone
> else.

Cryptography itself doesn't provide (much) protection in the TDX
architecture. TDX guests are isolated from the VMM in ways that
traditional guests are not, but that has almost nothing to do with
cryptography.

Is it cryptography that keeps the host from reading guest private data
in the clear? Is it cryptography that keeps the host from reading guest
ciphertext? Does cryptography enforce the extra rules of Secure-EPT?

>>> 3. Memory hotplug
>>>
>>> The first generation of TDX architecturally doesn't support memory
>>> hotplug. And the first generation of TDX-capable platforms don't support
>>> physical memory hotplug. Since it physically cannot happen, this series
>>> doesn't add any check in ACPI memory hotplug code path to disable it.
>>>
>>> A special case of memory hotplug is adding NVDIMM as system RAM using
>>> kmem driver. However the first generation of TDX-capable platforms
>>> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
>>> happen either.
>>
>> What prevents this code from today's code being run on tomorrow's
>> platforms and breaking these assumptions?
>
> I forgot to add below (which is in the documentation patch):
>
> "This can be enhanced when future generation of TDX starts to support ACPI
> memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
> same platform."
>
> Is this acceptable?

No, Kai.

You're basically saying: *this* code doesn't work with feature A, B and
C. Then, you're pivoting to say that it doesn't matter because one
version of Intel's hardware doesn't support A, B, or C.

I don't care about this *ONE* version of the hardware. I care about
*ALL* the hardware that this code will ever support. *ALL* the hardware
on which this code will run.

In 5 years, if someone takes this code and runs it on Intel hardware
with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?

You can't just ignore the problems because they're not present on one
version of the hardware.

>>> Another case is admin can use 'memmap' kernel command line to create
>>> legacy PMEMs and use them as TD guest memory, or theoretically, can use
>>> kmem driver to add them as system RAM. To avoid having to change memory
>>> hotplug code to prevent this from happening, this series always include
>>> legacy PMEMs when constructing TDMRs so they are also TDX memory.
>>>
>>> 4. CPU hotplug
>>>
>>> The first generation of TDX architecturally doesn't support ACPI CPU
>>> hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
>>> first generation of TDX-capable platforms don't support ACPI CPU hotplug
>>> either. Since this physically cannot happen, this series doesn't add any
>>> check in ACPI CPU hotplug code path to disable it.
>>>
>>> Also, only TDX module initialization requires all BIOS-enabled cpus are
>>> online. After the initialization, any logical cpu can be brought down
>>> and brought up to online again later. Therefore this series doesn't
>>> change logical CPU hotplug either.
>>>
>>> 5. TDX interaction with kexec()
>>>
>>> If TDX is ever enabled and/or used to run any TD guests, the cachelines
>>> of TDX private memory, including PAMTs, used by TDX module need to be
>>> flushed before transiting to the new kernel otherwise they may silently
>>> corrupt the new kernel. Similar to SME, this series flushes cache in
>>> stop_this_cpu().
>>
>> What does this have to do with kexec()? What's a PAMT?
>
> The point is the dirty cachelines of TDX private memory must be flushed
> otherwise they may slightly corrupt the new kexec()-ed kernel.
>
> Will use "TDX metadata" instead of "PAMT". The former has already been
> mentioned above.

Longer description for the patch itself:

TDX memory encryption is built on top of MKTME which uses physical
address aliases to designate encryption keys. This architecture is not
cache coherent. Software is responsible for flushing the CPU caches
when memory changes keys. When kexec()'ing, memory can be repurposed
from TDX use to non-TDX use, changing the effective encryption key.

Cover-letter-level description:

Just like SME, TDX hosts require special cache flushing before kexec().

>>> uninitialized state so it can be initialized again.
>>>
>>> This implies:
>>>
>>> - If the old kernel fails to initialize TDX, the new kernel cannot
>>> use TDX too unless the new kernel fixes the bug which leads to
>>> initialization failure in the old kernel and can resume from where
>>> the old kernel stops. This requires certain coordination between
>>> the two kernels.
>>
>> OK, but what does this *MEAN*?
>
> This means we need to extend the information which the old kernel passes to the
> new kernel. But I don't think it's feasible. I'll refine this kexec() section
> to make it more concise next version.
>
>>
>>> - If the old kernel has initialized TDX successfully, the new kernel
>>> may be able to use TDX if the two kernels have the exactly same
>>> configurations on the TDX module. It further requires the new kernel
>>> to reserve the TDX metadata pages (allocated by the old kernel) in
>>> its page allocator. It also requires coordination between the two
>>> kernels. Furthermore, if kexec() is done when there are active TD
>>> guests running, the new kernel cannot use TDX because it's extremely
>>> hard for the old kernel to pass all TDX private pages to the new
>>> kernel.
>>>
>>> Given that, this series doesn't support TDX after kexec() (except the
>>> old kernel doesn't attempt to initialize TDX at all).
>>>
>>> And this series doesn't shut down TDX module but leaves it open during
>>> kexec(). It is because shutting down TDX module requires CPU being in
>>> VMX operation but there's no guarantee of this during kexec(). Leaving
>>> the TDX module open is not the best case, but it is OK since the new
>>> kernel won't be able to use TDX anyway (therefore TDX module won't run
>>> at all).
>>
>> tl;dr: kexec() doesn't work with this code.
>>
>> Right?
>>
>> That doesn't seem good.
>
> It can work in my understanding. We just need to flush cache before booting to
> the new kernel.

What about all the concerns about TDX module configuration changing?

2022-04-28 01:00:55

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> On 4/26/22 18:15, Kai Huang wrote:
> > On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > SEAM VMX root operation is designed to host a CPU-attested, software
> > > > module called the 'TDX module' which implements functions to manage
> > > > crypto protected VMs called Trust Domains (TD). SEAM VMX root is also
> > >
> > > "crypto protected"? What the heck is that?
> >
> > How about "crypto-protected"? I googled and it seems it is used by someone
> > else.
>
> Cryptography itself doesn't provide (much) protection in the TDX
> architecture. TDX guests are isolated from the VMM in ways that
> traditional guests are not, but that has almost nothing to do with
> cryptography.
>
> Is it cryptography that keeps the host from reading guest private data
> in the clear? Is it cryptography that keeps the host from reading guest
> ciphertext? Does cryptography enforce the extra rules of Secure-EPT?

OK will change to "protected VMs" in this entire series.

>
> > > > 3. Memory hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support memory
> > > > hotplug. And the first generation of TDX-capable platforms don't support
> > > > physical memory hotplug. Since it physically cannot happen, this series
> > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > >
> > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> > > > kmem driver. However the first generation of TDX-capable platforms
> > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > happen either.
> > >
> > > What prevents this code from today's code being run on tomorrow's
> > > platforms and breaking these assumptions?
> >
> > I forgot to add below (which is in the documentation patch):
> >
> > "This can be enhanced when future generation of TDX starts to support ACPI
> > memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
> > same platform."
> >
> > Is this acceptable?
>
> No, Kai.
>
> You're basically saying: *this* code doesn't work with feature A, B and
> C. Then, you're pivoting to say that it doesn't matter because one
> version of Intel's hardware doesn't support A, B, or C.
>
> I don't care about this *ONE* version of the hardware. I care about
> *ALL* the hardware that this code will ever support. *ALL* the hardware
> on which this code will run.
>
> In 5 years, if someone takes this code and runs it on Intel hardware
> with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?

I thought we could document this in the documentation saying that this code can
only work on TDX machines that don't have above capabilities (SPR for now). We
can change the code and the documentation when we add the support of those
features in the future, and update the documentation.

If 5 years later someone takes this code, he/she should take a look at the
documentation and figure out that he/she should choose a newer kernel if the
machine support those features.

I'll think about design solutions if above doesn't look good for you.

>
> You can't just ignore the problems because they're not present on one
> version of the hardware.
>
> > > > Another case is admin can use 'memmap' kernel command line to create
> > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > kmem driver to add them as system RAM. To avoid having to change memory
> > > > hotplug code to prevent this from happening, this series always include
> > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > > >
> > > > 4. CPU hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > either. Since this physically cannot happen, this series doesn't add any
> > > > check in ACPI CPU hotplug code path to disable it.
> > > >
> > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > > > online. After the initialization, any logical cpu can be brought down
> > > > and brought up to online again later. Therefore this series doesn't
> > > > change logical CPU hotplug either.
> > > >
> > > > 5. TDX interaction with kexec()
> > > >
> > > > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > > > of TDX private memory, including PAMTs, used by TDX module need to be
> > > > flushed before transiting to the new kernel otherwise they may silently
> > > > corrupt the new kernel. Similar to SME, this series flushes cache in
> > > > stop_this_cpu().
> > >
> > > What does this have to do with kexec()? What's a PAMT?
> >
> > The point is the dirty cachelines of TDX private memory must be flushed
> > otherwise they may slightly corrupt the new kexec()-ed kernel.
> >
> > Will use "TDX metadata" instead of "PAMT". The former has already been
> > mentioned above.
>
> Longer description for the patch itself:
>
> TDX memory encryption is built on top of MKTME which uses physical
> address aliases to designate encryption keys. This architecture is not
> cache coherent. Software is responsible for flushing the CPU caches
> when memory changes keys. When kexec()'ing, memory can be repurposed
> from TDX use to non-TDX use, changing the effective encryption key.
>
> Cover-letter-level description:
>
> Just like SME, TDX hosts require special cache flushing before kexec().

Thanks.

>
> > > > uninitialized state so it can be initialized again.
> > > >
> > > > This implies:
> > > >
> > > > - If the old kernel fails to initialize TDX, the new kernel cannot
> > > > use TDX too unless the new kernel fixes the bug which leads to
> > > > initialization failure in the old kernel and can resume from where
> > > > the old kernel stops. This requires certain coordination between
> > > > the two kernels.
> > >
> > > OK, but what does this *MEAN*?
> >
> > This means we need to extend the information which the old kernel passes to the
> > new kernel. But I don't think it's feasible. I'll refine this kexec() section
> > to make it more concise next version.
> >
> > >
> > > > - If the old kernel has initialized TDX successfully, the new kernel
> > > > may be able to use TDX if the two kernels have the exactly same
> > > > configurations on the TDX module. It further requires the new kernel
> > > > to reserve the TDX metadata pages (allocated by the old kernel) in
> > > > its page allocator. It also requires coordination between the two
> > > > kernels. Furthermore, if kexec() is done when there are active TD
> > > > guests running, the new kernel cannot use TDX because it's extremely
> > > > hard for the old kernel to pass all TDX private pages to the new
> > > > kernel.
> > > >
> > > > Given that, this series doesn't support TDX after kexec() (except the
> > > > old kernel doesn't attempt to initialize TDX at all).
> > > >
> > > > And this series doesn't shut down TDX module but leaves it open during
> > > > kexec(). It is because shutting down TDX module requires CPU being in
> > > > VMX operation but there's no guarantee of this during kexec(). Leaving
> > > > the TDX module open is not the best case, but it is OK since the new
> > > > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > > > at all).
> > >
> > > tl;dr: kexec() doesn't work with this code.
> > >
> > > Right?
> > >
> > > That doesn't seem good.
> >
> > It can work in my understanding. We just need to flush cache before booting to
> > the new kernel.
>
> What about all the concerns about TDX module configuration changing?
>

Leaving the TDX module in fully initialized state or shutdown state (in case of
error during it's initialization) to the new kernel is fine. If the new kernel
doesn't use TDX at all, then the TDX module won't access memory using it's
global TDX KeyID. If the new kernel wants to use TDX, it will fail on the very
first SEAMCALL when it tries to initialize the TDX module, and won't use
SEAMCALL to call the TDX module again. If the new kernel doesn't follow this,
then it is a bug in the new kernel, or the new kernel is malicious, in which
case it can potentially corrupt the data. But I don't think we need to consider
this as if the new kernel is malicious, then it can corrupt data anyway.

Does this make sense?

Is there any other concerns that I missed?

--
Thanks,
-Kai


2022-04-28 01:28:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/27/22 17:37, Kai Huang wrote:
> On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
>> In 5 years, if someone takes this code and runs it on Intel hardware
>> with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
>
> I thought we could document this in the documentation saying that this code can
> only work on TDX machines that don't have above capabilities (SPR for now). We
> can change the code and the documentation when we add the support of those
> features in the future, and update the documentation.
>
> If 5 years later someone takes this code, he/she should take a look at the
> documentation and figure out that he/she should choose a newer kernel if the
> machine support those features.
>
> I'll think about design solutions if above doesn't look good for you.

No, it doesn't look good to me.

You can't just say:

/*
* This code will eat puppies if used on systems with hotplug.
*/

and merrily await the puppy bloodbath.

If it's not compatible, then you have to *MAKE* it not compatible in a
safe, controlled way.

>> You can't just ignore the problems because they're not present on one
>> version of the hardware.

Please, please read this again ^^

>> What about all the concerns about TDX module configuration changing?
>
> Leaving the TDX module in fully initialized state or shutdown state (in case of
> error during it's initialization) to the new kernel is fine. If the new kernel
> doesn't use TDX at all, then the TDX module won't access memory using it's
> global TDX KeyID. If the new kernel wants to use TDX, it will fail on the very
> first SEAMCALL when it tries to initialize the TDX module, and won't use
> SEAMCALL to call the TDX module again. If the new kernel doesn't follow this,
> then it is a bug in the new kernel, or the new kernel is malicious, in which
> case it can potentially corrupt the data. But I don't think we need to consider
> this as if the new kernel is malicious, then it can corrupt data anyway.
>
> Does this make sense?

No, I'm pretty lost. But, I'll look at the next version of this with
fresh eyes and hopefully you'll have had time to streamline the text by
then.

2022-04-28 04:51:24

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <[email protected]> wrote:
[..]
> > 3. Memory hotplug
> >
> > The first generation of TDX architecturally doesn't support memory
> > hotplug. And the first generation of TDX-capable platforms don't support
> > physical memory hotplug. Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> >
> > A special case of memory hotplug is adding NVDIMM as system RAM using

Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...

> > kmem driver. However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
>
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

The assumption is already broken today with NVDIMM-N. The lack of
DDR-T support on TDX enabled platforms has zero effect on DDR-based
persistent memory solutions. In other words, please describe the
actual software and hardware conflicts at play here, and do not make
the mistake of assuming that "no DDR-T support on TDX platforms" ==
"no NVDIMM support".

> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM. To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.

I am not sure what you are trying to say here?

> > 4. CPU hotplug
> >
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either. Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.

What are the actual challenges posed to TDX with respect to CPU hotplug?

> > Also, only TDX module initialization requires all BIOS-enabled cpus are

Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
/proc/cpuinfo for example.

2022-04-28 06:50:00

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> On 4/27/22 17:37, Kai Huang wrote:
> > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> >
> > I thought we could document this in the documentation saying that this code can
> > only work on TDX machines that don't have above capabilities (SPR for now). We
> > can change the code and the documentation when we add the support of those
> > features in the future, and update the documentation.
> >
> > If 5 years later someone takes this code, he/she should take a look at the
> > documentation and figure out that he/she should choose a newer kernel if the
> > machine support those features.
> >
> > I'll think about design solutions if above doesn't look good for you.
>
> No, it doesn't look good to me.
>
> You can't just say:
>
> /*
> * This code will eat puppies if used on systems with hotplug.
> */
>
> and merrily await the puppy bloodbath.
>
> If it's not compatible, then you have to *MAKE* it not compatible in a
> safe, controlled way.
>
> > > You can't just ignore the problems because they're not present on one
> > > version of the hardware.
>
> Please, please read this again ^^

OK. I'll think about solutions and come back later.

>
> > > What about all the concerns about TDX module configuration changing?
> >
> > Leaving the TDX module in fully initialized state or shutdown state (in case of
> > error during it's initialization) to the new kernel is fine. If the new kernel
> > doesn't use TDX at all, then the TDX module won't access memory using it's
> > global TDX KeyID. If the new kernel wants to use TDX, it will fail on the very
> > first SEAMCALL when it tries to initialize the TDX module, and won't use
> > SEAMCALL to call the TDX module again. If the new kernel doesn't follow this,
> > then it is a bug in the new kernel, or the new kernel is malicious, in which
> > case it can potentially corrupt the data. But I don't think we need to consider
> > this as if the new kernel is malicious, then it can corrupt data anyway.
> >
> > Does this make sense?
>
> No, I'm pretty lost. But, I'll look at the next version of this with
> fresh eyes and hopefully you'll have had time to streamline the text by
> then.

OK thanks.

--
Thanks,
-Kai


2022-04-28 07:51:44

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On Wed, 2022-04-27 at 07:49 -0700, Dave Hansen wrote:
> On 4/26/22 17:43, Kai Huang wrote:
> > On Tue, 2022-04-26 at 13:53 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> ...
> > > > +static bool tdx_keyid_sufficient(void)
> > > > +{
> > > > + if (!cpumask_equal(&cpus_booted_once_mask,
> > > > + cpu_present_mask))
> > > > + return false;
> > >
> > > I'd move this cpumask_equal() to a helper.
> >
> > Sorry to double confirm, do you want something like:
> >
> > static bool tdx_detected_on_all_cpus(void)
> > {
> > /*
> > * To detect any BIOS misconfiguration among cores, all logical
> > * cpus must have been brought up at least once. This is true
> > * unless 'maxcpus' kernel command line is used to limit the
> > * number of cpus to be brought up during boot time. However
> > * 'maxcpus' is basically an invalid operation mode due to the
> > * MCE broadcast problem, and it should not be used on a TDX
> > * capable machine. Just do paranoid check here and do not
> > * report SEAMRR as enabled in this case.
> > */
> > return cpumask_equal(&cpus_booted_once_mask, cpu_present_mask);
> > }
>
> That's logically the right idea, but I hate the name since the actual
> test has nothing to do with TDX being detected. The comment is also
> rather verbose and rambling.
>
> It should be named something like:
>
> all_cpus_booted()
>
> and with a comment like this:
>
> /*
> * To initialize TDX, the kernel needs to run some code on every
> * present CPU. Detect cases where present CPUs have not been
> * booted, like when maxcpus=N is used.
> */

Thank you.

>
> > static bool seamrr_enabled(void)
> > {
> > if (!tdx_detected_on_all_cpus())
> > return false;
> >
> > return __seamrr_enabled();
> > }
> >
> > static bool tdx_keyid_sufficient()
> > {
> > if (!tdx_detected_on_all_cpus())
> > return false;
> >
> > ...
> > }
>
> Although, looking at those, it's *still* unclear why you need this. I
> assume it's because some later TDX SEAMCALL will fail if you get this
> wrong, and you want to be able to provide a better error message.
>
> *BUT* this code doesn't actually provide halfway reasonable error
> messages. If someone uses maxcpus=99, then this code will report:
>
> pr_info("SEAMRR not enabled.\n");
>
> right? That's bonkers.

Right this isn't good.

I think we can use pr_info_once() when all_cpus_booted() returns false, and get
rid of printing "SEAMRR not enabled" in seamrr_enabled(). How about below?

static bool seamrr_enabled(void)
{
if (!all_cpus_booted())
pr_info_once("Not all present CPUs have been booted. Report
SEAMRR as not enabled");

return __seamrr_enabled();
}

And we don't print "SEAMRR not enabled".

>
> > > > + /*
> > > > + * TDX requires at least two KeyIDs: one global KeyID to
> > > > + * protect the metadata of the TDX module and one or more
> > > > + * KeyIDs to run TD guests.
> > > > + */
> > > > + return tdx_keyid_num >= 2;
> > > > +}
> > > > +
> > > > +static int __tdx_detect(void)
> > > > +{
> > > > + /* The TDX module is not loaded if SEAMRR is disabled */
> > > > + if (!seamrr_enabled()) {
> > > > + pr_info("SEAMRR not enabled.\n");
> > > > + goto no_tdx_module;
> > > > + }
> > >
> > > Why even bother with the SEAMRR stuff? It sounded like you can "ping"
> > > the module with SEAMCALL. Why not just use that directly?
> >
> > SEAMCALL will cause #GP if SEAMRR is not enabled. We should check whether
> > SEAMRR is enabled before making SEAMCALL.
>
> So... You could actually get rid of all this code. if SEAMCALL #GP's,
> then you say, "Whoops, the firmware didn't load the TDX module
> correctly, sorry."

Yes we can just use the first SEAMCALL (TDH.SYS.INIT) to detect whether TDX
module is loaded. If SEAMCALL is successful, the module is loaded.

One problem is currently the patch to flush cache for kexec() uses
seamrr_enabled() and tdx_keyid_sufficient() to determine whether we need to
flush the cache. The reason is, similar to SME, the flush is done in
stop_this_cpu(), but the status of TDX module initialization is protected by
mutex, so we cannot use TDX module status in stop_this_cpu() to determine
whether to flush.

If that patch makes sense, I think we still need to detect SEAMRR?

>
> Why is all this code here? What is it for?
>
> > > > + /*
> > > > + * Also do not report the TDX module as loaded if there's
> > > > + * no enough TDX private KeyIDs to run any TD guests.
> > > > + */
> > > > + if (!tdx_keyid_sufficient()) {
> > > > + pr_info("Number of TDX private KeyIDs too small: %u.\n",
> > > > + tdx_keyid_num);
> > > > + goto no_tdx_module;
> > > > + }
> > > > +
> > > > + /* Return -ENODEV until the TDX module is detected */
> > > > +no_tdx_module:
> > > > + tdx_module_status = TDX_MODULE_NONE;
> > > > + return -ENODEV;
> > > > +}
>
> Again, if someone uses maxcpus=1234 and we get down here, then it
> reports to the user:
>
> Number of TDX private KeyIDs too small: ...
>
> ???? When the root of the problem has nothing to do with KeyIDs.

Thanks for catching. Similar to seamrr_enabled() above.

>
> > > > +static int init_tdx_module(void)
> > > > +{
> > > > + /*
> > > > + * Return -EFAULT until all steps of TDX module
> > > > + * initialization are done.
> > > > + */
> > > > + return -EFAULT;
> > > > +}
> > > > +
> > > > +static void shutdown_tdx_module(void)
> > > > +{
> > > > + /* TODO: Shut down the TDX module */
> > > > + tdx_module_status = TDX_MODULE_SHUTDOWN;
> > > > +}
> > > > +
> > > > +static int __tdx_init(void)
> > > > +{
> > > > + int ret;
> > > > +
> > > > + /*
> > > > + * Logical-cpu scope initialization requires calling one SEAMCALL
> > > > + * on all logical cpus enabled by BIOS. Shutting down the TDX
> > > > + * module also has such requirement. Further more, configuring
> > > > + * the key of the global KeyID requires calling one SEAMCALL for
> > > > + * each package. For simplicity, disable CPU hotplug in the whole
> > > > + * initialization process.
> > > > + *
> > > > + * It's perhaps better to check whether all BIOS-enabled cpus are
> > > > + * online before starting initializing, and return early if not.
> > >
> > > But you did some of this cpumask checking above. Right?
> >
> > Above check only guarantees SEAMRR/TDX KeyID has been detected on all presnet
> > cpus. the 'present' cpumask doesn't equal to all BIOS-enabled CPUs.
>
> I have no idea what this is saying. In general, I have no idea what the
> comment is saying. It makes zero sense. The locking pattern for stuff
> like this is:
>
> cpus_read_lock();
>
> for_each_online_cpu()
> do_something();
>
> cpus_read_unlock();
>
> because you need to make sure that you don't miss "do_something()" on a
> CPU that comes online during the loop.

I don't want any CPU going offline so "do_something" will be done on all online
CPUs.

>
> But, now that I think about it, all of the checks I've seen so far are
> for *booted* CPUs. While the lock (I assume) would keep new CPUs from
> booting, it doesn't do any good really since the "cpus_booted_once_mask"
> bits are only set and not cleared. A CPU doesn't un-become booted once.
>
> Again, we seem to have a long, verbose comment that says very little and
> only confuses me.

How about below:

"During initializing the TDX module, one step requires some SEAMCALL must be
done on all logical cpus enabled by BIOS, otherwise a later step will fail.
Disable CPU hotplug during the initialization process to prevent any CPU going
offline during initializing the TDX module. Note it is caller's responsibility
to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
are online."


>
> ...
> > > Why does this need both a tdx_detect() and a tdx_init()? Shouldn't the
> > > interface from outside just be "get TDX up and running, please?"
> >
> > We can have a single tdx_init(). However tdx_init() can be heavy, and having a
> > separate non-heavy tdx_detect() may be useful if caller wants to separate
> > "detecting the TDX module" and "initializing the TDX module", i.e. to do
> > something in the middle.
>
> <Sigh> So, this "design" went unmentioned, *and* I can't review if the
> actual callers of this need the functionality or not because they're not
> in this series.

I'll remove tdx_detect(). Currently KVM doesn't do anything between
tdx_detect() and tdx_init().

https://lore.kernel.org/lkml/[email protected]/T/#mc7d5bb37107131b65ca7142b418b3e17da36a9ca

>
> > However tdx_detect() basically only detects P-SEAMLDR. If we move P-SEAMLDR
> > detection to tdx_init(), or we git rid of P-SEAMLDR completely, then we don't
> > need tdx_detect() anymore. We can expose seamrr_enabled() and TDX KeyID
> > variables or functions so caller can use them to see whether it should do TDX
> > related staff and then call tdx_init().
>
> I don't think you've made a strong case for why P-SEAMLDR detection is
> even necessary in this series.

Will remove P-SEAMLDR code and tdx_detect().

2022-04-28 10:06:19

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <[email protected]> wrote:
> [..]
> > > 3. Memory hotplug
> > >
> > > The first generation of TDX architecturally doesn't support memory
> > > hotplug. And the first generation of TDX-capable platforms don't support
> > > physical memory hotplug. Since it physically cannot happen, this series
> > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > >
> > > A special case of memory hotplug is adding NVDIMM as system RAM using
>
> Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
>
> > > kmem driver. However the first generation of TDX-capable platforms
> > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > happen either.
> >
> > What prevents this code from today's code being run on tomorrow's
> > platforms and breaking these assumptions?
>
> The assumption is already broken today with NVDIMM-N. The lack of
> DDR-T support on TDX enabled platforms has zero effect on DDR-based
> persistent memory solutions. In other words, please describe the
> actual software and hardware conflicts at play here, and do not make
> the mistake of assuming that "no DDR-T support on TDX platforms" ==
> "no NVDIMM support".

Sorry I got this information from planning team or execution team I guess. I was
told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
machine. "co-exist" means they cannot be turned on simultaneously on the same
platform. I am also not aware NVDIMM-N, nor the difference between DDR based
and DDR-T based persistent memory. Could you give some more background here so
I can take a look?

>
> > > Another case is admin can use 'memmap' kernel command line to create
> > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > kmem driver to add them as system RAM. To avoid having to change memory
> > > hotplug code to prevent this from happening, this series always include
> > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
>
> I am not sure what you are trying to say here?

We want to always make sure the memory managed by page allocator is TDX memory.
So if the legacy PMEMs are unconditionally configured as TDX memory, then we
don't need to prevent them from being added as system memory via kmem driver.

>
> > > 4. CPU hotplug
> > >
> > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > either. Since this physically cannot happen, this series doesn't add any
> > > check in ACPI CPU hotplug code path to disable it.
>
> What are the actual challenges posed to TDX with respect to CPU hotplug?

During the TDX module initialization, there is a step to call SEAMCALL on all
logical cpus to initialize per-cpu TDX staff. TDX doesn't support initializing
the new hot-added CPUs after the initialization. There are MCHECK/BIOS changes
to enforce this check too I guess but I don't know details about this.

>
> > > Also, only TDX module initialization requires all BIOS-enabled cpus are
>
> Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> /proc/cpuinfo for example.

It means the CPUs with "enable" bit set in the MADT table.


--
Thanks,
-Kai


2022-04-29 00:00:08

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On Thu, 2022-04-28 at 07:27 -0700, Dave Hansen wrote:
> On 4/27/22 17:00, Kai Huang wrote:
> > On Wed, 2022-04-27 at 07:49 -0700, Dave Hansen wrote:
> > I think we can use pr_info_once() when all_cpus_booted() returns false, and get
> > rid of printing "SEAMRR not enabled" in seamrr_enabled(). How about below?
> >
> > static bool seamrr_enabled(void)
> > {
> > if (!all_cpus_booted())
> > pr_info_once("Not all present CPUs have been booted. Report
> > SEAMRR as not enabled");
> >
> > return __seamrr_enabled();
> > }
> >
> > And we don't print "SEAMRR not enabled".
>
> That's better, but even better than that would be removing all that
> SEAMRR gunk in the first place.

Agreed.

> > > > > > + /*
> > > > > > + * TDX requires at least two KeyIDs: one global KeyID to
> > > > > > + * protect the metadata of the TDX module and one or more
> > > > > > + * KeyIDs to run TD guests.
> > > > > > + */
> > > > > > + return tdx_keyid_num >= 2;
> > > > > > +}
> > > > > > +
> > > > > > +static int __tdx_detect(void)
> > > > > > +{
> > > > > > + /* The TDX module is not loaded if SEAMRR is disabled */
> > > > > > + if (!seamrr_enabled()) {
> > > > > > + pr_info("SEAMRR not enabled.\n");
> > > > > > + goto no_tdx_module;
> > > > > > + }
> > > > >
> > > > > Why even bother with the SEAMRR stuff? It sounded like you can "ping"
> > > > > the module with SEAMCALL. Why not just use that directly?
> > > >
> > > > SEAMCALL will cause #GP if SEAMRR is not enabled. We should check whether
> > > > SEAMRR is enabled before making SEAMCALL.
> > >
> > > So... You could actually get rid of all this code. if SEAMCALL #GP's,
> > > then you say, "Whoops, the firmware didn't load the TDX module
> > > correctly, sorry."
> >
> > Yes we can just use the first SEAMCALL (TDH.SYS.INIT) to detect whether TDX
> > module is loaded. If SEAMCALL is successful, the module is loaded.
> >
> > One problem is currently the patch to flush cache for kexec() uses
> > seamrr_enabled() and tdx_keyid_sufficient() to determine whether we need to
> > flush the cache. The reason is, similar to SME, the flush is done in
> > stop_this_cpu(), but the status of TDX module initialization is protected by
> > mutex, so we cannot use TDX module status in stop_this_cpu() to determine
> > whether to flush.
> >
> > If that patch makes sense, I think we still need to detect SEAMRR?
>
> Please go look at stop_this_cpu() closely. What are the AMD folks doing
> for SME exactly? Do they, for instance, do the WBINVD when the kernel
> used SME? No, they just use a pretty low-level check if the processor
> supports SME.
>
> Doing the same kind of thing for TDX is fine. You could check the MTRR
> MSR bits that tell you if SEAMRR is supported and then read the MSR
> directly. You could check the CPUID enumeration for MKTME or
> CPUID.B.0.EDX (I'm not even sure what this is but the SEAMCALL spec says
> it is part of SEAMCALL operation).

I am not sure about this CPUID either.

>
> Just like the SME test, it doesn't even need to be precise. It just
> needs to be 100% accurate in that it is *ALWAYS* set for any system that
> might have dirtied cache aliases.
>
> I'm not sure why you are so fixated on SEAMRR specifically for this.

I see. I think I can simply use MTRR.SEAMRR bit check. If CPU supports SEAMRR,
then basically it supports MKTME.

Is this look good for you?


>
>
> ...
> > "During initializing the TDX module, one step requires some SEAMCALL must be
> > done on all logical cpus enabled by BIOS, otherwise a later step will fail.
> > Disable CPU hotplug during the initialization process to prevent any CPU going
> > offline during initializing the TDX module. Note it is caller's responsibility
> > to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
> > are online."
>
> But, what if a CPU went offline just before this lock was taken? What
> if the caller make sure all present CPUs are online, makes the call,
> then a CPU is taken offline. The lock wouldn't do any good.
>
> What purpose does the lock serve?

I thought cpus_read_lock() can prevent any CPU from going offline, no?


--
Thanks,
-Kai


2022-04-29 07:33:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM

On 4/5/22 21:49, Kai Huang wrote:
> The kernel configures TDX usable memory regions to the TDX module via
> an array of "TD Memory Region" (TDMR).

One bit of language that's repeated in these changelogs that I don't
like is "configure ... to". I think that's a misuse of the word
configure. I'd say something more like:

The kernel configures TDX-usable memory regions by passing an
array of "TD Memory Regions" (TDMRs) to the TDX module.

Could you please take a look over this series and reword those?

> Each TDMR entry (TDMR_INFO)
> contains the information of the base/size of a memory region, the
> base/size of the associated Physical Address Metadata Table (PAMT) and
> a list of reserved areas in the region.
>
> Create a number of TDMRs according to the verified e820 RAM entries.
> As the first step only set up the base/size information for each TDMR.
>
> TDMR must be 1G aligned and the size must be in 1G granularity. This

^ Each

> implies that one TDMR could cover multiple e820 RAM entries. If a RAM
> entry spans the 1GB boundary and the former part is already covered by
> the previous TDMR, just create a new TDMR for the latter part.
>
> TDX only supports a limited number of TDMRs (currently 64). Abort the
> TDMR construction process when the number of TDMRs exceeds this
> limitation.

... and what does this *MEAN*? Is TDX disabled? Does it throw away the
RAM? Does it eat puppies?

> arch/x86/virt/vmx/tdx/tdx.c | 138 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 138 insertions(+)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 6b0c51aaa7f2..82534e70df96 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -54,6 +54,18 @@
> ((u32)(((_keyid_part) & 0xffffffffull) + 1))
> #define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))
>
> +/* TDMR must be 1gb aligned */
> +#define TDMR_ALIGNMENT BIT_ULL(30)
> +#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT)
> +
> +/* Align up and down the address to TDMR boundary */
> +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT)
> +
> +/* TDMR's start and end address */
> +#define TDMR_START(_tdmr) ((_tdmr)->base)
> +#define TDMR_END(_tdmr) ((_tdmr)->base + (_tdmr)->size)

Make these 'static inline's please. #defines are only for constants or
things that can't use real functions.

> /*
> * TDX module status during initialization
> */
> @@ -813,6 +825,44 @@ static int e820_check_against_cmrs(void)
> return 0;
> }
>
> +/* The starting offset of reserved areas within TDMR_INFO */
> +#define TDMR_RSVD_START 64

^ extra whitespace

> +static struct tdmr_info *__alloc_tdmr(void)
> +{
> + int tdmr_sz;
> +
> + /*
> + * TDMR_INFO's actual size depends on maximum number of reserved
> + * areas that one TDMR supports.
> + */
> + tdmr_sz = TDMR_RSVD_START + tdx_sysinfo.max_reserved_per_tdmr *
> + sizeof(struct tdmr_reserved_area);

You have a structure for this. I know this because it's the return type
of the function. You have TDMR_RSVD_START available via the structure
itself. So, derive that 64 either via:

sizeof(struct tdmr_info)

or,

offsetof(struct tdmr_info, reserved_areas);

Which would make things look like this:

tdmr_base_sz = sizeof(struct tdmr_info);
tdmr_reserved_area_sz = sizeof(struct tdmr_reserved_area) *
tdx_sysinfo.max_reserved_per_tdmr;

tdmr_sz = tdmr_base_sz + tdmr_reserved_area_sz;

Could you explain why on earth you felt the need for the TDMR_RSVD_START
#define?

> + /*
> + * TDX requires TDMR_INFO to be 512 aligned. Always align up

Again, 512 what? 512 pages? 512 hippos?

> + * TDMR_INFO size to 512 so the memory allocated via kzalloc()
> + * can meet the alignment requirement.
> + */
> + tdmr_sz = ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> +
> + return kzalloc(tdmr_sz, GFP_KERNEL);
> +}
> +
> +/* Create a new TDMR at given index in the TDMR array */
> +static struct tdmr_info *alloc_tdmr(struct tdmr_info **tdmr_array, int idx)
> +{
> + struct tdmr_info *tdmr;
> +
> + if (WARN_ON_ONCE(tdmr_array[idx]))
> + return NULL;
> +
> + tdmr = __alloc_tdmr();
> + tdmr_array[idx] = tdmr;
> +
> + return tdmr;
> +}
> +
> static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> {
> int i;
> @@ -826,6 +876,89 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> }
> }
>
> +/*
> + * Create TDMRs to cover all RAM entries in e820_table. The created
> + * TDMRs are saved to @tdmr_array and @tdmr_num is set to the actual
> + * number of TDMRs. All entries in @tdmr_array must be initially NULL.
> + */
> +static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> +{
> + struct tdmr_info *tdmr;
> + u64 start, end;
> + int i, tdmr_idx;
> + int ret = 0;
> +
> + tdmr_idx = 0;
> + tdmr = alloc_tdmr(tdmr_array, 0);
> + if (!tdmr)
> + return -ENOMEM;
> + /*
> + * Loop over all RAM entries in e820 and create TDMRs to cover
> + * them. To keep it simple, always try to use one TDMR to cover
> + * one RAM entry.
> + */
> + e820_for_each_mem(i, start, end) {
> + start = TDMR_ALIGN_DOWN(start);
> + end = TDMR_ALIGN_UP(end);
^ vertically align those ='s, please.


> + /*
> + * If the current TDMR's size hasn't been initialized, it
> + * is a new allocated TDMR to cover the new RAM entry.
> + * Otherwise the current TDMR already covers the previous
> + * RAM entry. In the latter case, check whether the
> + * current RAM entry has been fully or partially covered
> + * by the current TDMR, since TDMR is 1G aligned.
> + */
> + if (tdmr->size) {
> + /*
> + * Loop to next RAM entry if the current entry
> + * is already fully covered by the current TDMR.
> + */
> + if (end <= TDMR_END(tdmr))
> + continue;

This loop is actually pretty well commented and looks OK. The
TDMR_END() construct even adds to readability. *BUT*, the

> + /*
> + * If part of current RAM entry has already been
> + * covered by current TDMR, skip the already
> + * covered part.
> + */
> + if (start < TDMR_END(tdmr))
> + start = TDMR_END(tdmr);
> +
> + /*
> + * Create a new TDMR to cover the current RAM
> + * entry, or the remaining part of it.
> + */
> + tdmr_idx++;
> + if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
> + ret = -E2BIG;
> + goto err;
> + }
> + tdmr = alloc_tdmr(tdmr_array, tdmr_idx);
> + if (!tdmr) {
> + ret = -ENOMEM;
> + goto err;
> + }

This is a bit verbose for this loop. Why not just hide the 'max_tdmrs'
inside the alloc_tdmr() function? That will make this loop smaller and
easier to read.

> + }
> +
> + tdmr->base = start;
> + tdmr->size = end - start;
> + }
> +
> + /* @tdmr_idx is always the index of last valid TDMR. */
> + *tdmr_num = tdmr_idx + 1;
> +
> + return 0;
> +err:
> + /*
> + * Clean up already allocated TDMRs in case of error. @tdmr_idx
> + * indicates the last TDMR that wasn't created successfully,
> + * therefore only needs to free @tdmr_idx TDMRs.
> + */
> + free_tdmrs(tdmr_array, tdmr_idx);
> + return ret;
> +}
> +
> static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> {
> int ret;
> @@ -834,8 +967,13 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> if (ret)
> goto err;
>
> + ret = create_tdmrs(tdmr_array, tdmr_num);
> + if (ret)
> + goto err;
> +
> /* Return -EFAULT until constructing TDMRs is done */
> ret = -EFAULT;
> + free_tdmrs(tdmr_array, *tdmr_num);
> err:
> return ret;
> }

2022-04-29 08:15:14

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM

On Thu, 2022-04-28 at 09:22 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > The kernel configures TDX usable memory regions to the TDX module via
> > an array of "TD Memory Region" (TDMR).
>
> One bit of language that's repeated in these changelogs that I don't
> like is "configure ... to". I think that's a misuse of the word
> configure. I'd say something more like:
>
> The kernel configures TDX-usable memory regions by passing an
> array of "TD Memory Regions" (TDMRs) to the TDX module.
>
> Could you please take a look over this series and reword those?

Thanks will do.

>
> > Each TDMR entry (TDMR_INFO)
> > contains the information of the base/size of a memory region, the
> > base/size of the associated Physical Address Metadata Table (PAMT) and
> > a list of reserved areas in the region.
> >
> > Create a number of TDMRs according to the verified e820 RAM entries.
> > As the first step only set up the base/size information for each TDMR.
> >
> > TDMR must be 1G aligned and the size must be in 1G granularity. This
>
> ^ Each

OK.

>
> > implies that one TDMR could cover multiple e820 RAM entries. If a RAM
> > entry spans the 1GB boundary and the former part is already covered by
> > the previous TDMR, just create a new TDMR for the latter part.
> >
> > TDX only supports a limited number of TDMRs (currently 64). Abort the
> > TDMR construction process when the number of TDMRs exceeds this
> > limitation.
>
> ... and what does this *MEAN*? Is TDX disabled? Does it throw away the
> RAM? Does it eat puppies?

How about:

TDX only supports a limited number of TDMRs. Simply return error when
the number of TDMRs exceeds the limitation. TDX is disabled in this
case.

>
> > arch/x86/virt/vmx/tdx/tdx.c | 138 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 138 insertions(+)
> >
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 6b0c51aaa7f2..82534e70df96 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -54,6 +54,18 @@
> > ((u32)(((_keyid_part) & 0xffffffffull) + 1))
> > #define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32))
> >
> > +/* TDMR must be 1gb aligned */
> > +#define TDMR_ALIGNMENT BIT_ULL(30)
> > +#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT)
> > +
> > +/* Align up and down the address to TDMR boundary */
> > +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> > +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT)
> > +
> > +/* TDMR's start and end address */
> > +#define TDMR_START(_tdmr) ((_tdmr)->base)
> > +#define TDMR_END(_tdmr) ((_tdmr)->base + (_tdmr)->size)
>
> Make these 'static inline's please. #defines are only for constants or
> things that can't use real functions.

OK.

>
> > /*
> > * TDX module status during initialization
> > */
> > @@ -813,6 +825,44 @@ static int e820_check_against_cmrs(void)
> > return 0;
> > }
> >
> > +/* The starting offset of reserved areas within TDMR_INFO */
> > +#define TDMR_RSVD_START 64
>
> ^ extra whitespace

Will remove.

>
> > +static struct tdmr_info *__alloc_tdmr(void)
> > +{
> > + int tdmr_sz;
> > +
> > + /*
> > + * TDMR_INFO's actual size depends on maximum number of reserved
> > + * areas that one TDMR supports.
> > + */
> > + tdmr_sz = TDMR_RSVD_START + tdx_sysinfo.max_reserved_per_tdmr *
> > + sizeof(struct tdmr_reserved_area);
>
> You have a structure for this. I know this because it's the return type
> of the function. You have TDMR_RSVD_START available via the structure
> itself. So, derive that 64 either via:
>
> sizeof(struct tdmr_info)
>
> or,
>
> offsetof(struct tdmr_info, reserved_areas);
>
> Which would make things look like this:
>
> tdmr_base_sz = sizeof(struct tdmr_info);
> tdmr_reserved_area_sz = sizeof(struct tdmr_reserved_area) *
> tdx_sysinfo.max_reserved_per_tdmr;
>
> tdmr_sz = tdmr_base_sz + tdmr_reserved_area_sz;
>
> Could you explain why on earth you felt the need for the TDMR_RSVD_START
> #define?

Will use sizeof (struct tdmr_info). Thanks for the tip.

>
> > + /*
> > + * TDX requires TDMR_INFO to be 512 aligned. Always align up
>
> Again, 512 what? 512 pages? 512 hippos?

Will change to 512-byte aligned.

>
> > + * TDMR_INFO size to 512 so the memory allocated via kzalloc()
> > + * can meet the alignment requirement.
> > + */
> > + tdmr_sz = ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> > +
> > + return kzalloc(tdmr_sz, GFP_KERNEL);
> > +}
> > +
> > +/* Create a new TDMR at given index in the TDMR array */
> > +static struct tdmr_info *alloc_tdmr(struct tdmr_info **tdmr_array, int idx)
> > +{
> > + struct tdmr_info *tdmr;
> > +
> > + if (WARN_ON_ONCE(tdmr_array[idx]))
> > + return NULL;
> > +
> > + tdmr = __alloc_tdmr();
> > + tdmr_array[idx] = tdmr;
> > +
> > + return tdmr;
> > +}
> > +
> > static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> > {
> > int i;
> > @@ -826,6 +876,89 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> > }
> > }
> >
> > +/*
> > + * Create TDMRs to cover all RAM entries in e820_table. The created
> > + * TDMRs are saved to @tdmr_array and @tdmr_num is set to the actual
> > + * number of TDMRs. All entries in @tdmr_array must be initially NULL.
> > + */
> > +static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> > +{
> > + struct tdmr_info *tdmr;
> > + u64 start, end;
> > + int i, tdmr_idx;
> > + int ret = 0;
> > +
> > + tdmr_idx = 0;
> > + tdmr = alloc_tdmr(tdmr_array, 0);
> > + if (!tdmr)
> > + return -ENOMEM;
> > + /*
> > + * Loop over all RAM entries in e820 and create TDMRs to cover
> > + * them. To keep it simple, always try to use one TDMR to cover
> > + * one RAM entry.
> > + */
> > + e820_for_each_mem(i, start, end) {
> > + start = TDMR_ALIGN_DOWN(start);
> > + end = TDMR_ALIGN_UP(end);
> ^ vertically align those ='s, please.

OK.

>
>
> > + /*
> > + * If the current TDMR's size hasn't been initialized, it
> > + * is a new allocated TDMR to cover the new RAM entry.
> > + * Otherwise the current TDMR already covers the previous
> > + * RAM entry. In the latter case, check whether the
> > + * current RAM entry has been fully or partially covered
> > + * by the current TDMR, since TDMR is 1G aligned.
> > + */
> > + if (tdmr->size) {
> > + /*
> > + * Loop to next RAM entry if the current entry
> > + * is already fully covered by the current TDMR.
> > + */
> > + if (end <= TDMR_END(tdmr))
> > + continue;
>
> This loop is actually pretty well commented and looks OK. The
> TDMR_END() construct even adds to readability. *BUT*, the
>
> > + /*
> > + * If part of current RAM entry has already been
> > + * covered by current TDMR, skip the already
> > + * covered part.
> > + */
> > + if (start < TDMR_END(tdmr))
> > + start = TDMR_END(tdmr);
> > +
> > + /*
> > + * Create a new TDMR to cover the current RAM
> > + * entry, or the remaining part of it.
> > + */
> > + tdmr_idx++;
> > + if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
> > + ret = -E2BIG;
> > + goto err;
> > + }
> > + tdmr = alloc_tdmr(tdmr_array, tdmr_idx);
> > + if (!tdmr) {
> > + ret = -ENOMEM;
> > + goto err;
> > + }
>
> This is a bit verbose for this loop. Why not just hide the 'max_tdmrs'
> inside the alloc_tdmr() function? That will make this loop smaller and
> easier to read.

Based on suggestion, I'll change to use alloc_pages_exact() to allocate those
TDMRs at once, so no need to allocate for each TDMR again here. I'll remove the
alloc_tdmr() but keep the max_tdmrs check here.



--
Thanks,
-Kai


2022-04-29 09:46:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On 4/27/22 17:00, Kai Huang wrote:
> On Wed, 2022-04-27 at 07:49 -0700, Dave Hansen wrote:
> I think we can use pr_info_once() when all_cpus_booted() returns false, and get
> rid of printing "SEAMRR not enabled" in seamrr_enabled(). How about below?
>
> static bool seamrr_enabled(void)
> {
> if (!all_cpus_booted())
> pr_info_once("Not all present CPUs have been booted. Report
> SEAMRR as not enabled");
>
> return __seamrr_enabled();
> }
>
> And we don't print "SEAMRR not enabled".

That's better, but even better than that would be removing all that
SEAMRR gunk in the first place.

>>>>> + /*
>>>>> + * TDX requires at least two KeyIDs: one global KeyID to
>>>>> + * protect the metadata of the TDX module and one or more
>>>>> + * KeyIDs to run TD guests.
>>>>> + */
>>>>> + return tdx_keyid_num >= 2;
>>>>> +}
>>>>> +
>>>>> +static int __tdx_detect(void)
>>>>> +{
>>>>> + /* The TDX module is not loaded if SEAMRR is disabled */
>>>>> + if (!seamrr_enabled()) {
>>>>> + pr_info("SEAMRR not enabled.\n");
>>>>> + goto no_tdx_module;
>>>>> + }
>>>>
>>>> Why even bother with the SEAMRR stuff? It sounded like you can "ping"
>>>> the module with SEAMCALL. Why not just use that directly?
>>>
>>> SEAMCALL will cause #GP if SEAMRR is not enabled. We should check whether
>>> SEAMRR is enabled before making SEAMCALL.
>>
>> So... You could actually get rid of all this code. if SEAMCALL #GP's,
>> then you say, "Whoops, the firmware didn't load the TDX module
>> correctly, sorry."
>
> Yes we can just use the first SEAMCALL (TDH.SYS.INIT) to detect whether TDX
> module is loaded. If SEAMCALL is successful, the module is loaded.
>
> One problem is currently the patch to flush cache for kexec() uses
> seamrr_enabled() and tdx_keyid_sufficient() to determine whether we need to
> flush the cache. The reason is, similar to SME, the flush is done in
> stop_this_cpu(), but the status of TDX module initialization is protected by
> mutex, so we cannot use TDX module status in stop_this_cpu() to determine
> whether to flush.
>
> If that patch makes sense, I think we still need to detect SEAMRR?

Please go look at stop_this_cpu() closely. What are the AMD folks doing
for SME exactly? Do they, for instance, do the WBINVD when the kernel
used SME? No, they just use a pretty low-level check if the processor
supports SME.

Doing the same kind of thing for TDX is fine. You could check the MTRR
MSR bits that tell you if SEAMRR is supported and then read the MSR
directly. You could check the CPUID enumeration for MKTME or
CPUID.B.0.EDX (I'm not even sure what this is but the SEAMCALL spec says
it is part of SEAMCALL operation).

Just like the SME test, it doesn't even need to be precise. It just
needs to be 100% accurate in that it is *ALWAYS* set for any system that
might have dirtied cache aliases.

I'm not sure why you are so fixated on SEAMRR specifically for this.


...
> "During initializing the TDX module, one step requires some SEAMCALL must be
> done on all logical cpus enabled by BIOS, otherwise a later step will fail.
> Disable CPU hotplug during the initialization process to prevent any CPU going
> offline during initializing the TDX module. Note it is caller's responsibility
> to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
> are online."

But, what if a CPU went offline just before this lock was taken? What
if the caller make sure all present CPUs are online, makes the call,
then a CPU is taken offline. The lock wouldn't do any good.

What purpose does the lock serve?

2022-04-29 10:20:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On 4/28/22 17:11, Kai Huang wrote:
> This is true. So I think w/o taking the lock is also fine, as the TDX module
> initialization is a state machine. If any cpu goes offline during logical-cpu
> level initialization and TDH.SYS.LP.INIT isn't done on that cpu, then later the
> TDH.SYS.CONFIG will fail. Similarly, if any cpu going offline causes
> TDH.SYS.KEY.CONFIG is not done for any package, then TDH.SYS.TDMR.INIT will
> fail.

Right. The worst-case scenario is someone is mucking around with CPU
hotplug during TDX initialization is that TDX initialization will fail.

We *can* fix some of this at least and provide coherent error messages
with a pattern like this:

cpus_read_lock();
// check that all MADT-enumerated CPUs are online
tdx_init();
cpus_read_unlock();

That, of course, *does* prevent CPUs from going offline during
tdx_init(). It also provides a nice place for an error message:

pr_warn("You offlined a CPU then want to use TDX? Sod off.\n");

> A problem (I realized it exists in current implementation too) is shutting down
> the TDX module, which requires calling TDH.SYS.LP.SHUTDOWN on all BIOS-enabled
> cpus. Kernel can do this SEAMCALL at most for all present cpus. However when
> any cpu is offline, this SEAMCALL won't be called on it, and it seems we need to
> add new CPU hotplug callback to call this SEAMCALL when the cpu is online again.

Hold on a sec. If you call TDH.SYS.LP.SHUTDOWN on any CPU, then TDX
stops working everywhere, right? But, if someone offlines one CPU, we
don't want TDX to stop working everywhere.

2022-04-29 10:43:01

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 4/5/22 21:49, Kai Huang wrote:
> In order to provide crypto protection to guests, the TDX module uses
> additional metadata to record things like which guest "owns" a given
> page of memory. This metadata, referred as Physical Address Metadata
> Table (PAMT), essentially serves as the 'struct page' for the TDX
> module. PAMTs are not reserved by hardware upfront. They must be
> allocated by the kernel and then given to the TDX module.
>
> TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.

s/respectively//

> Each PAMT must be a physically contiguous area from the Convertible

^ s/the/a/

> Memory Regions (CMR). However, the PAMTs which track pages in one TDMR
> do not need to reside within that TDMR but can be anywhere in CMRs.
> If one PAMT overlaps with any TDMR, the overlapping part must be
> reported as a reserved area in that particular TDMR.
>
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).

This is also a good place to note the downsides of using
alloc_contig_pages().

> The current version of TDX supports at most 16 reserved areas per TDMR
> to cover both PAMTs and potential memory holes within the TDMR. If many
> PAMTs are allocated within a single TDMR, 16 reserved areas may not be
> sufficient to cover all of them.
>
> Adopt the following policies when allocating PAMTs for a given TDMR:
>
> - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
> the total number of reserved areas consumed for PAMTs.
> - Try to first allocate PAMT from the local node of the TDMR for better
> NUMA locality.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 166 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7414625b938f..ff68d0829bd7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
> depends on CPU_SUP_INTEL
> depends on X86_64
> select NUMA_KEEP_MEMINFO if NUMA
> + depends on CONTIG_ALLOC
> help
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. This option enables necessary TDX
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 82534e70df96..1b807dcbc101 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -21,6 +21,7 @@
> #include <asm/cpufeatures.h>
> #include <asm/virtext.h>
> #include <asm/e820/api.h>
> +#include <asm/pgtable.h>
> #include <asm/tdx.h>
> #include "tdx.h"
>
> @@ -66,6 +67,16 @@
> #define TDMR_START(_tdmr) ((_tdmr)->base)
> #define TDMR_END(_tdmr) ((_tdmr)->base + (_tdmr)->size)
>
> +/* Page sizes supported by TDX */
> +enum tdx_page_sz {
> + TDX_PG_4K = 0,
> + TDX_PG_2M,
> + TDX_PG_1G,
> + TDX_PG_MAX,
> +};

Is that =0 required? I thought the first enum was defined to be 0.

> +#define TDX_HPAGE_SHIFT 9
> +
> /*
> * TDX module status during initialization
> */
> @@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> return ret;
> }
>
> +/* Calculate PAMT size given a TDMR and a page size */
> +static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> + enum tdx_page_sz pgsz)
> +{
> + unsigned long pamt_sz;
> +
> + pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
> + tdx_sysinfo.pamt_entry_size;

That 'pgsz' thing is just hideous. I'd *much* rather see something like
this:

static int tdx_page_size_shift(enum tdx_page_sz page_sz)
{
switch (page_sz) {
case TDX_PG_4K:
return PAGE_SIZE;
...
}
}

That's easy to figure out what's going on.

> + /* PAMT size must be 4K aligned */
> + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> + return pamt_sz;
> +}
> +
> +/* Calculate the size of all PAMTs for a TDMR */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
> +{
> + enum tdx_page_sz pgsz;
> + unsigned long pamt_sz;
> +
> + pamt_sz = 0;
> + for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
> + pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
> +
> + return pamt_sz;
> +}

But, there are 3 separate pointers pointing to 3 separate PAMTs. Why do
they all have to be contiguously allocated?

> +/*
> + * Locate the NUMA node containing the start of the given TDMR's first
> + * RAM entry. The given TDMR may also cover memory in other NUMA nodes.
> + */

Please add a sentence or two on the implications here of what this means
when it happens. Also, the joining of e820 regions seems like it might
span NUMA nodes. What prevents that code from just creating one large
e820 area that leads to one large TDMR and horrible NUMA affinity for
these structures?

> +static int tdmr_get_nid(struct tdmr_info *tdmr)
> +{
> + u64 start, end;
> + int i;
> +
> + /* Find the first RAM entry covered by the TDMR */
> + e820_for_each_mem(i, start, end)
> + if (end > TDMR_START(tdmr))
> + break;

Brackets around the big loop, please.

> + /*
> + * One TDMR must cover at least one (or partial) RAM entry,
> + * otherwise it is kernel bug. WARN_ON() in this case.
> + */
> + if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> + return 0;
> +
> + /*
> + * The first RAM entry may be partially covered by the previous
> + * TDMR. In this case, use TDMR's start to find the NUMA node.
> + */
> + if (start < TDMR_START(tdmr))
> + start = TDMR_START(tdmr);
> +
> + return phys_to_target_node(start);
> +}
> +
> +static int tdmr_setup_pamt(struct tdmr_info *tdmr)
> +{
> + unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
> + unsigned long pamt_sz[TDX_PG_MAX];
> + unsigned long pamt_npages;
> + struct page *pamt;
> + enum tdx_page_sz pgsz;
> + int nid;

Sooooooooooooooooooo close to reverse Christmas tree, but no cigar.
Please fix it.

> + /*
> + * Allocate one chunk of physically contiguous memory for all
> + * PAMTs. This helps minimize the PAMT's use of reserved areas
> + * in overlapped TDMRs.
> + */

Ahh, this explains it. Considering that tdmr_get_pamt_sz() is really
just two lines of code, I'd probably just the helper and open-code it
here. Then you only have one place to comment on it.

> + nid = tdmr_get_nid(tdmr);
> + pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
> + pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
> + &node_online_map);
> + if (!pamt)
> + return -ENOMEM;
> +
> + /* Calculate PAMT base and size for all supported page sizes. */
> + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> + for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> + unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
> +
> + pamt_base[pgsz] = tdmr_pamt_base;
> + pamt_sz[pgsz] = sz;
> +
> + tdmr_pamt_base += sz;
> + }
> +
> + tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> + tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
> + tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> + tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
> + tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> + tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];

This would all vertically align nicely if you renamed pamt_sz -> pamt_size.

> + return 0;
> +}
> +
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> + unsigned long pamt_pfn, pamt_sz;
> +
> + pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;

Comment, please:

/*
* The PAMT was allocated in one contiguous unit. The 4k PAMT
* should always point to the beginning of that allocation.
*/

> + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> +
> + /* Do nothing if PAMT hasn't been allocated for this TDMR */
> + if (!pamt_sz)
> + return;
> +
> + if (WARN_ON(!pamt_pfn))
> + return;
> +
> + free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
> +}
> +
> +static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
> +{
> + int i;
> +
> + for (i = 0; i < tdmr_num; i++)
> + tdmr_free_pamt(tdmr_array[i]);
> +}
> +
> +/* Allocate and set up PAMTs for all TDMRs */
> +static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)

"set_up", please, not "setup".

> +{
> + int i, ret;
> +
> + for (i = 0; i < tdmr_num; i++) {
> + ret = tdmr_setup_pamt(tdmr_array[i]);
> + if (ret)
> + goto err;
> + }
> +
> + return 0;
> +err:
> + tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> + return -ENOMEM;
> +}
> +
> static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> {
> int ret;
> @@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> if (ret)
> goto err;
>
> + ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
> + if (ret)
> + goto err_free_tdmrs;
> +
> /* Return -EFAULT until constructing TDMRs is done */
> ret = -EFAULT;
> + tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
> +err_free_tdmrs:
> free_tdmrs(tdmr_array, *tdmr_num);
> err:
> return ret;
> @@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
> * initialization are done.
> */
> ret = -EFAULT;
> + /*
> + * Free PAMTs allocated in construct_tdmrs() when TDX module
> + * initialization fails.
> + */
> + if (ret)
> + tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> out_free_tdmrs:
> /*
> * TDMRs are only used during initializing TDX module. Always

In a follow-on patch, I'd like this to dump out (in a pr_debug() or
pr_info()) how much memory is consumed by PAMT allocations.

2022-04-29 14:15:02

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, 2022-04-28 at 19:58 -0700, Dan Williams wrote:
> On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <[email protected]> wrote:
> >
> > On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> > > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <[email protected]> wrote:
> > > [..]
> > > > > 3. Memory hotplug
> > > > >
> > > > > The first generation of TDX architecturally doesn't support memory
> > > > > hotplug. And the first generation of TDX-capable platforms don't support
> > > > > physical memory hotplug. Since it physically cannot happen, this series
> > > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > > >
> > > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> > >
> > > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> > >
> > > > > kmem driver. However the first generation of TDX-capable platforms
> > > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > > happen either.
> > > >
> > > > What prevents this code from today's code being run on tomorrow's
> > > > platforms and breaking these assumptions?
> > >
> > > The assumption is already broken today with NVDIMM-N. The lack of
> > > DDR-T support on TDX enabled platforms has zero effect on DDR-based
> > > persistent memory solutions. In other words, please describe the
> > > actual software and hardware conflicts at play here, and do not make
> > > the mistake of assuming that "no DDR-T support on TDX platforms" ==
> > > "no NVDIMM support".
> >
> > Sorry I got this information from planning team or execution team I guess. I was
> > told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
> > machine. "co-exist" means they cannot be turned on simultaneously on the same
> > platform. I am also not aware NVDIMM-N, nor the difference between DDR based
> > and DDR-T based persistent memory. Could you give some more background here so
> > I can take a look?
>
> My rough understanding is that TDX makes use of metadata communicated
> "on the wire" for DDR, but that infrastructure is not there for DDR-T.
> However, there are plenty of DDR based NVDIMMs that use super-caps /
> batteries and flash to save contents. I believe the concern for TDX is
> that the kernel needs to know not use TDX accepted PMEM as PMEM
> because the contents saved by the DIMM's onboard energy source are
> unreadable outside of a TD.
>
> Here is one of the links that comes up in a search for NVDIMM-N.
>
> https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019

Thanks for the info. I need some more time to digest those different types of
DDRs and NVDIMMs. However I guess they are not quite relevant since TDX has a
concept of "Convertible Memory Region". Please see below.

>
> >
> > >
> > > > > Another case is admin can use 'memmap' kernel command line to create
> > > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > > kmem driver to add them as system RAM. To avoid having to change memory
> > > > > hotplug code to prevent this from happening, this series always include
> > > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > >
> > > I am not sure what you are trying to say here?
> >
> > We want to always make sure the memory managed by page allocator is TDX memory.
>
> That only seems possible if the kernel is given a TDX capable physical
> address map at the beginning of time.

Yes TDX architecture has a concept "Convertible Memory Region" (CMR). The memory
used by TDX must be convertible memory. BIOS generates an array of CMR entries
during boot and they are verified by MCHECK. CMRs are static during machine's
runtime.

>
> > So if the legacy PMEMs are unconditionally configured as TDX memory, then we
> > don't need to prevent them from being added as system memory via kmem driver.
>
> I think that is too narrow of a focus.
>
> Does a memory map exist for the physical address ranges that are TDX
> capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is
> ambiguous beyond the point of utility across the industry's entire
> range of confidential computing memory capabilities.
>
> One strawman would be an ACPI table with contents like:
>
> struct acpi_protected_memory {
> struct range range;
> uuid_t platform_mem_crypto_capability;
> };
>
> With some way to map those uuids to a set of platform vendor specific
> constraints and specifications. Some would be shared across
> confidential computing vendors, some might be unique. Otherwise, I do
> not see how you enforce the expectation of "all memory in the page
> allocator is TDX capable". 
>

Please see above. TDX has CMR.

> The other alternative is that *none* of the
> memory in the page allocator is TDX capable and a special memory
> allocation device is used to map memory for TDs. In either case a map
> of all possible TDX memory is needed and the discussion above seems
> like an incomplete / "hopeful" proposal about the memory dax_kmem, or
> other sources, might online. 

Yes we are also developing a new memfd based approach to support TD guest
memory. Please see my another reply to you.


> See the CXL CEDT CFWMS (CXL Fixed Memory
> Window Structure) as an example of an ACPI table that sets the
> kernel's expectations about how a physical address range might be
> used.
>
> https://www.computeexpresslink.org/spec-landing

Thanks for the info. I'll take a look to get some background.

>
> >
> > >
> > > > > 4. CPU hotplug
> > > > >
> > > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> > > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > > either. Since this physically cannot happen, this series doesn't add any
> > > > > check in ACPI CPU hotplug code path to disable it.
> > >
> > > What are the actual challenges posed to TDX with respect to CPU hotplug?
> >
> > During the TDX module initialization, there is a step to call SEAMCALL on all
> > logical cpus to initialize per-cpu TDX staff. TDX doesn't support initializing
> > the new hot-added CPUs after the initialization. There are MCHECK/BIOS changes
> > to enforce this check too I guess but I don't know details about this.
>
> Is there an ACPI table that indicates CPU-x passed the check? Or since
> the BIOS is invoked in the CPU-online path, is it trusted to suppress
> those events for CPUs outside of the mcheck domain?

No the TDX module (and the P-SEAMLDR) internally maintains some data to record
the total number of LPs and packages, and which logical cpu has been
initialized, etc.

I asked Intel guys whether BIOS would suppress an ACPI CPU hotplug event but I
never got a concrete answer. I'll try again.

>
> > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > >
> > > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> > > /proc/cpuinfo for example.
> >
> > It means the CPUs with "enable" bit set in the MADT table.
>
> That just indicates to the present CPUs and then a hot add event
> changes the state of now present CPUs to enabled. Per above is the
> BIOS responsible for rejecting those new CPUs, or is the kernel?

I'll ask BIOS guys again to see whether BIOS will suppress ACPI CPU hotplug
event. But I think we can have a simple patch to reject ACPI CPU hotplug if
platform is TDX-capable?

Or do you think we don't need to explicitly reject ACPI CPU hotplug if we can
confirm with BIOS guys that it will suppress on TDX capable machine?

--
Thanks,
-Kai


2022-04-29 14:30:35

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, 2022-04-28 at 20:04 -0700, Dan Williams wrote:
> On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <[email protected]> wrote:
> >
> > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > >
> > > > > I thought we could document this in the documentation saying that this code can
> > > > > only work on TDX machines that don't have above capabilities (SPR for now). We
> > > > > can change the code and the documentation when we add the support of those
> > > > > features in the future, and update the documentation.
> > > > >
> > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > machine support those features.
> > > > >
> > > > > I'll think about design solutions if above doesn't look good for you.
> > > >
> > > > No, it doesn't look good to me.
> > > >
> > > > You can't just say:
> > > >
> > > > /*
> > > > * This code will eat puppies if used on systems with hotplug.
> > > > */
> > > >
> > > > and merrily await the puppy bloodbath.
> > > >
> > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > safe, controlled way.
> > > >
> > > > > > You can't just ignore the problems because they're not present on one
> > > > > > version of the hardware.
> > > >
> > > > Please, please read this again ^^
> > >
> > > OK. I'll think about solutions and come back later.
> > > >
> >
> > Hi Dave,
> >
> > I think we have two approaches to handle memory hotplug interaction with the TDX
> > module initialization.
> >
> > The first approach is simple. We just block memory from being added as system
> > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > can add some arch-specific-check to __add_memory_resource() and reject the new
> > memory resource if platform supports TDX. __add_memory_resource() is called by
> > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > NVDIMM as system RAM and normal ACPI memory hotplug [2].
>
> What if the memory being added *is* TDX capable? What if someone
> wanted to manage a memory range as soft-reserved and move it back and
> forth from the core-mm to device access. That should be perfectly
> acceptable as long as the memory is TDX capable.

Please see below.

>
> > The second approach is relatively more complicated. Instead of directly
> > rejecting the new memory resource in __add_memory_resource(), we check whether
> > the memory resource can be added based on CMR and the TDX module initialization
> > status. This is feasible as with the latest public P-SEAMLDR spec, we can get
> > CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info
> > during kernel boots. And in __add_memory_resource() we do below check:
> >
> > tdx_init_disable(); /*similar to cpu_hotplug_disable() */
> > if (tdx_module_initialized())
> > // reject memory hotplug
> > else if (new_memory_resource NOT in CMRs)
> > // reject memory hotplug
> > else
> > allow memory hotplug
> > tdx_init_enable(); /*similar to cpu_hotplug_enable() */
> >
> > tdx_init_disable() temporarily disables TDX module initialization by trying to
> > grab the mutex. If the TDX module initialization is already on going, then it
> > waits until it completes.
> >
> > This should work better for future platforms, but would requires non-trivial
> > more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> > CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can
> > shutdown the TDX module in kexec().
> >
> > But for this series I think the second approach is overkill and we can choose to
> > use the first simple approach?
>
> This still sounds like it is trying to solve symptoms and not the root
> problem. Why must the core-mm never have non-TDX memory when VMs are
> fine to operate with either core-mm pages or memory from other sources
> like hugetlbfs and device-dax?

Basically we don't want to modify page allocator API to distinguish TDX and non-
TDX allocation. For instance, we don't want a new GFP_TDX.

There's another series done by Chao "KVM: mm: fd-based approach for supporting
KVM guest private memory" which essentially allows KVM to ask guest memory
backend to allocate page w/o having to mmap() to userspace.  

https://lore.kernel.org/kvm/[email protected]/

More specifically, memfd will support a new MFD_INACCESSIBLE flag when it is
created so all pages associated with this memfd will be TDX capable memory. The
backend will need to implement a new memfile_notifier_ops to allow KVM to get
and put the memory page.

struct memfile_pfn_ops {
long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
void (*put_unlock_pfn)(unsigned long pfn);
};

With that, it is backend's responsibility to implement get_lock_pfn() callback
in which the backend needs to ensure a TDX private page is allocated.

For TD guest, KVM should enforced to only use those fd-based backend. I am not
sure whether anonymous pages should be supported anymore.

Sean, please correct me if I am wrong?

Currently only shmem is extended to support it. By ensuring pages in page
allocator are all TDX memory, shmem can be extended easily to support TD guests.
 
If device-dax and hugetlbfs wants to support TD guests then they should
implement those callbacks and ensure only TDX memory is allocated. For
instance, when future TDX supports NVDIMM (i.e. NVDIMM is included to CMRs),
then device-dax pages can be included as TDX memory when initializing the TDX
module and device-dax can implement it's own to support allocating page for TD
guests.

But TDX architecture can be changed to support memory hotplug in a more graceful
way in the future. For instance, it can choose to support dynamically adding
any convertible memory as TDX memory *after* TDX module initialization. But
this is just my brainstorming.

Anyway, for now, since only shmem (or + anonymous pages) can be used to create
TD guests, I think we can just reject any memory hot-add when platform supports
TDX as described in the first simple approach. Eventually we may need something
like the second approach but TDX architecture can evolve too.


--
Thanks,
-Kai


2022-04-29 14:39:06

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <[email protected]> wrote:
>
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > >
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now). We
> > > > can change the code and the documentation when we add the support of those
> > > > features in the future, and update the documentation.
> > > >
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > >
> > > > I'll think about design solutions if above doesn't look good for you.
> > >
> > > No, it doesn't look good to me.
> > >
> > > You can't just say:
> > >
> > > /*
> > > * This code will eat puppies if used on systems with hotplug.
> > > */
> > >
> > > and merrily await the puppy bloodbath.
> > >
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > >
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > >
> > > Please, please read this again ^^
> >
> > OK. I'll think about solutions and come back later.
> > >
>
> Hi Dave,
>
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.
>
> The first approach is simple. We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX. __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

What if the memory being added *is* TDX capable? What if someone
wanted to manage a memory range as soft-reserved and move it back and
forth from the core-mm to device access. That should be perfectly
acceptable as long as the memory is TDX capable.

> The second approach is relatively more complicated. Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status. This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info
> during kernel boots. And in __add_memory_resource() we do below check:
>
> tdx_init_disable(); /*similar to cpu_hotplug_disable() */
> if (tdx_module_initialized())
> // reject memory hotplug
> else if (new_memory_resource NOT in CMRs)
> // reject memory hotplug
> else
> allow memory hotplug
> tdx_init_enable(); /*similar to cpu_hotplug_enable() */
>
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex. If the TDX module initialization is already on going, then it
> waits until it completes.
>
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
>
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?

This still sounds like it is trying to solve symptoms and not the root
problem. Why must the core-mm never have non-TDX memory when VMs are
fine to operate with either core-mm pages or memory from other sources
like hugetlbfs and device-dax?

2022-04-29 15:17:33

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On Thu, 2022-04-28 at 16:53 -0700, Dave Hansen wrote:
> On 4/28/22 16:44, Kai Huang wrote:
> > > Just like the SME test, it doesn't even need to be precise. It just
> > > needs to be 100% accurate in that it is *ALWAYS* set for any system that
> > > might have dirtied cache aliases.
> > >
> > > I'm not sure why you are so fixated on SEAMRR specifically for this.
> > I see. I think I can simply use MTRR.SEAMRR bit check. If CPU supports SEAMRR,
> > then basically it supports MKTME.
> >
> > Is this look good for you?
>
> Sure, fine, as long as it comes with a coherent description that
> explains why the check is good enough.
>
> > > > "During initializing the TDX module, one step requires some SEAMCALL must be
> > > > done on all logical cpus enabled by BIOS, otherwise a later step will fail.
> > > > Disable CPU hotplug during the initialization process to prevent any CPU going
> > > > offline during initializing the TDX module. Note it is caller's responsibility
> > > > to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
> > > > are online."
> > > But, what if a CPU went offline just before this lock was taken? What
> > > if the caller make sure all present CPUs are online, makes the call,
> > > then a CPU is taken offline. The lock wouldn't do any good.
> > >
> > > What purpose does the lock serve?
> > I thought cpus_read_lock() can prevent any CPU from going offline, no?
>
> It doesn't prevent squat before the lock is taken, though.

This is true. So I think w/o taking the lock is also fine, as the TDX module
initialization is a state machine. If any cpu goes offline during logical-cpu
level initialization and TDH.SYS.LP.INIT isn't done on that cpu, then later the
TDH.SYS.CONFIG will fail. Similarly, if any cpu going offline causes
TDH.SYS.KEY.CONFIG is not done for any package, then TDH.SYS.TDMR.INIT will
fail.

A problem (I realized it exists in current implementation too) is shutting down
the TDX module, which requires calling TDH.SYS.LP.SHUTDOWN on all BIOS-enabled
cpus. Kernel can do this SEAMCALL at most for all present cpus. However when
any cpu is offline, this SEAMCALL won't be called on it, and it seems we need to
add new CPU hotplug callback to call this SEAMCALL when the cpu is online again.

Any suggestion? Thanks!


--
Thanks,
-Kai


2022-04-29 19:20:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On 4/28/22 16:44, Kai Huang wrote:
>> Just like the SME test, it doesn't even need to be precise. It just
>> needs to be 100% accurate in that it is *ALWAYS* set for any system that
>> might have dirtied cache aliases.
>>
>> I'm not sure why you are so fixated on SEAMRR specifically for this.
> I see. I think I can simply use MTRR.SEAMRR bit check. If CPU supports SEAMRR,
> then basically it supports MKTME.
>
> Is this look good for you?

Sure, fine, as long as it comes with a coherent description that
explains why the check is good enough.

>>> "During initializing the TDX module, one step requires some SEAMCALL must be
>>> done on all logical cpus enabled by BIOS, otherwise a later step will fail.
>>> Disable CPU hotplug during the initialization process to prevent any CPU going
>>> offline during initializing the TDX module. Note it is caller's responsibility
>>> to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
>>> are online."
>> But, what if a CPU went offline just before this lock was taken? What
>> if the caller make sure all present CPUs are online, makes the call,
>> then a CPU is taken offline. The lock wouldn't do any good.
>>
>> What purpose does the lock serve?
> I thought cpus_read_lock() can prevent any CPU from going offline, no?

It doesn't prevent squat before the lock is taken, though.

2022-04-29 21:32:05

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, Apr 29, 2022 at 7:39 AM Dave Hansen <[email protected]> wrote:
>
> On 4/28/22 19:58, Dan Williams wrote:
> > That only seems possible if the kernel is given a TDX capable physical
> > address map at the beginning of time.
>
> TDX actually brings along its own memory map. The "EAS"[1]. has a lot
> of info on it, if you know where to find it. Here's the relevant chunk:
>
> CMR - Convertible Memory Range -
> A range of physical memory configured by BIOS and verified by
> MCHECK. MCHECK verificatio n is intended to help ensure that a
> CMR may be used to hold TDX memory pages encrypted with a
> private HKID.
>
> So, the BIOS has the platform knowledge to enumerate this range. It
> stashes the information off somewhere that the TDX module can find it.
> Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the
> TDX module and gets the list of CMRs.
>
> The OS then has to reconcile this CMR "memory map" against the regular
> old BIOS-provided memory map, tossing out any memory regions which are
> RAM, but not covered by a CMR, or disabling TDX entirely.
>
> Fun, eh?

Yes, I want to challenge the idea that all core-mm memory must be TDX
capable. Instead, this feels more like something that wants a
hugetlbfs / dax-device like capability to ask the kernel to gather /
set-aside the enumerated TDX memory out of all the general purpose
memory it knows about and then VMs use that ABI to get access to
convertible memory. Trying to ensure that all page allocator memory is
TDX capable feels too restrictive with all the different ways pfns can
get into the allocator.

2022-04-29 21:43:30

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Fri, Apr 29, 2022, Dave Hansen wrote:
> On 4/29/22 00:46, Kai Huang wrote:
> > On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> >> This is also a good place to note the downsides of using
> >> alloc_contig_pages().
> >
> > For instance:
> >
> > The allocation may fail when memory usage is under pressure.
>
> It's not really memory pressure, though. The larger the allocation, the
> more likely it is to fail. The more likely it is that the kernel can't
> free the memory or that if you need 1GB of contiguous memory that
> 999.996MB gets freed, but there is one stubborn page left.
>
> alloc_contig_pages() can and will fail. The only mitigation which is
> guaranteed to avoid this is doing the allocation at boot. But, you're
> not doing that to avoid wasting memory on every TDX system that doesn't
> use TDX.
>
> A *good* way (although not foolproof) is to launch a TDX VM early in
> boot before memory gets fragmented or consumed. You might even want to
> recommend this in the documentation.

What about providing a kernel param to tell the kernel to do the allocation during
boot? Or maybe a sysfs knob to reserve/free the memory, a la nr_overcommit_hugepages?

I suspect that most/all deployments that actually want to use TDX would much prefer
to eat the overhead if TDX VMs are never scheduled on the host, as opposed to having
to deal with a host in a TDX pool not actually being able to run TDX VMs.

2022-04-30 14:07:51

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <[email protected]> wrote:
>
> On 4/29/22 10:48, Dan Williams wrote:
> >> But, neither of those really help with, say, a device-DAX mapping of
> >> TDX-*IN*capable memory handed to KVM. The "new syscall" would just
> >> throw up its hands and leave users with the same result: TDX can't be
> >> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> >> device-DAX because they don't respect the NUMA policy ABI.
> > They do have "target_node" attributes to associate node specific
> > metadata, and could certainly express target_node capabilities in its
> > own ABI. Then it's just a matter of making pfn_to_nid() do the right
> > thing so KVM kernel side can validate the capabilities of all inbound
> > pfns.
>
> Let's walk through how this would work with today's kernel on tomorrow's
> hardware, without KVM validating PFNs:
>
> 1. daxaddr mmap("/dev/dax1234")
> 2. kvmfd = open("/dev/kvm")
> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };

At least for a file backed mapping the capability lookup could be done
here, no need to wait for the fault.

> 4. guest starts running
> 5. guest touches 'daxaddr'
> 6. Page fault handler maps 'daxaddr'
> 7. KVM finds new 'daxaddr' PTE
> 8. TDX code tries to add physical address to Secure-EPT
> 9. TDX "SEAMCALL" fails because page is not convertible
> 10. Guest dies
>
> All we can do to improve on that is call something that pledges to only
> map convertible memory at 'daxaddr'. We can't *actually* validate the
> physical addresses at mmap() time or even
> KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
> allocated.
>
> Those pledges are hard for anonymous memory though. To fulfill the
> pledge, we not only have to validate that the NUMA policy is compatible
> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> policy that might undermine the pledge.

I think it's less that the kernel needs to enforce a pledge and more
that an interface is needed to communicate the guest death reason.
I.e. "here is the impossible thing you asked for, next time set this
policy to avoid this problem".

2022-04-30 15:39:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/29/22 08:18, Dan Williams wrote:
> Yes, I want to challenge the idea that all core-mm memory must be TDX
> capable. Instead, this feels more like something that wants a
> hugetlbfs / dax-device like capability to ask the kernel to gather /
> set-aside the enumerated TDX memory out of all the general purpose
> memory it knows about and then VMs use that ABI to get access to
> convertible memory. Trying to ensure that all page allocator memory is
> TDX capable feels too restrictive with all the different ways pfns can
> get into the allocator.

The KVM users are the problem here. They use a variety of ABIs to get
memory and then hand it to KVM. KVM basically just consumes the
physical addresses from the page tables.

Also, there's no _practical_ problem here today. I can't actually think
of a case where any memory that ends up in the allocator on today's TDX
systems is not TDX capable.

Tomorrow's systems are going to be the problem. They'll (presumably)
have a mix of CXL devices that will have varying capabilities. Some
will surely lack the metadata storage for checksums and TD-owner bits.
TDX use will be *safe* on those systems: if you take this code and run
it on one tomorrow's systems, it will notice the TDX-incompatible memory
and will disable TDX.

The only way around this that I can see is to introduce ABI today that
anticipates the needs of the future systems. We could require that all
the KVM memory be "validated" before handing it to TDX. Maybe a new
syscall that says: "make sure this mapping works for TDX". It could be
new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory.

But, neither of those really help with, say, a device-DAX mapping of
TDX-*IN*capable memory handed to KVM. The "new syscall" would just
throw up its hands and leave users with the same result: TDX can't be
used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to
device-DAX because they don't respect the NUMA policy ABI.

I'm open to ideas here. If there's a viable ABI we can introduce to
train TDX users today that will work tomorrow too, I'm all for it.

2022-04-30 16:25:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM

On 4/29/22 00:24, Kai Huang wrote:
> On Thu, 2022-04-28 at 09:22 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> implies that one TDMR could cover multiple e820 RAM entries. If a RAM
>>> entry spans the 1GB boundary and the former part is already covered by
>>> the previous TDMR, just create a new TDMR for the latter part.
>>>
>>> TDX only supports a limited number of TDMRs (currently 64). Abort the
>>> TDMR construction process when the number of TDMRs exceeds this
>>> limitation.
>>
>> ... and what does this *MEAN*? Is TDX disabled? Does it throw away the
>> RAM? Does it eat puppies?
>
> How about:
>
> TDX only supports a limited number of TDMRs. Simply return error when
> the number of TDMRs exceeds the limitation. TDX is disabled in this
> case.

Better, but two things there that need to be improved. This is a cover
letter. Talking at the function level ("return error") is too
low-level. It's also slipping into passive mode "is disabled". Fixing
those, it looks like this:

TDX only supports a limited number of TDMRs. Disable TDX if all
TDMRs are consumed but there is more RAM to cover.

2022-05-01 10:38:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 4/29/22 00:46, Kai Huang wrote:
> On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
>> This is also a good place to note the downsides of using
>> alloc_contig_pages().
>
> For instance:
>
> The allocation may fail when memory usage is under pressure.

It's not really memory pressure, though. The larger the allocation, the
more likely it is to fail. The more likely it is that the kernel can't
free the memory or that if you need 1GB of contiguous memory that
999.996MB gets freed, but there is one stubborn page left.

alloc_contig_pages() can and will fail. The only mitigation which is
guaranteed to avoid this is doing the allocation at boot. But, you're
not doing that to avoid wasting memory on every TDX system that doesn't
use TDX.

A *good* way (although not foolproof) is to launch a TDX VM early in
boot before memory gets fragmented or consumed. You might even want to
recommend this in the documentation.

>>> +/*
>>> + * Locate the NUMA node containing the start of the given TDMR's first
>>> + * RAM entry. The given TDMR may also cover memory in other NUMA nodes.
>>> + */
>>
>> Please add a sentence or two on the implications here of what this means
>> when it happens. Also, the joining of e820 regions seems like it might
>> span NUMA nodes. What prevents that code from just creating one large
>> e820 area that leads to one large TDMR and horrible NUMA affinity for
>> these structures?
>
> How about adding:
>
> When TDMR is created, it stops spanning at NUAM boundary.

I actually don't know what that means at all. I was thinking of
something like this.

/*
* Pick a NUMA node on which to allocate this TDMR's metadata.
*
* This is imprecise since TDMRs are 1GB aligned and NUMA nodes might
* not be. If the TDMR covers more than one node, just use the _first_
* one. This can lead to small areas of off-node metadata for some
* memory.
*/

>>> +static int tdmr_get_nid(struct tdmr_info *tdmr)
>>> +{
>>> + u64 start, end;
>>> + int i;
>>> +
>>> + /* Find the first RAM entry covered by the TDMR */

There's something else missing in here. Why not just do:

return phys_to_target_node(TDMR_START(tdmr));

This would explain it:

/*
* The beginning of the TDMR might not point to RAM.
* Find its first RAM address which which its node can
* be found.
*/

>>> + e820_for_each_mem(i, start, end)
>>> + if (end > TDMR_START(tdmr))
>>> + break;
>>
>> Brackets around the big loop, please.
>
> OK.
>
>>
>>> + /*
>>> + * One TDMR must cover at least one (or partial) RAM entry,
>>> + * otherwise it is kernel bug. WARN_ON() in this case.
>>> + */
>>> + if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
>>> + return 0;

This really means "no RAM found for this TDMR", right? Can we say that,
please.


>>> + /*
>>> + * Allocate one chunk of physically contiguous memory for all
>>> + * PAMTs. This helps minimize the PAMT's use of reserved areas
>>> + * in overlapped TDMRs.
>>> + */
>>
>> Ahh, this explains it. Considering that tdmr_get_pamt_sz() is really
>> just two lines of code, I'd probably just the helper and open-code it
>> here. Then you only have one place to comment on it.
>
> It has a loop and internally calls __tdmr_get_pamt_sz(). It looks doesn't fit
> if we open-code it here.
>
> How about move this comment to tdmr_get_pamt_sz()?

I thought about that. But tdmr_get_pamt_sz() isn't itself doing any
allocation so it doesn't make a whole lot of logical sense. This is a
place where a helper _can_ be removed. Remove it, please.

2022-05-01 22:42:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/29/22 14:20, Dan Williams wrote:
> Is there something already like this today for people that, for
> example, attempt to use PCI BAR mappings as memory? Or does KVM simply
> allow for garbage-in garbage-out?

I'm just guessing, but I _assume_ those garbage PCI BAR mappings are how
KVM does device passthrough.

I know that some KVM users even use mem= to chop down the kernel-owned
'struct page'-backed memory, then have a kind of /dev/mem driver to let
the memory get mapped back into userspace. KVM is happy to pass through
those mappings.

2022-05-02 11:56:32

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, Apr 29, 2022 at 10:18 AM Dave Hansen <[email protected]> wrote:
>
> On 4/29/22 08:18, Dan Williams wrote:
> > Yes, I want to challenge the idea that all core-mm memory must be TDX
> > capable. Instead, this feels more like something that wants a
> > hugetlbfs / dax-device like capability to ask the kernel to gather /
> > set-aside the enumerated TDX memory out of all the general purpose
> > memory it knows about and then VMs use that ABI to get access to
> > convertible memory. Trying to ensure that all page allocator memory is
> > TDX capable feels too restrictive with all the different ways pfns can
> > get into the allocator.
>
> The KVM users are the problem here. They use a variety of ABIs to get
> memory and then hand it to KVM. KVM basically just consumes the
> physical addresses from the page tables.
>
> Also, there's no _practical_ problem here today. I can't actually think
> of a case where any memory that ends up in the allocator on today's TDX
> systems is not TDX capable.
>
> Tomorrow's systems are going to be the problem. They'll (presumably)
> have a mix of CXL devices that will have varying capabilities. Some
> will surely lack the metadata storage for checksums and TD-owner bits.
> TDX use will be *safe* on those systems: if you take this code and run
> it on one tomorrow's systems, it will notice the TDX-incompatible memory
> and will disable TDX.
>
> The only way around this that I can see is to introduce ABI today that
> anticipates the needs of the future systems. We could require that all
> the KVM memory be "validated" before handing it to TDX. Maybe a new
> syscall that says: "make sure this mapping works for TDX". It could be
> new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory.

Yes, node-id seems the only reasonable handle that can be used, and it
does not seem too onerous for a KVM user to have to set a node policy
preferring all the TDX / confidential-computing capable nodes.

> But, neither of those really help with, say, a device-DAX mapping of
> TDX-*IN*capable memory handed to KVM. The "new syscall" would just
> throw up its hands and leave users with the same result: TDX can't be
> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> device-DAX because they don't respect the NUMA policy ABI.

They do have "target_node" attributes to associate node specific
metadata, and could certainly express target_node capabilities in its
own ABI. Then it's just a matter of making pfn_to_nid() do the right
thing so KVM kernel side can validate the capabilities of all inbound
pfns.

> I'm open to ideas here. If there's a viable ABI we can introduce to
> train TDX users today that will work tomorrow too, I'm all for it.

In general, expressing NUMA node perf and node capabilities is
something Linux needs to get better at. HMAT data for example still
exists as sideband information ignored by numactl, but it feels
inevitable that perf and capability details become more of a first
class citizen for applications that have these mem-allocation-policy
constraints in the presence of disparate memory types.

2022-05-02 14:07:13

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <[email protected]> wrote:
>
> On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <[email protected]> wrote:
> > [..]
> > > > 3. Memory hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support memory
> > > > hotplug. And the first generation of TDX-capable platforms don't support
> > > > physical memory hotplug. Since it physically cannot happen, this series
> > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > >
> > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> >
> > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> >
> > > > kmem driver. However the first generation of TDX-capable platforms
> > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > happen either.
> > >
> > > What prevents this code from today's code being run on tomorrow's
> > > platforms and breaking these assumptions?
> >
> > The assumption is already broken today with NVDIMM-N. The lack of
> > DDR-T support on TDX enabled platforms has zero effect on DDR-based
> > persistent memory solutions. In other words, please describe the
> > actual software and hardware conflicts at play here, and do not make
> > the mistake of assuming that "no DDR-T support on TDX platforms" ==
> > "no NVDIMM support".
>
> Sorry I got this information from planning team or execution team I guess. I was
> told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
> machine. "co-exist" means they cannot be turned on simultaneously on the same
> platform. I am also not aware NVDIMM-N, nor the difference between DDR based
> and DDR-T based persistent memory. Could you give some more background here so
> I can take a look?

My rough understanding is that TDX makes use of metadata communicated
"on the wire" for DDR, but that infrastructure is not there for DDR-T.
However, there are plenty of DDR based NVDIMMs that use super-caps /
batteries and flash to save contents. I believe the concern for TDX is
that the kernel needs to know not use TDX accepted PMEM as PMEM
because the contents saved by the DIMM's onboard energy source are
unreadable outside of a TD.

Here is one of the links that comes up in a search for NVDIMM-N.

https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019

>
> >
> > > > Another case is admin can use 'memmap' kernel command line to create
> > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > kmem driver to add them as system RAM. To avoid having to change memory
> > > > hotplug code to prevent this from happening, this series always include
> > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> >
> > I am not sure what you are trying to say here?
>
> We want to always make sure the memory managed by page allocator is TDX memory.

That only seems possible if the kernel is given a TDX capable physical
address map at the beginning of time.

> So if the legacy PMEMs are unconditionally configured as TDX memory, then we
> don't need to prevent them from being added as system memory via kmem driver.

I think that is too narrow of a focus.

Does a memory map exist for the physical address ranges that are TDX
capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is
ambiguous beyond the point of utility across the industry's entire
range of confidential computing memory capabilities.

One strawman would be an ACPI table with contents like:

struct acpi_protected_memory {
struct range range;
uuid_t platform_mem_crypto_capability;
};

With some way to map those uuids to a set of platform vendor specific
constraints and specifications. Some would be shared across
confidential computing vendors, some might be unique. Otherwise, I do
not see how you enforce the expectation of "all memory in the page
allocator is TDX capable". The other alternative is that *none* of the
memory in the page allocator is TDX capable and a special memory
allocation device is used to map memory for TDs. In either case a map
of all possible TDX memory is needed and the discussion above seems
like an incomplete / "hopeful" proposal about the memory dax_kmem, or
other sources, might online. See the CXL CEDT CFWMS (CXL Fixed Memory
Window Structure) as an example of an ACPI table that sets the
kernel's expectations about how a physical address range might be
used.

https://www.computeexpresslink.org/spec-landing

>
> >
> > > > 4. CPU hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > either. Since this physically cannot happen, this series doesn't add any
> > > > check in ACPI CPU hotplug code path to disable it.
> >
> > What are the actual challenges posed to TDX with respect to CPU hotplug?
>
> During the TDX module initialization, there is a step to call SEAMCALL on all
> logical cpus to initialize per-cpu TDX staff. TDX doesn't support initializing
> the new hot-added CPUs after the initialization. There are MCHECK/BIOS changes
> to enforce this check too I guess but I don't know details about this.

Is there an ACPI table that indicates CPU-x passed the check? Or since
the BIOS is invoked in the CPU-online path, is it trusted to suppress
those events for CPUs outside of the mcheck domain?

> > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> >
> > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> > /proc/cpuinfo for example.
>
> It means the CPUs with "enable" bit set in the MADT table.

That just indicates to the present CPUs and then a hot add event
changes the state of now present CPUs to enabled. Per above is the
BIOS responsible for rejecting those new CPUs, or is the kernel?

2022-05-02 22:43:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 4/29/22 11:19, Sean Christopherson wrote:
> On Fri, Apr 29, 2022, Dave Hansen wrote:
>> On 4/29/22 07:30, Sean Christopherson wrote:
>>> On Fri, Apr 29, 2022, Dave Hansen wrote:
>> ...
>>>> A *good* way (although not foolproof) is to launch a TDX VM early
>>>> in boot before memory gets fragmented or consumed. You might even
>>>> want to recommend this in the documentation.
>>>
>>> What about providing a kernel param to tell the kernel to do the
>>> allocation during boot?
>>
>> I think that's where we'll end up eventually. But, I also want to defer
>> that discussion until after we have something merged.
>>
>> Right now, allocating the PAMTs precisely requires running the TDX
>> module. Running the TDX module requires VMXON. VMXON is only done by
>> KVM. KVM isn't necessarily there during boot. So, it's hard to do
>> precisely today without a bunch of mucking with VMX.
>
> Meh, it's hard only if we ignore the fact that the PAMT entry size isn't going
> to change for a given TDX module, and is extremely unlikely to change in general.
>
> Odds are good the kernel can hardcode a sane default and Just Work. Or provide
> the assumed size of a PAMT entry via module param. If the size ends up being
> wrong, log an error, free the reserved memory, and move on with TDX setup with
> the correct size.

Sure. The boot param could be:

tdx_reserve_whatever=auto

and then it can be overridden if necessary. I just don't want to have
kernel binaries that are only good as paperweights if Intel decides it
needs another byte of metadata.

>> You can arm-wrestle the distro folks who hate adding command-line tweaks
>> when the time comes. ;)
>
> Sure, you just find me the person that's going to run TDX guests with an
> off-the-shelf distro kernel :-D

Well, everyone gets their kernel from upstream eventually and everyone
watches upstream.

But, in all seriousness, do you really expect TDX to remain solely in
the non-distro-kernel crowd forever? I expect that the fancy cloud
providers (with custom kernels) who care the most to deploy TDX fist.
But, things will trickle down to the distro crowd over time.

2022-05-02 23:21:55

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, 2022-04-29 at 11:34 -0700, Dave Hansen wrote:
> On 4/29/22 10:48, Dan Williams wrote:
> > > But, neither of those really help with, say, a device-DAX mapping of
> > > TDX-*IN*capable memory handed to KVM. The "new syscall" would just
> > > throw up its hands and leave users with the same result: TDX can't be
> > > used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> > > device-DAX because they don't respect the NUMA policy ABI.
> > They do have "target_node" attributes to associate node specific
> > metadata, and could certainly express target_node capabilities in its
> > own ABI. Then it's just a matter of making pfn_to_nid() do the right
> > thing so KVM kernel side can validate the capabilities of all inbound
> > pfns.
>
> Let's walk through how this would work with today's kernel on tomorrow's
> hardware, without KVM validating PFNs:
>
> 1. daxaddr mmap("/dev/dax1234")
> 2. kvmfd = open("/dev/kvm")
> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> 4. guest starts running
> 5. guest touches 'daxaddr'
> 6. Page fault handler maps 'daxaddr'
> 7. KVM finds new 'daxaddr' PTE
> 8. TDX code tries to add physical address to Secure-EPT
> 9. TDX "SEAMCALL" fails because page is not convertible
> 10. Guest dies
>
> All we can do to improve on that is call something that pledges to only
> map convertible memory at 'daxaddr'. We can't *actually* validate the
> physical addresses at mmap() time or even
> KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
> allocated.
>
> Those pledges are hard for anonymous memory though. To fulfill the
> pledge, we not only have to validate that the NUMA policy is compatible
> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> policy that might undermine the pledge.

Hi Dave,

There's another series done by Chao "KVM: mm: fd-based approach for supporting
KVM guest private memory" which essentially allows KVM to ask guest memory
backend to allocate page w/o having to mmap() to userspace. Please see my reply
below:

https://lore.kernel.org/lkml/[email protected]/T/#mf9bf10a63eaaf0968c46ab33bdaf06bd2cfdd948

My understanding is for TDX guest KVM will be enforced to use the new mechanism.
So when TDX supports NVDIMM in the future, dax can be extended to support the
new mechanism to support using it as TD guest backend.

Sean, Paolo, Isaku, Chao,

Please correct me if I am wrong?

--
Thanks,
-Kai


2022-05-02 23:41:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/28/22 19:58, Dan Williams wrote:
> That only seems possible if the kernel is given a TDX capable physical
> address map at the beginning of time.

TDX actually brings along its own memory map. The "EAS"[1]. has a lot
of info on it, if you know where to find it. Here's the relevant chunk:

CMR - Convertible Memory Range -
A range of physical memory configured by BIOS and verified by
MCHECK. MCHECK verificatio n is intended to help ensure that a
CMR may be used to hold TDX memory pages encrypted with a
private HKID.

So, the BIOS has the platform knowledge to enumerate this range. It
stashes the information off somewhere that the TDX module can find it.
Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the
TDX module and gets the list of CMRs.

The OS then has to reconcile this CMR "memory map" against the regular
old BIOS-provided memory map, tossing out any memory regions which are
RAM, but not covered by a CMR, or disabling TDX entirely.

Fun, eh?

I'm still grappling with how this series handles the policy of what
memory to throw away when.

1.
https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf

2022-05-03 00:08:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 5/1/22 22:59, Kai Huang wrote:
> On Fri, 2022-04-29 at 07:20 -0700, Dave Hansen wrote:
> How about adding below in the changelog:
>
> "
> However using alloc_contig_pages() to allocate large physically contiguous
> memory at runtime may fail. The larger the allocation, the more likely it is to
> fail. Due to the fragmentation, the kernel may need to move pages out of the
> to-be-allocated contiguous memory range but it may fail to move even the last
> stubborn page. A good way (although not foolproof) is to launch a TD VM early
> in boot to get PAMTs allocated before memory gets fragmented or consumed.
> "

Better, although it's getting a bit off topic for this changelog.

Just be short and sweet:

1. the allocation can fail
2. Launch a VM early to (badly) mitigate this
3. the only way to fix it is to add a boot option


>>>>> + /*
>>>>> + * One TDMR must cover at least one (or partial) RAM entry,
>>>>> + * otherwise it is kernel bug. WARN_ON() in this case.
>>>>> + */
>>>>> + if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
>>>>> + return 0;
>>
>> This really means "no RAM found for this TDMR", right? Can we say that,
>> please.
>
> OK will add it. How about:
>
> /*
> * No RAM found for this TDMR. WARN() in this case, as it
> * cannot happen otherwise it is a kernel bug.
> */

The only useful information in that comment is the first sentence. The
jibberish about WARN() is patently obvious from the next two lines of code.

*WHY* can't this happen? How might it have actually happened?

2022-05-03 00:11:41

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > In order to provide crypto protection to guests, the TDX module uses
> > additional metadata to record things like which guest "owns" a given
> > page of memory. This metadata, referred as Physical Address Metadata
> > Table (PAMT), essentially serves as the 'struct page' for the TDX
> > module. PAMTs are not reserved by hardware upfront. They must be
> > allocated by the kernel and then given to the TDX module.
> >
> > TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region"
> > (TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.
>
> s/respectively//
>

Will remove.

> > Each PAMT must be a physically contiguous area from the Convertible
>
> ^ s/the/a/

OK.

>
> > Memory Regions (CMR). However, the PAMTs which track pages in one TDMR
> > do not need to reside within that TDMR but can be anywhere in CMRs.
> > If one PAMT overlaps with any TDMR, the overlapping part must be
> > reported as a reserved area in that particular TDMR.
> >
> > Use alloc_contig_pages() since PAMT must be a physically contiguous area
> > and it may be potentially large (~1/256th of the size of the given TDMR).
>
> This is also a good place to note the downsides of using
> alloc_contig_pages().

For instance:

The allocation may fail when memory usage is under pressure.

?

>
> > The current version of TDX supports at most 16 reserved areas per TDMR
> > to cover both PAMTs and potential memory holes within the TDMR. If many
> > PAMTs are allocated within a single TDMR, 16 reserved areas may not be
> > sufficient to cover all of them.
> >
> > Adopt the following policies when allocating PAMTs for a given TDMR:
> >
> > - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
> > the total number of reserved areas consumed for PAMTs.
> > - Try to first allocate PAMT from the local node of the TDMR for better
> > NUMA locality.
> >
> > Signed-off-by: Kai Huang <[email protected]>
> > ---
> > arch/x86/Kconfig | 1 +
> > arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 166 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 7414625b938f..ff68d0829bd7 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
> > depends on CPU_SUP_INTEL
> > depends on X86_64
> > select NUMA_KEEP_MEMINFO if NUMA
> > + depends on CONTIG_ALLOC
> > help
> > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks. This option enables necessary TDX
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 82534e70df96..1b807dcbc101 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -21,6 +21,7 @@
> > #include <asm/cpufeatures.h>
> > #include <asm/virtext.h>
> > #include <asm/e820/api.h>
> > +#include <asm/pgtable.h>
> > #include <asm/tdx.h>
> > #include "tdx.h"
> >
> > @@ -66,6 +67,16 @@
> > #define TDMR_START(_tdmr) ((_tdmr)->base)
> > #define TDMR_END(_tdmr) ((_tdmr)->base + (_tdmr)->size)
> >
> > +/* Page sizes supported by TDX */
> > +enum tdx_page_sz {
> > + TDX_PG_4K = 0,
> > + TDX_PG_2M,
> > + TDX_PG_1G,
> > + TDX_PG_MAX,
> > +};
>
> Is that =0 required? I thought the first enum was defined to be 0.

No it's not required. Will remove.

>
> > +#define TDX_HPAGE_SHIFT 9
> > +
> > /*
> > * TDX module status during initialization
> > */
> > @@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> > return ret;
> > }
> >
> > +/* Calculate PAMT size given a TDMR and a page size */
> > +static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> > + enum tdx_page_sz pgsz)
> > +{
> > + unsigned long pamt_sz;
> > +
> > + pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
> > + tdx_sysinfo.pamt_entry_size;
>
> That 'pgsz' thing is just hideous. I'd *much* rather see something like
> this:
>
> static int tdx_page_size_shift(enum tdx_page_sz page_sz)
> {
> switch (page_sz) {
> case TDX_PG_4K:
> return PAGE_SIZE;
> ...
> }
> }
>
> That's easy to figure out what's going on.

OK. Will do.

>
> > + /* PAMT size must be 4K aligned */
> > + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> > +
> > + return pamt_sz;
> > +}
> > +
> > +/* Calculate the size of all PAMTs for a TDMR */
> > +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
> > +{
> > + enum tdx_page_sz pgsz;
> > + unsigned long pamt_sz;
> > +
> > + pamt_sz = 0;
> > + for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
> > + pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
> > +
> > + return pamt_sz;
> > +}
>
> But, there are 3 separate pointers pointing to 3 separate PAMTs. Why do
> they all have to be contiguously allocated?

It is also explained in the changelog (the last two paragraphs).

>
> > +/*
> > + * Locate the NUMA node containing the start of the given TDMR's first
> > + * RAM entry. The given TDMR may also cover memory in other NUMA nodes.
> > + */
>
> Please add a sentence or two on the implications here of what this means
> when it happens. Also, the joining of e820 regions seems like it might
> span NUMA nodes. What prevents that code from just creating one large
> e820 area that leads to one large TDMR and horrible NUMA affinity for
> these structures?

How about adding:

When TDMR is created, it stops spanning at NUAM boundary.

>
> > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > +{
> > + u64 start, end;
> > + int i;
> > +
> > + /* Find the first RAM entry covered by the TDMR */
> > + e820_for_each_mem(i, start, end)
> > + if (end > TDMR_START(tdmr))
> > + break;
>
> Brackets around the big loop, please.

OK.

>
> > + /*
> > + * One TDMR must cover at least one (or partial) RAM entry,
> > + * otherwise it is kernel bug. WARN_ON() in this case.
> > + */
> > + if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > + return 0;
> > +
> > + /*
> > + * The first RAM entry may be partially covered by the previous
> > + * TDMR. In this case, use TDMR's start to find the NUMA node.
> > + */
> > + if (start < TDMR_START(tdmr))
> > + start = TDMR_START(tdmr);
> > +
> > + return phys_to_target_node(start);
> > +}
> > +
> > +static int tdmr_setup_pamt(struct tdmr_info *tdmr)
> > +{
> > + unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
> > + unsigned long pamt_sz[TDX_PG_MAX];
> > + unsigned long pamt_npages;
> > + struct page *pamt;
> > + enum tdx_page_sz pgsz;
> > + int nid;
>
> Sooooooooooooooooooo close to reverse Christmas tree, but no cigar.
> Please fix it.

Will fix. Thanks.

>
> > + /*
> > + * Allocate one chunk of physically contiguous memory for all
> > + * PAMTs. This helps minimize the PAMT's use of reserved areas
> > + * in overlapped TDMRs.
> > + */
>
> Ahh, this explains it. Considering that tdmr_get_pamt_sz() is really
> just two lines of code, I'd probably just the helper and open-code it
> here. Then you only have one place to comment on it.

It has a loop and internally calls __tdmr_get_pamt_sz(). It looks doesn't fit
if we open-code it here.

How about move this comment to tdmr_get_pamt_sz()?


>
> > + nid = tdmr_get_nid(tdmr);
> > + pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
> > + pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
> > + &node_online_map);
> > + if (!pamt)
> > + return -ENOMEM;
> > +
> > + /* Calculate PAMT base and size for all supported page sizes. */
> > + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> > + for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> > + unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
> > +
> > + pamt_base[pgsz] = tdmr_pamt_base;
> > + pamt_sz[pgsz] = sz;
> > +
> > + tdmr_pamt_base += sz;
> > + }
> > +
> > + tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> > + tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
> > + tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> > + tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
> > + tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> > + tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];
>
> This would all vertically align nicely if you renamed pamt_sz -> pamt_size.

OK.

>
> > + return 0;
> > +}
> > +
> > +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> > +{
> > + unsigned long pamt_pfn, pamt_sz;
> > +
> > + pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;
>
> Comment, please:
>
> /*
> * The PAMT was allocated in one contiguous unit. The 4k PAMT
> * should always point to the beginning of that allocation.
> */

Thanks will add.

>
> > + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> > +
> > + /* Do nothing if PAMT hasn't been allocated for this TDMR */
> > + if (!pamt_sz)
> > + return;
> > +
> > + if (WARN_ON(!pamt_pfn))
> > + return;
> > +
> > + free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
> > +}
> > +
> > +static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < tdmr_num; i++)
> > + tdmr_free_pamt(tdmr_array[i]);
> > +}
> > +
> > +/* Allocate and set up PAMTs for all TDMRs */
> > +static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
>
> "set_up", please, not "setup".

OK.

>
> > +{
> > + int i, ret;
> > +
> > + for (i = 0; i < tdmr_num; i++) {
> > + ret = tdmr_setup_pamt(tdmr_array[i]);
> > + if (ret)
> > + goto err;
> > + }
> > +
> > + return 0;
> > +err:
> > + tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> > + return -ENOMEM;
> > +}
> > +
> > static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> > {
> > int ret;
> > @@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> > if (ret)
> > goto err;
> >
> > + ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
> > + if (ret)
> > + goto err_free_tdmrs;
> > +
> > /* Return -EFAULT until constructing TDMRs is done */
> > ret = -EFAULT;
> > + tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
> > +err_free_tdmrs:
> > free_tdmrs(tdmr_array, *tdmr_num);
> > err:
> > return ret;
> > @@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
> > * initialization are done.
> > */
> > ret = -EFAULT;
> > + /*
> > + * Free PAMTs allocated in construct_tdmrs() when TDX module
> > + * initialization fails.
> > + */
> > + if (ret)
> > + tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> > out_free_tdmrs:
> > /*
> > * TDMRs are only used during initializing TDX module. Always
>
> In a follow-on patch, I'd like this to dump out (in a pr_debug() or
> pr_info()) how much memory is consumed by PAMT allocations.

OK.


2022-05-03 00:13:50

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Fri, 2022-04-29 at 07:20 -0700, Dave Hansen wrote:
> On 4/29/22 00:46, Kai Huang wrote:
> > On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> > > This is also a good place to note the downsides of using
> > > alloc_contig_pages().
> >
> > For instance:
> >
> > The allocation may fail when memory usage is under pressure.
>
> It's not really memory pressure, though. The larger the allocation, the
> more likely it is to fail. The more likely it is that the kernel can't
> free the memory or that if you need 1GB of contiguous memory that
> 999.996MB gets freed, but there is one stubborn page left.
>
> alloc_contig_pages() can and will fail. The only mitigation which is
> guaranteed to avoid this is doing the allocation at boot. But, you're
> not doing that to avoid wasting memory on every TDX system that doesn't
> use TDX.
>
> A *good* way (although not foolproof) is to launch a TDX VM early in
> boot before memory gets fragmented or consumed. You might even want to
> recommend this in the documentation.

"launch a TDX VM early in boot" I suppose you mean having some boot-time service
which launches a TDX VM before we get the login interface. I'll put this in the
documentation.

How about adding below in the changelog:

"
However using alloc_contig_pages() to allocate large physically contiguous
memory at runtime may fail. The larger the allocation, the more likely it is to
fail. Due to the fragmentation, the kernel may need to move pages out of the
to-be-allocated contiguous memory range but it may fail to move even the last
stubborn page. A good way (although not foolproof) is to launch a TD VM early
in boot to get PAMTs allocated before memory gets fragmented or consumed.
"

>
> > > > +/*
> > > > + * Locate the NUMA node containing the start of the given TDMR's first
> > > > + * RAM entry. The given TDMR may also cover memory in other NUMA nodes.
> > > > + */
> > >
> > > Please add a sentence or two on the implications here of what this means
> > > when it happens. Also, the joining of e820 regions seems like it might
> > > span NUMA nodes. What prevents that code from just creating one large
> > > e820 area that leads to one large TDMR and horrible NUMA affinity for
> > > these structures?
> >
> > How about adding:
> >
> > When TDMR is created, it stops spanning at NUAM boundary.
>
> I actually don't know what that means at all. I was thinking of
> something like this.
>
> /*
> * Pick a NUMA node on which to allocate this TDMR's metadata.
> *
> * This is imprecise since TDMRs are 1GB aligned and NUMA nodes might
> * not be. If the TDMR covers more than one node, just use the _first_
> * one. This can lead to small areas of off-node metadata for some
> * memory.
> */

Thanks.

>
> > > > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > > > +{
> > > > + u64 start, end;
> > > > + int i;
> > > > +
> > > > + /* Find the first RAM entry covered by the TDMR */
>
> There's something else missing in here. Why not just do:
>
> return phys_to_target_node(TDMR_START(tdmr));
>
> This would explain it:
>
> /*
> * The beginning of the TDMR might not point to RAM.
> * Find its first RAM address which which its node can
> * be found.
> */

Will use this. Thanks.

>
> > > > + e820_for_each_mem(i, start, end)
> > > > + if (end > TDMR_START(tdmr))
> > > > + break;
> > >
> > > Brackets around the big loop, please.
> >
> > OK.
> >
> > >
> > > > + /*
> > > > + * One TDMR must cover at least one (or partial) RAM entry,
> > > > + * otherwise it is kernel bug. WARN_ON() in this case.
> > > > + */
> > > > + if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > > > + return 0;
>
> This really means "no RAM found for this TDMR", right? Can we say that,
> please.

OK will add it. How about:

/*
* No RAM found for this TDMR. WARN() in this case, as it
* cannot happen otherwise it is a kernel bug.
*/

>
>
> > > > + /*
> > > > + * Allocate one chunk of physically contiguous memory for all
> > > > + * PAMTs. This helps minimize the PAMT's use of reserved areas
> > > > + * in overlapped TDMRs.
> > > > + */
> > >
> > > Ahh, this explains it. Considering that tdmr_get_pamt_sz() is really
> > > just two lines of code, I'd probably just the helper and open-code it
> > > here. Then you only have one place to comment on it.
> >
> > It has a loop and internally calls __tdmr_get_pamt_sz(). It looks doesn't fit
> > if we open-code it here.
> >
> > How about move this comment to tdmr_get_pamt_sz()?
>
> I thought about that. But tdmr_get_pamt_sz() isn't itself doing any
> allocation so it doesn't make a whole lot of logical sense. This is a
> place where a helper _can_ be removed. Remove it, please.

OK. Will remove the helper. Thanks.

--
Thanks,
-Kai


2022-05-03 00:21:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/29/22 11:47, Dan Williams wrote:
> On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <[email protected]> wrote:
>>
>> On 4/29/22 10:48, Dan Williams wrote:
>>>> But, neither of those really help with, say, a device-DAX mapping of
>>>> TDX-*IN*capable memory handed to KVM. The "new syscall" would just
>>>> throw up its hands and leave users with the same result: TDX can't be
>>>> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to
>>>> device-DAX because they don't respect the NUMA policy ABI.
>>> They do have "target_node" attributes to associate node specific
>>> metadata, and could certainly express target_node capabilities in its
>>> own ABI. Then it's just a matter of making pfn_to_nid() do the right
>>> thing so KVM kernel side can validate the capabilities of all inbound
>>> pfns.
>>
>> Let's walk through how this would work with today's kernel on tomorrow's
>> hardware, without KVM validating PFNs:
>>
>> 1. daxaddr mmap("/dev/dax1234")
>> 2. kvmfd = open("/dev/kvm")
>> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
>
> At least for a file backed mapping the capability lookup could be done
> here, no need to wait for the fault.

For DAX mappings, sure. But, anything that's backed by page cache, you
can't know until the RAM is allocated.

...
>> Those pledges are hard for anonymous memory though. To fulfill the
>> pledge, we not only have to validate that the NUMA policy is compatible
>> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
>> policy that might undermine the pledge.
>
> I think it's less that the kernel needs to enforce a pledge and more
> that an interface is needed to communicate the guest death reason.
> I.e. "here is the impossible thing you asked for, next time set this
> policy to avoid this problem".

IF this code is booted on a system where non-TDX-capable memory is
discovered, do we:
1. Disable TDX, printk() some nasty message, then boot as normal
or,
2a. Boot normally with TDX enabled
2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure
(the "SEAMCALLs" which are the last line of defense and will reject
the request to add non-TDX-capable memory to a guest). Or maybe
an even earlier message.

For #1, if TDX is on, we are quite sure it will work. But, it will
probably throw up its hands on tomorrow's hardware. (This patch set).

For #2, TDX might break (guests get killed) at runtime on tomorrow's
hardware, but it also might be just fine. Users might be able to work
around things by, for instance, figuring out a NUMA policy which
excludes TDX-incapable memory. (I think what Dan is looking for)

Is that a fair summary?

2022-05-03 00:23:33

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > On 4/27/22 17:37, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > >
> > > I thought we could document this in the documentation saying that this code can
> > > only work on TDX machines that don't have above capabilities (SPR for now). We
> > > can change the code and the documentation when we add the support of those
> > > features in the future, and update the documentation.
> > >
> > > If 5 years later someone takes this code, he/she should take a look at the
> > > documentation and figure out that he/she should choose a newer kernel if the
> > > machine support those features.
> > >
> > > I'll think about design solutions if above doesn't look good for you.
> >
> > No, it doesn't look good to me.
> >
> > You can't just say:
> >
> > /*
> > * This code will eat puppies if used on systems with hotplug.
> > */
> >
> > and merrily await the puppy bloodbath.
> >
> > If it's not compatible, then you have to *MAKE* it not compatible in a
> > safe, controlled way.
> >
> > > > You can't just ignore the problems because they're not present on one
> > > > version of the hardware.
> >
> > Please, please read this again ^^
>
> OK. I'll think about solutions and come back later.
> >

Hi Dave,

I think we have two approaches to handle memory hotplug interaction with the TDX
module initialization.

The first approach is simple. We just block memory from being added as system
RAM managed by page allocator when the platform supports TDX [1]. It seems we
can add some arch-specific-check to __add_memory_resource() and reject the new
memory resource if platform supports TDX. __add_memory_resource() is called by
both __add_memory() and add_memory_driver_managed() so it prevents from adding
NVDIMM as system RAM and normal ACPI memory hotplug [2].

The second approach is relatively more complicated. Instead of directly
rejecting the new memory resource in __add_memory_resource(), we check whether
the memory resource can be added based on CMR and the TDX module initialization
status. This is feasible as with the latest public P-SEAMLDR spec, we can get
CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info
during kernel boots. And in __add_memory_resource() we do below check:

tdx_init_disable(); /*similar to cpu_hotplug_disable() */
if (tdx_module_initialized())
// reject memory hotplug
else if (new_memory_resource NOT in CMRs)
// reject memory hotplug
else
allow memory hotplug
tdx_init_enable(); /*similar to cpu_hotplug_enable() */

tdx_init_disable() temporarily disables TDX module initialization by trying to
grab the mutex. If the TDX module initialization is already on going, then it
waits until it completes.

This should work better for future platforms, but would requires non-trivial
more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can
shutdown the TDX module in kexec().

But for this series I think the second approach is overkill and we can choose to
use the first simple approach?

Any suggestions?

[1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX
keyIDs. Or we can just check SEAMRR is enabled, as in practice a SEAMRR is
enabled means the machine is TDX-capable, and for now a TDX-capable machine
doesn't support ACPI memory hotplug.

[2] It prevents adding legacy PMEM as system RAM too but I think it's fine. If
user wants legacy PMEM then it is unlikely user will add it back and use as
system RAM. User is unlikely to use legacy PMEM as TD guest memory directly as
TD guests is likely to use a new memfd backend which allows private page not
accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs.

[3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec:
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > >

2022-05-03 00:28:59

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, Apr 29, 2022 at 12:20 PM Dave Hansen <[email protected]> wrote:
>
> On 4/29/22 11:47, Dan Williams wrote:
> > On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <[email protected]> wrote:
> >>
> >> On 4/29/22 10:48, Dan Williams wrote:
> >>>> But, neither of those really help with, say, a device-DAX mapping of
> >>>> TDX-*IN*capable memory handed to KVM. The "new syscall" would just
> >>>> throw up its hands and leave users with the same result: TDX can't be
> >>>> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> >>>> device-DAX because they don't respect the NUMA policy ABI.
> >>> They do have "target_node" attributes to associate node specific
> >>> metadata, and could certainly express target_node capabilities in its
> >>> own ABI. Then it's just a matter of making pfn_to_nid() do the right
> >>> thing so KVM kernel side can validate the capabilities of all inbound
> >>> pfns.
> >>
> >> Let's walk through how this would work with today's kernel on tomorrow's
> >> hardware, without KVM validating PFNs:
> >>
> >> 1. daxaddr mmap("/dev/dax1234")
> >> 2. kvmfd = open("/dev/kvm")
> >> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> >
> > At least for a file backed mapping the capability lookup could be done
> > here, no need to wait for the fault.
>
> For DAX mappings, sure. But, anything that's backed by page cache, you
> can't know until the RAM is allocated.
>
> ...
> >> Those pledges are hard for anonymous memory though. To fulfill the
> >> pledge, we not only have to validate that the NUMA policy is compatible
> >> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> >> policy that might undermine the pledge.
> >
> > I think it's less that the kernel needs to enforce a pledge and more
> > that an interface is needed to communicate the guest death reason.
> > I.e. "here is the impossible thing you asked for, next time set this
> > policy to avoid this problem".
>
> IF this code is booted on a system where non-TDX-capable memory is
> discovered, do we:
> 1. Disable TDX, printk() some nasty message, then boot as normal
> or,
> 2a. Boot normally with TDX enabled
> 2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure
> (the "SEAMCALLs" which are the last line of defense and will reject
> the request to add non-TDX-capable memory to a guest). Or maybe
> an even earlier message.
>
> For #1, if TDX is on, we are quite sure it will work. But, it will
> probably throw up its hands on tomorrow's hardware. (This patch set).
>
> For #2, TDX might break (guests get killed) at runtime on tomorrow's
> hardware, but it also might be just fine. Users might be able to work
> around things by, for instance, figuring out a NUMA policy which
> excludes TDX-incapable memory. (I think what Dan is looking for)
>
> Is that a fair summary?

Yes, just the option for TDX and non-TDX to live alongside each
other... although in the past I have argued to do option-1 and enforce
it at the lowest level [1]. Like platform BIOS is responsible to
disable CXL if CXL support for a given CPU security feature is
missing. However, I think end users will want to have their
confidential computing and capacity too. As long as that is not
precluded to be added after the fact, option-1 can be a way forward
until a concrete user for mixed mode shows up.

Is there something already like this today for people that, for
example, attempt to use PCI BAR mappings as memory? Or does KVM simply
allow for garbage-in garbage-out?

In the end the patches shouldn't talk about whether or not PMEM is
supported on a platform or not, that's irrelevant. What matters is
that misconfigurations can happen, should be rare to non-existent on
current platforms, and if it becomes a problem the kernel can grow ABI
to let userspace enumerate the conflicts.

[1]: https://lore.kernel.org/linux-cxl/CAPcyv4jMQbHYQssaDDDQFEbOR1v14VUnejcSwOP9VGUnZSsCKw@mail.gmail.com/

2022-05-03 00:35:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Fri, Apr 29, 2022, Dave Hansen wrote:
> On 4/29/22 07:30, Sean Christopherson wrote:
> > On Fri, Apr 29, 2022, Dave Hansen wrote:
> ...
> >> A *good* way (although not foolproof) is to launch a TDX VM early
> >> in boot before memory gets fragmented or consumed. You might even
> >> want to recommend this in the documentation.
> >
> > What about providing a kernel param to tell the kernel to do the
> > allocation during boot?
>
> I think that's where we'll end up eventually. But, I also want to defer
> that discussion until after we have something merged.
>
> Right now, allocating the PAMTs precisely requires running the TDX
> module. Running the TDX module requires VMXON. VMXON is only done by
> KVM. KVM isn't necessarily there during boot. So, it's hard to do
> precisely today without a bunch of mucking with VMX.

Meh, it's hard only if we ignore the fact that the PAMT entry size isn't going
to change for a given TDX module, and is extremely unlikely to change in general.

Odds are good the kernel can hardcode a sane default and Just Work. Or provide
the assumed size of a PAMT entry via module param. If the size ends up being
wrong, log an error, free the reserved memory, and move on with TDX setup with
the correct size.

> You can arm-wrestle the distro folks who hate adding command-line tweaks
> when the time comes. ;)

Sure, you just find me the person that's going to run TDX guests with an
off-the-shelf distro kernel :-D

2022-05-03 00:41:59

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On Mon, 2022-05-02 at 07:17 -0700, Dave Hansen wrote:
> On 5/1/22 22:59, Kai Huang wrote:
> > On Fri, 2022-04-29 at 07:20 -0700, Dave Hansen wrote:
> > How about adding below in the changelog:
> >
> > "
> > However using alloc_contig_pages() to allocate large physically contiguous
> > memory at runtime may fail. The larger the allocation, the more likely it is to
> > fail. Due to the fragmentation, the kernel may need to move pages out of the
> > to-be-allocated contiguous memory range but it may fail to move even the last
> > stubborn page. A good way (although not foolproof) is to launch a TD VM early
> > in boot to get PAMTs allocated before memory gets fragmented or consumed.
> > "
>
> Better, although it's getting a bit off topic for this changelog.
>
> Just be short and sweet:
>
> 1. the allocation can fail
> 2. Launch a VM early to (badly) mitigate this
> 3. the only way to fix it is to add a boot option
>
OK. Thanks.

>
> > > > > > + /*
> > > > > > + * One TDMR must cover at least one (or partial) RAM entry,
> > > > > > + * otherwise it is kernel bug. WARN_ON() in this case.
> > > > > > + */
> > > > > > + if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > > > > > + return 0;
> > >
> > > This really means "no RAM found for this TDMR", right? Can we say that,
> > > please.
> >
> > OK will add it. How about:
> >
> > /*
> > * No RAM found for this TDMR. WARN() in this case, as it
> > * cannot happen otherwise it is a kernel bug.
> > */
>
> The only useful information in that comment is the first sentence. The
> jibberish about WARN() is patently obvious from the next two lines of code.
>
> *WHY* can't this happen? How might it have actually happened?

When TDMRs are created, we already have made sure one TDMR must cover at least
one or partial RAM entry.

--
Thanks,
-Kai


2022-05-03 01:01:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 4/29/22 10:48, Dan Williams wrote:
>> But, neither of those really help with, say, a device-DAX mapping of
>> TDX-*IN*capable memory handed to KVM. The "new syscall" would just
>> throw up its hands and leave users with the same result: TDX can't be
>> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to
>> device-DAX because they don't respect the NUMA policy ABI.
> They do have "target_node" attributes to associate node specific
> metadata, and could certainly express target_node capabilities in its
> own ABI. Then it's just a matter of making pfn_to_nid() do the right
> thing so KVM kernel side can validate the capabilities of all inbound
> pfns.

Let's walk through how this would work with today's kernel on tomorrow's
hardware, without KVM validating PFNs:

1. daxaddr mmap("/dev/dax1234")
2. kvmfd = open("/dev/kvm")
3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
4. guest starts running
5. guest touches 'daxaddr'
6. Page fault handler maps 'daxaddr'
7. KVM finds new 'daxaddr' PTE
8. TDX code tries to add physical address to Secure-EPT
9. TDX "SEAMCALL" fails because page is not convertible
10. Guest dies

All we can do to improve on that is call something that pledges to only
map convertible memory at 'daxaddr'. We can't *actually* validate the
physical addresses at mmap() time or even
KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
allocated.

Those pledges are hard for anonymous memory though. To fulfill the
pledge, we not only have to validate that the NUMA policy is compatible
at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
policy that might undermine the pledge.

2022-05-03 01:02:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

On 4/29/22 07:30, Sean Christopherson wrote:
> On Fri, Apr 29, 2022, Dave Hansen wrote:
...
>> A *good* way (although not foolproof) is to launch a TDX VM early
>> in boot before memory gets fragmented or consumed. You might even
>> want to recommend this in the documentation.
>
> What about providing a kernel param to tell the kernel to do the
> allocation during boot?

I think that's where we'll end up eventually. But, I also want to defer
that discussion until after we have something merged.

Right now, allocating the PAMTs precisely requires running the TDX
module. Running the TDX module requires VMXON. VMXON is only done by
KVM. KVM isn't necessarily there during boot. So, it's hard to do
precisely today without a bunch of mucking with VMX.

But, it would be really easy to do something less precise like:

tdx_reserve_ratio=255

...

u8 *pamt_reserve[MAX_NR_NODES]

for_each_online_node(n) {
pamt_pages = (node_spanned_pages(n)/tdx_reserve_ratio) /
PAGE_SIZE;
pamt_reserve[n] = alloc_bootmem([pamt_pages);
}

Then have the TDX code use pamt_reserve[] instead of allocating more
memory when it is needed later.

That will work just fine as long as you know up front how much metadata
TDX needs. If the metadata requirements change in an updated TDX
module, the command-line will need to be updated to regain the
guarantee. But, it can still fall back to the best-effort code that is
in the series today.

In other words, I think we want what is in the series today no matter
what, and we'll want it forever. That's why it's the *one* way of doing
things now. I entirely agree that there will be TDX users that want a
stronger guarantee.

You can arm-wrestle the distro folks who hate adding command-line tweaks
when the time comes. ;)

2022-05-03 01:10:51

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand

On Thu, 2022-04-28 at 17:26 -0700, Dave Hansen wrote:
> On 4/28/22 17:11, Kai Huang wrote:
> > This is true. So I think w/o taking the lock is also fine, as the TDX module
> > initialization is a state machine. If any cpu goes offline during logical-cpu
> > level initialization and TDH.SYS.LP.INIT isn't done on that cpu, then later the
> > TDH.SYS.CONFIG will fail. Similarly, if any cpu going offline causes
> > TDH.SYS.KEY.CONFIG is not done for any package, then TDH.SYS.TDMR.INIT will
> > fail.
>
> Right. The worst-case scenario is someone is mucking around with CPU
> hotplug during TDX initialization is that TDX initialization will fail.
>
> We *can* fix some of this at least and provide coherent error messages
> with a pattern like this:
>
> cpus_read_lock();
> // check that all MADT-enumerated CPUs are online
> tdx_init();
> cpus_read_unlock();
>
> That, of course, *does* prevent CPUs from going offline during
> tdx_init(). It also provides a nice place for an error message:
>
> pr_warn("You offlined a CPU then want to use TDX? Sod off.\n");

Yes this is better.

The problem is how to check MADT-enumerated CPUs are online?

I checked the code, and it seems we can use 'num_processors + disabled_cpus' as
MADT-enumerated CPUs? In fact, there should be no 'disabled_cpus' for TDX, so I
think:

if (disabled_cpus || num_processors != num_online_cpus()) {
pr_err("Initializing the TDX module requires all MADT-
enumerated CPUs being onine.");
return -EINVAL;
}

But I may have misunderstanding.

>
> > A problem (I realized it exists in current implementation too) is shutting down
> > the TDX module, which requires calling TDH.SYS.LP.SHUTDOWN on all BIOS-enabled
> > cpus. Kernel can do this SEAMCALL at most for all present cpus. However when
> > any cpu is offline, this SEAMCALL won't be called on it, and it seems we need to
> > add new CPU hotplug callback to call this SEAMCALL when the cpu is online again.
>
> Hold on a sec. If you call TDH.SYS.LP.SHUTDOWN on any CPU, then TDX
> stops working everywhere, right?  
>

Yes.

But tot shut down the TDX module, it's better to call LP.SHUTDOWN on all
logical cpus as suggested by spec.

> But, if someone offlines one CPU, we
> don't want TDX to stop working everywhere.

Right. I am talking about when initializing fails due to any reason (i.e. -
ENOMEM), currently we shutdown the TDX module. When shutting down the TDX
module, we want to call LP.SHUTDOWN on all logical cpus. If there's any CPU
being offline when we do the shutdown, then LP.SHUTDOWN won't be called for that
cpu.

But as you suggested above, if we have an early check whether all MADT-
enumerated CPUs are online and if not we return w/o shutting down the TDX
module, then if we shutdown the module the LP.SHUTDOWN will be called on all
cpus.

2022-05-04 01:08:16

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > >
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now). We
> > > > can change the code and the documentation when we add the support of those
> > > > features in the future, and update the documentation.
> > > >
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > >
> > > > I'll think about design solutions if above doesn't look good for you.
> > >
> > > No, it doesn't look good to me.
> > >
> > > You can't just say:
> > >
> > > /*
> > > * This code will eat puppies if used on systems with hotplug.
> > > */
> > >
> > > and merrily await the puppy bloodbath.
> > >
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > >
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > >
> > > Please, please read this again ^^
> >
> > OK. I'll think about solutions and come back later.
> > >
>
> Hi Dave,
>
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.
>
> The first approach is simple. We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX. __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

Hi Dave,

Try to close how to handle memory hotplug. Any comments to below?

For the first approach, I forgot to think about memory hot-remove case. If we
just reject adding new memory resource when TDX is capable on the platform, then
if the memory is hot-removed, we won't be able to add it back. My thinking is
we still want to support memory online/offline because it is purely in software
but has nothing to do with TDX. But if one memory resource can be put to
offline, it seems we don't have any way to prevent it from being removed.

So if we do above, on the future platforms when memory hotplug can co-exist with
TDX, ACPI hot-add and kmem-hot-add memory will be prevented. However if some
memory is hot-removed, it won't be able to be added back (even it is included in
CMR, or TDMRs after TDX module is initialized).

Is this behavior acceptable? Or perhaps I have misunderstanding?

The second approach will behave more nicely, but I don't know whether it is
worth to do it now.

Btw, below logic when adding a new memory resource has a minor problem, please
see below...

>
> The second approach is relatively more complicated. Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status. This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info
> during kernel boots. And in __add_memory_resource() we do below check:
>
> tdx_init_disable(); /*similar to cpu_hotplug_disable() */
> if (tdx_module_initialized())
> // reject memory hotplug
> else if (new_memory_resource NOT in CMRs)
> // reject memory hotplug
> else
> allow memory hotplug
> tdx_init_enable(); /*similar to cpu_hotplug_enable() */

...

Should be:

// prevent racing with TDX module initialization */
tdx_init_disable();

if (tdx_module_initialized()) {
if (new_memory_resource in TDMRs)
// allow memory hot-add
else
// reject memory hot-add
} else if (new_memory_resource in CMR) {
// add new memory to TDX memory so it can be
// included into TDMRs

// allow memory hot-add
}
else
// reject memory hot-add

tdx_module_enable();

And when platform doesn't TDX, always allow memory hot-add.


>
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex. If the TDX module initialization is already on going, then it
> waits until it completes.
>
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
>
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?
>
> Any suggestions?
>
> [1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX
> keyIDs. Or we can just check SEAMRR is enabled, as in practice a SEAMRR is
> enabled means the machine is TDX-capable, and for now a TDX-capable machine
> doesn't support ACPI memory hotplug.
>
> [2] It prevents adding legacy PMEM as system RAM too but I think it's fine. If
> user wants legacy PMEM then it is unlikely user will add it back and use as
> system RAM. User is unlikely to use legacy PMEM as TD guest memory directly as
> TD guests is likely to use a new memfd backend which allows private page not
> accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs.
>
> [3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec:
> https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > > >

2022-05-04 08:25:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On 5/3/22 16:59, Kai Huang wrote:
> Should be:
>
> // prevent racing with TDX module initialization */
> tdx_init_disable();
>
> if (tdx_module_initialized()) {
> if (new_memory_resource in TDMRs)
> // allow memory hot-add
> else
> // reject memory hot-add
> } else if (new_memory_resource in CMR) {
> // add new memory to TDX memory so it can be
> // included into TDMRs
>
> // allow memory hot-add
> }
> else
> // reject memory hot-add
>
> tdx_module_enable();
>
> And when platform doesn't TDX, always allow memory hot-add.

I don't think it even needs to be *that* complicated.

It could just be winner take all: if TDX is initialized first, don't
allow memory hotplug. If memory hotplug happens first, don't allow TDX
to be initialized.

That's fine at least for a minimal patch set.

What you have up above is probably where you want to go eventually, but
it means doing things like augmenting the e820 since it's the single
source of truth for creating the TMDRs right now.


2022-05-05 12:39:29

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote:
> On 5/3/22 16:59, Kai Huang wrote:
> > Should be:
> >
> > // prevent racing with TDX module initialization */
> > tdx_init_disable();
> >
> > if (tdx_module_initialized()) {
> > if (new_memory_resource in TDMRs)
> > // allow memory hot-add
> > else
> > // reject memory hot-add
> > } else if (new_memory_resource in CMR) {
> > // add new memory to TDX memory so it can be
> > // included into TDMRs
> >
> > // allow memory hot-add
> > }
> > else
> > // reject memory hot-add
> >
> > tdx_module_enable();
> >
> > And when platform doesn't TDX, always allow memory hot-add.
>
> I don't think it even needs to be *that* complicated.
>
> It could just be winner take all: if TDX is initialized first, don't
> allow memory hotplug. If memory hotplug happens first, don't allow TDX
> to be initialized.
>
> That's fine at least for a minimal patch set.

OK. This should also work.

We will need tdx_init_disable() which grabs the mutex to prevent TDX module
initialization from running concurrently, and to disable TDX module
initialization once for all.


>
> What you have up above is probably where you want to go eventually, but
> it means doing things like augmenting the e820 since it's the single
> source of truth for creating the TMDRs right now.
>

Yes. But in this case, I am thinking about probably we should switch from
consulting e820 to consulting memblock. The advantage of using e820 is it's
easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can
see in e820_for_each_mem() loop) I am actually merging contiguous different
types of RAM entries in order to be consistent with the behavior of
e820_memblock_setup(). This is not nice.

If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be
used as TDX memory anyway now. The reason is TDX guest will very likely needing
to use the new fd-based backend (see my reply in other emails), but not just
some random backend. To me it's totally fine to not support using legacy PMEM
directly as TD guest backend (and if we create a TD with real NVDIMM as backend
using dax, the TD cannot be created anyway). Given we cannot kmem-hot-add
legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into
TDMRs.

In this case, we can just create TDMRs based on memblock directly. One problem
is memblock will be gone after kernel boots, but this can be solved either by
keeping the memblock, or build the TDX memory early when memblock is still
alive.

Btw, eventually, as it's likely we need to support different source of TDX
memory (CLX memory, etc), I think eventually we will need some data structures
to represent TDX memory block and APIs to add those blocks to the whole TDX
memory so those TDX memory ranges from different source can be added before
initializing the TDX module.

struct tdx_memblock {
struct list_head list;
phys_addr_t start;
phys_addr_t end;
int nid;
...
};

struct tdx_memory {
struct list_head tmb_list;
...
};

int tdx_memory_add_memblock(start, end, nid, ...);

And the TDMRs can be created based on 'struct tdx_memory'.

For now, we only need to add memblock to TDX memory.

Any comments?

--
Thanks,
-Kai



2022-05-05 18:20:17

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Wed, 2022-05-04 at 13:15 +1200, Kai Huang wrote:
> On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote:
> > On 5/3/22 16:59, Kai Huang wrote:
> > > Should be:
> > >
> > > // prevent racing with TDX module initialization */
> > > tdx_init_disable();
> > >
> > > if (tdx_module_initialized()) {
> > > if (new_memory_resource in TDMRs)
> > > // allow memory hot-add
> > > else
> > > // reject memory hot-add
> > > } else if (new_memory_resource in CMR) {
> > > // add new memory to TDX memory so it can be
> > > // included into TDMRs
> > >
> > > // allow memory hot-add
> > > }
> > > else
> > > // reject memory hot-add
> > >
> > > tdx_module_enable();
> > >
> > > And when platform doesn't TDX, always allow memory hot-add.
> >
> > I don't think it even needs to be *that* complicated.
> >
> > It could just be winner take all: if TDX is initialized first, don't
> > allow memory hotplug. If memory hotplug happens first, don't allow TDX
> > to be initialized.
> >
> > That's fine at least for a minimal patch set.
>
> OK. This should also work.
>
> We will need tdx_init_disable() which grabs the mutex to prevent TDX module
> initialization from running concurrently, and to disable TDX module
> initialization once for all.
>
>
> >
> > What you have up above is probably where you want to go eventually, but
> > it means doing things like augmenting the e820 since it's the single
> > source of truth for creating the TMDRs right now.
> >
>
> Yes. But in this case, I am thinking about probably we should switch from
> consulting e820 to consulting memblock. The advantage of using e820 is it's
> easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can
> see in e820_for_each_mem() loop) I am actually merging contiguous different
> types of RAM entries in order to be consistent with the behavior of
> e820_memblock_setup(). This is not nice.
>
> If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be
> used as TDX memory anyway now. The reason is TDX guest will very likely needing
> to use the new fd-based backend (see my reply in other emails), but not just
> some random backend. To me it's totally fine to not support using legacy PMEM
> directly as TD guest backend (and if we create a TD with real NVDIMM as backend
> using dax, the TD cannot be created anyway). Given we cannot kmem-hot-add
> legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into
> TDMRs.
>
> In this case, we can just create TDMRs based on memblock directly. One problem
> is memblock will be gone after kernel boots, but this can be solved either by
> keeping the memblock, or build the TDX memory early when memblock is still
> alive.
>
> Btw, eventually, as it's likely we need to support different source of TDX
> memory (CLX memory, etc), I think eventually we will need some data structures
> to represent TDX memory block and APIs to add those blocks to the whole TDX
> memory so those TDX memory ranges from different source can be added before
> initializing the TDX module.
>
> struct tdx_memblock {
> struct list_head list;
> phys_addr_t start;
> phys_addr_t end;
> int nid;
> ...
> };
>
> struct tdx_memory {
> struct list_head tmb_list;
> ...
> };
>
> int tdx_memory_add_memblock(start, end, nid, ...);
>
> And the TDMRs can be created based on 'struct tdx_memory'.
>
> For now, we only need to add memblock to TDX memory.
>
> Any comments?
>

Hi Dave,

Sorry to ping (trying to close this).

Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
initialization, I think for now it's totally fine to exclude legacy PMEMs from
TDMRs. The worst case is when someone tries to use them as TD guest backend
directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
that no one should just use some random backend to run TD.

I think w/o needing to include legacy PMEM, it's better to get all TDX memory
blocks based on memblock, but not e820. The pages managed by page allocator are
from memblock anyway (w/o those from memory hotplug).

And I also think it makes more sense to introduce 'tdx_memblock' and
'tdx_memory' data structures to gather all TDX memory blocks during boot when
memblock is still alive. When TDX module is initialized during runtime, TDMRs
can be created based on the 'struct tdx_memory' which contains all TDX memory
blocks we gathered based on memblock during boot. This is also more flexible to
support other TDX memory from other sources such as CLX memory in the future.

Please let me know if you have any objection? Thanks!

--
Thanks,
-Kai



2022-05-05 23:35:40

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Tue, May 3, 2022 at 4:59 PM Kai Huang <[email protected]> wrote:
>
> On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > >
> > > > > I thought we could document this in the documentation saying that this code can
> > > > > only work on TDX machines that don't have above capabilities (SPR for now). We
> > > > > can change the code and the documentation when we add the support of those
> > > > > features in the future, and update the documentation.
> > > > >
> > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > machine support those features.
> > > > >
> > > > > I'll think about design solutions if above doesn't look good for you.
> > > >
> > > > No, it doesn't look good to me.
> > > >
> > > > You can't just say:
> > > >
> > > > /*
> > > > * This code will eat puppies if used on systems with hotplug.
> > > > */
> > > >
> > > > and merrily await the puppy bloodbath.
> > > >
> > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > safe, controlled way.
> > > >
> > > > > > You can't just ignore the problems because they're not present on one
> > > > > > version of the hardware.
> > > >
> > > > Please, please read this again ^^
> > >
> > > OK. I'll think about solutions and come back later.
> > > >
> >
> > Hi Dave,
> >
> > I think we have two approaches to handle memory hotplug interaction with the TDX
> > module initialization.
> >
> > The first approach is simple. We just block memory from being added as system
> > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > can add some arch-specific-check to __add_memory_resource() and reject the new
> > memory resource if platform supports TDX. __add_memory_resource() is called by
> > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > NVDIMM as system RAM and normal ACPI memory hotplug [2].
>
> Hi Dave,
>
> Try to close how to handle memory hotplug. Any comments to below?
>
> For the first approach, I forgot to think about memory hot-remove case. If we
> just reject adding new memory resource when TDX is capable on the platform, then
> if the memory is hot-removed, we won't be able to add it back. My thinking is
> we still want to support memory online/offline because it is purely in software
> but has nothing to do with TDX. But if one memory resource can be put to
> offline, it seems we don't have any way to prevent it from being removed.
>
> So if we do above, on the future platforms when memory hotplug can co-exist with
> TDX, ACPI hot-add and kmem-hot-add memory will be prevented. However if some
> memory is hot-removed, it won't be able to be added back (even it is included in
> CMR, or TDMRs after TDX module is initialized).
>
> Is this behavior acceptable? Or perhaps I have misunderstanding?

Memory online at boot uses similar kernel paths as memory-online at
run time, so it sounds like your question is confusing physical vs
logical remove. Consider the case of logical offline then re-online
where the proposed TDX sanity check blocks the memory online, but then
a new kernel is kexec'd and that kernel again trusts the memory as TD
convertible again just because it onlines the memory in the boot path.
For physical memory remove it seems up to the platform to block that
if it conflicts with TDX, not for the kernel to add extra assumptions
that logical offline / online is incompatible with TDX.

2022-05-06 04:50:21

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

[ add Mike ]


On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
[..]
>
> Hi Dave,
>
> Sorry to ping (trying to close this).
>
> Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> initialization, I think for now it's totally fine to exclude legacy PMEMs from
> TDMRs. The worst case is when someone tries to use them as TD guest backend
> directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> that no one should just use some random backend to run TD.

The platform will already do this, right? I don't understand why this
is trying to take proactive action versus documenting the error
conditions and steps someone needs to take to avoid unconvertible
memory. There is already the CONFIG_HMEM_REPORTING that describes
relative performance properties between initiators and targets, it
seems fitting to also add security properties between initiators and
targets so someone can enumerate the numa-mempolicy that avoids
unconvertible memory.

No, special casing in hotplug code paths needed.

>
> I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> blocks based on memblock, but not e820. The pages managed by page allocator are
> from memblock anyway (w/o those from memory hotplug).
>
> And I also think it makes more sense to introduce 'tdx_memblock' and
> 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> memblock is still alive. When TDX module is initialized during runtime, TDMRs
> can be created based on the 'struct tdx_memory' which contains all TDX memory
> blocks we gathered based on memblock during boot. This is also more flexible to
> support other TDX memory from other sources such as CLX memory in the future.
>
> Please let me know if you have any objection? Thanks!

It's already the case that x86 maintains sideband structures to
preserve memory after exiting the early memblock code. Mike, correct
me if I am wrong, but adding more is less desirable than just keeping
the memblock around?

2022-05-07 01:34:26

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, May 5, 2022 at 3:14 PM Kai Huang <[email protected]> wrote:
>
> Thanks for feedback!
>
> On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > [ add Mike ]
> >
> >
> > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > [..]
> > >
> > > Hi Dave,
> > >
> > > Sorry to ping (trying to close this).
> > >
> > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > that no one should just use some random backend to run TD.
> >
> > The platform will already do this, right?
> >
>
> In the current v3 implementation, we don't have any code to handle memory
> hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> RAM using kmem driver. In order to guarantee all pages managed by page

That's the fundamental question I am asking why is "guarantee all
pages managed by page allocator are TDX memory". That seems overkill
compared to indicating the incompatibility after the fact.

> allocator are all TDX memory, the v3 implementation needs to always include
> legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system
> RAM, we can still guarantee all pages in page allocator are TDX memory.

Why?

> Of course, a side benefit of always including legacy PMEMs is people
> theoretically can use them directly as TD guest backend, but this is just a
> bonus but not something that we need to guarantee.
>
>
> > I don't understand why this
> > is trying to take proactive action versus documenting the error
> > conditions and steps someone needs to take to avoid unconvertible
> > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > relative performance properties between initiators and targets, it
> > seems fitting to also add security properties between initiators and
> > targets so someone can enumerate the numa-mempolicy that avoids
> > unconvertible memory.
>
> I don't think there's anything related to performance properties here. The only
> goal here is to make sure all pages in page allocator are TDX memory pages.

Please reconsider or re-clarify that goal.

>
> >
> > No, special casing in hotplug code paths needed.
> >
> > >
> > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > blocks based on memblock, but not e820. The pages managed by page allocator are
> > > from memblock anyway (w/o those from memory hotplug).
> > >
> > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > memblock is still alive. When TDX module is initialized during runtime, TDMRs
> > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > blocks we gathered based on memblock during boot. This is also more flexible to
> > > support other TDX memory from other sources such as CLX memory in the future.
> > >
> > > Please let me know if you have any objection? Thanks!
> >
> > It's already the case that x86 maintains sideband structures to
> > preserve memory after exiting the early memblock code.
> >
>
> May I ask what data structures are you referring to?

struct numa_meminfo.

> Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
> memblock info during boot. It is also used to provide a common data structure
> that the "constructing TDMRs" code can work on. If you look at patch 11-14, the
> logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
> construct TDMRs doesn't have hard dependency on e820. If we construct TDMRs
> based on a common 'tdx_memory' like below:
>
> int construct_tdmrs(struct tdx_memory *tmem, ...);
>
> It would be much easier to support other TDX memory resources in the future.

"in the future" is a prompt to ask "Why not wait until that future /
need arrives before adding new infrastructure?"

> The thing I am not sure is Dave wants to keep the code minimal (as this series
> is already very big in terms of LoC) to make TDX running, and for now in
> practice there's only system RAM during boot is TDX capable, so I am not sure we
> should introduce those structures now.
>
> > Mike, correct
> > me if I am wrong, but adding more is less desirable than just keeping
> > the memblock around?

2022-05-07 11:41:34

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, May 5, 2022 at 5:46 PM Kai Huang <[email protected]> wrote:
>
> On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > On Thu, May 5, 2022 at 3:14 PM Kai Huang <[email protected]> wrote:
> > >
> > > Thanks for feedback!
> > >
> > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > [ add Mike ]
> > > >
> > > >
> > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > > > [..]
> > > > >
> > > > > Hi Dave,
> > > > >
> > > > > Sorry to ping (trying to close this).
> > > > >
> > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > > > that no one should just use some random backend to run TD.
> > > >
> > > > The platform will already do this, right?
> > > >
> > >
> > > In the current v3 implementation, we don't have any code to handle memory
> > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > RAM using kmem driver. In order to guarantee all pages managed by page
> >
> > That's the fundamental question I am asking why is "guarantee all
> > pages managed by page allocator are TDX memory". That seems overkill
> > compared to indicating the incompatibility after the fact.
>
> As I explained, the reason is I don't want to modify page allocator to
> distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> and GFP_TDX.

Right, TDX details do not belong at that level, but it will work
almost all the time if you do nothing to "guarantee" all TDX capable
pages all the time.

> KVM depends on host's page fault handler to allocate the page. In fact KVM only
> consumes PFN from host's page tables. For now only RAM is TDX memory. By
> guaranteeing all pages in page allocator is TDX memory, we can easily use
> anonymous pages as TD guest memory.

Again, TDX capable pages will be the overwhelming default, why are you
worried about cluttering the memory hotplug path for nice corner
cases.

Consider the fact that end users can break the kernel by specifying
invalid memmap= command line options. The memory hotplug code does not
take any steps to add safety in those cases because there are already
too many ways it can go wrong. TDX is just one more corner case where
the memmap= user needs to be careful. Otherwise, it is up to the
platform firmware to make sure everything in the base memory map is
TDX capable, and then all you need is documentation about the failure
mode when extending "System RAM" beyond that baseline.

> shmem to support a new fd-based backend which doesn't require having to mmap()
> TD guest memory to host userspace:
>
> https://lore.kernel.org/kvm/[email protected]/
>
> Also, besides TD guest memory, there are some per-TD control data structures
> (which must be TDX memory too) need to be allocated for each TD. Normal memory
> allocation APIs can be used for such allocation if we guarantee all pages in
> page allocator is TDX memory.

You don't need that guarantee, just check it after the fact and fail
if that assertion fails. It should almost always be the case that it
succeeds and if it doesn't then something special is happening with
that system and the end user has effectively opt-ed out of TDX
operation.

> > > allocator are all TDX memory, the v3 implementation needs to always include
> > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system
> > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> >
> > Why?
>
> If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> system RAM using kmem driver, the assumption of "all pages in page allocator are
> TDX memory" is broken. A TD can be killed during runtime.

Yes, that is what the end user asked for. If they don't want that to
happen then the policy decision about using kmem needs to be updated
in userspace, not hard code that policy decision towards TDX inside
the kernel.

2022-05-08 08:56:14

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> [ add Mike ]
>
> On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> [..]
> >
> > Hi Dave,
> >
> > Sorry to ping (trying to close this).
> >
> > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > that no one should just use some random backend to run TD.
>
> The platform will already do this, right? I don't understand why this
> is trying to take proactive action versus documenting the error
> conditions and steps someone needs to take to avoid unconvertible
> memory. There is already the CONFIG_HMEM_REPORTING that describes
> relative performance properties between initiators and targets, it
> seems fitting to also add security properties between initiators and
> targets so someone can enumerate the numa-mempolicy that avoids
> unconvertible memory.
>
> No, special casing in hotplug code paths needed.
>
> >
> > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > blocks based on memblock, but not e820. The pages managed by page allocator are
> > from memblock anyway (w/o those from memory hotplug).
> >
> > And I also think it makes more sense to introduce 'tdx_memblock' and
> > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > memblock is still alive. When TDX module is initialized during runtime, TDMRs
> > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > blocks we gathered based on memblock during boot. This is also more flexible to
> > support other TDX memory from other sources such as CLX memory in the future.
> >
> > Please let me know if you have any objection? Thanks!
>
> It's already the case that x86 maintains sideband structures to
> preserve memory after exiting the early memblock code. Mike, correct
> me if I am wrong, but adding more is less desirable than just keeping
> the memblock around?

TBH, I didn't read the entire thread yet, but at the first glance, keeping
memblock around is much more preferable that adding yet another { .start,
.end, .flags } data structure. To keep memblock after boot all is needed is
something like

select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST

I'll take a closer look next week on the entire series, maybe I'm missing
some details.

--
Sincerely yours,
Mike.

2022-05-08 14:00:42

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Wed, 2022-05-04 at 07:31 -0700, Dan Williams wrote:
> On Tue, May 3, 2022 at 4:59 PM Kai Huang <[email protected]> wrote:
> >
> > On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> > > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > > >
> > > > > > I thought we could document this in the documentation saying that this code can
> > > > > > only work on TDX machines that don't have above capabilities (SPR for now). We
> > > > > > can change the code and the documentation when we add the support of those
> > > > > > features in the future, and update the documentation.
> > > > > >
> > > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > > machine support those features.
> > > > > >
> > > > > > I'll think about design solutions if above doesn't look good for you.
> > > > >
> > > > > No, it doesn't look good to me.
> > > > >
> > > > > You can't just say:
> > > > >
> > > > > /*
> > > > > * This code will eat puppies if used on systems with hotplug.
> > > > > */
> > > > >
> > > > > and merrily await the puppy bloodbath.
> > > > >
> > > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > > safe, controlled way.
> > > > >
> > > > > > > You can't just ignore the problems because they're not present on one
> > > > > > > version of the hardware.
> > > > >
> > > > > Please, please read this again ^^
> > > >
> > > > OK. I'll think about solutions and come back later.
> > > > >
> > >
> > > Hi Dave,
> > >
> > > I think we have two approaches to handle memory hotplug interaction with the TDX
> > > module initialization.
> > >
> > > The first approach is simple. We just block memory from being added as system
> > > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > > can add some arch-specific-check to __add_memory_resource() and reject the new
> > > memory resource if platform supports TDX. __add_memory_resource() is called by
> > > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > > NVDIMM as system RAM and normal ACPI memory hotplug [2].
> >
> > Hi Dave,
> >
> > Try to close how to handle memory hotplug. Any comments to below?
> >
> > For the first approach, I forgot to think about memory hot-remove case. If we
> > just reject adding new memory resource when TDX is capable on the platform, then
> > if the memory is hot-removed, we won't be able to add it back. My thinking is
> > we still want to support memory online/offline because it is purely in software
> > but has nothing to do with TDX. But if one memory resource can be put to
> > offline, it seems we don't have any way to prevent it from being removed.
> >
> > So if we do above, on the future platforms when memory hotplug can co-exist with
> > TDX, ACPI hot-add and kmem-hot-add memory will be prevented. However if some
> > memory is hot-removed, it won't be able to be added back (even it is included in
> > CMR, or TDMRs after TDX module is initialized).
> >
> > Is this behavior acceptable? Or perhaps I have misunderstanding?
>
> Memory online at boot uses similar kernel paths as memory-online at
> run time, so it sounds like your question is confusing physical vs
> logical remove. Consider the case of logical offline then re-online
> where the proposed TDX sanity check blocks the memory online, but then
> a new kernel is kexec'd and that kernel again trusts the memory as TD
> convertible again just because it onlines the memory in the boot path.
> For physical memory remove it seems up to the platform to block that
> if it conflicts with TDX, not for the kernel to add extra assumptions
> that logical offline / online is incompatible with TDX.

Hi Dan,

No we don't block memory online, but we block memory add. The code I mentioned
is add_memory_resource(), while memory online code path is
memory_block_online(). Or am I wrong?

--
Thanks,
-Kai



2022-05-08 15:03:03

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

Thanks for feedback!

On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> [ add Mike ]
>
>
> On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> [..]
> >
> > Hi Dave,
> >
> > Sorry to ping (trying to close this).
> >
> > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > that no one should just use some random backend to run TD.
>
> The platform will already do this, right? 
>

In the current v3 implementation, we don't have any code to handle memory
hotplug, therefore nothing prevents people from adding legacy PMEMs as system
RAM using kmem driver. In order to guarantee all pages managed by page
allocator are all TDX memory, the v3 implementation needs to always include
legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system
RAM, we can still guarantee all pages in page allocator are TDX memory.

Of course, a side benefit of always including legacy PMEMs is people
theoretically can use them directly as TD guest backend, but this is just a
bonus but not something that we need to guarantee.


> I don't understand why this
> is trying to take proactive action versus documenting the error
> conditions and steps someone needs to take to avoid unconvertible
> memory. There is already the CONFIG_HMEM_REPORTING that describes
> relative performance properties between initiators and targets, it
> seems fitting to also add security properties between initiators and
> targets so someone can enumerate the numa-mempolicy that avoids
> unconvertible memory.

I don't think there's anything related to performance properties here. The only
goal here is to make sure all pages in page allocator are TDX memory pages.

>
> No, special casing in hotplug code paths needed.
>
> >
> > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > blocks based on memblock, but not e820. The pages managed by page allocator are
> > from memblock anyway (w/o those from memory hotplug).
> >
> > And I also think it makes more sense to introduce 'tdx_memblock' and
> > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > memblock is still alive. When TDX module is initialized during runtime, TDMRs
> > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > blocks we gathered based on memblock during boot. This is also more flexible to
> > support other TDX memory from other sources such as CLX memory in the future.
> >
> > Please let me know if you have any objection? Thanks!
>
> It's already the case that x86 maintains sideband structures to
> preserve memory after exiting the early memblock code. 
>

May I ask what data structures are you referring to?

Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
memblock info during boot. It is also used to provide a common data structure
that the "constructing TDMRs" code can work on. If you look at patch 11-14, the
logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
construct TDMRs doesn't have hard dependency on e820. If we construct TDMRs
based on a common 'tdx_memory' like below:

int construct_tdmrs(struct tdx_memory *tmem, ...);

It would be much easier to support other TDX memory resources in the future.

The thing I am not sure is Dave wants to keep the code minimal (as this series
is already very big in terms of LoC) to make TDX running, and for now in
practice there's only system RAM during boot is TDX capable, so I am not sure we
should introduce those structures now.

> Mike, correct
> me if I am wrong, but adding more is less desirable than just keeping
> the memblock around?

2022-05-09 01:47:38

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 5:46 PM Kai Huang <[email protected]> wrote:
> >
> > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <[email protected]> wrote:
> > > >
> > > > Thanks for feedback!
> > > >
> > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > [ add Mike ]
> > > > >
> > > > >
> > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > > > > [..]
> > > > > >
> > > > > > Hi Dave,
> > > > > >
> > > > > > Sorry to ping (trying to close this).
> > > > > >
> > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > > > > that no one should just use some random backend to run TD.
> > > > >
> > > > > The platform will already do this, right?
> > > > >
> > > >
> > > > In the current v3 implementation, we don't have any code to handle memory
> > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > RAM using kmem driver. In order to guarantee all pages managed by page
> > >
> > > That's the fundamental question I am asking why is "guarantee all
> > > pages managed by page allocator are TDX memory". That seems overkill
> > > compared to indicating the incompatibility after the fact.
> >
> > As I explained, the reason is I don't want to modify page allocator to
> > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > and GFP_TDX.
>
> Right, TDX details do not belong at that level, but it will work
> almost all the time if you do nothing to "guarantee" all TDX capable
> pages all the time.

"almost all the time" do you mean?

>
> > KVM depends on host's page fault handler to allocate the page. In fact KVM only
> > consumes PFN from host's page tables. For now only RAM is TDX memory. By
> > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > anonymous pages as TD guest memory.
>
> Again, TDX capable pages will be the overwhelming default, why are you
> worried about cluttering the memory hotplug path for nice corner
> cases.

Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
let me clarify first:

1) Convertible Memory Regions (CMRs). This is reported by BIOS (thus static) to
indicate which memory regions *can* be used as TDX memory. This basically means
all RAM during boot for now.

2) TD Memory Regions (TDMRs). Memory pages in CMRs are not automatically TDX
usable memory. The TDX module needs to be configured which (convertible) memory
regions can be used as TDX memory. Kernel is responsible for choosing the
ranges, and configure to the TDX module. If a convertible memory page is not
included into TDMRs, the TDX module will report error when it is assigned to a
TD.

>
> Consider the fact that end users can break the kernel by specifying
> invalid memmap= command line options. The memory hotplug code does not
> take any steps to add safety in those cases because there are already
> too many ways it can go wrong. TDX is just one more corner case where
> the memmap= user needs to be careful. Otherwise, it is up to the
> platform firmware to make sure everything in the base memory map is
> TDX capable, and then all you need is documentation about the failure
> mode when extending "System RAM" beyond that baseline.

So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
system RAM, a live TD may eventually be killed.

If such case is a corner case that we don't need to guarantee, then even better.
And we have an additional reason that those legacy PMEMs don't need to be in
TDMRs. As you suggested, we can add some documentation to point out.

But the point we want to do some code check and prevent memory hotplug is, as
Dave said, we want this piece of code to work on *ANY* TDX capable machines,
including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
memory. If we don't do any code check in memory hotplug in this series, then
when this code runs in future platforms, user can plug NVDIMM or CLX memory as
system RAM thus break the assumption "all pages in page allocator are TDX
memory", which eventually leads to live TDs being killed potentially.

Dave said we need to guarantee this code can work on *ANY* TDX machines. Some
documentation saying it only works one some platforms and you shouldn't do
things on other platforms are not good enough:

https://lore.kernel.org/lkml/[email protected]/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471

>
>
> > shmem to support a new fd-based backend which doesn't require having to mmap()
> > TD guest memory to host userspace:
> >
> > https://lore.kernel.org/kvm/[email protected]/
> >
> > Also, besides TD guest memory, there are some per-TD control data structures
> > (which must be TDX memory too) need to be allocated for each TD. Normal memory
> > allocation APIs can be used for such allocation if we guarantee all pages in
> > page allocator is TDX memory.
>
> You don't need that guarantee, just check it after the fact and fail
> if that assertion fails. It should almost always be the case that it
> succeeds and if it doesn't then something special is happening with
> that system and the end user has effectively opt-ed out of TDX
> operation.

This doesn't guarantee consistent behaviour. For instance, for one TD it can be
created, while the second may fail. We should provide a consistent service.

The thing is anyway we need to configure some memory regions to the TDX module.
To me there's no reason we don't want to guarantee all pages in page allocator
are TDX memory.

>
> > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system
> > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > >
> > > Why?
> >
> > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > TDX memory" is broken. A TD can be killed during runtime.
>
> Yes, that is what the end user asked for. If they don't want that to
> happen then the policy decision about using kmem needs to be updated
> in userspace, not hard code that policy decision towards TDX inside
> the kernel.

This is also fine to me. But please also see above Dave's comment.

Thanks for those valuable feedback!


--
Thanks,
-Kai



2022-05-09 01:58:09

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, May 5, 2022 at 6:47 PM Kai Huang <[email protected]> wrote:
>
> On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> > On Thu, May 5, 2022 at 5:46 PM Kai Huang <[email protected]> wrote:
> > >
> > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <[email protected]> wrote:
> > > > >
> > > > > Thanks for feedback!
> > > > >
> > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > > [ add Mike ]
> > > > > >
> > > > > >
> > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > > > > > [..]
> > > > > > >
> > > > > > > Hi Dave,
> > > > > > >
> > > > > > > Sorry to ping (trying to close this).
> > > > > > >
> > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > > > > > that no one should just use some random backend to run TD.
> > > > > >
> > > > > > The platform will already do this, right?
> > > > > >
> > > > >
> > > > > In the current v3 implementation, we don't have any code to handle memory
> > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > > RAM using kmem driver. In order to guarantee all pages managed by page
> > > >
> > > > That's the fundamental question I am asking why is "guarantee all
> > > > pages managed by page allocator are TDX memory". That seems overkill
> > > > compared to indicating the incompatibility after the fact.
> > >
> > > As I explained, the reason is I don't want to modify page allocator to
> > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > > and GFP_TDX.
> >
> > Right, TDX details do not belong at that level, but it will work
> > almost all the time if you do nothing to "guarantee" all TDX capable
> > pages all the time.
>
> "almost all the time" do you mean?
>
> >
> > > KVM depends on host's page fault handler to allocate the page. In fact KVM only
> > > consumes PFN from host's page tables. For now only RAM is TDX memory. By
> > > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > > anonymous pages as TD guest memory.
> >
> > Again, TDX capable pages will be the overwhelming default, why are you
> > worried about cluttering the memory hotplug path for nice corner
> > cases.
>
> Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
> let me clarify first:
>
> 1) Convertible Memory Regions (CMRs). This is reported by BIOS (thus static) to
> indicate which memory regions *can* be used as TDX memory. This basically means
> all RAM during boot for now.
>
> 2) TD Memory Regions (TDMRs). Memory pages in CMRs are not automatically TDX
> usable memory. The TDX module needs to be configured which (convertible) memory
> regions can be used as TDX memory. Kernel is responsible for choosing the
> ranges, and configure to the TDX module. If a convertible memory page is not
> included into TDMRs, the TDX module will report error when it is assigned to a
> TD.
>
> >
> > Consider the fact that end users can break the kernel by specifying
> > invalid memmap= command line options. The memory hotplug code does not
> > take any steps to add safety in those cases because there are already
> > too many ways it can go wrong. TDX is just one more corner case where
> > the memmap= user needs to be careful. Otherwise, it is up to the
> > platform firmware to make sure everything in the base memory map is
> > TDX capable, and then all you need is documentation about the failure
> > mode when extending "System RAM" beyond that baseline.
>
> So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> system RAM, a live TD may eventually be killed.
>
> If such case is a corner case that we don't need to guarantee, then even better.
> And we have an additional reason that those legacy PMEMs don't need to be in
> TDMRs. As you suggested, we can add some documentation to point out.
>
> But the point we want to do some code check and prevent memory hotplug is, as
> Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> memory. If we don't do any code check in memory hotplug in this series, then
> when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> system RAM thus break the assumption "all pages in page allocator are TDX
> memory", which eventually leads to live TDs being killed potentially.
>
> Dave said we need to guarantee this code can work on *ANY* TDX machines. Some
> documentation saying it only works one some platforms and you shouldn't do
> things on other platforms are not good enough:
>
> https://lore.kernel.org/lkml/[email protected]/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471

Yes, the incompatible cases cannot be ignored, but I disagree that
they actively need to be prevented. One way to achieve that is to
explicitly enumerate TDX capable memory and document how mempolicy can
be used to avoid killing TDs.

> > > shmem to support a new fd-based backend which doesn't require having to mmap()
> > > TD guest memory to host userspace:
> > >
> > > https://lore.kernel.org/kvm/[email protected]/
> > >
> > > Also, besides TD guest memory, there are some per-TD control data structures
> > > (which must be TDX memory too) need to be allocated for each TD. Normal memory
> > > allocation APIs can be used for such allocation if we guarantee all pages in
> > > page allocator is TDX memory.
> >
> > You don't need that guarantee, just check it after the fact and fail
> > if that assertion fails. It should almost always be the case that it
> > succeeds and if it doesn't then something special is happening with
> > that system and the end user has effectively opt-ed out of TDX
> > operation.
>
> This doesn't guarantee consistent behaviour. For instance, for one TD it can be
> created, while the second may fail. We should provide a consistent service.

Yes, there needs to be enumeration and policy knobs to avoid failures,
hard coded "no memory hotplug" hacks do not seem the right enumeration
and policy knobs to me.

> The thing is anyway we need to configure some memory regions to the TDX module.
> To me there's no reason we don't want to guarantee all pages in page allocator
> are TDX memory.
>
> >
> > > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system
> > > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > >
> > > > Why?
> > >
> > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > > TDX memory" is broken. A TD can be killed during runtime.
> >
> > Yes, that is what the end user asked for. If they don't want that to
> > happen then the policy decision about using kmem needs to be updated
> > in userspace, not hard code that policy decision towards TDX inside
> > the kernel.
>
> This is also fine to me. But please also see above Dave's comment.

Dave is right, the implementation can not just ignore the conflict. To
me, enumeration plus error reporting allows for flexibility without
hard coding policy in the kernel.

2022-05-09 03:05:05

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 3:14 PM Kai Huang <[email protected]> wrote:
> >
> > Thanks for feedback!
> >
> > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > [ add Mike ]
> > >
> > >
> > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > > [..]
> > > >
> > > > Hi Dave,
> > > >
> > > > Sorry to ping (trying to close this).
> > > >
> > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > > that no one should just use some random backend to run TD.
> > >
> > > The platform will already do this, right?
> > >
> >
> > In the current v3 implementation, we don't have any code to handle memory
> > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > RAM using kmem driver. In order to guarantee all pages managed by page
>
> That's the fundamental question I am asking why is "guarantee all
> pages managed by page allocator are TDX memory". That seems overkill
> compared to indicating the incompatibility after the fact.

As I explained, the reason is I don't want to modify page allocator to
distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
and GFP_TDX.

KVM depends on host's page fault handler to allocate the page. In fact KVM only
consumes PFN from host's page tables. For now only RAM is TDX memory. By
guaranteeing all pages in page allocator is TDX memory, we can easily use
anonymous pages as TD guest memory. This also allows us to easily extend the
shmem to support a new fd-based backend which doesn't require having to mmap()
TD guest memory to host userspace:

https://lore.kernel.org/kvm/[email protected]/

Also, besides TD guest memory, there are some per-TD control data structures
(which must be TDX memory too) need to be allocated for each TD. Normal memory
allocation APIs can be used for such allocation if we guarantee all pages in
page allocator is TDX memory.

>
> > allocator are all TDX memory, the v3 implementation needs to always include
> > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system
> > RAM, we can still guarantee all pages in page allocator are TDX memory.
>
> Why?

If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
system RAM using kmem driver, the assumption of "all pages in page allocator are
TDX memory" is broken. A TD can be killed during runtime.

>
> > Of course, a side benefit of always including legacy PMEMs is people
> > theoretically can use them directly as TD guest backend, but this is just a
> > bonus but not something that we need to guarantee.
> >
> >
> > > I don't understand why this
> > > is trying to take proactive action versus documenting the error
> > > conditions and steps someone needs to take to avoid unconvertible
> > > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > > relative performance properties between initiators and targets, it
> > > seems fitting to also add security properties between initiators and
> > > targets so someone can enumerate the numa-mempolicy that avoids
> > > unconvertible memory.
> >
> > I don't think there's anything related to performance properties here. The only
> > goal here is to make sure all pages in page allocator are TDX memory pages.
>
> Please reconsider or re-clarify that goal.
>
> >
> > >
> > > No, special casing in hotplug code paths needed.
> > >
> > > >
> > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > > blocks based on memblock, but not e820. The pages managed by page allocator are
> > > > from memblock anyway (w/o those from memory hotplug).
> > > >
> > > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > > memblock is still alive. When TDX module is initialized during runtime, TDMRs
> > > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > > blocks we gathered based on memblock during boot. This is also more flexible to
> > > > support other TDX memory from other sources such as CLX memory in the future.
> > > >
> > > > Please let me know if you have any objection? Thanks!
> > >
> > > It's already the case that x86 maintains sideband structures to
> > > preserve memory after exiting the early memblock code.
> > >
> >
> > May I ask what data structures are you referring to?
>
> struct numa_meminfo.
>
> > Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
> > memblock info during boot. It is also used to provide a common data structure
> > that the "constructing TDMRs" code can work on. If you look at patch 11-14, the
> > logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
> > construct TDMRs doesn't have hard dependency on e820. If we construct TDMRs
> > based on a common 'tdx_memory' like below:
> >
> > int construct_tdmrs(struct tdx_memory *tmem, ...);
> >
> > It would be much easier to support other TDX memory resources in the future.
>
> "in the future" is a prompt to ask "Why not wait until that future /
> need arrives before adding new infrastructure?"

Fine to me.

--
Thanks,
-Kai



2022-05-09 03:20:29

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote:
> On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> > [ add Mike ]
> >
> > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > [..]
> > >
> > > Hi Dave,
> > >
> > > Sorry to ping (trying to close this).
> > >
> > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > that no one should just use some random backend to run TD.
> >
> > The platform will already do this, right? I don't understand why this
> > is trying to take proactive action versus documenting the error
> > conditions and steps someone needs to take to avoid unconvertible
> > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > relative performance properties between initiators and targets, it
> > seems fitting to also add security properties between initiators and
> > targets so someone can enumerate the numa-mempolicy that avoids
> > unconvertible memory.
> >
> > No, special casing in hotplug code paths needed.
> >
> > >
> > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > blocks based on memblock, but not e820. The pages managed by page allocator are
> > > from memblock anyway (w/o those from memory hotplug).
> > >
> > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > memblock is still alive. When TDX module is initialized during runtime, TDMRs
> > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > blocks we gathered based on memblock during boot. This is also more flexible to
> > > support other TDX memory from other sources such as CLX memory in the future.
> > >
> > > Please let me know if you have any objection? Thanks!
> >
> > It's already the case that x86 maintains sideband structures to
> > preserve memory after exiting the early memblock code. Mike, correct
> > me if I am wrong, but adding more is less desirable than just keeping
> > the memblock around?
>
> TBH, I didn't read the entire thread yet, but at the first glance, keeping
> memblock around is much more preferable that adding yet another { .start,
> .end, .flags } data structure. To keep memblock after boot all is needed is
> something like
>
> select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST
>
> I'll take a closer look next week on the entire series, maybe I'm missing
> some details.
>

Hi Mike,

Thanks for feedback.

Perhaps I haven't put a lot details of the new TDX data structures, so let me
point out that the new two data structures 'struct tdx_memblock' and 'struct
tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
which is pretty standalone. They are not supposed to be some basic
infrastructure that can be widely used by other random kernel components. 

In fact, currently the only operation we need is to allow memblock to register
all memory regions as TDX memory blocks when the memblock is still alive.
Therefore, in fact, the new data structures can even be completely invisible to
other kernel components. For instance, TDX code can provide below API w/o
exposing any data structures to other kernel components:

int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);

And we call above API for each memory region in memblock when it is alive.

TDX code internally manages those memory regions via the new data structures
that I mentioned above, so we don't need to keep memblock after boot. The
advantage of this approach is it is more flexible to support other potential TDX
memory resources (such as CLX memory) in the future.

Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when
INTEL_TDX_HOST is on and TDX code internally uses memblock API directly.

--
Thanks,
-Kai



2022-05-09 10:02:45

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Fri, 2022-05-06 at 08:57 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 6:47 PM Kai Huang <[email protected]> wrote:
> >
> > On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> > > On Thu, May 5, 2022 at 5:46 PM Kai Huang <[email protected]> wrote:
> > > >
> > > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <[email protected]> wrote:
> > > > > >
> > > > > > Thanks for feedback!
> > > > > >
> > > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > > > [ add Mike ]
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > > > > > > [..]
> > > > > > > >
> > > > > > > > Hi Dave,
> > > > > > > >
> > > > > > > > Sorry to ping (trying to close this).
> > > > > > > >
> > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > > > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > > > > > > that no one should just use some random backend to run TD.
> > > > > > >
> > > > > > > The platform will already do this, right?
> > > > > > >
> > > > > >
> > > > > > In the current v3 implementation, we don't have any code to handle memory
> > > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > > > RAM using kmem driver. In order to guarantee all pages managed by page
> > > > >
> > > > > That's the fundamental question I am asking why is "guarantee all
> > > > > pages managed by page allocator are TDX memory". That seems overkill
> > > > > compared to indicating the incompatibility after the fact.
> > > >
> > > > As I explained, the reason is I don't want to modify page allocator to
> > > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > > > and GFP_TDX.
> > >
> > > Right, TDX details do not belong at that level, but it will work
> > > almost all the time if you do nothing to "guarantee" all TDX capable
> > > pages all the time.
> >
> > "almost all the time" do you mean?
> >
> > >
> > > > KVM depends on host's page fault handler to allocate the page. In fact KVM only
> > > > consumes PFN from host's page tables. For now only RAM is TDX memory. By
> > > > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > > > anonymous pages as TD guest memory.
> > >
> > > Again, TDX capable pages will be the overwhelming default, why are you
> > > worried about cluttering the memory hotplug path for nice corner
> > > cases.
> >
> > Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
> > let me clarify first:
> >
> > 1) Convertible Memory Regions (CMRs). This is reported by BIOS (thus static) to
> > indicate which memory regions *can* be used as TDX memory. This basically means
> > all RAM during boot for now.
> >
> > 2) TD Memory Regions (TDMRs). Memory pages in CMRs are not automatically TDX
> > usable memory. The TDX module needs to be configured which (convertible) memory
> > regions can be used as TDX memory. Kernel is responsible for choosing the
> > ranges, and configure to the TDX module. If a convertible memory page is not
> > included into TDMRs, the TDX module will report error when it is assigned to a
> > TD.
> >
> > >
> > > Consider the fact that end users can break the kernel by specifying
> > > invalid memmap= command line options. The memory hotplug code does not
> > > take any steps to add safety in those cases because there are already
> > > too many ways it can go wrong. TDX is just one more corner case where
> > > the memmap= user needs to be careful. Otherwise, it is up to the
> > > platform firmware to make sure everything in the base memory map is
> > > TDX capable, and then all you need is documentation about the failure
> > > mode when extending "System RAM" beyond that baseline.
> >
> > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> > system RAM, a live TD may eventually be killed.
> >
> > If such case is a corner case that we don't need to guarantee, then even better.
> > And we have an additional reason that those legacy PMEMs don't need to be in
> > TDMRs. As you suggested, we can add some documentation to point out.
> >
> > But the point we want to do some code check and prevent memory hotplug is, as
> > Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> > memory. If we don't do any code check in memory hotplug in this series, then
> > when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> > system RAM thus break the assumption "all pages in page allocator are TDX
> > memory", which eventually leads to live TDs being killed potentially.
> >
> > Dave said we need to guarantee this code can work on *ANY* TDX machines. Some
> > documentation saying it only works one some platforms and you shouldn't do
> > things on other platforms are not good enough:
> >
> > https://lore.kernel.org/lkml/[email protected]/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471
>
> Yes, the incompatible cases cannot be ignored, but I disagree that
> they actively need to be prevented. One way to achieve that is to
> explicitly enumerate TDX capable memory and document how mempolicy can
> be used to avoid killing TDs.

Hi Dan,

Thanks for feedback.

Could you elaborate what does "explicitly enumerate TDX capable memory" mean?
How to enumerate exactly?

And for "document how mempolicy can be used to avoid killing TDs", what
mempolicy (and error reporting you mentioned below) are you referring to?

I skipped to reply your below your two replies as I think they are referring to
the same "enumerate" and "mempolicy" that I am asking above.

>
> > > > shmem to support a new fd-based backend which doesn't require having to mmap()
> > > > TD guest memory to host userspace:
> > > >
> > > > https://lore.kernel.org/kvm/[email protected]/
> > > >
> > > > Also, besides TD guest memory, there are some per-TD control data structures
> > > > (which must be TDX memory too) need to be allocated for each TD. Normal memory
> > > > allocation APIs can be used for such allocation if we guarantee all pages in
> > > > page allocator is TDX memory.
> > >
> > > You don't need that guarantee, just check it after the fact and fail
> > > if that assertion fails. It should almost always be the case that it
> > > succeeds and if it doesn't then something special is happening with
> > > that system and the end user has effectively opt-ed out of TDX
> > > operation.
> >
> > This doesn't guarantee consistent behaviour. For instance, for one TD it can be
> > created, while the second may fail. We should provide a consistent service.
>
> Yes, there needs to be enumeration and policy knobs to avoid failures,
> hard coded "no memory hotplug" hacks do not seem the right enumeration
> and policy knobs to me.
>
> > The thing is anyway we need to configure some memory regions to the TDX module.
> > To me there's no reason we don't want to guarantee all pages in page allocator
> > are TDX memory.
> >
> > >
> > > > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system
> > > > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > > >
> > > > > Why?
> > > >
> > > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > > > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > > > TDX memory" is broken. A TD can be killed during runtime.
> > >
> > > Yes, that is what the end user asked for. If they don't want that to
> > > happen then the policy decision about using kmem needs to be updated
> > > in userspace, not hard code that policy decision towards TDX inside
> > > the kernel.
> >
> > This is also fine to me. But please also see above Dave's comment.
>
> Dave is right, the implementation can not just ignore the conflict. To
> me, enumeration plus error reporting allows for flexibility without
> hard coding policy in the kernel.


--
Thanks,
-Kai



2022-05-09 11:39:04

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Sun, May 08, 2022 at 10:00:39PM +1200, Kai Huang wrote:
> On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote:
> > On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> > > [ add Mike ]
> > >
> > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <[email protected]> wrote:
> > > [..]
> > > >
> > > > Hi Dave,
> > > >
> > > > Sorry to ping (trying to close this).
> > > >
> > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > TDMRs. The worst case is when someone tries to use them as TD guest backend
> > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly
> > > > that no one should just use some random backend to run TD.
> > >
> > > The platform will already do this, right? I don't understand why this
> > > is trying to take proactive action versus documenting the error
> > > conditions and steps someone needs to take to avoid unconvertible
> > > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > > relative performance properties between initiators and targets, it
> > > seems fitting to also add security properties between initiators and
> > > targets so someone can enumerate the numa-mempolicy that avoids
> > > unconvertible memory.
> > >
> > > No, special casing in hotplug code paths needed.
> > >
> > > >
> > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > > blocks based on memblock, but not e820. The pages managed by page allocator are
> > > > from memblock anyway (w/o those from memory hotplug).
> > > >
> > > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > > memblock is still alive. When TDX module is initialized during runtime, TDMRs
> > > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > > blocks we gathered based on memblock during boot. This is also more flexible to
> > > > support other TDX memory from other sources such as CLX memory in the future.
> > > >
> > > > Please let me know if you have any objection? Thanks!
> > >
> > > It's already the case that x86 maintains sideband structures to
> > > preserve memory after exiting the early memblock code. Mike, correct
> > > me if I am wrong, but adding more is less desirable than just keeping
> > > the memblock around?
> >
> > TBH, I didn't read the entire thread yet, but at the first glance, keeping
> > memblock around is much more preferable that adding yet another { .start,
> > .end, .flags } data structure. To keep memblock after boot all is needed is
> > something like
> >
> > select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST
> >
> > I'll take a closer look next week on the entire series, maybe I'm missing
> > some details.
> >
>
> Hi Mike,
>
> Thanks for feedback.
>
> Perhaps I haven't put a lot details of the new TDX data structures, so let me
> point out that the new two data structures 'struct tdx_memblock' and 'struct
> tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
> which is pretty standalone. They are not supposed to be some basic
> infrastructure that can be widely used by other random kernel components.?

We already have "pretty standalone" numa_meminfo that originally was used
to setup NUMA memory topology, but now it's used by other code as well.
And e820 tables also contain similar data and they are supposedly should be
used only at boot time, but in reality there are too much callbacks into
e820 way after the system is booted.

So any additional memory representation will only add to the overall
complexity and well have even more "eventually consistent" collections of
{ .start, .end, .flags } structures.

> In fact, currently the only operation we need is to allow memblock to register
> all memory regions as TDX memory blocks when the memblock is still alive.
> Therefore, in fact, the new data structures can even be completely invisible to
> other kernel components. For instance, TDX code can provide below API w/o
> exposing any data structures to other kernel components:
>
> int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);
>
> And we call above API for each memory region in memblock when it is alive.
>
> TDX code internally manages those memory regions via the new data structures
> that I mentioned above, so we don't need to keep memblock after boot. The
> advantage of this approach is it is more flexible to support other potential TDX
> memory resources (such as CLX memory) in the future.

Please let keep things simple. If other TDX memory resources will need
different handling it can be implemented then. For now, just enable
ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory.

> Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when
> INTEL_TDX_HOST is on and TDX code internally uses memblock API directly.
>
> --
> Thanks,
> -Kai
>
>

--
Sincerely yours,
Mike.

2022-05-10 00:20:17

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

> >
> > Hi Mike,
> >
> > Thanks for feedback.
> >
> > Perhaps I haven't put a lot details of the new TDX data structures, so let me
> > point out that the new two data structures 'struct tdx_memblock' and 'struct
> > tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
> > which is pretty standalone. They are not supposed to be some basic
> > infrastructure that can be widely used by other random kernel components. 
>
> We already have "pretty standalone" numa_meminfo that originally was used
> to setup NUMA memory topology, but now it's used by other code as well.
> And e820 tables also contain similar data and they are supposedly should be
> used only at boot time, but in reality there are too much callbacks into
> e820 way after the system is booted.
>
> So any additional memory representation will only add to the overall
> complexity and well have even more "eventually consistent" collections of
> { .start, .end, .flags } structures.
>
> > In fact, currently the only operation we need is to allow memblock to register
> > all memory regions as TDX memory blocks when the memblock is still alive.
> > Therefore, in fact, the new data structures can even be completely invisible to
> > other kernel components. For instance, TDX code can provide below API w/o
> > exposing any data structures to other kernel components:
> >
> > int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);
> >
> > And we call above API for each memory region in memblock when it is alive.
> >
> > TDX code internally manages those memory regions via the new data structures
> > that I mentioned above, so we don't need to keep memblock after boot. The
> > advantage of this approach is it is more flexible to support other potential TDX
> > memory resources (such as CLX memory) in the future.
>
> Please let keep things simple. If other TDX memory resources will need
> different handling it can be implemented then. For now, just enable
> ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory.
>

Looks good to me. Thanks for the feedback.

--
Thanks,
-Kai



2022-05-10 15:34:18

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] TDX host kernel support

> > >
> > > >
> > > > Consider the fact that end users can break the kernel by specifying
> > > > invalid memmap= command line options. The memory hotplug code does not
> > > > take any steps to add safety in those cases because there are already
> > > > too many ways it can go wrong. TDX is just one more corner case where
> > > > the memmap= user needs to be careful. Otherwise, it is up to the
> > > > platform firmware to make sure everything in the base memory map is
> > > > TDX capable, and then all you need is documentation about the failure
> > > > mode when extending "System RAM" beyond that baseline.
> > >
> > > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> > > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> > > system RAM, a live TD may eventually be killed.
> > >
> > > If such case is a corner case that we don't need to guarantee, then even better.
> > > And we have an additional reason that those legacy PMEMs don't need to be in
> > > TDMRs. As you suggested, we can add some documentation to point out.
> > >
> > > But the point we want to do some code check and prevent memory hotplug is, as
> > > Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> > > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> > > memory. If we don't do any code check in memory hotplug in this series, then
> > > when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> > > system RAM thus break the assumption "all pages in page allocator are TDX
> > > memory", which eventually leads to live TDs being killed potentially.
> > >
> > > Dave said we need to guarantee this code can work on *ANY* TDX machines. Some
> > > documentation saying it only works one some platforms and you shouldn't do
> > > things on other platforms are not good enough:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471
> >
> > Yes, the incompatible cases cannot be ignored, but I disagree that
> > they actively need to be prevented. One way to achieve that is to
> > explicitly enumerate TDX capable memory and document how mempolicy can
> > be used to avoid killing TDs.
>
> Hi Dan,
>
> Thanks for feedback.
>
> Could you elaborate what does "explicitly enumerate TDX capable memory" mean?
> How to enumerate exactly?
>
> And for "document how mempolicy can be used to avoid killing TDs", what
> mempolicy (and error reporting you mentioned below) are you referring to?
>
> I skipped to reply your below your two replies as I think they are referring to
> the same "enumerate" and "mempolicy" that I am asking above.
>
>

Hi Dan,

I guess "explicitly enumerate TDX capable memory" means getting the Convertible
Memory Regions (CMR). And "document how mempolicy can be used to avoid killing
TDs" means we say something like below in the documentation?

Any non TDX capable memory hot-add will result in non TDX capable pages
being potentially allocated to a TD, in which case a TD may fail to be
created or a live TD may be killed at runtime.

And "error reporting" do you mean in memory hot-add code path, we check whether
the new memory resource is TDX capable, if not we print some error similar to
above message in documentation, but still allow the memory hot-add to happen?

Something like below in add_memory_resource()?

if (platform_has_tdx() && new memory resource NOT in CMRs)
pr_err("Hot-add non-TDX memory on TDX capable system. TD may
fail to be created, or a live TD may be killed during
runtime.\n");

// allow memory hot-add anyway


I have below concerns of this approach:

1) I think we should provide a consistent service to user, that is, we either to
guarantee that TD won't be failed to be created randomly and a running TD won't
be killed during runtime, or we don't provide any TDX functionality at all. So
I am not sure only "document how mempolicy can be use to avoid killing TDs" is
good enough.

2) Above code to check whether a new memory resource is in CMRs or not requires
the kernel to get CMRs during kernel boot. However getting CMRs requires
calling SEAMCALL which requires kernel to support VMXON/VMXOFF. VMXON/VMXOFF is
currently only handled by KVM. We'd like to avoid adding VMXON/VMXOFF to core-
kernel now if not mandatory, as eventually we will very likely need to have a
reference-based approach to call VMXON/VMXOFF. This part is explained in the
cover letter in this series.

Dave suggested for now to keep things simple, we can use "winner take all"
approach: If TDX is initialized first, don't allow memory hotplug. If memory
hotplug happens first, don't allow TDX to be initialized.

https://lore.kernel.org/lkml/[email protected]/T/#mfa6b5dcc536d8a7b78522f46ccd1230f84d52ae0

I think this is perhaps more reasonable as we are at least providing some
consistent service to user. And in this approach we don't need to handle
VMXON/VMXOFF in core-kernel.

Comments?


--
Thanks,
-Kai



2022-05-18 16:20:20

by Sagi Shahar

[permalink] [raw]
Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error

On Tue, Apr 26, 2022 at 5:06 PM Kai Huang <[email protected]> wrote:
>
> On Tue, 2022-04-26 at 13:59 -0700, Dave Hansen wrote:
> > On 4/5/22 21:49, Kai Huang wrote:
> > > TDX supports shutting down the TDX module at any time during its
> > > lifetime. After TDX module is shut down, no further SEAMCALL can be
> > > made on any logical cpu.
> >
> > Is this strictly true?
> >
> > I thought SEAMCALLs were used for the P-SEAMLDR too.
>
> Sorry will change to no TDX module SEAMCALL can be made on any logical cpu.
>
> [...]
>
> > >
> > > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > > +struct seamcall_ctx {
> > > + u64 fn;
> > > + u64 rcx;
> > > + u64 rdx;
> > > + u64 r8;
> > > + u64 r9;
> > > + atomic_t err;
> > > + u64 seamcall_ret;
> > > + struct tdx_module_output out;
> > > +};
> > > +
> > > +static void seamcall_smp_call_function(void *data)
> > > +{
> > > + struct seamcall_ctx *sc = data;
> > > + int ret;
> > > +
> > > + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > > + &sc->seamcall_ret, &sc->out);

Are the seamcall_ret and out fields in seamcall_ctx going to be used?
Right now it looks like no one is going to read them.
If they are going to be used then this is going to cause a race since
the different CPUs are going to write concurrently to the same address
inside seamcall().
We should either use local memory and write using atomic_set like the
case for the err field or hard code NULL at the call site if they are
not going to be used.

> > > + if (ret)
> > > + atomic_set(&sc->err, ret);
> > > +}
> > > +
> > > +/*
> > > + * Call the SEAMCALL on all online cpus concurrently.
> > > + * Return error if SEAMCALL fails on any cpu.
> > > + */
> > > +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > > +{
> > > + on_each_cpu(seamcall_smp_call_function, sc, true);
> > > + return atomic_read(&sc->err);
> > > +}
> >
> > Why bother returning something that's not read?
>
> It's not needed. I'll make it void.
>
> Caller can check seamcall_ctx::err directly if they want to know whether any
> error happened.
>
>
>
> --
> Thanks,
> -Kai
>
>

Sagi

2022-05-19 05:15:58

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error

On Wed, 2022-05-18 at 09:19 -0700, Sagi Shahar wrote:
> On Tue, Apr 26, 2022 at 5:06 PM Kai Huang <[email protected]> wrote:
> >
> > On Tue, 2022-04-26 at 13:59 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > TDX supports shutting down the TDX module at any time during its
> > > > lifetime. After TDX module is shut down, no further SEAMCALL can be
> > > > made on any logical cpu.
> > >
> > > Is this strictly true?
> > >
> > > I thought SEAMCALLs were used for the P-SEAMLDR too.
> >
> > Sorry will change to no TDX module SEAMCALL can be made on any logical cpu.
> >
> > [...]
> >
> > > >
> > > > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > > > +struct seamcall_ctx {
> > > > + u64 fn;
> > > > + u64 rcx;
> > > > + u64 rdx;
> > > > + u64 r8;
> > > > + u64 r9;
> > > > + atomic_t err;
> > > > + u64 seamcall_ret;
> > > > + struct tdx_module_output out;
> > > > +};
> > > > +
> > > > +static void seamcall_smp_call_function(void *data)
> > > > +{
> > > > + struct seamcall_ctx *sc = data;
> > > > + int ret;
> > > > +
> > > > + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > > > + &sc->seamcall_ret, &sc->out);
>
> Are the seamcall_ret and out fields in seamcall_ctx going to be used?
> Right now it looks like no one is going to read them.
> If they are going to be used then this is going to cause a race since
> the different CPUs are going to write concurrently to the same address
> inside seamcall().
> We should either use local memory and write using atomic_set like the
> case for the err field or hard code NULL at the call site if they are
> not going to be used.
> > > >

Thanks for catching this. Both 'seamcall_ret' and 'out' are actually not used
in this series, but this needs to be improved for sure.

I think I can just remove them from the 'seamcall_ctx' for now, since they are
not used at all.

--
Thanks,
-Kai