Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. TDX specs are available in [1].
This series is the initial support to enable TDX with minimal code to
allow KVM to create and run TDX guests. KVM support for TDX is being
developed separately[2]. A new "userspace inaccessible memfd" approach
to support TDX private memory is also being developed[3]. The KVM will
only support the new "userspace inaccessible memfd" as TDX guest memory.
Also, a few first generations of TDX hardware have an erratum[4], and
require additional handing.
This series doesn't aim to support all functionalities, and doesn't aim
to resolve all things perfectly. All other optimizations will be posted
as follow-up once this initial TDX support is upstreamed.
(For memory hotplug, sorry for broadcasting widely but I cc'ed the
[email protected] following Kirill's suggestion so MM experts can also
help to provide comments.)
Hi Dave/Kirill/Tony/David and all,
Thanks for your review on the previous versions. Appreciate your review
on this version and any tag if patches look good to you. Thanks!
----- Changelog history: ------
- v11 -> v12:
- Addressed comments in v11 from Dave/Kirill/David and others.
- Collected review tags from Dave/Kirill/David and others.
- Splitted the SEAMCALL infrastructure patch into 2 patches for better
reveiw.
- One more patch to change to keep TDMRs when module initialization is
successful for better review.
v11: https://lore.kernel.org/lkml/[email protected]/T/
- v10 -> v11:
- Addressed comments in v10
- Added patches to handle TDX "partial write machine check" erratum.
- Added a new patch to handle running out of entropy in common code.
- Fixed a bug in kexec() support.
v10: https://lore.kernel.org/kvm/[email protected]/
- v9 -> v10:
- Changed the per-cpu initalization handling
- Gave up "ensuring all online cpus are TDX-runnable when TDX module
is initialized", but just provide two basic functions, tdx_enable()
and tdx_cpu_enable(), to let the user of TDX to make sure the
tdx_cpu_enable() has been done successfully when the user wants to
use particular cpu for TDX.
- Thus, moved per-cpu initialization out of tdx_enable(). Now
tdx_enable() just assumes VMXON and tdx_cpu_enable() has been done
on all online cpus before calling it.
- Merged the tdx_enable() skeleton patch and per-cpu initialization
patch together to tell better story.
- Moved "SEAMCALL infrastructure" patch before the tdx_enable() patch.
v9: https://lore.kernel.org/lkml/[email protected]/
- v8 -> v9:
- Added patches to handle TDH.SYS.INIT and TDH.SYS.LP.INIT back.
- Other changes please refer to changelog histroy in individual patches.
v8: https://lore.kernel.org/lkml/[email protected]/
- v7 -> v8:
- 200+ LOC removed (from 1800+ -> 1600+).
- Removed patches to do TDH.SYS.INIT and TDH.SYS.LP.INIT
(Dave/Peter/Thomas).
- Removed patch to shut down TDX module (Sean).
- For memory hotplug, changed to reject non-TDX memory from
arch_add_memory() to memory_notifier (Dan/David).
- Simplified the "skeletion patch" as a result of removing
TDH.SYS.LP.INIT patch.
- Refined changelog/comments for most of the patches (to tell better
story, remove silly comments, etc) (Dave).
- Added new 'struct tdmr_info_list' struct, and changed all TDMR related
patches to use it (Dave).
- Effectively merged patch "Reserve TDX module global KeyID" and
"Configure TDX module with TDMRs and global KeyID", and removed the
static variable 'tdx_global_keyid', following Dave's suggestion on
making tdx_sysinfo local variable.
- For detailed changes please see individual patch changelog history.
v7: https://lore.kernel.org/lkml/[email protected]/T/
- v6 -> v7:
- Added memory hotplug support.
- Changed how to choose the list of "TDX-usable" memory regions from at
kernel boot time to TDX module initialization time.
- Addressed comments received in previous versions. (Andi/Dave).
- Improved the commit message and the comments of kexec() support patch,
and the patch handles returnning PAMTs back to the kernel when TDX
module initialization fails. Please also see "kexec()" section below.
- Changed the documentation patch accordingly.
- For all others please see individual patch changelog history.
v6: https://lore.kernel.org/lkml/[email protected]/T/
- v5 -> v6:
- Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
- Removed patch to disable driver-managed memory hotplug (Intel
internal discussion).
- Added one patch to introduce enum type for TDX supported page size
level to replace the hard-coded values in TDX guest code (Dave).
- Added one patch to make TDX depends on X2APIC being enabled (Dave).
- Added one patch to build all boot-time present memory regions as TDX
memory during kernel boot.
- Added Reviewed-by from others to some patches.
- For all others please see individual patch changelog history.
v5: https://lore.kernel.org/lkml/[email protected]/T/
- v4 -> v5:
This is essentially a resent of v4. Sorry I forgot to consult
get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
and linux-mm mailing list and the relevant people for 4 new patches.
There are also very minor code and commit message update from v4:
- Rebased to latest tip/x86/tdx.
- Fixed a checkpatch issue that I missed in v4.
- Removed an obsoleted comment that I missed in patch 6.
- Very minor update to the commit message of patch 12.
For other changes to individual patches since v3, please refer to the
changelog histroy of individual patches (I just used v3 -> v5 since
there's basically no code change to v4).
v4: https://lore.kernel.org/lkml/98c84c31d8f062a0b50a69ef4d3188bc259f2af2.1654025431.git.kai.huang@intel.com/T/
- v3 -> v4 (addressed Dave's comments, and other comments from others):
- Simplified SEAMRR and TDX keyID detection.
- Added patches to handle ACPI CPU hotplug.
- Added patches to handle ACPI memory hotplug and driver managed memory
hotplug.
- Removed tdx_detect() but only use single tdx_init().
- Removed detecting TDX module via P-SEAMLDR.
- Changed from using e820 to using memblock to convert system RAM to TDX
memory.
- Excluded legacy PMEM from TDX memory.
- Removed the boot-time command line to disable TDX patch.
- Addressed comments for other individual patches (please see individual
patches).
- Improved the documentation patch based on the new implementation.
v3: https://lore.kernel.org/lkml/[email protected]/T/
- V2 -> v3:
- Addressed comments from Isaku.
- Fixed memory leak and unnecessary function argument in the patch to
configure the key for the global keyid (patch 17).
- Enhanced a little bit to the patch to get TDX module and CMR
information (patch 09).
- Fixed an unintended change in the patch to allocate PAMT (patch 13).
- Addressed comments from Kevin:
- Slightly improvement on commit message to patch 03.
- Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
seamrr_enabled() (patch 04).
- Changed documentation patch to add TDX host kernel support materials
to Documentation/x86/tdx.rst together with TDX guest staff, instead
of a standalone file (patch 21)
- Very minor improvement in commit messages.
v2: https://lore.kernel.org/lkml/[email protected]/T/
- RFC (v1) -> v2:
- Rebased to Kirill's latest TDX guest code.
- Fixed two issues that are related to finding all RAM memory regions
based on e820.
- Minor improvement on comments and commit messages.
v1: https://lore.kernel.org/lkml/[email protected]/T/
== Background ==
TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
A CPU-attested software module called 'the TDX module' runs in the new
isolated region as a trusted hypervisor to create/run protected VMs.
TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.
TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection. The firmware runs on the
secure processor acts a similar role as the TDX module.
The host kernel communicates with SEAM software via a new SEAMCALL
instruction. This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.
Before being able to manage TD guests, the TDX module must be loaded
and properly initialized. This series assumes the TDX module is loaded
by BIOS before the kernel boots.
How to initialize the TDX module is described at TDX module 1.0
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
Initialization and Shutdown".
== Design Considerations ==
1. Initialize the TDX module at runtime
There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run. This
series implements the runtime initialization.
Also, TDX requires a per-cpu initialization SEAMCALL to be done before
making any SEAMCALL on that cpu.
This series adds two functions: tdx_cpu_enable() and tdx_enable() to do
per-cpu initialization and module initialization respectively.
2. CPU hotplug
DX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
event to the kernel. This series doesn't handle physical (ACPI) CPU
hotplug at all but depends on the BIOS to behave correctly.
Also, tdx_cpu_enable() will simply return error for any hot-added cpu if
something insane happened.
Note TDX works with CPU logical online/offline, thus this series still
allows to do logical CPU online/offline.
3. Kernel policy on TDX memory
The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable. The TDX architecture
allows the VMM to designate specific convertible memory regions as usable
for TDX private memory.
The initial support of TDX guests will only allocate TDX private memory
from the global page allocator. This series chooses to designate _all_
system RAM in the core-mm at the time of initializing TDX module as TDX
memory to guarantee all pages in the page allocator are TDX pages.
4. Memory Hotplug
After the kernel passes all "TDX-usable" memory regions to the TDX
module, the set of "TDX-usable" memory regions are fixed during module's
runtime. No more "TDX-usable" memory can be added to the TDX module
after that.
To achieve above "to guarantee all pages in the page allocator are TDX
pages", this series simply choose to reject any non-TDX-usable memory in
memory hotplug.
5. Physical Memory Hotplug
Note TDX assumes convertible memory is always physically present during
machine's runtime. A non-buggy BIOS should never support hot-removal of
any convertible memory. This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.
Also, if something insane really happened, 4) makes sure either TDX
cannot be enabled or hot-added memory will be rejected after TDX gets
enabled.
6. Kexec()
Similar to AMD's SME, in kexec() kernel needs to flush dirty cachelines
of TDX private memory otherwise they may silently corrupt the new kernel.
7. TDX erratum
The first few generations of TDX hardware have an erratum. A partial
write to a TDX private memory cacheline will silently "poison" the
line. Subsequent reads will consume the poison and generate a machine
check.
The fast warm reset reboot doesn't reset TDX private memory. With this
erratum, all TDX private pages needs to be converted back to normal
before a fast warm reset reboot or booting to the new kernel in kexec().
Otherwise, the new kernel may get unexpected machine check.
In normal condition, triggering the erratum in Linux requires some kind
of kernel bug involving relatively exotic memory writes to TDX private
memory and will manifest via spurious-looking machine checks when
reading the affected memory. Machine check handler is improved to deal
with such machine check.
[1]: TDX specs
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
[2]: KVM TDX basic feature support
https://lore.kernel.org/kvm/[email protected]/T/#t
[3]: KVM: mm: fd-based approach for supporting KVM
https://lore.kernel.org/kvm/[email protected]/
[4]: TDX erratum
https://cdrdv2.intel.com/v1/dl/getContent/772415?explicitVersion=true
Kai Huang (22):
x86/tdx: Define TDX supported page sizes as macros
x86/virt/tdx: Detect TDX during kernel boot
x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
x86/cpu: Detect TDX partial write machine check erratum
x86/virt/tdx: Add SEAMCALL infrastructure
x86/virt/tdx: Handle SEAMCALL running out of entropy error
x86/virt/tdx: Add skeleton to enable TDX on demand
x86/virt/tdx: Get information about TDX module and TDX-capable memory
x86/virt/tdx: Use all system memory when initializing TDX module as
TDX memory
x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
memory regions
x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
x86/virt/tdx: Designate reserved areas for all TDMRs
x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
x86/virt/tdx: Configure global KeyID on all packages
x86/virt/tdx: Initialize all TDMRs
x86/kexec: Flush cache of TDX private memory
x86/virt/tdx: Keep TDMRs when module initialization is successful
x86/kexec(): Reset TDX private memory on platforms with TDX erratum
x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
x86/mce: Improve error log of kernel space TDX #MC due to erratum
Documentation/x86: Add documentation for TDX host support
Documentation/arch/x86/tdx.rst | 189 +++-
arch/x86/Kconfig | 15 +
arch/x86/Makefile | 2 +
arch/x86/coco/tdx/tdx.c | 6 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 3 +
arch/x86/include/asm/tdx.h | 26 +
arch/x86/kernel/cpu/intel.c | 17 +
arch/x86/kernel/cpu/mce/core.c | 33 +
arch/x86/kernel/machine_kexec_64.c | 9 +
arch/x86/kernel/process.c | 7 +-
arch/x86/kernel/reboot.c | 15 +
arch/x86/kernel/setup.c | 2 +
arch/x86/virt/Makefile | 2 +
arch/x86/virt/vmx/Makefile | 2 +
arch/x86/virt/vmx/tdx/Makefile | 2 +
arch/x86/virt/vmx/tdx/seamcall.S | 52 +
arch/x86/virt/vmx/tdx/tdx.c | 1542 ++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 151 +++
arch/x86/virt/vmx/tdx/tdxcall.S | 19 +-
20 files changed, 2078 insertions(+), 17 deletions(-)
create mode 100644 arch/x86/virt/Makefile
create mode 100644 arch/x86/virt/vmx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
base-commit: 94142c9d1bdf1c18027a42758ceb6bdd59a92012
--
2.40.1
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.
Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME. The memory encryption hardware underpinning MKTME is also
used for Intel TDX. TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs. The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.
During machine boot, TDX microcode verifies that the BIOS programmed TDX
private KeyIDs consistently and correctly programmed across all CPU
packages. The MSRs are locked in this state after verification. This
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
it indicates not just that the hardware supports TDX, but that all the
boot-time security checks passed.
The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests. The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.
Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
private KeyIDs. Also add a function to report whether TDX is enabled by
the BIOS. Similar to AMD SME, kexec() will use it to determine whether
cache flush is needed.
The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata. Each TDX guest also needs a TDX KeyID for its
own protection. Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests. If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.
To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support). So far only KVM uses TDX. Make the new config option depend
on KVM_INTEL.
Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Isaku Yamahata <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
v11 -> v12:
- Improve setting up guest's TDX keyID range (David)
- ++tdx_keyid_start -> tdx_keyid_start + 1
- --nr_tdx_keyids -> nr_tdx_keyids - 1
- 'return -ENODEV' instead of 'goto no_tdx' (Sathy)
- pr_info() -> pr_err() (Isaku)
- Added tags from Isaku/David
v10 -> v11 (David):
- "host kernel" -> "the host kernel"
- "protected VM" -> "confidential VM".
- Moved setting tdx_global_keyid to the end of tdx_init().
v9 -> v10:
- No change.
v8 -> v9:
- Moved MSR macro from local tdx.h to <asm/msr-index.h> (Dave).
- Moved reserving the TDX global KeyID from later patch to here.
- Changed 'tdx_keyid_start' and 'nr_tdx_keyids' to
'tdx_guest_keyid_start' and 'tdx_nr_guest_keyids' to represent KeyIDs
can be used by guest. (Dave)
- Slight changelog update according to above changes.
v7 -> v8: (address Dave's comments)
- Improved changelog:
- "KVM user" -> "The TDX module will be initialized by KVM when ..."
- Changed "tdx_int" part to "Just say what this patch is doing"
- Fixed the last sentence of "kexec()" paragraph
- detect_tdx() -> record_keyid_partitioning()
- Improved how to calculate tdx_keyid_start.
- tdx_keyid_num -> nr_tdx_keyids.
- Improved dmesg printing.
- Add comment to clear_tdx().
v6 -> v7:
- No change.
v5 -> v6:
- Removed SEAMRR detection to make code simpler.
- Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
- Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).
---
arch/x86/Kconfig | 12 +++++
arch/x86/Makefile | 2 +
arch/x86/include/asm/msr-index.h | 3 ++
arch/x86/include/asm/tdx.h | 7 +++
arch/x86/virt/Makefile | 2 +
arch/x86/virt/vmx/Makefile | 2 +
arch/x86/virt/vmx/tdx/Makefile | 2 +
arch/x86/virt/vmx/tdx/tdx.c | 90 ++++++++++++++++++++++++++++++++
8 files changed, 120 insertions(+)
create mode 100644 arch/x86/virt/Makefile
create mode 100644 arch/x86/virt/vmx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..191587f75810 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1952,6 +1952,18 @@ config X86_SGX
If unsure, say N.
+config INTEL_TDX_HOST
+ bool "Intel Trust Domain Extensions (TDX) host support"
+ depends on CPU_SUP_INTEL
+ depends on X86_64
+ depends on KVM_INTEL
+ help
+ Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+ host and certain physical attacks. This option enables necessary TDX
+ support in the host kernel to run confidential VMs.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index b39975977c03..ec0e71d8fa30 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -252,6 +252,8 @@ archheaders:
libs-y += arch/x86/lib/
+core-y += arch/x86/virt/
+
# drivers-y are linked after core-y
drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
drivers-$(CONFIG_PCI) += arch/x86/pci/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3aedae61af4f..6d8f15b1552c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -523,6 +523,9 @@
#define MSR_RELOAD_PMC0 0x000014c1
#define MSR_RELOAD_FIXED_CTR0 0x00001309
+/* KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087
+
/*
* AMD64 MSRs. Not complete. See the architecture manual for a more
* complete list.
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 25fd6070dc0b..4dfe2e794411 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
return -ENODEV;
}
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else /* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..feebda21d793
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST) += tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..93ca8b73e1f1
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..908590e85749
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2023 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/printk.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+
+static u32 tdx_global_keyid __ro_after_init;
+static u32 tdx_guest_keyid_start __ro_after_init;
+static u32 tdx_nr_guest_keyids __ro_after_init;
+
+static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
+ u32 *nr_tdx_keyids)
+{
+ u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
+ int ret;
+
+ /*
+ * IA32_MKTME_KEYID_PARTIONING:
+ * Bit [31:0]: Number of MKTME KeyIDs.
+ * Bit [63:32]: Number of TDX private KeyIDs.
+ */
+ ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
+ &_nr_tdx_keyids);
+ if (ret)
+ return -ENODEV;
+
+ if (!_nr_tdx_keyids)
+ return -ENODEV;
+
+ /* TDX KeyIDs start after the last MKTME KeyID. */
+ _tdx_keyid_start = _nr_mktme_keyids + 1;
+
+ *tdx_keyid_start = _tdx_keyid_start;
+ *nr_tdx_keyids = _nr_tdx_keyids;
+
+ return 0;
+}
+
+static int __init tdx_init(void)
+{
+ u32 tdx_keyid_start, nr_tdx_keyids;
+ int err;
+
+ err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
+ if (err)
+ return err;
+
+ pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
+ tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
+
+ /*
+ * The TDX module itself requires one 'global KeyID' to protect
+ * its metadata. If there's only one TDX KeyID, there won't be
+ * any left for TDX guests thus there's no point to enable TDX
+ * at all.
+ */
+ if (nr_tdx_keyids < 2) {
+ pr_err("initialization failed: too few private KeyIDs available.\n");
+ return -ENODEV;
+ }
+
+ /*
+ * Just use the first TDX KeyID as the 'global KeyID' and
+ * leave the rest for TDX guests.
+ */
+ tdx_global_keyid = tdx_keyid_start;
+ tdx_guest_keyid_start = tdx_keyid_start + 1;
+ tdx_nr_guest_keyids = nr_tdx_keyids - 1;
+
+ return 0;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+ return !!tdx_global_keyid;
+}
--
2.40.1
There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.
The first problem doesn't matter on the platforms w/o the "partial write
machine check" erratum. KeyID 0 doesn't have integrity check. If the
new kernel wants to use any non-zero KeyID, it needs to convert the
memory to that KeyID and such conversion would work from any KeyID.
However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).
There are two things that the old kernel needs to do to achieve that:
1) Stop accessing TDX private memory mappings:
a. Stop making TDX module SEAMCALLs (TDX global KeyID);
b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.
For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support. And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().
Flushing cache in stop_this_cpu() only flushes cache on remote cpus. On
the rebooting cpu which does kexec(), unlike SME which does the cache
flush in relocate_kernel(), flush the cache right after stopping remote
cpus in machine_shutdown().
There are two reasons to do so: 1) For TDX there's no need to defer
cache flush to relocate_kernel() because all TDX activities have been
stopped. 2) On the platforms with the above erratum the kernel must
convert all TDX private pages back to normal before booting to the new
kernel in kexec(), and flushing cache early allows the kernel to convert
memory early rather than having to muck with the relocate_kernel()
assembly.
Theoretically, cache flush is only needed when the TDX module has been
initialized. However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status. Just check
whether TDX is enabled by the BIOS instead to flush cache.
Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Isaku Yamahata <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
---
v11 -> v12:
- Changed comment/changelog to say kernel doesn't try to handle fast
warm reset but depends on BIOS to enable workaround (Kirill)
- Added Kirill's tag
v10 -> v11:
- Fixed a bug that cache for rebooting cpu isn't flushed for TDX private
memory.
- Updated changelog accordingly.
v9 -> v10:
- No change.
v8 -> v9:
- Various changelog enhancement and fix (Dave).
- Improved comment (Dave).
v7 -> v8:
- Changelog:
- Removed "leave TDX module open" part due to shut down patch has been
removed.
v6 -> v7:
- Improved changelog to explain why don't convert TDX private pages back
to normal.
---
arch/x86/kernel/process.c | 7 ++++++-
arch/x86/kernel/reboot.c | 15 +++++++++++++++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dac41a0072ea..0ce66deb9bc8 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -780,8 +780,13 @@ void __noreturn stop_this_cpu(void *dummy)
*
* Test the CPUID bit directly because the machine might've cleared
* X86_FEATURE_SME due to cmdline options.
+ *
+ * The TDX module or guests might have left dirty cachelines
+ * behind. Flush them to avoid corruption from later writeback.
+ * Note that this flushes on all systems where TDX is possible,
+ * but does not actually check that TDX was in use.
*/
- if (cpuid_eax(0x8000001f) & BIT(0))
+ if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
native_wbinvd();
for (;;) {
/*
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 3adbe97015c1..ae7480a213a6 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -32,6 +32,7 @@
#include <asm/realmode.h>
#include <asm/x86_init.h>
#include <asm/efi.h>
+#include <asm/tdx.h>
/*
* Power off function, if any
@@ -695,6 +696,20 @@ void native_machine_shutdown(void)
local_irq_disable();
stop_other_cpus();
#endif
+ /*
+ * stop_other_cpus() has flushed all dirty cachelines of TDX
+ * private memory on remote cpus. Unlike SME, which does the
+ * cache flush on _this_ cpu in the relocate_kernel(), flush
+ * the cache for _this_ cpu here. This is because on the
+ * platforms with "partial write machine check" erratum the
+ * kernel needs to convert all TDX private pages back to normal
+ * before booting to the new kernel in kexec(), and the cache
+ * flush must be done before that. If the kernel took SME's way,
+ * it would have to muck with the relocate_kernel() assembly to
+ * do memory conversion.
+ */
+ if (platform_tdx_enabled())
+ native_wbinvd();
lapic_shutdown();
restore_boot_irq_mode();
--
2.40.1
TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This
mode runs only the TDX module itself or other code to load the TDX
module.
The host kernel communicates with SEAM software via a new SEAMCALL
instruction. This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead. The TDX
module establishes a new SEAMCALL ABI which allows the host to
initialize the module and to manage VMs.
Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar
to the TDCALL ABI and leverages much TDCALL infrastructure.
Also add a wrapper function of SEAMCALL to convert SEAMCALL error code
to the kernel error code, and print out SEAMCALL error code to help the
user to understand what went wrong.
Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Isaku Yamahata <[email protected]>
---
v11 -> v12:
- Moved _ASM_EXT_TABLE() for #UD/#GP to a later patch for better patch
review, and removed related part from changelog.
- Minor code changes in seamcall() (David)
- Added Isaku's tag
v10 -> v11:
- No update
v9 -> v10:
- Make the TDX_SEAMCALL_{GP|UD} error codes unconditional but doesn't
define them when INTEL_TDX_HOST is enabled. (Dave)
- Slightly improved changelog to explain why add assembly code to handle
#UD and #GP.
v8 -> v9:
- Changed patch title (Dave).
- Enhanced seamcall() to include the cpu id to the error message when
SEAMCALL fails.
v7 -> v8:
- Improved changelog (Dave):
- Trim down some sentences (Dave).
- Removed __seamcall() and seamcall() function name and changed
accordingly (Dave).
- Improved the sentence explaining why to handle #GP (Dave).
- Added code to print out error message in seamcall(), following
the idea that tdx_enable() to return universal error and print out
error message to make clear what's going wrong (Dave). Also mention
this in changelog.
v6 -> v7:
- No change.
v5 -> v6:
- Added code to handle #UD and #GP (Dave).
- Moved the seamcall() wrapper function to this patch, and used a
temporary __always_unused to avoid compile warning (Dave).
- v3 -> v5 (no feedback on v4):
- Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
SEAMCALL itself fails.
- Improve the changelog.
---
arch/x86/virt/vmx/tdx/Makefile | 2 +-
arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.c | 42 ++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 10 ++++++
4 files changed, 105 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 93ca8b73e1f1..38d534f2c113 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-y += tdx.o
+obj-y += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..f81be6b9c133
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ * (the P-SEAMLDR or the TDX module).
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
+ * or the completion status of the SEAMCALL leaf function. Additional
+ * output operands are saved in @out (if it is provided by the caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn (RDI) - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * used by the P-SEAMLDR or the TDX
+ * module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+ FRAME_BEGIN
+ TDX_MODULE_CALL host=1
+ FRAME_END
+ RET
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 908590e85749..f8233cba5931 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -12,14 +12,56 @@
#include <linux/init.h>
#include <linux/errno.h>
#include <linux/printk.h>
+#include <linux/smp.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/tdx.h>
+#include "tdx.h"
static u32 tdx_global_keyid __ro_after_init;
static u32 tdx_guest_keyid_start __ro_after_init;
static u32 tdx_nr_guest_keyids __ro_after_init;
+/*
+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
+ * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
+ * leaf function return code and the additional output respectively if
+ * not NULL.
+ */
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ u64 *seamcall_ret,
+ struct tdx_module_output *out)
+{
+ u64 sret;
+ int cpu;
+
+ /* Need a stable CPU id for printing error message */
+ cpu = get_cpu();
+ sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+ put_cpu();
+
+ /* Save SEAMCALL return code if the caller wants it */
+ if (seamcall_ret)
+ *seamcall_ret = sret;
+
+ switch (sret) {
+ case 0:
+ /* SEAMCALL was successful */
+ return 0;
+ case TDX_SEAMCALL_VMFAILINVALID:
+ pr_err_once("module is not loaded.\n");
+ return -ENODEV;
+ default:
+ pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
+ cpu, fn, sret);
+ if (out)
+ pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
+ out->rcx, out->rdx, out->r8,
+ out->r9, out->r10, out->r11);
+ return -EIO;
+ }
+}
+
static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
u32 *nr_tdx_keyids)
{
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..48ad1a1ba737
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+#include <linux/types.h>
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+#endif
--
2.40.1
On the platforms with the "partial write machine check" erratum, the
kexec() needs to convert all TDX private pages back to normal before
booting to the new kernel. Otherwise, the new kernel may get unexpected
machine check.
There's no existing infrastructure to track TDX private pages. Change
to keep TDMRs when module initialization is successful so that they can
be used to find PAMTs.
With this change, only put_online_mems() and freeing the buffer of the
TDSYSINFO_STRUCT and CMR array still need to be done even when module
initialization is successful. Adjust the error handling to explicitly
do them when module initialization is successful and unconditionally
clean up the rest when initialization fails.
Signed-off-by: Kai Huang <[email protected]>
---
v11 -> v12 (new patch):
- Defer keeping TDMRs logic to this patch for better review
- Improved error handling logic (Nikolay/Kirill in patch 15)
---
arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
1 file changed, 42 insertions(+), 42 deletions(-)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 52b7267ea226..85b24b2e9417 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
static LIST_HEAD(tdx_memlist);
+static struct tdmr_info_list tdx_tdmr_list;
+
/*
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
@@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
static int init_tdx_module(void)
{
struct tdsysinfo_struct *sysinfo;
- struct tdmr_info_list tdmr_list;
struct cmr_info *cmr_array;
int ret;
@@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
goto out_put_tdxmem;
/* Allocate enough space for constructing TDMRs */
- ret = alloc_tdmr_list(&tdmr_list, sysinfo);
+ ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
if (ret)
goto out_free_tdxmem;
/* Cover all TDX-usable memory regions in TDMRs */
- ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
+ ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
if (ret)
goto out_free_tdmrs;
/* Pass the TDMRs and the global KeyID to the TDX module */
- ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
+ ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
if (ret)
goto out_free_pamts;
@@ -1118,51 +1119,50 @@ static int init_tdx_module(void)
goto out_reset_pamts;
/* Initialize TDMRs to complete the TDX module initialization */
- ret = init_tdmrs(&tdmr_list);
+ ret = init_tdmrs(&tdx_tdmr_list);
+ if (ret)
+ goto out_reset_pamts;
+
+ pr_info("%lu KBs allocated for PAMT.\n",
+ tdmrs_count_pamt_kb(&tdx_tdmr_list));
+
+ /*
+ * @tdx_memlist is written here and read at memory hotplug time.
+ * Lock out memory hotplug code while building it.
+ */
+ put_online_mems();
+ /*
+ * For now both @sysinfo and @cmr_array are only used during
+ * module initialization, so always free them.
+ */
+ free_page((unsigned long)sysinfo);
+
+ return 0;
out_reset_pamts:
- if (ret) {
- /*
- * Part of PAMTs may already have been initialized by the
- * TDX module. Flush cache before returning PAMTs back
- * to the kernel.
- */
- wbinvd_on_all_cpus();
- /*
- * According to the TDX hardware spec, if the platform
- * doesn't have the "partial write machine check"
- * erratum, any kernel read/write will never cause #MC
- * in kernel space, thus it's OK to not convert PAMTs
- * back to normal. But do the conversion anyway here
- * as suggested by the TDX spec.
- */
- tdmrs_reset_pamt_all(&tdmr_list);
- }
+ /*
+ * Part of PAMTs may already have been initialized by the
+ * TDX module. Flush cache before returning PAMTs back
+ * to the kernel.
+ */
+ wbinvd_on_all_cpus();
+ /*
+ * According to the TDX hardware spec, if the platform
+ * doesn't have the "partial write machine check"
+ * erratum, any kernel read/write will never cause #MC
+ * in kernel space, thus it's OK to not convert PAMTs
+ * back to normal. But do the conversion anyway here
+ * as suggested by the TDX spec.
+ */
+ tdmrs_reset_pamt_all(&tdx_tdmr_list);
out_free_pamts:
- if (ret)
- tdmrs_free_pamt_all(&tdmr_list);
- else
- pr_info("%lu KBs allocated for PAMT.\n",
- tdmrs_count_pamt_kb(&tdmr_list));
+ tdmrs_free_pamt_all(&tdx_tdmr_list);
out_free_tdmrs:
- /*
- * Always free the buffer of TDMRs as they are only used during
- * module initialization.
- */
- free_tdmr_list(&tdmr_list);
+ free_tdmr_list(&tdx_tdmr_list);
out_free_tdxmem:
- if (ret)
- free_tdx_memlist(&tdx_memlist);
+ free_tdx_memlist(&tdx_memlist);
out_put_tdxmem:
- /*
- * @tdx_memlist is written here and read at memory hotplug time.
- * Lock out memory hotplug code while building it.
- */
put_online_mems();
out:
- /*
- * For now both @sysinfo and @cmr_array are only used during
- * module initialization, so always free them.
- */
free_page((unsigned long)sysinfo);
return ret;
}
--
2.40.1
TDX memory has integrity and confidentiality protections. Violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself. In other words,
the host kernel should never, itself, see machine checks induced by the
TDX integrity hardware.
Alas, the first few generations of TDX hardware have an erratum. A
partial write to a TDX private memory cacheline will silently "poison"
the line. Subsequent reads will consume the poison and generate a
machine check. According to the TDX hardware spec, neither of these
things should have happened.
Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.
This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.
With this erratum, there are additional things need to be done. Similar
to other CPU bugs, use a CPU bug bit to indicate this erratum, and
detect this erratum during early boot. Note this bug reflects the
hardware thus it is detected regardless of whether the kernel is built
with TDX support or not.
Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
---
v11 -> v12:
- Added Kirill's tag
- Changed to detect the erratum in early_init_intel() (Kirill)
v10 -> v11:
- New patch
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/intel.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cb8ca46213be..dc8701f8d88b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -483,5 +483,6 @@
#define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
#define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
#define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
+#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 1c4639588ff9..e6c3107adc15 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -358,6 +358,21 @@ int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type)
}
EXPORT_SYMBOL_GPL(intel_microcode_sanity_check);
+static void check_tdx_erratum(struct cpuinfo_x86 *c)
+{
+ /*
+ * These CPUs have an erratum. A partial write from non-TD
+ * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
+ * private memory poisons that memory, and a subsequent read of
+ * that memory triggers #MC.
+ */
+ switch (c->x86_model) {
+ case INTEL_FAM6_SAPPHIRERAPIDS_X:
+ case INTEL_FAM6_EMERALDRAPIDS_X:
+ setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
+ }
+}
+
static void early_init_intel(struct cpuinfo_x86 *c)
{
u64 misc_enable;
@@ -509,6 +524,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
*/
if (detect_extended_topology_early(c) < 0)
detect_ht_early(c);
+
+ check_tdx_erratum(c);
}
static void bsp_init_intel(struct cpuinfo_x86 *c)
--
2.40.1
After the global KeyID has been configured on all packages, initialize
all TDMRs to make all TDX-usable memory regions that are passed to the
TDX module become usable.
This is the last step of initializing the TDX module.
Initializing TDMRs can be time consuming on large memory systems as it
involves initializing all metadata entries for all pages that can be
used by TDX guests. Initializing different TDMRs can be parallelized.
For now to keep it simple, just initialize all TDMRs one by one. It can
be enhanced in the future.
Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Isaku Yamahata <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
---
v11 -> v12:
- Added Kirill's tag
v10 -> v11:
- No update
v9 -> v10:
- Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
v8 -> v9:
- Improved changlog to explain why initializing TDMRs can take long
time (Dave).
- Improved comments around 'next-to-initialize' address (Dave).
v7 -> v8: (Dave)
- Changelog:
- explicitly call out this is the last step of TDX module initialization.
- Trimed down changelog by removing SEAMCALL name and details.
- Removed/trimmed down unnecessary comments.
- Other changes due to 'struct tdmr_info_list'.
v6 -> v7:
- Removed need_resched() check. -- Andi.
---
arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
arch/x86/virt/vmx/tdx/tdx.h | 1 +
2 files changed, 53 insertions(+), 8 deletions(-)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f5d4dbc11aee..52b7267ea226 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -994,6 +994,56 @@ static int config_global_keyid(void)
return ret;
}
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+ u64 next;
+
+ /*
+ * Initializing a TDMR can be time consuming. To avoid long
+ * SEAMCALLs, the TDX module may only initialize a part of the
+ * TDMR in each call.
+ */
+ do {
+ struct tdx_module_output out;
+ int ret;
+
+ /* All 0's are unused parameters, they mean nothing. */
+ ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
+ &out);
+ if (ret)
+ return ret;
+ /*
+ * RDX contains 'next-to-initialize' address if
+ * TDH.SYS.TDMR.INIT did not fully complete and
+ * should be retried.
+ */
+ next = out.rdx;
+ cond_resched();
+ /* Keep making SEAMCALLs until the TDMR is done */
+ } while (next < tdmr->base + tdmr->size);
+
+ return 0;
+}
+
+static int init_tdmrs(struct tdmr_info_list *tdmr_list)
+{
+ int i;
+
+ /*
+ * This operation is costly. It can be parallelized,
+ * but keep it simple for now.
+ */
+ for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+ int ret;
+
+ ret = init_tdmr(tdmr_entry(tdmr_list, i));
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
static int init_tdx_module(void)
{
struct tdsysinfo_struct *sysinfo;
@@ -1067,14 +1117,8 @@ static int init_tdx_module(void)
if (ret)
goto out_reset_pamts;
- /*
- * TODO:
- *
- * - Initialize all TDMRs.
- *
- * Return error before all steps are done.
- */
- ret = -EINVAL;
+ /* Initialize TDMRs to complete the TDX module initialization */
+ ret = init_tdmrs(&tdmr_list);
out_reset_pamts:
if (ret) {
/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index a0438513bec0..f6b4e153890d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -25,6 +25,7 @@
#define TDH_SYS_INFO 32
#define TDH_SYS_INIT 33
#define TDH_SYS_LP_INIT 35
+#define TDH_SYS_TDMR_INIT 36
#define TDH_SYS_CONFIG 45
struct cmr_info {
--
2.40.1
On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> +/*
> + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> + * leaf function return code and the additional output respectively if
> + * not NULL.
> + */
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + u64 *seamcall_ret,
> + struct tdx_module_output *out)
> +{
> + u64 sret;
> + int cpu;
> +
> + /* Need a stable CPU id for printing error message */
> + cpu = get_cpu();
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> + put_cpu();
> +
> + /* Save SEAMCALL return code if the caller wants it */
> + if (seamcall_ret)
> + *seamcall_ret = sret;
> +
> + switch (sret) {
> + case 0:
> + /* SEAMCALL was successful */
> + return 0;
> + case TDX_SEAMCALL_VMFAILINVALID:
> + pr_err_once("module is not loaded.\n");
> + return -ENODEV;
> + default:
> + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> + cpu, fn, sret);
> + if (out)
> + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> + out->rcx, out->rdx, out->r8,
> + out->r9, out->r10, out->r11);
This look excessively noisy.
Don't we have SEAMCALL leafs that can fail in normal situation? Like
TDX_OPERAND_BUSY error code that indicate that operation likely will
succeed on retry.
Or is that wrapper only used for never-fail SEAMCALLs? If so, please
document it.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Tue, 2023-06-27 at 12:48 +0300, [email protected] wrote:
> On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> > +/*
> > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > + * leaf function return code and the additional output respectively if
> > + * not NULL.
> > + */
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > + u64 *seamcall_ret,
> > + struct tdx_module_output *out)
> > +{
> > + u64 sret;
> > + int cpu;
> > +
> > + /* Need a stable CPU id for printing error message */
> > + cpu = get_cpu();
> > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > + put_cpu();
> > +
> > + /* Save SEAMCALL return code if the caller wants it */
> > + if (seamcall_ret)
> > + *seamcall_ret = sret;
> > +
> > + switch (sret) {
> > + case 0:
> > + /* SEAMCALL was successful */
> > + return 0;
> > + case TDX_SEAMCALL_VMFAILINVALID:
> > + pr_err_once("module is not loaded.\n");
> > + return -ENODEV;
> > + default:
> > + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> > + cpu, fn, sret);
> > + if (out)
> > + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> > + out->rcx, out->rdx, out->r8,
> > + out->r9, out->r10, out->r11);
>
> This look excessively noisy.
>
> Don't we have SEAMCALL leafs that can fail in normal situation? Like
> TDX_OPERAND_BUSY error code that indicate that operation likely will
> succeed on retry.
For TDX module initialization TDX_OPERAND_BUSY cannot happen. KVM may have
legal cases that BUSY can happen, e.g., KVM's TDP MMU supports handling faults
concurrently on different cpus, but that is still under discussion. Also KVM
tends to use __seamcall() directly:
https://lore.kernel.org/lkml/3c2c142e14a04a833b47f77faecaa91899b472cd.1678643052.git.isaku.yamahata@intel.com/
I guess KVM doesn't want to print message in all cases as you said, but for
module initialization is fine. Those error messages are useful in case
something goes wrong, and printing them in seamcall() helps to reduce the code
to print in all callers.
>
> Or is that wrapper only used for never-fail SEAMCALLs? If so, please
> document it.
>
How about adding below?
Use __seamcall() directly in cases that printing error message isn't
desired, e.g., when SEAMCALL can legally fail with BUSY and the caller
wants to retry.
On Tue, Jun 27, 2023 at 10:28:20AM +0000, Huang, Kai wrote:
> > Or is that wrapper only used for never-fail SEAMCALLs? If so, please
> > document it.
> >
>
> How about adding below?
>
> Use __seamcall() directly in cases that printing error message isn't
> desired, e.g., when SEAMCALL can legally fail with BUSY and the caller
> wants to retry.
>
Looks good to me.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Jun 27, 2023 at 10:28:20AM +0000,
"Huang, Kai" <[email protected]> wrote:
> On Tue, 2023-06-27 at 12:48 +0300, [email protected] wrote:
> > On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> > > +/*
> > > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > > + * leaf function return code and the additional output respectively if
> > > + * not NULL.
> > > + */
> > > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > > + u64 *seamcall_ret,
> > > + struct tdx_module_output *out)
> > > +{
> > > + u64 sret;
> > > + int cpu;
> > > +
> > > + /* Need a stable CPU id for printing error message */
> > > + cpu = get_cpu();
> > > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > > + put_cpu();
> > > +
> > > + /* Save SEAMCALL return code if the caller wants it */
> > > + if (seamcall_ret)
> > > + *seamcall_ret = sret;
> > > +
> > > + switch (sret) {
> > > + case 0:
> > > + /* SEAMCALL was successful */
> > > + return 0;
> > > + case TDX_SEAMCALL_VMFAILINVALID:
> > > + pr_err_once("module is not loaded.\n");
> > > + return -ENODEV;
> > > + default:
> > > + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> > > + cpu, fn, sret);
> > > + if (out)
> > > + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> > > + out->rcx, out->rdx, out->r8,
> > > + out->r9, out->r10, out->r11);
> >
> > This look excessively noisy.
> >
> > Don't we have SEAMCALL leafs that can fail in normal situation? Like
> > TDX_OPERAND_BUSY error code that indicate that operation likely will
> > succeed on retry.
>
> For TDX module initialization TDX_OPERAND_BUSY cannot happen. KVM may have
> legal cases that BUSY can happen, e.g., KVM's TDP MMU supports handling faults
> concurrently on different cpus, but that is still under discussion. Also KVM
> tends to use __seamcall() directly:
>
> https://lore.kernel.org/lkml/3c2c142e14a04a833b47f77faecaa91899b472cd.1678643052.git.isaku.yamahata@intel.com/
>
> I guess KVM doesn't want to print message in all cases as you said, but for
> module initialization is fine. Those error messages are useful in case
> something goes wrong, and printing them in seamcall() helps to reduce the code
> to print in all callers.
That's right. KVM wants to do its own error handling and error messaging. Its
requirement is different from TDX module initialization. I didn't see much
benefit to unify the function.
--
Isaku Yamahata <[email protected]>
>+/*
>+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>+ * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
>+ * leaf function return code and the additional output respectively if
>+ * not NULL.
>+ */
>+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>+ u64 *seamcall_ret,
>+ struct tdx_module_output *out)
>+{
>+ u64 sret;
>+ int cpu;
>+
>+ /* Need a stable CPU id for printing error message */
>+ cpu = get_cpu();
>+ sret = __seamcall(fn, rcx, rdx, r8, r9, out);
>+ put_cpu();
>+
>+ /* Save SEAMCALL return code if the caller wants it */
>+ if (seamcall_ret)
>+ *seamcall_ret = sret;
Hi Kai,
All callers in this series pass NULL for seamcall_ret. I am no sure if
you keep it intentionally.
>+
>+ switch (sret) {
>+ case 0:
>+ /* SEAMCALL was successful */
Nit: if you add
#define TDX_SUCCESS 0
and do
case TDX_SUCCESS:
return 0;
then the code becomes self-explanatory. i.e., you can drop the comment.
>+ return 0;
On Wed, 2023-06-28 at 11:09 +0800, Chao Gao wrote:
> > +/*
> > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > + * leaf function return code and the additional output respectively if
> > + * not NULL.
> > + */
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64
> > r9,
> > + u64 *seamcall_ret,
> > + struct tdx_module_output *out)
> > +{
> > + u64 sret;
> > + int cpu;
> > +
> > + /* Need a stable CPU id for printing error message */
> > + cpu = get_cpu();
> > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > + put_cpu();
> > +
> > + /* Save SEAMCALL return code if the caller wants it */
> > + if (seamcall_ret)
> > + *seamcall_ret = sret;
>
> Hi Kai,
>
> All callers in this series pass NULL for seamcall_ret. I am no sure if
> you keep it intentionally.
In this series all the callers doesn't need seamcall_ret.
>
> > +
> > + switch (sret) {
> > + case 0:
> > + /* SEAMCALL was successful */
>
> Nit: if you add
>
> #define TDX_SUCCESS 0
>
> and do
>
> case TDX_SUCCESS:
> return 0;
>
> then the code becomes self-explanatory. i.e., you can drop the comment.
If using this, I ended up with below:
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -23,6 +23,8 @@
#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
+#define TDX_SUCCESS 0
+
Hi Kirill/Dave/David,
Are you happy with this?
> >
> > 2. CPU hotplug
> >
> > DX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should
> ^^
>
> Need T here.
Thanks!
>
[...]
> > 4. Memory Hotplug
> >
> > After the kernel passes all "TDX-usable" memory regions to the TDX
> > module, the set of "TDX-usable" memory regions are fixed during module's
> > runtime. No more "TDX-usable" memory can be added to the TDX module
> > after that.
> >
> > To achieve above "to guarantee all pages in the page allocator are TDX
> > pages", this series simply choose to reject any non-TDX-usable memory in
> > memory hotplug.
> >
> > 5. Physical Memory Hotplug
> >
> > Note TDX assumes convertible memory is always physically present during
> > machine's runtime. A non-buggy BIOS should never support hot-removal of
> > any convertible memory. This implementation doesn't handle ACPI memory
> > removal but depends on the BIOS to behave correctly.
> >
> > Also, if something insane really happened, 4) makes sure either TDX
>
> Please remove "4)" if have no specific meaning here.
>
It means the mechanism mentioned in "4. Memory hotplug".
>
On Tue, Jun 27, 2023 at 02:12:30AM +1200, Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks. TDX specs are available in [1].
>
> This series is the initial support to enable TDX with minimal code to
> allow KVM to create and run TDX guests. KVM support for TDX is being
> developed separately[2]. A new "userspace inaccessible memfd" approach
> to support TDX private memory is also being developed[3]. The KVM will
> only support the new "userspace inaccessible memfd" as TDX guest memory.
>
> Also, a few first generations of TDX hardware have an erratum[4], and
> require additional handing.
>
> This series doesn't aim to support all functionalities, and doesn't aim
> to resolve all things perfectly. All other optimizations will be posted
> as follow-up once this initial TDX support is upstreamed.
>
> (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> [email protected] following Kirill's suggestion so MM experts can also
> help to provide comments.)
.....
>
> == Design Considerations ==
>
> 1. Initialize the TDX module at runtime
>
> There are basically two ways the TDX module could be initialized: either
> in early boot, or at runtime before the first TDX guest is run. This
> series implements the runtime initialization.
>
> Also, TDX requires a per-cpu initialization SEAMCALL to be done before
> making any SEAMCALL on that cpu.
>
> This series adds two functions: tdx_cpu_enable() and tdx_enable() to do
> per-cpu initialization and module initialization respectively.
>
> 2. CPU hotplug
>
> DX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should
^^
Need T here.
> never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
> event to the kernel. This series doesn't handle physical (ACPI) CPU
> hotplug at all but depends on the BIOS to behave correctly.
>
> Also, tdx_cpu_enable() will simply return error for any hot-added cpu if
> something insane happened.
>
> Note TDX works with CPU logical online/offline, thus this series still
> allows to do logical CPU online/offline.
>
> 3. Kernel policy on TDX memory
>
> The TDX module reports a list of "Convertible Memory Region" (CMR) to
> indicate which memory regions are TDX-capable. The TDX architecture
> allows the VMM to designate specific convertible memory regions as usable
> for TDX private memory.
>
> The initial support of TDX guests will only allocate TDX private memory
> from the global page allocator. This series chooses to designate _all_
> system RAM in the core-mm at the time of initializing TDX module as TDX
> memory to guarantee all pages in the page allocator are TDX pages.
>
> 4. Memory Hotplug
>
> After the kernel passes all "TDX-usable" memory regions to the TDX
> module, the set of "TDX-usable" memory regions are fixed during module's
> runtime. No more "TDX-usable" memory can be added to the TDX module
> after that.
>
> To achieve above "to guarantee all pages in the page allocator are TDX
> pages", this series simply choose to reject any non-TDX-usable memory in
> memory hotplug.
>
> 5. Physical Memory Hotplug
>
> Note TDX assumes convertible memory is always physically present during
> machine's runtime. A non-buggy BIOS should never support hot-removal of
> any convertible memory. This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.
>
> Also, if something insane really happened, 4) makes sure either TDX
Please remove "4)" if have no specific meaning here.
> cannot be enabled or hot-added memory will be rejected after TDX gets
> enabled.
>
> 6. Kexec()
>
> Similar to AMD's SME, in kexec() kernel needs to flush dirty cachelines
> of TDX private memory otherwise they may silently corrupt the new kernel.
>
> 7. TDX erratum
>
> The first few generations of TDX hardware have an erratum. A partial
> write to a TDX private memory cacheline will silently "poison" the
> line. Subsequent reads will consume the poison and generate a machine
> check.
>
> The fast warm reset reboot doesn't reset TDX private memory. With this
> erratum, all TDX private pages needs to be converted back to normal
> before a fast warm reset reboot or booting to the new kernel in kexec().
> Otherwise, the new kernel may get unexpected machine check.
>
> In normal condition, triggering the erratum in Linux requires some kind
> of kernel bug involving relatively exotic memory writes to TDX private
> memory and will manifest via spurious-looking machine checks when
> reading the affected memory. Machine check handler is improved to deal
> with such machine check.
>
>
> [1]: TDX specs
> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
>
> [2]: KVM TDX basic feature support
> https://lore.kernel.org/kvm/[email protected]/T/#t
>
> [3]: KVM: mm: fd-based approach for supporting KVM
> https://lore.kernel.org/kvm/[email protected]/
>
> [4]: TDX erratum
> https://cdrdv2.intel.com/v1/dl/getContent/772415?explicitVersion=true
>
>
>
>
> Kai Huang (22):
> x86/tdx: Define TDX supported page sizes as macros
> x86/virt/tdx: Detect TDX during kernel boot
> x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
> x86/cpu: Detect TDX partial write machine check erratum
> x86/virt/tdx: Add SEAMCALL infrastructure
> x86/virt/tdx: Handle SEAMCALL running out of entropy error
> x86/virt/tdx: Add skeleton to enable TDX on demand
> x86/virt/tdx: Get information about TDX module and TDX-capable memory
> x86/virt/tdx: Use all system memory when initializing TDX module as
> TDX memory
> x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
> memory regions
> x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
> x86/virt/tdx: Allocate and set up PAMTs for TDMRs
> x86/virt/tdx: Designate reserved areas for all TDMRs
> x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
> x86/virt/tdx: Configure global KeyID on all packages
> x86/virt/tdx: Initialize all TDMRs
> x86/kexec: Flush cache of TDX private memory
> x86/virt/tdx: Keep TDMRs when module initialization is successful
> x86/kexec(): Reset TDX private memory on platforms with TDX erratum
> x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
> x86/mce: Improve error log of kernel space TDX #MC due to erratum
> Documentation/x86: Add documentation for TDX host support
>
> Documentation/arch/x86/tdx.rst | 189 +++-
> arch/x86/Kconfig | 15 +
> arch/x86/Makefile | 2 +
> arch/x86/coco/tdx/tdx.c | 6 +-
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/msr-index.h | 3 +
> arch/x86/include/asm/tdx.h | 26 +
> arch/x86/kernel/cpu/intel.c | 17 +
> arch/x86/kernel/cpu/mce/core.c | 33 +
> arch/x86/kernel/machine_kexec_64.c | 9 +
> arch/x86/kernel/process.c | 7 +-
> arch/x86/kernel/reboot.c | 15 +
> arch/x86/kernel/setup.c | 2 +
> arch/x86/virt/Makefile | 2 +
> arch/x86/virt/vmx/Makefile | 2 +
> arch/x86/virt/vmx/tdx/Makefile | 2 +
> arch/x86/virt/vmx/tdx/seamcall.S | 52 +
> arch/x86/virt/vmx/tdx/tdx.c | 1542 ++++++++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 151 +++
> arch/x86/virt/vmx/tdx/tdxcall.S | 19 +-
> 20 files changed, 2078 insertions(+), 17 deletions(-)
> create mode 100644 arch/x86/virt/Makefile
> create mode 100644 arch/x86/virt/vmx/Makefile
> create mode 100644 arch/x86/virt/vmx/tdx/Makefile
> create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
> create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
> create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
>
>
> base-commit: 94142c9d1bdf1c18027a42758ceb6bdd59a92012
> --
> 2.40.1
>
On 26.06.23 г. 17:12 ч., Kai Huang wrote:
> On the platforms with the "partial write machine check" erratum, the
> kexec() needs to convert all TDX private pages back to normal before
> booting to the new kernel. Otherwise, the new kernel may get unexpected
> machine check.
>
> There's no existing infrastructure to track TDX private pages. Change
> to keep TDMRs when module initialization is successful so that they can
> be used to find PAMTs.
>
> With this change, only put_online_mems() and freeing the buffer of the
> TDSYSINFO_STRUCT and CMR array still need to be done even when module
> initialization is successful. Adjust the error handling to explicitly
> do them when module initialization is successful and unconditionally
> clean up the rest when initialization fails.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v11 -> v12 (new patch):
> - Defer keeping TDMRs logic to this patch for better review
> - Improved error handling logic (Nikolay/Kirill in patch 15)
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
> 1 file changed, 42 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 52b7267ea226..85b24b2e9417 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
> /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
> static LIST_HEAD(tdx_memlist);
>
> +static struct tdmr_info_list tdx_tdmr_list;
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
> static int init_tdx_module(void)
> {
> struct tdsysinfo_struct *sysinfo;
> - struct tdmr_info_list tdmr_list;
> struct cmr_info *cmr_array;
> int ret;
>
> @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
> goto out_put_tdxmem;
>
> /* Allocate enough space for constructing TDMRs */
> - ret = alloc_tdmr_list(&tdmr_list, sysinfo);
> + ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
> if (ret)
> goto out_free_tdxmem;
>
> /* Cover all TDX-usable memory regions in TDMRs */
> - ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
> + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
nit: Does it make sense to keep passing those global variables are
function parameters? Since those functions are static it's unlikely that
they are going to be used with any other parameter so might as well use
the parameter directly. It makes the code somewhat easier to follow.
<snip>
On Wed, Jun 28, 2023 at 03:34:05AM +0000, Huang, Kai wrote:
> On Wed, 2023-06-28 at 11:09 +0800, Chao Gao wrote:
> > > +/*
> > > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > > + * leaf function return code and the additional output respectively if
> > > + * not NULL.
> > > + */
> > > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64
> > > r9,
> > > + u64 *seamcall_ret,
> > > + struct tdx_module_output *out)
> > > +{
> > > + u64 sret;
> > > + int cpu;
> > > +
> > > + /* Need a stable CPU id for printing error message */
> > > + cpu = get_cpu();
> > > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > > + put_cpu();
> > > +
> > > + /* Save SEAMCALL return code if the caller wants it */
> > > + if (seamcall_ret)
> > > + *seamcall_ret = sret;
> >
> > Hi Kai,
> >
> > All callers in this series pass NULL for seamcall_ret. I am no sure if
> > you keep it intentionally.
>
> In this series all the callers doesn't need seamcall_ret.
I'm fine keeping it if it is needed by KVM TDX enabling. Otherwise, just
drop it.
> > > +
> > > + switch (sret) {
> > > + case 0:
> > > + /* SEAMCALL was successful */
> >
> > Nit: if you add
> >
> > #define TDX_SUCCESS 0
> >
> > and do
> >
> > case TDX_SUCCESS:
> > return 0;
> >
> > then the code becomes self-explanatory. i.e., you can drop the comment.
>
> If using this, I ended up with below:
>
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -23,6 +23,8 @@
> #define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
> #define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
>
> +#define TDX_SUCCESS 0
> +
>
> Hi Kirill/Dave/David,
>
> Are you happy with this?
Sure, looks good.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Jun 27, 2023 at 02:12:48AM +1200, Kai Huang wrote:
> On the platforms with the "partial write machine check" erratum, the
> kexec() needs to convert all TDX private pages back to normal before
> booting to the new kernel. Otherwise, the new kernel may get unexpected
> machine check.
>
> There's no existing infrastructure to track TDX private pages. Change
> to keep TDMRs when module initialization is successful so that they can
> be used to find PAMTs.
>
> With this change, only put_online_mems() and freeing the buffer of the
> TDSYSINFO_STRUCT and CMR array still need to be done even when module
> initialization is successful. Adjust the error handling to explicitly
> do them when module initialization is successful and unconditionally
> clean up the rest when initialization fails.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v11 -> v12 (new patch):
> - Defer keeping TDMRs logic to this patch for better review
> - Improved error handling logic (Nikolay/Kirill in patch 15)
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
> 1 file changed, 42 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 52b7267ea226..85b24b2e9417 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
> /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
> static LIST_HEAD(tdx_memlist);
>
> +static struct tdmr_info_list tdx_tdmr_list;
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
> static int init_tdx_module(void)
> {
> struct tdsysinfo_struct *sysinfo;
> - struct tdmr_info_list tdmr_list;
> struct cmr_info *cmr_array;
> int ret;
>
> @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
> goto out_put_tdxmem;
>
> /* Allocate enough space for constructing TDMRs */
> - ret = alloc_tdmr_list(&tdmr_list, sysinfo);
> + ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
> if (ret)
> goto out_free_tdxmem;
>
> /* Cover all TDX-usable memory regions in TDMRs */
> - ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
> + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
> if (ret)
> goto out_free_tdmrs;
>
> /* Pass the TDMRs and the global KeyID to the TDX module */
> - ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
> + ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
> if (ret)
> goto out_free_pamts;
>
> @@ -1118,51 +1119,50 @@ static int init_tdx_module(void)
> goto out_reset_pamts;
>
> /* Initialize TDMRs to complete the TDX module initialization */
> - ret = init_tdmrs(&tdmr_list);
> + ret = init_tdmrs(&tdx_tdmr_list);
> + if (ret)
> + goto out_reset_pamts;
> +
> + pr_info("%lu KBs allocated for PAMT.\n",
> + tdmrs_count_pamt_kb(&tdx_tdmr_list));
> +
> + /*
> + * @tdx_memlist is written here and read at memory hotplug time.
> + * Lock out memory hotplug code while building it.
> + */
> + put_online_mems();
> + /*
> + * For now both @sysinfo and @cmr_array are only used during
> + * module initialization, so always free them.
> + */
> + free_page((unsigned long)sysinfo);
> +
> + return 0;
> out_reset_pamts:
> - if (ret) {
> - /*
> - * Part of PAMTs may already have been initialized by the
> - * TDX module. Flush cache before returning PAMTs back
> - * to the kernel.
> - */
> - wbinvd_on_all_cpus();
> - /*
> - * According to the TDX hardware spec, if the platform
> - * doesn't have the "partial write machine check"
> - * erratum, any kernel read/write will never cause #MC
> - * in kernel space, thus it's OK to not convert PAMTs
> - * back to normal. But do the conversion anyway here
> - * as suggested by the TDX spec.
> - */
> - tdmrs_reset_pamt_all(&tdmr_list);
> - }
> + /*
> + * Part of PAMTs may already have been initialized by the
> + * TDX module. Flush cache before returning PAMTs back
> + * to the kernel.
> + */
> + wbinvd_on_all_cpus();
> + /*
> + * According to the TDX hardware spec, if the platform
> + * doesn't have the "partial write machine check"
> + * erratum, any kernel read/write will never cause #MC
> + * in kernel space, thus it's OK to not convert PAMTs
> + * back to normal. But do the conversion anyway here
> + * as suggested by the TDX spec.
> + */
> + tdmrs_reset_pamt_all(&tdx_tdmr_list);
> out_free_pamts:
> - if (ret)
> - tdmrs_free_pamt_all(&tdmr_list);
> - else
> - pr_info("%lu KBs allocated for PAMT.\n",
> - tdmrs_count_pamt_kb(&tdmr_list));
> + tdmrs_free_pamt_all(&tdx_tdmr_list);
> out_free_tdmrs:
> - /*
> - * Always free the buffer of TDMRs as they are only used during
> - * module initialization.
> - */
> - free_tdmr_list(&tdmr_list);
> + free_tdmr_list(&tdx_tdmr_list);
> out_free_tdxmem:
> - if (ret)
> - free_tdx_memlist(&tdx_memlist);
> + free_tdx_memlist(&tdx_memlist);
> out_put_tdxmem:
> - /*
> - * @tdx_memlist is written here and read at memory hotplug time.
> - * Lock out memory hotplug code while building it.
> - */
> put_online_mems();
> out:
> - /*
> - * For now both @sysinfo and @cmr_array are only used during
> - * module initialization, so always free them.
> - */
> free_page((unsigned long)sysinfo);
> return ret;
> }
This diff is extremely hard to follow, but I think the change to error
handling Nikolay proposed has to be applied to the function from the
beginning, not changed drastically in this patch.
--
Kiryl Shutsemau / Kirill A. Shutemov
On 28.06.23 г. 15:23 ч., [email protected] wrote:
> On Tue, Jun 27, 2023 at 02:12:48AM +1200, Kai Huang wrote:
>> On the platforms with the "partial write machine check" erratum, the
>> kexec() needs to convert all TDX private pages back to normal before
>> booting to the new kernel. Otherwise, the new kernel may get unexpected
>> machine check.
>>
>> There's no existing infrastructure to track TDX private pages. Change
>> to keep TDMRs when module initialization is successful so that they can
>> be used to find PAMTs.
>>
>> With this change, only put_online_mems() and freeing the buffer of the
>> TDSYSINFO_STRUCT and CMR array still need to be done even when module
>> initialization is successful. Adjust the error handling to explicitly
>> do them when module initialization is successful and unconditionally
>> clean up the rest when initialization fails.
>>
>> Signed-off-by: Kai Huang <[email protected]>
>> ---
>>
>> v11 -> v12 (new patch):
>> - Defer keeping TDMRs logic to this patch for better review
>> - Improved error handling logic (Nikolay/Kirill in patch 15)
>>
>> ---
>> arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
>> 1 file changed, 42 insertions(+), 42 deletions(-)
>>
>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> index 52b7267ea226..85b24b2e9417 100644
>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>> /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
>> static LIST_HEAD(tdx_memlist);
>>
>> +static struct tdmr_info_list tdx_tdmr_list;
>> +
>> /*
>> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
>> @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
>> static int init_tdx_module(void)
>> {
>> struct tdsysinfo_struct *sysinfo;
>> - struct tdmr_info_list tdmr_list;
>> struct cmr_info *cmr_array;
>> int ret;
>>
>> @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
>> goto out_put_tdxmem;
>>
>> /* Allocate enough space for constructing TDMRs */
>> - ret = alloc_tdmr_list(&tdmr_list, sysinfo);
>> + ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
>> if (ret)
>> goto out_free_tdxmem;
>>
>> /* Cover all TDX-usable memory regions in TDMRs */
>> - ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
>> + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
>> if (ret)
>> goto out_free_tdmrs;
>>
>> /* Pass the TDMRs and the global KeyID to the TDX module */
>> - ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
>> + ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
>> if (ret)
>> goto out_free_pamts;
>>
>> @@ -1118,51 +1119,50 @@ static int init_tdx_module(void)
>> goto out_reset_pamts;
>>
>> /* Initialize TDMRs to complete the TDX module initialization */
>> - ret = init_tdmrs(&tdmr_list);
>> + ret = init_tdmrs(&tdx_tdmr_list);
>> + if (ret)
>> + goto out_reset_pamts;
>> +
>> + pr_info("%lu KBs allocated for PAMT.\n",
>> + tdmrs_count_pamt_kb(&tdx_tdmr_list));
>> +
>> + /*
>> + * @tdx_memlist is written here and read at memory hotplug time.
>> + * Lock out memory hotplug code while building it.
>> + */
>> + put_online_mems();
>> + /*
>> + * For now both @sysinfo and @cmr_array are only used during
>> + * module initialization, so always free them.
>> + */
>> + free_page((unsigned long)sysinfo);
>> +
>> + return 0;
>> out_reset_pamts:
>> - if (ret) {
>> - /*
>> - * Part of PAMTs may already have been initialized by the
>> - * TDX module. Flush cache before returning PAMTs back
>> - * to the kernel.
>> - */
>> - wbinvd_on_all_cpus();
>> - /*
>> - * According to the TDX hardware spec, if the platform
>> - * doesn't have the "partial write machine check"
>> - * erratum, any kernel read/write will never cause #MC
>> - * in kernel space, thus it's OK to not convert PAMTs
>> - * back to normal. But do the conversion anyway here
>> - * as suggested by the TDX spec.
>> - */
>> - tdmrs_reset_pamt_all(&tdmr_list);
>> - }
>> + /*
>> + * Part of PAMTs may already have been initialized by the
>> + * TDX module. Flush cache before returning PAMTs back
>> + * to the kernel.
>> + */
>> + wbinvd_on_all_cpus();
>> + /*
>> + * According to the TDX hardware spec, if the platform
>> + * doesn't have the "partial write machine check"
>> + * erratum, any kernel read/write will never cause #MC
>> + * in kernel space, thus it's OK to not convert PAMTs
>> + * back to normal. But do the conversion anyway here
>> + * as suggested by the TDX spec.
>> + */
>> + tdmrs_reset_pamt_all(&tdx_tdmr_list);
>> out_free_pamts:
>> - if (ret)
>> - tdmrs_free_pamt_all(&tdmr_list);
>> - else
>> - pr_info("%lu KBs allocated for PAMT.\n",
>> - tdmrs_count_pamt_kb(&tdmr_list));
>> + tdmrs_free_pamt_all(&tdx_tdmr_list);
>> out_free_tdmrs:
>> - /*
>> - * Always free the buffer of TDMRs as they are only used during
>> - * module initialization.
>> - */
>> - free_tdmr_list(&tdmr_list);
>> + free_tdmr_list(&tdx_tdmr_list);
>> out_free_tdxmem:
>> - if (ret)
>> - free_tdx_memlist(&tdx_memlist);
>> + free_tdx_memlist(&tdx_memlist);
>> out_put_tdxmem:
>> - /*
>> - * @tdx_memlist is written here and read at memory hotplug time.
>> - * Lock out memory hotplug code while building it.
>> - */
>> put_online_mems();
>> out:
>> - /*
>> - * For now both @sysinfo and @cmr_array are only used during
>> - * module initialization, so always free them.
>> - */
>> free_page((unsigned long)sysinfo);
>> return ret;
>> }
>
> This diff is extremely hard to follow, but I think the change to error
> handling Nikolay proposed has to be applied to the function from the
> beginning, not changed drastically in this patch.
>
I agree. That change should be broken across the various patches
introducing each piece of error handling.
On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
__always_inline perhaps? __always_unused seems wrong, worse it's still
there at the end of the series:
$ quilt diff --combine - | grep seamcall
...
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
...
+ ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+ ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
+ ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
+ ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array),
+ return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
+ ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
...
Definitely not unused.
> + u64 *seamcall_ret,
> + struct tdx_module_output *out)
This interface is atrocious :/ Why have these two ret values? Why can't
that live in a single space -- /me looks throught the callers, and finds
seamcall_ret is unused :-(
Worse, the input (c,d,8,9) is a strict subset of the output
(c,d,8,9,10,11) so why isn't that a single thing used for both input and
output.
struct tdx_call {
u64 rcx, rdx, r8, r9, r10, r11;
};
static int __always_inline seamcall(u64 fn, struct tdx_call *regs)
{
}
struct tdx_regs regs = { };
ret = seamcall(THD_SYS_INIT, ®s);
struct tdx_regs regs = {
.rcx = sysinfo_pa, .rdx = TDXSYSINFO_STRUCT_SIZE,
.r8 = cmr_array_pa, .r9 = MAX_CMRS,
};
ret = seamcall(THD_SYS_INFO, ®s);
if (ret)
return ret;
print_cmrs(cmr_array, regs.r9);
/me looks more at this stuff and ... WTF!?!?
Can someone explain to me why __tdx_hypercall() is sane (per the above)
but then we grew __tdx_module_call() as an absolute abomination and are
apparently using that for seam too?
> +{
> + u64 sret;
> + int cpu;
> +
> + /* Need a stable CPU id for printing error message */
> + cpu = get_cpu();
And that's important because? Does having preemption off across the
seamcall make sense? Does it still make sense when you add a loop later?
> + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> + put_cpu();
> +
> + /* Save SEAMCALL return code if the caller wants it */
> + if (seamcall_ret)
> + *seamcall_ret = sret;
> +
> + switch (sret) {
> + case 0:
> + /* SEAMCALL was successful */
> + return 0;
> + case TDX_SEAMCALL_VMFAILINVALID:
> + pr_err_once("module is not loaded.\n");
> + return -ENODEV;
> + default:
> + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> + cpu, fn, sret);
> + if (out)
> + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> + out->rcx, out->rdx, out->r8,
> + out->r9, out->r10, out->r11);
At the very least this lacks { }, but it is quite horrendous coding
style.
Why switch() at all, would not:
if (!rset)
return 0;
if (sret == TDX_SEAMCALL_VMFAILINVALID) {
pr_nonsense();
return -ENODEV;
}
if (sret == TDX_SEAMCALL_GP) {
pr_nonsense();
return -ENODEV;
}
if (sret == TDX_SEAMCALL_UD) {
pr_nonsense();
return -EINVAL;
}
pr_nonsense();
return -EIO;
be much clearer and have less horrific indenting issues?
> + return -EIO;
> + }
> +}
On Wed, Jun 28, 2023 at 02:58:13PM +0200, Peter Zijlstra wrote:
> Can someone explain to me why __tdx_hypercall() is sane (per the above)
> but then we grew __tdx_module_call() as an absolute abomination and are
> apparently using that for seam too?
That is, why do we have two different TDCALL wrappers? Makes no sense.
On Wed, 2023-06-28 at 14:58 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
>
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>
> __always_inline perhaps? __always_unused seems wrong, worse it's still
> there at the end of the series:
>
> $ quilt diff --combine - | grep seamcall
> ...
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> ...
> + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> + ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> + ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array),
> + return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
> + ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> ...
>
> Definitely not unused.
Thanks for reviewing!
Sorry obviously I forgot to remove __always_unused in the patch that firstly
used seamcall(). Should be more careful. :(
>
> > + u64 *seamcall_ret,
> > + struct tdx_module_output *out)
>
> This interface is atrocious :/ Why have these two ret values? Why can't
> that live in a single space -- /me looks throught the callers, and finds
> seamcall_ret is unused :-(
I'll @seamcall_ret as also suggested by Kirill.
>
> Worse, the input (c,d,8,9) is a strict subset of the output
> (c,d,8,9,10,11) so why isn't that a single thing used for both input and
> output.
>
> struct tdx_call {
> u64 rcx, rdx, r8, r9, r10, r11;
> };
>
> static int __always_inline seamcall(u64 fn, struct tdx_call *regs)
> {
> }
>
>
> struct tdx_regs regs = { };
> ret = seamcall(THD_SYS_INIT, ®s);
>
>
>
> struct tdx_regs regs = {
> .rcx = sysinfo_pa, .rdx = TDXSYSINFO_STRUCT_SIZE,
> .r8 = cmr_array_pa, .r9 = MAX_CMRS,
> };
> ret = seamcall(THD_SYS_INFO, ®s);
> if (ret)
> return ret;
>
> print_cmrs(cmr_array, regs.r9);
>
>
> /me looks more at this stuff and ... WTF!?!?
>
> Can someone explain to me why __tdx_hypercall() is sane (per the above)
> but then we grew __tdx_module_call() as an absolute abomination and are
> apparently using that for seam too?
>
>
Sorry I don't know the story behind __tdx_hypercall().
For TDCALL and SEAMCALL, I believe one reason is they can be used in performance
critical path. The @out is not always used, so putting all outputs to a
structure can reduce the number of function parameters. I once had separate
struct tdx_seamcall_input {} and struct tdx_seamcall_out {} but wasn't
preferred.
Kirill, could you help to explain?
>
>
> > +{
> > + u64 sret;
> > + int cpu;
> > +
> > + /* Need a stable CPU id for printing error message */
> > + cpu = get_cpu();
>
> And that's important because?
>
I want to have a stable cpu for error message printing.
> Does having preemption off across the
> seamcall make sense? Does it still make sense when you add a loop later?
SEAMCALL itself isn't interruptible, so I think having preemption off around
SEAMCALL is fine. But I agree disabling preemption around multiple SEAMCALL
isn't ideal. I'll change that to only disable preemption around one SEAMCALL to
get a correct CPU id for error printing.
>
> > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > + put_cpu();
> > +
> > + /* Save SEAMCALL return code if the caller wants it */
> > + if (seamcall_ret)
> > + *seamcall_ret = sret;
> > +
> > + switch (sret) {
> > + case 0:
> > + /* SEAMCALL was successful */
> > + return 0;
> > + case TDX_SEAMCALL_VMFAILINVALID:
> > + pr_err_once("module is not loaded.\n");
> > + return -ENODEV;
> > + default:
> > + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> > + cpu, fn, sret);
> > + if (out)
> > + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> > + out->rcx, out->rdx, out->r8,
> > + out->r9, out->r10, out->r11);
>
> At the very least this lacks { }, but it is quite horrendous coding
> style.
>
> Why switch() at all, would not:
>
> if (!rset)
> return 0;
>
> if (sret == TDX_SEAMCALL_VMFAILINVALID) {
> pr_nonsense();
> return -ENODEV;
> }
>
> if (sret == TDX_SEAMCALL_GP) {
> pr_nonsense();
> return -ENODEV;
> }
>
> if (sret == TDX_SEAMCALL_UD) {
> pr_nonsense();
> return -EINVAL;
> }
>
> pr_nonsense();
> return -EIO;
>
> be much clearer and have less horrific indenting issues?
I can certainly change to this style. Thanks.
On Wed, 2023-06-28 at 14:50 +0300, [email protected] wrote:
> On Wed, Jun 28, 2023 at 03:34:05AM +0000, Huang, Kai wrote:
> > On Wed, 2023-06-28 at 11:09 +0800, Chao Gao wrote:
> > > > +/*
> > > > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > > > + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > > > + * leaf function return code and the additional output respectively if
> > > > + * not NULL.
> > > > + */
> > > > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64
> > > > r9,
> > > > + u64 *seamcall_ret,
> > > > + struct tdx_module_output *out)
> > > > +{
> > > > + u64 sret;
> > > > + int cpu;
> > > > +
> > > > + /* Need a stable CPU id for printing error message */
> > > > + cpu = get_cpu();
> > > > + sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > > > + put_cpu();
> > > > +
> > > > + /* Save SEAMCALL return code if the caller wants it */
> > > > + if (seamcall_ret)
> > > > + *seamcall_ret = sret;
> > >
> > > Hi Kai,
> > >
> > > All callers in this series pass NULL for seamcall_ret. I am no sure if
> > > you keep it intentionally.
> >
> > In this series all the callers doesn't need seamcall_ret.
>
> I'm fine keeping it if it is needed by KVM TDX enabling. Otherwise, just
> drop it.
No problem I'll drop it. KVM is using __seamcall() anyway.
On Wed, 2023-06-28 at 15:54 +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 02:58:13PM +0200, Peter Zijlstra wrote:
>
> > Can someone explain to me why __tdx_hypercall() is sane (per the above)
> > but then we grew __tdx_module_call() as an absolute abomination and are
> > apparently using that for seam too?
>
> That is, why do we have two different TDCALL wrappers? Makes no sense.
>
I think the reason should be TDCALL/SEAMCALL can be used in performance critical
path, but TDVMCALL isn't.
For example, SEAMCALLs are used in KVM's MMU code to handle page fault for TDX
private pages.
Kirill, could you help to clarify? Thanks.
On Wed, 2023-06-28 at 15:48 +0300, Nikolay Borisov wrote:
> > This diff is extremely hard to follow, but I think the change to error
> > handling Nikolay proposed has to be applied to the function from the
> > beginning, not changed drastically in this patch.
> >
>
>
> I agree. That change should be broken across the various patches
> introducing each piece of error handling.
No I don't want to do this. The TDMRs are only needed to be saved if we want to
do the next patch (reset TDX memory). They are always freed in previous patch.
We can add justification to keep in previous patch but I now want to avoid such
pattern because I now believe it's not the right way to organize patches:
Obviously such justification depends on the later patch. In case the later
patch has something wrong and needs to be updated, the justification can be
invalid, and we need to adjust the previous patches accordingly. This could
result in code review frustration.
Specifically for this issue, if we always free TDMRs in previous patches, then
it's just not right to do what you suggested there. Also, now with dynamic
allocation of TDSYSINFO_STRUCT and CMR array, we need to do 3 things when module
initialization is successful:
put_online_mems();
kfree(sysinfo);
kfree(cmr_array);
return 0;
out_xxx:
....
put_online_mems();
kfree(sysinfo);
kfree(cmr_array);
return ret;
I can hardly say which is better. I am willing to do the above pattern if you
guys prefer but I certainly don't want to mix this logic to previous patches.
On Wed, 2023-06-28 at 12:04 +0300, Nikolay Borisov wrote:
>
> On 26.06.23 г. 17:12 ч., Kai Huang wrote:
> > On the platforms with the "partial write machine check" erratum, the
> > kexec() needs to convert all TDX private pages back to normal before
> > booting to the new kernel. Otherwise, the new kernel may get unexpected
> > machine check.
> >
> > There's no existing infrastructure to track TDX private pages. Change
> > to keep TDMRs when module initialization is successful so that they can
> > be used to find PAMTs.
> >
> > With this change, only put_online_mems() and freeing the buffer of the
> > TDSYSINFO_STRUCT and CMR array still need to be done even when module
> > initialization is successful. Adjust the error handling to explicitly
> > do them when module initialization is successful and unconditionally
> > clean up the rest when initialization fails.
> >
> > Signed-off-by: Kai Huang <[email protected]>
> > ---
> >
> > v11 -> v12 (new patch):
> > - Defer keeping TDMRs logic to this patch for better review
> > - Improved error handling logic (Nikolay/Kirill in patch 15)
> >
> > ---
> > arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
> > 1 file changed, 42 insertions(+), 42 deletions(-)
> >
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 52b7267ea226..85b24b2e9417 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
> > /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
> > static LIST_HEAD(tdx_memlist);
> >
> > +static struct tdmr_info_list tdx_tdmr_list;
> > +
> > /*
> > * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> > @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
> > static int init_tdx_module(void)
> > {
> > struct tdsysinfo_struct *sysinfo;
> > - struct tdmr_info_list tdmr_list;
> > struct cmr_info *cmr_array;
> > int ret;
> >
> > @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
> > goto out_put_tdxmem;
> >
> > /* Allocate enough space for constructing TDMRs */
> > - ret = alloc_tdmr_list(&tdmr_list, sysinfo);
> > + ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
> > if (ret)
> > goto out_free_tdxmem;
> >
> > /* Cover all TDX-usable memory regions in TDMRs */
> > - ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
> > + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
>
> nit: Does it make sense to keep passing those global variables are
> function parameters? Since those functions are static it's unlikely that
> they are going to be used with any other parameter so might as well use
> the parameter directly. It makes the code somewhat easier to follow.
>
I disagree. To me passing 'struct tdx_tdmr_info *tdmr_list' to
construct_tdmrs() as parameter makes this function clearer:
It takes all TDX memory blocks and sysinfo, generates the TDMRs, and stores them
to the buffer specified in the tdmr_list. The internal logic doesn't need to
care whether any of of those parameters are static or not.
On Wed, Jun 28, 2023 at 08:12:55AM +0000, Huang, Kai wrote:
> > >
> > > 2. CPU hotplug
> > >
> > > DX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should
> > ^^
> >
> > Need T here.
>
> Thanks!
>
> >
> [...]
>
> > > 4. Memory Hotplug
> > >
> > > After the kernel passes all "TDX-usable" memory regions to the TDX
> > > module, the set of "TDX-usable" memory regions are fixed during module's
> > > runtime. No more "TDX-usable" memory can be added to the TDX module
> > > after that.
> > >
> > > To achieve above "to guarantee all pages in the page allocator are TDX
> > > pages", this series simply choose to reject any non-TDX-usable memory in
> > > memory hotplug.
> > >
> > > 5. Physical Memory Hotplug
> > >
> > > Note TDX assumes convertible memory is always physically present during
> > > machine's runtime. A non-buggy BIOS should never support hot-removal of
> > > any convertible memory. This implementation doesn't handle ACPI memory
> > > removal but depends on the BIOS to behave correctly.
> > >
> > > Also, if something insane really happened, 4) makes sure either TDX
> >
> > Please remove "4)" if have no specific meaning here.
> >
>
> It means the mechanism mentioned in "4. Memory hotplug".
Ah I see, it's fine to me, thanks.
> >
On Wed, 2023-06-28 at 23:21 +0000, Huang, Kai wrote:
> > > + /* Need a stable CPU id for printing error message */
> > > + cpu = get_cpu();
> >
> > And that's important because?
> >
>
> I want to have a stable cpu for error message printing.
Sorry misunderstood your question.
I think having the CPU id on which the SEAMCALL failed in the dmesg would be
better? But it's not absolutely needed. I can remove it (thus remove
{get|put}_cpu()) if you prefer not to print?
On Wed, Jun 28, 2023 at 03:54:36PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 02:58:13PM +0200, Peter Zijlstra wrote:
>
> > Can someone explain to me why __tdx_hypercall() is sane (per the above)
> > but then we grew __tdx_module_call() as an absolute abomination and are
> > apparently using that for seam too?
>
> That is, why do we have two different TDCALL wrappers? Makes no sense.
__tdx_module_call() is the wrapper for TDCALL.
__tdx_hypercall() is the wrapper for TDG.VP.VMCALL leaf function of
TDCALL. The function is used often and it uses wider range or registers
comparing to the rest of the TDCALL functions.
--
Kiryl Shutsemau / Kirill A. Shutemov
>> then the code becomes self-explanatory. i.e., you can drop the comment.
>
> If using this, I ended up with below:
>
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -23,6 +23,8 @@
> #define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
> #define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
>
> +#define TDX_SUCCESS 0
> +
>
> Hi Kirill/Dave/David,
>
> Are you happy with this?
Yes, all sounds good to me!
--
Cheers,
David / dhildenb
On 26.06.23 16:12, Kai Huang wrote:
> TDX memory has integrity and confidentiality protections. Violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself. In other words,
> the host kernel should never, itself, see machine checks induced by the
> TDX integrity hardware.
>
> Alas, the first few generations of TDX hardware have an erratum. A
> partial write to a TDX private memory cacheline will silently "poison"
> the line. Subsequent reads will consume the poison and generate a
> machine check. According to the TDX hardware spec, neither of these
> things should have happened.
>
> Virtually all kernel memory accesses operations happen in full
> cachelines. In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller. The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings. The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> With this erratum, there are additional things need to be done. Similar
> to other CPU bugs, use a CPU bug bit to indicate this erratum, and
> detect this erratum during early boot. Note this bug reflects the
> hardware thus it is detected regardless of whether the kernel is built
> with TDX support or not.
>
> Signed-off-by: Kai Huang <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> ---
Reviewed-by: David Hildenbrand <[email protected]>
--
Cheers,
David / dhildenb
On Tue, Jun 27, 2023 at 02:12:46AM +1200, Kai Huang wrote:
> After the global KeyID has been configured on all packages, initialize
> all TDMRs to make all TDX-usable memory regions that are passed to the
> TDX module become usable.
>
> This is the last step of initializing the TDX module.
>
> Initializing TDMRs can be time consuming on large memory systems as it
> involves initializing all metadata entries for all pages that can be
> used by TDX guests. Initializing different TDMRs can be parallelized.
> For now to keep it simple, just initialize all TDMRs one by one. It can
> be enhanced in the future.
Reviewed-by: Yuan Yao <[email protected]>
>
> Signed-off-by: Kai Huang <[email protected]>
> Reviewed-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> ---
>
> v11 -> v12:
> - Added Kirill's tag
>
> v10 -> v11:
> - No update
>
> v9 -> v10:
> - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
>
> v8 -> v9:
> - Improved changlog to explain why initializing TDMRs can take long
> time (Dave).
> - Improved comments around 'next-to-initialize' address (Dave).
>
> v7 -> v8: (Dave)
> - Changelog:
> - explicitly call out this is the last step of TDX module initialization.
> - Trimed down changelog by removing SEAMCALL name and details.
> - Removed/trimmed down unnecessary comments.
> - Other changes due to 'struct tdmr_info_list'.
>
> v6 -> v7:
> - Removed need_resched() check. -- Andi.
>
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
> arch/x86/virt/vmx/tdx/tdx.h | 1 +
> 2 files changed, 53 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f5d4dbc11aee..52b7267ea226 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -994,6 +994,56 @@ static int config_global_keyid(void)
> return ret;
> }
>
> +static int init_tdmr(struct tdmr_info *tdmr)
> +{
> + u64 next;
> +
> + /*
> + * Initializing a TDMR can be time consuming. To avoid long
> + * SEAMCALLs, the TDX module may only initialize a part of the
> + * TDMR in each call.
> + */
> + do {
> + struct tdx_module_output out;
> + int ret;
> +
> + /* All 0's are unused parameters, they mean nothing. */
> + ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> + &out);
> + if (ret)
> + return ret;
> + /*
> + * RDX contains 'next-to-initialize' address if
> + * TDH.SYS.TDMR.INIT did not fully complete and
> + * should be retried.
> + */
> + next = out.rdx;
> + cond_resched();
> + /* Keep making SEAMCALLs until the TDMR is done */
> + } while (next < tdmr->base + tdmr->size);
> +
> + return 0;
> +}
> +
> +static int init_tdmrs(struct tdmr_info_list *tdmr_list)
> +{
> + int i;
> +
> + /*
> + * This operation is costly. It can be parallelized,
> + * but keep it simple for now.
> + */
> + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> + int ret;
> +
> + ret = init_tdmr(tdmr_entry(tdmr_list, i));
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> static int init_tdx_module(void)
> {
> struct tdsysinfo_struct *sysinfo;
> @@ -1067,14 +1117,8 @@ static int init_tdx_module(void)
> if (ret)
> goto out_reset_pamts;
>
> - /*
> - * TODO:
> - *
> - * - Initialize all TDMRs.
> - *
> - * Return error before all steps are done.
> - */
> - ret = -EINVAL;
> + /* Initialize TDMRs to complete the TDX module initialization */
> + ret = init_tdmrs(&tdmr_list);
> out_reset_pamts:
> if (ret) {
> /*
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index a0438513bec0..f6b4e153890d 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -25,6 +25,7 @@
> #define TDH_SYS_INFO 32
> #define TDH_SYS_INIT 33
> #define TDH_SYS_LP_INIT 35
> +#define TDH_SYS_TDMR_INIT 36
> #define TDH_SYS_CONFIG 45
>
> struct cmr_info {
> --
> 2.40.1
>