Hi All,
Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the basic TDX guest
infrastructure support (including #VE handler support, and #VE support
for halt and CPUID). This is just a subset of patches in the bare minimum
TDX support patch list which is required for supporting minimal
functional TDX guest. Other basic feature features like #VE support for
IO, MMIO, boot optimization fixes and shared-mm support will be submitted
in a separate patch set. To make reviewing easier we split it into smaller
series. This series alone is not necessarily fully functional.
Also, the host-side support patches, and support for advanced TD guest
features like attestation or debug-mode will be submitted at a later time.
Also, at this point it is not secure with some known holes in drivers, and
also hasn’t been fully audited and fuzzed yet.
TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space. With TDX the host cannot access the guest memory,
so various functionality that would normally be done in KVM has moved
into a (paravirtualized) guest. Partially this is done using the
Virtualization Exception (#VE) and partially with direct paravirtual hooks.
The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.
Some of the key differences between TD and regular VM is,
1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
to the guest TD in cases of instructions that need to be emulated, disallowed
MSR accesses, etc.
3. By default memory is marked as private, and TD will selectively share it with
VMM based on need.
Note that the kernel will also need to be hardened against low level inputs from
the now untrusted hosts. This will be done in follow on patches.
Please check the following info for more details about #VE on memory access:
== #VE on Memory Accesses ==
A TD guest is in control of whether its memory accesses are treated as
private or shared. It selects the behavior with a bit in its page table
entries.
=== #VE on Shared Pages ===
Accesses to shared mappings can cause #VE's. The hypervisor is in
control of when a #VE might occur, so the guest must be careful to only
reference shared pages when it is in a context that can safely handle a #VE.
However, shared mapping content can not be trusted since shared page
content is writable by the hypervisor. This means that shared mappings
are never used for sensitive memory contents like stacks or kernel text.
This means that the shared mapping property of inducing #VEs requires
essentially no special kernel handling in sensitive contexts like
syscall entry or NMIs.
=== #VE on Private Pages ===
Some accesses to private mappings may cause #VEs. Before a mapping is
accepted (aka. in the SEPT_PENDING state), a reference would cause
a #VE. But, after acceptance, references typically succeed.
The hypervisor can cause a private page reference to fail if it chooses
to move an accepted page to a "blocked" state. However, if it does
this, a page access will not generate a #VE. It will, instead, cause a
"TD Exit" where the hypervisor is required to handle the exception.
You can find TDX related documents in the following link.
https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html
Changes since v7:
* Included #VE details on memory access in cover letter.
* Change log is include per patch.
Changes since v6
* Fixed #ifdef format as per Boris comment in patch titled
"x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT".
* Fixed commit log of "x86/tdx: Handle CPUID via #VE" as per Dave
Hansen review comments.
* Removed TDX changes from "x86/tdx: Add protected guest support for
TDX guest" and created "x86/tdx: Add Intel version of
cc_platform_has()"
* Rebased on top of Tom Landecky's CC platform support patch series
https://lore.kernel.org/linux-iommu/f9951644147e27772bf4512325e8ba6472e363b7.1631141919.git.thomas.lendacky@amd.com/T/
* Rebased on top of v5.15-rc1.
Changes since v5:
* Moved patch titled "x86/tdx: Get TD execution environment
information via TDINFO" to the series which uses it.
* Rest of the change log is added per patch.
Changes since v4:
* Added a patch that adds TDX guest exception for CSTAR MSR.
* Rebased on top of Tom Lendacky's protected guest changes.
* Rest of the change log is added per patch.
Changes since v3:
* Moved generic protected guest changes from patch titled "x86:
Introduce generic protected guest abstraction" into seperate
patch outside this patchset. Also, TDX specific changes are
moved to patch titled "x86/tdx: Add protected guest support
for TDX guest"
* Rebased on top of v5.14-rc1.
* Rest of the change log is added per patch.
Changes since v1 (v2 is partial set submission):
* Patch titled "x86/x86: Add early_is_tdx_guest() interface" is moved
out of this series.
* Rest of the change log is added per patch.
Andi Kleen (1):
x86/tdx: Don't write CSTAR MSR on Intel
Kirill A. Shutemov (6):
x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
x86/traps: Add #VE support for TDX guest
x86/tdx: Add HLT support for TDX guest
x86/tdx: Wire up KVM hypercalls
x86/tdx: Add MSR support for TDX guest
x86/tdx: Handle CPUID via #VE
Kuppuswamy Sathyanarayanan (4):
x86/tdx: Introduce INTEL_TDX_GUEST config option
x86/cpufeatures: Add TDX Guest CPU feature
x86/tdx: Add Intel ARCH support to cc_platform_has()
x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper
functions
arch/x86/Kconfig | 15 ++
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/irqflags.h | 46 ++--
arch/x86/include/asm/kvm_para.h | 22 ++
arch/x86/include/asm/mem_encrypt.h | 6 +
arch/x86/include/asm/paravirt.h | 20 +-
arch/x86/include/asm/paravirt_types.h | 3 +-
arch/x86/include/asm/tdx.h | 109 +++++++++
arch/x86/kernel/Makefile | 3 +
arch/x86/kernel/asm-offsets.c | 23 ++
arch/x86/kernel/cc_platform.c | 5 +
arch/x86/kernel/cpu/Makefile | 5 +
arch/x86/kernel/cpu/common.c | 11 +-
arch/x86/kernel/cpu/intel.c | 16 ++
arch/x86/kernel/head64.c | 8 +
arch/x86/kernel/idt.c | 6 +
arch/x86/kernel/paravirt.c | 3 +-
arch/x86/kernel/tdcall.S | 314 ++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 244 ++++++++++++++++++++
arch/x86/kernel/traps.c | 77 +++++++
include/linux/cc_platform.h | 9 +
23 files changed, 918 insertions(+), 33 deletions(-)
create mode 100644 arch/x86/include/asm/tdx.h
create mode 100644 arch/x86/kernel/tdcall.S
create mode 100644 arch/x86/kernel/tdx.c
--
2.25.1
From: "Kirill A. Shutemov" <[email protected]>
CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.
For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
COFNIG_PARAVIRT_XXL.
Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.
Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
---
Changes since v7:
* None.
arch/x86/include/asm/irqflags.h | 46 ++++++++++++++++-----------
arch/x86/include/asm/paravirt.h | 20 ++++++------
arch/x86/include/asm/paravirt_types.h | 3 +-
arch/x86/kernel/paravirt.c | 3 +-
4 files changed, 41 insertions(+), 31 deletions(-)
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index c5ce9845c999..2a2ebf9af43e 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,27 +59,15 @@ static inline __cpuidle void native_halt(void)
#endif
-#ifdef CONFIG_PARAVIRT_XXL
-#include <asm/paravirt.h>
-#else
-#ifndef __ASSEMBLY__
-#include <linux/types.h>
+#ifdef CONFIG_PARAVIRT
-static __always_inline unsigned long arch_local_save_flags(void)
-{
- return native_save_fl();
-}
-
-static __always_inline void arch_local_irq_disable(void)
-{
- native_irq_disable();
-}
+# ifndef __ASSEMBLY__
+# include <asm/paravirt.h>
+# endif /* __ASSEMBLY__ */
-static __always_inline void arch_local_irq_enable(void)
-{
- native_irq_enable();
-}
+#else /* ! CONFIG_PARAVIRT */
+# ifndef __ASSEMBLY__
/*
* Used in the idle loop; sti takes one instruction cycle
* to complete:
@@ -97,6 +85,28 @@ static inline __cpuidle void halt(void)
{
native_halt();
}
+# endif /* __ASSEMBLY__ */
+
+#endif /* CONFIG_PARAVIRT */
+
+#ifndef CONFIG_PARAVIRT_XXL
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+static __always_inline unsigned long arch_local_save_flags(void)
+{
+ return native_save_fl();
+}
+
+static __always_inline void arch_local_irq_disable(void)
+{
+ native_irq_disable();
+}
+
+static __always_inline void arch_local_irq_enable(void)
+{
+ native_irq_enable();
+}
/*
* For spinlocks, etc:
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index da3a1ac82be5..d323a626c7a8 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -97,6 +97,16 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
PVOP_VCALL1(mmu.exit_mmap, mm);
}
+static inline void arch_safe_halt(void)
+{
+ PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+ PVOP_VCALL0(irq.halt);
+}
+
#ifdef CONFIG_PARAVIRT_XXL
static inline void load_sp0(unsigned long sp0)
{
@@ -162,16 +172,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
}
-static inline void arch_safe_halt(void)
-{
- PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
- PVOP_VCALL0(irq.halt);
-}
-
static inline void wbinvd(void)
{
PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT(X86_FEATURE_XENPV));
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index d9d6b0203ec4..40082847f314 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -150,10 +150,9 @@ struct pv_irq_ops {
struct paravirt_callee_save save_fl;
struct paravirt_callee_save irq_disable;
struct paravirt_callee_save irq_enable;
-
+#endif
void (*safe_halt)(void);
void (*halt)(void);
-#endif
} __no_randomize_layout;
struct pv_mmu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 04cafc057bed..8cea6e75ba29 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -283,9 +283,10 @@ struct paravirt_patch_template pv_ops = {
.irq.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.irq.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
.irq.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+
.irq.safe_halt = native_safe_halt,
.irq.halt = native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
/* Mmu ops. */
.mmu.flush_tlb_user = native_flush_tlb_local,
--
2.25.1
Add INTEL_TDX_GUEST config option to selectively compile
TDX guest support.
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
---
Changes since v7:
* None
Changes since v6:
* None
Changes since v5:
* None
Changes since v4:
* None
Changes since v3:
* Removed PARAVIRT_XL from dependency list.
* Fixed typo in help content.
arch/x86/Kconfig | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2b2a9639d8ae..c42dd8a2d1f4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -865,6 +865,20 @@ config ACRN_GUEST
IOT with small footprint and real-time features. More details can be
found in https://projectacrn.org/.
+config INTEL_TDX_GUEST
+ bool "Intel Trusted Domain eXtensions Guest Support"
+ depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
+ depends on SECURITY
+ select X86_X2APIC
+ select SECURITY_LOCKDOWN_LSM
+ help
+ Provide support for running in a trusted domain on Intel processors
+ equipped with Trusted Domain eXtensions. TDX is a Intel technology
+ that extends VMX and Memory Encryption with a new kind of virtual
+ machine guest called Trust Domain (TD). A TD is designed to run in
+ a CPU mode that protects the confidentiality of TD memory contents
+ and the TD’s CPU state from other software, including VMM.
+
endif #HYPERVISOR_GUEST
source "arch/x86/Kconfig.cpu"
--
2.25.1
Add CPU feature detection for Trusted Domain Extensions support. TDX
feature adds capabilities to keep guest register state and memory
isolated from hypervisor.
For TDX guest platforms, executing CPUID(eax=0x21, ecx=0) will return
following values in EAX, EBX, ECX and EDX.
EAX: Maximum sub-leaf number: 0
EBX/EDX/ECX: Vendor string:
EBX = "Inte"
EDX = "lTDX"
ECX = " "
So when above condition is true, set X86_FEATURE_TDX_GUEST feature cap
bit.
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* Add comments for the order of tdx_early_init() call.
* Similar to AMD (sme_me_mask) added a global variable
is_tdx_guest for Intel TDX related checks.
* Changed pr_fmt from "x86/tdx: " to "tdx:"
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/tdx.h | 22 ++++++++++++++++
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/head64.c | 8 ++++++
arch/x86/kernel/tdx.c | 40 ++++++++++++++++++++++++++++++
5 files changed, 73 insertions(+)
create mode 100644 arch/x86/include/asm/tdx.h
create mode 100644 arch/x86/kernel/tdx.c
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..84997abeb401 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
#define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
#define X86_FEATURE_PVUNLOCK ( 8*32+20) /* "" PV unlock function */
#define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* "" PV vcpu_is_preempted function */
+#define X86_FEATURE_TDX_GUEST ( 8*32+22) /* Trusted Domain Extensions Guest */
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..d552c7248bc7
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#include <linux/cpufeature.h>
+
+#define TDX_CPUID_LEAF_ID 0x21
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+extern u64 is_tdx_guest;
+void __init tdx_early_init(void);
+
+#else
+
+#define is_tdx_guest 0ULL
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f91403a78594..159fccfece65 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -29,6 +29,7 @@ KASAN_SANITIZE_dumpstack_$(BITS).o := n
KASAN_SANITIZE_stacktrace.o := n
KASAN_SANITIZE_paravirt.o := n
KASAN_SANITIZE_sev.o := n
+KASAN_SANITIZE_tdx.o := n
# With some compiler versions the generated code results in boot hangs, caused
# by several compilation units. To be safe, disable all instrumentation.
@@ -127,6 +128,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index f98c76a1d16c..97ce0d037387 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
#include <asm/extable.h>
#include <asm/trapnr.h>
#include <asm/sev.h>
+#include <asm/tdx.h>
/*
* Manage page tables very early on.
@@ -495,6 +496,13 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
copy_bootdata(__va(real_mode_data));
+ /*
+ * tdx_early_init() has dependency on command line parameters.
+ * So the order of calling it should be after copy_bootdata()
+ * (in which command line parameter is initialized).
+ */
+ tdx_early_init();
+
/*
* Load microcode early on BSP.
*/
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index 000000000000..ad3ff5925153
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#undef pr_fmt
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <asm/tdx.h>
+
+/*
+ * Allocate it in the data region to avoid zeroing it during
+ * BSS initialization. It is mainly used in cc_platform_has()
+ * call during early boot call.
+ */
+u64 __section(".data") is_tdx_guest = 0;
+
+static void __init is_tdx_guest_init(void)
+{
+ u32 eax, sig[3];
+
+ if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
+ is_tdx_guest = 0;
+ return;
+ }
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
+
+ is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
+}
+
+void __init tdx_early_init(void)
+{
+ is_tdx_guest_init();
+
+ if (!is_tdx_guest)
+ return;
+
+ setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+ pr_info("Guest initialized\n");
+}
--
2.25.1
cc_platform_has() can be used to check for specific active confidential
computing attributes, like memory encryption. For Intel platform like
Trusted Domain eXtensions (TDX) guest has need for using this function
to protect the TDX specific changes made in generic drivers.
So, extend cc_platform_has() and add support for Intel architecture
variant (intel_cc_platform_has()). Also add TDX guest support to
intel_cc_platform_has().
Since intel_cc_platform_has() can be called with invalid %gs, it
cannot be instrumented. So compile it with KCOV/ASAN disabled.
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* Merged patches titled "x86/tdx: Add Intel ARCH support to cc_platform_has()" and
"x86/tdx: Add TDX guest support to intel_cc_platform_has()" into one patch.
* Used cpuid_has_tdx_guest() when adding Intel support to cc_platform_has().
Change since v6:
* Used cc_platform_has() in place of prot_guest_has().
* Rebased on top of Tom Landecky's CC platform support patch series.
https://lore.kernel.org/linux-iommu/f9951644147e27772bf4512325e8ba6472e363b7.1631141919.git.thomas.lendacky@amd.com/T/
Changes since v5:
* Replaced tdx_prot_guest_has() with intel_prot_guest_has() to
keep the Intel call non TDX specific.
* Added TDX guest support to intel_prot_guest_has().
Changes since v4:
* Rebased on top of Tom Lendacky's protected guest changes.
* Moved memory encryption related protected guest flags in
tdx_prot_guest_has() to the patch that actually uses them.
arch/x86/Kconfig | 1 +
arch/x86/include/asm/mem_encrypt.h | 6 ++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/cc_platform.c | 5 +++++
arch/x86/kernel/cpu/Makefile | 5 +++++
arch/x86/kernel/cpu/intel.c | 16 ++++++++++++++++
include/linux/cc_platform.h | 9 +++++++++
7 files changed, 43 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c42dd8a2d1f4..abb249dc829d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -871,6 +871,7 @@ config INTEL_TDX_GUEST
depends on SECURITY
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
+ select ARCH_HAS_CC_PLATFORM
help
Provide support for running in a trusted domain on Intel processors
equipped with Trusted Domain eXtensions. TDX is a Intel technology
diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index ed954aa5c448..a73712b6ee0e 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -103,6 +103,12 @@ static inline u64 sme_get_me_mask(void)
return sme_me_mask;
}
+#if defined(CONFIG_CPU_SUP_INTEL) && defined(CONFIG_ARCH_HAS_CC_PLATFORM)
+bool intel_cc_platform_has(enum cc_attr attr);
+#else
+static inline bool intel_cc_platform_has(enum cc_attr attr) { return false; }
+#endif
+
#endif /* __ASSEMBLY__ */
#endif /* __X86_MEM_ENCRYPT_H__ */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 159fccfece65..8e5b49be65bd 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -28,6 +28,7 @@ KASAN_SANITIZE_dumpstack.o := n
KASAN_SANITIZE_dumpstack_$(BITS).o := n
KASAN_SANITIZE_stacktrace.o := n
KASAN_SANITIZE_paravirt.o := n
+KASAN_SANITIZE_cc_platform.o := n
KASAN_SANITIZE_sev.o := n
KASAN_SANITIZE_tdx.o := n
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 3c9bacd3c3f3..a84310ba1d8d 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -10,11 +10,16 @@
#include <linux/export.h>
#include <linux/cc_platform.h>
#include <linux/mem_encrypt.h>
+#include <linux/processor.h>
+
+#include <asm/tdx.h>
bool cc_platform_has(enum cc_attr attr)
{
if (sme_me_mask)
return amd_cc_platform_has(attr);
+ else if (is_tdx_guest)
+ return intel_cc_platform_has(attr);
return false;
}
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 637b499450d1..64d33160e377 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -16,6 +16,11 @@ KCOV_INSTRUMENT_perf_event.o := n
# As above, instrumenting secondary CPU boot code causes boot hangs.
KCSAN_SANITIZE_common.o := n
+# intel_cc_platform_has cannot be instrumented because it can be called
+# with invalid %gs
+KCOV_INSTRUMENT_intel.o := n
+KCSAN_SANITIZE_intel.o := n
+
# Make sure load_percpu_segment has no stackprotector
CFLAGS_common.o := -fno-stack-protector
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..b99ead877549 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -11,6 +11,7 @@
#include <linux/init.h>
#include <linux/uaccess.h>
#include <linux/delay.h>
+#include <linux/mem_encrypt.h>
#include <asm/cpufeature.h>
#include <asm/msr.h>
@@ -26,6 +27,7 @@
#include <asm/resctrl.h>
#include <asm/numa.h>
#include <asm/thermal.h>
+#include <asm/tdx.h>
#ifdef CONFIG_X86_64
#include <linux/topology.h>
@@ -60,6 +62,20 @@ static u64 msr_test_ctrl_cache __ro_after_init;
*/
static bool cpu_model_supports_sld __ro_after_init;
+#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
+bool intel_cc_platform_has(enum cc_attr attr)
+{
+ switch (attr) {
+ case CC_ATTR_GUEST_TDX:
+ return is_tdx_guest;
+ default:
+ return false;
+ }
+
+ return false;
+}
+#endif
+
/*
* Processors which have self-snooping capability can handle conflicting
* memory type across CPUs by snooping its own cache. However, there exists
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 253f3ea66cd8..26eb19f37d56 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -61,6 +61,15 @@ enum cc_attr {
* Examples include SEV-ES.
*/
CC_ATTR_GUEST_STATE_ENCRYPT,
+
+ /**
+ * @CC_ATTR_GUEST_TDX: Trusted Domain Extension Support
+ *
+ * The platform/OS is running as a TDX guest/virtual machine.
+ *
+ * Examples include Intel TDX.
+ */
+ CC_ATTR_GUEST_TDX,
};
#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
--
2.25.1
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose guest state to the host. This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called TDCALL.
In a TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate
secure communication. TDX guests communicate with the TDX module
using TDCALL instruction.
Both TDX module and VMM communication uses TDCALL instruction. Value
of the RAX register when executing TDCALL instruction is used to
determine the TDCALL type. If the TDCALL is executed with value "0"
in RAX, then it is the service request to VMM. Any other value in
RAX means it is a TDX module related call.
Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
__tdx_hypercall() - request services from the VMM.
__tdx_module_call() - communicate with the TDX Module.
Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, it does not need
multiple wrappers.
Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.
Rationale behind choosing to use assembly over inline assembly is,
since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions. But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.
For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
Originally-by: Sean Christopherson <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* None
Changes since v6:
* None
Changes since v5:
* Fixed commit log (vmcall -> VMCALL, vmlaunch -> VMLAUNCH).
* Fixed comments typo in __tdx_hypercall().
Changes since v4:
* None
Changes since v3:
* Fixed ARG7_SP_OFFSET value for CONFIG_FRAME_POINTER case.
* Fixed usage of we/you in comments.
Change since v1:
* Fixed commit comments format issues as per review comments.
* Fixed empty line and other typo issues found in review.
* Removed do_tdx_hypercall() helper function and modified
__tdx_hypercall() to include do_tdx_hypercall() implementation
and to accept TDVMCALL type as argument.
* Since the number of arguments in __tdx_hypercall() is more than
6, it has been modified to get the 7th argument from the caller stack.
* Instead of leaving output pointer in register before making
TDCALL, stored it in a stack.
* Instead of triggering ud2 for TDCALL failures in __tdx_hypercall(),
it is modified to return the TDCALL status as return value. we will let
user add appropriate error info before triggering fatal error. Also,
extended struct tdx_hypercall_output to store r10 register which
contains hypercall error code.
* Included TDCALL ABI details in __tdx_module_call() and __tdx_hypercall.
* Removed tdx_hypercall_out_r11() helper function. Since it is not really
useful.
* Added _tdx_hypercall() as a wrapper for __tdx_hypercall() with BUG_ON
check for TDCALL failure.
arch/x86/include/asm/tdx.h | 40 +++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets.c | 23 +++
arch/x86/kernel/tdcall.S | 282 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 23 +++
5 files changed, 369 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/tdcall.S
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d552c7248bc7..8d83a719b90b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -4,14 +4,54 @@
#define _ASM_X86_TDX_H
#include <linux/cpufeature.h>
+#include <linux/types.h>
#define TDX_CPUID_LEAF_ID 0x21
+#define TDX_HYPERCALL_STANDARD 0
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+ u64 r10;
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};
#ifdef CONFIG_INTEL_TDX_GUEST
extern u64 is_tdx_guest;
void __init tdx_early_init(void);
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
+ u64 r15, struct tdx_hypercall_output *out);
+
#else
#define is_tdx_guest 0ULL
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8e5b49be65bd..62738f3ca768 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -129,7 +129,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index ecd3fd6993d1..0494ec01218d 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
#include <xen/interface/xen.h>
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
#ifdef CONFIG_X86_32
# include "asm-offsets_32.c"
#else
@@ -68,6 +72,25 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ /* Offset for fields in tdx_module_output */
+ OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+ OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+ OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
+ OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
+ OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+ OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+ /* Offset for fields in tdx_hypercall_output */
+ OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_output, r10);
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..2e70133bebf2
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,282 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+#include <linux/errno.h>
+
+#define TDG_R10 BIT(10)
+#define TDG_R11 BIT(11)
+#define TDG_R12 BIT(12)
+#define TDG_R13 BIT(13)
+#define TDG_R14 BIT(14)
+#define TDG_R15 BIT(15)
+
+#ifdef CONFIG_FRAME_POINTER
+/* Frame offset + 8 (for arg1) */
+#define ARG7_SP_OFFSET (FRAME_OFFSET + 0x08)
+#else
+#define ARG7_SP_OFFSET (0x08)
+#endif
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. Bit field details can be found
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDG_R10 | TDG_R11 | \
+ TDG_R12 | TDG_R13 | \
+ TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call() - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by TDCALL ABI and share it with the
+ * TDX module. If the TDCALL operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the TDCALL operation is returned
+ * back to the user as a function return value.
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - TDCALL Leaf number.
+ * RCX,RDX,R8-R9 - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction error code.
+ * RCX,RDX,R8-R11 - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module). It
+ * can be NULL.
+ *
+ * Return status of TDCALL via RAX.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+
+ /*
+ * R12 will be used as temporary storage for
+ * struct tdx_module_output pointer. More
+ * details about struct tdx_module_output can
+ * be found in arch/x86/include/asm/tdx.h. Also
+ * note that registers R12-R15 are not used by
+ * TDCALL services supported by this helper
+ * function.
+ */
+
+ /* Callee saved, so preserve it */
+ push %r12
+
+ /*
+ * Push output pointer to stack, after TDCALL operation,
+ * it will be fetched into R12 register.
+ */
+ push %r9
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ /* Move TDCALL Leaf ID to RAX */
+ mov %rdi, %rax
+ /* Move input 4 to R9 */
+ mov %r8, %r9
+ /* Move input 3 to R8 */
+ mov %rcx, %r8
+ /* Move input 1 to RCX */
+ mov %rsi, %rcx
+ /* Leave input param 2 in RDX */
+
+ tdcall
+
+ /* Fetch output pointer from stack to R12 */
+ pop %r12
+
+ /* Check for TDCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /*
+ * __tdx_module_call() can be initiated without an output pointer.
+ * So, check if caller provided an output struct before storing
+ * output registers.
+ */
+ test %r12, %r12
+ jz 1f
+
+ /* Copy TDCALL result registers to output struct: */
+ movq %rcx, TDX_MODULE_rcx(%r12)
+ movq %rdx, TDX_MODULE_rdx(%r12)
+ movq %r8, TDX_MODULE_r8(%r12)
+ movq %r9, TDX_MODULE_r9(%r12)
+ movq %r10, TDX_MODULE_r10(%r12)
+ movq %r11, TDX_MODULE_r11(%r12)
+1:
+ /* Restore the state of R12 register */
+ pop %r12
+
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * __tdx_hypercall() - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using TDCALL instruction.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by TDCALL ABI and share it with VMM
+ * via the TDX module. After TDCALL operation, output from the VMM is
+ * saved to the memory specified in the "out" (struct tdx_hypercall_output)
+ * pointer.
+ *
+ *-------------------------------------------------------------------------
+ * TD VMCALL ABI:
+ *-------------------------------------------------------------------------
+ *
+ * Input Registers:
+ *
+ * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
+ * RCX - BITMAP which controls which part of TD Guest GPR
+ * is passed as-is to VMM and back.
+ * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
+ * specification. Non zero value indicates vendor
+ * specific ABI.
+ * R11 - VMCALL sub function number
+ * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
+ * R8-R9, R12–R15 - Same as above.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction status (Not related to hypercall
+ * output).
+ * R10 - Hypercall output error code.
+ * R11-R15 - Hypercall sub function specific output values.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_hypercall() function ABI:
+ *
+ * @type (RDI) - TD VMCALL type, moved to R10
+ * @fn (RSI) - TD VMCALL sub function, moved to R11
+ * @r12 (RDX) - Input parameter 1, moved to R12
+ * @r13 (RCX) - Input parameter 2, moved to R13
+ * @r14 (R8) - Input parameter 3, moved to R14
+ * @r15 (R9) - Input parameter 4, moved to R15
+ *
+ * @out (stack) - struct tdx_hypercall_output pointer (cannot be NULL)
+ *
+ * On successful completion, return TDCALL status or -EINVAL for invalid
+ * inputs.
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+
+ /* Move argument 7 from caller stack to RAX */
+ movq ARG7_SP_OFFSET(%rsp), %rax
+
+ /* Check if caller provided an output struct */
+ test %rax, %rax
+ /* If out pointer is NULL, return -EINVAL */
+ jz 1f
+
+ /* Save callee-saved GPRs as mandated by the x86_64 ABI */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /*
+ * Save R9 and output pointer (rax) in stack, it will be used
+ * again when storing the output registers after TDCALL
+ * operation.
+ */
+ push %r9
+ push %rax
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
+ xor %eax, %eax
+ /* Move TDVMCALL type (standard vs vendor) in R10 */
+ mov %rdi, %r10
+ /* Move TDVMCALL sub function id to R11 */
+ mov %rsi, %r11
+ /* Move input 1 to R12 */
+ mov %rdx, %r12
+ /* Move input 2 to R13 */
+ mov %rcx, %r13
+ /* Move input 3 to R14 */
+ mov %r8, %r14
+ /* Move input 4 to R15 */
+ mov %r9, %r15
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /* Restore output pointer to R9 */
+ pop %r9
+
+ /* Copy hypercall result registers to output struct: */
+ movq %r10, TDX_HYPERCALL_r10(%r9)
+ movq %r11, TDX_HYPERCALL_r11(%r9)
+ movq %r12, TDX_HYPERCALL_r12(%r9)
+ movq %r13, TDX_HYPERCALL_r13(%r9)
+ movq %r14, TDX_HYPERCALL_r14(%r9)
+ movq %r15, TDX_HYPERCALL_r15(%r9)
+
+ /*
+ * Zero out registers exposed to the VMM to avoid
+ * speculative execution with VMM-controlled values.
+ * This needs to include all registers present in
+ * TDVMCALL_EXPOSE_REGS_MASK (except R12-R15).
+ * R12-R15 context will be restored.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+
+ /* Restore state of R9 register */
+ pop %r9
+
+ /* Restore callee-saved GPRs as mandated by the x86_64 ABI */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ jmp 2f
+1:
+ movq $(-EINVAL), %rax
+2:
+ FRAME_END
+
+ retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ad3ff5925153..d5fc2904facf 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -27,6 +27,29 @@ static void __init is_tdx_guest_init(void)
is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
}
+/*
+ * Wrapper for standard use of __tdx_hypercall with BUG_ON() check
+ * for TDCALL error.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
+ u64 r15, struct tdx_hypercall_output *out)
+{
+ struct tdx_hypercall_output outl = {0};
+ u64 err;
+
+ /* __tdx_hypercall() does not accept NULL output pointer */
+ if (!out)
+ out = &outl;
+
+ err = __tdx_hypercall(TDX_HYPERCALL_STANDARD, fn, r12, r13, r14,
+ r15, out);
+
+ /* Non zero return value indicates buggy TDX module, so panic */
+ BUG_ON(err);
+
+ return out->r10;
+}
+
void __init tdx_early_init(void)
{
is_tdx_guest_init();
--
2.25.1
From: "Kirill A. Shutemov" <[email protected]>
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to TD-shared memory, which includes MMIO
In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.
The entry paths do not access TD-shared memory, MMIO regions or use
those specific MSRs, instructions, CPUID leaves that might generate #VE.
In addition, all interrupts including NMIs are blocked by the hardware
starting with #VE delivery until TDGETVEINFO is called. This eliminates
the chance of a #VE during the syscall gap or paranoid entry paths and
simplifies #VE handling.
After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
but it is expected not to happen because TDX expects NMIs not to
trigger #VEs. Another case where they could happen is if the #VE
exception panics, but in this case there are no guarantees on anything
anyways.
If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest which will result in an oops (and should eventually be a panic, as
it is expected panic_on_oops is set to 1 for TDX guests).
Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace. Later patches will add handling for specific #VE scenarios.
Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling ve_raise_fault() directly.
ve_raise_fault() is similar to #GP handler and is responsible for
sending SIGSEGV to userspace and CPU die and notifying debuggers and
other die chain users.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* None
Changes since v6:
* None
Changes since v5:
* Fixed "We" usage in commit log and replaced cpu -> CPU.
* Renamed "tdg_" prefix with "tdx_".
* Removed TODO comment in tdg_handle_virtualization_exception() as
per Boris review comments.
* Added comments for ve_raise_fault().
Changes since v4:
* Since ve_raise_fault() is used only by TDX code, moved it
within #ifdef CONFIG_INTEL_TDX_GUEST.
Changes since v3:
* None
arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/tdx.h | 19 ++++++++
arch/x86/kernel/idt.c | 6 +++
arch/x86/kernel/tdx.c | 33 ++++++++++++++
arch/x86/kernel/traps.c | 77 +++++++++++++++++++++++++++++++++
5 files changed, 139 insertions(+)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..8ccc81d653b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8d83a719b90b..458a564dd4c2 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -39,6 +39,20 @@ struct tdx_hypercall_output {
u64 r15;
};
+/*
+ * Used by #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ u64 gla; /* Guest Linear (virtual) Address */
+ u64 gpa; /* Guest Physical (virtual) Address */
+ u32 instr_len;
+ u32 instr_info;
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST
extern u64 is_tdx_guest;
@@ -52,6 +66,11 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
u64 r15, struct tdx_hypercall_output *out);
+unsigned long tdx_get_ve_info(struct ve_info *ve);
+
+int tdx_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve);
+
#else
#define is_tdx_guest 0ULL
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..a5eaae8e6c44 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};
/*
@@ -91,6 +94,9 @@ static const __initconst struct idt_data def_idts[] = {
INTG(X86_TRAP_MF, asm_exc_coprocessor_error),
INTG(X86_TRAP_AC, asm_exc_alignment_check),
INTG(X86_TRAP_XF, asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
#ifdef CONFIG_X86_32
TSKG(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index d5fc2904facf..f7885c777a09 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -6,6 +6,9 @@
#include <asm/tdx.h>
+/* TDX Module call Leaf IDs */
+#define TDGETVEINFO 3
+
/*
* Allocate it in the data region to avoid zeroing it during
* BSS initialization. It is mainly used in cc_platform_has()
@@ -50,6 +53,36 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}
+unsigned long tdx_get_ve_info(struct ve_info *ve)
+{
+ struct tdx_module_output out = {0};
+ u64 ret;
+
+ /*
+ * NMIs and machine checks are suppressed. Before this point any
+ * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+ * additional #VEs are permitted (but it is expected not to
+ * happen unless kernel panics).
+ */
+ ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = out.r10 & UINT_MAX;
+ ve->instr_info = out.r10 >> 32;
+
+ return ret;
+}
+
+int tdx_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EFAULT;
+}
+
void __init tdx_early_init(void)
{
is_tdx_guest_init();
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..152d1d3b9dc8 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -1140,6 +1141,82 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}
+#ifdef CONFIG_INTEL_TDX_GUEST
+#define VEFSTR "VE fault"
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+ struct task_struct *tsk = current;
+
+ if (user_mode(regs)) {
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * Not fixing up VDSO exceptions similar to #GP handler
+ * because it is expected that VDSO doesn't trigger #VE.
+ */
+ show_signal(tsk, SIGSEGV, "", VEFSTR, regs, error_code);
+ force_sig(SIGSEGV);
+ return;
+ }
+
+ /*
+ * Attempt to recover from #VE exception failure without
+ * triggering OOPS (useful for MSR read/write failures)
+ */
+ if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+ return;
+
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * To be potentially processing a kprobe fault and to trust the result
+ * from kprobe_running(), it should be non-preemptible.
+ */
+ if (!preemptible() &&
+ kprobe_running() &&
+ kprobe_fault_handler(regs, X86_TRAP_VE))
+ return;
+
+ /* Notify about #VE handling failure, useful for debugger hooks */
+ if (notify_die(DIE_GPF, VEFSTR, regs, error_code,
+ X86_TRAP_VE, SIGSEGV) == NOTIFY_STOP)
+ return;
+
+ /* Trigger OOPS and panic */
+ die_addr(VEFSTR, regs, error_code, 0);
+}
+
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+ int ret;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+ /*
+ * NMIs/Machine-checks/Interrupts will be in a disabled state
+ * till TDGETVEINFO TDCALL is executed. This prevents #VE
+ * nesting issue.
+ */
+ ret = tdx_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ if (!ret)
+ ret = tdx_handle_virtualization_exception(regs, &ve);
+ /*
+ * If tdx_handle_virtualization_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (ret)
+ ve_raise_fault(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.25.1
From: "Kirill A. Shutemov" <[email protected]>
Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 3.8,
TDVMCALL[Instruction.HLT] provides HLT operation. Use it to implement
halt() and safe_halt() paravirtualization calls.
The same TDX hypercall is used to handle #VE exception due to
EXIT_REASON_HLT.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* Added section title to spec reference in commit log and comments.
* Added extra comments as per review suggestion.
Changes since v6:
* None
Changes since v5:
* Replaced sti with STI in commit log and comments.
* Added comments for _tdx_hypercall() usage in _tdx_halt().
* Added new helper function _tdx_halt() to contain common
code between tdx_halt() and tdx_safe_halt().
* Renamed tdg_->tdx_.
* Removed BUG_ON() and used WARN_ONCE() for HLT emulation
failure.
Changes since v4:
* Added exception for EXIT_REASON_HLT in __tdx_hypercall() to
enable interrupts using sti.
Changes since v3:
* None
arch/x86/kernel/tdcall.S | 30 ++++++++++++++++
arch/x86/kernel/tdx.c | 75 ++++++++++++++++++++++++++++++++++++++--
2 files changed, 103 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 2e70133bebf2..1b9649ec2e29 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -40,6 +40,9 @@
*/
#define tdcall .byte 0x66,0x0f,0x01,0xcc
+/* HLT TDVMCALL sub-function ID */
+#define EXIT_REASON_HLT 12
+
/*
* __tdx_module_call() - Helper function used by TDX guests to request
* services from the TDX module (does not include VMM services).
@@ -240,6 +243,33 @@ SYM_FUNC_START(__tdx_hypercall)
movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+ /*
+ * For the idle loop STI needs to be called directly before
+ * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
+ * enables interrupts only one instruction later. If there
+ * are any instructions between the STI and the TDCALL for
+ * HLT then an interrupt could happen in that time, but the
+ * code would go back to sleep afterwards, which can cause
+ * longer delays.
+ *
+ * This leads to significant difference in network performance
+ * benchmarks. So add a special case for EXIT_REASON_HLT to
+ * trigger STI before TDCALL. But this change is not required
+ * for all HLT cases. So use R15 register value to identify the
+ * case which needs STI. So, if R11 is EXIT_REASON_HLT and R15
+ * is 1, then call STI before TDCALL instruction. Note that R15
+ * register is not required by TDCALL ABI when triggering the
+ * hypercall for EXIT_REASON_HLT case. So use it in software to
+ * select the STI case.
+ */
+ cmpl $EXIT_REASON_HLT, %r11d
+ jne skip_sti
+ cmpl $1, %r15d
+ jne skip_sti
+ /* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
+ xor %r15, %r15
+ sti
+skip_sti:
tdcall
/* Restore output pointer to R9 */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f7885c777a09..3d0416515506 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,7 @@
#define pr_fmt(fmt) "tdx: " fmt
#include <asm/tdx.h>
+#include <asm/vmx.h>
/* TDX Module call Leaf IDs */
#define TDGETVEINFO 3
@@ -53,6 +54,62 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}
+static __cpuidle void _tdx_halt(const bool irq_disabled, const bool do_sti)
+{
+ u64 ret;
+
+ /*
+ * Emulate HLT operation via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), sec 3.8 TDG.VP.VMCALL<Instruction.HLT>.
+ *
+ * The VMM uses the "IRQ disabled" param to understand IRQ
+ * enabled status (RFLAGS.IF) of TD guest and determine
+ * whether or not it should schedule the halted vCPU if an
+ * IRQ becomes pending. E.g. if IRQs are disabled the VMM
+ * can keep the vCPU in virtual HLT, even if an IRQ is
+ * pending, without hanging/breaking the guest.
+ *
+ * do_sti parameter is used by __tdx_hypercall() to decide
+ * whether to call STI instruction before executing TDCALL
+ * instruction.
+ */
+ ret = _tdx_hypercall(EXIT_REASON_HLT, irq_disabled, 0, 0, do_sti, NULL);
+
+ /*
+ * Use WARN_ONCE() to report the failure. Since tdx_*halt() calls
+ * are also used in pv_ops, #VE error handler cannot be used to
+ * report the failure.
+ */
+ WARN_ONCE(ret, "HLT instruction emulation failed\n");
+}
+
+static __cpuidle void tdx_halt(void)
+{
+ const bool irq_disabled = irqs_disabled();
+ const bool do_sti = false;
+
+ /*
+ * Non safe halt is mainly used in CPU off-lining
+ * and the guest will stay in halt state. So,
+ * STI instruction call is not required (set
+ * do_sti as false).
+ */
+ _tdx_halt(irq_disabled, do_sti);
+}
+
+static __cpuidle void tdx_safe_halt(void)
+{
+ /*
+ * Since STI instruction will be called in __tdx_hypercall()
+ * set irq_disabled as false.
+ */
+ const bool irq_disabled = false;
+ const bool do_sti = true;
+
+ _tdx_halt(irq_disabled, do_sti);
+}
+
unsigned long tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out = {0};
@@ -79,8 +136,19 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
int tdx_handle_virtualization_exception(struct pt_regs *regs,
struct ve_info *ve)
{
- pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return -EFAULT;
+ switch (ve->exit_reason) {
+ case EXIT_REASON_HLT:
+ tdx_halt();
+ break;
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EFAULT;
+ }
+
+ /* After successful #VE handling, move the IP */
+ regs->ip += ve->instr_len;
+
+ return 0;
}
void __init tdx_early_init(void)
@@ -92,5 +160,8 @@ void __init tdx_early_init(void)
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+ pv_ops.irq.safe_halt = tdx_safe_halt;
+ pv_ops.irq.halt = tdx_halt;
+
pr_info("Guest initialized\n");
}
--
2.25.1
From: Andi Kleen <[email protected]>
On Intel CPUs writing the CSTAR MSR is not really needed. Syscalls
from 32bit work using SYSENTER and 32bit SYSCALL is an illegal opcode.
But the kernel did write it anyways even though it was ignored by
the CPU. Inside a TDX guest this actually leads to a #VE which in
turn will trigger ve_raise_fault() due to failed MSR write. Inside
ve_raise_fault() before it recovers from this error, it prints an
ugly message at boot. Since such warning message is pointless for
CSTAR MSR write failure, add exception to skip CSTAR msr write on
Intel CPUs.
Signed-off-by: Andi Kleen <[email protected]>
---
Changes since v7:
* None.
Changes since v6:
* None.
Changes since v5:
* Fixed commit log as per review comments.
arch/x86/kernel/cpu/common.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0f8885949e8c..fd10f1044157 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1758,7 +1758,13 @@ void syscall_init(void)
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
#ifdef CONFIG_IA32_EMULATION
- wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
+ /*
+ * CSTAR is not needed on Intel because it doesn't support
+ * 32bit SYSCALL, but only SYSENTER. On a TDX guest
+ * it leads to a #GP.
+ */
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
/*
* This only works on Intel CPUs.
* On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
@@ -1770,7 +1776,8 @@ void syscall_init(void)
(unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
#else
- wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
--
2.25.1
From: "Kirill A. Shutemov" <[email protected]>
When running virtualized, the CPUID instruction is handled differently
based on the leaf being accessed. The behavior depends only on the
leaf and applies equally to both kernel/ring-0 and userspace/ring-3
execution of CPUID. Historically, there are two basic classes:
* Leaves handled transparently to the guest
* Leaves handled by the VMM
In a typical guest without TDX, "handled by the VMM" leaves cause a
VMEXIT. TDX replaces these VMEXITs with a #VE exception in the guest.
The guest typically handles the #VE by making a hypercall to the VMM.
The TDX module specification [1], section titled "CPUID Virtualization"
talks about a few more classes of CPUID handling. But, for the purposes
of this patch, the "handled transparently" CPUID leaves are all lumped
together because the guest handling is the same.
[1] - https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* None
Changes since v6:
* Fixed commit log as per Dave Hansen comments.
* Added extra comments for _tdx_hypercall() usage.
Changes since v5:
* Fixed commit log as per review comments.
* Removed WARN_ON() in tdx_handle_cpuid().
* Renamed "tdg" prefix with "tdx".
Changes since v4:
* None
Changes since v3:
* None
arch/x86/kernel/tdx.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 062ac4720434..11e367228e96 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -145,6 +145,31 @@ static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
return ret ? -EIO : 0;
}
+static u64 tdx_handle_cpuid(struct pt_regs *regs)
+{
+ struct tdx_hypercall_output out = {0};
+ u64 ret;
+
+ /*
+ * Emulate CPUID instruction via hypercall. More info about
+ * ABI can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
+ */
+ ret = _tdx_hypercall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
+
+ /*
+ * As per TDX GHCI CPUID ABI, r12-r15 registers contains contents of
+ * EAX, EBX, ECX, EDX registers after CPUID instruction execution.
+ * So copy the register contents back to pt_regs.
+ */
+ regs->ax = out.r12;
+ regs->bx = out.r13;
+ regs->cx = out.r14;
+ regs->dx = out.r15;
+
+ return ret;
+}
+
unsigned long tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out = {0};
@@ -188,6 +213,9 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_MSR_WRITE:
ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
break;
+ case EXIT_REASON_CPUID:
+ ret = tdx_handle_cpuid(regs);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
--
2.25.1
From: "Kirill A. Shutemov" <[email protected]>
KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI
is similar, those instructions no longer function for TDX guests. Make
vendor-specific TDVMCALLs instead of VMCALL. This enables TDX guests to
run with KVM acting as the hypervisor. TDX guests running under other
hypervisors will continue to use those hypervisors' hypercalls.
Since KVM driver can be built as a kernel module, export
tdx_kvm_hypercall*() to make the symbols visible to kvm.ko.
Also, add asm/tdx.h to asm/asm-prototypes.h so that asm symbol's
checksum can be generated in order to support CONFIG_MODVERSIONS with
it and fix:
WARNING: modpost: EXPORT symbol "__tdx_hypercall" [vmlinux] version \
generation failed, symbol will not be versioned.
[Isaku Yamahata: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* None
Changes since v6:
* Used cc_platform_has() in place of prot_guest_has().
Changes since v5:
* Added more info about version generation failed error in commit
log.
* Fixed commit log as per review comments.
* Removed CONFIG_INTEL_TDX_GUEST_KVM and used
CONFIG_KVM_GUEST/CONFIG_INTEL_TDX_GUEST for TDX KVM hypercall
implementation.
* Used EXPORT_SYMBOL_GPL for __tdx_hypercall() export.
Changes since v4:
* No functional changes.
Changes since v3:
* Fixed ASM sysmbol generation issue in tdcall.S by including tdx.h
in asm-prototypes.h
Changes since v1:
* Replaced is_tdx_guest() with prot_guest_has(PR_GUEST_TDX).
* Replaced tdx_kvm_hypercall{1-4} with single generic
function tdx_kvm_hypercall().
* Removed __tdx_hypercall_vendor_kvm() and re-used __tdx_hypercall().
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/include/asm/kvm_para.h | 22 ++++++++++++++++++
arch/x86/include/asm/tdx.h | 32 +++++++++++++++++++++++++--
arch/x86/kernel/tdcall.S | 2 ++
4 files changed, 55 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index 4cb726c71ed8..404add7ee720 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -6,6 +6,7 @@
#include <asm/page.h>
#include <asm/checksum.h>
#include <asm/mce.h>
+#include <asm/tdx.h>
#include <asm-generic/asm-prototypes.h>
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 69299878b200..e51977e77590 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -4,7 +4,9 @@
#include <asm/processor.h>
#include <asm/alternative.h>
+#include <asm/tdx.h>
#include <linux/interrupt.h>
+#include <linux/cc_platform.h>
#include <uapi/asm/kvm_para.h>
#ifdef CONFIG_KVM_GUEST
@@ -32,6 +34,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
static inline long kvm_hypercall0(unsigned int nr)
{
long ret;
+
+ if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr)
@@ -42,6 +48,10 @@ static inline long kvm_hypercall0(unsigned int nr)
static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
{
long ret;
+
+ if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1)
@@ -53,6 +63,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
{
long ret;
+
+ if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2)
@@ -64,6 +78,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
{
long ret;
+
+ if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -76,6 +94,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
unsigned long p4)
{
long ret;
+
+ if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 458a564dd4c2..ebb97e082376 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -6,8 +6,9 @@
#include <linux/cpufeature.h>
#include <linux/types.h>
-#define TDX_CPUID_LEAF_ID 0x21
-#define TDX_HYPERCALL_STANDARD 0
+#define TDX_CPUID_LEAF_ID 0x21
+#define TDX_HYPERCALL_STANDARD 0
+#define TDX_HYPERCALL_VENDOR_KVM 0x4d564b2e584454 /* TDX.KVM */
/*
* Used in __tdx_module_call() helper function to gather the
@@ -78,4 +79,31 @@ static inline void tdx_early_init(void) { };
#endif /* CONFIG_INTEL_TDX_GUEST */
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
+static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ struct tdx_hypercall_output out;
+ u64 err;
+
+ err = __tdx_hypercall(TDX_HYPERCALL_VENDOR_KVM, nr, p1, p2,
+ p3, p4, &out);
+
+ /*
+ * Non zero return value means buggy TDX module (which is fatal).
+ * So use BUG_ON() to panic.
+ */
+ BUG_ON(err);
+
+ return out.r10;
+}
+#else
+static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 1b9649ec2e29..fa87f5e2cf29 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
#include <asm/asm.h>
#include <asm/frame.h>
#include <asm/unwind_hints.h>
+#include <asm/export.h>
#include <linux/linkage.h>
#include <linux/bits.h>
@@ -310,3 +311,4 @@ skip_sti:
retq
SYM_FUNC_END(__tdx_hypercall)
+EXPORT_SYMBOL_GPL(__tdx_hypercall);
--
2.25.1
From: "Kirill A. Shutemov" <[email protected]>
Use hypercall to emulate MSR read/write for TDX platform.
TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide MSR
oprations.
RDMSR and WRMSR specification details can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDG.VP.
VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v7:
* Removed tdx_is_context_switched_msr() support (since the list
is incomplete).
* Added section title to spec reference.
Changes since v6:
* None
Changes since v5:
* Renamed "tdg" prefix with "tdx".
* Added comments for _tdx_hypercall() usage in MSR read/write functions.
Change since v4:
* Removed You usage from commit log.
Changes since v3:
* None
arch/x86/kernel/tdx.c | 53 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 51 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3d0416515506..062ac4720434 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -110,6 +110,41 @@ static __cpuidle void tdx_safe_halt(void)
_tdx_halt(irq_disabled, do_sti);
}
+static u64 tdx_read_msr_safe(unsigned int msr, int *err)
+{
+ struct tdx_hypercall_output out = {0};
+ u64 ret;
+
+ /*
+ * Emulate the MSR read via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
+ */
+ ret = _tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);
+
+ *err = ret ? -EIO : 0;
+
+ return out.r11;
+}
+
+static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
+ unsigned int high)
+{
+ u64 ret;
+
+ WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
+
+ /*
+ * Emulate the MSR write via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
+ */
+ ret = _tdx_hypercall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
+ 0, 0, NULL);
+
+ return ret ? -EIO : 0;
+}
+
unsigned long tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out = {0};
@@ -136,19 +171,33 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
int tdx_handle_virtualization_exception(struct pt_regs *regs,
struct ve_info *ve)
{
+ unsigned long val;
+ int ret = 0;
+
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
tdx_halt();
break;
+ case EXIT_REASON_MSR_READ:
+ val = tdx_read_msr_safe(regs->cx, (unsigned int *)&ret);
+ if (!ret) {
+ regs->ax = (u32)val;
+ regs->dx = val >> 32;
+ }
+ break;
+ case EXIT_REASON_MSR_WRITE:
+ ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
}
/* After successful #VE handling, move the IP */
- regs->ip += ve->instr_len;
+ if (!ret)
+ regs->ip += ve->instr_len;
- return 0;
+ return ret;
}
void __init tdx_early_init(void)
--
2.25.1
On 10/4/21 7:51 PM, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c42dd8a2d1f4..abb249dc829d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -871,6 +871,7 @@ config INTEL_TDX_GUEST
> depends on SECURITY
> select X86_X2APIC
> select SECURITY_LOCKDOWN_LSM
> + select ARCH_HAS_CC_PLATFORM
Where is ARCH_HAS_CC_PLATFORM defined, please?
I can't seem to find it.
> help
> Provide support for running in a trusted domain on Intel processors
> equipped with Trusted Domain eXtensions. TDX is a Intel technology
thanks.
--
~Randy
On 10/4/21 7:51 PM, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2b2a9639d8ae..c42dd8a2d1f4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -865,6 +865,20 @@ config ACRN_GUEST
> IOT with small footprint and real-time features. More details can be
> found inhttps://projectacrn.org/.
>
> +config INTEL_TDX_GUEST
> + bool "Intel Trusted Domain eXtensions Guest Support"
> + depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
> + depends on SECURITY
> + select X86_X2APIC
Apparently some Intel CPUs don't have the x2apic feature, since the
Kconfig help text for X86_X2APIC says:
This enables x2apic support on CPUs that have this feature.
so how is it safe to set/enable/select that kconfig symbol?
Will the x2apic code just safely not work if the h/w feature is
missing?
> + select SECURITY_LOCKDOWN_LSM
> + help
> + Provide support for running in a trusted domain on Intel processors
> + equipped with Trusted Domain eXtensions. TDX is a Intel technology
> + that extends VMX and Memory Encryption with a new kind of virtual
> + machine guest called Trust Domain (TD). A TD is designed to run in
> + a CPU mode that protects the confidentiality of TD memory contents
> + and the TD’s CPU state from other software, including VMM.
thanks.
--
~Randy
On 10/4/21 9:47 PM, Randy Dunlap wrote:
> On 10/4/21 7:51 PM, Kuppuswamy Sathyanarayanan wrote:
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c42dd8a2d1f4..abb249dc829d 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -871,6 +871,7 @@ config INTEL_TDX_GUEST
>> depends on SECURITY
>> select X86_X2APIC
>> select SECURITY_LOCKDOWN_LSM
>> + select ARCH_HAS_CC_PLATFORM
>
> Where is ARCH_HAS_CC_PLATFORM defined, please?
> I can't seem to find it.
It is merged to tip tree.
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=46b49b12f3fc5e1347dba37d4639e2165f447871
>
>> help
>> Provide support for running in a trusted domain on Intel processors
>> equipped with Trusted Domain eXtensions. TDX is a Intel technology
>
>
> thanks.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Mon, Oct 4, 2021 at 9:53 PM Randy Dunlap <[email protected]> wrote:
>
> On 10/4/21 7:51 PM, Kuppuswamy Sathyanarayanan wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 2b2a9639d8ae..c42dd8a2d1f4 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -865,6 +865,20 @@ config ACRN_GUEST
> > IOT with small footprint and real-time features. More details can be
> > found inhttps://projectacrn.org/.
> >
> > +config INTEL_TDX_GUEST
> > + bool "Intel Trusted Domain eXtensions Guest Support"
> > + depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
> > + depends on SECURITY
> > + select X86_X2APIC
>
> Apparently some Intel CPUs don't have the x2apic feature, since the
> Kconfig help text for X86_X2APIC says:
>
> This enables x2apic support on CPUs that have this feature.
>
> so how is it safe to set/enable/select that kconfig symbol?
>
> Will the x2apic code just safely not work if the h/w feature is
> missing?
For the TDX guest, x2apic will be emulated. So it will exist in our
case. Even if x2apic or TDX guest is not supported by CPU, it will
boot just fine.
>
> > + select SECURITY_LOCKDOWN_LSM
> > + help
> > + Provide support for running in a trusted domain on Intel processors
> > + equipped with Trusted Domain eXtensions. TDX is a Intel technology
> > + that extends VMX and Memory Encryption with a new kind of virtual
> > + machine guest called Trust Domain (TD). A TD is designed to run in
> > + a CPU mode that protects the confidentiality of TD memory contents
> > + and the TD’s CPU state from other software, including VMM.
>
>
> thanks.
> --
> ~Randy
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Tue, Oct 05, 2021 at 06:29:43AM -0700, Sathyanarayanan Kuppuswamy Natarajan wrote:
> For the TDX guest, x2apic will be emulated. So it will exist in our
> case. Even if x2apic or TDX guest is not supported by CPU, it will
> boot just fine.
Sure but why is it "select" and not "depends"?
Why does X86_X2APIC need to be selected while the others are "depends"?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/5/21 6:29 AM, Sathyanarayanan Kuppuswamy Natarajan wrote:
> On Mon, Oct 4, 2021 at 9:53 PM Randy Dunlap <[email protected]> wrote:
>> On 10/4/21 7:51 PM, Kuppuswamy Sathyanarayanan wrote:
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 2b2a9639d8ae..c42dd8a2d1f4 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -865,6 +865,20 @@ config ACRN_GUEST
>>> IOT with small footprint and real-time features. More details can be
>>> found inhttps://projectacrn.org/.
>>>
>>> +config INTEL_TDX_GUEST
>>> + bool "Intel Trusted Domain eXtensions Guest Support"
>>> + depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
>>> + depends on SECURITY
>>> + select X86_X2APIC
>> Apparently some Intel CPUs don't have the x2apic feature, since the
>> Kconfig help text for X86_X2APIC says:
>>
>> This enables x2apic support on CPUs that have this feature.
>>
>> so how is it safe to set/enable/select that kconfig symbol?
>>
>> Will the x2apic code just safely not work if the h/w feature is
>> missing?
> For the TDX guest, x2apic will be emulated. So it will exist in our
> case. Even if x2apic or TDX guest is not supported by CPU, it will
> boot just fine.
This doesn't really explain the "select X86_X2APIC", though.
You just said that TDX doesn't *require* X2APIC. So, why is it being
selected? What is the specific connection between TDX and X2APIC?
On Tue, Oct 05, 2021, Dave Hansen wrote:
> On 10/5/21 6:29 AM, Sathyanarayanan Kuppuswamy Natarajan wrote:
> > On Mon, Oct 4, 2021 at 9:53 PM Randy Dunlap <[email protected]> wrote:
> >> On 10/4/21 7:51 PM, Kuppuswamy Sathyanarayanan wrote:
> >>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >>> index 2b2a9639d8ae..c42dd8a2d1f4 100644
> >>> --- a/arch/x86/Kconfig
> >>> +++ b/arch/x86/Kconfig
> >>> @@ -865,6 +865,20 @@ config ACRN_GUEST
> >>> IOT with small footprint and real-time features. More details can be
> >>> found inhttps://projectacrn.org/.
> >>>
> >>> +config INTEL_TDX_GUEST
> >>> + bool "Intel Trusted Domain eXtensions Guest Support"
> >>> + depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
> >>> + depends on SECURITY
> >>> + select X86_X2APIC
> >> Apparently some Intel CPUs don't have the x2apic feature, since the
> >> Kconfig help text for X86_X2APIC says:
> >>
> >> This enables x2apic support on CPUs that have this feature.
> >>
> >> so how is it safe to set/enable/select that kconfig symbol?
It's safe because the X86_X2APIC=y doesn't force x2APIC mode, it only enables
support for x2APIC mode. If the CPU doesn't support x2APIC the kernel will use
legacy xAPIC.
That said, using select instead of depends is silly.
> >> Will the x2apic code just safely not work if the h/w feature is
> >> missing?
> > For the TDX guest, x2apic will be emulated. So it will exist in our
> > case.
That's incorrect, TDX partially virtualizes (as opposed to fully emulates) x2APIC
and thus requires the CPU to support x2APIC.
> > Even if x2apic or TDX guest is not supported by CPU, it will boot just fine.
>
> This doesn't really explain the "select X86_X2APIC", though.
>
> You just said that TDX doesn't *require* X2APIC.
Well, TDX requires the guest to support x2APIC if the guest wants to do anything
useful.
10.9.1. Virtual APIC Mode
* Guest TDs must use virtualized x2APIC mode. xAPIC mode (using memory mapped
APIC access) is not allowed.
* Guest TD attempts to RDMSR or WRMSR the IA32_APIC_BASE MSR cause a #VE to the
guest TD. The guest TD cannot disable the APIC.
On 10/5/21 7:09 AM, Dave Hansen wrote:
>> For the TDX guest, x2apic will be emulated. So it will exist in our
>> case. Even if x2apic or TDX guest is not supported by CPU, it will
>> boot just fine.
> This doesn't really explain the "select X86_X2APIC", though.
>
> You just said that TDX doesn't*require* X2APIC. So, why is it being
I meant for a valid TD guest, x2APIC will *always* be emulated. It is
also specified in the spec.
Please check sec "TD Hardware" in Intel TDX Virtual Firmware Design Guide
or "Interrupt Handling and APIC Virtualization" section in Intel Trust
Domain Extensions Module specification.
For the case without x2APIC, TDX initialization should fail (hence TDX)
will not be enabled). So in non-TDX mode, kernel will boot fine. But
in TDX mode, current behavior should be "kernel hang"
> selected? What is the specific connection between TDX and X2APIC?
X2APIC is used manage interrupts in virtualized environment (like TDX
guest). So it is required for interrupt management.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On 10/5/21 7:13 AM, Borislav Petkov wrote:
> On Tue, Oct 05, 2021 at 06:29:43AM -0700, Sathyanarayanan Kuppuswamy Natarajan wrote:
>> For the TDX guest, x2apic will be emulated. So it will exist in our
>> case. Even if x2apic or TDX guest is not supported by CPU, it will
>> boot just fine.
>
> Sure but why is it "select" and not "depends"?
>
> Why does X86_X2APIC need to be selected while the others are "depends"?
Since x2APIC will always exist in TDX guest case, we have used select to
enable the support. But since we have dependency on it, I think "depends"
might be a better choice.
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Tue, Oct 05, 2021 at 07:48:25AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Since x2APIC will always exist in TDX guest case, we have used select to
> enable the support. But since we have dependency on it, I think "depends"
> might be a better choice.
Right, and while we're on the subject, this looks silly to me too:
+ depends on SECURITY
...
+ select SECURITY_LOCKDOWN_LSM
because
Symbol: SECURITY_LOCKDOWN_LSM [=n]
│ Type : bool
│ Defined at security/lockdown/Kconfig:1
│ Prompt: Basic module for enforcing kernel lockdown
│ Depends on: SECURITY [=n]
^^^^^^^^^
so that symbol already depends on SECURITY.
And I have SECURITY=n in my config so I still have to go select SECURITY
by hand so that CONFIG_INTEL_TDX_GUEST becomes visible. And when I
select it, SECURITY_LOCKDOWN_LSM gets enabled too.
But since I have to go select SECURITY, I can just as well enable
SECURITY_LOCKDOWN_LSM in order to have TDX guest support.
IOW, I don't see the point for the evil "select"s - just make everything
depends on and be done with it.
Unless there's an aspect I'm missing...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Oct 04, 2021 at 07:51:55PM -0700, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index c5ce9845c999..2a2ebf9af43e 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -59,27 +59,15 @@ static inline __cpuidle void native_halt(void)
>
> #endif
>
> -#ifdef CONFIG_PARAVIRT_XXL
> -#include <asm/paravirt.h>
> -#else
> -#ifndef __ASSEMBLY__
> -#include <linux/types.h>
> +#ifdef CONFIG_PARAVIRT
>
> -static __always_inline unsigned long arch_local_save_flags(void)
> -{
> - return native_save_fl();
> -}
> -
> -static __always_inline void arch_local_irq_disable(void)
> -{
> - native_irq_disable();
> -}
> +# ifndef __ASSEMBLY__
> +# include <asm/paravirt.h>
> +# endif /* __ASSEMBLY__ */
>
> -static __always_inline void arch_local_irq_enable(void)
> -{
> - native_irq_enable();
> -}
> +#else /* ! CONFIG_PARAVIRT */
>
> +# ifndef __ASSEMBLY__
> /*
> * Used in the idle loop; sti takes one instruction cycle
> * to complete:
> @@ -97,6 +85,28 @@ static inline __cpuidle void halt(void)
> {
> native_halt();
> }
> +# endif /* __ASSEMBLY__ */
> +
> +#endif /* CONFIG_PARAVIRT */
> +
> +#ifndef CONFIG_PARAVIRT_XXL
> +#ifndef __ASSEMBLY__
There are a lot of '__ASSEMBLY__' #ifdefs which make this macro maze
extra hard to follow. Folding in something like this would help a lot:
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 2a2ebf9af43e..afecf7fa20d0 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -57,17 +57,12 @@ static inline __cpuidle void native_halt(void)
asm volatile("hlt": : :"memory");
}
-#endif
-
#ifdef CONFIG_PARAVIRT
-# ifndef __ASSEMBLY__
# include <asm/paravirt.h>
-# endif /* __ASSEMBLY__ */
#else /* ! CONFIG_PARAVIRT */
-# ifndef __ASSEMBLY__
/*
* Used in the idle loop; sti takes one instruction cycle
* to complete:
@@ -85,9 +80,9 @@ static inline __cpuidle void halt(void)
{
native_halt();
}
-# endif /* __ASSEMBLY__ */
#endif /* CONFIG_PARAVIRT */
+#endif /* __ASSEMBLY__ */
#ifndef CONFIG_PARAVIRT_XXL
#ifndef __ASSEMBLY__
On Mon, Oct 04, 2021 at 07:51:56PM -0700, Kuppuswamy Sathyanarayanan wrote:
> +config INTEL_TDX_GUEST
> + bool "Intel Trusted Domain eXtensions Guest Support"
...
> + Provide support for running in a trusted domain on Intel processors
> + equipped with Trusted Domain eXtensions. TDX is a Intel technology
I haven't seen this particular punctuation "eXtensions" anywhere. Intel
documentation writes it as "Extensions". Better to be consistent.
--
Josh
On Tue, Oct 05, 2021 at 07:29:12PM +0200, Borislav Petkov wrote:
> On Tue, Oct 05, 2021 at 07:48:25AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> > Since x2APIC will always exist in TDX guest case, we have used select to
> > enable the support. But since we have dependency on it, I think "depends"
> > might be a better choice.
>
> Right, and while we're on the subject, this looks silly to me too:
>
> + depends on SECURITY
> ...
> + select SECURITY_LOCKDOWN_LSM
>
> because
>
> Symbol: SECURITY_LOCKDOWN_LSM [=n]
> │ Type : bool
> │ Defined at security/lockdown/Kconfig:1
> │ Prompt: Basic module for enforcing kernel lockdown
> │ Depends on: SECURITY [=n]
> ^^^^^^^^^
>
> so that symbol already depends on SECURITY.
>
> And I have SECURITY=n in my config so I still have to go select SECURITY
> by hand so that CONFIG_INTEL_TDX_GUEST becomes visible. And when I
> select it, SECURITY_LOCKDOWN_LSM gets enabled too.
>
> But since I have to go select SECURITY, I can just as well enable
> SECURITY_LOCKDOWN_LSM in order to have TDX guest support.
>
> IOW, I don't see the point for the evil "select"s - just make everything
> depends on and be done with it.
>
> Unless there's an aspect I'm missing...
It would also be helpful to explain the dependencies (particularly
X86_X2APIC and SECURITY_LOCKDOWN_LSM) in the commit message.
--
Josh
On Tue, Oct 05, 2021, Josh Poimboeuf wrote:
> On Mon, Oct 04, 2021 at 07:51:56PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > +config INTEL_TDX_GUEST
> > + bool "Intel Trusted Domain eXtensions Guest Support"
>
> ...
>
> > + Provide support for running in a trusted domain on Intel processors
> > + equipped with Trusted Domain eXtensions. TDX is a Intel technology
>
> I haven't seen this particular punctuation "eXtensions" anywhere. Intel
> documentation writes it as "Extensions". Better to be consistent.
Heh, I suspect that one is my fault.
While we're nitpicking names, it's s/Trusted/Trust Domain Extensions.
On 10/5/21 1:21 PM, Josh Poimboeuf wrote:
> It would also be helpful to explain the dependencies (particularly
> X86_X2APIC and SECURITY_LOCKDOWN_LSM) in the commit message.
I think we no longer need CONFIG_SECURITY_LOCKDOWN_LSM. I will add some
details about X2APIC dependency.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On 10/5/21 1:17 PM, Josh Poimboeuf wrote:
>> + Provide support for running in a trusted domain on Intel processors
>> + equipped with Trusted Domain eXtensions. TDX is a Intel technology
> I haven't seen this particular punctuation "eXtensions" anywhere. Intel
> documentation writes it as "Extensions". Better to be consistent.
Ok. I will fix this in next version.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On 10/5/21 1:33 PM, Sean Christopherson wrote:
> On Tue, Oct 05, 2021, Josh Poimboeuf wrote:
>> On Mon, Oct 04, 2021 at 07:51:56PM -0700, Kuppuswamy Sathyanarayanan wrote:
>>> +config INTEL_TDX_GUEST
>>> + bool "Intel Trusted Domain eXtensions Guest Support"
>> ...
>>
>>> + Provide support for running in a trusted domain on Intel processors
>>> + equipped with Trusted Domain eXtensions. TDX is a Intel technology
>> I haven't seen this particular punctuation "eXtensions" anywhere. Intel
>> documentation writes it as "Extensions". Better to be consistent.
It's less about what Intel calls it and more defining acronyms by
capitalizing the letters used in the acronym. This:
git log -p arch/x86/ | grep eXtension
shows at least SGX and MPX in the past have done the "eXtension" thing.
So, yes, please pick one. But, don't throw out "eXtension" because it
hasn't been used before. It has. I'd also advise against using the
Intel documentation as gospel here. It's *not* a good example to
blindly follow.
On Mon, Oct 04, 2021 at 07:51:57PM -0700, Kuppuswamy Sathyanarayanan wrote:
> @@ -495,6 +496,13 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
>
> copy_bootdata(__va(real_mode_data));
>
> + /*
> + * tdx_early_init() has dependency on command line parameters.
> + * So the order of calling it should be after copy_bootdata()
> + * (in which command line parameter is initialized).
> + */
> + tdx_early_init();
Which cmdline parameters are those?
> +/*
> + * Allocate it in the data region to avoid zeroing it during
> + * BSS initialization. It is mainly used in cc_platform_has()
> + * call during early boot call.
> + */
> +u64 __section(".data") is_tdx_guest = 0;
Or you could just give it a -1 value here to avoid the section
annotation. Not sure why it needs 64 bits, any reason it can't just be
bool?
> +
> +static void __init is_tdx_guest_init(void)
> +{
> + u32 eax, sig[3];
> +
> + if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
> + is_tdx_guest = 0;
> + return;
> + }
> +
> + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
> +
> + is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
> +}
> +
> +void __init tdx_early_init(void)
> +{
> + is_tdx_guest_init();
> +
> + if (!is_tdx_guest)
> + return;
> +
> + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> +
> + pr_info("Guest initialized\n");
> +}
What's the point of having both 'is_tdx_guest' and
X86_FEATURE_TDX_GUEST? Are they not redundant?
--
Josh
On Mon, Oct 04, 2021 at 07:51:58PM -0700, Kuppuswamy Sathyanarayanan wrote:
> +#if defined(CONFIG_CPU_SUP_INTEL) && defined(CONFIG_ARCH_HAS_CC_PLATFORM)
> +bool intel_cc_platform_has(enum cc_attr attr);
> +#else
> +static inline bool intel_cc_platform_has(enum cc_attr attr) { return false; }
> +#endif
> +
I assume this needs a rebase on -tip since cc_platform.c already has an
empty version of this function (and it's static so it doesn't need to be
declared in a header).
> #endif /* __ASSEMBLY__ */
>
> #endif /* __X86_MEM_ENCRYPT_H__ */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 159fccfece65..8e5b49be65bd 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -28,6 +28,7 @@ KASAN_SANITIZE_dumpstack.o := n
> KASAN_SANITIZE_dumpstack_$(BITS).o := n
> KASAN_SANITIZE_stacktrace.o := n
> KASAN_SANITIZE_paravirt.o := n
> +KASAN_SANITIZE_cc_platform.o := n
> KASAN_SANITIZE_sev.o := n
> KASAN_SANITIZE_tdx.o := n
This change is already in -tip as well.
> + /**
> + * @CC_ATTR_GUEST_TDX: Trusted Domain Extension Support
> + *
> + * The platform/OS is running as a TDX guest/virtual machine.
> + *
> + * Examples include Intel TDX.
> + */
> + CC_ATTR_GUEST_TDX,
Examples of TDX include TDX? :-)
--
Josh
On Tue, Oct 05, 2021 at 02:04:23PM -0700, Josh Poimboeuf wrote:
> What's the point of having both 'is_tdx_guest' and
> X86_FEATURE_TDX_GUEST? Are they not redundant?
I think this should explain it:
https://lkml.kernel.org/r/[email protected]
but I've yet to go through these here patches and see what's happening
there.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/5/21 2:04 PM, Josh Poimboeuf wrote:
> On Mon, Oct 04, 2021 at 07:51:57PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> @@ -495,6 +496,13 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
>>
>> copy_bootdata(__va(real_mode_data));
>>
>> + /*
>> + * tdx_early_init() has dependency on command line parameters.
>> + * So the order of calling it should be after copy_bootdata()
>> + * (in which command line parameter is initialized).
>> + */
>> + tdx_early_init();
>
> Which cmdline parameters are those?
We have few debug command line options like tdx_forced (force TDX
initialization) and tdx_disable_filter (Disables TDX device filter
support). Support for these options have not been posted out (waiting
to merge the initial support patches first). Since we need to access
command line options, we want to follow the above calling order.
>
>> +/*
>> + * Allocate it in the data region to avoid zeroing it during
>> + * BSS initialization. It is mainly used in cc_platform_has()
>> + * call during early boot call.
>> + */
>> +u64 __section(".data") is_tdx_guest = 0;
>
> Or you could just give it a -1 value here to avoid the section
> annotation. Not sure why it needs 64 bits, any reason it can't just be
> bool?
It can be bool. I can fix this in next version.
>
>> +
>> +static void __init is_tdx_guest_init(void)
>> +{
>> + u32 eax, sig[3];
>> +
>> + if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
>> + is_tdx_guest = 0;
>> + return;
>> + }
>> +
>> + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
>> +
>> + is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
>> +}
>> +
>> +void __init tdx_early_init(void)
>> +{
>> + is_tdx_guest_init();
>> +
>> + if (!is_tdx_guest)
>> + return;
>> +
>> + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>> +
>> + pr_info("Guest initialized\n");
>> +}
>
> What's the point of having both 'is_tdx_guest' and
> X86_FEATURE_TDX_GUEST? Are they not redundant?
is_tdx_guest was mainly introduced to support cc_platform_has()
API in early boot calls (similar to sme_me_mask in AMD code).
Regarding FEATURE flag it will be useful for userspace tools to
check the TDX feature support.
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On 10/5/21 2:16 PM, Josh Poimboeuf wrote:
> Examples of TDX include TDX?:-)
Intel TDX. Not sure whether there will be other vendors.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Mon, Oct 04, 2021 at 07:52:03PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Use hypercall to emulate MSR read/write for TDX platform.
>
> TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide MSR
> oprations.
>
> RDMSR and WRMSR specification details can be found in
> Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
> Extensions (Intel TDX) specification, sec titled "TDG.VP.
> VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since v7:
> * Removed tdx_is_context_switched_msr() support (since the list
> is incomplete).
This fails to build:
arch/x86/kernel/tdx.c: In function ‘tdx_write_msr_safe’:
arch/x86/kernel/tdx.c:135:22: error: implicit declaration of function ‘tdx_is_context_switched_msr’ [-Werror=implicit-function-declaration]
135 | WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/asm-generic/bug.h:104:32: note: in definition of macro ‘WARN_ON_ONCE’
104 | int __ret_warn_on = !!(condition); \
| ^~~~~~~~~
--
Josh
On 10/5/21 4:22 PM, Josh Poimboeuf wrote:
> This fails to build:
>
> arch/x86/kernel/tdx.c: In function ‘tdx_write_msr_safe’:
> arch/x86/kernel/tdx.c:135:22: error: implicit declaration of function ‘tdx_is_context_switched_msr’ [-Werror=implicit-function-declaration]
> 135 | WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> ./include/asm-generic/bug.h:104:32: note: in definition of macro ‘WARN_ON_ONCE’
> 104 | int __ret_warn_on = !!(condition); \
> | ^~~~~~~~~
Sorry, it looks like the removal is split across two patches. So I did not catch
it in my compilation test. I will fix this and repost it.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Tue, Oct 05, 2021 at 02:41:35PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 10/5/21 2:04 PM, Josh Poimboeuf wrote:
> > On Mon, Oct 04, 2021 at 07:51:57PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > > @@ -495,6 +496,13 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
> > > copy_bootdata(__va(real_mode_data));
> > > + /*
> > > + * tdx_early_init() has dependency on command line parameters.
> > > + * So the order of calling it should be after copy_bootdata()
> > > + * (in which command line parameter is initialized).
> > > + */
> > > + tdx_early_init();
> >
> > Which cmdline parameters are those?
>
> We have few debug command line options like tdx_forced (force TDX
> initialization) and tdx_disable_filter (Disables TDX device filter
> support). Support for these options have not been posted out (waiting
> to merge the initial support patches first). Since we need to access
> command line options, we want to follow the above calling order.
But until if/when those cmdline options are added, the comment is plain
wrong. At the very least, it should state the present state of things,
i.e. that a future dependency on cmdline parameters is expected.
> > > +/*
> > > + * Allocate it in the data region to avoid zeroing it during
> > > + * BSS initialization. It is mainly used in cc_platform_has()
> > > + * call during early boot call.
> > > + */
> > > +u64 __section(".data") is_tdx_guest = 0;
> >
> > Or you could just give it a -1 value here to avoid the section
> > annotation. Not sure why it needs 64 bits, any reason it can't just be
> > bool?
>
> It can be bool. I can fix this in next version.
Ok. maybe something like
bool is_tdx_guest = true;
along with the comment clarifying why the nonzero initializer is needed.
> > > +static void __init is_tdx_guest_init(void)
> > > +{
> > > + u32 eax, sig[3];
> > > +
> > > + if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
> > > + is_tdx_guest = 0;
> > > + return;
> > > + }
> > > +
> > > + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
> > > +
> > > + is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
> > > +}
> > > +
> > > +void __init tdx_early_init(void)
> > > +{
> > > + is_tdx_guest_init();
> > > +
> > > + if (!is_tdx_guest)
> > > + return;
> > > +
> > > + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> > > +
> > > + pr_info("Guest initialized\n");
> > > +}
> >
> > What's the point of having both 'is_tdx_guest' and
> > X86_FEATURE_TDX_GUEST? Are they not redundant?
>
> is_tdx_guest was mainly introduced to support cc_platform_has()
> API in early boot calls (similar to sme_me_mask in AMD code).
> Regarding FEATURE flag it will be useful for userspace tools to
> check the TDX feature support.
FEATURE flags can also be checked in the kernel, with boot_cpu_has().
Or am I missing something?
--
Josh
On 10/5/21 8:42 PM, Josh Poimboeuf wrote:
>> is_tdx_guest was mainly introduced to support cc_platform_has()
>> API in early boot calls (similar to sme_me_mask in AMD code).
>> Regarding FEATURE flag it will be useful for userspace tools to
>> check the TDX feature support.
> FEATURE flags can also be checked in the kernel, with boot_cpu_has().
> Or am I missing something?
Yes, previously we have been using x86_feature_enabled() check in
cc_platform_has() call. Now with the introduction of is_tdx_guest
global variable, we don't use it in kernel. But I still want to
keep the feature flag for user space use case.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Tue, Oct 05, 2021 at 09:33:35PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 10/5/21 8:42 PM, Josh Poimboeuf wrote:
> > > is_tdx_guest was mainly introduced to support cc_platform_has()
> > > API in early boot calls (similar to sme_me_mask in AMD code).
> > > Regarding FEATURE flag it will be useful for userspace tools to
> > > check the TDX feature support.
> > FEATURE flags can also be checked in the kernel, with boot_cpu_has().
> > Or am I missing something?
>
> Yes, previously we have been using x86_feature_enabled() check in
> cc_platform_has() call. Now with the introduction of is_tdx_guest
> global variable, we don't use it in kernel. But I still want to
> keep the feature flag for user space use case.
I'm not suggesting getting rid of the feature flag. I'm suggesting
getting rid of the global variable. Is there a reason you can't check
the feature flag instead of checking the global variable?
--
Josh
On Mon, Oct 04, 2021 at 07:51:59PM -0700, Kuppuswamy Sathyanarayanan wrote:
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host. This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called TDCALL.
>
> In a TDX based VM, since VMM is an untrusted entity, a intermediary
> layer (TDX module) exists between host and guest to facilitate
> secure communication. TDX guests communicate with the TDX module
> using TDCALL instruction.
Most of the information in the second paragraph was already provided in
the first paragraph.
> +#ifdef CONFIG_FRAME_POINTER
> +/* Frame offset + 8 (for arg1) */
> +#define ARG7_SP_OFFSET (FRAME_OFFSET + 0x08)
> +#else
> +#define ARG7_SP_OFFSET (0x08)
> +#endif
No need for the #ifdef here, FRAME_OFFSET is already zero for
!FRAME_POINTER. So it can just be unconditional:
#define ARG7_SP_OFFSET (FRAME_OFFSET + 0x08)
> + * __tdx_hypercall() - Helper function used by TDX guests to request
> + * services from the VMM. All requests are made via the TDX module
> + * using TDCALL instruction.
> + *
> + * This function serves as a wrapper to move user call arguments to the
> + * correct registers as specified by TDCALL ABI and share it with VMM
> + * via the TDX module. After TDCALL operation, output from the VMM is
> + * saved to the memory specified in the "out" (struct tdx_hypercall_output)
> + * pointer.
> + *
> + *-------------------------------------------------------------------------
> + * TD VMCALL ABI:
> + *-------------------------------------------------------------------------
> + *
> + * Input Registers:
> + *
> + * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
> + * RCX - BITMAP which controls which part of TD Guest GPR
> + * is passed as-is to VMM and back.
> + * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
> + * specification. Non zero value indicates vendor
> + * specific ABI.
> + * R11 - VMCALL sub function number
> + * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
> + * R8-R9, R12–R15 - Same as above.
> + *
> + * Output Registers:
> + *
> + * RAX - TDCALL instruction status (Not related to hypercall
> + * output).
> + * R10 - Hypercall output error code.
> + * R11-R15 - Hypercall sub function specific output values.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __tdx_hypercall() function ABI:
> + *
> + * @type (RDI) - TD VMCALL type, moved to R10
> + * @fn (RSI) - TD VMCALL sub function, moved to R11
> + * @r12 (RDX) - Input parameter 1, moved to R12
> + * @r13 (RCX) - Input parameter 2, moved to R13
> + * @r14 (R8) - Input parameter 3, moved to R14
> + * @r15 (R9) - Input parameter 4, moved to R15
> + *
> + * @out (stack) - struct tdx_hypercall_output pointer (cannot be NULL)
> + *
> + * On successful completion, return TDCALL status or -EINVAL for invalid
> + * inputs.
> + */
The documentation and comments for all the asm code are excellent! If
only all kernel asm code were this readable!
I'm especially thankful that it's not all crammed into inline asm :-)
> +SYM_FUNC_START(__tdx_hypercall)
> + FRAME_BEGIN
> +
> + /* Move argument 7 from caller stack to RAX */
> + movq ARG7_SP_OFFSET(%rsp), %rax
> +
> + /* Check if caller provided an output struct */
> + test %rax, %rax
> + /* If out pointer is NULL, return -EINVAL */
> + jz 1f
> +
> + /* Save callee-saved GPRs as mandated by the x86_64 ABI */
> + push %r15
> + push %r14
> + push %r13
> + push %r12
> +
> + /*
> + * Save R9 and output pointer (rax) in stack, it will be used
> + * again when storing the output registers after TDCALL
> + * operation.
> + */
> + push %r9
> + push %rax
r9 is callee-clobbered, so we shouldn't need to push it here or pop it
("Restore state of R9 register" below) before returning.
> +/*
> + * Wrapper for standard use of __tdx_hypercall with BUG_ON() check
> + * for TDCALL error.
> + */
> +static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
> + u64 r15, struct tdx_hypercall_output *out)
Instead of calling the args r12-r15, I'm thinking it would be clearer to
call them arg1-arg4. It's not the C code's job to worry about the
argument-to-register ABI mappings.
Same comment for the function declaration in the header file.
--
Josh
On Tue, Oct 05, 2021 at 10:03:09PM -0700, Josh Poimboeuf wrote:
> I'm not suggesting getting rid of the feature flag. I'm suggesting
> getting rid of the global variable. Is there a reason you can't check
> the feature flag instead of checking the global variable?
The reason is that cc_platform_has() is used too early, even before
get_cpu_cap(). So you need to have TDX guest detected even before
feature flags.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Oct 06, 2021 at 02:47:42PM +0200, Borislav Petkov wrote:
> On Tue, Oct 05, 2021 at 10:03:09PM -0700, Josh Poimboeuf wrote:
> > I'm not suggesting getting rid of the feature flag. I'm suggesting
> > getting rid of the global variable. Is there a reason you can't check
> > the feature flag instead of checking the global variable?
>
> The reason is that cc_platform_has() is used too early, even before
> get_cpu_cap(). So you need to have TDX guest detected even before
> feature flags.
Ok. At least it sounds like cc_platform_has() needs a comment stating
that caps can't be relied on, and why.
--
Josh
On Mon, Oct 04, 2021 at 07:51:57PM -0700, Kuppuswamy Sathyanarayanan wrote:
> +/*
> + * Allocate it in the data region to avoid zeroing it during
> + * BSS initialization. It is mainly used in cc_platform_has()
> + * call during early boot call.
> + */
> +u64 __section(".data") is_tdx_guest = 0;
Actually this should be __ro_after_init, since it never changes after
it's initialized. As a bonus you won't have to worry about it getting
placed in BSS and the comment and initializer can go away.
> +
> +static void __init is_tdx_guest_init(void)
> +{
> + u32 eax, sig[3];
> +
> + if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
> + is_tdx_guest = 0;
> + return;
> + }
> +
> + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
> +
> + is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
> +}
> +
> +void __init tdx_early_init(void)
> +{
> + is_tdx_guest_init();
> +
> + if (!is_tdx_guest)
> + return;
> +
> + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
setup_force_cpu_cap() sets a bit in 'cpu_caps_set' which is in BSS,
which gets cleared later right? So this presumably doesn't fully work
as expected. Maybe it can be set in init_intel(), based on
'is_tdx_guest'.
--
Josh
On Wed, Oct 06, 2021 at 07:11:05AM -0700, Josh Poimboeuf wrote:
> Ok. At least it sounds like cc_platform_has() needs a comment stating
> that caps can't be relied on, and why.
I thought about turning the hunk here:
https://lkml.kernel.org/r/[email protected]
into a patch. But since we won't be using caps there, it would look
weird so I guess a comment's fine.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Oct 04, 2021 at 07:51:57PM -0700, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> new file mode 100644
> index 000000000000..ad3ff5925153
> --- /dev/null
> +++ b/arch/x86/kernel/tdx.c
> @@ -0,0 +1,40 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2020 Intel Corporation */
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +#include <asm/tdx.h>
> +
> +/*
> + * Allocate it in the data region to avoid zeroing it during
> + * BSS initialization. It is mainly used in cc_platform_has()
> + * call during early boot call.
> + */
> +u64 __section(".data") is_tdx_guest = 0;
> +
> +static void __init is_tdx_guest_init(void)
> +{
> + u32 eax, sig[3];
> +
> + if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
> + is_tdx_guest = 0;
> + return;
> + }
> +
> + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
> +
> + is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
> +}
> +
> +void __init tdx_early_init(void)
> +{
> + is_tdx_guest_init();
> +
> + if (!is_tdx_guest)
> + return;
> +
> + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> +
> + pr_info("Guest initialized\n");
> +}
> --
What I meant was this (untested of course).
is_tdx_guest() is the accessor external code queries and you cache the
detected value in tdx_guest so that the one after the first one is
cheap.
/*
* Allocate it in the data region to avoid zeroing it during
* BSS initialization. It is mainly used in cc_platform_has()
* call during early boot call.
*
* States whether the kernel is running as a TDX guest.
*/
static int tdx_guest __ro_after_init = -1;
bool is_tdx_guest(void)
{
u32 eax, sig[3];
if (tdx_guest >= 0)
return tdx_guest;
if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
tdx_guest = 0;
return false;
}
cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
tdx_guest = !memcmp("IntelTDX ", sig, 12);
return tdx_guest;
}
void __init tdx_early_init(void)
{
if (!is_tdx_guest())
return;
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
pr_info("Guest initialized\n");
}
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/6/21 8:26 AM, Borislav Petkov wrote:
> On Mon, Oct 04, 2021 at 07:51:57PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> new file mode 100644
>> index 000000000000..ad3ff5925153
>> --- /dev/null
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -0,0 +1,40 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2020 Intel Corporation */
>> +
>> +#undef pr_fmt
>> +#define pr_fmt(fmt) "tdx: " fmt
>> +
>> +#include <asm/tdx.h>
>> +
>> +/*
>> + * Allocate it in the data region to avoid zeroing it during
>> + * BSS initialization. It is mainly used in cc_platform_has()
>> + * call during early boot call.
>> + */
>> +u64 __section(".data") is_tdx_guest = 0;
>> +
>> +static void __init is_tdx_guest_init(void)
>> +{
>> + u32 eax, sig[3];
>> +
>> + if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
>> + is_tdx_guest = 0;
>> + return;
>> + }
>> +
>> + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
>> +
>> + is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
>> +}
>> +
>> +void __init tdx_early_init(void)
>> +{
>> + is_tdx_guest_init();
>> +
>> + if (!is_tdx_guest)
>> + return;
>> +
>> + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>> +
>> + pr_info("Guest initialized\n");
>> +}
>> --
>
> What I meant was this (untested of course).
>
> is_tdx_guest() is the accessor external code queries and you cache the
> detected value in tdx_guest so that the one after the first one is
> cheap.
Yes. But, Joerg Roedel in his review recommended using variable
similar to sme_me_mask to avoid function call in Intel platform in
cc_platform_has().
"
This causes a function call on every Intel machine this code runs. is
there an easier to check whether TDX is enabled, like the sme_me_mask
check on AMD?
"
That's why I have introduced is_tdx_guest global variable in this
version.
>
> /*
> * Allocate it in the data region to avoid zeroing it during
> * BSS initialization. It is mainly used in cc_platform_has()
> * call during early boot call.
> *
> * States whether the kernel is running as a TDX guest.
> */
> static int tdx_guest __ro_after_init = -1;
>
> bool is_tdx_guest(void)
> {
> u32 eax, sig[3];
>
> if (tdx_guest >= 0)
> return tdx_guest;
>
> if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) {
> tdx_guest = 0;
> return false;
> }
>
> cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
>
> tdx_guest = !memcmp("IntelTDX ", sig, 12);
>
> return tdx_guest;
> }
>
> void __init tdx_early_init(void)
> {
> if (!is_tdx_guest())
> return;
>
> setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>
> pr_info("Guest initialized\n");
> }
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Wed, Oct 06, 2021 at 08:43:48AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Yes. But, Joerg Roedel in his review recommended using variable
> similar to sme_me_mask to avoid function call in Intel platform in
> cc_platform_has().
That would normally make sense if this were a fast path. But I don't see
that being one. Besides, the subsequent runs will be the function call +
a couple of instructions:
# arch/x86/kernel/tdx.c:22: if (tdx_guest >= 0)
movl tdx_guest(%rip), %eax # tdx_guest,
testl %eax, %eax #
jns .L11 #,
...
.L11:
# arch/x86/kernel/tdx.c:23: return tdx_guest;
setne %al #, <retval>
# arch/x86/kernel/tdx.c:35: }
ret
which is not the end of the world.
And it if it is really used in a fast path, then we can do a static
branch etc.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/5/21 10:53 PM, Josh Poimboeuf wrote:
> On Mon, Oct 04, 2021 at 07:51:59PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> Guests communicate with VMMs with hypercalls. Historically, these
>> are implemented using instructions that are known to cause VMEXITs
>> like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
>> expose guest state to the host. This prevents the old hypercall
>> mechanisms from working. So to communicate with VMM, TDX
>> specification defines a new instruction called TDCALL.
>>
>> In a TDX based VM, since VMM is an untrusted entity, a intermediary
>> layer (TDX module) exists between host and guest to facilitate
>> secure communication. TDX guests communicate with the TDX module
>> using TDCALL instruction.
>
> Most of the information in the second paragraph was already provided in
> the first paragraph.
Second paragraph talks about TDX module where as first one talks about VMM.
>
>> +#ifdef CONFIG_FRAME_POINTER
>> +/* Frame offset + 8 (for arg1) */
>> +#define ARG7_SP_OFFSET (FRAME_OFFSET + 0x08)
>> +#else
>> +#define ARG7_SP_OFFSET (0x08)
>> +#endif
>
> No need for the #ifdef here, FRAME_OFFSET is already zero for
> !FRAME_POINTER. So it can just be unconditional:
>
> #define ARG7_SP_OFFSET (FRAME_OFFSET + 0x08)
Right. I will remove the #ifdef in next version.
>
>> + * __tdx_hypercall() - Helper function used by TDX guests to request
>> + * services from the VMM. All requests are made via the TDX module
>> + * using TDCALL instruction.
>> + *
>> + * This function serves as a wrapper to move user call arguments to the
>> + * correct registers as specified by TDCALL ABI and share it with VMM
>> + * via the TDX module. After TDCALL operation, output from the VMM is
>> + * saved to the memory specified in the "out" (struct tdx_hypercall_output)
>> + * pointer.
>> + *
>> + *-------------------------------------------------------------------------
>> + * TD VMCALL ABI:
>> + *-------------------------------------------------------------------------
>> + *
>> + * Input Registers:
>> + *
>> + * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
>> + * RCX - BITMAP which controls which part of TD Guest GPR
>> + * is passed as-is to VMM and back.
>> + * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
>> + * specification. Non zero value indicates vendor
>> + * specific ABI.
>> + * R11 - VMCALL sub function number
>> + * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
>> + * R8-R9, R12–R15 - Same as above.
>> + *
>> + * Output Registers:
>> + *
>> + * RAX - TDCALL instruction status (Not related to hypercall
>> + * output).
>> + * R10 - Hypercall output error code.
>> + * R11-R15 - Hypercall sub function specific output values.
>> + *
>> + *-------------------------------------------------------------------------
>> + *
>> + * __tdx_hypercall() function ABI:
>> + *
>> + * @type (RDI) - TD VMCALL type, moved to R10
>> + * @fn (RSI) - TD VMCALL sub function, moved to R11
>> + * @r12 (RDX) - Input parameter 1, moved to R12
>> + * @r13 (RCX) - Input parameter 2, moved to R13
>> + * @r14 (R8) - Input parameter 3, moved to R14
>> + * @r15 (R9) - Input parameter 4, moved to R15
>> + *
>> + * @out (stack) - struct tdx_hypercall_output pointer (cannot be NULL)
>> + *
>> + * On successful completion, return TDCALL status or -EINVAL for invalid
>> + * inputs.
>> + */
>
> The documentation and comments for all the asm code are excellent! If
> only all kernel asm code were this readable!
>
> I'm especially thankful that it's not all crammed into inline asm :-)
Thanks. But major credit should go to Dave Hansen (detailed comments are
due to his recommendation in his review).
>
>> +SYM_FUNC_START(__tdx_hypercall)
>> + FRAME_BEGIN
>> +
>> + /* Move argument 7 from caller stack to RAX */
>> + movq ARG7_SP_OFFSET(%rsp), %rax
>> +
>> + /* Check if caller provided an output struct */
>> + test %rax, %rax
>> + /* If out pointer is NULL, return -EINVAL */
>> + jz 1f
>> +
>> + /* Save callee-saved GPRs as mandated by the x86_64 ABI */
>> + push %r15
>> + push %r14
>> + push %r13
>> + push %r12
>> +
>> + /*
>> + * Save R9 and output pointer (rax) in stack, it will be used
>> + * again when storing the output registers after TDCALL
>> + * operation.
>> + */
>> + push %r9
>> + push %rax
>
> r9 is callee-clobbered, so we shouldn't need to push it here or pop it
> ("Restore state of R9 register" below) before returning.
Make sense. I will remove push/pop for r9 register.
>
>> +/*
>> + * Wrapper for standard use of __tdx_hypercall with BUG_ON() check
>> + * for TDCALL error.
>> + */
>> +static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
>> + u64 r15, struct tdx_hypercall_output *out)
>
> Instead of calling the args r12-r15, I'm thinking it would be clearer to
> call them arg1-arg4. It's not the C code's job to worry about the
> argument-to-register ABI mappings.
>
> Same comment for the function declaration in the header file.
Using register names as argument makes it easier when implementing
the hypercall based on TDX ABI spec (specifies which registers are to be
used).
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Tue, Oct 05, 2021 at 02:16:11PM -0700, Josh Poimboeuf wrote:
> I assume this needs a rebase on -tip since cc_platform.c already has an
> empty version of this function (and it's static so it doesn't need to be
> declared in a header).
Yes:
arch/x86/kernel/cc_platform.c:16:28: error: static declaration of ‘intel_cc_platform_has’ follows non-static declaration
16 | static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr)
| ^~~~~~~~~~~~~~~~~~~~~
In file included from ./include/linux/mem_encrypt.h:17,
from arch/x86/kernel/cc_platform.c:12:
./arch/x86/include/asm/mem_encrypt.h:105:6: note: previous declaration of ‘intel_cc_platform_has’ was here
105 | bool intel_cc_platform_has(enum cc_attr attr);
| ^~~~~~~~~~~~~~~~~~~~~
make[2]: *** [scripts/Makefile.build:277: arch/x86/kernel/cc_platform.o] Error 1
make[1]: *** [scripts/Makefile.build:540: arch/x86/kernel] Error 2
make: *** [Makefile:1868: arch/x86] Error 2
make: *** Waiting for unfinished jobs....
I had already started that function there - please add all TDX logic in
cc_platform.c.
When you do your next set, you can use tip/master as a base. This should
be used for all x86 patchsets anyway.
> > + /**
> > + * @CC_ATTR_GUEST_TDX: Trusted Domain Extension Support
> > + *
> > + * The platform/OS is running as a TDX guest/virtual machine.
> > + *
> > + * Examples include Intel TDX.
> > + */
> > + CC_ATTR_GUEST_TDX,
>
> Examples of TDX include TDX? :-)
Yeah, so whether we should be naming the actual conf. computing
implementation came up during the cc_platform_has() review and looking
forward in this patchset:
+ if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
you really need to test for TDX because you're doing a TDX-specific
hypercall.
Which brings me back to the fastpath use of is_idx_guest(): this
looks to me like a fastpath use - dunno how often one needs to do TDX
hypercalls so I can imagine that for this, the is_tdx_guest() check
should use a static branch.
But only with numbers to show the need for it.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/6/21 11:02 AM, Borislav Petkov wrote:
> On Tue, Oct 05, 2021 at 02:16:11PM -0700, Josh Poimboeuf wrote:
>> I assume this needs a rebase on -tip since cc_platform.c already has an
>> empty version of this function (and it's static so it doesn't need to be
>> declared in a header).
>
> Yes:
>
> arch/x86/kernel/cc_platform.c:16:28: error: static declaration of ‘intel_cc_platform_has’ follows non-static declaration
> 16 | static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr)
> | ^~~~~~~~~~~~~~~~~~~~~
> In file included from ./include/linux/mem_encrypt.h:17,
> from arch/x86/kernel/cc_platform.c:12:
> ./arch/x86/include/asm/mem_encrypt.h:105:6: note: previous declaration of ‘intel_cc_platform_has’ was here
> 105 | bool intel_cc_platform_has(enum cc_attr attr);
> | ^~~~~~~~~~~~~~~~~~~~~
> make[2]: *** [scripts/Makefile.build:277: arch/x86/kernel/cc_platform.o] Error 1
> make[1]: *** [scripts/Makefile.build:540: arch/x86/kernel] Error 2
> make: *** [Makefile:1868: arch/x86] Error 2
> make: *** Waiting for unfinished jobs....
>
> I had already started that function there - please add all TDX logic in
> cc_platform.c.
>
> When you do your next set, you can use tip/master as a base. This should
> be used for all x86 patchsets anyway.
Yes. I have already rebased my patches on top of tip branch. Next version
will not have this issue.
>
>>> + /**
>>> + * @CC_ATTR_GUEST_TDX: Trusted Domain Extension Support
>>> + *
>>> + * The platform/OS is running as a TDX guest/virtual machine.
>>> + *
>>> + * Examples include Intel TDX.
>>> + */
>>> + CC_ATTR_GUEST_TDX,
>>
>> Examples of TDX include TDX? :-)
>
> Yeah, so whether we should be naming the actual conf. computing
> implementation came up during the cc_platform_has() review and looking
> forward in this patchset:
>
> + if (cc_platform_has(CC_ATTR_GUEST_TDX))
> + return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
>
> you really need to test for TDX because you're doing a TDX-specific
> hypercall.
>
> Which brings me back to the fastpath use of is_idx_guest(): this
> looks to me like a fastpath use - dunno how often one needs to do TDX
> hypercalls so I can imagine that for this, the is_tdx_guest() check
> should use a static branch.
>
> But only with numbers to show the need for it.
Compared to TDX hypercall, additional function call should not take much
time. IMO, we don't need fast path for hypercalls.
Sean/Andi - Any comments?
>
> Thx.
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Mon, Oct 04, 2021 at 07:52:00PM -0700, Kuppuswamy Sathyanarayanan wrote:
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
> */
> INTG(X86_TRAP_PF, asm_exc_page_fault),
> #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> +#endif
> };
>
> /*
> @@ -91,6 +94,9 @@ static const __initconst struct idt_data def_idts[] = {
> INTG(X86_TRAP_MF, asm_exc_coprocessor_error),
> INTG(X86_TRAP_AC, asm_exc_alignment_check),
> INTG(X86_TRAP_XF, asm_exc_simd_coprocessor_error),
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> +#endif
I don't think this 'def_idts' entry is necessary. The #VE IDT gate is
already initialized due to the 'early_idts' entry.
>
> #ifdef CONFIG_X86_32
> TSKG(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS),
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index d5fc2904facf..f7885c777a09 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -6,6 +6,9 @@
>
> #include <asm/tdx.h>
>
> +/* TDX Module call Leaf IDs */
> +#define TDGETVEINFO 3
This doesn't appear to be the canonical name anyway, can we at least
insert some underscores to make it more readable?
TDX_GET_VEINFO?
> @@ -1140,6 +1141,82 @@ DEFINE_IDTENTRY(exc_device_not_available)
> }
> }
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +#define VEFSTR "VE fault"
Something more readable like VE_FAULT_STR?
--
Josh
On Mon, Oct 04, 2021 at 07:52:01PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
> + * Emulate HLT operation via hypercall. More info about ABI
> + * can be found in TDX Guest-Host-Communication Interface
> + * (GHCI), sec 3.8 TDG.VP.VMCALL<Instruction.HLT>.
> + *
> + * The VMM uses the "IRQ disabled" param to understand IRQ
> + * enabled status (RFLAGS.IF) of TD guest and determine
> + * whether or not it should schedule the halted vCPU if an
> + * IRQ becomes pending. E.g. if IRQs are disabled the VMM
> + * can keep the vCPU in virtual HLT, even if an IRQ is
> + * pending, without hanging/breaking the guest.
> + *
> + * do_sti parameter is used by __tdx_hypercall() to decide
> + * whether to call STI instruction before executing TDCALL
> + * instruction.
> + */
> + ret = _tdx_hypercall(EXIT_REASON_HLT, irq_disabled, 0, 0, do_sti, NULL);
> +
> + /*
> + * Use WARN_ONCE() to report the failure. Since tdx_*halt() calls
> + * are also used in pv_ops, #VE error handler cannot be used to
> + * report the failure.
> + */
> + WARN_ONCE(ret, "HLT instruction emulation failed\n");
I'm having trouble following this last comment, does it mean there's no
way to return the error back to the #VE handler when this is called in
#VE context? Seems like that would be any easy problem to solve by
shuffling the functions a bit.
--
Josh
On Mon, Oct 04, 2021 at 07:52:02PM -0700, Kuppuswamy Sathyanarayanan wrote:
> static inline long kvm_hypercall0(unsigned int nr)
> {
> long ret;
> +
> + if (cc_platform_has(CC_ATTR_GUEST_TDX))
> + return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
Hm, I'm still thinking that figuring out a way to convert all the
cc_platform_has() stuff to normal cpu features -- or at least inline
wrappers around them -- would be really nice here. Then this could just
be a nice fast static_cpu_has().
--
Josh
On Wed, Oct 06, 2021 at 12:34:40PM -0700, Josh Poimboeuf wrote:
> Hm, I'm still thinking that figuring out a way to convert all the
> cc_platform_has() stuff to normal cpu features -- or at least inline
> wrappers around them -- would be really nice here. Then this could just
> be a nice fast static_cpu_has().
cpu_feature_enabled(X86_FEATURE_TDX_GUEST)
provided those don't happen too early...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Oct 04, 2021 at 07:52:03PM -0700, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 3d0416515506..062ac4720434 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -110,6 +110,41 @@ static __cpuidle void tdx_safe_halt(void)
> _tdx_halt(irq_disabled, do_sti);
> }
>
> +static u64 tdx_read_msr_safe(unsigned int msr, int *err)
Here the kernel convention would probably be to return the error and
make 'val' an argument:
static int tdx_read_msr_safe(unsigned int msr, u64 *val)
> +{
> + struct tdx_hypercall_output out = {0};
> + u64 ret;
> +
> + /*
> + * Emulate the MSR read via hypercall. More info about ABI
> + * can be found in TDX Guest-Host-Communication Interface
> + * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
> + */
> + ret = _tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);
> +
> + *err = ret ? -EIO : 0;
> +
> + return out.r11;
> +}
> +
> +static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
> + unsigned int high)
> +{
> + u64 ret;
> +
> + WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
This fails to build, tdx_is_context_switched_msr() is missing.
> @@ -136,19 +171,33 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
> int tdx_handle_virtualization_exception(struct pt_regs *regs,
> struct ve_info *ve)
> {
> + unsigned long val;
> + int ret = 0;
> +
> switch (ve->exit_reason) {
> case EXIT_REASON_HLT:
> tdx_halt();
> break;
> + case EXIT_REASON_MSR_READ:
> + val = tdx_read_msr_safe(regs->cx, (unsigned int *)&ret);
Why the 'unsigned int *' cast?
Also, 'val' should have the same type as the one returned by
tdx_read_msr_safe(). (Though it technically doesn't matter here.)
--
Josh
On Mon, Oct 04, 2021 at 07:52:05PM -0700, Kuppuswamy Sathyanarayanan wrote:
> +static u64 tdx_handle_cpuid(struct pt_regs *regs)
> +{
> + struct tdx_hypercall_output out = {0};
> + u64 ret;
> +
> + /*
> + * Emulate CPUID instruction via hypercall. More info about
> + * ABI can be found in TDX Guest-Host-Communication Interface
> + * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
> + */
> + ret = _tdx_hypercall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
> +
> + /*
> + * As per TDX GHCI CPUID ABI, r12-r15 registers contains contents of
> + * EAX, EBX, ECX, EDX registers after CPUID instruction execution.
> + * So copy the register contents back to pt_regs.
> + */
> + regs->ax = out.r12;
> + regs->bx = out.r13;
> + regs->cx = out.r14;
> + regs->dx = out.r15;
Does it still make sense to save the regs if _tdx_hypercall() returns an
error?
> +
> + return ret;
Also I'm wondering about error handling for all these _tdx_hypercall()
wrapper functions which are called by the #VE handler.
First, there are some inconsistencies in whether and how they return the
r10 error.
- _tdx_halt() warns and doesn't return anything.
- tdx_read_msr_safe() and tdx_write_msr_safe() convert all errors to -EIO.
- tdx_handle_cpuid() returns the raw vmcall error.
Second, as far as I can tell, the #VE handler doesn't check the actual
return code value, other than checking for non-zero. Should it at least
be printed in a warning?
--
Josh
On Mon, Oct 04, 2021 at 07:51:59PM -0700, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index ad3ff5925153..d5fc2904facf 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -27,6 +27,29 @@ static void __init is_tdx_guest_init(void)
> is_tdx_guest = !memcmp("IntelTDX ", sig, 12);
> }
>
> +/*
> + * Wrapper for standard use of __tdx_hypercall with BUG_ON() check
> + * for TDCALL error.
> + */
> +static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
> + u64 r15, struct tdx_hypercall_output *out)
> +{
> + struct tdx_hypercall_output outl = {0};
Why do you need to clear this when, if used, it will get completely
overwritten in __tdx_hypercall()?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/7/21 10:06 AM, Borislav Petkov wrote:
> Same question as before - why do you need to clear this @out thing above
> when __tdx_module_call() will overwrite it?
Now checking again, I think we don't to initialize the @out pointer. I will
fix this in call cases.
But for tdx_get_ve_info() case, we are updating the @ve pointer without
checking the tdcall return value and __tdx_module_call() will update the
@out only if tdcall is successful. Currently due to @out=0 initialization,
this logic does not cause any issue. But to properly fix it, I need to
check for tdcall return value before updating the @ve value.
>
> What you should do instead is check that @ve pointer which you get
> passed in - it might be NULL.
Current use case of tdx_get_ve_info() can never be NULL. But if you
want to add this check for possible future issues, I can do it.
arch/x86/kernel/traps.c:1205: ret = tdx_get_ve_info(&ve);
arch/x86/kernel/tdx.c:945: if (tdx_get_ve_info(&ve))
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On 10/7/21 2:33 AM, Borislav Petkov wrote:
> Why do you need to clear this when, if used, it will get completely
> overwritten in __tdx_hypercall()?
Yes. we don't need to clear it. I can remove the initialization
in next version.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Thu, Oct 07, 2021 at 10:22:16AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Now checking again, I think we don't to initialize the @out pointer. I will
> fix this in call cases.
>
> But for tdx_get_ve_info() case, we are updating the @ve pointer without
> checking the tdcall return value and __tdx_module_call() will update the
> @out only if tdcall is successful. Currently due to @out=0 initialization,
> this logic does not cause any issue. But to properly fix it, I need to
> check for tdcall return value before updating the @ve value.
Yes, that looked weird too but I assumed since you enforce @out to be
non-NULL, you know that you'll get ve-> populated.
However, yes, you need to check the retval for that other case:
/* Check for TDCALL success: 0 - Successful, otherwise failed */
test %rax, %rax
jnz 1f
> Current use case of tdx_get_ve_info() can never be NULL. But if you
> want to add this check for possible future issues, I can do it.
>
> arch/x86/kernel/traps.c:1205: ret = tdx_get_ve_info(&ve);
> arch/x86/kernel/tdx.c:945: if (tdx_get_ve_info(&ve))
Yes, you want to fail gracefully in the case where someone forgets to
pass in a proper pointer to tdx_get_ve_info() in the future. As it is
now, the guest would simply explode and that is not nice.
Also, that function doesn't need to return an unsigned long if all it
wants to return is a bool, from looking at its call sites.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/6/21 12:17 PM, Josh Poimboeuf wrote:
> I'm having trouble following this last comment, does it mean there's no
> way to return the error back to the #VE handler when this is called in
> #VE context? Seems like that would be any easy problem to solve by
> shuffling the functions a bit.
Currently tdx_*halt() calls are used in both pv_ops and
tdx_handle_virtualization_exception(). If it is just
tdx_handle_virtualization_exception() then we can just return the error
and let exc_virtualization_exception() handle error using ve_raise_fault().
But in this case, since pv_ops is also involved, instead of defining
two different halt handler functions for pv_ops and
exc_virtualization_exception( we just use WARN_ON to report
the error.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Mon, Oct 04, 2021 at 07:52:00PM -0700, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index d5fc2904facf..f7885c777a09 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -6,6 +6,9 @@
>
> #include <asm/tdx.h>
>
> +/* TDX Module call Leaf IDs */
> +#define TDGETVEINFO 3
> +
> /*
> * Allocate it in the data region to avoid zeroing it during
> * BSS initialization. It is mainly used in cc_platform_has()
> @@ -50,6 +53,36 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
> return out->r10;
> }
>
> +unsigned long tdx_get_ve_info(struct ve_info *ve)
> +{
> + struct tdx_module_output out = {0};
> + u64 ret;
> +
> + /*
> + * NMIs and machine checks are suppressed. Before this point any
> + * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
> + * additional #VEs are permitted (but it is expected not to
> + * happen unless kernel panics).
> + */
> + ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
Same question as before - why do you need to clear this @out thing above
when __tdx_module_call() will overwrite it?
What you should do instead is check that @ve pointer which you get
passed in - it might be NULL.
> + ve->exit_reason = out.rcx;
> + ve->exit_qual = out.rdx;
> + ve->gla = out.r8;
> + ve->gpa = out.r9;
> + ve->instr_len = out.r10 & UINT_MAX;
> + ve->instr_info = out.r10 >> 32;
> +
> + return ret;
>
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/6/21 12:49 PM, Josh Poimboeuf wrote:
> On Mon, Oct 04, 2021 at 07:52:03PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 3d0416515506..062ac4720434 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -110,6 +110,41 @@ static __cpuidle void tdx_safe_halt(void)
>> _tdx_halt(irq_disabled, do_sti);
>> }
>>
>> +static u64 tdx_read_msr_safe(unsigned int msr, int *err)
>
> Here the kernel convention would probably be to return the error and
> make 'val' an argument:
>
> static int tdx_read_msr_safe(unsigned int msr, u64 *val)
Agree. I will fix this in next version.
>
>
>> +{
>> + struct tdx_hypercall_output out = {0};
>> + u64 ret;
>> +
>> + /*
>> + * Emulate the MSR read via hypercall. More info about ABI
>> + * can be found in TDX Guest-Host-Communication Interface
>> + * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
>> + */
>> + ret = _tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);
>> +
>> + *err = ret ? -EIO : 0;
>> +
>> + return out.r11;
>> +}
>> +
>> +static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
>> + unsigned int high)
>> +{
>> + u64 ret;
>> +
>> + WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
>
> This fails to build, tdx_is_context_switched_msr() is missing.
>
>> @@ -136,19 +171,33 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
>> int tdx_handle_virtualization_exception(struct pt_regs *regs,
>> struct ve_info *ve)
>> {
>> + unsigned long val;
>> + int ret = 0;
>> +
>> switch (ve->exit_reason) {
>> case EXIT_REASON_HLT:
>> tdx_halt();
>> break;
>> + case EXIT_REASON_MSR_READ:
>> + val = tdx_read_msr_safe(regs->cx, (unsigned int *)&ret);
>
> Why the 'unsigned int *' cast?
>
> Also, 'val' should have the same type as the one returned by
> tdx_read_msr_safe(). (Though it technically doesn't matter here.)
I will change val type to u64. It will not need any casting.
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On 10/6/21 1:26 PM, Josh Poimboeuf wrote:
> On Mon, Oct 04, 2021 at 07:52:05PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> +static u64 tdx_handle_cpuid(struct pt_regs *regs)
>> +{
>> + struct tdx_hypercall_output out = {0};
>> + u64 ret;
>> +
>> + /*
>> + * Emulate CPUID instruction via hypercall. More info about
>> + * ABI can be found in TDX Guest-Host-Communication Interface
>> + * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
>> + */
>> + ret = _tdx_hypercall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
>> +
>> + /*
>> + * As per TDX GHCI CPUID ABI, r12-r15 registers contains contents of
>> + * EAX, EBX, ECX, EDX registers after CPUID instruction execution.
>> + * So copy the register contents back to pt_regs.
>> + */
>> + regs->ax = out.r12;
>> + regs->bx = out.r13;
>> + regs->cx = out.r14;
>> + regs->dx = out.r15;
>
> Does it still make sense to save the regs if _tdx_hypercall() returns an
> error?
We don't need to save it in failure case. I will add check for error
case in next version.
>
>> +
>> + return ret;
>
> Also I'm wondering about error handling for all these _tdx_hypercall()
> wrapper functions which are called by the #VE handler. >
> First, there are some inconsistencies in whether and how they return the
> r10 error.
Since we have only cared about zero/non-zero return value, we did not
check for consistency. May be I can convert all handler return values
to bool.
>
> - _tdx_halt() warns and doesn't return anything.
Since tdx_halt handler is shared with pv_ops, we can't return anything
back (so we use WARN_ON to report the error).
>
> - tdx_read_msr_safe() and tdx_write_msr_safe() convert all errors to -EIO.
Return value does not matter. we only check for zero/non-zero value in
tdx_handle_virtualization_exception(). we have used -EIO to convey that it is
an IO error.
>
> - tdx_handle_cpuid() returns the raw vmcall error.
>
> Second, as far as I can tell, the #VE handler doesn't check the actual
> return code value, other than checking for non-zero. Should it at least
> be printed in a warning?
I don't think this is required. We can use trace to check the error code
or argument details in failure case. Since we don't really use the error
value, I am planning to change the #VE handler return type to bool.
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Mon, Oct 04, 2021 at 07:52:01PM -0700, Kuppuswamy Sathyanarayanan wrote:
> @@ -240,6 +243,33 @@ SYM_FUNC_START(__tdx_hypercall)
>
> movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>
> + /*
> + * For the idle loop STI needs to be called directly before
> + * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
> + * enables interrupts only one instruction later. If there
> + * are any instructions between the STI and the TDCALL for
> + * HLT then an interrupt could happen in that time, but the
> + * code would go back to sleep afterwards, which can cause
> + * longer delays.
> + *
> + * This leads to significant difference in network performance
> + * benchmarks. So add a special case for EXIT_REASON_HLT to
> + * trigger STI before TDCALL. But this change is not required
> + * for all HLT cases. So use R15 register value to identify the
> + * case which needs STI. So, if R11 is EXIT_REASON_HLT and R15
> + * is 1, then call STI before TDCALL instruction. Note that R15
> + * register is not required by TDCALL ABI when triggering the
> + * hypercall for EXIT_REASON_HLT case. So use it in software to
> + * select the STI case.
> + */
> + cmpl $EXIT_REASON_HLT, %r11d
> + jne skip_sti
> + cmpl $1, %r15d
> + jne skip_sti
> + /* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
> + xor %r15, %r15
Then you don't need to clear it either, right?
> + sti
> +skip_sti:
> tdcall
>
> /* Restore output pointer to R9 */
...
> +static __cpuidle void tdx_halt(void)
> +{
> + const bool irq_disabled = irqs_disabled();
> + const bool do_sti = false;
> +
> + /*
> + * Non safe halt is mainly used in CPU off-lining
> + * and the guest will stay in halt state. So,
> + * STI instruction call is not required (set
> + * do_sti as false).
> + */
Just like you've done below, put the comment over the variable assignment:
/*
* Non safe halt is mainly used in CPU offlining and the guest will stay in halt
* state. So, STI instruction call is not required.
*/
const bool do_sti = false;
> + _tdx_halt(irq_disabled, do_sti);
> +}
> +
> +static __cpuidle void tdx_safe_halt(void)
> +{
> + /*
> + * Since STI instruction will be called in __tdx_hypercall()
> + * set irq_disabled as false.
> + */
> + const bool irq_disabled = false;
> + const bool do_sti = true;
> +
> + _tdx_halt(irq_disabled, do_sti);
> +}
> +
> unsigned long tdx_get_ve_info(struct ve_info *ve)
> {
> struct tdx_module_output out = {0};
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/8/21 10:31 AM, Borislav Petkov wrote:
> On Mon, Oct 04, 2021 at 07:52:01PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> @@ -240,6 +243,33 @@ SYM_FUNC_START(__tdx_hypercall)
>>
>> movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>>
>> + /*
>> + * For the idle loop STI needs to be called directly before
>> + * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
>> + * enables interrupts only one instruction later. If there
>> + * are any instructions between the STI and the TDCALL for
>> + * HLT then an interrupt could happen in that time, but the
>> + * code would go back to sleep afterwards, which can cause
>> + * longer delays.
>> + *
>> + * This leads to significant difference in network performance
>> + * benchmarks. So add a special case for EXIT_REASON_HLT to
>> + * trigger STI before TDCALL. But this change is not required
>> + * for all HLT cases. So use R15 register value to identify the
>> + * case which needs STI. So, if R11 is EXIT_REASON_HLT and R15
>> + * is 1, then call STI before TDCALL instruction. Note that R15
>> + * register is not required by TDCALL ABI when triggering the
>> + * hypercall for EXIT_REASON_HLT case. So use it in software to
>> + * select the STI case.
>> + */
>> + cmpl $EXIT_REASON_HLT, %r11d
>> + jne skip_sti
>> + cmpl $1, %r15d
>> + jne skip_sti
>> + /* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
>> + xor %r15, %r15
>
> Then you don't need to clear it either, right?
Yes. As per ABI, I don't need to clear it. It will not be used by VMM.
But since that register content is shared with VMM, I tried to keep the
unused register values to '0' (consistent with other hypercall use cases).
I am fine with removing it if you think it is unnecessary.
>
>> + sti
>> +skip_sti:
>> tdcall
>>
>> /* Restore output pointer to R9 */
>
> ...
>
>> +static __cpuidle void tdx_halt(void)
>> +{
>> + const bool irq_disabled = irqs_disabled();
>> + const bool do_sti = false;
>> +
>> + /*
>> + * Non safe halt is mainly used in CPU off-lining
>> + * and the guest will stay in halt state. So,
>> + * STI instruction call is not required (set
>> + * do_sti as false).
>> + */
>
> Just like you've done below, put the comment over the variable assignment:
>
> /*
> * Non safe halt is mainly used in CPU offlining and the guest will stay in halt
> * state. So, STI instruction call is not required.
> */
> const bool do_sti = false;
Will move it above in next version.
>
>
>> + _tdx_halt(irq_disabled, do_sti);
>> +}
>> +
>> +static __cpuidle void tdx_safe_halt(void)
>> +{
>> + /*
>> + * Since STI instruction will be called in __tdx_hypercall()
>> + * set irq_disabled as false.
>> + */
>> + const bool irq_disabled = false;
>> + const bool do_sti = true;
>> +
>> + _tdx_halt(irq_disabled, do_sti);
>> +}
>> +
>> unsigned long tdx_get_ve_info(struct ve_info *ve)
>> {
>> struct tdx_module_output out = {0};
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Fri, Oct 08, 2021 at 10:38:22AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Yes. As per ABI, I don't need to clear it. It will not be used by VMM.
> But since that register content is shared with VMM, I tried to keep the
> unused register values to '0' (consistent with other hypercall use cases).
>
> I am fine with removing it if you think it is unnecessary.
Yeah, you probably wanna clear it after all, just in case, as Dirk says.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Oct 5, 2021 at 10:54 AM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> The entry paths do not access TD-shared memory, MMIO regions or use
> those specific MSRs, instructions, CPUID leaves that might generate #VE.
> In addition, all interrupts including NMIs are blocked by the hardware
> starting with #VE delivery until TDGETVEINFO is called. This eliminates
> the chance of a #VE during the syscall gap or paranoid entry paths and
> simplifies #VE handling.
Hello
If the reason is applied to #VE, I think it can be applied to SVM-ES's
#VC too. (I wish the entry code for #VC to be simplified since I'm
moving some the asm entry code to C code)
And I'm sorry I haven't read all the emails.
Has the question asked by Andy Lutomirski been answered in any emails?
https://lore.kernel.org/lkml/CALCETrU9XypKbj-TrXLB3CPW6=MZ__5ifLz0ckbB=c=Myegn9Q@mail.gmail.com/
Thanks
Lai
>
> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> but it is expected not to happen because TDX expects NMIs not to
> trigger #VEs. Another case where they could happen is if the #VE
> exception panics, but in this case there are no guarantees on anything
> anyways.
>
On Sat, Oct 09, 2021, Lai Jiangshan wrote:
> On Tue, Oct 5, 2021 at 10:54 AM Kuppuswamy Sathyanarayanan
> <[email protected]> wrote:
>
> >
> > The entry paths do not access TD-shared memory, MMIO regions or use
> > those specific MSRs, instructions, CPUID leaves that might generate #VE.
> > In addition, all interrupts including NMIs are blocked by the hardware
> > starting with #VE delivery until TDGETVEINFO is called. This eliminates
> > the chance of a #VE during the syscall gap or paranoid entry paths and
> > simplifies #VE handling.
Minor clarification: it eliminates the chance of a #VE during the syscall gap
_if the VMM is benign_. If the VMM is malicious, it can unmap and remap the
syscall page to induce an EPT Violation #VE due to the page not being accepted.
> Hello
>
> If the reason is applied to #VE, I think it can be applied to SVM-ES's
> #VC too. (I wish the entry code for #VC to be simplified since I'm
> moving some the asm entry code to C code)
>
> And I'm sorry I haven't read all the emails.
> Has the question asked by Andy Lutomirski been answered in any emails?
>
> https://lore.kernel.org/lkml/CALCETrU9XypKbj-TrXLB3CPW6=MZ__5ifLz0ckbB=c=Myegn9Q@mail.gmail.com/
This question?
Can the hypervisor cause an already-accepted secure-EPT page to transition to
the unaccepted state?
Yep. I wrote the above before following the link, I should have guessed which
question it was :-)
IIRC, the proposed middle ground was to add a TDCALL and/or TDPARAMS setting that
would allow the guest to opt-out of EPT Violation #VE due to page not accepted,
and instead terminate the VM on such a condition. The caveat is that that would
require the kernel to never take an "page not accepted #VE" when doing lazy page
acceptance, but that was deemed doable.
That also raises the question of whether Andy's NAK applies to SEV-SNP without
support for "Enhanced SYSCALL Behavior"[*], otherwise SEV-SNP has the same "#VC
in syscall gap" attack.
[*] https://www.amd.com/system/files/TechDocs/57115.pdf
> Minor clarification: it eliminates the chance of a #VE during the syscall gap
> _if the VMM is benign_. If the VMM is malicious, it can unmap and remap the
> syscall page to induce an EPT Violation #VE due to the page not being accepted.
This has been addressed. The TDX module will support a mode that forbids
unmapping pages permanently, and Linux is going to check/enforce that
this mode is enabled. The patch for the check is not included in the
posted patches yet though.
>
> This question?
>
> Can the hypervisor cause an already-accepted secure-EPT page to transition to
> the unaccepted state?
>
> Yep. I wrote the above before following the link, I should have guessed which
> question it was :-)
>
> IIRC, the proposed middle ground was to add a TDCALL and/or TDPARAMS setting that
> would allow the guest to opt-out of EPT Violation #VE due to page not accepted,
It's a TDPARAMS setting
-Andi
On Thu, Oct 07, 2021 at 07:25:44PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> > - _tdx_halt() warns and doesn't return anything.
>
> Since tdx_halt handler is shared with pv_ops, we can't return anything
> back (so we use WARN_ON to report the error).
Saying the "current structure of the code doesn't allow it" is not a
valid reason.
You could for example have _tdx_halt() return an error, and have the #VE
handler call _tdx_halt() directly so it can receive the error.
> > - tdx_read_msr_safe() and tdx_write_msr_safe() convert all errors to -EIO.
>
> Return value does not matter. we only check for zero/non-zero value in
> tdx_handle_virtualization_exception(). we have used -EIO to convey that it is
> an IO error.
If I'm reading the GHCI spec right, the only possible error code for msr
access is VMCALL_INVALID_OPERAND. Which sounds like -EINVAL to me.
> > - tdx_handle_cpuid() returns the raw vmcall error.
> >
> > Second, as far as I can tell, the #VE handler doesn't check the actual
> > return code value, other than checking for non-zero. Should it at least
> > be printed in a warning?
>
> I don't think this is required. We can use trace to check the error code
> or argument details in failure case. Since we don't really use the
> error value, I am planning to change the #VE handler return type to
> bool.
In general the kernel prefers errno values instead of bool. If
hard-coding an error, it should probably be hard-coded to a valid kernel
errno from within the inner handler (_tdx_halt(), tdx_read_msr_safe(),
etc).
That prevents confusion and will give the freedom to add more return
codes in the future, in case the inner handler wants to eventually
report multiple error types, or the top-level handler wants to handle
error types differently, or trace them, or warn, etc.
--
Josh
On 10/7/21 10:06 AM, Borislav Petkov wrote:
> On Mon, Oct 04, 2021 at 07:52:00PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> +unsigned long tdx_get_ve_info(struct ve_info *ve)
>> +{
>> + struct tdx_module_output out = {0};
>> + u64 ret;
>> +
>> + /*
>> + * NMIs and machine checks are suppressed. Before this point any
>> + * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
>> + * additional #VEs are permitted (but it is expected not to
>> + * happen unless kernel panics).
>> + */
>> + ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
> Same question as before - why do you need to clear this @out thing above
> when __tdx_module_call() will overwrite it?
>
> What you should do instead is check that @ve pointer which you get
> passed in - it might be NULL.
Hi Borislav,
That ve_info really is specific to handling a fault. There's only one
call site:
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> + struct ve_info ve;
...
> + ret = tdx_get_ve_info(&ve);
I think it's equivalent to something like a 'pt_regs' or 'stack_info'
that we pass around in other exception handlers. It's always stack
allocated. It's never dynamically allocated and NULL is never passed
for some other semantic reason.
On Sun, Oct 17, 2021 at 10:15:22AM -0700, Dave Hansen wrote:
> I think it's equivalent to something like a 'pt_regs' or 'stack_info'
> that we pass around in other exception handlers. It's always stack
> allocated. It's never dynamically allocated and NULL is never passed
> for some other semantic reason.
Ok, but why is adding that check such a big deal?
Are you saying that nothing else will call tdx_get_ve_info() in the
future so we should trust the passed in *ve pointer blindly or should we
simply add that cheap check just in case.
I don't mind having it without it but wondering why a little defensive
programming is a problem, at all.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 10/18/21 3:53 AM, Borislav Petkov wrote:
> On Sun, Oct 17, 2021 at 10:15:22AM -0700, Dave Hansen wrote:
>> I think it's equivalent to something like a 'pt_regs' or 'stack_info'
>> that we pass around in other exception handlers. It's always stack
>> allocated. It's never dynamically allocated and NULL is never passed
>> for some other semantic reason.
> Ok, but why is adding that check such a big deal?
It's really not a big deal.
> Are you saying that nothing else will call tdx_get_ve_info() in the
> future so we should trust the passed in *ve pointer blindly or should we
> simply add that cheap check just in case.
>
> I don't mind having it without it but wondering why a little defensive
> programming is a problem, at all.
It's not a problem. I think I'd just rather see a splat (oops or
WARN_ON_ONCE()) than silently returning an error.
On Mon, Oct 18, 2021 at 07:05:21AM -0700, Dave Hansen wrote:
> It's not a problem. I think I'd just rather see a splat (oops or
> WARN_ON_ONCE()) than silently returning an error.
Ok, that makes sense too.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Oct 04, 2021, Kuppuswamy Sathyanarayanan wrote:
> [Isaku Yamahata: proposed KVM VENDOR string]
...
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 458a564dd4c2..ebb97e082376 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -6,8 +6,9 @@
> #include <linux/cpufeature.h>
> #include <linux/types.h>
>
> -#define TDX_CPUID_LEAF_ID 0x21
> -#define TDX_HYPERCALL_STANDARD 0
> +#define TDX_CPUID_LEAF_ID 0x21
> +#define TDX_HYPERCALL_STANDARD 0
> +#define TDX_HYPERCALL_VENDOR_KVM 0x4d564b2e584454 /* TDX.KVM */
...
> +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
> +static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> + unsigned long p2, unsigned long p3,
> + unsigned long p4)
> +{
> + struct tdx_hypercall_output out;
> + u64 err;
> +
> + err = __tdx_hypercall(TDX_HYPERCALL_VENDOR_KVM, nr, p1, p2,
> + p3, p4, &out);
Why use a magic string? There are already mechanisms for the host to announce
itself to the guest, i.e. the guest shouldn't be attempting these hypercalls unless
it knows it's running on KVM (or something that implements KVM's ABI, whatever
that may be).
The only use case I can think of is to support multiple flavors of hypercalls in
the VMM, e.g. to let KVM support both KVM and Hyper-V hypercalls when KVM is
masquerading as Hyper-V, but this magic value alone isn't sufficient.
And if there is a future where KVM wants to support multiple sets of hypercalls,
using the entire 64-bit GPR for a magic value is unnecessary and wasteful, e.g.
it requires an overside MOV imm, reg.
Why not use a single high bit? Actually, looking at KVM's set of hypercalls,
the guest can literally pass @nr as is. The GHCI defines all non-zero values as
vendor owned, i.e. this needs to ensure only that @nr is non-zero. For whatever
reason, perhaps to avoid false positives if the guest forgets to fill the value,
KVM's hypercalls start at '1'.
Regardless of what is stuffed into r10 for the TDVMCALL, if it's anything other
than KVM's raw hypercall number, it absolutely must go into
include/uapi/linux/kvm_para.h and it should be done as a standalone commit,
because what you're proposing here is effectively committing KVM to supporting
an ABI. That is not remotely clear from the changelog.
+Isaku
Hi Isaku/Kirill,
On 11/5/21 1:59 PM, Sean Christopherson wrote:
> Why use a magic string? There are already mechanisms for the host to announce
> itself to the guest, i.e. the guest shouldn't be attempting these hypercalls unless
> it knows it's running on KVM (or something that implements KVM's ABI, whatever
> that may be).
>
> The only use case I can think of is to support multiple flavors of hypercalls in
> the VMM, e.g. to let KVM support both KVM and Hyper-V hypercalls when KVM is
> masquerading as Hyper-V, but this magic value alone isn't sufficient.
>
> And if there is a future where KVM wants to support multiple sets of hypercalls,
> using the entire 64-bit GPR for a magic value is unnecessary and wasteful, e.g.
> it requires an overside MOV imm, reg.
>
> Why not use a single high bit? Actually, looking at KVM's set of hypercalls,
> the guest can literally pass @nr as is. The GHCI defines all non-zero values as
> vendor owned, i.e. this needs to ensure only that @nr is non-zero. For whatever
> reason, perhaps to avoid false positives if the guest forgets to fill the value,
> KVM's hypercalls start at '1'.
>
> Regardless of what is stuffed into r10 for the TDVMCALL, if it's anything other
> than KVM's raw hypercall number, it absolutely must go into
> include/uapi/linux/kvm_para.h and it should be done as a standalone commit,
> because what you're proposing here is effectively committing KVM to supporting
> an ABI. That is not remotely clear from the changelog.
Do you see any issues with using a single bit or just passing the @nr as
it is?
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer