Subject: [RFC v2 00/32] Add TDX Guest Support

Hi All,

NOTE: This series is not ready for wide public review. It is being
specifically posted so that Peter Z and other experts on the entry
code can look for problems with the new exception handler (#VE).
That's also why x86@ is not being spammed.

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the bare-minimum
support to run a TDX guest. The host-side support will be submitted
separately. Also support for advanced TD guest features like attestation
or debug-mode will be submitted separately. Also, at this point it is not
secure with some known holes in drivers, and also hasn’t been fully audited
and fuzzed yet.

TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space.

In contrast to the SEV code in the kernel, TDX guest memory is integrity
protected and isolated; the host is prevented from accessing guest
memory (even ciphertext).

The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
to the guest TD in cases of instructions that need to be emulated, disallowed
MSR accesses, subset of CPUID leaves, etc.
3. By default memory is marked as private, and TD will selectively share it with
VMM based on need.
4. Remote attestation is supported to enable a third party (either the owner of
the workload or a user of the services provided by the workload) to establish
that the workload is running on an Intel-TDX-enabled platform located within a
TD prior to providing that workload data.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Changes since v1:
* Implemented tdcall() and tdvmcall() helper functions in assembly and renamed
them as __tdcall() and __tdvmcall().
* Added do_general_protection() helper function to re-use protection
code between #GP exception and TDX #VE exception handlers.
* Addressed syscall gap issue in #VE handler support (for details check
the commit log in "x86/traps: Add #VE support for TDX guest").
* Modified patch titled "x86/tdx: Handle port I/O" to re-use common
tdvmcall() helper function.
* Added error handling support to MADT CPU wakeup code.
* Introduced enum tdx_map_type to identify SHARED vs PRIVATE memory type.
* Enabled shared memory in IOAPIC driver.
* Added BINUTILS version info for TDCALL.
* Changed the TDVMCALL vendor id from 0 to "TDX.KVM".
* Replaced WARN() with pr_warn_ratelimited() in __tdvmcall() wrappers.
* Fixed commit log and code comments related review comments.
* Renamed patch titled # "x86/topology: Disable CPU hotplug support for TDX
platforms" to "x86/topology: Disable CPU online/offline control for
TDX guest"
* Rebased on top of v5.12 kernel.


Erik Kaneda (1):
ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure

Isaku Yamahata (1):
x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (16):
x86/paravirt: Introduce CONFIG_PARAVIRT_XL
x86/tdx: Get TD execution environment information via TDINFO
x86/traps: Add #VE support for TDX guest
x86/tdx: Add HLT support for TDX guest
x86/tdx: Wire up KVM hypercalls
x86/tdx: Add MSR support for TDX guest
x86/tdx: Handle CPUID via #VE
x86/io: Allow to override inX() and outX() implementation
x86/tdx: Handle port I/O
x86/tdx: Handle in-kernel MMIO
x86/mm: Move force_dma_unencrypted() to common code
x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
x86/tdx: Make pages shared in ioremap()
x86/tdx: Add helper to do MapGPA TDVMALL
x86/tdx: Make DMA pages shared
x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (10):
x86/tdx: Introduce INTEL_TDX_GUEST config option
x86/cpufeatures: Add TDX Guest CPU feature
x86/x86: Add is_tdx_guest() interface
x86/tdx: Add __tdcall() and __tdvmcall() helper functions
x86/traps: Add do_general_protection() helper function
x86/tdx: Handle MWAIT, MONITOR and WBINVD
ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
ACPI/table: Print MADT Wake table information
x86/acpi, x86/boot: Add multiprocessor wake-up support
x86/topology: Disable CPU online/offline control for TDX guest

Sean Christopherson (4):
x86/boot: Add a trampoline for APs booting in 64-bit mode
x86/boot: Avoid #VE during compressed boot for TDX platforms
x86/boot: Avoid unnecessary #VE during boot process
x86/tdx: Forcefully disable legacy PIC for TDX guests

arch/x86/Kconfig | 28 +-
arch/x86/boot/compressed/Makefile | 2 +
arch/x86/boot/compressed/head_64.S | 10 +-
arch/x86/boot/compressed/misc.h | 1 +
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/boot/compressed/tdcall.S | 9 +
arch/x86/boot/compressed/tdx.c | 32 ++
arch/x86/include/asm/apic.h | 3 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/io.h | 24 +-
arch/x86/include/asm/irqflags.h | 38 +-
arch/x86/include/asm/kvm_para.h | 21 +
arch/x86/include/asm/paravirt.h | 22 +-
arch/x86/include/asm/paravirt_types.h | 3 +-
arch/x86/include/asm/pgtable.h | 3 +
arch/x86/include/asm/realmode.h | 1 +
arch/x86/include/asm/tdx.h | 176 +++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/acpi/boot.c | 79 ++++
arch/x86/kernel/apic/apic.c | 8 +
arch/x86/kernel/apic/io_apic.c | 12 +-
arch/x86/kernel/asm-offsets.c | 22 ++
arch/x86/kernel/head64.c | 3 +
arch/x86/kernel/head_64.S | 13 +-
arch/x86/kernel/idt.c | 6 +
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/pci-swiotlb.c | 2 +-
arch/x86/kernel/smpboot.c | 5 +
arch/x86/kernel/tdcall.S | 361 +++++++++++++++++
arch/x86/kernel/tdx-kvm.c | 45 +++
arch/x86/kernel/tdx.c | 480 +++++++++++++++++++++++
arch/x86/kernel/topology.c | 3 +-
arch/x86/kernel/traps.c | 81 ++--
arch/x86/mm/Makefile | 2 +
arch/x86/mm/ioremap.c | 8 +-
arch/x86/mm/mem_encrypt.c | 75 ----
arch/x86/mm/mem_encrypt_common.c | 85 ++++
arch/x86/mm/mem_encrypt_identity.c | 1 +
arch/x86/mm/pat/set_memory.c | 48 ++-
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 49 ++-
arch/x86/realmode/rm/trampoline_common.S | 5 +-
drivers/acpi/tables.c | 11 +
include/acpi/actbl2.h | 26 +-
45 files changed, 1654 insertions(+), 162 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdcall.S
create mode 100644 arch/x86/boot/compressed/tdx.c
create mode 100644 arch/x86/include/asm/tdx.h
create mode 100644 arch/x86/kernel/tdcall.S
create mode 100644 arch/x86/kernel/tdx-kvm.c
create mode 100644 arch/x86/kernel/tdx.c
create mode 100644 arch/x86/mm/mem_encrypt_common.c

--
2.25.1


Subject: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

From: "Kirill A. Shutemov" <[email protected]>

Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
a separate config option. It provides a middle ground for
not-so-deep paravirtulized environments.

CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
config would be a bloat for TDX.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/Kconfig | 4 +++
arch/x86/boot/compressed/misc.h | 1 +
arch/x86/include/asm/irqflags.h | 38 +++++++++++++++------------
arch/x86/include/asm/paravirt.h | 22 +++++++++-------
arch/x86/include/asm/paravirt_types.h | 3 ++-
arch/x86/kernel/paravirt.c | 4 ++-
arch/x86/mm/mem_encrypt_identity.c | 1 +
7 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..6b4b682af468 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -783,8 +783,12 @@ config PARAVIRT
over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.

+config PARAVIRT_XL
+ bool
+
config PARAVIRT_XXL
bool
+ select PARAVIRT_XL

config PARAVIRT_DEBUG
bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..4b84abe43765 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,6 +9,7 @@
* paravirt and debugging variants are added.)
*/
#undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS
#undef CONFIG_KASAN
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..1688841893d7 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)

#endif

-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_XL
#include <asm/paravirt.h>
#else
#ifndef __ASSEMBLY__
#include <linux/types.h>
-
-static __always_inline unsigned long arch_local_save_flags(void)
-{
- return native_save_fl();
-}
-
-static __always_inline void arch_local_irq_disable(void)
-{
- native_irq_disable();
-}
-
-static __always_inline void arch_local_irq_enable(void)
-{
- native_irq_enable();
-}
-
/*
* Used in the idle loop; sti takes one instruction cycle
* to complete:
@@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
{
native_halt();
}
+#endif /* !__ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT_XL */
+
+#ifndef CONFIG_PARAVIRT_XXL
+#ifndef __ASSEMBLY__
+
+static __always_inline unsigned long arch_local_save_flags(void)
+{
+ return native_save_fl();
+}
+
+static __always_inline void arch_local_irq_disable(void)
+{
+ native_irq_disable();
+}
+
+static __always_inline void arch_local_irq_enable(void)
+{
+ native_irq_enable();
+}

/*
* For spinlocks, etc:
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..2dbb6c9c7e98 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
PVOP_VCALL1(mmu.exit_mmap, mm);
}

+#ifdef CONFIG_PARAVIRT_XL
+static inline void arch_safe_halt(void)
+{
+ PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+ PVOP_VCALL0(irq.halt);
+}
+#endif
+
#ifdef CONFIG_PARAVIRT_XXL
static inline void load_sp0(unsigned long sp0)
{
@@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
}

-static inline void arch_safe_halt(void)
-{
- PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
- PVOP_VCALL0(irq.halt);
-}
-
static inline void wbinvd(void)
{
PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..5261fba47ba5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,7 +177,8 @@ struct pv_irq_ops {
struct paravirt_callee_save save_fl;
struct paravirt_callee_save irq_disable;
struct paravirt_callee_save irq_enable;
-
+#endif
+#ifdef CONFIG_PARAVIRT_XL
void (*safe_halt)(void);
void (*halt)(void);
#endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..d6d0b363fe70 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
.irq.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.irq.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
.irq.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+#ifdef CONFIG_PARAVIRT_XL
.irq.safe_halt = native_safe_halt,
.irq.halt = native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_XL */

/* Mmu ops. */
.mmu.flush_tlb_user = native_flush_tlb_local,
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 6c5eb6f3f14f..20d0cb116557 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,6 +24,7 @@
* be extended when new paravirt and debugging variants are added.)
*/
#undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS

--
2.25.1

Subject: [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option

Add INTEL_TDX_GUEST config option to selectively compile
TDX guest support.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
---
arch/x86/Kconfig | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6b4b682af468..932e6d759ba7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -875,6 +875,21 @@ config ACRN_GUEST
IOT with small footprint and real-time features. More details can be
found in https://projectacrn.org/.

+config INTEL_TDX_GUEST
+ bool "Intel Trusted Domain eXtensions Guest Support"
+ depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
+ depends on SECURITY
+ select PARAVIRT_XL
+ select X86_X2APIC
+ select SECURITY_LOCKDOWN_LSM
+ help
+ Provide support for running in a trusted domain on Intel processors
+ equipped with Trusted Domain eXtenstions. TDX is an new Intel
+ technology that extends VMX and Memory Encryption with a new kind of
+ virtual machine guest called Trust Domain (TD). A TD is designed to
+ run in a CPU mode that protects the confidentiality of TD memory
+ contents and the TD’s CPU state from other software, including VMM.
+
endif #HYPERVISOR_GUEST

source "arch/x86/Kconfig.cpu"
--
2.25.1

Subject: [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface

Add helper function to detect TDX feature support. It will be used
to protect TDX specific code.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/tdx.c | 32 +++++++++++++++++++++++++++++++
arch/x86/include/asm/tdx.h | 8 ++++++++
arch/x86/kernel/tdx.c | 6 ++++++
4 files changed, 47 insertions(+)
create mode 100644 arch/x86/boot/compressed/tdx.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..a2554621cefe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -96,6 +96,7 @@ ifdef CONFIG_X86_64
endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o

vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..0a87c1775b67
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tdx.c - Early boot code for TDX
+ */
+
+#include <asm/tdx.h>
+
+static int __ro_after_init tdx_guest = -1;
+
+static inline bool native_cpuid_has_tdx_guest(void)
+{
+ u32 eax = TDX_CPUID_LEAF_ID, signature[3] = {0};
+
+ if (native_cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+ return false;
+
+ native_cpuid(&eax, &signature[0], &signature[1], &signature[2]);
+
+ if (memcmp("IntelTDX ", signature, 12))
+ return false;
+
+ return true;
+}
+
+bool is_tdx_guest(void)
+{
+ if (tdx_guest < 0)
+ tdx_guest = native_cpuid_has_tdx_guest();
+
+ return !!tdx_guest;
+}
+
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 679500e807f3..69af72d08d3d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -9,10 +9,18 @@

#include <asm/cpufeature.h>

+/* Common API to check TDX support in decompression and common kernel code. */
+bool is_tdx_guest(void);
+
void __init tdx_early_init(void);

#else // !CONFIG_INTEL_TDX_GUEST

+static inline bool is_tdx_guest(void)
+{
+ return false;
+}
+
static inline void tdx_early_init(void) { };

#endif /* CONFIG_INTEL_TDX_GUEST */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f927e36769d5..6a7193fead08 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -19,6 +19,12 @@ static inline bool cpuid_has_tdx_guest(void)
return true;
}

+bool is_tdx_guest(void)
+{
+ return static_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+EXPORT_SYMBOL_GPL(is_tdx_guest);
+
void __init tdx_early_init(void)
{
if (!cpuid_has_tdx_guest())
--
2.25.1

Subject: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdvmcall() - function can be used to request services from the VMM.
   
__tdcall()  - function can be used to communicate with the TDX Module.

Also define two additional wrappers, tdvmcall() and tdvmcall_out_r11()
to cover common use cases of __tdvmcall() function. Since each use
case of __tdcall() is different, we don't need such wrappers for it.

Implement __tdcall() and __tdvmcall() helper functions in assembly.
Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdvmcall() implementation is over 70, using inline assembly to
implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDVMCAL/TDCALL operation, if all these registers are included in
in-line assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdvmcall() function as well.

Current implementation of __tdvmcall()  includes error handling (ud2
on failure case) in assembly function instead of doing it in C wrapper
function. The reason behind this choice is, when adding support for
in/out instructions (refer to patch titled "x86/tdx: Handle port I/O"
in this series), we use alternative_io() to substitute in/out
instruction with  __tdvmcall() calls. So use of C wrappers is not trivial
in this case because the input parameters will be in the wrong registers
and it's tricky to include proper buffer code to make this happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/tdx.h | 26 +++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets.c | 22 ++++
arch/x86/kernel/tdcall.S | 200 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 36 ++++++
5 files changed, 285 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..6c3c71bb57a0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,38 @@
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
+#include <linux/types.h>
+
+struct tdcall_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+struct tdvmcall_output {
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};

/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

void __init tdx_early_init(void);

+/* Helper function used to communicate with the TDX module */
+u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdcall_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+ struct tdvmcall_output *out);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..4a9885a9a28b 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
#include <xen/interface/xen.h>
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
#ifdef CONFIG_X86_32
# include "asm-offsets_32.c"
#else
@@ -75,6 +79,24 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ /* Offset for fields in tdcall_output */
+ OFFSET(TDCALL_rcx, tdcall_output, rcx);
+ OFFSET(TDCALL_rdx, tdcall_output, rdx);
+ OFFSET(TDCALL_r8, tdcall_output, r8);
+ OFFSET(TDCALL_r9, tdcall_output, r9);
+ OFFSET(TDCALL_r10, tdcall_output, r10);
+ OFFSET(TDCALL_r11, tdcall_output, r11);
+
+ /* Offset for fields in tdvmcall_output */
+ OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
+ OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
+ OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
+ OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
+ OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..81af70c2acbd
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,200 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+
+/*
+ * Expose registers R10-R15 to VMM (for bitfield info
+ * refer to TDX GHCI specification).
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK 0xfc00
+
+/*
+ * TDX guests use the TDCALL instruction to make
+ * hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdcall() - Used to communicate with the TDX module
+ *
+ * @arg1 (RDI) - TDCALL Leaf ID
+ * @arg2 (RSI) - Input parameter 1 passed to TDX module
+ * via register RCX
+ * @arg2 (RDX) - Input parameter 2 passed to TDX module
+ * via register RDX
+ * @arg3 (RCX) - Input parameter 3 passed to TDX module
+ * via register R8
+ * @arg4 (R8) - Input parameter 4 passed to TDX module
+ * via register R9
+ * @arg5 (R9) - struct tdcall_output pointer
+ *
+ * @out - Return status of tdcall via RAX.
+ *
+ * NOTE: This function should only used for non TDVMCALL
+ * use cases
+ */
+SYM_FUNC_START(__tdcall)
+ FRAME_BEGIN
+
+ /* Save non-volatile GPRs that are exposed to the VMM. */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Move TDCALL Leaf ID to RAX */
+ mov %rdi, %rax
+ /* Move output pointer to R12 */
+ mov %r9, %r12
+ /* Move input param 4 to R9 */
+ mov %r8, %r9
+ /* Move input param 3 to R8 */
+ mov %rcx, %r8
+ /* Leave input param 2 in RDX */
+ /* Move input param 1 to RCX */
+ mov %rsi, %rcx
+
+ tdcall
+
+ /* Check for TDCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check for a TDCALL output struct */
+ test %r12, %r12
+ jz 1f
+
+ /* Copy TDCALL result registers to output struct: */
+ movq %rcx, TDCALL_rcx(%r12)
+ movq %rdx, TDCALL_rdx(%r12)
+ movq %r8, TDCALL_r8(%r12)
+ movq %r9, TDCALL_r9(%r12)
+ movq %r10, TDCALL_r10(%r12)
+ movq %r11, TDCALL_r11(%r12)
+1:
+ /* Zero out registers exposed to the TDX Module. */
+ xor %rcx, %rcx
+ xor %rdx, %rdx
+ xor %r8d, %r8d
+ xor %r9d, %r9d
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+
+ /* Restore non-volatile GPRs that are exposed to the VMM. */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdcall)
+
+/*
+ * do_tdvmcall() - Used to communicate with the VMM.
+ *
+ * @arg1 (RDI) - TDVMCALL function, e.g. exit reason
+ * @arg2 (RSI) - Input parameter 1 passed to VMM
+ * via register R12
+ * @arg3 (RDX) - Input parameter 2 passed to VMM
+ * via register R13
+ * @arg4 (RCX) - Input parameter 3 passed to VMM
+ * via register R14
+ * @arg5 (R8) - Input parameter 4 passed to VMM
+ * via register R15
+ * @arg6 (R9) - struct tdvmcall_output pointer
+ *
+ * @out - Return status of tdvmcall(R10) via RAX.
+ *
+ */
+SYM_CODE_START_LOCAL(do_tdvmcall)
+ FRAME_BEGIN
+
+ /* Save non-volatile GPRs that are exposed to the VMM. */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
+ xor %eax, %eax
+ /* Move TDVMCALL function id (1st argument) to R11 */
+ mov %rdi, %r11
+ /* Move Input parameter 1-4 to R12-R15 */
+ mov %rsi, %r12
+ mov %rdx, %r13
+ mov %rcx, %r14
+ mov %r8, %r15
+ /* Leave tdvmcall output pointer in R9 */
+
+ /*
+ * Value of RCX is used by the TDX Module to determine which
+ * registers are exposed to VMM. Each bit in RCX represents a
+ * register id. You can find the bitmap details from TDX GHCI
+ * spec.
+ */
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * Check for TDCALL success: 0 - Successful, otherwise failed.
+ * If failed, there is an issue with TDX Module which is fatal
+ * for the guest. So panic.
+ */
+ test %rax, %rax
+ jnz 2f
+
+ /* Move TDVMCALL success/failure to RAX to return to user */
+ mov %r10, %rax
+
+ /* Check for TDVMCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check for a TDVMCALL output struct */
+ test %r9, %r9
+ jz 1f
+
+ /* Copy TDVMCALL result registers to output struct: */
+ movq %r11, TDVMCALL_r11(%r9)
+ movq %r12, TDVMCALL_r12(%r9)
+ movq %r13, TDVMCALL_r13(%r9)
+ movq %r14, TDVMCALL_r14(%r9)
+ movq %r15, TDVMCALL_r15(%r9)
+1:
+ /*
+ * Zero out registers exposed to the VMM to avoid
+ * speculative execution with VMM-controlled values.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+ xor %r12d, %r12d
+ xor %r13d, %r13d
+ xor %r14d, %r14d
+ xor %r15d, %r15d
+
+ /* Restore non-volatile GPRs that are exposed to the VMM. */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ FRAME_END
+ ret
+2:
+ ud2
+SYM_CODE_END(do_tdvmcall)
+
+/* Helper function for standard type of TDVMCALL */
+SYM_FUNC_START(__tdvmcall)
+ /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+ xor %r10, %r10
+ call do_tdvmcall
+ retq
+SYM_FUNC_END(__tdvmcall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..29c52128b9c0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,44 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2020 Intel Corporation */

+#define pr_fmt(fmt) "TDX: " fmt
+
#include <asm/tdx.h>

+/*
+ * Wrapper for use case that checks for error code and print warning message.
+ */
+static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ u64 err;
+
+ err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output value (R11).
+ */
+static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+
+ struct tdvmcall_output out = {0};
+ u64 err;
+
+ err = __tdvmcall(fn, r12, r13, r14, r15, &out);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return out.r11;
+}
+
static inline bool cpuid_has_tdx_guest(void)
{
u32 eax, signature[3];
--
2.25.1

Subject: [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO

From: "Kirill A. Shutemov" <[email protected]>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 2.4.2,
TDCALL[TDINFO] provides basic TD execution environment information, not
provided by CPUID.

Call TDINFO during early boot to be used for following system
initialization.

The call provides info on which bit in pfn is used to indicate that the
page is shared with the host and attributes of the TD, such as debug.

We don't save information about the number of cpus as there's no users
so far.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/tdx.h | 2 ++
arch/x86/kernel/tdx.c | 23 +++++++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6c3c71bb57a0..c5a870cef0ae 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -10,6 +10,8 @@
#include <asm/cpufeature.h>
#include <linux/types.h>

+#define TDINFO 1
+
struct tdcall_output {
u64 rcx;
u64 rdx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 29c52128b9c0..b63275db1db9 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,14 @@
#define pr_fmt(fmt) "TDX: " fmt

#include <asm/tdx.h>
+#include <asm/vmx.h>
+
+#include <linux/cpu.h>
+
+static struct {
+ unsigned int gpa_width;
+ unsigned long attributes;
+} td_info __ro_after_init;

/*
* Wrapper for use case that checks for error code and print warning message.
@@ -61,6 +69,19 @@ bool is_tdx_guest(void)
}
EXPORT_SYMBOL_GPL(is_tdx_guest);

+static void tdg_get_info(void)
+{
+ u64 ret;
+ struct tdcall_output out = {0};
+
+ ret = __tdcall(TDINFO, 0, 0, 0, 0, &out);
+
+ BUG_ON(ret);
+
+ td_info.gpa_width = out.rcx & GENMASK(5, 0);
+ td_info.attributes = out.rdx;
+}
+
void __init tdx_early_init(void)
{
if (!cpuid_has_tdx_guest())
@@ -68,5 +89,7 @@ void __init tdx_early_init(void)

setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+ tdg_get_info();
+
pr_info("TDX guest is initialized\n");
}
--
2.25.1

Subject: [RFC v2 07/32] x86/traps: Add do_general_pro tection() helper function

TDX guest #VE exception handler treats unsupported exceptions
as #GP. So to handle the #GP, move the protection fault handler
code to out of exc_general_protection() and create new helper
function for it.

Also since exception handler is responsible to decide when to
turn on/off IRQ, move cond_local_irq_{enable/disable)() calls
out of do_general_protection().

This is a preparatory patch for adding #VE exception handler
support for TDX guests.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/traps.c | 51 ++++++++++++++++++++++-------------------
1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..213d4aa8e337 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -527,44 +527,28 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,

#define GPFSTR "general protection fault"

-DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+static void do_general_protection(struct pt_regs *regs, long error_code)
{
char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
enum kernel_gp_hint hint = GP_NO_HINT;
- struct task_struct *tsk;
+ struct task_struct *tsk = current;
unsigned long gp_addr;
int ret;

- cond_local_irq_enable(regs);
-
- if (static_cpu_has(X86_FEATURE_UMIP)) {
- if (user_mode(regs) && fixup_umip_exception(regs))
- goto exit;
- }
-
- if (v8086_mode(regs)) {
- local_irq_enable();
- handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
- local_irq_disable();
- return;
- }
-
- tsk = current;
-
if (user_mode(regs)) {
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;

if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
- goto exit;
+ return;

show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
force_sig(SIGSEGV);
- goto exit;
+ return;
}

if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
- goto exit;
+ return;

tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;
@@ -576,11 +560,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
if (!preemptible() &&
kprobe_running() &&
kprobe_fault_handler(regs, X86_TRAP_GP))
- goto exit;
+ return;

ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
if (ret == NOTIFY_STOP)
- goto exit;
+ return;

if (error_code)
snprintf(desc, sizeof(desc), "segment-related " GPFSTR);
@@ -601,8 +585,27 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
gp_addr = 0;

die_addr(desc, regs, error_code, gp_addr);
+}

-exit:
+DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+{
+ cond_local_irq_enable(regs);
+
+ if (static_cpu_has(X86_FEATURE_UMIP)) {
+ if (user_mode(regs) && fixup_umip_exception(regs)) {
+ cond_local_irq_disable(regs);
+ return;
+ }
+ }
+
+ if (v8086_mode(regs)) {
+ local_irq_enable();
+ handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
+ local_irq_disable();
+ return;
+ }
+
+ do_general_protection(regs, error_code);
cond_local_irq_disable(regs);
}

--
2.25.1

Subject: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

From: "Kirill A. Shutemov" <[email protected]>

The TDX module injects #VE exception to the guest TD in cases of
disallowed instructions, disallowed MSR accesses and subset of CPUID
leaves. The TDX module guarantees that no #VE is injected on an EPT
violation on guest physical addresses that are memory. We can still
get #VE on MMIO mappings. This avoids any problems with the “system
call gap”.
   
Add basic infrastructure to handle #VE. If there is no handler for a
given #VE, since it is an unexpected event (fault case), treat it as
a general protection fault and handle it using
do_general_protection() call.
   
TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.

The #VE cannot be nested before TDGETVEINFO is called, if there is any
reason for it to nest the TD would shut down. The TDX module guarantees
that no NMIs (or #MC or similar) can happen in this window. After
TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
it to happen normally.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/idtentry.h | 4 ++++
arch/x86/include/asm/tdx.h | 15 +++++++++++++
arch/x86/kernel/idt.c | 6 ++++++
arch/x86/kernel/tdx.c | 38 +++++++++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 30 ++++++++++++++++++++++++++
5 files changed, 93 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c5a870cef0ae..1ca55d8e9963 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -11,6 +11,7 @@
#include <linux/types.h>

#define TDINFO 1
+#define TDGETVEINFO 3

struct tdcall_output {
u64 rcx;
@@ -29,6 +30,20 @@ struct tdvmcall_output {
u64 r15;
};

+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ u64 gla;
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve);
+
/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};

/*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
INTG(X86_TRAP_MF, asm_exc_coprocessor_error),
INTG(X86_TRAP_AC, asm_exc_alignment_check),
INTG(X86_TRAP_XF, asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif

#ifdef CONFIG_X86_32
TSKG(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b63275db1db9..ccfcb07bfb2c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -82,6 +82,44 @@ static void tdg_get_info(void)
td_info.attributes = out.rdx;
}

+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+ u64 ret;
+ struct tdcall_output out = {0};
+
+ /*
+ * The #VE cannot be nested before TDGETVEINFO is called,
+ * if there is any reason for it to nest the TD would shut
+ * down. The TDX module guarantees that no NMIs (or #MC or
+ * similar) can happen in this window. After TDGETVEINFO
+ * the #VE handler can nest if needed, although we don’t
+ * expect it to happen normally.
+ */
+
+ ret = __tdcall(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = out.r10 & UINT_MAX;
+ ve->instr_info = out.r10 >> 32;
+
+ return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve)
+{
+ /*
+ * TODO: Add handler support for various #VE exit
+ * reasons. It will be added by other patches in
+ * the series.
+ */
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EFAULT;
+}
+
void __init tdx_early_init(void)
{
if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 213d4aa8e337..64869aa88a5a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -1140,6 +1141,35 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}

+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+ int ret;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+ /*
+ * Consume #VE info before re-enabling interrupts. It will be
+ * re-enabled after executing the TDGETVEINFO TDCALL.
+ */
+ ret = tdg_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ if (!ret)
+ ret = tdg_handle_virtualization_exception(regs, &ve);
+ /*
+ * If tdg_handle_virtualization_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (ret)
+ do_general_protection(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.25.1

Subject: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls

From: "Kirill A. Shutemov" <[email protected]>

KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

[Isaku: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
arch/x86/include/asm/tdx.h | 39 ++++++++++++++++++++++++++++
arch/x86/kernel/tdcall.S | 7 +++++
arch/x86/kernel/tdx-kvm.c | 45 +++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 4 +++
5 files changed, 116 insertions(+)
create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
#include <asm/alternative.h>
#include <linux/interrupt.h>
#include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>

extern void kvmclock_init(void);

@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
static inline long kvm_hypercall0(unsigned int nr)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall0(nr);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall1(nr, p1);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall2(nr, p1, p2);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
unsigned long p4)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1ca55d8e9963..e0b3ed9e262c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -56,6 +56,16 @@ u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
/* Helper function used to request services from VMM */
u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
struct tdvmcall_output *out);
+u64 __tdvmcall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+ struct tdvmcall_output *out);
+
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4);

#else // !CONFIG_INTEL_TDX_GUEST

@@ -66,6 +76,35 @@ static inline bool is_tdx_guest(void)

static inline void tdx_early_init(void) { };

+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+ unsigned long p2)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_INTEL_TDX_GUEST */

#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 81af70c2acbd..964bfd7fc682 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -11,6 +11,7 @@
* refer to TDX GHCI specification).
*/
#define TDVMCALL_EXPOSE_REGS_MASK 0xfc00
+#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */

/*
* TDX guests use the TDCALL instruction to make
@@ -198,3 +199,9 @@ SYM_FUNC_START(__tdvmcall)
call do_tdvmcall
retq
SYM_FUNC_END(__tdvmcall)
+
+SYM_FUNC_START(__tdvmcall_vendor_kvm)
+ movq $TDVMCALL_VENDOR_KVM, %r10
+ call do_tdvmcall
+ retq
+SYM_FUNC_END(__tdvmcall_vendor_kvm)
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..c4264e926712
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+
+static long tdvmcall_vendor(unsigned int fn, unsigned long r12,
+ unsigned long r13, unsigned long r14,
+ unsigned long r15)
+{
+ return __tdvmcall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return tdvmcall_vendor(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return tdvmcall_vendor(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+ return tdvmcall_vendor(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3)
+{
+ return tdvmcall_vendor(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4)
+{
+ return tdvmcall_vendor(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5169f72b6b3f..721c213d807d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,10 @@

#include <linux/cpu.h>

+#ifdef CONFIG_KVM_GUEST
+#include "tdx-kvm.c"
+#endif
+
static struct {
unsigned int gpa_width;
unsigned long attributes;
--
2.25.1

Subject: [RFC v2 09/32] x86/tdx: Add HLT support for TDX guest

From: "Kirill A. Shutemov" <[email protected]>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 3.8,
TDVMCALL[Instruction.HLT] provides HLT operation. Use it to implement
halt() and safe_halt() paravirtualization calls.

The same TDVMCALL is used to handle #VE exception due to
EXIT_REASON_HLT.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/tdx.c | 44 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ccfcb07bfb2c..5169f72b6b3f 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -82,6 +82,27 @@ static void tdg_get_info(void)
td_info.attributes = out.rdx;
}

+static __cpuidle void tdg_halt(void)
+{
+ u64 ret;
+
+ ret = __tdvmcall(EXIT_REASON_HLT, 0, 0, 0, 0, NULL);
+
+ /* It should never fail */
+ BUG_ON(ret);
+}
+
+static __cpuidle void tdg_safe_halt(void)
+{
+ /*
+ * Enable interrupts next to the TDVMCALL to avoid
+ * performance degradation.
+ */
+ asm volatile("sti\n\t");
+
+ tdg_halt();
+}
+
unsigned long tdg_get_ve_info(struct ve_info *ve)
{
u64 ret;
@@ -111,13 +132,19 @@ unsigned long tdg_get_ve_info(struct ve_info *ve)
int tdg_handle_virtualization_exception(struct pt_regs *regs,
struct ve_info *ve)
{
- /*
- * TODO: Add handler support for various #VE exit
- * reasons. It will be added by other patches in
- * the series.
- */
- pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return -EFAULT;
+ switch (ve->exit_reason) {
+ case EXIT_REASON_HLT:
+ tdg_halt();
+ break;
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EFAULT;
+ }
+
+ /* After successful #VE handling, move the IP */
+ regs->ip += ve->instr_len;
+
+ return 0;
}

void __init tdx_early_init(void)
@@ -129,5 +156,8 @@ void __init tdx_early_init(void)

tdg_get_info();

+ pv_ops.irq.safe_halt = tdg_safe_halt;
+ pv_ops.irq.halt = tdg_halt;
+
pr_info("TDX guest is initialized\n");
}
--
2.25.1

Subject: [RFC v2 14/32] x86/tdx: Handle port I/O

From: "Kirill A. Shutemov" <[email protected]>

Unroll string operations and handle port I/O through TDVMCALLs.
Also handle #VE due to I/O operations with the same TDVMCALLs.

Decompression code uses port IO for earlyprintk. We must use
paravirt calls there too if we want to allow earlyprintk.

Decompresion code cannot deal with alternatives: use branches
instead to implement inX() and outX() helpers.

Since we use call instruction in place of in/out instruction,
the argument passed to call instruction has to be in a
register, it cannot be an immediate value like in/out
instruction. So change constraint flag from "Nd" to "d"

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
---
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/tdcall.S | 9 ++
arch/x86/include/asm/io.h | 5 +-
arch/x86/include/asm/tdx.h | 46 ++++++++-
arch/x86/kernel/tdcall.S | 154 ++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 33 +++++++
6 files changed, 245 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index a2554621cefe..a944a2038797 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -97,6 +97,7 @@ endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o

vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..5ebb80d45ad8
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <asm/export.h>
+
+/* Do not export symbols in decompression code */
+#undef EXPORT_SYMBOL
+#define EXPORT_SYMBOL(sym)
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index ef7a686a55a9..30a3b30395ad 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -43,6 +43,7 @@
#include <asm/page.h>
#include <asm/early_ioremap.h>
#include <asm/pgtable_types.h>
+#include <asm/tdx.h>

#define build_mmio_read(name, size, type, reg, barrier) \
static inline type name(const volatile void __iomem *addr) \
@@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port) \
\
static inline void outs##bwl(int port, const void *addr, unsigned long count) \
{ \
- if (sev_key_active()) { \
+ if (sev_key_active() || is_tdx_guest()) { \
unsigned type *value = (unsigned type *)addr; \
while (count) { \
out##bwl(*value, port); \
@@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
\
static inline void ins##bwl(int port, void *addr, unsigned long count) \
{ \
- if (sev_key_active()) { \
+ if (sev_key_active() || is_tdx_guest()) { \
unsigned type *value = (unsigned type *)addr; \
while (count) { \
*value = in##bwl(port); \
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e0b3ed9e262c..b972c6531a53 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,6 +5,8 @@

#define TDX_CPUID_LEAF_ID 0x21

+#ifndef __ASSEMBLY__
+
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
@@ -67,6 +69,48 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4);

+/* Decompression code doesn't know how to handle alternatives */
+#ifdef BOOT_COMPRESSED_MISC_H
+#define __out(bwl, bw) \
+do { \
+ if (is_tdx_guest()) { \
+ asm volatile("call tdg_out" #bwl : : \
+ "a"(value), "d"(port)); \
+ } else { \
+ asm volatile("out" #bwl " %" #bw "0, %w1" : : \
+ "a"(value), "Nd"(port)); \
+ } \
+} while (0)
+#define __in(bwl, bw) \
+do { \
+ if (is_tdx_guest()) { \
+ asm volatile("call tdg_in" #bwl : \
+ "=a"(value) : "d"(port)); \
+ } else { \
+ asm volatile("in" #bwl " %w1, %" #bw "0" : \
+ "=a"(value) : "Nd"(port)); \
+ } \
+} while (0)
+#else
+#define __out(bwl, bw) \
+ alternative_input("out" #bwl " %" #bw "1, %w2", \
+ "call tdg_out" #bwl, X86_FEATURE_TDX_GUEST, \
+ "a"(value), "d"(port))
+
+#define __in(bwl, bw) \
+ alternative_io("in" #bwl " %w2, %" #bw "0", \
+ "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST, \
+ "=a"(value), "d"(port))
+#endif
+
+void tdg_outb(unsigned char value, unsigned short port);
+void tdg_outw(unsigned short value, unsigned short port);
+void tdg_outl(unsigned int value, unsigned short port);
+
+unsigned char tdg_inb(unsigned short port);
+unsigned short tdg_inw(unsigned short port);
+unsigned int tdg_inl(unsigned short port);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
@@ -106,5 +150,5 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
}

#endif /* CONFIG_INTEL_TDX_GUEST */
-
+#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 964bfd7fc682..df4159bb5103 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
#include <asm/asm.h>
#include <asm/frame.h>
#include <asm/unwind_hints.h>
+#include <asm/export.h>

#include <linux/linkage.h>

@@ -12,6 +13,12 @@
*/
#define TDVMCALL_EXPOSE_REGS_MASK 0xfc00
#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */
+#define EXIT_REASON_IO_INSTRUCTION 30
+/*
+ * Current size of struct tdvmcall_output is 40 bytes,
+ * but allocate double to account future changes.
+ */
+#define TDVMCALL_OUTPUT_SIZE 80

/*
* TDX guests use the TDCALL instruction to make
@@ -205,3 +212,150 @@ SYM_FUNC_START(__tdvmcall_vendor_kvm)
call do_tdvmcall
retq
SYM_FUNC_END(__tdvmcall_vendor_kvm)
+
+.macro io_save_registers
+ push %rbp
+ push %rbx
+ push %rcx
+ push %rdx
+ push %rdi
+ push %rsi
+ push %r8
+ push %r9
+ push %r10
+ push %r11
+ push %r12
+ push %r13
+ push %r14
+ push %r15
+.endm
+.macro io_restore_registers
+ pop %r15
+ pop %r14
+ pop %r13
+ pop %r12
+ pop %r11
+ pop %r10
+ pop %r9
+ pop %r8
+ pop %rsi
+ pop %rdi
+ pop %rdx
+ pop %rcx
+ pop %rbx
+ pop %rbp
+.endm
+
+/*
+ * tdg_out{b,w,l}() - Write given data to the specified port.
+ *
+ * @arg1 (RAX) - Value to be written (passed via R8 to do_tdvmcall()).
+ * @arg2 (RDX) - Port id (passed via RCX to do_tdvmcall()).
+ *
+ */
+SYM_FUNC_START(tdg_outb)
+ io_save_registers
+ xor %r8, %r8
+ /* Move data to R8 register */
+ mov %al, %r8b
+ /* Set data width to 1 byte */
+ mov $1, %rsi
+ jmp 1f
+
+SYM_FUNC_START(tdg_outw)
+ io_save_registers
+ xor %r8, %r8
+ /* Move data to R8 register */
+ mov %ax, %r8w
+ /* Set data width to 2 bytes */
+ mov $2, %rsi
+ jmp 1f
+
+SYM_FUNC_START(tdg_outl)
+ io_save_registers
+ xor %r8, %r8
+ /* Move data to R8 register */
+ mov %eax, %r8d
+ /* Set data width to 4 bytes */
+ mov $4, %rsi
+1:
+ /*
+ * Since io_save_registers does not save rax
+ * state, save it here so that we can preserve
+ * the caller register state.
+ */
+ push %rax
+
+ mov %rdx, %rcx
+ /* Set 1 in RDX to select out operation */
+ mov $1, %rdx
+ /* Set TDVMCALL function id in RDI */
+ mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+ /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+ xor %r10, %r10
+ /* Since we don't use tdvmcall output, set it to NULL */
+ xor %r9, %r9
+
+ call do_tdvmcall
+
+ pop %rax
+ io_restore_registers
+ ret
+SYM_FUNC_END(tdg_outb)
+SYM_FUNC_END(tdg_outw)
+SYM_FUNC_END(tdg_outl)
+EXPORT_SYMBOL(tdg_outb)
+EXPORT_SYMBOL(tdg_outw)
+EXPORT_SYMBOL(tdg_outl)
+
+/*
+ * tdg_in{b,w,l}() - Read data to the specified port.
+ *
+ * @arg1 (RDX) - Port id (passed via RCX to do_tdvmcall()).
+ *
+ * Returns data read via RAX register.
+ *
+ */
+SYM_FUNC_START(tdg_inb)
+ io_save_registers
+ /* Set data width to 1 byte */
+ mov $1, %rsi
+ jmp 1f
+
+SYM_FUNC_START(tdg_inw)
+ io_save_registers
+ /* Set data width to 2 bytes */
+ mov $2, %rsi
+ jmp 1f
+
+SYM_FUNC_START(tdg_inl)
+ io_save_registers
+ /* Set data width to 4 bytes */
+ mov $4, %rsi
+1:
+ mov %rdx, %rcx
+ /* Set 0 in RDX to select in operation */
+ mov $0, %rdx
+ /* Set TDVMCALL function id in RDI */
+ mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+ /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+ xor %r10, %r10
+ /* Allocate memory in stack for Output */
+ subq $TDVMCALL_OUTPUT_SIZE, %rsp
+ /* Move tdvmcall_output pointer to R9 */
+ movq %rsp, %r9
+
+ call do_tdvmcall
+
+ /* Move data read from port to RAX */
+ mov TDVMCALL_r11(%r9), %eax
+ /* Free allocated memory */
+ addq $TDVMCALL_OUTPUT_SIZE, %rsp
+ io_restore_registers
+ ret
+SYM_FUNC_END(tdg_inb)
+SYM_FUNC_END(tdg_inw)
+SYM_FUNC_END(tdg_inl)
+EXPORT_SYMBOL(tdg_inb)
+EXPORT_SYMBOL(tdg_inw)
+EXPORT_SYMBOL(tdg_inl)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e42e260df245..ec61f2f06c98 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -189,6 +189,36 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
regs->dx = out.r15;
}

+static void tdg_out(int size, int port, unsigned int value)
+{
+ tdvmcall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
+}
+
+static unsigned int tdg_in(int size, int port)
+{
+ return tdvmcall_out_r11(EXIT_REASON_IO_INSTRUCTION, size, 0, port, 0);
+}
+
+static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+ bool string = exit_qual & 16;
+ int out, size, port;
+
+ /* I/O strings ops are unrolled at build time. */
+ BUG_ON(string);
+
+ out = (exit_qual & 8) ? 0 : 1;
+ size = (exit_qual & 7) + 1;
+ port = exit_qual >> 16;
+
+ if (out) {
+ tdg_out(size, port, regs->ax);
+ } else {
+ regs->ax &= ~GENMASK(8 * size, 0);
+ regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
+ }
+}
+
unsigned long tdg_get_ve_info(struct ve_info *ve)
{
u64 ret;
@@ -238,6 +268,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_CPUID:
tdg_handle_cpuid(regs);
break;
+ case EXIT_REASON_IO_INSTRUCTION:
+ tdg_handle_io(regs, ve->exit_qual);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
--
2.25.1

Subject: [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest

From: "Kirill A. Shutemov" <[email protected]>

Operations on context-switched MSRs can be run natively. The rest of
MSRs should be handled through TDVMCALLs.

TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide
MSR oprations.

You can find RDMSR and WRMSR details in Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 3.10, 3.11.

Also, since CSTAR MSR is not used on Intel CPUs as SYSCALL
instruction, ignore accesses to CSTAR MSR.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/tdx.c | 85 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 721c213d807d..5b16707b3577 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -107,6 +107,73 @@ static __cpuidle void tdg_safe_halt(void)
tdg_halt();
}

+static bool tdg_is_context_switched_msr(unsigned int msr)
+{
+ /* XXX: Update the list of context-switched MSRs */
+
+ switch (msr) {
+ case MSR_EFER:
+ case MSR_IA32_CR_PAT:
+ case MSR_FS_BASE:
+ case MSR_GS_BASE:
+ case MSR_KERNEL_GS_BASE:
+ case MSR_IA32_SYSENTER_CS:
+ case MSR_IA32_SYSENTER_EIP:
+ case MSR_IA32_SYSENTER_ESP:
+ case MSR_STAR:
+ case MSR_LSTAR:
+ case MSR_SYSCALL_MASK:
+ case MSR_IA32_XSS:
+ case MSR_TSC_AUX:
+ case MSR_IA32_BNDCFGS:
+ return true;
+ }
+ return false;
+}
+
+static u64 tdg_read_msr_safe(unsigned int msr, int *err)
+{
+ u64 ret;
+ struct tdvmcall_output out = {0};
+
+ WARN_ON_ONCE(tdg_is_context_switched_msr(msr));
+
+ /*
+ * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
+ * instruction, just ignore it. Even raising TDVMCALL
+ * will lead to same result.
+ */
+ if (msr == MSR_CSTAR)
+ return 0;
+
+ ret = __tdvmcall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);
+
+ *err = (ret) ? -EIO : 0;
+
+ return out.r11;
+}
+
+static int tdg_write_msr_safe(unsigned int msr, unsigned int low,
+ unsigned int high)
+{
+ u64 ret;
+
+ WARN_ON_ONCE(tdg_is_context_switched_msr(msr));
+
+ /*
+ * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
+ * instruction, just ignore it. Even raising TDVMCALL
+ * will lead to same result.
+ */
+ if (msr == MSR_CSTAR)
+ return 0;
+
+ ret = __tdvmcall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
+ 0, 0, NULL);
+
+ return ret ? -EIO : 0;
+}
+
unsigned long tdg_get_ve_info(struct ve_info *ve)
{
u64 ret;
@@ -136,19 +203,33 @@ unsigned long tdg_get_ve_info(struct ve_info *ve)
int tdg_handle_virtualization_exception(struct pt_regs *regs,
struct ve_info *ve)
{
+ unsigned long val;
+ int ret = 0;
+
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
tdg_halt();
break;
+ case EXIT_REASON_MSR_READ:
+ val = tdg_read_msr_safe(regs->cx, (unsigned int *)&ret);
+ if (!ret) {
+ regs->ax = val & UINT_MAX;
+ regs->dx = val >> 32;
+ }
+ break;
+ case EXIT_REASON_MSR_WRITE:
+ ret = tdg_write_msr_safe(regs->cx, regs->ax, regs->dx);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
}

/* After successful #VE handling, move the IP */
- regs->ip += ve->instr_len;
+ if (!ret)
+ regs->ip += ve->instr_len;

- return 0;
+ return ret;
}

void __init tdx_early_init(void)
--
2.25.1

Subject: [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO

From: "Kirill A. Shutemov" <[email protected]>

Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle subset of instruction that kernel uses for MMIO
oerations. User-space access triggers SIGBUS.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/tdx.c | 100 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 100 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ec61f2f06c98..3fe617978fc4 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,8 @@

#include <asm/tdx.h>
#include <asm/vmx.h>
+#include <asm/insn.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */

#include <linux/cpu.h>

@@ -219,6 +221,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
}
}

+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+ unsigned long val)
+{
+ return tdvmcall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
+ write, addr, val);
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+ static const int regoff[] = {
+ offsetof(struct pt_regs, ax),
+ offsetof(struct pt_regs, cx),
+ offsetof(struct pt_regs, dx),
+ offsetof(struct pt_regs, bx),
+ offsetof(struct pt_regs, sp),
+ offsetof(struct pt_regs, bp),
+ offsetof(struct pt_regs, si),
+ offsetof(struct pt_regs, di),
+ offsetof(struct pt_regs, r8),
+ offsetof(struct pt_regs, r9),
+ offsetof(struct pt_regs, r10),
+ offsetof(struct pt_regs, r11),
+ offsetof(struct pt_regs, r12),
+ offsetof(struct pt_regs, r13),
+ offsetof(struct pt_regs, r14),
+ offsetof(struct pt_regs, r15),
+ };
+ int regno;
+
+ regno = X86_MODRM_REG(insn->modrm.value);
+ if (X86_REX_R(insn->rex_prefix.value))
+ regno += 8;
+
+ return (void *)regs + regoff[regno];
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+ int size;
+ bool write;
+ unsigned long *reg;
+ struct insn insn;
+ unsigned long val = 0;
+
+ /*
+ * User mode would mean the kernel exposed a device directly
+ * to ring3, which shouldn't happen except for things like
+ * DPDK.
+ */
+ if (user_mode(regs)) {
+ pr_err("Unexpected user-mode MMIO access.\n");
+ force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+ return 0;
+ }
+
+ kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
+ insn_get_length(&insn);
+ insn_get_opcode(&insn);
+
+ write = ve->exit_qual & 0x2;
+
+ size = insn.opnd_bytes;
+ switch (insn.opcode.bytes[0]) {
+ /* MOV r/m8 r8 */
+ case 0x88:
+ /* MOV r8 r/m8 */
+ case 0x8A:
+ /* MOV r/m8 imm8 */
+ case 0xC6:
+ size = 1;
+ break;
+ }
+
+ if (inat_has_immediate(insn.attr)) {
+ BUG_ON(!write);
+ val = insn.immediate.value;
+ tdg_mmio(size, write, ve->gpa, val);
+ return insn.length;
+ }
+
+ BUG_ON(!inat_has_modrm(insn.attr));
+
+ reg = get_reg_ptr(regs, &insn);
+
+ if (write) {
+ memcpy(&val, reg, size);
+ tdg_mmio(size, write, ve->gpa, val);
+ } else {
+ val = tdg_mmio(size, write, ve->gpa, val);
+ memset(reg, 0, size);
+ memcpy(reg, &val, size);
+ }
+ return insn.length;
+}
+
unsigned long tdg_get_ve_info(struct ve_info *ve)
{
u64 ret;
@@ -271,6 +368,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_IO_INSTRUCTION:
tdg_handle_io(regs, ve->exit_qual);
break;
+ case EXIT_REASON_EPT_VIOLATION:
+ ve->instr_len = tdg_handle_mmio(regs, ve);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
--
2.25.1

Subject: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work. If the guest kernel
uses these instructions, the hardware generates a #VE.

You can find the list of unsupported instructions in Intel
Trust Domain Extensions (Intel® TDX) Module specification,
sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 2.4.1.
   
To prevent TD guest from using MWAIT/MONITOR instructions,
support for these instructions are already disabled by TDX
module (SEAM). So CPUID flags for these instructions should
be in disabled state.

After the above mentioned preventive measures, if TD guests still
execute these instructions, add appropriate warning messages in #VE
handler. For WBIND instruction, since it's related to memory writeback
and cache flushes, it's mainly used in context of IO devices. Since
TDX 1.0 does not support non-virtual I/O devices, skipping it should
not cause any fatal issues. But to let users know about its usage, use
WARN() to report about it.. For MWAIT/MONITOR instruction, since its
unsupported use WARN() to report unsupported usage.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/tdx.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3fe617978fc4..294dda5bf3f6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -371,6 +371,21 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+ case EXIT_REASON_WBINVD:
+ /*
+ * WBINVD is not supported inside TDX guests. All in-
+ * kernel uses should have been disabled.
+ */
+ WARN_ONCE(1, "TD Guest used unsupported WBINVD instruction\n");
+ break;
+ case EXIT_REASON_MONITOR_INSTRUCTION:
+ case EXIT_REASON_MWAIT_INSTRUCTION:
+ /*
+ * Something in the kernel used MONITOR or MWAIT despite
+ * X86_FEATURE_MWAIT being cleared for TDX guests.
+ */
+ WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
--
2.25.1

Subject: [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure

ACPICA commit f1ee04207a212f6c519441e7e25397649ebc4cea

Add Multiprocessor Wakeup Mailbox Structure definition. It is useful
in parsing MADT Wake table.

Link: https://github.com/acpica/acpica/commit/f1ee0420
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
include/acpi/actbl2.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index b2362600b9ff..7dce422f6119 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -733,6 +733,20 @@ struct acpi_madt_multiproc_wakeup {
u64 base_address;
};

+#define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE 2032
+#define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE 2048
+
+struct acpi_madt_multiproc_wakeup_mailbox {
+ u16 command;
+ u16 reserved; /* reserved - must be zero */
+ u32 apic_id;
+ u64 wakeup_vector;
+ u8 reserved_os[ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE]; /* reserved for OS use */
+ u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE]; /* reserved for firmware use */
+};
+
+#define ACPI_MP_WAKE_COMMAND_WAKEUP 1
+
/*
* Common flags fields for MADT subtables
*/
--
2.25.1

Subject: [RFC v2 19/32] ACPI/table: Print MADT Wake table information

When MADT is parsed, print MADT Wake table information as
debug message. It will be useful to debug CPU boot issues
related to MADT wake table.

Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
drivers/acpi/tables.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 9d581045acff..206df4ad8b2b 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -207,6 +207,17 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
}
break;

+ case ACPI_MADT_TYPE_MULTIPROC_WAKEUP:
+ {
+ struct acpi_madt_multiproc_wakeup *p;
+
+ p = (struct acpi_madt_multiproc_wakeup *) header;
+
+ pr_debug("MP Wake (Mailbox version[%d] base_address[%llx])\n",
+ p->mailbox_version, p->base_address);
+ }
+ break;
+
default:
pr_warn("Found unsupported MADT entry (type = 0x%x)\n",
header->type);
--
2.25.1

Subject: [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature

Add CPU feature detection for Trusted Domain Extensions support.
TDX feature adds capabilities to keep guest register state and
memory isolated from hypervisor.

For TDX guest platforms, executing CPUID(0x21, 0) will return
following values in EAX, EBX, ECX and EDX.

EAX: Maximum sub-leaf number: 0
EBX/EDX/ECX: Vendor string:

EBX = "Inte"
EDX = "lTDX"
ECX = " "

So when above condition is true, set X86_FEATURE_TDX_GUEST
feature cap bit

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/tdx.h | 20 ++++++++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/head64.c | 3 +++
arch/x86/kernel/tdx.c | 30 ++++++++++++++++++++++++++++++
5 files changed, 55 insertions(+)
create mode 100644 arch/x86/include/asm/tdx.h
create mode 100644 arch/x86/kernel/tdx.c

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cc96e26d69f7..d883df70c27b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
#define X86_FEATURE_EPT_AD ( 8*32+17) /* Intel Extended Page Table access-dirty bit */
#define X86_FEATURE_VMCALL ( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
#define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
+#define X86_FEATURE_TDX_GUEST ( 8*32+20) /* Trusted Domain Extensions Guest */

/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..679500e807f3
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#define TDX_CPUID_LEAF_ID 0x21
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#include <asm/cpufeature.h>
+
+void __init tdx_early_init(void);
+
+#else // !CONFIG_INTEL_TDX_GUEST
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 2ddf08351f0b..ea111bf50691 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,6 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 5e9beb77cafd..75f2401cb5db 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
#include <asm/extable.h>
#include <asm/trapnr.h>
#include <asm/sev-es.h>
+#include <asm/tdx.h>

/*
* Manage page tables very early on.
@@ -491,6 +492,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

kasan_early_init();

+ tdx_early_init();
+
idt_setup_early_handler();

copy_bootdata(__va(real_mode_data));
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index 000000000000..f927e36769d5
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static inline bool cpuid_has_tdx_guest(void)
+{
+ u32 eax, signature[3];
+
+ if (cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+ return false;
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &signature[0],
+ &signature[1], &signature[2]);
+
+ if (memcmp("IntelTDX ", signature, 12))
+ return false;
+
+ return true;
+}
+
+void __init tdx_early_init(void)
+{
+ if (!cpuid_has_tdx_guest())
+ return;
+
+ setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+ pr_info("TDX guest is initialized\n");
+}
--
2.25.1

Subject: [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support

As per ACPI specification r6.4, sec 5.2.12.19, a new sub
structure – multiprocessor wake-up structure - is added to the
ACPI Multiple APIC Description Table (MADT) to describe the
information of the mailbox. If a platform firmware produces the
multiprocessor wake-up structure, then OS may use this new
mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake table parsing support for x86 platform and if
MADT wake table is present, update apic->wakeup_secondary_cpu with
new API which uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/apic.h | 3 ++
arch/x86/kernel/acpi/boot.c | 79 +++++++++++++++++++++++++++++++++++++
arch/x86/kernel/apic/apic.c | 8 ++++
3 files changed, 90 insertions(+)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 412b51e059c8..3e94e1f402ea 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -487,6 +487,9 @@ static inline unsigned int read_apic_id(void)
return apic->get_apic_id(reg);
}

+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+
extern int default_apic_id_valid(u32 apicid);
extern int default_acpi_madt_oem_check(char *, char *);
extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 14cd3186dc77..fce2aa7d718f 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,9 @@ int acpi_fix_pin2_polarity __initdata;
static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
#endif

+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+static u64 acpi_mp_wake_mailbox_paddr;
+
#ifdef CONFIG_X86_IO_APIC
/*
* Locks related to IOAPIC hotplug
@@ -329,6 +332,52 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
return 0;
}

+static void acpi_mp_wake_mailbox_init(void)
+{
+ if (acpi_mp_wake_mailbox)
+ return;
+
+ acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+ sizeof(*acpi_mp_wake_mailbox), MEMREMAP_WB);
+}
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+ u8 timeout = 0xFF;
+
+ acpi_mp_wake_mailbox_init();
+
+ if (!acpi_mp_wake_mailbox)
+ return -EINVAL;
+
+ /*
+ * Mailbox memory is shared between firmware and OS. Firmware will
+ * listen on mailbox command address, and once it receives the wakeup
+ * command, CPU associated with the given apicid will be booted. So,
+ * the value of apic_id and wakeup_vector has to be set before updating
+ * the wakeup command. So use WRITE_ONCE to let the compiler know about
+ * it and preserve the order of writes.
+ */
+ WRITE_ONCE(acpi_mp_wake_mailbox->apic_id, apicid);
+ WRITE_ONCE(acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+ WRITE_ONCE(acpi_mp_wake_mailbox->command, ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+ /*
+ * After writing wakeup command, wait for maximum timeout of 0xFF
+ * for firmware to reset the command address back zero to indicate
+ * the successful reception of command.
+ * NOTE: 255 as timeout value is decided based on our experiments.
+ *
+ * XXX: Change the timeout once ACPI specification comes up with
+ * standard maximum timeout value.
+ */
+ while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
+ cpu_relax();
+
+ /* If timedout, return error */
+ return timeout ? 0 : -EIO;
+}
+
#endif /*CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1086,6 +1135,30 @@ static int __init acpi_parse_madt_lapic_entries(void)
}
return 0;
}
+
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_madt_multiproc_wakeup *mp_wake;
+
+ if (acpi_mp_wake_mailbox)
+ return -EINVAL;
+
+ if (!IS_ENABLED(CONFIG_SMP))
+ return -ENODEV;
+
+ mp_wake = (struct acpi_madt_multiproc_wakeup *) header;
+ if (BAD_MADT_ENTRY(mp_wake, end))
+ return -EINVAL;
+
+ acpi_table_print_madt_entry(&header->common);
+
+ acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+ acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+ return 0;
+}
#endif /* CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1284,6 +1357,12 @@ static void __init acpi_process_madt(void)

smp_found_config = 1;
}
+
+ /*
+ * Parse MADT MP Wake entry.
+ */
+ acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+ acpi_parse_mp_wake, 1);
}
if (error == -EINVAL) {
/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 4f26700f314d..f1b90a4b89e8 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2554,6 +2554,14 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
}
EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);

+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+ struct apic **drv;
+
+ for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+ (*drv)->wakeup_secondary_cpu = handler;
+}
+
/*
* Override the generic EOI implementation with an optimized version.
* Only called during early boot when only one CPU is active and with
--
2.25.1

Subject: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

From: "Kirill A. Shutemov" <[email protected]>

tdx_shared_mask() returns the mask that has to be set in a page
table entry to make page shared with VMM.

Also, note that we cannot club shared mapping configuration between
AMD SME and Intel TDX Guest platforms in common function. SME has
to do it very early in __startup_64() as it sets the bit on all
memory, except what is used for communication. TDX can postpone as
we don't need any shared mapping in very early boot.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/tdx.h | 6 ++++++
arch/x86/kernel/tdx.c | 9 +++++++++
3 files changed, 16 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67f99bf27729..5f92e8205de2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
select PARAVIRT_XL
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
+ select X86_MEM_ENCRYPT_COMMON
help
Provide support for running in a trusted domain on Intel processors
equipped with Trusted Domain eXtenstions. TDX is an new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b972c6531a53..dc80cf7f7d08 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -111,6 +111,8 @@ unsigned char tdg_inb(unsigned short port);
unsigned short tdg_inw(unsigned short port);
unsigned int tdg_inl(unsigned short port);

+extern phys_addr_t tdg_shared_mask(void);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
@@ -149,6 +151,10 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
return -ENODEV;
}

+static inline phys_addr_t tdg_shared_mask(void)
+{
+ return 0;
+}
#endif /* CONFIG_INTEL_TDX_GUEST */
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 1f1bb98e1d38..7e391cd7aa2b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -76,6 +76,12 @@ bool is_tdx_guest(void)
}
EXPORT_SYMBOL_GPL(is_tdx_guest);

+/* The highest bit of a guest physical address is the "sharing" bit */
+phys_addr_t tdg_shared_mask(void)
+{
+ return 1ULL << (td_info.gpa_width - 1);
+}
+
static void tdg_get_info(void)
{
u64 ret;
@@ -87,6 +93,9 @@ static void tdg_get_info(void)

td_info.gpa_width = out.rcx & GENMASK(5, 0);
td_info.attributes = out.rdx;
+
+ /* Exclude Shared bit from the __PHYSICAL_MASK */
+ physical_mask &= ~tdg_shared_mask();
}

static __cpuidle void tdg_halt(void)
--
2.25.1

Subject: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

From: "Kirill A. Shutemov" <[email protected]>

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure we have for AMD SEV.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/pgtable.h | 3 +++
arch/x86/mm/ioremap.c | 8 +++++---
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..734e775605c0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,9 @@
#define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot)))
#define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))

+/* Make the page accesable by VMM */
+#define pgprot_tdg_shared(prot) __pgprot(pgprot_val(prot) | tdg_shared_mask())
+
#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
#include <asm/fpu/xstate.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..c0dac02f5b3f 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
}

/*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
*/
static unsigned int __ioremap_check_encrypted(struct resource *res)
{
- if (!sev_active())
+ if (!sev_active() && !is_tdx_guest())
return 0;

switch (res->desc) {
@@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+ else if (is_tdx_guest())
+ prot = pgprot_tdg_shared(prot);

switch (pcm) {
case _PAGE_CACHE_MODE_UC:
--
2.25.1

Subject: [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode

From: Sean Christopherson <[email protected]>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode. For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/realmode.h | 1 +
arch/x86/kernel/smpboot.c | 5 +++
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 49 +++++++++++++++++++++++-
arch/x86/realmode/rm/trampoline_common.S | 5 ++-
5 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..5066c8b35e7c 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
u32 sev_es_trampoline_start;
#endif
#ifdef CONFIG_X86_64
+ u32 trampoline_start64;
u32 trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..27d8491d753a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1036,6 +1036,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
unsigned long boot_error = 0;
unsigned long timeout;

+#ifdef CONFIG_X86_64
+ if (is_tdx_guest())
+ start_ip = real_mode_header->trampoline_start64;
+#endif
+
idle->thread.sp = (unsigned long)task_pt_regs(idle);
early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
initial_code = (unsigned long)start_secondary;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
.long pa_sev_es_trampoline_start
#endif
#ifdef CONFIG_X86_64
+ .long pa_trampoline_start64
.long pa_trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3

# Set up EFER
+ movl $MSR_EFER, %ecx
+ rdmsr
+ cmp pa_tr_efer, %eax
+ jne .Lwrite_efer
+ cmp pa_tr_efer + 4, %edx
+ je .Ldone_efer
+.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
- movl $MSR_EFER, %ecx
wrmsr

+.Ldone_efer:
# Enable paging and in turn activate Long Mode
- movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+ movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

/*
@@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
ljmpl $__KERNEL_CS, $pa_startup_64
SYM_CODE_END(startup_32)

+SYM_CODE_START(pa_trampoline_compat)
+ /*
+ * In compatibility mode. Prep ESP and DX for startup_32, then disable
+ * paging and complete the switch to legacy 32-bit mode.
+ */
+ movl $rm_stack_end, %esp
+ movw $__KERNEL_DS, %dx
+
+ movl $(X86_CR0_NE | X86_CR0_PE), %eax
+ movl %eax, %cr0
+ ljmpl $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
.section ".text64","ax"
.code64
.balign 4
@@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
jmpq *tr_start(%rip)
SYM_CODE_END(startup_64)

+SYM_CODE_START(trampoline_start64)
+ /*
+ * APs start here on a direct transfer from 64-bit BIOS with identity
+ * mapped page tables. Load the kernel's GDT in order to gear down to
+ * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+ * segment registers. Load the zero IDT so any fault triggers a
+ * shutdown instead of jumping back into BIOS.
+ */
+ lidt tr_idt(%rip)
+ lgdt tr_gdt64(%rip)
+
+ ljmpl *tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
.section ".rodata","a"
# Duplicate the global descriptor table
# so the kernel can live anywhere
@@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
.quad 0x00cf93000000ffff # __KERNEL_DS
SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)

+SYM_DATA_START(tr_gdt64)
+ .short tr_gdt_end - tr_gdt - 1 # gdt limit
+ .long pa_tr_gdt
+ .long 0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+ .long pa_trampoline_compat
+ .short __KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
.bss
.balign PAGE_SIZE
SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..506d5897112a 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
.section ".rodata","a"
.balign 16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+SYM_DATA_START_LOCAL(tr_idt)
+ .short 0
+ .quad 0
+SYM_DATA_END(tr_idt)
--
2.25.1

Subject: [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms

From: Sean Christopherson <[email protected]>

Avoid operations which will inject #VE during compressed
boot, which is obviously fatal for TDX platforms.

Details are,

1. TDX module injects #VE if a TDX guest attempts to write
EFER. So skip the WRMSR to set EFER.LME=1 if it's already
set. TDX also forces EFER.LME=1, i.e. the branch will always
be taken and thus the #VE avoided.

2. TDX module also injects a #VE if the guest attempts to clear
CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
boot. The Setting CR0.NE should be a nop on all CPUs that
support 64-bit mode.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 5 +++--
arch/x86/boot/compressed/pgtable.h | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..37c2f37d4a0d 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
+ jc 1f
wrmsr
- popl %edx
+1: popl %edx
popl %ecx

/* Enable PAE and LA57 (if required) paging modes */
@@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
pushl %eax

/* Enable paging again */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80

#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE

--
2.25.1

Subject: [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process

From: Sean Christopherson <[email protected]>

Skip writing EFER during secondary_startup_64() if the current value is
also the desired value. This avoids a #VE when running as a TDX guest,
as the TDX-Module does not allow writes to EFER (even when writing the
current, fixed value).

Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
when running as a TDX guest. The TDX-Module (effectively part of the
hypervisor) requires CR4.MCE to be set at all times and injects a #VE
if the guest attempts to clear CR4.MCE.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 5 ++++-
arch/x86/kernel/head_64.S | 13 +++++++++++--
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 37c2f37d4a0d..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
popl %ecx

/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ movl %cr4, %eax
+ /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
+ andl $X86_CR4_MCE, %eax
+ orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
1:

/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ movq %cr4, %rcx
+ /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
+ andl $X86_CR4_MCE, %ecx
+ orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
+ movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr /* Make changes effective */

+ /* Skip the WRMSR if the current value matches the desired value. */
+1: cmpl %edx, %eax
+ je 1f
+ xor %edx, %edx
+ wrmsr /* Make changes effective */
+1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
--
2.25.1

Subject: [RFC v2 30/32] x86/tdx: Make DMA pages shared

From: "Kirill A. Shutemov" <[email protected]>

Make force_dma_unencrypted() return true for TDX to get DMA pages mapped
as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private. If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted.

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory.

Tested-by: Kai Huang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/tdx.h | 3 ++
arch/x86/kernel/tdx.c | 26 ++++++++++++++++-
arch/x86/mm/mem_encrypt_common.c | 4 +--
arch/x86/mm/pat/set_memory.c | 48 ++++++++++++++++++++++++++------
4 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4789798d7737..2794bf71e45c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -19,6 +19,9 @@ enum tdx_map_type {

#define TDINFO 1
#define TDGETVEINFO 3
+#define TDACCEPTPAGE 6
+
+#define TDX_PAGE_ALREADY_ACCEPTED 0x8000000000000001

struct tdcall_output {
u64 rcx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 074136473011..44dd12c693d0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,7 +100,8 @@ static void tdg_get_info(void)
physical_mask &= ~tdg_shared_mask();
}

-int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
{
u64 ret;

@@ -111,6 +112,29 @@ int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
return ret ? -EIO : 0;
}

+static void tdg_accept_page(phys_addr_t gpa)
+{
+ u64 ret;
+
+ ret = __tdcall(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+ BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+ int ret, i;
+
+ ret = __tdg_map_gpa(gpa, numpages, map_type);
+ if (ret || map_type == TDX_MAP_SHARED)
+ return ret;
+
+ for (i = 0; i < numpages; i++)
+ tdg_accept_page(gpa + i*PAGE_SIZE);
+
+ return 0;
+}
+
static __cpuidle void tdg_halt(void)
{
u64 ret;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 964e04152417..b6d93b0c5dcf 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -15,9 +15,9 @@
bool force_dma_unencrypted(struct device *dev)
{
/*
- * For SEV, all DMA must be to unencrypted/shared addresses.
+ * For SEV and TDX, all DMA must be to unencrypted/shared addresses.
*/
- if (sev_active())
+ if (sev_active() || is_tdx_guest())
return true;

/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..ea78c7907847 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
#include <asm/proto.h>
#include <asm/memtype.h>
#include <asm/set_memory.h>
+#include <asm/tdx.h>

#include "../mm_internal.h"

@@ -1972,13 +1973,15 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}

-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
{
+ pgprot_t mem_protected_bits, mem_plain_bits;
struct cpa_data cpa;
+ enum tdx_map_type map_type;
int ret;

- /* Nothing to do if memory encryption is not active */
- if (!mem_encrypt_active())
+ /* Nothing to do if memory encryption and TDX are not active */
+ if (!mem_encrypt_active() && !is_tdx_guest())
return 0;

/* Should not be working on unaligned addresses */
@@ -1988,8 +1991,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
- cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
- cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+ if (is_tdx_guest()) {
+ mem_protected_bits = __pgprot(0);
+ mem_plain_bits = __pgprot(tdg_shared_mask());
+ } else {
+ mem_protected_bits = __pgprot(_PAGE_ENC);
+ mem_plain_bits = __pgprot(0);
+ }
+
+ if (protect) {
+ cpa.mask_set = mem_protected_bits;
+ cpa.mask_clr = mem_plain_bits;
+ map_type = TDX_MAP_PRIVATE;
+ } else {
+ cpa.mask_set = mem_plain_bits;
+ cpa.mask_clr = mem_protected_bits;
+ map_type = TDX_MAP_SHARED;
+ }
+
cpa.pgd = init_mm.pgd;

/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2018,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

/*
* Before changing the encryption attribute, we need to flush caches.
+ *
+ * For TDX we need to flush caches on private->shared. VMM is
+ * responsible for flushing on shared->private.
*/
- cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ if (is_tdx_guest()) {
+ if (map_type == TDX_MAP_SHARED)
+ cpa_flush(&cpa, 1);
+ } else {
+ cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ }

ret = __change_page_attr_set_clr(&cpa, 1);

@@ -2012,18 +2040,22 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
*/
cpa_flush(&cpa, 0);

+ if (!ret && is_tdx_guest()) {
+ ret = tdg_map_gpa(__pa(addr), numpages, map_type);
+ }
+
return ret;
}

int set_memory_encrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, true);
+ return __set_memory_protect(addr, numpages, true);
}
EXPORT_SYMBOL_GPL(set_memory_encrypted);

int set_memory_decrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, false);
+ return __set_memory_protect(addr, numpages, false);
}
EXPORT_SYMBOL_GPL(set_memory_decrypted);

--
2.25.1

Subject: [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest

From: "Kirill A. Shutemov" <[email protected]>

TDX doesn't allow to perform DMA access to guest private memory.
In order for DMA to work properly in TD guest, user SWIOTLB bounce
buffers.

Move AMD SEV initialization into common code and adopt for TDX.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/io.h | 3 +-
arch/x86/kernel/pci-swiotlb.c | 2 +-
arch/x86/kernel/tdx.c | 3 ++
arch/x86/mm/mem_encrypt.c | 45 ------------------------------
arch/x86/mm/mem_encrypt_common.c | 47 ++++++++++++++++++++++++++++++++
5 files changed, 53 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 30a3b30395ad..658d9c2c2a9a 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -257,10 +257,11 @@ static inline void slow_down_io(void)

#endif

+extern struct static_key_false sev_enable_key;
+
#ifdef CONFIG_AMD_MEM_ENCRYPT
#include <linux/jump_label.h>

-extern struct static_key_false sev_enable_key;
static inline bool sev_key_active(void)
{
return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..020e13749758 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
* buffers are allocated and used for devices that do not support
* the addressing range required for the encryption mask.
*/
- if (sme_active())
+ if (sme_active() || is_tdx_guest())
swiotlb = 1;

return swiotlb;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 44dd12c693d0..6b07e7b4a69c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,7 @@
#include <asm/vmx.h>
#include <asm/insn.h>
#include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>

#include <linux/cpu.h>

@@ -470,6 +471,8 @@ void __init tdx_early_init(void)

legacy_pic = &null_legacy_pic;

+ swiotlb_force = SWIOTLB_FORCE;
+
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
NULL, tdg_cpu_offline_prepare);

diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 6f713c6a32b2..761a98904aa2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -409,48 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)

free_init_pages("unused decrypted", vaddr, vaddr_end);
}
-
-static void print_mem_encrypt_feature_info(void)
-{
- pr_info("AMD Memory Encryption Features active:");
-
- /* Secure Memory Encryption */
- if (sme_active()) {
- /*
- * SME is mutually exclusive with any of the SEV
- * features below.
- */
- pr_cont(" SME\n");
- return;
- }
-
- /* Secure Encrypted Virtualization */
- if (sev_active())
- pr_cont(" SEV");
-
- /* Encrypted Register State */
- if (sev_es_active())
- pr_cont(" SEV-ES");
-
- pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
- if (!sme_me_mask)
- return;
-
- /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
- swiotlb_update_mem_attributes();
-
- /*
- * With SEV, we need to unroll the rep string I/O instructions,
- * but SEV-ES supports them through the #VC handler.
- */
- if (sev_active() && !sev_es_active())
- static_branch_enable(&sev_enable_key);
-
- print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index b6d93b0c5dcf..625c15fa92f9 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
#include <linux/mm.h>
#include <linux/mem_encrypt.h>
#include <linux/dma-mapping.h>
+#include <linux/swiotlb.h>

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
bool force_dma_unencrypted(struct device *dev)
@@ -36,3 +37,49 @@ bool force_dma_unencrypted(struct device *dev)

return false;
}
+
+static void print_amd_mem_encrypt_feature_info(void)
+{
+ pr_info("AMD Memory Encryption Features active:");
+
+ /* Secure Memory Encryption */
+ if (sme_active()) {
+ /*
+ * SME is mutually exclusive with any of the SEV
+ * features below.
+ */
+ pr_cont(" SME\n");
+ return;
+ }
+
+ /* Secure Encrypted Virtualization */
+ if (sev_active())
+ pr_cont(" SEV");
+
+ /* Encrypted Register State */
+ if (sev_es_active())
+ pr_cont(" SEV-ES");
+
+ pr_cont("\n");
+}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+ if (!sme_me_mask && !is_tdx_guest())
+ return;
+
+ /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+ swiotlb_update_mem_attributes();
+
+ /*
+ * With SEV, we need to unroll the rep string I/O instructions,
+ * but SEV-ES supports them through the #VC handler.
+ */
+ if (sev_active() && !sev_es_active())
+ static_branch_enable(&sev_enable_key);
+
+ /* sme_me_mask !=0 means SME or SEV */
+ if (sme_me_mask)
+ print_amd_mem_encrypt_feature_info();
+}
--
2.25.1

Subject: [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation

From: "Kirill A. Shutemov" <[email protected]>

The patch allows to override the implementation of the port IO
helpers. TDX code will provide an implementation that redirect the
helpers to paravirt calls.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/io.h | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..ef7a686a55a9 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -271,18 +271,26 @@ static inline bool sev_key_active(void) { return false; }

#endif /* CONFIG_AMD_MEM_ENCRYPT */

+#ifndef __out
+#define __out(bwl, bw) \
+ asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
+#endif
+
+#ifndef __in
+#define __in(bwl, bw) \
+ asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
+#endif
+
#define BUILDIO(bwl, bw, type) \
static inline void out##bwl(unsigned type value, int port) \
{ \
- asm volatile("out" #bwl " %" #bw "0, %w1" \
- : : "a"(value), "Nd"(port)); \
+ __out(bwl, bw); \
} \
\
static inline unsigned type in##bwl(int port) \
{ \
unsigned type value; \
- asm volatile("in" #bwl " %w1, %" #bw "0" \
- : "=a"(value) : "Nd"(port)); \
+ __in(bwl, bw); \
return value; \
} \
\
--
2.25.1

Subject: [RFC v2 12/32] x86/tdx: Handle CPUID via #VE

From: "Kirill A. Shutemov" <[email protected]>

TDX has three classes of CPUID leaves: some CPUID leaves
are always handled by the CPU, others are handled by the TDX module,
and some others are handled by the VMM. Since the VMM cannot directly
intercept the instruction these are reflected with a #VE exception
to the guest, which then converts it into a TDCALL to the VMM,
or handled directly.

The TDX module EAS has a full list of CPUID leaves which are handled
natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
the #VE method. In practice this typically only applies to the
hypervisor specific CPUIDs unknown to the native CPU.

Therefore there is no risk of causing this in early CPUID code which
runs before the #VE handler is set up because it will never access
those exotic CPUID leaves.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/tdx.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5b16707b3577..e42e260df245 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -174,6 +174,21 @@ static int tdg_write_msr_safe(unsigned int msr, unsigned int low,
return ret ? -EIO : 0;
}

+static void tdg_handle_cpuid(struct pt_regs *regs)
+{
+ u64 ret;
+ struct tdvmcall_output out = {0};
+
+ ret = __tdvmcall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
+
+ WARN_ON(ret);
+
+ regs->ax = out.r12;
+ regs->bx = out.r13;
+ regs->cx = out.r14;
+ regs->dx = out.r15;
+}
+
unsigned long tdg_get_ve_info(struct ve_info *ve)
{
u64 ret;
@@ -220,6 +235,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_MSR_WRITE:
ret = tdg_write_msr_safe(regs->cx, regs->ax, regs->dx);
break;
+ case EXIT_REASON_CPUID:
+ tdg_handle_cpuid(regs);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
--
2.25.1

Subject: [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest

As per Intel TDX Virtual Firmware Design Guide, sec 4.3.5 and
sec 9.4, all unused CPUs are put in spinning state by
TDVF until OS requests for CPU bring-up via mailbox address passed
by ACPI MADT table. Since by default all unused CPUs are always in
spinning state, there is no point in supporting dynamic CPU
online/offline feature. So current generation of TDVF does not
support CPU hotplug feature. It may be supported in next generation.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
---
arch/x86/kernel/tdx.c | 14 ++++++++++++++
arch/x86/kernel/topology.c | 3 ++-
2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 294dda5bf3f6..ab1efa4d10e9 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -316,6 +316,17 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return insn.length;
}

+static int tdg_cpu_offline_prepare(unsigned int cpu)
+{
+ /*
+ * Per Intel TDX Virtual Firmware Design Guide,
+ * sec 4.3.5 and sec 9.4, Hotplug is not supported
+ * in TDX platforms. So don't support CPU
+ * offline feature once its turned on.
+ */
+ return -EOPNOTSUPP;
+}
+
unsigned long tdg_get_ve_info(struct ve_info *ve)
{
u64 ret;
@@ -410,5 +421,8 @@ void __init tdx_early_init(void)
pv_ops.irq.safe_halt = tdg_safe_halt;
pv_ops.irq.halt = tdg_halt;

+ cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
+ NULL, tdg_cpu_offline_prepare);
+
pr_info("TDX guest is initialized\n");
}
diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c
index f5477eab5692..d879ea96d79c 100644
--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -34,6 +34,7 @@
#include <linux/irq.h>
#include <asm/io_apic.h>
#include <asm/cpu.h>
+#include <asm/tdx.h>

static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);

@@ -130,7 +131,7 @@ int arch_register_cpu(int num)
}
}
}
- if (num || cpu0_hotpluggable)
+ if ((num || cpu0_hotpluggable) && !is_tdx_guest())
per_cpu(cpu_devices, num).cpu.hotpluggable = 1;

return register_cpu(&per_cpu(cpu_devices, num).cpu, num);
--
2.25.1

Subject: [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure

From: Erik Kaneda <[email protected]>

ACPICA commit b9eb6f3a19b816824d6f47a6bc86fd8ce690e04b

Link: https://github.com/acpica/acpica/commit/b9eb6f3a
Signed-off-by: Erik Kaneda <[email protected]>
Signed-off-by: Bob Moore <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
include/acpi/actbl2.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index d6478c430c99..b2362600b9ff 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -516,7 +516,8 @@ enum acpi_madt_type {
ACPI_MADT_TYPE_GENERIC_MSI_FRAME = 13,
ACPI_MADT_TYPE_GENERIC_REDISTRIBUTOR = 14,
ACPI_MADT_TYPE_GENERIC_TRANSLATOR = 15,
- ACPI_MADT_TYPE_RESERVED = 16 /* 16 and greater are reserved */
+ ACPI_MADT_TYPE_MULTIPROC_WAKEUP = 16,
+ ACPI_MADT_TYPE_RESERVED = 17 /* 17 and greater are reserved */
};

/*
@@ -723,6 +724,15 @@ struct acpi_madt_generic_translator {
u32 reserved2;
};

+/* 16: Multiprocessor wakeup (ACPI 6.4) */
+
+struct acpi_madt_multiproc_wakeup {
+ struct acpi_subtable_header header;
+ u16 mailbox_version;
+ u32 reserved; /* reserved - must be zero */
+ u64 base_address;
+};
+
/*
* Common flags fields for MADT subtables
*/
--
2.25.1

Subject: [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL

From: "Kirill A. Shutemov" <[email protected]>

MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
shared memory mappings. Shared GPA mappings can be used for
communication beteen TD guest and host VMM, for example for
paravirtualized IO.

The new helper tdx_map_gpa() provides access to the operation.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/tdx.h | 13 +++++++++++++
arch/x86/kernel/tdx.c | 13 +++++++++++++
2 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index dc80cf7f7d08..4789798d7737 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,6 +7,11 @@

#ifndef __ASSEMBLY__

+enum tdx_map_type {
+ TDX_MAP_PRIVATE,
+ TDX_MAP_SHARED,
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
@@ -112,6 +117,8 @@ unsigned short tdg_inw(unsigned short port);
unsigned int tdg_inl(unsigned short port);

extern phys_addr_t tdg_shared_mask(void);
+extern int tdg_map_gpa(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type);

#else // !CONFIG_INTEL_TDX_GUEST

@@ -155,6 +162,12 @@ static inline phys_addr_t tdg_shared_mask(void)
{
return 0;
}
+
+static inline int tdg_map_gpa(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
+{
+ return -ENODEV;
+}
#endif /* CONFIG_INTEL_TDX_GUEST */
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 7e391cd7aa2b..074136473011 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -15,6 +15,8 @@
#include "tdx-kvm.c"
#endif

+#define TDVMCALL_MAP_GPA 0x10001
+
static struct {
unsigned int gpa_width;
unsigned long attributes;
@@ -98,6 +100,17 @@ static void tdg_get_info(void)
physical_mask &= ~tdg_shared_mask();
}

+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+ u64 ret;
+
+ if (map_type == TDX_MAP_SHARED)
+ gpa |= tdg_shared_mask();
+
+ ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
+ return ret ? -EIO : 0;
+}
+
static __cpuidle void tdg_halt(void)
{
u64 ret;
--
2.25.1

Subject: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to access guest memory. Any memory
that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. And, after
setting the shared bit, the conversion must be completed with
MapGPA TDVMALL. The call informs VMM about the conversion and
makes it remove the GPA from the S-EPT mapping. The shared
memory is similar to unencrypted memory in AMD SME/SEV terminology
but the underlying process of sharing/un-sharing the memory is
different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar.  TDX architecturally prevents access to private
guest memory by anything other than the guest itself. This means that
any DMA buffers must be shared.

So move force_dma_unencrypted() out of AMD specific code.
   
It will be modified to return true for Intel TDX guest platform,
similar to AMD SEV feature.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has not
functional change.    

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/Kconfig | 8 +++++--
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mem_encrypt.c | 30 -------------------------
arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
4 files changed, 46 insertions(+), 32 deletions(-)
create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 932e6d759ba7..67f99bf27729 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1529,14 +1529,18 @@ config X86_CPA_STATISTICS
helps to determine the effectiveness of preserving large and huge
page mappings when mapping protections are changed.

+config X86_MEM_ENCRYPT_COMMON
+ select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+ select DYNAMIC_PHYSICAL_MASK
+ def_bool n
+
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
- select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
- select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
+ select X86_MEM_ENCRYPT_COMMON
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o

+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o
+
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..6f713c6a32b2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
#include <linux/dma-direct.h>
#include <linux/swiotlb.h>
#include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>

#include <asm/tlbflush.h>
#include <asm/fixmap.h>
@@ -390,32 +386,6 @@ bool noinstr sev_es_active(void)
return sev_status & MSR_AMD64_SEV_ES_ENABLED;
}

-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
- /*
- * For SEV, all DMA must be to unencrypted addresses.
- */
- if (sev_active())
- return true;
-
- /*
- * For SME, all DMA must be to unencrypted addresses if the
- * device does not support DMA to addresses that include the
- * encryption mask.
- */
- if (sme_active()) {
- u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
- u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
- dev->bus_dma_limit);
-
- if (dma_dev_mask <= dma_enc_mask)
- return true;
- }
-
- return false;
-}
-
void __init mem_encrypt_free_decrypted_mem(void)
{
unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..964e04152417
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+ /*
+ * For SEV, all DMA must be to unencrypted/shared addresses.
+ */
+ if (sev_active())
+ return true;
+
+ /*
+ * For SME, all DMA must be to unencrypted addresses if the
+ * device does not support DMA to addresses that include the
+ * encryption mask.
+ */
+ if (sme_active()) {
+ u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+ u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+ dev->bus_dma_limit);
+
+ if (dma_dev_mask <= dma_enc_mask)
+ return true;
+ }
+
+ return false;
+}
--
2.25.1

Subject: [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests

From: Sean Christopherson <[email protected]>

Disable the legacy PIC (8259) for TDX guests as the PIC cannot be
supported by the VMM. TDX Module does not allow direct IRQ injection,
and using posted interrupt style delivery requires the guest to EOI
the IRQ, which diverges from the legacy PIC behavior.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/tdx.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ab1efa4d10e9..1f1bb98e1d38 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,7 @@
#define pr_fmt(fmt) "TDX: " fmt

#include <asm/tdx.h>
+#include <asm/i8259.h>
#include <asm/vmx.h>
#include <asm/insn.h>
#include <linux/sched/signal.h> /* force_sig_fault() */
@@ -421,6 +422,8 @@ void __init tdx_early_init(void)
pv_ops.irq.safe_halt = tdg_safe_halt;
pv_ops.irq.halt = tdg_halt;

+ legacy_pic = &null_legacy_pic;
+
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
NULL, tdg_cpu_offline_prepare);

--
2.25.1

Subject: [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address

From: Isaku Yamahata <[email protected]>

IOAPIC is emulated by KVM which means its MMIO address is shared
by host. Add shared bit for base address of IOAPIC.
Most MMIO region is handled by ioremap which is already marked
as shared for TDX guest platform, but IOAPIC is an exception which
uses fixed map.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 73ff4dd426a8..2a01d4a82be7 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2675,6 +2675,14 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}

+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx, phys_addr_t phys)
+{
+ pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+ if (is_tdx_guest())
+ flags = pgprot_tdg_shared(flags);
+ __set_fixmap(idx, phys, flags);
+}
+
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2715,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
- set_fixmap_nocache(idx, ioapic_phys);
+ io_apic_set_fixmap_nocache(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
@@ -2836,7 +2844,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;

- set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+ io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
--
2.25.1

2021-04-26 20:34:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

> +/*
> + * Expose registers R10-R15 to VMM (for bitfield info
> + * refer to TDX GHCI specification).
> + */
> +#define TDVMCALL_EXPOSE_REGS_MASK 0xfc00

Why can't we do:

#define TDC_R10 BIT(18)
#define TDC_R11 BIT(19)

and:

#define TDVMCALL_EXPOSE_REGS_MASK (TDX_R10 | TDX_R11 | TDX_R12 ...

or at least:

#define TDVMCALL_EXPOSE_REGS_MASK BIT(18) | BIT(19) ...

?

> +/*
> + * TDX guests use the TDCALL instruction to make
> + * hypercalls to the VMM. It is supported in
> + * Binutils >= 2.36.
> + */
> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
> +
> +/*
> + * __tdcall() - Used to communicate with the TDX module

Why is this function here? What does it do? Why do we need it?

I'd like this to actually talk about doing impedance matching between
the function call and TDCALL ABIs.

> + * @arg1 (RDI) - TDCALL Leaf ID
> + * @arg2 (RSI) - Input parameter 1 passed to TDX module
> + * via register RCX
> + * @arg2 (RDX) - Input parameter 2 passed to TDX module
> + * via register RDX
> + * @arg3 (RCX) - Input parameter 3 passed to TDX module
> + * via register R8
> + * @arg4 (R8) - Input parameter 4 passed to TDX module
> + * via register R9

The unnecessary repitition and verbosity actually make this harder to
read. This looks like it was easy to write, but not much effort is
being made to make it easy to consume. Could you please apply some
consideration to making it more readable?


> + * @arg5 (R9) - struct tdcall_output pointer
> + *
> + * @out - Return status of tdcall via RAX.

Don't comments usually just say "returns ... foo"? Also, the @params
usually refer to *REAL* variable names. Where the heck does "out" come
from? Why are you even putting argX? Shouldn't these be @'s be their
literal function argument names?

@rdi - Input parameter, moved to RCX

> + * NOTE: This function should only used for non TDVMCALL
> + * use cases
> + */
> +SYM_FUNC_START(__tdcall)
> + FRAME_BEGIN
> +
> + /* Save non-volatile GPRs that are exposed to the VMM. */
> + push %r15
> + push %r14
> + push %r13
> + push %r12

Why do we have to save these? Because they might be clobbered? If so,
let's say *THAT* instead of just "exposed". "Exposed" could mean "VMM
can read".

Also, this just told me that this function can't be used to talk to the
VMM. Why is this talking about exposure to the VMM?

> + /* Move TDCALL Leaf ID to RAX */
> + mov %rdi, %rax
> + /* Move output pointer to R12 */
> + mov %r9, %r12

I thought 'struct tdcall_output' was a purely software construct. Why
are we passing a pointer to it into TDCALL?

> + /* Move input param 4 to R9 */
> + mov %r8, %r9
> + /* Move input param 3 to R8 */
> + mov %rcx, %r8
> + /* Leave input param 2 in RDX */
> + /* Move input param 1 to RCX */
> + mov %rsi, %rcx

With a little work, this can be made a *LOT* more readable:

/* Mangle function call ABI into TDCALL ABI: */
mov %rdi, %rax /* Move TDCALL Leaf ID to RAX */
mov %r9, %r12 /* Move output pointer to R12 */
mov %r8, %r9 /* Move input 4 to R9 */
mov %rcx, %r8 /* Move input 3 to R8 */
mov %rsi, %rcx /* Move input 1 to RCX */
/* Leave input param 2 in RDX */


> + tdcall
> +
> + /* Check for TDCALL success: 0 - Successful, otherwise failed */
> + test %rax, %rax
> + jnz 1f
> +
> + /* Check for a TDCALL output struct */
> + test %r12, %r12
> + jz 1f

Does some universal status come back in r12? Aren't we dealing with a
VMM/SEAM-controlled register here? Isn't this dangerous?

> + /* Copy TDCALL result registers to output struct: */
> + movq %rcx, TDCALL_rcx(%r12)
> + movq %rdx, TDCALL_rdx(%r12)
> + movq %r8, TDCALL_r8(%r12)
> + movq %r9, TDCALL_r9(%r12)
> + movq %r10, TDCALL_r10(%r12)
> + movq %r11, TDCALL_r11(%r12)
> +1:
> + /* Zero out registers exposed to the TDX Module. */
> + xor %rcx, %rcx
> + xor %rdx, %rdx
> + xor %r8d, %r8d
> + xor %r9d, %r9d
> + xor %r10d, %r10d
> + xor %r11d, %r11d

... why?

> + /* Restore non-volatile GPRs that are exposed to the VMM. */
> + pop %r12
> + pop %r13
> + pop %r14
> + pop %r15
> +
> + FRAME_END
> + ret
> +SYM_FUNC_END(__tdcall)
> +
> +/*
> + * do_tdvmcall() - Used to communicate with the VMM.
> + *
> + * @arg1 (RDI) - TDVMCALL function, e.g. exit reason
> + * @arg2 (RSI) - Input parameter 1 passed to VMM
> + * via register R12
> + * @arg3 (RDX) - Input parameter 2 passed to VMM
> + * via register R13
> + * @arg4 (RCX) - Input parameter 3 passed to VMM
> + * via register R14
> + * @arg5 (R8) - Input parameter 4 passed to VMM
> + * via register R15
> + * @arg6 (R9) - struct tdvmcall_output pointer
> + *
> + * @out - Return status of tdvmcall(R10) via RAX.
> + *
> + */

Same comments on the sparse comment style.

> +SYM_CODE_START_LOCAL(do_tdvmcall)
> + FRAME_BEGIN
> +
> + /* Save non-volatile GPRs that are exposed to the VMM. */
> + push %r15
> + push %r14
> + push %r13
> + push %r12
> +
> + /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */

I think there needs to be some discussion of what TDCALL and TDVMCALL
are. They are named too similarly not to do so.

> + xor %eax, %eax
> + /* Move TDVMCALL function id (1st argument) to R11 */
> + mov %rdi, %r11> + /* Move Input parameter 1-4 to R12-R15 */
> + mov %rsi, %r12
> + mov %rdx, %r13
> + mov %rcx, %r14
> + mov %r8, %r15
> + /* Leave tdvmcall output pointer in R9 */
> +
> + /*
> + * Value of RCX is used by the TDX Module to determine which
> + * registers are exposed to VMM. Each bit in RCX represents a
> + * register id. You can find the bitmap details from TDX GHCI
> + * spec.
> + */

This doesn't belong here. Put it along with the
TDVMCALL_EXPOSE_REGS_MASK, please.

> + movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> + tdcall
> +
> + /*
> + * Check for TDCALL success: 0 - Successful, otherwise failed.
> + * If failed, there is an issue with TDX Module which is fatal
> + * for the guest. So panic.
> + */
> + test %rax, %rax
> + jnz 2f

So, just to be clear: %RAX is under the control of the SEAM module. The
VMM has no control over it. Right?

Shouldn't we say that explicitly?

> + /* Move TDVMCALL success/failure to RAX to return to user */
> + mov %r10, %rax
> +
> + /* Check for TDVMCALL success: 0 - Successful, otherwise failed */
> + test %rax, %rax
> + jnz 1f
> +
> + /* Check for a TDVMCALL output struct */
> + test %r9, %r9
> + jz 1f

I'd also include a note that %r9 was neither writable nor its value
exposed to the VMM.

> + /* Copy TDVMCALL result registers to output struct: */
> + movq %r11, TDVMCALL_r11(%r9)
> + movq %r12, TDVMCALL_r12(%r9)
> + movq %r13, TDVMCALL_r13(%r9)
> + movq %r14, TDVMCALL_r14(%r9)
> + movq %r15, TDVMCALL_r15(%r9)
> +1:
> + /*
> + * Zero out registers exposed to the VMM to avoid
> + * speculative execution with VMM-controlled values.
> + */
> + xor %r10d, %r10d
> + xor %r11d, %r11d
> + xor %r12d, %r12d
> + xor %r13d, %r13d
> + xor %r14d, %r14d
> + xor %r15d, %r15d
> +
> + /* Restore non-volatile GPRs that are exposed to the VMM. */
> + pop %r12
> + pop %r13
> + pop %r14
> + pop %r15
> +
> + FRAME_END
> + ret
> +2:
> + ud2
> +SYM_CODE_END(do_tdvmcall)
> +
> +/* Helper function for standard type of TDVMCALL */
> +SYM_FUNC_START(__tdvmcall)
> + /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> + xor %r10, %r10
> + call do_tdvmcall
> + retq
> +SYM_FUNC_END(__tdvmcall)

Why do we need this helper? Why does it need to be in assembly?

> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 6a7193fead08..29c52128b9c0 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -1,8 +1,44 @@
> // SPDX-License-Identifier: GPL-2.0
> /* Copyright (C) 2020 Intel Corporation */
>
> +#define pr_fmt(fmt) "TDX: " fmt
> +
> #include <asm/tdx.h>
>
> +/*
> + * Wrapper for use case that checks for error code and print warning message.
> + */

This comment isn't very useful. I can see the error check and warning
by reading the code.

> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> + u64 err;
> +
> + err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
> +
> + if (err)
> + pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
> + fn, err);
> +
> + return err;
> +}
> +
> +/*
> + * Wrapper for the semi-common case where we need single output value (R11).
> + */
> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +
> + struct tdvmcall_output out = {0};
> + u64 err;
> +
> + err = __tdvmcall(fn, r12, r13, r14, r15, &out);
> +
> + if (err)
> + pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
> + fn, err);
> +
> + return out.r11;
> +}

How do callers check for errors? Is the error value superfluously
returned in r11 and another output register?

2021-04-26 21:13:18

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Add INTEL_TDX_GUEST config option to selectively compile
> TDX guest support.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> ---
> arch/x86/Kconfig | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6b4b682af468..932e6d759ba7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -875,6 +875,21 @@ config ACRN_GUEST
> IOT with small footprint and real-time features. More details can be
> found in https://projectacrn.org/.
>
> +config INTEL_TDX_GUEST
> + bool "Intel Trusted Domain eXtensions Guest Support"
> + depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
> + depends on SECURITY
> + select PARAVIRT_XL
> + select X86_X2APIC
> + select SECURITY_LOCKDOWN_LSM
> + help
> + Provide support for running in a trusted domain on Intel processors
> + equipped with Trusted Domain eXtenstions. TDX is an new Intel

a new Intel

> + technology that extends VMX and Memory Encryption with a new kind of
> + virtual machine guest called Trust Domain (TD). A TD is designed to
> + run in a CPU mode that protects the confidentiality of TD memory
> + contents and the TD’s CPU state from other software, including VMM.
> +
> endif #HYPERVISOR_GUEST
>
> source "arch/x86/Kconfig.cpu"
>


--
~Randy

Subject: Re: [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option



On 4/26/21 2:09 PM, Randy Dunlap wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> Add INTEL_TDX_GUEST config option to selectively compile
>> TDX guest support.
>>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> Reviewed-by: Tony Luck <[email protected]>
>> ---
>> arch/x86/Kconfig | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 6b4b682af468..932e6d759ba7 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -875,6 +875,21 @@ config ACRN_GUEST
>> IOT with small footprint and real-time features. More details can be
>> found in https://projectacrn.org/.
>>
>> +config INTEL_TDX_GUEST
>> + bool "Intel Trusted Domain eXtensions Guest Support"
>> + depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
>> + depends on SECURITY
>> + select PARAVIRT_XL
>> + select X86_X2APIC
>> + select SECURITY_LOCKDOWN_LSM
>> + help
>> + Provide support for running in a trusted domain on Intel processors
>> + equipped with Trusted Domain eXtenstions. TDX is an new Intel
>
> a new Intel
>

Good catch. I will fix it in next version.

>> + technology that extends VMX and Memory Encryption with a new kind of
>> + virtual machine guest called Trust Domain (TD). A TD is designed to
>> + run in a CPU mode that protects the confidentiality of TD memory
>> + contents and the TD’s CPU state from other software, including VMM.
>> +
>> endif #HYPERVISOR_GUEST
>>
>> source "arch/x86/Kconfig.cpu"
>>
>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions



On 4/26/21 1:32 PM, Dave Hansen wrote:
>> +/*
>> + * Expose registers R10-R15 to VMM (for bitfield info
>> + * refer to TDX GHCI specification).
>> + */
>> +#define TDVMCALL_EXPOSE_REGS_MASK 0xfc00
>
> Why can't we do:
>
> #define TDC_R10 BIT(18)
> #define TDC_R11 BIT(19)
>
> and:
>
> #define TDVMCALL_EXPOSE_REGS_MASK (TDX_R10 | TDX_R11 | TDX_R12 ...
>
> or at least:
>
> #define TDVMCALL_EXPOSE_REGS_MASK BIT(18) | BIT(19) ...

If this is the preferred way, I will change it use macros (TDX_Rxx).

>
> ?
>
>> +/*
>> + * TDX guests use the TDCALL instruction to make
>> + * hypercalls to the VMM. It is supported in
>> + * Binutils >= 2.36.
>> + */
>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>> +
>> +/*
>> + * __tdcall() - Used to communicate with the TDX module
>
> Why is this function here? What does it do? Why do we need it?

__tdcall() function is used to request services from the TDX Module.
Example use cases are, TDREPORT, VEINFO, TDINFO, etc.

>
> I'd like this to actually talk about doing impedance matching between
> the function call and TDCALL ABIs.
>
>> + * @arg1 (RDI) - TDCALL Leaf ID
>> + * @arg2 (RSI) - Input parameter 1 passed to TDX module
>> + * via register RCX
>> + * @arg2 (RDX) - Input parameter 2 passed to TDX module
>> + * via register RDX
>> + * @arg3 (RCX) - Input parameter 3 passed to TDX module
>> + * via register R8
>> + * @arg4 (R8) - Input parameter 4 passed to TDX module
>> + * via register R9
>
> The unnecessary repitition and verbosity actually make this harder to
> read. This looks like it was easy to write, but not much effort is
> being made to make it easy to consume. Could you please apply some
> consideration to making it more readable?
>
>
>> + * @arg5 (R9) - struct tdcall_output pointer
>> + *
>> + * @out - Return status of tdcall via RAX.
>
> Don't comments usually just say "returns ... foo"? Also, the @params
> usually refer to *REAL* variable names. Where the heck does "out" come
> from? Why are you even putting argX? Shouldn't these be @'s be their
> literal function argument names?

I have added this comment block to make it easier for us to understand
the register mapping between function arguments and TDCALL ABI. But I got
your point. Usage of @arg1 or @out does not comply the function comment
standards. I will fix this in next version.

>
> @rdi - Input parameter, moved to RCX

I will use the above format to document function arguments.

>
>> + * NOTE: This function should only used for non TDVMCALL
>> + * use cases
>> + */
>> +SYM_FUNC_START(__tdcall)
>> + FRAME_BEGIN
>> +
>> + /* Save non-volatile GPRs that are exposed to the VMM. */
>> + push %r15
>> + push %r14
>> + push %r13
>> + push %r12
>
> Why do we have to save these? Because they might be clobbered? If so,
> let's say *THAT* instead of just "exposed". "Exposed" could mean "VMM
> can read".
>
> Also, this just told me that this function can't be used to talk to the
> VMM. Why is this talking about exposure to the VMM?

Although __tdcall() is only used to communicate with the TDX module and the
TDX module is not supposed to touch these registers, just to be on the safe
side, I have tried to save the context of registers R12-R15. Anyway cycles
used by instructions are less compared to tdcall.


>
>> + /* Move TDCALL Leaf ID to RAX */
>> + mov %rdi, %rax
>> + /* Move output pointer to R12 */
>> + mov %r9, %r12
>
> I thought 'struct tdcall_output' was a purely software construct. Why
> are we passing a pointer to it into TDCALL?

Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
function is concerned, its just a block of memory (accessed using
base address + TDCALL_r* offsets).

>
>> + /* Move input param 4 to R9 */
>> + mov %r8, %r9
>> + /* Move input param 3 to R8 */
>> + mov %rcx, %r8
>> + /* Leave input param 2 in RDX */
>> + /* Move input param 1 to RCX */
>> + mov %rsi, %rcx
>
> With a little work, this can be made a *LOT* more readable:
>
> /* Mangle function call ABI into TDCALL ABI: */
> mov %rdi, %rax /* Move TDCALL Leaf ID to RAX */
> mov %r9, %r12 /* Move output pointer to R12 */
> mov %r8, %r9 /* Move input 4 to R9 */
> mov %rcx, %r8 /* Move input 3 to R8 */
> mov %rsi, %rcx /* Move input 1 to RCX */
> /* Leave input param 2 in RDX */

Ok. I will use your version.

>
>
>> + tdcall
>> +
>> + /* Check for TDCALL success: 0 - Successful, otherwise failed */
>> + test %rax, %rax
>> + jnz 1f
>> +
>> + /* Check for a TDCALL output struct */
>> + test %r12, %r12
>> + jz 1f
>
> Does some universal status come back in r12? Aren't we dealing with a
> VMM/SEAM-controlled register here? Isn't this dangerous?

R12 is the temporary register we have used to store the address of user
passed output pointer. We just check for NULL condition here. R12 will
not be used by the TDX module.

If you prefer, we can just push the output pointer to stack and get it
after we make the tdcall.

>
>> + /* Copy TDCALL result registers to output struct: */
>> + movq %rcx, TDCALL_rcx(%r12)
>> + movq %rdx, TDCALL_rdx(%r12)
>> + movq %r8, TDCALL_r8(%r12)
>> + movq %r9, TDCALL_r9(%r12)
>> + movq %r10, TDCALL_r10(%r12)
>> + movq %r11, TDCALL_r11(%r12)
>> +1:
>> + /* Zero out registers exposed to the TDX Module. */
>> + xor %rcx, %rcx
>> + xor %rdx, %rdx
>> + xor %r8d, %r8d
>> + xor %r9d, %r9d
>> + xor %r10d, %r10d
>> + xor %r11d, %r11d
>
> ... why?

These registers are used by the TDX Module. Why pass the stale values
back to the user? So we clear them here.

>
>> + /* Restore non-volatile GPRs that are exposed to the VMM. */
>> + pop %r12
>> + pop %r13
>> + pop %r14
>> + pop %r15
>> +
>> + FRAME_END
>> + ret
>> +SYM_FUNC_END(__tdcall)
>> +
>> +/*
>> + * do_tdvmcall() - Used to communicate with the VMM.
>> + *
>> + * @arg1 (RDI) - TDVMCALL function, e.g. exit reason
>> + * @arg2 (RSI) - Input parameter 1 passed to VMM
>> + * via register R12
>> + * @arg3 (RDX) - Input parameter 2 passed to VMM
>> + * via register R13
>> + * @arg4 (RCX) - Input parameter 3 passed to VMM
>> + * via register R14
>> + * @arg5 (R8) - Input parameter 4 passed to VMM
>> + * via register R15
>> + * @arg6 (R9) - struct tdvmcall_output pointer
>> + *
>> + * @out - Return status of tdvmcall(R10) via RAX.
>> + *
>> + */
>
> Same comments on the sparse comment style.

will fix it similar to __tdcall().

>
>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>> + FRAME_BEGIN
>> +
>> + /* Save non-volatile GPRs that are exposed to the VMM. */
>> + push %r15
>> + push %r14
>> + push %r13
>> + push %r12
>> +
>> + /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
>
> I think there needs to be some discussion of what TDCALL and TDVMCALL
> are. They are named too similarly not to do so.

TDVMCALL is the sub function of TDCALL (selected by setting RAX register
to 0). TDVMCALL is used to request services from VMM.

>
>> + xor %eax, %eax
>> + /* Move TDVMCALL function id (1st argument) to R11 */
>> + mov %rdi, %r11> + /* Move Input parameter 1-4 to R12-R15 */
>> + mov %rsi, %r12
>> + mov %rdx, %r13
>> + mov %rcx, %r14
>> + mov %r8, %r15
>> + /* Leave tdvmcall output pointer in R9 */
>> +
>> + /*
>> + * Value of RCX is used by the TDX Module to determine which
>> + * registers are exposed to VMM. Each bit in RCX represents a
>> + * register id. You can find the bitmap details from TDX GHCI
>> + * spec.
>> + */
>
> This doesn't belong here. Put it along with the
> TDVMCALL_EXPOSE_REGS_MASK, please.

Ok. I will do it.

>
>> + movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>> +
>> + tdcall
>> +
>> + /*
>> + * Check for TDCALL success: 0 - Successful, otherwise failed.
>> + * If failed, there is an issue with TDX Module which is fatal
>> + * for the guest. So panic.
>> + */
>> + test %rax, %rax
>> + jnz 2f
>
> So, just to be clear: %RAX is under the control of the SEAM module. The
> VMM has no control over it. Right?

AFAIK, VMM will not touch it.

Sean, please confirm it.

>
> Shouldn't we say that explicitly?

I can add it to above comment.

>
>> + /* Move TDVMCALL success/failure to RAX to return to user */
>> + mov %r10, %rax
>> +
>> + /* Check for TDVMCALL success: 0 - Successful, otherwise failed */
>> + test %rax, %rax
>> + jnz 1f
>> +
>> + /* Check for a TDVMCALL output struct */
>> + test %r9, %r9
>> + jz 1f
>
> I'd also include a note that %r9 was neither writable nor its value
> exposed to the VMM.

will do it.

>
>> + /* Copy TDVMCALL result registers to output struct: */
>> + movq %r11, TDVMCALL_r11(%r9)
>> + movq %r12, TDVMCALL_r12(%r9)
>> + movq %r13, TDVMCALL_r13(%r9)
>> + movq %r14, TDVMCALL_r14(%r9)
>> + movq %r15, TDVMCALL_r15(%r9)
>> +1:
>> + /*
>> + * Zero out registers exposed to the VMM to avoid
>> + * speculative execution with VMM-controlled values.
>> + */
>> + xor %r10d, %r10d
>> + xor %r11d, %r11d
>> + xor %r12d, %r12d
>> + xor %r13d, %r13d
>> + xor %r14d, %r14d
>> + xor %r15d, %r15d
>> +
>> + /* Restore non-volatile GPRs that are exposed to the VMM. */
>> + pop %r12
>> + pop %r13
>> + pop %r14
>> + pop %r15
>> +
>> + FRAME_END
>> + ret
>> +2:
>> + ud2
>> +SYM_CODE_END(do_tdvmcall)
>> +
>> +/* Helper function for standard type of TDVMCALL */
>> +SYM_FUNC_START(__tdvmcall)
>> + /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>> + xor %r10, %r10
>> + call do_tdvmcall
>> + retq
>> +SYM_FUNC_END(__tdvmcall)
>
> Why do we need this helper? Why does it need to be in assembly?

Its simpler to do it in assembly. Also, grouping all register updates
in the same file will make it easier for us to read or debug issues. Another
reason is, we also call do_tdvmcall() from in/out instruction use case.

>
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 6a7193fead08..29c52128b9c0 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -1,8 +1,44 @@
>> // SPDX-License-Identifier: GPL-2.0
>> /* Copyright (C) 2020 Intel Corporation */
>>
>> +#define pr_fmt(fmt) "TDX: " fmt
>> +
>> #include <asm/tdx.h>
>>
>> +/*
>> + * Wrapper for use case that checks for error code and print warning message.
>> + */
>
> This comment isn't very useful. I can see the error check and warning
> by reading the code.

Its just a helper function that covers common case of checking for error
and print the warning message. If this comment is superfluous, I can remove
it.

>
>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> + u64 err;
>> +
>> + err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>> +
>> + if (err)
>> + pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>> + fn, err);
>> +
>> + return err;
>> +}
>> +
>> +/*
>> + * Wrapper for the semi-common case where we need single output value (R11).
>> + */
>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> +
>> + struct tdvmcall_output out = {0};
>> + u64 err;
>> +
>> + err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>> +
>> + if (err)
>> + pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>> + fn, err);
>> +
>> + return out.r11;
>> +}
>
> How do callers check for errors? Is the error value superfluously
> returned in r11 and another output register?

We already check for error in this helper function. User of this function
only cares about output value (R11). Mainly for in/out use case.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-04-26 23:18:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>> +
>>> +/*
>>> + * __tdcall()  - Used to communicate with the TDX module
>>
>> Why is this function here?  What does it do?  Why do we need it?
>
> __tdcall() function is used to request services from the TDX Module.
> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.

I think there might be some misinterpretation of my question. What you
are describing is what *TDCALL* does. Why do we need a wrapper
function? What purpose does this wrapper function serve? Why do we
need this wrapper function?

>>> + * NOTE: This function should only used for non TDVMCALL
>>> + *       use cases
>>> + */
>>> +SYM_FUNC_START(__tdcall)
>>> +    FRAME_BEGIN
>>> +
>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>> +    push %r15
>>> +    push %r14
>>> +    push %r13
>>> +    push %r12
>>
>> Why do we have to save these?  Because they might be clobbered?  If so,
>> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
>> can read".
>>
>> Also, this just told me that this function can't be used to talk to the
>> VMM.  Why is this talking about exposure to the VMM?
>
> Although __tdcall() is only used to communicate with the TDX module and the
> TDX module is not supposed to touch these registers, just to be on the safe
> side, I have tried to save the context of registers R12-R15. Anyway cycles
> used by instructions are less compared to tdcall.

Why are you talking about the VMM if this is a call to the SEAM module?

Let's say someone is reading the TDCALL architecture spec. It will say
something like, "blah blah, in this case TDCALL will not modify
%r12->%r15". Then someone goes and looks at this code that basically
says (or implies) "save these before the SEAM module modifies them".
What is a coder to do?

Please remove the ambiguity, either by removing this superfluous
(according to the spec) code, or documenting why it is not superfluous.

>>> +    /* Move TDCALL Leaf ID to RAX */
>>> +    mov %rdi, %rax
>>> +    /* Move output pointer to R12 */
>>> +    mov %r9, %r12
>>
>> I thought 'struct tdcall_output' was a purely software construct.  Why
>> are we passing a pointer to it into TDCALL?
>
> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
> function is concerned, its just a block of memory (accessed using
> base address + TDCALL_r* offsets).

Is 'struct tdcall_output' a hardware architectural structure or a
software structure?

If it's a software structure, then why are we passing a pointer to a
software structure into a hardware ABI?

If it's a hardware architecture structure, where is the documentation
for it?

>>> +    tdcall
>>> +
>>> +    /* Check for TDCALL success: 0 - Successful, otherwise failed */
>>> +    test %rax, %rax
>>> +    jnz 1f
>>> +
>>> +    /* Check for a TDCALL output struct */
>>> +    test %r12, %r12
>>> +    jz 1f
>>
>> Does some universal status come back in r12?  Aren't we dealing with a
>> VMM/SEAM-controlled register here?  Isn't this dangerous?
>
> R12 is the temporary register we have used to store the address of user
> passed output pointer. We just check for NULL condition here. R12 will
> not be used by the TDX module.

OK, so how do you know this? Could you share your logic, please?

> If you prefer, we can just push the output pointer to stack and get it
> after we make the tdcall.

I prefer that the code be understandable and be written for a clear
purpose. If you're using r12 for temporary storage, I expect to see at
least one reference *SOMEWHERE* to its use as temporary storage. Right
now.... nothing.

>>> +    /* Copy TDCALL result registers to output struct: */
>>> +    movq %rcx, TDCALL_rcx(%r12)
>>> +    movq %rdx, TDCALL_rdx(%r12)
>>> +    movq %r8,  TDCALL_r8(%r12)
>>> +    movq %r9,  TDCALL_r9(%r12)
>>> +    movq %r10, TDCALL_r10(%r12)
>>> +    movq %r11, TDCALL_r11(%r12)
>>> +1:
>>> +    /* Zero out registers exposed to the TDX Module. */
>>> +    xor %rcx,  %rcx
>>> +    xor %rdx,  %rdx
>>> +    xor %r8d,  %r8d
>>> +    xor %r9d,  %r9d
>>> +    xor %r10d, %r10d
>>> +    xor %r11d, %r11d
>>
>> ... why?
>
> These registers are used by the TDX Module. Why pass the stale values
> back to the user? So we clear them here.

Please go look at some other assembly code in the kernel called from C.
Do those functions do this? Why? Why not? Do they care about
"passing stale values back up"?

>>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>>> +    FRAME_BEGIN
>>> +
>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>> +    push %r15
>>> +    push %r14
>>> +    push %r13
>>> +    push %r12
>>> +
>>> +    /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
>>
>> I think there needs to be some discussion of what TDCALL and TDVMCALL
>> are.  They are named too similarly not to do so.
>
> TDVMCALL is the sub function of TDCALL (selected by setting RAX register
> to 0). TDVMCALL is used to request services from VMM.

Actually, I think these functions are horribly misnamed.

I think we should make them

__tdx_seam_call()
or __tdx_module_call()

and

__tdx_hypercall()


__tdcall()
and
__tdvmcall()

are really nonsensical in this context, especially since TDVMCALL is
implemented with the TDCALL instruction, but not the __tdcall() function.

>>> +/* Helper function for standard type of TDVMCALL */
>>> +SYM_FUNC_START(__tdvmcall)
>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>> +    xor %r10, %r10
>>> +    call do_tdvmcall
>>> +    retq
>>> +SYM_FUNC_END(__tdvmcall)
>>
>> Why do we need this helper?  Why does it need to be in assembly?
>
> Its simpler to do it in assembly. Also, grouping all register updates
> in the same file will make it easier for us to read or debug issues.
> Another
> reason is, we also call do_tdvmcall() from in/out instruction use case.

Sathya, I seem to have to reverse-engineer what you are doing for all
this stuff. Your answers to my questions are almost entirely orthogonal
to the things I really want to know. I guess I need to be more precise
with the questions I'm asking. But, this is yet another case where I
think the burden for this series continues to fall on the reviewer
rather than the submitter. Not the way I think it is best.

So, trying to reverse-engineer what you are doing here... it seems that
you can't *practically* call do_tdvmcall() directly because %r10 would
be garbage. That makes this (or a wrapper like it) required for every
practical call to do_tdvmcall().

But, even if that's the case, you need to *DOCUMENT* that up in
do_tdvmcall(): Hey, this function is worthless without something that
sets up %r10 before calling it.

I'm also not *SURE* this is simpler to do in assembly.

>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>> index 6a7193fead08..29c52128b9c0 100644
>>> --- a/arch/x86/kernel/tdx.c
>>> +++ b/arch/x86/kernel/tdx.c
>>> @@ -1,8 +1,44 @@
>>>   // SPDX-License-Identifier: GPL-2.0
>>>   /* Copyright (C) 2020 Intel Corporation */
>>>   +#define pr_fmt(fmt) "TDX: " fmt
>>> +
>>>   #include <asm/tdx.h>
>>>   +/*
>>> + * Wrapper for use case that checks for error code and print warning
>>> message.
>>> + */
>>
>> This comment isn't very useful.  I can see the error check and warning
>> by reading the code.
>
> Its just a helper function that covers common case of checking for error
> and print the warning message. If this comment is superfluous, I can remove
> it.

I'd prefer that you actually write a comment about what the function is
doing, maybe:

/*
* Wrapper for simple hypercalls that only return a success/error code.
*/

... or *SOMETHING* that tells what its purpose in life is.

>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>> +{
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>> +
>>> +    if (err)
>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>> +                    fn, err);
>>> +
>>> +    return err;
>>> +}
>>> +
>>> +/*
>>> + * Wrapper for the semi-common case where we need single output
>>> value (R11).
>>> + */
>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>> r14, u64 r15)
>>> +{
>>> +
>>> +    struct tdvmcall_output out = {0};
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>> +
>>> +    if (err)
>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>> +                    fn, err);
>>> +
>>> +    return out.r11;
>>> +}
>>
>> How do callers check for errors?  Is the error value superfluously
>> returned in r11 and another output register?
>
> We already check for error in this helper function. User of this function
> only cares about output value (R11). Mainly for in/out use case.

That's pretty valuable information.

Subject: Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions



On 4/26/21 4:17 PM, Dave Hansen wrote:
> On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>>> +
>>>> +/*
>>>> + * __tdcall()  - Used to communicate with the TDX module
>>>
>>> Why is this function here?  What does it do?  Why do we need it?
>>
>> __tdcall() function is used to request services from the TDX Module.
>> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.
>
> I think there might be some misinterpretation of my question. What you
> are describing is what *TDCALL* does. Why do we need a wrapper
> function? What purpose does this wrapper function serve? Why do we
> need this wrapper function?
>

How about following explanation?

Helper function for "tdcall" instruction, which can be used to request
services from the TDX module (does not include VMM). Few examples of
valid TDX module services are, "TDREPORT", "MEM PAGE ACCEPT", "VEINFO",
etc.

This function serves as a wrapper to move user call arguments to
the correct registers as specified by "tdcall" ABI and shares it with
the TDX module. If the "tdcall" operation is successful and a
valid "struct tdcall_out" pointer is available (in "out" argument),
output from the TDX module (RCX, RDX, R8-R11) is saved to the memory
specified in the "out" pointer. Also the status of the "tdcall"
operation is returned back to the user as a function return value.

>>> Why do we have to save these?  Because they might be clobbered?  If so,
>>> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
>>> can read".
>>>
>>> Also, this just told me that this function can't be used to talk to the
>>> VMM.  Why is this talking about exposure to the VMM?
>>
>> Although __tdcall() is only used to communicate with the TDX module and the
>> TDX module is not supposed to touch these registers, just to be on the safe
>> side, I have tried to save the context of registers R12-R15. Anyway cycles
>> used by instructions are less compared to tdcall.
>
> Why are you talking about the VMM if this is a call to the SEAM module?
>
> Let's say someone is reading the TDCALL architecture spec. It will say
> something like, "blah blah, in this case TDCALL will not modify
> %r12->%r15". Then someone goes and looks at this code that basically
> says (or implies) "save these before the SEAM module modifies them".
> What is a coder to do?
>
> Please remove the ambiguity, either by removing this superfluous
> (according to the spec) code, or documenting why it is not superfluous.

Agree. I will remove the save/restore context code.

>
>>>> +    /* Move TDCALL Leaf ID to RAX */
>>>> +    mov %rdi, %rax
>>>> +    /* Move output pointer to R12 */
>>>> +    mov %r9, %r12
>>>
>>> I thought 'struct tdcall_output' was a purely software construct.  Why
>>> are we passing a pointer to it into TDCALL?
>>
>> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
>> function is concerned, its just a block of memory (accessed using
>> base address + TDCALL_r* offsets).
>
> Is 'struct tdcall_output' a hardware architectural structure or a
> software structure?
>
> If it's a software structure, then why are we passing a pointer to a
> software structure into a hardware ABI?
>
> If it's a hardware architecture structure, where is the documentation
> for it?
>

I think there is a misunderstanding here. We don't share the tdcall_output
pointer with the TDX module. Current use cases of TDCALL (other than TDVMCALL)
do not use registers from R12-R15. Since the registers R12-R15 are free and
available, we are using R12 as temporary storage to hold the tdcall_output
pointer.

I will include some comment about using it as temporary storage.


>
> I prefer that the code be understandable and be written for a clear
> purpose. If you're using r12 for temporary storage, I expect to see at
> least one reference *SOMEWHERE* to its use as temporary storage. Right
> now.... nothing.
>

I will include some reference to it.

>>>> +    /* Copy TDCALL result registers to output struct: */
>>>> +    movq %rcx, TDCALL_rcx(%r12)
>>>> +    movq %rdx, TDCALL_rdx(%r12)
>>>> +    movq %r8,  TDCALL_r8(%r12)
>>>> +    movq %r9,  TDCALL_r9(%r12)
>>>> +    movq %r10, TDCALL_r10(%r12)
>>>> +    movq %r11, TDCALL_r11(%r12)
>>>> +1:
>>>> +    /* Zero out registers exposed to the TDX Module. */
>>>> +    xor %rcx,  %rcx
>>>> +    xor %rdx,  %rdx
>>>> +    xor %r8d,  %r8d
>>>> +    xor %r9d,  %r9d
>>>> +    xor %r10d, %r10d
>>>> +    xor %r11d, %r11d
>>>
>>> ... why?
>>
>> These registers are used by the TDX Module. Why pass the stale values
>> back to the user? So we clear them here.
>
> Please go look at some other assembly code in the kernel called from C.
> Do those functions do this? Why? Why not? Do they care about
> "passing stale values back up"?
>

Maybe I am being overly cautious here. Since TDX module is the trusted
code, speculation attack is not a consideration here. I will remove this
block of code.

>>>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>>>> +    FRAME_BEGIN
>>>> +
>>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>>> +    push %r15
>>>> +    push %r14
>>>> +    push %r13
>>>> +    push %r12
>>>> +
>>>> +    /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
>>>
>>> I think there needs to be some discussion of what TDCALL and TDVMCALL
>>> are.  They are named too similarly not to do so.
>>
>> TDVMCALL is the sub function of TDCALL (selected by setting RAX register
>> to 0). TDVMCALL is used to request services from VMM.
>
> Actually, I think these functions are horribly misnamed.
>
> I think we should make them
>
> __tdx_seam_call()
> or __tdx_module_call()
>
> and
>
> __tdx_hypercall()
>
>
> __tdcall()
> and
> __tdvmcall()
>
> are really nonsensical in this context, especially since TDVMCALL is
> implemented with the TDCALL instruction, but not the __tdcall() function.
>

TDVMCALL is a short form of "TDG.VP.VMCALL". This term usage came from
GHCI document. We can read it as "Trusted Domain VMCALL". Maybe
because we are used to GHCI spec, we don't find it confusing. I agree
that if you consider the "tdcall" instruction usage, it is confusing.

But if it's confusing for new readers and rename is preferred,

Do we need to rename the helper functions ?

tdvmcall(), tdvmcall_out_r11()

Also what about output structs?

struct tdcall_output
struct tdvmcall_output

>>>> +/* Helper function for standard type of TDVMCALL */
>>>> +SYM_FUNC_START(__tdvmcall)
>>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>>> +    xor %r10, %r10
>>>> +    call do_tdvmcall
>>>> +    retq
>>>> +SYM_FUNC_END(__tdvmcall)
>>>
>>> Why do we need this helper?  Why does it need to be in assembly?
>>
>> Its simpler to do it in assembly. Also, grouping all register updates
>> in the same file will make it easier for us to read or debug issues.
>> Another
>> reason is, we also call do_tdvmcall() from in/out instruction use case.
>
> Sathya, I seem to have to reverse-engineer what you are doing for all
> this stuff. Your answers to my questions are almost entirely orthogonal
> to the things I really want to know. I guess I need to be more precise
> with the questions I'm asking. But, this is yet another case where I
> think the burden for this series continues to fall on the reviewer
> rather than the submitter. Not the way I think it is best.

I have assumed that you are aware of reason for the existence of
do_tdvmcall() helper function. It is mainly created to hold common
code between vendor specific and standard type of tdvmcall's.

But it is a mistake from my end. I will try to be elaborate in my
future replies.

>
> So, trying to reverse-engineer what you are doing here... it seems that
> you can't *practically* call do_tdvmcall() directly because %r10 would
> be garbage. That makes this (or a wrapper like it) required for every
> practical call to do_tdvmcall().
>
> But, even if that's the case, you need to *DOCUMENT* that up in
> do_tdvmcall(): Hey, this function is worthless without something that
> sets up %r10 before calling it.

Agree. This needs to be documented. I will add it in next version.

>
> I'm also not *SURE* this is simpler to do in assembly.
>
>>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>>> index 6a7193fead08..29c52128b9c0 100644
>>>> --- a/arch/x86/kernel/tdx.c
>>>> +++ b/arch/x86/kernel/tdx.c
>>>> @@ -1,8 +1,44 @@
>>>>   // SPDX-License-Identifier: GPL-2.0
>>>>   /* Copyright (C) 2020 Intel Corporation */
>>>>   +#define pr_fmt(fmt) "TDX: " fmt
>>>> +
>>>>   #include <asm/tdx.h>
>>>>   +/*
>>>> + * Wrapper for use case that checks for error code and print warning
>>>> message.
>>>> + */
>>>
>>> This comment isn't very useful.  I can see the error check and warning
>>> by reading the code.
>>
>> Its just a helper function that covers common case of checking for error
>> and print the warning message. If this comment is superfluous, I can remove
>> it.
>
> I'd prefer that you actually write a comment about what the function is
> doing, maybe:
>
> /*
> * Wrapper for simple hypercalls that only return a success/error code.
> */
>
> ... or *SOMETHING* that tells what its purpose in life is.

I will fix it in next version.

>
>>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>>> +{
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>>> +
>>>> +    if (err)
>>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>>> +                    fn, err);
>>>> +
>>>> +    return err;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Wrapper for the semi-common case where we need single output
>>>> value (R11).
>>>> + */
>>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>>> r14, u64 r15)
>>>> +{
>>>> +
>>>> +    struct tdvmcall_output out = {0};
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>>> +
>>>> +    if (err)
>>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>>> +                    fn, err);
>>>> +
>>>> +    return out.r11;
>>>> +}
>>>
>>> How do callers check for errors?  Is the error value superfluously
>>> returned in r11 and another output register?
>>
>> We already check for error in this helper function. User of this function
>> only cares about output value (R11). Mainly for in/out use case.
>
> That's pretty valuable information.

I will include this note in the function comment.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-04-27 14:33:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

On 4/26/21 7:29 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 4/26/21 4:17 PM, Dave Hansen wrote:
>> On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>>>> +
>>>>> +/*
>>>>> + * __tdcall()  - Used to communicate with the TDX module
>>>>
>>>> Why is this function here?  What does it do?  Why do we need it?
>>>
>>> __tdcall() function is used to request services from the TDX Module.
>>> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.
>>
>> I think there might be some misinterpretation of my question.  What you
>> are describing is what *TDCALL* does.  Why do we need a wrapper
>> function?  What purpose does this wrapper function serve?  Why do we
>> need this wrapper function?
>>
> How about following explanation?
>
> Helper function for "tdcall" instruction, which can be used to request
> services from the TDX module (does not include VMM). Few examples of
> valid TDX module services are, "TDREPORT", "MEM PAGE ACCEPT", "VEINFO",
> etc.

Naming the services here is not useful. If I want to know who calls
this, I'll just literally do that: look up the callers of this function.

> This function serves as a wrapper to move user call arguments to
> the correct registers as specified by "tdcall" ABI and shares it with
> the TDX module.  If the "tdcall" operation is successful and a
> valid "struct tdcall_out" pointer is available (in "out" argument),
> output from the TDX module (RCX, RDX, R8-R11) is saved to the memory
> specified in the "out" pointer. Also the status of the "tdcall"
> operation is returned back to the user as a function return value.

I tend to prefer function comments that talk high-level about what the
function does rather than waste space on the exact registers used in the
ABI. I also tend not to talk about things that can be trivially grepped
for, like the callers of this function.

I'd trim the fat out of there, but it's generally OK, although too
rotund for my taste.

>>>>> +    /* Move TDCALL Leaf ID to RAX */
>>>>> +    mov %rdi, %rax
>>>>> +    /* Move output pointer to R12 */
>>>>> +    mov %r9, %r12
>>>>
>>>> I thought 'struct tdcall_output' was a purely software construct.  Why
>>>> are we passing a pointer to it into TDCALL?
>>>
>>> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
>>> function is concerned, its just a block of memory (accessed using
>>> base address + TDCALL_r* offsets).
>>
>> Is 'struct tdcall_output' a hardware architectural structure or a
>> software structure?
>>
>> If it's a software structure, then why are we passing a pointer to a
>> software structure into a hardware ABI?
>>
>> If it's a hardware architecture structure, where is the documentation
>> for it?
>>
>
> I think there is a misunderstanding here. We don't share the tdcall_output
> pointer with the TDX module. Current use cases of TDCALL (other than
> TDVMCALL)
> do not use registers from R12-R15. Since the registers R12-R15 are free and
> available, we are using R12 as temporary storage to hold the tdcall_output
> pointer.

In other words, 'struct tdcall_output' is a purely software concept.
However, its pointer is manipulated literally next to all of the TDCALL
register arguments and it has an *IDENTICAL* comment to all of those
other moves.

Please make it clear that %r12 is not being used at all for the TDCALL
instruction itself.

But, the bigger point here is that the code needs to be structured in a
way that makes the function and interactions clear. If I want to know
more about the "output pointer", where do I go? Do I go looking at the
calling functions or the TDINFO instruction reference?

...
>> Please go look at some other assembly code in the kernel called from C.
>>   Do those functions do this?  Why?  Why not?  Do they care about
>> "passing stale values back up"?
>
> Maybe I am being overly cautious here. Since TDX module is the trusted
> code, speculation attack is not a consideration here. I will remove this
> block of code.

Caution is OK. Caution without explanation somewhere as to why it is
warranted is not.

> Do we need to rename the helper functions ?
>
> tdvmcall(), tdvmcall_out_r11()

Yes.

> Also what about output structs?
>
> struct tdcall_output
> struct tdvmcall_output

Yes, they need sane, straightforward names which are not confusing too.

>>>>> +/* Helper function for standard type of TDVMCALL */
>>>>> +SYM_FUNC_START(__tdvmcall)
>>>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>>>> +    xor %r10, %r10
>>>>> +    call do_tdvmcall
>>>>> +    retq
>>>>> +SYM_FUNC_END(__tdvmcall)
>>>>
>>>> Why do we need this helper?  Why does it need to be in assembly?
>>>
>>> Its simpler to do it in assembly. Also, grouping all register updates
>>> in the same file will make it easier for us to read or debug issues.
>>> Another
>>> reason is, we also call do_tdvmcall() from in/out instruction use case.
>>
>> Sathya, I seem to have to reverse-engineer what you are doing for all
>> this stuff.  Your answers to my questions are almost entirely orthogonal
>> to the things I really want to know.  I guess I need to be more precise
>> with the questions I'm asking.  But, this is yet another case where I
>> think the burden for this series continues to fall on the reviewer
>> rather than the submitter.  Not the way I think it is best.
>
> I have assumed that you are aware of reason for the existence of
> do_tdvmcall() helper function. It is mainly created to hold common
> code between vendor specific and standard type of tdvmcall's.

No, I was not aware of that. Remember, you're not doing this for *ME*.
You're doing it for the hundred other people that are going to look
over the code and who won't have been aware of your reasoning.

2021-04-27 17:33:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

+ Jürgen.

On Mon, Apr 26, 2021 at 11:01:28AM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
> a separate config option. It provides a middle ground for
> not-so-deep paravirtulized environments.

Please introduce a spellchecker into your patch creation workflow.

Also, what does "not-so-deep" mean?

> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
> config would be a bloat for TDX.

Used how? Why is it bloat for TDX?

I'm sure that'll become clear in the remainder of the patches but you
should state it here so that it is clear why you're doing what you're
doing.

>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/Kconfig | 4 +++
> arch/x86/boot/compressed/misc.h | 1 +
> arch/x86/include/asm/irqflags.h | 38 +++++++++++++++------------
> arch/x86/include/asm/paravirt.h | 22 +++++++++-------
> arch/x86/include/asm/paravirt_types.h | 3 ++-
> arch/x86/kernel/paravirt.c | 4 ++-
> arch/x86/mm/mem_encrypt_identity.c | 1 +
> 7 files changed, 44 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879d398e..6b4b682af468 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -783,8 +783,12 @@ config PARAVIRT
> over full virtualization. However, when run without a hypervisor
> the kernel is theoretically slower and slightly larger.
>
> +config PARAVIRT_XL
> + bool
> +
> config PARAVIRT_XXL
> bool
> + select PARAVIRT_XL
>
> config PARAVIRT_DEBUG
> bool "paravirt-ops debugging"
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 901ea5ebec22..4b84abe43765 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -9,6 +9,7 @@
> * paravirt and debugging variants are added.)
> */
> #undef CONFIG_PARAVIRT
> +#undef CONFIG_PARAVIRT_XL
> #undef CONFIG_PARAVIRT_XXL

So what happens if someone else needs even less pv and defines
CONFIG_PARAVIRT_L. Or _M? Or _S?

Are we going to teleport into a clothing store each time we look at
paravirt now? :)

So before this goes out of hand let's define explicitly, pls, what
XXL means and XL. And rename them. They could be called PARAVIRT_FULL
and PARAVIRT_HLT as apparently that thing is exposing only the PV ops
related to HLT.

Or something to that effect.

Dunno, maybe Jürgen has a better idea, leaving in the rest quoted for him.

Thx.

> #undef CONFIG_PARAVIRT_SPINLOCKS
> #undef CONFIG_KASAN
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 144d70ea4393..1688841893d7 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)
>
> #endif
>
> -#ifdef CONFIG_PARAVIRT_XXL
> +#ifdef CONFIG_PARAVIRT_XL
> #include <asm/paravirt.h>
> #else
> #ifndef __ASSEMBLY__
> #include <linux/types.h>
> -
> -static __always_inline unsigned long arch_local_save_flags(void)
> -{
> - return native_save_fl();
> -}
> -
> -static __always_inline void arch_local_irq_disable(void)
> -{
> - native_irq_disable();
> -}
> -
> -static __always_inline void arch_local_irq_enable(void)
> -{
> - native_irq_enable();
> -}
> -
> /*
> * Used in the idle loop; sti takes one instruction cycle
> * to complete:
> @@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
> {
> native_halt();
> }
> +#endif /* !__ASSEMBLY__ */
> +#endif /* CONFIG_PARAVIRT_XL */
> +
> +#ifndef CONFIG_PARAVIRT_XXL
> +#ifndef __ASSEMBLY__
> +
> +static __always_inline unsigned long arch_local_save_flags(void)
> +{
> + return native_save_fl();
> +}
> +
> +static __always_inline void arch_local_irq_disable(void)
> +{
> + native_irq_disable();
> +}
> +
> +static __always_inline void arch_local_irq_enable(void)
> +{
> + native_irq_enable();
> +}
>
> /*
> * For spinlocks, etc:
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 4abf110e2243..2dbb6c9c7e98 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
> PVOP_VCALL1(mmu.exit_mmap, mm);
> }
>
> +#ifdef CONFIG_PARAVIRT_XL
> +static inline void arch_safe_halt(void)
> +{
> + PVOP_VCALL0(irq.safe_halt);
> +}
> +
> +static inline void halt(void)
> +{
> + PVOP_VCALL0(irq.halt);
> +}
> +#endif
> +
> #ifdef CONFIG_PARAVIRT_XXL
> static inline void load_sp0(unsigned long sp0)
> {
> @@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
> PVOP_VCALL1(cpu.write_cr4, x);
> }
>
> -static inline void arch_safe_halt(void)
> -{
> - PVOP_VCALL0(irq.safe_halt);
> -}
> -
> -static inline void halt(void)
> -{
> - PVOP_VCALL0(irq.halt);
> -}
> -
> static inline void wbinvd(void)
> {
> PVOP_VCALL0(cpu.wbinvd);
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index de87087d3bde..5261fba47ba5 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -177,7 +177,8 @@ struct pv_irq_ops {
> struct paravirt_callee_save save_fl;
> struct paravirt_callee_save irq_disable;
> struct paravirt_callee_save irq_enable;
> -
> +#endif
> +#ifdef CONFIG_PARAVIRT_XL
> void (*safe_halt)(void);
> void (*halt)(void);
> #endif
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index c60222ab8ab9..d6d0b363fe70 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
> .irq.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
> .irq.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
> .irq.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
> +#endif /* CONFIG_PARAVIRT_XXL */
> +#ifdef CONFIG_PARAVIRT_XL
> .irq.safe_halt = native_safe_halt,
> .irq.halt = native_halt,
> -#endif /* CONFIG_PARAVIRT_XXL */
> +#endif /* CONFIG_PARAVIRT_XL */
>
> /* Mmu ops. */
> .mmu.flush_tlb_user = native_flush_tlb_local,
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 6c5eb6f3f14f..20d0cb116557 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -24,6 +24,7 @@
> * be extended when new paravirt and debugging variants are added.)
> */
> #undef CONFIG_PARAVIRT
> +#undef CONFIG_PARAVIRT_XL
> #undef CONFIG_PARAVIRT_XXL
> #undef CONFIG_PARAVIRT_SPINLOCKS
>
> --
> 2.25.1
>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Subject: Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

Hi Dave,

On 4/27/21 7:29 AM, Dave Hansen wrote:
>> Do we need to rename the helper functions ?
>>
>> tdvmcall(), tdvmcall_out_r11()
> Yes.
>
>> Also what about output structs?
>>
>> struct tdcall_output
>> struct tdvmcall_output
> Yes, they need sane, straightforward names which are not confusing too.
>

Following is the rename diff. Please let me know if you agree with the
names used.

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6c3c71bb57a0..95a6a6c6061a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h

-struct tdcall_output {
+struct tdx_module_output {
u64 rcx;
u64 rdx;
u64 r8;
@@ -19,7 +19,7 @@ struct tdcall_output {
u64 r11;
};

-struct tdvmcall_output {
+struct tdx_hypercall_output {
u64 r11;
u64 r12;
u64 r13;
@@ -33,12 +33,12 @@ bool is_tdx_guest(void);
void __init tdx_early_init(void);

/* Helper function used to communicate with the TDX module */
-u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
- struct tdcall_output *out);
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);

/* Helper function used to request services from VMM */
-u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
- struct tdvmcall_output *out);
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+ struct tdx_hypercall_output *out);

--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,11 +8,11 @@
/*
* Wrapper for use case that checks for error code and print warning message.
*/
-static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
{
u64 err;

- err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);

if (err)
pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
@@ -24,13 +24,14 @@ static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
/*
* Wrapper for the semi-common case where we need single output value (R11).
*/
-static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+ u64 r14, u64 r15)


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-04-27 19:21:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

On 4/27/21 12:18 PM, Kuppuswamy, Sathyanarayanan wrote:
> Following is the rename diff. Please let me know if you agree with the
> names used.

Look fine at a glance, but the real key is how they look when they get used.

Subject: [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall() - function can be used to request services from
the VMM.
__tdx_module_call()  - function can be used to communicate with the
TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDCALL operation, if all these registers are included in in-line
assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Hi Dave,

It includes all fixes suggested by you. Please let me know your
comments.

Changes since v1:
* Renamed __tdcall()/__tdvmcall() to
__tdx_module_call()/__tdx_hypercall().
* Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
* Removed unnecessary code in __tdcall() function.
* Fixed comments as per Dave's review.

arch/x86/include/asm/tdx.h | 26 ++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets.c | 22 ++++
arch/x86/kernel/tdcall.S | 215 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 39 ++++++
5 files changed, 303 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..95a6a6c6061a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,38 @@
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
+#include <linux/types.h>
+
+struct tdx_module_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+struct tdx_hypercall_output {
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};

/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

void __init tdx_early_init(void);

+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+ struct tdx_hypercall_output *out);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
#include <xen/interface/xen.h>
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
#ifdef CONFIG_X86_32
# include "asm-offsets_32.c"
#else
@@ -75,6 +79,24 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ /* Offset for fields in tdcall_output */
+ OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+ OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+ OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
+ OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
+ OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+ OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+ /* Offset for fields in tdvmcall_output */
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..7e14b4a2312e
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10 BIT(10)
+#define TDG_R11 BIT(11)
+#define TDG_R12 BIT(12)
+#define TDG_R13 BIT(13)
+#define TDG_R14 BIT(14)
+#define TDG_R15 BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDG_R10 | TDG_R11 | \
+ TDG_R12 | TDG_R13 | \
+ TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call() - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module.  And if the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ * use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+
+ /*
+ * R12 will be used as temporary storage for
+ * struct tdx_module_output pointer. You can
+ * find struct tdx_module_output details in
+ * arch/x86/include/asm/tdx.h. Also note that
+ * registers R12-R15 are not used by TDCALL
+ * services supported by this helper function.
+ */
+ push %r12 /* Callee saved, so preserve it */
+ mov %r9, %r12 /* Move output pointer to R12 */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ mov %rdi, %rax /* Move TDCALL Leaf ID to RAX */
+ mov %r8, %r9 /* Move input 4 to R9 */
+ mov %rcx, %r8 /* Move input 3 to R8 */
+ mov %rsi, %rcx /* Move input 1 to RCX */
+ /* Leave input param 2 in RDX */
+
+ tdcall
+
+ /* Check for TDCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check for TDCALL output struct != NULL */
+ test %r12, %r12
+ jz 1f
+
+ /* Copy TDCALL result registers to output struct: */
+ movq %rcx, TDX_MODULE_rcx(%r12)
+ movq %rdx, TDX_MODULE_rdx(%r12)
+ movq %r8, TDX_MODULE_r8(%r12)
+ movq %r9, TDX_MODULE_r9(%r12)
+ movq %r10, TDX_MODULE_r10(%r12)
+ movq %r11, TDX_MODULE_r11(%r12)
+1:
+ pop %r12 /* Restore the state of R12 register */
+
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall() - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common between vendor specific
+ * and standard type tdx hypercalls. So the caller of this function had
+ * to set the TDVMCALL type in the R10 register before calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. And if the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn (RDI) - TDVMCALL function, moved to R11
+ * @r12 (RSI) - Input parameter 1, moved to R12
+ * @r13 (RDX) - Input parameter 2, moved to R13
+ * @r14 (RCX) - Input parameter 3, moved to R14
+ * @r15 (R8) - Input parameter 4, moved to R15
+ *
+ * @out (R9) - struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ * If the "tdcall" operation fails, panic.
+ *
+ */
+SYM_CODE_START_LOCAL(do_tdx_hypercall)
+ FRAME_BEGIN
+
+ /* Save non-volatile GPRs that are exposed to the VMM. */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+ mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+ mov %rsi, %r12 /* Move input 1 to R12 */
+ mov %rdx, %r13 /* Move input 2 to R13 */
+ mov %rcx, %r14 /* Move input 1 to R14 */
+ mov %r8, %r15 /* Move input 1 to R15 */
+ /* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * Check for TDCALL success: 0 - Successful, otherwise failed.
+ * If failed, there is an issue with TDX Module which is fatal
+ * for the guest. So panic. Also note that RAX is controlled
+ * only by the TDX module and not exposed to VMM.
+ */
+ test %rax, %rax
+ jnz 2f
+
+ /* Move hypercall error code to RAX to return to user */
+ mov %r10, %rax
+
+ /* Check for hypercall success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check for hypercall output struct != NULL */
+ test %r9, %r9
+ jz 1f
+
+ /* Copy hypercall result registers to output struct: */
+ movq %r11, TDX_HYPERCALL_r11(%r9)
+ movq %r12, TDX_HYPERCALL_r12(%r9)
+ movq %r13, TDX_HYPERCALL_r13(%r9)
+ movq %r14, TDX_HYPERCALL_r14(%r9)
+ movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+ /*
+ * Zero out registers exposed to the VMM to avoid
+ * speculative execution with VMM-controlled values.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+ xor %r12d, %r12d
+ xor %r13d, %r13d
+ xor %r14d, %r14d
+ xor %r15d, %r15d
+
+ /* Restore non-volatile GPRs that are exposed to the VMM. */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ FRAME_END
+ ret
+2:
+ ud2
+SYM_CODE_END(do_tdx_hypercall)
+
+/* Helper function for standard type of TDVMCALL */
+SYM_FUNC_START(__tdx_hypercall)
+ /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+ xor %r10, %r10
+ call do_tdx_hypercall
+ retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2020 Intel Corporation */

+#define pr_fmt(fmt) "TDX: " fmt
+
#include <asm/tdx.h>

+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+ u64 r14, u64 r15)
+{
+
+ struct tdx_hypercall_output out = {0};
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return out.r11;
+}
+
static inline bool cpuid_has_tdx_guest(void)
{
u32 eax, signature[3];
--
2.25.1

Subject: Re: [RFC v2 00/32] Add TDX Guest Support

Hi Peter/Andy,

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Hi All,

Just a gentle ping. Please let me know your comments on this patch set.
I hope it addressed concerns raised by you in RFC v1.

>
> NOTE: This series is not ready for wide public review. It is being
> specifically posted so that Peter Z and other experts on the entry
> code can look for problems with the new exception handler (#VE).
> That's also why x86@ is not being spammed.
>
> Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
> hosts and some physical attacks. This series adds the bare-minimum
> support to run a TDX guest. The host-side support will be submitted
> separately. Also support for advanced TD guest features like attestation
> or debug-mode will be submitted separately. Also, at this point it is not
> secure with some known holes in drivers, and also hasn’t been fully audited
> and fuzzed yet.
>
> TDX has a lot of similarities to SEV. It enhances confidentiality and
> of guest memory and state (like registers) and includes a new exception
> (#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
> yet), TDX limits the host's ability to effect changes in the guest
> physical address space.
>
> In contrast to the SEV code in the kernel, TDX guest memory is integrity
> protected and isolated; the host is prevented from accessing guest
> memory (even ciphertext).
>
> The TDX architecture also includes a new CPU mode called
> Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
> mode arbitrates interactions between host and guest and implements many of
> the guarantees of the TDX architecture.
>
> Some of the key differences between TD and regular VM is,
>
> 1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
> 2. A new #VE exception handler is added. The TDX module injects #VE exception
> to the guest TD in cases of instructions that need to be emulated, disallowed
> MSR accesses, subset of CPUID leaves, etc.
> 3. By default memory is marked as private, and TD will selectively share it with
> VMM based on need.
> 4. Remote attestation is supported to enable a third party (either the owner of
> the workload or a user of the services provided by the workload) to establish
> that the workload is running on an Intel-TDX-enabled platform located within a
> TD prior to providing that workload data.
>
> You can find TDX related documents in the following link.
>
> https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html
>
> Changes since v1:
> * Implemented tdcall() and tdvmcall() helper functions in assembly and renamed
> them as __tdcall() and __tdvmcall().
> * Added do_general_protection() helper function to re-use protection
> code between #GP exception and TDX #VE exception handlers.
> * Addressed syscall gap issue in #VE handler support (for details check
> the commit log in "x86/traps: Add #VE support for TDX guest").
> * Modified patch titled "x86/tdx: Handle port I/O" to re-use common
> tdvmcall() helper function.
> * Added error handling support to MADT CPU wakeup code.
> * Introduced enum tdx_map_type to identify SHARED vs PRIVATE memory type.
> * Enabled shared memory in IOAPIC driver.
> * Added BINUTILS version info for TDCALL.
> * Changed the TDVMCALL vendor id from 0 to "TDX.KVM".
> * Replaced WARN() with pr_warn_ratelimited() in __tdvmcall() wrappers.
> * Fixed commit log and code comments related review comments.
> * Renamed patch titled # "x86/topology: Disable CPU hotplug support for TDX
> platforms" to "x86/topology: Disable CPU online/offline control for
> TDX guest"
> * Rebased on top of v5.12 kernel.
>
>
> Erik Kaneda (1):
> ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure
>
> Isaku Yamahata (1):
> x86/tdx: ioapic: Add shared bit for IOAPIC base address
>
> Kirill A. Shutemov (16):
> x86/paravirt: Introduce CONFIG_PARAVIRT_XL
> x86/tdx: Get TD execution environment information via TDINFO
> x86/traps: Add #VE support for TDX guest
> x86/tdx: Add HLT support for TDX guest
> x86/tdx: Wire up KVM hypercalls
> x86/tdx: Add MSR support for TDX guest
> x86/tdx: Handle CPUID via #VE
> x86/io: Allow to override inX() and outX() implementation
> x86/tdx: Handle port I/O
> x86/tdx: Handle in-kernel MMIO
> x86/mm: Move force_dma_unencrypted() to common code
> x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
> x86/tdx: Make pages shared in ioremap()
> x86/tdx: Add helper to do MapGPA TDVMALL
> x86/tdx: Make DMA pages shared
> x86/kvm: Use bounce buffers for TD guest
>
> Kuppuswamy Sathyanarayanan (10):
> x86/tdx: Introduce INTEL_TDX_GUEST config option
> x86/cpufeatures: Add TDX Guest CPU feature
> x86/x86: Add is_tdx_guest() interface
> x86/tdx: Add __tdcall() and __tdvmcall() helper functions
> x86/traps: Add do_general_protection() helper function
> x86/tdx: Handle MWAIT, MONITOR and WBINVD
> ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
> ACPI/table: Print MADT Wake table information
> x86/acpi, x86/boot: Add multiprocessor wake-up support
> x86/topology: Disable CPU online/offline control for TDX guest
>
> Sean Christopherson (4):
> x86/boot: Add a trampoline for APs booting in 64-bit mode
> x86/boot: Avoid #VE during compressed boot for TDX platforms
> x86/boot: Avoid unnecessary #VE during boot process
> x86/tdx: Forcefully disable legacy PIC for TDX guests
>
> arch/x86/Kconfig | 28 +-
> arch/x86/boot/compressed/Makefile | 2 +
> arch/x86/boot/compressed/head_64.S | 10 +-
> arch/x86/boot/compressed/misc.h | 1 +
> arch/x86/boot/compressed/pgtable.h | 2 +-
> arch/x86/boot/compressed/tdcall.S | 9 +
> arch/x86/boot/compressed/tdx.c | 32 ++
> arch/x86/include/asm/apic.h | 3 +
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/idtentry.h | 4 +
> arch/x86/include/asm/io.h | 24 +-
> arch/x86/include/asm/irqflags.h | 38 +-
> arch/x86/include/asm/kvm_para.h | 21 +
> arch/x86/include/asm/paravirt.h | 22 +-
> arch/x86/include/asm/paravirt_types.h | 3 +-
> arch/x86/include/asm/pgtable.h | 3 +
> arch/x86/include/asm/realmode.h | 1 +
> arch/x86/include/asm/tdx.h | 176 +++++++++
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/acpi/boot.c | 79 ++++
> arch/x86/kernel/apic/apic.c | 8 +
> arch/x86/kernel/apic/io_apic.c | 12 +-
> arch/x86/kernel/asm-offsets.c | 22 ++
> arch/x86/kernel/head64.c | 3 +
> arch/x86/kernel/head_64.S | 13 +-
> arch/x86/kernel/idt.c | 6 +
> arch/x86/kernel/paravirt.c | 4 +-
> arch/x86/kernel/pci-swiotlb.c | 2 +-
> arch/x86/kernel/smpboot.c | 5 +
> arch/x86/kernel/tdcall.S | 361 +++++++++++++++++
> arch/x86/kernel/tdx-kvm.c | 45 +++
> arch/x86/kernel/tdx.c | 480 +++++++++++++++++++++++
> arch/x86/kernel/topology.c | 3 +-
> arch/x86/kernel/traps.c | 81 ++--
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/ioremap.c | 8 +-
> arch/x86/mm/mem_encrypt.c | 75 ----
> arch/x86/mm/mem_encrypt_common.c | 85 ++++
> arch/x86/mm/mem_encrypt_identity.c | 1 +
> arch/x86/mm/pat/set_memory.c | 48 ++-
> arch/x86/realmode/rm/header.S | 1 +
> arch/x86/realmode/rm/trampoline_64.S | 49 ++-
> arch/x86/realmode/rm/trampoline_common.S | 5 +-
> drivers/acpi/tables.c | 11 +
> include/acpi/actbl2.h | 26 +-
> 45 files changed, 1654 insertions(+), 162 deletions(-)
> create mode 100644 arch/x86/boot/compressed/tdcall.S
> create mode 100644 arch/x86/boot/compressed/tdx.c
> create mode 100644 arch/x86/include/asm/tdx.h
> create mode 100644 arch/x86/kernel/tdcall.S
> create mode 100644 arch/x86/kernel/tdx-kvm.c
> create mode 100644 arch/x86/kernel/tdx.c
> create mode 100644 arch/x86/mm/mem_encrypt_common.c
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-06 15:01:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

On Tue, Apr 27, 2021 at 07:31:09PM +0200, Borislav Petkov wrote:
> Or something to that effect.

See the couple of attached patches. Does look along the lines you wanted?

The first one renames PARAVIRT_XXL and the second one introduces
PARAVIRT_HLT.

--
Kirill A. Shutemov


Attachments:
(No filename) (286.00 B)
0001-x86-paravirt-Rename-PARAVIRT_XXL-to-PARAVIRT_FULL.patch (20.51 kB)
0002-x86-paravirt-Introduce-CONFIG_PARAVIRT_HLT.patch (5.64 kB)
Download all attachments

2021-05-07 21:22:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 07/32] x86/traps: Add do_gener al_protection() helper function

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> TDX guest #VE exception handler treats unsupported exceptions

^ The

> as #GP. So to handle the #GP, move the protection fault handler

s/So to/To/

Also, it does not "treat them as #GP". It handles them in the same way
that a #GP is handled. There's a difference between literally making
them a #GP and having a similar end result. This description conflates
them.

> code to out of exc_general_protection() and create new helper
> function for it.

I wouldn't name the functions. Just say that you want the #GP behavior
from #VE so you need a common helper.

> Also since exception handler is responsible to decide when to

^ an

> turn on/off IRQ, move cond_local_irq_{enable/disable)() calls
> out of do_general_protection().

This paragraph doesn't really say anything meaningful. Yes, exception
handlers reenable interrupts. Try to *SAY* something about why they do
this and why you have to move the code around. Or, just axe it.

> This is a preparatory patch for adding #VE exception handler
> support for TDX guests.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/kernel/traps.c | 51 ++++++++++++++++++++++-------------------
> 1 file changed, 27 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 651e3e508959..213d4aa8e337 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -527,44 +527,28 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
>
> #define GPFSTR "general protection fault"
>
> -DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> +static void do_general_protection(struct pt_regs *regs, long error_code)
> {
> char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
> enum kernel_gp_hint hint = GP_NO_HINT;
> - struct task_struct *tsk;
> + struct task_struct *tsk = current;
> unsigned long gp_addr;
> int ret;
>
> - cond_local_irq_enable(regs);
> -
> - if (static_cpu_has(X86_FEATURE_UMIP)) {
> - if (user_mode(regs) && fixup_umip_exception(regs))
> - goto exit;
> - }
> -
> - if (v8086_mode(regs)) {
> - local_irq_enable();
> - handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
> - local_irq_disable();
> - return;
> - }
> -
> - tsk = current;
> -
> if (user_mode(regs)) {
> tsk->thread.error_code = error_code;
> tsk->thread.trap_nr = X86_TRAP_GP;
>
> if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
> - goto exit;
> + return;
>
> show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
> force_sig(SIGSEGV);
> - goto exit;
> + return;
> }
>
> if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
> - goto exit;
> + return;
>
> tsk->thread.error_code = error_code;
> tsk->thread.trap_nr = X86_TRAP_GP;
> @@ -576,11 +560,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> if (!preemptible() &&
> kprobe_running() &&
> kprobe_fault_handler(regs, X86_TRAP_GP))
> - goto exit;
> + return;
>
> ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);

So... We're going to send signals based on #VE which use this bit in the
ABI which is documented as:

#define X86_TRAP_GP 13 /* General Protection Fault */

Considering that there is also a:

#define X86_TRAP_VE 20 /* Virtualization Exception */

this seems like a stretch.

Also, isnt there a lot of truly #GP-specific code in there, like
fixup_exception()? Why do you need to call that for #VE? How did you
decide what remains in the handler versus what gets separated out?

> if (ret == NOTIFY_STOP)
> - goto exit;
> + return;
>
> if (error_code)
> snprintf(desc, sizeof(desc), "segment-related " GPFSTR);
> @@ -601,8 +585,27 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> gp_addr = 0;
>
> die_addr(desc, regs, error_code, gp_addr);
> +}
>
> -exit:
> +DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> +{
> + cond_local_irq_enable(regs);
> +
> + if (static_cpu_has(X86_FEATURE_UMIP)) {
> + if (user_mode(regs) && fixup_umip_exception(regs)) {
> + cond_local_irq_disable(regs);
> + return;
> + }
> + }
> +
> + if (v8086_mode(regs)) {
> + local_irq_enable();
> + handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
> + local_irq_disable();
> + return;
> + }
> +
> + do_general_protection(regs, error_code);
> cond_local_irq_disable(regs);
> }

2021-05-07 21:37:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
...
> The #VE cannot be nested before TDGETVEINFO is called, if there is any
> reason for it to nest the TD would shut down. The TDX module guarantees
> that no NMIs (or #MC or similar) can happen in this window. After
> TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
> it to happen normally.

I think this description really needs some work. Does "The #VE cannot
be nested" mean that "hardware guarantees that #VE will not be
generated", or "the #VE must not be nested"?

What does "the TD would shut down" mean? I think you mean that instead
of delivering a nested #VE the hardware would actually exit to the host
and TDX would prevent the guest from being reentered. Right?

> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 5eb3bdf36a41..41a0732d5f68 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
> DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
> #endif
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
> +#endif
> +
> /* Device interrupts common/spurious */
> DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
> #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index c5a870cef0ae..1ca55d8e9963 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -11,6 +11,7 @@
> #include <linux/types.h>
>
> #define TDINFO 1
> +#define TDGETVEINFO 3
>
> struct tdcall_output {
> u64 rcx;
> @@ -29,6 +30,20 @@ struct tdvmcall_output {
> u64 r15;
> };
>
> +struct ve_info {
> + u64 exit_reason;
> + u64 exit_qual;
> + u64 gla;
> + u64 gpa;
> + u32 instr_len;
> + u32 instr_info;
> +};

Is this an architectural structure or some software construct?

> +unsigned long tdg_get_ve_info(struct ve_info *ve);
> +
> +int tdg_handle_virtualization_exception(struct pt_regs *regs,
> + struct ve_info *ve);
> +
> /* Common API to check TDX support in decompression and common kernel code. */
> bool is_tdx_guest(void);
>
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index ee1a283f8e96..546b6b636c7d 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
> */
> INTG(X86_TRAP_PF, asm_exc_page_fault),
> #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> +#endif
> };
>
> /*
> @@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
> INTG(X86_TRAP_MF, asm_exc_coprocessor_error),
> INTG(X86_TRAP_AC, asm_exc_alignment_check),
> INTG(X86_TRAP_XF, asm_exc_simd_coprocessor_error),
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> +#endif
>
> #ifdef CONFIG_X86_32
> TSKG(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS),
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b63275db1db9..ccfcb07bfb2c 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -82,6 +82,44 @@ static void tdg_get_info(void)
> td_info.attributes = out.rdx;
> }
>
> +unsigned long tdg_get_ve_info(struct ve_info *ve)
> +{
> + u64 ret;
> + struct tdcall_output out = {0};
> +
> + /*
> + * The #VE cannot be nested before TDGETVEINFO is called,
> + * if there is any reason for it to nest the TD would shut
> + * down. The TDX module guarantees that no NMIs (or #MC or
> + * similar) can happen in this window. After TDGETVEINFO
> + * the #VE handler can nest if needed, although we don’t
> + * expect it to happen normally.
> + */

I find that description a bit unsatisfying. Could we make this a bit
more concrete? By the way, what about *normal* interrupts?

Maybe we should talk about this in terms of *rules* that folks need to
follow. Maybe:

NMIs and machine checks are suppressed. Before this point any
#VE is fatal. After this point, NMIs and additional #VEs are
permitted.

> + ret = __tdcall(TDGETVEINFO, 0, 0, 0, 0, &out);
> +
> + ve->exit_reason = out.rcx;
> + ve->exit_qual = out.rdx;
> + ve->gla = out.r8;
> + ve->gpa = out.r9;
> + ve->instr_len = out.r10 & UINT_MAX;
> + ve->instr_info = out.r10 >> 32;
> +
> + return ret;
> +}
> +
> +int tdg_handle_virtualization_exception(struct pt_regs *regs,
> + struct ve_info *ve)
> +{
> + /*
> + * TODO: Add handler support for various #VE exit
> + * reasons. It will be added by other patches in
> + * the series.
> + */
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return -EFAULT;
> +}
> +
> void __init tdx_early_init(void)
> {
> if (!cpuid_has_tdx_guest())
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 213d4aa8e337..64869aa88a5a 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -61,6 +61,7 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/vdso.h>
> +#include <asm/tdx.h>
>
> #ifdef CONFIG_X86_64
> #include <asm/x86_init.h>
> @@ -1140,6 +1141,35 @@ DEFINE_IDTENTRY(exc_device_not_available)
> }
> }
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> + struct ve_info ve;
> + int ret;
> +
> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> + /*
> + * Consume #VE info before re-enabling interrupts. It will be
> + * re-enabled after executing the TDGETVEINFO TDCALL.
> + */

"It" is nebulous here. Is this talking about NMIs, or the
cond_local_irq_enable() that is "after" TDGETVEINFO?

> + ret = tdg_get_ve_info(&ve);
> +
> + cond_local_irq_enable(regs);
> +
> + if (!ret)
> + ret = tdg_handle_virtualization_exception(regs, &ve);
> + /*
> + * If tdg_handle_virtualization_exception() could not process
> + * it successfully, treat it as #GP(0) and handle it.
> + */
> + if (ret)
> + do_general_protection(regs, 0);
> +
> + cond_local_irq_disable(regs);
> +}
> +#endif
> +
> #ifdef CONFIG_X86_32
> DEFINE_IDTENTRY_SW(iret_error)
> {
>

2021-05-07 21:47:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

How about:

KVM hypercalls use the "vmcall" or "vmmcall" instructions. Although the
ABI is similar, those instructions no longer function for TDX guests.
Make TDVMCALLs instead of VMCALL/VMCALL.

> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 338119852512..2fa85481520b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -6,6 +6,7 @@
> #include <asm/alternative.h>
> #include <linux/interrupt.h>
> #include <uapi/asm/kvm_para.h>
> +#include <asm/tdx.h>
>
> extern void kvmclock_init(void);
>
> @@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
> static inline long kvm_hypercall0(unsigned int nr)
> {
> long ret;
> +
> + if (is_tdx_guest())
> + return tdx_kvm_hypercall0(nr);

... all of these look OK.

> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 81af70c2acbd..964bfd7fc682 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -11,6 +11,7 @@
> * refer to TDX GHCI specification).
> */
> #define TDVMCALL_EXPOSE_REGS_MASK 0xfc00
> +#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */
>
> /*
> * TDX guests use the TDCALL instruction to make
> @@ -198,3 +199,9 @@ SYM_FUNC_START(__tdvmcall)
> call do_tdvmcall
> retq
> SYM_FUNC_END(__tdvmcall)
> +
> +SYM_FUNC_START(__tdvmcall_vendor_kvm)
> + movq $TDVMCALL_VENDOR_KVM, %r10
> + call do_tdvmcall
> + retq
> +SYM_FUNC_END(__tdvmcall_vendor_kvm)

Granted, this is not a ton of assembly. But, it does look a bit weird.
It needs a comment and/or a mention in the changelog.

R10 is not part of the function call ABI, but it is a part of the
TDVMCALL ABI. This little assembly wrapper lets us reuse do_tdvmcall()
for both KVM-specific hypercalls TDVMCALL_VENDOR_KVM and the more
generic __tdvmcalls.

> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -8,6 +8,10 @@
>
> #include <linux/cpu.h>
>
> +#ifdef CONFIG_KVM_GUEST
> +#include "tdx-kvm.c"
> +#endif
> +
> static struct {
> unsigned int gpa_width;
> unsigned long attributes;

I know KVM does weird stuff. But, this is *really* weird. Why are we
#including a .c file into another .c file?

2021-05-07 21:54:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> exit reason.

This needs a bit of a history lesson. "In traditional VMs, MMIO tends
to be implemented by giving a guest access to an mapping which will
cause a VMEXIT on access. That's not possible in a TDX guest..."

> For now we only handle subset of instruction that kernel uses for MMIO
> oerations. User-space access triggers SIGBUS.

I still don't think that TDX guests should be doing things that they
*KNOW* will cause #VE, including MMIO. I really want to hear a more
discrete story about why this is the *best* way to do this for Linux
instead of just a hack from the Windows binary driver ecosystem that
seemed expedient.

2021-05-07 21:55:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

> +++ b/arch/x86/mm/mem_encrypt_common.c
...
> +/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> +bool force_dma_unencrypted(struct device *dev)
> +{
> + /*
> + * For SEV, all DMA must be to unencrypted/shared addresses.
> + */
> + if (sev_active())
> + return true;
> +
> + /*
> + * For SME, all DMA must be to unencrypted addresses if the
> + * device does not support DMA to addresses that include the
> + * encryption mask.
> + */
> + if (sme_active()) {
> + u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
> + u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
> + dev->bus_dma_limit);
> +
> + if (dma_dev_mask <= dma_enc_mask)
> + return true;
> + }
> +
> + return false;
> +}

This doesn't seem much like common code to me. It seems like 100% SEV
code. Is this really where we want to move it?

2021-05-07 21:57:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> static unsigned int __ioremap_check_encrypted(struct resource *res)
> {
> - if (!sev_active())
> + if (!sev_active() && !is_tdx_guest())
> return 0;

I think it's time to come up with a real name for all of the code that's
under: (sev_active() || is_tdx_guest()).

"encrypted" isn't it, for sure.

2021-05-07 22:40:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()


On 5/7/2021 2:55 PM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> static unsigned int __ioremap_check_encrypted(struct resource *res)
>> {
>> - if (!sev_active())
>> + if (!sev_active() && !is_tdx_guest())
>> return 0;
> I think it's time to come up with a real name for all of the code that's
> under: (sev_active() || is_tdx_guest()).
>
> "encrypted" isn't it, for sure.

I called it protected_guest() in some other patches.

-Andi

2021-05-07 23:07:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: Isaku Yamahata <[email protected]>
>
> IOAPIC is emulated by KVM which means its MMIO address is shared
> by host. Add shared bit for base address of IOAPIC.
> Most MMIO region is handled by ioremap which is already marked
> as shared for TDX guest platform, but IOAPIC is an exception which
> uses fixed map.

Ho hum... I guess I'll rewrite the changelog:

The kernel interacts with each bare-metal IOAPIC with a special MMIO
page. When running under KVM, the guest's IOAPICs are emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC mapping
as "shared" with the host. This ensures that TDX private protections
are not applied to the page, which allows the TDX host emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as shared.
However, the IOAPIC code does not use ioremap() and instead uses the
fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code. Ensure that
it marks IOAPIC pages as "shared". This replaces set_fixmap_nocache()
with __set_fixmap() since __set_fixmap() allows custom 'prot' values.

> arch/x86/kernel/apic/io_apic.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 73ff4dd426a8..2a01d4a82be7 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2675,6 +2675,14 @@ static struct resource * __init ioapic_setup_resources(void)
> return res;
> }
>
> +static void io_apic_set_fixmap_nocache(enum fixed_addresses idx, phys_addr_t phys)
> +{
> + pgprot_t flags = FIXMAP_PAGE_NOCACHE;
> + if (is_tdx_guest())
> + flags = pgprot_tdg_shared(flags);
> + __set_fixmap(idx, phys, flags);
> +}

^ This seems like it could at least use a one-liner comment.

> void __init io_apic_init_mappings(void)
> {
> unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
> @@ -2707,7 +2715,7 @@ void __init io_apic_init_mappings(void)
> __func__, PAGE_SIZE, PAGE_SIZE);
> ioapic_phys = __pa(ioapic_phys);
> }
> - set_fixmap_nocache(idx, ioapic_phys);
> + io_apic_set_fixmap_nocache(idx, ioapic_phys);
> apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
> __fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
> ioapic_phys);
> @@ -2836,7 +2844,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
> ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
> ioapics[idx].mp_config.apicaddr = address;
>
> - set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
> + io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
> if (bad_ioapic_register(idx)) {
> clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
> return -ENODEV;
>

Subject: Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls



On 5/7/21 2:46 PM, Dave Hansen wrote:
> I know KVM does weird stuff. But, this is*really* weird. Why are we
> #including a .c file into another .c file?

I think Kirill implemented it this way to skip Makefile changes for it. I don't
see any other KVM direct dependencies in tdx.c.

I will fix it in next version.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-10 08:09:59

by Jürgen Groß

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

On 27.04.21 19:31, Borislav Petkov wrote:
> + Jürgen.
>
> On Mon, Apr 26, 2021 at 11:01:28AM -0700, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
>> a separate config option. It provides a middle ground for
>> not-so-deep paravirtulized environments.
>
> Please introduce a spellchecker into your patch creation workflow.
>
> Also, what does "not-so-deep" mean?
>
>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>> config would be a bloat for TDX.
>
> Used how? Why is it bloat for TDX?

Is there any major downside to move the halt related pvops functions
from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?

I'd rather introduce a new PARAVIRT level only in case of multiple
pvops functions needed for a new guest type, or if a real hot path
would be affected.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2021-05-10 16:09:08

by Jürgen Groß

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

On 10.05.21 17:52, Andi Kleen wrote:
> \
>>>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>>>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>>>> config would be a bloat for TDX.
>>>
>>> Used how? Why is it bloat for TDX?
>>
>> Is there any major downside to move the halt related pvops functions
>> from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?
>
> I think the main motivation is to get rid of all the page table related
> hooks for modern configurations. These are the bulk of the annotations
> and  cause bloat and worse code. Shadow page tables are really obscure
> these days and very few people still need them and it's totally
> reasonable to build even widely used distribution kernels without them.
> On contrast most of the other hooks are comparatively few and also on
> comparatively slow paths, so don't really matter too much.
>
> I think it would be ok to have a CONFIG_PARAVIRT that does not have page
> table support, and a separate config option for those (that could be
> eventually deprecated).
>
> But that would break existing .configs for those shadow stack users,
> that's why I think Kirill did it the other way around.

No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
other hypervisor's guests, supporting basically the TLB flush operations
and time related operations only. Adding the halt related operations to
PARAVIRT wouldn't break anything.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2021-05-10 18:03:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

\
>>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>>> config would be a bloat for TDX.
>>
>> Used how? Why is it bloat for TDX?
>
> Is there any major downside to move the halt related pvops functions
> from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?

I think the main motivation is to get rid of all the page table related
hooks for modern configurations. These are the bulk of the annotations
and  cause bloat and worse code. Shadow page tables are really obscure
these days and very few people still need them and it's totally
reasonable to build even widely used distribution kernels without them.
On contrast most of the other hooks are comparatively few and also on
comparatively slow paths, so don't really matter too much.

I think it would be ok to have a CONFIG_PARAVIRT that does not have page
table support, and a separate config option for those (that could be
eventually deprecated).

But that would break existing .configs for those shadow stack users,
that's why I think Kirill did it the other way around.

-Andi


2021-05-10 22:00:22

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On Mon, Apr 26, 2021 at 11:02 AM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: "Kirill A. Shutemov" <[email protected]>

While I do not expect that a patch in the middle of a series needs the
full introduction of all concepts, the high expectations of the
reader's context in this changelog make the patch actively painful to
read.

Some connective tissue commentary to list assumptions and pointers to
definitions is needed to make this patch stand alone when a future
bisect lands on it and someone wonders where to get started debugging
it.

> Unroll string operations and handle port I/O through TDVMCALLs.
> Also handle #VE due to I/O operations with the same TDVMCALLs.

There is a mix of direct-TDVMCALL usage and handling #VE when and why
is either approached used?

> Decompression code uses port IO for earlyprintk. We must use
> paravirt calls there too if we want to allow earlyprintk.

What is the tradeoff between teaching the decompression code to handle
#VE (the implied assumption) vs teaching it to avoid #VE with direct
TDVMCALLs (the chosen direction)?

Rewrite without "we":

"Given the need to support earlyprintk for protected guests, deploy
paravirt calls for the io*() and out*() usage in the decompress code."

This raises the question of why the cover letter switched from
explicitly saying TDVMCALL to "paravirt" where it could be confused
with the typical paravirt helpers?

>
> Decompresion code cannot deal with alternatives: use branches

s/Decompresion/Decompression/

> instead to implement inX() and outX() helpers.
>
> Since we use call instruction in place of in/out instruction,
> the argument passed to call instruction has to be in a
> register, it cannot be an immediate value like in/out
> instruction. So change constraint flag from "Nd" to "d"

Rewrite without "we":

With the approach to use a "call" instruction as an alternative for an
"in/out" instruction it is no longer the case that the argument can be
an immediate value. Change the asm constraint flag from "Nd" to "d" to
accomodate.

>
> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> ---
> arch/x86/boot/compressed/Makefile | 1 +
> arch/x86/boot/compressed/tdcall.S | 9 ++
> arch/x86/include/asm/io.h | 5 +-
> arch/x86/include/asm/tdx.h | 46 ++++++++-
> arch/x86/kernel/tdcall.S | 154 ++++++++++++++++++++++++++++++

Why is this named "tdcall" when it is implementing tdvmcalls? I must
say those names don't really help me understand what they do. Can we
have Linux names that don't mandate keeping the spec terminology in my
brain's translation cache?

> arch/x86/kernel/tdx.c | 33 +++++++
> 6 files changed, 245 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/boot/compressed/tdcall.S
>
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index a2554621cefe..a944a2038797 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -97,6 +97,7 @@ endif
>
> vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
> vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
> +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
>
> vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
> efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
> diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
> new file mode 100644
> index 000000000000..5ebb80d45ad8
> --- /dev/null
> +++ b/arch/x86/boot/compressed/tdcall.S
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <asm/export.h>
> +
> +/* Do not export symbols in decompression code */
> +#undef EXPORT_SYMBOL
> +#define EXPORT_SYMBOL(sym)

What's wrong with the existing:

KBUILD_CFLAGS += -D__DISABLE_EXPORTS

...in arch/x86/boot/compressed/Makefile?

> +
> +#include "../../kernel/tdcall.S"
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index ef7a686a55a9..30a3b30395ad 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -43,6 +43,7 @@
> #include <asm/page.h>
> #include <asm/early_ioremap.h>
> #include <asm/pgtable_types.h>
> +#include <asm/tdx.h>
>
> #define build_mmio_read(name, size, type, reg, barrier) \
> static inline type name(const volatile void __iomem *addr) \
> @@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port) \
> \
> static inline void outs##bwl(int port, const void *addr, unsigned long count) \
> { \
> - if (sev_key_active()) { \
> + if (sev_key_active() || is_tdx_guest()) { \

Is there a unified Linux name these can be given to stop the
proliferation of poor vendor names for similar concepts?

That routine

> unsigned type *value = (unsigned type *)addr; \
> while (count) { \
> out##bwl(*value, port); \
> @@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
> \
> static inline void ins##bwl(int port, void *addr, unsigned long count) \
> { \
> - if (sev_key_active()) { \
> + if (sev_key_active() || is_tdx_guest()) { \
> unsigned type *value = (unsigned type *)addr; \
> while (count) { \
> *value = in##bwl(port); \
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index e0b3ed9e262c..b972c6531a53 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -5,6 +5,8 @@
>
> #define TDX_CPUID_LEAF_ID 0x21
>
> +#ifndef __ASSEMBLY__
> +
> #ifdef CONFIG_INTEL_TDX_GUEST
>
> #include <asm/cpufeature.h>
> @@ -67,6 +69,48 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
> long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
> unsigned long p3, unsigned long p4);
>
> +/* Decompression code doesn't know how to handle alternatives */

Does it also not know how to handle #VE to keep it aligned with the
runtime code?

> +#ifdef BOOT_COMPRESSED_MISC_H
> +#define __out(bwl, bw) \
> +do { \
> + if (is_tdx_guest()) { \
> + asm volatile("call tdg_out" #bwl : : \
> + "a"(value), "d"(port)); \
> + } else { \
> + asm volatile("out" #bwl " %" #bw "0, %w1" : : \
> + "a"(value), "Nd"(port)); \
> + } \
> +} while (0)
> +#define __in(bwl, bw) \
> +do { \
> + if (is_tdx_guest()) { \
> + asm volatile("call tdg_in" #bwl : \
> + "=a"(value) : "d"(port)); \
> + } else { \
> + asm volatile("in" #bwl " %w1, %" #bw "0" : \
> + "=a"(value) : "Nd"(port)); \
> + } \
> +} while (0)
> +#else
> +#define __out(bwl, bw) \
> + alternative_input("out" #bwl " %" #bw "1, %w2", \
> + "call tdg_out" #bwl, X86_FEATURE_TDX_GUEST, \
> + "a"(value), "d"(port))
> +
> +#define __in(bwl, bw) \
> + alternative_io("in" #bwl " %w2, %" #bw "0", \
> + "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST, \
> + "=a"(value), "d"(port))

Outside the boot decompression code isn't this branch of the "ifdef
BOOT_COMPRESSED_MISC_H" handled by #VE? I also don't see any usage of
__{in,out}() in this patch.

> +#endif
> +
> +void tdg_outb(unsigned char value, unsigned short port);
> +void tdg_outw(unsigned short value, unsigned short port);
> +void tdg_outl(unsigned int value, unsigned short port);
> +
> +unsigned char tdg_inb(unsigned short port);
> +unsigned short tdg_inw(unsigned short port);
> +unsigned int tdg_inl(unsigned short port);
> +
> #else // !CONFIG_INTEL_TDX_GUEST
>
> static inline bool is_tdx_guest(void)
> @@ -106,5 +150,5 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> }
>
> #endif /* CONFIG_INTEL_TDX_GUEST */
> -
> +#endif /* __ASSEMBLY__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 964bfd7fc682..df4159bb5103 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -3,6 +3,7 @@
> #include <asm/asm.h>
> #include <asm/frame.h>
> #include <asm/unwind_hints.h>
> +#include <asm/export.h>
>
> #include <linux/linkage.h>
>
> @@ -12,6 +13,12 @@
> */
> #define TDVMCALL_EXPOSE_REGS_MASK 0xfc00
> #define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */
> +#define EXIT_REASON_IO_INSTRUCTION 30
> +/*
> + * Current size of struct tdvmcall_output is 40 bytes,
> + * but allocate double to account future changes.

What future changes? Why could they not be handled as future code changes?

> + */
> +#define TDVMCALL_OUTPUT_SIZE 80

Perhaps "PAYLOAD_SIZE" since it is used for both input and output?

If the ABI does not include the size of the payload then how would
code detect if even 80 bytes was violated in the future?

>
> /*
> * TDX guests use the TDCALL instruction to make
> @@ -205,3 +212,150 @@ SYM_FUNC_START(__tdvmcall_vendor_kvm)
> call do_tdvmcall
> retq
> SYM_FUNC_END(__tdvmcall_vendor_kvm)
> +
> +.macro io_save_registers
> + push %rbp
> + push %rbx
> + push %rcx
> + push %rdx
> + push %rdi
> + push %rsi
> + push %r8
> + push %r9
> + push %r10
> + push %r11
> + push %r12
> + push %r13
> + push %r14
> + push %r15

Surely there's an existing macro for this pattern? Would
PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
would eliminate clearing of %r8.

> +.endm
> +.macro io_restore_registers
> + pop %r15
> + pop %r14
> + pop %r13
> + pop %r12
> + pop %r11
> + pop %r10
> + pop %r9
> + pop %r8
> + pop %rsi
> + pop %rdi
> + pop %rdx
> + pop %rcx
> + pop %rbx
> + pop %rbp
> +.endm
> +
> +/*
> + * tdg_out{b,w,l}() - Write given data to the specified port.
> + *
> + * @arg1 (RAX) - Value to be written (passed via R8 to do_tdvmcall()).
> + * @arg2 (RDX) - Port id (passed via RCX to do_tdvmcall()).
> + *
> + */
> +SYM_FUNC_START(tdg_outb)
> + io_save_registers
> + xor %r8, %r8
> + /* Move data to R8 register */
> + mov %al, %r8b
> + /* Set data width to 1 byte */
> + mov $1, %rsi
> + jmp 1f
> +
> +SYM_FUNC_START(tdg_outw)
> + io_save_registers
> + xor %r8, %r8
> + /* Move data to R8 register */
> + mov %ax, %r8w
> + /* Set data width to 2 bytes */
> + mov $2, %rsi
> + jmp 1f
> +
> +SYM_FUNC_START(tdg_outl)
> + io_save_registers
> + xor %r8, %r8
> + /* Move data to R8 register */
> + mov %eax, %r8d
> + /* Set data width to 4 bytes */
> + mov $4, %rsi
> +1:
> + /*
> + * Since io_save_registers does not save rax
> + * state, save it here so that we can preserve
> + * the caller register state.
> + */
> + push %rax
> +
> + mov %rdx, %rcx
> + /* Set 1 in RDX to select out operation */
> + mov $1, %rdx
> + /* Set TDVMCALL function id in RDI */
> + mov $EXIT_REASON_IO_INSTRUCTION, %rdi
> + /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> + xor %r10, %r10
> + /* Since we don't use tdvmcall output, set it to NULL */
> + xor %r9, %r9
> +
> + call do_tdvmcall
> +
> + pop %rax
> + io_restore_registers
> + ret
> +SYM_FUNC_END(tdg_outb)
> +SYM_FUNC_END(tdg_outw)
> +SYM_FUNC_END(tdg_outl)
> +EXPORT_SYMBOL(tdg_outb)
> +EXPORT_SYMBOL(tdg_outw)
> +EXPORT_SYMBOL(tdg_outl)
> +
> +/*
> + * tdg_in{b,w,l}() - Read data to the specified port.
> + *
> + * @arg1 (RDX) - Port id (passed via RCX to do_tdvmcall()).
> + *
> + * Returns data read via RAX register.
> + *
> + */
> +SYM_FUNC_START(tdg_inb)
> + io_save_registers
> + /* Set data width to 1 byte */
> + mov $1, %rsi
> + jmp 1f
> +
> +SYM_FUNC_START(tdg_inw)
> + io_save_registers
> + /* Set data width to 2 bytes */
> + mov $2, %rsi
> + jmp 1f
> +
> +SYM_FUNC_START(tdg_inl)
> + io_save_registers
> + /* Set data width to 4 bytes */
> + mov $4, %rsi
> +1:
> + mov %rdx, %rcx
> + /* Set 0 in RDX to select in operation */
> + mov $0, %rdx
> + /* Set TDVMCALL function id in RDI */
> + mov $EXIT_REASON_IO_INSTRUCTION, %rdi
> + /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> + xor %r10, %r10
> + /* Allocate memory in stack for Output */
> + subq $TDVMCALL_OUTPUT_SIZE, %rsp

Why is this leaf function responsibility? I would expect the core
do_tdvmcall (or whatever it is renamed to) helper to hide output
buffer payload handling. tdg_in* only wants 1, 2, or 4 bytes, not 40
bytes of payload to handle.

> + /* Move tdvmcall_output pointer to R9 */
> + movq %rsp, %r9
> +
> + call do_tdvmcall
> +
> + /* Move data read from port to RAX */
> + mov TDVMCALL_r11(%r9), %eax

"TDVMCALL_r11" is unreadable, what is that doing?

Shouldn't failed in* calls signal failure with an all ones result?

> + /* Free allocated memory */
> + addq $TDVMCALL_OUTPUT_SIZE, %rsp
> + io_restore_registers
> + ret
> +SYM_FUNC_END(tdg_inb)
> +SYM_FUNC_END(tdg_inw)
> +SYM_FUNC_END(tdg_inl)
> +EXPORT_SYMBOL(tdg_inb)
> +EXPORT_SYMBOL(tdg_inw)
> +EXPORT_SYMBOL(tdg_inl)
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index e42e260df245..ec61f2f06c98 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -189,6 +189,36 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
> regs->dx = out.r15;
> }
>
> +static void tdg_out(int size, int port, unsigned int value)
> +{
> + tdvmcall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
> +}
> +
> +static unsigned int tdg_in(int size, int port)
> +{
> + return tdvmcall_out_r11(EXIT_REASON_IO_INSTRUCTION, size, 0, port, 0);
> +}
> +
> +static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> + bool string = exit_qual & 16;
> + int out, size, port;
> +
> + /* I/O strings ops are unrolled at build time. */
> + BUG_ON(string);
> +
> + out = (exit_qual & 8) ? 0 : 1;
> + size = (exit_qual & 7) + 1;
> + port = exit_qual >> 16;

This seems to be begging for exit_qual helpers to put symbolic names
on these operations.

> +
> + if (out) {
> + tdg_out(size, port, regs->ax);
> + } else {
> + regs->ax &= ~GENMASK(8 * size, 0);
> + regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
> + }
> +}
> +
> unsigned long tdg_get_ve_info(struct ve_info *ve)
> {
> u64 ret;
> @@ -238,6 +268,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
> case EXIT_REASON_CPUID:
> tdg_handle_cpuid(regs);
> break;
> + case EXIT_REASON_IO_INSTRUCTION:
> + tdg_handle_io(regs, ve->exit_qual);
> + break;
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return -EFAULT;
> --
> 2.25.1
>

Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code



On 5/7/21 2:54 PM, Dave Hansen wrote:
> This doesn't seem much like common code to me. It seems like 100% SEV
> code. Is this really where we want to move it?

Both SEV and TDX code has requirement to enable
CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
function.

force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
pages shared" to add TDX guest specific support.

Since both SEV and TDX code uses it, its moved to common file.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

Hi Dave,

On 5/7/21 3:38 PM, Andi Kleen wrote:
>
> On 5/7/2021 2:55 PM, Dave Hansen wrote:
>> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>>   static unsigned int __ioremap_check_encrypted(struct resource *res)
>>>   {
>>> -    if (!sev_active())
>>> +    if (!sev_active() && !is_tdx_guest())
>>>           return 0;
>> I think it's time to come up with a real name for all of the code that's
>> under: (sev_active() || is_tdx_guest()).
>>
>> "encrypted" isn't it, for sure.
>
> I called it protected_guest() in some other patches.

If you are also fine with above mentioned function name, I can include it
in this series. Since we have many use cases of above condition, it will
be useful define it as helper function.

>
> -Andi
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-10 22:24:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On 5/10/21 3:19 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/7/21 2:54 PM, Dave Hansen wrote:
>> This doesn't seem much like common code to me.  It seems like 100% SEV
>> code.  Is this really where we want to move it?
>
> Both SEV and TDX code has requirement to enable
> CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
> function.
>
> force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
> pages shared" to add TDX guest specific support.
>
> Since both SEV and TDX code uses it, its moved to common file.

That's not an excuse to have a bunch of AMD (or Intel) feature-specific
code in a file named "common". I'd make an attempt to keep them
separate and then call into the two separate functions *from* the common
function.

2021-05-10 22:34:47

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On 5/10/21 3:23 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>
>>>> -    if (!sev_active())
>>>> +    if (!sev_active() && !is_tdx_guest())
>>>>           return 0;
>>> I think it's time to come up with a real name for all of the code that's
>>> under: (sev_active() || is_tdx_guest()).
>>>
>>> "encrypted" isn't it, for sure.
>>
>> I called it protected_guest() in some other patches.
>
> If you are also fine with above mentioned function name, I can include it
> in this series. Since we have many use cases of above condition, it will
> be useful define it as helper function.

FWIW, I think sev_active() has a horrible name. Shouldn't that be
"is_sev_guest()"? "sev_active()" could be read as "I'm a SEV host" or
"I'm a SEV guest" and "SEV is active".

protected_guest() seems fine to cover both, despite the horrid SEV
naming. It'll actually be nice to banish it from appearing in many of
its uses. :)

2021-05-10 22:53:55

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

+Boris, who has similar opinions on sev_active().

On Mon, May 10, 2021, Dave Hansen wrote:
> On 5/10/21 3:23 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>>
> >>>> -??? if (!sev_active())
> >>>> +??? if (!sev_active() && !is_tdx_guest())
> >>>> ????????? return 0;
> >>> I think it's time to come up with a real name for all of the code that's
> >>> under: (sev_active() || is_tdx_guest()).
> >>>
> >>> "encrypted" isn't it, for sure.
> >>
> >> I called it protected_guest() in some other patches.
> >
> > If you are also fine with above mentioned function name, I can include it
> > in this series. Since we have many use cases of above condition, it will
> > be useful define it as helper function.
>
> FWIW, I think sev_active() has a horrible name. Shouldn't that be
> "is_sev_guest()"? "sev_active()" could be read as "I'm a SEV host" or
> "I'm a SEV guest" and "SEV is active".

I can't find the thread offhand, but Boris proposed something along the lines of
cpu_has(), but specific to a given flavor of protected guest. IIRC, it was
sev_guest_has(SEV_ES) or something like that.

I 100% agree that we should have actual feature bits somewhere for the various
protected guest flavors.

> protected_guest() seems fine to cover both, despite the horrid SEV
> naming. It'll actually be nice to banish it from appearing in many of
> its uses. :)

2021-05-10 23:09:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O


On 5/10/2021 2:57 PM, Dan Williams wrote:
>
> There is a mix of direct-TDVMCALL usage and handling #VE when and why
> is either approached used?

For the really early code in the decompressor or the main kernel we
can't use #VE because the IDT needed for handling the exception is not
set up, and some other infrastructure needed by the handler is missing.
The early code needs to do port IO to be able to write the early serial
console. To keep it all common it ended up that all port IO is paravirt.
Actually for most the main kernel port IO calls we could just use #VE
and it would result in smaller binaries, but then we would need to
annotate all early portio with some special name. That's why port IO is
all TDCALL.

For some others the only thing that really has to be #VE is MMIO because
we don't want to annotate every MMIO read*/write* with an alternative
(which would result in incredible binary bloat) For the others they have
mostly become now direct calls.


>
>> Decompression code uses port IO for earlyprintk. We must use
>> paravirt calls there too if we want to allow earlyprintk.
> What is the tradeoff between teaching the decompression code to handle
> #VE (the implied assumption) vs teaching it to avoid #VE with direct
> TDVMCALLs (the chosen direction)?

The decompression code only really needs it to output something. But you
couldn't debug anything until #VE is set up. Also the decompression code
has a very basic environment that doesn't supply most kernel services,
and the #VE handler is relatively complicated. It would probably need to
be duplicated and the instruction decoder be ported to work in this
environment. It would be all a lot of work, just to make the debug
output work.

>
>> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> ---
>> arch/x86/boot/compressed/Makefile | 1 +
>> arch/x86/boot/compressed/tdcall.S | 9 ++
>> arch/x86/include/asm/io.h | 5 +-
>> arch/x86/include/asm/tdx.h | 46 ++++++++-
>> arch/x86/kernel/tdcall.S | 154 ++++++++++++++++++++++++++++++
> Why is this named "tdcall" when it is implementing tdvmcalls? I must
> say those names don't really help me understand what they do. Can we
> have Linux names that don't mandate keeping the spec terminology in my
> brain's translation cache?

The instruction is called TDCALL. It's always the same instruction

TDVMCALL is the variant when the host processes it (as opposed to the
TDX module), but it's just a different name space in the call number.


\

> Is there a unified Linux name these can be given to stop the
> proliferation of poor vendor names for similar concepts?

We could use protected_guest()


>
> Does it also not know how to handle #VE to keep it aligned with the
> runtime code?


Not sure I understand the question, but the decompression code supports
neither alternatives nor #VE. It's a very limited environment.

>
> Outside the boot decompression code isn't this branch of the "ifdef
> BOOT_COMPRESSED_MISC_H" handled by #VE? I also don't see any usage of
> __{in,out}() in this patch.

I thought it was all alternative after decompression, so the #VE code
shouldn't be called. We still have it for some reason though.


>
> Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
>
> If the ABI does not include the size of the payload then how would
> code detect if even 80 bytes was violated in the future?


The payload in memory is just a Linux concept. At the TDCALL level it's
only registers.


>
> 5
> Surely there's an existing macro for this pattern? Would
> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
> would eliminate clearing of %r8.


There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
some past refactorings.


-Andi

2021-05-10 23:35:51

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On Mon, May 10, 2021 at 4:08 PM Andi Kleen <[email protected]> wrote:
>
>
> On 5/10/2021 2:57 PM, Dan Williams wrote:
> >
> > There is a mix of direct-TDVMCALL usage and handling #VE when and why
> > is either approached used?
>
> For the really early code in the decompressor or the main kernel we
> can't use #VE because the IDT needed for handling the exception is not
> set up, and some other infrastructure needed by the handler is missing.
> The early code needs to do port IO to be able to write the early serial
> console. To keep it all common it ended up that all port IO is paravirt.
> Actually for most the main kernel port IO calls we could just use #VE
> and it would result in smaller binaries, but then we would need to
> annotate all early portio with some special name. That's why port IO is
> all TDCALL.

Thanks Andi. Sathya, please include the above in the next posting.

>
> For some others the only thing that really has to be #VE is MMIO because
> we don't want to annotate every MMIO read*/write* with an alternative
> (which would result in incredible binary bloat) For the others they have
> mostly become now direct calls.
>
>
> >
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> The decompression code only really needs it to output something. But you
> couldn't debug anything until #VE is set up. Also the decompression code
> has a very basic environment that doesn't supply most kernel services,
> and the #VE handler is relatively complicated. It would probably need to
> be duplicated and the instruction decoder be ported to work in this
> environment. It would be all a lot of work, just to make the debug
> output work.
>
> >
> >> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> >> Signed-off-by: Kirill A. Shutemov <[email protected]>
> >> Reviewed-by: Andi Kleen <[email protected]>
> >> ---
> >> arch/x86/boot/compressed/Makefile | 1 +
> >> arch/x86/boot/compressed/tdcall.S | 9 ++
> >> arch/x86/include/asm/io.h | 5 +-
> >> arch/x86/include/asm/tdx.h | 46 ++++++++-
> >> arch/x86/kernel/tdcall.S | 154 ++++++++++++++++++++++++++++++
> > Why is this named "tdcall" when it is implementing tdvmcalls? I must
> > say those names don't really help me understand what they do. Can we
> > have Linux names that don't mandate keeping the spec terminology in my
> > brain's translation cache?
>
> The instruction is called TDCALL. It's always the same instruction
>
> TDVMCALL is the variant when the host processes it (as opposed to the
> TDX module), but it's just a different name space in the call number.
>
>

Ok.

> \
>
> > Is there a unified Linux name these can be given to stop the
> > proliferation of poor vendor names for similar concepts?
>
> We could use protected_guest()

Looks good.

>
>
> >
> > Does it also not know how to handle #VE to keep it aligned with the
> > runtime code?
>
>
> Not sure I understand the question, but the decompression code supports
> neither alternatives nor #VE. It's a very limited environment.

Yes, that addresses the question.

>
> >
> > Outside the boot decompression code isn't this branch of the "ifdef
> > BOOT_COMPRESSED_MISC_H" handled by #VE? I also don't see any usage of
> > __{in,out}() in this patch.
>
> I thought it was all alternative after decompression, so the #VE code
> shouldn't be called. We still have it for some reason though.

Right, I'm struggling to understand where these spurious in/out
instructions are coming from that are not replaced by the
alternative's code? Shouldn't those be dropped on the floor and warned
about rather than handled? I.e. shouldn't port-io instruction escapes
that would cause #VE be precluded at build-time?

>
>
> >
> > Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
> >
> > If the ABI does not include the size of the payload then how would
> > code detect if even 80 bytes was violated in the future?
>
>
> The payload in memory is just a Linux concept. At the TDCALL level it's
> only registers.
>

If it's only a Linux concept why does this code need to "prepare for
the future"?


> >
> > 5
> > Surely there's an existing macro for this pattern? Would
> > PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
> > would eliminate clearing of %r8.
>
>
> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
> some past refactorings.

Not a huge deal, but at a minimum it seems a generic construct that
deserves to be declared centrally rather than tdx-guest-port-io local.

2021-05-11 00:04:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On Mon, May 10, 2021 at 04:34:34PM -0700, Dan Williams wrote:
> > > Outside the boot decompression code isn't this branch of the "ifdef
> > > BOOT_COMPRESSED_MISC_H" handled by #VE? I also don't see any usage of
> > > __{in,out}() in this patch.
> >
> > I thought it was all alternative after decompression, so the #VE code
> > shouldn't be called. We still have it for some reason though.
>
> Right, I'm struggling to understand where these spurious in/out
> instructions are coming from that are not replaced by the
> alternative's code?

There should be nothing in the main tree at least.

> Shouldn't those be dropped on the floor and warned
> about rather than handled?

It might be related to eventually handling them in ring 3, but
I believe we disallow that currently too and it's not all that useful
anyways. So yes it could be forbidden.

> I.e. shouldn't port-io instruction escapes
> that would cause #VE be precluded at build-time?

You mean in objtool? That would seem like overkill for a more theoretical
problem.

> > There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
> > some past refactorings.
>
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.

Yes I agree. We should just bring SAVE_ALL/SAVE_REGS back.

-Andi

2021-05-11 00:23:27

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On Mon, May 10, 2021 at 5:01 PM Andi Kleen <ak@lin
[..]
> > I.e. shouldn't port-io instruction escapes
> > that would cause #VE be precluded at build-time?
>
> You mean in objtool? That would seem like overkill for a more theoretical
> problem.

Oh, sorry, no, I was not implying objtool overkill, just that the
mainline kernel should not be surprised by spurious instruction usage.

Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O



On 5/10/21 4:34 PM, Dan Williams wrote:
> On Mon, May 10, 2021 at 4:08 PM Andi Kleen <[email protected]> wrote:
>>
>>
>> On 5/10/2021 2:57 PM, Dan Williams wrote:
>>>
>>> There is a mix of direct-TDVMCALL usage and handling #VE when and why
>>> is either approached used?
>>
>> For the really early code in the decompressor or the main kernel we
>> can't use #VE because the IDT needed for handling the exception is not
>> set up, and some other infrastructure needed by the handler is missing.
>> The early code needs to do port IO to be able to write the early serial
>> console. To keep it all common it ended up that all port IO is paravirt.
>> Actually for most the main kernel port IO calls we could just use #VE
>> and it would result in smaller binaries, but then we would need to
>> annotate all early portio with some special name. That's why port IO is
>> all TDCALL.
>
> Thanks Andi. Sathya, please include the above in the next posting.

Will include it.

>
>>
>> For some others the only thing that really has to be #VE is MMIO because
>> we don't want to annotate every MMIO read*/write* with an alternative
>> (which would result in incredible binary bloat) For the others they have
>> mostly become now direct calls.
>>
>>
>>>
>>>> Decompression code uses port IO for earlyprintk. We must use
>>>> paravirt calls there too if we want to allow earlyprintk.
>>> What is the tradeoff between teaching the decompression code to handle
>>> #VE (the implied assumption) vs teaching it to avoid #VE with direct
>>> TDVMCALLs (the chosen direction)?
>>
>> The decompression code only really needs it to output something. But you
>> couldn't debug anything until #VE is set up. Also the decompression code
>> has a very basic environment that doesn't supply most kernel services,
>> and the #VE handler is relatively complicated. It would probably need to
>> be duplicated and the instruction decoder be ported to work in this
>> environment. It would be all a lot of work, just to make the debug
>> output work.
>>
>>>
>>>> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
>>>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>>> Reviewed-by: Andi Kleen <[email protected]>
>>>> ---
>>>> arch/x86/boot/compressed/Makefile | 1 +
>>>> arch/x86/boot/compressed/tdcall.S | 9 ++
>>>> arch/x86/include/asm/io.h | 5 +-
>>>> arch/x86/include/asm/tdx.h | 46 ++++++++-
>>>> arch/x86/kernel/tdcall.S | 154 ++++++++++++++++++++++++++++++
>>> Why is this named "tdcall" when it is implementing tdvmcalls? I must
>>> say those names don't really help me understand what they do. Can we
>>> have Linux names that don't mandate keeping the spec terminology in my
>>> brain's translation cache?
>>
>> The instruction is called TDCALL. It's always the same instruction
>>
>> TDVMCALL is the variant when the host processes it (as opposed to the
>> TDX module), but it's just a different name space in the call number.
>>
>>
>
> Ok.
>
>> \
>>
>>> Is there a unified Linux name these can be given to stop the
>>> proliferation of poor vendor names for similar concepts?
>>
>> We could use protected_guest()
>
> Looks good.
>
>>
>>
>>>
>>> Does it also not know how to handle #VE to keep it aligned with the
>>> runtime code?
>>
>>
>> Not sure I understand the question, but the decompression code supports
>> neither alternatives nor #VE. It's a very limited environment.
>
> Yes, that addresses the question.
>
>>
>>>
>>> Outside the boot decompression code isn't this branch of the "ifdef
>>> BOOT_COMPRESSED_MISC_H" handled by #VE? I also don't see any usage of
>>> __{in,out}() in this patch.
>>
>> I thought it was all alternative after decompression, so the #VE code
>> shouldn't be called. We still have it for some reason though.
>
> Right, I'm struggling to understand where these spurious in/out
> instructions are coming from that are not replaced by the
> alternative's code? Shouldn't those be dropped on the floor and warned
> about rather than handled? I.e. shouldn't port-io instruction escapes
> that would cause #VE be precluded at build-time?
>
>>
>>
>>>
>>> Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
>>>
>>> If the ABI does not include the size of the payload then how would
>>> code detect if even 80 bytes was violated in the future?
>>
>>
>> The payload in memory is just a Linux concept. At the TDCALL level it's
>> only registers.
>>
>
> If it's only a Linux concept why does this code need to "prepare for
> the future"?

It is the software only structure. It is created to group all the output
registers used by VMM. You can find more details about it in patch titled
# "[RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions"

It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
and tdx_in{b,w,l}.

u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
struct tdx_hypercall_output *out);
u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
u64 r15, struct tdx_hypercall_output *out);

struct tdx_hypercall_output {
u64 r11;
u64 r12;
u64 r13;
u64 r14;
u64 r15;
};


Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
using TDCALL instruction. do_tdx_hypercall() is the helper function (in
tdcall.S) which actually implements this ABI.

As per current ABI, VMM will use registers R11-R15 to share the output
values with the guest. So we have defined the structure
struct tdx_hypercall_output to group all output registers and make it easier
to share it with users of the TDCALLs. This is Linux defined structure.

If there are any changes in TDCALL ABI for VMM, we might have to extend
this structure to accommodate new output register changes. So if we
define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
any future struct tdx_hypercall_output changes. So to avoid it, we have
allocated double the size.

May be I should define it as,

#define TDVMCALL_OUTPUT_SIZE sizeof(struct tdx_hypercall_output)

But currently we don't include the asm/tdx.h (which defines
struct tdx_hypercall_output) in tdcall.S. So I have defined the size as
constant value.

>
>
>>>
>>> 5
>>> Surely there's an existing macro for this pattern? Would
>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>> would eliminate clearing of %r8.
>>
>>
>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>> some past refactorings.
>
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O



On 5/10/21 4:34 PM, Dan Williams wrote:
>>> Surely there's an existing macro for this pattern? Would
>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>> would eliminate clearing of %r8.
>>
>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>> some past refactorings.
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.

I can define SAVE_ALL_REGS/RESTORE_ALL_REGS. Do you want to move it outside
TDX code? I don't know if there will be other users for it?



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-11 01:09:25

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
[..]
> It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
> and tdx_in{b,w,l}.
>
> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
> struct tdx_hypercall_output *out);
> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
> u64 r15, struct tdx_hypercall_output *out);
>
> struct tdx_hypercall_output {
> u64 r11;
> u64 r12;
> u64 r13;
> u64 r14;
> u64 r15;
> };

Why is this by register name and not something like:

struct tdx_hypercall_payload {
u64 data[5];
};

...because the code in this patch is reading the payload out of a
stack relative offset, not r11.

>
>
> Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
> by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
> using TDCALL instruction. do_tdx_hypercall() is the helper function (in
> tdcall.S) which actually implements this ABI.
>
> As per current ABI, VMM will use registers R11-R15 to share the output
> values with the guest.

Which ABI, __tdx_hypercall_vendor_kvm()? The code is putting the
payload on the stack, so I'm not sure what ABI you are referring to?


> So we have defined the structure
> struct tdx_hypercall_output to group all output registers and make it easier
> to share it with users of the TDCALLs. This is Linux defined structure.
>
> If there are any changes in TDCALL ABI for VMM, we might have to extend
> this structure to accommodate new output register changes. So if we
> define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
> any future struct tdx_hypercall_output changes. So to avoid it, we have
> allocated double the size.
>
> May be I should define it as,
>
> #define TDVMCALL_OUTPUT_SIZE sizeof(struct tdx_hypercall_output)

An arrangement like that seems more reasonable than a seemingly
arbitrary number and an ominous warning about things that may happen
in the future.

2021-05-11 01:25:10

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On Mon, Apr 26, 2021 at 11:02 AM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> When running as a TDX guest, there are a number of existing,
> privileged instructions that do not work. If the guest kernel
> uses these instructions, the hardware generates a #VE.
>
> You can find the list of unsupported instructions in Intel
> Trust Domain Extensions (Intel® TDX) Module specification,
> sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
> Specification for Intel TDX, sec 2.4.1.
>

Ah, better than the "handle port io" patch, these details at least
give the reader a chance.

> To prevent TD guest from using MWAIT/MONITOR instructions,
> support for these instructions are already disabled by TDX
> module (SEAM). So CPUID flags for these instructions should
> be in disabled state.

Why does this not result in a #UD if the instruction is disabled by
SEAM? How is it possible to execute a disabled instruction (one
precluded by CPUID) to the point where it triggers #VE instead of #UD?

> After the above mentioned preventive measures, if TD guests still
> execute these instructions, add appropriate warning messages in #VE
> handler. For WBIND instruction, since it's related to memory writeback
> and cache flushes, it's mainly used in context of IO devices. Since
> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> not cause any fatal issues.

WBINVD is in a different class than MWAIT/MONITOR since it is not
identified by CPUID, it can't possibly have the same #UD behaviour.
It's not clear why WBINVD is included in the same patch as
MWAIT/MONITOR?

I disagree with the assertion that WBINVD is mainly used in the
context of I/O devices, it's also used for ACPI power management
paths. WBINVD dependent functionality should be dynamically disabled
rather than warned about.

Does a TDX guest support out-of-tree modules? The kernel is already
tainted when out-of-tree modules are loaded. In other words in-tree
modules preclude forbidden instructions because they can just be
audited, and out-of-tree modules are ok to trigger abrupt failure if
they attempt to use forbidden instructions.

> But to let users know about its usage, use
> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
> unsupported use WARN() to report unsupported usage.

I'm not sure how useful warning is outside of a kernel developer's
debug environment. The kernel should know what instructions are
disabled and which are available. WBINVD in particular has potential
data integrity implications. Code that might lead to a WBINVD usage
should be disabled, not run all the way up to where WBINVD is
attempted and then trigger an after-the-fact WARN_ONCE().

The WBINVD change deserves to be split off from MWAIT/MONITOR, and
more thought needs to be put into where these spurious instruction
usages are arising.

>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> ---
> arch/x86/kernel/tdx.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 3fe617978fc4..294dda5bf3f6 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -371,6 +371,21 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
> case EXIT_REASON_EPT_VIOLATION:
> ve->instr_len = tdg_handle_mmio(regs, ve);
> break;
> + case EXIT_REASON_WBINVD:
> + /*
> + * WBINVD is not supported inside TDX guests. All in-
> + * kernel uses should have been disabled.
> + */
> + WARN_ONCE(1, "TD Guest used unsupported WBINVD instruction\n");
> + break;
> + case EXIT_REASON_MONITOR_INSTRUCTION:
> + case EXIT_REASON_MWAIT_INSTRUCTION:
> + /*
> + * Something in the kernel used MONITOR or MWAIT despite
> + * X86_FEATURE_MWAIT being cleared for TDX guests.
> + */
> + WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
> + break;
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return -EFAULT;
> --
> 2.25.1
>

2021-05-11 02:20:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

>> To prevent TD guest from using MWAIT/MONITOR instructions,
>> support for these instructions are already disabled by TDX
>> module (SEAM). So CPUID flags for these instructions should
>> be in disabled state.
> Why does this not result in a #UD if the instruction is disabled by
> SEAM?

It's just the TDX module (SEAM is the execution mode used by the TDX module)


> How is it possible to execute a disabled instruction (one
> precluded by CPUID) to the point where it triggers #VE instead of #UD?

That's how the TDX module works. It never injects anything else other
than #VE. You can still get other exceptions of course, but they won't
come from the TDX module.

>> After the above mentioned preventive measures, if TD guests still
>> execute these instructions, add appropriate warning messages in #VE
>> handler. For WBIND instruction, since it's related to memory writeback
>> and cache flushes, it's mainly used in context of IO devices. Since
>> TDX 1.0 does not support non-virtual I/O devices, skipping it should
>> not cause any fatal issues.
> WBINVD is in a different class than MWAIT/MONITOR since it is not
> identified by CPUID, it can't possibly have the same #UD behaviour.
> It's not clear why WBINVD is included in the same patch as
> MWAIT/MONITOR?

Because these are all instructions we never expect to execute, so
nothing special is needed for them. That's a unique class that logically
fits together.


>
> I disagree with the assertion that WBINVD is mainly used in the
> context of I/O devices, it's also used for ACPI power management
> paths.

You mean S3? That's of course also not supported inside TDX.


> WBINVD dependent functionality should be dynamically disabled
> rather than warned about.
>
> Does a TDX guest support out-of-tree modules? The kernel is already
> tainted when out-of-tree modules are loaded. In other words in-tree
> modules preclude forbidden instructions because they can just be
> audited, and out-of-tree modules are ok to trigger abrupt failure if
> they attempt to use forbidden instructions.

We already did a lot of bi^wdiscussion on this on the last review.

Originally we had a different handling, this was the result of previous
feedback.

It doesn't really matter because it should never happen.


>
>> But to let users know about its usage, use
>> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
>> unsupported use WARN() to report unsupported usage.
> I'm not sure how useful warning is outside of a kernel developer's
> debug environment. The kernel should know what instructions are
> disabled and which are available. WBINVD in particular has potential
> data integrity implications. Code that might lead to a WBINVD usage
> should be disabled, not run all the way up to where WBINVD is
> attempted and then trigger an after-the-fact WARN_ONCE().

We don't expect the warning to ever happen. Yes all of this will be
disabled. Nearly all are in code paths that cannot happen inside TDX
anyways due to missing PCI-IDs or different cpuids, and S3 is explicitly
disabled and would be impossible anyways due to lack of BIOS support.




>
> The WBINVD change deserves to be split off from MWAIT/MONITOR, and
> more thought needs to be put into where these spurious instruction
> usages are arising.

I disagree. We already spent a lot of cycles on this. WBINVD makes never
sense in current TDX and all the code will be disabled.


-Andi

2021-05-11 02:21:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O


On 5/10/2021 5:56 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/10/21 4:34 PM, Dan Williams wrote:
>>>> Surely there's an existing macro for this pattern? Would
>>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>>> would eliminate clearing of %r8.
>>>
>>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>>> some past refactorings.
>> Not a huge deal, but at a minimum it seems a generic construct that
>> deserves to be declared centrally rather than tdx-guest-port-io local.
>
> I can define SAVE_ALL_REGS/RESTORE_ALL_REGS. Do you want to move it
> outside
> TDX code? I don't know if there will be other users for it?

The old name was SAVE_ALL / SAVE_REGS.

Yes please put it outside tdx code into some include file.

-Andi


Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O



On 5/10/21 6:07 PM, Dan Williams wrote:
> On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
> <[email protected]> wrote:
> [..]
>> It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
>> and tdx_in{b,w,l}.
>>
>> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>> struct tdx_hypercall_output *out);
>> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>> u64 r15, struct tdx_hypercall_output *out);
>>
>> struct tdx_hypercall_output {
>> u64 r11;
>> u64 r12;
>> u64 r13;
>> u64 r14;
>> u64 r15;
>> };
>
> Why is this by register name and not something like:
>
> struct tdx_hypercall_payload {
> u64 data[5];
> };
>
> ...because the code in this patch is reading the payload out of a
> stack relative offset, not r11.

Since this patch allocates this memory in ASM code, we read it via
offset. If you see other use cases in tdx.c, you will notice the use
of register names.

static void tdg_handle_cpuid(struct pt_regs *regs)
{
u64 ret;
struct tdx_hypercall_output out = {0};

ret = __tdx_hypercall(EXIT_REASON_CPUID, regs->ax,
regs->cx, 0, 0, &out);

WARN_ON(ret);

regs->ax = out.r12;
regs->bx = out.r13;
regs->cx = out.r14;
regs->dx = out.r15;
}

static u64 tdg_read_msr_safe(unsigned int msr, int *err)
{
u64 ret;
struct tdx_hypercall_output out = {0};

WARN_ON_ONCE(tdg_is_context_switched_msr(msr));

/*
* Since CSTAR MSR is not used by Intel CPUs as SYSCALL
* instruction, just ignore it. Even raising TDVMCALL
* will lead to same result.
*/
if (msr == MSR_CSTAR)
return 0;

ret = __tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);

*err = ret ? -EIO : 0;

return out.r11;
}


>
>>
>>
>> Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
>> by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
>> using TDCALL instruction. do_tdx_hypercall() is the helper function (in
>> tdcall.S) which actually implements this ABI.
>>
>> As per current ABI, VMM will use registers R11-R15 to share the output
>> values with the guest.
>
> Which ABI,

TDCALL ABI (see sections 3.1 to 3.12 and look for Output Operands in each TDVMCALL variant).

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

__tdx_hypercall_vendor_kvm()? The code is putting the
> payload on the stack, so I'm not sure what ABI you are referring to?
>
>
>> So we have defined the structure
>> struct tdx_hypercall_output to group all output registers and make it easier
>> to share it with users of the TDCALLs. This is Linux defined structure.
>>
>> If there are any changes in TDCALL ABI for VMM, we might have to extend
>> this structure to accommodate new output register changes. So if we
>> define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
>> any future struct tdx_hypercall_output changes. So to avoid it, we have
>> allocated double the size.
>>
>> May be I should define it as,
>>
>> #define TDVMCALL_OUTPUT_SIZE sizeof(struct tdx_hypercall_output)
>
> An arrangement like that seems more reasonable than a seemingly
> arbitrary number and an ominous warning about things that may happen
> in the future.

I will use the above format.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD



On 5/10/21 7:17 PM, Andi Kleen wrote:
>>> To prevent TD guest from using MWAIT/MONITOR instructions,
>>> support for these instructions are already disabled by TDX
>>> module (SEAM). So CPUID flags for these instructions should
>>> be in disabled state.
>> Why does this not result in a #UD if the instruction is disabled by
>> SEAM?
>
> It's just the TDX module (SEAM is the execution mode used by the TDX module)

If it is disabled by the TDX Module, we should never execute it. But for some
reason, if we still come across this instruction (buggy TDX module?), we add
appropriate warning in #VE handler.

>
>
>> How is it possible to execute a disabled instruction (one
>> precluded by CPUID) to the point where it triggers #VE instead of #UD?
>
> That's how the TDX module works. It never injects anything else other than #VE. You can still get
> other exceptions of course, but they won't come from the TDX module.
>
>>> After the above mentioned preventive measures, if TD guests still
>>> execute these instructions, add appropriate warning messages in #VE
>>> handler. For WBIND instruction, since it's related to memory writeback
>>> and cache flushes, it's mainly used in context of IO devices. Since
>>> TDX 1.0 does not support non-virtual I/O devices, skipping it should
>>> not cause any fatal issues.
>> WBINVD is in a different class than MWAIT/MONITOR since it is not
>> identified by CPUID, it can't possibly have the same #UD behaviour.
>> It's not clear why WBINVD is included in the same patch as
>> MWAIT/MONITOR?
>
> Because these are all instructions we never expect to execute, so nothing special is needed for
> them. That's a unique class that logically fits together.

Yes, for all these three instruction we don't need any special
handling code. So they are grouped together.

>
>
>>
>> I disagree with the assertion that WBINVD is mainly used in the
>> context of I/O devices, it's also used for ACPI power management
>> paths.
>
> You mean S3? That's of course also not supported inside TDX.
>
>
>>   WBINVD dependent functionality should be dynamically disabled
>> rather than warned about.
>>
>> Does a TDX guest support out-of-tree modules?  The kernel is already
>> tainted when out-of-tree modules are loaded. In other words in-tree
>> modules preclude forbidden instructions because they can just be
>> audited, and out-of-tree modules are ok to trigger abrupt failure if
>> they attempt to use forbidden instructions.
>
> We already did a lot of bi^wdiscussion on this on the last review.
>
> Originally we had a different handling, this was the result of previous feedback.
>
> It doesn't really matter because it should never happen.
>
>
>>
>>> But to let users know about its usage, use
>>> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
>>> unsupported use WARN() to report unsupported usage.
>> I'm not sure how useful warning is outside of a kernel developer's
>> debug environment. The kernel should know what instructions are
>> disabled and which are available. WBINVD in particular has potential
>> data integrity implications. Code that might lead to a WBINVD usage
>> should be disabled, not run all the way up to where WBINVD is
>> attempted and then trigger an after-the-fact WARN_ONCE().
>
> We don't expect the warning to ever happen. Yes all of this will be disabled. Nearly all are in code
> paths that cannot happen inside TDX anyways due to missing PCI-IDs or different cpuids, and S3 is
> explicitly disabled and would be impossible anyways due to lack of BIOS support.

We have added WARN to let user know about its usage and fix it. By default we should
never hit this path.

>
>
>
>
>>
>> The WBINVD change deserves to be split off from MWAIT/MONITOR, and
>> more thought needs to be put into where these spurious instruction
>> usages are arising.
>
> I disagree. We already spent a lot of cycles on this. WBINVD makes never sense in current TDX and
> all the code will be disabled.

>
>
> -Andi
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-11 02:54:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD


On 5/10/2021 7:44 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/10/21 7:17 PM, Andi Kleen wrote:
>>>> To prevent TD guest from using MWAIT/MONITOR instructions,
>>>> support for these instructions are already disabled by TDX
>>>> module (SEAM). So CPUID flags for these instructions should
>>>> be in disabled state.
>>> Why does this not result in a #UD if the instruction is disabled by
>>> SEAM?
>>
>> It's just the TDX module (SEAM is the execution mode used by the TDX
>> module)
>
> If it is disabled by the TDX Module, we should never execute it. But
> for some
> reason, if we still come across this instruction (buggy TDX module?),
> we add
> appropriate warning in  #VE handler.

I think the only case where it could happen is if the kernel jumps to a
random address due to a bug and the destination happens to be these
instruction bytes. Of course it is exceedingly unlikely.

Or we make some mistake, but that's hopefully fixed quickly.


-Andi

2021-05-11 09:37:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On Mon, May 10, 2021 at 10:52:49PM +0000, Sean Christopherson wrote:
> I can't find the thread offhand, but Boris proposed something along the lines of
> cpu_has(), but specific to a given flavor of protected guest. IIRC, it was
> sev_guest_has(SEV_ES) or something like that.
>
> I 100% agree that we should have actual feature bits somewhere for the various
> protected guest flavors.

Preach brother! :)

/me goes and greps mailboxes...

ah, do you mean this, per chance:

https://lore.kernel.org/kvm/[email protected]/

?

And yes, this has "sev" in the name and dhansen makes sense to me in
wishing to unify all the protected guest feature queries under a common
name. And then depending on the vendor, that common name will call the
respective vendor's helper to answer the protected guest aspect asked
about.

This way, generic code will call

protected_guest_has()

or so and be nicely abstracted away from the underlying implementation.

Hohumm, yap, sounds nice to me.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-11 14:11:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On 5/10/21 6:23 PM, Dan Williams wrote:
>> To prevent TD guest from using MWAIT/MONITOR instructions,
>> support for these instructions are already disabled by TDX
>> module (SEAM). So CPUID flags for these instructions should
>> be in disabled state.
> Why does this not result in a #UD if the instruction is disabled by
> SEAM? How is it possible to execute a disabled instruction (one
> precluded by CPUID) to the point where it triggers #VE instead of #UD?

This is actually a vestige of VMX. It's quite possible toady to have a
feature which isn't enumerated in CPUID which still exists and "works"
in the silicon. There are all kinds of pitfalls to doing this, but
folks evidently do it in public clouds all the time.

The CPUID virtualization basically just traps into the hypervisor and
lets the hypervisor set whatever register values it wants to appear when
CPUID "returns".

But, the controls for what instructions generate #UD are actually quite
separate and unrelated to CPUID itself.

2021-05-11 14:41:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On 5/10/21 7:29 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/10/21 6:07 PM, Dan Williams wrote:
>> On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
>> <[email protected]> wrote:
>> [..]
>>> It is mainly used by functions like
>>> __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
>>> and tdx_in{b,w,l}.
>>>
>>> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>>>                       struct tdx_hypercall_output *out);
>>> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>>>                                  u64 r15, struct tdx_hypercall_output
>>> *out);
>>>
>>> struct tdx_hypercall_output {
>>>           u64 r11;
>>>           u64 r12;
>>>           u64 r13;
>>>           u64 r14;
>>>           u64 r15;
>>> };
>>
>> Why is this by register name and not something like:
>>
>> struct tdx_hypercall_payload {
>>    u64 data[5];
>> };
>>
>> ...because the code in this patch is reading the payload out of a
>> stack relative offset, not r11.
>
> Since this patch allocates this memory in ASM code, we read it via
> offset. If you see other use cases in tdx.c, you will notice the use
> of register names.

To what you do you refer by "this patch allocates this memory in ASM
code"? Could you point to the specific ASM code that "allocates memory"?

Dan I'll try to answer your question. TDX has both a "hypercall"
interface for guests to call into hosts and a "seamcall" interface where
guests or hosts can talk to the TDX/SEAM module.

Both of these represent an ABI which _resembles_ a system call ABI.
Values are placed in registers, including a "function" register which is
very similar to a the system call number we place in RAX.

*But* those ABIs was actually designed to (IIRC) resemble the
Windows/Microsoft ABI, not the Linux ABI. So the register conventions
are unfamiliar. There is assembly code to convert between the ELF
function call ABI and the TDX ABIs.

For instance, if you are in C code and you call:

__tdx_hypercall_vendor_kvm(u64 fn, u64 r12, ...

The value for "fn" will be placed in RAX and "r12" will be placed in RDI
for the function call itself. The assembly code will, for instance,
take the "r12" *VARIABLE* and ensure it gets into the R12 *REGISTER* for
the hypercall.

The same thing happens on the output side. The TDX ABIs specify
"return" values in certain registers (r11-r15). However, those
registers are not preserved in our function return ABI. So, they must
be stashed off in memory into a place where the caller can retrieve them.

Rather than being unstructured "data[]", the value in
tdx_hypercall_output->r11 was actually in register R11 at some point.
If you look at the spec, you can see the functions that use R11.

Let's say there's a hypercall to check for whether puppies are cute.
Here's the kernel side:

bool tdx_hypercall_puppies_are_cute()
{
struct tdx_hypercall_output out;
u64 ret;

ret = __tdx_hypercall_vendor_kvm(HOST_LIKES_PUPPIES, ..., &out);

/* Did the hypercall even succeed? */
if (ret != SUCCESS)
return -EINVAL;

if (out->r11 == TDX_WHATEVER_CUTE_BIT)
return true;

// Nope, I guess puppies are not cute
return false;
}

The spec would actually say, "Blah blah, puppies are cute if
TDX_WHATEVER_CUTE_BIT is set in r11". So, this whole setup actually
results in really nice C code that you can sit side-by-side with the
spec and see if they agree.

Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O



On 5/11/21 7:39 AM, Dave Hansen wrote:
> To what you do you refer by "this patch allocates this memory in ASM
> code"? Could you point to the specific ASM code that "allocates memory"?

We use 40 bytes in stack for storing the output register values. It is in
function tdg_inl().

subq $TDVMCALL_OUTPUT_SIZE, %rsp

+SYM_FUNC_START(tdg_inl)
+ io_save_registers
+ /* Set data width to 4 bytes */
+ mov $4, %rsi
+1:
+ mov %rdx, %rcx
+ /* Set 0 in RDX to select in operation */
+ mov $0, %rdx
+ /* Set TDVMCALL function id in RDI */
+ mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+ /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+ xor %r10, %r10
+ /* Allocate memory in stack for Output */
+ subq $TDVMCALL_OUTPUT_SIZE, %rsp
+ /* Move tdvmcall_output pointer to R9 */
+ movq %rsp, %r9
+
+ call do_tdvmcall

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-11 15:39:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On 5/10/21 2:57 PM, Dan Williams wrote:
>> Decompression code uses port IO for earlyprintk. We must use
>> paravirt calls there too if we want to allow earlyprintk.
> What is the tradeoff between teaching the decompression code to handle
> #VE (the implied assumption) vs teaching it to avoid #VE with direct
> TDVMCALLs (the chosen direction)?

To me, the tradeoff is not just "teaching" the code to handle a #VE, but
ensuring that the entire architecture works.

Intentionally invoking a #VE is like making a function call that *MIGHT*
recurse on itself. Sure, you can try to come up with a story about
bounding the recursion. But, I don't see any semblance of that in this
series.

Exception-based recursion is really nasty because it's implicit, not
explicit. That's why I'm advocating for a design where the kernel never
intentionally causes a #VE: it never intentionally recurses without bounds.

2021-05-11 15:39:36

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On Mon, May 10, 2021 at 7:17 PM Andi Kleen <[email protected]> wrote:
>
> >> To prevent TD guest from using MWAIT/MONITOR instructions,
> >> support for these instructions are already disabled by TDX
> >> module (SEAM). So CPUID flags for these instructions should
> >> be in disabled state.
> > Why does this not result in a #UD if the instruction is disabled by
> > SEAM?
>
> It's just the TDX module (SEAM is the execution mode used by the TDX module)
>
>
> > How is it possible to execute a disabled instruction (one
> > precluded by CPUID) to the point where it triggers #VE instead of #UD?
>
> That's how the TDX module works. It never injects anything else other
> than #VE. You can still get other exceptions of course, but they won't
> come from the TDX module.
>
> >> After the above mentioned preventive measures, if TD guests still
> >> execute these instructions, add appropriate warning messages in #VE
> >> handler. For WBIND instruction, since it's related to memory writeback
> >> and cache flushes, it's mainly used in context of IO devices. Since
> >> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> >> not cause any fatal issues.
> > WBINVD is in a different class than MWAIT/MONITOR since it is not
> > identified by CPUID, it can't possibly have the same #UD behaviour.
> > It's not clear why WBINVD is included in the same patch as
> > MWAIT/MONITOR?
>
> Because these are all instructions we never expect to execute, so
> nothing special is needed for them. That's a unique class that logically
> fits together.
>
>
> >
> > I disagree with the assertion that WBINVD is mainly used in the
> > context of I/O devices, it's also used for ACPI power management
> > paths.
>
> You mean S3? That's of course also not supported inside TDX.
>
>
> > WBINVD dependent functionality should be dynamically disabled
> > rather than warned about.
> >
> > Does a TDX guest support out-of-tree modules? The kernel is already
> > tainted when out-of-tree modules are loaded. In other words in-tree
> > modules preclude forbidden instructions because they can just be
> > audited, and out-of-tree modules are ok to trigger abrupt failure if
> > they attempt to use forbidden instructions.
>
> We already did a lot of bi^wdiscussion on this on the last review.
>
> Originally we had a different handling, this was the result of previous
> feedback.
>
> It doesn't really matter because it should never happen.
>
>
> >
> >> But to let users know about its usage, use
> >> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
> >> unsupported use WARN() to report unsupported usage.
> > I'm not sure how useful warning is outside of a kernel developer's
> > debug environment. The kernel should know what instructions are
> > disabled and which are available. WBINVD in particular has potential
> > data integrity implications. Code that might lead to a WBINVD usage
> > should be disabled, not run all the way up to where WBINVD is
> > attempted and then trigger an after-the-fact WARN_ONCE().
>
> We don't expect the warning to ever happen. Yes all of this will be
> disabled. Nearly all are in code paths that cannot happen inside TDX
> anyways due to missing PCI-IDs or different cpuids, and S3 is explicitly
> disabled and would be impossible anyways due to lack of BIOS support.
>
>
>
>
> >
> > The WBINVD change deserves to be split off from MWAIT/MONITOR, and
> > more thought needs to be put into where these spurious instruction
> > usages are arising.
>
> I disagree. We already spent a lot of cycles on this. WBINVD makes never
> sense in current TDX and all the code will be disabled.

Why not just drop the patch if it continues to cause people to spend
cycles on it and it addresses a problem that will never happen?

2021-05-11 15:45:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD


On 5/11/2021 8:37 AM, Dan Williams wrote:
> O
> Why not just drop the patch if it continues to cause people to spend
> cycles on it and it addresses a problem that will never happen?

We want to at least get some kind of warning if there is really a
mistake. Just dropping such an ability wouldn't seem right.

That's all that the patch does really.

-Andi


2021-05-11 15:47:29

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On Tue, May 11, 2021 at 8:36 AM Dave Hansen <[email protected]> wrote:
>
> On 5/10/21 2:57 PM, Dan Williams wrote:
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> To me, the tradeoff is not just "teaching" the code to handle a #VE, but
> ensuring that the entire architecture works.
>
> Intentionally invoking a #VE is like making a function call that *MIGHT*
> recurse on itself. Sure, you can try to come up with a story about
> bounding the recursion. But, I don't see any semblance of that in this
> series.
>
> Exception-based recursion is really nasty because it's implicit, not
> explicit. That's why I'm advocating for a design where the kernel never
> intentionally causes a #VE: it never intentionally recurses without bounds.

Thanks Dave, this really helps.

2021-05-11 15:49:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On 5/11/21 8:37 AM, Dan Williams wrote:
>> I disagree. We already spent a lot of cycles on this. WBINVD makes never
>> sense in current TDX and all the code will be disabled.
> Why not just drop the patch if it continues to cause people to spend
> cycles on it and it addresses a problem that will never happen?

If someone calls WBINVD, we have a bug. Not a little bug, either. It
probably means there's some horribly confused kernel code that's now
facing broken cache coherency. To me, it's a textbook place to use
BUG_ON().

This also doesn't "address" the problem, it just helps produce a more
coherent warning message. It's why we have OOPS messages in the page
fault handler: it never makes any sense to dereference a NULL pointer,
yet we have code to make debugging them easier. It's well worth the ~20
lines of code that this costs us for ease of debugging.

2021-05-11 15:51:37

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On Tue, May 11, 2021 at 8:45 AM Dave Hansen <[email protected]> wrote:
>
> On 5/11/21 8:37 AM, Dan Williams wrote:
> >> I disagree. We already spent a lot of cycles on this. WBINVD makes never
> >> sense in current TDX and all the code will be disabled.
> > Why not just drop the patch if it continues to cause people to spend
> > cycles on it and it addresses a problem that will never happen?
>
> If someone calls WBINVD, we have a bug. Not a little bug, either. It
> probably means there's some horribly confused kernel code that's now
> facing broken cache coherency. To me, it's a textbook place to use
> BUG_ON().
>
> This also doesn't "address" the problem, it just helps produce a more
> coherent warning message. It's why we have OOPS messages in the page
> fault handler: it never makes any sense to dereference a NULL pointer,
> yet we have code to make debugging them easier. It's well worth the ~20
> lines of code that this costs us for ease of debugging.

The 'default' case in this 'switch' prints the exit reason and faults,
can't that also trigger a backtrace that dumps the exception stack and
the faulting instruction? In other words shouldn't this just fail with
a common way to provide better debug on any unhandled #VE and not try
to continue running past something that "can't" happen?

2021-05-11 15:56:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> For WBIND instruction, since it's related to memory writeback

^ WBINVD

> and cache flushes, it's mainly used in context of IO devices. Since
> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> not cause any fatal issues. But
Do me a favor:

grep -ri wbinvd arch/x86/

How many I/O devices do you see?

Please get your ducks in a row here. Come up with a coherent changelog
about why the arch/x86 use of WBINVD doesn't apply to TDX guests.
Explain the audit that you did. You *DID* do an audit, right?

2021-05-11 15:56:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD


> The 'default' case in this 'switch' prints the exit reason and faults,
> can't that also trigger a backtrace that dumps the exception stack and
> the faulting instruction? In other words shouldn't this just fail with
> a common way to provide better debug on any unhandled #VE and not try
> to continue running past something that "can't" happen?

It will use the #GP common code which will do all the backtracing etc.

We didn't think we would need anything else than what #GP already does.

-Andi


2021-05-11 16:08:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On 5/11/21 8:52 AM, Andi Kleen wrote:
>> The 'default' case in this 'switch' prints the exit reason and faults,
>> can't that also trigger a backtrace that dumps the exception stack and
>> the faulting instruction? In other words shouldn't this just fail with
>> a common way to provide better debug on any unhandled #VE and not try
>> to continue running past something that "can't" happen?
>
> It will use the #GP common code which will do all the backtracing etc.
>
> We didn't think we would need anything else than what #GP already does.

How do these end up in practice? Do they still say "general protection
fault..."?

Isn't that really mean for anyone that goes trying to figure out what
caused these? If they see a "general protection fault" from WBINVD and
go digging in the SDM for how a #GP can come from WBINVD, won't they be
sorely disappointed?

2021-05-11 16:12:22

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On Tue, May 11, 2021, Dave Hansen wrote:
> On 5/10/21 6:23 PM, Dan Williams wrote:
> >> To prevent TD guest from using MWAIT/MONITOR instructions,
> >> support for these instructions are already disabled by TDX
> >> module (SEAM). So CPUID flags for these instructions should
> >> be in disabled state.
> > Why does this not result in a #UD if the instruction is disabled by
> > SEAM? How is it possible to execute a disabled instruction (one
> > precluded by CPUID) to the point where it triggers #VE instead of #UD?
>
> This is actually a vestige of VMX. It's quite possible toady to have a
> feature which isn't enumerated in CPUID which still exists and "works"
> in the silicon.

No, virtualization holes are something else entirely.

MONITOR/MWAIT are a bit weird; they do have an enable bit in IA32_MISC_ENABLE,
but most VMMs don't context switch IA32_MISC_ENABLE (load guest value on entry,
load host value on exit) because that would add ~250 cycles to every host<->guest
transition. And IA32_MISC_ENABLE is shared between SMT siblings, which further
complicates loading the guest's value into hardware. In the end, it's easier to
leave MONITOR/MWAIT enabled in hardware and instead force a VM-Exit.

As for why TDX injects #VE instead of #UD, I suspect it's for the same reason
that KVM emulates MONITOR/MWAIT as nops instead of injecting a #UD. The CPUID
bit for MONITOR/MWAIT reflects their enabling in IA32_MISC_ENABLE, not raw
support in hardware. That means there's no definitive way to enumerate to BIOS
that MONITOR/MWAIT are not supported, e.g. AFAICT, EDKII blindly assumes it can
enable MONITOR/MWAIT in IA32_MISC_ENABLE. To justify #UD instead of #VE, TDX
would have to inject #GP on WRMSR to set IA32_MISC_ENABLE.ENABLE_MONITOR, and
even then there would be weirdness with respect to VMM behavior in response to
TDVMCALL(WRMSR) since the VMM could allow the virtual write. In the end, it's
again simpler to inject #VE.

> There are all kinds of pitfalls to doing this, but folks evidently do it in
> public clouds all the time.

Virtualization holes are when instructions/features are enumerated via CPUID,
but don't have a control to hide the feature from the guest (or in the case of
CET, multiple feature are buried behind a single control). So even if the VMM
hides the feature via CPUID, the guest can still _cleanly_ execute the
instruction if it's supported by the underlying hardware.

> The CPUID virtualization basically just traps into the hypervisor and
> lets the hypervisor set whatever register values it wants to appear when
> CPUID "returns".
>
> But, the controls for what instructions generate #UD are actually quite
> separate and unrelated to CPUID itself.

Eh, any sane VMM will accurately represent its virtual CPU model via CPUID
insofar as possible, there are just too many creaky corners in x86 to make things
100% bombproof.

2021-05-11 16:19:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On 5/11/21 9:09 AM, Sean Christopherson wrote:
>>> Why does this not result in a #UD if the instruction is disabled by
>>> SEAM? How is it possible to execute a disabled instruction (one
>>> precluded by CPUID) to the point where it triggers #VE instead of #UD?
>> This is actually a vestige of VMX. It's quite possible toady to have a
>> feature which isn't enumerated in CPUID which still exists and "works"
>> in the silicon.
> No, virtualization holes are something else entirely.

I think the bigger point is that *CPUID* doesn't enable or disable
instructions in and of itself.

It can *reflect* enabling (like OSPKE), but nothing is actually enabled
or disabled via CPUID.

2021-05-11 17:09:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD


need anything else than what #GP already does.

> How do these end up in practice? Do they still say "general protection
> fault..."?

Yes, but there's a #VE specific message before it that prints the exit
reason.


>
> Isn't that really mean for anyone that goes trying to figure out what
> caused these? If they see a "general protection fault" from WBINVD and
> go digging in the SDM for how a #GP can come from WBINVD, won't they be
> sorely disappointed?

They'll see both the message and also that it isn't a true #VE in the
backtrace.


-Andi


2021-05-11 17:45:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD

On 5/11/21 10:06 AM, Andi Kleen wrote:
>> How do these end up in practice?  Do they still say "general protection
>> fault..."?
>
> Yes, but there's a #VE specific message before it that prints the exit
> reason.
>
>> Isn't that really mean for anyone that goes trying to figure out what
>> caused these?  If they see a "general protection fault" from WBINVD and
>> go digging in the SDM for how a #GP can come from WBINVD, won't they be
>> sorely disappointed?
>
> They'll see both the message and also that it isn't a true #VE in the
> backtrace.

Is there a good reason for the enduring "general protection fault..."
message other than an aversion to refactoring the code?

2021-05-11 17:50:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD


> Is there a good reason for the enduring "general protection fault..."
> message other than an aversion to refactoring the code?

You're the first ever to think it's a problem.

We're assuming that kernel developers are smart enough to understand this.

Please I implore everyone to move on from this patch. This is my last
email on this topic.

-Andi

2021-05-12 06:18:35

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 14/32] x86/tdx: Handle port I/O

On Tue, May 11, 2021 at 8:36 AM Dave Hansen <[email protected]> wrote:
>
> On 5/10/21 2:57 PM, Dan Williams wrote:
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> To me, the tradeoff is not just "teaching" the code to handle a #VE, but
> ensuring that the entire architecture works.
>
> Intentionally invoking a #VE is like making a function call that *MIGHT*
> recurse on itself. Sure, you can try to come up with a story about
> bounding the recursion. But, I don't see any semblance of that in this
> series.
>
> Exception-based recursion is really nasty because it's implicit, not
> explicit. That's why I'm advocating for a design where the kernel never
> intentionally causes a #VE: it never intentionally recurses without bounds.

So this circles back to the common problem with the
mwait/monitor/wbinvd patch and this one. "Can't happen" #VE conditions
should be fatal. I.e. have a nice clear message about why the kernel
failed and halt. All the uses of these #VE triggering instructions can
be eliminated ahead of time with auditing and people that load
unaudited out-of-tree modules that trigger #VE get to keep the pieces.
Said pieces will be described to them by the #VE triggered fail
message. This isn't like split lock disable where the code is
difficult to audit.

What am I missing?

2021-05-12 12:11:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
> On 10.05.21 17:52, Andi Kleen wrote:
> > \
> > > > > CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
> > > > > calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
> > > > > config would be a bloat for TDX.
> > > >
> > > > Used how? Why is it bloat for TDX?
> > >
> > > Is there any major downside to move the halt related pvops functions
> > > from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?
> >
> > I think the main motivation is to get rid of all the page table related
> > hooks for modern configurations. These are the bulk of the annotations
> > and? cause bloat and worse code. Shadow page tables are really obscure
> > these days and very few people still need them and it's totally
> > reasonable to build even widely used distribution kernels without them.
> > On contrast most of the other hooks are comparatively few and also on
> > comparatively slow paths, so don't really matter too much.
> >
> > I think it would be ok to have a CONFIG_PARAVIRT that does not have page
> > table support, and a separate config option for those (that could be
> > eventually deprecated).
> >
> > But that would break existing .configs for those shadow stack users,
> > that's why I think Kirill did it the other way around.
>
> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
> other hypervisor's guests, supporting basically the TLB flush operations
> and time related operations only. Adding the halt related operations to
> PARAVIRT wouldn't break anything.

Yeah, I think we can do this. It should be fine.

--
Kirill A. Shutemov

2021-05-12 13:19:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls

On Fri, May 07, 2021 at 05:59:34PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/7/21 2:46 PM, Dave Hansen wrote:
> > I know KVM does weird stuff. But, this is*really* weird. Why are we
> > #including a .c file into another .c file?
>
> I think Kirill implemented it this way to skip Makefile changes for it. I don't
> see any other KVM direct dependencies in tdx.c.
>
> I will fix it in next version.

This has to be compiled only for TDX+KVM.

--
Kirill A. Shutemov

2021-05-12 13:22:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On Mon, May 10, 2021 at 03:23:29PM -0700, Dave Hansen wrote:
> On 5/10/21 3:19 PM, Kuppuswamy, Sathyanarayanan wrote:
> > On 5/7/21 2:54 PM, Dave Hansen wrote:
> >> This doesn't seem much like common code to me.? It seems like 100% SEV
> >> code.? Is this really where we want to move it?
> >
> > Both SEV and TDX code has requirement to enable
> > CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
> > function.
> >
> > force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
> > pages shared" to add TDX guest specific support.
> >
> > Since both SEV and TDX code uses it, its moved to common file.
>
> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
> code in a file named "common". I'd make an attempt to keep them
> separate and then call into the two separate functions *from* the common
> function.

But why? What good does the additional level of inderection brings?

It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
or Intel specific. If a function can cover both vendors I don't see a
point for additinal complexity.

--
Kirill A. Shutemov

2021-05-12 13:23:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:

> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
> other hypervisor's guests, supporting basically the TLB flush operations
> and time related operations only. Adding the halt related operations to
> PARAVIRT wouldn't break anything.

Also, I don't think anything modern should actually ever hit any of the
HLT instructions, most everything should end up at an MWAIT.

Still, do we wants to give arch_safe_halt() and halt() the
PVOP_ALT_VCALL0() treatment?

2021-05-12 13:25:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL


On 5/12/2021 6:18 AM, Peter Zijlstra wrote:
> On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
>
>> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
>> other hypervisor's guests, supporting basically the TLB flush operations
>> and time related operations only. Adding the halt related operations to
>> PARAVIRT wouldn't break anything.
> Also, I don't think anything modern should actually ever hit any of the
> HLT instructions, most everything should end up at an MWAIT.
>
> Still, do we wants to give arch_safe_halt() and halt() the
> PVOP_ALT_VCALL0() treatment?

From performance reasons it's pointless to patch. HLT (and MWAIT) are
so slow anyways that using patching or an indirect pointer is completely
in the noise. So I would use whatever is cleanest in the code.

-Andi



2021-05-12 13:53:35

by Jürgen Groß

[permalink] [raw]
Subject: Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

On 12.05.21 15:24, Andi Kleen wrote:
>
> On 5/12/2021 6:18 AM, Peter Zijlstra wrote:
>> On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
>>
>>> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
>>> other hypervisor's guests, supporting basically the TLB flush operations
>>> and time related operations only. Adding the halt related operations to
>>> PARAVIRT wouldn't break anything.
>> Also, I don't think anything modern should actually ever hit any of the
>> HLT instructions, most everything should end up at an MWAIT.
>>
>> Still, do we wants to give arch_safe_halt() and halt() the
>> PVOP_ALT_VCALL0() treatment?
>
> From performance reasons it's pointless to patch. HLT (and MWAIT) are
> so slow anyways that using patching or an indirect pointer is completely
> in the noise. So I would use whatever is cleanest in the code.

This would probably be x86_platform_ops.hyper hooks.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments
Subject: Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls



On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
> This has to be compiled only for TDX+KVM.

Got it. So if we want to remove the "C" file include, we will have to
add #ifdef CONFIG_KVM_GUEST in Makefile.

ifdef CONFIG_KVM_GUEST
obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
#endif

Dave, do you prefer above change over "C" file include?

25 #ifdef CONFIG_KVM_GUEST
26 #include "tdx-kvm.c"
27 #endif

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-12 14:34:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls

On 5/12/21 7:10 AM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
>> This has to be compiled only for TDX+KVM.
>
> Got it. So if we want to remove the "C" file include, we will have to
> add #ifdef CONFIG_KVM_GUEST in Makefile.
>
> ifdef CONFIG_KVM_GUEST
> obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
> #endif

Is there truly no dependency between CONFIG_KVM_GUEST and
CONFIG_INTEL_TDX_GUEST?

If there isn't, then the way we do it is adding another (invisible)
Kconfig variable to express the dependency for tdx-kvm.o:

config INTEL_TDX_GUEST_KVM
bool
depends on KVM_GUEST && INTEL_TDX_GUEST

2021-05-12 16:38:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
>> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
>> code in a file named "common". I'd make an attempt to keep them
>> separate and then call into the two separate functions *from* the common
>> function.
> But why? What good does the additional level of inderection brings?
>
> It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
> or Intel specific. If a function can cover both vendors I don't see a
> point for additinal complexity.

Because the code is already separate. You're actually going to some
trouble to move the SEV-specific code and then combine it with the
TDX-specific code.

Anyway, please just give it a shot. Should take all of ten minutes. If
it doesn't work out in practice, fine. You'll have a good paragraph for
the changelog.

2021-05-12 18:04:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On Wed, May 12, 2021, Dave Hansen wrote:
> On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
> >> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
> >> code in a file named "common". I'd make an attempt to keep them
> >> separate and then call into the two separate functions *from* the common
> >> function.
> > But why? What good does the additional level of inderection brings?
> >
> > It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
> > or Intel specific. If a function can cover both vendors I don't see a
> > point for additinal complexity.
>
> Because the code is already separate. You're actually going to some
> trouble to move the SEV-specific code and then combine it with the
> TDX-specific code.
>
> Anyway, please just give it a shot. Should take all of ten minutes. If
> it doesn't work out in practice, fine. You'll have a good paragraph for
> the changelog.

Or maybe wait to see how Boris' propose protected_guest_has() pans out? E.g. if
we can do "protected_guest_has(MEMORY_ENCRYPTION)" or whatever, then the truly
common bits could be placed into common.c without any vendor-specific logic.

2021-05-13 02:57:59

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX.

Lets add a spec reference:

See section "4.1 ACPI-MADT-AP-Wakeup Table" in the Guest-Host
Communication Interface specification for TDX.

Although, there is not much "wake protocol" in this patch, this
appears to be the end of the process after the CPU has been messaged
to start.

> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode. For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/include/asm/realmode.h | 1 +
> arch/x86/kernel/smpboot.c | 5 +++
> arch/x86/realmode/rm/header.S | 1 +
> arch/x86/realmode/rm/trampoline_64.S | 49 +++++++++++++++++++++++-
> arch/x86/realmode/rm/trampoline_common.S | 5 ++-
> 5 files changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
> index 5db5d083c873..5066c8b35e7c 100644
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -25,6 +25,7 @@ struct real_mode_header {
> u32 sev_es_trampoline_start;
> #endif
> #ifdef CONFIG_X86_64
> + u32 trampoline_start64;
> u32 trampoline_pgd;
> #endif
> /* ACPI S3 wakeup */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 16703c35a944..27d8491d753a 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1036,6 +1036,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
> unsigned long boot_error = 0;
> unsigned long timeout;
>
> +#ifdef CONFIG_X86_64
> + if (is_tdx_guest())
> + start_ip = real_mode_header->trampoline_start64;
> +#endif

Perhaps wrap this into an inline helper in
arch/x86/include/asm/realmode.h so that this routine only does one
assignment to @start_ip at function entry?

> +
> idle->thread.sp = (unsigned long)task_pt_regs(idle);
> early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
> initial_code = (unsigned long)start_secondary;
> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
> index 8c1db5bf5d78..2eb62be6d256 100644
> --- a/arch/x86/realmode/rm/header.S
> +++ b/arch/x86/realmode/rm/header.S
> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
> .long pa_sev_es_trampoline_start
> #endif
> #ifdef CONFIG_X86_64
> + .long pa_trampoline_start64
> .long pa_trampoline_pgd;
> #endif
> /* ACPI S3 wakeup */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 84c5d1b33d10..12b734b1da8b 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
> movl %eax, %cr3
>
> # Set up EFER
> + movl $MSR_EFER, %ecx
> + rdmsr
> + cmp pa_tr_efer, %eax
> + jne .Lwrite_efer
> + cmp pa_tr_efer + 4, %edx
> + je .Ldone_efer
> +.Lwrite_efer:
> movl pa_tr_efer, %eax
> movl pa_tr_efer + 4, %edx
> - movl $MSR_EFER, %ecx
> wrmsr

Is this hunk just a performance optimization to save an unnecessary
wrmsr when it is pre-populated with the right value? Is it required
for this patch? If "yes", it was not clear to me from the changelog,
if "no" seems like it belongs in a standalone optimization patch.

>
> +.Ldone_efer:
> # Enable paging and in turn activate Long Mode
> - movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
> + movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax

It seems setting X86_CR0_NE is redundant when coming through
pa_trampoline_compat, is this a standalone fix to make sure that
'numeric-error' is enabled before startup_64?

> movl %eax, %cr0
>
> /*
> @@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
> ljmpl $__KERNEL_CS, $pa_startup_64
> SYM_CODE_END(startup_32)
>
> +SYM_CODE_START(pa_trampoline_compat)
> + /*
> + * In compatibility mode. Prep ESP and DX for startup_32, then disable
> + * paging and complete the switch to legacy 32-bit mode.
> + */
> + movl $rm_stack_end, %esp
> + movw $__KERNEL_DS, %dx
> +
> + movl $(X86_CR0_NE | X86_CR0_PE), %eax
> + movl %eax, %cr0
> + ljmpl $__KERNEL32_CS, $pa_startup_32
> +SYM_CODE_END(pa_trampoline_compat)
> +
> .section ".text64","ax"
> .code64
> .balign 4
> @@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
> jmpq *tr_start(%rip)
> SYM_CODE_END(startup_64)
>
> +SYM_CODE_START(trampoline_start64)
> + /*
> + * APs start here on a direct transfer from 64-bit BIOS with identity
> + * mapped page tables. Load the kernel's GDT in order to gear down to
> + * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
> + * segment registers. Load the zero IDT so any fault triggers a
> + * shutdown instead of jumping back into BIOS.
> + */
> + lidt tr_idt(%rip)
> + lgdt tr_gdt64(%rip)
> +
> + ljmpl *tr_compat(%rip)
> +SYM_CODE_END(trampoline_start64)
> +
> .section ".rodata","a"
> # Duplicate the global descriptor table
> # so the kernel can live anywhere
> @@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
> .quad 0x00cf93000000ffff # __KERNEL_DS
> SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>
> +SYM_DATA_START(tr_gdt64)
> + .short tr_gdt_end - tr_gdt - 1 # gdt limit
> + .long pa_tr_gdt
> + .long 0
> +SYM_DATA_END(tr_gdt64)
> +
> +SYM_DATA_START(tr_compat)
> + .long pa_trampoline_compat
> + .short __KERNEL32_CS
> +SYM_DATA_END(tr_compat)
> +
> .bss
> .balign PAGE_SIZE
> SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
> index 5033e640f957..506d5897112a 100644
> --- a/arch/x86/realmode/rm/trampoline_common.S
> +++ b/arch/x86/realmode/rm/trampoline_common.S
> @@ -1,4 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> .section ".rodata","a"
> .balign 16
> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
> +SYM_DATA_START_LOCAL(tr_idt)
> + .short 0
> + .quad 0
> +SYM_DATA_END(tr_idt)

Curious, is the following not equivalent?

-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+SYM_DATA_LOCAL(tr_idt, .fill 1, 10, 0)

2021-05-13 03:06:13

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Avoid operations which will inject #VE during compressed
> boot, which is obviously fatal for TDX platforms.
>
> Details are,
>
> 1. TDX module injects #VE if a TDX guest attempts to write
> EFER. So skip the WRMSR to set EFER.LME=1 if it's already
> set. TDX also forces EFER.LME=1, i.e. the branch will always
> be taken and thus the #VE avoided.

Ah here's the justification for that hunk in the previous patch, are
you sure that hunk belongs in the trampoline patch?

>
> 2. TDX module also injects a #VE if the guest attempts to clear
> CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
> boot. The Setting CR0.NE should be a nop on all CPUs that
> support 64-bit mode.

Ah, here's the justification for CR0.NE in the previous patch. Did
something go wrong in the patch splitting?

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/boot/compressed/head_64.S | 5 +++--
> arch/x86/boot/compressed/pgtable.h | 2 +-
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index e94874f4bbc1..37c2f37d4a0d 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
> movl $MSR_EFER, %ecx
> rdmsr
> btsl $_EFER_LME, %eax
> + jc 1f
> wrmsr
> - popl %edx
> +1: popl %edx
> popl %ecx
>
> /* Enable PAE and LA57 (if required) paging modes */
> @@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
> pushl %eax
>
> /* Enable paging again */
> - movl $(X86_CR0_PG | X86_CR0_PE), %eax
> + movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
> movl %eax, %cr0
>
> lret
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 6ff7e81b5628..cc9b2529a086 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,7 +6,7 @@
> #define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
>
> #define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
> +#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
>
> #define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
>
> --
> 2.25.1
>

2021-05-13 03:25:41

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Skip writing EFER during secondary_startup_64() if the current value is
> also the desired value. This avoids a #VE when running as a TDX guest,
> as the TDX-Module does not allow writes to EFER (even when writing the
> current, fixed value).
>
> Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
> when running as a TDX guest. The TDX-Module (effectively part of the
> hypervisor) requires CR4.MCE to be set at all times and injects a #VE
> if the guest attempts to clear CR4.MCE.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/boot/compressed/head_64.S | 5 ++++-
> arch/x86/kernel/head_64.S | 13 +++++++++++--
> 2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index 37c2f37d4a0d..2d79e5f97360 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
> popl %ecx
>
> /* Enable PAE and LA57 (if required) paging modes */
> - movl $X86_CR4_PAE, %eax
> + movl %cr4, %eax
> + /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
> + andl $X86_CR4_MCE, %eax
> + orl $X86_CR4_PAE, %eax
> testl %edx, %edx
> jz 1f
> orl $X86_CR4_LA57, %eax
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 04bddaaba8e2..92c77cf75542 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> 1:
>
> /* Enable PAE mode, PGE and LA57 */
> - movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
> + movq %cr4, %rcx
> + /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
> + andl $X86_CR4_MCE, %ecx
> + orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
> #ifdef CONFIG_X86_5LEVEL
> testl $1, __pgtable_l5_enabled(%rip)
> jz 1f
> @@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> /* Setup EFER (Extended Feature Enable Register) */
> movl $MSR_EFER, %ecx
> rdmsr
> + movl %eax, %edx

Maybe comment that EFER is being saved here to check if the following
enables are nops, but not a big deal.

Reviewed-by: Dan Williams <[email protected]>

...modulo whether the EFER wrmsr avoidance in PATCH 21 should move here.

> btsl $_EFER_SCE, %eax /* Enable System Call */
> btl $20,%edi /* No Execute supported? */
> jnc 1f
> btsl $_EFER_NX, %eax
> btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1: wrmsr /* Make changes effective */
>
> + /* Skip the WRMSR if the current value matches the desired value. */
> +1: cmpl %edx, %eax
> + je 1f
> + xor %edx, %edx
> + wrmsr /* Make changes effective */
> +1:
> /* Setup cr0 */
> movl $CR0_STATE, %eax
> /* Make changes effective */
> --
> 2.25.1
>

Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code



On 5/12/21 8:53 AM, Sean Christopherson wrote:
> On Wed, May 12, 2021, Dave Hansen wrote:
>> On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
>>>> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
>>>> code in a file named "common". I'd make an attempt to keep them
>>>> separate and then call into the two separate functions *from* the common
>>>> function.
>>> But why? What good does the additional level of inderection brings?
>>>
>>> It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
>>> or Intel specific. If a function can cover both vendors I don't see a
>>> point for additinal complexity.
>>
>> Because the code is already separate. You're actually going to some
>> trouble to move the SEV-specific code and then combine it with the
>> TDX-specific code.
>>
>> Anyway, please just give it a shot. Should take all of ten minutes. If
>> it doesn't work out in practice, fine. You'll have a good paragraph for
>> the changelog.
>
> Or maybe wait to see how Boris' propose protected_guest_has() pans out? E.g. if
> we can do "protected_guest_has(MEMORY_ENCRYPTION)" or whatever, then the truly
> common bits could be placed into common.c without any vendor-specific logic.

How about following abstraction? This patch was initially created to enable us use
is_tdx_guest() outside of arch/x86 code. But extended it to support bitmap flags.

commit 188bdd3c97e49020b2bda9efd992a22091423b85
Author: Kuppuswamy Sathyanarayanan <[email protected]>
Date: Wed May 12 11:35:13 2021 -0700

tdx: Introduce generic protected_guest abstraction

Add a generic way to check if we run with an encrypted guest,
without requiring x86 specific ifdefs. This can then be used in
non architecture specific code. Enablethis when running under
TDX/SEV.

Also add helper functions to set/test encrypted guest feature
flags.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..98c30312555b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -956,6 +956,9 @@ config HAVE_ARCH_NVRAM_OPS
config ISA_BUS_API
def_bool ISA

+config ARCH_HAS_PROTECTED_GUEST
+ bool
+
#
# ABI hall of shame
#
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 07fb4df1d881..001487c21874 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
select PARAVIRT_XL
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
+ select ARCH_HAS_PROTECTED_GUEST
help
Provide support for running in a trusted domain on Intel processors
equipped with Trusted Domain eXtenstions. TDX is a new Intel
@@ -1537,6 +1538,7 @@ config AMD_MEM_ENCRYPT
select ARCH_USE_MEMREMAP_PROT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
+ select ARCH_HAS_PROTECTED_GUEST
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ccab6cf91283..8260893c34ae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -21,6 +21,7 @@
#include <linux/usb/xhci-dbgp.h>
#include <linux/static_call.h>
#include <linux/swiotlb.h>
+#include <linux/protected_guest.h>

#include <uapi/linux/mount.h>

@@ -107,6 +108,10 @@ static struct resource bss_resource = {
.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
};

+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+DECLARE_BITMAP(protected_guest_flags, PROTECTED_GUEST_BITMAP_LEN);
+EXPORT_SYMBOL(protected_guest_flags);
+#endif

#ifdef CONFIG_X86_32
/* CPU data as detected by the assembly code in head_32.S */
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 04a780abb512..45b848ec8325 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -19,6 +19,7 @@
#include <linux/memblock.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/protected_guest.h>

#include <asm/cpu_entry_area.h>
#include <asm/stacktrace.h>
@@ -680,6 +681,9 @@ static void __init init_ghcb(int cpu)

data->ghcb_active = false;
data->backup_ghcb_active = false;
+
+ set_protected_guest_flag(GUEST_TYPE_SEV);
+ set_protected_guest_flag(MEMORY_ENCRYPTION);
}

void __init sev_es_init_vc_handling(void)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 4dfacde05f0c..d0207b990fe4 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,7 @@
#include <asm/vmx.h>

#include <linux/cpu.h>
+#include <linux/protected_guest.h>

static struct {
unsigned int gpa_width;
@@ -92,6 +93,9 @@ void __init tdx_early_init(void)

setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+ set_protected_guest_flag(GUEST_TYPE_TDX);
+ set_protected_guest_flag(MEMORY_ENCRYPTION);
+
tdg_get_info();

pr_info("TDX guest is initialized\n");
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..44e8c642654c
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,37 @@
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+#define PROTECTED_GUEST_BITMAP_LEN 128
+
+/* Protected Guest vendor types */
+#define GUEST_TYPE_TDX (1)
+#define GUEST_TYPE_SEV (2)
+
+/* Protected Guest features */
+#define MEMORY_ENCRYPTION (20)
+
+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+extern DECLARE_BITMAP(protected_guest_flags, PROTECTED_GUEST_BITMAP_LEN);
+
+static bool protected_guest_has(unsigned long flag)
+{
+ return test_bit(flag, protected_guest_flags);
+}
+
+static inline void set_protected_guest_flag(unsigned long flag)
+{
+ __set_bit(flag, protected_guest_flags);
+}
+
+static inline bool is_protected_guest(void)
+{
+ return ( protected_guest_has(GUEST_TYPE_TDX) |
+ protected_guest_has(GUEST_TYPE_SEV) );
+}
+#else
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+static inline void set_protected_guest_flag(unsigned long flag) { }
+static inline bool is_protected_guest(void) { return false; }
+#endif
+
+#endif


>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-13 22:52:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>
> +#define PROTECTED_GUEST_BITMAP_LEN    128
> +
> +/* Protected Guest vendor types */
> +#define GUEST_TYPE_TDX            (1)
> +#define GUEST_TYPE_SEV            (2)
> +
> +/* Protected Guest features */
> +#define MEMORY_ENCRYPTION        (20)

I was assuming we'd reuse the X86_FEATURE infrastructure somehow. Is
there a good reason not to?

That gives us all the compile-time optimization (via
en/disabled-features.h) and static branches for "free".

Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code



On 5/13/21 10:49 AM, Dave Hansen wrote:
> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>> +
>> +/* Protected Guest vendor types */
>> +#define GUEST_TYPE_TDX            (1)
>> +#define GUEST_TYPE_SEV            (2)
>> +
>> +/* Protected Guest features */
>> +#define MEMORY_ENCRYPTION        (20)
>
> I was assuming we'd reuse the X86_FEATURE infrastructure somehow. Is
> there a good reason not to?

My assumption is, protected guest abstraction can be also used by
non-x86 arch's in future. So I have tried to keep these definitions
in common code.


>
> That gives us all the compile-time optimization (via
> en/disabled-features.h) and static branches for "free".
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls



On 5/12/21 7:29 AM, Dave Hansen wrote:
> On 5/12/21 7:10 AM, Kuppuswamy, Sathyanarayanan wrote:
>> On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
>>> This has to be compiled only for TDX+KVM.
>>
>> Got it. So if we want to remove the "C" file include, we will have to
>> add #ifdef CONFIG_KVM_GUEST in Makefile.
>>
>> ifdef CONFIG_KVM_GUEST
>> obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
>> #endif
>
> Is there truly no dependency between CONFIG_KVM_GUEST and
> CONFIG_INTEL_TDX_GUEST?

We want to re-use TDX code with other hypervisors/guests as well. So
we can't create direct dependency with CONFIG_KVM_GUEST in Kconfig.

>
> If there isn't, then the way we do it is adding another (invisible)
> Kconfig variable to express the dependency for tdx-kvm.o:
>
> config INTEL_TDX_GUEST_KVM
> bool
> depends on KVM_GUEST && INTEL_TDX_GUEST

Currently it will only be used for KVM hypercall code. Will it to be
overkill to create a new config over #ifdefs for this use case ? But,
if this is the preferred approach, I will go with this suggestion.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-13 23:31:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls

On 5/13/21 12:29 PM, Kuppuswamy, Sathyanarayanan wrote:
>> If there isn't, then the way we do it is adding another (invisible)
>> Kconfig variable to express the dependency for tdx-kvm.o:
>>
>> config INTEL_TDX_GUEST_KVM
>>     bool
>>     depends on KVM_GUEST && INTEL_TDX_GUEST
>
> Currently it will only be used for KVM hypercall code. Will it to be
> overkill to create a new config over #ifdefs for this use case ? But,
> if this is the preferred approach, I will go with this suggestion.

You'll see this done lots of different (valid) ways over the kernel.
(#ifdef'd #including C files is not one of them.)

*My* preference is to use Kconfig in the way I described. It keeps
makefiles and #ifdef's clean and obvious, relegating the logic to Kconfig.

2021-05-13 23:33:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code


On 5/13/2021 10:49 AM, Dave Hansen wrote:
> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>> +
>> +/* Protected Guest vendor types */
>> +#define GUEST_TYPE_TDX            (1)
>> +#define GUEST_TYPE_SEV            (2)
>> +
>> +/* Protected Guest features */
>> +#define MEMORY_ENCRYPTION        (20)
> I was assuming we'd reuse the X86_FEATURE infrastructure somehow. Is
> there a good reason not to?


This for generic code. Would be a gigantic lift and lots of refactoring
to move that out.

>
> That gives us all the compile-time optimization (via
> en/disabled-features.h) and static branches for "free".

There's no user so far which is anywhere near performance critical, so
that would be total overkil

BTW right now I'm not even sure we need the bitmap for anything, but I
guess it doesn't hurt.

-Andi



2021-05-13 23:34:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest


On 5/7/2021 2:36 PM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> ...
>> The #VE cannot be nested before TDGETVEINFO is called, if there is any
>> reason for it to nest the TD would shut down. The TDX module guarantees
>> that no NMIs (or #MC or similar) can happen in this window. After
>> TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
>> it to happen normally.
> I think this description really needs some work. Does "The #VE cannot
> be nested" mean that "hardware guarantees that #VE will not be
> generated", or "the #VE must not be nested"?

The next half sentence answers this question..

"if there is any reason for it to nest the TD would shut down."

So it cannot nest.


>
> What does "the TD would shut down" mean? I think you mean that instead
> of delivering a nested #VE the hardware would actually exit to the host
> and TDX would prevent the guest from being reentered. Right?


Yes that's a shutdown. I Suppose we could add your sentence.


> I find that description a bit unsatisfying. Could we make this a bit
> more concrete?


I don't see what could be added. If you have concrete suggestions please
just propose something.


> By the way, what about *normal* interrupts?


Normal interrupts are blocked of course like in every other exception or
interrupt entry.

>
> Maybe we should talk about this in terms of *rules* that folks need to
> follow. Maybe:
>
> NMIs and machine checks are suppressed. Before this point any
> #VE is fatal. After this point, NMIs and additional #VEs are
> permitted.

Okay that's fine for me.


-Andi




2021-05-14 07:00:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On 5/13/21 12:38 PM, Andi Kleen wrote:
>
> On 5/13/2021 10:49 AM, Dave Hansen wrote:
>> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>>> +
>>> +/* Protected Guest vendor types */
>>> +#define GUEST_TYPE_TDX            (1)
>>> +#define GUEST_TYPE_SEV            (2)
>>> +
>>> +/* Protected Guest features */
>>> +#define MEMORY_ENCRYPTION        (20)
>> I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
>> there a good reason not to?
>
> This for generic code. Would be a gigantic lift and lots of refactoring
> to move that out.

Ahh, forgot about that. The whole "x86/mm" subject threw me off.

>> That gives us all the compile-time optimization (via
>> en/disabled-features.h) and static branches for "free".
>
> There's no user so far which is anywhere near performance critical, so
> that would be total overkil

The *REALLY* nice thing is that it keeps you from having to create stub
functions or #ifdefs and yet the compiler can still optimize the code to
nothing.

Anyway, thanks for the clarification about it being in non-arch code.

2021-05-14 07:01:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

On 5/13/21 12:47 PM, Andi Kleen wrote:
> "if there is any reason for it to nest the TD would shut down."

The TDX EAS says:

> If, when attempting to inject a #VE, the Intel TDX module discovers
> that the guest TD has not yet retrieved the information for a
> previous #VE (i.e., VE_INFO.VALID is not 0), the TDX module injects a
> #DF into the guest TD to indicate a #VE overrun.

How does that result in a shut down?

2021-05-14 07:02:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

On 5/13/21 12:47 PM, Andi Kleen wrote:
> I don't see what could be added. If you have concrete suggestions please
> just propose something.

Oh, boy, I love writing changelogs! I was hoping that the TDX folks
would chip in to write their own changelogs, but oh well. You made my day!

--

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either userspace or the kernel:

* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to TD-shared memory, which includes MMIO

#VE exceptions are never generated on accesses to normal, TD-private memory.

The entry paths do not access TD-shared memory or use those specific
MSRs, instructions, CPUID leaves. In addition, all interrupts including
NMIs are blocked by the hardware starting with #VE delivery until
TDGETVEINFO is called. This eliminates the chance of a #VE during the
syscall gap or paranoid entry paths and simplifies #VE handling.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest.

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace. Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling do_general_protection() directly.

--

Did I miss anything?

2021-05-14 08:12:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest


On 5/13/2021 1:07 PM, Dave Hansen wrote:
> On 5/13/21 12:47 PM, Andi Kleen wrote:
>> "if there is any reason for it to nest the TD would shut down."
> The TDX EAS says:
>
>> If, when attempting to inject a #VE, the Intel TDX module discovers
>> that the guest TD has not yet retrieved the information for a
>> previous #VE (i.e., VE_INFO.VALID is not 0), the TDX module injects a
>> #DF into the guest TD to indicate a #VE overrun.
> How does that result in a shut down?


You're right. It's not a shutdown, but a panic. We'll need to fix the
comment and replace 'shutdown' with 'panic'


-And






2021-05-18 23:29:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On Thu, May 13, 2021, Andi Kleen wrote:
>
> On 5/13/2021 10:49 AM, Dave Hansen wrote:
> > On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
> > > +#define PROTECTED_GUEST_BITMAP_LEN??? 128
> > > +
> > > +/* Protected Guest vendor types */
> > > +#define GUEST_TYPE_TDX??????????? (1)
> > > +#define GUEST_TYPE_SEV??????????? (2)
> > > +
> > > +/* Protected Guest features */
> > > +#define MEMORY_ENCRYPTION??????? (20)
> > I was assuming we'd reuse the X86_FEATURE infrastructure somehow. Is
> > there a good reason not to?
>
> This for generic code. Would be a gigantic lift and lots of refactoring to
> move that out.

What generic code needs access to SEV vs. TDX? force_dma_unencrypted() is called
from generic code, but its implementation is x86 specific.

> > That gives us all the compile-time optimization (via
> > en/disabled-features.h) and static branches for "free".
>
> There's no user so far which is anywhere near performance critical, so that
> would be total overkil

SEV already has the sev_enable_key static key that it uses for unrolling string
I/O, so there's at least one (debatable) case that wants to use static branches.

For SEV-ES and TDX, there's a better argument as using X86_FEATURE_* would unlock
alternatives.

> BTW right now I'm not even sure we need the bitmap for anything, but I guess
> it doesn't hurt.
>
> -Andi
>
>

Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code



On 5/17/21 11:16 AM, Sean Christopherson wrote:
> What generic code needs access to SEV vs. TDX? force_dma_unencrypted() is called
> from generic code, but its implementation is x86 specific.

When the hardening the drivers for TDX usage, we will have requirement to check
for is_protected_guest() to add code specific to protected guests. Since this will
be outside arch/x86, we need common framework for it.

Few examples are,
* ACPI sleep driver uses WBINVD (when doing cache flushes). We want to skip it for
TDX.
* Forcing virtio to use dma API when running with untrusted host.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 00:45:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On 5/17/21 11:27 AM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/17/21 11:16 AM, Sean Christopherson wrote:
>> What generic code needs access to SEV vs. TDX?
>> force_dma_unencrypted() is called from generic code, but its
>> implementation is x86 specific.
>
> When the hardening the drivers for TDX usage, we will have
> requirement to check for is_protected_guest() to add code specific to
> protected guests. Since this will be outside arch/x86, we need common
> framework for it.

Just remember, a "common framework" doesn't mean that it can't be backed
by extremely arch-specific mechanisms.

For instance, there's a lot of pkey-specific code in mm/mprotect.c. It
still gets optimized away on x86 with all the goodness of X86_FEATUREs.

2021-05-19 00:45:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code

On Mon, May 17, 2021, Dave Hansen wrote:
> On 5/17/21 11:27 AM, Kuppuswamy, Sathyanarayanan wrote:
> > On 5/17/21 11:16 AM, Sean Christopherson wrote:
> >> What generic code needs access to SEV vs. TDX?
> >> force_dma_unencrypted() is called from generic code, but its
> >> implementation is x86 specific.
> >
> > When the hardening the drivers for TDX usage, we will have
> > requirement to check for is_protected_guest() to add code specific to
> > protected guests. Since this will be outside arch/x86, we need common
> > framework for it.
>
> Just remember, a "common framework" doesn't mean that it can't be backed
> by extremely arch-specific mechanisms.
>
> For instance, there's a lot of pkey-specific code in mm/mprotect.c. It
> still gets optimized away on x86 with all the goodness of X86_FEATUREs.

Ya, exactly. Ideally, generic code shouldn't have to differentiate between SEV,
SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool is_protected_guest(void)" should
suffice. Under the hood, x86's implementation for is_protected_guest() can be
boot_cpu_has() checks (if we want).

Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code



On 5/17/21 11:37 AM, Sean Christopherson wrote:
>> Just remember, a "common framework" doesn't mean that it can't be backed
>> by extremely arch-specific mechanisms.
>>
>> For instance, there's a lot of pkey-specific code in mm/mprotect.c. It
>> still gets optimized away on x86 with all the goodness of X86_FEATUREs.
> Ya, exactly. Ideally, generic code shouldn't have to differentiate between SEV,
> SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool is_protected_guest(void)" should
> suffice. Under the hood, x86's implementation for is_protected_guest() can be
> boot_cpu_has() checks (if we want).

What about the use case of protected_guest_has(flag)? Do you want to call it with
with X86_FEATURE_* flags outside arch/x86 code ?


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 09:07:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code


On 5/17/2021 3:32 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/17/21 11:37 AM, Sean Christopherson wrote:
>>> Just remember, a "common framework" doesn't mean that it can't be
>>> backed
>>> by extremely arch-specific mechanisms.
>>>
>>> For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
>>> still gets optimized away on x86 with all the goodness of X86_FEATUREs.
>> Ya, exactly.  Ideally, generic code shouldn't have to differentiate
>> between SEV,
>> SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool
>> is_protected_guest(void)" should
>> suffice.  Under the hood, x86's implementation for
>> is_protected_guest() can be
>> boot_cpu_has() checks (if we want).
>
> What about the use case of protected_guest_has(flag)? Do you want to
> call it with
> with X86_FEATURE_* flags outside arch/x86 code ?


I don't think we need any flags in the generic code. Just a simple bool
is enough.


-Andi


Subject: [RFC v2-fix 1/1] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT

From: "Kirill A. Shutemov" <[email protected]>

CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.

For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
COFNIG_PARAVIRT_XXL.

Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.

Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
---

Changes since v1:
* Removed CONFIG_PARAVIRT_XL
* Moved HLT and SAFE_HLT under CONFIG_PARAVIRT

arch/x86/include/asm/irqflags.h | 40 +++++++++++++++------------
arch/x86/include/asm/paravirt.h | 20 +++++++-------
arch/x86/include/asm/paravirt_types.h | 3 +-
arch/x86/kernel/paravirt.c | 4 ++-
4 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..6671744dbf3c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,6 +59,28 @@ static inline __cpuidle void native_halt(void)

#endif

+#ifndef CONFIG_PARAVIRT
+#ifndef __ASSEMBLY__
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline __cpuidle void arch_safe_halt(void)
+{
+ native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline __cpuidle void halt(void)
+{
+ native_halt();
+}
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT */
+
#ifdef CONFIG_PARAVIRT_XXL
#include <asm/paravirt.h>
#else
@@ -80,24 +102,6 @@ static __always_inline void arch_local_irq_enable(void)
native_irq_enable();
}

-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static inline __cpuidle void arch_safe_halt(void)
-{
- native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static inline __cpuidle void halt(void)
-{
- native_halt();
-}
-
/*
* For spinlocks, etc:
*/
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..5d967bce8937 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,16 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
PVOP_VCALL1(mmu.exit_mmap, mm);
}

+static inline void arch_safe_halt(void)
+{
+ PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+ PVOP_VCALL0(irq.halt);
+}
+
#ifdef CONFIG_PARAVIRT_XXL
static inline void load_sp0(unsigned long sp0)
{
@@ -145,16 +155,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
}

-static inline void arch_safe_halt(void)
-{
- PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
- PVOP_VCALL0(irq.halt);
-}
-
static inline void wbinvd(void)
{
PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..68bf35ce6dd5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,10 +177,9 @@ struct pv_irq_ops {
struct paravirt_callee_save save_fl;
struct paravirt_callee_save irq_disable;
struct paravirt_callee_save irq_enable;
-
+#endif
void (*safe_halt)(void);
void (*halt)(void);
-#endif
} __no_randomize_layout;

struct pv_mmu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..b001f5aaee4a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
.irq.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.irq.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
.irq.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+
+ /* Irq HLT ops. */
.irq.safe_halt = native_safe_halt,
.irq.halt = native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */

/* Mmu ops. */
.mmu.flush_tlb_user = native_flush_tlb_local,
--
2.25.1


Subject: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls

From: "Kirill A. Shutemov" <[email protected]>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor specififc TDVMCALLs
instead of VMCALL.

[Isaku: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since RFC v2:
* Introduced INTEL_TDX_GUEST_KVM config for TDX+KVM related changes.
* Removed "C" include file.
* Fixed commit log as per Dave's comments.

arch/x86/Kconfig | 6 +++++
arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
arch/x86/include/asm/tdx.h | 41 ++++++++++++++++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/tdcall.S | 20 ++++++++++++++
arch/x86/kernel/tdx-kvm.c | 48 +++++++++++++++++++++++++++++++++
6 files changed, 137 insertions(+)
create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..768df1b98487 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
run in a CPU mode that protects the confidentiality of TD memory
contents and the TD’s CPU state from other software, including VMM.

+config INTEL_TDX_GUEST_KVM
+ def_bool y
+ depends on KVM_GUEST && INTEL_TDX_GUEST
+ help
+ This option enables KVM specific hypercalls in TDX guest.
+
endif #HYPERVISOR_GUEST

source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
#include <asm/alternative.h>
#include <linux/interrupt.h>
#include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>

extern void kvmclock_init(void);

@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
static inline long kvm_hypercall0(unsigned int nr)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall0(nr);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall1(nr, p1);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall2(nr, p1, p2);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
unsigned long p4)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..eb758b506dba 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,45 @@ static inline void tdx_early_init(void) { };

#endif /* CONFIG_INTEL_TDX_GUEST */

+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+ u64 r15, struct tdx_hypercall_output *out);
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+ unsigned long p2)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7966c10ea8d1..a90fec004844 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index a484c4aef6e6..3c57a1d67b79 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -25,6 +25,8 @@
TDG_R12 | TDG_R13 | \
TDG_R14 | TDG_R15 )

+#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */
+
/*
* TDX guests use the TDCALL instruction to make requests to the
* TDX module and hypercalls to the VMM. It is supported in
@@ -213,3 +215,21 @@ SYM_FUNC_START(__tdx_hypercall)
call do_tdx_hypercall
retq
SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+ /*
+ * R10 is not part of the function call ABI, but it is a part
+ * of the TDVMCALL ABI. So set it before making call to the
+ * do_tdx_hypercall().
+ */
+ movq $TDVMCALL_VENDOR_KVM, %r10
+ call do_tdx_hypercall
+ retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..b21453a81e38
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static long tdx_kvm_hypercall(unsigned int fn, unsigned long r12,
+ unsigned long r13, unsigned long r14,
+ unsigned long r15)
+{
+ return __tdx_hypercall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+ return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3)
+{
+ return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4)
+{
+ return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
--
2.25.1


Subject: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest

From: "Kirill A. Shutemov" <[email protected]>

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to TD-shared memory, which includes MMIO

In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.

The entry paths do not access TD-shared memory, MMIO regions or use
those specific MSRs, instructions, CPUID leaves that might generate #VE.
In addition, all interrupts including NMIs are blocked by the hardware
starting with #VE delivery until TDGETVEINFO is called.  This eliminates
the chance of a #VE during the syscall gap or paranoid entry paths and
simplifies #VE handling.

After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
although we don't expect it to happen because we don't expect NMIs to
trigger #VEs. Another case where they could happen is if the #VE
exception panics, but in this case there are no guarantees on anything
anyways.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest which will result in an oops (and should eventually be a panic, as
we would like to set panic_on_oops to 1 for TDX guests).

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace.  Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling ve_raise_fault() directly.
ve_raise_fault() is similar to #GP handler and is responsible for
sending SIGSEGV to userspace and cpu die and notifying debuggers and
other die chain users.  

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v1:
* Removed [RFC v2 07/32] x86/traps: Add do_general_protection() helper function.
* Instead of resuing #GP handler, defined a custom handler.
* Fixed commit log as per review comments.

arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/tdx.h | 20 ++++++++++
arch/x86/kernel/idt.c | 6 +++
arch/x86/kernel/tdx.c | 35 +++++++++++++++++
arch/x86/kernel/traps.c | 70 +++++++++++++++++++++++++++++++++
5 files changed, 135 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1d75be21a09b..8ab4067afefc 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -11,6 +11,7 @@
#include <linux/types.h>

#define TDINFO 1
+#define TDGETVEINFO 3

struct tdx_module_output {
u64 rcx;
@@ -29,6 +30,25 @@ struct tdx_hypercall_output {
u64 r15;
};

+/*
+ * Used by #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ u64 gla;
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve);
+
/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};

/*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
INTG(X86_TRAP_MF, asm_exc_coprocessor_error),
INTG(X86_TRAP_AC, asm_exc_alignment_check),
INTG(X86_TRAP_XF, asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif

#ifdef CONFIG_X86_32
TSKG(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 4dfacde05f0c..b5fffbd86331 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -85,6 +85,41 @@ static void tdg_get_info(void)
td_info.attributes = out.rdx;
}

+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+ u64 ret;
+ struct tdx_module_output out = {0};
+
+ /*
+ * NMIs and machine checks are suppressed. Before this point any
+ * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+ * additional #VEs are permitted (but we don't expect them to
+ * happen unless you panic).
+ */
+ ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = out.r10 & UINT_MAX;
+ ve->instr_info = out.r10 >> 32;
+
+ return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve)
+{
+ /*
+ * TODO: Add handler support for various #VE exit
+ * reasons. It will be added by other patches in
+ * the series.
+ */
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EFAULT;
+}
+
void __init tdx_early_init(void)
{
if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..af8efa2e57ba 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -1137,6 +1138,75 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}

+#define VEFSTR "VE fault"
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+ struct task_struct *tsk = current;
+
+ if (user_mode(regs)) {
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * Not fixing up VDSO exceptions similar to #GP handler
+ * because we don't expect the VDSO to trigger #VE.
+ */
+ show_signal(tsk, SIGSEGV, "", VEFSTR, regs, error_code);
+ force_sig(SIGSEGV);
+ return;
+ }
+
+
+ if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+ return;
+
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * To be potentially processing a kprobe fault and to trust the result
+ * from kprobe_running(), we have to be non-preemptible.
+ */
+ if (!preemptible() &&
+ kprobe_running() &&
+ kprobe_fault_handler(regs, X86_TRAP_VE))
+ return;
+
+ notify_die(DIE_GPF, VEFSTR, regs, error_code, X86_TRAP_VE, SIGSEGV);
+
+ die_addr(VEFSTR, regs, error_code, 0);
+}
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+ int ret;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+ /*
+ * NMIs/Machine-checks/Interrupts will be in a disabled state
+ * till TDGETVEINFO TDCALL is executed. This prevents #VE
+ * nesting issue.
+ */
+ ret = tdg_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ if (!ret)
+ ret = tdg_handle_virtualization_exception(regs, &ve);
+ /*
+ * If tdg_handle_virtualization_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (ret)
+ ve_raise_fault(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.25.1


Subject: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode

From: Sean Christopherson <[email protected]>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX. You can find MADT MP wake protocol details in ACPI specification
r6.4, sec 5.2.12.19.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode. For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Removed X86_CR0_NE and EFER related changes from this changes
and moved it to patch titled "x86/boot: Avoid #VE during
boot for TDX platforms"
* Fixed commit log as per Dan's suggestion.
* Added inline get_trampoline_start_ip() to set start_ip.

arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/include/asm/realmode.h | 10 +++++++
arch/x86/kernel/smpboot.c | 2 +-
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 38 ++++++++++++++++++++++++
arch/x86/realmode/rm/trampoline_common.S | 7 ++++-
6 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80

#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..3328c8edb200 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
u32 sev_es_trampoline_start;
#endif
#ifdef CONFIG_X86_64
+ u32 trampoline_start64;
u32 trampoline_pgd;
#endif
/* ACPI S3 wakeup */
@@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
real_mode_header = (struct real_mode_header *) __va(mem);
}

+static inline unsigned long get_trampoline_start_ip(void)
+{
+#ifdef CONFIG_X86_64
+ if (is_tdx_guest())
+ return real_mode_header->trampoline_start64;
+#endif
+ return real_mode_header->trampoline_start;
+}
+
void reserve_real_mode(void);

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..0b4dff5e67a9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
int *cpu0_nmi_registered)
{
/* start_ip had better be page-aligned! */
- unsigned long start_ip = real_mode_header->trampoline_start;
+ unsigned long start_ip = get_trampoline_start_ip();

unsigned long boot_error = 0;
unsigned long timeout;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
.long pa_sev_es_trampoline_start
#endif
#ifdef CONFIG_X86_64
+ .long pa_trampoline_start64
.long pa_trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..754f8d2ac9e8 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
ljmpl $__KERNEL_CS, $pa_startup_64
SYM_CODE_END(startup_32)

+SYM_CODE_START(pa_trampoline_compat)
+ /*
+ * In compatibility mode. Prep ESP and DX for startup_32, then disable
+ * paging and complete the switch to legacy 32-bit mode.
+ */
+ movl $rm_stack_end, %esp
+ movw $__KERNEL_DS, %dx
+
+ movl $(X86_CR0_NE | X86_CR0_PE), %eax
+ movl %eax, %cr0
+ ljmpl $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
.section ".text64","ax"
.code64
.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
jmpq *tr_start(%rip)
SYM_CODE_END(startup_64)

+SYM_CODE_START(trampoline_start64)
+ /*
+ * APs start here on a direct transfer from 64-bit BIOS with identity
+ * mapped page tables. Load the kernel's GDT in order to gear down to
+ * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+ * segment registers. Load the zero IDT so any fault triggers a
+ * shutdown instead of jumping back into BIOS.
+ */
+ lidt tr_idt(%rip)
+ lgdt tr_gdt64(%rip)
+
+ ljmpl *tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
.section ".rodata","a"
# Duplicate the global descriptor table
# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
.quad 0x00cf93000000ffff # __KERNEL_DS
SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)

+SYM_DATA_START(tr_gdt64)
+ .short tr_gdt_end - tr_gdt - 1 # gdt limit
+ .long pa_tr_gdt
+ .long 0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+ .long pa_trampoline_compat
+ .short __KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
.bss
.balign PAGE_SIZE
SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..ade7db208e4e 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,9 @@
/* SPDX-License-Identifier: GPL-2.0 */
.section ".rodata","a"
.balign 16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/* .fill cannot be used for size > 8. So use short and quad */
+SYM_DATA_START_LOCAL(tr_idt)
+ .short 0
+ .quad 0
+SYM_DATA_END(tr_idt)
--
2.25.1


Subject: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

From: "Kirill A. Shutemov" <[email protected]>

In traditional VMs, MMIO tends to be implemented by giving a
guest access to a mapping which will cause a VMEXIT on access.
That's not possible in TDX guest. So use #VE to implement MMIO
support. In TDX guest, MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle a subset of instructions that the kernel
uses for MMIO operations. User-space access triggers SIGBUS.

Also, reasons for supporting #VE based MMIO in TDX guest are,

* MMIO is widely used and we'll have more drivers in the future.
* We don't want to annotate every TDX specific MMIO readl/writel etc.
* If we didn't annotate we would need to add an alternative to every
MMIO access in the kernel (even though 99.9% will never be used on
TDX) which would be a complete waste and incredible binary bloat
for nothing.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Fixed commit log as per Dave's review.

arch/x86/kernel/tdx.c | 100 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 100 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b9e3010987e0..9330c7a9ad69 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,8 @@

#include <asm/tdx.h>
#include <asm/vmx.h>
+#include <asm/insn.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */

#include <linux/cpu.h>
#include <linux/protected_guest.h>
@@ -209,6 +211,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
}
}

+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+ unsigned long val)
+{
+ return tdx_hypercall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
+ write, addr, val);
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+ static const int regoff[] = {
+ offsetof(struct pt_regs, ax),
+ offsetof(struct pt_regs, cx),
+ offsetof(struct pt_regs, dx),
+ offsetof(struct pt_regs, bx),
+ offsetof(struct pt_regs, sp),
+ offsetof(struct pt_regs, bp),
+ offsetof(struct pt_regs, si),
+ offsetof(struct pt_regs, di),
+ offsetof(struct pt_regs, r8),
+ offsetof(struct pt_regs, r9),
+ offsetof(struct pt_regs, r10),
+ offsetof(struct pt_regs, r11),
+ offsetof(struct pt_regs, r12),
+ offsetof(struct pt_regs, r13),
+ offsetof(struct pt_regs, r14),
+ offsetof(struct pt_regs, r15),
+ };
+ int regno;
+
+ regno = X86_MODRM_REG(insn->modrm.value);
+ if (X86_REX_R(insn->rex_prefix.value))
+ regno += 8;
+
+ return (void *)regs + regoff[regno];
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+ int size;
+ bool write;
+ unsigned long *reg;
+ struct insn insn;
+ unsigned long val = 0;
+
+ /*
+ * User mode would mean the kernel exposed a device directly
+ * to ring3, which shouldn't happen except for things like
+ * DPDK.
+ */
+ if (user_mode(regs)) {
+ pr_err("Unexpected user-mode MMIO access.\n");
+ force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+ return 0;
+ }
+
+ kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
+ insn_get_length(&insn);
+ insn_get_opcode(&insn);
+
+ write = ve->exit_qual & 0x2;
+
+ size = insn.opnd_bytes;
+ switch (insn.opcode.bytes[0]) {
+ /* MOV r/m8 r8 */
+ case 0x88:
+ /* MOV r8 r/m8 */
+ case 0x8A:
+ /* MOV r/m8 imm8 */
+ case 0xC6:
+ size = 1;
+ break;
+ }
+
+ if (inat_has_immediate(insn.attr)) {
+ BUG_ON(!write);
+ val = insn.immediate.value;
+ tdg_mmio(size, write, ve->gpa, val);
+ return insn.length;
+ }
+
+ BUG_ON(!inat_has_modrm(insn.attr));
+
+ reg = get_reg_ptr(regs, &insn);
+
+ if (write) {
+ memcpy(&val, reg, size);
+ tdg_mmio(size, write, ve->gpa, val);
+ } else {
+ val = tdg_mmio(size, write, ve->gpa, val);
+ memset(reg, 0, size);
+ memcpy(reg, &val, size);
+ }
+ return insn.length;
+}
+
unsigned long tdg_get_ve_info(struct ve_info *ve)
{
u64 ret;
@@ -258,6 +355,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_IO_INSTRUCTION:
tdg_handle_io(regs, ve->exit_qual);
break;
+ case EXIT_REASON_EPT_VIOLATION:
+ ve->instr_len = tdg_handle_mmio(regs, ve);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
--
2.25.1


Subject: [WARNING: UNSCANNABLE EXTRACTION FAILED][WARNING: UNSCANNABLE EXTRACTION FAILED][RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms

From: Sean Christopherson <[email protected]>

Avoid operations which will inject #VE during boot process,
which is obviously fatal for TDX platforms.

Details are,

1. TDX module injects #VE if a TDX guest attempts to write
   EFER.
   
   Boot code updates EFER in following cases:
   
   * When enabling Long Mode configuration, EFER.LME bit will
     be set. Since TDX forces EFER.LME=1, we can skip updating
     it again. Check for EFER.LME before updating it and skip
     it if it is already set.

   * EFER is also updated to enable support for features like
     System call and No Execute page setting. In TDX, these
     features are set up by the TDX module. So check whether
     it is already enabled, and skip enabling it again.
   
2. TDX module also injects a #VE if the guest attempts to clear
   CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
   boot. The Setting CR0.NE should be a nop on all CPUs that
   support 64-bit mode.
   
3. The TDX-Module (effectively part of the hypervisor) requires
   CR4.MCE to be set at all times and injects a #VE if the guest
   attempts to clear CR4.MCE. So, preserve CR4.MCE instead of
   clearing it during boot to avoid #VE.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Merged Avoid #VE related changes together.
* [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot
for TDX platforms
* [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process.
* Fixed commit log as per review comments.

arch/x86/boot/compressed/head_64.S | 10 +++++++---
arch/x86/kernel/head_64.S | 13 +++++++++++--
arch/x86/realmode/rm/trampoline_64.S | 11 +++++++++--
3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,16 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
+ jc 1f
wrmsr
- popl %edx
+1: popl %edx
popl %ecx

/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ movl %cr4, %eax
+ /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
+ andl $X86_CR4_MCE, %eax
+ orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
@@ -636,7 +640,7 @@ SYM_CODE_START(trampoline_32bit_src)
pushl %eax

/* Enable paging again */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

lret
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
1:

/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ movq %cr4, %rcx
+ /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
+ andl $X86_CR4_MCE, %ecx
+ orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
+ movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr /* Make changes effective */

+ /* Skip the WRMSR if the current value matches the desired value. */
+1: cmpl %edx, %eax
+ je 1f
+ xor %edx, %edx
+ wrmsr /* Make changes effective */
+1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 754f8d2ac9e8..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3

# Set up EFER
+ movl $MSR_EFER, %ecx
+ rdmsr
+ cmp pa_tr_efer, %eax
+ jne .Lwrite_efer
+ cmp pa_tr_efer + 4, %edx
+ je .Ldone_efer
+.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
- movl $MSR_EFER, %ecx
wrmsr

+.Ldone_efer:
# Enable paging and in turn activate Long Mode
- movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+ movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

/*
--
2.25.1


Subject: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to access guest memory. Any memory
that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. And, after
setting the shared bit, the conversion must be completed with
MapGPA TDVMALL. The call informs VMM about the conversion and
makes it remove the GPA from the S-EPT mapping. The shared
memory is similar to unencrypted memory in AMD SME/SEV terminology
but the underlying process of sharing/un-sharing the memory is
different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar.  TDX architecturally prevents access to private
guest memory by anything other than the guest itself. This means
that any DMA buffers must be shared.

So create a new file mem_encrypt_tdx.c to hold TDX specific memory
initialization code, and re-define force_dma_unencrypted() for
TDX guest and make it return true to get DMA pages mapped as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private.  If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted.

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory. 

Tested-by: Kai Huang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Since the common code between AMD-SEV and TDX is very minimal,
defining a new config (X86_MEM_ENCRYPT_COMMON) for common code
is not very useful. So createed a seperate file for Intel TDX
specific memory initialization (similar to AMD SEV).
* Removed patch titled "x86/mm: Move force_dma_unencrypted() to
common code" from this series. And merged required changes in
this patch.

arch/x86/Kconfig | 1 +
arch/x86/include/asm/tdx.h | 3 +++
arch/x86/kernel/tdx.c | 26 ++++++++++++++++++-
arch/x86/mm/Makefile | 1 +
arch/x86/mm/mem_encrypt_tdx.c | 19 ++++++++++++++
arch/x86/mm/pat/set_memory.c | 48 +++++++++++++++++++++++++++++------
6 files changed, 89 insertions(+), 9 deletions(-)
create mode 100644 arch/x86/mm/mem_encrypt_tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a055594e2664..69a98bcdc07a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
select ARCH_HAS_PROTECTED_GUEST
+ select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select DYNAMIC_PHYSICAL_MASK
help
Provide support for running in a trusted domain on Intel processors
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f5e8088dabc5..4ad436cc2146 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -19,6 +19,9 @@ enum tdx_map_type {

#define TDINFO 1
#define TDGETVEINFO 3
+#define TDACCEPTPAGE 6
+
+#define TDX_PAGE_ALREADY_ACCEPTED 0x8000000000000001

struct tdx_module_output {
u64 rcx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 9ddb80adc034..caf8e4c5ddbc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,7 +100,8 @@ static void tdg_get_info(void)
physical_mask &= ~tdg_shared_mask();
}

-int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
{
u64 ret;

@@ -111,6 +112,29 @@ int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
return ret ? -EIO : 0;
}

+static void tdg_accept_page(phys_addr_t gpa)
+{
+ u64 ret;
+
+ ret = __tdx_module_call(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+ BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+ int ret, i;
+
+ ret = __tdg_map_gpa(gpa, numpages, map_type);
+ if (ret || map_type == TDX_MAP_SHARED)
+ return ret;
+
+ for (i = 0; i < numpages; i++)
+ tdg_accept_page(gpa + i*PAGE_SIZE);
+
+ return 0;
+}
+
static __cpuidle void tdg_halt(void)
{
u64 ret;
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..555dcc0cd087 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -55,3 +55,4 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += mem_encrypt_tdx.o
diff --git a/arch/x86/mm/mem_encrypt_tdx.c b/arch/x86/mm/mem_encrypt_tdx.c
new file mode 100644
index 000000000000..f394a43bf46d
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_tdx.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Intel TDX Memory Encryption Support
+ *
+ * Copyright (C) 2020 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
+
+#include <asm/tdx.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+ return is_tdx_guest();
+}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..ea78c7907847 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
#include <asm/proto.h>
#include <asm/memtype.h>
#include <asm/set_memory.h>
+#include <asm/tdx.h>

#include "../mm_internal.h"

@@ -1972,13 +1973,15 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}

-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
{
+ pgprot_t mem_protected_bits, mem_plain_bits;
struct cpa_data cpa;
+ enum tdx_map_type map_type;
int ret;

- /* Nothing to do if memory encryption is not active */
- if (!mem_encrypt_active())
+ /* Nothing to do if memory encryption and TDX are not active */
+ if (!mem_encrypt_active() && !is_tdx_guest())
return 0;

/* Should not be working on unaligned addresses */
@@ -1988,8 +1991,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
- cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
- cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+ if (is_tdx_guest()) {
+ mem_protected_bits = __pgprot(0);
+ mem_plain_bits = __pgprot(tdg_shared_mask());
+ } else {
+ mem_protected_bits = __pgprot(_PAGE_ENC);
+ mem_plain_bits = __pgprot(0);
+ }
+
+ if (protect) {
+ cpa.mask_set = mem_protected_bits;
+ cpa.mask_clr = mem_plain_bits;
+ map_type = TDX_MAP_PRIVATE;
+ } else {
+ cpa.mask_set = mem_plain_bits;
+ cpa.mask_clr = mem_protected_bits;
+ map_type = TDX_MAP_SHARED;
+ }
+
cpa.pgd = init_mm.pgd;

/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2018,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

/*
* Before changing the encryption attribute, we need to flush caches.
+ *
+ * For TDX we need to flush caches on private->shared. VMM is
+ * responsible for flushing on shared->private.
*/
- cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ if (is_tdx_guest()) {
+ if (map_type == TDX_MAP_SHARED)
+ cpa_flush(&cpa, 1);
+ } else {
+ cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ }

ret = __change_page_attr_set_clr(&cpa, 1);

@@ -2012,18 +2040,22 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
*/
cpa_flush(&cpa, 0);

+ if (!ret && is_tdx_guest()) {
+ ret = tdg_map_gpa(__pa(addr), numpages, map_type);
+ }
+
return ret;
}

int set_memory_encrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, true);
+ return __set_memory_protect(addr, numpages, true);
}
EXPORT_SYMBOL_GPL(set_memory_encrypted);

int set_memory_decrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, false);
+ return __set_memory_protect(addr, numpages, false);
}
EXPORT_SYMBOL_GPL(set_memory_decrypted);

--
2.25.1


Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code



On 5/12/21 8:44 AM, Dave Hansen wrote:
> Because the code is already separate. You're actually going to some
> trouble to move the SEV-specific code and then combine it with the
> TDX-specific code.
>
> Anyway, please just give it a shot. Should take all of ten minutes. If
> it doesn't work out in practice, fine. You'll have a good paragraph for
> the changelog.

After reviewing the code again, I have noticed that we don't really have
much common code between AMD and TDX. So I don't see any justification for
creating this common layer. So, I have decided to drop this patch and move
Intel TDX specific memory encryption init code to patch titled "[RFC v2 30/32]
x86/tdx: Make DMA pages shared". This model is similar to how AMD-SEV
does the initialization.

I have sent the modified patch as reply to patch titled "[RFC v2 30/32]
x86/tdx: Make DMA pages shared". Please check and let me know your comments.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 11:18:57

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode

I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
b4 recently gained support for partial series re-rolls [1], but I
think you would need to bump the version number [RFC PATCH v3 21/32]
and maintain the patch numbering. In this case with changes moving
between patches, and those other patches being squashed any chance of
automated reconstruction of this series is likely lost.

Just wanted to note that for future reference in case you were hoping
to avoid resending full series in the future. For now, some more
comments below:

[1]: https://lore.kernel.org/tools/[email protected]/

On Mon, May 17, 2021 at 5:54 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX. You can find MADT MP wake protocol details in ACPI specification
> r6.4, sec 5.2.12.19.
>
> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode. For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since RFC v2:
> * Removed X86_CR0_NE and EFER related changes from this changes

This was only partially done, see below...

> and moved it to patch titled "x86/boot: Avoid #VE during
> boot for TDX platforms"
> * Fixed commit log as per Dan's suggestion.
> * Added inline get_trampoline_start_ip() to set start_ip.

You also added a comment to tr_idt, but didn't mention it here, so I
went to double check. Please take care to document all changes to the
patch from the previous review.

>
> arch/x86/boot/compressed/pgtable.h | 2 +-
> arch/x86/include/asm/realmode.h | 10 +++++++
> arch/x86/kernel/smpboot.c | 2 +-
> arch/x86/realmode/rm/header.S | 1 +
> arch/x86/realmode/rm/trampoline_64.S | 38 ++++++++++++++++++++++++
> arch/x86/realmode/rm/trampoline_common.S | 7 ++++-
> 6 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 6ff7e81b5628..cc9b2529a086 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,7 +6,7 @@
> #define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
>
> #define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
> +#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
>
> #define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
>
> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
> index 5db5d083c873..3328c8edb200 100644
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -25,6 +25,7 @@ struct real_mode_header {
> u32 sev_es_trampoline_start;
> #endif
> #ifdef CONFIG_X86_64
> + u32 trampoline_start64;
> u32 trampoline_pgd;
> #endif
> /* ACPI S3 wakeup */
> @@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
> real_mode_header = (struct real_mode_header *) __va(mem);
> }
>
> +static inline unsigned long get_trampoline_start_ip(void)

I'd prefer this helper take a 'struct real_mode_header *rmh' as an
argument rather than assume a global variable.

> +{
> +#ifdef CONFIG_X86_64
> + if (is_tdx_guest())
> + return real_mode_header->trampoline_start64;
> +#endif
> + return real_mode_header->trampoline_start;
> +}
> +
> void reserve_real_mode(void);
>
> #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 16703c35a944..0b4dff5e67a9 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
> int *cpu0_nmi_registered)
> {
> /* start_ip had better be page-aligned! */
> - unsigned long start_ip = real_mode_header->trampoline_start;
> + unsigned long start_ip = get_trampoline_start_ip();
>
> unsigned long boot_error = 0;
> unsigned long timeout;
> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
> index 8c1db5bf5d78..2eb62be6d256 100644
> --- a/arch/x86/realmode/rm/header.S
> +++ b/arch/x86/realmode/rm/header.S
> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
> .long pa_sev_es_trampoline_start
> #endif
> #ifdef CONFIG_X86_64
> + .long pa_trampoline_start64
> .long pa_trampoline_pgd;
> #endif
> /* ACPI S3 wakeup */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 84c5d1b33d10..754f8d2ac9e8 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
> ljmpl $__KERNEL_CS, $pa_startup_64
> SYM_CODE_END(startup_32)
>
> +SYM_CODE_START(pa_trampoline_compat)
> + /*
> + * In compatibility mode. Prep ESP and DX for startup_32, then disable
> + * paging and complete the switch to legacy 32-bit mode.
> + */
> + movl $rm_stack_end, %esp
> + movw $__KERNEL_DS, %dx
> +
> + movl $(X86_CR0_NE | X86_CR0_PE), %eax

Before this patch the startup path did not touch X86_CR0_NE. I assume
it was added opportunistically for the TDX case? If it is to stay in
this patch it deserves a code comment / mention in the changelog, or
it needs to move to the other patch that fixes up the CR0 setup for
TDX.


> + movl %eax, %cr0
> + ljmpl $__KERNEL32_CS, $pa_startup_32
> +SYM_CODE_END(pa_trampoline_compat)
> +
> .section ".text64","ax"
> .code64
> .balign 4
> @@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
> jmpq *tr_start(%rip)
> SYM_CODE_END(startup_64)
>
> +SYM_CODE_START(trampoline_start64)
> + /*
> + * APs start here on a direct transfer from 64-bit BIOS with identity
> + * mapped page tables. Load the kernel's GDT in order to gear down to
> + * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
> + * segment registers. Load the zero IDT so any fault triggers a
> + * shutdown instead of jumping back into BIOS.
> + */
> + lidt tr_idt(%rip)
> + lgdt tr_gdt64(%rip)
> +
> + ljmpl *tr_compat(%rip)
> +SYM_CODE_END(trampoline_start64)
> +
> .section ".rodata","a"
> # Duplicate the global descriptor table
> # so the kernel can live anywhere
> @@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
> .quad 0x00cf93000000ffff # __KERNEL_DS
> SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>
> +SYM_DATA_START(tr_gdt64)
> + .short tr_gdt_end - tr_gdt - 1 # gdt limit
> + .long pa_tr_gdt
> + .long 0
> +SYM_DATA_END(tr_gdt64)
> +
> +SYM_DATA_START(tr_compat)
> + .long pa_trampoline_compat
> + .short __KERNEL32_CS
> +SYM_DATA_END(tr_compat)
> +
> .bss
> .balign PAGE_SIZE
> SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
> index 5033e640f957..ade7db208e4e 100644
> --- a/arch/x86/realmode/rm/trampoline_common.S
> +++ b/arch/x86/realmode/rm/trampoline_common.S
> @@ -1,4 +1,9 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> .section ".rodata","a"
> .balign 16
> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
> +
> +/* .fill cannot be used for size > 8. So use short and quad */

If there is to be a comment here it should be to clarify why @tr_idt
is 10 bytes, not necessarily a quirk of the assembler.

> +SYM_DATA_START_LOCAL(tr_idt)

The .fill restriction is only for @size, not @repeat. So, what's wrong
with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?

Subject: Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode



On 5/17/21 7:06 PM, Dan Williams wrote:
> I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
> b4 recently gained support for partial series re-rolls [1], but I
> think you would need to bump the version number [RFC PATCH v3 21/32]
> and maintain the patch numbering. In this case with changes moving
> between patches, and those other patches being squashed any chance of
> automated reconstruction of this series is likely lost.

Ok. I will make sure to bump the version in next partial re-roll.

If I am fixing this patch as per your comments, do I need bump the
patch version for it as well?

>
> Just wanted to note that for future reference in case you were hoping
> to avoid resending full series in the future. For now, some more
> comments below:

Thanks.

>
> [1]: https://lore.kernel.org/tools/[email protected]/
>
> On Mon, May 17, 2021 at 5:54 PM Kuppuswamy Sathyanarayanan
> <[email protected]> wrote:
>>
>> From: Sean Christopherson <[email protected]>
>>
>> Add a trampoline for booting APs in 64-bit mode via a software handoff
>> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
>> by TDX. You can find MADT MP wake protocol details in ACPI specification
>> r6.4, sec 5.2.12.19.
>>
>> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
>> mode. For the GDT pointer, create a new entry as the existing storage
>> for the pointer occupies the zero entry in the GDT itself.
>>
>> Reported-by: Kai Huang <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>> ---
>>
>> Changes since RFC v2:
>> * Removed X86_CR0_NE and EFER related changes from this changes
>
> This was only partially done, see below...
>
>> and moved it to patch titled "x86/boot: Avoid #VE during
>> boot for TDX platforms"
>> * Fixed commit log as per Dan's suggestion.
>> * Added inline get_trampoline_start_ip() to set start_ip.
>
> You also added a comment to tr_idt, but didn't mention it here, so I
> went to double check. Please take care to document all changes to the
> patch from the previous review.

Ok. I will make sure change log is current.

>
>>
>> arch/x86/boot/compressed/pgtable.h | 2 +-
>> arch/x86/include/asm/realmode.h | 10 +++++++
>> arch/x86/kernel/smpboot.c | 2 +-
>> arch/x86/realmode/rm/header.S | 1 +
>> arch/x86/realmode/rm/trampoline_64.S | 38 ++++++++++++++++++++++++
>> arch/x86/realmode/rm/trampoline_common.S | 7 ++++-
>> 6 files changed, 57 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
>> index 6ff7e81b5628..cc9b2529a086 100644
>> --- a/arch/x86/boot/compressed/pgtable.h
>> +++ b/arch/x86/boot/compressed/pgtable.h
>> @@ -6,7 +6,7 @@
>> #define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
>>
>> #define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
>> -#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
>> +#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
>>
>> #define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
>>
>> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
>> index 5db5d083c873..3328c8edb200 100644
>> --- a/arch/x86/include/asm/realmode.h
>> +++ b/arch/x86/include/asm/realmode.h
>> @@ -25,6 +25,7 @@ struct real_mode_header {
>> u32 sev_es_trampoline_start;
>> #endif
>> #ifdef CONFIG_X86_64
>> + u32 trampoline_start64;
>> u32 trampoline_pgd;
>> #endif
>> /* ACPI S3 wakeup */
>> @@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
>> real_mode_header = (struct real_mode_header *) __va(mem);
>> }
>>
>> +static inline unsigned long get_trampoline_start_ip(void)
>
> I'd prefer this helper take a 'struct real_mode_header *rmh' as an
> argument rather than assume a global variable.

I am fine with it. But existing inline functions also directly read/writes
the real_mode_header. So I just followed the same format.

I will fix this in next version.

>
>> +{
>> +#ifdef CONFIG_X86_64
>> + if (is_tdx_guest())
>> + return real_mode_header->trampoline_start64;
>> +#endif
>> + return real_mode_header->trampoline_start;
>> +}
>> +
>> void reserve_real_mode(void);
>>
>> #endif /* __ASSEMBLY__ */
>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>> index 16703c35a944..0b4dff5e67a9 100644
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
>> int *cpu0_nmi_registered)
>> {
>> /* start_ip had better be page-aligned! */
>> - unsigned long start_ip = real_mode_header->trampoline_start;
>> + unsigned long start_ip = get_trampoline_start_ip();
>>
>> unsigned long boot_error = 0;
>> unsigned long timeout;
>> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
>> index 8c1db5bf5d78..2eb62be6d256 100644
>> --- a/arch/x86/realmode/rm/header.S
>> +++ b/arch/x86/realmode/rm/header.S
>> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
>> .long pa_sev_es_trampoline_start
>> #endif
>> #ifdef CONFIG_X86_64
>> + .long pa_trampoline_start64
>> .long pa_trampoline_pgd;
>> #endif
>> /* ACPI S3 wakeup */
>> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
>> index 84c5d1b33d10..754f8d2ac9e8 100644
>> --- a/arch/x86/realmode/rm/trampoline_64.S
>> +++ b/arch/x86/realmode/rm/trampoline_64.S
>> @@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
>> ljmpl $__KERNEL_CS, $pa_startup_64
>> SYM_CODE_END(startup_32)
>>
>> +SYM_CODE_START(pa_trampoline_compat)
>> + /*
>> + * In compatibility mode. Prep ESP and DX for startup_32, then disable
>> + * paging and complete the switch to legacy 32-bit mode.
>> + */
>> + movl $rm_stack_end, %esp
>> + movw $__KERNEL_DS, %dx
>> +
>> + movl $(X86_CR0_NE | X86_CR0_PE), %eax
>
> Before this patch the startup path did not touch X86_CR0_NE. I assume
> it was added opportunistically for the TDX case? If it is to stay in
> this patch it deserves a code comment / mention in the changelog, or
> it needs to move to the other patch that fixes up the CR0 setup for
> TDX.

I will move X86_CR0_NE related update to the patch that has other
X86_CR0_NE related updates.

>
>
>> + movl %eax, %cr0
>> + ljmpl $__KERNEL32_CS, $pa_startup_32
>> +SYM_CODE_END(pa_trampoline_compat)
>> +
>> .section ".text64","ax"
>> .code64
>> .balign 4
>> @@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
>> jmpq *tr_start(%rip)
>> SYM_CODE_END(startup_64)
>>
>> +SYM_CODE_START(trampoline_start64)
>> + /*
>> + * APs start here on a direct transfer from 64-bit BIOS with identity
>> + * mapped page tables. Load the kernel's GDT in order to gear down to
>> + * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
>> + * segment registers. Load the zero IDT so any fault triggers a
>> + * shutdown instead of jumping back into BIOS.
>> + */
>> + lidt tr_idt(%rip)
>> + lgdt tr_gdt64(%rip)
>> +
>> + ljmpl *tr_compat(%rip)
>> +SYM_CODE_END(trampoline_start64)
>> +
>> .section ".rodata","a"
>> # Duplicate the global descriptor table
>> # so the kernel can live anywhere
>> @@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
>> .quad 0x00cf93000000ffff # __KERNEL_DS
>> SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>>
>> +SYM_DATA_START(tr_gdt64)
>> + .short tr_gdt_end - tr_gdt - 1 # gdt limit
>> + .long pa_tr_gdt
>> + .long 0
>> +SYM_DATA_END(tr_gdt64)
>> +
>> +SYM_DATA_START(tr_compat)
>> + .long pa_trampoline_compat
>> + .short __KERNEL32_CS
>> +SYM_DATA_END(tr_compat)
>> +
>> .bss
>> .balign PAGE_SIZE
>> SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
>> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
>> index 5033e640f957..ade7db208e4e 100644
>> --- a/arch/x86/realmode/rm/trampoline_common.S
>> +++ b/arch/x86/realmode/rm/trampoline_common.S
>> @@ -1,4 +1,9 @@
>> /* SPDX-License-Identifier: GPL-2.0 */
>> .section ".rodata","a"
>> .balign 16
>> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
>> +
>> +/* .fill cannot be used for size > 8. So use short and quad */
>
> If there is to be a comment here it should be to clarify why @tr_idt
> is 10 bytes, not necessarily a quirk of the assembler.

Got it. I will fix the comment or remove it.

>
>> +SYM_DATA_START_LOCAL(tr_idt)
>
> The .fill restriction is only for @size, not @repeat. So, what's wrong
> with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?

Any reason to prefer above change over previous code ?

SYM_DATA_START_LOCAL(tr_idt)
.short 0
.quad 0
SYM_DATA_END(tr_idt)

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 11:50:39

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode

On Mon, May 17, 2021 at 7:53 PM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
>
>
>
> On 5/17/21 7:06 PM, Dan Williams wrote:
> > I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
> > b4 recently gained support for partial series re-rolls [1], but I
> > think you would need to bump the version number [RFC PATCH v3 21/32]
> > and maintain the patch numbering. In this case with changes moving
> > between patches, and those other patches being squashed any chance of
> > automated reconstruction of this series is likely lost.
>
> Ok. I will make sure to bump the version in next partial re-roll.
>
> If I am fixing this patch as per your comments, do I need bump the
> patch version for it as well?

I don't think it matters too much in this case as I don't think I can
use b4 to assemble this series. So just for future reference on other
patch sets. That said, I wouldn't mind a link to your work-in-progress
branch to see all the changes together in one place.

[..]
> > I'd prefer this helper take a 'struct real_mode_header *rmh' as an
> > argument rather than assume a global variable.
>
> I am fine with it. But existing inline functions also directly read/writes
> the real_mode_header. So I just followed the same format.

I notice the SEV-ES code passes an @rmh variable around for this purpose.

[..]
> > If there is to be a comment here it should be to clarify why @tr_idt
> > is 10 bytes, not necessarily a quirk of the assembler.
>
> Got it. I will fix the comment or remove it.
>
> >
> >> +SYM_DATA_START_LOCAL(tr_idt)
> >
> > The .fill restriction is only for @size, not @repeat. So, what's wrong
> > with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?
>
> Any reason to prefer above change over previous code ?

What I'm really after is capturing why this size needs to be adjusted
for future reference. Maybe it's plainly obvious to someone who has
worked with this code, but it was not immediately obvious to me.

>
> SYM_DATA_START_LOCAL(tr_idt)
> .short 0
> .quad 0
> SYM_DATA_END(tr_idt)

This format implies that tr_idt is reserving space for 2 distinct data
structure attributes of those sizes, can you just put those names here
as comments? Otherwise the .fill format is more compact.

2021-05-19 18:18:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> In traditional VMs, MMIO tends to be implemented by giving a
> guest access to a mapping which will cause a VMEXIT on access.
> That's not possible in TDX guest.

Why is it not possible?

> So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> with EPT_VIOLATION exit reason.

What does the #VE handler do to resolve the exception?

> For now we only handle a subset of instructions that the kernel
> uses for MMIO operations. User-space access triggers SIGBUS.

How do you know which instructions the kernel uses? How do you know
that the compiler won't change them?

I guess the kernel won't boot far if this happens, but this still sounds
like trial-and-error programming.

> Also, reasons for supporting #VE based MMIO in TDX guest are,
>
> * MMIO is widely used and we'll have more drivers in the future.

OK, but you've also made a big deal about having to go explicitly audit
these drivers. I would imagine converting these over to stop using MMIO
would be _relatively_ minor compared to a big security audit and new
fuzzing infrastructure.

> * We don't want to annotate every TDX specific MMIO readl/writel etc.

^ TDX-specific

> * If we didn't annotate we would need to add an alternative to every
> MMIO access in the kernel (even though 99.9% will never be used on
> TDX) which would be a complete waste and incredible binary bloat
> for nothing.

That sounds like something objective we can measure. Does this cost 1
byte of extra text per readl/writel? 10? 100?

You're also being rather indirect about what solutions you ruled out.
Why not just say: we considered doing ____, but ruled that out because
it would have required ____. Above you just tell us what the solution
required without mentioning the solution.

> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b9e3010987e0..9330c7a9ad69 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -5,6 +5,8 @@
>
> #include <asm/tdx.h>
> #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <linux/sched/signal.h> /* force_sig_fault() */
>
> #include <linux/cpu.h>
> #include <linux/protected_guest.h>
> @@ -209,6 +211,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
> }
> }
>
> +static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
> + unsigned long val)
> +{
> + return tdx_hypercall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
> + write, addr, val);
> +}
> +
> +static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
> +{
> + static const int regoff[] = {
> + offsetof(struct pt_regs, ax),
> + offsetof(struct pt_regs, cx),
> + offsetof(struct pt_regs, dx),
> + offsetof(struct pt_regs, bx),
> + offsetof(struct pt_regs, sp),
> + offsetof(struct pt_regs, bp),
> + offsetof(struct pt_regs, si),
> + offsetof(struct pt_regs, di),
> + offsetof(struct pt_regs, r8),
> + offsetof(struct pt_regs, r9),
> + offsetof(struct pt_regs, r10),
> + offsetof(struct pt_regs, r11),
> + offsetof(struct pt_regs, r12),
> + offsetof(struct pt_regs, r13),
> + offsetof(struct pt_regs, r14),
> + offsetof(struct pt_regs, r15),
> + };
> + int regno;
> +
> + regno = X86_MODRM_REG(insn->modrm.value);
> + if (X86_REX_R(insn->rex_prefix.value))
> + regno += 8;
> +
> + return (void *)regs + regoff[regno];
> +}

Was there a reason you copied and pasted this from get_reg_offset()
instead of refactoring? This looks like almost entirely a subset of
get_reg_offset().

> +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> + int size;
> + bool write;
> + unsigned long *reg;
> + struct insn insn;
> + unsigned long val = 0;
> +
> + /*
> + * User mode would mean the kernel exposed a device directly
> + * to ring3, which shouldn't happen except for things like
> + * DPDK.
> + */

Uhh....

https://www.kernel.org/doc/html/v4.14/driver-api/uio-howto.html

I thought there were more than a few ways that userspace could get
access to MMIO mappings.

Also, do most people know what DPDK is? Should we even be talking about
silly out-of-tree kernel bypass schemes in kernel comments?

> + if (user_mode(regs)) {
> + pr_err("Unexpected user-mode MMIO access.\n");
> + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);

extra space ^

Is a non-ratelimited pr_err() appropriate here? I guess there shouldn't
be any MMIO passthrough to userspace on these systems.

> + return 0;
> + }
> +
> + kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> + insn_get_length(&insn);
> + insn_get_opcode(&insn);
> +
> + write = ve->exit_qual & 0x2;
> +
> + size = insn.opnd_bytes;
> + switch (insn.opcode.bytes[0]) {
> + /* MOV r/m8 r8 */
> + case 0x88:
> + /* MOV r8 r/m8 */
> + case 0x8A:
> + /* MOV r/m8 imm8 */
> + case 0xC6:

FWIW, I find that *REALLY* hard to read.

Check out is_string_insn() for a more readable example.

Oh, and I misread that. I read it as "these are all the opcodes we care
about". When, in fact, I _think_ it's all the opcodes that don't have a
size in insn.opnd_bytes.

Could you spell that out, please?

> + size = 1;
> + break;
> + }
> +
> + if (inat_has_immediate(insn.attr)) {
> + BUG_ON(!write);
> + val = insn.immediate.value;

This is pretty interesting. This won't work with implicit accesses. I
guess the limited opcodes above limit how much imprecision will result.
But, it would still be nice to hear something about that.

For instance, if someone pointed a mid-level page table to MMIO, we'd
get a va->gpa that had zero to do with the instruction. Granted, that's
only going to happen if something bonkers is going on, but maybe I'm
missing some simpler cases of implicit accesses.

> + tdg_mmio(size, write, ve->gpa, val);

What happens if this is an MMIO operation that *partially* touches MMIO
and partially touches normal memory? Let's say I wrote two bytes
(0x1234), starting at the last byte of a RAM page that ran over into an
MMIO page. The fault would occur trying to write 0x34 to the MMIO, but
the instruction cracking would result in trying to write 0x1234 into the
MMIO.

It doesn't seem *that* outlandish that an MMIO might cross a page
boundary. Would this work for a two-byte MMIO that crosses a page?

> + return insn.length;
> + }
> +
> + BUG_ON(!inat_has_modrm(insn.attr));

A comment would be nice here about the BUG_ON().

It would also be nice to give a high-level view of what's going on and
what we know about the instruction at this point.

> + reg = get_reg_ptr(regs, &insn);
> +
> + if (write) {
> + memcpy(&val, reg, size);
> + tdg_mmio(size, write, ve->gpa, val);
> + } else {
> + val = tdg_mmio(size, write, ve->gpa, val);
> + memset(reg, 0, size);
> + memcpy(reg, &val, size);
> + }
> + return insn.length;
> +}
> +
> unsigned long tdg_get_ve_info(struct ve_info *ve)
> {
> u64 ret;
> @@ -258,6 +355,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
> case EXIT_REASON_IO_INSTRUCTION:
> tdg_handle_io(regs, ve->exit_qual);
> break;
> + case EXIT_REASON_EPT_VIOLATION:
> + ve->instr_len = tdg_handle_mmio(regs, ve);
> + break;
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return -EFAULT;
>


2021-05-19 18:20:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest

On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> although we don't expect it to happen because we don't expect NMIs to
> trigger #VEs. Another case where they could happen is if the #VE
> exception panics, but in this case there are no guarantees on anything
> anyways.

This implies: "we do not expect any NMI to do MMIO". Is that true? Why?

2021-05-19 18:22:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls

Question for KVM folks: Should all of these guest patches say:
"x86/tdx/guest:" or something? It seems like that would put us all in
the right frame of mind as we review these. It's kinda easy (for me at
least) to get lost about which side I'm looking at sometimes.

On 5/17/21 5:15 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor specififc TDVMCALLs

"vendor-specific"

Hyphen and spelling ^

> instead of VMCALL.

This would also be a great place to say:

This enables TDX guests to run with KVM acting as the hypervisor. TDX
guests running under other hypervisors will continue to use those
hypervisors hypercalls.

> [Isaku: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

This SoB chain is odd. Kirill wrote this, sent it to Isaku, who sent it
to Sathya?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9e0e0ff76bab..768df1b98487 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
> run in a CPU mode that protects the confidentiality of TD memory
> contents and the TD’s CPU state from other software, including VMM.
>
> +config INTEL_TDX_GUEST_KVM
> + def_bool y
> + depends on KVM_GUEST && INTEL_TDX_GUEST
> + help
> + This option enables KVM specific hypercalls in TDX guest.

For something that's not user-visible, I'd probably just add a Kconfig
comment rather than help text.

...
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 7966c10ea8d1..a90fec004844 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>
> obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
> obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o

Is the indentation consistent with the other items near "tdx-kvm.o" in
the Makefile?

...
> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
> +long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
> + unsigned long p3, unsigned long p4)
> +{
> + return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
> +}
> +EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);

I always forget that KVM code is goofy and needs to have things in C
files so you can export the symbols. Could you add a sentence to the
changelog to this effect?

Code-wise, this is fine. Just a few tweaks and I'll be happy to ack
this one.

2021-05-19 18:22:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On 5/18/21 8:56 AM, Andi Kleen wrote:
> On 5/18/2021 8:00 AM, Dave Hansen wrote:
>> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> In traditional VMs, MMIO tends to be implemented by giving a
>>> guest access to a mapping which will cause a VMEXIT on access.
>>> That's not possible in TDX guest.
>> Why is it not possible?
>
> For once the TDX module doesn't support uncached mappings (IgnorePAT is
> always 1)

Actually, I was thinking more along the lines of why the architecture
doesn't have VMEXITs: VMEXITs expose guest state to the host and VMMs
use that state to emulate MMIO. TDX guests don't trust the host and
can't have that arbitrary state exposed to the host. So, they sanitize
the state in the #VE handler and make a *controlled* transition into the
host with a TDCALL rather than an uncontrolled VMEXIT.

>>> For now we only handle a subset of instructions that the kernel
>>> uses for MMIO operations. User-space access triggers SIGBUS.
>> How do you know which instructions the kernel uses?
>
> They're all in MMIO macros.

I've heard exactly the opposite from the TDX team in the past. What I
remember was a claim that one can not just leverage the MMIO macros as a
single point to avoid MMIO. I remember being told that not all code in
the kernel that does MMIO uses these macros. APIC MMIO's were called
out as a place that does not use the MMIO macros.

I'm confused now.

>>   How do you know that the compiler won't change them?
>
> The macros try hard to prevent that because it would likely break real
> MMIO too.
>
> Besides it works for others, like AMD-SEV today and of course all the
> hypervisors that do the same.

That would be some excellent information for the changelog.

2021-05-19 18:22:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


On 5/18/2021 8:00 AM, Dave Hansen wrote:
> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> In traditional VMs, MMIO tends to be implemented by giving a
>> guest access to a mapping which will cause a VMEXIT on access.
>> That's not possible in TDX guest.
> Why is it not possible?

For once the TDX module doesn't support uncached mappings (IgnorePAT is
always 1)




>
>> For now we only handle a subset of instructions that the kernel
>> uses for MMIO operations. User-space access triggers SIGBUS.
> How do you know which instructions the kernel uses?

They're all in MMIO macros.


> How do you know
> that the compiler won't change them?

The macros try hard to prevent that because it would likely break real
MMIO too.

Besides it works for others, like AMD-SEV today and of course all the
hypervisors that do the same.




> That sounds like something objective we can measure. Does this cost 1
> byte of extra text per readl/writel? 10? 100?

Alternatives are at least a pointer, but also the extra alternative
code. It's definitely more than 10, I would guess 40+



>
> I thought there were more than a few ways that userspace could get
> access to MMIO mappings.

Yes and they will all fault in TDX guests.


>> + if (user_mode(regs)) {
>> + pr_err("Unexpected user-mode MMIO access.\n");
>> + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
> extra space ^
>
> Is a non-ratelimited pr_err() appropriate here? I guess there shouldn't
> be any MMIO passthrough to userspace on these systems.
Yes rate limiting makes sense.


2021-05-19 18:22:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest

On 5/18/21 8:45 AM, Andi Kleen wrote:
>
> On 5/18/2021 8:11 AM, Dave Hansen wrote:
>> On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
>>> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
>>> although we don't expect it to happen because we don't expect NMIs to
>>> trigger #VEs. Another case where they could happen is if the #VE
>>> exception panics, but in this case there are no guarantees on anything
>>> anyways.
>> This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?
>
> Only drivers that are not supported in TDX anyways could do it (mainly
> watchdog drivers)

No APIC access either?

Also, shouldn't we have at least a:

WARN_ON_ONCE(in_nmi());

if we don't expect (or handle well) #VE in NMIs?

2021-05-19 18:22:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On 5/18/21 9:10 AM, Andi Kleen wrote:
> I'm not aware of any other places that would do MMIO without using the
> standard io.h macros, although it might happen in theory on x86 (but
> would likely break on some other architectures)

Can we please connect all of the dots and turn this into a coherent
changelog?

* In-kernel MMIO is handled via exceptions (#VE) and instruction
cracking
* Arbitrary MMIO instructions are not handled (and would result in...)
* The limited set of MMIO instructions that are handled are known and
come from the io.h macros, ultimately build_mmio_read/write().
* This approach is also used for SEV-ES???
* Some x86 code that avoids the MMIO code is known to exist (APIC).
But, this code is not used in TDX guests

BTW, in perusing arch/x86/include/asm/io.h, I was reminded of movdir64b.
That seems like one we'd want to take care of sooner rather than later.
Or, do we expect the first folks who expose a movdir64b-using driver to
TDX to go and update this code?

Also, the sev_key_active() stuff in there makes me nervous. Does this
scheme work with these:

> static inline void outs##bwl(int port, const void *addr, unsigned long count) \
> static inline void ins##bwl(int port, void *addr, unsigned long count) \

?

2021-05-19 18:23:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


>>>> For now we only handle a subset of instructions that the kernel
>>>> uses for MMIO operations. User-space access triggers SIGBUS.
>>> How do you know which instructions the kernel uses?
>> They're all in MMIO macros.
> I've heard exactly the opposite from the TDX team in the past. What I
> remember was a claim that one can not just leverage the MMIO macros as a
> single point to avoid MMIO. I remember being told that not all code in
> the kernel that does MMIO uses these macros. APIC MMIO's were called
> out as a place that does not use the MMIO macros.

Yes x86 APIC has its own macros, but we don't use the MMIO based APIC,
only X2APIC in TDX.

I'm not aware of any other places that would do MMIO without using the
standard io.h macros, although it might happen in theory on x86 (but
would likely break on some other architectures)


-Andi



2021-05-19 18:23:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On Tue, May 18, 2021, Dave Hansen wrote:
> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > In traditional VMs, MMIO tends?to be implemented by giving a
> > guest access to a mapping which will?cause a VMEXIT on access.
> > That's not possible in TDX guest.
>
> Why is it not possible?

It is possible, and in fact KVM will cause a VM-Exit on the first access to a
given MMIO page. The problem is that guest state is inaccessible and so the VMM
cannot do the front end of MMIO instruction emulation.

> > So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> > with EPT_VIOLATION exit reason.

It's more accurate to say that the VMM will configure EPT entries for pages that
require instruction emulation to cause #VE.

> What does the #VE handler do to resolve the exception?
>
> > For now we only handle a subset of instructions that the kernel
> > uses for MMIO?operations. User-space access triggers SIGBUS.
>
> How do you know which instructions the kernel uses? How do you know
> that the compiler won't change them?
>
> I guess the kernel won't boot far if this happens, but this still sounds
> like trial-and-error programming.

If a driver accesses MMIO through a struct overlay, all bets are off. The I/O
APIC code does this, but that problem is "solved" by forcefully disabling the
I/O APIC since it's useless for the current incarnation of TDX. IIRC, some of
the console code also accesses MMIO via a struct (or maybe just through generic
C code), and the compiler does indeed employ a wider variety of instructions.

So yeah, whack-a-mole.

> > Also, reasons for supporting #VE based MMIO in TDX guest are,
> >
> > * MMIO is widely used and we'll have more drivers in the future.
>
> OK, but you've also made a big deal about having to go explicitly audit
> these drivers. I would imagine converting these over to stop using MMIO
> would be _relatively_ minor compared

For drivers that use the kernel's macros, converting them to use TDVMCALL
directly will be trivial and shouldn't even require any modifications to the
driver. For drivers that use a struct overlay or generic C code, the "conversion"
could require a complete rewrite of the driver.

> to a big security audit and new fuzzing infrastructure.
>
> > * We don't want to annotate every TDX specific MMIO readl/writel etc.
>
> ^ TDX-specific
>
> > * If we didn't annotate we would need to add an alternative to every
> > MMIO access in the kernel (even though 99.9% will never be used on
> > TDX) which would be a complete waste and incredible binary bloat
> > for nothing.
>
> That sounds like something objective we can measure. Does this cost 1
> byte of extra text per readl/writel? 10? 100?

Agreed. And IMO, it's worth converting the common case (macros) if the overhead
is acceptable, while leaving the #VE handling in place for non-standard code.

> > +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)

...

> > + return 0;
> > + }
> > +
> > + kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> > + insn_get_length(&insn);
> > + insn_get_opcode(&insn);
> > +
> > + write = ve->exit_qual & 0x2;
> > +
> > + size = insn.opnd_bytes;
> > + switch (insn.opcode.bytes[0]) {
> > + /* MOV r/m8 r8 */
> > + case 0x88:
> > + /* MOV r8 r/m8 */
> > + case 0x8A:
> > + /* MOV r/m8 imm8 */
> > + case 0xC6:
>
> FWIW, I find that *REALLY* hard to read.

Why does this code exist at all? TDX and SEV-ES absolutely must share code for
handling MMIO reflection. It will require a fair amount of refactoring to move
the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
two separate versions of the opcode cracking.

Ditto for string I/O in vc_handle_ioio().

> What happens if this is an MMIO operation that *partially* touches MMIO
> and partially touches normal memory? Let's say I wrote two bytes
> (0x1234), starting at the last byte of a RAM page that ran over into an
> MMIO page. The fault would occur trying to write 0x34 to the MMIO, but
> the instruction cracking would result in trying to write 0x1234 into the
> MMIO.
>
> It doesn't seem *that* outlandish that an MMIO might cross a page
> boundary. Would this work for a two-byte MMIO that crosses a page?

I'm pretty sure we can get away with panic (kernel) and SIGBUS (userspace) on
a reflected memory access that splits a page. Yes, it's theoretically possible
and probably even "works", but practically speaking no emulated MMIO device is
going to have a single logical register/thing split a page, and I can't think of
any reason to allow accessing multiple registers/things across a page split.

The existing SEV-ES #VC handlers appear to be missing page split checks, so that
needs to be fixed.

2021-05-19 18:23:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest


> No APIC access either?


It's all X2APIC inside TDX which uses MSRs

>
> Also, shouldn't we have at least a:
>
> WARN_ON_ONCE(in_nmi());
>
> if we don't expect (or handle well) #VE in NMIs?

We handle it perfectly fine. It's just not needed.


-Andi


2021-05-19 18:23:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls

On Tue, May 18, 2021, Dave Hansen wrote:
> Question for KVM folks: Should all of these guest patches say:
> "x86/tdx/guest:" or something?

x86/tdx is fine. The KVM convention is to use "KVM: xxx:" for KVM host code and
"x86/kvm" for KVM guest code. E.g. for KVM TDX host code, the subjects will be
"KVM: x86:", "KVM: VMX:" or "KVM: TDX:".

The one I really don't like is using "tdg_" as the acronym for guest functions.
I find that really confusion and grep-unfriendly.

2021-05-19 18:23:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest


On 5/18/2021 8:11 AM, Dave Hansen wrote:
> On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
>> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
>> although we don't expect it to happen because we don't expect NMIs to
>> trigger #VEs. Another case where they could happen is if the #VE
>> exception panics, but in this case there are no guarantees on anything
>> anyways.
> This implies: "we do not expect any NMI to do MMIO". Is that true? Why?

Only drivers that are not supported in TDX anyways could do it (mainly
watchdog drivers)

panic is an exception, but that has been already covered.

-Andi




2021-05-19 18:25:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On Tue, May 18, 2021, Andi Kleen wrote:
> On 5/18/2021 8:00 AM, Dave Hansen wrote:
> > That sounds like something objective we can measure. Does this cost 1
> > byte of extra text per readl/writel? 10? 100?
>
> Alternatives are at least a pointer, but also the extra alternative code.
> It's definitely more than 10, I would guess 40+

The extra bytes for .altinstructions is very different than the extra bytes for
the code itself. The .altinstructions section is freed after init, so yes it
bloats the kernel size a bit, but the runtime footprint is unaffected by the
patching metadata.

IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.

The other option to explore is to hook/patch IO_COND(), which can be done with
neglible overhead because the helpers that use IO_COND() are not inlined. In a
TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
And if there are TDX VMMs that want to deploy virtio-mmio, hooking
drivers/virtio/virtio_mmio.c directly would be a viable option.

2021-05-19 18:26:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


> Or, do we expect the first folks who expose a movdir64b-using driver to
> TDX to go and update this code?

That's what we want to do.


>
> Also, the sev_key_active() stuff in there makes me nervous. Does this
> scheme work with these:
>
>> static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>> static inline void ins##bwl(int port, void *addr, unsigned long count) \
> ?


This is not MMIO, but port IO. We do similar changes as AMD for TDX.


-Andi


2021-05-19 18:26:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


On 5/18/2021 9:10 AM, Andi Kleen wrote:
>
>>>>> For now we only handle a subset of instructions that the kernel
>>>>> uses for MMIO operations. User-space access triggers SIGBUS.
>>>> How do you know which instructions the kernel uses?
>>> They're all in MMIO macros.
>> I've heard exactly the opposite from the TDX team in the past. What I
>> remember was a claim that one can not just leverage the MMIO macros as a
>> single point to avoid MMIO.  I remember being told that not all code in
>> the kernel that does MMIO uses these macros.  APIC MMIO's were called
>> out as a place that does not use the MMIO macros.
>
> Yes x86 APIC has its own macros, but we don't use the MMIO based APIC,
> only X2APIC in TDX.

I must correct myself here. We actually use #VE to handle MSRs, or at
least those that are not context switched by the TDX module. So there
can be #VE nested in NMI in normal operation, since MSR accesses in NMI
can happen.

I don't think it needs any changes to the code -- this should all work
-- but we need to update the commit log to document this case.


-Andi



2021-05-19 18:27:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


> The extra bytes for .altinstructions is very different than the extra bytes for
> the code itself. The .altinstructions section is freed after init, so yes it
> bloats the kernel size a bit, but the runtime footprint is unaffected by the
> patching metadata.
>
> IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
>
> The other option to explore is to hook/patch IO_COND(), which can be done with
> neglible overhead because the helpers that use IO_COND() are not inlined. In a
> TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> drivers/virtio/virtio_mmio.c directly would be a viable option.

Yes but what's the point of all that?

Even if it's only 3 bytes we still have a lot of MMIO all over the
kernel which never needs it.

And I don't even see what TDX (or SEV which already does the decoding
and has been merged) would get out of it. We handle all the #VEs just
fine. And the instruction handling code is fairly straight forward too.

Besides instruction decoding works fine for all the existing
hypervisors. All we really want to do is to do the same thing as KVM
would do.

-Andi


2021-05-19 18:27:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On 5/18/21 10:21 AM, Andi Kleen wrote:
> Besides instruction decoding works fine for all the existing
> hypervisors. All we really want to do is to do the same thing as KVM
> would do.

Dumb question of the day: If you want to do the same thing that KVM
does, why don't you share more code with KVM? Wouldn't you, for
instance, need to crack the same instruction opcodes?

I'd feel a lot better about this if you said:

Listen, this doesn't work for everything. But, it will run
every single driver as a TDX guest that KVM can handle as a
host. So, if the TDX code is broken, so is the KVM host code.

2021-05-19 18:28:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


>>> * If we didn't annotate we would need to add an alternative to every
>>> MMIO access in the kernel (even though 99.9% will never be used on
>>> TDX) which would be a complete waste and incredible binary bloat
>>> for nothing.
>> That sounds like something objective we can measure. Does this cost 1
>> byte of extra text per readl/writel? 10? 100?
> Agreed. And IMO, it's worth converting the common case (macros) if the overhead
> is acceptable, while leaving the #VE handling in place for non-standard code.

We have many millions of lines of MMIO using driver code in the kernel
99.99% of which never runs in TDX. I don't see any point in impacting
everything for this. That would be just against all good code change
hygiene practices, and also just be bloated.

But we also don't don't want to touch every driver, for similar reasons.

What I think would make sense is to convert something to a direct TDCALL
if we figure out the extra #VE is a real life performance problem. AFAIK
the only candidate that I have in mind for this is the virtio doorbell
write (and potentially later its VMBus equivalent). But we should really
only do that if some measurements show it's needed.



> Why does this code exist at all? TDX and SEV-ES absolutely must share code for
> handling MMIO reflection. It will require a fair amount of refactoring to move
> the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> two separate versions of the opcode cracking.

While that's true on the high level, all the low level details are
different. We looked at unifying at some point, but it would have been a
callback hell. I don't think unifying would make anything cleaner.

Besides the bulk of the decoding work is already unified in the common
x86 instruction decoder. The actual actions are different, and the code
fetching is also different, so on the rest there isn't that much to unify.


> The existing SEV-ES #VC handlers appear to be missing page split checks, so that
> needs to be fixed.

Only if anyone in the kernel actually relies on it?


-Andi


2021-05-19 18:29:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On Tue, May 18, 2021, Andi Kleen wrote:
>
> > The extra bytes for .altinstructions is very different than the extra bytes for
> > the code itself. The .altinstructions section is freed after init, so yes it
> > bloats the kernel size a bit, but the runtime footprint is unaffected by the
> > patching metadata.
> >
> > IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
> >
> > The other option to explore is to hook/patch IO_COND(), which can be done with
> > neglible overhead because the helpers that use IO_COND() are not inlined. In a
> > TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> > majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> > And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> > drivers/virtio/virtio_mmio.c directly would be a viable option.
>
> Yes but what's the point of all that?

Patching IO_COND() is relatively low effort. With some clever refactoring, I
suspect the net lines of code added would be less than 10. That seems like a
worthwhile effort to avoid millions of faults over the lifetime of the guest.

> Even if it's only 3 bytes we still have a lot of MMIO all over the kernel
> which never needs it.
>
> And I don't even see what TDX (or SEV which already does the decoding and
> has been merged) would get out of it. We handle all the #VEs just fine. And
> the instruction handling code is fairly straight forward too.
>
> Besides instruction decoding works fine for all the existing hypervisors.
> All we really want to do is to do the same thing as KVM would do.

Heh, trust me, you don't want to do the same thing KVM does :-)

2021-05-19 18:30:09

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On Tue, May 18, 2021, Dave Hansen wrote:
> On 5/18/21 10:21 AM, Andi Kleen wrote:
> > Besides instruction decoding works fine for all the existing
> > hypervisors. All we really want to do is to do the same thing as KVM
> > would do.
>
> Dumb question of the day: If you want to do the same thing that KVM
> does, why don't you share more code with KVM? Wouldn't you, for
> instance, need to crack the same instruction opcodes?

Pulling in all pf KVM's emulator is a bad idea from a security perspective. That
could be mitigated to some extent by teaching the emulator to emulate only select
instructions, but it'd still be much higher risk than a barebones guest-specific
implementations. Because old Intel CPUs don't support unrestricted guest, the set
of instructions that KVM _can_ emulate in total is far, far larger than what is
needed for MMIO.

Allowed instructions aside, KVM needs to handle a large number things a TDX/SEV
guest does not, e.g. segmentation, CPUID model, A/D bit updates, and so on and
so forth.

Refactoring KVM's emulator would also be a monumental task.

2021-05-19 18:31:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On Tue, May 18, 2021, Andi Kleen wrote:
> > Why does this code exist at all? TDX and SEV-ES absolutely must share code for
> > handling MMIO reflection. It will require a fair amount of refactoring to move
> > the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> > two separate versions of the opcode cracking.
>
> While that's true on the high level, all the low level details are
> different. We looked at unifying at some point, but it would have been a
> callback hell. I don't think unifying would make anything cleaner.

How hard did you look? The only part that _must_ be different between SEV and
TDX is the hypercall itself, which is wholly contained at the very end of
vc_do_mmio().

Despite vc_slow_virt_to_phys() taking a pointer to the ghcb, it's unused and
thus the function is 100% generic.

The ghcb->shared_buffer usage throughout the upper levels can be eliminated by
refactoring the stack to take a "u64 *val", since MMIO accesses are currently
bounded to 8 bytes.

> Besides the bulk of the decoding work is already unified in the common x86
> instruction decoder. The actual actions are different, and the code fetching
> is also different

Huh? What do you mean by "actual actions"? Why is the code fetch different?

2021-05-19 18:33:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared

On Mon, May 17, 2021, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Intel TDX doesn't allow VMM to access guest memory. Any memory
^
|- private

And to be pedantic, the VMM can _access_ guest private memory all it wants, it
just can't decrypt guest private memory.

> that is required for communication with VMM must be shared
> explicitly by setting the bit in page table entry. And, after
> setting the shared bit, the conversion must be completed with
> MapGPA TDVMALL. The call informs VMM about the conversion and
> makes it remove the GPA from the S-EPT mapping.

The VMM is _not_ required to remove the GPA from the S-EPT. E.g. if the VMM
wants to, it can leave a 2mb private page intact and create a 4kb shared page
translation within the same range (ignoring the shared bit).

> The shared memory is similar to unencrypted memory in AMD SME/SEV
> terminology but the underlying process of sharing/un-sharing the memory is
> different for Intel TDX guest platform.
>
> SEV assumes that I/O devices can only do DMA to "decrypted"
> physical addresses without the C-bit set.  In order for the CPU
> to interact with this memory, the CPU needs a decrypted mapping.
> To add this support, AMD SME code forces force_dma_unencrypted()
> to return true for platforms that support AMD SEV feature. It will
> be used for DMA memory allocation API to trigger
> set_memory_decrypted() for platforms that support AMD SEV feature.
>
> TDX is similar.  TDX architecturally prevents access to private

TDX doesn't prevent accesses. If hardware _prevented_ accesses then we wouldn't
have to deal with the #MC mess.

> guest memory by anything other than the guest itself. This means
> that any DMA buffers must be shared.
>
> So create a new file mem_encrypt_tdx.c to hold TDX specific memory
> initialization code, and re-define force_dma_unencrypted() for
> TDX guest and make it return true to get DMA pages mapped as shared.
>
> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
> accordingly following with relevant TDVMCALL.
>
> Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when

This should call out that the current TDX spec only supports 4kb AUG/ACCEPT.

On that topic... are there plans to support 2mb and/or 1gb TDH.MEM.PAGE.AUG? If
so, will TDG.MEM.PAGE.ACCEPT also support 2mb/1gb granularity?

> converting memory to private.  If the VMM uses a common pool for private
> and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
> (or on the first access to the private GPA),

What the VMM does or does not do is irrelevant. What matters is what the VMM is
_allowed_ to do without violating the GHCI. Specifically, the VMM is allowed to
unmap a private page in response to MAP_GPA to convert to a shared page.

If the GPA (range) was already mapped as an active, private page, the host
VMM may remove the private page from the TD by following the “Removing TD
Private Pages” sequence in the Intel TDX-module specification [3] to safely
block the mapping(s), flush the TLB and cache, and remove the mapping(s).

That would also provide a nice segue into the "already accepted" error below.

> in which case TDX-Module will hold the page in a non-present "pending" state
> until it is explicitly accepted.
>
> BUG() if TDACCEPTPAGE fails (except the above case)

What above case? The code handles the case where the page was already accepted,
but the changelog doesn't talk about that at all.

> as the guest is completely hosed if it can't access memory. 


2021-05-19 18:33:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


On 5/18/2021 11:22 AM, Sean Christopherson wrote:
> On Tue, May 18, 2021, Andi Kleen wrote:
>>> The extra bytes for .altinstructions is very different than the extra bytes for
>>> the code itself. The .altinstructions section is freed after init, so yes it
>>> bloats the kernel size a bit, but the runtime footprint is unaffected by the
>>> patching metadata.
>>>
>>> IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
>>>
>>> The other option to explore is to hook/patch IO_COND(), which can be done with
>>> neglible overhead because the helpers that use IO_COND() are not inlined. In a
>>> TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
>>> majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
>>> And if there are TDX VMMs that want to deploy virtio-mmio, hooking
>>> drivers/virtio/virtio_mmio.c directly would be a viable option.
>> Yes but what's the point of all that?
> Patching IO_COND() is relatively low effort. With some clever refactoring, I
> suspect the net lines of code added would be less than 10. That seems like a
> worthwhile effort to avoid millions of faults over the lifetime of the guest.

AFAIK IO_COND is only for iomap users. But most drivers don't even use
iomap. virtio doesn't for example, and that's really the only case we
currently care about.

Also millions of faults is nothing for a CPU.

The only case I can see it making sense is the virtio (and vmbus) door
bells. Everything else should be slow path anyways.

But doing that now would be premature optimization and that's usually a
bad idea. If it's a problem we can fix it later.


>
>> Even if it's only 3 bytes we still have a lot of MMIO all over the kernel
>> which never needs it.
>>
>> And I don't even see what TDX (or SEV which already does the decoding and
>> has been merged) would get out of it. We handle all the #VEs just fine. And
>> the instruction handling code is fairly straight forward too.
>>
>> Besides instruction decoding works fine for all the existing hypervisors.
>> All we really want to do is to do the same thing as KVM would do.
> Heh, trust me, you don't want to do the same thing KVM does :-)

We want the same behavior.

Yes probably not the same code.


-Andi



Subject: Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls



On 5/18/21 8:51 AM, Dave Hansen wrote:
> Question for KVM folks: Should all of these guest patches say:
> "x86/tdx/guest:" or something? It seems like that would put us all in
> the right frame of mind as we review these. It's kinda easy (for me at
> least) to get lost about which side I'm looking at sometimes.
>
> On 5/17/21 5:15 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
>> Although the ABI is similar, those instructions no longer
>> function for TDX guests. Make vendor specififc TDVMCALLs
>
> "vendor-specific"
>
> Hyphen and spelling ^

I will fix it next version.

>
>> instead of VMCALL.
>
> This would also be a great place to say:
>
> This enables TDX guests to run with KVM acting as the hypervisor. TDX
> guests running under other hypervisors will continue to use those
> hypervisors hypercalls.

I will include it.

>
>> [Isaku: proposed KVM VENDOR string]
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Signed-off-by: Isaku Yamahata <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>
> This SoB chain is odd. Kirill wrote this, sent it to Isaku, who sent it
> to Sathya?

Initially we have used "0" as vendor ID for KVM. But Isaku proposed a new
value for it and sent a patch to fix it. But, I did not want to carry it as
separate patch (for one line change). So I have merged his change with
this patch, and added his signed-off with comment ([Isaku: proposed KVM VENDOR string])

+#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */


>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 9e0e0ff76bab..768df1b98487 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
>> run in a CPU mode that protects the confidentiality of TD memory
>> contents and the TD’s CPU state from other software, including VMM.
>>
>> +config INTEL_TDX_GUEST_KVM
>> + def_bool y
>> + depends on KVM_GUEST && INTEL_TDX_GUEST
>> + help
>> + This option enables KVM specific hypercalls in TDX guest.
>
> For something that's not user-visible, I'd probably just add a Kconfig
> comment rather than help text.

If it is the preferred approach, I can remove it.

>
> ...
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index 7966c10ea8d1..a90fec004844 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>>
>> obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
>> obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
>> +obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o
>
> Is the indentation consistent with the other items near "tdx-kvm.o" in
> the Makefile?

Yes. For longer config names, common indentation is not maintained. Please
check the PMEM example.

126 obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
127 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
128
129 obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
130 obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
131 obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o


>
> ...
>> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
>> +long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
>> + unsigned long p3, unsigned long p4)
>> +{
>> + return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
>> +}
>> +EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
>
> I always forget that KVM code is goofy and needs to have things in C
> files so you can export the symbols. Could you add a sentence to the
> changelog to this effect?
>
> Code-wise, this is fine. Just a few tweaks and I'll be happy to ack
> this one.

Will add it.

Since KVM hypercall functions can be included and called
from kernel modules, export tdx_kvm_hypercall*() functions
to avoid symbol errors


>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 18:34:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls

On 5/18/21 1:12 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> [Isaku: proposed KVM VENDOR string]
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> Signed-off-by: Isaku Yamahata <[email protected]>
>>> Reviewed-by: Andi Kleen <[email protected]>
>>> Signed-off-by: Kuppuswamy Sathyanarayanan
>>> <[email protected]>
>>
>> This SoB chain is odd.  Kirill wrote this, sent it to Isaku, who sent it
>> to Sathya?
>
> Initially we have used "0" as vendor ID for KVM. But Isaku proposed a new
> value for it and sent a patch to fix it. But, I did not want to carry it as
> separate patch (for one line change). So I have merged his change with
> this patch, and added his signed-off with comment ([Isaku: proposed KVM
> VENDOR string])
>
> +#define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */

That's a combined Co-developed-by+Signed-off-by situation. You don't
add a bare SoB for that.

But, seriously, you don't need to preserve a SoB for a one-line patch.
Just pull the line in and make a note in the changelog.

2021-05-19 18:35:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO



> virtio-pci, which is going to used by pretty much all traditional VMs, uses iomap.
> See vp_get(), vp_set(), and all the vp_io{read,write}*() wrappers.

That's true. But there are still all the other users. So it doesn't
solve the problem. In the end I'm fairly sure we would need to patch
readl/writel and friends.

-Andi


2021-05-19 18:35:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On 5/18/21 1:20 PM, Andi Kleen wrote:
>
> On 5/18/2021 10:46 AM, Dave Hansen wrote:
>> On 5/18/21 10:21 AM, Andi Kleen wrote:
>>> Besides instruction decoding works fine for all the existing
>>> hypervisors. All we really want to do is to do the same thing as KVM
>>> would do.
>> Dumb question of the day: If you want to do the same thing that KVM
>> does, why don't you share more code with KVM?  Wouldn't you, for
>> instance, need to crack the same instruction opcodes?
>
> We're talking about ~60 lines of codes that calls an established
> standard library.
>
> https://github.com/intel/tdx/blob/8c20c364d1f52e432181d142054b1c2efa0ae6d3/arch/x86/kernel/tdx.c#L490
>
> You're proposing a gigantic refactoring to avoid 60 lines of straight
> forward code.
>
> That's not a practical proposal.

Hi Andi,

I'm not actually trying to propose things. I'm really just trying to
get an idea why the implementation ended up how it did. I actually
entirely respect the position that the KVM code is a monster and
shouldn't get reused. That seems totally reasonable.

What isn't reasonable is the lack of documentation of these design
decisions in the changelogs. My goal here is to raise the quality of
the changelogs so that other reviewers and maintainers don't have to ask
these questions when they perform their reviews.

This is honestly the best way I know to help get this code merged as
soon as possible. If I'm not helping, please let me know. I'm happy to
spend my time elsewhere.

2021-05-19 18:36:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On Tue, May 18, 2021, Andi Kleen wrote:
>
> On 5/18/2021 11:22 AM, Sean Christopherson wrote:
> > On Tue, May 18, 2021, Andi Kleen wrote:
> > > > The extra bytes for .altinstructions is very different than the extra bytes for
> > > > the code itself. The .altinstructions section is freed after init, so yes it
> > > > bloats the kernel size a bit, but the runtime footprint is unaffected by the
> > > > patching metadata.
> > > >
> > > > IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
> > > >
> > > > The other option to explore is to hook/patch IO_COND(), which can be done with
> > > > neglible overhead because the helpers that use IO_COND() are not inlined. In a
> > > > TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> > > > majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> > > > And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> > > > drivers/virtio/virtio_mmio.c directly would be a viable option.
> > > Yes but what's the point of all that?
> > Patching IO_COND() is relatively low effort. With some clever refactoring, I
> > suspect the net lines of code added would be less than 10. That seems like a
> > worthwhile effort to avoid millions of faults over the lifetime of the guest.
>
> AFAIK IO_COND is only for iomap users. But most drivers don't even use
> iomap. virtio doesn't for example, and that's really the only case we
> currently care about.

virtio-pci, which is going to used by pretty much all traditional VMs, uses iomap.
See vp_get(), vp_set(), and all the vp_io{read,write}*() wrappers.

2021-05-19 18:36:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


> I'm not actually trying to propose things. I'm really just trying to
> get an idea why the implementation ended up how it did. I actually
> entirely respect the position that the KVM code is a monster and
> shouldn't get reused. That seems totally reasonable.

Mainly because it's relatively simple and straight forward to do it this
way, Yes I know, that's a shocking concept, but sometimes it works even
in Linux code.

>
> What isn't reasonable is the lack of documentation of these design
> decisions in the changelogs. My goal here is to raise the quality of
> the changelogs so that other reviewers and maintainers don't have to ask
> these questions when they perform their reviews.
>
> This is honestly the best way I know to help get this code merged as
> soon as possible. If I'm not helping, please let me know. I'm happy to
> spend my time elsewhere.

I'm sure the commit logs can be improved and I appreciate your feedback.


I don't think every commit log needs to be an extended essay meandering
all over the possible design space, talking about everything that could
have been and wasn't. The way code is normally written is that we don't
do an exhaustive search of possible options, but instead we pick a
reasonable path and as long as that works and doesn't have too many
problems we just stick to it. The commit log reflects that single path
chosen, with only rare exceptions to talk about dead alleys.

In this case you can even see that multiple independent efforts (AMD and
Intel) came mostly to fairly similar implementations, so the path chosen
wasn't really that strange or non obvious.

Also overall I would appreciate if people would focus more on the code
than the commit logs. Commit logs are important, but in the end what
really matters is that the code is correct.

-Andi



Subject: [RFC v2-fix-v2 1/1] x86/tdx: Wire up KVM hypercalls

From: "Kirill A. Shutemov" <[email protected]>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor-specific TDVMCALLs
instead of VMCALL. This enables TDX guests to run with KVM
acting as the hypervisor. TDX guests running under other
hypervisors will continue to use those hypervisor's
hypercalls.

Since KVM hypercall functions can be included and called
from kernel modules, export tdx_kvm_hypercall*() functions
to avoid symbol errors

[Isaku Yamahata: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix:
* Removed "user help" for INTEL_TDX_GUEST_KVM config option
and added a comment for it.
* Added details about exporting symbols in the commit log.
* Removed Isaku's sign-off.

Changes since RFC v2:
* Introduced INTEL_TDX_GUEST_KVM config for TDX+KVM related changes.
* Removed "C" include file.
* Fixed commit log as per Dave's comments.

arch/x86/Kconfig | 5 ++++
arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
arch/x86/include/asm/tdx.h | 41 ++++++++++++++++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/tdcall.S | 20 ++++++++++++++
arch/x86/kernel/tdx-kvm.c | 48 +++++++++++++++++++++++++++++++++
6 files changed, 136 insertions(+)
create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..15e66a99dd41 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
run in a CPU mode that protects the confidentiality of TD memory
contents and the TD’s CPU state from other software, including VMM.

+# This option enables KVM specific hypercalls in TDX guest.
+config INTEL_TDX_GUEST_KVM
+ def_bool y
+ depends on KVM_GUEST && INTEL_TDX_GUEST
+
endif #HYPERVISOR_GUEST

source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
#include <asm/alternative.h>
#include <linux/interrupt.h>
#include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>

extern void kvmclock_init(void);

@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
static inline long kvm_hypercall0(unsigned int nr)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall0(nr);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall1(nr, p1);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall2(nr, p1, p2);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
unsigned long p4)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..eb758b506dba 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,45 @@ static inline void tdx_early_init(void) { };

#endif /* CONFIG_INTEL_TDX_GUEST */

+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+ u64 r15, struct tdx_hypercall_output *out);
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+ unsigned long p2)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7966c10ea8d1..a90fec004844 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index a484c4aef6e6..3c57a1d67b79 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -25,6 +25,8 @@
TDG_R12 | TDG_R13 | \
TDG_R14 | TDG_R15 )

+#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */
+
/*
* TDX guests use the TDCALL instruction to make requests to the
* TDX module and hypercalls to the VMM. It is supported in
@@ -213,3 +215,21 @@ SYM_FUNC_START(__tdx_hypercall)
call do_tdx_hypercall
retq
SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+ /*
+ * R10 is not part of the function call ABI, but it is a part
+ * of the TDVMCALL ABI. So set it before making call to the
+ * do_tdx_hypercall().
+ */
+ movq $TDVMCALL_VENDOR_KVM, %r10
+ call do_tdx_hypercall
+ retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..b21453a81e38
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static long tdx_kvm_hypercall(unsigned int fn, unsigned long r12,
+ unsigned long r13, unsigned long r14,
+ unsigned long r15)
+{
+ return __tdx_hypercall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+ return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3)
+{
+ return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4)
+{
+ return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
--
2.25.1


Subject: Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls



On 5/18/21 1:19 PM, Dave Hansen wrote:
> But, seriously, you don't need to preserve a SoB for a one-line patch.
> Just pull the line in and make a note in the changelog.

Ok. Makes sense. I will leave the comment and remove SOB from Isaku.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared



On 5/18/21 12:55 PM, Sean Christopherson wrote:
> On Mon, May 17, 2021, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> Intel TDX doesn't allow VMM to access guest memory. Any memory
> ^
> |- private
>
> And to be pedantic, the VMM can _access_ guest private memory all it wants, it
> just can't decrypt guest private memory.

Ok. I will use "guest private memory".

>
>> that is required for communication with VMM must be shared
>> explicitly by setting the bit in page table entry. And, after
>> setting the shared bit, the conversion must be completed with
>> MapGPA TDVMALL. The call informs VMM about the conversion and
>> makes it remove the GPA from the S-EPT mapping.
>
> The VMM is _not_ required to remove the GPA from the S-EPT. E.g. if the VMM
> wants to, it can leave a 2mb private page intact and create a 4kb shared page
> translation within the same range (ignoring the shared bit).

So does removing "makes it remove the GPA from the S-EPT mapping"
be sufficient? Or you want to add more detail?


>
>> The shared memory is similar to unencrypted memory in AMD SME/SEV
>> terminology but the underlying process of sharing/un-sharing the memory is
>> different for Intel TDX guest platform.
>>
>> SEV assumes that I/O devices can only do DMA to "decrypted"
>> physical addresses without the C-bit set.  In order for the CPU
>> to interact with this memory, the CPU needs a decrypted mapping.
>> To add this support, AMD SME code forces force_dma_unencrypted()
>> to return true for platforms that support AMD SEV feature. It will
>> be used for DMA memory allocation API to trigger
>> set_memory_decrypted() for platforms that support AMD SEV feature.
>>
>> TDX is similar.  TDX architecturally prevents access to private
>
> TDX doesn't prevent accesses. If hardware _prevented_ accesses then we wouldn't
> have to deal with the #MC mess.
How about following change?

"TDX is similar. TDX architecturally prevents access to private guest memory by
anything other than the guest itself.This means that any DMA buffers must be
shared."

modified to =>

"TDX is similar. In TDX architecture, the private guest memory is encrypted, which
prevents anything other than guest from accessing/modifying it. So to communicate
with I/O devices, we need to create decrypted mapping and make the pages shared."

>
>> guest memory by anything other than the guest itself. This means
>> that any DMA buffers must be shared.
>>
>> So create a new file mem_encrypt_tdx.c to hold TDX specific memory
>> initialization code, and re-define force_dma_unencrypted() for
>> TDX guest and make it return true to get DMA pages mapped as shared.
>>
>> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
>> accordingly following with relevant TDVMCALL.
>>
>> Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
>
> This should call out that the current TDX spec only supports 4kb AUG/ACCEPT.

Ok. I will add this spec detail.

>
> On that topic... are there plans to support 2mb and/or 1gb TDH.MEM.PAGE.AUG? If
> so, will TDG.MEM.PAGE.ACCEPT also support 2mb/1gb granularity?
>
>> converting memory to private.  If the VMM uses a common pool for private
>> and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
>> (or on the first access to the private GPA),
>
> What the VMM does or does not do is irrelevant. What matters is what the VMM is
> _allowed_ to do without violating the GHCI. Specifically, the VMM is allowed to
> unmap a private page in response to MAP_GPA to convert to a shared page.
>
> If the GPA (range) was already mapped as an active, private page, the host
> VMM may remove the private page from the TD by following the “Removing TD
> Private Pages” sequence in the Intel TDX-module specification [3] to safely
> block the mapping(s), flush the TLB and cache, and remove the mapping(s).
>
> That would also provide a nice segue into the "already accepted" error below.

Ok. I will add the above detail.

>
>> in which case TDX-Module will hold the page in a non-present "pending" state
>> until it is explicitly accepted.
>>
>> BUG() if TDACCEPTPAGE fails (except the above case)
>
> What above case? The code handles the case where the page was already accepted,
> but the changelog doesn't talk about that at all.

I think it meant about "already accepted" page case. With your above suggestion,
we can ignore this error. Or I can change it to,

BUG() if TDACCEPTPAGE fails (except "previously accepted page" case)

>
>> as the guest is completely hosed if it can't access memory.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 18:44:01

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared

On 5/18/21 3:12 PM, Kuppuswamy, Sathyanarayanan wrote:
> "TDX is similar. In TDX architecture, the private guest memory is
> encrypted, which prevents anything other than guest from
> accessing/modifying it. So to communicate with I/O devices, we need
> to create decrypted mapping and make the pages shared."

That's actually even more wrong. :(

Check out "Machine Check Architecture Background" in the TDX
architecture spec.

Modification is totally permitted in the architecture. A host can write
all day long to guest memory. Depending on how you use the word,
"access" can also include writes.

TDX really just prevents guests from *consuming* the gunk that an
attacker might write.

Also, don't say "decrypted". The memory is probably still TME-enabled
and probably encrypted on the DIMM. It's still encrypted even if
shared, it's just using the TME key, not the TD key.

2021-05-19 18:48:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/1] x86/tdx: Wire up KVM hypercalls

On 5/18/21 2:19 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor-specific TDVMCALLs
> instead of VMCALL. This enables TDX guests to run with KVM
> acting as the hypervisor. TDX guests running under other
> hypervisors will continue to use those hypervisor's
> hypercalls.

Well, I screwed this up when I typed it too, but it is:

TDX guests running under other hypervisors will continue
to use those hypervisors' hypercalls.

I hate how that reads, but oh well.

> Since KVM hypercall functions can be included and called
> from kernel modules, export tdx_kvm_hypercall*() functions
> to avoid symbol errors

No, you're not avoiding errors, you're exporting the symbol so it can be
*USED*. The error comes from it not being exported.

It also helps to be specific here: Export tdx_kvm_hypercall*() to make
the symbols visible to kvm.ko.

> [Isaku Yamahata: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

Reviewed-by: Dave Hansen <[email protected]>

Also, FWIW, if you did this in the header:

+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}

You could get away with just exporting tdx_kvm_hypercall() instead of 4
symbols. The rest of the code would look the same.

Subject: [RFC v2-fix-v3 1/1] x86/tdx: Wire up KVM hypercalls

From: "Kirill A. Shutemov" <[email protected]>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor-specific TDVMCALLs
instead of VMCALL. This enables TDX guests to run with KVM
acting as the hypervisor. TDX guests running under other
hypervisors will continue to use those hypervisors'
hypercalls.

Since KVM driver can be built as a kernel module, export
tdx_kvm_hypercall*() to make the symbols visible to kvm.ko.

[Isaku Yamahata: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/Kconfig | 5 +++
arch/x86/include/asm/kvm_para.h | 21 ++++++++++
arch/x86/include/asm/tdx.h | 68 +++++++++++++++++++++++++++++++++
arch/x86/kernel/tdcall.S | 26 +++++++++++++
4 files changed, 120 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..15e66a99dd41 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
run in a CPU mode that protects the confidentiality of TD memory
contents and the TD’s CPU state from other software, including VMM.

+# This option enables KVM specific hypercalls in TDX guest.
+config INTEL_TDX_GUEST_KVM
+ def_bool y
+ depends on KVM_GUEST && INTEL_TDX_GUEST
+
endif #HYPERVISOR_GUEST

source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
#include <asm/alternative.h>
#include <linux/interrupt.h>
#include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>

extern void kvmclock_init(void);

@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
static inline long kvm_hypercall0(unsigned int nr)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall0(nr);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall1(nr, p1);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall2(nr, p1, p2);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
unsigned long p4)
{
long ret;
+
+ if (is_tdx_guest())
+ return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..3d8d977e52f0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,72 @@ static inline void tdx_early_init(void) { };

#endif /* CONFIG_INTEL_TDX_GUEST */

+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+ u64 r15, struct tdx_hypercall_output *out);
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return __tdx_hypercall_vendor_kvm(nr, 0, 0, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return __tdx_hypercall_vendor_kvm(nr, p1, 0, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+ unsigned long p2)
+{
+ return __tdx_hypercall_vendor_kvm(nr, p1, p2, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3)
+{
+ return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, 0, NULL);
+}
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, p4, NULL);
+}
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+ unsigned long p2)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3)
+{
+ return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 2dfecdae38bb..27355fb80aeb 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
#include <asm/asm.h>
#include <asm/frame.h>
#include <asm/unwind_hints.h>
+#include <asm/export.h>

#include <linux/linkage.h>
#include <linux/bits.h>
@@ -25,6 +26,8 @@
TDG_R12 | TDG_R13 | \
TDG_R14 | TDG_R15 )

+#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */
+
/*
* TDX guests use the TDCALL instruction to make requests to the
* TDX module and hypercalls to the VMM. It is supported in
@@ -212,3 +215,26 @@ SYM_FUNC_START(__tdx_hypercall)
FRAME_END
retq
SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+ FRAME_BEGIN
+ /*
+ * R10 is not part of the function call ABI, but it is a part
+ * of the TDVMCALL ABI. So set it before making call to the
+ * do_tdx_hypercall().
+ */
+ movq $TDVMCALL_VENDOR_KVM, %r10
+ call do_tdx_hypercall
+ FRAME_END
+ retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+
+EXPORT_SYMBOL(__tdx_hypercall_vendor_kvm);
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
--
2.25.1


Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Wire up KVM hypercalls

Sorry, I have missed to include a change log.

* Removed tdx-kvm.c and implemented tdx_kvm_hypercall*() functions in tdx.h
* Exported __tdx_hypercall_vendor_kvm() symbol for kvm.ko.
* Fixed commit log as per Dave's suggestion.
* Added Reviewed-by from Dave
* Added FRAME_BEGIN/FRAME_END for __tdx_hypercall_vendor_kvm() to fix
compiler warnings.

On Tue, May 18, 2021 at 6:17 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: "Kirill A. Shutemov" <[email protected]>
>
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor-specific TDVMCALLs
> instead of VMCALL. This enables TDX guests to run with KVM
> acting as the hypervisor. TDX guests running under other
> hypervisors will continue to use those hypervisors'
> hypercalls.
>
> Since KVM driver can be built as a kernel module, export
> tdx_kvm_hypercall*() to make the symbols visible to kvm.ko.
>
> [Isaku Yamahata: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Dave Hansen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/Kconfig | 5 +++
> arch/x86/include/asm/kvm_para.h | 21 ++++++++++
> arch/x86/include/asm/tdx.h | 68 +++++++++++++++++++++++++++++++++
> arch/x86/kernel/tdcall.S | 26 +++++++++++++
> 4 files changed, 120 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9e0e0ff76bab..15e66a99dd41 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
> run in a CPU mode that protects the confidentiality of TD memory
> contents and the TD’s CPU state from other software, including VMM.
>
> +# This option enables KVM specific hypercalls in TDX guest.
> +config INTEL_TDX_GUEST_KVM
> + def_bool y
> + depends on KVM_GUEST && INTEL_TDX_GUEST
> +
> endif #HYPERVISOR_GUEST
>
> source "arch/x86/Kconfig.cpu"
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 338119852512..2fa85481520b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -6,6 +6,7 @@
> #include <asm/alternative.h>
> #include <linux/interrupt.h>
> #include <uapi/asm/kvm_para.h>
> +#include <asm/tdx.h>
>
> extern void kvmclock_init(void);
>
> @@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
> static inline long kvm_hypercall0(unsigned int nr)
> {
> long ret;
> +
> + if (is_tdx_guest())
> + return tdx_kvm_hypercall0(nr);
> +
> asm volatile(KVM_HYPERCALL
> : "=a"(ret)
> : "a"(nr)
> @@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
> static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
> {
> long ret;
> +
> + if (is_tdx_guest())
> + return tdx_kvm_hypercall1(nr, p1);
> +
> asm volatile(KVM_HYPERCALL
> : "=a"(ret)
> : "a"(nr), "b"(p1)
> @@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
> unsigned long p2)
> {
> long ret;
> +
> + if (is_tdx_guest())
> + return tdx_kvm_hypercall2(nr, p1, p2);
> +
> asm volatile(KVM_HYPERCALL
> : "=a"(ret)
> : "a"(nr), "b"(p1), "c"(p2)
> @@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
> unsigned long p2, unsigned long p3)
> {
> long ret;
> +
> + if (is_tdx_guest())
> + return tdx_kvm_hypercall3(nr, p1, p2, p3);
> +
> asm volatile(KVM_HYPERCALL
> : "=a"(ret)
> : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
> @@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> unsigned long p4)
> {
> long ret;
> +
> + if (is_tdx_guest())
> + return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
> +
> asm volatile(KVM_HYPERCALL
> : "=a"(ret)
> : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 8ab4067afefc..3d8d977e52f0 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -73,4 +73,72 @@ static inline void tdx_early_init(void) { };
>
> #endif /* CONFIG_INTEL_TDX_GUEST */
>
> +#ifdef CONFIG_INTEL_TDX_GUEST_KVM
> +u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
> + u64 r15, struct tdx_hypercall_output *out);
> +
> +/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall0(unsigned int nr)
> +{
> + return __tdx_hypercall_vendor_kvm(nr, 0, 0, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
> +{
> + return __tdx_hypercall_vendor_kvm(nr, p1, 0, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
> + unsigned long p2)
> +{
> + return __tdx_hypercall_vendor_kvm(nr, p1, p2, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
> + unsigned long p2, unsigned long p3)
> +{
> + return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> + unsigned long p2, unsigned long p3,
> + unsigned long p4)
> +{
> + return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, p4, NULL);
> +}
> +#else
> +static inline long tdx_kvm_hypercall0(unsigned int nr)
> +{
> + return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
> +{
> + return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
> + unsigned long p2)
> +{
> + return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
> + unsigned long p2, unsigned long p3)
> +{
> + return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> + unsigned long p2, unsigned long p3,
> + unsigned long p4)
> +{
> + return -ENODEV;
> +}
> +#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
> +
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 2dfecdae38bb..27355fb80aeb 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -3,6 +3,7 @@
> #include <asm/asm.h>
> #include <asm/frame.h>
> #include <asm/unwind_hints.h>
> +#include <asm/export.h>
>
> #include <linux/linkage.h>
> #include <linux/bits.h>
> @@ -25,6 +26,8 @@
> TDG_R12 | TDG_R13 | \
> TDG_R14 | TDG_R15 )
>
> +#define TDVMCALL_VENDOR_KVM 0x4d564b2e584454 /* "TDX.KVM" */
> +
> /*
> * TDX guests use the TDCALL instruction to make requests to the
> * TDX module and hypercalls to the VMM. It is supported in
> @@ -212,3 +215,26 @@ SYM_FUNC_START(__tdx_hypercall)
> FRAME_END
> retq
> SYM_FUNC_END(__tdx_hypercall)
> +
> +#ifdef CONFIG_INTEL_TDX_GUEST_KVM
> +
> +/*
> + * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
> + * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
> + * TDVMCALL_VENDOR_KVM).
> + */
> +SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
> + FRAME_BEGIN
> + /*
> + * R10 is not part of the function call ABI, but it is a part
> + * of the TDVMCALL ABI. So set it before making call to the
> + * do_tdx_hypercall().
> + */
> + movq $TDVMCALL_VENDOR_KVM, %r10
> + call do_tdx_hypercall
> + FRAME_END
> + retq
> +SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
> +
> +EXPORT_SYMBOL(__tdx_hypercall_vendor_kvm);
> +#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
> --
> 2.25.1
>


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

Hi Dave,

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov"<[email protected]>
>
> tdx_shared_mask() returns the mask that has to be set in a page
> table entry to make page shared with VMM.
>
> Also, note that we cannot club shared mapping configuration between
> AMD SME and Intel TDX Guest platforms in common function. SME has
> to do it very early in __startup_64() as it sets the bit on all
> memory, except what is used for communication. TDX can postpone as
> we don't need any shared mapping in very early boot.
>
> Signed-off-by: Kirill A. Shutemov<[email protected]>
> Reviewed-by: Andi Kleen<[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>

Any comments on this patch?

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall() - function can be used to request services from
the VMM.
__tdx_module_call()  - function can be used to communicate with the
TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDCALL operation, if all these registers are included in in-line
assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
 * Renamed __tdcall()/__tdvmcall() to __tdx_module_call()/__tdx_hypercall().
 * Renamed reg offsets from TDCALL_rx to TDX_MODULE_rx.
 * Renamed reg offsets from TDVMCALL_rx to TDX_HYPERCALL_rx.
 * Renamed struct tdcall_output to struct tdx_module_output.
 * Renamed struct tdvmcall_output to struct tdx_hypercall_output.
 * Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
 * Removed unnecessary push/pop sequence in __tdcall() function.
 * Fixed comments as per Dave's review.

arch/x86/include/asm/tdx.h | 38 ++++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets.c | 22 ++++
arch/x86/kernel/tdcall.S | 222 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 39 ++++++
5 files changed, 322 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..211b9d66b1b1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,50 @@
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};

/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

void __init tdx_early_init(void);

+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+ struct tdx_hypercall_output *out);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
#include <xen/interface/xen.h>
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
#ifdef CONFIG_X86_32
# include "asm-offsets_32.c"
#else
@@ -75,6 +79,24 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ /* Offset for fields in tdcall_output */
+ OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+ OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+ OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
+ OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
+ OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+ OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+ /* Offset for fields in tdvmcall_output */
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..a67c595e4169
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,222 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10 BIT(10)
+#define TDG_R11 BIT(11)
+#define TDG_R12 BIT(12)
+#define TDG_R13 BIT(13)
+#define TDG_R14 BIT(14)
+#define TDG_R15 BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDG_R10 | TDG_R11 | \
+ TDG_R12 | TDG_R13 | \
+ TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call() - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module.  And if the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ * use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+
+ /*
+ * R12 will be used as temporary storage for
+ * struct tdx_module_output pointer. You can
+ * find struct tdx_module_output details in
+ * arch/x86/include/asm/tdx.h. Also note that
+ * registers R12-R15 are not used by TDCALL
+ * services supported by this helper function.
+ */
+ push %r12 /* Callee saved, so preserve it */
+ mov %r9, %r12 /* Move output pointer to R12 */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ mov %rdi, %rax /* Move TDCALL Leaf ID to RAX */
+ mov %r8, %r9 /* Move input 4 to R9 */
+ mov %rcx, %r8 /* Move input 3 to R8 */
+ mov %rsi, %rcx /* Move input 1 to RCX */
+ /* Leave input param 2 in RDX */
+
+ tdcall
+
+ /* Check for TDCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check for TDCALL output struct != NULL */
+ test %r12, %r12
+ jz 1f
+
+ /* Copy TDCALL result registers to output struct: */
+ movq %rcx, TDX_MODULE_rcx(%r12)
+ movq %rdx, TDX_MODULE_rdx(%r12)
+ movq %r8, TDX_MODULE_r8(%r12)
+ movq %r9, TDX_MODULE_r9(%r12)
+ movq %r10, TDX_MODULE_r10(%r12)
+ movq %r11, TDX_MODULE_r11(%r12)
+1:
+ pop %r12 /* Restore the state of R12 register */
+
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall() - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common between vendor specific
+ * and standard type tdx hypercalls. So the caller of this function had
+ * to set the TDVMCALL type in the R10 register before calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. And if the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn (RDI) - TDVMCALL function, moved to R11
+ * @r12 (RSI) - Input parameter 1, moved to R12
+ * @r13 (RDX) - Input parameter 2, moved to R13
+ * @r14 (RCX) - Input parameter 3, moved to R14
+ * @r15 (R8) - Input parameter 4, moved to R15
+ *
+ * @out (R9) - struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ * If the "tdcall" operation fails, panic.
+ *
+ */
+SYM_FUNC_START_LOCAL(do_tdx_hypercall)
+ /* Save non-volatile GPRs that are exposed to the VMM. */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+ mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+ mov %rsi, %r12 /* Move input 1 to R12 */
+ mov %rdx, %r13 /* Move input 2 to R13 */
+ mov %rcx, %r14 /* Move input 1 to R14 */
+ mov %r8, %r15 /* Move input 1 to R15 */
+ /* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * Check for TDCALL success: 0 - Successful, otherwise failed.
+ * If failed, there is an issue with TDX Module which is fatal
+ * for the guest. So panic. Also note that RAX is controlled
+ * only by the TDX module and not exposed to VMM.
+ */
+ test %rax, %rax
+ jnz 2f
+
+ /* Move hypercall error code to RAX to return to user */
+ mov %r10, %rax
+
+ /* Check for hypercall success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check for hypercall output struct != NULL */
+ test %r9, %r9
+ jz 1f
+
+ /* Copy hypercall result registers to output struct: */
+ movq %r11, TDX_HYPERCALL_r11(%r9)
+ movq %r12, TDX_HYPERCALL_r12(%r9)
+ movq %r13, TDX_HYPERCALL_r13(%r9)
+ movq %r14, TDX_HYPERCALL_r14(%r9)
+ movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+ /*
+ * Zero out registers exposed to the VMM to avoid
+ * speculative execution with VMM-controlled values.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+ xor %r12d, %r12d
+ xor %r13d, %r13d
+ xor %r14d, %r14d
+ xor %r15d, %r15d
+
+ /* Restore non-volatile GPRs that are exposed to the VMM. */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ ret
+2:
+ ud2
+SYM_FUNC_END(do_tdx_hypercall)
+
+/*
+ * Helper function for for standard type of TDVMCALLs. This assembly
+ * wrapper lets us reuse do_tdvmcall() for standard type of hypercalls
+ * (R10 is set as zero).
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+ /*
+ * R10 is not part of the function call ABI, but it is a part
+ * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
+ * before making call to the do_tdx_hypercall().
+ */
+ xor %r10, %r10
+ call do_tdx_hypercall
+ FRAME_END
+ retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2020 Intel Corporation */

+#define pr_fmt(fmt) "TDX: " fmt
+
#include <asm/tdx.h>

+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+ u64 r14, u64 r15)
+{
+
+ struct tdx_hypercall_output out = {0};
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return out.r11;
+}
+
static inline bool cpuid_has_tdx_guest(void)
{
u32 eax, signature[3];
--
2.25.1


Subject: Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

Hi Dave,

On 5/18/21 10:58 PM, Kuppuswamy Sathyanarayanan wrote:
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".
>
> In TDX based VM, since VMM is an untrusted entity, a intermediary
> layer (TDX module) exists between host and guest to facilitate the
> secure communication. And "tdcall" instruction  is used by the guest
> to request services from TDX module. And a variant of "tdcall"
> instruction (with specific arguments as defined by GHCI) is used by
> the guest to request services from  VMM via the TDX module.
>
> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>
> __tdx_hypercall() - function can be used to request services from
> the VMM.
> __tdx_module_call()  - function can be used to communicate with the
> TDX Module.
>
> Also define two additional wrappers, tdx_hypercall() and
> tdx_hypercall_out_r11() to cover common use cases of
> __tdx_hypercall() function. Since each use case of
> __tdx_module_call() is different, we don't need such wrappers for it.
>
> Implement __tdx_module_call() and __tdx_hypercall() helper functions
> in assembly.
>
> Rationale behind choosing to use assembly over inline assembly are,
>
> 1. Since the number of lines of instructions (with comments) in
> __tdx_hypercall() implementation is over 70, using inline assembly
> to implement it will make it hard to read.
>
> 2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
> TDCALL operation, if all these registers are included in in-line
> assembly constraints, some of the older compilers may not
> be able to meet this requirement.
>
> Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
> use the same set of argument registers. The implementation here picks
> the current worst-case scenario for TDCALL (4 registers). For TDCALLs
> with fewer than 4 arguments, there will end up being a few superfluous
> (cheap) instructions.  But, this approach maximizes code reuse. The
> same argument applies to __tdx_hypercall() function as well.
>
> Current implementation of __tdx_hypercall() includes error handling
> (ud2 on failure case) in assembly function instead of doing it in C
> wrapper function. The reason behind this choice is, when adding support
> for in/out instructions (refer to patch titled "x86/tdx: Handle port
> I/O" in this series), we use alternative_io() to substitute in/out
> instruction with  __tdx_hypercall() calls. So use of C wrappers is not
> trivial in this case because the input parameters will be in the wrong
> registers and it's tricky to include proper buffer code to make this
> happen.
>
> For registers used by TDCALL instruction, please check TDX GHCI
> specification, sec 2.4 and 3.
>
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
>
> Originally-by: Sean Christopherson<[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>

I did send it as in-reply-to message id [email protected] (your
last reply mail id), but for some reason its not detected as reply to original patch
"[RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions".

I am not sure whats going on, but please review as reply to original patch.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 20:13:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

On 5/18/21 10:58 PM, Kuppuswamy Sathyanarayanan wrote:
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".
>
> In TDX based VM, since VMM is an untrusted entity, a intermediary

"In a TDX-based VM..."

> layer (TDX module) exists between host and guest to facilitate the
> secure communication. And "tdcall" instruction  is used by the guest
> to request services from TDX module. And a variant of "tdcall"
> instruction (with specific arguments as defined by GHCI) is used by
> the guest to request services from  VMM via the TDX module.

I'd just say:

TDX guests communicate with the TDX module and with the VMM
using a new instruction: TDCALL.

The rest of that is noise.

> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>    
> __tdx_hypercall() - function can be used to request services from
> the VMM.
> __tdx_module_call()  - function can be used to communicate with the
> TDX Module.

s/function can be used to//

> Also define two additional wrappers, tdx_hypercall() and
> tdx_hypercall_out_r11() to cover common use cases of
> __tdx_hypercall() function. Since each use case of
> __tdx_module_call() is different, we don't need such wrappers for it.
>
> Implement __tdx_module_call() and __tdx_hypercall() helper functions
> in assembly.
>
> Rationale behind choosing to use assembly over inline assembly are,
>
> 1. Since the number of lines of instructions (with comments) in
> __tdx_hypercall() implementation is over 70, using inline assembly
> to implement it will make it hard to read.
>    
> 2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
> TDCALL operation, if all these registers are included in in-line
> assembly constraints, some of the older compilers may not
> be able to meet this requirement.

Was this "older compiler" argument really the reason?

> Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
> use the same set of argument registers. The implementation here picks
> the current worst-case scenario for TDCALL (4 registers). For TDCALLs
> with fewer than 4 arguments, there will end up being a few superfluous
> (cheap) instructions.  But, this approach maximizes code reuse. The
> same argument applies to __tdx_hypercall() function as well.
>
> Current implementation of __tdx_hypercall() includes error handling
> (ud2 on failure case) in assembly function instead of doing it in C
> wrapper function. The reason behind this choice is, when adding support
> for in/out instructions (refer to patch titled "x86/tdx: Handle port
> I/O" in this series), we use alternative_io() to substitute in/out
> instruction with  __tdx_hypercall() calls. So use of C wrappers is not
> trivial in this case because the input parameters will be in the wrong
> registers and it's tricky to include proper buffer code to make this
> happen.
>
> For registers used by TDCALL instruction, please check TDX GHCI
> specification, sec 2.4 and 3.
>
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
>
> Originally-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

For what it's worth, that changelog really starts to ramble after the
"rationale" part.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 69af72d08d3d..211b9d66b1b1 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,12 +8,50 @@
> #ifdef CONFIG_INTEL_TDX_GUEST
>
> #include <asm/cpufeature.h>
> +#include <linux/types.h>
> +
> +/*
> + * Used in __tdx_module_call() helper function to gather the
> + * output registers values of TDCALL instruction when requesting

There's something wrong in this sentence. This needs to be "output
register values" or "output regisers' values".

> + * services from the TDX module. This is software only structure
> + * and not related to TDX module/VMM.
> + */
> +struct tdx_module_output {
> + u64 rcx;
> + u64 rdx;
> + u64 r8;
> + u64 r9;
> + u64 r10;
> + u64 r11;
> +};
> +
> +/*
> + * Used in __tdx_hypercall() helper function to gather the
> + * output registers values of TDCALL instruction when requesting
> + * services from the VMM. This is software only structure
> + * and not related to TDX module/VMM.
> + */
> +struct tdx_hypercall_output {
> + u64 r11;
> + u64 r12;
> + u64 r13;
> + u64 r14;
> + u64 r15;
> +};
>
> /* Common API to check TDX support in decompression and common kernel code. */
> bool is_tdx_guest(void);
>
> void __init tdx_early_init(void);
>
> +/* Helper function used to communicate with the TDX module */
> +u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + struct tdx_module_output *out);
> +
> +/* Helper function used to request services from VMM */
> +u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
> + struct tdx_hypercall_output *out);
> +
> #else // !CONFIG_INTEL_TDX_GUEST
>
> static inline bool is_tdx_guest(void)
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index ea111bf50691..7966c10ea8d1 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
> obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>
> obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
> -obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
>
> obj-$(CONFIG_EISA) += eisa.o
> obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 60b9f42ce3c1..e6b3bb983992 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -23,6 +23,10 @@
> #include <xen/interface/xen.h>
> #endif
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +#include <asm/tdx.h>
> +#endif
> +
> #ifdef CONFIG_X86_32
> # include "asm-offsets_32.c"
> #else
> @@ -75,6 +79,24 @@ static void __used common(void)
> OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
> #endif
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + BLANK();
> + /* Offset for fields in tdcall_output */
> + OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
> + OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
> + OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
> + OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
> + OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
> + OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
> +
> + /* Offset for fields in tdvmcall_output */
> + OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
> + OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
> + OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
> + OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
> + OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
> +#endif
> +
> BLANK();
> OFFSET(BP_scratch, boot_params, scratch);
> OFFSET(BP_secure_boot, boot_params, secure_boot);
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> new file mode 100644
> index 000000000000..a67c595e4169
> --- /dev/null
> +++ b/arch/x86/kernel/tdcall.S
> @@ -0,0 +1,222 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <asm/asm-offsets.h>
> +#include <asm/asm.h>
> +#include <asm/frame.h>
> +#include <asm/unwind_hints.h>
> +
> +#include <linux/linkage.h>
> +#include <linux/bits.h>
> +
> +#define TDG_R10 BIT(10)
> +#define TDG_R11 BIT(11)
> +#define TDG_R12 BIT(12)
> +#define TDG_R13 BIT(13)
> +#define TDG_R14 BIT(14)
> +#define TDG_R15 BIT(15)
> +
> +/*
> + * Expose registers R10-R15 to VMM. It is passed via RCX register
> + * to the TDX Module, which will be used by the TDX module to
> + * identify the list of registers exposed to VMM. Each bit in this
> + * mask represents a register ID. You can find the bit field details
> + * in TDX GHCI specification.
> + */
> +#define TDVMCALL_EXPOSE_REGS_MASK ( TDG_R10 | TDG_R11 | \
> + TDG_R12 | TDG_R13 | \
> + TDG_R14 | TDG_R15 )
> +
> +/*
> + * TDX guests use the TDCALL instruction to make requests to the
> + * TDX module and hypercalls to the VMM. It is supported in
> + * Binutils >= 2.36.
> + */
> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
> +
> +/*
> + * __tdx_module_call() - Helper function used by TDX guests to request
> + * services from the TDX module (does not include VMM services).
> + *
> + * This function serves as a wrapper to move user call arguments to the
> + * correct registers as specified by "tdcall" ABI and shares it with the
> + * TDX module.  And if the "tdcall" operation is successful and a valid

It's frequently taught to never start a sentence with "And" in formal
writing. You use it fairly frequently. Simply removing it increase
readability, IMNHO.

> + * "struct tdx_module_output" pointer is available (in "out" argument),
> + * output from the TDX module is saved to the memory specified in the
> + * "out" pointer. Also the status of the "tdcall" operation is returned
> + * back to the user as a function return value.
> + *
> + * @fn (RDI) - TDCALL Leaf ID, moved to RAX
> + * @rcx (RSI) - Input parameter 1, moved to RCX
> + * @rdx (RDX) - Input parameter 2, moved to RDX
> + * @r8 (RCX) - Input parameter 3, moved to R8
> + * @r9 (R8) - Input parameter 4, moved to R9
> + *
> + * @out (R9) - struct tdx_module_output pointer
> + * stored temporarily in R12 (not
> + * shared with the TDX module)
> + *
> + * Return status of tdcall via RAX.
> + *
> + * NOTE: This function should not be used for TDX hypercall
> + * use cases.
> + */
> +SYM_FUNC_START(__tdx_module_call)
> + FRAME_BEGIN
> +
> + /*
> + * R12 will be used as temporary storage for
> + * struct tdx_module_output pointer. You can
> + * find struct tdx_module_output details in
> + * arch/x86/include/asm/tdx.h. Also note that
> + * registers R12-R15 are not used by TDCALL
> + * services supported by this helper function.
> + */
> + push %r12 /* Callee saved, so preserve it */
> + mov %r9, %r12 /* Move output pointer to R12 */
> +
> + /* Mangle function call ABI into TDCALL ABI: */
> + mov %rdi, %rax /* Move TDCALL Leaf ID to RAX */
> + mov %r8, %r9 /* Move input 4 to R9 */
> + mov %rcx, %r8 /* Move input 3 to R8 */
> + mov %rsi, %rcx /* Move input 1 to RCX */
> + /* Leave input param 2 in RDX */
> +
> + tdcall
> +
> + /* Check for TDCALL success: 0 - Successful, otherwise failed */
> + test %rax, %rax
> + jnz 1f
> +
> + /* Check for TDCALL output struct != NULL */
> + test %r12, %r12
> + jz 1f
> +
> + /* Copy TDCALL result registers to output struct: */
> + movq %rcx, TDX_MODULE_rcx(%r12)
> + movq %rdx, TDX_MODULE_rdx(%r12)
> + movq %r8, TDX_MODULE_r8(%r12)
> + movq %r9, TDX_MODULE_r9(%r12)
> + movq %r10, TDX_MODULE_r10(%r12)
> + movq %r11, TDX_MODULE_r11(%r12)
> +1:
> + pop %r12 /* Restore the state of R12 register */
> +
> + FRAME_END
> + ret
> +SYM_FUNC_END(__tdx_module_call)
> +
> +/*
> + * do_tdx_hypercall() - Helper function used by TDX guests to request
> + * services from the VMM. All requests are made via the TDX module
> + * using "TDCALL" instruction.
> + *
> + * This function is created to contain common between vendor specific

This sentence seems wrong. Common... what?

> + * and standard type tdx hypercalls. So the caller of this function had

Please capitalize "tdx" consistently.

> + * to set the TDVMCALL type in the R10 register before calling it.

> + * This function serves as a wrapper to move user call arguments to the
> + * correct registers as specified by "tdcall" ABI and shares it with VMM
> + * via the TDX module. And if the "tdcall" operation is successful and a
> + * valid "struct tdx_hypercall_output" pointer is available (in "out"
> + * argument), output from the VMM is saved to the memory specified in the
> + * "out" pointer. 
> + *
> + * @fn (RDI) - TDVMCALL function, moved to R11
> + * @r12 (RSI) - Input parameter 1, moved to R12
> + * @r13 (RDX) - Input parameter 2, moved to R13
> + * @r14 (RCX) - Input parameter 3, moved to R14
> + * @r15 (R8) - Input parameter 4, moved to R15
> + *
> + * @out (R9) - struct tdx_hypercall_output pointer
> + *
> + * On successful completion, return TDX hypercall error code.
> + * If the "tdcall" operation fails, panic.
> + *
> + */

This sounds scary. Can you try to differentate a hypercall failure from
a "tdcall" failure?

Actually, I think that's done OK below. Just remove this mention of
panic().

> +SYM_FUNC_START_LOCAL(do_tdx_hypercall)
> + /* Save non-volatile GPRs that are exposed to the VMM. */
> + push %r15
> + push %r14
> + push %r13
> + push %r12
> +
> + /* Leave hypercall output pointer in R9, it's not clobbered by VMM */
> +
> + /* Mangle function call ABI into TDCALL ABI: */
> + xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
> + mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
> + mov %rsi, %r12 /* Move input 1 to R12 */
> + mov %rdx, %r13 /* Move input 2 to R13 */
> + mov %rcx, %r14 /* Move input 1 to R14 */
> + mov %r8, %r15 /* Move input 1 to R15 */
> + /* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
> +
> + movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> + tdcall
> +
> + /*
> + * Check for TDCALL success: 0 - Successful, otherwise failed.
> + * If failed, there is an issue with TDX Module which is fatal
> + * for the guest. So panic. Also note that RAX is controlled
> + * only by the TDX module and not exposed to VMM.
> + */

I'd probably just say:

/*
* Non-zero RAX values indicate a failure of TDCALL itself.
* Panic for those. This value is unrelated to the hypercall
* result in R10.
*/

> + test %rax, %rax
> + jnz 2f
> +
> + /* Move hypercall error code to RAX to return to user */
> + mov %r10, %rax
> +
> + /* Check for hypercall success: 0 - Successful, otherwise failed */
> + test %rax, %rax
> + jnz 1f
> +
> + /* Check for hypercall output struct != NULL */

This is a great example of a comment that's not using its space widely.
If you're reading this, you *KNOW* that it's checking for NULL. But
what does that *MEAN*?

Wh not:

/* Check if caller provided an output struct */

> + test %r9, %r9
> + jz 1f
> +
> + /* Copy hypercall result registers to output struct: */
> + movq %r11, TDX_HYPERCALL_r11(%r9)
> + movq %r12, TDX_HYPERCALL_r12(%r9)
> + movq %r13, TDX_HYPERCALL_r13(%r9)
> + movq %r14, TDX_HYPERCALL_r14(%r9)
> + movq %r15, TDX_HYPERCALL_r15(%r9)
> +1:
> + /*
> + * Zero out registers exposed to the VMM to avoid
> + * speculative execution with VMM-controlled values.
> + */

You can even say:

This needs to include all registers present in
TDVMCALL_EXPOSE_REGS_MASK

> + xor %r10d, %r10d
> + xor %r11d, %r11d
> + xor %r12d, %r12d
> + xor %r13d, %r13d
> + xor %r14d, %r14d
> + xor %r15d, %r15d
> +
> + /* Restore non-volatile GPRs that are exposed to the VMM. */
> + pop %r12
> + pop %r13
> + pop %r14
> + pop %r15
> +
> + ret
> +2:
> + ud2
> +SYM_FUNC_END(do_tdx_hypercall)
> +
> +/*
> + * Helper function for for standard type of TDVMCALLs. This assembly
> + * wrapper lets us reuse do_tdvmcall() for standard type of hypercalls
> + * (R10 is set as zero).
> + */

Remember, no "us", "we" in changelogs or comments.

> +SYM_FUNC_START(__tdx_hypercall)
> + FRAME_BEGIN
> + /*
> + * R10 is not part of the function call ABI, but it is a part
> + * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
> + * before making call to the do_tdx_hypercall().
> + */
> + xor %r10, %r10
> + call do_tdx_hypercall
> + FRAME_END
> + retq
> +SYM_FUNC_END(__tdx_hypercall)

The rest of it is fine. Probably just one more rev to beef up the
comments and changelogs.

2021-05-19 20:15:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
> shared memory mappings. Shared GPA mappings can be used for
> communication beteen TD guest and host VMM, for example for
> paravirtualized IO.

As usual, I hate the changelog. This appears to just be regurgitating
the spec.

Is this just for part of converting an existing mapping between private
and shared? If so, please say that.

> The new helper tdx_map_gpa() provides access to the operation.

<sigh> You got your own name wrong. It's tdg_map_gpa() in the patch.

BTW, I agree with Sean on this one: "tdg" is a horrible prefix. You
just proved Sean's point by mistyping it. *EVERYONE* is going to rpeat
that mistake: tdg -> tdx.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index dc80cf7f7d08..4789798d7737 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -7,6 +7,11 @@
>
> #ifndef __ASSEMBLY__
>
> +enum tdx_map_type {
> + TDX_MAP_PRIVATE,
> + TDX_MAP_SHARED,
> +};

I like the enum, but please call out that this is a software construct,
not a part of any hardware or VMM ABI.

> #ifdef CONFIG_INTEL_TDX_GUEST
>
> #include <asm/cpufeature.h>
> @@ -112,6 +117,8 @@ unsigned short tdg_inw(unsigned short port);
> unsigned int tdg_inl(unsigned short port);
>
> extern phys_addr_t tdg_shared_mask(void);
> +extern int tdg_map_gpa(phys_addr_t gpa, int numpages,
> + enum tdx_map_type map_type);
>
> #else // !CONFIG_INTEL_TDX_GUEST
>
> @@ -155,6 +162,12 @@ static inline phys_addr_t tdg_shared_mask(void)
> {
> return 0;
> }
> +
> +static inline int tdg_map_gpa(phys_addr_t gpa, int numpages,
> + enum tdx_map_type map_type)
> +{
> + return -ENODEV;
> +}

FWIW, you could probably get away with just inlining tdg_map_gpa():

static inline int tdg_map_gpa(phys_addr_t gpa, int numpages, ...
{
u64 ret;

if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
return -ENODEV;

if (map_type == TDX_MAP_SHARED)
gpa |= tdg_shared_mask();

ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, ...

return ret ? -EIO : 0;
}

Then you don't have three copies of the function signature that can get
out of sync.

> #endif /* CONFIG_INTEL_TDX_GUEST */
> #endif /* __ASSEMBLY__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 7e391cd7aa2b..074136473011 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -15,6 +15,8 @@
> #include "tdx-kvm.c"
> #endif
>
> +#define TDVMCALL_MAP_GPA 0x10001
> +
> static struct {
> unsigned int gpa_width;
> unsigned long attributes;
> @@ -98,6 +100,17 @@ static void tdg_get_info(void)
> physical_mask &= ~tdg_shared_mask();
> }
>
> +int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
> +{
> + u64 ret;
> +
> + if (map_type == TDX_MAP_SHARED)
> + gpa |= tdg_shared_mask();
> +
> + ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
> + return ret ? -EIO : 0;
> +}

The naming Intel chose here is nasty. This doesn't "map" anything. It
modifies an existing mapping from what I can tell. We could name it
much better than the spec, perhaps:

tdx_hcall_gpa_intent()

BTW, all of these hypercalls need a consistent prefix.

It also needs a comment:

/*
* Inform the VMM of the guest's intent for this physical page:
* shared with the VMM or private to the guest. The VMM is
* expected to change its mapping of the page in response.
*
* Note: shared->private conversions require further guest
* action to accept the page.
*/

The intent here is important. It makes it clear that this function
really only plays a role in the conversion process.

2021-05-19 20:16:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> tdx_shared_mask() returns the mask that has to be set in a page
> table entry to make page shared with VMM.

Here's a rewrite:

Just like MKTME, TDX reassigns bits of the physical address for
metadata. MKTME used several bits for an encryption KeyID. TDX uses a
single bit in guests to communicate whether a physical page should be
protected by TDX as private memory (bit set to 0) or unprotected and
shared with the VMM (bit set to 1).

Add a helper, tdg_shared_mask() (bad name please fix it) to generate the
mask. The processor enumerates its physical address width to include
the shared bit, which means it gets included in __PHYSICAL_MASK by default.

Remove the shared mask from 'physical_mask' since any bits in
tdg_shared_mask() are not used for physical addresses in page table entries.

--

BTW, do you find it confusing that the subject says: '__PHYSICAL_MASK'
and yet the code only modifies 'physical_mask'?

> Also, note that we cannot club shared mapping configuration between
> AMD SME and Intel TDX Guest platforms in common function. SME has
> to do it very early in __startup_64() as it sets the bit on all
> memory, except what is used for communication. TDX can postpone as
> we don't need any shared mapping in very early boot.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/tdx.h | 6 ++++++
> arch/x86/kernel/tdx.c | 9 +++++++++
> 3 files changed, 16 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 67f99bf27729..5f92e8205de2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
> select PARAVIRT_XL
> select X86_X2APIC
> select SECURITY_LOCKDOWN_LSM
> + select X86_MEM_ENCRYPT_COMMON
> help
> Provide support for running in a trusted domain on Intel processors
> equipped with Trusted Domain eXtenstions. TDX is an new Intel
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b972c6531a53..dc80cf7f7d08 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,6 +111,8 @@ unsigned char tdg_inb(unsigned short port);
> unsigned short tdg_inw(unsigned short port);
> unsigned int tdg_inl(unsigned short port);
>
> +extern phys_addr_t tdg_shared_mask(void);
> +
> #else // !CONFIG_INTEL_TDX_GUEST
>
> static inline bool is_tdx_guest(void)
> @@ -149,6 +151,10 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> return -ENODEV;
> }
>
> +static inline phys_addr_t tdg_shared_mask(void)
> +{
> + return 0;
> +}
> #endif /* CONFIG_INTEL_TDX_GUEST */
> #endif /* __ASSEMBLY__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 1f1bb98e1d38..7e391cd7aa2b 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -76,6 +76,12 @@ bool is_tdx_guest(void)
> }
> EXPORT_SYMBOL_GPL(is_tdx_guest);
>
> +/* The highest bit of a guest physical address is the "sharing" bit */
> +phys_addr_t tdg_shared_mask(void)
> +{
> + return 1ULL << (td_info.gpa_width - 1);
> +}

Why not just inline this thing? Functions don't get any smaller than
that. Or does it not get used anywhere else? Or are you concerned
about exporting td_info?

> static void tdg_get_info(void)
> {
> u64 ret;
> @@ -87,6 +93,9 @@ static void tdg_get_info(void)
>
> td_info.gpa_width = out.rcx & GENMASK(5, 0);
> td_info.attributes = out.rdx;
> +
> + /* Exclude Shared bit from the __PHYSICAL_MASK */
> + physical_mask &= ~tdg_shared_mask();
> }
>
> static __cpuidle void tdg_halt(void)
>


2021-05-19 20:18:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms

On 5/17/21 5:59 PM, Kuppuswamy Sathyanarayanan wrote:
> From: Sean Christopherson <[email protected]>
>
> Avoid operations which will inject #VE during boot process,
> which is obviously fatal for TDX platforms.

It's not "obviously fatal". We actually have early exception handlers.
Please give an actual reason. "They're easy to avoid, and that sure
beats handling the exceptions" is a perfectly fine reason.

> Details are,
>
> 1. TDX module injects #VE if a TDX guest attempts to write
>    EFER.
>    
>    Boot code updates EFER in following cases:
>    
>    * When enabling Long Mode configuration, EFER.LME bit will
>      be set. Since TDX forces EFER.LME=1, we can skip updating
>      it again. Check for EFER.LME before updating it and skip
>      it if it is already set.
>
>    * EFER is also updated to enable support for features like
>      System call and No Execute page setting. In TDX, these
>      features are set up by the TDX module. So check whether
>      it is already enabled, and skip enabling it again.
>    
> 2. TDX module also injects a #VE if the guest attempts to clear
>    CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
>    boot. The Setting CR0.NE should be a nop on all CPUs that
>    support 64-bit mode.
>    
> 3. The TDX-Module (effectively part of the hypervisor) requires

So, after we've mentioned the TDX module a few times, *NOW* we feel the
need to explain what it is? I'm also baffled by this little aside.
Literally the WHOLE POINT FOR SEAM TO EXIST is that it is NOT PART OF
THE HYPERVISOR. The whole point. Literally.

>    CR4.MCE to be set at all times and injects a #VE if the guest
>    attempts to clear CR4.MCE. So, preserve CR4.MCE instead of
>    clearing it during boot to avoid #VE.

This is a good example of a changelog run amok. It doesn't need to be
an English language reproduction of the code. This is getting close.

This can all be replaced and improved with a high-level discussion of
what is going on:

There are a few MSRs and control register bits which the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent
security guarantees. Fortunately, TDX ensures that these are
all in the correct state before the kernel loads, which means
the kernel has no need to modify them.

The conditions we need to avoid are:
1. Any writes to the EFER MSR
2. Clearing CR0.NE
3. Clearing CR3.MCE

> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index e94874f4bbc1..2d79e5f97360 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -616,12 +616,16 @@ SYM_CODE_START(trampoline_32bit_src)
> movl $MSR_EFER, %ecx
> rdmsr
> btsl $_EFER_LME, %eax
> + jc 1f
> wrmsr
> - popl %edx
> +1: popl %edx

A comment would be nice:

/* Avoid writing EFER if no change was made (for TDX guest) */

> popl %ecx
>
> /* Enable PAE and LA57 (if required) paging modes */
> - movl $X86_CR4_PAE, %eax
> + movl %cr4, %eax
> + /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
> + andl $X86_CR4_MCE, %eax

Maybe I'm just dense today, but I was boggling about what this 'andl' is
actually doing. This would help:

/*
* Clear all bits except CR4.MCE, which is preserved.
* Clearing CR4.MCE will #VE in TDX guests.
*/

> + orl $X86_CR4_PAE, %eax
> testl %edx, %edx
> jz 1f
> orl $X86_CR4_LA57, %eax
> @@ -636,7 +640,7 @@ SYM_CODE_START(trampoline_32bit_src)
> pushl %eax
>
> /* Enable paging again */
> - movl $(X86_CR0_PG | X86_CR0_PE), %eax
> + movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
> movl %eax, %cr0

Shouldn't we also comment the X86_CR0_NE?

/* Enable paging again. Avoid clearing X86_CR0_NE for TDX. */

> lret
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 04bddaaba8e2..92c77cf75542 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> 1:
>
> /* Enable PAE mode, PGE and LA57 */
> - movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
> + movq %cr4, %rcx
> + /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */
> + andl $X86_CR4_MCE, %ecx
> + orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx

Ditto on the comment from above about clearing/preserving bits.

> #ifdef CONFIG_X86_5LEVEL
> testl $1, __pgtable_l5_enabled(%rip)
> jz 1f
> @@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> /* Setup EFER (Extended Feature Enable Register) */
> movl $MSR_EFER, %ecx
> rdmsr
> + movl %eax, %edx

Comment, please.

> btsl $_EFER_SCE, %eax /* Enable System Call */
> btl $20,%edi /* No Execute supported? */
> jnc 1f
> btsl $_EFER_NX, %eax
> btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1: wrmsr /* Make changes effective */
>
> + /* Skip the WRMSR if the current value matches the desired value. */

If I read this comment in 5 years, I'm going to ask "Why bother?".
Please mention TDX.

> +1: cmpl %edx, %eax
> + je 1f
> + xor %edx, %edx
> + wrmsr /* Make changes effective */
> +1:
> /* Setup cr0 */
> movl $CR0_STATE, %eax
> /* Make changes effective */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 754f8d2ac9e8..12b734b1da8b 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
> movl %eax, %cr3
>
> # Set up EFER
> + movl $MSR_EFER, %ecx
> + rdmsr
> + cmp pa_tr_efer, %eax
> + jne .Lwrite_efer
> + cmp pa_tr_efer + 4, %edx

Comment, please:

# Skip EFER writes to avoid faults in TDX guests

> + je .Ldone_efer
> +.Lwrite_efer:
> movl pa_tr_efer, %eax
> movl pa_tr_efer + 4, %edx
> - movl $MSR_EFER, %ecx
> wrmsr
>
> +.Ldone_efer:
> # Enable paging and in turn activate Long Mode
> - movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
> + movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
> movl %eax, %cr0
>
> /*
>


Subject: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In a TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. TDX guests communicate with the TDX module and
with the VMM using a new instruction: TDCALL.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall() - request services from the VMM.
__tdx_module_call()  - communicate with the TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly is,
since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix-v1:
* Fixed commit log and comment corrections as suggested by Dave.

arch/x86/include/asm/tdx.h | 38 ++++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets.c | 22 ++++
arch/x86/kernel/tdcall.S | 223 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 39 ++++++
5 files changed, 323 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..fcd42119a287 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,50 @@
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};

/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

void __init tdx_early_init(void);

+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+ struct tdx_hypercall_output *out);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
#include <xen/interface/xen.h>
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
#ifdef CONFIG_X86_32
# include "asm-offsets_32.c"
#else
@@ -75,6 +79,24 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ /* Offset for fields in tdcall_output */
+ OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+ OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+ OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
+ OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
+ OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+ OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+ /* Offset for fields in tdvmcall_output */
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..b06e8b62dfe2
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10 BIT(10)
+#define TDG_R11 BIT(11)
+#define TDG_R12 BIT(12)
+#define TDG_R13 BIT(13)
+#define TDG_R14 BIT(14)
+#define TDG_R15 BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDG_R10 | TDG_R11 | \
+ TDG_R12 | TDG_R13 | \
+ TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call() - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module. If the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ * use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+
+ /*
+ * R12 will be used as temporary storage for
+ * struct tdx_module_output pointer. You can
+ * find struct tdx_module_output details in
+ * arch/x86/include/asm/tdx.h. Also note that
+ * registers R12-R15 are not used by TDCALL
+ * services supported by this helper function.
+ */
+ push %r12 /* Callee saved, so preserve it */
+ mov %r9, %r12 /* Move output pointer to R12 */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ mov %rdi, %rax /* Move TDCALL Leaf ID to RAX */
+ mov %r8, %r9 /* Move input 4 to R9 */
+ mov %rcx, %r8 /* Move input 3 to R8 */
+ mov %rsi, %rcx /* Move input 1 to RCX */
+ /* Leave input param 2 in RDX */
+
+ tdcall
+
+ /* Check for TDCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check if caller provided an output struct */
+ test %r12, %r12
+ jz 1f
+
+ /* Copy TDCALL result registers to output struct: */
+ movq %rcx, TDX_MODULE_rcx(%r12)
+ movq %rdx, TDX_MODULE_rdx(%r12)
+ movq %r8, TDX_MODULE_r8(%r12)
+ movq %r9, TDX_MODULE_r9(%r12)
+ movq %r10, TDX_MODULE_r10(%r12)
+ movq %r11, TDX_MODULE_r11(%r12)
+1:
+ pop %r12 /* Restore the state of R12 register */
+
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall() - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common code between vendor
+ * specific and standard type TDX hypercalls. So the caller of this
+ * function had to set the TDVMCALL type in the R10 register before
+ * calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. If the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn (RDI) - TDVMCALL function, moved to R11
+ * @r12 (RSI) - Input parameter 1, moved to R12
+ * @r13 (RDX) - Input parameter 2, moved to R13
+ * @r14 (RCX) - Input parameter 3, moved to R14
+ * @r15 (R8) - Input parameter 4, moved to R15
+ *
+ * @out (R9) - struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ *
+ */
+SYM_FUNC_START_LOCAL(do_tdx_hypercall)
+ /* Save non-volatile GPRs that are exposed to the VMM. */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+ mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+ mov %rsi, %r12 /* Move input 1 to R12 */
+ mov %rdx, %r13 /* Move input 2 to R13 */
+ mov %rcx, %r14 /* Move input 1 to R14 */
+ mov %r8, %r15 /* Move input 1 to R15 */
+ /* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * Non-zero RAX values indicate a failure of TDCALL itself.
+ * Panic for those. This value is unrelated to the hypercall
+ * result in R10.
+ */
+ test %rax, %rax
+ jnz 2f
+
+ /* Move hypercall error code to RAX to return to user */
+ mov %r10, %rax
+
+ /* Check for hypercall success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check if caller provided an output struct */
+ test %r9, %r9
+ jz 1f
+
+ /* Copy hypercall result registers to output struct: */
+ movq %r11, TDX_HYPERCALL_r11(%r9)
+ movq %r12, TDX_HYPERCALL_r12(%r9)
+ movq %r13, TDX_HYPERCALL_r13(%r9)
+ movq %r14, TDX_HYPERCALL_r14(%r9)
+ movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+ /*
+ * Zero out registers exposed to the VMM to avoid
+ * speculative execution with VMM-controlled values.
+ * This needs to include all registers present in
+ * TDVMCALL_EXPOSE_REGS_MASK.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+ xor %r12d, %r12d
+ xor %r13d, %r13d
+ xor %r14d, %r14d
+ xor %r15d, %r15d
+
+ /* Restore non-volatile GPRs that are exposed to the VMM. */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ ret
+2:
+ ud2
+SYM_FUNC_END(do_tdx_hypercall)
+
+/*
+ * Helper function for standard type of TDVMCALLs. This assembly
+ * wrapper reuses do_tdvmcall() for standard type of hypercalls
+ * (R10 is set as zero).
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+ /*
+ * R10 is not part of the function call ABI, but it is a part
+ * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
+ * before making call to the do_tdx_hypercall().
+ */
+ xor %r10, %r10
+ call do_tdx_hypercall
+ FRAME_END
+ retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2020 Intel Corporation */

+#define pr_fmt(fmt) "TDX: " fmt
+
#include <asm/tdx.h>

+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+ u64 r14, u64 r15)
+{
+
+ struct tdx_hypercall_output out = {0};
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return out.r11;
+}
+
static inline bool cpuid_has_tdx_guest(void)
{
u32 eax, signature[3];
--
2.25.1


Subject: Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions



On 5/19/21 8:31 AM, Dave Hansen wrote:
> Was this "older compiler" argument really the reason?

It is a speculation. I haven't tried to reproduce it with old compiler. So
I have removed that point.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-19 20:25:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

On Wed, May 19, 2021, Kuppuswamy, Sathyanarayanan wrote:
>
> On 5/19/21 8:31 AM, Dave Hansen wrote:
> > Was this "older compiler" argument really the reason?
>
> It is a speculation. I haven't tried to reproduce it with old compiler. So
> I have removed that point.

It's not "older" compilers. gcc does not support R8-R15 as input/output
constraints, which means inline asm needs to do register shenanigans, and those
are horribly fragile because the compiler does not ensure register variables are
preserved outside of asm blobs. E.g. adding a print like so can corrupt r10,
which makes it an absolute nightmare to debug/trace flows that pass r8-r15 to
asm blobs since looking at the code the wrong way can break things.

register unsigned long r10 asm("r10") = __r10;

pr_info("TDCALL: RAX = %lx, R10 = %lx\n", rax, __r10);

asm volatile("tdcall"
: "=a"(rax) :
: "a"(rax), "r"(r10));

2021-05-19 20:50:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions


On 5/19/2021 1:09 PM, Sean Christopherson wrote:
> On Wed, May 19, 2021, Kuppuswamy, Sathyanarayanan wrote:
>> On 5/19/21 8:31 AM, Dave Hansen wrote:
>>> Was this "older compiler" argument really the reason?
>> It is a speculation. I haven't tried to reproduce it with old compiler. So
>> I have removed that point.
> It's not "older" compilers. gcc does not support R8-R15 as input/output
> constraints,

Yes that's true, but they can be in clobbers. So it usually just needs a
mov from the input arguments for output, or a mov to the output
arguments for output.




2021-05-19 21:05:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO


On 5/18/2021 10:46 AM, Dave Hansen wrote:
> On 5/18/21 10:21 AM, Andi Kleen wrote:
>> Besides instruction decoding works fine for all the existing
>> hypervisors. All we really want to do is to do the same thing as KVM
>> would do.
> Dumb question of the day: If you want to do the same thing that KVM
> does, why don't you share more code with KVM? Wouldn't you, for
> instance, need to crack the same instruction opcodes?

We're talking about ~60 lines of codes that calls an established
standard library.

https://github.com/intel/tdx/blob/8c20c364d1f52e432181d142054b1c2efa0ae6d3/arch/x86/kernel/tdx.c#L490

You're proposing a gigantic refactoring to avoid 60 lines of straight
forward code.

That's not a practical proposal.

>
> I'd feel a lot better about this if you said:
>
> Listen, this doesn't work for everything. But, it will run
> every single driver as a TDX guest that KVM can handle as a
> host. So, if the TDX code is broken, so is the KVM host code.

I don't really know what problem you're trying to solve here. We only
have a small number of drivers and we tested them and they work fine.
There are special macros that limit the number of instructions. If there
are ever more instructions and the macros break somehow we'll add them.
There will be a clean error if it ever happens. We're not trying to
solve hypothetical problems here.

-Andi



Subject: Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode

Hi Dan,

On 5/17/21 9:08 PM, Dan Williams wrote:
>> SYM_DATA_START_LOCAL(tr_idt)
>> .short 0
>> .quad 0
>> SYM_DATA_END(tr_idt)
> This format implies that tr_idt is reserving space for 2 distinct data
> structure attributes of those sizes, can you just put those names here
> as comments? Otherwise the .fill format is more compact.

Initially its 6 bytes (2 bytes for IDT limit, 4 bytes for 32 bit linear
start address). This patch extends it by another 4 bytes for supporting
64 bit mode.

2 bytes IDT limit (.short)
8 bytes for 64 bit IDT start address (.quad)

This info is included in commit log. But I will add comment here as you
have mentioned.

Will following comment log do ?

/* Use 10 bytes for IDT (in 64 bit mode), 8 bytes for IDT start address
2 bytes for IDT limit size */

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-20 00:41:34

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode

On Wed, May 19, 2021 at 5:19 PM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
>
> Hi Dan,
>
> On 5/17/21 9:08 PM, Dan Williams wrote:
> >> SYM_DATA_START_LOCAL(tr_idt)
> >> .short 0
> >> .quad 0
> >> SYM_DATA_END(tr_idt)
> > This format implies that tr_idt is reserving space for 2 distinct data
> > structure attributes of those sizes, can you just put those names here
> > as comments? Otherwise the .fill format is more compact.
>
> Initially its 6 bytes (2 bytes for IDT limit, 4 bytes for 32 bit linear
> start address). This patch extends it by another 4 bytes for supporting
> 64 bit mode.
>
> 2 bytes IDT limit (.short)
> 8 bytes for 64 bit IDT start address (.quad)
>
> This info is included in commit log. But I will add comment here as you
> have mentioned.

Thanks. I only read commit logs when code comments fail.

>
> Will following comment log do ?
>
> /* Use 10 bytes for IDT (in 64 bit mode), 8 bytes for IDT start address
> 2 bytes for IDT limit size */

I would clarify how the boot code uses this:

"When a bootloader hands off to the kernel in 32-bit mode an IDT with
a 2-byte limit and 4-byte base is needed. When a boot loader hands off
to a kernel 64-bit mode the base address extends to 8-bytes. Reserve
enough space for either scenario."

Subject: Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode



On 5/19/21 5:40 PM, Dan Williams wrote:
> I would clarify how the boot code uses this:
>
> "When a bootloader hands off to the kernel in 32-bit mode an IDT with
> a 2-byte limit and 4-byte base is needed. When a boot loader hands off
> to a kernel 64-bit mode the base address extends to 8-bytes. Reserve
> enough space for either scenario."

I will add it. Thanks.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

Hi Dave,

On 5/19/21 9:14 AM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> tdx_shared_mask() returns the mask that has to be set in a page
>> table entry to make page shared with VMM.
>
> Here's a rewrite:
>
> Just like MKTME, TDX reassigns bits of the physical address for
> metadata. MKTME used several bits for an encryption KeyID. TDX uses a
> single bit in guests to communicate whether a physical page should be
> protected by TDX as private memory (bit set to 0) or unprotected and
> shared with the VMM (bit set to 1).
>
> Add a helper, tdg_shared_mask() (bad name please fix it) to generate the

Initially we have used tdx_* prefix for the guest code. But when the code from
host side got merged together, we came across many name conflicts. So to
avoid such issues in future, we were asked not to use the "tdx_" prefix and
our alternative choice was "tdg_".

Also, IMO, "tdg" prefix is more meaningful for guest code (Trusted Domain Guest)
compared to "tdx" (Trusted Domain eXtensions). I know that it gets confusing
when grepping for TDX related changes. But since these functions are only used
inside arch/x86 it should not be too confusing.

Even if rename is requested, IMO, it is easier to do it in one patch over
making changes in all the patches. So if it is required, we can do it later
once these initial patches were merged.

> mask. The processor enumerates its physical address width to include
> the shared bit, which means it gets included in __PHYSICAL_MASK by default.
>
> Remove the shared mask from 'physical_mask' since any bits in
> tdg_shared_mask() are not used for physical addresses in page table entries.
>
> --

Thanks. I will include it in next version.

>
> BTW, do you find it confusing that the subject says: '__PHYSICAL_MASK'
> and yet the code only modifies 'physical_mask'?
>
>> Also, note that we cannot club shared mapping configuration between
>> AMD SME and Intel TDX Guest platforms in common function. SME has
>> to do it very early in __startup_64() as it sets the bit on all
>> memory, except what is used for communication. TDX can postpone as
>> we don't need any shared mapping in very early boot.
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>> ---
>> arch/x86/Kconfig | 1 +
>> arch/x86/include/asm/tdx.h | 6 ++++++
>> arch/x86/kernel/tdx.c | 9 +++++++++
>> 3 files changed, 16 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 67f99bf27729..5f92e8205de2 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
>> select PARAVIRT_XL
>> select X86_X2APIC
>> select SECURITY_LOCKDOWN_LSM
>> + select X86_MEM_ENCRYPT_COMMON
>> help
>> Provide support for running in a trusted domain on Intel processors
>> equipped with Trusted Domain eXtenstions. TDX is an new Intel
>> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
>> index b972c6531a53..dc80cf7f7d08 100644
>> --- a/arch/x86/include/asm/tdx.h
>> +++ b/arch/x86/include/asm/tdx.h
>> @@ -111,6 +111,8 @@ unsigned char tdg_inb(unsigned short port);
>> unsigned short tdg_inw(unsigned short port);
>> unsigned int tdg_inl(unsigned short port);
>>
>> +extern phys_addr_t tdg_shared_mask(void);
>> +
>> #else // !CONFIG_INTEL_TDX_GUEST
>>
>> static inline bool is_tdx_guest(void)
>> @@ -149,6 +151,10 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
>> return -ENODEV;
>> }
>>
>> +static inline phys_addr_t tdg_shared_mask(void)
>> +{
>> + return 0;
>> +}
>> #endif /* CONFIG_INTEL_TDX_GUEST */
>> #endif /* __ASSEMBLY__ */
>> #endif /* _ASM_X86_TDX_H */
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 1f1bb98e1d38..7e391cd7aa2b 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -76,6 +76,12 @@ bool is_tdx_guest(void)
>> }
>> EXPORT_SYMBOL_GPL(is_tdx_guest);
>>
>> +/* The highest bit of a guest physical address is the "sharing" bit */
>> +phys_addr_t tdg_shared_mask(void)
>> +{
>> + return 1ULL << (td_info.gpa_width - 1);
>> +}
>
> Why not just inline this thing? Functions don't get any smaller than
> that. Or does it not get used anywhere else? Or are you concerned
> about exporting td_info?

We don't want to export td_info. It has more information additional to shared
mask details. Any reason for suggesting to use inline?

This function is only used in following files.

arch/x86/include/asm/pgtable.h:25:#define pgprot_tdg_shared(prot) __pgprot(pgprot_val(prot) |
tdg_shared_mask())
arch/x86/mm/pat/set_memory.c:1997: mem_plain_bits = __pgprot(tdg_shared_mask());
arch/x86/kernel/tdx.c:134:phys_addr_t tdg_shared_mask(void)
arch/x86/kernel/tdx.c:274: physical_mask &= ~tdg_shared_mask();


>
>> static void tdg_get_info(void)
>> {
>> u64 ret;
>> @@ -87,6 +93,9 @@ static void tdg_get_info(void)
>>
>> td_info.gpa_width = out.rcx & GENMASK(5, 0);
>> td_info.attributes = out.rdx;
>> +
>> + /* Exclude Shared bit from the __PHYSICAL_MASK */
>> + physical_mask &= ~tdg_shared_mask();
>> }
>>
>> static __cpuidle void tdg_halt(void)
>>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK



On 5/20/21 11:48 AM, Kuppuswamy, Sathyanarayanan wrote:
> BTW, do you find it confusing that the subject says: '__PHYSICAL_MASK'
> and yet the code only modifies 'physical_mask'?

"physical_mask" is defined as __PHYSICAL_MASK in page_types.h. MM code seems to
use __PHYSICAL_MASK for common usage. But for our use case, if it makes it more
readable, I am fine with using "physical_mask".

arch/x86/include/asm/page_types.h:57:#define __PHYSICAL_MASK physical_mask
arch/x86/mm/pat/memtype.c:560: return address & __PHYSICAL_MASK;

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-21 07:55:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On 5/20/21 11:48 AM, Kuppuswamy, Sathyanarayanan wrote:
>>>   +/* The highest bit of a guest physical address is the "sharing"
>>> bit */
>>> +phys_addr_t tdg_shared_mask(void)
>>> +{
>>> +    return 1ULL << (td_info.gpa_width - 1);
>>> +}
>>
>> Why not just inline this thing?  Functions don't get any smaller than
>> that.  Or does it not get used anywhere else?  Or are you concerned
>> about exporting td_info?
>
> We don't want to export td_info. It has more information additional to
> shared mask details. Any reason for suggesting to use inline?

My favorite reason is that it eliminates the need for three declarations:
1. An extern for the header
2. A stub for the header
3. The real function in the .c file.

An inline removes two places that might get out of sync in some way and
eliminates the need to check two implementation sites when grepping.

Not in this case, but in general, inlines also result in faster, more
compact code since the compiler has more visibility into what the
function does at its call sites.

Not wanting to export td_info _is_ a reasonable argument, though.

2021-05-21 07:59:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On 5/20/21 1:16 PM, Sean Christopherson wrote:
> On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
>> So what is your proposal? "tdx_guest_" / "tdx_host_" ?
> 1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
> to deal with the shared vs. private inversion and avoid tdg_shared_mask
> altogether.

One example here would be to keep a structure like:

struct protected_mem_config
{
unsigned long p_set_bits;
unsigned long p_clear_bits;
}

Where 'p_set_bits' are the bits that need to be set to establish memory
protection and 'p_clear_bits' are the bits that need to be cleared.
physical_mask would clear both of them:

physical_mask &= ~(pmc.p_set_bits & pmc.p_set_bits);

Then, in a place like __set_memory_enc_dec(), you would query whether
memory protection was in place or not:

+ if (protect) {
+ cpa.mask_set = pmc.p_set_bits;
+ cpa.mask_clr = pmc.p_clear_bits;
+ map_type = TDX_MAP_PRIVATE;
+ } else {
+ cpa.mask_set = pmc.p_clear_bits;
+ cpa.mask_clr = pmc.p_set_bits;
+ map_type = TDX_MAP_SHARED;
+ }

The is_tdx_guest() if()'s would just go away.

Basically, if there's a is_tdx_guest() check in common code, it's a
place that might need an abstraction.

This, for instance:

> + if (!ret && is_tdx_guest()) {
> + ret = tdg_map_gpa(__pa(addr), numpages, map_type);
> + }

could probably just be:

if (!ret && is_protected_guest()) {
ret = x86_vmm_protect(__pa(addr), numpages, protected);
}

Subject: Re: [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL



On 5/19/21 8:59 AM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
>> shared memory mappings. Shared GPA mappings can be used for
>> communication beteen TD guest and host VMM, for example for
>> paravirtualized IO.
>
> As usual, I hate the changelog. This appears to just be regurgitating
> the spec.
>
> Is this just for part of converting an existing mapping between private
> and shared? If so, please say that.
>

How about following change?

x86/tdx: Add helper to do MapGPA hypercall

MapGPA hypercall is used by TDX guests to request VMM convert
the existing mapping of given GPA address range between
private/shared.

tdx_hcall_gpa_intent() is the wrapper used for making MapGPA
hypercall.


>> The new helper tdx_map_gpa() provides access to the operation.
>
> <sigh> You got your own name wrong. It's tdg_map_gpa() in the patch.

I can use tdx_hcall_gpa_intent().

>
> BTW, I agree with Sean on this one: "tdg" is a horrible prefix. You
> just proved Sean's point by mistyping it. *EVERYONE* is going to rpeat
> that mistake: tdg -> tdx.
>
>> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
>> index dc80cf7f7d08..4789798d7737 100644
>> --- a/arch/x86/include/asm/tdx.h
>> +++ b/arch/x86/include/asm/tdx.h
>> @@ -7,6 +7,11 @@
>>
>> #ifndef __ASSEMBLY__
>>
>> +enum tdx_map_type {
>> + TDX_MAP_PRIVATE,
>> + TDX_MAP_SHARED,
>> +};
>
> I like the enum, but please call out that this is a software construct,
> not a part of any hardware or VMM ABI.
>
>> #ifdef CONFIG_INTEL_TDX_GUEST
>>
>> #include <asm/cpufeature.h>
>> @@ -112,6 +117,8 @@ unsigned short tdg_inw(unsigned short port);
>> unsigned int tdg_inl(unsigned short port);
>>
>> extern phys_addr_t tdg_shared_mask(void);
>> +extern int tdg_map_gpa(phys_addr_t gpa, int numpages,
>> + enum tdx_map_type map_type);
>>
>> #else // !CONFIG_INTEL_TDX_GUEST
>>
>> @@ -155,6 +162,12 @@ static inline phys_addr_t tdg_shared_mask(void)
>> {
>> return 0;
>> }
>> +
>> +static inline int tdg_map_gpa(phys_addr_t gpa, int numpages,
>> + enum tdx_map_type map_type)
>> +{
>> + return -ENODEV;
>> +}
>
> FWIW, you could probably get away with just inlining tdg_map_gpa():
>
> static inline int tdg_map_gpa(phys_addr_t gpa, int numpages, ...
> {
> u64 ret;
>
> if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
> return -ENODEV;
>
> if (map_type == TDX_MAP_SHARED)
> gpa |= tdg_shared_mask();
>
> ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, ...
>
> return ret ? -EIO : 0;
> }
>
> Then you don't have three copies of the function signature that can get
> out of sync.

I agree that this simplifies the function definition. But, there are
other TDX hypercalls definitions in tdx.c. I can't move all of them to
the header file. If possible, I would like to group all hypercalls in
the same place.

Also, IMO, it is better to hide hypercall internal implementation details
in C file. For example, user of MapGPA hypercall does not care about the
TDVMCALL_MAP_GPA leaf id value. If we inline this function we have to
move such details to header file.


>
>> #endif /* CONFIG_INTEL_TDX_GUEST */
>> #endif /* __ASSEMBLY__ */
>> #endif /* _ASM_X86_TDX_H */
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 7e391cd7aa2b..074136473011 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -15,6 +15,8 @@
>> #include "tdx-kvm.c"
>> #endif
>>
>> +#define TDVMCALL_MAP_GPA 0x10001
>> +
>> static struct {
>> unsigned int gpa_width;
>> unsigned long attributes;
>> @@ -98,6 +100,17 @@ static void tdg_get_info(void)
>> physical_mask &= ~tdg_shared_mask();
>> }
>>
>> +int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
>> +{
>> + u64 ret;
>> +
>> + if (map_type == TDX_MAP_SHARED)
>> + gpa |= tdg_shared_mask();
>> +
>> + ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
>> + return ret ? -EIO : 0;
>> +}
>
> The naming Intel chose here is nasty. This doesn't "map" anything. It
> modifies an existing mapping from what I can tell. We could name it
> much better than the spec, perhaps:
>
> tdx_hcall_gpa_intent()

I will use this function name in next version.

>
> BTW, all of these hypercalls need a consistent prefix.

I can include _hcall in other hypercall helper functions as well.

>
> It also needs a comment:
>
> /*
> * Inform the VMM of the guest's intent for this physical page:
> * shared with the VMM or private to the guest. The VMM is
> * expected to change its mapping of the page in response.
> *
> * Note: shared->private conversions require further guest
> * action to accept the page.
> */
>
> The intent here is important. It makes it clear that this function
> really only plays a role in the conversion process.

Thanks. I will include it in next version.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-21 16:55:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> Hi Dave,
>
> On 5/19/21 9:14 AM, Dave Hansen wrote:
> > On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> > > From: "Kirill A. Shutemov" <[email protected]>
> > >
> > > tdx_shared_mask() returns the mask that has to be set in a page
> > > table entry to make page shared with VMM.
> >
> > Here's a rewrite:
> >
> > Just like MKTME, TDX reassigns bits of the physical address for
> > metadata. MKTME used several bits for an encryption KeyID. TDX uses a
> > single bit in guests to communicate whether a physical page should be
> > protected by TDX as private memory (bit set to 0) or unprotected and
> > shared with the VMM (bit set to 1).
> >
> > Add a helper, tdg_shared_mask() (bad name please fix it) to generate the
>
> Initially we have used tdx_* prefix for the guest code. But when the code from
> host side got merged together, we came across many name conflicts.

Whatever the conflicts are, they are by no means an unsolvable problem. I am
more than happy to end up with slightly verbose names in KVM if that's what it
takes to avoid "tdg".

> So to avoid such issues in future, we were asked not to use the "tdx_" prefix
> and our alternative choice was "tdg_".

Who asked you not to use tdx_? More specifically, did that feedback come from a
maintainer (or anyone on-list), or was it an Intel-internal decision?

> Also, IMO, "tdg" prefix is more meaningful for guest code (Trusted Domain Guest)
> compared to "tdx" (Trusted Domain eXtensions). I know that it gets confusing
> when grepping for TDX related changes. But since these functions are only used
> inside arch/x86 it should not be too confusing.
>
> Even if rename is requested, IMO, it is easier to do it in one patch over
> making changes in all the patches. So if it is required, we can do it later
> once these initial patches were merged.

Hell no, we are not merging known bad crud that requires useless churn to get
things right.

2021-05-21 17:15:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On 5/20/21 2:18 PM, Sean Christopherson wrote:
> I understand the desire to have a unique prefix, but tdg is is _too_ close to
> tdx. I don't want to spend the next N years wondering if tdg is a typo or intended.


Sathya has even mis-typed "tdx" instead of "tdg" this in his own
changelogs up to this point. That massively weakens the argument that
"tdg" is a good idea.

Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK



On 5/20/21 2:23 PM, Dave Hansen wrote:
> Sathya has even mis-typed "tdx" instead of "tdg" this in his own
> changelogs up to this point. That massively weakens the argument that
> "tdg" is a good idea.

It is not a typo. But when we did the initial rename from "tdx_" -> "tdg_",
somehow I missed the change log change. That's why I am bit reluctant
to go for another rename (since we have scan change log, comments and code)
in all the patches.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK



On 5/20/21 12:33 PM, Sean Christopherson wrote:
>> Initially we have used tdx_* prefix for the guest code. But when the code from
>> host side got merged together, we came across many name conflicts.
> Whatever the conflicts are, they are by no means an unsolvable problem. I am
> more than happy to end up with slightly verbose names in KVM if that's what it
> takes to avoid "tdg".
>
>> So to avoid such issues in future, we were asked not to use the "tdx_" prefix
>> and our alternative choice was "tdg_".
> Who asked you not to use tdx_? More specifically, did that feedback come from a
> maintainer (or anyone on-list), or was it an Intel-internal decision?

It is the Intel internal feedback.

>
>> Also, IMO, "tdg" prefix is more meaningful for guest code (Trusted Domain Guest)
>> compared to "tdx" (Trusted Domain eXtensions). I know that it gets confusing
>> when grepping for TDX related changes. But since these functions are only used
>> inside arch/x86 it should not be too confusing.
>>
>> Even if rename is requested, IMO, it is easier to do it in one patch over
>> making changes in all the patches. So if it is required, we can do it later
>> once these initial patches were merged.
> Hell no, we are not merging known bad crud that requires useless churn to get
> things right.

So what is your proposal? "tdx_guest_" / "tdx_host_" ?

If there is supposed be a rename, lets wait till we know about maintainers
feedback as well. If possible I would prefer not to go through another
rename.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()



On 5/11/21 2:35 AM, Borislav Petkov wrote:
> Preach brother!:)
>
> /me goes and greps mailboxes...
>
> ah, do you mean this, per chance:
>
> https://lore.kernel.org/kvm/[email protected]/
>
> ?
>
> And yes, this has "sev" in the name and dhansen makes sense to me in
> wishing to unify all the protected guest feature queries under a common
> name. And then depending on the vendor, that common name will call the
> respective vendor's helper to answer the protected guest aspect asked
> about.
>
> This way, generic code will call
>
> protected_guest_has()
>
> or so and be nicely abstracted away from the underlying implementation.
>
> Hohumm, yap, sounds nice to me.
>
> Thx.

I see many variants of SEV/SME related checks in the common code path
between TDX and SEV/SME. Can a generic call like
protected_guest_has(MEMORY_ENCRYPTION) or is_protected_guest()
replace all these variants?

We will not be able to test AMD related features. So I need to confirm
it with AMD code maintainers/developers before making this change.

arch/x86/include/asm/io.h:313: if (sev_key_active() || is_tdx_guest()) { \
arch/x86/include/asm/io.h:329: if (sev_key_active() || is_tdx_guest()) { \
arch/x86/kernel/pci-swiotlb.c:52: if (sme_active() || is_tdx_guest())
arch/x86/mm/ioremap.c:96: if (!sev_active() && !is_tdx_guest())
arch/x86/mm/pat/set_memory.c:1984: if (!mem_encrypt_active() && !is_tdx_guest())

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-21 19:38:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> So what is your proposal? "tdx_guest_" / "tdx_host_" ?

1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
to deal with the shared vs. private inversion and avoid tdg_shared_mask
altogether.

2. Steal what SEV-ES did for the #VC handlers and use ve_ as the prefix for
handlers.

3. Use tdx_ everywhere else and handle the conflicts on a case-by-case basis
with a healthy dose of common sense. E.g. there should be no need to worry
about "static __cpuidle void tdg_safe_halt(void)" colliding because neither
the guest nor KVM should be exposing tdx_safe_halt() outside of its
compilation unit.

2021-05-21 19:41:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK


On 5/20/2021 1:16 PM, Sean Christopherson wrote:
> On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
>> So what is your proposal? "tdx_guest_" / "tdx_host_" ?
> 1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
> to deal with the shared vs. private inversion and avoid tdg_shared_mask
> altogether.
>
> 2. Steal what SEV-ES did for the #VC handlers and use ve_ as the prefix for
> handlers.
>
> 3. Use tdx_ everywhere else and handle the conflicts on a case-by-case basis
> with a healthy dose of common sense. E.g. there should be no need to worry
> about "static __cpuidle void tdg_safe_halt(void)" colliding because neither
> the guest nor KVM should be exposing tdx_safe_halt() outside of its
> compilation unit.


Sorry Sean, but your suggestion is against all good code hygiene
practices. Normally we try to pick unique prefixes for every module, and
trying to coordinate with lots of other code that is maintained by other
people is just a long term recipe for annoying merging problems.  Same
with coordinating with SEV-ES for ve_.

Is it really that hard to adjust your grep patterns?

I'm not against changing tdg_, but if it's changed it should be
something unique, and also not too long. Today tdg_ fits that criteria
nicely.


-Andi

2021-05-21 19:59:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On Thu, May 20, 2021, Andi Kleen wrote:
>
> On 5/20/2021 1:16 PM, Sean Christopherson wrote:
> > On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> > > So what is your proposal? "tdx_guest_" / "tdx_host_" ?
> > 1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
> > to deal with the shared vs. private inversion and avoid tdg_shared_mask
> > altogether.
> >
> > 2. Steal what SEV-ES did for the #VC handlers and use ve_ as the prefix for
> > handlers.
> >
> > 3. Use tdx_ everywhere else and handle the conflicts on a case-by-case basis
> > with a healthy dose of common sense. E.g. there should be no need to worry
> > about "static __cpuidle void tdg_safe_halt(void)" colliding because neither
> > the guest nor KVM should be exposing tdx_safe_halt() outside of its
> > compilation unit.
>
>
> Sorry Sean, but your suggestion is against all good code hygiene practices.
> Normally we try to pick unique prefixes for every module, and trying to
> coordinate with lots of other code that is maintained by other people is
> just a long term recipe for annoying merging problems.? Same with
> coordinating with SEV-ES for ve_.

For ve_? SEV-ES uses vc_...

I'd buy that argument if series as a whole was consistent, but there are
individual function prototypes that aren't consistent, e.g.

+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)

a number of functions that use tdx_ isntead of tdg_ (I'll give y'all a break on
is_tdx_guest()), the files are all tdx.{c,h}, the shortlogs all use x86/tdx, the
comments all use TDX, and so on and so forth.

I understand the desire to have a unique prefix, but tdg is is _too_ close to
tdx. I don't want to spend the next N years wondering if tdg is a typo or intended.

2021-05-21 20:01:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

On Tue, May 18, 2021 at 06:17:04PM +0000, Sean Christopherson wrote:
> On Tue, May 18, 2021, Andi Kleen wrote:
> > > Why does this code exist at all? TDX and SEV-ES absolutely must share code for
> > > handling MMIO reflection. It will require a fair amount of refactoring to move
> > > the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> > > two separate versions of the opcode cracking.
> >
> > While that's true on the high level, all the low level details are
> > different. We looked at unifying at some point, but it would have been a
> > callback hell. I don't think unifying would make anything cleaner.
>
> How hard did you look? The only part that _must_ be different between SEV and
> TDX is the hypercall itself, which is wholly contained at the very end of
> vc_do_mmio().

I've come up with the code below. decode_mmio() can be shared with SEV.

I don't have a testing setup for AMD. I can do a blind patch, but it would
be much more productive if someone on AMD side could look into this.

Any opinions?

enum mmio_type {
MMIO_DECODE_FAILED,
MMIO_WRITE,
MMIO_WRITE_IMM,
MMIO_READ,
MMIO_READ_ZERO_EXTEND,
MMIO_READ_SIGN_EXTEND,
MMIO_MOVS,
};

static enum mmio_type decode_mmio(struct insn *insn, struct pt_regs *regs,
int *bytes)
{
int type = MMIO_DECODE_FAILED;

*bytes = 0;

switch (insn->opcode.bytes[0]) {
case 0x88: /* MOV m8,r8 */
*bytes = 1;
fallthrough;
case 0x89: /* MOV m16/m32/m64, r16/m32/m64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
type = MMIO_WRITE;
break;

case 0xc6: /* MOV m8, imm8 */
*bytes = 1;
fallthrough;
case 0xc7: /* MOV m16/m32/m64, imm16/imm32/imm64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
type = MMIO_WRITE_IMM;
break;

case 0x8a: /* MOV r8, m8 */
*bytes = 1;
fallthrough;
case 0x8b: /* MOV r16/r32/r64, m16/m32/m64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
type = MMIO_READ;
break;

case 0xa4: /* MOVS m8, m8 */
*bytes = 1;
fallthrough;
case 0xa5: /* MOVS m16/m32/m64, m16/m32/m64 */
if (!*bytes)
*bytes = insn->opnd_bytes;
type = MMIO_MOVS;
break;

case 0x0f: /* Two-byte instruction */
switch (insn->opcode.bytes[1]) {
case 0xb6: /* MOVZX r16/r32/r64, m8 */
*bytes = 1;
fallthrough;
case 0xb7: /* MOVZX r32/r64, m16 */
if (!*bytes)
*bytes = 2;
type = MMIO_READ_ZERO_EXTEND;
break;

case 0xbe: /* MOVSX r16/r32/r64, m8 */
*bytes = 1;
fallthrough;
case 0xbf: /* MOVSX r32/r64, m16 */
if (!*bytes)
*bytes = 2;
type = MMIO_READ_SIGN_EXTEND;
break;
}
break;
}

return type;
}

static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
int size;
unsigned long *reg;
struct insn insn;
unsigned long val = 0;

kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
insn_get_length(&insn);
insn_get_opcode(&insn);

reg = get_reg_ptr(&insn, regs);

switch (decode_mmio(&insn, regs, &size)) {
case MMIO_WRITE:
memcpy(&val, reg, size);
tdg_mmio(size, true, ve->gpa, val);
break;
case MMIO_WRITE_IMM:
val = insn.immediate.value;
tdg_mmio(size, true, ve->gpa, val);
break;
case MMIO_READ:
val = tdg_mmio(size, false, ve->gpa, val);
/* Zero-extend for 32-bit operation */
if (size == 4)
*reg = 0;
memcpy(reg, &val, size);
break;
case MMIO_READ_ZERO_EXTEND:
val = tdg_mmio(size, false, ve->gpa, val);

/* Zero extend based on operand size */
memset(reg, 0, insn.opnd_bytes);
memcpy(reg, &val, size);
break;
case MMIO_READ_SIGN_EXTEND:
val = tdg_mmio(size, false, ve->gpa, val);

/* Sign extend based on operand size */
if (val & (size == 1 ? 0x80 : 0x8000))
memset(reg, 0xff, insn.opnd_bytes);
else
memset(reg, 0, insn.opnd_bytes);
memcpy(reg, &val, size);
break;
case MMIO_MOVS:
case MMIO_DECODE_FAILED:
force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
return 0;
}

return insn.length;
}

--
Kirill A. Shutemov

2021-05-21 20:03:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK


On 5/20/2021 2:28 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/20/21 2:23 PM, Dave Hansen wrote:
>> Sathya has even mis-typed "tdx" instead of "tdg" this in his own
>> changelogs up to this point.  That massively weakens the argument that
>> "tdg" is a good idea.
>
> It is not a typo. But when we did the initial rename from "tdx_" ->
> "tdg_",
> somehow I missed the change log change. That's why I am bit reluctant
> to go for another rename (since we have scan change log, comments and
> code)
> in all the patches.


Yes I agree. If there's another rename it should be after a full review
by all the maintainers. If there is still consensus that a rename is
needed then it can be done then.

And we'll just hope that Sean's brain will get used to tdg_ by then.


-Andi

Subject: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms

From: Sean Christopherson <[email protected]>

Avoid operations which will inject #VE during boot process.
They're easy to avoid and it is less complex than handling
the exceptions.

There are a few MSRs and control register bits which the
kernel normally needs to modify during boot. But, TDX
disallows modification of these registers to help provide
consistent security guarantees ( and avoid generating #VE
when updating them). Fortunately, TDX ensures that these are
all in the correct state before the kernel loads, which means
the kernel has no need to modify them.

The conditions we need to avoid are:

* Any writes to the EFER MSR
* Clearing CR0.NE
* Clearing CR3.MCE

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix:
* Fixed commit and comments as per Dave and Dan's suggestions.
* Merged CR0.NE related change in pa_trampoline_compat() from patch
titled "x86/boot: Add a trampoline for APs booting in 64-bit mode"
to this patch. It belongs in this patch.
* Merged TRAMPOLINE_32BIT_CODE_SIZE related change from patch titled
"x86/boot: Add a trampoline for APs booting in 64-bit mode" to this
patch (since it was wrongly merged to that patch during patch split).

arch/x86/boot/compressed/head_64.S | 16 ++++++++++++----
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/kernel/head_64.S | 20 ++++++++++++++++++--
arch/x86/realmode/rm/trampoline_64.S | 23 +++++++++++++++++++----
4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..f848569e3fb0 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,20 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
+ /* Avoid writing EFER if no change was made (for TDX guest) */
+ jc 1f
wrmsr
- popl %edx
+1: popl %edx
popl %ecx

/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ movl %cr4, %eax
+ /*
+ * Clear all bits except CR4.MCE, which is preserved.
+ * Clearing CR4.MCE will #VE in TDX guests.
+ */
+ andl $X86_CR4_MCE, %eax
+ orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
@@ -635,8 +643,8 @@ SYM_CODE_START(trampoline_32bit_src)
pushl $__KERNEL_CS
pushl %eax

- /* Enable paging again */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ /* Enable paging again. Avoid clearing X86_CR0_NE for TDX */
+ movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80

#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..6cf8d126b80a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
1:

/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ movq %cr4, %rcx
+ /*
+ * Clear all bits except CR4.MCE, which is preserved.
+ * Clearing CR4.MCE will #VE in TDX guests.
+ */
+ andl $X86_CR4_MCE, %ecx
+ orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
@@ -229,13 +235,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
+ /*
+ * Preserve current value of EFER for comparison and to skip
+ * EFER writes if no change was made (for TDX guest)
+ */
+ movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr /* Make changes effective */

+ /* Avoid writing EFER if no change was made (for TDX guest) */
+1: cmpl %edx, %eax
+ je 1f
+ xor %edx, %edx
+ wrmsr /* Make changes effective */
+1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 957bb21ce105..cf14d0326a48 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,27 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3

# Set up EFER
+ movl $MSR_EFER, %ecx
+ rdmsr
+ /*
+ * Skip writing to EFER if the register already has desiered
+ * value (to avoid #VE for TDX guest).
+ */
+ cmp pa_tr_efer, %eax
+ jne .Lwrite_efer
+ cmp pa_tr_efer + 4, %edx
+ je .Ldone_efer
+.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
- movl $MSR_EFER, %ecx
wrmsr

- # Enable paging and in turn activate Long Mode
- movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+ /*
+ * Enable paging and in turn activate Long Mode. Avoid clearing
+ * X86_CR0_NE for TDX.
+ */
+ movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

/*
@@ -169,7 +183,8 @@ SYM_CODE_START(pa_trampoline_compat)
movl $rm_stack_end, %esp
movw $__KERNEL_DS, %dx

- movl $X86_CR0_PE, %eax
+ /* Avoid clearing X86_CR0_NE for TDX */
+ movl $(X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0
ljmpl $__KERNEL32_CS, $pa_startup_32
SYM_CODE_END(pa_trampoline_compat)
--
2.25.1

Subject: [RFC v2-fix-v2 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode

From: Sean Christopherson <[email protected]>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX. You can find MADT MP wake protocol details in ACPI specification
r6.4, sec 5.2.12.19.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode. For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix:
* Passed rmh as argument to get_trampoline_start_ip().
* Added a comment line for get_trampoline_start_ip().
* Moved X86_CR0_NE change from pa_trampoline_compat() to patch
"x86/boot: Avoid #VE during boot for TDX platforms".
* Fixed comments for tr_idt as per Dan's comments.
* Moved TRAMPOLINE_32BIT_CODE_SIZE change to "x86/boot: Avoid #VE
during boot for TDX platforms" patch.

arch/x86/include/asm/realmode.h | 11 +++++++
arch/x86/kernel/smpboot.c | 2 +-
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 38 ++++++++++++++++++++++++
arch/x86/realmode/rm/trampoline_common.S | 12 +++++++-
5 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..0f707521b797 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
u32 sev_es_trampoline_start;
#endif
#ifdef CONFIG_X86_64
+ u32 trampoline_start64;
u32 trampoline_pgd;
#endif
/* ACPI S3 wakeup */
@@ -88,6 +89,16 @@ static inline void set_real_mode_mem(phys_addr_t mem)
real_mode_header = (struct real_mode_header *) __va(mem);
}

+/* Common helper function to get start IP address */
+static inline unsigned long get_trampoline_start_ip(struct real_mode_header *rmh)
+{
+#ifdef CONFIG_X86_64
+ if (is_tdx_guest())
+ return rmh->trampoline_start64;
+#endif
+ return rmh->trampoline_start;
+}
+
void reserve_real_mode(void);

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..659e8d011fe6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
int *cpu0_nmi_registered)
{
/* start_ip had better be page-aligned! */
- unsigned long start_ip = real_mode_header->trampoline_start;
+ unsigned long start_ip = get_trampoline_start_ip(real_mode_header);

unsigned long boot_error = 0;
unsigned long timeout;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
.long pa_sev_es_trampoline_start
#endif
#ifdef CONFIG_X86_64
+ .long pa_trampoline_start64
.long pa_trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..957bb21ce105 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
ljmpl $__KERNEL_CS, $pa_startup_64
SYM_CODE_END(startup_32)

+SYM_CODE_START(pa_trampoline_compat)
+ /*
+ * In compatibility mode. Prep ESP and DX for startup_32, then disable
+ * paging and complete the switch to legacy 32-bit mode.
+ */
+ movl $rm_stack_end, %esp
+ movw $__KERNEL_DS, %dx
+
+ movl $X86_CR0_PE, %eax
+ movl %eax, %cr0
+ ljmpl $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
.section ".text64","ax"
.code64
.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
jmpq *tr_start(%rip)
SYM_CODE_END(startup_64)

+SYM_CODE_START(trampoline_start64)
+ /*
+ * APs start here on a direct transfer from 64-bit BIOS with identity
+ * mapped page tables. Load the kernel's GDT in order to gear down to
+ * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+ * segment registers. Load the zero IDT so any fault triggers a
+ * shutdown instead of jumping back into BIOS.
+ */
+ lidt tr_idt(%rip)
+ lgdt tr_gdt64(%rip)
+
+ ljmpl *tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
.section ".rodata","a"
# Duplicate the global descriptor table
# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
.quad 0x00cf93000000ffff # __KERNEL_DS
SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)

+SYM_DATA_START(tr_gdt64)
+ .short tr_gdt_end - tr_gdt - 1 # gdt limit
+ .long pa_tr_gdt
+ .long 0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+ .long pa_trampoline_compat
+ .short __KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
.bss
.balign PAGE_SIZE
SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..4331c32c47f8 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,14 @@
/* SPDX-License-Identifier: GPL-2.0 */
.section ".rodata","a"
.balign 16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/*
+ * When a bootloader hands off to the kernel in 32-bit mode an
+ * IDT with a 2-byte limit and 4-byte base is needed. When a boot
+ * loader hands off to a kernel 64-bit mode the base address
+ * extends to 8-bytes. Reserve enough space for either scenario.
+ */
+SYM_DATA_START_LOCAL(tr_idt)
+ .short 0
+ .quad 0
+SYM_DATA_END(tr_idt)
--
2.25.1

2021-05-21 20:21:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms

> Avoid operations which will inject #VE during boot process.
> They're easy to avoid and it is less complex than handling
> the exceptions.

This puts the solution before the problem. I'd also make sure to
clearly connect this solution to the problem. For instance, if you
refer to register "modification", ensure that you reflect that language
here. Don't call them "modifications" in one part of the changelog and
"operations" here. I'd also qualify them as "superfluous".

Please reorder this in the following form:

1. Background
2. Problem
3. Solution

Please do this for all of your patches.

> There are a few MSRs and control register bits which the
> kernel normally needs to modify during boot. But, TDX
> disallows modification of these registers to help provide
> consistent security guarantees ( and avoid generating #VE
> when updating them).

No, the TDX architecture does not avoid generating #VE. The *kernel*
does that. This sentence conflates those two things.

> Fortunately, TDX ensures that these are
> all in the correct state before the kernel loads, which means
> the kernel has no need to modify them.
>
> The conditions we need to avoid are:
>
> * Any writes to the EFER MSR
> * Clearing CR0.NE
> * Clearing CR3.MCE

Sathya, there have been repeated issues in your changelogs with "we's".
Remember, speak in imperative voice. Please fix this in your tooling
to find these so that reviewers don't have to.

> + /*
> + * Preserve current value of EFER for comparison and to skip
> + * EFER writes if no change was made (for TDX guest)
> + */
> + movl %eax, %edx
> btsl $_EFER_SCE, %eax /* Enable System Call */
> btl $20,%edi /* No Execute supported? */
> jnc 1f
> btsl $_EFER_NX, %eax
> btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1: wrmsr /* Make changes effective */
>
> + /* Avoid writing EFER if no change was made (for TDX guest) */
> +1: cmpl %edx, %eax
> + je 1f
> + xor %edx, %edx
> + wrmsr /* Make changes effective */
> +1:

Just curious, but what if this goes wrong? Say the TDX firmware didn't
set up EFER correctly and this code does the WRMSR. What ends up
happening? Do we get anything out on the console, or is it essentially
undebuggable?

>
> + /*
> + * Skip writing to EFER if the register already has desiered
> + * value (to avoid #VE for TDX guest).
> + */


spelling ^

There are lots of editors that can do spell checking, even in C
comments. You might want to look into that for your editor.

2021-05-21 20:21:35

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On 5/21/21 10:18 AM, Borislav Petkov wrote:
> On Thu, May 20, 2021 at 01:12:58PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> I see many variants of SEV/SME related checks in the common code path
>> between TDX and SEV/SME. Can a generic call like
>> protected_guest_has(MEMORY_ENCRYPTION) or is_protected_guest()
>> replace all these variants?
>
> It depends...
>
>> We will not be able to test AMD related features. So I need to confirm
>> it with AMD code maintainers/developers before making this change.
>
> Lemme add two to Cc.
>
> So looking at those examples, you guys are making it not very
> suspenceful for TDX - it is the same function in all. :)
>
>> arch/x86/include/asm/io.h:313: if (sev_key_active() || is_tdx_guest()) { \
>> arch/x86/include/asm/io.h:329: if (sev_key_active() || is_tdx_guest()) { \
>
> So I think the static key on the AMD side is not really needed and it
> could be replaced with
>
> sev_active() && !sev_es_active()
>
> i.e. SEV but but not SEV-ES. A vendor-agnostic function would do here
> probably something like:
>
> protected_guest_has(ENC_UNROLL_STRING_IO)
>
> and inside it, it would do:
>
> if (AMD)
> amd_protected_guest_has(...)
> else if (Intel)
> intel_protected_guest_has(...)
> else
> WARN()
>
> and both vendors would each implement that function with the respective
> low-level query functions.
>
>> arch/x86/kernel/pci-swiotlb.c:52: if (sme_active() || is_tdx_guest())
>
> That can be probably
>
> protected_guest_has(ENC_HOST_MEM_ENCRYPT);
>
> as on AMD that means SME but not SEV. I guess on Intel you guys want to
> do bounce buffers in the guest? or so...

In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
when SEV support was added), we do:
if (sev_active())
swiotlb_force = SWIOTLB_FORCE;

TDX should be able to do a similar thing without having to touch
arch/x86/kernel/pci-swiotlb.c.

That would remove any confusion over SME being part of a
protected_guest_has() call.

>
>> arch/x86/mm/ioremap.c:96: if (!sev_active() && !is_tdx_guest())
>
> So that function should simply be replaced with:
>
> if (!(desc->flags & IORES_MAP_ENCRYPTED)) {
> /* ... comment bla explaining what this is... */
> if ((sev_active() || is_tdx_guest()) &&
> (res->desc != IORES_DESC_NONE &&
> res->desc != IORES_DESC_RESERVED))
> desc->flags |= IORES_MAP_ENCRYPTED;
> }

I kinda like the separate function, though.

>
> as to the first check I guess:
>
> protected_guest_has(ENC_GUEST_ENABLED)
>
> or so to mean, kernel is running as an encrypted guest...
>
>> arch/x86/mm/pat/set_memory.c:1984: if (!mem_encrypt_active() && !is_tdx_guest())
>
> That should probably be
>
> protected_guest_has(ENC_ACTIVE);
>
> to denote the generic "I'm running some sort of memory encryption..."

Except mem_encrypt_active() covers both SME and SEV, so
protected_guest_has() would be confusing.

Thanks,
Tom

>
> Yeah, this is all rough and should show the main idea - to have a
> vendor-agnostic accessor in such common code paths and then abstract
> away the differences in cpu/amd.c and cpu/intel.c, respectively and thus
> keep the code sane.
>
> How does that sound?
>
> ENC_ being an ENCryption prefix, ofc.
>

2021-05-21 20:22:51

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On Thu, May 20, 2021 at 01:12:58PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> I see many variants of SEV/SME related checks in the common code path
> between TDX and SEV/SME. Can a generic call like
> protected_guest_has(MEMORY_ENCRYPTION) or is_protected_guest()
> replace all these variants?

It depends...

> We will not be able to test AMD related features. So I need to confirm
> it with AMD code maintainers/developers before making this change.

Lemme add two to Cc.

So looking at those examples, you guys are making it not very
suspenceful for TDX - it is the same function in all. :)

> arch/x86/include/asm/io.h:313: if (sev_key_active() || is_tdx_guest()) { \
> arch/x86/include/asm/io.h:329: if (sev_key_active() || is_tdx_guest()) { \

So I think the static key on the AMD side is not really needed and it
could be replaced with

sev_active() && !sev_es_active()

i.e. SEV but but not SEV-ES. A vendor-agnostic function would do here
probably something like:

protected_guest_has(ENC_UNROLL_STRING_IO)

and inside it, it would do:

if (AMD)
amd_protected_guest_has(...)
else if (Intel)
intel_protected_guest_has(...)
else
WARN()

and both vendors would each implement that function with the respective
low-level query functions.

> arch/x86/kernel/pci-swiotlb.c:52: if (sme_active() || is_tdx_guest())

That can be probably

protected_guest_has(ENC_HOST_MEM_ENCRYPT);

as on AMD that means SME but not SEV. I guess on Intel you guys want to
do bounce buffers in the guest? or so...

> arch/x86/mm/ioremap.c:96: if (!sev_active() && !is_tdx_guest())

So that function should simply be replaced with:

if (!(desc->flags & IORES_MAP_ENCRYPTED)) {
/* ... comment bla explaining what this is... */
if ((sev_active() || is_tdx_guest()) &&
(res->desc != IORES_DESC_NONE &&
res->desc != IORES_DESC_RESERVED))
desc->flags |= IORES_MAP_ENCRYPTED;
}

as to the first check I guess:

protected_guest_has(ENC_GUEST_ENABLED)

or so to mean, kernel is running as an encrypted guest...

> arch/x86/mm/pat/set_memory.c:1984: if (!mem_encrypt_active() && !is_tdx_guest())

That should probably be

protected_guest_has(ENC_ACTIVE);

to denote the generic "I'm running some sort of memory encryption..."

Yeah, this is all rough and should show the main idea - to have a
vendor-agnostic accessor in such common code paths and then abstract
away the differences in cpu/amd.c and cpu/intel.c, respectively and thus
keep the code sane.

How does that sound?

ENC_ being an ENCryption prefix, ofc.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest

Hi Dave,

On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov"<[email protected]>
>
> Virtualization Exceptions (#VE) are delivered to TDX guests due to
> specific guest actions which may happen in either user space or the kernel:
>
>  * Specific instructions (WBINVD, for example)
>  * Specific MSR accesses
>  * Specific CPUID leaf accesses
>  * Access to TD-shared memory, which includes MMIO
>
> In the settings that Linux will run in, virtual exceptions are never
> generated on accesses to normal, TD-private memory that has been
> accepted.
>
> The entry paths do not access TD-shared memory, MMIO regions or use
> those specific MSRs, instructions, CPUID leaves that might generate #VE.
> In addition, all interrupts including NMIs are blocked by the hardware
> starting with #VE delivery until TDGETVEINFO is called.  This eliminates
> the chance of a #VE during the syscall gap or paranoid entry paths and
> simplifies #VE handling.
>
> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> although we don't expect it to happen because we don't expect NMIs to
> trigger #VEs. Another case where they could happen is if the #VE
> exception panics, but in this case there are no guarantees on anything
> anyways.
>
> If a guest kernel action which would normally cause a #VE occurs in the
> interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
> guest which will result in an oops (and should eventually be a panic, as
> we would like to set panic_on_oops to 1 for TDX guests).
>
> Add basic infrastructure to handle any #VE which occurs in the kernel or
> userspace.  Later patches will add handling for specific #VE scenarios.
>
> Convert unhandled #VE's (everything, until later in this series) so that
> they appear just like a #GP by calling ve_raise_fault() directly.
> ve_raise_fault() is similar to #GP handler and is responsible for
> sending SIGSEGV to userspace and cpu die and notifying debuggers and
> other die chain users.
>
> Co-developed-by: Sean Christopherson<[email protected]>
> Signed-off-by: Sean Christopherson<[email protected]>
> Signed-off-by: Kirill A. Shutemov<[email protected]>
> Reviewed-by: Andi Kleen<[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> ---

You have any other comments on this patch? If not, can you reply with your
Reviewed-by tag?

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-21 20:28:01

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest

On Tue, May 18, 2021 at 8:45 AM Andi Kleen <[email protected]> wrote:
>
>
> On 5/18/2021 8:11 AM, Dave Hansen wrote:
> > On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
> >> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> >> although we don't expect it to happen because we don't expect NMIs to
> >> trigger #VEs. Another case where they could happen is if the #VE
> >> exception panics, but in this case there are no guarantees on anything
> >> anyways.
> > This implies: "we do not expect any NMI to do MMIO". Is that true? Why?
>
> Only drivers that are not supported in TDX anyways could do it (mainly
> watchdog drivers)

What about apei_{read,write}() for ACPI error handling? Those are
called in NMI to do MMIO accesses. It's not just watchdog drivers.

2021-05-21 20:28:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms

On Fri, May 21, 2021, Dave Hansen wrote:
> > + /*
> > + * Preserve current value of EFER for comparison and to skip
> > + * EFER writes if no change was made (for TDX guest)
> > + */
> > + movl %eax, %edx
> > btsl $_EFER_SCE, %eax /* Enable System Call */
> > btl $20,%edi /* No Execute supported? */
> > jnc 1f
> > btsl $_EFER_NX, %eax
> > btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
> > -1: wrmsr /* Make changes effective */
> >
> > + /* Avoid writing EFER if no change was made (for TDX guest) */
> > +1: cmpl %edx, %eax
> > + je 1f
> > + xor %edx, %edx
> > + wrmsr /* Make changes effective */
> > +1:
>
> Just curious, but what if this goes wrong? Say the TDX firmware didn't
> set up EFER correctly and this code does the WRMSR.

By firmware, do you mean TDX-module, or guest firmware? EFER is read-only in a
TDX guest, i.e. the guest firmware can't change it either.

> What ends up happening? Do we get anything out on the console, or is it
> essentially undebuggable?

Assuming "firmware" means TDX-module, if TDX-Module botches EFER (and only EFER)
then odds are very, very good that the guest will never get to the kernel as it
will have died long before in guest BIOS.

If the bug is such that EFER is correct in hardware, but RDMSR returns the wrong
value (due to MSR interception), IIRC this will triple fault and so nothing will
get logged. But, the odds of that type of bug being hit in production are
practically zero because the EFER setup is very static, i.e. any such bug should
be hit during qualification of the VMM+TDX-Module.

In any case, even if a bug escapes, the shutdown is relatively easy to debug even
without logs because the failure will cleary point at the WRMSR (that info can be
had by running a debug TD or a debug TDX-Module). By TDX standards, debugging
shutdowns on a specific instruction is downright trivial :-).

2021-05-21 20:28:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest

On 5/21/21 11:45 AM, Kuppuswamy, Sathyanarayanan wrote:
> You have any other comments on this patch? If not, can you reply with your
> Reviewed-by tag?

Sathya, I've been rather busy with your own patches and your colleagues
TDX patches. I've clearly communicated to you which patches I plan to
provide a review for. I'll get to them, although not quite at the speed
you would like.

If you would like to get a quicker review, I'd highly suggest you go
find some of your TDX colleagues' code that needs its quality improved
and help by providing them reviews. Reviews are a two-way street, not
just a service provided by maintainers to contributors.

You could also make good use of your time by going back over all of the
review comments I've made up to this point and doing a pass over your
work to ensure that I don't have to continue to repeat myself and waste
review efforts. You could add a spell checker to your workflow, or
scripting to check for language conventions like avoiding "us" and "we".
You could also seek out help to raise the quality of your
communications. It isn't just reviewers that can help raise the quality
of your contributions.

Subject: Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms



On 5/21/21 9:11 AM, Dave Hansen wrote:
>> Avoid operations which will inject #VE during boot process.
>> They're easy to avoid and it is less complex than handling
>> the exceptions.
>
> This puts the solution before the problem. I'd also make sure to
> clearly connect this solution to the problem. For instance, if you
> refer to register "modification", ensure that you reflect that language
> here. Don't call them "modifications" in one part of the changelog and
> "operations" here. I'd also qualify them as "superfluous".
>
> Please reorder this in the following form:
>
> 1. Background
> 2. Problem
> 3. Solution
>
> Please do this for all of your patches.
>
>> There are a few MSRs and control register bits which the
>> kernel normally needs to modify during boot. But, TDX
>> disallows modification of these registers to help provide
>> consistent security guarantees ( and avoid generating #VE
>> when updating them).
>
> No, the TDX architecture does not avoid generating #VE. The *kernel*
> does that. This sentence conflates those two things.
>
>> Fortunately, TDX ensures that these are
>> all in the correct state before the kernel loads, which means
>> the kernel has no need to modify them.
>>
>> The conditions we need to avoid are:
>>
>> * Any writes to the EFER MSR
>> * Clearing CR0.NE
>> * Clearing CR3.MCE
>
> Sathya, there have been repeated issues in your changelogs with "we's".
> Remember, speak in imperative voice. Please fix this in your tooling
> to find these so that reviewers don't have to.

How about the following commit log?

In TDX guests, Virtualization Exceptions (#VE) are delivered
to TDX guests due to specific guest actions like MSR writes,
CPUID leaf accesses or I/O access. But in early boot code, #VE
cannot be allowed because the required exception handler setup
support code is missing. If #VE is triggered without proper
handler support, it would lead to triple fault or kernel hang.
So, avoid operations which will inject #VE during boot process.
They're easy to avoid and it is less complex than handling the
exceptions.

There are a few MSRs and control register bits which the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent
security guarantees. Fortunately, TDX ensures that these are all
in the correct state before the kernel loads, which means the
kernel has no need to modify them.

The conditions to avoid are:

* Any writes to the EFER MSR
* Clearing CR0.NE
* Clearing CR3.MCE

If above conditions are not avoided, it would lead to triple
fault or kernel hang.

>
>> + /*
>> + * Preserve current value of EFER for comparison and to skip
>> + * EFER writes if no change was made (for TDX guest)
>> + */
>> + movl %eax, %edx
>> btsl $_EFER_SCE, %eax /* Enable System Call */
>> btl $20,%edi /* No Execute supported? */
>> jnc 1f
>> btsl $_EFER_NX, %eax
>> btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
>> -1: wrmsr /* Make changes effective */
>>
>> + /* Avoid writing EFER if no change was made (for TDX guest) */
>> +1: cmpl %edx, %eax
>> + je 1f
>> + xor %edx, %edx
>> + wrmsr /* Make changes effective */
>> +1:
>
> Just curious, but what if this goes wrong? Say the TDX firmware didn't
> set up EFER correctly and this code does the WRMSR. What ends up
> happening?

It would lead to triple fault.

Do we get anything out on the console, or is it essentially
> undebuggable?
>

We can still get logs with debug TDX module. So it is still debugable.

>>
>> + /*
>> + * Skip writing to EFER if the register already has desiered
>> + * value (to avoid #VE for TDX guest).
>> + */
>
>
> spelling ^
>
> There are lots of editors that can do spell checking, even in C
> comments. You might want to look into that for your editor.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest



On 5/21/21 12:15 PM, Dave Hansen wrote:
> On 5/21/21 11:45 AM, Kuppuswamy, Sathyanarayanan wrote:
>> You have any other comments on this patch? If not, can you reply with your
>> Reviewed-by tag?
>
> Sathya, I've been rather busy with your own patches and your colleagues
> TDX patches. I've clearly communicated to you which patches I plan to
> provide a review for. I'll get to them, although not quite at the speed
> you would like.
>

My impression so far is, for TDX patch submissions, you usually reply to
the patch submission/comments in 1-2 days (sorry if this assumption is
incorrect). Since I did not see any major objections for this patch, I
was just checking with you to understand if this patch review is pending
due to something missing from my end. My intention was not to rush you,
but just to understand if it needs some work from my end.

Sorry if the reminder emails trouble you. Since we are aiming for v5.14
merge window, I am trying to avoid any delays from my end.

> If you would like to get a quicker review, I'd highly suggest you go
> find some of your TDX colleagues' code that needs its quality improved
> and help by providing them reviews. Reviews are a two-way street, not
> just a service provided by maintainers to contributors.
>
> You could also make good use of your time by going back over all of the
> review comments I've made up to this point and doing a pass over your
> work to ensure that I don't have to continue to repeat myself and waste
> review efforts.

I have considered your comments and fixed the common issues reported by
you in this patch-set. But when addressing recent comments and while
updating the commit log, some of these issues got introduced again. I will
try to avoid them in future.



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-21 20:41:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms

On 5/21/21 11:18 AM, Sean Christopherson wrote:
> On Fri, May 21, 2021, Dave Hansen wrote:
>>> + /*
>>> + * Preserve current value of EFER for comparison and to skip
>>> + * EFER writes if no change was made (for TDX guest)
>>> + */
>>> + movl %eax, %edx
>>> btsl $_EFER_SCE, %eax /* Enable System Call */
>>> btl $20,%edi /* No Execute supported? */
>>> jnc 1f
>>> btsl $_EFER_NX, %eax
>>> btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
>>> -1: wrmsr /* Make changes effective */
>>>
>>> + /* Avoid writing EFER if no change was made (for TDX guest) */
>>> +1: cmpl %edx, %eax
>>> + je 1f
>>> + xor %edx, %edx
>>> + wrmsr /* Make changes effective */
>>> +1:
>>
>> Just curious, but what if this goes wrong? Say the TDX firmware didn't
>> set up EFER correctly and this code does the WRMSR.
>
> By firmware, do you mean TDX-module, or guest firmware? EFER is read-only in a
> TDX guest, i.e. the guest firmware can't change it either.

I guess I was assuming that the trusted BIOS was going to do the setup
of EFER before it hands control over to the kernel. So, I *meant* the BIOS.

But, I see from below that it's probably the TDX-module that's
responsible for this behavior.

>> What ends up happening? Do we get anything out on the console, or is it
>> essentially undebuggable?
>
> Assuming "firmware" means TDX-module, if TDX-Module botches EFER (and only EFER)
> then odds are very, very good that the guest will never get to the kernel as it
> will have died long before in guest BIOS.
>
> If the bug is such that EFER is correct in hardware, but RDMSR returns the wrong
> value (due to MSR interception), IIRC this will triple fault and so nothing will
> get logged. But, the odds of that type of bug being hit in production are
> practically zero because the EFER setup is very static, i.e. any such bug should
> be hit during qualification of the VMM+TDX-Module.
>
> In any case, even if a bug escapes, the shutdown is relatively easy to debug even
> without logs because the failure will cleary point at the WRMSR (that info can be
> had by running a debug TD or a debug TDX-Module). By TDX standards, debugging
> shutdowns on a specific instruction is downright trivial :-).

That sounds sane to me. It would be nice to get this into the
changelog. Perhaps:

This theoretically makes guest boot more fragile. If, for
instance, EER was set up incorrectly and a WRMSR was performed,
the resulting (unhandled) #VE would triple fault. However, this
is likely to trip up the guest BIOS long before control reaches
the kernel. In any case, these kinds of problems are unlikely
to occur in production environments, and developers have good
debug tools to fix them quickly.

That would put my mind at ease a bit.

Subject: Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms



On 5/21/21 11:30 AM, Dave Hansen wrote:
> That sounds sane to me. It would be nice to get this into the
> changelog. Perhaps:
>
> This theoretically makes guest boot more fragile. If, for
> instance, EER was set up incorrectly and a WRMSR was performed,
> the resulting (unhandled) #VE would triple fault. However, this
> is likely to trip up the guest BIOS long before control reaches
> the kernel. In any case, these kinds of problems are unlikely
> to occur in production environments, and developers have good
> debug tools to fix them quickly.
>
> That would put my mind at ease a bit.

I can add it to change log.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-21 21:15:30

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()



On 5/21/21 1:49 PM, Borislav Petkov wrote:
> On Fri, May 21, 2021 at 11:19:15AM -0500, Tom Lendacky wrote:
>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>> when SEV support was added), we do:
>> if (sev_active())
>> swiotlb_force = SWIOTLB_FORCE;
>>
>> TDX should be able to do a similar thing without having to touch
>> arch/x86/kernel/pci-swiotlb.c.
>>
>> That would remove any confusion over SME being part of a
>> protected_guest_has() call.
>
> Even better.
>
>> I kinda like the separate function, though.
>
> Only if you clean it up and get rid of the inverted logic and drop that
> silly switch-case.
>
>> Except mem_encrypt_active() covers both SME and SEV, so
>> protected_guest_has() would be confusing.
>
> I don't understand - the AMD-specific function amd_protected_guest_has()
> would return sme_me_mask just like mem_encrypt_active() does and we can
> get rid of latter.
>
> Or do you have a problem with the name protected_guest_has() containing
> "guest" while we're talking about SME here?

The latter.

>
> If so, feel free to suggest a better one - the name does not have to
> have "guest" in it.

Let me see if I can come up with something that will make sense.

Thanks,
Tom

>
> Thx.
>
>

2021-05-21 21:27:09

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode

On Fri, May 21, 2021 at 7:40 AM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX. You can find MADT MP wake protocol details in ACPI specification
> r6.4, sec 5.2.12.19.
>
> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode. For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since RFC v2-fix:
> * Passed rmh as argument to get_trampoline_start_ip().
> * Added a comment line for get_trampoline_start_ip().
> * Moved X86_CR0_NE change from pa_trampoline_compat() to patch
> "x86/boot: Avoid #VE during boot for TDX platforms".
> * Fixed comments for tr_idt as per Dan's comments.
> * Moved TRAMPOLINE_32BIT_CODE_SIZE change to "x86/boot: Avoid #VE
> during boot for TDX platforms" patch.

Thanks, looks good, no more comments from me:

Reviewed-by: Dan Williams <[email protected]>

2021-05-21 21:27:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On Fri, May 21, 2021 at 11:19:15AM -0500, Tom Lendacky wrote:
> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
> when SEV support was added), we do:
> if (sev_active())
> swiotlb_force = SWIOTLB_FORCE;
>
> TDX should be able to do a similar thing without having to touch
> arch/x86/kernel/pci-swiotlb.c.
>
> That would remove any confusion over SME being part of a
> protected_guest_has() call.

Even better.

> I kinda like the separate function, though.

Only if you clean it up and get rid of the inverted logic and drop that
silly switch-case.

> Except mem_encrypt_active() covers both SME and SEV, so
> protected_guest_has() would be confusing.

I don't understand - the AMD-specific function amd_protected_guest_has()
would return sme_me_mask just like mem_encrypt_active() does and we can
get rid of latter.

Or do you have a problem with the name protected_guest_has() containing
"guest" while we're talking about SME here?

If so, feel free to suggest a better one - the name does not have to
have "guest" in it.

Thx.


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-24 14:03:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest


>> Only drivers that are not supported in TDX anyways could do it (mainly
>> watchdog drivers)
> What about apei_{read,write}() for ACPI error handling? Those are
> called in NMI to do MMIO accesses. It's not just watchdog drivers.

We expect the APEI stuff to be filtered in the normal case to reduce the
attack surface. There's no use case for APEI error reporting in a
normally operating TDX guest.

But yes that's why I wrote mainly. It should work in any case, we fully
support #VE nesting after TDVEREPORT.

-Andi

Subject: [RFC v2-fix-v3 1/1] x86/boot: Avoid #VE during boot for TDX platforms

From: Sean Christopherson <[email protected]>

In TDX guests, Virtualization Exceptions (#VE) are delivered
to TDX guests due to specific guest actions like MSR writes,
CPUID leaf accesses or I/O access. But in early boot code, #VE
cannot be allowed because the required exception handler setup
support code is missing. If #VE is triggered without proper
handler support, it would lead to triple fault or kernel hang.
So, avoid operations which will inject #VE during boot process.
They're easy to avoid and it is less complex than handling the
exceptions.

There are a few MSRs and control register bits which the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent
security guarantees. Fortunately, TDX ensures that these are all
in the correct state before the kernel loads, which means the
kernel has no need to modify them.

The conditions to avoid are:

  * Any writes to the EFER MSR
  * Clearing CR0.NE
  * Clearing CR3.MCE

This theoretically makes guest boot more fragile. If, for
instance, EFER was set up incorrectly and a WRMSR was performed,
the resulting (unhandled) #VE would triple fault. However, this
is likely to trip up the guest BIOS long before control reaches
the kernel. In any case, these kinds of problems are unlikely to
occur in production environments, and developers have good debug
tools to fix them quickly. 

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since RFC v2-fix-v2:
* Fixed commit log as per review comments.

Changes since RFC v2-fix:
* Fixed commit and comments as per Dave and Dan's suggestions.
* Merged CR0.NE related change in pa_trampoline_compat() from patch
titled "x86/boot: Add a trampoline for APs booting in 64-bit mode"
to this patch. It belongs in this patch.
* Merged TRAMPOLINE_32BIT_CODE_SIZE related change from patch titled
"x86/boot: Add a trampoline for APs booting in 64-bit mode" to this
patch (since it was wrongly merged to that patch during patch split).

arch/x86/boot/compressed/head_64.S | 16 ++++++++++++----
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/kernel/head_64.S | 20 ++++++++++++++++++--
arch/x86/realmode/rm/trampoline_64.S | 23 +++++++++++++++++++----
4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..f848569e3fb0 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,20 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
+ /* Avoid writing EFER if no change was made (for TDX guest) */
+ jc 1f
wrmsr
- popl %edx
+1: popl %edx
popl %ecx

/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ movl %cr4, %eax
+ /*
+ * Clear all bits except CR4.MCE, which is preserved.
+ * Clearing CR4.MCE will #VE in TDX guests.
+ */
+ andl $X86_CR4_MCE, %eax
+ orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
@@ -635,8 +643,8 @@ SYM_CODE_START(trampoline_32bit_src)
pushl $__KERNEL_CS
pushl %eax

- /* Enable paging again */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ /* Enable paging again. Avoid clearing X86_CR0_NE for TDX */
+ movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80

#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..6cf8d126b80a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
1:

/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ movq %cr4, %rcx
+ /*
+ * Clear all bits except CR4.MCE, which is preserved.
+ * Clearing CR4.MCE will #VE in TDX guests.
+ */
+ andl $X86_CR4_MCE, %ecx
+ orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
@@ -229,13 +235,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
+ /*
+ * Preserve current value of EFER for comparison and to skip
+ * EFER writes if no change was made (for TDX guest)
+ */
+ movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr /* Make changes effective */

+ /* Avoid writing EFER if no change was made (for TDX guest) */
+1: cmpl %edx, %eax
+ je 1f
+ xor %edx, %edx
+ wrmsr /* Make changes effective */
+1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 957bb21ce105..cf14d0326a48 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,27 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3

# Set up EFER
+ movl $MSR_EFER, %ecx
+ rdmsr
+ /*
+ * Skip writing to EFER if the register already has desiered
+ * value (to avoid #VE for TDX guest).
+ */
+ cmp pa_tr_efer, %eax
+ jne .Lwrite_efer
+ cmp pa_tr_efer + 4, %edx
+ je .Ldone_efer
+.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
- movl $MSR_EFER, %ecx
wrmsr

- # Enable paging and in turn activate Long Mode
- movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+ /*
+ * Enable paging and in turn activate Long Mode. Avoid clearing
+ * X86_CR0_NE for TDX.
+ */
+ movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

/*
@@ -169,7 +183,8 @@ SYM_CODE_START(pa_trampoline_compat)
movl $rm_stack_end, %esp
movw $__KERNEL_DS, %dx

- movl $X86_CR0_PE, %eax
+ /* Avoid clearing X86_CR0_NE for TDX */
+ movl $(X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0
ljmpl $__KERNEL32_CS, $pa_startup_32
SYM_CODE_END(pa_trampoline_compat)
--
2.25.1

Subject: [RFC v2-fix-v2 1/1] x86/tdx: ioapic: Add shared bit for IOAPIC base address

From: Isaku Yamahata <[email protected]>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host. This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code. Ensure
that it marks IOAPIC pages as "shared". This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Fixed commit log and comment as per review comments.

arch/x86/kernel/apic/io_apic.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 73ff4dd426a8..810fc58e3c42 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2675,6 +2675,18 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}

+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+ phys_addr_t phys)
+{
+ pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+ /* Set TDX guest shared bit in pgprot flags */
+ if (is_tdx_guest())
+ flags = pgprot_tdg_shared(flags);
+
+ __set_fixmap(idx, phys, flags);
+}
+
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2719,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
- set_fixmap_nocache(idx, ioapic_phys);
+ io_apic_set_fixmap_nocache(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
@@ -2836,7 +2848,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;

- set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+ io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
--
2.25.1

Subject: [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work. If the guest kernel
uses these instructions, the hardware generates a #VE.

You can find the list of unsupported instructions in Intel
Trust Domain Extensions (Intel® TDX) Module specification,
sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 2.4.1.

To prevent TD guests from using MWAIT/MONITOR instructions,
the CPUID flags for these instructions are already disabled
by the TDX module. 
   
After the above mentioned preventive measures, if TD guests
still execute these instructions, add appropriate warning
message (WARN_ONCE()) in #VE handler. This handling behavior
is same as KVM (which also treats MWAIT/MONITOR as nops with
warning once in unsupported platforms).

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
---

Changes since RFC v2:
* Moved WBINVD related changes to a new patch.
* Fixed commit log as per review comments.

arch/x86/kernel/tdx.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3e961fdfdae0..3800c7cbace3 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -511,6 +511,14 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+ case EXIT_REASON_MONITOR_INSTRUCTION:
+ case EXIT_REASON_MWAIT_INSTRUCTION:
+ /*
+ * Something in the kernel used MONITOR or MWAIT despite
+ * X86_FEATURE_MWAIT being cleared for TDX guests.
+ */
+ WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EFAULT;
--
2.25.1

Subject: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

Functionally only DMA devices can notice a side effect from
WBINVD's cache flushing. But, TDX does not support DMA,
because DMA typically needs uncached access for MMIO, and
the current TDX module always sets the IgnorePAT bit, which
prevents that.

So handle the WBINVD instruction as nop. Currently, we did
not include any warning for WBINVD handling because ACPI
reboot code uses it. This is the same behavior as KVM. It only
allows WBINVD in a guest when the guest supports VT-d (=DMA),
but just handles it as a nop if it doesn't .

If TDX ever gets DMA support, a hypercall will be added to
implement it similar to AMD-SEV. But current TDX does not
support direct DMA.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Fixed commit log as per review comments.
* Removed WARN_ONCE for WBINVD #VE support.

arch/x86/kernel/tdx.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3800c7cbace3..21dec5bfc88e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -511,6 +511,12 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+ case EXIT_REASON_WBINVD:
+ /*
+ * Non coherent DMA is not supported in TDX guest.
+ * So ignore WBINVD and treat it nop.
+ */
+ break;
case EXIT_REASON_MONITOR_INSTRUCTION:
case EXIT_REASON_MWAIT_INSTRUCTION:
/*
--
2.25.1

2021-05-24 23:42:03

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On Mon, May 24, 2021 at 4:32 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> Functionally only DMA devices can notice a side effect from
> WBINVD's cache flushing. But, TDX does not support DMA,
> because DMA typically needs uncached access for MMIO, and
> the current TDX module always sets the IgnorePAT bit, which
> prevents that.

I thought we discussed that there are other considerations for wbinvd
besides DMA? In any event this paragraph is actively misleading
because it disregards ACPI and Persistent Memory secure-erase whose
usages of wbinvd have nothing to do with DMA. I would much prefer a
patch to shutdown all the known wbinvd users as a precursor to this
patch rather than assuming it's ok to simply ignore it. You have
mentioned that TDX does not need to use those paths, but rather than
assume they can't be used why not do the audit to explicitly disable
them? Otherwise this statement seems to imply that the audit has not
been done.

2021-05-24 23:46:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On 5/24/21 4:32 PM, Kuppuswamy Sathyanarayanan wrote:
> Functionally only DMA devices can notice a side effect from
> WBINVD's cache flushing.

This seems to be trying to make some kind of case that the only visible
effects from WBINVD are for DMA devices. That's flat out wrong. It
might be arguable that none of the other cases exist in a TDX guest, but
it doesn't excuse making such a broad statement without qualification.

Just grep in the kernel for a bunch of reasons this is wrong.

Where did this come from?

Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest



On 5/24/21 4:39 PM, Dan Williams wrote:
>> Functionally only DMA devices can notice a side effect from
>> WBINVD's cache flushing. But, TDX does not support DMA,
>> because DMA typically needs uncached access for MMIO, and
>> the current TDX module always sets the IgnorePAT bit, which
>> prevents that.

> I thought we discussed that there are other considerations for wbinvd
> besides DMA? In any event this paragraph is actively misleading
> because it disregards ACPI and Persistent Memory secure-erase whose
> usages of wbinvd have nothing to do with DMA. I would much prefer a
> patch to shutdown all the known wbinvd users as a precursor to this
> patch rather than assuming it's ok to simply ignore it. You have
> mentioned that TDX does not need to use those paths, but rather than
> assume they can't be used why not do the audit to explicitly disable
> them? Otherwise this statement seems to imply that the audit has not
> been done.

But KVM also emulates WBINVD only if DMA is supported. Otherwise it
will be treated as noop.

static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
{
return kvm_arch_has_noncoherent_dma(vcpu->kvm);
}



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-25 00:37:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest


that.

> I thought we discussed that there are other considerations for wbinvd
> besides DMA? In any event this paragraph is actively misleading
> because it disregards ACPI and Persistent Memory secure-erase whose
> usages of wbinvd have nothing to do with DMA.


In this case they would be broken in KVM too.


> I would much prefer a
> patch to shutdown all the known wbinvd users as a precursor to this
> patch rather than assuming it's ok to simply ignore it. You have
> mentioned that TDX does not need to use those paths, but rather than
> assume they can't be used why not do the audit to explicitly disable
> them? Otherwise this statement seems to imply that the audit has not
> been done.

We're not assuming it. We know it because KVM does it since forever.

All we want to do is do the same as KVM.

-Andi


2021-05-25 00:40:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest


On 5/24/2021 4:42 PM, Dave Hansen wrote:
> On 5/24/21 4:32 PM, Kuppuswamy Sathyanarayanan wrote:
>> Functionally only DMA devices can notice a side effect from
>> WBINVD's cache flushing.
> This seems to be trying to make some kind of case that the only visible
> effects from WBINVD are for DMA devices. That's flat out wrong. It
> might be arguable that none of the other cases exist in a TDX guest, but
> it doesn't excuse making such a broad statement without qualification.

We're describing a few sentences down that guests run with EPT
IgnorePAT=1, which is the qualification.

>
> Just grep in the kernel for a bunch of reasons this is wrong.
>
> Where did this come from?

Again the logic is very simple: TDX guest code is (mostly) about
replacing KVM code with in kernel code, so we're just doing the same as
KVM. You cannot get any more proven than that.


-Andi


2021-05-25 00:52:30

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On Mon, May 24, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
>
>
>
> On 5/24/21 4:39 PM, Dan Williams wrote:
> >> Functionally only DMA devices can notice a side effect from
> >> WBINVD's cache flushing. But, TDX does not support DMA,
> >> because DMA typically needs uncached access for MMIO, and
> >> the current TDX module always sets the IgnorePAT bit, which
> >> prevents that.
>
> > I thought we discussed that there are other considerations for wbinvd
> > besides DMA? In any event this paragraph is actively misleading
> > because it disregards ACPI and Persistent Memory secure-erase whose
> > usages of wbinvd have nothing to do with DMA. I would much prefer a
> > patch to shutdown all the known wbinvd users as a precursor to this
> > patch rather than assuming it's ok to simply ignore it. You have
> > mentioned that TDX does not need to use those paths, but rather than
> > assume they can't be used why not do the audit to explicitly disable
> > them? Otherwise this statement seems to imply that the audit has not
> > been done.
>
> But KVM also emulates WBINVD only if DMA is supported. Otherwise it
> will be treated as noop.
>
> static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> {
> return kvm_arch_has_noncoherent_dma(vcpu->kvm);
> }

That makes KVM also broken for the cases where wbinvd is needed, but
it does not make the description of this patch correct.

2021-05-25 00:56:22

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On Mon, May 24, 2021, Dan Williams wrote:
> On Mon, May 24, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
> <[email protected]> wrote:
> >
> >
> >
> > On 5/24/21 4:39 PM, Dan Williams wrote:
> > >> Functionally only DMA devices can notice a side effect from
> > >> WBINVD's cache flushing. But, TDX does not support DMA,
> > >> because DMA typically needs uncached access for MMIO, and
> > >> the current TDX module always sets the IgnorePAT bit, which
> > >> prevents that.
> >
> > > I thought we discussed that there are other considerations for wbinvd
> > > besides DMA? In any event this paragraph is actively misleading
> > > because it disregards ACPI and Persistent Memory secure-erase whose
> > > usages of wbinvd have nothing to do with DMA. I would much prefer a
> > > patch to shutdown all the known wbinvd users as a precursor to this
> > > patch rather than assuming it's ok to simply ignore it. You have
> > > mentioned that TDX does not need to use those paths, but rather than
> > > assume they can't be used why not do the audit to explicitly disable
> > > them? Otherwise this statement seems to imply that the audit has not
> > > been done.
> >
> > But KVM also emulates WBINVD only if DMA is supported. Otherwise it
> > will be treated as noop.
> >
> > static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> > {
> > return kvm_arch_has_noncoherent_dma(vcpu->kvm);
> > }
>
> That makes KVM also broken for the cases where wbinvd is needed, but
> it does not make the description of this patch correct.

Yep! KVM has a long and dubious history of making things work for specific use
cases without stricitly adhering to the architecture.

KVM also has to worry about malicious/buggy guests, e.g. letting the guest do
WBINVD at will would be a massive noisy neighbor problem (at best), while
ratelimiting might unnecessarily harm legitimate use case. I.e. KVM has a
somewhat sane reason for "emulating" WBINVD as a nop.

And FWIW, IIRC all modern hardware has a coherent IOMMU, though that could be me
making things up.

2021-05-25 00:58:31

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On Mon, May 24, 2021 at 5:40 PM Andi Kleen <[email protected]> wrote:
>
>
> On 5/24/2021 4:42 PM, Dave Hansen wrote:
> > On 5/24/21 4:32 PM, Kuppuswamy Sathyanarayanan wrote:
> >> Functionally only DMA devices can notice a side effect from
> >> WBINVD's cache flushing.
> > This seems to be trying to make some kind of case that the only visible
> > effects from WBINVD are for DMA devices. That's flat out wrong. It
> > might be arguable that none of the other cases exist in a TDX guest, but
> > it doesn't excuse making such a broad statement without qualification.
>
> We're describing a few sentences down that guests run with EPT
> IgnorePAT=1, which is the qualification.
>
> >
> > Just grep in the kernel for a bunch of reasons this is wrong.
> >
> > Where did this come from?
>
> Again the logic is very simple: TDX guest code is (mostly) about
> replacing KVM code with in kernel code, so we're just doing the same as
> KVM. You cannot get any more proven than that.
>

I have no problem pointing at KVM as to why the risk is mitigated, but
I do have a problem with misrepresenting the scope of the risk.

2021-05-25 01:03:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest


> That makes KVM also broken for the cases where wbinvd is needed,


Or maybe your analysis is wrong?


> but
> it does not make the description of this patch correct.

If KVM was broken I'm sure we would hear about it.

The ACPI cases are for S3, which is not supported in guests, or for the
old style manual IO port C6, which isn't supported either.

The persistent memory cases would require working DMA mappings, which we
currently don't support. If DMA mappings were added we would need to
para virtualized WBINVD, like the comments say.

AFAIK all the rest is for some caching attribute change, which is not
possible in KVM (because it uses EPT.IgnorePAT=1) nor in TDX (which does
the same). Some are for MTRR which is completely disabled if you're
running under EPT.

-Andi

2021-05-25 01:48:44

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On Mon, May 24, 2021 at 6:02 PM Andi Kleen <[email protected]> wrote:
>
>
> > That makes KVM also broken for the cases where wbinvd is needed,
>
>
> Or maybe your analysis is wrong?

I'm well aware of the fact that wbinvd is problematic for hypervisors
and is an attack vector for a guest to DOS the host.

>
>
> > but
> > it does not make the description of this patch correct.
>
> If KVM was broken I'm sure we would hear about it.

KVM does not try to support the cases where wbinvd being unavailable
would break the system. That is not the claim being made in this
patch.

> The ACPI cases are for S3, which is not supported in guests, or for the
> old style manual IO port C6, which isn't supported either.

> The persistent memory cases would require working DMA mappings,

No, that analysis is wrong.The wbinvd audit would have found that
persistent memory secure-erase and unlock, which has nothing to do
with DMA, needs wbinvd to ensure that the CPU has not retained a copy
of the PMEM contents from before the unlock happened and it needs to
make sure that any data that was meant to be destroyed by an erasure
is not retained in cache.

> which we
> currently don't support. If DMA mappings were added we would need to
> para virtualized WBINVD, like the comments say.
>
> AFAIK all the rest is for some caching attribute change, which is not
> possible in KVM (because it uses EPT.IgnorePAT=1) nor in TDX (which does
> the same). Some are for MTRR which is completely disabled if you're
> running under EPT.

It's fine to not support the above cases, I am asking for the
explanation to demonstrate the known risks and the known mitigations.
IgnorePAT is not the mitigation, the mitigation is an audit to
describe why the known users are unlikely to be triggered. Even better
would be an addition patch that does something like:

iff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4b80150e4afa..a6b13a1ae319 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -170,6 +170,9 @@ static int __nvdimm_security_unlock(struct nvdimm *nvdimm)
const void *data;
int rc;

+ if (is_protected_guest())
+ return -ENXIO;
+
/* The bus lock should be held at the top level of the call stack */
lockdep_assert_held(&nvdimm_bus->reconfig_mutex);

...to explicitly error out a wbinvd use case before data is altered
and wbinvd is needed.

2021-05-25 02:47:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest


On 5/24/2021 6:45 PM, Dan Williams wrote:
>
>>
>>> but
>>> it does not make the description of this patch correct.
>> If KVM was broken I'm sure we would hear about it.
> KVM does not try to support the cases where wbinvd being unavailable
> would break the system. That is not the claim being made in this
> patch.

I thought we made that claim.


"We just want to be the same as KVM"

>
>> The ACPI cases are for S3, which is not supported in guests, or for the
>> old style manual IO port C6, which isn't supported either.
>> The persistent memory cases would require working DMA mappings,
> No, that analysis is wrong.The wbinvd audit would have found that
> persistent memory secure-erase and unlock, which has nothing to do
> with DMA, needs wbinvd to ensure that the CPU has not retained a copy
> of the PMEM contents from before the unlock happened and it needs to
> make sure that any data that was meant to be destroyed by an erasure
> is not retained in cache.

But that's all not supported in TDX.

And the only way it could work in KVM is when there is some DMA, likely
at least an IOMMU, e.g. to set up the persistent memory. That's what I
meant with working DMA mappings.

Otherwise KVM would be really broken, but I don't really believe that
without some real evidence.


>
> It's fine to not support the above cases, I am asking for the
> explanation to demonstrate the known risks and the known mitigations.

The analysis is that all this stuff that you are worried about cannot be
enabled in a TDX guest

(it would be a nightmare if it could, we would need to actually make it
secure against a malicious host)

> IgnorePAT is not the mitigation, the mitigation is an audit to
> describe why the known users are unlikely to be triggered. Even better
> would be an addition patch that does something like:
>
> iff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 4b80150e4afa..a6b13a1ae319 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -170,6 +170,9 @@ static int __nvdimm_security_unlock(struct nvdimm *nvdimm)
> const void *data;
> int rc;
>
> + if (is_protected_guest())
> + return -ENXIO;
> +
> /* The bus lock should be held at the top level of the call stack */
> lockdep_assert_held(&nvdimm_bus->reconfig_mutex);
>
> ...to explicitly error out a wbinvd use case before data is altered
> and wbinvd is needed.

I don't see any point of all of this. We really just want to be the same
as KVM. Not get into the business of patching a bazillion sub systems
that cannot be used in TDX anyways.

-Andi

2021-05-25 02:47:58

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR

On Mon, May 24, 2021 at 4:32 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> When running as a TDX guest, there are a number of existing,
> privileged instructions that do not work. If the guest kernel
> uses these instructions, the hardware generates a #VE.
>
> You can find the list of unsupported instructions in Intel
> Trust Domain Extensions (Intel® TDX) Module specification,
> sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
> Specification for Intel TDX, sec 2.4.1.
>
> To prevent TD guests from using MWAIT/MONITOR instructions,
> the CPUID flags for these instructions are already disabled
> by the TDX module.
>
> After the above mentioned preventive measures, if TD guests
> still execute these instructions, add appropriate warning
> message (WARN_ONCE()) in #VE handler. This handling behavior
> is same as KVM (which also treats MWAIT/MONITOR as nops with
> warning once in unsupported platforms).
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> ---
>
> Changes since RFC v2:
> * Moved WBINVD related changes to a new patch.
> * Fixed commit log as per review comments.

Looks good.

Reviewed-by: Dan Williams <[email protected]>

2021-05-25 02:51:58

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On Mon, May 24, 2021 at 7:13 PM Andi Kleen <[email protected]> wrote:
[..]
> > ...to explicitly error out a wbinvd use case before data is altered
> > and wbinvd is needed.
>
> I don't see any point of all of this. We really just want to be the same
> as KVM. Not get into the business of patching a bazillion sub systems
> that cannot be used in TDX anyways.

Please let's not start this patch off with dubious claims of safety
afforded by IgnorePAT. Instead make the true argument that wbinvd is
known to be problematic in guests and for that reason many bare metal
use cases that require wbinvd have not been ported to guests (like
PMEM unlock), and others that only use wbinvd to opportunistically
enforce a cache state (like ACPI sleep states) do not see ill effects
from missing wbinvd. Given KVM ships with a policy to elide wbinvd in
many scenarios adopt the same policy for TDX guests.

2021-05-25 03:30:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest


On 5/24/2021 7:49 PM, Dan Williams wrote:
> On Mon, May 24, 2021 at 7:13 PM Andi Kleen <[email protected]> wrote:
> [..]
>>> ...to explicitly error out a wbinvd use case before data is altered
>>> and wbinvd is needed.
>> I don't see any point of all of this. We really just want to be the same
>> as KVM. Not get into the business of patching a bazillion sub systems
>> that cannot be used in TDX anyways.
> Please let's not start this patch off with dubious claims of safety
> afforded by IgnorePAT. Instead make the true argument that wbinvd is
> known to be problematic in guests

That's just another reason to not support WBINVD, but I don't think it's
the main reason. The main reason is that it is simply not needed, unless
you do DMA in some form.

(and yes I consider direct mapping of persistent memory with a complex
setup procedure a form of DMA -- my guess is that the reason that it
works in KVM is that it somehow activates the DMA code paths in KVM)

IMNSHO that's the true reason.

> and for that reason many bare metal
> use cases that require wbinvd have not been ported to guests (like
> PMEM unlock), and others that only use wbinvd to opportunistically
> enforce a cache state (like ACPI sleep states)

ACPI sleep states are not supported or needed in virtualization. They
are mostly obsolete on real hardware too.


> do not see ill effects
> from missing wbinvd. Given KVM ships with a policy to elide wbinvd in
> many scenarios adopt the same policy for TDX guests.

2021-05-25 03:43:52

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On Mon, May 24, 2021 at 8:27 PM Andi Kleen <[email protected]> wrote:
>
>
> On 5/24/2021 7:49 PM, Dan Williams wrote:
> > On Mon, May 24, 2021 at 7:13 PM Andi Kleen <[email protected]> wrote:
> > [..]
> >>> ...to explicitly error out a wbinvd use case before data is altered
> >>> and wbinvd is needed.
> >> I don't see any point of all of this. We really just want to be the same
> >> as KVM. Not get into the business of patching a bazillion sub systems
> >> that cannot be used in TDX anyways.
> > Please let's not start this patch off with dubious claims of safety
> > afforded by IgnorePAT. Instead make the true argument that wbinvd is
> > known to be problematic in guests
>
> That's just another reason to not support WBINVD, but I don't think it's
> the main reason. The main reason is that it is simply not needed, unless
> you do DMA in some form.
>
> (and yes I consider direct mapping of persistent memory with a complex
> setup procedure a form of DMA -- my guess is that the reason that it
> works in KVM is that it somehow activates the DMA code paths in KVM)

No, it doesn't. Simply no one has tried to pass through the security
interface of bare metal nvdimm to a guest, or enabled the security
commands in a virtualized nvdimm. If a guest supports a memory map it
supports PMEM I struggle to see DMA anywhere in that equation.

>
> IMNSHO that's the true reason.

I do see why it would be attractive if IgnorePAT was a solid signal to
ditch wbinvd support. However, it simply isn't, and to date nothing
has cared trip over that gap.

2021-05-25 04:34:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest

On 5/24/21 7:13 PM, Andi Kleen wrote:
> I don't see any point of all of this. We really just want to be the same
> as KVM. Not get into the business of patching a bazillion sub systems
> that cannot be used in TDX anyways.

Andi, there's a fundamental difference between KVM the hypervisor and a
TDX guest: KVM the hypervisor runs unknown guests, and lots of them.

TD guest support as a whole has to handle one thing: running *one* Linux
kernel. Further, the guest support shares a source tree with that
kernel. TD guest support doesn't have to run random binaries for which
there is no source. All of the source is *RIGHT* *THERE*.

The only reason TD guest support would have to fall back to KVM's dirty
tricks is a desire to treat the rest of the kernel like a black box.
KVM frankly has no other choice. TD guest support has all the choices
in the world.

Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

Hi,

On 5/21/21 2:14 PM, Tom Lendacky wrote:
>
>
> On 5/21/21 1:49 PM, Borislav Petkov wrote:
>> On Fri, May 21, 2021 at 11:19:15AM -0500, Tom Lendacky wrote:
>>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>>> when SEV support was added), we do:
>>> if (sev_active())
>>> swiotlb_force = SWIOTLB_FORCE;
>>>
>>> TDX should be able to do a similar thing without having to touch
>>> arch/x86/kernel/pci-swiotlb.c.
>>>
>>> That would remove any confusion over SME being part of a
>>> protected_guest_has() call.
>>
>> Even better.
>>
>>> I kinda like the separate function, though.
>>
>> Only if you clean it up and get rid of the inverted logic and drop that
>> silly switch-case.
>>
>>> Except mem_encrypt_active() covers both SME and SEV, so
>>> protected_guest_has() would be confusing.
>>
>> I don't understand - the AMD-specific function amd_protected_guest_has()
>> would return sme_me_mask just like mem_encrypt_active() does and we can
>> get rid of latter.
>>
>> Or do you have a problem with the name protected_guest_has() containing
>> "guest" while we're talking about SME here?
>
> The latter.
>
>>
>> If so, feel free to suggest a better one - the name does not have to
>> have "guest" in it.
>
> Let me see if I can come up with something that will make sense.
>
> Thanks,
> Tom
>
>>
>> Thx.
>>
>>

Following is the sample implementation. Please let me know your
comments.

tdx: Introduce generic protected_guest abstraction

Add a generic way to check if we run with an encrypted guest,
without requiring x86 specific ifdefs. This can then be used in
non architecture specific code.

is_protected_guest() helper function can be implemented using
arch specific CPU feature flags.

protected_guest_has() is used to check for protected guest
feature flags.

Originally-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..98c30312555b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -956,6 +956,9 @@ config HAVE_ARCH_NVRAM_OPS
config ISA_BUS_API
def_bool ISA

+config ARCH_HAS_PROTECTED_GUEST
+ bool
+
#
# ABI hall of shame
#
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bc91c4aa7ce4..2f31613be965 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
select X86_MEM_ENCRYPT_COMMON
+ select ARCH_HAS_PROTECTED_GUEST
help
Provide support for running in a trusted domain on Intel processors
equipped with Trusted Domain eXtenstions. TDX is a new Intel
diff --git a/arch/x86/include/asm/protected_guest.h b/arch/x86/include/asm/protected_guest.h
new file mode 100644
index 000000000000..b2838e58ce94
--- /dev/null
+++ b/arch/x86/include/asm/protected_guest.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_PROTECTED_GUEST
+#define _ASM_PROTECTED_GUEST 1
+
+#include <asm/cpufeature.h>
+#include <asm/tdx.h>
+
+/* Only include through linux/protected_guest.h */
+
+static inline bool is_protected_guest(void)
+{
+ return boot_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+
+static inline bool protected_guest_has(unsigned long flag)
+{
+ if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+ return tdx_protected_guest_has(flag);
+
+ return false;
+}
+
+#endif
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 175cebb7bf94..d894111f49ea 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -147,6 +147,7 @@ do { \
extern phys_addr_t tdg_shared_mask(void);
extern int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
enum tdx_map_type map_type);
+bool tdx_protected_guest_has(unsigned long flag);

#else // !CONFIG_INTEL_TDX_GUEST

@@ -167,6 +168,11 @@ static inline int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
{
return -ENODEV;
}
+
+static inline bool tdx_protected_guest_has(unsigned long flag)
+{
+ return false;
+}
#endif /* CONFIG_INTEL_TDX_GUEST */

#ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index c613c89d0d6a..cbb893412b43 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -11,6 +11,7 @@
#include <linux/sched/signal.h> /* force_sig_fault() */
#include <linux/swiotlb.h>
#include <linux/security.h>
+#include <linux/protected_guest.h>

#include <linux/cpu.h>

@@ -122,6 +123,23 @@ bool is_tdx_guest(void)
}
EXPORT_SYMBOL_GPL(is_tdx_guest);

+bool tdx_protected_guest_has(unsigned long flag)
+{
+ if (!is_tdx_guest())
+ return false;
+
+ switch (flag) {
+ case VM_MEM_ENCRYPT:
+ case VM_MEM_ENCRYPT_ACTIVE:
+ case VM_UNROLL_STRING_IO:
+ case VM_HOST_MEM_ENCRYPT:
+ return true;
+ }
+
+ return false;
+}
+EXPORT_SYMBOL_GPL(tdx_protected_guest_has);
+
/* The highest bit of a guest physical address is the "sharing" bit */
phys_addr_t tdg_shared_mask(void)
{
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..f362eea39bd8
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define VM_MEM_ENCRYPT 0x100
+/* Encryption support is active */
+#define VM_MEM_ENCRYPT_ACTIVE 0x101
+/* Support for unrolled string IO */
+#define VM_UNROLL_STRING_IO 0x102
+/* Support for host memory encryption */
+#define VM_HOST_MEM_ENCRYPT 0x103
+
+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+#include <asm/protected_guest.h>
+#else
+static inline bool is_protected_guest(void) { return false; }
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+#endif
+
+#endif


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-26 02:52:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest


On 5/24/2021 8:40 PM, Dan Williams wrote:
> On Mon, May 24, 2021 at 8:27 PM Andi Kleen <[email protected]> wrote:
>>
>> On 5/24/2021 7:49 PM, Dan Williams wrote:
>>> On Mon, May 24, 2021 at 7:13 PM Andi Kleen <[email protected]> wrote:
>>> [..]
>>>>> ...to explicitly error out a wbinvd use case before data is altered
>>>>> and wbinvd is needed.
>>>> I don't see any point of all of this. We really just want to be the same
>>>> as KVM. Not get into the business of patching a bazillion sub systems
>>>> that cannot be used in TDX anyways.
>>> Please let's not start this patch off with dubious claims of safety
>>> afforded by IgnorePAT. Instead make the true argument that wbinvd is
>>> known to be problematic in guests
>> That's just another reason to not support WBINVD, but I don't think it's
>> the main reason. The main reason is that it is simply not needed, unless
>> you do DMA in some form.
>>
>> (and yes I consider direct mapping of persistent memory with a complex
>> setup procedure a form of DMA -- my guess is that the reason that it
>> works in KVM is that it somehow activates the DMA code paths in KVM)
> No, it doesn't. Simply no one has tried to pass through the security
> interface of bare metal nvdimm to a guest, or enabled the security
> commands in a virtualized nvdimm.

Maybe a better term would be "external side effects". If you have
something in IO domain which can notice a difference.

> If a guest supports a memory map it supports PMEM I struggle to see DMA anywhere in that equation.

Okay if that's happen to a TDX guest we have to start emulate WBINVD.
But right now we don't need it.

I guess we can add a comment that says

"if someone wants to implement NVDIMM secure delete they would also need
to implement this new hypercall"

>
>> IMNSHO that's the true reason.
> I do see why it would be attractive if IgnorePAT was a solid signal to
> ditch wbinvd support. However, it simply isn't, and to date nothing
> has cared trip over that gap.


I think we're getting into angels on a pinhead here.

The key point is that current TDX does not need WBINVD. I believe we
agree on that.


-Andi


Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()



On 5/21/21 9:19 AM, Tom Lendacky wrote:
> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
> when SEV support was added), we do:
> if (sev_active())
> swiotlb_force = SWIOTLB_FORCE;
>
> TDX should be able to do a similar thing without having to touch
> arch/x86/kernel/pci-swiotlb.c.
>
> That would remove any confusion over SME being part of a
> protected_guest_has() call.

You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?

41 int __init pci_swiotlb_detect_4gb(void)
42 {
43 /* don't initialize swiotlb if iommu=off (no_iommu=1) */
44 if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
45 swiotlb = 1;
46
47 /*
48 * If SME is active then swiotlb will be set to 1 so that bounce
49 * buffers are allocated and used for devices that do not support
50 * the addressing range required for the encryption mask.
51 */
52 if (sme_active() || is_tdx_guest())
53 swiotlb = 1;


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()



On 5/26/21 3:14 PM, Tom Lendacky wrote:
> On 5/26/21 5:02 PM, Tom Lendacky wrote:
>> On 5/26/21 4:37 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>
>>>
>>> On 5/21/21 9:19 AM, Tom Lendacky wrote:
>>>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>>>> when SEV support was added), we do:
>>>>     if (sev_active())
>>>>         swiotlb_force = SWIOTLB_FORCE;
>>>>
>>>> TDX should be able to do a similar thing without having to touch
>>>> arch/x86/kernel/pci-swiotlb.c.
>>>>
>>>> That would remove any confusion over SME being part of a
>>>> protected_guest_has() call.
>>>
>>> You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?
>>
>> No, the sme_active() check is required to make sure that SWIOTLB is
>> available under SME. Encrypted DMA is supported under SME if the device
>> supports 64-bit DMA. But if the device doesn't support 64-bit DMA and the
>> IOMMU is not active, then DMA will be bounced through SWIOTLB.
>>
>> As compared to SEV, where all DMA has to be bounced through SWIOTLB or
>> unencrypted memory. For that, swiotlb_force is used.
>
> I should probably add that SME is memory encryption support for
> host/hypervisor/bare-metal, while SEV is memory encryption support for
> virtualization.

Got it. Thanks for clarification.

>
> Thanks,
> Tom
>
>>
>> Thanks,
>> Tom
>>
>>>
>>>  41 int __init pci_swiotlb_detect_4gb(void)
>>>  42 {
>>>  43         /* don't initialize swiotlb if iommu=off (no_iommu=1) */
>>>  44         if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
>>>  45                 swiotlb = 1;
>>>  46
>>>  47         /*
>>>  48          * If SME is active then swiotlb will be set to 1 so that bounce
>>>  49          * buffers are allocated and used for devices that do not support
>>>  50          * the addressing range required for the encryption mask.
>>>  51          */
>>>  52         if (sme_active() || is_tdx_guest())
>>>  53                 swiotlb = 1;
>>>
>>>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-27 02:44:51

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()



On 5/26/21 4:37 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/21/21 9:19 AM, Tom Lendacky wrote:
>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>> when SEV support was added), we do:
>>     if (sev_active())
>>         swiotlb_force = SWIOTLB_FORCE;
>>
>> TDX should be able to do a similar thing without having to touch
>> arch/x86/kernel/pci-swiotlb.c.
>>
>> That would remove any confusion over SME being part of a
>> protected_guest_has() call.
>
> You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?

No, the sme_active() check is required to make sure that SWIOTLB is
available under SME. Encrypted DMA is supported under SME if the device
supports 64-bit DMA. But if the device doesn't support 64-bit DMA and the
IOMMU is not active, then DMA will be bounced through SWIOTLB.

As compared to SEV, where all DMA has to be bounced through SWIOTLB or
unencrypted memory. For that, swiotlb_force is used.

Thanks,
Tom

>
>  41 int __init pci_swiotlb_detect_4gb(void)
>  42 {
>  43         /* don't initialize swiotlb if iommu=off (no_iommu=1) */
>  44         if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
>  45                 swiotlb = 1;
>  46
>  47         /*
>  48          * If SME is active then swiotlb will be set to 1 so that bounce
>  49          * buffers are allocated and used for devices that do not support
>  50          * the addressing range required for the encryption mask.
>  51          */
>  52         if (sme_active() || is_tdx_guest())
>  53                 swiotlb = 1;
>
>

2021-05-27 02:46:16

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On 5/26/21 5:02 PM, Tom Lendacky wrote:
> On 5/26/21 4:37 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>>
>> On 5/21/21 9:19 AM, Tom Lendacky wrote:
>>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>>> when SEV support was added), we do:
>>>     if (sev_active())
>>>         swiotlb_force = SWIOTLB_FORCE;
>>>
>>> TDX should be able to do a similar thing without having to touch
>>> arch/x86/kernel/pci-swiotlb.c.
>>>
>>> That would remove any confusion over SME being part of a
>>> protected_guest_has() call.
>>
>> You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?
>
> No, the sme_active() check is required to make sure that SWIOTLB is
> available under SME. Encrypted DMA is supported under SME if the device
> supports 64-bit DMA. But if the device doesn't support 64-bit DMA and the
> IOMMU is not active, then DMA will be bounced through SWIOTLB.
>
> As compared to SEV, where all DMA has to be bounced through SWIOTLB or
> unencrypted memory. For that, swiotlb_force is used.

I should probably add that SME is memory encryption support for
host/hypervisor/bare-metal, while SEV is memory encryption support for
virtualization.

Thanks,
Tom

>
> Thanks,
> Tom
>
>>
>>  41 int __init pci_swiotlb_detect_4gb(void)
>>  42 {
>>  43         /* don't initialize swiotlb if iommu=off (no_iommu=1) */
>>  44         if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
>>  45                 swiotlb = 1;
>>  46
>>  47         /*
>>  48          * If SME is active then swiotlb will be set to 1 so that bounce
>>  49          * buffers are allocated and used for devices that do not support
>>  50          * the addressing range required for the encryption mask.
>>  51          */
>>  52         if (sme_active() || is_tdx_guest())
>>  53                 swiotlb = 1;
>>
>>

Subject: [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest

From: "Kirill A. Shutemov" <[email protected]>

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to TD-shared memory, which includes MMIO

In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.

The entry paths do not access TD-shared memory, MMIO regions or use
those specific MSRs, instructions, CPUID leaves that might generate #VE.
In addition, all interrupts including NMIs are blocked by the hardware
starting with #VE delivery until TDGETVEINFO is called.  This eliminates
the chance of a #VE during the syscall gap or paranoid entry paths and
simplifies #VE handling.

After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
but it is expected not to happen because TDX expects NMIs not to
trigger #VEs. Another case where they could happen is if the #VE
exception panics, but in this case there are no guarantees on anything
anyways.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest which will result in an oops (and should eventually be a panic, as
we would like to set panic_on_oops to 1 for TDX guests).

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace.  Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling ve_raise_fault() directly.
ve_raise_fault() is similar to #GP handler and is responsible for
sending SIGSEGV to userspace and cpu die and notifying debuggers and
other die chain users.  

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix:
* No code changes (Added Tony to "To" list)

Changes since v1:
* Removed [RFC v2 07/32] x86/traps: Add do_general_protection() helper function.
* Instead of resuing #GP handler, defined a custom handler.
* Fixed commit log as per review comments.

arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/tdx.h | 19 +++++++++
arch/x86/kernel/idt.c | 6 +++
arch/x86/kernel/tdx.c | 36 +++++++++++++++++
arch/x86/kernel/traps.c | 69 +++++++++++++++++++++++++++++++++
5 files changed, 134 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index fcd42119a287..a451786496a0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -39,6 +39,25 @@ struct tdx_hypercall_output {
u64 r15;
};

+/*
+ * Used by #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ u64 gla;
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve);
+
/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};

/*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
INTG(X86_TRAP_MF, asm_exc_coprocessor_error),
INTG(X86_TRAP_AC, asm_exc_alignment_check),
INTG(X86_TRAP_XF, asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif

#ifdef CONFIG_X86_32
TSKG(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e4383b416ef3..527d2638ddae 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -10,6 +10,7 @@

/* TDX Module call Leaf IDs */
#define TDINFO 1
+#define TDGETVEINFO 3

static struct {
unsigned int gpa_width;
@@ -87,6 +88,41 @@ static void tdg_get_info(void)
td_info.attributes = out.rdx;
}

+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+ u64 ret;
+ struct tdx_module_output out = {0};
+
+ /*
+ * NMIs and machine checks are suppressed. Before this point any
+ * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+ * additional #VEs are permitted (but we don't expect them to
+ * happen unless you panic).
+ */
+ ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = out.r10 & UINT_MAX;
+ ve->instr_info = out.r10 >> 32;
+
+ return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+ struct ve_info *ve)
+{
+ /*
+ * TODO: Add handler support for various #VE exit
+ * reasons. It will be added by other patches in
+ * the series.
+ */
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return -EFAULT;
+}
+
void __init tdx_early_init(void)
{
if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..043608943c3b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -1137,6 +1138,74 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}

+#define VEFSTR "VE fault"
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+ struct task_struct *tsk = current;
+
+ if (user_mode(regs)) {
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * Not fixing up VDSO exceptions similar to #GP handler
+ * because we don't expect the VDSO to trigger #VE.
+ */
+ show_signal(tsk, SIGSEGV, "", VEFSTR, regs, error_code);
+ force_sig(SIGSEGV);
+ return;
+ }
+
+ if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+ return;
+
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * To be potentially processing a kprobe fault and to trust the result
+ * from kprobe_running(), we have to be non-preemptible.
+ */
+ if (!preemptible() &&
+ kprobe_running() &&
+ kprobe_fault_handler(regs, X86_TRAP_VE))
+ return;
+
+ notify_die(DIE_GPF, VEFSTR, regs, error_code, X86_TRAP_VE, SIGSEGV);
+
+ die_addr(VEFSTR, regs, error_code, 0);
+}
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+ int ret;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+ /*
+ * NMIs/Machine-checks/Interrupts will be in a disabled state
+ * till TDGETVEINFO TDCALL is executed. This prevents #VE
+ * nesting issue.
+ */
+ ret = tdg_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ if (!ret)
+ ret = tdg_handle_virtualization_exception(regs, &ve);
+ /*
+ * If tdg_handle_virtualization_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (ret)
+ ve_raise_fault(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.25.1

Subject: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In a TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. TDX guests communicate with the TDX module and
with the VMM using a new instruction: TDCALL.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall() - request services from the VMM.
__tdx_module_call()  - communicate with the TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, it does not need
multiple wrappers.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly is,
since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix-v1:
* No code changes (Adding Tony to "To" list)

Changes since RFC v2:
 * Renamed __tdcall()/__tdvmcall() to __tdx_module_call()/__tdx_hypercall().
 * Renamed reg offsets from TDCALL_rx to TDX_MODULE_rx.
 * Renamed reg offsets from TDVMCALL_rx to TDX_HYPERCALL_rx.
 * Renamed struct tdcall_output to struct tdx_module_output.
 * Renamed struct tdvmcall_output to struct tdx_hypercall_output.
 * Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
 * Removed unnecessary push/pop sequence in __tdcall() function.
 * Fixed comments as per Dave's review.

arch/x86/include/asm/tdx.h | 38 ++++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets.c | 22 ++++
arch/x86/kernel/tdcall.S | 223 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 38 ++++++
5 files changed, 322 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..fcd42119a287 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,50 @@
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};

/* Common API to check TDX support in decompression and common kernel code. */
bool is_tdx_guest(void);

void __init tdx_early_init(void);

+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+ struct tdx_hypercall_output *out);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o

obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..70cafbae4fea 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
#include <xen/interface/xen.h>
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
#ifdef CONFIG_X86_32
# include "asm-offsets_32.c"
#else
@@ -75,6 +79,24 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ /* Offset for fields in tdx_module_output */
+ OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+ OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+ OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
+ OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
+ OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+ OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+ /* Offset for fields in tdx_hypercall_output */
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..b06e8b62dfe2
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10 BIT(10)
+#define TDG_R11 BIT(11)
+#define TDG_R12 BIT(12)
+#define TDG_R13 BIT(13)
+#define TDG_R14 BIT(14)
+#define TDG_R15 BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDG_R10 | TDG_R11 | \
+ TDG_R12 | TDG_R13 | \
+ TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call() - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module. If the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ * use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+
+ /*
+ * R12 will be used as temporary storage for
+ * struct tdx_module_output pointer. You can
+ * find struct tdx_module_output details in
+ * arch/x86/include/asm/tdx.h. Also note that
+ * registers R12-R15 are not used by TDCALL
+ * services supported by this helper function.
+ */
+ push %r12 /* Callee saved, so preserve it */
+ mov %r9, %r12 /* Move output pointer to R12 */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ mov %rdi, %rax /* Move TDCALL Leaf ID to RAX */
+ mov %r8, %r9 /* Move input 4 to R9 */
+ mov %rcx, %r8 /* Move input 3 to R8 */
+ mov %rsi, %rcx /* Move input 1 to RCX */
+ /* Leave input param 2 in RDX */
+
+ tdcall
+
+ /* Check for TDCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check if caller provided an output struct */
+ test %r12, %r12
+ jz 1f
+
+ /* Copy TDCALL result registers to output struct: */
+ movq %rcx, TDX_MODULE_rcx(%r12)
+ movq %rdx, TDX_MODULE_rdx(%r12)
+ movq %r8, TDX_MODULE_r8(%r12)
+ movq %r9, TDX_MODULE_r9(%r12)
+ movq %r10, TDX_MODULE_r10(%r12)
+ movq %r11, TDX_MODULE_r11(%r12)
+1:
+ pop %r12 /* Restore the state of R12 register */
+
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall() - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common code between vendor
+ * specific and standard type TDX hypercalls. So the caller of this
+ * function had to set the TDVMCALL type in the R10 register before
+ * calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. If the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn (RDI) - TDVMCALL function, moved to R11
+ * @r12 (RSI) - Input parameter 1, moved to R12
+ * @r13 (RDX) - Input parameter 2, moved to R13
+ * @r14 (RCX) - Input parameter 3, moved to R14
+ * @r15 (R8) - Input parameter 4, moved to R15
+ *
+ * @out (R9) - struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ *
+ */
+SYM_FUNC_START_LOCAL(do_tdx_hypercall)
+ /* Save non-volatile GPRs that are exposed to the VMM. */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+ mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+ mov %rsi, %r12 /* Move input 1 to R12 */
+ mov %rdx, %r13 /* Move input 2 to R13 */
+ mov %rcx, %r14 /* Move input 1 to R14 */
+ mov %r8, %r15 /* Move input 1 to R15 */
+ /* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * Non-zero RAX values indicate a failure of TDCALL itself.
+ * Panic for those. This value is unrelated to the hypercall
+ * result in R10.
+ */
+ test %rax, %rax
+ jnz 2f
+
+ /* Move hypercall error code to RAX to return to user */
+ mov %r10, %rax
+
+ /* Check for hypercall success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz 1f
+
+ /* Check if caller provided an output struct */
+ test %r9, %r9
+ jz 1f
+
+ /* Copy hypercall result registers to output struct: */
+ movq %r11, TDX_HYPERCALL_r11(%r9)
+ movq %r12, TDX_HYPERCALL_r12(%r9)
+ movq %r13, TDX_HYPERCALL_r13(%r9)
+ movq %r14, TDX_HYPERCALL_r14(%r9)
+ movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+ /*
+ * Zero out registers exposed to the VMM to avoid
+ * speculative execution with VMM-controlled values.
+ * This needs to include all registers present in
+ * TDVMCALL_EXPOSE_REGS_MASK.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+ xor %r12d, %r12d
+ xor %r13d, %r13d
+ xor %r14d, %r14d
+ xor %r15d, %r15d
+
+ /* Restore non-volatile GPRs that are exposed to the VMM. */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ ret
+2:
+ ud2
+SYM_FUNC_END(do_tdx_hypercall)
+
+/*
+ * Helper function for standard type of TDVMCALLs. This assembly
+ * wrapper reuses do_tdvmcall() for standard type of hypercalls
+ * (R10 is set as zero).
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+ /*
+ * R10 is not part of the function call ABI, but it is a part
+ * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
+ * before making call to the do_tdx_hypercall().
+ */
+ xor %r10, %r10
+ call do_tdx_hypercall
+ FRAME_END
+ retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5e70617e9877..97b54317f799 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,46 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2020 Intel Corporation */

+#define pr_fmt(fmt) "TDX: " fmt
+
#include <asm/tdx.h>

+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return err;
+}
+
+/*
+ * Wrapper for the semi-common case where user need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+ u64 r14, u64 r15)
+{
+ struct tdx_hypercall_output out = {0};
+ u64 err;
+
+ err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+ if (err)
+ pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+ fn, err);
+
+ return out.r11;
+}
+
static inline bool cpuid_has_tdx_guest(void)
{
u32 eax, signature[3];
--
2.25.1

Subject: [RFC v2-fix-v3 1/1] x86/tdx: Ignore WBINVD instruction for TDX guest

Functionally only devices outside the CPU (such as DMA devices,
or persistent memory for flushing) can notice the external side
effects from WBINVD's cache flushing for write back mappings. One
exception here is MKTME, but that is not visible outside the TDX
module and not possible inside a TDX guest.

Currently TDX does not support DMA, because DMA typically needs
uncached access for MMIO, and the current TDX module always sets
the IgnorePAT bit, which prevents that.

Persistent memory is also currently not supported. There are some
other cases that use WBINVD, such as the legacy ACPI sleeps, but
these are all not supported in virtualization and there are better
mechanisms inside a guest anyways. The guests usually are not
aware of power management. Another code path that uses WBINVD is
the MTRR driver, but EPT/virtualization always disables MTRRs so
those are not needed. This all implies WBINVD is not needed with
current TDX. 

So handle the WBINVD instruction as nop. Currently, #VE exception
handler does not include any warning for WBINVD handling because
ACPI reboot code uses it. This is the same behavior as KVM. It
only allows WBINVD in a guest when the guest supports VT-d (=DMA),
but just handles it as a nop if it doesn't .

If TDX ever gets DMA support, or persistent memory support, or
some other devices that can observe flushing side effects, a
hypercall can be added to implement it similar to AMD-SEV. But
current TDX does not need it.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-2:
* Added more details to commit log and comments to address
review comments.

Changes since RFC v2:
* Fixed commit log as per review comments.
* Removed WARN_ONCE for WBINVD #VE support.

arch/x86/kernel/tdx.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index da5c9cd08299..775ae090b625 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -455,6 +455,13 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+ case EXIT_REASON_WBINVD:
+ /*
+ * Non coherent DMA, persistent memory, MTRRs or
+ * outdated ACPI sleeps are not supported in TDX guest.
+ * So ignore WBINVD and treat it nop.
+ */
+ break;
case EXIT_REASON_MONITOR_INSTRUCTION:
case EXIT_REASON_MWAIT_INSTRUCTION:
/*
--
2.25.1

Subject: [RFC v2-fix-v1 0/3] x86/tdx: Handle port I/O

This patchset addresses the review comments in the patch titled
"[RFC v2 14/32] x86/tdx: Handle port I/O". Since it requires
patch split, sending these together.

Changes since RFC v2:
 * Removed assembly implementation of port IO emulation code
   and modified __in/__out IO helpers to directly call C function
   for in/out instruction emulation in decompression code.
 * Added helper function tdx_get_iosize() to make it easier for
   calling tdg_out/tdg_int() C functions from decompression code.
 * Added support for early exception handler to support IO
   instruction emulation in early boot kernel code.
 * Removed alternative_ usage and made kernel only use #VE based
   IO instruction emulation support outside the decompression module.
 * Added support for protection_guest_has() API to generalize
   AMD SEV/TDX specific initialization code in common drivers.
 * Fixed commit log and comments as per review comments.


Andi Kleen (1):
x86/tdx: Handle early IO operations

Kirill A. Shutemov (1):
x86/tdx: Handle port I/O

Kuppuswamy Sathyanarayanan (1):
tdx: Introduce generic protected_guest abstraction

arch/Kconfig | 3 +
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/tdcall.S | 3 +
arch/x86/boot/compressed/tdx.c | 28 ++++++
arch/x86/include/asm/io.h | 7 +-
arch/x86/include/asm/protected_guest.h | 24 +++++
arch/x86/include/asm/tdx.h | 60 ++++++++++++-
arch/x86/kernel/head64.c | 4 +
arch/x86/kernel/tdx.c | 116 +++++++++++++++++++++++++
include/linux/protected_guest.h | 23 +++++
11 files changed, 267 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdcall.S
create mode 100644 arch/x86/include/asm/protected_guest.h
create mode 100644 include/linux/protected_guest.h

--
2.25.1

Subject: Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code



On 5/17/21 6:28 PM, Kuppuswamy, Sathyanarayanan wrote:
>> Because the code is already separate.  You're actually going to some
>> trouble to move the SEV-specific code and then combine it with the
>> TDX-specific code.
>>
>> Anyway, please just give it a shot.  Should take all of ten minutes.  If
>> it doesn't work out in practice, fine.  You'll have a good paragraph for
>> the changelog.
>
> After reviewing the code again, I have noticed that we don't really have
> much common code between AMD and TDX. So I don't see any justification for
> creating this common layer. So, I have decided to drop this patch and move
> Intel TDX specific memory encryption init code to patch titled "[RFC v2 30/32]
> x86/tdx: Make DMA pages shared". This model is similar to how AMD-SEV
> does the initialization.
>
> I have sent the modified patch as reply to patch titled "[RFC v2 30/32]
> x86/tdx: Make DMA pages shared". Please check and let me know your comment

My method of using separate initialization file for Intel only code will not
work if we want to support both AMD SEV and TDX guest support in same binary.
So please ignore my previous reply. I will address the issue as per your
original comments and send you an updated patch.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: [RFC v2-fix-v1 1/1] x86/tdx: Add helper to do MapGPA hypercall

From: "Kirill A. Shutemov" <[email protected]>

MapGPA hypercall is used by TDX guests to request VMM convert
the existing mapping of given GPA address range between
private/shared.

tdx_hcall_gpa_intent() is the wrapper used for making MapGPA
hypercall.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Renamed tdg_map_gpa() to tdx_hcall_gpa_intent().
* Fixed commit log and comments as per review comments.

arch/x86/include/asm/tdx.h | 17 +++++++++++++++++
arch/x86/kernel/tdx.c | 24 ++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a93528246595..eb9fa5f4d0e3 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,6 +7,15 @@

#ifndef __ASSEMBLY__

+/*
+ * Page mapping type enum. This is software construct not
+ * part of any hardware or VMM ABI.
+ */
+enum tdx_map_type {
+ TDX_MAP_PRIVATE,
+ TDX_MAP_SHARED,
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

#include <asm/cpufeature.h>
@@ -119,6 +128,8 @@ do { \
#endif

extern phys_addr_t tdg_shared_mask(void);
+extern int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type);

#else // !CONFIG_INTEL_TDX_GUEST

@@ -143,6 +154,12 @@ static inline phys_addr_t tdg_shared_mask(void)
{
return 0;
}
+
+static inline int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
+{
+ return -ENODEV;
+}
#endif /* CONFIG_INTEL_TDX_GUEST */

#ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 0c8b10b78f32..a8ebd2d10093 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -16,6 +16,9 @@
#define TDINFO 1
#define TDGETVEINFO 3

+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA 0x10001
+
#define VE_GET_IO_TYPE(exit_qual) (((exit_qual) & 8) ? 0 : 1)
#define VE_GET_IO_SIZE(exit_qual) (((exit_qual) & 7) + 1)
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
@@ -122,6 +125,27 @@ static void tdg_get_info(void)
physical_mask &= ~tdg_shared_mask();
}

+/*
+ * Inform the VMM of the guest's intent for this physical page:
+ * shared with the VMM or private to the guest. The VMM is
+ * expected to change its mapping of the page in response.
+ *
+ * Note: shared->private conversions require further guest
+ * action to accept the page.
+ */
+int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
+{
+ u64 ret;
+
+ if (map_type == TDX_MAP_SHARED)
+ gpa |= tdg_shared_mask();
+
+ ret = tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
+
+ return ret ? -EIO : 0;
+}
+
static __cpuidle void tdg_halt(void)
{
u64 ret;
--
2.25.1

Subject: [RFC v2-fix-v1 1/1] x86/mm: Move force_dma_unencrypted() to common code

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to access guest private memory. Any
memory that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. After setting
the shared bit, the conversion must be completed with MapGPA TDVMALL.
The call informs VMM about the conversion between private/shared
mappings. The shared memory is similar to unencrypted memory in AMD
SME/SEV terminology but the underlying process of sharing/un-sharing
the memory is different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar. So, to communicate with I/O devices, related pages
need to be marked as shared. As mentioned above, shared memory in
TDX architecture is similar to decrypted memory in AMD SME/SEV. So
similar to AMD SEV, force_dma_unencrypted() has to forced to return
true. This support is added in other patches in this series.

So move force_dma_unencrypted() out of AMD specific code and call
AMD specific (amd_force_dma_unencrypted()) initialization function
from it. force_dma_unencrypted() will be modified by later patches
to include Intel TDX guest platform specific initialization.

Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has
to be selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has not
functional change.    

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Instead of moving all the contents of force_dma_unencrypted() to
mem_encrypt_common.c, create sub function for AMD and call it
from common code.
* Fixed commit log as per review comments.

arch/x86/Kconfig | 8 ++++++--
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mem_encrypt.c | 4 ++--
arch/x86/mm/mem_encrypt_common.c | 22 ++++++++++++++++++++++
4 files changed, 32 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fc588a64d1a0..7bc371d8ad7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1531,14 +1531,18 @@ config X86_CPA_STATISTICS
helps to determine the effectiveness of preserving large and huge
page mappings when mapping protections are changed.

+config X86_MEM_ENCRYPT_COMMON
+ select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+ select DYNAMIC_PHYSICAL_MASK
+ def_bool n
+
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
- select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
- select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
+ select X86_MEM_ENCRYPT_COMMON
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o

+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o
+
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..ae4f3924f98f 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -390,8 +390,8 @@ bool noinstr sev_es_active(void)
return sev_status & MSR_AMD64_SEV_ES_ENABLED;
}

-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
+/* Override for DMA direct allocation check - AMD specific initialization */
+bool amd_force_dma_unencrypted(struct device *dev)
{
/*
* For SEV, all DMA must be to unencrypted addresses.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..5ebf04482feb
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory Encryption Support Common Code
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <[email protected]>
+ */
+
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+bool amd_force_dma_unencrypted(struct device *dev);
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+ if (sev_active() || sme_active())
+ return amd_force_dma_unencrypted(dev);
+
+ return false;
+}
--
2.25.1

2021-05-27 17:34:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".

You use all caps TDCALL everywhere else in this commit message.
Looks odd to have quoted lower case here.

> In a TDX based VM, since VMM is an untrusted entity, a intermediary
> layer (TDX module) exists between host and guest to facilitate the
> secure communication. TDX guests communicate with the TDX module and
> with the VMM using a new instruction: TDCALL.

Seems both repeat what was in the first paragraph, but also fail to
explain how this TDCALL is different from that first TDCALL.

> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>   
> __tdx_hypercall() - request services from the VMM.
> __tdx_module_call()  - communicate with the TDX Module.

Looking at the code, the hypercall can return an error if TDCALL fails,
but module_call forces a panic with UD2 on error. This difference isn't
explained anywhere.

-Tony

Subject: Re: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions



On 5/27/21 8:25 AM, Luck, Tony wrote:
>> Guests communicate with VMMs with hypercalls. Historically, these
>> are implemented using instructions that are known to cause VMEXITs
>> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
>> expose guest state to the host.  This prevents the old hypercall
>> mechanisms from working. So to communicate with VMM, TDX
>> specification defines a new instruction called "tdcall".
>
> You use all caps TDCALL everywhere else in this commit message.
> Looks odd to have quoted lower case here.

I will use TDCALL uniformly.

>
>> In a TDX based VM, since VMM is an untrusted entity, a intermediary
>> layer (TDX module) exists between host and guest to facilitate the
>> secure communication. TDX guests communicate with the TDX module and
>> with the VMM using a new instruction: TDCALL.
>
> Seems both repeat what was in the first paragraph, but also fail to
> explain how this TDCALL is different from that first TDCALL.

Both cases uses TDCALL instruction. Arguments we pass confirms the
type of TDCALL ( one used to communicate with TDX module vs one used
to communicate with VMM).

I can modify the description to convey the difference between both
cases.

>
>> Implement common helper functions to communicate with the TDX Module
>> and VMM (using TDCALL instruction).
>>
>> __tdx_hypercall() - request services from the VMM.
>> __tdx_module_call()  - communicate with the TDX Module.
>
> Looking at the code, the hypercall can return an error if TDCALL fails,
> but module_call forces a panic with UD2 on error. This difference isn't
> explained anywhere.

I think you meant hypercall will panic vs module call will not.

In hypercall case, since we use same TDCALL instruction, we will have two
return values. One is for TDCALL failure (at the TDX module level) and
other is return value from VMM. So in hypercall case, we return VMM value
to the user but panic for TDCALL failures. As per TDX spec, for hypercall
use case, if everything is in order, TDCALL will never fail. If we notice
TDCALL failure error then it means, we are working with the broken TDX module.
So we panic.

> -Tony
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-27 18:44:04

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

>> Looking at the code, the hypercall can return an error if TDCALL fails,
>> but module_call forces a panic with UD2 on error. This difference isn't
>> explained anywhere.
>
> I think you meant hypercall will panic vs module call will not.

yes

> In hypercall case, since we use same TDCALL instruction, we will have two
> return values. One is for TDCALL failure (at the TDX module level) and
> other is return value from VMM. So in hypercall case, we return VMM value
> to the user but panic for TDCALL failures. As per TDX spec, for hypercall
> use case, if everything is in order, TDCALL will never fail. If we notice
> TDCALL failure error then it means, we are working with the broken TDX module.
> So we panic.

Add a comment in the .S file right before that ud2 explaining this. That
should help anyone tracking down that panic understand that the problem
is in the TDX module.

Otherwise looks ok.

Reviewed-by: Tony Luck <[email protected]>

-Tony

2021-05-27 21:45:43

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest

+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ u64 gla;
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};

I guess that "gla" = Guest Linear Address ... which is a very "Intel" way of
describing what everyone else would call a Guest Virtual Address.

I don't feel strongly about this though. If this has already been hashed
out already then stick with this name.

Otherwise:

Reviewed-by: Tony Luck <[email protected]>

-Tony

2021-05-27 23:23:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest

On Thu, May 27, 2021, Luck, Tony wrote:
> +struct ve_info {
> + u64 exit_reason;
> + u64 exit_qual;
> + u64 gla;
> + u64 gpa;
> + u32 instr_len;
> + u32 instr_info;
> +};
>
> I guess that "gla" = Guest Linear Address ... which is a very "Intel" way of
> describing what everyone else would call a Guest Virtual Address.
>
> I don't feel strongly about this though. If this has already been hashed
> out already then stick with this name.

The "real" #VE information area that TDX is usurping is an architectural struct
that defines exit_reason, exit_qual, gla, and gpa, and those fields in turn come
directly from their corresponding VMCS fields with longer versions of the same
names, e.g. ve_info->gla is a reflection of vmcs.GUEST_LINEAR_ADDRESS.

So normally I would agree that the "linear" terminology is obnoxious, but in
this specific case I think it's warranted.

2021-05-27 23:23:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest

On 5/27/21 9:24 AM, Sean Christopherson wrote:
> On Thu, May 27, 2021, Luck, Tony wrote:
>> +struct ve_info {
>> + u64 exit_reason;
>> + u64 exit_qual;
>> + u64 gla;
>> + u64 gpa;
>> + u32 instr_len;
>> + u32 instr_info;
>> +};
>>
>> I guess that "gla" = Guest Linear Address ... which is a very "Intel" way of
>> describing what everyone else would call a Guest Virtual Address.
>>
>> I don't feel strongly about this though. If this has already been hashed
>> out already then stick with this name.
> The "real" #VE information area that TDX is usurping is an architectural struct
> that defines exit_reason, exit_qual, gla, and gpa, and those fields in turn come
> directly from their corresponding VMCS fields with longer versions of the same
> names, e.g. ve_info->gla is a reflection of vmcs.GUEST_LINEAR_ADDRESS.
>
> So normally I would agree that the "linear" terminology is obnoxious, but in
> this specific case I think it's warranted.

The architectural name needs to be *somewhere*. But, we do diverge from
the naming in plenty of places. The architectural name "XSTATE_BV" is
called xstate.xfeatures in the FPU code, for instance.

In this case, the _least_ we can do is:

u64 gla; /* Guest Linear (virtual) Address */

although I also wouldn't mind if we did something like:

u64 guest_vaddr; /* Guest Linear Address (gla) in the spec */

either.

Subject: [RFC v2-fix-v4 1/1] x86/boot: Avoid #VE during boot for TDX platforms

From: Sean Christopherson <[email protected]>

There are a few MSRs and control register bits which the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent
security guarantees. Fortunately, TDX ensures that these are all
in the correct state before the kernel loads, which means the
kernel has no need to modify them.

The conditions to avoid are:

  * Any writes to the EFER MSR
  * Clearing CR0.NE
  * Clearing CR3.MCE

This theoretically makes guest boot more fragile. If, for
instance, EFER was set up incorrectly and a WRMSR was performed,
it will trigger early exception panic or a triple fault, if it's
before early exceptions are set up. However, this is likely to
trip up the guest BIOS long before control reaches the kernel. In
any case, these kinds of problems are unlikely to occur in
production environments, and developers have good debug
tools to fix them quickly. 

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since RFC v2-fix-v3:
* Removed uncessary contents from commit log. No code changes.

Changes since RFC v2-fix-v2:
* Fixed commit log as per review comments.

Changes since RFC v2-fix:
* Fixed commit and comments as per Dave and Dan's suggestions.
* Merged CR0.NE related change in pa_trampoline_compat() from patch
titled "x86/boot: Add a trampoline for APs booting in 64-bit mode"
to this patch. It belongs in this patch.
* Merged TRAMPOLINE_32BIT_CODE_SIZE related change from patch titled
"x86/boot: Add a trampoline for APs booting in 64-bit mode" to this
patch (since it was wrongly merged to that patch during patch split).

arch/x86/boot/compressed/head_64.S | 16 ++++++++++++----
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/kernel/head_64.S | 20 ++++++++++++++++++--
arch/x86/realmode/rm/trampoline_64.S | 23 +++++++++++++++++++----
4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..f848569e3fb0 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,20 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
+ /* Avoid writing EFER if no change was made (for TDX guest) */
+ jc 1f
wrmsr
- popl %edx
+1: popl %edx
popl %ecx

/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ movl %cr4, %eax
+ /*
+ * Clear all bits except CR4.MCE, which is preserved.
+ * Clearing CR4.MCE will #VE in TDX guests.
+ */
+ andl $X86_CR4_MCE, %eax
+ orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
@@ -635,8 +643,8 @@ SYM_CODE_START(trampoline_32bit_src)
pushl $__KERNEL_CS
pushl %eax

- /* Enable paging again */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ /* Enable paging again. Avoid clearing X86_CR0_NE for TDX */
+ movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80

#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..6cf8d126b80a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
1:

/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ movq %cr4, %rcx
+ /*
+ * Clear all bits except CR4.MCE, which is preserved.
+ * Clearing CR4.MCE will #VE in TDX guests.
+ */
+ andl $X86_CR4_MCE, %ecx
+ orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
@@ -229,13 +235,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
+ /*
+ * Preserve current value of EFER for comparison and to skip
+ * EFER writes if no change was made (for TDX guest)
+ */
+ movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr /* Make changes effective */

+ /* Avoid writing EFER if no change was made (for TDX guest) */
+1: cmpl %edx, %eax
+ je 1f
+ xor %edx, %edx
+ wrmsr /* Make changes effective */
+1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 957bb21ce105..f121f5e29d50 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,27 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3

# Set up EFER
+ movl $MSR_EFER, %ecx
+ rdmsr
+ /*
+ * Skip writing to EFER if the register already has desired
+ * value (to avoid #VE for the TDX guest).
+ */
+ cmp pa_tr_efer, %eax
+ jne .Lwrite_efer
+ cmp pa_tr_efer + 4, %edx
+ je .Ldone_efer
+.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
- movl $MSR_EFER, %ecx
wrmsr

- # Enable paging and in turn activate Long Mode
- movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+ /*
+ * Enable paging and in turn activate Long Mode. Avoid clearing
+ * X86_CR0_NE for TDX.
+ */
+ movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0

/*
@@ -169,7 +183,8 @@ SYM_CODE_START(pa_trampoline_compat)
movl $rm_stack_end, %esp
movw $__KERNEL_DS, %dx

- movl $X86_CR0_PE, %eax
+ /* Avoid clearing X86_CR0_NE for TDX */
+ movl $(X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0
ljmpl $__KERNEL32_CS, $pa_startup_32
SYM_CODE_END(pa_trampoline_compat)
--
2.25.1

2021-05-31 17:00:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On Tue, May 25, 2021 at 11:21:21AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Following is the sample implementation. Please let me know your
> comments.

Doesn't look like what I suggested here:

https://lkml.kernel.org/r/[email protected]

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()



On 5/31/21 8:13 AM, Borislav Petkov wrote:
> On Tue, May 25, 2021 at 11:21:21AM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> Following is the sample implementation. Please let me know your
>> comments.
>
> Doesn't look like what I suggested here:
>
> https://lkml.kernel.org/r/[email protected]

IIUC, following are your design suggestions:

1. Define generic flags.

I think following flags are defined as you have suggested.

+++ b/include/linux/protected_guest.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define VM_MEM_ENCRYPT 0x100
+/* Encryption support is active */
+#define VM_MEM_ENCRYPT_ACTIVE 0x101
+/* Support for unrolled string IO */
+#define VM_UNROLL_STRING_IO 0x102
+/* Support for host memory encryption */
+#define VM_HOST_MEM_ENCRYPT 0x103

2. Define generic functions and allow calls to arch specific implementations.

For above requirement, instead of calling arch specific functions from
include/linux/protected_guest.h, I have directly included the arch specific file in
linux/protected_guest.h

+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+#include <asm/protected_guest.h>
+#else
+static inline bool is_protected_guest(void) { return false; }
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+#endif

3. Implement arch specific implementations respond to protected_guest_has() calls right?

I think above requirement is satisfied in following implementation.

+++ b/arch/x86/include/asm/protected_guest.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_PROTECTED_GUEST
+#define _ASM_PROTECTED_GUEST 1
+
+#include <asm/cpufeature.h>
+#include <asm/tdx.h>
+
+/* Only include through linux/protected_guest.h */
+
+static inline bool is_protected_guest(void)
+{
+ return boot_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+
+static inline bool protected_guest_has(unsigned long flag)
+{
+ if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+ return tdx_protected_guest_has(flag);
+
+ return false;
+}
+

Did I misunderstand anything ? Please let me know your comments.



>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-31 17:58:28

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On Mon, May 31, 2021 at 10:32:44AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> I think above requirement is satisfied in following implementation.

Well, I suggested a single protected_guest_has() function which does:

if (AMD)
amd_protected_guest_has(...)
else if (Intel)
intel_protected_guest_has(...)
else
WARN()

where amd_protected_guest_has() is implemented in arch/x86/kernel/sev.c
and intel_protected_guest_has() is implemented in, as far as I can
follow your paths in the diff, in arch/x86/kernel/tdx.c.

No is_protected_guest() and no ARCH_HAS_PROTECTED_GUEST.

Just the above controlled by CONFIG_INTEL_TDX_GUEST or whatever
the TDX config item is gonna end up being and on the AMD side by
CONFIG_AMD_MEM_ENCRYPT.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()



On 5/31/21 10:55 AM, Borislav Petkov wrote:
> On Mon, May 31, 2021 at 10:32:44AM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> I think above requirement is satisfied in following implementation.
>
> Well, I suggested a single protected_guest_has() function which does:
>
> if (AMD)
> amd_protected_guest_has(...)
> else if (Intel)
> intel_protected_guest_has(...)
> else
> WARN()
>
> where amd_protected_guest_has() is implemented in arch/x86/kernel/sev.c
> and intel_protected_guest_has() is implemented in, as far as I can
> follow your paths in the diff, in arch/x86/kernel/tdx.c.
>
> No is_protected_guest()

is_protected_guest() is a helper function added to check for VM guest type
(protected or normal). Andi is going to add some security hardening code in
virto and other some other generic drivers. He wants a helper function to
selective enable them for all protected guests. Since these are generic
drivers we need generic (non arch specific) helper call. is_protected_guest()
is proposed for this purpose.

We can also use protected_guest_has(VM_VIRTIO_SECURE_FIX) or something
similar for this purpose. Andi, any comments?

> and no ARCH_HAS_PROTECTED_GUEST.

IMHO, its better to use above generic config option in common header
file (linux/protected_guest.h). Any architecture that implements
protected guest feature can enable it. This will help is hide arch
specific config options in arch specific header file.

This seems to be a cleaner solution than including ARCH specific
CONFIG option options in common header file (linux/protected_guest.h)

#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
#include <asm/protected_guest.h>
#else
blah
#endif

is better than

#ifdef (AMD)
amd_call()
#endif

#ifdef (INTEL)
intel_call()
#endif

#ifdef (ARM)
arm_call()
#endif


>
> Just the above controlled by CONFIG_INTEL_TDX_GUEST or whatever
> the TDX config item is gonna end up being and on the AMD side by
> CONFIG_AMD_MEM_ENCRYPT.
>
> Thx.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-05-31 19:16:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

On Mon, May 31, 2021 at 11:45:38AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> We can also use protected_guest_has(VM_VIRTIO_SECURE_FIX) or something
> similar for this purpose. Andi, any comments?

protected_guest_has() is enough for that - no need for two functions.

> IMHO, its better to use above generic config option in common header
> file (linux/protected_guest.h). Any architecture that implements
> protected guest feature can enable it. This will help is hide arch
> specific config options in arch specific header file.

You define empty function stubs for when the arch config option is not
enabled. Everything else is unnecessary. When another architecture needs
this, then another architecture will generalize it like it is usually
done.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-31 21:47:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

On Thu, May 20, 2021 at 01:56:13PM -0700, Dave Hansen wrote:
> On 5/20/21 1:16 PM, Sean Christopherson wrote:
> > On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> >> So what is your proposal? "tdx_guest_" / "tdx_host_" ?
> > 1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
> > to deal with the shared vs. private inversion and avoid tdg_shared_mask
> > altogether.
>
> One example here would be to keep a structure like:
>
> struct protected_mem_config
> {
> unsigned long p_set_bits;
> unsigned long p_clear_bits;
> }
>
> Where 'p_set_bits' are the bits that need to be set to establish memory
> protection and 'p_clear_bits' are the bits that need to be cleared.
> physical_mask would clear both of them:
>
> physical_mask &= ~(pmc.p_set_bits & pmc.p_set_bits);

For me it looks like an abstraction for sake of abstraction. More levels
of indirection without clear benefit. It doesn't add any more readability:
would you know what 'p_set_bits' stands for in two month? I'm not sure.

I would rather leave explicit check for protection flavour. It provides
better context for a reader.

--
Kirill A. Shutemov

Subject: [RFC v2-fix-v3 1/1] x86/tdx: ioapic: Add shared bit for IOAPIC base address

From: Isaku Yamahata <[email protected]>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host. This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code. Ensure
that it marks IOAPIC pages as "shared". This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix-v2:
* Replaced is_tdx_guest() call with protected_guest_has() call.
* Used pgprot_protected_guest() instead of prot_tdg_shared().

Changes since RFC v2:
* Fixed commit log and comment as per review comment

arch/x86/kernel/apic/io_apic.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 73ff4dd426a8..9c0dff0d7aa4 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/memblock.h>
#include <linux/msi.h>
+#include <linux/protected_guest.h>

#include <asm/irqdomain.h>
#include <asm/io.h>
@@ -2675,6 +2676,18 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}

+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+ phys_addr_t phys)
+{
+ pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+ /* Set TDX guest shared bit in pgprot flags */
+ if (protected_guest_has(VM_SHARED_MAPPING_INIT))
+ flags = pgprot_protected_guest(flags);
+
+ __set_fixmap(idx, phys, flags);
+}
+
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2720,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
- set_fixmap_nocache(idx, ioapic_phys);
+ io_apic_set_fixmap_nocache(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
@@ -2836,7 +2849,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;

- set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+ io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
--
2.25.1

Subject: [RFC v2-fix-v1 1/1] x86/kvm: Use bounce buffers for TD guest

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to directly access guest private
memory. Any memory that is required for communication with
VMM must be shared explicitly. The same rule applies for any
any DMA to and fromTDX guest. All DMA pages had to marked as
shared pages. A generic way to achieve this without any changes
to device drivers is to use the SWIOTLB framework.

This method of handling is similar to AMD SEV. So extend this
support for TDX guest as well. Also since there are some common
code between AMD SEV and TDX guest in mem_encrypt_init(), move it
to mem_encrypt_common.c and call AMD specific init function from
it

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Fixed commit log as per review comments.
* Instead of moving all AMD related changes to mem_encrypt_common.c,
created a AMD specific helper function amd_mem_encrypt_init() and
called it from mem_encrypt_init().
* Removed redundant changes in arch/x86/kernel/pci-swiotlb.c.

arch/x86/include/asm/mem_encrypt_common.h | 2 ++
arch/x86/kernel/tdx.c | 3 +++
arch/x86/mm/mem_encrypt.c | 5 +----
arch/x86/mm/mem_encrypt_common.c | 16 ++++++++++++++++
4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
index 697bc40a4e3d..48d98a3d64fd 100644
--- a/arch/x86/include/asm/mem_encrypt_common.h
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -8,11 +8,13 @@

#ifdef CONFIG_AMD_MEM_ENCRYPT
bool amd_force_dma_unencrypted(struct device *dev);
+void __init amd_mem_encrypt_init(void);
#else /* CONFIG_AMD_MEM_ENCRYPT */
static inline bool amd_force_dma_unencrypted(struct device *dev)
{
return false;
}
+static inline void amd_mem_encrypt_init(void) {}
#endif /* CONFIG_AMD_MEM_ENCRYPT */

#endif
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e84ae4f302b8..31aa47ba8f91 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,7 @@
#include <asm/vmx.h>
#include <asm/insn.h>
#include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>

#include <linux/cpu.h>
#include <linux/protected_guest.h>
@@ -536,6 +537,8 @@ void __init tdx_early_init(void)

legacy_pic = &null_legacy_pic;

+ swiotlb_force = SWIOTLB_FORCE;
+
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
NULL, tdg_cpu_offline_prepare);

diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 5a81f73dd61e..073f2105b4af 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -467,14 +467,11 @@ static void print_mem_encrypt_feature_info(void)
}

/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
+void __init amd_mem_encrypt_init(void)
{
if (!sme_me_mask)
return;

- /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
- swiotlb_update_mem_attributes();
-
/*
* With SEV, we need to unroll the rep string I/O instructions,
* but SEV-ES supports them through the #VC handler.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 661c9457c02e..24c9117547b4 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -9,6 +9,7 @@

#include <asm/mem_encrypt_common.h>
#include <linux/dma-mapping.h>
+#include <linux/swiotlb.h>

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
bool force_dma_unencrypted(struct device *dev)
@@ -21,3 +22,18 @@ bool force_dma_unencrypted(struct device *dev)

return false;
}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+ /*
+ * For TDX guest or SEV/SME, call into SWIOTLB to update
+ * the SWIOTLB DMA buffers
+ */
+ if (sme_me_mask || protected_guest_has(VM_MEM_ENCRYPT))
+ swiotlb_update_mem_attributes();
+
+ if (sme_me_mask)
+ amd_mem_encrypt_init();
+}
+
--
2.25.1

Subject: [RFC v2-fix-v2 1/1] x86/tdx: Make DMA pages shared

From: "Kirill A. Shutemov" <[email protected]>

Just like MKTME, TDX reassigns bits of the physical address for
metadata. MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDX hypercall.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range
when converting memory to private. Using 4k page size limit is due
to current TDX spec restriction. Also, If the GPA (range) was
already mapped as an active, private page, the host VMM may remove
the private page from the TD by following the “Removing TD Private
Pages” sequence in the Intel TDX-module specification [1] to safely
block the mapping(s), flush the TLB and cache, and remove the
mapping(s).

BUG() if TDACCEPTPAGE fails (except "previously accepted page" case)
, as the guest is completely hosed if it can't access memory. 

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf

Tested-by: Kai Huang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix:
* Using seperate file for Intel TDX specific memory initialization
breaks binary compatibility (between AMD/TDX). So revert back to
older version.
* Fixed the commit log to reflect the above change.
* Replaced is_tdx_guest() checks with appropriate
protected_guest_has() checks.
* Used tdx_hcall_gpa_intent() instead of __tdg_map_gpa() call.
* Removed __tdg_map_gpa() helper function and added tdg_accept_page()
related changes to tdx_hcall_gpa_intent().
* Used pgprot_pg_shared_mask() macro for __pgprot(tdg_shared_mask()).
* Fixed commit log as per review comments.

Changes since RFC v2:
* Since the common code between AMD-SEV and TDX is very minimal,
defining a new config (X86_MEM_ENCRYPT_COMMON) for common code
is not very useful. So createed a seperate file for Intel TDX
specific memory initialization (similar to AMD SEV).
* Removed patch titled "x86/mm: Move force_dma_unencrypted() to
common code" from this series. And merged required changes in
this patch.

arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/tdx.c | 34 ++++++++++++++++++-----
arch/x86/mm/mem_encrypt_common.c | 3 +++
arch/x86/mm/pat/set_memory.c | 46 +++++++++++++++++++++++++++-----
4 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7988e1fc2ce9..87c93815c4d7 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -24,6 +24,7 @@
/* Make the page accesable by VMM for protected guests */
#define pgprot_protected_guest(prot) __pgprot(pgprot_val(prot) | \
tdg_shared_mask())
+#define pgprot_pg_shared_mask() __pgprot(tdg_shared_mask())

#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 07610eab1c64..e84ae4f302b8 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -15,10 +15,14 @@
/* TDX Module call Leaf IDs */
#define TDINFO 1
#define TDGETVEINFO 3
+#define TDACCEPTPAGE 6

/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001

+/* TDX Module call error codes */
+#define TDX_PAGE_ALREADY_ACCEPTED 0x8000000000000001
+
#define VE_GET_IO_TYPE(exit_qual) (((exit_qual) & 8) ? 0 : 1)
#define VE_GET_IO_SIZE(exit_qual) (((exit_qual) & 7) + 1)
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
@@ -126,25 +130,43 @@ static void tdg_get_info(void)
physical_mask &= ~tdg_shared_mask();
}

+static void tdg_accept_page(phys_addr_t gpa)
+{
+ u64 ret;
+
+ ret = __tdx_module_call(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+ BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
/*
* Inform the VMM of the guest's intent for this physical page:
* shared with the VMM or private to the guest. The VMM is
* expected to change its mapping of the page in response.
- *
- * Note: shared->private conversions require further guest
- * action to accept the page.
*/
int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
enum tdx_map_type map_type)
{
- u64 ret;
+ u64 ret = 0;
+ int i;

if (map_type == TDX_MAP_SHARED)
gpa |= tdg_shared_mask();

- ret = tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
+ if (tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0))
+ ret = -EIO;

- return ret ? -EIO : 0;
+ if (ret || map_type == TDX_MAP_SHARED)
+ return ret;
+
+ /*
+ * For shared->private conversion, accept the page using TDACCEPTPAGE
+ * TDX module call.
+ */
+ for (i = 0; i < numpages; i++)
+ tdg_accept_page(gpa + i * PAGE_SIZE);
+
+ return 0;
}

static __cpuidle void tdg_halt(void)
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 4a9a4d5f36cd..661c9457c02e 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -16,5 +16,8 @@ bool force_dma_unencrypted(struct device *dev)
if (sev_active() || sme_active())
return amd_force_dma_unencrypted(dev);

+ if (protected_guest_has(VM_MEM_ENCRYPT))
+ return true;
+
return false;
}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..56ea2079cc36 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
#include <asm/proto.h>
#include <asm/memtype.h>
#include <asm/set_memory.h>
+#include <asm/tdx.h>

#include "../mm_internal.h"

@@ -1972,13 +1973,16 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}

-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
{
+ pgprot_t mem_protected_bits, mem_plain_bits;
struct cpa_data cpa;
+ enum tdx_map_type map_type;
int ret;

/* Nothing to do if memory encryption is not active */
- if (!mem_encrypt_active())
+ if (!mem_encrypt_active() &&
+ !protected_guest_has(VM_MEM_ENCRYPT_ACTIVE))
return 0;

/* Should not be working on unaligned addresses */
@@ -1988,8 +1992,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
- cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
- cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+ if (protected_guest_has(VM_SHARED_MAPPING_INIT)) {
+ mem_protected_bits = __pgprot(0);
+ mem_plain_bits = pgprot_pg_shared_mask();
+ } else {
+ mem_protected_bits = __pgprot(_PAGE_ENC);
+ mem_plain_bits = __pgprot(0);
+ }
+
+ if (protect) {
+ cpa.mask_set = mem_protected_bits;
+ cpa.mask_clr = mem_plain_bits;
+ map_type = TDX_MAP_PRIVATE;
+ } else {
+ cpa.mask_set = mem_plain_bits;
+ cpa.mask_clr = mem_protected_bits;
+ map_type = TDX_MAP_SHARED;
+ }
+
cpa.pgd = init_mm.pgd;

/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2019,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

/*
* Before changing the encryption attribute, we need to flush caches.
+ *
+ * For TDX we need to flush caches on private->shared. VMM is
+ * responsible for flushing on shared->private.
*/
- cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ if (is_tdx_guest()) {
+ if (map_type == TDX_MAP_SHARED)
+ cpa_flush(&cpa, 1);
+ } else {
+ cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ }

ret = __change_page_attr_set_clr(&cpa, 1);

@@ -2012,18 +2041,21 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
*/
cpa_flush(&cpa, 0);

+ if (!ret && protected_guest_has(VM_SHARED_MAPPING_INIT))
+ ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
+
return ret;
}

int set_memory_encrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, true);
+ return __set_memory_protect(addr, numpages, true);
}
EXPORT_SYMBOL_GPL(set_memory_encrypted);

int set_memory_decrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, false);
+ return __set_memory_protect(addr, numpages, false);
}
EXPORT_SYMBOL_GPL(set_memory_decrypted);

--
2.25.1

Subject: [RFC v2-fix-v1 1/1] x86/tdx: Make pages shared in ioremap()

From: "Kirill A. Shutemov" <[email protected]>

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure we have for AMD SEV.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2:
* Replaced is_tdx_guest() checks with protected_guest_has() calls.
* Renamed pgprot_tdg_shared() to pgprot_protected_guest()

arch/x86/include/asm/pgtable.h | 4 ++++
arch/x86/mm/ioremap.c | 9 ++++++---
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..7988e1fc2ce9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,10 @@
#define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot)))
#define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))

+/* Make the page accesable by VMM for protected guests */
+#define pgprot_protected_guest(prot) __pgprot(pgprot_val(prot) | \
+ tdg_shared_mask())
+
#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
#include <asm/fpu/xstate.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..f0d31f6fd98c 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -17,6 +17,7 @@
#include <linux/mem_encrypt.h>
#include <linux/efi.h>
#include <linux/pgtable.h>
+#include <linux/protected_guest.h>

#include <asm/set_memory.h>
#include <asm/e820/api.h>
@@ -87,12 +88,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
}

/*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
*/
static unsigned int __ioremap_check_encrypted(struct resource *res)
{
- if (!sev_active())
+ if (!sev_active() && !protected_guest_has(VM_MEM_ENCRYPT))
return 0;

switch (res->desc) {
@@ -244,6 +245,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+ else if (protected_guest_has(VM_SHARED_MAPPING_INIT))
+ prot = pgprot_protected_guest(prot);

switch (pcm) {
case _PAGE_CACHE_MODE_UC:
--
2.25.1

Subject: [RFC v2-fix-v1 1/1] x86/tdx: Exclude Shared bit from physical_mask

From: "Kirill A. Shutemov" <[email protected]>

Just like MKTME, TDX reassigns bits of the physical address for
metadata. MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

Add a helper, tdg_shared_mask() to generate the mask. The processor
enumerates its physical address width to include the shared bit, which
means it gets included in __PHYSICAL_MASK by default.

Remove the shared mask from 'physical_mask' since any bits in
tdg_shared_mask() are not used for physical addresses in page table
entries.

Also, note that we cannot club shared mapping configuration between
AMD SME and Intel TDX Guest platforms in common function. SME has
to do it very early in __startup_64() as it sets the bit on all
memory, except what is used for communication. TDX can postpone it,
as it don't need any shared mapping in very early boot.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC-v2:
* Renamed __PHYSICAL_MASK to physical_mask in commit subject.
* Fixed commit log as per review comments.

arch/x86/Kconfig | 1 +
arch/x86/include/asm/tdx.h | 6 ++++++
arch/x86/kernel/tdx.c | 9 +++++++++
3 files changed, 16 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7bc371d8ad7d..7e7ac99c4f4c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
select ARCH_HAS_PROTECTED_GUEST
+ select X86_MEM_ENCRYPT_COMMON
help
Provide support for running in a trusted domain on Intel processors
equipped with Trusted Domain eXtenstions. TDX is a new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index dfdb303ef7e2..0808cbbde045 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -118,6 +118,8 @@ do { \
} while (0)
#endif

+extern phys_addr_t tdg_shared_mask(void);
+
#else // !CONFIG_INTEL_TDX_GUEST

static inline bool is_tdx_guest(void)
@@ -137,6 +139,10 @@ static inline bool tdg_early_handle_ve(struct pt_regs *regs)
return false;
}

+static inline phys_addr_t tdg_shared_mask(void)
+{
+ return 0;
+}
#endif /* CONFIG_INTEL_TDX_GUEST */

#ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 02a3273b09d2..29d4b06535ce 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -101,6 +101,12 @@ bool tdx_protected_guest_has(unsigned long flag)
}
EXPORT_SYMBOL_GPL(tdx_protected_guest_has);

+/* The highest bit of a guest physical address is the "sharing" bit */
+phys_addr_t tdg_shared_mask(void)
+{
+ return 1ULL << (td_info.gpa_width - 1);
+}
+
static void tdg_get_info(void)
{
u64 ret;
@@ -112,6 +118,9 @@ static void tdg_get_info(void)

td_info.gpa_width = out.rcx & GENMASK(5, 0);
td_info.attributes = out.rdx;
+
+ /* Exclude Shared bit from the __PHYSICAL_MASK */
+ physical_mask &= ~tdg_shared_mask();
}

static __cpuidle void tdg_halt(void)
--
2.25.1

Subject: [RFC v2-fix-v2 1/1] x86/mm: Move force_dma_unencrypted() to common code

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to access guest private memory. Any
memory that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. After setting
the shared bit, the conversion must be completed with MapGPA TDVMALL.
The call informs VMM about the conversion between private/shared
mappings. The shared memory is similar to unencrypted memory in AMD
SME/SEV terminology but the underlying process of sharing/un-sharing
the memory is different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar. So, to communicate with I/O devices, related pages
need to be marked as shared. As mentioned above, shared memory in
TDX architecture is similar to decrypted memory in AMD SME/SEV. So
similar to AMD SEV, force_dma_unencrypted() has to forced to return
true. This support is added in other patches in this series.

So move force_dma_unencrypted() out of AMD specific code and call
AMD specific (amd_force_dma_unencrypted()) initialization function
from it. force_dma_unencrypted() will be modified by later patches
to include Intel TDX guest platform specific initialization.

Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has
to be selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has not
functional change.    

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since RFC v2-fix-v1:
* Added mem_encrypt_common.h and moved common encryption
related function declarations to it.
Changes since RFC v2:
* Instead of moving all the contents of force_dma_unencrypted() to
mem_encrypt_common.c, create sub function for AMD and call it
from common code.
* Fixed commit log as per review comments.

arch/x86/Kconfig | 8 ++++++--
arch/x86/include/asm/mem_encrypt_common.h | 18 ++++++++++++++++++
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mem_encrypt.c | 5 +++--
arch/x86/mm/mem_encrypt_common.c | 20 ++++++++++++++++++++
5 files changed, 49 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fc588a64d1a0..7bc371d8ad7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1531,14 +1531,18 @@ config X86_CPA_STATISTICS
helps to determine the effectiveness of preserving large and huge
page mappings when mapping protections are changed.

+config X86_MEM_ENCRYPT_COMMON
+ select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+ select DYNAMIC_PHYSICAL_MASK
+ def_bool n
+
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
- select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
- select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
+ select X86_MEM_ENCRYPT_COMMON
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
new file mode 100644
index 000000000000..697bc40a4e3d
--- /dev/null
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_MEM_ENCRYPT_COMMON_H
+#define _ASM_X86_MEM_ENCRYPT_COMMON_H
+
+#include <linux/mem_encrypt.h>
+#include <linux/device.h>
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool amd_force_dma_unencrypted(struct device *dev);
+#else /* CONFIG_AMD_MEM_ENCRYPT */
+static inline bool amd_force_dma_unencrypted(struct device *dev)
+{
+ return false;
+}
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o

+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o
+
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..5a81f73dd61e 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -29,6 +29,7 @@
#include <asm/processor-flags.h>
#include <asm/msr.h>
#include <asm/cmdline.h>
+#include <asm/mem_encrypt_common.h>

#include "mm_internal.h"

@@ -390,8 +391,8 @@ bool noinstr sev_es_active(void)
return sev_status & MSR_AMD64_SEV_ES_ENABLED;
}

-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
+/* Override for DMA direct allocation check - AMD specific initialization */
+bool amd_force_dma_unencrypted(struct device *dev)
{
/*
* For SEV, all DMA must be to unencrypted addresses.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..4a9a4d5f36cd
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory Encryption Support Common Code
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <[email protected]>
+ */
+
+#include <asm/mem_encrypt_common.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+ if (sev_active() || sme_active())
+ return amd_force_dma_unencrypted(dev);
+
+ return false;
+}
--
2.25.1

Subject: Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()

Hi,

On 5/31/21 12:14 PM, Borislav Petkov wrote:
>> We can also use protected_guest_has(VM_VIRTIO_SECURE_FIX) or something
>> similar for this purpose. Andi, any comments?
> protected_guest_has() is enough for that - no need for two functions.
>
>> IMHO, its better to use above generic config option in common header
>> file (linux/protected_guest.h). Any architecture that implements
>> protected guest feature can enable it. This will help is hide arch
>> specific config options in arch specific header file.
> You define empty function stubs for when the arch config option is not
> enabled. Everything else is unnecessary. When another architecture needs
> this, then another architecture will generalize it like it is usually
> done.

Please check the updated version in email titled "[RFC v2-fix-v2 1/1] x86:
Introduce generic protected guest abstraction".

We can continue the rest of the discussion in that email.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: [RFC v2-fix-v2 0/2] x86/tdx: Handle in-kernel MMIO

This patchset addresses the review comments in the patch titled
"x86/tdx: Handle in-kernel MMIO". Since it requires
patch split, sending these together.

Changes since RFC v2-fix:
* Introduced "x86/sev-es: Abstract out MMIO instruction
decoding" patch for sharing common code between TDX
and SEV.
* Modified TDX MMIO code to utilize common shared functions.
* Modified commit log to reflect latest changes and to
address review comments.

Changes since RFC v2:
* Fixed commit log as per Dave's review.

Kirill A. Shutemov (2):
x86/sev-es: Abstract out MMIO instruction decoding
x86/tdx: Handle in-kernel MMIO

arch/x86/include/asm/insn-eval.h | 13 +++
arch/x86/kernel/sev.c | 171 ++++++++-----------------------
arch/x86/kernel/tdx.c | 108 +++++++++++++++++++
arch/x86/lib/insn-eval.c | 102 ++++++++++++++++++
4 files changed, 263 insertions(+), 131 deletions(-)

--
2.25.1

2021-06-05 03:38:27

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Ignore WBINVD instruction for TDX guest

On Wed, May 26, 2021 at 9:38 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> Functionally only devices outside the CPU (such as DMA devices,
> or persistent memory for flushing) can notice the external side
> effects from WBINVD's cache flushing for write back mappings. One
> exception here is MKTME, but that is not visible outside the TDX
> module and not possible inside a TDX guest.
>
> Currently TDX does not support DMA, because DMA typically needs
> uncached access for MMIO, and the current TDX module always sets
> the IgnorePAT bit, which prevents that.
>
> Persistent memory is also currently not supported. There are some
> other cases that use WBINVD, such as the legacy ACPI sleeps, but
> these are all not supported in virtualization and there are better
> mechanisms inside a guest anyways. The guests usually are not
> aware of power management. Another code path that uses WBINVD is
> the MTRR driver, but EPT/virtualization always disables MTRRs so
> those are not needed. This all implies WBINVD is not needed with
> current TDX.
>
> So handle the WBINVD instruction as nop. Currently, #VE exception
> handler does not include any warning for WBINVD handling because
> ACPI reboot code uses it. This is the same behavior as KVM. It
> only allows WBINVD in a guest when the guest supports VT-d (=DMA),
> but just handles it as a nop if it doesn't .
>
> If TDX ever gets DMA support, or persistent memory support, or
> some other devices that can observe flushing side effects, a
> hypercall can be added to implement it similar to AMD-SEV. But
> current TDX does not need it.

Please just drop this patch. It serves no purpose especially when the
assertion is that nothing in TDX will miss WBINVD. Why would Linux
merge a patch that has no claimed end user benefit? If the only known
usage of WBINVD in a TDX guest is the ACPI reboot path then add an
is_protected_guest() to that one usage.

If a TDX guest runs an unexpected WBINVD that's a bug that needs a kernel fix.

2021-06-08 18:14:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest


On 6/8/2021 10:53 AM, Dave Hansen wrote:
> On 6/8/21 10:48 AM, Sean Christopherson wrote:
>> On Tue, Jun 08, 2021, Dave Hansen wrote:
>>> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>>> +#ifdef CONFIG_INTEL_TDX_GUEST
>>>> +DEFINE_IDTENTRY(exc_virtualization_exception)
>>>> +{
>>>> + struct ve_info ve;
>>>> + int ret;
>>>> +
>>>> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
>>>> +
>>>> + /*
>>>> + * Consume #VE info before re-enabling interrupts. It will be
>>>> + * re-enabled after executing the TDGETVEINFO TDCALL.
>>>> + */
>>>> + ret = tdg_get_ve_info(&ve);
>>> Is it safe to have *anything* before the tdg_get_ve_info()? For
>>> instance, say that RCU_LOCKDEP_WARN() triggers. Will anything in there
>>> do MMIO?
>> I doubt it's safe, anything that's doing printing has the potential to trigger
>> #VE. Even if we can prove it's safe for all possible paths, I can't think of a
>> reason to allow anything that's not absolutely necessary before retrieving the
>> #VE info.
> What about tracing? Can I plop a kprobe in here or turn on ftrace?

I believe neither does mmio/msr normally (except maybe ftrace+tp_printk,
but that will likely work because it shouldn't recurse more than once
due to ftrace's reentry protection)

-Andi

2021-06-08 18:20:22

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest



On Tue, Jun 8, 2021, at 11:15 AM, Dave Hansen wrote:
> On 6/8/21 11:12 AM, Andi Kleen wrote:
> > I believe neither does mmio/msr normally (except maybe
> > ftrace+tp_printk, but that will likely work because it shouldn't
> > recurse more than once due to ftrace's reentry protection)
>
> Can it do MMIO:
>
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> =======> HERE
> > + ret = tdg_get_ve_info(&ve);
>
> Recursion isn't the problem. It would double-fault there, right?
>

We should do the get_ve_info in a noinstr region.

2021-06-08 18:20:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest


On 6/8/2021 11:15 AM, Dave Hansen wrote:
> On 6/8/21 11:12 AM, Andi Kleen wrote:
>> I believe neither does mmio/msr normally (except maybe
>> ftrace+tp_printk, but that will likely work because it shouldn't
>> recurse more than once due to ftrace's reentry protection)
> Can it do MMIO:
>
>> +DEFINE_IDTENTRY(exc_virtualization_exception)
>> +{
> =======> HERE
>> + ret = tdg_get_ve_info(&ve);
> Recursion isn't the problem. It would double-fault there, right?

Yes that's right. tp_printk already has a lot of other corner cases that
break though, so it's not a real issue.

-Andi

Subject: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

Current TDX spec does not have support to emulate the WBINVD
instruction. So, add support to skip WBINVD instruction in
drivers that are currently enabled in the TDX guest.

Functionally only devices outside the CPU (such as DMA devices,
or persistent memory for flushing) can notice the external side
effects from WBINVD's cache flushing for write back mappings.
One exception here is MKTME, but that is not visible outside
the TDX module and not possible inside a TDX guest.

Currently TDX does not support DMA, because DMA typically needs
uncached access for MMIO, and the current TDX module always
sets the IgnorePAT bit, which prevents that.
   
Persistent memory is also currently not supported. Another code
path that uses WBINVD is the MTRR driver, but EPT/virtualization
always disables MTRRs so those are not needed. This all implies
WBINVD is not needed with current TDX.

So, most drivers/code-paths that use wbinvd instructions are
already disabled for TDX guest platforms via config-option/BIOS.
Following are the list of drivers that use wbinvd instruction
and are still enabled for TDX guests.
   
drivers/acpi/sleep.c
drivers/acpi/acpica/hwsleep.c
   
Since cache is always coherent in TDX guests, making wbinvd as
noop should not cause any issues. This behavior is the same as
KVM guest.
   
Also, hwsleep shouldn't happen for TDX guest because the TDX
BIOS won't enable it, but it's better to disable it anyways

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix-v2:
* Instead of handling WBINVD #VE exception as nop, we skip its
usage in currently enabled drivers.
* Adapted commit log for above change.

arch/x86/kernel/tdx.c | 1 +
drivers/acpi/acpica/hwsleep.c | 12 +++++++++---
drivers/acpi/sleep.c | 26 +++++++++++++++++++++++---
include/linux/protected_guest.h | 2 ++
4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 1caf9fa5bb30..e33928131e6a 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,6 +100,7 @@ bool tdx_protected_guest_has(unsigned long flag)
case PR_GUEST_MEM_ENCRYPT_ACTIVE:
case PR_GUEST_UNROLL_STRING_IO:
case PR_GUEST_SHARED_MAPPING_INIT:
+ case PR_GUEST_DISABLE_WBINVD:
return true;
}

diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
index 14baa13bf848..9d40df1b8a74 100644
--- a/drivers/acpi/acpica/hwsleep.c
+++ b/drivers/acpi/acpica/hwsleep.c
@@ -9,6 +9,7 @@
*****************************************************************************/

#include <acpi/acpi.h>
+#include <linux/protected_guest.h>
#include "accommon.h"

#define _COMPONENT ACPI_HARDWARE
@@ -108,9 +109,14 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
pm1a_control |= sleep_enable_reg_info->access_bit_mask;
pm1b_control |= sleep_enable_reg_info->access_bit_mask;

- /* Flush caches, as per ACPI specification */
-
- ACPI_FLUSH_CPU_CACHE();
+ /*
+ * WBINVD instruction is not supported in TDX
+ * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+ * WBINVD, skip cache flushes for TDX guests.
+ */
+ if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+ /* Flush caches, as per ACPI specification */
+ ACPI_FLUSH_CPU_CACHE();

status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
if (status == AE_CTRL_TERMINATE) {
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index df386571da98..3d6c213481f0 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -18,6 +18,7 @@
#include <linux/acpi.h>
#include <linux/module.h>
#include <linux/syscore_ops.h>
+#include <linux/protected_guest.h>
#include <asm/io.h>
#include <trace/events/power.h>

@@ -71,7 +72,14 @@ static int acpi_sleep_prepare(u32 acpi_state)
acpi_set_waking_vector(acpi_wakeup_address);

}
- ACPI_FLUSH_CPU_CACHE();
+
+ /*
+ * WBINVD instruction is not supported in TDX
+ * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+ * WBINVD, skip cache flushes for TDX guests.
+ */
+ if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+ ACPI_FLUSH_CPU_CACHE();
#endif
printk(KERN_INFO PREFIX "Preparing to enter system sleep state S%d\n",
acpi_state);
@@ -566,7 +574,13 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
u32 acpi_state = acpi_target_sleep_state;
int error;

- ACPI_FLUSH_CPU_CACHE();
+ /*
+ * WBINVD instruction is not supported in TDX
+ * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+ * WBINVD, skip cache flushes for TDX guests.
+ */
+ if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+ ACPI_FLUSH_CPU_CACHE();

trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
switch (acpi_state) {
@@ -899,7 +913,13 @@ static int acpi_hibernation_enter(void)
{
acpi_status status = AE_OK;

- ACPI_FLUSH_CPU_CACHE();
+ /*
+ * WBINVD instruction is not supported in TDX
+ * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+ * WBINVD, skip cache flushes for TDX guests.
+ */
+ if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+ ACPI_FLUSH_CPU_CACHE();

/* This shouldn't return. If it returns, we have a problem */
status = acpi_enter_sleep_state(ACPI_STATE_S4);
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
index adfa62e2615e..0ec4dab86f67 100644
--- a/include/linux/protected_guest.h
+++ b/include/linux/protected_guest.h
@@ -18,6 +18,8 @@
#define PR_GUEST_HOST_MEM_ENCRYPT 0x103
/* Support for shared mapping initialization (after early init) */
#define PR_GUEST_SHARED_MAPPING_INIT 0x104
+/* Support to disable WBINVD */
+#define PR_GUEST_DISABLE_WBINVD 0x105

#if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)

--
2.25.1

2021-06-08 21:45:51

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

[ add Rafael and linux-acpi ]

On Tue, Jun 8, 2021 at 2:35 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> Current TDX spec does not have support to emulate the WBINVD
> instruction. So, add support to skip WBINVD instruction in
> drivers that are currently enabled in the TDX guest.
>
> Functionally only devices outside the CPU (such as DMA devices,
> or persistent memory for flushing) can notice the external side
> effects from WBINVD's cache flushing for write back mappings.
> One exception here is MKTME, but that is not visible outside
> the TDX module and not possible inside a TDX guest.
>
> Currently TDX does not support DMA, because DMA typically needs
> uncached access for MMIO, and the current TDX module always
> sets the IgnorePAT bit, which prevents that.
>
> Persistent memory is also currently not supported. Another code
> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> always disables MTRRs so those are not needed. This all implies
> WBINVD is not needed with current TDX.
>
> So, most drivers/code-paths that use wbinvd instructions are
> already disabled for TDX guest platforms via config-option/BIOS.
> Following are the list of drivers that use wbinvd instruction
> and are still enabled for TDX guests.
>
> drivers/acpi/sleep.c
> drivers/acpi/acpica/hwsleep.c
>
> Since cache is always coherent in TDX guests, making wbinvd as
> noop should not cause any issues. This behavior is the same as
> KVM guest.
>
> Also, hwsleep shouldn't happen for TDX guest because the TDX
> BIOS won't enable it, but it's better to disable it anyways
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since RFC v2-fix-v2:
> * Instead of handling WBINVD #VE exception as nop, we skip its
> usage in currently enabled drivers.
> * Adapted commit log for above change.
>
> arch/x86/kernel/tdx.c | 1 +
> drivers/acpi/acpica/hwsleep.c | 12 +++++++++---
> drivers/acpi/sleep.c | 26 +++++++++++++++++++++++---
> include/linux/protected_guest.h | 2 ++
> 4 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 1caf9fa5bb30..e33928131e6a 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -100,6 +100,7 @@ bool tdx_protected_guest_has(unsigned long flag)
> case PR_GUEST_MEM_ENCRYPT_ACTIVE:
> case PR_GUEST_UNROLL_STRING_IO:
> case PR_GUEST_SHARED_MAPPING_INIT:
> + case PR_GUEST_DISABLE_WBINVD:
> return true;
> }
>
> diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
> index 14baa13bf848..9d40df1b8a74 100644
> --- a/drivers/acpi/acpica/hwsleep.c
> +++ b/drivers/acpi/acpica/hwsleep.c
> @@ -9,6 +9,7 @@
> *****************************************************************************/
>
> #include <acpi/acpi.h>
> +#include <linux/protected_guest.h>
> #include "accommon.h"
>
> #define _COMPONENT ACPI_HARDWARE
> @@ -108,9 +109,14 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
> pm1a_control |= sleep_enable_reg_info->access_bit_mask;
> pm1b_control |= sleep_enable_reg_info->access_bit_mask;
>
> - /* Flush caches, as per ACPI specification */
> -
> - ACPI_FLUSH_CPU_CACHE();
> + /*
> + * WBINVD instruction is not supported in TDX
> + * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> + * WBINVD, skip cache flushes for TDX guests.
> + */
> + if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> + /* Flush caches, as per ACPI specification */
> + ACPI_FLUSH_CPU_CACHE();
>
> status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
> if (status == AE_CTRL_TERMINATE) {
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index df386571da98..3d6c213481f0 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -18,6 +18,7 @@
> #include <linux/acpi.h>
> #include <linux/module.h>
> #include <linux/syscore_ops.h>
> +#include <linux/protected_guest.h>
> #include <asm/io.h>
> #include <trace/events/power.h>
>
> @@ -71,7 +72,14 @@ static int acpi_sleep_prepare(u32 acpi_state)
> acpi_set_waking_vector(acpi_wakeup_address);
>
> }
> - ACPI_FLUSH_CPU_CACHE();
> +
> + /*
> + * WBINVD instruction is not supported in TDX
> + * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> + * WBINVD, skip cache flushes for TDX guests.
> + */
> + if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> + ACPI_FLUSH_CPU_CACHE();
> #endif
> printk(KERN_INFO PREFIX "Preparing to enter system sleep state S%d\n",
> acpi_state);
> @@ -566,7 +574,13 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
> u32 acpi_state = acpi_target_sleep_state;
> int error;
>
> - ACPI_FLUSH_CPU_CACHE();
> + /*
> + * WBINVD instruction is not supported in TDX
> + * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> + * WBINVD, skip cache flushes for TDX guests.
> + */
> + if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> + ACPI_FLUSH_CPU_CACHE();
>
> trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
> switch (acpi_state) {
> @@ -899,7 +913,13 @@ static int acpi_hibernation_enter(void)
> {
> acpi_status status = AE_OK;
>
> - ACPI_FLUSH_CPU_CACHE();
> + /*
> + * WBINVD instruction is not supported in TDX
> + * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> + * WBINVD, skip cache flushes for TDX guests.
> + */
> + if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> + ACPI_FLUSH_CPU_CACHE();
>
> /* This shouldn't return. If it returns, we have a problem */
> status = acpi_enter_sleep_state(ACPI_STATE_S4);
> diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
> index adfa62e2615e..0ec4dab86f67 100644
> --- a/include/linux/protected_guest.h
> +++ b/include/linux/protected_guest.h
> @@ -18,6 +18,8 @@
> #define PR_GUEST_HOST_MEM_ENCRYPT 0x103
> /* Support for shared mapping initialization (after early init) */
> #define PR_GUEST_SHARED_MAPPING_INIT 0x104
> +/* Support to disable WBINVD */
> +#define PR_GUEST_DISABLE_WBINVD 0x105
>
> #if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
>
> --
> 2.25.1
>

2021-06-08 22:39:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest


On 6/8/2021 3:17 PM, Dave Hansen wrote:
> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
>> Persistent memory is also currently not supported. Another code
>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>> always disables MTRRs so those are not needed. This all implies
>> WBINVD is not needed with current TDX.
> It's one thing to declare something unsupported. It's quite another to
> declare it unsupported and then back it up with code to ensure that any
> attempted use is thwarted.
>
> This patch certainly shows us half of the solution. But, to be
> complete, we also need to see the other half: where is the patch


We had multiple patches to handle it earlier (by ignoring it which is
the right way and deployed successfully everywhere in KVM), but you guys
all didn't like them.

So they got removed.

You can't have your cake and eat it. Either you have the ignore or warn
on patches or you have panic.

In this iteration now you have panic (through the exception handler)
except we explicitely ignore it for the cases we know that can happen
(which is reboot)


> or
> documentation for why it is not *possible* to encounter persistent
> memory in a TDX guest?


I thought we already went over this ad nauseam.

The current TDX VMMs don't support anything else than plain DRAM.

If there is support for anything else in the future we'll need to add a
new GHCI call that implements WBINVD through the host, but right now we
don't need it.

-Andi.

2021-06-08 22:55:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On 6/8/21 3:36 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 6/8/21 3:17 PM, Dave Hansen wrote:
>> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
>>> Persistent memory is also currently not supported. Another code
>>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>>> always disables MTRRs so those are not needed. This all implies
>>> WBINVD is not needed with current TDX.
>>
>> It's one thing to declare something unsupported.  It's quite another to
>> declare it unsupported and then back it up with code to ensure that any
>> attempted use is thwarted.
>
> Only audited and supported drivers will be allowed to enumerate after
> device filter support patch is merged. Till we merge that patch, If
> any of these unsupported features (with WBINVD usage) are enabled in TDX,
> it will lead to sigfault (due to unhandled #VE).

A kernel driver using WBINVD will "sigfault"? I'm not sure what that
means. How does the kernel "sigfault"?

> In this patch we only create exception for ACPI sleep driver code. If
> commit log is confusing, I can remove information about other unsupported
> feature (with WBINVD usage).

Yes, the changelog is horribly confusing. But simply removing this
information is insufficient to rectify the deficiency.

I've lost trust that due diligence will be performed on this series on
its own. I've seen too many broken promises and too many holes.

Here's what I want to see: a list of all of the unique call sites for
WBINVD in the kernel. I want a written down methodology for how the
list of call sites was generated. I want to see an item-by-item list of
why those call sites are unreachable with the TDX guest code. It might
be because they've been patched in this patch, or the driver has been
disabled, or because the TDX architecture spec would somehow prohibit
the situation where it might be needed. But, there needs to be a list,
and you have to show your work. If you refer to code from this series
as helping to prevent WBINVD, then it has to be earlier in this series,
not in some other series and not later in this series.

Just eyeballing it, there are ~50 places in the kernel that need auditing.

Right now, we mostly have indiscriminate hand-waving about this not
being a problem. It's a hard NAK from me on this patch until this audit
is in place.

Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest



On 6/8/21 3:53 PM, Dave Hansen wrote:
> On 6/8/21 3:36 PM, Kuppuswamy, Sathyanarayanan wrote:
>> On 6/8/21 3:17 PM, Dave Hansen wrote:
>>> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:

>
> A kernel driver using WBINVD will "sigfault"? I'm not sure what that
> means. How does the kernel "sigfault"?

Sorry, un-supported #VE is handled similar to #GP fault.

>
>> In this patch we only create exception for ACPI sleep driver code. If
>> commit log is confusing, I can remove information about other unsupported
>> feature (with WBINVD usage).
>
> Yes, the changelog is horribly confusing. But simply removing this
> information is insufficient to rectify the deficiency.

I will remove all the unrelated information from this commit log. As long as
commit log *only* talks and handles the exception for ACPI sleep driver, it
should be acceptable for you right? I will also add a note about, if any
other feature with WBINVD usage is enabled, it would lead to #GP fault.

>
> I've lost trust that due diligence will be performed on this series on
> its own. I've seen too many broken promises and too many holes.
>
> Here's what I want to see: a list of all of the unique call sites for
> WBINVD in the kernel. I want a written down methodology for how the
> list of call sites was generated. I want to see an item-by-item list of
> why those call sites are unreachable with the TDX guest code. It might
> be because they've been patched in this patch, or the driver has been
> disabled, or because the TDX architecture spec would somehow prohibit
> the situation where it might be needed. But, there needs to be a list,
> and you have to show your work. If you refer to code from this series
> as helping to prevent WBINVD, then it has to be earlier in this series,
> not in some other series and not later in this series.
>
> Just eyeballing it, there are ~50 places in the kernel that need auditing.
>
> Right now, we mostly have indiscriminate hand-waving about this not
> being a problem. It's a hard NAK from me on this patch until this audit
> is in place.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-06-08 23:06:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

\
> A kernel driver using WBINVD will "sigfault"? I'm not sure what that
> means. How does the kernel "sigfault"?

It panics. Please, you know exactly what Sathya meant because you've
read the code.

>
> Here's what I want to see: a list of all of the unique call sites for
> WBINVD in the kernel. I want a written down methodology for how the
> list of call sites was generated. I want to see an item-by-item list of
> why those call sites are unreachable with the TDX guest code. It might
> be because they've been patched in this patch, or the driver has been
> disabled, or because the TDX architecture spec would somehow prohibit
> the situation where it might be needed. But, there needs to be a list,
> and you have to show your work. If you refer to code from this series
> as helping to prevent WBINVD, then it has to be earlier in this series,
> not in some other series and not later in this series.

Sorry this is ridiculous. We're not in a make-work project here. We're
about practical engineering  not make out life artificially complicated.

If that is what is required then the change requests to NOT ignore but
patch every site were just not practical.

>
> Just eyeballing it, there are ~50 places in the kernel that need auditing.
>
> Right now, we mostly have indiscriminate hand-waving about this not
> being a problem. It's a hard NAK from me on this patch until this audit
> is in place.


Okay then we just go back to ignore like the rest of the KVM world.

That's what we had originally and it it's fine because it's exactly what
KVM does, which is all we want.

It was the sane thing to do and it's still the sane thing to do because
it has been always done this way.

-And

2021-06-08 23:22:29

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/boot: Avoid #VE during boot for TDX platforms

On Thu, May 27, 2021 at 2:25 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> There are a few MSRs and control register bits which the kernel
> normally needs to modify during boot. But, TDX disallows
> modification of these registers to help provide consistent
> security guarantees. Fortunately, TDX ensures that these are all
> in the correct state before the kernel loads, which means the
> kernel has no need to modify them.
>
> The conditions to avoid are:
>
> * Any writes to the EFER MSR
> * Clearing CR0.NE
> * Clearing CR3.MCE
>
> This theoretically makes guest boot more fragile. If, for
> instance, EFER was set up incorrectly and a WRMSR was performed,
> it will trigger early exception panic or a triple fault, if it's
> before early exceptions are set up. However, this is likely to
> trip up the guest BIOS long before control reaches the kernel. In
> any case, these kinds of problems are unlikely to occur in
> production environments, and developers have good debug
> tools to fix them quickly.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

Looks good to me:

Reviewed-by: Dan Williams <[email protected]>

2021-06-09 06:43:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

On Tue, Jun 08, 2021, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> > + struct ve_info ve;
> > + int ret;
> > +
> > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > +
> > + /*
> > + * Consume #VE info before re-enabling interrupts. It will be
> > + * re-enabled after executing the TDGETVEINFO TDCALL.
> > + */
> > + ret = tdg_get_ve_info(&ve);
>
> Is it safe to have *anything* before the tdg_get_ve_info()? For
> instance, say that RCU_LOCKDEP_WARN() triggers. Will anything in there
> do MMIO?

I doubt it's safe, anything that's doing printing has the potential to trigger
#VE. Even if we can prove it's safe for all possible paths, I can't think of a
reason to allow anything that's not absolutely necessary before retrieving the
#VE info.

2021-06-09 06:47:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

On 6/8/21 11:12 AM, Andi Kleen wrote:
> I believe neither does mmio/msr normally (except maybe
> ftrace+tp_printk, but that will likely work because it shouldn't
> recurse more than once due to ftrace's reentry protection)

Can it do MMIO:

> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
=======> HERE
> + ret = tdg_get_ve_info(&ve);

Recursion isn't the problem. It would double-fault there, right?

2021-06-09 08:43:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> + struct ve_info ve;
> + int ret;
> +
> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> + /*
> + * Consume #VE info before re-enabling interrupts. It will be
> + * re-enabled after executing the TDGETVEINFO TDCALL.
> + */
> + ret = tdg_get_ve_info(&ve);

Is it safe to have *anything* before the tdg_get_ve_info()? For
instance, say that RCU_LOCKDEP_WARN() triggers. Will anything in there
do MMIO?

2021-06-09 08:56:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest

On 6/8/21 10:48 AM, Sean Christopherson wrote:
> On Tue, Jun 08, 2021, Dave Hansen wrote:
>> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>> +#ifdef CONFIG_INTEL_TDX_GUEST
>>> +DEFINE_IDTENTRY(exc_virtualization_exception)
>>> +{
>>> + struct ve_info ve;
>>> + int ret;
>>> +
>>> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
>>> +
>>> + /*
>>> + * Consume #VE info before re-enabling interrupts. It will be
>>> + * re-enabled after executing the TDGETVEINFO TDCALL.
>>> + */
>>> + ret = tdg_get_ve_info(&ve);
>> Is it safe to have *anything* before the tdg_get_ve_info()? For
>> instance, say that RCU_LOCKDEP_WARN() triggers. Will anything in there
>> do MMIO?
> I doubt it's safe, anything that's doing printing has the potential to trigger
> #VE. Even if we can prove it's safe for all possible paths, I can't think of a
> reason to allow anything that's not absolutely necessary before retrieving the
> #VE info.

What about tracing? Can I plop a kprobe in here or turn on ftrace?

Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest



On 6/8/21 3:17 PM, Dave Hansen wrote:
> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
>> Persistent memory is also currently not supported. Another code
>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>> always disables MTRRs so those are not needed. This all implies
>> WBINVD is not needed with current TDX.
>
> It's one thing to declare something unsupported. It's quite another to
> declare it unsupported and then back it up with code to ensure that any
> attempted use is thwarted.

Only audited and supported drivers will be allowed to enumerate after
device filter support patch is merged. Till we merge that patch, If
any of these unsupported features (with WBINVD usage) are enabled in TDX,
it will lead to sigfault (due to unhandled #VE).

In this patch we only create exception for ACPI sleep driver code. If
commit log is confusing, I can remove information about other unsupported
feature (with WBINVD usage).

>
> This patch certainly shows us half of the solution. But, to be
> complete, we also need to see the other half: where is the patch or
> documentation for why it is not *possible* to encounter persistent
> memory in a TDX guest?
>
> BTW, "persistent memory" is much more than Intel Optane DCPMM.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-06-09 16:53:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
> Persistent memory is also currently not supported. Another code
> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> always disables MTRRs so those are not needed. This all implies
> WBINVD is not needed with current TDX.

It's one thing to declare something unsupported. It's quite another to
declare it unsupported and then back it up with code to ensure that any
attempted use is thwarted.

This patch certainly shows us half of the solution. But, to be
complete, we also need to see the other half: where is the patch or
documentation for why it is not *possible* to encounter persistent
memory in a TDX guest?

BTW, "persistent memory" is much more than Intel Optane DCPMM.

2021-06-09 16:55:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On 6/8/21 4:32 PM, Dan Williams wrote:
>> Persistent memory is also currently not supported. Another code
>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>> always disables MTRRs so those are not needed. This all implies
>> WBINVD is not needed with current TDX.
> Let's drop the last three paragraphs and just say something like:
> "This is one of a series of patches to usages of wbinvd for protected
> guests. For now this just addresses the one known path that TDX
> executes, ACPI reboot. Its usage can be elided because FOO reason and
> all the other ACPI_FLUSH_CPU_CACHE usages can be elided because BAR
> reason"

A better effort at transparency can be made here:

This patches the one WBINVD instance which has been encountered
in practice: ACPI reboot. Assume no other instance will be
encountered.

2021-06-09 16:57:05

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Tue, Jun 8, 2021 at 2:35 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> Current TDX spec does not have support to emulate the WBINVD
> instruction. So, add support to skip WBINVD instruction in
> drivers that are currently enabled in the TDX guest.
>
> Functionally only devices outside the CPU (such as DMA devices,
> or persistent memory for flushing) can notice the external side
> effects from WBINVD's cache flushing for write back mappings.
> One exception here is MKTME, but that is not visible outside
> the TDX module and not possible inside a TDX guest.
>
> Currently TDX does not support DMA, because DMA typically needs
> uncached access for MMIO, and the current TDX module always
> sets the IgnorePAT bit, which prevents that.
>
> Persistent memory is also currently not supported. Another code
> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> always disables MTRRs so those are not needed. This all implies
> WBINVD is not needed with current TDX.

Let's drop the last three paragraphs and just say something like:
"This is one of a series of patches to usages of wbinvd for protected
guests. For now this just addresses the one known path that TDX
executes, ACPI reboot. Its usage can be elided because FOO reason and
all the other ACPI_FLUSH_CPU_CACHE usages can be elided because BAR
reason"

>
> So, most drivers/code-paths that use wbinvd instructions are
> already disabled for TDX guest platforms via config-option/BIOS.
> Following are the list of drivers that use wbinvd instruction
> and are still enabled for TDX guests.
>
> drivers/acpi/sleep.c
> drivers/acpi/acpica/hwsleep.c
>
> Since cache is always coherent in TDX guests, making wbinvd as
> noop should not cause any issues. This behavior is the same as
> KVM guest.
>
> Also, hwsleep shouldn't happen for TDX guest because the TDX
> BIOS won't enable it, but it's better to disable it anyways
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since RFC v2-fix-v2:
> * Instead of handling WBINVD #VE exception as nop, we skip its
> usage in currently enabled drivers.
> * Adapted commit log for above change.
>
> arch/x86/kernel/tdx.c | 1 +
> drivers/acpi/acpica/hwsleep.c | 12 +++++++++---
> drivers/acpi/sleep.c | 26 +++++++++++++++++++++++---
> include/linux/protected_guest.h | 2 ++
> 4 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 1caf9fa5bb30..e33928131e6a 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -100,6 +100,7 @@ bool tdx_protected_guest_has(unsigned long flag)
> case PR_GUEST_MEM_ENCRYPT_ACTIVE:
> case PR_GUEST_UNROLL_STRING_IO:
> case PR_GUEST_SHARED_MAPPING_INIT:
> + case PR_GUEST_DISABLE_WBINVD:
> return true;
> }
>
> diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
> index 14baa13bf848..9d40df1b8a74 100644
> --- a/drivers/acpi/acpica/hwsleep.c
> +++ b/drivers/acpi/acpica/hwsleep.c
> @@ -9,6 +9,7 @@
> *****************************************************************************/
>
> #include <acpi/acpi.h>
> +#include <linux/protected_guest.h>
> #include "accommon.h"
>
> #define _COMPONENT ACPI_HARDWARE
> @@ -108,9 +109,14 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
> pm1a_control |= sleep_enable_reg_info->access_bit_mask;
> pm1b_control |= sleep_enable_reg_info->access_bit_mask;
>
> - /* Flush caches, as per ACPI specification */
> -
> - ACPI_FLUSH_CPU_CACHE();
> + /*
> + * WBINVD instruction is not supported in TDX
> + * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> + * WBINVD, skip cache flushes for TDX guests.
> + */
> + if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> + /* Flush caches, as per ACPI specification */
> + ACPI_FLUSH_CPU_CACHE();

ACPICA uses OS abstractions like ACPI_FLUSH_CPU_CACHE and Linux
patches rarely (never?) change ACPICA directly. If you want to change
ACPICA it goes through the ACPICA project first and is then
"Linux-ized", but in this case I believe you do not need to go that
path. Instead, this wants to change the definition of
ACPI_FLUSH_CPU_CACHE() directly in arch/x86/include/asm/acenv.h and
explain why the other ACPI cache flushing paths / requirements do not
apply to TDX guests.

Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest



On 6/8/21 5:07 PM, Dan Williams wrote:
> That works too, but I assume if ACPI_FLUSH_CPU_CACHE() itself is going
> to be changed rather than sprinkling protected_guest_has() checks in a
> few places it will need to assert why changing all of those at once is
> correct. Otherwise I expect Rafael to ask why this global change of
> the ACPI_FLUSH_CPU_CACHE() policy is ok.

Yes. I am fixing it as below.

--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
#define _ASM_X86_ACENV_H

#include <asm/special_insns.h>
+#include <asm/protected_guest.h>

/* Asm macros */

-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+#define ACPI_FLUSH_CPU_CACHE() \
+do { \
+ if (!prot_guest_has(PR_GUEST_DISABLE_WBINVD)) \
+ wbinvd(); \
+} while (0)


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-06-09 17:00:25

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Tue, Jun 8, 2021 at 4:38 PM Dave Hansen <[email protected]> wrote:
>
> On 6/8/21 4:32 PM, Dan Williams wrote:
> >> Persistent memory is also currently not supported. Another code
> >> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> >> always disables MTRRs so those are not needed. This all implies
> >> WBINVD is not needed with current TDX.
> > Let's drop the last three paragraphs and just say something like:
> > "This is one of a series of patches to usages of wbinvd for protected
> > guests. For now this just addresses the one known path that TDX
> > executes, ACPI reboot. Its usage can be elided because FOO reason and
> > all the other ACPI_FLUSH_CPU_CACHE usages can be elided because BAR
> > reason"
>
> A better effort at transparency can be made here:
>
> This patches the one WBINVD instance which has been encountered
> in practice: ACPI reboot. Assume no other instance will be
> encountered.
>

That works too, but I assume if ACPI_FLUSH_CPU_CACHE() itself is going
to be changed rather than sprinkling protected_guest_has() checks in a
few places it will need to assert why changing all of those at once is
correct. Otherwise I expect Rafael to ask why this global change of
the ACPI_FLUSH_CPU_CACHE() policy is ok.