2022-02-24 16:41:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 00/30] TDX Guest: TDX core support

Hi All,

Intel's Trust Domain Extensions (TDX) protects confidential guest VMs
from the host and physical attacks by isolating the guest register
state and by encrypting the guest memory. In TDX, a special TDX module
sits between the host and the guest, and runs in a special mode and
manages the guest/host separation.

Please review and consider applying.

More details of TDX guests can be found in Documentation/x86/tdx.rst.

All dependencies of the patchset are in Linus' tree now.

SEV/TDX comparison:
-------------------

TDX has a lot of similarities to SEV. It enhances confidentiality
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to make changes in the guest
physical address space.

TDX/VM comparison:
------------------

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
to the guest TD in cases of instructions that need to be emulated, disallowed
MSR accesses, etc.
3. By default memory is marked as private, and TD will selectively share it with
VMM based on need.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Git tree:

https://github.com/intel/tdx.git guest-upstream

Previous version:

https://lore.kernel.org/r/[email protected]

Changes from v3:
- Rebased on top of merged x86/coco patches
- Sanity build-time check for TDX detection (Cyrill Gorcunov)
- Correction in the documentation regarding #VE for CPUID
Changes from v2:
- Move TDX-Guest-specific code under arch/x86/coco/
- Code shared between host and guest is under arch/x86/virt/
- Fix handling CR4.MCE for !CONFIG_X86_MCE
- A separate patch to clarify CR0.NE situation
- Use u8/u16/u32 for port I/O handler
- Rework TDCALL helpers:
+ consolidation between guest and host
+ clearer interface
+ A new tdx_module_call() panic() if TDCALL fails
- Rework MMIO handling to imporove readability
- New generic API to deal encryption masks
- Move tdx_early_init() before copy_bootdata() (again)
- Rework #VE handing to share more code with #GP handler
- Rework __set_memory_enc_pgtable() to provide proper abstruction for both
SME/SEV and TDX cases.
- Fix warning on build with X86_MEM_ENCRYPT=y
- ... and more
Changes from v1:
- Rebased to tip/master (94985da003a4).
- Address feedback from Borislav and Josh.
- Wire up KVM hypercalls. Needed to send IPI.

Andi Kleen (1):
x86/tdx: Handle early boot port I/O

Isaku Yamahata (1):
x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (18):
x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y
x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
x86/tdx: Extend the confidential computing API to support TDX guests
x86/tdx: Exclude shared bit from __PHYSICAL_MASK
x86/traps: Add #VE support for TDX guest
x86/tdx: Add HLT support for TDX guests
x86/tdx: Add MSR support for TDX guests
x86/tdx: Handle CPUID via #VE
x86/tdx: Handle in-kernel MMIO
x86: Adjust types used in port I/O helpers
x86: Consolidate port I/O helpers
x86/boot: Allow to hook up alternative port I/O helpers
x86/boot/compressed: Support TDX guest port I/O at decompression time
x86/boot: Set CR0.NE early and keep it set during the boot
x86/tdx: Make pages shared in ioremap()
x86/mm/cpa: Add support for TDX shared memory
x86/kvm: Use bounce buffers for TD guest
ACPICA: Avoid cache flush on TDX guest

Kuppuswamy Sathyanarayanan (8):
x86/tdx: Detect running as a TDX guest in early boot
x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper
functions
x86/tdx: Detect TDX at early kernel decompression time
x86/tdx: Add port I/O emulation
x86/tdx: Wire up KVM hypercalls
x86/acpi, x86/boot: Add multiprocessor wake-up support
x86/topology: Disable CPU online/offline control for TDX guests
Documentation/x86: Document TDX kernel architecture

Sean Christopherson (2):
x86/boot: Add a trampoline for booting APs via firmware handoff
x86/boot: Avoid #VE during boot for TDX platforms

Documentation/x86/index.rst | 1 +
Documentation/x86/tdx.rst | 196 ++++++++
arch/x86/Kconfig | 15 +
arch/x86/boot/a20.c | 14 +-
arch/x86/boot/boot.h | 35 +-
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/head_64.S | 27 +-
arch/x86/boot/compressed/misc.c | 26 +-
arch/x86/boot/compressed/misc.h | 4 +-
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/boot/compressed/tdcall.S | 3 +
arch/x86/boot/compressed/tdx.c | 98 ++++
arch/x86/boot/compressed/tdx.h | 15 +
arch/x86/boot/cpuflags.c | 3 +-
arch/x86/boot/cpuflags.h | 1 +
arch/x86/boot/early_serial_console.c | 28 +-
arch/x86/boot/io.h | 28 ++
arch/x86/boot/main.c | 4 +
arch/x86/boot/pm.c | 10 +-
arch/x86/boot/tty.c | 4 +-
arch/x86/boot/video-vga.c | 6 +-
arch/x86/boot/video.h | 8 +-
arch/x86/coco/Makefile | 2 +
arch/x86/coco/core.c | 14 +-
arch/x86/coco/tdcall.S | 197 ++++++++
arch/x86/coco/tdx.c | 594 +++++++++++++++++++++++
arch/x86/include/asm/acenv.h | 16 +-
arch/x86/include/asm/apic.h | 7 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/io.h | 42 +-
arch/x86/include/asm/kvm_para.h | 22 +
arch/x86/include/asm/mem_encrypt.h | 6 +-
arch/x86/include/asm/realmode.h | 1 +
arch/x86/include/asm/shared/io.h | 34 ++
arch/x86/include/asm/shared/tdx.h | 37 ++
arch/x86/include/asm/tdx.h | 82 ++++
arch/x86/kernel/acpi/boot.c | 118 +++++
arch/x86/kernel/apic/apic.c | 10 +
arch/x86/kernel/apic/io_apic.c | 15 +-
arch/x86/kernel/asm-offsets.c | 19 +
arch/x86/kernel/head64.c | 7 +
arch/x86/kernel/head_64.S | 28 +-
arch/x86/kernel/idt.c | 3 +
arch/x86/kernel/process.c | 4 +
arch/x86/kernel/smpboot.c | 12 +-
arch/x86/kernel/traps.c | 138 +++++-
arch/x86/mm/ioremap.c | 5 +
arch/x86/mm/mem_encrypt.c | 9 +-
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 57 ++-
arch/x86/realmode/rm/trampoline_common.S | 12 +-
arch/x86/realmode/rm/wakemain.c | 14 +-
arch/x86/virt/tdxcall.S | 91 ++++
include/linux/cc_platform.h | 10 +
kernel/cpu.c | 7 +
57 files changed, 1997 insertions(+), 159 deletions(-)
create mode 100644 Documentation/x86/tdx.rst
create mode 100644 arch/x86/boot/compressed/tdcall.S
create mode 100644 arch/x86/boot/compressed/tdx.c
create mode 100644 arch/x86/boot/compressed/tdx.h
create mode 100644 arch/x86/boot/io.h
create mode 100644 arch/x86/coco/tdcall.S
create mode 100644 arch/x86/coco/tdx.c
create mode 100644 arch/x86/include/asm/shared/io.h
create mode 100644 arch/x86/include/asm/shared/tdx.h
create mode 100644 arch/x86/include/asm/tdx.h
create mode 100644 arch/x86/virt/tdxcall.S

--
2.34.1


2022-02-24 16:42:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

So far, AMD_MEM_ENCRYPT is the only user of X86_MEM_ENCRYPT. TDX will be
the second. It will make mem_encrypt.c build without AMD_MEM_ENCRYPT,
which triggers a warning:

arch/x86/mm/mem_encrypt.c:69:13: warning: no previous prototype for
function 'mem_encrypt_init' [-Wmissing-prototypes]

Fix it by moving mem_encrypt_init() declaration outside of #ifdef
CONFIG_AMD_MEM_ENCRYPT.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Fixes: 20f07a044a76 ("x86/sev: Move common memory encryption code to mem_encrypt.c")
Acked-by: David Rientjes <[email protected]>
---
arch/x86/include/asm/mem_encrypt.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index e2c6f433ed10..88ceaf3648b3 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -49,9 +49,6 @@ void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages,

void __init mem_encrypt_free_decrypted_mem(void);

-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void);
-
void __init sev_es_init_vc_handling(void);

#define __bss_decrypted __section(".bss..decrypted")
@@ -89,6 +86,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }

#endif /* CONFIG_AMD_MEM_ENCRYPT */

+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void);
+
/*
* The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
* writing to or comparing values from the cr3 register. Having the
--
2.34.1

2022-02-24 16:42:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 13/30] x86: Adjust types used in port I/O helpers

Change port I/O helpers to use u8/u16/u32 instead of unsigned
char/short/int for values. Use u16 instead of int for port number.

It aligns the helpers with implementation in boot stub in preparation
for consolidation.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/io.h | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..638c1a2a82e0 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -258,37 +258,37 @@ static inline void slow_down_io(void)
#endif

#define BUILDIO(bwl, bw, type) \
-static inline void out##bwl(unsigned type value, int port) \
+static inline void out##bwl(type value, u16 port) \
{ \
asm volatile("out" #bwl " %" #bw "0, %w1" \
: : "a"(value), "Nd"(port)); \
} \
\
-static inline unsigned type in##bwl(int port) \
+static inline type in##bwl(u16 port) \
{ \
- unsigned type value; \
+ type value; \
asm volatile("in" #bwl " %w1, %" #bw "0" \
: "=a"(value) : "Nd"(port)); \
return value; \
} \
\
-static inline void out##bwl##_p(unsigned type value, int port) \
+static inline void out##bwl##_p(type value, u16 port) \
{ \
out##bwl(value, port); \
slow_down_io(); \
} \
\
-static inline unsigned type in##bwl##_p(int port) \
+static inline type in##bwl##_p(u16 port) \
{ \
- unsigned type value = in##bwl(port); \
+ type value = in##bwl(port); \
slow_down_io(); \
return value; \
} \
\
-static inline void outs##bwl(int port, const void *addr, unsigned long count) \
+static inline void outs##bwl(u16 port, const void *addr, unsigned long count) \
{ \
if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
- unsigned type *value = (unsigned type *)addr; \
+ type *value = (type *)addr; \
while (count) { \
out##bwl(*value, port); \
value++; \
@@ -301,10 +301,10 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
} \
} \
\
-static inline void ins##bwl(int port, void *addr, unsigned long count) \
+static inline void ins##bwl(u16 port, void *addr, unsigned long count) \
{ \
if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
- unsigned type *value = (unsigned type *)addr; \
+ type *value = (type *)addr; \
while (count) { \
*value = in##bwl(port); \
value++; \
@@ -317,9 +317,9 @@ static inline void ins##bwl(int port, void *addr, unsigned long count) \
} \
}

-BUILDIO(b, b, char)
-BUILDIO(w, w, short)
-BUILDIO(l, , int)
+BUILDIO(b, b, u8)
+BUILDIO(w, w, u16)
+BUILDIO(l, , u32)

#define inb inb
#define inw inw
--
2.34.1

2022-02-24 16:42:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff

From: Sean Christopherson <[email protected]>

Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".

Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.

There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.

The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/apic.h | 2 ++
arch/x86/include/asm/realmode.h | 1 +
arch/x86/kernel/smpboot.c | 12 ++++++--
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 38 ++++++++++++++++++++++++
arch/x86/realmode/rm/trampoline_common.S | 12 +++++++-
6 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 48067af94678..35006e151774 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -328,6 +328,8 @@ struct apic {

/* wakeup_secondary_cpu */
int (*wakeup_secondary_cpu)(int apicid, unsigned long start_eip);
+ /* wakeup secondary CPU using 64-bit wakeup point */
+ int (*wakeup_secondary_cpu_64)(int apicid, unsigned long start_eip);

void (*inquire_remote_apic)(int apicid);

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 331474b150f1..fd6f6e5b755a 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
u32 sev_es_trampoline_start;
#endif
#ifdef CONFIG_X86_64
+ u32 trampoline_start64;
u32 trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 617012f4619f..6269dd126dba 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1088,6 +1088,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
unsigned long boot_error = 0;
unsigned long timeout;

+#ifdef CONFIG_X86_64
+ /* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
+ if (apic->wakeup_secondary_cpu_64)
+ start_ip = real_mode_header->trampoline_start64;
+#endif
idle->thread.sp = (unsigned long)task_pt_regs(idle);
early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
initial_code = (unsigned long)start_secondary;
@@ -1129,11 +1134,14 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,

/*
* Wake up a CPU in difference cases:
- * - Use the method in the APIC driver if it's defined
+ * - Use a method from the APIC driver if one defined, with wakeup
+ * straight to 64-bit mode preferred over wakeup to RM.
* Otherwise,
* - Use an INIT boot APIC message for APs or NMI for BSP.
*/
- if (apic->wakeup_secondary_cpu)
+ if (apic->wakeup_secondary_cpu_64)
+ boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip);
+ else if (apic->wakeup_secondary_cpu)
boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
else
boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
.long pa_sev_es_trampoline_start
#endif
#ifdef CONFIG_X86_64
+ .long pa_trampoline_start64
.long pa_trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index cc8391f86cdb..ae112a91592f 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
ljmpl $__KERNEL_CS, $pa_startup_64
SYM_CODE_END(startup_32)

+SYM_CODE_START(pa_trampoline_compat)
+ /*
+ * In compatibility mode. Prep ESP and DX for startup_32, then disable
+ * paging and complete the switch to legacy 32-bit mode.
+ */
+ movl $rm_stack_end, %esp
+ movw $__KERNEL_DS, %dx
+
+ movl $X86_CR0_PE, %eax
+ movl %eax, %cr0
+ ljmpl $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
.section ".text64","ax"
.code64
.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
jmpq *tr_start(%rip)
SYM_CODE_END(startup_64)

+SYM_CODE_START(trampoline_start64)
+ /*
+ * APs start here on a direct transfer from 64-bit BIOS with identity
+ * mapped page tables. Load the kernel's GDT in order to gear down to
+ * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+ * segment registers. Load the zero IDT so any fault triggers a
+ * shutdown instead of jumping back into BIOS.
+ */
+ lidt tr_idt(%rip)
+ lgdt tr_gdt64(%rip)
+
+ ljmpl *tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
.section ".rodata","a"
# Duplicate the global descriptor table
# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
.quad 0x00cf93000000ffff # __KERNEL_DS
SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)

+SYM_DATA_START(tr_gdt64)
+ .short tr_gdt_end - tr_gdt - 1 # gdt limit
+ .long pa_tr_gdt
+ .long 0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+ .long pa_trampoline_compat
+ .short __KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
.bss
.balign PAGE_SIZE
SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..4331c32c47f8 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,14 @@
/* SPDX-License-Identifier: GPL-2.0 */
.section ".rodata","a"
.balign 16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/*
+ * When a bootloader hands off to the kernel in 32-bit mode an
+ * IDT with a 2-byte limit and 4-byte base is needed. When a boot
+ * loader hands off to a kernel 64-bit mode the base address
+ * extends to 8-bytes. Reserve enough space for either scenario.
+ */
+SYM_DATA_START_LOCAL(tr_idt)
+ .short 0
+ .quad 0
+SYM_DATA_END(tr_idt)
--
2.34.1

2022-02-24 16:42:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 23/30] x86/boot: Avoid #VE during boot for TDX platforms

From: Sean Christopherson <[email protected]>

There are a few MSRs and control register bits that the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent security
guarantees. Fortunately, TDX ensures that these are all in the correct
state before the kernel loads, which means the kernel does not need to
modify them.

The conditions to avoid are:

* Any writes to the EFER MSR
* Clearing CR3.MCE

This theoretically makes the guest boot more fragile. If, for instance,
EFER was set up incorrectly and a WRMSR was performed, it will trigger
early exception panic or a triple fault, if it's before early
exceptions are set up. However, this is likely to trip up the guest
BIOS long before control reaches the kernel. In any case, these kinds
of problems are unlikely to occur in production environments, and
developers have good debug tools to fix them quickly.

Change the common boot code to work on TDX and non-TDX systems.
This should have no functional effect on non-TDX systems.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/head_64.S | 20 ++++++++++++++++++--
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/kernel/head_64.S | 28 ++++++++++++++++++++++++++--
arch/x86/realmode/rm/trampoline_64.S | 13 ++++++++++++-
5 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d2f45e58e846..98efb35ed7b1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,7 @@ config INTEL_TDX_GUEST
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
select DYNAMIC_PHYSICAL_MASK
+ select X86_MCE
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d0c3d33f3542..6d903b2fc544 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -643,12 +643,28 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
+ /* Avoid writing EFER if no change was made (for TDX guest) */
+ jc 1f
wrmsr
- popl %edx
+1: popl %edx
popl %ecx

+#ifdef CONFIG_X86_MCE
+ /*
+ * Preserve CR4.MCE if the kernel will enable #MC support.
+ * Clearing MCE may fault in some environments (that also force #MC
+ * support). Any machine check that occurs before #MC support is fully
+ * configured will crash the system regardless of the CR4.MCE value set
+ * here.
+ */
+ movl %cr4, %eax
+ andl $X86_CR4_MCE, %eax
+#else
+ movl $0, %eax
+#endif
+
/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80

#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9c63fc5988cd..184b7468ea76 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -140,8 +140,22 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

+#ifdef CONFIG_X86_MCE
+ /*
+ * Preserve CR4.MCE if the kernel will enable #MC support.
+ * Clearing MCE may fault in some environments (that also force #MC
+ * support). Any machine check that occurs before #MC support is fully
+ * configured will crash the system regardless of the CR4.MCE value set
+ * here.
+ */
+ movq %cr4, %rcx
+ andl $X86_CR4_MCE, %ecx
+#else
+ movl $0, %ecx
+#endif
+
/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
@@ -246,13 +260,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
+ /*
+ * Preserve current value of EFER for comparison and to skip
+ * EFER writes if no change was made (for TDX guest)
+ */
+ movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr /* Make changes effective */

+ /* Avoid writing EFER if no change was made (for TDX guest) */
+1: cmpl %edx, %eax
+ je 1f
+ xor %edx, %edx
+ wrmsr /* Make changes effective */
+1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index d380f2d1fd23..e38d61d6562e 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,11 +143,22 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3

# Set up EFER
+ movl $MSR_EFER, %ecx
+ rdmsr
+ /*
+ * Skip writing to EFER if the register already has desired
+ * value (to avoid #VE for the TDX guest).
+ */
+ cmp pa_tr_efer, %eax
+ jne .Lwrite_efer
+ cmp pa_tr_efer + 4, %edx
+ je .Ldone_efer
+.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
- movl $MSR_EFER, %ecx
wrmsr

+.Ldone_efer:
# Enable paging and in turn activate Long Mode.
movl $CR0_STATE, %eax
movl %eax, %cr0
--
2.34.1

2022-02-24 16:42:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 19/30] x86/tdx: Wire up KVM hypercalls

From: Kuppuswamy Sathyanarayanan <[email protected]>

KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI
is similar, those instructions no longer function for TDX guests.

Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX
guests to run with KVM acting as the hypervisor.

Among other things, KVM hypercall is used to send IPIs.

Since the KVM driver can be built as a kernel module, export
tdx_kvm_hypercall() to make the symbols visible to kvm.ko.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
arch/x86/coco/tdx.c | 17 +++++++++++++++++
arch/x86/include/asm/kvm_para.h | 22 ++++++++++++++++++++++
arch/x86/include/asm/tdx.h | 11 +++++++++++
3 files changed, 50 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 0d2a4c947a6c..6306ef19584f 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -48,6 +48,23 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
return __tdx_hypercall(&args, 0);
}

+#ifdef CONFIG_KVM_GUEST
+long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = nr,
+ .r11 = p1,
+ .r12 = p2,
+ .r13 = p3,
+ .r14 = p4,
+ };
+
+ return __tdx_hypercall(&args, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
+#endif
+
static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_output *out)
{
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 56935ebb1dfe..57bc74e112f2 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,6 +7,8 @@
#include <linux/interrupt.h>
#include <uapi/asm/kvm_para.h>

+#include <asm/tdx.h>
+
#ifdef CONFIG_KVM_GUEST
bool kvm_check_and_clear_guest_paused(void);
#else
@@ -32,6 +34,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
static inline long kvm_hypercall0(unsigned int nr)
{
long ret;
+
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr)
@@ -42,6 +48,10 @@ static inline long kvm_hypercall0(unsigned int nr)
static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
{
long ret;
+
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1)
@@ -53,6 +63,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
{
long ret;
+
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2)
@@ -64,6 +78,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
{
long ret;
+
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -76,6 +94,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
unsigned long p4)
{
long ret;
+
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+
asm volatile(KVM_HYPERCALL
: "=a"(ret)
: "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ba0f8c2b185c..6a97d42b0de9 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,5 +67,16 @@ static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }

#endif /* CONFIG_INTEL_TDX_GUEST */

+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
+long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
+ unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
+ unsigned long p2, unsigned long p3,
+ unsigned long p4)
+{
+ return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
--
2.34.1

2022-02-24 16:42:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 29/30] ACPICA: Avoid cache flush on TDX guest

ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states. It is
required to prevent data loss.

While running inside TDX guest, the kernel can bypass cache flushing.
Changing sleep state in a virtual machine doesn't affect the host system
sleep state and cannot lead to data loss.

The approach can be generalized to all guest kernels, but, to be
cautious, let's limit it to TDX for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/acenv.h | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..d19deca6dd27 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -13,7 +13,21 @@

/* Asm macros */

-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+/*
+ * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
+ * It is required to prevent data loss.
+ *
+ * While running inside TDX guest, the kernel can bypass cache flushing.
+ * Changing sleep state in a virtual machine doesn't affect the host system
+ * sleep state and cannot lead to data loss.
+ *
+ * TODO: Is it safe to generalize this from TDX guests to all guest kernels?
+ */
+#define ACPI_FLUSH_CPU_CACHE() \
+do { \
+ if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) \
+ wbinvd(); \
+} while (0)

int __acpi_acquire_global_lock(unsigned int *lock);
int __acpi_release_global_lock(unsigned int *lock);
--
2.34.1

2022-02-24 16:42:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 30/30] Documentation/x86: Document TDX kernel architecture

From: Kuppuswamy Sathyanarayanan <[email protected]>

Document the TDX guest architecture details like #VE support,
shared memory, etc.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/x86/index.rst | 1 +
Documentation/x86/tdx.rst | 196 ++++++++++++++++++++++++++++++++++++
2 files changed, 197 insertions(+)
create mode 100644 Documentation/x86/tdx.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index f498f1d36cd3..382e53ca850a 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -24,6 +24,7 @@ x86-specific Documentation
intel-iommu
intel_txt
amd-memory-encryption
+ tdx
pti
mds
microcode
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
new file mode 100644
index 000000000000..a0b603ac49ca
--- /dev/null
+++ b/Documentation/x86/tdx.rst
@@ -0,0 +1,196 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Intel Trust Domain Extensions (TDX)
+=====================================
+
+Intel's Trust Domain Extensions (TDX) protect confidential guest VMs
+from the host and physical attacks by isolating the guest register
+state and by encrypting the guest memory. In TDX, a special TDX module
+sits between the host and the guest, and runs in a special mode and
+manages the guest/host separation.
+
+Since the host cannot directly access guest registers or memory, much
+normal functionality of a hypervisor (such as trapping MMIO, some MSRs,
+some CPUIDs, and some other instructions) has to be moved into the
+guest. This is implemented using a Virtualization Exception (#VE) that
+is handled by the guest kernel. Some #VEs are handled inside the guest
+kernel, but some require the hypervisor (VMM) to be involved. The TD
+hypercall mechanism allows TD guests to call TDX module or hypervisor
+function.
+
+#VE Exceptions:
+===============
+
+In TDX guests, #VE Exceptions are delivered to TDX guests in following
+scenarios:
+
+* Execution of certain instructions (see list below)
+* Certain MSR accesses.
+* CPUID usage (only for certain leaves)
+* Shared memory access (including MMIO)
+
+#VE due to instruction execution
+---------------------------------
+
+Intel TDX dis-allows execution of certain instructions in non-root
+mode. Execution of these instructions would lead to #VE or #GP.
+
+Details are,
+
+List of instructions that can cause a #VE is,
+
+* String I/O (INS, OUTS), IN, OUT
+* HLT
+* MONITOR, MWAIT
+* WBINVD, INVD
+* VMCALL
+
+List of instructions that can cause a #GP is,
+
+* All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
+ VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
+* ENCLS, ENCLV
+* GETSEC
+* RSM
+* ENQCMD
+
+#VE due to MSR access
+----------------------
+
+In TDX guest, MSR access behavior can be categorized as,
+
+* Native supported (also called "context switched MSR")
+ No special handling is required for these MSRs in TDX guests.
+* #GP triggered
+ Dis-allowed MSR read/write would lead to #GP.
+* #VE triggered
+ All MSRs that are not natively supported or dis-allowed
+ (triggers #GP) will trigger #VE. To support access to
+ these MSRs, it needs to be emulated using TDCALL.
+
+Look Intel TDX Module Specification, sec "MSR Virtualization" for the complete
+list of MSRs that fall under the categories above.
+
+#VE due to CPUID instruction
+----------------------------
+
+In TDX guests, most of CPUID leaf/sub-leaf combinations are virtualized by
+the TDX module while some trigger #VE. Whether the leaf/sub-leaf triggers #VE
+defined in the TDX spec.
+
+VMM during the TD initialization time (using TDH.MNG.INIT) configures if
+a feature bits in specific leaf-subleaf are exposed to TD guest or not.
+
+#VE on Memory Accesses
+----------------------
+
+A TD guest is in control of whether its memory accesses are treated as
+private or shared. It selects the behavior with a bit in its page table
+entries.
+
+#VE on Shared Pages
+-------------------
+
+Access to shared mappings can cause a #VE. The hypervisor controls whether
+access of shared mapping causes a #VE, so the guest must be careful to only
+reference shared pages it can safely handle a #VE, avoid nested #VEs.
+
+Content of shared mapping is not trusted since shared memory is writable
+by the hypervisor. Shared mappings are never used for sensitive memory content
+like stacks or kernel text, only for I/O buffers and MMIO regions. The kernel
+will not encounter shared mappings in sensitive contexts like syscall entry
+or NMIs.
+
+#VE on Private Pages
+--------------------
+
+Some accesses to private mappings may cause #VEs. Before a mapping is
+accepted (AKA in the SEPT_PENDING state), a reference would cause a #VE.
+But, after acceptance, references typically succeed.
+
+The hypervisor can cause a private page reference to fail if it chooses
+to move an accepted page to a "blocked" state. However, if it does
+this, page access will not generate a #VE. It will, instead, cause a
+"TD Exit" where the hypervisor is required to handle the exception.
+
+Linux #VE handler
+-----------------
+
+Both user/kernel #VE exceptions are handled by the tdx_handle_virt_exception()
+handler. If successfully handled, the instruction pointer is incremented to
+complete the handling process. If failed to handle, it is treated as a regular
+exception and handled via fixup handlers.
+
+In TD guests, #VE nesting (a #VE triggered before handling the current one
+or AKA syscall gap issue) problem is handled by TDX module ensuring that
+interrupts, including NMIs, are blocked. The hardware blocks interrupts
+starting with #VE delivery until TDGETVEINFO is called.
+
+The kernel must avoid triggering #VE in entry paths: do not touch TD-shared
+memory, including MMIO regions, and do not use #VE triggering MSRs,
+instructions, or CPUID leaves that might generate #VE.
+
+MMIO handling:
+==============
+
+In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
+mapping which will cause a VMEXIT on access, and then the VMM emulates the
+access. That's not possible in TDX guests because VMEXIT will expose the
+register state to the host. TDX guests don't trust the host and can't have
+their state exposed to the host.
+
+In TDX the MMIO regions are instead configured to trigger a #VE
+exception in the guest. The guest #VE handler then emulates the MMIO
+instructions inside the guest and converts them into a controlled TDCALL
+to the host, rather than completely exposing the state to the host.
+
+MMIO addresses on x86 are just special physical addresses. They can be
+accessed with any instruction that accesses memory. However, the
+introduced instruction decoding method is limited. It is only designed
+to decode instructions like those generated by io.h macros.
+
+MMIO access via other means (like structure overlays) may result in
+MMIO_DECODE_FAILED and an oops.
+
+Shared memory:
+==============
+
+Intel TDX doesn't allow the VMM to access guest private memory. Any
+memory that is required for communication with VMM must be shared
+explicitly by setting the bit in the page table entry. The shared bit
+can be enumerated with TDX_GET_INFO.
+
+After setting the shared bit, the conversion must be completed with
+MapGPA hypercall. The call informs the VMM about the conversion between
+private/shared mappings.
+
+set_memory_decrypted() converts a range of pages to shared.
+set_memory_encrypted() converts memory back to private.
+
+Device drivers are the primary user of shared memory, but there's no
+need in touching every driver. DMA buffers and ioremap()'ed regions are
+converted to shared automatically.
+
+TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
+converted to shared on boot.
+
+For coherent DMA allocation, the DMA buffer gets converted on the
+allocation. Check force_dma_unencrypted() for details.
+
+References
+==========
+
+More details about TDX module (and its response for MSR, memory access,
+IO, CPUID etc) can be found at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
+
+More details about TDX hypercall and TDX module call ABI can be found
+at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf
+
+More details about TDVF requirements can be found at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
--
2.34.1

2022-02-24 16:42:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 15/30] x86/boot: Allow to hook up alternative port I/O helpers

Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Add a way to hook up alternative port I/O helpers in the boot stub.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/a20.c | 14 +++++++-------
arch/x86/boot/boot.h | 2 +-
arch/x86/boot/compressed/misc.c | 18 ++++++++++++------
arch/x86/boot/compressed/misc.h | 2 +-
arch/x86/boot/early_serial_console.c | 28 ++++++++++++++--------------
arch/x86/boot/io.h | 28 ++++++++++++++++++++++++++++
arch/x86/boot/main.c | 4 ++++
arch/x86/boot/pm.c | 10 +++++-----
arch/x86/boot/tty.c | 4 ++--
arch/x86/boot/video-vga.c | 6 +++---
arch/x86/boot/video.h | 8 +++++---
arch/x86/realmode/rm/wakemain.c | 14 +++++++++-----
12 files changed, 91 insertions(+), 47 deletions(-)
create mode 100644 arch/x86/boot/io.h

diff --git a/arch/x86/boot/a20.c b/arch/x86/boot/a20.c
index a2b6b428922a..7f6dd5cc4670 100644
--- a/arch/x86/boot/a20.c
+++ b/arch/x86/boot/a20.c
@@ -25,7 +25,7 @@ static int empty_8042(void)
while (loops--) {
io_delay();

- status = inb(0x64);
+ status = pio_ops.inb(0x64);
if (status == 0xff) {
/* FF is a plausible, but very unlikely status */
if (!--ffs)
@@ -34,7 +34,7 @@ static int empty_8042(void)
if (status & 1) {
/* Read and discard input data */
io_delay();
- (void)inb(0x60);
+ (void)pio_ops.inb(0x60);
} else if (!(status & 2)) {
/* Buffers empty, finished! */
return 0;
@@ -99,13 +99,13 @@ static void enable_a20_kbc(void)
{
empty_8042();

- outb(0xd1, 0x64); /* Command write */
+ pio_ops.outb(0xd1, 0x64); /* Command write */
empty_8042();

- outb(0xdf, 0x60); /* A20 on */
+ pio_ops.outb(0xdf, 0x60); /* A20 on */
empty_8042();

- outb(0xff, 0x64); /* Null command, but UHCI wants it */
+ pio_ops.outb(0xff, 0x64); /* Null command, but UHCI wants it */
empty_8042();
}

@@ -113,10 +113,10 @@ static void enable_a20_fast(void)
{
u8 port_a;

- port_a = inb(0x92); /* Configuration port A */
+ port_a = pio_ops.inb(0x92); /* Configuration port A */
port_a |= 0x02; /* Enable A20 */
port_a &= ~0x01; /* Do not reset machine */
- outb(port_a, 0x92);
+ pio_ops.outb(port_a, 0x92);
}

/*
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 22a474c5b3e8..bd8f640ca15f 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,10 +23,10 @@
#include <linux/edd.h>
#include <asm/setup.h>
#include <asm/asm.h>
-#include <asm/shared/io.h>
#include "bitops.h"
#include "ctype.h"
#include "cpuflags.h"
+#include "io.h"

/* Useful macros */
#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 2b1169869b96..c0711b18086a 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -47,6 +47,8 @@ void *memmove(void *dest, const void *src, size_t n);
*/
struct boot_params *boot_params;

+struct port_io_ops pio_ops;
+
memptr free_mem_ptr;
memptr free_mem_end_ptr;

@@ -103,10 +105,12 @@ static void serial_putchar(int ch)
{
unsigned timeout = 0xffff;

- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 &&
+ --timeout) {
cpu_relax();
+ }

- outb(ch, early_serial_base + TXR);
+ pio_ops.outb(ch, early_serial_base + TXR);
}

void __putstr(const char *s)
@@ -152,10 +156,10 @@ void __putstr(const char *s)
boot_params->screen_info.orig_y = y;

pos = (x + cols * y) * 2; /* Update cursor position */
- outb(14, vidport);
- outb(0xff & (pos >> 9), vidport+1);
- outb(15, vidport);
- outb(0xff & (pos >> 1), vidport+1);
+ pio_ops.outb(14, vidport);
+ pio_ops.outb(0xff & (pos >> 9), vidport+1);
+ pio_ops.outb(15, vidport);
+ pio_ops.outb(0xff & (pos >> 1), vidport+1);
}

void __puthex(unsigned long value)
@@ -370,6 +374,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;

+ init_io_ops();
+
/*
* Detect TDX guest environment.
*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 8a253e85f990..ea71cf3d64e1 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -26,7 +26,6 @@
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
-#include <asm/shared/io.h>

#include "tdx.h"

@@ -35,6 +34,7 @@

#define BOOT_BOOT_H
#include "../ctype.h"
+#include "../io.h"

#ifdef CONFIG_X86_64
#define memptr long
diff --git a/arch/x86/boot/early_serial_console.c b/arch/x86/boot/early_serial_console.c
index 023bf1c3de8b..03e43d770571 100644
--- a/arch/x86/boot/early_serial_console.c
+++ b/arch/x86/boot/early_serial_console.c
@@ -28,17 +28,17 @@ static void early_serial_init(int port, int baud)
unsigned char c;
unsigned divisor;

- outb(0x3, port + LCR); /* 8n1 */
- outb(0, port + IER); /* no interrupt */
- outb(0, port + FCR); /* no fifo */
- outb(0x3, port + MCR); /* DTR + RTS */
+ pio_ops.outb(0x3, port + LCR); /* 8n1 */
+ pio_ops.outb(0, port + IER); /* no interrupt */
+ pio_ops.outb(0, port + FCR); /* no fifo */
+ pio_ops.outb(0x3, port + MCR); /* DTR + RTS */

divisor = 115200 / baud;
- c = inb(port + LCR);
- outb(c | DLAB, port + LCR);
- outb(divisor & 0xff, port + DLL);
- outb((divisor >> 8) & 0xff, port + DLH);
- outb(c & ~DLAB, port + LCR);
+ c = pio_ops.inb(port + LCR);
+ pio_ops.outb(c | DLAB, port + LCR);
+ pio_ops.outb(divisor & 0xff, port + DLL);
+ pio_ops.outb((divisor >> 8) & 0xff, port + DLH);
+ pio_ops.outb(c & ~DLAB, port + LCR);

early_serial_base = port;
}
@@ -104,11 +104,11 @@ static unsigned int probe_baud(int port)
unsigned char lcr, dll, dlh;
unsigned int quot;

- lcr = inb(port + LCR);
- outb(lcr | DLAB, port + LCR);
- dll = inb(port + DLL);
- dlh = inb(port + DLH);
- outb(lcr, port + LCR);
+ lcr = pio_ops.inb(port + LCR);
+ pio_ops.outb(lcr | DLAB, port + LCR);
+ dll = pio_ops.inb(port + DLL);
+ dlh = pio_ops.inb(port + DLH);
+ pio_ops.outb(lcr, port + LCR);
quot = (dlh << 8) | dll;

return BASE_BAUD / quot;
diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
new file mode 100644
index 000000000000..8a53947ef70e
--- /dev/null
+++ b/arch/x86/boot/io.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_IO_H
+#define BOOT_IO_H
+
+#include <asm/shared/io.h>
+
+struct port_io_ops {
+ u8 (*inb)(u16 port);
+ u16 (*inw)(u16 port);
+ u32 (*inl)(u16 port);
+ void (*outb)(u8 v, u16 port);
+ void (*outw)(u16 v, u16 port);
+ void (*outl)(u32 v, u16 port);
+};
+
+extern struct port_io_ops pio_ops;
+
+static inline void init_io_ops(void)
+{
+ pio_ops.inb = inb;
+ pio_ops.inw = inw;
+ pio_ops.inl = inl;
+ pio_ops.outb = outb;
+ pio_ops.outw = outw;
+ pio_ops.outl = outl;
+}
+
+#endif
diff --git a/arch/x86/boot/main.c b/arch/x86/boot/main.c
index e3add857c2c9..447a797891be 100644
--- a/arch/x86/boot/main.c
+++ b/arch/x86/boot/main.c
@@ -17,6 +17,8 @@

struct boot_params boot_params __attribute__((aligned(16)));

+struct port_io_ops pio_ops;
+
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */

@@ -133,6 +135,8 @@ static void init_heap(void)

void main(void)
{
+ init_io_ops();
+
/* First, copy the boot header into the "zeropage" */
copy_boot_params();

diff --git a/arch/x86/boot/pm.c b/arch/x86/boot/pm.c
index 40031a614712..4180b6a264c9 100644
--- a/arch/x86/boot/pm.c
+++ b/arch/x86/boot/pm.c
@@ -25,7 +25,7 @@ static void realmode_switch_hook(void)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
- outb(0x80, 0x70); /* Disable NMI */
+ pio_ops.outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}
@@ -35,9 +35,9 @@ static void realmode_switch_hook(void)
*/
static void mask_all_interrupts(void)
{
- outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
+ pio_ops.outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
io_delay();
- outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
+ pio_ops.outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
io_delay();
}

@@ -46,9 +46,9 @@ static void mask_all_interrupts(void)
*/
static void reset_coprocessor(void)
{
- outb(0, 0xf0);
+ pio_ops.outb(0, 0xf0);
io_delay();
- outb(0, 0xf1);
+ pio_ops.outb(0, 0xf1);
io_delay();
}

diff --git a/arch/x86/boot/tty.c b/arch/x86/boot/tty.c
index f7eb976b0a4b..ee8700682801 100644
--- a/arch/x86/boot/tty.c
+++ b/arch/x86/boot/tty.c
@@ -29,10 +29,10 @@ static void __section(".inittext") serial_putchar(int ch)
{
unsigned timeout = 0xffff;

- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
cpu_relax();

- outb(ch, early_serial_base + TXR);
+ pio_ops.outb(ch, early_serial_base + TXR);
}

static void __section(".inittext") bios_putchar(int ch)
diff --git a/arch/x86/boot/video-vga.c b/arch/x86/boot/video-vga.c
index 4816cb9cf996..17baac542ee7 100644
--- a/arch/x86/boot/video-vga.c
+++ b/arch/x86/boot/video-vga.c
@@ -131,7 +131,7 @@ static void vga_set_80x43(void)
/* I/O address of the VGA CRTC */
u16 vga_crtc(void)
{
- return (inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
+ return (pio_ops.inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
}

static void vga_set_480_scanlines(void)
@@ -148,10 +148,10 @@ static void vga_set_480_scanlines(void)
out_idx(0xdf, crtc, 0x12); /* Vertical display end */
out_idx(0xe7, crtc, 0x15); /* Vertical blank start */
out_idx(0x04, crtc, 0x16); /* Vertical blank end */
- csel = inb(0x3cc);
+ csel = pio_ops.inb(0x3cc);
csel &= 0x0d;
csel |= 0xe2;
- outb(csel, 0x3c2);
+ pio_ops.outb(csel, 0x3c2);
}

static void vga_set_vertical_end(int lines)
diff --git a/arch/x86/boot/video.h b/arch/x86/boot/video.h
index 04bde0bb2003..87a5f726e731 100644
--- a/arch/x86/boot/video.h
+++ b/arch/x86/boot/video.h
@@ -15,6 +15,8 @@

#include <linux/types.h>

+#include "boot.h"
+
/*
* This code uses an extended set of video mode numbers. These include:
* Aliases for standard modes
@@ -96,13 +98,13 @@ extern int graphic_mode; /* Graphics mode with linear frame buffer */
/* Accessing VGA indexed registers */
static inline u8 in_idx(u16 port, u8 index)
{
- outb(index, port);
- return inb(port+1);
+ pio_ops.outb(index, port);
+ return pio_ops.inb(port+1);
}

static inline void out_idx(u8 v, u16 port, u8 index)
{
- outw(index+(v << 8), port);
+ pio_ops.outw(index+(v << 8), port);
}

/* Writes a value to an indexed port and then reads the port again */
diff --git a/arch/x86/realmode/rm/wakemain.c b/arch/x86/realmode/rm/wakemain.c
index 1d6437e6d2ba..b49404d0d63c 100644
--- a/arch/x86/realmode/rm/wakemain.c
+++ b/arch/x86/realmode/rm/wakemain.c
@@ -17,18 +17,18 @@ static void beep(unsigned int hz)
} else {
u16 div = 1193181/hz;

- outb(0xb6, 0x43); /* Ctr 2, squarewave, load, binary */
+ pio_ops.outb(0xb6, 0x43); /* Ctr 2, squarewave, load, binary */
io_delay();
- outb(div, 0x42); /* LSB of counter */
+ pio_ops.outb(div, 0x42); /* LSB of counter */
io_delay();
- outb(div >> 8, 0x42); /* MSB of counter */
+ pio_ops.outb(div >> 8, 0x42); /* MSB of counter */
io_delay();

enable = 0x03; /* Turn on speaker */
}
- inb(0x61); /* Dummy read of System Control Port B */
+ pio_ops.inb(0x61); /* Dummy read of System Control Port B */
io_delay();
- outb(enable, 0x61); /* Enable timer 2 output to speaker */
+ pio_ops.outb(enable, 0x61); /* Enable timer 2 output to speaker */
io_delay();
}

@@ -62,8 +62,12 @@ static void send_morse(const char *pattern)
}
}

+struct port_io_ops pio_ops;
+
void main(void)
{
+ init_io_ops();
+
/* Kill machine if structures are wrong */
if (wakeup_header.real_magic != 0x12345678)
while (1)
--
2.34.1

2022-02-24 16:42:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 27/30] x86/kvm: Use bounce buffers for TD guest

Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.

Force SWIOTLB on TD guest and make SWIOTLB buffer shared by generalizing
mem_encrypt_init() to cover TDX.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/coco/core.c | 1 +
arch/x86/coco/tdx.c | 3 +++
arch/x86/mm/mem_encrypt.c | 9 ++++++++-
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 98efb35ed7b1..1312cefb927d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -885,7 +885,7 @@ config INTEL_TDX_GUEST
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
- select DYNAMIC_PHYSICAL_MASK
+ select X86_MEM_ENCRYPT
select X86_MCE
help
Support running as a guest under Intel TDX. Without this support,
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 9778cf4c6901..b10326f91d4f 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -22,6 +22,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
case CC_ATTR_GUEST_UNROLL_STRING_IO:
case CC_ATTR_HOTPLUG_DISABLED:
case CC_ATTR_GUEST_MEM_ENCRYPT:
+ case CC_ATTR_MEM_ENCRYPT:
return true;
default:
return false;
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index da2ae399ea71..d33f65a58d7b 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -5,6 +5,7 @@
#define pr_fmt(fmt) "tdx: " fmt

#include <linux/cpufeature.h>
+#include <linux/swiotlb.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -587,5 +588,7 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;

+ swiotlb_force = SWIOTLB_FORCE;
+
pr_info("Guest detected\n");
}
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 50d209939c66..10ee40b5204b 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -42,7 +42,14 @@ bool force_dma_unencrypted(struct device *dev)

static void print_mem_encrypt_feature_info(void)
{
- pr_info("AMD Memory Encryption Features active:");
+ pr_info("Memory Encryption Features active:");
+
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+ pr_cont(" Intel TDX\n");
+ return;
+ }
+
+ pr_cont("AMD ");

/* Secure Memory Encryption */
if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {
--
2.34.1

2022-02-24 16:43:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 16/30] x86/boot/compressed: Support TDX guest port I/O at decompression time

Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Hook up TDX-specific port I/O helpers if booting in TDX environment.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/tdcall.S | 3 ++
arch/x86/boot/compressed/tdx.c | 71 +++++++++++++++++++++++++++++++
arch/x86/include/asm/shared/tdx.h | 29 +++++++++++++
arch/x86/include/asm/tdx.h | 24 -----------
5 files changed, 104 insertions(+), 25 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 732f6b21ecbd..8fd0e6ae2e1f 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -101,7 +101,7 @@ ifdef CONFIG_X86_64
endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
-vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o

vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..59b80ab6b41c
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../coco/tdcall.S"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index dec68c184358..499ccbd71312 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -2,6 +2,10 @@

#include "../cpuflags.h"
#include "../string.h"
+#include "../io.h"
+
+#include <vdso/limits.h>
+#include <uapi/asm/vmx.h>

#include <asm/shared/tdx.h>

@@ -12,6 +16,66 @@ bool early_is_tdx_guest(void)
return tdx_guest_detected;
}

+static inline unsigned int tdx_io_in(int size, u16 port)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_IO_INSTRUCTION,
+ .r12 = size,
+ .r13 = 0,
+ .r14 = port,
+ };
+
+ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+ return UINT_MAX;
+
+ return args.r11;
+}
+
+static inline void tdx_io_out(int size, u16 port, u32 value)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_IO_INSTRUCTION,
+ .r12 = size,
+ .r13 = 1,
+ .r14 = port,
+ .r15 = value,
+ };
+
+ __tdx_hypercall(&args, 0);
+}
+
+static inline u8 tdx_inb(u16 port)
+{
+ return tdx_io_in(1, port);
+}
+
+static inline u16 tdx_inw(u16 port)
+{
+ return tdx_io_in(2, port);
+}
+
+static inline u32 tdx_inl(u16 port)
+{
+ return tdx_io_in(4, port);
+}
+
+static inline void tdx_outb(u8 value, u16 port)
+{
+ tdx_io_out(1, port, value);
+}
+
+static inline void tdx_outw(u16 value, u16 port)
+{
+ tdx_io_out(2, port, value);
+}
+
+static inline void tdx_outl(u32 value, u16 port)
+{
+ tdx_io_out(4, port, value);
+}
+
void early_tdx_detect(void)
{
u32 eax, sig[3];
@@ -24,4 +88,11 @@ void early_tdx_detect(void)

/* Cache TDX guest feature status */
tdx_guest_detected = true;
+
+ pio_ops.inb = tdx_inb;
+ pio_ops.inw = tdx_inw;
+ pio_ops.inl = tdx_inl;
+ pio_ops.outb = tdx_outb;
+ pio_ops.outw = tdx_outw;
+ pio_ops.outl = tdx_outl;
}
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 8209ba9ffe1a..51bce6351124 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -2,7 +2,36 @@
#ifndef _ASM_X86_SHARED_TDX_H
#define _ASM_X86_SHARED_TDX_H

+#include <linux/bits.h>
+#include <linux/types.h>
+
+#define TDX_HYPERCALL_STANDARD 0
+
+#define TDX_HCALL_HAS_OUTPUT BIT(0)
+#define TDX_HCALL_ISSUE_STI BIT(1)
+
#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "

+#ifndef __ASSEMBLY__
+
+/*
+ * Used in __tdx_hypercall() to pass down and get back registers' values of
+ * the TDCALL instruction when requesting services from the VMM.
+ *
+ * This is a software only structure and not part of the TDX module/VMM ABI.
+ */
+struct tdx_hypercall_args {
+ u64 r10;
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};
+
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
+
+#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d4c28b9f2fbc..54803cb6ccf5 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -3,16 +3,10 @@
#ifndef _ASM_X86_TDX_H
#define _ASM_X86_TDX_H

-#include <linux/bits.h>
#include <linux/init.h>
#include <asm/ptrace.h>
#include <asm/shared/tdx.h>

-#define TDX_HYPERCALL_STANDARD 0
-
-#define TDX_HCALL_HAS_OUTPUT BIT(0)
-#define TDX_HCALL_ISSUE_STI BIT(1)
-
#define TDX_SEAMCALL_VMFAILINVALID 0x8000FF00FFFF0000ULL

#ifndef __ASSEMBLY__
@@ -32,21 +26,6 @@ struct tdx_module_output {
u64 r11;
};

-/*
- * Used in __tdx_hypercall() to pass down and get back registers' values of
- * the TDCALL instruction when requesting services from the VMM.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_hypercall_args {
- u64 r10;
- u64 r11;
- u64 r12;
- u64 r13;
- u64 r14;
- u64 r15;
-};
-
/*
* Used by the #VE exception handler to gather the #VE exception
* info from the TDX module. This is a software only structure
@@ -71,9 +50,6 @@ void __init tdx_early_init(void);
u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_output *out);

-/* Used to request services from the VMM */
-u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
-
void tdx_get_ve_info(struct ve_info *ve);

bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
--
2.34.1

2022-02-24 16:43:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 06/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK

In TDX guests, by default memory is protected from host access. If a
guest needs to communicate with the VMM (like the I/O use case), it uses
a single bit in the physical address to communicate the protected/shared
attribute of the given page.

In the x86 ARCH code, __PHYSICAL_MASK macro represents the width of the
physical address in the given architecture. It is used in creating
physical PAGE_MASK for address bits in the kernel. Since in TDX guest,
a single bit is used as metadata, it needs to be excluded from valid
physical address bits to avoid using incorrect addresses bits in the
kernel.

Enable DYNAMIC_PHYSICAL_MASK to support updating the __PHYSICAL_MASK.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/coco/tdx.c | 8 ++++++++
2 files changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 93e67842e369..d2f45e58e846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -885,6 +885,7 @@ config INTEL_TDX_GUEST
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
+ select DYNAMIC_PHYSICAL_MASK
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 74c6e68dd1b3..14c085930b5f 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -74,6 +74,14 @@ void __init tdx_early_init(void)

cc_set_vendor(CC_VENDOR_INTEL);

+ /*
+ * All bits above GPA width are reserved and kernel treats shared bit
+ * as flag, not as part of physical address.
+ *
+ * Adjust physical mask to only cover valid GPA bits.
+ */
+ physical_mask &= GENMASK_ULL(td_info.gpa_width - 2, 0);
+
/*
* The highest bit of a guest physical address is the "sharing" bit.
* Set it for shared pages and clear it for private pages.
--
2.34.1

2022-02-24 16:43:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 17/30] x86/tdx: Add port I/O emulation

From: Kuppuswamy Sathyanarayanan <[email protected]>

TDX hypervisors cannot emulate instructions directly. This includes
port I/O which is normally emulated in the hypervisor. All port I/O
instructions inside TDX trigger the #VE exception in the guest and
would be normally emulated there.

Use a hypercall to emulate port I/O. Extend the
tdx_handle_virt_exception() and add support to handle the #VE due to
port I/O instructions.

String I/O operations are not supported in TDX. Unroll them by declaring
CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/core.c | 7 +++++-
arch/x86/coco/tdx.c | 54 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 9113baebbfd2..5615b75e6fc6 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -18,7 +18,12 @@ static u64 cc_mask __ro_after_init;

static bool intel_cc_platform_has(enum cc_attr attr)
{
- return false;
+ switch (attr) {
+ case CC_ATTR_GUEST_UNROLL_STRING_IO:
+ return true;
+ default:
+ return false;
+ }
}

/*
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 15519e498679..2e342760b1d2 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -19,6 +19,12 @@
#define EPT_READ 0
#define EPT_WRITE 1

+/* See Exit Qualification for I/O Instructions in VMX documentation */
+#define VE_IS_IO_IN(e) ((e) & BIT(3))
+#define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1)
+#define VE_GET_PORT_NUM(e) ((e) >> 16)
+#define VE_IS_IO_STRING(e) ((e) & BIT(4))
+
static struct {
unsigned int gpa_width;
unsigned long attributes;
@@ -292,6 +298,52 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return true;
}

+/*
+ * Emulate I/O using hypercall.
+ *
+ * Assumes the IO instruction was using ax, which is enforced
+ * by the standard io.h macros.
+ *
+ * Return True on success or False on failure.
+ */
+static bool handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_IO_INSTRUCTION,
+ };
+ int size, port;
+ u64 mask;
+ bool in, ret;
+
+ if (VE_IS_IO_STRING(exit_qual))
+ return false;
+
+ in = VE_IS_IO_IN(exit_qual);
+ size = VE_GET_IO_SIZE(exit_qual);
+ port = VE_GET_PORT_NUM(exit_qual);
+ mask = GENMASK(BITS_PER_BYTE * size, 0);
+
+ args.r12 = size;
+ args.r13 = !in;
+ args.r14 = port;
+ args.r15 = in ? 0 : regs->ax;
+
+ /*
+ * Emulate the I/O read/write via hypercall. More info about
+ * ABI can be found in TDX Guest-Host-Communication Interface
+ * (GHCI) section titled "TDG.VP.VMCALL<Instruction.IO>".
+ */
+ ret = !__tdx_hypercall(&args, in ? TDX_HCALL_HAS_OUTPUT : 0);
+ if (!ret || !in)
+ return ret;
+
+ regs->ax &= ~mask;
+ regs->ax |= ret ? args.r11 & mask : UINT_MAX;
+
+ return ret;
+}
+
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -347,6 +399,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
return handle_cpuid(regs);
case EXIT_REASON_EPT_VIOLATION:
return handle_mmio(regs, ve);
+ case EXIT_REASON_IO_INSTRUCTION:
+ return handle_io(regs, ve->exit_qual);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return false;
--
2.34.1

2022-02-24 16:43:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
by the TDX module while some trigger #VE.

Implement the #VE handling for EXIT_REASON_CPUID by handing it through
the hypercall, which in turn lets the TDX module handle it by invoking
the host VMM.

More details on CPUID Virtualization can be found in the TDX module
specification, the section titled "CPUID Virtualization".

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdx.c | 41 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 89992593a209..fd78b81a951d 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -154,6 +154,36 @@ static bool write_msr(struct pt_regs *regs)
return !__tdx_hypercall(&args, 0);
}

+static bool handle_cpuid(struct pt_regs *regs)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_CPUID,
+ .r12 = regs->ax,
+ .r13 = regs->cx,
+ };
+
+ /*
+ * Emulate the CPUID instruction via a hypercall. More info about
+ * ABI can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
+ */
+ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+ return false;
+
+ /*
+ * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
+ * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
+ * So copy the register contents back to pt_regs.
+ */
+ regs->ax = args.r12;
+ regs->bx = args.r13;
+ regs->cx = args.r14;
+ regs->dx = args.r15;
+
+ return true;
+}
+
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -186,8 +216,13 @@ void tdx_get_ve_info(struct ve_info *ve)
*/
static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
{
- pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return false;
+ switch (ve->exit_reason) {
+ case EXIT_REASON_CPUID:
+ return handle_cpuid(regs);
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+ }
}

/* Handle the kernel #VE */
@@ -200,6 +235,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
return read_msr(regs);
case EXIT_REASON_MSR_WRITE:
return write_msr(regs);
+ case EXIT_REASON_CPUID:
+ return handle_cpuid(regs);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return false;
--
2.34.1

2022-02-24 16:43:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 12/30] x86/tdx: Detect TDX at early kernel decompression time

From: Kuppuswamy Sathyanarayanan <[email protected]>

The early decompression code does port I/O for its console output. But,
handling the decompression-time port I/O demands a different approach
from normal runtime because the IDT required to support #VE based port
I/O emulation is not yet set up. Paravirtualizing I/O calls during
the decompression step is acceptable because the decompression code size is
small enough and hence patching it will not bloat the image size a lot.

To support port I/O in decompression code, TDX must be detected before
the decompression code might do port I/O. Detect whether the kernel runs
in a TDX guest.

Add an early_is_tdx_guest() interface to query the cached TDX guest
status in the decompression code.

The actual port I/O paravirtualization will come later in the series.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/misc.c | 8 ++++++++
arch/x86/boot/compressed/misc.h | 2 ++
arch/x86/boot/compressed/tdx.c | 27 +++++++++++++++++++++++++++
arch/x86/boot/compressed/tdx.h | 15 +++++++++++++++
arch/x86/boot/cpuflags.c | 3 +--
arch/x86/boot/cpuflags.h | 1 +
arch/x86/include/asm/shared/tdx.h | 8 ++++++++
arch/x86/include/asm/tdx.h | 4 +---
9 files changed, 64 insertions(+), 5 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdx.c
create mode 100644 arch/x86/boot/compressed/tdx.h
create mode 100644 arch/x86/include/asm/shared/tdx.h

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6115274fe10f..732f6b21ecbd 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -101,6 +101,7 @@ ifdef CONFIG_X86_64
endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o

vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index a4339cb2d247..2b1169869b96 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -370,6 +370,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;

+ /*
+ * Detect TDX guest environment.
+ *
+ * It has to be done before console_init() in order to use
+ * paravirtualized port I/O operations if needed.
+ */
+ early_tdx_detect();
+
console_init();

/*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 16ed360b6692..0d8e275a9d96 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -28,6 +28,8 @@
#include <asm/bootparam.h>
#include <asm/desc_defs.h>

+#include "tdx.h"
+
#define BOOT_CTYPE_H
#include <linux/acpi.h>

diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..dec68c184358
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../cpuflags.h"
+#include "../string.h"
+
+#include <asm/shared/tdx.h>
+
+static bool tdx_guest_detected;
+
+bool early_is_tdx_guest(void)
+{
+ return tdx_guest_detected;
+}
+
+void early_tdx_detect(void)
+{
+ u32 eax, sig[3];
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
+
+ BUILD_BUG_ON(sizeof(sig) != sizeof(TDX_IDENT) - 1);
+ if (memcmp(TDX_IDENT, sig, sizeof(sig)))
+ return;
+
+ /* Cache TDX guest feature status */
+ tdx_guest_detected = true;
+}
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
new file mode 100644
index 000000000000..a7bff6ae002e
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_TDX_H
+#define BOOT_COMPRESSED_TDX_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+void early_tdx_detect(void);
+bool early_is_tdx_guest(void);
+#else
+static inline void early_tdx_detect(void) { };
+static inline bool early_is_tdx_guest(void) { return false; }
+#endif
+
+#endif /* BOOT_COMPRESSED_TDX_H */
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index a0b75f73dc63..a83d67ec627d 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -71,8 +71,7 @@ int has_eflag(unsigned long mask)
# define EBX_REG "=b"
#endif

-static inline void cpuid_count(u32 id, u32 count,
- u32 *a, u32 *b, u32 *c, u32 *d)
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
{
asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
"cpuid \n\t"
diff --git a/arch/x86/boot/cpuflags.h b/arch/x86/boot/cpuflags.h
index 2e20814d3ce3..475b8fde90f7 100644
--- a/arch/x86/boot/cpuflags.h
+++ b/arch/x86/boot/cpuflags.h
@@ -17,5 +17,6 @@ extern u32 cpu_vendor[3];

int has_eflag(unsigned long mask);
void get_cpuflags(void);
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d);

#endif
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
new file mode 100644
index 000000000000..8209ba9ffe1a
--- /dev/null
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_TDX_H
+#define _ASM_X86_SHARED_TDX_H
+
+#define TDX_CPUID_LEAF_ID 0x21
+#define TDX_IDENT "IntelTDX "
+
+#endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e6e23ade53a6..d4c28b9f2fbc 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -6,9 +6,7 @@
#include <linux/bits.h>
#include <linux/init.h>
#include <asm/ptrace.h>
-
-#define TDX_CPUID_LEAF_ID 0x21
-#define TDX_IDENT "IntelTDX "
+#include <asm/shared/tdx.h>

#define TDX_HYPERCALL_STANDARD 0

--
2.34.1

2022-02-24 16:43:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 04/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

From: Kuppuswamy Sathyanarayanan <[email protected]>

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.

In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.

A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A variant of TDCALL used to communicate
with the VMM is called TDVMCALL.

Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).

__tdx_hypercall() - Used by the guest to request services from the
VMM (via TDVMCALL).
__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL).

Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.

The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.

Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.

For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".

Based on previous patch by Sean Christopherson.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/Makefile | 2 +-
arch/x86/coco/tdcall.S | 184 ++++++++++++++++++++++++++++++++++
arch/x86/coco/tdx.c | 18 ++++
arch/x86/include/asm/tdx.h | 27 +++++
arch/x86/kernel/asm-offsets.c | 10 ++
5 files changed, 240 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/coco/tdcall.S

diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
index 32f4c6e6f199..14af5412e3cd 100644
--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -5,4 +5,4 @@ CFLAGS_core.o += -fno-stack-protector

obj-y += core.o

-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o tdcall.o
diff --git a/arch/x86/coco/tdcall.S b/arch/x86/coco/tdcall.S
new file mode 100644
index 000000000000..c4dd9468e7d9
--- /dev/null
+++ b/arch/x86/coco/tdcall.S
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+#include <linux/errno.h>
+
+#include "../virt/tdxcall.S"
+
+/*
+ * Bitmasks of exposed registers (with VMM).
+ */
+#define TDX_R10 BIT(10)
+#define TDX_R11 BIT(11)
+#define TDX_R12 BIT(12)
+#define TDX_R13 BIT(13)
+#define TDX_R14 BIT(14)
+#define TDX_R15 BIT(15)
+
+/*
+ * These registers are clobbered to hold arguments for each
+ * TDVMCALL. They are safe to expose to the VMM.
+ * Each bit in this mask represents a register ID. Bit field
+ * details can be found in TDX GHCI specification, section
+ * titled "TDCALL [TDG.VP.VMCALL] leaf".
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \
+ TDX_R12 | TDX_R13 | \
+ TDX_R14 | TDX_R15 )
+
+/*
+ * __tdx_module_call() - Used by TDX guests to request services from
+ * the TDX module (does not include VMM services).
+ *
+ * Transforms function call register arguments into the TDCALL
+ * register ABI. After TDCALL operation, TDX module output is saved
+ * in @out (if it is provided by the user)
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - TDCALL Leaf number.
+ * RCX,RDX,R8-R9 - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction error code.
+ * RCX,RDX,R8-R11 - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module). It
+ * can be NULL.
+ *
+ * Return status of TDCALL via RAX.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+ TDX_MODULE_CALL host=0
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * __tdx_hypercall() - Make hypercalls to a TDX VMM.
+ *
+ * Transforms values in function call argument struct tdx_hypercall_args @args
+ * into the TDCALL register ABI. After TDCALL operation, VMM output is saved
+ * back in @args.
+ *
+ *-------------------------------------------------------------------------
+ * TD VMCALL ABI:
+ *-------------------------------------------------------------------------
+ *
+ * Input Registers:
+ *
+ * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
+ * RCX - BITMAP which controls which part of TD Guest GPR
+ * is passed as-is to the VMM and back.
+ * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
+ * specification. Non zero value indicates vendor
+ * specific ABI.
+ * R11 - VMCALL sub function number
+ * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
+ * R8-R9, R12-R15 - Same as above.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction status (Not related to hypercall
+ * output).
+ * R10 - Hypercall output error code.
+ * R11-R15 - Hypercall sub function specific output values.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_hypercall() function ABI:
+ *
+ * @args (RDI) - struct tdx_hypercall_args for input and output
+ * @flags (RSI) - TDX_HCALL_* flags
+ *
+ * On successful completion, return the hypercall error code.
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+
+ /* Save callee-saved GPRs as mandated by the x86_64 ABI */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
+ xor %eax, %eax
+
+ /* Copy hypercall registers from arg struct: */
+ movq TDX_HYPERCALL_r10(%rdi), %r10
+ movq TDX_HYPERCALL_r11(%rdi), %r11
+ movq TDX_HYPERCALL_r12(%rdi), %r12
+ movq TDX_HYPERCALL_r13(%rdi), %r13
+ movq TDX_HYPERCALL_r14(%rdi), %r14
+ movq TDX_HYPERCALL_r15(%rdi), %r15
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * TDVMCALL leaf does not suppose to fail. If it fails something
+ * is horribly wrong with TDX module. Stop the world.
+ */
+ testq %rax, %rax
+ jne .Lpanic
+
+ /* TDVMCALL leaf return code is in R10 */
+ movq %r10, %rax
+
+ /* Copy hypercall result registers to arg struct if needed */
+ testq $TDX_HCALL_HAS_OUTPUT, %rsi
+ jz .Lout
+
+ movq %r10, TDX_HYPERCALL_r10(%rdi)
+ movq %r11, TDX_HYPERCALL_r11(%rdi)
+ movq %r12, TDX_HYPERCALL_r12(%rdi)
+ movq %r13, TDX_HYPERCALL_r13(%rdi)
+ movq %r14, TDX_HYPERCALL_r14(%rdi)
+ movq %r15, TDX_HYPERCALL_r15(%rdi)
+.Lout:
+ /*
+ * Zero out registers exposed to the VMM to avoid speculative execution
+ * with VMM-controlled values. This needs to include all registers
+ * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
+ * context will be restored.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+
+ /* Restore callee-saved GPRs as mandated by the x86_64 ABI */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ FRAME_END
+
+ retq
+.Lpanic:
+ ud2
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 00898e3eb77f..17365fd40ba2 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -7,6 +7,24 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>

+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = fn,
+ .r12 = r12,
+ .r13 = r13,
+ .r14 = r14,
+ .r15 = r15,
+ };
+
+ return __tdx_hypercall(&args, 0);
+}
+
void __init tdx_early_init(void)
{
u32 eax, sig[3];
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 2f8cb1e53e77..557227e40da9 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -3,11 +3,16 @@
#ifndef _ASM_X86_TDX_H
#define _ASM_X86_TDX_H

+#include <linux/bits.h>
#include <linux/init.h>

#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "

+#define TDX_HYPERCALL_STANDARD 0
+
+#define TDX_HCALL_HAS_OUTPUT BIT(0)
+
#define TDX_SEAMCALL_VMFAILINVALID 0x8000FF00FFFF0000ULL

#ifndef __ASSEMBLY__
@@ -27,10 +32,32 @@ struct tdx_module_output {
u64 r11;
};

+/*
+ * Used in __tdx_hypercall() to pass down and get back registers' values of
+ * the TDCALL instruction when requesting services from the VMM.
+ *
+ * This is a software only structure and not part of the TDX module/VMM ABI.
+ */
+struct tdx_hypercall_args {
+ u64 r10;
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

void __init tdx_early_init(void);

+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 7dca52f5cfc6..0b465e7d0a2f 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -74,6 +74,16 @@ static void __used common(void)
OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
OFFSET(TDX_MODULE_r11, tdx_module_output, r11);

+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10);
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
--
2.34.1

2022-02-24 16:43:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 25/30] x86/tdx: Make pages shared in ioremap()

In TDX guests, guest memory is protected from host access. If a guest
performs I/O, it needs to explicitly share the I/O memory with the host.

Make all ioremap()ed pages that are not backed by normal memory
(IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.

Since TDX memory encryption support is similar to AMD SEV architecture,
reuse the infrastructure from AMD SEV code.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/ioremap.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 026031b3b782..a5d4ec1afca2 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -242,10 +242,15 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
* If the page being mapped is in memory and SEV is active then
* make sure the memory encryption attribute is enabled in the
* resulting mapping.
+ * In TDX guests, memory is marked private by default. If encryption
+ * is not requested (using encrypted), explicitly set decrypt
+ * attribute in all IOREMAPPED memory.
*/
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+ else
+ prot = pgprot_decrypted(prot);

switch (pcm) {
case _PAGE_CACHE_MODE_UC:
--
2.34.1

2022-02-24 16:43:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 08/30] x86/tdx: Add HLT support for TDX guests

The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).

To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].

In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.

To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.

Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdcall.S | 13 ++++++++
arch/x86/coco/tdx.c | 66 ++++++++++++++++++++++++++++++++++++--
arch/x86/include/asm/tdx.h | 4 +++
arch/x86/kernel/process.c | 4 +++
4 files changed, 85 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdcall.S b/arch/x86/coco/tdcall.S
index c4dd9468e7d9..3c35a056974d 100644
--- a/arch/x86/coco/tdcall.S
+++ b/arch/x86/coco/tdcall.S
@@ -138,6 +138,19 @@ SYM_FUNC_START(__tdx_hypercall)

movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx

+ /*
+ * For the idle loop STI needs to be called directly before the TDCALL
+ * that enters idle (EXIT_REASON_HLT case). STI instruction enables
+ * interrupts only one instruction later. If there is a window between
+ * STI and the instruction that emulates the HALT state, there is a
+ * chance for interrupts to happen in this window, which can delay the
+ * HLT operation indefinitely. Since this is the not the desired
+ * result, conditionally call STI before TDCALL.
+ */
+ testq $TDX_HCALL_ISSUE_STI, %rsi
+ jz .Lskip_sti
+ sti
+.Lskip_sti:
tdcall

/*
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 86a2f35e7308..0a2e6be0cdae 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -7,6 +7,7 @@
#include <linux/cpufeature.h>
#include <asm/coco.h>
#include <asm/tdx.h>
+#include <asm/vmx.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
@@ -59,6 +60,62 @@ static void get_info(void)
td_info.attributes = out.rdx;
}

+static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_HLT,
+ .r12 = irq_disabled,
+ };
+
+ /*
+ * Emulate HLT operation via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>.
+ *
+ * The VMM uses the "IRQ disabled" param to understand IRQ
+ * enabled status (RFLAGS.IF) of the TD guest and to determine
+ * whether or not it should schedule the halted vCPU if an
+ * IRQ becomes pending. E.g. if IRQs are disabled, the VMM
+ * can keep the vCPU in virtual HLT, even if an IRQ is
+ * pending, without hanging/breaking the guest.
+ */
+ return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
+}
+
+static bool handle_halt(void)
+{
+ /*
+ * Since non safe halt is mainly used in CPU offlining
+ * and the guest will always stay in the halt state, don't
+ * call the STI instruction (set do_sti as false).
+ */
+ const bool irq_disabled = irqs_disabled();
+ const bool do_sti = false;
+
+ if (__halt(irq_disabled, do_sti))
+ return false;
+
+ return true;
+}
+
+void __cpuidle tdx_safe_halt(void)
+{
+ /*
+ * For do_sti=true case, __tdx_hypercall() function enables
+ * interrupts using the STI instruction before the TDCALL. So
+ * set irq_disabled as false.
+ */
+ const bool irq_disabled = false;
+ const bool do_sti = true;
+
+ /*
+ * Use WARN_ONCE() to report the failure.
+ */
+ if (__halt(irq_disabled, do_sti))
+ WARN_ONCE(1, "HLT instruction emulation failed\n");
+}
+
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -98,8 +155,13 @@ static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
/* Handle the kernel #VE */
static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
{
- pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return false;
+ switch (ve->exit_reason) {
+ case EXIT_REASON_HLT:
+ return handle_halt();
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+ }
}

bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 34cf998ad534..e6e23ade53a6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -13,6 +13,7 @@
#define TDX_HYPERCALL_STANDARD 0

#define TDX_HCALL_HAS_OUTPUT BIT(0)
+#define TDX_HCALL_ISSUE_STI BIT(1)

#define TDX_SEAMCALL_VMFAILINVALID 0x8000FF00FFFF0000ULL

@@ -79,9 +80,12 @@ void tdx_get_ve_info(struct ve_info *ve);

bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);

+void tdx_safe_halt(void);
+
#else

static inline void tdx_early_init(void) { };
+static inline void tdx_safe_halt(void) { };

#endif /* CONFIG_INTEL_TDX_GUEST */

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e131d71b3cae..2e90d57cf86e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,7 @@
#include <asm/proto.h>
#include <asm/frame.h>
#include <asm/unwind.h>
+#include <asm/tdx.h>

#include "process.h"

@@ -873,6 +874,9 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
} else if (prefer_mwait_c1_over_halt(c)) {
pr_info("using mwait in idle threads\n");
x86_idle = mwait_idle;
+ } else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+ pr_info("using TDX aware idle routine\n");
+ x86_idle = tdx_safe_halt;
} else
x86_idle = default_idle;
}
--
2.34.1

2022-02-24 16:43:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 14/30] x86: Consolidate port I/O helpers

There are two implementations of port I/O helpers: one in the kernel and
one in the boot stub.

Move the helpers required for both to <asm/shared/io.h> and use the one
implementation everywhere.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
arch/x86/boot/boot.h | 35 +-------------------------------
arch/x86/boot/compressed/misc.h | 2 +-
arch/x86/include/asm/io.h | 22 ++------------------
arch/x86/include/asm/shared/io.h | 34 +++++++++++++++++++++++++++++++
4 files changed, 38 insertions(+), 55 deletions(-)
create mode 100644 arch/x86/include/asm/shared/io.h

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 34c9dbb6a47d..22a474c5b3e8 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,6 +23,7 @@
#include <linux/edd.h>
#include <asm/setup.h>
#include <asm/asm.h>
+#include <asm/shared/io.h>
#include "bitops.h"
#include "ctype.h"
#include "cpuflags.h"
@@ -35,40 +36,6 @@ extern struct boot_params boot_params;

#define cpu_relax() asm volatile("rep; nop")

-/* Basic port I/O */
-static inline void outb(u8 v, u16 port)
-{
- asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u8 inb(u16 port)
-{
- u8 v;
- asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outw(u16 v, u16 port)
-{
- asm volatile("outw %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u16 inw(u16 port)
-{
- u16 v;
- asm volatile("inw %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outl(u32 v, u16 port)
-{
- asm volatile("outl %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u32 inl(u16 port)
-{
- u32 v;
- asm volatile("inl %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
static inline void io_delay(void)
{
const u16 DELAY_PORT = 0x80;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0d8e275a9d96..8a253e85f990 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -22,11 +22,11 @@
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
-#include <linux/io.h>
#include <asm/page.h>
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
+#include <asm/shared/io.h>

#include "tdx.h"

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 638c1a2a82e0..a1eb218a49f8 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -44,6 +44,7 @@
#include <asm/page.h>
#include <asm/early_ioremap.h>
#include <asm/pgtable_types.h>
+#include <asm/shared/io.h>

#define build_mmio_read(name, size, type, reg, barrier) \
static inline type name(const volatile void __iomem *addr) \
@@ -258,20 +259,6 @@ static inline void slow_down_io(void)
#endif

#define BUILDIO(bwl, bw, type) \
-static inline void out##bwl(type value, u16 port) \
-{ \
- asm volatile("out" #bwl " %" #bw "0, %w1" \
- : : "a"(value), "Nd"(port)); \
-} \
- \
-static inline type in##bwl(u16 port) \
-{ \
- type value; \
- asm volatile("in" #bwl " %w1, %" #bw "0" \
- : "=a"(value) : "Nd"(port)); \
- return value; \
-} \
- \
static inline void out##bwl##_p(type value, u16 port) \
{ \
out##bwl(value, port); \
@@ -320,10 +307,8 @@ static inline void ins##bwl(u16 port, void *addr, unsigned long count) \
BUILDIO(b, b, u8)
BUILDIO(w, w, u16)
BUILDIO(l, , u32)
+#undef BUILDIO

-#define inb inb
-#define inw inw
-#define inl inl
#define inb_p inb_p
#define inw_p inw_p
#define inl_p inl_p
@@ -331,9 +316,6 @@ BUILDIO(l, , u32)
#define insw insw
#define insl insl

-#define outb outb
-#define outw outw
-#define outl outl
#define outb_p outb_p
#define outw_p outw_p
#define outl_p outl_p
diff --git a/arch/x86/include/asm/shared/io.h b/arch/x86/include/asm/shared/io.h
new file mode 100644
index 000000000000..6707cd555f0c
--- /dev/null
+++ b/arch/x86/include/asm/shared/io.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_IO_H
+#define _ASM_X86_SHARED_IO_H
+
+#include <linux/types.h>
+
+#define BUILDIO(bwl, bw, type) \
+static inline void out##bwl(type value, u16 port) \
+{ \
+ asm volatile("out" #bwl " %" #bw "0, %w1" \
+ : : "a"(value), "Nd"(port)); \
+} \
+ \
+static inline type in##bwl(u16 port) \
+{ \
+ type value; \
+ asm volatile("in" #bwl " %w1, %" #bw "0" \
+ : "=a"(value) : "Nd"(port)); \
+ return value; \
+}
+
+BUILDIO(b, b, u8)
+BUILDIO(w, w, u16)
+BUILDIO(l, , u32)
+#undef BUILDIO
+
+#define inb inb
+#define inw inw
+#define inl inl
+#define outb outb
+#define outw outw
+#define outl outl
+
+#endif
--
2.34.1

2022-02-24 16:43:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to unmapped pages (EPT violation)

In the settings that Linux will run in, virtualization exceptions are
never generated on accesses to normal, TD-private memory that has been
accepted.

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

If a guest kernel action which would normally cause a #VE occurs in
the interrupt-disabled region before TDGETVEINFO, a #DF (fault
exception) is delivered to the guest which will result in an oops.

Add basic infrastructure to handle any #VE which occurs in the kernel
or userspace. Later patches will add handling for specific #VE
scenarios.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdx.c | 60 ++++++++++++++
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/tdx.h | 21 +++++
arch/x86/kernel/idt.c | 3 +
arch/x86/kernel/traps.c | 138 ++++++++++++++++++++++++++------
5 files changed, 203 insertions(+), 23 deletions(-)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 14c085930b5f..86a2f35e7308 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -10,6 +10,7 @@

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
+#define TDX_GET_VEINFO 3

static struct {
unsigned int gpa_width;
@@ -58,6 +59,65 @@ static void get_info(void)
td_info.attributes = out.rdx;
}

+void tdx_get_ve_info(struct ve_info *ve)
+{
+ struct tdx_module_output out;
+
+ /*
+ * Retrieve the #VE info from the TDX module, which also clears the "#VE
+ * valid" flag. This must be done before anything else as any #VE that
+ * occurs while the valid flag is set, i.e. before the previous #VE info
+ * was consumed, is morphed to a #DF by the TDX module. Note, the TDX
+ * module also treats virtual NMIs as inhibited if the #VE valid flag is
+ * set, e.g. so that NMI=>#VE will not result in a #DF.
+ */
+ tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
+
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = lower_32_bits(out.r10);
+ ve->instr_info = upper_32_bits(out.r10);
+}
+
+/*
+ * Handle the user initiated #VE.
+ *
+ * For example, executing the CPUID instruction from user space
+ * is a valid case and hence the resulting #VE has to be handled.
+ *
+ * For dis-allowed or invalid #VE just return failure.
+ */
+static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+}
+
+/* Handle the kernel #VE */
+static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+}
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
+{
+ bool ret;
+
+ if (user_mode(regs))
+ ret = virt_exception_user(regs, ve);
+ else
+ ret = virt_exception_kernel(regs, ve);
+
+ /* After successful #VE handling, move the IP */
+ if (ret)
+ regs->ip += ve->instr_len;
+
+ return ret;
+}
+
void __init tdx_early_init(void)
{
u32 eax, sig[3];
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..8ccc81d653b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 557227e40da9..34cf998ad534 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,6 +5,7 @@

#include <linux/bits.h>
#include <linux/init.h>
+#include <asm/ptrace.h>

#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "
@@ -47,6 +48,22 @@ struct tdx_hypercall_args {
u64 r15;
};

+/*
+ * Used by the #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is a software only structure
+ * and not part of the TDX module/VMM ABI.
+ */
+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ /* Guest Linear (virtual) Address */
+ u64 gla;
+ /* Guest Physical (virtual) Address */
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

void __init tdx_early_init(void);
@@ -58,6 +75,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
/* Used to request services from the VMM */
u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);

+void tdx_get_ve_info(struct ve_info *ve);
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..1da074123c16 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};

/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7ef00dee35be..b2510af38158 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -62,6 +62,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -611,13 +612,43 @@ static bool try_fixup_enqcmd_gp(void)
#endif
}

+static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr,
+ unsigned long error_code, const char *str)
+{
+ int ret;
+
+ if (fixup_exception(regs, trapnr, error_code, 0))
+ return true;
+
+ current->thread.error_code = error_code;
+ current->thread.trap_nr = trapnr;
+
+ /*
+ * To be potentially processing a kprobe fault and to trust the result
+ * from kprobe_running(), we have to be non-preemptible.
+ */
+ if (!preemptible() && kprobe_running() &&
+ kprobe_fault_handler(regs, trapnr))
+ return true;
+
+ ret = notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV);
+ return ret == NOTIFY_STOP;
+}
+
+static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr,
+ unsigned long error_code, const char *str)
+{
+ current->thread.error_code = error_code;
+ current->thread.trap_nr = trapnr;
+ show_signal(current, SIGSEGV, "", str, regs, error_code);
+ force_sig(SIGSEGV);
+}
+
DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
{
char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
enum kernel_gp_hint hint = GP_NO_HINT;
- struct task_struct *tsk;
unsigned long gp_addr;
- int ret;

if (user_mode(regs) && try_fixup_enqcmd_gp())
return;
@@ -636,40 +667,21 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
return;
}

- tsk = current;
-
if (user_mode(regs)) {
if (fixup_iopl_exception(regs))
goto exit;

- tsk->thread.error_code = error_code;
- tsk->thread.trap_nr = X86_TRAP_GP;
-
if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
goto exit;

- show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
- force_sig(SIGSEGV);
+ gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc);
goto exit;
}

if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
goto exit;

- tsk->thread.error_code = error_code;
- tsk->thread.trap_nr = X86_TRAP_GP;
-
- /*
- * To be potentially processing a kprobe fault and to trust the result
- * from kprobe_running(), we have to be non-preemptible.
- */
- if (!preemptible() &&
- kprobe_running() &&
- kprobe_fault_handler(regs, X86_TRAP_GP))
- goto exit;
-
- ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
- if (ret == NOTIFY_STOP)
+ if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc))
goto exit;

if (error_code)
@@ -1267,6 +1279,86 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}

+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#define VE_FAULT_STR "VE fault"
+
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+ if (user_mode(regs)) {
+ gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
+ return;
+ }
+
+ if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
+ return;
+
+ die_addr(VE_FAULT_STR, regs, error_code, 0);
+}
+
+/*
+ * Virtualization Exceptions (#VE) are delivered to TDX guests due to
+ * specific guest actions which may happen in either user space or the
+ * kernel:
+ *
+ * * Specific instructions (WBINVD, for example)
+ * * Specific MSR accesses
+ * * Specific CPUID leaf accesses
+ * * Access to unmapped pages (EPT violation)
+ *
+ * In the settings that Linux will run in, virtualization exceptions are
+ * never generated on accesses to normal, TD-private memory that has been
+ * accepted.
+ *
+ * Syscall entry code has a critical window where the kernel stack is not
+ * yet set up. Any exception in this window leads to hard to debug issues
+ * and can be exploited for privilege escalation. Exceptions in the NMI
+ * entry code also cause issues. Returning from the exception handler with
+ * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
+ *
+ * For these reasons, the kernel avoids #VEs during the syscall gap and
+ * the NMI entry code. Entry code paths do not access TD-shared memory,
+ * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
+ * that might generate #VE. VMM can remove memory from TD at any point,
+ * but access to unaccepted (or missing) private memory leads to VM
+ * termination, not to #VE.
+ *
+ * Similarly to page faults and breakpoints, #VEs are allowed in NMI
+ * handlers once the kernel is ready to deal with nested NMIs.
+ *
+ * During #VE delivery, all interrupts, including NMIs, are blocked until
+ * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
+ * the VE info.
+ *
+ * If a guest kernel action which would normally cause a #VE occurs in
+ * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
+ * exception) is delivered to the guest which will result in an oops.
+ */
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+
+ /*
+ * NMIs/Machine-checks/Interrupts will be in a disabled state
+ * till TDGETVEINFO TDCALL is executed. This ensures that VE
+ * info cannot be overwritten by a nested #VE.
+ */
+ tdx_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ /*
+ * If tdx_handle_virt_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (!tdx_handle_virt_exception(regs, &ve))
+ ve_raise_fault(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.34.1

2022-02-24 16:43:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 05/30] x86/tdx: Extend the confidential computing API to support TDX guests

Confidential Computing (CC) features (like string I/O unroll support,
memory encryption/decryption support, etc) are conditionally enabled
in the kernel using cc_platform_has() API. Since TDX guests also need
to use these CC features, extend cc_platform_has() API and add TDX
guest-specific CC attributes support.

Like AMD SME/SEV, TDX uses a bit in the page table entry to indicate
encryption status of the page, but the polarity of the mask is
opposite to AMD: if the bit is set the page is accessible to VMM.

Details about which bit in the page table entry to be used to indicate
shared/private state can be determined by using the TDINFO TDCALL.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/coco/core.c | 4 ++++
arch/x86/coco/tdx.c | 43 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c346d66b51fc..93e67842e369 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
bool "Intel TDX (Trust Domain Extensions) - Guest Support"
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
+ select ARCH_HAS_CC_PLATFORM
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index fc1365dd927e..9113baebbfd2 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -90,6 +90,8 @@ u64 cc_mkenc(u64 val)
switch (vendor) {
case CC_VENDOR_AMD:
return val | cc_mask;
+ case CC_VENDOR_INTEL:
+ return val & ~cc_mask;
default:
return val;
}
@@ -100,6 +102,8 @@ u64 cc_mkdec(u64 val)
switch (vendor) {
case CC_VENDOR_AMD:
return val & ~cc_mask;
+ case CC_VENDOR_INTEL:
+ return val | cc_mask;
default:
return val;
}
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 17365fd40ba2..74c6e68dd1b3 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -5,8 +5,17 @@
#define pr_fmt(fmt) "tdx: " fmt

#include <linux/cpufeature.h>
+#include <asm/coco.h>
#include <asm/tdx.h>

+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO 1
+
+static struct {
+ unsigned int gpa_width;
+ unsigned long attributes;
+} td_info __ro_after_init;
+
/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
* return code.
@@ -25,6 +34,30 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
return __tdx_hypercall(&args, 0);
}

+static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out)
+{
+ if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
+ panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
+}
+
+static void get_info(void)
+{
+ struct tdx_module_output out;
+
+ /*
+ * TDINFO TDX module call is used to get the TD execution environment
+ * information like GPA width, number of available vcpus, debug mode
+ * information, etc. More details about the ABI can be found in TDX
+ * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+ * [TDG.VP.INFO].
+ */
+ tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+ td_info.gpa_width = out.rcx & GENMASK(5, 0);
+ td_info.attributes = out.rdx;
+}
+
void __init tdx_early_init(void)
{
u32 eax, sig[3];
@@ -37,5 +70,15 @@ void __init tdx_early_init(void)

setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+ get_info();
+
+ cc_set_vendor(CC_VENDOR_INTEL);
+
+ /*
+ * The highest bit of a guest physical address is the "sharing" bit.
+ * Set it for shared pages and clear it for private pages.
+ */
+ cc_set_mask(BIT_ULL(td_info.gpa_width - 1));
+
pr_info("Guest detected\n");
}
--
2.34.1

2022-02-24 16:43:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 18/30] x86/tdx: Handle early boot port I/O

From: Andi Kleen <[email protected]>

TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.

But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.

The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures).

Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
arch/x86/coco/tdx.c | 16 ++++++++++++++++
arch/x86/include/asm/tdx.h | 4 ++++
arch/x86/kernel/head64.c | 3 +++
3 files changed, 23 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 2e342760b1d2..0d2a4c947a6c 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -344,6 +344,22 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual)
return ret;
}

+/*
+ * Early #VE exception handler. Only handles a subset of port I/O.
+ * Intended only for earlyprintk. If failed, return false.
+ */
+__init bool tdx_early_handle_ve(struct pt_regs *regs)
+{
+ struct ve_info ve;
+
+ tdx_get_ve_info(&ve);
+
+ if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
+ return false;
+
+ return handle_io(regs, ve.exit_qual);
+}
+
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 54803cb6ccf5..ba0f8c2b185c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -56,11 +56,15 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);

void tdx_safe_halt(void);

+bool tdx_early_handle_ve(struct pt_regs *regs);
+
#else

static inline void tdx_early_init(void) { };
static inline void tdx_safe_halt(void) { };

+static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
+
#endif /* CONFIG_INTEL_TDX_GUEST */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 6dff50c3edd6..ecbf50e5b8e0 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -417,6 +417,9 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
return;

+ if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs))
+ return;
+
early_fixup_exception(regs, trapnr);
}

--
2.34.1

2022-02-24 16:44:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 24/30] x86/topology: Disable CPU online/offline control for TDX guests

From: Kuppuswamy Sathyanarayanan <[email protected]>

Unlike regular VMs, TDX guests use the firmware hand-off wakeup method
to wake up the APs during the boot process. This wakeup model uses a
mailbox to communicate with firmware to bring up the APs. As per the
design, this mailbox can only be used once for the given AP, which means
after the APs are booted, the same mailbox cannot be used to
offline/online the given AP. More details about this requirement can be
found in Intel TDX Virtual Firmware Design Guide, sec titled "AP
initialization in OS" and in sec titled "Hotplug Device".

Since the architecture does not support any method of offlining the
CPUs, disable CPU hotplug support in the kernel.

Since this hotplug disable feature can be re-used by other VM guests,
add a new CC attribute CC_ATTR_HOTPLUG_DISABLED and use it to disable
the hotplug support.

With hotplug disabled, /sys/devices/system/cpu/cpuX/online sysfs option
will not exist for TDX guests.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/core.c | 1 +
include/linux/cc_platform.h | 10 ++++++++++
kernel/cpu.c | 7 +++++++
3 files changed, 18 insertions(+)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 5615b75e6fc6..54344122e2fe 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -20,6 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
{
switch (attr) {
case CC_ATTR_GUEST_UNROLL_STRING_IO:
+ case CC_ATTR_HOTPLUG_DISABLED:
return true;
default:
return false;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index efd8205282da..691494bbaf5a 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -72,6 +72,16 @@ enum cc_attr {
* Examples include TDX guest & SEV.
*/
CC_ATTR_GUEST_UNROLL_STRING_IO,
+
+ /**
+ * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
+ *
+ * The platform/OS is running as a guest/virtual machine does not
+ * support CPU hotplug feature.
+ *
+ * Examples include TDX Guest.
+ */
+ CC_ATTR_HOTPLUG_DISABLED,
};

#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f39eb0b52dfe..c94f00fa34d3 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -34,6 +34,7 @@
#include <linux/scs.h>
#include <linux/percpu-rwsem.h>
#include <linux/cpuset.h>
+#include <linux/cc_platform.h>

#include <trace/events/power.h>
#define CREATE_TRACE_POINTS
@@ -1185,6 +1186,12 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,

static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
{
+ /*
+ * If the platform does not support hotplug, report it explicitly to
+ * differentiate it from a transient offlining failure.
+ */
+ if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+ return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
return _cpu_down(cpu, 0, target);
--
2.34.1

2022-02-24 16:44:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address

From: Isaku Yamahata <[email protected]>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host. This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

ioremap()-created mappings such as virtio will be marked as
shared by default. However, the IOAPIC code does not use ioremap() and
instead uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code. Ensure
that it marks IOAPIC pages as "shared". This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

AMD SEV gets IOAPIC pages shared because FIXMAP_PAGE_NOCACHE has _ENC
bit clear. TDX has to set bit to share the page with the host.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c1bb384935b0..d775f58a3c3e 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/memblock.h>
#include <linux/msi.h>
+#include <linux/cc_platform.h>

#include <asm/irqdomain.h>
#include <asm/io.h>
@@ -65,6 +66,7 @@
#include <asm/irq_remapping.h>
#include <asm/hw_irq.h>
#include <asm/apic.h>
+#include <asm/pgtable.h>

#define for_each_ioapic(idx) \
for ((idx) = 0; (idx) < nr_ioapics; (idx)++)
@@ -2677,6 +2679,15 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}

+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+ phys_addr_t phys)
+{
+ pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+ flags = pgprot_decrypted(flags);
+ __set_fixmap(idx, phys, flags);
+}
+
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2709,7 +2720,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
- set_fixmap_nocache(idx, ioapic_phys);
+ io_apic_set_fixmap_nocache(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
@@ -2838,7 +2849,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;

- set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+ io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
--
2.34.1

2022-02-24 16:44:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 02/30] x86/tdx: Detect running as a TDX guest in early boot

From: Kuppuswamy Sathyanarayanan <[email protected]>

In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.

Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 12 ++++++++++++
arch/x86/coco/Makefile | 2 ++
arch/x86/coco/tdx.c | 23 +++++++++++++++++++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/include/asm/tdx.h | 21 +++++++++++++++++++++
arch/x86/kernel/head64.c | 4 ++++
7 files changed, 70 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/coco/tdx.c
create mode 100644 arch/x86/include/asm/tdx.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 57a4e0285a80..c346d66b51fc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -880,6 +880,18 @@ config ACRN_GUEST
IOT with small footprint and real-time features. More details can be
found in https://projectacrn.org/.

+config INTEL_TDX_GUEST
+ bool "Intel TDX (Trust Domain Extensions) - Guest Support"
+ depends on X86_64 && CPU_SUP_INTEL
+ depends on X86_X2APIC
+ help
+ Support running as a guest under Intel TDX. Without this support,
+ the guest kernel can not boot or run under TDX.
+ TDX includes memory encryption and integrity capabilities
+ which protect the confidentiality and integrity of guest
+ memory contents and CPU state. TDX guests are protected from
+ some attacks from the VMM.
+
endif #HYPERVISOR_GUEST

source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
index c1ead00017a7..32f4c6e6f199 100644
--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -4,3 +4,5 @@ KASAN_SANITIZE_core.o := n
CFLAGS_core.o += -fno-stack-protector

obj-y += core.o
+
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
new file mode 100644
index 000000000000..00898e3eb77f
--- /dev/null
+++ b/arch/x86/coco/tdx.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2021-2022 Intel Corporation */
+
+#undef pr_fmt
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/cpufeature.h>
+#include <asm/tdx.h>
+
+void __init tdx_early_init(void)
+{
+ u32 eax, sig[3];
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
+
+ BUILD_BUG_ON(sizeof(sig) != sizeof(TDX_IDENT) - 1);
+ if (memcmp(TDX_IDENT, sig, sizeof(sig)))
+ return;
+
+ setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+ pr_info("Guest detected\n");
+}
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 5cd22090e53d..cacc8dde854b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
#define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
#define X86_FEATURE_PVUNLOCK ( 8*32+20) /* "" PV unlock function */
#define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* "" PV vcpu_is_preempted function */
+#define X86_FEATURE_TDX_GUEST ( 8*32+22) /* Intel Trust Domain Extensions Guest */

/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 1231d63f836d..b37de8268c9a 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -68,6 +68,12 @@
# define DISABLE_SGX (1 << (X86_FEATURE_SGX & 31))
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+# define DISABLE_TDX_GUEST 0
+#else
+# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -79,7 +85,7 @@
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
#define DISABLED_MASK7 (DISABLE_PTI)
-#define DISABLED_MASK8 0
+#define DISABLED_MASK8 (DISABLE_TDX_GUEST)
#define DISABLED_MASK9 (DISABLE_SMAP|DISABLE_SGX)
#define DISABLED_MASK10 0
#define DISABLED_MASK11 0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..ba8042ce61c2
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021-2022 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#include <linux/init.h>
+
+#define TDX_CPUID_LEAF_ID 0x21
+#define TDX_IDENT "IntelTDX "
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+void __init tdx_early_init(void);
+
+#else
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4f5ecbbaae77..6dff50c3edd6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
#include <asm/extable.h>
#include <asm/trapnr.h>
#include <asm/sev.h>
+#include <asm/tdx.h>

/*
* Manage page tables very early on.
@@ -514,6 +515,9 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

idt_setup_early_handler();

+ /* Needed before cc_platform_has() can be used for TDX */
+ tdx_early_init();
+
copy_bootdata(__va(real_mode_data));

/*
--
2.34.1

2022-02-24 16:44:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support

From: Kuppuswamy Sathyanarayanan <[email protected]>

TDX cannot use INIT/SIPI protocol to bring up secondary CPUs because it
requires assistance from untrusted VMM.

For platforms that do not support SIPI/INIT, ACPI defines a wakeup
model (using mailbox) via MADT multiprocessor wakeup structure. More
details about it can be found in ACPI specification v6.4, the section
titled "Multiprocessor Wakeup Structure". If a platform firmware
produces the multiprocessor wakeup structure, then OS may use this
new mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake structure parsing support for x86 platform and if
MADT wake table is present, update apic->wakeup_secondary_cpu_64 with
new API which uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/apic.h | 5 ++
arch/x86/kernel/acpi/boot.c | 118 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/apic/apic.c | 10 +++
3 files changed, 133 insertions(+)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 35006e151774..bd8ae0a7010a 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -490,6 +490,11 @@ static inline unsigned int read_apic_id(void)
return apic->get_apic_id(reg);
}

+#ifdef CONFIG_X86_64
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+#endif
+
extern int default_apic_id_valid(u32 apicid);
extern int default_acpi_madt_oem_check(char *, char *);
extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 5b6d1a95776f..99518eac2bbc 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,15 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
static bool acpi_support_online_capable;
#endif

+#ifdef CONFIG_X86_64
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+/* Lock to protect mailbox (acpi_mp_wake_mailbox) from parallel access */
+static DEFINE_SPINLOCK(mailbox_lock);
+#endif
+
#ifdef CONFIG_X86_IO_APIC
/*
* Locks related to IOAPIC hotplug
@@ -336,6 +345,84 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
return 0;
}

+#ifdef CONFIG_X86_64
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+ static physid_mask_t apic_id_wakemap = PHYSID_MASK_NONE;
+ u8 timeout;
+
+ /* Remap mailbox memory only for the first call to acpi_wakeup_cpu() */
+ if (physids_empty(apic_id_wakemap)) {
+ acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+ sizeof(*acpi_mp_wake_mailbox),
+ MEMREMAP_WB);
+ }
+
+ /*
+ * According to the ACPI specification r6.4, section titled
+ * "Multiprocessor Wakeup Structure" the mailbox-based wakeup
+ * mechanism cannot be used more than once for the same CPU.
+ * Skip wakeups if they are attempted more than once.
+ */
+ if (physid_isset(apicid, apic_id_wakemap)) {
+ pr_err("CPU already awake (APIC ID %x), skipping wakeup\n",
+ apicid);
+ return -EINVAL;
+ }
+
+ spin_lock(&mailbox_lock);
+
+ /*
+ * Mailbox memory is shared between firmware and OS. Firmware will
+ * listen on mailbox command address, and once it receives the wakeup
+ * command, CPU associated with the given apicid will be booted.
+ *
+ * The value of apic_id and wakeup_vector has to be set before updating
+ * the wakeup command. To let compiler preserve order of writes, use
+ * smp_store_release.
+ */
+ smp_store_release(&acpi_mp_wake_mailbox->apic_id, apicid);
+ smp_store_release(&acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+ smp_store_release(&acpi_mp_wake_mailbox->command,
+ ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+ /*
+ * After writing the wakeup command, wait for maximum timeout of 0xFF
+ * for firmware to reset the command address back zero to indicate
+ * the successful reception of command.
+ * NOTE: 0xFF as timeout value is decided based on our experiments.
+ *
+ * XXX: Change the timeout once ACPI specification comes up with
+ * standard maximum timeout value.
+ */
+ timeout = 0xFF;
+ while (READ_ONCE(acpi_mp_wake_mailbox->command) && --timeout)
+ cpu_relax();
+
+ /* If timed out (timeout == 0), return error */
+ if (!timeout) {
+ /*
+ * XXX: Is there a recovery path after timeout is hit?
+ * Spec is unclear. Reset command to 0 if timeout is hit.
+ */
+ acpi_mp_wake_mailbox->command = 0;
+ spin_unlock(&mailbox_lock);
+ return -EIO;
+ }
+
+ /*
+ * If the CPU wakeup process is successful, store the
+ * status in apic_id_wakemap to prevent re-wakeup
+ * requests.
+ */
+ physid_set(apicid, apic_id_wakemap);
+
+ spin_unlock(&mailbox_lock);
+
+ return 0;
+}
+#endif
#endif /*CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1083,6 +1170,29 @@ static int __init acpi_parse_madt_lapic_entries(void)
}
return 0;
}
+
+#ifdef CONFIG_X86_64
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_madt_multiproc_wakeup *mp_wake;
+
+ if (!IS_ENABLED(CONFIG_SMP))
+ return -ENODEV;
+
+ mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+ if (BAD_MADT_ENTRY(mp_wake, end))
+ return -EINVAL;
+
+ acpi_table_print_madt_entry(&header->common);
+
+ acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+ acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+ return 0;
+}
+#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1278,6 +1388,14 @@ static void __init acpi_process_madt(void)

smp_found_config = 1;
}
+
+#ifdef CONFIG_X86_64
+ /*
+ * Parse MADT MP Wake entry.
+ */
+ acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+ acpi_parse_mp_wake, 1);
+#endif
}
if (error == -EINVAL) {
/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b70344bf6600..3c8f2c797a98 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2551,6 +2551,16 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
}
EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);

+#ifdef CONFIG_X86_64
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+ struct apic **drv;
+
+ for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+ (*drv)->wakeup_secondary_cpu_64 = handler;
+}
+#endif
+
/*
* Override the generic EOI implementation with an optimized version.
* Only called during early boot when only one CPU is active and with
--
2.34.1

2022-02-24 16:44:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 09/30] x86/tdx: Add MSR support for TDX guests

Use hypercall to emulate MSR read/write for the TDX platform.

There are two viable approaches for doing MSRs in a TD guest:

1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal
do. Some will succeed, others will cause a #VE. All of those that
cause a #VE will be handled with a TDCALL.
2. Use paravirt infrastructure. The paravirt hook has to keep a list
of which MSRs would cause a #VE and use a TDCALL. All other MSRs
execute RDMSR/WRMSR instructions directly.

The second option can be ruled out because the list of MSRs was
challenging to maintain. That leaves option #1 as the only viable
solution for the minimal TDX support.

For performance-critical MSR writes (like TSC_DEADLINE), future patches
will replace the WRMSR/#VE sequence with the direct TDCALL.

RDMSR and WRMSR specification details can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDG.VP.
VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 0a2e6be0cdae..89992593a209 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -116,6 +116,44 @@ void __cpuidle tdx_safe_halt(void)
WARN_ONCE(1, "HLT instruction emulation failed\n");
}

+static bool read_msr(struct pt_regs *regs)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_MSR_READ,
+ .r12 = regs->cx,
+ };
+
+ /*
+ * Emulate the MSR read via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
+ */
+ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+ return false;
+
+ regs->ax = lower_32_bits(args.r11);
+ regs->dx = upper_32_bits(args.r11);
+ return true;
+}
+
+static bool write_msr(struct pt_regs *regs)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_MSR_WRITE,
+ .r12 = regs->cx,
+ .r13 = (u64)regs->dx << 32 | regs->ax,
+ };
+
+ /*
+ * Emulate the MSR write via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
+ */
+ return !__tdx_hypercall(&args, 0);
+}
+
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -158,6 +196,10 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
return handle_halt();
+ case EXIT_REASON_MSR_READ:
+ return read_msr(regs);
+ case EXIT_REASON_MSR_WRITE:
+ return write_msr(regs);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return false;
--
2.34.1

2022-02-24 16:44:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 11/30] x86/tdx: Handle in-kernel MMIO

In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.

To emulate an instruction an emulator needs two things:

- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.

- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.

Both of them are not available to VMM in TDX environment:

- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.

- Memory is encrypted a TD-private key. The CPU disallows software
other than the TDX module and TDs from making memory accesses using
the private key.

In TDX the MMIO regions are instead configured to trigger a #VE
exception in the guest. The guest #VE handler then emulates the MMIO
instruction inside the guest and converts it into a controlled hypercall
to the host.

MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.

readX()/writeX() helpers limit the range of instructions which can trigger
MMIO. It makes MMIO instruction emulation feasible. Raw access to a MMIO
region allows the compiler to generate whatever instruction it wants.
Supporting all possible instructions is a task of a different scope.

MMIO access with anything other than helpers from io.h may result in
MMIO_DECODE_FAILED and an oops.

AMD SEV has the same limitations to MMIO handling.

=== Potential alternative approaches ===

== Paravirtualizing all MMIO ==

An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.

Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.

However, any paravirtual approach would be patching approximately
120k call sites. With a conservative overhead estimation of 5 bytes per
call site (CALL instruction), it leads to bloating code by 600k.

Many drivers will never be used in the TDX environment and the bloat
cannot be justified.

== Patching TDX drivers ==

Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests. Right now, that's
limited only to virtio and some x86-specific drivers.

All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch. This will be implemented in the
future, removing the bulk of MMIO #VEs.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdx.c | 110 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 110 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index fd78b81a951d..15519e498679 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -8,11 +8,17 @@
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3

+/* MMIO direction */
+#define EPT_READ 0
+#define EPT_WRITE 1
+
static struct {
unsigned int gpa_width;
unsigned long attributes;
@@ -184,6 +190,108 @@ static bool handle_cpuid(struct pt_regs *regs)
return true;
}

+static bool mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = EXIT_REASON_EPT_VIOLATION,
+ .r12 = size,
+ .r13 = EPT_READ,
+ .r14 = addr,
+ .r15 = *val,
+ };
+
+ if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+ return false;
+ *val = args.r11;
+ return true;
+}
+
+static bool mmio_write(int size, unsigned long addr, unsigned long val)
+{
+ return !_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE,
+ addr, val);
+}
+
+static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+ char buffer[MAX_INSN_SIZE];
+ unsigned long *reg, val;
+ struct insn insn = {};
+ enum mmio_type mmio;
+ int size, extend_size;
+ u8 extend_val = 0;
+
+ if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+ return false;
+
+ if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+ return false;
+
+ mmio = insn_decode_mmio(&insn, &size);
+ if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+ return false;
+
+ if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+ reg = insn_get_modrm_reg_ptr(&insn, regs);
+ if (!reg)
+ return false;
+ }
+
+ ve->instr_len = insn.length;
+
+ switch (mmio) {
+ case MMIO_WRITE:
+ memcpy(&val, reg, size);
+ return mmio_write(size, ve->gpa, val);
+ case MMIO_WRITE_IMM:
+ val = insn.immediate.value;
+ return mmio_write(size, ve->gpa, val);
+ case MMIO_READ:
+ case MMIO_READ_ZERO_EXTEND:
+ case MMIO_READ_SIGN_EXTEND:
+ break;
+ case MMIO_MOVS:
+ case MMIO_DECODE_FAILED:
+ return false;
+ default:
+ BUG();
+ }
+
+ /* Handle reads */
+ if (!mmio_read(size, ve->gpa, &val))
+ return false;
+
+ switch (mmio) {
+ case MMIO_READ:
+ /* Zero-extend for 32-bit operation */
+ extend_size = size == 4 ? sizeof(*reg) : 0;
+ break;
+ case MMIO_READ_ZERO_EXTEND:
+ /* Zero extend based on operand size */
+ extend_size = insn.opnd_bytes;
+ break;
+ case MMIO_READ_SIGN_EXTEND:
+ /* Sign extend based on operand size */
+ extend_size = insn.opnd_bytes;
+ if (size == 1 && val & BIT(7))
+ extend_val = 0xFF;
+ else if (size > 1 && val & BIT(15))
+ extend_val = 0xFF;
+ break;
+ case MMIO_MOVS:
+ case MMIO_DECODE_FAILED:
+ return false;
+ default:
+ BUG();
+ }
+
+ if (extend_size)
+ memset(reg, extend_val, extend_size);
+ memcpy(reg, &val, size);
+ return true;
+}
+
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -237,6 +345,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
return write_msr(regs);
case EXIT_REASON_CPUID:
return handle_cpuid(regs);
+ case EXIT_REASON_EPT_VIOLATION:
+ return handle_mmio(regs, ve);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return false;
--
2.34.1

2022-02-24 16:44:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 22/30] x86/boot: Set CR0.NE early and keep it set during the boot

TDX guest requires CR0.NE to be set. Clearing the bit triggers #GP(0).

If CR0.NE is 0, the MS-DOS compatibility mode for handling floating-point
exceptions is selected. In this mode, the software exception handler for
floating-point exceptions is invoked externally using the processor’s
FERR#, INTR, and IGNNE# pins.

Using FERR# and IGNNE# to handle floating-point exception is deprecated.
CR0.NE=0 also limits newer processors to operate with one logical
processor active.

Kernel uses CR0_STATE constant to initialize CR0. It has NE bit set.
But during early boot kernel has more ad-hoc approach to setting bit
in the register.

Make CR0 initialization consistent, deriving the initial value of CR0
from CR0_STATE.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 7 ++++---
arch/x86/realmode/rm/trampoline_64.S | 8 ++++----
2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index fd9441f40457..d0c3d33f3542 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -289,7 +289,7 @@ SYM_FUNC_START(startup_32)
pushl %eax

/* Enter paged protected Mode, activating Long Mode */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
+ movl $CR0_STATE, %eax
movl %eax, %cr0

/* Jump from 32bit compatibility mode into 64bit mode. */
@@ -662,8 +662,9 @@ SYM_CODE_START(trampoline_32bit_src)
pushl $__KERNEL_CS
pushl %eax

- /* Enable paging again */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ /* Enable paging again. */
+ movl %cr0, %eax
+ btsl $X86_CR0_PG_BIT, %eax
movl %eax, %cr0

lret
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index ae112a91592f..d380f2d1fd23 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -70,7 +70,7 @@ SYM_CODE_START(trampoline_start)
movw $__KERNEL_DS, %dx # Data segment descriptor

# Enable protected mode
- movl $X86_CR0_PE, %eax # protected mode (PE) bit
+ movl $(CR0_STATE & ~X86_CR0_PG), %eax
movl %eax, %cr0 # into protected mode

# flush prefetch and jump to startup_32
@@ -148,8 +148,8 @@ SYM_CODE_START(startup_32)
movl $MSR_EFER, %ecx
wrmsr

- # Enable paging and in turn activate Long Mode
- movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+ # Enable paging and in turn activate Long Mode.
+ movl $CR0_STATE, %eax
movl %eax, %cr0

/*
@@ -169,7 +169,7 @@ SYM_CODE_START(pa_trampoline_compat)
movl $rm_stack_end, %esp
movw $__KERNEL_DS, %dx

- movl $X86_CR0_PE, %eax
+ movl $(CR0_STATE & ~X86_CR0_PG), %eax
movl %eax, %cr0
ljmpl $__KERNEL32_CS, $pa_startup_32
SYM_CODE_END(pa_trampoline_compat)
--
2.34.1

2022-02-24 16:44:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> So far, AMD_MEM_ENCRYPT is the only user of X86_MEM_ENCRYPT. TDX will be
> the second. It will make mem_encrypt.c build without AMD_MEM_ENCRYPT,
> which triggers a warning:
>
> arch/x86/mm/mem_encrypt.c:69:13: warning: no previous prototype for
> function 'mem_encrypt_init' [-Wmissing-prototypes]
>
> Fix it by moving mem_encrypt_init() declaration outside of #ifdef
> CONFIG_AMD_MEM_ENCRYPT.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Fixes: 20f07a044a76 ("x86/sev: Move common memory encryption code to mem_encrypt.c")
> Acked-by: David Rientjes <[email protected]>

Hey Kirill,

The last time you posted this, I ack'd it:

Acked-by: Dave Hansen <[email protected]>

but that didn't make it into this version. It saves me the trouble of
looking through this patch again when I was fine with it last week.

The ack still stands if you're interested in incorporating it. ;)

2022-02-24 16:45:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 26/30] x86/mm/cpa: Add support for TDX shared memory

Intel TDX protects guest memory from VMM access. Any memory that is
required for communication with the VMM must be explicitly shared.

It is a two-step process: the guest sets the shared bit in the page
table entry and notifies VMM about the change. The notification happens
using MapGPA hypercall.

Conversion back to private memory requires clearing the shared bit,
notifying VMM with MapGPA hypercall following with accepting the memory
with AcceptPage hypercall.

Provide a TDX version of x86_platform.guest.* callbacks. It makes
__set_memory_enc_pgtable() work right in TDX guest.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/core.c | 1 +
arch/x86/coco/tdx.c | 101 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 102 insertions(+)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 54344122e2fe..9778cf4c6901 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -21,6 +21,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
switch (attr) {
case CC_ATTR_GUEST_UNROLL_STRING_IO:
case CC_ATTR_HOTPLUG_DISABLED:
+ case CC_ATTR_GUEST_MEM_ENCRYPT:
return true;
default:
return false;
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 6306ef19584f..da2ae399ea71 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -10,10 +10,15 @@
#include <asm/vmx.h>
#include <asm/insn.h>
#include <asm/insn-eval.h>
+#include <asm/x86_init.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3
+#define TDX_ACCEPT_PAGE 6
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA 0x10001

/* MMIO direction */
#define EPT_READ 0
@@ -456,6 +461,98 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
return ret;
}

+static bool tdx_tlb_flush_required(bool enc)
+{
+ /*
+ * TDX guest is responsible for flushing caches on private->shared
+ * transition. VMM is responsible for flushing on shared->private.
+ */
+ return !enc;
+}
+
+static bool tdx_cache_flush_required(void)
+{
+ return true;
+}
+
+static bool accept_page(phys_addr_t gpa, enum pg_level pg_level)
+{
+ /*
+ * Pass the page physical address to the TDX module to accept the
+ * pending, private page.
+ *
+ * Bits 2:0 of GPA encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+ */
+ switch (pg_level) {
+ case PG_LEVEL_4K:
+ break;
+ case PG_LEVEL_2M:
+ gpa |= 1;
+ break;
+ case PG_LEVEL_1G:
+ gpa |= 2;
+ break;
+ default:
+ return false;
+ }
+
+ return !__tdx_module_call(TDX_ACCEPT_PAGE, gpa, 0, 0, 0, NULL);
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest. The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+ phys_addr_t start = __pa(vaddr);
+ phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+ if (!enc) {
+ start |= cc_mkdec(0);
+ end |= cc_mkdec(0);
+ }
+
+ /*
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * section "TDG.VP.VMCALL<MapGPA>"
+ */
+ if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+ return false;
+
+ /* private->shared conversion requires only MapGPA call */
+ if (!enc)
+ return true;
+
+ /*
+ * For shared->private conversion, accept the page using
+ * TDX_ACCEPT_PAGE TDX module call.
+ */
+ while (start < end) {
+ /* Try if 1G page accept is possible */
+ if (!(start & ~PUD_MASK) && end - start >= PUD_SIZE &&
+ accept_page(start, PG_LEVEL_1G)) {
+ start += PUD_SIZE;
+ continue;
+ }
+
+ /* Try if 2M page accept is possible */
+ if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
+ accept_page(start, PG_LEVEL_2M)) {
+ start += PMD_SIZE;
+ continue;
+ }
+
+ if (!accept_page(start, PG_LEVEL_4K))
+ return false;
+ start += PAGE_SIZE;
+ }
+
+ return true;
+}
+
void __init tdx_early_init(void)
{
u32 eax, sig[3];
@@ -486,5 +583,9 @@ void __init tdx_early_init(void)
*/
cc_set_mask(BIT_ULL(td_info.gpa_width - 1));

+ x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
+ x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
+ x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
+
pr_info("Guest detected\n");
}
--
2.34.1

2022-02-24 18:23:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 04/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> + tdcall
> +
> + /*
> + * TDVMCALL leaf does not suppose to fail. If it fails something
> + * is horribly wrong with TDX module. Stop the world.
> + */
> + testq %rax, %rax
> + jne .Lpanic

This should be:

"A TDVMCALL is not supposed to fail."

I also wish this was mentioning something about the difference between a
failure and return code.

/*
* %rax==0 indicates a failure of the TDVMCALL mechanism itself
* and that something has gone horribly wrong with the TDX
* module.
*
* The return status of the hypercall operation is separate
* (in %r10). Hypercall errors are a part of normal operation
* and are handled by callers.
*/

I've been confused by this exact thing multiple times over the months
that I've been looking at this code. I think it deserves a good comment.

2022-02-24 18:44:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 05/30] x86/tdx: Extend the confidential computing API to support TDX guests

> +/* TDX module Call Leaf IDs */
> +#define TDX_GET_INFO 1
> +
> +static struct {
> + unsigned int gpa_width;
> + unsigned long attributes;
> +} td_info __ro_after_init;

This defines part of an ABI (TDX_GET_INFO) and then a structure right
next to it which is not part of the ABI. That's really confusing.

It's further muddied by "attributes" being unused in this patch. Why
bother declaring it and assigning a value to it that won't be used? Why
even *have* a structure for a single value?

> /*
> * Wrapper for standard use of __tdx_hypercall with no output aside from
> * return code.
> @@ -25,6 +34,30 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> return __tdx_hypercall(&args, 0);
> }
>
> +static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> + struct tdx_module_output *out)
> +{
> + if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
> + panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
> +}
> +
> +static void get_info(void)
> +{
> + struct tdx_module_output out;
> +
> + /*
> + * TDINFO TDX module call is used to get the TD execution environment
> + * information like GPA width, number of available vcpus, debug mode
> + * information, etc. More details about the ABI can be found in TDX
> + * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
> + * [TDG.VP.INFO].
> + */
> + tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
> +
> + td_info.gpa_width = out.rcx & GENMASK(5, 0);
> + td_info.attributes = out.rdx;
> +}

This left me wondering two things. First, why this bothers storing
'gpa_width' when it's only used to generate a mask? Why not just store
the mask in the structure?

Second, why have the global 'td_info' instead of just declaring it on
the stack. It is only used within a single function. Having it on the
stack is *REALLY* nice because the code ends up looking like:

struct foo foo;
get_info(&foo);
cc_set_bar(foo.bar);

The dependencies and scope are just stupidly obvious if you do that.

> void __init tdx_early_init(void)
> {
> u32 eax, sig[3];
> @@ -37,5 +70,15 @@ void __init tdx_early_init(void)
>
> setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>
> + get_info();
> +
> + cc_set_vendor(CC_VENDOR_INTEL);
> +
> + /*
> + * The highest bit of a guest physical address is the "sharing" bit.
> + * Set it for shared pages and clear it for private pages.
> + */
> + cc_set_mask(BIT_ULL(td_info.gpa_width - 1));
> +
> pr_info("Guest detected\n");
> }
I really want to start acking these things and get them moved along to
the next step. There are a few too many open questions, though.

2022-02-24 19:31:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCHv4 09/30] x86/tdx: Add MSR support for TDX guests

On Thu, Feb 24, 2022, Dave Hansen wrote:
> On 2/24/22 07:56, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> > index 0a2e6be0cdae..89992593a209 100644
> > --- a/arch/x86/coco/tdx.c
> > +++ b/arch/x86/coco/tdx.c
> > @@ -116,6 +116,44 @@ void __cpuidle tdx_safe_halt(void)
> > WARN_ONCE(1, "HLT instruction emulation failed\n");
> > }
> >
> > +static bool read_msr(struct pt_regs *regs)
> > +{
> > + struct tdx_hypercall_args args = {
> > + .r10 = TDX_HYPERCALL_STANDARD,
> > + .r11 = EXIT_REASON_MSR_READ,
>
> Just a minor note: these "EXIT_REASON_FOO"'s in r11 are effectively
> *the* hypercall being made, right?
>
> The hypercall is being made in response to what would have otherwise
> been a MSR read VMEXIT. But, it's a *bit* goofy to see them here when
> the TDX guest isn't doing any kind of VMEXIT.

But the TDX guest is doing a VM-Exit, that's all TDCALL is, an exit to the host.
r10 states that this is a GHCI-standard hypercall, r11 holds the reason why the
guest is exiting to the host. The guest could pretty it up by redefining all the
VM-Exit reasons as TDX_REQUEST_MSR_READ or whatever, but IMO diverging from
directly using EXIT_REASON_* will be annoying in the long run, e.g. will make it
more difficult to grep KVM + kernel to understand the end-to-end flow.

2022-02-24 20:11:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 09/30] x86/tdx: Add MSR support for TDX guests

On 2/24/22 11:04, Sean Christopherson wrote:
> On Thu, Feb 24, 2022, Dave Hansen wrote:
>> On 2/24/22 07:56, Kirill A. Shutemov wrote:
>>> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
>>> index 0a2e6be0cdae..89992593a209 100644
>>> --- a/arch/x86/coco/tdx.c
>>> +++ b/arch/x86/coco/tdx.c
>>> @@ -116,6 +116,44 @@ void __cpuidle tdx_safe_halt(void)
>>> WARN_ONCE(1, "HLT instruction emulation failed\n");
>>> }
>>>
>>> +static bool read_msr(struct pt_regs *regs)
>>> +{
>>> + struct tdx_hypercall_args args = {
>>> + .r10 = TDX_HYPERCALL_STANDARD,
>>> + .r11 = EXIT_REASON_MSR_READ,
>> Just a minor note: these "EXIT_REASON_FOO"'s in r11 are effectively
>> *the* hypercall being made, right?
>>
>> The hypercall is being made in response to what would have otherwise
>> been a MSR read VMEXIT. But, it's a *bit* goofy to see them here when
>> the TDX guest isn't doing any kind of VMEXIT.
> But the TDX guest is doing a VM-Exit, that's all TDCALL is, an exit to the host.
> r10 states that this is a GHCI-standard hypercall, r11 holds the reason why the
> guest is exiting to the host. The guest could pretty it up by redefining all the
> VM-Exit reasons as TDX_REQUEST_MSR_READ or whatever, but IMO diverging from
> directly using EXIT_REASON_* will be annoying in the long run, e.g. will make it
> more difficult to grep KVM + kernel to understand the end-to-end flow.

I understand that it looks like an "exit" if you know how it's
implemented, know the history and squint at it funny. But, r11 is not
an exit reason. It's a hypercall number that just sometimes happens to
also take an exit reason as a convention. Don't confuse that with "r11
*is* an exit reason".

Heck, look at the GHCI spec. Does it simply cede some of the
sub-function space and map them directly to VMEXIT reasons? Nope. It
goes to the trouble of individually defining them:

12 Instruction.HLT
30 Instruction.IO
31 Instruction.RDMSR
32 Instruction.WRMSR
48 #VE.RequestMMIO
65 Instruction.PCONFIG

I'm not saying we need 15 new #defines. It would be really nice like
you say to be able to connect the host and guest sides with a grep. I
wouldn't hate if we did something like:

.r11 = hcall_func(EXIT_REASON_MSR_READ),

That retains greppability, but also tells you that r11 is a function
number. It even gives a nice place to stick a comment to say what the
heck is going on:

/*
* The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
* independently from but are currently matched 1:1 with VMX
* EXIT_REASONs. Reusing the KVM EXIT_REASON macros makes it easier to
* connect the host and guest sides of these calls.
*/
static u64 hcall_func(u64 exit_reason)
{
return exit_reason;
}

Like I said, this is all a minor note. But, things like this really go
a long way for folks like me who don't spend our days looking at the KVM
code or thinking deeply about how the hypercall is implemented.

2022-02-24 21:26:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> {
> - pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> - return false;
> + switch (ve->exit_reason) {
> + case EXIT_REASON_CPUID:
> + return handle_cpuid(regs);
> + default:
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return false;
> + }
> }

What does this mean for userspace? What kinds of things are we ceding
to the (untrusted) VMM to supply to userspace?

> /* Handle the kernel #VE */
> @@ -200,6 +235,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> return read_msr(regs);
> case EXIT_REASON_MSR_WRITE:
> return write_msr(regs);
> + case EXIT_REASON_CPUID:
> + return handle_cpuid(regs);
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return false;
What kinds of random CPUID uses in the kernel at runtime need this
handling? Is it really OK that we let the VMM inject arbitrary CPUID
values into random CPUID uses in the kernel... silently?

Is this better than just returning 0's, for instance?

2022-02-24 21:27:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 12/30] x86/tdx: Detect TDX at early kernel decompression time

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> The early decompression code does port I/O for its console output. But,
> handling the decompression-time port I/O demands a different approach
> from normal runtime because the IDT required to support #VE based port
> I/O emulation is not yet set up. Paravirtualizing I/O calls during
> the decompression step is acceptable because the decompression code size is
> small enough and hence patching it will not bloat the image size a lot.

It's not the *decompression* code size that matters. It's that there
aren't a lot of call sites to the I/O instructions. Right?

> To support port I/O in decompression code, TDX must be detected before
> the decompression code might do port I/O. Detect whether the kernel runs
> in a TDX guest.
>
> Add an early_is_tdx_guest() interface to query the cached TDX guest
> status in the decompression code.

Nit: I was a bit surprised by the minor cpuid() munging and the new
shared/tdx.h header. They look sane, but you can reduce reviewer
surprise by adding a sentence or two of changelog material.

> The actual port I/O paravirtualization will come later in the series.
>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Acked-by: Dave Hansen <[email protected]>

2022-02-24 22:29:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> Virtualization Exceptions (#VE) are delivered to TDX guests due to
> specific guest actions which may happen in either user space or the
> kernel:
>
> * Specific instructions (WBINVD, for example)
> * Specific MSR accesses
> * Specific CPUID leaf accesses
> * Access to unmapped pages (EPT violation)

Considering that you're talking partly about userspace, it would be nice
to talk about what "unmapped" really means here.

> In the settings that Linux will run in, virtualization exceptions are
> never generated on accesses to normal, TD-private memory that has been
> accepted.

This is getting into nit territory. But, at this point a normal reader
has no idea what "accepted" memory is.

> Syscall entry code has a critical window where the kernel stack is not
> yet set up. Any exception in this window leads to hard to debug issues
> and can be exploited for privilege escalation. Exceptions in the NMI
> entry code also cause issues. Returning from the exception handler with
> IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
>
> For these reasons, the kernel avoids #VEs during the syscall gap and
> the NMI entry code. Entry code paths do not access TD-shared memory,
> MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> that might generate #VE. VMM can remove memory from TD at any point,
> but access to unaccepted (or missing) private memory leads to VM
> termination, not to #VE.
>
> Similarly to page faults and breakpoints, #VEs are allowed in NMI
> handlers once the kernel is ready to deal with nested NMIs.
>
> During #VE delivery, all interrupts, including NMIs, are blocked until
> TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
> the VE info.
>
> If a guest kernel action which would normally cause a #VE occurs in
> the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> exception) is delivered to the guest which will result in an oops.
>
> Add basic infrastructure to handle any #VE which occurs in the kernel
> or userspace. Later patches will add handling for specific #VE
> scenarios.
>
> For now, convert unhandled #VE's (everything, until later in this
> series) so that they appear just like a #GP by calling the
> ve_raise_fault() directly. The ve_raise_fault() function is similar
> to #GP handler and is responsible for sending SIGSEGV to userspace
> and CPU die and notifying debuggers and other die chain users.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/coco/tdx.c | 60 ++++++++++++++
> arch/x86/include/asm/idtentry.h | 4 +
> arch/x86/include/asm/tdx.h | 21 +++++
> arch/x86/kernel/idt.c | 3 +
> arch/x86/kernel/traps.c | 138 ++++++++++++++++++++++++++------
> 5 files changed, 203 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index 14c085930b5f..86a2f35e7308 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -10,6 +10,7 @@
>
> /* TDX module Call Leaf IDs */
> #define TDX_GET_INFO 1
> +#define TDX_GET_VEINFO 3
>
> static struct {
> unsigned int gpa_width;
> @@ -58,6 +59,65 @@ static void get_info(void)
> td_info.attributes = out.rdx;
> }
>
> +void tdx_get_ve_info(struct ve_info *ve)
> +{
> + struct tdx_module_output out;
> +
> + /*
> + * Retrieve the #VE info from the TDX module, which also clears the "#VE
> + * valid" flag. This must be done before anything else as any #VE that
> + * occurs while the valid flag is set, i.e. before the previous #VE info
> + * was consumed, is morphed to a #DF by the TDX module.


That's a really weird sentence. It doesn't really parse for me. It
might be the misplaced comma after "consumed,".

For what it's worth, I think "i.e." and "e.g." have been over used in
the TDX text (sorry Sean). They lead to really weird sentence structure.

Note, the TDX
> + * module also treats virtual NMIs as inhibited if the #VE valid flag is
> + * set, e.g. so that NMI=>#VE will not result in a #DF.
> + */

Are we missing anything valuable if we just trim the comment down to
something like:

/*
* Called during #VE handling to retrieve the #VE info from the
* TDX module.
*
* This should called done early in #VE handling. A "nested"
* #VE which occurs before this will raise a #DF and is not
* recoverable.
*/

For what it's worth, I don't think we care who "morphs" things. We just
care about the fallout.

> + tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);

How about a one-liner below here:

/* Interrupts and NMIs can be delivered again. */

> + ve->exit_reason = out.rcx;
> + ve->exit_qual = out.rdx;
> + ve->gla = out.r8;
> + ve->gpa = out.r9;
> + ve->instr_len = lower_32_bits(out.r10);
> + ve->instr_info = upper_32_bits(out.r10);
> +}
> +
> +/*
> + * Handle the user initiated #VE.
> + *
> + * For example, executing the CPUID instruction from user space
> + * is a valid case and hence the resulting #VE has to be handled.
> + *
> + * For dis-allowed or invalid #VE just return failure.
> + */

This is just insane to have in the series at this point. It says that
the "#VE has to be handled" and then doesn't handle it!

> +static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> +{
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return false;
> +}
> +
> +/* Handle the kernel #VE */
> +static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> +{
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return false;
> +}
> +
> +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> +{
> + bool ret;
> +
> + if (user_mode(regs))
> + ret = virt_exception_user(regs, ve);
> + else
> + ret = virt_exception_kernel(regs, ve);
> +
> + /* After successful #VE handling, move the IP */
> + if (ret)
> + regs->ip += ve->instr_len;
> +
> + return ret;
> +}

At this point in the series, these three functions can be distilled down to:

bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
{
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);

return false;
}

> void __init tdx_early_init(void)
> {
> u32 eax, sig[3];
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 1345088e9902..8ccc81d653b3 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
> DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
> #endif
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
> +#endif
> +
> /* Device interrupts common/spurious */
> DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
> #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 557227e40da9..34cf998ad534 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -5,6 +5,7 @@
>
> #include <linux/bits.h>
> #include <linux/init.h>
> +#include <asm/ptrace.h>
>
> #define TDX_CPUID_LEAF_ID 0x21
> #define TDX_IDENT "IntelTDX "
> @@ -47,6 +48,22 @@ struct tdx_hypercall_args {
> u64 r15;
> };
>
> +/*
> + * Used by the #VE exception handler to gather the #VE exception
> + * info from the TDX module. This is a software only structure
> + * and not part of the TDX module/VMM ABI.
> + */
> +struct ve_info {
> + u64 exit_reason;
> + u64 exit_qual;
> + /* Guest Linear (virtual) Address */
> + u64 gla;
> + /* Guest Physical (virtual) Address */
> + u64 gpa;

"Physical (virtual) Address"?

> + u32 instr_len;
> + u32 instr_info;
> +};
> +
> #ifdef CONFIG_INTEL_TDX_GUEST
>
> void __init tdx_early_init(void);
> @@ -58,6 +75,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> /* Used to request services from the VMM */
> u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
>
> +void tdx_get_ve_info(struct ve_info *ve);
> +
> +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
> +
> #else
>
> static inline void tdx_early_init(void) { };
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index df0fa695bb09..1da074123c16 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
> */
> INTG(X86_TRAP_PF, asm_exc_page_fault),
> #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> +#endif
> };
>
> /*
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 7ef00dee35be..b2510af38158 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -62,6 +62,7 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/vdso.h>
> +#include <asm/tdx.h>
>
> #ifdef CONFIG_X86_64
> #include <asm/x86_init.h>
> @@ -611,13 +612,43 @@ static bool try_fixup_enqcmd_gp(void)
> #endif
> }
>
> +static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr,
> + unsigned long error_code, const char *str)
> +{
> + int ret;
> +
> + if (fixup_exception(regs, trapnr, error_code, 0))
> + return true;
> +
> + current->thread.error_code = error_code;
> + current->thread.trap_nr = trapnr;
> +
> + /*
> + * To be potentially processing a kprobe fault and to trust the result
> + * from kprobe_running(), we have to be non-preemptible.
> + */
> + if (!preemptible() && kprobe_running() &&
> + kprobe_fault_handler(regs, trapnr))
> + return true;
> +
> + ret = notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV);
> + return ret == NOTIFY_STOP;
> +}
> +
> +static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr,
> + unsigned long error_code, const char *str)
> +{
> + current->thread.error_code = error_code;
> + current->thread.trap_nr = trapnr;
> + show_signal(current, SIGSEGV, "", str, regs, error_code);
> + force_sig(SIGSEGV);
> +}
> +
> DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> {
> char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
> enum kernel_gp_hint hint = GP_NO_HINT;
> - struct task_struct *tsk;
> unsigned long gp_addr;
> - int ret;
>
> if (user_mode(regs) && try_fixup_enqcmd_gp())
> return;
> @@ -636,40 +667,21 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> return;
> }
>
> - tsk = current;
> -
> if (user_mode(regs)) {
> if (fixup_iopl_exception(regs))
> goto exit;
>
> - tsk->thread.error_code = error_code;
> - tsk->thread.trap_nr = X86_TRAP_GP;
> -
> if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
> goto exit;
>
> - show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
> - force_sig(SIGSEGV);
> + gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc);
> goto exit;
> }
>
> if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
> goto exit;
>
> - tsk->thread.error_code = error_code;
> - tsk->thread.trap_nr = X86_TRAP_GP;
> -
> - /*
> - * To be potentially processing a kprobe fault and to trust the result
> - * from kprobe_running(), we have to be non-preemptible.
> - */
> - if (!preemptible() &&
> - kprobe_running() &&
> - kprobe_fault_handler(regs, X86_TRAP_GP))
> - goto exit;
> -
> - ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
> - if (ret == NOTIFY_STOP)
> + if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc))
> goto exit;
>
> if (error_code)
> @@ -1267,6 +1279,86 @@ DEFINE_IDTENTRY(exc_device_not_available)
> }
> }

I'm glad the exc_general_protection() code is getting refactored and not
copied. That's nice. The refactoring really needs to be in a separate
patch, though.

> +#ifdef CONFIG_INTEL_TDX_GUEST
> +
> +#define VE_FAULT_STR "VE fault"
> +
> +static void ve_raise_fault(struct pt_regs *regs, long error_code)
> +{
> + if (user_mode(regs)) {
> + gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
> + return;
> + }
> +
> + if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
> + return;
> +
> + die_addr(VE_FAULT_STR, regs, error_code, 0);
> +}
> +
> +/*
> + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> + * specific guest actions which may happen in either user space or the
> + * kernel:
> + *
> + * * Specific instructions (WBINVD, for example)
> + * * Specific MSR accesses
> + * * Specific CPUID leaf accesses
> + * * Access to unmapped pages (EPT violation)
> + *
> + * In the settings that Linux will run in, virtualization exceptions are
> + * never generated on accesses to normal, TD-private memory that has been
> + * accepted.

This actually makes a lot more sense as a code comment than changelog.
It would be really nice to circle back here and actually refer to the
functions that accept memory.

> + * Syscall entry code has a critical window where the kernel stack is not
> + * yet set up. Any exception in this window leads to hard to debug issues
> + * and can be exploited for privilege escalation. Exceptions in the NMI
> + * entry code also cause issues. Returning from the exception handler with
> + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> + *
> + * For these reasons, the kernel avoids #VEs during the syscall gap and
> + * the NMI entry code. Entry code paths do not access TD-shared memory,
> + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> + * that might generate #VE. VMM can remove memory from TD at any point,
> + * but access to unaccepted (or missing) private memory leads to VM
> + * termination, not to #VE.
> + *
> + * Similarly to page faults and breakpoints, #VEs are allowed in NMI
> + * handlers once the kernel is ready to deal with nested NMIs.
> + *
> + * During #VE delivery, all interrupts, including NMIs, are blocked until
> + * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
> + * the VE info.
> + *
> + * If a guest kernel action which would normally cause a #VE occurs in
> + * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> + * exception) is delivered to the guest which will result in an oops.
> + */
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> + struct ve_info ve;
> +
> + /*
> + * NMIs/Machine-checks/Interrupts will be in a disabled state
> + * till TDGETVEINFO TDCALL is executed. This ensures that VE
> + * info cannot be overwritten by a nested #VE.
> + */
> + tdx_get_ve_info(&ve);
> +
> + cond_local_irq_enable(regs);
> +
> + /*
> + * If tdx_handle_virt_exception() could not process
> + * it successfully, treat it as #GP(0) and handle it.
> + */
> + if (!tdx_handle_virt_exception(regs, &ve))
> + ve_raise_fault(regs, 0);
> +
> + cond_local_irq_disable(regs);
> +}
> +
> +#endif
> +
> #ifdef CONFIG_X86_32
> DEFINE_IDTENTRY_SW(iret_error)
> {

2022-02-24 22:43:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 09/30] x86/tdx: Add MSR support for TDX guests

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> Use hypercall to emulate MSR read/write for the TDX platform.
>
> There are two viable approaches for doing MSRs in a TD guest:
>
> 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal
> do. Some will succeed, others will cause a #VE. All of those that
> cause a #VE will be handled with a TDCALL.
> 2. Use paravirt infrastructure. The paravirt hook has to keep a list
> of which MSRs would cause a #VE and use a TDCALL. All other MSRs
> execute RDMSR/WRMSR instructions directly.
>
> The second option can be ruled out because the list of MSRs was
> challenging to maintain. That leaves option #1 as the only viable
> solution for the minimal TDX support.
>
> For performance-critical MSR writes (like TSC_DEADLINE), future patches
> will replace the WRMSR/#VE sequence with the direct TDCALL.

This will still leave us with a list of non-#VE-inducing MSRs. That's
not great. But, if we miss an MSR in the performance-critical list, the
result is a slow WRMSR->#VE. If we miss an MSR in the paravirt
approach, we induce a fatal #VE.

Please add something to that effect if you revise this patch.

> RDMSR and WRMSR specification details can be found in
> Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
> Extensions (Intel TDX) specification, sec titled "TDG.VP.
> VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".
>
> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/coco/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 42 insertions(+)
>
> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index 0a2e6be0cdae..89992593a209 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -116,6 +116,44 @@ void __cpuidle tdx_safe_halt(void)
> WARN_ONCE(1, "HLT instruction emulation failed\n");
> }
>
> +static bool read_msr(struct pt_regs *regs)
> +{
> + struct tdx_hypercall_args args = {
> + .r10 = TDX_HYPERCALL_STANDARD,
> + .r11 = EXIT_REASON_MSR_READ,

Just a minor note: these "EXIT_REASON_FOO"'s in r11 are effectively
*the* hypercall being made, right?

The hypercall is being made in response to what would have otherwise
been a MSR read VMEXIT. But, it's a *bit* goofy to see them here when
the TDX guest isn't doing any kind of VMEXIT.

I wish there were some clarity around it, but it's not a deal breaker.

> + .r12 = regs->cx,
> + };
> +
> + /*
> + * Emulate the MSR read via hypercall. More info about ABI
> + * can be found in TDX Guest-Host-Communication Interface
> + * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
> + */
> + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
> + return false;
> +
> + regs->ax = lower_32_bits(args.r11);
> + regs->dx = upper_32_bits(args.r11);
> + return true;
> +}
> +
> +static bool write_msr(struct pt_regs *regs)
> +{
> + struct tdx_hypercall_args args = {
> + .r10 = TDX_HYPERCALL_STANDARD,
> + .r11 = EXIT_REASON_MSR_WRITE,
> + .r12 = regs->cx,
> + .r13 = (u64)regs->dx << 32 | regs->ax,
> + };
> +
> + /*
> + * Emulate the MSR write via hypercall. More info about ABI
> + * can be found in TDX Guest-Host-Communication Interface
> + * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
> + */
> + return !__tdx_hypercall(&args, 0);
> +}
> +
> void tdx_get_ve_info(struct ve_info *ve)
> {
> struct tdx_module_output out;
> @@ -158,6 +196,10 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> switch (ve->exit_reason) {
> case EXIT_REASON_HLT:
> return handle_halt();
> + case EXIT_REASON_MSR_READ:
> + return read_msr(regs);
> + case EXIT_REASON_MSR_WRITE:
> + return write_msr(regs);
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return false;

I still think it's annoying that all these WRMSR's are turned into #VE,
but this does seem like the best approach given the architecture that we
have. Having the optimized ones seems like a good compromise.

Acked-by: Dave Hansen <[email protected]>

2022-02-24 23:26:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 16/30] x86/boot/compressed: Support TDX guest port I/O at decompression time

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> @@ -24,4 +88,11 @@ void early_tdx_detect(void)
>
> /* Cache TDX guest feature status */
> tdx_guest_detected = true;
> +
> + pio_ops.inb = tdx_inb;
> + pio_ops.inw = tdx_inw;
> + pio_ops.inl = tdx_inl;
> + pio_ops.outb = tdx_outb;
> + pio_ops.outw = tdx_outw;
> + pio_ops.outl = tdx_outl;
> }

I guess the kernel isn't going to get far if any of this goes wrong.
But, I do kinda wish that code ^^ was connected to the below code somehow:

> +static inline void init_io_ops(void)
> +{
> + pio_ops.inb = inb;
> + pio_ops.inw = inw;
> + pio_ops.inl = inl;
> + pio_ops.outb = outb;
> + pio_ops.outw = outw;
> + pio_ops.outl = outl;
> +}

Maybe just a comment would do it. Or, maybe init_io_ops() should just
be called init_default_io_ops(). I think this would do:

/*
* Use the normal I/O instructions by default.
* TDX guests override these to use hypercalls.
*/

if it went in init_io_ops() from the last patch.

>
> +static inline unsigned int tdx_io_in(int size, u16 port)
> +{
> + struct tdx_hypercall_args args = {
> + .r10 = TDX_HYPERCALL_STANDARD,
> + .r11 = EXIT_REASON_IO_INSTRUCTION,
> + .r12 = size,
> + .r13 = 0,

^ munged whitespace?

> + .r14 = port,
> + };
> +
> + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
> + return UINT_MAX;
> +
> + return args.r11;
> +}

With that fixed:

Acked-by: Dave Hansen <[email protected]>

2022-02-25 00:08:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 13/30] x86: Adjust types used in port I/O helpers

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> Change port I/O helpers to use u8/u16/u32 instead of unsigned
> char/short/int for values. Use u16 instead of int for port number.
>
> It aligns the helpers with implementation in boot stub in preparation
> for consolidation.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Acked-by: Dave Hansen <[email protected]>

2022-02-25 00:24:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 17/30] x86/tdx: Add port I/O emulation

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> TDX hypervisors cannot emulate instructions directly. This includes
> port I/O which is normally emulated in the hypervisor. All port I/O
> instructions inside TDX trigger the #VE exception in the guest and
> would be normally emulated there.
>
> Use a hypercall to emulate port I/O. Extend the
> tdx_handle_virt_exception() and add support to handle the #VE due to
> port I/O instructions.
>
> String I/O operations are not supported in TDX. Unroll them by declaring
> CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/coco/core.c | 7 +++++-
> arch/x86/coco/tdx.c | 54 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 60 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index 9113baebbfd2..5615b75e6fc6 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -18,7 +18,12 @@ static u64 cc_mask __ro_after_init;
>
> static bool intel_cc_platform_has(enum cc_attr attr)
> {
> - return false;
> + switch (attr) {
> + case CC_ATTR_GUEST_UNROLL_STRING_IO:
> + return true;
> + default:
> + return false;
> + }
> }
>
> /*
> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index 15519e498679..2e342760b1d2 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -19,6 +19,12 @@
> #define EPT_READ 0
> #define EPT_WRITE 1
>
> +/* See Exit Qualification for I/O Instructions in VMX documentation */
> +#define VE_IS_IO_IN(e) ((e) & BIT(3))
> +#define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1)
> +#define VE_GET_PORT_NUM(e) ((e) >> 16)
> +#define VE_IS_IO_STRING(e) ((e) & BIT(4))
> +
> static struct {
> unsigned int gpa_width;
> unsigned long attributes;
> @@ -292,6 +298,52 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> return true;
> }
>
> +/*
> + * Emulate I/O using hypercall.
> + *
> + * Assumes the IO instruction was using ax, which is enforced
> + * by the standard io.h macros.
> + *
> + * Return True on success or False on failure.
> + */
> +static bool handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> + struct tdx_hypercall_args args = {
> + .r10 = TDX_HYPERCALL_STANDARD,
> + .r11 = EXIT_REASON_IO_INSTRUCTION,
> + };
> + int size, port;
> + u64 mask;
> + bool in, ret;
> +
> + if (VE_IS_IO_STRING(exit_qual))
> + return false;
> +
> + in = VE_IS_IO_IN(exit_qual);
> + size = VE_GET_IO_SIZE(exit_qual);
> + port = VE_GET_PORT_NUM(exit_qual);
> + mask = GENMASK(BITS_PER_BYTE * size, 0);
> +
> + args.r12 = size;
> + args.r13 = !in;
> + args.r14 = port;
> + args.r15 = in ? 0 : regs->ax;
> +
> + /*
> + * Emulate the I/O read/write via hypercall. More info about
> + * ABI can be found in TDX Guest-Host-Communication Interface
> + * (GHCI) section titled "TDG.VP.VMCALL<Instruction.IO>".
> + */
> + ret = !__tdx_hypercall(&args, in ? TDX_HCALL_HAS_OUTPUT : 0);
> + if (!ret || !in)
> + return ret;
> +
> + regs->ax &= ~mask;
> + regs->ax |= ret ? args.r11 & mask : UINT_MAX;
> +
> + return ret;
> +}


I can't help but think this would be more clear if the in and !in sideds
were separated:

if (!in) {
args.r15 = regs->ax;
ret = !__tdx_hypercall(&args, 0);
} else {
args.r15 = 0;
ret = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT);
regs->ax &= ~mask;
if (ret)
regs->ax |= args.r11 & mask
else
regs->ax |= UINT_MAX;
}

return ret;

The ->ax mangling also needs a comment. It looks like it's trying to
inject -1 when there's a failure. Even if the roots of this are in the
TDX spec(s), it would be nice to express the intent of this code.

I also really dislike using 'ret' as a bool return type. Wouldn't this
be about a billion times clearer if 'ret' was renamed to 'success'?

Right now, this code is effectively doing:

if (!ret)
return error;

Which is actually functionally correct here. All other int-typed 'ret'
code in the kernel do this:

if (ret)
return error;

which would be functionally _wrong_. Can you please take mercy on my
little brain an at least make the code look different if it behaves the
opposite from the same literal code in other spots in the kernel?

Heck, even 'bool handled' would be pretty nice.

> void tdx_get_ve_info(struct ve_info *ve)
> {
> struct tdx_module_output out;
> @@ -347,6 +399,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> return handle_cpuid(regs);
> case EXIT_REASON_EPT_VIOLATION:
> return handle_mmio(regs, ve);
> + case EXIT_REASON_IO_INSTRUCTION:
> + return handle_io(regs, ve->exit_qual);
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return false;

2022-02-25 01:26:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 18/30] x86/tdx: Handle early boot port I/O

I wish this was telling more of a story. There *is* a story to be told
and this series is really missing an opportunity to tell it. The last
three patches do the same logical thing: add support for I/O
instructions when running as a TDX guest. But, the three subjects call
it: "Support", "Add" and "Handle". All three talk about "port I/O", but
in different ways.

Imagine you had the subjects be:

x86/boot: Port I/O: add decompression-time support for TDX
x86/tdx: Port I/O: add runtime hypercalls
x86/tdx: Port I/O: add early boot support

That makes it be visually *obvious* what's going on. All three are
covering the same ground: "Port I/O". They're all adding something. In
succession they add the same basic thing for
{decompression,runtime,early} code.

I mentioned this exact thing to *somebody* about this exact part of the
series, who knows when. But, it still bugs me...

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> From: Andi Kleen <[email protected]>
>
> TDX guests cannot do port I/O directly. The TDX module triggers a #VE
> exception to let the guest kernel emulate port I/O by converting them
> into TDCALLs to call the host.

As part of telling the story, it would be best to refer to the code that
you introduced in the last few patches. "At runtime..." could hearken
back to the subject from two patches ago.

Anyway, the code is fine.

Acked-by: Dave Hansen <[email protected]>

2022-02-25 01:42:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 05/30] x86/tdx: Extend the confidential computing API to support TDX guests

On 2/24/22 15:54, Kirill A. Shutemov wrote:
>
>> Second, why have the global 'td_info' instead of just declaring it on
>> the stack. It is only used within a single function. Having it on the
>> stack is *REALLY* nice because the code ends up looking like:
>>
>> struct foo foo;
>> get_info(&foo);
>> cc_set_bar(foo.bar);
>>
>> The dependencies and scope are just stupidly obvious if you do that.
> Okay, I will rework it with plain gpa_width on stack and get_info(&gpa_width);
> Attributes will be needed after core enabling, so I will drop it from
> here.

I don't mind the 'struct tdx_info' if there's going to be more stuff in
it soon-ish. Having a single member is fine for now. Just make it
clear that the seamcall returns a bunch of stuff and only a subset of it
is used right now.

2022-02-25 01:55:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 15/30] x86/boot: Allow to hook up alternative port I/O helpers

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> Port I/O instructions trigger #VE in the TDX environment. In response to
> the exception, kernel emulates these instructions using hypercalls.
>
> But during early boot, on the decompression stage, it is cumbersome to
> deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> handling.
>
> Add a way to hook up alternative port I/O helpers in the boot stub.

I'd add one more sentence to that:

... with a new pio_ops structure. For now, set the ops structure to
just call the normal I/O operation functions.

But, either way:

Acked-by: Dave Hansen <[email protected]>

2022-02-25 02:41:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 05/30] x86/tdx: Extend the confidential computing API to support TDX guests

On Thu, Feb 24, 2022 at 09:54:16AM -0800, Dave Hansen wrote:
>
> This left me wondering two things. First, why this bothers storing
> 'gpa_width' when it's only used to generate a mask? Why not just store
> the mask in the structure?

It was needed when tdx_shared_mask() was a thing. It takes a pair of
fresh eyes to break the inertia.

> Second, why have the global 'td_info' instead of just declaring it on
> the stack. It is only used within a single function. Having it on the
> stack is *REALLY* nice because the code ends up looking like:
>
> struct foo foo;
> get_info(&foo);
> cc_set_bar(foo.bar);
>
> The dependencies and scope are just stupidly obvious if you do that.

Okay, I will rework it with plain gpa_width on stack and get_info(&gpa_width);
Attributes will be needed after core enabling, so I will drop it from
here.

--
Kirill A. Shutemov

2022-02-25 02:53:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 11/30] x86/tdx: Handle in-kernel MMIO

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> In non-TDX VMs, MMIO is implemented by providing the guest a mapping
> which will cause a VMEXIT on access and then the VMM emulating the
> instruction that caused the VMEXIT. That's not possible for TDX VM.
>
> To emulate an instruction an emulator needs two things:
>
> - R/W access to the register file to read/modify instruction arguments
> and see RIP of the faulted instruction.
>
> - Read access to memory where instruction is placed to see what to
> emulate. In this case it is guest kernel text.
>
> Both of them are not available to VMM in TDX environment:
>
> - Register file is never exposed to VMM. When a TD exits to the module,
> it saves registers into the state-save area allocated for that TD.
> The module then scrubs these registers before returning execution
> control to the VMM, to help prevent leakage of TD state.
>
> - Memory is encrypted a TD-private key. The CPU disallows software
> other than the TDX module and TDs from making memory accesses using
> the private key.
>
> In TDX the MMIO regions are instead configured to trigger a #VE
> exception in the guest. The guest #VE handler then emulates the MMIO
> instruction inside the guest and converts it into a controlled hypercall
> to the host.

Nit on the changelog: This never really comes out and explicitly says
what *this* patch does. It never transitions into imperative voice. Maybe:

In TDX, MMIO regions are configured by ____ to trigger a #VE
exception in the guest.

Add #VE handling that emulates the MMIO instruction inside the
guest and converts it into a controlled hypercall.

I found this next transition jarring. Maybe add a section title:

=== Limitations of this approach ===

> MMIO addresses can be used with any CPU instruction that accesses
> memory. Address only MMIO accesses done via io.h helpers, such as
> 'readl()' or 'writeq()'.

Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.

> readX()/writeX() helpers limit the range of instructions which can trigger
> MMIO. It makes MMIO instruction emulation feasible. Raw access to a MMIO
> region allows the compiler to generate whatever instruction it wants.
> Supporting all possible instructions is a task of a different scope.

The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.

MMIO accesses are performed without the io.h helpers are at the mercy of
the compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.

This means that TDX guests *must* use the io.h helpers to access MMIO.

This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.

---

I found a few things lacking in that description. How's that for a rewrite?


> === Potential alternative approaches ===
>
> == Paravirtualizing all MMIO ==
>
> An alternative to letting MMIO induce a #VE exception is to avoid
> the #VE in the first place. Similar to the port I/O case, it is
> theoretically possible to paravirtualize MMIO accesses.
>
> Like the exception-based approach offered here, a fully paravirtualized
> approach would be limited to MMIO users that leverage common
> infrastructure like the io.h macros.
>
> However, any paravirtual approach would be patching approximately
> 120k call sites. With a conservative overhead estimation of 5 bytes per
> call site (CALL instruction), it leads to bloating code by 600k.

There's one important detail missing there:

Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call.

> Many drivers will never be used in the TDX environment and the bloat
> cannot be justified.
>
> == Patching TDX drivers ==
>
> Rather than touching the entire kernel, it might also be possible to
> just go after drivers that use MMIO in TDX guests. Right now, that's
> limited only to virtio and some x86-specific drivers.
>
> All virtio MMIO appears to be done through a single function, which
> makes virtio eminently easy to patch. This will be implemented in the
> future, removing the bulk of MMIO #VEs.

Given what is written here, this sounds like a great solution especially
compared to all the instruction decoding nasiness. What's wrong with it?

> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index fd78b81a951d..15519e498679 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -8,11 +8,17 @@
> #include <asm/coco.h>
> #include <asm/tdx.h>
> #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <asm/insn-eval.h>
>
> /* TDX module Call Leaf IDs */
> #define TDX_GET_INFO 1
> #define TDX_GET_VEINFO 3
>
> +/* MMIO direction */
> +#define EPT_READ 0
> +#define EPT_WRITE 1
> +
> static struct {
> unsigned int gpa_width;
> unsigned long attributes;
> @@ -184,6 +190,108 @@ static bool handle_cpuid(struct pt_regs *regs)
> return true;
> }
>
> +static bool mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> + struct tdx_hypercall_args args = {
> + .r10 = TDX_HYPERCALL_STANDARD,
> + .r11 = EXIT_REASON_EPT_VIOLATION,
> + .r12 = size,
> + .r13 = EPT_READ,
> + .r14 = addr,
> + .r15 = *val,
> + };
> +
> + if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
> + return false;
> + *val = args.r11;
> + return true;
> +}
> +
> +static bool mmio_write(int size, unsigned long addr, unsigned long val)
> +{
> + return !_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE,
> + addr, val);
> +}
> +
> +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> + char buffer[MAX_INSN_SIZE];
> + unsigned long *reg, val;
> + struct insn insn = {};
> + enum mmio_type mmio;
> + int size, extend_size;
> + u8 extend_val = 0;
> +
> + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> + return false;
> +
> + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> + return false;
> +
> + mmio = insn_decode_mmio(&insn, &size);
> + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> + return false;
> +
> + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> + reg = insn_get_modrm_reg_ptr(&insn, regs);
> + if (!reg)
> + return false;
> + }
> +
> + ve->instr_len = insn.length;
> +
> + switch (mmio) {
> + case MMIO_WRITE:
> + memcpy(&val, reg, size);
> + return mmio_write(size, ve->gpa, val);
> + case MMIO_WRITE_IMM:
> + val = insn.immediate.value;
> + return mmio_write(size, ve->gpa, val);
> + case MMIO_READ:
> + case MMIO_READ_ZERO_EXTEND:
> + case MMIO_READ_SIGN_EXTEND:
> + break;
> + case MMIO_MOVS:
> + case MMIO_DECODE_FAILED:
> + return false;
> + default:
> + BUG();
> + }

Given the huge description above, it's borderline criminal to not
discuss what could led to this BUG().

It could literally be some minor tweak in the compiler that changed a
non-io.h-using MMIO access to get converted over to a instruction that
can't be decoded.

Could we spend a few lines of comments to help out the future poor sod
that sees "kernel bug at foo.c:1234"? Maybe:

/*
* MMIO was accessed with an instruction that could not
* be decoded. It was likely not using io.h helpers or
* accessed MMIO accidentally.
*/

> + /* Handle reads */
> + if (!mmio_read(size, ve->gpa, &val))
> + return false;
> +
> + switch (mmio) {
> + case MMIO_READ:
> + /* Zero-extend for 32-bit operation */
> + extend_size = size == 4 ? sizeof(*reg) : 0;
> + break;
> + case MMIO_READ_ZERO_EXTEND:
> + /* Zero extend based on operand size */
> + extend_size = insn.opnd_bytes;
> + break;
> + case MMIO_READ_SIGN_EXTEND:
> + /* Sign extend based on operand size */
> + extend_size = insn.opnd_bytes;
> + if (size == 1 && val & BIT(7))
> + extend_val = 0xFF;
> + else if (size > 1 && val & BIT(15))
> + extend_val = 0xFF;
> + break;
> + case MMIO_MOVS:
> + case MMIO_DECODE_FAILED:
> + return false;
> + default:
> + BUG();
> + }
> +
> + if (extend_size)
> + memset(reg, extend_val, extend_size);
> + memcpy(reg, &val, size);
> + return true;
> +}
> +
> void tdx_get_ve_info(struct ve_info *ve)
> {
> struct tdx_module_output out;
> @@ -237,6 +345,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> return write_msr(regs);
> case EXIT_REASON_CPUID:
> return handle_cpuid(regs);
> + case EXIT_REASON_EPT_VIOLATION:
> + return handle_mmio(regs, ve);
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return false;

2022-02-25 03:56:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 08/30] x86/tdx: Add HLT support for TDX guests

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> The HLT instruction is a privileged instruction, executing it stops
> instruction execution and places the processor in a HALT state. It
> is used in kernel for cases like reboot, idle loop and exception fixup
> handlers. For the idle case, interrupts will be enabled (using STI)
> before the HLT instruction (this is also called safe_halt()).
>
> To support the HLT instruction in TDX guests, it needs to be emulated
> using TDVMCALL (hypercall to VMM). More details about it can be found
> in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
> Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
>
> In TDX guests, executing HLT instruction will generate a #VE, which is
> used to emulate the HLT instruction. But #VE based emulation will not
> work for the safe_halt() flavor, because it requires STI instruction to
> be executed just before the TDCALL. Since idle loop is the only user of
> safe_halt() variant, handle it as a special case.
>
> To avoid *safe_halt() call in the idle function, define the
> tdx_guest_idle() and use it to override the "x86_idle" function pointer
> for a valid TDX guest.
>
> Alternative choices like PV ops have been considered for adding
> safe_halt() support. But it was rejected because HLT paravirt calls
> only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
> safe_halt() use case is not worth the cost.

Thanks for all the history and background here.

> diff --git a/arch/x86/coco/tdcall.S b/arch/x86/coco/tdcall.S
> index c4dd9468e7d9..3c35a056974d 100644
> --- a/arch/x86/coco/tdcall.S
> +++ b/arch/x86/coco/tdcall.S
> @@ -138,6 +138,19 @@ SYM_FUNC_START(__tdx_hypercall)
>
> movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>
> + /*
> + * For the idle loop STI needs to be called directly before the TDCALL
> + * that enters idle (EXIT_REASON_HLT case). STI instruction enables
> + * interrupts only one instruction later. If there is a window between
> + * STI and the instruction that emulates the HALT state, there is a
> + * chance for interrupts to happen in this window, which can delay the
> + * HLT operation indefinitely. Since this is the not the desired
> + * result, conditionally call STI before TDCALL.
> + */
> + testq $TDX_HCALL_ISSUE_STI, %rsi
> + jz .Lskip_sti
> + sti
> +.Lskip_sti:
> tdcall
>
> /*
> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index 86a2f35e7308..0a2e6be0cdae 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -7,6 +7,7 @@
> #include <linux/cpufeature.h>
> #include <asm/coco.h>
> #include <asm/tdx.h>
> +#include <asm/vmx.h>
>
> /* TDX module Call Leaf IDs */
> #define TDX_GET_INFO 1
> @@ -59,6 +60,62 @@ static void get_info(void)
> td_info.attributes = out.rdx;
> }
>
> +static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
> +{
> + struct tdx_hypercall_args args = {
> + .r10 = TDX_HYPERCALL_STANDARD,
> + .r11 = EXIT_REASON_HLT,
> + .r12 = irq_disabled,
> + };
> +
> + /*
> + * Emulate HLT operation via hypercall. More info about ABI
> + * can be found in TDX Guest-Host-Communication Interface
> + * (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>.
> + *
> + * The VMM uses the "IRQ disabled" param to understand IRQ
> + * enabled status (RFLAGS.IF) of the TD guest and to determine
> + * whether or not it should schedule the halted vCPU if an
> + * IRQ becomes pending. E.g. if IRQs are disabled, the VMM
> + * can keep the vCPU in virtual HLT, even if an IRQ is
> + * pending, without hanging/breaking the guest.
> + */
> + return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
> +}
> +
> +static bool handle_halt(void)
> +{
> + /*
> + * Since non safe halt is mainly used in CPU offlining
> + * and the guest will always stay in the halt state, don't
> + * call the STI instruction (set do_sti as false).
> + */
> + const bool irq_disabled = irqs_disabled();
> + const bool do_sti = false;
> +
> + if (__halt(irq_disabled, do_sti))
> + return false;
> +
> + return true;
> +}

One other note: I really do like the silly:

const bool do_sti = false;

variables as opposed to doing gunk like:

__halt(irq_disabled, false));

Thanks for doing that.

Acked-by: Dave Hansen <[email protected]>

2022-02-25 05:37:07

by David Laight

[permalink] [raw]
Subject: RE: [PATCHv4 11/30] x86/tdx: Handle in-kernel MMIO

From: David Laight
> Sent: 25 February 2022 02:23
>
> From: Dave Hansen
> > Sent: 24 February 2022 20:12
> ...
> > === Limitations of this approach ===
> >
> > > MMIO addresses can be used with any CPU instruction that accesses
> > > memory. Address only MMIO accesses done via io.h helpers, such as
> > > 'readl()' or 'writeq()'.
> >
> > Any CPU instruction that accesses memory can also be used to access
> > MMIO. However, by convention, MMIO access are typically performed via
> > io.h helpers such as 'readl()' or 'writeq()'.
> >
> > > readX()/writeX() helpers limit the range of instructions which can trigger
> > > MMIO. It makes MMIO instruction emulation feasible. Raw access to a MMIO
> > > region allows the compiler to generate whatever instruction it wants.
> > > Supporting all possible instructions is a task of a different scope.
> >
> > The io.h helpers intentionally use a limited set of instructions when
> > accessing MMIO. This known, limited set of instructions makes MMIO
> > instruction decoding and emulation feasible in KVM hosts and SEV guests
> > today.
> >
> > MMIO accesses are performed without the io.h helpers are at the mercy of
> > the compiler. Compilers can and will generate a much more broad set of
> > instructions which can not practically be decoded and emulated. TDX
> > guests will oops if they encounter one of these decoding failures.
> >
> > This means that TDX guests *must* use the io.h helpers to access MMIO.
> >
> > This requirement is not new. Both KVM hosts and AMD SEV guests have the
> > same limitations on MMIO access.
>
> Am I reading the last sentence correctly?
> Normally (on x86 at least) a driver can mmap() PCIe addresses directly
> into a user process.
> This lets a user process directly issue PCIe read/write bus cycles.
> These can be any instructions at all.
> I don't think we've had any issues doing that in normal VMs.

Actually we won't have been exposing PCIe devices to VMs.

> Or is this emulation only applying to specific PCIe slaves?
>
> David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-25 07:40:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 17/30] x86/tdx: Add port I/O emulation

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> @@ -347,6 +399,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> return handle_cpuid(regs);
> case EXIT_REASON_EPT_VIOLATION:
> return handle_mmio(regs, ve);
> + case EXIT_REASON_IO_INSTRUCTION:
> + return handle_io(regs, ve->exit_qual);

Sorry to keep throwing random new things at this patch set. Thanks for
bearing with me.

Is there anything to keep these port I/O #VE's from occurring in
userspace? It's not how things are normally done, but is there
something fundamental to keep ioperm() and friends from working in TDX
guests?

As it stands with this set, userspace would probably
1. Succeed with the ioperm()
2. Do a port I/O instruction
3. Trigger a #VE
4. Get killed by the SIGSEGV that came from the #VE handler

That's not a horrible state of affairs. But, if this *can* happen, it
might be nice to just refuse the ioperm() in the first place.

2022-02-25 07:56:24

by David Laight

[permalink] [raw]
Subject: RE: [PATCHv4 11/30] x86/tdx: Handle in-kernel MMIO

From: Dave Hansen
> Sent: 24 February 2022 20:12
...
> === Limitations of this approach ===
>
> > MMIO addresses can be used with any CPU instruction that accesses
> > memory. Address only MMIO accesses done via io.h helpers, such as
> > 'readl()' or 'writeq()'.
>
> Any CPU instruction that accesses memory can also be used to access
> MMIO. However, by convention, MMIO access are typically performed via
> io.h helpers such as 'readl()' or 'writeq()'.
>
> > readX()/writeX() helpers limit the range of instructions which can trigger
> > MMIO. It makes MMIO instruction emulation feasible. Raw access to a MMIO
> > region allows the compiler to generate whatever instruction it wants.
> > Supporting all possible instructions is a task of a different scope.
>
> The io.h helpers intentionally use a limited set of instructions when
> accessing MMIO. This known, limited set of instructions makes MMIO
> instruction decoding and emulation feasible in KVM hosts and SEV guests
> today.
>
> MMIO accesses are performed without the io.h helpers are at the mercy of
> the compiler. Compilers can and will generate a much more broad set of
> instructions which can not practically be decoded and emulated. TDX
> guests will oops if they encounter one of these decoding failures.
>
> This means that TDX guests *must* use the io.h helpers to access MMIO.
>
> This requirement is not new. Both KVM hosts and AMD SEV guests have the
> same limitations on MMIO access.

Am I reading the last sentence correctly?
Normally (on x86 at least) a driver can mmap() PCIe addresses directly
into a user process.
This lets a user process directly issue PCIe read/write bus cycles.
These can be any instructions at all.
I don't think we've had any issues doing that in normal VMs.

Or is this emulation only applying to specific PCIe slaves?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-25 15:25:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 04/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions

On Thu, Feb 24, 2022 at 09:01:42AM -0800, Dave Hansen wrote:
> On 2/24/22 07:56, Kirill A. Shutemov wrote:
> > + tdcall
> > +
> > + /*
> > + * TDVMCALL leaf does not suppose to fail. If it fails something
> > + * is horribly wrong with TDX module. Stop the world.
> > + */
> > + testq %rax, %rax
> > + jne .Lpanic
>
> This should be:
>
> "A TDVMCALL is not supposed to fail."
>
> I also wish this was mentioning something about the difference between a
> failure and return code.
>
> /*
> * %rax==0 indicates a failure of the TDVMCALL mechanism itself
> * and that something has gone horribly wrong with the TDX
> * module.
> *
> * The return status of the hypercall operation is separate
> * (in %r10). Hypercall errors are a part of normal operation
> * and are handled by callers.
> */
>
> I've been confused by this exact thing multiple times over the months
> that I've been looking at this code. I think it deserves a good comment.

Sure. The updated patch is below.

From 3952126750fe5bd8ef94015e036292f48883e369 Mon Sep 17 00:00:00 2001
From: Kuppuswamy Sathyanarayanan <[email protected]>
Date: Sun, 21 Feb 2021 22:24:44 -0800
Subject: [PATCH] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper
functions

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.

In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.

A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A variant of TDCALL used to communicate
with the VMM is called TDVMCALL.

Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).

__tdx_hypercall() - Used by the guest to request services from the
VMM (via TDVMCALL).
__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL).

Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.

The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.

Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.

For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".

Based on previous patch by Sean Christopherson.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/Makefile | 2 +-
arch/x86/coco/tdcall.S | 188 ++++++++++++++++++++++++++++++++++
arch/x86/coco/tdx.c | 18 ++++
arch/x86/include/asm/tdx.h | 27 +++++
arch/x86/kernel/asm-offsets.c | 10 ++
5 files changed, 244 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/coco/tdcall.S

diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
index 32f4c6e6f199..14af5412e3cd 100644
--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -5,4 +5,4 @@ CFLAGS_core.o += -fno-stack-protector

obj-y += core.o

-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o tdcall.o
diff --git a/arch/x86/coco/tdcall.S b/arch/x86/coco/tdcall.S
new file mode 100644
index 000000000000..4767e0b5f0d9
--- /dev/null
+++ b/arch/x86/coco/tdcall.S
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+#include <linux/errno.h>
+
+#include "../virt/tdxcall.S"
+
+/*
+ * Bitmasks of exposed registers (with VMM).
+ */
+#define TDX_R10 BIT(10)
+#define TDX_R11 BIT(11)
+#define TDX_R12 BIT(12)
+#define TDX_R13 BIT(13)
+#define TDX_R14 BIT(14)
+#define TDX_R15 BIT(15)
+
+/*
+ * These registers are clobbered to hold arguments for each
+ * TDVMCALL. They are safe to expose to the VMM.
+ * Each bit in this mask represents a register ID. Bit field
+ * details can be found in TDX GHCI specification, section
+ * titled "TDCALL [TDG.VP.VMCALL] leaf".
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \
+ TDX_R12 | TDX_R13 | \
+ TDX_R14 | TDX_R15 )
+
+/*
+ * __tdx_module_call() - Used by TDX guests to request services from
+ * the TDX module (does not include VMM services).
+ *
+ * Transforms function call register arguments into the TDCALL
+ * register ABI. After TDCALL operation, TDX module output is saved
+ * in @out (if it is provided by the user)
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - TDCALL Leaf number.
+ * RCX,RDX,R8-R9 - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction error code.
+ * RCX,RDX,R8-R11 - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module). It
+ * can be NULL.
+ *
+ * Return status of TDCALL via RAX.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+ TDX_MODULE_CALL host=0
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * __tdx_hypercall() - Make hypercalls to a TDX VMM.
+ *
+ * Transforms values in function call argument struct tdx_hypercall_args @args
+ * into the TDCALL register ABI. After TDCALL operation, VMM output is saved
+ * back in @args.
+ *
+ *-------------------------------------------------------------------------
+ * TD VMCALL ABI:
+ *-------------------------------------------------------------------------
+ *
+ * Input Registers:
+ *
+ * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
+ * RCX - BITMAP which controls which part of TD Guest GPR
+ * is passed as-is to the VMM and back.
+ * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
+ * specification. Non zero value indicates vendor
+ * specific ABI.
+ * R11 - VMCALL sub function number
+ * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
+ * R8-R9, R12-R15 - Same as above.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction status (Not related to hypercall
+ * output).
+ * R10 - Hypercall output error code.
+ * R11-R15 - Hypercall sub function specific output values.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_hypercall() function ABI:
+ *
+ * @args (RDI) - struct tdx_hypercall_args for input and output
+ * @flags (RSI) - TDX_HCALL_* flags
+ *
+ * On successful completion, return the hypercall error code.
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+
+ /* Save callee-saved GPRs as mandated by the x86_64 ABI */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
+ xor %eax, %eax
+
+ /* Copy hypercall registers from arg struct: */
+ movq TDX_HYPERCALL_r10(%rdi), %r10
+ movq TDX_HYPERCALL_r11(%rdi), %r11
+ movq TDX_HYPERCALL_r12(%rdi), %r12
+ movq TDX_HYPERCALL_r13(%rdi), %r13
+ movq TDX_HYPERCALL_r14(%rdi), %r14
+ movq TDX_HYPERCALL_r15(%rdi), %r15
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /*
+ * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that
+ * something has gone horribly wrong with the TDX module.
+ *
+ * The return status of the hypercall operation is in a separate
+ * register (in R10). Hypercall errors are a part of normal operation
+ * and are handled by callers.
+ */
+ testq %rax, %rax
+ jne .Lpanic
+
+ /* TDVMCALL leaf return code is in R10 */
+ movq %r10, %rax
+
+ /* Copy hypercall result registers to arg struct if needed */
+ testq $TDX_HCALL_HAS_OUTPUT, %rsi
+ jz .Lout
+
+ movq %r10, TDX_HYPERCALL_r10(%rdi)
+ movq %r11, TDX_HYPERCALL_r11(%rdi)
+ movq %r12, TDX_HYPERCALL_r12(%rdi)
+ movq %r13, TDX_HYPERCALL_r13(%rdi)
+ movq %r14, TDX_HYPERCALL_r14(%rdi)
+ movq %r15, TDX_HYPERCALL_r15(%rdi)
+.Lout:
+ /*
+ * Zero out registers exposed to the VMM to avoid speculative execution
+ * with VMM-controlled values. This needs to include all registers
+ * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
+ * context will be restored.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+
+ /* Restore callee-saved GPRs as mandated by the x86_64 ABI */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ FRAME_END
+
+ retq
+.Lpanic:
+ ud2
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 00898e3eb77f..17365fd40ba2 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -7,6 +7,24 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>

+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = fn,
+ .r12 = r12,
+ .r13 = r13,
+ .r14 = r14,
+ .r15 = r15,
+ };
+
+ return __tdx_hypercall(&args, 0);
+}
+
void __init tdx_early_init(void)
{
u32 eax, sig[3];
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e59c7960cc0d..d2a7e4ef4c1f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -3,11 +3,16 @@
#ifndef _ASM_X86_TDX_H
#define _ASM_X86_TDX_H

+#include <linux/bits.h>
#include <linux/init.h>

#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "

+#define TDX_HYPERCALL_STANDARD 0
+
+#define TDX_HCALL_HAS_OUTPUT BIT(0)
+
/*
* SW-defined error codes.
*
@@ -33,10 +38,32 @@ struct tdx_module_output {
u64 r11;
};

+/*
+ * Used in __tdx_hypercall() to pass down and get back registers' values of
+ * the TDCALL instruction when requesting services from the VMM.
+ *
+ * This is a software only structure and not part of the TDX module/VMM ABI.
+ */
+struct tdx_hypercall_args {
+ u64 r10;
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

void __init tdx_early_init(void);

+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 7dca52f5cfc6..0b465e7d0a2f 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -74,6 +74,16 @@ static void __used common(void)
OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
OFFSET(TDX_MODULE_r11, tdx_module_output, r11);

+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10);
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
--
Kirill A. Shutemov

2022-02-25 20:45:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 30/30] Documentation/x86: Document TDX kernel architecture

> +#VE Exceptions:
> +===============
> +
> +In TDX guests, #VE Exceptions are delivered to TDX guests in following
> +scenarios:
> +
> +* Execution of certain instructions (see list below)
> +* Certain MSR accesses.
> +* CPUID usage (only for certain leaves)
> +* Shared memory access (including MMIO)

This makes it sound like *ALL* MMIO will cause a #VE. Is this strictly
true? I didn't see anything in the spec that completely disallowed a
host from passing through an MMIO range to a guest in a shared memory
range. Granted, the host can unilaterally make that range start causing
a #VE at any time. But, is MMIO itself disallowed? Or, do guests just
have to be *prepared* for a #VE when accessing something that might be MMIO?

> +#VE due to instruction execution
> +---------------------------------
> +
> +Intel TDX dis-allows execution of certain instructions in non-root

^ disallows

> +mode. Execution of these instructions would lead to #VE or #GP.


Some instruction behavior changes when running inside a TDX guest.
These are typically instructions that would have been trapped by a
hypervisor and emulated. In a TDX guest, these instructions either lead
to a #VE or #GP.

* Instructions that always cause a #VE:

> +* String I/O (INS, OUTS), IN, OUT
> +* HLT
> +* MONITOR, MWAIT
> +* WBINVD, INVD
> +* VMCALL

* Instructions always cause a #GP:

> +* All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
> + VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
> +* ENCLS, ENCLV

^ ENCLU

I don't think there's an "ENCLV" instruction.

> +* GETSEC
> +* RSM
> +* ENQCMD


* Instructions that conditionally cause a #VE (more details below)
* WRMSR, RDMSR (details below)
* CPUID
> +#VE due to MSR access
> +----------------------

<Sigh> The title of this section is #VE. Then, it talks about how #GP's
are triggered.

> +In TDX guest, MSR access behavior can be categorized as,
> +
> +* Native supported (also called "context switched MSR")
> + No special handling is required for these MSRs in TDX guests.
> +* #GP triggered
> + Dis-allowed MSR read/write would lead to #GP.
> +* #VE triggered
> + All MSRs that are not natively supported or dis-allowed
> + (triggers #GP) will trigger #VE. To support access to
> + these MSRs, it needs to be emulated using TDCALL.

This is really struggling to do anything useful. I mean, it says: "look
there are three categories." It defines the third category as
"everything not in the other two". <sigh> That's just a waste of bytes.

--

MSR access behavior falls into three categories:

* #GP generated
* #VE generated
* MSR "just works"

In general, the #GP MSRs should not be used in guests. Their use likely
indicates a bug in the guest. The guest _can_ try to handle the #GP
with a hypercall but it is unlikely to succeed.

The #VE MSRs are typically able to be handled by the hypervisor. Guests
can make a hypercall to the hypervisor to handle the #VE.

The "just works" MSRs do not need any special guest handling. They
might be implemented by directly passing through the MSR to the hardware
or by trapping and handling in the TDX module. Other than possibly
being slow, these MSRs appear to function just as they would on bare metal.

> +Look Intel TDX Module Specification, sec "MSR Virtualization" for the complete
> +list of MSRs that fall under the categories above.

Could we try to write some actual coherent text here, please? This
isn't even a complete sentence.

> +#VE due to CPUID instruction
> +----------------------------
> +
> +In TDX guests, most of CPUID leaf/sub-leaf combinations are virtualized by
> +the TDX module while some trigger #VE. Whether the leaf/sub-leaf triggers #VE
> +defined in the TDX spec.
> +
> +VMM during the TD initialization time (using TDH.MNG.INIT) configures if
> +a feature bits in specific leaf-subleaf are exposed to TD guest or not.

This needs to *say* something. Otherwise, it's just useless bytes.
Basically, this is a long-winded way of saying "if you want to know
anything about CPUID, look at the TDX spec".

What do we want the reader to take away from this?

> +#VE on Memory Accesses
> +----------------------
> +
> +A TD guest is in control of whether its memory accesses are treated as
> +private or shared. It selects the behavior with a bit in its page table
> +entries.

... and what?

Why does this matter? What does it have to do with #VE?

> +#VE on Shared Pages
> +-------------------
> +
> +Access to shared mappings can cause a #VE. The hypervisor controls whether
> +access of shared mapping causes a #VE, so the guest must be careful to only
> +reference shared pages it can safely handle a #VE, avoid nested #VEs.
> +
> +Content of shared mapping is not trusted since shared memory is writable
> +by the hypervisor. Shared mappings are never used for sensitive memory content
> +like stacks or kernel text, only for I/O buffers and MMIO regions. The kernel
> +will not encounter shared mappings in sensitive contexts like syscall entry
> +or NMIs.
> +
> +#VE on Private Pages
> +--------------------
> +
> +Some accesses to private mappings may cause #VEs. Before a mapping is
> +accepted (AKA in the SEPT_PENDING state), a reference would cause a #VE.
> +But, after acceptance, references typically succeed.
> +
> +The hypervisor can cause a private page reference to fail if it chooses
> +to move an accepted page to a "blocked" state. However, if it does
> +this, page access will not generate a #VE. It will, instead, cause a
> +"TD Exit" where the hypervisor is required to handle the exception.
> +
> +Linux #VE handler
> +-----------------
> +
> +Both user/kernel #VE exceptions are handled by the tdx_handle_virt_exception()
> +handler. If successfully handled, the instruction pointer is incremented to
> +complete the handling process. If failed to handle, it is treated as a regular
> +exception and handled via fixup handlers.
> +
> +In TD guests, #VE nesting (a #VE triggered before handling the current one
> +or AKA syscall gap issue) problem is handled by TDX module ensuring that
> +interrupts, including NMIs, are blocked. The hardware blocks interrupts
> +starting with #VE delivery until TDGETVEINFO is called.
> +
> +The kernel must avoid triggering #VE in entry paths: do not touch TD-shared
> +memory, including MMIO regions, and do not use #VE triggering MSRs,
> +instructions, or CPUID leaves that might generate #VE.
> +
> +MMIO handling:
> +==============
> +
> +In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
> +mapping which will cause a VMEXIT on access, and then the VMM emulates the
> +access. That's not possible in TDX guests because VMEXIT will expose the
> +register state to the host. TDX guests don't trust the host and can't have
> +their state exposed to the host.
> +
> +In TDX the MMIO regions are instead configured to trigger a #VE
> +exception in the guest. The guest #VE handler then emulates the MMIO
> +instructions inside the guest and converts them into a controlled TDCALL
> +to the host, rather than completely exposing the state to the host.
> +
> +MMIO addresses on x86 are just special physical addresses. They can be
> +accessed with any instruction that accesses memory. However, the
> +introduced instruction decoding method is limited. It is only designed
> +to decode instructions like those generated by io.h macros.
> +
> +MMIO access via other means (like structure overlays) may result in
> +MMIO_DECODE_FAILED and an oops.
> +
> +Shared memory:
> +==============
> +
> +Intel TDX doesn't allow the VMM to access guest private memory. Any
> +memory that is required for communication with VMM must be shared
> +explicitly by setting the bit in the page table entry. The shared bit
> +can be enumerated with TDX_GET_INFO.
> +
> +After setting the shared bit, the conversion must be completed with
> +MapGPA hypercall. The call informs the VMM about the conversion between
> +private/shared mappings.
> +
> +set_memory_decrypted() converts a range of pages to shared.
> +set_memory_encrypted() converts memory back to private.
> +
> +Device drivers are the primary user of shared memory, but there's no
> +need in touching every driver. DMA buffers and ioremap()'ed regions are
> +converted to shared automatically.
> +
> +TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
> +converted to shared on boot.
> +
> +For coherent DMA allocation, the DMA buffer gets converted on the
> +allocation. Check force_dma_unencrypted() for details.
> +


> +More details about TDX module (and its response for MSR, memory access,
> +IO, CPUID etc) can be found at,
> +
> +https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> +
> +More details about TDX hypercall and TDX module call ABI can be found
> +at,
> +
> +https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf
> +
> +More details about TDVF requirements can be found at,
> +
> +https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf

None of these are stable URLs. Let's just get rid of them.

2022-02-25 22:07:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest

On 2/25/22 11:30, Kirill A. Shutemov wrote:
> On Thu, Feb 24, 2022 at 10:36:02AM -0800, Dave Hansen wrote:
>> On 2/24/22 07:56, Kirill A. Shutemov wrote:
>>> Virtualization Exceptions (#VE) are delivered to TDX guests due to
>>> specific guest actions which may happen in either user space or the
>>> kernel:
>>>
>>> * Specific instructions (WBINVD, for example)
>>> * Specific MSR accesses
>>> * Specific CPUID leaf accesses
>>> * Access to unmapped pages (EPT violation)
>>
>> Considering that you're talking partly about userspace, it would be nice
>> to talk about what "unmapped" really means here.
>
> I'm not sure what you want to see here. Doesn't EPT violation describe it?
>
> It can happen to userspace too, but we don't expect it to be use used and
> SIGSEGV the process if it happens.

How about just:

* Access to specific guest physical addresses

That makes it clear that we're not really talking about userspace
unmapped pages.

...
>>> + * module also treats virtual NMIs as inhibited if the #VE valid flag is
>>> + * set, e.g. so that NMI=>#VE will not result in a #DF.
>>> + */
>>
>> Are we missing anything valuable if we just trim the comment down to
>> something like:
>>
>> /*
>> * Called during #VE handling to retrieve the #VE info from the
>> * TDX module.
>> *
>> * This should called done early in #VE handling. A "nested"
>> * #VE which occurs before this will raise a #DF and is not
>> * recoverable.
>> */
>
> This variant of the comment lost information about #VE-valid flag and
> doesn't describe how NMI is inhibited.

IMNHO, the "#VE valid" flag is a super-fine implementation detail. I'd
personally deal with that in Documentation or the changelog instead of a
comment.

>>> +#ifdef CONFIG_INTEL_TDX_GUEST
>>> +
>>> +#define VE_FAULT_STR "VE fault"
>>> +
>>> +static void ve_raise_fault(struct pt_regs *regs, long error_code)
>>> +{
>>> + if (user_mode(regs)) {
>>> + gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
>>> + return;
>>> + }
>>> +
>>> + if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
>>> + return;
>>> +
>>> + die_addr(VE_FAULT_STR, regs, error_code, 0);
>>> +}
>>> +
>>> +/*
>>> + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
>>> + * specific guest actions which may happen in either user space or the
>>> + * kernel:
>>> + *
>>> + * * Specific instructions (WBINVD, for example)
>>> + * * Specific MSR accesses
>>> + * * Specific CPUID leaf accesses
>>> + * * Access to unmapped pages (EPT violation)
>>> + *
>>> + * In the settings that Linux will run in, virtualization exceptions are
>>> + * never generated on accesses to normal, TD-private memory that has been
>>> + * accepted.
>>
>> This actually makes a lot more sense as a code comment than changelog.
>> It would be really nice to circle back here and actually refer to the
>> functions that accept memory.
>
> We don't have such functions at this point in the patchset. Do you want
> the comment to be updated once we get them introduced?

Yes, please. Supplement the comment when the functions are introduced
later.

2022-02-26 01:44:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 30/30] Documentation/x86: Document TDX kernel architecture

On 2/24/22 07:56, Kirill A. Shutemov wrote:
> +List of instructions that can cause a #VE is,
> +
> +* String I/O (INS, OUTS), IN, OUT

Also: String I/O != Port I/O

I'm just going to throw this into a shared document and start rewriting it.

2022-02-26 02:01:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Feb 24, 2022 at 10:36:02AM -0800, Dave Hansen wrote:
> On 2/24/22 07:56, Kirill A. Shutemov wrote:
> > Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > specific guest actions which may happen in either user space or the
> > kernel:
> >
> > * Specific instructions (WBINVD, for example)
> > * Specific MSR accesses
> > * Specific CPUID leaf accesses
> > * Access to unmapped pages (EPT violation)
>
> Considering that you're talking partly about userspace, it would be nice
> to talk about what "unmapped" really means here.

I'm not sure what you want to see here. Doesn't EPT violation describe it?

It can happen to userspace too, but we don't expect it to be use used and
SIGSEGV the process if it happens.

> > In the settings that Linux will run in, virtualization exceptions are
> > never generated on accesses to normal, TD-private memory that has been
> > accepted.
>
> This is getting into nit territory. But, at this point a normal reader
> has no idea what "accepted" memory is.

I will add: "(prepared to be used in the TD)". Okay?

> > @@ -58,6 +59,65 @@ static void get_info(void)
> > td_info.attributes = out.rdx;
> > }
> >
> > +void tdx_get_ve_info(struct ve_info *ve)
> > +{
> > + struct tdx_module_output out;
> > +
> > + /*
> > + * Retrieve the #VE info from the TDX module, which also clears the "#VE
> > + * valid" flag. This must be done before anything else as any #VE that
> > + * occurs while the valid flag is set, i.e. before the previous #VE info
> > + * was consumed, is morphed to a #DF by the TDX module.
>
>
> That's a really weird sentence. It doesn't really parse for me. It
> might be the misplaced comma after "consumed,".
>
> For what it's worth, I think "i.e." and "e.g." have been over used in
> the TDX text (sorry Sean). They lead to really weird sentence structure.
>
> Note, the TDX
> > + * module also treats virtual NMIs as inhibited if the #VE valid flag is
> > + * set, e.g. so that NMI=>#VE will not result in a #DF.
> > + */
>
> Are we missing anything valuable if we just trim the comment down to
> something like:
>
> /*
> * Called during #VE handling to retrieve the #VE info from the
> * TDX module.
> *
> * This should called done early in #VE handling. A "nested"
> * #VE which occurs before this will raise a #DF and is not
> * recoverable.
> */

This variant of the comment lost information about #VE-valid flag and
doesn't describe how NMI is inhibited.

Sean proposed this wording as reply to Thomas' questions:

http://lore.kernel.org/r/[email protected]

Do we need to keep the info?

> For what it's worth, I don't think we care who "morphs" things. We just
> care about the fallout.
>
> > + tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
>
> How about a one-liner below here:
>
> /* Interrupts and NMIs can be delivered again. */
>
> > + ve->exit_reason = out.rcx;
> > + ve->exit_qual = out.rdx;
> > + ve->gla = out.r8;
> > + ve->gpa = out.r9;
> > + ve->instr_len = lower_32_bits(out.r10);
> > + ve->instr_info = upper_32_bits(out.r10);
> > +}
> > +
> > +/*
> > + * Handle the user initiated #VE.
> > + *
> > + * For example, executing the CPUID instruction from user space
> > + * is a valid case and hence the resulting #VE has to be handled.
> > + *
> > + * For dis-allowed or invalid #VE just return failure.
> > + */
>
> This is just insane to have in the series at this point. It says that
> the "#VE has to be handled" and then doesn't handle it!

I can't see why it's a big deal, but okay.

> > +static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> > + return false;
> > +}
> > +
> > +/* Handle the kernel #VE */
> > +static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> > + return false;
> > +}
> > +
> > +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > + bool ret;
> > +
> > + if (user_mode(regs))
> > + ret = virt_exception_user(regs, ve);
> > + else
> > + ret = virt_exception_kernel(regs, ve);
> > +
> > + /* After successful #VE handling, move the IP */
> > + if (ret)
> > + regs->ip += ve->instr_len;
> > +
> > + return ret;
> > +}
>
> At this point in the series, these three functions can be distilled down to:
>
> bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> {
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>
> return false;
> }

I will do as you want, but I don't feel it is right.

The patch adds a little more infrastructure that makes following patches
cleaner.


> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +
> > +#define VE_FAULT_STR "VE fault"
> > +
> > +static void ve_raise_fault(struct pt_regs *regs, long error_code)
> > +{
> > + if (user_mode(regs)) {
> > + gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
> > + return;
> > + }
> > +
> > + if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
> > + return;
> > +
> > + die_addr(VE_FAULT_STR, regs, error_code, 0);
> > +}
> > +
> > +/*
> > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > + * specific guest actions which may happen in either user space or the
> > + * kernel:
> > + *
> > + * * Specific instructions (WBINVD, for example)
> > + * * Specific MSR accesses
> > + * * Specific CPUID leaf accesses
> > + * * Access to unmapped pages (EPT violation)
> > + *
> > + * In the settings that Linux will run in, virtualization exceptions are
> > + * never generated on accesses to normal, TD-private memory that has been
> > + * accepted.
>
> This actually makes a lot more sense as a code comment than changelog.
> It would be really nice to circle back here and actually refer to the
> functions that accept memory.

We don't have such functions at this point in the patchset. Do you want
the comment to be updated once we get them introduced?
>
> > + * Syscall entry code has a critical window where the kernel stack is not
> > + * yet set up. Any exception in this window leads to hard to debug issues
> > + * and can be exploited for privilege escalation. Exceptions in the NMI
> > + * entry code also cause issues. Returning from the exception handler with
> > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> > + *
> > + * For these reasons, the kernel avoids #VEs during the syscall gap and
> > + * the NMI entry code. Entry code paths do not access TD-shared memory,
> > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> > + * that might generate #VE. VMM can remove memory from TD at any point,
> > + * but access to unaccepted (or missing) private memory leads to VM
> > + * termination, not to #VE.
> > + *
> > + * Similarly to page faults and breakpoints, #VEs are allowed in NMI
> > + * handlers once the kernel is ready to deal with nested NMIs.
> > + *
> > + * During #VE delivery, all interrupts, including NMIs, are blocked until
> > + * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
> > + * the VE info.
> > + *
> > + * If a guest kernel action which would normally cause a #VE occurs in
> > + * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> > + * exception) is delivered to the guest which will result in an oops.
> > + */
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> > + struct ve_info ve;
> > +
> > + /*
> > + * NMIs/Machine-checks/Interrupts will be in a disabled state
> > + * till TDGETVEINFO TDCALL is executed. This ensures that VE
> > + * info cannot be overwritten by a nested #VE.
> > + */
> > + tdx_get_ve_info(&ve);
> > +
> > + cond_local_irq_enable(regs);
> > +
> > + /*
> > + * If tdx_handle_virt_exception() could not process
> > + * it successfully, treat it as #GP(0) and handle it.
> > + */
> > + if (!tdx_handle_virt_exception(regs, &ve))
> > + ve_raise_fault(regs, 0);
> > +
> > + cond_local_irq_disable(regs);
> > +}
> > +
> > +#endif
> > +
> > #ifdef CONFIG_X86_32
> > DEFINE_IDTENTRY_SW(iret_error)
> > {
>

--
Kirill A. Shutemov

2022-02-26 21:37:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 09/30] x86/tdx: Add MSR support for TDX guests

On Thu, Feb 24, 2022 at 10:52:23AM -0800, Dave Hansen wrote:
> On 2/24/22 07:56, Kirill A. Shutemov wrote:
> > Use hypercall to emulate MSR read/write for the TDX platform.
> >
> > There are two viable approaches for doing MSRs in a TD guest:
> >
> > 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal
> > do. Some will succeed, others will cause a #VE. All of those that
> > cause a #VE will be handled with a TDCALL.
> > 2. Use paravirt infrastructure. The paravirt hook has to keep a list
> > of which MSRs would cause a #VE and use a TDCALL. All other MSRs
> > execute RDMSR/WRMSR instructions directly.
> >
> > The second option can be ruled out because the list of MSRs was
> > challenging to maintain. That leaves option #1 as the only viable
> > solution for the minimal TDX support.
> >
> > For performance-critical MSR writes (like TSC_DEADLINE), future patches
> > will replace the WRMSR/#VE sequence with the direct TDCALL.
>
> This will still leave us with a list of non-#VE-inducing MSRs. That's
> not great.

Em. No. TSC_DEADLINE is #VE-inducing MSR. So we will only maintain a list
of performance-critical MSR writes that do trigger #VE.

Here's how we do it now:

https://github.com/intel/tdx/commit/2cea8becaa5a287c93266c01fc7f2a4ed53c509d

The idea is if MSR is in the list, go for direct TDVMCALL. Otherwise go for
native WRMSR that may or may not trigger #VE.

> But, if we miss an MSR in the performance-critical list, the
> result is a slow WRMSR->#VE. If we miss an MSR in the paravirt
> approach, we induce a fatal #VE.
>
> Please add something to that effect if you revise this patch.

I'm not sure explaining mechanism of a future patch is a good idea.
It may change before it gets implemented.

--
Kirill A. Shutemov

2022-02-27 01:53:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

On Thu, Feb 24, 2022 at 11:04:04AM -0800, Dave Hansen wrote:
> On 2/24/22 07:56, Kirill A. Shutemov wrote:
> > static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> > {
> > - pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> > - return false;
> > + switch (ve->exit_reason) {
> > + case EXIT_REASON_CPUID:
> > + return handle_cpuid(regs);
> > + default:
> > + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> > + return false;
> > + }
> > }
>
> What does this mean for userspace? What kinds of things are we ceding
> to the (untrusted) VMM to supply to userspace?

Here's what I see called from userspace.
CPUID(AX=0x2)
CPUID(AX=0xb, CX=0x0)
CPUID(AX=0xb, CX=0x1)
CPUID(AX=0x40000000, CX=0xfffaba17)
CPUID(AX=0x80000007, CX=0x121)

> > /* Handle the kernel #VE */
> > @@ -200,6 +235,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> > return read_msr(regs);
> > case EXIT_REASON_MSR_WRITE:
> > return write_msr(regs);
> > + case EXIT_REASON_CPUID:
> > + return handle_cpuid(regs);
> > default:
> > pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> > return false;
> What kinds of random CPUID uses in the kernel at runtime need this
> handling?

CPUID(AX=0x2)
CPUID(AX=0x6, CX=0x0)
CPUID(AX=0xb, CX=0x0)
CPUID(AX=0xb, CX=0x1)
CPUID(AX=0xb, CX=0x2)
CPUID(AX=0xf, CX=0x0)
CPUID(AX=0xf, CX=0x1)
CPUID(AX=0x10, CX=0x0)
CPUID(AX=0x10, CX=0x1)
CPUID(AX=0x10, CX=0x2)
CPUID(AX=0x10, CX=0x3)
CPUID(AX=0x16, CX=0x0)
CPUID(AX=0x1f, CX=0x0)
CPUID(AX=0x40000000, CX=0x0)
CPUID(AX=0x40000000, CX=0xfffaba17)
CPUID(AX=0x40000001, CX=0x0)
CPUID(AX=0x80000002, CX=0x0)
CPUID(AX=0x80000003, CX=0x0)
CPUID(AX=0x80000004, CX=0x0)
CPUID(AX=0x80000007, CX=0x0)
CPUID(AX=0x80000007, CX=0x121)

> Is it really OK that we let the VMM inject arbitrary CPUID
> values into random CPUID uses in the kernel... silently?

We realise that this is possible vector of attack and plan to implement
proper filtering. But it is beyon core enabling.

> Is this better than just returning 0's, for instance?

Plain 0 injection breaks the boot. More complicated solution is need.


--
Kirill A. Shutemov

2022-02-28 00:04:15

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv4 15/30] x86/boot: Allow to hook up alternative port I/O helpers

On Thu, Feb 24, 2022 at 06:56:15PM +0300, Kirill A. Shutemov wrote:
> Port I/O instructions trigger #VE in the TDX environment. In response to
> the exception, kernel emulates these instructions using hypercalls.
>
> But during early boot, on the decompression stage, it is cumbersome to
> deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> handling.
>
> Add a way to hook up alternative port I/O helpers in the boot stub.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

I think you missed my comment from v3. Repeating it here:

At least from reading the commit message it's not self-evident why #VE
handling would be worse, especially since there's already #VC support in
boot. It would help to give more info about that in the commit message.

The current approach also seems fragile, doesn't it require all future
code to remember to not do i/o directly? How do we make sure that
doesn't happen going forward?

How does it fail if some code accidentally does i/o directly? Or
triggers #VE some other way? Is the error understandable and
actionable?

--
Josh

2022-02-28 06:37:00

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHv4 29/30] ACPICA: Avoid cache flush on TDX guest

On Sun, Feb 27, 2022 at 2:05 PM Josh Poimboeuf <[email protected]> wrote:
>
> On Thu, Feb 24, 2022 at 06:56:29PM +0300, Kirill A. Shutemov wrote:
> > +/*
> > + * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
> > + * It is required to prevent data loss.
> > + *
> > + * While running inside TDX guest, the kernel can bypass cache flushing.
> > + * Changing sleep state in a virtual machine doesn't affect the host system
> > + * sleep state and cannot lead to data loss.
> > + *
> > + * TODO: Is it safe to generalize this from TDX guests to all guest kernels?
> > + */
> > +#define ACPI_FLUSH_CPU_CACHE() \
> > +do { \
> > + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) \
> > + wbinvd(); \
> > +} while (0)
>
> If it's safe, why not do it for all VMs? Is there something specific
> about TDX which makes this more obviously known to be safe than for
> regular VMs?
>
> The patch description and the above comment make it sound like "we're
> not really sure this is safe, so we'll just use TDX as a testing ground
> for the idea." Which doesn't really inspire a lot of confidence in the
> stability of TD sleep states.

Agree, why is this marked as "TODO"? The cache flushes associated with
ACPI sleep states are to flush cache before bare metal power loss to
CPU caches and bare metal transition of DDR in self-refresh mode. If a
cache flush is required it is the responsibility of the hypervisor.
Either it is safe for all guests or it is unsafe for all guests, not
TD specific.

2022-02-28 06:37:47

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

On Thu, Feb 24, 2022 at 06:56:01PM +0300, Kirill A. Shutemov wrote:
> So far, AMD_MEM_ENCRYPT is the only user of X86_MEM_ENCRYPT. TDX will be
> the second. It will make mem_encrypt.c build without AMD_MEM_ENCRYPT,
> which triggers a warning:
>
> arch/x86/mm/mem_encrypt.c:69:13: warning: no previous prototype for
> function 'mem_encrypt_init' [-Wmissing-prototypes]
>
> Fix it by moving mem_encrypt_init() declaration outside of #ifdef
> CONFIG_AMD_MEM_ENCRYPT.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Fixes: 20f07a044a76 ("x86/sev: Move common memory encryption code to mem_encrypt.c")
> Acked-by: David Rientjes <[email protected]>

The patch title, warning, and "Fixes" tag tend to give the impression
this is fixing a real user-visible bug. But the bug is theoretical, as
it's not possible to enable X86_MEM_ENCRYPT without AMD_MEM_ENCRYPT,
until patch 27.

IMO it would be preferable to just squash this change with patch 27.

Having it as a separate patch is also fine, but it shouldn't be
described as a fix or use the Fixes tag. It's more of a preparatory
patch.

--
Josh

2022-02-28 07:31:50

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv4 29/30] ACPICA: Avoid cache flush on TDX guest

On Thu, Feb 24, 2022 at 06:56:29PM +0300, Kirill A. Shutemov wrote:
> +/*
> + * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
> + * It is required to prevent data loss.
> + *
> + * While running inside TDX guest, the kernel can bypass cache flushing.
> + * Changing sleep state in a virtual machine doesn't affect the host system
> + * sleep state and cannot lead to data loss.
> + *
> + * TODO: Is it safe to generalize this from TDX guests to all guest kernels?
> + */
> +#define ACPI_FLUSH_CPU_CACHE() \
> +do { \
> + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) \
> + wbinvd(); \
> +} while (0)

If it's safe, why not do it for all VMs? Is there something specific
about TDX which makes this more obviously known to be safe than for
regular VMs?

The patch description and the above comment make it sound like "we're
not really sure this is safe, so we'll just use TDX as a testing ground
for the idea." Which doesn't really inspire a lot of confidence in the
stability of TD sleep states.

--
Josh

2022-02-28 08:18:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 17/30] x86/tdx: Add port I/O emulation

On 2/27/22 17:16, Kirill A. Shutemov wrote:
> Anyway, it is in our plans to sort it out, but it is not in scope of core
> enabling. Let's make it functional first.

Yeah, but we need to know what these plans are. There's still a _bit_
too much hand-waving and "trust us" going on in this set.

If this can induce extra SIGSEV's in userspace that aren't possible in
non-TDX systems, please call that out.

For instance, something like this in the changelog of this patch would
be really nice:

== Userspace Implications ==

The ioperm() facility allows userspace access to I/O
instructions like inb/outb. Among other things, this allows
writing userspace device drivers.

This series has no special handling for ioperm(). Users
will be able to successfully request I/O permissions but will
induce a #VE on their first I/O instruction. If this is
undesirable users can <add advice here about LOCKDOWN_IOPORT>

More robust handling of this situation (denying ioperm() in
all TDX guests) will be addressed in follow-on work.

That says: This causes a problem. The problem looks like this. It can
be addressed now by doing $FOO or later by doing $BAR.

But, the *problem* needs to be called out. That way, folks can actually
think about the problem rather than just reading a happy changelog that
neglects to mention any of the problems that the patch leaves in its wake.

The same goes for the CPUID mess. I'm not demanding a full solution in
the patch or the series even. But, what I am demanding is a full
_problem_ disclosure.

2022-02-28 08:37:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 17/30] x86/tdx: Add port I/O emulation

On Thu, Feb 24, 2022 at 07:59:51PM -0800, Dave Hansen wrote:
> On 2/24/22 07:56, Kirill A. Shutemov wrote:
> > @@ -347,6 +399,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> > return handle_cpuid(regs);
> > case EXIT_REASON_EPT_VIOLATION:
> > return handle_mmio(regs, ve);
> > + case EXIT_REASON_IO_INSTRUCTION:
> > + return handle_io(regs, ve->exit_qual);
>
> Sorry to keep throwing random new things at this patch set. Thanks for
> bearing with me.
>
> Is there anything to keep these port I/O #VE's from occurring in
> userspace? It's not how things are normally done, but is there
> something fundamental to keep ioperm() and friends from working in TDX
> guests?
>
> As it stands with this set, userspace would probably
> 1. Succeed with the ioperm()
> 2. Do a port I/O instruction
> 3. Trigger a #VE
> 4. Get killed by the SIGSEGV that came from the #VE handler
>
> That's not a horrible state of affairs. But, if this *can* happen, it
> might be nice to just refuse the ioperm() in the first place.

Right, there's a way to get port I/O from userspace and we are not
intended to support it. And, yes, ioperm() is the right place to do this.

We considered to make it happen via security lockdown mechanism. It
already block port I/O (LOCKDOWN_IOPORT) and does more stuff that can be
considered useful for paranoid guest. I'm not sure it is the right way to
go. Will see.

Anyway, it is in our plans to sort it out, but it is not in scope of core
enabling. Let's make it functional first.

--
Kirill A. Shutemov

2022-02-28 17:33:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 15/30] x86/boot: Allow to hook up alternative port I/O helpers

On Sun, Feb 27, 2022 at 02:02:19PM -0800, Josh Poimboeuf wrote:
> On Thu, Feb 24, 2022 at 06:56:15PM +0300, Kirill A. Shutemov wrote:
> > Port I/O instructions trigger #VE in the TDX environment. In response to
> > the exception, kernel emulates these instructions using hypercalls.
> >
> > But during early boot, on the decompression stage, it is cumbersome to
> > deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> > handling.
> >
> > Add a way to hook up alternative port I/O helpers in the boot stub.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
>
> I think you missed my comment from v3.

I did not missed it, but I failed to acknowledge it.

To me it is a judgement call. Either way has right to live.
I talked to Borislav on this and we suggested to keep it as. Rework later
as needed.

> Repeating it here:
>
> At least from reading the commit message it's not self-evident why #VE
> handling would be worse, especially since there's already #VC support in
> boot. It would help to give more info about that in the commit message.
>
> The current approach also seems fragile, doesn't it require all future
> code to remember to not do i/o directly? How do we make sure that
> doesn't happen going forward?
>
> How does it fail if some code accidentally does i/o directly? Or
> triggers #VE some other way? Is the error understandable and
> actionable?

Dealing with failure in decompression code is a pain. We don't have usual
infrastructure there. The patch deals with port I/O which is the only way
to communicate issue to the user. If it fails for whatever reason we are
screwed. And it doesn't depend on how it was implemented.

--
Kirill A. Shutemov

2022-02-28 17:45:35

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

On Mon, Feb 28, 2022 at 08:51:20AM -0800, Dave Hansen wrote:
> On 2/28/22 08:40, Josh Poimboeuf wrote:
> >> maintainer-tip.rst seems disagree with you:
> >>
> >> A Fixes tag should be added even for changes which do not need to be
> >> backported to stable kernels, i.e. when addressing a recently introduced
> >> issue which only affects tip or the current head of mainline.
> >>
> >> I will leave it as is.
> > How does that disagree with me?
> >
> > The "Fixes" tag is for bug fixes. If it's not possible to trigger the
> > warning and there's no user impact, it's not a bug.
>
> Does having Fixes: *break* anything?

People rely on the "Fixes:" tag for actual bug fixes. Using it here --
along with the rest of the "this is fixing a bug" tone of the title and
description -- is guaranteed to confuse stable maintainers and distros
doing backports.

Again, if nothing's broken from the standpoint of the user then it's not
a bug and shouldn't be reported that way.

> If not, I think I'd generally rather have the metadata with more
> information as opposed to less information.

I would call it misinformation. How is that useful?

It's ok to reference a related commit in the patch description itself.
Just don't abuse the "Fixes" tag.

But IMO, for the least amount of confusion, it makes more sense to
squash this with the patch which actually requires it.

--
Josh

2022-02-28 17:46:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 29/30] ACPICA: Avoid cache flush on TDX guest

On 2/28/22 08:37, Kirill A. Shutemov wrote:
>> Agree, why is this marked as "TODO"? The cache flushes associated with
>> ACPI sleep states are to flush cache before bare metal power loss to
>> CPU caches and bare metal transition of DDR in self-refresh mode. If a
>> cache flush is required it is the responsibility of the hypervisor.
>> Either it is safe for all guests or it is unsafe for all guests, not
>> TD specific.
> Do we have "any VM" check? I can't find it right away.

Yes:

> #define X86_FEATURE_HYPERVISOR ( 4*32+31) /* Running on a hypervisor */

I'm pretty sure an earlier version of this patch used it.

2022-02-28 17:48:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 29/30] ACPICA: Avoid cache flush on TDX guest

On Sun, Feb 27, 2022 at 05:34:45PM -0800, Dan Williams wrote:
> On Sun, Feb 27, 2022 at 2:05 PM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Thu, Feb 24, 2022 at 06:56:29PM +0300, Kirill A. Shutemov wrote:
> > > +/*
> > > + * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
> > > + * It is required to prevent data loss.
> > > + *
> > > + * While running inside TDX guest, the kernel can bypass cache flushing.
> > > + * Changing sleep state in a virtual machine doesn't affect the host system
> > > + * sleep state and cannot lead to data loss.
> > > + *
> > > + * TODO: Is it safe to generalize this from TDX guests to all guest kernels?
> > > + */
> > > +#define ACPI_FLUSH_CPU_CACHE() \
> > > +do { \
> > > + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) \
> > > + wbinvd(); \
> > > +} while (0)
> >
> > If it's safe, why not do it for all VMs? Is there something specific
> > about TDX which makes this more obviously known to be safe than for
> > regular VMs?
> >
> > The patch description and the above comment make it sound like "we're
> > not really sure this is safe, so we'll just use TDX as a testing ground
> > for the idea." Which doesn't really inspire a lot of confidence in the
> > stability of TD sleep states.
>
> Agree, why is this marked as "TODO"? The cache flushes associated with
> ACPI sleep states are to flush cache before bare metal power loss to
> CPU caches and bare metal transition of DDR in self-refresh mode. If a
> cache flush is required it is the responsibility of the hypervisor.
> Either it is safe for all guests or it is unsafe for all guests, not
> TD specific.

Do we have "any VM" check? I can't find it right away.

--
Kirill A. Shutemov

2022-02-28 17:50:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

On 2/28/22 08:40, Josh Poimboeuf wrote:
>> maintainer-tip.rst seems disagree with you:
>>
>> A Fixes tag should be added even for changes which do not need to be
>> backported to stable kernels, i.e. when addressing a recently introduced
>> issue which only affects tip or the current head of mainline.
>>
>> I will leave it as is.
> How does that disagree with me?
>
> The "Fixes" tag is for bug fixes. If it's not possible to trigger the
> warning and there's no user impact, it's not a bug.

Does having Fixes: *break* anything?

If not, I think I'd generally rather have the metadata with more
information as opposed to less information.

2022-02-28 17:52:15

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv4 15/30] x86/boot: Allow to hook up alternative port I/O helpers

On Mon, Feb 28, 2022 at 07:33:53PM +0300, Kirill A. Shutemov wrote:
> On Sun, Feb 27, 2022 at 02:02:19PM -0800, Josh Poimboeuf wrote:
> > On Thu, Feb 24, 2022 at 06:56:15PM +0300, Kirill A. Shutemov wrote:
> > > Port I/O instructions trigger #VE in the TDX environment. In response to
> > > the exception, kernel emulates these instructions using hypercalls.
> > >
> > > But during early boot, on the decompression stage, it is cumbersome to
> > > deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> > > handling.
> > >
> > > Add a way to hook up alternative port I/O helpers in the boot stub.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> >
> > I think you missed my comment from v3.
>
> I did not missed it, but I failed to acknowledge it.
>
> To me it is a judgement call. Either way has right to live.
> I talked to Borislav on this and we suggested to keep it as. Rework later
> as needed.
>
> > Repeating it here:
> >
> > At least from reading the commit message it's not self-evident why #VE
> > handling would be worse, especially since there's already #VC support in
> > boot. It would help to give more info about that in the commit message.
> >
> > The current approach also seems fragile, doesn't it require all future
> > code to remember to not do i/o directly? How do we make sure that
> > doesn't happen going forward?
> >
> > How does it fail if some code accidentally does i/o directly? Or
> > triggers #VE some other way? Is the error understandable and
> > actionable?
>
> Dealing with failure in decompression code is a pain. We don't have usual
> infrastructure there. The patch deals with port I/O which is the only way
> to communicate issue to the user. If it fails for whatever reason we are
> screwed. And it doesn't depend on how it was implemented.

In the patch description, please address all of my concerns and
questions.

--
Josh

2022-02-28 17:57:05

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

On Mon, Feb 28, 2022 at 07:20:56PM +0300, Kirill A. Shutemov wrote:
> On Sun, Feb 27, 2022 at 02:01:30PM -0800, Josh Poimboeuf wrote:
> > On Thu, Feb 24, 2022 at 06:56:01PM +0300, Kirill A. Shutemov wrote:
> > > So far, AMD_MEM_ENCRYPT is the only user of X86_MEM_ENCRYPT. TDX will be
> > > the second. It will make mem_encrypt.c build without AMD_MEM_ENCRYPT,
> > > which triggers a warning:
> > >
> > > arch/x86/mm/mem_encrypt.c:69:13: warning: no previous prototype for
> > > function 'mem_encrypt_init' [-Wmissing-prototypes]
> > >
> > > Fix it by moving mem_encrypt_init() declaration outside of #ifdef
> > > CONFIG_AMD_MEM_ENCRYPT.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > Fixes: 20f07a044a76 ("x86/sev: Move common memory encryption code to mem_encrypt.c")
> > > Acked-by: David Rientjes <[email protected]>
> >
> > The patch title, warning, and "Fixes" tag tend to give the impression
> > this is fixing a real user-visible bug. But the bug is theoretical, as
> > it's not possible to enable X86_MEM_ENCRYPT without AMD_MEM_ENCRYPT,
> > until patch 27.
> >
> > IMO it would be preferable to just squash this change with patch 27.
> >
> > Having it as a separate patch is also fine, but it shouldn't be
> > described as a fix or use the Fixes tag. It's more of a preparatory
> > patch.
>
> maintainer-tip.rst seems disagree with you:
>
> A Fixes tag should be added even for changes which do not need to be
> backported to stable kernels, i.e. when addressing a recently introduced
> issue which only affects tip or the current head of mainline.
>
> I will leave it as is.

How does that disagree with me?

The "Fixes" tag is for bug fixes. If it's not possible to trigger the
warning and there's no user impact, it's not a bug.

--
Josh

2022-02-28 17:58:51

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv4 29/30] ACPICA: Avoid cache flush on TDX guest

On Mon, Feb 28, 2022 at 07:37:13PM +0300, Kirill A. Shutemov wrote:
> On Sun, Feb 27, 2022 at 05:34:45PM -0800, Dan Williams wrote:
> > On Sun, Feb 27, 2022 at 2:05 PM Josh Poimboeuf <[email protected]> wrote:
> > >
> > > On Thu, Feb 24, 2022 at 06:56:29PM +0300, Kirill A. Shutemov wrote:
> > > > +/*
> > > > + * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
> > > > + * It is required to prevent data loss.
> > > > + *
> > > > + * While running inside TDX guest, the kernel can bypass cache flushing.
> > > > + * Changing sleep state in a virtual machine doesn't affect the host system
> > > > + * sleep state and cannot lead to data loss.
> > > > + *
> > > > + * TODO: Is it safe to generalize this from TDX guests to all guest kernels?
> > > > + */
> > > > +#define ACPI_FLUSH_CPU_CACHE() \
> > > > +do { \
> > > > + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) \
> > > > + wbinvd(); \
> > > > +} while (0)
> > >
> > > If it's safe, why not do it for all VMs? Is there something specific
> > > about TDX which makes this more obviously known to be safe than for
> > > regular VMs?
> > >
> > > The patch description and the above comment make it sound like "we're
> > > not really sure this is safe, so we'll just use TDX as a testing ground
> > > for the idea." Which doesn't really inspire a lot of confidence in the
> > > stability of TD sleep states.
> >
> > Agree, why is this marked as "TODO"? The cache flushes associated with
> > ACPI sleep states are to flush cache before bare metal power loss to
> > CPU caches and bare metal transition of DDR in self-refresh mode. If a
> > cache flush is required it is the responsibility of the hypervisor.
> > Either it is safe for all guests or it is unsafe for all guests, not
> > TD specific.
>
> Do we have "any VM" check? I can't find it right away.

X86_FEATURE_HYPERVISOR

--
Josh

2022-02-28 18:02:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

On 2/26/22 17:07, Kirill A. Shutemov wrote:
> On Thu, Feb 24, 2022 at 11:04:04AM -0800, Dave Hansen wrote:
>> On 2/24/22 07:56, Kirill A. Shutemov wrote:
>>> static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
>>> {
>>> - pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>>> - return false;
>>> + switch (ve->exit_reason) {
>>> + case EXIT_REASON_CPUID:
>>> + return handle_cpuid(regs);
>>> + default:
>>> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>>> + return false;
>>> + }
>>> }
>>
>> What does this mean for userspace? What kinds of things are we ceding
>> to the (untrusted) VMM to supply to userspace?
>
> Here's what I see called from userspace.
> CPUID(AX=0x2)
> CPUID(AX=0xb, CX=0x0)
> CPUID(AX=0xb, CX=0x1)
> CPUID(AX=0x40000000, CX=0xfffaba17)
> CPUID(AX=0x80000007, CX=0x121)

Hi Kirill,

I'm not quite sure what to make of this. Is this an *exhaustive* list
of CPUID values? Or is this an example of what you see on one system
and one boot of userspace?

What I really want to get at is what this *means*.

For instance, maybe all of these are in the hypervisor CPUID space.
Those basically *must* be supplied by the hypervisor.

>>> /* Handle the kernel #VE */
>>> @@ -200,6 +235,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>>> return read_msr(regs);
>>> case EXIT_REASON_MSR_WRITE:
>>> return write_msr(regs);
>>> + case EXIT_REASON_CPUID:
>>> + return handle_cpuid(regs);
>>> default:
>>> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>>> return false;
>> What kinds of random CPUID uses in the kernel at runtime need this
>> handling?
>
> CPUID(AX=0x2)
> CPUID(AX=0x6, CX=0x0)
> CPUID(AX=0xb, CX=0x0)
> CPUID(AX=0xb, CX=0x1)
> CPUID(AX=0xb, CX=0x2)
> CPUID(AX=0xf, CX=0x0)
> CPUID(AX=0xf, CX=0x1)
> CPUID(AX=0x10, CX=0x0)
> CPUID(AX=0x10, CX=0x1)
> CPUID(AX=0x10, CX=0x2)
> CPUID(AX=0x10, CX=0x3)
> CPUID(AX=0x16, CX=0x0)
> CPUID(AX=0x1f, CX=0x0)
> CPUID(AX=0x40000000, CX=0x0)
> CPUID(AX=0x40000000, CX=0xfffaba17)
> CPUID(AX=0x40000001, CX=0x0)
> CPUID(AX=0x80000002, CX=0x0)
> CPUID(AX=0x80000003, CX=0x0)
> CPUID(AX=0x80000004, CX=0x0)
> CPUID(AX=0x80000007, CX=0x0)
> CPUID(AX=0x80000007, CX=0x121)

OK, that's a good list. I guess I need to decode those and make sense
of them. Any help would be appreciated and would speed along this
review process.

>> Is it really OK that we let the VMM inject arbitrary CPUID
>> values into random CPUID uses in the kernel... silently?
>
> We realise that this is possible vector of attack and plan to implement
> proper filtering. But it is beyon core enabling.
>
>> Is this better than just returning 0's, for instance?
>
> Plain 0 injection breaks the boot. More complicated solution is need.

OK, so we're leaving the kernel open to something that might be an
attack vector: we know that we don't know how this might be bad. It's a
"known unknown"[1].

That doesn't seem *horrible*. But, it also doesn't seem great. There
are a lot of ways to address the situation. But, simply not mentioning
it in the changelog or cover letter isn't a great way to handle it.

Where do we *want* this to be? I'll take a stab at it:

In a perfect world, we'd simply keep a list of things that come from the
hypervisor and others where the kernel wants to provide the CPUID data.
But, there are always going to be new uses of CPUID. There's no way we
can keep a 100% complete list.

That means that the kernel needs to handle unknown CPUID use. There are
currently no known stupidly simple solutions like "return all 0's".

The other simplest solution is to just call into the hypervisor no
matter what the CPUID use is. This puts the kernel at the mercy of the
hypervisor to some unknown degree. But, it is OK given a benign hypervisor.

The kernel can WARN() or taint on the situation for now until you
develop a a more robust list of items that can be deferred to the
hypervisor.

1. https://en.wikipedia.org/wiki/There_are_known_knowns

2022-02-28 18:04:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

On Sun, Feb 27, 2022 at 02:01:30PM -0800, Josh Poimboeuf wrote:
> On Thu, Feb 24, 2022 at 06:56:01PM +0300, Kirill A. Shutemov wrote:
> > So far, AMD_MEM_ENCRYPT is the only user of X86_MEM_ENCRYPT. TDX will be
> > the second. It will make mem_encrypt.c build without AMD_MEM_ENCRYPT,
> > which triggers a warning:
> >
> > arch/x86/mm/mem_encrypt.c:69:13: warning: no previous prototype for
> > function 'mem_encrypt_init' [-Wmissing-prototypes]
> >
> > Fix it by moving mem_encrypt_init() declaration outside of #ifdef
> > CONFIG_AMD_MEM_ENCRYPT.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Fixes: 20f07a044a76 ("x86/sev: Move common memory encryption code to mem_encrypt.c")
> > Acked-by: David Rientjes <[email protected]>
>
> The patch title, warning, and "Fixes" tag tend to give the impression
> this is fixing a real user-visible bug. But the bug is theoretical, as
> it's not possible to enable X86_MEM_ENCRYPT without AMD_MEM_ENCRYPT,
> until patch 27.
>
> IMO it would be preferable to just squash this change with patch 27.
>
> Having it as a separate patch is also fine, but it shouldn't be
> described as a fix or use the Fixes tag. It's more of a preparatory
> patch.

maintainer-tip.rst seems disagree with you:

A Fixes tag should be added even for changes which do not need to be
backported to stable kernels, i.e. when addressing a recently introduced
issue which only affects tip or the current head of mainline.

I will leave it as is.

--
Kirill A. Shutemov

2022-02-28 23:30:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

On 2/28/22 14:53, Kirill A. Shutemov wrote:
> On Mon, Feb 28, 2022 at 08:41:38AM -0800, Dave Hansen wrote:
>>> We realise that this is possible vector of attack and plan to implement
>>> proper filtering. But it is beyon core enabling.
>>>
>>>> Is this better than just returning 0's, for instance?
>>> Plain 0 injection breaks the boot. More complicated solution is need.
>> OK, so we're leaving the kernel open to something that might be an
>> attack vector: we know that we don't know how this might be bad. It's a
>> "known unknown"[1].
> I looked deeper. The only CPUIDs that actually required are from the
> hypervisor range (the range is reserved and never will be used by CPU, so
> hypervisors adopt it for own use).
>
> So this filtering makes kernel boot (I didn't test much beyond that).
>
> /*
> * Only allow VMM to control range reserved for hypervisor
> * communication.
> *
> * Return all-zeros for any CPUID outside the range.
> */
> if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
> regs->ax = regs->bx = regs->cx = regs->dx = 0;
> return true;
> }
>
> We may tighten the range further (only few leafs from the range is
> actually used during the boot), but this should be good enough for this
> stage of enabling.

Seems sane to me. This closes off basically any ability for the VMM to
confuse the guest with CPUID values except for the ones that *must* by
hypervisor-controlled.

Does this, in practice, keep TDX guests from detecting any features that
it supports today?

2022-02-28 23:46:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

On Mon, Feb 28, 2022 at 08:41:38AM -0800, Dave Hansen wrote:
> > We realise that this is possible vector of attack and plan to implement
> > proper filtering. But it is beyon core enabling.
> >
> >> Is this better than just returning 0's, for instance?
> >
> > Plain 0 injection breaks the boot. More complicated solution is need.
>
> OK, so we're leaving the kernel open to something that might be an
> attack vector: we know that we don't know how this might be bad. It's a
> "known unknown"[1].

I looked deeper. The only CPUIDs that actually required are from the
hypervisor range (the range is reserved and never will be used by CPU, so
hypervisors adopt it for own use).

So this filtering makes kernel boot (I didn't test much beyond that).

/*
* Only allow VMM to control range reserved for hypervisor
* communication.
*
* Return all-zeros for any CPUID outside the range.
*/
if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
regs->ax = regs->bx = regs->cx = regs->dx = 0;
return true;
}

We may tighten the range further (only few leafs from the range is
actually used during the boot), but this should be good enough for this
stage of enabling.

Comments?

--
Kirill A. Shutemov

2022-03-01 00:32:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

On Mon, Feb 28, 2022 at 03:05:26PM -0800, Dave Hansen wrote:
> On 2/28/22 14:53, Kirill A. Shutemov wrote:
> > On Mon, Feb 28, 2022 at 08:41:38AM -0800, Dave Hansen wrote:
> >>> We realise that this is possible vector of attack and plan to implement
> >>> proper filtering. But it is beyon core enabling.
> >>>
> >>>> Is this better than just returning 0's, for instance?
> >>> Plain 0 injection breaks the boot. More complicated solution is need.
> >> OK, so we're leaving the kernel open to something that might be an
> >> attack vector: we know that we don't know how this might be bad. It's a
> >> "known unknown"[1].
> > I looked deeper. The only CPUIDs that actually required are from the
> > hypervisor range (the range is reserved and never will be used by CPU, so
> > hypervisors adopt it for own use).
> >
> > So this filtering makes kernel boot (I didn't test much beyond that).
> >
> > /*
> > * Only allow VMM to control range reserved for hypervisor
> > * communication.
> > *
> > * Return all-zeros for any CPUID outside the range.
> > */
> > if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
> > regs->ax = regs->bx = regs->cx = regs->dx = 0;
> > return true;
> > }
> >
> > We may tighten the range further (only few leafs from the range is
> > actually used during the boot), but this should be good enough for this
> > stage of enabling.
>
> Seems sane to me. This closes off basically any ability for the VMM to
> confuse the guest with CPUID values except for the ones that *must* by
> hypervisor-controlled.
>
> Does this, in practice, keep TDX guests from detecting any features that
> it supports today?

I scanned through the list of CPUID that probed via #VE during the boot
and they are related to cache/TLB hierarchy enumeration, thermal and
topology. Without cache/TLB enumeration we may miss some optimization.
Topology can be problematic, we may miss ability to communicate the
configuration, I donno.

Shouldn't be a show-stopper.

--
Kirill A. Shutemov

2022-03-01 01:23:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE

On 2/28/22 15:31, Kirill A. Shutemov wrote:
>> Does this, in practice, keep TDX guests from detecting any features that
>> it supports today?
> I scanned through the list of CPUID that probed via #VE during the boot
> and they are related to cache/TLB hierarchy enumeration, thermal and
> topology. Without cache/TLB enumeration we may miss some optimization.
> Topology can be problematic, we may miss ability to communicate the
> configuration, I donno.
>
> Shouldn't be a show-stopper.

I can live with that for an initial TDX guest series. Does that bother
anyone else?

2022-03-01 09:47:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y

On Mon, Feb 28, 2022 at 09:11:59AM -0800, Josh Poimboeuf wrote:
> People rely on the "Fixes:" tag for actual bug fixes. Using it here --
> along with the rest of the "this is fixing a bug" tone of the title and
> description -- is guaranteed to confuse stable maintainers and distros
> doing backports.

Yes, they very much do. There's even automatic tools which look at Fixes
and give people work. So yeah, pls don't take this lightly.

> I would call it misinformation. How is that useful?
>
> It's ok to reference a related commit in the patch description itself.
> Just don't abuse the "Fixes" tag.
>
> But IMO, for the least amount of confusion, it makes more sense to
> squash this with the patch which actually requires it.

Full ack. Please squash it into the patch which causes the issue. And
frankly, I don't understand why you guys are making such a fuss about it
and why this needs to be a separate patch, at all?!

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-03-03 00:15:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 11/30] x86/tdx: Handle in-kernel MMIO

On Thu, Feb 24, 2022 at 12:11:54PM -0800, Dave Hansen wrote:
>
> I found a few things lacking in that description. How's that for a rewrite?

Looks great, thanks.
> > == Patching TDX drivers ==
> >
> > Rather than touching the entire kernel, it might also be possible to
> > just go after drivers that use MMIO in TDX guests. Right now, that's
> > limited only to virtio and some x86-specific drivers.
> >
> > All virtio MMIO appears to be done through a single function, which
> > makes virtio eminently easy to patch. This will be implemented in the
> > future, removing the bulk of MMIO #VEs.
>
> Given what is written here, this sounds like a great solution especially
> compared to all the instruction decoding nasiness. What's wrong with it?

This will not cover non-virtio users. So #VE-based MMIO will remain as
fallback mechanism.

> > + switch (mmio) {
> > + case MMIO_WRITE:
> > + memcpy(&val, reg, size);
> > + return mmio_write(size, ve->gpa, val);
> > + case MMIO_WRITE_IMM:
> > + val = insn.immediate.value;
> > + return mmio_write(size, ve->gpa, val);
> > + case MMIO_READ:
> > + case MMIO_READ_ZERO_EXTEND:
> > + case MMIO_READ_SIGN_EXTEND:
> > + break;
> > + case MMIO_MOVS:
> > + case MMIO_DECODE_FAILED:
> > + return false;
> > + default:
> > + BUG();
> > + }
>
> Given the huge description above, it's borderline criminal to not
> discuss what could led to this BUG().

This bug actually covers "Unknown insn_decode_mmio() decode value" case.
I will add the comment.


> It could literally be some minor tweak in the compiler that changed a
> non-io.h-using MMIO access to get converted over to a instruction that
> can't be decoded.
>
> Could we spend a few lines of comments to help out the future poor sod
> that sees "kernel bug at foo.c:1234"? Maybe:
>
> /*
> * MMIO was accessed with an instruction that could not
> * be decoded. It was likely not using io.h helpers or
> * accessed MMIO accidentally.
> */

Thanks, I will use the comment for MMIO_DECODE_FAILED handling.

--
Kirill A. Shutemov