Hi All,
Intel's Trust Domain Extensions (TDX) protects confidential guest VMs
from the host and physical attacks by isolating the guest register
state and by encrypting the guest memory. In TDX, a special TDX module
sits between the host and the guest, and runs in a special mode and
manages the guest/host separation.
More details of TDX guests can be found in Documentation/x86/tdx.rst.
This cover letter is structured as below:
1. Previous versions/changes : Contains details about previous submission
and its patchset structure.
2. Current Patchset Structure : Covers details about the current patchset
and its structure.
3. Patch-set dependency : Covers info about this patch series
dependencies.
4. SEV/TDX comparison : Details about SEV and TDX guest differences
and similarities
5: TDX/VM/comparison : Compares TDX guests with regular VMs
Previous versions/changes:
--------------------------
Previously, the TDX patch series has been submitted into 4 different
patch sets. You can find them in the following links. Due to review
feedback and to make the next review easier, we have re-organized
the patchsets into two sets (cleanup/core). Details are explained
in the "current patch-set structure" section.
Initial support:
https://lore.kernel.org/lkml/20211009053747.1694419-1-sathyanarayanan.kuppuswamy@linux.intel.com/
#VE support:
https://lore.kernel.org/lkml/20211005204136.1812078-1-sathyanarayanan.kuppuswamy@linux.intel.com/
Boot support:
https://lore.kernel.org/lkml/20211005230550.1819406-1-sathyanarayanan.kuppuswamy@linux.intel.com/
Shared-mm support:
https://lore.kernel.org/lkml/20211009003711.1390019-1-sathyanarayanan.kuppuswamy@linux.intel.com/
Current Patch-set Structure:
----------------------------
The minimal code that can run real userspace is spread across 30
patches in 2 sets:
1. TDX Infrastructure/cleanup set (4 patches)
2. TDX core support (26 patches)
"Infrastructure/cleanup set" includes infrastructure changes made in
generic code (like for sharing with AMD SEV code) to add TDX guest
support. It was posted as 2 different patch series (links are included
below).
Share common features between AMD SEV / TDX guest:
https://lore.kernel.org/r/[email protected]
Skip CSTAR MSR on Intel:
https://lore.kernel.org/lkml/20211116005103.2929441-1-sathyanarayanan.kuppuswamy@linux.intel.com/
"TDX core support" patches add TDX infrastructure in Linux kernel (like
#VE support for I/O, MMIO, CPUID, etc), detection support, some boot
fixes.
This series adds TDX core support.
Patch-set dependency:
---------------------
This series also has dependency on MMIO decoding patchset.
https://lore.kernel.org/lkml/[email protected]/
It is also dependent on the "TDX Infrastructure/cleanup set".
SEV/TDX comparison:
-------------------
TDX has a lot of similarities to SEV. It enhances confidentiality
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to make changes in the guest
physical address space.
TDX/VM comparison:
------------------
Some of the key differences between TD and regular VM is,
1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
to the guest TD in cases of instructions that need to be emulated, disallowed
MSR accesses, etc.
3. By default memory is marked as private, and TD will selectively share it with
VMM based on need.
You can find TDX related documents in the following link.
https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html
Andi Kleen (1):
x86/tdx: Early boot handling of port I/O
Isaku Yamahata (1):
x86/tdx: ioapic: Add shared bit for IOAPIC base address
Kirill A. Shutemov (13):
x86/traps: Add #VE support for TDX guest
x86/tdx: Add HLT support for TDX guests (#VE approach)
x86/tdx: Add MSR support for TDX guests
x86/tdx: Handle CPUID via #VE
x86/tdx: Handle in-kernel MMIO
x86/tdx: Get page shared bit info from the TDX Module
x86/tdx: Exclude shared bit from __PHYSICAL_MASK
x86/tdx: Make pages shared in ioremap()
x86/tdx: Add helper to convert memory between shared and private
x86/mm/cpa: Add support for TDX shared memory
x86/kvm: Use bounce buffers for TD guest
ACPICA: Avoid cache flush on TDX guest
x86/tdx: Warn about unexpected WBINVD
Kuppuswamy Sathyanarayanan (9):
x86/tdx: Detect running as a TDX guest in early boot
x86/tdx: Extend the cc_platform_has() API to support TDX guests
x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper
functions
x86/tdx: Detect TDX at early kernel decompression time
x86/tdx: Support TDX guest port I/O at decompression time
x86/tdx: Add port I/O emulation
x86/acpi, x86/boot: Add multiprocessor wake-up support
x86/topology: Disable CPU online/offline control for TDX guests
Documentation/x86: Document TDX kernel architecture
Sean Christopherson (2):
x86/boot: Add a trampoline for booting APs via firmware handoff
x86/boot: Avoid #VE during boot for TDX platforms
Documentation/x86/index.rst | 1 +
Documentation/x86/tdx.rst | 194 ++++++++
arch/x86/Kconfig | 16 +
arch/x86/boot/compressed/Makefile | 2 +
arch/x86/boot/compressed/head_64.S | 25 +-
arch/x86/boot/compressed/misc.c | 8 +
arch/x86/boot/compressed/misc.h | 6 +
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/boot/compressed/tdcall.S | 3 +
arch/x86/boot/compressed/tdx.c | 33 ++
arch/x86/boot/compressed/tdx.h | 60 +++
arch/x86/boot/cpuflags.c | 13 +-
arch/x86/boot/cpuflags.h | 2 +
arch/x86/include/asm/acenv.h | 16 +-
arch/x86/include/asm/apic.h | 7 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/io.h | 16 +-
arch/x86/include/asm/pgtable.h | 19 +-
arch/x86/include/asm/realmode.h | 1 +
arch/x86/include/asm/tdx.h | 110 +++++
arch/x86/kernel/Makefile | 4 +
arch/x86/kernel/acpi/boot.c | 114 +++++
arch/x86/kernel/apic/apic.c | 10 +
arch/x86/kernel/apic/io_apic.c | 18 +-
arch/x86/kernel/asm-offsets.c | 20 +
arch/x86/kernel/cc_platform.c | 20 +-
arch/x86/kernel/head64.c | 7 +
arch/x86/kernel/head_64.S | 24 +-
arch/x86/kernel/idt.c | 3 +
arch/x86/kernel/process.c | 7 +
arch/x86/kernel/smpboot.c | 12 +-
arch/x86/kernel/tdcall.S | 302 ++++++++++++
arch/x86/kernel/tdx.c | 586 +++++++++++++++++++++++
arch/x86/kernel/traps.c | 79 +++
arch/x86/mm/ioremap.c | 5 +
arch/x86/mm/mem_encrypt.c | 36 +-
arch/x86/mm/pat/set_memory.c | 39 +-
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 63 ++-
arch/x86/realmode/rm/trampoline_common.S | 12 +-
include/linux/cc_platform.h | 19 +
kernel/cpu.c | 3 +
44 files changed, 1892 insertions(+), 39 deletions(-)
create mode 100644 Documentation/x86/tdx.rst
create mode 100644 arch/x86/boot/compressed/tdcall.S
create mode 100644 arch/x86/boot/compressed/tdx.c
create mode 100644 arch/x86/boot/compressed/tdx.h
create mode 100644 arch/x86/include/asm/tdx.h
create mode 100644 arch/x86/kernel/tdcall.S
create mode 100644 arch/x86/kernel/tdx.c
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
cc_platform_has() API is used in the kernel to enable confidential
computing features. Since TDX guest is a confidential computing
platform, it also needs to use this API.
In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect for TDX guests support in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.
Since cc_plaform_has() API will be used frequently across the boot
code, instead of repeatedly detecting the TDX guest status using the
CPUID instruction, detect once and cache the result. Add a function
(is_tdx_guest()) to read the cached TDX guest status in CC APIs.
Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform. This feature bit will be used to
do TDX-specific handling in some areas of the ARCH code where a
function call to check for TDX guest status is not cost-effective
(for example, TDX hypercall support).
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 13 +++++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++-
arch/x86/include/asm/tdx.h | 22 +++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/head64.c | 4 +++
arch/x86/kernel/tdx.c | 34 ++++++++++++++++++++++++
7 files changed, 82 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/tdx.h
create mode 100644 arch/x86/kernel/tdx.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 793e9b42ace0..a61ac6f8821a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -872,6 +872,19 @@ config ACRN_GUEST
IOT with small footprint and real-time features. More details can be
found in https://projectacrn.org/.
+# TDX guest uses X2APIC for interrupt management.
+config INTEL_TDX_GUEST
+ bool "Intel TDX (Trust Domain Extensions) - Guest Support"
+ depends on X86_64 && CPU_SUP_INTEL
+ depends on X86_X2APIC
+ help
+ Support running as a guest under Intel TDX. Without this support,
+ the guest kernel can not boot or run under TDX.
+ TDX includes memory encryption and integrity capabilities
+ which protect the confidentiality and integrity of guest
+ memory contents and CPU state. TDX guests are protected from
+ potential attacks from the VMM.
+
endif #HYPERVISOR_GUEST
source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d5b5f2ab87a0..fb178544fd21 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
#define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
#define X86_FEATURE_PVUNLOCK ( 8*32+20) /* "" PV unlock function */
#define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* "" PV vcpu_is_preempted function */
+#define X86_FEATURE_TDX_GUEST ( 8*32+22) /* Intel Trust Domain Extensions Guest */
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..f556086e6093 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -65,6 +65,12 @@
# define DISABLE_SGX (1 << (X86_FEATURE_SGX & 31))
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+# define DISABLE_TDX_GUEST 0
+#else
+# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -76,7 +82,7 @@
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
#define DISABLED_MASK7 (DISABLE_PTI)
-#define DISABLED_MASK8 0
+#define DISABLED_MASK8 (DISABLE_TDX_GUEST)
#define DISABLED_MASK9 (DISABLE_SMAP|DISABLE_SGX)
#define DISABLED_MASK10 0
#define DISABLED_MASK11 0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..686168941f92
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021-2022 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#include <linux/init.h>
+
+#define TDX_CPUID_LEAF_ID 0x21
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+void __init tdx_early_init(void);
+bool is_tdx_guest(void);
+
+#else
+
+static inline void tdx_early_init(void) { };
+static inline bool is_tdx_guest(void) { return false; }
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 2ff3e600f426..64f9babcfd95 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -130,6 +130,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index fc5371a7e9d1..66deb2611dc5 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
#include <asm/extable.h>
#include <asm/trapnr.h>
#include <asm/sev.h>
+#include <asm/tdx.h>
/*
* Manage page tables very early on.
@@ -498,6 +499,9 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
idt_setup_early_handler();
+ /* Needed before cc_platform_has() can be used for TDX: */
+ tdx_early_init();
+
copy_bootdata(__va(real_mode_data));
/*
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index 000000000000..d32d9d9946d8
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2021-2022 Intel Corporation */
+
+#undef pr_fmt
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/cpufeature.h>
+#include <asm/tdx.h>
+
+static bool tdx_guest_detected __ro_after_init;
+
+bool is_tdx_guest(void)
+{
+ return tdx_guest_detected;
+}
+
+void __init tdx_early_init(void)
+{
+ u32 eax, sig[3];
+
+ if (cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+ return;
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
+
+ if (memcmp("IntelTDX ", sig, 12))
+ return;
+
+ tdx_guest_detected = true;
+
+ setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+ pr_info("Guest detected\n");
+}
--
2.32.0
From: Andi Kleen <[email protected]>
TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O, by converting them
into TDCALLs to call the host.
But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver (although it is expected not to be used).
The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures).
This early handler enables the use of normal in*/out* macros without
patching them for every driver. Since there is no expectation that
early port I/O is performance-critical, the #VE emulation cost is worth
the simplicity benefit of not patching the port I/O usage in early
code. There are also no concerns with nesting, since there should be
no NMIs or interrupts this early.
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/tdx.h | 4 ++++
arch/x86/kernel/head64.c | 3 +++
arch/x86/kernel/tdx.c | 17 +++++++++++++++++
3 files changed, 24 insertions(+)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 3be9d0e9f7a0..d2ffc9a6ba53 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -74,12 +74,16 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
void tdx_guest_idle(void);
+bool tdx_early_handle_ve(struct pt_regs *regs);
+
#else
static inline void tdx_early_init(void) { };
static inline bool is_tdx_guest(void) { return false; }
static inline void tdx_guest_idle(void) { };
+static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
+
#endif /* CONFIG_INTEL_TDX_GUEST */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 66deb2611dc5..d42996a28722 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -415,6 +415,9 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
return;
+ if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs))
+ return;
+
early_fixup_exception(regs, trapnr);
}
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 00bf02bc9838..82e848006e3e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -304,6 +304,23 @@ static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
return !ret;
}
+/*
+ * Early #VE exception handler. Only handles a subset of port I/O.
+ * Intended only for earlyprintk. If failed, return false.
+ */
+__init bool tdx_early_handle_ve(struct pt_regs *regs)
+{
+ struct ve_info ve;
+
+ if (tdx_get_ve_info(&ve))
+ return false;
+
+ if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
+ return false;
+
+ return tdx_handle_io(regs, ve.exit_qual);
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
--
2.32.0
Intel TDX doesn't allow VMM to access guest private memory. Any memory
that is required for communication with the VMM must be shared
explicitly by setting a bit in the page table entry. Details about
which bit in the page table entry to be used to indicate shared/private
state can be determined by using the TDINFO TDCALL (call to TDX
Module).
Fetch and save the guest TD execution environment information at
initialization time. The next patch will use the information.
More details about the TDINFO TDCALL can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDCALL[TDINFO]".
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tdx.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 82e848006e3e..b55bc1879d1e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -11,6 +11,7 @@
#include <asm/insn-eval.h>
/* TDX Module Call Leaf IDs */
+#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3
/* See Exit Qualification for I/O Instructions in VMX documentation */
@@ -19,6 +20,12 @@
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
#define VE_IS_IO_STRING(exit_qual) ((exit_qual) & 16 ? 1 : 0)
+/* Guest TD execution environment information */
+static struct {
+ unsigned int gpa_width;
+ unsigned long attributes;
+} td_info __ro_after_init;
+
static bool tdx_guest_detected __ro_after_init;
/*
@@ -45,6 +52,28 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}
+static void tdx_get_info(void)
+{
+ struct tdx_module_output out;
+ u64 ret;
+
+ /*
+ * TDINFO TDX Module call is used to get the TD execution environment
+ * information like GPA width, number of available vcpus, debug mode
+ * information, etc. More details about the ABI can be found in TDX
+ * Guest-Host-Communication Interface (GHCI), sec 2.4.2 TDCALL
+ * [TDG.VP.INFO].
+ */
+ ret = __tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+ /* Non zero return value indicates buggy TDX module, so panic */
+ if (ret)
+ panic("TDINFO TDCALL failed (Buggy TDX module!)\n");
+
+ td_info.gpa_width = out.rcx & GENMASK(5, 0);
+ td_info.attributes = out.rdx;
+}
+
static __cpuidle u64 _tdx_halt(const bool irq_disabled, const bool do_sti)
{
/*
@@ -448,5 +477,7 @@ void __init tdx_early_init(void)
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+ tdx_get_info();
+
pr_info("Guest detected\n");
}
--
2.32.0
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to unmapped pages (EPT violation)
In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.
The #VE handler implementation is simplified by the fact that entry
paths do not trigger #VE and that the handler may not be interrupted.
Specifically, the implementation assumes that the entry paths do not
access TD-shared memory, MMIO regions, use #VE triggering MSRs,
instructions, or CPUID leaves that might generate #VE. Interrupts,
including NMIs, are blocked by the hardware starting with #VE delivery
until TDGETVEINFO is called. All of this combined eliminates the
chance of a #VE during the syscall gap, or paranoid entry paths.
After TDGETVEINFO, #VE could happen in theory (e.g. through an NMI),
but it is expected not to happen because TDX expects NMIs not to
trigger #VEs. Another case where #VE could happen is if the #VE
exception panics, but in this case, since the platform is already in
a panic state, nested #VE is not a concern.
If a guest kernel action which would normally cause a #VE occurs in
the interrupt-disabled region before TDGETVEINFO, a #DF (fault
exception) is delivered to the guest which will result in an oops
(and should eventually be a panic, as it is expected panic_on_oops is
set to 1 for TDX guests).
Add basic infrastructure to handle any #VE which occurs in the kernel
or userspace. Later patches will add handling for specific #VE
scenarios.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/tdx.h | 21 +++++++++
arch/x86/kernel/idt.c | 3 ++
arch/x86/kernel/tdx.c | 66 +++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 79 +++++++++++++++++++++++++++++++++
5 files changed, 173 insertions(+)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..8ccc81d653b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 78bfd7dc9b2f..8c33d7439c08 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -4,6 +4,7 @@
#define _ASM_X86_TDX_H
#include <linux/init.h>
+#include <asm/ptrace.h>
#define TDX_CPUID_LEAF_ID 0x21
#define TDX_HYPERCALL_STANDARD 0
@@ -38,6 +39,22 @@ struct tdx_hypercall_output {
u64 r15;
};
+/*
+ * Used by the #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is a software only structure
+ * and not part of the TDX module/VMM ABI.
+ */
+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ /* Guest Linear (virtual) Address */
+ u64 gla;
+ /* Guest Physical (virtual) Address */
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
@@ -51,6 +68,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
u64 r15, struct tdx_hypercall_output *out);
+bool tdx_get_ve_info(struct ve_info *ve);
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
+
#else
static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..1da074123c16 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};
/*
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 1cc850fd03ff..b6d0e45e6589 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,9 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>
+/* TDX Module Call Leaf IDs */
+#define TDX_GET_VEINFO 3
+
static bool tdx_guest_detected __ro_after_init;
/*
@@ -33,6 +36,69 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}
+bool tdx_get_ve_info(struct ve_info *ve)
+{
+ struct tdx_module_output out;
+
+ /*
+ * NMIs and machine checks are suppressed. Before this point any
+ * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+ * additional #VEs are permitted (but it is expected not to
+ * happen unless kernel panics).
+ */
+ if (__tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out))
+ return false;
+
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = lower_32_bits(out.r10);
+ ve->instr_info = upper_32_bits(out.r10);
+
+ return true;
+}
+
+/*
+ * Handle the user initiated #VE.
+ *
+ * For example, executing the CPUID instruction from the user
+ * space is a valid case and hence the resulting #VE had to
+ * be handled.
+ *
+ * For dis-allowed or invalid #VE just return failure.
+ *
+ * Return True on success and False on failure.
+ */
+static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+}
+
+/* Handle the kernel #VE */
+static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+}
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
+{
+ bool ret;
+
+ if (user_mode(regs))
+ ret = tdx_virt_exception_user(regs, ve);
+ else
+ ret = tdx_virt_exception_kernel(regs, ve);
+
+ /* After successful #VE handling, move the IP */
+ if (ret)
+ regs->ip += ve->instr_len;
+
+ return ret;
+}
+
bool is_tdx_guest(void)
{
return tdx_guest_detected;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c9d566dcf89a..24791a8bcd63 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -1212,6 +1213,84 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#define VE_FAULT_STR "VE fault"
+
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+ struct task_struct *tsk = current;
+
+ if (user_mode(regs)) {
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+ show_signal(tsk, SIGSEGV, "", VE_FAULT_STR, regs, error_code);
+ force_sig(SIGSEGV);
+ return;
+ }
+
+ /*
+ * Attempt to recover from #VE exception failure without
+ * triggering OOPS (useful for MSR read/write failures)
+ */
+ if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+ return;
+
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * To be potentially processing a kprobe fault and to trust the result
+ * from kprobe_running(), it should be non-preemptible.
+ */
+ if (!preemptible() && kprobe_running() &&
+ kprobe_fault_handler(regs, X86_TRAP_VE))
+ return;
+
+ /* Notify about #VE handling failure, useful for debugger hooks */
+ if (notify_die(DIE_GPF, VE_FAULT_STR, regs, error_code,
+ X86_TRAP_VE, SIGSEGV) == NOTIFY_STOP)
+ return;
+
+ /* Trigger OOPS and panic */
+ die_addr(VE_FAULT_STR, regs, error_code, 0);
+}
+
+/*
+ * In TDX guests, specific MSRs, instructions, CPUID leaves, shared
+ * memory access triggers #VE. The tdx_handle_virt_exception() will
+ * try to handle the #VE using appropriate hypercalls. For unhandled
+ * or failed #VEs, attempt recovery using fixups or raise fault if
+ * failed.
+ */
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+ bool ret;
+
+ /*
+ * NMIs/Machine-checks/Interrupts will be in a disabled state
+ * till TDGETVEINFO TDCALL is executed. This ensures that VE
+ * info cannot be overwritten by a nested #VE.
+ */
+ ret = tdx_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ if (ret)
+ ret = tdx_handle_virt_exception(regs, &ve);
+ /*
+ * If tdx_handle_virt_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (!ret)
+ ve_raise_fault(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.32.0
Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.
Force SWIOTLB on TD guest and make SWIOTLB buffer shared by generalizing
mem_encrypt_init() to cover TDX.
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cc_platform.c | 1 +
arch/x86/kernel/tdx.c | 3 +++
arch/x86/mm/mem_encrypt.c | 9 ++++++++-
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 4a3064bf1eb5..013224679b98 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -21,6 +21,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
case CC_ATTR_HOTPLUG_DISABLED:
case CC_ATTR_GUEST_TDX:
case CC_ATTR_GUEST_MEM_ENCRYPT:
+ case CC_ATTR_MEM_ENCRYPT:
return true;
default:
return false;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 9ef3cf0879d3..2175336d1a2a 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,7 @@
#define pr_fmt(fmt) "tdx: " fmt
#include <linux/cpufeature.h>
+#include <linux/swiotlb.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
#include <asm/insn.h>
@@ -575,5 +576,7 @@ void __init tdx_early_init(void)
*/
physical_mask &= GENMASK_ULL(td_info.gpa_width - 2, 0);
+ swiotlb_force = SWIOTLB_FORCE;
+
pr_info("Guest detected\n");
}
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 8b9de7e478c6..3a4230226388 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -69,7 +69,14 @@ bool force_dma_unencrypted(struct device *dev)
static void print_mem_encrypt_feature_info(void)
{
- pr_info("AMD Memory Encryption Features active:");
+ pr_info("Memory Encryption Features active:");
+
+ if (is_tdx_guest()) {
+ pr_cont(" Intel TDX\n");
+ return;
+ }
+
+ pr_cont("AMD ");
/* Secure Memory Encryption */
if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {
--
2.32.0
TDX steals a bit from the physical address and uses it to indicate
whether the page is private to the guest (bit set 0) or unprotected
and shared with the VMM (bit set 1).
AMD SEV uses a similar scheme, repurposing a bit from the physical address
to indicate encrypted or decrypted pages.
The kernel already has the infrastructure to deal with encrypted/decrypted
pages for AMD SEV. Modify the __set_memory_enc_pgtable() and make it
aware about TDX.
After modifying page table entries, the kernel needs to notify VMM about
the change with tdx_hcall_request_gpa_type().
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Tested-by: Kai Huang <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cc_platform.c | 1 +
arch/x86/mm/pat/set_memory.c | 39 +++++++++++++++++++++++++++++++----
2 files changed, 36 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index a0fc329edc35..4a3064bf1eb5 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -20,6 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
case CC_ATTR_GUEST_UNROLL_STRING_IO:
case CC_ATTR_HOTPLUG_DISABLED:
case CC_ATTR_GUEST_TDX:
+ case CC_ATTR_GUEST_MEM_ENCRYPT:
return true;
default:
return false;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index b4072115c8ef..3a89966c30a9 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -32,6 +32,7 @@
#include <asm/set_memory.h>
#include <asm/hyperv-tlfs.h>
#include <asm/mshyperv.h>
+#include <asm/tdx.h>
#include "../mm_internal.h"
@@ -1983,12 +1984,21 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}
+static pgprot_t pgprot_cc_mask(bool enc)
+{
+ if (enc)
+ return pgprot_encrypted(__pgprot(0));
+ else
+ return pgprot_decrypted(__pgprot(0));
+}
+
/*
* __set_memory_enc_pgtable() is used for the hypervisors that get
* informed about "encryption" status via page tables.
*/
static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
{
+ enum tdx_map_type map_type;
struct cpa_data cpa;
int ret;
@@ -1999,8 +2009,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
- cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
- cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+ cpa.mask_set = pgprot_cc_mask(enc);
+ cpa.mask_clr = pgprot_cc_mask(!enc);
+ map_type = enc ? TDX_MAP_PRIVATE : TDX_MAP_SHARED;
+
cpa.pgd = init_mm.pgd;
/* Must avoid aliasing mappings in the highmem code */
@@ -2008,9 +2021,17 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
vm_unmap_aliases();
/*
- * Before changing the encryption attribute, we need to flush caches.
+ * Before changing the encryption attribute, flush caches.
+ *
+ * For TDX, guest is responsible for flushing caches on private->shared
+ * transition. VMM is responsible for flushing on shared->private.
*/
- cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
+ if (map_type == TDX_MAP_SHARED)
+ cpa_flush(&cpa, 1);
+ } else {
+ cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ }
ret = __change_page_attr_set_clr(&cpa, 1);
@@ -2023,6 +2044,16 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
*/
cpa_flush(&cpa, 0);
+ /*
+ * For TDX Guest, raise hypercall to request memory mapping
+ * change with the VMM.
+ */
+ if (!ret && cc_platform_has(CC_ATTR_GUEST_TDX)) {
+ ret = tdx_hcall_request_gpa_type(__pa(addr),
+ __pa(addr) + numpages * PAGE_SIZE,
+ map_type);
+ }
+
/*
* Notify hypervisor that a given memory range is mapped encrypted
* or decrypted.
--
2.32.0
From: Isaku Yamahata <[email protected]>
The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.
When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host. This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.
Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.
Introduce a special fixmap helper just for the IOAPIC code. Ensure
that it marks IOAPIC pages as "shared". This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c1bb384935b0..d2fef5893e41 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/memblock.h>
#include <linux/msi.h>
+#include <linux/cc_platform.h>
#include <asm/irqdomain.h>
#include <asm/io.h>
@@ -65,6 +66,7 @@
#include <asm/irq_remapping.h>
#include <asm/hw_irq.h>
#include <asm/apic.h>
+#include <asm/pgtable.h>
#define for_each_ioapic(idx) \
for ((idx) = 0; (idx) < nr_ioapics; (idx)++)
@@ -2677,6 +2679,18 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}
+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+ phys_addr_t phys)
+{
+ pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+ /* Set TDX guest shared bit in pgprot flags */
+ if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ flags = pgprot_decrypted(flags);
+
+ __set_fixmap(idx, phys, flags);
+}
+
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2709,7 +2723,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
- set_fixmap_nocache(idx, ioapic_phys);
+ io_apic_set_fixmap_nocache(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
@@ -2838,7 +2852,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;
- set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+ io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
The early decompression code does port I/O for its console output. But,
handling the decompression-time port I/O demands a different approach
from normal runtime because the IDT required to support #VE based port
I/O emulation is not yet set up. Paravirtualizing I/O calls during
the decompression step is acceptable because the decompression code size is
small enough and hence patching it will not bloat the image size a lot.
To support port I/O in decompression code, TDX must be detected before
the decompression code might do port I/O. Add support to detect for
TDX guest support before console_init() in the extract_kernel().
Detecting it above the console_init() is early enough for patching
port I/O.
Add an early_is_tdx_guest() interface to get the cached TDX guest
status in the decompression code.
The actual port I/O paravirtualization will come later in the series.
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/misc.c | 8 ++++++++
arch/x86/boot/compressed/misc.h | 2 ++
arch/x86/boot/compressed/tdx.c | 33 +++++++++++++++++++++++++++++++
arch/x86/boot/compressed/tdx.h | 16 +++++++++++++++
arch/x86/boot/cpuflags.c | 13 ++++++++++--
arch/x86/boot/cpuflags.h | 2 ++
7 files changed, 73 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdx.c
create mode 100644 arch/x86/boot/compressed/tdx.h
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 431bf7f846c3..22a2a6cc2ab4 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -98,6 +98,7 @@ ifdef CONFIG_X86_64
endif
vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index a4339cb2d247..1b07755a5efd 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -370,6 +370,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;
+ /*
+ * Detect if we are running in TDX guest environment.
+ *
+ * It has to be done before console_init() to use paravirtualized
+ * port I/O oprations if needed.
+ */
+ early_tdx_detect();
+
console_init();
/*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 16ed360b6692..0d8e275a9d96 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -28,6 +28,8 @@
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
+#include "tdx.h"
+
#define BOOT_CTYPE_H
#include <linux/acpi.h>
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..2ee7d0bdf907
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tdx.c - Early boot code for TDX
+ */
+
+#include "../cpuflags.h"
+#include "../string.h"
+
+#define TDX_CPUID_LEAF_ID 0x21
+#define TDX_IDENT "IntelTDX "
+
+static bool tdx_guest_detected;
+
+void early_tdx_detect(void)
+{
+ u32 eax, sig[3];
+
+ if (cpuid_max_leaf() < TDX_CPUID_LEAF_ID)
+ return;
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
+
+ if (memcmp(TDX_IDENT, sig, 12))
+ return;
+
+ /* Cache TDX guest feature status */
+ tdx_guest_detected = true;
+}
+
+bool early_is_tdx_guest(void)
+{
+ return tdx_guest_detected;
+}
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
new file mode 100644
index 000000000000..18970c09512e
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021 Intel Corporation */
+#ifndef BOOT_COMPRESSED_TDX_H
+#define BOOT_COMPRESSED_TDX_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+void early_tdx_detect(void);
+bool early_is_tdx_guest(void);
+#else
+static inline void early_tdx_detect(void) { };
+static inline bool early_is_tdx_guest(void) { return false; }
+#endif
+
+#endif
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index a0b75f73dc63..7f97bf88c540 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -71,8 +71,7 @@ int has_eflag(unsigned long mask)
# define EBX_REG "=b"
#endif
-static inline void cpuid_count(u32 id, u32 count,
- u32 *a, u32 *b, u32 *c, u32 *d)
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
{
asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t"
"cpuid \n\t"
@@ -82,6 +81,16 @@ static inline void cpuid_count(u32 id, u32 count,
);
}
+/* Get maximum leaf value using CPUID(0) */
+u32 cpuid_max_leaf(void)
+{
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0, 0, &eax, &ebx, &ecx, &edx);
+
+ return eax;
+}
+
#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d)
void get_cpuflags(void)
diff --git a/arch/x86/boot/cpuflags.h b/arch/x86/boot/cpuflags.h
index 2e20814d3ce3..dea67090a22f 100644
--- a/arch/x86/boot/cpuflags.h
+++ b/arch/x86/boot/cpuflags.h
@@ -17,5 +17,7 @@ extern u32 cpu_vendor[3];
int has_eflag(unsigned long mask);
void get_cpuflags(void);
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d);
+u32 cpuid_max_leaf(void);
#endif
--
2.32.0
In TDX guests, by default memory is protected from host access. If a
guest needs to communicate with the VMM (like the I/O use case), it uses
a single bit in the physical address to communicate the protected/shared
attribute of the given page.
In the x86 ARCH code, __PHYSICAL_MASK macro represents the width of the
physical address in the given architecture. It is used in creating
physical PAGE_MASK for address bits in the kernel. Since in TDX guest,
a single bit is used as metadata, it needs to be excluded from valid
physical address bits to avoid using incorrect addresses bits in the
kernel.
Enable DYNAMIC_PHYSICAL_MASK to support updating the __PHYSICAL_MASK.
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/tdx.c | 8 ++++++++
2 files changed, 9 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fc4771a07fc0..fe0382f20445 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
select X86_MCE
+ select DYNAMIC_PHYSICAL_MASK
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b55bc1879d1e..cbfacc2af8bb 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -479,5 +479,13 @@ void __init tdx_early_init(void)
tdx_get_info();
+ /*
+ * All bits above GPA width are reserved and kernel treats shared bit
+ * as flag, not as part of physical address.
+ *
+ * Adjust physical mask to only cover valid GPA bits.
+ */
+ physical_mask &= GENMASK_ULL(td_info.gpa_width - 2, 0);
+
pr_info("Guest detected\n");
}
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.
In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer (TDX module) exists in the CPU to facilitate secure communication
between the host and the guest. TDX guests communicate with the TDX module
using the TDCALL instruction.
A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A variant of TDCALL used to communicate
with the VMM is called TDVMCALL.
Add generic interfaces to communicate with the TDX Module and VMM
(using the TDCALL instruction).
__tdx_hypercall() - Used by the guest to request services from the
VMM (via TDVMCALL).
__tdx_module_call() - Used to communicate with the TDX Module (via
TDCALL).
Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.
The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.
Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.
For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".
https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf
Originally-by: Sean Christopherson <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/tdx.h | 39 +++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/asm-offsets.c | 20 +++
arch/x86/kernel/tdcall.S | 270 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/tdx.c | 24 +++
5 files changed, 354 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/tdcall.S
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 686168941f92..78bfd7dc9b2f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -6,12 +6,51 @@
#include <linux/init.h>
#define TDX_CPUID_LEAF_ID 0x21
+#define TDX_HYPERCALL_STANDARD 0
+
+/*
+ * Used in __tdx_module_call() to gather the output registers'
+ * values of the TDCALL instruction when requesting services from
+ * the TDX module. This is a software only structure and not part
+ * of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() to gather the output registers' values
+ * of the TDCALL instruction when requesting services from the VMM.
+ * This is a software only structure and not part of the TDX
+ * module/VMM ABI.
+ */
+struct tdx_hypercall_output {
+ u64 r10;
+ u64 r11;
+ u64 r12;
+ u64 r13;
+ u64 r14;
+ u64 r15;
+};
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
bool is_tdx_guest(void);
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
+ u64 r15, struct tdx_hypercall_output *out);
+
#else
static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8c9a9214dd34..59578a6488ad 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -133,7 +133,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o
obj-$(CONFIG_EISA) += eisa.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index ecd3fd6993d1..a00a45e411a7 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
#include <asm/bootparam.h>
#include <asm/suspend.h>
#include <asm/tlbflush.h>
+#include <asm/tdx.h>
#ifdef CONFIG_XEN
#include <xen/interface/xen.h>
@@ -68,6 +69,25 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ BLANK();
+ /* Offset for fields in tdx_module_output */
+ OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+ OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+ OFFSET(TDX_MODULE_r8, tdx_module_output, r8);
+ OFFSET(TDX_MODULE_r9, tdx_module_output, r9);
+ OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+ OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+ /* Offset for fields in tdx_hypercall_output */
+ OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_output, r10);
+ OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+ OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+ OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+ OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+ OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..ee52dde01b24
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,270 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+#include <linux/errno.h>
+
+/*
+ * Bitmasks of exposed registers (with VMM).
+ */
+#define TDX_R10 BIT(10)
+#define TDX_R11 BIT(11)
+#define TDX_R12 BIT(12)
+#define TDX_R13 BIT(13)
+#define TDX_R14 BIT(14)
+#define TDX_R15 BIT(15)
+
+/* Frame offset + 8 (for arg1) */
+#define ARG7_SP_OFFSET (FRAME_OFFSET + 0x08)
+
+/*
+ * These registers are clobbered to hold arguments for each
+ * TDVMCALL. They are safe to expose to the VMM.
+ * Each bit in this mask represents a register ID. Bit field
+ * details can be found in TDX GHCI specification, section
+ * titled "TDCALL [TDG.VP.VMCALL] leaf".
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \
+ TDX_R12 | TDX_R13 | \
+ TDX_R14 | TDX_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call() - Used by TDX guests to request services from
+ * the TDX module (does not include VMM services).
+ *
+ * Transforms function call register arguments into the TDCALL
+ * register ABI. After TDCALL operation, TDX Module output is saved
+ * in @out (if it is provided by the user)
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX - TDCALL Leaf number.
+ * RCX,RDX,R8-R9 - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction error code.
+ * RCX,RDX,R8-R11 - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn (RDI) - TDCALL Leaf ID, moved to RAX
+ * @rcx (RSI) - Input parameter 1, moved to RCX
+ * @rdx (RDX) - Input parameter 2, moved to RDX
+ * @r8 (RCX) - Input parameter 3, moved to R8
+ * @r9 (R8) - Input parameter 4, moved to R9
+ *
+ * @out (R9) - struct tdx_module_output pointer
+ * stored temporarily in R12 (not
+ * shared with the TDX module). It
+ * can be NULL.
+ *
+ * Return status of TDCALL via RAX.
+ */
+SYM_FUNC_START(__tdx_module_call)
+ FRAME_BEGIN
+
+ /*
+ * R12 will be used as temporary storage for
+ * struct tdx_module_output pointer. Since R12-R15
+ * registers are not used by TDCALL services supported
+ * by this function, it can be reused.
+ */
+
+ /* Callee saved, so preserve it */
+ push %r12
+
+ /*
+ * Push output pointer to stack.
+ * After the TDCALL operation, it will be fetched
+ * into R12 register.
+ */
+ push %r9
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ /* Move TDCALL Leaf ID to RAX */
+ mov %rdi, %rax
+ /* Move input 4 to R9 */
+ mov %r8, %r9
+ /* Move input 3 to R8 */
+ mov %rcx, %r8
+ /* Move input 1 to RCX */
+ mov %rsi, %rcx
+ /* Leave input param 2 in RDX */
+
+ tdcall
+
+ /*
+ * Fetch output pointer from stack to R12 (It is used
+ * as temporary storage)
+ */
+ pop %r12
+
+ /* Check for TDCALL success: 0 - Successful, otherwise failed */
+ test %rax, %rax
+ jnz mcall_done
+
+ /*
+ * Since this function can be initiated without an output pointer,
+ * check if caller provided an output struct before storing
+ * output registers.
+ */
+ test %r12, %r12
+ jz mcall_done
+
+ /* Copy TDCALL result registers to output struct: */
+ movq %rcx, TDX_MODULE_rcx(%r12)
+ movq %rdx, TDX_MODULE_rdx(%r12)
+ movq %r8, TDX_MODULE_r8(%r12)
+ movq %r9, TDX_MODULE_r9(%r12)
+ movq %r10, TDX_MODULE_r10(%r12)
+ movq %r11, TDX_MODULE_r11(%r12)
+
+mcall_done:
+ /* Restore the state of R12 register */
+ pop %r12
+
+ FRAME_END
+ ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * __tdx_hypercall() - Make hypercalls to a TDX VMM.
+ *
+ * Transforms function call register arguments into the TDCALL
+ * register ABI. After TDCALL operation, VMM output is saved in @out.
+ *
+ *-------------------------------------------------------------------------
+ * TD VMCALL ABI:
+ *-------------------------------------------------------------------------
+ *
+ * Input Registers:
+ *
+ * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
+ * RCX - BITMAP which controls which part of TD Guest GPR
+ * is passed as-is to the VMM and back.
+ * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
+ * specification. Non zero value indicates vendor
+ * specific ABI.
+ * R11 - VMCALL sub function number
+ * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
+ * R8-R9, R12–R15 - Same as above.
+ *
+ * Output Registers:
+ *
+ * RAX - TDCALL instruction status (Not related to hypercall
+ * output).
+ * R10 - Hypercall output error code.
+ * R11-R15 - Hypercall sub function specific output values.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_hypercall() function ABI:
+ *
+ * @type (RDI) - TD VMCALL type, moved to R10
+ * @fn (RSI) - TD VMCALL sub function, moved to R11
+ * @r12 (RDX) - Input parameter 1, moved to R12
+ * @r13 (RCX) - Input parameter 2, moved to R13
+ * @r14 (R8) - Input parameter 3, moved to R14
+ * @r15 (R9) - Input parameter 4, moved to R15
+ *
+ * @out (stack) - struct tdx_hypercall_output pointer (cannot be NULL)
+ *
+ * On successful completion, return TDCALL status or -EINVAL for invalid
+ * inputs.
+ */
+SYM_FUNC_START(__tdx_hypercall)
+ FRAME_BEGIN
+
+ /* Move argument 7 from caller stack to RAX */
+ movq ARG7_SP_OFFSET(%rsp), %rax
+
+ /* Check if caller provided an output struct */
+ test %rax, %rax
+ /* If out pointer is NULL, return -EINVAL */
+ jz ret_err
+
+ /* Save callee-saved GPRs as mandated by the x86_64 ABI */
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+
+ /*
+ * Save output pointer (rax) in stack, it will be used
+ * again when storing the output registers after the
+ * TDCALL operation.
+ */
+ push %rax
+
+ /* Mangle function call ABI into TDCALL ABI: */
+ /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
+ xor %eax, %eax
+ /* Move TDVMCALL type (standard vs vendor) in R10 */
+ mov %rdi, %r10
+ /* Move TDVMCALL sub function id to R11 */
+ mov %rsi, %r11
+ /* Move input 1 to R12 */
+ mov %rdx, %r12
+ /* Move input 2 to R13 */
+ mov %rcx, %r13
+ /* Move input 3 to R14 */
+ mov %r8, %r14
+ /* Move input 4 to R15 */
+ mov %r9, %r15
+
+ movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+ tdcall
+
+ /* Restore output pointer to R9 */
+ pop %r9
+
+ /* Copy hypercall result registers to output struct: */
+ movq %r10, TDX_HYPERCALL_r10(%r9)
+ movq %r11, TDX_HYPERCALL_r11(%r9)
+ movq %r12, TDX_HYPERCALL_r12(%r9)
+ movq %r13, TDX_HYPERCALL_r13(%r9)
+ movq %r14, TDX_HYPERCALL_r14(%r9)
+ movq %r15, TDX_HYPERCALL_r15(%r9)
+
+ /*
+ * Zero out registers exposed to the VMM to avoid
+ * speculative execution with VMM-controlled values.
+ * This needs to include all registers present in
+ * TDVMCALL_EXPOSE_REGS_MASK (except R12-R15).
+ * R12-R15 context will be restored.
+ */
+ xor %r10d, %r10d
+ xor %r11d, %r11d
+
+ /* Restore callee-saved GPRs as mandated by the x86_64 ABI */
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ jmp hcall_done
+ret_err:
+ movq $(-EINVAL), %rax
+hcall_done:
+ FRAME_END
+
+ retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index d32d9d9946d8..1cc850fd03ff 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -9,6 +9,30 @@
static bool tdx_guest_detected __ro_after_init;
+/*
+ * Wrapper for standard use of __tdx_hypercall with panic report
+ * for TDCALL error.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
+ u64 r15, struct tdx_hypercall_output *out)
+{
+ struct tdx_hypercall_output dummy_out;
+ u64 err;
+
+ /* __tdx_hypercall() does not accept NULL output pointer */
+ if (!out)
+ out = &dummy_out;
+
+ err = __tdx_hypercall(TDX_HYPERCALL_STANDARD, fn, r12, r13, r14,
+ r15, out);
+
+ /* Non zero return value indicates buggy TDX module, so panic */
+ if (err)
+ panic("Hypercall fn %llu failed (Buggy TDX module!)\n", fn);
+
+ return out->r10;
+}
+
bool is_tdx_guest(void)
{
return tdx_guest_detected;
--
2.32.0
ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states. It is
required to prevent data loss.
While running inside TDX guest, the kernel can bypass cache flushing.
Changing sleep state in a virtual machine doesn't affect the host system
sleep state and cannot lead to data loss.
The approach can be generalized to all guest kernels, but, to be
cautious, let's limit it to TDX for now.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/acenv.h | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..d19deca6dd27 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -13,7 +13,21 @@
/* Asm macros */
-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+/*
+ * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
+ * It is required to prevent data loss.
+ *
+ * While running inside TDX guest, the kernel can bypass cache flushing.
+ * Changing sleep state in a virtual machine doesn't affect the host system
+ * sleep state and cannot lead to data loss.
+ *
+ * TODO: Is it safe to generalize this from TDX guests to all guest kernels?
+ */
+#define ACPI_FLUSH_CPU_CACHE() \
+do { \
+ if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) \
+ wbinvd(); \
+} while (0)
int __acpi_acquire_global_lock(unsigned int *lock);
int __acpi_release_global_lock(unsigned int *lock);
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
Unlike regular VMs, TDX guests use the firmware hand-off wakeup method
to wake up the APs during the boot process. This wakeup model uses a
mailbox to communicate with firmware to bring up the APs. As per the
design, this mailbox can only be used once for the given AP, which means
after the APs are booted, the same mailbox cannot be used to
offline/online the given AP. More details about this requirement can be
found in Intel TDX Virtual Firmware Design Guide, sec titled "AP
initialization in OS" and in sec titled "Hotplug Device".
Since the architecture does not support any method of offlining the
CPUs, disable CPU hotplug support in the kernel.
Since this hotplug disable feature can be re-used by other VM guests,
add a new CC attribute CC_ATTR_HOTPLUG_DISABLED and use it to disable
the hotplug support.
With hotplug disabled, /sys/devices/system/cpu/cpuX/online sysfs option
will not exist for TDX guests.
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cc_platform.c | 7 ++++++-
include/linux/cc_platform.h | 10 ++++++++++
kernel/cpu.c | 3 +++
3 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 890452a85dae..2ed8652ab042 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -16,8 +16,13 @@
static bool intel_cc_platform_has(enum cc_attr attr)
{
- if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
+ switch (attr) {
+ case CC_ATTR_GUEST_UNROLL_STRING_IO:
+ case CC_ATTR_HOTPLUG_DISABLED:
return true;
+ default:
+ return false;
+ }
return false;
}
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index f47f0c9edb3b..63b15108bc85 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -72,6 +72,16 @@ enum cc_attr {
* Examples include TDX Guest & SEV.
*/
CC_ATTR_GUEST_UNROLL_STRING_IO,
+
+ /**
+ * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
+ *
+ * The platform/OS is running as a guest/virtual machine does not
+ * support CPU hotplug feature.
+ *
+ * Examples include TDX Guest.
+ */
+ CC_ATTR_HOTPLUG_DISABLED,
};
#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 192e43a87407..3c323aebd5b1 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -33,6 +33,7 @@
#include <linux/slab.h>
#include <linux/percpu-rwsem.h>
#include <linux/cpuset.h>
+#include <linux/cc_platform.h>
#include <trace/events/power.h>
#define CREATE_TRACE_POINTS
@@ -1178,6 +1179,8 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
{
+ if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+ return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
return _cpu_down(cpu, 0, target);
--
2.32.0
Use hypercall to emulate MSR read/write for the TDX platform.
There are two viable approaches for doing MSRs in a TD guest:
1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal
do. Some will succeed, others will cause a #VE. All of those that
cause a #VE will be handled with a TDCALL.
2. Use paravirt infrastructure. The paravirt hook has to keep a list
of which MSRs would cause a #VE and use a TDCALL. All other MSRs
execute RDMSR/WRMSR instructions directly.
The second option can be ruled out because the list of MSRs was
challenging to maintain. That leaves option #1 as the only viable
solution for the minimal TDX support.
For performance-critical MSR writes (like TSC_DEADLINE), future patches
will replace the WRMSR/#VE sequence with the direct TDCALL.
RDMSR and WRMSR specification details can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDG.VP.
VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tdx.c | 44 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6749ca3b2e3d..8be8090ca19f 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -97,6 +97,39 @@ void __cpuidle tdx_guest_idle(void)
tdx_safe_halt();
}
+static bool tdx_read_msr_safe(unsigned int msr, u64 *val)
+{
+ struct tdx_hypercall_output out;
+
+ /*
+ * Emulate the MSR read via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
+ */
+ if (_tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out))
+ return false;
+
+ *val = out.r11;
+
+ return true;
+}
+
+static bool tdx_write_msr_safe(unsigned int msr, unsigned int low,
+ unsigned int high)
+{
+ u64 ret;
+
+ /*
+ * Emulate the MSR write via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
+ */
+ ret = _tdx_hypercall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
+ 0, 0, NULL);
+
+ return ret ? false : true;
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -141,11 +174,22 @@ static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
{
bool ret = false;
+ u64 val;
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
ret = tdx_halt();
break;
+ case EXIT_REASON_MSR_READ:
+ ret = tdx_read_msr_safe(regs->cx, &val);
+ if (ret) {
+ regs->ax = lower_32_bits(val);
+ regs->dx = upper_32_bits(val);
+ }
+ break;
+ case EXIT_REASON_MSR_WRITE:
+ ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
break;
--
2.32.0
In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
by the TDX module while some trigger #VE.
Implement the #VE handling for EXIT_REASON_CPUID by handing it through
the hypercall, which in turn lets the TDX module handle it by invoking
the host VMM.
More details on CPUID Virtualization can be found in the TDX module
specification [1], the section titled "CPUID Virtualization".
[1] - https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++++--
1 file changed, 40 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 8be8090ca19f..e1c757d1720c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -130,6 +130,31 @@ static bool tdx_write_msr_safe(unsigned int msr, unsigned int low,
return ret ? false : true;
}
+static bool tdx_handle_cpuid(struct pt_regs *regs)
+{
+ struct tdx_hypercall_output out;
+
+ /*
+ * Emulate the CPUID instruction via a hypercall. More info about
+ * ABI can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
+ */
+ if (_tdx_hypercall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out))
+ return false;
+
+ /*
+ * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
+ * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
+ * So copy the register contents back to pt_regs.
+ */
+ regs->ax = out.r12;
+ regs->bx = out.r13;
+ regs->cx = out.r14;
+ regs->dx = out.r15;
+
+ return true;
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -166,8 +191,18 @@ bool tdx_get_ve_info(struct ve_info *ve)
*/
static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
{
- pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return false;
+ bool ret = false;
+
+ switch (ve->exit_reason) {
+ case EXIT_REASON_CPUID:
+ ret = tdx_handle_cpuid(regs);
+ break;
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ break;
+ }
+
+ return ret;
}
/* Handle the kernel #VE */
@@ -190,6 +225,9 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
case EXIT_REASON_MSR_WRITE:
ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
break;
+ case EXIT_REASON_CPUID:
+ ret = tdx_handle_cpuid(regs);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
break;
--
2.32.0
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/tdx.h | 3 ++
arch/x86/kernel/process.c | 7 ++++
arch/x86/kernel/tdcall.S | 32 ++++++++++++++++
arch/x86/kernel/tdx.c | 75 +++++++++++++++++++++++++++++++++++++-
4 files changed, 115 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8c33d7439c08..3be9d0e9f7a0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -72,10 +72,13 @@ bool tdx_get_ve_info(struct ve_info *ve);
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
+void tdx_guest_idle(void);
+
#else
static inline void tdx_early_init(void) { };
static inline bool is_tdx_guest(void) { return false; }
+static inline void tdx_guest_idle(void) { };
#endif /* CONFIG_INTEL_TDX_GUEST */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e9ee8b526319..273e4266b2c1 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,7 @@
#include <asm/proto.h>
#include <asm/frame.h>
#include <asm/unwind.h>
+#include <asm/tdx.h>
#include "process.h"
@@ -864,6 +865,12 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
if (x86_idle || boot_option_idle_override == IDLE_POLL)
return;
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+ x86_idle = tdx_guest_idle;
+ pr_info("using TDX aware idle routine\n");
+ return;
+ }
+
if (boot_cpu_has_bug(X86_BUG_AMD_E400)) {
pr_info("using AMD E400 aware idle routine\n");
x86_idle = amd_e400_idle;
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index ee52dde01b24..e19187048be8 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
#include <asm/asm.h>
#include <asm/frame.h>
#include <asm/unwind_hints.h>
+#include <uapi/asm/vmx.h>
#include <linux/linkage.h>
#include <linux/bits.h>
@@ -39,6 +40,13 @@
*/
#define tdcall .byte 0x66,0x0f,0x01,0xcc
+/*
+ * Used in the __tdx_hypercall() function to test R15 register content
+ * and optionally include the STI instruction before the TDCALL
+ * instruction (for EXIT_REASON_HLT case).
+ */
+#define do_sti 0x01
+
/*
* __tdx_module_call() - Used by TDX guests to request services from
* the TDX module (does not include VMM services).
@@ -231,6 +239,30 @@ SYM_FUNC_START(__tdx_hypercall)
movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+ /*
+ * For the idle loop STI needs to be called directly before
+ * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
+ * instruction enables interrupts only one instruction later.
+ * If there is a window between STI and the instruction that
+ * emulates the HALT state, there is a chance for interrupts to
+ * happen in this window, which can delay the HLT operation
+ * indefinitely. Since this is the not the desired result, add
+ * support to conditionally call STI before TDCALL.
+ *
+ * Since STI instruction is only required for the idle case
+ * (a special case of EXIT_REASON_HLT), use the r15 register
+ * value to identify it. Since the R15 register is not used
+ * by the VMM as per EXIT_REASON_HLT ABI, re-use it in
+ * software to identify the STI case.
+ */
+ cmpl $EXIT_REASON_HLT, %r11d
+ jne skip_sti
+ cmpl $do_sti, %r15d
+ jne skip_sti
+ /* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
+ xor %r15, %r15
+ sti
+skip_sti:
tdcall
/* Restore output pointer to R9 */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b6d0e45e6589..6749ca3b2e3d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -6,6 +6,7 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>
+#include <asm/vmx.h>
/* TDX Module Call Leaf IDs */
#define TDX_GET_VEINFO 3
@@ -36,6 +37,66 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}
+static __cpuidle u64 _tdx_halt(const bool irq_disabled, const bool do_sti)
+{
+ /*
+ * Emulate HLT operation via hypercall. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), sec 3.8 TDG.VP.VMCALL<Instruction.HLT>.
+ *
+ * The VMM uses the "IRQ disabled" param to understand IRQ
+ * enabled status (RFLAGS.IF) of the TD guest and to determine
+ * whether or not it should schedule the halted vCPU if an
+ * IRQ becomes pending. E.g. if IRQs are disabled, the VMM
+ * can keep the vCPU in virtual HLT, even if an IRQ is
+ * pending, without hanging/breaking the guest.
+ *
+ * do_sti parameter is used by the __tdx_hypercall() to decide
+ * whether to call the STI instruction before executing the
+ * TDCALL instruction.
+ */
+ return _tdx_hypercall(EXIT_REASON_HLT, irq_disabled, 0, 0,
+ do_sti, NULL);
+}
+
+static bool tdx_halt(void)
+{
+ /*
+ * Since non safe halt is mainly used in CPU offlining
+ * and the guest will always stay in the halt state, don't
+ * call the STI instruction (set do_sti as false).
+ */
+ const bool irq_disabled = irqs_disabled();
+ const bool do_sti = false;
+
+ if (_tdx_halt(irq_disabled, do_sti))
+ return false;
+
+ return true;
+}
+
+static __cpuidle void tdx_safe_halt(void)
+{
+ /*
+ * For do_sti=true case, __tdx_hypercall() function enables
+ * interrupts using the STI instruction before the TDCALL. So
+ * set irq_disabled as false.
+ */
+ const bool irq_disabled = false;
+ const bool do_sti = true;
+
+ /*
+ * Use WARN_ONCE() to report the failure.
+ */
+ if (_tdx_halt(irq_disabled, do_sti))
+ WARN_ONCE(1, "HLT instruction emulation failed\n");
+}
+
+void __cpuidle tdx_guest_idle(void)
+{
+ tdx_safe_halt();
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -79,8 +140,18 @@ static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
/* Handle the kernel #VE */
static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
{
- pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return false;
+ bool ret = false;
+
+ switch (ve->exit_reason) {
+ case EXIT_REASON_HLT:
+ ret = tdx_halt();
+ break;
+ default:
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ break;
+ }
+
+ return ret;
}
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
--
2.32.0
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible in TDX guests
because it requires exposing guest register and memory state to
potentially malicious VMM.
In TDX the MMIO regions are instead configured to trigger a #VE
exception in the guest. The guest #VE handler then emulates the MMIO
instruction inside the guest and converts them into a controlled
hypercall to the host.
MMIO addresses can be used with any CPU instruction that accesses the
memory. This patch, however, covers only MMIO accesses done via io.h
helpers, such as 'readl()' or 'writeq()'.
MMIO access via other means (like structure overlays) may result in
MMIO_DECODE_FAILED and an oops.
AMD SEV has the same limitations to MMIO handling.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered by this patch, a fully
paravirtualized approach would be limited to MMIO users that leverage
common infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately
120k call sites. With a conservative overhead estimation of 5 bytes per
call site (CALL instruction), it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests. Right now, that's
limited only to virtio and some x86-specific drivers.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch. Future patches will implement this
idea, removing the bulk of MMIO #VEs.
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tdx.c | 110 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 110 insertions(+)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e1c757d1720c..b04802b4b69e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,8 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
/* TDX Module Call Leaf IDs */
#define TDX_GET_VEINFO 3
@@ -155,6 +157,108 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
return true;
}
+static bool tdx_mmio(int size, bool write, unsigned long addr,
+ unsigned long *val)
+{
+ struct tdx_hypercall_output out;
+ u64 err;
+
+ err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
+ addr, *val, &out);
+ if (err)
+ return true;
+
+ *val = out.r11;
+ return false;
+}
+
+static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+ return tdx_mmio(size, false, addr, val);
+}
+
+static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
+{
+ return tdx_mmio(size, true, addr, val);
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+ char buffer[MAX_INSN_SIZE];
+ unsigned long *reg, val = 0;
+ struct insn insn = {};
+ enum mmio_type mmio;
+ int size;
+ u8 sign_byte;
+ bool err;
+
+ if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+ return -EFAULT;
+
+ insn_init(&insn, buffer, MAX_INSN_SIZE, 1);
+ insn_get_length(&insn);
+
+ mmio = insn_decode_mmio(&insn, &size);
+ if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+ return -EFAULT;
+
+ if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+ reg = insn_get_modrm_reg_ptr(&insn, regs);
+ if (!reg)
+ return -EFAULT;
+ }
+
+ switch (mmio) {
+ case MMIO_WRITE:
+ memcpy(&val, reg, size);
+ err = tdx_mmio_write(size, ve->gpa, &val);
+ break;
+ case MMIO_WRITE_IMM:
+ val = insn.immediate.value;
+ err = tdx_mmio_write(size, ve->gpa, &val);
+ break;
+ case MMIO_READ:
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+ /* Zero-extend for 32-bit operation */
+ if (size == 4)
+ *reg = 0;
+ memcpy(reg, &val, size);
+ break;
+ case MMIO_READ_ZERO_EXTEND:
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+
+ /* Zero extend based on operand size */
+ memset(reg, 0, insn.opnd_bytes);
+ memcpy(reg, &val, size);
+ break;
+ case MMIO_READ_SIGN_EXTEND:
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+
+ if (size == 1)
+ sign_byte = (val & 0x80) ? 0xff : 0x00;
+ else
+ sign_byte = (val & 0x8000) ? 0xff : 0x00;
+
+ /* Sign extend based on operand size */
+ memset(reg, sign_byte, insn.opnd_bytes);
+ memcpy(reg, &val, size);
+ break;
+ case MMIO_MOVS:
+ case MMIO_DECODE_FAILED:
+ return -EFAULT;
+ }
+
+ if (err)
+ return -EFAULT;
+ return insn.length;
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -228,6 +332,12 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
case EXIT_REASON_CPUID:
ret = tdx_handle_cpuid(regs);
break;
+ case EXIT_REASON_EPT_VIOLATION:
+ ve->instr_len = tdx_handle_mmio(regs, ve);
+ ret = ve->instr_len > 0;
+ if (!ret)
+ pr_warn_once("MMIO failed\n");
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
break;
--
2.32.0
From: Sean Christopherson <[email protected]>
Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".
Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.
There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.
The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.
Reported-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/apic.h | 2 ++
arch/x86/include/asm/realmode.h | 1 +
arch/x86/kernel/smpboot.c | 12 ++++++--
arch/x86/realmode/rm/header.S | 1 +
arch/x86/realmode/rm/trampoline_64.S | 38 ++++++++++++++++++++++++
arch/x86/realmode/rm/trampoline_common.S | 12 +++++++-
6 files changed, 63 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 48067af94678..35006e151774 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -328,6 +328,8 @@ struct apic {
/* wakeup_secondary_cpu */
int (*wakeup_secondary_cpu)(int apicid, unsigned long start_eip);
+ /* wakeup secondary CPU using 64-bit wakeup point */
+ int (*wakeup_secondary_cpu_64)(int apicid, unsigned long start_eip);
void (*inquire_remote_apic)(int apicid);
diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..5066c8b35e7c 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
u32 sev_es_trampoline_start;
#endif
#ifdef CONFIG_X86_64
+ u32 trampoline_start64;
u32 trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ac2909f0cab3..180615a99cb5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1077,6 +1077,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
unsigned long boot_error = 0;
unsigned long timeout;
+#ifdef CONFIG_X86_64
+ /* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
+ if (apic->wakeup_secondary_cpu_64)
+ start_ip = real_mode_header->trampoline_start64;
+#endif
idle->thread.sp = (unsigned long)task_pt_regs(idle);
early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
initial_code = (unsigned long)start_secondary;
@@ -1118,11 +1123,14 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
/*
* Wake up a CPU in difference cases:
- * - Use the method in the APIC driver if it's defined
+ * - Use a method from the APIC driver if one defined, with wakeup
+ * straight to 64-bit mode preferred over wakeup to RM.
* Otherwise,
* - Use an INIT boot APIC message for APs or NMI for BSP.
*/
- if (apic->wakeup_secondary_cpu)
+ if (apic->wakeup_secondary_cpu_64)
+ boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip);
+ else if (apic->wakeup_secondary_cpu)
boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
else
boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
.long pa_sev_es_trampoline_start
#endif
#ifdef CONFIG_X86_64
+ .long pa_trampoline_start64
.long pa_trampoline_pgd;
#endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index cc8391f86cdb..ae112a91592f 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
ljmpl $__KERNEL_CS, $pa_startup_64
SYM_CODE_END(startup_32)
+SYM_CODE_START(pa_trampoline_compat)
+ /*
+ * In compatibility mode. Prep ESP and DX for startup_32, then disable
+ * paging and complete the switch to legacy 32-bit mode.
+ */
+ movl $rm_stack_end, %esp
+ movw $__KERNEL_DS, %dx
+
+ movl $X86_CR0_PE, %eax
+ movl %eax, %cr0
+ ljmpl $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
.section ".text64","ax"
.code64
.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
jmpq *tr_start(%rip)
SYM_CODE_END(startup_64)
+SYM_CODE_START(trampoline_start64)
+ /*
+ * APs start here on a direct transfer from 64-bit BIOS with identity
+ * mapped page tables. Load the kernel's GDT in order to gear down to
+ * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+ * segment registers. Load the zero IDT so any fault triggers a
+ * shutdown instead of jumping back into BIOS.
+ */
+ lidt tr_idt(%rip)
+ lgdt tr_gdt64(%rip)
+
+ ljmpl *tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
.section ".rodata","a"
# Duplicate the global descriptor table
# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
.quad 0x00cf93000000ffff # __KERNEL_DS
SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
+SYM_DATA_START(tr_gdt64)
+ .short tr_gdt_end - tr_gdt - 1 # gdt limit
+ .long pa_tr_gdt
+ .long 0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+ .long pa_trampoline_compat
+ .short __KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
.bss
.balign PAGE_SIZE
SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..4331c32c47f8 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,14 @@
/* SPDX-License-Identifier: GPL-2.0 */
.section ".rodata","a"
.balign 16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/*
+ * When a bootloader hands off to the kernel in 32-bit mode an
+ * IDT with a 2-byte limit and 4-byte base is needed. When a boot
+ * loader hands off to a kernel 64-bit mode the base address
+ * extends to 8-bytes. Reserve enough space for either scenario.
+ */
+SYM_DATA_START_LOCAL(tr_idt)
+ .short 0
+ .quad 0
+SYM_DATA_END(tr_idt)
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
TDX hypervisors cannot emulate instructions directly. This includes
port I/O which is normally emulated in the hypervisor. All port I/O
instructions inside TDX trigger the #VE exception in the guest and
would be normally emulated there.
Use a hypercall to emulate port I/O. Extend the
tdx_handle_virt_exception() and add support to handle the #VE due to
port I/O instructions.
String I/O operations are not supported in TDX. Unroll them by declaring
CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute.
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cc_platform.c | 3 +++
arch/x86/kernel/tdx.c | 48 +++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index e291e071aa63..890452a85dae 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -16,6 +16,9 @@
static bool intel_cc_platform_has(enum cc_attr attr)
{
+ if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
+ return true;
+
return false;
}
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b04802b4b69e..00bf02bc9838 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -13,6 +13,12 @@
/* TDX Module Call Leaf IDs */
#define TDX_GET_VEINFO 3
+/* See Exit Qualification for I/O Instructions in VMX documentation */
+#define VE_IS_IO_IN(exit_qual) (((exit_qual) & 8) ? 1 : 0)
+#define VE_GET_IO_SIZE(exit_qual) (((exit_qual) & 7) + 1)
+#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
+#define VE_IS_IO_STRING(exit_qual) ((exit_qual) & 16 ? 1 : 0)
+
static bool tdx_guest_detected __ro_after_init;
/*
@@ -259,6 +265,45 @@ static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return insn.length;
}
+/*
+ * Emulate I/O using hypercall.
+ *
+ * Assumes the IO instruction was using ax, which is enforced
+ * by the standard io.h macros.
+ *
+ * Return True on success or False on failure.
+ */
+static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+ struct tdx_hypercall_output out;
+ int size, port, ret;
+ u64 mask;
+ bool in;
+
+ if (VE_IS_IO_STRING(exit_qual))
+ return false;
+
+ in = VE_IS_IO_IN(exit_qual);
+ size = VE_GET_IO_SIZE(exit_qual);
+ port = VE_GET_PORT_NUM(exit_qual);
+ mask = GENMASK(BITS_PER_BYTE * size, 0);
+
+ /*
+ * Emulate the I/O read/write via hypercall. More info about
+ * ABI can be found in TDX Guest-Host-Communication Interface
+ * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.IO>".
+ */
+ ret = _tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, !in, port,
+ in ? 0 : regs->ax, &out);
+ if (!in)
+ return !ret;
+
+ regs->ax &= ~mask;
+ regs->ax |= ret ? UINT_MAX : out.r11 & mask;
+
+ return !ret;
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -338,6 +383,9 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
if (!ret)
pr_warn_once("MMIO failed\n");
break;
+ case EXIT_REASON_IO_INSTRUCTION:
+ ret = tdx_handle_io(regs, ve->exit_qual);
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
break;
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
Port IO triggers a #VE exception in TDX guests. During normal runtime,
the kernel will handle those exceptions for any port IO.
But for the early code in the decompressor, #VE cannot be used because
the IDT needed for handling the exception is not set up.
Replace IN/OUT instructions with TDX IO hypercalls by defining helper
macros __in/__out and by re-defining them in the decompressor code.
Also, since TDX IO hypercall requires an IO size parameter, allow
__in/__out macros to accept size as an input parameter.
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/misc.h | 8 ++++--
arch/x86/boot/compressed/tdcall.S | 3 +++
arch/x86/boot/compressed/tdx.h | 44 +++++++++++++++++++++++++++++++
arch/x86/include/asm/io.h | 16 ++++++++---
5 files changed, 66 insertions(+), 6 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdcall.S
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 22a2a6cc2ab4..1bfe30ebadbe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -99,6 +99,7 @@ endif
vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0d8e275a9d96..6502adf71a2f 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -19,6 +19,12 @@
/* cpu_feature_enabled() cannot be used this early */
#define USE_EARLY_PGTABLE_L5
+/*
+ * Redefine __in/__out macros via tdx.h before including
+ * linux/io.h.
+ */
+#include "tdx.h"
+
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
@@ -28,8 +34,6 @@
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
-#include "tdx.h"
-
#define BOOT_CTYPE_H
#include <linux/acpi.h>
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..aafadc136c88
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
index 18970c09512e..5d39608a2af4 100644
--- a/arch/x86/boot/compressed/tdx.h
+++ b/arch/x86/boot/compressed/tdx.h
@@ -6,8 +6,52 @@
#include <linux/types.h>
#ifdef CONFIG_INTEL_TDX_GUEST
+
+#include <vdso/limits.h>
+#include <uapi/asm/vmx.h>
+#include <asm/tdx.h>
+
void early_tdx_detect(void);
bool early_is_tdx_guest(void);
+
+static inline unsigned int tdx_io_in(int size, int port)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 0, port, 0, &out);
+
+ return out.r10 ? UINT_MAX : out.r11;
+}
+
+static inline void tdx_io_out(int size, int port, u64 value)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 1, port, value, &out);
+}
+
+#define __out(bwl, bw, sz) \
+do { \
+ if (early_is_tdx_guest()) { \
+ tdx_io_out(sz, port, value); \
+ } else { \
+ asm volatile("out" #bwl " %" #bw "0, %w1" : : \
+ "a"(value), "Nd"(port)); \
+ } \
+} while (0)
+
+#define __in(bwl, bw, sz) \
+do { \
+ if (early_is_tdx_guest()) { \
+ value = tdx_io_in(sz, port); \
+ } else { \
+ asm volatile("in" #bwl " %w1, %" #bw "0" : \
+ "=a"(value) : "Nd"(port)); \
+ } \
+} while (0)
+
#else
static inline void early_tdx_detect(void) { };
static inline bool early_is_tdx_guest(void) { return false; }
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..f5bb8972b4b2 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -257,18 +257,26 @@ static inline void slow_down_io(void)
#endif
+#ifndef __out
+#define __out(bwl, bw, sz) \
+ asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
+#endif
+
+#ifndef __in
+#define __in(bwl, bw, sz) \
+ asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
+#endif
+
#define BUILDIO(bwl, bw, type) \
static inline void out##bwl(unsigned type value, int port) \
{ \
- asm volatile("out" #bwl " %" #bw "0, %w1" \
- : : "a"(value), "Nd"(port)); \
+ __out(bwl, bw, sizeof(type)); \
} \
\
static inline unsigned type in##bwl(int port) \
{ \
unsigned type value; \
- asm volatile("in" #bwl " %w1, %" #bw "0" \
- : "=a"(value) : "Nd"(port)); \
+ __in(bwl, bw, sizeof(type)); \
return value; \
} \
\
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
Document the TDX guest architecture details like #VE support,
shared memory, etc.
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/x86/index.rst | 1 +
Documentation/x86/tdx.rst | 194 ++++++++++++++++++++++++++++++++++++
2 files changed, 195 insertions(+)
create mode 100644 Documentation/x86/tdx.rst
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index f498f1d36cd3..382e53ca850a 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -24,6 +24,7 @@ x86-specific Documentation
intel-iommu
intel_txt
amd-memory-encryption
+ tdx
pti
mds
microcode
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
new file mode 100644
index 000000000000..8c9cf1a5bfb8
--- /dev/null
+++ b/Documentation/x86/tdx.rst
@@ -0,0 +1,194 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Intel Trust Domain Extensions (TDX)
+=====================================
+
+Intel's Trust Domain Extensions (TDX) protect confidential guest VMs
+from the host and physical attacks by isolating the guest register
+state and by encrypting the guest memory. In TDX, a special TDX module
+sits between the host and the guest, and runs in a special mode and
+manages the guest/host separation.
+
+Since the host cannot directly access guest registers or memory, much
+normal functionality of a hypervisor (such as trapping MMIO, some MSRs,
+some CPUIDs, and some other instructions) has to be moved into the
+guest. This is implemented using a Virtualization Exception (#VE) that
+is handled by the guest kernel. Some #VEs are handled inside the guest
+kernel, but some require the hypervisor (VMM) to be involved. The TD
+hypercall mechanism allows TD guests to call TDX module or hypervisor
+function.
+
+#VE Exceptions:
+===============
+
+In TDX guests, #VE Exceptions are delivered to TDX guests in following
+scenarios:
+
+* Execution of certain instructions (see list below)
+* Certain MSR accesses.
+* CPUID usage (only for certain leaves)
+* Shared memory access (including MMIO)
+
+#VE due to instruction execution
+---------------------------------
+
+Intel TDX dis-allows execution of certain instructions in non-root
+mode. Execution of these instructions would lead to #VE or #GP.
+
+Details are,
+
+List of instructions that can cause a #VE is,
+
+* String I/O (INS, OUTS), IN, OUT
+* HLT
+* MONITOR, MWAIT
+* WBINVD, INVD
+* VMCALL
+
+List of instructions that can cause a #GP is,
+
+* All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
+ VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
+* ENCLS, ENCLV
+* GETSEC
+* RSM
+* ENQCMD
+
+#VE due to MSR access
+----------------------
+
+In TDX guest, MSR access behavior can be categorized as,
+
+* Native supported (also called "context switched MSR")
+ No special handling is required for these MSRs in TDX guests.
+* #GP triggered
+ Dis-allowed MSR read/write would lead to #GP.
+* #VE triggered
+ All MSRs that are not natively supported or dis-allowed
+ (triggers #GP) will trigger #VE. To support access to
+ these MSRs, it needs to be emulated using TDCALL.
+
+Look Intel TDX Module Specification, sec "MSR Virtualization" for the complete
+list of MSRs that fall under the categories above.
+
+#VE due to CPUID instruction
+----------------------------
+
+In TDX guests, most of CPUID leaf/sub-leaf combinations are virtualized by
+the TDX module while some trigger #VE. Combinations of CPUID leaf/sub-leaf
+which triggers #VE are configured by the VMM during the TD initialization
+time (using TDH.MNG.INIT).
+
+#VE on Memory Accesses
+----------------------
+
+A TD guest is in control of whether its memory accesses are treated as
+private or shared. It selects the behavior with a bit in its page table
+entries.
+
+#VE on Shared Pages
+-------------------
+
+Access to shared mappings can cause a #VE. The hypervisor controls whether
+access of shared mapping causes a #VE, so the guest must be careful to only
+reference shared pages it can safely handle a #VE, avoid nested #VEs.
+
+Content of shared mapping is not trusted since shared memory is writable
+by the hypervisor. Shared mappings are never used for sensitive memory content
+like stacks or kernel text, only for I/O buffers and MMIO regions. The kernel
+will not encounter shared mappings in sensitive contexts like syscall entry
+or NMIs.
+
+#VE on Private Pages
+--------------------
+
+Some accesses to private mappings may cause #VEs. Before a mapping is
+accepted (AKA in the SEPT_PENDING state), a reference would cause a #VE.
+But, after acceptance, references typically succeed.
+
+The hypervisor can cause a private page reference to fail if it chooses
+to move an accepted page to a "blocked" state. However, if it does
+this, page access will not generate a #VE. It will, instead, cause a
+"TD Exit" where the hypervisor is required to handle the exception.
+
+Linux #VE handler
+-----------------
+
+Both user/kernel #VE exceptions are handled by the tdx_handle_virt_exception()
+handler. If successfully handled, the instruction pointer is incremented to
+complete the handling process. If failed to handle, it is treated as a regular
+exception and handled via fixup handlers.
+
+In TD guests, #VE nesting (a #VE triggered before handling the current one
+or AKA syscall gap issue) problem is handled by TDX Module ensuring that
+interrupts, including NMIs, are blocked. The hardware blocks interrupts
+starting with #VE delivery until TDGETVEINFO is called.
+
+The kernel must avoid triggering #VE in entry paths: do not touch TD-shared
+memory, including MMIO regions, and do not use #VE triggering MSRs,
+instructions, or CPUID leaves that might generate #VE.
+
+MMIO handling:
+==============
+
+In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
+mapping which will cause a VMEXIT on access, and then the VMM emulates the
+access. That's not possible in TDX guests because VMEXIT will expose the
+register state to the host. TDX guests don't trust the host and can't have
+their state exposed to the host.
+
+In TDX the MMIO regions are instead configured to trigger a #VE
+exception in the guest. The guest #VE handler then emulates the MMIO
+instructions inside the guest and converts them into a controlled TDCALL
+to the host, rather than completely exposing the state to the host.
+
+MMIO addresses on x86 are just special physical addresses. They can be
+accessed with any instruction that accesses memory. However, the
+introduced instruction decoding method is limited. It is only designed
+to decode instructions like those generated by io.h macros.
+
+MMIO access via other means (like structure overlays) may result in
+MMIO_DECODE_FAILED and an oops.
+
+Shared memory:
+==============
+
+Intel TDX doesn't allow the VMM to access guest private memory. Any
+memory that is required for communication with VMM must be shared
+explicitly by setting the bit in the page table entry. The shared bit
+can be enumerated with TDX_GET_INFO.
+
+After setting the shared bit, the conversion must be completed with
+MapGPA hypercall. The call informs the VMM about the conversion between
+private/shared mappings.
+
+set_memory_decrypted() converts a range of pages to shared.
+set_memory_encrypted() converts memory back to private.
+
+Device drivers are the primary user of shared memory, but there's no
+need in touching every driver. DMA buffers and ioremap()'ed regions are
+converted to shared automatically.
+
+TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
+converted to shared on boot.
+
+For coherent DMA allocation, the DMA buffer gets converted on the
+allocation. Check force_dma_unencrypted() for details.
+
+References
+==========
+
+More details about TDX module (and its response for MSR, memory access,
+IO, CPUID etc) can be found at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
+
+More details about TDX hypercall and TDX module call ABI can be found
+at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf
+
+More details about TDVF requirements can be found at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
Confidential Computing (CC) features (like string I/O unroll support,
memory encryption/decryption support, etc) are conditionally enabled
in the kernel using cc_platform_has() API. Since TDX guests also need
to use these CC features, extend cc_platform_has() API and add TDX
guest-specific CC attributes support.
Use is_tdx_guest() API to detect for the TDX guest status and return
TDX-specific CC attributes. To enable use of CC APIs in the TDX guest,
select ARCH_HAS_CC_PLATFORM in the CONFIG_INTEL_TDX_GUEST case.
This is a preparatory patch and just creates the framework for adding
TDX guest specific CC attributes.
Since is_tdx_guest() function (through cc_platform_has() API) is used in
the early boot code, disable the instrumentation flags and function
tracer. This is similar to AMD SEV and cc_platform.c.
Since intel_cc_platform_has() function only gets triggered when
is_tdx_guest() is true (valid CONFIG_INTEL_TDX_GUEST case), remove the
redundant #ifdef in intel_cc_platform_has().
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/Makefile | 3 +++
arch/x86/kernel/cc_platform.c | 9 ++++-----
3 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a61ac6f8821a..8e781f166030 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -877,6 +877,7 @@ config INTEL_TDX_GUEST
bool "Intel TDX (Trust Domain Extensions) - Guest Support"
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
+ select ARCH_HAS_CC_PLATFORM
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 64f9babcfd95..8c9a9214dd34 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -22,6 +22,7 @@ CFLAGS_REMOVE_early_printk.o = -pg
CFLAGS_REMOVE_head64.o = -pg
CFLAGS_REMOVE_sev.o = -pg
CFLAGS_REMOVE_cc_platform.o = -pg
+CFLAGS_REMOVE_tdx.o = -pg
endif
KASAN_SANITIZE_head$(BITS).o := n
@@ -31,6 +32,7 @@ KASAN_SANITIZE_stacktrace.o := n
KASAN_SANITIZE_paravirt.o := n
KASAN_SANITIZE_sev.o := n
KASAN_SANITIZE_cc_platform.o := n
+KASAN_SANITIZE_tdx.o := n
# With some compiler versions the generated code results in boot hangs, caused
# by several compilation units. To be safe, disable all instrumentation.
@@ -50,6 +52,7 @@ KCOV_INSTRUMENT := n
CFLAGS_head$(BITS).o += -fno-stack-protector
CFLAGS_cc_platform.o += -fno-stack-protector
+CFLAGS_tdx.o += -fno-stack-protector
CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index cc1ffe710dd2..e291e071aa63 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -12,14 +12,11 @@
#include <linux/mem_encrypt.h>
#include <asm/processor.h>
+#include <asm/tdx.h>
-static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr)
+static bool intel_cc_platform_has(enum cc_attr attr)
{
-#ifdef CONFIG_INTEL_TDX_GUEST
return false;
-#else
- return false;
-#endif
}
/*
@@ -67,6 +64,8 @@ bool cc_platform_has(enum cc_attr attr)
{
if (sme_me_mask)
return amd_cc_platform_has(attr);
+ else if (is_tdx_guest())
+ return intel_cc_platform_has(attr);
return false;
}
--
2.32.0
From: Sean Christopherson <[email protected]>
There are a few MSRs and control register bits that the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent security
guarantees. Fortunately, TDX ensures that these are all in the correct
state before the kernel loads, which means the kernel does not need to
modify them.
The conditions to avoid are:
* Any writes to the EFER MSR
* Clearing CR0.NE
* Clearing CR3.MCE
This theoretically makes the guest boot more fragile. If, for instance,
EFER was set up incorrectly and a WRMSR was performed, it will trigger
early exception panic or a triple fault, if it's before early
exceptions are set up. However, this is likely to trip up the guest
BIOS long before control reaches the kernel. In any case, these kinds
of problems are unlikely to occur in production environments, and
developers have good debug tools to fix them quickly.
Change the common boot code to work on TDX and non-TDX systems.
This should have no functional effect on non-TDX systems.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/head_64.S | 25 +++++++++++++++++++++----
arch/x86/boot/compressed/pgtable.h | 2 +-
arch/x86/kernel/head_64.S | 24 ++++++++++++++++++++++--
arch/x86/realmode/rm/trampoline_64.S | 27 +++++++++++++++++++++++----
5 files changed, 68 insertions(+), 11 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8e781f166030..fc4771a07fc0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -878,6 +878,7 @@ config INTEL_TDX_GUEST
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
+ select X86_MCE
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 572c535cf45b..dbbfafba553f 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -643,12 +643,25 @@ SYM_CODE_START(trampoline_32bit_src)
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
+ /* Avoid writing EFER if no change was made (for TDX guest) */
+ jc 1f
wrmsr
- popl %edx
+1: popl %edx
popl %ecx
/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ movl %cr4, %eax
+
+#ifdef CONFIG_X86_MCE
+ /*
+ * Preserve CR4.MCE if the kernel will enable #MC support. Clearing
+ * MCE may fault in some environments (that also force #MC support).
+ * Any machine check that occurs before #MC support is fully configured
+ * will crash the system regardless of the CR4.MCE value set here.
+ */
+ andl $X86_CR4_MCE, %eax
+#endif
+ orl $X86_CR4_PAE, %eax
testl %edx, %edx
jz 1f
orl $X86_CR4_LA57, %eax
@@ -662,8 +675,12 @@ SYM_CODE_START(trampoline_32bit_src)
pushl $__KERNEL_CS
pushl %eax
- /* Enable paging again */
- movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ /*
+ * Enable paging again. Keep CR0.NE set, FERR# is no longer used
+ * to handle x87 FPU errors and clearing NE may fault in some
+ * environments.
+ */
+ movl $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0
lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d8b3ebd2bb85..f503e97945d3 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,17 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
1:
/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ movq %cr4, %rcx
+#ifdef CONFIG_X86_MCE
+ /*
+ * Preserve CR4.MCE if the kernel will enable #MC support. Clearing
+ * MCE may fault in some environments (that also force #MC support).
+ * Any machine check that occurs before #MC support is fully configured
+ * will crash the system regardless of the CR4.MCE value set here.
+ */
+ andl $X86_CR4_MCE, %ecx
+#endif
+ orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
@@ -229,13 +239,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
+ /*
+ * Preserve current value of EFER for comparison and to skip
+ * EFER writes if no change was made (for TDX guest)
+ */
+ movl %eax, %edx
btsl $_EFER_SCE, %eax /* Enable System Call */
btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr /* Make changes effective */
+ /* Avoid writing EFER if no change was made (for TDX guest) */
+1: cmpl %edx, %eax
+ je 1f
+ xor %edx, %edx
+ wrmsr /* Make changes effective */
+1:
/* Setup cr0 */
movl $CR0_STATE, %eax
/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index ae112a91592f..170f248d5769 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,28 @@ SYM_CODE_START(startup_32)
movl %eax, %cr3
# Set up EFER
+ movl $MSR_EFER, %ecx
+ rdmsr
+ /*
+ * Skip writing to EFER if the register already has desired
+ * value (to avoid #VE for the TDX guest).
+ */
+ cmp pa_tr_efer, %eax
+ jne .Lwrite_efer
+ cmp pa_tr_efer + 4, %edx
+ je .Ldone_efer
+.Lwrite_efer:
movl pa_tr_efer, %eax
movl pa_tr_efer + 4, %edx
- movl $MSR_EFER, %ecx
wrmsr
- # Enable paging and in turn activate Long Mode
- movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+ /*
+ * Enable paging and in turn activate Long Mode. Keep CR0.NE set, FERR#
+ * is no longer used to handle x87 FPU errors and clearing NE may fault
+ * in some environments.
+ */
+ movl $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0
/*
@@ -169,7 +184,11 @@ SYM_CODE_START(pa_trampoline_compat)
movl $rm_stack_end, %esp
movw $__KERNEL_DS, %dx
- movl $X86_CR0_PE, %eax
+ /*
+ * Keep CR0.NE set, FERR# is no longer used to handle x87 FPU errors
+ * and clearing NE may fault in some environments.
+ */
+ movl $(X86_CR0_NE | X86_CR0_PE), %eax
movl %eax, %cr0
ljmpl $__KERNEL32_CS, $pa_startup_32
SYM_CODE_END(pa_trampoline_compat)
--
2.32.0
WBINVD causes #VE in TDX guests. There's no reliable way to emulate it.
The kernel can ask for VMM assistance, but VMM is untrusted and can ignore
the request.
Fortunately, there is no use case for WBINVD inside TDX guests.
Warn about any unexpected WBINVD.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tdx.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 2175336d1a2a..4da69bc760e1 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -521,6 +521,10 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
case EXIT_REASON_IO_INSTRUCTION:
ret = tdx_handle_io(regs, ve->exit_qual);
break;
+ case EXIT_REASON_WBINVD:
+ WARN_ONCE(1, "Unexpected WBINVD\n");
+ ret = true;
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
break;
--
2.32.0
From: Kuppuswamy Sathyanarayanan <[email protected]>
TDX cannot use INIT/SIPI protocol to bring up secondary CPUs because it
requires assistance from untrusted VMM.
For platforms that do not support SIPI/INIT, ACPI defines a wakeup
model (using mailbox) via MADT multiprocessor wakeup structure. More
details about it can be found in ACPI specification v6.4, the section
titled "Multiprocessor Wakeup Structure". If a platform firmware
produces the multiprocessor wakeup structure, then OS may use this
new mailbox-based mechanism to wake up the APs.
Add ACPI MADT wake structure parsing support for x86 platform and if
MADT wake table is present, update apic->wakeup_secondary_cpu_64 with
new API which uses MADT wake mailbox to wake-up CPU.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/apic.h | 5 ++
arch/x86/kernel/acpi/boot.c | 114 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/apic/apic.c | 10 ++++
3 files changed, 129 insertions(+)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 35006e151774..bd8ae0a7010a 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -490,6 +490,11 @@ static inline unsigned int read_apic_id(void)
return apic->get_apic_id(reg);
}
+#ifdef CONFIG_X86_64
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+#endif
+
extern int default_apic_id_valid(u32 apicid);
extern int default_acpi_madt_oem_check(char *, char *);
extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 5b6d1a95776f..af204a217575 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,15 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
static bool acpi_support_online_capable;
#endif
+#ifdef CONFIG_X86_64
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+/* Lock to protect mailbox (acpi_mp_wake_mailbox) from parallel access */
+static DEFINE_SPINLOCK(mailbox_lock);
+#endif
+
#ifdef CONFIG_X86_IO_APIC
/*
* Locks related to IOAPIC hotplug
@@ -336,6 +345,80 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
return 0;
}
+#ifdef CONFIG_X86_64
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+ static physid_mask_t apic_id_wakemap = PHYSID_MASK_NONE;
+ unsigned long flags;
+ u8 timeout;
+
+ /* Remap mailbox memory only for the first call to acpi_wakeup_cpu() */
+ if (physids_empty(apic_id_wakemap)) {
+ acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+ sizeof(*acpi_mp_wake_mailbox),
+ MEMREMAP_WB);
+ }
+
+ /*
+ * According to the ACPI specification r6.4, section titled
+ * "Multiprocessor Wakeup Structure" the mailbox-based wakeup
+ * mechanism cannot be used more than once for the same CPU.
+ * Skip wakeups if they are attempted more than once.
+ */
+ if (physid_isset(apicid, apic_id_wakemap)) {
+ pr_err("CPU already awake (APIC ID %x), skipping wakeup\n",
+ apicid);
+ return -EINVAL;
+ }
+
+ spin_lock_irqsave(&mailbox_lock, flags);
+
+ /*
+ * Mailbox memory is shared between firmware and OS. Firmware will
+ * listen on mailbox command address, and once it receives the wakeup
+ * command, CPU associated with the given apicid will be booted.
+ *
+ * The value of apic_id and wakeup_vector has to be set before updating
+ * the wakeup command. To let compiler preserve order of writes, use
+ * smp_store_release.
+ */
+ smp_store_release(&acpi_mp_wake_mailbox->apic_id, apicid);
+ smp_store_release(&acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+ smp_store_release(&acpi_mp_wake_mailbox->command,
+ ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+ /*
+ * After writing the wakeup command, wait for maximum timeout of 0xFF
+ * for firmware to reset the command address back zero to indicate
+ * the successful reception of command.
+ * NOTE: 0xFF as timeout value is decided based on our experiments.
+ *
+ * XXX: Change the timeout once ACPI specification comes up with
+ * standard maximum timeout value.
+ */
+ timeout = 0xFF;
+ while (READ_ONCE(acpi_mp_wake_mailbox->command) && --timeout)
+ cpu_relax();
+
+ /* If timed out (timeout == 0), return error */
+ if (!timeout) {
+ spin_unlock_irqrestore(&mailbox_lock, flags);
+ return -EIO;
+ }
+
+ /*
+ * If the CPU wakeup process is successful, store the
+ * status in apic_id_wakemap to prevent re-wakeup
+ * requests.
+ */
+ physid_set(apicid, apic_id_wakemap);
+
+ spin_unlock_irqrestore(&mailbox_lock, flags);
+
+ return 0;
+}
+#endif
#endif /*CONFIG_X86_LOCAL_APIC */
#ifdef CONFIG_X86_IO_APIC
@@ -1083,6 +1166,29 @@ static int __init acpi_parse_madt_lapic_entries(void)
}
return 0;
}
+
+#ifdef CONFIG_X86_64
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_madt_multiproc_wakeup *mp_wake;
+
+ if (!IS_ENABLED(CONFIG_SMP))
+ return -ENODEV;
+
+ mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+ if (BAD_MADT_ENTRY(mp_wake, end))
+ return -EINVAL;
+
+ acpi_table_print_madt_entry(&header->common);
+
+ acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+ acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+ return 0;
+}
+#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */
#ifdef CONFIG_X86_IO_APIC
@@ -1278,6 +1384,14 @@ static void __init acpi_process_madt(void)
smp_found_config = 1;
}
+
+#ifdef CONFIG_X86_64
+ /*
+ * Parse MADT MP Wake entry.
+ */
+ acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+ acpi_parse_mp_wake, 1);
+#endif
}
if (error == -EINVAL) {
/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b70344bf6600..3c8f2c797a98 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2551,6 +2551,16 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
}
EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);
+#ifdef CONFIG_X86_64
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+ struct apic **drv;
+
+ for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+ (*drv)->wakeup_secondary_cpu_64 = handler;
+}
+#endif
+
/*
* Override the generic EOI implementation with an optimized version.
* Only called during early boot when only one CPU is active and with
--
2.32.0
Intel TDX protects guest memory from VMM access. Any memory that is
required for communication with the VMM must be explicitly shared.
It is a two-step process: the guest sets the shared bit in the page
table entry and notifies VMM about the change. The notification happens
using MapGPA hypercall.
Conversion back to private memory requires clearing the shared bit,
notifying VMM with MapGPA hypercall following with accepting the memory
with AcceptPage hypercall.
Provide a helper to do conversion between shared and private memory.
It is going to be used by the following patch.
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/tdx.h | 18 +++++++++
arch/x86/kernel/tdx.c | 79 ++++++++++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 286adda40fb7..20114af47db9 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -55,6 +55,15 @@ struct ve_info {
u32 instr_info;
};
+/*
+ * Page mapping types. This is software construct not part of any hardware
+ * or VMM ABI.
+ */
+enum tdx_map_type {
+ TDX_MAP_PRIVATE,
+ TDX_MAP_SHARED,
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
@@ -78,6 +87,9 @@ bool tdx_early_handle_ve(struct pt_regs *regs);
phys_addr_t tdx_shared_mask(void);
+int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end,
+ enum tdx_map_type map_type);
+
#else
static inline void tdx_early_init(void) { };
@@ -87,6 +99,12 @@ static inline void tdx_guest_idle(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
static inline phys_addr_t tdx_shared_mask(void) { return 0; }
+static inline int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end,
+ enum tdx_map_type map_type)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_INTEL_TDX_GUEST */
#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index d4aae2f139a8..9ef3cf0879d3 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -13,6 +13,10 @@
/* TDX Module Call Leaf IDs */
#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3
+#define TDX_ACCEPT_PAGE 6
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA 0x10001
/* See Exit Qualification for I/O Instructions in VMX documentation */
#define VE_IS_IO_IN(exit_qual) (((exit_qual) & 8) ? 1 : 0)
@@ -83,6 +87,81 @@ static void tdx_get_info(void)
td_info.attributes = out.rdx;
}
+static bool tdx_accept_page(phys_addr_t gpa, enum pg_level pg_level)
+{
+ /*
+ * Pass the page physical address to the TDX module to accept the
+ * pending, private page.
+ *
+ * Bits 2:0 if GPA encodes page size: 0 - 4K, 1 - 2M, 2 - 1G.
+ */
+ switch (pg_level) {
+ case PG_LEVEL_4K:
+ break;
+ case PG_LEVEL_2M:
+ gpa |= 1;
+ break;
+ case PG_LEVEL_1G:
+ gpa |= 2;
+ break;
+ default:
+ return true;
+ }
+
+ return __tdx_module_call(TDX_ACCEPT_PAGE, gpa, 0, 0, 0, NULL);
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest. The VMM is expected to change its mapping
+ * of the page in response.
+ */
+int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end,
+ enum tdx_map_type map_type)
+{
+ u64 ret;
+
+ if (end <= start)
+ return -EINVAL;
+
+ if (map_type == TDX_MAP_SHARED) {
+ start |= tdx_shared_mask();
+ end |= tdx_shared_mask();
+ }
+
+ /*
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * sec "TDG.VP.VMCALL<MapGPA>"
+ */
+ ret = _tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0, NULL);
+
+ if (ret)
+ ret = -EIO;
+
+ if (ret || map_type == TDX_MAP_SHARED)
+ return ret;
+
+ /*
+ * For shared->private conversion, accept the page using
+ * TDX_ACCEPT_PAGE TDX module call.
+ */
+ while (start < end) {
+ /* Try 2M page accept first if possible */
+ if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
+ !tdx_accept_page(start, PG_LEVEL_2M)) {
+ start += PMD_SIZE;
+ continue;
+ }
+
+ if (tdx_accept_page(start, PG_LEVEL_4K))
+ return -EIO;
+ start += PAGE_SIZE;
+ }
+
+ return 0;
+}
+
static __cpuidle u64 _tdx_halt(const bool irq_disabled, const bool do_sti)
{
/*
--
2.32.0
In TDX guests, guest memory is protected from host access. If a guest
performs I/O, it needs to explicitly share the I/O memory with the host.
Make all ioremap()ed pages that are not backed by normal memory
(IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.
Since TDX memory encryption support is similar to AMD SEV architecture,
reuse the infrastructure from AMD SEV code. Introduce CC_ATTR_GUEST_TDX
to add TDX-specific changes to the AMD SEV/SME memory encryption code.
Add tdx_shared_mask() interface to get the TDX guest shared bitmask.
pgprot_decrypted() is used by drivers (i915, virtio_gpu, vfio). Export
both pgprot_encrypted() and pgprot_decrypted().
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pgtable.h | 19 +++++++++++++------
arch/x86/include/asm/tdx.h | 3 +++
arch/x86/kernel/cc_platform.c | 1 +
arch/x86/kernel/tdx.c | 9 +++++++++
arch/x86/mm/ioremap.c | 5 +++++
arch/x86/mm/mem_encrypt.c | 27 +++++++++++++++++++++++++++
include/linux/cc_platform.h | 9 +++++++++
8 files changed, 68 insertions(+), 7 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fe0382f20445..f5df53b6c80d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,7 +879,7 @@ config INTEL_TDX_GUEST
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
select X86_MCE
- select DYNAMIC_PHYSICAL_MASK
+ select X86_MEM_ENCRYPT
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..019cb2f97c20 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -15,12 +15,6 @@
cachemode2protval(_PAGE_CACHE_MODE_UC_MINUS))) \
: (prot))
-/*
- * Macros to add or remove encryption attribute
- */
-#define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot)))
-#define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))
-
#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
#include <asm/pkru.h>
@@ -36,6 +30,19 @@ void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
void ptdump_walk_pgd_level_checkwx(void);
void ptdump_walk_user_pgd_level_checkwx(void);
+/*
+ * Macros to add or remove encryption attribute
+ */
+#ifdef CONFIG_X86_MEM_ENCRYPT
+pgprot_t pgprot_cc_encrypted(pgprot_t prot);
+pgprot_t pgprot_cc_decrypted(pgprot_t prot);
+#define pgprot_encrypted(prot) pgprot_cc_encrypted(prot)
+#define pgprot_decrypted(prot) pgprot_cc_decrypted(prot)
+#else
+#define pgprot_encrypted(prot) (prot)
+#define pgprot_decrypted(prot) (prot)
+#endif
+
#ifdef CONFIG_DEBUG_WX
#define debug_checkwx() ptdump_walk_pgd_level_checkwx()
#define debug_checkwx_user() ptdump_walk_user_pgd_level_checkwx()
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d2ffc9a6ba53..286adda40fb7 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -76,6 +76,8 @@ void tdx_guest_idle(void);
bool tdx_early_handle_ve(struct pt_regs *regs);
+phys_addr_t tdx_shared_mask(void);
+
#else
static inline void tdx_early_init(void) { };
@@ -83,6 +85,7 @@ static inline bool is_tdx_guest(void) { return false; }
static inline void tdx_guest_idle(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
+static inline phys_addr_t tdx_shared_mask(void) { return 0; }
#endif /* CONFIG_INTEL_TDX_GUEST */
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 2ed8652ab042..a0fc329edc35 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -19,6 +19,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
switch (attr) {
case CC_ATTR_GUEST_UNROLL_STRING_IO:
case CC_ATTR_HOTPLUG_DISABLED:
+ case CC_ATTR_GUEST_TDX:
return true;
default:
return false;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index cbfacc2af8bb..d4aae2f139a8 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -52,6 +52,15 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}
+/*
+ * The highest bit of a guest physical address is the "sharing" bit.
+ * Set it for shared pages and clear it for private pages.
+ */
+phys_addr_t tdx_shared_mask(void)
+{
+ return BIT_ULL(td_info.gpa_width - 1);
+}
+
static void tdx_get_info(void)
{
struct tdx_module_output out;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 026031b3b782..a5d4ec1afca2 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -242,10 +242,15 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
* If the page being mapped is in memory and SEV is active then
* make sure the memory encryption attribute is enabled in the
* resulting mapping.
+ * In TDX guests, memory is marked private by default. If encryption
+ * is not requested (using encrypted), explicitly set decrypt
+ * attribute in all IOREMAPPED memory.
*/
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+ else
+ prot = pgprot_decrypted(prot);
switch (pcm) {
case _PAGE_CACHE_MODE_UC:
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 50d209939c66..8b9de7e478c6 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -14,6 +14,33 @@
#include <linux/mem_encrypt.h>
#include <linux/virtio_config.h>
+#include <asm/tdx.h>
+
+/*
+ * Set or unset encryption attribute in vendor agnostic way.
+ */
+pgprot_t pgprot_cc_encrypted(pgprot_t prot)
+{
+ if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+ return __pgprot(__sme_set(pgprot_val(prot)));
+ else if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
+
+ return prot;
+}
+EXPORT_SYMBOL_GPL(pgprot_cc_encrypted);
+
+pgprot_t pgprot_cc_decrypted(pgprot_t prot)
+{
+ if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+ return __pgprot(__sme_clr(pgprot_val(prot)));
+ else if (cc_platform_has(CC_ATTR_GUEST_TDX))
+ return __pgprot(pgprot_val(prot) | tdx_shared_mask());
+
+ return prot;
+}
+EXPORT_SYMBOL_GPL(pgprot_cc_decrypted);
+
/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
bool force_dma_unencrypted(struct device *dev)
{
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 63b15108bc85..5fed077cc5f4 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -82,6 +82,15 @@ enum cc_attr {
* Examples include TDX Guest.
*/
CC_ATTR_HOTPLUG_DISABLED,
+
+ /**
+ * @CC_ATTR_GUEST_TDX: Trust Domain Extension Support
+ *
+ * The platform/OS is running as a TDX guest/virtual machine.
+ *
+ * Examples include Intel TDX.
+ */
+ CC_ATTR_GUEST_TDX = 0x100,
};
#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
--
2.32.0
On Tue, Dec 14, 2021 at 06:02:39PM +0300, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> cc_platform_has() API is used in the kernel to enable confidential
> computing features. Since TDX guest is a confidential computing
> platform, it also needs to use this API.
>
> In preparation of extending cc_platform_has() API to support TDX guest,
> use CPUID instruction to detect for TDX guests support in the early
" ... to detect support for TDX guests... "
> boot code (via tdx_early_init()). Since copy_bootdata() is the first
> user of cc_platform_has() API, detect the TDX guest status before it.
...
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 793e9b42ace0..a61ac6f8821a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -872,6 +872,19 @@ config ACRN_GUEST
> IOT with small footprint and real-time features. More details can be
> found in https://projectacrn.org/.
>
> +# TDX guest uses X2APIC for interrupt management.
For whom is that comment and who's going to see it? Is that comment
supposed to explain the "depends on X86_X2APIC" below?
> +config INTEL_TDX_GUEST
> + bool "Intel TDX (Trust Domain Extensions) - Guest Support"
> + depends on X86_64 && CPU_SUP_INTEL
> + depends on X86_X2APIC
> + help
> + Support running as a guest under Intel TDX. Without this support,
> + the guest kernel can not boot or run under TDX.
> + TDX includes memory encryption and integrity capabilities
> + which protect the confidentiality and integrity of guest
> + memory contents and CPU state. TDX guests are protected from
> + potential attacks from the VMM.
> +
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 07:18:14PM +0100, Borislav Petkov wrote:
> > In preparation of extending cc_platform_has() API to support TDX guest,
> > use CPUID instruction to detect for TDX guests support in the early
>
> " ... to detect support for TDX guests... "
Right, thanks.
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 793e9b42ace0..a61ac6f8821a 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -872,6 +872,19 @@ config ACRN_GUEST
> > IOT with small footprint and real-time features. More details can be
> > found in https://projectacrn.org/.
> >
> > +# TDX guest uses X2APIC for interrupt management.
>
> For whom is that comment and who's going to see it? Is that comment
> supposed to explain the "depends on X86_X2APIC" below?
Yes.
But I think it should be pretty self-explanatory. I'll drop it.
> > +config INTEL_TDX_GUEST
> > + bool "Intel TDX (Trust Domain Extensions) - Guest Support"
> > + depends on X86_64 && CPU_SUP_INTEL
> > + depends on X86_X2APIC
> > + help
> > + Support running as a guest under Intel TDX. Without this support,
> > + the guest kernel can not boot or run under TDX.
> > + TDX includes memory encryption and integrity capabilities
> > + which protect the confidentiality and integrity of guest
> > + memory contents and CPU state. TDX guests are protected from
> > + potential attacks from the VMM.
> > +
--
Kirill A. Shutemov
On Tue, Dec 14, 2021 at 11:21:06PM +0300, Kirill A. Shutemov wrote:
> But I think it should be pretty self-explanatory. I'll drop it.
Right, otherwise we'd have to document all the depends in Kconfig which
would be a pointless exercise in useless work.
:-)
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 06:02:40PM +0300, Kirill A. Shutemov wrote:
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -22,6 +22,7 @@ CFLAGS_REMOVE_early_printk.o = -pg
> CFLAGS_REMOVE_head64.o = -pg
> CFLAGS_REMOVE_sev.o = -pg
> CFLAGS_REMOVE_cc_platform.o = -pg
> +CFLAGS_REMOVE_tdx.o = -pg
> endif
>
> KASAN_SANITIZE_head$(BITS).o := n
> @@ -31,6 +32,7 @@ KASAN_SANITIZE_stacktrace.o := n
> KASAN_SANITIZE_paravirt.o := n
> KASAN_SANITIZE_sev.o := n
> KASAN_SANITIZE_cc_platform.o := n
> +KASAN_SANITIZE_tdx.o := n
>
> # With some compiler versions the generated code results in boot hangs, caused
> # by several compilation units. To be safe, disable all instrumentation.
> @@ -50,6 +52,7 @@ KCOV_INSTRUMENT := n
>
> CFLAGS_head$(BITS).o += -fno-stack-protector
> CFLAGS_cc_platform.o += -fno-stack-protector
> +CFLAGS_tdx.o += -fno-stack-protector
>
> CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
Don't these Makefile changes belong in patch 1, which adds tdx.c?
--
Josh
On Tue, Dec 14, 2021 at 06:02:46PM +0300, Kirill A. Shutemov wrote:
> @@ -155,6 +157,108 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
> return true;
> }
>
> +static bool tdx_mmio(int size, bool write, unsigned long addr,
> + unsigned long *val)
> +{
> + struct tdx_hypercall_output out;
> + u64 err;
> +
> + err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> + addr, *val, &out);
> + if (err)
> + return true;
> +
> + *val = out.r11;
> + return false;
> +}
> +
> +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> + return tdx_mmio(size, false, addr, val);
> +}
> +
> +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> +{
> + return tdx_mmio(size, true, addr, val);
> +}
These bool functions return false on success. Conversely, other
functions in this file return true on success. That inconsistency is
really confusing for the callers and is bound to introduce bugs
eventually.
> +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
Similarly, tdx_handle_mmio() returns (int) 0 for success, while other
tdx_handle_*() functions return (bool) true for success. Also
confusing.
The most robust option would be for all the functions to follow the
typical kernel convention of returning (int) 0 on success. It works for
99.99% of the kernel. Why mess with success? (pun intended)
Otherwise it's just pointless added cognitive overhead, trying to keep
track of what success means, for each individual function.
--
Josh
On Wed, Dec 15, 2021 at 03:19:04PM -0800, Josh Poimboeuf wrote:
> On Tue, Dec 14, 2021 at 06:02:40PM +0300, Kirill A. Shutemov wrote:
> > --- a/arch/x86/kernel/Makefile
> > +++ b/arch/x86/kernel/Makefile
> > @@ -22,6 +22,7 @@ CFLAGS_REMOVE_early_printk.o = -pg
> > CFLAGS_REMOVE_head64.o = -pg
> > CFLAGS_REMOVE_sev.o = -pg
> > CFLAGS_REMOVE_cc_platform.o = -pg
> > +CFLAGS_REMOVE_tdx.o = -pg
> > endif
> >
> > KASAN_SANITIZE_head$(BITS).o := n
> > @@ -31,6 +32,7 @@ KASAN_SANITIZE_stacktrace.o := n
> > KASAN_SANITIZE_paravirt.o := n
> > KASAN_SANITIZE_sev.o := n
> > KASAN_SANITIZE_cc_platform.o := n
> > +KASAN_SANITIZE_tdx.o := n
> >
> > # With some compiler versions the generated code results in boot hangs, caused
> > # by several compilation units. To be safe, disable all instrumentation.
> > @@ -50,6 +52,7 @@ KCOV_INSTRUMENT := n
> >
> > CFLAGS_head$(BITS).o += -fno-stack-protector
> > CFLAGS_cc_platform.o += -fno-stack-protector
> > +CFLAGS_tdx.o += -fno-stack-protector
> >
> > CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
>
> Don't these Makefile changes belong in patch 1, which adds tdx.c?
Removing of the instrumentation is required because is_tdx_guest() is
called from cc_platform_has().
Commit message tries to communicate this:
Since is_tdx_guest() function (through cc_platform_has() API) is used in
the early boot code, disable the instrumentation flags and function
tracer. This is similar to AMD SEV and cc_platform.c.
--
Kirill A. Shutemov
On Wed, Dec 15, 2021 at 03:31:16PM -0800, Josh Poimboeuf wrote:
> On Tue, Dec 14, 2021 at 06:02:46PM +0300, Kirill A. Shutemov wrote:
> > @@ -155,6 +157,108 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
> > return true;
> > }
> >
> > +static bool tdx_mmio(int size, bool write, unsigned long addr,
> > + unsigned long *val)
> > +{
> > + struct tdx_hypercall_output out;
> > + u64 err;
> > +
> > + err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> > + addr, *val, &out);
> > + if (err)
> > + return true;
> > +
> > + *val = out.r11;
> > + return false;
> > +}
> > +
> > +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> > +{
> > + return tdx_mmio(size, false, addr, val);
> > +}
> > +
> > +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> > +{
> > + return tdx_mmio(size, true, addr, val);
> > +}
>
> These bool functions return false on success. Conversely, other
> functions in this file return true on success. That inconsistency is
> really confusing for the callers and is bound to introduce bugs
> eventually.
>
> > +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
>
> Similarly, tdx_handle_mmio() returns (int) 0 for success, while other
> tdx_handle_*() functions return (bool) true for success. Also
> confusing.
>
> The most robust option would be for all the functions to follow the
> typical kernel convention of returning (int) 0 on success. It works for
> 99.99% of the kernel. Why mess with success? (pun intended)
>
> Otherwise it's just pointless added cognitive overhead, trying to keep
> track of what success means, for each individual function.
Okay, fair enough. I will make them consistent.
--
Kirill A. Shutemov
On Thu, Dec 16, 2021 at 02:35:17AM +0300, Kirill A. Shutemov wrote:
> On Wed, Dec 15, 2021 at 03:19:04PM -0800, Josh Poimboeuf wrote:
> > On Tue, Dec 14, 2021 at 06:02:40PM +0300, Kirill A. Shutemov wrote:
> > > --- a/arch/x86/kernel/Makefile
> > > +++ b/arch/x86/kernel/Makefile
> > > @@ -22,6 +22,7 @@ CFLAGS_REMOVE_early_printk.o = -pg
> > > CFLAGS_REMOVE_head64.o = -pg
> > > CFLAGS_REMOVE_sev.o = -pg
> > > CFLAGS_REMOVE_cc_platform.o = -pg
> > > +CFLAGS_REMOVE_tdx.o = -pg
> > > endif
> > >
> > > KASAN_SANITIZE_head$(BITS).o := n
> > > @@ -31,6 +32,7 @@ KASAN_SANITIZE_stacktrace.o := n
> > > KASAN_SANITIZE_paravirt.o := n
> > > KASAN_SANITIZE_sev.o := n
> > > KASAN_SANITIZE_cc_platform.o := n
> > > +KASAN_SANITIZE_tdx.o := n
> > >
> > > # With some compiler versions the generated code results in boot hangs, caused
> > > # by several compilation units. To be safe, disable all instrumentation.
> > > @@ -50,6 +52,7 @@ KCOV_INSTRUMENT := n
> > >
> > > CFLAGS_head$(BITS).o += -fno-stack-protector
> > > CFLAGS_cc_platform.o += -fno-stack-protector
> > > +CFLAGS_tdx.o += -fno-stack-protector
> > >
> > > CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
> >
> > Don't these Makefile changes belong in patch 1, which adds tdx.c?
>
> Removing of the instrumentation is required because is_tdx_guest() is
> called from cc_platform_has().
>
> Commit message tries to communicate this:
>
> Since is_tdx_guest() function (through cc_platform_has() API) is used in
> the early boot code, disable the instrumentation flags and function
> tracer. This is similar to AMD SEV and cc_platform.c.
Ah, that's what I get for skimming the patch description.
--
Josh
On Tue, Dec 14, 2021 at 06:02:40PM +0300, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> Confidential Computing (CC) features (like string I/O unroll support,
> memory encryption/decryption support, etc) are conditionally enabled
> in the kernel using cc_platform_has() API. Since TDX guests also need
> to use these CC features, extend cc_platform_has() API and add TDX
> guest-specific CC attributes support.
>
> Use is_tdx_guest() API to detect for the TDX guest status and return
> TDX-specific CC attributes. To enable use of CC APIs in the TDX guest,
> select ARCH_HAS_CC_PLATFORM in the CONFIG_INTEL_TDX_GUEST case.
>
> This is a preparatory patch and just creates the framework for adding
> TDX guest specific CC attributes.
>
> Since is_tdx_guest() function (through cc_platform_has() API) is used in
> the early boot code, disable the instrumentation flags and function
> tracer. This is similar to AMD SEV and cc_platform.c.
>
> Since intel_cc_platform_has() function only gets triggered when
"... only gets called... "
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 06:02:41PM +0300, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
> expose the guest state to the host. This prevents the old hypercall
> mechanisms from working. So, to communicate with VMM, TDX
> specification defines a new instruction called TDCALL.
>
> In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> layer (TDX module) exists in the CPU to facilitate secure communication
in the CPU?!
I think you wanna say, "it is loaded like a firmware into a special CPU
mode called SEAM..." or so.
> between the host and the guest. TDX guests communicate with the TDX module
> using the TDCALL instruction.
>
> A guest uses TDCALL to communicate with both the TDX module and VMM.
> The value of the RAX register when executing the TDCALL instruction is
> used to determine the TDCALL type. A variant of TDCALL used to communicate
> with the VMM is called TDVMCALL.
>
> Add generic interfaces to communicate with the TDX Module and VMM
"module"
> (using the TDCALL instruction).
>
> __tdx_hypercall() - Used by the guest to request services from the
> VMM (via TDVMCALL).
> __tdx_module_call() - Used to communicate with the TDX Module (via
> TDCALL).
"module". No need to capitalize every word like in CPU manuals.
>
> Also define an additional wrapper _tdx_hypercall(), which adds error
> handling support for the TDCALL failure.
>
> The __tdx_module_call() and __tdx_hypercall() helper functions are
> implemented in assembly in a .S file. The TDCALL ABI requires
> shuffling arguments in and out of registers, which proved to be
> awkward with inline assembly.
>
> Just like syscalls, not all TDVMCALL use cases need to use the same
> number of argument registers. The implementation here picks the current
> worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
> than 4 arguments, there will end up being a few superfluous (cheap)
> instructions. But, this approach maximizes code reuse.
>
> For registers used by the TDCALL instruction, please check TDX GHCI
> specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
> Interface".
>
> https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf
>
> Originally-by: Sean Christopherson <[email protected]>
Just state that in free text in the commit message:
"Based on a previous patch by Sean... "
...
> + /*
> + * Since this function can be initiated without an output pointer,
> + * check if caller provided an output struct before storing
> + * output registers.
> + */
> + test %r12, %r12
> + jz mcall_done
All those local label names need to be prefixed with .L so that they
don't appear in the vmlinux symbol table unnecessarily:
jz .Lno_output_struct
> +
> + /* Copy TDCALL result registers to output struct: */
> + movq %rcx, TDX_MODULE_rcx(%r12)
> + movq %rdx, TDX_MODULE_rdx(%r12)
> + movq %r8, TDX_MODULE_r8(%r12)
> + movq %r9, TDX_MODULE_r9(%r12)
> + movq %r10, TDX_MODULE_r10(%r12)
> + movq %r11, TDX_MODULE_r11(%r12)
> +
> +mcall_done:
.Lno_output_struct:
Ditto below.
> + /* Restore the state of R12 register */
> + pop %r12
> +
> + FRAME_END
> + ret
> +SYM_FUNC_END(__tdx_module_call)
> +
> +/*
> + * __tdx_hypercall() - Make hypercalls to a TDX VMM.
> + *
> + * Transforms function call register arguments into the TDCALL
> + * register ABI. After TDCALL operation, VMM output is saved in @out.
> + *
> + *-------------------------------------------------------------------------
> + * TD VMCALL ABI:
> + *-------------------------------------------------------------------------
> + *
> + * Input Registers:
> + *
> + * RAX - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
> + * RCX - BITMAP which controls which part of TD Guest GPR
> + * is passed as-is to the VMM and back.
> + * R10 - Set 0 to indicate TDCALL follows standard TDX ABI
> + * specification. Non zero value indicates vendor
> + * specific ABI.
> + * R11 - VMCALL sub function number
> + * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments.
> + * R8-R9, R12–R15 - Same as above.
^
massage_diff: Warning: Unicode char [–] (0x2013) in line: + * R8-R9, R12–R15 - Same as above.
All those other '-' are char 0x2d but this one has probably happened due
to copy-paste or whatnot. The second time I see UTF-8 chars in a patch
today.
> +SYM_FUNC_START(__tdx_hypercall)
> + FRAME_BEGIN
> +
> + /* Move argument 7 from caller stack to RAX */
> + movq ARG7_SP_OFFSET(%rsp), %rax
> +
> + /* Check if caller provided an output struct */
> + test %rax, %rax
> + /* If out pointer is NULL, return -EINVAL */
> + jz ret_err
> +
> + /* Save callee-saved GPRs as mandated by the x86_64 ABI */
> + push %r15
> + push %r14
> + push %r13
> + push %r12
> +
> + /*
> + * Save output pointer (rax) in stack, it will be used
"... on the stack... "
> + * again when storing the output registers after the
> + * TDCALL operation.
> + */
> + push %rax
> +
> + /* Mangle function call ABI into TDCALL ABI: */
> + /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
> + xor %eax, %eax
> + /* Move TDVMCALL type (standard vs vendor) in R10 */
> + mov %rdi, %r10
> + /* Move TDVMCALL sub function id to R11 */
> + mov %rsi, %r11
> + /* Move input 1 to R12 */
> + mov %rdx, %r12
> + /* Move input 2 to R13 */
> + mov %rcx, %r13
> + /* Move input 3 to R14 */
> + mov %r8, %r14
> + /* Move input 4 to R15 */
> + mov %r9, %r15
> +
> + movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> + tdcall
> +
> + /* Restore output pointer to R9 */
> + pop %r9
> +
> + /* Copy hypercall result registers to output struct: */
> + movq %r10, TDX_HYPERCALL_r10(%r9)
> + movq %r11, TDX_HYPERCALL_r11(%r9)
> + movq %r12, TDX_HYPERCALL_r12(%r9)
> + movq %r13, TDX_HYPERCALL_r13(%r9)
> + movq %r14, TDX_HYPERCALL_r14(%r9)
> + movq %r15, TDX_HYPERCALL_r15(%r9)
> +
> + /*
> + * Zero out registers exposed to the VMM to avoid
> + * speculative execution with VMM-controlled values.
> + * This needs to include all registers present in
> + * TDVMCALL_EXPOSE_REGS_MASK (except R12-R15).
> + * R12-R15 context will be restored.
> + */
> + xor %r10d, %r10d
> + xor %r11d, %r11d
> +
> + /* Restore callee-saved GPRs as mandated by the x86_64 ABI */
> + pop %r12
> + pop %r13
> + pop %r14
> + pop %r15
> +
> + jmp hcall_done
> +ret_err:
> + movq $(-EINVAL), %rax
What are the brackets for?
> +hcall_done:
> + FRAME_END
> +
> + retq
> +SYM_FUNC_END(__tdx_hypercall)
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index d32d9d9946d8..1cc850fd03ff 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -9,6 +9,30 @@
>
> static bool tdx_guest_detected __ro_after_init;
>
> +/*
> + * Wrapper for standard use of __tdx_hypercall with panic report
> + * for TDCALL error.
> + */
> +static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
> + u64 r15, struct tdx_hypercall_output *out)
> +{
> + struct tdx_hypercall_output dummy_out;
> + u64 err;
> +
> + /* __tdx_hypercall() does not accept NULL output pointer */
> + if (!out)
> + out = &dummy_out;
> +
> + err = __tdx_hypercall(TDX_HYPERCALL_STANDARD, fn, r12, r13, r14,
> + r15, out);
> +
> + /* Non zero return value indicates buggy TDX module, so panic */
> + if (err)
> + panic("Hypercall fn %llu failed (Buggy TDX module!)\n", fn);
Use a standard formatted pattern pls:
/* Non zero return value indicates buggy TDX module, so panic */
err = __tdx_hypercall(TDX_HYPERCALL_STANDARD, fn, r12, r13, r14, r15, out);
if (err)
panic("Hypercall fn %llu failed (Buggy TDX module!)\n", fn);
> +
> + return out->r10;
> +}
> +
> bool is_tdx_guest(void)
> {
> return tdx_guest_detected;
> --
> 2.32.0
>
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 12/14/21 9:02 AM, Kirill A. Shutemov wrote:
> In TDX guests, guest memory is protected from host access. If a guest
> performs I/O, it needs to explicitly share the I/O memory with the host.
>
> Make all ioremap()ed pages that are not backed by normal memory
> (IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.
>
> Since TDX memory encryption support is similar to AMD SEV architecture,
> reuse the infrastructure from AMD SEV code. Introduce CC_ATTR_GUEST_TDX
> to add TDX-specific changes to the AMD SEV/SME memory encryption code.
>
> Add tdx_shared_mask() interface to get the TDX guest shared bitmask.
>
> pgprot_decrypted() is used by drivers (i915, virtio_gpu, vfio). Export
> both pgprot_encrypted() and pgprot_decrypted().
>
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -14,6 +14,33 @@
> #include <linux/mem_encrypt.h>
> #include <linux/virtio_config.h>
>
> +#include <asm/tdx.h>
> +
> +/*
> + * Set or unset encryption attribute in vendor agnostic way.
> + */
> +pgprot_t pgprot_cc_encrypted(pgprot_t prot)
> +{
> + if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> + return __pgprot(__sme_set(pgprot_val(prot)));
> + else if (cc_platform_has(CC_ATTR_GUEST_TDX))
> + return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> +
Hmmm... I believe this breaks SEV guests. __sme_set() uses sme_me_mask
which is used for both SME and SEV. With the current checks, an SEV guest
will end up never setting an encrypted address through this path. Ditto
below on the decrypted path.
Thanks,
Tom
> + return prot;
> +}
> +EXPORT_SYMBOL_GPL(pgprot_cc_encrypted);
> +
> +pgprot_t pgprot_cc_decrypted(pgprot_t prot)
> +{
> + if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> + return __pgprot(__sme_clr(pgprot_val(prot)));
> + else if (cc_platform_has(CC_ATTR_GUEST_TDX))
> + return __pgprot(pgprot_val(prot) | tdx_shared_mask());
> +
> + return prot;
> +}
> +EXPORT_SYMBOL_GPL(pgprot_cc_decrypted);
> +
> /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> bool force_dma_unencrypted(struct device *dev)
> {
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index 63b15108bc85..5fed077cc5f4 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -82,6 +82,15 @@ enum cc_attr {
> * Examples include TDX Guest.
> */
> CC_ATTR_HOTPLUG_DISABLED,
> +
> + /**
> + * @CC_ATTR_GUEST_TDX: Trust Domain Extension Support
> + *
> + * The platform/OS is running as a TDX guest/virtual machine.
> + *
> + * Examples include Intel TDX.
> + */
> + CC_ATTR_GUEST_TDX = 0x100,
> };
>
> #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
>
On Tue, Dec 21, 2021 at 08:11:45PM +0100, Borislav Petkov wrote:
> On Tue, Dec 14, 2021 at 06:02:41PM +0300, Kirill A. Shutemov wrote:
> > From: Kuppuswamy Sathyanarayanan <[email protected]>
> >
> > Guests communicate with VMMs with hypercalls. Historically, these
> > are implemented using instructions that are known to cause VMEXITs
> > like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
> > expose the guest state to the host. This prevents the old hypercall
> > mechanisms from working. So, to communicate with VMM, TDX
> > specification defines a new instruction called TDCALL.
> >
> > In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> > layer (TDX module) exists in the CPU to facilitate secure communication
>
> in the CPU?!
>
> I think you wanna say, "it is loaded like a firmware into a special CPU
> mode called SEAM..." or so.
What about this?
In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.
Does it look fine?
> > (using the TDCALL instruction).
> >
> > __tdx_hypercall() - Used by the guest to request services from the
> > VMM (via TDVMCALL).
> > __tdx_module_call() - Used to communicate with the TDX Module (via
> > TDCALL).
>
> "module". No need to capitalize every word like in CPU manuals.
Okay, I will change it globally over the whole patchset.
> > Originally-by: Sean Christopherson <[email protected]>
>
> Just state that in free text in the commit message:
>
> "Based on a previous patch by Sean... "
Okay.
> > + /*
> > + * Since this function can be initiated without an output pointer,
> > + * check if caller provided an output struct before storing
> > + * output registers.
> > + */
> > + test %r12, %r12
> > + jz mcall_done
>
> All those local label names need to be prefixed with .L so that they
> don't appear in the vmlinux symbol table unnecessarily:
>
> jz .Lno_output_struct
Ah, okay. I did not know about special treatment for .L labels.
Again, will check whole patchset.
--
Kirill A. Shutemov
On Wed, Dec 22, 2021 at 11:26:59AM -0600, Tom Lendacky wrote:
> On 12/14/21 9:02 AM, Kirill A. Shutemov wrote:
> > In TDX guests, guest memory is protected from host access. If a guest
> > performs I/O, it needs to explicitly share the I/O memory with the host.
> >
> > Make all ioremap()ed pages that are not backed by normal memory
> > (IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.
> >
> > Since TDX memory encryption support is similar to AMD SEV architecture,
> > reuse the infrastructure from AMD SEV code. Introduce CC_ATTR_GUEST_TDX
> > to add TDX-specific changes to the AMD SEV/SME memory encryption code.
> >
> > Add tdx_shared_mask() interface to get the TDX guest shared bitmask.
> >
> > pgprot_decrypted() is used by drivers (i915, virtio_gpu, vfio). Export
> > both pgprot_encrypted() and pgprot_decrypted().
> >
>
> > --- a/arch/x86/mm/mem_encrypt.c
> > +++ b/arch/x86/mm/mem_encrypt.c
> > @@ -14,6 +14,33 @@
> > #include <linux/mem_encrypt.h>
> > #include <linux/virtio_config.h>
> > +#include <asm/tdx.h>
> > +
> > +/*
> > + * Set or unset encryption attribute in vendor agnostic way.
> > + */
> > +pgprot_t pgprot_cc_encrypted(pgprot_t prot)
> > +{
> > + if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> > + return __pgprot(__sme_set(pgprot_val(prot)));
> > + else if (cc_platform_has(CC_ATTR_GUEST_TDX))
> > + return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> > +
>
> Hmmm... I believe this breaks SEV guests. __sme_set() uses sme_me_mask which
> is used for both SME and SEV. With the current checks, an SEV guest will end
> up never setting an encrypted address through this path. Ditto below on the
> decrypted path.
Hm, okay. What if I rewrite code like this:
pgprot_t pgprot_cc_encrypted(pgprot_t prot)
{
if (cc_platform_has(CC_ATTR_GUEST_TDX))
return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
else
return __pgprot(__sme_set(pgprot_val(prot)));
}
I believe it should cover all cases, right?
--
Kirill A. Shutemov
On Thu, Dec 23, 2021 at 07:55:48PM +0300, Kirill A. Shutemov wrote:
> What about this?
>
> In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> layer -- TDX module -- facilitates secure communication between the host
> and the guest. TDX module is loaded like a firmware into a special CPU
> mode called SEAM. TDX guests communicate with the TDX module using the
> TDCALL instruction.
>
> Does it look fine?
Yap, thx.
> Ah, okay. I did not know about special treatment for .L labels.
> Again, will check whole patchset.
Yeah, those are local labels. From the gas manpage:
-L
--keep-locals
Keep (in the symbol table) local symbols. These symbols start with system-
specific local label prefixes, typically .L for ELF systems or L for
traditional a.out systems.
Apparently, one can even add own prefix for local labels too:
-local-prefix=prefix
Mark all labels with specified prefix as local. But such label can be marked
global explicitly in the code. This option do not change default local label
prefix ".L", it is just adds new one.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 12/23/21 9:15 AM, Kirill A. Shutemov wrote:
>>> +pgprot_t pgprot_cc_encrypted(pgprot_t prot)
>>> +{
>>> + if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
>>> + return __pgprot(__sme_set(pgprot_val(prot)));
>>> + else if (cc_platform_has(CC_ATTR_GUEST_TDX))
>>> + return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
>>> +
>> Hmmm... I believe this breaks SEV guests. __sme_set() uses sme_me_mask which
>> is used for both SME and SEV. With the current checks, an SEV guest will end
>> up never setting an encrypted address through this path. Ditto below on the
>> decrypted path.
> Hm, okay. What if I rewrite code like this:
>
> pgprot_t pgprot_cc_encrypted(pgprot_t prot)
> {
> if (cc_platform_has(CC_ATTR_GUEST_TDX))
> return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> else
> return __pgprot(__sme_set(pgprot_val(prot)));
> }
>
> I believe it should cover all cases, right?
I _think_ that should be fine for now. But, it does expose that
__sme_set() is weird because it get used on non-SME systems while
tdx_shared_mask() is only used on TDX systems.
Ideally, we'd eventually get to something close to what you had originally:
pgprot_t pgprot_cc_encrypted(pgprot_t prot)
{
if (cc_platform_has(CC_ATTR_GUEST_TDX))
return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
if (cc_platform_has(CC_ATTR_SME_SOMETHING??))
return __pgprot(pgprot_val(prot) | sme_me_mask));
return prot;
}
CC_ATTR_SME_SOMETHING would get set when sme_me_mask is initialized to
something non-zero. That will keep folks from falling into the same
trap that you did in the long term.
The SEV code wasn't crazy for doing what it did when it was the only
game in town. But, now that TDX is joining the party, we need to make
sure that SEV isn't special.
On Tue, Dec 14, 2021 at 06:02:42PM +0300, Kirill A. Shutemov wrote:
> Virtualization Exceptions (#VE) are delivered to TDX guests due to
> specific guest actions which may happen in either user space or the
> kernel:
>
> * Specific instructions (WBINVD, for example)
> * Specific MSR accesses
> * Specific CPUID leaf accesses
> * Access to unmapped pages (EPT violation)
>
> In the settings that Linux will run in, virtual exceptions are never
> generated on accesses to normal, TD-private memory that has been
> accepted.
>
> The #VE handler implementation is simplified by the fact that entry
> paths do not trigger #VE and that the handler may not be interrupted.
> Specifically, the implementation assumes that the entry paths do not
> access TD-shared memory, MMIO regions, use #VE triggering MSRs,
> instructions, or CPUID leaves that might generate #VE. Interrupts,
> including NMIs, are blocked by the hardware starting with #VE delivery
> until TDGETVEINFO is called. All of this combined eliminates the
> chance of a #VE during the syscall gap, or paranoid entry paths.
>
> After TDGETVEINFO, #VE could happen in theory (e.g. through an NMI),
> but it is expected not to happen because TDX expects NMIs not to
> trigger #VEs. Another case where #VE could happen is if the #VE
> exception panics, but in this case, since the platform is already in
> a panic state, nested #VE is not a concern.
>
> If a guest kernel action which would normally cause a #VE occurs in
> the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> exception) is delivered to the guest which will result in an oops
> (and should eventually be a panic, as it is expected panic_on_oops is
> set to 1 for TDX guests).
So until here there are a lot of expectations and assumptions. What
happens if those are violated?
What happens if the NMI handler triggers a #VE after all? Or where is it
enforced that TDX guests should set panic_on_oops?
It all reads really weird, like the TDX guest is a big bird which simply
sticks its head in the sand in the face of danger...
...
> +/*
> + * Handle the user initiated #VE.
> + *
> + * For example, executing the CPUID instruction from the user
"... from userspace... " no "the"
> + * space is a valid case and hence the resulting #VE had to
s/had/has/
> + * be handled.
> + *
> + * For dis-allowed or invalid #VE just return failure.
> + *
> + * Return True on success and False on failure.
You lost me here - function returns false unconditionally. And that
bla about CPUID from user being a valid case doesn't really look like
one when I look at the code. Especially since ve_raise_fault() sends a
SIGSEGV for user #VEs.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Dec 23, 2021 at 11:45:19AM -0800, Dave Hansen wrote:
> CC_ATTR_SME_SOMETHING would get set when sme_me_mask is initialized to
> something non-zero. That will keep folks from falling into the same
> trap that you did in the long term.
I guess CC_ATTR_MEM_ENCRYPT which basically says generic memory
encryption...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Dec 23, 2021 at 08:53:27PM +0100, Borislav Petkov wrote:
> On Thu, Dec 23, 2021 at 11:45:19AM -0800, Dave Hansen wrote:
> > CC_ATTR_SME_SOMETHING would get set when sme_me_mask is initialized to
> > something non-zero. That will keep folks from falling into the same
> > trap that you did in the long term.
>
> I guess CC_ATTR_MEM_ENCRYPT which basically says generic memory
> encryption...
Except CC_ATTR_MEM_ENCRYPT is true for TDX too, so it will also depend on
check order. It is fragile.
Frankly, naked sme_me_mask check would be better. Hm?
--
Kirill A. Shutemov
On Thu, Dec 23, 2021 at 11:56:04PM +0300, Kirill A. Shutemov wrote:
> Except CC_ATTR_MEM_ENCRYPT is true for TDX too, so it will also depend on
> check order. It is fragile.
So the query you wanna do is:
if (memory encryption in use)
use mask;
and the mask you use depends on whether it is SEV or TDX. Right?
If so, you can either do a cc_get_mask() function which gives you either
the SEV or TDX mask or simply do:
if (CC_ATTR_MEM_ENCRYPT) {
if (CC_ATTR_GUEST_TDX)
mask = tdx_shared_mask();
else if (sme_me_mask)
mask = sme_me_mask;
}
Yeah, sme_me_mask has become synonymous with the kernel running as a AMD
confidential guest. I need to think about how to make this cleaner...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 12/23/21 17:55, Kirill A. Shutemov wrote:
> In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> layer -- TDX module -- facilitates secure communication between the host
> and the guest. TDX module is loaded like a firmware into a special CPU
> mode called SEAM. TDX guests communicate with the TDX module using the
> TDCALL instruction.
>
> Does it look fine?
Looks good but I wouldn't say "like a firmware". The TDX module is the
"real" hypervisor, it's not firmware.
Paolo
On Fri, Dec 24, 2021 at 10:16:16AM +0100, Paolo Bonzini wrote:
> On 12/23/21 17:55, Kirill A. Shutemov wrote:
> > In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> > layer -- TDX module -- facilitates secure communication between the host
> > and the guest. TDX module is loaded like a firmware into a special CPU
> > mode called SEAM. TDX guests communicate with the TDX module using the
> > TDCALL instruction.
> >
> > Does it look fine?
>
> Looks good but I wouldn't say "like a firmware". The TDX module is the
> "real" hypervisor, it's not firmware.
We are talking about the way it gets loaded, not about its functionality.
>
> Paolo
>
--
Kirill A. Shutemov
On Thu, Dec 23, 2021 at 10:09:26PM +0100, Borislav Petkov wrote:
> On Thu, Dec 23, 2021 at 11:56:04PM +0300, Kirill A. Shutemov wrote:
> > Except CC_ATTR_MEM_ENCRYPT is true for TDX too, so it will also depend on
> > check order. It is fragile.
>
> So the query you wanna do is:
>
> if (memory encryption in use)
> use mask;
>
> and the mask you use depends on whether it is SEV or TDX. Right?
>
> If so, you can either do a cc_get_mask() function which gives you either
> the SEV or TDX mask or simply do:
>
> if (CC_ATTR_MEM_ENCRYPT) {
> if (CC_ATTR_GUEST_TDX)
> mask = tdx_shared_mask();
> else if (sme_me_mask)
> mask = sme_me_mask;
> }
>
> Yeah, sme_me_mask has become synonymous with the kernel running as a AMD
> confidential guest. I need to think about how to make this cleaner...
Okay. Meanwhile I leave it this way:
pgprot_t pgprot_cc_encrypted(pgprot_t prot)
{
if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
if (cc_platform_has(CC_ATTR_GUEST_TDX))
return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
else if (sme_me_mask)
return __pgprot(__sme_set(pgprot_val(prot)));
else
WARN_ON_ONCE(1);
}
return prot;
}
EXPORT_SYMBOL_GPL(pgprot_cc_encrypted);
pgprot_t pgprot_cc_decrypted(pgprot_t prot)
{
if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
if (cc_platform_has(CC_ATTR_GUEST_TDX))
return __pgprot(pgprot_val(prot) | tdx_shared_mask());
else if (sme_me_mask)
return __pgprot(__sme_clr(pgprot_val(prot)));
else
WARN_ON_ONCE(1);
}
return prot;
}
EXPORT_SYMBOL_GPL(pgprot_cc_decrypted);
--
Kirill A. Shutemov
On Fri, Dec 24, 2021 at 02:03:00PM +0300, Kirill A. Shutemov wrote:
> Okay. Meanwhile I leave it this way:
>
> pgprot_t pgprot_cc_encrypted(pgprot_t prot)
> {
> if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> if (cc_platform_has(CC_ATTR_GUEST_TDX))
> return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> else if (sme_me_mask)
> return __pgprot(__sme_set(pgprot_val(prot)));
> else
> WARN_ON_ONCE(1);
I'm wondering if defining a generic cc_attr especially for this:
if (cc_platform_has(CC_ATTR_MEMORY_SHARING))
to mean, the CC guest needs to do special stuff in order to share memory
with the host (naming sucks, ofc) would be cleaner?
Because then
1. you can return whatever you need to, in the vendor-specific
cc_platform_has() and
2. this can be a separate attribute as I'm assuming it probably will be
used in a couple of places where sharing info with the host is needed.
Hmmm.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 27, 2021 at 12:51:21PM +0100, Borislav Petkov wrote:
> On Fri, Dec 24, 2021 at 02:03:00PM +0300, Kirill A. Shutemov wrote:
> > Okay. Meanwhile I leave it this way:
> >
> > pgprot_t pgprot_cc_encrypted(pgprot_t prot)
> > {
> > if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> > if (cc_platform_has(CC_ATTR_GUEST_TDX))
> > return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> > else if (sme_me_mask)
> > return __pgprot(__sme_set(pgprot_val(prot)));
> > else
> > WARN_ON_ONCE(1);
>
> I'm wondering if defining a generic cc_attr especially for this:
>
> if (cc_platform_has(CC_ATTR_MEMORY_SHARING))
>
> to mean, the CC guest needs to do special stuff in order to share memory
> with the host (naming sucks, ofc) would be cleaner?
Looks like CC_ATTR_MEM_ENCRYPT already does this. The attribute doesn't
have much meaning beyond that, no?
--
Kirill A. Shutemov
On 12/24/21 5:03 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 23, 2021 at 10:09:26PM +0100, Borislav Petkov wrote:
>> On Thu, Dec 23, 2021 at 11:56:04PM +0300, Kirill A. Shutemov wrote:
>>> Except CC_ATTR_MEM_ENCRYPT is true for TDX too, so it will also depend on
>>> check order. It is fragile.
>>
>> So the query you wanna do is:
>>
>> if (memory encryption in use)
>> use mask;
>>
>> and the mask you use depends on whether it is SEV or TDX. Right?
>>
>> If so, you can either do a cc_get_mask() function which gives you either
>> the SEV or TDX mask or simply do:
>>
>> if (CC_ATTR_MEM_ENCRYPT) {
>> if (CC_ATTR_GUEST_TDX)
>> mask = tdx_shared_mask();
>> else if (sme_me_mask)
>> mask = sme_me_mask;
>> }
>>
>> Yeah, sme_me_mask has become synonymous with the kernel running as a AMD
>> confidential guest. I need to think about how to make this cleaner...
>
> Okay. Meanwhile I leave it this way:
>
> pgprot_t pgprot_cc_encrypted(pgprot_t prot)
> {
> if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> if (cc_platform_has(CC_ATTR_GUEST_TDX))
> return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> else if (sme_me_mask)
> return __pgprot(__sme_set(pgprot_val(prot)));
> else
> WARN_ON_ONCE(1);
> }
>
> return prot;
> }
> EXPORT_SYMBOL_GPL(pgprot_cc_encrypted);
Why can't this follow the cc_platform_has() logic and maybe even live in
the cc_platform.c file (though there might be issues with that, I haven't
really looked)?
if (sme_me_mask)
return __pgprot(__sme_set(pgprot_val(prot)));
else if (is_tdx_guest())
return return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
return prot;
and maybe even it call it cc_pgrot_encrypted()?
Just a thought.
Thanks,
Tom
>
> pgprot_t pgprot_cc_decrypted(pgprot_t prot)
> {
> if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> if (cc_platform_has(CC_ATTR_GUEST_TDX))
> return __pgprot(pgprot_val(prot) | tdx_shared_mask());
> else if (sme_me_mask)
> return __pgprot(__sme_clr(pgprot_val(prot)));
> else
> WARN_ON_ONCE(1);
> }
>
> return prot;
> }
> EXPORT_SYMBOL_GPL(pgprot_cc_decrypted);
>
On Mon, Dec 27, 2021 at 05:14:36PM +0300, Kirill A. Shutemov wrote:
> On Mon, Dec 27, 2021 at 12:51:21PM +0100, Borislav Petkov wrote:
> > On Fri, Dec 24, 2021 at 02:03:00PM +0300, Kirill A. Shutemov wrote:
> > > Okay. Meanwhile I leave it this way:
> > >
> > > pgprot_t pgprot_cc_encrypted(pgprot_t prot)
> > > {
> > > if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> > > if (cc_platform_has(CC_ATTR_GUEST_TDX))
> > > return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> > > else if (sme_me_mask)
> > > return __pgprot(__sme_set(pgprot_val(prot)));
> > > else
> > > WARN_ON_ONCE(1);
> >
> > I'm wondering if defining a generic cc_attr especially for this:
> >
> > if (cc_platform_has(CC_ATTR_MEMORY_SHARING))
> >
> > to mean, the CC guest needs to do special stuff in order to share memory
> > with the host (naming sucks, ofc) would be cleaner?
>
> Looks like CC_ATTR_MEM_ENCRYPT already does this. The attribute doesn't
> have much meaning beyond that, no?
It means that *some* memory encryption - guest or host - is in use.
But my point about removing the outer check is bull - you need the
TDX/SEV checks too to figure out which mask to use.
So, reading Tom's latest email, having
cc_pgprot_encrypted(prot)
and
cc_pgprot_decrypted(prot)
in cc_platform.c and which hide all that logic inside doesn't sound like
a bad idea. And cc_platform.c already looks at sme_me_mask and we do
that there for the early path so I guess that's probably halfway fine...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 06:02:43PM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index e9ee8b526319..273e4266b2c1 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -46,6 +46,7 @@
> #include <asm/proto.h>
> #include <asm/frame.h>
> #include <asm/unwind.h>
> +#include <asm/tdx.h>
>
> #include "process.h"
>
> @@ -864,6 +865,12 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
> if (x86_idle || boot_option_idle_override == IDLE_POLL)
> return;
>
> + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> + x86_idle = tdx_guest_idle;
> + pr_info("using TDX aware idle routine\n");
> + return;
> + }
Why isn't this part of the following if-else if-else if... noodle?
> +
> if (boot_cpu_has_bug(X86_BUG_AMD_E400)) {
> pr_info("using AMD E400 aware idle routine\n");
> x86_idle = amd_e400_idle;
This one starting here: ^^^^
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index ee52dde01b24..e19187048be8 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -3,6 +3,7 @@
> #include <asm/asm.h>
> #include <asm/frame.h>
> #include <asm/unwind_hints.h>
> +#include <uapi/asm/vmx.h>
>
> #include <linux/linkage.h>
> #include <linux/bits.h>
> @@ -39,6 +40,13 @@
> */
> #define tdcall .byte 0x66,0x0f,0x01,0xcc
>
> +/*
> + * Used in the __tdx_hypercall() function to test R15 register content
> + * and optionally include the STI instruction before the TDCALL
> + * instruction (for EXIT_REASON_HLT case).
"Used in __tdx_hypercall() to determine whether to enable interrupts
before issuing TDCALL for the EXIT_REASON_HLT case."
Plain and simple.
> + */
> +#define do_sti 0x01
#define ENABLE_IRQS_BEFORE_HLT 0x01
and when you call it that, you don't even need the comment above it
because the name is self-explanatory.
> +
> /*
> * __tdx_module_call() - Used by TDX guests to request services from
> * the TDX module (does not include VMM services).
> @@ -231,6 +239,30 @@ SYM_FUNC_START(__tdx_hypercall)
>
> movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>
> + /*
> + * For the idle loop STI needs to be called directly before
> + * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
> + * instruction enables interrupts only one instruction later.
> + * If there is a window between STI and the instruction that
> + * emulates the HALT state, there is a chance for interrupts to
> + * happen in this window, which can delay the HLT operation
> + * indefinitely. Since this is the not the desired result, add
> + * support to conditionally call STI before TDCALL.
"add support"?
> + *
> + * Since STI instruction is only required for the idle case
> + * (a special case of EXIT_REASON_HLT), use the r15 register
> + * value to identify it. Since the R15 register is not used
> + * by the VMM as per EXIT_REASON_HLT ABI, re-use it in
> + * software to identify the STI case.
> + */
> + cmpl $EXIT_REASON_HLT, %r11d
> + jne skip_sti
> + cmpl $do_sti, %r15d
> + jne skip_sti
> + /* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
> + xor %r15, %r15
> + sti
> +skip_sti:
.Lskip_sti:
> tdcall
>
> /* Restore output pointer to R9 */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b6d0e45e6589..6749ca3b2e3d 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -6,6 +6,7 @@
>
> #include <linux/cpufeature.h>
> #include <asm/tdx.h>
> +#include <asm/vmx.h>
>
> /* TDX Module Call Leaf IDs */
> #define TDX_GET_VEINFO 3
...
> +void __cpuidle tdx_guest_idle(void)
> +{
> + tdx_safe_halt();
> +}
That wrapper looks useless...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Dec 23, 2021 at 08:45:40PM +0100, Borislav Petkov wrote:
> What happens if the NMI handler triggers a #VE after all? Or where is it
> enforced that TDX guests should set panic_on_oops?
Kernel will handle the #VE normally inside NMI handler. (We tested it once
again, just in case.)
The critical part is that #VE must not be triggered in NMI entry code,
before kernel is ready to handle nested NMIs.
#VE cannot possibly happen there: no #VE-inducing instructions, code and
data are in guest private memory.
VMM can remove private memory from under us, but access to unaccepted (or
missing) private memory leads to VM termination, not to #VE.
The situation is similar to NMIs vs. breakpoints.
> > + * be handled.
> > + *
> > + * For dis-allowed or invalid #VE just return failure.
> > + *
> > + * Return True on success and False on failure.
>
> You lost me here - function returns false unconditionally. And that
> bla about CPUID from user being a valid case doesn't really look like
> one when I look at the code. Especially since ve_raise_fault() sends a
> SIGSEGV for user #VEs.
tdx_virt_exception_user()/tdx_virt_exception_kernel() will be populated by
following patches. The patch adds generic infrastructure for #VE handling.
--
Kirill A. Shutemov
On Tue, Dec 28, 2021 at 07:39:46PM +0100, Borislav Petkov wrote:
> But my point about removing the outer check is bull - you need the
> TDX/SEV checks too to figure out which mask to use.
>
> So, reading Tom's latest email, having
>
> cc_pgprot_encrypted(prot)
>
> and
> cc_pgprot_decrypted(prot)
>
> in cc_platform.c and which hide all that logic inside doesn't sound like
> a bad idea. And cc_platform.c already looks at sme_me_mask and we do
> that there for the early path so I guess that's probably halfway fine...
Okay, will go this path.
--
Kirill A. Shutemov
On Wed, Dec 29, 2021 at 02:31:12AM +0300, Kirill A. Shutemov wrote:
> On Thu, Dec 23, 2021 at 08:45:40PM +0100, Borislav Petkov wrote:
> > What happens if the NMI handler triggers a #VE after all? Or where is it
> > enforced that TDX guests should set panic_on_oops?
>
> Kernel will handle the #VE normally inside NMI handler. (We tested it once
> again, just in case.)
>
> The critical part is that #VE must not be triggered in NMI entry code,
> before kernel is ready to handle nested NMIs.
Well, I can't read that in the commit message, maybe it needs expanding
on that aspect?
What I read is:
"Interrupts, including NMIs, are blocked by the hardware starting with
#VE delivery until TDGETVEINFO is called."
but this simply means that *if* you get a #VE anywhere, NMIs are masked
until TDGETVEINFO.
If you get a #VE during the NMI entry code, then you're toast...
> #VE cannot possibly happen there: no #VE-inducing instructions, code and
> data are in guest private memory.
Right, that. So we cannot get a #VE there.
> VMM can remove private memory from under us, but access to unaccepted (or
> missing) private memory leads to VM termination, not to #VE.
And that can't trigger a #VE either.
So I'm confused...
It sounds like you wanna say: no #VEs should happen during the NMI entry
code because of <raisins> and in order to prevent those, we don't use
insns causing #VE, etc. And private pages removed by the VM will simply
terminate the guest.
So what's up?
> tdx_virt_exception_user()/tdx_virt_exception_kernel() will be populated by
> following patches. The patch adds generic infrastructure for #VE handling.
Yeah, you either need to state that somewhere or keep changing those
functions as they evolve in the patchset. As it is, it just confuses
reviewers.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 06:02:44PM +0300, Kirill A. Shutemov wrote:
> +static bool tdx_read_msr_safe(unsigned int msr, u64 *val)
> +{
> + struct tdx_hypercall_output out;
> +
> + /*
> + * Emulate the MSR read via hypercall. More info about ABI
> + * can be found in TDX Guest-Host-Communication Interface
> + * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
> + */
> + if (_tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out))
> + return false;
> +
> + *val = out.r11;
> +
> + return true;
> +}
> +
> +static bool tdx_write_msr_safe(unsigned int msr, unsigned int low,
Why the "_safe" suffix?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Dec 29, 2021, Borislav Petkov wrote:
> On Wed, Dec 29, 2021 at 02:31:12AM +0300, Kirill A. Shutemov wrote:
> What I read is:
>
> "Interrupts, including NMIs, are blocked by the hardware starting with
> #VE delivery until TDGETVEINFO is called."
FWIW, virtual/guest NMIs are blocked by the TDX module until pending #VE info
is retrieved via TDGETVEINFO. Hardware has nothing to do with that behavior.
> but this simply means that *if* you get a #VE anywhere, NMIs are masked
> until TDGETVEINFO.
Yep.
> If you get a #VE during the NMI entry code, then you're toast...
Yes? The rules would be the same as whatever existing rules we have for taking
#DBs in NMI, but that's because the subsequent IRET unblocking NMIs, not because
there's anything special about #VE. Pending NMIs are blocked by the regular NMI
status (unblocked by IRET) _and_ by an unread #VE info.
The unread #VE info clause in NMI blocking is purely to prevent an NMI from being
injected before the guest's #VE handler can do TDGETVEINFO, otherwise a #VE at
_any_ point in the NMI handler would be fatal due to it clobbering the unread #VE
info (it'd be a similar problem to SEV-ES's GHCB juggling).
On Wed, Dec 29, 2021 at 05:07:34PM +0000, Sean Christopherson wrote:
> FWIW, virtual/guest NMIs are blocked by the TDX module until pending #VE info
> is retrieved via TDGETVEINFO. Hardware has nothing to do with that behavior.
The TDX module can block NMIs?! Can we get that functionality exported
to baremetal too pls? Then we can get rid of the NMI nesting crap.
> Yes? The rules would be the same as whatever existing rules we have for taking
> #DBs in NMI, but that's because the subsequent IRET unblocking NMIs, not because
> there's anything special about #VE. Pending NMIs are blocked by the regular NMI
> status (unblocked by IRET) _and_ by an unread #VE info.
>
> The unread #VE info clause in NMI blocking is purely to prevent an NMI from being
> injected before the guest's #VE handler can do TDGETVEINFO, otherwise a #VE at
> _any_ point in the NMI handler would be fatal due to it clobbering the unread #VE
> info (it'd be a similar problem to SEV-ES's GHCB juggling).
I guess this is what Kirill means with:
"The critical part is that #VE must not be triggered in NMI entry code,
before kernel is ready to handle nested NMIs."
I read that as "you die if you get it then" but it sounds like it is
"it'll overwrite #VE info and you'll probably die eventually." Or so.
Yeah, so this commit message text - and I actually think that this is
much more important than to put just in a commit message - so this text
would need a lot more scrubbing and put somewhere - maybe over the #VE
handler - and explain what the situation is wrt NMIs and #VEs. And those
formulations about a TDX guest expecting stuff to be a certain way are
just silly.
TDX guest either enforces them or throws hands in the air and does not
boot.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Dec 29, 2021, Borislav Petkov wrote:
> On Wed, Dec 29, 2021 at 05:07:34PM +0000, Sean Christopherson wrote:
> > FWIW, virtual/guest NMIs are blocked by the TDX module until pending #VE info
> > is retrieved via TDGETVEINFO. Hardware has nothing to do with that behavior.
>
> The TDX module can block NMIs?!
It blocks _virtual_ NMIs, which simply means that it doesn't inject an NMI until
NMIs are unblocked _in the guest_. Hardware NMIs that arrive in the guest are
never blocked and will trigger an exit to the host.
Any hypervisor can do the same, but it requires a contract between the guest and
the hypervisor to define when NMIs are unblocked. TDX extends the historical x86
contract with the #VE info clause, but again that doesn't help with nested NMIs.
> Can we get that functionality exported to baremetal too pls? Then we can get
> rid of the NMI nesting crap.
I believe that's being addressed with FRED[*]. ERET{S,U} unblock NMIs iff a magic
bit is set on the stack, and that magic bit is set by hardware only when delivering
NMIs. I.e. so long as the NMI handler doesn't deliberately set the bit when
returning from other faults/events, NMIs will remain blocked until the NMI handler
returns.
[*] https://www.intel.com/content/www/us/en/develop/download/flexible-return-and-event-delivery-specification.html
On 12/28/21 3:31 PM, Kirill A. Shutemov wrote:
> On Thu, Dec 23, 2021 at 08:45:40PM +0100, Borislav Petkov wrote:
>> What happens if the NMI handler triggers a #VE after all? Or where is it
>> enforced that TDX guests should set panic_on_oops?
> Kernel will handle the #VE normally inside NMI handler. (We tested it once
> again, just in case.)
>
> The critical part is that #VE must not be triggered in NMI entry code,
> before kernel is ready to handle nested NMIs.
>
> #VE cannot possibly happen there: no #VE-inducing instructions, code and
> data are in guest private memory.
...
> The situation is similar to NMIs vs. breakpoints.
Or page faults for that matter.
Page faults are architecturally permitted to occur in the NMI entry
path. But, there's no facility to handle them. The kernel (mostly
easily) avoids doing things that might cause page faults in the NMI
entry path.
The same goes for #VE's in the same path. A guest is written to avoid
#VE in the NMI entry. If they happen in that path, there's a bug somewhere.
I wouldn't go as far as to say "#VE cannot possibly happen there (NMI
entry code)". They *CAN* happen there, but the kernel is doing
everything it can to avoid them.
On Wed, Dec 29, 2021 at 12:29:51PM +0100, Borislav Petkov wrote:
> On Wed, Dec 29, 2021 at 02:31:12AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Dec 23, 2021 at 08:45:40PM +0100, Borislav Petkov wrote:
> > > What happens if the NMI handler triggers a #VE after all? Or where is it
> > > enforced that TDX guests should set panic_on_oops?
> >
> > Kernel will handle the #VE normally inside NMI handler. (We tested it once
> > again, just in case.)
> >
> > The critical part is that #VE must not be triggered in NMI entry code,
> > before kernel is ready to handle nested NMIs.
>
> Well, I can't read that in the commit message, maybe it needs expanding
> on that aspect?
>
> What I read is:
>
> "Interrupts, including NMIs, are blocked by the hardware starting with
> #VE delivery until TDGETVEINFO is called."
>
> but this simply means that *if* you get a #VE anywhere, NMIs are masked
> until TDGETVEINFO.
>
> If you get a #VE during the NMI entry code, then you're toast...
Hm. Two sentance above the one you quoted describes (maybe badly? I donno)
why #VE doesn't happen in entry paths. Maybe it's not clear it covers NMI
entry path too.
What if I replace the paragraph with these two:
Kernel avoids #VEs during syscall gap and NMI entry code. Entry code
paths do not access TD-shared memory, MMIO regions, use #VE triggering
MSRs, instructions, or CPUID leaves that might generate #VE. Similarly,
to page faults and breakpoints, #VEs are allowed in NMI handlers once
kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until kernel reads the VE
info.
Is it better?
> > tdx_virt_exception_user()/tdx_virt_exception_kernel() will be populated by
> > following patches. The patch adds generic infrastructure for #VE handling.
>
> Yeah, you either need to state that somewhere or keep changing those
> functions as they evolve in the patchset. As it is, it just confuses
> reviewers.
Commit message already has this:
Add basic infrastructure to handle any #VE which occurs in the kernel
or userspace. Later patches will add handling for specific #VE
scenarios.
I'm not sure what need to be changed.
--
Kirill A. Shutemov
On Thu, Dec 30, 2021 at 11:05:00AM +0300, Kirill A. Shutemov wrote:
> Hm. Two sentance above the one you quoted describes (maybe badly? I donno)
> why #VE doesn't happen in entry paths. Maybe it's not clear it covers NMI
> entry path too.
>
> What if I replace the paragraph with these two:
>
> Kernel avoids #VEs during syscall gap and NMI entry code.
because? Explain why here.
> Entry code
> paths do not access TD-shared memory, MMIO regions, use #VE triggering
> MSRs, instructions, or CPUID leaves that might generate #VE. Similarly,
> to page faults and breakpoints, #VEs are allowed in NMI handlers once
> kernel is ready to deal with nested NMIs.
>
> During #VE delivery, all interrupts, including NMIs, are blocked until
> TDGETVEINFO is called. It prevents #VE nesting until kernel reads the VE
> info.
This alludes somewhat to the why above.
Now, I hear that TDX doesn't generate #VE anymore for the case where the
HV might have unmapped/made non-private the page which contains the NMI
entry code.
Explain that here too pls.
And then stick that text over exc_virtualization_exception() so that it
is clear what's going on and that it can be easily found.
And then you still need to deal with
"(and should eventually be a panic, as it is expected panic_on_oops is
set to 1 for TDX guests)."
You can say what is expected to be done by the TDX guest owner in some
how-to doc but if those expectations are not met, then the guest should
simply die. Not we expect this and hope that users will do it, but
actually enforce it.
> Commit message already has this:
>
> Add basic infrastructure to handle any #VE which occurs in the kernel
> or userspace. Later patches will add handling for specific #VE
> scenarios.
>
> I'm not sure what need to be changed.
That:
+ * Return True on success and False on failure.
+ */
+static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
Kill the wrong comment.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Dec 30, 2021 at 11:53:39AM +0100, Borislav Petkov wrote:
> On Thu, Dec 30, 2021 at 11:05:00AM +0300, Kirill A. Shutemov wrote:
> > Hm. Two sentance above the one you quoted describes (maybe badly? I donno)
> > why #VE doesn't happen in entry paths. Maybe it's not clear it covers NMI
> > entry path too.
> >
> > What if I replace the paragraph with these two:
> >
> > Kernel avoids #VEs during syscall gap and NMI entry code.
>
> because? Explain why here.
Okay.
>
> > Entry code
> > paths do not access TD-shared memory, MMIO regions, use #VE triggering
> > MSRs, instructions, or CPUID leaves that might generate #VE. Similarly,
> > to page faults and breakpoints, #VEs are allowed in NMI handlers once
> > kernel is ready to deal with nested NMIs.
> >
> > During #VE delivery, all interrupts, including NMIs, are blocked until
> > TDGETVEINFO is called. It prevents #VE nesting until kernel reads the VE
> > info.
>
> This alludes somewhat to the why above.
It addresses the apparent issue with nested #VEs. I consider it to be
separate from the issue of exceptions in the entry code.
> Now, I hear that TDX doesn't generate #VE anymore for the case where the
> HV might have unmapped/made non-private the page which contains the NMI
> entry code.
>
> Explain that here too pls.
Okay.
> And then stick that text over exc_virtualization_exception() so that it
> is clear what's going on and that it can be easily found.
Will do.
>
> And then you still need to deal with
>
> "(and should eventually be a panic, as it is expected panic_on_oops is
> set to 1 for TDX guests)."
I will drop this. Forcing panic_on_oops is out of scope for the patch.
The updated commit message is below. Let me know if something is unclear.
----------------------------8<-------------------------------------------
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to unmapped pages (EPT violation)
In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. IRET from the exception handle will
re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly, to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
If a guest kernel action which would normally cause a #VE occurs in
the interrupt-disabled region before TDGETVEINFO, a #DF (fault
exception) is delivered to the guest which will result in an oops.
Add basic infrastructure to handle any #VE which occurs in the kernel
or userspace. Later patches will add handling for specific #VE
scenarios.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
--
Kirill A. Shutemov
On Thu, Dec 30, 2021 at 06:41:27PM +0300, Kirill A. Shutemov wrote:
> The updated commit message is below. Let me know if something is unclear.
>
> ----------------------------8<-------------------------------------------
>
> Virtualization Exceptions (#VE) are delivered to TDX guests due to
> specific guest actions which may happen in either user space or the
> kernel:
>
> * Specific instructions (WBINVD, for example)
> * Specific MSR accesses
> * Specific CPUID leaf accesses
> * Access to unmapped pages (EPT violation)
>
> In the settings that Linux will run in, virtual exceptions are never
virtualization exceptions
> generated on accesses to normal, TD-private memory that has been
> accepted.
>
> Syscall entry code has a critical window where the kernel stack is not
> yet set up. Any exception in this window leads to hard to debug issues
> and can be exploited for privilege escalation. Exceptions in the NMI
> entry code also cause issues. IRET from the exception handle will
"Returning from the exception handler with IRET will... "
> re-enable NMIs and nested NMI will corrupt the NMI stack.
>
> For these reasons, the kernel avoids #VEs during the syscall gap and
> the NMI entry code. Entry code paths do not access TD-shared memory,
> MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> that might generate #VE. VMM can remove memory from TD at any point,
> but access to unaccepted (or missing) private memory leads to VM
> termination, not to #VE.
>
> Similarly, to page faults and breakpoints, #VEs are allowed in NMI
"Similarly to" - no comma.
> handlers once the kernel is ready to deal with nested NMIs.
>
> During #VE delivery, all interrupts, including NMIs, are blocked until
> TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
> the VE info.
>
> If a guest kernel action which would normally cause a #VE occurs in
> the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> exception) is delivered to the guest which will result in an oops.
That up to here can go over the #VE handler.
> Add basic infrastructure to handle any #VE which occurs in the kernel
> or userspace. Later patches will add handling for specific #VE
> scenarios.
>
> For now, convert unhandled #VE's (everything, until later in this
> series) so that they appear just like a #GP by calling the
> ve_raise_fault() directly. The ve_raise_fault() function is similar
> to #GP handler and is responsible for sending SIGSEGV to userspace
> and CPU die and notifying debuggers and other die chain users.
Yap, better.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 06:02:45PM +0300, Kirill A. Shutemov wrote:
> In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
> by the TDX module while some trigger #VE.
>
> Implement the #VE handling for EXIT_REASON_CPUID by handing it through
> the hypercall, which in turn lets the TDX module handle it by invoking
> the host VMM.
>
> More details on CPUID Virtualization can be found in the TDX module
> specification [1], the section titled "CPUID Virtualization".
The exact name and section should be enough to find the spec document
because...
>
> [1] - https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
... those links are never stable and become stale eventually. Just save
yourself the effort of adding them to commit messages.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 27, 2021 at 09:07:10AM -0600, Tom Lendacky wrote:
> Why can't this follow the cc_platform_has() logic and maybe even live in
> the cc_platform.c file (though there might be issues with that, I haven't
> really looked)?
There's issue with declaring cc_pgprot_encrypted()/cc_pgprot_decrypted()
in cc_platform.h. It requires pgprot_t to be defined and attempt to
include relevant header leads to circular dependencies.
Moreover, pgprot_t defined in different headers, depending on an
architecture.
I'm not sure how to unwind this dependency hell. Any clues?
--
Kirill A. Shutemov
On Mon, Jan 03, 2022 at 05:17:05PM +0300, Kirill A. Shutemov wrote:
> I'm not sure how to unwind this dependency hell. Any clues?
Forward-declaration maybe?
I.e., something like
struct task_struct;
at the top of arch/x86/include/asm/switch_to.h, for example...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Jan 03, 2022 at 03:29:44PM +0100, Borislav Petkov wrote:
> On Mon, Jan 03, 2022 at 05:17:05PM +0300, Kirill A. Shutemov wrote:
> > I'm not sure how to unwind this dependency hell. Any clues?
>
> Forward-declaration maybe?
>
> I.e., something like
>
> struct task_struct;
>
> at the top of arch/x86/include/asm/switch_to.h, for example...
Forward-declaration only works if you refer the struct/union by pointer,
not value.
And pgprot_t is not always a struct and when it is a struct it is
anonymous.
See "git grep 'typedef.*pgprot_t;'".
--
Kirill A. Shutemov
On 1/3/22 7:15 AM, Kirill A. Shutemov wrote:
> On Mon, Jan 03, 2022 at 03:29:44PM +0100, Borislav Petkov wrote:
>> On Mon, Jan 03, 2022 at 05:17:05PM +0300, Kirill A. Shutemov wrote:
>>> I'm not sure how to unwind this dependency hell. Any clues?
>> Forward-declaration maybe?
>>
>> I.e., something like
>>
>> struct task_struct;
>>
>> at the top of arch/x86/include/asm/switch_to.h, for example...
> Forward-declaration only works if you refer the struct/union by pointer,
> not value.
>
> And pgprot_t is not always a struct and when it is a struct it is
> anonymous.
>
> See "git grep 'typedef.*pgprot_t;'".
In the end, the new functions get used like this:
prot = pgprot_decrypted(prot);
I think they _could_ be:
pgprot_set_decrypted(&prot);
Which would let you have a declaration like this:
extern void pgprot_cc_set_decrypted(pgprot_t *prot);
It does not exactly give me warm and fuzzy feelings, but it would work
around the header problem.
On Mon, Jan 03, 2022 at 08:50:12AM -0800, Dave Hansen wrote:
> On 1/3/22 7:15 AM, Kirill A. Shutemov wrote:
> > On Mon, Jan 03, 2022 at 03:29:44PM +0100, Borislav Petkov wrote:
> >> On Mon, Jan 03, 2022 at 05:17:05PM +0300, Kirill A. Shutemov wrote:
> >>> I'm not sure how to unwind this dependency hell. Any clues?
> >> Forward-declaration maybe?
> >>
> >> I.e., something like
> >>
> >> struct task_struct;
> >>
> >> at the top of arch/x86/include/asm/switch_to.h, for example...
> > Forward-declaration only works if you refer the struct/union by pointer,
> > not value.
> >
> > And pgprot_t is not always a struct and when it is a struct it is
> > anonymous.
> >
> > See "git grep 'typedef.*pgprot_t;'".
>
> In the end, the new functions get used like this:
>
> prot = pgprot_decrypted(prot);
>
> I think they _could_ be:
>
> pgprot_set_decrypted(&prot);
>
> Which would let you have a declaration like this:
>
> extern void pgprot_cc_set_decrypted(pgprot_t *prot);
>
> It does not exactly give me warm and fuzzy feelings, but it would work
> around the header problem.
Apart for being ugly, I don't see how it solves anything. How would you
forward-declare a typedef?
--
Kirill A. Shutemov
On Mon, Jan 03, 2022 at 09:10:59PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jan 03, 2022 at 08:50:12AM -0800, Dave Hansen wrote:
> > On 1/3/22 7:15 AM, Kirill A. Shutemov wrote:
> > > On Mon, Jan 03, 2022 at 03:29:44PM +0100, Borislav Petkov wrote:
> > >> On Mon, Jan 03, 2022 at 05:17:05PM +0300, Kirill A. Shutemov wrote:
> > >>> I'm not sure how to unwind this dependency hell. Any clues?
> > >> Forward-declaration maybe?
> > >>
> > >> I.e., something like
> > >>
> > >> struct task_struct;
> > >>
> > >> at the top of arch/x86/include/asm/switch_to.h, for example...
> > > Forward-declaration only works if you refer the struct/union by pointer,
> > > not value.
> > >
> > > And pgprot_t is not always a struct and when it is a struct it is
> > > anonymous.
> > >
> > > See "git grep 'typedef.*pgprot_t;'".
> >
> > In the end, the new functions get used like this:
> >
> > prot = pgprot_decrypted(prot);
> >
> > I think they _could_ be:
> >
> > pgprot_set_decrypted(&prot);
> >
> > Which would let you have a declaration like this:
> >
> > extern void pgprot_cc_set_decrypted(pgprot_t *prot);
> >
> > It does not exactly give me warm and fuzzy feelings, but it would work
> > around the header problem.
>
> Apart for being ugly, I don't see how it solves anything. How would you
> forward-declare a typedef?
I see two possible options (I hate both): leave it defined in per-arch
<asm/pgtable.h> or move it to <linux/mm.h> next to user in
io_remap_pfn_range().
--
Kirill A. Shutemov
On 1/4/22 11:14 AM, Kirill A. Shutemov wrote:
> I see two possible options (I hate both): leave it defined in per-arch
> <asm/pgtable.h> or move it to <linux/mm.h> next to user in
> io_remap_pfn_range().
Could we do an asm-generic/pgprot.h that was basically just:
typedef struct { unsigned long pgprot; } pgprot_t;
That would cover probably 80% of the of the architectures. The rest of
them could define an actual asm/pgprot.h.
It doesn't seem to be *that* much work, although it is a bit of a shame
that pgtable-types.h doesn't fix this already. I've attached something
that compiles on s390 (representing a random non-x86 architecture) and x86.
On Tue, Jan 04, 2022 at 12:36:06PM -0800, Dave Hansen wrote:
> On 1/4/22 11:14 AM, Kirill A. Shutemov wrote:
> > I see two possible options (I hate both): leave it defined in per-arch
> > <asm/pgtable.h> or move it to <linux/mm.h> next to user in
> > io_remap_pfn_range().
>
> Could we do an asm-generic/pgprot.h that was basically just:
>
> typedef struct { unsigned long pgprot; } pgprot_t;
>
> That would cover probably 80% of the of the architectures. The rest of
> them could define an actual asm/pgprot.h.
A file per typedef looks like an overkill.
Maybe <asm-generic/types.h> that included from <asm/types.h> would be
easier to justify. Some archs already have <asm/types.h>
Although, it is not as simple as your patches. See below.
>
> It doesn't seem to be *that* much work, although it is a bit of a shame
> that pgtable-types.h doesn't fix this already. I've attached something
> that compiles on s390 (representing a random non-x86 architecture) and x86.
> diff -puN arch/sparc/include/asm/page_32.h~pgprot-generic arch/sparc/include/asm/page_32.h
> --- a/arch/sparc/include/asm/page_32.h~pgprot-generic 2022-01-04 12:00:31.651180536 -0800
> +++ b/arch/sparc/include/asm/page_32.h 2022-01-04 12:00:31.659180446 -0800
> @@ -10,6 +10,7 @@
> #define _SPARC_PAGE_H
>
> #include <linux/const.h>
> +#include <asm-generic/pgprot.h>
>
> #define PAGE_SHIFT 12
> #define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
> @@ -57,7 +58,6 @@ typedef struct { unsigned long iopte; }
> typedef struct { unsigned long pmd; } pmd_t;
> typedef struct { unsigned long pgd; } pgd_t;
> typedef struct { unsigned long ctxd; } ctxd_t;
> -typedef struct { unsigned long pgprot; } pgprot_t;
> typedef struct { unsigned long iopgprot; } iopgprot_t;
>
> #define pte_val(x) ((x).pte)
> @@ -85,7 +85,6 @@ typedef unsigned long iopte_t;
> typedef unsigned long pmd_t;
> typedef unsigned long pgd_t;
> typedef unsigned long ctxd_t;
> -typedef unsigned long pgprot_t;
> typedef unsigned long iopgprot_t;
>
> #define pte_val(x) (x)
Any arch that use STRICT_MM_TYPECHECKS hacks will get broken if compiled
without the define (as sparc by default).
It's fixable with more crunch (and more build bugs). I think we can use
approach from posix_types.h where asm-generic version defines the type if
it was not defined by the arch code.
Is it the way to go we want?
--
Kirill A. Shutemov
On 1/4/22 4:31 PM, Kirill A. Shutemov wrote:
> On Tue, Jan 04, 2022 at 12:36:06PM -0800, Dave Hansen wrote:
>> @@ -57,7 +58,6 @@ typedef struct { unsigned long iopte; }
>> typedef struct { unsigned long pmd; } pmd_t;
>> typedef struct { unsigned long pgd; } pgd_t;
>> typedef struct { unsigned long ctxd; } ctxd_t;
>> -typedef struct { unsigned long pgprot; } pgprot_t;
>> typedef struct { unsigned long iopgprot; } iopgprot_t;
>>
>> #define pte_val(x) ((x).pte)
>> @@ -85,7 +85,6 @@ typedef unsigned long iopte_t;
>> typedef unsigned long pmd_t;
>> typedef unsigned long pgd_t;
>> typedef unsigned long ctxd_t;
>> -typedef unsigned long pgprot_t;
>> typedef unsigned long iopgprot_t;
>>
>> #define pte_val(x) (x)
>
> Any arch that use STRICT_MM_TYPECHECKS hacks will get broken if compiled
> without the define (as sparc by default).
My read of STRICT_MM_TYPECHECKS was that "typedef unsigned long
pgprot_t" produces better code, but "typedef struct { unsigned long
pgprot; } pgprot_t;" produces better type checking.
I just compiled these patches on sparc with no issues.
...
> Is it the way to go we want?
I _think_ this was all a result of some review feedback from Tom
Lendacky about where the encryption-modifying pgprot helpers got placed
in the code. I don't feel strongly about it, but I'm not quite sure
that this is worth the trouble.
I'd be curious what Tom thinks now that he's gotten a peek at what it's
going to take to address his concerns.
On Tue, Jan 04, 2022 at 04:43:09PM -0800, Dave Hansen wrote:
> On 1/4/22 4:31 PM, Kirill A. Shutemov wrote:
> > On Tue, Jan 04, 2022 at 12:36:06PM -0800, Dave Hansen wrote:
> >> @@ -57,7 +58,6 @@ typedef struct { unsigned long iopte; }
> >> typedef struct { unsigned long pmd; } pmd_t;
> >> typedef struct { unsigned long pgd; } pgd_t;
> >> typedef struct { unsigned long ctxd; } ctxd_t;
> >> -typedef struct { unsigned long pgprot; } pgprot_t;
> >> typedef struct { unsigned long iopgprot; } iopgprot_t;
> >>
> >> #define pte_val(x) ((x).pte)
> >> @@ -85,7 +85,6 @@ typedef unsigned long iopte_t;
> >> typedef unsigned long pmd_t;
> >> typedef unsigned long pgd_t;
> >> typedef unsigned long ctxd_t;
> >> -typedef unsigned long pgprot_t;
> >> typedef unsigned long iopgprot_t;
> >>
> >> #define pte_val(x) (x)
> >
> > Any arch that use STRICT_MM_TYPECHECKS hacks will get broken if compiled
> > without the define (as sparc by default).
>
> My read of STRICT_MM_TYPECHECKS was that "typedef unsigned long
> pgprot_t" produces better code, but "typedef struct { unsigned long
> pgprot; } pgprot_t;" produces better type checking.
Apart from pgprot_t, __pgprot() and pgrot_val() helpers are defined
differently depending on STRICT_MM_TYPECHECKS.
> I just compiled these patches on sparc with no issues.
Hm. I can't see how
#define pgprot_val(x) (x)
can work to access value for the pgprot_t defined as a struct.
--
Kirill A. Shutemov
On Wed, Jan 05, 2022 at 03:57:20AM +0300, Kirill A. Shutemov wrote:
> On Tue, Jan 04, 2022 at 04:43:09PM -0800, Dave Hansen wrote:
> > On 1/4/22 4:31 PM, Kirill A. Shutemov wrote:
> > > On Tue, Jan 04, 2022 at 12:36:06PM -0800, Dave Hansen wrote:
> > >> @@ -57,7 +58,6 @@ typedef struct { unsigned long iopte; }
> > >> typedef struct { unsigned long pmd; } pmd_t;
> > >> typedef struct { unsigned long pgd; } pgd_t;
> > >> typedef struct { unsigned long ctxd; } ctxd_t;
> > >> -typedef struct { unsigned long pgprot; } pgprot_t;
> > >> typedef struct { unsigned long iopgprot; } iopgprot_t;
> > >>
> > >> #define pte_val(x) ((x).pte)
> > >> @@ -85,7 +85,6 @@ typedef unsigned long iopte_t;
> > >> typedef unsigned long pmd_t;
> > >> typedef unsigned long pgd_t;
> > >> typedef unsigned long ctxd_t;
> > >> -typedef unsigned long pgprot_t;
> > >> typedef unsigned long iopgprot_t;
> > >>
> > >> #define pte_val(x) (x)
> > >
> > > Any arch that use STRICT_MM_TYPECHECKS hacks will get broken if compiled
> > > without the define (as sparc by default).
> >
> > My read of STRICT_MM_TYPECHECKS was that "typedef unsigned long
> > pgprot_t" produces better code, but "typedef struct { unsigned long
> > pgprot; } pgprot_t;" produces better type checking.
>
> Apart from pgprot_t, __pgprot() and pgrot_val() helpers are defined
> differently depending on STRICT_MM_TYPECHECKS.
>
> > I just compiled these patches on sparc with no issues.
>
> Hm. I can't see how
>
> #define pgprot_val(x) (x)
>
> can work to access value for the pgprot_t defined as a struct.
Ah. I guess you compiled 64-bit sparc, right? STRICT_MM_TYPECHECKS is
default there, unline 32-bit.
--
Kirill A. Shutemov
On 1/4/22 4:57 PM, Kirill A. Shutemov wrote:
>> My read of STRICT_MM_TYPECHECKS was that "typedef unsigned long
>> pgprot_t" produces better code, but "typedef struct { unsigned long
>> pgprot; } pgprot_t;" produces better type checking.
> Apart from pgprot_t, __pgprot() and pgrot_val() helpers are defined
> differently depending on STRICT_MM_TYPECHECKS.
>
>> I just compiled these patches on sparc with no issues.
> Hm. I can't see how
>
> #define pgprot_val(x) (x)
>
> can work to access value for the pgprot_t defined as a struct.
Oh, I must just be compiling with the strict type checks on all the
time. I do really wonder if these are useful these days or if the hacks
were for ancient compilers.
In any case, this would be pretty easy to fix by just removing the
!STRICT_MM_TYPECHECKS pgprot_val() and defning the STRICT_MM_TYPECHECKS
universally.
On Tue, Jan 04, 2022 at 05:38:25PM -0800, Dave Hansen wrote:
> On 1/4/22 4:57 PM, Kirill A. Shutemov wrote:
> >> My read of STRICT_MM_TYPECHECKS was that "typedef unsigned long
> >> pgprot_t" produces better code, but "typedef struct { unsigned long
> >> pgprot; } pgprot_t;" produces better type checking.
> > Apart from pgprot_t, __pgprot() and pgrot_val() helpers are defined
> > differently depending on STRICT_MM_TYPECHECKS.
> >
> >> I just compiled these patches on sparc with no issues.
> > Hm. I can't see how
> >
> > #define pgprot_val(x) (x)
> >
> > can work to access value for the pgprot_t defined as a struct.
>
> Oh, I must just be compiling with the strict type checks on all the
> time. I do really wonder if these are useful these days or if the hacks
> were for ancient compilers.
>
> In any case, this would be pretty easy to fix by just removing the
> !STRICT_MM_TYPECHECKS pgprot_val() and defning the STRICT_MM_TYPECHECKS
> universally.
There's comment in 32-bit Sparc as a reason for STRICT_MM_TYPECHECKS not
to be used:
/* passing structs on the Sparc slow us down tremendously... */
The comment came from before git times, so I don't know if it still has a
merit or newer compilers can deal with it better.
bloat-o-meter shows not trivial difference:
Total: Before=5342261, After=5344025, chg +0.03%
but I'm not sure if it translates into performance loss.
David, does the comment still relevant?
--
Kirill A. Shutemov
On Tue, Dec 14, 2021 at 06:02:46PM +0300, Kirill A. Shutemov wrote:
> In non-TDX VMs, MMIO is implemented by providing the guest a mapping
> which will cause a VMEXIT on access and then the VMM emulating the
> instruction that caused the VMEXIT. That's not possible in TDX guests
> because it requires exposing guest register and memory state to
> potentially malicious VMM.
What does that mean exactly? Aren't TDX registers encrypted just like
SEV-ES ones? If so, they can't really be exposed...
> In TDX the MMIO regions are instead configured to trigger a #VE
> exception in the guest. The guest #VE handler then emulates the MMIO
> instruction inside the guest and converts them into a controlled
s/them/it/
> hypercall to the host.
>
> MMIO addresses can be used with any CPU instruction that accesses the
s/the //
> memory. This patch, however, covers only MMIO accesses done via io.h
"Here are covered only the MMIO accesses ... "
> helpers, such as 'readl()' or 'writeq()'.
>
> MMIO access via other means (like structure overlays) may result in
> MMIO_DECODE_FAILED and an oops.
Why? They won't cause a EXIT_REASON_EPT_VIOLATION #VE or?
> AMD SEV has the same limitations to MMIO handling.
See, the other guy is no better here. :-P
> === Potential alternative approaches ===
>
> == Paravirtualizing all MMIO ==
>
> An alternative to letting MMIO induce a #VE exception is to avoid
> the #VE in the first place. Similar to the port I/O case, it is
> theoretically possible to paravirtualize MMIO accesses.
>
> Like the exception-based approach offered by this patch, a fully
"... offered here, a fully ..."
> paravirtualized approach would be limited to MMIO users that leverage
> common infrastructure like the io.h macros.
>
> However, any paravirtual approach would be patching approximately
> 120k call sites. With a conservative overhead estimation of 5 bytes per
> call site (CALL instruction), it leads to bloating code by 600k.
>
> Many drivers will never be used in the TDX environment and the bloat
> cannot be justified.
I like the conservative approach here.
> == Patching TDX drivers ==
>
> Rather than touching the entire kernel, it might also be possible to
> just go after drivers that use MMIO in TDX guests. Right now, that's
> limited only to virtio and some x86-specific drivers.
>
> All virtio MMIO appears to be done through a single function, which
> makes virtio eminently easy to patch. Future patches will implement this
> idea,
"This will be implemented in the future, ... "
> +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> + char buffer[MAX_INSN_SIZE];
> + unsigned long *reg, val = 0;
> + struct insn insn = {};
> + enum mmio_type mmio;
> + int size;
> + u8 sign_byte;
> + bool err;
> +
> + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> + return -EFAULT;
> +
> + insn_init(&insn, buffer, MAX_INSN_SIZE, 1);
> + insn_get_length(&insn);
There is insn_decode() - see how it is used and use it here pls.
> + case MMIO_READ_SIGN_EXTEND:
> + err = tdx_mmio_read(size, ve->gpa, &val);
> + if (err)
> + break;
> +
> + if (size == 1)
> + sign_byte = (val & 0x80) ? 0xff : 0x00;
> + else
> + sign_byte = (val & 0x8000) ? 0xff : 0x00;
> +
> + /* Sign extend based on operand size */
> + memset(reg, sign_byte, insn.opnd_bytes);
> + memcpy(reg, &val, size);
> + break;
You can simplify this a bit:
case MMIO_READ_SIGN_EXTEND: {
u8 sign_byte = 0, msb = 7;
err = tdx_mmio_read(size, ve->gpa, &val);
if (err)
break;
if (size > 1)
msb = 15;
if (val & BIT(msb))
sign_byte = -1;
/* Sign extend based on operand size */
memset(reg, sign_byte, insn.opnd_bytes);
memcpy(reg, &val, size);
break;
}
> + case MMIO_MOVS:
> + case MMIO_DECODE_FAILED:
> + return -EFAULT;
> + }
> +
> + if (err)
> + return -EFAULT;
<---- newline here.
> + return insn.length;
> +}
> +
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 1/4/22 6:43 PM, Dave Hansen wrote:
> On 1/4/22 4:31 PM, Kirill A. Shutemov wrote:
>> On Tue, Jan 04, 2022 at 12:36:06PM -0800, Dave Hansen wrote:
>>> @@ -57,7 +58,6 @@ typedef struct { unsigned long iopte; }
>>> typedef struct { unsigned long pmd; } pmd_t;
>>> typedef struct { unsigned long pgd; } pgd_t;
>>> typedef struct { unsigned long ctxd; } ctxd_t;
>>> -typedef struct { unsigned long pgprot; } pgprot_t;
>>> typedef struct { unsigned long iopgprot; } iopgprot_t;
>>>
>>> #define pte_val(x) ((x).pte)
>>> @@ -85,7 +85,6 @@ typedef unsigned long iopte_t;
>>> typedef unsigned long pmd_t;
>>> typedef unsigned long pgd_t;
>>> typedef unsigned long ctxd_t;
>>> -typedef unsigned long pgprot_t;
>>> typedef unsigned long iopgprot_t;
>>>
>>> #define pte_val(x) (x)
>>
>> Any arch that use STRICT_MM_TYPECHECKS hacks will get broken if compiled
>> without the define (as sparc by default).
>
> My read of STRICT_MM_TYPECHECKS was that "typedef unsigned long
> pgprot_t" produces better code, but "typedef struct { unsigned long
> pgprot; } pgprot_t;" produces better type checking.
>
> I just compiled these patches on sparc with no issues.
>
> ...
>> Is it the way to go we want?
>
> I _think_ this was all a result of some review feedback from Tom
> Lendacky about where the encryption-modifying pgprot helpers got placed
> in the code. I don't feel strongly about it, but I'm not quite sure
> that this is worth the trouble.
>
> I'd be curious what Tom thinks now that he's gotten a peek at what it's
> going to take to address his concerns.
I have vague memories of pgprot_t and what a pain it could be, which is
why my feedback suggested putting it in cc_platform.c, but said there
might be issues :)
I'm fine with it living somewhere else, just thought it would be nice to
have everything consolidated, if possible.
Thanks,
Tom
>
On Wed, Jan 05, 2022 at 11:37:58AM +0100, Borislav Petkov wrote:
> On Tue, Dec 14, 2021 at 06:02:46PM +0300, Kirill A. Shutemov wrote:
> > In non-TDX VMs, MMIO is implemented by providing the guest a mapping
> > which will cause a VMEXIT on access and then the VMM emulating the
> > instruction that caused the VMEXIT. That's not possible in TDX guests
> > because it requires exposing guest register and memory state to
> > potentially malicious VMM.
>
> What does that mean exactly? Aren't TDX registers encrypted just like
> SEV-ES ones? If so, they can't really be exposed...
Not encrypted, saved/restored by TDX module. But yes, cannot be exposed
(without guest intend).
I talk here about *why* the traditional way to handle MMIO -- on VMM side
-- doesn't work for TDX. It's not safe with untrusted VMM.
> > In TDX the MMIO regions are instead configured to trigger a #VE
> > exception in the guest. The guest #VE handler then emulates the MMIO
> > instruction inside the guest and converts them into a controlled
>
> s/them/it/
>
> > hypercall to the host.
> >
> > MMIO addresses can be used with any CPU instruction that accesses the
>
> s/the //
>
> > memory. This patch, however, covers only MMIO accesses done via io.h
>
> "Here are covered only the MMIO accesses ... "
>
> > helpers, such as 'readl()' or 'writeq()'.
> >
> > MMIO access via other means (like structure overlays) may result in
> > MMIO_DECODE_FAILED and an oops.
>
> Why? They won't cause a EXIT_REASON_EPT_VIOLATION #VE or?
readX()/writeX() helpers limit the range of instructions which can trigger
MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO
region allows compiler to generate whatever instruction it wants.
Supporting all possible instructions is a task of a different scope.
>
> > AMD SEV has the same limitations to MMIO handling.
>
> See, the other guy is no better here. :-P
... but it works fine :P
> > +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > + char buffer[MAX_INSN_SIZE];
> > + unsigned long *reg, val = 0;
> > + struct insn insn = {};
> > + enum mmio_type mmio;
> > + int size;
> > + u8 sign_byte;
> > + bool err;
> > +
> > + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> > + return -EFAULT;
> > +
> > + insn_init(&insn, buffer, MAX_INSN_SIZE, 1);
> > + insn_get_length(&insn);
>
> There is insn_decode() - see how it is used and use it here pls.
Right, missed that.
> > + case MMIO_READ_SIGN_EXTEND:
> > + err = tdx_mmio_read(size, ve->gpa, &val);
> > + if (err)
> > + break;
> > +
> > + if (size == 1)
> > + sign_byte = (val & 0x80) ? 0xff : 0x00;
> > + else
> > + sign_byte = (val & 0x8000) ? 0xff : 0x00;
> > +
> > + /* Sign extend based on operand size */
> > + memset(reg, sign_byte, insn.opnd_bytes);
> > + memcpy(reg, &val, size);
> > + break;
>
> You can simplify this a bit:
>
> case MMIO_READ_SIGN_EXTEND: {
> u8 sign_byte = 0, msb = 7;
>
> err = tdx_mmio_read(size, ve->gpa, &val);
> if (err)
> break;
>
> if (size > 1)
> msb = 15;
>
> if (val & BIT(msb))
> sign_byte = -1;
>
> /* Sign extend based on operand size */
> memset(reg, sign_byte, insn.opnd_bytes);
> memcpy(reg, &val, size);
> break;
> }
Okay, will do.
--
Kirill A. Shutemov
On Wed, Jan 05, 2022 at 08:16:49AM -0600, Tom Lendacky wrote:
> On 1/4/22 6:43 PM, Dave Hansen wrote:
> > On 1/4/22 4:31 PM, Kirill A. Shutemov wrote:
> > > On Tue, Jan 04, 2022 at 12:36:06PM -0800, Dave Hansen wrote:
> > > > @@ -57,7 +58,6 @@ typedef struct { unsigned long iopte; }
> > > > typedef struct { unsigned long pmd; } pmd_t;
> > > > typedef struct { unsigned long pgd; } pgd_t;
> > > > typedef struct { unsigned long ctxd; } ctxd_t;
> > > > -typedef struct { unsigned long pgprot; } pgprot_t;
> > > > typedef struct { unsigned long iopgprot; } iopgprot_t;
> > > > #define pte_val(x) ((x).pte)
> > > > @@ -85,7 +85,6 @@ typedef unsigned long iopte_t;
> > > > typedef unsigned long pmd_t;
> > > > typedef unsigned long pgd_t;
> > > > typedef unsigned long ctxd_t;
> > > > -typedef unsigned long pgprot_t;
> > > > typedef unsigned long iopgprot_t;
> > > > #define pte_val(x) (x)
> > >
> > > Any arch that use STRICT_MM_TYPECHECKS hacks will get broken if compiled
> > > without the define (as sparc by default).
> >
> > My read of STRICT_MM_TYPECHECKS was that "typedef unsigned long
> > pgprot_t" produces better code, but "typedef struct { unsigned long
> > pgprot; } pgprot_t;" produces better type checking.
> >
> > I just compiled these patches on sparc with no issues.
> >
> > ...
> > > Is it the way to go we want?
> >
> > I _think_ this was all a result of some review feedback from Tom
> > Lendacky about where the encryption-modifying pgprot helpers got placed
> > in the code. I don't feel strongly about it, but I'm not quite sure
> > that this is worth the trouble.
> >
> > I'd be curious what Tom thinks now that he's gotten a peek at what it's
> > going to take to address his concerns.
>
> I have vague memories of pgprot_t and what a pain it could be, which is why
> my feedback suggested putting it in cc_platform.c, but said there might be
> issues :)
>
> I'm fine with it living somewhere else, just thought it would be nice to
> have everything consolidated, if possible.
In this case I would rather leave it in <asm/pgtable.h>. We still can
rename it to cc_pgprot_decrypted()/cc_pgprot_encrypted().
--
Kirill A. Shutemov
On Wed, Dec 15, 2021 at 03:31:16PM -0800, Josh Poimboeuf wrote:
> > +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
>
> Similarly, tdx_handle_mmio() returns (int) 0 for success, while other
> tdx_handle_*() functions return (bool) true for success. Also
> confusing.
Looked at this again, you read it wrong. tdx_handle_mmio() return size of
instruction it handled so we can advance RIP and <= 0 on error. It is
consistent with other #VE handlers that positive (true) on success.
--
Kirill A. Shutemov
On Wed, Jan 05, 2022 at 06:43:11PM +0300, Kirill A. Shutemov wrote:
> Not encrypted, saved/restored by TDX module. But yes, cannot be exposed
> (without guest intend).
>
> I talk here about *why* the traditional way to handle MMIO -- on VMM side
> -- doesn't work for TDX. It's not safe with untrusted VMM.
Lemme see if I understand this correctly: TDX module saves/restores
guest registers so a malicious hypervisor cannot access them? And that's
why you can't do the traditional way MMIO is done?
> readX()/writeX() helpers limit the range of instructions which can trigger
> MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO
> region allows compiler to generate whatever instruction it wants.
> Supporting all possible instructions is a task of a different scope.
Yap, please add that to the commit message.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 06:02:47PM +0300, Kirill A. Shutemov wrote:
> @@ -370,6 +370,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
> lines = boot_params->screen_info.orig_video_lines;
> cols = boot_params->screen_info.orig_video_cols;
>
> + /*
> + * Detect if we are running in TDX guest environment.
Please use passive voice: no "we" or "I", etc,
> + *
> + * It has to be done before console_init() to use paravirtualized
^
in order
...
> +void early_tdx_detect(void)
> +{
> + u32 eax, sig[3];
> +
> + if (cpuid_max_leaf() < TDX_CPUID_LEAF_ID)
What's the use of that helper?
AFAICT, none because you call cpuid_count below anyway. And you use that
helper only here.
IOW, you can simply use cpuid_count() and not add it.
> + return;
> +
> + cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
> +
> + if (memcmp(TDX_IDENT, sig, 12))
> + return;
> +
> + /* Cache TDX guest feature status */
> + tdx_guest_detected = true;
> +}
...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Jan 07, 2022 at 02:46:41PM +0100, Borislav Petkov wrote:
> On Wed, Jan 05, 2022 at 06:43:11PM +0300, Kirill A. Shutemov wrote:
> > Not encrypted, saved/restored by TDX module. But yes, cannot be exposed
> > (without guest intend).
> >
> > I talk here about *why* the traditional way to handle MMIO -- on VMM side
> > -- doesn't work for TDX. It's not safe with untrusted VMM.
>
> Lemme see if I understand this correctly: TDX module saves/restores
> guest registers so a malicious hypervisor cannot access them? And that's
> why you can't do the traditional way MMIO is done?
To emulate an instruction the emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emualte. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- Memory is encrypted TD-private key. The CPU disallows software other
than the TDX module and TDs from making memory accesses using the
private key.
>
> > readX()/writeX() helpers limit the range of instructions which can trigger
> > MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO
> > region allows compiler to generate whatever instruction it wants.
> > Supporting all possible instructions is a task of a different scope.
>
> Yap, please add that to the commit message.
Okay.
--
Kirill A. Shutemov
On Fri, Jan 07, 2022 at 08:49:26PM +0300, Kirill A. Shutemov wrote:
> To emulate an instruction the emulator needs two things:
>
> - R/W access to the register file to read/modify instruction arguments
> and see RIP of the faulted instruction.
>
> - Read access to memory where instruction is placed to see what to
> emualte. In this case it is guest kernel text.
>
> Both of them are not available to VMM in TDX environment:
>
> - Register file is never exposed to VMM. When a TD exits to the module,
> it saves registers into the state-save area allocated for that TD.
> The module then scrubs these registers before returning execution
> control to the VMM, to help prevent leakage of TD state.
>
> - Memory is encrypted TD-private key. The CPU disallows software other
> than the TDX module and TDs from making memory accesses using the
> private key.
Thanks, that's very helpful info. It would be nice to have it in the
commit message.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 14, 2021 at 06:02:48PM +0300, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> Port IO triggers a #VE exception in TDX guests. During normal runtime,
> the kernel will handle those exceptions for any port IO.
>
> But for the early code in the decompressor, #VE cannot be used because
> the IDT needed for handling the exception is not set up.
... yet.
Well, we're setting up and IDT twice in
arch/x86/boot/compressed/idt_64.c
as early as startup_64 for SEV. And the second stage one
do_boot_stage2_vc() handles port IO too.
Can't you hook in your VE handler there too?
> Replace IN/OUT instructions with TDX IO hypercalls by defining helper
> macros __in/__out and by re-defining them in the decompressor code.
> Also, since TDX IO hypercall requires an IO size parameter, allow
> __in/__out macros to accept size as an input parameter.
Please end function/macro names with parentheses. I think in this
particular case you wanna say
"__in*()/__out*() macros"
When a function is mentioned in the changelog, either the text body or the
subject line, please use the format 'function_name()'. Omitting the
brackets after the function name can be ambiguous::
Subject: subsys/component: Make reservation_count static
reservation_count is only used in reservation_stats. Make it static.
The variant with brackets is more precise::
Subject: subsys/component: Make reservation_count() static
reservation_count() is only called from reservation_stats(). Make it
static.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Jan 13, 2022 at 02:51:54PM +0100, Borislav Petkov wrote:
> On Tue, Dec 14, 2021 at 06:02:48PM +0300, Kirill A. Shutemov wrote:
> > From: Kuppuswamy Sathyanarayanan <[email protected]>
> >
> > Port IO triggers a #VE exception in TDX guests. During normal runtime,
> > the kernel will handle those exceptions for any port IO.
> >
> > But for the early code in the decompressor, #VE cannot be used because
> > the IDT needed for handling the exception is not set up.
> ... yet.
>
> Well, we're setting up and IDT twice in
>
> arch/x86/boot/compressed/idt_64.c
>
> as early as startup_64 for SEV. And the second stage one
> do_boot_stage2_vc() handles port IO too.
>
> Can't you hook in your VE handler there too?
We certainly can. But do we want to?
IIUC, SEV has to handle #VC very early to deal with CPUID and covering
port I/O in addition via the exception is a logical step.
We had some back and forth on #VE vs direct hypercalls for port I/O in
decompresser and we settled on direct hypercalls.
Adding all the plumbing for #VE just to deal with port I/O was considered
overkill. And I expect debugging exception infrastructure is harder than
direct hypercalls.
Do you see it differently? Do you want to switch to #VE here?
> > Replace IN/OUT instructions with TDX IO hypercalls by defining helper
> > macros __in/__out and by re-defining them in the decompressor code.
> > Also, since TDX IO hypercall requires an IO size parameter, allow
> > __in/__out macros to accept size as an input parameter.
>
> Please end function/macro names with parentheses. I think in this
> particular case you wanna say
>
> "__in*()/__out*() macros"
>
> When a function is mentioned in the changelog, either the text body or the
> subject line, please use the format 'function_name()'. Omitting the
> brackets after the function name can be ambiguous::
>
> Subject: subsys/component: Make reservation_count static
>
> reservation_count is only used in reservation_stats. Make it static.
>
> The variant with brackets is more precise::
>
> Subject: subsys/component: Make reservation_count() static
>
> reservation_count() is only called from reservation_stats(). Make it
> static.
Okay, got it.
--
Kirill A. Shutemov
On Sat, Jan 15, 2022 at 04:01:55AM +0300, Kirill A. Shutemov wrote:
> Do you see it differently? Do you want to switch to #VE here?
I'm just comparing to what SEV does and wondering why you guys do it
differently. But if you think hypercalls is easier, fine by me.
The thing I don't like about that patch is you mixing up kernel proper
io helpers with the decompressor code instead of modifying the ones in
arch/x86/boot/boot.h.
We need to hammer out how the code sharing between kernel proper and the
decompressor should be done but that ain't it, especially if there are
already special io helpers in the decompressor.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Sat, Jan 15, 2022 at 01:16:00PM +0100, Borislav Petkov wrote:
> On Sat, Jan 15, 2022 at 04:01:55AM +0300, Kirill A. Shutemov wrote:
> > Do you see it differently? Do you want to switch to #VE here?
>
> I'm just comparing to what SEV does and wondering why you guys do it
> differently. But if you think hypercalls is easier, fine by me.
>
> The thing I don't like about that patch is you mixing up kernel proper
> io helpers with the decompressor code instead of modifying the ones in
> arch/x86/boot/boot.h.
arch/x86/boot and arch/x86/boot/compressed are separate linking domains.
boot/ uses own implementation while boot/compressed uses implementation
from <asm/io.h>. Decopliing boot/compressed from <asm/io.h> requires hack.
See #define _ACPI_IO_H_ below.
And even after that we cannot directly use implementation in boot/ since
we would need to make aware about TDX. That's not needed beyond
boot/comressed.
> We need to hammer out how the code sharing between kernel proper and the
> decompressor should be done but that ain't it, especially if there are
> already special io helpers in the decompressor.
What about the patch below?
I've added another (yes, third) implementation of outb()/inb() for
boot/compressed (we don't need the rest io helpers there).
Looks cleaner to me.
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 22a2a6cc2ab4..1bfe30ebadbe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -99,6 +99,7 @@ endif
vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/early_serial_console.c b/arch/x86/boot/compressed/early_serial_console.c
index 261e81fb9582..1b842d04e687 100644
--- a/arch/x86/boot/compressed/early_serial_console.c
+++ b/arch/x86/boot/compressed/early_serial_console.c
@@ -1,4 +1,5 @@
#include "misc.h"
+#include "io.h"
int early_serial_base;
diff --git a/arch/x86/boot/compressed/io.h b/arch/x86/boot/compressed/io.h
new file mode 100644
index 000000000000..5e9de1e781d7
--- /dev/null
+++ b/arch/x86/boot/compressed/io.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_IO_H
+#define BOOT_COMPRESSED_IO_H
+
+#include "tdx.h"
+
+static inline void outb(u8 v, u16 port)
+{
+ if (early_is_tdx_guest())
+ tdx_io_out(1, port, v);
+ else
+ asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
+}
+
+static inline u8 inb(u16 port)
+{
+ u8 v;
+ if (early_is_tdx_guest())
+ v = tdx_io_in(1, port);
+ else
+ asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
+ return v;
+}
+
+#endif
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index d8373d766672..dd97d9ca73db 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -15,6 +15,8 @@
#include "misc.h"
#include "error.h"
#include "pgtable.h"
+#include "tdx.h"
+#include "io.h"
#include "../string.h"
#include "../voffset.h"
#include <asm/bootparam_utils.h>
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0d8e275a9d96..f3c10ae33c45 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -22,13 +22,13 @@
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
-#include <linux/io.h>
#include <asm/page.h>
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
-#include "tdx.h"
+/* Avoid pulling outb()/inb() from <asm/io.h> */
+#define _ACPI_IO_H_
#define BOOT_CTYPE_H
#include <linux/acpi.h>
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..aafadc136c88
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
index 18970c09512e..6d6799c1daec 100644
--- a/arch/x86/boot/compressed/tdx.h
+++ b/arch/x86/boot/compressed/tdx.h
@@ -6,11 +6,37 @@
#include <linux/types.h>
#ifdef CONFIG_INTEL_TDX_GUEST
+
+#include <vdso/limits.h>
+#include <uapi/asm/vmx.h>
+#include <asm/tdx.h>
+
void early_tdx_detect(void);
bool early_is_tdx_guest(void);
+
+static inline unsigned int tdx_io_in(int size, int port)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 0, port, 0, &out);
+
+ return out.r10 ? UINT_MAX : out.r11;
+}
+
+static inline void tdx_io_out(int size, int port, u64 value)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 1, port, value, &out);
+}
+
#else
static inline void early_tdx_detect(void) { };
static inline bool early_is_tdx_guest(void) { return false; }
+static inline unsigned int tdx_io_in(int size, int port) { return 0; }
+static inline void tdx_io_out(int size, int port, u64 value) { }
#endif
#endif
--
Kirill A. Shutemov
On Mon, Jan 17, 2022 at 05:39:20PM +0300, Kirill A. Shutemov wrote:
> arch/x86/boot and arch/x86/boot/compressed are separate linking domains.
> boot/ uses own implementation while boot/compressed uses implementation
> from <asm/io.h>. Decopliing boot/compressed from <asm/io.h> requires hack.
> See #define _ACPI_IO_H_ below.
I am painfully aware. And the need to share code with kernel proper has
grown quite the nasties in the meantime.
So, we talked about what to do here recently and the suggestion was to
librarize common functionality so that
1. it can be shared between the two.
2. changes in the kernel proper headers do not break the boot stubs.
So, instead of yet another duplication, I think what we should do is
start growing a shared/ header namespace, i.e.,
arch/x86/include/asm/shared/
for example, and put there common, well, shared, functionality between
boot stubs and kernel proper. Stuff which is basic and generic enough so
that it can be shared by both.
That would be a prepatch.
Then, ontop, I'm wondering if it would be cleaner to have in/out
function pointers in the boot stub which are assigned by default to
those __in/__out generic shared handlers and then early_tdx_detect()
would assign to them tdx_io_{in,out} when it detects it is running as a
TDX guest.
Hmmm...?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Jan 17, 2022 at 07:32:51PM +0100, Borislav Petkov wrote:
> On Mon, Jan 17, 2022 at 05:39:20PM +0300, Kirill A. Shutemov wrote:
> > arch/x86/boot and arch/x86/boot/compressed are separate linking domains.
> > boot/ uses own implementation while boot/compressed uses implementation
> > from <asm/io.h>. Decopliing boot/compressed from <asm/io.h> requires hack.
> > See #define _ACPI_IO_H_ below.
>
> I am painfully aware. And the need to share code with kernel proper has
> grown quite the nasties in the meantime.
>
> So, we talked about what to do here recently and the suggestion was to
> librarize common functionality so that
>
> 1. it can be shared between the two.
> 2. changes in the kernel proper headers do not break the boot stubs.
>
> So, instead of yet another duplication, I think what we should do is
> start growing a shared/ header namespace, i.e.,
>
> arch/x86/include/asm/shared/
>
> for example, and put there common, well, shared, functionality between
> boot stubs and kernel proper. Stuff which is basic and generic enough so
> that it can be shared by both.
>
> That would be a prepatch.
>
> Then, ontop, I'm wondering if it would be cleaner to have in/out
> function pointers in the boot stub which are assigned by default to
> those __in/__out generic shared handlers and then early_tdx_detect()
> would assign to them tdx_io_{in,out} when it detects it is running as a
> TDX guest.
Could you take a look if the diff below is the right direction?
If yes, I will prepare a proper patches. My plan is 3 patches: introduce
<asm/shared/io.h>, add 'struct port_io_ops' for early boot, hook up
alternative 'struct port_io_ops' for TDX.
diff --git a/arch/x86/boot/a20.c b/arch/x86/boot/a20.c
index a2b6b428922a..c83a0ae0d1df 100644
--- a/arch/x86/boot/a20.c
+++ b/arch/x86/boot/a20.c
@@ -12,6 +12,7 @@
*/
#include "boot.h"
+#include "io.h"
#define MAX_8042_LOOPS 100000
#define MAX_8042_FF 32
@@ -25,7 +26,7 @@ static int empty_8042(void)
while (loops--) {
io_delay();
- status = inb(0x64);
+ status = port_io_ops.inb(0x64);
if (status == 0xff) {
/* FF is a plausible, but very unlikely status */
if (!--ffs)
@@ -34,7 +35,7 @@ static int empty_8042(void)
if (status & 1) {
/* Read and discard input data */
io_delay();
- (void)inb(0x60);
+ (void)port_io_ops.inb(0x60);
} else if (!(status & 2)) {
/* Buffers empty, finished! */
return 0;
@@ -99,13 +100,13 @@ static void enable_a20_kbc(void)
{
empty_8042();
- outb(0xd1, 0x64); /* Command write */
+ port_io_ops.outb(0xd1, 0x64); /* Command write */
empty_8042();
- outb(0xdf, 0x60); /* A20 on */
+ port_io_ops.outb(0xdf, 0x60); /* A20 on */
empty_8042();
- outb(0xff, 0x64); /* Null command, but UHCI wants it */
+ port_io_ops.outb(0xff, 0x64); /* Null command, but UHCI wants it */
empty_8042();
}
@@ -113,10 +114,10 @@ static void enable_a20_fast(void)
{
u8 port_a;
- port_a = inb(0x92); /* Configuration port A */
+ port_a = port_io_ops.inb(0x92); /* Configuration port A */
port_a |= 0x02; /* Enable A20 */
port_a &= ~0x01; /* Do not reset machine */
- outb(port_a, 0x92);
+ port_io_ops.outb(port_a, 0x92);
}
/*
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 34c9dbb6a47d..27ce7cef13aa 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -18,11 +18,14 @@
#ifndef __ASSEMBLY__
+#undef CONFIG_PARAVIRT
+
#include <linux/stdarg.h>
#include <linux/types.h>
#include <linux/edd.h>
#include <asm/setup.h>
#include <asm/asm.h>
+#include <asm/shared/io.h>
#include "bitops.h"
#include "ctype.h"
#include "cpuflags.h"
@@ -35,40 +38,6 @@ extern struct boot_params boot_params;
#define cpu_relax() asm volatile("rep; nop")
-/* Basic port I/O */
-static inline void outb(u8 v, u16 port)
-{
- asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u8 inb(u16 port)
-{
- u8 v;
- asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outw(u16 v, u16 port)
-{
- asm volatile("outw %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u16 inw(u16 port)
-{
- u16 v;
- asm volatile("inw %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outl(u32 v, u16 port)
-{
- asm volatile("outl %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u32 inl(u16 port)
-{
- u32 v;
- asm volatile("inl %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
static inline void io_delay(void)
{
const u16 DELAY_PORT = 0x80;
diff --git a/arch/x86/boot/compressed/io.h b/arch/x86/boot/compressed/io.h
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index d8373d766672..48de56f2219d 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -15,9 +15,12 @@
#include "misc.h"
#include "error.h"
#include "pgtable.h"
+#include "tdx.h"
+#include "io.h"
#include "../string.h"
#include "../voffset.h"
#include <asm/bootparam_utils.h>
+#include <asm/shared/io.h>
/*
* WARNING!!
@@ -47,6 +50,8 @@ void *memmove(void *dest, const void *src, size_t n);
*/
struct boot_params *boot_params;
+struct port_io_ops port_io_ops;
+
memptr free_mem_ptr;
memptr free_mem_end_ptr;
@@ -103,10 +108,12 @@ static void serial_putchar(int ch)
{
unsigned timeout = 0xffff;
- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((port_io_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 &&
+ --timeout) {
cpu_relax();
+ }
- outb(ch, early_serial_base + TXR);
+ port_io_ops.outb(ch, early_serial_base + TXR);
}
void __putstr(const char *s)
@@ -152,10 +159,10 @@ void __putstr(const char *s)
boot_params->screen_info.orig_y = y;
pos = (x + cols * y) * 2; /* Update cursor position */
- outb(14, vidport);
- outb(0xff & (pos >> 9), vidport+1);
- outb(15, vidport);
- outb(0xff & (pos >> 1), vidport+1);
+ port_io_ops.outb(14, vidport);
+ port_io_ops.outb(0xff & (pos >> 9), vidport+1);
+ port_io_ops.outb(15, vidport);
+ port_io_ops.outb(0xff & (pos >> 1), vidport+1);
}
void __puthex(unsigned long value)
@@ -370,6 +377,15 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;
+ port_io_ops = (const struct port_io_ops){
+ .inb = inb,
+ .inw = inw,
+ .inl = inl,
+ .outb = outb,
+ .outw = outw,
+ .outl = outl,
+ };
+
/*
* Detect TDX guest environment.
*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 6502adf71a2f..74951befb240 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -19,26 +19,23 @@
/* cpu_feature_enabled() cannot be used this early */
#define USE_EARLY_PGTABLE_L5
-/*
- * Redefine __in/__out macros via tdx.h before including
- * linux/io.h.
- */
-#include "tdx.h"
-
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
-#include <linux/io.h>
#include <asm/page.h>
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
+/* Avoid pulling outb()/inb() from <asm/io.h> */
+#define _ACPI_IO_H_
+
#define BOOT_CTYPE_H
#include <linux/acpi.h>
#define BOOT_BOOT_H
#include "../ctype.h"
+#include "../io.h"
#ifdef CONFIG_X86_64
#define memptr long
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 50c8145bd0f3..e0a8b054006f 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -5,12 +5,71 @@
#include "../cpuflags.h"
#include "../string.h"
+#include "../io.h"
+
+#include <vdso/limits.h>
+#include <uapi/asm/vmx.h>
+#include <asm/shared/io.h>
+#include <asm/tdx.h>
#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "
static bool tdx_guest_detected;
+bool early_is_tdx_guest(void)
+{
+ return tdx_guest_detected;
+}
+
+static inline unsigned int tdx_io_in(int size, int port)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 0, port, 0, &out);
+
+ return out.r10 ? UINT_MAX : out.r11;
+}
+
+static inline void tdx_io_out(int size, int port, u64 value)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 1, port, value, &out);
+}
+
+static inline unsigned char tdx_inb(int port)
+{
+ return tdx_io_in(1, port);
+}
+
+static inline unsigned short tdx_inw(int port)
+{
+ return tdx_io_in(2, port);
+}
+
+static inline unsigned int tdx_inl(int port)
+{
+ return tdx_io_in(4, port);
+}
+
+static inline void tdx_outb(unsigned char value, int port)
+{
+ tdx_io_out(1, port, value);
+}
+
+static inline void tdx_outw(unsigned short value, int port)
+{
+ tdx_io_out(2, port, value);
+}
+
+static inline void tdx_outl(unsigned int value, int port)
+{
+ tdx_io_out(4, port, value);
+}
+
void early_tdx_detect(void)
{
u32 eax, sig[3];
@@ -22,9 +81,13 @@ void early_tdx_detect(void)
/* Cache TDX guest feature status */
tdx_guest_detected = true;
-}
-bool early_is_tdx_guest(void)
-{
- return tdx_guest_detected;
+ port_io_ops = (struct port_io_ops) {
+ .inb = tdx_inb,
+ .inw = tdx_inw,
+ .inl = tdx_inl,
+ .outb = tdx_outb,
+ .outw = tdx_outw,
+ .outl = tdx_outl,
+ };
}
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
index 5d39608a2af4..613b2aa0986f 100644
--- a/arch/x86/boot/compressed/tdx.h
+++ b/arch/x86/boot/compressed/tdx.h
@@ -7,51 +7,9 @@
#ifdef CONFIG_INTEL_TDX_GUEST
-#include <vdso/limits.h>
-#include <uapi/asm/vmx.h>
-#include <asm/tdx.h>
-
void early_tdx_detect(void);
bool early_is_tdx_guest(void);
-static inline unsigned int tdx_io_in(int size, int port)
-{
- struct tdx_hypercall_output out;
-
- __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
- size, 0, port, 0, &out);
-
- return out.r10 ? UINT_MAX : out.r11;
-}
-
-static inline void tdx_io_out(int size, int port, u64 value)
-{
- struct tdx_hypercall_output out;
-
- __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
- size, 1, port, value, &out);
-}
-
-#define __out(bwl, bw, sz) \
-do { \
- if (early_is_tdx_guest()) { \
- tdx_io_out(sz, port, value); \
- } else { \
- asm volatile("out" #bwl " %" #bw "0, %w1" : : \
- "a"(value), "Nd"(port)); \
- } \
-} while (0)
-
-#define __in(bwl, bw, sz) \
-do { \
- if (early_is_tdx_guest()) { \
- value = tdx_io_in(sz, port); \
- } else { \
- asm volatile("in" #bwl " %w1, %" #bw "0" : \
- "=a"(value) : "Nd"(port)); \
- } \
-} while (0)
-
#else
static inline void early_tdx_detect(void) { };
static inline bool early_is_tdx_guest(void) { return false; }
diff --git a/arch/x86/boot/early_serial_console.c b/arch/x86/boot/early_serial_console.c
index 023bf1c3de8b..0afe624db9f6 100644
--- a/arch/x86/boot/early_serial_console.c
+++ b/arch/x86/boot/early_serial_console.c
@@ -4,6 +4,7 @@
* included from both the compressed kernel and the regular kernel.
*/
#include "boot.h"
+#include "io.h"
#define DEFAULT_SERIAL_PORT 0x3f8 /* ttyS0 */
@@ -28,17 +29,17 @@ static void early_serial_init(int port, int baud)
unsigned char c;
unsigned divisor;
- outb(0x3, port + LCR); /* 8n1 */
- outb(0, port + IER); /* no interrupt */
- outb(0, port + FCR); /* no fifo */
- outb(0x3, port + MCR); /* DTR + RTS */
+ port_io_ops.outb(0x3, port + LCR); /* 8n1 */
+ port_io_ops.outb(0, port + IER); /* no interrupt */
+ port_io_ops.outb(0, port + FCR); /* no fifo */
+ port_io_ops.outb(0x3, port + MCR); /* DTR + RTS */
divisor = 115200 / baud;
- c = inb(port + LCR);
- outb(c | DLAB, port + LCR);
- outb(divisor & 0xff, port + DLL);
- outb((divisor >> 8) & 0xff, port + DLH);
- outb(c & ~DLAB, port + LCR);
+ c = port_io_ops.inb(port + LCR);
+ port_io_ops.outb(c | DLAB, port + LCR);
+ port_io_ops.outb(divisor & 0xff, port + DLL);
+ port_io_ops.outb((divisor >> 8) & 0xff, port + DLH);
+ port_io_ops.outb(c & ~DLAB, port + LCR);
early_serial_base = port;
}
@@ -104,11 +105,11 @@ static unsigned int probe_baud(int port)
unsigned char lcr, dll, dlh;
unsigned int quot;
- lcr = inb(port + LCR);
- outb(lcr | DLAB, port + LCR);
- dll = inb(port + DLL);
- dlh = inb(port + DLH);
- outb(lcr, port + LCR);
+ lcr = port_io_ops.inb(port + LCR);
+ port_io_ops.outb(lcr | DLAB, port + LCR);
+ dll = port_io_ops.inb(port + DLL);
+ dlh = port_io_ops.inb(port + DLH);
+ port_io_ops.outb(lcr, port + LCR);
quot = (dlh << 8) | dll;
return BASE_BAUD / quot;
diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
new file mode 100644
index 000000000000..934666ec8a22
--- /dev/null
+++ b/arch/x86/boot/io.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_IO_H
+#define BOOT_IO_H
+
+#include <linux/compiler_attributes.h>
+
+struct port_io_ops {
+ unsigned char (*inb)(int port);
+ unsigned short (*inw)(int port);
+ unsigned int (*inl)(int port);
+ void (*outb)(unsigned char v, int port);
+ void (*outw)(unsigned short v, int port);
+ void (*outl)(unsigned int v, int port);
+};
+
+extern struct port_io_ops port_io_ops;
+
+#endif
diff --git a/arch/x86/boot/main.c b/arch/x86/boot/main.c
index e3add857c2c9..f0b8539d9a62 100644
--- a/arch/x86/boot/main.c
+++ b/arch/x86/boot/main.c
@@ -13,10 +13,13 @@
#include <linux/build_bug.h>
#include "boot.h"
+#include "io.h"
#include "string.h"
struct boot_params boot_params __attribute__((aligned(16)));
+struct port_io_ops port_io_ops;
+
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
@@ -133,6 +136,15 @@ static void init_heap(void)
void main(void)
{
+ port_io_ops = (struct port_io_ops){
+ .inb = inb,
+ .inw = inw,
+ .inl = inl,
+ .outb = outb,
+ .outw = outw,
+ .outl = outl,
+ };
+
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
diff --git a/arch/x86/boot/pm.c b/arch/x86/boot/pm.c
index 40031a614712..b25e21a1e0c9 100644
--- a/arch/x86/boot/pm.c
+++ b/arch/x86/boot/pm.c
@@ -11,6 +11,7 @@
*/
#include "boot.h"
+#include "io.h"
#include <asm/segment.h>
/*
@@ -25,7 +26,7 @@ static void realmode_switch_hook(void)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
- outb(0x80, 0x70); /* Disable NMI */
+ port_io_ops.outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}
@@ -35,9 +36,9 @@ static void realmode_switch_hook(void)
*/
static void mask_all_interrupts(void)
{
- outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
+ port_io_ops.outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
io_delay();
- outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
+ port_io_ops.outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
io_delay();
}
@@ -46,9 +47,9 @@ static void mask_all_interrupts(void)
*/
static void reset_coprocessor(void)
{
- outb(0, 0xf0);
+ port_io_ops.outb(0, 0xf0);
io_delay();
- outb(0, 0xf1);
+ port_io_ops.outb(0, 0xf1);
io_delay();
}
diff --git a/arch/x86/boot/tty.c b/arch/x86/boot/tty.c
index f7eb976b0a4b..4134dee678f6 100644
--- a/arch/x86/boot/tty.c
+++ b/arch/x86/boot/tty.c
@@ -12,6 +12,7 @@
*/
#include "boot.h"
+#include "io.h"
int early_serial_base;
@@ -29,10 +30,10 @@ static void __section(".inittext") serial_putchar(int ch)
{
unsigned timeout = 0xffff;
- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((port_io_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
cpu_relax();
- outb(ch, early_serial_base + TXR);
+ port_io_ops.outb(ch, early_serial_base + TXR);
}
static void __section(".inittext") bios_putchar(int ch)
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f5bb8972b4b2..ceee33b07dbf 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -38,6 +38,7 @@
#define ARCH_HAS_IOREMAP_WC
#define ARCH_HAS_IOREMAP_WT
+#include <asm/shared/io.h>
#include <linux/string.h>
#include <linux/compiler.h>
#include <linux/cc_platform.h>
@@ -241,114 +242,6 @@ extern void native_io_delay(void);
extern int io_delay_type;
extern void io_delay_init(void);
-#if defined(CONFIG_PARAVIRT)
-#include <asm/paravirt.h>
-#else
-
-static inline void slow_down_io(void)
-{
- native_io_delay();
-#ifdef REALLY_SLOW_IO
- native_io_delay();
- native_io_delay();
- native_io_delay();
-#endif
-}
-
-#endif
-
-#ifndef __out
-#define __out(bwl, bw, sz) \
- asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
-#endif
-
-#ifndef __in
-#define __in(bwl, bw, sz) \
- asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
-#endif
-
-#define BUILDIO(bwl, bw, type) \
-static inline void out##bwl(unsigned type value, int port) \
-{ \
- __out(bwl, bw, sizeof(type)); \
-} \
- \
-static inline unsigned type in##bwl(int port) \
-{ \
- unsigned type value; \
- __in(bwl, bw, sizeof(type)); \
- return value; \
-} \
- \
-static inline void out##bwl##_p(unsigned type value, int port) \
-{ \
- out##bwl(value, port); \
- slow_down_io(); \
-} \
- \
-static inline unsigned type in##bwl##_p(int port) \
-{ \
- unsigned type value = in##bwl(port); \
- slow_down_io(); \
- return value; \
-} \
- \
-static inline void outs##bwl(int port, const void *addr, unsigned long count) \
-{ \
- if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
- unsigned type *value = (unsigned type *)addr; \
- while (count) { \
- out##bwl(*value, port); \
- value++; \
- count--; \
- } \
- } else { \
- asm volatile("rep; outs" #bwl \
- : "+S"(addr), "+c"(count) \
- : "d"(port) : "memory"); \
- } \
-} \
- \
-static inline void ins##bwl(int port, void *addr, unsigned long count) \
-{ \
- if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
- unsigned type *value = (unsigned type *)addr; \
- while (count) { \
- *value = in##bwl(port); \
- value++; \
- count--; \
- } \
- } else { \
- asm volatile("rep; ins" #bwl \
- : "+D"(addr), "+c"(count) \
- : "d"(port) : "memory"); \
- } \
-}
-
-BUILDIO(b, b, char)
-BUILDIO(w, w, short)
-BUILDIO(l, , int)
-
-#define inb inb
-#define inw inw
-#define inl inl
-#define inb_p inb_p
-#define inw_p inw_p
-#define inl_p inl_p
-#define insb insb
-#define insw insw
-#define insl insl
-
-#define outb outb
-#define outw outw
-#define outl outl
-#define outb_p outb_p
-#define outw_p outw_p
-#define outl_p outl_p
-#define outsb outsb
-#define outsw outsw
-#define outsl outsl
-
extern void *xlate_dev_mem_ptr(phys_addr_t phys);
extern void unxlate_dev_mem_ptr(phys_addr_t phys, void *addr);
diff --git a/arch/x86/include/asm/shared/io.h b/arch/x86/include/asm/shared/io.h
new file mode 100644
index 000000000000..12369bc72dcf
--- /dev/null
+++ b/arch/x86/include/asm/shared/io.h
@@ -0,0 +1,107 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_IO_H
+#define _ASM_X86_SHARED_IO_H
+
+#include <linux/cc_platform.h>
+
+extern void native_io_delay(void);
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+static inline void slow_down_io(void)
+{
+ native_io_delay();
+#ifdef REALLY_SLOW_IO
+ native_io_delay();
+ native_io_delay();
+ native_io_delay();
+#endif
+}
+#endif /* CONFIG_PARAVIRT */
+
+#define BUILDIO(bwl, bw, type) \
+static inline void out##bwl(unsigned type value, int port) \
+{ \
+ asm volatile("out" #bwl " %" #bw "0, %w1" \
+ : : "a"(value), "Nd"(port)); \
+} \
+ \
+static inline unsigned type in##bwl(int port) \
+{ \
+ unsigned type value; \
+ asm volatile("in" #bwl " %w1, %" #bw "0" \
+ : "=a"(value) : "Nd"(port)); \
+ return value; \
+} \
+ \
+static inline void out##bwl##_p(unsigned type value, int port) \
+{ \
+ out##bwl(value, port); \
+ slow_down_io(); \
+} \
+ \
+static inline unsigned type in##bwl##_p(int port) \
+{ \
+ unsigned type value = in##bwl(port); \
+ slow_down_io(); \
+ return value; \
+} \
+ \
+static inline void outs##bwl(int port, const void *addr, unsigned long count) \
+{ \
+ if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
+ unsigned type *value = (unsigned type *)addr; \
+ while (count) { \
+ out##bwl(*value, port); \
+ value++; \
+ count--; \
+ } \
+ } else { \
+ asm volatile("rep; outs" #bwl \
+ : "+S"(addr), "+c"(count) \
+ : "d"(port) : "memory"); \
+ } \
+} \
+ \
+static inline void ins##bwl(int port, void *addr, unsigned long count) \
+{ \
+ if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) { \
+ unsigned type *value = (unsigned type *)addr; \
+ while (count) { \
+ *value = in##bwl(port); \
+ value++; \
+ count--; \
+ } \
+ } else { \
+ asm volatile("rep; ins" #bwl \
+ : "+D"(addr), "+c"(count) \
+ : "d"(port) : "memory"); \
+ } \
+}
+
+BUILDIO(b, b, char)
+BUILDIO(w, w, short)
+BUILDIO(l, , int)
+
+#define inb inb
+#define inw inw
+#define inl inl
+#define inb_p inb_p
+#define inw_p inw_p
+#define inl_p inl_p
+#define insb insb
+#define insw insw
+#define insl insl
+
+#define outb outb
+#define outw outw
+#define outl outl
+#define outb_p outb_p
+#define outw_p outw_p
+#define outl_p outl_p
+#define outsb outsb
+#define outsw outsw
+#define outsl outsl
+
+#endif
--
Kirill A. Shutemov
Raise hpa and tglx to To: for the general direction.
Full mail is at
https://lore.kernel.org/r/[email protected]
On Wed, Jan 19, 2022 at 02:53:26PM +0300, Kirill A. Shutemov wrote:
> Could you take a look if the diff below is the right direction?
>
> If yes, I will prepare a proper patches. My plan is 3 patches: introduce
> <asm/shared/io.h>, add 'struct port_io_ops' for early boot, hook up
> alternative 'struct port_io_ops' for TDX.
Makes sense.
> diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
> index 34c9dbb6a47d..27ce7cef13aa 100644
> --- a/arch/x86/boot/boot.h
> +++ b/arch/x86/boot/boot.h
> @@ -18,11 +18,14 @@
>
> #ifndef __ASSEMBLY__
>
> +#undef CONFIG_PARAVIRT
Yeah, this is the stuff I'd like to avoid in boot/. Can we get rid of
any ifdeffery in the shared/ namespace?
I see this slow_down_io()-enforced CONFIG_PARAVIRT ifdeffery and that
should not be there but in the ...asm/io.h kernel proper header. In the
shared header we should have only basic functions which are shared by
all.
For the same reason I don't think the shared header should have those
if (cc_platform_has... branches but just the basic bits with the asm
wrappers.
Hmmm.
> diff --git a/arch/x86/boot/compressed/io.h b/arch/x86/boot/compressed/io.h
> new file mode 100644
> index 000000000000..e69de29bb2d1
That one's empty.
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index d8373d766672..48de56f2219d 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -15,9 +15,12 @@
> #include "misc.h"
> #include "error.h"
> #include "pgtable.h"
> +#include "tdx.h"
> +#include "io.h"
> #include "../string.h"
> #include "../voffset.h"
> #include <asm/bootparam_utils.h>
> +#include <asm/shared/io.h>
>
> /*
> * WARNING!!
> @@ -47,6 +50,8 @@ void *memmove(void *dest, const void *src, size_t n);
> */
> struct boot_params *boot_params;
>
> +struct port_io_ops port_io_ops;
call that pio_ops to differ from the struct name.
> +
> memptr free_mem_ptr;
> memptr free_mem_end_ptr;
>
> @@ -103,10 +108,12 @@ static void serial_putchar(int ch)
> {
> unsigned timeout = 0xffff;
>
> - while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
> + while ((port_io_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 &&
> + --timeout) {
> cpu_relax();
> + }
>
> - outb(ch, early_serial_base + TXR);
> + port_io_ops.outb(ch, early_serial_base + TXR);
> }
>
> void __putstr(const char *s)
> @@ -152,10 +159,10 @@ void __putstr(const char *s)
> boot_params->screen_info.orig_y = y;
>
> pos = (x + cols * y) * 2; /* Update cursor position */
> - outb(14, vidport);
> - outb(0xff & (pos >> 9), vidport+1);
> - outb(15, vidport);
> - outb(0xff & (pos >> 1), vidport+1);
> + port_io_ops.outb(14, vidport);
> + port_io_ops.outb(0xff & (pos >> 9), vidport+1);
> + port_io_ops.outb(15, vidport);
> + port_io_ops.outb(0xff & (pos >> 1), vidport+1);
> }
>
> void __puthex(unsigned long value)
> @@ -370,6 +377,15 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
> lines = boot_params->screen_info.orig_video_lines;
> cols = boot_params->screen_info.orig_video_cols;
>
> + port_io_ops = (const struct port_io_ops){
> + .inb = inb,
> + .inw = inw,
> + .inl = inl,
> + .outb = outb,
> + .outw = outw,
> + .outl = outl,
> + };
Why here and not statically defined above?
> +
> /*
> * Detect TDX guest environment.
> *
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 6502adf71a2f..74951befb240 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -19,26 +19,23 @@
> /* cpu_feature_enabled() cannot be used this early */
> #define USE_EARLY_PGTABLE_L5
>
> -/*
> - * Redefine __in/__out macros via tdx.h before including
> - * linux/io.h.
> - */
> -#include "tdx.h"
> -
> #include <linux/linkage.h>
> #include <linux/screen_info.h>
> #include <linux/elf.h>
> -#include <linux/io.h>
> #include <asm/page.h>
> #include <asm/boot.h>
> #include <asm/bootparam.h>
> #include <asm/desc_defs.h>
>
> +/* Avoid pulling outb()/inb() from <asm/io.h> */
> +#define _ACPI_IO_H_
> +
This too. I think those shared headers should contain the basic
functionality which all stages share and can include without ifdeffery.
If they have to change that functionality, they will have to define
their own io.h header, extend it there by using the basic one and then
use it.
This way we'll always have the common, shared stuff in, well, a shared
header.
IMNSVHO.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Jan 19, 2022 at 02:35:09PM +0100, Borislav Petkov wrote:
> Raise hpa and tglx to To: for the general direction.
>
> Full mail is at
>
> https://lore.kernel.org/r/[email protected]
>
> On Wed, Jan 19, 2022 at 02:53:26PM +0300, Kirill A. Shutemov wrote:
> > Could you take a look if the diff below is the right direction?
> >
> > If yes, I will prepare a proper patches. My plan is 3 patches: introduce
> > <asm/shared/io.h>, add 'struct port_io_ops' for early boot, hook up
> > alternative 'struct port_io_ops' for TDX.
>
> Makes sense.
>
> > diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
> > index 34c9dbb6a47d..27ce7cef13aa 100644
> > --- a/arch/x86/boot/boot.h
> > +++ b/arch/x86/boot/boot.h
> > @@ -18,11 +18,14 @@
> >
> > #ifndef __ASSEMBLY__
> >
> > +#undef CONFIG_PARAVIRT
>
> Yeah, this is the stuff I'd like to avoid in boot/. Can we get rid of
> any ifdeffery in the shared/ namespace?
>
> I see this slow_down_io()-enforced CONFIG_PARAVIRT ifdeffery and that
> should not be there but in the ...asm/io.h kernel proper header. In the
> shared header we should have only basic functions which are shared by
> all.
>
> For the same reason I don't think the shared header should have those
> if (cc_platform_has... branches but just the basic bits with the asm
> wrappers.
Okay, makes sense. See the patch below.
> > @@ -370,6 +377,15 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
> > lines = boot_params->screen_info.orig_video_lines;
> > cols = boot_params->screen_info.orig_video_cols;
> >
> > + port_io_ops = (const struct port_io_ops){
> > + .inb = inb,
> > + .inw = inw,
> > + .inl = inl,
> > + .outb = outb,
> > + .outw = outw,
> > + .outl = outl,
> > + };
>
> Why here and not statically defined above?
I tried to define
const struct port_io_ops default_pio_ops = {
.inb = inb,
.inw = inw,
.inl = inl,
.outb = outb,
.outw = outw,
.outl = outl,
};
make pio_ops a pointer and assign it here:
pio_ops = &default_pio_ops;
But it leads to an issue on linking:
ld.lld: error: Unexpected run-time relocations (.rela) detected!
I'm not sure how to deal with it. Any clues?
An updated patch:
diff --git a/arch/x86/boot/a20.c b/arch/x86/boot/a20.c
index a2b6b428922a..506e4dc0d519 100644
--- a/arch/x86/boot/a20.c
+++ b/arch/x86/boot/a20.c
@@ -12,6 +12,7 @@
*/
#include "boot.h"
+#include "io.h"
#define MAX_8042_LOOPS 100000
#define MAX_8042_FF 32
@@ -25,7 +26,7 @@ static int empty_8042(void)
while (loops--) {
io_delay();
- status = inb(0x64);
+ status = pio_ops.inb(0x64);
if (status == 0xff) {
/* FF is a plausible, but very unlikely status */
if (!--ffs)
@@ -34,7 +35,7 @@ static int empty_8042(void)
if (status & 1) {
/* Read and discard input data */
io_delay();
- (void)inb(0x60);
+ (void)pio_ops.inb(0x60);
} else if (!(status & 2)) {
/* Buffers empty, finished! */
return 0;
@@ -99,13 +100,13 @@ static void enable_a20_kbc(void)
{
empty_8042();
- outb(0xd1, 0x64); /* Command write */
+ pio_ops.outb(0xd1, 0x64); /* Command write */
empty_8042();
- outb(0xdf, 0x60); /* A20 on */
+ pio_ops.outb(0xdf, 0x60); /* A20 on */
empty_8042();
- outb(0xff, 0x64); /* Null command, but UHCI wants it */
+ pio_ops.outb(0xff, 0x64); /* Null command, but UHCI wants it */
empty_8042();
}
@@ -113,10 +114,10 @@ static void enable_a20_fast(void)
{
u8 port_a;
- port_a = inb(0x92); /* Configuration port A */
+ port_a = pio_ops.inb(0x92); /* Configuration port A */
port_a |= 0x02; /* Enable A20 */
port_a &= ~0x01; /* Do not reset machine */
- outb(port_a, 0x92);
+ pio_ops.outb(port_a, 0x92);
}
/*
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 34c9dbb6a47d..22a474c5b3e8 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,6 +23,7 @@
#include <linux/edd.h>
#include <asm/setup.h>
#include <asm/asm.h>
+#include <asm/shared/io.h>
#include "bitops.h"
#include "ctype.h"
#include "cpuflags.h"
@@ -35,40 +36,6 @@ extern struct boot_params boot_params;
#define cpu_relax() asm volatile("rep; nop")
-/* Basic port I/O */
-static inline void outb(u8 v, u16 port)
-{
- asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u8 inb(u16 port)
-{
- u8 v;
- asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outw(u16 v, u16 port)
-{
- asm volatile("outw %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u16 inw(u16 port)
-{
- u16 v;
- asm volatile("inw %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outl(u32 v, u16 port)
-{
- asm volatile("outl %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u32 inl(u16 port)
-{
- u32 v;
- asm volatile("inl %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
static inline void io_delay(void)
{
const u16 DELAY_PORT = 0x80;
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 22a2a6cc2ab4..1bfe30ebadbe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -99,6 +99,7 @@ endif
vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index d8373d766672..f0906b39d3d7 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -15,9 +15,11 @@
#include "misc.h"
#include "error.h"
#include "pgtable.h"
+#include "tdx.h"
#include "../string.h"
#include "../voffset.h"
#include <asm/bootparam_utils.h>
+#include <asm/shared/io.h>
/*
* WARNING!!
@@ -47,6 +49,8 @@ void *memmove(void *dest, const void *src, size_t n);
*/
struct boot_params *boot_params;
+struct port_io_ops pio_ops;
+
memptr free_mem_ptr;
memptr free_mem_end_ptr;
@@ -103,10 +107,12 @@ static void serial_putchar(int ch)
{
unsigned timeout = 0xffff;
- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 &&
+ --timeout) {
cpu_relax();
+ }
- outb(ch, early_serial_base + TXR);
+ pio_ops.outb(ch, early_serial_base + TXR);
}
void __putstr(const char *s)
@@ -152,10 +158,10 @@ void __putstr(const char *s)
boot_params->screen_info.orig_y = y;
pos = (x + cols * y) * 2; /* Update cursor position */
- outb(14, vidport);
- outb(0xff & (pos >> 9), vidport+1);
- outb(15, vidport);
- outb(0xff & (pos >> 1), vidport+1);
+ pio_ops.outb(14, vidport);
+ pio_ops.outb(0xff & (pos >> 9), vidport+1);
+ pio_ops.outb(15, vidport);
+ pio_ops.outb(0xff & (pos >> 1), vidport+1);
}
void __puthex(unsigned long value)
@@ -370,6 +376,15 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;
+ pio_ops = (const struct port_io_ops){
+ .inb = inb,
+ .inw = inw,
+ .inl = inl,
+ .outb = outb,
+ .outw = outw,
+ .outl = outl,
+ };
+
/*
* Detect TDX guest environment.
*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0d8e275a9d96..bf0fd98d5ce7 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -22,19 +22,17 @@
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
-#include <linux/io.h>
#include <asm/page.h>
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
-#include "tdx.h"
-
#define BOOT_CTYPE_H
#include <linux/acpi.h>
#define BOOT_BOOT_H
#include "../ctype.h"
+#include "../io.h"
#ifdef CONFIG_X86_64
#define memptr long
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..aafadc136c88
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 50c8145bd0f3..07357147753f 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -5,12 +5,71 @@
#include "../cpuflags.h"
#include "../string.h"
+#include "../io.h"
+
+#include <vdso/limits.h>
+#include <uapi/asm/vmx.h>
+#include <asm/shared/io.h>
+#include <asm/tdx.h>
#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "
static bool tdx_guest_detected;
+bool early_is_tdx_guest(void)
+{
+ return tdx_guest_detected;
+}
+
+static inline unsigned int tdx_io_in(int size, int port)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 0, port, 0, &out);
+
+ return out.r10 ? UINT_MAX : out.r11;
+}
+
+static inline void tdx_io_out(int size, int port, u64 value)
+{
+ struct tdx_hypercall_output out;
+
+ __tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+ size, 1, port, value, &out);
+}
+
+static inline unsigned char tdx_inb(int port)
+{
+ return tdx_io_in(1, port);
+}
+
+static inline unsigned short tdx_inw(int port)
+{
+ return tdx_io_in(2, port);
+}
+
+static inline unsigned int tdx_inl(int port)
+{
+ return tdx_io_in(4, port);
+}
+
+static inline void tdx_outb(unsigned char value, int port)
+{
+ tdx_io_out(1, port, value);
+}
+
+static inline void tdx_outw(unsigned short value, int port)
+{
+ tdx_io_out(2, port, value);
+}
+
+static inline void tdx_outl(unsigned int value, int port)
+{
+ tdx_io_out(4, port, value);
+}
+
void early_tdx_detect(void)
{
u32 eax, sig[3];
@@ -22,9 +81,13 @@ void early_tdx_detect(void)
/* Cache TDX guest feature status */
tdx_guest_detected = true;
-}
-bool early_is_tdx_guest(void)
-{
- return tdx_guest_detected;
+ pio_ops = (struct port_io_ops) {
+ .inb = tdx_inb,
+ .inw = tdx_inw,
+ .inl = tdx_inl,
+ .outb = tdx_outb,
+ .outw = tdx_outw,
+ .outl = tdx_outl,
+ };
}
diff --git a/arch/x86/boot/early_serial_console.c b/arch/x86/boot/early_serial_console.c
index 023bf1c3de8b..5ccc3d1b8f97 100644
--- a/arch/x86/boot/early_serial_console.c
+++ b/arch/x86/boot/early_serial_console.c
@@ -4,6 +4,7 @@
* included from both the compressed kernel and the regular kernel.
*/
#include "boot.h"
+#include "io.h"
#define DEFAULT_SERIAL_PORT 0x3f8 /* ttyS0 */
@@ -28,17 +29,17 @@ static void early_serial_init(int port, int baud)
unsigned char c;
unsigned divisor;
- outb(0x3, port + LCR); /* 8n1 */
- outb(0, port + IER); /* no interrupt */
- outb(0, port + FCR); /* no fifo */
- outb(0x3, port + MCR); /* DTR + RTS */
+ pio_ops.outb(0x3, port + LCR); /* 8n1 */
+ pio_ops.outb(0, port + IER); /* no interrupt */
+ pio_ops.outb(0, port + FCR); /* no fifo */
+ pio_ops.outb(0x3, port + MCR); /* DTR + RTS */
divisor = 115200 / baud;
- c = inb(port + LCR);
- outb(c | DLAB, port + LCR);
- outb(divisor & 0xff, port + DLL);
- outb((divisor >> 8) & 0xff, port + DLH);
- outb(c & ~DLAB, port + LCR);
+ c = pio_ops.inb(port + LCR);
+ pio_ops.outb(c | DLAB, port + LCR);
+ pio_ops.outb(divisor & 0xff, port + DLL);
+ pio_ops.outb((divisor >> 8) & 0xff, port + DLH);
+ pio_ops.outb(c & ~DLAB, port + LCR);
early_serial_base = port;
}
@@ -104,11 +105,11 @@ static unsigned int probe_baud(int port)
unsigned char lcr, dll, dlh;
unsigned int quot;
- lcr = inb(port + LCR);
- outb(lcr | DLAB, port + LCR);
- dll = inb(port + DLL);
- dlh = inb(port + DLH);
- outb(lcr, port + LCR);
+ lcr = pio_ops.inb(port + LCR);
+ pio_ops.outb(lcr | DLAB, port + LCR);
+ dll = pio_ops.inb(port + DLL);
+ dlh = pio_ops.inb(port + DLH);
+ pio_ops.outb(lcr, port + LCR);
quot = (dlh << 8) | dll;
return BASE_BAUD / quot;
diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
new file mode 100644
index 000000000000..f75edad63eaa
--- /dev/null
+++ b/arch/x86/boot/io.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_IO_H
+#define BOOT_IO_H
+
+#include <linux/compiler_attributes.h>
+
+struct port_io_ops {
+ unsigned char (*inb)(int port);
+ unsigned short (*inw)(int port);
+ unsigned int (*inl)(int port);
+ void (*outb)(unsigned char v, int port);
+ void (*outw)(unsigned short v, int port);
+ void (*outl)(unsigned int v, int port);
+};
+
+extern struct port_io_ops pio_ops;
+
+#endif
diff --git a/arch/x86/boot/main.c b/arch/x86/boot/main.c
index e3add857c2c9..ae2eed21f62e 100644
--- a/arch/x86/boot/main.c
+++ b/arch/x86/boot/main.c
@@ -13,10 +13,13 @@
#include <linux/build_bug.h>
#include "boot.h"
+#include "io.h"
#include "string.h"
struct boot_params boot_params __attribute__((aligned(16)));
+struct port_io_ops pio_ops;
+
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
@@ -133,6 +136,15 @@ static void init_heap(void)
void main(void)
{
+ pio_ops = (struct port_io_ops){
+ .inb = inb,
+ .inw = inw,
+ .inl = inl,
+ .outb = outb,
+ .outw = outw,
+ .outl = outl,
+ };
+
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
diff --git a/arch/x86/boot/pm.c b/arch/x86/boot/pm.c
index 40031a614712..26d9c5b34b9f 100644
--- a/arch/x86/boot/pm.c
+++ b/arch/x86/boot/pm.c
@@ -11,6 +11,7 @@
*/
#include "boot.h"
+#include "io.h"
#include <asm/segment.h>
/*
@@ -25,7 +26,7 @@ static void realmode_switch_hook(void)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
- outb(0x80, 0x70); /* Disable NMI */
+ pio_ops.outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}
@@ -35,9 +36,9 @@ static void realmode_switch_hook(void)
*/
static void mask_all_interrupts(void)
{
- outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
+ pio_ops.outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
io_delay();
- outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
+ pio_ops.outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
io_delay();
}
@@ -46,9 +47,9 @@ static void mask_all_interrupts(void)
*/
static void reset_coprocessor(void)
{
- outb(0, 0xf0);
+ pio_ops.outb(0, 0xf0);
io_delay();
- outb(0, 0xf1);
+ pio_ops.outb(0, 0xf1);
io_delay();
}
diff --git a/arch/x86/boot/tty.c b/arch/x86/boot/tty.c
index f7eb976b0a4b..7675ed9afa55 100644
--- a/arch/x86/boot/tty.c
+++ b/arch/x86/boot/tty.c
@@ -12,6 +12,7 @@
*/
#include "boot.h"
+#include "io.h"
int early_serial_base;
@@ -29,10 +30,10 @@ static void __section(".inittext") serial_putchar(int ch)
{
unsigned timeout = 0xffff;
- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
cpu_relax();
- outb(ch, early_serial_base + TXR);
+ pio_ops.outb(ch, early_serial_base + TXR);
}
static void __section(".inittext") bios_putchar(int ch)
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..8ce0a40379de 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -44,6 +44,7 @@
#include <asm/page.h>
#include <asm/early_ioremap.h>
#include <asm/pgtable_types.h>
+#include <asm/shared/io.h>
#define build_mmio_read(name, size, type, reg, barrier) \
static inline type name(const volatile void __iomem *addr) \
@@ -258,20 +259,6 @@ static inline void slow_down_io(void)
#endif
#define BUILDIO(bwl, bw, type) \
-static inline void out##bwl(unsigned type value, int port) \
-{ \
- asm volatile("out" #bwl " %" #bw "0, %w1" \
- : : "a"(value), "Nd"(port)); \
-} \
- \
-static inline unsigned type in##bwl(int port) \
-{ \
- unsigned type value; \
- asm volatile("in" #bwl " %w1, %" #bw "0" \
- : "=a"(value) : "Nd"(port)); \
- return value; \
-} \
- \
static inline void out##bwl##_p(unsigned type value, int port) \
{ \
out##bwl(value, port); \
@@ -320,10 +307,8 @@ static inline void ins##bwl(int port, void *addr, unsigned long count) \
BUILDIO(b, b, char)
BUILDIO(w, w, short)
BUILDIO(l, , int)
+#undef BUILDIO
-#define inb inb
-#define inw inw
-#define inl inl
#define inb_p inb_p
#define inw_p inw_p
#define inl_p inl_p
@@ -331,9 +316,6 @@ BUILDIO(l, , int)
#define insw insw
#define insl insl
-#define outb outb
-#define outw outw
-#define outl outl
#define outb_p outb_p
#define outw_p outw_p
#define outl_p outl_p
diff --git a/arch/x86/include/asm/shared/io.h b/arch/x86/include/asm/shared/io.h
new file mode 100644
index 000000000000..f17247f6c471
--- /dev/null
+++ b/arch/x86/include/asm/shared/io.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_IO_H
+#define _ASM_X86_SHARED_IO_H
+
+#define BUILDIO(bwl, bw, type) \
+static inline void out##bwl(unsigned type value, int port) \
+{ \
+ asm volatile("out" #bwl " %" #bw "0, %w1" \
+ : : "a"(value), "Nd"(port)); \
+} \
+ \
+static inline unsigned type in##bwl(int port) \
+{ \
+ unsigned type value; \
+ asm volatile("in" #bwl " %w1, %" #bw "0" \
+ : "=a"(value) : "Nd"(port)); \
+ return value; \
+}
+
+BUILDIO(b, b, char)
+BUILDIO(w, w, short)
+BUILDIO(l, , int)
+#undef BUILDIO
+
+#define inb inb
+#define inw inw
+#define inl inl
+#define outb outb
+#define outw outw
+#define outl outl
+
+#endif
--
Kirill A. Shutemov
On Wed, Jan 19, 2022 at 06:49:25PM +0300, Kirill A. Shutemov wrote:
> const struct port_io_ops default_pio_ops = {
> .inb = inb,
> .inw = inw,
> .inl = inl,
> .outb = outb,
> .outw = outw,
> .outl = outl,
> };
>
> make pio_ops a pointer and assign it here:
>
> pio_ops = &default_pio_ops;
>
> But it leads to an issue on linking:
>
> ld.lld: error: Unexpected run-time relocations (.rela) detected!
So the above generates absolute relocations of type R_X86_64_64
Relocation section '.rela.data.rel.local' at offset 0x5c18 contains 6 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000000 000200000001 R_X86_64_64 0000000000000000 .text + 10
000000000008 000200000001 R_X86_64_64 0000000000000000 .text + 30
000000000010 000200000001 R_X86_64_64 0000000000000000 .text + 50
000000000018 000200000001 R_X86_64_64 0000000000000000 .text + 0
000000000020 000200000001 R_X86_64_64 0000000000000000 .text + 20
000000000028 000200000001 R_X86_64_64 0000000000000000 .text + 40
and the linker doesn't like them probably because of the mcmodel we use
but I need to talk to toolchain folks first to make sense of what's
going on...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Jan 19, 2022 at 08:46:24PM +0100, Borislav Petkov wrote:
> On Wed, Jan 19, 2022 at 06:49:25PM +0300, Kirill A. Shutemov wrote:
> > const struct port_io_ops default_pio_ops = {
> > .inb = inb,
> > .inw = inw,
> > .inl = inl,
> > .outb = outb,
> > .outw = outw,
> > .outl = outl,
> > };
> >
> > make pio_ops a pointer and assign it here:
> >
> > pio_ops = &default_pio_ops;
> >
> > But it leads to an issue on linking:
> >
> > ld.lld: error: Unexpected run-time relocations (.rela) detected!
>
> So the above generates absolute relocations of type R_X86_64_64
>
> Relocation section '.rela.data.rel.local' at offset 0x5c18 contains 6 entries:
> Offset Info Type Sym. Value Sym. Name + Addend
> 000000000000 000200000001 R_X86_64_64 0000000000000000 .text + 10
> 000000000008 000200000001 R_X86_64_64 0000000000000000 .text + 30
> 000000000010 000200000001 R_X86_64_64 0000000000000000 .text + 50
> 000000000018 000200000001 R_X86_64_64 0000000000000000 .text + 0
> 000000000020 000200000001 R_X86_64_64 0000000000000000 .text + 20
> 000000000028 000200000001 R_X86_64_64 0000000000000000 .text + 40
>
> and the linker doesn't like them probably because of the mcmodel we use
> but I need to talk to toolchain folks first to make sense of what's
> going on...
JFYI, the message comes from ASSERT in vmlinux.lds.S.
I assume for now I can proceed with the assignment that works, right?
It can be changed later once we figure out what is going on.
--
Kirill A. Shutemov
On Wed, Jan 19, 2022 at 11:08:41PM +0300, Kirill A. Shutemov wrote:
> > Relocation section '.rela.data.rel.local' at offset 0x5c18 contains 6 entries:
^^^^^^
> JFYI, the message comes from ASSERT in vmlinux.lds.S.
Yah, because those relocations are put in a .rela section and that one
matches.
And looking at which commit added it:
527afc212231 ("x86/boot: Check that there are no run-time relocations")
the removed comment kinda explains it - decompressor kernel cannot
handle runtime relocations. Obviously.
Now we need to figure out how to avoid those...
> I assume for now I can proceed with the assignment that works, right?
> It can be changed later once we figure out what is going on.
Right.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
There are two implementations of port I/O helpers: one in the kernel and
one in the boot stub.
Move the helpers required for both to <asm/shared/io.h> and use the one
implementation everywhere.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/boot.h | 35 +-------------------------------
arch/x86/boot/compressed/misc.h | 2 +-
arch/x86/include/asm/io.h | 22 ++------------------
arch/x86/include/asm/shared/io.h | 32 +++++++++++++++++++++++++++++
4 files changed, 36 insertions(+), 55 deletions(-)
create mode 100644 arch/x86/include/asm/shared/io.h
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 34c9dbb6a47d..22a474c5b3e8 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,6 +23,7 @@
#include <linux/edd.h>
#include <asm/setup.h>
#include <asm/asm.h>
+#include <asm/shared/io.h>
#include "bitops.h"
#include "ctype.h"
#include "cpuflags.h"
@@ -35,40 +36,6 @@ extern struct boot_params boot_params;
#define cpu_relax() asm volatile("rep; nop")
-/* Basic port I/O */
-static inline void outb(u8 v, u16 port)
-{
- asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u8 inb(u16 port)
-{
- u8 v;
- asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outw(u16 v, u16 port)
-{
- asm volatile("outw %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u16 inw(u16 port)
-{
- u16 v;
- asm volatile("inw %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
-static inline void outl(u32 v, u16 port)
-{
- asm volatile("outl %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u32 inl(u16 port)
-{
- u32 v;
- asm volatile("inl %1,%0" : "=a" (v) : "dN" (port));
- return v;
-}
-
static inline void io_delay(void)
{
const u16 DELAY_PORT = 0x80;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0d8e275a9d96..8a253e85f990 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -22,11 +22,11 @@
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
-#include <linux/io.h>
#include <asm/page.h>
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
+#include <asm/shared/io.h>
#include "tdx.h"
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..8ce0a40379de 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -44,6 +44,7 @@
#include <asm/page.h>
#include <asm/early_ioremap.h>
#include <asm/pgtable_types.h>
+#include <asm/shared/io.h>
#define build_mmio_read(name, size, type, reg, barrier) \
static inline type name(const volatile void __iomem *addr) \
@@ -258,20 +259,6 @@ static inline void slow_down_io(void)
#endif
#define BUILDIO(bwl, bw, type) \
-static inline void out##bwl(unsigned type value, int port) \
-{ \
- asm volatile("out" #bwl " %" #bw "0, %w1" \
- : : "a"(value), "Nd"(port)); \
-} \
- \
-static inline unsigned type in##bwl(int port) \
-{ \
- unsigned type value; \
- asm volatile("in" #bwl " %w1, %" #bw "0" \
- : "=a"(value) : "Nd"(port)); \
- return value; \
-} \
- \
static inline void out##bwl##_p(unsigned type value, int port) \
{ \
out##bwl(value, port); \
@@ -320,10 +307,8 @@ static inline void ins##bwl(int port, void *addr, unsigned long count) \
BUILDIO(b, b, char)
BUILDIO(w, w, short)
BUILDIO(l, , int)
+#undef BUILDIO
-#define inb inb
-#define inw inw
-#define inl inl
#define inb_p inb_p
#define inw_p inw_p
#define inl_p inl_p
@@ -331,9 +316,6 @@ BUILDIO(l, , int)
#define insw insw
#define insl insl
-#define outb outb
-#define outw outw
-#define outl outl
#define outb_p outb_p
#define outw_p outw_p
#define outl_p outl_p
diff --git a/arch/x86/include/asm/shared/io.h b/arch/x86/include/asm/shared/io.h
new file mode 100644
index 000000000000..f17247f6c471
--- /dev/null
+++ b/arch/x86/include/asm/shared/io.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_IO_H
+#define _ASM_X86_SHARED_IO_H
+
+#define BUILDIO(bwl, bw, type) \
+static inline void out##bwl(unsigned type value, int port) \
+{ \
+ asm volatile("out" #bwl " %" #bw "0, %w1" \
+ : : "a"(value), "Nd"(port)); \
+} \
+ \
+static inline unsigned type in##bwl(int port) \
+{ \
+ unsigned type value; \
+ asm volatile("in" #bwl " %w1, %" #bw "0" \
+ : "=a"(value) : "Nd"(port)); \
+ return value; \
+}
+
+BUILDIO(b, b, char)
+BUILDIO(w, w, short)
+BUILDIO(l, , int)
+#undef BUILDIO
+
+#define inb inb
+#define inw inw
+#define inl inl
+#define outb outb
+#define outw outw
+#define outl outl
+
+#endif
--
2.34.1
Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.
But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.
Add a way to hook up alternative port I/O helpers in the boot stub.
All port I/O operations are routed via 'pio_ops'. By default 'pio_ops'
initialized with native port I/O implementations.
This is a preparation patch. The next patch will override 'pio_ops' if
the kernel booted in the TDX environment.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/a20.c | 14 ++++++-------
arch/x86/boot/boot.h | 2 +-
arch/x86/boot/compressed/misc.c | 18 +++++++++++------
arch/x86/boot/compressed/misc.h | 2 +-
arch/x86/boot/early_serial_console.c | 28 +++++++++++++-------------
arch/x86/boot/io.h | 30 ++++++++++++++++++++++++++++
arch/x86/boot/main.c | 4 ++++
arch/x86/boot/pm.c | 10 +++++-----
arch/x86/boot/tty.c | 4 ++--
arch/x86/boot/video-vga.c | 6 +++---
arch/x86/boot/video.h | 8 +++++---
arch/x86/realmode/rm/wakemain.c | 14 ++++++++-----
12 files changed, 93 insertions(+), 47 deletions(-)
create mode 100644 arch/x86/boot/io.h
diff --git a/arch/x86/boot/a20.c b/arch/x86/boot/a20.c
index a2b6b428922a..7f6dd5cc4670 100644
--- a/arch/x86/boot/a20.c
+++ b/arch/x86/boot/a20.c
@@ -25,7 +25,7 @@ static int empty_8042(void)
while (loops--) {
io_delay();
- status = inb(0x64);
+ status = pio_ops.inb(0x64);
if (status == 0xff) {
/* FF is a plausible, but very unlikely status */
if (!--ffs)
@@ -34,7 +34,7 @@ static int empty_8042(void)
if (status & 1) {
/* Read and discard input data */
io_delay();
- (void)inb(0x60);
+ (void)pio_ops.inb(0x60);
} else if (!(status & 2)) {
/* Buffers empty, finished! */
return 0;
@@ -99,13 +99,13 @@ static void enable_a20_kbc(void)
{
empty_8042();
- outb(0xd1, 0x64); /* Command write */
+ pio_ops.outb(0xd1, 0x64); /* Command write */
empty_8042();
- outb(0xdf, 0x60); /* A20 on */
+ pio_ops.outb(0xdf, 0x60); /* A20 on */
empty_8042();
- outb(0xff, 0x64); /* Null command, but UHCI wants it */
+ pio_ops.outb(0xff, 0x64); /* Null command, but UHCI wants it */
empty_8042();
}
@@ -113,10 +113,10 @@ static void enable_a20_fast(void)
{
u8 port_a;
- port_a = inb(0x92); /* Configuration port A */
+ port_a = pio_ops.inb(0x92); /* Configuration port A */
port_a |= 0x02; /* Enable A20 */
port_a &= ~0x01; /* Do not reset machine */
- outb(port_a, 0x92);
+ pio_ops.outb(port_a, 0x92);
}
/*
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 22a474c5b3e8..bd8f640ca15f 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,10 +23,10 @@
#include <linux/edd.h>
#include <asm/setup.h>
#include <asm/asm.h>
-#include <asm/shared/io.h>
#include "bitops.h"
#include "ctype.h"
#include "cpuflags.h"
+#include "io.h"
/* Useful macros */
#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index d8373d766672..cc47cf239c67 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -47,6 +47,8 @@ void *memmove(void *dest, const void *src, size_t n);
*/
struct boot_params *boot_params;
+struct port_io_ops pio_ops;
+
memptr free_mem_ptr;
memptr free_mem_end_ptr;
@@ -103,10 +105,12 @@ static void serial_putchar(int ch)
{
unsigned timeout = 0xffff;
- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 &&
+ --timeout) {
cpu_relax();
+ }
- outb(ch, early_serial_base + TXR);
+ pio_ops.outb(ch, early_serial_base + TXR);
}
void __putstr(const char *s)
@@ -152,10 +156,10 @@ void __putstr(const char *s)
boot_params->screen_info.orig_y = y;
pos = (x + cols * y) * 2; /* Update cursor position */
- outb(14, vidport);
- outb(0xff & (pos >> 9), vidport+1);
- outb(15, vidport);
- outb(0xff & (pos >> 1), vidport+1);
+ pio_ops.outb(14, vidport);
+ pio_ops.outb(0xff & (pos >> 9), vidport+1);
+ pio_ops.outb(15, vidport);
+ pio_ops.outb(0xff & (pos >> 1), vidport+1);
}
void __puthex(unsigned long value)
@@ -370,6 +374,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;
+ init_io_ops();
+
/*
* Detect TDX guest environment.
*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 8a253e85f990..ea71cf3d64e1 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -26,7 +26,6 @@
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
-#include <asm/shared/io.h>
#include "tdx.h"
@@ -35,6 +34,7 @@
#define BOOT_BOOT_H
#include "../ctype.h"
+#include "../io.h"
#ifdef CONFIG_X86_64
#define memptr long
diff --git a/arch/x86/boot/early_serial_console.c b/arch/x86/boot/early_serial_console.c
index 023bf1c3de8b..03e43d770571 100644
--- a/arch/x86/boot/early_serial_console.c
+++ b/arch/x86/boot/early_serial_console.c
@@ -28,17 +28,17 @@ static void early_serial_init(int port, int baud)
unsigned char c;
unsigned divisor;
- outb(0x3, port + LCR); /* 8n1 */
- outb(0, port + IER); /* no interrupt */
- outb(0, port + FCR); /* no fifo */
- outb(0x3, port + MCR); /* DTR + RTS */
+ pio_ops.outb(0x3, port + LCR); /* 8n1 */
+ pio_ops.outb(0, port + IER); /* no interrupt */
+ pio_ops.outb(0, port + FCR); /* no fifo */
+ pio_ops.outb(0x3, port + MCR); /* DTR + RTS */
divisor = 115200 / baud;
- c = inb(port + LCR);
- outb(c | DLAB, port + LCR);
- outb(divisor & 0xff, port + DLL);
- outb((divisor >> 8) & 0xff, port + DLH);
- outb(c & ~DLAB, port + LCR);
+ c = pio_ops.inb(port + LCR);
+ pio_ops.outb(c | DLAB, port + LCR);
+ pio_ops.outb(divisor & 0xff, port + DLL);
+ pio_ops.outb((divisor >> 8) & 0xff, port + DLH);
+ pio_ops.outb(c & ~DLAB, port + LCR);
early_serial_base = port;
}
@@ -104,11 +104,11 @@ static unsigned int probe_baud(int port)
unsigned char lcr, dll, dlh;
unsigned int quot;
- lcr = inb(port + LCR);
- outb(lcr | DLAB, port + LCR);
- dll = inb(port + DLL);
- dlh = inb(port + DLH);
- outb(lcr, port + LCR);
+ lcr = pio_ops.inb(port + LCR);
+ pio_ops.outb(lcr | DLAB, port + LCR);
+ dll = pio_ops.inb(port + DLL);
+ dlh = pio_ops.inb(port + DLH);
+ pio_ops.outb(lcr, port + LCR);
quot = (dlh << 8) | dll;
return BASE_BAUD / quot;
diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
new file mode 100644
index 000000000000..640daa3925fb
--- /dev/null
+++ b/arch/x86/boot/io.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_IO_H
+#define BOOT_IO_H
+
+#include <asm/shared/io.h>
+
+struct port_io_ops {
+ unsigned char (*inb)(int port);
+ unsigned short (*inw)(int port);
+ unsigned int (*inl)(int port);
+ void (*outb)(unsigned char v, int port);
+ void (*outw)(unsigned short v, int port);
+ void (*outl)(unsigned int v, int port);
+};
+
+extern struct port_io_ops pio_ops;
+
+static inline void init_io_ops(void)
+{
+ pio_ops = (struct port_io_ops){
+ .inb = inb,
+ .inw = inw,
+ .inl = inl,
+ .outb = outb,
+ .outw = outw,
+ .outl = outl,
+ };
+}
+
+#endif
diff --git a/arch/x86/boot/main.c b/arch/x86/boot/main.c
index e3add857c2c9..447a797891be 100644
--- a/arch/x86/boot/main.c
+++ b/arch/x86/boot/main.c
@@ -17,6 +17,8 @@
struct boot_params boot_params __attribute__((aligned(16)));
+struct port_io_ops pio_ops;
+
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
@@ -133,6 +135,8 @@ static void init_heap(void)
void main(void)
{
+ init_io_ops();
+
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
diff --git a/arch/x86/boot/pm.c b/arch/x86/boot/pm.c
index 40031a614712..4180b6a264c9 100644
--- a/arch/x86/boot/pm.c
+++ b/arch/x86/boot/pm.c
@@ -25,7 +25,7 @@ static void realmode_switch_hook(void)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
- outb(0x80, 0x70); /* Disable NMI */
+ pio_ops.outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}
@@ -35,9 +35,9 @@ static void realmode_switch_hook(void)
*/
static void mask_all_interrupts(void)
{
- outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
+ pio_ops.outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
io_delay();
- outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
+ pio_ops.outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
io_delay();
}
@@ -46,9 +46,9 @@ static void mask_all_interrupts(void)
*/
static void reset_coprocessor(void)
{
- outb(0, 0xf0);
+ pio_ops.outb(0, 0xf0);
io_delay();
- outb(0, 0xf1);
+ pio_ops.outb(0, 0xf1);
io_delay();
}
diff --git a/arch/x86/boot/tty.c b/arch/x86/boot/tty.c
index f7eb976b0a4b..ee8700682801 100644
--- a/arch/x86/boot/tty.c
+++ b/arch/x86/boot/tty.c
@@ -29,10 +29,10 @@ static void __section(".inittext") serial_putchar(int ch)
{
unsigned timeout = 0xffff;
- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
cpu_relax();
- outb(ch, early_serial_base + TXR);
+ pio_ops.outb(ch, early_serial_base + TXR);
}
static void __section(".inittext") bios_putchar(int ch)
diff --git a/arch/x86/boot/video-vga.c b/arch/x86/boot/video-vga.c
index 4816cb9cf996..17baac542ee7 100644
--- a/arch/x86/boot/video-vga.c
+++ b/arch/x86/boot/video-vga.c
@@ -131,7 +131,7 @@ static void vga_set_80x43(void)
/* I/O address of the VGA CRTC */
u16 vga_crtc(void)
{
- return (inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
+ return (pio_ops.inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
}
static void vga_set_480_scanlines(void)
@@ -148,10 +148,10 @@ static void vga_set_480_scanlines(void)
out_idx(0xdf, crtc, 0x12); /* Vertical display end */
out_idx(0xe7, crtc, 0x15); /* Vertical blank start */
out_idx(0x04, crtc, 0x16); /* Vertical blank end */
- csel = inb(0x3cc);
+ csel = pio_ops.inb(0x3cc);
csel &= 0x0d;
csel |= 0xe2;
- outb(csel, 0x3c2);
+ pio_ops.outb(csel, 0x3c2);
}
static void vga_set_vertical_end(int lines)
diff --git a/arch/x86/boot/video.h b/arch/x86/boot/video.h
index 04bde0bb2003..87a5f726e731 100644
--- a/arch/x86/boot/video.h
+++ b/arch/x86/boot/video.h
@@ -15,6 +15,8 @@
#include <linux/types.h>
+#include "boot.h"
+
/*
* This code uses an extended set of video mode numbers. These include:
* Aliases for standard modes
@@ -96,13 +98,13 @@ extern int graphic_mode; /* Graphics mode with linear frame buffer */
/* Accessing VGA indexed registers */
static inline u8 in_idx(u16 port, u8 index)
{
- outb(index, port);
- return inb(port+1);
+ pio_ops.outb(index, port);
+ return pio_ops.inb(port+1);
}
static inline void out_idx(u8 v, u16 port, u8 index)
{
- outw(index+(v << 8), port);
+ pio_ops.outw(index+(v << 8), port);
}
/* Writes a value to an indexed port and then reads the port again */
diff --git a/arch/x86/realmode/rm/wakemain.c b/arch/x86/realmode/rm/wakemain.c
index 1d6437e6d2ba..b49404d0d63c 100644
--- a/arch/x86/realmode/rm/wakemain.c
+++ b/arch/x86/realmode/rm/wakemain.c
@@ -17,18 +17,18 @@ static void beep(unsigned int hz)
} else {
u16 div = 1193181/hz;
- outb(0xb6, 0x43); /* Ctr 2, squarewave, load, binary */
+ pio_ops.outb(0xb6, 0x43); /* Ctr 2, squarewave, load, binary */
io_delay();
- outb(div, 0x42); /* LSB of counter */
+ pio_ops.outb(div, 0x42); /* LSB of counter */
io_delay();
- outb(div >> 8, 0x42); /* MSB of counter */
+ pio_ops.outb(div >> 8, 0x42); /* MSB of counter */
io_delay();
enable = 0x03; /* Turn on speaker */
}
- inb(0x61); /* Dummy read of System Control Port B */
+ pio_ops.inb(0x61); /* Dummy read of System Control Port B */
io_delay();
- outb(enable, 0x61); /* Enable timer 2 output to speaker */
+ pio_ops.outb(enable, 0x61); /* Enable timer 2 output to speaker */
io_delay();
}
@@ -62,8 +62,12 @@ static void send_morse(const char *pattern)
}
}
+struct port_io_ops pio_ops;
+
void main(void)
{
+ init_io_ops();
+
/* Kill machine if structures are wrong */
if (wakeup_header.real_magic != 0x12345678)
while (1)
--
2.34.1
On Thu, Jan 20, 2022 at 05:15:43AM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
> new file mode 100644
> index 000000000000..640daa3925fb
> --- /dev/null
> +++ b/arch/x86/boot/io.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef BOOT_IO_H
> +#define BOOT_IO_H
> +
> +#include <asm/shared/io.h>
> +
> +struct port_io_ops {
> + unsigned char (*inb)(int port);
> + unsigned short (*inw)(int port);
> + unsigned int (*inl)(int port);
> + void (*outb)(unsigned char v, int port);
> + void (*outw)(unsigned short v, int port);
> + void (*outl)(unsigned int v, int port);
> +};
> +
> +extern struct port_io_ops pio_ops;
> +
> +static inline void init_io_ops(void)
> +{
> + pio_ops = (struct port_io_ops){
> + .inb = inb,
> + .inw = inw,
> + .inl = inl,
> + .outb = outb,
> + .outw = outw,
> + .outl = outl,
> + };
> +}
> +
> +#endif
It works fine on x86-64, but breaks on i386:
ld: Unexpected run-time relocations (.rel) detected!
I'll change it to
pio_ops.inb = inb;
pio_ops.inw = inw;
pio_ops.inl = inl;
pio_ops.outb = outb;
pio_ops.outw = outw;
pio_ops.outl = outl;
It works, but I hate that I don't really have control here. I have no clue
why compiler generate different code after the change. It is very fragile.
Do we really have no way to say compiler to avoid relactions here?
--
Kirill A. Shutemov
On Thu, Jan 20, 2022 at 07:38:26PM +0300, Kirill A. Shutemov wrote:
> On Thu, Jan 20, 2022 at 05:15:43AM +0300, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
> > new file mode 100644
> > index 000000000000..640daa3925fb
> > --- /dev/null
> > +++ b/arch/x86/boot/io.h
> > @@ -0,0 +1,30 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef BOOT_IO_H
> > +#define BOOT_IO_H
> > +
> > +#include <asm/shared/io.h>
> > +
> > +struct port_io_ops {
> > + unsigned char (*inb)(int port);
> > + unsigned short (*inw)(int port);
> > + unsigned int (*inl)(int port);
> > + void (*outb)(unsigned char v, int port);
> > + void (*outw)(unsigned short v, int port);
> > + void (*outl)(unsigned int v, int port);
> > +};
> > +
> > +extern struct port_io_ops pio_ops;
> > +
> > +static inline void init_io_ops(void)
> > +{
> > + pio_ops = (struct port_io_ops){
> > + .inb = inb,
> > + .inw = inw,
> > + .inl = inl,
> > + .outb = outb,
> > + .outw = outw,
> > + .outl = outl,
> > + };
> > +}
> > +
> > +#endif
>
> It works fine on x86-64, but breaks on i386:
>
> ld: Unexpected run-time relocations (.rel) detected!
>
> I'll change it to
>
> pio_ops.inb = inb;
> pio_ops.inw = inw;
> pio_ops.inl = inl;
> pio_ops.outb = outb;
> pio_ops.outw = outw;
> pio_ops.outl = outl;
>
> It works, but I hate that I don't really have control here. I have no clue
> why compiler generate different code after the change. It is very fragile.
>
> Do we really have no way to say compiler to avoid relactions here?
This one:
pio_ops = (struct port_io_ops){
.inb = inb,
.inw = inw,
.inl = inl,
.outb = outb,
.outw = outw,
.outl = outl,
};
.. actually allocates an anonymous struct in the .data section, which is
memcpy'ed at runtime when the assignment occurs. That anonymous struct
has .data -> .text relocations which have to be resolved at runtime
because the distance between .data and .text isn't constant.
The working version:
pio_ops.inb = inb;
pio_ops.inw = inw;
pio_ops.inl = inl;
pio_ops.outb = outb;
pio_ops.outw = outw;
pio_ops.outl = outl;
... only needs .text -> .text relocations which can be resolved at link
time.
--
Josh
On Thu, Jan 20, 2022 at 01:13:47PM -0800, Josh Poimboeuf wrote:
> This one:
>
> pio_ops = (struct port_io_ops){
> .inb = inb,
> .inw = inw,
> .inl = inl,
> .outb = outb,
> .outw = outw,
> .outl = outl,
> };
>
> .. actually allocates an anonymous struct in the .data section, which is
> memcpy'ed at runtime when the assignment occurs. That anonymous struct
> has .data -> .text relocations which have to be resolved at runtime
> because the distance between .data and .text isn't constant.
Yap, and this is the key point - decompressor kernel is a -pie
executable so it needs to resolve .data section relocations at *runtime*
but we don't have a dynamic linker during early boot.
We could patch at early boot by going through the .data runtime
relocations and patch in the target locations but that would be probably
too much just so that we can do those struct initializers.
And, I'm being told, global .data section things should be avoided, if
possible.
> The working version:
>
> pio_ops.inb = inb;
> pio_ops.inw = inw;
> pio_ops.inl = inl;
> pio_ops.outb = outb;
> pio_ops.outw = outw;
> pio_ops.outl = outl;
>
> ... only needs .text -> .text relocations which can be resolved at link
> time.
So yeah, we can simply do this and forget about it.
If someone is bored and wants to fixup such runtime relocations at,
well, runtime, sure. But until then...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette