2021-07-02 22:06:58

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 00/69] KVM: X86: TDX support

From: Isaku Yamahata <[email protected]>

* What's TDX?
TDX stands for Trust Domain Extensions which isolates VMs from the
virtual-machine manager (VMM)/hypervisor and any other software on the
platform. [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.


* The goal of this RFC patch
The purpose of this post is to get feedback early on high level design issue of
KVM enhancement for TDX. The detailed coding (variable naming etc) is not cared
of. This patch series is incomplete (not working). So it's RFC. Although
multiple software components, not only KVM but also QEMU, guest Linux and
virtual bios, need to be updated, this includes only KVM VMM part. For those who
are curious to changes to other component, there are public repositories at
github. [8], [9]


* Patch organization
The patch 66 is main change. The preceding patches(1-65) The preceding
patches(01-61) are refactoring the code and introducing additional hooks.

- 01-12: They are preparations. introduce architecture constants, code
refactoring, export symbols for following patches.
- 13-40: start to introduce the new type of VM and allow the coexistence of
multiple type of VM. allow/disallow KVM ioctl where
appropriate. Especially make per-system ioctl to per-VM ioctl.
- 41-65: refactoring KVM VMX/MMU and adding new hooks for Secure EPT.
- 66: main patch to add "basic" support for building/running TDX.
- 67: trace points for
- 68-69: Documentation

* TODOs
Those major features are missing from this patch series to keep this patch
series small.

- load/initialize TDX module
split out from this patch series.
- unmapping private page
Will integrate Kirill's patch to show how kvm will utilize it.
- qemu gdb stub support
- Large page support
- guest PMU support
- TDP MMU support
- and more

Changes from v1:
- rebase to v5.13
- drop load/initialization of TDX module
- catch up the update of related specifications.
- rework on C-wrapper function to invoke seamcall
- various code clean up

[1] TDX specification
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
[3] Intel CPU Architectural Extensions Specification
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 EAS
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
[5] Intel TDX Loader Interface Specification
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
[7] Intel TDX Virtual Firmware Design Guide
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
[8] intel public github
kvm TDX branch: https://github.com/intel/tdx/tree/kvm
TDX guest branch: https://github.com/intel/tdx/tree/guest
qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
https://github.com/tianocore/edk2-staging/tree/TDVF

Isaku Yamahata (11):
KVM: TDX: introduce config for KVM TDX support
KVM: X86: move kvm_cpu_vmxon() from vmx.c to virtext.h
KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86
KVM: TDX: add a helper function for kvm to call seamcall
KVM: TDX: add trace point before/after TDX SEAMCALLs
KVM: TDX: Print the name of SEAMCALL status code
KVM: Add per-VM flag to mark read-only memory as unsupported
KVM: x86: add per-VM flags to disable SMI/INIT/SIPI
KVM: TDX: add trace point for TDVMCALL and SEPT operation
KVM: TDX: add document on TDX MODULE
Documentation/virtual/kvm: Add Trust Domain Extensions(TDX)

Kai Huang (2):
KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level
routes
cpu/hotplug: Document that TDX also depends on booting CPUs once

Rick Edgecombe (1):
KVM: x86: Add infrastructure for stolen GPA bits

Sean Christopherson (53):
KVM: TDX: Add TDX "architectural" error codes
KVM: TDX: Add architectural definitions for structures and values
KVM: TDX: define and export helper functions for KVM TDX support
KVM: TDX: Add C wrapper functions for TDX SEAMCALLs
KVM: Export kvm_io_bus_read for use by TDX for PV MMIO
KVM: Enable hardware before doing arch VM initialization
KVM: x86: Split core of hypercall emulation to helper function
KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO
KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default
KVM: Add infrastructure and macro to mark VM as bugged
KVM: Export kvm_make_all_cpus_request() for use in marking VMs as
bugged
KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the
VM
KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID
KVM: Add max_vcpus field in common 'struct kvm'
KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()
KVM: x86: Introduce "protected guest" concept and block disallowed
ioctls
KVM: x86: Add per-VM flag to disable direct IRQ injection
KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE
KVM: x86: Add flag to mark TSC as immutable (for TDX)
KVM: Add per-VM flag to disable dirty logging of memslots for TDs
KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()
KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP
KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
behavior
KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
KVM: x86: Add option to force LAPIC expiration wait
KVM: x86: Add guest_supported_xss placholder
KVM: Export kvm_is_reserved_pfn() for use by TDX
KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
KVM: x86/mmu: Allow non-zero init value for shadow PTE
KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce
indentation
KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
KVM: x86/mmu: Frame in support for private/inaccessible shadow pages
KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
KVM: VMX: Modify NMI and INTR handlers to take intr_info as param
KVM: VMX: Move NMI/exception handler to common helper
KVM: x86/mmu: Allow per-VM override of the TDP max page level
KVM: VMX: Split out guts of EPT violation to common/exposed function
KVM: VMX: Define EPT Violation architectural bits
KVM: VMX: Define VMCS encodings for shared EPT pointer
KVM: VMX: Add 'main.c' to wrap VMX and TDX
KVM: VMX: Move setting of EPT MMU masks to common VT-x code
KVM: VMX: Move register caching logic to common code
KVM: TDX: Define TDCALL exit reason
KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs
KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h
KVM: VMX: MOVE GDT and IDT accessors to common code
KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX
code
KVM: TDX: Add "basic" support for building and running Trust Domains

Xiaoyao Li (2):
KVM: TDX: Introduce pr_seamcall_ex_ret_info() to print more info when
SEAMCALL fails
KVM: X86: Introduce initial_tsc_khz in struct kvm_arch

Documentation/virt/kvm/api.rst | 6 +-
Documentation/virt/kvm/intel-tdx.rst | 441 ++++++
Documentation/virt/kvm/tdx-module.rst | 48 +
arch/arm64/include/asm/kvm_host.h | 3 -
arch/arm64/kvm/arm.c | 7 +-
arch/arm64/kvm/vgic/vgic-init.c | 6 +-
arch/x86/Kbuild | 1 +
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/kvm-x86-ops.h | 8 +
arch/x86/include/asm/kvm_boot.h | 30 +
arch/x86/include/asm/kvm_host.h | 55 +-
arch/x86/include/asm/virtext.h | 25 +
arch/x86/include/asm/vmx.h | 17 +
arch/x86/include/uapi/asm/kvm.h | 60 +
arch/x86/include/uapi/asm/vmx.h | 7 +-
arch/x86/kernel/asm-offsets_64.c | 15 +
arch/x86/kvm/Kconfig | 11 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/boot/Makefile | 6 +
arch/x86/kvm/boot/seam/tdx_common.c | 242 +++
arch/x86/kvm/boot/seam/tdx_common.h | 13 +
arch/x86/kvm/ioapic.c | 4 +
arch/x86/kvm/irq_comm.c | 13 +-
arch/x86/kvm/lapic.c | 7 +-
arch/x86/kvm/lapic.h | 2 +-
arch/x86/kvm/mmu.h | 31 +-
arch/x86/kvm/mmu/mmu.c | 526 +++++--
arch/x86/kvm/mmu/mmu_internal.h | 3 +
arch/x86/kvm/mmu/paging_tmpl.h | 25 +-
arch/x86/kvm/mmu/spte.c | 15 +-
arch/x86/kvm/mmu/spte.h | 18 +-
arch/x86/kvm/svm/svm.c | 18 +-
arch/x86/kvm/trace.h | 138 ++
arch/x86/kvm/vmx/common.h | 178 +++
arch/x86/kvm/vmx/main.c | 1098 ++++++++++++++
arch/x86/kvm/vmx/posted_intr.c | 6 +
arch/x86/kvm/vmx/seamcall.S | 64 +
arch/x86/kvm/vmx/seamcall.h | 68 +
arch/x86/kvm/vmx/tdx.c | 1958 +++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 267 ++++
arch/x86/kvm/vmx/tdx_arch.h | 370 +++++
arch/x86/kvm/vmx/tdx_errno.h | 202 +++
arch/x86/kvm/vmx/tdx_ops.h | 218 +++
arch/x86/kvm/vmx/tdx_stubs.c | 45 +
arch/x86/kvm/vmx/vmcs.h | 11 -
arch/x86/kvm/vmx/vmenter.S | 146 ++
arch/x86/kvm/vmx/vmx.c | 509 ++-----
arch/x86/kvm/x86.c | 285 +++-
include/linux/kvm_host.h | 51 +-
include/uapi/linux/kvm.h | 2 +
kernel/cpu.c | 4 +
tools/arch/x86/include/uapi/asm/kvm.h | 55 +
tools/include/uapi/linux/kvm.h | 2 +
virt/kvm/kvm_main.c | 44 +-
54 files changed, 6717 insertions(+), 672 deletions(-)
create mode 100644 Documentation/virt/kvm/intel-tdx.rst
create mode 100644 Documentation/virt/kvm/tdx-module.rst
create mode 100644 arch/x86/include/asm/kvm_boot.h
create mode 100644 arch/x86/kvm/boot/Makefile
create mode 100644 arch/x86/kvm/boot/seam/tdx_common.c
create mode 100644 arch/x86/kvm/boot/seam/tdx_common.h
create mode 100644 arch/x86/kvm/vmx/common.h
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/seamcall.S
create mode 100644 arch/x86/kvm/vmx/seamcall.h
create mode 100644 arch/x86/kvm/vmx/tdx.c
create mode 100644 arch/x86/kvm/vmx/tdx.h
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
create mode 100644 arch/x86/kvm/vmx/tdx_stubs.c

--
2.25.1


2021-07-02 22:07:05

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 07/69] KVM: TDX: define and export helper functions for KVM TDX support

From: Sean Christopherson <[email protected]>

NOTE: This is to make this patch series compile. other patch series that
loads/initializes TDX module will replace this patch.

Define and export four helper functions commly used for for KVM TDX support
and SEAMLDR. tdx_get_sysinfo(), tdx_seamcall_on_each_pkg(),
tdx_keyid_alloc() and tdx_keyid_free(). The SEAMLDR logic will initializes
at boot phase and KVM TDX will use those function to get system info,
operation of package wide resource and, alloc/free tdx private key ID.

Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/Kbuild | 1 +
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/kvm_boot.h | 30 +++++
arch/x86/kvm/boot/Makefile | 6 +
arch/x86/kvm/boot/seam/tdx_common.c | 167 ++++++++++++++++++++++++++++
arch/x86/kvm/boot/seam/tdx_common.h | 13 +++
6 files changed, 219 insertions(+)
create mode 100644 arch/x86/include/asm/kvm_boot.h
create mode 100644 arch/x86/kvm/boot/Makefile
create mode 100644 arch/x86/kvm/boot/seam/tdx_common.c
create mode 100644 arch/x86/kvm/boot/seam/tdx_common.h

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 30dec019756b..4f35eaad7468 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -4,6 +4,7 @@ obj-y += entry/
obj-$(CONFIG_PERF_EVENTS) += events/

obj-$(CONFIG_KVM) += kvm/
+obj-$(subst m,y,$(CONFIG_KVM)) += kvm/boot/

# Xen paravirtualization support
obj-$(CONFIG_XEN) += xen/
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ac37830ae941..fe5cfc013444 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -230,6 +230,8 @@
#define X86_FEATURE_FLEXPRIORITY ( 8*32+ 2) /* Intel FlexPriority */
#define X86_FEATURE_EPT ( 8*32+ 3) /* Intel Extended Page Table */
#define X86_FEATURE_VPID ( 8*32+ 4) /* Intel Virtual Processor ID */
+#define X86_FEATURE_SEAM ( 8*32+ 5) /* "" Secure Arbitration Mode */
+#define X86_FEATURE_TDX ( 8*32+ 6) /* Intel Trusted Domain eXtensions */

#define X86_FEATURE_VMMCALL ( 8*32+15) /* Prefer VMMCALL to VMCALL */
#define X86_FEATURE_XENPV ( 8*32+16) /* "" Xen paravirtual guest */
diff --git a/arch/x86/include/asm/kvm_boot.h b/arch/x86/include/asm/kvm_boot.h
new file mode 100644
index 000000000000..3d58d4109566
--- /dev/null
+++ b/arch/x86/include/asm/kvm_boot.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_X86_KVM_BOOT_H
+#define _ASM_X86_KVM_BOOT_H
+
+#include <linux/cpumask.h>
+#include <linux/mutex.h>
+#include <linux/smp.h>
+#include <linux/types.h>
+#include <asm/processor.h>
+
+#ifdef CONFIG_KVM_INTEL_TDX
+
+/*
+ * Return pointer to TDX system info (TDSYSINFO_STRUCT) if TDX has been
+ * successfully initialized, or NULL.
+ */
+struct tdsysinfo_struct;
+const struct tdsysinfo_struct *tdx_get_sysinfo(void);
+
+extern u32 tdx_seam_keyid __ro_after_init;
+
+int tdx_seamcall_on_each_pkg(int (*fn)(void *), void *param);
+
+/* TDX keyID allocation functions */
+extern int tdx_keyid_alloc(void);
+extern void tdx_keyid_free(int keyid);
+
+#endif
+
+#endif /* _ASM_X86_KVM_BOOT_H */
diff --git a/arch/x86/kvm/boot/Makefile b/arch/x86/kvm/boot/Makefile
new file mode 100644
index 000000000000..a85eb5af90d5
--- /dev/null
+++ b/arch/x86/kvm/boot/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+asflags-y += -I$(srctree)/arch/x86/kvm
+ccflags-y += -I$(srctree)/arch/x86/kvm
+
+obj-$(CONFIG_KVM_INTEL_TDX) += seam/tdx_common.o
diff --git a/arch/x86/kvm/boot/seam/tdx_common.c b/arch/x86/kvm/boot/seam/tdx_common.c
new file mode 100644
index 000000000000..d803dbd11693
--- /dev/null
+++ b/arch/x86/kvm/boot/seam/tdx_common.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Common functions/symbols for SEAMLDR and KVM. */
+
+#include <linux/cpuhotplug.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/idr.h>
+
+#include <asm/kvm_boot.h>
+
+#include "vmx/tdx_arch.h"
+
+/*
+ * TDX system information returned by TDSYSINFO.
+ */
+struct tdsysinfo_struct tdx_tdsysinfo;
+
+/* KeyID range reserved to TDX by BIOS */
+u32 tdx_keyids_start;
+u32 tdx_nr_keyids;
+
+u32 tdx_seam_keyid __ro_after_init;
+EXPORT_SYMBOL_GPL(tdx_seam_keyid);
+
+/* TDX keyID pool */
+static DEFINE_IDA(tdx_keyid_pool);
+
+static int *tdx_package_masters __ro_after_init;
+
+static int tdx_starting_cpu(unsigned int cpu)
+{
+ int pkg = topology_physical_package_id(cpu);
+
+ /*
+ * If this package doesn't have a master CPU for IPI operation, use this
+ * CPU as package master.
+ */
+ if (tdx_package_masters && tdx_package_masters[pkg] == -1)
+ tdx_package_masters[pkg] = cpu;
+
+ return 0;
+}
+
+static int tdx_dying_cpu(unsigned int cpu)
+{
+ int pkg = topology_physical_package_id(cpu);
+ int other;
+
+ if (!tdx_package_masters || tdx_package_masters[pkg] != cpu)
+ return 0;
+
+ /*
+ * If offlining cpu was used as package master, find other online cpu on
+ * this package.
+ */
+ tdx_package_masters[pkg] = -1;
+ for_each_online_cpu(other) {
+ if (other == cpu)
+ continue;
+ if (topology_physical_package_id(other) != pkg)
+ continue;
+
+ tdx_package_masters[pkg] = other;
+ break;
+ }
+
+ return 0;
+}
+
+/*
+ * Setup one-cpu-per-pkg array to do package-scoped SEAMCALLs. The array is
+ * only necessary if there are multiple packages.
+ */
+int __init init_package_masters(void)
+{
+ int cpu, pkg, nr_filled, nr_pkgs;
+
+ nr_pkgs = topology_max_packages();
+ if (nr_pkgs == 1)
+ return 0;
+
+ tdx_package_masters = kcalloc(nr_pkgs, sizeof(int), GFP_KERNEL);
+ if (!tdx_package_masters)
+ return -ENOMEM;
+
+ memset(tdx_package_masters, -1, nr_pkgs * sizeof(int));
+
+ nr_filled = 0;
+ for_each_online_cpu(cpu) {
+ pkg = topology_physical_package_id(cpu);
+ if (tdx_package_masters[pkg] >= 0)
+ continue;
+
+ tdx_package_masters[pkg] = cpu;
+ if (++nr_filled == topology_max_packages())
+ break;
+ }
+
+ if (WARN_ON(nr_filled != topology_max_packages())) {
+ kfree(tdx_package_masters);
+ return -EIO;
+ }
+
+ if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "tdx/cpu:starting",
+ tdx_starting_cpu, tdx_dying_cpu) < 0) {
+ kfree(tdx_package_masters);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+int tdx_seamcall_on_each_pkg(int (*fn)(void *), void *param)
+{
+ int ret = 0;
+ int i;
+
+ cpus_read_lock();
+ if (!tdx_package_masters) {
+ ret = fn(param);
+ goto out;
+ }
+
+ for (i = 0; i < topology_max_packages(); i++) {
+ ret = smp_call_on_cpu(tdx_package_masters[i], fn, param, 1);
+ if (ret)
+ break;
+ }
+
+out:
+ cpus_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_seamcall_on_each_pkg);
+
+const struct tdsysinfo_struct *tdx_get_sysinfo(void)
+{
+ if (boot_cpu_has(X86_FEATURE_TDX))
+ return &tdx_tdsysinfo;
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
+
+int tdx_keyid_alloc(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TDX))
+ return -EINVAL;
+
+ if (WARN_ON_ONCE(!tdx_keyids_start || !tdx_nr_keyids))
+ return -EINVAL;
+
+ /* The first keyID is reserved for the global key. */
+ return ida_alloc_range(&tdx_keyid_pool, tdx_keyids_start + 1,
+ tdx_keyids_start + tdx_nr_keyids - 1,
+ GFP_KERNEL);
+}
+EXPORT_SYMBOL_GPL(tdx_keyid_alloc);
+
+void tdx_keyid_free(int keyid)
+{
+ if (!keyid || keyid < 0)
+ return;
+
+ ida_free(&tdx_keyid_pool, keyid);
+}
+EXPORT_SYMBOL_GPL(tdx_keyid_free);
diff --git a/arch/x86/kvm/boot/seam/tdx_common.h b/arch/x86/kvm/boot/seam/tdx_common.h
new file mode 100644
index 000000000000..6f94ebb2b815
--- /dev/null
+++ b/arch/x86/kvm/boot/seam/tdx_common.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* common functions/symbols used by SEAMLDR and KVM */
+
+#ifndef __BOOT_SEAM_TDX_COMMON_H
+#define __BOOT_SEAM_TDX_COMMON_H
+
+extern struct tdsysinfo_struct tdx_tdsysinfo;
+extern u32 tdx_keyids_start;
+extern u32 tdx_nr_keyids;
+
+int __init init_package_masters(void);
+
+#endif /* __BOOT_SEAM_TDX_COMMON_H */
--
2.25.1

2021-07-02 22:07:12

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 09/69] KVM: TDX: Add C wrapper functions for TDX SEAMCALLs

From: Sean Christopherson <[email protected]>

TDX SEAMCALL interface is defined in [1] 20.2 Host-Side(SEAMCALL) interface
Functions. Define C wrapper functions for SEAMCALLs which the later
patches will use.

[1] TDX Module spec
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf

Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx_ops.h | 205 +++++++++++++++++++++++++++++++++++++
1 file changed, 205 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..8afcffa267dc
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "seamcall.h"
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+ return seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, ex);
+}
+
+static inline u64 tdh_mem_spet_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+ return seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+ return seamcall(TDH_MNG_KEY_CONFIG, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+ return seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+ return seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MNG_RD, tdr, field, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mng_wr(hpa_t tdr, u64 field, u64 val, u64 mask,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MNG_WR, tdr, field, val, mask, 0, ex);
+}
+
+static inline u64 tdh_phymem_page_rd(hpa_t addr, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_PHYMEM_PAGE_RD, addr, 0, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_phymem_page_wr(hpa_t addr, u64 val, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_PHYMEM_PAGE_WR, addr, val, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_page_demote(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_PAGE_DEMOTE, gpa | level, tdr, page, 0, 0, ex);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MR_EXTEND, gpa, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+ return seamcall(TDH_MR_FINALIZE, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+ return seamcall(TDH_VP_FLUSH, tdvpr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+ return seamcall(TDH_MNG_VPFLUSHDONE, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+ return seamcall(TDH_MNG_KEY_FREEID, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MNG_INIT, tdr, td_params, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+ return seamcall(TDH_VP_INIT, tdvpr, rcx, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_promote(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_PAGE_PROMOTE, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_phymem_page_rdmd(hpa_t page, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_PHYMEM_PAGE_RDMD, page, 0, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_sept_rd(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_SEPT_RD, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_VP_RD, tdvpr, field, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+ return seamcall(TDH_MNG_KEY_RECLAIMID, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page, struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_PHYMEM_PAGE_RECLAIM, page, 0, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_SEPT_REMOVE, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+ return seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+ return seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+ return seamcall(TDH_PHYMEM_CACHE_WB, resume ? 1 : 0, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+ return seamcall(TDH_PHYMEM_PAGE_WBINVD, page, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_sept_wr(hpa_t tdr, gpa_t gpa, int level, u64 val,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_MEM_SEPT_WR, gpa | level, tdr, val, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+ struct tdx_ex_ret *ex)
+{
+ return seamcall(TDH_VP_WR, tdvpr, field, val, mask, 0, ex);
+}
+
+#endif /* __KVM_X86_TDX_OPS_H */
--
2.25.1

2021-07-02 22:07:17

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 05/69] KVM: TDX: Add architectural definitions for structures and values

From: Sean Christopherson <[email protected]>

Add structures and values that are architecturally defined in
[1] chapter 18 ABI Reference: Data Types and in [1] 20.2.1 SEAMCALL
Instruction(Common) Table 20.4 SEAMCALL Instruction Leaf Numbers
Definition.

[1] TDX Module Spec
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Co-developed-by: Chao Gao <[email protected]>
Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx_arch.h | 307 ++++++++++++++++++++++++++++++++++++
1 file changed, 307 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..57e9ea4a7fad
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,307 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define SEAMCALL_TDH_VP_ENTER 0
+#define SEAMCALL_TDH_MNG_ADDCX 1
+#define SEAMCALL_TDH_MEM_PAGE_ADD 2
+#define SEAMCALL_TDH_MEM_SEPT_ADD 3
+#define SEAMCALL_TDH_VP_ADDCX 4
+#define SEAMCALL_TDH_MEM_PAGE_AUG 6
+#define SEAMCALL_TDH_MEM_RANGE_BLOCK 7
+#define SEAMCALL_TDH_MNG_KEY_CONFIG 8
+#define SEAMCALL_TDH_MNG_CREATE 9
+#define SEAMCALL_TDH_VP_CREATE 10
+#define SEAMCALL_TDH_MNG_RD 11
+#define SEAMCALL_TDH_PHYMEM_PAGE_RD 12
+#define SEAMCALL_TDH_MNG_WR 13
+#define SEAMCALL_TDH_PHYMEM_PAGE_WR 14
+#define SEAMCALL_TDH_MEM_PAGE_DEMOTE 15
+#define SEAMCALL_TDH_MR_EXTEND 16
+#define SEAMCALL_TDH_MR_FINALIZE 17
+#define SEAMCALL_TDH_VP_FLUSH 18
+#define SEAMCALL_TDH_MNG_VPFLUSHDONE 19
+#define SEAMCALL_TDH_MNG_KEY_FREEID 20
+#define SEAMCALL_TDH_MNG_INIT 21
+#define SEAMCALL_TDH_VP_INIT 22
+#define SEAMCALL_TDH_MEM_PAGE_PROMOTE 23
+#define SEAMCALL_TDH_PHYMEM_PAGE_RDMD 24
+#define SEAMCALL_TDH_MEM_SEPT_RD 25
+#define SEAMCALL_TDH_VP_RD 26
+#define SEAMCALL_TDH_MNG_KEY_RECLAIMID 27
+#define SEAMCALL_TDH_PHYMEM_PAGE_RECLAIM 28
+#define SEAMCALL_TDH_MEM_PAGE_REMOVE 29
+#define SEAMCALL_TDH_MEM_SEPT_REMOVE 30
+#define SEAMCALL_TDH_SYS_KEY_CONFIG 31
+#define SEAMCALL_TDH_SYS_INFO 32
+#define SEAMCALL_TDH_SYS_INIT 33
+#define SEAMCALL_TDH_SYS_LP_INIT 35
+#define SEAMCALL_TDH_SYS_TDMR_INIT 36
+#define SEAMCALL_TDH_MEM_TRACK 38
+#define SEAMCALL_TDH_MEM_RANGE_UNBLOCK 39
+#define SEAMCALL_TDH_PHYMEM_CACHE_WB 40
+#define SEAMCALL_TDH_PHYMEM_PAGE_WBINVD 41
+#define SEAMCALL_TDH_MEM_SEPT_WR 42
+#define SEAMCALL_TDH_VP_WR 43
+#define SEAMCALL_TDH_SYS_LP_SHUTDOWN 44
+#define SEAMCALL_TDH_SYS_CONFIG 45
+
+#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO 0x10000
+#define TDG_VP_VMCALL_MAP_GPA 0x10001
+#define TDG_VP_VMCALL_GET_QUOTE 0x10002
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR 0x10003
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_CLASS_SHIFT 56
+#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
+
+#define BUILD_TDX_FIELD(class, field) \
+ (((u64)(class) << TDX_CLASS_SHIFT) | ((u64)(field) & TDX_FIELD_MASK))
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field) BUILD_TDX_FIELD(0, (field))
+
+/*
+ * @offset is the offset (in bytes) from the beginning of the architectural
+ * virtual APIC page.
+ */
+#define TDVPS_APIC(offset) BUILD_TDX_FIELD(1, (offset))
+
+/* @gpr is the index of a general purpose register, e.g. eax=0 */
+#define TDVPS_GPR(gpr) BUILD_TDX_FIELD(16, (gpr))
+
+#define TDVPS_DR(dr) BUILD_TDX_FIELD(17, (0 + (dr)))
+
+enum tdx_guest_other_state {
+ TD_VCPU_XCR0 = 32,
+ TD_VCPU_IWK_ENCKEY0 = 64,
+ TD_VCPU_IWK_ENCKEY1,
+ TD_VCPU_IWK_ENCKEY2,
+ TD_VCPU_IWK_ENCKEY3,
+ TD_VCPU_IWK_INTKEY0 = 68,
+ TD_VCPU_IWK_INTKEY1,
+ TD_VCPU_IWK_FLAGS = 70,
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field) BUILD_TDX_FIELD(17, (field))
+
+/* @msr is the MSR index */
+#define TDVPS_MSR(msr) BUILD_TDX_FIELD(19, (msr))
+
+/* Management class fields */
+enum tdx_guest_management {
+ TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_guest_management */
+#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(32, (field))
+
+#define TDX1_NR_TDCX_PAGES 4
+#define TDX1_NR_TDVPX_PAGES 5
+
+#define TDX1_MAX_NR_CPUID_CONFIGS 6
+#define TDX1_MAX_NR_CMRS 32
+#define TDX1_MAX_NR_TDMRS 64
+#define TDX1_MAX_NR_RSVD_AREAS 16
+#define TDX1_PAMT_ENTRY_SIZE 16
+#define TDX1_EXTENDMR_CHUNKSIZE 256
+
+struct tdx_cpuid_config {
+ u32 leaf;
+ u32 sub_leaf;
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+} __packed;
+
+struct tdx_cpuid_value {
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+} __packed;
+
+#define TDX1_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
+#define TDX1_TD_ATTRIBUTE_PKS BIT_ULL(30)
+#define TDX1_TD_ATTRIBUTE_KL BIT_ULL(31)
+#define TDX1_TD_ATTRIBUTE_PERFMON BIT_ULL(63)
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+struct td_params {
+ u64 attributes;
+ u64 xfam;
+ u32 max_vcpus;
+ u32 reserved0;
+
+ u64 eptp_controls;
+ u64 exec_controls;
+ u16 tsc_frequency;
+ u8 reserved1[38];
+
+ u64 mrconfigid[6];
+ u64 mrowner[6];
+ u64 mrownerconfig[6];
+ u64 reserved2[4];
+
+ union {
+ struct tdx_cpuid_value cpuid_values[0];
+ u8 reserved3[768];
+ };
+} __packed __aligned(1024);
+
+/* Guest uses MAX_PA for GPAW when set. */
+#define TDX1_EXEC_CONTROL_MAX_GPAW BIT_ULL(0)
+
+/*
+ * TDX1 requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. TDX-SEAM
+ * can only program frequencies that are multiples of 25MHz. The frequency
+ * must be between 1ghz and 10ghz (inclusive).
+ */
+#define TDX1_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000))
+#define TDX1_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000))
+#define TDX1_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
+#define TDX1_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)
+
+struct tdmr_reserved_area {
+ u64 offset;
+ u64 size;
+} __packed;
+
+#define TDX_TDMR_ADDR_ALIGNMENT 512
+#define TDX_TDMR_INFO_ALIGNMENT 512
+struct tdmr_info {
+ u64 base;
+ u64 size;
+ u64 pamt_1g_base;
+ u64 pamt_1g_size;
+ u64 pamt_2m_base;
+ u64 pamt_2m_size;
+ u64 pamt_4k_base;
+ u64 pamt_4k_size;
+ struct tdmr_reserved_area reserved_areas[TDX1_MAX_NR_RSVD_AREAS];
+} __packed __aligned(TDX_TDMR_INFO_ALIGNMENT);
+
+#define TDX_CMR_INFO_ARRAY_ALIGNMENT 512
+struct cmr_info {
+ u64 base;
+ u64 size;
+} __packed;
+
+#define TDX_TDSYSINFO_STRUCT_ALIGNEMNT 1024
+struct tdsysinfo_struct {
+ /* TDX-SEAM Module Info */
+ u32 attributes;
+ u32 vendor_id;
+ u32 build_date;
+ u16 build_num;
+ u16 minor_version;
+ u16 major_version;
+ u8 reserved0[14];
+ /* Memory Info */
+ u16 max_tdmrs;
+ u16 max_reserved_per_tdmr;
+ u16 pamt_entry_size;
+ u8 reserved1[10];
+ /* Control Struct Info */
+ u16 tdcs_base_size;
+ u8 reserved2[2];
+ u16 tdvps_base_size;
+ u8 tdvps_xfam_dependent_size;
+ u8 reserved3[9];
+ /* TD Capabilities */
+ u64 attributes_fixed0;
+ u64 attributes_fixed1;
+ u64 xfam_fixed0;
+ u64 xfam_fixed1;
+ u8 reserved4[32];
+ u32 num_cpuid_config;
+ union {
+ struct tdx_cpuid_config cpuid_configs[0];
+ u8 reserved5[892];
+ };
+} __packed __aligned(TDX_TDSYSINFO_STRUCT_ALIGNEMNT);
+
+struct tdx_ex_ret {
+ union {
+ /* Used to retrieve values from hardware. */
+ struct {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+ };
+ /* Functions that walk SEPT */
+ struct {
+ u64 septe;
+ struct {
+ u64 level :3;
+ u64 sept_reserved_0 :5;
+ u64 state :8;
+ u64 sept_reserved_1 :48;
+ };
+ };
+ /* TD_MNG_{RD,WR} return the TDR, field code, and value. */
+ struct {
+ u64 tdr;
+ u64 field;
+ u64 field_val;
+ };
+ /* TD_MNG_{RD,WR}MEM return the address and its value. */
+ struct {
+ u64 addr;
+ u64 val;
+ };
+ /* TDH_PHYMEM_PAGE_RDMD and TDH_PHYMEM_PAGE_RECLAIM return page metadata. */
+ struct {
+ u64 page_type;
+ u64 owner;
+ u64 page_size;
+ };
+ /*
+ * TDH_SYS_INFO returns the buffer address and its size, and the
+ * CMR_INFO address and its number of entries.
+ */
+ struct {
+ u64 buffer;
+ u64 nr_bytes;
+ u64 cmr_info;
+ u64 nr_cmr_entries;
+ };
+ /*
+ * TDH_MNG_INIT and TDH_SYS_INIT return CPUID info on error. Note, only
+ * the leaf and subleaf are valid on TDH_MNG_INIT error.
+ */
+ struct {
+ u32 leaf;
+ u32 subleaf;
+ u32 eax_mask;
+ u32 ebx_mask;
+ u32 ecx_mask;
+ u32 edx_mask;
+ u32 eax_val;
+ u32 ebx_val;
+ u32 ecx_val;
+ u32 edx_val;
+ };
+ /* TDH_SYS_TDMR_INIT returns the input PA and next PA. */
+ struct {
+ u64 prev;
+ u64 next;
+ };
+ };
+};
+
+#endif /* __KVM_X86_TDX_ARCH_H */
--
2.25.1

2021-07-02 22:07:20

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 04/69] KVM: TDX: Add TDX "architectural" error codes

From: Sean Christopherson <[email protected]>

Add TDX completion status codes for SEAMCALL and TDG_VP_VMCALL.

TDX-SEAM uses bits 31:0 to return more information, so these error codes
for SEAMCALL will only exactly match RAX[63:32]. [1] Section 17.1 Interface
Function Completion Status Codes
Completion status codes for TDG.VP.VMCALL is defined in
[2] Chapter 3 TDG.VP.VMCALL Interface.

[1] Intel TDX Module Spec
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf

[2] TDX Guest-Host Communication interface for Intel Trust Domain Extensions
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx_errno.h | 106 +++++++++++++++++++++++++++++++++++
1 file changed, 106 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
new file mode 100644
index 000000000000..675acea412c9
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_ERRNO_H
+#define __KVM_X86_TDX_ERRNO_H
+
+/*
+ * TDX SEAMCALL Status Codes (returned in RAX)
+ */
+#define TDX_SUCCESS 0x0000000000000000
+#define TDX_NON_RECOVERABLE_VCPU 0x4000000100000000
+#define TDX_NON_RECOVERABLE_TD 0x4000000200000000
+#define TDX_INTERRUPTED_RESUMABLE 0x8000000300000000
+#define TDX_INTERRUPTED_RESTARTABLE 0x8000000400000000
+#define TDX_NON_RECOVERABLE_TD_FATAL 0x4000000500000000
+#define TDX_INVALID_RESUMPTION 0xC000000600000000
+#define TDX_NON_RECOVERABLE_TD_NO_APIC 0xC000000700000000
+#define TDX_OPERAND_INVALID 0xC000010000000000
+#define TDX_OPERAND_ADDR_RANGE_ERROR 0xC000010100000000
+#define TDX_OPERAND_BUSY 0x8000020000000000
+#define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000
+#define TDX_SYS_BUSY 0x8000020200000000
+#define TDX_PAGE_METADATA_INCORRECT 0xC000030000000000
+#define TDX_PAGE_ALREADY_FREE 0x0000030100000000
+#define TDX_PAGE_NOT_OWNED_BY_TD 0xC000030200000000
+#define TDX_PAGE_NOT_FREE 0xC000030300000000
+#define TDX_TD_ASSOCIATED_PAGES_EXIST 0xC000040000000000
+#define TDX_SYSINIT_NOT_PENDING 0xC000050000000000
+#define TDX_SYSINIT_NOT_DONE 0xC000050100000000
+#define TDX_SYSINITLP_NOT_DONE 0xC000050200000000
+#define TDX_SYSINITLP_DONE 0xC000050300000000
+#define TDX_SYS_NOT_READY 0xC000050500000000
+#define TDX_SYS_SHUTDOWN 0xC000050600000000
+#define TDX_SYSCONFIG_NOT_DONE 0xC000050700000000
+#define TDX_TD_NOT_INITIALIZED 0xC000060000000000
+#define TDX_TD_INITIALIZED 0xC000060100000000
+#define TDX_TD_NOT_FINALIZED 0xC000060200000000
+#define TDX_TD_FINALIZED 0xC000060300000000
+#define TDX_TD_FATAL 0xC000060400000000
+#define TDX_TD_NON_DEBUG 0xC000060500000000
+#define TDX_TDCX_NUM_INCORRECT 0xC000061000000000
+#define TDX_VCPU_STATE_INCORRECT 0xC000070000000000
+#define TDX_VCPU_ASSOCIATED 0x8000070100000000
+#define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000
+#define TDX_TDVPX_NUM_INCORRECT 0xC000070300000000
+#define TDX_NO_VALID_VE_INFO 0xC000070400000000
+#define TDX_MAX_VCPUS_EXCEEDED 0xC000070500000000
+#define TDX_TSC_ROLLBACK 0xC000070600000000
+#define TDX_FIELD_NOT_WRITABLE 0xC000072000000000
+#define TDX_FIELD_NOT_READABLE 0xC000072100000000
+#define TDX_TD_VMCS_FIELD_NOT_INITIALIZED 0xC000073000000000
+#define TDX_KEY_GENERATION_FAILED 0x8000080000000000
+#define TDX_TD_KEYS_NOT_CONFIGURED 0x8000081000000000
+#define TDX_KEY_STATE_INCORRECT 0xC000081100000000
+#define TDX_KEY_CONFIGURED 0x0000081500000000
+#define TDX_WBCACHE_NOT_COMPLETE 0x8000081700000000
+#define TDX_HKID_NOT_FREE 0xC000082000000000
+#define TDX_NO_HKID_READY_TO_WBCACHE 0x0000082100000000
+#define TDX_WBCACHE_RESUME_ERROR 0xC000082300000000
+#define TDX_FLUSHVP_NOT_DONE 0x8000082400000000
+#define TDX_NUM_ACTIVATED_HKIDS_NOT_SUPPORTED 0xC000082500000000
+#define TDX_INCORRECT_CPUID_VALUE 0xC000090000000000
+#define TDX_BOOT_NT4_SET 0xC000090100000000
+#define TDX_INCONSISTENT_CPUID_FIELD 0xC000090200000000
+#define TDX_CPUID_LEAF_1F_FORMAT_UNRECOGNIZED 0xC000090400000000
+#define TDX_INVALID_WBINVD_SCOPE 0xC000090500000000
+#define TDX_INVALID_PKG_ID 0xC000090600000000
+#define TDX_CPUID_LEAF_NOT_SUPPORTED 0xC000090800000000
+#define TDX_SMRR_NOT_LOCKED 0xC000091000000000
+#define TDX_INVALID_SMRR_CONFIGURATION 0xC000091100000000
+#define TDX_SMRR_OVERLAPS_CMR 0xC000091200000000
+#define TDX_SMRR_LOCK_NOT_SUPPORTED 0xC000091300000000
+#define TDX_SMRR_NOT_SUPPORTED 0xC000091400000000
+#define TDX_INCONSISTENT_MSR 0xC000092000000000
+#define TDX_INCORRECT_MSR_VALUE 0xC000092100000000
+#define TDX_SEAMREPORT_NOT_AVAILABLE 0xC000093000000000
+#define TDX_PERF_COUNTERS_ARE_PEBS_ENABLED 0x8000094000000000
+#define TDX_INVALID_TDMR 0xC0000A0000000000
+#define TDX_NON_ORDERED_TDMR 0xC0000A0100000000
+#define TDX_TDMR_OUTSIDE_CMRS 0xC0000A0200000000
+#define TDX_TDMR_ALREADY_INITIALIZED 0x00000A0300000000
+#define TDX_INVALID_PAMT 0xC0000A1000000000
+#define TDX_PAMT_OUTSIDE_CMRS 0xC0000A1100000000
+#define TDX_PAMT_OVERLAP 0xC0000A1200000000
+#define TDX_INVALID_RESERVED_IN_TDMR 0xC0000A2000000000
+#define TDX_NON_ORDERED_RESERVED_IN_TDMR 0xC0000A2100000000
+#define TDX_CMR_LIST_INVALID 0xC0000A2200000000
+#define TDX_EPT_WALK_FAILED 0xC0000B0000000000
+#define TDX_EPT_ENTRY_FREE 0xC0000B0100000000
+#define TDX_EPT_ENTRY_NOT_FREE 0xC0000B0200000000
+#define TDX_EPT_ENTRY_NOT_PRESENT 0xC0000B0300000000
+#define TDX_EPT_ENTRY_NOT_LEAF 0xC0000B0400000000
+#define TDX_EPT_ENTRY_LEAF 0xC0000B0500000000
+#define TDX_GPA_RANGE_NOT_BLOCKED 0xC0000B0600000000
+#define TDX_GPA_RANGE_ALREADY_BLOCKED 0x00000B0700000000
+#define TDX_TLB_TRACKING_NOT_DONE 0xC0000B0800000000
+#define TDX_EPT_INVALID_PROMOTE_CONDITIONS 0xC0000B0900000000
+#define TDX_PAGE_ALREADY_ACCEPTED 0x00000B0A00000000
+#define TDX_PAGE_SIZE_MISMATCH 0xC0000B0B00000000
+
+/*
+ * TDG.VP.VMCALL Status Codes (returned in R10)
+ */
+#define TDG_VP_VMCALL_SUCCESS 0x0000000000000000
+#define TDG_VP_VMCALL_INVALID_OPERAND 0x8000000000000000
+#define TDG_VP_VMCALL_TDREPORT_FAILED 0x8000000000000001
+
+#endif /* __KVM_X86_TDX_ERRNO_H */
--
2.25.1

2021-07-02 22:07:21

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 15/69] KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO

From: Sean Christopherson <[email protected]>

Later kvm_intel.ko will use it.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/x86.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 795f83a1cf9a..cc45b2c47672 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11962,6 +11962,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);

EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
--
2.25.1

2021-07-02 22:07:23

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 16/69] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default

From: Sean Christopherson <[email protected]>

Zap only leaf SPTEs when deleting/moving a memslot by default, and add a
module param to allow reverting to the old behavior of zapping all SPTEs
at all levels and memslots when any memslot is updated.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8d5876dfc6b7..5b8a640f8042 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -85,6 +85,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_ratio, "uint");
static bool __read_mostly force_flush_and_sync_on_reuse;
module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);

+static bool __read_mostly memslot_update_zap_all;
+module_param(memslot_update_zap_all, bool, 0444);
+
/*
* When setting this variable to true it enables Two-Dimensional-Paging
* where the hardware walks 2 page tables:
@@ -5480,11 +5483,27 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
}

+static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ /*
+ * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+ * case scenario we'll have unused shadow pages lying around until they
+ * are recycled due to age or when the VM is destroyed.
+ */
+ write_lock(&kvm->mmu_lock);
+ slot_handle_level(kvm, slot, kvm_zap_rmapp, PG_LEVEL_4K,
+ KVM_MAX_HUGEPAGE_LEVEL, true);
+ write_unlock(&kvm->mmu_lock);
+}
+
static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
{
- kvm_mmu_zap_all_fast(kvm);
+ if (memslot_update_zap_all)
+ kvm_mmu_zap_all_fast(kvm);
+ else
+ kvm_mmu_zap_memslot(kvm, slot);
}

void kvm_mmu_init_vm(struct kvm *kvm)
--
2.25.1

2021-07-02 22:07:28

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 23/69] KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()

From: Sean Christopherson <[email protected]>

Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of
sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run(). This
allows a future patch to allow synchronizing select state for protected
VMs.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/x86.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7110d48cbc1..271245ffc67c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9729,7 +9729,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
goto out;
}

- if (kvm_run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) {
+ if ((kvm_run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) ||
+ (kvm_run->kvm_dirty_regs & ~KVM_SYNC_X86_VALID_FIELDS)) {
r = -EINVAL;
goto out;
}
@@ -10264,9 +10265,6 @@ static void store_regs(struct kvm_vcpu *vcpu)

static int sync_regs(struct kvm_vcpu *vcpu)
{
- if (vcpu->run->kvm_dirty_regs & ~KVM_SYNC_X86_VALID_FIELDS)
- return -EINVAL;
-
if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
__set_regs(vcpu, &vcpu->run->s.regs.regs);
vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
--
2.25.1

2021-07-02 22:07:40

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 22/69] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs

From: Sean Christopherson <[email protected]>

Add a capability to effectively allow userspace to query what VM types
are supported by KVM.

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/uapi/asm/kvm.h | 4 ++++
arch/x86/kvm/svm/svm.c | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 6 ++++++
arch/x86/kvm/x86.c | 9 ++++++++-
include/uapi/linux/kvm.h | 2 ++
tools/arch/x86/include/uapi/asm/kvm.h | 4 ++++
tools/include/uapi/linux/kvm.h | 2 ++
9 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e7bef91cee04..01457da0162b 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -18,6 +18,7 @@ KVM_X86_OP_NULL(hardware_unsetup)
KVM_X86_OP_NULL(cpu_has_accelerated_tpr)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP(vm_init)
KVM_X86_OP_NULL(vm_destroy)
KVM_X86_OP(vcpu_create)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 80b943e4ab6d..301b10172cbf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -975,6 +975,7 @@ struct kvm_x86_msr_filter {
#define APICV_INHIBIT_REASON_X2APIC 5

struct kvm_arch {
+ unsigned long vm_type;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -1207,6 +1208,7 @@ struct kvm_x86_ops {
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

+ bool (*is_vm_type_supported)(unsigned long vm_type);
unsigned int vm_size;
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0662f644aad9..8341ec720b3f 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -490,4 +490,8 @@ struct kvm_pmu_event_filter {
#define KVM_PMU_EVENT_ALLOW 0
#define KVM_PMU_EVENT_DENY 1

+#define KVM_X86_LEGACY_VM 0
+#define KVM_X86_SEV_ES_VM 1
+#define KVM_X86_TDX_VM 2
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 25c72925eb8a..286a49b09269 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4422,6 +4422,11 @@ static void svm_vm_destroy(struct kvm *kvm)
sev_vm_destroy(kvm);
}

+static bool svm_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_LEGACY_VM;
+}
+
static int svm_vm_init(struct kvm *kvm)
{
if (!pause_filter_count || !pause_filter_thresh)
@@ -4448,6 +4453,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_free = svm_free_vcpu,
.vcpu_reset = svm_vcpu_reset,

+ .is_vm_type_supported = svm_is_vm_type_supported,
.vm_size = sizeof(struct kvm_svm),
.vm_init = svm_vm_init,
.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6c043a160b30..84c2df824ecc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6951,6 +6951,11 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
return err;
}

+static bool vmx_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_LEGACY_VM;
+}
+
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

@@ -7605,6 +7610,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.cpu_has_accelerated_tpr = report_flexpriority,
.has_emulated_msr = vmx_has_emulated_msr,

+ .is_vm_type_supported = vmx_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9244d1d560d5..d7110d48cbc1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3995,6 +3995,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
else
r = 0;
break;
+ case KVM_CAP_VM_TYPES:
+ r = BIT(KVM_X86_LEGACY_VM);
+ if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
+ r |= BIT(KVM_X86_TDX_VM);
+ break;
default:
break;
}
@@ -10746,9 +10751,11 @@ void kvm_arch_free_vm(struct kvm *kvm)

int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
{
- if (type)
+ if (!static_call(kvm_x86_is_vm_type_supported)(type))
return -EINVAL;

+ kvm->arch.vm_type = type;
+
INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 79d9c44d1ad7..52b3e212037a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1084,6 +1084,8 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
#define KVM_CAP_PTP_KVM 198

+#define KVM_CAP_VM_TYPES 1000
+
#ifdef KVM_CAP_IRQ_ROUTING

struct kvm_irq_routing_irqchip {
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 0662f644aad9..8341ec720b3f 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -490,4 +490,8 @@ struct kvm_pmu_event_filter {
#define KVM_PMU_EVENT_ALLOW 0
#define KVM_PMU_EVENT_DENY 1

+#define KVM_X86_LEGACY_VM 0
+#define KVM_X86_SEV_ES_VM 1
+#define KVM_X86_TDX_VM 2
+
#endif /* _ASM_X86_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 79d9c44d1ad7..52b3e212037a 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1084,6 +1084,8 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
#define KVM_CAP_PTP_KVM 198

+#define KVM_CAP_VM_TYPES 1000
+
#ifdef KVM_CAP_IRQ_ROUTING

struct kvm_irq_routing_irqchip {
--
2.25.1

2021-07-02 22:07:44

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 24/69] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls

From: Sean Christopherson <[email protected]>

Add 'guest_state_protected' to mark a VM's state as being protected by
hardware/firmware, e.g. SEV-ES or TDX-SEAM. Use the flag to disallow
ioctls() and/or flows that attempt to access protected state.

Return an error if userspace attempts to get/set register state for a
protected VM, e.g. a non-debug TDX guest. KVM can't provide sane data,
it's userspace's responsibility to avoid attempting to read guest state
when it's known to be inaccessible.

Retrieving vCPU events is the one exception, as the userspace VMM is
allowed to inject NMIs.

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/x86.c | 104 +++++++++++++++++++++++++++++++++++++--------
1 file changed, 86 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 271245ffc67c..b89845dfb679 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4297,6 +4297,10 @@ static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)

static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
{
+ /* TODO: use more precise flag */
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
kvm_make_request(KVM_REQ_SMI, vcpu);

return 0;
@@ -4343,6 +4347,10 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
unsigned bank_num = mcg_cap & 0xff;
u64 *banks = vcpu->arch.mce_banks;

+ /* TODO: use more precise flag */
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
if (mce->bank >= bank_num || !(mce->status & MCI_STATUS_VAL))
return -EINVAL;
/*
@@ -4438,7 +4446,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
events->interrupt.nr = vcpu->arch.interrupt.nr;
events->interrupt.soft = 0;
- events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
+ if (!vcpu->arch.guest_state_protected)
+ events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);

events->nmi.injected = vcpu->arch.nmi_injected;
events->nmi.pending = vcpu->arch.nmi_pending != 0;
@@ -4467,11 +4476,17 @@ static void kvm_smm_changed(struct kvm_vcpu *vcpu);
static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
struct kvm_vcpu_events *events)
{
- if (events->flags & ~(KVM_VCPUEVENT_VALID_NMI_PENDING
- | KVM_VCPUEVENT_VALID_SIPI_VECTOR
- | KVM_VCPUEVENT_VALID_SHADOW
- | KVM_VCPUEVENT_VALID_SMM
- | KVM_VCPUEVENT_VALID_PAYLOAD))
+ u32 allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING |
+ KVM_VCPUEVENT_VALID_SIPI_VECTOR |
+ KVM_VCPUEVENT_VALID_SHADOW |
+ KVM_VCPUEVENT_VALID_SMM |
+ KVM_VCPUEVENT_VALID_PAYLOAD;
+
+ /* TODO: introduce more precise flag */
+ if (vcpu->arch.guest_state_protected)
+ allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING;
+
+ if (events->flags & ~allowed_flags)
return -EINVAL;

if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
@@ -4552,17 +4567,22 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
return 0;
}

-static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
- struct kvm_debugregs *dbgregs)
+static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
+ struct kvm_debugregs *dbgregs)
{
unsigned long val;

+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
memcpy(dbgregs->db, vcpu->arch.db, sizeof(vcpu->arch.db));
kvm_get_dr(vcpu, 6, &val);
dbgregs->dr6 = val;
dbgregs->dr7 = vcpu->arch.dr7;
dbgregs->flags = 0;
memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
+
+ return 0;
}

static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
@@ -4576,6 +4596,9 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
if (!kvm_dr7_valid(dbgregs->dr7))
return -EINVAL;

+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
memcpy(vcpu->arch.db, dbgregs->db, sizeof(vcpu->arch.db));
kvm_update_dr0123(vcpu);
vcpu->arch.dr6 = dbgregs->dr6;
@@ -4671,11 +4694,14 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
}
}

-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
- struct kvm_xsave *guest_xsave)
+static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+ struct kvm_xsave *guest_xsave)
{
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
if (!vcpu->arch.guest_fpu)
- return;
+ return 0;

if (boot_cpu_has(X86_FEATURE_XSAVE)) {
memset(guest_xsave, 0, sizeof(struct kvm_xsave));
@@ -4687,6 +4713,8 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
XFEATURE_MASK_FPSSE;
}
+
+ return 0;
}

#define XSAVE_MXCSR_OFFSET 24
@@ -4697,6 +4725,9 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
u64 xstate_bv;
u32 mxcsr;

+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
if (!vcpu->arch.guest_fpu)
return 0;

@@ -4722,18 +4753,22 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
return 0;
}

-static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
- struct kvm_xcrs *guest_xcrs)
+static int kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
+ struct kvm_xcrs *guest_xcrs)
{
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
guest_xcrs->nr_xcrs = 0;
- return;
+ return 0;
}

guest_xcrs->nr_xcrs = 1;
guest_xcrs->flags = 0;
guest_xcrs->xcrs[0].xcr = XCR_XFEATURE_ENABLED_MASK;
guest_xcrs->xcrs[0].value = vcpu->arch.xcr0;
+ return 0;
}

static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
@@ -4741,6 +4776,9 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
{
int i, r = 0;

+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
if (!boot_cpu_has(X86_FEATURE_XSAVE))
return -EINVAL;

@@ -5011,7 +5049,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
case KVM_GET_DEBUGREGS: {
struct kvm_debugregs dbgregs;

- kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
+ r = kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
+ if (r)
+ break;

r = -EFAULT;
if (copy_to_user(argp, &dbgregs,
@@ -5037,7 +5077,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (!u.xsave)
break;

- kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
+ r = kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
+ if (r)
+ break;

r = -EFAULT;
if (copy_to_user(argp, u.xsave, sizeof(struct kvm_xsave)))
@@ -5061,7 +5103,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (!u.xcrs)
break;

- kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
+ r = kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
+ if (r)
+ break;

r = -EFAULT;
if (copy_to_user(argp, u.xcrs,
@@ -9735,6 +9779,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
goto out;
}

+ if (vcpu->arch.guest_state_protected &&
+ (kvm_run->kvm_valid_regs || kvm_run->kvm_dirty_regs)) {
+ r = -EINVAL;
+ goto out;
+ }
+
if (kvm_run->kvm_dirty_regs) {
r = sync_regs(vcpu);
if (r != 0)
@@ -9765,7 +9815,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)

out:
kvm_put_guest_fpu(vcpu);
- if (kvm_run->kvm_valid_regs)
+ if (kvm_run->kvm_valid_regs && !vcpu->arch.guest_state_protected)
store_regs(vcpu);
post_kvm_run_save(vcpu);
kvm_sigset_deactivate(vcpu);
@@ -9812,6 +9862,9 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)

int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
{
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
vcpu_load(vcpu);
__get_regs(vcpu, regs);
vcpu_put(vcpu);
@@ -9852,6 +9905,9 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)

int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
{
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
vcpu_load(vcpu);
__set_regs(vcpu, regs);
vcpu_put(vcpu);
@@ -9912,6 +9968,9 @@ static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
struct kvm_sregs *sregs)
{
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
vcpu_load(vcpu);
__get_sregs(vcpu, sregs);
vcpu_put(vcpu);
@@ -10112,6 +10171,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
{
int ret;

+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
vcpu_load(vcpu);
ret = __set_sregs(vcpu, sregs);
vcpu_put(vcpu);
@@ -10205,6 +10267,9 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
{
struct fxregs_state *fxsave;

+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
if (!vcpu->arch.guest_fpu)
return 0;

@@ -10228,6 +10293,9 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
{
struct fxregs_state *fxsave;

+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+
if (!vcpu->arch.guest_fpu)
return 0;

--
2.25.1

2021-07-02 22:07:54

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 21/69] KVM: Add max_vcpus field in common 'struct kvm'

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 3 ---
arch/arm64/kvm/arm.c | 7 ++-----
arch/arm64/kvm/vgic/vgic-init.c | 6 +++---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 3 ++-
5 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..96a0dc3a8780 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -106,9 +106,6 @@ struct kvm_arch {
/* VTCR_EL2 value for this VM */
u64 vtcr;

- /* The maximum number of vCPUs depends on the used GIC model */
- int max_vcpus;
-
/* Interrupt controller */
struct vgic_dist vgic;

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e720148232a0..a46306cf3106 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -145,7 +145,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_vgic_early_init(kvm);

/* The maximum number of VCPUs is limited by the host's GIC model */
- kvm->arch.max_vcpus = kvm_arm_default_max_vcpus();
+ kvm->max_vcpus = kvm_arm_default_max_vcpus();

set_default_spectre(kvm);

@@ -220,7 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_MAX_VCPUS:
case KVM_CAP_MAX_VCPU_ID:
if (kvm)
- r = kvm->arch.max_vcpus;
+ r = kvm->max_vcpus;
else
r = kvm_arm_default_max_vcpus();
break;
@@ -299,9 +299,6 @@ int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
return -EBUSY;

- if (id >= kvm->arch.max_vcpus)
- return -EINVAL;
-
return 0;
}

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index 58cbda00e56d..089ac00c55d7 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -97,11 +97,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
ret = 0;

if (type == KVM_DEV_TYPE_ARM_VGIC_V2)
- kvm->arch.max_vcpus = VGIC_V2_MAX_CPUS;
+ kvm->max_vcpus = VGIC_V2_MAX_CPUS;
else
- kvm->arch.max_vcpus = VGIC_V3_MAX_CPUS;
+ kvm->max_vcpus = VGIC_V3_MAX_CPUS;

- if (atomic_read(&kvm->online_vcpus) > kvm->arch.max_vcpus) {
+ if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) {
ret = -E2BIG;
goto out_unlock;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e87f07c5c601..ddd4d0f68cdf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -544,6 +544,7 @@ struct kvm {
* and is accessed atomically.
*/
atomic_t online_vcpus;
+ int max_vcpus;
int created_vcpus;
int last_boosted_vcpu;
struct list_head vm_list;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dc752d0bd3ec..52d40ea75749 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -910,6 +910,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
mutex_init(&kvm->irq_lock);
mutex_init(&kvm->slots_lock);
INIT_LIST_HEAD(&kvm->devices);
+ kvm->max_vcpus = KVM_MAX_VCPUS;

BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);

@@ -3329,7 +3330,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
return -EINVAL;

mutex_lock(&kvm->lock);
- if (kvm->created_vcpus == KVM_MAX_VCPUS) {
+ if (kvm->created_vcpus >= kvm->max_vcpus) {
mutex_unlock(&kvm->lock);
return -EINVAL;
}
--
2.25.1

2021-07-02 22:07:59

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 20/69] KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5b8a640f8042..0dc4bf34ce9c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5091,7 +5091,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
if (r == RET_PF_INVALID) {
r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
lower_32_bits(error_code), false);
- if (WARN_ON_ONCE(r == RET_PF_INVALID))
+ if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
}

--
2.25.1

2021-07-02 22:08:01

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 25/69] KVM: x86: Add per-VM flag to disable direct IRQ injection

From: Sean Christopherson <[email protected]>

Add a flag to disable IRQ injection, which is not supported by TDX.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 301b10172cbf..be329f0a7054 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1063,6 +1063,7 @@ struct kvm_arch {
bool exception_payload_enabled;

bool bus_lock_detection_enabled;
+ bool irq_injection_disallowed;

/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
u32 user_space_msr_mask;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b89845dfb679..4070786f17d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4264,7 +4264,8 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
struct kvm_interrupt *irq)
{
- if (irq->irq >= KVM_NR_INTERRUPTS)
+ if (irq->irq >= KVM_NR_INTERRUPTS ||
+ vcpu->kvm->arch.irq_injection_disallowed)
return -EINVAL;

if (!irqchip_in_kernel(vcpu->kvm)) {
@@ -8562,6 +8563,7 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
{
return vcpu->run->request_interrupt_window &&
+ !vcpu->kvm->arch.irq_injection_disallowed &&
likely(!pic_in_kernel(vcpu->kvm));
}

--
2.25.1

2021-07-02 22:08:06

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 27/69] KVM: x86: Add flag to mark TSC as immutable (for TDX)

From: Sean Christopherson <[email protected]>

The TSC for TDX1 guests is fixed at TD creation time. Add tsc_immutable
to reflect that the TSC of the guest cannot be changed in any way, and
use it to short circuit all paths that lead to one of the myriad TSC
adjustment flows.

Suggested-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 35 +++++++++++++++++++++++++--------
2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09e51c5e86b3..5d6143643cd1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1044,6 +1044,7 @@ struct kvm_arch {
int audit_point;
#endif

+ bool tsc_immutable;
bool backwards_tsc_observed;
bool boot_vcpu_runs_old_kvmclock;
u32 bsp_vcpu_id;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 681fc3be2b2b..cd9407982366 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2184,7 +2184,9 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
u64 ratio;

/* Guest TSC same frequency as host TSC? */
- if (!scale) {
+ if (!scale || vcpu->kvm->arch.tsc_immutable) {
+ if (scale)
+ pr_warn_ratelimited("Guest TSC immutable, scaling not supported\n");
vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
return 0;
}
@@ -2360,6 +2362,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
bool already_matched;
bool synchronizing = false;

+ if (WARN_ON_ONCE(vcpu->kvm->arch.tsc_immutable))
+ return;
+
raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
offset = kvm_compute_tsc_offset(vcpu, data);
ns = get_kvmclock_base_ns();
@@ -2791,6 +2796,10 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
u8 pvclock_flags;
bool use_master_clock;

+ /* Unable to update guest time if the TSC is immutable. */
+ if (ka->tsc_immutable)
+ return 0;
+
kernel_ns = 0;
host_tsc = 0;

@@ -4142,7 +4151,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (tsc_delta < 0)
mark_tsc_unstable("KVM discovered backwards TSC");

- if (kvm_check_tsc_unstable()) {
+ if (kvm_check_tsc_unstable() &&
+ !vcpu->kvm->arch.tsc_immutable) {
u64 offset = kvm_compute_tsc_offset(vcpu,
vcpu->arch.last_guest_tsc);
kvm_vcpu_write_tsc_offset(vcpu, offset);
@@ -4156,7 +4166,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
* On a host with synchronized TSC, there is no need to update
* kvmclock on vcpu->cpu migration
*/
- if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
+ if ((!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1) &&
+ !vcpu->kvm->arch.tsc_immutable)
kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
if (vcpu->cpu != cpu)
kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
@@ -5126,10 +5137,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_SET_TSC_KHZ: {
- u32 user_tsc_khz;
+ u32 user_tsc_khz = (u32)arg;

r = -EINVAL;
- user_tsc_khz = (u32)arg;
+ if (vcpu->kvm->arch.tsc_immutable)
+ goto out;

if (kvm_has_tsc_control &&
user_tsc_khz >= kvm_max_guest_tsc_khz)
@@ -10499,9 +10511,12 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)

if (mutex_lock_killable(&vcpu->mutex))
return;
- vcpu_load(vcpu);
- kvm_synchronize_tsc(vcpu, 0);
- vcpu_put(vcpu);
+
+ if (!kvm->arch.tsc_immutable) {
+ vcpu_load(vcpu);
+ kvm_synchronize_tsc(vcpu, 0);
+ vcpu_put(vcpu);
+ }

/* poll control enabled by default */
vcpu->arch.msr_kvm_poll_control = 1;
@@ -10696,6 +10711,10 @@ int kvm_arch_hardware_enable(void)
if (backwards_tsc) {
u64 delta_cyc = max_tsc - local_tsc;
list_for_each_entry(kvm, &vm_list, vm_list) {
+ if (vcpu->kvm->arch.tsc_immutable) {
+ pr_warn_ratelimited("Backwards TSC observed and guest with immutable TSC active\n");
+ continue;
+ }
kvm->arch.backwards_tsc_observed = true;
kvm_for_each_vcpu(i, vcpu, kvm) {
vcpu->arch.tsc_offset_adjustment += delta_cyc;
--
2.25.1

2021-07-02 22:08:11

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 28/69] KVM: Add per-VM flag to mark read-only memory as unsupported

From: Isaku Yamahata <[email protected]>

Add a flag for TDX to flag RO memory as unsupported and propagate it to
KVM_MEM_READONLY to allow reporting RO memory as unsupported on a per-VM
basis. TDX1 doesn't expose permission bits to the VMM in the SEPT
tables, i.e. doesn't support read-only private memory.

Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 4 +++-
include/linux/kvm_host.h | 4 ++++
virt/kvm/kvm_main.c | 8 +++++---
3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd9407982366..87212d7563ae 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3897,7 +3897,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ASYNC_PF_INT:
case KVM_CAP_GET_TSC_KHZ:
case KVM_CAP_KVMCLOCK_CTRL:
- case KVM_CAP_READONLY_MEM:
case KVM_CAP_HYPERV_TIME:
case KVM_CAP_IOAPIC_POLARITY_IGNORED:
case KVM_CAP_TSC_DEADLINE_TIMER:
@@ -4009,6 +4008,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
r |= BIT(KVM_X86_TDX_VM);
break;
+ case KVM_CAP_READONLY_MEM:
+ r = kvm && kvm->readonly_mem_unsupported ? 0 : 1;
+ break;
default:
break;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ddd4d0f68cdf..7ee7104b4b59 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -597,6 +597,10 @@ struct kvm {
unsigned int max_halt_poll_ns;
u32 dirty_ring_size;

+#ifdef __KVM_HAVE_READONLY_MEM
+ bool readonly_mem_unsupported;
+#endif
+
bool vm_bugged;
};

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 52d40ea75749..63d0c2833913 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1258,12 +1258,14 @@ static void update_memslots(struct kvm_memslots *slots,
}
}

-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

#ifdef __KVM_HAVE_READONLY_MEM
- valid_flags |= KVM_MEM_READONLY;
+ if (!kvm->readonly_mem_unsupported)
+ valid_flags |= KVM_MEM_READONLY;
#endif

if (mem->flags & ~valid_flags)
@@ -1436,7 +1438,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;

- r = check_memory_region_flags(mem);
+ r = check_memory_region_flags(kvm, mem);
if (r)
return r;

--
2.25.1

2021-07-02 22:08:20

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 29/69] KVM: Add per-VM flag to disable dirty logging of memslots for TDs

From: Sean Christopherson <[email protected]>

Add a flag for TDX to mark dirty logging as unsupported.

Suggested-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7ee7104b4b59..74bc55df7a5d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -597,6 +597,7 @@ struct kvm {
unsigned int max_halt_poll_ns;
u32 dirty_ring_size;

+ bool dirty_log_unsupported;
#ifdef __KVM_HAVE_READONLY_MEM
bool readonly_mem_unsupported;
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 63d0c2833913..8b075b5e7303 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1261,7 +1261,10 @@ static void update_memslots(struct kvm_memslots *slots,
static int check_memory_region_flags(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem)
{
- u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+ u32 valid_flags = 0;
+
+ if (!kvm->dirty_log_unsupported)
+ valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;

#ifdef __KVM_HAVE_READONLY_MEM
if (!kvm->readonly_mem_unsupported)
--
2.25.1

2021-07-02 22:08:27

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

From: Rick Edgecombe <[email protected]>

Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
perspective) to a single GPA (from a memslot perspective). GPA alising
will be used to repurpose GPA bits as attribute bits, e.g. to expose an
execute-only permission bit to the guest. To keep the implementation
simple (relatively speaking), GPA aliasing is only supported via TDP.

Today KVM assumes two things that are broken by GPA aliasing.
1. GPAs coming from hardware can be simply shifted to get the GFNs.
2. GPA bits 51:MAXPHYADDR are reserved to zero.

With GPA aliasing, translating a GPA to GFN requires masking off the
repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.

To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
that is, bits stolen from the GPA to act as new virtualized attribute
bits. A bit in the mask will cause the MMU code to create aliases of the
GPA. It can also be used to find the GFN out of a GPA coming from a tdp
fault.

To handle case (1) from above, retain any stolen bits when passing a GPA
in KVM's MMU code, but strip them when converting to a GFN so that the
GFN contains only the "real" GFN, i.e. never has repurposed bits set.

GFNs (without stolen bits) continue to be used to:
-Specify physical memory by userspace via memslots
-Map GPAs to TDP PTEs via RMAP
-Specify dirty tracking and write protection
-Look up MTRR types
-Inject async page faults

Since there are now multiple aliases for the same aliased GPA, when
userspace memory backing the memslots is paged out, both aliases need to be
modified. Fortunately this happens automatically. Since rmap supports
multiple mappings for the same GFN for PTE shadowing based paging, by
adding/removing each alias PTE with its GFN, kvm_handle_hva() based
operations will be applied to both aliases.

In the case of the rmap being removed in the future, the needed
information could be recovered by iterating over the stolen bits and
walking the TDP page tables.

For TLB flushes that are address based, make sure to flush both aliases
in the stolen bits case.

Only support stolen bits in 64 bit guest paging modes (long, PAE).
Features that use this infrastructure should restrict the stolen bits to
exclude the other paging modes. Don't support stolen bits for shadow EPT.

Signed-off-by: Rick Edgecombe <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 26 ++++++++++
arch/x86/kvm/mmu/mmu.c | 86 ++++++++++++++++++++++-----------
arch/x86/kvm/mmu/mmu_internal.h | 1 +
arch/x86/kvm/mmu/paging_tmpl.h | 25 ++++++----
4 files changed, 101 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 88d0ed5225a4..69b82857acdb 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -232,4 +232,30 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
int kvm_mmu_post_init_vm(struct kvm *kvm);
void kvm_mmu_pre_destroy_vm(struct kvm *kvm);

+static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
+{
+ /* Currently there are no stolen bits in KVM */
+ return 0;
+}
+
+static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
+{
+ return kvm_gfn_stolen_mask(vcpu->kvm);
+}
+
+static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
+{
+ return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
+}
+
+static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
+{
+ return kvm_gpa_stolen_mask(vcpu->kvm);
+}
+
+static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
+}
+
#endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0dc4bf34ce9c..990ee645b8a2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -188,27 +188,37 @@ static inline bool kvm_available_flush_tlb_with_range(void)
return kvm_x86_ops.tlb_remote_flush_with_range;
}

-static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
- struct kvm_tlb_range *range)
-{
- int ret = -ENOTSUPP;
-
- if (range && kvm_x86_ops.tlb_remote_flush_with_range)
- ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
-
- if (ret)
- kvm_flush_remote_tlbs(kvm);
-}
-
void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
u64 start_gfn, u64 pages)
{
struct kvm_tlb_range range;
+ u64 gfn_stolen_mask;
+
+ if (!kvm_available_flush_tlb_with_range())
+ goto generic_flush;
+
+ /*
+ * Fall back to the big hammer flush if there is more than one
+ * GPA alias that needs to be flushed.
+ */
+ gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
+ if (hweight64(gfn_stolen_mask) > 1)
+ goto generic_flush;

range.start_gfn = start_gfn;
range.pages = pages;
+ if (static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range))
+ goto generic_flush;
+
+ if (!gfn_stolen_mask)
+ return;

- kvm_flush_remote_tlbs_with_range(kvm, &range);
+ range.start_gfn |= gfn_stolen_mask;
+ static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range);
+ return;
+
+generic_flush:
+ kvm_flush_remote_tlbs(kvm);
}

bool is_nx_huge_page_enabled(void)
@@ -1949,14 +1959,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
}

-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
- gfn_t gfn,
- gva_t gaddr,
- unsigned level,
- int direct,
- unsigned int access)
+static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ gfn_t gfn_stolen_bits,
+ gva_t gaddr,
+ unsigned int level,
+ int direct,
+ unsigned int access)
{
bool direct_mmu = vcpu->arch.mmu->direct_map;
+ gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
union kvm_mmu_page_role role;
struct hlist_head *sp_list;
unsigned quadrant;
@@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.quadrant = quadrant;
}

- sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+ sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
for_each_valid_sp(vcpu->kvm, sp, sp_list) {
- if (sp->gfn != gfn) {
+ if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
collisions++;
continue;
}
@@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
sp = kvm_mmu_alloc_page(vcpu, direct);

sp->gfn = gfn;
+ sp->gfn_stolen_bits = gfn_stolen_bits;
sp->role = role;
hlist_add_head(&sp->hash_link, sp_list);
if (!direct) {
@@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
return sp;
}

+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+ gva_t gaddr, unsigned int level,
+ int direct, unsigned int access)
+{
+ return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
+}
+
static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, hpa_t root,
u64 addr)
@@ -2637,7 +2657,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,

gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
- if (!slot)
+
+ /* Don't map private memslots for stolen bits */
+ if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
return -1;

ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
@@ -2827,7 +2849,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
struct kvm_shadow_walk_iterator it;
struct kvm_mmu_page *sp;
int level, req_level, ret;
- gfn_t gfn = gpa >> PAGE_SHIFT;
+ gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
+ gfn_t gfn = (gpa & ~gpa_stolen_mask) >> PAGE_SHIFT;
+ gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
gfn_t base_gfn = gfn;

if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
@@ -2852,8 +2876,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

drop_large_spte(vcpu, it.sptep);
if (!is_shadow_present_pte(*it.sptep)) {
- sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
- it.level - 1, true, ACC_ALL);
+ sp = __kvm_mmu_get_page(vcpu, base_gfn,
+ gfn_stolen_bits, it.addr,
+ it.level - 1, true, ACC_ALL);

link_shadow_page(vcpu, it.sptep, sp);
if (is_tdp && huge_page_disallowed &&
@@ -3689,6 +3714,13 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
return true;

+ /* Don't expose aliases for no slot GFNs or private memslots */
+ if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
+ !kvm_is_visible_memslot(slot)) {
+ *pfn = KVM_PFN_NOSLOT;
+ return false;
+ }
+
/* Don't expose private memslots to L2. */
if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
*pfn = KVM_PFN_NOSLOT;
@@ -3723,7 +3755,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
bool write = error_code & PFERR_WRITE_MASK;
bool map_writable;

- gfn_t gfn = gpa >> PAGE_SHIFT;
+ gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
unsigned long mmu_seq;
kvm_pfn_t pfn;
hva_t hva;
@@ -3833,7 +3865,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
max_level > PG_LEVEL_4K;
max_level--) {
int page_num = KVM_PAGES_PER_HPAGE(max_level);
- gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
+ gfn_t base = vcpu_gpa_to_gfn_unalias(vcpu, gpa) & ~(page_num - 1);

if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
break;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d64ccb417c60..c896ec9f3159 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -46,6 +46,7 @@ struct kvm_mmu_page {
*/
union kvm_mmu_page_role role;
gfn_t gfn;
+ gfn_t gfn_stolen_bits;

u64 *spt;
/* hold the gfn of each spte inside spt */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 823a5919f9fa..439dc141391b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -25,7 +25,8 @@
#define guest_walker guest_walker64
#define FNAME(name) paging##64_##name
#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
- #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+ #define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
+ PT64_LVL_ADDR_MASK(lvl))
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -44,7 +45,7 @@
#define guest_walker guest_walker32
#define FNAME(name) paging##32_##name
#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
- #define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
+ #define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
#define PT_LEVEL_BITS PT32_LEVEL_BITS
@@ -58,7 +59,7 @@
#define guest_walker guest_walkerEPT
#define FNAME(name) ept_##name
#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
- #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+ #define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -75,7 +76,7 @@
#define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)

#define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
-#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
+#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)

/*
* The guest_walker structure emulates the behavior of the hardware page
@@ -96,9 +97,9 @@ struct guest_walker {
struct x86_exception fault;
};

-static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
+static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
{
- return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
+ return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
}

static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
@@ -366,7 +367,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
--walker->level;

index = PT_INDEX(addr, walker->level);
- table_gfn = gpte_to_gfn(pte);
+ table_gfn = gpte_to_gfn(vcpu, pte);
offset = index * sizeof(pt_element_t);
pte_gpa = gfn_to_gpa(table_gfn) + offset;

@@ -432,7 +433,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
if (unlikely(errcode))
goto error;

- gfn = gpte_to_gfn_lvl(pte, walker->level);
+ gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;

if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
@@ -537,12 +538,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gfn_t gfn;
kvm_pfn_t pfn;

+ WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
+
if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
return false;

pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);

- gfn = gpte_to_gfn(gpte);
+ gfn = gpte_to_gfn(vcpu, gpte);
pte_access = sp->role.access & FNAME(gpte_access)(gpte);
FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
@@ -652,6 +655,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,

direct_access = gw->pte_access;

+ WARN_ON(addr & vcpu_gpa_stolen_mask(vcpu));
+
top_level = vcpu->arch.mmu->root_level;
if (top_level == PT32E_ROOT_LEVEL)
top_level = PT32_ROOT_LEVEL;
@@ -1067,7 +1072,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
continue;
}

- gfn = gpte_to_gfn(gpte);
+ gfn = gpte_to_gfn(vcpu, gpte);
pte_access = sp->role.access;
pte_access &= FNAME(gpte_access)(gpte);
FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
--
2.25.1

2021-07-02 22:08:27

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 35/69] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()

From: Sean Christopherson <[email protected]>

Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
destruction path, which needs to first put the VM into a teardown state,
then free per-vCPU resource, and finally free per-VM resources.

Note, this knowingly creates a discrepancy in nomenclature for SVM as
svm_vm_teardown() invokes avic_vm_destroy() and sev_vm_destroy().
Moving the now-misnamed functions or renaming them is left to a future
patch so as not to introduce a functional change for SVM.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm/svm.c | 8 +++++++-
arch/x86/kvm/vmx/vmx.c | 12 ++++++++++++
arch/x86/kvm/x86.c | 3 ++-
5 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 01457da0162b..95ae4a71cfbc 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP(vm_init)
+KVM_X86_OP(vm_teardown)
KVM_X86_OP_NULL(vm_destroy)
KVM_X86_OP(vcpu_create)
KVM_X86_OP(vcpu_free)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e3abf077f328..96e6cd95d884 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1217,6 +1217,7 @@ struct kvm_x86_ops {
bool (*is_vm_type_supported)(unsigned long vm_type);
unsigned int vm_size;
int (*vm_init)(struct kvm *kvm);
+ void (*vm_teardown)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);

/* Create, but do not attach this VCPU */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 286a49b09269..bcc3fc4872a3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4416,12 +4416,17 @@ static void svm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
sev_vcpu_deliver_sipi_vector(vcpu, vector);
}

-static void svm_vm_destroy(struct kvm *kvm)
+static void svm_vm_teardown(struct kvm *kvm)
{
avic_vm_destroy(kvm);
sev_vm_destroy(kvm);
}

+static void svm_vm_destroy(struct kvm *kvm)
+{
+
+}
+
static bool svm_is_vm_type_supported(unsigned long type)
{
return type == KVM_X86_LEGACY_VM;
@@ -4456,6 +4461,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.is_vm_type_supported = svm_is_vm_type_supported,
.vm_size = sizeof(struct kvm_svm),
.vm_init = svm_vm_init,
+ .vm_teardown = svm_vm_teardown,
.vm_destroy = svm_vm_destroy,

.prepare_guest_switch = svm_prepare_guest_switch,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 84c2df824ecc..36756a356704 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6995,6 +6995,16 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}

+static void vmx_vm_teardown(struct kvm *kvm)
+{
+
+}
+
+static void vmx_vm_destroy(struct kvm *kvm)
+{
+
+}
+
static int __init vmx_check_processor_compat(void)
{
struct vmcs_config vmcs_conf;
@@ -7613,6 +7623,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.is_vm_type_supported = vmx_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,
+ .vm_teardown = vmx_vm_teardown,
+ .vm_destroy = vmx_vm_destroy,

.vcpu_create = vmx_create_vcpu,
.vcpu_free = vmx_free_vcpu,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da9f1081cb03..4b436cae1732 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11043,7 +11043,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
mutex_unlock(&kvm->slots_lock);
}
- static_call_cond(kvm_x86_vm_destroy)(kvm);
+ static_call(kvm_x86_vm_teardown)(kvm);
kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1));
kvm_pic_destroy(kvm);
kvm_ioapic_destroy(kvm);
@@ -11054,6 +11054,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_page_track_cleanup(kvm);
kvm_xen_destroy_vm(kvm);
kvm_hv_destroy_vm(kvm);
+ static_call_cond(kvm_x86_vm_destroy)(kvm);
}

void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
--
2.25.1

2021-07-02 22:08:30

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 44/69] KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce indentation

From: Sean Christopherson <[email protected]>

Employ a 'continue' to reduce the indentation for linking a new shadow
page during __direct_map() in preparation for linking private pages.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c40dfd05979..0259781cee6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2910,16 +2910,15 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
break;

drop_large_spte(vcpu, it.sptep);
- if (!is_shadow_present_pte(*it.sptep)) {
- sp = __kvm_mmu_get_page(vcpu, base_gfn,
- gfn_stolen_bits, it.addr,
- it.level - 1, true, ACC_ALL);
-
- link_shadow_page(vcpu, it.sptep, sp);
- if (is_tdp && huge_page_disallowed &&
- req_level >= it.level)
- account_huge_nx_page(vcpu->kvm, sp);
- }
+ if (is_shadow_present_pte(*it.sptep))
+ continue;
+
+ sp = __kvm_mmu_get_page(vcpu, base_gfn, gfn_stolen_bits,
+ it.addr, it.level - 1, true, ACC_ALL);
+
+ link_shadow_page(vcpu, it.sptep, sp);
+ if (is_tdp && huge_page_disallowed && req_level >= it.level)
+ account_huge_nx_page(vcpu->kvm, sp);
}

ret = mmu_set_spte(vcpu, it.sptep, ACC_ALL,
--
2.25.1

2021-07-02 22:08:32

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 30/69] KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level routes

From: Kai Huang <[email protected]>

Add a flag to let TDX disallow the in-kernel I/O APIC, level triggered
routes for a userspace I/O APIC, and anything else that relies on being
able to intercept EOIs. TDX-SEAM does not allow intercepting EOI.

Note, technically KVM could partially emulate the I/O APIC by allowing
only edge triggered interrupts, but that adds a lot of complexity for
basically zero benefit. Ideally KVM wouldn't even allow I/O APIC route
reservation, but disabling that is a train wreck for Qemu.

Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/ioapic.c | 4 ++++
arch/x86/kvm/irq_comm.c | 9 +++++++--
arch/x86/kvm/lapic.c | 3 ++-
arch/x86/kvm/x86.c | 6 ++++++
5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5d6143643cd1..f373d672b4ac 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1054,6 +1054,7 @@ struct kvm_arch {

enum kvm_irqchip_mode irqchip_mode;
u8 nr_reserved_ioapic_pins;
+ bool eoi_intercept_unsupported;

bool disabled_lapic_found;

diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 698969e18fe3..e2de6e552d25 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -311,6 +311,10 @@ void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm)
{
if (!ioapic_in_kernel(kvm))
return;
+
+ if (WARN_ON_ONCE(kvm->arch.eoi_intercept_unsupported))
+ return;
+
kvm_make_scan_ioapic_request(kvm);
}

diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index d5b72a08e566..bcfac99db579 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -123,7 +123,12 @@ EXPORT_SYMBOL_GPL(kvm_set_msi_irq);
static inline bool kvm_msi_route_invalid(struct kvm *kvm,
struct kvm_kernel_irq_routing_entry *e)
{
- return kvm->arch.x2apic_format && (e->msi.address_hi & 0xff);
+ struct msi_msg msg = { .address_lo = e->msi.address_lo,
+ .address_hi = e->msi.address_hi,
+ .data = e->msi.data };
+ return (kvm->arch.eoi_intercept_unsupported &&
+ msg.arch_data.is_level) ||
+ (kvm->arch.x2apic_format && (msg.address_hi & 0xff));
}

int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
@@ -385,7 +390,7 @@ int kvm_setup_empty_irq_routing(struct kvm *kvm)

void kvm_arch_post_irq_routing_update(struct kvm *kvm)
{
- if (!irqchip_split(kvm))
+ if (!irqchip_split(kvm) || kvm->arch.eoi_intercept_unsupported)
return;
kvm_make_scan_ioapic_request(kvm);
}
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 17fa4ab1b834..977a704e3ff1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -278,7 +278,8 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
if (old)
call_rcu(&old->rcu, kvm_apic_map_free);

- kvm_make_scan_ioapic_request(kvm);
+ if (!kvm->arch.eoi_intercept_unsupported)
+ kvm_make_scan_ioapic_request(kvm);
}

static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 87212d7563ae..92204bbc7ea5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5744,6 +5744,9 @@ long kvm_arch_vm_ioctl(struct file *filp,
goto create_irqchip_unlock;

r = -EINVAL;
+ if (kvm->arch.eoi_intercept_unsupported)
+ goto create_irqchip_unlock;
+
if (kvm->created_vcpus)
goto create_irqchip_unlock;

@@ -5774,6 +5777,9 @@ long kvm_arch_vm_ioctl(struct file *filp,
u.pit_config.flags = KVM_PIT_SPEAKER_DUMMY;
goto create_pit;
case KVM_CREATE_PIT2:
+ r = -EINVAL;
+ if (kvm->arch.eoi_intercept_unsupported)
+ goto out;
r = -EFAULT;
if (copy_from_user(&u.pit_config, argp,
sizeof(struct kvm_pit_config)))
--
2.25.1

2021-07-02 22:08:36

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 48/69] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX

From: Sean Christopherson <[email protected]>

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path. This allows
TDX to get the resulting pfn and also allows the RET_PF_* enums to
stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 25 +++++++++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c3e241088851..8bf1c6dbac78 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -125,6 +125,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
return vcpu->arch.mmu->page_fault(vcpu, cr2_or_gpa, err, prefault);
}

+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 error_code, int max_level);
+
/*
* Currently, we have two sorts of write-protection, a) the first one
* write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 96a99450f114..82db62753acb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4098,6 +4098,31 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
max_level, true, &pfn);
}

+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 error_code, int max_level)
+{
+ kvm_pfn_t pfn;
+ int r;
+
+ if (mmu_topup_memory_caches(vcpu, false))
+ return KVM_PFN_ERR_FAULT;
+
+ /*
+ * Loop on the page fault path to handle the case where an mmu_notifier
+ * invalidation triggers RET_PF_RETRY. In the normal page fault path,
+ * KVM needs to resume the guest in case the invalidation changed any
+ * of the page fault properties, i.e. the gpa or error code. For this
+ * path, the gpa and error code are fixed by the caller, and the caller
+ * expects failure if and only if the page fault can't be fixed.
+ */
+ do {
+ r = direct_page_fault(vcpu, gpa, error_code, false, max_level,
+ true, &pfn);
+ } while (r == RET_PF_RETRY && !is_error_noslot_pfn(pfn));
+ return pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
static void nonpaging_init_context(struct kvm_vcpu *vcpu,
struct kvm_mmu *context)
{
--
2.25.1

2021-07-02 22:08:37

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 49/69] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param

From: Sean Christopherson <[email protected]>

Pass intr_info to the NMI and INTR handlers instead of pulling it from
vcpu_vmx in preparation for sharing the bulk of the handlers with TDX.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7ce15a2c3490..e08f85c93e55 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6404,25 +6404,24 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
kvm_after_interrupt(vcpu);
}

-static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
+static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
{
const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
- u32 intr_info = vmx_get_intr_info(&vmx->vcpu);

/* if exit due to PF check for async PF */
if (is_page_fault(intr_info))
- vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
/* Handle machine checks before interrupts are enabled */
else if (is_machine_check(intr_info))
kvm_machine_check();
/* We need to handle NMIs before interrupts are enabled */
else if (is_nmi(intr_info))
- handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
+ handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
}

-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
+static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
{
- u32 intr_info = vmx_get_intr_info(vcpu);
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
gate_desc *desc = (gate_desc *)host_idt_base + vector;

@@ -6438,9 +6437,9 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);

if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu);
+ handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_nmi_irqoff(vmx);
+ handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
}

/*
--
2.25.1

2021-07-02 22:08:38

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 45/69] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()

From: Sean Christopherson <[email protected]>

Return the old SPTE when clearing a SPTE and push the "old SPTE present"
check to the caller. Private shadow page support will use the old SPTE
in rmap_remove() to determine whether or not there is a linked private
shadow page.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0259781cee6a..6b0c8c84aabe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -542,9 +542,9 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
* Rules for using mmu_spte_clear_track_bits:
* It sets the sptep from present to nonpresent, and track the
* state bits, it is used to clear the last level sptep.
- * Returns non-zero if the PTE was previously valid.
+ * Returns the old PTE.
*/
-static int mmu_spte_clear_track_bits(u64 *sptep)
+static u64 mmu_spte_clear_track_bits(u64 *sptep)
{
kvm_pfn_t pfn;
u64 old_spte = *sptep;
@@ -555,7 +555,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
old_spte = __update_clear_spte_slow(sptep, shadow_init_value);

if (!is_shadow_present_pte(old_spte))
- return 0;
+ return old_spte;

pfn = spte_to_pfn(old_spte);

@@ -572,7 +572,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
if (is_dirty_spte(old_spte))
kvm_set_pfn_dirty(pfn);

- return 1;
+ return old_spte;
}

/*
@@ -1104,7 +1104,9 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)

static void drop_spte(struct kvm *kvm, u64 *sptep)
{
- if (mmu_spte_clear_track_bits(sptep))
+ u64 old_spte = mmu_spte_clear_track_bits(sptep);
+
+ if (is_shadow_present_pte(old_spte))
rmap_remove(kvm, sptep);
}

--
2.25.1

2021-07-02 22:08:43

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 47/69] KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()

From: Sean Christopherson <[email protected]>

When adding pages prior to boot, TDX will need the resulting host pfn so
that it can be passed to TDADDPAGE (TDX-SEAM always works with physical
addresses as it has its own page tables). Start plumbing pfn back up
the page fault stack.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c343e546b7c5..96a99450f114 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3969,14 +3969,14 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
}

static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
- bool prefault, int max_level, bool is_tdp)
+ bool prefault, int max_level, bool is_tdp,
+ kvm_pfn_t *pfn)
{
bool write = error_code & PFERR_WRITE_MASK;
bool map_writable;

gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
unsigned long mmu_seq;
- kvm_pfn_t pfn;
hva_t hva;
int r;

@@ -3996,11 +3996,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
+ if (try_async_pf(vcpu, prefault, gfn, gpa, pfn, &hva,
write, &map_writable))
return RET_PF_RETRY;

- if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
+ if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, *pfn, ACC_ALL, &r))
return r;

r = RET_PF_RETRY;
@@ -4010,7 +4010,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
else
write_lock(&vcpu->kvm->mmu_lock);

- if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
+ if (!is_noslot_pfn(*pfn) &&
+ mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
goto out_unlock;
r = make_mmu_pages_available(vcpu);
if (r)
@@ -4018,28 +4019,30 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
- pfn, prefault);
+ *pfn, prefault);
else
- r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
- prefault, is_tdp);
+ r = __direct_map(vcpu, gpa, error_code, map_writable, max_level,
+ *pfn, prefault, is_tdp);

out_unlock:
if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(pfn);
+ kvm_release_pfn_clean(*pfn);
return r;
}

static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
u32 error_code, bool prefault)
{
+ kvm_pfn_t pfn;
+
pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);

/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
- PG_LEVEL_2M, false);
+ PG_LEVEL_2M, false, &pfn);
}

int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
@@ -4078,6 +4081,7 @@ EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
bool prefault)
{
+ kvm_pfn_t pfn;
int max_level;

for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
@@ -4091,7 +4095,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
}

return direct_page_fault(vcpu, gpa, error_code, prefault,
- max_level, true);
+ max_level, true, &pfn);
}

static void nonpaging_init_context(struct kvm_vcpu *vcpu,
--
2.25.1

2021-07-02 22:08:43

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 46/69] KVM: x86/mmu: Frame in support for private/inaccessible shadow pages

From: Sean Christopherson <[email protected]>

Add kvm_x86_ops hooks to set/clear private SPTEs, i.e. SEPT entries, and
to link/free private shadow pages, i.e. non-leaf SEPT pages.

Because SEAMCALLs are bloody expensive, and because KVM's MMU is already
complex enough, TDX's SEPT will mirror KVM's shadow pages instead of
replacing them outright. This costs extra memory, but is simpler and
far more performant.

Add a separate list for tracking active private shadow pages. Zapping
and freeing SEPT entries is subject to very different rules than normal
pages/memory, and need to be preserved (along with their shadow page
counterparts) when KVM gets trigger happy, e.g. zaps everything during a
memslot update.

Zap any aliases of a GPA when mapping in a guest that supports guest
private GPAs. This is necessary to avoid integrity failures with TDX
due to pointing shared and private GPAs at the same HPA.

Do not prefetch private pages (this should probably be a property of the
VM).

TDX needs one more bits in spte for SPTE_PRIVATE_ZAPPED. Steal 1 bit
from MMIO generation as 62 bit and re-purpose it for non-present
SPTE, SPTE_PRIVATE_ZAPPED.

Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 6 +
arch/x86/include/asm/kvm_host.h | 24 ++-
arch/x86/kvm/mmu.h | 3 +-
arch/x86/kvm/mmu/mmu.c | 292 ++++++++++++++++++++++++-----
arch/x86/kvm/mmu/mmu_internal.h | 2 +
arch/x86/kvm/mmu/spte.h | 16 +-
arch/x86/kvm/x86.c | 4 +-
7 files changed, 293 insertions(+), 54 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 95ae4a71cfbc..3074c1bfa3dc 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -88,6 +88,12 @@ KVM_X86_OP(set_tss_addr)
KVM_X86_OP(set_identity_map_addr)
KVM_X86_OP(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(set_private_spte)
+KVM_X86_OP(drop_private_spte)
+KVM_X86_OP(zap_private_spte)
+KVM_X86_OP(unzap_private_spte)
+KVM_X86_OP(link_private_sp)
+KVM_X86_OP(free_private_sp)
KVM_X86_OP_NULL(has_wbinvd_exit)
KVM_X86_OP(write_l1_tsc_offset)
KVM_X86_OP(get_exit_info)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c641654307c6..9631b985ebdc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -385,6 +385,7 @@ struct kvm_mmu {
struct kvm_mmu_page *sp);
void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
hpa_t root_hpa;
+ hpa_t private_root_hpa;
gpa_t root_pgd;
union kvm_mmu_role mmu_role;
u8 root_level;
@@ -424,6 +425,7 @@ struct kvm_mmu {
u8 last_nonleaf_level;

bool nx;
+ bool no_prefetch;

u64 pdptrs[4]; /* pae */
};
@@ -639,6 +641,7 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_gfn_array_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ struct kvm_mmu_memory_cache mmu_private_sp_cache;

/*
* QEMU userspace and the guest each have their own FPU state.
@@ -989,6 +992,7 @@ struct kvm_arch {
u8 mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
+ struct list_head private_mmu_pages;
struct list_head zapped_obsolete_pages;
struct list_head lpage_disallowed_mmu_pages;
struct kvm_page_track_notifier_node mmu_sp_tracker;
@@ -1137,6 +1141,8 @@ struct kvm_arch {
*/
spinlock_t tdp_mmu_pages_lock;
#endif /* CONFIG_X86_64 */
+
+ gfn_t gfn_shared_mask;
};

struct kvm_vm_stat {
@@ -1319,6 +1325,17 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);

+ void (*set_private_spte)(struct kvm_vcpu *vcpu, gfn_t gfn, int level,
+ kvm_pfn_t pfn);
+ void (*drop_private_spte)(struct kvm *kvm, gfn_t gfn, int level,
+ kvm_pfn_t pfn);
+ void (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, int level);
+ void (*unzap_private_spte)(struct kvm *kvm, gfn_t gfn, int level);
+ int (*link_private_sp)(struct kvm_vcpu *vcpu, gfn_t gfn, int level,
+ void *private_sp);
+ int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, int level,
+ void *private_sp);
+
bool (*has_wbinvd_exit)(void);

/* Returns actual tsc_offset set in active VMCS */
@@ -1490,7 +1507,8 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *memslot);
void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
struct kvm_memory_slot *memslot);
-void kvm_mmu_zap_all(struct kvm *kvm);
+void kvm_mmu_zap_all_active(struct kvm *kvm);
+void kvm_mmu_zap_all_private(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
@@ -1656,7 +1674,9 @@ static inline int __kvm_irq_line_state(unsigned long *irq_state,

#define KVM_MMU_ROOT_CURRENT BIT(0)
#define KVM_MMU_ROOT_PREVIOUS(i) BIT(1+i)
-#define KVM_MMU_ROOTS_ALL (~0UL)
+#define KVM_MMU_ROOT_PRIVATE BIT(1+KVM_MMU_NUM_PREV_ROOTS)
+#define KVM_MMU_ROOTS_ALL ((u32)(~KVM_MMU_ROOT_PRIVATE))
+#define KVM_MMU_ROOTS_ALL_INC_PRIVATE (KVM_MMU_ROOTS_ALL | KVM_MMU_ROOT_PRIVATE)

int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level);
void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6ec8d9fdff35..c3e241088851 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -235,8 +235,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm);

static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
{
- /* Currently there are no stolen bits in KVM */
- return 0;
+ return kvm->arch.gfn_shared_mask;
}

static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6b0c8c84aabe..c343e546b7c5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -544,15 +544,15 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
* state bits, it is used to clear the last level sptep.
* Returns the old PTE.
*/
-static u64 mmu_spte_clear_track_bits(u64 *sptep)
+static u64 __mmu_spte_clear_track_bits(u64 *sptep, u64 clear_value)
{
kvm_pfn_t pfn;
u64 old_spte = *sptep;

if (!spte_has_volatile_bits(old_spte))
- __update_clear_spte_fast(sptep, shadow_init_value);
+ __update_clear_spte_fast(sptep, clear_value);
else
- old_spte = __update_clear_spte_slow(sptep, shadow_init_value);
+ old_spte = __update_clear_spte_slow(sptep, clear_value);

if (!is_shadow_present_pte(old_spte))
return old_spte;
@@ -575,6 +575,11 @@ static u64 mmu_spte_clear_track_bits(u64 *sptep)
return old_spte;
}

+static inline u64 mmu_spte_clear_track_bits(u64 *sptep)
+{
+ return __mmu_spte_clear_track_bits(sptep, shadow_init_value);
+}
+
/*
* Rules for using mmu_spte_clear_no_track:
* Directly clear spte without caring the state bits of sptep,
@@ -681,6 +686,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
int start, end, i, r;

+ if (vcpu->kvm->arch.gfn_shared_mask) {
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
+ PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+ }
+
if (shadow_init_value)
start = kvm_mmu_memory_cache_nr_free_objects(mc);

@@ -722,6 +734,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}
@@ -863,6 +876,23 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
return slot;
}

+static inline bool __is_private_gfn(struct kvm *kvm, gfn_t gfn_stolen_bits)
+{
+ gfn_t gfn_shared_mask = kvm->arch.gfn_shared_mask;
+
+ return gfn_shared_mask && !(gfn_shared_mask & gfn_stolen_bits);
+}
+
+static inline bool is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn_stolen_bits)
+{
+ return __is_private_gfn(vcpu->kvm, gfn_stolen_bits);
+}
+
+static inline bool is_private_spte(struct kvm *kvm, u64 *sptep)
+{
+ return __is_private_gfn(kvm, sptep_to_sp(sptep)->gfn_stolen_bits);
+}
+
/*
* About rmap_head encoding:
*
@@ -1014,7 +1044,7 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
return pte_list_add(vcpu, spte, rmap_head);
}

-static void rmap_remove(struct kvm *kvm, u64 *spte)
+static void rmap_remove(struct kvm *kvm, u64 *spte, u64 old_spte)
{
struct kvm_mmu_page *sp;
gfn_t gfn;
@@ -1024,6 +1054,10 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
rmap_head = gfn_to_rmap(kvm, gfn, sp);
__pte_list_remove(spte, rmap_head);
+
+ if (__is_private_gfn(kvm, sp->gfn_stolen_bits))
+ static_call(kvm_x86_drop_private_spte)(
+ kvm, gfn, sp->role.level - 1, spte_to_pfn(old_spte));
}

/*
@@ -1061,7 +1095,8 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
iter->pos = 0;
sptep = iter->desc->sptes[iter->pos];
out:
- BUG_ON(!is_shadow_present_pte(*sptep));
+ BUG_ON(!is_shadow_present_pte(*sptep) &&
+ !is_zapped_private_pte(*sptep));
return sptep;
}

@@ -1106,8 +1141,9 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
{
u64 old_spte = mmu_spte_clear_track_bits(sptep);

- if (is_shadow_present_pte(old_spte))
- rmap_remove(kvm, sptep);
+ if (is_shadow_present_pte(old_spte) ||
+ is_zapped_private_pte(old_spte))
+ rmap_remove(kvm, sptep, old_spte);
}


@@ -1330,28 +1366,67 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
}

-static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot)
+static bool kvm_mmu_zap_private_spte(struct kvm *kvm, u64 *sptep)
+{
+ struct kvm_mmu_page *sp;
+ kvm_pfn_t pfn;
+ gfn_t gfn;
+
+ /* Skip the lookup if the VM doesn't support private memory. */
+ if (likely(!kvm->arch.gfn_shared_mask))
+ return false;
+
+ sp = sptep_to_sp(sptep);
+ if (!__is_private_gfn(kvm, sp->gfn_stolen_bits))
+ return false;
+
+ gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+ pfn = spte_to_pfn(*sptep);
+
+ static_call(kvm_x86_zap_private_spte)(kvm, gfn, sp->role.level - 1);
+
+ __mmu_spte_clear_track_bits(sptep,
+ SPTE_PRIVATE_ZAPPED | pfn << PAGE_SHIFT);
+ return true;
+}
+
+static bool __kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
{
u64 *sptep;
struct rmap_iterator iter;
bool flush = false;

- while ((sptep = rmap_get_first(rmap_head, &iter))) {
- rmap_printk("spte %p %llx.\n", sptep, *sptep);
+restart:
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
+ rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);
+
+ if (is_zapped_private_pte(*sptep))
+ continue;

- pte_list_remove(rmap_head, sptep);
flush = true;
+
+ /* Keep the rmap if the private SPTE couldn't be zapped. */
+ if (kvm_mmu_zap_private_spte(kvm, sptep))
+ continue;
+
+ pte_list_remove(rmap_head, sptep);
+ goto restart;
}

return flush;
}

+static inline bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot)
+{
+ return __kvm_zap_rmapp(kvm, rmap_head);
+}
+
static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn, int level,
pte_t unused)
{
- return kvm_zap_rmapp(kvm, rmap_head, slot);
+ return __kvm_zap_rmapp(kvm, rmap_head);
}

static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
@@ -1374,6 +1449,9 @@ static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,

need_flush = 1;

+ /* Private page relocation is not yet supported. */
+ KVM_BUG_ON(is_private_spte(kvm, sptep), kvm);
+
if (pte_write(pte)) {
pte_list_remove(rmap_head, sptep);
goto restart;
@@ -1600,7 +1678,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, unsigned long nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
}

-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
hlist_del(&sp->hash_link);
@@ -1608,6 +1686,11 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
free_page((unsigned long)sp->spt);
if (!sp->role.direct)
free_page((unsigned long)sp->gfns);
+ if (sp->private_sp &&
+ !static_call(kvm_x86_free_private_sp)(kvm, sp->gfn, sp->role.level,
+ sp->private_sp))
+ free_page((unsigned long)sp->private_sp);
+
kmem_cache_free(mmu_page_header_cache, sp);
}

@@ -1638,7 +1721,8 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
mmu_spte_clear_no_track(parent_pte);
}

-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
+ int direct, bool private)
{
struct kvm_mmu_page *sp;

@@ -1654,7 +1738,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
* comments in kvm_zap_obsolete_pages().
*/
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
- list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+ if (private)
+ list_add(&sp->link, &vcpu->kvm->arch.private_mmu_pages);
+ else
+ list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
kvm_mod_used_mmu_pages(vcpu->kvm, +1);
return sp;
}
@@ -2066,7 +2153,8 @@ static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,

++vcpu->kvm->stat.mmu_cache_miss;

- sp = kvm_mmu_alloc_page(vcpu, direct);
+ sp = kvm_mmu_alloc_page(vcpu, direct,
+ is_private_gfn(vcpu, gfn_stolen_bits));

sp->gfn = gfn;
sp->gfn_stolen_bits = gfn_stolen_bits;
@@ -2133,8 +2221,13 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, u64 addr)
{
- shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root_hpa,
- addr);
+ hpa_t root;
+
+ if (tdp_enabled && is_private_gfn(vcpu, addr >> PAGE_SHIFT))
+ root = vcpu->arch.mmu->private_root_hpa;
+ else
+ root = vcpu->arch.mmu->root_hpa;
+ shadow_walk_init_using_root(iterator, vcpu, root, addr);
}

static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
@@ -2211,7 +2304,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
struct kvm_mmu_page *child;

pte = *spte;
- if (is_shadow_present_pte(pte)) {
+ if (is_shadow_present_pte(pte) || is_zapped_private_pte(pte)) {
if (is_last_spte(pte, sp->role.level)) {
drop_spte(kvm, spte);
if (is_large_pte(pte))
@@ -2220,6 +2313,9 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
drop_parent_pte(child, spte);

+ if (!is_shadow_present_pte(pte))
+ return 0;
+
/*
* Recursively zap nested TDP SPs, parentless SPs are
* unlikely to be used again in the near future. This
@@ -2370,7 +2466,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,

list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
- kvm_mmu_free_page(sp);
+ kvm_mmu_free_page(kvm, sp);
}
}

@@ -2603,6 +2699,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
int set_spte_ret;
int ret = RET_PF_FIXED;
bool flush = false;
+ u64 pte = *sptep;

pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
*sptep, write_fault, gfn);
@@ -2612,25 +2709,27 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
return RET_PF_EMULATE;
}

- if (is_shadow_present_pte(*sptep)) {
+ if (is_shadow_present_pte(pte)) {
/*
* If we overwrite a PTE page pointer with a 2MB PMD, unlink
* the parent of the now unreachable PTE.
*/
- if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) {
+ if (level > PG_LEVEL_4K && !is_large_pte(pte)) {
struct kvm_mmu_page *child;
- u64 pte = *sptep;

child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
drop_parent_pte(child, sptep);
flush = true;
- } else if (pfn != spte_to_pfn(*sptep)) {
+ } else if (pfn != spte_to_pfn(pte)) {
pgprintk("hfn old %llx new %llx\n",
- spte_to_pfn(*sptep), pfn);
+ spte_to_pfn(pte), pfn);
drop_spte(vcpu->kvm, sptep);
flush = true;
} else
was_rmapped = 1;
+ } else if (is_zapped_private_pte(pte)) {
+ WARN_ON(pfn != spte_to_pfn(pte));
+ was_rmapped = 1;
}

set_spte_ret = set_spte(vcpu, sptep, pte_access, level, gfn, pfn,
@@ -2875,6 +2974,52 @@ void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
}
}

+static void kvm_mmu_link_private_sp(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp)
+{
+ void *p = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_sp_cache);
+
+ if (!static_call(kvm_x86_link_private_sp)(vcpu, sp->gfn,
+ sp->role.level, p))
+ sp->private_sp = p;
+ else
+ free_page((unsigned long)p);
+}
+
+static void kvm_mmu_zap_alias_spte(struct kvm_vcpu *vcpu, gfn_t gfn,
+ gpa_t gpa_alias)
+{
+ struct kvm_shadow_walk_iterator it;
+ struct kvm_rmap_head *rmap_head;
+ struct kvm *kvm = vcpu->kvm;
+ struct rmap_iterator iter;
+ struct kvm_mmu_page *sp;
+ u64 *sptep;
+
+ for_each_shadow_entry(vcpu, gpa_alias, it) {
+ if (!is_shadow_present_pte(*it.sptep))
+ break;
+ }
+
+ sp = sptep_to_sp(it.sptep);
+ if (!is_last_spte(*it.sptep, sp->role.level))
+ return;
+
+ rmap_head = gfn_to_rmap(kvm, gfn, sp);
+ if (__kvm_zap_rmapp(kvm, rmap_head))
+ kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+
+ if (!is_private_gfn(vcpu, sp->gfn_stolen_bits))
+ return;
+
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
+ if (!is_zapped_private_pte(*sptep))
+ continue;
+
+ drop_spte(kvm, sptep);
+ }
+}
+
static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
int map_writable, int max_level, kvm_pfn_t pfn,
bool prefault, bool is_tdp)
@@ -2890,10 +3035,19 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
gfn_t gfn = (gpa & ~gpa_stolen_mask) >> PAGE_SHIFT;
gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
gfn_t base_gfn = gfn;
+ bool is_private = is_private_gfn(vcpu, gfn_stolen_bits);
+ bool is_zapped_pte;

if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
return RET_PF_RETRY;

+ if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)) {
+ if (is_private)
+ return -EFAULT;
+ } else if (vcpu->kvm->arch.gfn_shared_mask) {
+ kvm_mmu_zap_alias_spte(vcpu, gfn, gpa ^ gpa_stolen_mask);
+ }
+
level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
huge_page_disallowed, &req_level);

@@ -2921,15 +3075,30 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
link_shadow_page(vcpu, it.sptep, sp);
if (is_tdp && huge_page_disallowed && req_level >= it.level)
account_huge_nx_page(vcpu->kvm, sp);
+ if (is_private)
+ kvm_mmu_link_private_sp(vcpu, sp);
}

+ is_zapped_pte = is_zapped_private_pte(*it.sptep);
+
ret = mmu_set_spte(vcpu, it.sptep, ACC_ALL,
write, level, base_gfn, pfn, prefault,
map_writable);
if (ret == RET_PF_SPURIOUS)
return ret;

- direct_pte_prefetch(vcpu, it.sptep);
+ if (!is_private) {
+ if (!vcpu->arch.mmu->no_prefetch)
+ direct_pte_prefetch(vcpu, it.sptep);
+ } else if (!WARN_ON_ONCE(ret != RET_PF_FIXED)) {
+ if (is_zapped_pte)
+ static_call(kvm_x86_unzap_private_spte)(vcpu->kvm, gfn,
+ level - 1);
+ else
+ static_call(kvm_x86_set_private_spte)(vcpu, gfn, level,
+ pfn);
+ }
+
++vcpu->stat.pf_fixed;
return ret;
}
@@ -3210,7 +3379,9 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
VALID_PAGE(mmu->prev_roots[i].hpa))
break;

- if (i == KVM_MMU_NUM_PREV_ROOTS)
+ if (i == KVM_MMU_NUM_PREV_ROOTS &&
+ (!(roots_to_free & KVM_MMU_ROOT_PRIVATE) ||
+ !VALID_PAGE(mmu->private_root_hpa)))
return;
}

@@ -3239,6 +3410,9 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
mmu->root_pgd = 0;
}

+ if (roots_to_free & KVM_MMU_ROOT_PRIVATE)
+ mmu_free_root_page(kvm, &mmu->private_root_hpa, &invalid_list);
+
kvm_mmu_commit_zap_page(kvm, &invalid_list);
write_unlock(&kvm->mmu_lock);
}
@@ -3256,12 +3430,14 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
return ret;
}

-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
- u8 level, bool direct)
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn,
+ gfn_t gfn_stolen_bits, gva_t gva, u8 level,
+ bool direct)
{
struct kvm_mmu_page *sp;

- sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+ sp = __kvm_mmu_get_page(vcpu, gfn, gfn_stolen_bits, gva, level, direct,
+ ACC_ALL);
++sp->root_count;

return __pa(sp->spt);
@@ -3271,6 +3447,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
u8 shadow_root_level = mmu->shadow_root_level;
+ gfn_t gfn_shared = vcpu->kvm->arch.gfn_shared_mask;
hpa_t root;
unsigned i;
int r;
@@ -3284,9 +3461,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
mmu->root_hpa = root;
} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
- mmu->root_hpa = root;
+ if (gfn_shared && !VALID_PAGE(vcpu->arch.mmu->private_root_hpa)) {
+ root = mmu_alloc_root(vcpu, 0, 0, 0, shadow_root_level, true);
+ vcpu->arch.mmu->private_root_hpa = root;
+ }
+ root = mmu_alloc_root(vcpu, 0, gfn_shared, 0, shadow_root_level, true);
+ vcpu->arch.mmu->root_hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
+ WARN_ON_ONCE(gfn_shared);
+
if (WARN_ON_ONCE(!mmu->pae_root)) {
r = -EIO;
goto out_unlock;
@@ -3295,7 +3478,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
for (i = 0; i < 4; ++i) {
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

- root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
+ root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
i << 30, PT32_ROOT_LEVEL, true);
mmu->pae_root[i] = root | PT_PRESENT_MASK |
shadow_me_mask;
@@ -3354,8 +3537,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
* write-protect the guests page table root.
*/
if (mmu->root_level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, root_gfn, 0,
- mmu->shadow_root_level, false);
+ root = mmu_alloc_root(vcpu, root_gfn, 0, 0,
+ vcpu->arch.mmu->shadow_root_level, false);
mmu->root_hpa = root;
goto set_root_pgd;
}
@@ -3393,7 +3576,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
root_gfn = pdptrs[i] >> PAGE_SHIFT;
}

- root = mmu_alloc_root(vcpu, root_gfn, i << 30,
+ root = mmu_alloc_root(vcpu, root_gfn, 0, i << 30,
PT32_ROOT_LEVEL, false);
mmu->pae_root[i] = root | pm_mask;
}
@@ -4325,7 +4508,6 @@ reset_ept_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
(6 & (access) ? 64 : 0) | \
(7 & (access) ? 128 : 0))

-
static void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
{
@@ -4953,14 +5135,19 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
return r;
}

-void kvm_mmu_unload(struct kvm_vcpu *vcpu)
+static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, u32 roots_to_free)
{
- kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, roots_to_free);
WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
- kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, roots_to_free);
WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
}

+void kvm_mmu_unload(struct kvm_vcpu *vcpu)
+{
+ __kvm_mmu_unload(vcpu, KVM_MMU_ROOTS_ALL);
+}
+
static bool need_remote_flush(u64 old, u64 new)
{
if (!is_shadow_present_pte(old))
@@ -5365,8 +5552,10 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
int i;

mmu->root_hpa = INVALID_PAGE;
+ mmu->private_root_hpa = INVALID_PAGE;
mmu->root_pgd = 0;
mmu->translate_gpa = translate_gpa;
+ mmu->no_prefetch = false;
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;

@@ -5694,6 +5883,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
sp = sptep_to_sp(sptep);
pfn = spte_to_pfn(*sptep);

+ /* Private page dirty logging is not supported. */
+ KVM_BUG_ON(is_private_spte(kvm, sptep), kvm);
+
/*
* We cannot do huge page mapping for indirect shadow pages,
* which are found on the last rmap (level = 1) when not using
@@ -5784,7 +5976,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
}

-void kvm_mmu_zap_all(struct kvm *kvm)
+static void __kvm_mmu_zap_all(struct kvm *kvm, struct list_head *mmu_pages)
{
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);
@@ -5792,7 +5984,7 @@ void kvm_mmu_zap_all(struct kvm *kvm)

write_lock(&kvm->mmu_lock);
restart:
- list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
+ list_for_each_entry_safe(sp, node, mmu_pages, link) {
if (WARN_ON(sp->role.invalid))
continue;
if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
@@ -5800,7 +5992,6 @@ void kvm_mmu_zap_all(struct kvm *kvm)
if (cond_resched_rwlock_write(&kvm->mmu_lock))
goto restart;
}
-
kvm_mmu_commit_zap_page(kvm, &invalid_list);

if (is_tdp_mmu_enabled(kvm))
@@ -5809,6 +6000,17 @@ void kvm_mmu_zap_all(struct kvm *kvm)
write_unlock(&kvm->mmu_lock);
}

+void kvm_mmu_zap_all_active(struct kvm *kvm)
+{
+ __kvm_mmu_zap_all(kvm, &kvm->arch.active_mmu_pages);
+}
+
+void kvm_mmu_zap_all_private(struct kvm *kvm)
+{
+ __kvm_mmu_zap_all(kvm, &kvm->arch.private_mmu_pages);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_zap_all_private);
+
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
{
WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
@@ -6028,7 +6230,7 @@ unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)

void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
- kvm_mmu_unload(vcpu);
+ __kvm_mmu_unload(vcpu, KVM_MMU_ROOTS_ALL_INC_PRIVATE);
free_mmu_pages(&vcpu->arch.root_mmu);
free_mmu_pages(&vcpu->arch.guest_mmu);
mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c896ec9f3159..43df47a24d13 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -51,6 +51,8 @@ struct kvm_mmu_page {
u64 *spt;
/* hold the gfn of each spte inside spt */
gfn_t *gfns;
+ /* associated private shadow page, e.g. SEPT page */
+ void *private_sp;
/* Currently serving as active root */
union {
int root_count;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f88cf3db31c7..6c4feabcef56 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -14,6 +14,9 @@
*/
#define SPTE_MMU_PRESENT_MASK BIT_ULL(11)

+/* Masks that used to track metadata for not-present SPTEs. */
+#define SPTE_PRIVATE_ZAPPED BIT_ULL(62)
+
/*
* TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
* be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
@@ -101,11 +104,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
#undef SHADOW_ACC_TRACK_SAVED_MASK

/*
- * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
+ * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
* the memslots generation and is derived as follows:
*
* Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
- * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
+ * Bits 8-17 of the MMIO generation are propagated to spte bits 52-61
*
* The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
* the MMIO generation number, as doing so would require stealing a bit from
@@ -119,7 +122,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
#define MMIO_SPTE_GEN_LOW_END 10

#define MMIO_SPTE_GEN_HIGH_START 52
-#define MMIO_SPTE_GEN_HIGH_END 62
+#define MMIO_SPTE_GEN_HIGH_END 61

#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
MMIO_SPTE_GEN_LOW_START)
@@ -132,7 +135,7 @@ static_assert(!(SPTE_MMU_PRESENT_MASK &
#define MMIO_SPTE_GEN_HIGH_BITS (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)

/* remember to adjust the comment above as well if you change these */
-static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 10);

#define MMIO_SPTE_GEN_LOW_SHIFT (MMIO_SPTE_GEN_LOW_START - 0)
#define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
@@ -260,6 +263,11 @@ static inline bool is_access_track_spte(u64 spte)
return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
}

+static inline bool is_zapped_private_pte(u64 pte)
+{
+ return !!(pte & SPTE_PRIVATE_ZAPPED);
+}
+
static inline bool is_large_pte(u64 pte)
{
return pte & PT_PAGE_SIZE_MASK;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 92d5a6649a21..a8299add443f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10874,6 +10874,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)

INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ INIT_LIST_HEAD(&kvm->arch.private_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
@@ -11299,7 +11300,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- kvm_mmu_zap_all(kvm);
+ /* Zapping private pages must be deferred until VM destruction. */
+ kvm_mmu_zap_all_active(kvm);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
2.25.1

2021-07-02 22:08:49

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 39/69] KVM: x86: Add guest_supported_xss placholder

From: Sean Christopherson <[email protected]>

Add a per-vcpu placeholder for the support XSS of the guest so that the
TDX configuration code doesn't need to hack in manual computation of the
supported XSS. KVM XSS enabling is currently being upstreamed, i.e.
guest_supported_xss will no longer be a placeholder by the time TDX is
ready for upstreaming (hopefully).

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7822b531a5e2..c641654307c6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -656,6 +656,7 @@ struct kvm_vcpu_arch {

u64 xcr0;
u64 guest_supported_xcr0;
+ u64 guest_supported_xss;

struct kvm_pio_request pio;
void *pio_data;
--
2.25.1

2021-07-02 22:08:53

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 38/69] KVM: x86: Add option to force LAPIC expiration wait

From: Sean Christopherson <[email protected]>

Add an option to skip the IRR check in kvm_wait_lapic_expire(). This
will be used by TDX to wait if there is an outstanding notification for
a TD, i.e. a virtual interrupt is being triggered via posted interrupt
processing. KVM TDX doesn't emulate PI processing, i.e. there will
never be a bit set in IRR/ISR, so the default behavior for APICv of
querying the IRR doesn't work as intended.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/lapic.c | 4 ++--
arch/x86/kvm/lapic.h | 2 +-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 977a704e3ff1..3cfc0485a46e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1622,12 +1622,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
}

-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
{
if (lapic_in_kernel(vcpu) &&
vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
vcpu->arch.apic->lapic_timer.timer_advance_ns &&
- lapic_timer_int_injected(vcpu))
+ (force_wait || lapic_timer_int_injected(vcpu)))
__kvm_wait_lapic_expire(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 997c45a5963a..2bd32d86ad6f 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -233,7 +233,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)

bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);

-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait);

void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
unsigned long *vcpu_bitmap);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index bcc3fc4872a3..b12bfdbc394b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3774,7 +3774,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
clgi();
kvm_load_guest_xsave_state(vcpu);

- kvm_wait_lapic_expire(vcpu);
+ kvm_wait_lapic_expire(vcpu, false);

/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 36756a356704..7ce15a2c3490 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6727,7 +6727,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (enable_preemption_timer)
vmx_update_hv_timer(vcpu);

- kvm_wait_lapic_expire(vcpu);
+ kvm_wait_lapic_expire(vcpu, false);

/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
--
2.25.1

2021-07-02 22:08:54

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 37/69] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()

From: Sean Christopherson <[email protected]>

Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
interrupt to support TDX's usage of APICv. Unlike VMX, TDX doesn't have
access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
i.e. needs to generate a posted interrupt and more importantly can't
manually move requested interrupts into the vIRR (which it also doesn't
have access to).

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/x86.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f1d5e0a53640..92d5a6649a21 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11341,7 +11341,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)

if (kvm_arch_interrupt_allowed(vcpu) &&
(kvm_cpu_has_interrupt(vcpu) ||
- kvm_guest_apic_has_interrupt(vcpu)))
+ kvm_guest_apic_has_interrupt(vcpu) ||
+ (vcpu->arch.apicv_active &&
+ kvm_x86_ops.dy_apicv_has_pending_interrupt(vcpu))))
return true;

if (kvm_hv_has_stimer_pending(vcpu))
--
2.25.1

2021-07-02 22:09:02

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 52/69] KVM: VMX: Split out guts of EPT violation to common/exposed function

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 29 +++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 33 +++++----------------------------
2 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 81c73f30d01d..9e5865b05d47 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -5,8 +5,11 @@
#include <linux/kvm_host.h>

#include <asm/traps.h>
+#include <asm/vmx.h>

+#include "mmu.h"
#include "vmcs.h"
+#include "vmx.h"
#include "x86.h"

extern unsigned long vmx_host_idt_base;
@@ -49,4 +52,30 @@ static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
}

+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long exit_qualification)
+{
+ u64 error_code;
+
+ /* Is it a read fault? */
+ error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+ ? PFERR_USER_MASK : 0;
+ /* Is it a write fault? */
+ error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+ ? PFERR_WRITE_MASK : 0;
+ /* Is it a fetch fault? */
+ error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+ ? PFERR_FETCH_MASK : 0;
+ /* ept page table entry is present? */
+ error_code |= (exit_qualification &
+ (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
+ EPT_VIOLATION_EXECUTABLE))
+ ? PFERR_PRESENT_MASK : 0;
+
+ error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
+ PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+ return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 452d4d1400db..8a104a54121b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5328,11 +5328,10 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)

static int handle_ept_violation(struct kvm_vcpu *vcpu)
{
- unsigned long exit_qualification;
- gpa_t gpa;
- u64 error_code;
+ unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
+ gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);

- exit_qualification = vmx_get_exit_qual(vcpu);
+ trace_kvm_page_fault(gpa, exit_qualification);

/*
* EPT violation happened while executing iret from NMI,
@@ -5341,31 +5340,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
* AAK134, BY25.
*/
if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
- enable_vnmi &&
- (exit_qualification & INTR_INFO_UNBLOCK_NMI))
+ enable_vnmi && (exit_qualification & INTR_INFO_UNBLOCK_NMI))
vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI);

- gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
- trace_kvm_page_fault(gpa, exit_qualification);
-
- /* Is it a read fault? */
- error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
- ? PFERR_USER_MASK : 0;
- /* Is it a write fault? */
- error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
- ? PFERR_WRITE_MASK : 0;
- /* Is it a fetch fault? */
- error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
- ? PFERR_FETCH_MASK : 0;
- /* ept page table entry is present? */
- error_code |= (exit_qualification &
- (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
- EPT_VIOLATION_EXECUTABLE))
- ? PFERR_PRESENT_MASK : 0;
-
- error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
- PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
vcpu->arch.exit_qualification = exit_qualification;

/*
@@ -5379,7 +5356,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
return kvm_emulate_instruction(vcpu, 0);

- return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+ return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
}

static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
--
2.25.1

2021-07-02 22:09:06

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 50/69] KVM: VMX: Move NMI/exception handler to common helper

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 52 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 52 ++++++---------------------------------
2 files changed, 60 insertions(+), 44 deletions(-)
create mode 100644 arch/x86/kvm/vmx/common.h

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..81c73f30d01d
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include <asm/traps.h>
+
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
+
+static inline void vmx_handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
+ unsigned long entry)
+{
+ kvm_before_interrupt(vcpu);
+ vmx_do_interrupt_nmi_irqoff(entry);
+ kvm_after_interrupt(vcpu);
+}
+
+static inline void vmx_handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
+
+ /* if exit due to PF check for async PF */
+ if (is_page_fault(intr_info))
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ /* Handle machine checks before interrupts are enabled */
+ else if (is_machine_check(intr_info))
+ kvm_machine_check();
+ /* We need to handle NMIs before interrupts are enabled */
+ else if (is_nmi(intr_info))
+ vmx_handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+ gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
+
+ if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+ "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
+ return;
+
+ vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e08f85c93e55..452d4d1400db 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -49,6 +49,7 @@
#include <asm/vmx.h>

#include "capabilities.h"
+#include "common.h"
#include "cpuid.h"
#include "evmcs.h"
#include "hyperv.h"
@@ -453,7 +454,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
vmx->segment_cache.bitmask = 0;
}

-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;

#if IS_ENABLED(CONFIG_HYPERV)
static bool __read_mostly enlightened_vmcs = true;
@@ -4120,7 +4121,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */

- vmcs_writel(HOST_IDTR_BASE, host_idt_base); /* 22.2.4 */
+ vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base); /* 22.2.4 */

vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */

@@ -4830,7 +4831,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
intr_info = vmx_get_intr_info(vcpu);

if (is_machine_check(intr_info) || is_nmi(intr_info))
- return 1; /* handled by handle_exception_nmi_irqoff() */
+ return 1; /* handled by vmx_handle_exception_nmi_irqoff() */

if (is_invalid_opcode(intr_info))
return handle_ud(vcpu);
@@ -6394,52 +6395,15 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
}

-void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
-
-static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
- unsigned long entry)
-{
- kvm_before_interrupt(vcpu);
- vmx_do_interrupt_nmi_irqoff(entry);
- kvm_after_interrupt(vcpu);
-}
-
-static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
- const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-
- /* if exit due to PF check for async PF */
- if (is_page_fault(intr_info))
- vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
- /* Handle machine checks before interrupts are enabled */
- else if (is_machine_check(intr_info))
- kvm_machine_check();
- /* We need to handle NMIs before interrupts are enabled */
- else if (is_nmi(intr_info))
- handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
- u32 intr_info)
-{
- unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
- gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
- if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
- "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
- return;
-
- handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
-}
-
static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
}

/*
@@ -7780,7 +7744,7 @@ static __init int hardware_setup(void)
int r, ept_lpage_level;

store_idt(&dt);
- host_idt_base = dt.address;
+ vmx_host_idt_base = dt.address;

vmx_setup_user_return_msrs();

--
2.25.1

2021-07-02 22:09:07

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 42/69] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault

From: Sean Christopherson <[email protected]>

Explicity check for an MMIO spte in the fast page fault flow. TDX will
use a not-present entry for MMIO sptes, which can be mistaken for an
access-tracked spte since both have SPTE_SPECIAL_MASK set.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 990ee645b8a2..631b92e6e9ba 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3060,7 +3060,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
break;

sp = sptep_to_sp(iterator.sptep);
- if (!is_last_spte(spte, sp->role.level))
+ if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
break;

/*
--
2.25.1

2021-07-02 22:09:25

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 34/69] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 12 ++++++++++++
2 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9791c4bb5198..e3abf077f328 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1377,7 +1377,9 @@ struct kvm_x86_ops {
int (*pre_leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
void (*enable_smi_window)(struct kvm_vcpu *vcpu);

+ int (*mem_enc_op_dev)(void __user *argp);
int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
+ int (*mem_enc_op_vcpu)(struct kvm_vcpu *vcpu, void __user *argp);
int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*mem_enc_unreg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f7ae0a47e555..da9f1081cb03 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4109,6 +4109,12 @@ long kvm_arch_dev_ioctl(struct file *filp,
case KVM_GET_SUPPORTED_HV_CPUID:
r = kvm_ioctl_get_supported_hv_cpuid(NULL, argp);
break;
+ case KVM_MEMORY_ENCRYPT_OP:
+ r = -EINVAL;
+ if (!kvm_x86_ops.mem_enc_op_dev)
+ goto out;
+ r = kvm_x86_ops.mem_enc_op_dev(argp);
+ break;
default:
r = -EINVAL;
break;
@@ -5263,6 +5269,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
#endif
+ case KVM_MEMORY_ENCRYPT_OP:
+ r = -EINVAL;
+ if (!kvm_x86_ops.mem_enc_op_vcpu)
+ goto out;
+ r = kvm_x86_ops.mem_enc_op_vcpu(vcpu, argp);
+ break;
default:
r = -EINVAL;
}
--
2.25.1

2021-07-02 22:09:25

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 32/69] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID

From: Sean Christopherson <[email protected]>

Let userspace, or in the case of TDX, KVM itself, enable X2APIC even if
X2APIC is not reported as supported in the guest's CPU model. KVM
generally does not force specific ordering between ioctls(), e.g. this
forces userspace to configure CPUID before MSRs. And for TDX, vCPUs
will always run with X2APIC enabled, e.g. KVM will want/need to enable
X2APIC from time zero.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/x86.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3407870b6f44..c231a88d5946 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -437,8 +437,11 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
enum lapic_mode old_mode = kvm_get_apic_mode(vcpu);
enum lapic_mode new_mode = kvm_apic_mode(msr_info->data);
- u64 reserved_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu) | 0x2ff |
- (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) ? 0 : X2APIC_ENABLE);
+ u64 reserved_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu) | 0x2ff;
+
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_X2APIC))
+ reserved_bits |= X2APIC_ENABLE;

if ((msr_info->data & reserved_bits) != 0 || new_mode == LAPIC_MODE_INVALID)
return 1;
--
2.25.1

2021-07-02 22:09:26

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 63/69] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 14 ++++++++++++++
arch/x86/kvm/vmx/vmx.c | 10 +---------
2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 755aaec85199..817ff3e74933 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -120,6 +120,20 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}

+static inline u32 __vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+ u32 interruptibility;
+ int ret = 0;
+
+ interruptibility = vmread32(vcpu, GUEST_INTERRUPTIBILITY_INFO);
+ if (interruptibility & GUEST_INTR_STATE_STI)
+ ret |= KVM_X86_SHADOW_INT_STI;
+ if (interruptibility & GUEST_INTR_STATE_MOV_SS)
+ ret |= KVM_X86_SHADOW_INT_MOV_SS;
+
+ return ret;
+}
+
static inline u32 vmx_encode_ar_bytes(struct kvm_segment *var)
{
u32 ar;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d69d4dc7c071..d31cace67907 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1467,15 +1467,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)

u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
{
- u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
- int ret = 0;
-
- if (interruptibility & GUEST_INTR_STATE_STI)
- ret |= KVM_X86_SHADOW_INT_STI;
- if (interruptibility & GUEST_INTR_STATE_MOV_SS)
- ret |= KVM_X86_SHADOW_INT_MOV_SS;
-
- return ret;
+ return __vmx_get_interrupt_shadow(vcpu);
}

void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
--
2.25.1

2021-07-02 22:09:26

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 61/69] KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h

From: Sean Christopherson <[email protected]>

Move the AR_BYTES helpers to common.h so that future patches can reuse
them to decode/encode AR for TDX.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 41 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 47 ++++-----------------------------------
2 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index aa6a569b87d1..755aaec85199 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,6 +4,7 @@

#include <linux/kvm_host.h>

+#include <asm/kvm.h>
#include <asm/traps.h>
#include <asm/vmx.h>

@@ -119,4 +120,44 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}

+static inline u32 vmx_encode_ar_bytes(struct kvm_segment *var)
+{
+ u32 ar;
+
+ if (var->unusable || !var->present)
+ ar = 1 << 16;
+ else {
+ ar = var->type & 15;
+ ar |= (var->s & 1) << 4;
+ ar |= (var->dpl & 3) << 5;
+ ar |= (var->present & 1) << 7;
+ ar |= (var->avl & 1) << 12;
+ ar |= (var->l & 1) << 13;
+ ar |= (var->db & 1) << 14;
+ ar |= (var->g & 1) << 15;
+ }
+
+ return ar;
+}
+
+static inline void vmx_decode_ar_bytes(u32 ar, struct kvm_segment *var)
+{
+ var->unusable = (ar >> 16) & 1;
+ var->type = ar & 15;
+ var->s = (ar >> 4) & 1;
+ var->dpl = (ar >> 5) & 3;
+ /*
+ * Some userspaces do not preserve unusable property. Since usable
+ * segment has to be present according to VMX spec we can use present
+ * property to amend userspace bug by making unusable segment always
+ * nonpresent. vmx_encode_ar_bytes() already marks nonpresent
+ * segment as unusable.
+ */
+ var->present = !var->unusable;
+ var->avl = (ar >> 12) & 1;
+ var->l = (ar >> 13) & 1;
+ var->db = (ar >> 14) & 1;
+ var->g = (ar >> 15) & 1;
+}
+
#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3c3bfc80d2bb..40843ca2fb33 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -365,8 +365,6 @@ static const struct kernel_param_ops vmentry_l1d_flush_ops = {
};
module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);

-static u32 vmx_segment_access_rights(struct kvm_segment *var);
-
void vmx_vmexit(void);

#define vmx_insn_failed(fmt...) \
@@ -2826,7 +2824,7 @@ static void fix_rmode_seg(int seg, struct kvm_segment *save)
vmcs_write16(sf->selector, var.selector);
vmcs_writel(sf->base, var.base);
vmcs_write32(sf->limit, var.limit);
- vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(&var));
+ vmcs_write32(sf->ar_bytes, vmx_encode_ar_bytes(&var));
}

static void enter_rmode(struct kvm_vcpu *vcpu)
@@ -3217,7 +3215,6 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u32 ar;

if (vmx->rmode.vm86_active && seg != VCPU_SREG_LDTR) {
*var = vmx->rmode.segs[seg];
@@ -3231,23 +3228,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->base = vmx_read_guest_seg_base(vmx, seg);
var->limit = vmx_read_guest_seg_limit(vmx, seg);
var->selector = vmx_read_guest_seg_selector(vmx, seg);
- ar = vmx_read_guest_seg_ar(vmx, seg);
- var->unusable = (ar >> 16) & 1;
- var->type = ar & 15;
- var->s = (ar >> 4) & 1;
- var->dpl = (ar >> 5) & 3;
- /*
- * Some userspaces do not preserve unusable property. Since usable
- * segment has to be present according to VMX spec we can use present
- * property to amend userspace bug by making unusable segment always
- * nonpresent. vmx_segment_access_rights() already marks nonpresent
- * segment as unusable.
- */
- var->present = !var->unusable;
- var->avl = (ar >> 12) & 1;
- var->l = (ar >> 13) & 1;
- var->db = (ar >> 14) & 1;
- var->g = (ar >> 15) & 1;
+ vmx_decode_ar_bytes(vmx_read_guest_seg_ar(vmx, seg), var);
}

static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
@@ -3273,26 +3254,6 @@ int vmx_get_cpl(struct kvm_vcpu *vcpu)
}
}

-static u32 vmx_segment_access_rights(struct kvm_segment *var)
-{
- u32 ar;
-
- if (var->unusable || !var->present)
- ar = 1 << 16;
- else {
- ar = var->type & 15;
- ar |= (var->s & 1) << 4;
- ar |= (var->dpl & 3) << 5;
- ar |= (var->present & 1) << 7;
- ar |= (var->avl & 1) << 12;
- ar |= (var->l & 1) << 13;
- ar |= (var->db & 1) << 14;
- ar |= (var->g & 1) << 15;
- }
-
- return ar;
-}
-
void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3327,7 +3288,7 @@ void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
if (is_unrestricted_guest(vcpu) && (seg != VCPU_SREG_LDTR))
var->type |= 0x1; /* Accessed */

- vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
+ vmcs_write32(sf->ar_bytes, vmx_encode_ar_bytes(var));

out:
vmx->emulation_required = emulation_required(vcpu);
@@ -3374,7 +3335,7 @@ static bool rmode_segment_valid(struct kvm_vcpu *vcpu, int seg)
var.dpl = 0x3;
if (seg == VCPU_SREG_CS)
var.type = 0x3;
- ar = vmx_segment_access_rights(&var);
+ ar = vmx_encode_ar_bytes(&var);

if (var.base != (var.selector << 4))
return false;
--
2.25.1

2021-07-02 22:09:26

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 59/69] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers

From: Sean Christopherson <[email protected]>

Stub in kvm_tdx, vcpu_tdx, their various accessors, and VMCS helpers.
The VMCS helpers, which rely on the stubs, will be used by preparatory
patches to move VMX functions for accessing VMCS state to common code.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.h | 170 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 170 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx.h

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
new file mode 100644
index 000000000000..c2849e0f4260
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -0,0 +1,170 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_H
+#define __KVM_X86_TDX_H
+
+#include <linux/list.h>
+#include <linux/kvm_host.h>
+
+#include "tdx_arch.h"
+#include "tdx_errno.h"
+#include "tdx_ops.h"
+
+#ifdef CONFIG_KVM_INTEL_TDX
+
+struct tdx_td_page {
+ unsigned long va;
+ hpa_t pa;
+ bool added;
+};
+
+struct kvm_tdx {
+ struct kvm kvm;
+
+ struct tdx_td_page tdr;
+ struct tdx_td_page tdcs[TDX1_NR_TDCX_PAGES];
+};
+
+struct vcpu_tdx {
+ struct kvm_vcpu vcpu;
+
+ struct tdx_td_page tdvpr;
+ struct tdx_td_page tdvpx[TDX1_NR_TDVPX_PAGES];
+};
+
+static inline bool is_td(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+ return is_td(vcpu->kvm);
+}
+
+static inline bool is_debug_td(struct kvm_vcpu *vcpu)
+{
+ return !vcpu->arch.guest_state_protected;
+}
+
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
+{
+ return container_of(kvm, struct kvm_tdx, kvm);
+}
+
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
+{
+ return container_of(vcpu, struct vcpu_tdx, vcpu);
+}
+
+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+ BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) & 0x1,
+ "Read/Write to TD VMCS *_HIGH fields not supported");
+
+ BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+ BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+ (((field) & 0x6000) == 0x2000 ||
+ ((field) & 0x6000) == 0x6000),
+ "Invalid TD VMCS access for 64-bit field");
+ BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+ ((field) & 0x6000) == 0x4000,
+ "Invalid TD VMCS access for 32-bit field");
+ BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+ ((field) & 0x6000) == 0x0000,
+ "Invalid TD VMCS access for 16-bit field");
+}
+
+static __always_inline void tdvps_gpr_check(u64 field, u8 bits)
+{
+ BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) >= NR_VCPU_REGS,
+ "Invalid TD guest GPR index");
+}
+
+static __always_inline void tdvps_apic_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_dr_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_state_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_msr_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
+ u32 field) \
+{ \
+ struct tdx_ex_ret ex_ret; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_rd(tdx->tdvpr.pa, TDVPS_##uclass(field), &ex_ret); \
+ if (unlikely(err)) { \
+ pr_err("TDH_VP_RD["#uclass".0x%x] failed: %s (0x%llx)\n", \
+ field, tdx_seamcall_error_name(err), err); \
+ return 0; \
+ } \
+ return (u##bits)ex_ret.r8; \
+} \
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx, \
+ u32 field, u##bits val) \
+{ \
+ struct tdx_ex_ret ex_ret; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), val, \
+ GENMASK_ULL(bits - 1, 0), &ex_ret); \
+ if (unlikely(err)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: %s (0x%llx)\n", \
+ field, (u64)val, tdx_seamcall_error_name(err), err); \
+} \
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx, \
+ u32 field, u64 bit) \
+{ \
+ struct tdx_ex_ret ex_ret; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), bit, bit, \
+ &ex_ret); \
+ if (unlikely(err)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: %s (0x%llx)\n", \
+ field, bit, tdx_seamcall_error_name(err), err); \
+} \
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx, \
+ u32 field, u64 bit) \
+{ \
+ struct tdx_ex_ret ex_ret; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), 0, bit, \
+ &ex_ret); \
+ if (unlikely(err)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: %s (0x%llx)\n", \
+ field, bit, tdx_seamcall_error_name(err), err); \
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
+
+TDX_BUILD_TDVPS_ACCESSORS(64, APIC, apic);
+TDX_BUILD_TDVPS_ACCESSORS(64, GPR, gpr);
+TDX_BUILD_TDVPS_ACCESSORS(64, DR, dr);
+TDX_BUILD_TDVPS_ACCESSORS(64, STATE, state);
+TDX_BUILD_TDVPS_ACCESSORS(64, MSR, msr);
+TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+
+#else
+
+struct kvm_tdx;
+struct vcpu_tdx;
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+static inline bool is_debug_td(struct kvm_vcpu *vcpu) { return false; }
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+
+#endif /* CONFIG_KVM_INTEL_TDX */
+
+#endif /* __KVM_X86_TDX_H */
--
2.25.1

2021-07-02 22:09:27

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 56/69] KVM: VMX: Move setting of EPT MMU masks to common VT-x code

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 4 ++++
arch/x86/kvm/vmx/vmx.c | 4 ----
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c3fefa0e5a63..0d8d2a0a2979 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -34,6 +34,10 @@ static __init int vt_hardware_setup(void)
if (ret)
return ret;

+ if (enable_ept)
+ kvm_mmu_set_ept_masks(enable_ept_ad_bits,
+ cpu_has_vmx_ept_execute_only());
+
return 0;
}

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 77b2b2cf76db..e315a46d1566 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7644,10 +7644,6 @@ static __init int hardware_setup(void)

set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */

- if (enable_ept)
- kvm_mmu_set_ept_masks(enable_ept_ad_bits,
- cpu_has_vmx_ept_execute_only());
-
if (!enable_ept)
ept_lpage_level = 0;
else if (cpu_has_vmx_ept_1g_page())
--
2.25.1

2021-07-02 22:09:38

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 18/69] KVM: Export kvm_make_all_cpus_request() for use in marking VMs as bugged

From: Sean Christopherson <[email protected]>

Export kvm_make_all_cpus_request() and hoist the request helper
declarations of request up to the KVM_REQ_* definitions in preparation
for adding a "VM bugged" framework. The framework will add KVM_BUG()
and KVM_BUG_ON() as alternatives to full BUG()/BUG_ON() for cases where
KVM has definitely hit a bug (in itself or in silicon) and the VM is all
but guaranteed to be hosed. Marking a VM bugged will trigger a request
to all vCPUs to allow arch code to forcefully evict each vCPU from its
run loop.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_host.h | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 09618f8a1338..e87f07c5c601 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -158,6 +158,15 @@ static inline bool is_error_page(struct page *page)
})
#define KVM_ARCH_REQ(nr) KVM_ARCH_REQ_FLAGS(nr, 0)

+bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
+ struct kvm_vcpu *except,
+ unsigned long *vcpu_bitmap, cpumask_var_t tmp);
+bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
+bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req,
+ struct kvm_vcpu *except);
+bool kvm_make_cpus_request_mask(struct kvm *kvm, unsigned int req,
+ unsigned long *vcpu_bitmap);
+
#define KVM_USERSPACE_IRQ_SOURCE_ID 0
#define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1

@@ -616,7 +625,6 @@ struct kvm {
#define vcpu_err(vcpu, fmt, ...) \
kvm_err("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)

-bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
static inline void kvm_vm_bugged(struct kvm *kvm)
{
kvm->vm_bugged = true;
@@ -955,14 +963,6 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
#endif

-bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
- struct kvm_vcpu *except,
- unsigned long *vcpu_bitmap, cpumask_var_t tmp);
-bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req,
- struct kvm_vcpu *except);
-bool kvm_make_cpus_request_mask(struct kvm *kvm, unsigned int req,
- unsigned long *vcpu_bitmap);
-
long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
long kvm_arch_vcpu_ioctl(struct file *filp,
--
2.25.1

2021-07-02 22:09:38

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 17/69] KVM: Add infrastructure and macro to mark VM as bugged

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_host.h | 29 ++++++++++++++++++++++++++++-
virt/kvm/kvm_main.c | 10 +++++-----
2 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8583ed3ff344..09618f8a1338 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -149,6 +149,7 @@ static inline bool is_error_page(struct page *page)
#define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_UNBLOCK 2
#define KVM_REQ_UNHALT 3
+#define KVM_REQ_VM_BUGGED (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQUEST_ARCH_BASE 8

#define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
@@ -585,6 +586,8 @@ struct kvm {
pid_t userspace_pid;
unsigned int max_halt_poll_ns;
u32 dirty_ring_size;
+
+ bool vm_bugged;
};

#define kvm_err(fmt, ...) \
@@ -613,6 +616,31 @@ struct kvm {
#define vcpu_err(vcpu, fmt, ...) \
kvm_err("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)

+bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
+static inline void kvm_vm_bugged(struct kvm *kvm)
+{
+ kvm->vm_bugged = true;
+ kvm_make_all_cpus_request(kvm, KVM_REQ_VM_BUGGED);
+}
+
+#define KVM_BUG(cond, kvm, fmt...) \
+({ \
+ int __ret = (cond); \
+ \
+ if (WARN_ONCE(__ret && !(kvm)->vm_bugged, fmt)) \
+ kvm_vm_bugged(kvm); \
+ unlikely(__ret); \
+})
+
+#define KVM_BUG_ON(cond, kvm) \
+({ \
+ int __ret = (cond); \
+ \
+ if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
+ kvm_vm_bugged(kvm); \
+ unlikely(__ret); \
+})
+
static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)
{
return !!(kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET);
@@ -930,7 +958,6 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
struct kvm_vcpu *except,
unsigned long *vcpu_bitmap, cpumask_var_t tmp);
-bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req,
struct kvm_vcpu *except);
bool kvm_make_cpus_request_mask(struct kvm *kvm, unsigned int req,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 751d1f6890b0..dc752d0bd3ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3435,7 +3435,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
struct kvm_fpu *fpu = NULL;
struct kvm_sregs *kvm_sregs = NULL;

- if (vcpu->kvm->mm != current->mm)
+ if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
return -EIO;

if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
@@ -3641,7 +3641,7 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
void __user *argp = compat_ptr(arg);
int r;

- if (vcpu->kvm->mm != current->mm)
+ if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
return -EIO;

switch (ioctl) {
@@ -3707,7 +3707,7 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
{
struct kvm_device *dev = filp->private_data;

- if (dev->kvm->mm != current->mm)
+ if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged)
return -EIO;

switch (ioctl) {
@@ -3991,7 +3991,7 @@ static long kvm_vm_ioctl(struct file *filp,
void __user *argp = (void __user *)arg;
int r;

- if (kvm->mm != current->mm)
+ if (kvm->mm != current->mm || kvm->vm_bugged)
return -EIO;
switch (ioctl) {
case KVM_CREATE_VCPU:
@@ -4189,7 +4189,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
struct kvm *kvm = filp->private_data;
int r;

- if (kvm->mm != current->mm)
+ if (kvm->mm != current->mm || kvm->vm_bugged)
return -EIO;
switch (ioctl) {
case KVM_GET_DIRTY_LOG: {
--
2.25.1

2021-07-02 22:09:39

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 14/69] KVM: x86: Split core of hypercall emulation to helper function

From: Sean Christopherson <[email protected]>

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +++
arch/x86/kvm/x86.c | 55 ++++++++++++++++++++-------------
2 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9c7ced0e3171..80b943e4ab6d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1667,6 +1667,10 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu);
void kvm_request_apicv_update(struct kvm *kvm, bool activate,
unsigned long bit);

+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit);
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);

int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d11cf87674f3..795f83a1cf9a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8406,26 +8406,15 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
return;
}

-int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit)
{
- unsigned long nr, a0, a1, a2, a3, ret;
- int op_64_bit;
-
- if (kvm_xen_hypercall_enabled(vcpu->kvm))
- return kvm_xen_hypercall(vcpu);
-
- if (kvm_hv_hypercall_enabled(vcpu))
- return kvm_hv_hypercall(vcpu);
-
- nr = kvm_rax_read(vcpu);
- a0 = kvm_rbx_read(vcpu);
- a1 = kvm_rcx_read(vcpu);
- a2 = kvm_rdx_read(vcpu);
- a3 = kvm_rsi_read(vcpu);
+ unsigned long ret;

trace_kvm_hypercall(nr, a0, a1, a2, a3);

- op_64_bit = is_64_bit_mode(vcpu);
if (!op_64_bit) {
nr &= 0xFFFFFFFF;
a0 &= 0xFFFFFFFF;
@@ -8434,11 +8423,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
a3 &= 0xFFFFFFFF;
}

- if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
- ret = -KVM_EPERM;
- goto out;
- }
-
ret = -KVM_ENOSYS;

switch (nr) {
@@ -8475,6 +8459,35 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
ret = -KVM_ENOSYS;
break;
}
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+ unsigned long nr, a0, a1, a2, a3, ret;
+ int op_64_bit;
+
+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
+ return kvm_xen_hypercall(vcpu);
+
+ if (kvm_hv_hypercall_enabled(vcpu))
+ return kvm_hv_hypercall(vcpu);
+
+ nr = kvm_rax_read(vcpu);
+ a0 = kvm_rbx_read(vcpu);
+ a1 = kvm_rcx_read(vcpu);
+ a2 = kvm_rdx_read(vcpu);
+ a3 = kvm_rsi_read(vcpu);
+
+ op_64_bit = is_64_bit_mode(vcpu);
+
+ if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
+ ret = -KVM_EPERM;
+ goto out;
+ }
+
+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
out:
if (!op_64_bit)
ret = (u32)ret;
--
2.25.1

2021-07-02 22:09:48

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 11/69] KVM: TDX: Introduce pr_seamcall_ex_ret_info() to print more info when SEAMCALL fails

From: Xiaoyao Li <[email protected]>

Various SEAMCALLs may have additional info in RCX, RDX, R8, R9, R10
when it completes. Introduce pr_seamcall_ex_ret_info() to parse those
additional info based on returned SEAMCALL status code.

It only processes some cases for reference. More processes for other
SEAMCALL status code can be added in the future on demand.

Signed-off-by: Xiaoyao Li <[email protected]>
Co-developed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/boot/seam/tdx_common.c | 54 +++++++++++++++++++++++++++++
arch/x86/kvm/vmx/seamcall.h | 6 ++--
2 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/boot/seam/tdx_common.c b/arch/x86/kvm/boot/seam/tdx_common.c
index 4fe352fb8586..a81a4672bf58 100644
--- a/arch/x86/kvm/boot/seam/tdx_common.c
+++ b/arch/x86/kvm/boot/seam/tdx_common.c
@@ -186,3 +186,57 @@ const char *tdx_seamcall_error_name(u64 error_code)
return "Unknown SEAMCALL status code";
}
EXPORT_SYMBOL_GPL(tdx_seamcall_error_name);
+
+static const char * const TDX_SEPT_ENTRY_STATES[] = {
+ "SEPT_FREE",
+ "SEPT_BLOCKED",
+ "SEPT_PENDING",
+ "SEPT_PENDING_BLOCKED",
+ "SEPT_PRESENT"
+};
+
+void pr_seamcall_ex_ret_info(u64 op, u64 error_code, struct tdx_ex_ret *ex_ret)
+{
+ if (!ex_ret)
+ return;
+
+ switch (error_code & TDX_SEAMCALL_STATUS_MASK) {
+ case TDX_INCORRECT_CPUID_VALUE:
+ pr_err("Expected CPUID [leaf 0x%x subleaf 0x%x]: "
+ "eax 0x%x check_mask 0x%x, ebx 0x%x check_mask 0x%x, "
+ "ecx 0x%x check_mask 0x%x, edx 0x%x check_mask 0x%x\n",
+ ex_ret->leaf, ex_ret->subleaf,
+ ex_ret->eax_val, ex_ret->eax_mask,
+ ex_ret->ebx_val, ex_ret->ebx_mask,
+ ex_ret->ecx_val, ex_ret->ecx_mask,
+ ex_ret->edx_val, ex_ret->edx_mask);
+ break;
+ case TDX_INCONSISTENT_CPUID_FIELD:
+ pr_err("Inconsistent CPUID [leaf 0x%x subleaf 0x%x]: "
+ "eax_mask 0x%x, ebx_mask 0x%x, ecx_mask %x, edx_mask 0x%x\n",
+ ex_ret->leaf, ex_ret->subleaf,
+ ex_ret->eax_mask, ex_ret->ebx_mask,
+ ex_ret->ecx_mask, ex_ret->edx_mask);
+ break;
+ case TDX_EPT_WALK_FAILED: {
+ const char *state;
+
+ if (ex_ret->state >= ARRAY_SIZE(TDX_SEPT_ENTRY_STATES))
+ state = "Invalid";
+ else
+ state = TDX_SEPT_ENTRY_STATES[ex_ret->state];
+
+ pr_err("Secure EPT walk error: SEPTE 0x%llx, level %d, %s\n",
+ ex_ret->septe, ex_ret->level, state);
+ break;
+ }
+ default:
+ /* TODO: print only meaningful registers depending on op */
+ pr_err("RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, "
+ "R10 0x%llx, R11 0x%llx\n",
+ ex_ret->rcx, ex_ret->rdx, ex_ret->r8, ex_ret->r9,
+ ex_ret->r10, ex_ret->r11);
+ break;
+ }
+}
+EXPORT_SYMBOL_GPL(pr_seamcall_ex_ret_info);
diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
index fbb18aea1720..85eeedc06a4f 100644
--- a/arch/x86/kvm/vmx/seamcall.h
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -38,6 +38,7 @@ static inline u64 _seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
#endif

const char *tdx_seamcall_error_name(u64 error_code);
+void pr_seamcall_ex_ret_info(u64 op, u64 error_code, struct tdx_ex_ret *ex_ret);

static inline void __pr_seamcall_error(u64 op, const char *op_str,
u64 err, struct tdx_ex_ret *ex)
@@ -46,10 +47,7 @@ static inline void __pr_seamcall_error(u64 op, const char *op_str,
op_str, smp_processor_id(),
tdx_seamcall_error_name(err), err);
if (ex)
- pr_err_ratelimited(
- "RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
- (ex)->rcx, (ex)->rdx, (ex)->r8, (ex)->r9, (ex)->r10,
- (ex)->r11);
+ pr_seamcall_ex_ret_info(op, err, ex);
}

#define pr_seamcall_error(op, err, ex) \
--
2.25.1

2021-07-02 22:09:50

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 08/69] KVM: TDX: add trace point before/after TDX SEAMCALLs

From: Isaku Yamahata <[email protected]>

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/trace.h | 80 ++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/seamcall.h | 22 ++++++++-
arch/x86/kvm/vmx/tdx_arch.h | 47 ++++++++++++++++++
arch/x86/kvm/vmx/tdx_errno.h | 96 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 2 +
5 files changed, 246 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4f839148948b..c3398d0de9a7 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -8,6 +8,9 @@
#include <asm/clocksource.h>
#include <asm/pvclock-abi.h>

+#include "vmx/tdx_arch.h"
+#include "vmx/tdx_errno.h"
+
#undef TRACE_SYSTEM
#define TRACE_SYSTEM kvm

@@ -659,6 +662,83 @@ TRACE_EVENT(kvm_nested_vmexit_inject,
__entry->exit_int_info, __entry->exit_int_info_err)
);

+/*
+ * Tracepoint for the start of TDX SEAMCALLs.
+ */
+TRACE_EVENT(kvm_tdx_seamcall_enter,
+ TP_PROTO(int cpuid, __u64 op, __u64 rcx, __u64 rdx, __u64 r8,
+ __u64 r9, __u64 r10),
+ TP_ARGS(cpuid, op, rcx, rdx, r8, r9, r10),
+
+ TP_STRUCT__entry(
+ __field( int, cpuid )
+ __field( __u64, op )
+ __field( __u64, rcx )
+ __field( __u64, rdx )
+ __field( __u64, r8 )
+ __field( __u64, r9 )
+ __field( __u64, r10 )
+ ),
+
+ TP_fast_assign(
+ __entry->cpuid = cpuid;
+ __entry->op = op;
+ __entry->rcx = rcx;
+ __entry->rdx = rdx;
+ __entry->r8 = r8;
+ __entry->r9 = r9;
+ __entry->r10 = r10;
+ ),
+
+ TP_printk("cpu: %d op: %s rcx: 0x%llx rdx: 0x%llx r8: 0x%llx r9: 0x%llx r10: 0x%llx",
+ __entry->cpuid,
+ __print_symbolic(__entry->op, TDX_SEAMCALL_OP_CODES),
+ __entry->rcx, __entry->rdx, __entry->r8,
+ __entry->r9, __entry->r10)
+);
+
+/*
+ * Tracepoint for the end of TDX SEAMCALLs.
+ */
+TRACE_EVENT(kvm_tdx_seamcall_exit,
+ TP_PROTO(int cpuid, __u64 op, __u64 err, __u64 rcx, __u64 rdx, __u64 r8,
+ __u64 r9, __u64 r10, __u64 r11),
+ TP_ARGS(cpuid, op, err, rcx, rdx, r8, r9, r10, r11),
+
+ TP_STRUCT__entry(
+ __field( int, cpuid )
+ __field( __u64, op )
+ __field( __u64, err )
+ __field( __u64, rcx )
+ __field( __u64, rdx )
+ __field( __u64, r8 )
+ __field( __u64, r9 )
+ __field( __u64, r10 )
+ __field( __u64, r11 )
+ ),
+
+ TP_fast_assign(
+ __entry->cpuid = cpuid;
+ __entry->op = op;
+ __entry->err = err;
+ __entry->rcx = rcx;
+ __entry->rdx = rdx;
+ __entry->r8 = r8;
+ __entry->r9 = r9;
+ __entry->r10 = r10;
+ __entry->r11 = r11;
+ ),
+
+ TP_printk("cpu: %d op: %s err %s 0x%llx rcx: 0x%llx rdx: 0x%llx r8: 0x%llx r9: 0x%llx r10: 0x%llx r11: 0x%llx",
+ __entry->cpuid,
+ __print_symbolic(__entry->op, TDX_SEAMCALL_OP_CODES),
+ __print_symbolic(__entry->err & TDX_SEAMCALL_STATUS_MASK,
+ TDX_SEAMCALL_STATUS_CODES),
+ __entry->err,
+ __entry->rcx, __entry->rdx, __entry->r8,
+ __entry->r9, __entry->r10, __entry->r11)
+);
+
/*
* Tracepoint for nested #vmexit because of interrupt pending
*/
diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
index a318940f62ed..2c83ab46eeac 100644
--- a/arch/x86/kvm/vmx/seamcall.h
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -9,12 +9,32 @@
#else

#ifndef seamcall
+#include "trace.h"
+
struct tdx_ex_ret;
asmlinkage u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
struct tdx_ex_ret *ex);

+static inline u64 _seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
+ struct tdx_ex_ret *ex)
+{
+ u64 err;
+
+ trace_kvm_tdx_seamcall_enter(smp_processor_id(), op,
+ rcx, rdx, r8, r9, r10);
+ err = __seamcall(op, rcx, rdx, r8, r9, r10, ex);
+ if (ex)
+ trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err, ex->rcx,
+ ex->rdx, ex->r8, ex->r9, ex->r10,
+ ex->r11);
+ else
+ trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err,
+ 0, 0, 0, 0, 0, 0);
+ return err;
+}
+
#define seamcall(op, rcx, rdx, r8, r9, r10, ex) \
- __seamcall(SEAMCALL_##op, (rcx), (rdx), (r8), (r9), (r10), (ex))
+ _seamcall(SEAMCALL_##op, (rcx), (rdx), (r8), (r9), (r10), (ex))
#endif

static inline void __pr_seamcall_error(u64 op, const char *op_str,
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 57e9ea4a7fad..559a63290c4d 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -51,6 +51,53 @@
#define SEAMCALL_TDH_SYS_LP_SHUTDOWN 44
#define SEAMCALL_TDH_SYS_CONFIG 45

+#define TDX_BUILD_OP_CODE(name) { SEAMCALL_ ## name, #name }
+
+#define TDX_SEAMCALL_OP_CODES \
+ TDX_BUILD_OP_CODE(TDH_VP_ENTER), \
+ TDX_BUILD_OP_CODE(TDH_MNG_ADDCX), \
+ TDX_BUILD_OP_CODE(TDH_MEM_PAGE_ADD), \
+ TDX_BUILD_OP_CODE(TDH_MEM_SEPT_ADD), \
+ TDX_BUILD_OP_CODE(TDH_VP_ADDCX), \
+ TDX_BUILD_OP_CODE(TDH_MEM_PAGE_AUG), \
+ TDX_BUILD_OP_CODE(TDH_MEM_RANGE_BLOCK), \
+ TDX_BUILD_OP_CODE(TDH_MNG_KEY_CONFIG), \
+ TDX_BUILD_OP_CODE(TDH_MNG_CREATE), \
+ TDX_BUILD_OP_CODE(TDH_VP_CREATE), \
+ TDX_BUILD_OP_CODE(TDH_MNG_RD), \
+ TDX_BUILD_OP_CODE(TDH_PHYMEM_PAGE_RD), \
+ TDX_BUILD_OP_CODE(TDH_MNG_WR), \
+ TDX_BUILD_OP_CODE(TDH_PHYMEM_PAGE_WR), \
+ TDX_BUILD_OP_CODE(TDH_MEM_PAGE_DEMOTE), \
+ TDX_BUILD_OP_CODE(TDH_MR_EXTEND), \
+ TDX_BUILD_OP_CODE(TDH_MR_FINALIZE), \
+ TDX_BUILD_OP_CODE(TDH_VP_FLUSH), \
+ TDX_BUILD_OP_CODE(TDH_MNG_VPFLUSHDONE), \
+ TDX_BUILD_OP_CODE(TDH_MNG_KEY_FREEID), \
+ TDX_BUILD_OP_CODE(TDH_MNG_INIT), \
+ TDX_BUILD_OP_CODE(TDH_VP_INIT), \
+ TDX_BUILD_OP_CODE(TDH_MEM_PAGE_PROMOTE), \
+ TDX_BUILD_OP_CODE(TDH_PHYMEM_PAGE_RDMD), \
+ TDX_BUILD_OP_CODE(TDH_MEM_SEPT_RD), \
+ TDX_BUILD_OP_CODE(TDH_VP_RD), \
+ TDX_BUILD_OP_CODE(TDH_MNG_KEY_RECLAIMID), \
+ TDX_BUILD_OP_CODE(TDH_PHYMEM_PAGE_RECLAIM), \
+ TDX_BUILD_OP_CODE(TDH_MEM_PAGE_REMOVE), \
+ TDX_BUILD_OP_CODE(TDH_MEM_SEPT_REMOVE), \
+ TDX_BUILD_OP_CODE(TDH_SYS_KEY_CONFIG), \
+ TDX_BUILD_OP_CODE(TDH_SYS_INFO), \
+ TDX_BUILD_OP_CODE(TDH_SYS_INIT), \
+ TDX_BUILD_OP_CODE(TDH_SYS_LP_INIT), \
+ TDX_BUILD_OP_CODE(TDH_SYS_TDMR_INIT), \
+ TDX_BUILD_OP_CODE(TDH_MEM_TRACK), \
+ TDX_BUILD_OP_CODE(TDH_MEM_RANGE_UNBLOCK), \
+ TDX_BUILD_OP_CODE(TDH_PHYMEM_CACHE_WB), \
+ TDX_BUILD_OP_CODE(TDH_PHYMEM_PAGE_WBINVD), \
+ TDX_BUILD_OP_CODE(TDH_MEM_SEPT_WR), \
+ TDX_BUILD_OP_CODE(TDH_VP_WR), \
+ TDX_BUILD_OP_CODE(TDH_SYS_LP_SHUTDOWN), \
+ TDX_BUILD_OP_CODE(TDH_SYS_CONFIG)
+
#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO 0x10000
#define TDG_VP_VMCALL_MAP_GPA 0x10001
#define TDG_VP_VMCALL_GET_QUOTE 0x10002
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index 675acea412c9..90ee2b5364d6 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -2,6 +2,8 @@
#ifndef __KVM_X86_TDX_ERRNO_H
#define __KVM_X86_TDX_ERRNO_H

+#define TDX_SEAMCALL_STATUS_MASK 0xFFFFFFFF00000000
+
/*
* TDX SEAMCALL Status Codes (returned in RAX)
*/
@@ -96,6 +98,100 @@
#define TDX_PAGE_ALREADY_ACCEPTED 0x00000B0A00000000
#define TDX_PAGE_SIZE_MISMATCH 0xC0000B0B00000000

+#define TDX_BUILD_STATUS_CODE(name) { name, #name }
+
+#define TDX_SEAMCALL_STATUS_CODES \
+ TDX_BUILD_STATUS_CODE(TDX_SUCCESS), \
+ TDX_BUILD_STATUS_CODE(TDX_NON_RECOVERABLE_VCPU), \
+ TDX_BUILD_STATUS_CODE(TDX_NON_RECOVERABLE_TD), \
+ TDX_BUILD_STATUS_CODE(TDX_INTERRUPTED_RESUMABLE), \
+ TDX_BUILD_STATUS_CODE(TDX_INTERRUPTED_RESTARTABLE), \
+ TDX_BUILD_STATUS_CODE(TDX_NON_RECOVERABLE_TD_FATAL), \
+ TDX_BUILD_STATUS_CODE(TDX_INVALID_RESUMPTION), \
+ TDX_BUILD_STATUS_CODE(TDX_NON_RECOVERABLE_TD_NO_APIC), \
+ TDX_BUILD_STATUS_CODE(TDX_OPERAND_INVALID), \
+ TDX_BUILD_STATUS_CODE(TDX_OPERAND_ADDR_RANGE_ERROR), \
+ TDX_BUILD_STATUS_CODE(TDX_OPERAND_BUSY), \
+ TDX_BUILD_STATUS_CODE(TDX_PREVIOUS_TLB_EPOCH_BUSY), \
+ TDX_BUILD_STATUS_CODE(TDX_SYS_BUSY), \
+ TDX_BUILD_STATUS_CODE(TDX_PAGE_METADATA_INCORRECT), \
+ TDX_BUILD_STATUS_CODE(TDX_PAGE_ALREADY_FREE), \
+ TDX_BUILD_STATUS_CODE(TDX_PAGE_NOT_OWNED_BY_TD), \
+ TDX_BUILD_STATUS_CODE(TDX_PAGE_NOT_FREE), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_ASSOCIATED_PAGES_EXIST), \
+ TDX_BUILD_STATUS_CODE(TDX_SYSINIT_NOT_PENDING), \
+ TDX_BUILD_STATUS_CODE(TDX_SYSINIT_NOT_DONE), \
+ TDX_BUILD_STATUS_CODE(TDX_SYSINITLP_NOT_DONE), \
+ TDX_BUILD_STATUS_CODE(TDX_SYSINITLP_DONE), \
+ TDX_BUILD_STATUS_CODE(TDX_SYS_NOT_READY), \
+ TDX_BUILD_STATUS_CODE(TDX_SYS_SHUTDOWN), \
+ TDX_BUILD_STATUS_CODE(TDX_SYSCONFIG_NOT_DONE), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_NOT_INITIALIZED), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_INITIALIZED), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_NOT_FINALIZED), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_FINALIZED), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_FATAL), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_NON_DEBUG), \
+ TDX_BUILD_STATUS_CODE(TDX_TDCX_NUM_INCORRECT), \
+ TDX_BUILD_STATUS_CODE(TDX_VCPU_STATE_INCORRECT), \
+ TDX_BUILD_STATUS_CODE(TDX_VCPU_ASSOCIATED), \
+ TDX_BUILD_STATUS_CODE(TDX_VCPU_NOT_ASSOCIATED), \
+ TDX_BUILD_STATUS_CODE(TDX_TDVPX_NUM_INCORRECT), \
+ TDX_BUILD_STATUS_CODE(TDX_NO_VALID_VE_INFO), \
+ TDX_BUILD_STATUS_CODE(TDX_MAX_VCPUS_EXCEEDED), \
+ TDX_BUILD_STATUS_CODE(TDX_TSC_ROLLBACK), \
+ TDX_BUILD_STATUS_CODE(TDX_FIELD_NOT_WRITABLE), \
+ TDX_BUILD_STATUS_CODE(TDX_FIELD_NOT_READABLE), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_VMCS_FIELD_NOT_INITIALIZED), \
+ TDX_BUILD_STATUS_CODE(TDX_KEY_GENERATION_FAILED), \
+ TDX_BUILD_STATUS_CODE(TDX_TD_KEYS_NOT_CONFIGURED), \
+ TDX_BUILD_STATUS_CODE(TDX_KEY_STATE_INCORRECT), \
+ TDX_BUILD_STATUS_CODE(TDX_KEY_CONFIGURED), \
+ TDX_BUILD_STATUS_CODE(TDX_WBCACHE_NOT_COMPLETE), \
+ TDX_BUILD_STATUS_CODE(TDX_HKID_NOT_FREE), \
+ TDX_BUILD_STATUS_CODE(TDX_NO_HKID_READY_TO_WBCACHE), \
+ TDX_BUILD_STATUS_CODE(TDX_WBCACHE_RESUME_ERROR), \
+ TDX_BUILD_STATUS_CODE(TDX_FLUSHVP_NOT_DONE), \
+ TDX_BUILD_STATUS_CODE(TDX_NUM_ACTIVATED_HKIDS_NOT_SUPPORTED), \
+ TDX_BUILD_STATUS_CODE(TDX_INCORRECT_CPUID_VALUE), \
+ TDX_BUILD_STATUS_CODE(TDX_BOOT_NT4_SET), \
+ TDX_BUILD_STATUS_CODE(TDX_INCONSISTENT_CPUID_FIELD), \
+ TDX_BUILD_STATUS_CODE(TDX_CPUID_LEAF_1F_FORMAT_UNRECOGNIZED), \
+ TDX_BUILD_STATUS_CODE(TDX_INVALID_WBINVD_SCOPE), \
+ TDX_BUILD_STATUS_CODE(TDX_INVALID_PKG_ID), \
+ TDX_BUILD_STATUS_CODE(TDX_CPUID_LEAF_NOT_SUPPORTED), \
+ TDX_BUILD_STATUS_CODE(TDX_SMRR_NOT_LOCKED), \
+ TDX_BUILD_STATUS_CODE(TDX_INVALID_SMRR_CONFIGURATION), \
+ TDX_BUILD_STATUS_CODE(TDX_SMRR_OVERLAPS_CMR), \
+ TDX_BUILD_STATUS_CODE(TDX_SMRR_LOCK_NOT_SUPPORTED), \
+ TDX_BUILD_STATUS_CODE(TDX_SMRR_NOT_SUPPORTED), \
+ TDX_BUILD_STATUS_CODE(TDX_INCONSISTENT_MSR), \
+ TDX_BUILD_STATUS_CODE(TDX_INCORRECT_MSR_VALUE), \
+ TDX_BUILD_STATUS_CODE(TDX_SEAMREPORT_NOT_AVAILABLE), \
+ TDX_BUILD_STATUS_CODE(TDX_PERF_COUNTERS_ARE_PEBS_ENABLED), \
+ TDX_BUILD_STATUS_CODE(TDX_INVALID_TDMR), \
+ TDX_BUILD_STATUS_CODE(TDX_NON_ORDERED_TDMR), \
+ TDX_BUILD_STATUS_CODE(TDX_TDMR_OUTSIDE_CMRS), \
+ TDX_BUILD_STATUS_CODE(TDX_TDMR_ALREADY_INITIALIZED), \
+ TDX_BUILD_STATUS_CODE(TDX_INVALID_PAMT), \
+ TDX_BUILD_STATUS_CODE(TDX_PAMT_OUTSIDE_CMRS), \
+ TDX_BUILD_STATUS_CODE(TDX_PAMT_OVERLAP), \
+ TDX_BUILD_STATUS_CODE(TDX_INVALID_RESERVED_IN_TDMR), \
+ TDX_BUILD_STATUS_CODE(TDX_NON_ORDERED_RESERVED_IN_TDMR), \
+ TDX_BUILD_STATUS_CODE(TDX_CMR_LIST_INVALID), \
+ TDX_BUILD_STATUS_CODE(TDX_EPT_WALK_FAILED), \
+ TDX_BUILD_STATUS_CODE(TDX_EPT_ENTRY_FREE), \
+ TDX_BUILD_STATUS_CODE(TDX_EPT_ENTRY_NOT_FREE), \
+ TDX_BUILD_STATUS_CODE(TDX_EPT_ENTRY_NOT_PRESENT), \
+ TDX_BUILD_STATUS_CODE(TDX_EPT_ENTRY_NOT_LEAF), \
+ TDX_BUILD_STATUS_CODE(TDX_EPT_ENTRY_LEAF), \
+ TDX_BUILD_STATUS_CODE(TDX_GPA_RANGE_NOT_BLOCKED), \
+ TDX_BUILD_STATUS_CODE(TDX_GPA_RANGE_ALREADY_BLOCKED), \
+ TDX_BUILD_STATUS_CODE(TDX_TLB_TRACKING_NOT_DONE), \
+ TDX_BUILD_STATUS_CODE(TDX_EPT_INVALID_PROMOTE_CONDITIONS), \
+ TDX_BUILD_STATUS_CODE(TDX_PAGE_ALREADY_ACCEPTED), \
+ TDX_BUILD_STATUS_CODE(TDX_PAGE_SIZE_MISMATCH)
+
/*
* TDG.VP.VMCALL Status Codes (returned in R10)
*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e0f4a46649d7..d11cf87674f3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11970,6 +11970,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_apicv_update_request);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_tdx_seamcall_enter);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_tdx_seamcall_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
--
2.25.1

2021-07-02 22:09:57

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 65/69] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch

From: Xiaoyao Li <[email protected]>

Introduce a per-vm variable initial_tsc_khz to hold the default tsc_khz
for kvm_arch_vcpu_create().

This field is going to be used by TDX since TSC frequency for TD guest
is configured at TD VM initialization phase.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a47e17892258..ae8b96e15e71 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1030,6 +1030,7 @@ struct kvm_arch {
u64 last_tsc_nsec;
u64 last_tsc_write;
u32 last_tsc_khz;
+ u32 initial_tsc_khz;
u64 cur_tsc_nsec;
u64 cur_tsc_write;
u64 cur_tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a8299add443f..d3ebed784eac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10441,7 +10441,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
else
vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;

- kvm_set_tsc_khz(vcpu, max_tsc_khz);
+ kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.initial_tsc_khz);

r = kvm_mmu_create(vcpu);
if (r < 0)
@@ -10894,6 +10894,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
pvclock_update_vm_gtod_copy(kvm);

kvm->arch.guest_can_read_msr_platform_info = true;
+ kvm->arch.initial_tsc_khz = max_tsc_khz;

INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
--
2.25.1

2021-07-02 22:09:57

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 06/69] KVM: TDX: add a helper function for kvm to call seamcall

From: Isaku Yamahata <[email protected]>

Add a helper function for kvm to call seamcall and a helper macro to check
its return value. The later patches will use them.

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kernel/asm-offsets_64.c | 15 ++++++++
arch/x86/kvm/Makefile | 1 +
arch/x86/kvm/vmx/seamcall.S | 64 ++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/seamcall.h | 47 +++++++++++++++++++++++
4 files changed, 127 insertions(+)
create mode 100644 arch/x86/kvm/vmx/seamcall.S
create mode 100644 arch/x86/kvm/vmx/seamcall.h

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..c5908bcf3055 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -9,6 +9,11 @@
#include <asm/kvm_para.h>
#endif

+#ifdef CONFIG_KVM_INTEL_TDX
+#include <linux/kvm_types.h>
+#include "../kvm/vmx/tdx_arch.h"
+#endif
+
int main(void)
{
#ifdef CONFIG_PARAVIRT
@@ -25,6 +30,16 @@ int main(void)
BLANK();
#endif

+#ifdef CONFIG_KVM_INTEL_TDX
+ OFFSET(TDX_SEAM_rcx, tdx_ex_ret, rcx);
+ OFFSET(TDX_SEAM_rdx, tdx_ex_ret, rdx);
+ OFFSET(TDX_SEAM_r8, tdx_ex_ret, r8);
+ OFFSET(TDX_SEAM_r9, tdx_ex_ret, r9);
+ OFFSET(TDX_SEAM_r10, tdx_ex_ret, r10);
+ OFFSET(TDX_SEAM_r11, tdx_ex_ret, r11);
+ BLANK();
+#endif
+
#define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry)
ENTRY(bx);
ENTRY(cx);
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index c589db5d91b3..60f3e90fef8b 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -24,6 +24,7 @@ kvm-$(CONFIG_KVM_XEN) += xen.o
kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
+kvm-intel-$(CONFIG_KVM_INTEL_TDX) += vmx/seamcall.o

kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o

diff --git a/arch/x86/kvm/vmx/seamcall.S b/arch/x86/kvm/vmx/seamcall.S
new file mode 100644
index 000000000000..08bb2b29deb7
--- /dev/null
+++ b/arch/x86/kvm/vmx/seamcall.S
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* ASM helper to call SEAMCALL for P-SEAMLDR, TDX module */
+
+#include <linux/linkage.h>
+
+#include <asm/alternative.h>
+#include <asm/asm-offsets.h>
+#include <asm/frame.h>
+#include <asm/asm.h>
+
+#include "seamcall.h"
+
+/*
+ * __seamcall - helper function to invoke SEAMCALL to request service
+ * of TDX module for KVM.
+ *
+ * @op (RDI) SEAMCALL leaf ID
+ * @rcx (RSI) input 1 (optional based on leaf ID)
+ * @rdx (RDX) input 2 (optional based on leaf ID)
+ * @r8 (RCX) input 3 (optional based on leaf ID)
+ * @r9 (R8) input 4 (optional based on leaf ID)
+ * @r10 (R9) input 5 (optional based on leaf ID)
+ * @ex stack pointer to struct tdx_ex_ret. optional return value stored.
+ *
+ * @return RAX: completion code of P-SEAMLDR or TDX module
+ * 0 on success, non-0 on failure
+ * trapnumber on fault
+ */
+SYM_FUNC_START(__seamcall)
+ FRAME_BEGIN
+
+ /* shuffle registers from function call ABI to SEAMCALL ABI. */
+ movq %r9, %r10
+ movq %r8, %r9
+ movq %rcx, %r8
+ /* %rdx doesn't need shuffle. */
+ movq %rsi, %rcx
+ movq %rdi, %rax
+
+.Lseamcall:
+ seamcall
+ jmp .Lseamcall_ret
+.Lspurious_fault:
+ call kvm_spurious_fault
+.Lseamcall_ret:
+
+ movq (FRAME_OFFSET + 8)(%rsp), %rdi
+ testq %rdi, %rdi
+ jz 1f
+
+ /* If ex is non-NULL, store extra return values into it. */
+ movq %rcx, TDX_SEAM_rcx(%rdi)
+ movq %rdx, TDX_SEAM_rdx(%rdi)
+ movq %r8, TDX_SEAM_r8(%rdi)
+ movq %r9, TDX_SEAM_r9(%rdi)
+ movq %r10, TDX_SEAM_r10(%rdi)
+ movq %r11, TDX_SEAM_r11(%rdi)
+
+1:
+ FRAME_END
+ ret
+
+ _ASM_EXTABLE(.Lseamcall, .Lspurious_fault)
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
new file mode 100644
index 000000000000..a318940f62ed
--- /dev/null
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_VMX_SEAMCALL_H
+#define __KVM_VMX_SEAMCALL_H
+
+#ifdef __ASSEMBLY__
+
+#define seamcall .byte 0x66, 0x0f, 0x01, 0xcf
+
+#else
+
+#ifndef seamcall
+struct tdx_ex_ret;
+asmlinkage u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
+ struct tdx_ex_ret *ex);
+
+#define seamcall(op, rcx, rdx, r8, r9, r10, ex) \
+ __seamcall(SEAMCALL_##op, (rcx), (rdx), (r8), (r9), (r10), (ex))
+#endif
+
+static inline void __pr_seamcall_error(u64 op, const char *op_str,
+ u64 err, struct tdx_ex_ret *ex)
+{
+ pr_err_ratelimited("SEAMCALL[%s] failed on cpu %d: 0x%llx\n",
+ op_str, smp_processor_id(), (err));
+ if (ex)
+ pr_err_ratelimited(
+ "RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
+ (ex)->rcx, (ex)->rdx, (ex)->r8, (ex)->r9, (ex)->r10,
+ (ex)->r11);
+}
+
+#define pr_seamcall_error(op, err, ex) \
+ __pr_seamcall_error(SEAMCALL_##op, #op, (err), (ex))
+
+/* ex is a pointer to struct tdx_ex_ret or NULL. */
+#define TDX_ERR(err, op, ex) \
+({ \
+ u64 __ret_warn_on = WARN_ON_ONCE(err); \
+ \
+ if (unlikely(__ret_warn_on)) \
+ pr_seamcall_error(op, err, ex); \
+ __ret_warn_on; \
+})
+
+#endif
+
+#endif /* __KVM_VMX_SEAMCALL_H */
--
2.25.1

2021-07-02 22:10:07

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 67/69] KVM: TDX: add trace point for TDVMCALL and SEPT operation

From: Isaku Yamahata <[email protected]>

Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/trace.h | 58 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.c | 16 ++++++++++
arch/x86/kvm/vmx/tdx_arch.h | 9 ++++++
arch/x86/kvm/x86.c | 2 ++
4 files changed, 85 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index c3398d0de9a7..58631124f08d 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -739,6 +739,64 @@ TRACE_EVENT(kvm_tdx_seamcall_exit,
__entry->r9, __entry->r10, __entry->r11)
);

+/*
+ * Tracepoint for TDVMCALL from a TDX guest
+ */
+TRACE_EVENT(kvm_tdvmcall,
+ TP_PROTO(struct kvm_vcpu *vcpu, __u32 exit_reason,
+ __u64 p1, __u64 p2, __u64 p3, __u64 p4),
+ TP_ARGS(vcpu, exit_reason, p1, p2, p3, p4),
+
+ TP_STRUCT__entry(
+ __field( __u64, rip )
+ __field( __u32, exit_reason )
+ __field( __u64, p1 )
+ __field( __u64, p2 )
+ __field( __u64, p3 )
+ __field( __u64, p4 )
+ ),
+
+ TP_fast_assign(
+ __entry->rip = kvm_rip_read(vcpu);
+ __entry->exit_reason = exit_reason;
+ __entry->p1 = p1;
+ __entry->p2 = p2;
+ __entry->p3 = p3;
+ __entry->p4 = p4;
+ ),
+
+ TP_printk("rip: %llx reason: %s p1: %llx p2: %llx p3: %llx p4: %llx",
+ __entry->rip,
+ __print_symbolic(__entry->exit_reason,
+ TDG_VP_VMCALL_EXIT_REASONS),
+ __entry->p1, __entry->p2, __entry->p3, __entry->p4)
+);
+
+/*
+ * Tracepoint for SEPT related SEAMCALLs.
+ */
+TRACE_EVENT(kvm_sept_seamcall,
+ TP_PROTO(__u64 op, __u64 gpa, __u64 hpa, int level),
+ TP_ARGS(op, gpa, hpa, level),
+
+ TP_STRUCT__entry(
+ __field( __u64, op )
+ __field( __u64, gpa )
+ __field( __u64, hpa )
+ __field( int, level )
+ ),
+
+ TP_fast_assign(
+ __entry->op = op;
+ __entry->gpa = gpa;
+ __entry->hpa = hpa;
+ __entry->level = level;
+ ),
+
+ TP_printk("op: %llu gpa: 0x%llx hpa: 0x%llx level: %u",
+ __entry->op, __entry->gpa, __entry->hpa, __entry->level)
+);
+
/*
* Tracepoint for nested #vmexit because of interrupt pending
*/
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1aed4286ce0c..63130fb5a003 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -934,6 +934,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)

exit_reason = tdvmcall_exit_reason(vcpu);

+ trace_kvm_tdvmcall(vcpu, exit_reason,
+ tdvmcall_p1_read(vcpu), tdvmcall_p2_read(vcpu),
+ tdvmcall_p3_read(vcpu), tdvmcall_p4_read(vcpu));
+
switch (exit_reason) {
case EXIT_REASON_CPUID:
return tdx_emulate_cpuid(vcpu);
@@ -1011,11 +1015,15 @@ static void tdx_sept_set_private_spte(struct kvm_vcpu *vcpu, gfn_t gfn,

/* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
if (is_td_finalized(kvm_tdx)) {
+ trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_PAGE_AUG, gpa, hpa, level);
+
err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &ex_ret);
SEPT_ERR(err, &ex_ret, TDH_MEM_PAGE_AUG, vcpu->kvm);
return;
}

+ trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_PAGE_ADD, gpa, hpa, level);
+
source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;

err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &ex_ret);
@@ -1039,6 +1047,8 @@ static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, int level,
return;

if (is_hkid_assigned(kvm_tdx)) {
+ trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_PAGE_REMOVE, gpa, hpa, level);
+
err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
if (SEPT_ERR(err, &ex_ret, TDH_MEM_PAGE_REMOVE, kvm))
return;
@@ -1063,6 +1073,8 @@ static int tdx_sept_link_private_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
struct tdx_ex_ret ex_ret;
u64 err;

+ trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_SEPT_ADD, gpa, hpa, level);
+
err = tdh_mem_spet_add(kvm_tdx->tdr.pa, gpa, level, hpa, &ex_ret);
if (SEPT_ERR(err, &ex_ret, TDH_MEM_SEPT_ADD, vcpu->kvm))
return -EIO;
@@ -1077,6 +1089,8 @@ static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, int level)
struct tdx_ex_ret ex_ret;
u64 err;

+ trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_RANGE_BLOCK, gpa, -1ull, level);
+
err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
SEPT_ERR(err, &ex_ret, TDH_MEM_RANGE_BLOCK, kvm);
}
@@ -1088,6 +1102,8 @@ static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn, int level)
struct tdx_ex_ret ex_ret;
u64 err;

+ trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_RANGE_UNBLOCK, gpa, -1ull, level);
+
err = tdh_mem_range_unblock(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
SEPT_ERR(err, &ex_ret, TDH_MEM_RANGE_UNBLOCK, kvm);
}
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 7258825b1e02..414b933a3b03 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -104,6 +104,15 @@
#define TDG_VP_VMCALL_REPORT_FATAL_ERROR 0x10003
#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004

+#define TDG_VP_VMCALL_EXIT_REASONS \
+ { TDG_VP_VMCALL_GET_TD_VM_CALL_INFO, \
+ "GET_TD_VM_CALL_INFO" }, \
+ { TDG_VP_VMCALL_MAP_GPA, "MAP_GPA" }, \
+ { TDG_VP_VMCALL_GET_QUOTE, "GET_QUOTE" }, \
+ { TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, \
+ "SETUP_EVENT_NOTIFY_INTERRUPT" }, \
+ VMX_EXIT_REASONS
+
/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
#define TDX_CLASS_SHIFT 56
#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ba69abcc663a..ad619c1b2a88 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12104,6 +12104,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_tdvmcall);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_sept_seamcall);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intr_vmexit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmenter_failed);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
--
2.25.1

2021-07-02 22:10:05

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 66/69] KVM: TDX: Add "basic" support for building and running Trust Domains

From: Sean Christopherson <[email protected]>

Add what is effectively a TDX-specific ioctl for initializing the guest
Trust Domain. Implement the functionality as a subcommand of
KVM_MEMORY_ENCRYPT_OP, analogous to how the ioctl is used by SVM to
manage SEV guests.

For easy compatibility with future versions of TDX-SEAM, add a
KVM-defined struct, tdx_capabilities, to track requirements/capabilities
for the overall system, and define a global instance to serve as the
canonical reference.

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Chao Gao <[email protected]>
Signed-off-by: Chao Gao <[email protected]>
Co-developed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Yuan Yao <[email protected]>
Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 56 +
arch/x86/include/uapi/asm/vmx.h | 3 +-
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/mmu/spte.c | 5 +-
arch/x86/kvm/vmx/common.h | 1 +
arch/x86/kvm/vmx/main.c | 375 ++++-
arch/x86/kvm/vmx/posted_intr.c | 6 +
arch/x86/kvm/vmx/tdx.c | 1942 +++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 97 ++
arch/x86/kvm/vmx/tdx_arch.h | 7 +
arch/x86/kvm/vmx/tdx_ops.h | 13 +
arch/x86/kvm/vmx/tdx_stubs.c | 45 +
arch/x86/kvm/vmx/vmenter.S | 146 ++
arch/x86/kvm/x86.c | 8 +-
tools/arch/x86/include/uapi/asm/kvm.h | 51 +
16 files changed, 2736 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/kvm/vmx/tdx.c
create mode 100644 arch/x86/kvm/vmx/tdx_stubs.c

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 8341ec720b3f..18950b226d00 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -494,4 +494,60 @@ struct kvm_pmu_event_filter {
#define KVM_X86_SEV_ES_VM 1
#define KVM_X86_TDX_VM 2

+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
+
+ KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+ __u32 id;
+ __u32 metadata;
+ __u64 data;
+};
+
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ __u32 padding;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
+struct kvm_tdx_init_vm {
+ __u32 max_vcpus;
+ __u32 tsc_khz;
+ __u64 attributes;
+ __u64 cpuid;
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha348 digest */
+ __u64 reserved[43]; /* must be zero for future extensibility */
+};
+
+#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index ba5908dfc7c0..79843a0143d2 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -32,8 +32,9 @@
#define EXIT_REASON_EXCEPTION_NMI 0
#define EXIT_REASON_EXTERNAL_INTERRUPT 1
#define EXIT_REASON_TRIPLE_FAULT 2
-#define EXIT_REASON_INIT_SIGNAL 3
+#define EXIT_REASON_INIT_SIGNAL 3
#define EXIT_REASON_SIPI_SIGNAL 4
+#define EXIT_REASON_OTHER_SMI 6

#define EXIT_REASON_INTERRUPT_WINDOW 7
#define EXIT_REASON_NMI_WINDOW 8
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 8bf1c6dbac78..764283e0979d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -60,7 +60,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
}

void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value);
void kvm_mmu_set_spte_init_value(u64 init_value);

void
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4ee6d7803f18..2f8486f79ddb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5163,6 +5163,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
out:
return r;
}
+EXPORT_SYMBOL_GPL(kvm_mmu_load);

static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, u32 roots_to_free)
{
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0b931f1c2210..076b489d56d5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -302,14 +302,15 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);

-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value)
{
shadow_user_mask = VMX_EPT_READABLE_MASK;
shadow_accessed_mask = has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
- shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+ shadow_present_mask =
+ (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | init_value;
shadow_acc_track_mask = VMX_EPT_RWX_MASK;
shadow_me_mask = 0ull;

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 817ff3e74933..a99fe68ad687 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -9,6 +9,7 @@
#include <asm/vmx.h>

#include "mmu.h"
+#include "tdx.h"
#include "vmcs.h"
#include "vmx.h"
#include "x86.h"
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8e03cb72b910..8d6bfe09d89f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -3,8 +3,21 @@

static struct kvm_x86_ops vt_x86_ops __initdata;

+#ifdef CONFIG_KVM_INTEL_TDX
+static bool __read_mostly enable_tdx = 1;
+module_param_named(tdx, enable_tdx, bool, 0444);
+#else
+#define enable_tdx 0
+#endif
+
#include "vmx.c"

+#ifdef CONFIG_KVM_INTEL_TDX
+#include "tdx.c"
+#else
+#include "tdx_stubs.c"
+#endif
+
static int __init vt_cpu_has_kvm_support(void)
{
return cpu_has_vmx();
@@ -23,6 +36,16 @@ static int __init vt_check_processor_compatibility(void)
if (ret)
return ret;

+ if (enable_tdx) {
+ /*
+ * Reject the entire module load if the per-cpu check fails, it
+ * likely indicates a hardware or system configuration issue.
+ */
+ ret = tdx_check_processor_compatibility();
+ if (ret)
+ return ret;
+ }
+
return 0;
}

@@ -34,20 +57,40 @@ static __init int vt_hardware_setup(void)
if (ret)
return ret;

- if (enable_ept)
+#ifdef CONFIG_KVM_INTEL_TDX
+ if (enable_tdx && tdx_hardware_setup(&vt_x86_ops))
+ enable_tdx = false;
+#endif
+
+ if (enable_ept) {
+ const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
kvm_mmu_set_ept_masks(enable_ept_ad_bits,
- cpu_has_vmx_ept_execute_only());
+ cpu_has_vmx_ept_execute_only(), init_value);
+ kvm_mmu_set_spte_init_value(init_value);
+ }

return 0;
}

static int vt_hardware_enable(void)
{
- return hardware_enable();
+ int ret;
+
+ ret = hardware_enable();
+ if (ret)
+ return ret;
+
+ if (enable_tdx)
+ tdx_hardware_enable();
+ return 0;
}

static void vt_hardware_disable(void)
{
+ /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+ if (enable_tdx)
+ tdx_hardware_disable();
+
hardware_disable();
}

@@ -58,62 +101,92 @@ static bool vt_cpu_has_accelerated_tpr(void)

static bool vt_is_vm_type_supported(unsigned long type)
{
- return type == KVM_X86_LEGACY_VM;
+ return type == KVM_X86_LEGACY_VM ||
+ (type == KVM_X86_TDX_VM && enable_tdx);
}

static int vt_vm_init(struct kvm *kvm)
{
+ if (kvm->arch.vm_type == KVM_X86_TDX_VM)
+ return tdx_vm_init(kvm);
+
return vmx_vm_init(kvm);
}

static void vt_vm_teardown(struct kvm *kvm)
{
-
+ if (is_td(kvm))
+ return tdx_vm_teardown(kvm);
}

static void vt_vm_destroy(struct kvm *kvm)
{
-
+ if (is_td(kvm))
+ return tdx_vm_destroy(kvm);
}

static int vt_vcpu_create(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_create(vcpu);
+
return vmx_create_vcpu(vcpu);
}

static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_run(vcpu);
+
return vmx_vcpu_run(vcpu);
}

static void vt_vcpu_free(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_free(vcpu);
+
return vmx_free_vcpu(vcpu);
}

static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_reset(vcpu, init_event);
+
return vmx_vcpu_reset(vcpu, init_event);
}

static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_load(vcpu, cpu);
+
return vmx_vcpu_load(vcpu, cpu);
}

static void vt_vcpu_put(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_put(vcpu);
+
return vmx_vcpu_put(vcpu);
}

static int vt_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_handle_exit(vcpu, fastpath);
+
return vmx_handle_exit(vcpu, fastpath);
}

static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_handle_exit_irqoff(vcpu);
+
vmx_handle_exit_irqoff(vcpu);
}

@@ -129,21 +202,33 @@ static void vt_update_emulated_instruction(struct kvm_vcpu *vcpu)

static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_set_msr(vcpu, msr_info);
+
return vmx_set_msr(vcpu, msr_info);
}

static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
+ if (is_td_vcpu(vcpu))
+ return false;
+
return vmx_smi_allowed(vcpu, for_injection);
}

static int vt_pre_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
return vmx_pre_enter_smm(vcpu, smstate);
}

static int vt_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return 0;
+
return vmx_pre_leave_smm(vcpu, smstate);
}

@@ -156,6 +241,9 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, void *insn,
int insn_len)
{
+ if (is_td_vcpu(vcpu))
+ return false;
+
return vmx_can_emulate_instruction(vcpu, insn, insn_len);
}

@@ -164,11 +252,17 @@ static int vt_check_intercept(struct kvm_vcpu *vcpu,
enum x86_intercept_stage stage,
struct x86_exception *exception)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return X86EMUL_UNHANDLEABLE;
+
return vmx_check_intercept(vcpu, info, stage, exception);
}

static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return true;
+
return vmx_apic_init_signal_blocked(vcpu);
}

@@ -177,13 +271,43 @@ static void vt_migrate_timers(struct kvm_vcpu *vcpu)
vmx_migrate_timers(vcpu);
}

+static int vt_mem_enc_op_dev(void __user *argp)
+{
+ if (!enable_tdx)
+ return -EINVAL;
+
+ return tdx_dev_ioctl(argp);
+}
+
+static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
+{
+ if (!is_td(kvm))
+ return -ENOTTY;
+
+ return tdx_vm_ioctl(kvm, argp);
+}
+
+static int vt_mem_enc_op_vcpu(struct kvm_vcpu *vcpu, void __user *argp)
+{
+ if (!is_td_vcpu(vcpu))
+ return -EINVAL;
+
+ return tdx_vcpu_ioctl(vcpu, argp);
+}
+
static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_set_virtual_apic_mode(vcpu);
+
return vmx_set_virtual_apic_mode(vcpu);
}

static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_apicv_post_state_restore(vcpu);
+
return vmx_apicv_post_state_restore(vcpu);
}

@@ -194,31 +318,49 @@ static bool vt_check_apicv_inhibit_reasons(ulong bit)

static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
return vmx_hwapic_irr_update(vcpu, max_irr);
}

static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
return vmx_hwapic_isr_update(vcpu, max_isr);
}

static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return false;
+
return vmx_guest_apic_has_interrupt(vcpu);
}

static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return -1;
+
return vmx_sync_pir_to_irr(vcpu);
}

static int vt_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_deliver_posted_interrupt(vcpu, vector);
+
return vmx_deliver_posted_interrupt(vcpu, vector);
}

static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
return vmx_vcpu_after_set_cpuid(vcpu);
}

@@ -228,6 +370,9 @@ static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
*/
static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
{
+ if (kvm && is_td(kvm))
+ return tdx_is_emulated_msr(index, true);
+
return vmx_has_emulated_msr(kvm, index);
}

@@ -238,11 +383,23 @@ static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)

static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
+ /*
+ * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+ * guest state of a TD is obviously off limits. Deferring MSRs and DRs
+ * is pointless because TDX-SEAM needs to load *something* so as not to
+ * expose guest state.
+ */
+ if (is_td_vcpu(vcpu))
+ return;
+
vmx_prepare_switch_to_guest(vcpu);
}

static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_update_exception_bitmap(vcpu);
+
vmx_update_exception_bitmap(vcpu);
}

@@ -253,49 +410,76 @@ static int vt_get_msr_feature(struct kvm_msr_entry *msr)

static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_get_msr(vcpu, msr_info);
+
return vmx_get_msr(vcpu, msr_info);
}

static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_segment_base(vcpu, seg);
+
return vmx_get_segment_base(vcpu, seg);
}

static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
int seg)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_segment(vcpu, var, seg);
+
vmx_get_segment(vcpu, var, seg);
}

static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
int seg)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_set_segment(vcpu, var, seg);
}

static int vt_get_cpl(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_cpl(vcpu);
+
return vmx_get_cpl(vcpu);
}

static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu) && !is_debug_td(vcpu), vcpu->kvm))
+ return;
+
vmx_get_cs_db_l_bits(vcpu, db, l);
}

static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_set_cr0(vcpu, cr0);
}

static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int pgd_level)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}

static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_set_cr4(vcpu, cr4);
}

@@ -306,6 +490,9 @@ static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)

static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return -EIO;
+
return vmx_set_efer(vcpu, efer);
}

@@ -317,6 +504,9 @@ static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)

static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_set_idt(vcpu, dt);
}

@@ -328,16 +518,30 @@ static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)

static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_set_gdt(vcpu, dt);
}

static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_set_dr7(vcpu, val);
+
vmx_set_dr7(vcpu, val);
}

static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
{
+ /*
+ * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+ * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+ * reach here for TD vcpu.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_sync_dirty_debug_regs(vcpu);
}

@@ -349,31 +553,41 @@ static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)

switch (reg) {
case VCPU_REGS_RSP:
- vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
+ vcpu->arch.regs[VCPU_REGS_RSP] = vmreadl(vcpu, GUEST_RSP);
break;
case VCPU_REGS_RIP:
- vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
+#ifdef CONFIG_KVM_INTEL_TDX
+ /*
+ * RIP can be read by tracepoints, stuff a bogus value and
+ * avoid a WARN/error.
+ */
+ if (unlikely(is_td_vcpu(vcpu) && !is_debug_td(vcpu))) {
+ vcpu->arch.regs[VCPU_REGS_RIP] = 0xdeadul << 48;
+ break;
+ }
+#endif
+ vcpu->arch.regs[VCPU_REGS_RIP] = vmreadl(vcpu, GUEST_RIP);
break;
case VCPU_EXREG_PDPTR:
- if (enable_ept)
+ if (enable_ept && !KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
ept_save_pdptrs(vcpu);
break;
case VCPU_EXREG_CR0:
guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;

vcpu->arch.cr0 &= ~guest_owned_bits;
- vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
+ vcpu->arch.cr0 |= vmreadl(vcpu, GUEST_CR0) & guest_owned_bits;
break;
case VCPU_EXREG_CR3:
if (is_unrestricted_guest(vcpu) ||
(enable_ept && is_paging(vcpu)))
- vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+ vcpu->arch.cr3 = vmreadl(vcpu, GUEST_CR3);
break;
case VCPU_EXREG_CR4:
guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;

vcpu->arch.cr4 &= ~guest_owned_bits;
- vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
+ vcpu->arch.cr4 |= vmreadl(vcpu, GUEST_CR4) & guest_owned_bits;
break;
default:
KVM_BUG_ON(1, vcpu->kvm);
@@ -383,159 +597,266 @@ static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)

static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_rflags(vcpu);
+
return vmx_get_rflags(vcpu);
}

static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_set_rflags(vcpu, rflags);
+
vmx_set_rflags(vcpu, rflags);
}

static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_flush_tlb(vcpu);
+
vmx_flush_tlb_all(vcpu);
}

static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_flush_tlb(vcpu);
+
vmx_flush_tlb_current(vcpu);
}

static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_flush_tlb_gva(vcpu, addr);
}

static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
vmx_flush_tlb_guest(vcpu);
}

static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_set_interrupt_shadow(vcpu, mask);
}

static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
{
- return vmx_get_interrupt_shadow(vcpu);
+ return __vmx_get_interrupt_shadow(vcpu);
}

static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
unsigned char *hypercall)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_patch_hypercall(vcpu, hypercall);
}

static void vt_inject_irq(struct kvm_vcpu *vcpu)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_inject_irq(vcpu);
}

static void vt_inject_nmi(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_inject_nmi(vcpu);
+
vmx_inject_nmi(vcpu);
}

static void vt_queue_exception(struct kvm_vcpu *vcpu)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu) && !is_debug_td(vcpu), vcpu->kvm))
+ return;
+
vmx_queue_exception(vcpu);
}

static void vt_cancel_injection(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
vmx_cancel_injection(vcpu);
}

static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
+ if (is_td_vcpu(vcpu))
+ return true;
+
return vmx_interrupt_allowed(vcpu, for_injection);
}

static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
+ /*
+ * TDX-SEAM manages NMI windows and NMI reinjection, and hides NMI
+ * blocking, all KVM can do is throw an NMI over the wall.
+ */
+ if (is_td_vcpu(vcpu))
+ return true;
+
return vmx_nmi_allowed(vcpu, for_injection);
}

static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
{
+ /*
+ * Assume NMIs are always unmasked. KVM could query PEND_NMI and treat
+ * NMIs as masked if a previous NMI is still pending, but SEAMCALLs are
+ * expensive and the end result is unchanged as the only relevant usage
+ * of get_nmi_mask() is to limit the number of pending NMIs, i.e. it
+ * only changes whether KVM or TDX-SEAM drops an NMI.
+ */
+ if (is_td_vcpu(vcpu))
+ return false;
+
return vmx_get_nmi_mask(vcpu);
}

static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
vmx_set_nmi_mask(vcpu, masked);
}

static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
{
+ /* TDX-SEAM handles NMI windows, KVM always reports NMIs as unblocked. */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_enable_nmi_window(vcpu);
}

static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_enable_irq_window(vcpu);
}

static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_update_cr8_intercept(vcpu, tpr, irr);
}

static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
vmx_set_apic_access_page_addr(vcpu);
}

static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
vmx_refresh_apicv_exec_ctrl(vcpu);
}

static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
}

static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
+ /* TODO: Reject this and update Qemu, or eat it? */
+ if (is_td(kvm))
+ return 0;
+
return vmx_set_tss_addr(kvm, addr);
}

static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
{
+ /* TODO: Reject this and update Qemu, or eat it? */
+ if (is_td(kvm))
+ return 0;
+
return vmx_set_identity_map_addr(kvm, ident_addr);
}

static u64 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
+ if (is_td_vcpu(vcpu)) {
+ if (is_mmio)
+ return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+ }
+
return vmx_get_mt_mask(vcpu, gfn, is_mmio);
}

static void vt_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2,
u32 *intr_info, u32 *error_code)
{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_exit_info(vcpu, info1, info2, intr_info,
+ error_code);

return vmx_get_exit_info(vcpu, info1, info2, intr_info, error_code);
}

static u64 vt_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
return vmx_write_l1_tsc_offset(vcpu, offset);
}

static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return __kvm_request_immediate_exit(vcpu);
+
vmx_request_immediate_exit(vcpu);
}

static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
vmx_sched_in(vcpu, cpu);
}

static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
vmx_update_cpu_dirty_logging(vcpu);
}

@@ -544,12 +865,16 @@ static int vt_pre_block(struct kvm_vcpu *vcpu)
if (pi_pre_block(vcpu))
return 1;

+ if (is_td_vcpu(vcpu))
+ return 0;
+
return vmx_pre_block(vcpu);
}

static void vt_post_block(struct kvm_vcpu *vcpu)
{
- vmx_post_block(vcpu);
+ if (!is_td_vcpu(vcpu))
+ vmx_post_block(vcpu);

pi_post_block(vcpu);
}
@@ -559,17 +884,26 @@ static void vt_post_block(struct kvm_vcpu *vcpu)
static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
bool *expired)
{
+ if (is_td_vcpu(vcpu))
+ return -EINVAL;
+
return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
}

static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
vmx_cancel_hv_timer(vcpu);
}
#endif

static void vt_setup_mce(struct kvm_vcpu *vcpu)
{
+ if (is_td_vcpu(vcpu))
+ return;
+
vmx_setup_mce(vcpu);
}

@@ -706,6 +1040,10 @@ static struct kvm_x86_ops vt_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,

.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+ .mem_enc_op_dev = vt_mem_enc_op_dev,
+ .mem_enc_op = vt_mem_enc_op,
+ .mem_enc_op_vcpu = vt_mem_enc_op_vcpu,
};

static struct kvm_x86_init_ops vt_init_ops __initdata = {
@@ -722,6 +1060,9 @@ static int __init vt_init(void)
unsigned int vcpu_size = 0, vcpu_align = 0;
int r;

+ /* tdx_pre_kvm_init must be called before vmx_pre_kvm_init(). */
+ tdx_pre_kvm_init(&vcpu_size, &vcpu_align, &vt_x86_ops.vm_size);
+
vmx_pre_kvm_init(&vcpu_size, &vcpu_align);

r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
@@ -732,8 +1073,14 @@ static int __init vt_init(void)
if (r)
goto err_kvm_exit;

+ r = tdx_init();
+ if (r)
+ goto err_vmx_exit;
+
return 0;

+err_vmx_exit:
+ vmx_exit();
err_kvm_exit:
kvm_exit();
err_vmx_post_exit:
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5f81ef092bd4..28d1c252c2e8 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -6,6 +6,7 @@

#include "lapic.h"
#include "posted_intr.h"
+#include "tdx.h"
#include "trace.h"
#include "vmx.h"

@@ -18,6 +19,11 @@ static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);

static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
+#ifdef CONFIG_KVM_INTEL_TDX
+ if (is_td_vcpu(vcpu))
+ return &(to_tdx(vcpu)->pi_desc);
+#endif
+
return &(to_vmx(vcpu)->pi_desc);
}

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
new file mode 100644
index 000000000000..1aed4286ce0c
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -0,0 +1,1942 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+#include <linux/kvm_host.h>
+#include <linux/jump_label.h>
+#include <linux/trace_events.h>
+#include <linux/pagemap.h>
+
+#include <asm/kvm_boot.h>
+#include <asm/virtext.h>
+
+#include "common.h"
+#include "cpuid.h"
+#include "lapic.h"
+#include "tdx.h"
+#include "tdx_errno.h"
+#include "tdx_ops.h"
+
+#include <trace/events/kvm.h>
+#include "trace.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "tdx: " fmt
+
+/* Capabilities of KVM + TDX-SEAM. */
+struct tdx_capabilities tdx_caps;
+
+static struct mutex *tdx_phymem_cache_wb_lock;
+static struct mutex *tdx_mng_key_config_lock;
+
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
+static u64 hkid_mask __ro_after_init;
+static u8 hkid_start_pos __ro_after_init;
+
+static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
+{
+ pa &= ~hkid_mask;
+ pa |= (u64)hkid << hkid_start_pos;
+
+ return pa;
+}
+
+static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rcx_read(vcpu);
+}
+static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rdx_read(vcpu);
+}
+static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+{
+ return kvm_r8_read(vcpu);
+}
+static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
+{
+ return kvm_r9_read(vcpu);
+}
+
+#define BUILD_TDVMCALL_ACCESSORS(param, gpr) \
+static __always_inline \
+unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu) \
+{ \
+ return kvm_##gpr##_read(vcpu); \
+} \
+static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
+ unsigned long val) \
+{ \
+ kvm_##gpr##_write(vcpu, val); \
+}
+BUILD_TDVMCALL_ACCESSORS(p1, r12);
+BUILD_TDVMCALL_ACCESSORS(p2, r13);
+BUILD_TDVMCALL_ACCESSORS(p3, r14);
+BUILD_TDVMCALL_ACCESSORS(p4, r15);
+
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+ return kvm_r10_read(vcpu);
+}
+static __always_inline unsigned long tdvmcall_exit_reason(struct kvm_vcpu *vcpu)
+{
+ return kvm_r11_read(vcpu);
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+ long val)
+{
+ kvm_r10_write(vcpu, val);
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+ unsigned long val)
+{
+ kvm_r11_write(vcpu, val);
+}
+
+static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
+{
+ return tdx->tdvpr.added;
+}
+
+static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->tdr.added;
+}
+
+static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->hkid >= 0;
+}
+
+static inline bool is_td_initialized(struct kvm *kvm)
+{
+ return !!kvm->max_vcpus;
+}
+
+static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->finalized;
+}
+
+static void tdx_clear_page(unsigned long page)
+{
+ const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+ unsigned long i;
+
+ /* Zeroing the page is only necessary for systems with MKTME-i. */
+ if (!static_cpu_has(X86_FEATURE_MOVDIR64B))
+ return;
+
+ for (i = 0; i < 4096; i += 64)
+ /* MOVDIR64B [rdx], es:rdi */
+ asm (".byte 0x66, 0x0f, 0x38, 0xf8, 0x3a"
+ : : "d" (zero_page), "D" (page + i) : "memory");
+}
+
+static int __tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
+{
+ struct tdx_ex_ret ex_ret;
+ u64 err;
+
+ err = tdh_phymem_page_reclaim(pa, &ex_ret);
+ if (TDX_ERR(err, TDH_PHYMEM_PAGE_RECLAIM, &ex_ret))
+ return -EIO;
+
+ if (do_wb) {
+ err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
+ if (TDX_ERR(err, TDH_PHYMEM_PAGE_WBINVD, NULL))
+ return -EIO;
+ }
+
+ tdx_clear_page(va);
+ return 0;
+}
+
+static int tdx_reclaim_page(unsigned long va, hpa_t pa)
+{
+ return __tdx_reclaim_page(va, pa, false, 0);
+}
+
+static int tdx_alloc_td_page(struct tdx_td_page *page)
+{
+ page->va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!page->va)
+ return -ENOMEM;
+
+ page->pa = __pa(page->va);
+ return 0;
+}
+
+static void tdx_add_td_page(struct tdx_td_page *page)
+{
+ WARN_ON_ONCE(page->added);
+ page->added = true;
+}
+
+static void tdx_reclaim_td_page(struct tdx_td_page *page)
+{
+ if (page->added) {
+ if (tdx_reclaim_page(page->va, page->pa))
+ return;
+
+ page->added = false;
+ }
+ free_page(page->va);
+}
+
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+ list_del(&to_tdx(vcpu)->cpu_list);
+
+ /*
+ * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+ * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+ * to its list before its deleted from this CPUs list.
+ */
+ smp_wmb();
+
+ vcpu->cpu = -1;
+}
+
+static void tdx_flush_vp(void *arg)
+{
+ struct kvm_vcpu *vcpu = arg;
+ u64 err;
+
+ /* Task migration can race with CPU offlining. */
+ if (vcpu->cpu != raw_smp_processor_id())
+ return;
+
+ err = tdh_vp_flush(to_tdx(vcpu)->tdvpr.pa);
+ if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED))
+ TDX_ERR(err, TDH_VP_FLUSH, NULL);
+
+ tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->cpu == -1)
+ return;
+
+ /*
+ * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
+ * list tracking still needs to be updated so that it's correct if/when
+ * the vCPU does get initialized.
+ */
+ if (is_td_vcpu_created(to_tdx(vcpu)))
+ smp_call_function_single(vcpu->cpu, tdx_flush_vp, vcpu, 1);
+ else
+ tdx_disassociate_vp(vcpu);
+}
+
+static int tdx_do_tdh_phymem_cache_wb(void *param)
+{
+ int cpu, cur_pkg;
+ u64 err = 0;
+
+ cpu = raw_smp_processor_id();
+ cur_pkg = topology_physical_package_id(cpu);
+
+ mutex_lock(&tdx_phymem_cache_wb_lock[cur_pkg]);
+ do {
+ err = tdh_phymem_cache_wb(!!err);
+ } while (err == TDX_INTERRUPTED_RESUMABLE);
+ mutex_unlock(&tdx_phymem_cache_wb_lock[cur_pkg]);
+
+ if (TDX_ERR(err, TDH_PHYMEM_CACHE_WB, NULL))
+ return -EIO;
+
+ return 0;
+}
+
+static void tdx_vm_teardown(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_vcpu *vcpu;
+ u64 err;
+ int i;
+
+ if (!is_hkid_assigned(kvm_tdx))
+ return;
+
+ if (!is_td_created(kvm_tdx))
+ goto free_hkid;
+
+ err = tdh_mng_key_reclaimid(kvm_tdx->tdr.pa);
+ if (TDX_ERR(err, TDH_MNG_KEY_RECLAIMID, NULL))
+ return;
+
+ kvm_for_each_vcpu(i, vcpu, (&kvm_tdx->kvm))
+ tdx_flush_vp_on_cpu(vcpu);
+
+ err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
+ if (TDX_ERR(err, TDH_MNG_VPFLUSHDONE, NULL))
+ return;
+
+ err = tdx_seamcall_on_each_pkg(tdx_do_tdh_phymem_cache_wb, NULL);
+
+ if (unlikely(err))
+ return;
+
+ err = tdh_mng_key_freeid(kvm_tdx->tdr.pa);
+ if (TDX_ERR(err, TDH_MNG_KEY_FREEID, NULL))
+ return;
+
+free_hkid:
+ tdx_keyid_free(kvm_tdx->hkid);
+ kvm_tdx->hkid = -1;
+}
+
+static void tdx_vm_destroy(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ int i;
+
+ /* Can't reclaim or free TD pages if teardown failed. */
+ if (is_hkid_assigned(kvm_tdx))
+ return;
+
+ kvm_mmu_zap_all_private(kvm);
+
+ for (i = 0; i < tdx_caps.tdcs_nr_pages; i++)
+ tdx_reclaim_td_page(&kvm_tdx->tdcs[i]);
+
+ if (kvm_tdx->tdr.added &&
+ __tdx_reclaim_page(kvm_tdx->tdr.va, kvm_tdx->tdr.pa, true, tdx_seam_keyid))
+ return;
+
+ free_page(kvm_tdx->tdr.va);
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+ hpa_t *tdr_p = param;
+ int cpu, cur_pkg;
+ u64 err;
+
+ cpu = raw_smp_processor_id();
+ cur_pkg = topology_physical_package_id(cpu);
+
+ mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
+ do {
+ err = tdh_mng_key_config(*tdr_p);
+ } while (err == TDX_KEY_GENERATION_FAILED);
+ mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);
+
+ if (TDX_ERR(err, TDH_MNG_KEY_CONFIG, NULL))
+ return -EIO;
+
+ return 0;
+}
+
+static int tdx_vm_init(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ int ret, i;
+ u64 err;
+
+ kvm->dirty_log_unsupported = true;
+ kvm->readonly_mem_unsupported = true;
+
+ kvm->arch.tsc_immutable = true;
+ kvm->arch.eoi_intercept_unsupported = true;
+ kvm->arch.smm_unsupported = true;
+ kvm->arch.init_sipi_unsupported = true;
+ kvm->arch.irq_injection_disallowed = true;
+ kvm->arch.mce_injection_disallowed = true;
+
+ /* TODO: Enable 2mb and 1gb large page support. */
+ kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
+
+ kvm_apicv_init(kvm, true);
+
+ /* vCPUs can't be created until after KVM_TDX_INIT_VM. */
+ kvm->max_vcpus = 0;
+
+ kvm_tdx->hkid = tdx_keyid_alloc();
+ if (kvm_tdx->hkid < 0)
+ return -EBUSY;
+ if (WARN_ON_ONCE(kvm_tdx->hkid >> 16)) {
+ ret = -EIO;
+ goto free_hkid;
+ }
+
+ ret = tdx_alloc_td_page(&kvm_tdx->tdr);
+ if (ret)
+ goto free_hkid;
+
+ for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+ ret = tdx_alloc_td_page(&kvm_tdx->tdcs[i]);
+ if (ret)
+ goto free_tdcs;
+ }
+
+ ret = -EIO;
+ err = tdh_mng_create(kvm_tdx->tdr.pa, kvm_tdx->hkid);
+ if (TDX_ERR(err, TDH_MNG_CREATE, NULL))
+ goto free_tdcs;
+ tdx_add_td_page(&kvm_tdx->tdr);
+
+ ret = tdx_seamcall_on_each_pkg(tdx_do_tdh_mng_key_config, &kvm_tdx->tdr.pa);
+ if (ret)
+ goto teardown;
+
+ for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+ err = tdh_mng_addcx(kvm_tdx->tdr.pa, kvm_tdx->tdcs[i].pa);
+ if (TDX_ERR(err, TDH_MNG_ADDCX, NULL))
+ goto teardown;
+ tdx_add_td_page(&kvm_tdx->tdcs[i]);
+ }
+
+ /*
+ * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
+ * ioctl() to define the configure CPUID values for the TD.
+ */
+ return 0;
+
+ /*
+ * The sequence for freeing resources from a partially initialized TD
+ * varies based on where in the initialization flow failure occurred.
+ * Simply use the full teardown and destroy, which naturally play nice
+ * with partial initialization.
+ */
+teardown:
+ tdx_vm_teardown(kvm);
+ tdx_vm_destroy(kvm);
+ return ret;
+
+free_tdcs:
+ /* @i points at the TDCS page that failed allocation. */
+ for (--i; i >= 0; i--)
+ free_page(kvm_tdx->tdcs[i].va);
+
+ free_page(kvm_tdx->tdr.va);
+free_hkid:
+ tdx_keyid_free(kvm_tdx->hkid);
+ return ret;
+}
+
+static int tdx_vcpu_create(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ int cpu, ret, i;
+
+ ret = tdx_alloc_td_page(&tdx->tdvpr);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+ ret = tdx_alloc_td_page(&tdx->tdvpx[i]);
+ if (ret)
+ goto free_tdvpx;
+ }
+
+ vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+
+ vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH_GUEST;
+ vcpu->arch.cr0_guest_owned_bits = -1ul;
+ vcpu->arch.cr4_guest_owned_bits = -1ul;
+
+ vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
+ vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
+ vcpu->arch.guest_state_protected =
+ !(to_kvm_tdx(vcpu->kvm)->attributes & TDX1_TD_ATTRIBUTE_DEBUG);
+ vcpu->arch.root_mmu.no_prefetch = true;
+
+ tdx->pi_desc.nv = POSTED_INTR_VECTOR;
+ tdx->pi_desc.sn = 1;
+
+ cpu = get_cpu();
+ list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+ vcpu->cpu = cpu;
+ put_cpu();
+
+ return 0;
+
+free_tdvpx:
+ /* @i points at the TDVPX page that failed allocation. */
+ for (--i; i >= 0; i--)
+ free_page(tdx->tdvpx[i].va);
+
+ free_page(tdx->tdvpr.va);
+
+ return ret;
+}
+
+static void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (vcpu->cpu != cpu) {
+ tdx_flush_vp_on_cpu(vcpu);
+
+ /*
+ * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+ * vcpu->cpu is read before tdx->cpu_list.
+ */
+ smp_rmb();
+
+ list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+ }
+
+ vmx_vcpu_pi_load(vcpu, cpu);
+}
+
+static void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ vmx_vcpu_pi_put(vcpu);
+}
+
+static void tdx_vcpu_free(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ int i;
+
+ /* Can't reclaim or free pages if teardown failed. */
+ if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
+ return;
+
+ for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
+ tdx_reclaim_td_page(&tdx->tdvpx[i]);
+
+ tdx_reclaim_td_page(&tdx->tdvpr);
+}
+
+static void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ struct msr_data apic_base_msr;
+ u64 err;
+ int i;
+
+ if (WARN_ON(init_event) || !vcpu->arch.apic)
+ goto td_bugged;
+
+ err = tdh_vp_create(kvm_tdx->tdr.pa, tdx->tdvpr.pa);
+ if (TDX_ERR(err, TDH_VP_CREATE, NULL))
+ goto td_bugged;
+ tdx_add_td_page(&tdx->tdvpr);
+
+ for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+ err = tdh_vp_addcx(tdx->tdvpr.pa, tdx->tdvpx[i].pa);
+ if (TDX_ERR(err, TDH_VP_ADDCX, NULL))
+ goto td_bugged;
+ tdx_add_td_page(&tdx->tdvpx[i]);
+ }
+
+ apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
+ if (kvm_vcpu_is_reset_bsp(vcpu))
+ apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
+ apic_base_msr.host_initiated = true;
+ if (WARN_ON(kvm_set_apic_base(vcpu, &apic_base_msr)))
+ goto td_bugged;
+
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+ return;
+
+td_bugged:
+ vcpu->kvm->vm_bugged = true;
+}
+
+static void tdx_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
+}
+
+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+ if (static_cpu_has(X86_FEATURE_XSAVE) &&
+ host_xcr0 != (kvm_tdx->xfam & supported_xcr0))
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
+ if (static_cpu_has(X86_FEATURE_XSAVES) &&
+ host_xss != (kvm_tdx->xfam & supported_xss))
+ wrmsrl(MSR_IA32_XSS, host_xss);
+ if (static_cpu_has(X86_FEATURE_PKU) &&
+ (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+ __write_pkru(vcpu->arch.host_pkru);
+}
+
+u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
+
+static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+ struct vcpu_tdx *tdx)
+{
+ kvm_guest_enter_irqoff();
+
+ tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs,
+ tdx->tdvmcall.regs_mask);
+
+ kvm_guest_exit_irqoff();
+}
+
+static fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (unlikely(vcpu->kvm->vm_bugged)) {
+ tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
+ return EXIT_FASTPATH_NONE;
+ }
+
+ trace_kvm_entry(vcpu);
+
+ if (pi_test_on(&tdx->pi_desc)) {
+ apic->send_IPI_self(POSTED_INTR_VECTOR);
+
+ kvm_wait_lapic_expire(vcpu, true);
+ }
+
+ tdx_vcpu_enter_exit(vcpu, tdx);
+
+ tdx_restore_host_xsave_state(vcpu);
+
+ vmx_register_cache_reset(vcpu);
+
+ trace_kvm_exit((unsigned int)tdx->exit_reason.full, vcpu, KVM_ISA_VMX);
+
+ if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
+ return EXIT_FASTPATH_NONE;
+
+ if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
+ tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
+ else
+ tdx->tdvmcall.rcx = 0;
+
+ return EXIT_FASTPATH_NONE;
+}
+
+static void tdx_hardware_enable(void)
+{
+ INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
+}
+
+static void tdx_hardware_disable(void)
+{
+ int cpu = raw_smp_processor_id();
+ struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+ struct vcpu_tdx *tdx, *tmp;
+
+ /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+ list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
+ tdx_disassociate_vp(&tdx->vcpu);
+}
+
+static void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ u16 exit_reason = to_tdx(vcpu)->exit_reason.basic;
+
+ if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
+ vmx_handle_exception_nmi_irqoff(vcpu, tdexit_intr_info(vcpu));
+ else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
+{
+ u32 intr_info = tdexit_intr_info(vcpu);
+
+ if (is_nmi(intr_info) || is_machine_check(intr_info))
+ return 1;
+
+ kvm_pr_unimpl("unexpected exception 0x%x\n", intr_info);
+ return -EFAULT;
+}
+
+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
+{
+ ++vcpu->stat.irq_exits;
+ return 1;
+}
+
+static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
+{
+ vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+ vcpu->mmio_needed = 0;
+ return 0;
+}
+
+static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+ u32 eax, ebx, ecx, edx;
+
+ eax = tdvmcall_p1_read(vcpu);
+ ecx = tdvmcall_p2_read(vcpu);
+
+ kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, true);
+
+ tdvmcall_p1_write(vcpu, eax);
+ tdvmcall_p2_write(vcpu, ebx);
+ tdvmcall_p3_write(vcpu, ecx);
+ tdvmcall_p4_write(vcpu, edx);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+ return 1;
+}
+
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+ return kvm_vcpu_halt(vcpu);
+}
+
+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ int ret;
+
+ BUG_ON(vcpu->arch.pio.count != 1);
+
+ ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+ vcpu->arch.pio.port, &val, 1);
+ WARN_ON(!ret);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, val);
+
+ return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ unsigned int port;
+ int size, ret;
+
+ ++vcpu->stat.io_exits;
+
+ size = tdvmcall_p1_read(vcpu);
+ port = tdvmcall_p3_read(vcpu);
+
+ if (size != 1 && size != 2 && size != 4) {
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ if (!tdvmcall_p2_read(vcpu)) {
+ ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+ if (!ret)
+ vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
+ else
+ tdvmcall_set_return_val(vcpu, val);
+ } else {
+ val = tdvmcall_p4_read(vcpu);
+ ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+
+ // No need for a complete_userspace_io callback.
+ vcpu->arch.pio.count = 0;
+ }
+ if (ret)
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return ret;
+}
+
+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
+{
+ unsigned long nr, a0, a1, a2, a3, ret;
+
+ nr = tdvmcall_exit_reason(vcpu);
+ a0 = tdvmcall_p1_read(vcpu);
+ a1 = tdvmcall_p2_read(vcpu);
+ a2 = tdvmcall_p3_read(vcpu);
+ a3 = tdvmcall_p4_read(vcpu);
+
+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true);
+
+ tdvmcall_set_return_code(vcpu, ret);
+
+ return 1;
+}
+
+static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
+{
+ unsigned long val = 0;
+ gpa_t gpa;
+ int size;
+
+ BUG_ON(vcpu->mmio_needed != 1);
+ vcpu->mmio_needed = 0;
+
+ if (!vcpu->mmio_is_write) {
+ gpa = vcpu->mmio_fragments[0].gpa;
+ size = vcpu->mmio_fragments[0].len;
+
+ memcpy(&val, vcpu->run->mmio.data, size);
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ }
+ return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+ unsigned long val)
+{
+ if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+ return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+ unsigned long val;
+
+ if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+ struct kvm_memory_slot *slot;
+ int size, write, r;
+ unsigned long val;
+ gpa_t gpa;
+
+ BUG_ON(vcpu->mmio_needed);
+
+ size = tdvmcall_p1_read(vcpu);
+ write = tdvmcall_p2_read(vcpu);
+ gpa = tdvmcall_p3_read(vcpu);
+ val = write ? tdvmcall_p4_read(vcpu) : 0;
+
+ /* Strip the shared bit, allow MMIO with and without it set. */
+ gpa &= ~(vcpu->kvm->arch.gfn_shared_mask << PAGE_SHIFT);
+
+ if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK) {
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa >> PAGE_SHIFT);
+ if (slot && !(slot->flags & KVM_MEMSLOT_INVALID)) {
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+ trace_kvm_fast_mmio(gpa);
+ return 1;
+ }
+
+ if (write)
+ r = tdx_mmio_write(vcpu, gpa, size, val);
+ else
+ r = tdx_mmio_read(vcpu, gpa, size);
+ if (!r) {
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return 1;
+ }
+
+ vcpu->mmio_needed = 1;
+ vcpu->mmio_is_write = write;
+ vcpu->arch.complete_userspace_io = tdx_complete_mmio;
+
+ vcpu->run->mmio.phys_addr = gpa;
+ vcpu->run->mmio.len = size;
+ vcpu->run->mmio.is_write = write;
+ vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+ if (write) {
+ memcpy(vcpu->run->mmio.data, &val, size);
+ } else {
+ vcpu->mmio_fragments[0].gpa = gpa;
+ vcpu->mmio_fragments[0].len = size;
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+ }
+ return 0;
+}
+
+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_p1_read(vcpu);
+ u64 data;
+
+ if (kvm_get_msr(vcpu, index, &data)) {
+ trace_kvm_msr_read_ex(index);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+ trace_kvm_msr_read(index, data);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, data);
+ return 1;
+}
+
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_p1_read(vcpu);
+ u64 data = tdvmcall_p2_read(vcpu);
+
+ if (kvm_set_msr(vcpu, index, data)) {
+ trace_kvm_msr_write_ex(index, data);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ trace_kvm_msr_write(index, data);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return 1;
+}
+
+static int tdx_map_gpa(struct kvm_vcpu *vcpu)
+{
+ gpa_t gpa = tdvmcall_p1_read(vcpu);
+ gpa_t size = tdvmcall_p2_read(vcpu);
+
+ if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
+ (gpa + size) < gpa ||
+ (gpa + size) > vcpu->kvm->arch.gfn_shared_mask << (PAGE_SHIFT + 1))
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ else
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+ return 1;
+}
+
+static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
+{
+ vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+ vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
+ vcpu->run->system_event.flags = tdvmcall_p1_read(vcpu);
+ return 0;
+}
+
+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ unsigned long exit_reason;
+
+ if (unlikely(tdx->tdvmcall.xmm_mask))
+ goto unsupported;
+
+ if (tdvmcall_exit_type(vcpu))
+ return tdx_emulate_vmcall(vcpu);
+
+ exit_reason = tdvmcall_exit_reason(vcpu);
+
+ switch (exit_reason) {
+ case EXIT_REASON_CPUID:
+ return tdx_emulate_cpuid(vcpu);
+ case EXIT_REASON_HLT:
+ return tdx_emulate_hlt(vcpu);
+ case EXIT_REASON_IO_INSTRUCTION:
+ return tdx_emulate_io(vcpu);
+ case EXIT_REASON_MSR_READ:
+ return tdx_emulate_rdmsr(vcpu);
+ case EXIT_REASON_MSR_WRITE:
+ return tdx_emulate_wrmsr(vcpu);
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_emulate_mmio(vcpu);
+ case TDG_VP_VMCALL_MAP_GPA:
+ return tdx_map_gpa(vcpu);
+ case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+ return tdx_report_fatal_error(vcpu);
+ default:
+ break;
+ }
+
+unsupported:
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+}
+
+static void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+ int pgd_level)
+{
+ td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
+}
+
+#define SEPT_ERR(err, ex, op, kvm) \
+({ \
+ int __ret = KVM_BUG_ON(err, kvm); \
+ \
+ if (unlikely(__ret)) { \
+ pr_seamcall_error(op, err, ex); \
+ } \
+ __ret; \
+})
+
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+{
+ struct tdx_ex_ret ex_ret;
+ u64 err;
+ int i;
+
+ for (i = 0; i < PAGE_SIZE; i += TDX1_EXTENDMR_CHUNKSIZE) {
+ err = tdh_mr_extend(kvm_tdx->tdr.pa, gpa + i, &ex_ret);
+ if (SEPT_ERR(err, &ex_ret, TDH_MR_EXTEND, &kvm_tdx->kvm))
+ break;
+ }
+}
+
+static void tdx_sept_set_private_spte(struct kvm_vcpu *vcpu, gfn_t gfn,
+ int level, kvm_pfn_t pfn)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ hpa_t hpa = pfn << PAGE_SHIFT;
+ gpa_t gpa = gfn << PAGE_SHIFT;
+ struct tdx_ex_ret ex_ret;
+ hpa_t source_pa;
+ u64 err;
+
+ if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
+ return;
+
+ /* TODO: handle large pages. */
+ if (KVM_BUG_ON(level != PG_LEVEL_4K, vcpu->kvm))
+ return;
+
+ /* Pin the page, KVM doesn't yet support page migration. */
+ get_page(pfn_to_page(pfn));
+
+ /* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
+ if (is_td_finalized(kvm_tdx)) {
+ err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &ex_ret);
+ SEPT_ERR(err, &ex_ret, TDH_MEM_PAGE_AUG, vcpu->kvm);
+ return;
+ }
+
+ source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
+
+ err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &ex_ret);
+ if (!SEPT_ERR(err, &ex_ret, TDH_MEM_PAGE_ADD, vcpu->kvm) &&
+ (kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
+ tdx_measure_page(kvm_tdx, gpa);
+}
+
+static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, int level,
+ kvm_pfn_t pfn)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn << PAGE_SHIFT;
+ hpa_t hpa = pfn << PAGE_SHIFT;
+ hpa_t hpa_with_hkid;
+ struct tdx_ex_ret ex_ret;
+ u64 err;
+
+ /* TODO: handle large pages. */
+ if (KVM_BUG_ON(level != PG_LEVEL_NONE, kvm))
+ return;
+
+ if (is_hkid_assigned(kvm_tdx)) {
+ err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
+ if (SEPT_ERR(err, &ex_ret, TDH_MEM_PAGE_REMOVE, kvm))
+ return;
+
+ hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+ err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+ if (TDX_ERR(err, TDH_PHYMEM_PAGE_WBINVD, NULL))
+ return;
+ } else if (tdx_reclaim_page((unsigned long)__va(hpa), hpa)) {
+ return;
+ }
+
+ put_page(pfn_to_page(pfn));
+}
+
+static int tdx_sept_link_private_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
+ int level, void *sept_page)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ gpa_t gpa = gfn << PAGE_SHIFT;
+ hpa_t hpa = __pa(sept_page);
+ struct tdx_ex_ret ex_ret;
+ u64 err;
+
+ err = tdh_mem_spet_add(kvm_tdx->tdr.pa, gpa, level, hpa, &ex_ret);
+ if (SEPT_ERR(err, &ex_ret, TDH_MEM_SEPT_ADD, vcpu->kvm))
+ return -EIO;
+
+ return 0;
+}
+
+static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, int level)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn << PAGE_SHIFT;
+ struct tdx_ex_ret ex_ret;
+ u64 err;
+
+ err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
+ SEPT_ERR(err, &ex_ret, TDH_MEM_RANGE_BLOCK, kvm);
+}
+
+static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn, int level)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn << PAGE_SHIFT;
+ struct tdx_ex_ret ex_ret;
+ u64 err;
+
+ err = tdh_mem_range_unblock(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
+ SEPT_ERR(err, &ex_ret, TDH_MEM_RANGE_UNBLOCK, kvm);
+}
+
+static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, int level,
+ void *sept_page)
+{
+ /*
+ * free_private_sp() is (obviously) called when a shadow page is being
+ * zapped. KVM doesn't (yet) zap private SPs while the TD is active.
+ */
+ if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
+ return -EINVAL;
+
+ return tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page));
+}
+
+static int tdx_sept_tlb_remote_flush(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx;
+ u64 err;
+
+ if (!is_td(kvm))
+ return -EOPNOTSUPP;
+
+ kvm_tdx = to_kvm_tdx(kvm);
+ kvm_tdx->tdh_mem_track = true;
+
+ kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+ if (is_hkid_assigned(kvm_tdx) && is_td_finalized(kvm_tdx)) {
+ err = tdh_mem_track(to_kvm_tdx(kvm)->tdr.pa);
+ SEPT_ERR(err, NULL, TDH_MEM_TRACK, kvm);
+ }
+
+ WRITE_ONCE(kvm_tdx->tdh_mem_track, false);
+
+ return 0;
+}
+
+static void tdx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ u64 root_hpa = mmu->root_hpa;
+
+ /* Flush the shared EPTP, if it's valid. */
+ if (VALID_PAGE(root_hpa))
+ ept_sync_context(construct_eptp(vcpu, root_hpa,
+ mmu->shadow_root_level));
+
+ while (READ_ONCE(kvm_tdx->tdh_mem_track))
+ cpu_relax();
+}
+
+static inline bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
+{
+ return !((gpa >> PAGE_SHIFT) & kvm->arch.gfn_shared_mask);
+}
+
+#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+ unsigned long exit_qual;
+
+ if (tdx_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu)))
+ exit_qual = TDX_SEPT_PFERR;
+ else
+ exit_qual = tdexit_exit_qual(vcpu);
+ trace_kvm_page_fault(tdexit_gpa(vcpu), exit_qual);
+ return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
+{
+ WARN_ON(1);
+
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+
+ return 0;
+}
+
+static int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath)
+{
+ union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+ if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
+ kvm_pr_unimpl("TD exit due to %s, Exit Reason %d\n",
+ tdx_seamcall_error_name(exit_reason.full),
+ exit_reason.basic);
+ if (exit_reason.basic == EXIT_REASON_TRIPLE_FAULT)
+ return tdx_handle_triple_fault(vcpu);
+
+ goto unhandled_exit;
+ }
+
+ WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
+
+ switch (exit_reason.basic) {
+ case EXIT_REASON_EXCEPTION_NMI:
+ return tdx_handle_exception(vcpu);
+ case EXIT_REASON_EXTERNAL_INTERRUPT:
+ return tdx_handle_external_interrupt(vcpu);
+ case EXIT_REASON_TDCALL:
+ return handle_tdvmcall(vcpu);
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_handle_ept_violation(vcpu);
+ case EXIT_REASON_EPT_MISCONFIG:
+ return tdx_handle_ept_misconfig(vcpu);
+ case EXIT_REASON_OTHER_SMI:
+ /*
+ * If reach here, it's not a MSMI.
+ * #SMI is delivered and handled right after SEAMRET, nothing
+ * needs to be done in KVM.
+ */
+ return 1;
+ default:
+ break;
+ }
+
+unhandled_exit:
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = exit_reason.full;
+ return 0;
+}
+
+static void tdx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2,
+ u32 *intr_info, u32 *error_code)
+{
+ *info1 = tdexit_exit_qual(vcpu);
+ *info2 = tdexit_ext_exit_qual(vcpu);
+
+ *intr_info = tdexit_intr_info(vcpu);
+ *error_code = 0;
+}
+
+static int __init tdx_check_processor_compatibility(void)
+{
+ /* TDX-SEAM itself verifies compatibility on all CPUs. */
+ return 0;
+}
+
+static void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
+}
+
+static void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ pi_clear_on(&tdx->pi_desc);
+ memset(tdx->pi_desc.pir, 0, sizeof(tdx->pi_desc.pir));
+}
+
+/*
+ * Send interrupt to vcpu via posted interrupt way.
+ * 1. If target vcpu is running(non-root mode), send posted interrupt
+ * notification to vcpu and hardware will sync PIR to vIRR atomically.
+ * 2. If target vcpu isn't running(root mode), kick it to pick up the
+ * interrupt from PIR in next vmentry.
+ */
+static int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (pi_test_and_set_pir(vector, &tdx->pi_desc))
+ return 0;
+
+ /* If a previous notification has sent the IPI, nothing to do. */
+ if (pi_test_and_set_on(&tdx->pi_desc))
+ return 0;
+
+ if (vcpu != kvm_get_running_vcpu() &&
+ !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
+ kvm_vcpu_kick(vcpu);
+
+ return 0;
+}
+
+static int tdx_dev_ioctl(void __user *argp)
+{
+ struct kvm_tdx_capabilities __user *user_caps;
+ struct kvm_tdx_capabilities caps;
+ struct kvm_tdx_cmd cmd;
+
+ BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
+ sizeof(struct tdx_cpuid_config));
+
+ if (copy_from_user(&cmd, argp, sizeof(cmd)))
+ return -EFAULT;
+
+ if (cmd.metadata || cmd.id != KVM_TDX_CAPABILITIES)
+ return -EINVAL;
+
+ user_caps = (void __user *)cmd.data;
+ if (copy_from_user(&caps, user_caps, sizeof(caps)))
+ return -EFAULT;
+
+ if (caps.nr_cpuid_configs < tdx_caps.nr_cpuid_configs)
+ return -E2BIG;
+ caps.nr_cpuid_configs = tdx_caps.nr_cpuid_configs;
+
+ if (copy_to_user(user_caps->cpuid_configs, &tdx_caps.cpuid_configs,
+ tdx_caps.nr_cpuid_configs * sizeof(struct tdx_cpuid_config)))
+ return -EFAULT;
+
+ caps.attrs_fixed0 = tdx_caps.attrs_fixed0;
+ caps.attrs_fixed1 = tdx_caps.attrs_fixed1;
+ caps.xfam_fixed0 = tdx_caps.xfam_fixed0;
+ caps.xfam_fixed1 = tdx_caps.xfam_fixed1;
+
+ if (copy_to_user((void __user *)cmd.data, &caps, sizeof(caps)))
+ return -EFAULT;
+
+ return 0;
+}
+
+/*
+ * TDX-SEAM definitions for fixed{0,1} are inverted relative to VMX. The TDX
+ * definitions are sane, the VMX definitions are backwards.
+ *
+ * if fixed0[i] == 0: val[i] must be 0
+ * if fixed1[i] == 1: val[i] must be 1
+ */
+static inline bool tdx_fixed_bits_valid(u64 val, u64 fixed0, u64 fixed1)
+{
+ return ((val & fixed0) | fixed1) == val;
+}
+
+static struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(struct kvm_tdx *kvm_tdx,
+ u32 function, u32 index)
+{
+ struct kvm_cpuid_entry2 *e;
+ int i;
+
+ for (i = 0; i < kvm_tdx->cpuid_nent; i++) {
+ e = &kvm_tdx->cpuid_entries[i];
+
+ if (e->function == function && (e->index == index ||
+ !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
+ return e;
+ }
+ return NULL;
+}
+
+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
+ struct kvm_tdx_init_vm *init_vm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct tdx_cpuid_config *config;
+ struct kvm_cpuid_entry2 *entry;
+ struct tdx_cpuid_value *value;
+ u64 guest_supported_xcr0;
+ u64 guest_supported_xss;
+ u32 guest_tsc_khz;
+ int max_pa;
+ int i;
+
+ /* init_vm->reserved must be zero */
+ if (find_first_bit((unsigned long *)init_vm->reserved,
+ sizeof(init_vm->reserved) * 8) !=
+ sizeof(init_vm->reserved) * 8)
+ return -EINVAL;
+
+ td_params->attributes = init_vm->attributes;
+ td_params->max_vcpus = init_vm->max_vcpus;
+
+ /* TODO: Enforce consistent CPUID features for all vCPUs. */
+ for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
+ config = &tdx_caps.cpuid_configs[i];
+
+ entry = tdx_find_cpuid_entry(kvm_tdx, config->leaf,
+ config->sub_leaf);
+ if (!entry)
+ continue;
+
+ /*
+ * Non-configurable bits must be '0', even if they are fixed to
+ * '1' by TDX-SEAM, i.e. mask off non-configurable bits.
+ */
+ value = &td_params->cpuid_values[i];
+ value->eax = entry->eax & config->eax;
+ value->ebx = entry->ebx & config->ebx;
+ value->ecx = entry->ecx & config->ecx;
+ value->edx = entry->edx & config->edx;
+ }
+
+ entry = tdx_find_cpuid_entry(kvm_tdx, 0xd, 0);
+ if (entry)
+ guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
+ else
+ guest_supported_xcr0 = 0;
+ guest_supported_xcr0 &= supported_xcr0;
+
+ entry = tdx_find_cpuid_entry(kvm_tdx, 0xd, 1);
+ if (entry)
+ guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
+ else
+ guest_supported_xss = 0;
+ guest_supported_xss &= supported_xss;
+
+ max_pa = 36;
+ entry = tdx_find_cpuid_entry(kvm_tdx, 0x80000008, 0);
+ if (entry)
+ max_pa = entry->eax & 0xff;
+
+ td_params->eptp_controls = VMX_EPTP_MT_WB;
+
+ if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
+ td_params->eptp_controls |= VMX_EPTP_PWL_5;
+ td_params->exec_controls |= TDX1_EXEC_CONTROL_MAX_GPAW;
+ } else {
+ td_params->eptp_controls |= VMX_EPTP_PWL_4;
+ }
+
+ if (!tdx_fixed_bits_valid(td_params->attributes,
+ tdx_caps.attrs_fixed0,
+ tdx_caps.attrs_fixed1))
+ return -EINVAL;
+
+ /* Setup td_params.xfam */
+ td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
+ if (!tdx_fixed_bits_valid(td_params->xfam,
+ tdx_caps.xfam_fixed0,
+ tdx_caps.xfam_fixed1))
+ return -EINVAL;
+
+ if (init_vm->tsc_khz)
+ guest_tsc_khz = init_vm->tsc_khz;
+ else
+ guest_tsc_khz = kvm->arch.initial_tsc_khz;
+
+ if (guest_tsc_khz < TDX1_MIN_TSC_FREQUENCY_KHZ ||
+ guest_tsc_khz > TDX1_MAX_TSC_FREQUENCY_KHZ) {
+ pr_warn_ratelimited("Illegal TD TSC %d Khz, it must be between [%d, %d] Khz\n",
+ guest_tsc_khz, TDX1_MIN_TSC_FREQUENCY_KHZ, TDX1_MAX_TSC_FREQUENCY_KHZ);
+ return -EINVAL;
+ }
+
+ td_params->tsc_frequency = TDX1_TSC_KHZ_TO_25MHZ(guest_tsc_khz);
+ if (TDX1_TSC_25MHZ_TO_KHZ(td_params->tsc_frequency) != guest_tsc_khz) {
+ pr_warn_ratelimited("TD TSC %d Khz not a multiple of 25Mhz\n", guest_tsc_khz);
+ if (init_vm->tsc_khz)
+ return -EINVAL;
+ }
+
+ BUILD_BUG_ON(sizeof(td_params->mrconfigid) !=
+ sizeof(init_vm->mrconfigid));
+ memcpy(td_params->mrconfigid, init_vm->mrconfigid,
+ sizeof(td_params->mrconfigid));
+ BUILD_BUG_ON(sizeof(td_params->mrowner) !=
+ sizeof(init_vm->mrowner));
+ memcpy(td_params->mrowner, init_vm->mrowner, sizeof(td_params->mrowner));
+ BUILD_BUG_ON(sizeof(td_params->mrownerconfig) !=
+ sizeof(init_vm->mrownerconfig));
+ memcpy(td_params->mrownerconfig, init_vm->mrownerconfig,
+ sizeof(td_params->mrownerconfig));
+
+ return 0;
+}
+
+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_cpuid2 __user *user_cpuid;
+ struct kvm_tdx_init_vm init_vm;
+ struct td_params *td_params;
+ struct tdx_ex_ret ex_ret;
+ struct kvm_cpuid2 cpuid;
+ int ret;
+ u64 err;
+
+ if (is_td_initialized(kvm))
+ return -EINVAL;
+
+ if (cmd->metadata)
+ return -EINVAL;
+
+ if (copy_from_user(&init_vm, (void __user *)cmd->data, sizeof(init_vm)))
+ return -EFAULT;
+
+ if (init_vm.max_vcpus > KVM_MAX_VCPUS)
+ return -EINVAL;
+
+ user_cpuid = (void *)init_vm.cpuid;
+ if (copy_from_user(&cpuid, user_cpuid, sizeof(cpuid)))
+ return -EFAULT;
+
+ if (cpuid.nent > KVM_MAX_CPUID_ENTRIES)
+ return -E2BIG;
+
+ if (copy_from_user(&kvm_tdx->cpuid_entries, user_cpuid->entries,
+ cpuid.nent * sizeof(struct kvm_cpuid_entry2)))
+ return -EFAULT;
+
+ BUILD_BUG_ON(sizeof(struct td_params) != 1024);
+
+ td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL_ACCOUNT);
+ if (!td_params)
+ return -ENOMEM;
+
+ kvm_tdx->cpuid_nent = cpuid.nent;
+
+ ret = setup_tdparams(kvm, td_params, &init_vm);
+ if (ret)
+ goto free_tdparams;
+
+ err = tdh_mng_init(kvm_tdx->tdr.pa, __pa(td_params), &ex_ret);
+ if (TDX_ERR(err, TDH_MNG_INIT, &ex_ret)) {
+ ret = -EIO;
+ goto free_tdparams;
+ }
+
+ kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
+ kvm_tdx->attributes = td_params->attributes;
+ kvm_tdx->xfam = td_params->xfam;
+ kvm->max_vcpus = td_params->max_vcpus;
+ kvm->arch.initial_tsc_khz = TDX1_TSC_25MHZ_TO_KHZ(td_params->tsc_frequency);
+
+ if (td_params->exec_controls & TDX1_EXEC_CONTROL_MAX_GPAW)
+ kvm->arch.gfn_shared_mask = BIT_ULL(51) >> PAGE_SHIFT;
+ else
+ kvm->arch.gfn_shared_mask = BIT_ULL(47) >> PAGE_SHIFT;
+
+free_tdparams:
+ kfree(td_params);
+ if (ret)
+ kvm_tdx->cpuid_nent = 0;
+ return ret;
+}
+
+static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_tdx_init_mem_region region;
+ struct kvm_vcpu *vcpu;
+ struct page *page;
+ kvm_pfn_t pfn;
+ int idx, ret = 0;
+
+ /* The BSP vCPU must be created before initializing memory regions. */
+ if (!atomic_read(&kvm->online_vcpus))
+ return -EINVAL;
+
+ if (cmd->metadata & ~KVM_TDX_MEASURE_MEMORY_REGION)
+ return -EINVAL;
+
+ if (copy_from_user(&region, (void __user *)cmd->data, sizeof(region)))
+ return -EFAULT;
+
+ /* Sanity check */
+ if (!IS_ALIGNED(region.source_addr, PAGE_SIZE))
+ return -EINVAL;
+ if (!IS_ALIGNED(region.gpa, PAGE_SIZE))
+ return -EINVAL;
+ if (!region.nr_pages)
+ return -EINVAL;
+ if (region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa)
+ return -EINVAL;
+ if (!tdx_is_private_gpa(kvm, region.gpa))
+ return -EINVAL;
+
+ vcpu = kvm_get_vcpu(kvm, 0);
+ if (mutex_lock_killable(&vcpu->mutex))
+ return -EINTR;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&kvm->srcu);
+
+ kvm_mmu_reload(vcpu);
+
+ while (region.nr_pages) {
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+
+ /* Pin the source page. */
+ ret = get_user_pages_fast(region.source_addr, 1, 0, &page);
+ if (ret < 0)
+ break;
+ if (ret != 1) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
+ (cmd->metadata & KVM_TDX_MEASURE_MEMORY_REGION);
+
+ pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+ PG_LEVEL_4K);
+ if (is_error_noslot_pfn(pfn) || kvm->vm_bugged)
+ ret = -EFAULT;
+ else
+ ret = 0;
+
+ put_page(page);
+ if (ret)
+ break;
+
+ region.source_addr += PAGE_SIZE;
+ region.gpa += PAGE_SIZE;
+ region.nr_pages--;
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ mutex_unlock(&vcpu->mutex);
+
+ if (copy_to_user((void __user *)cmd->data, &region, sizeof(region)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
+static int tdx_td_finalizemr(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+
+ if (!is_td_initialized(kvm) || is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ err = tdh_mr_finalize(kvm_tdx->tdr.pa);
+ if (TDX_ERR(err, TDH_MR_FINALIZE, NULL))
+ return -EIO;
+
+ kvm_tdx->finalized = true;
+ return 0;
+}
+
+static int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+ struct kvm_tdx_cmd tdx_cmd;
+ int r;
+
+ if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+ return -EFAULT;
+
+ mutex_lock(&kvm->lock);
+
+ switch (tdx_cmd.id) {
+ case KVM_TDX_INIT_VM:
+ r = tdx_td_init(kvm, &tdx_cmd);
+ break;
+ case KVM_TDX_INIT_MEM_REGION:
+ r = tdx_init_mem_region(kvm, &tdx_cmd);
+ break;
+ case KVM_TDX_FINALIZE_VM:
+ r = tdx_td_finalizemr(kvm);
+ break;
+ default:
+ r = -EINVAL;
+ goto out;
+ }
+
+ if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+ r = -EFAULT;
+
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
+static int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ struct kvm_tdx_cmd cmd;
+ u64 err;
+
+ if (tdx->initialized)
+ return -EINVAL;
+
+ if (!is_td_initialized(vcpu->kvm) || is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ if (copy_from_user(&cmd, argp, sizeof(cmd)))
+ return -EFAULT;
+
+ if (cmd.metadata || cmd.id != KVM_TDX_INIT_VCPU)
+ return -EINVAL;
+
+ err = tdh_vp_init(tdx->tdvpr.pa, cmd.data);
+ if (TDX_ERR(err, TDH_VP_INIT, NULL))
+ return -EIO;
+
+ tdx->initialized = true;
+
+ td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
+ td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
+ td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
+ return 0;
+}
+
+static void tdx_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+ /* TODO: Figure out exception bitmap for debug TD. */
+}
+
+static void tdx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+ /* TODO: Add TDH_VP_WR(GUEST_DR7) for debug TDs. */
+ if (is_debug_td(vcpu))
+ return;
+
+ KVM_BUG_ON(val != DR7_FIXED_1, vcpu->kvm);
+}
+
+static int tdx_get_cpl(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+ return 0;
+
+ /*
+ * For debug TDs, tdx_get_cpl() may be called before the vCPU is
+ * initialized, i.e. before TDH_VP_RD is legal, if the vCPU is scheduled
+ * out. If this happens, simply return CPL0 to avoid TDH_VP_RD failure.
+ */
+ if (!to_tdx(vcpu)->initialized)
+ return 0;
+
+ return VMX_AR_DPL(td_vmcs_read32(to_tdx(vcpu), GUEST_SS_AR_BYTES));
+}
+
+static unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+ return 0;
+
+ return td_vmcs_read64(to_tdx(vcpu), GUEST_RFLAGS);
+}
+
+static void tdx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+ if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+ return;
+
+ /*
+ * TODO: This is currently disallowed by TDX-SEAM, which breaks single-
+ * step debug.
+ */
+ td_vmcs_write64(to_tdx(vcpu), GUEST_RFLAGS, rflags);
+}
+
+static bool tdx_is_emulated_msr(u32 index, bool write)
+{
+ switch (index) {
+ case MSR_IA32_UCODE_REV:
+ case MSR_IA32_ARCH_CAPABILITIES:
+ case MSR_IA32_POWER_CTL:
+ case MSR_MTRRcap:
+ case 0x200 ... 0x2ff:
+ case MSR_IA32_TSC_DEADLINE:
+ case MSR_IA32_MISC_ENABLE:
+ case MSR_KVM_STEAL_TIME:
+ case MSR_KVM_POLL_CONTROL:
+ case MSR_PLATFORM_INFO:
+ case MSR_MISC_FEATURES_ENABLES:
+ case MSR_IA32_MCG_CTL:
+ case MSR_IA32_MCG_STATUS:
+ case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(32) - 1:
+ return true;
+ case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
+ /*
+ * x2APIC registers that are virtualized by the CPU can't be
+ * emulated, KVM doesn't have access to the virtual APIC page.
+ */
+ switch (index) {
+ case X2APIC_MSR(APIC_TASKPRI):
+ case X2APIC_MSR(APIC_PROCPRI):
+ case X2APIC_MSR(APIC_EOI):
+ case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
+ case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
+ case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
+ return false;
+ default:
+ return true;
+ }
+ case MSR_IA32_APICBASE:
+ case MSR_EFER:
+ return !write;
+ default:
+ return false;
+ }
+}
+
+static int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+ if (tdx_is_emulated_msr(msr->index, false))
+ return kvm_get_msr_common(vcpu, msr);
+ return 1;
+}
+
+static int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+ if (tdx_is_emulated_msr(msr->index, true))
+ return kvm_set_msr_common(vcpu, msr);
+ return 1;
+}
+
+static u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ if (!is_debug_td(vcpu))
+ return 0;
+
+ return td_vmcs_read64(to_tdx(vcpu), GUEST_ES_BASE + seg * 2);
+}
+
+static void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (!is_debug_td(vcpu)) {
+ memset(var, 0, sizeof(*var));
+ return;
+ }
+
+ seg *= 2;
+ var->base = td_vmcs_read64(tdx, GUEST_ES_BASE + seg);
+ var->limit = td_vmcs_read32(tdx, GUEST_ES_LIMIT + seg);
+ var->selector = td_vmcs_read16(tdx, GUEST_ES_SELECTOR + seg);
+ vmx_decode_ar_bytes(td_vmcs_read32(tdx, GUEST_ES_AR_BYTES + seg), var);
+}
+
+static void tdx_cache_gprs(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ int i;
+
+ if (!is_td_vcpu(vcpu) || !is_debug_td(vcpu))
+ return;
+
+ for (i = 0; i < NR_VCPU_REGS; i++) {
+ if (i == VCPU_REGS_RSP || i == VCPU_REGS_RIP)
+ continue;
+
+ vcpu->arch.regs[i] = td_gpr_read64(tdx, i);
+ }
+}
+
+static void tdx_flush_gprs(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ int i;
+
+ if (!is_td_vcpu(vcpu) || KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+ return;
+
+ for (i = 0; i < NR_VCPU_REGS; i++)
+ td_gpr_write64(tdx, i, vcpu->arch.regs[i]);
+}
+
+static void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
+ unsigned int *vcpu_align,
+ unsigned int *vm_size)
+{
+ *vcpu_size = sizeof(struct vcpu_tdx);
+ *vcpu_align = __alignof__(struct vcpu_tdx);
+
+ if (sizeof(struct kvm_tdx) > *vm_size)
+ *vm_size = sizeof(struct kvm_tdx);
+}
+
+static int __init tdx_init(void)
+{
+ return 0;
+}
+
+static int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+{
+ int i, max_pkgs;
+ u32 max_pa;
+ const struct tdsysinfo_struct *tdsysinfo = tdx_get_sysinfo();
+
+ if (!enable_ept) {
+ pr_warn("Cannot enable TDX with EPT disabled\n");
+ return -EINVAL;
+ }
+
+ if (tdsysinfo == NULL) {
+ WARN_ON_ONCE(boot_cpu_has(X86_FEATURE_TDX));
+ return -ENODEV;
+ }
+
+ if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
+ return -EIO;
+
+ tdx_caps.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE;
+ if (tdx_caps.tdcs_nr_pages != TDX1_NR_TDCX_PAGES)
+ return -EIO;
+
+ tdx_caps.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1;
+ if (tdx_caps.tdvpx_nr_pages != TDX1_NR_TDVPX_PAGES)
+ return -EIO;
+
+ tdx_caps.attrs_fixed0 = tdsysinfo->attributes_fixed0;
+ tdx_caps.attrs_fixed1 = tdsysinfo->attributes_fixed1;
+ tdx_caps.xfam_fixed0 = tdsysinfo->xfam_fixed0;
+ tdx_caps.xfam_fixed1 = tdsysinfo->xfam_fixed1;
+
+ tdx_caps.nr_cpuid_configs = tdsysinfo->num_cpuid_config;
+ if (tdx_caps.nr_cpuid_configs > TDX1_MAX_NR_CPUID_CONFIGS)
+ return -EIO;
+
+ if (!memcpy(tdx_caps.cpuid_configs, tdsysinfo->cpuid_configs,
+ tdsysinfo->num_cpuid_config * sizeof(struct tdx_cpuid_config)))
+ return -EIO;
+
+ x86_ops->cache_gprs = tdx_cache_gprs;
+ x86_ops->flush_gprs = tdx_flush_gprs;
+
+ x86_ops->tlb_remote_flush = tdx_sept_tlb_remote_flush;
+ x86_ops->set_private_spte = tdx_sept_set_private_spte;
+ x86_ops->drop_private_spte = tdx_sept_drop_private_spte;
+ x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+ x86_ops->unzap_private_spte = tdx_sept_unzap_private_spte;
+ x86_ops->link_private_sp = tdx_sept_link_private_sp;
+ x86_ops->free_private_sp = tdx_sept_free_private_sp;
+
+ max_pkgs = topology_max_packages();
+ tdx_phymem_cache_wb_lock = kcalloc(max_pkgs, sizeof(*tdx_phymem_cache_wb_lock),
+ GFP_KERNEL);
+ tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_phymem_cache_wb_lock),
+ GFP_KERNEL);
+ if (!tdx_phymem_cache_wb_lock || !tdx_mng_key_config_lock) {
+ kfree(tdx_phymem_cache_wb_lock);
+ kfree(tdx_mng_key_config_lock);
+ return -ENOMEM;
+ }
+ for (i = 0; i < max_pkgs; i++) {
+ mutex_init(&tdx_phymem_cache_wb_lock[i]);
+ mutex_init(&tdx_mng_key_config_lock[i]);
+ }
+
+ max_pa = cpuid_eax(0x80000008) & 0xff;
+ hkid_start_pos = boot_cpu_data.x86_phys_bits;
+ hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
+
+ return 0;
+}
+
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index c2849e0f4260..dbb24896d87e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -8,6 +8,7 @@
#include "tdx_arch.h"
#include "tdx_errno.h"
#include "tdx_ops.h"
+#include "posted_intr.h"

#ifdef CONFIG_KVM_INTEL_TDX

@@ -22,6 +23,51 @@ struct kvm_tdx {

struct tdx_td_page tdr;
struct tdx_td_page tdcs[TDX1_NR_TDCX_PAGES];
+
+ u64 attributes;
+ u64 xfam;
+ int hkid;
+
+ int cpuid_nent;
+ struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
+
+ bool finalized;
+ bool tdh_mem_track;
+
+ hpa_t source_pa;
+
+ u64 tsc_offset;
+};
+
+union tdx_exit_reason {
+ struct {
+ /* 31:0 mirror the VMX Exit Reason format */
+ u64 basic : 16;
+ u64 reserved16 : 1;
+ u64 reserved17 : 1;
+ u64 reserved18 : 1;
+ u64 reserved19 : 1;
+ u64 reserved20 : 1;
+ u64 reserved21 : 1;
+ u64 reserved22 : 1;
+ u64 reserved23 : 1;
+ u64 reserved24 : 1;
+ u64 reserved25 : 1;
+ u64 reserved26 : 1;
+ u64 enclave_mode : 1;
+ u64 smi_pending_mtf : 1;
+ u64 smi_from_vmx_root : 1;
+ u64 reserved30 : 1;
+ u64 failed_vmentry : 1;
+
+ /* 63:32 are TDX specific */
+ u64 details_l1 : 8;
+ u64 class : 8;
+ u64 reserved61_48 : 14;
+ u64 non_recoverable : 1;
+ u64 error : 1;
+ };
+ u64 full;
};

struct vcpu_tdx {
@@ -29,6 +75,42 @@ struct vcpu_tdx {

struct tdx_td_page tdvpr;
struct tdx_td_page tdvpx[TDX1_NR_TDVPX_PAGES];
+
+ struct list_head cpu_list;
+
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ union {
+ struct {
+ union {
+ struct {
+ u16 gpr_mask;
+ u16 xmm_mask;
+ };
+ u32 regs_mask;
+ };
+ u32 reserved;
+ };
+ u64 rcx;
+ } tdvmcall;
+
+ union tdx_exit_reason exit_reason;
+
+ bool initialized;
+};
+
+struct tdx_capabilities {
+ u8 tdcs_nr_pages;
+ u8 tdvpx_nr_pages;
+
+ u64 attrs_fixed0;
+ u64 attrs_fixed1;
+ u64 xfam_fixed0;
+ u64 xfam_fixed1;
+
+ u32 nr_cpuid_configs;
+ struct tdx_cpuid_config cpuid_configs[TDX1_MAX_NR_CPUID_CONFIGS];
};

static inline bool is_td(struct kvm *kvm)
@@ -154,6 +236,21 @@ TDX_BUILD_TDVPS_ACCESSORS(64, STATE, state);
TDX_BUILD_TDVPS_ACCESSORS(64, MSR, msr);
TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);

+static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
+{
+ struct tdx_ex_ret ex_ret;
+ u64 err;
+
+ err = tdh_mng_rd(kvm_tdx->tdr.pa, TDCS_EXEC(field), &ex_ret);
+ if (unlikely(err)) {
+ pr_err("TDH_MNG_RD[EXEC.0x%x] failed: %s (0x%llx)\n", field,
+ tdx_seamcall_error_name(err), err);
+ WARN_ON(1);
+ return 0;
+ }
+ return ex_ret.r8;
+}
+
#else

struct kvm_tdx;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 559a63290c4d..7258825b1e02 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -150,6 +150,13 @@ enum tdx_guest_management {
/* @field is any of enum tdx_guest_management */
#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(32, (field))

+enum tdx_tdcs_execution_control {
+ TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field) BUILD_TDX_FIELD(17, (field))
+
#define TDX1_NR_TDCX_PAGES 4
#define TDX1_NR_TDVPX_PAGES 5

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 8afcffa267dc..219473235c97 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -6,34 +6,45 @@

#include <asm/asm.h>
#include <asm/kvm_host.h>
+#include <asm/cacheflush.h>

#include "seamcall.h"

+static inline void tdx_clflush_page(hpa_t addr)
+{
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+}
+
static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
{
+ tdx_clflush_page(addr);
return seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
}

static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
struct tdx_ex_ret *ex)
{
+ tdx_clflush_page(hpa);
return seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, ex);
}

static inline u64 tdh_mem_spet_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
struct tdx_ex_ret *ex)
{
+ tdx_clflush_page(page);
return seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, ex);
}

static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
{
+ tdx_clflush_page(addr);
return seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, 0, NULL);
}

static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
struct tdx_ex_ret *ex)
{
+ tdx_clflush_page(hpa);
return seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, ex);
}

@@ -50,11 +61,13 @@ static inline u64 tdh_mng_key_config(hpa_t tdr)

static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
{
+ tdx_clflush_page(tdr);
return seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
}

static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
{
+ tdx_clflush_page(tdvpr);
return seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, 0, NULL);
}

diff --git a/arch/x86/kvm/vmx/tdx_stubs.c b/arch/x86/kvm/vmx/tdx_stubs.c
new file mode 100644
index 000000000000..b666a92f6041
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_stubs.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kvm_host.h>
+
+static int tdx_vm_init(struct kvm *kvm) { return 0; }
+static void tdx_vm_teardown(struct kvm *kvm) {}
+static void tdx_vm_destroy(struct kvm *kvm) {}
+static int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return 0; }
+static void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+static void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
+static fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+static void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+static void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static void tdx_hardware_enable(void) {}
+static void tdx_hardware_disable(void) {}
+static void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
+static int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath) { return 0; }
+static int tdx_dev_ioctl(void __user *argp) { return -EINVAL; }
+static int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EINVAL; }
+static int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EINVAL; }
+static void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
+static void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long pgd,
+ int pgd_level) {}
+static void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
+static void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu) {}
+static int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) { return -1; }
+static void tdx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2,
+ u32 *intr_info, u32 *error_code) {}
+static int __init tdx_check_processor_compatibility(void) { return 0; }
+static void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
+ unsigned int *vcpu_align,
+ unsigned int *vm_size) {}
+static int __init tdx_init(void) { return 0; }
+static void tdx_update_exception_bitmap(struct kvm_vcpu *vcpu) {}
+static void tdx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) {}
+static int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
+static unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
+static void tdx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) {}
+static bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
+static int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0; }
+static void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg) {}
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 3a6461694fc2..4f9931db30ac 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -2,6 +2,7 @@
#include <linux/linkage.h>
#include <asm/asm.h>
#include <asm/bitsperlong.h>
+#include <asm/errno.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
#include <asm/segment.h>
@@ -28,6 +29,13 @@
#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
#endif

+#ifdef CONFIG_KVM_INTEL_TDX
+#define TDENTER 0
+#define EXIT_REASON_TDCALL 77
+#define TDENTER_ERROR_BIT 63
+#include "seamcall.h"
+#endif
+
.section .noinstr.text, "ax"

/**
@@ -328,3 +336,141 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
pop %_ASM_BP
ret
SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff)
+
+#ifdef CONFIG_KVM_INTEL_TDX
+
+.pushsection .noinstr.text, "ax"
+
+/**
+ * __tdx_vcpu_run - Call SEAMCALL(TDENTER) to run a TD vcpu
+ * @tdvpr: physical address of TDVPR
+ * @regs: void * (to registers of TDVCPU)
+ * @gpr_mask: non-zero if guest registers need to be loaded prior to TDENTER
+ *
+ * Returns:
+ * TD-Exit Reason
+ *
+ * Note: KVM doesn't support using XMM in its hypercalls, it's the HyperV
+ * code's responsibility to save/restore XMM registers on TDVMCALL.
+ */
+SYM_FUNC_START(__tdx_vcpu_run)
+ push %rbp
+ mov %rsp, %rbp
+
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+ push %rbx
+
+ /* Save @regs, which is needed after TDENTER to capture output. */
+ push %rsi
+
+ /* Load @tdvpr to RCX */
+ mov %rdi, %rcx
+
+ /* No need to load guest GPRs if the last exit wasn't a TDVMCALL. */
+ test %dx, %dx
+ je 1f
+
+ /* Load @regs to RAX, which will be clobbered with $TDENTER anyways. */
+ mov %rsi, %rax
+
+ mov VCPU_RBX(%rax), %rbx
+ mov VCPU_RDX(%rax), %rdx
+ mov VCPU_RBP(%rax), %rbp
+ mov VCPU_RSI(%rax), %rsi
+ mov VCPU_RDI(%rax), %rdi
+
+ mov VCPU_R8 (%rax), %r8
+ mov VCPU_R9 (%rax), %r9
+ mov VCPU_R10(%rax), %r10
+ mov VCPU_R11(%rax), %r11
+ mov VCPU_R12(%rax), %r12
+ mov VCPU_R13(%rax), %r13
+ mov VCPU_R14(%rax), %r14
+ mov VCPU_R15(%rax), %r15
+
+ /* Load TDENTER to RAX. This kills the @regs pointer! */
+1: mov $TDENTER, %rax
+
+2: seamcall
+
+ /* Skip to the exit path if TDENTER failed. */
+ bt $TDENTER_ERROR_BIT, %rax
+ jc 4f
+
+ /* Temporarily save the TD-Exit reason. */
+ push %rax
+
+ /* check if TD-exit due to TDVMCALL */
+ cmp $EXIT_REASON_TDCALL, %ax
+
+ /* Reload @regs to RAX. */
+ mov 8(%rsp), %rax
+
+ /* Jump on non-TDVMCALL */
+ jne 3f
+
+ /* Save all output from SEAMCALL(TDENTER) */
+ mov %rbx, VCPU_RBX(%rax)
+ mov %rbp, VCPU_RBP(%rax)
+ mov %rsi, VCPU_RSI(%rax)
+ mov %rdi, VCPU_RDI(%rax)
+ mov %r10, VCPU_R10(%rax)
+ mov %r11, VCPU_R11(%rax)
+ mov %r12, VCPU_R12(%rax)
+ mov %r13, VCPU_R13(%rax)
+ mov %r14, VCPU_R14(%rax)
+ mov %r15, VCPU_R15(%rax)
+
+3: mov %rcx, VCPU_RCX(%rax)
+ mov %rdx, VCPU_RDX(%rax)
+ mov %r8, VCPU_R8 (%rax)
+ mov %r9, VCPU_R9 (%rax)
+
+ /*
+ * Clear all general purpose registers except RSP and RAX to prevent
+ * speculative use of the guest's values.
+ */
+ xor %rbx, %rbx
+ xor %rcx, %rcx
+ xor %rdx, %rdx
+ xor %rsi, %rsi
+ xor %rdi, %rdi
+ xor %rbp, %rbp
+ xor %r8, %r8
+ xor %r9, %r9
+ xor %r10, %r10
+ xor %r11, %r11
+ xor %r12, %r12
+ xor %r13, %r13
+ xor %r14, %r14
+ xor %r15, %r15
+
+ /* Restore the TD-Exit reason to RAX for return. */
+ pop %rax
+
+ /* "POP" @regs. */
+4: add $8, %rsp
+ pop %rbx
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ pop %rbp
+ ret
+
+5: cmpb $0, kvm_rebooting
+ je 6f
+ mov $-EFAULT, %rax
+ jmp 4b
+6: ud2
+ _ASM_EXTABLE(2b, 5b)
+
+SYM_FUNC_END(__tdx_vcpu_run)
+
+.popsection
+
+#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3ebed784eac..ba69abcc663a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -261,6 +261,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
};

u64 __read_mostly host_xcr0;
+EXPORT_SYMBOL_GPL(host_xcr0);
u64 __read_mostly supported_xcr0;
EXPORT_SYMBOL_GPL(supported_xcr0);

@@ -2187,9 +2188,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
u64 ratio;

/* Guest TSC same frequency as host TSC? */
- if (!scale || vcpu->kvm->arch.tsc_immutable) {
- if (scale)
- pr_warn_ratelimited("Guest TSC immutable, scaling not supported\n");
+ if (!scale) {
vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
return 0;
}
@@ -10214,7 +10213,8 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
{
int ret;

- if (vcpu->arch.guest_state_protected)
+ if (vcpu->arch.guest_state_protected ||
+ vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM)
return -EINVAL;

vcpu_load(vcpu);
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 8341ec720b3f..45cb24efe1ef 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -494,4 +494,55 @@ struct kvm_pmu_event_filter {
#define KVM_X86_SEV_ES_VM 1
#define KVM_X86_TDX_VM 2

+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
+
+ KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+ __u32 id;
+ __u32 metadata;
+ __u64 data;
+};
+
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
+struct kvm_tdx_init_vm {
+ __u32 max_vcpus;
+ __u32 reserved;
+ __u64 attributes;
+ __u64 cpuid;
+};
+
+#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+};
+
#endif /* _ASM_X86_KVM_H */
--
2.25.1

2021-07-02 22:10:16

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 68/69] KVM: TDX: add document on TDX MODULE

From: Isaku Yamahata <[email protected]>

Add a document on how to integrate TDX MODULE into initrd so that
TDX MODULE can be updated on kernel startup.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/tdx-module.rst | 48 +++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
create mode 100644 Documentation/virt/kvm/tdx-module.rst

diff --git a/Documentation/virt/kvm/tdx-module.rst b/Documentation/virt/kvm/tdx-module.rst
new file mode 100644
index 000000000000..8beea8302f94
--- /dev/null
+++ b/Documentation/virt/kvm/tdx-module.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+TDX MODULE
+==========
+
+Integrating TDX MODULE into initrd
+==================================
+If TDX is enabled in KVM(CONFIG_KVM_INTEL_TDX=y), kernel is able to load
+tdx seam module from initrd.
+The related modules (seamldr.ac, libtdx.so and libtdx.so.sigstruct) need to be
+stored in initrd.
+
+tdx-seam is a sample hook script for initramfs-tools.
+TDXSEAM_SRCDIR are the directory in the host file system to store files related
+to TDX MODULE.
+
+Since it heavily depends on distro how to prepare initrd, here's a example how
+to prepare an initrd.
+(Actually this is taken from Documentation/x86/microcode.rst)
+::
+ #!/bin/bash
+
+ if [ -z "$1" ]; then
+ echo "You need to supply an initrd file"
+ exit 1
+ fi
+
+ INITRD="$1"
+
+ DSTDIR=lib/firmware/intel-seam
+ TMPDIR=/tmp/initrd
+ LIBTDX="/lib/firmware/intel-seam/seamldr.acm /lib/firmware/intel-seam/libtdx.so /lib/firmware/intel-seam/libtdx.so.sigstruct"
+
+ rm -rf $TMPDIR
+
+ mkdir $TMPDIR
+ cd $TMPDIR
+ mkdir -p $DSTDIR
+
+ cp ${LIBTDX} ${DSTDIR}
+
+ find . | cpio -o -H newc > ../tdx-seam.cpio
+ cd ..
+ mv $INITRD $INITRD.orig
+ cat tdx-seam.cpio $INITRD.orig > $INITRD
+
+ rm -rf $TMPDIR
--
2.25.1

2021-07-02 22:10:23

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 36/69] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior

From: Sean Christopherson <[email protected]>

Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
irrespective of any other flags. TDX-SEAM unconditionally saves and
restores guest DRs and reset to architectural INIT state on TD exit.
So, KVM needs to save host DRs before TD enter without restoring guest DRs
and restore host DRs after TD exit.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Chao Gao <[email protected]>
Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 11 ++++++++---
arch/x86/kvm/x86.c | 3 ++-
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 96e6cd95d884..7822b531a5e2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -488,9 +488,14 @@ struct kvm_pmu {
struct kvm_pmu_ops;

enum {
- KVM_DEBUGREG_BP_ENABLED = 1,
- KVM_DEBUGREG_WONT_EXIT = 2,
- KVM_DEBUGREG_RELOAD = 4,
+ KVM_DEBUGREG_BP_ENABLED = BIT(0),
+ KVM_DEBUGREG_WONT_EXIT = BIT(1),
+ KVM_DEBUGREG_RELOAD = BIT(2),
+ /*
+ * Guest debug registers are saved/restored by hardware on exit from
+ * or enter guest. KVM needn't switch them.
+ */
+ KVM_DEBUGREG_AUTO_SWITCH_GUEST = BIT(3),
};

struct kvm_mtrr_range {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b436cae1732..f1d5e0a53640 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9441,7 +9441,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return();

- if (unlikely(vcpu->arch.switch_db_regs)) {
+ if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH_GUEST)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);
set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -9473,6 +9473,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+ WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH_GUEST);
static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
kvm_update_dr0123(vcpu);
kvm_update_dr7(vcpu);
--
2.25.1

2021-07-02 22:10:25

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 40/69] KVM: Export kvm_is_reserved_pfn() for use by TDX

From: Sean Christopherson <[email protected]>

TDX will use kvm_is_reserved_pfn() to prevent installing a reserved PFN
int SEPT. Or rather, to prevent such an attempt, as reserved PFNs are
not covered by TDMRs.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
virt/kvm/kvm_main.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b075b5e7303..dd6492b526c9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -188,6 +188,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)

return true;
}
+EXPORT_SYMBOL_GPL(kvm_is_reserved_pfn);

bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
{
--
2.25.1

2021-07-02 22:10:27

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 58/69] KVM: TDX: Define TDCALL exit reason

From: Sean Christopherson <[email protected]>

Define the TDCALL exit reason, which is carved out from the VMX exit
reason namespace as the TDCALL exit from TDX guest to TDX-SEAM is really
just a VM-Exit.

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/vmx.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 946d761adbd3..ba5908dfc7c0 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -91,6 +91,7 @@
#define EXIT_REASON_UMWAIT 67
#define EXIT_REASON_TPAUSE 68
#define EXIT_REASON_BUS_LOCK 74
+#define EXIT_REASON_TDCALL 77

#define VMX_EXIT_REASONS \
{ EXIT_REASON_EXCEPTION_NMI, "EXCEPTION_NMI" }, \
@@ -153,7 +154,8 @@
{ EXIT_REASON_XRSTORS, "XRSTORS" }, \
{ EXIT_REASON_UMWAIT, "UMWAIT" }, \
{ EXIT_REASON_TPAUSE, "TPAUSE" }, \
- { EXIT_REASON_BUS_LOCK, "BUS_LOCK" }
+ { EXIT_REASON_BUS_LOCK, "BUS_LOCK" }, \
+ { EXIT_REASON_TDCALL, "TDCALL" }

#define VMX_EXIT_REASON_FLAGS \
{ VMX_EXIT_REASONS_FAILED_VMENTRY, "FAILED_VMENTRY" }
--
2.25.1

2021-07-02 22:10:28

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 43/69] KVM: x86/mmu: Allow non-zero init value for shadow PTE

From: Sean Christopherson <[email protected]>

TDX will run with EPT violation #VEs enabled, which means KVM needs to
set the "suppress #VE" bit in unused PTEs to avoid unintentionally
reflecting not-present EPT violations into the guest.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 50 +++++++++++++++++++++++++++++++++++------
arch/x86/kvm/mmu/spte.c | 10 +++++++++
arch/x86/kvm/mmu/spte.h | 2 ++
4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 69b82857acdb..6ec8d9fdff35 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -61,6 +61,7 @@ static __always_inline u64 rsvd_bits(int s, int e)

void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_spte_init_value(u64 init_value);

void
reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 631b92e6e9ba..1c40dfd05979 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -550,9 +550,9 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
u64 old_spte = *sptep;

if (!spte_has_volatile_bits(old_spte))
- __update_clear_spte_fast(sptep, 0ull);
+ __update_clear_spte_fast(sptep, shadow_init_value);
else
- old_spte = __update_clear_spte_slow(sptep, 0ull);
+ old_spte = __update_clear_spte_slow(sptep, shadow_init_value);

if (!is_shadow_present_pte(old_spte))
return 0;
@@ -582,7 +582,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
*/
static void mmu_spte_clear_no_track(u64 *sptep)
{
- __update_clear_spte_fast(sptep, 0ull);
+ __update_clear_spte_fast(sptep, shadow_init_value);
}

static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -660,6 +660,42 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
local_irq_enable();
}

+static inline void kvm_init_shadow_page(void *page)
+{
+#ifdef CONFIG_X86_64
+ int ign;
+
+ asm volatile (
+ "rep stosq\n\t"
+ : "=c"(ign), "=D"(page)
+ : "a"(shadow_init_value), "c"(4096/8), "D"(page)
+ : "memory"
+ );
+#else
+ BUG();
+#endif
+}
+
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
+ int start, end, i, r;
+
+ if (shadow_init_value)
+ start = kvm_mmu_memory_cache_nr_free_objects(mc);
+
+ r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+
+ if (shadow_init_value) {
+ end = kvm_mmu_memory_cache_nr_free_objects(mc);
+ for (i = start; i < end; i++)
+ kvm_init_shadow_page(mc->objects[i]);
+ }
+ return 0;
+}
+
static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;
@@ -669,8 +705,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
- PT64_ROOT_MAX_LEVEL);
+ r = mmu_topup_shadow_page_cache(vcpu);
if (r)
return r;
if (maybe_indirect) {
@@ -3041,7 +3076,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
int ret = RET_PF_INVALID;
- u64 spte = 0ull;
+ u64 spte = shadow_init_value;
uint retry_count = 0;

if (!page_fault_can_be_fast(error_code))
@@ -5383,7 +5418,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ if (!shadow_init_value)
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 66d43cec0c31..0b931f1c2210 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -34,6 +34,7 @@ u64 __read_mostly shadow_mmio_access_mask;
u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_me_mask;
u64 __read_mostly shadow_acc_track_mask;
+u64 __read_mostly shadow_init_value;

u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -211,6 +212,14 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
return new_spte;
}

+void kvm_mmu_set_spte_init_value(u64 init_value)
+{
+ if (WARN_ON(!IS_ENABLED(CONFIG_X86_64) && init_value))
+ init_value = 0;
+ shadow_init_value = init_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_spte_init_value);
+
static u8 kvm_get_shadow_phys_bits(void)
{
/*
@@ -355,6 +364,7 @@ void kvm_mmu_reset_all_pte_masks(void)
shadow_present_mask = PT_PRESENT_MASK;
shadow_acc_track_mask = 0;
shadow_me_mask = sme_me_mask;
+ shadow_init_value = 0;

shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITEABLE;
shadow_mmu_writable_mask = DEFAULT_SPTE_MMU_WRITEABLE;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index bca0ba11cccf..f88cf3db31c7 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -152,6 +152,8 @@ extern u64 __read_mostly shadow_mmio_access_mask;
extern u64 __read_mostly shadow_present_mask;
extern u64 __read_mostly shadow_me_mask;

+extern u64 __read_mostly shadow_init_value;
+
/*
* SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
* shadow_acc_track_mask is the set of bits to be cleared in non-accessed
--
2.25.1

2021-07-02 22:10:49

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 51/69] KVM: x86/mmu: Allow per-VM override of the TDP max page level

From: Sean Christopherson <[email protected]>

TODO: This is tentative patch. Support large page and delete this patch.

Allow TDX to effectively disable large pages, as SEPT will initially
support only 4k pages.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9631b985ebdc..a47e17892258 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -989,6 +989,7 @@ struct kvm_arch {
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+ int tdp_max_page_level;
u8 mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 82db62753acb..4ee6d7803f18 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4084,7 +4084,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
kvm_pfn_t pfn;
int max_level;

- for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
+ for (max_level = vcpu->kvm->arch.tdp_max_page_level;
max_level > PG_LEVEL_4K;
max_level--) {
int page_num = KVM_PAGES_PER_HPAGE(max_level);
@@ -5802,6 +5802,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
node->track_write = kvm_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
+
+ kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
}

void kvm_mmu_uninit_vm(struct kvm *kvm)
--
2.25.1

2021-07-02 22:10:54

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 57/69] KVM: VMX: Move register caching logic to common code

From: Sean Christopherson <[email protected]>

Move the guts of vmx_cache_reg() to vt_cache_reg() in preparation for
reusing the bulk of the code for TDX, which can access guest state for
debug TDs.

Use kvm_x86_ops.cache_reg() in ept_update_paging_mode_cr0() rather than
trying to expose vt_cache_reg() to vmx.c, even though it means taking a
retpoline. The code runs if and only if EPT is enabled but unrestricted
guest. Only one generation of CPU, Nehalem, supports EPT but not
unrestricted guest, and disabling unrestricted guest without also
disabling EPT is, to put it bluntly, dumb.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 37 +++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 42 +----------------------------------------
2 files changed, 37 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 0d8d2a0a2979..b619615f77de 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -341,7 +341,42 @@ static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)

static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
{
- vmx_cache_reg(vcpu, reg);
+ unsigned long guest_owned_bits;
+
+ kvm_register_mark_available(vcpu, reg);
+
+ switch (reg) {
+ case VCPU_REGS_RSP:
+ vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
+ break;
+ case VCPU_REGS_RIP:
+ vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
+ break;
+ case VCPU_EXREG_PDPTR:
+ if (enable_ept)
+ ept_save_pdptrs(vcpu);
+ break;
+ case VCPU_EXREG_CR0:
+ guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
+
+ vcpu->arch.cr0 &= ~guest_owned_bits;
+ vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
+ break;
+ case VCPU_EXREG_CR3:
+ if (is_unrestricted_guest(vcpu) ||
+ (enable_ept && is_paging(vcpu)))
+ vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+ break;
+ case VCPU_EXREG_CR4:
+ guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
+
+ vcpu->arch.cr4 &= ~guest_owned_bits;
+ vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
+ break;
+ default:
+ KVM_BUG_ON(1, vcpu->kvm);
+ break;
+ }
}

static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e315a46d1566..3c3bfc80d2bb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2326,46 +2326,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return ret;
}

-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
-{
- unsigned long guest_owned_bits;
-
- kvm_register_mark_available(vcpu, reg);
-
- switch (reg) {
- case VCPU_REGS_RSP:
- vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
- break;
- case VCPU_REGS_RIP:
- vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
- break;
- case VCPU_EXREG_PDPTR:
- if (enable_ept)
- ept_save_pdptrs(vcpu);
- break;
- case VCPU_EXREG_CR0:
- guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
-
- vcpu->arch.cr0 &= ~guest_owned_bits;
- vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
- break;
- case VCPU_EXREG_CR3:
- if (is_unrestricted_guest(vcpu) ||
- (enable_ept && is_paging(vcpu)))
- vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
- break;
- case VCPU_EXREG_CR4:
- guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
-
- vcpu->arch.cr4 &= ~guest_owned_bits;
- vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
- break;
- default:
- KVM_BUG_ON(1, vcpu->kvm);
- break;
- }
-}
-
static __init int vmx_disabled_by_bios(void)
{
return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
@@ -3066,7 +3026,7 @@ static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
struct vcpu_vmx *vmx = to_vmx(vcpu);

if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
- vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
+ kvm_x86_ops.cache_reg(vcpu, VCPU_EXREG_CR3);
if (!(cr0 & X86_CR0_PG)) {
/* From paging/starting to nonpaging */
exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
--
2.25.1

2021-07-02 22:11:03

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 03/69] KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86

From: Isaku Yamahata <[email protected]>

This is preparation for TDX support.

Because SEAMCALL instruction requires VMX enabled, it needs to initialize
struct vmcs and load it before SEAMCALL instruction.[1] [2] Move out the
definition of vmcs into a common x86 header, arch/x86/include/asm/vmx.h, so
that seamloader code can share the same definition.

[1] Intel Trust Domain CPU Architectural Extensions
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf

[2] TDX Module spec
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/vmx.h | 11 +++++++++++
arch/x86/kvm/vmx/vmcs.h | 11 -----------
2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..035dfdafa2c1 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -17,6 +17,17 @@
#include <uapi/asm/vmx.h>
#include <asm/vmxfeatures.h>

+struct vmcs_hdr {
+ u32 revision_id:31;
+ u32 shadow_vmcs:1;
+};
+
+struct vmcs {
+ struct vmcs_hdr hdr;
+ u32 abort;
+ char data[];
+};
+
#define VMCS_CONTROL_BIT(x) BIT(VMX_FEATURE_##x & 0x1f)

/*
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 1472c6c376f7..ac09bc4996a5 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -11,17 +11,6 @@

#include "capabilities.h"

-struct vmcs_hdr {
- u32 revision_id:31;
- u32 shadow_vmcs:1;
-};
-
-struct vmcs {
- struct vmcs_hdr hdr;
- u32 abort;
- char data[];
-};
-
DECLARE_PER_CPU(struct vmcs *, current_vmcs);

/*
--
2.25.1

2021-07-02 22:11:03

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 54/69] KVM: VMX: Define VMCS encodings for shared EPT pointer

From: Sean Christopherson <[email protected]>

Add the VMCS field encoding for the shared EPTP, which will be used by
TDX to have separate EPT walks for private GPAs (existing EPTP) versus
shared GPAs (new shared EPTP).

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/vmx.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 132981276a2f..56b3d32941fd 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -235,6 +235,8 @@ enum vmcs_field {
ENCLS_EXITING_BITMAP_HIGH = 0x0000202F,
TSC_MULTIPLIER = 0x00002032,
TSC_MULTIPLIER_HIGH = 0x00002033,
+ SHARED_EPT_POINTER = 0x0000203C,
+ SHARED_EPT_POINTER_HIGH = 0x0000203D,
GUEST_PHYSICAL_ADDRESS = 0x00002400,
GUEST_PHYSICAL_ADDRESS_HIGH = 0x00002401,
VMCS_LINK_POINTER = 0x00002800,
--
2.25.1

2021-07-02 22:11:08

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 55/69] KVM: VMX: Add 'main.c' to wrap VMX and TDX

From: Sean Christopherson <[email protected]>

Wrap the VMX kvm_x86_ops hooks in preparation of adding TDX, which can
coexist with VMX, i.e. KVM can run both VMs and TDs. Use 'vt' for the
naming scheme as a nod to VT-x and as a concatenation of VmxTdx.

Reported-by: kernel test robot <[email protected]>
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/main.c | 710 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 283 ++++------------
3 files changed, 768 insertions(+), 227 deletions(-)
create mode 100644 arch/x86/kvm/vmx/main.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 60f3e90fef8b..3c3815bd0f56 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -21,7 +21,7 @@ kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
kvm-$(CONFIG_KVM_XEN) += xen.o

-kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
+kvm-intel-y += vmx/main.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
kvm-intel-$(CONFIG_KVM_INTEL_TDX) += vmx/seamcall.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..c3fefa0e5a63
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,710 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+static struct kvm_x86_ops vt_x86_ops __initdata;
+
+#include "vmx.c"
+
+static int __init vt_cpu_has_kvm_support(void)
+{
+ return cpu_has_vmx();
+}
+
+static int __init vt_disabled_by_bios(void)
+{
+ return vmx_disabled_by_bios();
+}
+
+static int __init vt_check_processor_compatibility(void)
+{
+ int ret;
+
+ ret = vmx_check_processor_compat();
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+static __init int vt_hardware_setup(void)
+{
+ int ret;
+
+ ret = hardware_setup();
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+static int vt_hardware_enable(void)
+{
+ return hardware_enable();
+}
+
+static void vt_hardware_disable(void)
+{
+ hardware_disable();
+}
+
+static bool vt_cpu_has_accelerated_tpr(void)
+{
+ return report_flexpriority();
+}
+
+static bool vt_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_LEGACY_VM;
+}
+
+static int vt_vm_init(struct kvm *kvm)
+{
+ return vmx_vm_init(kvm);
+}
+
+static void vt_vm_teardown(struct kvm *kvm)
+{
+
+}
+
+static void vt_vm_destroy(struct kvm *kvm)
+{
+
+}
+
+static int vt_vcpu_create(struct kvm_vcpu *vcpu)
+{
+ return vmx_create_vcpu(vcpu);
+}
+
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
+{
+ return vmx_vcpu_run(vcpu);
+}
+
+static void vt_vcpu_free(struct kvm_vcpu *vcpu)
+{
+ return vmx_free_vcpu(vcpu);
+}
+
+static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+ return vmx_vcpu_reset(vcpu, init_event);
+}
+
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ return vmx_vcpu_load(vcpu, cpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ return vmx_vcpu_put(vcpu);
+}
+
+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath)
+{
+ return vmx_handle_exit(vcpu, fastpath);
+}
+
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ vmx_handle_exit_irqoff(vcpu);
+}
+
+static int vt_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+{
+ return vmx_skip_emulated_instruction(vcpu);
+}
+
+static void vt_update_emulated_instruction(struct kvm_vcpu *vcpu)
+{
+ vmx_update_emulated_instruction(vcpu);
+}
+
+static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ return vmx_set_msr(vcpu, msr_info);
+}
+
+static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ return vmx_smi_allowed(vcpu, for_injection);
+}
+
+static int vt_pre_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+{
+ return vmx_pre_enter_smm(vcpu, smstate);
+}
+
+static int vt_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+{
+ return vmx_pre_leave_smm(vcpu, smstate);
+}
+
+static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+ /* RSM will cause a vmexit anyway. */
+ vmx_enable_smi_window(vcpu);
+}
+
+static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, void *insn,
+ int insn_len)
+{
+ return vmx_can_emulate_instruction(vcpu, insn, insn_len);
+}
+
+static int vt_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
+{
+ return vmx_check_intercept(vcpu, info, stage, exception);
+}
+
+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+ return vmx_apic_init_signal_blocked(vcpu);
+}
+
+static void vt_migrate_timers(struct kvm_vcpu *vcpu)
+{
+ vmx_migrate_timers(vcpu);
+}
+
+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ return vmx_set_virtual_apic_mode(vcpu);
+}
+
+static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+ return vmx_apicv_post_state_restore(vcpu);
+}
+
+static bool vt_check_apicv_inhibit_reasons(ulong bit)
+{
+ return vmx_check_apicv_inhibit_reasons(bit);
+}
+
+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+ return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+{
+ return vmx_hwapic_isr_update(vcpu, max_isr);
+}
+
+static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ return vmx_guest_apic_has_interrupt(vcpu);
+}
+
+static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+ return vmx_sync_pir_to_irr(vcpu);
+}
+
+static int vt_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
+{
+ return vmx_deliver_posted_interrupt(vcpu, vector);
+}
+
+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+ return vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+/*
+ * The kvm parameter can be NULL (module initialization, or invocation before
+ * VM creation). Be sure to check the kvm parameter before using it.
+ */
+static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
+{
+ return vmx_has_emulated_msr(kvm, index);
+}
+
+static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
+{
+ vmx_msr_filter_changed(vcpu);
+}
+
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+ vmx_update_exception_bitmap(vcpu);
+}
+
+static int vt_get_msr_feature(struct kvm_msr_entry *msr)
+{
+ return vmx_get_msr_feature(msr);
+}
+
+static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ return vmx_get_msr(vcpu, msr_info);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+ return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+ vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ vmx_set_cr0(vcpu, cr0);
+}
+
+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+ int pgd_level)
+{
+ vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ vmx_set_cr4(vcpu, cr4);
+}
+
+static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ return vmx_is_valid_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+ return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+ vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+ vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+ vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+ return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+ vmx_set_rflags(vcpu, rflags);
+}
+
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+ vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+ vmx_flush_tlb_current(vcpu);
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+ vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+ vmx_flush_tlb_guest(vcpu);
+}
+
+static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+ vmx_set_interrupt_shadow(vcpu, mask);
+}
+
+static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+ return vmx_get_interrupt_shadow(vcpu);
+}
+
+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+ unsigned char *hypercall)
+{
+ vmx_patch_hypercall(vcpu, hypercall);
+}
+
+static void vt_inject_irq(struct kvm_vcpu *vcpu)
+{
+ vmx_inject_irq(vcpu);
+}
+
+static void vt_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ vmx_inject_nmi(vcpu);
+}
+
+static void vt_queue_exception(struct kvm_vcpu *vcpu)
+{
+ vmx_queue_exception(vcpu);
+}
+
+static void vt_cancel_injection(struct kvm_vcpu *vcpu)
+{
+ vmx_cancel_injection(vcpu);
+}
+
+static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ return vmx_interrupt_allowed(vcpu, for_injection);
+}
+
+static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ return vmx_nmi_allowed(vcpu, for_injection);
+}
+
+static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
+{
+ return vmx_get_nmi_mask(vcpu);
+}
+
+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
+{
+ vmx_set_nmi_mask(vcpu, masked);
+}
+
+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
+{
+ vmx_enable_nmi_window(vcpu);
+}
+
+static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
+{
+ vmx_enable_irq_window(vcpu);
+}
+
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+{
+ vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+ vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+ vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+ vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+ return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+ return vmx_set_identity_map_addr(kvm, ident_addr);
+}
+
+static u64 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+ return vmx_get_mt_mask(vcpu, gfn, is_mmio);
+}
+
+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2,
+ u32 *intr_info, u32 *error_code)
+{
+
+ return vmx_get_exit_info(vcpu, info1, info2, intr_info, error_code);
+}
+
+static u64 vt_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+{
+ return vmx_write_l1_tsc_offset(vcpu, offset);
+}
+
+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+ vmx_request_immediate_exit(vcpu);
+}
+
+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+ vmx_sched_in(vcpu, cpu);
+}
+
+static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
+{
+ vmx_update_cpu_dirty_logging(vcpu);
+}
+
+static int vt_pre_block(struct kvm_vcpu *vcpu)
+{
+ if (pi_pre_block(vcpu))
+ return 1;
+
+ return vmx_pre_block(vcpu);
+}
+
+static void vt_post_block(struct kvm_vcpu *vcpu)
+{
+ vmx_post_block(vcpu);
+
+ pi_post_block(vcpu);
+}
+
+
+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
+{
+ return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+ vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
+static void vt_setup_mce(struct kvm_vcpu *vcpu)
+{
+ vmx_setup_mce(vcpu);
+}
+
+static struct kvm_x86_ops vt_x86_ops __initdata = {
+ .hardware_unsetup = hardware_unsetup,
+
+ .hardware_enable = vt_hardware_enable,
+ .hardware_disable = vt_hardware_disable,
+ .cpu_has_accelerated_tpr = vt_cpu_has_accelerated_tpr,
+ .has_emulated_msr = vt_has_emulated_msr,
+
+ .is_vm_type_supported = vt_is_vm_type_supported,
+ .vm_size = sizeof(struct kvm_vmx),
+ .vm_init = vt_vm_init,
+ .vm_teardown = vt_vm_teardown,
+ .vm_destroy = vt_vm_destroy,
+
+ .vcpu_create = vt_vcpu_create,
+ .vcpu_free = vt_vcpu_free,
+ .vcpu_reset = vt_vcpu_reset,
+
+ .prepare_guest_switch = vt_prepare_switch_to_guest,
+ .vcpu_load = vt_vcpu_load,
+ .vcpu_put = vt_vcpu_put,
+
+ .update_exception_bitmap = vt_update_exception_bitmap,
+ .get_msr_feature = vt_get_msr_feature,
+ .get_msr = vt_get_msr,
+ .set_msr = vt_set_msr,
+ .get_segment_base = vt_get_segment_base,
+ .get_segment = vt_get_segment,
+ .set_segment = vt_set_segment,
+ .get_cpl = vt_get_cpl,
+ .get_cs_db_l_bits = vt_get_cs_db_l_bits,
+ .set_cr0 = vt_set_cr0,
+ .is_valid_cr4 = vt_is_valid_cr4,
+ .set_cr4 = vt_set_cr4,
+ .set_efer = vt_set_efer,
+ .get_idt = vt_get_idt,
+ .set_idt = vt_set_idt,
+ .get_gdt = vt_get_gdt,
+ .set_gdt = vt_set_gdt,
+ .set_dr7 = vt_set_dr7,
+ .sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+ .cache_reg = vt_cache_reg,
+ .get_rflags = vt_get_rflags,
+ .set_rflags = vt_set_rflags,
+
+ .tlb_flush_all = vt_flush_tlb_all,
+ .tlb_flush_current = vt_flush_tlb_current,
+ .tlb_flush_gva = vt_flush_tlb_gva,
+ .tlb_flush_guest = vt_flush_tlb_guest,
+
+ .run = vt_vcpu_run,
+ .handle_exit = vt_handle_exit,
+ .skip_emulated_instruction = vt_skip_emulated_instruction,
+ .update_emulated_instruction = vt_update_emulated_instruction,
+ .set_interrupt_shadow = vt_set_interrupt_shadow,
+ .get_interrupt_shadow = vt_get_interrupt_shadow,
+ .patch_hypercall = vt_patch_hypercall,
+ .set_irq = vt_inject_irq,
+ .set_nmi = vt_inject_nmi,
+ .queue_exception = vt_queue_exception,
+ .cancel_injection = vt_cancel_injection,
+ .interrupt_allowed = vt_interrupt_allowed,
+ .nmi_allowed = vt_nmi_allowed,
+ .get_nmi_mask = vt_get_nmi_mask,
+ .set_nmi_mask = vt_set_nmi_mask,
+ .enable_nmi_window = vt_enable_nmi_window,
+ .enable_irq_window = vt_enable_irq_window,
+ .update_cr8_intercept = vt_update_cr8_intercept,
+ .set_virtual_apic_mode = vt_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vt_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
+ .load_eoi_exitmap = vt_load_eoi_exitmap,
+ .apicv_post_state_restore = vt_apicv_post_state_restore,
+ .check_apicv_inhibit_reasons = vt_check_apicv_inhibit_reasons,
+ .hwapic_irr_update = vt_hwapic_irr_update,
+ .hwapic_isr_update = vt_hwapic_isr_update,
+ .guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
+ .sync_pir_to_irr = vt_sync_pir_to_irr,
+ .deliver_posted_interrupt = vt_deliver_posted_interrupt,
+ .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+ .set_tss_addr = vt_set_tss_addr,
+ .set_identity_map_addr = vt_set_identity_map_addr,
+ .get_mt_mask = vt_get_mt_mask,
+
+ .get_exit_info = vt_get_exit_info,
+
+ .vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,
+
+ .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+ .write_l1_tsc_offset = vt_write_l1_tsc_offset,
+
+ .load_mmu_pgd = vt_load_mmu_pgd,
+
+ .check_intercept = vt_check_intercept,
+ .handle_exit_irqoff = vt_handle_exit_irqoff,
+
+ .request_immediate_exit = vt_request_immediate_exit,
+
+ .sched_in = vt_sched_in,
+
+ .cpu_dirty_log_size = PML_ENTITY_NUM,
+ .update_cpu_dirty_logging = vt_update_cpu_dirty_logging,
+
+ .pre_block = vt_pre_block,
+ .post_block = vt_post_block,
+
+ .pmu_ops = &intel_pmu_ops,
+ .nested_ops = &vmx_nested_ops,
+
+ .update_pi_irte = pi_update_irte,
+
+#ifdef CONFIG_X86_64
+ .set_hv_timer = vt_set_hv_timer,
+ .cancel_hv_timer = vt_cancel_hv_timer,
+#endif
+
+ .setup_mce = vt_setup_mce,
+
+ .smi_allowed = vt_smi_allowed,
+ .pre_enter_smm = vt_pre_enter_smm,
+ .pre_leave_smm = vt_pre_leave_smm,
+ .enable_smi_window = vt_enable_smi_window,
+
+ .can_emulate_instruction = vt_can_emulate_instruction,
+ .apic_init_signal_blocked = vt_apic_init_signal_blocked,
+ .migrate_timers = vt_migrate_timers,
+
+ .msr_filter_changed = vt_msr_filter_changed,
+ .complete_emulated_msr = kvm_complete_insn_gp,
+
+ .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+};
+
+static struct kvm_x86_init_ops vt_init_ops __initdata = {
+ .cpu_has_kvm_support = vt_cpu_has_kvm_support,
+ .disabled_by_bios = vt_disabled_by_bios,
+ .check_processor_compatibility = vt_check_processor_compatibility,
+ .hardware_setup = vt_hardware_setup,
+
+ .runtime_ops = &vt_x86_ops,
+};
+
+static int __init vt_init(void)
+{
+ unsigned int vcpu_size = 0, vcpu_align = 0;
+ int r;
+
+ vmx_pre_kvm_init(&vcpu_size, &vcpu_align);
+
+ r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
+ if (r)
+ goto err_vmx_post_exit;
+
+ r = vmx_init();
+ if (r)
+ goto err_kvm_exit;
+
+ return 0;
+
+err_kvm_exit:
+ kvm_exit();
+err_vmx_post_exit:
+ vmx_post_kvm_exit();
+ return r;
+}
+module_init(vt_init);
+
+static void vt_exit(void)
+{
+ vmx_exit();
+ kvm_exit();
+ vmx_post_kvm_exit();
+}
+module_exit(vt_exit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8a104a54121b..77b2b2cf76db 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2366,11 +2366,6 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
}
}

-static __init int cpu_has_kvm_support(void)
-{
- return cpu_has_vmx();
-}
-
static __init int vmx_disabled_by_bios(void)
{
return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
@@ -6891,11 +6886,6 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
return err;
}

-static bool vmx_is_vm_type_supported(unsigned long type)
-{
- return type == KVM_X86_LEGACY_VM;
-}
-
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

@@ -6935,16 +6925,6 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}

-static void vmx_vm_teardown(struct kvm *kvm)
-{
-
-}
-
-static void vmx_vm_destroy(struct kvm *kvm)
-{
-
-}
-
static int __init vmx_check_processor_compat(void)
{
struct vmcs_config vmcs_conf;
@@ -7447,9 +7427,6 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)

static int vmx_pre_block(struct kvm_vcpu *vcpu)
{
- if (pi_pre_block(vcpu))
- return 1;
-
if (kvm_lapic_hv_timer_in_use(vcpu))
kvm_lapic_switch_to_sw_timer(vcpu);

@@ -7460,8 +7437,6 @@ static void vmx_post_block(struct kvm_vcpu *vcpu)
{
if (kvm_x86_ops.set_hv_timer)
kvm_lapic_switch_to_hv_timer(vcpu);
-
- pi_post_block(vcpu);
}

static void vmx_setup_mce(struct kvm_vcpu *vcpu)
@@ -7552,142 +7527,6 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
return supported & BIT(bit);
}

-static struct kvm_x86_ops vmx_x86_ops __initdata = {
- .hardware_unsetup = hardware_unsetup,
-
- .hardware_enable = hardware_enable,
- .hardware_disable = hardware_disable,
- .cpu_has_accelerated_tpr = report_flexpriority,
- .has_emulated_msr = vmx_has_emulated_msr,
-
- .is_vm_type_supported = vmx_is_vm_type_supported,
- .vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
- .vm_teardown = vmx_vm_teardown,
- .vm_destroy = vmx_vm_destroy,
-
- .vcpu_create = vmx_create_vcpu,
- .vcpu_free = vmx_free_vcpu,
- .vcpu_reset = vmx_vcpu_reset,
-
- .prepare_guest_switch = vmx_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
-
- .update_exception_bitmap = vmx_update_exception_bitmap,
- .get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
-
- .tlb_flush_all = vmx_flush_tlb_all,
- .tlb_flush_current = vmx_flush_tlb_current,
- .tlb_flush_gva = vmx_flush_tlb_gva,
- .tlb_flush_guest = vmx_flush_tlb_guest,
-
- .run = vmx_vcpu_run,
- .handle_exit = vmx_handle_exit,
- .skip_emulated_instruction = vmx_skip_emulated_instruction,
- .update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
- .set_irq = vmx_inject_irq,
- .set_nmi = vmx_inject_nmi,
- .queue_exception = vmx_queue_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_post_state_restore = vmx_apicv_post_state_restore,
- .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_posted_interrupt = vmx_deliver_posted_interrupt,
- .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
-
- .get_exit_info = vmx_get_exit_info,
-
- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
- .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
- .write_l1_tsc_offset = vmx_write_l1_tsc_offset,
-
- .load_mmu_pgd = vmx_load_mmu_pgd,
-
- .check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
-
- .request_immediate_exit = vmx_request_immediate_exit,
-
- .sched_in = vmx_sched_in,
-
- .cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
- .pre_block = vmx_pre_block,
- .post_block = vmx_post_block,
-
- .pmu_ops = &intel_pmu_ops,
- .nested_ops = &vmx_nested_ops,
-
- .update_pi_irte = pi_update_irte,
- .start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
- .setup_mce = vmx_setup_mce,
-
- .smi_allowed = vmx_smi_allowed,
- .pre_enter_smm = vmx_pre_enter_smm,
- .pre_leave_smm = vmx_pre_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
-
- .can_emulate_instruction = vmx_can_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
- .migrate_timers = vmx_migrate_timers,
-
- .msr_filter_changed = vmx_msr_filter_changed,
- .complete_emulated_msr = kvm_complete_insn_gp,
-
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-};
-
static __init void vmx_setup_user_return_msrs(void)
{

@@ -7768,16 +7607,16 @@ static __init int hardware_setup(void)
* using the APIC_ACCESS_ADDR VMCS field.
*/
if (!flexpriority_enabled)
- vmx_x86_ops.set_apic_access_page_addr = NULL;
+ vt_x86_ops.set_apic_access_page_addr = NULL;

if (!cpu_has_vmx_tpr_shadow())
- vmx_x86_ops.update_cr8_intercept = NULL;
+ vt_x86_ops.update_cr8_intercept = NULL;

#if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
&& enable_ept) {
- vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
- vmx_x86_ops.tlb_remote_flush_with_range =
+ vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
+ vt_x86_ops.tlb_remote_flush_with_range =
hv_remote_flush_tlb_with_range;
}
#endif
@@ -7792,7 +7631,7 @@ static __init int hardware_setup(void)

if (!cpu_has_vmx_apicv()) {
enable_apicv = 0;
- vmx_x86_ops.sync_pir_to_irr = NULL;
+ vt_x86_ops.sync_pir_to_irr = NULL;
}

if (cpu_has_vmx_tsc_scaling()) {
@@ -7827,7 +7666,7 @@ static __init int hardware_setup(void)
enable_pml = 0;

if (!enable_pml)
- vmx_x86_ops.cpu_dirty_log_size = 0;
+ vt_x86_ops.cpu_dirty_log_size = 0;

if (!cpu_has_vmx_preemption_timer())
enable_preemption_timer = false;
@@ -7854,9 +7693,9 @@ static __init int hardware_setup(void)
}

if (!enable_preemption_timer) {
- vmx_x86_ops.set_hv_timer = NULL;
- vmx_x86_ops.cancel_hv_timer = NULL;
- vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+ vt_x86_ops.set_hv_timer = NULL;
+ vt_x86_ops.cancel_hv_timer = NULL;
+ vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
}

kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler);
@@ -7887,15 +7726,6 @@ static __init int hardware_setup(void)
return r;
}

-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
- .cpu_has_kvm_support = cpu_has_kvm_support,
- .disabled_by_bios = vmx_disabled_by_bios,
- .check_processor_compatibility = vmx_check_processor_compat,
- .hardware_setup = hardware_setup,
-
- .runtime_ops = &vmx_x86_ops,
-};
-
static void vmx_cleanup_l1d_flush(void)
{
if (vmx_l1d_flush_pages) {
@@ -7906,45 +7736,13 @@ static void vmx_cleanup_l1d_flush(void)
l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
}

-static void vmx_exit(void)
-{
-#ifdef CONFIG_KEXEC_CORE
- RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
- synchronize_rcu();
-#endif
-
- kvm_exit();
-
-#if IS_ENABLED(CONFIG_HYPERV)
- if (static_branch_unlikely(&enable_evmcs)) {
- int cpu;
- struct hv_vp_assist_page *vp_ap;
- /*
- * Reset everything to support using non-enlightened VMCS
- * access later (e.g. when we reload the module with
- * enlightened_vmcs=0)
- */
- for_each_online_cpu(cpu) {
- vp_ap = hv_get_vp_assist_page(cpu);
-
- if (!vp_ap)
- continue;
-
- vp_ap->nested_control.features.directhypercall = 0;
- vp_ap->current_nested_vmcs = 0;
- vp_ap->enlighten_vmentry = 0;
- }
-
- static_branch_disable(&enable_evmcs);
- }
-#endif
- vmx_cleanup_l1d_flush();
-}
-module_exit(vmx_exit);
-
-static int __init vmx_init(void)
+static void __init vmx_pre_kvm_init(unsigned int *vcpu_size,
+ unsigned int *vcpu_align)
{
- int r, cpu;
+ if (sizeof(struct vcpu_vmx) > *vcpu_size)
+ *vcpu_size = sizeof(struct vcpu_vmx);
+ if (__alignof__(struct vcpu_vmx) > *vcpu_align)
+ *vcpu_align = __alignof__(struct vcpu_vmx);

#if IS_ENABLED(CONFIG_HYPERV)
/*
@@ -7972,18 +7770,45 @@ static int __init vmx_init(void)
}

if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
- vmx_x86_ops.enable_direct_tlbflush
+ vt_x86_ops.enable_direct_tlbflush
= hv_enable_direct_tlbflush;

} else {
enlightened_vmcs = false;
}
#endif
+}

- r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
- __alignof__(struct vcpu_vmx), THIS_MODULE);
- if (r)
- return r;
+static void vmx_post_kvm_exit(void)
+{
+#if IS_ENABLED(CONFIG_HYPERV)
+ if (static_branch_unlikely(&enable_evmcs)) {
+ int cpu;
+ struct hv_vp_assist_page *vp_ap;
+ /*
+ * Reset everything to support using non-enlightened VMCS
+ * access later (e.g. when we reload the module with
+ * enlightened_vmcs=0)
+ */
+ for_each_online_cpu(cpu) {
+ vp_ap = hv_get_vp_assist_page(cpu);
+
+ if (!vp_ap)
+ continue;
+
+ vp_ap->nested_control.features.directhypercall = 0;
+ vp_ap->current_nested_vmcs = 0;
+ vp_ap->enlighten_vmentry = 0;
+ }
+
+ static_branch_disable(&enable_evmcs);
+ }
+#endif
+}
+
+static int __init vmx_init(void)
+{
+ int r, cpu;

/*
* Must be called after kvm_init() so enable_ept is properly set
@@ -7993,10 +7818,8 @@ static int __init vmx_init(void)
* mitigation mode.
*/
r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
- if (r) {
- vmx_exit();
+ if (r)
return r;
- }

for_each_possible_cpu(cpu) {
INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
@@ -8020,4 +7843,12 @@ static int __init vmx_init(void)

return 0;
}
-module_init(vmx_init);
+
+static void vmx_exit(void)
+{
+#ifdef CONFIG_KEXEC_CORE
+ RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
+ synchronize_rcu();
+#endif
+ vmx_cleanup_l1d_flush();
+}
--
2.25.1

2021-07-02 22:11:13

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 53/69] KVM: VMX: Define EPT Violation architectural bits

From: Sean Christopherson <[email protected]>

Define the EPT Violation #VE control bit, #VE info VMCS fields, and the
suppress #VE bit for EPT entries.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/vmx.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 035dfdafa2c1..132981276a2f 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -78,6 +78,7 @@ struct vmcs {
#define SECONDARY_EXEC_ENCLS_EXITING VMCS_CONTROL_BIT(ENCLS_EXITING)
#define SECONDARY_EXEC_RDSEED_EXITING VMCS_CONTROL_BIT(RDSEED_EXITING)
#define SECONDARY_EXEC_ENABLE_PML VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
+#define SECONDARY_EXEC_EPT_VIOLATION_VE VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
#define SECONDARY_EXEC_PT_CONCEAL_VMX VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
#define SECONDARY_EXEC_XSAVES VMCS_CONTROL_BIT(XSAVES)
#define SECONDARY_EXEC_MODE_BASED_EPT_EXEC VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
@@ -226,6 +227,8 @@ enum vmcs_field {
VMREAD_BITMAP_HIGH = 0x00002027,
VMWRITE_BITMAP = 0x00002028,
VMWRITE_BITMAP_HIGH = 0x00002029,
+ VE_INFO_ADDRESS = 0x0000202A,
+ VE_INFO_ADDRESS_HIGH = 0x0000202B,
XSS_EXIT_BITMAP = 0x0000202C,
XSS_EXIT_BITMAP_HIGH = 0x0000202D,
ENCLS_EXITING_BITMAP = 0x0000202E,
@@ -509,6 +512,7 @@ enum vmcs_field {
#define VMX_EPT_IPAT_BIT (1ull << 6)
#define VMX_EPT_ACCESS_BIT (1ull << 8)
#define VMX_EPT_DIRTY_BIT (1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
#define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
VMX_EPT_WRITABLE_MASK | \
VMX_EPT_EXECUTABLE_MASK)
--
2.25.1

2021-07-02 22:11:21

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 60/69] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs

From: Sean Christopherson <[email protected]>

Add a macro framework to hide VMX vs. TDX details of VMREAD and VMWRITE
so the VMX and TDX can shared common flows, e.g. accessing DTs.

Note, the TDX paths are dead code at this time. There is no great way
to deal with the chicken-and-egg scenario of having things in place for
TDX without first having TDX.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 41 +++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 9e5865b05d47..aa6a569b87d1 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -11,6 +11,47 @@
#include "vmcs.h"
#include "vmx.h"
#include "x86.h"
+#include "tdx.h"
+
+#ifdef CONFIG_KVM_INTEL_TDX
+#define VT_BUILD_VMCS_HELPERS(type, bits, tdbits) \
+static __always_inline type vmread##bits(struct kvm_vcpu *vcpu, \
+ unsigned long field) \
+{ \
+ if (unlikely(is_td_vcpu(vcpu))) { \
+ if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm)) \
+ return 0; \
+ return td_vmcs_read##tdbits(to_tdx(vcpu), field); \
+ } \
+ return vmcs_read##bits(field); \
+} \
+static __always_inline void vmwrite##bits(struct kvm_vcpu *vcpu, \
+ unsigned long field, type value) \
+{ \
+ if (unlikely(is_td_vcpu(vcpu))) { \
+ if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm)) \
+ return; \
+ return td_vmcs_write##tdbits(to_tdx(vcpu), field, value); \
+ } \
+ vmcs_write##bits(field, value); \
+}
+#else
+#define VT_BUILD_VMCS_HELPERS(type, bits, tdbits) \
+static __always_inline type vmread##bits(struct kvm_vcpu *vcpu, \
+ unsigned long field) \
+{ \
+ return vmcs_read##bits(field); \
+} \
+static __always_inline void vmwrite##bits(struct kvm_vcpu *vcpu, \
+ unsigned long field, type value) \
+{ \
+ vmcs_write##bits(field, value); \
+}
+#endif /* CONFIG_KVM_INTEL_TDX */
+VT_BUILD_VMCS_HELPERS(u16, 16, 16);
+VT_BUILD_VMCS_HELPERS(u32, 32, 32);
+VT_BUILD_VMCS_HELPERS(u64, 64, 64);
+VT_BUILD_VMCS_HELPERS(unsigned long, l, 64);

extern unsigned long vmx_host_idt_base;
void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
--
2.25.1

2021-07-02 22:11:22

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 26/69] KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE

From: Sean Christopherson <[email protected]>

Add a flag to disallow MCE injection and reject KVM_X86_SETUP_MCE with
-EINVAL when set. TDX doesn't support injecting exceptions, including
(virtual) #MCs.

Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 14 +++++++-------
2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be329f0a7054..09e51c5e86b3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1064,6 +1064,7 @@ struct kvm_arch {

bool bus_lock_detection_enabled;
bool irq_injection_disallowed;
+ bool mce_injection_disallowed;

/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
u32 user_space_msr_mask;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4070786f17d1..681fc3be2b2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4319,15 +4319,16 @@ static int vcpu_ioctl_tpr_access_reporting(struct kvm_vcpu *vcpu,
static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
u64 mcg_cap)
{
- int r;
unsigned bank_num = mcg_cap & 0xff, bank;

- r = -EINVAL;
+ if (vcpu->kvm->arch.mce_injection_disallowed)
+ return -EINVAL;
+
if (!bank_num || bank_num > KVM_MAX_MCE_BANKS)
- goto out;
+ return -EINVAL;
if (mcg_cap & ~(kvm_mce_cap_supported | 0xff | 0xff0000))
- goto out;
- r = 0;
+ return -EINVAL;
+
vcpu->arch.mcg_cap = mcg_cap;
/* Init IA32_MCG_CTL to all 1s */
if (mcg_cap & MCG_CTL_P)
@@ -4337,8 +4338,7 @@ static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
vcpu->arch.mce_banks[bank*4] = ~(u64)0;

static_call(kvm_x86_setup_mce)(vcpu);
-out:
- return r;
+ return 0;
}

static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
--
2.25.1

2021-07-02 22:11:23

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 33/69] KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()

From: Sean Christopherson <[email protected]>

Add hooks to cache and flush GPRs and invoke them from KVM_GET_REGS and
KVM_SET_REGS respecitively. TDX will use the hooks to read/write GPRs
from TDX-SEAM on-demand (for debug TDs).

Cc: Tom Lendacky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 6 ++++++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 00333af724d7..9791c4bb5198 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1248,6 +1248,8 @@ struct kvm_x86_ops {
void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu);
void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value);
+ void (*cache_gprs)(struct kvm_vcpu *vcpu);
+ void (*flush_gprs)(struct kvm_vcpu *vcpu);
void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c231a88d5946..f7ae0a47e555 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9850,6 +9850,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)

static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
{
+ if (kvm_x86_ops.cache_gprs)
+ kvm_x86_ops.cache_gprs(vcpu);
+
if (vcpu->arch.emulate_regs_need_sync_to_vcpu) {
/*
* We are here if userspace calls get_regs() in the middle of
@@ -9924,6 +9927,9 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)

vcpu->arch.exception.pending = false;

+ if (kvm_x86_ops.flush_gprs)
+ kvm_x86_ops.flush_gprs(vcpu);
+
kvm_make_request(KVM_REQ_EVENT, vcpu);
}

--
2.25.1

2021-07-02 22:11:23

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 62/69] KVM: VMX: MOVE GDT and IDT accessors to common code

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 6 ++++--
arch/x86/kvm/vmx/vmx.c | 12 ------------
2 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b619615f77de..8e03cb72b910 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -311,7 +311,8 @@ static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)

static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
- vmx_get_idt(vcpu, dt);
+ dt->size = vmread32(vcpu, GUEST_IDTR_LIMIT);
+ dt->address = vmreadl(vcpu, GUEST_IDTR_BASE);
}

static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
@@ -321,7 +322,8 @@ static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)

static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
- vmx_get_gdt(vcpu, dt);
+ dt->size = vmread32(vcpu, GUEST_GDTR_LIMIT);
+ dt->address = vmreadl(vcpu, GUEST_GDTR_BASE);
}

static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 40843ca2fb33..d69d4dc7c071 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3302,24 +3302,12 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
*l = (ar >> 13) & 1;
}

-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
-{
- dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
- dt->address = vmcs_readl(GUEST_IDTR_BASE);
-}
-
static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
vmcs_writel(GUEST_IDTR_BASE, dt->address);
}

-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
-{
- dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
- dt->address = vmcs_readl(GUEST_GDTR_BASE);
-}
-
static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
--
2.25.1

2021-07-02 22:11:27

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 64/69] cpu/hotplug: Document that TDX also depends on booting CPUs once

From: Kai Huang <[email protected]>

Add a comment to explain that TDX also depends on booting logical CPUs
at least once.

TDSYSINITLP must be run on all CPUs, even software disabled CPUs in the
-nosmt case. Fortunately, current SMT handling for #MC already supports
booting all CPUs once; the to-be-disabled sibling is booted once (and
later put into deep C-state to honor SMT=off) to allow the init code to
set CR4.MCE and avoid an unwanted shutdown on a broadcasted MCE.

Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
kernel/cpu.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index e538518556f4..58377a03e9d6 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -446,6 +446,10 @@ static inline bool cpu_smt_allowed(unsigned int cpu)
* that the init code can get a chance to set CR4.MCE on each
* CPU. Otherwise, a broadcasted MCE observing CR4.MCE=0b on any
* core will shutdown the machine.
+ *
+ * Intel TDX also requires running TDH_SYS_LP_INIT on all logical CPUs
+ * during boot, booting all CPUs once allows TDX to play nice with
+ * 'nosmt'.
*/
return !cpumask_test_cpu(cpu, &cpus_booted_once_mask);
}
--
2.25.1

2021-07-02 22:11:27

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 19/69] KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM

From: Sean Christopherson <[email protected]>

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 23 ++++++++++++++---------
arch/x86/kvm/x86.c | 4 ++++
3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e088086f3de6..25c72925eb8a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1526,7 +1526,7 @@ static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
break;
default:
- WARN_ON_ONCE(1);
+ KVM_BUG_ON(1, vcpu->kvm);
}
}

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d73ba7a6ff8d..6c043a160b30 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2360,7 +2360,7 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
break;
default:
- WARN_ON_ONCE(1);
+ KVM_BUG_ON(1, vcpu->kvm);
break;
}
}
@@ -5062,6 +5062,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
return kvm_complete_insn_gp(vcpu, err);
case 3:
WARN_ON_ONCE(enable_unrestricted_guest);
+
err = kvm_set_cr3(vcpu, val);
return kvm_complete_insn_gp(vcpu, err);
case 4:
@@ -5087,14 +5088,13 @@ static int handle_cr(struct kvm_vcpu *vcpu)
}
break;
case 2: /* clts */
- WARN_ONCE(1, "Guest should always own CR0.TS");
- vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
- trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
- return kvm_skip_emulated_instruction(vcpu);
+ KVM_BUG(1, vcpu->kvm, "Guest always owns CR0.TS");
+ return -EIO;
case 1: /*mov from cr*/
switch (cr) {
case 3:
WARN_ON_ONCE(enable_unrestricted_guest);
+
val = kvm_read_cr3(vcpu);
kvm_register_write(vcpu, reg, val);
trace_kvm_cr_read(cr, val);
@@ -5404,7 +5404,9 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)

static int handle_nmi_window(struct kvm_vcpu *vcpu)
{
- WARN_ON_ONCE(!enable_vnmi);
+ if (KVM_BUG_ON(!enable_vnmi, vcpu->kvm))
+ return -EIO;
+
exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
++vcpu->stat.nmi_window_exits;
kvm_make_request(KVM_REQ_EVENT, vcpu);
@@ -5960,7 +5962,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
* below) should never happen as that means we incorrectly allowed a
* nested VM-Enter with an invalid vmcs12.
*/
- WARN_ON_ONCE(vmx->nested.nested_run_pending);
+ if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm))
+ return -EIO;

/* If guest state is invalid, start emulating */
if (vmx->emulation_required)
@@ -6338,7 +6341,9 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
int max_irr;
bool max_irr_updated;

- WARN_ON(!vcpu->arch.apicv_active);
+ if (KVM_BUG_ON(!vcpu->arch.apicv_active, vcpu->kvm))
+ return -EIO;
+
if (pi_test_on(&vmx->pi_desc)) {
pi_clear_on(&vmx->pi_desc);
/*
@@ -6421,7 +6426,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
gate_desc *desc = (gate_desc *)host_idt_base + vector;

- if (WARN_ONCE(!is_external_intr(intr_info),
+ if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
"KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
return;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc45b2c47672..9244d1d560d5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9153,6 +9153,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}

if (kvm_request_pending(vcpu)) {
+ if (kvm_check_request(KVM_REQ_VM_BUGGED, vcpu)) {
+ r = -EIO;
+ goto out;
+ }
if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
r = 0;
--
2.25.1

2021-07-02 22:11:27

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 31/69] KVM: x86: add per-VM flags to disable SMI/INIT/SIPI

From: Isaku Yamahata <[email protected]>

Add a flag to let TDX disallow to inject interrupt with delivery
mode of SMI/INIT/SIPI. add a check to reject SMI/INIT interrupt
delivery mode.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/irq_comm.c | 4 ++++
arch/x86/kvm/x86.c | 3 +--
3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f373d672b4ac..00333af724d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1055,6 +1055,8 @@ struct kvm_arch {
enum kvm_irqchip_mode irqchip_mode;
u8 nr_reserved_ioapic_pins;
bool eoi_intercept_unsupported;
+ bool smm_unsupported;
+ bool init_sipi_unsupported;

bool disabled_lapic_found;

diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index bcfac99db579..396ccf086bdd 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -128,6 +128,10 @@ static inline bool kvm_msi_route_invalid(struct kvm *kvm,
.data = e->msi.data };
return (kvm->arch.eoi_intercept_unsupported &&
msg.arch_data.is_level) ||
+ (kvm->arch.smm_unsupported &&
+ msg.arch_data.delivery_mode == APIC_DELIVERY_MODE_SMI) ||
+ (kvm->arch.init_sipi_unsupported &&
+ msg.arch_data.delivery_mode == APIC_DELIVERY_MODE_INIT) ||
(kvm->arch.x2apic_format && (msg.address_hi & 0xff));
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 92204bbc7ea5..3407870b6f44 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4311,8 +4311,7 @@ static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)

static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
{
- /* TODO: use more precise flag */
- if (vcpu->arch.guest_state_protected)
+ if (vcpu->kvm->arch.smm_unsupported)
return -EINVAL;

kvm_make_request(KVM_REQ_SMI, vcpu);
--
2.25.1

2021-07-02 22:11:38

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 13/69] KVM: Enable hardware before doing arch VM initialization

From: Sean Christopherson <[email protected]>

Swap the order of hardware_enable_all() and kvm_arch_init_vm() to
accommodate Intel's TDX, which needs VMX to be enabled during VM init in
order to make SEAMCALLs.

This also provides consistent ordering between kvm_create_vm() and
kvm_destroy_vm() with respect to calling kvm_arch_destroy_vm() and
hardware_disable_all().

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
virt/kvm/kvm_main.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9e52fe999c92..751d1f6890b0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -923,7 +923,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
struct kvm_memslots *slots = kvm_alloc_memslots();

if (!slots)
- goto out_err_no_arch_destroy_vm;
+ goto out_err_no_disable;
/* Generations must be different for each address space. */
slots->generation = i;
rcu_assign_pointer(kvm->memslots[i], slots);
@@ -933,19 +933,19 @@ static struct kvm *kvm_create_vm(unsigned long type)
rcu_assign_pointer(kvm->buses[i],
kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL_ACCOUNT));
if (!kvm->buses[i])
- goto out_err_no_arch_destroy_vm;
+ goto out_err_no_disable;
}

kvm->max_halt_poll_ns = halt_poll_ns;

- r = kvm_arch_init_vm(kvm, type);
- if (r)
- goto out_err_no_arch_destroy_vm;
-
r = hardware_enable_all();
if (r)
goto out_err_no_disable;

+ r = kvm_arch_init_vm(kvm, type);
+ if (r)
+ goto out_err_no_arch_destroy_vm;
+
#ifdef CONFIG_HAVE_KVM_IRQFD
INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
#endif
@@ -972,10 +972,10 @@ static struct kvm *kvm_create_vm(unsigned long type)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
#endif
out_err_no_mmu_notifier:
- hardware_disable_all();
-out_err_no_disable:
kvm_arch_destroy_vm(kvm);
out_err_no_arch_destroy_vm:
+ hardware_disable_all();
+out_err_no_disable:
WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
for (i = 0; i < KVM_NR_BUSES; i++)
kfree(kvm_get_bus(kvm, i));
--
2.25.1

2021-07-02 22:11:42

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 12/69] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO

From: Sean Christopherson <[email protected]>

Later kvm_intel.ko will use it.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
virt/kvm/kvm_main.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 46fb042837d2..9e52fe999c92 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4606,6 +4606,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
r = __kvm_io_bus_read(vcpu, bus, &range, val);
return r < 0 ? r : 0;
}
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);

/* Caller must hold slots_lock. */
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
--
2.25.1

2021-07-02 22:11:46

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 10/69] KVM: TDX: Print the name of SEAMCALL status code

From: Isaku Yamahata <[email protected]>

SEAMCALL error code is not intuitive to tell what's wrong in the
SEAMCALL, print the error code name along with it.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/boot/seam/tdx_common.c | 21 +++++++++++++++++++++
arch/x86/kvm/vmx/seamcall.h | 7 +++++--
2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/boot/seam/tdx_common.c b/arch/x86/kvm/boot/seam/tdx_common.c
index d803dbd11693..4fe352fb8586 100644
--- a/arch/x86/kvm/boot/seam/tdx_common.c
+++ b/arch/x86/kvm/boot/seam/tdx_common.c
@@ -9,6 +9,7 @@
#include <asm/kvm_boot.h>

#include "vmx/tdx_arch.h"
+#include "vmx/tdx_errno.h"

/*
* TDX system information returned by TDSYSINFO.
@@ -165,3 +166,23 @@ void tdx_keyid_free(int keyid)
ida_free(&tdx_keyid_pool, keyid);
}
EXPORT_SYMBOL_GPL(tdx_keyid_free);
+
+static struct tdx_seamcall_status {
+ u64 err_code;
+ const char *err_name;
+} tdx_seamcall_status_codes[] = {TDX_SEAMCALL_STATUS_CODES};
+
+const char *tdx_seamcall_error_name(u64 error_code)
+{
+ struct tdx_seamcall_status status;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(tdx_seamcall_status_codes); i++) {
+ status = tdx_seamcall_status_codes[i];
+ if ((error_code & TDX_SEAMCALL_STATUS_MASK) == status.err_code)
+ return status.err_name;
+ }
+
+ return "Unknown SEAMCALL status code";
+}
+EXPORT_SYMBOL_GPL(tdx_seamcall_error_name);
diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
index 2c83ab46eeac..fbb18aea1720 100644
--- a/arch/x86/kvm/vmx/seamcall.h
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -37,11 +37,14 @@ static inline u64 _seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
_seamcall(SEAMCALL_##op, (rcx), (rdx), (r8), (r9), (r10), (ex))
#endif

+const char *tdx_seamcall_error_name(u64 error_code);
+
static inline void __pr_seamcall_error(u64 op, const char *op_str,
u64 err, struct tdx_ex_ret *ex)
{
- pr_err_ratelimited("SEAMCALL[%s] failed on cpu %d: 0x%llx\n",
- op_str, smp_processor_id(), (err));
+ pr_err_ratelimited("SEAMCALL[%s] failed on cpu %d: %s (0x%llx)\n",
+ op_str, smp_processor_id(),
+ tdx_seamcall_error_name(err), err);
if (ex)
pr_err_ratelimited(
"RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
--
2.25.1

2021-07-02 22:12:03

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v2 69/69] Documentation/virtual/kvm: Add Trust Domain Extensions(TDX)

From: Isaku Yamahata <[email protected]>

Add a documentation to Intel Trusted Docmain Extensions(TDX) support.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/api.rst | 6 +-
Documentation/virt/kvm/intel-tdx.rst | 441 +++++++++++++++++++++++++++
2 files changed, 446 insertions(+), 1 deletion(-)
create mode 100644 Documentation/virt/kvm/intel-tdx.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7fcb2fd38f42..b6a33a9bac87 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4297,7 +4297,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.

:Capability: basic
:Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
:Parameters: an opaque platform specific structure (in/out)
:Returns: 0 on success; -1 on error

@@ -4309,6 +4309,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
Documentation/virt/kvm/amd-memory-encryption.rst.

+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/intel-tdx.rst.
+
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------

diff --git a/Documentation/virt/kvm/intel-tdx.rst b/Documentation/virt/kvm/intel-tdx.rst
new file mode 100644
index 000000000000..6f2fbd2da243
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx.rst
@@ -0,0 +1,441 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. [1]
+For details, the specifications, [2], [3], [4], [5], [6], [7], are
+available.
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: system ioctl, vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+ /* Trust Domain eXtension sub-ioctl() commands. */
+ enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
+
+ KVM_TDX_CMD_NR_MAX,
+ };
+
+ struct kvm_tdx_cmd {
+ __u32 id; /* tdx_cmd_id */
+ __u32 metadata; /* sub comamnd specific */
+ __u64 data; /* sub command specific */
+ };
+
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: system ioctl
+
+subset of TDSYSINFO_STRCUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. which describes about Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- metadata: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+
+::
+
+ struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+ };
+
+ struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+ };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- metadata: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- reserved: must be 0
+
+::
+
+ struct kvm_tdx_init_vm {
+ __u32 max_vcpus;
+ __u32 reserved;
+ __u64 attributes;
+ __u64 cpuid; /* pointer to struct kvm_cpuid2 */
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha348 digest */
+ __u64 reserved[43]; /* must be zero for future extensibility */
+ };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- metadata: must be 0
+- data: initial value of the guest TD VCPU RCX
+
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vm ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- metadata: flags
+ currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+
+::
+
+ #define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+ struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+ };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- metadata: ignored
+- data: ignored
+
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called. The control flow
+looks like as follows.
+
+#. system wide capability check
+ * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+ * KVM_CAP_xxx: check other KVM extensions same to normal KVM case.
+
+#. creating VM
+ * KVM_CREATE_VM
+ * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+ * KVM_CREATE_VCPU
+ * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+
+#. initializing guest memory
+ * allocate guest memory and initialize page same to normal KVM case
+ In TDX case, parse and load TDVF into guest memory in addition.
+ * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+ If the pages has contents above, those pages need to be added.
+ Otherwise the contents will be lost and guest sees zero pages.
+ * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+ This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Loading TDX module
+==================
+
+Integrating TDX SEAM module into initrd
+---------------------------------------
+If TDX is enabled in KVM(CONFIG_KVM_INTEL_TDX=y), kernel is able to load
+tdx seam module from initrd.
+The related modules (seamldr.ac, libtdx.so and libtdx.so.sigstruct) need to be
+stored in initrd.
+
+tdx-seam is a sample hook script for initramfs-tools.
+TDXSEAM_SRCDIR are the directory in the host file system to store files related
+to TDX SEAM module.
+
+Since it heavily depends on distro how to prepare initrd, here's an example how
+to prepare an initrd.
+(Actually this is taken from Documentation/x86/microcode.rst)
+
+::
+
+ #!/bin/bash
+
+ if [ -z "$1" ]; then
+ echo "You need to supply an initrd file"
+ exit 1
+ fi
+
+ INITRD="$1"
+
+ DSTDIR=lib/firmware/intel-seam
+ TMPDIR=/tmp/initrd
+ LIBTDX="/lib/firmware/intel-seam/seamldr.acm /lib/firmware/intel-seam/libtdx.so /lib/firmware/intel-seam/libtdx.so.sigstruct"
+
+ rm -rf $TMPDIR
+
+ mkdir $TMPDIR
+ cd $TMPDIR
+ mkdir -p $DSTDIR
+
+ cp ${LIBTDX} ${DSTDIR}
+
+ find . | cpio -o -H newc > ../tdx-seam.cpio
+ cd ..
+ mv $INITRD $INITRD.orig
+ cat tdx-seam.cpio $INITRD.orig > $INITRD
+
+ rm -rf $TMPDIR
+
+
+Design discussion
+=================
+
+the file location of the boot code
+----------------------------------
+BSP launches SEAM Loader on BSP to load TDX module. TDX module is on
+all CPUs. The directory, arch/x86/kvm/boot/seam, is chosen to locate
+the related files in near directory. When maintenance/enhancement in
+future, it will be easy to identify that they're related to be synced
+with.
+
+- arch/x86/kvm/boot/seam: the current choice
+ Pros:
+ - The directory clearly indicates that the code is related to only KVM.
+ - Keep files near to the related code (KVM TDX code).
+ Cons:
+ - It doesn't follow the existing convention.
+
+Alternative:
+
+The alternative is to follow the existing convention.
+- arch/x86/kernel/cpu/
+ Pros:
+ - It follows the existing convention.
+ Cons:
+ - It's unclear that it's related to only KVM TDX.
+
+- drivers/firmware/
+ As TDX module can be considered a firmware, yet other choice is
+ Pros:
+ - It follows the existing convention. it clarifies that TDX module
+ is a firmware.
+ Cons:
+ - It's hard to understand the firmware is only for KVM TDX.
+ - The files are far from the related code(KVM TDX).
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered.
+ . No or minimal overhead when TDX is disabled(CONFIG_KVM_INTEL_TDX=n).
+ . Avoid overhead of indirect call via function pointers.
+ . Contain the changes under arch/x86/kvm/vmx directory and share logic
+ with VMX for maintenance.
+ Even though the ways to operation on VM (VMX instruction vs TDX
+ SEAM call) is different, the basic idea remains same. So, many
+ logic can be shared.
+ . Future maintenance
+ The huge change of kvm_x86_ops in (near) future isn't expected.
+ a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+ Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+ main.c, is just chosen to show main entry points for callbacks.) and
+ wrapper functions around all the callbacks with
+ "if (is-tdx) tdx-callback() else vmx-callback()".
+
+ Pros:
+ - No major change in common x86 KVM code. The change is (mostly)
+ contained under arch/x86/kvm/vmx/.
+ - When TDX is disabled(CONFIG_KVM_INTEL_TDX=n), the overhead is
+ optimized out.
+ - Micro optimization by avoiding function pointer.
+ Cons:
+ - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+Alternative:
+- Introduce another callback layer under arch/x86/kvm/vmx.
+ Pros:
+ - No major change in common x86 KVM code. The change is (mostly)
+ contained under arch/x86/kvm/vmx/.
+ - clear separation on callbacks.
+ Cons:
+ - overhead in VMX even when TDX is disabled(CONFIG_KVM_INTEL_TDX=n).
+
+- Allow per-VM kvm_x86_ops callbacks instead of global kvm_x86_ops
+ Pros:
+ - clear separation on callbacks.
+ Cons:
+ - Big change in common x86 code.
+ - overhead in common code even when TDX is
+ disabled(CONFIG_KVM_INTEL_TDX=n).
+
+- Introduce new directory arch/x86/kvm/tdx
+ Pros:
+ - It clarifies that TDX is different from VMX.
+ Cons:
+ - Given the level of code sharing, it complicates code sharing.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+ . Reuse the existing MMU code with minimal update. Because the
+ execution flow is mostly same. But additional operation, TDX call
+ for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+ . For performance, minimize TDX SEAM call to operate on S-EPT. When
+ getting corresponding S-EPT pages/entry from faulting GPA, don't
+ use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+ in host memory.
+ Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+ associate S-EPT to it.
+ . Treats share bit as attributes. mask/unmask the bit where
+ necessary to keep the existing traversing code works.
+ Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+ for special case.
+ = 0 : for non-TDX case
+ = 51 or 47 bit set for TDX case.
+
+ Pros:
+ - Large code reuse with minimal new hooks.
+ - Execution path is same.
+ Cons:
+ - Complicates the existing code.
+ - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+Alternative:
+- Replace direct read/write on EPT entry with TDX-SEAM call by
+ introducing callbacks on EPT entry.
+ Pros:
+ - Straightforward.
+ Cons:
+ - Too many touching point.
+ - Too slow due to TDX-SEAM call.
+ - Overhead even when TDX is disabled(CONFIG_KVM_INTEL_TDX=n).
+
+- Sprinkle "if (is-tdx)" for TDX special case
+ Pros:
+ - Straightforward.
+ Cons:
+ - The result is non-generic and ugly.
+ - Put TDX specific logic into common KVM MMU code.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM API are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+ Although not all operation isn't memory encryption, repupose to get
+ TDX specific ioctls.
+ Pros:
+ - No major change in common x86 KVM code.
+ Cons:
+ - The operations aren't actually memory encryption, but operations
+ on TD VMs.
+
+Alternative:
+- Introduce new ioctl for guest protection like
+ KVM_GUEST_PROTECTION_OP and introduce subcommand for TDX.
+ Pros:
+ - Clean name.
+ Cons:
+ - One more new ioctl for guest protection.
+ - Confusion with KVM_MEMORY_ENCRYPT_OP with KVM_GUEST_PROTECTION_OP.
+
+- Rename KVM_MEMORY_ENCRYPT_OP to KVM_GUEST_PROTECTION_OP and keep
+ KVM_MEMORY_ENCRYPT_OP as same value for user API for compatibility.
+ "#define KVM_MEMORY_ENCRYPT_OP KVM_GUEST_PROTECTION_OP" for uapi
+ compatibility.
+ Pros:
+ - No new ioctl with more suitable name.
+ Cons:
+ - May cause confusion to the existing user program.
+
+
+References
+==========
+
+.. [1] TDX specification
+ https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+ kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+ TDX guest branch: https://github.com/intel/tdx/tree/guest
+.. [9] tdvf
+ https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+ Enable Hardware Isolated VMs
+ https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+ Architectural Extensions for Hardware Virtual Machine Isolation
+ to Advance Confidential Computing in Public Clouds - Ravi Sahita
+ & Jun Nakajima, Intel Corporation
+ https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+ https://lkml.org/lkml/2020/10/20/66
--
2.25.1

2021-07-06 12:42:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 03/69] KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86

On 03/07/21 00:04, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> This is preparation for TDX support.
>
> Because SEAMCALL instruction requires VMX enabled, it needs to initialize
> struct vmcs and load it before SEAMCALL instruction.[1] [2] Move out the
> definition of vmcs into a common x86 header, arch/x86/include/asm/vmx.h, so
> that seamloader code can share the same definition.
>
> [1] Intel Trust Domain CPU Architectural Extensions
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
>
> [2] TDX Module spec
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/vmx.h | 11 +++++++++++
> arch/x86/kvm/vmx/vmcs.h | 11 -----------
> 2 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0ffaa3156a4e..035dfdafa2c1 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -17,6 +17,17 @@
> #include <uapi/asm/vmx.h>
> #include <asm/vmxfeatures.h>
>
> +struct vmcs_hdr {
> + u32 revision_id:31;
> + u32 shadow_vmcs:1;
> +};
> +
> +struct vmcs {
> + struct vmcs_hdr hdr;
> + u32 abort;
> + char data[];
> +};
> +
> #define VMCS_CONTROL_BIT(x) BIT(VMX_FEATURE_##x & 0x1f)
>
> /*
> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> index 1472c6c376f7..ac09bc4996a5 100644
> --- a/arch/x86/kvm/vmx/vmcs.h
> +++ b/arch/x86/kvm/vmx/vmcs.h
> @@ -11,17 +11,6 @@
>
> #include "capabilities.h"
>
> -struct vmcs_hdr {
> - u32 revision_id:31;
> - u32 shadow_vmcs:1;
> -};
> -
> -struct vmcs {
> - struct vmcs_hdr hdr;
> - u32 abort;
> - char data[];
> -};
> -
> DECLARE_PER_CPU(struct vmcs *, current_vmcs);
>
> /*
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 12:59:08

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/69] KVM: TDX: add a helper function for kvm to call seamcall

On 03/07/21 00:04, [email protected] wrote:
> +
> +.Lseamcall:
> + seamcall
> + jmp .Lseamcall_ret
> +.Lspurious_fault:
> + call kvm_spurious_fault
> +.Lseamcall_ret:
> +
> + movq (FRAME_OFFSET + 8)(%rsp), %rdi
> + testq %rdi, %rdi
> + jz 1f
> +
> + /* If ex is non-NULL, store extra return values into it. */
> + movq %rcx, TDX_SEAM_rcx(%rdi)
> + movq %rdx, TDX_SEAM_rdx(%rdi)
> + movq %r8, TDX_SEAM_r8(%rdi)
> + movq %r9, TDX_SEAM_r9(%rdi)
> + movq %r10, TDX_SEAM_r10(%rdi)
> + movq %r11, TDX_SEAM_r11(%rdi)
> +
> +1:
> + FRAME_END
> + ret
> +
> + _ASM_EXTABLE(.Lseamcall, .Lspurious_fault)

Please use local labels and avoid unnecessary jmp, for example

1:
seamcall
movq (FRAME_OFFSET + 8)(%rsp), %rdi
testq %rdi, %rdi
jz 2f

/* If ex is non-NULL, store extra return values into it. */
movq %rcx, TDX_SEAM_rcx(%rdi)
movq %rdx, TDX_SEAM_rdx(%rdi)
movq %r8, TDX_SEAM_r8(%rdi)
movq %r9, TDX_SEAM_r9(%rdi)
movq %r10, TDX_SEAM_r10(%rdi)
movq %r11, TDX_SEAM_r11(%rdi)
2:
FRAME_END
ret
3:
/* Probably it helps to write an error code in %rax? */
movq $0x4000000500000000, %rax
cmpb $0, kvm_rebooting
jne 2b
ud2
_ASM_EXTABLE(1b, 3b)

Paolo

2021-07-06 13:25:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/69] KVM: TDX: add trace point before/after TDX SEAMCALLs

On 03/07/21 00:04, [email protected] wrote:
> + trace_kvm_tdx_seamcall_enter(smp_processor_id(), op,
> + rcx, rdx, r8, r9, r10);
> + err = __seamcall(op, rcx, rdx, r8, r9, r10, ex);
> + if (ex)
> + trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err, ex->rcx,
> + ex->rdx, ex->r8, ex->r9, ex->r10,
> + ex->r11);
> + else
> + trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err,
> + 0, 0, 0, 0, 0, 0);

Would it make sense to do the zeroing of ex directly in __seamcall in
case there is an error?

Otherwise looks good.

Paolo

2021-07-06 13:27:47

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 09/69] KVM: TDX: Add C wrapper functions for TDX SEAMCALLs

On 03/07/21 00:04, [email protected] wrote:
> +static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
> +{
> + return seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
> +}
> +

Since you have wrappers anyway, I don't like having an extra macro level
just to remove the SEAMCALL_ prefix. It messes up editors that look up
the symbols.

Paolo

2021-07-06 13:28:53

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/69] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Later kvm_intel.ko will use it.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> virt/kvm/kvm_main.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 46fb042837d2..9e52fe999c92 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4606,6 +4606,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
> r = __kvm_io_bus_read(vcpu, bus, &range, val);
> return r < 0 ? r : 0;
> }
> +EXPORT_SYMBOL_GPL(kvm_io_bus_read);
>
> /* Caller must hold slots_lock. */
> int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 13:29:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/69] KVM: TDX: Print the name of SEAMCALL status code

On 03/07/21 00:04, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> SEAMCALL error code is not intuitive to tell what's wrong in the
> SEAMCALL, print the error code name along with it.
>
> Signed-off-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/boot/seam/tdx_common.c | 21 +++++++++++++++++++++
> arch/x86/kvm/vmx/seamcall.h | 7 +++++--
> 2 files changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/boot/seam/tdx_common.c b/arch/x86/kvm/boot/seam/tdx_common.c
> index d803dbd11693..4fe352fb8586 100644
> --- a/arch/x86/kvm/boot/seam/tdx_common.c
> +++ b/arch/x86/kvm/boot/seam/tdx_common.c
> @@ -9,6 +9,7 @@
> #include <asm/kvm_boot.h>
>
> #include "vmx/tdx_arch.h"
> +#include "vmx/tdx_errno.h"
>
> /*
> * TDX system information returned by TDSYSINFO.
> @@ -165,3 +166,23 @@ void tdx_keyid_free(int keyid)
> ida_free(&tdx_keyid_pool, keyid);
> }
> EXPORT_SYMBOL_GPL(tdx_keyid_free);
> +
> +static struct tdx_seamcall_status {
> + u64 err_code;
> + const char *err_name;
> +} tdx_seamcall_status_codes[] = {TDX_SEAMCALL_STATUS_CODES};
> +
> +const char *tdx_seamcall_error_name(u64 error_code)
> +{
> + struct tdx_seamcall_status status;
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(tdx_seamcall_status_codes); i++) {
> + status = tdx_seamcall_status_codes[i];
> + if ((error_code & TDX_SEAMCALL_STATUS_MASK) == status.err_code)
> + return status.err_name;
> + }
> +
> + return "Unknown SEAMCALL status code";
> +}
> +EXPORT_SYMBOL_GPL(tdx_seamcall_error_name);
> diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
> index 2c83ab46eeac..fbb18aea1720 100644
> --- a/arch/x86/kvm/vmx/seamcall.h
> +++ b/arch/x86/kvm/vmx/seamcall.h
> @@ -37,11 +37,14 @@ static inline u64 _seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
> _seamcall(SEAMCALL_##op, (rcx), (rdx), (r8), (r9), (r10), (ex))
> #endif
>
> +const char *tdx_seamcall_error_name(u64 error_code);
> +
> static inline void __pr_seamcall_error(u64 op, const char *op_str,
> u64 err, struct tdx_ex_ret *ex)
> {
> - pr_err_ratelimited("SEAMCALL[%s] failed on cpu %d: 0x%llx\n",
> - op_str, smp_processor_id(), (err));
> + pr_err_ratelimited("SEAMCALL[%s] failed on cpu %d: %s (0x%llx)\n",
> + op_str, smp_processor_id(),
> + tdx_seamcall_error_name(err), err);
> if (ex)
> pr_err_ratelimited(
> "RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
>

You can squash it in the earlier patch that introduced __pr_seamcall_error.

Paolo

2021-07-06 13:30:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/69] KVM: Enable hardware before doing arch VM initialization

On 03/07/21 00:04, [email protected] wrote:
> This also provides consistent ordering between kvm_create_vm() and
> kvm_destroy_vm() with respect to calling kvm_arch_destroy_vm() and
> hardware_disable_all().
>
> Signed-off-by: Sean Christopherson<[email protected]>
> Signed-off-by: Isaku Yamahata<[email protected]>
> ---
> virt/kvm/kvm_main.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9e52fe999c92..751d1f6890b0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -923,7 +923,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> struct kvm_memslots *slots = kvm_alloc_memslots();
>
> if (!slots)
> - goto out_err_no_arch_destroy_vm;
> + goto out_err_no_disable;
> /* Generations must be different for each address space. */
> slots->generation = i;
> rcu_assign_pointer(kvm->memslots[i], slots);
> @@ -933,19 +933,19 @@ static struct kvm *kvm_create_vm(unsigned long typ

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:40:31

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/69] KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Later kvm_intel.ko will use it.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/x86.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 795f83a1cf9a..cc45b2c47672 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11962,6 +11962,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:40:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 65/69] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch

On 03/07/21 00:05, [email protected] wrote:
> From: Xiaoyao Li <[email protected]>
>
> Introduce a per-vm variable initial_tsc_khz to hold the default tsc_khz
> for kvm_arch_vcpu_create().
>
> This field is going to be used by TDX since TSC frequency for TD guest
> is configured at TD VM initialization phase.
>
> Signed-off-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 3 ++-
> 2 files changed, 3 insertions(+), 1 deletion(-)

So this means disabling TSC frequency scaling on TDX. Would it make
sense to delay VM creation to a separate ioctl, similar to
KVM_ARM_VCPU_FINALIZE (KVM_VM_FINALIZE)?

Paolo

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a47e17892258..ae8b96e15e71 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1030,6 +1030,7 @@ struct kvm_arch {
> u64 last_tsc_nsec;
> u64 last_tsc_write;
> u32 last_tsc_khz;
> + u32 initial_tsc_khz;
> u64 cur_tsc_nsec;
> u64 cur_tsc_write;
> u64 cur_tsc_offset;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a8299add443f..d3ebed784eac 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10441,7 +10441,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> else
> vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
>
> - kvm_set_tsc_khz(vcpu, max_tsc_khz);
> + kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.initial_tsc_khz);
>
> r = kvm_mmu_create(vcpu);
> if (r < 0)
> @@ -10894,6 +10894,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> pvclock_update_vm_gtod_copy(kvm);
>
> kvm->arch.guest_can_read_msr_platform_info = true;
> + kvm->arch.initial_tsc_khz = max_tsc_khz;
>
> INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
> INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
>

2021-07-06 14:40:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 23/69] KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of
> sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run(). This
> allows a future patch to allow synchronizing select state for protected
> VMs.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/x86.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7110d48cbc1..271245ffc67c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9729,7 +9729,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> goto out;
> }
>
> - if (kvm_run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) {
> + if ((kvm_run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) ||
> + (kvm_run->kvm_dirty_regs & ~KVM_SYNC_X86_VALID_FIELDS)) {
> r = -EINVAL;
> goto out;
> }
> @@ -10264,9 +10265,6 @@ static void store_regs(struct kvm_vcpu *vcpu)
>
> static int sync_regs(struct kvm_vcpu *vcpu)
> {
> - if (vcpu->run->kvm_dirty_regs & ~KVM_SYNC_X86_VALID_FIELDS)
> - return -EINVAL;
> -
> if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
> __set_regs(vcpu, &vcpu->run->s.regs.regs);
> vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:40:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 34/69] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/x86.c | 12 ++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9791c4bb5198..e3abf077f328 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1377,7 +1377,9 @@ struct kvm_x86_ops {
> int (*pre_leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
> void (*enable_smi_window)(struct kvm_vcpu *vcpu);
>
> + int (*mem_enc_op_dev)(void __user *argp);
> int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
> + int (*mem_enc_op_vcpu)(struct kvm_vcpu *vcpu, void __user *argp);
> int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> int (*mem_enc_unreg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f7ae0a47e555..da9f1081cb03 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4109,6 +4109,12 @@ long kvm_arch_dev_ioctl(struct file *filp,
> case KVM_GET_SUPPORTED_HV_CPUID:
> r = kvm_ioctl_get_supported_hv_cpuid(NULL, argp);
> break;
> + case KVM_MEMORY_ENCRYPT_OP:
> + r = -EINVAL;
> + if (!kvm_x86_ops.mem_enc_op_dev)
> + goto out;
> + r = kvm_x86_ops.mem_enc_op_dev(argp);
> + break;
> default:
> r = -EINVAL;
> break;
> @@ -5263,6 +5269,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> break;
> }
> #endif
> + case KVM_MEMORY_ENCRYPT_OP:
> + r = -EINVAL;
> + if (!kvm_x86_ops.mem_enc_op_vcpu)
> + goto out;
> + r = kvm_x86_ops.mem_enc_op_vcpu(vcpu, argp);
> + break;
> default:
> r = -EINVAL;
> }
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:40:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 28/69] KVM: Add per-VM flag to mark read-only memory as unsupported

On 03/07/21 00:04, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add a flag for TDX to flag RO memory as unsupported and propagate it to
> KVM_MEM_READONLY to allow reporting RO memory as unsupported on a per-VM
> basis. TDX1 doesn't expose permission bits to the VMM in the SEPT
> tables, i.e. doesn't support read-only private memory.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/x86.c | 4 +++-
> include/linux/kvm_host.h | 4 ++++
> virt/kvm/kvm_main.c | 8 +++++---
> 3 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cd9407982366..87212d7563ae 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3897,7 +3897,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ASYNC_PF_INT:
> case KVM_CAP_GET_TSC_KHZ:
> case KVM_CAP_KVMCLOCK_CTRL:
> - case KVM_CAP_READONLY_MEM:
> case KVM_CAP_HYPERV_TIME:
> case KVM_CAP_IOAPIC_POLARITY_IGNORED:
> case KVM_CAP_TSC_DEADLINE_TIMER:
> @@ -4009,6 +4008,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
> r |= BIT(KVM_X86_TDX_VM);
> break;
> + case KVM_CAP_READONLY_MEM:
> + r = kvm && kvm->readonly_mem_unsupported ? 0 : 1;
> + break;
> default:
> break;
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ddd4d0f68cdf..7ee7104b4b59 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -597,6 +597,10 @@ struct kvm {
> unsigned int max_halt_poll_ns;
> u32 dirty_ring_size;
>
> +#ifdef __KVM_HAVE_READONLY_MEM
> + bool readonly_mem_unsupported;
> +#endif
> +
> bool vm_bugged;
> };
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 52d40ea75749..63d0c2833913 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1258,12 +1258,14 @@ static void update_memslots(struct kvm_memslots *slots,
> }
> }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> + const struct kvm_userspace_memory_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> #ifdef __KVM_HAVE_READONLY_MEM
> - valid_flags |= KVM_MEM_READONLY;
> + if (!kvm->readonly_mem_unsupported)
> + valid_flags |= KVM_MEM_READONLY;
> #endif
>
> if (mem->flags & ~valid_flags)
> @@ -1436,7 +1438,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> int as_id, id;
> int r;
>
> - r = check_memory_region_flags(mem);
> + r = check_memory_region_flags(kvm, mem);
> if (r)
> return r;
>
>

For all these flags, which of these limitations will be common to SEV-ES
and SEV-SNP (ExtINT injection, MCE injection, changing TSC, read-only
memory, dirty logging)? Would it make sense to use vm_type instead of
all of them? I guess this also guides the choice of whether to use a
single vm-type for TDX and SEV-SNP or two. Probably two is better, and
there can be static inline bool functions to derive the support flags
from the vm-type.

Paolo

2021-07-06 14:40:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 63/69] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code

On 03/07/21 00:05, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/common.h | 14 ++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 10 +---------
> 2 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 755aaec85199..817ff3e74933 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -120,6 +120,20 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> }
>
> +static inline u32 __vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
> +{
> + u32 interruptibility;
> + int ret = 0;
> +
> + interruptibility = vmread32(vcpu, GUEST_INTERRUPTIBILITY_INFO);
> + if (interruptibility & GUEST_INTR_STATE_STI)
> + ret |= KVM_X86_SHADOW_INT_STI;
> + if (interruptibility & GUEST_INTR_STATE_MOV_SS)
> + ret |= KVM_X86_SHADOW_INT_MOV_SS;
> +
> + return ret;
> +}
> +
> static inline u32 vmx_encode_ar_bytes(struct kvm_segment *var)
> {
> u32 ar;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d69d4dc7c071..d31cace67907 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1467,15 +1467,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>
> u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
> {
> - u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
> - int ret = 0;
> -
> - if (interruptibility & GUEST_INTR_STATE_STI)
> - ret |= KVM_X86_SHADOW_INT_STI;
> - if (interruptibility & GUEST_INTR_STATE_MOV_SS)
> - ret |= KVM_X86_SHADOW_INT_MOV_SS;
> -
> - return ret;
> + return __vmx_get_interrupt_shadow(vcpu);
> }
>
> void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
>

Is there any reason to add the __ version, since at this point
kvm_x86_ops is already pointing to vt_get_interrupt_shadow?

Paolo

2021-07-06 14:41:01

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/69] KVM: Add infrastructure and macro to mark VM as bugged

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> include/linux/kvm_host.h | 29 ++++++++++++++++++++++++++++-
> virt/kvm/kvm_main.c | 10 +++++-----
> 2 files changed, 33 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8583ed3ff344..09618f8a1338 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -149,6 +149,7 @@ static inline bool is_error_page(struct page *page)
> #define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_UNBLOCK 2
> #define KVM_REQ_UNHALT 3
> +#define KVM_REQ_VM_BUGGED (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQUEST_ARCH_BASE 8
>
> #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -585,6 +586,8 @@ struct kvm {
> pid_t userspace_pid;
> unsigned int max_halt_poll_ns;
> u32 dirty_ring_size;
> +
> + bool vm_bugged;
> };
>
> #define kvm_err(fmt, ...) \
> @@ -613,6 +616,31 @@ struct kvm {
> #define vcpu_err(vcpu, fmt, ...) \
> kvm_err("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)
>
> +bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
> +static inline void kvm_vm_bugged(struct kvm *kvm)
> +{
> + kvm->vm_bugged = true;
> + kvm_make_all_cpus_request(kvm, KVM_REQ_VM_BUGGED);
> +}
> +
> +#define KVM_BUG(cond, kvm, fmt...) \
> +({ \
> + int __ret = (cond); \
> + \
> + if (WARN_ONCE(__ret && !(kvm)->vm_bugged, fmt)) \
> + kvm_vm_bugged(kvm); \
> + unlikely(__ret); \
> +})
> +
> +#define KVM_BUG_ON(cond, kvm) \
> +({ \
> + int __ret = (cond); \
> + \
> + if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
> + kvm_vm_bugged(kvm); \
> + unlikely(__ret); \
> +})
> +
> static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)
> {
> return !!(kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET);
> @@ -930,7 +958,6 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
> struct kvm_vcpu *except,
> unsigned long *vcpu_bitmap, cpumask_var_t tmp);
> -bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
> bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req,
> struct kvm_vcpu *except);
> bool kvm_make_cpus_request_mask(struct kvm *kvm, unsigned int req,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 751d1f6890b0..dc752d0bd3ec 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3435,7 +3435,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
> struct kvm_fpu *fpu = NULL;
> struct kvm_sregs *kvm_sregs = NULL;
>
> - if (vcpu->kvm->mm != current->mm)
> + if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
> return -EIO;
>
> if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
> @@ -3641,7 +3641,7 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
> void __user *argp = compat_ptr(arg);
> int r;
>
> - if (vcpu->kvm->mm != current->mm)
> + if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
> return -EIO;
>
> switch (ioctl) {
> @@ -3707,7 +3707,7 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
> {
> struct kvm_device *dev = filp->private_data;
>
> - if (dev->kvm->mm != current->mm)
> + if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged)
> return -EIO;
>
> switch (ioctl) {
> @@ -3991,7 +3991,7 @@ static long kvm_vm_ioctl(struct file *filp,
> void __user *argp = (void __user *)arg;
> int r;
>
> - if (kvm->mm != current->mm)
> + if (kvm->mm != current->mm || kvm->vm_bugged)
> return -EIO;
> switch (ioctl) {
> case KVM_CREATE_VCPU:
> @@ -4189,7 +4189,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
> struct kvm *kvm = filp->private_data;
> int r;
>
> - if (kvm->mm != current->mm)
> + if (kvm->mm != current->mm || kvm->vm_bugged)
> return -EIO;
> switch (ioctl) {
> case KVM_GET_DIRTY_LOG: {
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:41:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/69] KVM: Export kvm_make_all_cpus_request() for use in marking VMs as bugged

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Export kvm_make_all_cpus_request() and hoist the request helper
> declarations of request up to the KVM_REQ_* definitions in preparation
> for adding a "VM bugged" framework. The framework will add KVM_BUG()
> and KVM_BUG_ON() as alternatives to full BUG()/BUG_ON() for cases where
> KVM has definitely hit a bug (in itself or in silicon) and the VM is all
> but guaranteed to be hosed. Marking a VM bugged will trigger a request
> to all vCPUs to allow arch code to forcefully evict each vCPU from its
> run loop.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>

This should be _before_ patch 17, not after.

Paolo

2021-07-06 14:41:12

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 36/69] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
> irrespective of any other flags. TDX-SEAM unconditionally saves and
> restores guest DRs and reset to architectural INIT state on TD exit.
> So, KVM needs to save host DRs before TD enter without restoring guest DRs
> and restore host DRs after TD exit.
>
> Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().
>
> Reported-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Co-developed-by: Chao Gao <[email protected]>
> Signed-off-by: Chao Gao <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 11 ++++++++---
> arch/x86/kvm/x86.c | 3 ++-
> 2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 96e6cd95d884..7822b531a5e2 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -488,9 +488,14 @@ struct kvm_pmu {
> struct kvm_pmu_ops;
>
> enum {
> - KVM_DEBUGREG_BP_ENABLED = 1,
> - KVM_DEBUGREG_WONT_EXIT = 2,
> - KVM_DEBUGREG_RELOAD = 4,
> + KVM_DEBUGREG_BP_ENABLED = BIT(0),
> + KVM_DEBUGREG_WONT_EXIT = BIT(1),
> + KVM_DEBUGREG_RELOAD = BIT(2),
> + /*
> + * Guest debug registers are saved/restored by hardware on exit from
> + * or enter guest. KVM needn't switch them.
> + */
> + KVM_DEBUGREG_AUTO_SWITCH_GUEST = BIT(3),

Maybe remove "_GUEST"? Apart from that,

Reviewed-by: Paolo Bonzini <[email protected]>

Paolo

> };
>
> struct kvm_mtrr_range {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b436cae1732..f1d5e0a53640 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9441,7 +9441,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> if (test_thread_flag(TIF_NEED_FPU_LOAD))
> switch_fpu_return();
>
> - if (unlikely(vcpu->arch.switch_db_regs)) {
> + if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH_GUEST)) {
> set_debugreg(0, 7);
> set_debugreg(vcpu->arch.eff_db[0], 0);
> set_debugreg(vcpu->arch.eff_db[1], 1);
> @@ -9473,6 +9473,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> */
> if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
> WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
> + WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH_GUEST);
> static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
> kvm_update_dr0123(vcpu);
> kvm_update_dr7(vcpu);
>

2021-07-06 14:41:18

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 67/69] KVM: TDX: add trace point for TDVMCALL and SEPT operation

On 03/07/21 00:05, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Signed-off-by: Yuan Yao <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/trace.h | 58 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.c | 16 ++++++++++
> arch/x86/kvm/vmx/tdx_arch.h | 9 ++++++
> arch/x86/kvm/x86.c | 2 ++
> 4 files changed, 85 insertions(+)
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index c3398d0de9a7..58631124f08d 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -739,6 +739,64 @@ TRACE_EVENT(kvm_tdx_seamcall_exit,
> __entry->r9, __entry->r10, __entry->r11)
> );
>
> +/*
> + * Tracepoint for TDVMCALL from a TDX guest
> + */
> +TRACE_EVENT(kvm_tdvmcall,
> + TP_PROTO(struct kvm_vcpu *vcpu, __u32 exit_reason,
> + __u64 p1, __u64 p2, __u64 p3, __u64 p4),
> + TP_ARGS(vcpu, exit_reason, p1, p2, p3, p4),
> +
> + TP_STRUCT__entry(
> + __field( __u64, rip )
> + __field( __u32, exit_reason )
> + __field( __u64, p1 )
> + __field( __u64, p2 )
> + __field( __u64, p3 )
> + __field( __u64, p4 )
> + ),
> +
> + TP_fast_assign(
> + __entry->rip = kvm_rip_read(vcpu);
> + __entry->exit_reason = exit_reason;
> + __entry->p1 = p1;
> + __entry->p2 = p2;
> + __entry->p3 = p3;
> + __entry->p4 = p4;
> + ),
> +
> + TP_printk("rip: %llx reason: %s p1: %llx p2: %llx p3: %llx p4: %llx",
> + __entry->rip,
> + __print_symbolic(__entry->exit_reason,
> + TDG_VP_VMCALL_EXIT_REASONS),
> + __entry->p1, __entry->p2, __entry->p3, __entry->p4)
> +);
> +
> +/*
> + * Tracepoint for SEPT related SEAMCALLs.
> + */
> +TRACE_EVENT(kvm_sept_seamcall,
> + TP_PROTO(__u64 op, __u64 gpa, __u64 hpa, int level),
> + TP_ARGS(op, gpa, hpa, level),
> +
> + TP_STRUCT__entry(
> + __field( __u64, op )
> + __field( __u64, gpa )
> + __field( __u64, hpa )
> + __field( int, level )
> + ),
> +
> + TP_fast_assign(
> + __entry->op = op;
> + __entry->gpa = gpa;
> + __entry->hpa = hpa;
> + __entry->level = level;
> + ),
> +
> + TP_printk("op: %llu gpa: 0x%llx hpa: 0x%llx level: %u",
> + __entry->op, __entry->gpa, __entry->hpa, __entry->level)
> +);
> +
> /*
> * Tracepoint for nested #vmexit because of interrupt pending
> */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1aed4286ce0c..63130fb5a003 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -934,6 +934,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>
> exit_reason = tdvmcall_exit_reason(vcpu);
>
> + trace_kvm_tdvmcall(vcpu, exit_reason,
> + tdvmcall_p1_read(vcpu), tdvmcall_p2_read(vcpu),
> + tdvmcall_p3_read(vcpu), tdvmcall_p4_read(vcpu));
> +
> switch (exit_reason) {
> case EXIT_REASON_CPUID:
> return tdx_emulate_cpuid(vcpu);
> @@ -1011,11 +1015,15 @@ static void tdx_sept_set_private_spte(struct kvm_vcpu *vcpu, gfn_t gfn,
>
> /* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
> if (is_td_finalized(kvm_tdx)) {
> + trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_PAGE_AUG, gpa, hpa, level);
> +
> err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &ex_ret);
> SEPT_ERR(err, &ex_ret, TDH_MEM_PAGE_AUG, vcpu->kvm);
> return;
> }
>
> + trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_PAGE_ADD, gpa, hpa, level);
> +
> source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
>
> err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &ex_ret);
> @@ -1039,6 +1047,8 @@ static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, int level,
> return;
>
> if (is_hkid_assigned(kvm_tdx)) {
> + trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_PAGE_REMOVE, gpa, hpa, level);
> +
> err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
> if (SEPT_ERR(err, &ex_ret, TDH_MEM_PAGE_REMOVE, kvm))
> return;
> @@ -1063,6 +1073,8 @@ static int tdx_sept_link_private_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
> struct tdx_ex_ret ex_ret;
> u64 err;
>
> + trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_SEPT_ADD, gpa, hpa, level);
> +
> err = tdh_mem_spet_add(kvm_tdx->tdr.pa, gpa, level, hpa, &ex_ret);
> if (SEPT_ERR(err, &ex_ret, TDH_MEM_SEPT_ADD, vcpu->kvm))
> return -EIO;
> @@ -1077,6 +1089,8 @@ static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, int level)
> struct tdx_ex_ret ex_ret;
> u64 err;
>
> + trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_RANGE_BLOCK, gpa, -1ull, level);
> +
> err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
> SEPT_ERR(err, &ex_ret, TDH_MEM_RANGE_BLOCK, kvm);
> }
> @@ -1088,6 +1102,8 @@ static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn, int level)
> struct tdx_ex_ret ex_ret;
> u64 err;
>
> + trace_kvm_sept_seamcall(SEAMCALL_TDH_MEM_RANGE_UNBLOCK, gpa, -1ull, level);
> +
> err = tdh_mem_range_unblock(kvm_tdx->tdr.pa, gpa, level, &ex_ret);
> SEPT_ERR(err, &ex_ret, TDH_MEM_RANGE_UNBLOCK, kvm);
> }
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> index 7258825b1e02..414b933a3b03 100644
> --- a/arch/x86/kvm/vmx/tdx_arch.h
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -104,6 +104,15 @@
> #define TDG_VP_VMCALL_REPORT_FATAL_ERROR 0x10003
> #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004
>
> +#define TDG_VP_VMCALL_EXIT_REASONS \
> + { TDG_VP_VMCALL_GET_TD_VM_CALL_INFO, \
> + "GET_TD_VM_CALL_INFO" }, \
> + { TDG_VP_VMCALL_MAP_GPA, "MAP_GPA" }, \
> + { TDG_VP_VMCALL_GET_QUOTE, "GET_QUOTE" }, \
> + { TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, \
> + "SETUP_EVENT_NOTIFY_INTERRUPT" }, \
> + VMX_EXIT_REASONS
> +
> /* TDX control structure (TDR/TDCS/TDVPS) field access codes */
> #define TDX_CLASS_SHIFT 56
> #define TDX_FIELD_MASK GENMASK_ULL(31, 0)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ba69abcc663a..ad619c1b2a88 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12104,6 +12104,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_tdvmcall);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_sept_seamcall);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intr_vmexit);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmenter_failed);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
>

Please split this in two parts for each tracepoint, and squash it in the
earlier patches that introduced handle_tdvmcall and tdx_sept_*.

Paolo

2021-07-06 14:41:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 20/69] KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5b8a640f8042..0dc4bf34ce9c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5091,7 +5091,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> if (r == RET_PF_INVALID) {
> r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
> lower_32_bits(error_code), false);
> - if (WARN_ON_ONCE(r == RET_PF_INVALID))
> + if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> return -EIO;
> }
>
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:41:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/69] KVM: x86: Split core of hypercall emulation to helper function

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> By necessity, TDX will use a different register ABI for hypercalls.
> Break out the core functionality so that it may be reused for TDX.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 +++
> arch/x86/kvm/x86.c | 55 ++++++++++++++++++++-------------
> 2 files changed, 38 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9c7ced0e3171..80b943e4ab6d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1667,6 +1667,10 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu);
> void kvm_request_apicv_update(struct kvm *kvm, bool activate,
> unsigned long bit);
>
> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> + unsigned long a0, unsigned long a1,
> + unsigned long a2, unsigned long a3,
> + int op_64_bit);
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>
> int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d11cf87674f3..795f83a1cf9a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8406,26 +8406,15 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
> return;
> }
>
> -int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> + unsigned long a0, unsigned long a1,
> + unsigned long a2, unsigned long a3,
> + int op_64_bit)
> {
> - unsigned long nr, a0, a1, a2, a3, ret;
> - int op_64_bit;
> -
> - if (kvm_xen_hypercall_enabled(vcpu->kvm))
> - return kvm_xen_hypercall(vcpu);
> -
> - if (kvm_hv_hypercall_enabled(vcpu))
> - return kvm_hv_hypercall(vcpu);
> -
> - nr = kvm_rax_read(vcpu);
> - a0 = kvm_rbx_read(vcpu);
> - a1 = kvm_rcx_read(vcpu);
> - a2 = kvm_rdx_read(vcpu);
> - a3 = kvm_rsi_read(vcpu);
> + unsigned long ret;
>
> trace_kvm_hypercall(nr, a0, a1, a2, a3);
>
> - op_64_bit = is_64_bit_mode(vcpu);
> if (!op_64_bit) {
> nr &= 0xFFFFFFFF;
> a0 &= 0xFFFFFFFF;
> @@ -8434,11 +8423,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> a3 &= 0xFFFFFFFF;
> }
>
> - if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
> - ret = -KVM_EPERM;
> - goto out;
> - }
> -
> ret = -KVM_ENOSYS;
>
> switch (nr) {
> @@ -8475,6 +8459,35 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> ret = -KVM_ENOSYS;
> break;
> }
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> +
> +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> +{
> + unsigned long nr, a0, a1, a2, a3, ret;
> + int op_64_bit;
> +
> + if (kvm_xen_hypercall_enabled(vcpu->kvm))
> + return kvm_xen_hypercall(vcpu);
> +
> + if (kvm_hv_hypercall_enabled(vcpu))
> + return kvm_hv_hypercall(vcpu);
> +
> + nr = kvm_rax_read(vcpu);
> + a0 = kvm_rbx_read(vcpu);
> + a1 = kvm_rcx_read(vcpu);
> + a2 = kvm_rdx_read(vcpu);
> + a3 = kvm_rsi_read(vcpu);
> +
> + op_64_bit = is_64_bit_mode(vcpu);
> +
> + if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
> + ret = -KVM_EPERM;
> + goto out;
> + }
> +
> + ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
> out:
> if (!op_64_bit)
> ret = (u32)ret;
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:41:51

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 40/69] KVM: Export kvm_is_reserved_pfn() for use by TDX

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX will use kvm_is_reserved_pfn() to prevent installing a reserved PFN
> int SEPT. Or rather, to prevent such an attempt, as reserved PFNs are
> not covered by TDMRs.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> virt/kvm/kvm_main.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8b075b5e7303..dd6492b526c9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -188,6 +188,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>
> return true;
> }
> +EXPORT_SYMBOL_GPL(kvm_is_reserved_pfn);
>
> bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
> {
>

As before, there's no problem in squashing this in the patch that
introduces the use of kvm_is_reserved_pfn. You could also move
kvm_is_reserved_pfn and kvm_is_zone_device_pfn to a .h file.

Paolo

2021-07-06 14:42:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/69] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Zap only leaf SPTEs when deleting/moving a memslot by default, and add a
> module param to allow reverting to the old behavior of zapping all SPTEs
> at all levels and memslots when any memslot is updated.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++++-
> 1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8d5876dfc6b7..5b8a640f8042 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -85,6 +85,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_ratio, "uint");
> static bool __read_mostly force_flush_and_sync_on_reuse;
> module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
>
> +static bool __read_mostly memslot_update_zap_all;
> +module_param(memslot_update_zap_all, bool, 0444);
> +
> /*
> * When setting this variable to true it enables Two-Dimensional-Paging
> * where the hardware walks 2 page tables:
> @@ -5480,11 +5483,27 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> }
>
> +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + write_lock(&kvm->mmu_lock);
> + slot_handle_level(kvm, slot, kvm_zap_rmapp, PG_LEVEL_4K,
> + KVM_MAX_HUGEPAGE_LEVEL, true);
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> struct kvm_page_track_notifier_node *node)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (memslot_update_zap_all)
> + kvm_mmu_zap_all_fast(kvm);
> + else
> + kvm_mmu_zap_memslot(kvm, slot);
> }
>
> void kvm_mmu_init_vm(struct kvm *kvm)
>

This is the old patch that broke VFIO for some unknown reason. The
commit message should at least say why memslot_update_zap_all is not
true by default. Also, IIUC the bug still there with NX hugepage splits
disabled, but what if the TDP MMU is enabled?

Paolo

2021-07-06 14:43:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 35/69] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()

On 03/07/21 00:04, [email protected] wrote:
> -static void svm_vm_destroy(struct kvm *kvm)
> +static void svm_vm_teardown(struct kvm *kvm)
> {
> avic_vm_destroy(kvm);
> sev_vm_destroy(kvm);
> }

Please keep "destroy" as is and use "free" from the final step.

> +static void svm_vm_destroy(struct kvm *kvm)
> +{
> +

Please remove the empty lines.

Paolo

> +}
> +
> static bool svm_is_vm_type_supported(unsigned long type)
> {
> return type == KVM_X86_LEGACY_VM;
> @@ -4456,6 +4461,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> .is_vm_type_supported = svm_is_vm_type_supported,
> .vm_size = sizeof(struct kvm_svm),
> .vm_init = svm_vm_init,
> + .vm_teardown = svm_vm_teardown,
> .vm_destroy = svm_vm_destroy,
>
> .prepare_guest_switch = svm_prepare_guest_switch,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 84c2df824ecc..36756a356704 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6995,6 +6995,16 @@ static int vmx_vm_init(struct kvm *kvm)
> return 0;
> }
>
> +static void vmx_vm_teardown(struct kvm *kvm)
> +{
> +
> +}
> +
> +static void vmx_vm_destroy(struct kvm *kvm)
> +{
> +
> +}
> +
> static int __init vmx_check_processor_compat(void)
> {
> struct vmcs_config vmcs_conf;
> @@ -7613,6 +7623,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
> .is_vm_type_supported = vmx_is_vm_type_supported,
> .vm_size = sizeof(struct kvm_vmx),
> .vm_init = vmx_vm_init,
> + .vm_teardown = vmx_vm_teardown,
> + .vm_destroy = vmx_vm_destroy,
>
> .vcpu_create = vmx_create_vcpu,
> .vcpu_free = vmx_free_vcpu,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index da9f1081cb03..4b436cae1732 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11043,7 +11043,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
> mutex_unlock(&kvm->slots_lock);
> }
> - static_call_cond(kvm_x86_vm_destroy)(kvm);
> + static_call(kvm_x86_vm_teardown)(kvm);
> kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1));
> kvm_pic_destroy(kvm);
> kvm_ioapic_destroy(kvm);
> @@ -11054,6 +11054,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> kvm_page_track_cleanup(kvm);
> kvm_xen_destroy_vm(kvm);
> kvm_hv_destroy_vm(kvm);
> + static_call_cond(kvm_x86_vm_destroy)(kvm);
> }
>
> void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>

2021-07-06 14:43:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 68/69] KVM: TDX: add document on TDX MODULE

On 03/07/21 00:05, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add a document on how to integrate TDX MODULE into initrd so that
> TDX MODULE can be updated on kernel startup.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> Documentation/virt/kvm/tdx-module.rst | 48 +++++++++++++++++++++++++++
> 1 file changed, 48 insertions(+)
> create mode 100644 Documentation/virt/kvm/tdx-module.rst
>
> diff --git a/Documentation/virt/kvm/tdx-module.rst b/Documentation/virt/kvm/tdx-module.rst
> new file mode 100644
> index 000000000000..8beea8302f94
> --- /dev/null
> +++ b/Documentation/virt/kvm/tdx-module.rst
> @@ -0,0 +1,48 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========
> +TDX MODULE
> +==========
> +
> +Integrating TDX MODULE into initrd
> +==================================
> +If TDX is enabled in KVM(CONFIG_KVM_INTEL_TDX=y), kernel is able to load
> +tdx seam module from initrd.
> +The related modules (seamldr.ac, libtdx.so and libtdx.so.sigstruct) need to be
> +stored in initrd.
> +
> +tdx-seam is a sample hook script for initramfs-tools.
> +TDXSEAM_SRCDIR are the directory in the host file system to store files related
> +to TDX MODULE.
> +
> +Since it heavily depends on distro how to prepare initrd, here's a example how
> +to prepare an initrd.
> +(Actually this is taken from Documentation/x86/microcode.rst)
> +::
> + #!/bin/bash
> +
> + if [ -z "$1" ]; then
> + echo "You need to supply an initrd file"
> + exit 1
> + fi
> +
> + INITRD="$1"
> +
> + DSTDIR=lib/firmware/intel-seam
> + TMPDIR=/tmp/initrd
> + LIBTDX="/lib/firmware/intel-seam/seamldr.acm /lib/firmware/intel-seam/libtdx.so /lib/firmware/intel-seam/libtdx.so.sigstruct"
> +
> + rm -rf $TMPDIR
> +
> + mkdir $TMPDIR
> + cd $TMPDIR
> + mkdir -p $DSTDIR
> +
> + cp ${LIBTDX} ${DSTDIR}
> +
> + find . | cpio -o -H newc > ../tdx-seam.cpio
> + cd ..
> + mv $INITRD $INITRD.orig
> + cat tdx-seam.cpio $INITRD.orig > $INITRD
> +
> + rm -rf $TMPDIR
>

I think this belongs in a different series that adds SEAM loading?

Paolo

2021-07-06 14:43:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 21/69] KVM: Add max_vcpus field in common 'struct kvm'

Please replace "Add" with "Move" and add a couple lines to the commit
message.

Paolo

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 3 ---
> arch/arm64/kvm/arm.c | 7 ++-----
> arch/arm64/kvm/vgic/vgic-init.c | 6 +++---
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 3 ++-
> 5 files changed, 8 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 7cd7d5c8c4bc..96a0dc3a8780 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -106,9 +106,6 @@ struct kvm_arch {
> /* VTCR_EL2 value for this VM */
> u64 vtcr;
>
> - /* The maximum number of vCPUs depends on the used GIC model */
> - int max_vcpus;
> -
> /* Interrupt controller */
> struct vgic_dist vgic;
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e720148232a0..a46306cf3106 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -145,7 +145,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> kvm_vgic_early_init(kvm);
>
> /* The maximum number of VCPUs is limited by the host's GIC model */
> - kvm->arch.max_vcpus = kvm_arm_default_max_vcpus();
> + kvm->max_vcpus = kvm_arm_default_max_vcpus();
>
> set_default_spectre(kvm);
>
> @@ -220,7 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_MAX_VCPUS:
> case KVM_CAP_MAX_VCPU_ID:
> if (kvm)
> - r = kvm->arch.max_vcpus;
> + r = kvm->max_vcpus;
> else
> r = kvm_arm_default_max_vcpus();
> break;
> @@ -299,9 +299,6 @@ int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
> if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
> return -EBUSY;
>
> - if (id >= kvm->arch.max_vcpus)
> - return -EINVAL;
> -
> return 0;
> }
>
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index 58cbda00e56d..089ac00c55d7 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -97,11 +97,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
> ret = 0;
>
> if (type == KVM_DEV_TYPE_ARM_VGIC_V2)
> - kvm->arch.max_vcpus = VGIC_V2_MAX_CPUS;
> + kvm->max_vcpus = VGIC_V2_MAX_CPUS;
> else
> - kvm->arch.max_vcpus = VGIC_V3_MAX_CPUS;
> + kvm->max_vcpus = VGIC_V3_MAX_CPUS;
>
> - if (atomic_read(&kvm->online_vcpus) > kvm->arch.max_vcpus) {
> + if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) {
> ret = -E2BIG;
> goto out_unlock;
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e87f07c5c601..ddd4d0f68cdf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -544,6 +544,7 @@ struct kvm {
> * and is accessed atomically.
> */
> atomic_t online_vcpus;
> + int max_vcpus;
> int created_vcpus;
> int last_boosted_vcpu;
> struct list_head vm_list;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dc752d0bd3ec..52d40ea75749 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -910,6 +910,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> mutex_init(&kvm->irq_lock);
> mutex_init(&kvm->slots_lock);
> INIT_LIST_HEAD(&kvm->devices);
> + kvm->max_vcpus = KVM_MAX_VCPUS;
>
> BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> @@ -3329,7 +3330,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
> return -EINVAL;
>
> mutex_lock(&kvm->lock);
> - if (kvm->created_vcpus == KVM_MAX_VCPUS) {
> + if (kvm->created_vcpus >= kvm->max_vcpus) {
> mutex_unlock(&kvm->lock);
> return -EINVAL;
> }
>

2021-07-06 14:43:28

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 33/69] KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add hooks to cache and flush GPRs and invoke them from KVM_GET_REGS and
> KVM_SET_REGS respecitively. TDX will use the hooks to read/write GPRs
> from TDX-SEAM on-demand (for debug TDs).
>
> Cc: Tom Lendacky <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/x86.c | 6 ++++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 00333af724d7..9791c4bb5198 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1248,6 +1248,8 @@ struct kvm_x86_ops {
> void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu);
> void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value);
> + void (*cache_gprs)(struct kvm_vcpu *vcpu);
> + void (*flush_gprs)(struct kvm_vcpu *vcpu);
> void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
> unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
> void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c231a88d5946..f7ae0a47e555 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9850,6 +9850,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>
> static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> {
> + if (kvm_x86_ops.cache_gprs)
> + kvm_x86_ops.cache_gprs(vcpu);
> +
> if (vcpu->arch.emulate_regs_need_sync_to_vcpu) {
> /*
> * We are here if userspace calls get_regs() in the middle of
> @@ -9924,6 +9927,9 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>
> vcpu->arch.exception.pending = false;
>
> + if (kvm_x86_ops.flush_gprs)
> + kvm_x86_ops.flush_gprs(vcpu);
> +
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> }
>
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:43:42

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 38/69] KVM: x86: Add option to force LAPIC expiration wait

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add an option to skip the IRR check in kvm_wait_lapic_expire(). This
> will be used by TDX to wait if there is an outstanding notification for
> a TD, i.e. a virtual interrupt is being triggered via posted interrupt
> processing. KVM TDX doesn't emulate PI processing, i.e. there will
> never be a bit set in IRR/ISR, so the default behavior for APICv of
> querying the IRR doesn't work as intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>

Is there a better (existing after the previous patches) flag to test, or
possibly can it use vm_type following the suggestion I gave for patch 28?

Paolo

> ---
> arch/x86/kvm/lapic.c | 4 ++--
> arch/x86/kvm/lapic.h | 2 +-
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 2 +-
> 4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 977a704e3ff1..3cfc0485a46e 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1622,12 +1622,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> }
>
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
> {
> if (lapic_in_kernel(vcpu) &&
> vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
> vcpu->arch.apic->lapic_timer.timer_advance_ns &&
> - lapic_timer_int_injected(vcpu))
> + (force_wait || lapic_timer_int_injected(vcpu)))
> __kvm_wait_lapic_expire(vcpu);
> }
> EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 997c45a5963a..2bd32d86ad6f 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -233,7 +233,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)
>
> bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
>
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait);
>
> void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
> unsigned long *vcpu_bitmap);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index bcc3fc4872a3..b12bfdbc394b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3774,7 +3774,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
> clgi();
> kvm_load_guest_xsave_state(vcpu);
>
> - kvm_wait_lapic_expire(vcpu);
> + kvm_wait_lapic_expire(vcpu, false);
>
> /*
> * If this vCPU has touched SPEC_CTRL, restore the guest's value if
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 36756a356704..7ce15a2c3490 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6727,7 +6727,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> if (enable_preemption_timer)
> vmx_update_hv_timer(vcpu);
>
> - kvm_wait_lapic_expire(vcpu);
> + kvm_wait_lapic_expire(vcpu, false);
>
> /*
> * If this vCPU has touched SPEC_CTRL, restore the guest's value if
>

2021-07-06 14:43:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 32/69] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID

On 03/07/21 00:04, [email protected] wrote:
> Let userspace, or in the case of TDX, KVM itself, enable X2APIC even if
> X2APIC is not reported as supported in the guest's CPU model. KVM
> generally does not force specific ordering between ioctls(), e.g. this
> forces userspace to configure CPUID before MSRs.

You already have to do this, see for example MSR_IA32_PERF_CAPABILITIES:

struct kvm_msr_entry msr_ent = {.index = msr, .data = 0};

if (!msr_info->host_initiated)
return 1;
if (guest_cpuid_has(vcpu, X86_FEATURE_PDCM) && kvm_get_msr_feature(&msr_ent))
return 1;
if (data & ~msr_ent.data)
return 1;

Is this patch necessary? If not, I think it can be dropped.

Paolo

> And for TDX, vCPUs
> will always run with X2APIC enabled, e.g. KVM will want/need to enable
> X2APIC from time zero.

2021-07-06 14:44:03

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 19/69] KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 23 ++++++++++++++---------
> arch/x86/kvm/x86.c | 4 ++++
> 3 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e088086f3de6..25c72925eb8a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1526,7 +1526,7 @@ static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
> break;
> default:
> - WARN_ON_ONCE(1);
> + KVM_BUG_ON(1, vcpu->kvm);
> }
> }
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d73ba7a6ff8d..6c043a160b30 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2360,7 +2360,7 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
> break;
> default:
> - WARN_ON_ONCE(1);
> + KVM_BUG_ON(1, vcpu->kvm);
> break;
> }
> }
> @@ -5062,6 +5062,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
> return kvm_complete_insn_gp(vcpu, err);
> case 3:
> WARN_ON_ONCE(enable_unrestricted_guest);
> +
> err = kvm_set_cr3(vcpu, val);
> return kvm_complete_insn_gp(vcpu, err);
> case 4:
> @@ -5087,14 +5088,13 @@ static int handle_cr(struct kvm_vcpu *vcpu)
> }
> break;
> case 2: /* clts */
> - WARN_ONCE(1, "Guest should always own CR0.TS");
> - vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
> - trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
> - return kvm_skip_emulated_instruction(vcpu);
> + KVM_BUG(1, vcpu->kvm, "Guest always owns CR0.TS");
> + return -EIO;
> case 1: /*mov from cr*/
> switch (cr) {
> case 3:
> WARN_ON_ONCE(enable_unrestricted_guest);
> +
> val = kvm_read_cr3(vcpu);
> kvm_register_write(vcpu, reg, val);
> trace_kvm_cr_read(cr, val);
> @@ -5404,7 +5404,9 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>
> static int handle_nmi_window(struct kvm_vcpu *vcpu)
> {
> - WARN_ON_ONCE(!enable_vnmi);
> + if (KVM_BUG_ON(!enable_vnmi, vcpu->kvm))
> + return -EIO;
> +
> exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
> ++vcpu->stat.nmi_window_exits;
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> @@ -5960,7 +5962,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> * below) should never happen as that means we incorrectly allowed a
> * nested VM-Enter with an invalid vmcs12.
> */
> - WARN_ON_ONCE(vmx->nested.nested_run_pending);
> + if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm))
> + return -EIO;
>
> /* If guest state is invalid, start emulating */
> if (vmx->emulation_required)
> @@ -6338,7 +6341,9 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> int max_irr;
> bool max_irr_updated;
>
> - WARN_ON(!vcpu->arch.apicv_active);
> + if (KVM_BUG_ON(!vcpu->arch.apicv_active, vcpu->kvm))
> + return -EIO;
> +
> if (pi_test_on(&vmx->pi_desc)) {
> pi_clear_on(&vmx->pi_desc);
> /*
> @@ -6421,7 +6426,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> gate_desc *desc = (gate_desc *)host_idt_base + vector;
>
> - if (WARN_ONCE(!is_external_intr(intr_info),
> + if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
> "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
> return;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cc45b2c47672..9244d1d560d5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9153,6 +9153,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> }
>
> if (kvm_request_pending(vcpu)) {
> + if (kvm_check_request(KVM_REQ_VM_BUGGED, vcpu)) {
> + r = -EIO;
> + goto out;
> + }
> if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> r = 0;
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:44:07

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 39/69] KVM: x86: Add guest_supported_xss placholder

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add a per-vcpu placeholder for the support XSS of the guest so that the
> TDX configuration code doesn't need to hack in manual computation of the
> supported XSS. KVM XSS enabling is currently being upstreamed, i.e.
> guest_supported_xss will no longer be a placeholder by the time TDX is
> ready for upstreaming (hopefully).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7822b531a5e2..c641654307c6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -656,6 +656,7 @@ struct kvm_vcpu_arch {
>
> u64 xcr0;
> u64 guest_supported_xcr0;
> + u64 guest_supported_xss;
>
> struct kvm_pio_request pio;
> void *pio_data;
>

Please remove guest_supported_xss from patch 66 for now, instead.

Paolo

2021-07-06 14:45:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 55/69] KVM: VMX: Add 'main.c' to wrap VMX and TDX

On 03/07/21 00:05, [email protected] wrote:
> +#include "vmx.c"

What makes it particularly hard to have this as a separate .o file
rather than an #include?

Paolo

2021-07-06 14:45:27

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 56/69] KVM: VMX: Move setting of EPT MMU masks to common VT-x code

On 03/07/21 00:05, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 4 ++++
> arch/x86/kvm/vmx/vmx.c | 4 ----
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index c3fefa0e5a63..0d8d2a0a2979 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -34,6 +34,10 @@ static __init int vt_hardware_setup(void)
> if (ret)
> return ret;
>
> + if (enable_ept)
> + kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> + cpu_has_vmx_ept_execute_only());
> +
> return 0;
> }
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 77b2b2cf76db..e315a46d1566 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7644,10 +7644,6 @@ static __init int hardware_setup(void)
>
> set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>
> - if (enable_ept)
> - kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> - cpu_has_vmx_ept_execute_only());
> -
> if (!enable_ept)
> ept_lpage_level = 0;
> else if (cpu_has_vmx_ept_1g_page())
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:46:11

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 53/69] KVM: VMX: Define EPT Violation architectural bits

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Define the EPT Violation #VE control bit, #VE info VMCS fields, and the
> suppress #VE bit for EPT entries.

Better: "KVM: VMX: Define EPT Violation #VE architectural bits".

Paolo

> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/vmx.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 035dfdafa2c1..132981276a2f 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -78,6 +78,7 @@ struct vmcs {
> #define SECONDARY_EXEC_ENCLS_EXITING VMCS_CONTROL_BIT(ENCLS_EXITING)
> #define SECONDARY_EXEC_RDSEED_EXITING VMCS_CONTROL_BIT(RDSEED_EXITING)
> #define SECONDARY_EXEC_ENABLE_PML VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
> +#define SECONDARY_EXEC_EPT_VIOLATION_VE VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
> #define SECONDARY_EXEC_PT_CONCEAL_VMX VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
> #define SECONDARY_EXEC_XSAVES VMCS_CONTROL_BIT(XSAVES)
> #define SECONDARY_EXEC_MODE_BASED_EPT_EXEC VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
> @@ -226,6 +227,8 @@ enum vmcs_field {
> VMREAD_BITMAP_HIGH = 0x00002027,
> VMWRITE_BITMAP = 0x00002028,
> VMWRITE_BITMAP_HIGH = 0x00002029,
> + VE_INFO_ADDRESS = 0x0000202A,
> + VE_INFO_ADDRESS_HIGH = 0x0000202B,
> XSS_EXIT_BITMAP = 0x0000202C,
> XSS_EXIT_BITMAP_HIGH = 0x0000202D,
> ENCLS_EXITING_BITMAP = 0x0000202E,
> @@ -509,6 +512,7 @@ enum vmcs_field {
> #define VMX_EPT_IPAT_BIT (1ull << 6)
> #define VMX_EPT_ACCESS_BIT (1ull << 8)
> #define VMX_EPT_DIRTY_BIT (1ull << 9)
> +#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
> #define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
> VMX_EPT_WRITABLE_MASK | \
> VMX_EPT_EXECUTABLE_MASK)
>

2021-07-06 14:46:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 57/69] KVM: VMX: Move register caching logic to common code

On 03/07/21 00:05, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Move the guts of vmx_cache_reg() to vt_cache_reg() in preparation for
> reusing the bulk of the code for TDX, which can access guest state for
> debug TDs.
>
> Use kvm_x86_ops.cache_reg() in ept_update_paging_mode_cr0() rather than
> trying to expose vt_cache_reg() to vmx.c, even though it means taking a
> retpoline. The code runs if and only if EPT is enabled but unrestricted
> guest. Only one generation of CPU, Nehalem, supports EPT but not
> unrestricted guest, and disabling unrestricted guest without also
> disabling EPT is, to put it bluntly, dumb.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 37 +++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.c | 42 +----------------------------------------
> 2 files changed, 37 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 0d8d2a0a2979..b619615f77de 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -341,7 +341,42 @@ static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
>
> static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> {
> - vmx_cache_reg(vcpu, reg);
> + unsigned long guest_owned_bits;
> +
> + kvm_register_mark_available(vcpu, reg);
> +
> + switch (reg) {
> + case VCPU_REGS_RSP:
> + vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
> + break;
> + case VCPU_REGS_RIP:
> + vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
> + break;
> + case VCPU_EXREG_PDPTR:
> + if (enable_ept)
> + ept_save_pdptrs(vcpu);
> + break;
> + case VCPU_EXREG_CR0:
> + guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
> +
> + vcpu->arch.cr0 &= ~guest_owned_bits;
> + vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
> + break;
> + case VCPU_EXREG_CR3:
> + if (is_unrestricted_guest(vcpu) ||
> + (enable_ept && is_paging(vcpu)))
> + vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> + break;
> + case VCPU_EXREG_CR4:
> + guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
> +
> + vcpu->arch.cr4 &= ~guest_owned_bits;
> + vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
> + break;
> + default:
> + KVM_BUG_ON(1, vcpu->kvm);
> + break;
> + }
> }
>
> static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index e315a46d1566..3c3bfc80d2bb 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2326,46 +2326,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> return ret;
> }
>
> -static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> -{
> - unsigned long guest_owned_bits;
> -
> - kvm_register_mark_available(vcpu, reg);
> -
> - switch (reg) {
> - case VCPU_REGS_RSP:
> - vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
> - break;
> - case VCPU_REGS_RIP:
> - vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
> - break;
> - case VCPU_EXREG_PDPTR:
> - if (enable_ept)
> - ept_save_pdptrs(vcpu);
> - break;
> - case VCPU_EXREG_CR0:
> - guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
> -
> - vcpu->arch.cr0 &= ~guest_owned_bits;
> - vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
> - break;
> - case VCPU_EXREG_CR3:
> - if (is_unrestricted_guest(vcpu) ||
> - (enable_ept && is_paging(vcpu)))
> - vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> - break;
> - case VCPU_EXREG_CR4:
> - guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
> -
> - vcpu->arch.cr4 &= ~guest_owned_bits;
> - vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
> - break;
> - default:
> - KVM_BUG_ON(1, vcpu->kvm);
> - break;
> - }
> -}
> -
> static __init int vmx_disabled_by_bios(void)
> {
> return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
> @@ -3066,7 +3026,7 @@ static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
> - vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
> + kvm_x86_ops.cache_reg(vcpu, VCPU_EXREG_CR3);
> if (!(cr0 & X86_CR0_PG)) {
> /* From paging/starting to nonpaging */
> exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
>

This shows the problem with #including vmx.c. You should have a .h file
for both vmx.h and main.h (e.g. kvm_intel.h), so that here you can just
use vt_cache_reg.

Paolo

2021-07-06 14:48:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 61/69] KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h

On 03/07/21 00:05, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Move the AR_BYTES helpers to common.h so that future patches can reuse
> them to decode/encode AR for TDX.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/common.h | 41 ++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 47 ++++-----------------------------------
> 2 files changed, 45 insertions(+), 43 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index aa6a569b87d1..755aaec85199 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -4,6 +4,7 @@
>
> #include <linux/kvm_host.h>
>
> +#include <asm/kvm.h>
> #include <asm/traps.h>
> #include <asm/vmx.h>
>
> @@ -119,4 +120,44 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> }
>
> +static inline u32 vmx_encode_ar_bytes(struct kvm_segment *var)
> +{
> + u32 ar;
> +
> + if (var->unusable || !var->present)
> + ar = 1 << 16;
> + else {
> + ar = var->type & 15;
> + ar |= (var->s & 1) << 4;
> + ar |= (var->dpl & 3) << 5;
> + ar |= (var->present & 1) << 7;
> + ar |= (var->avl & 1) << 12;
> + ar |= (var->l & 1) << 13;
> + ar |= (var->db & 1) << 14;
> + ar |= (var->g & 1) << 15;
> + }
> +
> + return ar;
> +}
> +
> +static inline void vmx_decode_ar_bytes(u32 ar, struct kvm_segment *var)
> +{
> + var->unusable = (ar >> 16) & 1;
> + var->type = ar & 15;
> + var->s = (ar >> 4) & 1;
> + var->dpl = (ar >> 5) & 3;
> + /*
> + * Some userspaces do not preserve unusable property. Since usable
> + * segment has to be present according to VMX spec we can use present
> + * property to amend userspace bug by making unusable segment always
> + * nonpresent. vmx_encode_ar_bytes() already marks nonpresent
> + * segment as unusable.
> + */
> + var->present = !var->unusable;
> + var->avl = (ar >> 12) & 1;
> + var->l = (ar >> 13) & 1;
> + var->db = (ar >> 14) & 1;
> + var->g = (ar >> 15) & 1;
> +}
> +
> #endif /* __KVM_X86_VMX_COMMON_H */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3c3bfc80d2bb..40843ca2fb33 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -365,8 +365,6 @@ static const struct kernel_param_ops vmentry_l1d_flush_ops = {
> };
> module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);
>
> -static u32 vmx_segment_access_rights(struct kvm_segment *var);
> -
> void vmx_vmexit(void);
>
> #define vmx_insn_failed(fmt...) \
> @@ -2826,7 +2824,7 @@ static void fix_rmode_seg(int seg, struct kvm_segment *save)
> vmcs_write16(sf->selector, var.selector);
> vmcs_writel(sf->base, var.base);
> vmcs_write32(sf->limit, var.limit);
> - vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(&var));
> + vmcs_write32(sf->ar_bytes, vmx_encode_ar_bytes(&var));
> }
>
> static void enter_rmode(struct kvm_vcpu *vcpu)
> @@ -3217,7 +3215,6 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> - u32 ar;
>
> if (vmx->rmode.vm86_active && seg != VCPU_SREG_LDTR) {
> *var = vmx->rmode.segs[seg];
> @@ -3231,23 +3228,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> var->base = vmx_read_guest_seg_base(vmx, seg);
> var->limit = vmx_read_guest_seg_limit(vmx, seg);
> var->selector = vmx_read_guest_seg_selector(vmx, seg);
> - ar = vmx_read_guest_seg_ar(vmx, seg);
> - var->unusable = (ar >> 16) & 1;
> - var->type = ar & 15;
> - var->s = (ar >> 4) & 1;
> - var->dpl = (ar >> 5) & 3;
> - /*
> - * Some userspaces do not preserve unusable property. Since usable
> - * segment has to be present according to VMX spec we can use present
> - * property to amend userspace bug by making unusable segment always
> - * nonpresent. vmx_segment_access_rights() already marks nonpresent
> - * segment as unusable.
> - */
> - var->present = !var->unusable;
> - var->avl = (ar >> 12) & 1;
> - var->l = (ar >> 13) & 1;
> - var->db = (ar >> 14) & 1;
> - var->g = (ar >> 15) & 1;
> + vmx_decode_ar_bytes(vmx_read_guest_seg_ar(vmx, seg), var);
> }
>
> static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> @@ -3273,26 +3254,6 @@ int vmx_get_cpl(struct kvm_vcpu *vcpu)
> }
> }
>
> -static u32 vmx_segment_access_rights(struct kvm_segment *var)
> -{
> - u32 ar;
> -
> - if (var->unusable || !var->present)
> - ar = 1 << 16;
> - else {
> - ar = var->type & 15;
> - ar |= (var->s & 1) << 4;
> - ar |= (var->dpl & 3) << 5;
> - ar |= (var->present & 1) << 7;
> - ar |= (var->avl & 1) << 12;
> - ar |= (var->l & 1) << 13;
> - ar |= (var->db & 1) << 14;
> - ar |= (var->g & 1) << 15;
> - }
> -
> - return ar;
> -}
> -
> void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -3327,7 +3288,7 @@ void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> if (is_unrestricted_guest(vcpu) && (seg != VCPU_SREG_LDTR))
> var->type |= 0x1; /* Accessed */
>
> - vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
> + vmcs_write32(sf->ar_bytes, vmx_encode_ar_bytes(var));
>
> out:
> vmx->emulation_required = emulation_required(vcpu);
> @@ -3374,7 +3335,7 @@ static bool rmode_segment_valid(struct kvm_vcpu *vcpu, int seg)
> var.dpl = 0x3;
> if (seg == VCPU_SREG_CS)
> var.type = 0x3;
> - ar = vmx_segment_access_rights(&var);
> + ar = vmx_encode_ar_bytes(&var);
>
> if (var.base != (var.selector << 4))
> return false;
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:48:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 62/69] KVM: VMX: MOVE GDT and IDT accessors to common code

On 03/07/21 00:05, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 6 ++++--
> arch/x86/kvm/vmx/vmx.c | 12 ------------
> 2 files changed, 4 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b619615f77de..8e03cb72b910 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -311,7 +311,8 @@ static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>
> static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> - vmx_get_idt(vcpu, dt);
> + dt->size = vmread32(vcpu, GUEST_IDTR_LIMIT);
> + dt->address = vmreadl(vcpu, GUEST_IDTR_BASE);
> }
>
> static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> @@ -321,7 +322,8 @@ static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
>
> static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> - vmx_get_gdt(vcpu, dt);
> + dt->size = vmread32(vcpu, GUEST_GDTR_LIMIT);
> + dt->address = vmreadl(vcpu, GUEST_GDTR_BASE);
> }
>
> static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 40843ca2fb33..d69d4dc7c071 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3302,24 +3302,12 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
> *l = (ar >> 13) & 1;
> }
>
> -static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> -{
> - dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
> - dt->address = vmcs_readl(GUEST_IDTR_BASE);
> -}
> -
> static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
> vmcs_writel(GUEST_IDTR_BASE, dt->address);
> }
>
> -static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> -{
> - dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
> - dt->address = vmcs_readl(GUEST_GDTR_BASE);
> -}
> -
> static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:48:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/69] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs

On 03/07/21 00:04, [email protected] wrote:
>
> struct kvm_arch {
> + unsigned long vm_type;

Also why not just int or u8?

Paolo

2021-07-06 14:50:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 60/69] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs

On 03/07/21 00:05, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add a macro framework to hide VMX vs. TDX details of VMREAD and VMWRITE
> so the VMX and TDX can shared common flows, e.g. accessing DTs.
>
> Note, the TDX paths are dead code at this time. There is no great way
> to deal with the chicken-and-egg scenario of having things in place for
> TDX without first having TDX.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/common.h | 41 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 41 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 9e5865b05d47..aa6a569b87d1 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -11,6 +11,47 @@
> #include "vmcs.h"
> #include "vmx.h"
> #include "x86.h"
> +#include "tdx.h"
> +
> +#ifdef CONFIG_KVM_INTEL_TDX

Is this #ifdef needed at all if tdx.h properly stubs is_td_vcpu (to
return false) and possibly declares a dummy version of
td_vmcs_read/td_vmcs_write?

Paolo

> +#define VT_BUILD_VMCS_HELPERS(type, bits, tdbits) \
> +static __always_inline type vmread##bits(struct kvm_vcpu *vcpu, \
> + unsigned long field) \
> +{ \
> + if (unlikely(is_td_vcpu(vcpu))) { \
> + if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm)) \
> + return 0; \
> + return td_vmcs_read##tdbits(to_tdx(vcpu), field); \
> + } \
> + return vmcs_read##bits(field); \
> +} \
> +static __always_inline void vmwrite##bits(struct kvm_vcpu *vcpu, \
> + unsigned long field, type value) \
> +{ \
> + if (unlikely(is_td_vcpu(vcpu))) { \
> + if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm)) \
> + return; \
> + return td_vmcs_write##tdbits(to_tdx(vcpu), field, value); \
> + } \
> + vmcs_write##bits(field, value); \
> +}
> +#else
> +#define VT_BUILD_VMCS_HELPERS(type, bits, tdbits) \
> +static __always_inline type vmread##bits(struct kvm_vcpu *vcpu, \
> + unsigned long field) \
> +{ \
> + return vmcs_read##bits(field); \
> +} \
> +static __always_inline void vmwrite##bits(struct kvm_vcpu *vcpu, \
> + unsigned long field, type value) \
> +{ \
> + vmcs_write##bits(field, value); \
> +}
> +#endif /* CONFIG_KVM_INTEL_TDX */
> +VT_BUILD_VMCS_HELPERS(u16, 16, 16);
> +VT_BUILD_VMCS_HELPERS(u32, 32, 32);
> +VT_BUILD_VMCS_HELPERS(u64, 64, 64);
> +VT_BUILD_VMCS_HELPERS(unsigned long, l, 64);
>
> extern unsigned long vmx_host_idt_base;
> void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
>

2021-07-06 14:50:10

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/69] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs

On 03/07/21 00:04, [email protected] wrote:
> #define KVM_PMU_EVENT_DENY 1
>
> +#define KVM_X86_LEGACY_VM 0
> +#define KVM_X86_SEV_ES_VM 1
> +#define KVM_X86_TDX_VM 2
> +

SEV-ES is not needed, and TDX_VM might be reused for SEV-SNP. Also
"legacy VM" is not really the right name. Maybe NORMAL/TRUSTED?

Paolo

2021-07-06 14:51:03

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/69] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add 'guest_state_protected' to mark a VM's state as being protected by
> hardware/firmware, e.g. SEV-ES or TDX-SEAM. Use the flag to disallow
> ioctls() and/or flows that attempt to access protected state.
>
> Return an error if userspace attempts to get/set register state for a
> protected VM, e.g. a non-debug TDX guest. KVM can't provide sane data,
> it's userspace's responsibility to avoid attempting to read guest state
> when it's known to be inaccessible.
>
> Retrieving vCPU events is the one exception, as the userspace VMM is
> allowed to inject NMIs.
>
> Co-developed-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/x86.c | 104 +++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 86 insertions(+), 18 deletions(-)

Looks good, but it should be checked whether it breaks QEMU for SEV-ES.
Tom, can you help?

Paolo

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 271245ffc67c..b89845dfb679 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4297,6 +4297,10 @@ static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
>
> static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
> {
> + /* TODO: use more precise flag */
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> kvm_make_request(KVM_REQ_SMI, vcpu);
>
> return 0;
> @@ -4343,6 +4347,10 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
> unsigned bank_num = mcg_cap & 0xff;
> u64 *banks = vcpu->arch.mce_banks;
>
> + /* TODO: use more precise flag */
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> if (mce->bank >= bank_num || !(mce->status & MCI_STATUS_VAL))
> return -EINVAL;
> /*
> @@ -4438,7 +4446,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> events->interrupt.nr = vcpu->arch.interrupt.nr;
> events->interrupt.soft = 0;
> - events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> + if (!vcpu->arch.guest_state_protected)
> + events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>
> events->nmi.injected = vcpu->arch.nmi_injected;
> events->nmi.pending = vcpu->arch.nmi_pending != 0;
> @@ -4467,11 +4476,17 @@ static void kvm_smm_changed(struct kvm_vcpu *vcpu);
> static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
> struct kvm_vcpu_events *events)
> {
> - if (events->flags & ~(KVM_VCPUEVENT_VALID_NMI_PENDING
> - | KVM_VCPUEVENT_VALID_SIPI_VECTOR
> - | KVM_VCPUEVENT_VALID_SHADOW
> - | KVM_VCPUEVENT_VALID_SMM
> - | KVM_VCPUEVENT_VALID_PAYLOAD))
> + u32 allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING |
> + KVM_VCPUEVENT_VALID_SIPI_VECTOR |
> + KVM_VCPUEVENT_VALID_SHADOW |
> + KVM_VCPUEVENT_VALID_SMM |
> + KVM_VCPUEVENT_VALID_PAYLOAD;
> +
> + /* TODO: introduce more precise flag */
> + if (vcpu->arch.guest_state_protected)
> + allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING;
> +
> + if (events->flags & ~allowed_flags)
> return -EINVAL;
>
> if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
> @@ -4552,17 +4567,22 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
> return 0;
> }
>
> -static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
> - struct kvm_debugregs *dbgregs)
> +static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
> + struct kvm_debugregs *dbgregs)
> {
> unsigned long val;
>
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> memcpy(dbgregs->db, vcpu->arch.db, sizeof(vcpu->arch.db));
> kvm_get_dr(vcpu, 6, &val);
> dbgregs->dr6 = val;
> dbgregs->dr7 = vcpu->arch.dr7;
> dbgregs->flags = 0;
> memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
> +
> + return 0;
> }
>
> static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
> @@ -4576,6 +4596,9 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
> if (!kvm_dr7_valid(dbgregs->dr7))
> return -EINVAL;
>
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> memcpy(vcpu->arch.db, dbgregs->db, sizeof(vcpu->arch.db));
> kvm_update_dr0123(vcpu);
> vcpu->arch.dr6 = dbgregs->dr6;
> @@ -4671,11 +4694,14 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
> }
> }
>
> -static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> - struct kvm_xsave *guest_xsave)
> +static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> + struct kvm_xsave *guest_xsave)
> {
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> if (!vcpu->arch.guest_fpu)
> - return;
> + return 0;
>
> if (boot_cpu_has(X86_FEATURE_XSAVE)) {
> memset(guest_xsave, 0, sizeof(struct kvm_xsave));
> @@ -4687,6 +4713,8 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> *(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
> XFEATURE_MASK_FPSSE;
> }
> +
> + return 0;
> }
>
> #define XSAVE_MXCSR_OFFSET 24
> @@ -4697,6 +4725,9 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> u64 xstate_bv;
> u32 mxcsr;
>
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> if (!vcpu->arch.guest_fpu)
> return 0;
>
> @@ -4722,18 +4753,22 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> return 0;
> }
>
> -static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
> - struct kvm_xcrs *guest_xcrs)
> +static int kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
> + struct kvm_xcrs *guest_xcrs)
> {
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
> guest_xcrs->nr_xcrs = 0;
> - return;
> + return 0;
> }
>
> guest_xcrs->nr_xcrs = 1;
> guest_xcrs->flags = 0;
> guest_xcrs->xcrs[0].xcr = XCR_XFEATURE_ENABLED_MASK;
> guest_xcrs->xcrs[0].value = vcpu->arch.xcr0;
> + return 0;
> }
>
> static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
> @@ -4741,6 +4776,9 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
> {
> int i, r = 0;
>
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> if (!boot_cpu_has(X86_FEATURE_XSAVE))
> return -EINVAL;
>
> @@ -5011,7 +5049,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> case KVM_GET_DEBUGREGS: {
> struct kvm_debugregs dbgregs;
>
> - kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
> + r = kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
> + if (r)
> + break;
>
> r = -EFAULT;
> if (copy_to_user(argp, &dbgregs,
> @@ -5037,7 +5077,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> if (!u.xsave)
> break;
>
> - kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
> + r = kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
> + if (r)
> + break;
>
> r = -EFAULT;
> if (copy_to_user(argp, u.xsave, sizeof(struct kvm_xsave)))
> @@ -5061,7 +5103,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> if (!u.xcrs)
> break;
>
> - kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
> + r = kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
> + if (r)
> + break;
>
> r = -EFAULT;
> if (copy_to_user(argp, u.xcrs,
> @@ -9735,6 +9779,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> goto out;
> }
>
> + if (vcpu->arch.guest_state_protected &&
> + (kvm_run->kvm_valid_regs || kvm_run->kvm_dirty_regs)) {
> + r = -EINVAL;
> + goto out;
> + }
> +
> if (kvm_run->kvm_dirty_regs) {
> r = sync_regs(vcpu);
> if (r != 0)
> @@ -9765,7 +9815,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>
> out:
> kvm_put_guest_fpu(vcpu);
> - if (kvm_run->kvm_valid_regs)
> + if (kvm_run->kvm_valid_regs && !vcpu->arch.guest_state_protected)
> store_regs(vcpu);
> post_kvm_run_save(vcpu);
> kvm_sigset_deactivate(vcpu);
> @@ -9812,6 +9862,9 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>
> int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> {
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> vcpu_load(vcpu);
> __get_regs(vcpu, regs);
> vcpu_put(vcpu);
> @@ -9852,6 +9905,9 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>
> int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> {
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> vcpu_load(vcpu);
> __set_regs(vcpu, regs);
> vcpu_put(vcpu);
> @@ -9912,6 +9968,9 @@ static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
> int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
> struct kvm_sregs *sregs)
> {
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> vcpu_load(vcpu);
> __get_sregs(vcpu, sregs);
> vcpu_put(vcpu);
> @@ -10112,6 +10171,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
> {
> int ret;
>
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> vcpu_load(vcpu);
> ret = __set_sregs(vcpu, sregs);
> vcpu_put(vcpu);
> @@ -10205,6 +10267,9 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
> {
> struct fxregs_state *fxsave;
>
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> if (!vcpu->arch.guest_fpu)
> return 0;
>
> @@ -10228,6 +10293,9 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
> {
> struct fxregs_state *fxsave;
>
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> +
> if (!vcpu->arch.guest_fpu)
> return 0;
>
>

2021-07-06 14:51:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/69] KVM: X86: TDX support

Based on the initial review, I think patches 2-3-17-18-19-20-23-49 can
already be merged for 5.15.

The next part should be the introduction of vm_types, blocking ioctls
depending on the vm_type (patches 24-31) and possibly applying this
block already to SEV-ES.

On 03/07/21 00:04, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> * What's TDX?
> TDX stands for Trust Domain Extensions which isolates VMs from the
> virtual-machine manager (VMM)/hypervisor and any other software on the
> platform. [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
>
>
> * The goal of this RFC patch
> The purpose of this post is to get feedback early on high level design issue of
> KVM enhancement for TDX. The detailed coding (variable naming etc) is not cared
> of. This patch series is incomplete (not working). So it's RFC. Although
> multiple software components, not only KVM but also QEMU, guest Linux and
> virtual bios, need to be updated, this includes only KVM VMM part. For those who
> are curious to changes to other component, there are public repositories at
> github. [8], [9]
>
>
> * Patch organization
> The patch 66 is main change. The preceding patches(1-65) The preceding
> patches(01-61) are refactoring the code and introducing additional hooks.
>
> - 01-12: They are preparations. introduce architecture constants, code
> refactoring, export symbols for following patches.
> - 13-40: start to introduce the new type of VM and allow the coexistence of
> multiple type of VM. allow/disallow KVM ioctl where
> appropriate. Especially make per-system ioctl to per-VM ioctl.
> - 41-65: refactoring KVM VMX/MMU and adding new hooks for Secure EPT.
> - 66: main patch to add "basic" support for building/running TDX.
> - 67: trace points for
> - 68-69: Documentation
>
> * TODOs
> Those major features are missing from this patch series to keep this patch
> series small.
>
> - load/initialize TDX module
> split out from this patch series.
> - unmapping private page
> Will integrate Kirill's patch to show how kvm will utilize it.
> - qemu gdb stub support
> - Large page support
> - guest PMU support
> - TDP MMU support
> - and more
>
> Changes from v1:
> - rebase to v5.13
> - drop load/initialization of TDX module
> - catch up the update of related specifications.
> - rework on C-wrapper function to invoke seamcall
> - various code clean up
>
> [1] TDX specification
> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
> [3] Intel CPU Architectural Extensions Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 EAS
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
> [5] Intel TDX Loader Interface Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> [7] Intel TDX Virtual Firmware Design Guide
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
> [8] intel public github
> kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> TDX guest branch: https://github.com/intel/tdx/tree/guest
> qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
> https://github.com/tianocore/edk2-staging/tree/TDVF
>
> Isaku Yamahata (11):
> KVM: TDX: introduce config for KVM TDX support
> KVM: X86: move kvm_cpu_vmxon() from vmx.c to virtext.h
> KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86
> KVM: TDX: add a helper function for kvm to call seamcall
> KVM: TDX: add trace point before/after TDX SEAMCALLs
> KVM: TDX: Print the name of SEAMCALL status code
> KVM: Add per-VM flag to mark read-only memory as unsupported
> KVM: x86: add per-VM flags to disable SMI/INIT/SIPI
> KVM: TDX: add trace point for TDVMCALL and SEPT operation
> KVM: TDX: add document on TDX MODULE
> Documentation/virtual/kvm: Add Trust Domain Extensions(TDX)
>
> Kai Huang (2):
> KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level
> routes
> cpu/hotplug: Document that TDX also depends on booting CPUs once
>
> Rick Edgecombe (1):
> KVM: x86: Add infrastructure for stolen GPA bits
>
> Sean Christopherson (53):
> KVM: TDX: Add TDX "architectural" error codes
> KVM: TDX: Add architectural definitions for structures and values
> KVM: TDX: define and export helper functions for KVM TDX support
> KVM: TDX: Add C wrapper functions for TDX SEAMCALLs
> KVM: Export kvm_io_bus_read for use by TDX for PV MMIO
> KVM: Enable hardware before doing arch VM initialization
> KVM: x86: Split core of hypercall emulation to helper function
> KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO
> KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default
> KVM: Add infrastructure and macro to mark VM as bugged
> KVM: Export kvm_make_all_cpus_request() for use in marking VMs as
> bugged
> KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the
> VM
> KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID
> KVM: Add max_vcpus field in common 'struct kvm'
> KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
> KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()
> KVM: x86: Introduce "protected guest" concept and block disallowed
> ioctls
> KVM: x86: Add per-VM flag to disable direct IRQ injection
> KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE
> KVM: x86: Add flag to mark TSC as immutable (for TDX)
> KVM: Add per-VM flag to disable dirty logging of memslots for TDs
> KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
> KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()
> KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP
> KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
> KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> behavior
> KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
> KVM: x86: Add option to force LAPIC expiration wait
> KVM: x86: Add guest_supported_xss placholder
> KVM: Export kvm_is_reserved_pfn() for use by TDX
> KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> KVM: x86/mmu: Allow non-zero init value for shadow PTE
> KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce
> indentation
> KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
> KVM: x86/mmu: Frame in support for private/inaccessible shadow pages
> KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> KVM: VMX: Modify NMI and INTR handlers to take intr_info as param
> KVM: VMX: Move NMI/exception handler to common helper
> KVM: x86/mmu: Allow per-VM override of the TDP max page level
> KVM: VMX: Split out guts of EPT violation to common/exposed function
> KVM: VMX: Define EPT Violation architectural bits
> KVM: VMX: Define VMCS encodings for shared EPT pointer
> KVM: VMX: Add 'main.c' to wrap VMX and TDX
> KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> KVM: VMX: Move register caching logic to common code
> KVM: TDX: Define TDCALL exit reason
> KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs
> KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h
> KVM: VMX: MOVE GDT and IDT accessors to common code
> KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX
> code
> KVM: TDX: Add "basic" support for building and running Trust Domains
>
> Xiaoyao Li (2):
> KVM: TDX: Introduce pr_seamcall_ex_ret_info() to print more info when
> SEAMCALL fails
> KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
>
> Documentation/virt/kvm/api.rst | 6 +-
> Documentation/virt/kvm/intel-tdx.rst | 441 ++++++
> Documentation/virt/kvm/tdx-module.rst | 48 +
> arch/arm64/include/asm/kvm_host.h | 3 -
> arch/arm64/kvm/arm.c | 7 +-
> arch/arm64/kvm/vgic/vgic-init.c | 6 +-
> arch/x86/Kbuild | 1 +
> arch/x86/include/asm/cpufeatures.h | 2 +
> arch/x86/include/asm/kvm-x86-ops.h | 8 +
> arch/x86/include/asm/kvm_boot.h | 30 +
> arch/x86/include/asm/kvm_host.h | 55 +-
> arch/x86/include/asm/virtext.h | 25 +
> arch/x86/include/asm/vmx.h | 17 +
> arch/x86/include/uapi/asm/kvm.h | 60 +
> arch/x86/include/uapi/asm/vmx.h | 7 +-
> arch/x86/kernel/asm-offsets_64.c | 15 +
> arch/x86/kvm/Kconfig | 11 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/boot/Makefile | 6 +
> arch/x86/kvm/boot/seam/tdx_common.c | 242 +++
> arch/x86/kvm/boot/seam/tdx_common.h | 13 +
> arch/x86/kvm/ioapic.c | 4 +
> arch/x86/kvm/irq_comm.c | 13 +-
> arch/x86/kvm/lapic.c | 7 +-
> arch/x86/kvm/lapic.h | 2 +-
> arch/x86/kvm/mmu.h | 31 +-
> arch/x86/kvm/mmu/mmu.c | 526 +++++--
> arch/x86/kvm/mmu/mmu_internal.h | 3 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 +-
> arch/x86/kvm/mmu/spte.c | 15 +-
> arch/x86/kvm/mmu/spte.h | 18 +-
> arch/x86/kvm/svm/svm.c | 18 +-
> arch/x86/kvm/trace.h | 138 ++
> arch/x86/kvm/vmx/common.h | 178 +++
> arch/x86/kvm/vmx/main.c | 1098 ++++++++++++++
> arch/x86/kvm/vmx/posted_intr.c | 6 +
> arch/x86/kvm/vmx/seamcall.S | 64 +
> arch/x86/kvm/vmx/seamcall.h | 68 +
> arch/x86/kvm/vmx/tdx.c | 1958 +++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 267 ++++
> arch/x86/kvm/vmx/tdx_arch.h | 370 +++++
> arch/x86/kvm/vmx/tdx_errno.h | 202 +++
> arch/x86/kvm/vmx/tdx_ops.h | 218 +++
> arch/x86/kvm/vmx/tdx_stubs.c | 45 +
> arch/x86/kvm/vmx/vmcs.h | 11 -
> arch/x86/kvm/vmx/vmenter.S | 146 ++
> arch/x86/kvm/vmx/vmx.c | 509 ++-----
> arch/x86/kvm/x86.c | 285 +++-
> include/linux/kvm_host.h | 51 +-
> include/uapi/linux/kvm.h | 2 +
> kernel/cpu.c | 4 +
> tools/arch/x86/include/uapi/asm/kvm.h | 55 +
> tools/include/uapi/linux/kvm.h | 2 +
> virt/kvm/kvm_main.c | 44 +-
> 54 files changed, 6717 insertions(+), 672 deletions(-)
> create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> create mode 100644 Documentation/virt/kvm/tdx-module.rst
> create mode 100644 arch/x86/include/asm/kvm_boot.h
> create mode 100644 arch/x86/kvm/boot/Makefile
> create mode 100644 arch/x86/kvm/boot/seam/tdx_common.c
> create mode 100644 arch/x86/kvm/boot/seam/tdx_common.h
> create mode 100644 arch/x86/kvm/vmx/common.h
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/seamcall.S
> create mode 100644 arch/x86/kvm/vmx/seamcall.h
> create mode 100644 arch/x86/kvm/vmx/tdx.c
> create mode 100644 arch/x86/kvm/vmx/tdx.h
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> create mode 100644 arch/x86/kvm/vmx/tdx_stubs.c
>

2021-07-06 14:51:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 37/69] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
> interrupt to support TDX's usage of APICv. Unlike VMX, TDX doesn't have
> access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
> i.e. needs to generate a posted interrupt and more importantly can't
> manually move requested interrupts into the vIRR (which it also doesn't
> have access to).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/x86.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f1d5e0a53640..92d5a6649a21 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11341,7 +11341,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>
> if (kvm_arch_interrupt_allowed(vcpu) &&
> (kvm_cpu_has_interrupt(vcpu) ||
> - kvm_guest_apic_has_interrupt(vcpu)))
> + kvm_guest_apic_has_interrupt(vcpu) ||
> + (vcpu->arch.apicv_active &&
> + kvm_x86_ops.dy_apicv_has_pending_interrupt(vcpu))))
> return true;
>
> if (kvm_hv_has_stimer_pending(vcpu))
>

Please remove "dy_" from the name of the callback, and use the static
call. Also, if it makes sense, please consider using the same test as
for patch 38 to choose *between* either kvm_cpu_has_interrupt() +
kvm_guest_apic_has_interrupt() or
kvm_x86_ops.dy_apicv_has_pending_interrupt().

Paolo

2021-07-06 14:52:03

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 54/69] KVM: VMX: Define VMCS encodings for shared EPT pointer

On 03/07/21 00:05, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add the VMCS field encoding for the shared EPTP, which will be used by
> TDX to have separate EPT walks for private GPAs (existing EPTP) versus
> shared GPAs (new shared EPTP).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/vmx.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 132981276a2f..56b3d32941fd 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -235,6 +235,8 @@ enum vmcs_field {
> ENCLS_EXITING_BITMAP_HIGH = 0x0000202F,
> TSC_MULTIPLIER = 0x00002032,
> TSC_MULTIPLIER_HIGH = 0x00002033,
> + SHARED_EPT_POINTER = 0x0000203C,
> + SHARED_EPT_POINTER_HIGH = 0x0000203D,
> GUEST_PHYSICAL_ADDRESS = 0x00002400,
> GUEST_PHYSICAL_ADDRESS_HIGH = 0x00002401,
> VMCS_LINK_POINTER = 0x00002800,
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:52:44

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 49/69] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Pass intr_info to the NMI and INTR handlers instead of pulling it from
> vcpu_vmx in preparation for sharing the bulk of the handlers with TDX.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7ce15a2c3490..e08f85c93e55 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6404,25 +6404,24 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
> kvm_after_interrupt(vcpu);
> }
>
> -static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
> +static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
> {
> const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
> - u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
>
> /* if exit due to PF check for async PF */
> if (is_page_fault(intr_info))
> - vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> + vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> /* Handle machine checks before interrupts are enabled */
> else if (is_machine_check(intr_info))
> kvm_machine_check();
> /* We need to handle NMIs before interrupts are enabled */
> else if (is_nmi(intr_info))
> - handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
> + handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
> }
>
> -static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> +static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> + u32 intr_info)
> {
> - u32 intr_info = vmx_get_intr_info(vcpu);
> unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> gate_desc *desc = (gate_desc *)host_idt_base + vector;
>
> @@ -6438,9 +6437,9 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
> - handle_external_interrupt_irqoff(vcpu);
> + handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
> else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
> - handle_exception_nmi_irqoff(vmx);
> + handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
> }
>
> /*
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:55:00

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/69] KVM: X86: TDX support

Based on the initial review, I think patches 2-3-17-18-19-20-23-49 can
already be merged for 5.15.

The next part should be the introduction of vm_types, blocking ioctls
depending on the vm_type (patches 24-31). Perhaps this blocking should
be applied already to SEV-ES, so that the corresponding code in QEMU can
be added early.

Paolo

On 03/07/21 00:04, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> * What's TDX?
> TDX stands for Trust Domain Extensions which isolates VMs from the
> virtual-machine manager (VMM)/hypervisor and any other software on the
> platform. [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
>
>
> * The goal of this RFC patch
> The purpose of this post is to get feedback early on high level design issue of
> KVM enhancement for TDX. The detailed coding (variable naming etc) is not cared
> of. This patch series is incomplete (not working). So it's RFC. Although
> multiple software components, not only KVM but also QEMU, guest Linux and
> virtual bios, need to be updated, this includes only KVM VMM part. For those who
> are curious to changes to other component, there are public repositories at
> github. [8], [9]
>
>
> * Patch organization
> The patch 66 is main change. The preceding patches(1-65) The preceding
> patches(01-61) are refactoring the code and introducing additional hooks.
>
> - 01-12: They are preparations. introduce architecture constants, code
> refactoring, export symbols for following patches.
> - 13-40: start to introduce the new type of VM and allow the coexistence of
> multiple type of VM. allow/disallow KVM ioctl where
> appropriate. Especially make per-system ioctl to per-VM ioctl.
> - 41-65: refactoring KVM VMX/MMU and adding new hooks for Secure EPT.
> - 66: main patch to add "basic" support for building/running TDX.
> - 67: trace points for
> - 68-69: Documentation
>
> * TODOs
> Those major features are missing from this patch series to keep this patch
> series small.
>
> - load/initialize TDX module
> split out from this patch series.
> - unmapping private page
> Will integrate Kirill's patch to show how kvm will utilize it.
> - qemu gdb stub support
> - Large page support
> - guest PMU support
> - TDP MMU support
> - and more
>
> Changes from v1:
> - rebase to v5.13
> - drop load/initialization of TDX module
> - catch up the update of related specifications.
> - rework on C-wrapper function to invoke seamcall
> - various code clean up
>
> [1] TDX specification
> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
> [3] Intel CPU Architectural Extensions Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 EAS
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
> [5] Intel TDX Loader Interface Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> [7] Intel TDX Virtual Firmware Design Guide
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
> [8] intel public github
> kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> TDX guest branch: https://github.com/intel/tdx/tree/guest
> qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
> https://github.com/tianocore/edk2-staging/tree/TDVF
>
> Isaku Yamahata (11):
> KVM: TDX: introduce config for KVM TDX support
> KVM: X86: move kvm_cpu_vmxon() from vmx.c to virtext.h
> KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86
> KVM: TDX: add a helper function for kvm to call seamcall
> KVM: TDX: add trace point before/after TDX SEAMCALLs
> KVM: TDX: Print the name of SEAMCALL status code
> KVM: Add per-VM flag to mark read-only memory as unsupported
> KVM: x86: add per-VM flags to disable SMI/INIT/SIPI
> KVM: TDX: add trace point for TDVMCALL and SEPT operation
> KVM: TDX: add document on TDX MODULE
> Documentation/virtual/kvm: Add Trust Domain Extensions(TDX)
>
> Kai Huang (2):
> KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level
> routes
> cpu/hotplug: Document that TDX also depends on booting CPUs once
>
> Rick Edgecombe (1):
> KVM: x86: Add infrastructure for stolen GPA bits
>
> Sean Christopherson (53):
> KVM: TDX: Add TDX "architectural" error codes
> KVM: TDX: Add architectural definitions for structures and values
> KVM: TDX: define and export helper functions for KVM TDX support
> KVM: TDX: Add C wrapper functions for TDX SEAMCALLs
> KVM: Export kvm_io_bus_read for use by TDX for PV MMIO
> KVM: Enable hardware before doing arch VM initialization
> KVM: x86: Split core of hypercall emulation to helper function
> KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO
> KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default
> KVM: Add infrastructure and macro to mark VM as bugged
> KVM: Export kvm_make_all_cpus_request() for use in marking VMs as
> bugged
> KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the
> VM
> KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID
> KVM: Add max_vcpus field in common 'struct kvm'
> KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
> KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()
> KVM: x86: Introduce "protected guest" concept and block disallowed
> ioctls
> KVM: x86: Add per-VM flag to disable direct IRQ injection
> KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE
> KVM: x86: Add flag to mark TSC as immutable (for TDX)
> KVM: Add per-VM flag to disable dirty logging of memslots for TDs
> KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
> KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()
> KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP
> KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
> KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> behavior
> KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
> KVM: x86: Add option to force LAPIC expiration wait
> KVM: x86: Add guest_supported_xss placholder
> KVM: Export kvm_is_reserved_pfn() for use by TDX
> KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> KVM: x86/mmu: Allow non-zero init value for shadow PTE
> KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce
> indentation
> KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
> KVM: x86/mmu: Frame in support for private/inaccessible shadow pages
> KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> KVM: VMX: Modify NMI and INTR handlers to take intr_info as param
> KVM: VMX: Move NMI/exception handler to common helper
> KVM: x86/mmu: Allow per-VM override of the TDP max page level
> KVM: VMX: Split out guts of EPT violation to common/exposed function
> KVM: VMX: Define EPT Violation architectural bits
> KVM: VMX: Define VMCS encodings for shared EPT pointer
> KVM: VMX: Add 'main.c' to wrap VMX and TDX
> KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> KVM: VMX: Move register caching logic to common code
> KVM: TDX: Define TDCALL exit reason
> KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs
> KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h
> KVM: VMX: MOVE GDT and IDT accessors to common code
> KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX
> code
> KVM: TDX: Add "basic" support for building and running Trust Domains
>
> Xiaoyao Li (2):
> KVM: TDX: Introduce pr_seamcall_ex_ret_info() to print more info when
> SEAMCALL fails
> KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
>
> Documentation/virt/kvm/api.rst | 6 +-
> Documentation/virt/kvm/intel-tdx.rst | 441 ++++++
> Documentation/virt/kvm/tdx-module.rst | 48 +
> arch/arm64/include/asm/kvm_host.h | 3 -
> arch/arm64/kvm/arm.c | 7 +-
> arch/arm64/kvm/vgic/vgic-init.c | 6 +-
> arch/x86/Kbuild | 1 +
> arch/x86/include/asm/cpufeatures.h | 2 +
> arch/x86/include/asm/kvm-x86-ops.h | 8 +
> arch/x86/include/asm/kvm_boot.h | 30 +
> arch/x86/include/asm/kvm_host.h | 55 +-
> arch/x86/include/asm/virtext.h | 25 +
> arch/x86/include/asm/vmx.h | 17 +
> arch/x86/include/uapi/asm/kvm.h | 60 +
> arch/x86/include/uapi/asm/vmx.h | 7 +-
> arch/x86/kernel/asm-offsets_64.c | 15 +
> arch/x86/kvm/Kconfig | 11 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/boot/Makefile | 6 +
> arch/x86/kvm/boot/seam/tdx_common.c | 242 +++
> arch/x86/kvm/boot/seam/tdx_common.h | 13 +
> arch/x86/kvm/ioapic.c | 4 +
> arch/x86/kvm/irq_comm.c | 13 +-
> arch/x86/kvm/lapic.c | 7 +-
> arch/x86/kvm/lapic.h | 2 +-
> arch/x86/kvm/mmu.h | 31 +-
> arch/x86/kvm/mmu/mmu.c | 526 +++++--
> arch/x86/kvm/mmu/mmu_internal.h | 3 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 +-
> arch/x86/kvm/mmu/spte.c | 15 +-
> arch/x86/kvm/mmu/spte.h | 18 +-
> arch/x86/kvm/svm/svm.c | 18 +-
> arch/x86/kvm/trace.h | 138 ++
> arch/x86/kvm/vmx/common.h | 178 +++
> arch/x86/kvm/vmx/main.c | 1098 ++++++++++++++
> arch/x86/kvm/vmx/posted_intr.c | 6 +
> arch/x86/kvm/vmx/seamcall.S | 64 +
> arch/x86/kvm/vmx/seamcall.h | 68 +
> arch/x86/kvm/vmx/tdx.c | 1958 +++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 267 ++++
> arch/x86/kvm/vmx/tdx_arch.h | 370 +++++
> arch/x86/kvm/vmx/tdx_errno.h | 202 +++
> arch/x86/kvm/vmx/tdx_ops.h | 218 +++
> arch/x86/kvm/vmx/tdx_stubs.c | 45 +
> arch/x86/kvm/vmx/vmcs.h | 11 -
> arch/x86/kvm/vmx/vmenter.S | 146 ++
> arch/x86/kvm/vmx/vmx.c | 509 ++-----
> arch/x86/kvm/x86.c | 285 +++-
> include/linux/kvm_host.h | 51 +-
> include/uapi/linux/kvm.h | 2 +
> kernel/cpu.c | 4 +
> tools/arch/x86/include/uapi/asm/kvm.h | 55 +
> tools/include/uapi/linux/kvm.h | 2 +
> virt/kvm/kvm_main.c | 44 +-
> 54 files changed, 6717 insertions(+), 672 deletions(-)
> create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> create mode 100644 Documentation/virt/kvm/tdx-module.rst
> create mode 100644 arch/x86/include/asm/kvm_boot.h
> create mode 100644 arch/x86/kvm/boot/Makefile
> create mode 100644 arch/x86/kvm/boot/seam/tdx_common.c
> create mode 100644 arch/x86/kvm/boot/seam/tdx_common.h
> create mode 100644 arch/x86/kvm/vmx/common.h
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/seamcall.S
> create mode 100644 arch/x86/kvm/vmx/seamcall.h
> create mode 100644 arch/x86/kvm/vmx/tdx.c
> create mode 100644 arch/x86/kvm/vmx/tdx.h
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> create mode 100644 arch/x86/kvm/vmx/tdx_stubs.c
>

2021-07-06 14:55:11

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On 03/07/21 00:04, [email protected] wrote:
> From: Rick Edgecombe <[email protected]>
>
> Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> perspective) to a single GPA (from a memslot perspective). GPA alising
> will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> execute-only permission bit to the guest. To keep the implementation
> simple (relatively speaking), GPA aliasing is only supported via TDP.
>
> Today KVM assumes two things that are broken by GPA aliasing.
> 1. GPAs coming from hardware can be simply shifted to get the GFNs.
> 2. GPA bits 51:MAXPHYADDR are reserved to zero.
>
> With GPA aliasing, translating a GPA to GFN requires masking off the
> repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
>
> To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> that is, bits stolen from the GPA to act as new virtualized attribute
> bits. A bit in the mask will cause the MMU code to create aliases of the
> GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> fault.
>
> To handle case (1) from above, retain any stolen bits when passing a GPA
> in KVM's MMU code, but strip them when converting to a GFN so that the
> GFN contains only the "real" GFN, i.e. never has repurposed bits set.
>
> GFNs (without stolen bits) continue to be used to:
> -Specify physical memory by userspace via memslots
> -Map GPAs to TDP PTEs via RMAP
> -Specify dirty tracking and write protection
> -Look up MTRR types
> -Inject async page faults
>
> Since there are now multiple aliases for the same aliased GPA, when
> userspace memory backing the memslots is paged out, both aliases need to be
> modified. Fortunately this happens automatically. Since rmap supports
> multiple mappings for the same GFN for PTE shadowing based paging, by
> adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> operations will be applied to both aliases.
>
> In the case of the rmap being removed in the future, the needed
> information could be recovered by iterating over the stolen bits and
> walking the TDP page tables.
>
> For TLB flushes that are address based, make sure to flush both aliases
> in the stolen bits case.
>
> Only support stolen bits in 64 bit guest paging modes (long, PAE).
> Features that use this infrastructure should restrict the stolen bits to
> exclude the other paging modes. Don't support stolen bits for shadow EPT.
>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>

Looks good, but the commit message is obsolete.

Paolo

> arch/x86/kvm/mmu.h | 26 ++++++++++
> arch/x86/kvm/mmu/mmu.c | 86 ++++++++++++++++++++++-----------
> arch/x86/kvm/mmu/mmu_internal.h | 1 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 ++++++----
> 4 files changed, 101 insertions(+), 37 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 88d0ed5225a4..69b82857acdb 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -232,4 +232,30 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> int kvm_mmu_post_init_vm(struct kvm *kvm);
> void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
>
> +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> +{
> + /* Currently there are no stolen bits in KVM */
> + return 0;
> +}
> +
> +static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> + return kvm_gfn_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> +{
> + return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
> +}
> +
> +static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> + return kvm_gpa_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> + return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
> +}
> +
> #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0dc4bf34ce9c..990ee645b8a2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -188,27 +188,37 @@ static inline bool kvm_available_flush_tlb_with_range(void)
> return kvm_x86_ops.tlb_remote_flush_with_range;
> }
>
> -static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> - struct kvm_tlb_range *range)
> -{
> - int ret = -ENOTSUPP;
> -
> - if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> - ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> -
> - if (ret)
> - kvm_flush_remote_tlbs(kvm);
> -}
> -
> void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
> u64 start_gfn, u64 pages)
> {
> struct kvm_tlb_range range;
> + u64 gfn_stolen_mask;
> +
> + if (!kvm_available_flush_tlb_with_range())
> + goto generic_flush;
> +
> + /*
> + * Fall back to the big hammer flush if there is more than one
> + * GPA alias that needs to be flushed.
> + */
> + gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> + if (hweight64(gfn_stolen_mask) > 1)
> + goto generic_flush;
>
> range.start_gfn = start_gfn;
> range.pages = pages;
> + if (static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range))
> + goto generic_flush;
> +
> + if (!gfn_stolen_mask)
> + return;
>
> - kvm_flush_remote_tlbs_with_range(kvm, &range);
> + range.start_gfn |= gfn_stolen_mask;
> + static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range);
> + return;
> +
> +generic_flush:
> + kvm_flush_remote_tlbs(kvm);
> }
>
> bool is_nx_huge_page_enabled(void)
> @@ -1949,14 +1959,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
> __clear_sp_write_flooding_count(sptep_to_sp(spte));
> }
>
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> - gfn_t gfn,
> - gva_t gaddr,
> - unsigned level,
> - int direct,
> - unsigned int access)
> +static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> + gfn_t gfn,
> + gfn_t gfn_stolen_bits,
> + gva_t gaddr,
> + unsigned int level,
> + int direct,
> + unsigned int access)
> {
> bool direct_mmu = vcpu->arch.mmu->direct_map;
> + gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
> union kvm_mmu_page_role role;
> struct hlist_head *sp_list;
> unsigned quadrant;
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> role.quadrant = quadrant;
> }
>
> - sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> + sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> - if (sp->gfn != gfn) {
> + if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> collisions++;
> continue;
> }
> @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> sp = kvm_mmu_alloc_page(vcpu, direct);
>
> sp->gfn = gfn;
> + sp->gfn_stolen_bits = gfn_stolen_bits;
> sp->role = role;
> hlist_add_head(&sp->hash_link, sp_list);
> if (!direct) {
> @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> return sp;
> }
>
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> + gva_t gaddr, unsigned int level,
> + int direct, unsigned int access)
> +{
> + return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
> +}
> +
> static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> struct kvm_vcpu *vcpu, hpa_t root,
> u64 addr)
> @@ -2637,7 +2657,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>
> gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
> slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
> - if (!slot)
> +
> + /* Don't map private memslots for stolen bits */
> + if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
> return -1;
>
> ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
> @@ -2827,7 +2849,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> struct kvm_shadow_walk_iterator it;
> struct kvm_mmu_page *sp;
> int level, req_level, ret;
> - gfn_t gfn = gpa >> PAGE_SHIFT;
> + gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
> + gfn_t gfn = (gpa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> + gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
> gfn_t base_gfn = gfn;
>
> if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> @@ -2852,8 +2876,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>
> drop_large_spte(vcpu, it.sptep);
> if (!is_shadow_present_pte(*it.sptep)) {
> - sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
> - it.level - 1, true, ACC_ALL);
> + sp = __kvm_mmu_get_page(vcpu, base_gfn,
> + gfn_stolen_bits, it.addr,
> + it.level - 1, true, ACC_ALL);
>
> link_shadow_page(vcpu, it.sptep, sp);
> if (is_tdp && huge_page_disallowed &&
> @@ -3689,6 +3714,13 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> return true;
>
> + /* Don't expose aliases for no slot GFNs or private memslots */
> + if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
> + !kvm_is_visible_memslot(slot)) {
> + *pfn = KVM_PFN_NOSLOT;
> + return false;
> + }
> +
> /* Don't expose private memslots to L2. */
> if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
> *pfn = KVM_PFN_NOSLOT;
> @@ -3723,7 +3755,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> bool write = error_code & PFERR_WRITE_MASK;
> bool map_writable;
>
> - gfn_t gfn = gpa >> PAGE_SHIFT;
> + gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
> unsigned long mmu_seq;
> kvm_pfn_t pfn;
> hva_t hva;
> @@ -3833,7 +3865,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> max_level > PG_LEVEL_4K;
> max_level--) {
> int page_num = KVM_PAGES_PER_HPAGE(max_level);
> - gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
> + gfn_t base = vcpu_gpa_to_gfn_unalias(vcpu, gpa) & ~(page_num - 1);
>
> if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> break;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index d64ccb417c60..c896ec9f3159 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -46,6 +46,7 @@ struct kvm_mmu_page {
> */
> union kvm_mmu_page_role role;
> gfn_t gfn;
> + gfn_t gfn_stolen_bits;
>
> u64 *spt;
> /* hold the gfn of each spte inside spt */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 823a5919f9fa..439dc141391b 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -25,7 +25,8 @@
> #define guest_walker guest_walker64
> #define FNAME(name) paging##64_##name
> #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> - #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
> + PT64_LVL_ADDR_MASK(lvl))
> #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> #define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -44,7 +45,7 @@
> #define guest_walker guest_walker32
> #define FNAME(name) paging##32_##name
> #define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
> - #define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
> #define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
> #define PT_INDEX(addr, level) PT32_INDEX(addr, level)
> #define PT_LEVEL_BITS PT32_LEVEL_BITS
> @@ -58,7 +59,7 @@
> #define guest_walker guest_walkerEPT
> #define FNAME(name) ept_##name
> #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> - #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
> #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> #define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -75,7 +76,7 @@
> #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
>
> #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
> -#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
> +#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
>
> /*
> * The guest_walker structure emulates the behavior of the hardware page
> @@ -96,9 +97,9 @@ struct guest_walker {
> struct x86_exception fault;
> };
>
> -static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> +static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
> {
> - return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> + return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
> }
>
> static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
> @@ -366,7 +367,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> --walker->level;
>
> index = PT_INDEX(addr, walker->level);
> - table_gfn = gpte_to_gfn(pte);
> + table_gfn = gpte_to_gfn(vcpu, pte);
> offset = index * sizeof(pt_element_t);
> pte_gpa = gfn_to_gpa(table_gfn) + offset;
>
> @@ -432,7 +433,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> if (unlikely(errcode))
> goto error;
>
> - gfn = gpte_to_gfn_lvl(pte, walker->level);
> + gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
> gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
>
> if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
> @@ -537,12 +538,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> gfn_t gfn;
> kvm_pfn_t pfn;
>
> + WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
> +
> if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
> return false;
>
> pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
>
> - gfn = gpte_to_gfn(gpte);
> + gfn = gpte_to_gfn(vcpu, gpte);
> pte_access = sp->role.access & FNAME(gpte_access)(gpte);
> FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
> @@ -652,6 +655,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
>
> direct_access = gw->pte_access;
>
> + WARN_ON(addr & vcpu_gpa_stolen_mask(vcpu));
> +
> top_level = vcpu->arch.mmu->root_level;
> if (top_level == PT32E_ROOT_LEVEL)
> top_level = PT32_ROOT_LEVEL;
> @@ -1067,7 +1072,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> continue;
> }
>
> - gfn = gpte_to_gfn(gpte);
> + gfn = gpte_to_gfn(vcpu, gpte);
> pte_access = sp->role.access;
> pte_access &= FNAME(gpte_access)(gpte);
> FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
>

2021-07-06 14:56:17

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 52/69] KVM: VMX: Split out guts of EPT violation to common/exposed function

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/common.h | 29 +++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 33 +++++----------------------------
> 2 files changed, 34 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 81c73f30d01d..9e5865b05d47 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -5,8 +5,11 @@
> #include <linux/kvm_host.h>
>
> #include <asm/traps.h>
> +#include <asm/vmx.h>
>
> +#include "mmu.h"
> #include "vmcs.h"
> +#include "vmx.h"
> #include "x86.h"
>
> extern unsigned long vmx_host_idt_base;
> @@ -49,4 +52,30 @@ static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
> }
>
> +static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> + unsigned long exit_qualification)
> +{
> + u64 error_code;
> +
> + /* Is it a read fault? */
> + error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
> + ? PFERR_USER_MASK : 0;
> + /* Is it a write fault? */
> + error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
> + ? PFERR_WRITE_MASK : 0;
> + /* Is it a fetch fault? */
> + error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
> + ? PFERR_FETCH_MASK : 0;
> + /* ept page table entry is present? */
> + error_code |= (exit_qualification &
> + (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
> + EPT_VIOLATION_EXECUTABLE))
> + ? PFERR_PRESENT_MASK : 0;
> +
> + error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
> + PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> +
> + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +}
> +
> #endif /* __KVM_X86_VMX_COMMON_H */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 452d4d1400db..8a104a54121b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5328,11 +5328,10 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
>
> static int handle_ept_violation(struct kvm_vcpu *vcpu)
> {
> - unsigned long exit_qualification;
> - gpa_t gpa;
> - u64 error_code;
> + unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
> + gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>
> - exit_qualification = vmx_get_exit_qual(vcpu);
> + trace_kvm_page_fault(gpa, exit_qualification);
>
> /*
> * EPT violation happened while executing iret from NMI,
> @@ -5341,31 +5340,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> * AAK134, BY25.
> */
> if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
> - enable_vnmi &&
> - (exit_qualification & INTR_INFO_UNBLOCK_NMI))
> + enable_vnmi && (exit_qualification & INTR_INFO_UNBLOCK_NMI))
> vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI);
>
> - gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> - trace_kvm_page_fault(gpa, exit_qualification);
> -
> - /* Is it a read fault? */
> - error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
> - ? PFERR_USER_MASK : 0;
> - /* Is it a write fault? */
> - error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
> - ? PFERR_WRITE_MASK : 0;
> - /* Is it a fetch fault? */
> - error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
> - ? PFERR_FETCH_MASK : 0;
> - /* ept page table entry is present? */
> - error_code |= (exit_qualification &
> - (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
> - EPT_VIOLATION_EXECUTABLE))
> - ? PFERR_PRESENT_MASK : 0;
> -
> - error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
> - PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> -
> vcpu->arch.exit_qualification = exit_qualification;
>
> /*
> @@ -5379,7 +5356,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
> return kvm_emulate_instruction(vcpu, 0);
>
> - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> + return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> }
>
> static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>

This should be in main.c, not in a header (and named
__vt_handle_ept_qualification).

Paolo

2021-07-06 14:56:35

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 42/69] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Explicity check for an MMIO spte in the fast page fault flow. TDX will
> use a not-present entry for MMIO sptes, which can be mistaken for an
> access-tracked spte since both have SPTE_SPECIAL_MASK set.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 990ee645b8a2..631b92e6e9ba 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3060,7 +3060,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> break;
>
> sp = sptep_to_sp(iterator.sptep);
> - if (!is_last_spte(spte, sp->role.level))
> + if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
> break;
>
> /*
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:57:49

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 44/69] KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce indentation

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Employ a 'continue' to reduce the indentation for linking a new shadow
> page during __direct_map() in preparation for linking private pages.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 19 +++++++++----------
> 1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1c40dfd05979..0259781cee6a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2910,16 +2910,15 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> break;
>
> drop_large_spte(vcpu, it.sptep);
> - if (!is_shadow_present_pte(*it.sptep)) {
> - sp = __kvm_mmu_get_page(vcpu, base_gfn,
> - gfn_stolen_bits, it.addr,
> - it.level - 1, true, ACC_ALL);
> -
> - link_shadow_page(vcpu, it.sptep, sp);
> - if (is_tdp && huge_page_disallowed &&
> - req_level >= it.level)
> - account_huge_nx_page(vcpu->kvm, sp);
> - }
> + if (is_shadow_present_pte(*it.sptep))
> + continue;
> +
> + sp = __kvm_mmu_get_page(vcpu, base_gfn, gfn_stolen_bits,
> + it.addr, it.level - 1, true, ACC_ALL);
> +
> + link_shadow_page(vcpu, it.sptep, sp);
> + if (is_tdp && huge_page_disallowed && req_level >= it.level)
> + account_huge_nx_page(vcpu->kvm, sp);
> }
>
> ret = mmu_set_spte(vcpu, it.sptep, ACC_ALL,
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:57:59

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 43/69] KVM: x86/mmu: Allow non-zero init value for shadow PTE

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX will run with EPT violation #VEs enabled, which means KVM needs to
> set the "suppress #VE" bit in unused PTEs to avoid unintentionally
> reflecting not-present EPT violations into the guest.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 50 +++++++++++++++++++++++++++++++++++------
> arch/x86/kvm/mmu/spte.c | 10 +++++++++
> arch/x86/kvm/mmu/spte.h | 2 ++
> 4 files changed, 56 insertions(+), 7 deletions(-)

Please ensure that this also works for tdp_mmu.c (if anything, consider
supporting TDX only for TDP MMU; it's quite likely that mmu.c support
for EPT/NPT will go away).

Paolo

> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 69b82857acdb..6ec8d9fdff35 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -61,6 +61,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
>
> void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
> void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> +void kvm_mmu_set_spte_init_value(u64 init_value);
>
> void
> reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 631b92e6e9ba..1c40dfd05979 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -550,9 +550,9 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
> u64 old_spte = *sptep;
>
> if (!spte_has_volatile_bits(old_spte))
> - __update_clear_spte_fast(sptep, 0ull);
> + __update_clear_spte_fast(sptep, shadow_init_value);
> else
> - old_spte = __update_clear_spte_slow(sptep, 0ull);
> + old_spte = __update_clear_spte_slow(sptep, shadow_init_value);
>
> if (!is_shadow_present_pte(old_spte))
> return 0;
> @@ -582,7 +582,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
> */
> static void mmu_spte_clear_no_track(u64 *sptep)
> {
> - __update_clear_spte_fast(sptep, 0ull);
> + __update_clear_spte_fast(sptep, shadow_init_value);
> }
>
> static u64 mmu_spte_get_lockless(u64 *sptep)
> @@ -660,6 +660,42 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> local_irq_enable();
> }
>
> +static inline void kvm_init_shadow_page(void *page)
> +{
> +#ifdef CONFIG_X86_64
> + int ign;
> +
> + asm volatile (
> + "rep stosq\n\t"
> + : "=c"(ign), "=D"(page)
> + : "a"(shadow_init_value), "c"(4096/8), "D"(page)
> + : "memory"
> + );
> +#else
> + BUG();
> +#endif
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> + int start, end, i, r;
> +
> + if (shadow_init_value)
> + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> +
> + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> +
> + if (shadow_init_value) {
> + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> + for (i = start; i < end; i++)
> + kvm_init_shadow_page(mc->objects[i]);
> + }
> + return 0;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -669,8 +705,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_shadow_page_cache(vcpu);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -3041,7 +3076,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> struct kvm_shadow_walk_iterator iterator;
> struct kvm_mmu_page *sp;
> int ret = RET_PF_INVALID;
> - u64 spte = 0ull;
> + u64 spte = shadow_init_value;
> uint retry_count = 0;
>
> if (!page_fault_can_be_fast(error_code))
> @@ -5383,7 +5418,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + if (!shadow_init_value)
> + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 66d43cec0c31..0b931f1c2210 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -34,6 +34,7 @@ u64 __read_mostly shadow_mmio_access_mask;
> u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> +u64 __read_mostly shadow_init_value;
>
> u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -211,6 +212,14 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
> return new_spte;
> }
>
> +void kvm_mmu_set_spte_init_value(u64 init_value)
> +{
> + if (WARN_ON(!IS_ENABLED(CONFIG_X86_64) && init_value))
> + init_value = 0;
> + shadow_init_value = init_value;
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_set_spte_init_value);
> +
> static u8 kvm_get_shadow_phys_bits(void)
> {
> /*
> @@ -355,6 +364,7 @@ void kvm_mmu_reset_all_pte_masks(void)
> shadow_present_mask = PT_PRESENT_MASK;
> shadow_acc_track_mask = 0;
> shadow_me_mask = sme_me_mask;
> + shadow_init_value = 0;
>
> shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITEABLE;
> shadow_mmu_writable_mask = DEFAULT_SPTE_MMU_WRITEABLE;
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index bca0ba11cccf..f88cf3db31c7 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -152,6 +152,8 @@ extern u64 __read_mostly shadow_mmio_access_mask;
> extern u64 __read_mostly shadow_present_mask;
> extern u64 __read_mostly shadow_me_mask;
>
> +extern u64 __read_mostly shadow_init_value;
> +
> /*
> * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
> * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
>

2021-07-06 14:59:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 45/69] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Return the old SPTE when clearing a SPTE and push the "old SPTE present"
> check to the caller. Private shadow page support will use the old SPTE
> in rmap_remove() to determine whether or not there is a linked private
> shadow page.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0259781cee6a..6b0c8c84aabe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -542,9 +542,9 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> * Rules for using mmu_spte_clear_track_bits:
> * It sets the sptep from present to nonpresent, and track the
> * state bits, it is used to clear the last level sptep.
> - * Returns non-zero if the PTE was previously valid.
> + * Returns the old PTE.
> */
> -static int mmu_spte_clear_track_bits(u64 *sptep)
> +static u64 mmu_spte_clear_track_bits(u64 *sptep)
> {
> kvm_pfn_t pfn;
> u64 old_spte = *sptep;
> @@ -555,7 +555,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
> old_spte = __update_clear_spte_slow(sptep, shadow_init_value);
>
> if (!is_shadow_present_pte(old_spte))
> - return 0;
> + return old_spte;
>
> pfn = spte_to_pfn(old_spte);
>
> @@ -572,7 +572,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
> if (is_dirty_spte(old_spte))
> kvm_set_pfn_dirty(pfn);
>
> - return 1;
> + return old_spte;
> }
>
> /*
> @@ -1104,7 +1104,9 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>
> static void drop_spte(struct kvm *kvm, u64 *sptep)
> {
> - if (mmu_spte_clear_track_bits(sptep))
> + u64 old_spte = mmu_spte_clear_track_bits(sptep);
> +
> + if (is_shadow_present_pte(old_spte))
> rmap_remove(kvm, sptep);
> }
>
>

Reviewed-by: Paolo Bonzini <[email protected]>

2021-07-06 14:59:47

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 51/69] KVM: x86/mmu: Allow per-VM override of the TDP max page level

On 03/07/21 00:04, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TODO: This is tentative patch. Support large page and delete this patch.
>
> Allow TDX to effectively disable large pages, as SEPT will initially
> support only 4k pages.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 4 +++-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9631b985ebdc..a47e17892258 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -989,6 +989,7 @@ struct kvm_arch {
> unsigned long n_requested_mmu_pages;
> unsigned long n_max_mmu_pages;
> unsigned int indirect_shadow_pages;
> + int tdp_max_page_level;
> u8 mmu_valid_gen;
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> struct list_head active_mmu_pages;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 82db62753acb..4ee6d7803f18 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4084,7 +4084,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> kvm_pfn_t pfn;
> int max_level;
>
> - for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
> + for (max_level = vcpu->kvm->arch.tdp_max_page_level;
> max_level > PG_LEVEL_4K;
> max_level--) {
> int page_num = KVM_PAGES_PER_HPAGE(max_level);
> @@ -5802,6 +5802,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
> node->track_write = kvm_mmu_pte_write;
> node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> kvm_page_track_register_notifier(kvm, node);
> +
> + kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
> }
>
> void kvm_mmu_uninit_vm(struct kvm *kvm)
>

Seems good enough for now.

Reviewed-by: Paolo Bonzini <[email protected]>

Paolo

2021-07-06 19:05:40

by Brijesh Singh

[permalink] [raw]
Subject: Re: [RFC PATCH v2 28/69] KVM: Add per-VM flag to mark read-only memory as unsupported



On 7/6/21 9:03 AM, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
>>
>> Add a flag for TDX to flag RO memory as unsupported and propagate it to
>> KVM_MEM_READONLY to allow reporting RO memory as unsupported on a per-VM
>> basis.  TDX1 doesn't expose permission bits to the VMM in the SEPT
>> tables, i.e. doesn't support read-only private memory.
>>
>> Signed-off-by: Isaku Yamahata <[email protected]>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> ---
>>   arch/x86/kvm/x86.c       | 4 +++-
>>   include/linux/kvm_host.h | 4 ++++
>>   virt/kvm/kvm_main.c      | 8 +++++---
>>   3 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index cd9407982366..87212d7563ae 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3897,7 +3897,6 @@ int kvm_vm_ioctl_check_extension(struct kvm
>> *kvm, long ext)
>>       case KVM_CAP_ASYNC_PF_INT:
>>       case KVM_CAP_GET_TSC_KHZ:
>>       case KVM_CAP_KVMCLOCK_CTRL:
>> -    case KVM_CAP_READONLY_MEM:
>>       case KVM_CAP_HYPERV_TIME:
>>       case KVM_CAP_IOAPIC_POLARITY_IGNORED:
>>       case KVM_CAP_TSC_DEADLINE_TIMER:
>> @@ -4009,6 +4008,9 @@ int kvm_vm_ioctl_check_extension(struct kvm
>> *kvm, long ext)
>>           if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
>>               r |= BIT(KVM_X86_TDX_VM);
>>           break;
>> +    case KVM_CAP_READONLY_MEM:
>> +        r = kvm && kvm->readonly_mem_unsupported ? 0 : 1;
>> +        break;
>>       default:
>>           break;
>>       }
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index ddd4d0f68cdf..7ee7104b4b59 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -597,6 +597,10 @@ struct kvm {
>>       unsigned int max_halt_poll_ns;
>>       u32 dirty_ring_size;
>> +#ifdef __KVM_HAVE_READONLY_MEM
>> +    bool readonly_mem_unsupported;
>> +#endif
>> +
>>       bool vm_bugged;
>>   };
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 52d40ea75749..63d0c2833913 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1258,12 +1258,14 @@ static void update_memslots(struct
>> kvm_memslots *slots,
>>       }
>>   }
>> -static int check_memory_region_flags(const struct
>> kvm_userspace_memory_region *mem)
>> +static int check_memory_region_flags(struct kvm *kvm,
>> +                     const struct kvm_userspace_memory_region *mem)
>>   {
>>       u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>>   #ifdef __KVM_HAVE_READONLY_MEM
>> -    valid_flags |= KVM_MEM_READONLY;
>> +    if (!kvm->readonly_mem_unsupported)
>> +        valid_flags |= KVM_MEM_READONLY;
>>   #endif
>>       if (mem->flags & ~valid_flags)
>> @@ -1436,7 +1438,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>>       int as_id, id;
>>       int r;
>> -    r = check_memory_region_flags(mem);
>> +    r = check_memory_region_flags(kvm, mem);
>>       if (r)
>>           return r;
>>
>
> For all these flags, which of these limitations will be common to SEV-ES
> and SEV-SNP (ExtINT injection, MCE injection, changing TSC, read-only
> memory, dirty logging)?  Would it make sense to use vm_type instead of
> all of them?  I guess this also guides the choice of whether to use a
> single vm-type for TDX and SEV-SNP or two.  Probably two is better, and
> there can be static inline bool functions to derive the support flags
> from the vm-type.
>

The SEV-ES does not need any of these flags. However, with SEV-SNP, we
may able use the ExtINT injection.

-Brijesh

2021-07-08 15:21:47

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH v2 43/69] KVM: x86/mmu: Allow non-zero init value for shadow PTE

On Tue, Jul 06, 2021 at 04:56:07PM +0200,
Paolo Bonzini <[email protected]> wrote:

> On 03/07/21 00:04, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > TDX will run with EPT violation #VEs enabled, which means KVM needs to
> > set the "suppress #VE" bit in unused PTEs to avoid unintentionally
> > reflecting not-present EPT violations into the guest.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu.h | 1 +
> > arch/x86/kvm/mmu/mmu.c | 50 +++++++++++++++++++++++++++++++++++------
> > arch/x86/kvm/mmu/spte.c | 10 +++++++++
> > arch/x86/kvm/mmu/spte.h | 2 ++
> > 4 files changed, 56 insertions(+), 7 deletions(-)
>
> Please ensure that this also works for tdp_mmu.c (if anything, consider
> supporting TDX only for TDP MMU; it's quite likely that mmu.c support for
> EPT/NPT will go away).

It's on my TODO list. Will address it.
--
Isaku Yamahata <[email protected]>

2021-07-08 15:23:26

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH v2 55/69] KVM: VMX: Add 'main.c' to wrap VMX and TDX

On Tue, Jul 06, 2021 at 04:43:22PM +0200,
Paolo Bonzini <[email protected]> wrote:

> On 03/07/21 00:05, [email protected] wrote:
> > +#include "vmx.c"
>
> What makes it particularly hard to have this as a separate .o file rather
> than an #include?

It's to let complier to optimize functionc call of "if (tdx) tdx_xxx() else vmx_xxx()",
given x86_ops static call story.
--
Isaku Yamahata <[email protected]>

2021-07-08 15:30:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 55/69] KVM: VMX: Add 'main.c' to wrap VMX and TDX

On 08/07/21 17:21, Isaku Yamahata wrote:
> On Tue, Jul 06, 2021 at 04:43:22PM +0200,
> Paolo Bonzini <[email protected]> wrote:
>
>> On 03/07/21 00:05, [email protected] wrote:
>>> +#include "vmx.c"
>>
>> What makes it particularly hard to have this as a separate .o file rather
>> than an #include?
>
> It's to let complier to optimize functionc call of "if (tdx) tdx_xxx() else vmx_xxx()",
> given x86_ops static call story.

As long as it's not an indirect call, not inlining tdx_xxx and vmx_xxx
is unlikely to give a lot of benefit.

What you could do, is use a static branch that bypasses the
"is_tdx_vcpu/vm" check if no TDX VM is running. A similar technique is
used to bypass the test for in-kernel APIC if all VM have it.

Paolo

2021-07-13 18:01:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 03/69] KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86

On Fri, Jul 02, 2021, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> This is preparation for TDX support.
>
> Because SEAMCALL instruction requires VMX enabled, it needs to initialize
> struct vmcs and load it before SEAMCALL instruction.[1] [2] Move out the
> definition of vmcs into a common x86 header, arch/x86/include/asm/vmx.h, so
> that seamloader code can share the same definition.
^^^^^^^^^^
SEAMLDR?

I don't have a strong preference on what we call it, but we should be consistent
in our usage.

Same comments as the first two patches, without seeing the actual SEAMLDR code
it's impossible review this patch. I certainly have no objection to splitting
up this behemoth, but the series should be self contained (within reason).

2021-07-13 18:13:11

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 65/69] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch

Nit on the shortlog (applies to other patches too): please use "x86", not "X86".

Thanks!

2021-07-13 18:15:37

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 65/69] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:05, [email protected] wrote:
> > From: Xiaoyao Li <[email protected]>
> >
> > Introduce a per-vm variable initial_tsc_khz to hold the default tsc_khz
> > for kvm_arch_vcpu_create().
> >
> > This field is going to be used by TDX since TSC frequency for TD guest
> > is configured at TD VM initialization phase.
> >
> > Signed-off-by: Xiaoyao Li <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/x86.c | 3 ++-
> > 2 files changed, 3 insertions(+), 1 deletion(-)
>
> So this means disabling TSC frequency scaling on TDX. Would it make sense
> to delay VM creation to a separate ioctl, similar to KVM_ARM_VCPU_FINALIZE
> (KVM_VM_FINALIZE)?

There's an equivalent of that in the next mega-patch, the KVM_TDX_INIT_VM sub-ioctl
of KVM_MEMORY_ENCRYPT_OP. The TSC frequency for the guest gets provided at that
time.

2021-07-13 19:36:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/69] KVM: TDX: add trace point before/after TDX SEAMCALLs

On Fri, Jul 02, 2021, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/trace.h | 80 ++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/seamcall.h | 22 ++++++++-
> arch/x86/kvm/vmx/tdx_arch.h | 47 ++++++++++++++++++
> arch/x86/kvm/vmx/tdx_errno.h | 96 ++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 2 +
> 5 files changed, 246 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 4f839148948b..c3398d0de9a7 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -8,6 +8,9 @@
> #include <asm/clocksource.h>
> #include <asm/pvclock-abi.h>
>
> +#include "vmx/tdx_arch.h"
> +#include "vmx/tdx_errno.h"
> +
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM kvm
>
> @@ -659,6 +662,83 @@ TRACE_EVENT(kvm_nested_vmexit_inject,
> __entry->exit_int_info, __entry->exit_int_info_err)
> );
>
> +/*
> + * Tracepoint for the start of TDX SEAMCALLs.
> + */
> +TRACE_EVENT(kvm_tdx_seamcall_enter,

To avoid confusion, I think it makes sense to avoid "enter" and "exit". E.g.
my first reaction was that the tracepoint was specific to TDENTER. And under
the hood, SEAMCALL is technically an exit :-)

What about kvm_tdx_seamcall and kvm_tdx_seamret? If the seamret usage is too
much of a stretch, kvm_tdx_seamcall_begin/end?

> + TP_PROTO(int cpuid, __u64 op, __u64 rcx, __u64 rdx, __u64 r8,
> + __u64 r9, __u64 r10),
> + TP_ARGS(cpuid, op, rcx, rdx, r8, r9, r10),

"cpuid" is potentially confusing without looking at the caller. pcpu or pcpu_id
would be preferable.

> diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
> index a318940f62ed..2c83ab46eeac 100644
> --- a/arch/x86/kvm/vmx/seamcall.h
> +++ b/arch/x86/kvm/vmx/seamcall.h
> @@ -9,12 +9,32 @@
> #else
>
> #ifndef seamcall
> +#include "trace.h"
> +
> struct tdx_ex_ret;
> asmlinkage u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
> struct tdx_ex_ret *ex);
>
> +static inline u64 _seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
> + struct tdx_ex_ret *ex)
> +{
> + u64 err;
> +
> + trace_kvm_tdx_seamcall_enter(smp_processor_id(), op,
> + rcx, rdx, r8, r9, r10);
> + err = __seamcall(op, rcx, rdx, r8, r9, r10, ex);

What was the motivation behind switching from the macro magic[*] to a dedicated
asm subroutine? The macros are gross, but IMO they yielded much more readable
code for the upper level helpers, which is what people will look at the vast
majority of time. E.g.

static inline u64 tdh_sys_lp_shutdown(void)
{
return seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, 0, NULL);
}

static inline u64 tdh_mem_track(hpa_t tdr)
{
return seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, 0, NULL);
}

versus

static inline u64 tdsysshutdownlp(void)
{
seamcall_0(TDSYSSHUTDOWNLP);
}

static inline u64 tdtrack(hpa_t tdr)
{
seamcall_1(TDTRACK, tdr);
}


The new approach also generates very suboptimal code due to the need to shuffle
registers everywhere, e.g. gcc doesn't inline _seamcall because it's a whopping
200+ bytes.

[*] https://patchwork.kernel.org/project/kvm/patch/25f0d2c2f73c20309a1b578cc5fc15f4fd6b9a13.1605232743.git.isaku.yamahata@intel.com/

> + if (ex)
> + trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err, ex->rcx,

smp_processor_id() is not stable since this code runs with IRQs and preemption
enabled, e.g. if the task is preempted between the tracepoint and the actual
SEAMCALL then the tracepoint may be wrong. There could also be weirdly "nested"
tracepoints since migrating the task will generate TDH_VP_FLUSH.

> + ex->rdx, ex->r8, ex->r9, ex->r10,
> + ex->r11);
> + else
> + trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err,
> + 0, 0, 0, 0, 0, 0);
> + return err;
> +}
> +
> #define seamcall(op, rcx, rdx, r8, r9, r10, ex) \
> - __seamcall(SEAMCALL_##op, (rcx), (rdx), (r8), (r9), (r10), (ex))
> + _seamcall(SEAMCALL_##op, (rcx), (rdx), (r8), (r9), (r10), (ex))
> #endif

2021-07-13 19:56:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/69] KVM: TDX: add trace point before/after TDX SEAMCALLs

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > + trace_kvm_tdx_seamcall_enter(smp_processor_id(), op,
> > + rcx, rdx, r8, r9, r10);
> > + err = __seamcall(op, rcx, rdx, r8, r9, r10, ex);
> > + if (ex)
> > + trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err, ex->rcx,
> > + ex->rdx, ex->r8, ex->r9, ex->r10,
> > + ex->r11);
> > + else
> > + trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err,
> > + 0, 0, 0, 0, 0, 0);
>
> Would it make sense to do the zeroing of ex directly in __seamcall in case
> there is an error?

A better option would be to pass "ex" into the tracepoint. tdx_arch.h is already
included by trace.h (though I'm not sure that's a good thing), and the cost of
checking ex against NULL over and over is a non-issue because it's buried in the
tracepoint, i.e. hidden behind a patch nop. The below reduces the footprint of
_seamcall by 100+ bytes of code, presumably due to avoiding even more register
shuffling (I didn't look too closely).

That said, I'm not sure adding generic tracepoints is a good idea. The flows
that truly benefit from tracepoints will likely want to provide more/different
information, e.g. the entry/exit flow already uses kvm_trace_entry/exit, and the
SEPT flows have dedicated tracepoints. For flows like tdh_vp_flush(), which
might benefit from a tracepoint, they'll only provide the host PA of the TDVPR,
which is rather useless on its own. It's probably possible to cross-reference
everything to understand what's going on, but it certainly won't be easy.

I can see the generic tracepoint being somewhat useful for debugging early
development and/or a new TDX module, but otherwise I think it will be mostly
overhead. E.g. if a TDX failure pops up in production, enabling the tracepoint
might not even be viable. And even for the cases where the tracepoint is useful,
I would be quite surprised if additional instrumentation wasn't needed to debug
non-trivial issues.


diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 58631124f08d..e2868f6d84f8 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -701,9 +701,8 @@ TRACE_EVENT(kvm_tdx_seamcall_enter,
* Tracepoint for the end of TDX SEAMCALLs.
*/
TRACE_EVENT(kvm_tdx_seamcall_exit,
- TP_PROTO(int cpuid, __u64 op, __u64 err, __u64 rcx, __u64 rdx, __u64 r8,
- __u64 r9, __u64 r10, __u64 r11),
- TP_ARGS(cpuid, op, err, rcx, rdx, r8, r9, r10, r11),
+ TP_PROTO(int cpuid, __u64 op, __u64 err, struct tdx_ex_ret *ex),
+ TP_ARGS(cpuid, op, err, ex),

TP_STRUCT__entry(
__field( int, cpuid )
@@ -721,12 +720,12 @@ TRACE_EVENT(kvm_tdx_seamcall_exit,
__entry->cpuid = cpuid;
__entry->op = op;
__entry->err = err;
- __entry->rcx = rcx;
- __entry->rdx = rdx;
- __entry->r8 = r8;
- __entry->r9 = r9;
- __entry->r10 = r10;
- __entry->r11 = r11;
+ __entry->rcx = ex ? ex->rcx : 0;
+ __entry->rdx = ex ? ex->rdx : 0;
+ __entry->r8 = ex ? ex->r8 : 0;
+ __entry->r9 = ex ? ex->r9 : 0;
+ __entry->r10 = ex ? ex->r10 : 0;
+ __entry->r11 = ex ? ex->r11 : 0;
),

TP_printk("cpu: %d op: %s err %s 0x%llx rcx: 0x%llx rdx: 0x%llx r8: 0x%llx r9: 0x%llx r10: 0x%llx r11: 0x%llx",
diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
index 85eeedc06a4f..b2067f7e6a9d 100644
--- a/arch/x86/kvm/vmx/seamcall.h
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -23,13 +23,8 @@ static inline u64 _seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
trace_kvm_tdx_seamcall_enter(smp_processor_id(), op,
rcx, rdx, r8, r9, r10);
err = __seamcall(op, rcx, rdx, r8, r9, r10, ex);
- if (ex)
- trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err, ex->rcx,
- ex->rdx, ex->r8, ex->r9, ex->r10,
- ex->r11);
- else
- trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err,
- 0, 0, 0, 0, 0, 0);
+ trace_kvm_tdx_seamcall_exit(smp_processor_id(), op, err, ex);
+
return err;
}


2021-07-13 20:00:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 09/69] KVM: TDX: Add C wrapper functions for TDX SEAMCALLs

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > +static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
> > +{
> > + return seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
> > +}
> > +
>
> Since you have wrappers anyway, I don't like having an extra macro level
> just to remove the SEAMCALL_ prefix. It messes up editors that look up the
> symbols.

True. On the other hand, prefixing SEAMCALL_ over and over is tedious and adds
a lot of noise. What if we drop SEAMCALL_ from the #defines? The prefix made
sense when there was no additional namespace, e.g. instead of having a bare
TDCREATE, but that's mostly a non-issue now that all the SEAMCALLs are namespaced
with TDH_.

2021-07-13 20:23:55

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/69] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > Zap only leaf SPTEs when deleting/moving a memslot by default, and add a
> > module param to allow reverting to the old behavior of zapping all SPTEs
> > at all levels and memslots when any memslot is updated.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++++-
> > 1 file changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8d5876dfc6b7..5b8a640f8042 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -85,6 +85,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_ratio, "uint");
> > static bool __read_mostly force_flush_and_sync_on_reuse;
> > module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> > +static bool __read_mostly memslot_update_zap_all;
> > +module_param(memslot_update_zap_all, bool, 0444);
> > +
> > /*
> > * When setting this variable to true it enables Two-Dimensional-Paging
> > * where the hardware walks 2 page tables:
> > @@ -5480,11 +5483,27 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > }
> > +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > +{
> > + /*
> > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> > + * case scenario we'll have unused shadow pages lying around until they
> > + * are recycled due to age or when the VM is destroyed.
> > + */
> > + write_lock(&kvm->mmu_lock);
> > + slot_handle_level(kvm, slot, kvm_zap_rmapp, PG_LEVEL_4K,
> > + KVM_MAX_HUGEPAGE_LEVEL, true);
> > + write_unlock(&kvm->mmu_lock);
> > +}
> > +
> > static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> > struct kvm_memory_slot *slot,
> > struct kvm_page_track_notifier_node *node)
> > {
> > - kvm_mmu_zap_all_fast(kvm);
> > + if (memslot_update_zap_all)
> > + kvm_mmu_zap_all_fast(kvm);
> > + else
> > + kvm_mmu_zap_memslot(kvm, slot);
> > }
> > void kvm_mmu_init_vm(struct kvm *kvm)
> >
>
> This is the old patch that broke VFIO for some unknown reason.

Yes, my white whale :-/

> The commit message should at least say why memslot_update_zap_all is not true
> by default. Also, IIUC the bug still there with NX hugepage splits disabled,

I strongly suspect the bug is also there with hugepage splits enabled, it's just
masked and/or harder to hit.

> but what if the TDP MMU is enabled?

This should not be a module param. IIRC, the original code I wrote had it as a
per-VM flag that wasn't even exposed to the user, i.e. TDX guests always do the
partial flush and non-TDX guests always do the full flush. I think that's the
least awful approach if we can't figure out the underlying bug before TDX is
ready for inclusion.

2021-07-13 20:30:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 20/69] KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID

On Fri, Jul 02, 2021, [email protected] wrote:
> From: Sean Christopherson <[email protected]>

For a changelog:

Mark a VM as bugged instead of simply warning if the core page fault
handler unexpectedly returns RET_PF_INVALID. KVM's (undocumented) API
for the page fault path does not allow returning RET_PF_INVALID, e.g. a
fatal condition should be morphed to -errno.

> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---

2021-07-13 20:38:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 21/69] KVM: Add max_vcpus field in common 'struct kvm'

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> Please replace "Add" with "Move" and add a couple lines to the commit
> message.

Move arm's per-VM max_vcpus field into the generic "struct kvm", and use
it to check vcpus_created in the generic code instead of checking only
the hardcoded absolute KVM-wide max. x86 TDX guests will reuse the
generic check verbatim, as the max number of vCPUs for a TDX guest is
user defined at VM creation and immutable thereafter.

2021-07-13 20:41:55

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/69] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > struct kvm_arch {
> > + unsigned long vm_type;
>
> Also why not just int or u8?

Heh, because kvm_dev_ioctl_create_vm() takes an "unsigned long" for the type and
it felt wrong to store it as something else. Storing it as a smaller field should
be fine, I highly doubt we'll get to 256 types anytime soon :-)

I think kvm_x86_ops.is_vm_type_supported() should take the full size though.

2021-07-13 20:53:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 38/69] KVM: x86: Add option to force LAPIC expiration wait

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > Add an option to skip the IRR check in kvm_wait_lapic_expire(). This
> > will be used by TDX to wait if there is an outstanding notification for
> > a TD, i.e. a virtual interrupt is being triggered via posted interrupt
> > processing. KVM TDX doesn't emulate PI processing, i.e. there will
> > never be a bit set in IRR/ISR, so the default behavior for APICv of
> > querying the IRR doesn't work as intended.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
>
> Is there a better (existing after the previous patches) flag to test, or
> possibly can it use vm_type following the suggestion I gave for patch 28?

Not sure if there's a "better" flag, but there's most definitely a flag somewhere
that will suffice :-)

2021-07-13 20:57:52

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 60/69] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:05, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > Add a macro framework to hide VMX vs. TDX details of VMREAD and VMWRITE
> > so the VMX and TDX can shared common flows, e.g. accessing DTs.
> >
> > Note, the TDX paths are dead code at this time. There is no great way
> > to deal with the chicken-and-egg scenario of having things in place for
> > TDX without first having TDX.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/common.h | 41 +++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 41 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> > index 9e5865b05d47..aa6a569b87d1 100644
> > --- a/arch/x86/kvm/vmx/common.h
> > +++ b/arch/x86/kvm/vmx/common.h
> > @@ -11,6 +11,47 @@
> > #include "vmcs.h"
> > #include "vmx.h"
> > #include "x86.h"
> > +#include "tdx.h"
> > +
> > +#ifdef CONFIG_KVM_INTEL_TDX
>
> Is this #ifdef needed at all if tdx.h properly stubs is_td_vcpu (to return
> false) and possibly declares a dummy version of td_vmcs_read/td_vmcs_write?

IIRC, it requires dummy versions of is_debug_td() and all the ##bits variants of
td_vmcs_read/write(). I'm not sure if I ever actually tried that, e.g. to see
if the compiler completely elided the TDX crud when CONFIG_KVM_INTEL_TDX=n.

2021-07-13 21:00:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 52/69] KVM: VMX: Split out guts of EPT violation to common/exposed function

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > +static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> > + unsigned long exit_qualification)
> > +{

...

> > +}
> > +

...

> > @@ -5379,7 +5356,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
> > return kvm_emulate_instruction(vcpu, 0);
> > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > + return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> > }
> > static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> >
>
> This should be in main.c, not in a header (and named
> __vt_handle_ept_qualification).

Yar, though I'm guessing you meant __vt_handle_ept_violation?

2021-07-13 21:05:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 51/69] KVM: x86/mmu: Allow per-VM override of the TDP max page level

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > TODO: This is tentative patch. Support large page and delete this patch.
> >
> > Allow TDX to effectively disable large pages, as SEPT will initially
> > support only 4k pages.

...

> Seems good enough for now.

Looks like SNP needs a dynamic check, i.e. a kvm_x86_ops hook, to handle an edge
case in the RMP. That's probably the better route given that this is a short-term
hack (hopefully :-D).

2021-07-13 21:13:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 63/69] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code

On Tue, Jul 06, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:05, [email protected] wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index d69d4dc7c071..d31cace67907 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1467,15 +1467,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> > u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
> > {
> > - u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
> > - int ret = 0;
> > -
> > - if (interruptibility & GUEST_INTR_STATE_STI)
> > - ret |= KVM_X86_SHADOW_INT_STI;
> > - if (interruptibility & GUEST_INTR_STATE_MOV_SS)
> > - ret |= KVM_X86_SHADOW_INT_MOV_SS;
> > -
> > - return ret;
> > + return __vmx_get_interrupt_shadow(vcpu);
> > }
> > void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> >
>
> Is there any reason to add the __ version, since at this point kvm_x86_ops
> is already pointing to vt_get_interrupt_shadow?

Yeah, no idea what I was thinking, the whole thing can be moved as is, just need
to delete the prototype in vmx.h.

2021-07-20 22:11:36

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/69] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls

On 7/6/21 8:59 AM, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
>> From: Sean Christopherson <[email protected]>
>>
>> Add 'guest_state_protected' to mark a VM's state as being protected by
>> hardware/firmware, e.g. SEV-ES or TDX-SEAM.  Use the flag to disallow
>> ioctls() and/or flows that attempt to access protected state.
>>
>> Return an error if userspace attempts to get/set register state for a
>> protected VM, e.g. a non-debug TDX guest.  KVM can't provide sane data,
>> it's userspace's responsibility to avoid attempting to read guest state
>> when it's known to be inaccessible.
>>
>> Retrieving vCPU events is the one exception, as the userspace VMM is
>> allowed to inject NMIs.
>>
>> Co-developed-by: Xiaoyao Li <[email protected]>
>> Signed-off-by: Xiaoyao Li <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Isaku Yamahata <[email protected]>
>> ---
>>   arch/x86/kvm/x86.c | 104 +++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 86 insertions(+), 18 deletions(-)
>
> Looks good, but it should be checked whether it breaks QEMU for SEV-ES.
>  Tom, can you help?

Sorry to take so long to get back to you... been really slammed, let me
look into this a bit more. But, some quick thoughts...

Offhand, the SMI isn't a problem since SEV-ES doesn't support SMM.

For kvm_vcpu_ioctl_x86_{get,set}_xsave(), can TDX use what was added for
SEV-ES:
ed02b213098a ("KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest")

Same for kvm_arch_vcpu_ioctl_{get,set}_fpu().

The changes to kvm_arch_vcpu_ioctl_{get,set}_sregs() might cause issues,
since there are specific things allowed in __{get,set}_sregs. But I'll
need to dig a bit more on that.

Thanks,
Tom

>
> Paolo
>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 271245ffc67c..b89845dfb679 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4297,6 +4297,10 @@ static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
>>     static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
>>   {
>> +    /* TODO: use more precise flag */
>> +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       kvm_make_request(KVM_REQ_SMI, vcpu);
>>         return 0;
>> @@ -4343,6 +4347,10 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct
>> kvm_vcpu *vcpu,
>>       unsigned bank_num = mcg_cap & 0xff;
>>       u64 *banks = vcpu->arch.mce_banks;
>>   +    /* TODO: use more precise flag */
>> +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       if (mce->bank >= bank_num || !(mce->status & MCI_STATUS_VAL))
>>           return -EINVAL;
>>       /*
>> @@ -4438,7 +4446,8 @@ static void
>> kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>>           vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
>>       events->interrupt.nr = vcpu->arch.interrupt.nr;
>>       events->interrupt.soft = 0;
>> -    events->interrupt.shadow =
>> static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>> +    if (!vcpu->arch.guest_state_protected)
>> +        events->interrupt.shadow =
>> static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>>         events->nmi.injected = vcpu->arch.nmi_injected;
>>       events->nmi.pending = vcpu->arch.nmi_pending != 0;
>> @@ -4467,11 +4476,17 @@ static void kvm_smm_changed(struct kvm_vcpu *vcpu);
>>   static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>>                             struct kvm_vcpu_events *events)
>>   {
>> -    if (events->flags & ~(KVM_VCPUEVENT_VALID_NMI_PENDING
>> -                  | KVM_VCPUEVENT_VALID_SIPI_VECTOR
>> -                  | KVM_VCPUEVENT_VALID_SHADOW
>> -                  | KVM_VCPUEVENT_VALID_SMM
>> -                  | KVM_VCPUEVENT_VALID_PAYLOAD))
>> +    u32 allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING |
>> +                KVM_VCPUEVENT_VALID_SIPI_VECTOR |
>> +                KVM_VCPUEVENT_VALID_SHADOW |
>> +                KVM_VCPUEVENT_VALID_SMM |
>> +                KVM_VCPUEVENT_VALID_PAYLOAD;
>> +
>> +    /* TODO: introduce more precise flag */
>> +    if (vcpu->arch.guest_state_protected)
>> +        allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING;
>> +
>> +    if (events->flags & ~allowed_flags)
>>           return -EINVAL;
>>         if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
>> @@ -4552,17 +4567,22 @@ static int
>> kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>>       return 0;
>>   }
>>   -static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
>> -                         struct kvm_debugregs *dbgregs)
>> +static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
>> +                        struct kvm_debugregs *dbgregs)
>>   {
>>       unsigned long val;
>>   +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       memcpy(dbgregs->db, vcpu->arch.db, sizeof(vcpu->arch.db));
>>       kvm_get_dr(vcpu, 6, &val);
>>       dbgregs->dr6 = val;
>>       dbgregs->dr7 = vcpu->arch.dr7;
>>       dbgregs->flags = 0;
>>       memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
>> +
>> +    return 0;
>>   }
>>     static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
>> @@ -4576,6 +4596,9 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct
>> kvm_vcpu *vcpu,
>>       if (!kvm_dr7_valid(dbgregs->dr7))
>>           return -EINVAL;
>>   +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       memcpy(vcpu->arch.db, dbgregs->db, sizeof(vcpu->arch.db));
>>       kvm_update_dr0123(vcpu);
>>       vcpu->arch.dr6 = dbgregs->dr6;
>> @@ -4671,11 +4694,14 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8
>> *src)
>>       }
>>   }
>>   -static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
>> -                     struct kvm_xsave *guest_xsave)
>> +static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
>> +                    struct kvm_xsave *guest_xsave)
>>   {
>> +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       if (!vcpu->arch.guest_fpu)
>> -        return;
>> +        return 0;
>>         if (boot_cpu_has(X86_FEATURE_XSAVE)) {
>>           memset(guest_xsave, 0, sizeof(struct kvm_xsave));
>> @@ -4687,6 +4713,8 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct
>> kvm_vcpu *vcpu,
>>           *(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
>>               XFEATURE_MASK_FPSSE;
>>       }
>> +
>> +    return 0;
>>   }
>>     #define XSAVE_MXCSR_OFFSET 24
>> @@ -4697,6 +4725,9 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct
>> kvm_vcpu *vcpu,
>>       u64 xstate_bv;
>>       u32 mxcsr;
>>   +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       if (!vcpu->arch.guest_fpu)
>>           return 0;
>>   @@ -4722,18 +4753,22 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct
>> kvm_vcpu *vcpu,
>>       return 0;
>>   }
>>   -static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
>> -                    struct kvm_xcrs *guest_xcrs)
>> +static int kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
>> +                       struct kvm_xcrs *guest_xcrs)
>>   {
>> +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
>>           guest_xcrs->nr_xcrs = 0;
>> -        return;
>> +        return 0;
>>       }
>>         guest_xcrs->nr_xcrs = 1;
>>       guest_xcrs->flags = 0;
>>       guest_xcrs->xcrs[0].xcr = XCR_XFEATURE_ENABLED_MASK;
>>       guest_xcrs->xcrs[0].value = vcpu->arch.xcr0;
>> +    return 0;
>>   }
>>     static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
>> @@ -4741,6 +4776,9 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct
>> kvm_vcpu *vcpu,
>>   {
>>       int i, r = 0;
>>   +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       if (!boot_cpu_has(X86_FEATURE_XSAVE))
>>           return -EINVAL;
>>   @@ -5011,7 +5049,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>       case KVM_GET_DEBUGREGS: {
>>           struct kvm_debugregs dbgregs;
>>   -        kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
>> +        r = kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
>> +        if (r)
>> +            break;
>>             r = -EFAULT;
>>           if (copy_to_user(argp, &dbgregs,
>> @@ -5037,7 +5077,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>           if (!u.xsave)
>>               break;
>>   -        kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
>> +        r = kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
>> +        if (r)
>> +            break;
>>             r = -EFAULT;
>>           if (copy_to_user(argp, u.xsave, sizeof(struct kvm_xsave)))
>> @@ -5061,7 +5103,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>           if (!u.xcrs)
>>               break;
>>   -        kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
>> +        r = kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
>> +        if (r)
>> +            break;
>>             r = -EFAULT;
>>           if (copy_to_user(argp, u.xcrs,
>> @@ -9735,6 +9779,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>           goto out;
>>       }
>>   +    if (vcpu->arch.guest_state_protected &&
>> +        (kvm_run->kvm_valid_regs || kvm_run->kvm_dirty_regs)) {
>> +        r = -EINVAL;
>> +        goto out;
>> +    }
>> +
>>       if (kvm_run->kvm_dirty_regs) {
>>           r = sync_regs(vcpu);
>>           if (r != 0)
>> @@ -9765,7 +9815,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>     out:
>>       kvm_put_guest_fpu(vcpu);
>> -    if (kvm_run->kvm_valid_regs)
>> +    if (kvm_run->kvm_valid_regs && !vcpu->arch.guest_state_protected)
>>           store_regs(vcpu);
>>       post_kvm_run_save(vcpu);
>>       kvm_sigset_deactivate(vcpu);
>> @@ -9812,6 +9862,9 @@ static void __get_regs(struct kvm_vcpu *vcpu,
>> struct kvm_regs *regs)
>>     int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct
>> kvm_regs *regs)
>>   {
>> +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       vcpu_load(vcpu);
>>       __get_regs(vcpu, regs);
>>       vcpu_put(vcpu);
>> @@ -9852,6 +9905,9 @@ static void __set_regs(struct kvm_vcpu *vcpu,
>> struct kvm_regs *regs)
>>     int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct
>> kvm_regs *regs)
>>   {
>> +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       vcpu_load(vcpu);
>>       __set_regs(vcpu, regs);
>>       vcpu_put(vcpu);
>> @@ -9912,6 +9968,9 @@ static void __get_sregs(struct kvm_vcpu *vcpu,
>> struct kvm_sregs *sregs)
>>   int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
>>                     struct kvm_sregs *sregs)
>>   {
>> +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       vcpu_load(vcpu);
>>       __get_sregs(vcpu, sregs);
>>       vcpu_put(vcpu);
>> @@ -10112,6 +10171,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct
>> kvm_vcpu *vcpu,
>>   {
>>       int ret;
>>   +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       vcpu_load(vcpu);
>>       ret = __set_sregs(vcpu, sregs);
>>       vcpu_put(vcpu);
>> @@ -10205,6 +10267,9 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu
>> *vcpu, struct kvm_fpu *fpu)
>>   {
>>       struct fxregs_state *fxsave;
>>   +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       if (!vcpu->arch.guest_fpu)
>>           return 0;
>>   @@ -10228,6 +10293,9 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct
>> kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>>   {
>>       struct fxregs_state *fxsave;
>>   +    if (vcpu->arch.guest_state_protected)
>> +        return -EINVAL;
>> +
>>       if (!vcpu->arch.guest_fpu)
>>           return 0;
>>  
>

2021-07-26 05:33:36

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v2 65/69] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch

On 7/14/2021 2:14 AM, Sean Christopherson wrote:
> On Tue, Jul 06, 2021, Paolo Bonzini wrote:
>> On 03/07/21 00:05, [email protected] wrote:
>>> From: Xiaoyao Li <[email protected]>
>>>
>>> Introduce a per-vm variable initial_tsc_khz to hold the default tsc_khz
>>> for kvm_arch_vcpu_create().
>>>
>>> This field is going to be used by TDX since TSC frequency for TD guest
>>> is configured at TD VM initialization phase.
>>>
>>> Signed-off-by: Xiaoyao Li <[email protected]>
>>> Signed-off-by: Isaku Yamahata <[email protected]>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 1 +
>>> arch/x86/kvm/x86.c | 3 ++-
>>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> So this means disabling TSC frequency scaling on TDX.

No. It still supports TSC frequency scaling on TDX. Only that we need to
configure TSC frequency for TD guest at VM level, not vcpu level.

>> Would it make sense
>> to delay VM creation to a separate ioctl, similar to KVM_ARM_VCPU_FINALIZE
>> (KVM_VM_FINALIZE)?
>
> There's an equivalent of that in the next mega-patch, the KVM_TDX_INIT_VM sub-ioctl
> of KVM_MEMORY_ENCRYPT_OP. The TSC frequency for the guest gets provided at that
> time.
>

2021-07-26 12:58:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/69] KVM: X86: TDX support

On 03/07/21 00:04, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> * What's TDX?
> TDX stands for Trust Domain Extensions which isolates VMs from the
> virtual-machine manager (VMM)/hypervisor and any other software on the
> platform. [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
>
>
> * The goal of this RFC patch
> The purpose of this post is to get feedback early on high level design issue of
> KVM enhancement for TDX. The detailed coding (variable naming etc) is not cared
> of. This patch series is incomplete (not working). So it's RFC. Although
> multiple software components, not only KVM but also QEMU, guest Linux and
> virtual bios, need to be updated, this includes only KVM VMM part. For those who
> are curious to changes to other component, there are public repositories at
> github. [8], [9]
>
>
> * Patch organization
> The patch 66 is main change. The preceding patches(1-65) The preceding
> patches(01-61) are refactoring the code and introducing additional hooks.
>
> - 01-12: They are preparations. introduce architecture constants, code
> refactoring, export symbols for following patches.
> - 13-40: start to introduce the new type of VM and allow the coexistence of
> multiple type of VM. allow/disallow KVM ioctl where
> appropriate. Especially make per-system ioctl to per-VM ioctl.
> - 41-65: refactoring KVM VMX/MMU and adding new hooks for Secure EPT.
> - 66: main patch to add "basic" support for building/running TDX.
> - 67: trace points for
> - 68-69: Documentation

Queued 2,3,17-20,23,44-45, thanks.

Paolo

> * TODOs
> Those major features are missing from this patch series to keep this patch
> series small.
>
> - load/initialize TDX module
> split out from this patch series.
> - unmapping private page
> Will integrate Kirill's patch to show how kvm will utilize it.
> - qemu gdb stub support
> - Large page support
> - guest PMU support
> - TDP MMU support
> - and more
>
> Changes from v1:
> - rebase to v5.13
> - drop load/initialization of TDX module
> - catch up the update of related specifications.
> - rework on C-wrapper function to invoke seamcall
> - various code clean up
>
> [1] TDX specification
> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
> [3] Intel CPU Architectural Extensions Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 EAS
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
> [5] Intel TDX Loader Interface Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> [7] Intel TDX Virtual Firmware Design Guide
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
> [8] intel public github
> kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> TDX guest branch: https://github.com/intel/tdx/tree/guest
> qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
> https://github.com/tianocore/edk2-staging/tree/TDVF
>
> Isaku Yamahata (11):
> KVM: TDX: introduce config for KVM TDX support
> KVM: X86: move kvm_cpu_vmxon() from vmx.c to virtext.h
> KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86
> KVM: TDX: add a helper function for kvm to call seamcall
> KVM: TDX: add trace point before/after TDX SEAMCALLs
> KVM: TDX: Print the name of SEAMCALL status code
> KVM: Add per-VM flag to mark read-only memory as unsupported
> KVM: x86: add per-VM flags to disable SMI/INIT/SIPI
> KVM: TDX: add trace point for TDVMCALL and SEPT operation
> KVM: TDX: add document on TDX MODULE
> Documentation/virtual/kvm: Add Trust Domain Extensions(TDX)
>
> Kai Huang (2):
> KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level
> routes
> cpu/hotplug: Document that TDX also depends on booting CPUs once
>
> Rick Edgecombe (1):
> KVM: x86: Add infrastructure for stolen GPA bits
>
> Sean Christopherson (53):
> KVM: TDX: Add TDX "architectural" error codes
> KVM: TDX: Add architectural definitions for structures and values
> KVM: TDX: define and export helper functions for KVM TDX support
> KVM: TDX: Add C wrapper functions for TDX SEAMCALLs
> KVM: Export kvm_io_bus_read for use by TDX for PV MMIO
> KVM: Enable hardware before doing arch VM initialization
> KVM: x86: Split core of hypercall emulation to helper function
> KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO
> KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default
> KVM: Add infrastructure and macro to mark VM as bugged
> KVM: Export kvm_make_all_cpus_request() for use in marking VMs as
> bugged
> KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the
> VM
> KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID
> KVM: Add max_vcpus field in common 'struct kvm'
> KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
> KVM: x86: Hoist kvm_dirty_regs check out of sync_regs()
> KVM: x86: Introduce "protected guest" concept and block disallowed
> ioctls
> KVM: x86: Add per-VM flag to disable direct IRQ injection
> KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE
> KVM: x86: Add flag to mark TSC as immutable (for TDX)
> KVM: Add per-VM flag to disable dirty logging of memslots for TDs
> KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
> KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()
> KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP
> KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
> KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> behavior
> KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
> KVM: x86: Add option to force LAPIC expiration wait
> KVM: x86: Add guest_supported_xss placholder
> KVM: Export kvm_is_reserved_pfn() for use by TDX
> KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> KVM: x86/mmu: Allow non-zero init value for shadow PTE
> KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce
> indentation
> KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
> KVM: x86/mmu: Frame in support for private/inaccessible shadow pages
> KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> KVM: VMX: Modify NMI and INTR handlers to take intr_info as param
> KVM: VMX: Move NMI/exception handler to common helper
> KVM: x86/mmu: Allow per-VM override of the TDP max page level
> KVM: VMX: Split out guts of EPT violation to common/exposed function
> KVM: VMX: Define EPT Violation architectural bits
> KVM: VMX: Define VMCS encodings for shared EPT pointer
> KVM: VMX: Add 'main.c' to wrap VMX and TDX
> KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> KVM: VMX: Move register caching logic to common code
> KVM: TDX: Define TDCALL exit reason
> KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs
> KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h
> KVM: VMX: MOVE GDT and IDT accessors to common code
> KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX
> code
> KVM: TDX: Add "basic" support for building and running Trust Domains
>
> Xiaoyao Li (2):
> KVM: TDX: Introduce pr_seamcall_ex_ret_info() to print more info when
> SEAMCALL fails
> KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
>
> Documentation/virt/kvm/api.rst | 6 +-
> Documentation/virt/kvm/intel-tdx.rst | 441 ++++++
> Documentation/virt/kvm/tdx-module.rst | 48 +
> arch/arm64/include/asm/kvm_host.h | 3 -
> arch/arm64/kvm/arm.c | 7 +-
> arch/arm64/kvm/vgic/vgic-init.c | 6 +-
> arch/x86/Kbuild | 1 +
> arch/x86/include/asm/cpufeatures.h | 2 +
> arch/x86/include/asm/kvm-x86-ops.h | 8 +
> arch/x86/include/asm/kvm_boot.h | 30 +
> arch/x86/include/asm/kvm_host.h | 55 +-
> arch/x86/include/asm/virtext.h | 25 +
> arch/x86/include/asm/vmx.h | 17 +
> arch/x86/include/uapi/asm/kvm.h | 60 +
> arch/x86/include/uapi/asm/vmx.h | 7 +-
> arch/x86/kernel/asm-offsets_64.c | 15 +
> arch/x86/kvm/Kconfig | 11 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/boot/Makefile | 6 +
> arch/x86/kvm/boot/seam/tdx_common.c | 242 +++
> arch/x86/kvm/boot/seam/tdx_common.h | 13 +
> arch/x86/kvm/ioapic.c | 4 +
> arch/x86/kvm/irq_comm.c | 13 +-
> arch/x86/kvm/lapic.c | 7 +-
> arch/x86/kvm/lapic.h | 2 +-
> arch/x86/kvm/mmu.h | 31 +-
> arch/x86/kvm/mmu/mmu.c | 526 +++++--
> arch/x86/kvm/mmu/mmu_internal.h | 3 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 +-
> arch/x86/kvm/mmu/spte.c | 15 +-
> arch/x86/kvm/mmu/spte.h | 18 +-
> arch/x86/kvm/svm/svm.c | 18 +-
> arch/x86/kvm/trace.h | 138 ++
> arch/x86/kvm/vmx/common.h | 178 +++
> arch/x86/kvm/vmx/main.c | 1098 ++++++++++++++
> arch/x86/kvm/vmx/posted_intr.c | 6 +
> arch/x86/kvm/vmx/seamcall.S | 64 +
> arch/x86/kvm/vmx/seamcall.h | 68 +
> arch/x86/kvm/vmx/tdx.c | 1958 +++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 267 ++++
> arch/x86/kvm/vmx/tdx_arch.h | 370 +++++
> arch/x86/kvm/vmx/tdx_errno.h | 202 +++
> arch/x86/kvm/vmx/tdx_ops.h | 218 +++
> arch/x86/kvm/vmx/tdx_stubs.c | 45 +
> arch/x86/kvm/vmx/vmcs.h | 11 -
> arch/x86/kvm/vmx/vmenter.S | 146 ++
> arch/x86/kvm/vmx/vmx.c | 509 ++-----
> arch/x86/kvm/x86.c | 285 +++-
> include/linux/kvm_host.h | 51 +-
> include/uapi/linux/kvm.h | 2 +
> kernel/cpu.c | 4 +
> tools/arch/x86/include/uapi/asm/kvm.h | 55 +
> tools/include/uapi/linux/kvm.h | 2 +
> virt/kvm/kvm_main.c | 44 +-
> 54 files changed, 6717 insertions(+), 672 deletions(-)
> create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> create mode 100644 Documentation/virt/kvm/tdx-module.rst
> create mode 100644 arch/x86/include/asm/kvm_boot.h
> create mode 100644 arch/x86/kvm/boot/Makefile
> create mode 100644 arch/x86/kvm/boot/seam/tdx_common.c
> create mode 100644 arch/x86/kvm/boot/seam/tdx_common.h
> create mode 100644 arch/x86/kvm/vmx/common.h
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/seamcall.S
> create mode 100644 arch/x86/kvm/vmx/seamcall.h
> create mode 100644 arch/x86/kvm/vmx/tdx.c
> create mode 100644 arch/x86/kvm/vmx/tdx.h
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> create mode 100644 arch/x86/kvm/vmx/tdx_stubs.c
>

2021-07-28 17:34:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/69] KVM: X86: TDX support

On Mon, Jul 26, 2021, Paolo Bonzini wrote:
> On 03/07/21 00:04, [email protected] wrote:
> > * Patch organization
> > The patch 66 is main change. The preceding patches(1-65) The preceding
> > patches(01-61) are refactoring the code and introducing additional hooks.
> >
> > - 01-12: They are preparations. introduce architecture constants, code
> > refactoring, export symbols for following patches.
> > - 13-40: start to introduce the new type of VM and allow the coexistence of
> > multiple type of VM. allow/disallow KVM ioctl where
> > appropriate. Especially make per-system ioctl to per-VM ioctl.
> > - 41-65: refactoring KVM VMX/MMU and adding new hooks for Secure EPT.
> > - 66: main patch to add "basic" support for building/running TDX.
> > - 67: trace points for
> > - 68-69: Documentation
>
> Queued 2,3,17-20,23,44-45, thanks.

I strongly object to merging these two until we see the new SEAMLDR code:

[RFC PATCH v2 02/69] KVM: X86: move kvm_cpu_vmxon() from vmx.c to virtext.h
[RFC PATCH v2 03/69] KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86

If the SEAMLDR code ends up being fully contained in KVM, then this is unnecessary
churn and exposes code outside of KVM that we may not want exposed (yet). E.g.
setting and clearing CR4.VMXE (in the fault path) in cpu_vmxon() may not be
necessary/desirable for SEAMLDR, we simply can't tell without seeing the code.

2021-07-31 01:06:23

by Erdem Aktas

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/69] KVM: TDX: Add architectural definitions for structures and values

On Fri, Jul 2, 2021 at 3:05 PM <[email protected]> wrote:
> +/* Management class fields */
> +enum tdx_guest_management {
> + TD_VCPU_PEND_NMI = 11,
> +};
> +
> +/* @field is any of enum tdx_guest_management */
> +#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(32, (field))

I am a little confused with this. According to the spec, PEND_NMI has
a field code of 0x200000000000000B
I can understand that 0x20 is the class code and the PEND_NMI field code is 0xB.
On the other hand, for the LAST_EXIT_TSC the field code is 0xA00000000000000A.
Based on your code and the table in the spec, I can see that there is
an additional mask (1ULL<<63) for readonly fields.
Is this information correct and is this included in the spec? I tried
to find it but somehow I do not see it clearly defined.

> +#define TDX1_NR_TDCX_PAGES 4
> +#define TDX1_NR_TDVPX_PAGES 5
> +
> +#define TDX1_MAX_NR_CPUID_CONFIGS 6
Why is this just 6? I am looking at the CPUID table in the spec and
there are already more than 6 CPUID leaves there.

> +#define TDX1_MAX_NR_CMRS 32
> +#define TDX1_MAX_NR_TDMRS 64
> +#define TDX1_MAX_NR_RSVD_AREAS 16
> +#define TDX1_PAMT_ENTRY_SIZE 16
> +#define TDX1_EXTENDMR_CHUNKSIZE 256

I believe all of the defined variables above need to be enumerated
with TDH.SYS.INFO.

> +#define TDX_TDMR_ADDR_ALIGNMENT 512
Is TDX_TDMR_ADDR_ALIGNMENT used anywhere or is it just for completeness?

> +#define TDX_TDMR_INFO_ALIGNMENT 512
Why do we have alignment of 512, I am assuming to make it cache line
size aligned for efficiency?


> +#define TDX_TDSYSINFO_STRUCT_ALIGNEMNT 1024

typo: ALIGNEMNT -> ALIGNMENT

-Erdem

2021-08-02 07:36:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/69] KVM: X86: TDX support

On 28/07/21 18:51, Sean Christopherson wrote:
> I strongly object to merging these two until we see the new SEAMLDR code:
>
> [RFC PATCH v2 02/69] KVM: X86: move kvm_cpu_vmxon() from vmx.c to virtext.h
> [RFC PATCH v2 03/69] KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86
>
> If the SEAMLDR code ends up being fully contained in KVM, then this is unnecessary
> churn and exposes code outside of KVM that we may not want exposed (yet). E.g.
> setting and clearing CR4.VMXE (in the fault path) in cpu_vmxon() may not be
> necessary/desirable for SEAMLDR, we simply can't tell without seeing the code.

Fair enough (though, for patch 2, it's a bit weird to have vmxoff in
virtext.h and not vmxon).

Paolo


2021-08-02 13:29:12

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/69] KVM: TDX: Add architectural definitions for structures and values

On 7/31/2021 9:04 AM, Erdem Aktas wrote:
> On Fri, Jul 2, 2021 at 3:05 PM <[email protected]> wrote:
>> +/* Management class fields */
>> +enum tdx_guest_management {
>> + TD_VCPU_PEND_NMI = 11,
>> +};
>> +
>> +/* @field is any of enum tdx_guest_management */
>> +#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(32, (field))
>
> I am a little confused with this. According to the spec, PEND_NMI has
> a field code of 0x200000000000000B
> I can understand that 0x20 is the class code and the PEND_NMI field code is 0xB.
> On the other hand, for the LAST_EXIT_TSC the field code is 0xA00000000000000A.

> Based on your code and the table in the spec, I can see that there is
> an additional mask (1ULL<<63) for readonly fields

No. bit 63 is not for readonly fields, but for non_arch fields.

Please see 18.7.1 General definition

> Is this information correct and is this included in the spec? I tried
> to find it but somehow I do not see it clearly defined.
>
>> +#define TDX1_NR_TDCX_PAGES 4
>> +#define TDX1_NR_TDVPX_PAGES 5
>> +
>> +#define TDX1_MAX_NR_CPUID_CONFIGS 6
> Why is this just 6? I am looking at the CPUID table in the spec and
> there are already more than 6 CPUID leaves there.

This is the number of CPUID config reported by TDH.SYS.INFO. Current KVM
only reports 6 leaves.

>> +#define TDX1_MAX_NR_CMRS 32
>> +#define TDX1_MAX_NR_TDMRS 64
>> +#define TDX1_MAX_NR_RSVD_AREAS 16
>> +#define TDX1_PAMT_ENTRY_SIZE 16
>> +#define TDX1_EXTENDMR_CHUNKSIZE 256
>
> I believe all of the defined variables above need to be enumerated
> with TDH.SYS.INFO.

No. Only TDX1_MAX_NR_TDMRS, TDX1_MAX_NR_RSVD_AREAS and
TDX1_PAMT_ENTRY_SIZE can be enumerated from TDH.SYS.INFO.

- TDX1_MAX_NR_CMRS is described in 18.6.3 CMR_INFO, which tells

TDH.SYS.INFO leaf function returns a MAX_CMRS(32) entry array
of CMR_INFO entries.

- TDX1_EXTENDMR_CHUNKSIZE is describe in 20.2.23 TDH.MR.EXTEND

>> +#define TDX_TDMR_ADDR_ALIGNMENT 512
> Is TDX_TDMR_ADDR_ALIGNMENT used anywhere or is it just for completeness?

It's the leftover during rebase. We will clean it up.

>> +#define TDX_TDMR_INFO_ALIGNMENT 512
> Why do we have alignment of 512, I am assuming to make it cache line
> size aligned for efficiency?

It should be leftover too.

SEAMCALL TDH.SYS.INFO requires each cmr info in CMR_INFO_ARRAY to be
512B aligned

>
>> +#define TDX_TDSYSINFO_STRUCT_ALIGNEMNT 1024
>
> typo: ALIGNEMNT -> ALIGNMENT
>
> -Erdem
>


2021-08-02 15:13:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/69] KVM: X86: TDX support

On Mon, Aug 02, 2021, Paolo Bonzini wrote:
> On 28/07/21 18:51, Sean Christopherson wrote:
> > I strongly object to merging these two until we see the new SEAMLDR code:
> >
> > [RFC PATCH v2 02/69] KVM: X86: move kvm_cpu_vmxon() from vmx.c to virtext.h
> > [RFC PATCH v2 03/69] KVM: X86: move out the definition vmcs_hdr/vmcs from kvm to x86
> >
> > If the SEAMLDR code ends up being fully contained in KVM, then this is unnecessary
> > churn and exposes code outside of KVM that we may not want exposed (yet). E.g.
> > setting and clearing CR4.VMXE (in the fault path) in cpu_vmxon() may not be
> > necessary/desirable for SEAMLDR, we simply can't tell without seeing the code.
>
> Fair enough (though, for patch 2, it's a bit weird to have vmxoff in
> virtext.h and not vmxon).

I don't really disagree, but vmxoff is already in virtext.h for the emergency
reboot stuff. This series only touches the vmxon code.

2021-08-02 15:50:08

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/69] KVM: X86: TDX support

On 02/08/21 17:12, Sean Christopherson wrote:
>> Fair enough (though, for patch 2, it's a bit weird to have vmxoff in
>> virtext.h and not vmxon).
> I don't really disagree, but vmxoff is already in virtext.h for the emergency
> reboot stuff. This series only touches the vmxon code.
>

Yes, that's what I meant. But anyway I've dequeued them.

Paolo


2021-08-04 23:22:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/69] KVM: TDX: Add architectural definitions for structures and values

On Wed, Aug 04, 2021, Erdem Aktas wrote:
> On Mon, Aug 2, 2021 at 6:25 AM Xiaoyao Li <[email protected]> wrote:
> > > Is this information correct and is this included in the spec? I tried
> > > to find it but somehow I do not see it clearly defined.
> > >
> > >> +#define TDX1_NR_TDCX_PAGES 4
> > >> +#define TDX1_NR_TDVPX_PAGES 5
> > >> +
> > >> +#define TDX1_MAX_NR_CPUID_CONFIGS 6
> > > Why is this just 6? I am looking at the CPUID table in the spec and
> > > there are already more than 6 CPUID leaves there.
> >
> > This is the number of CPUID config reported by TDH.SYS.INFO. Current KVM
> > only reports 6 leaves.
>
> I, personally, still think that it should be enumerated, rather than
> hardcoded. It is not clear to me why it is 6 and nothing in the spec
> says it will not change.

It's both hardcoded and enumerated. KVM's hardcoded value is specifically the
maximum value expected for TDX-Module modules supporting the so called "1.0" spec.
It's certainly possible a spec change could bump the maximum, but KVM will refuse
to use a module with higher maximums until Linux/KVM is updated to play nice with
the new module.

Having hardcoded maximum allows for simpler and more efficient code, as loops and
arrays can be statically defined instead of having to pass around the enumerated
values.

And we'd want sanity checking anyways, e.g. if the TDX-module pulled a stupid and
reported that it needs 4000 TDCX pages. This approach gives KVM documented values
to sanity check, e.g. instead of arbitrary magic numbers.

The downside of this approach is that KVM will need to be updated to play nice
with a new module if any of these maximums are raised. But, IMO that's acceptable
because I can't imagine a scenario where anyone would want to load a TDX module
without first testing the daylights out of the specific kernel+TDX combination,
especially a TDX-module that by definition includes new features.

> > >> +#define TDX1_MAX_NR_CMRS 32
> > >> +#define TDX1_MAX_NR_TDMRS 64
> > >> +#define TDX1_MAX_NR_RSVD_AREAS 16
> > >> +#define TDX1_PAMT_ENTRY_SIZE 16
> > >> +#define TDX1_EXTENDMR_CHUNKSIZE 256
> > >
> > > I believe all of the defined variables above need to be enumerated
> > > with TDH.SYS.INFO.

And they are, though I believe the code for doing the actual SEAMCALL wasn't
posted in this series. The output is sanity checked by tdx_hardware_enable():

+ tdx_caps.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE;
+ if (tdx_caps.tdcs_nr_pages != TDX1_NR_TDCX_PAGES)
+ return -EIO;
+
+ tdx_caps.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1;
+ if (tdx_caps.tdvpx_nr_pages != TDX1_NR_TDVPX_PAGES)
+ return -EIO;
+
+ tdx_caps.attrs_fixed0 = tdsysinfo->attributes_fixed0;
+ tdx_caps.attrs_fixed1 = tdsysinfo->attributes_fixed1;
+ tdx_caps.xfam_fixed0 = tdsysinfo->xfam_fixed0;
+ tdx_caps.xfam_fixed1 = tdsysinfo->xfam_fixed1;
+
+ tdx_caps.nr_cpuid_configs = tdsysinfo->num_cpuid_config;
+ if (tdx_caps.nr_cpuid_configs > TDX1_MAX_NR_CPUID_CONFIGS)
+ return -EIO;
+

2021-08-05 01:43:26

by Erdem Aktas

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/69] KVM: TDX: Add architectural definitions for structures and values

On Mon, Aug 2, 2021 at 6:25 AM Xiaoyao Li <[email protected]> wrote:
>
> No. bit 63 is not for readonly fields, but for non_arch fields.
>
> Please see 18.7.1 General definition
Thank you so much! make sense.

> > Is this information correct and is this included in the spec? I tried
> > to find it but somehow I do not see it clearly defined.
> >
> >> +#define TDX1_NR_TDCX_PAGES 4
> >> +#define TDX1_NR_TDVPX_PAGES 5
> >> +
> >> +#define TDX1_MAX_NR_CPUID_CONFIGS 6
> > Why is this just 6? I am looking at the CPUID table in the spec and
> > there are already more than 6 CPUID leaves there.
>
> This is the number of CPUID config reported by TDH.SYS.INFO. Current KVM
> only reports 6 leaves.

I, personally, still think that it should be enumerated, rather than
hardcoded. It is not clear to me why it is 6 and nothing in the spec
says it will not change.

> >> +#define TDX1_MAX_NR_CMRS 32
> >> +#define TDX1_MAX_NR_TDMRS 64
> >> +#define TDX1_MAX_NR_RSVD_AREAS 16
> >> +#define TDX1_PAMT_ENTRY_SIZE 16
> >> +#define TDX1_EXTENDMR_CHUNKSIZE 256
> >
> > I believe all of the defined variables above need to be enumerated
> > with TDH.SYS.INFO.
>
> No. Only TDX1_MAX_NR_TDMRS, TDX1_MAX_NR_RSVD_AREAS and
> TDX1_PAMT_ENTRY_SIZE can be enumerated from TDH.SYS.INFO.
>
> - TDX1_MAX_NR_CMRS is described in 18.6.3 CMR_INFO, which tells
>
> TDH.SYS.INFO leaf function returns a MAX_CMRS(32) entry array
> of CMR_INFO entries.
>
> - TDX1_EXTENDMR_CHUNKSIZE is describe in 20.2.23 TDH.MR.EXTEND

Thanks for the pointers for MAX_CMRS and TDX1_EXTENDMR_CHUNKSIZE.
Will the rest of it be enumerated or hardcoded?

> >> +#define TDX_TDMR_ADDR_ALIGNMENT 512
> > Is TDX_TDMR_ADDR_ALIGNMENT used anywhere or is it just for completeness?
>
> It's the leftover during rebase. We will clean it up.
Thanks!

> SEAMCALL TDH.SYS.INFO requires each cmr info in CMR_INFO_ARRAY to be
> 512B aligned

Make sense, Thanks for the explanation.

2021-08-05 13:33:29

by Kai Huang

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Fri, 2 Jul 2021 15:04:47 -0700 [email protected] wrote:
> From: Rick Edgecombe <[email protected]>
>
> Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> perspective) to a single GPA (from a memslot perspective). GPA alising
> will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> execute-only permission bit to the guest. To keep the implementation
> simple (relatively speaking), GPA aliasing is only supported via TDP.
>
> Today KVM assumes two things that are broken by GPA aliasing.
> 1. GPAs coming from hardware can be simply shifted to get the GFNs.
> 2. GPA bits 51:MAXPHYADDR are reserved to zero.
>
> With GPA aliasing, translating a GPA to GFN requires masking off the
> repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
>
> To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> that is, bits stolen from the GPA to act as new virtualized attribute
> bits. A bit in the mask will cause the MMU code to create aliases of the
> GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> fault.
>
> To handle case (1) from above, retain any stolen bits when passing a GPA
> in KVM's MMU code, but strip them when converting to a GFN so that the
> GFN contains only the "real" GFN, i.e. never has repurposed bits set.
>
> GFNs (without stolen bits) continue to be used to:
> -Specify physical memory by userspace via memslots
> -Map GPAs to TDP PTEs via RMAP
> -Specify dirty tracking and write protection
> -Look up MTRR types
> -Inject async page faults
>
> Since there are now multiple aliases for the same aliased GPA, when
> userspace memory backing the memslots is paged out, both aliases need to be
> modified. Fortunately this happens automatically. Since rmap supports
> multiple mappings for the same GFN for PTE shadowing based paging, by
> adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> operations will be applied to both aliases.
>
> In the case of the rmap being removed in the future, the needed
> information could be recovered by iterating over the stolen bits and
> walking the TDP page tables.
>
> For TLB flushes that are address based, make sure to flush both aliases
> in the stolen bits case.
>
> Only support stolen bits in 64 bit guest paging modes (long, PAE).
> Features that use this infrastructure should restrict the stolen bits to
> exclude the other paging modes. Don't support stolen bits for shadow EPT.
>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu.h | 26 ++++++++++
> arch/x86/kvm/mmu/mmu.c | 86 ++++++++++++++++++++++-----------
> arch/x86/kvm/mmu/mmu_internal.h | 1 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 ++++++----
> 4 files changed, 101 insertions(+), 37 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 88d0ed5225a4..69b82857acdb 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -232,4 +232,30 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> int kvm_mmu_post_init_vm(struct kvm *kvm);
> void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
>
> +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> +{
> + /* Currently there are no stolen bits in KVM */
> + return 0;
> +}
> +
> +static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> + return kvm_gfn_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> +{
> + return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
> +}
> +
> +static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> + return kvm_gpa_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> + return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
> +}
> +
> #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0dc4bf34ce9c..990ee645b8a2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -188,27 +188,37 @@ static inline bool kvm_available_flush_tlb_with_range(void)
> return kvm_x86_ops.tlb_remote_flush_with_range;
> }
>
> -static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> - struct kvm_tlb_range *range)
> -{
> - int ret = -ENOTSUPP;
> -
> - if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> - ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> -
> - if (ret)
> - kvm_flush_remote_tlbs(kvm);
> -}
> -
> void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
> u64 start_gfn, u64 pages)
> {
> struct kvm_tlb_range range;
> + u64 gfn_stolen_mask;
> +
> + if (!kvm_available_flush_tlb_with_range())
> + goto generic_flush;
> +
> + /*
> + * Fall back to the big hammer flush if there is more than one
> + * GPA alias that needs to be flushed.
> + */
> + gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> + if (hweight64(gfn_stolen_mask) > 1)
> + goto generic_flush;
>
> range.start_gfn = start_gfn;
> range.pages = pages;
> + if (static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range))
> + goto generic_flush;
> +
> + if (!gfn_stolen_mask)
> + return;
>
> - kvm_flush_remote_tlbs_with_range(kvm, &range);
> + range.start_gfn |= gfn_stolen_mask;
> + static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range);
> + return;
> +
> +generic_flush:
> + kvm_flush_remote_tlbs(kvm);
> }
>
> bool is_nx_huge_page_enabled(void)
> @@ -1949,14 +1959,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
> __clear_sp_write_flooding_count(sptep_to_sp(spte));
> }
>
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> - gfn_t gfn,
> - gva_t gaddr,
> - unsigned level,
> - int direct,
> - unsigned int access)
> +static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> + gfn_t gfn,
> + gfn_t gfn_stolen_bits,
> + gva_t gaddr,
> + unsigned int level,
> + int direct,
> + unsigned int access)
> {
> bool direct_mmu = vcpu->arch.mmu->direct_map;
> + gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
> union kvm_mmu_page_role role;
> struct hlist_head *sp_list;
> unsigned quadrant;
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> role.quadrant = quadrant;
> }
>
> - sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> + sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> - if (sp->gfn != gfn) {
> + if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> collisions++;
> continue;
> }
> @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> sp = kvm_mmu_alloc_page(vcpu, direct);
>
> sp->gfn = gfn;
> + sp->gfn_stolen_bits = gfn_stolen_bits;
> sp->role = role;
> hlist_add_head(&sp->hash_link, sp_list);
> if (!direct) {
> @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> return sp;
> }


Sorry for replying old thread, but to me it looks weird to have gfn_stolen_bits
in 'struct kvm_mmu_page'. If I understand correctly, above code basically
means that GFN with different stolen bit will have different 'struct
kvm_mmu_page', but in the context of this patch, mappings with different
stolen bits still use the same root, which means gfn_stolen_bits doesn't make a
lot of sense at least for root page table.

Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
the context of TDX, since TDX requires two separate roots for private and
shared mappings.

So given we cannot tell whether the same root, or different roots should be
used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
'struct kvm_mmu_page' and use it to determine whether to allocate a new table
for the same GFN, but should use a new role (i.e role.private) to determine.

And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save some
memory.

>
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> + gva_t gaddr, unsigned int level,
> + int direct, unsigned int access)
> +{
> + return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
> +}
> +
> static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> struct kvm_vcpu *vcpu, hpa_t root,
> u64 addr)
> @@ -2637,7 +2657,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>
> gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
> slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
> - if (!slot)
> +
> + /* Don't map private memslots for stolen bits */
> + if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
> return -1;
>
> ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
> @@ -2827,7 +2849,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> struct kvm_shadow_walk_iterator it;
> struct kvm_mmu_page *sp;
> int level, req_level, ret;
> - gfn_t gfn = gpa >> PAGE_SHIFT;
> + gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
> + gfn_t gfn = (gpa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> + gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
> gfn_t base_gfn = gfn;
>
> if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> @@ -2852,8 +2876,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>
> drop_large_spte(vcpu, it.sptep);
> if (!is_shadow_present_pte(*it.sptep)) {
> - sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
> - it.level - 1, true, ACC_ALL);
> + sp = __kvm_mmu_get_page(vcpu, base_gfn,
> + gfn_stolen_bits, it.addr,
> + it.level - 1, true, ACC_ALL);
>
> link_shadow_page(vcpu, it.sptep, sp);
> if (is_tdp && huge_page_disallowed &&
> @@ -3689,6 +3714,13 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> return true;
>
> + /* Don't expose aliases for no slot GFNs or private memslots */
> + if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
> + !kvm_is_visible_memslot(slot)) {
> + *pfn = KVM_PFN_NOSLOT;
> + return false;
> + }
> +
> /* Don't expose private memslots to L2. */
> if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
> *pfn = KVM_PFN_NOSLOT;
> @@ -3723,7 +3755,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> bool write = error_code & PFERR_WRITE_MASK;
> bool map_writable;
>
> - gfn_t gfn = gpa >> PAGE_SHIFT;
> + gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
> unsigned long mmu_seq;
> kvm_pfn_t pfn;
> hva_t hva;
> @@ -3833,7 +3865,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> max_level > PG_LEVEL_4K;
> max_level--) {
> int page_num = KVM_PAGES_PER_HPAGE(max_level);
> - gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
> + gfn_t base = vcpu_gpa_to_gfn_unalias(vcpu, gpa) & ~(page_num - 1);
>
> if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> break;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index d64ccb417c60..c896ec9f3159 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -46,6 +46,7 @@ struct kvm_mmu_page {
> */
> union kvm_mmu_page_role role;
> gfn_t gfn;
> + gfn_t gfn_stolen_bits;
>
> u64 *spt;
> /* hold the gfn of each spte inside spt */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 823a5919f9fa..439dc141391b 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -25,7 +25,8 @@
> #define guest_walker guest_walker64
> #define FNAME(name) paging##64_##name
> #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> - #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
> + PT64_LVL_ADDR_MASK(lvl))
> #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> #define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -44,7 +45,7 @@
> #define guest_walker guest_walker32
> #define FNAME(name) paging##32_##name
> #define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
> - #define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
> #define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
> #define PT_INDEX(addr, level) PT32_INDEX(addr, level)
> #define PT_LEVEL_BITS PT32_LEVEL_BITS
> @@ -58,7 +59,7 @@
> #define guest_walker guest_walkerEPT
> #define FNAME(name) ept_##name
> #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> - #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
> #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> #define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -75,7 +76,7 @@
> #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
>
> #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
> -#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
> +#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
>
> /*
> * The guest_walker structure emulates the behavior of the hardware page
> @@ -96,9 +97,9 @@ struct guest_walker {
> struct x86_exception fault;
> };
>
> -static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> +static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
> {
> - return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> + return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
> }
>
> static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
> @@ -366,7 +367,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> --walker->level;
>
> index = PT_INDEX(addr, walker->level);
> - table_gfn = gpte_to_gfn(pte);
> + table_gfn = gpte_to_gfn(vcpu, pte);
> offset = index * sizeof(pt_element_t);
> pte_gpa = gfn_to_gpa(table_gfn) + offset;
>
> @@ -432,7 +433,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> if (unlikely(errcode))
> goto error;
>
> - gfn = gpte_to_gfn_lvl(pte, walker->level);
> + gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
> gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
>
> if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
> @@ -537,12 +538,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> gfn_t gfn;
> kvm_pfn_t pfn;
>
> + WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
> +
> if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
> return false;
>
> pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
>
> - gfn = gpte_to_gfn(gpte);
> + gfn = gpte_to_gfn(vcpu, gpte);
> pte_access = sp->role.access & FNAME(gpte_access)(gpte);
> FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
> @@ -652,6 +655,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
>
> direct_access = gw->pte_access;
>
> + WARN_ON(addr & vcpu_gpa_stolen_mask(vcpu));
> +
> top_level = vcpu->arch.mmu->root_level;
> if (top_level == PT32E_ROOT_LEVEL)
> top_level = PT32_ROOT_LEVEL;
> @@ -1067,7 +1072,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> continue;
> }
>
> - gfn = gpte_to_gfn(gpte);
> + gfn = gpte_to_gfn(vcpu, gpte);
> pte_access = sp->role.access;
> pte_access &= FNAME(gpte_access)(gpte);
> FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> --
> 2.25.1
>

2021-08-05 20:53:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Thu, Aug 05, 2021, Edgecombe, Rick P wrote:
> On Thu, 2021-08-05 at 17:39 +0000, Sean Christopherson wrote:
> > If we really want to reduce the memory footprint for the common case (TDP
> > MMU), the crud that's used only by indirect shadow pages could be shoved
> > into a different struct by abusing the struct layout and and wrapping
> > accesses to the indirect-only fields with casts/container_of and helpers,
> > e.g.
> >
> Wow, didn't realize classic MMU was that relegated already. Mostly an
> onlooker here, but does TDX actually need classic MMU support? Nice to
> have?

Gah, bad verbiage on my part. I didn't me _the_ TDP MMU, I meant "MMU that uses
TDP". The "TDP MMU" is being enabled by default in upstream, whether or not TDX
needs to support the classic/legacy MMU is an unanswered question. From a
maintenance perspective, I'd love to say no, but from a "what does Google actually
want to use for TDX" perspective, I don't have a defininitive answer yet :-/

> > The role is already factored into the collision logic.
>
> I mean how aliases of the same gfn don't necessarily collide and the
> collisions counter is only incremented if the gfn/stolen matches, but
> not if the role is different.

Ah. I still think it would Just Work. The collisions tracking is purely for
stats, presumably used in the past to see if the hash size was effective. If
a hash lookup collides then it should be accounted, _why_ it collided doesn't
factor in to the stats.

Your comments about the hash behavior being different still stand, though for
TDX they wouldn't matter in practice since KVM is hosed if it has both shared and
private versions of a shadow page.

2021-08-05 22:03:54

by Kai Huang

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> On Thu, Aug 05, 2021, Kai Huang wrote:
> > On Fri, 2 Jul 2021 15:04:47 -0700 [email protected] wrote:
> > > From: Rick Edgecombe <[email protected]>
> > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > sp = kvm_mmu_alloc_page(vcpu, direct);
> > >
> > > sp->gfn = gfn;
> > > + sp->gfn_stolen_bits = gfn_stolen_bits;
> > > sp->role = role;
> > > hlist_add_head(&sp->hash_link, sp_list);
> > > if (!direct) {
> > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > return sp;
> > > }
> >
> >
> > Sorry for replying old thread,
>
> Ha, one month isn't old, it's barely even mature.
>
> > but to me it looks weird to have gfn_stolen_bits
> > in 'struct kvm_mmu_page'. If I understand correctly, above code basically
> > means that GFN with different stolen bit will have different 'struct
> > kvm_mmu_page', but in the context of this patch, mappings with different
> > stolen bits still use the same root,
>
> You're conflating "mapping" with "PTE". The GFN is a per-PTE value. Yes, there
> is a final GFN that is representative of the mapping, but more directly the final
> GFN is associated with the leaf PTE.
>
> TDX effectively adds the restriction that all PTEs used for a mapping must have
> the same shared/private status, so mapping and PTE are somewhat interchangeable
> when talking about stolen bits (the shared bit), but in the context of this patch,
> the stolen bits are a property of the PTE.

Yes it is a property of PTE, this is the reason that I think it's weird to have
stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
stolen bit?

>
> Back to your statement, it's incorrect. PTEs (effectively mappings in TDX) with
> different stolen bits will _not_ use the same root. kvm_mmu_get_page() includes
> the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> stolen bits when looking for an existing shadow page at the target GFN.
>
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> role.quadrant = quadrant;
> }
>
> - sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> + sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> - if (sp->gfn != gfn) {
> + if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> collisions++;
> continue;
> }
>

This only works for non-root table, but there's only one single
vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
stolen bits share one single root.

> > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > page table.
>
> It does make sense, even without a follow-up patch. In Rick's original series,
> stealing a bit for execute-only guest memory, there was only a single root. And
> except for TDX, there can only ever be a single root because the shared EPTP isn't
> usable, i.e. there's only the regular/private EPTP.
>
> > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > the context of TDX, since TDX requires two separate roots for private and
> > shared mappings.
>
> > So given we cannot tell whether the same root, or different roots should be
> > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > for the same GFN, but should use a new role (i.e role.private) to determine.
>
> A new role would work, too, but it has the disadvantage of not automagically
> working for all uses of stolen bits, e.g. XO support would have to add another
> role bit.

For each purpose of particular stolen bit, a new role can be defined. For
instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
set role.private (otherwise set role.private). For XO, if you see the stolen
bit is XO, you set role.xo.

We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
all code paths can find the actual stolen bit based on sp->role and vcpu (I
haven't gone through all though, assuming the annoying part is rmap).

>
> > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save some
> > memory.
>
> But I do like saving memory... One potentially bad idea would be to unionize
> gfn and stolen bits by shifting the stolen bits after they're extracted from the
> gpa, e.g.
>
> union {
> gfn_t gfn_and_stolen;
> struct {
> gfn_t gfn:52;
> gfn_t stolen:12;
> }
> };
>
> the downsides being that accessing just the gfn would require an additional masking
> operation, and the stolen bits wouldn't align with reality.

2021-08-05 22:44:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Thu, Aug 05, 2021, Kai Huang wrote:
> On Fri, 2 Jul 2021 15:04:47 -0700 [email protected] wrote:
> > From: Rick Edgecombe <[email protected]>
> > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > sp = kvm_mmu_alloc_page(vcpu, direct);
> >
> > sp->gfn = gfn;
> > + sp->gfn_stolen_bits = gfn_stolen_bits;
> > sp->role = role;
> > hlist_add_head(&sp->hash_link, sp_list);
> > if (!direct) {
> > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > return sp;
> > }
>
>
> Sorry for replying old thread,

Ha, one month isn't old, it's barely even mature.

> but to me it looks weird to have gfn_stolen_bits
> in 'struct kvm_mmu_page'. If I understand correctly, above code basically
> means that GFN with different stolen bit will have different 'struct
> kvm_mmu_page', but in the context of this patch, mappings with different
> stolen bits still use the same root,

You're conflating "mapping" with "PTE". The GFN is a per-PTE value. Yes, there
is a final GFN that is representative of the mapping, but more directly the final
GFN is associated with the leaf PTE.

TDX effectively adds the restriction that all PTEs used for a mapping must have
the same shared/private status, so mapping and PTE are somewhat interchangeable
when talking about stolen bits (the shared bit), but in the context of this patch,
the stolen bits are a property of the PTE.

Back to your statement, it's incorrect. PTEs (effectively mappings in TDX) with
different stolen bits will _not_ use the same root. kvm_mmu_get_page() includes
the stolen bits in both the hash lookup and in the comparison, i.e. restores the
stolen bits when looking for an existing shadow page at the target GFN.

@@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.quadrant = quadrant;
}

- sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+ sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
for_each_valid_sp(vcpu->kvm, sp, sp_list) {
- if (sp->gfn != gfn) {
+ if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
collisions++;
continue;
}

> which means gfn_stolen_bits doesn't make a lot of sense at least for root
> page table.

It does make sense, even without a follow-up patch. In Rick's original series,
stealing a bit for execute-only guest memory, there was only a single root. And
except for TDX, there can only ever be a single root because the shared EPTP isn't
usable, i.e. there's only the regular/private EPTP.

> Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> the context of TDX, since TDX requires two separate roots for private and
> shared mappings.

> So given we cannot tell whether the same root, or different roots should be
> used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> for the same GFN, but should use a new role (i.e role.private) to determine.

A new role would work, too, but it has the disadvantage of not automagically
working for all uses of stolen bits, e.g. XO support would have to add another
role bit.

> And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save some
> memory.

But I do like saving memory... One potentially bad idea would be to unionize
gfn and stolen bits by shifting the stolen bits after they're extracted from the
gpa, e.g.

union {
gfn_t gfn_and_stolen;
struct {
gfn_t gfn:52;
gfn_t stolen:12;
}
};

the downsides being that accessing just the gfn would require an additional masking
operation, and the stolen bits wouldn't align with reality.

2021-08-05 23:18:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Thu, 2021-08-05 at 16:06 +0000, Sean Christopherson wrote:
> On Thu, Aug 05, 2021, Kai Huang wrote:
> > On Fri, 2 Jul 2021 15:04:47 -0700 [email protected] wrote:
> > > From: Rick Edgecombe <[email protected]>
> > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page
> > > *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > sp = kvm_mmu_alloc_page(vcpu, direct);
> > >
> > > sp->gfn = gfn;
> > > + sp->gfn_stolen_bits = gfn_stolen_bits;
> > > sp->role = role;
> > > hlist_add_head(&sp->hash_link, sp_list);
> > > if (!direct) {
> > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page
> > > *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > return sp;
> > > }
> >
> >
> > Sorry for replying old thread,
>
> Ha, one month isn't old, it's barely even mature.
>
> > but to me it looks weird to have gfn_stolen_bits
> > in 'struct kvm_mmu_page'. If I understand correctly, above code
> > basically
> > means that GFN with different stolen bit will have different
> > 'struct
> > kvm_mmu_page', but in the context of this patch, mappings with
> > different
> > stolen bits still use the same root,
>
> You're conflating "mapping" with "PTE". The GFN is a per-PTE
> value. Yes, there
> is a final GFN that is representative of the mapping, but more
> directly the final
> GFN is associated with the leaf PTE.
>
> TDX effectively adds the restriction that all PTEs used for a mapping
> must have
> the same shared/private status, so mapping and PTE are somewhat
> interchangeable
> when talking about stolen bits (the shared bit), but in the context
> of this patch,
> the stolen bits are a property of the PTE.
>
> Back to your statement, it's incorrect. PTEs (effectively mappings
> in TDX) with
> different stolen bits will _not_ use the same
> root. kvm_mmu_get_page() includes
> the stolen bits in both the hash lookup and in the comparison, i.e.
> restores the
> stolen bits when looking for an existing shadow page at the target
> GFN.
>
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page
> *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> role.quadrant = quadrant;
> }
>
> - sp_list = &vcpu->kvm-
> >arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> + sp_list = &vcpu->kvm-
> >arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> - if (sp->gfn != gfn) {
> + if ((sp->gfn | sp->gfn_stolen_bits) !=
> gfn_and_stolen) {
> collisions++;
> continue;
> }
>
> > which means gfn_stolen_bits doesn't make a lot of sense at least
> > for root
> > page table.
>
> It does make sense, even without a follow-up patch. In Rick's
> original series,
> stealing a bit for execute-only guest memory, there was only a single
> root. And
> except for TDX, there can only ever be a single root because the
> shared EPTP isn't
> usable, i.e. there's only the regular/private EPTP.
>
> > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes
> > sense in
> > the context of TDX, since TDX requires two separate roots for
> > private and
> > shared mappings.
> > So given we cannot tell whether the same root, or different roots
> > should be
> > used for different stolen bits, I think we should not add
> > 'gfn_stolen_bits' to
> > 'struct kvm_mmu_page' and use it to determine whether to allocate a
> > new table
> > for the same GFN, but should use a new role (i.e role.private) to
> > determine.
>
> A new role would work, too, but it has the disadvantage of not
> automagically
> working for all uses of stolen bits, e.g. XO support would have to
> add another
> role bit.
>
> > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also
> > save some
> > memory.
>
> But I do like saving memory... One potentially bad idea would be to
> unionize
> gfn and stolen bits by shifting the stolen bits after they're
> extracted from the
> gpa, e.g.
>
> union {
> gfn_t gfn_and_stolen;
> struct {
> gfn_t gfn:52;
> gfn_t stolen:12;
> }
> };
>
> the downsides being that accessing just the gfn would require an
> additional masking
> operation, and the stolen bits wouldn't align with reality.

It definitely seems like the sp could be packed more efficiently.
One other idea is the stolen bits could just be recovered from the role
bits with a helper, like how the page fault error code stolen bits
encoding version of this works.

If the stolen bits are not fed into the hash calculation though it
would change the behavior a bit. Not sure if for better or worse. Also
the calculation of hash collisions would need to be aware.

FWIW, I kind of like something like Sean's proposal. It's a bit
convoluted, but there are more unused bits in the gfn than the role.
Also they are a little more related.

2021-08-06 00:11:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Thu, Aug 05, 2021, Edgecombe, Rick P wrote:
> On Thu, 2021-08-05 at 16:06 +0000, Sean Christopherson wrote:
> > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save
> > > some memory.
> >
> > But I do like saving memory... One potentially bad idea would be to
> > unionize gfn and stolen bits by shifting the stolen bits after they're
> > extracted from the gpa, e.g.
> >
> > union {
> > gfn_t gfn_and_stolen;
> > struct {
> > gfn_t gfn:52;
> > gfn_t stolen:12;
> > }
> > };
> >
> > the downsides being that accessing just the gfn would require an additional
> > masking operation, and the stolen bits wouldn't align with reality.
>
> It definitely seems like the sp could be packed more efficiently.

Yeah, in general it could be optimized. But for TDP/direct MMUs, we don't care
thaaat much because there are relatively few shadow pages, versus indirect MMUs
with thousands or tens of thousands of shadow pages. Of course, indirect MMUs
are also the most gluttonous due to the unsync_child_bitmap, gfns, write flooding
count, etc...

If we really want to reduce the memory footprint for the common case (TDP MMU),
the crud that's used only by indirect shadow pages could be shoved into a
different struct by abusing the struct layout and and wrapping accesses to the
indirect-only fields with casts/container_of and helpers, e.g.

struct kvm_mmu_indirect_page {
struct kvm_mmu_page this;

gfn_t *gfns;
unsigned int unsync_children;
DECLARE_BITMAP(unsync_child_bitmap, 512);

#ifdef CONFIG_X86_32
/*
* Used out of the mmu-lock to avoid reading spte values while an
* update is in progress; see the comments in __get_spte_lockless().
*/
int clear_spte_count;
#endif

/* Number of writes since the last time traversal visited this page. */
atomic_t write_flooding_count;
}


> One other idea is the stolen bits could just be recovered from the role
> bits with a helper, like how the page fault error code stolen bits
> encoding version of this works.

As in, a generic "stolen_gfn_bits" in the role instead of a per-feature role bit?
That would avoid the problem of per-feature role bits leading to a pile of
marshalling code, and wouldn't suffer the masking cost when accessing ->gfn,
though I'm not sure that matters much.

> If the stolen bits are not fed into the hash calculation though it
> would change the behavior a bit. Not sure if for better or worse. Also
> the calculation of hash collisions would need to be aware.

The role is already factored into the collision logic.

> FWIW, I kind of like something like Sean's proposal. It's a bit
> convoluted, but there are more unused bits in the gfn than the role.

And tightly bound, i.e. there can't be more than gfn_t gfn+gfn_stolen bits.

> Also they are a little more related.


2021-08-06 00:42:58

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Thu, 2021-08-05 at 17:39 +0000, Sean Christopherson wrote:
> On Thu, Aug 05, 2021, Edgecombe, Rick P wrote:
> > On Thu, 2021-08-05 at 16:06 +0000, Sean Christopherson wrote:
> > > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could
> > > > also save
> > > > some memory.
> > >
> > > But I do like saving memory... One potentially bad idea would be
> > > to
> > > unionize gfn and stolen bits by shifting the stolen bits after
> > > they're
> > > extracted from the gpa, e.g.
> > >
> > > union {
> > > gfn_t gfn_and_stolen;
> > > struct {
> > > gfn_t gfn:52;
> > > gfn_t stolen:12;
> > > }
> > > };
> > >
> > > the downsides being that accessing just the gfn would require an
> > > additional
> > > masking operation, and the stolen bits wouldn't align with
> > > reality.
> >
> > It definitely seems like the sp could be packed more efficiently.
>
> Yeah, in general it could be optimized. But for TDP/direct MMUs, we
> don't care
> thaaat much because there are relatively few shadow pages, versus
> indirect MMUs
> with thousands or tens of thousands of shadow pages. Of course,
> indirect MMUs
> are also the most gluttonous due to the unsync_child_bitmap, gfns,
> write flooding
> count, etc...
>
> If we really want to reduce the memory footprint for the common case
> (TDP MMU),
> the crud that's used only by indirect shadow pages could be shoved
> into a
> different struct by abusing the struct layout and and wrapping
> accesses to the
> indirect-only fields with casts/container_of and helpers, e.g.
>
Wow, didn't realize classic MMU was that relegated already. Mostly an
onlooker here, but does TDX actually need classic MMU support? Nice to
have?

> struct kvm_mmu_indirect_page {
> struct kvm_mmu_page this;
>
> gfn_t *gfns;
> unsigned int unsync_children;
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
> #ifdef CONFIG_X86_32
> /*
> * Used out of the mmu-lock to avoid reading spte values while
> an
> * update is in progress; see the comments in
> __get_spte_lockless().
> */
> int clear_spte_count;
> #endif
>
> /* Number of writes since the last time traversal visited this
> page. */
> atomic_t write_flooding_count;
> }
>
>
> > One other idea is the stolen bits could just be recovered from the
> > role
> > bits with a helper, like how the page fault error code stolen bits
> > encoding version of this works.
>
> As in, a generic "stolen_gfn_bits" in the role instead of a per-
> feature role bit?
> That would avoid the problem of per-feature role bits leading to a
> pile of
> marshalling code, and wouldn't suffer the masking cost when accessing
> ->gfn,
> though I'm not sure that matters much.
Well I was thinking multiple types of aliases, like the pf err code
stuff works, like this:

gfn_t stolen_bits(struct kvm *kvm, struct kvm_mmu_page *sp)
{
gfn_t stolen = 0;

if (sp->role.shared)
stolen |= kvm->arch.gfn_shared_mask;
if (sp->role.other_alias)
stolen |= kvm->arch.gfn_other_mask;

return stolen;
}

But yea, there really only needs to be one. Still bit shifting seems
better.

>
> > If the stolen bits are not fed into the hash calculation though it
> > would change the behavior a bit. Not sure if for better or worse.
> > Also
> > the calculation of hash collisions would need to be aware.
>
> The role is already factored into the collision logic.
I mean how aliases of the same gfn don't necessarily collide and the
collisions counter is only incremented if the gfn/stolen matches, but
not if the role is different.

>
> > FWIW, I kind of like something like Sean's proposal. It's a bit
> > convoluted, but there are more unused bits in the gfn than the
> > role.
>
> And tightly bound, i.e. there can't be more than gfn_t gfn+gfn_stolen
> bits.
>
> > Also they are a little more related.
>
>

2021-08-07 00:06:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Fri, Aug 06, 2021, Kai Huang wrote:
> On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > On Fri, 2 Jul 2021 15:04:47 -0700 [email protected] wrote:
> > > > From: Rick Edgecombe <[email protected]>
> > > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > sp = kvm_mmu_alloc_page(vcpu, direct);
> > > >
> > > > sp->gfn = gfn;
> > > > + sp->gfn_stolen_bits = gfn_stolen_bits;
> > > > sp->role = role;
> > > > hlist_add_head(&sp->hash_link, sp_list);
> > > > if (!direct) {
> > > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > return sp;
> > > > }
> > >
> > >
> > > Sorry for replying old thread,
> >
> > Ha, one month isn't old, it's barely even mature.
> >
> > > but to me it looks weird to have gfn_stolen_bits
> > > in 'struct kvm_mmu_page'. If I understand correctly, above code basically
> > > means that GFN with different stolen bit will have different 'struct
> > > kvm_mmu_page', but in the context of this patch, mappings with different
> > > stolen bits still use the same root,
> >
> > You're conflating "mapping" with "PTE". The GFN is a per-PTE value. Yes, there
> > is a final GFN that is representative of the mapping, but more directly the final
> > GFN is associated with the leaf PTE.
> >
> > TDX effectively adds the restriction that all PTEs used for a mapping must have
> > the same shared/private status, so mapping and PTE are somewhat interchangeable
> > when talking about stolen bits (the shared bit), but in the context of this patch,
> > the stolen bits are a property of the PTE.
>
> Yes it is a property of PTE, this is the reason that I think it's weird to have
> stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
> kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
> stolen bit?

No, the stolen bits are the property of the shadow page. I'm using "PTE" above
to mean "PTE for this shadow page", not PTEs within the shadow page, if that makes
sense.

> > Back to your statement, it's incorrect. PTEs (effectively mappings in TDX) with
> > different stolen bits will _not_ use the same root. kvm_mmu_get_page() includes
> > the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> > stolen bits when looking for an existing shadow page at the target GFN.
> >
> > @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > role.quadrant = quadrant;
> > }
> >
> > - sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > + sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> > for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > - if (sp->gfn != gfn) {
> > + if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> > collisions++;
> > continue;
> > }
> >
>
> This only works for non-root table, but there's only one single
> vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
> stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
> stolen bits share one single root.

Yes, and that's absolutely the required behavior for everything except for TDX
with its two EPTPs. E.g. any other implement _must_ reject CR3s that set stolen
gfn bits.

> > > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > > page table.
> >
> > It does make sense, even without a follow-up patch. In Rick's original series,
> > stealing a bit for execute-only guest memory, there was only a single root. And
> > except for TDX, there can only ever be a single root because the shared EPTP isn't
> > usable, i.e. there's only the regular/private EPTP.
> >
> > > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > > the context of TDX, since TDX requires two separate roots for private and
> > > shared mappings.
> >
> > > So given we cannot tell whether the same root, or different roots should be
> > > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > > for the same GFN, but should use a new role (i.e role.private) to determine.
> >
> > A new role would work, too, but it has the disadvantage of not automagically
> > working for all uses of stolen bits, e.g. XO support would have to add another
> > role bit.
>
> For each purpose of particular stolen bit, a new role can be defined. For
> instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
> set role.private (otherwise set role.private). For XO, if you see the stolen
> bit is XO, you set role.xo.
>
> We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
> all code paths can find the actual stolen bit based on sp->role and vcpu (I
> haven't gone through all though, assuming the annoying part is rmap).

Yes, and I'm not totally against the idea, but I'm also not 100% sold on it either,
yet... The idea of a 'private' flag is growing on me.

If we're treating the shared bit as an attribute bit, which we are, then it's
effectively an extension of role.access. Ditto for XO.

And looking at the code, I think it would be an improvement for TDX, as all of
the is_private_gfn() calls that operate on a shadow page would be simplified and
optimized as they wouldn't have to lookup both gfn_stolen_bits and the vcpu->kvm
mask of the shared bit.

Actually, the more I think about it, the more I like it. For TDX, there's no
risk of increased hash collisions, as we've already done messed up if there's a
shared vs. private collision.

And for XO, if it ever makes it way upstream, I think we should flat out disallow
referencing XO addresses in non-leaf PTEs, i.e. make the XO permission bit reserved
in non-leaf PTEs. That would avoid any theoretical problems with the guest doing
something stupid by polluting all its upper level PxEs with XO. Collisions would
be purely limited to the case where the guest is intentionally creating an alternate
mapping, which should be a rare event (or the guest is comprosied, which is also
hopefully a rare event).


2021-08-07 00:20:45

by Kai Huang

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Fri, 6 Aug 2021 19:02:39 +0000 Sean Christopherson wrote:
> On Fri, Aug 06, 2021, Kai Huang wrote:
> > On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> > > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > > On Fri, 2 Jul 2021 15:04:47 -0700 [email protected] wrote:
> > > > > From: Rick Edgecombe <[email protected]>
> > > > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > > sp = kvm_mmu_alloc_page(vcpu, direct);
> > > > >
> > > > > sp->gfn = gfn;
> > > > > + sp->gfn_stolen_bits = gfn_stolen_bits;
> > > > > sp->role = role;
> > > > > hlist_add_head(&sp->hash_link, sp_list);
> > > > > if (!direct) {
> > > > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > > return sp;
> > > > > }
> > > >
> > > >
> > > > Sorry for replying old thread,
> > >
> > > Ha, one month isn't old, it's barely even mature.
> > >
> > > > but to me it looks weird to have gfn_stolen_bits
> > > > in 'struct kvm_mmu_page'. If I understand correctly, above code basically
> > > > means that GFN with different stolen bit will have different 'struct
> > > > kvm_mmu_page', but in the context of this patch, mappings with different
> > > > stolen bits still use the same root,
> > >
> > > You're conflating "mapping" with "PTE". The GFN is a per-PTE value. Yes, there
> > > is a final GFN that is representative of the mapping, but more directly the final
> > > GFN is associated with the leaf PTE.
> > >
> > > TDX effectively adds the restriction that all PTEs used for a mapping must have
> > > the same shared/private status, so mapping and PTE are somewhat interchangeable
> > > when talking about stolen bits (the shared bit), but in the context of this patch,
> > > the stolen bits are a property of the PTE.
> >
> > Yes it is a property of PTE, this is the reason that I think it's weird to have
> > stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
> > kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
> > stolen bit?
>
> No, the stolen bits are the property of the shadow page. I'm using "PTE" above
> to mean "PTE for this shadow page", not PTEs within the shadow page, if that makes
> sense.

I see.

>
> > > Back to your statement, it's incorrect. PTEs (effectively mappings in TDX) with
> > > different stolen bits will _not_ use the same root. kvm_mmu_get_page() includes
> > > the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> > > stolen bits when looking for an existing shadow page at the target GFN.
> > >
> > > @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > role.quadrant = quadrant;
> > > }
> > >
> > > - sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > > + sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> > > for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > > - if (sp->gfn != gfn) {
> > > + if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> > > collisions++;
> > > continue;
> > > }
> > >
> >
> > This only works for non-root table, but there's only one single
> > vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
> > stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
> > stolen bits share one single root.
>
> Yes, and that's absolutely the required behavior for everything except for TDX
> with its two EPTPs. E.g. any other implement _must_ reject CR3s that set stolen
> gfn bits.

OK. I was thinking gfn_stolen_bits for 'struct kvm_mmu_page' for the table
pointed by CR3 should still make sense.

>
> > > > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > > > page table.
> > >
> > > It does make sense, even without a follow-up patch. In Rick's original series,
> > > stealing a bit for execute-only guest memory, there was only a single root. And
> > > except for TDX, there can only ever be a single root because the shared EPTP isn't
> > > usable, i.e. there's only the regular/private EPTP.
> > >
> > > > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > > > the context of TDX, since TDX requires two separate roots for private and
> > > > shared mappings.
> > >
> > > > So given we cannot tell whether the same root, or different roots should be
> > > > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > > > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > > > for the same GFN, but should use a new role (i.e role.private) to determine.
> > >
> > > A new role would work, too, but it has the disadvantage of not automagically
> > > working for all uses of stolen bits, e.g. XO support would have to add another
> > > role bit.
> >
> > For each purpose of particular stolen bit, a new role can be defined. For
> > instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
> > set role.private (otherwise set role.private). For XO, if you see the stolen
> > bit is XO, you set role.xo.
> >
> > We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
> > all code paths can find the actual stolen bit based on sp->role and vcpu (I
> > haven't gone through all though, assuming the annoying part is rmap).
>
> Yes, and I'm not totally against the idea, but I'm also not 100% sold on it either,
> yet... The idea of a 'private' flag is growing on me.
>
> If we're treating the shared bit as an attribute bit, which we are, then it's
> effectively an extension of role.access. Ditto for XO.
>
> And looking at the code, I think it would be an improvement for TDX, as all of
> the is_private_gfn() calls that operate on a shadow page would be simplified and
> optimized as they wouldn't have to lookup both gfn_stolen_bits and the vcpu->kvm
> mask of the shared bit.
>
> Actually, the more I think about it, the more I like it. For TDX, there's no
> risk of increased hash collisions, as we've already done messed up if there's a
> shared vs. private collision.
>
> And for XO, if it ever makes it way upstream, I think we should flat out disallow
> referencing XO addresses in non-leaf PTEs, i.e. make the XO permission bit reserved
> in non-leaf PTEs. That would avoid any theoretical problems with the guest doing
> something stupid by polluting all its upper level PxEs with XO. Collisions would
> be purely limited to the case where the guest is intentionally creating an alternate
> mapping, which should be a rare event (or the guest is comprosied, which is also
> hopefully a rare event).
>
>

My main motivation is 'gfn_stolen_bits' doesn't quite make sense for 'struct
kvm_mmu_page' for root, plus it seems it's a little bit redundant at first
glance.

So could we have your final suggestion? :)

Thanks,
-Kai

2021-08-07 00:21:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Sat, Aug 07, 2021, Kai Huang wrote:
> So could we have your final suggestion? :)

Try the kvm_mmu_page_role.private approach. So long as it doesn't end up splattering
code everywhere, I think that's more aligned with how KVM generally wants to treat
the shared bit.

In the changelog for this patch, make it very clear that ensuring different aliases
get different shadow page (if necessary) is the responsiblity of each individual
feature that leverages stolen gfn bits.

2021-08-07 00:22:11

by Kai Huang

[permalink] [raw]
Subject: Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits

On Fri, 6 Aug 2021 22:09:35 +0000 Sean Christopherson wrote:
> On Sat, Aug 07, 2021, Kai Huang wrote:
> > So could we have your final suggestion? :)
>
> Try the kvm_mmu_page_role.private approach. So long as it doesn't end up splattering
> code everywhere, I think that's more aligned with how KVM generally wants to treat
> the shared bit.

OK.

>
> In the changelog for this patch, make it very clear that ensuring different aliases
> get different shadow page (if necessary) is the responsiblity of each individual
> feature that leverages stolen gfn bits.

Make sense. Thanks.

2021-10-09 07:52:57

by Wang, Wei W

[permalink] [raw]
Subject: RE: [RFC PATCH v2 07/69] KVM: TDX: define and export helper functions for KVM TDX support

On Saturday, July 3, 2021 6:04 AM, Isaku Yamahata wrote:
> Subject: [RFC PATCH v2 07/69] KVM: TDX: define and export helper functions
> for KVM TDX support
> +/*
> + * Setup one-cpu-per-pkg array to do package-scoped SEAMCALLs. The
> +array is
> + * only necessary if there are multiple packages.
> + */
> +int __init init_package_masters(void)
> +{
> + int cpu, pkg, nr_filled, nr_pkgs;
> +
> + nr_pkgs = topology_max_packages();
> + if (nr_pkgs == 1)
> + return 0;
> +
> + tdx_package_masters = kcalloc(nr_pkgs, sizeof(int), GFP_KERNEL);


Where is the corresponding kfree() invoked? (except the one invoked on error conditions below)


> + if (!tdx_package_masters)
> + return -ENOMEM;
> +
> + memset(tdx_package_masters, -1, nr_pkgs * sizeof(int));
> +
> + nr_filled = 0;
> + for_each_online_cpu(cpu) {
> + pkg = topology_physical_package_id(cpu);
> + if (tdx_package_masters[pkg] >= 0)
> + continue;
> +
> + tdx_package_masters[pkg] = cpu;
> + if (++nr_filled == topology_max_packages())
> + break;
> + }
> +
> + if (WARN_ON(nr_filled != topology_max_packages())) {
> + kfree(tdx_package_masters);
> + return -EIO;
> + }

Best,
Wei

2021-10-21 21:46:50

by Sagi Shahar

[permalink] [raw]
Subject: Re: [RFC PATCH v2 66/69] KVM: TDX: Add "basic" support for building and running Trust Domains

On Fri, Jul 2, 2021 at 3:06 PM, Isaku Yamahata
<[email protected]> wrote:
> Subject: [RFC PATCH v2 66/69] KVM: TDX: Add "basic" support for
> building and running Trust Domains
>
>
> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
> +{
> + gpa_t gpa = tdvmcall_p1_read(vcpu);
> + gpa_t size = tdvmcall_p2_read(vcpu);
> +
> + if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
> + (gpa + size) < gpa ||
> + (gpa + size) > vcpu->kvm->arch.gfn_shared_mask << (PAGE_SHIFT + 1))
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> + else
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> +
> + return 1;
> +}

This function looks like a no op in case of success. Is this
intentional? Is this mapping handled somewhere else later on?

2021-10-24 13:00:48

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v2 66/69] KVM: TDX: Add "basic" support for building and running Trust Domains

On 10/22/2021 5:44 AM, Sagi Shahar wrote:
> On Fri, Jul 2, 2021 at 3:06 PM, Isaku Yamahata
> <[email protected]> wrote:
>> Subject: [RFC PATCH v2 66/69] KVM: TDX: Add "basic" support for
>> building and running Trust Domains
>>
>>
>> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>> +{
>> + gpa_t gpa = tdvmcall_p1_read(vcpu);
>> + gpa_t size = tdvmcall_p2_read(vcpu);
>> +
>> + if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
>> + (gpa + size) < gpa ||
>> + (gpa + size) > vcpu->kvm->arch.gfn_shared_mask << (PAGE_SHIFT + 1))
>> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
>> + else
>> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
>> +
>> + return 1;
>> +}
>
> This function looks like a no op in case of success. Is this
> intentional? Is this mapping handled somewhere else later on?
>

Yes, it's intentional.

The mapping will be exactly set up in EPT violation handler when the GPA
is really accessed.

2021-11-09 23:52:40

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/69] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls

On 7/21/2021 6:08 AM, Tom Lendacky wrote:
> On 7/6/21 8:59 AM, Paolo Bonzini wrote:
>> On 03/07/21 00:04, [email protected] wrote:
>>> From: Sean Christopherson <[email protected]>
>>>
>>> Add 'guest_state_protected' to mark a VM's state as being protected by
>>> hardware/firmware, e.g. SEV-ES or TDX-SEAM.  Use the flag to disallow
>>> ioctls() and/or flows that attempt to access protected state.
>>>
>>> Return an error if userspace attempts to get/set register state for a
>>> protected VM, e.g. a non-debug TDX guest.  KVM can't provide sane data,
>>> it's userspace's responsibility to avoid attempting to read guest state
>>> when it's known to be inaccessible.
>>>
>>> Retrieving vCPU events is the one exception, as the userspace VMM is
>>> allowed to inject NMIs.
>>>
>>> Co-developed-by: Xiaoyao Li <[email protected]>
>>> Signed-off-by: Xiaoyao Li <[email protected]>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> Signed-off-by: Isaku Yamahata <[email protected]>
>>> ---
>>>   arch/x86/kvm/x86.c | 104 +++++++++++++++++++++++++++++++++++++--------
>>>   1 file changed, 86 insertions(+), 18 deletions(-)
>>
>> Looks good, but it should be checked whether it breaks QEMU for SEV-ES.
>>  Tom, can you help?
>
> Sorry to take so long to get back to you... been really slammed, let me
> look into this a bit more. But, some quick thoughts...
>
> Offhand, the SMI isn't a problem since SEV-ES doesn't support SMM.
>
> For kvm_vcpu_ioctl_x86_{get,set}_xsave(), can TDX use what was added for
> SEV-ES:
> ed02b213098a ("KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest")
>
> Same for kvm_arch_vcpu_ioctl_{get,set}_fpu().

Tom,

I think what you did in this commit is not so correct. It just silently
ignores the ioctls insteaf of returning an error to userspace to tell
this IOCTL is not invalid to this VM. E.g., for
kvm_arch_vcpu_ioctl_get_fpu(), QEMU just gets it succesful with fpu
being all zeros.

So Paolo, what's your point on this?

> The changes to kvm_arch_vcpu_ioctl_{get,set}_sregs() might cause issues,
> since there are specific things allowed in __{get,set}_sregs. But I'll
> need to dig a bit more on that.
>
> Thanks,
> Tom
>


2021-11-10 00:10:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/69] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls

On 11/9/21 14:37, Xiaoyao Li wrote:
>
> Tom,
>
> I think what you did in this commit is not so correct. It just silently
> ignores the ioctls insteaf of returning an error to userspace to tell
> this IOCTL is not invalid to this VM. E.g., for
> kvm_arch_vcpu_ioctl_get_fpu(), QEMU just gets it succesful with fpu
> being all zeros.

Yes, it's a "cop out" that removes the need for more complex changes in
QEMU.

I think for the get/set registers ioctls
KVM_GET/SET_{REGS,SREGS,FPU,XSAVE,XCRS} we need to consider SEV-ES
backwards compatibility. This means, at least for now, only apply the
restriction to TDX (using a bool-returning function, see the review for
28/69).

For SMM, MCE, vCPU events and for kvm_valid/dirty_regs, it can be done
as in this patch.

Paolo

2021-11-10 01:49:23

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/69] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls

On 11/10/2021 1:15 AM, Paolo Bonzini wrote:
> On 11/9/21 14:37, Xiaoyao Li wrote:
>>
>> Tom,
>>
>> I think what you did in this commit is not so correct. It just
>> silently ignores the ioctls insteaf of returning an error to userspace
>> to tell this IOCTL is not invalid to this VM. E.g., for
>> kvm_arch_vcpu_ioctl_get_fpu(), QEMU just gets it succesful with fpu
>> being all zeros.
>
> Yes, it's a "cop out" that removes the need for more complex changes in
> QEMU.
>
> I think for the get/set registers ioctls
> KVM_GET/SET_{REGS,SREGS,FPU,XSAVE,XCRS} we need to consider SEV-ES
> backwards compatibility.  This means, at least for now, only apply the
> restriction to TDX (using a bool-returning function, see the review for
> 28/69).
>
> For SMM, MCE, vCPU events and for kvm_valid/dirty_regs, it can be done
> as in this patch.
>

thank you Paolo,

I will go with this direction.


2021-11-11 03:28:15

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/69] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs

On 7/14/2021 4:39 AM, Sean Christopherson wrote:
> On Tue, Jul 06, 2021, Paolo Bonzini wrote:
>> On 03/07/21 00:04, [email protected] wrote:
>>> struct kvm_arch {
>>> + unsigned long vm_type;
>>
>> Also why not just int or u8?
>
> Heh, because kvm_dev_ioctl_create_vm() takes an "unsigned long" for the type and
> it felt wrong to store it as something else. Storing it as a smaller field should
> be fine, I highly doubt we'll get to 256 types anytime soon :-)

It's the bit position. We can get only 8 types with u8 actually.

>
> I think kvm_x86_ops.is_vm_type_supported() should take the full size though.
>


2021-11-11 07:28:42

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/69] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs

On 11/11/21 04:28, Xiaoyao Li wrote:
>>
>> Heh, because kvm_dev_ioctl_create_vm() takes an "unsigned long" for
>> the type and
>> it felt wrong to store it as something else.  Storing it as a smaller
>> field should
>> be fine, I highly doubt we'll get to 256 types anytime soon :-)
>
> It's the bit position. We can get only 8 types with u8 actually.

Every architecture defines the meaning, and for x86 we can say it's not
a bit position.

Paolo


2021-11-11 08:29:11

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/69] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs

On 11/11/2021 3:28 PM, Paolo Bonzini wrote:
> On 11/11/21 04:28, Xiaoyao Li wrote:
>>>
>>> Heh, because kvm_dev_ioctl_create_vm() takes an "unsigned long" for
>>> the type and
>>> it felt wrong to store it as something else.  Storing it as a smaller
>>> field should
>>> be fine, I highly doubt we'll get to 256 types anytime soon :-)
>>
>> It's the bit position. We can get only 8 types with u8 actually.
>
> Every architecture defines the meaning, and for x86 we can say it's not
> a bit position.

Sorry, I find I was wrong. The types are not bit position but value.

KVM_CAP_VM_TYPES reports the supported vm types using bitmap that bit n
represents type value n.

> Paolo
>