From: Isaku Yamahata <[email protected]>
Hi. Now TDX host kernel patch series was posted, I've rebased this patch
series to it and make it work.
https://lore.kernel.org/lkml/[email protected]/
Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.
Thanks,
* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.
A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).
In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.
* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/[email protected]/
this patch series is available at
https://github.com/intel/tdx/releases/tag/kvm-upstream
The corresponding patches to qemu are available at
https://github.com/intel/qemu-tdx/commits/tdx-upstream
The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.
The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step. Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.
TDX vcpu
interrupt/exits/hypercall<------------\
^ |
| |
TD finalization |
^ |
| |
TDX EPT violation<------------\ |
^ | |
| | |
TD vcpu enter/exit | |
^ | |
| | |
TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
^ | ^
| | |
TD VM creation/destruction \---------------KVM TDP MMU hooks
^ ^
| |
TDX architectural definitions KVM TDP refactoring for TDX
^ ^
| |
TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
coexistence support
The followings are explanations of each layer. Each layer has a dummy commit
that starts with [MARKER] in subject. It is intended to help to identify where
each layer starts.
TDX host kernel support:
https://lore.kernel.org/lkml/[email protected]/
The guts of system-wide initialization of TDX module. There is an
independent patch series for host x86. TDX KVM patches call functions
this patch series provides to initialize the TDX module.
TDX, VMX coexistence:
Infrastructure to allow TDX to coexist with VMX and trigger the
initialization of the TDX module.
This layer starts with
"KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
Add TDX architectural definitions and helper functions
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
Guest TD creation/destroy allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
guest TD creation/destroy Allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
Create an initial guest memory image with TDX measurement. Handle
secure EPT violations to populate guest pages with TDX SEAMCALLs.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
entering into TD. Restore CPU state after exiting from TD.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
Handle various exits/hypercalls and allow interrupts to be injected so
that TD vcpu can continue running.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
KVM MMU GPA stolen bits:
Introduce framework to handle stolen repurposed bit of GPA TDX
repurposed a bit of GPA to indicate shared or private. If it's shared,
it's the same as the conventional VMX EPT case. VMM can access shared
guest pages. If it's private, it's handled by Secure-EPT and the guest
page is encrypted.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
TDX Secure EPT requires different constants. e.g. initial value EPT
entry value etc. Various refactoring for those differences.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
Introduce framework to TDP MMU to add hooks in addition to direct EPT
access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
use TDX SEAMCALLs to operate on Secure EPT.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
Introduce framework to handle switching guest pages from private/shared
to shared/private. For a given GPA, a guest page can be assigned to a
private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
guest TD converts GPA assignments from private (or shared) to shared (or
private).
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
Guest private memory requires different memory management in KVM. The
patch proposes a way for it. Integration with TDX KVM.
(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.
The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.
TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode. A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key). At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.
Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.
Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM. These control
structures are pointed to by fields in the TD VMCS.
The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others. 3) Redirect operations to . 3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".
*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.
In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.
During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.
* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.
This way, we can effectively walk Secure EPT without using the TDX interface
functions.
* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.
* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.
* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.
[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.
* Restrictions or future work
Some features are not included to reduce patch size. Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more
* Prerequisites
It's required to load the TDX module and initialize it. It's out of the scope
of this patch series. Another independent patch for the common x86 code is
planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.
Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY. Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine
** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it. Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- int seamrr_enabled()
Check if required cpu feature (SEAM mode) is available. This only check CPU
feature availability. At this point, the TDX module may not be ready for KVM
to use.
- int init_tdx(void);
Initialization of TDX module so that the TDX module is ready for KVM to use.
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
Return the system wide information about the TDX module. NULL if the TDX
isn't initialized.
- u32 tdx_get_global_keyid(void);
Return global key id that is used for the TDX module itself.
- int tdx_keyid_alloc(void);
Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
Free HKID for guest TD.
(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).
- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.
- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.
The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
- Get KVM system capability to check if TDX VM type is supported
- VM creation (KVM_CREATE_VM)
- New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
- New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
- VCPU creation (KVM_CREATE_VCPU)
- New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
- New: Initialize guest memory as boot state and extend the measurement with
the memory. KVM_TDX_INIT_MEM_REGION.
- New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
TDX VM contents.
- VCPU RUN (KVM_VCPU_RUN)
- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them. For example, accessing CPU registers, injecting
exceptions, and accessing guest memory. Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.
VM/VCPU state and callbacks for TDX specific operations.
Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
Operations on the CPU state
silently ignore operations on the guest state. For example, the write to
CPU registers is ignored and the read from CPU registers returns 0.
. ignore access to CPU registers except for allowed ones.
. TSC: add a check if tsc is immutable and return an error. Because the KVM
implementation updates the internal tsc state and it's difficult to back
out those changes. Instead, skip the logic.
. dirty logging: add check if dirty logging is supported.
. exceptions/SMI/MCE/SIPI/INIT: silently ignore
Note: virtual external interrupt and NMI can be injected into TDX guests.
- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set). The bits are called stolen bits.
- Stolen bits framework
systematically tracks which guest physical address, shared or private, is
used.
- Shared EPT and secure EPT
There are two EPTs. Shared EPT (the conventional one) and Secure
EPT(the new one). Shared EPT is handled the same for the stolen
bit set. Secure EPT points to private guest pages. To resolve
EPT violation, KVM walks one of two EPTs based on faulted GPA.
Because it's costly to access secure EPT during walking EPTs with
SEAMCALLs for the private guest physical address, another private
EPT is used as a shadow of Secure-EPT with the existing logic at
the cost of extra memory.
The following depicts the relationship.
KVM | TDX module
| | |
-------------+---------- | |
| | | |
V V | |
shared GPA private GPA | |
CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
| | | |
| | | |
V V | V
shared EPT private EPT<-------mirror----->Secure EPT
| | | |
| \--------------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
|
non-encrypted memory | encrypted memory
|
- Operating on Secure EPT
Use the TDX module APIs to operate on Secure EPT. To call the TDX API
during resolving EPT violation, add hooks to additional operation and wiring
it to TDX backend.
* References
[1] TDX specification
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
[3] Intel CPU Architectural Extensions Specification
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 EAS
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
[5] Intel TDX Loader Interface Specification
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
[7] Intel TDX Virtual Firmware Design Guide
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
[8] intel public github
kvm TDX branch: https://github.com/intel/tdx/tree/kvm
TDX guest branch: https://github.com/intel/tdx/tree/guest
qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
https://github.com/tianocore/edk2-staging/tree/TDVF
Chao Gao (1):
KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
wrmsr
Isaku Yamahata (73):
x86/virt/tdx: export platform_has_tdx
KVM: TDX: Detect CPU feature on kernel module initialization
KVM: x86: Refactor KVM VMX module init/exit functions
KVM: TDX: Add placeholders for TDX VM/vcpu structure
x86/virt/tdx: Add a helper function to return system wide info about
TDX module
KVM: TDX: Add a function to initialize TDX module
KVM: TDX: Make TDX VM type supported
[MARKER] The start of TDX KVM patch series: TDX architectural
definitions
KVM: TDX: Define TDX architectural definitions
KVM: TDX: Add a function for KVM to invoke SEAMCALL
KVM: TDX: add a helper function for KVM to issue SEAMCALL
KVM: TDX: Add helper functions to print TDX SEAMCALL error
[MARKER] The start of TDX KVM patch series: TD VM creation/destruction
KVM: TDX: allocate per-package mutex
x86/cpu: Add helper functions to allocate/free MKTME keyid
KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
[MARKER] The start of TDX KVM patch series: TD vcpu
creation/destruction
KVM: TDX: allocate/free TDX vcpu structure
[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits
KVM: x86/mmu: introduce config for PRIVATE KVM MMU
[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
TDX
KVM: x86/mmu: Disallow fast page fault on private GPA
[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
[MARKER] The start of TDX KVM patch series: TDX EPT violation
KVM: TDX: TDP MMU TDX support
[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
KVM: x86/mmu: steal software usable bit for EPT to represent shared
page
KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT
KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
KVM: x86/mmu: Focibly use TDP MMU for TDX
[MARKER] The start of TDX KVM patch series: TD finalization
KVM: TDX: Create initial guest memory
KVM: TDX: Finalize VM initialization
[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
KVM: TDX: Add helper assembly function to TDX vcpu
KVM: TDX: Implement TDX vcpu enter/exit path
KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
KVM: TDX: restore host xsave state when exit from the guest TD
KVM: TDX: restore user ret MSRs
[MARKER] The start of TDX KVM patch series: TD vcpu
exits/interrupts/hypercalls
KVM: TDX: complete interrupts after tdexit
KVM: TDX: restore debug store when TD exit
KVM: TDX: handle vcpu migration over logical processor
KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the
guest TD
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
behavior
KVM: TDX: Implement interrupt injection
KVM: TDX: Implements vcpu request_immediate_exit
KVM: TDX: Implement methods to inject NMI
KVM: TDX: Add a place holder to handle TDX VM exit
KVM: TDX: handle EXIT_REASON_OTHER_SMI
KVM: TDX: handle ept violation/misconfig exit
KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
KVM: TDX: Handle TDX PV CPUID hypercall
KVM: TDX: Handle TDX PV HLT hypercall
KVM: TDX: Handle TDX PV port io hypercall
KVM: TDX: Implement callbacks for MSR operations for TDX
KVM: TDX: Handle TDX PV rdmsr hypercall
KVM: TDX: Handle TDX PV wrmsr hypercall
KVM: TDX: Handle TDX PV report fatal error hypercall
KVM: TDX: Handle TDX PV map_gpa hypercall
KVM: TDX: Silently discard SMI request
KVM: TDX: Silently ignore INIT/SIPI
Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
Kai Huang (1):
KVM: x86: Introduce hooks to free VM callback prezap and vm_free
Rick Edgecombe (1):
KVM: x86: Add infrastructure for stolen GPA bits
Sean Christopherson (26):
KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
KVM: Enable hardware before doing arch VM initialization
KVM: x86: Introduce vm_type to differentiate default VMs from
confidential VMs
KVM: TDX: Add TDX "architectural" error codes
KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
KVM: Add max_vcpus field in common 'struct kvm'
KVM: TDX: create/destroy VM structure
KVM: TDX: Do TDX specific vcpu initialization
KVM: x86/mmu: Disallow dirty logging for x86 TDX
KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
KVM: x86/mmu: Allow non-zero init value for shadow PTE
KVM: x86/mmu: Allow per-VM override of the TDP max page level
KVM: VMX: Split out guts of EPT violation to common/exposed function
KVM: VMX: Move setting of EPT MMU masks to common VT-x code
KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
KVM: TDX: Add load_mmu_pgd method for TDX
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
KVM: x86: Add option to force LAPIC expiration wait
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
argument
KVM: VMX: Move NMI/exception handler to common helper
KVM: x86: Split core of hypercall emulation to helper function
KVM: TDX: Add a placeholder for handler of TDX hypercalls
(TDG.VP.VMCALL)
KVM: TDX: Handle TDX PV MMIO hypercall
KVM: TDX: Add methods to ignore accesses to CPU state
Xiaoyao Li (1):
KVM: TDX: initialize VM with TDX specific parameters
Yuan Yao (1):
KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
Documentation/virt/kvm/api.rst | 24 +-
.../virt/kvm/intel-tdx-layer-status.rst | 33 +
Documentation/virt/kvm/intel-tdx.rst | 360 +++
Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 ++++
arch/arm64/include/asm/kvm_host.h | 3 -
arch/arm64/kvm/arm.c | 6 +-
arch/arm64/kvm/vgic/vgic-init.c | 6 +-
arch/x86/events/intel/ds.c | 1 +
arch/x86/include/asm/kvm-x86-ops.h | 5 +
arch/x86/include/asm/kvm_host.h | 38 +-
arch/x86/include/asm/tdx.h | 61 +
arch/x86/include/asm/vmx.h | 2 +
arch/x86/include/uapi/asm/kvm.h | 59 +
arch/x86/include/uapi/asm/vmx.h | 5 +-
arch/x86/kvm/Kconfig | 4 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/lapic.c | 25 +-
arch/x86/kvm/lapic.h | 2 +-
arch/x86/kvm/mmu.h | 65 +-
arch/x86/kvm/mmu/mmu.c | 232 +-
arch/x86/kvm/mmu/mmu_internal.h | 84 +
arch/x86/kvm/mmu/paging_tmpl.h | 25 +-
arch/x86/kvm/mmu/spte.c | 48 +-
arch/x86/kvm/mmu/spte.h | 40 +-
arch/x86/kvm/mmu/tdp_iter.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 642 ++++-
arch/x86/kvm/mmu/tdp_mmu.h | 16 +-
arch/x86/kvm/svm/svm.c | 10 +-
arch/x86/kvm/vmx/common.h | 155 ++
arch/x86/kvm/vmx/main.c | 1026 ++++++++
arch/x86/kvm/vmx/posted_intr.c | 8 +-
arch/x86/kvm/vmx/seamcall.S | 55 +
arch/x86/kvm/vmx/seamcall.h | 25 +
arch/x86/kvm/vmx/tdx.c | 2337 +++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 253 ++
arch/x86/kvm/vmx/tdx_arch.h | 158 ++
arch/x86/kvm/vmx/tdx_errno.h | 29 +
arch/x86/kvm/vmx/tdx_error.c | 22 +
arch/x86/kvm/vmx/tdx_ops.h | 174 ++
arch/x86/kvm/vmx/vmenter.S | 146 +
arch/x86/kvm/vmx/vmx.c | 619 ++---
arch/x86/kvm/vmx/x86_ops.h | 235 ++
arch/x86/kvm/x86.c | 123 +-
arch/x86/kvm/x86.h | 8 +
arch/x86/virt/tdxcall.S | 8 +-
arch/x86/virt/vmx/tdx.c | 50 +-
arch/x86/virt/vmx/tdx.h | 52 -
include/linux/kvm_host.h | 2 +
include/uapi/linux/kvm.h | 1 +
tools/arch/x86/include/uapi/asm/kvm.h | 59 +
tools/include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 35 +-
52 files changed, 7142 insertions(+), 706 deletions(-)
create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
create mode 100644 Documentation/virt/kvm/intel-tdx.rst
create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
create mode 100644 arch/x86/kvm/vmx/common.h
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/seamcall.S
create mode 100644 arch/x86/kvm/vmx/seamcall.h
create mode 100644 arch/x86/kvm/vmx/tdx.c
create mode 100644 arch/x86/kvm/vmx/tdx.h
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
create mode 100644 arch/x86/kvm/vmx/tdx_error.c
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
create mode 100644 arch/x86/kvm/vmx/x86_ops.h
--
2.25.1
From: Isaku Yamahata <[email protected]>
Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
propagating the change private EPT entry to Secure EPT and freeing Secure
EPT page.
TLB flush handles both shared EPT and private EPT. It flushes shared EPT
same as VMX. It also waits for the TDX TLB shootdown.
For the hook to free Secure EPT page, unlinks the Secure EPT page from the
Secure EPT so that the page can be freed to OS.
Propagating the entry change to Secure EPT. The possible entry changes are
present -> non-present(zapping) and non-present -> present(population). On
population just link the Secure EPT page or the private guest page to the
Secure EPT by TDX SEAMCALL.
Because TDP MMU allows concurrent zapping/population, zapping requires
synchronous TLB shootdown with the frozen EPT entry. It zaps the secure
entry, increments TLB counter, sends IPI to remote vcpus to trigger TLB
flush, and then unlinks the private guest page from the Secure EPT.
For simplicity, batched zapping with exclude lock is handled as concurrent
zapping. Although it's inefficient, it can be optimized in the future.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 40 +++++-
arch/x86/kvm/vmx/tdx.c | 246 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 14 +++
arch/x86/kvm/vmx/tdx_ops.h | 3 +
arch/x86/kvm/vmx/x86_ops.h | 2 +
5 files changed, 301 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6969e3557bd4..f571b07c2aae 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -89,6 +89,38 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
return vmx_vcpu_reset(vcpu, init_event);
}
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_flush_tlb(vcpu);
+
+ vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_flush_tlb(vcpu);
+
+ vmx_flush_tlb_current(vcpu);
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_flush_tlb_guest(vcpu);
+}
+
static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int pgd_level)
{
@@ -162,10 +194,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_rflags = vmx_set_rflags,
.get_if_flag = vmx_get_if_flag,
- .tlb_flush_all = vmx_flush_tlb_all,
- .tlb_flush_current = vmx_flush_tlb_current,
- .tlb_flush_gva = vmx_flush_tlb_gva,
- .tlb_flush_guest = vmx_flush_tlb_guest,
+ .tlb_flush_all = vt_flush_tlb_all,
+ .tlb_flush_current = vt_flush_tlb_current,
+ .tlb_flush_gva = vt_flush_tlb_gva,
+ .tlb_flush_guest = vt_flush_tlb_guest,
.vcpu_pre_run = vmx_vcpu_pre_run,
.run = vmx_vcpu_run,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 51098e10b6a0..5d74ae001e4f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,7 +5,9 @@
#include "capabilities.h"
#include "x86_ops.h"
+#include "mmu.h"
#include "tdx.h"
+#include "vmx.h"
#include "x86.h"
#undef pr_fmt
@@ -272,6 +274,15 @@ int tdx_vm_init(struct kvm *kvm)
int ret, i;
u64 err;
+ /*
+ * To generate EPT violation to inject #VE instead of EPT MISCONFIG,
+ * set RWX=0.
+ */
+ kvm_mmu_set_mmio_spte_mask(kvm, 0, VMX_EPT_RWX_MASK, 0);
+
+ /* TODO: Enable 2mb and 1gb large page support. */
+ kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
+
/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
kvm->max_vcpus = 0;
@@ -331,6 +342,8 @@ int tdx_vm_init(struct kvm *kvm)
tdx_mark_td_page_added(&kvm_tdx->tdcs[i]);
}
+ spin_lock_init(&kvm_tdx->seamcall_lock);
+
/*
* Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
* ioctl() to define the configure CPUID values for the TD.
@@ -501,6 +514,220 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
}
+static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ hpa_t hpa = pfn_to_hpa(pfn);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ struct tdx_module_output out;
+ u64 err;
+
+ if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
+ return;
+
+ /* TODO: handle large pages. */
+ if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+ return;
+
+ /* Pin the page, TDX KVM doesn't yet support page migration. */
+ get_page(pfn_to_page(pfn));
+
+ if (likely(is_td_finalized(kvm_tdx))) {
+ err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
+ if (KVM_BUG_ON(err, kvm))
+ pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
+ return;
+ }
+}
+
+static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+ spin_lock(&kvm_tdx->seamcall_lock);
+ __tdx_sept_set_private_spte(kvm, gfn, level, pfn);
+ spin_unlock(&kvm_tdx->seamcall_lock);
+}
+
+static void tdx_sept_drop_private_spte(
+ struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ hpa_t hpa = pfn_to_hpa(pfn);
+ hpa_t hpa_with_hkid;
+ struct tdx_module_output out;
+ u64 err = 0;
+
+ /* TODO: handle large pages. */
+ if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+ return;
+
+ spin_lock(&kvm_tdx->seamcall_lock);
+ if (is_hkid_assigned(kvm_tdx)) {
+ err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
+ goto unlock;
+ }
+
+ hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+ err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ goto unlock;
+ }
+ } else
+ err = tdx_reclaim_page((unsigned long)__va(hpa), hpa);
+
+unlock:
+ spin_unlock(&kvm_tdx->seamcall_lock);
+
+ if (!err)
+ put_page(pfn_to_page(pfn));
+}
+
+static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *sept_page)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ hpa_t hpa = __pa(sept_page);
+ struct tdx_module_output out;
+ u64 err;
+
+ spin_lock(&kvm_tdx->seamcall_lock);
+ err = tdh_mem_sept_add(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &out);
+ spin_unlock(&kvm_tdx->seamcall_lock);
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ struct tdx_module_output out;
+ u64 err;
+
+ spin_lock(&kvm_tdx->seamcall_lock);
+ err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
+ spin_unlock(&kvm_tdx->seamcall_lock);
+ if (KVM_BUG_ON(err, kvm))
+ pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
+}
+
+static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *sept_page)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ int ret;
+
+ /*
+ * free_private_sp() is (obviously) called when a shadow page is being
+ * zapped. KVM doesn't (yet) zap private SPs while the TD is active.
+ */
+ if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
+ return -EINVAL;
+
+ spin_lock(&kvm_tdx->seamcall_lock);
+ ret = tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page));
+ spin_unlock(&kvm_tdx->seamcall_lock);
+
+ return ret;
+}
+
+static int tdx_sept_tlb_remote_flush(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx;
+ u64 err;
+
+ if (!is_td(kvm))
+ return -EOPNOTSUPP;
+
+ kvm_tdx = to_kvm_tdx(kvm);
+ if (!is_hkid_assigned(kvm_tdx))
+ return 0;
+
+ /* If TD isn't finalized, it's before any vcpu running. */
+ if (unlikely(!is_td_finalized(kvm_tdx)))
+ return 0;
+
+ kvm_tdx->tdh_mem_track = true;
+
+ kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+ err = tdh_mem_track(kvm_tdx->tdr.pa);
+ if (KVM_BUG_ON(err, kvm))
+ pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+
+ WRITE_ONCE(kvm_tdx->tdh_mem_track, false);
+
+ return 0;
+}
+
+static void tdx_handle_changed_private_spte(
+ struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
+ kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page)
+{
+ WARN_ON(!is_td(kvm));
+ lockdep_assert_held(&kvm->mmu_lock);
+
+ if (is_present) {
+ /* TDP MMU doesn't change present -> present */
+ WARN_ON(was_present);
+
+ /*
+ * Use different call to either set up middle level
+ * private page table, or leaf.
+ */
+ if (is_leaf)
+ tdx_sept_set_private_spte(kvm, gfn, level, new_pfn);
+ else {
+ WARN_ON(!sept_page);
+ if (tdx_sept_link_private_sp(kvm, gfn, level, sept_page))
+ /* failed to update Secure-EPT. */
+ WARN_ON(1);
+ }
+ } else if (was_leaf) {
+ /* non-present -> non-present doesn't make sense. */
+ WARN_ON(!was_present);
+
+ /*
+ * Zap private leaf SPTE. Zapping private table is done
+ * below in handle_removed_tdp_mmu_page().
+ */
+ tdx_sept_zap_private_spte(kvm, gfn, level);
+
+ /*
+ * TDX requires TLB tracking before dropping private page. Do
+ * it here, although it is also done later.
+ * If hkid isn't assigned, the guest is destroying and no vcpu
+ * runs further. TLB shootdown isn't needed.
+ *
+ * TODO: implement with_range version for optimization.
+ * kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+ * => tdx_sept_tlb_remote_flush_with_range(kvm, gfn,
+ * KVM_PAGES_PER_HPAGE(level));
+ */
+ if (is_hkid_assigned(to_kvm_tdx(kvm)))
+ kvm_flush_remote_tlbs(kvm);
+
+ tdx_sept_drop_private_spte(kvm, gfn, level, old_pfn);
+ }
+}
+
static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
@@ -736,6 +963,21 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
return ret;
}
+void tdx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ u64 root_hpa = mmu->root_hpa;
+
+ /* Flush the shared EPTP, if it's valid. */
+ if (VALID_PAGE(root_hpa))
+ ept_sync_context(construct_eptp(vcpu, root_hpa,
+ mmu->shadow_root_level));
+
+ while (READ_ONCE(kvm_tdx->tdh_mem_track))
+ cpu_relax();
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -901,6 +1143,10 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
hkid_start_pos = boot_cpu_data.x86_phys_bits;
hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
+ x86_ops->tlb_remote_flush = tdx_sept_tlb_remote_flush;
+ x86_ops->free_private_sp = tdx_sept_free_private_sp;
+ x86_ops->handle_changed_private_spte = tdx_handle_changed_private_spte;
+
return 0;
}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index b32e068c51b4..906666c7c70b 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -29,9 +29,17 @@ struct kvm_tdx {
struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
bool finalized;
+ bool tdh_mem_track;
u64 tsc_offset;
unsigned long tsc_khz;
+
+ /*
+ * Lock to prevent seamcalls from running concurrently
+ * when TDP MMU is enabled, because TDP fault handler
+ * runs concurrently.
+ */
+ spinlock_t seamcall_lock;
};
struct vcpu_tdx {
@@ -166,6 +174,12 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
return out.r8;
}
+static __always_inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+ WARN_ON(level == PG_LEVEL_NONE);
+ return level - 1;
+}
+
#else
#define enable_tdx false
static inline int tdx_module_setup(void) { return -ENODEV; };
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index dc76b3a5cf96..cb40edc8c245 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -30,12 +30,14 @@ static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
struct tdx_module_output *out)
{
+ tdx_clflush_page(hpa);
return kvm_seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, out);
}
static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
struct tdx_module_output *out)
{
+ tdx_clflush_page(page);
return kvm_seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, out);
}
@@ -48,6 +50,7 @@ static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
struct tdx_module_output *out)
{
+ tdx_clflush_page(hpa);
return kvm_seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, out);
}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ad9b1c883761..922a3799336e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -144,6 +144,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+void tdx_flush_tlb(struct kvm_vcpu *vcpu);
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
#else
static inline void tdx_pre_kvm_init(
@@ -163,6 +164,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
+static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
#endif
--
2.25.1
From: Isaku Yamahata <[email protected]>
On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
from VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 28 +++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 39 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 4 ++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
4 files changed, 73 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2e5a7a72d560..f9d43f2de145 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -89,6 +89,30 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
return vmx_vcpu_reset(vcpu, init_event);
}
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ /*
+ * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+ * guest state of a TD is obviously off limits. Deferring MSRs and DRs
+ * is pointless because the TDX module needs to load *something* so as
+ * not to expose guest state.
+ */
+ if (is_td_vcpu(vcpu)) {
+ tdx_prepare_switch_to_guest(vcpu);
+ return;
+ }
+
+ vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_put(vcpu);
+
+ return vmx_vcpu_put(vcpu);
+}
+
static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -174,9 +198,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_free = vt_vcpu_free,
.vcpu_reset = vt_vcpu_reset,
- .prepare_guest_switch = vmx_prepare_switch_to_guest,
+ .prepare_guest_switch = vt_prepare_switch_to_guest,
.vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
+ .vcpu_put = vt_vcpu_put,
.update_exception_bitmap = vmx_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ebe4f9bf19e7..7a288aae03ba 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/cpu.h>
+#include <linux/mmu_context.h>
#include <asm/tdx.h>
@@ -407,6 +408,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
vcpu->arch.guest_state_protected =
!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ tdx->host_state_need_save = true;
+ tdx->host_state_need_restore = false;
+
return 0;
free_tdvpx:
@@ -420,6 +424,39 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return ret;
}
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (!tdx->host_state_need_save)
+ return;
+
+ if (likely(is_64bit_mm(current->mm)))
+ tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+ else
+ tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+ tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ tdx->host_state_need_save = true;
+ if (!tdx->host_state_need_restore)
+ return;
+
+ wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+ tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ vmx_vcpu_pi_put(vcpu);
+ tdx_prepare_switch_to_host(vcpu);
+}
+
void tdx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -535,6 +572,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(vcpu, tdx);
+ tdx->host_state_need_restore = true;
+
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
trace_kvm_exit(vcpu, KVM_ISA_VMX);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e950404ce5de..8b1cf9c158e3 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -84,6 +84,10 @@ struct vcpu_tdx {
union tdx_exit_reason exit_reason;
bool initialized;
+
+ bool host_state_need_save;
+ bool host_state_need_restore;
+ u64 msr_host_kernel_gs_base;
};
static inline bool is_td(struct kvm *kvm)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 44404dd25737..8b871c5f52cf 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -141,6 +141,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -162,6 +164,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1
From: Isaku Yamahata <[email protected]>
Several user ret MSRs are clobbered on TD exit. Restore those values on
TD exit and before returning to ring 3.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 54be5be1a06c..c1366aac7d96 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -550,6 +550,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->kvm->vm_bugged = true;
}
+struct tdx_uret_msr {
+ u32 msr;
+ unsigned int slot;
+ u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+ {.msr = MSR_SYSCALL_MASK,},
+ {.msr = MSR_STAR,},
+ {.msr = MSR_LSTAR,},
+ {.msr = MSR_TSC_AUX,},
+};
+
+static void tdx_user_return_update_cache(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+ kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
+ tdx_uret_msrs[i].defval);
+}
+
static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -589,6 +611,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(vcpu, tdx);
+ tdx_user_return_update_cache();
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;
@@ -1371,6 +1394,16 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
return -EIO;
+ for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+ tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+ if (tdx_uret_msrs[i].slot == -1) {
+ /* If any MSR isn't supported, it is a KVM bug */
+ pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+ tdx_uret_msrs[i].msr);
+ return -EIO;
+ }
+ }
+
max_pkgs = topology_max_packages();
tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
GFP_KERNEL);
--
2.25.1
From: Chao Gao <[email protected]>
Several MSRs are constant and only used in userspace(ring 3). But VMs may
have different values. KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.
TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit. It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware. This inconsistency needs to be
fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values. kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr. So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.
Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 25 ++++++++++++++++++++-----
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8406f8b5ab74..b6396d11139e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1894,6 +1894,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
int kvm_add_user_return_msr(u32 msr);
int kvm_find_user_return_msr(u32 msr);
int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_update_cache(unsigned int index, u64 val);
static inline bool kvm_is_supported_user_return_msr(u32 msr)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 66400810d54f..45e8a02e99bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -427,6 +427,15 @@ static void kvm_user_return_msr_cpu_online(void)
}
}
+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+ if (!msrs->registered) {
+ msrs->urn.on_user_return = kvm_on_user_return;
+ user_return_notifier_register(&msrs->urn);
+ msrs->registered = true;
+ }
+}
+
int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
{
unsigned int cpu = smp_processor_id();
@@ -441,15 +450,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
return 1;
msrs->values[slot].curr = value;
- if (!msrs->registered) {
- msrs->urn.on_user_return = kvm_on_user_return;
- user_return_notifier_register(&msrs->urn);
- msrs->registered = true;
- }
+ kvm_user_return_register_notifier(msrs);
return 0;
}
EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
+/* Update the cache, "curr", and register the notifier */
+void kvm_user_return_update_cache(unsigned int slot, u64 value)
+{
+ struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+ msrs->values[slot].curr = value;
+ kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
+
static void drop_user_return_notifiers(void)
{
unsigned int cpu = smp_processor_id();
--
2.25.1
From: Isaku Yamahata <[email protected]>
This patch implements running TDX vcpu. Once vcpu runs on the logical
processor (LP), the TDX vcpu is associated with it. When the TDX vcpu
moves to another LP, the TDX vcpu needs to flush its status on the LP.
When destroying TDX vcpu, it needs to complete flush and flush cpu memory
cache. Track which LP the TDX vcpu run and flush it as necessary.
Do nothing on sched_in event as TDX doesn't support pause loop.
TDX vcpu execution requires restoring PMU debug store after returning back
to KVM because the TDX module unconditionally resets the value. To reuse
the existing code, export perf_restore_debug_store.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 +++++++++-
arch/x86/kvm/vmx/tdx.c | 34 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
arch/x86/kvm/x86.c | 1 +
5 files changed, 79 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f571b07c2aae..2e5a7a72d560 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -89,6 +89,14 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
return vmx_vcpu_reset(vcpu, init_event);
}
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_run(vcpu);
+
+ return vmx_vcpu_run(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -200,7 +208,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.tlb_flush_guest = vt_flush_tlb_guest,
.vcpu_pre_run = vmx_vcpu_pre_run,
- .run = vmx_vcpu_run,
+ .run = vt_vcpu_run,
.handle_exit = vmx_handle_exit,
.skip_emulated_instruction = vmx_skip_emulated_instruction,
.update_emulated_instruction = vmx_update_emulated_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 85d5f961d97e..ebe4f9bf19e7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -10,6 +10,9 @@
#include "vmx.h"
#include "x86.h"
+#include <trace/events/kvm.h>
+#include "trace.h"
+
#undef pr_fmt
#define pr_fmt(fmt) "tdx: " fmt
@@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->kvm->vm_bugged = true;
}
+u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
+
+static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+ struct vcpu_tdx *tdx)
+{
+ guest_enter_irqoff();
+ tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
+ guest_exit_irqoff();
+}
+
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (unlikely(vcpu->kvm->vm_bugged)) {
+ tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
+ return EXIT_FASTPATH_NONE;
+ }
+
+ trace_kvm_entry(vcpu);
+
+ tdx_vcpu_enter_exit(vcpu, tdx);
+
+ vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
+ trace_kvm_exit(vcpu, KVM_ISA_VMX);
+
+ if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
+ return EXIT_FASTPATH_NONE;
+ return EXIT_FASTPATH_NONE;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index bf9865a88991..e950404ce5de 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -44,12 +44,45 @@ struct kvm_tdx {
spinlock_t seamcall_lock;
};
+union tdx_exit_reason {
+ struct {
+ /* 31:0 mirror the VMX Exit Reason format */
+ u64 basic : 16;
+ u64 reserved16 : 1;
+ u64 reserved17 : 1;
+ u64 reserved18 : 1;
+ u64 reserved19 : 1;
+ u64 reserved20 : 1;
+ u64 reserved21 : 1;
+ u64 reserved22 : 1;
+ u64 reserved23 : 1;
+ u64 reserved24 : 1;
+ u64 reserved25 : 1;
+ u64 bus_lock_detected : 1;
+ u64 enclave_mode : 1;
+ u64 smi_pending_mtf : 1;
+ u64 smi_from_vmx_root : 1;
+ u64 reserved30 : 1;
+ u64 failed_vmentry : 1;
+
+ /* 63:32 are TDX specific */
+ u64 details_l1 : 8;
+ u64 class : 8;
+ u64 reserved61_48 : 14;
+ u64 non_recoverable : 1;
+ u64 error : 1;
+ };
+ u64 full;
+};
+
struct vcpu_tdx {
struct kvm_vcpu vcpu;
struct tdx_td_page tdvpr;
struct tdx_td_page *tdvpx;
+ union tdx_exit_reason exit_reason;
+
bool initialized;
};
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 922a3799336e..44404dd25737 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -140,6 +140,7 @@ void tdx_vm_free(struct kvm *kvm);
int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -160,6 +161,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da411bcd8cbc..66400810d54f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -300,6 +300,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
};
u64 __read_mostly host_xcr0;
+EXPORT_SYMBOL_GPL(host_xcr0);
u64 __read_mostly supported_xcr0;
EXPORT_SYMBOL_GPL(supported_xcr0);
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of TD vcpu
exits, interrupts, and hypercalls.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index e6af9ad4e23f..1aad7ceb0573 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -13,6 +13,7 @@ What qemu can do
- Qemu can create/destroy vcpu of TDX vm type.
- Qemu can populate initial guest memory image.
- Qemu can finalize guest TD.
+- Qemu can start to run vcpu. But vcpu can not make progress yet.
Patch Layer status
------------------
@@ -23,7 +24,7 @@ Patch Layer status
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
* TD finalization: Applied
-* TD vcpu enter/exit: Applying
+* TD vcpu enter/exit: Applied
* TD vcpu interrupts/exit/hypercall: Not yet
* KVM MMU GPA stolen bits: Applied
--
2.25.1
From: Isaku Yamahata <[email protected]>
This corresponds to VMX __vmx_complete_interrupts(). Because TDX
virtualize vAPIC, KVM only needs to care NMI injection.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c1366aac7d96..3cb2fbd1c12c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -550,6 +550,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->kvm->vm_bugged = true;
}
+static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
+{
+ /* Avoid costly SEAMCALL if no nmi was injected */
+ if (vcpu->arch.nmi_injected)
+ vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
+ TD_VCPU_PEND_NMI);
+}
+
struct tdx_uret_msr {
u32 msr;
unsigned int slot;
@@ -618,6 +626,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
trace_kvm_exit(vcpu, KVM_ISA_VMX);
+ tdx_complete_interrupts(vcpu);
+
if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
return EXIT_FASTPATH_NONE;
return EXIT_FASTPATH_NONE;
--
2.25.1
From: Isaku Yamahata <[email protected]>
Because debug store is clobbered, restore it on TD exit.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/events/intel/ds.c | 1 +
arch/x86/kvm/vmx/tdx.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 376cc3d66094..cdba4227ad3b 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2256,3 +2256,4 @@ void perf_restore_debug_store(void)
wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
}
+EXPORT_SYMBOL_GPL(perf_restore_debug_store);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3cb2fbd1c12c..37cf7d43435d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -620,6 +620,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(vcpu, tdx);
tdx_user_return_update_cache();
+ perf_restore_debug_store();
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;
--
2.25.1
From: Sean Christopherson <[email protected]>
Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path. This allows
TDX to get the resulting pfn and also allows the RET_PF_* enums to
stay in mmu.c where they belong.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ac4540aa694d..bd93e7235812 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -205,6 +205,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
return vcpu->arch.mmu->page_fault(vcpu, &fault);
}
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 error_code, int max_level);
+
/*
* Currently, we have two sorts of write-protection, a) the first one
* write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2d4a7d546e1..72d8f200c819 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4192,6 +4192,44 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 error_code, int max_level)
+{
+ int r;
+ struct kvm_page_fault fault = (struct kvm_page_fault) {
+ .addr = gpa,
+ .error_code = error_code,
+ .exec = error_code & PFERR_FETCH_MASK,
+ .write = error_code & PFERR_WRITE_MASK,
+ .present = error_code & PFERR_PRESENT_MASK,
+ .rsvd = error_code & PFERR_RSVD_MASK,
+ .user = error_code & PFERR_USER_MASK,
+ .prefetch = false,
+ .is_tdp = true,
+ .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+ };
+
+ if (mmu_topup_memory_caches(vcpu, false))
+ return KVM_PFN_ERR_FAULT;
+
+ /*
+ * Loop on the page fault path to handle the case where an mmu_notifier
+ * invalidation triggers RET_PF_RETRY. In the normal page fault path,
+ * KVM needs to resume the guest in case the invalidation changed any
+ * of the page fault properties, i.e. the gpa or error code. For this
+ * path, the gpa and error code are fixed by the caller, and the caller
+ * expects failure if and only if the page fault can't be fixed.
+ */
+ do {
+ fault.max_level = max_level;
+ fault.req_level = PG_LEVEL_4K;
+ fault.goal_level = PG_LEVEL_4K;
+ r = direct_page_fault(vcpu, &fault);
+ } while (r == RET_PF_RETRY && !is_error_noslot_pfn(fault.pfn));
+ return fault.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.25.1
From: Isaku Yamahata <[email protected]>
For vcpu migration, in the case of VMX, VCMS is flushed on the source pcpu,
and load it on the target pcpu. There are corresponding TDX SEAMCALL APIs,
call them on vcpu migration. The logic is mostly same as VMX except the
TDX SEAMCALLs are used.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 20 +++++++++++++--
arch/x86/kvm/vmx/tdx.c | 51 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
3 files changed, 71 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f9d43f2de145..2cd5ba0e8788 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -121,6 +121,14 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_vcpu_run(vcpu);
}
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_load(vcpu, cpu);
+
+ return vmx_vcpu_load(vcpu, cpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -162,6 +170,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}
+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_sched_in(vcpu, cpu);
+}
+
static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -199,7 +215,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_reset = vt_vcpu_reset,
.prepare_guest_switch = vt_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
+ .vcpu_load = vt_vcpu_load,
.vcpu_put = vt_vcpu_put,
.update_exception_bitmap = vmx_update_exception_bitmap,
@@ -285,7 +301,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.request_immediate_exit = vmx_request_immediate_exit,
- .sched_in = vmx_sched_in,
+ .sched_in = vt_sched_in,
.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 37cf7d43435d..a6b1a8ce888d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -85,6 +85,18 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
return kvm_tdx->finalized;
}
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+ * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+ * to its list before its deleted from this CPUs list.
+ */
+ smp_wmb();
+
+ vcpu->cpu = -1;
+}
+
static void tdx_clear_page(unsigned long page)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -155,6 +167,39 @@ static void tdx_reclaim_td_page(struct tdx_td_page *page)
free_page(page->va);
}
+static void tdx_flush_vp(void *arg)
+{
+ struct kvm_vcpu *vcpu = arg;
+ u64 err;
+
+ /* Task migration can race with CPU offlining. */
+ if (vcpu->cpu != raw_smp_processor_id())
+ return;
+
+ /*
+ * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
+ * list tracking still needs to be updated so that it's correct if/when
+ * the vCPU does get initialized.
+ */
+ if (is_td_vcpu_created(to_tdx(vcpu))) {
+ err = tdh_vp_flush(to_tdx(vcpu)->tdvpr.pa);
+ if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+ if (WARN_ON_ONCE(err))
+ pr_tdx_error(TDH_VP_FLUSH, err, NULL);
+ }
+ }
+
+ tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+ if (unlikely(vcpu->cpu == -1))
+ return;
+
+ smp_call_function_single(vcpu->cpu, tdx_flush_vp, vcpu, 1);
+}
+
static int tdx_do_tdh_phymem_cache_wb(void *param)
{
u64 err = 0;
@@ -425,6 +470,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return ret;
}
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (vcpu->cpu != cpu)
+ tdx_flush_vp_on_cpu(vcpu);
+}
+
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 8b871c5f52cf..ceafd6e18f4e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -143,6 +143,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -166,6 +167,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1
From: Isaku Yamahata <[email protected]>
When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
each pcpu. Do the similar for TDX with TDX SEAMCALL APIs.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 23 +++++++++++--
arch/x86/kvm/vmx/tdx.c | 70 ++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.h | 2 ++
arch/x86/kvm/vmx/x86_ops.h | 4 +++
4 files changed, 95 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2cd5ba0e8788..882358ac270b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -13,6 +13,25 @@ static bool vt_is_vm_type_supported(unsigned long type)
return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
}
+static int vt_hardware_enable(void)
+{
+ int ret;
+
+ ret = vmx_hardware_enable();
+ if (ret)
+ return ret;
+
+ tdx_hardware_enable();
+ return 0;
+}
+
+static void vt_hardware_disable(void)
+{
+ /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+ tdx_hardware_disable();
+ vmx_hardware_disable();
+}
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -199,8 +218,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_unsetup = vt_hardware_unsetup,
- .hardware_enable = vmx_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
+ .hardware_enable = vt_hardware_enable,
+ .hardware_disable = vt_hardware_disable,
.cpu_has_accelerated_tpr = report_flexpriority,
.has_emulated_msr = vmx_has_emulated_msr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a6b1a8ce888d..690298fb99c7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
static DEFINE_MUTEX(tdx_lock);
static struct mutex *tdx_mng_key_config_lock;
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ * Protected by interrupt mask. This list is manipulated in process context
+ * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
static u64 hkid_mask __ro_after_init;
static u8 hkid_start_pos __ro_after_init;
@@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
{
+ list_del(&to_tdx(vcpu)->cpu_list);
+
/*
* Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
* otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
@@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
vcpu->cpu = -1;
}
+void tdx_hardware_enable(void)
+{
+ INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
+}
+
+void tdx_hardware_disable(void)
+{
+ int cpu = raw_smp_processor_id();
+ struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+ struct vcpu_tdx *tdx, *tmp;
+
+ /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+ list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
+ tdx_disassociate_vp(&tdx->vcpu);
+}
+
static void tdx_clear_page(unsigned long page)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
cpumask_var_t packages;
bool cpumask_allocated;
+ struct kvm_vcpu *vcpu;
u64 err;
int ret;
int i;
+ unsigned long j;
if (!is_hkid_assigned(kvm_tdx))
return;
@@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
return;
}
+ kvm_for_each_vcpu(j, vcpu, kvm)
+ tdx_flush_vp_on_cpu(vcpu);
+
+ mutex_lock(&tdx_lock);
+ err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
+ mutex_unlock(&tdx_lock);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
+ return;
+ }
+
cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
for_each_online_cpu(i) {
if (cpumask_allocated &&
@@ -472,8 +511,22 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
- if (vcpu->cpu != cpu)
- tdx_flush_vp_on_cpu(vcpu);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (vcpu->cpu == cpu)
+ return;
+
+ tdx_flush_vp_on_cpu(vcpu);
+
+ local_irq_disable();
+ /*
+ * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+ * vcpu->cpu is read before tdx->cpu_list.
+ */
+ smp_rmb();
+
+ list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+ local_irq_enable();
}
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -522,6 +575,19 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
tdx_reclaim_td_page(&tdx->tdvpx[i]);
kfree(tdx->tdvpx);
tdx_reclaim_td_page(&tdx->tdvpr);
+
+ /*
+ * kvm_free_vcpus()
+ * -> kvm_unload_vcpu_mmu()
+ *
+ * does vcpu_load() for every vcpu after they already disassociated
+ * from the per cpu list when tdx_vm_teardown(). So we need to
+ * disassociate them again, otherwise the freed vcpu data will be
+ * accessed when do list_{del,add}() on associated_tdvcpus list
+ * later.
+ */
+ tdx_flush_vp_on_cpu(vcpu);
+ WARN_ON(vcpu->cpu != -1);
}
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 8b1cf9c158e3..180360a65545 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -81,6 +81,8 @@ struct vcpu_tdx {
struct tdx_td_page tdvpr;
struct tdx_td_page *tdvpx;
+ struct list_head cpu_list;
+
union tdx_exit_reason exit_reason;
bool initialized;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ceafd6e18f4e..aae0f4449ec5 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -132,6 +132,8 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
bool tdx_is_vm_type_supported(unsigned long type);
void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
void tdx_hardware_unsetup(void);
+void tdx_hardware_enable(void);
+void tdx_hardware_disable(void);
int tdx_vm_init(struct kvm *kvm);
void tdx_mmu_prezap(struct kvm *kvm);
@@ -156,6 +158,8 @@ static inline void tdx_pre_kvm_init(
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
static inline void tdx_hardware_unsetup(void) {}
+static inline void tdx_hardware_enable(void) {}
+static inline void tdx_hardware_disable(void) {}
static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
static inline void tdx_mmu_prezap(struct kvm *kvm) {}
--
2.25.1
From: Isaku Yamahata <[email protected]>
To protect the initial contents of the guest TD, the TDX module measures
the guest TD during the build process as SHA-384 measurement. The
measurement of the guest TD contents needs to be completed to make the
guest TD ready to run.
Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
to run.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
tools/arch/x86/include/uapi/asm/kvm.h | 1 +
3 files changed, 23 insertions(+)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 77f46260d868..943219a08fcd 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
KVM_TDX_CMD_NR_MAX,
};
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cd726c41d362..85d5f961d97e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1103,6 +1103,24 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
return ret;
}
+static int tdx_td_finalizemr(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+
+ if (!is_td_initialized(kvm) || is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ err = tdh_mr_finalize(kvm_tdx->tdr.pa);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
+ return -EIO;
+ }
+
+ kvm_tdx->finalized = true;
+ return 0;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -1123,6 +1141,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_INIT_MEM_REGION:
r = tdx_init_mem_region(kvm, &tdx_cmd);
break;
+ case KVM_TDX_FINALIZE_VM:
+ r = tdx_td_finalizemr(kvm);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 77f46260d868..943219a08fcd 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
KVM_TDX_CMD_NR_MAX,
};
--
2.25.1
From: Isaku Yamahata <[email protected]>
On exiting from the guest TD, xsave state is clobbered. Restore xsave
state on TD exit.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7a288aae03ba..54be5be1a06c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,7 @@
#include <linux/cpu.h>
#include <linux/mmu_context.h>
+#include <asm/fpu/xcr.h>
#include <asm/tdx.h>
#include "capabilities.h"
@@ -549,6 +550,22 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->kvm->vm_bugged = true;
}
+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+ if (static_cpu_has(X86_FEATURE_XSAVE) &&
+ host_xcr0 != (kvm_tdx->xfam & supported_xcr0))
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
+ if (static_cpu_has(X86_FEATURE_XSAVES) &&
+ /* PT can be exposed to TD guest regardless of KVM's XSS support */
+ host_xss != (kvm_tdx->xfam & (supported_xss | XFEATURE_MASK_PT)))
+ wrmsrl(MSR_IA32_XSS, host_xss);
+ if (static_cpu_has(X86_FEATURE_PKU) &&
+ (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+ write_pkru(vcpu->arch.host_pkru);
+}
+
u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
@@ -572,6 +589,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(vcpu, tdx);
+ tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
--
2.25.1
From: Isaku Yamahata <[email protected]>
The TDX Guest-Hypervisor communication interface(GHCI) specification
defines MapGPA hypercall for guest TD to request the host VMM to map given
GPA range as private or shared.
It means the guest TD uses the GPA as shared (or private). The GPA
won't be used as private (or shared). VMM should enforce GPA usage. VMM
doesn't have to map the GPA on the hypercall request.
- Zap the aliased region.
If shared (or private) GPA is requested, zap private (or shared) GPA
(modulo shared bit).
- Record the request GPA is shared (or private) by SPTE_PRIVATE_PROHIBIT
in SPTE in both shared and private EPT tables.
- With SPTE_PRIVATE_PROHIBIT set, a shared GPA is allowed.
- With SPTE_PRIVATE_PROHIBIT cleared, a private GPA is allowed.
The reason to record SPTE_PRIVATE_PROHIBIT in both shared and private EPT
is to optimize EPT violation path for normal guest TD execution path and
penalize map_gpa hypercall.
If the guest TD faults on not-allowed GPA (modulo shared bit), the KVM
doesn't resolve EPT violation and let vcpu retry. vcpu will keep
faulting until other vcpu maps the region with MapGPA hypercall. With
the initial value of spte(initial shadow_init_value),
SPTE_PRIVATE_PROHIBIT is cleared. So the default behavior doesn't
change.
- don't map GPA.
The GPA is mapped on the next EPT violation.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 2 +
arch/x86/kvm/mmu/mmu.c | 56 ++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 208 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 3 +
4 files changed, 269 insertions(+)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b49841e4faaa..ac4540aa694d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -305,6 +305,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
+int kvm_mmu_map_gpa(struct kvm_vcpu *vcpu, gfn_t *startp, gfn_t end);
+
int kvm_mmu_post_init_vm(struct kvm *kvm);
void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0ec9548ff4dd..e2d4a7d546e1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6119,6 +6119,62 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
}
}
+int kvm_mmu_map_gpa(struct kvm_vcpu *vcpu, gfn_t *startp, gfn_t end)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ gfn_t start = *startp;
+ bool allow_private;
+ int ret;
+
+ if (!kvm_gfn_stolen_mask(kvm))
+ return -EOPNOTSUPP;
+
+ ret = mmu_topup_memory_caches(vcpu, false);
+ if (ret)
+ return ret;
+
+ allow_private = kvm_is_private_gfn(kvm, start);
+ start = kvm_gfn_unalias(kvm, start);
+ end = kvm_gfn_unalias(kvm, end);
+
+ mutex_lock(&kvm->slots_lock);
+ write_lock(&kvm->mmu_lock);
+
+ slots = __kvm_memslots(kvm, 0 /* only normal ram. not SMM. */);
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+ struct kvm_memory_slot *memslot = iter.slot;
+ gfn_t s = max(start, memslot->base_gfn);
+ gfn_t e = min(end, memslot->base_gfn + memslot->npages);
+
+ if (WARN_ON_ONCE(s >= e))
+ continue;
+ if (is_tdp_mmu_enabled(kvm)) {
+ ret = kvm_tdp_mmu_map_gpa(vcpu, &s, e, allow_private);
+ if (ret) {
+ start = s;
+ break;
+ }
+ } else {
+ ret = -EOPNOTSUPP;
+ break;
+ }
+ }
+
+ write_unlock(&kvm->mmu_lock);
+ mutex_unlock(&kvm->slots_lock);
+
+ if (ret == -EAGAIN) {
+ if (allow_private)
+ *startp = kvm_gfn_private(kvm, start);
+ else
+ *startp = kvm_gfn_shared(kvm, start);
+ }
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_gpa);
+
static unsigned long
mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f6bd35831e32..b33ace3d4456 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -533,6 +533,13 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
WARN_ON(sp->gfn != gfn);
}
+ /*
+ * SPTE_PRIVATE_PROHIBIT is only changed by map_gpa that obtains
+ * write lock of mmu_lock.
+ */
+ WARN_ON(shared &&
+ (is_private_prohibit_spte(old_spte) !=
+ is_private_prohibit_spte(new_spte)));
static_call(kvm_x86_handle_changed_private_spte)(
kvm, gfn, level,
old_pfn, was_present, was_leaf,
@@ -1751,6 +1758,207 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
return spte_set;
}
+typedef void (*update_spte_t)(
+ struct kvm *kvm, struct tdp_iter *iter, bool allow_private);
+
+static int kvm_tdp_mmu_update_range(struct kvm_vcpu *vcpu, bool is_private,
+ gfn_t start, gfn_t end, gfn_t *nextp,
+ update_spte_t fn, bool allow_private)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct tdp_iter iter;
+ int ret = 0;
+
+ rcu_read_lock();
+ tdp_mmu_for_each_pte(iter, vcpu->arch.mmu, is_private, start, end) {
+ if (iter.level == PG_LEVEL_4K) {
+ fn(kvm, &iter, allow_private);
+ continue;
+ }
+
+ /*
+ * Which GPA is allowed, private or shared, is recorded in the
+ * granular of 4K in private leaf spte as SPTE_PRIVATE_PROHIBIT.
+ * Break large page into 4K.
+ */
+ if (is_shadow_present_pte(iter.old_spte) &&
+ is_large_pte(iter.old_spte)) {
+ /*
+ * TODO: large page support.
+ * Doesn't support large page for TDX now
+ */
+ WARN_ON_ONCE(true);
+ tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
+ iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ }
+
+ if (!is_shadow_present_pte(iter.old_spte)) {
+ /*
+ * Guarantee that alloc_tdp_mmu_page() succees which
+ * assumes page allocation from cache always successes.
+ */
+ if (vcpu->arch.mmu_page_header_cache.nobjs == 0 ||
+ vcpu->arch.mmu_shadow_page_cache.nobjs == 0 ||
+ vcpu->arch.mmu_private_sp_cache.nobjs == 0) {
+ ret = -EAGAIN;
+ break;
+ }
+ /*
+ * write lock of mmu_lock is held. No other thread
+ * freezes SPTE.
+ */
+ if (!tdp_mmu_populate_nonleaf(
+ vcpu, &iter, is_private, false)) {
+ /* As write lock is held, this case sholdn't happen. */
+ WARN_ON_ONCE(true);
+ ret = -EAGAIN;
+ break;
+ }
+ }
+ }
+ rcu_read_unlock();
+
+ if (ret == -EAGAIN)
+ *nextp = iter.next_last_level_gfn;
+
+ return ret;
+}
+
+static void kvm_tdp_mmu_update_shared_spte(
+ struct kvm *kvm, struct tdp_iter *iter, bool allow_private)
+{
+ u64 new_spte;
+
+ WARN_ON(kvm_is_private_gfn(kvm, iter->gfn));
+ if (allow_private) {
+ /* Zap SPTE and clear PRIVATE_PROHIBIT */
+ new_spte = shadow_init_value;
+ if (new_spte != iter->old_spte)
+ tdp_mmu_set_spte(kvm, iter, new_spte);
+ } else {
+ new_spte = iter->old_spte | SPTE_PRIVATE_PROHIBIT;
+ /* No side effect is needed */
+ if (new_spte != iter->old_spte)
+ WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+ }
+}
+
+static void kvm_tdp_mmu_update_private_spte(
+ struct kvm *kvm, struct tdp_iter *iter, bool allow_private)
+{
+ u64 new_spte;
+
+ WARN_ON(!kvm_is_private_gfn(kvm, iter->gfn));
+ if (allow_private) {
+ new_spte = iter->old_spte & ~SPTE_PRIVATE_PROHIBIT;
+ /* No side effect is needed */
+ if (new_spte != iter->old_spte)
+ WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+ } else {
+ if (is_shadow_present_pte(iter->old_spte)) {
+ /* Zap SPTE */
+ new_spte = shadow_init_value | SPTE_PRIVATE_PROHIBIT;
+ tdp_mmu_set_spte(kvm, iter, new_spte);
+ } else {
+ new_spte = iter->old_spte | SPTE_PRIVATE_PROHIBIT;
+ /* No side effect is needed */
+ WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+ }
+ }
+}
+
+/*
+ * Whether GPA is allowed to map private or shared is recorded in both private
+ * and shared leaf spte entry as SPTE_PRIVATE_PROHIBIT bit. They must match.
+ * private leaf spte entry
+ * - present: private mapping is allowed. (already mapped)
+ * - non-present: private mapping is allowed.
+ * - present | PRIVATE_PROHIBIT: invalid state.
+ * - non-present | SPTE_PRIVATE_PROHIBIT: shared mapping is allowed.
+ * may or may not be mapped as shared.
+ * shared leaf spte entry
+ * - present: invalid state
+ * - non-present: private mapping is allowed.
+ * - present | PRIVATE_PROHIBIT: shared mapping is allowed (already mapped)
+ * - non-present | PRIVATE_PROHIBIT: shared mapping is allowed.
+ *
+ * state change of private spte:
+ * map_gpa(private):
+ * private EPT entry: clear PRIVATE_PROHIBIT
+ * present: nop
+ * non-present: nop
+ * non-present | PRIVATE_PROHIBIT -> non-present
+ * share EPT entry: zap and clear PRIVATE_PROHIBIT
+ * any -> non-present
+ * map_gpa(shared):
+ * private EPT entry: zap and set PRIVATE_PROHIBIT
+ * present -> non-present | PRIVATE_PROHIBIT
+ * non-present -> non-present | PRIVATE_PROHIBIT
+ * non-present | PRIVATE_PROHIBIT: nop
+ * shared EPT entry: set PRIVATE_PROHIBIT
+ * present | PRIVATE_PROHIBIT: nop
+ * non-present -> non-present | PRIVATE_PROHIBIT
+ * non-present | PRIVATE_PROHIBIT: nop
+ * map(private GPA):
+ * private EPT entry: try to populate
+ * present: nop
+ * non-present -> present
+ * non-present | PRIVATE_PROHIBIT: nop. looping on EPT violation
+ * shared EPT entry: nop
+ * map(shared GPA):
+ * private EPT entry: nop
+ * shared EPT entry: populate
+ * present | PRIVATE_PROHIBIT: nop
+ * non-present | PRIVATE_PROHIBIT -> present | PRIVATE_PROHIBIT
+ * non-present: nop. looping on EPT violation
+ * zap(private GPA):
+ * private EPT entry: zap and keep PRIVATE_PROHIBIT
+ * present | PRIVATE_PROHIBIT -> non-present | PRIVATE_PROHIBIT
+ * non-present: nop as is_shadow_prsent_pte() is checked
+ * non-present | PRIVATE_PROHIBIT: nop by is_shadow_present_pte()
+ * shared EPT entry: nop
+ * zap(shared GPA):
+ * private EPT entry: nop
+ * shared EPT entry: zap and keep PRIVATE_PROHIBIT
+ * present | PRIVATE_PROHIBIT -> non-present | PRIVATE_PROHIBIT
+ * non-present | PRIVATE_PROHIBIT: nop
+ * non-present: nop.
+ */
+int kvm_tdp_mmu_map_gpa(struct kvm_vcpu *vcpu,
+ gfn_t *startp, gfn_t end, bool allow_private)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ gfn_t start = *startp;
+ gfn_t next;
+ int ret = 0;
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ WARN_ON(start & kvm_gfn_stolen_mask(kvm));
+ WARN_ON(end & kvm_gfn_stolen_mask(kvm));
+
+ if (!VALID_PAGE(mmu->root_hpa) || !VALID_PAGE(mmu->private_root_hpa))
+ return -EINVAL;
+
+ next = end;
+ ret = kvm_tdp_mmu_update_range(
+ vcpu, false, kvm_gfn_shared(kvm, start), kvm_gfn_shared(kvm, end),
+ &next, kvm_tdp_mmu_update_shared_spte, allow_private);
+ if (ret) {
+ kvm_flush_remote_tlbs_with_address(kvm, start, next - start);
+ return ret;
+ }
+
+ ret = kvm_tdp_mmu_update_range(vcpu, true, start, end, &next,
+ kvm_tdp_mmu_update_private_spte, allow_private);
+ if (ret == -EAGAIN) {
+ *startp = next;
+ end = *startp;
+ }
+ kvm_flush_remote_tlbs_with_address(kvm, start, end - start);
+ return ret;
+}
+
/*
* Return the level of the lowest level SPTE added to sptes.
* That SPTE may be non-present.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 7c62f694a465..0f83960d92aa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -74,6 +74,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
int min_level);
+int kvm_tdp_mmu_map_gpa(struct kvm_vcpu *vcpu,
+ gfn_t *startp, gfn_t end, bool is_private);
+
static inline void kvm_tdp_mmu_walk_lockless_begin(void)
{
rcu_read_lock();
--
2.25.1
From: Isaku Yamahata <[email protected]>
SPTE_PRIVATE_PROHIBIT specifies the share or private GPA is allowed or not.
It needs to be kept over zapping the EPT entry. Currently the EPT entry is
initialized shadow_init_value unconditionally to clear
SPTE_PRIVATE_PROHIBIT bit. To carry SPTE_PRIVATE_PROHIBIT bit, introduce a
helper function to get initial value for zapped entry with
SPTE_PRIVATE_PROHIBIT bit. Replace shadow_init_value with it.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1949f81027a0..6d750563824d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -610,6 +610,12 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
return true;
}
+static u64 shadow_init_spte(u64 old_spte)
+{
+ return shadow_init_value |
+ (is_private_prohibit_spte(old_spte) ? SPTE_PRIVATE_PROHIBIT : 0);
+}
+
static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter)
{
@@ -641,7 +647,8 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* shadow_init_value (which sets "suppress #VE" bit) so it
* can be set when EPT table entries are zapped.
*/
- WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
+ WRITE_ONCE(*rcu_dereference(iter->sptep),
+ shadow_init_spte(iter->old_spte));
return true;
}
@@ -853,7 +860,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
if (!shared) {
/* see comments in tdp_mmu_zap_spte_atomic() */
- tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
+ tdp_mmu_set_spte(kvm, &iter,
+ shadow_init_spte(iter.old_spte));
flush = true;
} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
/*
@@ -1038,11 +1046,14 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
new_spte = make_mmio_spte(vcpu,
tdp_iter_gfn_unalias(vcpu->kvm, iter),
pte_access);
- else
+ else {
wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
tdp_iter_gfn_unalias(vcpu->kvm, iter),
fault->pfn, iter->old_spte, fault->prefetch,
true, fault->map_writable, &new_spte);
+ if (is_private_prohibit_spte(iter->old_spte))
+ new_spte |= SPTE_PRIVATE_PROHIBIT;
+ }
if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
@@ -1335,7 +1346,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See __handle_changed_spte().
*/
- tdp_mmu_set_spte(kvm, iter, shadow_init_value);
+ tdp_mmu_set_spte(kvm, iter, shadow_init_spte(iter->old_spte));
if (!pte_write(range->pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1
From: Isaku Yamahata <[email protected]>
Because the guest memory is protected in TDX, the creation of the initial
guest memory requires a dedicated TDX module API, tdh_mem_page_add, instead
of directly copying the memory contents into the guest memory in the case
of the default VM type. KVM MMU page fault handler callback,
private_page_add, handles it.
Define new subcommand, KVM_TDX_INIT_MEM_REGION, of VM-scoped
KVM_MEMORY_ENCRYPT_OP. It assigns the guest page, copies the initial
memory contents into the guest memory, encrypts the guest memory. At the
same time, optionally it extends memory measurement of the TDX guest. It
calls the KVM MMU page fault(EPT-violation) handler to trigger the
callbacks for it.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 9 ++
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/vmx/tdx.c | 128 ++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 2 +
tools/arch/x86/include/uapi/asm/kvm.h | 9 ++
5 files changed, 149 insertions(+)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9702f0d95776..77f46260d868 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -533,6 +533,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
KVM_TDX_CMD_NR_MAX,
};
@@ -574,4 +575,12 @@ struct kvm_tdx_init_vm {
__u64 reserved[43]; /* must be zero for future extensibility */
};
+#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 72d8f200c819..23c954035227 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5226,6 +5226,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
out:
return r;
}
+EXPORT_SYMBOL(kvm_mmu_load);
void kvm_mmu_unload(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5d74ae001e4f..cd726c41d362 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -514,6 +514,21 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
}
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+{
+ struct tdx_module_output out;
+ u64 err;
+ int i;
+
+ for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+ err = tdh_mr_extend(kvm_tdx->tdr.pa, gpa + i, &out);
+ if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
+ pr_tdx_error(TDH_MR_EXTEND, err, &out);
+ break;
+ }
+ }
+}
+
static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
@@ -521,6 +536,7 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
hpa_t hpa = pfn_to_hpa(pfn);
gpa_t gpa = gfn_to_gpa(gfn);
struct tdx_module_output out;
+ hpa_t source_pa;
u64 err;
if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
@@ -533,12 +549,33 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
/* Pin the page, TDX KVM doesn't yet support page migration. */
get_page(pfn_to_page(pfn));
+ /* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
if (likely(is_td_finalized(kvm_tdx))) {
err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
if (KVM_BUG_ON(err, kvm))
pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
return;
}
+
+ /*
+ * In case of TDP MMU, fault handler can run concurrently. Note
+ * 'source_pa' is a TD scope variable, meaning if there are multiple
+ * threads reaching here with all needing to access 'source_pa', it
+ * will break. However fortunately this won't happen, because below
+ * TDH_MEM_PAGE_ADD code path is only used when VM is being created
+ * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
+ * always uses vcpu 0's page table and protected by vcpu->mutex).
+ */
+ WARN_ON(kvm_tdx->source_pa == INVALID_PAGE);
+ source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
+
+ err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &out);
+ if (KVM_BUG_ON(err, kvm))
+ pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
+ else if ((kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
+ tdx_measure_page(kvm_tdx, gpa);
+
+ kvm_tdx->source_pa = INVALID_PAGE;
}
static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -978,6 +1015,94 @@ void tdx_flush_tlb(struct kvm_vcpu *vcpu)
cpu_relax();
}
+#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_tdx_init_mem_region region;
+ struct kvm_vcpu *vcpu;
+ struct page *page;
+ kvm_pfn_t pfn;
+ int idx, ret = 0;
+
+ /* The BSP vCPU must be created before initializing memory regions. */
+ if (!atomic_read(&kvm->online_vcpus))
+ return -EINVAL;
+
+ if (cmd->metadata & ~KVM_TDX_MEASURE_MEMORY_REGION)
+ return -EINVAL;
+
+ if (copy_from_user(®ion, (void __user *)cmd->data, sizeof(region)))
+ return -EFAULT;
+
+ /* Sanity check */
+ if (!IS_ALIGNED(region.source_addr, PAGE_SIZE) ||
+ !IS_ALIGNED(region.gpa, PAGE_SIZE) ||
+ !region.nr_pages ||
+ region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
+ !kvm_is_private_gpa(kvm, region.gpa) ||
+ !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT)))
+ return -EINVAL;
+
+ vcpu = kvm_get_vcpu(kvm, 0);
+ if (mutex_lock_killable(&vcpu->mutex))
+ return -EINTR;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&kvm->srcu);
+
+ kvm_mmu_reload(vcpu);
+
+ while (region.nr_pages) {
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+
+ /* Pin the source page. */
+ ret = get_user_pages_fast(region.source_addr, 1, 0, &page);
+ if (ret < 0)
+ break;
+ if (ret != 1) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
+ (cmd->metadata & KVM_TDX_MEASURE_MEMORY_REGION);
+
+ pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+ PG_LEVEL_4K);
+ if (is_error_noslot_pfn(pfn) || kvm->vm_bugged)
+ ret = -EFAULT;
+ else
+ ret = 0;
+
+ put_page(page);
+ if (ret)
+ break;
+
+ region.source_addr += PAGE_SIZE;
+ region.gpa += PAGE_SIZE;
+ region.nr_pages--;
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ mutex_unlock(&vcpu->mutex);
+
+ if (copy_to_user((void __user *)cmd->data, ®ion, sizeof(region)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -995,6 +1120,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_INIT_VM:
r = tdx_td_init(kvm, &tdx_cmd);
break;
+ case KVM_TDX_INIT_MEM_REGION:
+ r = tdx_init_mem_region(kvm, &tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 906666c7c70b..bf9865a88991 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -28,6 +28,8 @@ struct kvm_tdx {
int cpuid_nent;
struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
+ hpa_t source_pa;
+
bool finalized;
bool tdh_mem_track;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 9702f0d95776..77f46260d868 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -533,6 +533,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
KVM_TDX_CMD_NR_MAX,
};
@@ -574,4 +575,12 @@ struct kvm_tdx_init_vm {
__u64 reserved[43]; /* must be zero for future extensibility */
};
+#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+};
+
#endif /* _ASM_X86_KVM_H */
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of TD finalization.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index d56b3890ddfe..3737b966ea07 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -21,11 +21,11 @@ Patch Layer status
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
-* TD finalization: Not yet
+* TD finalization: Applying
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet
* KVM MMU GPA stolen bits: Applied
* KVM TDP refactoring for TDX: Applied
* KVM TDP MMU hooks: Applied
-* KVM TDP MMU MapGPA: Not yet
+* KVM TDP MMU MapGPA: Applied
--
2.25.1
From: Isaku Yamahata <[email protected]>
At this point, TDX supports TDP MMU and doesn't support legacy MMU.
Forcibly use TDP MMU for TDX irrelevant of kernel parameter to disable
TDP MMU.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b33ace3d4456..9df6aa4da202 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -16,7 +16,12 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
/* Initializes the TDP MMU for the VM, if enabled. */
bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
{
- if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
+ /*
+ * Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
+ * of TDX.
+ */
+ if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
+ (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
return false;
/* This should not be changed for the lifetime of the VM. */
--
2.25.1
From: Isaku Yamahata <[email protected]>
Use the bit SPTE_PRIVATE_PROHIBIT in shared and private EPT to determine
which mapping, shared or private, is allowed. If requested mapping isn't
allowed, return RET_PF_RETRY to wait for other vcpu to change it.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/spte.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++++++++---
2 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 25dffdb488d1..9c37381a6762 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -223,7 +223,7 @@ extern u64 __read_mostly shadow_init_value;
static inline bool is_removed_spte(u64 spte)
{
- return spte == SHADOW_REMOVED_SPTE;
+ return (spte & ~SPTE_PRIVATE_PROHIBIT) == SHADOW_REMOVED_SPTE;
}
static inline bool is_private_prohibit_spte(u64 spte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6d750563824d..f6bd35831e32 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1038,9 +1038,25 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
WARN_ON(sp->role.level != fault->goal_level);
- /* TDX shared GPAs are no executable, enforce this for the SDV. */
- if (!kvm_is_private_gfn(vcpu->kvm, iter->gfn))
- pte_access &= ~ACC_EXEC_MASK;
+ if (kvm_gfn_stolen_mask(vcpu->kvm)) {
+ if (is_private_spte(iter->sptep)) {
+ /*
+ * This GPA is not allowed to map as private. Let
+ * vcpu loop in page fault until other vcpu change it
+ * by MapGPA hypercall.
+ */
+ if (fault->slot &&
+ is_private_prohibit_spte(iter->old_spte))
+ return RET_PF_RETRY;
+ } else {
+ /* This GPA is not allowed to map as shared. */
+ if (fault->slot &&
+ !is_private_prohibit_spte(iter->old_spte))
+ return RET_PF_RETRY;
+ /* TDX shared GPAs are no executable, enforce this. */
+ pte_access &= ~ACC_EXEC_MASK;
+ }
+ }
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu,
--
2.25.1
From: Isaku Yamahata <[email protected]>
With TDX, all GFNs are private at guest boot time. At run time guest TD
can explicitly change it to shared from private or vice-versa by MapGPA
hypercall. If it's specified, the given GFN can't be used as otherwise.
That's is, if a guest tells KVM that the GFN is shared, it can't be used
as private. or vice-versa.
KVM needs to record it. Steal software usable bit for it from MMIO
counter.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/spte.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e88f796724b4..25dffdb488d1 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -14,6 +14,9 @@
*/
#define SPTE_MMU_PRESENT_MASK BIT_ULL(11)
+/* Masks that used to track for shared GPA **/
+#define SPTE_PRIVATE_PROHIBIT BIT_ULL(62)
+
/*
* TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
* be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
@@ -124,7 +127,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
* the memslots generation and is derived as follows:
*
* Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
- * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
+ * Bits 8-18 of the MMIO generation are propagated to spte bits 52-61
*
* The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
* the MMIO generation number, as doing so would require stealing a bit from
@@ -138,7 +141,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
#define MMIO_SPTE_GEN_LOW_END 10
#define MMIO_SPTE_GEN_HIGH_START 52
-#define MMIO_SPTE_GEN_HIGH_END 62
+#define MMIO_SPTE_GEN_HIGH_END 61
#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
MMIO_SPTE_GEN_LOW_START)
@@ -151,7 +154,7 @@ static_assert(!(SPTE_MMU_PRESENT_MASK &
#define MMIO_SPTE_GEN_HIGH_BITS (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
/* remember to adjust the comment above as well if you change these */
-static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 10);
#define MMIO_SPTE_GEN_LOW_SHIFT (MMIO_SPTE_GEN_LOW_START - 0)
#define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
@@ -208,6 +211,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(REMOVED_SPTE & SPTE_PRIVATE_PROHIBIT));
/*
* See above comment around REMOVED_SPTE. SHADOW_REMOVED_SPTE is the actual
@@ -222,6 +226,11 @@ static inline bool is_removed_spte(u64 spte)
return spte == SHADOW_REMOVED_SPTE;
}
+static inline bool is_private_prohibit_spte(u64 spte)
+{
+ return !!(spte & SPTE_PRIVATE_PROHIBIT);
+}
+
/*
* In some cases, we need to preserve the GFN of a non-present or reserved
* SPTE when we usurp the upper five bits of the physical address space to
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of KVM TDP MMU
MapGPA.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index a14355332d44..d56b3890ddfe 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -11,6 +11,7 @@ What qemu can do
- TDX VM TYPE is exposed to Qemu.
- Qemu can create/destroy guest of TDX vm type.
- Qemu can create/destroy vcpu of TDX vm type.
+- Qemu can populate initial guest memory image.
Patch Layer status
------------------
@@ -19,7 +20,7 @@ Patch Layer status
* TDX architectural definitions: Applied
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
-* TDX EPT violation: Applying
+* TDX EPT violation: Applied
* TD finalization: Not yet
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet
--
2.25.1
From: Sean Christopherson <[email protected]>
Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
interrupt to support TDX's usage of APICv. Unlike VMX, TDX doesn't have
access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
i.e. needs to generate a posted interrupt and more importantly can't
manually move requested interrupts into the vIRR (which it also doesn't
have access to).
Because pi_has_pending_interrupt() is heavy operation which uses two atomic
test bit operations and one atomic 256 bit bitmap check, introduce new
callback for this check instead of reusing dy_apicv_has_pending_interrupt()
callback to avoid affecting the exiting code.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/main.c | 9 +++++++++
arch/x86/kvm/x86.c | 5 ++++-
3 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 489374a57b66..8dab9f16f559 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1491,6 +1491,7 @@ struct kvm_x86_ops {
void (*start_assignment)(struct kvm *kvm);
void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+ bool (*apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
bool *expired);
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 882358ac270b..d75caf0d6861 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -148,6 +148,14 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
return vmx_vcpu_load(vcpu, cpu);
}
+static bool vt_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return pi_has_pending_interrupt(vcpu);
+
+ return false;
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -297,6 +305,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sync_pir_to_irr = vmx_sync_pir_to_irr,
.deliver_interrupt = vmx_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+ .apicv_has_pending_interrupt = vt_apicv_has_pending_interrupt,
.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 89d04cd64cd0..314ae43e07bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12111,7 +12111,10 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
if (kvm_arch_interrupt_allowed(vcpu) &&
(kvm_cpu_has_interrupt(vcpu) ||
- kvm_guest_apic_has_interrupt(vcpu)))
+ kvm_guest_apic_has_interrupt(vcpu) ||
+ (vcpu->arch.apicv_active &&
+ kvm_x86_ops.apicv_has_pending_interrupt &&
+ kvm_x86_ops.apicv_has_pending_interrupt(vcpu))))
return true;
if (kvm_hv_has_stimer_pending(vcpu))
--
2.25.1
From: Isaku Yamahata <[email protected]>
Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
vcpu. Wire up kvm x86 methods for blocking/unblocking vcpu for TDX. To
unblock on pending events, request immediate exit methods is also needed.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a0bcc4dca678..404a260796e4 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -280,6 +280,14 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
vmx_enable_irq_window(vcpu);
}
+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return __kvm_request_immediate_exit(vcpu);
+
+ vmx_request_immediate_exit(vcpu);
+}
+
static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -402,7 +410,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,
- .request_immediate_exit = vmx_request_immediate_exit,
+ .request_immediate_exit = vt_request_immediate_exit,
.sched_in = vt_sched_in,
--
2.25.1
From: Sean Christopherson <[email protected]>
Add an option to skip the IRR check-in kvm_wait_lapic_expire(). This
will be used by TDX to wait if there is an outstanding notification for
a TD, i.e. a virtual interrupt is being triggered via posted interrupt
processing. KVM TDX doesn't emulate PI processing, i.e. there will
never be a bit set in IRR/ISR, so the default behavior for APICv of
querying the IRR doesn't work as intended.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/lapic.c | 4 ++--
arch/x86/kvm/lapic.h | 2 +-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
4 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9322e6340a74..d49f029ef0e3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1620,12 +1620,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
}
-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
{
if (lapic_in_kernel(vcpu) &&
vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
vcpu->arch.apic->lapic_timer.timer_advance_ns &&
- lapic_timer_int_injected(vcpu))
+ (force_wait || lapic_timer_int_injected(vcpu)))
__kvm_wait_lapic_expire(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2b44e533fc8d..2a0119ef9e96 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -233,7 +233,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)
bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait);
void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
unsigned long *vcpu_bitmap);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c7eec23e9ebe..a46415845f48 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3766,7 +3766,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
clgi();
kvm_load_guest_xsave_state(vcpu);
- kvm_wait_lapic_expire(vcpu);
+ kvm_wait_lapic_expire(vcpu, false);
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 00f88aa25047..9b7bd52d19a9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6838,7 +6838,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (enable_preemption_timer)
vmx_update_hv_timer(vcpu);
- kvm_wait_lapic_expire(vcpu);
+ kvm_wait_lapic_expire(vcpu, false);
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
--
2.25.1
From: Isaku Yamahata <[email protected]>
The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
Instead it defines the different protocols to boot application processors.
Ignore INIT and SIPI events for the TDX guest.
There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
error to guest TDs somehow. Given that TDX guest is paravirtualized to
boot AP, the option 1 is chosen for simplicity.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/lapic.c | 21 +++++++++++++++++----
arch/x86/kvm/vmx/main.c | 10 +++++++++-
arch/x86/kvm/x86.h | 5 +++++
3 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d49f029ef0e3..e27653d5e630 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2921,11 +2921,20 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
if (test_bit(KVM_APIC_INIT, &pe)) {
clear_bit(KVM_APIC_INIT, &apic->pending_events);
- kvm_vcpu_reset(vcpu, true);
- if (kvm_vcpu_is_bsp(apic->vcpu))
+ if (kvm_init_sipi_unsupported(vcpu->kvm))
+ /*
+ * TDX doesn't support INIT. Ignore INIT event. In the
+ * case of SIPI, the callback of
+ * vcpu_deliver_sipi_vector ignores it.
+ */
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
- else
- vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+ else {
+ kvm_vcpu_reset(vcpu, true);
+ if (kvm_vcpu_is_bsp(apic->vcpu))
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ else
+ vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+ }
}
if (test_bit(KVM_APIC_SIPI, &pe)) {
clear_bit(KVM_APIC_SIPI, &apic->pending_events);
@@ -2933,6 +2942,10 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
/* evaluate pending_events before reading the vector */
smp_rmb();
sipi_vector = apic->sipi_vector;
+ /*
+ * If SINIT isn't supported, the callback ignores SIPI
+ * request.
+ */
kvm_x86_ops.vcpu_deliver_sipi_vector(vcpu, sipi_vector);
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
}
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 478aa63acefa..de9b4a270f20 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -264,6 +264,14 @@ static bool vt_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
return false;
}
+static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ kvm_vcpu_deliver_sipi_vector(vcpu, vector);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -586,7 +594,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.msr_filter_changed = vmx_msr_filter_changed,
.complete_emulated_msr = kvm_complete_insn_gp,
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_sipi_vector = vt_vcpu_deliver_sipi_vector,
.mem_enc_op = vt_mem_enc_op,
.mem_enc_op_vcpu = vt_mem_enc_op_vcpu,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f15bf1c0aeb1..c789d72ab408 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -405,6 +405,11 @@ static inline void kvm_machine_check(void)
#endif
}
+static __always_inline bool kvm_init_sipi_unsupported(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
int kvm_spec_ctrl_test_value(u64 value);
--
2.25.1
From: Isaku Yamahata <[email protected]>
As first step TDX VM support, return that TDX VM type supported to device
model, e.g. qemu. The callback to create guest TD is vm_init callback for
KVM_CREATE_VM. Add a place holder function and call a function to
initialize TDX module on demand because in that callback VMX is enabled by
hardware_enable callback (vmx_hardware_enable).
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 24 ++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 5 +++++
arch/x86/kvm/vmx/vmx.c | 5 -----
arch/x86/kvm/vmx/x86_ops.h | 3 ++-
4 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 77da926ee505..8103d1c32cc9 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -5,6 +5,12 @@
#include "vmx.h"
#include "nested.h"
#include "pmu.h"
+#include "tdx.h"
+
+static bool vt_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
+}
static __init int vt_hardware_setup(void)
{
@@ -19,6 +25,20 @@ static __init int vt_hardware_setup(void)
return 0;
}
+static int vt_vm_init(struct kvm *kvm)
+{
+ int ret;
+
+ if (is_td(kvm)) {
+ ret = tdx_module_setup();
+ if (ret)
+ return ret;
+ return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
+ }
+
+ return vmx_vm_init(kvm);
+}
+
struct kvm_x86_ops vt_x86_ops __initdata = {
.name = "kvm_intel",
@@ -29,9 +49,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.cpu_has_accelerated_tpr = report_flexpriority,
.has_emulated_msr = vmx_has_emulated_msr,
- .is_vm_type_supported = vmx_is_vm_type_supported,
+ .is_vm_type_supported = vt_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
+ .vm_init = vt_vm_init,
.vcpu_create = vmx_vcpu_create,
.vcpu_free = vmx_vcpu_free,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8adc87ad1807..e8d293a3c11c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -105,6 +105,11 @@ int tdx_module_setup(void)
return ret;
}
+bool tdx_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_TDX_VM && READ_ONCE(enable_tdx);
+}
+
static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
{
u32 max_pa;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3c7b3f245fee..7838cd177f0e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7079,11 +7079,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}
-bool vmx_is_vm_type_supported(unsigned long type)
-{
- return type == KVM_X86_DEFAULT_VM;
-}
-
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f7327bc73be0..78331dbc29f7 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -25,7 +25,6 @@ void vmx_hardware_unsetup(void);
int vmx_hardware_enable(void);
void vmx_hardware_disable(void);
bool report_flexpriority(void);
-bool vmx_is_vm_type_supported(unsigned long type);
int vmx_vm_init(struct kvm *kvm);
int vmx_vcpu_create(struct kvm_vcpu *vcpu);
int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
@@ -130,10 +129,12 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
#ifdef CONFIG_INTEL_TDX_HOST
void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
unsigned int *vcpu_align, unsigned int *vm_size);
+bool tdx_is_vm_type_supported(unsigned long type);
void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
#else
static inline void tdx_pre_kvm_init(
unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
+static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
#endif
--
2.25.1
From: Sean Christopherson <[email protected]>
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX defines its data structure and TDX SEAMCALL APIs for
VMM to operate on Trust Domain (TD) instead.
Trust Domain Virtual Processor State (TDVPS) is the root control structure
of a TD VCPU. It helps the TDX module control the operation of the VCPU,
and holds the VCPU state while the VCPU is not running. TDVPS is opaque to
software and DMA access, accessible only by using the TDX module interface
functions (such as TDH.VP.RD, TDH.VP.WR ,..). TDVPS includes TD VMCS, and
TD VMCS auxiliary structures, such as virtual APIC page, virtualization
exception information, etc. TDVPS is composed of Trust Domain Virtual
Processor Root (TDVPR) which is the root page of TDVPS and Trust Domain
Virtual Processor eXtension (TDVPX) pages which extend TDVPR to help
provide enough physical space for the logical TDVPS structure.
Also, we have a new structure, Trust Domain Control Structure (TDCS) is the
main control structure of a guest TD, and encrypted (using the guest TD's
ephemeral private key). At a high level, TDCS holds information for
controlling TD operation as a whole, execution, EPTP, MSR bitmaps, etc. KVM
needs to set it up. Note that MSR bitmaps are held as part of TDCS (unlike
VMX) because they are meant to have the same value for all VCPUs of the
same TD. TDCS is a multi-page logical structure composed of multiple Trust
Domain Control Extension (TDCX) physical pages. Trust Domain Root (TDR) is
the root control structure of a guest TD and is encrypted using the TDX
global private key. It holds a minimal set of state variables that enable
guest TD control even during times when the TD's private key is not known,
or when the TD's key management state does not permit access to memory
encrypted using the TD's private key.
The following shows the relationship between those structures.
TDR--> TDCS per-TD
| \--> TDCX
\
\--> TDVPS per-TD VCPU
\--> TDVPR and TDVPX
The existing global struct kvm_x86_ops already defines an interface which
fits with TDX. But kvm_x86_ops is system-wide, not per-VM structure. To
allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
"if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.
To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
both VMs and TDs. Use 'vt' for the naming scheme as a nod to VT-x and as a
concatenation of VmxTdx.
The current code looks as follows.
In vmx.c
static vmx_op() { ... }
static struct kvm_x86_ops vmx_x86_ops = {
.op = vmx_op,
initialization code
The eventually converted code will look like
In vmx.c, keep the VMX operations.
vmx_op() { ... }
VMX initialization
In tdx.c, define the TDX operations.
tdx_op() { ... }
TDX initialization
In x86_ops.h, declare the VMX and TDX operations.
vmx_op();
tdx_op();
In main.c, define common wrappers for VMX and VMX.
static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
static struct kvm_x86_ops vt_x86_ops = {
.op = vt_op,
initialization to call VMX and TDX initialization
Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vxm_vcpu_free().
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/main.c | 154 ++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 360 +++++++++++--------------------------
arch/x86/kvm/vmx/x86_ops.h | 126 +++++++++++++
4 files changed, 385 insertions(+), 257 deletions(-)
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/x86_ops.h
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 30f244b64523..ee4d0999f20f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -22,7 +22,7 @@ kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
kvm-$(CONFIG_KVM_XEN) += xen.o
kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
- vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
+ vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..b08ea9c42a11
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+ .name = "kvm_intel",
+
+ .hardware_unsetup = vmx_hardware_unsetup,
+
+ .hardware_enable = vmx_hardware_enable,
+ .hardware_disable = vmx_hardware_disable,
+ .cpu_has_accelerated_tpr = report_flexpriority,
+ .has_emulated_msr = vmx_has_emulated_msr,
+
+ .vm_size = sizeof(struct kvm_vmx),
+ .vm_init = vmx_vm_init,
+
+ .vcpu_create = vmx_vcpu_create,
+ .vcpu_free = vmx_vcpu_free,
+ .vcpu_reset = vmx_vcpu_reset,
+
+ .prepare_guest_switch = vmx_prepare_switch_to_guest,
+ .vcpu_load = vmx_vcpu_load,
+ .vcpu_put = vmx_vcpu_put,
+
+ .update_exception_bitmap = vmx_update_exception_bitmap,
+ .get_msr_feature = vmx_get_msr_feature,
+ .get_msr = vmx_get_msr,
+ .set_msr = vmx_set_msr,
+ .get_segment_base = vmx_get_segment_base,
+ .get_segment = vmx_get_segment,
+ .set_segment = vmx_set_segment,
+ .get_cpl = vmx_get_cpl,
+ .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
+ .set_cr0 = vmx_set_cr0,
+ .is_valid_cr4 = vmx_is_valid_cr4,
+ .set_cr4 = vmx_set_cr4,
+ .set_efer = vmx_set_efer,
+ .get_idt = vmx_get_idt,
+ .set_idt = vmx_set_idt,
+ .get_gdt = vmx_get_gdt,
+ .set_gdt = vmx_set_gdt,
+ .set_dr7 = vmx_set_dr7,
+ .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
+ .cache_reg = vmx_cache_reg,
+ .get_rflags = vmx_get_rflags,
+ .set_rflags = vmx_set_rflags,
+ .get_if_flag = vmx_get_if_flag,
+
+ .tlb_flush_all = vmx_flush_tlb_all,
+ .tlb_flush_current = vmx_flush_tlb_current,
+ .tlb_flush_gva = vmx_flush_tlb_gva,
+ .tlb_flush_guest = vmx_flush_tlb_guest,
+
+ .vcpu_pre_run = vmx_vcpu_pre_run,
+ .run = vmx_vcpu_run,
+ .handle_exit = vmx_handle_exit,
+ .skip_emulated_instruction = vmx_skip_emulated_instruction,
+ .update_emulated_instruction = vmx_update_emulated_instruction,
+ .set_interrupt_shadow = vmx_set_interrupt_shadow,
+ .get_interrupt_shadow = vmx_get_interrupt_shadow,
+ .patch_hypercall = vmx_patch_hypercall,
+ .set_irq = vmx_inject_irq,
+ .set_nmi = vmx_inject_nmi,
+ .queue_exception = vmx_queue_exception,
+ .cancel_injection = vmx_cancel_injection,
+ .interrupt_allowed = vmx_interrupt_allowed,
+ .nmi_allowed = vmx_nmi_allowed,
+ .get_nmi_mask = vmx_get_nmi_mask,
+ .set_nmi_mask = vmx_set_nmi_mask,
+ .enable_nmi_window = vmx_enable_nmi_window,
+ .enable_irq_window = vmx_enable_irq_window,
+ .update_cr8_intercept = vmx_update_cr8_intercept,
+ .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .apicv_post_state_restore = vmx_apicv_post_state_restore,
+ .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
+ .hwapic_irr_update = vmx_hwapic_irr_update,
+ .hwapic_isr_update = vmx_hwapic_isr_update,
+ .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .sync_pir_to_irr = vmx_sync_pir_to_irr,
+ .deliver_interrupt = vmx_deliver_interrupt,
+ .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+ .set_tss_addr = vmx_set_tss_addr,
+ .set_identity_map_addr = vmx_set_identity_map_addr,
+ .get_mt_mask = vmx_get_mt_mask,
+
+ .get_exit_info = vmx_get_exit_info,
+
+ .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+
+ .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+ .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
+ .write_tsc_offset = vmx_write_tsc_offset,
+ .write_tsc_multiplier = vmx_write_tsc_multiplier,
+
+ .load_mmu_pgd = vmx_load_mmu_pgd,
+
+ .check_intercept = vmx_check_intercept,
+ .handle_exit_irqoff = vmx_handle_exit_irqoff,
+
+ .request_immediate_exit = vmx_request_immediate_exit,
+
+ .sched_in = vmx_sched_in,
+
+ .cpu_dirty_log_size = PML_ENTITY_NUM,
+ .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+
+ .pmu_ops = &intel_pmu_ops,
+ .nested_ops = &vmx_nested_ops,
+
+ .update_pi_irte = pi_update_irte,
+ .start_assignment = vmx_pi_start_assignment,
+
+#ifdef CONFIG_X86_64
+ .set_hv_timer = vmx_set_hv_timer,
+ .cancel_hv_timer = vmx_cancel_hv_timer,
+#endif
+
+ .setup_mce = vmx_setup_mce,
+
+ .smi_allowed = vmx_smi_allowed,
+ .enter_smm = vmx_enter_smm,
+ .leave_smm = vmx_leave_smm,
+ .enable_smi_window = vmx_enable_smi_window,
+
+ .can_emulate_instruction = vmx_can_emulate_instruction,
+ .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .migrate_timers = vmx_migrate_timers,
+
+ .msr_filter_changed = vmx_msr_filter_changed,
+ .complete_emulated_msr = kvm_complete_insn_gp,
+
+ .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+};
+
+struct kvm_x86_init_ops vt_init_ops __initdata = {
+ .cpu_has_kvm_support = vmx_cpu_has_kvm_support,
+ .disabled_by_bios = vmx_disabled_by_bios,
+ .check_processor_compatibility = vmx_check_processor_compat,
+ .hardware_setup = vmx_hardware_setup,
+ .handle_intel_pt_intr = NULL,
+
+ .runtime_ops = &vt_x86_ops,
+};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index efda5e4d6247..f6f5d0dac579 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -66,6 +66,7 @@
#include "vmcs12.h"
#include "vmx.h"
#include "x86.h"
+#include "x86_ops.h"
MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");
@@ -541,7 +542,7 @@ static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu)
return flexpriority_enabled && lapic_in_kernel(vcpu);
}
-static inline bool report_flexpriority(void)
+bool report_flexpriority(void)
{
return flexpriority_enabled;
}
@@ -1316,7 +1317,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
* Switches to specified vcpu, until a matching vcpu_put(), but assumes
* vcpu mutex is already taken.
*/
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1327,7 +1328,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmx->host_debugctlmsr = get_debugctlmsr();
}
-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
{
vmx_vcpu_pi_put(vcpu);
@@ -1381,7 +1382,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
vmx->emulation_required = vmx_emulation_required(vcpu);
}
-static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
{
return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
}
@@ -1487,8 +1488,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
return 0;
}
-static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
- void *insn, int insn_len)
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
{
/*
* Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1572,7 +1573,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
* Recognizes a pending MTF VM-exit and records the nested state for later
* delivery.
*/
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1595,7 +1596,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
vmx->nested.mtf_pending = false;
}
-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
vmx_update_emulated_instruction(vcpu);
return skip_emulated_instruction(vcpu);
@@ -1614,7 +1615,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
}
-static void vmx_queue_exception(struct kvm_vcpu *vcpu)
+void vmx_queue_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned nr = vcpu->arch.exception.nr;
@@ -1727,12 +1728,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
return kvm_default_tsc_scaling_ratio;
}
-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
{
vmcs_write64(TSC_OFFSET, offset);
}
-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
{
vmcs_write64(TSC_MULTIPLIER, multiplier);
}
@@ -1756,7 +1757,7 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
return !(val & ~valid_bits);
}
-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
{
switch (msr->index) {
case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
@@ -1776,7 +1777,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -1954,7 +1955,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2267,7 +2268,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return ret;
}
-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
{
unsigned long guest_owned_bits;
@@ -2310,12 +2311,12 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
}
}
-static __init int cpu_has_kvm_support(void)
+__init int vmx_cpu_has_kvm_support(void)
{
return cpu_has_vmx();
}
-static __init int vmx_disabled_by_bios(void)
+__init int vmx_disabled_by_bios(void)
{
return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
!boot_cpu_has(X86_FEATURE_VMX);
@@ -2341,7 +2342,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
return -EFAULT;
}
-static int hardware_enable(void)
+int vmx_hardware_enable(void)
{
int cpu = raw_smp_processor_id();
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2382,7 +2383,7 @@ static void vmclear_local_loaded_vmcss(void)
__loaded_vmcs_clear(v);
}
-static void hardware_disable(void)
+void vmx_hardware_disable(void)
{
vmclear_local_loaded_vmcss();
@@ -2924,7 +2925,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
#endif
-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -2954,7 +2955,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
return to_vmx(vcpu)->vpid;
}
-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 root_hpa = mmu->root_hpa;
@@ -2970,7 +2971,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
vpid_sync_context(vmx_get_current_vpid(vcpu));
}
-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
{
/*
* vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
@@ -2979,7 +2980,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
}
-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
{
/*
* vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
@@ -3134,8 +3135,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
return eptp;
}
-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
- int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
{
struct kvm *kvm = vcpu->kvm;
bool update_guest_cr3 = true;
@@ -3163,8 +3163,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmcs_writel(GUEST_CR3, guest_cr3);
}
-
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
/*
* We operate under the default treatment of SMM, so VMX cannot be
@@ -3280,7 +3279,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->g = (ar >> 15) & 1;
}
-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
{
struct kvm_segment s;
@@ -3360,14 +3359,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
}
-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
__vmx_set_segment(vcpu, var, seg);
to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
}
-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
{
u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
@@ -3375,25 +3374,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
*l = (ar >> 13) & 1;
}
-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
dt->address = vmcs_readl(GUEST_IDTR_BASE);
}
-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
vmcs_writel(GUEST_IDTR_BASE, dt->address);
}
-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
dt->address = vmcs_readl(GUEST_GDTR_BASE);
}
-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -3889,7 +3888,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
}
}
-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
void *vapic_page;
@@ -3909,7 +3908,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
return ((rvi & 0xf0) > (vppr & 0xf0));
}
-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 i;
@@ -4041,8 +4040,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
return 0;
}
-static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
- int trig_mode, int vector)
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
{
struct kvm_vcpu *vcpu = apic->vcpu;
@@ -4185,7 +4184,7 @@ static u32 vmx_vmexit_ctrl(void)
~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
}
-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4508,7 +4507,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmx->pi_desc.sn = 1;
}
-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4565,12 +4564,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vpid_sync_context(vmx->vpid);
}
-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
{
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
}
-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
{
if (!enable_vnmi ||
vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4581,7 +4580,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
}
-static void vmx_inject_irq(struct kvm_vcpu *vcpu)
+void vmx_inject_irq(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
uint32_t intr;
@@ -4609,7 +4608,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
vmx_clear_hlt(vcpu);
}
-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4687,7 +4686,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
GUEST_INTR_STATE_NMI));
}
-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -4709,7 +4708,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
}
-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -4724,7 +4723,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !vmx_interrupt_blocked(vcpu);
}
-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
void __user *ret;
@@ -4744,7 +4743,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
return init_rmode_tss(kvm, ret);
}
-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
{
to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
return 0;
@@ -5023,8 +5022,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
return kvm_fast_pio(vcpu, size, port, in);
}
-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
{
/*
* Patch in the VMCALL instruction:
@@ -5234,7 +5232,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
return kvm_complete_insn_gp(vcpu, err);
}
-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
{
get_debugreg(vcpu->arch.db[0], 0);
get_debugreg(vcpu->arch.db[1], 1);
@@ -5253,7 +5251,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
set_debugreg(DR6_RESERVED, 6);
}
-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
{
vmcs_writel(GUEST_DR7, val);
}
@@ -5519,7 +5517,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
return 1;
}
-static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (vmx_emulation_required_with_pending_exception(vcpu)) {
kvm_prepare_emulation_failure_exit(vcpu);
@@ -5756,9 +5754,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);
-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
- u64 *info1, u64 *info2,
- u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6191,7 +6188,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return 0;
}
-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
int ret = __vmx_handle_exit(vcpu, exit_fastpath);
@@ -6279,7 +6276,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
: "eax", "ebx", "ecx", "edx");
}
-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
int tpr_threshold;
@@ -6349,7 +6346,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
}
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
{
struct page *page;
@@ -6377,7 +6374,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
put_page(page);
}
-static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
{
u16 status;
u8 old;
@@ -6411,7 +6408,7 @@ static void vmx_set_rvi(int vector)
}
}
-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
{
/*
* When running L2, updating RVI is only relevant when
@@ -6425,7 +6422,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
vmx_set_rvi(max_irr);
}
-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int max_irr;
@@ -6471,7 +6468,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
return max_irr;
}
-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
return;
@@ -6482,7 +6479,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}
-static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6554,7 +6551,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
}
-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6571,7 +6568,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
* The kvm parameter can be NULL (module initialization, or invocation before
* VM creation). Be sure to check the kvm parameter before using it.
*/
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
{
switch (index) {
case MSR_IA32_SMBASE:
@@ -6692,7 +6689,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
IDT_VECTORING_ERROR_CODE);
}
-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
{
__vmx_complete_interrupts(vcpu,
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -6788,7 +6785,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
guest_state_exit_irqoff();
}
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long cr4;
@@ -6969,7 +6966,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_exit_handlers_fastpath(vcpu);
}
-static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
+void vmx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6980,7 +6977,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
free_loaded_vmcs(vmx->loaded_vmcs);
}
-static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
+int vmx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct vmx_uret_msr *tsx_ctrl;
struct vcpu_vmx *vmx;
@@ -7085,7 +7082,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
{
if (!ple_gap)
kvm->arch.pause_in_guest = true;
@@ -7116,7 +7113,7 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}
-static int __init vmx_check_processor_compat(void)
+int __init vmx_check_processor_compat(void)
{
struct vmcs_config vmcs_conf;
struct vmx_capability vmx_cap;
@@ -7139,7 +7136,7 @@ static int __init vmx_check_processor_compat(void)
return 0;
}
-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;
@@ -7328,7 +7325,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}
-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7433,7 +7430,7 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
}
-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->req_immediate_exit = true;
}
@@ -7472,10 +7469,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
}
-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
- struct x86_instruction_info *info,
- enum x86_intercept_stage stage,
- struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
@@ -7540,8 +7537,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
return 0;
}
-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
- bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
{
struct vcpu_vmx *vmx;
u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -7580,13 +7577,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
return 0;
}
-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->hv_deadline_tsc = -1;
}
#endif
-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
if (!kvm_pause_in_guest(vcpu->kvm))
shrink_ple_window(vcpu);
@@ -7612,7 +7609,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
}
-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.mcg_cap & MCG_LMCE_P)
to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -7622,7 +7619,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
~FEAT_CTL_LMCE_ENABLED;
}
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
/* we need a nested vmexit to enter SMM, postpone if run is pending */
if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -7630,7 +7627,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !is_smm(vcpu);
}
-static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7644,7 +7641,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
return 0;
}
-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int ret;
@@ -7665,17 +7662,17 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
return 0;
}
-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
{
/* RSM will cause a vmexit anyway. */
}
-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
}
-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
{
if (is_guest_mode(vcpu)) {
struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -7685,7 +7682,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
}
}
-static void hardware_unsetup(void)
+void vmx_hardware_unsetup(void)
{
kvm_set_posted_intr_wakeup_handler(NULL);
@@ -7695,7 +7692,7 @@ static void hardware_unsetup(void)
free_kvm_area();
}
-static bool vmx_check_apicv_inhibit_reasons(ulong bit)
+bool vmx_check_apicv_inhibit_reasons(ulong bit)
{
ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
BIT(APICV_INHIBIT_REASON_ABSENT) |
@@ -7705,143 +7702,6 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
return supported & BIT(bit);
}
-static struct kvm_x86_ops vmx_x86_ops __initdata = {
- .name = "kvm_intel",
-
- .hardware_unsetup = hardware_unsetup,
-
- .hardware_enable = hardware_enable,
- .hardware_disable = hardware_disable,
- .cpu_has_accelerated_tpr = report_flexpriority,
- .has_emulated_msr = vmx_has_emulated_msr,
-
- .vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
-
- .vcpu_create = vmx_create_vcpu,
- .vcpu_free = vmx_free_vcpu,
- .vcpu_reset = vmx_vcpu_reset,
-
- .prepare_guest_switch = vmx_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
-
- .update_exception_bitmap = vmx_update_exception_bitmap,
- .get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
-
- .tlb_flush_all = vmx_flush_tlb_all,
- .tlb_flush_current = vmx_flush_tlb_current,
- .tlb_flush_gva = vmx_flush_tlb_gva,
- .tlb_flush_guest = vmx_flush_tlb_guest,
-
- .vcpu_pre_run = vmx_vcpu_pre_run,
- .run = vmx_vcpu_run,
- .handle_exit = vmx_handle_exit,
- .skip_emulated_instruction = vmx_skip_emulated_instruction,
- .update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
- .set_irq = vmx_inject_irq,
- .set_nmi = vmx_inject_nmi,
- .queue_exception = vmx_queue_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_post_state_restore = vmx_apicv_post_state_restore,
- .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_interrupt = vmx_deliver_interrupt,
- .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
-
- .get_exit_info = vmx_get_exit_info,
-
- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
- .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
-
- .load_mmu_pgd = vmx_load_mmu_pgd,
-
- .check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
-
- .request_immediate_exit = vmx_request_immediate_exit,
-
- .sched_in = vmx_sched_in,
-
- .cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
- .pmu_ops = &intel_pmu_ops,
- .nested_ops = &vmx_nested_ops,
-
- .update_pi_irte = pi_update_irte,
- .start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
- .setup_mce = vmx_setup_mce,
-
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
-
- .can_emulate_instruction = vmx_can_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
- .migrate_timers = vmx_migrate_timers,
-
- .msr_filter_changed = vmx_msr_filter_changed,
- .complete_emulated_msr = kvm_complete_insn_gp,
-
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-};
-
static unsigned int vmx_handle_intel_pt_intr(void)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -7882,9 +7742,7 @@ static __init void vmx_setup_user_return_msrs(void)
kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
}
-static struct kvm_x86_init_ops vmx_init_ops __initdata;
-
-static __init int hardware_setup(void)
+__init int vmx_hardware_setup(void)
{
unsigned long host_bndcfgs;
struct desc_ptr dt;
@@ -7944,16 +7802,16 @@ static __init int hardware_setup(void)
* using the APIC_ACCESS_ADDR VMCS field.
*/
if (!flexpriority_enabled)
- vmx_x86_ops.set_apic_access_page_addr = NULL;
+ vt_x86_ops.set_apic_access_page_addr = NULL;
if (!cpu_has_vmx_tpr_shadow())
- vmx_x86_ops.update_cr8_intercept = NULL;
+ vt_x86_ops.update_cr8_intercept = NULL;
#if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
&& enable_ept) {
- vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
- vmx_x86_ops.tlb_remote_flush_with_range =
+ vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
+ vt_x86_ops.tlb_remote_flush_with_range =
hv_remote_flush_tlb_with_range;
}
#endif
@@ -7969,7 +7827,7 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_apicv())
enable_apicv = 0;
if (!enable_apicv)
- vmx_x86_ops.sync_pir_to_irr = NULL;
+ vt_x86_ops.sync_pir_to_irr = NULL;
if (cpu_has_vmx_tsc_scaling()) {
kvm_has_tsc_control = true;
@@ -7996,7 +7854,7 @@ static __init int hardware_setup(void)
enable_pml = 0;
if (!enable_pml)
- vmx_x86_ops.cpu_dirty_log_size = 0;
+ vt_x86_ops.cpu_dirty_log_size = 0;
if (!cpu_has_vmx_preemption_timer())
enable_preemption_timer = false;
@@ -8023,9 +7881,9 @@ static __init int hardware_setup(void)
}
if (!enable_preemption_timer) {
- vmx_x86_ops.set_hv_timer = NULL;
- vmx_x86_ops.cancel_hv_timer = NULL;
- vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+ vt_x86_ops.set_hv_timer = NULL;
+ vt_x86_ops.cancel_hv_timer = NULL;
+ vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
}
kvm_mce_cap_supported |= MCG_LMCE_P;
@@ -8035,9 +7893,9 @@ static __init int hardware_setup(void)
if (!enable_ept || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
if (pt_mode == PT_MODE_HOST_GUEST)
- vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+ vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
else
- vmx_init_ops.handle_intel_pt_intr = NULL;
+ vt_init_ops.handle_intel_pt_intr = NULL;
setup_default_sgx_lepubkeyhash();
@@ -8061,16 +7919,6 @@ static __init int hardware_setup(void)
return r;
}
-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
- .cpu_has_kvm_support = cpu_has_kvm_support,
- .disabled_by_bios = vmx_disabled_by_bios,
- .check_processor_compatibility = vmx_check_processor_compat,
- .hardware_setup = hardware_setup,
- .handle_intel_pt_intr = NULL,
-
- .runtime_ops = &vmx_x86_ops,
-};
-
static void vmx_cleanup_l1d_flush(void)
{
if (vmx_l1d_flush_pages) {
@@ -8149,7 +7997,7 @@ static int __init vmx_init(void)
}
if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
- vmx_x86_ops.enable_direct_tlbflush
+ vt_x86_ops.enable_direct_tlbflush
= hv_enable_direct_tlbflush;
} else {
@@ -8157,8 +8005,8 @@ static int __init vmx_init(void)
}
#endif
- r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
- __alignof__(struct vcpu_vmx), THIS_MODULE);
+ r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
+ __alignof__(struct vcpu_vmx), THIS_MODULE);
if (r)
return r;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..40c64fb1f505
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,126 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include <asm/virtext.h>
+
+#include "x86.h"
+
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+__init int vmx_cpu_has_kvm_support(void);
+__init int vmx_disabled_by_bios(void);
+int __init vmx_check_processor_compat(void);
+__init int vmx_hardware_setup(void);
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+void vmx_hardware_unsetup(void);
+int vmx_hardware_enable(void);
+void vmx_hardware_disable(void);
+bool report_flexpriority(void);
+int vmx_vm_init(struct kvm *kvm);
+int vmx_vcpu_create(struct kvm_vcpu *vcpu);
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_vcpu_free(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(ulong bit);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_queue_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1
From: Isaku Yamahata <[email protected]>
Implement a VM-scoped subcomment to get system-wide parameters. Although
this is system-wide parameters not per-VM, this subcomand is VM-scoped
because
- Device model needs TDX system-wide parameters after creating KVM VM.
- This subcommands requires to initialize TDX module. For lazy
initialization of the TDX module, vm-scope ioctl is better.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 22 ++++++++++++++
arch/x86/kvm/vmx/tdx.c | 41 +++++++++++++++++++++++++++
tools/arch/x86/include/uapi/asm/kvm.h | 22 ++++++++++++++
3 files changed, 85 insertions(+)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 2ad61caf4e0b..70f9be4ea575 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -530,6 +530,8 @@ struct kvm_pmu_event_filter {
/* Trust Domain eXtension sub-ioctl() commands. */
enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+
KVM_TDX_CMD_NR_MAX,
};
@@ -539,4 +541,24 @@ struct kvm_tdx_cmd {
__u64 data;
};
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ __u32 padding;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8c67444d052a..20b45bb0b032 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -349,6 +349,44 @@ int tdx_vm_init(struct kvm *kvm)
return ret;
}
+static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx_capabilities __user *user_caps;
+ struct kvm_tdx_capabilities caps;
+
+ BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
+ sizeof(struct tdx_cpuid_config));
+
+ WARN_ON(cmd->id != KVM_TDX_CAPABILITIES);
+ if (cmd->metadata)
+ return -EINVAL;
+
+ user_caps = (void __user *)cmd->data;
+ if (copy_from_user(&caps, user_caps, sizeof(caps)))
+ return -EFAULT;
+
+ if (caps.nr_cpuid_configs < tdx_caps.nr_cpuid_configs)
+ return -E2BIG;
+
+ caps = (struct kvm_tdx_capabilities) {
+ .attrs_fixed0 = tdx_caps.attrs_fixed0,
+ .attrs_fixed1 = tdx_caps.attrs_fixed1,
+ .xfam_fixed0 = tdx_caps.xfam_fixed0,
+ .xfam_fixed1 = tdx_caps.xfam_fixed1,
+ .nr_cpuid_configs = tdx_caps.nr_cpuid_configs,
+ .padding = 0,
+ };
+
+ if (copy_to_user(user_caps, &caps, sizeof(caps)))
+ return -EFAULT;
+ if (copy_to_user(user_caps->cpuid_configs, &tdx_caps.cpuid_configs,
+ tdx_caps.nr_cpuid_configs *
+ sizeof(struct tdx_cpuid_config)))
+ return -EFAULT;
+
+ return 0;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -360,6 +398,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
mutex_lock(&kvm->lock);
switch (tdx_cmd.id) {
+ case KVM_TDX_CAPABILITIES:
+ r = tdx_capabilities(kvm, &tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 2ad61caf4e0b..70f9be4ea575 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -530,6 +530,8 @@ struct kvm_pmu_event_filter {
/* Trust Domain eXtension sub-ioctl() commands. */
enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+
KVM_TDX_CMD_NR_MAX,
};
@@ -539,4 +541,24 @@ struct kvm_tdx_cmd {
__u64 data;
};
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ __u32 padding;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
#endif /* _ASM_X86_KVM_H */
--
2.25.1
From: Isaku Yamahata <[email protected]>
TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
intermediate value to indicate one thread is operating on it and the value
should be semi-arbitrary value. For TDX (more correctly to use #VE), the
value should include suppress #VE value which is shadow_init_value.
Define SHADOW_REMOVED_SPTE as shadow_init_value | REMOVED_SPTE, and replace
REMOVED_SPTE with SHADOW_REMOVED_SPTE to use suppress #VE bit properly for
TDX.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/spte.h | 14 ++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++++++-------
2 files changed, 28 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index bde843bce878..e88f796724b4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -194,7 +194,9 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
* If a thread running without exclusive control of the MMU lock must perform a
* multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
* non-present intermediate value. Other threads which encounter this value
- * should not modify the SPTE.
+ * should not modify the SPTE. When TDX is enabled, shadow_init_value, which
+ * is "suppress #VE" bit set, is also set to removed SPTE, because TDX module
+ * always enables "EPT violation #VE".
*
* Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
* bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
@@ -207,9 +209,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+/*
+ * See above comment around REMOVED_SPTE. SHADOW_REMOVED_SPTE is the actual
+ * intermediate value set to the removed SPET. When TDX is enabled, it sets
+ * the "suppress #VE" bit, otherwise it's REMOVED_SPTE.
+ */
+extern u64 __read_mostly shadow_init_value;
+#define SHADOW_REMOVED_SPTE (shadow_init_value | REMOVED_SPTE)
+
static inline bool is_removed_spte(u64 spte)
{
- return spte == REMOVED_SPTE;
+ return spte == SHADOW_REMOVED_SPTE;
}
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ebd0a02620e8..b6ec2f112c26 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -338,7 +338,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
* value to the removed SPTE value.
*/
for (;;) {
- old_child_spte = xchg(sptep, REMOVED_SPTE);
+ old_child_spte = xchg(sptep, SHADOW_REMOVED_SPTE);
if (!is_removed_spte(old_child_spte))
break;
cpu_relax();
@@ -365,10 +365,10 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
* the two branches consistent and simplifies
* the function.
*/
- WRITE_ONCE(*sptep, REMOVED_SPTE);
+ WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
- old_child_spte, REMOVED_SPTE, level,
+ old_child_spte, SHADOW_REMOVED_SPTE, level,
shared);
}
@@ -537,7 +537,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* immediately installing a present entry in its place
* before the TLBs are flushed.
*/
- if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
+ if (!tdp_mmu_set_spte_atomic(kvm, iter, SHADOW_REMOVED_SPTE))
return false;
kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
@@ -550,8 +550,16 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* special removed SPTE value. No bookkeeping is needed
* here since the SPTE is going from non-present
* to non-present.
+ *
+ * Set non-present value to shadow_init_value, rather than 0.
+ * It is because when TDX is enabled, TDX module always
+ * enables "EPT-violation #VE", so KVM needs to set
+ * "suppress #VE" bit in EPT table entries, in order to get
+ * real EPT violation, rather than TDVMCALL. KVM sets
+ * shadow_init_value (which sets "suppress #VE" bit) so it
+ * can be set when EPT table entries are zapped.
*/
- WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
+ WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
return true;
}
@@ -748,7 +756,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
continue;
if (!shared) {
- tdp_mmu_set_spte(kvm, &iter, 0);
+ /* see comments in tdp_mmu_zap_spte_atomic() */
+ tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
flush = true;
} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
/*
@@ -1135,7 +1144,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See __handle_changed_spte().
*/
- tdp_mmu_set_spte(kvm, iter, 0);
+ tdp_mmu_set_spte(kvm, iter, shadow_init_value);
if (!pte_write(range->pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1
From: Sean Christopherson <[email protected]>
Explicitly check for an MMIO spte in the fast page fault flow. TDX will
use a not-present entry for MMIO sptes, which can be mistaken for an
access-tracked spte since both have SPTE_SPECIAL_MASK set.
The fast page fault handles the case of changing access bits without
obtaining mmu_lock. For example, clear write protect bit for dirty page
tracking. MMIO emulation is handled in a slow path. So it doesn't affect
the default VM case.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b68191aa39bf..9907cb759fd1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3167,7 +3167,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
break;
sp = sptep_to_sp(sptep);
- if (!is_last_spte(spte, sp->role.level))
+ if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
break;
/*
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of TD vcpu
enter/exit.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 3737b966ea07..e6af9ad4e23f 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -12,6 +12,7 @@ What qemu can do
- Qemu can create/destroy guest of TDX vm type.
- Qemu can create/destroy vcpu of TDX vm type.
- Qemu can populate initial guest memory image.
+- Qemu can finalize guest TD.
Patch Layer status
------------------
@@ -21,8 +22,8 @@ Patch Layer status
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
-* TD finalization: Applying
-* TD vcpu enter/exit: Not yet
+* TD finalization: Applied
+* TD vcpu enter/exit: Applying
* TD vcpu interrupts/exit/hypercall: Not yet
* KVM MMU GPA stolen bits: Applied
--
2.25.1
From: Isaku Yamahata <[email protected]>
TDX defines an API to run TDX vcpu with its own ABI. Define an assembly
helper function to run TDX vcpu to hide the special ABI so that C code can
call it with function call ABI.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/vmenter.S | 146 +++++++++++++++++++++++++++++++++++++
1 file changed, 146 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 435c187927c4..33dc5aa2f0db 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -2,6 +2,7 @@
#include <linux/linkage.h>
#include <asm/asm.h>
#include <asm/bitsperlong.h>
+#include <asm/errno.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
#include <asm/segment.h>
@@ -28,6 +29,13 @@
#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
#endif
+#ifdef CONFIG_INTEL_TDX_HOST
+#define TDENTER 0
+#define EXIT_REASON_TDCALL 77
+#define TDENTER_ERROR_BIT 63
+#include "seamcall.h"
+#endif
+
.section .noinstr.text, "ax"
/**
@@ -328,3 +336,141 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
pop %_ASM_BP
RET
SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff)
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+.pushsection .noinstr.text, "ax"
+
+/**
+ * __tdx_vcpu_run - Call SEAMCALL(TDENTER) to run a TD vcpu
+ * @tdvpr: physical address of TDVPR
+ * @regs: void * (to registers of TDVCPU)
+ * @gpr_mask: non-zero if guest registers need to be loaded prior to TDENTER
+ *
+ * Returns:
+ * TD-Exit Reason
+ *
+ * Note: KVM doesn't support using XMM in its hypercalls, it's the HyperV
+ * code's responsibility to save/restore XMM registers on TDVMCALL.
+ */
+SYM_FUNC_START(__tdx_vcpu_run)
+ push %rbp
+ mov %rsp, %rbp
+
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+ push %rbx
+
+ /* Save @regs, which is needed after TDENTER to capture output. */
+ push %rsi
+
+ /* Load @tdvpr to RCX */
+ mov %rdi, %rcx
+
+ /* No need to load guest GPRs if the last exit wasn't a TDVMCALL. */
+ test %dx, %dx
+ je 1f
+
+ /* Load @regs to RAX, which will be clobbered with $TDENTER anyways. */
+ mov %rsi, %rax
+
+ mov VCPU_RBX(%rax), %rbx
+ mov VCPU_RDX(%rax), %rdx
+ mov VCPU_RBP(%rax), %rbp
+ mov VCPU_RSI(%rax), %rsi
+ mov VCPU_RDI(%rax), %rdi
+
+ mov VCPU_R8 (%rax), %r8
+ mov VCPU_R9 (%rax), %r9
+ mov VCPU_R10(%rax), %r10
+ mov VCPU_R11(%rax), %r11
+ mov VCPU_R12(%rax), %r12
+ mov VCPU_R13(%rax), %r13
+ mov VCPU_R14(%rax), %r14
+ mov VCPU_R15(%rax), %r15
+
+ /* Load TDENTER to RAX. This kills the @regs pointer! */
+1: mov $TDENTER, %rax
+
+2: seamcall
+
+ /* Skip to the exit path if TDENTER failed. */
+ bt $TDENTER_ERROR_BIT, %rax
+ jc 4f
+
+ /* Temporarily save the TD-Exit reason. */
+ push %rax
+
+ /* check if TD-exit due to TDVMCALL */
+ cmp $EXIT_REASON_TDCALL, %ax
+
+ /* Reload @regs to RAX. */
+ mov 8(%rsp), %rax
+
+ /* Jump on non-TDVMCALL */
+ jne 3f
+
+ /* Save all output from SEAMCALL(TDENTER) */
+ mov %rbx, VCPU_RBX(%rax)
+ mov %rbp, VCPU_RBP(%rax)
+ mov %rsi, VCPU_RSI(%rax)
+ mov %rdi, VCPU_RDI(%rax)
+ mov %r10, VCPU_R10(%rax)
+ mov %r11, VCPU_R11(%rax)
+ mov %r12, VCPU_R12(%rax)
+ mov %r13, VCPU_R13(%rax)
+ mov %r14, VCPU_R14(%rax)
+ mov %r15, VCPU_R15(%rax)
+
+3: mov %rcx, VCPU_RCX(%rax)
+ mov %rdx, VCPU_RDX(%rax)
+ mov %r8, VCPU_R8 (%rax)
+ mov %r9, VCPU_R9 (%rax)
+
+ /*
+ * Clear all general purpose registers except RSP and RAX to prevent
+ * speculative use of the guest's values.
+ */
+ xor %rbx, %rbx
+ xor %rcx, %rcx
+ xor %rdx, %rdx
+ xor %rsi, %rsi
+ xor %rdi, %rdi
+ xor %rbp, %rbp
+ xor %r8, %r8
+ xor %r9, %r9
+ xor %r10, %r10
+ xor %r11, %r11
+ xor %r12, %r12
+ xor %r13, %r13
+ xor %r14, %r14
+ xor %r15, %r15
+
+ /* Restore the TD-Exit reason to RAX for return. */
+ pop %rax
+
+ /* "POP" @regs. */
+4: add $8, %rsp
+ pop %rbx
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ pop %rbp
+ ret
+
+5: cmpb $0, kvm_rebooting
+ je 6f
+ mov $-EFAULT, %rax
+ jmp 4b
+6: ud2
+ _ASM_EXTABLE(2b, 5b)
+
+SYM_FUNC_END(__tdx_vcpu_run)
+
+.popsection
+
+#endif
--
2.25.1
A series of 104 patches is completely unreviewably, please split it into
reasonable chunks.
On 3/7/22 08:44, Christoph Hellwig wrote:
> A series of 104 patches is completely unreviewably, please split it into
> reasonable chunks.
It is split into 5-15 patch chunks, and I'm going to review it mostly
according to the separation. It's just posted together because it
doesn't really accomplish anything until all the chunks are merged together.
From the cover letter:
>> TDX, VMX coexistence:
>> Infrastructure to allow TDX to coexist with VMX and trigger the
>> initialization of the TDX module.
>> This layer starts with
>> "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
>> TDX architectural definitions:
>> Add TDX architectural definitions and helper functions
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
>> TD VM creation/destruction:
>> Guest TD creation/destroy allocation and releasing of TDX specific vm
>> and vcpu structure. Create an initial guest memory image with TDX
>> measurement.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
>> TD vcpu creation/destruction:
>> guest TD creation/destroy Allocation and releasing of TDX specific vm
>> and vcpu structure. Create an initial guest memory image with TDX
>> measurement.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
>> TDX EPT violation:
>> Create an initial guest memory image with TDX measurement. Handle
>> secure EPT violations to populate guest pages with TDX SEAMCALLs.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
>> TD vcpu enter/exit:
>> Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
>> entering into TD. Restore CPU state after exiting from TD.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
>> TD vcpu interrupts/exit/hypercall:
>> Handle various exits/hypercalls and allow interrupts to be injected so
>> that TD vcpu can continue running.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
>>
>> KVM MMU GPA stolen bits:
>> Introduce framework to handle stolen repurposed bit of GPA TDX
>> repurposed a bit of GPA to indicate shared or private. If it's shared,
>> it's the same as the conventional VMX EPT case. VMM can access shared
>> guest pages. If it's private, it's handled by Secure-EPT and the guest
>> page is encrypted.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
>> KVM TDP refactoring for TDX:
>> TDX Secure EPT requires different constants. e.g. initial value EPT
>> entry value etc. Various refactoring for those differences.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
>> KVM TDP MMU hooks:
>> Introduce framework to TDP MMU to add hooks in addition to direct EPT
>> access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
>> conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
>> use TDX SEAMCALLs to operate on Secure EPT.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
>> KVM TDP MMU MapGPA:
>> Introduce framework to handle switching guest pages from private/shared
>> to shared/private. For a given GPA, a guest page can be assigned to a
>> private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
>> guest TD converts GPA assignments from private (or shared) to shared (or
>> private).
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
Paolo
On Fri, 2022-03-04 at 11:48 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> As first step TDX VM support, return that TDX VM type supported to device
> model, e.g. qemu. The callback to create guest TD is vm_init callback for
> KVM_CREATE_VM. Add a place holder function and call a function to
> initialize TDX module on demand because in that callback VMX is enabled by
> hardware_enable callback (vmx_hardware_enable).
Should we put this patch at the end of series until all changes required to run
TD are introduced? This patch essentially tells userspace KVM is ready to
support a TD but actually it's not ready. And this might also cause bisect
issue I suppose?
--
Thanks,
-Kai
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 24 ++++++++++++++++++++++--
> arch/x86/kvm/vmx/tdx.c | 5 +++++
> arch/x86/kvm/vmx/vmx.c | 5 -----
> arch/x86/kvm/vmx/x86_ops.h | 3 ++-
> 4 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 77da926ee505..8103d1c32cc9 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -5,6 +5,12 @@
> #include "vmx.h"
> #include "nested.h"
> #include "pmu.h"
> +#include "tdx.h"
> +
> +static bool vt_is_vm_type_supported(unsigned long type)
> +{
> + return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
> +}
>
> static __init int vt_hardware_setup(void)
> {
> @@ -19,6 +25,20 @@ static __init int vt_hardware_setup(void)
> return 0;
> }
>
> +static int vt_vm_init(struct kvm *kvm)
> +{
> + int ret;
> +
> + if (is_td(kvm)) {
> + ret = tdx_module_setup();
> + if (ret)
> + return ret;
> + return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
> + }
> +
> + return vmx_vm_init(kvm);
> +}
> +
> struct kvm_x86_ops vt_x86_ops __initdata = {
> .name = "kvm_intel",
>
> @@ -29,9 +49,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .cpu_has_accelerated_tpr = report_flexpriority,
> .has_emulated_msr = vmx_has_emulated_msr,
>
> - .is_vm_type_supported = vmx_is_vm_type_supported,
> + .is_vm_type_supported = vt_is_vm_type_supported,
> .vm_size = sizeof(struct kvm_vmx),
> - .vm_init = vmx_vm_init,
> + .vm_init = vt_vm_init,
>
> .vcpu_create = vmx_vcpu_create,
> .vcpu_free = vmx_vcpu_free,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8adc87ad1807..e8d293a3c11c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -105,6 +105,11 @@ int tdx_module_setup(void)
> return ret;
> }
>
> +bool tdx_is_vm_type_supported(unsigned long type)
> +{
> + return type == KVM_X86_TDX_VM && READ_ONCE(enable_tdx);
> +}
> +
> static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> {
> u32 max_pa;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3c7b3f245fee..7838cd177f0e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7079,11 +7079,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
> return err;
> }
>
> -bool vmx_is_vm_type_supported(unsigned long type)
> -{
> - return type == KVM_X86_DEFAULT_VM;
> -}
> -
> #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
> #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index f7327bc73be0..78331dbc29f7 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -25,7 +25,6 @@ void vmx_hardware_unsetup(void);
> int vmx_hardware_enable(void);
> void vmx_hardware_disable(void);
> bool report_flexpriority(void);
> -bool vmx_is_vm_type_supported(unsigned long type);
> int vmx_vm_init(struct kvm *kvm);
> int vmx_vcpu_create(struct kvm_vcpu *vcpu);
> int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
> @@ -130,10 +129,12 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
> #ifdef CONFIG_INTEL_TDX_HOST
> void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
> unsigned int *vcpu_align, unsigned int *vm_size);
> +bool tdx_is_vm_type_supported(unsigned long type);
> void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> #else
> static inline void tdx_pre_kvm_init(
> unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
> +static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
> static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
> #endif
>
On 3/15/22 22:47, Kai Huang wrote:
>> The intention is that developers can exercise the new code step-by-step even if
>> the TDX KVM isn't complete.
> What is the purpose/value to allow developers to exercise the new code step-by-
> step? Userspace cannot create TD successfully anyway until all patches are
> ready.
We can move this to the end when the patch is committed, but I think
there is value in showing that the series works (for partial definitions
of "work") at every step of the enablement process.
Paolo
On Mon, Mar 14, 2022 at 12:08:59PM +1300,
Kai Huang <[email protected]> wrote:
> On Fri, 2022-03-04 at 11:48 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > As first step TDX VM support, return that TDX VM type supported to device
> > model, e.g. qemu. The callback to create guest TD is vm_init callback for
> > KVM_CREATE_VM. Add a place holder function and call a function to
> > initialize TDX module on demand because in that callback VMX is enabled by
> > hardware_enable callback (vmx_hardware_enable).
>
> Should we put this patch at the end of series until all changes required to run
> TD are introduced? This patch essentially tells userspace KVM is ready to
> support a TD but actually it's not ready. And this might also cause bisect
> issue I suppose?
The intention is that developers can exercise the new code step-by-step even if
the TDX KVM isn't complete.
How about introducing new config and remove it at the last of the patch series?
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 2b1548da00eb..a3287440aa9e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -98,6 +98,20 @@ config X86_SGX_KVM
If unsure, say N.
+config X86_TDX_KVM_EXPERIMENTAL
+ bool "EXPERIMENTAL Trust Domian Extensions (TDX) KVM support"
+ default n
+ depends on INTEL_TDX_HOST
+ depends on KVM_INTEL
+ help
+ Enable experimental TDX KVM support. TDX KVM needs many patches and
+ the patches will be merged step by step, not at once. Even if TDX KVM
+ support is incomplete, enable TDX KVM support so that developper can
+ exercise TDX KVM code. TODO: Remove this configuration once the
+ (first step of) TDX KVM support is complete.
+
+ If unsure, say N.
+
config KVM_AMD
tristate "KVM for AMD processors support"
depends on KVM
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b16e2ed3b204..e31d6902e49c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -170,7 +170,11 @@ int tdx_module_setup(void)
bool tdx_is_vm_type_supported(unsigned long type)
{
+#ifdef CONFIG_X86_TDX_KVM_EXPERIMENTAL
return type == KVM_X86_TDX_VM && READ_ONCE(enable_tdx);
+#else
+ return false;
+#endif
}
static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
--
Isaku Yamahata <[email protected]>
On 3/4/22 20:48, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
> to operate on VM. TDX defines its data structure and TDX SEAMCALL APIs for
> VMM to operate on Trust Domain (TD) instead.
>
> Trust Domain Virtual Processor State (TDVPS) is the root control structure
> of a TD VCPU. It helps the TDX module control the operation of the VCPU,
> and holds the VCPU state while the VCPU is not running. TDVPS is opaque to
> software and DMA access, accessible only by using the TDX module interface
> functions (such as TDH.VP.RD, TDH.VP.WR ,..). TDVPS includes TD VMCS, and
> TD VMCS auxiliary structures, such as virtual APIC page, virtualization
> exception information, etc. TDVPS is composed of Trust Domain Virtual
> Processor Root (TDVPR) which is the root page of TDVPS and Trust Domain
> Virtual Processor eXtension (TDVPX) pages which extend TDVPR to help
> provide enough physical space for the logical TDVPS structure.
>
> Also, we have a new structure, Trust Domain Control Structure (TDCS) is the
> main control structure of a guest TD, and encrypted (using the guest TD's
> ephemeral private key). At a high level, TDCS holds information for
> controlling TD operation as a whole, execution, EPTP, MSR bitmaps, etc. KVM
> needs to set it up. Note that MSR bitmaps are held as part of TDCS (unlike
> VMX) because they are meant to have the same value for all VCPUs of the
> same TD. TDCS is a multi-page logical structure composed of multiple Trust
> Domain Control Extension (TDCX) physical pages. Trust Domain Root (TDR) is
> the root control structure of a guest TD and is encrypted using the TDX
> global private key. It holds a minimal set of state variables that enable
> guest TD control even during times when the TD's private key is not known,
> or when the TD's key management state does not permit access to memory
> encrypted using the TD's private key.
>
> The following shows the relationship between those structures.
>
> TDR--> TDCS per-TD
> | \--> TDCX
> \
> \--> TDVPS per-TD VCPU
> \--> TDVPR and TDVPX
>
> The existing global struct kvm_x86_ops already defines an interface which
> fits with TDX. But kvm_x86_ops is system-wide, not per-VM structure. To
> allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
> "if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.
>
> To split the runtime switch, the VMX implementation, and the TDX
> implementation, add main.c, and move out the vmx_x86_ops hooks in
> preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
> both VMs and TDs. Use 'vt' for the naming scheme as a nod to VT-x and as a
> concatenation of VmxTdx.
>
> The current code looks as follows.
> In vmx.c
> static vmx_op() { ... }
> static struct kvm_x86_ops vmx_x86_ops = {
> .op = vmx_op,
> initialization code
>
> The eventually converted code will look like
> In vmx.c, keep the VMX operations.
> vmx_op() { ... }
> VMX initialization
> In tdx.c, define the TDX operations.
> tdx_op() { ... }
> TDX initialization
> In x86_ops.h, declare the VMX and TDX operations.
> vmx_op();
> tdx_op();
> In main.c, define common wrappers for VMX and VMX.
> static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
> static struct kvm_x86_ops vt_x86_ops = {
> .op = vt_op,
> initialization to call VMX and TDX initialization
>
> Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
> vmx_free_vcpu() to vmx_vcpu_create() and vxm_vcpu_free().
>
> Co-developed-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/Makefile | 2 +-
> arch/x86/kvm/vmx/main.c | 154 ++++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 360 +++++++++++--------------------------
> arch/x86/kvm/vmx/x86_ops.h | 126 +++++++++++++
> 4 files changed, 385 insertions(+), 257 deletions(-)
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/x86_ops.h
>
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 30f244b64523..ee4d0999f20f 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -22,7 +22,7 @@ kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
> kvm-$(CONFIG_KVM_XEN) += xen.o
>
> kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
> - vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
> + vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
> kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
>
> kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> new file mode 100644
> index 000000000000..b08ea9c42a11
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -0,0 +1,154 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/moduleparam.h>
> +
> +#include "x86_ops.h"
> +#include "vmx.h"
> +#include "nested.h"
> +#include "pmu.h"
> +
> +struct kvm_x86_ops vt_x86_ops __initdata = {
> + .name = "kvm_intel",
> +
> + .hardware_unsetup = vmx_hardware_unsetup,
> +
> + .hardware_enable = vmx_hardware_enable,
> + .hardware_disable = vmx_hardware_disable,
> + .cpu_has_accelerated_tpr = report_flexpriority,
> + .has_emulated_msr = vmx_has_emulated_msr,
> +
> + .vm_size = sizeof(struct kvm_vmx),
> + .vm_init = vmx_vm_init,
> +
> + .vcpu_create = vmx_vcpu_create,
> + .vcpu_free = vmx_vcpu_free,
> + .vcpu_reset = vmx_vcpu_reset,
> +
> + .prepare_guest_switch = vmx_prepare_switch_to_guest,
> + .vcpu_load = vmx_vcpu_load,
> + .vcpu_put = vmx_vcpu_put,
> +
> + .update_exception_bitmap = vmx_update_exception_bitmap,
> + .get_msr_feature = vmx_get_msr_feature,
> + .get_msr = vmx_get_msr,
> + .set_msr = vmx_set_msr,
> + .get_segment_base = vmx_get_segment_base,
> + .get_segment = vmx_get_segment,
> + .set_segment = vmx_set_segment,
> + .get_cpl = vmx_get_cpl,
> + .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
> + .set_cr0 = vmx_set_cr0,
> + .is_valid_cr4 = vmx_is_valid_cr4,
> + .set_cr4 = vmx_set_cr4,
> + .set_efer = vmx_set_efer,
> + .get_idt = vmx_get_idt,
> + .set_idt = vmx_set_idt,
> + .get_gdt = vmx_get_gdt,
> + .set_gdt = vmx_set_gdt,
> + .set_dr7 = vmx_set_dr7,
> + .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
> + .cache_reg = vmx_cache_reg,
> + .get_rflags = vmx_get_rflags,
> + .set_rflags = vmx_set_rflags,
> + .get_if_flag = vmx_get_if_flag,
> +
> + .tlb_flush_all = vmx_flush_tlb_all,
> + .tlb_flush_current = vmx_flush_tlb_current,
> + .tlb_flush_gva = vmx_flush_tlb_gva,
> + .tlb_flush_guest = vmx_flush_tlb_guest,
> +
> + .vcpu_pre_run = vmx_vcpu_pre_run,
> + .run = vmx_vcpu_run,
> + .handle_exit = vmx_handle_exit,
> + .skip_emulated_instruction = vmx_skip_emulated_instruction,
> + .update_emulated_instruction = vmx_update_emulated_instruction,
> + .set_interrupt_shadow = vmx_set_interrupt_shadow,
> + .get_interrupt_shadow = vmx_get_interrupt_shadow,
> + .patch_hypercall = vmx_patch_hypercall,
> + .set_irq = vmx_inject_irq,
> + .set_nmi = vmx_inject_nmi,
> + .queue_exception = vmx_queue_exception,
> + .cancel_injection = vmx_cancel_injection,
> + .interrupt_allowed = vmx_interrupt_allowed,
> + .nmi_allowed = vmx_nmi_allowed,
> + .get_nmi_mask = vmx_get_nmi_mask,
> + .set_nmi_mask = vmx_set_nmi_mask,
> + .enable_nmi_window = vmx_enable_nmi_window,
> + .enable_irq_window = vmx_enable_irq_window,
> + .update_cr8_intercept = vmx_update_cr8_intercept,
> + .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
> + .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
> + .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
> + .load_eoi_exitmap = vmx_load_eoi_exitmap,
> + .apicv_post_state_restore = vmx_apicv_post_state_restore,
> + .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
> + .hwapic_irr_update = vmx_hwapic_irr_update,
> + .hwapic_isr_update = vmx_hwapic_isr_update,
> + .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
> + .sync_pir_to_irr = vmx_sync_pir_to_irr,
> + .deliver_interrupt = vmx_deliver_interrupt,
> + .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
> +
> + .set_tss_addr = vmx_set_tss_addr,
> + .set_identity_map_addr = vmx_set_identity_map_addr,
> + .get_mt_mask = vmx_get_mt_mask,
> +
> + .get_exit_info = vmx_get_exit_info,
> +
> + .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
> +
> + .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> +
> + .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
> + .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
> + .write_tsc_offset = vmx_write_tsc_offset,
> + .write_tsc_multiplier = vmx_write_tsc_multiplier,
> +
> + .load_mmu_pgd = vmx_load_mmu_pgd,
> +
> + .check_intercept = vmx_check_intercept,
> + .handle_exit_irqoff = vmx_handle_exit_irqoff,
> +
> + .request_immediate_exit = vmx_request_immediate_exit,
> +
> + .sched_in = vmx_sched_in,
> +
> + .cpu_dirty_log_size = PML_ENTITY_NUM,
> + .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> +
> + .pmu_ops = &intel_pmu_ops,
> + .nested_ops = &vmx_nested_ops,
> +
> + .update_pi_irte = pi_update_irte,
> + .start_assignment = vmx_pi_start_assignment,
> +
> +#ifdef CONFIG_X86_64
> + .set_hv_timer = vmx_set_hv_timer,
> + .cancel_hv_timer = vmx_cancel_hv_timer,
> +#endif
> +
> + .setup_mce = vmx_setup_mce,
> +
> + .smi_allowed = vmx_smi_allowed,
> + .enter_smm = vmx_enter_smm,
> + .leave_smm = vmx_leave_smm,
> + .enable_smi_window = vmx_enable_smi_window,
> +
> + .can_emulate_instruction = vmx_can_emulate_instruction,
> + .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
> + .migrate_timers = vmx_migrate_timers,
> +
> + .msr_filter_changed = vmx_msr_filter_changed,
> + .complete_emulated_msr = kvm_complete_insn_gp,
> +
> + .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> +};
> +
> +struct kvm_x86_init_ops vt_init_ops __initdata = {
> + .cpu_has_kvm_support = vmx_cpu_has_kvm_support,
> + .disabled_by_bios = vmx_disabled_by_bios,
> + .check_processor_compatibility = vmx_check_processor_compat,
> + .hardware_setup = vmx_hardware_setup,
> + .handle_intel_pt_intr = NULL,
> +
> + .runtime_ops = &vt_x86_ops,
> +};
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index efda5e4d6247..f6f5d0dac579 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -66,6 +66,7 @@
> #include "vmcs12.h"
> #include "vmx.h"
> #include "x86.h"
> +#include "x86_ops.h"
>
> MODULE_AUTHOR("Qumranet");
> MODULE_LICENSE("GPL");
> @@ -541,7 +542,7 @@ static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu)
> return flexpriority_enabled && lapic_in_kernel(vcpu);
> }
>
> -static inline bool report_flexpriority(void)
> +bool report_flexpriority(void)
> {
> return flexpriority_enabled;
> }
> @@ -1316,7 +1317,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
> * Switches to specified vcpu, until a matching vcpu_put(), but assumes
> * vcpu mutex is already taken.
> */
> -static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -1327,7 +1328,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> vmx->host_debugctlmsr = get_debugctlmsr();
> }
>
> -static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> +void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> {
> vmx_vcpu_pi_put(vcpu);
>
> @@ -1381,7 +1382,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> vmx->emulation_required = vmx_emulation_required(vcpu);
> }
>
> -static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
> +bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
> {
> return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
> }
> @@ -1487,8 +1488,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
> return 0;
> }
>
> -static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
> - void *insn, int insn_len)
> +bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
> + void *insn, int insn_len)
> {
> /*
> * Emulation of instructions in SGX enclaves is impossible as RIP does
> @@ -1572,7 +1573,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
> * Recognizes a pending MTF VM-exit and records the nested state for later
> * delivery.
> */
> -static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
> +void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
> {
> struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -1595,7 +1596,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
> vmx->nested.mtf_pending = false;
> }
>
> -static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> +int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> {
> vmx_update_emulated_instruction(vcpu);
> return skip_emulated_instruction(vcpu);
> @@ -1614,7 +1615,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
> vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
> }
>
> -static void vmx_queue_exception(struct kvm_vcpu *vcpu)
> +void vmx_queue_exception(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> unsigned nr = vcpu->arch.exception.nr;
> @@ -1727,12 +1728,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
> return kvm_default_tsc_scaling_ratio;
> }
>
> -static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> +void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> {
> vmcs_write64(TSC_OFFSET, offset);
> }
>
> -static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
> +void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
> {
> vmcs_write64(TSC_MULTIPLIER, multiplier);
> }
> @@ -1756,7 +1757,7 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
> return !(val & ~valid_bits);
> }
>
> -static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
> +int vmx_get_msr_feature(struct kvm_msr_entry *msr)
> {
> switch (msr->index) {
> case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
> @@ -1776,7 +1777,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
> * Returns 0 on success, non-0 otherwise.
> * Assumes vcpu_load() was already called.
> */
> -static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> +int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> struct vmx_uret_msr *msr;
> @@ -1954,7 +1955,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
> * Returns 0 on success, non-0 otherwise.
> * Assumes vcpu_load() was already called.
> */
> -static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> +int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> struct vmx_uret_msr *msr;
> @@ -2267,7 +2268,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> return ret;
> }
>
> -static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> +void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> {
> unsigned long guest_owned_bits;
>
> @@ -2310,12 +2311,12 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> }
> }
>
> -static __init int cpu_has_kvm_support(void)
> +__init int vmx_cpu_has_kvm_support(void)
> {
> return cpu_has_vmx();
> }
>
> -static __init int vmx_disabled_by_bios(void)
> +__init int vmx_disabled_by_bios(void)
> {
> return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
> !boot_cpu_has(X86_FEATURE_VMX);
> @@ -2341,7 +2342,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
> return -EFAULT;
> }
>
> -static int hardware_enable(void)
> +int vmx_hardware_enable(void)
> {
> int cpu = raw_smp_processor_id();
> u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
> @@ -2382,7 +2383,7 @@ static void vmclear_local_loaded_vmcss(void)
> __loaded_vmcs_clear(v);
> }
>
> -static void hardware_disable(void)
> +void vmx_hardware_disable(void)
> {
> vmclear_local_loaded_vmcss();
>
> @@ -2924,7 +2925,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
>
> #endif
>
> -static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
> +void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -2954,7 +2955,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
> return to_vmx(vcpu)->vpid;
> }
>
> -static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
> +void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> u64 root_hpa = mmu->root_hpa;
> @@ -2970,7 +2971,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
> vpid_sync_context(vmx_get_current_vpid(vcpu));
> }
>
> -static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
> +void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
> {
> /*
> * vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
> @@ -2979,7 +2980,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
> vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
> }
>
> -static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
> +void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
> {
> /*
> * vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
> @@ -3134,8 +3135,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
> return eptp;
> }
>
> -static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> - int root_level)
> +void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
> {
> struct kvm *kvm = vcpu->kvm;
> bool update_guest_cr3 = true;
> @@ -3163,8 +3163,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> vmcs_writel(GUEST_CR3, guest_cr3);
> }
>
> -
> -static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> +bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> {
> /*
> * We operate under the default treatment of SMM, so VMX cannot be
> @@ -3280,7 +3279,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> var->g = (ar >> 15) & 1;
> }
>
> -static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> +u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> {
> struct kvm_segment s;
>
> @@ -3360,14 +3359,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
> }
>
> -static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> +void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> {
> __vmx_set_segment(vcpu, var, seg);
>
> to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
> }
>
> -static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
> +void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
> {
> u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
>
> @@ -3375,25 +3374,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
> *l = (ar >> 13) & 1;
> }
>
> -static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
> dt->address = vmcs_readl(GUEST_IDTR_BASE);
> }
>
> -static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
> vmcs_writel(GUEST_IDTR_BASE, dt->address);
> }
>
> -static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
> dt->address = vmcs_readl(GUEST_GDTR_BASE);
> }
>
> -static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> {
> vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
> vmcs_writel(GUEST_GDTR_BASE, dt->address);
> @@ -3889,7 +3888,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
> }
> }
>
> -static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
> +bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> void *vapic_page;
> @@ -3909,7 +3908,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
> return ((rvi & 0xf0) > (vppr & 0xf0));
> }
>
> -static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
> +void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> u32 i;
> @@ -4041,8 +4040,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
> return 0;
> }
>
> -static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> - int trig_mode, int vector)
> +void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> + int trig_mode, int vector)
> {
> struct kvm_vcpu *vcpu = apic->vcpu;
>
> @@ -4185,7 +4184,7 @@ static u32 vmx_vmexit_ctrl(void)
> ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
> }
>
> -static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> +void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -4508,7 +4507,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
> vmx->pi_desc.sn = 1;
> }
>
> -static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -4565,12 +4564,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vpid_sync_context(vmx->vpid);
> }
>
> -static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
> +void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
> {
> exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
> }
>
> -static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
> +void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
> {
> if (!enable_vnmi ||
> vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
> @@ -4581,7 +4580,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
> exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
> }
>
> -static void vmx_inject_irq(struct kvm_vcpu *vcpu)
> +void vmx_inject_irq(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> uint32_t intr;
> @@ -4609,7 +4608,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
> vmx_clear_hlt(vcpu);
> }
>
> -static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
> +void vmx_inject_nmi(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -4687,7 +4686,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
> GUEST_INTR_STATE_NMI));
> }
>
> -static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> if (to_vmx(vcpu)->nested.nested_run_pending)
> return -EBUSY;
> @@ -4709,7 +4708,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
> (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
> }
>
> -static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> if (to_vmx(vcpu)->nested.nested_run_pending)
> return -EBUSY;
> @@ -4724,7 +4723,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> return !vmx_interrupt_blocked(vcpu);
> }
>
> -static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> +int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> {
> void __user *ret;
>
> @@ -4744,7 +4743,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> return init_rmode_tss(kvm, ret);
> }
>
> -static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> +int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> {
> to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
> return 0;
> @@ -5023,8 +5022,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
> return kvm_fast_pio(vcpu, size, port, in);
> }
>
> -static void
> -vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> +void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> {
> /*
> * Patch in the VMCALL instruction:
> @@ -5234,7 +5232,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
> return kvm_complete_insn_gp(vcpu, err);
> }
>
> -static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
> +void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
> {
> get_debugreg(vcpu->arch.db[0], 0);
> get_debugreg(vcpu->arch.db[1], 1);
> @@ -5253,7 +5251,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
> set_debugreg(DR6_RESERVED, 6);
> }
>
> -static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
> +void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
> {
> vmcs_writel(GUEST_DR7, val);
> }
> @@ -5519,7 +5517,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> -static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
> +int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
> {
> if (vmx_emulation_required_with_pending_exception(vcpu)) {
> kvm_prepare_emulation_failure_exit(vcpu);
> @@ -5756,9 +5754,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> static const int kvm_vmx_max_exit_handlers =
> ARRAY_SIZE(kvm_vmx_exit_handlers);
>
> -static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> - u64 *info1, u64 *info2,
> - u32 *intr_info, u32 *error_code)
> +void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> + u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -6191,7 +6188,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> return 0;
> }
>
> -static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> +int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> {
> int ret = __vmx_handle_exit(vcpu, exit_fastpath);
>
> @@ -6279,7 +6276,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
> : "eax", "ebx", "ecx", "edx");
> }
>
> -static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> +void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> {
> struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> int tpr_threshold;
> @@ -6349,7 +6346,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
> vmx_update_msr_bitmap_x2apic(vcpu);
> }
>
> -static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
> +void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
> {
> struct page *page;
>
> @@ -6377,7 +6374,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
> put_page(page);
> }
>
> -static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> +void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> {
> u16 status;
> u8 old;
> @@ -6411,7 +6408,7 @@ static void vmx_set_rvi(int vector)
> }
> }
>
> -static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> +void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> {
> /*
> * When running L2, updating RVI is only relevant when
> @@ -6425,7 +6422,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> vmx_set_rvi(max_irr);
> }
>
> -static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> +int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> int max_irr;
> @@ -6471,7 +6468,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> return max_irr;
> }
>
> -static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> +void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> {
> if (!kvm_vcpu_apicv_active(vcpu))
> return;
> @@ -6482,7 +6479,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
> }
>
> -static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
> +void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -6554,7 +6551,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
> }
>
> -static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> +void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -6571,7 +6568,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> * The kvm parameter can be NULL (module initialization, or invocation before
> * VM creation). Be sure to check the kvm parameter before using it.
> */
> -static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
> +bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
> {
> switch (index) {
> case MSR_IA32_SMBASE:
> @@ -6692,7 +6689,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> IDT_VECTORING_ERROR_CODE);
> }
>
> -static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> +void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> {
> __vmx_complete_interrupts(vcpu,
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> @@ -6788,7 +6785,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> guest_state_exit_irqoff();
> }
>
> -static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> +fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> unsigned long cr4;
> @@ -6969,7 +6966,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> return vmx_exit_handlers_fastpath(vcpu);
> }
>
> -static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
> +void vmx_vcpu_free(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -6980,7 +6977,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
> free_loaded_vmcs(vmx->loaded_vmcs);
> }
>
> -static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
> +int vmx_vcpu_create(struct kvm_vcpu *vcpu)
> {
> struct vmx_uret_msr *tsx_ctrl;
> struct vcpu_vmx *vmx;
> @@ -7085,7 +7082,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
> #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
> #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>
> -static int vmx_vm_init(struct kvm *kvm)
> +int vmx_vm_init(struct kvm *kvm)
> {
> if (!ple_gap)
> kvm->arch.pause_in_guest = true;
> @@ -7116,7 +7113,7 @@ static int vmx_vm_init(struct kvm *kvm)
> return 0;
> }
>
> -static int __init vmx_check_processor_compat(void)
> +int __init vmx_check_processor_compat(void)
> {
> struct vmcs_config vmcs_conf;
> struct vmx_capability vmx_cap;
> @@ -7139,7 +7136,7 @@ static int __init vmx_check_processor_compat(void)
> return 0;
> }
>
> -static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
> u8 cache;
>
> @@ -7328,7 +7325,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
>
> -static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> +void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -7433,7 +7430,7 @@ static __init void vmx_set_cpu_caps(void)
> kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> }
>
> -static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> +void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> {
> to_vmx(vcpu)->req_immediate_exit = true;
> }
> @@ -7472,10 +7469,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
> }
>
> -static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> - struct x86_instruction_info *info,
> - enum x86_intercept_stage stage,
> - struct x86_exception *exception)
> +int vmx_check_intercept(struct kvm_vcpu *vcpu,
> + struct x86_instruction_info *info,
> + enum x86_intercept_stage stage,
> + struct x86_exception *exception)
> {
> struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>
> @@ -7540,8 +7537,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
> return 0;
> }
>
> -static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> - bool *expired)
> +int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> + bool *expired)
> {
> struct vcpu_vmx *vmx;
> u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
> @@ -7580,13 +7577,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> return 0;
> }
>
> -static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
> +void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
> {
> to_vmx(vcpu)->hv_deadline_tsc = -1;
> }
> #endif
>
> -static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
> {
> if (!kvm_pause_in_guest(vcpu->kvm))
> shrink_ple_window(vcpu);
> @@ -7612,7 +7609,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
> secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
> }
>
> -static void vmx_setup_mce(struct kvm_vcpu *vcpu)
> +void vmx_setup_mce(struct kvm_vcpu *vcpu)
> {
> if (vcpu->arch.mcg_cap & MCG_LMCE_P)
> to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
> @@ -7622,7 +7619,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
> ~FEAT_CTL_LMCE_ENABLED;
> }
>
> -static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> /* we need a nested vmexit to enter SMM, postpone if run is pending */
> if (to_vmx(vcpu)->nested.nested_run_pending)
> @@ -7630,7 +7627,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> return !is_smm(vcpu);
> }
>
> -static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
> +int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> @@ -7644,7 +7641,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
> return 0;
> }
>
> -static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
> +int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> int ret;
> @@ -7665,17 +7662,17 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
> return 0;
> }
>
> -static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
> +void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
> {
> /* RSM will cause a vmexit anyway. */
> }
>
> -static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> {
> return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
> }
>
> -static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
> +void vmx_migrate_timers(struct kvm_vcpu *vcpu)
> {
> if (is_guest_mode(vcpu)) {
> struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
> @@ -7685,7 +7682,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
> }
> }
>
> -static void hardware_unsetup(void)
> +void vmx_hardware_unsetup(void)
> {
> kvm_set_posted_intr_wakeup_handler(NULL);
>
> @@ -7695,7 +7692,7 @@ static void hardware_unsetup(void)
> free_kvm_area();
> }
>
> -static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> +bool vmx_check_apicv_inhibit_reasons(ulong bit)
> {
> ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
> BIT(APICV_INHIBIT_REASON_ABSENT) |
> @@ -7705,143 +7702,6 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> return supported & BIT(bit);
> }
>
> -static struct kvm_x86_ops vmx_x86_ops __initdata = {
> - .name = "kvm_intel",
> -
> - .hardware_unsetup = hardware_unsetup,
> -
> - .hardware_enable = hardware_enable,
> - .hardware_disable = hardware_disable,
> - .cpu_has_accelerated_tpr = report_flexpriority,
> - .has_emulated_msr = vmx_has_emulated_msr,
> -
> - .vm_size = sizeof(struct kvm_vmx),
> - .vm_init = vmx_vm_init,
> -
> - .vcpu_create = vmx_create_vcpu,
> - .vcpu_free = vmx_free_vcpu,
> - .vcpu_reset = vmx_vcpu_reset,
> -
> - .prepare_guest_switch = vmx_prepare_switch_to_guest,
> - .vcpu_load = vmx_vcpu_load,
> - .vcpu_put = vmx_vcpu_put,
> -
> - .update_exception_bitmap = vmx_update_exception_bitmap,
> - .get_msr_feature = vmx_get_msr_feature,
> - .get_msr = vmx_get_msr,
> - .set_msr = vmx_set_msr,
> - .get_segment_base = vmx_get_segment_base,
> - .get_segment = vmx_get_segment,
> - .set_segment = vmx_set_segment,
> - .get_cpl = vmx_get_cpl,
> - .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
> - .set_cr0 = vmx_set_cr0,
> - .is_valid_cr4 = vmx_is_valid_cr4,
> - .set_cr4 = vmx_set_cr4,
> - .set_efer = vmx_set_efer,
> - .get_idt = vmx_get_idt,
> - .set_idt = vmx_set_idt,
> - .get_gdt = vmx_get_gdt,
> - .set_gdt = vmx_set_gdt,
> - .set_dr7 = vmx_set_dr7,
> - .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
> - .cache_reg = vmx_cache_reg,
> - .get_rflags = vmx_get_rflags,
> - .set_rflags = vmx_set_rflags,
> - .get_if_flag = vmx_get_if_flag,
> -
> - .tlb_flush_all = vmx_flush_tlb_all,
> - .tlb_flush_current = vmx_flush_tlb_current,
> - .tlb_flush_gva = vmx_flush_tlb_gva,
> - .tlb_flush_guest = vmx_flush_tlb_guest,
> -
> - .vcpu_pre_run = vmx_vcpu_pre_run,
> - .run = vmx_vcpu_run,
> - .handle_exit = vmx_handle_exit,
> - .skip_emulated_instruction = vmx_skip_emulated_instruction,
> - .update_emulated_instruction = vmx_update_emulated_instruction,
> - .set_interrupt_shadow = vmx_set_interrupt_shadow,
> - .get_interrupt_shadow = vmx_get_interrupt_shadow,
> - .patch_hypercall = vmx_patch_hypercall,
> - .set_irq = vmx_inject_irq,
> - .set_nmi = vmx_inject_nmi,
> - .queue_exception = vmx_queue_exception,
> - .cancel_injection = vmx_cancel_injection,
> - .interrupt_allowed = vmx_interrupt_allowed,
> - .nmi_allowed = vmx_nmi_allowed,
> - .get_nmi_mask = vmx_get_nmi_mask,
> - .set_nmi_mask = vmx_set_nmi_mask,
> - .enable_nmi_window = vmx_enable_nmi_window,
> - .enable_irq_window = vmx_enable_irq_window,
> - .update_cr8_intercept = vmx_update_cr8_intercept,
> - .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
> - .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
> - .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
> - .load_eoi_exitmap = vmx_load_eoi_exitmap,
> - .apicv_post_state_restore = vmx_apicv_post_state_restore,
> - .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
> - .hwapic_irr_update = vmx_hwapic_irr_update,
> - .hwapic_isr_update = vmx_hwapic_isr_update,
> - .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
> - .sync_pir_to_irr = vmx_sync_pir_to_irr,
> - .deliver_interrupt = vmx_deliver_interrupt,
> - .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
> -
> - .set_tss_addr = vmx_set_tss_addr,
> - .set_identity_map_addr = vmx_set_identity_map_addr,
> - .get_mt_mask = vmx_get_mt_mask,
> -
> - .get_exit_info = vmx_get_exit_info,
> -
> - .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
> -
> - .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> -
> - .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
> - .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
> - .write_tsc_offset = vmx_write_tsc_offset,
> - .write_tsc_multiplier = vmx_write_tsc_multiplier,
> -
> - .load_mmu_pgd = vmx_load_mmu_pgd,
> -
> - .check_intercept = vmx_check_intercept,
> - .handle_exit_irqoff = vmx_handle_exit_irqoff,
> -
> - .request_immediate_exit = vmx_request_immediate_exit,
> -
> - .sched_in = vmx_sched_in,
> -
> - .cpu_dirty_log_size = PML_ENTITY_NUM,
> - .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> -
> - .pmu_ops = &intel_pmu_ops,
> - .nested_ops = &vmx_nested_ops,
> -
> - .update_pi_irte = pi_update_irte,
> - .start_assignment = vmx_pi_start_assignment,
> -
> -#ifdef CONFIG_X86_64
> - .set_hv_timer = vmx_set_hv_timer,
> - .cancel_hv_timer = vmx_cancel_hv_timer,
> -#endif
> -
> - .setup_mce = vmx_setup_mce,
> -
> - .smi_allowed = vmx_smi_allowed,
> - .enter_smm = vmx_enter_smm,
> - .leave_smm = vmx_leave_smm,
> - .enable_smi_window = vmx_enable_smi_window,
> -
> - .can_emulate_instruction = vmx_can_emulate_instruction,
> - .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
> - .migrate_timers = vmx_migrate_timers,
> -
> - .msr_filter_changed = vmx_msr_filter_changed,
> - .complete_emulated_msr = kvm_complete_insn_gp,
> -
> - .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> -};
> -
> static unsigned int vmx_handle_intel_pt_intr(void)
> {
> struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> @@ -7882,9 +7742,7 @@ static __init void vmx_setup_user_return_msrs(void)
> kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
> }
>
> -static struct kvm_x86_init_ops vmx_init_ops __initdata;
> -
> -static __init int hardware_setup(void)
> +__init int vmx_hardware_setup(void)
> {
> unsigned long host_bndcfgs;
> struct desc_ptr dt;
> @@ -7944,16 +7802,16 @@ static __init int hardware_setup(void)
> * using the APIC_ACCESS_ADDR VMCS field.
> */
> if (!flexpriority_enabled)
> - vmx_x86_ops.set_apic_access_page_addr = NULL;
> + vt_x86_ops.set_apic_access_page_addr = NULL;
>
> if (!cpu_has_vmx_tpr_shadow())
> - vmx_x86_ops.update_cr8_intercept = NULL;
> + vt_x86_ops.update_cr8_intercept = NULL;
>
> #if IS_ENABLED(CONFIG_HYPERV)
> if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
> && enable_ept) {
> - vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
> - vmx_x86_ops.tlb_remote_flush_with_range =
> + vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
> + vt_x86_ops.tlb_remote_flush_with_range =
> hv_remote_flush_tlb_with_range;
> }
> #endif
> @@ -7969,7 +7827,7 @@ static __init int hardware_setup(void)
> if (!cpu_has_vmx_apicv())
> enable_apicv = 0;
> if (!enable_apicv)
> - vmx_x86_ops.sync_pir_to_irr = NULL;
> + vt_x86_ops.sync_pir_to_irr = NULL;
>
> if (cpu_has_vmx_tsc_scaling()) {
> kvm_has_tsc_control = true;
> @@ -7996,7 +7854,7 @@ static __init int hardware_setup(void)
> enable_pml = 0;
>
> if (!enable_pml)
> - vmx_x86_ops.cpu_dirty_log_size = 0;
> + vt_x86_ops.cpu_dirty_log_size = 0;
>
> if (!cpu_has_vmx_preemption_timer())
> enable_preemption_timer = false;
> @@ -8023,9 +7881,9 @@ static __init int hardware_setup(void)
> }
>
> if (!enable_preemption_timer) {
> - vmx_x86_ops.set_hv_timer = NULL;
> - vmx_x86_ops.cancel_hv_timer = NULL;
> - vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
> + vt_x86_ops.set_hv_timer = NULL;
> + vt_x86_ops.cancel_hv_timer = NULL;
> + vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
> }
>
> kvm_mce_cap_supported |= MCG_LMCE_P;
> @@ -8035,9 +7893,9 @@ static __init int hardware_setup(void)
> if (!enable_ept || !cpu_has_vmx_intel_pt())
> pt_mode = PT_MODE_SYSTEM;
> if (pt_mode == PT_MODE_HOST_GUEST)
> - vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
> + vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
> else
> - vmx_init_ops.handle_intel_pt_intr = NULL;
> + vt_init_ops.handle_intel_pt_intr = NULL;
>
> setup_default_sgx_lepubkeyhash();
>
> @@ -8061,16 +7919,6 @@ static __init int hardware_setup(void)
> return r;
> }
>
> -static struct kvm_x86_init_ops vmx_init_ops __initdata = {
> - .cpu_has_kvm_support = cpu_has_kvm_support,
> - .disabled_by_bios = vmx_disabled_by_bios,
> - .check_processor_compatibility = vmx_check_processor_compat,
> - .hardware_setup = hardware_setup,
> - .handle_intel_pt_intr = NULL,
> -
> - .runtime_ops = &vmx_x86_ops,
> -};
> -
> static void vmx_cleanup_l1d_flush(void)
> {
> if (vmx_l1d_flush_pages) {
> @@ -8149,7 +7997,7 @@ static int __init vmx_init(void)
> }
>
> if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
> - vmx_x86_ops.enable_direct_tlbflush
> + vt_x86_ops.enable_direct_tlbflush
> = hv_enable_direct_tlbflush;
>
> } else {
> @@ -8157,8 +8005,8 @@ static int __init vmx_init(void)
> }
> #endif
>
> - r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
> - __alignof__(struct vcpu_vmx), THIS_MODULE);
> + r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
> + __alignof__(struct vcpu_vmx), THIS_MODULE);
> if (r)
> return r;
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> new file mode 100644
> index 000000000000..40c64fb1f505
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -0,0 +1,126 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVM_X86_VMX_X86_OPS_H
> +#define __KVM_X86_VMX_X86_OPS_H
> +
> +#include <linux/kvm_host.h>
> +
> +#include <asm/virtext.h>
> +
> +#include "x86.h"
> +
> +extern struct kvm_x86_init_ops vt_init_ops __initdata;
> +
> +__init int vmx_cpu_has_kvm_support(void);
> +__init int vmx_disabled_by_bios(void);
> +int __init vmx_check_processor_compat(void);
> +__init int vmx_hardware_setup(void);
> +
> +extern struct kvm_x86_ops vt_x86_ops __initdata;
> +extern struct kvm_x86_init_ops vt_init_ops __initdata;
> +
> +void vmx_hardware_unsetup(void);
> +int vmx_hardware_enable(void);
> +void vmx_hardware_disable(void);
> +bool report_flexpriority(void);
> +int vmx_vm_init(struct kvm *kvm);
> +int vmx_vcpu_create(struct kvm_vcpu *vcpu);
> +int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
> +fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
> +void vmx_vcpu_free(struct kvm_vcpu *vcpu);
> +void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
> +void vmx_vcpu_put(struct kvm_vcpu *vcpu);
> +int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
> +void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
> +int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
> +void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
> +int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
> +int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
> +int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
> +int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
> +void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
> +bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
> + void *insn, int insn_len);
> +int vmx_check_intercept(struct kvm_vcpu *vcpu,
> + struct x86_instruction_info *info,
> + enum x86_intercept_stage stage,
> + struct x86_exception *exception);
> +bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
> +void vmx_migrate_timers(struct kvm_vcpu *vcpu);
> +void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
> +void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
> +bool vmx_check_apicv_inhibit_reasons(ulong bit);
> +void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
> +void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr);
> +bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
> +int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
> +void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> + int trig_mode, int vector);
> +void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
> +bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
> +void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
> +void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> +void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
> +int vmx_get_msr_feature(struct kvm_msr_entry *msr);
> +int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
> +u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
> +void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
> +void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
> +int vmx_get_cpl(struct kvm_vcpu *vcpu);
> +void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
> +void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
> +void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
> +void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> +bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> +int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
> +void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
> +void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
> +void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
> +unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
> +void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
> +bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
> +void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
> +void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
> +void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
> +void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
> +void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
> +u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
> +void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
> +void vmx_inject_irq(struct kvm_vcpu *vcpu);
> +void vmx_inject_nmi(struct kvm_vcpu *vcpu);
> +void vmx_queue_exception(struct kvm_vcpu *vcpu);
> +void vmx_cancel_injection(struct kvm_vcpu *vcpu);
> +int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
> +int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
> +bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
> +void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
> +void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
> +void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
> +void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
> +void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
> +void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
> +void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> +int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
> +int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
> +u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> +void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> + u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
> +u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
> +u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
> +void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset);
> +void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
> +void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
> +void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
> +void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
> +#ifdef CONFIG_X86_64
> +int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> + bool *expired);
> +void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
> +#endif
> +void vmx_setup_mce(struct kvm_vcpu *vcpu);
> +
> +#endif /* __KVM_X86_VMX_X86_OPS_H */
On Tue, 2022-03-15 at 14:03 -0700, Isaku Yamahata wrote:
> On Mon, Mar 14, 2022 at 12:08:59PM +1300,
> Kai Huang <[email protected]> wrote:
>
> > On Fri, 2022-03-04 at 11:48 -0800, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > As first step TDX VM support, return that TDX VM type supported to device
> > > model, e.g. qemu. The callback to create guest TD is vm_init callback for
> > > KVM_CREATE_VM. Add a place holder function and call a function to
> > > initialize TDX module on demand because in that callback VMX is enabled by
> > > hardware_enable callback (vmx_hardware_enable).
> >
> > Should we put this patch at the end of series until all changes required to run
> > TD are introduced? This patch essentially tells userspace KVM is ready to
> > support a TD but actually it's not ready. And this might also cause bisect
> > issue I suppose?
>
> The intention is that developers can exercise the new code step-by-step even if
> the TDX KVM isn't complete.
What is the purpose/value to allow developers to exercise the new code step-by-
step? Userspace cannot create TD successfully anyway until all patches are
ready.
--
Thanks,
-Kai
On Fri, Mar 4, 2022 at 11:50 AM <[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
> each pcpu. Do the similar for TDX with TDX SEAMCALL APIs.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 23 +++++++++++--
> arch/x86/kvm/vmx/tdx.c | 70 ++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/vmx/tdx.h | 2 ++
> arch/x86/kvm/vmx/x86_ops.h | 4 +++
> 4 files changed, 95 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 2cd5ba0e8788..882358ac270b 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -13,6 +13,25 @@ static bool vt_is_vm_type_supported(unsigned long type)
> return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
> }
>
> +static int vt_hardware_enable(void)
> +{
> + int ret;
> +
> + ret = vmx_hardware_enable();
> + if (ret)
> + return ret;
> +
> + tdx_hardware_enable();
> + return 0;
> +}
> +
> +static void vt_hardware_disable(void)
> +{
> + /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
> + tdx_hardware_disable();
> + vmx_hardware_disable();
> +}
> +
> static __init int vt_hardware_setup(void)
> {
> int ret;
> @@ -199,8 +218,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> .hardware_unsetup = vt_hardware_unsetup,
>
> - .hardware_enable = vmx_hardware_enable,
> - .hardware_disable = vmx_hardware_disable,
> + .hardware_enable = vt_hardware_enable,
> + .hardware_disable = vt_hardware_disable,
> .cpu_has_accelerated_tpr = report_flexpriority,
> .has_emulated_msr = vmx_has_emulated_msr,
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a6b1a8ce888d..690298fb99c7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
> static DEFINE_MUTEX(tdx_lock);
> static struct mutex *tdx_mng_key_config_lock;
>
> +/*
> + * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
> + * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
> + * Protected by interrupt mask. This list is manipulated in process context
> + * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
> + */
> +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> +
> static u64 hkid_mask __ro_after_init;
> static u8 hkid_start_pos __ro_after_init;
>
> @@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
>
> static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> {
> + list_del(&to_tdx(vcpu)->cpu_list);
> +
> /*
> * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> @@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> vcpu->cpu = -1;
> }
>
> +void tdx_hardware_enable(void)
> +{
> + INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
> +}
> +
> +void tdx_hardware_disable(void)
> +{
> + int cpu = raw_smp_processor_id();
> + struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> + struct vcpu_tdx *tdx, *tmp;
> +
> + /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> + list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
> + tdx_disassociate_vp(&tdx->vcpu);
> +}
> +
> static void tdx_clear_page(unsigned long page)
> {
> const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> cpumask_var_t packages;
> bool cpumask_allocated;
> + struct kvm_vcpu *vcpu;
> u64 err;
> int ret;
> int i;
> + unsigned long j;
>
> if (!is_hkid_assigned(kvm_tdx))
> return;
> @@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
> return;
> }
>
> + kvm_for_each_vcpu(j, vcpu, kvm)
> + tdx_flush_vp_on_cpu(vcpu);
> +
> + mutex_lock(&tdx_lock);
> + err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
Hi Isaku,
I am wondering about the impact of the failures on these functions. Is
there any other function which recovers any failures here?
When I look at the tdx_flush_vp function, it seems like it can fail
due to task migration so tdx_flush_vp_on_cpu might also fail and if it
fails, tdh_mng_vpflushdone returns err. Since tdx_vm_teardown does not
return any error , how the VMM can free the keyid used in this TD.
Will they be forever in "used state"?
Also if tdx_vm_teardown fails, the kvm_tdx->hkid is never set to -1
which will prevent tdx_vcpu_free to free and reclaim the resources
allocated for the vcpu.
-Erdem
> + mutex_unlock(&tdx_lock);
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
> + return;
> + }
> +
> cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> for_each_online_cpu(i) {
> if (cpumask_allocated &&
> @@ -472,8 +511,22 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>
> void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> - if (vcpu->cpu != cpu)
> - tdx_flush_vp_on_cpu(vcpu);
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> + if (vcpu->cpu == cpu)
> + return;
> +
> + tdx_flush_vp_on_cpu(vcpu);
> +
> + local_irq_disable();
> + /*
> + * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
> + * vcpu->cpu is read before tdx->cpu_list.
> + */
> + smp_rmb();
> +
> + list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
> + local_irq_enable();
> }
>
> void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> @@ -522,6 +575,19 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> tdx_reclaim_td_page(&tdx->tdvpx[i]);
> kfree(tdx->tdvpx);
> tdx_reclaim_td_page(&tdx->tdvpr);
> +
> + /*
> + * kvm_free_vcpus()
> + * -> kvm_unload_vcpu_mmu()
> + *
> + * does vcpu_load() for every vcpu after they already disassociated
> + * from the per cpu list when tdx_vm_teardown(). So we need to
> + * disassociate them again, otherwise the freed vcpu data will be
> + * accessed when do list_{del,add}() on associated_tdvcpus list
> + * later.
> + */
> + tdx_flush_vp_on_cpu(vcpu);
> + WARN_ON(vcpu->cpu != -1);
> }
>
> void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 8b1cf9c158e3..180360a65545 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -81,6 +81,8 @@ struct vcpu_tdx {
> struct tdx_td_page tdvpr;
> struct tdx_td_page *tdvpx;
>
> + struct list_head cpu_list;
> +
> union tdx_exit_reason exit_reason;
>
> bool initialized;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index ceafd6e18f4e..aae0f4449ec5 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -132,6 +132,8 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
> bool tdx_is_vm_type_supported(unsigned long type);
> void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> void tdx_hardware_unsetup(void);
> +void tdx_hardware_enable(void);
> +void tdx_hardware_disable(void);
>
> int tdx_vm_init(struct kvm *kvm);
> void tdx_mmu_prezap(struct kvm *kvm);
> @@ -156,6 +158,8 @@ static inline void tdx_pre_kvm_init(
> static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
> static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
> static inline void tdx_hardware_unsetup(void) {}
> +static inline void tdx_hardware_enable(void) {}
> +static inline void tdx_hardware_disable(void) {}
>
> static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> static inline void tdx_mmu_prezap(struct kvm *kvm) {}
> --
> 2.25.1
>
On Wed, Mar 23, 2022 at 10:55 AM Isaku Yamahata
<[email protected]> wrote:
>
> On Tue, Mar 22, 2022 at 10:28:42AM -0700,
> Erdem Aktas <[email protected]> wrote:
>
> > On Fri, Mar 4, 2022 at 11:50 AM <[email protected]> wrote:
> > > @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > vcpu->kvm->vm_bugged = true;
> > > }
> > >
> > > +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> > > +
> > > +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > > + struct vcpu_tdx *tdx)
> > > +{
> > > + guest_enter_irqoff();
> > > + tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> > > + guest_exit_irqoff();
> > > +}
> > > +
> > > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +
> > > + if (unlikely(vcpu->kvm->vm_bugged)) {
> > > + tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > > + return EXIT_FASTPATH_NONE;
> > > + }
> > > +
> > > + trace_kvm_entry(vcpu);
> > > +
> > > + tdx_vcpu_enter_exit(vcpu, tdx);
> > > +
> > > + vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > > + trace_kvm_exit(vcpu, KVM_ISA_VMX);
> > > +
> > > + if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> > > + return EXIT_FASTPATH_NONE;
> >
> > Looks like the above if statement has no effect. Just checking if this
> > is intentional.
>
> I'm not sure if I get your point. tdx->exit_reason is updated by the above
> tdx_cpu_enter_exit(). So it makes sense to check .error or .non_recoverable.
> --
> Isaku Yamahata <[email protected]>
What I mean is, if there is an error, it returns EXIT_FASTPATH_NONE
but if there is no error, it still returns EXIT_FASTPATH_NONE.
The code is like below, the if-statement might be there as a
placeholder to check errors but it has no impact on what is returned
from this function.
if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
return EXIT_FASTPATH_NONE;
return EXIT_FASTPATH_NONE;
On Fri, Mar 4, 2022 at 11:50 AM <[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> This patch implements running TDX vcpu. Once vcpu runs on the logical
> processor (LP), the TDX vcpu is associated with it. When the TDX vcpu
> moves to another LP, the TDX vcpu needs to flush its status on the LP.
> When destroying TDX vcpu, it needs to complete flush and flush cpu memory
> cache. Track which LP the TDX vcpu run and flush it as necessary.
>
> Do nothing on sched_in event as TDX doesn't support pause loop.
>
> TDX vcpu execution requires restoring PMU debug store after returning back
> to KVM because the TDX module unconditionally resets the value. To reuse
> the existing code, export perf_restore_debug_store.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 10 +++++++++-
> arch/x86/kvm/vmx/tdx.c | 34 ++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 33 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 2 ++
> arch/x86/kvm/x86.c | 1 +
> 5 files changed, 79 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index f571b07c2aae..2e5a7a72d560 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -89,6 +89,14 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> return vmx_vcpu_reset(vcpu, init_event);
> }
>
> +static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_vcpu_run(vcpu);
> +
> + return vmx_vcpu_run(vcpu);
> +}
> +
> static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> {
> if (is_td_vcpu(vcpu))
> @@ -200,7 +208,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .tlb_flush_guest = vt_flush_tlb_guest,
>
> .vcpu_pre_run = vmx_vcpu_pre_run,
> - .run = vmx_vcpu_run,
> + .run = vt_vcpu_run,
> .handle_exit = vmx_handle_exit,
> .skip_emulated_instruction = vmx_skip_emulated_instruction,
> .update_emulated_instruction = vmx_update_emulated_instruction,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 85d5f961d97e..ebe4f9bf19e7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -10,6 +10,9 @@
> #include "vmx.h"
> #include "x86.h"
>
> +#include <trace/events/kvm.h>
> +#include "trace.h"
> +
> #undef pr_fmt
> #define pr_fmt(fmt) "tdx: " fmt
>
> @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vcpu->kvm->vm_bugged = true;
> }
>
> +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> +
> +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> + struct vcpu_tdx *tdx)
> +{
> + guest_enter_irqoff();
> + tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> + guest_exit_irqoff();
> +}
> +
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> + if (unlikely(vcpu->kvm->vm_bugged)) {
> + tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> + return EXIT_FASTPATH_NONE;
> + }
> +
> + trace_kvm_entry(vcpu);
> +
> + tdx_vcpu_enter_exit(vcpu, tdx);
> +
> + vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> + trace_kvm_exit(vcpu, KVM_ISA_VMX);
> +
> + if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> + return EXIT_FASTPATH_NONE;
Looks like the above if statement has no effect. Just checking if this
is intentional.
> + return EXIT_FASTPATH_NONE;
> +}
> +
> void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> {
> td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index bf9865a88991..e950404ce5de 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -44,12 +44,45 @@ struct kvm_tdx {
> spinlock_t seamcall_lock;
> };
>
> +union tdx_exit_reason {
> + struct {
> + /* 31:0 mirror the VMX Exit Reason format */
> + u64 basic : 16;
> + u64 reserved16 : 1;
> + u64 reserved17 : 1;
> + u64 reserved18 : 1;
> + u64 reserved19 : 1;
> + u64 reserved20 : 1;
> + u64 reserved21 : 1;
> + u64 reserved22 : 1;
> + u64 reserved23 : 1;
> + u64 reserved24 : 1;
> + u64 reserved25 : 1;
> + u64 bus_lock_detected : 1;
> + u64 enclave_mode : 1;
> + u64 smi_pending_mtf : 1;
> + u64 smi_from_vmx_root : 1;
> + u64 reserved30 : 1;
> + u64 failed_vmentry : 1;
> +
> + /* 63:32 are TDX specific */
> + u64 details_l1 : 8;
> + u64 class : 8;
> + u64 reserved61_48 : 14;
> + u64 non_recoverable : 1;
> + u64 error : 1;
> + };
> + u64 full;
> +};
> +
> struct vcpu_tdx {
> struct kvm_vcpu vcpu;
>
> struct tdx_td_page tdvpr;
> struct tdx_td_page *tdvpx;
>
> + union tdx_exit_reason exit_reason;
> +
> bool initialized;
> };
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 922a3799336e..44404dd25737 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -140,6 +140,7 @@ void tdx_vm_free(struct kvm *kvm);
> int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> void tdx_vcpu_free(struct kvm_vcpu *vcpu);
> void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
>
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -160,6 +161,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
> static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
> static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
>
> static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index da411bcd8cbc..66400810d54f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -300,6 +300,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
> };
>
> u64 __read_mostly host_xcr0;
> +EXPORT_SYMBOL_GPL(host_xcr0);
> u64 __read_mostly supported_xcr0;
> EXPORT_SYMBOL_GPL(supported_xcr0);
>
> --
> 2.25.1
>
On Tue, Mar 22, 2022 at 10:28:42AM -0700,
Erdem Aktas <[email protected]> wrote:
> On Fri, Mar 4, 2022 at 11:50 AM <[email protected]> wrote:
> > @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > vcpu->kvm->vm_bugged = true;
> > }
> >
> > +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> > +
> > +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > + struct vcpu_tdx *tdx)
> > +{
> > + guest_enter_irqoff();
> > + tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> > + guest_exit_irqoff();
> > +}
> > +
> > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > +{
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +
> > + if (unlikely(vcpu->kvm->vm_bugged)) {
> > + tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > + return EXIT_FASTPATH_NONE;
> > + }
> > +
> > + trace_kvm_entry(vcpu);
> > +
> > + tdx_vcpu_enter_exit(vcpu, tdx);
> > +
> > + vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > + trace_kvm_exit(vcpu, KVM_ISA_VMX);
> > +
> > + if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> > + return EXIT_FASTPATH_NONE;
>
> Looks like the above if statement has no effect. Just checking if this
> is intentional.
I'm not sure if I get your point. tdx->exit_reason is updated by the above
tdx_cpu_enter_exit(). So it makes sense to check .error or .non_recoverable.
--
Isaku Yamahata <[email protected]>
So the tdh_vp_flush should always succeed while vm is being torn down.
Thanks Isaku for the explanation, and I think it would be great to add
the error message.
-Erdem
On Wed, Mar 23, 2022 at 12:08 PM Isaku Yamahata
<[email protected]> wrote:
>
> On Tue, Mar 22, 2022 at 05:54:45PM -0700,
> Erdem Aktas <[email protected]> wrote:
>
> > On Fri, Mar 4, 2022 at 11:50 AM <[email protected]> wrote:
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index a6b1a8ce888d..690298fb99c7 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
> > > static DEFINE_MUTEX(tdx_lock);
> > > static struct mutex *tdx_mng_key_config_lock;
> > >
> > > +/*
> > > + * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
> > > + * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
> > > + * Protected by interrupt mask. This list is manipulated in process context
> > > + * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
> > > + */
> > > +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> > > +
> > > static u64 hkid_mask __ro_after_init;
> > > static u8 hkid_start_pos __ro_after_init;
> > >
> > > @@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> > >
> > > static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > > {
> > > + list_del(&to_tdx(vcpu)->cpu_list);
> > > +
> > > /*
> > > * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> > > * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> > > @@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > > vcpu->cpu = -1;
> > > }
> > >
> > > +void tdx_hardware_enable(void)
> > > +{
> > > + INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
> > > +}
> > > +
> > > +void tdx_hardware_disable(void)
> > > +{
> > > + int cpu = raw_smp_processor_id();
> > > + struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> > > + struct vcpu_tdx *tdx, *tmp;
> > > +
> > > + /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> > > + list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
> > > + tdx_disassociate_vp(&tdx->vcpu);
> > > +}
> > > +
> > > static void tdx_clear_page(unsigned long page)
> > > {
> > > const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > > @@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
> > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > cpumask_var_t packages;
> > > bool cpumask_allocated;
> > > + struct kvm_vcpu *vcpu;
> > > u64 err;
> > > int ret;
> > > int i;
> > > + unsigned long j;
> > >
> > > if (!is_hkid_assigned(kvm_tdx))
> > > return;
> > > @@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
> > > return;
> > > }
> > >
> > > + kvm_for_each_vcpu(j, vcpu, kvm)
> > > + tdx_flush_vp_on_cpu(vcpu);
> > > +
> > > + mutex_lock(&tdx_lock);
> > > + err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
> >
> > Hi Isaku,
>
> Hi.
>
>
> > I am wondering about the impact of the failures on these functions. Is
> > there any other function which recovers any failures here?
> > When I look at the tdx_flush_vp function, it seems like it can fail
> > due to task migration so tdx_flush_vp_on_cpu might also fail and if it
> > fails, tdh_mng_vpflushdone returns err. Since tdx_vm_teardown does not
> > return any error , how the VMM can free the keyid used in this TD.
> > Will they be forever in "used state"?
> > Also if tdx_vm_teardown fails, the kvm_tdx->hkid is never set to -1
> > which will prevent tdx_vcpu_free to free and reclaim the resources
> > allocated for the vcpu.
>
> mmu_prezap() is called via release callback of mmu notifier when the last mmu
> reference of this process is dropped. It is after all kvm vcpu fd and kvm vm
> fd were closed. vcpu will never run. But we still hold kvm_vcpu structures.
> There is no race between tdh_vp_flush()/tdh_mng_vpflushdone() here and process
> migration. tdh_vp_flush()/tdh_mng_vp_flushdone() should success.
>
> The cpuid check in tdx_flush_vp() is for vcpu_load() which may race with process
> migration.
>
> Anyway what if one of those TDX seamcalls fails? HKID is leaked and will be
> never used because there is no good way to free and use HKID safely. Such
> failure is due to unknown issue and probably a bug.
>
> One mitigation is to add pr_err() when HKID leak happens. I'll add such message
> on next respin.
>
> thanks,
> --
> Isaku Yamahata <[email protected]>
On Wed, Mar 23, 2022 at 01:05:27PM -0700,
Erdem Aktas <[email protected]> wrote:
> On Wed, Mar 23, 2022 at 10:55 AM Isaku Yamahata
> <[email protected]> wrote:
> >
> > On Tue, Mar 22, 2022 at 10:28:42AM -0700,
> > Erdem Aktas <[email protected]> wrote:
> >
> > > On Fri, Mar 4, 2022 at 11:50 AM <[email protected]> wrote:
> > > > @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > vcpu->kvm->vm_bugged = true;
> > > > }
> > > >
> > > > +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> > > > +
> > > > +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > > > + struct vcpu_tdx *tdx)
> > > > +{
> > > > + guest_enter_irqoff();
> > > > + tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> > > > + guest_exit_irqoff();
> > > > +}
> > > > +
> > > > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > > > +{
> > > > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > > +
> > > > + if (unlikely(vcpu->kvm->vm_bugged)) {
> > > > + tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > > > + return EXIT_FASTPATH_NONE;
> > > > + }
> > > > +
> > > > + trace_kvm_entry(vcpu);
> > > > +
> > > > + tdx_vcpu_enter_exit(vcpu, tdx);
> > > > +
> > > > + vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > > > + trace_kvm_exit(vcpu, KVM_ISA_VMX);
> > > > +
> > > > + if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> > > > + return EXIT_FASTPATH_NONE;
> > >
> > > Looks like the above if statement has no effect. Just checking if this
> > > is intentional.
> >
> > I'm not sure if I get your point. tdx->exit_reason is updated by the above
> > tdx_cpu_enter_exit(). So it makes sense to check .error or .non_recoverable.
> > --
> > Isaku Yamahata <[email protected]>
>
> What I mean is, if there is an error, it returns EXIT_FASTPATH_NONE
> but if there is no error, it still returns EXIT_FASTPATH_NONE.
>
> The code is like below, the if-statement might be there as a
> placeholder to check errors but it has no impact on what is returned
> from this function.
>
> if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> return EXIT_FASTPATH_NONE;
> return EXIT_FASTPATH_NONE;
Got it. It doesn't make sense. I'll fix it with the next respin.
--
Isaku Yamahata <[email protected]>
On Tue, Mar 22, 2022 at 05:54:45PM -0700,
Erdem Aktas <[email protected]> wrote:
> On Fri, Mar 4, 2022 at 11:50 AM <[email protected]> wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index a6b1a8ce888d..690298fb99c7 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
> > static DEFINE_MUTEX(tdx_lock);
> > static struct mutex *tdx_mng_key_config_lock;
> >
> > +/*
> > + * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
> > + * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
> > + * Protected by interrupt mask. This list is manipulated in process context
> > + * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
> > + */
> > +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> > +
> > static u64 hkid_mask __ro_after_init;
> > static u8 hkid_start_pos __ro_after_init;
> >
> > @@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> >
> > static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > {
> > + list_del(&to_tdx(vcpu)->cpu_list);
> > +
> > /*
> > * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> > * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> > @@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > vcpu->cpu = -1;
> > }
> >
> > +void tdx_hardware_enable(void)
> > +{
> > + INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
> > +}
> > +
> > +void tdx_hardware_disable(void)
> > +{
> > + int cpu = raw_smp_processor_id();
> > + struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> > + struct vcpu_tdx *tdx, *tmp;
> > +
> > + /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> > + list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
> > + tdx_disassociate_vp(&tdx->vcpu);
> > +}
> > +
> > static void tdx_clear_page(unsigned long page)
> > {
> > const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > @@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > cpumask_var_t packages;
> > bool cpumask_allocated;
> > + struct kvm_vcpu *vcpu;
> > u64 err;
> > int ret;
> > int i;
> > + unsigned long j;
> >
> > if (!is_hkid_assigned(kvm_tdx))
> > return;
> > @@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
> > return;
> > }
> >
> > + kvm_for_each_vcpu(j, vcpu, kvm)
> > + tdx_flush_vp_on_cpu(vcpu);
> > +
> > + mutex_lock(&tdx_lock);
> > + err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
>
> Hi Isaku,
Hi.
> I am wondering about the impact of the failures on these functions. Is
> there any other function which recovers any failures here?
> When I look at the tdx_flush_vp function, it seems like it can fail
> due to task migration so tdx_flush_vp_on_cpu might also fail and if it
> fails, tdh_mng_vpflushdone returns err. Since tdx_vm_teardown does not
> return any error , how the VMM can free the keyid used in this TD.
> Will they be forever in "used state"?
> Also if tdx_vm_teardown fails, the kvm_tdx->hkid is never set to -1
> which will prevent tdx_vcpu_free to free and reclaim the resources
> allocated for the vcpu.
mmu_prezap() is called via release callback of mmu notifier when the last mmu
reference of this process is dropped. It is after all kvm vcpu fd and kvm vm
fd were closed. vcpu will never run. But we still hold kvm_vcpu structures.
There is no race between tdh_vp_flush()/tdh_mng_vpflushdone() here and process
migration. tdh_vp_flush()/tdh_mng_vp_flushdone() should success.
The cpuid check in tdx_flush_vp() is for vcpu_load() which may race with process
migration.
Anyway what if one of those TDX seamcalls fails? HKID is leaked and will be
never used because there is no good way to free and use HKID safely. Such
failure is due to unknown issue and probably a bug.
One mitigation is to add pr_err() when HKID leak happens. I'll add such message
on next respin.
thanks,
--
Isaku Yamahata <[email protected]>
On 4/5/22 10:48, Paolo Bonzini wrote:
> On 3/4/22 20:49, [email protected] wrote:
>> + if (kvm_init_sipi_unsupported(vcpu->kvm))
>> + /*
>> + * TDX doesn't support INIT. Ignore INIT event. In the
>> + * case of SIPI, the callback of
>> + * vcpu_deliver_sipi_vector ignores it.
>> + */
>> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> - else
>> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> + else {
>> + kvm_vcpu_reset(vcpu, true);
>> + if (kvm_vcpu_is_bsp(apic->vcpu))
>> + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> + else
>> + vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> + }
>
> Should you check vcpu->arch.guest_state_protected instead of
> special-casing TDX? KVM_APIC_INIT is not valid for SEV-ES either, if I
> remember correctly.
While the INIT doesn't update any actual state that is in the encrypted
VMSA, SEV-ES still calls kvm_vcpu_reset() to allow KVM to set any internal
tracking state, etc. I haven't ever tested SEV-ES where that is bypassed.
Thanks,
Tom
>
> Paolo
On 3/4/22 20:49, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add an option to skip the IRR check-in kvm_wait_lapic_expire(). This
> will be used by TDX to wait if there is an outstanding notification for
> a TD, i.e. a virtual interrupt is being triggered via posted interrupt
> processing. KVM TDX doesn't emulate PI processing, i.e. there will
> never be a bit set in IRR/ISR, so the default behavior for APICv of
> querying the IRR doesn't work as intended.
Would be better to explain "doesn't work as intended" more verbosely.
Otherwise,
Reviewed-by: Paolo Bonzini <[email protected]>
Paolo
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 4 ++--
> arch/x86/kvm/lapic.h | 2 +-
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 2 +-
> 4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9322e6340a74..d49f029ef0e3 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1620,12 +1620,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> }
>
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
> {
> if (lapic_in_kernel(vcpu) &&
> vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
> vcpu->arch.apic->lapic_timer.timer_advance_ns &&
> - lapic_timer_int_injected(vcpu))
> + (force_wait || lapic_timer_int_injected(vcpu)))
> __kvm_wait_lapic_expire(vcpu);
> }
> EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 2b44e533fc8d..2a0119ef9e96 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -233,7 +233,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)
>
> bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
>
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait);
>
> void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
> unsigned long *vcpu_bitmap);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index c7eec23e9ebe..a46415845f48 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3766,7 +3766,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
> clgi();
> kvm_load_guest_xsave_state(vcpu);
>
> - kvm_wait_lapic_expire(vcpu);
> + kvm_wait_lapic_expire(vcpu, false);
>
> /*
> * If this vCPU has touched SPEC_CTRL, restore the guest's value if
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 00f88aa25047..9b7bd52d19a9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6838,7 +6838,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> if (enable_preemption_timer)
> vmx_update_hv_timer(vcpu);
>
> - kvm_wait_lapic_expire(vcpu);
> + kvm_wait_lapic_expire(vcpu, false);
>
> /*
> * If this vCPU has touched SPEC_CTRL, restore the guest's value if
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value. For TDX (more correctly to use #VE), the
> value should include suppress #VE value which is shadow_init_value.
>
> Define SHADOW_REMOVED_SPTE as shadow_init_value | REMOVED_SPTE, and replace
> REMOVED_SPTE with SHADOW_REMOVED_SPTE to use suppress #VE bit properly for
> TDX.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/spte.h | 14 ++++++++++++--
> arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++++++-------
> 2 files changed, 28 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index bde843bce878..e88f796724b4 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -194,7 +194,9 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> * If a thread running without exclusive control of the MMU lock must perform a
> * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
> * non-present intermediate value. Other threads which encounter this value
> - * should not modify the SPTE.
> + * should not modify the SPTE. When TDX is enabled, shadow_init_value, which
> + * is "suppress #VE" bit set, is also set to removed SPTE, because TDX module
> + * always enables "EPT violation #VE".
> *
> * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> @@ -207,9 +209,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
>
> +/*
> + * See above comment around REMOVED_SPTE. SHADOW_REMOVED_SPTE is the actual
> + * intermediate value set to the removed SPET. When TDX is enabled, it sets
> + * the "suppress #VE" bit, otherwise it's REMOVED_SPTE.
> + */
> +extern u64 __read_mostly shadow_init_value;
> +#define SHADOW_REMOVED_SPTE (shadow_init_value | REMOVED_SPTE)
Please rename the existing REMOVED_SPTE to REMOVED_SPTE_MASK, and call
this simply REMOVED_SPTE. This also makes the patch smaller.
Paolo
> }
>
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ebd0a02620e8..b6ec2f112c26 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -338,7 +338,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> * value to the removed SPTE value.
> */
> for (;;) {
> - old_child_spte = xchg(sptep, REMOVED_SPTE);
> + old_child_spte = xchg(sptep, SHADOW_REMOVED_SPTE);
> if (!is_removed_spte(old_child_spte))
> break;
> cpu_relax();
> @@ -365,10 +365,10 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> * the two branches consistent and simplifies
> * the function.
> */
> - WRITE_ONCE(*sptep, REMOVED_SPTE);
> + WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
> }
> handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> - old_child_spte, REMOVED_SPTE, level,
> + old_child_spte, SHADOW_REMOVED_SPTE, level,
> shared);
> }
>
> @@ -537,7 +537,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * immediately installing a present entry in its place
> * before the TLBs are flushed.
> */
> - if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
> + if (!tdp_mmu_set_spte_atomic(kvm, iter, SHADOW_REMOVED_SPTE))
> return false;
>
> kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
> @@ -550,8 +550,16 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * special removed SPTE value. No bookkeeping is needed
> * here since the SPTE is going from non-present
> * to non-present.
> + *
> + * Set non-present value to shadow_init_value, rather than 0.
> + * It is because when TDX is enabled, TDX module always
> + * enables "EPT-violation #VE", so KVM needs to set
> + * "suppress #VE" bit in EPT table entries, in order to get
> + * real EPT violation, rather than TDVMCALL. KVM sets
> + * shadow_init_value (which sets "suppress #VE" bit) so it
> + * can be set when EPT table entries are zapped.
> */
> - WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
> + WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
>
> return true;
> }
> @@ -748,7 +756,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
>
> if (!shared) {
> - tdp_mmu_set_spte(kvm, &iter, 0);
> + /* see comments in tdp_mmu_zap_spte_atomic() */
> + tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
> flush = true;
> } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> /*
> @@ -1135,7 +1144,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> * invariant that the PFN of a present * leaf SPTE can never change.
> * See __handle_changed_spte().
> */
> - tdp_mmu_set_spte(kvm, iter, 0);
> + tdp_mmu_set_spte(kvm, iter, shadow_init_value);
>
> if (!pte_write(range->pte)) {
> new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
On 3/4/22 20:48, [email protected] wrote:
> Implement a VM-scoped subcomment to get system-wide parameters. Although
> this is system-wide parameters not per-VM, this subcomand is VM-scoped
> because
> - Device model needs TDX system-wide parameters after creating KVM VM.
> - This subcommands requires to initialize TDX module. For lazy
> initialization of the TDX module, vm-scope ioctl is better.
Since there was agreement to install the TDX module on load, please
place this ioctl on the /dev/kvm file descriptor.
At least for SEV, there were cases where the system-wide parameters are
needed outside KVM, so it's better to avoid requiring a VM file descriptor.
Thanks,
Paolo
On 3/4/22 20:49, [email protected] wrote:
> + if (kvm_init_sipi_unsupported(vcpu->kvm))
> + /*
> + * TDX doesn't support INIT. Ignore INIT event. In the
> + * case of SIPI, the callback of
> + * vcpu_deliver_sipi_vector ignores it.
> + */
> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> - else
> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> + else {
> + kvm_vcpu_reset(vcpu, true);
> + if (kvm_vcpu_is_bsp(apic->vcpu))
> + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> + else
> + vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> + }
Should you check vcpu->arch.guest_state_protected instead of
special-casing TDX? KVM_APIC_INIT is not valid for SEV-ES either, if I
remember correctly.
Paolo
On 3/4/22 20:48, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Explicitly check for an MMIO spte in the fast page fault flow. TDX will
> use a not-present entry for MMIO sptes, which can be mistaken for an
> access-tracked spte since both have SPTE_SPECIAL_MASK set.
>
> The fast page fault handles the case of changing access bits without
> obtaining mmu_lock. For example, clear write protect bit for dirty page
> tracking. MMIO emulation is handled in a slow path. So it doesn't affect
"MMIO sptes are handled in handle_mmio_page_fault for non-TDX VMs, so
this patch does not affect them. TDX will handle MMIO emulation through
a hypercall instead".
For this comment, it is not necessary to talk about the slow path, since
that is just where MMIO sptes are installed. If the slow path is
reached, fast_page_fault must not have seen is_mmio_spte(spte).
> @@ -3167,7 +3167,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> break;
>
> sp = sptep_to_sp(sptep);
> - if (!is_last_spte(spte, sp->role.level))
> + if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
> break;
>
> /*
I would include the check a couple lines before:
if (!is_shadow_present_pte(spte) || is_mmio_spte(spte))
This matches what is in the commit message: the problem is that MMIO
SPTEs are present in the TDX case, so you need to check them even if
is_shadow_present_pte(spte) returns true.
Paolo
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
> vcpu. Wire up kvm x86 methods for blocking/unblocking vcpu for TDX. To
> unblock on pending events, request immediate exit methods is also needed.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index a0bcc4dca678..404a260796e4 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -280,6 +280,14 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
> vmx_enable_irq_window(vcpu);
> }
>
> +static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return __kvm_request_immediate_exit(vcpu);
> +
> + vmx_request_immediate_exit(vcpu);
> +}
> +
> static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
> {
> if (!is_td(kvm))
> @@ -402,7 +410,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .check_intercept = vmx_check_intercept,
> .handle_exit_irqoff = vmx_handle_exit_irqoff,
>
> - .request_immediate_exit = vmx_request_immediate_exit,
> + .request_immediate_exit = vt_request_immediate_exit,
>
> .sched_in = vt_sched_in,
>
Reviewed-by: Paolo Bonzini <[email protected]>
On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
> On 3/4/22 20:48, [email protected] wrote:
>> Implement a VM-scoped subcomment to get system-wide parameters. Although
>> this is system-wide parameters not per-VM, this subcomand is VM-scoped
>> because
>> - Device model needs TDX system-wide parameters after creating KVM VM.
>> - This subcommands requires to initialize TDX module. For lazy
>> initialization of the TDX module, vm-scope ioctl is better.
>
> Since there was agreement to install the TDX module on load, please
> place this ioctl on the /dev/kvm file descriptor.
>
> At least for SEV, there were cases where the system-wide parameters are
> needed outside KVM, so it's better to avoid requiring a VM file descriptor.
I don't have strong preference on KVM-scope ioctl or VM-scope.
Initially, we made it KVM-scope and change it to VM-scope in this
version. Yes, it returns the info from TDX module, which doesn't vary
per VM. However, what if we want to return different capabilities
(software controlled capabilities) per VM? Part of the TDX capabilities
serves like get_supported_cpuid, making it KVM wide lacks the
flexibility to return differentiated capabilities for different TDs.
> Thanks,
>
> Paolo
>
On 4/7/2022 9:07 AM, Kai Huang wrote:
> On Wed, 2022-04-06 at 09:54 +0800, Xiaoyao Li wrote:
>> On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
>>> On 3/4/22 20:48, [email protected] wrote:
>>>> Implement a VM-scoped subcomment to get system-wide parameters. Although
>>>> this is system-wide parameters not per-VM, this subcomand is VM-scoped
>>>> because
>>>> - Device model needs TDX system-wide parameters after creating KVM VM.
>>>> - This subcommands requires to initialize TDX module. For lazy
>>>> initialization of the TDX module, vm-scope ioctl is better.
>>>
>>> Since there was agreement to install the TDX module on load, please
>>> place this ioctl on the /dev/kvm file descriptor.
>>>
>>> At least for SEV, there were cases where the system-wide parameters are
>>> needed outside KVM, so it's better to avoid requiring a VM file descriptor.
>>
>> I don't have strong preference on KVM-scope ioctl or VM-scope.
>>
>> Initially, we made it KVM-scope and change it to VM-scope in this
>> version. Yes, it returns the info from TDX module, which doesn't vary
>> per VM. However, what if we want to return different capabilities
>> (software controlled capabilities) per VM?
>>
>
> In this case, you don't return different capabilities, instead, you return the
> same capabilities but control the capabilities on per-VM basis.
yes, so I'm not arguing it or insisting on per-VM.
I just speak out my concern since it's user ABI.
>> Part of the TDX capabilities
>> serves like get_supported_cpuid, making it KVM wide lacks the
>> flexibility to return differentiated capabilities for different TDs.
>>
>>
>>> Thanks,
>>>
>>> Paolo
>>>
>>
>
On Fri, 2022-03-04 at 11:49 -0800, [email protected] wrote:
> + /*
> + * In case of TDP MMU, fault handler can run concurrently. Note
> + * 'source_pa' is a TD scope variable, meaning if there are multiple
> + * threads reaching here with all needing to access 'source_pa', it
> + * will break. However fortunately this won't happen, because below
> + * TDH_MEM_PAGE_ADD code path is only used when VM is being created
> + * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
> + * always uses vcpu 0's page table and protected by vcpu->mutex).
> + */
> + WARN_ON(kvm_tdx->source_pa == INVALID_PAGE);
We can just KVM_BUG_ON() and return here.
> + source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
> +
> + err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &out);
> + if (KVM_BUG_ON(err, kvm))
> + pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
> + else if ((kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
> + tdx_measure_page(kvm_tdx, gpa);
> +
> + kvm_tdx->source_pa = INVALID_PAGE;
On Tue, Apr 05, 2022, Paolo Bonzini wrote:
> On 3/4/22 20:49, [email protected] wrote:
> > @@ -207,9 +209,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> > static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> > +/*
> > + * See above comment around REMOVED_SPTE. SHADOW_REMOVED_SPTE is the actual
> > + * intermediate value set to the removed SPET. When TDX is enabled, it sets
> > + * the "suppress #VE" bit, otherwise it's REMOVED_SPTE.
> > + */
> > +extern u64 __read_mostly shadow_init_value;
> > +#define SHADOW_REMOVED_SPTE (shadow_init_value | REMOVED_SPTE)
>
> Please rename the existing REMOVED_SPTE to REMOVED_SPTE_MASK, and call this
> simply REMOVED_SPTE. This also makes the patch smaller.
Can we name it either __REMOVE_SPTE or REMOVED_SPTE_VAL? It's most definitely
not a mask, it's a full value, e.g. spte |= REMOVED_SPTE_MASK is completely wrong.
Other than that, 100% agree with avoiding churn.
On Fri, 2022-03-04 at 11:49 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> SPTE_PRIVATE_PROHIBIT specifies the share or private GPA is allowed or not.
> It needs to be kept over zapping the EPT entry. Currently the EPT entry is
> initialized shadow_init_value unconditionally to clear
> SPTE_PRIVATE_PROHIBIT bit. To carry SPTE_PRIVATE_PROHIBIT bit, introduce a
> helper function to get initial value for zapped entry with
> SPTE_PRIVATE_PROHIBIT bit. Replace shadow_init_value with it.
Isn't it better to merge patch 53-55, especially 54-55 together?
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1949f81027a0..6d750563824d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -610,6 +610,12 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> return true;
> }
>
> +static u64 shadow_init_spte(u64 old_spte)
> +{
> + return shadow_init_value |
> + (is_private_prohibit_spte(old_spte) ? SPTE_PRIVATE_PROHIBIT : 0);
> +}
> +
> static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> struct tdp_iter *iter)
> {
> @@ -641,7 +647,8 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * shadow_init_value (which sets "suppress #VE" bit) so it
> * can be set when EPT table entries are zapped.
> */
> - WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
> + WRITE_ONCE(*rcu_dereference(iter->sptep),
> + shadow_init_spte(iter->old_spte));
>
> return true;
> }
In this and next patch (54-55), in all the code path, you already have the iter-
>sptep, from which you can get the sp->private_sp, and check using
is_private_sp(). Why do we need this SPTE_PRIVATE_PRORHIBIT bit?
Are you suggesting we can have mixed private/shared mapping under a private_sp?
> @@ -853,7 +860,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
> if (!shared) {
> /* see comments in tdp_mmu_zap_spte_atomic() */
> - tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
> + tdp_mmu_set_spte(kvm, &iter,
> + shadow_init_spte(iter.old_spte));
> flush = true;
> } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> /*
> @@ -1038,11 +1046,14 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> new_spte = make_mmio_spte(vcpu,
> tdp_iter_gfn_unalias(vcpu->kvm, iter),
> pte_access);
> - else
> + else {
> wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
> tdp_iter_gfn_unalias(vcpu->kvm, iter),
> fault->pfn, iter->old_spte, fault->prefetch,
> true, fault->map_writable, &new_spte);
> + if (is_private_prohibit_spte(iter->old_spte))
> + new_spte |= SPTE_PRIVATE_PROHIBIT;
> + }
>
> if (new_spte == iter->old_spte)
> ret = RET_PF_SPURIOUS;
> @@ -1335,7 +1346,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> * invariant that the PFN of a present * leaf SPTE can never change.
> * See __handle_changed_spte().
> */
> - tdp_mmu_set_spte(kvm, iter, shadow_init_value);
> + tdp_mmu_set_spte(kvm, iter, shadow_init_spte(iter->old_spte));
>
> if (!pte_write(range->pte)) {
> new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
On Wed, 2022-04-06 at 09:54 +0800, Xiaoyao Li wrote:
> On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
> > On 3/4/22 20:48, [email protected] wrote:
> > > Implement a VM-scoped subcomment to get system-wide parameters. Although
> > > this is system-wide parameters not per-VM, this subcomand is VM-scoped
> > > because
> > > - Device model needs TDX system-wide parameters after creating KVM VM.
> > > - This subcommands requires to initialize TDX module. For lazy
> > > initialization of the TDX module, vm-scope ioctl is better.
> >
> > Since there was agreement to install the TDX module on load, please
> > place this ioctl on the /dev/kvm file descriptor.
> >
> > At least for SEV, there were cases where the system-wide parameters are
> > needed outside KVM, so it's better to avoid requiring a VM file descriptor.
>
> I don't have strong preference on KVM-scope ioctl or VM-scope.
>
> Initially, we made it KVM-scope and change it to VM-scope in this
> version. Yes, it returns the info from TDX module, which doesn't vary
> per VM. However, what if we want to return different capabilities
> (software controlled capabilities) per VM?
>
In this case, you don't return different capabilities, instead, you return the
same capabilities but control the capabilities on per-VM basis.
> Part of the TDX capabilities
> serves like get_supported_cpuid, making it KVM wide lacks the
> flexibility to return differentiated capabilities for different TDs.
>
>
> > Thanks,
> >
> > Paolo
> >
>
On Fri, 2022-03-04 at 11:49 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> At this point, TDX supports TDP MMU and doesn't support legacy MMU.
> Forcibly use TDP MMU for TDX irrelevant of kernel parameter to disable
> TDP MMU.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b33ace3d4456..9df6aa4da202 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -16,7 +16,12 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
> /* Initializes the TDP MMU for the VM, if enabled. */
> bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> {
> - if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
> + /*
> + * Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
> + * of TDX.
> + */
> + if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
> + (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
> return false;
>
> /* This should not be changed for the lifetime of the VM. */
Please move this patch forward before introducing any private/shared mapping
support, otherwise nothing prevents you from creating a TD against legacy MMU,
which is broken (especially you have allowed userspace to create TD in patch 10
"KVM: TDX: Make TDX VM type supported").
--
Thanks,
-Kai
On Fri, 2022-03-04 at 11:49 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value. For TDX (more correctly to use #VE), the
> value should include suppress #VE value which is shadow_init_value.
>
> Define SHADOW_REMOVED_SPTE as shadow_init_value | REMOVED_SPTE, and replace
> REMOVED_SPTE with SHADOW_REMOVED_SPTE to use suppress #VE bit properly for
> TDX.
Like we discussed, this patch should be merged with patch "KVM: x86/mmu: Allow
non-zero init value for shadow PTE".
--
Thanks,
-Kai
On Fri, 2022-03-04 at 11:49 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
> propagating the change private EPT entry to Secure EPT and freeing Secure
> EPT page.
>
> TLB flush handles both shared EPT and private EPT. It flushes shared EPT
> same as VMX. It also waits for the TDX TLB shootdown.
>
> For the hook to free Secure EPT page, unlinks the Secure EPT page from the
> Secure EPT so that the page can be freed to OS.
>
> Propagating the entry change to Secure EPT. The possible entry changes are
> present -> non-present(zapping) and non-present -> present(population). On
> population just link the Secure EPT page or the private guest page to the
> Secure EPT by TDX SEAMCALL.
>
> Because TDP MMU allows concurrent zapping/population, zapping requires
> synchronous TLB shootdown with the frozen EPT entry. It zaps the secure
> entry, increments TLB counter, sends IPI to remote vcpus to trigger TLB
> flush, and then unlinks the private guest page from the Secure EPT.
>
> For simplicity, batched zapping with exclude lock is handled as concurrent
> zapping. Although it's inefficient, it can be optimized in the future.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 40 +++++-
> arch/x86/kvm/vmx/tdx.c | 246 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 14 +++
> arch/x86/kvm/vmx/tdx_ops.h | 3 +
> arch/x86/kvm/vmx/x86_ops.h | 2 +
> 5 files changed, 301 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 6969e3557bd4..f571b07c2aae 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -89,6 +89,38 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> return vmx_vcpu_reset(vcpu, init_event);
> }
>
> +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_flush_tlb(vcpu);
> +
> + vmx_flush_tlb_all(vcpu);
> +}
> +
> +static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_flush_tlb(vcpu);
> +
> + vmx_flush_tlb_current(vcpu);
> +}
> +
> +static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
> +{
> + if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> + return;
> +
> + vmx_flush_tlb_gva(vcpu, addr);
> +}
> +
> +static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return;
> +
> + vmx_flush_tlb_guest(vcpu);
> +}
> +
> static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int pgd_level)
> {
> @@ -162,10 +194,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .set_rflags = vmx_set_rflags,
> .get_if_flag = vmx_get_if_flag,
>
> - .tlb_flush_all = vmx_flush_tlb_all,
> - .tlb_flush_current = vmx_flush_tlb_current,
> - .tlb_flush_gva = vmx_flush_tlb_gva,
> - .tlb_flush_guest = vmx_flush_tlb_guest,
> + .tlb_flush_all = vt_flush_tlb_all,
> + .tlb_flush_current = vt_flush_tlb_current,
> + .tlb_flush_gva = vt_flush_tlb_gva,
> + .tlb_flush_guest = vt_flush_tlb_guest,
>
> .vcpu_pre_run = vmx_vcpu_pre_run,
> .run = vmx_vcpu_run,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 51098e10b6a0..5d74ae001e4f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,7 +5,9 @@
>
> #include "capabilities.h"
> #include "x86_ops.h"
> +#include "mmu.h"
> #include "tdx.h"
> +#include "vmx.h"
> #include "x86.h"
>
> #undef pr_fmt
> @@ -272,6 +274,15 @@ int tdx_vm_init(struct kvm *kvm)
> int ret, i;
> u64 err;
>
> + /*
> + * To generate EPT violation to inject #VE instead of EPT MISCONFIG,
> + * set RWX=0.
> + */
> + kvm_mmu_set_mmio_spte_mask(kvm, 0, VMX_EPT_RWX_MASK, 0);
I literally spent couple of minutes looking for this chunk while I was looking
at the patch "KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis".
> +
> + /* TODO: Enable 2mb and 1gb large page support. */
> + kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
Why don't you move this chunk before MMU code change, where you declared large
page is not supported many times in the code?
> +
> /* vCPUs can't be created until after KVM_TDX_INIT_VM. */
> kvm->max_vcpus = 0;
>
> @@ -331,6 +342,8 @@ int tdx_vm_init(struct kvm *kvm)
> tdx_mark_td_page_added(&kvm_tdx->tdcs[i]);
> }
>
> + spin_lock_init(&kvm_tdx->seamcall_lock);
> +
> /*
> * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> * ioctl() to define the configure CPUID values for the TD.
> @@ -501,6 +514,220 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
> }
>
> +static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + hpa_t hpa = pfn_to_hpa(pfn);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + struct tdx_module_output out;
> + u64 err;
> +
> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
> + return;
> +
> + /* TODO: handle large pages. */
> + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> + return;
> +
> + /* Pin the page, TDX KVM doesn't yet support page migration. */
> + get_page(pfn_to_page(pfn));
I think there are some MMU code change has the logic that private mappings are
not zapped during VM's runtime. This logic depends on page being pinned, which
you are doing here.
> +
> + if (likely(is_td_finalized(kvm_tdx))) {
> + err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
> + if (KVM_BUG_ON(err, kvm))
> + pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
> + return;
> + }
> +}
> +
> +static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> + spin_lock(&kvm_tdx->seamcall_lock);
> + __tdx_sept_set_private_spte(kvm, gfn, level, pfn);
> + spin_unlock(&kvm_tdx->seamcall_lock);
> +}
> +
> +static void tdx_sept_drop_private_spte(
> + struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + hpa_t hpa = pfn_to_hpa(pfn);
> + hpa_t hpa_with_hkid;
> + struct tdx_module_output out;
> + u64 err = 0;
> +
> + /* TODO: handle large pages. */
> + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> + return;
> +
> + spin_lock(&kvm_tdx->seamcall_lock);
> + if (is_hkid_assigned(kvm_tdx)) {
> + err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
> + goto unlock;
> + }
> +
> + hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
> + err = tdh_phymem_page_wbinvd(hpa_with_hkid);
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> + goto unlock;
> + }
> + } else
> + err = tdx_reclaim_page((unsigned long)__va(hpa), hpa);
> +
> +unlock:
> + spin_unlock(&kvm_tdx->seamcall_lock);
> +
> + if (!err)
> + put_page(pfn_to_page(pfn));
> +}
> +
> +static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, void *sept_page)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + hpa_t hpa = __pa(sept_page);
> + struct tdx_module_output out;
> + u64 err;
> +
> + spin_lock(&kvm_tdx->seamcall_lock);
> + err = tdh_mem_sept_add(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &out);
> + spin_unlock(&kvm_tdx->seamcall_lock);
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + struct tdx_module_output out;
> + u64 err;
> +
> + spin_lock(&kvm_tdx->seamcall_lock);
> + err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
> + spin_unlock(&kvm_tdx->seamcall_lock);
> + if (KVM_BUG_ON(err, kvm))
> + pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
> +}
> +
> +static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *sept_page)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + int ret;
> +
> + /*
> + * free_private_sp() is (obviously) called when a shadow page is being
> + * zapped. KVM doesn't (yet) zap private SPs while the TD is active.
> + */
I have some memory that you ever said private memory can be zapped when memory
slot is moved/deleted?
> + if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> + return -EINVAL;
> +
> + spin_lock(&kvm_tdx->seamcall_lock);
> + ret = tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page));
> + spin_unlock(&kvm_tdx->seamcall_lock);
> +
> + return ret;
> +}
> +
> +static int tdx_sept_tlb_remote_flush(struct kvm *kvm)
> +{
> + struct kvm_tdx *kvm_tdx;
> + u64 err;
> +
> + if (!is_td(kvm))
> + return -EOPNOTSUPP;
> +
> + kvm_tdx = to_kvm_tdx(kvm);
> + if (!is_hkid_assigned(kvm_tdx))
> + return 0;
> +
> + /* If TD isn't finalized, it's before any vcpu running. */
> + if (unlikely(!is_td_finalized(kvm_tdx)))
> + return 0;
> +
> + kvm_tdx->tdh_mem_track = true;
> +
> + kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> +
> + err = tdh_mem_track(kvm_tdx->tdr.pa);
> + if (KVM_BUG_ON(err, kvm))
> + pr_tdx_error(TDH_MEM_TRACK, err, NULL);
> +
> + WRITE_ONCE(kvm_tdx->tdh_mem_track, false);
> +
> + return 0;
The whole TLB flush mechanism definitely needs more explanation in either commit
message, or comments. How can people understand this magic with such little
information?
> +}
> +
> +static void tdx_handle_changed_private_spte(
> + struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
> + kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page)
> +{
> + WARN_ON(!is_td(kvm));
> + lockdep_assert_held(&kvm->mmu_lock);
> +
> + if (is_present) {
> + /* TDP MMU doesn't change present -> present */
> + WARN_ON(was_present);
> +
> + /*
> + * Use different call to either set up middle level
> + * private page table, or leaf.
> + */
> + if (is_leaf)
> + tdx_sept_set_private_spte(kvm, gfn, level, new_pfn);
> + else {
> + WARN_ON(!sept_page);
> + if (tdx_sept_link_private_sp(kvm, gfn, level, sept_page))
> + /* failed to update Secure-EPT. */
> + WARN_ON(1);
> + }
> + } else if (was_leaf) {
> + /* non-present -> non-present doesn't make sense. */
> + WARN_ON(!was_present);
> +
> + /*
> + * Zap private leaf SPTE. Zapping private table is done
> + * below in handle_removed_tdp_mmu_page().
> + */
> + tdx_sept_zap_private_spte(kvm, gfn, level);
> +
> + /*
> + * TDX requires TLB tracking before dropping private page. Do
> + * it here, although it is also done later.
> + * If hkid isn't assigned, the guest is destroying and no vcpu
> + * runs further. TLB shootdown isn't needed.
> + *
> + * TODO: implement with_range version for optimization.
> + * kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
> + * => tdx_sept_tlb_remote_flush_with_range(kvm, gfn,
> + * KVM_PAGES_PER_HPAGE(level));
> + */
> + if (is_hkid_assigned(to_kvm_tdx(kvm)))
> + kvm_flush_remote_tlbs(kvm);
> +
> + tdx_sept_drop_private_spte(kvm, gfn, level, old_pfn);
> + }
> +}
> +
> static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> {
> struct kvm_tdx_capabilities __user *user_caps;
> @@ -736,6 +963,21 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> return ret;
> }
>
> +void tdx_flush_tlb(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> + struct kvm_mmu *mmu = vcpu->arch.mmu;
> + u64 root_hpa = mmu->root_hpa;
> +
> + /* Flush the shared EPTP, if it's valid. */
> + if (VALID_PAGE(root_hpa))
> + ept_sync_context(construct_eptp(vcpu, root_hpa,
> + mmu->shadow_root_level));
> +
> + while (READ_ONCE(kvm_tdx->tdh_mem_track))
> + cpu_relax();
> +}
> +
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_tdx_cmd tdx_cmd;
> @@ -901,6 +1143,10 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> hkid_start_pos = boot_cpu_data.x86_phys_bits;
> hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
>
> + x86_ops->tlb_remote_flush = tdx_sept_tlb_remote_flush;
> + x86_ops->free_private_sp = tdx_sept_free_private_sp;
> + x86_ops->handle_changed_private_spte = tdx_handle_changed_private_spte;
> +
> return 0;
> }
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index b32e068c51b4..906666c7c70b 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -29,9 +29,17 @@ struct kvm_tdx {
> struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
>
> bool finalized;
> + bool tdh_mem_track;
>
> u64 tsc_offset;
> unsigned long tsc_khz;
> +
> + /*
> + * Lock to prevent seamcalls from running concurrently
> + * when TDP MMU is enabled, because TDP fault handler
> + * runs concurrently.
> + */
> + spinlock_t seamcall_lock;
Please also explain why relevant SEAMCALLs cannot run concurrently.
> };
>
> struct vcpu_tdx {
> @@ -166,6 +174,12 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
> return out.r8;
> }
>
> +static __always_inline int pg_level_to_tdx_sept_level(enum pg_level level)
> +{
> + WARN_ON(level == PG_LEVEL_NONE);
> + return level - 1;
> +}
> +
> #else
> #define enable_tdx false
> static inline int tdx_module_setup(void) { return -ENODEV; };
> diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
> index dc76b3a5cf96..cb40edc8c245 100644
> --- a/arch/x86/kvm/vmx/tdx_ops.h
> +++ b/arch/x86/kvm/vmx/tdx_ops.h
> @@ -30,12 +30,14 @@ static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
> static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
> struct tdx_module_output *out)
> {
> + tdx_clflush_page(hpa);
I think those flush change can be done together when those tdh_mem_xx are
introduced with a single explanation why flush is needed. You really don't need
to do each of them in separate patch.
> return kvm_seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, out);
> }
>
> static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
> struct tdx_module_output *out)
> {
> + tdx_clflush_page(page);
> return kvm_seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, out);
> }
>
> @@ -48,6 +50,7 @@ static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
> static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
> struct tdx_module_output *out)
> {
> + tdx_clflush_page(hpa);
> return kvm_seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, out);
> }
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index ad9b1c883761..922a3799336e 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -144,6 +144,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>
> +void tdx_flush_tlb(struct kvm_vcpu *vcpu);
> void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
> #else
> static inline void tdx_pre_kvm_init(
> @@ -163,6 +164,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>
> +static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
> static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
> #endif
>
On 4/7/22 13:09, Xiaoyao Li wrote:
> On 4/5/2022 11:48 PM, Paolo Bonzini wrote:
>> On 3/4/22 20:49, [email protected] wrote:
>>> + if (kvm_init_sipi_unsupported(vcpu->kvm))
>>> + /*
>>> + * TDX doesn't support INIT. Ignore INIT event. In the
>>> + * case of SIPI, the callback of
>>> + * vcpu_deliver_sipi_vector ignores it.
>>> + */
>>> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>> - else
>>> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>>> + else {
>>> + kvm_vcpu_reset(vcpu, true);
>>> + if (kvm_vcpu_is_bsp(apic->vcpu))
>>> + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>> + else
>>> + vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>>> + }
>>
>> Should you check vcpu->arch.guest_state_protected instead of
>> special-casing TDX?
>
> We cannot use vcpu->arch.guest_state_protected because TDX supports
> debug TD, of which the states are not protected.
>
> At least we need another flag, I think.
Let's add .deliver_init to the kvm_x86_ops then.
Paolo
On 4/5/2022 11:48 PM, Paolo Bonzini wrote:
> On 3/4/22 20:49, [email protected] wrote:
>> + if (kvm_init_sipi_unsupported(vcpu->kvm))
>> + /*
>> + * TDX doesn't support INIT. Ignore INIT event. In the
>> + * case of SIPI, the callback of
>> + * vcpu_deliver_sipi_vector ignores it.
>> + */
>> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> - else
>> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> + else {
>> + kvm_vcpu_reset(vcpu, true);
>> + if (kvm_vcpu_is_bsp(apic->vcpu))
>> + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> + else
>> + vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> + }
>
> Should you check vcpu->arch.guest_state_protected instead of
> special-casing TDX?
We cannot use vcpu->arch.guest_state_protected because TDX supports
debug TD, of which the states are not protected.
At least we need another flag, I think.
> KVM_APIC_INIT is not valid for SEV-ES either, if I
> remember correctly.
>
> Paolo
On 4/7/22 01:35, Sean Christopherson wrote:
>> Please rename the existing REMOVED_SPTE to REMOVED_SPTE_MASK, and call this
>> simply REMOVED_SPTE. This also makes the patch smaller.
> Can we name it either __REMOVE_SPTE or REMOVED_SPTE_VAL? It's most definitely
> not a mask, it's a full value, e.g. spte |= REMOVED_SPTE_MASK is completely wrong.
REMOVED_SPTE_VAL is fine.
Paolo
On Thu, Apr 07, 2022 at 09:17:51AM +0800,
Xiaoyao Li <[email protected]> wrote:
> On 4/7/2022 9:07 AM, Kai Huang wrote:
> > On Wed, 2022-04-06 at 09:54 +0800, Xiaoyao Li wrote:
> > > On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
> > > > On 3/4/22 20:48, [email protected] wrote:
> > > > > Implement a VM-scoped subcomment to get system-wide parameters. Although
> > > > > this is system-wide parameters not per-VM, this subcomand is VM-scoped
> > > > > because
> > > > > - Device model needs TDX system-wide parameters after creating KVM VM.
> > > > > - This subcommands requires to initialize TDX module. For lazy
> > > > > initialization of the TDX module, vm-scope ioctl is better.
> > > >
> > > > Since there was agreement to install the TDX module on load, please
> > > > place this ioctl on the /dev/kvm file descriptor.
> > > >
> > > > At least for SEV, there were cases where the system-wide parameters are
> > > > needed outside KVM, so it's better to avoid requiring a VM file descriptor.
> > >
> > > I don't have strong preference on KVM-scope ioctl or VM-scope.
> > >
> > > Initially, we made it KVM-scope and change it to VM-scope in this
> > > version. Yes, it returns the info from TDX module, which doesn't vary
> > > per VM. However, what if we want to return different capabilities
> > > (software controlled capabilities) per VM?
> > >
> >
> > In this case, you don't return different capabilities, instead, you return the
> > same capabilities but control the capabilities on per-VM basis.
>
> yes, so I'm not arguing it or insisting on per-VM.
>
> I just speak out my concern since it's user ABI.
The reason why I made this API to VM-scope API is to reduce the number of patch
given qemu usage. Now Paolo requested it, I'll change it KVM-scope API.
--
Isaku Yamahata <[email protected]>
On Thu, Apr 07, 2022 at 02:12:28PM +0200,
Paolo Bonzini <[email protected]> wrote:
> On 4/7/22 13:09, Xiaoyao Li wrote:
> > On 4/5/2022 11:48 PM, Paolo Bonzini wrote:
> > > On 3/4/22 20:49, [email protected] wrote:
> > > > + if (kvm_init_sipi_unsupported(vcpu->kvm))
> > > > + /*
> > > > + * TDX doesn't support INIT. Ignore INIT event. In the
> > > > + * case of SIPI, the callback of
> > > > + * vcpu_deliver_sipi_vector ignores it.
> > > > + */
> > > > vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > > > - else
> > > > - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> > > > + else {
> > > > + kvm_vcpu_reset(vcpu, true);
> > > > + if (kvm_vcpu_is_bsp(apic->vcpu))
> > > > + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > > > + else
> > > > + vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> > > > + }
> > >
> > > Should you check vcpu->arch.guest_state_protected instead of
> > > special-casing TDX?
> >
> > We cannot use vcpu->arch.guest_state_protected because TDX supports
> > debug TD, of which the states are not protected.
> >
> > At least we need another flag, I think.
>
> Let's add .deliver_init to the kvm_x86_ops then.
Will do.
--
Isaku Yamahata <[email protected]>
On Fri, Mar 04, 2022, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
> interrupt to support TDX's usage of APICv. Unlike VMX, TDX doesn't have
> access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
Based on the discussion in the HLT patch, this is no longer true.
> i.e. needs to generate a posted interrupt and more importantly can't
> manually move requested interrupts into the vIRR (which it also doesn't
> have access to).
>
> Because pi_has_pending_interrupt() is heavy operation which uses two atomic
> test bit operations and one atomic 256 bit bitmap check, introduce new
> callback for this check instead of reusing dy_apicv_has_pending_interrupt()
> callback to avoid affecting the exiting code.
...
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 89d04cd64cd0..314ae43e07bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12111,7 +12111,10 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>
> if (kvm_arch_interrupt_allowed(vcpu) &&
> (kvm_cpu_has_interrupt(vcpu) ||
> - kvm_guest_apic_has_interrupt(vcpu)))
> + kvm_guest_apic_has_interrupt(vcpu) ||
> + (vcpu->arch.apicv_active &&
> + kvm_x86_ops.apicv_has_pending_interrupt &&
> + kvm_x86_ops.apicv_has_pending_interrupt(vcpu))))
This is pretty gross (fully realizing that I wrote this patch). It's also arguably
wrong as it really should be called from apic_has_interrupt_for_ppr().
1. The hook implies it is valid for APICv in general, which is misleading.
2. It's wasted effort for VMX.
3. It does a poor job of conveying _why_ TDX is different.
4. KVM is unnecessarily processing its useless "copy" of the PPR/IRR for TDX
vCPUs. It's functionally not an issue unless userspace stuffs garbage into
KVM's vAPIC, but it's unnecessary work.
Rather than hook this path, I would rather we tag kvm_apic has having some of its
state protected. Then kvm_cpu_has_interrupt() can invoke the alternative,
protected-apic-only hook when appropriate, and kvm_apic_has_interrupt() can bail
immediately instead of doing useless processing of stale vAPIC state.
Note, the below moves the !apic check from tdx_vcpu_reset() to tdx_vcpu_create().
That part should be hoisted earlier in the series, there's no reason to wait until
RESET to perform the check, and I suspect the WARN_ON() can be triggered by userespace.
Compile tested only...
From: Sean Christopherson <[email protected]>
Date: Fri, 8 Apr 2022 08:56:27 -0700
Subject: [PATCH] KVM: TDX: Add support for find pending IRQ in a protected
local APIC
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest as a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.
Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.
Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.
Note, this doesn't handle interrupts that have been delivered to the vCPU
but not yet recognized by the core, i.e. interrupts that are sitting in
vmcs.GUEST_INTR_STATUS. Querying that state requires a SEAMCALL and will
be supported in a future patch.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/irq.c | 3 +++
arch/x86/kvm/lapic.c | 3 +++
arch/x86/kvm/lapic.h | 2 ++
arch/x86/kvm/vmx/main.c | 11 +++++++++++
arch/x86/kvm/vmx/tdx.c | 9 ++++++---
7 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 7e27b73d839f..ce705d0c6241 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -110,6 +110,7 @@ KVM_X86_OP_NULL(update_pi_irte)
KVM_X86_OP_NULL(start_assignment)
KVM_X86_OP_NULL(apicv_post_state_restore)
KVM_X86_OP_NULL(dy_apicv_has_pending_interrupt)
+KVM_X86_OP_NULL(protected_apic_has_interrupt)
KVM_X86_OP_NULL(set_hv_timer)
KVM_X86_OP_NULL(cancel_hv_timer)
KVM_X86_OP(setup_mce)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 489374a57b66..b3dcc0814461 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1491,6 +1491,7 @@ struct kvm_x86_ops {
void (*start_assignment)(struct kvm *kvm);
void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+ bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);
int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
bool *expired);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 172b05343cfd..24f180c538b0 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -96,6 +96,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
if (kvm_cpu_has_extint(v))
return 1;
+ if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
+ return static_call(kvm_x86_protected_apic_has_interrupt)(v);
+
return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
}
EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9322e6340a74..50a483abc0fe 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2503,6 +2503,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
if (!kvm_apic_present(vcpu))
return -1;
+ if (apic->guest_apic_protected)
+ return -1;
+
__apic_update_ppr(apic, &ppr);
return apic_has_interrupt_for_ppr(apic, ppr);
}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2b44e533fc8d..7b62f1889a98 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -52,6 +52,8 @@ struct kvm_lapic {
bool sw_enabled;
bool irr_pending;
bool lvt0_in_nmi_mode;
+ /* Select registers in the vAPIC cannot be read/written. */
+ bool guest_apic_protected;
/* Number of bits set in ISR. */
s16 isr_count;
/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 882358ac270b..31aab8add010 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -42,6 +42,9 @@ static __init int vt_hardware_setup(void)
tdx_hardware_setup(&vt_x86_ops);
+ if (!enable_tdx)
+ vt_x86_ops.protected_apic_has_interrupt = NULL;
+
if (enable_ept) {
const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
kvm_mmu_set_ept_masks(enable_ept_ad_bits,
@@ -148,6 +151,13 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
return vmx_vcpu_load(vcpu, cpu);
}
+static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ KVM_BUG_ON(!is_td_vcpu(vcpu), vcpu->kvm);
+
+ return pi_has_pending_interrupt(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -297,6 +307,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sync_pir_to_irr = vmx_sync_pir_to_irr,
.deliver_interrupt = vmx_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+ .protected_apic_has_interrupt = vt_protected_apic_has_interrupt,
.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3a0e826fbe0c..7b9370384ce4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -467,6 +467,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
struct vcpu_tdx *tdx = to_tdx(vcpu);
int ret, i;
+ /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
+ if (!vcpu->arch.apic)
+ return -EINVAL;
+
+ vcpu->arch.apic->guest_apic_protected = true;
+
ret = tdx_alloc_td_page(&tdx->tdvpr);
if (ret)
return ret;
@@ -602,9 +608,6 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
/* TDX doesn't support INIT event. */
if (WARN_ON(init_event))
goto td_bugged;
- /* TDX supports only X2APIC enabled. */
- if (WARN_ON(!vcpu->arch.apic))
- goto td_bugged;
if (WARN_ON(is_td_vcpu_created(tdx)))
goto td_bugged;
base-commit: f88e9fa63cbd87cda9352ee9a86a6f815744be33
--
On Fri, Mar 04, 2022, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add an option to skip the IRR check-in kvm_wait_lapic_expire(). This
> will be used by TDX to wait if there is an outstanding notification for
> a TD, i.e. a virtual interrupt is being triggered via posted interrupt
> processing. KVM TDX doesn't emulate PI processing, i.e. there will
> never be a bit set in IRR/ISR, so the default behavior for APICv of
> querying the IRR doesn't work as intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 4 ++--
> arch/x86/kvm/lapic.h | 2 +-
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 2 +-
> 4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9322e6340a74..d49f029ef0e3 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1620,12 +1620,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> }
>
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
> {
> if (lapic_in_kernel(vcpu) &&
> vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
> vcpu->arch.apic->lapic_timer.timer_advance_ns &&
> - lapic_timer_int_injected(vcpu))
> + (force_wait || lapic_timer_int_injected(vcpu)))
> __kvm_wait_lapic_expire(vcpu);
If the guest_apic_protected idea works, rather than require TDX to tell the local
APIC that it should wait, the common code can instead assume a timer IRQ is pending
if the IRR holds garbage.
Again, compile tested only...
From: Sean Christopherson <[email protected]>
Date: Fri, 8 Apr 2022 09:24:39 -0700
Subject: [PATCH] KVM: x86: Assume timer IRQ was injected if APIC state is
proteced
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path. The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.
Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 50a483abc0fe..e5555dce8db8 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1531,8 +1531,17 @@ static void apic_update_lvtt(struct kvm_lapic *apic)
static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- u32 reg = kvm_lapic_get_reg(apic, APIC_LVTT);
+ u32 reg;
+ /*
+ * Assume a timer IRQ was "injected" if the APIC is protected. KVM's
+ * copy of the vIRR is bogus, it's the responsibility of the caller to
+ * precisely check whether or not a timer IRQ is pending.
+ */
+ if (apic->guest_apic_protected)
+ return true;
+
+ reg = kvm_lapic_get_reg(apic, APIC_LVTT);
if (kvm_apic_hw_enabled(apic)) {
int vec = reg & APIC_VECTOR_MASK;
void *bitmap = apic->regs + APIC_ISR;
base-commit: 33f2439cd63c84fcbc8b4cdd4eb731e83deead90
--
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> For vcpu migration, in the case of VMX, VCMS is flushed on the source pcpu,
> and load it on the target pcpu. There are corresponding TDX SEAMCALL APIs,
> call them on vcpu migration. The logic is mostly same as VMX except the
> TDX SEAMCALLs are used.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 20 +++++++++++++--
> arch/x86/kvm/vmx/tdx.c | 51 ++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 2 ++
> 3 files changed, 71 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index f9d43f2de145..2cd5ba0e8788 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -121,6 +121,14 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
> return vmx_vcpu_run(vcpu);
> }
>
> +static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_vcpu_load(vcpu, cpu);
> +
> + return vmx_vcpu_load(vcpu, cpu);
> +}
> +
> static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> {
> if (is_td_vcpu(vcpu))
> @@ -162,6 +170,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> }
>
> +static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return;
> +
> + vmx_sched_in(vcpu, cpu);
> +}
> +
> static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
> {
> if (!is_td(kvm))
> @@ -199,7 +215,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .vcpu_reset = vt_vcpu_reset,
>
> .prepare_guest_switch = vt_prepare_switch_to_guest,
> - .vcpu_load = vmx_vcpu_load,
> + .vcpu_load = vt_vcpu_load,
> .vcpu_put = vt_vcpu_put,
>
> .update_exception_bitmap = vmx_update_exception_bitmap,
> @@ -285,7 +301,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> .request_immediate_exit = vmx_request_immediate_exit,
>
> - .sched_in = vmx_sched_in,
> + .sched_in = vt_sched_in,
>
> .cpu_dirty_log_size = PML_ENTITY_NUM,
> .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 37cf7d43435d..a6b1a8ce888d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -85,6 +85,18 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> return kvm_tdx->finalized;
> }
>
> +static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> +{
> + /*
> + * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> + * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> + * to its list before its deleted from this CPUs list.
> + */
> + smp_wmb();
> +
> + vcpu->cpu = -1;
> +}
> +
> static void tdx_clear_page(unsigned long page)
> {
> const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -155,6 +167,39 @@ static void tdx_reclaim_td_page(struct tdx_td_page *page)
> free_page(page->va);
> }
>
> +static void tdx_flush_vp(void *arg)
> +{
> + struct kvm_vcpu *vcpu = arg;
> + u64 err;
> +
> + /* Task migration can race with CPU offlining. */
> + if (vcpu->cpu != raw_smp_processor_id())
> + return;
> +
> + /*
> + * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
> + * list tracking still needs to be updated so that it's correct if/when
> + * the vCPU does get initialized.
> + */
> + if (is_td_vcpu_created(to_tdx(vcpu))) {
> + err = tdh_vp_flush(to_tdx(vcpu)->tdvpr.pa);
> + if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
> + if (WARN_ON_ONCE(err))
> + pr_tdx_error(TDH_VP_FLUSH, err, NULL);
> + }
> + }
> +
> + tdx_disassociate_vp(vcpu);
> +}
> +
> +static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
> +{
> + if (unlikely(vcpu->cpu == -1))
> + return;
> +
> + smp_call_function_single(vcpu->cpu, tdx_flush_vp, vcpu, 1);
> +}
> +
> static int tdx_do_tdh_phymem_cache_wb(void *param)
> {
> u64 err = 0;
> @@ -425,6 +470,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> return ret;
> }
>
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> + if (vcpu->cpu != cpu)
> + tdx_flush_vp_on_cpu(vcpu);
> +}
> +
> void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> {
> struct vcpu_tdx *tdx = to_tdx(vcpu);
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 8b871c5f52cf..ceafd6e18f4e 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -143,6 +143,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
> void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> void tdx_vcpu_put(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -166,6 +167,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
> static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
> static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>
> static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
This patch and the next one might even be squashed together.
Otherwise
Reviewed-by: Paolo Bonzini <[email protected]>
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Several user ret MSRs are clobbered on TD exit. Restore those values on
> TD exit and before returning to ring 3.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++++
> 1 file changed, 33 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 54be5be1a06c..c1366aac7d96 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -550,6 +550,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vcpu->kvm->vm_bugged = true;
> }
>
> +struct tdx_uret_msr {
> + u32 msr;
> + unsigned int slot;
> + u64 defval;
> +};
> +
> +static struct tdx_uret_msr tdx_uret_msrs[] = {
> + {.msr = MSR_SYSCALL_MASK,},
> + {.msr = MSR_STAR,},
> + {.msr = MSR_LSTAR,},
> + {.msr = MSR_TSC_AUX,},
> +};
> +
> +static void tdx_user_return_update_cache(void)
> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> + kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
> + tdx_uret_msrs[i].defval);
> +}
> +
> static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -589,6 +611,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> tdx_vcpu_enter_exit(vcpu, tdx);
>
> + tdx_user_return_update_cache();
> tdx_restore_host_xsave_state(vcpu);
> tdx->host_state_need_restore = true;
>
> @@ -1371,6 +1394,16 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
> return -EIO;
>
> + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
> + tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
> + if (tdx_uret_msrs[i].slot == -1) {
> + /* If any MSR isn't supported, it is a KVM bug */
> + pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
> + tdx_uret_msrs[i].msr);
> + return -EIO;
> + }
> + }
> +
> max_pkgs = topology_max_packages();
> tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> GFP_KERNEL);
I wonder if you only need to do this if
!this_cpu_ptr(user_return_msrs)->registered, but not a big deal.
Reviewed-by: Paolo Bonzini <[email protected]>
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
> from VMX case. Add TDX hooks to save/restore host/guest CPU state.
> Save/restore kernel GS base MSR.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 28 +++++++++++++++++++++++++--
> arch/x86/kvm/vmx/tdx.c | 39 ++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 4 ++++
> arch/x86/kvm/vmx/x86_ops.h | 4 ++++
> 4 files changed, 73 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 2e5a7a72d560..f9d43f2de145 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -89,6 +89,30 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> return vmx_vcpu_reset(vcpu, init_event);
> }
>
> +static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> +{
> + /*
> + * All host state is saved/restored across SEAMCALL/SEAMRET, and the
> + * guest state of a TD is obviously off limits. Deferring MSRs and DRs
> + * is pointless because the TDX module needs to load *something* so as
> + * not to expose guest state.
> + */
> + if (is_td_vcpu(vcpu)) {
> + tdx_prepare_switch_to_guest(vcpu);
> + return;
> + }
> +
> + vmx_prepare_switch_to_guest(vcpu);
> +}
> +
> +static void vt_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_vcpu_put(vcpu);
> +
> + return vmx_vcpu_put(vcpu);
> +}
> +
> static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
> {
> if (is_td_vcpu(vcpu))
> @@ -174,9 +198,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .vcpu_free = vt_vcpu_free,
> .vcpu_reset = vt_vcpu_reset,
>
> - .prepare_guest_switch = vmx_prepare_switch_to_guest,
> + .prepare_guest_switch = vt_prepare_switch_to_guest,
> .vcpu_load = vmx_vcpu_load,
> - .vcpu_put = vmx_vcpu_put,
> + .vcpu_put = vt_vcpu_put,
>
> .update_exception_bitmap = vmx_update_exception_bitmap,
> .get_msr_feature = vmx_get_msr_feature,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ebe4f9bf19e7..7a288aae03ba 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1,5 +1,6 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/cpu.h>
> +#include <linux/mmu_context.h>
>
> #include <asm/tdx.h>
>
> @@ -407,6 +408,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.guest_state_protected =
> !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
>
> + tdx->host_state_need_save = true;
> + tdx->host_state_need_restore = false;
> +
> return 0;
>
> free_tdvpx:
> @@ -420,6 +424,39 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> return ret;
> }
>
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> + if (!tdx->host_state_need_save)
> + return;
> +
> + if (likely(is_64bit_mm(current->mm)))
> + tdx->msr_host_kernel_gs_base = current->thread.gsbase;
> + else
> + tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> +
> + tdx->host_state_need_save = false;
> +}
> +
> +static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> + tdx->host_state_need_save = true;
> + if (!tdx->host_state_need_restore)
> + return;
> +
> + wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
> + tdx->host_state_need_restore = false;
> +}
> +
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> + vmx_vcpu_pi_put(vcpu);
> + tdx_prepare_switch_to_host(vcpu);
> +}
> +
> void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> {
> struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -535,6 +572,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> tdx_vcpu_enter_exit(vcpu, tdx);
>
> + tdx->host_state_need_restore = true;
> +
> vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> trace_kvm_exit(vcpu, KVM_ISA_VMX);
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index e950404ce5de..8b1cf9c158e3 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -84,6 +84,10 @@ struct vcpu_tdx {
> union tdx_exit_reason exit_reason;
>
> bool initialized;
> +
> + bool host_state_need_save;
> + bool host_state_need_restore;
> + u64 msr_host_kernel_gs_base;
> };
>
> static inline bool is_td(struct kvm *kvm)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 44404dd25737..8b871c5f52cf 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -141,6 +141,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> void tdx_vcpu_free(struct kvm_vcpu *vcpu);
> void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu);
>
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -162,6 +164,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
> static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
> +static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
>
> static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
Reviewed-by: Paolo Bonzini <[email protected]>
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> This corresponds to VMX __vmx_complete_interrupts(). Because TDX
> virtualize vAPIC, KVM only needs to care NMI injection.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c1366aac7d96..3cb2fbd1c12c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -550,6 +550,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vcpu->kvm->vm_bugged = true;
> }
>
> +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> +{
> + /* Avoid costly SEAMCALL if no nmi was injected */
> + if (vcpu->arch.nmi_injected)
> + vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> + TD_VCPU_PEND_NMI);
> +}
> +
> struct tdx_uret_msr {
> u32 msr;
> unsigned int slot;
> @@ -618,6 +626,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> trace_kvm_exit(vcpu, KVM_ISA_VMX);
>
> + tdx_complete_interrupts(vcpu);
> +
> if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> return EXIT_FASTPATH_NONE;
> return EXIT_FASTPATH_NONE;
Reviewed-by: Paolo Bonzini <[email protected]>
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Because debug store is clobbered, restore it on TD exit.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/events/intel/ds.c | 1 +
> arch/x86/kvm/vmx/tdx.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 376cc3d66094..cdba4227ad3b 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2256,3 +2256,4 @@ void perf_restore_debug_store(void)
>
> wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
> }
> +EXPORT_SYMBOL_GPL(perf_restore_debug_store);
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3cb2fbd1c12c..37cf7d43435d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -620,6 +620,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> tdx_vcpu_enter_exit(vcpu, tdx);
>
> tdx_user_return_update_cache();
> + perf_restore_debug_store();
> tdx_restore_host_xsave_state(vcpu);
> tdx->host_state_need_restore = true;
>
Reviewed-by: Paolo Bonzini <[email protected]>
On Fri, Apr 15, 2022 at 05:18:42PM +0200,
Paolo Bonzini <[email protected]> wrote:
> On 3/4/22 20:48, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Hi. Now TDX host kernel patch series was posted, I've rebased this patch
> > series to it and make it work.
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > Changes from v4:
> > - rebased to TDX host kernel patch series.
> > - include all the patches to make this patch series working.
> > - add [MARKER] patches to mark the patch layer clear.
>
> I think I have reviewed everything except the TDP MMU parts (48, 54-57). I
> will do those next week, but in the meanwhile feel free to send v6 if you
> have it ready. A lot of the requests have been cosmetic.
Thank you so much. I'm updating patches now.
> If you would like to use something like Trello to track all the changes, and
> submit before you have done all of them, that's fine by me.
Sure. I've created public trello board.
If you want to edit it, please let me know. I'll add you to the project member.
https://trello.com/kvmtdxreview
thanks,
--
Isaku Yamahata <[email protected]>
On 4/15/22 17:18, Paolo Bonzini wrote:
> On 3/4/22 20:48, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
>>
>> Hi. Now TDX host kernel patch series was posted, I've rebased this patch
>> series to it and make it work.
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Changes from v4:
>> - rebased to TDX host kernel patch series.
>> - include all the patches to make this patch series working.
>> - add [MARKER] patches to mark the patch layer clear.
>
> I think I have reviewed everything except the TDP MMU parts (48, 54-57).
> I will do those next week, but in the meanwhile feel free to send v6
> if you have it ready. A lot of the requests have been cosmetic.
>
> If you would like to use something like Trello to track all the changes,
> and submit before you have done all of them, that's fine by me.
Also, I have now pushed what (I think) should be all that's needed to
run TDX guests at branch kvm-tdx-5.17 of
https://git.kernel.org/pub/scm/virt/kvm/kvm.git. It's only
compile-tested for now, but if I missed something please report so that
it can be used by people doing other work (including QEMU, TDVF and guest).
Thanks,
Paolo
> Paolo
>
>> Thanks,
>>
>>
>> * What's TDX?
>> TDX stands for Trust Domain Extensions, which extends Intel Virtual
>> Machines
>> Extensions (VMX) to introduce a kind of virtual machine guest called a
>> Trust
>> Domain (TD) for confidential computing.
>>
>> A TD runs in a CPU mode that is designed to protect the
>> confidentiality of its
>> memory contents and its CPU state from any other software, including
>> the hosting
>> Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
>>
>> We have more detailed explanations below (***).
>> We have the high-level design of TDX KVM below (****).
>>
>> In this patch series, we use "TD" or "guest TD" to differentiate it
>> from the
>> current "VM" (Virtual Machine), which is supported by KVM today.
>>
>>
>> * The organization of this patch series
>> This patch series is on top of the patches series "TDX host kernel
>> support":
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> this patch series is available at
>> https://github.com/intel/tdx/releases/tag/kvm-upstream
>> The corresponding patches to qemu are available at
>> https://github.com/intel/qemu-tdx/commits/tdx-upstream
>>
>> The relations of the layers are depicted as follows.
>> The arrows below show the order of patch reviews we would like to have.
>>
>> The below layers are chosen so that the device model, for example,
>> qemu can
>> exercise each layering step by step. Check if TDX is supported,
>> create TD VM,
>> create TD vcpu, allow vcpu running, populate TD guest private memory,
>> and handle
>> vcpu exits/hypercalls/interrupts to run TD fully.
>>
>> TDX vcpu
>> interrupt/exits/hypercall<------------\
>> ^ |
>> | |
>> TD finalization |
>> ^ |
>> | |
>> TDX EPT violation<------------\ |
>> ^ | |
>> | | |
>> TD vcpu enter/exit | |
>> ^ | |
>> | | |
>> TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
>> ^ | ^
>> | | |
>> TD VM creation/destruction \---------------KVM TDP MMU hooks
>> ^ ^
>> | |
>> TDX architectural definitions KVM TDP refactoring
>> for TDX
>> ^ ^
>> | |
>> TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
>> coexistence support
>>
>>
>> The followings are explanations of each layer. Each layer has a dummy
>> commit
>> that starts with [MARKER] in subject. It is intended to help to
>> identify where
>> each layer starts.
>>
>> TDX host kernel support:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>> The guts of system-wide initialization of TDX module. There
>> is an
>> independent patch series for host x86. TDX KVM patches call
>> functions
>> this patch series provides to initialize the TDX module.
>>
>> TDX, VMX coexistence:
>> Infrastructure to allow TDX to coexist with VMX and trigger the
>> initialization of the TDX module.
>> This layer starts with
>> "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
>> TDX architectural definitions:
>> Add TDX architectural definitions and helper functions
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TDX
>> architectural definitions".
>> TD VM creation/destruction:
>> Guest TD creation/destroy allocation and releasing of TDX
>> specific vm
>> and vcpu structure. Create an initial guest memory image
>> with TDX
>> measurement.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD VM
>> creation/destruction".
>> TD vcpu creation/destruction:
>> guest TD creation/destroy Allocation and releasing of TDX
>> specific vm
>> and vcpu structure. Create an initial guest memory image
>> with TDX
>> measurement.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD vcpu
>> creation/destruction"
>> TDX EPT violation:
>> Create an initial guest memory image with TDX measurement.
>> Handle
>> secure EPT violations to populate guest pages with TDX
>> SEAMCALLs.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
>> TD vcpu enter/exit:
>> Allow TDX vcpu to enter into TD and exit from TD. Save CPU
>> state before
>> entering into TD. Restore CPU state after exiting from TD.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
>> TD vcpu interrupts/exit/hypercall:
>> Handle various exits/hypercalls and allow interrupts to be
>> injected so
>> that TD vcpu can continue running.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: TD vcpu
>> exits/interrupts/hypercalls"
>>
>> KVM MMU GPA stolen bits:
>> Introduce framework to handle stolen repurposed bit of GPA TDX
>> repurposed a bit of GPA to indicate shared or private. If
>> it's shared,
>> it's the same as the conventional VMX EPT case. VMM can
>> access shared
>> guest pages. If it's private, it's handled by Secure-EPT and
>> the guest
>> page is encrypted.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM MMU GPA
>> stolen bits"
>> KVM TDP refactoring for TDX:
>> TDX Secure EPT requires different constants. e.g. initial
>> value EPT
>> entry value etc. Various refactoring for those differences.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM TDP
>> refactoring for TDX"
>> KVM TDP MMU hooks:
>> Introduce framework to TDP MMU to add hooks in addition to
>> direct EPT
>> access TDX added Secure EPT which is an enhancement to VMX
>> EPT. Unlike
>> conventional VMX EPT, CPU can't directly read/write Secure
>> EPT. Instead,
>> use TDX SEAMCALLs to operate on Secure EPT.
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
>> KVM TDP MMU MapGPA:
>> Introduce framework to handle switching guest pages from
>> private/shared
>> to shared/private. For a given GPA, a guest page can be
>> assigned to a
>> private GPA or a shared GPA exclusively. With TDX MapGPA
>> hypercall,
>> guest TD converts GPA assignments from private (or shared) to
>> shared (or
>> private).
>> This layer starts with
>> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU
>> MapGPA "
>>
>> KVM guest private memory: (not shown in the above diagram)
>> [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest
>> private
>> memory: https://lkml.org/lkml/2022/1/18/395
>> Guest private memory requires different memory management in
>> KVM. The
>> patch proposes a way for it. Integration with TDX KVM.
>>
>> (***)
>> * TDX module
>> A CPU-attested software module called the "TDX module" is designed to
>> implement
>> the TDX architecture, and it is loaded by the UEFI firmware today. It
>> can be
>> loaded by the kernel or driver at runtime, but in this patch series we
>> assume
>> that the TDX module is already loaded and initialized.
>>
>> The TDX module provides two main new logical modes of operation built
>> upon the
>> new SEAM (Secure Arbitration Mode) root and non-root CPU modes added
>> to the VMX
>> architecture. TDX root mode is mostly identical to the VMX root
>> operation mode,
>> and the TDX functions (described later) are triggered by the new SEAMCALL
>> instruction with the desired interface function selected by an input
>> operand
>> (leaf number, in RAX). TDX non-root mode is used for TD guest
>> operation. TDX
>> non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
>> operation (i.e. guest VM), with changes and restrictions to better
>> assure that
>> no other software or hardware has direct visibility of the TD memory
>> and state.
>>
>> TDX transitions between TDX root operation and TDX non-root operation
>> include TD
>> Entries, from TDX root to TDX non-root mode, and TD Exits from TDX
>> non-root to
>> TDX root mode. A TD Exit might be asynchronous, triggered by some
>> external
>> event (e.g., external interrupt or SMI) or an exception, or it might be
>> synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
>>
>> TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM.
>> TDH.VP.ENTER is one
>> of the TDX interface functions as mentioned above, and "TDH" stands
>> for Trust
>> Domain Host. Those host-side TDX interface functions are categorized into
>> various areas just for better organization, such as SYS (TDX module
>> management),
>> MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM
>> (private memory),
>> etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module
>> information.
>>
>> TDCS (Trust Domain Control Structure) is the main control structure of
>> a guest
>> TD, and encrypted (using the guest TD's ephemeral private key). At a
>> high
>> level, TDCS holds information for controlling TD operation as a whole,
>> execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note
>> that MSR
>> bitmaps are held as part of TDCS (unlike VMX) because they are meant
>> to have the
>> same value for all VCPUs of the same TD.
>>
>> Trust Domain Virtual Processor State (TDVPS) is the root control
>> structure of a
>> TD VCPU. It helps the TDX module control the operation of the VCPU,
>> and holds
>> the VCPU state while the VCPU is not running. TDVPS is opaque to
>> software and
>> DMA access, accessible only by using the TDX module interface
>> functions (such as
>> TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary
>> structures,
>> such as virtual APIC page, virtualization exception information, etc.
>>
>> Several VMX control structures (such as Shared EPT and Posted interrupt
>> descriptor) are directly managed and accessed by the host VMM. These
>> control
>> structures are pointed to by fields in the TD VMCS.
>>
>> The above means that 1) KVM needs to allocate different data
>> structures for TDs,
>> 2) KVM can reuse the existing code for TDs for some operations, 3) it
>> needs to
>> define TD-specific handling for others. 3) Redirect operations to . 3)
>> Redirect operations to the TDX specific callbacks, like "if
>> (is_td_vcpu(vcpu))
>> tdx_callback() else vmx_callback();".
>>
>> *TD Private Memory
>> TD private memory is designed to hold TD private content, encrypted by
>> the CPU
>> using the TD ephemeral key. An encryption engine holds a table of
>> encryption
>> keys, and an encryption key is selected for each memory transaction
>> based on a
>> Host Key Identifier (HKID). By design, the host VMM does not have
>> access to the
>> encryption keys.
>>
>> In the first generation of MKTME, HKID is "stolen" from the physical
>> address by
>> allocating a configurable number of bits from the top of the physical
>> address. The HKID space is partitioned into shared HKIDs for legacy MKTME
>> accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for
>> the shared
>> HKID on the host so that MKTME can be opaque or bypassed on the host.
>>
>> During TDX non-root operation (i.e. guest TD), memory accesses can be
>> qualified
>> as either shared or private, based on the value of a new SHARED bit in
>> the Guest
>> Physical Address (GPA). The CPU translates shared GPAs using the
>> usual VMX EPT
>> (Extended Page Table) or "Shared EPT" (in this document), which
>> resides in host
>> VMM memory. The Shared EPT is directly managed by the host VMM - the
>> same as
>> with the current VMX. Since guest TDs usually require I/O, and the
>> data exchange
>> needs to be done via shared memory, thus KVM needs to use the current EPT
>> functionality even for TDs.
>>
>> * Secure EPT and Minoring using the TDP code
>> The CPU translates private GPAs using a separate Secure EPT. The
>> Secure EPT
>> pages are encrypted and integrity-protected with the TD's ephemeral
>> private
>> key. Secure EPT can be managed _indirectly_ by the host VMM, using
>> the TDX
>> interface functions, and thus conceptually Secure EPT is a subset of
>> EPT (why
>> "subset"). Since execution of such interface functions takes much
>> longer time
>> than accessing memory directly, in KVM we use the existing TDP code to
>> minor the
>> Secure EPT for the TD.
>>
>> This way, we can effectively walk Secure EPT without using the TDX
>> interface
>> functions.
>>
>> * VM life cycle and TDX specific operations
>> The userspace VMM, such as QEMU, needs to build and treat TDs
>> differently. For
>> example, a TD needs to boot in private memory, and the host software
>> cannot copy
>> the initial image to private memory.
>>
>> * TSC Virtualization
>> The TDX module helps TDs maintain reliable TSC (Time Stamp Counter)
>> values
>> (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is
>> determined
>> by TD configuration, i.e. when the TD is created, not per VCPU. The
>> current KVM
>> owns TSC virtualization for VMs, but the TDX module does for TDs.
>>
>> * MCE support for TDs
>> The TDX module doesn't allow VMM to inject MCE. Instead PV way is
>> needed for TD
>> to communicate with VMM. For now, KVM silently ignores MCE request by
>> VMM. MSRs
>> related to MCE (e.g, MCE bank registers) can be naturally emulated by
>> paravirtualizing MSR access.
>>
>> [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
>> available.
>>
>> * Restrictions or future work
>> Some features are not included to reduce patch size. Those features are
>> addressed as future independent patch series.
>> - large page (2M, 1G)
>> - qemu gdb stub
>> - guest PMU
>> - and more
>>
>> * Prerequisites
>> It's required to load the TDX module and initialize it. It's out of
>> the scope
>> of this patch series. Another independent patch for the common x86
>> code is
>> planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
>> CONFIG_INTEL_TDX_HOST. It's assumed that With
>> CONFIG_INTEL_TDX_HOST=y, the TDX
>> module is initialized and ready for KVM to use the TDX module APIs for
>> TDX guest
>> life cycle like tdh.mng.init are ready to use.
>>
>> Concretely Global initialization, LP (Logical Processor)
>> initialization, global
>> configuration, the key configuration, and TDMR and PAMT initialization
>> are done.
>> The state of the TDX module is SYS_READY. Please refer to the TDX module
>> specification, the chapter Intel TDX Module Lifecycle State Machine
>>
>> ** Detecting the TDX module readiness.
>> TDX host patch series implements the detection of the TDX module
>> availability
>> and its initialization so that KVM can use it. Also it manages Host
>> KeyID
>> (HKID) assigned to guest TD.
>> The assumed APIs the TDX host patch series provides are
>> - int seamrr_enabled()
>> Check if required cpu feature (SEAM mode) is available. This only
>> check CPU
>> feature availability. At this point, the TDX module may not be
>> ready for KVM
>> to use.
>> - int init_tdx(void);
>> Initialization of TDX module so that the TDX module is ready for
>> KVM to use.
>> - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
>> Return the system wide information about the TDX module. NULL if
>> the TDX
>> isn't initialized.
>> - u32 tdx_get_global_keyid(void);
>> Return global key id that is used for the TDX module itself.
>> - int tdx_keyid_alloc(void);
>> Allocate HKID for guest TD.
>> - void tdx_keyid_free(int keyid);
>> Free HKID for guest TD.
>>
>> (****)
>> * TDX KVM high-level design
>> - Host key ID management
>> Host Key ID (HKID) needs to be assigned to each TDX guest for memory
>> encryption.
>> It is assumed The TDX host patch series implements necessary functions,
>> u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
>> void tdx_keyid_free(int keyid).
>>
>> - Data structures and VM type
>> Because TDX is different from VMX, define its own VM/VCPU structures,
>> struct
>> kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct
>> vcpu_vmx. To
>> identify the VM, introduce VM-type to specify which VM type, VMX
>> (default) or
>> TDX, is used.
>>
>> - VM life cycle and TDX specific operations
>> Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific
>> operations.
>> New commands are used to get the TDX system parameters, set TDX
>> specific VM/VCPU
>> parameters, set initial guest memory and measurement.
>>
>> The creation of TDX VM requires five additional operations in addition
>> to the
>> conventional VM creation.
>> - Get KVM system capability to check if TDX VM type is supported
>> - VM creation (KVM_CREATE_VM)
>> - New: Get the TDX specific system parameters.
>> KVM_TDX_GET_CAPABILITY.
>> - New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
>> - VCPU creation (KVM_CREATE_VCPU)
>> - New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
>> - New: Initialize guest memory as boot state and extend the
>> measurement with
>> the memory. KVM_TDX_INIT_MEM_REGION.
>> - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the
>> initial
>> TDX VM contents.
>> - VCPU RUN (KVM_VCPU_RUN)
>>
>> - Protected guest state
>> Because the guest state (CPU state and guest memory) is protected, the
>> KVM VMM
>> can't operate on them. For example, accessing CPU registers, injecting
>> exceptions, and accessing guest memory. Those operations are handled as
>> silently ignored, returning zero or initial reset value when it's
>> requested via
>> KVM API ioctls.
>>
>> VM/VCPU state and callbacks for TDX specific operations.
>> Define tdx specific VM state and VCPU state instead of VMX ones.
>> Redirect
>> operations to TDX specific callbacks. "if (tdx) tdx_op() else
>> vmx_op()".
>>
>> Operations on the CPU state
>> silently ignore operations on the guest state. For example, the
>> write to
>> CPU registers is ignored and the read from CPU registers returns 0.
>>
>> . ignore access to CPU registers except for allowed ones.
>> . TSC: add a check if tsc is immutable and return an error.
>> Because the KVM
>> implementation updates the internal tsc state and it's
>> difficult to back
>> out those changes. Instead, skip the logic.
>> . dirty logging: add check if dirty logging is supported.
>> . exceptions/SMI/MCE/SIPI/INIT: silently ignore
>>
>> Note: virtual external interrupt and NMI can be injected into TDX
>> guests.
>>
>> - KVM MMU integration
>> One bit of the guest physical address (bit 51 or 47) is repurposed to
>> indicate if
>> the guest physical address is private (the bit is cleared) or shared
>> (the bit is
>> set). The bits are called stolen bits.
>>
>> - Stolen bits framework
>> systematically tracks which guest physical address, shared or
>> private, is
>> used.
>>
>> - Shared EPT and secure EPT
>> There are two EPTs. Shared EPT (the conventional one) and Secure
>> EPT(the new one). Shared EPT is handled the same for the stolen
>> bit set. Secure EPT points to private guest pages. To resolve
>> EPT violation, KVM walks one of two EPTs based on faulted GPA.
>> Because it's costly to access secure EPT during walking EPTs with
>> SEAMCALLs for the private guest physical address, another private
>> EPT is used as a shadow of Secure-EPT with the existing logic at
>> the cost of extra memory.
>>
>> The following depicts the relationship.
>>
>> KVM | TDX module
>> | | |
>> -------------+---------- | |
>> | | | |
>> V V | |
>> shared GPA private GPA | |
>> CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT
>> pointer
>> | | | |
>> | | | |
>> V V | V
>> shared EPT private EPT<-------mirror----->Secure EPT
>> | | | |
>> | \--------------------+------\ |
>> | | | |
>> V | V V
>> shared guest page | private
>> guest page
>> |
>> |
>> non-encrypted memory | encrypted
>> memory
>> |
>>
>> - Operating on Secure EPT
>> Use the TDX module APIs to operate on Secure EPT. To call the
>> TDX API
>> during resolving EPT violation, add hooks to additional operation
>> and wiring
>> it to TDX backend.
>>
>> * References
>>
>> [1] TDX specification
>>
>> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
>>
>> [2] Intel Trust Domain Extensions (Intel TDX)
>>
>> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
>>
>> [3] Intel CPU Architectural Extensions Specification
>>
>> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
>>
>> [4] Intel TDX Module 1.0 EAS
>>
>> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
>>
>> [5] Intel TDX Loader Interface Specification
>>
>> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
>>
>> [6] Intel TDX Guest-Hypervisor Communication Interface
>>
>> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
>>
>> [7] Intel TDX Virtual Firmware Design Guide
>>
>> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
>>
>> [8] intel public github
>> kvm TDX branch: https://github.com/intel/tdx/tree/kvm
>> TDX guest branch: https://github.com/intel/tdx/tree/guest
>> qemu TDX https://github.com/intel/qemu-tdx
>> [9] TDVF
>> https://github.com/tianocore/edk2-staging/tree/TDVF
>>
>>
>> Chao Gao (1):
>> KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
>> wrmsr
>>
>> Isaku Yamahata (73):
>> x86/virt/tdx: export platform_has_tdx
>> KVM: TDX: Detect CPU feature on kernel module initialization
>> KVM: x86: Refactor KVM VMX module init/exit functions
>> KVM: TDX: Add placeholders for TDX VM/vcpu structure
>> x86/virt/tdx: Add a helper function to return system wide info about
>> TDX module
>> KVM: TDX: Add a function to initialize TDX module
>> KVM: TDX: Make TDX VM type supported
>> [MARKER] The start of TDX KVM patch series: TDX architectural
>> definitions
>> KVM: TDX: Define TDX architectural definitions
>> KVM: TDX: Add a function for KVM to invoke SEAMCALL
>> KVM: TDX: add a helper function for KVM to issue SEAMCALL
>> KVM: TDX: Add helper functions to print TDX SEAMCALL error
>> [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
>> KVM: TDX: allocate per-package mutex
>> x86/cpu: Add helper functions to allocate/free MKTME keyid
>> KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
>> KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
>> [MARKER] The start of TDX KVM patch series: TD vcpu
>> creation/destruction
>> KVM: TDX: allocate/free TDX vcpu structure
>> [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits
>> KVM: x86/mmu: introduce config for PRIVATE KVM MMU
>> [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
>> TDX
>> KVM: x86/mmu: Disallow fast page fault on private GPA
>> [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
>> KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
>> KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
>> KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
>> KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
>> KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
>> [MARKER] The start of TDX KVM patch series: TDX EPT violation
>> KVM: TDX: TDP MMU TDX support
>> [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
>> KVM: x86/mmu: steal software usable bit for EPT to represent shared
>> page
>> KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
>> KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT
>> KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
>> KVM: x86/mmu: Focibly use TDP MMU for TDX
>> [MARKER] The start of TDX KVM patch series: TD finalization
>> KVM: TDX: Create initial guest memory
>> KVM: TDX: Finalize VM initialization
>> [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
>> KVM: TDX: Add helper assembly function to TDX vcpu
>> KVM: TDX: Implement TDX vcpu enter/exit path
>> KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
>> KVM: TDX: restore host xsave state when exit from the guest TD
>> KVM: TDX: restore user ret MSRs
>> [MARKER] The start of TDX KVM patch series: TD vcpu
>> exits/interrupts/hypercalls
>> KVM: TDX: complete interrupts after tdexit
>> KVM: TDX: restore debug store when TD exit
>> KVM: TDX: handle vcpu migration over logical processor
>> KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the
>> guest TD
>> KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
>> behavior
>> KVM: TDX: Implement interrupt injection
>> KVM: TDX: Implements vcpu request_immediate_exit
>> KVM: TDX: Implement methods to inject NMI
>> KVM: TDX: Add a place holder to handle TDX VM exit
>> KVM: TDX: handle EXIT_REASON_OTHER_SMI
>> KVM: TDX: handle ept violation/misconfig exit
>> KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
>> KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
>> KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
>> KVM: TDX: Handle TDX PV CPUID hypercall
>> KVM: TDX: Handle TDX PV HLT hypercall
>> KVM: TDX: Handle TDX PV port io hypercall
>> KVM: TDX: Implement callbacks for MSR operations for TDX
>> KVM: TDX: Handle TDX PV rdmsr hypercall
>> KVM: TDX: Handle TDX PV wrmsr hypercall
>> KVM: TDX: Handle TDX PV report fatal error hypercall
>> KVM: TDX: Handle TDX PV map_gpa hypercall
>> KVM: TDX: Silently discard SMI request
>> KVM: TDX: Silently ignore INIT/SIPI
>> Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
>> KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
>>
>> Kai Huang (1):
>> KVM: x86: Introduce hooks to free VM callback prezap and vm_free
>>
>> Rick Edgecombe (1):
>> KVM: x86: Add infrastructure for stolen GPA bits
>>
>> Sean Christopherson (26):
>> KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
>> KVM: Enable hardware before doing arch VM initialization
>> KVM: x86: Introduce vm_type to differentiate default VMs from
>> confidential VMs
>> KVM: TDX: Add TDX "architectural" error codes
>> KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
>> KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
>> KVM: Add max_vcpus field in common 'struct kvm'
>> KVM: TDX: create/destroy VM structure
>> KVM: TDX: Do TDX specific vcpu initialization
>> KVM: x86/mmu: Disallow dirty logging for x86 TDX
>> KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
>> KVM: x86/mmu: Allow non-zero init value for shadow PTE
>> KVM: x86/mmu: Allow per-VM override of the TDP max page level
>> KVM: VMX: Split out guts of EPT violation to common/exposed function
>> KVM: VMX: Move setting of EPT MMU masks to common VT-x code
>> KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
>> KVM: TDX: Add load_mmu_pgd method for TDX
>> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
>> KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
>> KVM: x86: Add option to force LAPIC expiration wait
>> KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
>> argument
>> KVM: VMX: Move NMI/exception handler to common helper
>> KVM: x86: Split core of hypercall emulation to helper function
>> KVM: TDX: Add a placeholder for handler of TDX hypercalls
>> (TDG.VP.VMCALL)
>> KVM: TDX: Handle TDX PV MMIO hypercall
>> KVM: TDX: Add methods to ignore accesses to CPU state
>>
>> Xiaoyao Li (1):
>> KVM: TDX: initialize VM with TDX specific parameters
>>
>> Yuan Yao (1):
>> KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
>>
>> Documentation/virt/kvm/api.rst | 24 +-
>> .../virt/kvm/intel-tdx-layer-status.rst | 33 +
>> Documentation/virt/kvm/intel-tdx.rst | 360 +++
>> Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 ++++
>> arch/arm64/include/asm/kvm_host.h | 3 -
>> arch/arm64/kvm/arm.c | 6 +-
>> arch/arm64/kvm/vgic/vgic-init.c | 6 +-
>> arch/x86/events/intel/ds.c | 1 +
>> arch/x86/include/asm/kvm-x86-ops.h | 5 +
>> arch/x86/include/asm/kvm_host.h | 38 +-
>> arch/x86/include/asm/tdx.h | 61 +
>> arch/x86/include/asm/vmx.h | 2 +
>> arch/x86/include/uapi/asm/kvm.h | 59 +
>> arch/x86/include/uapi/asm/vmx.h | 5 +-
>> arch/x86/kvm/Kconfig | 4 +
>> arch/x86/kvm/Makefile | 3 +-
>> arch/x86/kvm/lapic.c | 25 +-
>> arch/x86/kvm/lapic.h | 2 +-
>> arch/x86/kvm/mmu.h | 65 +-
>> arch/x86/kvm/mmu/mmu.c | 232 +-
>> arch/x86/kvm/mmu/mmu_internal.h | 84 +
>> arch/x86/kvm/mmu/paging_tmpl.h | 25 +-
>> arch/x86/kvm/mmu/spte.c | 48 +-
>> arch/x86/kvm/mmu/spte.h | 40 +-
>> arch/x86/kvm/mmu/tdp_iter.h | 2 +-
>> arch/x86/kvm/mmu/tdp_mmu.c | 642 ++++-
>> arch/x86/kvm/mmu/tdp_mmu.h | 16 +-
>> arch/x86/kvm/svm/svm.c | 10 +-
>> arch/x86/kvm/vmx/common.h | 155 ++
>> arch/x86/kvm/vmx/main.c | 1026 ++++++++
>> arch/x86/kvm/vmx/posted_intr.c | 8 +-
>> arch/x86/kvm/vmx/seamcall.S | 55 +
>> arch/x86/kvm/vmx/seamcall.h | 25 +
>> arch/x86/kvm/vmx/tdx.c | 2337 +++++++++++++++++
>> arch/x86/kvm/vmx/tdx.h | 253 ++
>> arch/x86/kvm/vmx/tdx_arch.h | 158 ++
>> arch/x86/kvm/vmx/tdx_errno.h | 29 +
>> arch/x86/kvm/vmx/tdx_error.c | 22 +
>> arch/x86/kvm/vmx/tdx_ops.h | 174 ++
>> arch/x86/kvm/vmx/vmenter.S | 146 +
>> arch/x86/kvm/vmx/vmx.c | 619 ++---
>> arch/x86/kvm/vmx/x86_ops.h | 235 ++
>> arch/x86/kvm/x86.c | 123 +-
>> arch/x86/kvm/x86.h | 8 +
>> arch/x86/virt/tdxcall.S | 8 +-
>> arch/x86/virt/vmx/tdx.c | 50 +-
>> arch/x86/virt/vmx/tdx.h | 52 -
>> include/linux/kvm_host.h | 2 +
>> include/uapi/linux/kvm.h | 1 +
>> tools/arch/x86/include/uapi/asm/kvm.h | 59 +
>> tools/include/uapi/linux/kvm.h | 1 +
>> virt/kvm/kvm_main.c | 35 +-
>> 52 files changed, 7142 insertions(+), 706 deletions(-)
>> create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
>> create mode 100644 Documentation/virt/kvm/intel-tdx.rst
>> create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
>> create mode 100644 arch/x86/kvm/vmx/common.h
>> create mode 100644 arch/x86/kvm/vmx/main.c
>> create mode 100644 arch/x86/kvm/vmx/seamcall.S
>> create mode 100644 arch/x86/kvm/vmx/seamcall.h
>> create mode 100644 arch/x86/kvm/vmx/tdx.c
>> create mode 100644 arch/x86/kvm/vmx/tdx.h
>> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
>> create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
>> create mode 100644 arch/x86/kvm/vmx/tdx_error.c
>> create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
>> create mode 100644 arch/x86/kvm/vmx/x86_ops.h
>>
>
On 3/4/22 20:49, [email protected] wrote:
> +/* Masks that used to track for shared GPA **/
> +#define SPTE_PRIVATE_PROHIBIT BIT_ULL(62)
> +
Please rename this to SPTE_SHARED_MAPPING_MASK, or even just
SPTE_SHARED_MASK.
Paolo
On 3/4/22 20:49, [email protected] wrote:
> From: Chao Gao <[email protected]>
>
> Several MSRs are constant and only used in userspace(ring 3). But VMs may
> have different values. KVM uses kvm_set_user_return_msr() to switch to
> guest's values and leverages user return notifier to restore them when the
> kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
> caches the value it wrote to an MSR last time.
>
> TDX module unconditionally resets some of these MSRs to architectural INIT
> state on TD exit. It makes the cached values in kvm_user_return_msrs are
> inconsistent with values in hardware. This inconsistency needs to be
> fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
> some MSRs to the host's values. kvm_set_user_return_msr() can help correct
> this case, but it is not optimal as it always does a wrmsr. So, introduce
> a variation of kvm_set_user_return_msr() to update cached values and skip
> that wrmsr.
>
> Signed-off-by: Chao Gao <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 25 ++++++++++++++++++++-----
> 2 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8406f8b5ab74..b6396d11139e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1894,6 +1894,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
> int kvm_add_user_return_msr(u32 msr);
> int kvm_find_user_return_msr(u32 msr);
> int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> +void kvm_user_return_update_cache(unsigned int index, u64 val);
>
> static inline bool kvm_is_supported_user_return_msr(u32 msr)
> {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 66400810d54f..45e8a02e99bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -427,6 +427,15 @@ static void kvm_user_return_msr_cpu_online(void)
> }
> }
>
> +static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
> +{
> + if (!msrs->registered) {
> + msrs->urn.on_user_return = kvm_on_user_return;
> + user_return_notifier_register(&msrs->urn);
> + msrs->registered = true;
> + }
> +}
> +
> int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
> {
> unsigned int cpu = smp_processor_id();
> @@ -441,15 +450,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
> return 1;
>
> msrs->values[slot].curr = value;
> - if (!msrs->registered) {
> - msrs->urn.on_user_return = kvm_on_user_return;
> - user_return_notifier_register(&msrs->urn);
> - msrs->registered = true;
> - }
> + kvm_user_return_register_notifier(msrs);
> return 0;
> }
> EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
>
> +/* Update the cache, "curr", and register the notifier */
> +void kvm_user_return_update_cache(unsigned int slot, u64 value)
> +{
> + struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
> +
> + msrs->values[slot].curr = value;
> + kvm_user_return_register_notifier(msrs);
> +}
> +EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
> +
> static void drop_user_return_notifiers(void)
> {
> unsigned int cpu = smp_processor_id();
Reviewed-by: Paolo Bonzini <[email protected]>
On 3/4/22 20:48, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Hi. Now TDX host kernel patch series was posted, I've rebased this patch
> series to it and make it work.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Changes from v4:
> - rebased to TDX host kernel patch series.
> - include all the patches to make this patch series working.
> - add [MARKER] patches to mark the patch layer clear.
I think I have reviewed everything except the TDP MMU parts (48, 54-57).
I will do those next week, but in the meanwhile feel free to send v6
if you have it ready. A lot of the requests have been cosmetic.
If you would like to use something like Trello to track all the changes,
and submit before you have done all of them, that's fine by me.
Paolo
> Thanks,
>
>
> * What's TDX?
> TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> Domain (TD) for confidential computing.
>
> A TD runs in a CPU mode that is designed to protect the confidentiality of its
> memory contents and its CPU state from any other software, including the hosting
> Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
>
> We have more detailed explanations below (***).
> We have the high-level design of TDX KVM below (****).
>
> In this patch series, we use "TD" or "guest TD" to differentiate it from the
> current "VM" (Virtual Machine), which is supported by KVM today.
>
>
> * The organization of this patch series
> This patch series is on top of the patches series "TDX host kernel support":
> https://lore.kernel.org/lkml/[email protected]/
>
> this patch series is available at
> https://github.com/intel/tdx/releases/tag/kvm-upstream
> The corresponding patches to qemu are available at
> https://github.com/intel/qemu-tdx/commits/tdx-upstream
>
> The relations of the layers are depicted as follows.
> The arrows below show the order of patch reviews we would like to have.
>
> The below layers are chosen so that the device model, for example, qemu can
> exercise each layering step by step. Check if TDX is supported, create TD VM,
> create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> vcpu exits/hypercalls/interrupts to run TD fully.
>
> TDX vcpu
> interrupt/exits/hypercall<------------\
> ^ |
> | |
> TD finalization |
> ^ |
> | |
> TDX EPT violation<------------\ |
> ^ | |
> | | |
> TD vcpu enter/exit | |
> ^ | |
> | | |
> TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
> ^ | ^
> | | |
> TD VM creation/destruction \---------------KVM TDP MMU hooks
> ^ ^
> | |
> TDX architectural definitions KVM TDP refactoring for TDX
> ^ ^
> | |
> TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
> coexistence support
>
>
> The followings are explanations of each layer. Each layer has a dummy commit
> that starts with [MARKER] in subject. It is intended to help to identify where
> each layer starts.
>
> TDX host kernel support:
> https://lore.kernel.org/lkml/[email protected]/
> The guts of system-wide initialization of TDX module. There is an
> independent patch series for host x86. TDX KVM patches call functions
> this patch series provides to initialize the TDX module.
>
> TDX, VMX coexistence:
> Infrastructure to allow TDX to coexist with VMX and trigger the
> initialization of the TDX module.
> This layer starts with
> "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> TDX architectural definitions:
> Add TDX architectural definitions and helper functions
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> TD VM creation/destruction:
> Guest TD creation/destroy allocation and releasing of TDX specific vm
> and vcpu structure. Create an initial guest memory image with TDX
> measurement.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> TD vcpu creation/destruction:
> guest TD creation/destroy Allocation and releasing of TDX specific vm
> and vcpu structure. Create an initial guest memory image with TDX
> measurement.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> TDX EPT violation:
> Create an initial guest memory image with TDX measurement. Handle
> secure EPT violations to populate guest pages with TDX SEAMCALLs.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> TD vcpu enter/exit:
> Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
> entering into TD. Restore CPU state after exiting from TD.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> TD vcpu interrupts/exit/hypercall:
> Handle various exits/hypercalls and allow interrupts to be injected so
> that TD vcpu can continue running.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
>
> KVM MMU GPA stolen bits:
> Introduce framework to handle stolen repurposed bit of GPA TDX
> repurposed a bit of GPA to indicate shared or private. If it's shared,
> it's the same as the conventional VMX EPT case. VMM can access shared
> guest pages. If it's private, it's handled by Secure-EPT and the guest
> page is encrypted.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> KVM TDP refactoring for TDX:
> TDX Secure EPT requires different constants. e.g. initial value EPT
> entry value etc. Various refactoring for those differences.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> KVM TDP MMU hooks:
> Introduce framework to TDP MMU to add hooks in addition to direct EPT
> access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
> conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> use TDX SEAMCALLs to operate on Secure EPT.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> KVM TDP MMU MapGPA:
> Introduce framework to handle switching guest pages from private/shared
> to shared/private. For a given GPA, a guest page can be assigned to a
> private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
> guest TD converts GPA assignments from private (or shared) to shared (or
> private).
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
>
> KVM guest private memory: (not shown in the above diagram)
> [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> memory: https://lkml.org/lkml/2022/1/18/395
> Guest private memory requires different memory management in KVM. The
> patch proposes a way for it. Integration with TDX KVM.
>
> (***)
> * TDX module
> A CPU-attested software module called the "TDX module" is designed to implement
> the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> loaded by the kernel or driver at runtime, but in this patch series we assume
> that the TDX module is already loaded and initialized.
>
> The TDX module provides two main new logical modes of operation built upon the
> new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> architecture. TDX root mode is mostly identical to the VMX root operation mode,
> and the TDX functions (described later) are triggered by the new SEAMCALL
> instruction with the desired interface function selected by an input operand
> (leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
> non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> operation (i.e. guest VM), with changes and restrictions to better assure that
> no other software or hardware has direct visibility of the TD memory and state.
>
> TDX transitions between TDX root operation and TDX non-root operation include TD
> Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> TDX root mode. A TD Exit might be asynchronous, triggered by some external
> event (e.g., external interrupt or SMI) or an exception, or it might be
> synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
>
> TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> Domain Host. Those host-side TDX interface functions are categorized into
> various areas just for better organization, such as SYS (TDX module management),
> MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
>
> TDCS (Trust Domain Control Structure) is the main control structure of a guest
> TD, and encrypted (using the guest TD's ephemeral private key). At a high
> level, TDCS holds information for controlling TD operation as a whole,
> execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
> bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> same value for all VCPUs of the same TD.
>
> Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
> the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> DMA access, accessible only by using the TDX module interface functions (such as
> TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> such as virtual APIC page, virtualization exception information, etc.
>
> Several VMX control structures (such as Shared EPT and Posted interrupt
> descriptor) are directly managed and accessed by the host VMM. These control
> structures are pointed to by fields in the TD VMCS.
>
> The above means that 1) KVM needs to allocate different data structures for TDs,
> 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> define TD-specific handling for others. 3) Redirect operations to . 3)
> Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> tdx_callback() else vmx_callback();".
>
> *TD Private Memory
> TD private memory is designed to hold TD private content, encrypted by the CPU
> using the TD ephemeral key. An encryption engine holds a table of encryption
> keys, and an encryption key is selected for each memory transaction based on a
> Host Key Identifier (HKID). By design, the host VMM does not have access to the
> encryption keys.
>
> In the first generation of MKTME, HKID is "stolen" from the physical address by
> allocating a configurable number of bits from the top of the physical
> address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> HKID on the host so that MKTME can be opaque or bypassed on the host.
>
> During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> as either shared or private, based on the value of a new SHARED bit in the Guest
> Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
> (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> with the current VMX. Since guest TDs usually require I/O, and the data exchange
> needs to be done via shared memory, thus KVM needs to use the current EPT
> functionality even for TDs.
>
> * Secure EPT and Minoring using the TDP code
> The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
> pages are encrypted and integrity-protected with the TD's ephemeral private
> key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> "subset"). Since execution of such interface functions takes much longer time
> than accessing memory directly, in KVM we use the existing TDP code to minor the
> Secure EPT for the TD.
>
> This way, we can effectively walk Secure EPT without using the TDX interface
> functions.
>
> * VM life cycle and TDX specific operations
> The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
> example, a TD needs to boot in private memory, and the host software cannot copy
> the initial image to private memory.
>
> * TSC Virtualization
> The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
> owns TSC virtualization for VMs, but the TDX module does for TDs.
>
> * MCE support for TDs
> The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
> to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
> related to MCE (e.g, MCE bank registers) can be naturally emulated by
> paravirtualizing MSR access.
>
> [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
>
> * Restrictions or future work
> Some features are not included to reduce patch size. Those features are
> addressed as future independent patch series.
> - large page (2M, 1G)
> - qemu gdb stub
> - guest PMU
> - and more
>
> * Prerequisites
> It's required to load the TDX module and initialize it. It's out of the scope
> of this patch series. Another independent patch for the common x86 code is
> planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> life cycle like tdh.mng.init are ready to use.
>
> Concretely Global initialization, LP (Logical Processor) initialization, global
> configuration, the key configuration, and TDMR and PAMT initialization are done.
> The state of the TDX module is SYS_READY. Please refer to the TDX module
> specification, the chapter Intel TDX Module Lifecycle State Machine
>
> ** Detecting the TDX module readiness.
> TDX host patch series implements the detection of the TDX module availability
> and its initialization so that KVM can use it. Also it manages Host KeyID
> (HKID) assigned to guest TD.
> The assumed APIs the TDX host patch series provides are
> - int seamrr_enabled()
> Check if required cpu feature (SEAM mode) is available. This only check CPU
> feature availability. At this point, the TDX module may not be ready for KVM
> to use.
> - int init_tdx(void);
> Initialization of TDX module so that the TDX module is ready for KVM to use.
> - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> Return the system wide information about the TDX module. NULL if the TDX
> isn't initialized.
> - u32 tdx_get_global_keyid(void);
> Return global key id that is used for the TDX module itself.
> - int tdx_keyid_alloc(void);
> Allocate HKID for guest TD.
> - void tdx_keyid_free(int keyid);
> Free HKID for guest TD.
>
> (****)
> * TDX KVM high-level design
> - Host key ID management
> Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> It is assumed The TDX host patch series implements necessary functions,
> u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> void tdx_keyid_free(int keyid).
>
> - Data structures and VM type
> Because TDX is different from VMX, define its own VM/VCPU structures, struct
> kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
> identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> TDX, is used.
>
> - VM life cycle and TDX specific operations
> Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> parameters, set initial guest memory and measurement.
>
> The creation of TDX VM requires five additional operations in addition to the
> conventional VM creation.
> - Get KVM system capability to check if TDX VM type is supported
> - VM creation (KVM_CREATE_VM)
> - New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
> - New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
> - VCPU creation (KVM_CREATE_VCPU)
> - New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
> - New: Initialize guest memory as boot state and extend the measurement with
> the memory. KVM_TDX_INIT_MEM_REGION.
> - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> TDX VM contents.
> - VCPU RUN (KVM_VCPU_RUN)
>
> - Protected guest state
> Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> can't operate on them. For example, accessing CPU registers, injecting
> exceptions, and accessing guest memory. Those operations are handled as
> silently ignored, returning zero or initial reset value when it's requested via
> KVM API ioctls.
>
> VM/VCPU state and callbacks for TDX specific operations.
> Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
> operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
>
> Operations on the CPU state
> silently ignore operations on the guest state. For example, the write to
> CPU registers is ignored and the read from CPU registers returns 0.
>
> . ignore access to CPU registers except for allowed ones.
> . TSC: add a check if tsc is immutable and return an error. Because the KVM
> implementation updates the internal tsc state and it's difficult to back
> out those changes. Instead, skip the logic.
> . dirty logging: add check if dirty logging is supported.
> . exceptions/SMI/MCE/SIPI/INIT: silently ignore
>
> Note: virtual external interrupt and NMI can be injected into TDX guests.
>
> - KVM MMU integration
> One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> the guest physical address is private (the bit is cleared) or shared (the bit is
> set). The bits are called stolen bits.
>
> - Stolen bits framework
> systematically tracks which guest physical address, shared or private, is
> used.
>
> - Shared EPT and secure EPT
> There are two EPTs. Shared EPT (the conventional one) and Secure
> EPT(the new one). Shared EPT is handled the same for the stolen
> bit set. Secure EPT points to private guest pages. To resolve
> EPT violation, KVM walks one of two EPTs based on faulted GPA.
> Because it's costly to access secure EPT during walking EPTs with
> SEAMCALLs for the private guest physical address, another private
> EPT is used as a shadow of Secure-EPT with the existing logic at
> the cost of extra memory.
>
> The following depicts the relationship.
>
> KVM | TDX module
> | | |
> -------------+---------- | |
> | | | |
> V V | |
> shared GPA private GPA | |
> CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
> | | | |
> | | | |
> V V | V
> shared EPT private EPT<-------mirror----->Secure EPT
> | | | |
> | \--------------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> |
> non-encrypted memory | encrypted memory
> |
>
> - Operating on Secure EPT
> Use the TDX module APIs to operate on Secure EPT. To call the TDX API
> during resolving EPT violation, add hooks to additional operation and wiring
> it to TDX backend.
>
> * References
>
> [1] TDX specification
> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
> [3] Intel CPU Architectural Extensions Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 EAS
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
> [5] Intel TDX Loader Interface Specification
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> [7] Intel TDX Virtual Firmware Design Guide
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
> [8] intel public github
> kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> TDX guest branch: https://github.com/intel/tdx/tree/guest
> qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
> https://github.com/tianocore/edk2-staging/tree/TDVF
>
>
> Chao Gao (1):
> KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> wrmsr
>
> Isaku Yamahata (73):
> x86/virt/tdx: export platform_has_tdx
> KVM: TDX: Detect CPU feature on kernel module initialization
> KVM: x86: Refactor KVM VMX module init/exit functions
> KVM: TDX: Add placeholders for TDX VM/vcpu structure
> x86/virt/tdx: Add a helper function to return system wide info about
> TDX module
> KVM: TDX: Add a function to initialize TDX module
> KVM: TDX: Make TDX VM type supported
> [MARKER] The start of TDX KVM patch series: TDX architectural
> definitions
> KVM: TDX: Define TDX architectural definitions
> KVM: TDX: Add a function for KVM to invoke SEAMCALL
> KVM: TDX: add a helper function for KVM to issue SEAMCALL
> KVM: TDX: Add helper functions to print TDX SEAMCALL error
> [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> KVM: TDX: allocate per-package mutex
> x86/cpu: Add helper functions to allocate/free MKTME keyid
> KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
> [MARKER] The start of TDX KVM patch series: TD vcpu
> creation/destruction
> KVM: TDX: allocate/free TDX vcpu structure
> [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits
> KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> TDX
> KVM: x86/mmu: Disallow fast page fault on private GPA
> [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
> KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
> KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> [MARKER] The start of TDX KVM patch series: TDX EPT violation
> KVM: TDX: TDP MMU TDX support
> [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> KVM: x86/mmu: steal software usable bit for EPT to represent shared
> page
> KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
> KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT
> KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> KVM: x86/mmu: Focibly use TDP MMU for TDX
> [MARKER] The start of TDX KVM patch series: TD finalization
> KVM: TDX: Create initial guest memory
> KVM: TDX: Finalize VM initialization
> [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> KVM: TDX: Add helper assembly function to TDX vcpu
> KVM: TDX: Implement TDX vcpu enter/exit path
> KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> KVM: TDX: restore host xsave state when exit from the guest TD
> KVM: TDX: restore user ret MSRs
> [MARKER] The start of TDX KVM patch series: TD vcpu
> exits/interrupts/hypercalls
> KVM: TDX: complete interrupts after tdexit
> KVM: TDX: restore debug store when TD exit
> KVM: TDX: handle vcpu migration over logical processor
> KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the
> guest TD
> KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> behavior
> KVM: TDX: Implement interrupt injection
> KVM: TDX: Implements vcpu request_immediate_exit
> KVM: TDX: Implement methods to inject NMI
> KVM: TDX: Add a place holder to handle TDX VM exit
> KVM: TDX: handle EXIT_REASON_OTHER_SMI
> KVM: TDX: handle ept violation/misconfig exit
> KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
> KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> KVM: TDX: Handle TDX PV CPUID hypercall
> KVM: TDX: Handle TDX PV HLT hypercall
> KVM: TDX: Handle TDX PV port io hypercall
> KVM: TDX: Implement callbacks for MSR operations for TDX
> KVM: TDX: Handle TDX PV rdmsr hypercall
> KVM: TDX: Handle TDX PV wrmsr hypercall
> KVM: TDX: Handle TDX PV report fatal error hypercall
> KVM: TDX: Handle TDX PV map_gpa hypercall
> KVM: TDX: Silently discard SMI request
> KVM: TDX: Silently ignore INIT/SIPI
> Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
>
> Kai Huang (1):
> KVM: x86: Introduce hooks to free VM callback prezap and vm_free
>
> Rick Edgecombe (1):
> KVM: x86: Add infrastructure for stolen GPA bits
>
> Sean Christopherson (26):
> KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> KVM: Enable hardware before doing arch VM initialization
> KVM: x86: Introduce vm_type to differentiate default VMs from
> confidential VMs
> KVM: TDX: Add TDX "architectural" error codes
> KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> KVM: Add max_vcpus field in common 'struct kvm'
> KVM: TDX: create/destroy VM structure
> KVM: TDX: Do TDX specific vcpu initialization
> KVM: x86/mmu: Disallow dirty logging for x86 TDX
> KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> KVM: x86/mmu: Allow non-zero init value for shadow PTE
> KVM: x86/mmu: Allow per-VM override of the TDP max page level
> KVM: VMX: Split out guts of EPT violation to common/exposed function
> KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> KVM: TDX: Add load_mmu_pgd method for TDX
> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
> KVM: x86: Add option to force LAPIC expiration wait
> KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> argument
> KVM: VMX: Move NMI/exception handler to common helper
> KVM: x86: Split core of hypercall emulation to helper function
> KVM: TDX: Add a placeholder for handler of TDX hypercalls
> (TDG.VP.VMCALL)
> KVM: TDX: Handle TDX PV MMIO hypercall
> KVM: TDX: Add methods to ignore accesses to CPU state
>
> Xiaoyao Li (1):
> KVM: TDX: initialize VM with TDX specific parameters
>
> Yuan Yao (1):
> KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
>
> Documentation/virt/kvm/api.rst | 24 +-
> .../virt/kvm/intel-tdx-layer-status.rst | 33 +
> Documentation/virt/kvm/intel-tdx.rst | 360 +++
> Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 ++++
> arch/arm64/include/asm/kvm_host.h | 3 -
> arch/arm64/kvm/arm.c | 6 +-
> arch/arm64/kvm/vgic/vgic-init.c | 6 +-
> arch/x86/events/intel/ds.c | 1 +
> arch/x86/include/asm/kvm-x86-ops.h | 5 +
> arch/x86/include/asm/kvm_host.h | 38 +-
> arch/x86/include/asm/tdx.h | 61 +
> arch/x86/include/asm/vmx.h | 2 +
> arch/x86/include/uapi/asm/kvm.h | 59 +
> arch/x86/include/uapi/asm/vmx.h | 5 +-
> arch/x86/kvm/Kconfig | 4 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/lapic.c | 25 +-
> arch/x86/kvm/lapic.h | 2 +-
> arch/x86/kvm/mmu.h | 65 +-
> arch/x86/kvm/mmu/mmu.c | 232 +-
> arch/x86/kvm/mmu/mmu_internal.h | 84 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 +-
> arch/x86/kvm/mmu/spte.c | 48 +-
> arch/x86/kvm/mmu/spte.h | 40 +-
> arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 642 ++++-
> arch/x86/kvm/mmu/tdp_mmu.h | 16 +-
> arch/x86/kvm/svm/svm.c | 10 +-
> arch/x86/kvm/vmx/common.h | 155 ++
> arch/x86/kvm/vmx/main.c | 1026 ++++++++
> arch/x86/kvm/vmx/posted_intr.c | 8 +-
> arch/x86/kvm/vmx/seamcall.S | 55 +
> arch/x86/kvm/vmx/seamcall.h | 25 +
> arch/x86/kvm/vmx/tdx.c | 2337 +++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 253 ++
> arch/x86/kvm/vmx/tdx_arch.h | 158 ++
> arch/x86/kvm/vmx/tdx_errno.h | 29 +
> arch/x86/kvm/vmx/tdx_error.c | 22 +
> arch/x86/kvm/vmx/tdx_ops.h | 174 ++
> arch/x86/kvm/vmx/vmenter.S | 146 +
> arch/x86/kvm/vmx/vmx.c | 619 ++---
> arch/x86/kvm/vmx/x86_ops.h | 235 ++
> arch/x86/kvm/x86.c | 123 +-
> arch/x86/kvm/x86.h | 8 +
> arch/x86/virt/tdxcall.S | 8 +-
> arch/x86/virt/vmx/tdx.c | 50 +-
> arch/x86/virt/vmx/tdx.h | 52 -
> include/linux/kvm_host.h | 2 +
> include/uapi/linux/kvm.h | 1 +
> tools/arch/x86/include/uapi/asm/kvm.h | 59 +
> tools/include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 35 +-
> 52 files changed, 7142 insertions(+), 706 deletions(-)
> create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
> create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> create mode 100644 arch/x86/kvm/vmx/common.h
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/seamcall.S
> create mode 100644 arch/x86/kvm/vmx/seamcall.h
> create mode 100644 arch/x86/kvm/vmx/tdx.c
> create mode 100644 arch/x86/kvm/vmx/tdx.h
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> create mode 100644 arch/x86/kvm/vmx/x86_ops.h
>
On 4/8/22 18:24, Sean Christopherson wrote:
>> Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
>> interrupt to support TDX's usage of APICv. Unlike VMX, TDX doesn't have
>> access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
> Based on the discussion in the HLT patch, this is no longer true.
>
It's still true, it only has access to RVI > PPR (which is enough to check
if the vCPU is runnable).
> Rather than hook this path, I would rather we tag kvm_apic has having some of its
> state protected. Then kvm_cpu_has_interrupt() can invoke the alternative,
> protected-apic-only hook when appropriate, and kvm_apic_has_interrupt() can bail
> immediately instead of doing useless processing of stale vAPIC state.
Agreed, this is similar to my suggestion on the HLT patch:
https://lkml.kernel.org/r/[email protected]
Paolo
On 3/4/22 20:49, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> To protect the initial contents of the guest TD, the TDX module measures
> the guest TD during the build process as SHA-384 measurement. The
> measurement of the guest TD contents needs to be completed to make the
> guest TD ready to run.
>
> Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
> KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
> to run.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
> tools/arch/x86/include/uapi/asm/kvm.h | 1 +
> 3 files changed, 23 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 77f46260d868..943219a08fcd 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
> KVM_TDX_INIT_VM,
> KVM_TDX_INIT_VCPU,
> KVM_TDX_INIT_MEM_REGION,
> + KVM_TDX_FINALIZE_VM,
>
> KVM_TDX_CMD_NR_MAX,
> };
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index cd726c41d362..85d5f961d97e 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1103,6 +1103,24 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> return ret;
> }
>
> +static int tdx_td_finalizemr(struct kvm *kvm)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + u64 err;
> +
> + if (!is_td_initialized(kvm) || is_td_finalized(kvm_tdx))
> + return -EINVAL;
> +
> + err = tdh_mr_finalize(kvm_tdx->tdr.pa);
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
> + return -EIO;
> + }
> +
> + kvm_tdx->finalized = true;
> + return 0;
> +}
> +
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_tdx_cmd tdx_cmd;
> @@ -1123,6 +1141,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_TDX_INIT_MEM_REGION:
> r = tdx_init_mem_region(kvm, &tdx_cmd);
> break;
> + case KVM_TDX_FINALIZE_VM:
> + r = tdx_td_finalizemr(kvm);
> + break;
> default:
> r = -EINVAL;
> goto out;
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index 77f46260d868..943219a08fcd 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
> KVM_TDX_INIT_VM,
> KVM_TDX_INIT_VCPU,
> KVM_TDX_INIT_MEM_REGION,
> + KVM_TDX_FINALIZE_VM,
>
> KVM_TDX_CMD_NR_MAX,
> };
Reviewed-by: Paolo Bonzini <[email protected]>
Note however that errors should be passed back in the struct.
Paolo