From: Isaku Yamahata <[email protected]>
Hello. This is v6 the patch series vof KVM TDX support.
This is based on v5.18-rc3 + kvm/queue branch + TDX HOST patch series.
The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
Major changes from v5:
- initialize TDX module on loading kvm_intel.ko
This requires changes to other arch. I compile-tested only other arch.
Needs review by each KVM arch maintainer.
- introduced protected apic suggested by Sean Christopherson <[email protected]>
- use constants for non-present SPTE value
I tested on VMX, but complie test only for SVM.
- introduced debug mode to enable #VE suppressbit for VMX and warn on #VE exit
TODO:
- 2M large page support. It's work-in-progress.
How to run/test:
It's describe at
https://github.com/intel/tdx/blob/kvm-upstream-workaround/KVM-TDX.README.md
Trello:
I've created to track details. If you want to update items, please let me know.
https://trello.com/b/B1cLGCcA/kvm-tdx
Thanks,
Isaku Yamahata
Changes from v5:
- export __seamcall and use it
- move mutex lock from callee function of smp_call_on_cpu to the caller.
- rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
- updated comment
- drop the use of tdh_mng_key.reclaimid(): as the function is for backward
compatibility to only return success
- struct kvm_tdx_cmd: metadata => flags, added __u64 error.
- make this ioctl systemwide ioctl
- ABI change to struct kvm_init_vm
- guest_tsc_khz: use kvm->arch.default_tsc_khz
- rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
- drop exporting kvm_set_tsc_khz().
- fix kvm_tdp_page_fault() for mtrr emulation
- rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
- drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
- update commit message
- rename shadow_init_value => shadow_nonprsent_value
- added ept_violation_ve_test mode
- shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
- legacy MMU case
=> - mmu_topup_shadow_page_cache(), kvm_mmu_create()
- FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
- #VE warning:
- rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
- merge into Like we discussed, this patch should be merged with patch
"KVM: x86/mmu: Allow non-zero init value for shadow PTE".
- fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
- introduce kvm_gfn_for_root(kvm, root, gfn)
- add only_shared argument to kvm_tdp_mmu_handle_gfn()
- use kvm_arch_dirty_log_supported()
- rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
- rename: is_private_prohibit_spte() => spte_shared_mask()
- fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
- dropped this patch as the change was merged into kvm/queue
- update vt_apicv_post_state_restore()
- use is_64_bit_hypercall()
- comment: expand MSMI -> Machine Check System Management Interrupt
- fixed TDX_SEPT_PFERR
- tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
- rename tdmvcall_exit_readon() => tdvmcall_leaf()
- remove optional zero check of argument.
- do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
in kvm_vcpu_ioctl_smi and __apic_accept_irq.
- WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
- introduce vcpu_deliver_init to x86_ops
- sprinkeled KVM_BUG_ON()
Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.
---
* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.
A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).
In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.
* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/[email protected]/
this patch series is available at
https://github.com/intel/tdx/releases/tag/kvm-upstream
The corresponding patches to qemu are available at
https://github.com/intel/qemu-tdx/commits/tdx-upstream
The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.
The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step. Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.
TDX vcpu
interrupt/exits/hypercall<------------\
^ |
| |
TD finalization |
^ |
| |
TDX EPT violation<------------\ |
^ | |
| | |
TD vcpu enter/exit | |
^ | |
| | |
TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
^ | ^
| | |
TD VM creation/destruction \---------------KVM TDP MMU hooks
^ ^
| |
TDX architectural definitions KVM TDP refactoring for TDX
^ ^
| |
TDX, VMX <--------TDX host kernel KVM MMU GPA share mask
coexistence support
The followings are explanations of each layer. Each layer has a dummy commit
that starts with [MARKER] in subject. It is intended to help to identify where
each layer starts.
TDX host kernel support:
https://lore.kernel.org/lkml/[email protected]/
The guts of system-wide initialization of TDX module. There is an
independent patch series for host x86. TDX KVM patches call functions
this patch series provides to initialize the TDX module.
TDX, VMX coexistence:
Infrastructure to allow TDX to coexist with VMX and trigger the
initialization of the TDX module.
This layer starts with
"KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
Add TDX architectural definitions and helper functions
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
Guest TD creation/destroy allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
guest TD creation/destroy Allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
Create an initial guest memory image with TDX measurement. Handle
secure EPT violations to populate guest pages with TDX SEAMCALLs.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
entering into TD. Restore CPU state after exiting from TD.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
Handle various exits/hypercalls and allow interrupts to be injected so
that TD vcpu can continue running.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
KVM MMU GPA shared bit:
Introduce framework to handle shared bit repurposed bit of GPA TDX
repurposed a bit of GPA to indicate shared or private. If it's shared,
it's the same as the conventional VMX EPT case. VMM can access shared
guest pages. If it's private, it's handled by Secure-EPT and the guest
page is encrypted.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
TDX Secure EPT requires different constants. e.g. initial value EPT
entry value etc. Various refactoring for those differences.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
Introduce framework to TDP MMU to add hooks in addition to direct EPT
access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
use TDX SEAMCALLs to operate on Secure EPT.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
Introduce framework to handle switching guest pages from private/shared
to shared/private. For a given GPA, a guest page can be assigned to a
private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
guest TD converts GPA assignments from private (or shared) to shared (or
private).
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
Guest private memory requires different memory management in KVM. The
patch proposes a way for it. Integration with TDX KVM.
(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.
The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.
TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode. A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key). At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.
Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.
Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM. These control
structures are pointed to by fields in the TD VMCS.
The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others. 3) Redirect operations to . 3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".
*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.
In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.
During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.
* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.
This way, we can effectively walk Secure EPT without using the TDX interface
functions.
* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.
* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.
* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.
[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.
* Restrictions or future work
Some features are not included to reduce patch size. Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more
* Prerequisites
It's required to load the TDX module and initialize it. It's out of the scope
of this patch series. Another independent patch for the common x86 code is
planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.
Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY. Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine
** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it. Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- int seamrr_enabled()
Check if required cpu feature (SEAM mode) is available. This only check CPU
feature availability. At this point, the TDX module may not be ready for KVM
to use.
- int init_tdx(void);
Initialization of TDX module so that the TDX module is ready for KVM to use.
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
Return the system wide information about the TDX module. NULL if the TDX
isn't initialized.
- u32 tdx_get_global_keyid(void);
Return global key id that is used for the TDX module itself.
- int tdx_keyid_alloc(void);
Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
Free HKID for guest TD.
(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).
- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.
- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.
The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
- Get KVM system capability to check if TDX VM type is supported
- VM creation (KVM_CREATE_VM)
- New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
- New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
- VCPU creation (KVM_CREATE_VCPU)
- New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
- New: Initialize guest memory as boot state and extend the measurement with
the memory. KVM_TDX_INIT_MEM_REGION.
- New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
TDX VM contents.
- VCPU RUN (KVM_VCPU_RUN)
- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them. For example, accessing CPU registers, injecting
exceptions, and accessing guest memory. Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.
VM/VCPU state and callbacks for TDX specific operations.
Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
Operations on the CPU state
silently ignore operations on the guest state. For example, the write to
CPU registers is ignored and the read from CPU registers returns 0.
. ignore access to CPU registers except for allowed ones.
. TSC: add a check if tsc is immutable and return an error. Because the KVM
implementation updates the internal tsc state and it's difficult to back
out those changes. Instead, skip the logic.
. dirty logging: add check if dirty logging is supported.
. exceptions/SMI/MCE/SIPI/INIT: silently ignore
Note: virtual external interrupt and NMI can be injected into TDX guests.
- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set). The bits are called stolen bits.
- Stolen bits framework
systematically tracks which guest physical address, shared or private, is
used.
- Shared EPT and secure EPT
There are two EPTs. Shared EPT (the conventional one) and Secure
EPT(the new one). Shared EPT is handled the same for the stolen
bit set. Secure EPT points to private guest pages. To resolve
EPT violation, KVM walks one of two EPTs based on faulted GPA.
Because it's costly to access secure EPT during walking EPTs with
SEAMCALLs for the private guest physical address, another private
EPT is used as a shadow of Secure-EPT with the existing logic at
the cost of extra memory.
The following depicts the relationship.
KVM | TDX module
| | |
-------------+---------- | |
| | | |
V V | |
shared GPA private GPA | |
CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
| | | |
| | | |
V V | V
shared EPT private EPT--------mirror----->Secure EPT
| | | |
| \--------------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
|
non-encrypted memory | encrypted memory
|
- Operating on Secure EPT
Use the TDX module APIs to operate on Secure EPT. To call the TDX API
during resolving EPT violation, add hooks to additional operation and wiring
it to TDX backend.
* References
[1] TDX specification
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
https://cdrdv2.intel.com/v1/dl/getContent/726790
[3] Intel CPU Architectural Extensions Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 Specification
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
[5] Intel TDX Loader Interface Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
https://cdrdv2.intel.com/v1/dl/getContent/726790
[7] Intel TDX Virtual Firmware Design Guide
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
[8] intel public github
kvm TDX branch: https://github.com/intel/tdx/tree/kvm
TDX guest branch: https://github.com/intel/tdx/tree/guest
qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
https://github.com/tianocore/edk2-staging/tree/TDVF
This was merged into EDK2 main branch. https://github.com/tianocore/edk2
Chao Gao (3):
KVM: x86: Move check_processor_compatibility from init ops to runtime
ops
Partially revert "KVM: Pass kvm_init()'s opaque param to additional
arch funcs"
KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
wrmsr
Isaku Yamahata (74):
KVM: Refactor CPU compatibility check on module initialiization
x86/virt/vmx/tdx: export platform_has_tdx
KVM: TDX: Detect CPU feature on kernel module initialization
KVM: x86: Refactor KVM VMX module init/exit functions
KVM: TDX: Add placeholders for TDX VM/vcpu structure
x86/virt/tdx: Add a helper function to return system wide info about
TDX module
KVM: TDX: Initialize TDX module when loading kvm_intel.ko
KVM: TDX: Make TDX VM type supported
[MARKER] The start of TDX KVM patch series: TDX architectural
definitions
KVM: TDX: Define TDX architectural definitions
KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
KVM: TDX: Add helper functions to print TDX SEAMCALL error
[MARKER] The start of TDX KVM patch series: TD VM creation/destruction
x86/cpu: Add helper functions to allocate/free TDX private host key id
KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
KVM: TDX: Make KVM_CAP_SET_IDENTITY_MAP_ADDR unsupported for TDX
KVM: TDX: Make pmu_intel.c ignore guest TD case
[MARKER] The start of TDX KVM patch series: TD vcpu
creation/destruction
KVM: TDX: allocate/free TDX vcpu structure
KVM: TDX: allocate/free TDX vcpu structure
[MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
KVM: x86/mmu: introduce config for PRIVATE KVM MMU
[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
TDX
KVM: x86/mmu: Disallow fast page fault on private GPA
KVM: VMX: Introduce test mode related to EPT violation VE
[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
KVM: x86/mmu: Focibly use TDP MMU for TDX
KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
[MARKER] The start of TDX KVM patch series: TDX EPT violation
KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
KVM: TDX: TDP MMU TDX support
[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
KVM: x86/mmu: steal software usable git to record if GFN is for shared
or not
KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
[MARKER] The start of TDX KVM patch series: TD finalization
KVM: TDX: Create initial guest memory
KVM: TDX: Finalize VM initialization
[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
KVM: TDX: Add helper assembly function to TDX vcpu
KVM: TDX: Implement TDX vcpu enter/exit path
KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
KVM: TDX: restore host xsave state when exit from the guest TD
KVM: TDX: restore user ret MSRs
[MARKER] The start of TDX KVM patch series: TD vcpu
exits/interrupts/hypercalls
KVM: TDX: complete interrupts after tdexit
KVM: TDX: restore debug store when TD exit
KVM: TDX: handle vcpu migration over logical processor
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
behavior
KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
KVM: TDX: Implement interrupt injection
KVM: TDX: Implements vcpu request_immediate_exit
KVM: TDX: Implement methods to inject NMI
KVM: TDX: Add a place holder to handle TDX VM exit
KVM: TDX: handle EXIT_REASON_OTHER_SMI
KVM: TDX: handle ept violation/misconfig exit
KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
KVM: TDX: Add a place holder for handler of TDX hypercalls
(TDG.VP.VMCALL)
KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
KVM: TDX: Handle TDX PV CPUID hypercall
KVM: TDX: Handle TDX PV HLT hypercall
KVM: TDX: Handle TDX PV port io hypercall
KVM: TDX: Implement callbacks for MSR operations for TDX
KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
KVM: TDX: Handle TDX PV report fatal error hypercall
KVM: TDX: Handle TDX PV map_gpa hypercall
KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
KVM: TDX: Silently discard SMI request
KVM: TDX: Silently ignore INIT/SIPI
Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
[MARKER] the end of (the first phase of) TDX KVM patch series
Rick Edgecombe (1):
KVM: x86/mmu: Add address conversion functions for TDX shared bits
Sean Christopherson (25):
KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
KVM: Enable hardware before doing arch VM initialization
KVM: x86: Introduce vm_type to differentiate default VMs from
confidential VMs
KVM: TDX: Add TDX "architectural" error codes
KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
KVM: TDX: create/destroy VM structure
KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
KVM: TDX: Do TDX specific vcpu initialization
KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
KVM: x86/mmu: Allow non-zero value for non-present SPTE
KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
KVM: x86/mmu: Allow per-VM override of the TDP max page level
KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
private mmu
KVM: x86/mmu: Disallow dirty logging for x86 TDX
KVM: VMX: Split out guts of EPT violation to common/exposed function
KVM: VMX: Move setting of EPT MMU masks to common VT-x code
KVM: TDX: Add load_mmu_pgd method for TDX
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
KVM: TDX: Add support for find pending IRQ in a protected local APIC
KVM: x86: Assume timer IRQ was injected if APIC state is proteced
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
argument
KVM: VMX: Move NMI/exception handler to common helper
KVM: x86: Split core of hypercall emulation to helper function
KVM: TDX: Handle TDX PV MMIO hypercall
KVM: TDX: Add methods to ignore accesses to CPU state
Xiaoyao Li (1):
KVM: TDX: initialize VM with TDX specific parameters
Documentation/virt/kvm/api.rst | 30 +-
Documentation/virt/kvm/intel-tdx.rst | 381 ++++
Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 +++++
arch/arm64/kvm/arm.c | 2 +-
arch/mips/kvm/mips.c | 14 +-
arch/powerpc/kvm/powerpc.c | 2 +-
arch/riscv/kvm/main.c | 2 +-
arch/s390/kvm/kvm-s390.c | 2 +-
arch/x86/events/intel/ds.c | 1 +
arch/x86/include/asm/kvm-x86-ops.h | 10 +
arch/x86/include/asm/kvm_host.h | 56 +-
arch/x86/include/asm/tdx.h | 66 +
arch/x86/include/asm/vmx.h | 14 +
arch/x86/include/uapi/asm/kvm.h | 95 +
arch/x86/include/uapi/asm/vmx.h | 5 +-
arch/x86/kvm/Kconfig | 4 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/irq.c | 3 +
arch/x86/kvm/lapic.c | 37 +-
arch/x86/kvm/lapic.h | 2 +
arch/x86/kvm/mmu.h | 48 +-
arch/x86/kvm/mmu/mmu.c | 371 +++-
arch/x86/kvm/mmu/mmu_internal.h | 120 ++
arch/x86/kvm/mmu/paging_tmpl.h | 5 +-
arch/x86/kvm/mmu/spte.c | 46 +-
arch/x86/kvm/mmu/spte.h | 65 +-
arch/x86/kvm/mmu/tdp_iter.c | 1 +
arch/x86/kvm/mmu/tdp_iter.h | 5 +-
arch/x86/kvm/mmu/tdp_mmu.c | 683 ++++++-
arch/x86/kvm/mmu/tdp_mmu.h | 12 +-
arch/x86/kvm/svm/svm.c | 13 +-
arch/x86/kvm/vmx/common.h | 154 ++
arch/x86/kvm/vmx/evmcs.c | 2 +-
arch/x86/kvm/vmx/evmcs.h | 2 +-
arch/x86/kvm/vmx/main.c | 1073 ++++++++++
arch/x86/kvm/vmx/pmu_intel.c | 33 +
arch/x86/kvm/vmx/pmu_intel.h | 29 +
arch/x86/kvm/vmx/posted_intr.c | 43 +-
arch/x86/kvm/vmx/posted_intr.h | 13 +
arch/x86/kvm/vmx/tdx.c | 2470 ++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 275 +++
arch/x86/kvm/vmx/tdx_arch.h | 157 ++
arch/x86/kvm/vmx/tdx_errno.h | 29 +
arch/x86/kvm/vmx/tdx_error.c | 22 +
arch/x86/kvm/vmx/tdx_ops.h | 188 ++
arch/x86/kvm/vmx/vmenter.S | 146 ++
arch/x86/kvm/vmx/vmx.c | 716 +++----
arch/x86/kvm/vmx/vmx.h | 41 +-
arch/x86/kvm/vmx/x86_ops.h | 235 +++
arch/x86/kvm/x86.c | 155 +-
arch/x86/virt/vmx/tdx/seamcall.S | 1 +
arch/x86/virt/vmx/tdx/tdx.c | 53 +-
arch/x86/virt/vmx/tdx/tdx.h | 52 -
include/linux/kvm_host.h | 4 +-
include/uapi/linux/kvm.h | 2 +
tools/arch/x86/include/uapi/asm/kvm.h | 95 +
tools/include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 67 +-
58 files changed, 7839 insertions(+), 783 deletions(-)
create mode 100644 Documentation/virt/kvm/intel-tdx.rst
create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
create mode 100644 arch/x86/kvm/vmx/common.h
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
create mode 100644 arch/x86/kvm/vmx/tdx.c
create mode 100644 arch/x86/kvm/vmx/tdx.h
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
create mode 100644 arch/x86/kvm/vmx/tdx_error.c
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
create mode 100644 arch/x86/kvm/vmx/x86_ops.h
--
2.25.1
From: Sean Christopherson <[email protected]>
The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification. To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/common.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 32 ++++++--------------------------
2 files changed, 39 insertions(+), 26 deletions(-)
create mode 100644 arch/x86/kvm/vmx/common.h
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..235908f3e044
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long exit_qualification)
+{
+ u64 error_code;
+
+ /* Is it a read fault? */
+ error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+ ? PFERR_USER_MASK : 0;
+ /* Is it a write fault? */
+ error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+ ? PFERR_WRITE_MASK : 0;
+ /* Is it a fetch fault? */
+ error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+ ? PFERR_FETCH_MASK : 0;
+ /* ept page table entry is present? */
+ error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
+ ? PFERR_PRESENT_MASK : 0;
+
+ error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
+ PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+ return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 60dedae31426..1a2e8d195891 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -50,6 +50,7 @@
#include <asm/vmx.h>
#include "capabilities.h"
+#include "common.h"
#include "cpuid.h"
#include "evmcs.h"
#include "hyperv.h"
@@ -5513,11 +5514,10 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
static int handle_ept_violation(struct kvm_vcpu *vcpu)
{
- unsigned long exit_qualification;
- gpa_t gpa;
- u64 error_code;
+ unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
+ gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
- exit_qualification = vmx_get_exit_qual(vcpu);
+ trace_kvm_page_fault(gpa, exit_qualification);
/*
* EPT violation happened while executing iret from NMI,
@@ -5526,29 +5526,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
* AAK134, BY25.
*/
if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
- enable_vnmi &&
- (exit_qualification & INTR_INFO_UNBLOCK_NMI))
+ enable_vnmi && (exit_qualification & INTR_INFO_UNBLOCK_NMI))
vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI);
- gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
- trace_kvm_page_fault(gpa, exit_qualification);
-
- /* Is it a read fault? */
- error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
- ? PFERR_USER_MASK : 0;
- /* Is it a write fault? */
- error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
- ? PFERR_WRITE_MASK : 0;
- /* Is it a fetch fault? */
- error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
- ? PFERR_FETCH_MASK : 0;
- /* ept page table entry is present? */
- error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
- ? PFERR_PRESENT_MASK : 0;
-
- error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
- PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
vcpu->arch.exit_qualification = exit_qualification;
/*
@@ -5562,7 +5542,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
return kvm_emulate_instruction(vcpu, 0);
- return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+ return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
}
static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
--
2.25.1
From: Isaku Yamahata <[email protected]>
Define architectural definitions for KVM to issue the TDX SEAMCALLs.
Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx_arch.h | 157 ++++++++++++++++++++++++++++++++++++
1 file changed, 157 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..94258056d742
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER 0
+#define TDH_MNG_ADDCX 1
+#define TDH_MEM_PAGE_ADD 2
+#define TDH_MEM_SEPT_ADD 3
+#define TDH_VP_ADDCX 4
+#define TDH_MEM_PAGE_RELOCATE 5
+#define TDH_MEM_PAGE_AUG 6
+#define TDH_MEM_RANGE_BLOCK 7
+#define TDH_MNG_KEY_CONFIG 8
+#define TDH_MNG_CREATE 9
+#define TDH_VP_CREATE 10
+#define TDH_MNG_RD 11
+#define TDH_MR_EXTEND 16
+#define TDH_MR_FINALIZE 17
+#define TDH_VP_FLUSH 18
+#define TDH_MNG_VPFLUSHDONE 19
+#define TDH_MNG_KEY_FREEID 20
+#define TDH_MNG_INIT 21
+#define TDH_VP_INIT 22
+#define TDH_VP_RD 26
+#define TDH_MNG_KEY_RECLAIMID 27
+#define TDH_PHYMEM_PAGE_RECLAIM 28
+#define TDH_MEM_PAGE_REMOVE 29
+#define TDH_MEM_SEPT_REMOVE 30
+#define TDH_MEM_TRACK 38
+#define TDH_MEM_RANGE_UNBLOCK 39
+#define TDH_PHYMEM_CACHE_WB 40
+#define TDH_PHYMEM_PAGE_WBINVD 41
+#define TDH_VP_WR 43
+#define TDH_SYS_LP_SHUTDOWN 44
+
+#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO 0x10000
+#define TDG_VP_VMCALL_MAP_GPA 0x10001
+#define TDG_VP_VMCALL_GET_QUOTE 0x10002
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR 0x10003
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH BIT_ULL(63)
+#define TDX_CLASS_SHIFT 56
+#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
+
+#define __BUILD_TDX_FIELD(non_arch, class, field) \
+ (((non_arch) ? TDX_NON_ARCH : 0) | \
+ ((u64)(class) << TDX_CLASS_SHIFT) | \
+ ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field) \
+ __BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field) \
+ __BUILD_TDX_FIELD(true, (class), (field))
+
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field) BUILD_TDX_FIELD(0, (field))
+
+enum tdx_guest_other_state {
+ TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+union tdx_vcpu_state_details {
+ struct {
+ u64 vmxip : 1;
+ u64 reserved : 63;
+ };
+ u64 full;
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field) BUILD_TDX_FIELD(17, (field))
+#define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(17, (field))
+
+/* Management class fields */
+enum tdx_guest_management {
+ TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_guest_management */
+#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(32, (field))
+
+enum tdx_tdcs_execution_control {
+ TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field) BUILD_TDX_FIELD(17, (field))
+
+#define TDX_EXTENDMR_CHUNKSIZE 256
+
+struct tdx_cpuid_value {
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+} __packed;
+
+#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
+#define TDX_TD_ATTRIBUTE_PKS BIT_ULL(30)
+#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
+#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+struct td_params {
+ u64 attributes;
+ u64 xfam;
+ u32 max_vcpus;
+ u32 reserved0;
+
+ u64 eptp_controls;
+ u64 exec_controls;
+ u16 tsc_frequency;
+ u8 reserved1[38];
+
+ u64 mrconfigid[6];
+ u64 mrowner[6];
+ u64 mrownerconfig[6];
+ u64 reserved2[4];
+
+ union {
+ struct tdx_cpuid_value cpuid_values[0];
+ u8 reserved3[768];
+ };
+} __packed __aligned(1024);
+
+/*
+ * Guest uses MAX_PA for GPAW when set.
+ * 0: GPA.SHARED bit is GPA[47]
+ * 1: GPA.SHARED bit is GPA[51]
+ */
+#define TDX_EXEC_CONTROL_MAX_GPAW BIT_ULL(0)
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
+ * module can only program frequencies that are multiples of 25MHz. The
+ * frequency must be between 100mhz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)
+
+#endif /* __KVM_X86_TDX_ARCH_H */
--
2.25.1
From: Sean Christopherson <[email protected]>
Stub in kvm_tdx, vcpu_tdx, and their various accessors. TDX defines
SEAMCALL APIs to access TDX control structures corresponding to the VMX
VMCS. Introduce helper accessors to hide its SEAMCALL ABI details.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.h | 103 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 101 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 2f43db5bbefb..f50d37f3fc9c 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -3,16 +3,29 @@
#define __KVM_X86_TDX_H
#ifdef CONFIG_INTEL_TDX_HOST
+
+#include "tdx_ops.h"
+
int tdx_module_setup(void);
+struct tdx_td_page {
+ unsigned long va;
+ hpa_t pa;
+ bool added;
+};
+
struct kvm_tdx {
struct kvm kvm;
- /* TDX specific members follow. */
+
+ struct tdx_td_page tdr;
+ struct tdx_td_page *tdcs;
};
struct vcpu_tdx {
struct kvm_vcpu vcpu;
- /* TDX specific members follow. */
+
+ struct tdx_td_page tdvpr;
+ struct tdx_td_page *tdvpx;
};
static inline bool is_td(struct kvm *kvm)
@@ -34,6 +47,92 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
{
return container_of(vcpu, struct vcpu_tdx, vcpu);
}
+
+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+ BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) & 0x1,
+ "Read/Write to TD VMCS *_HIGH fields not supported");
+
+ BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+ BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+ (((field) & 0x6000) == 0x2000 ||
+ ((field) & 0x6000) == 0x6000),
+ "Invalid TD VMCS access for 64-bit field");
+ BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+ ((field) & 0x6000) == 0x4000,
+ "Invalid TD VMCS access for 32-bit field");
+ BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+ ((field) & 0x6000) == 0x0000,
+ "Invalid TD VMCS access for 16-bit field");
+}
+
+static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
+ u32 field) \
+{ \
+ struct tdx_module_output out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_rd(tdx->tdvpr.pa, TDVPS_##uclass(field), &out); \
+ if (unlikely(err)) { \
+ pr_err("TDH_VP_RD["#uclass".0x%x] failed: 0x%llx\n", \
+ field, err); \
+ return 0; \
+ } \
+ return (u##bits)out.r8; \
+} \
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx, \
+ u32 field, u##bits val) \
+{ \
+ struct tdx_module_output out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), val, \
+ GENMASK_ULL(bits - 1, 0), &out); \
+ if (unlikely(err)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n", \
+ field, (u64)val, err); \
+} \
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx, \
+ u32 field, u64 bit) \
+{ \
+ struct tdx_module_output out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), bit, bit, \
+ &out); \
+ if (unlikely(err)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n", \
+ field, bit, err); \
+} \
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx, \
+ u32 field, u64 bit) \
+{ \
+ struct tdx_module_output out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), 0, bit, \
+ &out); \
+ if (unlikely(err)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n", \
+ field, bit, err); \
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
+
+TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);
+TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+
#else
static inline int tdx_module_setup(void) { return -ENODEV; };
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of TD finalization.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 5797d172176d..53897312699f 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -21,11 +21,11 @@ Patch Layer status
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
-* TD finalization: Not yet
+* TD finalization: Applying
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet
* KVM MMU GPA shared bits: Applied
* KVM TDP refactoring for TDX: Applied
* KVM TDP MMU hooks: Applied
-* KVM TDP MMU MapGPA: Not yet
+* KVM TDP MMU MapGPA: Applied
--
2.25.1
From: Isaku Yamahata <[email protected]>
Factor out non-leaf SPTE population logic from kvm_tdp_mmu_map(). MapGPA
hypercall needs to populate non-leaf SPTE to record which GPA, private or
shared, is allowed in the leaf EPT entry.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b7e13061e57d..9e015b3e0578 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1149,6 +1149,24 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
return 0;
}
+static int tdp_mmu_populate_nonleaf(
+ struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
+{
+ struct kvm_mmu_page *sp;
+ int ret;
+
+ WARN_ON(is_shadow_present_pte(iter->old_spte));
+ WARN_ON(is_removed_spte(iter->old_spte));
+
+ sp = tdp_mmu_alloc_sp(vcpu);
+ tdp_mmu_init_child_sp(sp, iter);
+
+ ret = tdp_mmu_link_sp(vcpu->kvm, iter, sp, account_nx, true);
+ if (ret)
+ tdp_mmu_free_sp(sp);
+ return ret;
+}
+
/*
* Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
* page tables and SPTEs to translate the faulting guest physical address.
@@ -1157,7 +1175,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
struct tdp_iter iter;
- struct kvm_mmu_page *sp;
int ret;
kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1203,13 +1220,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (is_removed_spte(iter.old_spte))
break;
- sp = tdp_mmu_alloc_sp(vcpu);
- tdp_mmu_init_child_sp(sp, &iter);
-
- if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
- tdp_mmu_free_sp(sp);
+ if (tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
break;
- }
}
}
--
2.25.1
From: Isaku Yamahata <[email protected]>
A VMM interacts with the TDX module using a new instruction (SEAMCALL). A
TDX VMM uses SEAMCALLs where a VMX VMM would have directly interacted with
VMX instructions. For instance, a TDX VMM does not have full access to the
VM control structure corresponding to VMX VMCS. Instead, a VMM induces the
TDX module to act on behalf via SEAMCALLs.
Export __seamcall and define C wrapper functions for SEAMCALLs for
readability. Some SEAMCALL APIs donates pages to TDX module or guest TD.
The pages are encrypted with TDX private host key id set in high bits of
physical address. If any modified cache lines may exit for these pages,
flush them to memory by clflush_cache_range().
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/tdx.h | 2 +
arch/x86/kvm/vmx/tdx_ops.h | 185 +++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/seamcall.S | 1 +
3 files changed, 188 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0670c86de015..3a4fb2844f66 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -149,6 +149,8 @@ int tdx_detect(void);
int tdx_init(void);
bool platform_has_tdx(void);
const struct tdsysinfo_struct *tdx_get_sysinfo(void);
+u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
#else
static inline bool __seamrr_enabled(void) { return false; }
static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..85adbf49c277
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,185 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/cacheflush.h>
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return __seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return __seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
+}
+
+static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(page), PAGE_SIZE);
+ return __seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, out);
+}
+
+static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_SEPT_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return __seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return __seamcall(TDH_MEM_PAGE_RELOCATE, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return __seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_KEY_CONFIG, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+ clflush_cache_range(__va(tdr), PAGE_SIZE);
+ return __seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+ clflush_cache_range(__va(tdvpr), PAGE_SIZE);
+ return __seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MNG_RD, tdr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MR_EXTEND, gpa, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+ return __seamcall(TDH_MR_FINALIZE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+ return __seamcall(TDH_VP_FLUSH, tdvpr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_VPFLUSHDONE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_KEY_FREEID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MNG_INIT, tdr, td_params, 0, 0, out);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+ return __seamcall(TDH_VP_INIT, tdvpr, rcx, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_VP_RD, tdvpr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_KEY_RECLAIMID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_PHYMEM_PAGE_RECLAIM, page, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+ return __seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+ return __seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+ return __seamcall(TDH_PHYMEM_CACHE_WB, resume ? 1 : 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+ return __seamcall(TDH_PHYMEM_PAGE_WBINVD, page, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_VP_WR, tdvpr, field, val, mask, out);
+}
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_OPS_H */
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index 8df7a16f7685..b4fc8182e1cf 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -50,3 +50,4 @@ SYM_FUNC_START(__seamcall)
FRAME_END
ret
SYM_FUNC_END(__seamcall)
+EXPORT_SYMBOL_GPL(__seamcall)
--
2.25.1
From: Sean Christopherson <[email protected]>
Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
hypercall to the KVM backend functions.
kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
MMIO address and emulates the MMIO. As TDX PV MMIO also needs it, export
kvm_io_bus_read(). kvm_io_bus_write() is already exported. TDX PV MMIO
emulates some of MMIO itself. To add trace point consistently with x86
kvm, export kvm_mmio tracepoint.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
virt/kvm/kvm_main.c | 2 +
3 files changed, 117 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ee0cf5336ade..6ab4a52fc9e9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1057,6 +1057,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
return ret;
}
+static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
+{
+ unsigned long val = 0;
+ gpa_t gpa;
+ int size;
+
+ WARN_ON(vcpu->mmio_needed != 1);
+ vcpu->mmio_needed = 0;
+
+ if (!vcpu->mmio_is_write) {
+ gpa = vcpu->mmio_fragments[0].gpa;
+ size = vcpu->mmio_fragments[0].len;
+
+ memcpy(&val, vcpu->run->mmio.data, size);
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ }
+ return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+ unsigned long val)
+{
+ if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+ return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+ unsigned long val;
+
+ if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+ struct kvm_memory_slot *slot;
+ int size, write, r;
+ unsigned long val;
+ gpa_t gpa;
+
+ WARN_ON(vcpu->mmio_needed);
+
+ size = tdvmcall_a0_read(vcpu);
+ write = tdvmcall_a1_read(vcpu);
+ gpa = tdvmcall_a2_read(vcpu);
+ val = write ? tdvmcall_a3_read(vcpu) : 0;
+
+ if (size != 1 && size != 2 && size != 4 && size != 8)
+ goto error;
+ if (write != 0 && write != 1)
+ goto error;
+
+ /* Strip the shared bit, allow MMIO with and without it set. */
+ gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
+
+ if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
+ goto error;
+
+ slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
+ if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
+ goto error;
+
+ if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+ trace_kvm_fast_mmio(gpa);
+ return 1;
+ }
+
+ if (write)
+ r = tdx_mmio_write(vcpu, gpa, size, val);
+ else
+ r = tdx_mmio_read(vcpu, gpa, size);
+ if (!r) {
+ /* Kernel completed device emulation. */
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return 1;
+ }
+
+ /* Request the device emulation to userspace device model. */
+ vcpu->mmio_needed = 1;
+ vcpu->mmio_is_write = write;
+ vcpu->arch.complete_userspace_io = tdx_complete_mmio;
+
+ vcpu->run->mmio.phys_addr = gpa;
+ vcpu->run->mmio.len = size;
+ vcpu->run->mmio.is_write = write;
+ vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+ if (write) {
+ memcpy(vcpu->run->mmio.data, &val, size);
+ } else {
+ vcpu->mmio_fragments[0].gpa = gpa;
+ vcpu->mmio_fragments[0].len = size;
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+ }
+ return 0;
+
+error:
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1069,6 +1181,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_hlt(vcpu);
case EXIT_REASON_IO_INSTRUCTION:
return tdx_emulate_io(vcpu);
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_emulate_mmio(vcpu);
default:
break;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5f291470a6f6..f367d0dcef97 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13166,6 +13166,7 @@ bool kvm_arch_dirty_log_supported(struct kvm *kvm)
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4bf7178e42bd..7f01131666de 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2294,6 +2294,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
return NULL;
}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
{
@@ -5169,6 +5170,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
r = __kvm_io_bus_read(vcpu, bus, &range, val);
return r < 0 ? r : 0;
}
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);
/* Caller must hold slots_lock. */
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
--
2.25.1
From: Isaku Yamahata <[email protected]>
TDX defines an API to run TDX vcpu with its own ABI. Define an assembly
helper function to run TDX vcpu to hide the special ABI so that C code can
call it with function call ABI.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/vmenter.S | 146 +++++++++++++++++++++++++++++++++++++
1 file changed, 146 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 435c187927c4..f655bcca0e93 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -2,6 +2,7 @@
#include <linux/linkage.h>
#include <asm/asm.h>
#include <asm/bitsperlong.h>
+#include <asm/errno.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
#include <asm/segment.h>
@@ -28,6 +29,13 @@
#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
#endif
+#ifdef CONFIG_INTEL_TDX_HOST
+#define TDENTER 0
+#define EXIT_REASON_TDCALL 77
+#define TDENTER_ERROR_BIT 63
+#define seamcall .byte 0x66,0x0f,0x01,0xcf
+#endif
+
.section .noinstr.text, "ax"
/**
@@ -328,3 +336,141 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
pop %_ASM_BP
RET
SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff)
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+.pushsection .noinstr.text, "ax"
+
+/**
+ * __tdx_vcpu_run - Call SEAMCALL(TDENTER) to run a TD vcpu
+ * @tdvpr: physical address of TDVPR
+ * @regs: void * (to registers of TDVCPU)
+ * @gpr_mask: non-zero if guest registers need to be loaded prior to TDENTER
+ *
+ * Returns:
+ * TD-Exit Reason
+ *
+ * Note: KVM doesn't support using XMM in its hypercalls, it's the HyperV
+ * code's responsibility to save/restore XMM registers on TDVMCALL.
+ */
+SYM_FUNC_START(__tdx_vcpu_run)
+ push %rbp
+ mov %rsp, %rbp
+
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+ push %rbx
+
+ /* Save @regs, which is needed after TDENTER to capture output. */
+ push %rsi
+
+ /* Load @tdvpr to RCX */
+ mov %rdi, %rcx
+
+ /* No need to load guest GPRs if the last exit wasn't a TDVMCALL. */
+ test %dx, %dx
+ je 1f
+
+ /* Load @regs to RAX, which will be clobbered with $TDENTER anyways. */
+ mov %rsi, %rax
+
+ mov VCPU_RBX(%rax), %rbx
+ mov VCPU_RDX(%rax), %rdx
+ mov VCPU_RBP(%rax), %rbp
+ mov VCPU_RSI(%rax), %rsi
+ mov VCPU_RDI(%rax), %rdi
+
+ mov VCPU_R8 (%rax), %r8
+ mov VCPU_R9 (%rax), %r9
+ mov VCPU_R10(%rax), %r10
+ mov VCPU_R11(%rax), %r11
+ mov VCPU_R12(%rax), %r12
+ mov VCPU_R13(%rax), %r13
+ mov VCPU_R14(%rax), %r14
+ mov VCPU_R15(%rax), %r15
+
+ /* Load TDENTER to RAX. This kills the @regs pointer! */
+1: mov $TDENTER, %rax
+
+2: seamcall
+
+ /* Skip to the exit path if TDENTER failed. */
+ bt $TDENTER_ERROR_BIT, %rax
+ jc 4f
+
+ /* Temporarily save the TD-Exit reason. */
+ push %rax
+
+ /* check if TD-exit due to TDVMCALL */
+ cmp $EXIT_REASON_TDCALL, %ax
+
+ /* Reload @regs to RAX. */
+ mov 8(%rsp), %rax
+
+ /* Jump on non-TDVMCALL */
+ jne 3f
+
+ /* Save all output from SEAMCALL(TDENTER) */
+ mov %rbx, VCPU_RBX(%rax)
+ mov %rbp, VCPU_RBP(%rax)
+ mov %rsi, VCPU_RSI(%rax)
+ mov %rdi, VCPU_RDI(%rax)
+ mov %r10, VCPU_R10(%rax)
+ mov %r11, VCPU_R11(%rax)
+ mov %r12, VCPU_R12(%rax)
+ mov %r13, VCPU_R13(%rax)
+ mov %r14, VCPU_R14(%rax)
+ mov %r15, VCPU_R15(%rax)
+
+3: mov %rcx, VCPU_RCX(%rax)
+ mov %rdx, VCPU_RDX(%rax)
+ mov %r8, VCPU_R8 (%rax)
+ mov %r9, VCPU_R9 (%rax)
+
+ /*
+ * Clear all general purpose registers except RSP and RAX to prevent
+ * speculative use of the guest's values.
+ */
+ xor %rbx, %rbx
+ xor %rcx, %rcx
+ xor %rdx, %rdx
+ xor %rsi, %rsi
+ xor %rdi, %rdi
+ xor %rbp, %rbp
+ xor %r8, %r8
+ xor %r9, %r9
+ xor %r10, %r10
+ xor %r11, %r11
+ xor %r12, %r12
+ xor %r13, %r13
+ xor %r14, %r14
+ xor %r15, %r15
+
+ /* Restore the TD-Exit reason to RAX for return. */
+ pop %rax
+
+ /* "POP" @regs. */
+4: add $8, %rsp
+ pop %rbx
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+
+ pop %rbp
+ ret
+
+5: cmpb $0, kvm_rebooting
+ je 6f
+ mov $-EFAULT, %rax
+ jmp 4b
+6: ud2
+ _ASM_EXTABLE(2b, 5b)
+
+SYM_FUNC_END(__tdx_vcpu_run)
+
+.popsection
+
+#endif
--
2.25.1
From: Isaku Yamahata <[email protected]>
Wire up TDX PV rdmsr/wrmsr hypercall to the KVM backend function.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f46825843a8b..1518a8c310d6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1169,6 +1169,39 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
return 1;
}
+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_a0_read(vcpu);
+ u64 data;
+
+ if (kvm_get_msr(vcpu, index, &data)) {
+ trace_kvm_msr_read_ex(index);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+ trace_kvm_msr_read(index, data);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, data);
+ return 1;
+}
+
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_a0_read(vcpu);
+ u64 data = tdvmcall_a1_read(vcpu);
+
+ if (kvm_set_msr(vcpu, index, data)) {
+ trace_kvm_msr_write_ex(index, data);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ trace_kvm_msr_write(index, data);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1183,6 +1216,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_io(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_emulate_mmio(vcpu);
+ case EXIT_REASON_MSR_READ:
+ return tdx_emulate_rdmsr(vcpu);
+ case EXIT_REASON_MSR_WRITE:
+ return tdx_emulate_wrmsr(vcpu);
default:
break;
}
--
2.25.1
From: Sean Christopherson <[email protected]>
Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
some operations (e.g., memory read/write, register state access, etc).
Introduce vm_type to track the type of the VM to x86 KVM. Other arch KVMs
already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
vm_init accepts vm_type. So follow them. Further, a different policy can
be made based on vm_type. Define KVM_X86_DEFAULT_VM for default VM as
default and define KVM_X86_TDX_VM for Intel TDX VM. The wrapper function
will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
e.g. qemu, to query what VM types are supported by KVM. This (introduce a
new capability and add vm_type) is chosen to align with other arch KVMs
that have VM types already. Other arch KVMs uses different name to query
supported vm types and there is no common name for it, so new name was
chosen.
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++++
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/uapi/asm/kvm.h | 3 +++
arch/x86/kvm/svm/svm.c | 6 ++++++
arch/x86/kvm/vmx/main.c | 1 +
arch/x86/kvm/vmx/tdx.h | 6 +-----
arch/x86/kvm/vmx/vmx.c | 5 +++++
arch/x86/kvm/vmx/x86_ops.h | 1 +
arch/x86/kvm/x86.c | 9 ++++++++-
include/uapi/linux/kvm.h | 1 +
tools/arch/x86/include/uapi/asm/kvm.h | 3 +++
tools/include/uapi/linux/kvm.h | 1 +
13 files changed, 54 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ae19a1da71f4..7fa6850f1e81 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,31 @@ described as 'basic' will be available.
The new VM has no virtual cpus and no memory.
You probably want to use 0 as machine type.
+X86:
+^^^^
+
+Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
+bitmap of supported vm types. The 1-setting of bit @n means vm type with
+value @n is supported.
+
+S390:
+^^^^^
+
In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).
+MIPS:
+^^^^^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^^^^^
+
On arm64, the physical address size for a VM (IPA Size limit) is limited
to 40bits by default. The limit can be configured if the host supports the
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 75bc44aa8d51..a97cdb203a16 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -19,6 +19,7 @@ KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP(vm_init)
KVM_X86_OP_OPTIONAL(vm_destroy)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 24c15cfe6c32..d1c6c529d52a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1058,6 +1058,7 @@ enum kvm_apicv_inhibit {
};
struct kvm_arch {
+ unsigned long vm_type;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -1338,6 +1339,7 @@ struct kvm_x86_ops {
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
+ bool (*is_vm_type_supported)(unsigned long vm_type);
unsigned int vm_size;
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 21614807a2cb..1c3b97f403f4 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -526,4 +526,7 @@ struct kvm_pmu_event_filter {
#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
#define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define KVM_X86_DEFAULT_VM 0
+#define KVM_X86_TDX_VM 1
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d977e4ad133d..833eb557dee7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4585,6 +4585,11 @@ static void svm_vm_destroy(struct kvm *kvm)
sev_vm_destroy(kvm);
}
+static bool svm_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_DEFAULT_VM;
+}
+
static int svm_vm_init(struct kvm *kvm)
{
if (!pause_filter_count || !pause_filter_thresh)
@@ -4612,6 +4617,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_free = svm_vcpu_free,
.vcpu_reset = svm_vcpu_reset,
+ .is_vm_type_supported = svm_is_vm_type_supported,
.vm_size = sizeof(struct kvm_svm),
.vm_init = svm_vm_init,
.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ac788af17d92..7be4941e4c4d 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -43,6 +43,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,
+ .is_vm_type_supported = vmx_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,
.vm_destroy = vmx_vm_destroy,
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 54d7a26ed9ee..2f43db5bbefb 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,11 +17,7 @@ struct vcpu_tdx {
static inline bool is_td(struct kvm *kvm)
{
- /*
- * TDX VM type isn't defined yet.
- * return kvm->arch.vm_type == KVM_X86_TDX_VM;
- */
- return false;
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
}
static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b5846e0fc78f..5312aa6339b3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7173,6 +7173,11 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}
+bool vmx_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_DEFAULT_VM;
+}
+
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index e28f4e0653b8..544ae23998f7 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -25,6 +25,7 @@ void vmx_hardware_unsetup(void);
int vmx_check_processor_compatibility(void);
int vmx_hardware_enable(void);
void vmx_hardware_disable(void);
+bool vmx_is_vm_type_supported(unsigned long type);
int vmx_vm_init(struct kvm *kvm);
void vmx_vm_destroy(struct kvm *kvm);
int vmx_vcpu_precreate(struct kvm *kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf77a8b64647..7233ce67ae1d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4389,6 +4389,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_DISABLE_QUIRKS2:
r = KVM_X86_VALID_QUIRKS;
break;
+ case KVM_CAP_VM_TYPES:
+ r = BIT(KVM_X86_DEFAULT_VM);
+ if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
+ r |= BIT(KVM_X86_TDX_VM);
+ break;
default:
break;
}
@@ -11791,9 +11796,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
int ret;
unsigned long flags;
- if (type)
+ if (!static_call(kvm_x86_is_vm_type_supported)(type))
return -EINVAL;
+ kvm->arch.vm_type = type;
+
ret = kvm_page_track_init(kvm);
if (ret)
goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dc5837e7bb40..9a3fd7b41fc5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1154,6 +1154,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_DISABLE_QUIRKS2 213
#define KVM_CAP_VM_TSC_CONTROL 214
#define KVM_CAP_SYSTEM_EVENT_DATA 215
+#define KVM_CAP_VM_TYPES 216
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index bf6e96011dfe..71a5851475e7 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -525,4 +525,7 @@ struct kvm_pmu_event_filter {
#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
#define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define KVM_X86_DEFAULT_VM 0
+#define KVM_X86_TDX_VM 1
+
#endif /* _ASM_X86_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 91a6fe4e02c0..110d5822f8b2 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1144,6 +1144,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_MEM_OP_EXTENSION 211
#define KVM_CAP_PMU_CAPABILITY 212
#define KVM_CAP_DISABLE_QUIRKS2 213
+#define KVM_CAP_VM_TYPES 216
#ifdef KVM_CAP_IRQ_ROUTING
--
2.25.1
From: Isaku Yamahata <[email protected]>
On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
from VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 28 +++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 39 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 4 ++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
4 files changed, 73 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 099842a8a397..f101f358d90c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -110,6 +110,30 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
return vmx_vcpu_reset(vcpu, init_event);
}
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ /*
+ * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+ * guest state of a TD is obviously off limits. Deferring MSRs and DRs
+ * is pointless because the TDX module needs to load *something* so as
+ * not to expose guest state.
+ */
+ if (is_td_vcpu(vcpu)) {
+ tdx_prepare_switch_to_guest(vcpu);
+ return;
+ }
+
+ vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_put(vcpu);
+
+ return vmx_vcpu_put(vcpu);
+}
+
static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -206,9 +230,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_free = vt_vcpu_free,
.vcpu_reset = vt_vcpu_reset,
- .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+ .prepare_switch_to_guest = vt_prepare_switch_to_guest,
.vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
+ .vcpu_put = vt_vcpu_put,
.update_exception_bitmap = vmx_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4c69fe7c2ee5..1aa52a093764 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/cpu.h>
+#include <linux/mmu_context.h>
#include <asm/tdx.h>
@@ -461,6 +462,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
vcpu->arch.guest_state_protected =
!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ tdx->host_state_need_save = true;
+ tdx->host_state_need_restore = false;
+
return 0;
free_tdvpx:
@@ -474,6 +478,39 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return ret;
}
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (!tdx->host_state_need_save)
+ return;
+
+ if (likely(is_64bit_mm(current->mm)))
+ tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+ else
+ tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+ tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ tdx->host_state_need_save = true;
+ if (!tdx->host_state_need_restore)
+ return;
+
+ wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+ tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ vmx_vcpu_pi_put(vcpu);
+ tdx_prepare_switch_to_host(vcpu);
+}
+
void tdx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -576,6 +613,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(vcpu, tdx);
+ tdx->host_state_need_restore = true;
+
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
trace_kvm_exit(vcpu, KVM_ISA_VMX);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index f90f83b22d25..414c15235ed0 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -89,6 +89,10 @@ struct vcpu_tdx {
bool initialized;
+ bool host_state_need_save;
+ bool host_state_need_restore;
+ u64 msr_host_kernel_gs_base;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 96e7d853f93f..a2deb42794c0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -142,6 +142,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -163,6 +165,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1
On 5/5/22 20:14, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX defines an API to run TDX vcpu with its own ABI. Define an assembly
> helper function to run TDX vcpu to hide the special ABI so that C code can
> call it with function call ABI.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
"ret" needs to be "RET" to support SLS mitigation.
Paolo
> ---
> arch/x86/kvm/vmx/vmenter.S | 146 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 146 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index 435c187927c4..f655bcca0e93 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -2,6 +2,7 @@
> #include <linux/linkage.h>
> #include <asm/asm.h>
> #include <asm/bitsperlong.h>
> +#include <asm/errno.h>
> #include <asm/kvm_vcpu_regs.h>
> #include <asm/nospec-branch.h>
> #include <asm/segment.h>
> @@ -28,6 +29,13 @@
> #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> #endif
>
> +#ifdef CONFIG_INTEL_TDX_HOST
> +#define TDENTER 0
> +#define EXIT_REASON_TDCALL 77
> +#define TDENTER_ERROR_BIT 63
> +#define seamcall .byte 0x66,0x0f,0x01,0xcf
> +#endif
> +
> .section .noinstr.text, "ax"
>
> /**
> @@ -328,3 +336,141 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
> pop %_ASM_BP
> RET
> SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff)
> +
> +#ifdef CONFIG_INTEL_TDX_HOST
> +
> +.pushsection .noinstr.text, "ax"
> +
> +/**
> + * __tdx_vcpu_run - Call SEAMCALL(TDENTER) to run a TD vcpu
> + * @tdvpr: physical address of TDVPR
> + * @regs: void * (to registers of TDVCPU)
> + * @gpr_mask: non-zero if guest registers need to be loaded prior to TDENTER
> + *
> + * Returns:
> + * TD-Exit Reason
> + *
> + * Note: KVM doesn't support using XMM in its hypercalls, it's the HyperV
> + * code's responsibility to save/restore XMM registers on TDVMCALL.
> + */
> +SYM_FUNC_START(__tdx_vcpu_run)
> + push %rbp
> + mov %rsp, %rbp
> +
> + push %r15
> + push %r14
> + push %r13
> + push %r12
> + push %rbx
> +
> + /* Save @regs, which is needed after TDENTER to capture output. */
> + push %rsi
> +
> + /* Load @tdvpr to RCX */
> + mov %rdi, %rcx
> +
> + /* No need to load guest GPRs if the last exit wasn't a TDVMCALL. */
> + test %dx, %dx
> + je 1f
> +
> + /* Load @regs to RAX, which will be clobbered with $TDENTER anyways. */
> + mov %rsi, %rax
> +
> + mov VCPU_RBX(%rax), %rbx
> + mov VCPU_RDX(%rax), %rdx
> + mov VCPU_RBP(%rax), %rbp
> + mov VCPU_RSI(%rax), %rsi
> + mov VCPU_RDI(%rax), %rdi
> +
> + mov VCPU_R8 (%rax), %r8
> + mov VCPU_R9 (%rax), %r9
> + mov VCPU_R10(%rax), %r10
> + mov VCPU_R11(%rax), %r11
> + mov VCPU_R12(%rax), %r12
> + mov VCPU_R13(%rax), %r13
> + mov VCPU_R14(%rax), %r14
> + mov VCPU_R15(%rax), %r15
> +
> + /* Load TDENTER to RAX. This kills the @regs pointer! */
> +1: mov $TDENTER, %rax
> +
> +2: seamcall
> +
> + /* Skip to the exit path if TDENTER failed. */
> + bt $TDENTER_ERROR_BIT, %rax
> + jc 4f
> +
> + /* Temporarily save the TD-Exit reason. */
> + push %rax
> +
> + /* check if TD-exit due to TDVMCALL */
> + cmp $EXIT_REASON_TDCALL, %ax
> +
> + /* Reload @regs to RAX. */
> + mov 8(%rsp), %rax
> +
> + /* Jump on non-TDVMCALL */
> + jne 3f
> +
> + /* Save all output from SEAMCALL(TDENTER) */
> + mov %rbx, VCPU_RBX(%rax)
> + mov %rbp, VCPU_RBP(%rax)
> + mov %rsi, VCPU_RSI(%rax)
> + mov %rdi, VCPU_RDI(%rax)
> + mov %r10, VCPU_R10(%rax)
> + mov %r11, VCPU_R11(%rax)
> + mov %r12, VCPU_R12(%rax)
> + mov %r13, VCPU_R13(%rax)
> + mov %r14, VCPU_R14(%rax)
> + mov %r15, VCPU_R15(%rax)
> +
> +3: mov %rcx, VCPU_RCX(%rax)
> + mov %rdx, VCPU_RDX(%rax)
> + mov %r8, VCPU_R8 (%rax)
> + mov %r9, VCPU_R9 (%rax)
> +
> + /*
> + * Clear all general purpose registers except RSP and RAX to prevent
> + * speculative use of the guest's values.
> + */
> + xor %rbx, %rbx
> + xor %rcx, %rcx
> + xor %rdx, %rdx
> + xor %rsi, %rsi
> + xor %rdi, %rdi
> + xor %rbp, %rbp
> + xor %r8, %r8
> + xor %r9, %r9
> + xor %r10, %r10
> + xor %r11, %r11
> + xor %r12, %r12
> + xor %r13, %r13
> + xor %r14, %r14
> + xor %r15, %r15
> +
> + /* Restore the TD-Exit reason to RAX for return. */
> + pop %rax
> +
> + /* "POP" @regs. */
> +4: add $8, %rsp
> + pop %rbx
> + pop %r12
> + pop %r13
> + pop %r14
> + pop %r15
> +
> + pop %rbp
> + ret
> +
> +5: cmpb $0, kvm_rebooting
> + je 6f
> + mov $-EFAULT, %rax
> + jmp 4b
> +6: ud2
> + _ASM_EXTABLE(2b, 5b)
> +
> +SYM_FUNC_END(__tdx_vcpu_run)
> +
> +.popsection
> +
> +#endif
On 5/5/22 20:13, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Hello. This is v6 the patch series vof KVM TDX support.
> This is based on v5.18-rc3 + kvm/queue branch + TDX HOST patch series.
> The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
>
> Major changes from v5:
> - initialize TDX module on loading kvm_intel.ko
> This requires changes to other arch. I compile-tested only other arch.
> Needs review by each KVM arch maintainer.
> - introduced protected apic suggested by Sean Christopherson <[email protected]>
> - use constants for non-present SPTE value
> I tested on VMX, but complie test only for SVM.
> - introduced debug mode to enable #VE suppressbit for VMX and warn on #VE exit
>
> TODO:
> - 2M large page support. It's work-in-progress.
So the only important conflicts are with the PRIVATE mapping series
(see reply to patch 47) and with commit ba3a6120a4e7:
Author: Sean Christopherson <[email protected]>
Date: Sat Apr 23 03:47:43 2022 +0000
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
which are a bit boring but not hard. If you can post a v7 relatively
soon I'd be grateful.
Paolo
> How to run/test:
> It's describe at
> https://github.com/intel/tdx/blob/kvm-upstream-workaround/KVM-TDX.README.md
>
> Trello:
> I've created to track details. If you want to update items, please let me know.
> https://trello.com/b/B1cLGCcA/kvm-tdx
>
> Thanks,
> Isaku Yamahata
>
> Changes from v5:
> - export __seamcall and use it
> - move mutex lock from callee function of smp_call_on_cpu to the caller.
> - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> - updated comment
> - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
> compatibility to only return success
> - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> - make this ioctl systemwide ioctl
> - ABI change to struct kvm_init_vm
> - guest_tsc_khz: use kvm->arch.default_tsc_khz
> - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> - drop exporting kvm_set_tsc_khz().
> - fix kvm_tdp_page_fault() for mtrr emulation
> - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
> keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> - update commit message
> - rename shadow_init_value => shadow_nonprsent_value
> - added ept_violation_ve_test mode
> - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> - legacy MMU case
> => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
> - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> - #VE warning:
> - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> - merge into Like we discussed, this patch should be merged with patch
> "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> - introduce kvm_gfn_for_root(kvm, root, gfn)
> - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> - use kvm_arch_dirty_log_supported()
> - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> - rename: is_private_prohibit_spte() => spte_shared_mask()
> - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> - dropped this patch as the change was merged into kvm/queue
> - update vt_apicv_post_state_restore()
> - use is_64_bit_hypercall()
> - comment: expand MSMI -> Machine Check System Management Interrupt
> - fixed TDX_SEPT_PFERR
> - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> - remove optional zero check of argument.
> - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
> in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> - introduce vcpu_deliver_init to x86_ops
> - sprinkeled KVM_BUG_ON()
>
> Changes from v4:
> - rebased to TDX host kernel patch series.
> - include all the patches to make this patch series working.
> - add [MARKER] patches to mark the patch layer clear.
>
> ---
> * What's TDX?
> TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> Domain (TD) for confidential computing.
>
> A TD runs in a CPU mode that is designed to protect the confidentiality of its
> memory contents and its CPU state from any other software, including the hosting
> Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
>
> We have more detailed explanations below (***).
> We have the high-level design of TDX KVM below (****).
>
> In this patch series, we use "TD" or "guest TD" to differentiate it from the
> current "VM" (Virtual Machine), which is supported by KVM today.
>
>
> * The organization of this patch series
> This patch series is on top of the patches series "TDX host kernel support":
> https://lore.kernel.org/lkml/[email protected]/
>
> this patch series is available at
> https://github.com/intel/tdx/releases/tag/kvm-upstream
> The corresponding patches to qemu are available at
> https://github.com/intel/qemu-tdx/commits/tdx-upstream
>
> The relations of the layers are depicted as follows.
> The arrows below show the order of patch reviews we would like to have.
>
> The below layers are chosen so that the device model, for example, qemu can
> exercise each layering step by step. Check if TDX is supported, create TD VM,
> create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> vcpu exits/hypercalls/interrupts to run TD fully.
>
> TDX vcpu
> interrupt/exits/hypercall<------------\
> ^ |
> | |
> TD finalization |
> ^ |
> | |
> TDX EPT violation<------------\ |
> ^ | |
> | | |
> TD vcpu enter/exit | |
> ^ | |
> | | |
> TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
> ^ | ^
> | | |
> TD VM creation/destruction \---------------KVM TDP MMU hooks
> ^ ^
> | |
> TDX architectural definitions KVM TDP refactoring for TDX
> ^ ^
> | |
> TDX, VMX <--------TDX host kernel KVM MMU GPA share mask
> coexistence support
>
>
> The followings are explanations of each layer. Each layer has a dummy commit
> that starts with [MARKER] in subject. It is intended to help to identify where
> each layer starts.
>
> TDX host kernel support:
> https://lore.kernel.org/lkml/[email protected]/
> The guts of system-wide initialization of TDX module. There is an
> independent patch series for host x86. TDX KVM patches call functions
> this patch series provides to initialize the TDX module.
>
> TDX, VMX coexistence:
> Infrastructure to allow TDX to coexist with VMX and trigger the
> initialization of the TDX module.
> This layer starts with
> "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> TDX architectural definitions:
> Add TDX architectural definitions and helper functions
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> TD VM creation/destruction:
> Guest TD creation/destroy allocation and releasing of TDX specific vm
> and vcpu structure. Create an initial guest memory image with TDX
> measurement.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> TD vcpu creation/destruction:
> guest TD creation/destroy Allocation and releasing of TDX specific vm
> and vcpu structure. Create an initial guest memory image with TDX
> measurement.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> TDX EPT violation:
> Create an initial guest memory image with TDX measurement. Handle
> secure EPT violations to populate guest pages with TDX SEAMCALLs.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> TD vcpu enter/exit:
> Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
> entering into TD. Restore CPU state after exiting from TD.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> TD vcpu interrupts/exit/hypercall:
> Handle various exits/hypercalls and allow interrupts to be injected so
> that TD vcpu can continue running.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
>
> KVM MMU GPA shared bit:
> Introduce framework to handle shared bit repurposed bit of GPA TDX
> repurposed a bit of GPA to indicate shared or private. If it's shared,
> it's the same as the conventional VMX EPT case. VMM can access shared
> guest pages. If it's private, it's handled by Secure-EPT and the guest
> page is encrypted.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> KVM TDP refactoring for TDX:
> TDX Secure EPT requires different constants. e.g. initial value EPT
> entry value etc. Various refactoring for those differences.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> KVM TDP MMU hooks:
> Introduce framework to TDP MMU to add hooks in addition to direct EPT
> access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
> conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> use TDX SEAMCALLs to operate on Secure EPT.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> KVM TDP MMU MapGPA:
> Introduce framework to handle switching guest pages from private/shared
> to shared/private. For a given GPA, a guest page can be assigned to a
> private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
> guest TD converts GPA assignments from private (or shared) to shared (or
> private).
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
>
> KVM guest private memory: (not shown in the above diagram)
> [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> memory: https://lkml.org/lkml/2022/1/18/395
> Guest private memory requires different memory management in KVM. The
> patch proposes a way for it. Integration with TDX KVM.
>
> (***)
> * TDX module
> A CPU-attested software module called the "TDX module" is designed to implement
> the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> loaded by the kernel or driver at runtime, but in this patch series we assume
> that the TDX module is already loaded and initialized.
>
> The TDX module provides two main new logical modes of operation built upon the
> new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> architecture. TDX root mode is mostly identical to the VMX root operation mode,
> and the TDX functions (described later) are triggered by the new SEAMCALL
> instruction with the desired interface function selected by an input operand
> (leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
> non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> operation (i.e. guest VM), with changes and restrictions to better assure that
> no other software or hardware has direct visibility of the TD memory and state.
>
> TDX transitions between TDX root operation and TDX non-root operation include TD
> Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> TDX root mode. A TD Exit might be asynchronous, triggered by some external
> event (e.g., external interrupt or SMI) or an exception, or it might be
> synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
>
> TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> Domain Host. Those host-side TDX interface functions are categorized into
> various areas just for better organization, such as SYS (TDX module management),
> MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
>
> TDCS (Trust Domain Control Structure) is the main control structure of a guest
> TD, and encrypted (using the guest TD's ephemeral private key). At a high
> level, TDCS holds information for controlling TD operation as a whole,
> execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
> bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> same value for all VCPUs of the same TD.
>
> Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
> the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> DMA access, accessible only by using the TDX module interface functions (such as
> TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> such as virtual APIC page, virtualization exception information, etc.
>
> Several VMX control structures (such as Shared EPT and Posted interrupt
> descriptor) are directly managed and accessed by the host VMM. These control
> structures are pointed to by fields in the TD VMCS.
>
> The above means that 1) KVM needs to allocate different data structures for TDs,
> 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> define TD-specific handling for others. 3) Redirect operations to . 3)
> Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> tdx_callback() else vmx_callback();".
>
> *TD Private Memory
> TD private memory is designed to hold TD private content, encrypted by the CPU
> using the TD ephemeral key. An encryption engine holds a table of encryption
> keys, and an encryption key is selected for each memory transaction based on a
> Host Key Identifier (HKID). By design, the host VMM does not have access to the
> encryption keys.
>
> In the first generation of MKTME, HKID is "stolen" from the physical address by
> allocating a configurable number of bits from the top of the physical
> address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> HKID on the host so that MKTME can be opaque or bypassed on the host.
>
> During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> as either shared or private, based on the value of a new SHARED bit in the Guest
> Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
> (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> with the current VMX. Since guest TDs usually require I/O, and the data exchange
> needs to be done via shared memory, thus KVM needs to use the current EPT
> functionality even for TDs.
>
> * Secure EPT and Minoring using the TDP code
> The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
> pages are encrypted and integrity-protected with the TD's ephemeral private
> key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> "subset"). Since execution of such interface functions takes much longer time
> than accessing memory directly, in KVM we use the existing TDP code to minor the
> Secure EPT for the TD.
>
> This way, we can effectively walk Secure EPT without using the TDX interface
> functions.
>
> * VM life cycle and TDX specific operations
> The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
> example, a TD needs to boot in private memory, and the host software cannot copy
> the initial image to private memory.
>
> * TSC Virtualization
> The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
> owns TSC virtualization for VMs, but the TDX module does for TDs.
>
> * MCE support for TDs
> The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
> to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
> related to MCE (e.g, MCE bank registers) can be naturally emulated by
> paravirtualizing MSR access.
>
> [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
>
> * Restrictions or future work
> Some features are not included to reduce patch size. Those features are
> addressed as future independent patch series.
> - large page (2M, 1G)
> - qemu gdb stub
> - guest PMU
> - and more
>
> * Prerequisites
> It's required to load the TDX module and initialize it. It's out of the scope
> of this patch series. Another independent patch for the common x86 code is
> planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> life cycle like tdh.mng.init are ready to use.
>
> Concretely Global initialization, LP (Logical Processor) initialization, global
> configuration, the key configuration, and TDMR and PAMT initialization are done.
> The state of the TDX module is SYS_READY. Please refer to the TDX module
> specification, the chapter Intel TDX Module Lifecycle State Machine
>
> ** Detecting the TDX module readiness.
> TDX host patch series implements the detection of the TDX module availability
> and its initialization so that KVM can use it. Also it manages Host KeyID
> (HKID) assigned to guest TD.
> The assumed APIs the TDX host patch series provides are
> - int seamrr_enabled()
> Check if required cpu feature (SEAM mode) is available. This only check CPU
> feature availability. At this point, the TDX module may not be ready for KVM
> to use.
> - int init_tdx(void);
> Initialization of TDX module so that the TDX module is ready for KVM to use.
> - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> Return the system wide information about the TDX module. NULL if the TDX
> isn't initialized.
> - u32 tdx_get_global_keyid(void);
> Return global key id that is used for the TDX module itself.
> - int tdx_keyid_alloc(void);
> Allocate HKID for guest TD.
> - void tdx_keyid_free(int keyid);
> Free HKID for guest TD.
>
> (****)
> * TDX KVM high-level design
> - Host key ID management
> Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> It is assumed The TDX host patch series implements necessary functions,
> u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> void tdx_keyid_free(int keyid).
>
> - Data structures and VM type
> Because TDX is different from VMX, define its own VM/VCPU structures, struct
> kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
> identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> TDX, is used.
>
> - VM life cycle and TDX specific operations
> Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> parameters, set initial guest memory and measurement.
>
> The creation of TDX VM requires five additional operations in addition to the
> conventional VM creation.
> - Get KVM system capability to check if TDX VM type is supported
> - VM creation (KVM_CREATE_VM)
> - New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
> - New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
> - VCPU creation (KVM_CREATE_VCPU)
> - New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
> - New: Initialize guest memory as boot state and extend the measurement with
> the memory. KVM_TDX_INIT_MEM_REGION.
> - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> TDX VM contents.
> - VCPU RUN (KVM_VCPU_RUN)
>
> - Protected guest state
> Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> can't operate on them. For example, accessing CPU registers, injecting
> exceptions, and accessing guest memory. Those operations are handled as
> silently ignored, returning zero or initial reset value when it's requested via
> KVM API ioctls.
>
> VM/VCPU state and callbacks for TDX specific operations.
> Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
> operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
>
> Operations on the CPU state
> silently ignore operations on the guest state. For example, the write to
> CPU registers is ignored and the read from CPU registers returns 0.
>
> . ignore access to CPU registers except for allowed ones.
> . TSC: add a check if tsc is immutable and return an error. Because the KVM
> implementation updates the internal tsc state and it's difficult to back
> out those changes. Instead, skip the logic.
> . dirty logging: add check if dirty logging is supported.
> . exceptions/SMI/MCE/SIPI/INIT: silently ignore
>
> Note: virtual external interrupt and NMI can be injected into TDX guests.
>
> - KVM MMU integration
> One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> the guest physical address is private (the bit is cleared) or shared (the bit is
> set). The bits are called stolen bits.
>
> - Stolen bits framework
> systematically tracks which guest physical address, shared or private, is
> used.
>
> - Shared EPT and secure EPT
> There are two EPTs. Shared EPT (the conventional one) and Secure
> EPT(the new one). Shared EPT is handled the same for the stolen
> bit set. Secure EPT points to private guest pages. To resolve
> EPT violation, KVM walks one of two EPTs based on faulted GPA.
> Because it's costly to access secure EPT during walking EPTs with
> SEAMCALLs for the private guest physical address, another private
> EPT is used as a shadow of Secure-EPT with the existing logic at
> the cost of extra memory.
>
> The following depicts the relationship.
>
> KVM | TDX module
> | | |
> -------------+---------- | |
> | | | |
> V V | |
> shared GPA private GPA | |
> CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
> | | | |
> | | | |
> V V | V
> shared EPT private EPT--------mirror----->Secure EPT
> | | | |
> | \--------------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> |
> non-encrypted memory | encrypted memory
> |
>
> - Operating on Secure EPT
> Use the TDX module APIs to operate on Secure EPT. To call the TDX API
> during resolving EPT violation, add hooks to additional operation and wiring
> it to TDX backend.
>
> * References
>
> [1] TDX specification
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
> https://cdrdv2.intel.com/v1/dl/getContent/726790
> [3] Intel CPU Architectural Extensions Specification
> https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 Specification
> https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> [5] Intel TDX Loader Interface Specification
> https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
> https://cdrdv2.intel.com/v1/dl/getContent/726790
> [7] Intel TDX Virtual Firmware Design Guide
> https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> [8] intel public github
> kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> TDX guest branch: https://github.com/intel/tdx/tree/guest
> qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
> https://github.com/tianocore/edk2-staging/tree/TDVF
> This was merged into EDK2 main branch. https://github.com/tianocore/edk2
>
> Chao Gao (3):
> KVM: x86: Move check_processor_compatibility from init ops to runtime
> ops
> Partially revert "KVM: Pass kvm_init()'s opaque param to additional
> arch funcs"
> KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> wrmsr
>
> Isaku Yamahata (74):
> KVM: Refactor CPU compatibility check on module initialiization
> x86/virt/vmx/tdx: export platform_has_tdx
> KVM: TDX: Detect CPU feature on kernel module initialization
> KVM: x86: Refactor KVM VMX module init/exit functions
> KVM: TDX: Add placeholders for TDX VM/vcpu structure
> x86/virt/tdx: Add a helper function to return system wide info about
> TDX module
> KVM: TDX: Initialize TDX module when loading kvm_intel.ko
> KVM: TDX: Make TDX VM type supported
> [MARKER] The start of TDX KVM patch series: TDX architectural
> definitions
> KVM: TDX: Define TDX architectural definitions
> KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> KVM: TDX: Add helper functions to print TDX SEAMCALL error
> [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> x86/cpu: Add helper functions to allocate/free TDX private host key id
> KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> KVM: TDX: Make KVM_CAP_SET_IDENTITY_MAP_ADDR unsupported for TDX
> KVM: TDX: Make pmu_intel.c ignore guest TD case
> [MARKER] The start of TDX KVM patch series: TD vcpu
> creation/destruction
> KVM: TDX: allocate/free TDX vcpu structure
> KVM: TDX: allocate/free TDX vcpu structure
> [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
> KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> TDX
> KVM: x86/mmu: Disallow fast page fault on private GPA
> KVM: VMX: Introduce test mode related to EPT violation VE
> [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> KVM: x86/mmu: Focibly use TDP MMU for TDX
> KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
> KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> [MARKER] The start of TDX KVM patch series: TDX EPT violation
> KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
> KVM: TDX: TDP MMU TDX support
> [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> KVM: x86/mmu: steal software usable git to record if GFN is for shared
> or not
> KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> [MARKER] The start of TDX KVM patch series: TD finalization
> KVM: TDX: Create initial guest memory
> KVM: TDX: Finalize VM initialization
> [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> KVM: TDX: Add helper assembly function to TDX vcpu
> KVM: TDX: Implement TDX vcpu enter/exit path
> KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> KVM: TDX: restore host xsave state when exit from the guest TD
> KVM: TDX: restore user ret MSRs
> [MARKER] The start of TDX KVM patch series: TD vcpu
> exits/interrupts/hypercalls
> KVM: TDX: complete interrupts after tdexit
> KVM: TDX: restore debug store when TD exit
> KVM: TDX: handle vcpu migration over logical processor
> KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> behavior
> KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
> KVM: TDX: Implement interrupt injection
> KVM: TDX: Implements vcpu request_immediate_exit
> KVM: TDX: Implement methods to inject NMI
> KVM: TDX: Add a place holder to handle TDX VM exit
> KVM: TDX: handle EXIT_REASON_OTHER_SMI
> KVM: TDX: handle ept violation/misconfig exit
> KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> KVM: TDX: Add a place holder for handler of TDX hypercalls
> (TDG.VP.VMCALL)
> KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> KVM: TDX: Handle TDX PV CPUID hypercall
> KVM: TDX: Handle TDX PV HLT hypercall
> KVM: TDX: Handle TDX PV port io hypercall
> KVM: TDX: Implement callbacks for MSR operations for TDX
> KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
> KVM: TDX: Handle TDX PV report fatal error hypercall
> KVM: TDX: Handle TDX PV map_gpa hypercall
> KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
> KVM: TDX: Silently discard SMI request
> KVM: TDX: Silently ignore INIT/SIPI
> Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> [MARKER] the end of (the first phase of) TDX KVM patch series
>
> Rick Edgecombe (1):
> KVM: x86/mmu: Add address conversion functions for TDX shared bits
>
> Sean Christopherson (25):
> KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> KVM: Enable hardware before doing arch VM initialization
> KVM: x86: Introduce vm_type to differentiate default VMs from
> confidential VMs
> KVM: TDX: Add TDX "architectural" error codes
> KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> KVM: TDX: create/destroy VM structure
> KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
> KVM: TDX: Do TDX specific vcpu initialization
> KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> KVM: x86/mmu: Allow non-zero value for non-present SPTE
> KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> KVM: x86/mmu: Allow per-VM override of the TDP max page level
> KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
> private mmu
> KVM: x86/mmu: Disallow dirty logging for x86 TDX
> KVM: VMX: Split out guts of EPT violation to common/exposed function
> KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> KVM: TDX: Add load_mmu_pgd method for TDX
> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> KVM: TDX: Add support for find pending IRQ in a protected local APIC
> KVM: x86: Assume timer IRQ was injected if APIC state is proteced
> KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> argument
> KVM: VMX: Move NMI/exception handler to common helper
> KVM: x86: Split core of hypercall emulation to helper function
> KVM: TDX: Handle TDX PV MMIO hypercall
> KVM: TDX: Add methods to ignore accesses to CPU state
>
> Xiaoyao Li (1):
> KVM: TDX: initialize VM with TDX specific parameters
>
> Documentation/virt/kvm/api.rst | 30 +-
> Documentation/virt/kvm/intel-tdx.rst | 381 ++++
> Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 +++++
> arch/arm64/kvm/arm.c | 2 +-
> arch/mips/kvm/mips.c | 14 +-
> arch/powerpc/kvm/powerpc.c | 2 +-
> arch/riscv/kvm/main.c | 2 +-
> arch/s390/kvm/kvm-s390.c | 2 +-
> arch/x86/events/intel/ds.c | 1 +
> arch/x86/include/asm/kvm-x86-ops.h | 10 +
> arch/x86/include/asm/kvm_host.h | 56 +-
> arch/x86/include/asm/tdx.h | 66 +
> arch/x86/include/asm/vmx.h | 14 +
> arch/x86/include/uapi/asm/kvm.h | 95 +
> arch/x86/include/uapi/asm/vmx.h | 5 +-
> arch/x86/kvm/Kconfig | 4 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/irq.c | 3 +
> arch/x86/kvm/lapic.c | 37 +-
> arch/x86/kvm/lapic.h | 2 +
> arch/x86/kvm/mmu.h | 48 +-
> arch/x86/kvm/mmu/mmu.c | 371 +++-
> arch/x86/kvm/mmu/mmu_internal.h | 120 ++
> arch/x86/kvm/mmu/paging_tmpl.h | 5 +-
> arch/x86/kvm/mmu/spte.c | 46 +-
> arch/x86/kvm/mmu/spte.h | 65 +-
> arch/x86/kvm/mmu/tdp_iter.c | 1 +
> arch/x86/kvm/mmu/tdp_iter.h | 5 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 683 ++++++-
> arch/x86/kvm/mmu/tdp_mmu.h | 12 +-
> arch/x86/kvm/svm/svm.c | 13 +-
> arch/x86/kvm/vmx/common.h | 154 ++
> arch/x86/kvm/vmx/evmcs.c | 2 +-
> arch/x86/kvm/vmx/evmcs.h | 2 +-
> arch/x86/kvm/vmx/main.c | 1073 ++++++++++
> arch/x86/kvm/vmx/pmu_intel.c | 33 +
> arch/x86/kvm/vmx/pmu_intel.h | 29 +
> arch/x86/kvm/vmx/posted_intr.c | 43 +-
> arch/x86/kvm/vmx/posted_intr.h | 13 +
> arch/x86/kvm/vmx/tdx.c | 2470 ++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 275 +++
> arch/x86/kvm/vmx/tdx_arch.h | 157 ++
> arch/x86/kvm/vmx/tdx_errno.h | 29 +
> arch/x86/kvm/vmx/tdx_error.c | 22 +
> arch/x86/kvm/vmx/tdx_ops.h | 188 ++
> arch/x86/kvm/vmx/vmenter.S | 146 ++
> arch/x86/kvm/vmx/vmx.c | 716 +++----
> arch/x86/kvm/vmx/vmx.h | 41 +-
> arch/x86/kvm/vmx/x86_ops.h | 235 +++
> arch/x86/kvm/x86.c | 155 +-
> arch/x86/virt/vmx/tdx/seamcall.S | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 53 +-
> arch/x86/virt/vmx/tdx/tdx.h | 52 -
> include/linux/kvm_host.h | 4 +-
> include/uapi/linux/kvm.h | 2 +
> tools/arch/x86/include/uapi/asm/kvm.h | 95 +
> tools/include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 67 +-
> 58 files changed, 7839 insertions(+), 783 deletions(-)
> create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> create mode 100644 arch/x86/kvm/vmx/common.h
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
> create mode 100644 arch/x86/kvm/vmx/tdx.c
> create mode 100644 arch/x86/kvm/vmx/tdx.h
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> create mode 100644 arch/x86/kvm/vmx/x86_ops.h
>
On Tue, May 31, 2022 at 05:58:39PM +0200,
Paolo Bonzini <[email protected]> wrote:
> On 5/5/22 20:14, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX defines an API to run TDX vcpu with its own ABI. Define an assembly
> > helper function to run TDX vcpu to hide the special ABI so that C code can
> > call it with function call ABI.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
>
> "ret" needs to be "RET" to support SLS mitigation.
Thanks for pointing it out. As it's fixed in github repo. will be fixed on the
next series.
--
Isaku Yamahata <[email protected]>
On Tue, May 31, 2022 at 04:46:02PM +0200,
Paolo Bonzini <[email protected]> wrote:
> On 5/5/22 20:13, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Hello. This is v6 the patch series vof KVM TDX support.
> > This is based on v5.18-rc3 + kvm/queue branch + TDX HOST patch series.
> > The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> >
> > Major changes from v5:
> > - initialize TDX module on loading kvm_intel.ko
> > This requires changes to other arch. I compile-tested only other arch.
> > Needs review by each KVM arch maintainer.
> > - introduced protected apic suggested by Sean Christopherson <[email protected]>
> > - use constants for non-present SPTE value
> > I tested on VMX, but complie test only for SVM.
> > - introduced debug mode to enable #VE suppressbit for VMX and warn on #VE exit
> >
> > TODO:
> > - 2M large page support. It's work-in-progress.
>
> So the only important conflicts are with the PRIVATE mapping series
> (see reply to patch 47) and with commit ba3a6120a4e7:
>
> Author: Sean Christopherson <[email protected]>
> Date: Sat Apr 23 03:47:43 2022 +0000
>
> KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
>
> which are a bit boring but not hard. If you can post a v7 relatively
> soon I'd be grateful.
Ok, I'll do. Do you want v7 to be on top of memfd notifier patch series?
or without the patches? I'm fine with either way because it's already integrated
in github repo. https://github.com/intel/tdx/tree/kvm-upstream-workaround
It's a matter of tedious patch reordering.
Thanks,
>
> Paolo
>
> > How to run/test:
> > It's describe at
> > https://github.com/intel/tdx/blob/kvm-upstream-workaround/KVM-TDX.README.md
> >
> > Trello:
> > I've created to track details. If you want to update items, please let me know.
> > https://trello.com/b/B1cLGCcA/kvm-tdx
> >
> > Thanks,
> > Isaku Yamahata
> >
> > Changes from v5:
> > - export __seamcall and use it
> > - move mutex lock from callee function of smp_call_on_cpu to the caller.
> > - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> > - updated comment
> > - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
> > compatibility to only return success
> > - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> > - make this ioctl systemwide ioctl
> > - ABI change to struct kvm_init_vm
> > - guest_tsc_khz: use kvm->arch.default_tsc_khz
> > - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> > - drop exporting kvm_set_tsc_khz().
> > - fix kvm_tdp_page_fault() for mtrr emulation
> > - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> > - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
> > keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> > - update commit message
> > - rename shadow_init_value => shadow_nonprsent_value
> > - added ept_violation_ve_test mode
> > - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> > - legacy MMU case
> > => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
> > - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> > - #VE warning:
> > - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> > - merge into Like we discussed, this patch should be merged with patch
> > "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> > - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> > - introduce kvm_gfn_for_root(kvm, root, gfn)
> > - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> > - use kvm_arch_dirty_log_supported()
> > - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> > - rename: is_private_prohibit_spte() => spte_shared_mask()
> > - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> > - dropped this patch as the change was merged into kvm/queue
> > - update vt_apicv_post_state_restore()
> > - use is_64_bit_hypercall()
> > - comment: expand MSMI -> Machine Check System Management Interrupt
> > - fixed TDX_SEPT_PFERR
> > - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> > - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> > - remove optional zero check of argument.
> > - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
> > in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> > - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> > - introduce vcpu_deliver_init to x86_ops
> > - sprinkeled KVM_BUG_ON()
> >
> > Changes from v4:
> > - rebased to TDX host kernel patch series.
> > - include all the patches to make this patch series working.
> > - add [MARKER] patches to mark the patch layer clear.
> >
> > ---
> > * What's TDX?
> > TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> > Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> > Domain (TD) for confidential computing.
> >
> > A TD runs in a CPU mode that is designed to protect the confidentiality of its
> > memory contents and its CPU state from any other software, including the hosting
> > Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
> >
> > We have more detailed explanations below (***).
> > We have the high-level design of TDX KVM below (****).
> >
> > In this patch series, we use "TD" or "guest TD" to differentiate it from the
> > current "VM" (Virtual Machine), which is supported by KVM today.
> >
> >
> > * The organization of this patch series
> > This patch series is on top of the patches series "TDX host kernel support":
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > this patch series is available at
> > https://github.com/intel/tdx/releases/tag/kvm-upstream
> > The corresponding patches to qemu are available at
> > https://github.com/intel/qemu-tdx/commits/tdx-upstream
> >
> > The relations of the layers are depicted as follows.
> > The arrows below show the order of patch reviews we would like to have.
> >
> > The below layers are chosen so that the device model, for example, qemu can
> > exercise each layering step by step. Check if TDX is supported, create TD VM,
> > create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> > vcpu exits/hypercalls/interrupts to run TD fully.
> >
> > TDX vcpu
> > interrupt/exits/hypercall<------------\
> > ^ |
> > | |
> > TD finalization |
> > ^ |
> > | |
> > TDX EPT violation<------------\ |
> > ^ | |
> > | | |
> > TD vcpu enter/exit | |
> > ^ | |
> > | | |
> > TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
> > ^ | ^
> > | | |
> > TD VM creation/destruction \---------------KVM TDP MMU hooks
> > ^ ^
> > | |
> > TDX architectural definitions KVM TDP refactoring for TDX
> > ^ ^
> > | |
> > TDX, VMX <--------TDX host kernel KVM MMU GPA share mask
> > coexistence support
> >
> >
> > The followings are explanations of each layer. Each layer has a dummy commit
> > that starts with [MARKER] in subject. It is intended to help to identify where
> > each layer starts.
> >
> > TDX host kernel support:
> > https://lore.kernel.org/lkml/[email protected]/
> > The guts of system-wide initialization of TDX module. There is an
> > independent patch series for host x86. TDX KVM patches call functions
> > this patch series provides to initialize the TDX module.
> >
> > TDX, VMX coexistence:
> > Infrastructure to allow TDX to coexist with VMX and trigger the
> > initialization of the TDX module.
> > This layer starts with
> > "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> > TDX architectural definitions:
> > Add TDX architectural definitions and helper functions
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> > TD VM creation/destruction:
> > Guest TD creation/destroy allocation and releasing of TDX specific vm
> > and vcpu structure. Create an initial guest memory image with TDX
> > measurement.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> > TD vcpu creation/destruction:
> > guest TD creation/destroy Allocation and releasing of TDX specific vm
> > and vcpu structure. Create an initial guest memory image with TDX
> > measurement.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> > TDX EPT violation:
> > Create an initial guest memory image with TDX measurement. Handle
> > secure EPT violations to populate guest pages with TDX SEAMCALLs.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> > TD vcpu enter/exit:
> > Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
> > entering into TD. Restore CPU state after exiting from TD.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> > TD vcpu interrupts/exit/hypercall:
> > Handle various exits/hypercalls and allow interrupts to be injected so
> > that TD vcpu can continue running.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
> >
> > KVM MMU GPA shared bit:
> > Introduce framework to handle shared bit repurposed bit of GPA TDX
> > repurposed a bit of GPA to indicate shared or private. If it's shared,
> > it's the same as the conventional VMX EPT case. VMM can access shared
> > guest pages. If it's private, it's handled by Secure-EPT and the guest
> > page is encrypted.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> > KVM TDP refactoring for TDX:
> > TDX Secure EPT requires different constants. e.g. initial value EPT
> > entry value etc. Various refactoring for those differences.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> > KVM TDP MMU hooks:
> > Introduce framework to TDP MMU to add hooks in addition to direct EPT
> > access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
> > conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> > use TDX SEAMCALLs to operate on Secure EPT.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> > KVM TDP MMU MapGPA:
> > Introduce framework to handle switching guest pages from private/shared
> > to shared/private. For a given GPA, a guest page can be assigned to a
> > private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
> > guest TD converts GPA assignments from private (or shared) to shared (or
> > private).
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
> >
> > KVM guest private memory: (not shown in the above diagram)
> > [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> > memory: https://lkml.org/lkml/2022/1/18/395
> > Guest private memory requires different memory management in KVM. The
> > patch proposes a way for it. Integration with TDX KVM.
> >
> > (***)
> > * TDX module
> > A CPU-attested software module called the "TDX module" is designed to implement
> > the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> > loaded by the kernel or driver at runtime, but in this patch series we assume
> > that the TDX module is already loaded and initialized.
> >
> > The TDX module provides two main new logical modes of operation built upon the
> > new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> > architecture. TDX root mode is mostly identical to the VMX root operation mode,
> > and the TDX functions (described later) are triggered by the new SEAMCALL
> > instruction with the desired interface function selected by an input operand
> > (leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
> > non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> > operation (i.e. guest VM), with changes and restrictions to better assure that
> > no other software or hardware has direct visibility of the TD memory and state.
> >
> > TDX transitions between TDX root operation and TDX non-root operation include TD
> > Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> > TDX root mode. A TD Exit might be asynchronous, triggered by some external
> > event (e.g., external interrupt or SMI) or an exception, or it might be
> > synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
> >
> > TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> > of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> > Domain Host. Those host-side TDX interface functions are categorized into
> > various areas just for better organization, such as SYS (TDX module management),
> > MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> > etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
> >
> > TDCS (Trust Domain Control Structure) is the main control structure of a guest
> > TD, and encrypted (using the guest TD's ephemeral private key). At a high
> > level, TDCS holds information for controlling TD operation as a whole,
> > execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
> > bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> > same value for all VCPUs of the same TD.
> >
> > Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> > TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
> > the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> > DMA access, accessible only by using the TDX module interface functions (such as
> > TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> > such as virtual APIC page, virtualization exception information, etc.
> >
> > Several VMX control structures (such as Shared EPT and Posted interrupt
> > descriptor) are directly managed and accessed by the host VMM. These control
> > structures are pointed to by fields in the TD VMCS.
> >
> > The above means that 1) KVM needs to allocate different data structures for TDs,
> > 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> > define TD-specific handling for others. 3) Redirect operations to . 3)
> > Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> > tdx_callback() else vmx_callback();".
> >
> > *TD Private Memory
> > TD private memory is designed to hold TD private content, encrypted by the CPU
> > using the TD ephemeral key. An encryption engine holds a table of encryption
> > keys, and an encryption key is selected for each memory transaction based on a
> > Host Key Identifier (HKID). By design, the host VMM does not have access to the
> > encryption keys.
> >
> > In the first generation of MKTME, HKID is "stolen" from the physical address by
> > allocating a configurable number of bits from the top of the physical
> > address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> > accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> > HKID on the host so that MKTME can be opaque or bypassed on the host.
> >
> > During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> > as either shared or private, based on the value of a new SHARED bit in the Guest
> > Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
> > (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> > VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> > with the current VMX. Since guest TDs usually require I/O, and the data exchange
> > needs to be done via shared memory, thus KVM needs to use the current EPT
> > functionality even for TDs.
> >
> > * Secure EPT and Minoring using the TDP code
> > The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
> > pages are encrypted and integrity-protected with the TD's ephemeral private
> > key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> > interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> > "subset"). Since execution of such interface functions takes much longer time
> > than accessing memory directly, in KVM we use the existing TDP code to minor the
> > Secure EPT for the TD.
> >
> > This way, we can effectively walk Secure EPT without using the TDX interface
> > functions.
> >
> > * VM life cycle and TDX specific operations
> > The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
> > example, a TD needs to boot in private memory, and the host software cannot copy
> > the initial image to private memory.
> >
> > * TSC Virtualization
> > The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> > (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> > by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
> > owns TSC virtualization for VMs, but the TDX module does for TDs.
> >
> > * MCE support for TDs
> > The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
> > to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
> > related to MCE (e.g, MCE bank registers) can be naturally emulated by
> > paravirtualizing MSR access.
> >
> > [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> > available.
> >
> > * Restrictions or future work
> > Some features are not included to reduce patch size. Those features are
> > addressed as future independent patch series.
> > - large page (2M, 1G)
> > - qemu gdb stub
> > - guest PMU
> > - and more
> >
> > * Prerequisites
> > It's required to load the TDX module and initialize it. It's out of the scope
> > of this patch series. Another independent patch for the common x86 code is
> > planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> > CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> > module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> > life cycle like tdh.mng.init are ready to use.
> >
> > Concretely Global initialization, LP (Logical Processor) initialization, global
> > configuration, the key configuration, and TDMR and PAMT initialization are done.
> > The state of the TDX module is SYS_READY. Please refer to the TDX module
> > specification, the chapter Intel TDX Module Lifecycle State Machine
> >
> > ** Detecting the TDX module readiness.
> > TDX host patch series implements the detection of the TDX module availability
> > and its initialization so that KVM can use it. Also it manages Host KeyID
> > (HKID) assigned to guest TD.
> > The assumed APIs the TDX host patch series provides are
> > - int seamrr_enabled()
> > Check if required cpu feature (SEAM mode) is available. This only check CPU
> > feature availability. At this point, the TDX module may not be ready for KVM
> > to use.
> > - int init_tdx(void);
> > Initialization of TDX module so that the TDX module is ready for KVM to use.
> > - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> > Return the system wide information about the TDX module. NULL if the TDX
> > isn't initialized.
> > - u32 tdx_get_global_keyid(void);
> > Return global key id that is used for the TDX module itself.
> > - int tdx_keyid_alloc(void);
> > Allocate HKID for guest TD.
> > - void tdx_keyid_free(int keyid);
> > Free HKID for guest TD.
> >
> > (****)
> > * TDX KVM high-level design
> > - Host key ID management
> > Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> > It is assumed The TDX host patch series implements necessary functions,
> > u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> > void tdx_keyid_free(int keyid).
> >
> > - Data structures and VM type
> > Because TDX is different from VMX, define its own VM/VCPU structures, struct
> > kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
> > identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> > TDX, is used.
> >
> > - VM life cycle and TDX specific operations
> > Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> > New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> > parameters, set initial guest memory and measurement.
> >
> > The creation of TDX VM requires five additional operations in addition to the
> > conventional VM creation.
> > - Get KVM system capability to check if TDX VM type is supported
> > - VM creation (KVM_CREATE_VM)
> > - New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
> > - New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
> > - VCPU creation (KVM_CREATE_VCPU)
> > - New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
> > - New: Initialize guest memory as boot state and extend the measurement with
> > the memory. KVM_TDX_INIT_MEM_REGION.
> > - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> > TDX VM contents.
> > - VCPU RUN (KVM_VCPU_RUN)
> >
> > - Protected guest state
> > Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> > can't operate on them. For example, accessing CPU registers, injecting
> > exceptions, and accessing guest memory. Those operations are handled as
> > silently ignored, returning zero or initial reset value when it's requested via
> > KVM API ioctls.
> >
> > VM/VCPU state and callbacks for TDX specific operations.
> > Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
> > operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
> >
> > Operations on the CPU state
> > silently ignore operations on the guest state. For example, the write to
> > CPU registers is ignored and the read from CPU registers returns 0.
> >
> > . ignore access to CPU registers except for allowed ones.
> > . TSC: add a check if tsc is immutable and return an error. Because the KVM
> > implementation updates the internal tsc state and it's difficult to back
> > out those changes. Instead, skip the logic.
> > . dirty logging: add check if dirty logging is supported.
> > . exceptions/SMI/MCE/SIPI/INIT: silently ignore
> >
> > Note: virtual external interrupt and NMI can be injected into TDX guests.
> >
> > - KVM MMU integration
> > One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> > the guest physical address is private (the bit is cleared) or shared (the bit is
> > set). The bits are called stolen bits.
> >
> > - Stolen bits framework
> > systematically tracks which guest physical address, shared or private, is
> > used.
> >
> > - Shared EPT and secure EPT
> > There are two EPTs. Shared EPT (the conventional one) and Secure
> > EPT(the new one). Shared EPT is handled the same for the stolen
> > bit set. Secure EPT points to private guest pages. To resolve
> > EPT violation, KVM walks one of two EPTs based on faulted GPA.
> > Because it's costly to access secure EPT during walking EPTs with
> > SEAMCALLs for the private guest physical address, another private
> > EPT is used as a shadow of Secure-EPT with the existing logic at
> > the cost of extra memory.
> >
> > The following depicts the relationship.
> >
> > KVM | TDX module
> > | | |
> > -------------+---------- | |
> > | | | |
> > V V | |
> > shared GPA private GPA | |
> > CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
> > | | | |
> > | | | |
> > V V | V
> > shared EPT private EPT--------mirror----->Secure EPT
> > | | | |
> > | \--------------------+------\ |
> > | | | |
> > V | V V
> > shared guest page | private guest page
> > |
> > |
> > non-encrypted memory | encrypted memory
> > |
> >
> > - Operating on Secure EPT
> > Use the TDX module APIs to operate on Secure EPT. To call the TDX API
> > during resolving EPT violation, add hooks to additional operation and wiring
> > it to TDX backend.
> >
> > * References
> >
> > [1] TDX specification
> > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > [2] Intel Trust Domain Extensions (Intel TDX)
> > https://cdrdv2.intel.com/v1/dl/getContent/726790
> > [3] Intel CPU Architectural Extensions Specification
> > https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> > [4] Intel TDX Module 1.0 Specification
> > https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> > [5] Intel TDX Loader Interface Specification
> > https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > [6] Intel TDX Guest-Hypervisor Communication Interface
> > https://cdrdv2.intel.com/v1/dl/getContent/726790
> > [7] Intel TDX Virtual Firmware Design Guide
> > https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> > [8] intel public github
> > kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> > TDX guest branch: https://github.com/intel/tdx/tree/guest
> > qemu TDX https://github.com/intel/qemu-tdx
> > [9] TDVF
> > https://github.com/tianocore/edk2-staging/tree/TDVF
> > This was merged into EDK2 main branch. https://github.com/tianocore/edk2
> >
> > Chao Gao (3):
> > KVM: x86: Move check_processor_compatibility from init ops to runtime
> > ops
> > Partially revert "KVM: Pass kvm_init()'s opaque param to additional
> > arch funcs"
> > KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> > wrmsr
> >
> > Isaku Yamahata (74):
> > KVM: Refactor CPU compatibility check on module initialiization
> > x86/virt/vmx/tdx: export platform_has_tdx
> > KVM: TDX: Detect CPU feature on kernel module initialization
> > KVM: x86: Refactor KVM VMX module init/exit functions
> > KVM: TDX: Add placeholders for TDX VM/vcpu structure
> > x86/virt/tdx: Add a helper function to return system wide info about
> > TDX module
> > KVM: TDX: Initialize TDX module when loading kvm_intel.ko
> > KVM: TDX: Make TDX VM type supported
> > [MARKER] The start of TDX KVM patch series: TDX architectural
> > definitions
> > KVM: TDX: Define TDX architectural definitions
> > KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> > KVM: TDX: Add helper functions to print TDX SEAMCALL error
> > [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> > x86/cpu: Add helper functions to allocate/free TDX private host key id
> > KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> > KVM: TDX: Make KVM_CAP_SET_IDENTITY_MAP_ADDR unsupported for TDX
> > KVM: TDX: Make pmu_intel.c ignore guest TD case
> > [MARKER] The start of TDX KVM patch series: TD vcpu
> > creation/destruction
> > KVM: TDX: allocate/free TDX vcpu structure
> > KVM: TDX: allocate/free TDX vcpu structure
> > [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
> > KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> > [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> > TDX
> > KVM: x86/mmu: Disallow fast page fault on private GPA
> > KVM: VMX: Introduce test mode related to EPT violation VE
> > [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> > KVM: x86/mmu: Focibly use TDP MMU for TDX
> > KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
> > KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> > KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> > [MARKER] The start of TDX KVM patch series: TDX EPT violation
> > KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> > KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
> > KVM: TDX: TDP MMU TDX support
> > [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> > KVM: x86/mmu: steal software usable git to record if GFN is for shared
> > or not
> > KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> > [MARKER] The start of TDX KVM patch series: TD finalization
> > KVM: TDX: Create initial guest memory
> > KVM: TDX: Finalize VM initialization
> > [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> > KVM: TDX: Add helper assembly function to TDX vcpu
> > KVM: TDX: Implement TDX vcpu enter/exit path
> > KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> > KVM: TDX: restore host xsave state when exit from the guest TD
> > KVM: TDX: restore user ret MSRs
> > [MARKER] The start of TDX KVM patch series: TD vcpu
> > exits/interrupts/hypercalls
> > KVM: TDX: complete interrupts after tdexit
> > KVM: TDX: restore debug store when TD exit
> > KVM: TDX: handle vcpu migration over logical processor
> > KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> > behavior
> > KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
> > KVM: TDX: Implement interrupt injection
> > KVM: TDX: Implements vcpu request_immediate_exit
> > KVM: TDX: Implement methods to inject NMI
> > KVM: TDX: Add a place holder to handle TDX VM exit
> > KVM: TDX: handle EXIT_REASON_OTHER_SMI
> > KVM: TDX: handle ept violation/misconfig exit
> > KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> > KVM: TDX: Add a place holder for handler of TDX hypercalls
> > (TDG.VP.VMCALL)
> > KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> > KVM: TDX: Handle TDX PV CPUID hypercall
> > KVM: TDX: Handle TDX PV HLT hypercall
> > KVM: TDX: Handle TDX PV port io hypercall
> > KVM: TDX: Implement callbacks for MSR operations for TDX
> > KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
> > KVM: TDX: Handle TDX PV report fatal error hypercall
> > KVM: TDX: Handle TDX PV map_gpa hypercall
> > KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
> > KVM: TDX: Silently discard SMI request
> > KVM: TDX: Silently ignore INIT/SIPI
> > Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> > KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> > [MARKER] the end of (the first phase of) TDX KVM patch series
> >
> > Rick Edgecombe (1):
> > KVM: x86/mmu: Add address conversion functions for TDX shared bits
> >
> > Sean Christopherson (25):
> > KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> > KVM: Enable hardware before doing arch VM initialization
> > KVM: x86: Introduce vm_type to differentiate default VMs from
> > confidential VMs
> > KVM: TDX: Add TDX "architectural" error codes
> > KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> > KVM: TDX: create/destroy VM structure
> > KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
> > KVM: TDX: Do TDX specific vcpu initialization
> > KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> > KVM: x86/mmu: Allow non-zero value for non-present SPTE
> > KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> > KVM: x86/mmu: Allow per-VM override of the TDP max page level
> > KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
> > private mmu
> > KVM: x86/mmu: Disallow dirty logging for x86 TDX
> > KVM: VMX: Split out guts of EPT violation to common/exposed function
> > KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> > KVM: TDX: Add load_mmu_pgd method for TDX
> > KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> > KVM: TDX: Add support for find pending IRQ in a protected local APIC
> > KVM: x86: Assume timer IRQ was injected if APIC state is proteced
> > KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> > argument
> > KVM: VMX: Move NMI/exception handler to common helper
> > KVM: x86: Split core of hypercall emulation to helper function
> > KVM: TDX: Handle TDX PV MMIO hypercall
> > KVM: TDX: Add methods to ignore accesses to CPU state
> >
> > Xiaoyao Li (1):
> > KVM: TDX: initialize VM with TDX specific parameters
> >
> > Documentation/virt/kvm/api.rst | 30 +-
> > Documentation/virt/kvm/intel-tdx.rst | 381 ++++
> > Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 +++++
> > arch/arm64/kvm/arm.c | 2 +-
> > arch/mips/kvm/mips.c | 14 +-
> > arch/powerpc/kvm/powerpc.c | 2 +-
> > arch/riscv/kvm/main.c | 2 +-
> > arch/s390/kvm/kvm-s390.c | 2 +-
> > arch/x86/events/intel/ds.c | 1 +
> > arch/x86/include/asm/kvm-x86-ops.h | 10 +
> > arch/x86/include/asm/kvm_host.h | 56 +-
> > arch/x86/include/asm/tdx.h | 66 +
> > arch/x86/include/asm/vmx.h | 14 +
> > arch/x86/include/uapi/asm/kvm.h | 95 +
> > arch/x86/include/uapi/asm/vmx.h | 5 +-
> > arch/x86/kvm/Kconfig | 4 +
> > arch/x86/kvm/Makefile | 3 +-
> > arch/x86/kvm/irq.c | 3 +
> > arch/x86/kvm/lapic.c | 37 +-
> > arch/x86/kvm/lapic.h | 2 +
> > arch/x86/kvm/mmu.h | 48 +-
> > arch/x86/kvm/mmu/mmu.c | 371 +++-
> > arch/x86/kvm/mmu/mmu_internal.h | 120 ++
> > arch/x86/kvm/mmu/paging_tmpl.h | 5 +-
> > arch/x86/kvm/mmu/spte.c | 46 +-
> > arch/x86/kvm/mmu/spte.h | 65 +-
> > arch/x86/kvm/mmu/tdp_iter.c | 1 +
> > arch/x86/kvm/mmu/tdp_iter.h | 5 +-
> > arch/x86/kvm/mmu/tdp_mmu.c | 683 ++++++-
> > arch/x86/kvm/mmu/tdp_mmu.h | 12 +-
> > arch/x86/kvm/svm/svm.c | 13 +-
> > arch/x86/kvm/vmx/common.h | 154 ++
> > arch/x86/kvm/vmx/evmcs.c | 2 +-
> > arch/x86/kvm/vmx/evmcs.h | 2 +-
> > arch/x86/kvm/vmx/main.c | 1073 ++++++++++
> > arch/x86/kvm/vmx/pmu_intel.c | 33 +
> > arch/x86/kvm/vmx/pmu_intel.h | 29 +
> > arch/x86/kvm/vmx/posted_intr.c | 43 +-
> > arch/x86/kvm/vmx/posted_intr.h | 13 +
> > arch/x86/kvm/vmx/tdx.c | 2470 ++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx.h | 275 +++
> > arch/x86/kvm/vmx/tdx_arch.h | 157 ++
> > arch/x86/kvm/vmx/tdx_errno.h | 29 +
> > arch/x86/kvm/vmx/tdx_error.c | 22 +
> > arch/x86/kvm/vmx/tdx_ops.h | 188 ++
> > arch/x86/kvm/vmx/vmenter.S | 146 ++
> > arch/x86/kvm/vmx/vmx.c | 716 +++----
> > arch/x86/kvm/vmx/vmx.h | 41 +-
> > arch/x86/kvm/vmx/x86_ops.h | 235 +++
> > arch/x86/kvm/x86.c | 155 +-
> > arch/x86/virt/vmx/tdx/seamcall.S | 1 +
> > arch/x86/virt/vmx/tdx/tdx.c | 53 +-
> > arch/x86/virt/vmx/tdx/tdx.h | 52 -
> > include/linux/kvm_host.h | 4 +-
> > include/uapi/linux/kvm.h | 2 +
> > tools/arch/x86/include/uapi/asm/kvm.h | 95 +
> > tools/include/uapi/linux/kvm.h | 1 +
> > virt/kvm/kvm_main.c | 67 +-
> > 58 files changed, 7839 insertions(+), 783 deletions(-)
> > create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> > create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> > create mode 100644 arch/x86/kvm/vmx/common.h
> > create mode 100644 arch/x86/kvm/vmx/main.c
> > create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
> > create mode 100644 arch/x86/kvm/vmx/tdx.c
> > create mode 100644 arch/x86/kvm/vmx/tdx.h
> > create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> > create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> > create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> > create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> > create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> >
>
--
Isaku Yamahata <[email protected]>
On Thu, May 5, 2022 at 11:16 AM <[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> Wire up TDX PV rdmsr/wrmsr hypercall to the KVM backend function.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 37 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f46825843a8b..1518a8c310d6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1169,6 +1169,39 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> +static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
> +{
> + u32 index = tdvmcall_a0_read(vcpu);
> + u64 data;
> +
> + if (kvm_get_msr(vcpu, index, &data)) {
kvm_get_msr and kvm_set_msr used to check the MSR permissions using
kvm_msr_allowed but that behaviour changed in "KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr".
Now kvm_get_msr and kvm_set_msr skip these checks and will allow
access regardless of the permissions in the msr_filter.
These should be changed to kvm_get_msr_with_filter and
kvm_set_msr_with_filter or something similar that checks permissions
for MSR access.
> + trace_kvm_msr_read_ex(index);
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> + return 1;
> + }
> + trace_kvm_msr_read(index, data);
> +
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> + tdvmcall_set_return_val(vcpu, data);
> + return 1;
> +}
> +
> +static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
> +{
> + u32 index = tdvmcall_a0_read(vcpu);
> + u64 data = tdvmcall_a1_read(vcpu);
> +
> + if (kvm_set_msr(vcpu, index, data)) {
> + trace_kvm_msr_write_ex(index, data);
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> + return 1;
> + }
> +
> + trace_kvm_msr_write(index, data);
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> + return 1;
> +}
> +
> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> {
> if (tdvmcall_exit_type(vcpu))
> @@ -1183,6 +1216,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> return tdx_emulate_io(vcpu);
> case EXIT_REASON_EPT_VIOLATION:
> return tdx_emulate_mmio(vcpu);
> + case EXIT_REASON_MSR_READ:
> + return tdx_emulate_rdmsr(vcpu);
> + case EXIT_REASON_MSR_WRITE:
> + return tdx_emulate_wrmsr(vcpu);
> default:
> break;
> }
> --
> 2.25.1
>
Sagi
On Thu, May 5, 2022 at 11:16 AM <[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
> hypercall to the KVM backend functions.
>
> kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
> MMIO address and emulates the MMIO. As TDX PV MMIO also needs it, export
> kvm_io_bus_read(). kvm_io_bus_write() is already exported. TDX PV MMIO
> emulates some of MMIO itself. To add trace point consistently with x86
> kvm, export kvm_mmio tracepoint.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 1 +
> virt/kvm/kvm_main.c | 2 +
> 3 files changed, 117 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ee0cf5336ade..6ab4a52fc9e9 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1057,6 +1057,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
> return ret;
> }
>
> +static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
> +{
> + unsigned long val = 0;
> + gpa_t gpa;
> + int size;
> +
> + WARN_ON(vcpu->mmio_needed != 1);
> + vcpu->mmio_needed = 0;
> +
> + if (!vcpu->mmio_is_write) {
> + gpa = vcpu->mmio_fragments[0].gpa;
> + size = vcpu->mmio_fragments[0].len;
> +
> + memcpy(&val, vcpu->run->mmio.data, size);
> + tdvmcall_set_return_val(vcpu, val);
> + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> + }
> + return 1;
> +}
> +
> +static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
> + unsigned long val)
> +{
> + if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> + kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> + return -EOPNOTSUPP;
> +
> + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
> + return 0;
> +}
> +
> +static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
> +{
> + unsigned long val;
> +
> + if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> + kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> + return -EOPNOTSUPP;
> +
> + tdvmcall_set_return_val(vcpu, val);
> + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> + return 0;
> +}
> +
> +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_memory_slot *slot;
> + int size, write, r;
> + unsigned long val;
> + gpa_t gpa;
> +
> + WARN_ON(vcpu->mmio_needed);
> +
> + size = tdvmcall_a0_read(vcpu);
> + write = tdvmcall_a1_read(vcpu);
> + gpa = tdvmcall_a2_read(vcpu);
> + val = write ? tdvmcall_a3_read(vcpu) : 0;
> +
> + if (size != 1 && size != 2 && size != 4 && size != 8)
> + goto error;
> + if (write != 0 && write != 1)
> + goto error;
> +
> + /* Strip the shared bit, allow MMIO with and without it set. */
> + gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
> +
> + if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
> + goto error;
> +
> + slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
> + if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
> + goto error;
> +
> + if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> + trace_kvm_fast_mmio(gpa);
> + return 1;
> + }
> +
> + if (write)
> + r = tdx_mmio_write(vcpu, gpa, size, val);
> + else
> + r = tdx_mmio_read(vcpu, gpa, size);
> + if (!r) {
> + /* Kernel completed device emulation. */
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> + return 1;
> + }
> +
> + /* Request the device emulation to userspace device model. */
> + vcpu->mmio_needed = 1;
> + vcpu->mmio_is_write = write;
> + vcpu->arch.complete_userspace_io = tdx_complete_mmio;
> +
> + vcpu->run->mmio.phys_addr = gpa;
> + vcpu->run->mmio.len = size;
> + vcpu->run->mmio.is_write = write;
> + vcpu->run->exit_reason = KVM_EXIT_MMIO;
> +
> + if (write) {
> + memcpy(vcpu->run->mmio.data, &val, size);
> + } else {
> + vcpu->mmio_fragments[0].gpa = gpa;
> + vcpu->mmio_fragments[0].len = size;
> + trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
> + }
> + return 0;
> +
> +error:
> + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
We should return an error code here.
> + return 1;
> +}
> +
> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> {
> if (tdvmcall_exit_type(vcpu))
> @@ -1069,6 +1181,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> return tdx_emulate_hlt(vcpu);
> case EXIT_REASON_IO_INSTRUCTION:
> return tdx_emulate_io(vcpu);
> + case EXIT_REASON_EPT_VIOLATION:
> + return tdx_emulate_mmio(vcpu);
> default:
> break;
> }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5f291470a6f6..f367d0dcef97 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13166,6 +13166,7 @@ bool kvm_arch_dirty_log_supported(struct kvm *kvm)
>
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4bf7178e42bd..7f01131666de 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2294,6 +2294,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
>
> return NULL;
> }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
>
> bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
> {
> @@ -5169,6 +5170,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
> r = __kvm_io_bus_read(vcpu, bus, &range, val);
> return r < 0 ? r : 0;
> }
> +EXPORT_SYMBOL_GPL(kvm_io_bus_read);
>
> /* Caller must hold slots_lock. */
> int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> --
> 2.25.1
>
Sagi
On Fri, Jun 10, 2022 at 02:04:49PM -0700,
Sagi Shahar <[email protected]> wrote:
> On Thu, May 5, 2022 at 11:16 AM <[email protected]> wrote:
> >
> > From: Isaku Yamahata <[email protected]>
> >
> > Wire up TDX PV rdmsr/wrmsr hypercall to the KVM backend function.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/vmx/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
> > 1 file changed, 37 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index f46825843a8b..1518a8c310d6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1169,6 +1169,39 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> > return 1;
> > }
> >
> > +static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
> > +{
> > + u32 index = tdvmcall_a0_read(vcpu);
> > + u64 data;
> > +
> > + if (kvm_get_msr(vcpu, index, &data)) {
>
> kvm_get_msr and kvm_set_msr used to check the MSR permissions using
> kvm_msr_allowed but that behaviour changed in "KVM: x86: Only do MSR
> filtering when access MSR by rdmsr/wrmsr".
>
> Now kvm_get_msr and kvm_set_msr skip these checks and will allow
> access regardless of the permissions in the msr_filter.
>
> These should be changed to kvm_get_msr_with_filter and
> kvm_set_msr_with_filter or something similar that checks permissions
> for MSR access.
Thanks for pointing it out. I fixed it as adding kvm_msr_allowed()
--
Isaku Yamahata <[email protected]>
On Tue, Jun 14, 2022 at 11:08:16AM -0700,
Sagi Shahar <[email protected]> wrote:
> On Thu, May 5, 2022 at 11:16 AM <[email protected]> wrote:
> >
> > From: Sean Christopherson <[email protected]>
> >
> > Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
> > hypercall to the KVM backend functions.
> >
> > kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
> > MMIO address and emulates the MMIO. As TDX PV MMIO also needs it, export
> > kvm_io_bus_read(). kvm_io_bus_write() is already exported. TDX PV MMIO
> > emulates some of MMIO itself. To add trace point consistently with x86
> > kvm, export kvm_mmio tracepoint.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/x86.c | 1 +
> > virt/kvm/kvm_main.c | 2 +
> > 3 files changed, 117 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index ee0cf5336ade..6ab4a52fc9e9 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1057,6 +1057,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
> > return ret;
> > }
> >
> > +static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
> > +{
> > + unsigned long val = 0;
> > + gpa_t gpa;
> > + int size;
> > +
> > + WARN_ON(vcpu->mmio_needed != 1);
> > + vcpu->mmio_needed = 0;
> > +
> > + if (!vcpu->mmio_is_write) {
> > + gpa = vcpu->mmio_fragments[0].gpa;
> > + size = vcpu->mmio_fragments[0].len;
> > +
> > + memcpy(&val, vcpu->run->mmio.data, size);
> > + tdvmcall_set_return_val(vcpu, val);
> > + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> > + }
> > + return 1;
> > +}
> > +
> > +static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
> > + unsigned long val)
> > +{
> > + if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> > + kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> > + return -EOPNOTSUPP;
> > +
> > + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
> > + return 0;
> > +}
> > +
> > +static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
> > +{
> > + unsigned long val;
> > +
> > + if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> > + kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> > + return -EOPNOTSUPP;
> > +
> > + tdvmcall_set_return_val(vcpu, val);
> > + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> > + return 0;
> > +}
> > +
> > +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_memory_slot *slot;
> > + int size, write, r;
> > + unsigned long val;
> > + gpa_t gpa;
> > +
> > + WARN_ON(vcpu->mmio_needed);
> > +
> > + size = tdvmcall_a0_read(vcpu);
> > + write = tdvmcall_a1_read(vcpu);
> > + gpa = tdvmcall_a2_read(vcpu);
> > + val = write ? tdvmcall_a3_read(vcpu) : 0;
> > +
> > + if (size != 1 && size != 2 && size != 4 && size != 8)
> > + goto error;
> > + if (write != 0 && write != 1)
> > + goto error;
> > +
> > + /* Strip the shared bit, allow MMIO with and without it set. */
> > + gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
> > +
> > + if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
> > + goto error;
> > +
> > + slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
> > + if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
> > + goto error;
> > +
> > + if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> > + trace_kvm_fast_mmio(gpa);
> > + return 1;
> > + }
> > +
> > + if (write)
> > + r = tdx_mmio_write(vcpu, gpa, size, val);
> > + else
> > + r = tdx_mmio_read(vcpu, gpa, size);
> > + if (!r) {
> > + /* Kernel completed device emulation. */
> > + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> > + return 1;
> > + }
> > +
> > + /* Request the device emulation to userspace device model. */
> > + vcpu->mmio_needed = 1;
> > + vcpu->mmio_is_write = write;
> > + vcpu->arch.complete_userspace_io = tdx_complete_mmio;
> > +
> > + vcpu->run->mmio.phys_addr = gpa;
> > + vcpu->run->mmio.len = size;
> > + vcpu->run->mmio.is_write = write;
> > + vcpu->run->exit_reason = KVM_EXIT_MMIO;
> > +
> > + if (write) {
> > + memcpy(vcpu->run->mmio.data, &val, size);
> > + } else {
> > + vcpu->mmio_fragments[0].gpa = gpa;
> > + vcpu->mmio_fragments[0].len = size;
> > + trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
> > + }
> > + return 0;
> > +
> > +error:
> > + tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
>
> We should return an error code here.
Yes, I'll fix it as follows. Thanks for catching it.
error:
- tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
return 1;
}
--
Isaku Yamahata <[email protected]>