From: Isaku Yamahata <[email protected]>
KVM TDX basic feature support
Hello. This is v7 the patch series vof KVM TDX support.
This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
Major changes from v6:
- rebased to v5.19 base
TODO:
- integrate fd-based guest memory. As the discussion is still on-going, I
intentionally dropped fd-based guest memory support yet. The integration can
be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
- 2M large page support. It's work-in-progress.
For large page support, there are several design choices. Here is the design options.
Any thoughts/feedback?
KVM MMU Large page support for TDX
* What needs to be done
- Track private or shared of each page size (4KB, 2MB, 1GB) based on
TDG.VP.VMCALL<MapGPA>. For large pages(2MB, 1GB), it can be mixed (some
lower-size pages are private and some shared.) In this case, the page can't
be large.
- if necessary, split large page on TDG.VP.VMCALL<MapGPA>
(split on dirty page tracking is future work)
- resolving KVM page fault
When resolving a private page and the page is large in the host, GPA can be
resolved as a large page in Secure-EPT. Even if the page is large on the host
side, sometimes a 4KB page can be resolved because it's up to guest TD to
accept at 4KB, 2MB, or 1GB.
- collapsing pages into a large page.
At this point, it's okay to not implement this. When dirty page tracking is
supported, this needs to be supported.
- On MapGPA, the page can be collapsed into a large page
- handle zapping SPTE and try to collapse the pages on the next KVM page fault
Unlike the EPT case, some trick is needed.
- For performance, optimize KVM page fault path at the cost of complicating
MapGPA path.
* options to track private or shared
At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
1GB case). For 4KB each page, 1 bit per page is needed. private or shared. For
large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
mixed). When resolving KVM page fault, we don't want to check the lower-size
pages to check if the given GPA can be a large for performance. On MapGPA check
it instead.
Option A). enhance kvm_arch_memory_slot
enum kvm_page_type {
KVM_PAGE_TYPE_INVALID,
KVM_PAGE_TYPE_SHARED,
KVM_PAGE_TYPE_PRIVATE,
KVM_PAGE_TYPE_MIXED,
};
struct kvm_page_attr {
enum kvm_page_type type;
};
struct kvm_arch_memory_slot {
+ struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
If !SPTE_MIXED_MASK, it can be large page.
Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
* comparison
A).
+ straightforward to implement
+ SPTE_SHARED_MASK isn't needed
- memory overhead compared to B). or C).
- more memory reference on KVM page fault
B).
+ simpler than C) (complex than A)?)
+ efficient on KVM page fault. (only SPTE reference)
+ low memory overhead
- Waste precious SPTE bits.
C).
+ efficient on KVM page fault. (only SPTE reference)
+ low memory overhead
- complicates MapGPA
- scattered data structure
Thanks,
Isaku Yamahata
Changes from v6:
- rebased to v5.19
Changes from v5:
- export __seamcall and use it
- move mutex lock from callee function of smp_call_on_cpu to the caller.
- rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
- updated comment
- drop the use of tdh_mng_key.reclaimid(): as the function is for backward
compatibility to only return success
- struct kvm_tdx_cmd: metadata => flags, added __u64 error.
- make this ioctl systemwide ioctl
- ABI change to struct kvm_init_vm
- guest_tsc_khz: use kvm->arch.default_tsc_khz
- rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
- drop exporting kvm_set_tsc_khz().
- fix kvm_tdp_page_fault() for mtrr emulation
- rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
- drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
- update commit message
- rename shadow_init_value => shadow_nonprsent_value
- added ept_violation_ve_test mode
- shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
- legacy MMU case
=> - mmu_topup_shadow_page_cache(), kvm_mmu_create()
- FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
- #VE warning:
- rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
- merge into Like we discussed, this patch should be merged with patch
"KVM: x86/mmu: Allow non-zero init value for shadow PTE".
- fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
- introduce kvm_gfn_for_root(kvm, root, gfn)
- add only_shared argument to kvm_tdp_mmu_handle_gfn()
- use kvm_arch_dirty_log_supported()
- rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
- rename: is_private_prohibit_spte() => spte_shared_mask()
- fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
- dropped this patch as the change was merged into kvm/queue
- update vt_apicv_post_state_restore()
- use is_64_bit_hypercall()
- comment: expand MSMI -> Machine Check System Management Interrupt
- fixed TDX_SEPT_PFERR
- tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
- rename tdmvcall_exit_readon() => tdvmcall_leaf()
- remove optional zero check of argument.
- do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
in kvm_vcpu_ioctl_smi and __apic_accept_irq.
- WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
- introduce vcpu_deliver_init to x86_ops
- sprinkeled KVM_BUG_ON()
Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.
---
* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.
A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).
In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.
* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/[email protected]/
this patch series is available at
https://github.com/intel/tdx/releases/tag/kvm-upstream
The corresponding patches to qemu are available at
https://github.com/intel/qemu-tdx/commits/tdx-upstream
The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.
The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step. Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.
TDX vcpu
interrupt/exits/hypercall<------------\
^ |
| |
TD finalization |
^ |
| |
TDX EPT violation<------------\ |
^ | |
| | |
TD vcpu enter/exit | |
^ | |
| | |
TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
^ | ^
| | |
TD VM creation/destruction \---------------KVM TDP MMU hooks
^ ^
| |
TDX architectural definitions KVM TDP refactoring for TDX
^ ^
| |
TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
coexistence support
The followings are explanations of each layer. Each layer has a dummy commit
that starts with [MARKER] in subject. It is intended to help to identify where
each layer starts.
TDX host kernel support:
https://lore.kernel.org/lkml/[email protected]/
The guts of system-wide initialization of TDX module. There is an
independent patch series for host x86. TDX KVM patches call functions
this patch series provides to initialize the TDX module.
TDX, VMX coexistence:
Infrastructure to allow TDX to coexist with VMX and trigger the
initialization of the TDX module.
This layer starts with
"KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
Add TDX architectural definitions and helper functions
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
Guest TD creation/destroy allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
guest TD creation/destroy Allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
Create an initial guest memory image with TDX measurement. Handle
secure EPT violations to populate guest pages with TDX SEAMCALLs.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
entering into TD. Restore CPU state after exiting from TD.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
Handle various exits/hypercalls and allow interrupts to be injected so
that TD vcpu can continue running.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
KVM MMU GPA shared bit:
Introduce framework to handle shared bit repurposed bit of GPA TDX
repurposed a bit of GPA to indicate shared or private. If it's shared,
it's the same as the conventional VMX EPT case. VMM can access shared
guest pages. If it's private, it's handled by Secure-EPT and the guest
page is encrypted.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
TDX Secure EPT requires different constants. e.g. initial value EPT
entry value etc. Various refactoring for those differences.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
Introduce framework to TDP MMU to add hooks in addition to direct EPT
access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
use TDX SEAMCALLs to operate on Secure EPT.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
Introduce framework to handle switching guest pages from private/shared
to shared/private. For a given GPA, a guest page can be assigned to a
private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
guest TD converts GPA assignments from private (or shared) to shared (or
private).
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
Guest private memory requires different memory management in KVM. The
patch proposes a way for it. Integration with TDX KVM.
(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.
The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.
TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode. A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key). At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.
Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.
Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM. These control
structures are pointed to by fields in the TD VMCS.
The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others. 3) Redirect operations to . 3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".
*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.
In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.
During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.
* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.
This way, we can effectively walk Secure EPT without using the TDX interface
functions.
* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.
* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.
* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.
[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.
* Restrictions or future work
Some features are not included to reduce patch size. Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more
* Prerequisites
It's required to load the TDX module and initialize it. It's out of the scope
of this patch series. Another independent patch for the common x86 code is
planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.
Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY. Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine
** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it. Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- int seamrr_enabled()
Check if required cpu feature (SEAM mode) is available. This only check CPU
feature availability. At this point, the TDX module may not be ready for KVM
to use.
- int init_tdx(void);
Initialization of TDX module so that the TDX module is ready for KVM to use.
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
Return the system wide information about the TDX module. NULL if the TDX
isn't initialized.
- u32 tdx_get_global_keyid(void);
Return global key id that is used for the TDX module itself.
- int tdx_keyid_alloc(void);
Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
Free HKID for guest TD.
(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).
- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.
- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.
The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
- Get KVM system capability to check if TDX VM type is supported
- VM creation (KVM_CREATE_VM)
- New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
- New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
- VCPU creation (KVM_CREATE_VCPU)
- New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
- New: Initialize guest memory as boot state and extend the measurement with
the memory. KVM_TDX_INIT_MEM_REGION.
- New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
TDX VM contents.
- VCPU RUN (KVM_VCPU_RUN)
- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them. For example, accessing CPU registers, injecting
exceptions, and accessing guest memory. Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.
VM/VCPU state and callbacks for TDX specific operations.
Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
Operations on the CPU state
silently ignore operations on the guest state. For example, the write to
CPU registers is ignored and the read from CPU registers returns 0.
. ignore access to CPU registers except for allowed ones.
. TSC: add a check if tsc is immutable and return an error. Because the KVM
implementation updates the internal tsc state and it's difficult to back
out those changes. Instead, skip the logic.
. dirty logging: add check if dirty logging is supported.
. exceptions/SMI/MCE/SIPI/INIT: silently ignore
Note: virtual external interrupt and NMI can be injected into TDX guests.
- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set). The bits are called stolen bits.
- Stolen bits framework
systematically tracks which guest physical address, shared or private, is
used.
- Shared EPT and secure EPT
There are two EPTs. Shared EPT (the conventional one) and Secure
EPT(the new one). Shared EPT is handled the same for the stolen
bit set. Secure EPT points to private guest pages. To resolve
EPT violation, KVM walks one of two EPTs based on faulted GPA.
Because it's costly to access secure EPT during walking EPTs with
SEAMCALLs for the private guest physical address, another private
EPT is used as a shadow of Secure-EPT with the existing logic at
the cost of extra memory.
The following depicts the relationship.
KVM | TDX module
| | |
-------------+---------- | |
| | | |
V V | |
shared GPA private GPA | |
CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
| | | |
| | | |
V V | V
shared EPT private EPT--------mirror----->Secure EPT
| | | |
| \--------------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
|
non-encrypted memory | encrypted memory
|
- Operating on Secure EPT
Use the TDX module APIs to operate on Secure EPT. To call the TDX API
during resolving EPT violation, add hooks to additional operation and wiring
it to TDX backend.
* References
[1] TDX specification
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
https://cdrdv2.intel.com/v1/dl/getContent/726790
[3] Intel CPU Architectural Extensions Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 Specification
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
[5] Intel TDX Loader Interface Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
https://cdrdv2.intel.com/v1/dl/getContent/726790
[7] Intel TDX Virtual Firmware Design Guide
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
[8] intel public github
kvm TDX branch: https://github.com/intel/tdx/tree/kvm
TDX guest branch: https://github.com/intel/tdx/tree/guest
qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
https://github.com/tianocore/edk2-staging/tree/TDVF
This was merged into EDK2 main branch. https://github.com/tianocore/edk2
Chao Gao (3):
KVM: x86: Move check_processor_compatibility from init ops to runtime
ops
Partially revert "KVM: Pass kvm_init()'s opaque param to additional
arch funcs"
KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
wrmsr
Isaku Yamahata (72):
KVM: Refactor CPU compatibility check on module initialiization
x86/virt/vmx/tdx: export platform_tdx_enabled()
KVM: TDX: Detect CPU feature on kernel module initialization
KVM: x86: Refactor KVM VMX module init/exit functions
KVM: TDX: Add placeholders for TDX VM/vcpu structure
x86/virt/tdx: Add a helper function to return system wide info about
TDX module
KVM: TDX: Initialize TDX module when loading kvm_intel.ko
KVM: TDX: Make TDX VM type supported
[MARKER] The start of TDX KVM patch series: TDX architectural
definitions
KVM: TDX: Define TDX architectural definitions
KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
KVM: TDX: Add helper functions to print TDX SEAMCALL error
[MARKER] The start of TDX KVM patch series: TD VM creation/destruction
x86/cpu: Add helper functions to allocate/free TDX private host key id
KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
KVM: TDX: Make pmu_intel.c ignore guest TD case
[MARKER] The start of TDX KVM patch series: TD vcpu
creation/destruction
KVM: TDX: allocate/free TDX vcpu structure
KVM: TDX: allocate/free TDX vcpu structure
[MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
KVM: x86/mmu: introduce config for PRIVATE KVM MMU
[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
TDX
KVM: x86/mmu: Disallow fast page fault on private GPA
KVM: VMX: Introduce test mode related to EPT violation VE
[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
KVM: x86/mmu: Focibly use TDP MMU for TDX
KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
[MARKER] The start of TDX KVM patch series: TDX EPT violation
KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
KVM: TDX: TDP MMU TDX support
[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
KVM: x86/mmu: steal software usable git to record if GFN is for shared
or not
KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
[MARKER] The start of TDX KVM patch series: TD finalization
KVM: TDX: Create initial guest memory
KVM: TDX: Finalize VM initialization
[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
KVM: TDX: Add helper assembly function to TDX vcpu
KVM: TDX: Implement TDX vcpu enter/exit path
KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
KVM: TDX: restore host xsave state when exit from the guest TD
KVM: TDX: restore user ret MSRs
[MARKER] The start of TDX KVM patch series: TD vcpu
exits/interrupts/hypercalls
KVM: TDX: complete interrupts after tdexit
KVM: TDX: restore debug store when TD exit
KVM: TDX: handle vcpu migration over logical processor
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
behavior
KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
KVM: TDX: Implement interrupt injection
KVM: TDX: Implements vcpu request_immediate_exit
KVM: TDX: Implement methods to inject NMI
KVM: TDX: Add a place holder to handle TDX VM exit
KVM: TDX: handle EXIT_REASON_OTHER_SMI
KVM: TDX: handle ept violation/misconfig exit
KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
KVM: TDX: Add a place holder for handler of TDX hypercalls
(TDG.VP.VMCALL)
KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
KVM: TDX: Handle TDX PV CPUID hypercall
KVM: TDX: Handle TDX PV HLT hypercall
KVM: TDX: Handle TDX PV port io hypercall
KVM: TDX: Implement callbacks for MSR operations for TDX
KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
KVM: TDX: Handle TDX PV report fatal error hypercall
KVM: TDX: Handle TDX PV map_gpa hypercall
KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
KVM: TDX: Silently discard SMI request
KVM: TDX: Silently ignore INIT/SIPI
Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
Rick Edgecombe (1):
KVM: x86/mmu: Add address conversion functions for TDX shared bits
Sean Christopherson (25):
KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
KVM: Enable hardware before doing arch VM initialization
KVM: x86: Introduce vm_type to differentiate default VMs from
confidential VMs
KVM: TDX: Add TDX "architectural" error codes
KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
KVM: TDX: create/destroy VM structure
KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
KVM: TDX: Do TDX specific vcpu initialization
KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
KVM: x86/mmu: Allow non-zero value for non-present SPTE
KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
KVM: x86/mmu: Allow per-VM override of the TDP max page level
KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
private mmu
KVM: x86/mmu: Disallow dirty logging for x86 TDX
KVM: VMX: Split out guts of EPT violation to common/exposed function
KVM: VMX: Move setting of EPT MMU masks to common VT-x code
KVM: TDX: Add load_mmu_pgd method for TDX
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
KVM: TDX: Add support for find pending IRQ in a protected local APIC
KVM: x86: Assume timer IRQ was injected if APIC state is proteced
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
argument
KVM: VMX: Move NMI/exception handler to common helper
KVM: x86: Split core of hypercall emulation to helper function
KVM: TDX: Handle TDX PV MMIO hypercall
KVM: TDX: Add methods to ignore accesses to CPU state
Xiaoyao Li (1):
KVM: TDX: initialize VM with TDX specific parameters
Documentation/virt/kvm/api.rst | 30 +-
.../virt/kvm/intel-tdx-layer-status.rst | 33 +
Documentation/virt/kvm/intel-tdx.rst | 381 +++
Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 ++++
arch/arm64/kvm/arm.c | 2 +-
arch/mips/kvm/mips.c | 14 +-
arch/powerpc/kvm/powerpc.c | 2 +-
arch/riscv/kvm/main.c | 2 +-
arch/s390/kvm/kvm-s390.c | 2 +-
arch/x86/events/intel/ds.c | 1 +
arch/x86/include/asm/kvm-x86-ops.h | 10 +
arch/x86/include/asm/kvm_host.h | 56 +-
arch/x86/include/asm/tdx.h | 67 +
arch/x86/include/asm/vmx.h | 14 +
arch/x86/include/uapi/asm/kvm.h | 95 +
arch/x86/include/uapi/asm/vmx.h | 5 +-
arch/x86/kvm/Kconfig | 4 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/irq.c | 3 +
arch/x86/kvm/lapic.c | 37 +-
arch/x86/kvm/lapic.h | 2 +
arch/x86/kvm/mmu.h | 42 +-
arch/x86/kvm/mmu/mmu.c | 360 ++-
arch/x86/kvm/mmu/mmu_internal.h | 123 +-
arch/x86/kvm/mmu/paging_tmpl.h | 5 +-
arch/x86/kvm/mmu/spte.c | 46 +-
arch/x86/kvm/mmu/spte.h | 65 +-
arch/x86/kvm/mmu/tdp_iter.c | 1 +
arch/x86/kvm/mmu/tdp_iter.h | 5 +-
arch/x86/kvm/mmu/tdp_mmu.c | 690 ++++-
arch/x86/kvm/mmu/tdp_mmu.h | 12 +-
arch/x86/kvm/svm/svm.c | 13 +-
arch/x86/kvm/vmx/common.h | 174 ++
arch/x86/kvm/vmx/evmcs.c | 2 +-
arch/x86/kvm/vmx/evmcs.h | 2 +-
arch/x86/kvm/vmx/main.c | 1071 +++++++
arch/x86/kvm/vmx/pmu_intel.c | 39 +-
arch/x86/kvm/vmx/pmu_intel.h | 28 +
arch/x86/kvm/vmx/posted_intr.c | 43 +-
arch/x86/kvm/vmx/posted_intr.h | 13 +
arch/x86/kvm/vmx/tdx.c | 2465 +++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 275 ++
arch/x86/kvm/vmx/tdx_arch.h | 157 ++
arch/x86/kvm/vmx/tdx_errno.h | 29 +
arch/x86/kvm/vmx/tdx_error.c | 22 +
arch/x86/kvm/vmx/tdx_ops.h | 188 ++
arch/x86/kvm/vmx/vmenter.S | 146 +
arch/x86/kvm/vmx/vmx.c | 737 ++---
arch/x86/kvm/vmx/vmx.h | 39 +-
arch/x86/kvm/vmx/x86_ops.h | 235 ++
arch/x86/kvm/x86.c | 148 +-
arch/x86/virt/vmx/tdx/seamcall.S | 2 +
arch/x86/virt/vmx/tdx/tdx.c | 54 +-
arch/x86/virt/vmx/tdx/tdx.h | 52 -
include/linux/kvm_host.h | 4 +-
include/uapi/linux/kvm.h | 2 +
tools/arch/x86/include/uapi/asm/kvm.h | 95 +
tools/include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 67 +-
59 files changed, 7877 insertions(+), 804 deletions(-)
create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
create mode 100644 Documentation/virt/kvm/intel-tdx.rst
create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
create mode 100644 arch/x86/kvm/vmx/common.h
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
create mode 100644 arch/x86/kvm/vmx/tdx.c
create mode 100644 arch/x86/kvm/vmx/tdx.h
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
create mode 100644 arch/x86/kvm/vmx/tdx_error.c
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
create mode 100644 arch/x86/kvm/vmx/x86_ops.h
--
2.25.1
From: Isaku Yamahata <[email protected]>
Wire up TDX PV report fatal error hypercall to KVM_SYSTEM_EVENT_CRASH KVM
exit event.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 20 ++++++++++++++++++++
include/uapi/linux/kvm.h | 1 +
2 files changed, 21 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index dc66c799cae8..00baecbb62ff 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1203,6 +1203,24 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
return 1;
}
+static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Exit to userspace device model for teardown.
+ * Because guest TD is already panicing, returning an error to guerst TD
+ * doesn't make sense. No argument check is done.
+ */
+
+ vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+ vcpu->run->system_event.type = KVM_SYSTEM_EVENT_TDX;
+ vcpu->run->system_event.ndata = 3;
+ vcpu->run->system_event.data[0] = TDG_VP_VMCALL_REPORT_FATAL_ERROR;
+ vcpu->run->system_event.data[1] = tdvmcall_a0_read(vcpu);
+ vcpu->run->system_event.data[2] = tdvmcall_a1_read(vcpu);
+
+ return 0;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1221,6 +1239,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_rdmsr(vcpu);
case EXIT_REASON_MSR_WRITE:
return tdx_emulate_wrmsr(vcpu);
+ case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+ return tdx_report_fatal_error(vcpu);
default:
break;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d6785d2685f..014337760dfa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -448,6 +448,7 @@ struct kvm_run {
#define KVM_SYSTEM_EVENT_WAKEUP 4
#define KVM_SYSTEM_EVENT_SUSPEND 5
#define KVM_SYSTEM_EVENT_SEV_TERM 6
+#define KVM_SYSTEM_EVENT_TDX 7
__u32 type;
__u32 ndata;
union {
--
2.25.1
From: Isaku Yamahata <[email protected]>
Because debug store is clobbered, restore it on TD exit.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/events/intel/ds.c | 1 +
arch/x86/kvm/vmx/tdx.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 376cc3d66094..cdba4227ad3b 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2256,3 +2256,4 @@ void perf_restore_debug_store(void)
wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
}
+EXPORT_SYMBOL_GPL(perf_restore_debug_store);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c9cb9670f7cf..0de113a643e4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -663,6 +663,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(vcpu, tdx);
tdx_user_return_update_cache();
+ perf_restore_debug_store();
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;
--
2.25.1
From: Isaku Yamahata <[email protected]>
To protect the initial contents of the guest TD, the TDX module measures
the guest TD during the build process as SHA-384 measurement. The
measurement of the guest TD contents needs to be completed to make the
guest TD ready to run.
Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
to run.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
tools/arch/x86/include/uapi/asm/kvm.h | 1 +
3 files changed, 23 insertions(+)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index cb2b0701f0d9..2fe4cc497bc2 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -540,6 +540,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
KVM_TDX_CMD_NR_MAX,
};
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 69550a1ea1d0..d2688bb8e5fa 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1230,6 +1230,24 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
return ret;
}
+static int tdx_td_finalizemr(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+
+ if (!is_td_initialized(kvm) || is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ err = tdh_mr_finalize(kvm_tdx->tdr.pa);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
+ return -EIO;
+ }
+
+ kvm_tdx->finalized = true;
+ return 0;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -1249,6 +1267,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_INIT_MEM_REGION:
r = tdx_init_mem_region(kvm, &tdx_cmd);
break;
+ case KVM_TDX_FINALIZE_VM:
+ r = tdx_td_finalizemr(kvm);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index af39f3adc179..7f5eb5536ec5 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
KVM_TDX_CMD_NR_MAX,
};
--
2.25.1
From: Isaku Yamahata <[email protected]>
Define architectural definitions for KVM to issue the TDX SEAMCALLs.
Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx_arch.h | 157 ++++++++++++++++++++++++++++++++++++
1 file changed, 157 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..94258056d742
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER 0
+#define TDH_MNG_ADDCX 1
+#define TDH_MEM_PAGE_ADD 2
+#define TDH_MEM_SEPT_ADD 3
+#define TDH_VP_ADDCX 4
+#define TDH_MEM_PAGE_RELOCATE 5
+#define TDH_MEM_PAGE_AUG 6
+#define TDH_MEM_RANGE_BLOCK 7
+#define TDH_MNG_KEY_CONFIG 8
+#define TDH_MNG_CREATE 9
+#define TDH_VP_CREATE 10
+#define TDH_MNG_RD 11
+#define TDH_MR_EXTEND 16
+#define TDH_MR_FINALIZE 17
+#define TDH_VP_FLUSH 18
+#define TDH_MNG_VPFLUSHDONE 19
+#define TDH_MNG_KEY_FREEID 20
+#define TDH_MNG_INIT 21
+#define TDH_VP_INIT 22
+#define TDH_VP_RD 26
+#define TDH_MNG_KEY_RECLAIMID 27
+#define TDH_PHYMEM_PAGE_RECLAIM 28
+#define TDH_MEM_PAGE_REMOVE 29
+#define TDH_MEM_SEPT_REMOVE 30
+#define TDH_MEM_TRACK 38
+#define TDH_MEM_RANGE_UNBLOCK 39
+#define TDH_PHYMEM_CACHE_WB 40
+#define TDH_PHYMEM_PAGE_WBINVD 41
+#define TDH_VP_WR 43
+#define TDH_SYS_LP_SHUTDOWN 44
+
+#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO 0x10000
+#define TDG_VP_VMCALL_MAP_GPA 0x10001
+#define TDG_VP_VMCALL_GET_QUOTE 0x10002
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR 0x10003
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH BIT_ULL(63)
+#define TDX_CLASS_SHIFT 56
+#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
+
+#define __BUILD_TDX_FIELD(non_arch, class, field) \
+ (((non_arch) ? TDX_NON_ARCH : 0) | \
+ ((u64)(class) << TDX_CLASS_SHIFT) | \
+ ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field) \
+ __BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field) \
+ __BUILD_TDX_FIELD(true, (class), (field))
+
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field) BUILD_TDX_FIELD(0, (field))
+
+enum tdx_guest_other_state {
+ TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+union tdx_vcpu_state_details {
+ struct {
+ u64 vmxip : 1;
+ u64 reserved : 63;
+ };
+ u64 full;
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field) BUILD_TDX_FIELD(17, (field))
+#define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(17, (field))
+
+/* Management class fields */
+enum tdx_guest_management {
+ TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_guest_management */
+#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(32, (field))
+
+enum tdx_tdcs_execution_control {
+ TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field) BUILD_TDX_FIELD(17, (field))
+
+#define TDX_EXTENDMR_CHUNKSIZE 256
+
+struct tdx_cpuid_value {
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+} __packed;
+
+#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
+#define TDX_TD_ATTRIBUTE_PKS BIT_ULL(30)
+#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
+#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+struct td_params {
+ u64 attributes;
+ u64 xfam;
+ u32 max_vcpus;
+ u32 reserved0;
+
+ u64 eptp_controls;
+ u64 exec_controls;
+ u16 tsc_frequency;
+ u8 reserved1[38];
+
+ u64 mrconfigid[6];
+ u64 mrowner[6];
+ u64 mrownerconfig[6];
+ u64 reserved2[4];
+
+ union {
+ struct tdx_cpuid_value cpuid_values[0];
+ u8 reserved3[768];
+ };
+} __packed __aligned(1024);
+
+/*
+ * Guest uses MAX_PA for GPAW when set.
+ * 0: GPA.SHARED bit is GPA[47]
+ * 1: GPA.SHARED bit is GPA[51]
+ */
+#define TDX_EXEC_CONTROL_MAX_GPAW BIT_ULL(0)
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
+ * module can only program frequencies that are multiples of 25MHz. The
+ * frequency must be between 100mhz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)
+
+#endif /* __KVM_X86_TDX_ARCH_H */
--
2.25.1
From: Sean Christopherson <[email protected]>
Implement a system-scoped ioctl to get system-wide parameters for TDX.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 48 +++++++++++++++++++++++++++
arch/x86/kvm/vmx/main.c | 2 ++
arch/x86/kvm/vmx/tdx.c | 46 +++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
arch/x86/kvm/x86.c | 6 ++++
tools/arch/x86/include/uapi/asm/kvm.h | 48 +++++++++++++++++++++++++++
8 files changed, 154 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index fbb2c6746066..3677a5015a4f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -117,6 +117,7 @@ KVM_X86_OP(smi_allowed)
KVM_X86_OP(enter_smm)
KVM_X86_OP(leave_smm)
KVM_X86_OP(enable_smi_window)
+KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
KVM_X86_OP_OPTIONAL(mem_enc_register_region)
KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 80df346af117..342decc69649 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1591,6 +1591,7 @@ struct kvm_x86_ops {
int (*leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
void (*enable_smi_window)(struct kvm_vcpu *vcpu);
+ int (*dev_mem_enc_ioctl)(void __user *argp);
int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9792ec1cc317..273c8d82b9c8 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -534,4 +534,52 @@ struct kvm_pmu_event_filter {
#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_TDX_VM 1
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+
+ KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+ /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+ __u64 unused;
+};
+
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ __u32 padding;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6a93b19a8b06..7b497ed1f21c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -212,6 +212,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+ .dev_mem_enc_ioctl = tdx_dev_ioctl,
};
struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 63f3c7a02cc8..ec4ebba4152a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -392,6 +392,52 @@ int tdx_vm_init(struct kvm *kvm)
return ret;
}
+int tdx_dev_ioctl(void __user *argp)
+{
+ struct kvm_tdx_capabilities __user *user_caps;
+ struct kvm_tdx_capabilities caps;
+ struct kvm_tdx_cmd cmd;
+
+ BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
+ sizeof(struct tdx_cpuid_config));
+
+ if (copy_from_user(&cmd, argp, sizeof(cmd)))
+ return -EFAULT;
+ if (cmd.flags || cmd.error || cmd.unused)
+ return -EINVAL;
+ /*
+ * Currently only KVM_TDX_CAPABILITIES is defined for system-scoped
+ * mem_enc_ioctl().
+ */
+ if (cmd.id != KVM_TDX_CAPABILITIES)
+ return -EINVAL;
+
+ user_caps = (void __user *)cmd.data;
+ if (copy_from_user(&caps, user_caps, sizeof(caps)))
+ return -EFAULT;
+
+ if (caps.nr_cpuid_configs < tdx_caps.nr_cpuid_configs)
+ return -E2BIG;
+
+ caps = (struct kvm_tdx_capabilities) {
+ .attrs_fixed0 = tdx_caps.attrs_fixed0,
+ .attrs_fixed1 = tdx_caps.attrs_fixed1,
+ .xfam_fixed0 = tdx_caps.xfam_fixed0,
+ .xfam_fixed1 = tdx_caps.xfam_fixed1,
+ .nr_cpuid_configs = tdx_caps.nr_cpuid_configs,
+ .padding = 0,
+ };
+
+ if (copy_to_user(user_caps, &caps, sizeof(caps)))
+ return -EFAULT;
+ if (copy_to_user(user_caps->cpuid_configs, &tdx_caps.cpuid_configs,
+ tdx_caps.nr_cpuid_configs *
+ sizeof(struct tdx_cpuid_config)))
+ return -EFAULT;
+
+ return 0;
+}
+
int __init tdx_module_setup(void)
{
const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 663fd8d4063f..3027d9821fe1 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -132,6 +132,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
bool tdx_is_vm_type_supported(unsigned long type);
void tdx_hardware_unsetup(void);
+int tdx_dev_ioctl(void __user *argp);
int tdx_vm_init(struct kvm *kvm);
void tdx_mmu_release_hkid(struct kvm *kvm);
@@ -140,6 +141,7 @@ void tdx_vm_free(struct kvm *kvm);
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
static inline void tdx_hardware_unsetup(void) {}
+static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 320f902eaf9e..6037ce93bcb7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4565,6 +4565,12 @@ long kvm_arch_dev_ioctl(struct file *filp,
break;
r = kvm_x86_dev_has_attr(&attr);
break;
+ case KVM_MEMORY_ENCRYPT_OP:
+ r = -EINVAL;
+ if (!kvm_x86_ops.dev_mem_enc_ioctl)
+ goto out;
+ r = static_call(kvm_x86_dev_mem_enc_ioctl)(argp);
+ break;
}
default:
r = -EINVAL;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 71a5851475e7..a9ea3573be1b 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -528,4 +528,52 @@ struct kvm_pmu_event_filter {
#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_TDX_VM 1
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+
+ KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+ /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+ __u64 unused;
+};
+
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ __u32 padding;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
#endif /* _ASM_X86_KVM_H */
--
2.25.1
From: Isaku Yamahata <[email protected]>
With TDX, all GFNs are private at guest boot time. At run time guest TD
can explicitly change it to shared from private or vice-versa by MapGPA
hypercall. If it's specified, the given GFN can't be used as otherwise.
That's is, if a guest tells KVM that the GFN is shared, it can't be used
as private. or vice-versa.
Steal software usable bit, SPTE_SHARED_MASK, for it from MMIO counter to
record it. Use the bit SPTE_SHARED_MASK in shared or private EPT to
determine which mapping, shared or private, is allowed. If requested
mapping isn't allowed, return RET_PF_RETRY to wait for other vcpu to change
it. The bit is recorded in both shared and private shadow page to avoid
traverse one more shadow page when resolving KVM page fault.
The bit needs to be kept over zapping the EPT entry. Currently the EPT
entry is initialized SHADOW_NONPRESENT_VALUE unconditionally to clear
SPTE_SHARED_MASK bit. To carry SPTE_SHARED_MASK bit, introduce a helper
function to get initial value for zapped entry with SPTE_SHARED_MASK bit.
Replace SHADOW_NONPRESENT_VALUE with it.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/spte.h | 17 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++++++------
2 files changed, 68 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 96312ab4fffb..7c1aaf0e963e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -14,6 +14,9 @@
*/
#define SPTE_MMU_PRESENT_MASK BIT_ULL(11)
+/* Masks that used to track for shared GPA **/
+#define SPTE_SHARED_MASK BIT_ULL(62)
+
/*
* TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
* be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
@@ -104,7 +107,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
* the memslots generation and is derived as follows:
*
* Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
- * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
+ * Bits 8-18 of the MMIO generation are propagated to spte bits 52-61
*
* The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
* the MMIO generation number, as doing so would require stealing a bit from
@@ -118,7 +121,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
#define MMIO_SPTE_GEN_LOW_END 10
#define MMIO_SPTE_GEN_HIGH_START 52
-#define MMIO_SPTE_GEN_HIGH_END 62
+#define MMIO_SPTE_GEN_HIGH_END 61
#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
MMIO_SPTE_GEN_LOW_START)
@@ -131,7 +134,7 @@ static_assert(!(SPTE_MMU_PRESENT_MASK &
#define MMIO_SPTE_GEN_HIGH_BITS (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
/* remember to adjust the comment above as well if you change these */
-static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 10);
#define MMIO_SPTE_GEN_LOW_SHIFT (MMIO_SPTE_GEN_LOW_START - 0)
#define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
@@ -208,6 +211,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
+static_assert(!(__REMOVED_SPTE & SPTE_SHARED_MASK));
/*
* See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
@@ -217,7 +221,12 @@ static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
static inline bool is_removed_spte(u64 spte)
{
- return spte == REMOVED_SPTE;
+ return (spte & ~SPTE_SHARED_MASK) == REMOVED_SPTE;
+}
+
+static inline u64 spte_shared_mask(u64 spte)
+{
+ return spte & SPTE_SHARED_MASK;
}
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fef6246086a8..4f279700b3cc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -758,6 +758,11 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
return 0;
}
+static u64 shadow_nonpresent_spte(u64 old_spte)
+{
+ return SHADOW_NONPRESENT_VALUE | spte_shared_mask(old_spte);
+}
+
static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter)
{
@@ -791,7 +796,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
* can be set when EPT table entries are zapped.
*/
- __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
+ __kvm_tdp_mmu_write_spte(iter->sptep,
+ shadow_nonpresent_spte(iter->old_spte));
return 0;
}
@@ -975,8 +981,11 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
continue;
if (!shared)
- tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
- else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
+ tdp_mmu_set_spte(kvm, &iter,
+ shadow_nonpresent_spte(iter.old_spte));
+ else if (tdp_mmu_set_spte_atomic(
+ kvm, &iter,
+ shadow_nonpresent_spte(iter.old_spte)))
goto retry;
}
}
@@ -1033,7 +1042,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
return false;
__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
- SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
+ shadow_nonpresent_spte(old_spte),
+ sp->gfn, sp->role.level + 1,
true, true, is_private_sp(sp));
return true;
@@ -1075,11 +1085,20 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
continue;
}
+ /*
+ * SPTE_SHARED_MASK is stored as 4K granularity. The
+ * information is lost if we delete upper level SPTE page.
+ * TODO: support large page.
+ */
+ if (kvm_gfn_shared_mask(kvm) && iter.level > PG_LEVEL_4K)
+ continue;
+
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
- tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+ tdp_mmu_set_spte(kvm, &iter,
+ shadow_nonpresent_spte(iter.old_spte));
flush = true;
}
@@ -1195,18 +1214,44 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
gfn_t gfn_unalias = iter->gfn & ~kvm_gfn_shared_mask(vcpu->kvm);
WARN_ON(sp->role.level != fault->goal_level);
+ WARN_ON(is_private_sptep(iter->sptep) != fault->is_private);
- /* TDX shared GPAs are no executable, enforce this for the SDV. */
- if (kvm_gfn_shared_mask(vcpu->kvm) && !fault->is_private)
- pte_access &= ~ACC_EXEC_MASK;
+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
+ if (fault->is_private) {
+ /*
+ * SPTE allows only RWX mapping. PFN can't be mapped it
+ * as READONLY in GPA.
+ */
+ if (fault->slot && !fault->map_writable)
+ return RET_PF_RETRY;
+ /*
+ * This GPA is not allowed to map as private. Let
+ * vcpu loop in page fault until other vcpu change it
+ * by MapGPA hypercall.
+ */
+ if (fault->slot &&
+ spte_shared_mask(iter->old_spte))
+ return RET_PF_RETRY;
+ } else {
+ /* This GPA is not allowed to map as shared. */
+ if (fault->slot &&
+ !spte_shared_mask(iter->old_spte))
+ return RET_PF_RETRY;
+ /* TDX shared GPAs are no executable, enforce this. */
+ pte_access &= ~ACC_EXEC_MASK;
+ }
+ }
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, gfn_unalias, pte_access);
- else
+ else {
wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
gfn_unalias, fault->pfn, iter->old_spte,
fault->prefetch, true, fault->map_writable,
&new_spte);
+ if (spte_shared_mask(iter->old_spte))
+ new_spte |= SPTE_SHARED_MASK;
+ }
if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
@@ -1509,7 +1554,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See __handle_changed_spte().
*/
- tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
+ tdp_mmu_set_spte(kvm, iter, shadow_nonpresent_spte(iter->old_spte));
if (!pte_write(range->pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1
From: Isaku Yamahata <[email protected]>
Currently, KVM VMX module initialization/exit functions are a single
function each. Refactor KVM VMX module initialization functions into KVM
common part and VMX part so that TDX specific part can be added cleanly.
Opportunistically refactor module exit function as well.
The current module initialization flow is, 1.) calculate the sizes of VMX
kvm structure and VMX vcpu structure, 2.) hyper-v specific initialization
3.) report those sizes to the KVM common layer and KVM common
initialization, and 4.) VMX specific system-wide initialization.
Refactor the KVM VMX module initialization function into functions with a
wrapper function to separate VMX logic in vmx.c from a file, main.c, common
among VMX and TDX. We have a wrapper function, "vt_init() {vmx kvm/vcpu
size calculation; hv_vp_assist_page_init(); kvm_init(); vmx_init(); }" in
main.c, and hv_vp_assist_page_init() and vmx_init() in vmx.c.
hv_vp_assist_page_init() initializes hyper-v specific assist pages,
kvm_init() does system-wide initialization of the KVM common layer, and
vmx_init() does system-wide VMX initialization.
The KVM architecture common layer allocates struct kvm with reported size
for architecture-specific code. The KVM VMX module defines its structure
as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
TDX specific kvm and vcpu structures, add tdx_pre_kvm_init() to report the
sizes of them to the KVM common layer.
The current module exit function is also a single function, a combination
of VMX specific logic and common KVM logic. Refactor it into VMX specific
logic and KVM common logic. This is just refactoring to keep the VMX
specific logic in vmx.c from main.c.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 38 +++++++++++++
arch/x86/kvm/vmx/vmx.c | 106 ++++++++++++++++++-------------------
arch/x86/kvm/vmx/x86_ops.h | 6 +++
3 files changed, 95 insertions(+), 55 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index fabf5f22c94f..371dad728166 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -169,3 +169,41 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
.runtime_ops = &vt_x86_ops,
.pmu_ops = &intel_pmu_ops,
};
+
+static int __init vt_init(void)
+{
+ unsigned int vcpu_size, vcpu_align;
+ int r;
+
+ vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
+ vcpu_size = sizeof(struct vcpu_vmx);
+ vcpu_align = __alignof__(struct vcpu_vmx);
+
+ hv_vp_assist_page_init();
+ vmx_init_early();
+
+ r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
+ if (r)
+ goto err_vmx_post_exit;
+
+ r = vmx_init();
+ if (r)
+ goto err_kvm_exit;
+
+ return 0;
+
+err_kvm_exit:
+ kvm_exit();
+err_vmx_post_exit:
+ hv_vp_assist_page_exit();
+ return r;
+}
+module_init(vt_init);
+
+static void vt_exit(void)
+{
+ vmx_exit();
+ kvm_exit();
+ hv_vp_assist_page_exit();
+}
+module_exit(vt_exit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 286947c00638..b30d73d28e75 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8181,15 +8181,45 @@ static void vmx_cleanup_l1d_flush(void)
l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
}
-static void vmx_exit(void)
+void __init hv_vp_assist_page_init(void)
{
-#ifdef CONFIG_KEXEC_CORE
- RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
- synchronize_rcu();
-#endif
+#if IS_ENABLED(CONFIG_HYPERV)
+ /*
+ * Enlightened VMCS usage should be recommended and the host needs
+ * to support eVMCS v1 or above. We can also disable eVMCS support
+ * with module parameter.
+ */
+ if (enlightened_vmcs &&
+ ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED &&
+ (ms_hyperv.nested_features & HV_X64_ENLIGHTENED_VMCS_VERSION) >=
+ KVM_EVMCS_VERSION) {
+ int cpu;
+
+ /* Check that we have assist pages on all online CPUs */
+ for_each_online_cpu(cpu) {
+ if (!hv_get_vp_assist_page(cpu)) {
+ enlightened_vmcs = false;
+ break;
+ }
+ }
- kvm_exit();
+ if (enlightened_vmcs) {
+ pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
+ static_branch_enable(&enable_evmcs);
+ }
+
+ if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
+ vt_x86_ops.enable_direct_tlbflush
+ = hv_enable_direct_tlbflush;
+ } else {
+ enlightened_vmcs = false;
+ }
+#endif
+}
+
+void hv_vp_assist_page_exit(void)
+{
#if IS_ENABLED(CONFIG_HYPERV)
if (static_branch_unlikely(&enable_evmcs)) {
int cpu;
@@ -8213,14 +8243,10 @@ static void vmx_exit(void)
static_branch_disable(&enable_evmcs);
}
#endif
- vmx_cleanup_l1d_flush();
-
- allow_smaller_maxphyaddr = false;
}
-module_exit(vmx_exit);
/* initialize before kvm_init() so that hardware_enable/disable() can work. */
-static void __init vmx_init_early(void)
+void __init vmx_init_early(void)
{
int cpu;
@@ -8228,49 +8254,10 @@ static void __init vmx_init_early(void)
INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
}
-static int __init vmx_init(void)
+int __init vmx_init(void)
{
int r, cpu;
-#if IS_ENABLED(CONFIG_HYPERV)
- /*
- * Enlightened VMCS usage should be recommended and the host needs
- * to support eVMCS v1 or above. We can also disable eVMCS support
- * with module parameter.
- */
- if (enlightened_vmcs &&
- ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED &&
- (ms_hyperv.nested_features & HV_X64_ENLIGHTENED_VMCS_VERSION) >=
- KVM_EVMCS_VERSION) {
-
- /* Check that we have assist pages on all online CPUs */
- for_each_online_cpu(cpu) {
- if (!hv_get_vp_assist_page(cpu)) {
- enlightened_vmcs = false;
- break;
- }
- }
-
- if (enlightened_vmcs) {
- pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
- static_branch_enable(&enable_evmcs);
- }
-
- if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
- vt_x86_ops.enable_direct_tlbflush
- = hv_enable_direct_tlbflush;
-
- } else {
- enlightened_vmcs = false;
- }
-#endif
-
- vmx_init_early();
- r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
- __alignof__(struct vcpu_vmx), THIS_MODULE);
- if (r)
- return r;
-
/*
* Must be called after kvm_init() so enable_ept is properly set
* up. Hand the parameter mitigation value in which was stored in
@@ -8279,10 +8266,8 @@ static int __init vmx_init(void)
* mitigation mode.
*/
r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
- if (r) {
- vmx_exit();
+ if (r)
return r;
- }
for_each_possible_cpu(cpu)
pi_init_cpu(cpu);
@@ -8303,4 +8288,15 @@ static int __init vmx_init(void)
return 0;
}
-module_init(vmx_init);
+
+void vmx_exit(void)
+{
+#ifdef CONFIG_KEXEC_CORE
+ RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
+ synchronize_rcu();
+#endif
+
+ vmx_cleanup_l1d_flush();
+
+ allow_smaller_maxphyaddr = false;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 0a5967a91e26..2abead2f60f7 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -8,6 +8,12 @@
#include "x86.h"
+void __init hv_vp_assist_page_init(void);
+void hv_vp_assist_page_exit(void);
+void __init vmx_init_early(void);
+int __init vmx_init(void);
+void vmx_exit(void);
+
__init int vmx_cpu_has_kvm_support(void);
__init int vmx_disabled_by_bios(void);
__init int vmx_hardware_setup(void);
--
2.25.1
From: Isaku Yamahata <[email protected]>
Wire up TDX PV HLT hypercall to the KVM backend function.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 3 +++
2 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 96e41602125b..15dc0ae61e0f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -654,7 +654,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
- return pi_has_pending_interrupt(vcpu);
+ bool ret = pi_has_pending_interrupt(vcpu);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+ return true;
+
+ if (tdx->interrupt_disabled_hlt)
+ return false;
+
+ /*
+ * This is for the case where the virtual interrupt is recognized,
+ * i.e. set in vmcs.RVI, between the STI and "HLT". KVM doesn't have
+ * access to RVI and the interrupt is no longer in the PID (because it
+ * was "recognized". It doesn't get delivered in the guest because the
+ * TDCALL completes before interrupts are enabled.
+ *
+ * TDX modules sets RVI while in an STI interrupt shadow.
+ * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
+ * The interrupt shadow at this point is gone.
+ * - It knows that there is an interrupt that can be delivered
+ * (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
+ * matter)
+ * - It forwards the TDExit nevertheless, to a clueless hypervisor that
+ * has no way to glean either RVI or PPR.
+ */
+ return !!xchg(&tdx->buggy_hlt_workaround, 0);
}
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -967,6 +992,17 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
return 1;
}
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ /* See tdx_protected_apic_has_interrupt() to avoid heavy seamcall */
+ tdx->interrupt_disabled_hlt = tdvmcall_a0_read(vcpu);;
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return kvm_emulate_halt_noskip(vcpu);
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -975,6 +1011,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
switch (tdvmcall_leaf(vcpu)) {
case EXIT_REASON_CPUID:
return tdx_emulate_cpuid(vcpu);
+ case EXIT_REASON_HLT:
+ return tdx_emulate_hlt(vcpu);
default:
break;
}
@@ -1311,6 +1349,8 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
struct kvm_vcpu *vcpu = apic->vcpu;
struct vcpu_tdx *tdx = to_tdx(vcpu);
+ /* See comment in tdx_protected_apic_has_interrupt(). */
+ tdx->buggy_hlt_workaround = 1;
/* TDX supports only posted interrupt. No lapic emulation. */
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index b0bb239b51bf..a456ca6ec187 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -116,6 +116,9 @@ struct vcpu_tdx {
bool host_state_need_restore;
u64 msr_host_kernel_gs_base;
+ bool interrupt_disabled_hlt;
+ unsigned int buggy_hlt_workaround;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
--
2.25.1
From: Isaku Yamahata <[email protected]>
Add a high level design document on TDX changes to TDP MMU.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 +++++++++++++++++++++++++
1 file changed, 466 insertions(+)
create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
diff --git a/Documentation/virt/kvm/tdx-tdp-mmu.rst b/Documentation/virt/kvm/tdx-tdp-mmu.rst
new file mode 100644
index 000000000000..6d63bb75f785
--- /dev/null
+++ b/Documentation/virt/kvm/tdx-tdp-mmu.rst
@@ -0,0 +1,466 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Design of TDP MMU for TDX support
+=================================
+This document describes a (high level) design for TDX support of KVM TDP MMU of
+x86 KVM.
+
+In this document, we use "TD" or "guest TD" to differentiate it from the current
+"VM" (Virtual Machine), which is supported by KVM today.
+
+
+Background of TDX
+=================
+TD private memory is designed to hold TD private content, encrypted by the CPU
+using the TD ephemeral key. An encryption engine holds a table of encryption
+keys, and an encryption key is selected for each memory transaction based on a
+Host Key Identifier (HKID). By design, the host VMM does not have access to the
+encryption keys.
+
+In the first generation of MKTME, HKID is "stolen" from the physical address by
+allocating a configurable number of bits from the top of the physical address.
+The HKID space is partitioned into shared HKIDs for legacy MKTME accesses and
+private HKIDs for SEAM-mode-only accesses. We use 0 for the shared HKID on the
+host so that MKTME can be opaque or bypassed on the host.
+
+During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
+as either shared or private, based on the value of a new SHARED bit in the Guest
+Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
+(Extended Page Table) or "Shared EPT" (in this document), which resides in the
+host VMM memory. The Shared EPT is directly managed by the host VMM - the same
+as with the current VMX. Since guest TDs usually require I/O, and the data
+exchange needs to be done via shared memory, thus KVM needs to use the current
+EPT functionality even for TDs.
+
+The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
+pages are encrypted and integrity-protected with the TD's ephemeral private key.
+Secure EPT can be managed _indirectly_ by the host VMM, using the TDX interface
+functions (SEAMCALLs), and thus conceptually Secure EPT is a subset of EPT
+because not all functionalities are available.
+
+Since the execution of such interface functions takes much longer time than
+accessing memory directly, in KVM we use the existing TDP code to mirror the
+Secure EPT for the TD. And we think there are at least two options today in
+terms of the timing for executing such SEAMCALLs:
+
+1. synchronous, i.e. while walking the TDP page tables, or
+2. post-walk, i.e. record what needs to be done to the real Secure EPT during
+ the walk, and execute SEAMCALLs later.
+
+The option 1 seems to be more intuitive and simpler, but the Secure EPT
+concurrency rules are different from the ones of the TDP or EPT. For example,
+MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the target
+
+Secure EPT(SEPT) operations
+---------------------------
+Secure EPT is an Extended Page Table for GPA-to-HPA translation of TD private
+HPA. A Secure EPT is designed to be encrypted with the TD's ephemeral private
+key. SEPT pages are allocated by the host VMM via Intel TDX functions, but their
+content is intended to be hidden and is not architectural.
+
+Unlike the conventional EPT, the CPU can't directly read/write its entry.
+Instead, TDX SEAMCALL API is used. Several SEAMCALLs correspond to operation on
+the EPT entry.
+
+* TDH.MEM.SEPT.ADD():
+ Add a secure EPT page from the secure EPT tree. This corresponds to updating
+ the non-leaf EPT entry with present bit set
+
+* TDH.MEM.SEPT.REMOVE():
+ Remove the secure page from the secure EPT tree. There is no corresponding
+ to the EPT operation.
+
+* TDH.MEM.SEPT.RD():
+ Read the secure EPT entry. This corresponds to reading the EPT entry as
+ memory. Please note that this is much slower than direct memory reading.
+
+* TDH.MEM.PAGE.ADD() and TDH.MEM.PAGE.AUG():
+ Add a private page to the secure EPT tree. This corresponds to updating the
+ leaf EPT entry with present bit set.
+
+* THD.MEM.PAGE.REMOVE():
+ Remove a private page from the secure EPT tree. There is no corresponding
+ to the EPT operation.
+
+* TDH.MEM.RANGE.BLOCK():
+ This (mostly) corresponds to clearing the present bit of the leaf EPT entry.
+ Note that the private page is still linked in the secure EPT. To remove it
+ from the secure EPT, TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE() needs to
+ be called.
+
+* TDH.MEM.TRACK():
+ Increment the TLB epoch counter. This (mostly) corresponds to EPT TLB flush.
+ Note that the private page is still linked in the secure EPT. To remove it
+ from the secure EPT, tdh_mem_page_remove() needs to be called.
+
+
+Adding private page
+-------------------
+The procedure of populating the private page looks as follows.
+
+1. TDH.MEM.SEPT.ADD(512G level)
+2. TDH.MEM.SEPT.ADD(1G level)
+3. TDH.MEM.SEPT.ADD(2M level)
+4. TDH.MEM.PAGE.AUG(4K level)
+
+Those operations correspond to updating the EPT entries.
+
+Dropping private page and TLB shootdown
+---------------------------------------
+The procedure of dropping the private page looks as follows.
+
+1. TDH.MEM.RANGE.BLOCK(4K level)
+ This mostly corresponds to clear the present bit in the EPT entry. This
+ prevents (or blocks) TLB entry from creating in the future. Note that the
+ private page is still linked in the secure EPT tree and the existing cache
+ entry in the TLB isn't flushed.
+2. TDH.MEM.TRACK(range) and TLB shootdown
+ This mostly corresponds to the EPT TLB shootdown. Because all vcpus share
+ the same Secure EPT, all vcpus need to flush TLB.
+ * TDH.MEM.TRACK(range) by one vcpu. It increments the global internal TLB
+ epoch counter.
+ * send IPI to remote vcpus
+ * Other vcpu exits to VMM from guest TD and then re-enter. TDH.VP.ENTER().
+ * TDH.VP.ENTER() checks the TLB epoch counter and If its TLB is old, flush
+ TLB.
+ Note that only single vcpu issues tdh_mem_track().
+ Note that the private page is still linked in the secure EPT tree, unlike the
+ conventional EPT.
+3. TDH.MEM.PAGE.PROMOTE, TDH.MEM.PAGEDEMOTE(), TDH.MEM.PAGE.RELOCATE(), or
+ TDH.MEM.PAGE.REMOVE()
+ There is no corresponding operation to the conventional EPT.
+ * When changing page size (e.g. 4K <-> 2M) TDH.MEM.PAGE.PROMOTE() or
+ TDH.MEM.PAGE.DEMOTE() is used. During those operation, the guest page is
+ kept referenced in the Secure EPT.
+ * When migrating page, TDH.MEM.PAGE.RELOCATE(). This requires both source
+ page and destination page.
+ * when destroying TD, TDH.MEM.PAGE.REMOVE() removes the private page from the
+ secure EPT tree. In this case TLB shootdown is not needed because vcpus
+ don't run any more.
+
+The basic idea for TDX support
+==============================
+Because shared EPT is the same as the existing EPT, use the existing logic for
+shared EPT. On the other hand, secure EPT requires additional operations
+instead of directly reading/writing of the EPT entry.
+
+On EPT violation, The KVM mmu walks down the EPT tree from the root, determines
+the EPT entry to operate, and updates the entry. If necessary, a TLB shootdown
+is done. Because it's very slow to directly walk secure EPT by TDX SEAMCALL,
+TDH.MEM.SEPT.RD(), the mirror of secure EPT is created and maintained. Add
+hooks to KVM MMU to reuse the existing code.
+
+EPT violation on shared GPA
+---------------------------
+(1) EPT violation on shared GPA or zapping shared GPA
+ walk down shared EPT tree (the existing code)
+ |
+ |
+ V
+shared EPT tree (CPU refers.)
+(2) update the EPT entry. (the existing code)
+ TLB shootdown in the case of zapping.
+
+
+EPT violation on private GPA
+----------------------------
+(1) EPT violation on private GPA or zapping private GPA
+ walk down the mirror of secure EPT tree (mostly same as the existing code)
+ |
+ |
+ V
+mirror of secure EPT tree (KVM MMU software only. reuse of the existing code)
+(2) update the (mirrored) EPT entry. (mostly same as the existing code)
+(3) call the hooks with what EPT entry is changed
+ |
+ NEW: hooks in KVM MMU
+ |
+ V
+secure EPT root(CPU refers)
+(4) the TDX backend calls necessary TDX SEAMCALLs to update real secure EPT.
+
+The major modification is to add hooks for the TDX backend for additional
+operations and to pass down which EPT, shared EPT, or private EPT is used, and
+twist the behavior if we're operating on private EPT.
+
+The following depicts the relationship.
+::
+
+ KVM | TDX module
+ | | |
+ -------------+---------- | |
+ | | | |
+ V V | |
+ shared GPA private GPA | |
+ CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
+ | | | |
+ | | | |
+ V V | V
+ shared EPT private EPT<-------mirror----->Secure EPT
+ | | | |
+ | \--------------------+------\ |
+ | | | |
+ V | V V
+ shared guest page | private guest page
+ |
+ |
+ non-encrypted memory | encrypted memory
+ |
+
+shared EPT: CPU and KVM walk with shared GPA
+ Maintained by the existing code
+private EPT: KVM walks with private GPA
+ Maintained by the twisted existing code
+secure EPT: CPU walks with private GPA.
+ Maintained by TDX module with TDX SEAMCALLs via hooks
+
+
+Tracking private EPT page
+=========================
+Shared EPT pages are managed by struct kvm_mmu_page. They are linked in a list
+structure. When necessary, the list is traversed to operate on. Private EPT
+pages have different characteristics. For example, private pages can't be
+swapped out. When shrinking memory, we'd like to traverse only shared EPT pages
+and skip private EPT pages. Likewise, page migration isn't supported for
+private pages (yet). Introduce an additional list to track shared EPT pages and
+track private EPT pages independently.
+
+At the beginning of EPT violation, the fault handler knows fault GPA, thus it
+knows which EPT to operate on, private or shared. If it's private EPT,
+an additional task is done. Something like "if (private) { callback a hook }".
+Since the fault handler has deep function calls, it's cumbersome to hold the
+information of which EPT is operating. Options to mitigate it are
+
+1. Pass the information as an argument for the function call.
+2. Record the information in struct kvm_mmu_page somehow.
+3. Record the information in vcpu structure.
+
+Option 2 was chosen. Because option 1 requires modifying all the functions. It
+would affect badly to the normal case. Option 3 doesn't work well because in
+some cases, we need to walk both private and shared EPT.
+
+The role of the EPT page can be utilized and one bit can be curved out from
+unused bits in struct kvm_mmu_page_role. When allocating the EPT page,
+initialize the information. Mostly struct kvm_mmu_page is available because
+we're operating on EPT pages.
+
+
+The conversion of private GPA and shared GPA
+============================================
+A page of a given GPA can be assigned to only private GPA xor shared GPA at one
+time. The GPA can't be accessed simultaneously via both private GPA and shared
+GPA. On guest startup, all the GPAs are assigned as private. Guest converts
+the range of GPA to shared (or private) from private (or shared) by MapGPA
+hypercall. MapGPA hypercall takes the start GPA and the size of the region. If
+the given start GPA is shared, VMM converts the region into shared (if it's
+already shared, nop). If the start GPA is private, VMM converts the region into
+private. It implies the guest won't access the unmapped region. private(or
+shared) region after converting to shared(or private).
+
+If the guest TD triggers an EPT violation on the already converted region, the
+access won't be allowed (loop in EPT violation) until other vcpu converts back
+the region.
+
+KVM MMU records which GPA is allowed to access, private or shared. It steals
+software usable bit from MMU present mask. SPTE_SHARED_MASK. The bit is
+recorded in both shared EPT and the mirror of secure EPT.
+
+* If SPTE_SHARED_MASK cleared in the shared EPT and the mirror of secure EPT:
+ Private GPA is allowed. Shared GPA is not allowed.
+
+* SPTE_SHARED_MASK set in the shared EPT and the mirror of secure EPT:
+ Private GPA is not allowed. Shared GPA is allowed.
+
+The default is that SPTE_SHARED_MASK is cleared so that the existing KVM
+MMU code (mostly) works.
+
+The reason why the bit is recorded in both shared and private EPT is to optimize
+for EPT violation path by penalizing MapGPA hypercall.
+
+The state machine of EPT entry
+------------------------------
+(private EPT entry, shared EPT entry) =
+ (non-present, non-present): private mapping is allowed
+ (present, non-present): private mapping is mapped
+ (non-present | SPTE_SHARED_MASK, non-present | SPTE_SHARED_MASK):
+ shared mapping is allowed
+ (non-present | SPTE_SHARED_MASK, present | SPTE_SHARED_MASK):
+ shared mapping is mapped
+ (present | SPTE_SHARED_MASK, any) invalid combination
+
+* map_gpa(private GPA): Mark the region that private GPA is allowed(NEW)
+ private EPT entry: clear SPTE_SHARED_MASK
+ present: nop
+ non-present: nop
+ non-present | SPTE_SHARED_MASK -> non-present (clear SPTE_SHARED_MASK)
+
+ shared EPT entry: zap the entry, clear SPTE_SHARED_MASK
+ present: invalid
+ non-present -> non-present: nop
+ present | SPTE_SHARED_MASK -> non-present
+ non-present | SPTE_SHARED_MASK -> non-present
+
+* map_gpa(shared GPA): Mark the region that shared GPA is allowed(NEW)
+ private EPT entry: zap and set SPTE_SHARED_MASK
+ present -> non-present | SPTE_SHARED_MASK
+ non-present -> non-present | SPTE_SHARED_MASK
+ non-present | SPTE_SHARED_MASK: nop
+
+ shared EPT entry: set SPTE_SHARED_MASK
+ present: invalid
+ non-present -> non-present | SPTE_SHARED_MASK
+ present | SPTE_SHARED_MASK -> present | SPTE_SHARED_MASK: nop
+ non-present | SPTE_SHARED_MASK -> non-present | SPTE_SHARED_MASK: nop
+
+* map(private GPA)
+ private EPT entry
+ present: nop
+ non-present -> present
+ non-present | SPTE_SHARED_MASK: nop. looping on EPT violation(NEW)
+
+ shared EPT entry: nop
+
+* map(shared GPA)
+ private EPT entry: nop
+
+ shared EPT entry
+ present: invalid
+ present | SPTE_SHARED_MASK: nop
+ non-present | SPTE_SHARED_MASK -> present | SPTE_SHARED_MASK
+ non-present: nop. looping on EPT violation(NEW)
+
+* zap(private GPA)
+ private EPT entry: zap the entry with keeping SPTE_SHARED_MASK
+ present -> non-present
+ present | SPTE_SHARED_MASK: invalid
+ non-present: nop as is_shadow_present_pte() is checked
+ non-present | SPTE_SHARED_MASK: nop as is_shadow_present_pte() is
+ checked
+
+ shared EPT entry: nop
+
+* zap(shared GPA)
+ private EPT entry: nop
+
+ shared EPT entry: zap
+ any -> non-present
+ present: invalid
+ present | SPTE_SHARED_MASK -> non-present | SPTE_SHARED_MASK
+ non-present: nop as is_shadow_present_pte() is checked
+ non-present | SPTE_SHARED_MASK: nop as is_shadow_present_pte() is
+ checked
+
+
+The original TDP MMU and race condition
+=======================================
+Because vcpus share the EPT, once the EPT entry is zapped, we need to shootdown
+TLB. Send IPI to remote vcpus. Remote vcpus flush their down TLBs. Until TLB
+shootdown is done, vcpus may reference the zapped guest page.
+
+TDP MMU uses read lock of mmu_lock to mitigate vcpu contention. When read lock
+is obtained, it depends on the atomic update of the EPT entry. (On the other
+hand legacy MMU uses write lock.) When vcpu is populating/zapping the EPT entry
+with a read lock held, other vcpu may be populating or zapping the same EPT
+entry at the same time.
+
+To avoid the race condition, the entry is frozen. It means the EPT entry is set
+to the special value, REMOVED_SPTE which clears the present bit. And then after
+TLB shootdown, update the EPT entry to the final value.
+
+Concurrent zapping
+------------------
+1. read lock
+2. freeze the EPT entry (atomically set the value to REMOVED_SPTE)
+ If other vcpu froze the entry, restart page fault.
+3. TLB shootdown
+ * send IPI to remote vcpus
+ * TLB flush (local and remote)
+ For each entry update, TLB shootdown is needed because of the
+ concurrency.
+4. atomically set the EPT entry to the final value
+5. read unlock
+
+Concurrent populating
+---------------------
+In the case of populating the non-present EPT entry, atomically update the EPT
+entry.
+1. read lock
+2. atomically update the EPT entry
+ If other vcpu frozen the entry or updated the entry, restart page fault.
+3. read unlock
+
+In the case of updating the present EPT entry (e.g. page migration), the
+operation is split into two. Zapping the entry and populating the entry.
+1. read lock
+2. zap the EPT entry. follow the concurrent zapping case.
+3. populate the non-present EPT entry.
+4. read unlock
+
+Non-concurrent batched zapping
+------------------------------
+In some cases, zapping the ranges is done exclusively with a write lock held.
+In this case, the TLB shootdown is batched into one.
+
+1. write lock
+2. zap the EPT entries by traversing them
+3. TLB shootdown
+4. write unlock
+
+
+For Secure EPT, TDX SEAMCALLs are needed in addition to updating the mirrored
+EPT entry.
+
+TDX concurrent zapping
+----------------------
+Add a hook for TDX SEAMCALLs at the step of the TLB shootdown.
+
+1. read lock
+2. freeze the EPT entry(set the value to REMOVED_SPTE)
+3. TLB shootdown via a hook
+ * TLB.MEM.RANGE.BLOCK()
+ * TLB.MEM.TRACK()
+ * send IPI to remote vcpus
+4. set the EPT entry to the final value
+5. read unlock
+
+TDX concurrent populating
+-------------------------
+TDX SEAMCALLs are required in addition to operating the mirrored EPT entry. The
+frozen entry is utilized by following the zapping case to avoid the race
+condition. A hook can be added.
+
+1. read lock
+2. freeze the EPT entry
+3. hook
+ * TDH_MEM_SEPT_ADD() for non-leaf or TDH_MEM_PAGE_AUG() for leaf.
+4. set the EPT entry to the final value
+5. read unlock
+
+Without freezing the entry, the following race can happen. Suppose two vcpus
+are faulting on the same GPA and the 2M and 4K level entries aren't populated
+yet.
+
+* vcpu 1: update 2M level EPT entry
+* vcpu 2: update 4K level EPT entry
+* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
+* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
+
+
+TDX non-concurrent batched zapping
+----------------------------------
+For simplicity, the procedure of concurrent populating is utilized. The
+procedure can be optimized later.
+
+
+Co-existing with unmapping guest private memory
+===============================================
+TODO. This needs to be addressed.
+
+
+Restrictions or future work
+===========================
+The following features aren't supported yet at the moment.
+
+* optimizing non-concurrent zap
+* Large page
+* Page migration
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of TD vcpu
enter/exit.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 53897312699f..b51e8e6b1541 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -12,6 +12,7 @@ What qemu can do
- Qemu can create/destroy guest of TDX vm type.
- Qemu can create/destroy vcpu of TDX vm type.
- Qemu can populate initial guest memory image.
+- Qemu can finalize guest TD.
Patch Layer status
------------------
@@ -21,8 +22,8 @@ Patch Layer status
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
-* TD finalization: Applying
-* TD vcpu enter/exit: Not yet
+* TD finalization: Applied
+* TD vcpu enter/exit: Applying
* TD vcpu interrupts/exit/hypercall: Not yet
* KVM MMU GPA shared bits: Applied
--
2.25.1
From: Sean Christopherson <[email protected]>
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX defines its data structure and TDX SEAMCALL APIs for
VMM to operate on Trust Domain (TD) instead.
Trust Domain Virtual Processor State (TDVPS) is the root control structure
of a TD VCPU. It helps the TDX module control the operation of the VCPU,
and holds the VCPU state while the VCPU is not running. TDVPS is opaque to
software and DMA access, accessible only by using the TDX module interface
functions (such as TDH.VP.RD, TDH.VP.WR ,..). TDVPS includes TD VMCS, and
TD VMCS auxiliary structures, such as virtual APIC page, virtualization
exception information, etc. TDVPS is composed of Trust Domain Virtual
Processor Root (TDVPR) which is the root page of TDVPS and Trust Domain
Virtual Processor eXtension (TDVPX) pages which extend TDVPR to help
provide enough physical space for the logical TDVPS structure.
Also, we have a new structure, Trust Domain Control Structure (TDCS) is the
main control structure of a guest TD, and encrypted (using the guest TD's
ephemeral private key). At a high level, TDCS holds information for
controlling TD operation as a whole, execution, EPTP, MSR bitmaps, etc. KVM
needs to set it up. Note that MSR bitmaps are held as part of TDCS (unlike
VMX) because they are meant to have the same value for all VCPUs of the
same TD. TDCS is a multi-page logical structure composed of multiple Trust
Domain Control Extension (TDCX) physical pages. Trust Domain Root (TDR) is
the root control structure of a guest TD and is encrypted using the TDX
global private key. It holds a minimal set of state variables that enable
guest TD control even during times when the TD's private key is not known,
or when the TD's key management state does not permit access to memory
encrypted using the TD's private key.
The following shows the relationship between those structures.
TDR--> TDCS per-TD
| \--> TDCX
\
\--> TDVPS per-TD VCPU
\--> TDVPR and TDVPX
The existing global struct kvm_x86_ops already defines an interface which
fits with TDX. But kvm_x86_ops is system-wide, not per-VM structure. To
allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
"if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.
To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
both VMs and TDs. Use 'vt' for the naming scheme as a nod to VT-x and as a
concatenation of VmxTdx.
The current code looks as follows.
In vmx.c
static vmx_op() { ... }
static struct kvm_x86_ops vmx_x86_ops = {
.op = vmx_op,
initialization code
The eventually converted code will look like
In vmx.c, keep the VMX operations.
vmx_op() { ... }
VMX initialization
In tdx.c, define the TDX operations.
tdx_op() { ... }
TDX initialization
In x86_ops.h, declare the VMX and TDX operations.
vmx_op();
tdx_op();
In main.c, define common wrappers for VMX and VMX.
static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
static struct kvm_x86_ops vt_x86_ops = {
.op = vt_op,
initialization to call VMX and TDX initialization
Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vxm_vcpu_free().
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/main.c | 155 ++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 363 +++++++++++--------------------------
arch/x86/kvm/vmx/x86_ops.h | 125 +++++++++++++
4 files changed, 386 insertions(+), 259 deletions(-)
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/x86_ops.h
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 30f244b64523..ee4d0999f20f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -22,7 +22,7 @@ kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
kvm-$(CONFIG_KVM_XEN) += xen.o
kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
- vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
+ vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..636768f5b985
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+ .name = "kvm_intel",
+
+ .hardware_unsetup = vmx_hardware_unsetup,
+ .check_processor_compatibility = vmx_check_processor_compatibility,
+
+ .hardware_enable = vmx_hardware_enable,
+ .hardware_disable = vmx_hardware_disable,
+ .has_emulated_msr = vmx_has_emulated_msr,
+
+ .vm_size = sizeof(struct kvm_vmx),
+ .vm_init = vmx_vm_init,
+ .vm_destroy = vmx_vm_destroy,
+
+ .vcpu_precreate = vmx_vcpu_precreate,
+ .vcpu_create = vmx_vcpu_create,
+ .vcpu_free = vmx_vcpu_free,
+ .vcpu_reset = vmx_vcpu_reset,
+
+ .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+ .vcpu_load = vmx_vcpu_load,
+ .vcpu_put = vmx_vcpu_put,
+
+ .update_exception_bitmap = vmx_update_exception_bitmap,
+ .get_msr_feature = vmx_get_msr_feature,
+ .get_msr = vmx_get_msr,
+ .set_msr = vmx_set_msr,
+ .get_segment_base = vmx_get_segment_base,
+ .get_segment = vmx_get_segment,
+ .set_segment = vmx_set_segment,
+ .get_cpl = vmx_get_cpl,
+ .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
+ .set_cr0 = vmx_set_cr0,
+ .is_valid_cr4 = vmx_is_valid_cr4,
+ .set_cr4 = vmx_set_cr4,
+ .set_efer = vmx_set_efer,
+ .get_idt = vmx_get_idt,
+ .set_idt = vmx_set_idt,
+ .get_gdt = vmx_get_gdt,
+ .set_gdt = vmx_set_gdt,
+ .set_dr7 = vmx_set_dr7,
+ .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
+ .cache_reg = vmx_cache_reg,
+ .get_rflags = vmx_get_rflags,
+ .set_rflags = vmx_set_rflags,
+ .get_if_flag = vmx_get_if_flag,
+
+ .flush_tlb_all = vmx_flush_tlb_all,
+ .flush_tlb_current = vmx_flush_tlb_current,
+ .flush_tlb_gva = vmx_flush_tlb_gva,
+ .flush_tlb_guest = vmx_flush_tlb_guest,
+
+ .vcpu_pre_run = vmx_vcpu_pre_run,
+ .vcpu_run = vmx_vcpu_run,
+ .handle_exit = vmx_handle_exit,
+ .skip_emulated_instruction = vmx_skip_emulated_instruction,
+ .update_emulated_instruction = vmx_update_emulated_instruction,
+ .set_interrupt_shadow = vmx_set_interrupt_shadow,
+ .get_interrupt_shadow = vmx_get_interrupt_shadow,
+ .patch_hypercall = vmx_patch_hypercall,
+ .inject_irq = vmx_inject_irq,
+ .inject_nmi = vmx_inject_nmi,
+ .queue_exception = vmx_queue_exception,
+ .cancel_injection = vmx_cancel_injection,
+ .interrupt_allowed = vmx_interrupt_allowed,
+ .nmi_allowed = vmx_nmi_allowed,
+ .get_nmi_mask = vmx_get_nmi_mask,
+ .set_nmi_mask = vmx_set_nmi_mask,
+ .enable_nmi_window = vmx_enable_nmi_window,
+ .enable_irq_window = vmx_enable_irq_window,
+ .update_cr8_intercept = vmx_update_cr8_intercept,
+ .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .apicv_post_state_restore = vmx_apicv_post_state_restore,
+ .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
+ .hwapic_irr_update = vmx_hwapic_irr_update,
+ .hwapic_isr_update = vmx_hwapic_isr_update,
+ .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .sync_pir_to_irr = vmx_sync_pir_to_irr,
+ .deliver_interrupt = vmx_deliver_interrupt,
+ .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+ .set_tss_addr = vmx_set_tss_addr,
+ .set_identity_map_addr = vmx_set_identity_map_addr,
+ .get_mt_mask = vmx_get_mt_mask,
+
+ .get_exit_info = vmx_get_exit_info,
+
+ .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+
+ .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+ .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
+ .write_tsc_offset = vmx_write_tsc_offset,
+ .write_tsc_multiplier = vmx_write_tsc_multiplier,
+
+ .load_mmu_pgd = vmx_load_mmu_pgd,
+
+ .check_intercept = vmx_check_intercept,
+ .handle_exit_irqoff = vmx_handle_exit_irqoff,
+
+ .request_immediate_exit = vmx_request_immediate_exit,
+
+ .sched_in = vmx_sched_in,
+
+ .cpu_dirty_log_size = PML_ENTITY_NUM,
+ .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+
+ .nested_ops = &vmx_nested_ops,
+
+ .pi_update_irte = vmx_pi_update_irte,
+ .pi_start_assignment = vmx_pi_start_assignment,
+
+#ifdef CONFIG_X86_64
+ .set_hv_timer = vmx_set_hv_timer,
+ .cancel_hv_timer = vmx_cancel_hv_timer,
+#endif
+
+ .setup_mce = vmx_setup_mce,
+
+ .smi_allowed = vmx_smi_allowed,
+ .enter_smm = vmx_enter_smm,
+ .leave_smm = vmx_leave_smm,
+ .enable_smi_window = vmx_enable_smi_window,
+
+ .can_emulate_instruction = vmx_can_emulate_instruction,
+ .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .migrate_timers = vmx_migrate_timers,
+
+ .msr_filter_changed = vmx_msr_filter_changed,
+ .complete_emulated_msr = kvm_complete_insn_gp,
+
+ .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+};
+
+struct kvm_x86_init_ops vt_init_ops __initdata = {
+ .cpu_has_kvm_support = vmx_cpu_has_kvm_support,
+ .disabled_by_bios = vmx_disabled_by_bios,
+ .hardware_setup = vmx_hardware_setup,
+ .handle_intel_pt_intr = NULL,
+
+ .runtime_ops = &vt_x86_ops,
+ .pmu_ops = &intel_pmu_ops,
+};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d3b68a6dec48..286947c00638 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -66,6 +66,7 @@
#include "vmcs12.h"
#include "vmx.h"
#include "x86.h"
+#include "x86_ops.h"
MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");
@@ -1312,7 +1313,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
* Switches to specified vcpu, until a matching vcpu_put(), but assumes
* vcpu mutex is already taken.
*/
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1323,7 +1324,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmx->host_debugctlmsr = get_debugctlmsr();
}
-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
{
vmx_vcpu_pi_put(vcpu);
@@ -1377,7 +1378,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
vmx->emulation_required = vmx_emulation_required(vcpu);
}
-static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
{
return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
}
@@ -1483,8 +1484,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
return 0;
}
-static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
- void *insn, int insn_len)
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
{
/*
* Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1568,7 +1569,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
* Recognizes a pending MTF VM-exit and records the nested state for later
* delivery.
*/
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1591,7 +1592,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
vmx->nested.mtf_pending = false;
}
-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
vmx_update_emulated_instruction(vcpu);
return skip_emulated_instruction(vcpu);
@@ -1610,7 +1611,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
}
-static void vmx_queue_exception(struct kvm_vcpu *vcpu)
+void vmx_queue_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned nr = vcpu->arch.exception.nr;
@@ -1723,12 +1724,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
return kvm_caps.default_tsc_scaling_ratio;
}
-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
{
vmcs_write64(TSC_OFFSET, offset);
}
-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
{
vmcs_write64(TSC_MULTIPLIER, multiplier);
}
@@ -1752,7 +1753,7 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
return !(val & ~valid_bits);
}
-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
{
switch (msr->index) {
case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
@@ -1772,7 +1773,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -1950,7 +1951,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2274,7 +2275,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return ret;
}
-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
{
unsigned long guest_owned_bits;
@@ -2317,12 +2318,12 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
}
}
-static __init int cpu_has_kvm_support(void)
+__init int vmx_cpu_has_kvm_support(void)
{
return cpu_has_vmx();
}
-static __init int vmx_disabled_by_bios(void)
+__init int vmx_disabled_by_bios(void)
{
return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
!boot_cpu_has(X86_FEATURE_VMX);
@@ -2348,7 +2349,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
return -EFAULT;
}
-static int vmx_hardware_enable(void)
+int vmx_hardware_enable(void)
{
int cpu = raw_smp_processor_id();
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2389,7 +2390,7 @@ static void vmclear_local_loaded_vmcss(void)
__loaded_vmcs_clear(v);
}
-static void vmx_hardware_disable(void)
+void vmx_hardware_disable(void)
{
vmclear_local_loaded_vmcss();
@@ -2988,7 +2989,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
#endif
-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3018,7 +3019,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
return to_vmx(vcpu)->vpid;
}
-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 root_hpa = mmu->root.hpa;
@@ -3034,7 +3035,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
vpid_sync_context(vmx_get_current_vpid(vcpu));
}
-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
{
/*
* vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
@@ -3043,7 +3044,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
}
-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
{
/*
* vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
@@ -3198,8 +3199,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
return eptp;
}
-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
- int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
{
struct kvm *kvm = vcpu->kvm;
bool update_guest_cr3 = true;
@@ -3227,8 +3227,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmcs_writel(GUEST_CR3, guest_cr3);
}
-
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
/*
* We operate under the default treatment of SMM, so VMX cannot be
@@ -3344,7 +3343,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->g = (ar >> 15) & 1;
}
-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
{
struct kvm_segment s;
@@ -3424,14 +3423,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
}
-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
__vmx_set_segment(vcpu, var, seg);
to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
}
-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
{
u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
@@ -3439,25 +3438,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
*l = (ar >> 13) & 1;
}
-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
dt->address = vmcs_readl(GUEST_IDTR_BASE);
}
-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
vmcs_writel(GUEST_IDTR_BASE, dt->address);
}
-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
dt->address = vmcs_readl(GUEST_GDTR_BASE);
}
-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -3955,7 +3954,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
}
}
-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
void *vapic_page;
@@ -3975,7 +3974,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
return ((rvi & 0xf0) > (vppr & 0xf0));
}
-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 i;
@@ -4109,8 +4108,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
return 0;
}
-static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
- int trig_mode, int vector)
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
{
struct kvm_vcpu *vcpu = apic->vcpu;
@@ -4253,7 +4252,7 @@ static u32 vmx_vmexit_ctrl(void)
~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
}
-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4493,7 +4492,7 @@ static int vmx_alloc_ipiv_pid_table(struct kvm *kvm)
return 0;
}
-static int vmx_vcpu_precreate(struct kvm *kvm)
+int vmx_vcpu_precreate(struct kvm *kvm)
{
return vmx_alloc_ipiv_pid_table(kvm);
}
@@ -4645,7 +4644,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmx->pi_desc.sn = 1;
}
-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4702,12 +4701,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vpid_sync_context(vmx->vpid);
}
-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
{
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
}
-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
{
if (!enable_vnmi ||
vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4718,7 +4717,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
}
-static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
uint32_t intr;
@@ -4746,7 +4745,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
vmx_clear_hlt(vcpu);
}
-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4824,7 +4823,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
GUEST_INTR_STATE_NMI));
}
-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -4846,7 +4845,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
}
-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -4861,7 +4860,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !vmx_interrupt_blocked(vcpu);
}
-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
void __user *ret;
@@ -4881,7 +4880,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
return init_rmode_tss(kvm, ret);
}
-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
{
to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
return 0;
@@ -5160,8 +5159,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
return kvm_fast_pio(vcpu, size, port, in);
}
-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
{
/*
* Patch in the VMCALL instruction:
@@ -5371,7 +5369,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
return kvm_complete_insn_gp(vcpu, err);
}
-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
{
get_debugreg(vcpu->arch.db[0], 0);
get_debugreg(vcpu->arch.db[1], 1);
@@ -5390,7 +5388,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
set_debugreg(DR6_RESERVED, 6);
}
-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
{
vmcs_writel(GUEST_DR7, val);
}
@@ -5661,7 +5659,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
return 1;
}
-static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (vmx_emulation_required_with_pending_exception(vcpu)) {
kvm_prepare_emulation_failure_exit(vcpu);
@@ -5925,9 +5923,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);
-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
- u64 *info1, u64 *info2,
- u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6370,7 +6367,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return 0;
}
-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
int ret = __vmx_handle_exit(vcpu, exit_fastpath);
@@ -6458,7 +6455,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
: "eax", "ebx", "ecx", "edx");
}
-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
int tpr_threshold;
@@ -6528,7 +6525,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
}
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
{
struct page *page;
@@ -6556,7 +6553,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
put_page(page);
}
-static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
{
u16 status;
u8 old;
@@ -6590,7 +6587,7 @@ static void vmx_set_rvi(int vector)
}
}
-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
{
/*
* When running L2, updating RVI is only relevant when
@@ -6604,7 +6601,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
vmx_set_rvi(max_irr);
}
-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int max_irr;
@@ -6650,7 +6647,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
return max_irr;
}
-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
return;
@@ -6661,7 +6658,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}
-static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6734,7 +6731,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
vcpu->arch.at_instruction_boundary = true;
}
-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6751,7 +6748,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
* The kvm parameter can be NULL (module initialization, or invocation before
* VM creation). Be sure to check the kvm parameter before using it.
*/
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
{
switch (index) {
case MSR_IA32_SMBASE:
@@ -6872,7 +6869,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
IDT_VECTORING_ERROR_CODE);
}
-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
{
__vmx_complete_interrupts(vcpu,
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -6973,7 +6970,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
guest_state_exit_irqoff();
}
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long cr3, cr4;
@@ -7167,7 +7164,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_exit_handlers_fastpath(vcpu);
}
-static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
+void vmx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7178,7 +7175,7 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_loaded_vmcs(vmx->loaded_vmcs);
}
-static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
+int vmx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct vmx_uret_msr *tsx_ctrl;
struct vcpu_vmx *vmx;
@@ -7287,7 +7284,7 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
{
if (!ple_gap)
kvm->arch.pause_in_guest = true;
@@ -7318,7 +7315,7 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}
-static int vmx_check_processor_compatibility(void)
+int vmx_check_processor_compatibility(void)
{
struct vmcs_config vmcs_conf;
struct vmx_capability vmx_cap;
@@ -7341,7 +7338,7 @@ static int vmx_check_processor_compatibility(void)
return 0;
}
-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;
@@ -7530,7 +7527,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}
-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7642,7 +7639,7 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
}
-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->req_immediate_exit = true;
}
@@ -7681,10 +7678,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
}
-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
- struct x86_instruction_info *info,
- enum x86_intercept_stage stage,
- struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
@@ -7749,8 +7746,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
return 0;
}
-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
- bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
{
struct vcpu_vmx *vmx;
u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -7789,13 +7786,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
return 0;
}
-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->hv_deadline_tsc = -1;
}
#endif
-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
if (!kvm_pause_in_guest(vcpu->kvm))
shrink_ple_window(vcpu);
@@ -7821,7 +7818,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
}
-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.mcg_cap & MCG_LMCE_P)
to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -7831,7 +7828,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
~FEAT_CTL_LMCE_ENABLED;
}
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
/* we need a nested vmexit to enter SMM, postpone if run is pending */
if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -7839,7 +7836,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !is_smm(vcpu);
}
-static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7853,7 +7850,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
return 0;
}
-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int ret;
@@ -7874,17 +7871,17 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
return 0;
}
-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
{
/* RSM will cause a vmexit anyway. */
}
-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
}
-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
{
if (is_guest_mode(vcpu)) {
struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -7894,7 +7891,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
}
}
-static void vmx_hardware_unsetup(void)
+void vmx_hardware_unsetup(void)
{
kvm_set_posted_intr_wakeup_handler(NULL);
@@ -7904,7 +7901,7 @@ static void vmx_hardware_unsetup(void)
free_kvm_area();
}
-static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
+bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
{
ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
BIT(APICV_INHIBIT_REASON_ABSENT) |
@@ -7916,151 +7913,13 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
return supported & BIT(reason);
}
-static void vmx_vm_destroy(struct kvm *kvm)
+void vmx_vm_destroy(struct kvm *kvm)
{
struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
}
-static struct kvm_x86_ops vmx_x86_ops __initdata = {
- .name = "kvm_intel",
-
- .hardware_unsetup = vmx_hardware_unsetup,
-
- .check_processor_compatibility = vmx_check_processor_compatibility,
- .hardware_enable = vmx_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
- .has_emulated_msr = vmx_has_emulated_msr,
-
- .vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
- .vm_destroy = vmx_vm_destroy,
-
- .vcpu_precreate = vmx_vcpu_precreate,
- .vcpu_create = vmx_vcpu_create,
- .vcpu_free = vmx_vcpu_free,
- .vcpu_reset = vmx_vcpu_reset,
-
- .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
-
- .update_exception_bitmap = vmx_update_exception_bitmap,
- .get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
-
- .flush_tlb_all = vmx_flush_tlb_all,
- .flush_tlb_current = vmx_flush_tlb_current,
- .flush_tlb_gva = vmx_flush_tlb_gva,
- .flush_tlb_guest = vmx_flush_tlb_guest,
-
- .vcpu_pre_run = vmx_vcpu_pre_run,
- .vcpu_run = vmx_vcpu_run,
- .handle_exit = vmx_handle_exit,
- .skip_emulated_instruction = vmx_skip_emulated_instruction,
- .update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
- .inject_irq = vmx_inject_irq,
- .inject_nmi = vmx_inject_nmi,
- .queue_exception = vmx_queue_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_post_state_restore = vmx_apicv_post_state_restore,
- .check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_interrupt = vmx_deliver_interrupt,
- .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
-
- .get_exit_info = vmx_get_exit_info,
-
- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
- .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
-
- .load_mmu_pgd = vmx_load_mmu_pgd,
-
- .check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
-
- .request_immediate_exit = vmx_request_immediate_exit,
-
- .sched_in = vmx_sched_in,
-
- .cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
- .nested_ops = &vmx_nested_ops,
-
- .pi_update_irte = vmx_pi_update_irte,
- .pi_start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
- .setup_mce = vmx_setup_mce,
-
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
-
- .can_emulate_instruction = vmx_can_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
- .migrate_timers = vmx_migrate_timers,
-
- .msr_filter_changed = vmx_msr_filter_changed,
- .complete_emulated_msr = kvm_complete_insn_gp,
-
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-};
-
static unsigned int vmx_handle_intel_pt_intr(void)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -8126,9 +7985,7 @@ static void __init vmx_setup_me_spte_mask(void)
kvm_mmu_set_me_spte_mask(0, me_mask);
}
-static struct kvm_x86_init_ops vmx_init_ops __initdata;
-
-static __init int hardware_setup(void)
+__init int vmx_hardware_setup(void)
{
unsigned long host_bndcfgs;
struct desc_ptr dt;
@@ -8188,16 +8045,16 @@ static __init int hardware_setup(void)
* using the APIC_ACCESS_ADDR VMCS field.
*/
if (!flexpriority_enabled)
- vmx_x86_ops.set_apic_access_page_addr = NULL;
+ vt_x86_ops.set_apic_access_page_addr = NULL;
if (!cpu_has_vmx_tpr_shadow())
- vmx_x86_ops.update_cr8_intercept = NULL;
+ vt_x86_ops.update_cr8_intercept = NULL;
#if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
&& enable_ept) {
- vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
- vmx_x86_ops.tlb_remote_flush_with_range =
+ vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
+ vt_x86_ops.tlb_remote_flush_with_range =
hv_remote_flush_tlb_with_range;
}
#endif
@@ -8213,7 +8070,7 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_apicv())
enable_apicv = 0;
if (!enable_apicv)
- vmx_x86_ops.sync_pir_to_irr = NULL;
+ vt_x86_ops.sync_pir_to_irr = NULL;
if (!enable_apicv || !cpu_has_vmx_ipiv())
enable_ipiv = false;
@@ -8249,7 +8106,7 @@ static __init int hardware_setup(void)
enable_pml = 0;
if (!enable_pml)
- vmx_x86_ops.cpu_dirty_log_size = 0;
+ vt_x86_ops.cpu_dirty_log_size = 0;
if (!cpu_has_vmx_preemption_timer())
enable_preemption_timer = false;
@@ -8276,9 +8133,9 @@ static __init int hardware_setup(void)
}
if (!enable_preemption_timer) {
- vmx_x86_ops.set_hv_timer = NULL;
- vmx_x86_ops.cancel_hv_timer = NULL;
- vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+ vt_x86_ops.set_hv_timer = NULL;
+ vt_x86_ops.cancel_hv_timer = NULL;
+ vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
}
kvm_caps.supported_mce_cap |= MCG_LMCE_P;
@@ -8288,9 +8145,9 @@ static __init int hardware_setup(void)
if (!enable_ept || !enable_pmu || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
if (pt_mode == PT_MODE_HOST_GUEST)
- vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+ vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
else
- vmx_init_ops.handle_intel_pt_intr = NULL;
+ vt_init_ops.handle_intel_pt_intr = NULL;
setup_default_sgx_lepubkeyhash();
@@ -8314,16 +8171,6 @@ static __init int hardware_setup(void)
return r;
}
-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
- .cpu_has_kvm_support = cpu_has_kvm_support,
- .disabled_by_bios = vmx_disabled_by_bios,
- .hardware_setup = hardware_setup,
- .handle_intel_pt_intr = NULL,
-
- .runtime_ops = &vmx_x86_ops,
- .pmu_ops = &intel_pmu_ops,
-};
-
static void vmx_cleanup_l1d_flush(void)
{
if (vmx_l1d_flush_pages) {
@@ -8410,7 +8257,7 @@ static int __init vmx_init(void)
}
if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
- vmx_x86_ops.enable_direct_tlbflush
+ vt_x86_ops.enable_direct_tlbflush
= hv_enable_direct_tlbflush;
} else {
@@ -8419,8 +8266,8 @@ static int __init vmx_init(void)
#endif
vmx_init_early();
- r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
- __alignof__(struct vcpu_vmx), THIS_MODULE);
+ r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
+ __alignof__(struct vcpu_vmx), THIS_MODULE);
if (r)
return r;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..0f8a8547958f
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include <asm/virtext.h>
+
+#include "x86.h"
+
+__init int vmx_cpu_has_kvm_support(void);
+__init int vmx_disabled_by_bios(void);
+__init int vmx_hardware_setup(void);
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+void vmx_hardware_unsetup(void);
+int vmx_check_processor_compatibility(void);
+int vmx_hardware_enable(void);
+void vmx_hardware_disable(void);
+int vmx_vm_init(struct kvm *kvm);
+void vmx_vm_destroy(struct kvm *kvm);
+int vmx_vcpu_precreate(struct kvm *kvm);
+int vmx_vcpu_create(struct kvm_vcpu *vcpu);
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_vcpu_free(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_queue_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1
From: Sean Christopherson <[email protected]>
TDX doesn't support dirty logging. Report dirty logging isn't supported so
that device model, for example qemu, can properly handle it.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 5 +++++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 15 ++++++++++++---
3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4309ef0ade21..dcd1f5e2ba05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13164,6 +13164,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
}
EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
+bool kvm_arch_dirty_log_supported(struct kvm *kvm)
+{
+ return kvm->arch.vm_type != KVM_X86_TDX_VM;
+}
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 79a4988fd51f..6fd8ec297236 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1452,6 +1452,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_dirty_log_supported(struct kvm *kvm);
#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7a5261eb7eb8..703c1d0c98da 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1467,9 +1467,18 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
{
- u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+ return true;
+}
+
+static int check_memory_region_flags(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem)
+{
+ u32 valid_flags = 0;
+
+ if (kvm_arch_dirty_log_supported(kvm))
+ valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
#ifdef __KVM_HAVE_READONLY_MEM
valid_flags |= KVM_MEM_READONLY;
@@ -1871,7 +1880,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;
- r = check_memory_region_flags(mem);
+ r = check_memory_region_flags(kvm, mem);
if (r)
return r;
--
2.25.1
From: Isaku Yamahata <[email protected]>
Wire up handle_exit and handle_exit_irqoff methods and add a place holder
to handle VM exit. Add helper functions to get exit info, exit
qualification, etc.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 33 ++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 81 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 11 ++++++
3 files changed, 122 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index eddfd07506df..227739c2490e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -188,6 +188,23 @@ static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
return tdx_protected_apic_has_interrupt(vcpu);
}
+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_handle_exit(vcpu, fastpath);
+
+ return vmx_handle_exit(vcpu, fastpath);
+}
+
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_handle_exit_irqoff(vcpu);
+
+ vmx_handle_exit_irqoff(vcpu);
+}
+
static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -371,6 +388,16 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
vmx_request_immediate_exit(vcpu);
}
+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+ error_code);
+
+ return vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -444,7 +471,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_pre_run = vt_vcpu_pre_run,
.vcpu_run = vt_vcpu_run,
- .handle_exit = vmx_handle_exit,
+ .handle_exit = vt_handle_exit,
.skip_emulated_instruction = vmx_skip_emulated_instruction,
.update_emulated_instruction = vmx_update_emulated_instruction,
.set_interrupt_shadow = vt_set_interrupt_shadow,
@@ -479,7 +506,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_identity_map_addr = vmx_set_identity_map_addr,
.get_mt_mask = vmx_get_mt_mask,
- .get_exit_info = vmx_get_exit_info,
+ .get_exit_info = vt_get_exit_info,
.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
@@ -493,7 +520,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.load_mmu_pgd = vt_load_mmu_pgd,
.check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
+ .handle_exit_irqoff = vt_handle_exit_irqoff,
.request_immediate_exit = vt_request_immediate_exit,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index de696d82ddbf..c29501a69167 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -78,6 +78,26 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
return pa;
}
+static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rcx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rdx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+{
+ return kvm_r8_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
+{
+ return kvm_r9_read(vcpu);
+}
+
static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
{
return tdx->tdvpr.added;
@@ -820,6 +840,25 @@ void tdx_inject_nmi(struct kvm_vcpu *vcpu)
td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
}
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ u16 exit_reason = tdx->exit_reason.basic;
+
+ if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
+ vmx_handle_exception_nmi_irqoff(vcpu, tdexit_intr_info(vcpu));
+ else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
+{
+ vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+ vcpu->mmio_needed = 0;
+ return 0;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1152,6 +1191,48 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}
+int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
+{
+ union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+ if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
+ if (exit_reason.basic == EXIT_REASON_TRIPLE_FAULT)
+ return tdx_handle_triple_fault(vcpu);
+
+ kvm_pr_unimpl("TD exit 0x%llx, %d hkid 0x%x hkid pa 0x%llx\n",
+ exit_reason.full, exit_reason.basic,
+ to_kvm_tdx(vcpu->kvm)->hkid,
+ set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
+ goto unhandled_exit;
+ }
+
+ WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
+
+ switch (exit_reason.basic) {
+ default:
+ break;
+ }
+
+unhandled_exit:
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = exit_reason.full;
+ return 0;
+}
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ *reason = tdx->exit_reason.full;
+
+ *info1 = tdexit_exit_qual(vcpu);
+ *info2 = tdexit_ext_exit_qual(vcpu);
+
+ *intr_info = tdexit_intr_info(vcpu);
+ *error_code = 0;
+}
+
int tdx_dev_ioctl(void __user *argp)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 174e90eb7e2d..78f2d624b58e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -147,10 +147,15 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath);
void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
void tdx_inject_nmi(struct kvm_vcpu *vcpu);
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -178,10 +183,16 @@ static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
+static inline void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
+static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath) { return 0; }
static inline void tdx_deliver_interrupt(
struct kvm_lapic *apic, int delivery_mode, int trig_mode, int vector) {}
static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
+static inline void tdx_get_exit_info(
+ struct kvm_vcpu *vcpu, u32 *reason, u64 *info1, u64 *info2,
+ u32 *intr_info, u32 *error_code) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1
From: Isaku Yamahata <[email protected]>
A VMM interacts with the TDX module using a new instruction (SEAMCALL). A
TDX VMM uses SEAMCALLs where a VMX VMM would have directly interacted with
VMX instructions. For instance, a TDX VMM does not have full access to the
VM control structure corresponding to VMX VMCS. Instead, a VMM induces the
TDX module to act on behalf via SEAMCALLs.
Export __seamcall and define C wrapper functions for SEAMCALLs for
readability. Some SEAMCALL APIs donates pages to TDX module or guest TD.
The pages are encrypted with TDX private host key id set in high bits of
physical address. If any modified cache lines may exit for these pages,
flush them to memory by clflush_cache_range().
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/tdx.h | 2 +
arch/x86/kvm/vmx/tdx_ops.h | 185 +++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/seamcall.S | 2 +
3 files changed, 189 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index dfea0dd71bc1..c887618e3cec 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -144,6 +144,8 @@ struct tdsysinfo_struct {
bool platform_tdx_enabled(void);
int tdx_init(void);
const struct tdsysinfo_struct *tdx_get_sysinfo(void);
+u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
#else /* !CONFIG_INTEL_TDX_HOST */
static inline bool platform_tdx_enabled(void) { return false; }
static inline int tdx_init(void) { return -ENODEV; }
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..85adbf49c277
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,185 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/cacheflush.h>
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return __seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return __seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
+}
+
+static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(page), PAGE_SIZE);
+ return __seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, out);
+}
+
+static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_SEPT_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return __seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return __seamcall(TDH_MEM_PAGE_RELOCATE, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_output *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return __seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_KEY_CONFIG, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+ clflush_cache_range(__va(tdr), PAGE_SIZE);
+ return __seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+ clflush_cache_range(__va(tdvpr), PAGE_SIZE);
+ return __seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MNG_RD, tdr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MR_EXTEND, gpa, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+ return __seamcall(TDH_MR_FINALIZE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+ return __seamcall(TDH_VP_FLUSH, tdvpr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_VPFLUSHDONE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_KEY_FREEID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MNG_INIT, tdr, td_params, 0, 0, out);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+ return __seamcall(TDH_VP_INIT, tdvpr, rcx, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_VP_RD, tdvpr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+ return __seamcall(TDH_MNG_KEY_RECLAIMID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_PHYMEM_PAGE_RECLAIM, page, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+ return __seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+ return __seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+ return __seamcall(TDH_PHYMEM_CACHE_WB, resume ? 1 : 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+ return __seamcall(TDH_PHYMEM_PAGE_WBINVD, page, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+ struct tdx_module_output *out)
+{
+ return __seamcall(TDH_VP_WR, tdvpr, field, val, mask, out);
+}
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_OPS_H */
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index f322427e48c3..aced0ed9b76a 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -1,5 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/linkage.h>
+#include <asm/export.h>
#include <asm/frame.h>
#include "tdxcall.S"
@@ -50,3 +51,4 @@ SYM_FUNC_START(__seamcall)
FRAME_END
RET
SYM_FUNC_END(__seamcall)
+EXPORT_SYMBOL_GPL(__seamcall)
--
2.25.1
From: Isaku Yamahata <[email protected]>
Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters.
KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM. It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.
TDX requires VM-scoped, and VCPU-scoped TDX-specific operations for device
model, for example, qemu. Getting system-wide parameters, TDX-specific VM
initialization, and TDX-specific vCPU initialization. Which requires KVM
vCPU-scoped operations in addition to the existing VM-scoped operations.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 9 +++++++++
arch/x86/kvm/vmx/tdx.c | 26 ++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
3 files changed, 39 insertions(+)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7b497ed1f21c..067f5de56c53 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -73,6 +73,14 @@ static void vt_vm_free(struct kvm *kvm)
return tdx_vm_free(kvm);
}
+static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
+{
+ if (!is_td(kvm))
+ return -ENOTTY;
+
+ return tdx_vm_ioctl(kvm, argp);
+}
+
struct kvm_x86_ops vt_x86_ops __initdata = {
.name = "kvm_intel",
@@ -214,6 +222,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
.dev_mem_enc_ioctl = tdx_dev_ioctl,
+ .mem_enc_ioctl = vt_mem_enc_ioctl,
};
struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ec4ebba4152a..2a9dfd54189f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -438,6 +438,32 @@ int tdx_dev_ioctl(void __user *argp)
return 0;
}
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+ struct kvm_tdx_cmd tdx_cmd;
+ int r;
+
+ if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+ return -EFAULT;
+ if (tdx_cmd.error || tdx_cmd.unused)
+ return -EINVAL;
+
+ mutex_lock(&kvm->lock);
+
+ switch (tdx_cmd.id) {
+ default:
+ r = -EINVAL;
+ goto out;
+ }
+
+ if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+ r = -EFAULT;
+
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
int __init tdx_module_setup(void)
{
const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 3027d9821fe1..ef6115ae0e88 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -137,6 +137,8 @@ int tdx_dev_ioctl(void __user *argp);
int tdx_vm_init(struct kvm *kvm);
void tdx_mmu_release_hkid(struct kvm *kvm);
void tdx_vm_free(struct kvm *kvm);
+
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
@@ -147,6 +149,8 @@ static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
static inline void tdx_vm_free(struct kvm *kvm) {}
+
+static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
#endif
#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1
From: Isaku Yamahata <[email protected]>
Because the guest memory is protected in TDX, the creation of the initial
guest memory requires a dedicated TDX module API, tdh_mem_page_add, instead
of directly copying the memory contents into the guest memory in the case
of the default VM type. KVM MMU page fault handler callback,
private_page_add, handles it.
Define new subcommand, KVM_TDX_INIT_MEM_REGION, of VM-scoped
KVM_MEMORY_ENCRYPT_OP. It assigns the guest page, copies the initial
memory contents into the guest memory, encrypts the guest memory. At the
same time, optionally it extends memory measurement of the TDX guest. It
calls the KVM MMU page fault(EPT-violation) handler to trigger the
callbacks for it.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 9 ++
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/vmx/tdx.c | 135 +++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 2 +
tools/arch/x86/include/uapi/asm/kvm.h | 9 ++
5 files changed, 155 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 399c28b2f4f5..cb2b0701f0d9 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -539,6 +539,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
KVM_TDX_CMD_NR_MAX,
};
@@ -616,4 +617,12 @@ struct kvm_tdx_init_vm {
};
};
+#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 599c81504bea..da634fa4b75f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5285,6 +5285,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
out:
return r;
}
+EXPORT_SYMBOL(kvm_mmu_load);
void kvm_mmu_unload(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3d578197d567..69550a1ea1d0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -557,6 +557,21 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
}
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+{
+ struct tdx_module_output out;
+ u64 err;
+ int i;
+
+ for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+ err = tdh_mr_extend(kvm_tdx->tdr.pa, gpa + i, &out);
+ if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
+ pr_tdx_error(TDH_MR_EXTEND, err, &out);
+ break;
+ }
+ }
+}
+
static void tdx_unpin_pfn(struct kvm *kvm, kvm_pfn_t pfn)
{
struct page *page = pfn_to_page(pfn);
@@ -572,6 +587,7 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
hpa_t hpa = pfn_to_hpa(pfn);
gpa_t gpa = gfn_to_gpa(gfn);
struct tdx_module_output out;
+ hpa_t source_pa;
u64 err;
if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
@@ -584,14 +600,40 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
/* To prevent page migration, do nothing on mmu notifier. */
get_page(pfn_to_page(pfn));
+ /* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
if (likely(is_td_finalized(kvm_tdx))) {
err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
- put_page(pfn_to_page(pfn));
+ tdx_unpin_pfn(kvm, pfn);
}
return;
}
+
+ /*
+ * In case of TDP MMU, fault handler can run concurrently. Note
+ * 'source_pa' is a TD scope variable, meaning if there are multiple
+ * threads reaching here with all needing to access 'source_pa', it
+ * will break. However fortunately this won't happen, because below
+ * TDH_MEM_PAGE_ADD code path is only used when VM is being created
+ * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
+ * always uses vcpu 0's page table and protected by vcpu->mutex).
+ */
+ if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
+ tdx_unpin_pfn(kvm, pfn);
+ return;
+ }
+
+ source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
+
+ err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &out);
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
+ tdx_unpin_pfn(kvm, pfn);
+ } else if ((kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
+ tdx_measure_page(kvm_tdx, gpa);
+
+ kvm_tdx->source_pa = INVALID_PAGE;
}
static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -1100,6 +1142,94 @@ void tdx_flush_tlb(struct kvm_vcpu *vcpu)
cpu_relax();
}
+#define TDX_SEPT_PFERR PFERR_WRITE_MASK
+
+static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_tdx_init_mem_region region;
+ struct kvm_vcpu *vcpu;
+ struct page *page;
+ kvm_pfn_t pfn;
+ int idx, ret = 0;
+
+ /* The BSP vCPU must be created before initializing memory regions. */
+ if (!atomic_read(&kvm->online_vcpus))
+ return -EINVAL;
+
+ if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
+ return -EINVAL;
+
+ if (copy_from_user(®ion, (void __user *)cmd->data, sizeof(region)))
+ return -EFAULT;
+
+ /* Sanity check */
+ if (!IS_ALIGNED(region.source_addr, PAGE_SIZE) ||
+ !IS_ALIGNED(region.gpa, PAGE_SIZE) ||
+ !region.nr_pages ||
+ region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
+ !kvm_is_private_gpa(kvm, region.gpa) ||
+ !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT)))
+ return -EINVAL;
+
+ vcpu = kvm_get_vcpu(kvm, 0);
+ if (mutex_lock_killable(&vcpu->mutex))
+ return -EINTR;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&kvm->srcu);
+
+ kvm_mmu_reload(vcpu);
+
+ while (region.nr_pages) {
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+
+ /* Pin the source page. */
+ ret = get_user_pages_fast(region.source_addr, 1, 0, &page);
+ if (ret < 0)
+ break;
+ if (ret != 1) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
+ (cmd->flags & KVM_TDX_MEASURE_MEMORY_REGION);
+
+ pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+ PG_LEVEL_4K);
+ if (is_error_noslot_pfn(pfn) || kvm->vm_bugged)
+ ret = -EFAULT;
+ else
+ ret = 0;
+
+ put_page(page);
+ if (ret)
+ break;
+
+ region.source_addr += PAGE_SIZE;
+ region.gpa += PAGE_SIZE;
+ region.nr_pages--;
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ mutex_unlock(&vcpu->mutex);
+
+ if (copy_to_user((void __user *)cmd->data, ®ion, sizeof(region)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -1116,6 +1246,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_INIT_VM:
r = tdx_td_init(kvm, &tdx_cmd);
break;
+ case KVM_TDX_INIT_MEM_REGION:
+ r = tdx_init_mem_region(kvm, &tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index d8dcbedd690b..29e7accee733 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -25,6 +25,8 @@ struct kvm_tdx {
u64 xfam;
int hkid;
+ hpa_t source_pa;
+
bool finalized;
atomic_t tdh_mem_track;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 60a79f9ef174..af39f3adc179 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -533,6 +533,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
KVM_TDX_CMD_NR_MAX,
};
@@ -610,4 +611,12 @@ struct kvm_tdx_init_vm {
};
};
+#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+};
+
#endif /* _ASM_X86_KVM_H */
--
2.25.1
From: Sean Christopherson <[email protected]>
TDX protects TDX guest state from VMM. Implements to access methods for
TDX guest state to ignore them or return zero.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 463 +++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 55 ++++-
arch/x86/kvm/vmx/x86_ops.h | 17 ++
3 files changed, 490 insertions(+), 45 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 552f2576d3ae..b9ad41ace499 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -268,6 +268,46 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
vmx_enable_smi_window(vcpu);
}
+static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
+{
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_can_emulate_instruction(vcpu, emul_type, insn, insn_len);
+}
+
+static int vt_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
+{
+ /*
+ * This call back is triggered by the x86 instruction emulator. TDX
+ * doesn't allow guest memory inspection.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return X86EMUL_UNHANDLEABLE;
+
+ return vmx_check_intercept(vcpu, info, stage, exception);
+}
+
+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_apic_init_signal_blocked(vcpu);
+}
+
+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_set_virtual_apic_mode(vcpu);
+
+ return vmx_set_virtual_apic_mode(vcpu);
+}
+
static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -275,6 +315,31 @@ static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
memset(pi->pir, 0, sizeof(pi->pir));
}
+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_hwapic_isr_update(vcpu, max_isr);
+}
+
+static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 at the moment. */
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return false;
+
+ return vmx_guest_apic_has_interrupt(vcpu);
+}
+
static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -314,6 +379,177 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
kvm_vcpu_deliver_init(vcpu);
}
+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_update_exception_bitmap(vcpu);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return tdx_get_segment_base(vcpu, seg);
+
+ return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return tdx_get_segment(vcpu, var, seg);
+
+ vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return tdx_get_cpl(vcpu);
+
+ return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_cr0(vcpu, cr0);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+ if (is_td_vcpu(vcpu))
+ return 0;
+
+ return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm)) {
+ memset(dt, 0, sizeof(*dt));
+ return;
+ }
+
+ vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm)) {
+ memset(dt, 0, sizeof(*dt));
+ return;
+ }
+
+ vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+ /*
+ * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+ * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+ * reach here for TD vcpu.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_cache_reg(vcpu, reg);
+
+ return vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_rflags(vcpu);
+
+ return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_rflags(vcpu, rflags);
+}
+
+static bool vt_get_if_flag(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_get_if_flag(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -430,6 +666,15 @@ static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
return vmx_get_interrupt_shadow(vcpu);
}
+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+ unsigned char *hypercall)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_patch_hypercall(vcpu, hypercall);
+}
+
static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
{
if (is_td_vcpu(vcpu))
@@ -438,6 +683,14 @@ static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
vmx_inject_irq(vcpu, reinjected);
}
+static void vt_queue_exception(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_queue_exception(vcpu);
+}
+
static void vt_cancel_injection(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -470,6 +723,130 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
vmx_request_immediate_exit(vcpu);
}
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
+ vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
+ vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_set_identity_map_addr(kvm, ident_addr);
+}
+
+static u64 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+ if (is_td_vcpu(vcpu)) {
+ if (is_mmio)
+ return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+ }
+
+ return vmx_get_mt_mask(vcpu, gfn, is_mmio);
+}
+
+static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 guest at the moment. */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_get_l2_tsc_offset(vcpu);
+}
+
+static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 guest at the moment. */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_get_l2_tsc_multiplier(vcpu);
+}
+
+static void vt_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+{
+ /* In TDX, tsc offset can't be changed. */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_write_tsc_offset(vcpu, offset);
+}
+
+static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+{
+ /* In TDX, tsc multiplier can't be changed. */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_write_tsc_multiplier(vcpu, multiplier);
+}
+
+static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_update_cpu_dirty_logging(vcpu);
+}
+
+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
+{
+ /* VMX-preemption timer isn't available for TDX. */
+ if (is_td_vcpu(vcpu))
+ return -EINVAL;
+
+ return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+ /* VMX-preemption timer can't be set. Set vt_set_hv_timer(). */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
{
@@ -522,29 +899,29 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_load = vt_vcpu_load,
.vcpu_put = vt_vcpu_put,
- .update_exception_bitmap = vmx_update_exception_bitmap,
+ .update_exception_bitmap = vt_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
.get_msr = vt_get_msr,
.set_msr = vt_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .set_cr0 = vmx_set_cr0,
+ .get_segment_base = vt_get_segment_base,
+ .get_segment = vt_get_segment,
+ .set_segment = vt_set_segment,
+ .get_cpl = vt_get_cpl,
+ .get_cs_db_l_bits = vt_get_cs_db_l_bits,
+ .set_cr0 = vt_set_cr0,
.is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
+ .set_cr4 = vt_set_cr4,
+ .set_efer = vt_set_efer,
+ .get_idt = vt_get_idt,
+ .set_idt = vt_set_idt,
+ .get_gdt = vt_get_gdt,
+ .set_gdt = vt_set_gdt,
+ .set_dr7 = vt_set_dr7,
+ .sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+ .cache_reg = vt_cache_reg,
+ .get_rflags = vt_get_rflags,
+ .set_rflags = vt_set_rflags,
+ .get_if_flag = vt_get_if_flag,
.flush_tlb_all = vt_flush_tlb_all,
.flush_tlb_current = vt_flush_tlb_current,
@@ -558,10 +935,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.update_emulated_instruction = vmx_update_emulated_instruction,
.set_interrupt_shadow = vt_set_interrupt_shadow,
.get_interrupt_shadow = vt_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
+ .patch_hypercall = vt_patch_hypercall,
.inject_irq = vt_inject_irq,
.inject_nmi = vt_inject_nmi,
- .queue_exception = vmx_queue_exception,
+ .queue_exception = vt_queue_exception,
.cancel_injection = vt_cancel_injection,
.interrupt_allowed = vt_interrupt_allowed,
.nmi_allowed = vt_nmi_allowed,
@@ -569,39 +946,39 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_nmi_mask = vt_set_nmi_mask,
.enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .update_cr8_intercept = vt_update_cr8_intercept,
+ .set_virtual_apic_mode = vt_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vt_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
+ .load_eoi_exitmap = vt_load_eoi_exitmap,
.apicv_post_state_restore = vt_apicv_post_state_restore,
.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .hwapic_irr_update = vt_hwapic_irr_update,
+ .hwapic_isr_update = vt_hwapic_isr_update,
+ .guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
.sync_pir_to_irr = vt_sync_pir_to_irr,
.deliver_interrupt = vt_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,
- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
+ .set_tss_addr = vt_set_tss_addr,
+ .set_identity_map_addr = vt_set_identity_map_addr,
+ .get_mt_mask = vt_get_mt_mask,
.get_exit_info = vt_get_exit_info,
- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+ .vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,
.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
+ .get_l2_tsc_offset = vt_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
+ .write_tsc_offset = vt_write_tsc_offset,
+ .write_tsc_multiplier = vt_write_tsc_multiplier,
.load_mmu_pgd = vt_load_mmu_pgd,
- .check_intercept = vmx_check_intercept,
+ .check_intercept = vt_check_intercept,
.handle_exit_irqoff = vt_handle_exit_irqoff,
.request_immediate_exit = vt_request_immediate_exit,
@@ -609,7 +986,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sched_in = vt_sched_in,
.cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+ .update_cpu_dirty_logging = vt_update_cpu_dirty_logging,
.nested_ops = &vmx_nested_ops,
@@ -617,8 +994,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.pi_start_assignment = vmx_pi_start_assignment,
#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
+ .set_hv_timer = vt_set_hv_timer,
+ .cancel_hv_timer = vt_cancel_hv_timer,
#endif
.setup_mce = vmx_setup_mce,
@@ -628,8 +1005,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.leave_smm = vt_leave_smm,
.enable_smi_window = vt_enable_smi_window,
- .can_emulate_instruction = vmx_can_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .can_emulate_instruction = vt_can_emulate_instruction,
+ .apic_init_signal_blocked = vt_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,
.msr_filter_changed = vmx_msr_filter_changed,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d81a0a832ce2..10207afddec8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3,6 +3,7 @@
#include <linux/mmu_context.h>
#include <asm/fpu/xcr.h>
+#include <asm/virtext.h>
#include <asm/tdx.h>
#include "capabilities.h"
@@ -609,8 +610,15 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
- vcpu->arch.guest_state_protected =
- !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ /*
+ * TODO: support off-TD debug. If TD DEBUG is enabled, guest state
+ * can be accessed. guest_state_protected = false. and kvm ioctl to
+ * access CPU states should be usable for user space VMM (e.g. qemu).
+ *
+ * vcpu->arch.guest_state_protected =
+ * !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ */
+ vcpu->arch.guest_state_protected = true;
tdx->pi_desc.nv = POSTED_INTR_VECTOR;
tdx->pi_desc.sn = 1;
@@ -1855,6 +1863,49 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
vcpu->arch.smi_pending = false;
}
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ /* Only x2APIC mode is supported for TD. */
+ WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
+}
+
+int tdx_get_cpl(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+ kvm_register_mark_available(vcpu, reg);
+ switch (reg) {
+ case VCPU_REGS_RSP:
+ case VCPU_REGS_RIP:
+ case VCPU_EXREG_PDPTR:
+ case VCPU_EXREG_CR0:
+ case VCPU_EXREG_CR3:
+ case VCPU_EXREG_CR4:
+ break;
+ default:
+ KVM_BUG_ON(1, vcpu->kvm);
+ break;
+ }
+}
+
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ return 0;
+}
+
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+ memset(var, 0, sizeof(*var));
+}
+
int tdx_dev_ioctl(void __user *argp)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 1c4672037a2e..2e204002efb1 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -163,6 +163,14 @@ int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+
+int tdx_get_cpl(struct kvm_vcpu *vcpu);
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
+bool tdx_is_emulated_msr(u32 index, bool write);
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -203,10 +211,19 @@ static inline void tdx_get_exit_info(
static inline bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+
static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate) { return 0; }
static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate) { return 0; }
static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
+static inline void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
+
+static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
+static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
+static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
+static inline u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0;}
+static inline void tdx_get_segment(
+ struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1
From: Sean Christopherson <[email protected]>
TDX mostly handles NMI/exception exit mostly the same to VMX case. The
difference is how to retrieve exit qualification. To share the code with
TDX, move NMI/exception to a common header, common.h.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 70 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 79 ++++-----------------------------------
2 files changed, 78 insertions(+), 71 deletions(-)
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 1522e9e6851b..fd5ed3c0f894 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,8 +4,78 @@
#include <linux/kvm_host.h>
+#include <asm/traps.h>
+
#include "posted_intr.h"
#include "mmu.h"
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
+
+static inline void vmx_handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
+ unsigned long entry)
+{
+ bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
+
+ kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ);
+ vmx_do_interrupt_nmi_irqoff(entry);
+ kvm_after_interrupt(vcpu);
+}
+
+static inline void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Save xfd_err to guest_fpu before interrupt is enabled, so the
+ * MSR value is not clobbered by the host activity before the guest
+ * has chance to consume it.
+ *
+ * Do not blindly read xfd_err here, since this exception might
+ * be caused by L1 interception on a platform which doesn't
+ * support xfd at all.
+ *
+ * Do it conditionally upon guest_fpu::xfd. xfd_err matters
+ * only when xfd contains a non-zero value.
+ *
+ * Queuing exception is done in vmx_handle_exit. See comment there.
+ */
+ if (vcpu->arch.guest_fpu.fpstate->xfd)
+ rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+}
+
+static inline void vmx_handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
+
+ /* if exit due to PF check for async PF */
+ if (is_page_fault(intr_info))
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ /* if exit due to NM, handle before interrupts are enabled */
+ else if (is_nm_fault(intr_info))
+ vmx_handle_nm_fault_irqoff(vcpu);
+ /* Handle machine checks before interrupts are enabled */
+ else if (is_machine_check(intr_info))
+ kvm_machine_check();
+ /* We need to handle NMIs before interrupts are enabled */
+ else if (is_nmi(intr_info))
+ vmx_handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+ gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
+
+ if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+ "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
+ return;
+
+ vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
+ vcpu->arch.at_instruction_boundary = true;
+}
static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
unsigned long exit_qualification)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ccc245fbe0a1..5c5580ab98d3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -468,7 +468,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
vmx->segment_cache.bitmask = 0;
}
-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;
#if IS_ENABLED(CONFIG_HYPERV)
static bool __read_mostly enlightened_vmcs = true;
@@ -4125,7 +4125,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */
- vmcs_writel(HOST_IDTR_BASE, host_idt_base); /* 22.2.4 */
+ vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base); /* 22.2.4 */
vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */
@@ -4970,10 +4970,10 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
intr_info = vmx_get_intr_info(vcpu);
if (is_machine_check(intr_info) || is_nmi(intr_info))
- return 1; /* handled by handle_exception_nmi_irqoff() */
+ return 1; /* handled by vmx_handle_exception_nmi_irqoff() */
/*
- * Queue the exception here instead of in handle_nm_fault_irqoff().
+ * Queue the exception here instead of in vmx_handle_nm_fault_irqoff().
* This ensures the nested_vmx check is not skipped so vmexit can
* be reflected to L1 (when it intercepts #NM) before reaching this
* point.
@@ -6645,70 +6645,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}
-void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
-
-static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
- unsigned long entry)
-{
- bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
-
- kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ);
- vmx_do_interrupt_nmi_irqoff(entry);
- kvm_after_interrupt(vcpu);
-}
-
-static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
-{
- /*
- * Save xfd_err to guest_fpu before interrupt is enabled, so the
- * MSR value is not clobbered by the host activity before the guest
- * has chance to consume it.
- *
- * Do not blindly read xfd_err here, since this exception might
- * be caused by L1 interception on a platform which doesn't
- * support xfd at all.
- *
- * Do it conditionally upon guest_fpu::xfd. xfd_err matters
- * only when xfd contains a non-zero value.
- *
- * Queuing exception is done in vmx_handle_exit. See comment there.
- */
- if (vcpu->arch.guest_fpu.fpstate->xfd)
- rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
-}
-
-static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
- const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-
- /* if exit due to PF check for async PF */
- if (is_page_fault(intr_info))
- vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
- /* if exit due to NM, handle before interrupts are enabled */
- else if (is_nm_fault(intr_info))
- handle_nm_fault_irqoff(vcpu);
- /* Handle machine checks before interrupts are enabled */
- else if (is_machine_check(intr_info))
- kvm_machine_check();
- /* We need to handle NMIs before interrupts are enabled */
- else if (is_nmi(intr_info))
- handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
- u32 intr_info)
-{
- unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
- gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
- if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
- "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
- return;
-
- handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
- vcpu->arch.at_instruction_boundary = true;
-}
-
void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6717,9 +6653,10 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
return;
if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
}
/*
@@ -7980,7 +7917,7 @@ __init int vmx_hardware_setup(void)
int r;
store_idt(&dt);
- host_idt_base = dt.address;
+ vmx_host_idt_base = dt.address;
vmx_setup_user_return_msrs();
--
2.25.1
From: Sean Christopherson <[email protected]>
For virtual IO, the guest TD shares guest pages with VMM without
encryption. Shared EPT is used to map guest pages in unprotected way.
Add the VMCS field encoding for the shared EPTP, which will be used by
TDX to have separate EPT walks for private GPAs (existing EPTP) versus
shared GPAs (new shared EPTP).
Set shared EPT pointer value for the TDX guest to initialize TDX MMU.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/vmx/main.c | 11 ++++++++++-
arch/x86/kvm/vmx/tdx.c | 5 +++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
4 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f0f8eecf55ac..e169ace97e83 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -234,6 +234,7 @@ enum vmcs_field {
TSC_MULTIPLIER_HIGH = 0x00002033,
TERTIARY_VM_EXEC_CONTROL = 0x00002034,
TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035,
+ SHARED_EPT_POINTER = 0x0000203C,
PID_POINTER_TABLE = 0x00002042,
PID_POINTER_TABLE_HIGH = 0x00002043,
GUEST_PHYSICAL_ADDRESS = 0x00002400,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9f4c3a0bcc12..252b7298b230 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -110,6 +110,15 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
return vmx_vcpu_reset(vcpu, init_event);
}
+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+ int pgd_level)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+
+ vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -228,7 +237,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.write_tsc_offset = vmx_write_tsc_offset,
.write_tsc_multiplier = vmx_write_tsc_multiplier,
- .load_mmu_pgd = vmx_load_mmu_pgd,
+ .load_mmu_pgd = vt_load_mmu_pgd,
.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2772775457b0..24b428b7491d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -532,6 +532,11 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->kvm->vm_bugged = true;
}
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
+{
+ td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
+}
+
int tdx_dev_ioctl(void __user *argp)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7e38c7b756d4..e70f84d29d21 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -144,6 +144,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
@@ -161,6 +163,8 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
#endif
#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1
From: Isaku Yamahata <[email protected]>
For vcpu migration, in the case of VMX, VCMS is flushed on the source pcpu,
and load it on the target pcpu. There are corresponding TDX SEAMCALL APIs,
call them on vcpu migration. The logic is mostly same as VMX except the
TDX SEAMCALLs are used.
When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
each pcpu. Do the similar for TDX with TDX SEAMCALL APIs.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 43 +++++++++++--
arch/x86/kvm/vmx/tdx.c | 121 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 2 +
arch/x86/kvm/vmx/x86_ops.h | 6 ++
4 files changed, 168 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f101f358d90c..ad09988c4faa 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -17,6 +17,25 @@ static bool vt_is_vm_type_supported(unsigned long type)
(enable_tdx && tdx_is_vm_type_supported(type));
}
+static int vt_hardware_enable(void)
+{
+ int ret;
+
+ ret = vmx_hardware_enable();
+ if (ret)
+ return ret;
+
+ tdx_hardware_enable();
+ return 0;
+}
+
+static void vt_hardware_disable(void)
+{
+ /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+ tdx_hardware_disable();
+ vmx_hardware_disable();
+}
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -151,6 +170,14 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_vcpu_run(vcpu);
}
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_load(vcpu, cpu);
+
+ return vmx_vcpu_load(vcpu, cpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -192,6 +219,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}
+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_sched_in(vcpu, cpu);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -214,8 +249,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_unsetup = vt_hardware_unsetup,
.check_processor_compatibility = vmx_check_processor_compatibility,
- .hardware_enable = vmx_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
+ .hardware_enable = vt_hardware_enable,
+ .hardware_disable = vt_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,
.is_vm_type_supported = vt_is_vm_type_supported,
@@ -231,7 +266,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_reset = vt_vcpu_reset,
.prepare_switch_to_guest = vt_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
+ .vcpu_load = vt_vcpu_load,
.vcpu_put = vt_vcpu_put,
.update_exception_bitmap = vmx_update_exception_bitmap,
@@ -317,7 +352,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.request_immediate_exit = vmx_request_immediate_exit,
- .sched_in = vmx_sched_in,
+ .sched_in = vt_sched_in,
.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0de113a643e4..4db9bfe2c534 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -61,6 +61,14 @@ static struct tdx_capabilities tdx_caps;
static DEFINE_MUTEX(tdx_lock);
static struct mutex *tdx_mng_key_config_lock;
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ * Protected by interrupt mask. This list is manipulated in process context
+ * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
{
pa &= ~hkid_mask;
@@ -95,6 +103,36 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
return kvm_tdx->finalized;
}
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+ list_del(&to_tdx(vcpu)->cpu_list);
+
+ /*
+ * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+ * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+ * to its list before its deleted from this CPUs list.
+ */
+ smp_wmb();
+
+ vcpu->cpu = -1;
+}
+
+void tdx_hardware_enable(void)
+{
+ INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
+}
+
+void tdx_hardware_disable(void)
+{
+ int cpu = raw_smp_processor_id();
+ struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+ struct vcpu_tdx *tdx, *tmp;
+
+ /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+ list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
+ tdx_disassociate_vp(&tdx->vcpu);
+}
+
static void tdx_clear_page(unsigned long page)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -171,6 +209,41 @@ static void tdx_reclaim_td_page(struct tdx_td_page *page)
free_page(page->va);
}
+static void tdx_flush_vp(void *arg)
+{
+ struct kvm_vcpu *vcpu = arg;
+ u64 err;
+
+ lockdep_assert_irqs_disabled();
+
+ /* Task migration can race with CPU offlining. */
+ if (vcpu->cpu != raw_smp_processor_id())
+ return;
+
+ /*
+ * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
+ * list tracking still needs to be updated so that it's correct if/when
+ * the vCPU does get initialized.
+ */
+ if (is_td_vcpu_created(to_tdx(vcpu))) {
+ err = tdh_vp_flush(to_tdx(vcpu)->tdvpr.pa);
+ if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+ if (WARN_ON_ONCE(err))
+ pr_tdx_error(TDH_VP_FLUSH, err, NULL);
+ }
+ }
+
+ tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+ if (unlikely(vcpu->cpu == -1))
+ return;
+
+ smp_call_function_single(vcpu->cpu, tdx_flush_vp, vcpu, 1);
+}
+
static int tdx_do_tdh_phymem_cache_wb(void *param)
{
u64 err = 0;
@@ -195,9 +268,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
cpumask_var_t packages;
bool cpumask_allocated;
+ struct kvm_vcpu *vcpu;
u64 err;
int ret;
int i;
+ unsigned long j;
if (!is_hkid_assigned(kvm_tdx))
return;
@@ -205,6 +280,19 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
if (!is_td_created(kvm_tdx))
goto free_hkid;
+ kvm_for_each_vcpu(j, vcpu, kvm)
+ tdx_flush_vp_on_cpu(vcpu);
+
+ mutex_lock(&tdx_lock);
+ err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
+ mutex_unlock(&tdx_lock);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
+ pr_err("tdh_mng_vpflushdone failed. HKID %d is leaked.\n",
+ kvm_tdx->hkid);
+ return;
+ }
+
cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
cpus_read_lock();
for_each_online_cpu(i) {
@@ -481,6 +569,26 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return ret;
}
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (vcpu->cpu == cpu)
+ return;
+
+ tdx_flush_vp_on_cpu(vcpu);
+
+ local_irq_disable();
+ /*
+ * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+ * vcpu->cpu is read before tdx->cpu_list.
+ */
+ smp_rmb();
+
+ list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+ local_irq_enable();
+}
+
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -527,6 +635,19 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
tdx_reclaim_td_page(&tdx->tdvpx[i]);
kfree(tdx->tdvpx);
tdx_reclaim_td_page(&tdx->tdvpr);
+
+ /*
+ * kvm_free_vcpus()
+ * -> kvm_unload_vcpu_mmu()
+ *
+ * does vcpu_load() for every vcpu after they already disassociated
+ * from the per cpu list when tdx_vm_teardown(). So we need to
+ * disassociate them again, otherwise the freed vcpu data will be
+ * accessed when do list_{del,add}() on associated_tdvcpus list
+ * later.
+ */
+ tdx_flush_vp_on_cpu(vcpu);
+ WARN_ON(vcpu->cpu != -1);
}
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 414c15235ed0..32e05efa70f9 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -85,6 +85,8 @@ struct vcpu_tdx {
struct tdx_td_page tdvpr;
struct tdx_td_page *tdvpx;
+ struct list_head cpu_list;
+
union tdx_exit_reason exit_reason;
bool initialized;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 2213739c2303..55273a0fe273 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -132,6 +132,8 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
bool tdx_is_vm_type_supported(unsigned long type);
void tdx_hardware_unsetup(void);
+void tdx_hardware_enable(void);
+void tdx_hardware_disable(void);
int tdx_dev_ioctl(void __user *argp);
int tdx_vm_init(struct kvm *kvm);
@@ -144,6 +146,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -154,6 +157,8 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
static inline void tdx_hardware_unsetup(void) {}
+static inline void tdx_hardware_enable(void) {}
+static inline void tdx_hardware_disable(void) {}
static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
@@ -167,6 +172,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1
From: Isaku Yamahata <[email protected]>
For private GPA, CPU refers a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used and their cost is expensive.
When KVM resolves KVM page fault, it walks the page tables. To reuse the
existing KVM MMU code and mitigate the heavy cost to directly walk
encrypted private page table, allocate a more page to mirror the existing
KVM page table. Resolve KVM page fault with the existing code, and do
additional operations necessary for the mirrored private page table. To
distinguish such cases, the existing KVM page table is called a shared page
table (i.e. no mirrored private page table), and the KVM page table with
mirrored private page table is called a private page table. The
relationship is depicted below.
Add private pointer to struct kvm_mmu_page for mirrored private page table
and add helper functions to allocate/initialize/free a mirrored private
page table page. Also, add helper functions to check if a given
kvm_mmu_page is private. The later patch introduces hooks to operate on
the mirrored private page table.
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
CPU/KVM shared PT root KVM private PT root | CPU private PT root
| | | |
V V | V
shared PT private PT <----mirror----> mirrored private PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT: page table
Both CPU and KVM refer to CPU/KVM shared page table. Private page table
is used only by KVM. CPU refers to mirrored private page table.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 9 ++++
arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 3 ++
4 files changed, 97 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f4d4ed41641b..bfc934dc9a33 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -716,6 +716,7 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_gfn_array_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ struct kvm_mmu_memory_cache mmu_private_sp_cache;
/*
* QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c517c7bca105..a5bf3e40e209 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -691,6 +691,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
int start, end, i, r;
bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
+ PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+ }
+
if (is_tdp_mmu && shadow_nonpresent_value)
start = kvm_mmu_memory_cache_nr_free_objects(mc);
@@ -732,6 +739,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}
@@ -1736,6 +1744,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
if (!direct)
sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+ kvm_mmu_init_private_sp(sp, NULL);
/*
* active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 44a04fad4bed..9f3a6bea60a3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -55,6 +55,10 @@ struct kvm_mmu_page {
u64 *spt;
/* hold the gfn of each spte inside spt */
gfn_t *gfns;
+#ifdef CONFIG_KVM_MMU_PRIVATE
+ /* associated private shadow page, e.g. SEPT page. */
+ void *private_sp;
+#endif
/* Currently serving as active root */
union {
int root_count;
@@ -115,6 +119,86 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
return kvm_mmu_role_as_id(sp->role);
}
+/*
+ * TDX vcpu allocates page for root Secure EPT page and assigns to CPU secure
+ * EPT pointer. KVM doesn't need to allocate and link to the secure EPT.
+ * Dummy value to make is_pivate_sp() return true.
+ */
+#define KVM_MMU_PRIVATE_SP_ROOT ((void *)1)
+
+#ifdef CONFIG_KVM_MMU_PRIVATE
+static inline bool is_private_sp(struct kvm_mmu_page *sp)
+{
+ return !!sp->private_sp;
+}
+
+static inline bool is_private_sptep(u64 *sptep)
+{
+ WARN_ON(!sptep);
+ return is_private_sp(sptep_to_sp(sptep));
+}
+
+static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
+{
+ return sp->private_sp;
+}
+
+static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
+{
+ sp->private_sp = private_sp;
+}
+
+/* Valid sp->role.level is required. */
+static inline void kvm_mmu_alloc_private_sp(
+ struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
+{
+ if (is_root)
+ sp->private_sp = KVM_MMU_PRIVATE_SP_ROOT;
+ else
+ sp->private_sp = kvm_mmu_memory_cache_alloc(
+ &vcpu->arch.mmu_private_sp_cache);
+ /*
+ * Because mmu_private_sp_cache is topped up before staring kvm page
+ * fault resolving, the allocation above shouldn't fail.
+ */
+ WARN_ON_ONCE(!sp->private_sp);
+}
+
+static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
+{
+ if (sp->private_sp != KVM_MMU_PRIVATE_SP_ROOT)
+ free_page((unsigned long)sp->private_sp);
+}
+#else
+static inline bool is_private_sp(struct kvm_mmu_page *sp)
+{
+ return false;
+}
+
+static inline bool is_private_sptep(u64 *sptep)
+{
+ return false;
+}
+
+static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
+{
+ return NULL;
+}
+
+static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
+{
+}
+
+static inline void kvm_mmu_alloc_private_sp(
+ struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
+{
+}
+
+static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
+{
+}
+#endif
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7eb41b176d1e..b2568b062faa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -72,6 +72,8 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
+ if (is_private_sp(sp))
+ kvm_mmu_free_private_sp(sp);
free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
}
@@ -295,6 +297,7 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
sp->gfn = gfn;
sp->ptep = sptep;
sp->tdp_mmu_page = true;
+ kvm_mmu_init_private_sp(sp);
trace_kvm_mmu_get_page(sp, true);
}
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of TD finalization.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 5797d172176d..53897312699f 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -21,11 +21,11 @@ Patch Layer status
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
-* TD finalization: Not yet
+* TD finalization: Applying
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet
* KVM MMU GPA shared bits: Applied
* KVM TDP refactoring for TDX: Applied
* KVM TDP MMU hooks: Applied
-* KVM TDP MMU MapGPA: Not yet
+* KVM TDP MMU MapGPA: Applied
--
2.25.1
From: Isaku Yamahata <[email protected]>
In this patch series, TDX supports only TDP MMU and doesn't support legacy
MMU. Forcibly use TDP MMU for TDX irrelevant of kernel parameter to
disable TDP MMU.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 82f1bfac7ee6..7eb41b176d1e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -18,8 +18,13 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
{
struct workqueue_struct *wq;
- if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
- return 0;
+ /*
+ * Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
+ * of TDX.
+ */
+ if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
+ (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
+ return false;
wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
if (!wq)
--
2.25.1
From: Isaku Yamahata <[email protected]>
To Keep the case of non TDX intact, introduce a new config option for
private KVM MMU support. At the moment, this is synonym for
CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL. The new flag make it clear
that the config is only for x86 KVM MMU.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Kconfig | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd7706136..5a59abc83179 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -129,4 +129,8 @@ config KVM_XEN
config KVM_EXTERNAL_WRITE_TRACKING
bool
+config KVM_MMU_PRIVATE
+ def_bool y
+ depends on INTEL_TDX_HOST && KVM_INTEL
+
endif # VIRTUALIZATION
--
2.25.1
From: Isaku Yamahata <[email protected]>
Add documentation to Intel Trusted Domain Extensions(TDX) support.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/api.rst | 9 +-
Documentation/virt/kvm/intel-tdx.rst | 381 +++++++++++++++++++++++++++
2 files changed, 389 insertions(+), 1 deletion(-)
create mode 100644 Documentation/virt/kvm/intel-tdx.rst
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index b9ab598883b2..653ba93452f3 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1402,6 +1402,9 @@ It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
allocation and is deprecated.
+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported. Only as-id 0 is supported.
+
4.36 KVM_SET_TSS_ADDR
---------------------
@@ -4688,7 +4691,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
:Capability: basic
:Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
:Parameters: an opaque platform specific structure (in/out)
:Returns: 0 on success; -1 on error
@@ -4700,6 +4703,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
Documentation/virt/kvm/amd-memory-encryption.rst.
+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/intel-tdx.rst.
+
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------
diff --git a/Documentation/virt/kvm/intel-tdx.rst b/Documentation/virt/kvm/intel-tdx.rst
new file mode 100644
index 000000000000..3fae2cf9e534
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx.rst
@@ -0,0 +1,381 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. [1]
+For details, the specifications, [2], [3], [4], [5], [6], [7], are
+available.
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+ /* Trust Domain eXtension sub-ioctl() commands. */
+ enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
+
+ KVM_TDX_CMD_NR_MAX,
+ };
+
+ struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+ /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+ __u64 unused;
+ };
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: vm ioctl
+
+Subset of TDSYSINFO_STRCUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. Which describes about Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- flags: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+- error: must be 0
+- unused: must be 0
+
+::
+
+ struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+ };
+
+ struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[0];
+ };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- flags: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- error: must be 0
+- unused: must be 0
+
+::
+
+ struct kvm_tdx_init_vm {
+ __u32 max_vcpus;
+ __u32 reserved;
+ __u64 attributes;
+ __u64 cpuid; /* pointer to struct kvm_cpuid2 */
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha348 digest */
+ __u64 reserved[43]; /* must be zero for future extensibility */
+ };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: must be 0
+- data: initial value of the guest TD VCPU RCX
+- error: must be 0
+- unused: must be 0
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vm ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: flags
+ currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+- error: must be 0
+- unused: must be 0
+
+::
+
+ #define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+ struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+ };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- flags: must be 0
+- data: must be 0
+- error: must be 0
+- unused: must be 0
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called. The control flow
+looks like as follows.
+
+#. system wide capability check
+ * KVM_CAP_VM_TYPES: check if VM type is supported and if TDX_VM_TYPE is
+ supported.
+
+#. creating VM
+ * KVM_CREATE_VM
+ * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+ * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+ * KVM_CREATE_VCPU
+ * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+
+#. initializing guest memory
+ * allocate guest memory and initialize page same to normal KVM case
+ In TDX case, parse and load TDVF into guest memory in addition.
+ * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+ If the pages has contents above, those pages need to be added.
+ Otherwise the contents will be lost and guest sees zero pages.
+ * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+ This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Design discussion
+=================
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered.
+ . No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+ . Avoid overhead of indirect call via function pointers.
+ . Contain the changes under arch/x86/kvm/vmx directory and share logic
+ with VMX for maintenance.
+ Even though the ways to operation on VM (VMX instruction vs TDX
+ SEAM call) is different, the basic idea remains same. So, many
+ logic can be shared.
+ . Future maintenance
+ The huge change of kvm_x86_ops in (near) future isn't expected.
+ a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+ Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+ main.c, is just chosen to show main entry points for callbacks.) and
+ wrapper functions around all the callbacks with
+ "if (is-tdx) tdx-callback() else vmx-callback()".
+
+ Pros:
+ - No major change in common x86 KVM code. The change is (mostly)
+ contained under arch/x86/kvm/vmx/.
+ - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
+ optimized out.
+ - Micro optimization by avoiding function pointer.
+ Cons:
+ - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+Alternative:
+- Introduce another callback layer under arch/x86/kvm/vmx.
+ Pros:
+ - No major change in common x86 KVM code. The change is (mostly)
+ contained under arch/x86/kvm/vmx/.
+ - clear separation on callbacks.
+ Cons:
+ - overhead in VMX even when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Allow per-VM kvm_x86_ops callbacks instead of global kvm_x86_ops
+ Pros:
+ - clear separation on callbacks.
+ Cons:
+ - Big change in common x86 code.
+ - overhead in common code even when TDX is
+ disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Introduce new directory arch/x86/kvm/tdx
+ Pros:
+ - It clarifies that TDX is different from VMX.
+ Cons:
+ - Given the level of code sharing, it complicates code sharing.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+ . Reuse the existing MMU code with minimal update. Because the
+ execution flow is mostly same. But additional operation, TDX call
+ for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+ . For performance, minimize TDX SEAM call to operate on S-EPT. When
+ getting corresponding S-EPT pages/entry from faulting GPA, don't
+ use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+ in host memory.
+ Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+ associate S-EPT to it.
+ . Treats share bit as attributes. mask/unmask the bit where
+ necessary to keep the existing traversing code works.
+ Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+ for special case.
+ = 0 : for non-TDX case
+ = 51 or 47 bit set for TDX case.
+
+ Pros:
+ - Large code reuse with minimal new hooks.
+ - Execution path is same.
+ Cons:
+ - Complicates the existing code.
+ - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+Alternative:
+- Replace direct read/write on EPT entry with TDX-SEAM call by
+ introducing callbacks on EPT entry.
+ Pros:
+ - Straightforward.
+ Cons:
+ - Too many touching point.
+ - Too slow due to TDX-SEAM call.
+ - Overhead even when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Sprinkle "if (is-tdx)" for TDX special case
+ Pros:
+ - Straightforward.
+ Cons:
+ - The result is non-generic and ugly.
+ - Put TDX specific logic into common KVM MMU code.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM API are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+ Although not all operation isn't memory encryption, repupose to get
+ TDX specific ioctls.
+ Pros:
+ - No major change in common x86 KVM code.
+ Cons:
+ - The operations aren't actually memory encryption, but operations
+ on TD VMs.
+
+Alternative:
+- Introduce new ioctl for guest protection like
+ KVM_GUEST_PROTECTION_OP and introduce subcommand for TDX.
+ Pros:
+ - Clean name.
+ Cons:
+ - One more new ioctl for guest protection.
+ - Confusion with KVM_MEMORY_ENCRYPT_OP with KVM_GUEST_PROTECTION_OP.
+
+- Rename KVM_MEMORY_ENCRYPT_OP to KVM_GUEST_PROTECTION_OP and keep
+ KVM_MEMORY_ENCRYPT_OP as same value for user API for compatibility.
+ "#define KVM_MEMORY_ENCRYPT_OP KVM_GUEST_PROTECTION_OP" for uapi
+ compatibility.
+ Pros:
+ - No new ioctl with more suitable name.
+ Cons:
+ - May cause confusion to the existing user program.
+
+
+References
+==========
+
+.. [1] TDX specification
+ https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+ kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+ TDX guest branch: https://github.com/intel/tdx/tree/guest
+.. [9] tdvf
+ https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+ Enable Hardware Isolated VMs
+ https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+ Architectural Extensions for Hardware Virtual Machine Isolation
+ to Advance Confidential Computing in Public Clouds - Ravi Sahita
+ & Jun Nakajima, Intel Corporation
+ https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+ https://lkml.org/lkml/2020/10/20/66
--
2.25.1
From: Isaku Yamahata <[email protected]>
Wire up TDX PV port IO hypercall to the KVM backend function.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 57 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 15dc0ae61e0f..a62586a83b80 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1003,6 +1003,61 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
return kvm_emulate_halt_noskip(vcpu);
}
+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ int ret;
+
+ WARN_ON(vcpu->arch.pio.count != 1);
+
+ ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+ vcpu->arch.pio.port, &val, 1);
+ WARN_ON(!ret);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, val);
+
+ return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ unsigned int port;
+ int size, ret;
+ bool write;
+
+ ++vcpu->stat.io_exits;
+
+ size = tdvmcall_a0_read(vcpu);
+ write = tdvmcall_a1_read(vcpu);
+ port = tdvmcall_a2_read(vcpu);
+
+ if (size != 1 && size != 2 && size != 4) {
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ if (write) {
+ val = tdvmcall_a3_read(vcpu);
+ ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+
+ /* No need for a complete_userspace_io callback. */
+ vcpu->arch.pio.count = 0;
+ } else {
+ ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+ if (!ret)
+ vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
+ else
+ tdvmcall_set_return_val(vcpu, val);
+ }
+ if (ret)
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return ret;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1013,6 +1068,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_cpuid(vcpu);
case EXIT_REASON_HLT:
return tdx_emulate_hlt(vcpu);
+ case EXIT_REASON_IO_INSTRUCTION:
+ return tdx_emulate_io(vcpu);
default:
break;
}
--
2.25.1
From: Isaku Yamahata <[email protected]>
On EPT violation, call a common function, __vmx_handle_ept_violation() to
trigger x86 MMU code. On EPT misconfiguration, exit to ring 3 with
KVM_EXIT_UNKNOWN. because EPT misconfiguration can't happen as MMIO is
trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the
fast path.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 46 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e5268bfa8d27..14f65d7b3824 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1191,6 +1191,48 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}
+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+ unsigned long exit_qual;
+
+ if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+ /*
+ * Always treat SEPT violations as write faults. Ignore the
+ * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
+ * TD private pages are always RWX in the SEPT tables,
+ * i.e. they're always mapped writable. Just as importantly,
+ * treating SEPT violations as write faults is necessary to
+ * avoid COW allocations, which will cause TDAUGPAGE failures
+ * due to aliasing a single HPA to multiple GPAs.
+ */
+#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
+ exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
+ } else {
+ exit_qual = tdexit_exit_qual(vcpu);;
+ if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
+ pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
+ tdexit_gpa(vcpu), kvm_rip_read(vcpu));
+ vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
+ vcpu->run->ex.exception = PF_VECTOR;
+ vcpu->run->ex.error_code = exit_qual;
+ return 0;
+ }
+ }
+
+ trace_kvm_page_fault(tdexit_gpa(vcpu), exit_qual);
+ return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
+{
+ WARN_ON(1);
+
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+
+ return 0;
+}
+
int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
{
union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
@@ -1209,6 +1251,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
switch (exit_reason.basic) {
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_handle_ept_violation(vcpu);
+ case EXIT_REASON_EPT_MISCONFIG:
+ return tdx_handle_ept_misconfig(vcpu);
case EXIT_REASON_OTHER_SMI:
/*
* If reach here, it's not a Machine Check System Management
--
2.25.1
From: Isaku Yamahata <[email protected]>
The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM. When the guest TD issues
TDG.VP.VMCALL, the guest TD exits to VMM with a new exit reason of
TDVMCALL. The arguments from the guest TD and returned values from the VMM
are passed in the guest registers. The guest RCX registers indicates which
registers are used. Define helper functions to access those registers as
ABI.
Define the TDVMCALL exit reason, which is carved out from the VMX exit
reason namespace as the TDVMCALL exit from TDX guest to TDX-SEAM is really
just a VM-Exit. Add a place holder to handle TDVMCALL exit.
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/vmx.h | 4 ++-
arch/x86/kvm/vmx/tdx.c | 56 ++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 13 ++++++++
3 files changed, 71 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index b3a30ef3efdd..f0f4a4cf84a7 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -93,6 +93,7 @@
#define EXIT_REASON_TPAUSE 68
#define EXIT_REASON_BUS_LOCK 74
#define EXIT_REASON_NOTIFY 75
+#define EXIT_REASON_TDCALL 77
#define VMX_EXIT_REASONS \
{ EXIT_REASON_EXCEPTION_NMI, "EXCEPTION_NMI" }, \
@@ -156,7 +157,8 @@
{ EXIT_REASON_UMWAIT, "UMWAIT" }, \
{ EXIT_REASON_TPAUSE, "TPAUSE" }, \
{ EXIT_REASON_BUS_LOCK, "BUS_LOCK" }, \
- { EXIT_REASON_NOTIFY, "NOTIFY" }
+ { EXIT_REASON_NOTIFY, "NOTIFY" }, \
+ { EXIT_REASON_TDCALL, "TDCALL" }
#define VMX_EXIT_REASON_FLAGS \
{ VMX_EXIT_REASONS_FAILED_VMENTRY, "FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6e8a7e4b4da2..c9663df83292 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -98,6 +98,41 @@ static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
return kvm_r9_read(vcpu);
}
+#define BUILD_TDVMCALL_ACCESSORS(param, gpr) \
+static __always_inline \
+unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu) \
+{ \
+ return kvm_##gpr##_read(vcpu); \
+} \
+static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
+ unsigned long val) \
+{ \
+ kvm_##gpr##_write(vcpu, val); \
+}
+BUILD_TDVMCALL_ACCESSORS(a0, r12);
+BUILD_TDVMCALL_ACCESSORS(a1, r13);
+BUILD_TDVMCALL_ACCESSORS(a2, r14);
+BUILD_TDVMCALL_ACCESSORS(a3, r15);
+
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+ return kvm_r10_read(vcpu);
+}
+static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
+{
+ return kvm_r11_read(vcpu);
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+ long val)
+{
+ kvm_r10_write(vcpu, val);
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+ unsigned long val)
+{
+ kvm_r11_write(vcpu, val);
+}
+
static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
{
return tdx->tdvpr.added;
@@ -799,7 +834,8 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
struct vcpu_tdx *tdx)
{
guest_enter_irqoff();
- tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
+ tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs,
+ tdx->tdvmcall.regs_mask);
guest_exit_irqoff();
}
@@ -832,6 +868,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_complete_interrupts(vcpu);
+ if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
+ tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
+ else
+ tdx->tdvmcall.rcx = 0;
+
return EXIT_FASTPATH_NONE;
}
@@ -878,6 +919,17 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
return 0;
}
+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+ switch (tdvmcall_leaf(vcpu)) {
+ default:
+ break;
+ }
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1274,6 +1326,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
return tdx_handle_exception(vcpu);
case EXIT_REASON_EXTERNAL_INTERRUPT:
return tdx_handle_external_interrupt(vcpu);
+ case EXIT_REASON_TDCALL:
+ return handle_tdvmcall(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_handle_ept_violation(vcpu);
case EXIT_REASON_EPT_MISCONFIG:
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 1268a49fdf18..b0bb239b51bf 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -95,6 +95,19 @@ struct vcpu_tdx {
struct list_head cpu_list;
+ union {
+ struct {
+ union {
+ struct {
+ u16 gpr_mask;
+ u16 xmm_mask;
+ };
+ u32 regs_mask;
+ };
+ u32 reserved;
+ };
+ u64 rcx;
+ } tdvmcall;
union tdx_exit_reason exit_reason;
bool initialized;
--
2.25.1
From: Sean Christopherson <[email protected]>
Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
some operations (e.g., memory read/write, register state access, etc).
Introduce vm_type to track the type of the VM to x86 KVM. Other arch KVMs
already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
vm_init accepts vm_type. So follow them. Further, a different policy can
be made based on vm_type. Define KVM_X86_DEFAULT_VM for default VM as
default and define KVM_X86_TDX_VM for Intel TDX VM. The wrapper function
will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
e.g. qemu, to query what VM types are supported by KVM. This (introduce a
new capability and add vm_type) is chosen to align with other arch KVMs
that have VM types already. Other arch KVMs uses different name to query
supported vm types and there is no common name for it, so new name was
chosen.
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++++
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/uapi/asm/kvm.h | 3 +++
arch/x86/kvm/svm/svm.c | 6 ++++++
arch/x86/kvm/vmx/main.c | 1 +
arch/x86/kvm/vmx/tdx.h | 6 +-----
arch/x86/kvm/vmx/vmx.c | 5 +++++
arch/x86/kvm/vmx/x86_ops.h | 1 +
arch/x86/kvm/x86.c | 9 ++++++++-
include/uapi/linux/kvm.h | 1 +
tools/arch/x86/include/uapi/asm/kvm.h | 3 +++
tools/include/uapi/linux/kvm.h | 1 +
13 files changed, 54 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9cbbfdb663b6..b9ab598883b2 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,31 @@ described as 'basic' will be available.
The new VM has no virtual cpus and no memory.
You probably want to use 0 as machine type.
+X86:
+^^^^
+
+Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
+bitmap of supported vm types. The 1-setting of bit @n means vm type with
+value @n is supported.
+
+S390:
+^^^^^
+
In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).
+MIPS:
+^^^^^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^^^^^
+
On arm64, the physical address size for a VM (IPA Size limit) is limited
to 40bits by default. The limit can be configured if the host supports the
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 75bc44aa8d51..a97cdb203a16 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -19,6 +19,7 @@ KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP(vm_init)
KVM_X86_OP_OPTIONAL(vm_destroy)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aa11525500d3..089e0a4de926 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1141,6 +1141,7 @@ enum kvm_apicv_inhibit {
};
struct kvm_arch {
+ unsigned long vm_type;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -1434,6 +1435,7 @@ struct kvm_x86_ops {
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
+ bool (*is_vm_type_supported)(unsigned long vm_type);
unsigned int vm_size;
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 50a4e787d5e6..9792ec1cc317 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -531,4 +531,7 @@ struct kvm_pmu_event_filter {
#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
#define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define KVM_X86_DEFAULT_VM 0
+#define KVM_X86_TDX_VM 1
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 247c0ad458a0..815a07c594f1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4685,6 +4685,11 @@ static void svm_vm_destroy(struct kvm *kvm)
sev_vm_destroy(kvm);
}
+static bool svm_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_DEFAULT_VM;
+}
+
static int svm_vm_init(struct kvm *kvm)
{
if (!pause_filter_count || !pause_filter_thresh)
@@ -4712,6 +4717,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_free = svm_vcpu_free,
.vcpu_reset = svm_vcpu_reset,
+ .is_vm_type_supported = svm_is_vm_type_supported,
.vm_size = sizeof(struct kvm_svm),
.vm_init = svm_vm_init,
.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ac788af17d92..7be4941e4c4d 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -43,6 +43,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,
+ .is_vm_type_supported = vmx_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,
.vm_destroy = vmx_vm_destroy,
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 54d7a26ed9ee..2f43db5bbefb 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,11 +17,7 @@ struct vcpu_tdx {
static inline bool is_td(struct kvm *kvm)
{
- /*
- * TDX VM type isn't defined yet.
- * return kvm->arch.vm_type == KVM_X86_TDX_VM;
- */
- return false;
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
}
static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b30d73d28e75..5ba62f8b42ce 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7281,6 +7281,11 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}
+bool vmx_is_vm_type_supported(unsigned long type)
+{
+ return type == KVM_X86_DEFAULT_VM;
+}
+
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 2abead2f60f7..a5e85eb4e183 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -25,6 +25,7 @@ void vmx_hardware_unsetup(void);
int vmx_check_processor_compatibility(void);
int vmx_hardware_enable(void);
void vmx_hardware_disable(void);
+bool vmx_is_vm_type_supported(unsigned long type);
int vmx_vm_init(struct kvm *kvm);
void vmx_vm_destroy(struct kvm *kvm);
int vmx_vcpu_precreate(struct kvm *kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb7a33fbc136..96dc8f52a137 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4408,6 +4408,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_X86_NOTIFY_VMEXIT:
r = kvm_caps.has_notify_vmexit;
break;
+ case KVM_CAP_VM_TYPES:
+ r = BIT(KVM_X86_DEFAULT_VM);
+ if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
+ r |= BIT(KVM_X86_TDX_VM);
+ break;
default:
break;
}
@@ -11858,9 +11863,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
int ret;
unsigned long flags;
- if (type)
+ if (!static_call(kvm_x86_is_vm_type_supported)(type))
return -EINVAL;
+ kvm->arch.vm_type = type;
+
ret = kvm_page_track_init(kvm);
if (ret)
goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7569b4ec199c..6d6785d2685f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1166,6 +1166,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_PROTECTED_DUMP 217
#define KVM_CAP_X86_TRIPLE_FAULT_EVENT 218
#define KVM_CAP_X86_NOTIFY_VMEXIT 219
+#define KVM_CAP_VM_TYPES 220
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index bf6e96011dfe..71a5851475e7 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -525,4 +525,7 @@ struct kvm_pmu_event_filter {
#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
#define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define KVM_X86_DEFAULT_VM 0
+#define KVM_X86_TDX_VM 1
+
#endif /* _ASM_X86_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 6a184d260c7f..1e89b967e050 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1152,6 +1152,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_DISABLE_QUIRKS2 213
/* #define KVM_CAP_VM_TSC_CONTROL 214 */
#define KVM_CAP_SYSTEM_EVENT_DATA 215
+#define KVM_CAP_VM_TYPES 220
#ifdef KVM_CAP_IRQ_ROUTING
--
2.25.1
From: Sean Christopherson <[email protected]>
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path. The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.
Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c85ed9f6a8c9..707f1ff90f8a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1578,8 +1578,17 @@ static void apic_update_lvtt(struct kvm_lapic *apic)
static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- u32 reg = kvm_lapic_get_reg(apic, APIC_LVTT);
+ u32 reg;
+ /*
+ * Assume a timer IRQ was "injected" if the APIC is protected. KVM's
+ * copy of the vIRR is bogus, it's the responsibility of the caller to
+ * precisely check whether or not a timer IRQ is pending.
+ */
+ if (apic->guest_apic_protected)
+ return true;
+
+ reg = kvm_lapic_get_reg(apic, APIC_LVTT);
if (kvm_apic_hw_enabled(apic)) {
int vec = reg & APIC_VECTOR_MASK;
void *bitmap = apic->regs + APIC_ISR;
--
2.25.1
From: Isaku Yamahata <[email protected]>
The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
Instead it defines the different protocols to boot application processors.
Ignore INIT and SIPI events for the TDX guest.
There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
error to guest TDs somehow. Given that TDX guest is paravirtualized to
boot AP, the option 1 is chosen for simplicity.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/lapic.c | 16 +++++++++++-----
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/vmx/main.c | 22 +++++++++++++++++++++-
5 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index ec98b3f734a2..ff658969cfff 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -136,6 +136,7 @@ KVM_X86_OP_OPTIONAL(migrate_timers)
KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
+KVM_X86_OP(vcpu_deliver_init)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP(check_processor_compatibility)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42d209fe0a4f..2b79d1c9cabb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1649,6 +1649,7 @@ struct kvm_x86_ops {
int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+ void (*vcpu_deliver_init)(struct kvm_vcpu *vcpu);
/*
* Returns vCPU specific APICv inhibit reasons
@@ -1858,6 +1859,7 @@ int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu);
int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
int reason, bool has_error_code, u32 error_code);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 67dbc26aa1bd..596955070721 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2996,6 +2996,16 @@ int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
return 0;
}
+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+ kvm_vcpu_reset(vcpu, true);
+ if (kvm_vcpu_is_bsp(vcpu))
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ else
+ vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_deliver_init);
+
int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
@@ -3043,11 +3053,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
if (test_bit(KVM_APIC_INIT, &pe)) {
clear_bit(KVM_APIC_INIT, &apic->pending_events);
- kvm_vcpu_reset(vcpu, true);
- if (kvm_vcpu_is_bsp(apic->vcpu))
- vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
- else
- vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+ static_call(kvm_x86_vcpu_deliver_init)(vcpu);
}
if (test_bit(KVM_APIC_SIPI, &pe)) {
clear_bit(KVM_APIC_SIPI, &apic->pending_events);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0abc43d6a115..0f4ce62b30c0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4829,6 +4829,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.complete_emulated_msr = svm_complete_emulated_msr,
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_init = kvm_vcpu_deliver_init,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
};
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 294919913dfd..552f2576d3ae 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -295,6 +295,25 @@ static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
}
+static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ kvm_vcpu_deliver_sipi_vector(vcpu, vector);
+}
+
+static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ /* TDX doesn't support INIT. Ignore INIT event */
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ return;
+ }
+
+ kvm_vcpu_deliver_init(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -616,7 +635,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.msr_filter_changed = vmx_msr_filter_changed,
.complete_emulated_msr = kvm_complete_insn_gp,
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_sipi_vector = vt_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_init = vt_vcpu_deliver_init,
.dev_mem_enc_ioctl = tdx_dev_ioctl,
.mem_enc_ioctl = vt_mem_enc_ioctl,
--
2.25.1
From: Sean Christopherson <[email protected]>
For kvm mmu that has shared bit mask, zap only leaf SPTEs when
deleting/moving a memslot. The existing kvm_mmu_zap_memslot() depends on
role.invalid with read lock of mmu_lock so that other vcpu can operate on
kvm mmu concurrently. Mark the root page table invalid, unlink it from page
table pointer of CPU, process the page table. It doesn't work for private
page table to unlink the root page table because it requires all SPTE entry
to be non-present. Instead, with write-lock of mmu_lock and zap only leaf
SPTEs for kvm mmu with shared bit mask.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 80d7c7709af3..c517c7bca105 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5854,11 +5854,44 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
}
+static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ bool flush = false;
+
+ write_lock(&kvm->mmu_lock);
+
+ /*
+ * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+ * case scenario we'll have unused shadow pages lying around until they
+ * are recycled due to age or when the VM is destroyed.
+ */
+ if (is_tdp_mmu_enabled(kvm)) {
+ struct kvm_gfn_range range = {
+ .slot = slot,
+ .start = slot->base_gfn,
+ .end = slot->base_gfn + slot->npages,
+ .may_block = false,
+ };
+
+ flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush);
+ } else {
+ flush = slot_handle_level(kvm, slot, kvm_zap_rmapp, PG_LEVEL_4K,
+ KVM_MAX_HUGEPAGE_LEVEL, true);
+ }
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+
+ write_unlock(&kvm->mmu_lock);
+}
+
static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
{
- kvm_mmu_zap_all_fast(kvm);
+ if (kvm_gfn_shared_mask(kvm))
+ kvm_mmu_zap_memslot(kvm, slot);
+ else
+ kvm_mmu_zap_all_fast(kvm);
}
int kvm_mmu_init_vm(struct kvm *kvm)
--
2.25.1
From: Isaku Yamahata <[email protected]>
This empty commit is to mark the start of patch series of TDX EPT
violation.
Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index d5cace00c433..c3e675bea802 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -19,12 +19,12 @@ Patch Layer status
* TDX architectural definitions: Applied
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
-* TDX EPT violation: Not yet
+* TDX EPT violation: Applying
* TD finalization: Not yet
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet
* KVM MMU GPA shared bits: Applied
* KVM TDP refactoring for TDX: Applied
-* KVM TDP MMU hooks: Applying
+* KVM TDP MMU hooks: Applied
* KVM TDP MMU MapGPA: Not yet
--
2.25.1
From: Isaku Yamahata <[email protected]>
Wire up TDX PV map_gpa hypercall to the kvm/mmu backend.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 60 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 00baecbb62ff..d4ac573d9db3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1221,6 +1221,64 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
return 0;
}
+static int tdx_map_gpa(struct kvm_vcpu *vcpu)
+{
+ struct kvm *kvm = vcpu->kvm;
+ gpa_t gpa = tdvmcall_a0_read(vcpu);
+ gpa_t size = tdvmcall_a1_read(vcpu);
+ gpa_t end = gpa + size;
+ bool allow_private = kvm_is_private_gpa(kvm, gpa);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
+ end < gpa ||
+ end > kvm_gfn_shared_mask(kvm) << (PAGE_SHIFT + 1) ||
+ kvm_is_private_gpa(kvm, gpa) != kvm_is_private_gpa(kvm, end))
+ return 1;
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+#define TDX_MAP_GPA_SIZE_MAX (16 * 1024 * 1024)
+ while (gpa < end) {
+ gfn_t s = gpa_to_gfn(gpa);
+ gfn_t e = gpa_to_gfn(
+ min(roundup(gpa + 1, TDX_MAP_GPA_SIZE_MAX), end));
+ int ret = kvm_mmu_map_gpa(vcpu, &s, e, allow_private);
+
+ if (ret == -EAGAIN)
+ e = s;
+ else if (ret) {
+ tdvmcall_set_return_code(vcpu,
+ TDG_VP_VMCALL_INVALID_OPERAND);
+ break;
+ }
+
+ gpa = gfn_to_gpa(e);
+
+ /*
+ * TODO:
+ * Interrupt this hypercall invocation to return remaining
+ * region to the guest and let the guest to resume the
+ * hypercall.
+ *
+ * The TDX Guest-Hypervisor Communication Interface(GHCI)
+ * specification and guest implementation need to be updated.
+ *
+ * if (gpa < end && need_resched()) {
+ * size = end - gpa;
+ * tdvmcall_a0_write(vcpu, gpa);
+ * tdvmcall_a1_write(vcpu, size);
+ * tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INTERRUPTED_RESUME);
+ * break;
+ * }
+ */
+ if (gpa < end && need_resched())
+ cond_resched();
+ }
+
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1241,6 +1299,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_wrmsr(vcpu);
case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
return tdx_report_fatal_error(vcpu);
+ case TDG_VP_VMCALL_MAP_GPA:
+ return tdx_map_gpa(vcpu);
default:
break;
}
--
2.25.1
From: Isaku Yamahata <[email protected]>
Factor out non-leaf SPTE population logic from kvm_tdp_mmu_map(). MapGPA
hypercall needs to populate non-leaf SPTE to record which GPA, private or
shared, is allowed in the leaf EPT entry.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b2568b062faa..d874c79ab96c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1167,6 +1167,24 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
return 0;
}
+static int tdp_mmu_populate_nonleaf(
+ struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
+{
+ struct kvm_mmu_page *sp;
+ int ret;
+
+ WARN_ON(is_shadow_present_pte(iter->old_spte));
+ WARN_ON(is_removed_spte(iter->old_spte));
+
+ sp = tdp_mmu_alloc_sp(vcpu);
+ tdp_mmu_init_child_sp(sp, iter);
+
+ ret = tdp_mmu_link_sp(vcpu->kvm, iter, sp, account_nx, true);
+ if (ret)
+ tdp_mmu_free_sp(sp);
+ return ret;
+}
+
/*
* Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
* page tables and SPTEs to translate the faulting guest physical address.
@@ -1175,7 +1193,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
struct tdp_iter iter;
- struct kvm_mmu_page *sp;
int ret;
kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1221,13 +1238,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (is_removed_spte(iter.old_spte))
break;
- sp = tdp_mmu_alloc_sp(vcpu);
- tdp_mmu_init_child_sp(sp, &iter);
-
- if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
- tdp_mmu_free_sp(sp);
+ if (tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
break;
- }
}
}
--
2.25.1
From: Sean Christopherson <[email protected]>
By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +++
arch/x86/kvm/x86.c | 54 ++++++++++++++++++++-------------
2 files changed, 37 insertions(+), 21 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6a940700eb9a..42d209fe0a4f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1948,6 +1948,10 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
}
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit);
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 39473b561e27..a68a917ebdff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9316,26 +9316,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}
-int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit)
{
- unsigned long nr, a0, a1, a2, a3, ret;
- int op_64_bit;
-
- if (kvm_xen_hypercall_enabled(vcpu->kvm))
- return kvm_xen_hypercall(vcpu);
-
- if (kvm_hv_hypercall_enabled(vcpu))
- return kvm_hv_hypercall(vcpu);
-
- nr = kvm_rax_read(vcpu);
- a0 = kvm_rbx_read(vcpu);
- a1 = kvm_rcx_read(vcpu);
- a2 = kvm_rdx_read(vcpu);
- a3 = kvm_rsi_read(vcpu);
+ unsigned long ret;
trace_kvm_hypercall(nr, a0, a1, a2, a3);
- op_64_bit = is_64_bit_hypercall(vcpu);
if (!op_64_bit) {
nr &= 0xFFFFFFFF;
a0 &= 0xFFFFFFFF;
@@ -9344,11 +9333,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
a3 &= 0xFFFFFFFF;
}
- if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
- ret = -KVM_EPERM;
- goto out;
- }
-
ret = -KVM_ENOSYS;
switch (nr) {
@@ -9407,6 +9391,34 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
ret = -KVM_ENOSYS;
break;
}
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+ unsigned long nr, a0, a1, a2, a3, ret;
+ int op_64_bit;
+
+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
+ return kvm_xen_hypercall(vcpu);
+
+ if (kvm_hv_hypercall_enabled(vcpu))
+ return kvm_hv_hypercall(vcpu);
+
+ nr = kvm_rax_read(vcpu);
+ a0 = kvm_rbx_read(vcpu);
+ a1 = kvm_rcx_read(vcpu);
+ a2 = kvm_rdx_read(vcpu);
+ a3 = kvm_rsi_read(vcpu);
+ op_64_bit = is_64_bit_hypercall(vcpu);
+
+ if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
+ ret = -KVM_EPERM;
+ goto out;
+ }
+
+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
out:
if (!op_64_bit)
ret = (u32)ret;
--
2.25.1
On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
> some operations (e.g., memory read/write, register state access, etc).
>
> Introduce vm_type to track the type of the VM to x86 KVM. Other arch KVMs
> already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
> vm_init accepts vm_type. So follow them. Further, a different policy can
> be made based on vm_type. Define KVM_X86_DEFAULT_VM for default VM as
> default and define KVM_X86_TDX_VM for Intel TDX VM. The wrapper function
> will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
>
> Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
> e.g. qemu, to query what VM types are supported by KVM. This (introduce a
> new capability and add vm_type) is chosen to align with other arch KVMs
> that have VM types already. Other arch KVMs uses different name to query
> supported vm types and there is no common name for it, so new name was
> chosen.
>
> Co-developed-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++++
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/include/uapi/asm/kvm.h | 3 +++
> arch/x86/kvm/svm/svm.c | 6 ++++++
> arch/x86/kvm/vmx/main.c | 1 +
> arch/x86/kvm/vmx/tdx.h | 6 +-----
> arch/x86/kvm/vmx/vmx.c | 5 +++++
> arch/x86/kvm/vmx/x86_ops.h | 1 +
> arch/x86/kvm/x86.c | 9 ++++++++-
> include/uapi/linux/kvm.h | 1 +
> tools/arch/x86/include/uapi/asm/kvm.h | 3 +++
> tools/include/uapi/linux/kvm.h | 1 +
> 13 files changed, 54 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9cbbfdb663b6..b9ab598883b2 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -147,10 +147,31 @@ described as 'basic' will be available.
> The new VM has no virtual cpus and no memory.
> You probably want to use 0 as machine type.
>
> +X86:
> +^^^^
> +
> +Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
> +bitmap of supported vm types. The 1-setting of bit @n means vm type with
> +value @n is supported.
Perhaps I am missing something, but I don't understand how the below changes
(except the x86 part above) in Documentation are related to this patch.
> +
> +S390:
> +^^^^^
> +
> In order to create user controlled virtual machines on S390, check
> KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
> privileged user (CAP_SYS_ADMIN).
>
> +MIPS:
> +^^^^^
> +
> +To use hardware assisted virtualization on MIPS (VZ ASE) rather than
> +the default trap & emulate implementation (which changes the virtual
> +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
> +flag KVM_VM_MIPS_VZ.
> +
> +ARM64:
> +^^^^^^
> +
> On arm64, the physical address size for a VM (IPA Size limit) is limited
> to 40bits by default. The limit can be configured if the host supports the
> extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 75bc44aa8d51..a97cdb203a16 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -19,6 +19,7 @@ KVM_X86_OP(hardware_disable)
> KVM_X86_OP(hardware_unsetup)
> KVM_X86_OP(has_emulated_msr)
> KVM_X86_OP(vcpu_after_set_cpuid)
> +KVM_X86_OP(is_vm_type_supported)
> KVM_X86_OP(vm_init)
> KVM_X86_OP_OPTIONAL(vm_destroy)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa11525500d3..089e0a4de926 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1141,6 +1141,7 @@ enum kvm_apicv_inhibit {
> };
>
> struct kvm_arch {
> + unsigned long vm_type;
> unsigned long n_used_mmu_pages;
> unsigned long n_requested_mmu_pages;
> unsigned long n_max_mmu_pages;
> @@ -1434,6 +1435,7 @@ struct kvm_x86_ops {
> bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
> void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
>
> + bool (*is_vm_type_supported)(unsigned long vm_type);
> unsigned int vm_size;
> int (*vm_init)(struct kvm *kvm);
> void (*vm_destroy)(struct kvm *kvm);
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 50a4e787d5e6..9792ec1cc317 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -531,4 +531,7 @@ struct kvm_pmu_event_filter {
> #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
> #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
>
> +#define KVM_X86_DEFAULT_VM 0
> +#define KVM_X86_TDX_VM 1
> +
> #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 247c0ad458a0..815a07c594f1 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4685,6 +4685,11 @@ static void svm_vm_destroy(struct kvm *kvm)
> sev_vm_destroy(kvm);
> }
>
> +static bool svm_is_vm_type_supported(unsigned long type)
> +{
> + return type == KVM_X86_DEFAULT_VM;
> +}
> +
> static int svm_vm_init(struct kvm *kvm)
> {
> if (!pause_filter_count || !pause_filter_thresh)
> @@ -4712,6 +4717,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> .vcpu_free = svm_vcpu_free,
> .vcpu_reset = svm_vcpu_reset,
>
> + .is_vm_type_supported = svm_is_vm_type_supported,
> .vm_size = sizeof(struct kvm_svm),
> .vm_init = svm_vm_init,
> .vm_destroy = svm_vm_destroy,
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index ac788af17d92..7be4941e4c4d 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -43,6 +43,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .hardware_disable = vmx_hardware_disable,
> .has_emulated_msr = vmx_has_emulated_msr,
>
> + .is_vm_type_supported = vmx_is_vm_type_supported,
> .vm_size = sizeof(struct kvm_vmx),
> .vm_init = vmx_vm_init,
> .vm_destroy = vmx_vm_destroy,
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 54d7a26ed9ee..2f43db5bbefb 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -17,11 +17,7 @@ struct vcpu_tdx {
>
> static inline bool is_td(struct kvm *kvm)
> {
> - /*
> - * TDX VM type isn't defined yet.
> - * return kvm->arch.vm_type == KVM_X86_TDX_VM;
> - */
> - return false;
> + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> }
If you put this patch before patch:
[PATCH v7 009/102] KVM: TDX: Add placeholders for TDX VM/vcpu structure
Then you don't need to introduce this chunk in above patch and then remove it
here, which is unnecessary and ugly.
And you can even only introduce KVM_X86_DEFAULT_VM but not KVM_X86_TDX_VM in
this patch, so you can make this patch as a infrastructural patch to report VM
type. The KVM_X86_TDX_VM can come with the patch where is_td() is introduced
(in your above patch 9). Â
To me, it's more clean way to write patch. For instance, this infrastructural
patch can be theoretically used by other series if they have similar thing to
support, but doesn't need to carry is_td() and KVM_X86_TDX_VM burden that you
made.
>
> static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b30d73d28e75..5ba62f8b42ce 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7281,6 +7281,11 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
> return err;
> }
>
> +bool vmx_is_vm_type_supported(unsigned long type)
> +{
> + return type == KVM_X86_DEFAULT_VM;
> +}
> +
> #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
> #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 2abead2f60f7..a5e85eb4e183 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -25,6 +25,7 @@ void vmx_hardware_unsetup(void);
> int vmx_check_processor_compatibility(void);
> int vmx_hardware_enable(void);
> void vmx_hardware_disable(void);
> +bool vmx_is_vm_type_supported(unsigned long type);
> int vmx_vm_init(struct kvm *kvm);
> void vmx_vm_destroy(struct kvm *kvm);
> int vmx_vcpu_precreate(struct kvm *kvm);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fb7a33fbc136..96dc8f52a137 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4408,6 +4408,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_X86_NOTIFY_VMEXIT:
> r = kvm_caps.has_notify_vmexit;
> break;
> + case KVM_CAP_VM_TYPES:
> + r = BIT(KVM_X86_DEFAULT_VM);
> + if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
> + r |= BIT(KVM_X86_TDX_VM);
> + break;
> default:
> break;
> }
> @@ -11858,9 +11863,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> int ret;
> unsigned long flags;
>
> - if (type)
> + if (!static_call(kvm_x86_is_vm_type_supported)(type))
> return -EINVAL;
>
> + kvm->arch.vm_type = type;
> +
> ret = kvm_page_track_init(kvm);
> if (ret)
> goto out;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 7569b4ec199c..6d6785d2685f 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1166,6 +1166,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_S390_PROTECTED_DUMP 217
> #define KVM_CAP_X86_TRIPLE_FAULT_EVENT 218
> #define KVM_CAP_X86_NOTIFY_VMEXIT 219
> +#define KVM_CAP_VM_TYPES 220
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index bf6e96011dfe..71a5851475e7 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -525,4 +525,7 @@ struct kvm_pmu_event_filter {
> #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
> #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
>
> +#define KVM_X86_DEFAULT_VM 0
> +#define KVM_X86_TDX_VM 1
> +
> #endif /* _ASM_X86_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 6a184d260c7f..1e89b967e050 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -1152,6 +1152,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_DISABLE_QUIRKS2 213
> /* #define KVM_CAP_VM_TSC_CONTROL 214 */
> #define KVM_CAP_SYSTEM_EVENT_DATA 215
> +#define KVM_CAP_VM_TYPES 220
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
--
Thanks,
-Kai
On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Currently, KVM VMX module initialization/exit functions are a single
> function each. Refactor KVM VMX module initialization functions into KVM
> common part and VMX part so that TDX specific part can be added cleanly.
> Opportunistically refactor module exit function as well.
>
> The current module initialization flow is, 1.) calculate the sizes of VMX
> kvm structure and VMX vcpu structure, 2.) hyper-v specific initialization
> 3.) report those sizes to the KVM common layer and KVM common
> initialization, and 4.) VMX specific system-wide initialization.
>
> Refactor the KVM VMX module initialization function into functions with a
> wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> among VMX and TDX. We have a wrapper function, "vt_init() {vmx kvm/vcpu
> size calculation; hv_vp_assist_page_init(); kvm_init(); vmx_init(); }" in
> main.c, and hv_vp_assist_page_init() and vmx_init() in vmx.c.
> hv_vp_assist_page_init() initializes hyper-v specific assist pages,
> kvm_init() does system-wide initialization of the KVM common layer, and
> vmx_init() does system-wide VMX initialization.
>
> The KVM architecture common layer allocates struct kvm with reported size
> for architecture-specific code. The KVM VMX module defines its structure
> as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
> TDX specific kvm and vcpu structures, add tdx_pre_kvm_init() to report the
> sizes of them to the KVM common layer.
>
> The current module exit function is also a single function, a combination
> of VMX specific logic and common KVM logic. Refactor it into VMX specific
> logic and KVM common logic. This is just refactoring to keep the VMX
> specific logic in vmx.c from main.c.
This patch, coupled with the patch:
KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
Basically provides an infrastructure to support both VMX and TDX. Why we cannot
merge them into one patch? What's the benefit of splitting them?
At least, why the two patches cannot be put together closely?
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 38 +++++++++++++
> arch/x86/kvm/vmx/vmx.c | 106 ++++++++++++++++++-------------------
> arch/x86/kvm/vmx/x86_ops.h | 6 +++
> 3 files changed, 95 insertions(+), 55 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index fabf5f22c94f..371dad728166 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -169,3 +169,41 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
> .runtime_ops = &vt_x86_ops,
> .pmu_ops = &intel_pmu_ops,
> };
> +
> +static int __init vt_init(void)
> +{
> + unsigned int vcpu_size, vcpu_align;
> + int r;
> +
> + vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
> + vcpu_size = sizeof(struct vcpu_vmx);
> + vcpu_align = __alignof__(struct vcpu_vmx);
> +
> + hv_vp_assist_page_init();
> + vmx_init_early();
> +
> + r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
> + if (r)
> + goto err_vmx_post_exit;
> +
> + r = vmx_init();
> + if (r)
> + goto err_kvm_exit;
> +
> + return 0;
> +
> +err_kvm_exit:
> + kvm_exit();
> +err_vmx_post_exit:
> + hv_vp_assist_page_exit();
> + return r;
> +}
> +module_init(vt_init);
> +
> +static void vt_exit(void)
> +{
> + vmx_exit();
> + kvm_exit();
> + hv_vp_assist_page_exit();
> +}
> +module_exit(vt_exit);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 286947c00638..b30d73d28e75 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8181,15 +8181,45 @@ static void vmx_cleanup_l1d_flush(void)
> l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
> }
>
> -static void vmx_exit(void)
> +void __init hv_vp_assist_page_init(void)
> {
> -#ifdef CONFIG_KEXEC_CORE
> - RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
> - synchronize_rcu();
> -#endif
> +#if IS_ENABLED(CONFIG_HYPERV)
> + /*
> + * Enlightened VMCS usage should be recommended and the host needs
> + * to support eVMCS v1 or above. We can also disable eVMCS support
> + * with module parameter.
> + */
> + if (enlightened_vmcs &&
> + ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED &&
> + (ms_hyperv.nested_features & HV_X64_ENLIGHTENED_VMCS_VERSION) >=
> + KVM_EVMCS_VERSION) {
> + int cpu;
> +
> + /* Check that we have assist pages on all online CPUs */
> + for_each_online_cpu(cpu) {
> + if (!hv_get_vp_assist_page(cpu)) {
> + enlightened_vmcs = false;
> + break;
> + }
> + }
>
> - kvm_exit();
> + if (enlightened_vmcs) {
> + pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
> + static_branch_enable(&enable_evmcs);
> + }
> +
> + if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
> + vt_x86_ops.enable_direct_tlbflush
> + = hv_enable_direct_tlbflush;
>
> + } else {
> + enlightened_vmcs = false;
> + }
> +#endif
> +}
> +
> +void hv_vp_assist_page_exit(void)
> +{
> #if IS_ENABLED(CONFIG_HYPERV)
> if (static_branch_unlikely(&enable_evmcs)) {
> int cpu;
> @@ -8213,14 +8243,10 @@ static void vmx_exit(void)
> static_branch_disable(&enable_evmcs);
> }
> #endif
> - vmx_cleanup_l1d_flush();
> -
> - allow_smaller_maxphyaddr = false;
> }
> -module_exit(vmx_exit);
>
> /* initialize before kvm_init() so that hardware_enable/disable() can work. */
> -static void __init vmx_init_early(void)
> +void __init vmx_init_early(void)
> {
> int cpu;
>
> @@ -8228,49 +8254,10 @@ static void __init vmx_init_early(void)
> INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> }
>
> -static int __init vmx_init(void)
> +int __init vmx_init(void)
> {
> int r, cpu;
>
> -#if IS_ENABLED(CONFIG_HYPERV)
> - /*
> - * Enlightened VMCS usage should be recommended and the host needs
> - * to support eVMCS v1 or above. We can also disable eVMCS support
> - * with module parameter.
> - */
> - if (enlightened_vmcs &&
> - ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED &&
> - (ms_hyperv.nested_features & HV_X64_ENLIGHTENED_VMCS_VERSION) >=
> - KVM_EVMCS_VERSION) {
> -
> - /* Check that we have assist pages on all online CPUs */
> - for_each_online_cpu(cpu) {
> - if (!hv_get_vp_assist_page(cpu)) {
> - enlightened_vmcs = false;
> - break;
> - }
> - }
> -
> - if (enlightened_vmcs) {
> - pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
> - static_branch_enable(&enable_evmcs);
> - }
> -
> - if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
> - vt_x86_ops.enable_direct_tlbflush
> - = hv_enable_direct_tlbflush;
> -
> - } else {
> - enlightened_vmcs = false;
> - }
> -#endif
> -
> - vmx_init_early();
> - r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
> - __alignof__(struct vcpu_vmx), THIS_MODULE);
> - if (r)
> - return r;
> -
> /*
> * Must be called after kvm_init() so enable_ept is properly set
> * up. Hand the parameter mitigation value in which was stored in
> @@ -8279,10 +8266,8 @@ static int __init vmx_init(void)
> * mitigation mode.
> */
> r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
> - if (r) {
> - vmx_exit();
> + if (r)
> return r;
> - }
>
> for_each_possible_cpu(cpu)
> pi_init_cpu(cpu);
> @@ -8303,4 +8288,15 @@ static int __init vmx_init(void)
>
> return 0;
> }
> -module_init(vmx_init);
> +
> +void vmx_exit(void)
> +{
> +#ifdef CONFIG_KEXEC_CORE
> + RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
> + synchronize_rcu();
> +#endif
> +
> + vmx_cleanup_l1d_flush();
> +
> + allow_smaller_maxphyaddr = false;
> +}
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 0a5967a91e26..2abead2f60f7 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -8,6 +8,12 @@
>
> #include "x86.h"
>
> +void __init hv_vp_assist_page_init(void);
> +void hv_vp_assist_page_exit(void);
> +void __init vmx_init_early(void);
> +int __init vmx_init(void);
> +void vmx_exit(void);
> +
> __init int vmx_cpu_has_kvm_support(void);
> __init int vmx_disabled_by_bios(void);
> __init int vmx_hardware_setup(void);
On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> For kvm mmu that has shared bit mask, zap only leaf SPTEs when
> deleting/moving a memslot. The existing kvm_mmu_zap_memslot() depends on
Unless I am mistaken, I don't see there's an 'existing' kvm_mmu_zap_memslot().
> role.invalid with read lock of mmu_lock so that other vcpu can operate on
> kvm mmu concurrently.Â
>
> Mark the root page table invalid, unlink it from page
> table pointer of CPU, process the page table. Â
>
Are you talking about the behaviour of existing code, or the change you are
going to make? Looks like you mean the latter but I believe it's the former.
> It doesn't work for private
> page table to unlink the root page table because it requires all SPTE entry
> to be non-present. Â
>
I don't think we can truly *unlink* the private root page table from secure
EPTP, right? The EPTP (root table) is fixed (and hidden) during TD's runtime.
I guess you are trying to say: removing/unlinking one secure-EPT page requires
removing/unlinking all its children first?
So the reason to only zap leaf is we cannot truly unlink the private root page
table, correct? Sorry your changelog is not obvious to me.
> Instead, with write-lock of mmu_lock and zap only leaf
> SPTEs for kvm mmu with shared bit mask.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 35 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 80d7c7709af3..c517c7bca105 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5854,11 +5854,44 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> }
>
> +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + bool flush = false;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + if (is_tdp_mmu_enabled(kvm)) {
> + struct kvm_gfn_range range = {
> + .slot = slot,
> + .start = slot->base_gfn,
> + .end = slot->base_gfn + slot->npages,
> + .may_block = false,
> + };
> +
> + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush);
It appears you only unmap private GFNs (because the base_gfn doesn't have shared
bit)? I think shared mapping in this slot must be zapped too? Â
How is this done? Or the kvm_tdp_mmu_unmap_gfn_range() also zaps shared
mappings?
It's hard to review if one patch's behaviour/logic depends on further patches.
> + } else {
> + flush = slot_handle_level(kvm, slot, kvm_zap_rmapp, PG_LEVEL_4K,
> + KVM_MAX_HUGEPAGE_LEVEL, true);
> + }
> + if (flush)
> + kvm_flush_remote_tlbs(kvm);
> +
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> struct kvm_page_track_notifier_node *node)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (kvm_gfn_shared_mask(kvm))
> + kvm_mmu_zap_memslot(kvm, slot);
> + else
> + kvm_mmu_zap_all_fast(kvm);
> }
>
> int kvm_mmu_init_vm(struct kvm *kvm)
On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> For private GPA, CPU refers a private page table whose contents are
> encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> PTE entry) are used and their cost is expensive.
>
> When KVM resolves KVM page fault, it walks the page tables. To reuse the
> existing KVM MMU code and mitigate the heavy cost to directly walk
> encrypted private page table, allocate a more page to mirror the existing
> KVM page table. Â Resolve KVM page fault with the existing code, and do
> additional operations necessary for the mirrored private page table. To
> distinguish such cases, the existing KVM page table is called a shared page
> table (i.e. no mirrored private page table), and the KVM page table with
> mirrored private page table is called a private page table. The
> relationship is depicted below.
>
> Add private pointer to struct kvm_mmu_page for mirrored private page table
> and add helper functions to allocate/initialize/free a mirrored private
> page table page. Also, add helper functions to check if a given
> kvm_mmu_page is private. The later patch introduces hooks to operate on
> the mirrored private page table.
>
> KVM page fault |
> | |
> V |
> -------------+---------- |
> | | |
> V V |
> shared GPA private GPA |
> | | |
> V V |
> CPU/KVM shared PT root KVM private PT root | CPU private PT root
> | | | |
> V V | V
> shared PT private PT <----mirror----> mirrored private PT
> | | | |
> | \-----------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> non-encrypted memory | encrypted memory
> |
> PT: page table
>
> Both CPU and KVM refer to CPU/KVM shared page table. Private page table
> is used only by KVM. CPU refers to mirrored private page table.
Shouldn't the private page table maintained by KVM be "mirrored private PT"?
To me "mirrored" normally implies it is fake, or backup which isn't actually
used. But here "mirrored private PT" is actually used by hardware.
And to me, "CPU and KVM" above are confusing. For instance, "Both CPU and KVM
refer to CPU/KVM shared page table" took me at least one minute to understand,
with the help from the diagram -- otherwise I won't be able to understand.
I guess you can just say somewhere:
1) Shared PT is visible to KVM and it is used by CPU;
1) Private PT is used by CPU but it is invisible to KVM;
2) Mirrored private PT is visible to KVM but not used by CPU. It is used to
mirror the actual private PT which is used by CPU.
[...]
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
> +{
> + sp->private_sp = private_sp;
> +}
>
[...]
> @@ -295,6 +297,7 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
> sp->gfn = gfn;
> sp->ptep = sptep;
> sp->tdp_mmu_page = true;
> + kvm_mmu_init_private_sp(sp);
Can this even compile? Unless I am seeing mistakenly, kvm_mmu_init_private_sp()
(see above) has two arguments..
Please make sure each patch can at least compile and doesn't cause warning...
--
Thanks,
-Kai
On Tue, 2022-06-28 at 14:52 +1200, Kai Huang wrote:
> On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
> > some operations (e.g., memory read/write, register state access, etc).
> >
> > Introduce vm_type to track the type of the VM to x86 KVM. Other arch KVMs
> > already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
> > vm_init accepts vm_type. So follow them. Further, a different policy can
> > be made based on vm_type. Define KVM_X86_DEFAULT_VM for default VM as
> > default and define KVM_X86_TDX_VM for Intel TDX VM. The wrapper function
> > will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
> >
> > Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
> > e.g. qemu, to query what VM types are supported by KVM. This (introduce a
> > new capability and add vm_type) is chosen to align with other arch KVMs
> > that have VM types already. Other arch KVMs uses different name to query
> > supported vm types and there is no common name for it, so new name was
> > chosen.
> >
> > Co-developed-by: Xiaoyao Li <[email protected]>
> > Signed-off-by: Xiaoyao Li <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++++
> > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/include/uapi/asm/kvm.h | 3 +++
> > arch/x86/kvm/svm/svm.c | 6 ++++++
> > arch/x86/kvm/vmx/main.c | 1 +
> > arch/x86/kvm/vmx/tdx.h | 6 +-----
> > arch/x86/kvm/vmx/vmx.c | 5 +++++
> > arch/x86/kvm/vmx/x86_ops.h | 1 +
> > arch/x86/kvm/x86.c | 9 ++++++++-
> > include/uapi/linux/kvm.h | 1 +
> > tools/arch/x86/include/uapi/asm/kvm.h | 3 +++
> > tools/include/uapi/linux/kvm.h | 1 +
> > 13 files changed, 54 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 9cbbfdb663b6..b9ab598883b2 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -147,10 +147,31 @@ described as 'basic' will be available.
> > The new VM has no virtual cpus and no memory.
> > You probably want to use 0 as machine type.
> >
> > +X86:
> > +^^^^
> > +
> > +Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
> > +bitmap of supported vm types. The 1-setting of bit @n means vm type with
> > +value @n is supported.
>
>
> Perhaps I am missing something, but I don't understand how the below changes
> (except the x86 part above) in Documentation are related to this patch.
>
> > +
> > +S390:
> > +^^^^^
> > +
> > In order to create user controlled virtual machines on S390, check
> > KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
> > privileged user (CAP_SYS_ADMIN).
> >
> > +MIPS:
> > +^^^^^
> > +
> > +To use hardware assisted virtualization on MIPS (VZ ASE) rather than
> > +the default trap & emulate implementation (which changes the virtual
> > +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
> > +flag KVM_VM_MIPS_VZ.
> > +
> > +ARM64:
> > +^^^^^^
> > +
> > On arm64, the physical address size for a VM (IPA Size limit) is limited
> > to 40bits by default. The limit can be configured if the host supports the
> > extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 75bc44aa8d51..a97cdb203a16 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -19,6 +19,7 @@ KVM_X86_OP(hardware_disable)
> > KVM_X86_OP(hardware_unsetup)
> > KVM_X86_OP(has_emulated_msr)
> > KVM_X86_OP(vcpu_after_set_cpuid)
> > +KVM_X86_OP(is_vm_type_supported)
> > KVM_X86_OP(vm_init)
> > KVM_X86_OP_OPTIONAL(vm_destroy)
> > KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index aa11525500d3..089e0a4de926 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1141,6 +1141,7 @@ enum kvm_apicv_inhibit {
> > };
> >
> > struct kvm_arch {
> > + unsigned long vm_type;
> > unsigned long n_used_mmu_pages;
> > unsigned long n_requested_mmu_pages;
> > unsigned long n_max_mmu_pages;
> > @@ -1434,6 +1435,7 @@ struct kvm_x86_ops {
> > bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
> > void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
> >
> > + bool (*is_vm_type_supported)(unsigned long vm_type);
> > unsigned int vm_size;
> > int (*vm_init)(struct kvm *kvm);
> > void (*vm_destroy)(struct kvm *kvm);
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 50a4e787d5e6..9792ec1cc317 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -531,4 +531,7 @@ struct kvm_pmu_event_filter {
> > #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
> > #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
> >
> > +#define KVM_X86_DEFAULT_VM 0
> > +#define KVM_X86_TDX_VM 1
> > +
> > #endif /* _ASM_X86_KVM_H */
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 247c0ad458a0..815a07c594f1 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4685,6 +4685,11 @@ static void svm_vm_destroy(struct kvm *kvm)
> > sev_vm_destroy(kvm);
> > }
> >
> > +static bool svm_is_vm_type_supported(unsigned long type)
> > +{
> > + return type == KVM_X86_DEFAULT_VM;
> > +}
> > +
> > static int svm_vm_init(struct kvm *kvm)
> > {
> > if (!pause_filter_count || !pause_filter_thresh)
> > @@ -4712,6 +4717,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> > .vcpu_free = svm_vcpu_free,
> > .vcpu_reset = svm_vcpu_reset,
> >
> > + .is_vm_type_supported = svm_is_vm_type_supported,
> > .vm_size = sizeof(struct kvm_svm),
> > .vm_init = svm_vm_init,
> > .vm_destroy = svm_vm_destroy,
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index ac788af17d92..7be4941e4c4d 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -43,6 +43,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > .hardware_disable = vmx_hardware_disable,
> > .has_emulated_msr = vmx_has_emulated_msr,
> >
> > + .is_vm_type_supported = vmx_is_vm_type_supported,
> > .vm_size = sizeof(struct kvm_vmx),
> > .vm_init = vmx_vm_init,
> > .vm_destroy = vmx_vm_destroy,
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 54d7a26ed9ee..2f43db5bbefb 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -17,11 +17,7 @@ struct vcpu_tdx {
> >
> > static inline bool is_td(struct kvm *kvm)
> > {
> > - /*
> > - * TDX VM type isn't defined yet.
> > - * return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > - */
> > - return false;
> > + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > }
>
> If you put this patch before patch:
>
> [PATCH v7 009/102] KVM: TDX: Add placeholders for TDX VM/vcpu structure
>
> Then you don't need to introduce this chunk in above patch and then remove it
> here, which is unnecessary and ugly.
>
> And you can even only introduce KVM_X86_DEFAULT_VM but not KVM_X86_TDX_VM in
> this patch, so you can make this patch as a infrastructural patch to report VM
> type. The KVM_X86_TDX_VM can come with the patch where is_td() is introduced
> (in your above patch 9). Â
>
> To me, it's more clean way to write patch. For instance, this infrastructural
> patch can be theoretically used by other series if they have similar thing to
> support, but doesn't need to carry is_td() and KVM_X86_TDX_VM burden that you
> made.
Sorry I missed this patch already has Paolo's Reviewed-by. Please feel free to
ignore my comments.
--
Thanks,
-Kai
On Mon, Jun 27, 2022 at 02:53:15PM -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Implement a system-scoped ioctl to get system-wide parameters for TDX.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/include/uapi/asm/kvm.h | 48 +++++++++++++++++++++++++++
> arch/x86/kvm/vmx/main.c | 2 ++
> arch/x86/kvm/vmx/tdx.c | 46 +++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 2 ++
> arch/x86/kvm/x86.c | 6 ++++
> tools/arch/x86/include/uapi/asm/kvm.h | 48 +++++++++++++++++++++++++++
> 8 files changed, 154 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index fbb2c6746066..3677a5015a4f 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -117,6 +117,7 @@ KVM_X86_OP(smi_allowed)
> KVM_X86_OP(enter_smm)
> KVM_X86_OP(leave_smm)
> KVM_X86_OP(enable_smi_window)
> +KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
> KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
> KVM_X86_OP_OPTIONAL(mem_enc_register_region)
> KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 80df346af117..342decc69649 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1591,6 +1591,7 @@ struct kvm_x86_ops {
> int (*leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
> void (*enable_smi_window)(struct kvm_vcpu *vcpu);
>
> + int (*dev_mem_enc_ioctl)(void __user *argp);
> int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
> int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 9792ec1cc317..273c8d82b9c8 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -534,4 +534,52 @@ struct kvm_pmu_event_filter {
> #define KVM_X86_DEFAULT_VM 0
> #define KVM_X86_TDX_VM 1
>
> +/* Trust Domain eXtension sub-ioctl() commands. */
> +enum kvm_tdx_cmd_id {
> + KVM_TDX_CAPABILITIES = 0,
> +
> + KVM_TDX_CMD_NR_MAX,
> +};
> +
> +struct kvm_tdx_cmd {
> + /* enum kvm_tdx_cmd_id */
> + __u32 id;
> + /* flags for sub-commend. If sub-command doesn't use this, set zero. */
> + __u32 flags;
> + /*
> + * data for each sub-command. An immediate or a pointer to the actual
> + * data in process virtual address. If sub-command doesn't use it,
> + * set zero.
> + */
> + __u64 data;
> + /*
> + * Auxiliary error code. The sub-command may return TDX SEAMCALL
> + * status code in addition to -Exxx.
> + * Defined for consistency with struct kvm_sev_cmd.
> + */
> + __u64 error;
> + /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
> + __u64 unused;
> +};
> +
> +struct kvm_tdx_cpuid_config {
> + __u32 leaf;
> + __u32 sub_leaf;
> + __u32 eax;
> + __u32 ebx;
> + __u32 ecx;
> + __u32 edx;
> +};
> +
> +struct kvm_tdx_capabilities {
> + __u64 attrs_fixed0;
> + __u64 attrs_fixed1;
> + __u64 xfam_fixed0;
> + __u64 xfam_fixed1;
> +
> + __u32 nr_cpuid_configs;
> + __u32 padding;
> + struct kvm_tdx_cpuid_config cpuid_configs[0];
> +};
> +
> #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 6a93b19a8b06..7b497ed1f21c 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -212,6 +212,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .complete_emulated_msr = kvm_complete_insn_gp,
>
> .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> +
> + .dev_mem_enc_ioctl = tdx_dev_ioctl,
> };
>
> struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 63f3c7a02cc8..ec4ebba4152a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -392,6 +392,52 @@ int tdx_vm_init(struct kvm *kvm)
> return ret;
> }
>
> +int tdx_dev_ioctl(void __user *argp)
> +{
> + struct kvm_tdx_capabilities __user *user_caps;
> + struct kvm_tdx_capabilities caps;
> + struct kvm_tdx_cmd cmd;
> +
> + BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
> + sizeof(struct tdx_cpuid_config));
> +
> + if (copy_from_user(&cmd, argp, sizeof(cmd)))
> + return -EFAULT;
> + if (cmd.flags || cmd.error || cmd.unused)
> + return -EINVAL;
> + /*
> + * Currently only KVM_TDX_CAPABILITIES is defined for system-scoped
> + * mem_enc_ioctl().
> + */
> + if (cmd.id != KVM_TDX_CAPABILITIES)
> + return -EINVAL;
> +
> + user_caps = (void __user *)cmd.data;
> + if (copy_from_user(&caps, user_caps, sizeof(caps)))
> + return -EFAULT;
> +
> + if (caps.nr_cpuid_configs < tdx_caps.nr_cpuid_configs)
> + return -E2BIG;
> +
> + caps = (struct kvm_tdx_capabilities) {
> + .attrs_fixed0 = tdx_caps.attrs_fixed0,
> + .attrs_fixed1 = tdx_caps.attrs_fixed1,
> + .xfam_fixed0 = tdx_caps.xfam_fixed0,
> + .xfam_fixed1 = tdx_caps.xfam_fixed1,
> + .nr_cpuid_configs = tdx_caps.nr_cpuid_configs,
> + .padding = 0,
> + };
> +
> + if (copy_to_user(user_caps, &caps, sizeof(caps)))
> + return -EFAULT;
> + if (copy_to_user(user_caps->cpuid_configs, &tdx_caps.cpuid_configs,
> + tdx_caps.nr_cpuid_configs *
> + sizeof(struct tdx_cpuid_config)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> int __init tdx_module_setup(void)
> {
> const struct tdsysinfo_struct *tdsysinfo;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 663fd8d4063f..3027d9821fe1 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -132,6 +132,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
> int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> bool tdx_is_vm_type_supported(unsigned long type);
> void tdx_hardware_unsetup(void);
> +int tdx_dev_ioctl(void __user *argp);
>
> int tdx_vm_init(struct kvm *kvm);
> void tdx_mmu_release_hkid(struct kvm *kvm);
> @@ -140,6 +141,7 @@ void tdx_vm_free(struct kvm *kvm);
> static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
> static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
> static inline void tdx_hardware_unsetup(void) {}
> +static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
>
> static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 320f902eaf9e..6037ce93bcb7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4565,6 +4565,12 @@ long kvm_arch_dev_ioctl(struct file *filp,
> break;
> r = kvm_x86_dev_has_attr(&attr);
> break;
> + case KVM_MEMORY_ENCRYPT_OP:
> + r = -EINVAL;
> + if (!kvm_x86_ops.dev_mem_enc_ioctl)
> + goto out;
> + r = static_call(kvm_x86_dev_mem_enc_ioctl)(argp);
> + break;
Incorrect indention and please move it out of
case KVM_HAS_DEVICE_ATTR: {
}
> }
> default:
> r = -EINVAL;
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index 71a5851475e7..a9ea3573be1b 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -528,4 +528,52 @@ struct kvm_pmu_event_filter {
> #define KVM_X86_DEFAULT_VM 0
> #define KVM_X86_TDX_VM 1
>
> +/* Trust Domain eXtension sub-ioctl() commands. */
> +enum kvm_tdx_cmd_id {
> + KVM_TDX_CAPABILITIES = 0,
> +
> + KVM_TDX_CMD_NR_MAX,
> +};
> +
> +struct kvm_tdx_cmd {
> + /* enum kvm_tdx_cmd_id */
> + __u32 id;
> + /* flags for sub-commend. If sub-command doesn't use this, set zero. */
> + __u32 flags;
> + /*
> + * data for each sub-command. An immediate or a pointer to the actual
> + * data in process virtual address. If sub-command doesn't use it,
> + * set zero.
> + */
> + __u64 data;
> + /*
> + * Auxiliary error code. The sub-command may return TDX SEAMCALL
> + * status code in addition to -Exxx.
> + * Defined for consistency with struct kvm_sev_cmd.
> + */
> + __u64 error;
> + /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
> + __u64 unused;
> +};
> +
> +struct kvm_tdx_cpuid_config {
> + __u32 leaf;
> + __u32 sub_leaf;
> + __u32 eax;
> + __u32 ebx;
> + __u32 ecx;
> + __u32 edx;
> +};
> +
> +struct kvm_tdx_capabilities {
> + __u64 attrs_fixed0;
> + __u64 attrs_fixed1;
> + __u64 xfam_fixed0;
> + __u64 xfam_fixed1;
> +
> + __u32 nr_cpuid_configs;
> + __u32 padding;
> + struct kvm_tdx_cpuid_config cpuid_configs[0];
> +};
> +
> #endif /* _ASM_X86_KVM_H */
> --
> 2.25.1
>
On Mon, Jun 27, 2022 at 02:53:16PM -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
> TDX specific sub-commands will be added to retrieve/pass TDX specific
> parameters.
>
> KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
> guest state-protected VM. It defined subcommands for technology-specific
> operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands
> are not limited to memory encryption, but various technology-specific
> operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
> for TDX specific operations and define subcommands.
>
> TDX requires VM-scoped, and VCPU-scoped TDX-specific operations for device
> model, for example, qemu. Getting system-wide parameters, TDX-specific VM
> initialization, and TDX-specific vCPU initialization. Which requires KVM
> vCPU-scoped operations in addition to the existing VM-scoped operations.
Suggest to no need talking about vcpu scope operations here, because
they're not available in this patch, we can talk about them in the
patch which introduces them.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 9 +++++++++
> arch/x86/kvm/vmx/tdx.c | 26 ++++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 4 ++++
> 3 files changed, 39 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 7b497ed1f21c..067f5de56c53 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -73,6 +73,14 @@ static void vt_vm_free(struct kvm *kvm)
> return tdx_vm_free(kvm);
> }
>
> +static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> +{
> + if (!is_td(kvm))
> + return -ENOTTY;
> +
> + return tdx_vm_ioctl(kvm, argp);
> +}
> +
> struct kvm_x86_ops vt_x86_ops __initdata = {
> .name = "kvm_intel",
>
> @@ -214,6 +222,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
>
> .dev_mem_enc_ioctl = tdx_dev_ioctl,
> + .mem_enc_ioctl = vt_mem_enc_ioctl,
> };
>
> struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ec4ebba4152a..2a9dfd54189f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -438,6 +438,32 @@ int tdx_dev_ioctl(void __user *argp)
> return 0;
> }
>
> +int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> +{
> + struct kvm_tdx_cmd tdx_cmd;
> + int r;
> +
> + if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
> + return -EFAULT;
> + if (tdx_cmd.error || tdx_cmd.unused)
> + return -EINVAL;
> +
> + mutex_lock(&kvm->lock);
> +
> + switch (tdx_cmd.id) {
> + default:
> + r = -EINVAL;
> + goto out;
> + }
> +
> + if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
> + r = -EFAULT;
> +
> +out:
> + mutex_unlock(&kvm->lock);
> + return r;
> +}
> +
> int __init tdx_module_setup(void)
> {
> const struct tdsysinfo_struct *tdsysinfo;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 3027d9821fe1..ef6115ae0e88 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -137,6 +137,8 @@ int tdx_dev_ioctl(void __user *argp);
> int tdx_vm_init(struct kvm *kvm);
> void tdx_mmu_release_hkid(struct kvm *kvm);
> void tdx_vm_free(struct kvm *kvm);
> +
> +int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> #else
> static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
> static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
> @@ -147,6 +149,8 @@ static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
> static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
> static inline void tdx_vm_free(struct kvm *kvm) {}
> +
> +static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> #endif
>
> #endif /* __KVM_X86_VMX_X86_OPS_H */
> --
> 2.25.1
>
> +
> +- Wrapping kvm x86_ops: The current choice
> + Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
> + main.c, is just chosen to show main entry points for callbacks.) and
> + wrapper functions around all the callbacks with
> + "if (is-tdx) tdx-callback() else vmx-callback()".
> +
> + Pros:
> + - No major change in common x86 KVM code. The change is (mostly)
> + contained under arch/x86/kvm/vmx/.
> + - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
> + optimized out.
> + - Micro optimization by avoiding function pointer.
> + Cons:
> + - Many boiler plates in arch/x86/kvm/vmx/main.c.
> +
> +Alternative:
> +- Introduce another callback layer under arch/x86/kvm/vmx.
> + Pros:
> + - No major change in common x86 KVM code. The change is (mostly)
> + contained under arch/x86/kvm/vmx/.
> + - clear separation on callbacks.
> + Cons:
> + - overhead in VMX even when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
> +
Why putting "Alternative" in the documentation? You may put it to the cover
letter so people can judge whether the design is reasonable, but it should not
be in the documentation.
--
Thanks,
-Kai
On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> To Keep the case of non TDX intact, introduce a new config option for
> private KVM MMU support. At the moment, this is synonym for
> CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL. The new flag make it clear
> that the config is only for x86 KVM MMU.
What is the "new flag"?
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/Kconfig | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index e3cbd7706136..5a59abc83179 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -129,4 +129,8 @@ config KVM_XEN
> config KVM_EXTERNAL_WRITE_TRACKING
> bool
>
> +config KVM_MMU_PRIVATE
> + def_bool y
> + depends on INTEL_TDX_HOST && KVM_INTEL
> +
> endif # VIRTUALIZATION
On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX doesn't support dirty logging. Report dirty logging isn't supported so
> that device model, for example qemu, can properly handle it.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Xiaoyao Li <[email protected]>
Xiaoyao's SoB looks weird.
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/x86.c | 5 +++++
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 15 ++++++++++++---
> 3 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4309ef0ade21..dcd1f5e2ba05 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13164,6 +13164,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
> }
> EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>
> +bool kvm_arch_dirty_log_supported(struct kvm *kvm)
> +{
> + return kvm->arch.vm_type != KVM_X86_TDX_VM;
> +}
> +
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 79a4988fd51f..6fd8ec297236 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1452,6 +1452,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_dirty_log_supported(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7a5261eb7eb8..703c1d0c98da 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1467,9 +1467,18 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
> {
> - u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> + return true;
> +}
> +
> +static int check_memory_region_flags(struct kvm *kvm,
> + const struct kvm_userspace_memory_region *mem)
> +{
> + u32 valid_flags = 0;
> +
> + if (kvm_arch_dirty_log_supported(kvm))
> + valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
>
> #ifdef __KVM_HAVE_READONLY_MEM
> valid_flags |= KVM_MEM_READONLY;
> @@ -1871,7 +1880,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> int as_id, id;
> int r;
>
> - r = check_memory_region_flags(mem);
> + r = check_memory_region_flags(kvm, mem);
> if (r)
> return r;
>
On Mon, Jun 27, 2022 at 02:53:35PM -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> In this patch series, TDX supports only TDP MMU and doesn't support legacy
> MMU. Forcibly use TDP MMU for TDX irrelevant of kernel parameter to
> disable TDP MMU.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 82f1bfac7ee6..7eb41b176d1e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -18,8 +18,13 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> {
> struct workqueue_struct *wq;
>
> - if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
> - return 0;
> + /*
> + * Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
> + * of TDX.
> + */
> + if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
> + (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
> + return false;
Please return 0 here for int return value type.
>
> wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
> if (!wq)
> --
> 2.25.1
>
On Mon, Jun 27, 2022 at 02:53:36PM -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> For private GPA, CPU refers a private page table whose contents are
> encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> PTE entry) are used and their cost is expensive.
>
> When KVM resolves KVM page fault, it walks the page tables. To reuse the
> existing KVM MMU code and mitigate the heavy cost to directly walk
> encrypted private page table, allocate a more page to mirror the existing
> KVM page table. Resolve KVM page fault with the existing code, and do
> additional operations necessary for the mirrored private page table. To
> distinguish such cases, the existing KVM page table is called a shared page
> table (i.e. no mirrored private page table), and the KVM page table with
> mirrored private page table is called a private page table. The
> relationship is depicted below.
>
> Add private pointer to struct kvm_mmu_page for mirrored private page table
> and add helper functions to allocate/initialize/free a mirrored private
> page table page. Also, add helper functions to check if a given
> kvm_mmu_page is private. The later patch introduces hooks to operate on
> the mirrored private page table.
>
> KVM page fault |
> | |
> V |
> -------------+---------- |
> | | |
> V V |
> shared GPA private GPA |
> | | |
> V V |
> CPU/KVM shared PT root KVM private PT root | CPU private PT root
> | | | |
> V V | V
> shared PT private PT <----mirror----> mirrored private PT
> | | | |
> | \-----------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> non-encrypted memory | encrypted memory
> |
> PT: page table
>
> Both CPU and KVM refer to CPU/KVM shared page table. Private page table
> is used only by KVM. CPU refers to mirrored private page table.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 9 ++++
> arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 3 ++
> 4 files changed, 97 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f4d4ed41641b..bfc934dc9a33 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -716,6 +716,7 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
> + struct kvm_mmu_memory_cache mmu_private_sp_cache;
>
> /*
> * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c517c7bca105..a5bf3e40e209 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -691,6 +691,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> int start, end, i, r;
> bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
>
> + if (kvm_gfn_shared_mask(vcpu->kvm)) {
> + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
> + PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> + }
> +
> if (is_tdp_mmu && shadow_nonpresent_value)
> start = kvm_mmu_memory_cache_nr_free_objects(mc);
>
> @@ -732,6 +739,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -1736,6 +1744,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
> if (!direct)
> sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> + kvm_mmu_init_private_sp(sp, NULL);
>
> /*
> * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 44a04fad4bed..9f3a6bea60a3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -55,6 +55,10 @@ struct kvm_mmu_page {
> u64 *spt;
> /* hold the gfn of each spte inside spt */
> gfn_t *gfns;
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> + /* associated private shadow page, e.g. SEPT page. */
> + void *private_sp;
> +#endif
> /* Currently serving as active root */
> union {
> int root_count;
> @@ -115,6 +119,86 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> return kvm_mmu_role_as_id(sp->role);
> }
>
> +/*
> + * TDX vcpu allocates page for root Secure EPT page and assigns to CPU secure
"TDX vcpu" is a little confused, how about "TDX moudule allocates(or manages) page
for ..." ?
> + * EPT pointer. KVM doesn't need to allocate and link to the secure EPT.
> + * Dummy value to make is_pivate_sp() return true.
> + */
> +#define KVM_MMU_PRIVATE_SP_ROOT ((void *)1)
> +
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> + return !!sp->private_sp;
> +}
> +
> +static inline bool is_private_sptep(u64 *sptep)
> +{
> + WARN_ON(!sptep);
> + return is_private_sp(sptep_to_sp(sptep));
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> + return sp->private_sp;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
> +{
> + sp->private_sp = private_sp;
> +}
> +
> +/* Valid sp->role.level is required. */
I didn't see such requirement in kvm_mmu_alloc_private_sp(), please
consider to move the comment with the code that introduces such
requirement together.
> +static inline void kvm_mmu_alloc_private_sp(
> + struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
> +{
> + if (is_root)
> + sp->private_sp = KVM_MMU_PRIVATE_SP_ROOT;
> + else
> + sp->private_sp = kvm_mmu_memory_cache_alloc(
> + &vcpu->arch.mmu_private_sp_cache);
> + /*
> + * Because mmu_private_sp_cache is topped up before staring kvm page
> + * fault resolving, the allocation above shouldn't fail.
> + */
> + WARN_ON_ONCE(!sp->private_sp);
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> + if (sp->private_sp != KVM_MMU_PRIVATE_SP_ROOT)
> + free_page((unsigned long)sp->private_sp);
> +}
> +#else
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> + return false;
> +}
> +
> +static inline bool is_private_sptep(u64 *sptep)
> +{
> + return false;
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> + return NULL;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
> +{
> +}
> +
> +static inline void kvm_mmu_alloc_private_sp(
> + struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
> +{
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> +}
> +#endif
> +
> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7eb41b176d1e..b2568b062faa 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -72,6 +72,8 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>
> static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> {
> + if (is_private_sp(sp))
> + kvm_mmu_free_private_sp(sp);
> free_page((unsigned long)sp->spt);
> kmem_cache_free(mmu_page_header_cache, sp);
> }
> @@ -295,6 +297,7 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
> sp->gfn = gfn;
> sp->ptep = sptep;
> sp->tdp_mmu_page = true;
> + kvm_mmu_init_private_sp(sp);
>
> trace_kvm_mmu_get_page(sp, true);
> }
> --
> 2.25.1
>
Hi. Because my description on large page support was terse, I wrote up more
detailed one. Any feedback/thoughts on large page support?
TDP MMU large page support design
Two main discussion points
* how to track page status. private vs shared, no-largepage vs can-be-largepage
* how to trigger merging mapping from 4KB/2MB to 2MB/1GB
Expected private-vs-shared page usage
-------------------------------------
On TD boot all pages are private and TD converts pages into shared if necessary.
* Most of the guest pages remain private.
* Only limited pages are converted at kernel boot
** bounce buffer for IO (virt-io). It's allocated as swiotlb. Its size is
64MB or 6% of total guest memory.
** KVM PV shared page. (the current guest TD doesn't use KVM PV shared page.)
* Only a small number of pages are dynamically converted from private to shared
and vice versa. This usage is very limited. e.g. GetQuote, the lack of
swiotlb buffer
Theory of Secure-EPT operations related to large page
-----------------------------------------------------
TDX Secure-EPT has differences from VMX EPT.
To add a page to Secure-EPT
* Here is the operation to resolve the EPT violation.
1. TD: Accepts GPA. TD needs to accept GPA before accessing GPA because TD
needs to detect that VMM unmaps GPA and maps GPA again.
2. EPT violation is triggered. TD exit to VMM.
3. VMM: allocate a page for GPA and TDH.MEM.PAGE.AUG it to GPA. Resume TD vcpu.
(3a. TD: #VE<EPT violation> is injected. #VE handler accepts the page)
4. TD: resume #VE and continue TD vcpu execution
TD may choose step 1. In that case, After step 3. #VE is injected into TD and,
TD #VE handler needs to accept the page.
When adding a page to Secure-EPT again, the page contexts are cleared and the
page is encrypted. If a page is disassociated from Secure-EPT and added again,
the page content is lost.
* TDG.VP.VMCALL<MapGPA> hypercall
The page associated with GPA can be private or shared. TD converts the GPA by
TDG.VP.VMCALL<MapGPA> hypercall from private to shared or vice versa. VMM
tracks whether the given GPA is private or shared.
* mapping merge(promote)/split(demote)
The page can be mapped as large page (2MB or 1GB) in addition to 4KB. The
mapping can be merged(4KB/2MB -> 2MB/1GB) or split(2MB/1GB -> 4KB/2MB) by TDX
SEAMCALL TDH.MEM.PAGE.PROMOTE and TDH.MEM.PAGE.DEMOTE.
The merge of mapping requires all the pages needs to be mapped, unlike VMX EPT
because of encryption. This implies the current KVM implementation doesn't work
for TDX when merging mapping as follows
- EPT violation and host page is 2MB mappable.
some of the 4KB pages of the given 2MB page are already mapped, some not.
i.e. 2MB EPT -> 4KB EPT -> 4K pages
- KVM page fault handler zap 2MB EPT entry and populate 2MB EPT entry
zap: 2MB EPT: non present
populate 2MB: -> 2MB page
If VMM zaps 2MB Secure-EPT entry, the page contents will be lost for TDX.
Mapping merge requires all pages are already mapped.
Instead, the following steps are needed.
- EPT violation and host page is 2MB mappable.
some of the 4KB pages of the given 2MB page are already mapped. Some not.
i.e. 2MB EPT -> 4KB EPT -> 4K pages
- VMM checks all 4KB GPAs are private. If not, it can't be mapped as a large page.
(****)
- VMM checks all 4KB GPAs are already mapped. If not, give up mapping merge.
(or map missing 4KB pages.)
- mapping merge by TDH.MEM.PAGE.PROMOTE
The mapping split for TDX Secure-EPT works similarly to the VMX EPT case.
EPT violation and MapGPA
------------------------
- EPT violation is a fast path
- MapGPA is not a fast path.
=> Keep the EPT violation path optimized and complicates the MapGPA path. For
(****) check, we don't want to scan the 4KB mapping on EPT violation. Instead,
the MapGPA path scans it and records the result as the page can be mapped as 2MB
due to private/shared.
Tracking private/shared and large page mappable
-----------------------------------------------
VMM needs to track that page is mapped as private or shared at 4KB granularity.
For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
track the page can be mapped as a large page (regarding private/shared). VMM
updates it on MapGPA and references it on the EPT violation path. (****)
For 4KB pages, 1 bit is needed. private or shared. Let's call it shared-mask bit.
For 2MB/1GB pages, 2 bit is needed. large page mappable or not. private or
shared if mappable. Let's call it no-largepage bit.
Option A.)
Allocate array for pages in struct kvm_arch_memory_slot on TD creation.
struct kvm_arch_memory_slot {
+struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
}
pros:
+straight forward implementation
+SPTE_SHARED_MASK is not needed
cons:
-memory overhead is high
-not optimized for expected usage
-one more look-up on EPT violation
Option B.) Steal two software usable bits from SPTE and record them in SPTE.
SPTE_SHARED_MASK, SPTE_NOLARGE_PAGE_MASK
pros:
+optimized for EPT violation
cons:
-2bits used in SPTE entry
-complicates the MapGPA path.
Option C.) Steal one software usable bit from SPTE and record it in SPTE.
SPTE_SHARED_MASK
For 2MB/1GB, allocate bitmap in kvm_mmu_page.
struct kvm_mmu_page {
bitmap nolarge
}
pros:
+optimized for EPT violation
cons:
-complicates the MapGPA path.
-information is scattered in SPTE and struct kvm_mmu_page
How to update those bits
------------------------
- MapGPA
- at 4KB level, set or clear shared-mask bit.
- Scan 512 4KB bit, at 2MB level
- set or clear shared-mask bit, clear no-largepage bit or
- clear shared-mask bit, set no-largepage bit
- increment/decrement lpageinfo to prevent/allow large page
- similar for 1GB level
Note: This logic might a bit tricky.
- EPT violation
- If 2MB large page is allowed, check if no-largepage bit
- If no-largepage bit is set, => go down to 4KB page
- If no-largepage bit is cleared => try to map 2MB page
- If 4KB level is not mapped, map 2MB page
- If some 4KB level is already mapped, go down to 4KB.
Don't try to merge mapping. Or it's possible to try to merge mapping.
Note: 512 4KB entry scanning is not done at EPT violation because it's fast
path.
Map merging
-----------
Map merging is necessary for TD migration. (Map split is the easy part.) The
current KVM implementation zaps the range (mmu notification or lpage recovery
worker) and expects large page mapping on the next EPT violation.
Option A.) Keep the code similar to map merging logic.
Zap 2MB EPT entry in some sense and trigger map merging logic on the next EPT
violation. To keep encrypted page contents, zapped EPT entries needs to keep
the page. Steal one more bits from SPTE. SPTE_PRIVATE_BLOCKED_MASK.
It means that the page is zapped from SPTE. but it still alive and references
page.
Option B.) In the callback, directly merge mapping somehow. In this case, mmu
notifier usage doesn't make sense.
NOTE:
- Implement map merging in MapGPA. This doesn't work for dirty page logging.
- We can utilize kvm_nx_lpage_recovery_worker
- We can utilize THP. Probably doesn't work well for fd-based private memory.
Thanks,
Isaku Yamayhata
On Mon, Jun 27, 2022 at 02:52:52PM -0700,
[email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> KVM TDX basic feature support
>
> Hello. This is v7 the patch series vof KVM TDX support.
> This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
>
> Major changes from v6:
> - rebased to v5.19 base
>
> TODO:
> - integrate fd-based guest memory. As the discussion is still on-going, I
> intentionally dropped fd-based guest memory support yet. The integration can
> be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> - 2M large page support. It's work-in-progress.
> For large page support, there are several design choices. Here is the design options.
> Any thoughts/feedback?
>
> KVM MMU Large page support for TDX
>
> * What needs to be done
> - Track private or shared of each page size (4KB, 2MB, 1GB) based on
> TDG.VP.VMCALL<MapGPA>. For large pages(2MB, 1GB), it can be mixed (some
> lower-size pages are private and some shared.) In this case, the page can't
> be large.
> - if necessary, split large page on TDG.VP.VMCALL<MapGPA>
> (split on dirty page tracking is future work)
> - resolving KVM page fault
> When resolving a private page and the page is large in the host, GPA can be
> resolved as a large page in Secure-EPT. Even if the page is large on the host
> side, sometimes a 4KB page can be resolved because it's up to guest TD to
> accept at 4KB, 2MB, or 1GB.
> - collapsing pages into a large page.
> At this point, it's okay to not implement this. When dirty page tracking is
> supported, this needs to be supported.
> - On MapGPA, the page can be collapsed into a large page
> - handle zapping SPTE and try to collapse the pages on the next KVM page fault
> Unlike the EPT case, some trick is needed.
> - For performance, optimize KVM page fault path at the cost of complicating
> MapGPA path.
>
> * options to track private or shared
> At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> 1GB case). For 4KB each page, 1 bit per page is needed. private or shared. For
> large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> mixed). When resolving KVM page fault, we don't want to check the lower-size
> pages to check if the given GPA can be a large for performance. On MapGPA check
> it instead.
>
> Option A). enhance kvm_arch_memory_slot
> enum kvm_page_type {
> KVM_PAGE_TYPE_INVALID,
> KVM_PAGE_TYPE_SHARED,
> KVM_PAGE_TYPE_PRIVATE,
> KVM_PAGE_TYPE_MIXED,
> };
>
> struct kvm_page_attr {
> enum kvm_page_type type;
> };
>
> struct kvm_arch_memory_slot {
> + struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
>
> Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> If !SPTE_MIXED_MASK, it can be large page.
>
> Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
>
>
> * comparison
> A).
> + straightforward to implement
> + SPTE_SHARED_MASK isn't needed
> - memory overhead compared to B). or C).
> - more memory reference on KVM page fault
>
> B).
> + simpler than C) (complex than A)?)
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - Waste precious SPTE bits.
>
> C).
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - complicates MapGPA
> - scattered data structure
>
> Thanks,
> Isaku Yamahata
>
> Changes from v6:
> - rebased to v5.19
>
> Changes from v5:
> - export __seamcall and use it
> - move mutex lock from callee function of smp_call_on_cpu to the caller.
> - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> - updated comment
> - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
> compatibility to only return success
> - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> - make this ioctl systemwide ioctl
> - ABI change to struct kvm_init_vm
> - guest_tsc_khz: use kvm->arch.default_tsc_khz
> - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> - drop exporting kvm_set_tsc_khz().
> - fix kvm_tdp_page_fault() for mtrr emulation
> - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
> keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> - update commit message
> - rename shadow_init_value => shadow_nonprsent_value
> - added ept_violation_ve_test mode
> - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> - legacy MMU case
> => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
> - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> - #VE warning:
> - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> - merge into Like we discussed, this patch should be merged with patch
> "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> - introduce kvm_gfn_for_root(kvm, root, gfn)
> - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> - use kvm_arch_dirty_log_supported()
> - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> - rename: is_private_prohibit_spte() => spte_shared_mask()
> - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> - dropped this patch as the change was merged into kvm/queue
> - update vt_apicv_post_state_restore()
> - use is_64_bit_hypercall()
> - comment: expand MSMI -> Machine Check System Management Interrupt
> - fixed TDX_SEPT_PFERR
> - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> - remove optional zero check of argument.
> - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
> in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> - introduce vcpu_deliver_init to x86_ops
> - sprinkeled KVM_BUG_ON()
>
> Changes from v4:
> - rebased to TDX host kernel patch series.
> - include all the patches to make this patch series working.
> - add [MARKER] patches to mark the patch layer clear.
>
> ---
> * What's TDX?
> TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> Domain (TD) for confidential computing.
>
> A TD runs in a CPU mode that is designed to protect the confidentiality of its
> memory contents and its CPU state from any other software, including the hosting
> Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
>
> We have more detailed explanations below (***).
> We have the high-level design of TDX KVM below (****).
>
> In this patch series, we use "TD" or "guest TD" to differentiate it from the
> current "VM" (Virtual Machine), which is supported by KVM today.
>
>
> * The organization of this patch series
> This patch series is on top of the patches series "TDX host kernel support":
> https://lore.kernel.org/lkml/[email protected]/
>
> this patch series is available at
> https://github.com/intel/tdx/releases/tag/kvm-upstream
> The corresponding patches to qemu are available at
> https://github.com/intel/qemu-tdx/commits/tdx-upstream
>
> The relations of the layers are depicted as follows.
> The arrows below show the order of patch reviews we would like to have.
>
> The below layers are chosen so that the device model, for example, qemu can
> exercise each layering step by step. Check if TDX is supported, create TD VM,
> create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> vcpu exits/hypercalls/interrupts to run TD fully.
>
> TDX vcpu
> interrupt/exits/hypercall<------------\
> ^ |
> | |
> TD finalization |
> ^ |
> | |
> TDX EPT violation<------------\ |
> ^ | |
> | | |
> TD vcpu enter/exit | |
> ^ | |
> | | |
> TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
> ^ | ^
> | | |
> TD VM creation/destruction \---------------KVM TDP MMU hooks
> ^ ^
> | |
> TDX architectural definitions KVM TDP refactoring for TDX
> ^ ^
> | |
> TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
> coexistence support
>
>
> The followings are explanations of each layer. Each layer has a dummy commit
> that starts with [MARKER] in subject. It is intended to help to identify where
> each layer starts.
>
> TDX host kernel support:
> https://lore.kernel.org/lkml/[email protected]/
> The guts of system-wide initialization of TDX module. There is an
> independent patch series for host x86. TDX KVM patches call functions
> this patch series provides to initialize the TDX module.
>
> TDX, VMX coexistence:
> Infrastructure to allow TDX to coexist with VMX and trigger the
> initialization of the TDX module.
> This layer starts with
> "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> TDX architectural definitions:
> Add TDX architectural definitions and helper functions
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> TD VM creation/destruction:
> Guest TD creation/destroy allocation and releasing of TDX specific vm
> and vcpu structure. Create an initial guest memory image with TDX
> measurement.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> TD vcpu creation/destruction:
> guest TD creation/destroy Allocation and releasing of TDX specific vm
> and vcpu structure. Create an initial guest memory image with TDX
> measurement.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> TDX EPT violation:
> Create an initial guest memory image with TDX measurement. Handle
> secure EPT violations to populate guest pages with TDX SEAMCALLs.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> TD vcpu enter/exit:
> Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
> entering into TD. Restore CPU state after exiting from TD.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> TD vcpu interrupts/exit/hypercall:
> Handle various exits/hypercalls and allow interrupts to be injected so
> that TD vcpu can continue running.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
>
> KVM MMU GPA shared bit:
> Introduce framework to handle shared bit repurposed bit of GPA TDX
> repurposed a bit of GPA to indicate shared or private. If it's shared,
> it's the same as the conventional VMX EPT case. VMM can access shared
> guest pages. If it's private, it's handled by Secure-EPT and the guest
> page is encrypted.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> KVM TDP refactoring for TDX:
> TDX Secure EPT requires different constants. e.g. initial value EPT
> entry value etc. Various refactoring for those differences.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> KVM TDP MMU hooks:
> Introduce framework to TDP MMU to add hooks in addition to direct EPT
> access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
> conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> use TDX SEAMCALLs to operate on Secure EPT.
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> KVM TDP MMU MapGPA:
> Introduce framework to handle switching guest pages from private/shared
> to shared/private. For a given GPA, a guest page can be assigned to a
> private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
> guest TD converts GPA assignments from private (or shared) to shared (or
> private).
> This layer starts with
> "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
>
> KVM guest private memory: (not shown in the above diagram)
> [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> memory: https://lkml.org/lkml/2022/1/18/395
> Guest private memory requires different memory management in KVM. The
> patch proposes a way for it. Integration with TDX KVM.
>
> (***)
> * TDX module
> A CPU-attested software module called the "TDX module" is designed to implement
> the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> loaded by the kernel or driver at runtime, but in this patch series we assume
> that the TDX module is already loaded and initialized.
>
> The TDX module provides two main new logical modes of operation built upon the
> new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> architecture. TDX root mode is mostly identical to the VMX root operation mode,
> and the TDX functions (described later) are triggered by the new SEAMCALL
> instruction with the desired interface function selected by an input operand
> (leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
> non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> operation (i.e. guest VM), with changes and restrictions to better assure that
> no other software or hardware has direct visibility of the TD memory and state.
>
> TDX transitions between TDX root operation and TDX non-root operation include TD
> Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> TDX root mode. A TD Exit might be asynchronous, triggered by some external
> event (e.g., external interrupt or SMI) or an exception, or it might be
> synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
>
> TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> Domain Host. Those host-side TDX interface functions are categorized into
> various areas just for better organization, such as SYS (TDX module management),
> MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
>
> TDCS (Trust Domain Control Structure) is the main control structure of a guest
> TD, and encrypted (using the guest TD's ephemeral private key). At a high
> level, TDCS holds information for controlling TD operation as a whole,
> execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
> bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> same value for all VCPUs of the same TD.
>
> Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
> the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> DMA access, accessible only by using the TDX module interface functions (such as
> TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> such as virtual APIC page, virtualization exception information, etc.
>
> Several VMX control structures (such as Shared EPT and Posted interrupt
> descriptor) are directly managed and accessed by the host VMM. These control
> structures are pointed to by fields in the TD VMCS.
>
> The above means that 1) KVM needs to allocate different data structures for TDs,
> 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> define TD-specific handling for others. 3) Redirect operations to . 3)
> Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> tdx_callback() else vmx_callback();".
>
> *TD Private Memory
> TD private memory is designed to hold TD private content, encrypted by the CPU
> using the TD ephemeral key. An encryption engine holds a table of encryption
> keys, and an encryption key is selected for each memory transaction based on a
> Host Key Identifier (HKID). By design, the host VMM does not have access to the
> encryption keys.
>
> In the first generation of MKTME, HKID is "stolen" from the physical address by
> allocating a configurable number of bits from the top of the physical
> address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> HKID on the host so that MKTME can be opaque or bypassed on the host.
>
> During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> as either shared or private, based on the value of a new SHARED bit in the Guest
> Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
> (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> with the current VMX. Since guest TDs usually require I/O, and the data exchange
> needs to be done via shared memory, thus KVM needs to use the current EPT
> functionality even for TDs.
>
> * Secure EPT and Minoring using the TDP code
> The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
> pages are encrypted and integrity-protected with the TD's ephemeral private
> key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> "subset"). Since execution of such interface functions takes much longer time
> than accessing memory directly, in KVM we use the existing TDP code to minor the
> Secure EPT for the TD.
>
> This way, we can effectively walk Secure EPT without using the TDX interface
> functions.
>
> * VM life cycle and TDX specific operations
> The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
> example, a TD needs to boot in private memory, and the host software cannot copy
> the initial image to private memory.
>
> * TSC Virtualization
> The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
> owns TSC virtualization for VMs, but the TDX module does for TDs.
>
> * MCE support for TDs
> The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
> to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
> related to MCE (e.g, MCE bank registers) can be naturally emulated by
> paravirtualizing MSR access.
>
> [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
>
> * Restrictions or future work
> Some features are not included to reduce patch size. Those features are
> addressed as future independent patch series.
> - large page (2M, 1G)
> - qemu gdb stub
> - guest PMU
> - and more
>
> * Prerequisites
> It's required to load the TDX module and initialize it. It's out of the scope
> of this patch series. Another independent patch for the common x86 code is
> planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> life cycle like tdh.mng.init are ready to use.
>
> Concretely Global initialization, LP (Logical Processor) initialization, global
> configuration, the key configuration, and TDMR and PAMT initialization are done.
> The state of the TDX module is SYS_READY. Please refer to the TDX module
> specification, the chapter Intel TDX Module Lifecycle State Machine
>
> ** Detecting the TDX module readiness.
> TDX host patch series implements the detection of the TDX module availability
> and its initialization so that KVM can use it. Also it manages Host KeyID
> (HKID) assigned to guest TD.
> The assumed APIs the TDX host patch series provides are
> - int seamrr_enabled()
> Check if required cpu feature (SEAM mode) is available. This only check CPU
> feature availability. At this point, the TDX module may not be ready for KVM
> to use.
> - int init_tdx(void);
> Initialization of TDX module so that the TDX module is ready for KVM to use.
> - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> Return the system wide information about the TDX module. NULL if the TDX
> isn't initialized.
> - u32 tdx_get_global_keyid(void);
> Return global key id that is used for the TDX module itself.
> - int tdx_keyid_alloc(void);
> Allocate HKID for guest TD.
> - void tdx_keyid_free(int keyid);
> Free HKID for guest TD.
>
> (****)
> * TDX KVM high-level design
> - Host key ID management
> Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> It is assumed The TDX host patch series implements necessary functions,
> u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> void tdx_keyid_free(int keyid).
>
> - Data structures and VM type
> Because TDX is different from VMX, define its own VM/VCPU structures, struct
> kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
> identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> TDX, is used.
>
> - VM life cycle and TDX specific operations
> Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> parameters, set initial guest memory and measurement.
>
> The creation of TDX VM requires five additional operations in addition to the
> conventional VM creation.
> - Get KVM system capability to check if TDX VM type is supported
> - VM creation (KVM_CREATE_VM)
> - New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
> - New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
> - VCPU creation (KVM_CREATE_VCPU)
> - New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
> - New: Initialize guest memory as boot state and extend the measurement with
> the memory. KVM_TDX_INIT_MEM_REGION.
> - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> TDX VM contents.
> - VCPU RUN (KVM_VCPU_RUN)
>
> - Protected guest state
> Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> can't operate on them. For example, accessing CPU registers, injecting
> exceptions, and accessing guest memory. Those operations are handled as
> silently ignored, returning zero or initial reset value when it's requested via
> KVM API ioctls.
>
> VM/VCPU state and callbacks for TDX specific operations.
> Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
> operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
>
> Operations on the CPU state
> silently ignore operations on the guest state. For example, the write to
> CPU registers is ignored and the read from CPU registers returns 0.
>
> . ignore access to CPU registers except for allowed ones.
> . TSC: add a check if tsc is immutable and return an error. Because the KVM
> implementation updates the internal tsc state and it's difficult to back
> out those changes. Instead, skip the logic.
> . dirty logging: add check if dirty logging is supported.
> . exceptions/SMI/MCE/SIPI/INIT: silently ignore
>
> Note: virtual external interrupt and NMI can be injected into TDX guests.
>
> - KVM MMU integration
> One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> the guest physical address is private (the bit is cleared) or shared (the bit is
> set). The bits are called stolen bits.
>
> - Stolen bits framework
> systematically tracks which guest physical address, shared or private, is
> used.
>
> - Shared EPT and secure EPT
> There are two EPTs. Shared EPT (the conventional one) and Secure
> EPT(the new one). Shared EPT is handled the same for the stolen
> bit set. Secure EPT points to private guest pages. To resolve
> EPT violation, KVM walks one of two EPTs based on faulted GPA.
> Because it's costly to access secure EPT during walking EPTs with
> SEAMCALLs for the private guest physical address, another private
> EPT is used as a shadow of Secure-EPT with the existing logic at
> the cost of extra memory.
>
> The following depicts the relationship.
>
> KVM | TDX module
> | | |
> -------------+---------- | |
> | | | |
> V V | |
> shared GPA private GPA | |
> CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
> | | | |
> | | | |
> V V | V
> shared EPT private EPT--------mirror----->Secure EPT
> | | | |
> | \--------------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> |
> non-encrypted memory | encrypted memory
> |
>
> - Operating on Secure EPT
> Use the TDX module APIs to operate on Secure EPT. To call the TDX API
> during resolving EPT violation, add hooks to additional operation and wiring
> it to TDX backend.
>
> * References
>
> [1] TDX specification
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
> https://cdrdv2.intel.com/v1/dl/getContent/726790
> [3] Intel CPU Architectural Extensions Specification
> https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 Specification
> https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> [5] Intel TDX Loader Interface Specification
> https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
> https://cdrdv2.intel.com/v1/dl/getContent/726790
> [7] Intel TDX Virtual Firmware Design Guide
> https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> [8] intel public github
> kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> TDX guest branch: https://github.com/intel/tdx/tree/guest
> qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
> https://github.com/tianocore/edk2-staging/tree/TDVF
> This was merged into EDK2 main branch. https://github.com/tianocore/edk2
>
> Chao Gao (3):
> KVM: x86: Move check_processor_compatibility from init ops to runtime
> ops
> Partially revert "KVM: Pass kvm_init()'s opaque param to additional
> arch funcs"
> KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> wrmsr
>
> Isaku Yamahata (72):
> KVM: Refactor CPU compatibility check on module initialiization
> x86/virt/vmx/tdx: export platform_tdx_enabled()
> KVM: TDX: Detect CPU feature on kernel module initialization
> KVM: x86: Refactor KVM VMX module init/exit functions
> KVM: TDX: Add placeholders for TDX VM/vcpu structure
> x86/virt/tdx: Add a helper function to return system wide info about
> TDX module
> KVM: TDX: Initialize TDX module when loading kvm_intel.ko
> KVM: TDX: Make TDX VM type supported
> [MARKER] The start of TDX KVM patch series: TDX architectural
> definitions
> KVM: TDX: Define TDX architectural definitions
> KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> KVM: TDX: Add helper functions to print TDX SEAMCALL error
> [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> x86/cpu: Add helper functions to allocate/free TDX private host key id
> KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> KVM: TDX: Make pmu_intel.c ignore guest TD case
> [MARKER] The start of TDX KVM patch series: TD vcpu
> creation/destruction
> KVM: TDX: allocate/free TDX vcpu structure
> KVM: TDX: allocate/free TDX vcpu structure
> [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
> KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> TDX
> KVM: x86/mmu: Disallow fast page fault on private GPA
> KVM: VMX: Introduce test mode related to EPT violation VE
> [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> KVM: x86/mmu: Focibly use TDP MMU for TDX
> KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
> KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> [MARKER] The start of TDX KVM patch series: TDX EPT violation
> KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
> KVM: TDX: TDP MMU TDX support
> [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> KVM: x86/mmu: steal software usable git to record if GFN is for shared
> or not
> KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> [MARKER] The start of TDX KVM patch series: TD finalization
> KVM: TDX: Create initial guest memory
> KVM: TDX: Finalize VM initialization
> [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> KVM: TDX: Add helper assembly function to TDX vcpu
> KVM: TDX: Implement TDX vcpu enter/exit path
> KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> KVM: TDX: restore host xsave state when exit from the guest TD
> KVM: TDX: restore user ret MSRs
> [MARKER] The start of TDX KVM patch series: TD vcpu
> exits/interrupts/hypercalls
> KVM: TDX: complete interrupts after tdexit
> KVM: TDX: restore debug store when TD exit
> KVM: TDX: handle vcpu migration over logical processor
> KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> behavior
> KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
> KVM: TDX: Implement interrupt injection
> KVM: TDX: Implements vcpu request_immediate_exit
> KVM: TDX: Implement methods to inject NMI
> KVM: TDX: Add a place holder to handle TDX VM exit
> KVM: TDX: handle EXIT_REASON_OTHER_SMI
> KVM: TDX: handle ept violation/misconfig exit
> KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> KVM: TDX: Add a place holder for handler of TDX hypercalls
> (TDG.VP.VMCALL)
> KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> KVM: TDX: Handle TDX PV CPUID hypercall
> KVM: TDX: Handle TDX PV HLT hypercall
> KVM: TDX: Handle TDX PV port io hypercall
> KVM: TDX: Implement callbacks for MSR operations for TDX
> KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
> KVM: TDX: Handle TDX PV report fatal error hypercall
> KVM: TDX: Handle TDX PV map_gpa hypercall
> KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
> KVM: TDX: Silently discard SMI request
> KVM: TDX: Silently ignore INIT/SIPI
> Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
>
> Rick Edgecombe (1):
> KVM: x86/mmu: Add address conversion functions for TDX shared bits
>
> Sean Christopherson (25):
> KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> KVM: Enable hardware before doing arch VM initialization
> KVM: x86: Introduce vm_type to differentiate default VMs from
> confidential VMs
> KVM: TDX: Add TDX "architectural" error codes
> KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> KVM: TDX: create/destroy VM structure
> KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
> KVM: TDX: Do TDX specific vcpu initialization
> KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> KVM: x86/mmu: Allow non-zero value for non-present SPTE
> KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> KVM: x86/mmu: Allow per-VM override of the TDP max page level
> KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
> private mmu
> KVM: x86/mmu: Disallow dirty logging for x86 TDX
> KVM: VMX: Split out guts of EPT violation to common/exposed function
> KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> KVM: TDX: Add load_mmu_pgd method for TDX
> KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> KVM: TDX: Add support for find pending IRQ in a protected local APIC
> KVM: x86: Assume timer IRQ was injected if APIC state is proteced
> KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> argument
> KVM: VMX: Move NMI/exception handler to common helper
> KVM: x86: Split core of hypercall emulation to helper function
> KVM: TDX: Handle TDX PV MMIO hypercall
> KVM: TDX: Add methods to ignore accesses to CPU state
>
> Xiaoyao Li (1):
> KVM: TDX: initialize VM with TDX specific parameters
>
> Documentation/virt/kvm/api.rst | 30 +-
> .../virt/kvm/intel-tdx-layer-status.rst | 33 +
> Documentation/virt/kvm/intel-tdx.rst | 381 +++
> Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 ++++
> arch/arm64/kvm/arm.c | 2 +-
> arch/mips/kvm/mips.c | 14 +-
> arch/powerpc/kvm/powerpc.c | 2 +-
> arch/riscv/kvm/main.c | 2 +-
> arch/s390/kvm/kvm-s390.c | 2 +-
> arch/x86/events/intel/ds.c | 1 +
> arch/x86/include/asm/kvm-x86-ops.h | 10 +
> arch/x86/include/asm/kvm_host.h | 56 +-
> arch/x86/include/asm/tdx.h | 67 +
> arch/x86/include/asm/vmx.h | 14 +
> arch/x86/include/uapi/asm/kvm.h | 95 +
> arch/x86/include/uapi/asm/vmx.h | 5 +-
> arch/x86/kvm/Kconfig | 4 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/irq.c | 3 +
> arch/x86/kvm/lapic.c | 37 +-
> arch/x86/kvm/lapic.h | 2 +
> arch/x86/kvm/mmu.h | 42 +-
> arch/x86/kvm/mmu/mmu.c | 360 ++-
> arch/x86/kvm/mmu/mmu_internal.h | 123 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 5 +-
> arch/x86/kvm/mmu/spte.c | 46 +-
> arch/x86/kvm/mmu/spte.h | 65 +-
> arch/x86/kvm/mmu/tdp_iter.c | 1 +
> arch/x86/kvm/mmu/tdp_iter.h | 5 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 690 ++++-
> arch/x86/kvm/mmu/tdp_mmu.h | 12 +-
> arch/x86/kvm/svm/svm.c | 13 +-
> arch/x86/kvm/vmx/common.h | 174 ++
> arch/x86/kvm/vmx/evmcs.c | 2 +-
> arch/x86/kvm/vmx/evmcs.h | 2 +-
> arch/x86/kvm/vmx/main.c | 1071 +++++++
> arch/x86/kvm/vmx/pmu_intel.c | 39 +-
> arch/x86/kvm/vmx/pmu_intel.h | 28 +
> arch/x86/kvm/vmx/posted_intr.c | 43 +-
> arch/x86/kvm/vmx/posted_intr.h | 13 +
> arch/x86/kvm/vmx/tdx.c | 2465 +++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 275 ++
> arch/x86/kvm/vmx/tdx_arch.h | 157 ++
> arch/x86/kvm/vmx/tdx_errno.h | 29 +
> arch/x86/kvm/vmx/tdx_error.c | 22 +
> arch/x86/kvm/vmx/tdx_ops.h | 188 ++
> arch/x86/kvm/vmx/vmenter.S | 146 +
> arch/x86/kvm/vmx/vmx.c | 737 ++---
> arch/x86/kvm/vmx/vmx.h | 39 +-
> arch/x86/kvm/vmx/x86_ops.h | 235 ++
> arch/x86/kvm/x86.c | 148 +-
> arch/x86/virt/vmx/tdx/seamcall.S | 2 +
> arch/x86/virt/vmx/tdx/tdx.c | 54 +-
> arch/x86/virt/vmx/tdx/tdx.h | 52 -
> include/linux/kvm_host.h | 4 +-
> include/uapi/linux/kvm.h | 2 +
> tools/arch/x86/include/uapi/asm/kvm.h | 95 +
> tools/include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 67 +-
> 59 files changed, 7877 insertions(+), 804 deletions(-)
> create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
> create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> create mode 100644 arch/x86/kvm/vmx/common.h
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
> create mode 100644 arch/x86/kvm/vmx/tdx.c
> create mode 100644 arch/x86/kvm/vmx/tdx.h
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> create mode 100644 arch/x86/kvm/vmx/x86_ops.h
>
> --
> 2.25.1
>
--
Isaku Yamahata <[email protected]>
s/Focibly/Forcibly, but that's a moot point because KVM shouldn't override the
the module param. KVM should instead _require_ the TDP MMU to be enabled. E.g.
if userspace disables the TDP MMU to workaround a fatal bug, then forcing the TDP
MMU may silently expose KVM to said bug.
And overriding tdp_enabled is just mind-boggling broken, all of the SPTE masks
will be wrong.
On Mon, Jun 27, 2022, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> In this patch series, TDX supports only TDP MMU and doesn't support legacy
> MMU. Forcibly use TDP MMU for TDX irrelevant of kernel parameter to
> disable TDP MMU.
Do not refer to the "patch series", instead phrase the statement with respect to
what KVM support.
Require the TDP MMU for TDX guests, the so called "shadow" MMU does not
support mapping guest private memory, i.e. does not support Secure-EPT.
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 82f1bfac7ee6..7eb41b176d1e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -18,8 +18,13 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> {
> struct workqueue_struct *wq;
>
> - if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
> - return 0;
> + /*
> + * Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
> + * of TDX.
> + */
> + if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
> + (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
> + return false;
Yeah, no.
if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
return kvm->arch.vm_type == KVM_X86_TDX_VM ? -EINVAL : 0;
>
> wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
> if (!wq)
> --
> 2.25.1
>
On Tue, Jun 28, 2022 at 03:53:31PM +1200,
Kai Huang <[email protected]> wrote:
> On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Currently, KVM VMX module initialization/exit functions are a single
> > function each. Refactor KVM VMX module initialization functions into KVM
> > common part and VMX part so that TDX specific part can be added cleanly.
> > Opportunistically refactor module exit function as well.
> >
> > The current module initialization flow is, 1.) calculate the sizes of VMX
> > kvm structure and VMX vcpu structure, 2.) hyper-v specific initialization
> > 3.) report those sizes to the KVM common layer and KVM common
> > initialization, and 4.) VMX specific system-wide initialization.
> >
> > Refactor the KVM VMX module initialization function into functions with a
> > wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> > among VMX and TDX. We have a wrapper function, "vt_init() {vmx kvm/vcpu
> > size calculation; hv_vp_assist_page_init(); kvm_init(); vmx_init(); }" in
> > main.c, and hv_vp_assist_page_init() and vmx_init() in vmx.c.
> > hv_vp_assist_page_init() initializes hyper-v specific assist pages,
> > kvm_init() does system-wide initialization of the KVM common layer, and
> > vmx_init() does system-wide VMX initialization.
> >
> > The KVM architecture common layer allocates struct kvm with reported size
> > for architecture-specific code. The KVM VMX module defines its structure
> > as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> > struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
> > TDX specific kvm and vcpu structures, add tdx_pre_kvm_init() to report the
> > sizes of them to the KVM common layer.
> >
> > The current module exit function is also a single function, a combination
> > of VMX specific logic and common KVM logic. Refactor it into VMX specific
> > logic and KVM common logic. This is just refactoring to keep the VMX
> > specific logic in vmx.c from main.c.
>
> This patch, coupled with the patch:
>
> KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
>
> Basically provides an infrastructure to support both VMX and TDX. Why we cannot
> merge them into one patch? What's the benefit of splitting them?
>
> At least, why the two patches cannot be put together closely?
It is trivial for the change of "KVM: VMX: Move out vmx_x86_ops to 'main.c' to
wrap VMX and TDX" to introduce no functional change. But it's not trivial
for this patch to introduce no functional change.
So I moved this patch right after the main.c patch.
--
Isaku Yamahata <[email protected]>
On Tue, Jun 28, 2022 at 02:52:28PM +1200,
Kai Huang <[email protected]> wrote:
> On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
> > some operations (e.g., memory read/write, register state access, etc).
> >
> > Introduce vm_type to track the type of the VM to x86 KVM. Other arch KVMs
> > already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
> > vm_init accepts vm_type. So follow them. Further, a different policy can
> > be made based on vm_type. Define KVM_X86_DEFAULT_VM for default VM as
> > default and define KVM_X86_TDX_VM for Intel TDX VM. The wrapper function
> > will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
> >
> > Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
> > e.g. qemu, to query what VM types are supported by KVM. This (introduce a
> > new capability and add vm_type) is chosen to align with other arch KVMs
> > that have VM types already. Other arch KVMs uses different name to query
> > supported vm types and there is no common name for it, so new name was
> > chosen.
> >
> > Co-developed-by: Xiaoyao Li <[email protected]>
> > Signed-off-by: Xiaoyao Li <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++++
> > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/include/uapi/asm/kvm.h | 3 +++
> > arch/x86/kvm/svm/svm.c | 6 ++++++
> > arch/x86/kvm/vmx/main.c | 1 +
> > arch/x86/kvm/vmx/tdx.h | 6 +-----
> > arch/x86/kvm/vmx/vmx.c | 5 +++++
> > arch/x86/kvm/vmx/x86_ops.h | 1 +
> > arch/x86/kvm/x86.c | 9 ++++++++-
> > include/uapi/linux/kvm.h | 1 +
> > tools/arch/x86/include/uapi/asm/kvm.h | 3 +++
> > tools/include/uapi/linux/kvm.h | 1 +
> > 13 files changed, 54 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 9cbbfdb663b6..b9ab598883b2 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -147,10 +147,31 @@ described as 'basic' will be available.
> > The new VM has no virtual cpus and no memory.
> > You probably want to use 0 as machine type.
> >
> > +X86:
> > +^^^^
> > +
> > +Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
> > +bitmap of supported vm types. The 1-setting of bit @n means vm type with
> > +value @n is supported.
>
>
> Perhaps I am missing something, but I don't understand how the below changes
> (except the x86 part above) in Documentation are related to this patch.
This is to summarize divergence of archs. Those archs (s390, mips, and
arm64) introduce essentially same KVM capabilities, but different names. This
patch makes things worse. So I thought it's good idea to summarize it. Probably
this documentation part can be split out into its own patch. thoughts?
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 54d7a26ed9ee..2f43db5bbefb 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -17,11 +17,7 @@ struct vcpu_tdx {
> >
> > static inline bool is_td(struct kvm *kvm)
> > {
> > - /*
> > - * TDX VM type isn't defined yet.
> > - * return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > - */
> > - return false;
> > + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > }
>
> If you put this patch before patch:
>
> [PATCH v7 009/102] KVM: TDX: Add placeholders for TDX VM/vcpu structure
>
> Then you don't need to introduce this chunk in above patch and then remove it
> here, which is unnecessary and ugly.
>
> And you can even only introduce KVM_X86_DEFAULT_VM but not KVM_X86_TDX_VM in
> this patch, so you can make this patch as a infrastructural patch to report VM
> type. The KVM_X86_TDX_VM can come with the patch where is_td() is introduced
> (in your above patch 9). Â
>
> To me, it's more clean way to write patch. For instance, this infrastructural
> patch can be theoretically used by other series if they have similar thing to
> support, but doesn't need to carry is_td() and KVM_X86_TDX_VM burden that you
> made.
There are two choices. One is to put this patch before 9 as you suggested, other
is to put it here right before the patch 13 that uses vm_type_supported().
Thanks,
--
Isaku Yamahata <[email protected]>
On Mon, 2022-07-11 at 18:01 -0700, Isaku Yamahata wrote:
> On Tue, Jun 28, 2022 at 02:52:28PM +1200,
> Kai Huang <[email protected]> wrote:
>
> > On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > > From: Sean Christopherson <[email protected]>
> > >
> > > Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
> > > some operations (e.g., memory read/write, register state access, etc).
> > >
> > > Introduce vm_type to track the type of the VM to x86 KVM. Other arch KVMs
> > > already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
> > > vm_init accepts vm_type. So follow them. Further, a different policy can
> > > be made based on vm_type. Define KVM_X86_DEFAULT_VM for default VM as
> > > default and define KVM_X86_TDX_VM for Intel TDX VM. The wrapper function
> > > will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
> > >
> > > Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
> > > e.g. qemu, to query what VM types are supported by KVM. This (introduce a
> > > new capability and add vm_type) is chosen to align with other arch KVMs
> > > that have VM types already. Other arch KVMs uses different name to query
> > > supported vm types and there is no common name for it, so new name was
> > > chosen.
> > >
> > > Co-developed-by: Xiaoyao Li <[email protected]>
> > > Signed-off-by: Xiaoyao Li <[email protected]>
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > Reviewed-by: Paolo Bonzini <[email protected]>
> > > ---
> > > Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++++
> > > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > arch/x86/include/uapi/asm/kvm.h | 3 +++
> > > arch/x86/kvm/svm/svm.c | 6 ++++++
> > > arch/x86/kvm/vmx/main.c | 1 +
> > > arch/x86/kvm/vmx/tdx.h | 6 +-----
> > > arch/x86/kvm/vmx/vmx.c | 5 +++++
> > > arch/x86/kvm/vmx/x86_ops.h | 1 +
> > > arch/x86/kvm/x86.c | 9 ++++++++-
> > > include/uapi/linux/kvm.h | 1 +
> > > tools/arch/x86/include/uapi/asm/kvm.h | 3 +++
> > > tools/include/uapi/linux/kvm.h | 1 +
> > > 13 files changed, 54 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index 9cbbfdb663b6..b9ab598883b2 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -147,10 +147,31 @@ described as 'basic' will be available.
> > > The new VM has no virtual cpus and no memory.
> > > You probably want to use 0 as machine type.
> > >
> > > +X86:
> > > +^^^^
> > > +
> > > +Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
> > > +bitmap of supported vm types. The 1-setting of bit @n means vm type with
> > > +value @n is supported.
> >
> >
> > Perhaps I am missing something, but I don't understand how the below changes
> > (except the x86 part above) in Documentation are related to this patch.
>
> This is to summarize divergence of archs. Those archs (s390, mips, and
> arm64) introduce essentially same KVM capabilities, but different names. This
> patch makes things worse. So I thought it's good idea to summarize it. Probably
> this documentation part can be split out into its own patch. thoughts?
I will leave to maintainers here. Thought personally I would split different
things into different patches.
>
>
> > > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > > index 54d7a26ed9ee..2f43db5bbefb 100644
> > > --- a/arch/x86/kvm/vmx/tdx.h
> > > +++ b/arch/x86/kvm/vmx/tdx.h
> > > @@ -17,11 +17,7 @@ struct vcpu_tdx {
> > >
> > > static inline bool is_td(struct kvm *kvm)
> > > {
> > > - /*
> > > - * TDX VM type isn't defined yet.
> > > - * return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > > - */
> > > - return false;
> > > + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > > }
> >
> > If you put this patch before patch:
> >
> > [PATCH v7 009/102] KVM: TDX: Add placeholders for TDX VM/vcpu structure
> >
> > Then you don't need to introduce this chunk in above patch and then remove it
> > here, which is unnecessary and ugly.
> >
> > And you can even only introduce KVM_X86_DEFAULT_VM but not KVM_X86_TDX_VM in
> > this patch, so you can make this patch as a infrastructural patch to report VM
> > type. The KVM_X86_TDX_VM can come with the patch where is_td() is introduced
> > (in your above patch 9). Â
> >
> > To me, it's more clean way to write patch. For instance, this infrastructural
> > patch can be theoretically used by other series if they have similar thing to
> > support, but doesn't need to carry is_td() and KVM_X86_TDX_VM burden that you
> > made.
>
> There are two choices. One is to put this patch before 9 as you suggested, other
> is to put it here right before the patch 13 that uses vm_type_supported().
>
> Thanks,
To me this belongs to category of "infrastructural patch", which does "Add new
ABI to support reporting VM types". It can originally support default VM only.
TDX VM can come later. But will leave to maintainers.
On Mon, 2022-07-11 at 17:38 -0700, Isaku Yamahata wrote:
> On Tue, Jun 28, 2022 at 03:53:31PM +1200,
> Kai Huang <[email protected]> wrote:
>
> > On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > Currently, KVM VMX module initialization/exit functions are a single
> > > function each. Refactor KVM VMX module initialization functions into KVM
> > > common part and VMX part so that TDX specific part can be added cleanly.
> > > Opportunistically refactor module exit function as well.
> > >
> > > The current module initialization flow is, 1.) calculate the sizes of VMX
> > > kvm structure and VMX vcpu structure, 2.) hyper-v specific initialization
> > > 3.) report those sizes to the KVM common layer and KVM common
> > > initialization, and 4.) VMX specific system-wide initialization.
> > >
> > > Refactor the KVM VMX module initialization function into functions with a
> > > wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> > > among VMX and TDX. We have a wrapper function, "vt_init() {vmx kvm/vcpu
> > > size calculation; hv_vp_assist_page_init(); kvm_init(); vmx_init(); }" in
> > > main.c, and hv_vp_assist_page_init() and vmx_init() in vmx.c.
> > > hv_vp_assist_page_init() initializes hyper-v specific assist pages,
> > > kvm_init() does system-wide initialization of the KVM common layer, and
> > > vmx_init() does system-wide VMX initialization.
> > >
> > > The KVM architecture common layer allocates struct kvm with reported size
> > > for architecture-specific code. The KVM VMX module defines its structure
> > > as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> > > struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
> > > TDX specific kvm and vcpu structures, add tdx_pre_kvm_init() to report the
> > > sizes of them to the KVM common layer.
> > >
> > > The current module exit function is also a single function, a combination
> > > of VMX specific logic and common KVM logic. Refactor it into VMX specific
> > > logic and KVM common logic. This is just refactoring to keep the VMX
> > > specific logic in vmx.c from main.c.
> >
> > This patch, coupled with the patch:
> >
> > KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> >
> > Basically provides an infrastructure to support both VMX and TDX. Why we cannot
> > merge them into one patch? What's the benefit of splitting them?
> >
> > At least, why the two patches cannot be put together closely?
>
> It is trivial for the change of "KVM: VMX: Move out vmx_x86_ops to 'main.c' to
> wrap VMX and TDX" to introduce no functional change. But it's not trivial
> for this patch to introduce no functional change.
This doesn't sound right. If I understand correctly, this patch supposedly
shouldn't bring any functional change, right? Could you explain what functional
change does this patch bring?
On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
>Hi. Because my description on large page support was terse, I wrote up more
>detailed one. Any feedback/thoughts on large page support?
>
>TDP MMU large page support design
>
>Two main discussion points
>* how to track page status. private vs shared, no-largepage vs can-be-largepage
...
>
>Tracking private/shared and large page mappable
>-----------------------------------------------
>VMM needs to track that page is mapped as private or shared at 4KB granularity.
>For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
>track the page can be mapped as a large page (regarding private/shared). VMM
>updates it on MapGPA and references it on the EPT violation path. (****)
Isaku,
+ Peng Chao
Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
private or all shared?
KVM always retrieves the mapping level in CR3 and enforces that EPT's
page level is not greater than that in CR3. My point is if UPM already enforces
no mixed pages in a large page, then KVM needn't do that again (UPM can
be trusted).
Maybe I am misunderstanding something?
On Tue, Jul 12, 2022 at 01:07:20PM +0800, Chao Gao wrote:
> On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> >Hi. Because my description on large page support was terse, I wrote up more
> >detailed one. Any feedback/thoughts on large page support?
> >
> >TDP MMU large page support design
> >
> >Two main discussion points
> >* how to track page status. private vs shared, no-largepage vs can-be-largepage
>
> ...
>
> >
> >Tracking private/shared and large page mappable
> >-----------------------------------------------
> >VMM needs to track that page is mapped as private or shared at 4KB granularity.
> >For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> >track the page can be mapped as a large page (regarding private/shared). VMM
> >updates it on MapGPA and references it on the EPT violation path. (****)
>
> Isaku,
>
> + Peng Chao
>
> Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
> private or all shared?
>
> KVM always retrieves the mapping level in CR3 and enforces that EPT's
> page level is not greater than that in CR3. My point is if UPM already enforces
> no mixed pages in a large page, then KVM needn't do that again (UPM can
> be trusted).
The backing store in the UMP can tell KVM which page level it can
support for a given private gpa, similar to host_pfn_mapping_level() for
shared address.
However, this solely represents the backing store's capability, KVM
still needs additional info to decide whether that can be safely mapped
as 2M/1G, e.g. all the following pages in the 2M/1G range should be all
private, currently this is not something backing store can tell.
Actually, in UPM v7 we let KVM record this info so one possible solution
is making use of it.
https://lkml.org/lkml/2022/7/6/259
Then to map a page as 2M, KVM needs to check:
- Memory backing store support that level
- All pages in 2M range are private as we recorded through
KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
- No existing partial 4K map(s) in 2M range
Chao
>
> Maybe I am misunderstanding something?
On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> Hi. Because my description on large page support was terse, I wrote up more
> detailed one. Any feedback/thoughts on large page support?
>
> TDP MMU large page support design
>
> Two main discussion points
> * how to track page status. private vs shared, no-largepage vs can-be-largepage
> * how to trigger merging mapping from 4KB/2MB to 2MB/1GB
>
> Expected private-vs-shared page usage
> -------------------------------------
> On TD boot all pages are private and TD converts pages into shared if necessary.
> * Most of the guest pages remain private.
> * Only limited pages are converted at kernel boot
> ** bounce buffer for IO (virt-io). It's allocated as swiotlb. Its size is
> 64MB or 6% of total guest memory.
> ** KVM PV shared page. (the current guest TD doesn't use KVM PV shared page.)
> * Only a small number of pages are dynamically converted from private to shared
> and vice versa. This usage is very limited. e.g. GetQuote, the lack of
> swiotlb buffer
>
>
> Theory of Secure-EPT operations related to large page
> -----------------------------------------------------
> TDX Secure-EPT has differences from VMX EPT.
> To add a page to Secure-EPT
>
> * Here is the operation to resolve the EPT violation.
> 1. TD: Accepts GPA. TD needs to accept GPA before accessing GPA because TD
> needs to detect that VMM unmaps GPA and maps GPA again.
> 2. EPT violation is triggered. TD exit to VMM.
> 3. VMM: allocate a page for GPA and TDH.MEM.PAGE.AUG it to GPA. Resume TD vcpu.
> (3a. TD: #VE<EPT violation> is injected. #VE handler accepts the page)
> 4. TD: resume #VE and continue TD vcpu execution
>
> TD may choose step 1. In that case, After step 3. #VE is injected into TD and,
> TD #VE handler needs to accept the page.
>
> When adding a page to Secure-EPT again, the page contexts are cleared and the
> page is encrypted. If a page is disassociated from Secure-EPT and added again,
> the page content is lost.
>
> * TDG.VP.VMCALL<MapGPA> hypercall
> The page associated with GPA can be private or shared. TD converts the GPA by
> TDG.VP.VMCALL<MapGPA> hypercall from private to shared or vice versa. VMM
> tracks whether the given GPA is private or shared.
>
> * mapping merge(promote)/split(demote)
> The page can be mapped as large page (2MB or 1GB) in addition to 4KB. The
> mapping can be merged(4KB/2MB -> 2MB/1GB) or split(2MB/1GB -> 4KB/2MB) by TDX
> SEAMCALL TDH.MEM.PAGE.PROMOTE and TDH.MEM.PAGE.DEMOTE.
> The merge of mapping requires all the pages needs to be mapped, unlike VMX EPT
> because of encryption. This implies the current KVM implementation doesn't work
> for TDX when merging mapping as follows
>
> - EPT violation and host page is 2MB mappable.
> some of the 4KB pages of the given 2MB page are already mapped, some not.
> i.e. 2MB EPT -> 4KB EPT -> 4K pages
> - KVM page fault handler zap 2MB EPT entry and populate 2MB EPT entry
> zap: 2MB EPT: non present
> populate 2MB: -> 2MB page
>
> If VMM zaps 2MB Secure-EPT entry, the page contents will be lost for TDX.
> Mapping merge requires all pages are already mapped.
>
> Instead, the following steps are needed.
> - EPT violation and host page is 2MB mappable.
> some of the 4KB pages of the given 2MB page are already mapped. Some not.
> i.e. 2MB EPT -> 4KB EPT -> 4K pages
> - VMM checks all 4KB GPAs are private. If not, it can't be mapped as a large page.
> (****)
> - VMM checks all 4KB GPAs are already mapped. If not, give up mapping merge.
> (or map missing 4KB pages.)
> - mapping merge by TDH.MEM.PAGE.PROMOTE
>
> The mapping split for TDX Secure-EPT works similarly to the VMX EPT case.
>
>
> EPT violation and MapGPA
> ------------------------
> - EPT violation is a fast path
> - MapGPA is not a fast path.
> => Keep the EPT violation path optimized and complicates the MapGPA path. For
> (****) check, we don't want to scan the 4KB mapping on EPT violation. Instead,
> the MapGPA path scans it and records the result as the page can be mapped as 2MB
> due to private/shared.
This sounds reasonable, Instead of tracking that in MapGPA, maybe
KVM_MEMORY_ENCRYPT_{UN,}REG_REGION introduced in UPM v7 is a better
place to put the scan code in.
https://lkml.org/lkml/2022/7/6/259
Both the MapGPA (explicit conversion) and the EPT violation (implicit
conversion) can cause invocation to these two ioctls and need update to
this info.
>
>
> Tracking private/shared and large page mappable
> -----------------------------------------------
> VMM needs to track that page is mapped as private or shared at 4KB granularity.
> For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> track the page can be mapped as a large page (regarding private/shared). VMM
> updates it on MapGPA and references it on the EPT violation path. (****)
>
> For 4KB pages, 1 bit is needed. private or shared. Let's call it shared-mask bit.
> For 2MB/1GB pages, 2 bit is needed. large page mappable or not. private or
> shared if mappable. Let's call it no-largepage bit.
I'm just thinking maybe we don't need introduce new bits, instead we
reuse lpage_info where we already use it to track whether a page can be
mapped at specified page level in kvm_mmu_max_mapping_level(). Then in
the above two ioctls we do a scan for each level and update lpage_info.
For example, we should disallow_lpage if private/shared pages are mixed
in that page level.
It's however a bit tricky to manage lpage_info.disallow_lpage in these
two ioctls with current code. We can't simply do disallow_lpage++ and
disallow_lpage--. One possible solution can treat disallow_lpage as a
mask instead of a count. Then we define bits like below for use:
- USER_GFN_UNALIGNED set when memslot user_address/private_offset/gfn
is not aligned on the page level
- PAGE_TRACKING set during page tracking
- PRIVITE_SHARED_MIXED set when private/shared pages are mixed
In page fault handler the page can be mapped at that level only when all
bits are zero and in above two ioctls we just switch on/off bit
PRIVITE_SHARED_MIXED.
Currently UMP don't have this code yet, but can be added if feasible.
Chao
>
> Option A.)
> Allocate array for pages in struct kvm_arch_memory_slot on TD creation.
> struct kvm_arch_memory_slot {
> +struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> }
>
> pros:
> +straight forward implementation
> +SPTE_SHARED_MASK is not needed
> cons:
> -memory overhead is high
> -not optimized for expected usage
> -one more look-up on EPT violation
>
> Option B.) Steal two software usable bits from SPTE and record them in SPTE.
> SPTE_SHARED_MASK, SPTE_NOLARGE_PAGE_MASK
> pros:
> +optimized for EPT violation
> cons:
> -2bits used in SPTE entry
> -complicates the MapGPA path.
>
> Option C.) Steal one software usable bit from SPTE and record it in SPTE.
> SPTE_SHARED_MASK
> For 2MB/1GB, allocate bitmap in kvm_mmu_page.
> struct kvm_mmu_page {
> bitmap nolarge
> }
> pros:
> +optimized for EPT violation
> cons:
> -complicates the MapGPA path.
> -information is scattered in SPTE and struct kvm_mmu_page
>
>
> How to update those bits
> ------------------------
> - MapGPA
> - at 4KB level, set or clear shared-mask bit.
> - Scan 512 4KB bit, at 2MB level
> - set or clear shared-mask bit, clear no-largepage bit or
> - clear shared-mask bit, set no-largepage bit
> - increment/decrement lpageinfo to prevent/allow large page
> - similar for 1GB level
> Note: This logic might a bit tricky.
>
> - EPT violation
> - If 2MB large page is allowed, check if no-largepage bit
> - If no-largepage bit is set, => go down to 4KB page
> - If no-largepage bit is cleared => try to map 2MB page
> - If 4KB level is not mapped, map 2MB page
> - If some 4KB level is already mapped, go down to 4KB.
> Don't try to merge mapping. Or it's possible to try to merge mapping.
> Note: 512 4KB entry scanning is not done at EPT violation because it's fast
> path.
>
>
> Map merging
> -----------
> Map merging is necessary for TD migration. (Map split is the easy part.) The
> current KVM implementation zaps the range (mmu notification or lpage recovery
> worker) and expects large page mapping on the next EPT violation.
>
> Option A.) Keep the code similar to map merging logic.
> Zap 2MB EPT entry in some sense and trigger map merging logic on the next EPT
> violation. To keep encrypted page contents, zapped EPT entries needs to keep
> the page. Steal one more bits from SPTE. SPTE_PRIVATE_BLOCKED_MASK.
> It means that the page is zapped from SPTE. but it still alive and references
> page.
>
> Option B.) In the callback, directly merge mapping somehow. In this case, mmu
> notifier usage doesn't make sense.
>
> NOTE:
> - Implement map merging in MapGPA. This doesn't work for dirty page logging.
> - We can utilize kvm_nx_lpage_recovery_worker
> - We can utilize THP. Probably doesn't work well for fd-based private memory.
>
> Thanks,
> Isaku Yamayhata
>
> On Mon, Jun 27, 2022 at 02:52:52PM -0700,
> [email protected] wrote:
>
> > From: Isaku Yamahata <[email protected]>
> >
> > KVM TDX basic feature support
> >
> > Hello. This is v7 the patch series vof KVM TDX support.
> > This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> > The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> > How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
> >
> > Major changes from v6:
> > - rebased to v5.19 base
> >
> > TODO:
> > - integrate fd-based guest memory. As the discussion is still on-going, I
> > intentionally dropped fd-based guest memory support yet. The integration can
> > be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> > - 2M large page support. It's work-in-progress.
> > For large page support, there are several design choices. Here is the design options.
> > Any thoughts/feedback?
> >
> > KVM MMU Large page support for TDX
> >
> > * What needs to be done
> > - Track private or shared of each page size (4KB, 2MB, 1GB) based on
> > TDG.VP.VMCALL<MapGPA>. For large pages(2MB, 1GB), it can be mixed (some
> > lower-size pages are private and some shared.) In this case, the page can't
> > be large.
> > - if necessary, split large page on TDG.VP.VMCALL<MapGPA>
> > (split on dirty page tracking is future work)
> > - resolving KVM page fault
> > When resolving a private page and the page is large in the host, GPA can be
> > resolved as a large page in Secure-EPT. Even if the page is large on the host
> > side, sometimes a 4KB page can be resolved because it's up to guest TD to
> > accept at 4KB, 2MB, or 1GB.
> > - collapsing pages into a large page.
> > At this point, it's okay to not implement this. When dirty page tracking is
> > supported, this needs to be supported.
> > - On MapGPA, the page can be collapsed into a large page
> > - handle zapping SPTE and try to collapse the pages on the next KVM page fault
> > Unlike the EPT case, some trick is needed.
> > - For performance, optimize KVM page fault path at the cost of complicating
> > MapGPA path.
> >
> > * options to track private or shared
> > At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> > 1GB case). For 4KB each page, 1 bit per page is needed. private or shared. For
> > large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> > mixed). When resolving KVM page fault, we don't want to check the lower-size
> > pages to check if the given GPA can be a large for performance. On MapGPA check
> > it instead.
> >
> > Option A). enhance kvm_arch_memory_slot
> > enum kvm_page_type {
> > KVM_PAGE_TYPE_INVALID,
> > KVM_PAGE_TYPE_SHARED,
> > KVM_PAGE_TYPE_PRIVATE,
> > KVM_PAGE_TYPE_MIXED,
> > };
> >
> > struct kvm_page_attr {
> > enum kvm_page_type type;
> > };
> >
> > struct kvm_arch_memory_slot {
> > + struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> >
> > Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> > If !SPTE_MIXED_MASK, it can be large page.
> >
> > Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> > kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
> >
> >
> > * comparison
> > A).
> > + straightforward to implement
> > + SPTE_SHARED_MASK isn't needed
> > - memory overhead compared to B). or C).
> > - more memory reference on KVM page fault
> >
> > B).
> > + simpler than C) (complex than A)?)
> > + efficient on KVM page fault. (only SPTE reference)
> > + low memory overhead
> > - Waste precious SPTE bits.
> >
> > C).
> > + efficient on KVM page fault. (only SPTE reference)
> > + low memory overhead
> > - complicates MapGPA
> > - scattered data structure
> >
> > Thanks,
> > Isaku Yamahata
> >
> > Changes from v6:
> > - rebased to v5.19
> >
> > Changes from v5:
> > - export __seamcall and use it
> > - move mutex lock from callee function of smp_call_on_cpu to the caller.
> > - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> > - updated comment
> > - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
> > compatibility to only return success
> > - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> > - make this ioctl systemwide ioctl
> > - ABI change to struct kvm_init_vm
> > - guest_tsc_khz: use kvm->arch.default_tsc_khz
> > - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> > - drop exporting kvm_set_tsc_khz().
> > - fix kvm_tdp_page_fault() for mtrr emulation
> > - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> > - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
> > keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> > - update commit message
> > - rename shadow_init_value => shadow_nonprsent_value
> > - added ept_violation_ve_test mode
> > - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> > - legacy MMU case
> > => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
> > - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> > - #VE warning:
> > - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> > - merge into Like we discussed, this patch should be merged with patch
> > "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> > - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> > - introduce kvm_gfn_for_root(kvm, root, gfn)
> > - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> > - use kvm_arch_dirty_log_supported()
> > - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> > - rename: is_private_prohibit_spte() => spte_shared_mask()
> > - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> > - dropped this patch as the change was merged into kvm/queue
> > - update vt_apicv_post_state_restore()
> > - use is_64_bit_hypercall()
> > - comment: expand MSMI -> Machine Check System Management Interrupt
> > - fixed TDX_SEPT_PFERR
> > - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> > - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> > - remove optional zero check of argument.
> > - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
> > in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> > - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> > - introduce vcpu_deliver_init to x86_ops
> > - sprinkeled KVM_BUG_ON()
> >
> > Changes from v4:
> > - rebased to TDX host kernel patch series.
> > - include all the patches to make this patch series working.
> > - add [MARKER] patches to mark the patch layer clear.
> >
> > ---
> > * What's TDX?
> > TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> > Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> > Domain (TD) for confidential computing.
> >
> > A TD runs in a CPU mode that is designed to protect the confidentiality of its
> > memory contents and its CPU state from any other software, including the hosting
> > Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
> >
> > We have more detailed explanations below (***).
> > We have the high-level design of TDX KVM below (****).
> >
> > In this patch series, we use "TD" or "guest TD" to differentiate it from the
> > current "VM" (Virtual Machine), which is supported by KVM today.
> >
> >
> > * The organization of this patch series
> > This patch series is on top of the patches series "TDX host kernel support":
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > this patch series is available at
> > https://github.com/intel/tdx/releases/tag/kvm-upstream
> > The corresponding patches to qemu are available at
> > https://github.com/intel/qemu-tdx/commits/tdx-upstream
> >
> > The relations of the layers are depicted as follows.
> > The arrows below show the order of patch reviews we would like to have.
> >
> > The below layers are chosen so that the device model, for example, qemu can
> > exercise each layering step by step. Check if TDX is supported, create TD VM,
> > create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> > vcpu exits/hypercalls/interrupts to run TD fully.
> >
> > TDX vcpu
> > interrupt/exits/hypercall<------------\
> > ^ |
> > | |
> > TD finalization |
> > ^ |
> > | |
> > TDX EPT violation<------------\ |
> > ^ | |
> > | | |
> > TD vcpu enter/exit | |
> > ^ | |
> > | | |
> > TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
> > ^ | ^
> > | | |
> > TD VM creation/destruction \---------------KVM TDP MMU hooks
> > ^ ^
> > | |
> > TDX architectural definitions KVM TDP refactoring for TDX
> > ^ ^
> > | |
> > TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
> > coexistence support
> >
> >
> > The followings are explanations of each layer. Each layer has a dummy commit
> > that starts with [MARKER] in subject. It is intended to help to identify where
> > each layer starts.
> >
> > TDX host kernel support:
> > https://lore.kernel.org/lkml/[email protected]/
> > The guts of system-wide initialization of TDX module. There is an
> > independent patch series for host x86. TDX KVM patches call functions
> > this patch series provides to initialize the TDX module.
> >
> > TDX, VMX coexistence:
> > Infrastructure to allow TDX to coexist with VMX and trigger the
> > initialization of the TDX module.
> > This layer starts with
> > "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> > TDX architectural definitions:
> > Add TDX architectural definitions and helper functions
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> > TD VM creation/destruction:
> > Guest TD creation/destroy allocation and releasing of TDX specific vm
> > and vcpu structure. Create an initial guest memory image with TDX
> > measurement.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> > TD vcpu creation/destruction:
> > guest TD creation/destroy Allocation and releasing of TDX specific vm
> > and vcpu structure. Create an initial guest memory image with TDX
> > measurement.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> > TDX EPT violation:
> > Create an initial guest memory image with TDX measurement. Handle
> > secure EPT violations to populate guest pages with TDX SEAMCALLs.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> > TD vcpu enter/exit:
> > Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
> > entering into TD. Restore CPU state after exiting from TD.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> > TD vcpu interrupts/exit/hypercall:
> > Handle various exits/hypercalls and allow interrupts to be injected so
> > that TD vcpu can continue running.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
> >
> > KVM MMU GPA shared bit:
> > Introduce framework to handle shared bit repurposed bit of GPA TDX
> > repurposed a bit of GPA to indicate shared or private. If it's shared,
> > it's the same as the conventional VMX EPT case. VMM can access shared
> > guest pages. If it's private, it's handled by Secure-EPT and the guest
> > page is encrypted.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> > KVM TDP refactoring for TDX:
> > TDX Secure EPT requires different constants. e.g. initial value EPT
> > entry value etc. Various refactoring for those differences.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> > KVM TDP MMU hooks:
> > Introduce framework to TDP MMU to add hooks in addition to direct EPT
> > access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
> > conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> > use TDX SEAMCALLs to operate on Secure EPT.
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> > KVM TDP MMU MapGPA:
> > Introduce framework to handle switching guest pages from private/shared
> > to shared/private. For a given GPA, a guest page can be assigned to a
> > private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
> > guest TD converts GPA assignments from private (or shared) to shared (or
> > private).
> > This layer starts with
> > "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
> >
> > KVM guest private memory: (not shown in the above diagram)
> > [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> > memory: https://lkml.org/lkml/2022/1/18/395
> > Guest private memory requires different memory management in KVM. The
> > patch proposes a way for it. Integration with TDX KVM.
> >
> > (***)
> > * TDX module
> > A CPU-attested software module called the "TDX module" is designed to implement
> > the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> > loaded by the kernel or driver at runtime, but in this patch series we assume
> > that the TDX module is already loaded and initialized.
> >
> > The TDX module provides two main new logical modes of operation built upon the
> > new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> > architecture. TDX root mode is mostly identical to the VMX root operation mode,
> > and the TDX functions (described later) are triggered by the new SEAMCALL
> > instruction with the desired interface function selected by an input operand
> > (leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
> > non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> > operation (i.e. guest VM), with changes and restrictions to better assure that
> > no other software or hardware has direct visibility of the TD memory and state.
> >
> > TDX transitions between TDX root operation and TDX non-root operation include TD
> > Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> > TDX root mode. A TD Exit might be asynchronous, triggered by some external
> > event (e.g., external interrupt or SMI) or an exception, or it might be
> > synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
> >
> > TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> > of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> > Domain Host. Those host-side TDX interface functions are categorized into
> > various areas just for better organization, such as SYS (TDX module management),
> > MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> > etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
> >
> > TDCS (Trust Domain Control Structure) is the main control structure of a guest
> > TD, and encrypted (using the guest TD's ephemeral private key). At a high
> > level, TDCS holds information for controlling TD operation as a whole,
> > execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
> > bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> > same value for all VCPUs of the same TD.
> >
> > Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> > TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
> > the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> > DMA access, accessible only by using the TDX module interface functions (such as
> > TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> > such as virtual APIC page, virtualization exception information, etc.
> >
> > Several VMX control structures (such as Shared EPT and Posted interrupt
> > descriptor) are directly managed and accessed by the host VMM. These control
> > structures are pointed to by fields in the TD VMCS.
> >
> > The above means that 1) KVM needs to allocate different data structures for TDs,
> > 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> > define TD-specific handling for others. 3) Redirect operations to . 3)
> > Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> > tdx_callback() else vmx_callback();".
> >
> > *TD Private Memory
> > TD private memory is designed to hold TD private content, encrypted by the CPU
> > using the TD ephemeral key. An encryption engine holds a table of encryption
> > keys, and an encryption key is selected for each memory transaction based on a
> > Host Key Identifier (HKID). By design, the host VMM does not have access to the
> > encryption keys.
> >
> > In the first generation of MKTME, HKID is "stolen" from the physical address by
> > allocating a configurable number of bits from the top of the physical
> > address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> > accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> > HKID on the host so that MKTME can be opaque or bypassed on the host.
> >
> > During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> > as either shared or private, based on the value of a new SHARED bit in the Guest
> > Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
> > (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> > VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> > with the current VMX. Since guest TDs usually require I/O, and the data exchange
> > needs to be done via shared memory, thus KVM needs to use the current EPT
> > functionality even for TDs.
> >
> > * Secure EPT and Minoring using the TDP code
> > The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
> > pages are encrypted and integrity-protected with the TD's ephemeral private
> > key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> > interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> > "subset"). Since execution of such interface functions takes much longer time
> > than accessing memory directly, in KVM we use the existing TDP code to minor the
> > Secure EPT for the TD.
> >
> > This way, we can effectively walk Secure EPT without using the TDX interface
> > functions.
> >
> > * VM life cycle and TDX specific operations
> > The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
> > example, a TD needs to boot in private memory, and the host software cannot copy
> > the initial image to private memory.
> >
> > * TSC Virtualization
> > The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> > (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> > by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
> > owns TSC virtualization for VMs, but the TDX module does for TDs.
> >
> > * MCE support for TDs
> > The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
> > to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
> > related to MCE (e.g, MCE bank registers) can be naturally emulated by
> > paravirtualizing MSR access.
> >
> > [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> > available.
> >
> > * Restrictions or future work
> > Some features are not included to reduce patch size. Those features are
> > addressed as future independent patch series.
> > - large page (2M, 1G)
> > - qemu gdb stub
> > - guest PMU
> > - and more
> >
> > * Prerequisites
> > It's required to load the TDX module and initialize it. It's out of the scope
> > of this patch series. Another independent patch for the common x86 code is
> > planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> > CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> > module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> > life cycle like tdh.mng.init are ready to use.
> >
> > Concretely Global initialization, LP (Logical Processor) initialization, global
> > configuration, the key configuration, and TDMR and PAMT initialization are done.
> > The state of the TDX module is SYS_READY. Please refer to the TDX module
> > specification, the chapter Intel TDX Module Lifecycle State Machine
> >
> > ** Detecting the TDX module readiness.
> > TDX host patch series implements the detection of the TDX module availability
> > and its initialization so that KVM can use it. Also it manages Host KeyID
> > (HKID) assigned to guest TD.
> > The assumed APIs the TDX host patch series provides are
> > - int seamrr_enabled()
> > Check if required cpu feature (SEAM mode) is available. This only check CPU
> > feature availability. At this point, the TDX module may not be ready for KVM
> > to use.
> > - int init_tdx(void);
> > Initialization of TDX module so that the TDX module is ready for KVM to use.
> > - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> > Return the system wide information about the TDX module. NULL if the TDX
> > isn't initialized.
> > - u32 tdx_get_global_keyid(void);
> > Return global key id that is used for the TDX module itself.
> > - int tdx_keyid_alloc(void);
> > Allocate HKID for guest TD.
> > - void tdx_keyid_free(int keyid);
> > Free HKID for guest TD.
> >
> > (****)
> > * TDX KVM high-level design
> > - Host key ID management
> > Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> > It is assumed The TDX host patch series implements necessary functions,
> > u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> > void tdx_keyid_free(int keyid).
> >
> > - Data structures and VM type
> > Because TDX is different from VMX, define its own VM/VCPU structures, struct
> > kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
> > identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> > TDX, is used.
> >
> > - VM life cycle and TDX specific operations
> > Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> > New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> > parameters, set initial guest memory and measurement.
> >
> > The creation of TDX VM requires five additional operations in addition to the
> > conventional VM creation.
> > - Get KVM system capability to check if TDX VM type is supported
> > - VM creation (KVM_CREATE_VM)
> > - New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
> > - New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
> > - VCPU creation (KVM_CREATE_VCPU)
> > - New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
> > - New: Initialize guest memory as boot state and extend the measurement with
> > the memory. KVM_TDX_INIT_MEM_REGION.
> > - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> > TDX VM contents.
> > - VCPU RUN (KVM_VCPU_RUN)
> >
> > - Protected guest state
> > Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> > can't operate on them. For example, accessing CPU registers, injecting
> > exceptions, and accessing guest memory. Those operations are handled as
> > silently ignored, returning zero or initial reset value when it's requested via
> > KVM API ioctls.
> >
> > VM/VCPU state and callbacks for TDX specific operations.
> > Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
> > operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
> >
> > Operations on the CPU state
> > silently ignore operations on the guest state. For example, the write to
> > CPU registers is ignored and the read from CPU registers returns 0.
> >
> > . ignore access to CPU registers except for allowed ones.
> > . TSC: add a check if tsc is immutable and return an error. Because the KVM
> > implementation updates the internal tsc state and it's difficult to back
> > out those changes. Instead, skip the logic.
> > . dirty logging: add check if dirty logging is supported.
> > . exceptions/SMI/MCE/SIPI/INIT: silently ignore
> >
> > Note: virtual external interrupt and NMI can be injected into TDX guests.
> >
> > - KVM MMU integration
> > One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> > the guest physical address is private (the bit is cleared) or shared (the bit is
> > set). The bits are called stolen bits.
> >
> > - Stolen bits framework
> > systematically tracks which guest physical address, shared or private, is
> > used.
> >
> > - Shared EPT and secure EPT
> > There are two EPTs. Shared EPT (the conventional one) and Secure
> > EPT(the new one). Shared EPT is handled the same for the stolen
> > bit set. Secure EPT points to private guest pages. To resolve
> > EPT violation, KVM walks one of two EPTs based on faulted GPA.
> > Because it's costly to access secure EPT during walking EPTs with
> > SEAMCALLs for the private guest physical address, another private
> > EPT is used as a shadow of Secure-EPT with the existing logic at
> > the cost of extra memory.
> >
> > The following depicts the relationship.
> >
> > KVM | TDX module
> > | | |
> > -------------+---------- | |
> > | | | |
> > V V | |
> > shared GPA private GPA | |
> > CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
> > | | | |
> > | | | |
> > V V | V
> > shared EPT private EPT--------mirror----->Secure EPT
> > | | | |
> > | \--------------------+------\ |
> > | | | |
> > V | V V
> > shared guest page | private guest page
> > |
> > |
> > non-encrypted memory | encrypted memory
> > |
> >
> > - Operating on Secure EPT
> > Use the TDX module APIs to operate on Secure EPT. To call the TDX API
> > during resolving EPT violation, add hooks to additional operation and wiring
> > it to TDX backend.
> >
> > * References
> >
> > [1] TDX specification
> > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > [2] Intel Trust Domain Extensions (Intel TDX)
> > https://cdrdv2.intel.com/v1/dl/getContent/726790
> > [3] Intel CPU Architectural Extensions Specification
> > https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> > [4] Intel TDX Module 1.0 Specification
> > https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> > [5] Intel TDX Loader Interface Specification
> > https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > [6] Intel TDX Guest-Hypervisor Communication Interface
> > https://cdrdv2.intel.com/v1/dl/getContent/726790
> > [7] Intel TDX Virtual Firmware Design Guide
> > https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> > [8] intel public github
> > kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> > TDX guest branch: https://github.com/intel/tdx/tree/guest
> > qemu TDX https://github.com/intel/qemu-tdx
> > [9] TDVF
> > https://github.com/tianocore/edk2-staging/tree/TDVF
> > This was merged into EDK2 main branch. https://github.com/tianocore/edk2
> >
> > Chao Gao (3):
> > KVM: x86: Move check_processor_compatibility from init ops to runtime
> > ops
> > Partially revert "KVM: Pass kvm_init()'s opaque param to additional
> > arch funcs"
> > KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> > wrmsr
> >
> > Isaku Yamahata (72):
> > KVM: Refactor CPU compatibility check on module initialiization
> > x86/virt/vmx/tdx: export platform_tdx_enabled()
> > KVM: TDX: Detect CPU feature on kernel module initialization
> > KVM: x86: Refactor KVM VMX module init/exit functions
> > KVM: TDX: Add placeholders for TDX VM/vcpu structure
> > x86/virt/tdx: Add a helper function to return system wide info about
> > TDX module
> > KVM: TDX: Initialize TDX module when loading kvm_intel.ko
> > KVM: TDX: Make TDX VM type supported
> > [MARKER] The start of TDX KVM patch series: TDX architectural
> > definitions
> > KVM: TDX: Define TDX architectural definitions
> > KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> > KVM: TDX: Add helper functions to print TDX SEAMCALL error
> > [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> > x86/cpu: Add helper functions to allocate/free TDX private host key id
> > KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> > KVM: TDX: Make pmu_intel.c ignore guest TD case
> > [MARKER] The start of TDX KVM patch series: TD vcpu
> > creation/destruction
> > KVM: TDX: allocate/free TDX vcpu structure
> > KVM: TDX: allocate/free TDX vcpu structure
> > [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
> > KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> > [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> > TDX
> > KVM: x86/mmu: Disallow fast page fault on private GPA
> > KVM: VMX: Introduce test mode related to EPT violation VE
> > [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> > KVM: x86/mmu: Focibly use TDP MMU for TDX
> > KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
> > KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> > KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> > [MARKER] The start of TDX KVM patch series: TDX EPT violation
> > KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> > KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
> > KVM: TDX: TDP MMU TDX support
> > [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> > KVM: x86/mmu: steal software usable git to record if GFN is for shared
> > or not
> > KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> > [MARKER] The start of TDX KVM patch series: TD finalization
> > KVM: TDX: Create initial guest memory
> > KVM: TDX: Finalize VM initialization
> > [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> > KVM: TDX: Add helper assembly function to TDX vcpu
> > KVM: TDX: Implement TDX vcpu enter/exit path
> > KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> > KVM: TDX: restore host xsave state when exit from the guest TD
> > KVM: TDX: restore user ret MSRs
> > [MARKER] The start of TDX KVM patch series: TD vcpu
> > exits/interrupts/hypercalls
> > KVM: TDX: complete interrupts after tdexit
> > KVM: TDX: restore debug store when TD exit
> > KVM: TDX: handle vcpu migration over logical processor
> > KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> > behavior
> > KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
> > KVM: TDX: Implement interrupt injection
> > KVM: TDX: Implements vcpu request_immediate_exit
> > KVM: TDX: Implement methods to inject NMI
> > KVM: TDX: Add a place holder to handle TDX VM exit
> > KVM: TDX: handle EXIT_REASON_OTHER_SMI
> > KVM: TDX: handle ept violation/misconfig exit
> > KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> > KVM: TDX: Add a place holder for handler of TDX hypercalls
> > (TDG.VP.VMCALL)
> > KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> > KVM: TDX: Handle TDX PV CPUID hypercall
> > KVM: TDX: Handle TDX PV HLT hypercall
> > KVM: TDX: Handle TDX PV port io hypercall
> > KVM: TDX: Implement callbacks for MSR operations for TDX
> > KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
> > KVM: TDX: Handle TDX PV report fatal error hypercall
> > KVM: TDX: Handle TDX PV map_gpa hypercall
> > KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
> > KVM: TDX: Silently discard SMI request
> > KVM: TDX: Silently ignore INIT/SIPI
> > Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> > KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> >
> > Rick Edgecombe (1):
> > KVM: x86/mmu: Add address conversion functions for TDX shared bits
> >
> > Sean Christopherson (25):
> > KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> > KVM: Enable hardware before doing arch VM initialization
> > KVM: x86: Introduce vm_type to differentiate default VMs from
> > confidential VMs
> > KVM: TDX: Add TDX "architectural" error codes
> > KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> > KVM: TDX: create/destroy VM structure
> > KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
> > KVM: TDX: Do TDX specific vcpu initialization
> > KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> > KVM: x86/mmu: Allow non-zero value for non-present SPTE
> > KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> > KVM: x86/mmu: Allow per-VM override of the TDP max page level
> > KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
> > private mmu
> > KVM: x86/mmu: Disallow dirty logging for x86 TDX
> > KVM: VMX: Split out guts of EPT violation to common/exposed function
> > KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> > KVM: TDX: Add load_mmu_pgd method for TDX
> > KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> > KVM: TDX: Add support for find pending IRQ in a protected local APIC
> > KVM: x86: Assume timer IRQ was injected if APIC state is proteced
> > KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> > argument
> > KVM: VMX: Move NMI/exception handler to common helper
> > KVM: x86: Split core of hypercall emulation to helper function
> > KVM: TDX: Handle TDX PV MMIO hypercall
> > KVM: TDX: Add methods to ignore accesses to CPU state
> >
> > Xiaoyao Li (1):
> > KVM: TDX: initialize VM with TDX specific parameters
> >
> > Documentation/virt/kvm/api.rst | 30 +-
> > .../virt/kvm/intel-tdx-layer-status.rst | 33 +
> > Documentation/virt/kvm/intel-tdx.rst | 381 +++
> > Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 ++++
> > arch/arm64/kvm/arm.c | 2 +-
> > arch/mips/kvm/mips.c | 14 +-
> > arch/powerpc/kvm/powerpc.c | 2 +-
> > arch/riscv/kvm/main.c | 2 +-
> > arch/s390/kvm/kvm-s390.c | 2 +-
> > arch/x86/events/intel/ds.c | 1 +
> > arch/x86/include/asm/kvm-x86-ops.h | 10 +
> > arch/x86/include/asm/kvm_host.h | 56 +-
> > arch/x86/include/asm/tdx.h | 67 +
> > arch/x86/include/asm/vmx.h | 14 +
> > arch/x86/include/uapi/asm/kvm.h | 95 +
> > arch/x86/include/uapi/asm/vmx.h | 5 +-
> > arch/x86/kvm/Kconfig | 4 +
> > arch/x86/kvm/Makefile | 3 +-
> > arch/x86/kvm/irq.c | 3 +
> > arch/x86/kvm/lapic.c | 37 +-
> > arch/x86/kvm/lapic.h | 2 +
> > arch/x86/kvm/mmu.h | 42 +-
> > arch/x86/kvm/mmu/mmu.c | 360 ++-
> > arch/x86/kvm/mmu/mmu_internal.h | 123 +-
> > arch/x86/kvm/mmu/paging_tmpl.h | 5 +-
> > arch/x86/kvm/mmu/spte.c | 46 +-
> > arch/x86/kvm/mmu/spte.h | 65 +-
> > arch/x86/kvm/mmu/tdp_iter.c | 1 +
> > arch/x86/kvm/mmu/tdp_iter.h | 5 +-
> > arch/x86/kvm/mmu/tdp_mmu.c | 690 ++++-
> > arch/x86/kvm/mmu/tdp_mmu.h | 12 +-
> > arch/x86/kvm/svm/svm.c | 13 +-
> > arch/x86/kvm/vmx/common.h | 174 ++
> > arch/x86/kvm/vmx/evmcs.c | 2 +-
> > arch/x86/kvm/vmx/evmcs.h | 2 +-
> > arch/x86/kvm/vmx/main.c | 1071 +++++++
> > arch/x86/kvm/vmx/pmu_intel.c | 39 +-
> > arch/x86/kvm/vmx/pmu_intel.h | 28 +
> > arch/x86/kvm/vmx/posted_intr.c | 43 +-
> > arch/x86/kvm/vmx/posted_intr.h | 13 +
> > arch/x86/kvm/vmx/tdx.c | 2465 +++++++++++++++++
> > arch/x86/kvm/vmx/tdx.h | 275 ++
> > arch/x86/kvm/vmx/tdx_arch.h | 157 ++
> > arch/x86/kvm/vmx/tdx_errno.h | 29 +
> > arch/x86/kvm/vmx/tdx_error.c | 22 +
> > arch/x86/kvm/vmx/tdx_ops.h | 188 ++
> > arch/x86/kvm/vmx/vmenter.S | 146 +
> > arch/x86/kvm/vmx/vmx.c | 737 ++---
> > arch/x86/kvm/vmx/vmx.h | 39 +-
> > arch/x86/kvm/vmx/x86_ops.h | 235 ++
> > arch/x86/kvm/x86.c | 148 +-
> > arch/x86/virt/vmx/tdx/seamcall.S | 2 +
> > arch/x86/virt/vmx/tdx/tdx.c | 54 +-
> > arch/x86/virt/vmx/tdx/tdx.h | 52 -
> > include/linux/kvm_host.h | 4 +-
> > include/uapi/linux/kvm.h | 2 +
> > tools/arch/x86/include/uapi/asm/kvm.h | 95 +
> > tools/include/uapi/linux/kvm.h | 1 +
> > virt/kvm/kvm_main.c | 67 +-
> > 59 files changed, 7877 insertions(+), 804 deletions(-)
> > create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
> > create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> > create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> > create mode 100644 arch/x86/kvm/vmx/common.h
> > create mode 100644 arch/x86/kvm/vmx/main.c
> > create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
> > create mode 100644 arch/x86/kvm/vmx/tdx.c
> > create mode 100644 arch/x86/kvm/vmx/tdx.h
> > create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> > create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> > create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> > create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> > create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> >
> > --
> > 2.25.1
> >
>
> --
> Isaku Yamahata <[email protected]>
On Tue, Jul 12, 2022 at 06:54:19PM +0800,
Chao Peng <[email protected]> wrote:
> On Tue, Jul 12, 2022 at 01:07:20PM +0800, Chao Gao wrote:
> > On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> > >Hi. Because my description on large page support was terse, I wrote up more
> > >detailed one. Any feedback/thoughts on large page support?
> > >
> > >TDP MMU large page support design
> > >
> > >Two main discussion points
> > >* how to track page status. private vs shared, no-largepage vs can-be-largepage
> >
> > ...
> >
> > >
> > >Tracking private/shared and large page mappable
> > >-----------------------------------------------
> > >VMM needs to track that page is mapped as private or shared at 4KB granularity.
> > >For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> > >track the page can be mapped as a large page (regarding private/shared). VMM
> > >updates it on MapGPA and references it on the EPT violation path. (****)
> >
> > Isaku,
> >
> > + Peng Chao
> >
> > Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
> > private or all shared?
> >
> > KVM always retrieves the mapping level in CR3 and enforces that EPT's
> > page level is not greater than that in CR3. My point is if UPM already enforces
> > no mixed pages in a large page, then KVM needn't do that again (UPM can
> > be trusted).
>
> The backing store in the UMP can tell KVM which page level it can
> support for a given private gpa, similar to host_pfn_mapping_level() for
> shared address.
>
> However, this solely represents the backing store's capability, KVM
> still needs additional info to decide whether that can be safely mapped
> as 2M/1G, e.g. all the following pages in the 2M/1G range should be all
> private, currently this is not something backing store can tell.
This argument applies to shared GPA. The shared pages is backed by normal file
mapping with UPM. When KVM is mapping shared GPA, the same check is needed. So
I think KVM has to track all private or all shared or no-largepage at 2MB/1GB
level. If UPM tracks shared-or-private at 4KB level, probably KVM may not need to
track it at 4KB level.
> Actually, in UPM v7 we let KVM record this info so one possible solution
> is making use of it.
>
> https://lkml.org/lkml/2022/7/6/259
>
> Then to map a page as 2M, KVM needs to check:
> - Memory backing store support that level
> - All pages in 2M range are private as we recorded through
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> - No existing partial 4K map(s) in 2M range
--
Isaku Yamahata <[email protected]>
On Tue, Jul 12, 2022 at 06:49:25PM +0800,
Chao Peng <[email protected]> wrote:
> On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> > Hi. Because my description on large page support was terse, I wrote up more
> > detailed one. Any feedback/thoughts on large page support?
> >
> > TDP MMU large page support design
> >
> > Two main discussion points
> > * how to track page status. private vs shared, no-largepage vs can-be-largepage
> > * how to trigger merging mapping from 4KB/2MB to 2MB/1GB
> >
> > Expected private-vs-shared page usage
> > -------------------------------------
> > On TD boot all pages are private and TD converts pages into shared if necessary.
> > * Most of the guest pages remain private.
> > * Only limited pages are converted at kernel boot
> > ** bounce buffer for IO (virt-io). It's allocated as swiotlb. Its size is
> > 64MB or 6% of total guest memory.
> > ** KVM PV shared page. (the current guest TD doesn't use KVM PV shared page.)
> > * Only a small number of pages are dynamically converted from private to shared
> > and vice versa. This usage is very limited. e.g. GetQuote, the lack of
> > swiotlb buffer
> >
> >
> > Theory of Secure-EPT operations related to large page
> > -----------------------------------------------------
> > TDX Secure-EPT has differences from VMX EPT.
> > To add a page to Secure-EPT
> >
> > * Here is the operation to resolve the EPT violation.
> > 1. TD: Accepts GPA. TD needs to accept GPA before accessing GPA because TD
> > needs to detect that VMM unmaps GPA and maps GPA again.
> > 2. EPT violation is triggered. TD exit to VMM.
> > 3. VMM: allocate a page for GPA and TDH.MEM.PAGE.AUG it to GPA. Resume TD vcpu.
> > (3a. TD: #VE<EPT violation> is injected. #VE handler accepts the page)
> > 4. TD: resume #VE and continue TD vcpu execution
> >
> > TD may choose step 1. In that case, After step 3. #VE is injected into TD and,
> > TD #VE handler needs to accept the page.
> >
> > When adding a page to Secure-EPT again, the page contexts are cleared and the
> > page is encrypted. If a page is disassociated from Secure-EPT and added again,
> > the page content is lost.
> >
> > * TDG.VP.VMCALL<MapGPA> hypercall
> > The page associated with GPA can be private or shared. TD converts the GPA by
> > TDG.VP.VMCALL<MapGPA> hypercall from private to shared or vice versa. VMM
> > tracks whether the given GPA is private or shared.
> >
> > * mapping merge(promote)/split(demote)
> > The page can be mapped as large page (2MB or 1GB) in addition to 4KB. The
> > mapping can be merged(4KB/2MB -> 2MB/1GB) or split(2MB/1GB -> 4KB/2MB) by TDX
> > SEAMCALL TDH.MEM.PAGE.PROMOTE and TDH.MEM.PAGE.DEMOTE.
> > The merge of mapping requires all the pages needs to be mapped, unlike VMX EPT
> > because of encryption. This implies the current KVM implementation doesn't work
> > for TDX when merging mapping as follows
> >
> > - EPT violation and host page is 2MB mappable.
> > some of the 4KB pages of the given 2MB page are already mapped, some not.
> > i.e. 2MB EPT -> 4KB EPT -> 4K pages
> > - KVM page fault handler zap 2MB EPT entry and populate 2MB EPT entry
> > zap: 2MB EPT: non present
> > populate 2MB: -> 2MB page
> >
> > If VMM zaps 2MB Secure-EPT entry, the page contents will be lost for TDX.
> > Mapping merge requires all pages are already mapped.
> >
> > Instead, the following steps are needed.
> > - EPT violation and host page is 2MB mappable.
> > some of the 4KB pages of the given 2MB page are already mapped. Some not.
> > i.e. 2MB EPT -> 4KB EPT -> 4K pages
> > - VMM checks all 4KB GPAs are private. If not, it can't be mapped as a large page.
> > (****)
> > - VMM checks all 4KB GPAs are already mapped. If not, give up mapping merge.
> > (or map missing 4KB pages.)
> > - mapping merge by TDH.MEM.PAGE.PROMOTE
> >
> > The mapping split for TDX Secure-EPT works similarly to the VMX EPT case.
> >
> >
> > EPT violation and MapGPA
> > ------------------------
> > - EPT violation is a fast path
> > - MapGPA is not a fast path.
> > => Keep the EPT violation path optimized and complicates the MapGPA path. For
> > (****) check, we don't want to scan the 4KB mapping on EPT violation. Instead,
> > the MapGPA path scans it and records the result as the page can be mapped as 2MB
> > due to private/shared.
>
> This sounds reasonable, Instead of tracking that in MapGPA, maybe
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION introduced in UPM v7 is a better
> place to put the scan code in.
>
> https://lkml.org/lkml/2022/7/6/259
>
> Both the MapGPA (explicit conversion) and the EPT violation (implicit
> conversion) can cause invocation to these two ioctls and need update to
> this info.
>
> >
> >
> > Tracking private/shared and large page mappable
> > -----------------------------------------------
> > VMM needs to track that page is mapped as private or shared at 4KB granularity.
> > For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> > track the page can be mapped as a large page (regarding private/shared). VMM
> > updates it on MapGPA and references it on the EPT violation path. (****)
> >
> > For 4KB pages, 1 bit is needed. private or shared. Let's call it shared-mask bit.
> > For 2MB/1GB pages, 2 bit is needed. large page mappable or not. private or
> > shared if mappable. Let's call it no-largepage bit.
>
> I'm just thinking maybe we don't need introduce new bits, instead we
> reuse lpage_info where we already use it to track whether a page can be
> mapped at specified page level in kvm_mmu_max_mapping_level(). Then in
> the above two ioctls we do a scan for each level and update lpage_info.
> For example, we should disallow_lpage if private/shared pages are mixed
> in that page level.
>
> It's however a bit tricky to manage lpage_info.disallow_lpage in these
> two ioctls with current code. We can't simply do disallow_lpage++ and
> disallow_lpage--. One possible solution can treat disallow_lpage as a
> mask instead of a count. Then we define bits like below for use:
> - USER_GFN_UNALIGNED set when memslot user_address/private_offset/gfn
> is not aligned on the page level
> - PAGE_TRACKING set during page tracking
> - PRIVITE_SHARED_MIXED set when private/shared pages are mixed
>
> In page fault handler the page can be mapped at that level only when all
> bits are zero and in above two ioctls we just switch on/off bit
> PRIVITE_SHARED_MIXED.
So steal 1 or 2 bits from kvm_lpage_info.disallow_lpage instead of adding one more
array in struct kvm_arch_memory_slot. Nice idea. Let's call it option A.1).
We increment/decrement disallow_lpage with option A.). With option A.1), it
automatically handled.
pros:
+SPTE_SHARED_MASK is not needed
cons:
-one more look-up on EPT violation
> Currently UMP don't have this code yet, but can be added if feasible.
Anyway let me integrate UPM v7.
Thanks,
> Chao
> >
> > Option A.)
> > Allocate array for pages in struct kvm_arch_memory_slot on TD creation.
> > struct kvm_arch_memory_slot {
> > +struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> > }
> >
> > pros:
> > +straight forward implementation
> > +SPTE_SHARED_MASK is not needed
> > cons:
> > -memory overhead is high
> > -not optimized for expected usage
> > -one more look-up on EPT violation
> >
> > Option B.) Steal two software usable bits from SPTE and record them in SPTE.
> > SPTE_SHARED_MASK, SPTE_NOLARGE_PAGE_MASK
> > pros:
> > +optimized for EPT violation
> > cons:
> > -2bits used in SPTE entry
> > -complicates the MapGPA path.
> >
> > Option C.) Steal one software usable bit from SPTE and record it in SPTE.
> > SPTE_SHARED_MASK
> > For 2MB/1GB, allocate bitmap in kvm_mmu_page.
> > struct kvm_mmu_page {
> > bitmap nolarge
> > }
> > pros:
> > +optimized for EPT violation
> > cons:
> > -complicates the MapGPA path.
> > -information is scattered in SPTE and struct kvm_mmu_page
> >
> >
> > How to update those bits
> > ------------------------
> > - MapGPA
> > - at 4KB level, set or clear shared-mask bit.
> > - Scan 512 4KB bit, at 2MB level
> > - set or clear shared-mask bit, clear no-largepage bit or
> > - clear shared-mask bit, set no-largepage bit
> > - increment/decrement lpageinfo to prevent/allow large page
> > - similar for 1GB level
> > Note: This logic might a bit tricky.
> >
> > - EPT violation
> > - If 2MB large page is allowed, check if no-largepage bit
> > - If no-largepage bit is set, => go down to 4KB page
> > - If no-largepage bit is cleared => try to map 2MB page
> > - If 4KB level is not mapped, map 2MB page
> > - If some 4KB level is already mapped, go down to 4KB.
> > Don't try to merge mapping. Or it's possible to try to merge mapping.
> > Note: 512 4KB entry scanning is not done at EPT violation because it's fast
> > path.
> >
> >
> > Map merging
> > -----------
> > Map merging is necessary for TD migration. (Map split is the easy part.) The
> > current KVM implementation zaps the range (mmu notification or lpage recovery
> > worker) and expects large page mapping on the next EPT violation.
> >
> > Option A.) Keep the code similar to map merging logic.
> > Zap 2MB EPT entry in some sense and trigger map merging logic on the next EPT
> > violation. To keep encrypted page contents, zapped EPT entries needs to keep
> > the page. Steal one more bits from SPTE. SPTE_PRIVATE_BLOCKED_MASK.
> > It means that the page is zapped from SPTE. but it still alive and references
> > page.
> >
> > Option B.) In the callback, directly merge mapping somehow. In this case, mmu
> > notifier usage doesn't make sense.
> >
> > NOTE:
> > - Implement map merging in MapGPA. This doesn't work for dirty page logging.
> > - We can utilize kvm_nx_lpage_recovery_worker
> > - We can utilize THP. Probably doesn't work well for fd-based private memory.
> >
> > Thanks,
> > Isaku Yamayhata
> >
> > On Mon, Jun 27, 2022 at 02:52:52PM -0700,
> > [email protected] wrote:
> >
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > KVM TDX basic feature support
> > >
> > > Hello. This is v7 the patch series vof KVM TDX support.
> > > This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> > > The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> > > How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
> > >
> > > Major changes from v6:
> > > - rebased to v5.19 base
> > >
> > > TODO:
> > > - integrate fd-based guest memory. As the discussion is still on-going, I
> > > intentionally dropped fd-based guest memory support yet. The integration can
> > > be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> > > - 2M large page support. It's work-in-progress.
> > > For large page support, there are several design choices. Here is the design options.
> > > Any thoughts/feedback?
> > >
> > > KVM MMU Large page support for TDX
> > >
> > > * What needs to be done
> > > - Track private or shared of each page size (4KB, 2MB, 1GB) based on
> > > TDG.VP.VMCALL<MapGPA>. For large pages(2MB, 1GB), it can be mixed (some
> > > lower-size pages are private and some shared.) In this case, the page can't
> > > be large.
> > > - if necessary, split large page on TDG.VP.VMCALL<MapGPA>
> > > (split on dirty page tracking is future work)
> > > - resolving KVM page fault
> > > When resolving a private page and the page is large in the host, GPA can be
> > > resolved as a large page in Secure-EPT. Even if the page is large on the host
> > > side, sometimes a 4KB page can be resolved because it's up to guest TD to
> > > accept at 4KB, 2MB, or 1GB.
> > > - collapsing pages into a large page.
> > > At this point, it's okay to not implement this. When dirty page tracking is
> > > supported, this needs to be supported.
> > > - On MapGPA, the page can be collapsed into a large page
> > > - handle zapping SPTE and try to collapse the pages on the next KVM page fault
> > > Unlike the EPT case, some trick is needed.
> > > - For performance, optimize KVM page fault path at the cost of complicating
> > > MapGPA path.
> > >
> > > * options to track private or shared
> > > At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> > > 1GB case). For 4KB each page, 1 bit per page is needed. private or shared. For
> > > large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> > > mixed). When resolving KVM page fault, we don't want to check the lower-size
> > > pages to check if the given GPA can be a large for performance. On MapGPA check
> > > it instead.
> > >
> > > Option A). enhance kvm_arch_memory_slot
> > > enum kvm_page_type {
> > > KVM_PAGE_TYPE_INVALID,
> > > KVM_PAGE_TYPE_SHARED,
> > > KVM_PAGE_TYPE_PRIVATE,
> > > KVM_PAGE_TYPE_MIXED,
> > > };
> > >
> > > struct kvm_page_attr {
> > > enum kvm_page_type type;
> > > };
> > >
> > > struct kvm_arch_memory_slot {
> > > + struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> > >
> > > Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> > > If !SPTE_MIXED_MASK, it can be large page.
> > >
> > > Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> > > kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
> > >
> > >
> > > * comparison
> > > A).
> > > + straightforward to implement
> > > + SPTE_SHARED_MASK isn't needed
> > > - memory overhead compared to B). or C).
> > > - more memory reference on KVM page fault
> > >
> > > B).
> > > + simpler than C) (complex than A)?)
> > > + efficient on KVM page fault. (only SPTE reference)
> > > + low memory overhead
> > > - Waste precious SPTE bits.
> > >
> > > C).
> > > + efficient on KVM page fault. (only SPTE reference)
> > > + low memory overhead
> > > - complicates MapGPA
> > > - scattered data structure
> > >
> > > Thanks,
> > > Isaku Yamahata
> > >
> > > Changes from v6:
> > > - rebased to v5.19
> > >
> > > Changes from v5:
> > > - export __seamcall and use it
> > > - move mutex lock from callee function of smp_call_on_cpu to the caller.
> > > - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> > > - updated comment
> > > - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
> > > compatibility to only return success
> > > - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> > > - make this ioctl systemwide ioctl
> > > - ABI change to struct kvm_init_vm
> > > - guest_tsc_khz: use kvm->arch.default_tsc_khz
> > > - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> > > - drop exporting kvm_set_tsc_khz().
> > > - fix kvm_tdp_page_fault() for mtrr emulation
> > > - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> > > - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
> > > keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> > > - update commit message
> > > - rename shadow_init_value => shadow_nonprsent_value
> > > - added ept_violation_ve_test mode
> > > - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> > > - legacy MMU case
> > > => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
> > > - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> > > - #VE warning:
> > > - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> > > - merge into Like we discussed, this patch should be merged with patch
> > > "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> > > - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> > > - introduce kvm_gfn_for_root(kvm, root, gfn)
> > > - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> > > - use kvm_arch_dirty_log_supported()
> > > - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> > > - rename: is_private_prohibit_spte() => spte_shared_mask()
> > > - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> > > - dropped this patch as the change was merged into kvm/queue
> > > - update vt_apicv_post_state_restore()
> > > - use is_64_bit_hypercall()
> > > - comment: expand MSMI -> Machine Check System Management Interrupt
> > > - fixed TDX_SEPT_PFERR
> > > - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> > > - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> > > - remove optional zero check of argument.
> > > - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
> > > in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> > > - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> > > - introduce vcpu_deliver_init to x86_ops
> > > - sprinkeled KVM_BUG_ON()
> > >
> > > Changes from v4:
> > > - rebased to TDX host kernel patch series.
> > > - include all the patches to make this patch series working.
> > > - add [MARKER] patches to mark the patch layer clear.
> > >
> > > ---
> > > * What's TDX?
> > > TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> > > Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> > > Domain (TD) for confidential computing.
> > >
> > > A TD runs in a CPU mode that is designed to protect the confidentiality of its
> > > memory contents and its CPU state from any other software, including the hosting
> > > Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
> > >
> > > We have more detailed explanations below (***).
> > > We have the high-level design of TDX KVM below (****).
> > >
> > > In this patch series, we use "TD" or "guest TD" to differentiate it from the
> > > current "VM" (Virtual Machine), which is supported by KVM today.
> > >
> > >
> > > * The organization of this patch series
> > > This patch series is on top of the patches series "TDX host kernel support":
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > this patch series is available at
> > > https://github.com/intel/tdx/releases/tag/kvm-upstream
> > > The corresponding patches to qemu are available at
> > > https://github.com/intel/qemu-tdx/commits/tdx-upstream
> > >
> > > The relations of the layers are depicted as follows.
> > > The arrows below show the order of patch reviews we would like to have.
> > >
> > > The below layers are chosen so that the device model, for example, qemu can
> > > exercise each layering step by step. Check if TDX is supported, create TD VM,
> > > create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> > > vcpu exits/hypercalls/interrupts to run TD fully.
> > >
> > > TDX vcpu
> > > interrupt/exits/hypercall<------------\
> > > ^ |
> > > | |
> > > TD finalization |
> > > ^ |
> > > | |
> > > TDX EPT violation<------------\ |
> > > ^ | |
> > > | | |
> > > TD vcpu enter/exit | |
> > > ^ | |
> > > | | |
> > > TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
> > > ^ | ^
> > > | | |
> > > TD VM creation/destruction \---------------KVM TDP MMU hooks
> > > ^ ^
> > > | |
> > > TDX architectural definitions KVM TDP refactoring for TDX
> > > ^ ^
> > > | |
> > > TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
> > > coexistence support
> > >
> > >
> > > The followings are explanations of each layer. Each layer has a dummy commit
> > > that starts with [MARKER] in subject. It is intended to help to identify where
> > > each layer starts.
> > >
> > > TDX host kernel support:
> > > https://lore.kernel.org/lkml/[email protected]/
> > > The guts of system-wide initialization of TDX module. There is an
> > > independent patch series for host x86. TDX KVM patches call functions
> > > this patch series provides to initialize the TDX module.
> > >
> > > TDX, VMX coexistence:
> > > Infrastructure to allow TDX to coexist with VMX and trigger the
> > > initialization of the TDX module.
> > > This layer starts with
> > > "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> > > TDX architectural definitions:
> > > Add TDX architectural definitions and helper functions
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> > > TD VM creation/destruction:
> > > Guest TD creation/destroy allocation and releasing of TDX specific vm
> > > and vcpu structure. Create an initial guest memory image with TDX
> > > measurement.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> > > TD vcpu creation/destruction:
> > > guest TD creation/destroy Allocation and releasing of TDX specific vm
> > > and vcpu structure. Create an initial guest memory image with TDX
> > > measurement.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> > > TDX EPT violation:
> > > Create an initial guest memory image with TDX measurement. Handle
> > > secure EPT violations to populate guest pages with TDX SEAMCALLs.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> > > TD vcpu enter/exit:
> > > Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
> > > entering into TD. Restore CPU state after exiting from TD.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> > > TD vcpu interrupts/exit/hypercall:
> > > Handle various exits/hypercalls and allow interrupts to be injected so
> > > that TD vcpu can continue running.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
> > >
> > > KVM MMU GPA shared bit:
> > > Introduce framework to handle shared bit repurposed bit of GPA TDX
> > > repurposed a bit of GPA to indicate shared or private. If it's shared,
> > > it's the same as the conventional VMX EPT case. VMM can access shared
> > > guest pages. If it's private, it's handled by Secure-EPT and the guest
> > > page is encrypted.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> > > KVM TDP refactoring for TDX:
> > > TDX Secure EPT requires different constants. e.g. initial value EPT
> > > entry value etc. Various refactoring for those differences.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> > > KVM TDP MMU hooks:
> > > Introduce framework to TDP MMU to add hooks in addition to direct EPT
> > > access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
> > > conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> > > use TDX SEAMCALLs to operate on Secure EPT.
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> > > KVM TDP MMU MapGPA:
> > > Introduce framework to handle switching guest pages from private/shared
> > > to shared/private. For a given GPA, a guest page can be assigned to a
> > > private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
> > > guest TD converts GPA assignments from private (or shared) to shared (or
> > > private).
> > > This layer starts with
> > > "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
> > >
> > > KVM guest private memory: (not shown in the above diagram)
> > > [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> > > memory: https://lkml.org/lkml/2022/1/18/395
> > > Guest private memory requires different memory management in KVM. The
> > > patch proposes a way for it. Integration with TDX KVM.
> > >
> > > (***)
> > > * TDX module
> > > A CPU-attested software module called the "TDX module" is designed to implement
> > > the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> > > loaded by the kernel or driver at runtime, but in this patch series we assume
> > > that the TDX module is already loaded and initialized.
> > >
> > > The TDX module provides two main new logical modes of operation built upon the
> > > new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> > > architecture. TDX root mode is mostly identical to the VMX root operation mode,
> > > and the TDX functions (described later) are triggered by the new SEAMCALL
> > > instruction with the desired interface function selected by an input operand
> > > (leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
> > > non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> > > operation (i.e. guest VM), with changes and restrictions to better assure that
> > > no other software or hardware has direct visibility of the TD memory and state.
> > >
> > > TDX transitions between TDX root operation and TDX non-root operation include TD
> > > Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> > > TDX root mode. A TD Exit might be asynchronous, triggered by some external
> > > event (e.g., external interrupt or SMI) or an exception, or it might be
> > > synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
> > >
> > > TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> > > of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> > > Domain Host. Those host-side TDX interface functions are categorized into
> > > various areas just for better organization, such as SYS (TDX module management),
> > > MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> > > etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
> > >
> > > TDCS (Trust Domain Control Structure) is the main control structure of a guest
> > > TD, and encrypted (using the guest TD's ephemeral private key). At a high
> > > level, TDCS holds information for controlling TD operation as a whole,
> > > execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
> > > bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> > > same value for all VCPUs of the same TD.
> > >
> > > Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> > > TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
> > > the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> > > DMA access, accessible only by using the TDX module interface functions (such as
> > > TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> > > such as virtual APIC page, virtualization exception information, etc.
> > >
> > > Several VMX control structures (such as Shared EPT and Posted interrupt
> > > descriptor) are directly managed and accessed by the host VMM. These control
> > > structures are pointed to by fields in the TD VMCS.
> > >
> > > The above means that 1) KVM needs to allocate different data structures for TDs,
> > > 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> > > define TD-specific handling for others. 3) Redirect operations to . 3)
> > > Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> > > tdx_callback() else vmx_callback();".
> > >
> > > *TD Private Memory
> > > TD private memory is designed to hold TD private content, encrypted by the CPU
> > > using the TD ephemeral key. An encryption engine holds a table of encryption
> > > keys, and an encryption key is selected for each memory transaction based on a
> > > Host Key Identifier (HKID). By design, the host VMM does not have access to the
> > > encryption keys.
> > >
> > > In the first generation of MKTME, HKID is "stolen" from the physical address by
> > > allocating a configurable number of bits from the top of the physical
> > > address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> > > accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> > > HKID on the host so that MKTME can be opaque or bypassed on the host.
> > >
> > > During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> > > as either shared or private, based on the value of a new SHARED bit in the Guest
> > > Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
> > > (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> > > VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> > > with the current VMX. Since guest TDs usually require I/O, and the data exchange
> > > needs to be done via shared memory, thus KVM needs to use the current EPT
> > > functionality even for TDs.
> > >
> > > * Secure EPT and Minoring using the TDP code
> > > The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
> > > pages are encrypted and integrity-protected with the TD's ephemeral private
> > > key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> > > interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> > > "subset"). Since execution of such interface functions takes much longer time
> > > than accessing memory directly, in KVM we use the existing TDP code to minor the
> > > Secure EPT for the TD.
> > >
> > > This way, we can effectively walk Secure EPT without using the TDX interface
> > > functions.
> > >
> > > * VM life cycle and TDX specific operations
> > > The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
> > > example, a TD needs to boot in private memory, and the host software cannot copy
> > > the initial image to private memory.
> > >
> > > * TSC Virtualization
> > > The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> > > (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> > > by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
> > > owns TSC virtualization for VMs, but the TDX module does for TDs.
> > >
> > > * MCE support for TDs
> > > The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
> > > to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
> > > related to MCE (e.g, MCE bank registers) can be naturally emulated by
> > > paravirtualizing MSR access.
> > >
> > > [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> > > available.
> > >
> > > * Restrictions or future work
> > > Some features are not included to reduce patch size. Those features are
> > > addressed as future independent patch series.
> > > - large page (2M, 1G)
> > > - qemu gdb stub
> > > - guest PMU
> > > - and more
> > >
> > > * Prerequisites
> > > It's required to load the TDX module and initialize it. It's out of the scope
> > > of this patch series. Another independent patch for the common x86 code is
> > > planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> > > CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> > > module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> > > life cycle like tdh.mng.init are ready to use.
> > >
> > > Concretely Global initialization, LP (Logical Processor) initialization, global
> > > configuration, the key configuration, and TDMR and PAMT initialization are done.
> > > The state of the TDX module is SYS_READY. Please refer to the TDX module
> > > specification, the chapter Intel TDX Module Lifecycle State Machine
> > >
> > > ** Detecting the TDX module readiness.
> > > TDX host patch series implements the detection of the TDX module availability
> > > and its initialization so that KVM can use it. Also it manages Host KeyID
> > > (HKID) assigned to guest TD.
> > > The assumed APIs the TDX host patch series provides are
> > > - int seamrr_enabled()
> > > Check if required cpu feature (SEAM mode) is available. This only check CPU
> > > feature availability. At this point, the TDX module may not be ready for KVM
> > > to use.
> > > - int init_tdx(void);
> > > Initialization of TDX module so that the TDX module is ready for KVM to use.
> > > - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> > > Return the system wide information about the TDX module. NULL if the TDX
> > > isn't initialized.
> > > - u32 tdx_get_global_keyid(void);
> > > Return global key id that is used for the TDX module itself.
> > > - int tdx_keyid_alloc(void);
> > > Allocate HKID for guest TD.
> > > - void tdx_keyid_free(int keyid);
> > > Free HKID for guest TD.
> > >
> > > (****)
> > > * TDX KVM high-level design
> > > - Host key ID management
> > > Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> > > It is assumed The TDX host patch series implements necessary functions,
> > > u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> > > void tdx_keyid_free(int keyid).
> > >
> > > - Data structures and VM type
> > > Because TDX is different from VMX, define its own VM/VCPU structures, struct
> > > kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
> > > identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> > > TDX, is used.
> > >
> > > - VM life cycle and TDX specific operations
> > > Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> > > New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> > > parameters, set initial guest memory and measurement.
> > >
> > > The creation of TDX VM requires five additional operations in addition to the
> > > conventional VM creation.
> > > - Get KVM system capability to check if TDX VM type is supported
> > > - VM creation (KVM_CREATE_VM)
> > > - New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
> > > - New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
> > > - VCPU creation (KVM_CREATE_VCPU)
> > > - New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
> > > - New: Initialize guest memory as boot state and extend the measurement with
> > > the memory. KVM_TDX_INIT_MEM_REGION.
> > > - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> > > TDX VM contents.
> > > - VCPU RUN (KVM_VCPU_RUN)
> > >
> > > - Protected guest state
> > > Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> > > can't operate on them. For example, accessing CPU registers, injecting
> > > exceptions, and accessing guest memory. Those operations are handled as
> > > silently ignored, returning zero or initial reset value when it's requested via
> > > KVM API ioctls.
> > >
> > > VM/VCPU state and callbacks for TDX specific operations.
> > > Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
> > > operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".
> > >
> > > Operations on the CPU state
> > > silently ignore operations on the guest state. For example, the write to
> > > CPU registers is ignored and the read from CPU registers returns 0.
> > >
> > > . ignore access to CPU registers except for allowed ones.
> > > . TSC: add a check if tsc is immutable and return an error. Because the KVM
> > > implementation updates the internal tsc state and it's difficult to back
> > > out those changes. Instead, skip the logic.
> > > . dirty logging: add check if dirty logging is supported.
> > > . exceptions/SMI/MCE/SIPI/INIT: silently ignore
> > >
> > > Note: virtual external interrupt and NMI can be injected into TDX guests.
> > >
> > > - KVM MMU integration
> > > One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> > > the guest physical address is private (the bit is cleared) or shared (the bit is
> > > set). The bits are called stolen bits.
> > >
> > > - Stolen bits framework
> > > systematically tracks which guest physical address, shared or private, is
> > > used.
> > >
> > > - Shared EPT and secure EPT
> > > There are two EPTs. Shared EPT (the conventional one) and Secure
> > > EPT(the new one). Shared EPT is handled the same for the stolen
> > > bit set. Secure EPT points to private guest pages. To resolve
> > > EPT violation, KVM walks one of two EPTs based on faulted GPA.
> > > Because it's costly to access secure EPT during walking EPTs with
> > > SEAMCALLs for the private guest physical address, another private
> > > EPT is used as a shadow of Secure-EPT with the existing logic at
> > > the cost of extra memory.
> > >
> > > The following depicts the relationship.
> > >
> > > KVM | TDX module
> > > | | |
> > > -------------+---------- | |
> > > | | | |
> > > V V | |
> > > shared GPA private GPA | |
> > > CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
> > > | | | |
> > > | | | |
> > > V V | V
> > > shared EPT private EPT--------mirror----->Secure EPT
> > > | | | |
> > > | \--------------------+------\ |
> > > | | | |
> > > V | V V
> > > shared guest page | private guest page
> > > |
> > > |
> > > non-encrypted memory | encrypted memory
> > > |
> > >
> > > - Operating on Secure EPT
> > > Use the TDX module APIs to operate on Secure EPT. To call the TDX API
> > > during resolving EPT violation, add hooks to additional operation and wiring
> > > it to TDX backend.
> > >
> > > * References
> > >
> > > [1] TDX specification
> > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > > [2] Intel Trust Domain Extensions (Intel TDX)
> > > https://cdrdv2.intel.com/v1/dl/getContent/726790
> > > [3] Intel CPU Architectural Extensions Specification
> > > https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> > > [4] Intel TDX Module 1.0 Specification
> > > https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> > > [5] Intel TDX Loader Interface Specification
> > > https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > > [6] Intel TDX Guest-Hypervisor Communication Interface
> > > https://cdrdv2.intel.com/v1/dl/getContent/726790
> > > [7] Intel TDX Virtual Firmware Design Guide
> > > https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> > > [8] intel public github
> > > kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> > > TDX guest branch: https://github.com/intel/tdx/tree/guest
> > > qemu TDX https://github.com/intel/qemu-tdx
> > > [9] TDVF
> > > https://github.com/tianocore/edk2-staging/tree/TDVF
> > > This was merged into EDK2 main branch. https://github.com/tianocore/edk2
> > >
> > > Chao Gao (3):
> > > KVM: x86: Move check_processor_compatibility from init ops to runtime
> > > ops
> > > Partially revert "KVM: Pass kvm_init()'s opaque param to additional
> > > arch funcs"
> > > KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> > > wrmsr
> > >
> > > Isaku Yamahata (72):
> > > KVM: Refactor CPU compatibility check on module initialiization
> > > x86/virt/vmx/tdx: export platform_tdx_enabled()
> > > KVM: TDX: Detect CPU feature on kernel module initialization
> > > KVM: x86: Refactor KVM VMX module init/exit functions
> > > KVM: TDX: Add placeholders for TDX VM/vcpu structure
> > > x86/virt/tdx: Add a helper function to return system wide info about
> > > TDX module
> > > KVM: TDX: Initialize TDX module when loading kvm_intel.ko
> > > KVM: TDX: Make TDX VM type supported
> > > [MARKER] The start of TDX KVM patch series: TDX architectural
> > > definitions
> > > KVM: TDX: Define TDX architectural definitions
> > > KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> > > KVM: TDX: Add helper functions to print TDX SEAMCALL error
> > > [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> > > x86/cpu: Add helper functions to allocate/free TDX private host key id
> > > KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> > > KVM: TDX: Make pmu_intel.c ignore guest TD case
> > > [MARKER] The start of TDX KVM patch series: TD vcpu
> > > creation/destruction
> > > KVM: TDX: allocate/free TDX vcpu structure
> > > KVM: TDX: allocate/free TDX vcpu structure
> > > [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
> > > KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> > > [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> > > TDX
> > > KVM: x86/mmu: Disallow fast page fault on private GPA
> > > KVM: VMX: Introduce test mode related to EPT violation VE
> > > [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> > > KVM: x86/mmu: Focibly use TDP MMU for TDX
> > > KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
> > > KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> > > KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> > > [MARKER] The start of TDX KVM patch series: TDX EPT violation
> > > KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> > > KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
> > > KVM: TDX: TDP MMU TDX support
> > > [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> > > KVM: x86/mmu: steal software usable git to record if GFN is for shared
> > > or not
> > > KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> > > [MARKER] The start of TDX KVM patch series: TD finalization
> > > KVM: TDX: Create initial guest memory
> > > KVM: TDX: Finalize VM initialization
> > > [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> > > KVM: TDX: Add helper assembly function to TDX vcpu
> > > KVM: TDX: Implement TDX vcpu enter/exit path
> > > KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> > > KVM: TDX: restore host xsave state when exit from the guest TD
> > > KVM: TDX: restore user ret MSRs
> > > [MARKER] The start of TDX KVM patch series: TD vcpu
> > > exits/interrupts/hypercalls
> > > KVM: TDX: complete interrupts after tdexit
> > > KVM: TDX: restore debug store when TD exit
> > > KVM: TDX: handle vcpu migration over logical processor
> > > KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> > > behavior
> > > KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
> > > KVM: TDX: Implement interrupt injection
> > > KVM: TDX: Implements vcpu request_immediate_exit
> > > KVM: TDX: Implement methods to inject NMI
> > > KVM: TDX: Add a place holder to handle TDX VM exit
> > > KVM: TDX: handle EXIT_REASON_OTHER_SMI
> > > KVM: TDX: handle ept violation/misconfig exit
> > > KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> > > KVM: TDX: Add a place holder for handler of TDX hypercalls
> > > (TDG.VP.VMCALL)
> > > KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> > > KVM: TDX: Handle TDX PV CPUID hypercall
> > > KVM: TDX: Handle TDX PV HLT hypercall
> > > KVM: TDX: Handle TDX PV port io hypercall
> > > KVM: TDX: Implement callbacks for MSR operations for TDX
> > > KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
> > > KVM: TDX: Handle TDX PV report fatal error hypercall
> > > KVM: TDX: Handle TDX PV map_gpa hypercall
> > > KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
> > > KVM: TDX: Silently discard SMI request
> > > KVM: TDX: Silently ignore INIT/SIPI
> > > Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> > > KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> > >
> > > Rick Edgecombe (1):
> > > KVM: x86/mmu: Add address conversion functions for TDX shared bits
> > >
> > > Sean Christopherson (25):
> > > KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> > > KVM: Enable hardware before doing arch VM initialization
> > > KVM: x86: Introduce vm_type to differentiate default VMs from
> > > confidential VMs
> > > KVM: TDX: Add TDX "architectural" error codes
> > > KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> > > KVM: TDX: create/destroy VM structure
> > > KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
> > > KVM: TDX: Do TDX specific vcpu initialization
> > > KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> > > KVM: x86/mmu: Allow non-zero value for non-present SPTE
> > > KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> > > KVM: x86/mmu: Allow per-VM override of the TDP max page level
> > > KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
> > > private mmu
> > > KVM: x86/mmu: Disallow dirty logging for x86 TDX
> > > KVM: VMX: Split out guts of EPT violation to common/exposed function
> > > KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> > > KVM: TDX: Add load_mmu_pgd method for TDX
> > > KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> > > KVM: TDX: Add support for find pending IRQ in a protected local APIC
> > > KVM: x86: Assume timer IRQ was injected if APIC state is proteced
> > > KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> > > argument
> > > KVM: VMX: Move NMI/exception handler to common helper
> > > KVM: x86: Split core of hypercall emulation to helper function
> > > KVM: TDX: Handle TDX PV MMIO hypercall
> > > KVM: TDX: Add methods to ignore accesses to CPU state
> > >
> > > Xiaoyao Li (1):
> > > KVM: TDX: initialize VM with TDX specific parameters
> > >
> > > Documentation/virt/kvm/api.rst | 30 +-
> > > .../virt/kvm/intel-tdx-layer-status.rst | 33 +
> > > Documentation/virt/kvm/intel-tdx.rst | 381 +++
> > > Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 ++++
> > > arch/arm64/kvm/arm.c | 2 +-
> > > arch/mips/kvm/mips.c | 14 +-
> > > arch/powerpc/kvm/powerpc.c | 2 +-
> > > arch/riscv/kvm/main.c | 2 +-
> > > arch/s390/kvm/kvm-s390.c | 2 +-
> > > arch/x86/events/intel/ds.c | 1 +
> > > arch/x86/include/asm/kvm-x86-ops.h | 10 +
> > > arch/x86/include/asm/kvm_host.h | 56 +-
> > > arch/x86/include/asm/tdx.h | 67 +
> > > arch/x86/include/asm/vmx.h | 14 +
> > > arch/x86/include/uapi/asm/kvm.h | 95 +
> > > arch/x86/include/uapi/asm/vmx.h | 5 +-
> > > arch/x86/kvm/Kconfig | 4 +
> > > arch/x86/kvm/Makefile | 3 +-
> > > arch/x86/kvm/irq.c | 3 +
> > > arch/x86/kvm/lapic.c | 37 +-
> > > arch/x86/kvm/lapic.h | 2 +
> > > arch/x86/kvm/mmu.h | 42 +-
> > > arch/x86/kvm/mmu/mmu.c | 360 ++-
> > > arch/x86/kvm/mmu/mmu_internal.h | 123 +-
> > > arch/x86/kvm/mmu/paging_tmpl.h | 5 +-
> > > arch/x86/kvm/mmu/spte.c | 46 +-
> > > arch/x86/kvm/mmu/spte.h | 65 +-
> > > arch/x86/kvm/mmu/tdp_iter.c | 1 +
> > > arch/x86/kvm/mmu/tdp_iter.h | 5 +-
> > > arch/x86/kvm/mmu/tdp_mmu.c | 690 ++++-
> > > arch/x86/kvm/mmu/tdp_mmu.h | 12 +-
> > > arch/x86/kvm/svm/svm.c | 13 +-
> > > arch/x86/kvm/vmx/common.h | 174 ++
> > > arch/x86/kvm/vmx/evmcs.c | 2 +-
> > > arch/x86/kvm/vmx/evmcs.h | 2 +-
> > > arch/x86/kvm/vmx/main.c | 1071 +++++++
> > > arch/x86/kvm/vmx/pmu_intel.c | 39 +-
> > > arch/x86/kvm/vmx/pmu_intel.h | 28 +
> > > arch/x86/kvm/vmx/posted_intr.c | 43 +-
> > > arch/x86/kvm/vmx/posted_intr.h | 13 +
> > > arch/x86/kvm/vmx/tdx.c | 2465 +++++++++++++++++
> > > arch/x86/kvm/vmx/tdx.h | 275 ++
> > > arch/x86/kvm/vmx/tdx_arch.h | 157 ++
> > > arch/x86/kvm/vmx/tdx_errno.h | 29 +
> > > arch/x86/kvm/vmx/tdx_error.c | 22 +
> > > arch/x86/kvm/vmx/tdx_ops.h | 188 ++
> > > arch/x86/kvm/vmx/vmenter.S | 146 +
> > > arch/x86/kvm/vmx/vmx.c | 737 ++---
> > > arch/x86/kvm/vmx/vmx.h | 39 +-
> > > arch/x86/kvm/vmx/x86_ops.h | 235 ++
> > > arch/x86/kvm/x86.c | 148 +-
> > > arch/x86/virt/vmx/tdx/seamcall.S | 2 +
> > > arch/x86/virt/vmx/tdx/tdx.c | 54 +-
> > > arch/x86/virt/vmx/tdx/tdx.h | 52 -
> > > include/linux/kvm_host.h | 4 +-
> > > include/uapi/linux/kvm.h | 2 +
> > > tools/arch/x86/include/uapi/asm/kvm.h | 95 +
> > > tools/include/uapi/linux/kvm.h | 1 +
> > > virt/kvm/kvm_main.c | 67 +-
> > > 59 files changed, 7877 insertions(+), 804 deletions(-)
> > > create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
> > > create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> > > create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> > > create mode 100644 arch/x86/kvm/vmx/common.h
> > > create mode 100644 arch/x86/kvm/vmx/main.c
> > > create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
> > > create mode 100644 arch/x86/kvm/vmx/tdx.c
> > > create mode 100644 arch/x86/kvm/vmx/tdx.h
> > > create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> > > create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> > > create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> > > create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> > > create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> > >
> > > --
> > > 2.25.1
> > >
> >
> > --
> > Isaku Yamahata <[email protected]>
--
Isaku Yamahata <[email protected]>
On Fri, Jul 08, 2022 at 01:53:48PM +1200,
Kai Huang <[email protected]> wrote:
> On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > To Keep the case of non TDX intact, introduce a new config option for
> > private KVM MMU support. At the moment, this is synonym for
> > CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL. The new flag make it clear
> > that the config is only for x86 KVM MMU.
>
> What is the "new flag"?
Oops. flags should be "config". Will fix it. Thanks for pointing it out.
--
Isaku Yamahata <[email protected]>
On Tue, Jul 12, 2022 at 10:22:50AM -0700, Isaku Yamahata wrote:
> On Tue, Jul 12, 2022 at 06:54:19PM +0800,
> Chao Peng <[email protected]> wrote:
>
> > On Tue, Jul 12, 2022 at 01:07:20PM +0800, Chao Gao wrote:
> > > On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> > > >Hi. Because my description on large page support was terse, I wrote up more
> > > >detailed one. Any feedback/thoughts on large page support?
> > > >
> > > >TDP MMU large page support design
> > > >
> > > >Two main discussion points
> > > >* how to track page status. private vs shared, no-largepage vs can-be-largepage
> > >
> > > ...
> > >
> > > >
> > > >Tracking private/shared and large page mappable
> > > >-----------------------------------------------
> > > >VMM needs to track that page is mapped as private or shared at 4KB granularity.
> > > >For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> > > >track the page can be mapped as a large page (regarding private/shared). VMM
> > > >updates it on MapGPA and references it on the EPT violation path. (****)
> > >
> > > Isaku,
> > >
> > > + Peng Chao
> > >
> > > Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
> > > private or all shared?
> > >
> > > KVM always retrieves the mapping level in CR3 and enforces that EPT's
> > > page level is not greater than that in CR3. My point is if UPM already enforces
> > > no mixed pages in a large page, then KVM needn't do that again (UPM can
> > > be trusted).
> >
> > The backing store in the UMP can tell KVM which page level it can
> > support for a given private gpa, similar to host_pfn_mapping_level() for
> > shared address.
> >
> > However, this solely represents the backing store's capability, KVM
> > still needs additional info to decide whether that can be safely mapped
> > as 2M/1G, e.g. all the following pages in the 2M/1G range should be all
> > private, currently this is not something backing store can tell.
>
> This argument applies to shared GPA. The shared pages is backed by normal file
> mapping with UPM. When KVM is mapping shared GPA, the same check is needed. So
> I think KVM has to track all private or all shared or no-largepage at 2MB/1GB
> level. If UPM tracks shared-or-private at 4KB level, probably KVM may not need to
> track it at 4KB level.
Right, the same also applies to shared memory. All the info we need is
whether pages of a 2M range is all private/shared but not mixed. UPM v7
has code tracking that in KVM and previously versions we track that in
the backing store which has been discussed not a good idea.
Chao
>
>
> > Actually, in UPM v7 we let KVM record this info so one possible solution
> > is making use of it.
> >
> > https://lkml.org/lkml/2022/7/6/259
> >
> > Then to map a page as 2M, KVM needs to check:
> > - Memory backing store support that level
> > - All pages in 2M range are private as we recorded through
> > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> > - No existing partial 4K map(s) in 2M range
> --
> Isaku Yamahata <[email protected]>
On Mon, Jun 27, 2022, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> KVM TDX basic feature support
>
> Hello. This is v7 the patch series vof KVM TDX support.
> This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
>
> Major changes from v6:
> - rebased to v5.19 base
>
> TODO:
> - integrate fd-based guest memory. As the discussion is still on-going, I
> intentionally dropped fd-based guest memory support yet. The integration can
> be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> - 2M large page support. It's work-in-progress.
> For large page support, there are several design choices. Here is the design options.
> Any thoughts/feedback?
Apologies, I didn't read beyond the intro paragraph. In case something like this
comes up again, it's probably best to send a standalone email tagged RFC, I doubt
I'm the only one that missed this embedded RFC.
> KVM MMU Large page support for TDX
...
> * options to track private or shared
> At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> 1GB case). For 4KB each page, 1 bit per page is needed. private or shared. For
> large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> mixed). When resolving KVM page fault, we don't want to check the lower-size
> pages to check if the given GPA can be a large for performance. On MapGPA check
> it instead.
>
> Option A). enhance kvm_arch_memory_slot
> enum kvm_page_type {
> KVM_PAGE_TYPE_INVALID,
> KVM_PAGE_TYPE_SHARED,
> KVM_PAGE_TYPE_PRIVATE,
> KVM_PAGE_TYPE_MIXED,
> };
>
> struct kvm_page_attr {
> enum kvm_page_type type;
> };
>
> struct kvm_arch_memory_slot {
> + struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
>
> Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> If !SPTE_MIXED_MASK, it can be large page.
>
> Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
>
>
> * comparison
> A).
> + straightforward to implement
> + SPTE_SHARED_MASK isn't needed
> - memory overhead compared to B). or C).
> - more memory reference on KVM page fault
>
> B).
> + simpler than C) (complex than A)?)
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - Waste precious SPTE bits.
>
> C).
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - complicates MapGPA
> - scattered data structure
Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
on insertion/removal to (dis)allow hugepages as needed.
+ efficient on KVM page fault (no new lookups)
+ zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
+ straightforward to implement
+ can (and should) be merged as part of the UPM series
I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
completely covered (fully shared) or not covered at all (fully private), but I'm
not 100% certain that xa_for_each_range() works the way I think it does.
On 7/14/2022 9:03 AM, Sean Christopherson wrote:
> On Mon, Jun 27, 2022, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
>>
>> KVM TDX basic feature support
>>
>> Hello. This is v7 the patch series vof KVM TDX support.
>> This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
>> The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
>> How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
>>
>> Major changes from v6:
>> - rebased to v5.19 base
>>
>> TODO:
>> - integrate fd-based guest memory. As the discussion is still on-going, I
>> intentionally dropped fd-based guest memory support yet. The integration can
>> be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
>> - 2M large page support. It's work-in-progress.
>> For large page support, there are several design choices. Here is the design options.
>> Any thoughts/feedback?
>
> Apologies, I didn't read beyond the intro paragraph. In case something like this
> comes up again, it's probably best to send a standalone email tagged RFC, I doubt
> I'm the only one that missed this embedded RFC.
>
>> KVM MMU Large page support for TDX
>
> ...
>
>> * options to track private or shared
>> At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
>> 1GB case). For 4KB each page, 1 bit per page is needed. private or shared. For
>> large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
>> mixed). When resolving KVM page fault, we don't want to check the lower-size
>> pages to check if the given GPA can be a large for performance. On MapGPA check
>> it instead.
>>
>> Option A). enhance kvm_arch_memory_slot
>> enum kvm_page_type {
>> KVM_PAGE_TYPE_INVALID,
>> KVM_PAGE_TYPE_SHARED,
>> KVM_PAGE_TYPE_PRIVATE,
>> KVM_PAGE_TYPE_MIXED,
>> };
>>
>> struct kvm_page_attr {
>> enum kvm_page_type type;
>> };
>>
>> struct kvm_arch_memory_slot {
>> + struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
>>
>> Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
>> If !SPTE_MIXED_MASK, it can be large page.
I don't think this is a good option, since it requires all the mappings
exist all the time both in shared spte tree and private spte tree.
>> Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
>> kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
>>
>>
>> * comparison
>> A).
>> + straightforward to implement
>> + SPTE_SHARED_MASK isn't needed
>> - memory overhead compared to B). or C).
>> - more memory reference on KVM page fault
>>
>> B).
>> + simpler than C) (complex than A)?)
>> + efficient on KVM page fault. (only SPTE reference)
>> + low memory overhead
>> - Waste precious SPTE bits.
>>
>> C).
>> + efficient on KVM page fault. (only SPTE reference)
>> + low memory overhead
>> - complicates MapGPA
>> - scattered data structure
>
> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> on insertion/removal to (dis)allow hugepages as needed.
UPM v7[1] introduces "struct xarray mem_attr_array" to track the
shared/private attr of a range.
So in kvm_vm_ioctl_set_encrypted_region() it needs to
- increase the lpage_info counter when a 2m/1g range changed from
identical to mixed, and
- decrease the counter when mixed -> identical
[1]:
https://lore.kernel.org/all/[email protected]/
>
> + efficient on KVM page fault (no new lookups)
> + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
> + straightforward to implement
> + can (and should) be merged as part of the UPM series
>
> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> completely covered (fully shared) or not covered at all (fully private), but I'm
> not 100% certain that xa_for_each_range() works the way I think it does.
On Mon, Jun 27, 2022 at 02:53:48PM -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
Subject: s/git/bit
>
> With TDX, all GFNs are private at guest boot time. At run time guest TD
> can explicitly change it to shared from private or vice-versa by MapGPA
> hypercall. If it's specified, the given GFN can't be used as otherwise.
> That's is, if a guest tells KVM that the GFN is shared, it can't be used
> as private. or vice-versa.
>
> Steal software usable bit, SPTE_SHARED_MASK, for it from MMIO counter to
> record it. Use the bit SPTE_SHARED_MASK in shared or private EPT to
> determine which mapping, shared or private, is allowed. If requested
> mapping isn't allowed, return RET_PF_RETRY to wait for other vcpu to change
> it. The bit is recorded in both shared and private shadow page to avoid
> traverse one more shadow page when resolving KVM page fault.
>
> The bit needs to be kept over zapping the EPT entry. Currently the EPT
> entry is initialized SHADOW_NONPRESENT_VALUE unconditionally to clear
> SPTE_SHARED_MASK bit. To carry SPTE_SHARED_MASK bit, introduce a helper
> function to get initial value for zapped entry with SPTE_SHARED_MASK bit.
> Replace SHADOW_NONPRESENT_VALUE with it.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/spte.h | 17 +++++++---
> arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++++++------
> 2 files changed, 68 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 96312ab4fffb..7c1aaf0e963e 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -14,6 +14,9 @@
> */
> #define SPTE_MMU_PRESENT_MASK BIT_ULL(11)
>
> +/* Masks that used to track for shared GPA **/
> +#define SPTE_SHARED_MASK BIT_ULL(62)
> +
> /*
> * TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
> * be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
> @@ -104,7 +107,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
> * the memslots generation and is derived as follows:
> *
> * Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
> - * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
> + * Bits 8-18 of the MMIO generation are propagated to spte bits 52-61
Should be 8-17.
> *
> * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
> * the MMIO generation number, as doing so would require stealing a bit from
> @@ -118,7 +121,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
> #define MMIO_SPTE_GEN_LOW_END 10
>
> #define MMIO_SPTE_GEN_HIGH_START 52
> -#define MMIO_SPTE_GEN_HIGH_END 62
> +#define MMIO_SPTE_GEN_HIGH_END 61
>
> #define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
> MMIO_SPTE_GEN_LOW_START)
> @@ -131,7 +134,7 @@ static_assert(!(SPTE_MMU_PRESENT_MASK &
> #define MMIO_SPTE_GEN_HIGH_BITS (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
>
> /* remember to adjust the comment above as well if you change these */
> -static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
> +static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 10);
>
> #define MMIO_SPTE_GEN_LOW_SHIFT (MMIO_SPTE_GEN_LOW_START - 0)
> #define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
> @@ -208,6 +211,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
> +static_assert(!(__REMOVED_SPTE & SPTE_SHARED_MASK));
>
> /*
> * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
> @@ -217,7 +221,12 @@ static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
>
> static inline bool is_removed_spte(u64 spte)
> {
> - return spte == REMOVED_SPTE;
> + return (spte & ~SPTE_SHARED_MASK) == REMOVED_SPTE;
> +}
> +
> +static inline u64 spte_shared_mask(u64 spte)
> +{
> + return spte & SPTE_SHARED_MASK;
> }
>
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index fef6246086a8..4f279700b3cc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -758,6 +758,11 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
> return 0;
> }
>
> +static u64 shadow_nonpresent_spte(u64 old_spte)
> +{
> + return SHADOW_NONPRESENT_VALUE | spte_shared_mask(old_spte);
> +}
> +
> static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> struct tdp_iter *iter)
> {
> @@ -791,7 +796,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> * can be set when EPT table entries are zapped.
> */
> - __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
> + __kvm_tdp_mmu_write_spte(iter->sptep,
> + shadow_nonpresent_spte(iter->old_spte));
>
> return 0;
> }
> @@ -975,8 +981,11 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
>
> if (!shared)
> - tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> - else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> + tdp_mmu_set_spte(kvm, &iter,
> + shadow_nonpresent_spte(iter.old_spte));
> + else if (tdp_mmu_set_spte_atomic(
> + kvm, &iter,
> + shadow_nonpresent_spte(iter.old_spte)))
> goto retry;
> }
> }
> @@ -1033,7 +1042,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> return false;
>
> __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> - SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
> + shadow_nonpresent_spte(old_spte),
> + sp->gfn, sp->role.level + 1,
> true, true, is_private_sp(sp));
>
> return true;
> @@ -1075,11 +1085,20 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
> }
>
> + /*
> + * SPTE_SHARED_MASK is stored as 4K granularity. The
> + * information is lost if we delete upper level SPTE page.
> + * TODO: support large page.
> + */
> + if (kvm_gfn_shared_mask(kvm) && iter.level > PG_LEVEL_4K)
> + continue;
> +
> if (!is_shadow_present_pte(iter.old_spte) ||
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> - tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> + tdp_mmu_set_spte(kvm, &iter,
> + shadow_nonpresent_spte(iter.old_spte));
> flush = true;
> }
>
> @@ -1195,18 +1214,44 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> gfn_t gfn_unalias = iter->gfn & ~kvm_gfn_shared_mask(vcpu->kvm);
>
> WARN_ON(sp->role.level != fault->goal_level);
> + WARN_ON(is_private_sptep(iter->sptep) != fault->is_private);
>
> - /* TDX shared GPAs are no executable, enforce this for the SDV. */
> - if (kvm_gfn_shared_mask(vcpu->kvm) && !fault->is_private)
> - pte_access &= ~ACC_EXEC_MASK;
> + if (kvm_gfn_shared_mask(vcpu->kvm)) {
> + if (fault->is_private) {
> + /*
> + * SPTE allows only RWX mapping. PFN can't be mapped it
> + * as READONLY in GPA.
> + */
> + if (fault->slot && !fault->map_writable)
> + return RET_PF_RETRY;
> + /*
> + * This GPA is not allowed to map as private. Let
> + * vcpu loop in page fault until other vcpu change it
> + * by MapGPA hypercall.
> + */
> + if (fault->slot &&
Please consider to merge this if into above "if (fault->slot) {}"
> + spte_shared_mask(iter->old_spte))
> + return RET_PF_RETRY;
> + } else {
> + /* This GPA is not allowed to map as shared. */
> + if (fault->slot &&
> + !spte_shared_mask(iter->old_spte))
> + return RET_PF_RETRY;
> + /* TDX shared GPAs are no executable, enforce this. */
> + pte_access &= ~ACC_EXEC_MASK;
> + }
> + }
>
> if (unlikely(!fault->slot))
> new_spte = make_mmio_spte(vcpu, gfn_unalias, pte_access);
> - else
> + else {
> wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
> gfn_unalias, fault->pfn, iter->old_spte,
> fault->prefetch, true, fault->map_writable,
> &new_spte);
> + if (spte_shared_mask(iter->old_spte))
> + new_spte |= SPTE_SHARED_MASK;
> + }
The if can be eliminated:
new_spte |= spte_shared_mask(iter->old_spte);
>
> if (new_spte == iter->old_spte)
> ret = RET_PF_SPURIOUS;
> @@ -1509,7 +1554,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> * invariant that the PFN of a present * leaf SPTE can never change.
> * See __handle_changed_spte().
> */
> - tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
> + tdp_mmu_set_spte(kvm, iter, shadow_nonpresent_spte(iter->old_spte));
>
> if (!pte_write(range->pte)) {
> new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
> --
> 2.25.1
>
On Fri, Jul 01, 2022 at 10:41:08PM +1200,
Kai Huang <[email protected]> wrote:
> On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > For kvm mmu that has shared bit mask, zap only leaf SPTEs when
> > deleting/moving a memslot. The existing kvm_mmu_zap_memslot() depends on
>
> Unless I am mistaken, I don't see there's an 'existing' kvm_mmu_zap_memslot().
Oops. it should be kvm_tdp_mmu_invalidate_all_roots().
> > role.invalid with read lock of mmu_lock so that other vcpu can operate on
> > kvm mmu concurrently.Â
> >
>
> > Mark the root page table invalid, unlink it from page
> > table pointer of CPU, process the page table. Â
> >
>
> Are you talking about the behaviour of existing code, or the change you are
> going to make? Looks like you mean the latter but I believe it's the former.
The existing code. The should "It marks ...".
> > It doesn't work for private
> > page table to unlink the root page table because it requires all SPTE entry
> > to be non-present.
> >
>
> I don't think we can truly *unlink* the private root page table from secure
> EPTP, right? The EPTP (root table) is fixed (and hidden) during TD's runtime.
>
> I guess you are trying to say: removing/unlinking one secure-EPT page requires
> removing/unlinking all its children first?
That's right. I'll update it as follows.
> So the reason to only zap leaf is we cannot truly unlink the private root page
> table, correct? Sorry your changelog is not obvious to me.
The reason is, to unlink a page table from the parent's SPTE, all SPTEs of the
page table need to be non-present.
Here is the updated commit message.
KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for private mmu|
For kvm mmu that has shared bit mask, zap only leaf SPTEs when |
deleting/moving a memslot. The existing kvm_tdp_mmu_invalidate_all_roots()|
depends on role.invalid with read lock of mmu_lock so that other vcpu can |
operate on kvm mmu concurrently. It marks the root page table invalid, |
zaps SPTEs of the root page tables. |
|
It doesn't work to unlink a private page table from the parent's SPTE entry|
because it requires all SPTE entry of the page table to be non-present. |
Instead, with write-lock of mmu_lock and zap only leaf SPTEs for kvm mmu |
with shared bit mask.
> > Instead, with write-lock of mmu_lock and zap only leaf
> > SPTEs for kvm mmu with shared bit mask.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 35 ++++++++++++++++++++++++++++++++++-
> > 1 file changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 80d7c7709af3..c517c7bca105 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5854,11 +5854,44 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > }
> >
> > +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > +{
> > + bool flush = false;
> > +
> > + write_lock(&kvm->mmu_lock);
> > +
> > + /*
> > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> > + * case scenario we'll have unused shadow pages lying around until they
> > + * are recycled due to age or when the VM is destroyed.
> > + */
> > + if (is_tdp_mmu_enabled(kvm)) {
> > + struct kvm_gfn_range range = {
> > + .slot = slot,
> > + .start = slot->base_gfn,
> > + .end = slot->base_gfn + slot->npages,
> > + .may_block = false,
> > + };
> > +
> > + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush);
>
>
> It appears you only unmap private GFNs (because the base_gfn doesn't have shared
> bit)? I think shared mapping in this slot must be zapped too? Â
>
> How is this done? Or the kvm_tdp_mmu_unmap_gfn_range() also zaps shared
> mappings?
kvm_tdp_mmu_unmap_gfn_range() handles both private gfn and shared gfn as
they are alias.
> It's hard to review if one patch's behaviour/logic depends on further patches.
I'll add a comment on the call.
--
Isaku Yamahata <[email protected]>
On Mon, Jul 11, 2022 at 02:56:29PM +0000,
Sean Christopherson <[email protected]> wrote:
> s/Focibly/Forcibly, but that's a moot point because KVM shouldn't override the
> the module param. KVM should instead _require_ the TDP MMU to be enabled. E.g.
> if userspace disables the TDP MMU to workaround a fatal bug, then forcing the TDP
> MMU may silently expose KVM to said bug.
>
> And overriding tdp_enabled is just mind-boggling broken, all of the SPTE masks
> will be wrong.
>
> On Mon, Jun 27, 2022, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > In this patch series, TDX supports only TDP MMU and doesn't support legacy
> > MMU. Forcibly use TDP MMU for TDX irrelevant of kernel parameter to
> > disable TDP MMU.
>
> Do not refer to the "patch series", instead phrase the statement with respect to
> what KVM support.
>
> Require the TDP MMU for TDX guests, the so called "shadow" MMU does not
> support mapping guest private memory, i.e. does not support Secure-EPT.
Thanks for rewrite of the commit message. Now the TDP MMU is default, I'll change
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 82f1bfac7ee6..7eb41b176d1e 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -18,8 +18,13 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> > {
> > struct workqueue_struct *wq;
> >
> > - if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
> > - return 0;
> > + /*
> > + * Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
> > + * of TDX.
> > + */
> > + if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
> > + (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
> > + return false;
>
> Yeah, no.
>
> if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
> return kvm->arch.vm_type == KVM_X86_TDX_VM ? -EINVAL : 0;
I'll use -EOPNOTSUPP instead of -EINVAL.
--
Isaku Yamahata <[email protected]>
On Fri, Jul 01, 2022 at 11:12:44PM +1200,
Kai Huang <[email protected]> wrote:
> On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > For private GPA, CPU refers a private page table whose contents are
> > encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> > PTE entry) are used and their cost is expensive.
> >
> > When KVM resolves KVM page fault, it walks the page tables. To reuse the
> > existing KVM MMU code and mitigate the heavy cost to directly walk
> > encrypted private page table, allocate a more page to mirror the existing
> > KVM page table. Â Resolve KVM page fault with the existing code, and do
> > additional operations necessary for the mirrored private page table. To
> > distinguish such cases, the existing KVM page table is called a shared page
> > table (i.e. no mirrored private page table), and the KVM page table with
> > mirrored private page table is called a private page table. The
> > relationship is depicted below.
> >
> > Add private pointer to struct kvm_mmu_page for mirrored private page table
> > and add helper functions to allocate/initialize/free a mirrored private
> > page table page. Also, add helper functions to check if a given
> > kvm_mmu_page is private. The later patch introduces hooks to operate on
> > the mirrored private page table.
> >
> > KVM page fault |
> > | |
> > V |
> > -------------+---------- |
> > | | |
> > V V |
> > shared GPA private GPA |
> > | | |
> > V V |
> > CPU/KVM shared PT root KVM private PT root | CPU private PT root
> > | | | |
> > V V | V
> > shared PT private PT <----mirror----> mirrored private PT
> > | | | |
> > | \-----------------+------\ |
> > | | | |
> > V | V V
> > shared guest page | private guest page
> > |
> > non-encrypted memory | encrypted memory
> > |
> > PT: page table
> >
> > Both CPU and KVM refer to CPU/KVM shared page table. Private page table
> > is used only by KVM. CPU refers to mirrored private page table.
>
> Shouldn't the private page table maintained by KVM be "mirrored private PT"?
>
> To me "mirrored" normally implies it is fake, or backup which isn't actually
> used. But here "mirrored private PT" is actually used by hardware.
>
> And to me, "CPU and KVM" above are confusing. For instance, "Both CPU and KVM
> refer to CPU/KVM shared page table" took me at least one minute to understand,
> with the help from the diagram -- otherwise I won't be able to understand.
>
> I guess you can just say somewhere:
>
> 1) Shared PT is visible to KVM and it is used by CPU;
> 1) Private PT is used by CPU but it is invisible to KVM;
> 2) Mirrored private PT is visible to KVM but not used by CPU. It is used to
> mirror the actual private PT which is used by CPU.
I removed "mirror" word and use protected for encrypted page table.
KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
For private GPA, CPU refers a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used and their cost is expensive.
When KVM resolves KVM page fault, it walks the page tables. To reuse the
existing KVM MMU code and mitigate the heavy cost to directly walk
protected (encrypted) page table, allocate one more page to copy the
protected page table for KVM MMU code to directly walk. Resolve KVM page
fault with the existing code, and do additional operations necessary for
the protected page table. To distinguish such cases, the existing KVM page
table is called a shared page table (i.e. not associated with protected
page table), and the page table with protected page table is called a
private page table. The relationship is depicted below.
Add a private pointer to struct kvm_mmu_page for protected page table and
add helper functions to allocate/initialize/free a protected page table
page. Also, add helper functions to check if a given kvm_mmu_page is
private. The later patch introduces hooks to operate on the protected page
table.
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root private PT root | protected PT root
| | | |
V V | V
shared PT private PT ----propagate----> protected PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT: page table
- Shared PT is visible to KVM and it is used by CPU.
- Protected PT is used by CPU but it is invisible to KVM.
- Private PT is visible to KVM but not used by CPU. It is used to
propagate PT change to the actual protected PT which is used by CPU.
Signed-off-by: Isaku Yamahata <[email protected]>
--
Isaku Yamahata <[email protected]>
On Tue, 2022-07-19 at 04:06 -0700, Isaku Yamahata wrote:
> On Fri, Jul 01, 2022 at 10:41:08PM +1200,
> Kai Huang <[email protected]> wrote:
>
> > On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > > From: Sean Christopherson <[email protected]>
> > >
> > > For kvm mmu that has shared bit mask, zap only leaf SPTEs when
> > > deleting/moving a memslot. The existing kvm_mmu_zap_memslot() depends on
> >
> > Unless I am mistaken, I don't see there's an 'existing' kvm_mmu_zap_memslot().
>
> Oops. it should be kvm_tdp_mmu_invalidate_all_roots().
>
>
> > > role.invalid with read lock of mmu_lock so that other vcpu can operate on
> > > kvm mmu concurrently.Â
> > >
> >
> > > Mark the root page table invalid, unlink it from page
> > > table pointer of CPU, process the page table. Â
> > >
> >
> > Are you talking about the behaviour of existing code, or the change you are
> > going to make? Looks like you mean the latter but I believe it's the former.
>
>
> The existing code. The should "It marks ...".
>
>
> > > It doesn't work for private
> > > page table to unlink the root page table because it requires all SPTE entry
> > > to be non-present.
> > >
> >
> > I don't think we can truly *unlink* the private root page table from secure
> > EPTP, right? The EPTP (root table) is fixed (and hidden) during TD's runtime.
> >
> > I guess you are trying to say: removing/unlinking one secure-EPT page requires
> > removing/unlinking all its children first?
>
> That's right. I'll update it as follows.
>
>
> > So the reason to only zap leaf is we cannot truly unlink the private root page
> > table, correct? Sorry your changelog is not obvious to me.
>
> The reason is, to unlink a page table from the parent's SPTE, all SPTEs of the
> page table need to be non-present.
>
> Here is the updated commit message.
>
> KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for private mmu|
> For kvm mmu that has shared bit mask, zap only leaf SPTEs when |
> deleting/moving a memslot. The existing kvm_tdp_mmu_invalidate_all_roots()|
> depends on role.invalid with read lock of mmu_lock so that other vcpu can |
> operate on kvm mmu concurrently. It marks the root page table invalid, |
> zaps SPTEs of the root page tables. |
> |
> It doesn't work to unlink a private page table from the parent's SPTE entry|
> because it requires all SPTE entry of the page table to be non-present. |
AFAICT this isn't the real reason that you cannot mark private root table as
invalid, and do the same zapping as you mentioned above. There might be some
change to support "zapping all children before zapping the parent for private
table" (currently the actual page table is freed after RCU grace period, but not
at unlink time), but I don't see how this cannot be supported, or at least the
changelog doesn't explain why it cannot be supported.
The true reason is, if I understand correctly, you cannot truly unlink the
private root page table from the hardware and then, i.e. allocate a new one for
it. So just zap the leafs.
> Instead, with write-lock of mmu_lock and zap only leaf SPTEs for kvm mmu |
> with shared bit mask.
>
> > > Instead, with write-lock of mmu_lock and zap only leaf
> > > SPTEs for kvm mmu with shared bit mask.
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 35 ++++++++++++++++++++++++++++++++++-
> > > 1 file changed, 34 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 80d7c7709af3..c517c7bca105 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5854,11 +5854,44 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > > return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > > }
> > >
> > > +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > > +{
> > > + bool flush = false;
> > > +
> > > + write_lock(&kvm->mmu_lock);
> > > +
> > > + /*
> > > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> > > + * case scenario we'll have unused shadow pages lying around until they
> > > + * are recycled due to age or when the VM is destroyed.
> > > + */
> > > + if (is_tdp_mmu_enabled(kvm)) {
> > > + struct kvm_gfn_range range = {
> > > + .slot = slot,
> > > + .start = slot->base_gfn,
> > > + .end = slot->base_gfn + slot->npages,
> > > + .may_block = false,
> > > + };
> > > +
> > > + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush);
> >
> >
> > It appears you only unmap private GFNs (because the base_gfn doesn't have shared
> > bit)? I think shared mapping in this slot must be zapped too? Â
> >
> > How is this done? Or the kvm_tdp_mmu_unmap_gfn_range() also zaps shared
> > mappings?
>
> kvm_tdp_mmu_unmap_gfn_range() handles both private gfn and shared gfn as
> they are alias.
>
>
> > It's hard to review if one patch's behaviour/logic depends on further patches.
>
> I'll add a comment on the call.
>
I don't think adding a comment is enough. The correctness of one patch needs to
depend on future patch doesn't seem right. Please also consider patch
reorganize/reorder.
On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
...
>
> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> on insertion/removal to (dis)allow hugepages as needed.
>
> + efficient on KVM page fault (no new lookups)
> + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
> + straightforward to implement
> + can (and should) be merged as part of the UPM series
>
> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> completely covered (fully shared) or not covered at all (fully private), but I'm
> not 100% certain that xa_for_each_range() works the way I think it does.
Hi Sean,
Below is the implementation to support 2M as you mentioned as option D.
It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
we still treat it as a count, it will be a challenge to make the inc/dec
balanced. So in this patch I stole a bit for the purpose, looks ugly.
Any feedback is welcome.
Thanks,
Chao
-----------------------------------------------------------------------
From: Chao Peng <[email protected]>
Date: Wed, 20 Jul 2022 11:37:18 +0800
Subject: [PATCH] KVM: Add large page support for private memory
Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Reserve a bit in disallow_lpage to indicate a large page has
private/share pages mixed.
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +++
arch/x86/kvm/mmu/mmu.c | 120 +++++++++++++++++++++++++++++++-
include/linux/kvm_host.h | 14 ++++
virt/kvm/kvm_main.c | 12 +++-
4 files changed, 150 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d460b8511041..b6ffe8b1c547 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
#define __KVM_HAVE_ZAP_GFN_RANGE
+#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
#define KVM_MAX_VCPUS 1024
@@ -935,6 +936,13 @@ struct kvm_vcpu_arch {
#endif
};
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits will be used as a reference count for other users.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
+#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
+
struct kvm_lpage_info {
int disallow_lpage;
};
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 771ffd147e77..d040eeaf1f1c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -843,11 +843,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
{
struct kvm_lpage_info *linfo;
int i;
+ int disallow_count;
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+ disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+ WARN_ON(disallow_count + count < 0 ||
+ disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
linfo->disallow_lpage += count;
- WARN_ON(linfo->disallow_lpage < 0);
}
}
@@ -7246,3 +7251,116 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_lpage_recovery_thread)
kthread_stop(kvm->arch.nx_lpage_recovery_thread);
}
+
+static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ XA_STATE(xas, &kvm->mem_attr_array, start);
+ gfn_t gfn = start;
+ void *entry;
+ bool shared, private;
+ bool mixed = false;
+
+ if (attr == KVM_MEM_ATTR_SHARED) {
+ shared = true;
+ private = false;
+ } else {
+ shared = false;
+ private = true;
+ }
+
+ rcu_read_lock();
+ entry = xas_load(&xas);
+ while (gfn < end) {
+ if (xas_retry(&xas, entry))
+ continue;
+
+ KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+ if (entry)
+ private = true;
+ else
+ shared = true;
+
+ if (private && shared) {
+ mixed = true;
+ goto out;
+ }
+
+ entry = xas_next(&xas);
+ gfn++;
+ }
+out:
+ rcu_read_unlock();
+ return mixed;
+}
+
+static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
+{
+ if (mixed)
+ linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+ else
+ linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void update_mem_lpage_info(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ unsigned long lpage_start, lpage_end;
+ unsigned long gfn, pages, mask;
+ int level;
+
+ for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+ pages = KVM_PAGES_PER_HPAGE(level);
+ mask = ~(pages - 1);
+ lpage_start = start & mask;
+ lpage_end = end & mask;
+
+ /*
+ * We only need to scan the head and tail page, for middle pages
+ * we know they are not mixed.
+ */
+ update_mixed(lpage_info_slot(lpage_start, slot, level),
+ mem_attr_is_mixed(kvm, attr, lpage_start,
+ lpage_start + pages));
+
+ if (lpage_start == lpage_end)
+ return;
+
+ for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
+ update_mixed(lpage_info_slot(gfn, slot, level), false);
+ }
+
+ update_mixed(lpage_info_slot(lpage_end, slot, level),
+ mem_attr_is_mixed(kvm, attr, lpage_end,
+ lpage_end + pages));
+ }
+}
+
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ int i;
+
+ WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
+ "Unsupported mem attribute.\n");
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+ slot = iter.slot;
+ start = max(start, slot->base_gfn);
+ end = min(end, slot->base_gfn + slot->npages);
+ if (WARN_ON_ONCE(start >= end))
+ continue;
+
+ update_mem_lpage_info(kvm, slot, attr, start, end);
+ }
+ }
+}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d45f00f5b3ee..7b18fcd71df5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2282,6 +2282,10 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+
+#define KVM_MEM_ATTR_SHARED 0x0001
+#define KVM_MEM_ATTR_PRIVATE 0x0002
+
static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *order)
{
@@ -2307,6 +2311,16 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
return !!xa_load(&kvm->mem_attr_array, gfn);
}
+#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+}
+#endif
+
#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1ba4b9e5449c..1d22c8603f91 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -863,12 +863,12 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
-#define KVM_MEM_ATTR_PRIVATE 0x0001
static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
struct kvm_enc_region *region)
{
unsigned long start, end;
void *entry;
+ int attr;
int r;
if (region->size == 0 || region->addr + region->size < region->addr)
@@ -879,13 +879,19 @@ static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl
start = region->addr >> PAGE_SHIFT;
end = (region->addr + region->size - 1) >> PAGE_SHIFT;
- entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
- xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
+ if (ioctl == KVM_MEMORY_ENCRYPT_REG_REGION) {
+ attr = KVM_MEM_ATTR_PRIVATE;
+ entry = xa_mk_value(KVM_MEM_ATTR_PRIVATE);
+ } else {
+ attr = KVM_MEM_ATTR_SHARED;
+ entry = NULL;
+ }
r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
entry, GFP_KERNEL_ACCOUNT));
kvm_zap_gfn_range(kvm, start, end + 1);
+ kvm_arch_update_mem_attr(kvm, attr, start, end + 1);
return r;
}
--
2
On 7/20/2022 8:29 PM, Chao Peng wrote:
> On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
> ...
>>
>> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
>> on insertion/removal to (dis)allow hugepages as needed.
>>
>> + efficient on KVM page fault (no new lookups)
>> + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
>> + straightforward to implement
>> + can (and should) be merged as part of the UPM series
>>
>> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
>> completely covered (fully shared) or not covered at all (fully private), but I'm
>> not 100% certain that xa_for_each_range() works the way I think it does.
>
> Hi Sean,
>
> Below is the implementation to support 2M as you mentioned as option D.
> It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
>
> Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
> we still treat it as a count, it will be a challenge to make the inc/dec
> balanced. So in this patch I stole a bit for the purpose, looks ugly.
>
> Any feedback is welcome.
>
> Thanks,
> Chao
>
> -----------------------------------------------------------------------
> From: Chao Peng <[email protected]>
> Date: Wed, 20 Jul 2022 11:37:18 +0800
> Subject: [PATCH] KVM: Add large page support for private memory
>
> Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>
> Reserve a bit in disallow_lpage to indicate a large page has
> private/share pages mixed.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> +static void update_mem_lpage_info(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned int attr,
> + gfn_t start, gfn_t end)
> +{
> + unsigned long lpage_start, lpage_end;
> + unsigned long gfn, pages, mask;
> + int level;
> +
> + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> + pages = KVM_PAGES_PER_HPAGE(level);
> + mask = ~(pages - 1);
> + lpage_start = start & mask;
> + lpage_end = end & mask;
> +
> + /*
> + * We only need to scan the head and tail page, for middle pages
> + * we know they are not mixed.
> + */
> + update_mixed(lpage_info_slot(lpage_start, slot, level),
> + mem_attr_is_mixed(kvm, attr, lpage_start,
> + lpage_start + pages));
> +
> + if (lpage_start == lpage_end)
> + return;
> +
> + for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
> + update_mixed(lpage_info_slot(gfn, slot, level), false);
> + }
Boundary check missing here for the case when gfn reaches lpage_end.
if (gfn == lpage_end)
return;
> +
> + update_mixed(lpage_info_slot(lpage_end, slot, level),
> + mem_attr_is_mixed(kvm, attr, lpage_end,
> + lpage_end + pages));
> + }
> +}
Regards
Nikunj
On Mon, Jul 25, 2022 at 07:16:24PM +0530, Nikunj A. Dadhania wrote:
> On 7/20/2022 8:29 PM, Chao Peng wrote:
> > On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
> > ...
> >>
> >> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> >> on insertion/removal to (dis)allow hugepages as needed.
> >>
> >> + efficient on KVM page fault (no new lookups)
> >> + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
> >> + straightforward to implement
> >> + can (and should) be merged as part of the UPM series
> >>
> >> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> >> completely covered (fully shared) or not covered at all (fully private), but I'm
> >> not 100% certain that xa_for_each_range() works the way I think it does.
> >
> > Hi Sean,
> >
> > Below is the implementation to support 2M as you mentioned as option D.
> > It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
> >
> > Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
> > we still treat it as a count, it will be a challenge to make the inc/dec
> > balanced. So in this patch I stole a bit for the purpose, looks ugly.
> >
> > Any feedback is welcome.
> >
> > Thanks,
> > Chao
> >
> > -----------------------------------------------------------------------
> > From: Chao Peng <[email protected]>
> > Date: Wed, 20 Jul 2022 11:37:18 +0800
> > Subject: [PATCH] KVM: Add large page support for private memory
> >
> > Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> >
> > Reserve a bit in disallow_lpage to indicate a large page has
> > private/share pages mixed.
> >
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
>
>
> > +static void update_mem_lpage_info(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned int attr,
> > + gfn_t start, gfn_t end)
> > +{
> > + unsigned long lpage_start, lpage_end;
> > + unsigned long gfn, pages, mask;
> > + int level;
> > +
> > + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > + pages = KVM_PAGES_PER_HPAGE(level);
> > + mask = ~(pages - 1);
> > + lpage_start = start & mask;
> > + lpage_end = end & mask;
> > +
> > + /*
> > + * We only need to scan the head and tail page, for middle pages
> > + * we know they are not mixed.
> > + */
> > + update_mixed(lpage_info_slot(lpage_start, slot, level),
> > + mem_attr_is_mixed(kvm, attr, lpage_start,
> > + lpage_start + pages));
> > +
> > + if (lpage_start == lpage_end)
> > + return;
> > +
> > + for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
> > + update_mixed(lpage_info_slot(gfn, slot, level), false);
> > + }
>
> Boundary check missing here for the case when gfn reaches lpage_end.
>
> if (gfn == lpage_end)
> return;
In this case, it's actually the tail page that I want to scan for with
below code.
It's also possible I misunderstand something here.
Chao
>
> > +
> > + update_mixed(lpage_info_slot(lpage_end, slot, level),
> > + mem_attr_is_mixed(kvm, attr, lpage_end,
> > + lpage_end + pages));
> > + }
> > +}
>
> Regards
> Nikunj
On Tue, Jul 12, 2022 at 01:30:34PM +1200,
Kai Huang <[email protected]> wrote:
> On Mon, 2022-07-11 at 17:38 -0700, Isaku Yamahata wrote:
> > On Tue, Jun 28, 2022 at 03:53:31PM +1200,
> > Kai Huang <[email protected]> wrote:
> >
> > > On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > > > From: Isaku Yamahata <[email protected]>
> > > >
> > > > Currently, KVM VMX module initialization/exit functions are a single
> > > > function each. Refactor KVM VMX module initialization functions into KVM
> > > > common part and VMX part so that TDX specific part can be added cleanly.
> > > > Opportunistically refactor module exit function as well.
> > > >
> > > > The current module initialization flow is, 1.) calculate the sizes of VMX
> > > > kvm structure and VMX vcpu structure, 2.) hyper-v specific initialization
> > > > 3.) report those sizes to the KVM common layer and KVM common
> > > > initialization, and 4.) VMX specific system-wide initialization.
> > > >
> > > > Refactor the KVM VMX module initialization function into functions with a
> > > > wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> > > > among VMX and TDX. We have a wrapper function, "vt_init() {vmx kvm/vcpu
> > > > size calculation; hv_vp_assist_page_init(); kvm_init(); vmx_init(); }" in
> > > > main.c, and hv_vp_assist_page_init() and vmx_init() in vmx.c.
> > > > hv_vp_assist_page_init() initializes hyper-v specific assist pages,
> > > > kvm_init() does system-wide initialization of the KVM common layer, and
> > > > vmx_init() does system-wide VMX initialization.
> > > >
> > > > The KVM architecture common layer allocates struct kvm with reported size
> > > > for architecture-specific code. The KVM VMX module defines its structure
> > > > as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> > > > struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
> > > > TDX specific kvm and vcpu structures, add tdx_pre_kvm_init() to report the
> > > > sizes of them to the KVM common layer.
> > > >
> > > > The current module exit function is also a single function, a combination
> > > > of VMX specific logic and common KVM logic. Refactor it into VMX specific
> > > > logic and KVM common logic. This is just refactoring to keep the VMX
> > > > specific logic in vmx.c from main.c.
> > >
> > > This patch, coupled with the patch:
> > >
> > > KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> > >
> > > Basically provides an infrastructure to support both VMX and TDX. Why we cannot
> > > merge them into one patch? What's the benefit of splitting them?
> > >
> > > At least, why the two patches cannot be put together closely?
> >
> > It is trivial for the change of "KVM: VMX: Move out vmx_x86_ops to 'main.c' to
> > wrap VMX and TDX" to introduce no functional change. But it's not trivial
> > for this patch to introduce no functional change.
>
> This doesn't sound right. If I understand correctly, this patch supposedly
> shouldn't bring any functional change, right? Could you explain what functional
> change does this patch bring?
This patch doesn't bring functional change. This patch changes orders of
some function calls. It doesn't matter actually. But I think it's non-trivial.
--
Isaku Yamahata <[email protected]>
On 7/26/2022 8:02 PM, Chao Peng wrote:
> On Mon, Jul 25, 2022 at 07:16:24PM +0530, Nikunj A. Dadhania wrote:
>> On 7/20/2022 8:29 PM, Chao Peng wrote:
>>> On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
>>> ...
>>>>
>>>> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
>>>> on insertion/removal to (dis)allow hugepages as needed.
>>>>
>>>> + efficient on KVM page fault (no new lookups)
>>>> + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
>>>> + straightforward to implement
>>>> + can (and should) be merged as part of the UPM series
>>>>
>>>> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
>>>> completely covered (fully shared) or not covered at all (fully private), but I'm
>>>> not 100% certain that xa_for_each_range() works the way I think it does.
>>>
>>> Hi Sean,
>>>
>>> Below is the implementation to support 2M as you mentioned as option D.
>>> It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
>>>
>>> Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
>>> we still treat it as a count, it will be a challenge to make the inc/dec
>>> balanced. So in this patch I stole a bit for the purpose, looks ugly.
>>>
>>> Any feedback is welcome.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -----------------------------------------------------------------------
>>> From: Chao Peng <[email protected]>
>>> Date: Wed, 20 Jul 2022 11:37:18 +0800
>>> Subject: [PATCH] KVM: Add large page support for private memory
>>>
>>> Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>
>>> Reserve a bit in disallow_lpage to indicate a large page has
>>> private/share pages mixed.
>>>
>>> Signed-off-by: Chao Peng <[email protected]>
>>> ---
>>
>>
>>> +static void update_mem_lpage_info(struct kvm *kvm,
>>> + struct kvm_memory_slot *slot,
>>> + unsigned int attr,
>>> + gfn_t start, gfn_t end)
>>> +{
>>> + unsigned long lpage_start, lpage_end;
>>> + unsigned long gfn, pages, mask;
>>> + int level;
>>> +
>>> + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
>>> + pages = KVM_PAGES_PER_HPAGE(level);
>>> + mask = ~(pages - 1);
>>> + lpage_start = start & mask;
>>> + lpage_end = end & mask;
>>> +
>>> + /*
>>> + * We only need to scan the head and tail page, for middle pages
>>> + * we know they are not mixed.
>>> + */
>>> + update_mixed(lpage_info_slot(lpage_start, slot, level),
>>> + mem_attr_is_mixed(kvm, attr, lpage_start,
>>> + lpage_start + pages));
>>> +
>>> + if (lpage_start == lpage_end)
>>> + return;
>>> +
>>> + for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
>>> + update_mixed(lpage_info_slot(gfn, slot, level), false);
>>> + }
>>
>> Boundary check missing here for the case when gfn reaches lpage_end.
>>
>> if (gfn == lpage_end)
>> return;
>
> In this case, it's actually the tail page that I want to scan for with
> below code.
What if you do not have the tail lpage?
For example: memslot base_gfn = 0x1000 and npages is 0x800, so memslot range
is 0x1000 to 0x17ff.
Assume a case when this function is called with start = 1000 and end = 1800.
For 2M, page mask is 0x1ff. start and end both are 2M aligned.
First update_mixed takes care of 0x1000-0x1200
Loop update_mixed: goes over from 0x1200 - 0x1800, there are no pages left
for last update_mixed to process.
>
> It's also possible I misunderstand something here.
>
> Chao
>>
>>> +
>>> + update_mixed(lpage_info_slot(lpage_end, slot, level),
>>> + mem_attr_is_mixed(kvm, attr, lpage_end,
>>> + lpage_end + pages));
lpage_info_slot some times causes a crash, as I noticed that
lpage_info_slot() returns out-of-bound index.
Regards
Nikunj
On Mon, Jun 27, 2022 at 02:53:36PM -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> For private GPA, CPU refers a private page table whose contents are
> encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> PTE entry) are used and their cost is expensive.
>
> When KVM resolves KVM page fault, it walks the page tables. To reuse the
> existing KVM MMU code and mitigate the heavy cost to directly walk
> encrypted private page table, allocate a more page to mirror the existing
> KVM page table. Resolve KVM page fault with the existing code, and do
> additional operations necessary for the mirrored private page table. To
> distinguish such cases, the existing KVM page table is called a shared page
> table (i.e. no mirrored private page table), and the KVM page table with
> mirrored private page table is called a private page table. The
> relationship is depicted below.
>
> Add private pointer to struct kvm_mmu_page for mirrored private page table
> and add helper functions to allocate/initialize/free a mirrored private
> page table page. Also, add helper functions to check if a given
> kvm_mmu_page is private. The later patch introduces hooks to operate on
> the mirrored private page table.
>
> KVM page fault |
> | |
> V |
> -------------+---------- |
> | | |
> V V |
> shared GPA private GPA |
> | | |
> V V |
> CPU/KVM shared PT root KVM private PT root | CPU private PT root
> | | | |
> V V | V
> shared PT private PT <----mirror----> mirrored private PT
> | | | |
> | \-----------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> non-encrypted memory | encrypted memory
> |
> PT: page table
>
> Both CPU and KVM refer to CPU/KVM shared page table. Private page table
> is used only by KVM. CPU refers to mirrored private page table.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 9 ++++
> arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 3 ++
> 4 files changed, 97 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f4d4ed41641b..bfc934dc9a33 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -716,6 +716,7 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
> + struct kvm_mmu_memory_cache mmu_private_sp_cache;
I notice that mmu_private_sp_cache.gfp_zero is left unset so these pages
may contain garbage. Is this by design because the TDX module can't rely
on the contents being zero and has to take care of initializing the page
itself? i.e. GFP_ZERO would be a waste of cycles?
If I'm correct please include a comment here in the next revision to
explain why GFP_ZERO is not necessary.
>
> /*
> * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c517c7bca105..a5bf3e40e209 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -691,6 +691,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> int start, end, i, r;
> bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
>
> + if (kvm_gfn_shared_mask(vcpu->kvm)) {
> + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
> + PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> + }
> +
> if (is_tdp_mmu && shadow_nonpresent_value)
> start = kvm_mmu_memory_cache_nr_free_objects(mc);
>
> @@ -732,6 +739,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -1736,6 +1744,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
> if (!direct)
> sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> + kvm_mmu_init_private_sp(sp, NULL);
This is unnecessary. kvm_mmu_page structs are zero-initialized so
private_sp will already be NULL.
>
> /*
> * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 44a04fad4bed..9f3a6bea60a3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -55,6 +55,10 @@ struct kvm_mmu_page {
> u64 *spt;
> /* hold the gfn of each spte inside spt */
> gfn_t *gfns;
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> + /* associated private shadow page, e.g. SEPT page. */
Can we use "Secure EPT" instead of SEPT in KVM code and comments? (i.e.
also including variable names like sept_page -> secure_ept_page)
"SEPT" looks like a mispelling of SPTE, which is used all over KVM. It
will be difficult to read code that contains both acronyms.
> + void *private_sp;
Please name this "private_spt" and move it up next to "spt".
sp" or "shadow page" is used to refer to kvm_mmu_page structs. For
example, look at all the code in KVM that uses `struct kvm_mmu_page *sp`.
"spt" is "shadow page table", i.e. the actual page table memory. See
kvm_mmu_page.spt. Calling this field "private_spt" makes it obvious that
this pointer is pointing to a page table.
Also please update the language in the comment accordingly to "private
shadow page table".
> +#endif
> /* Currently serving as active root */
> union {
> int root_count;
> @@ -115,6 +119,86 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> return kvm_mmu_role_as_id(sp->role);
> }
>
> +/*
> + * TDX vcpu allocates page for root Secure EPT page and assigns to CPU secure
> + * EPT pointer. KVM doesn't need to allocate and link to the secure EPT.
> + * Dummy value to make is_pivate_sp() return true.
> + */
> +#define KVM_MMU_PRIVATE_SP_ROOT ((void *)1)
> +
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> + return !!sp->private_sp;
> +}
> +
> +static inline bool is_private_sptep(u64 *sptep)
> +{
> + WARN_ON(!sptep);
> + return is_private_sp(sptep_to_sp(sptep));
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> + return sp->private_sp;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
> +{
> + sp->private_sp = private_sp;
> +}
> +
> +/* Valid sp->role.level is required. */
> +static inline void kvm_mmu_alloc_private_sp(
> + struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
> +{
> + if (is_root)
> + sp->private_sp = KVM_MMU_PRIVATE_SP_ROOT;
> + else
> + sp->private_sp = kvm_mmu_memory_cache_alloc(
> + &vcpu->arch.mmu_private_sp_cache);
> + /*
> + * Because mmu_private_sp_cache is topped up before staring kvm page
> + * fault resolving, the allocation above shouldn't fail.
> + */
> + WARN_ON_ONCE(!sp->private_sp);
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> + if (sp->private_sp != KVM_MMU_PRIVATE_SP_ROOT)
> + free_page((unsigned long)sp->private_sp);
> +}
> +#else
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> + return false;
> +}
> +
> +static inline bool is_private_sptep(u64 *sptep)
> +{
> + return false;
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> + return NULL;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
> +{
> +}
> +
> +static inline void kvm_mmu_alloc_private_sp(
> + struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
> +{
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> +}
> +#endif
> +
> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7eb41b176d1e..b2568b062faa 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -72,6 +72,8 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>
> static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> {
> + if (is_private_sp(sp))
> + kvm_mmu_free_private_sp(sp);
> free_page((unsigned long)sp->spt);
> kmem_cache_free(mmu_page_header_cache, sp);
> }
> @@ -295,6 +297,7 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
> sp->gfn = gfn;
> sp->ptep = sptep;
> sp->tdp_mmu_page = true;
> + kvm_mmu_init_private_sp(sp);
>
> trace_kvm_mmu_get_page(sp, true);
> }
> --
> 2.25.1
>
On Mon, Jun 27, 2022 at 02:53:36PM -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> For private GPA, CPU refers a private page table whose contents are
> encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> PTE entry) are used and their cost is expensive.
>
> When KVM resolves KVM page fault, it walks the page tables. To reuse the
> existing KVM MMU code and mitigate the heavy cost to directly walk
> encrypted private page table, allocate a more page to mirror the existing
> KVM page table. Resolve KVM page fault with the existing code, and do
> additional operations necessary for the mirrored private page table. To
> distinguish such cases, the existing KVM page table is called a shared page
> table (i.e. no mirrored private page table), and the KVM page table with
> mirrored private page table is called a private page table. The
> relationship is depicted below.
>
> Add private pointer to struct kvm_mmu_page for mirrored private page table
> and add helper functions to allocate/initialize/free a mirrored private
> page table page. Also, add helper functions to check if a given
> kvm_mmu_page is private. The later patch introduces hooks to operate on
> the mirrored private page table.
>
> KVM page fault |
> | |
> V |
> -------------+---------- |
> | | |
> V V |
> shared GPA private GPA |
> | | |
> V V |
> CPU/KVM shared PT root KVM private PT root | CPU private PT root
> | | | |
> V V | V
> shared PT private PT <----mirror----> mirrored private PT
> | | | |
> | \-----------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> non-encrypted memory | encrypted memory
> |
> PT: page table
>
> Both CPU and KVM refer to CPU/KVM shared page table. Private page table
> is used only by KVM. CPU refers to mirrored private page table.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 9 ++++
> arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 3 ++
> 4 files changed, 97 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f4d4ed41641b..bfc934dc9a33 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -716,6 +716,7 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
> + struct kvm_mmu_memory_cache mmu_private_sp_cache;
>
> /*
> * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c517c7bca105..a5bf3e40e209 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -691,6 +691,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> int start, end, i, r;
> bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
>
> + if (kvm_gfn_shared_mask(vcpu->kvm)) {
> + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
> + PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> + }
> +
> if (is_tdp_mmu && shadow_nonpresent_value)
> start = kvm_mmu_memory_cache_nr_free_objects(mc);
>
> @@ -732,6 +739,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -1736,6 +1744,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
> if (!direct)
> sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> + kvm_mmu_init_private_sp(sp, NULL);
>
> /*
> * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 44a04fad4bed..9f3a6bea60a3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -55,6 +55,10 @@ struct kvm_mmu_page {
> u64 *spt;
> /* hold the gfn of each spte inside spt */
> gfn_t *gfns;
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> + /* associated private shadow page, e.g. SEPT page. */
> + void *private_sp;
> +#endif
write_flooding_count and unsync_children are only used in shadow MMU SPs
and private_sp is only used in TDP MMU SPs. So it seems like we could
put these together in a union and drop CONFIG_KVM_MMU_PRIVATE without
increasing the size of kvm_mmu_page. i.e.
union {
struct {
unsigned int unsync_children;
/* Number of writes since the last time traversal visited this page. */
atomic_t write_flooding_count;
};
/*
* The associated private shadow page table, e.g. for Secure EPT.
* Only valid if tdp_mmu_page is true.
*/
void *private_spt;
};
Then change is_private_sp() to:
static inline bool is_private_sp(struct kvm_mmu_page *sp)
{
return sp->tdp_mmu_page && sp->private_sp;
}
This will allow us to drop CONFIG_KVM_MMU_PRIVATE, the only benefit of
which I see is to avoid increasing the size of kvm_mmu_page. However
to actually realize that benefit Cloud vendors (for example) would have
to create separate kernel builds for TDX and non-TDX hosts, which seems
like a huge hassel.
> /* Currently serving as active root */
> union {
> int root_count;
> @@ -115,6 +119,86 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> return kvm_mmu_role_as_id(sp->role);
> }
>
> +/*
> + * TDX vcpu allocates page for root Secure EPT page and assigns to CPU secure
> + * EPT pointer. KVM doesn't need to allocate and link to the secure EPT.
> + * Dummy value to make is_pivate_sp() return true.
> + */
> +#define KVM_MMU_PRIVATE_SP_ROOT ((void *)1)
> +
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> + return !!sp->private_sp;
> +}
> +
> +static inline bool is_private_sptep(u64 *sptep)
> +{
> + WARN_ON(!sptep);
> + return is_private_sp(sptep_to_sp(sptep));
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> + return sp->private_sp;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
> +{
> + sp->private_sp = private_sp;
> +}
> +
> +/* Valid sp->role.level is required. */
> +static inline void kvm_mmu_alloc_private_sp(
> + struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
> +{
> + if (is_root)
> + sp->private_sp = KVM_MMU_PRIVATE_SP_ROOT;
> + else
> + sp->private_sp = kvm_mmu_memory_cache_alloc(
> + &vcpu->arch.mmu_private_sp_cache);
> + /*
> + * Because mmu_private_sp_cache is topped up before staring kvm page
> + * fault resolving, the allocation above shouldn't fail.
> + */
> + WARN_ON_ONCE(!sp->private_sp);
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> + if (sp->private_sp != KVM_MMU_PRIVATE_SP_ROOT)
> + free_page((unsigned long)sp->private_sp);
> +}
> +#else
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> + return false;
> +}
> +
> +static inline bool is_private_sptep(u64 *sptep)
> +{
> + return false;
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> + return NULL;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp, void *private_sp)
> +{
> +}
> +
> +static inline void kvm_mmu_alloc_private_sp(
> + struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool is_root)
> +{
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> +}
> +#endif
> +
> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7eb41b176d1e..b2568b062faa 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -72,6 +72,8 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>
> static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> {
> + if (is_private_sp(sp))
> + kvm_mmu_free_private_sp(sp);
> free_page((unsigned long)sp->spt);
> kmem_cache_free(mmu_page_header_cache, sp);
> }
> @@ -295,6 +297,7 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
> sp->gfn = gfn;
> sp->ptep = sptep;
> sp->tdp_mmu_page = true;
> + kvm_mmu_init_private_sp(sp);
>
> trace_kvm_mmu_get_page(sp, true);
> }
> --
> 2.25.1
>
On Wed, Jul 27, 2022 at 02:56:40PM +0530, Nikunj A. Dadhania wrote:
> On 7/26/2022 8:02 PM, Chao Peng wrote:
> > On Mon, Jul 25, 2022 at 07:16:24PM +0530, Nikunj A. Dadhania wrote:
> >> On 7/20/2022 8:29 PM, Chao Peng wrote:
> >>> On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
> >>> ...
> >>>>
> >>>> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> >>>> on insertion/removal to (dis)allow hugepages as needed.
> >>>>
> >>>> + efficient on KVM page fault (no new lookups)
> >>>> + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
> >>>> + straightforward to implement
> >>>> + can (and should) be merged as part of the UPM series
> >>>>
> >>>> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> >>>> completely covered (fully shared) or not covered at all (fully private), but I'm
> >>>> not 100% certain that xa_for_each_range() works the way I think it does.
> >>>
> >>> Hi Sean,
> >>>
> >>> Below is the implementation to support 2M as you mentioned as option D.
> >>> It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
> >>>
> >>> Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
> >>> we still treat it as a count, it will be a challenge to make the inc/dec
> >>> balanced. So in this patch I stole a bit for the purpose, looks ugly.
> >>>
> >>> Any feedback is welcome.
> >>>
> >>> Thanks,
> >>> Chao
> >>>
> >>> -----------------------------------------------------------------------
> >>> From: Chao Peng <[email protected]>
> >>> Date: Wed, 20 Jul 2022 11:37:18 +0800
> >>> Subject: [PATCH] KVM: Add large page support for private memory
> >>>
> >>> Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> >>>
> >>> Reserve a bit in disallow_lpage to indicate a large page has
> >>> private/share pages mixed.
> >>>
> >>> Signed-off-by: Chao Peng <[email protected]>
> >>> ---
> >>
> >>
> >>> +static void update_mem_lpage_info(struct kvm *kvm,
> >>> + struct kvm_memory_slot *slot,
> >>> + unsigned int attr,
> >>> + gfn_t start, gfn_t end)
> >>> +{
> >>> + unsigned long lpage_start, lpage_end;
> >>> + unsigned long gfn, pages, mask;
> >>> + int level;
> >>> +
> >>> + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> >>> + pages = KVM_PAGES_PER_HPAGE(level);
> >>> + mask = ~(pages - 1);
> >>> + lpage_start = start & mask;
> >>> + lpage_end = end & mask;
> >>> +
> >>> + /*
> >>> + * We only need to scan the head and tail page, for middle pages
> >>> + * we know they are not mixed.
> >>> + */
> >>> + update_mixed(lpage_info_slot(lpage_start, slot, level),
> >>> + mem_attr_is_mixed(kvm, attr, lpage_start,
> >>> + lpage_start + pages));
> >>> +
> >>> + if (lpage_start == lpage_end)
> >>> + return;
> >>> +
> >>> + for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
> >>> + update_mixed(lpage_info_slot(gfn, slot, level), false);
> >>> + }
> >>
> >> Boundary check missing here for the case when gfn reaches lpage_end.
> >>
> >> if (gfn == lpage_end)
> >> return;
> >
> > In this case, it's actually the tail page that I want to scan for with
> > below code.
>
> What if you do not have the tail lpage?
>
> For example: memslot base_gfn = 0x1000 and npages is 0x800, so memslot range
> is 0x1000 to 0x17ff.
>
> Assume a case when this function is called with start = 1000 and end = 1800.
> For 2M, page mask is 0x1ff. start and end both are 2M aligned.
>
> First update_mixed takes care of 0x1000-0x1200
> Loop update_mixed: goes over from 0x1200 - 0x1800, there are no pages left
> for last update_mixed to process.
Oops, good catch. I would fix it differently by playing with lpage_end:
lpage_end = (end - 1) & mask;
Thanks,
Chao
>
> >
> > It's also possible I misunderstand something here.
> >
> > Chao
> >>
> >>> +
> >>> + update_mixed(lpage_info_slot(lpage_end, slot, level),
> >>> + mem_attr_is_mixed(kvm, attr, lpage_end,
> >>> + lpage_end + pages));
>
> lpage_info_slot some times causes a crash, as I noticed that
> lpage_info_slot() returns out-of-bound index.
>
> Regards
> Nikunj
>
On Thu, Jul 28, 2022 at 12:41:51PM -0700,
David Matlack <[email protected]> wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f4d4ed41641b..bfc934dc9a33 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -716,6 +716,7 @@ struct kvm_vcpu_arch {
> > struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> > struct kvm_mmu_memory_cache mmu_page_header_cache;
> > + struct kvm_mmu_memory_cache mmu_private_sp_cache;
>
> I notice that mmu_private_sp_cache.gfp_zero is left unset so these pages
> may contain garbage. Is this by design because the TDX module can't rely
> on the contents being zero and has to take care of initializing the page
> itself? i.e. GFP_ZERO would be a waste of cycles?
>
> If I'm correct please include a comment here in the next revision to
> explain why GFP_ZERO is not necessary.
Yes, exactly. Here is the added comments.
/*
* This cache is to allocate pages used for Secure-EPT used by the TDX
* module. Because the TDX module doesn't trust VMM and initializes the
* pages itself, KVM doesn't initialize them. Allocate pages with
* garbage and give them to the TDX module.
*/
> > /*
> > * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c517c7bca105..a5bf3e40e209 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -691,6 +691,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> > int start, end, i, r;
> > bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> >
> > + if (kvm_gfn_shared_mask(vcpu->kvm)) {
> > + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
> > + PT64_ROOT_MAX_LEVEL);
> > + if (r)
> > + return r;
> > + }
> > +
> > if (is_tdp_mmu && shadow_nonpresent_value)
> > start = kvm_mmu_memory_cache_nr_free_objects(mc);
> >
> > @@ -732,6 +739,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > }
> > @@ -1736,6 +1744,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
> > if (!direct)
> > sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > + kvm_mmu_init_private_sp(sp, NULL);
>
> This is unnecessary. kvm_mmu_page structs are zero-initialized so
> private_sp will already be NULL.
Ok.
> >
> > /*
> > * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 44a04fad4bed..9f3a6bea60a3 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -55,6 +55,10 @@ struct kvm_mmu_page {
> > u64 *spt;
> > /* hold the gfn of each spte inside spt */
> > gfn_t *gfns;
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> > + /* associated private shadow page, e.g. SEPT page. */
>
> Can we use "Secure EPT" instead of SEPT in KVM code and comments? (i.e.
> also including variable names like sept_page -> secure_ept_page)
>
> "SEPT" looks like a mispelling of SPTE, which is used all over KVM. It
> will be difficult to read code that contains both acronyms.
Makes sense. Will update it.
> > + void *private_sp;
>
> Please name this "private_spt" and move it up next to "spt".
>
> sp" or "shadow page" is used to refer to kvm_mmu_page structs. For
> example, look at all the code in KVM that uses `struct kvm_mmu_page *sp`.
>
> "spt" is "shadow page table", i.e. the actual page table memory. See
> kvm_mmu_page.spt. Calling this field "private_spt" makes it obvious that
> this pointer is pointing to a page table.
>
> Also please update the language in the comment accordingly to "private
> shadow page table".
I'll rename as follows
private_sp => private_spt
spet_page => private_spt
mmu_private_sp_cache => mmu_private_spt_cache
kvm_mmu_init_private_sp => kvm_mmu_inite_private_spt
kvm_mmu_alloc_private_sp => kvm_mmu_alloc_private_spt
kvm_mmu_free_private_sp => kvm_mmu_free_private_spt
kvm_alloc_private_sp_for_split => kvm_alloc_private_spt_for_split
Thanks,
--
Isaku Yamahata <[email protected]>
On Thu, Jul 28, 2022 at 01:13:35PM -0700,
David Matlack <[email protected]> wrote:
> On Mon, Jun 27, 2022 at 02:53:36PM -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > For private GPA, CPU refers a private page table whose contents are
> > encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> > PTE entry) are used and their cost is expensive.
> >
> > When KVM resolves KVM page fault, it walks the page tables. To reuse the
> > existing KVM MMU code and mitigate the heavy cost to directly walk
> > encrypted private page table, allocate a more page to mirror the existing
> > KVM page table. Resolve KVM page fault with the existing code, and do
> > additional operations necessary for the mirrored private page table. To
> > distinguish such cases, the existing KVM page table is called a shared page
> > table (i.e. no mirrored private page table), and the KVM page table with
> > mirrored private page table is called a private page table. The
> > relationship is depicted below.
> >
> > Add private pointer to struct kvm_mmu_page for mirrored private page table
> > and add helper functions to allocate/initialize/free a mirrored private
> > page table page. Also, add helper functions to check if a given
> > kvm_mmu_page is private. The later patch introduces hooks to operate on
> > the mirrored private page table.
> >
> > KVM page fault |
> > | |
> > V |
> > -------------+---------- |
> > | | |
> > V V |
> > shared GPA private GPA |
> > | | |
> > V V |
> > CPU/KVM shared PT root KVM private PT root | CPU private PT root
> > | | | |
> > V V | V
> > shared PT private PT <----mirror----> mirrored private PT
> > | | | |
> > | \-----------------+------\ |
> > | | | |
> > V | V V
> > shared guest page | private guest page
> > |
> > non-encrypted memory | encrypted memory
> > |
> > PT: page table
> >
> > Both CPU and KVM refer to CPU/KVM shared page table. Private page table
> > is used only by KVM. CPU refers to mirrored private page table.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/mmu/mmu.c | 9 ++++
> > arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
> > arch/x86/kvm/mmu/tdp_mmu.c | 3 ++
> > 4 files changed, 97 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f4d4ed41641b..bfc934dc9a33 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -716,6 +716,7 @@ struct kvm_vcpu_arch {
> > struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> > struct kvm_mmu_memory_cache mmu_page_header_cache;
> > + struct kvm_mmu_memory_cache mmu_private_sp_cache;
> >
> > /*
> > * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c517c7bca105..a5bf3e40e209 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -691,6 +691,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> > int start, end, i, r;
> > bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> >
> > + if (kvm_gfn_shared_mask(vcpu->kvm)) {
> > + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
> > + PT64_ROOT_MAX_LEVEL);
> > + if (r)
> > + return r;
> > + }
> > +
> > if (is_tdp_mmu && shadow_nonpresent_value)
> > start = kvm_mmu_memory_cache_nr_free_objects(mc);
> >
> > @@ -732,6 +739,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > }
> > @@ -1736,6 +1744,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
> > if (!direct)
> > sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > + kvm_mmu_init_private_sp(sp, NULL);
> >
> > /*
> > * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 44a04fad4bed..9f3a6bea60a3 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -55,6 +55,10 @@ struct kvm_mmu_page {
> > u64 *spt;
> > /* hold the gfn of each spte inside spt */
> > gfn_t *gfns;
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> > + /* associated private shadow page, e.g. SEPT page. */
> > + void *private_sp;
> > +#endif
>
> write_flooding_count and unsync_children are only used in shadow MMU SPs
> and private_sp is only used in TDP MMU SPs. So it seems like we could
> put these together in a union and drop CONFIG_KVM_MMU_PRIVATE without
> increasing the size of kvm_mmu_page. i.e.
I introduced KVM_MMU_PRIVATE as a alias to INTEL_TDX_HOST because I don't want
to use it in kvm/mmu and I'd like KVM_MMU_PRIVATE (a sort of) independent from
INTEL_TDX_HOST. Anyway once the patch series is merged, we can drop
KVM_MMU_PRIVATE.
> union {
> struct {
> unsigned int unsync_children;
> /* Number of writes since the last time traversal visited this page. */
> atomic_t write_flooding_count;
> };
> /*
> * The associated private shadow page table, e.g. for Secure EPT.
> * Only valid if tdp_mmu_page is true.
> */
> void *private_spt;
> };
>
> Then change is_private_sp() to:
>
> static inline bool is_private_sp(struct kvm_mmu_page *sp)
> {
> return sp->tdp_mmu_page && sp->private_sp;
> }
>
> This will allow us to drop CONFIG_KVM_MMU_PRIVATE, the only benefit of
> which I see is to avoid increasing the size of kvm_mmu_page. However
> to actually realize that benefit Cloud vendors (for example) would have
> to create separate kernel builds for TDX and non-TDX hosts, which seems
> like a huge hassel.
Good idea. I'll use union.
--
Isaku Yamahata <[email protected]>