2023-10-16 16:21:25

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 000/116] KVM TDX basic feature support

From: Isaku Yamahata <[email protected]>

KVM TDX basic feature support

Hello. This is v16 the patch series vof KVM TDX support. This is based on
v6.6-rc2 + the following patch series + minor fixes.

Related patch series This patch is based on:
- v12 KVM: guest_memfd() and per-page attributes
https://lore.kernel.org/all/[email protected]/

- KVM: guest_memfd fixes
https://lore.kernel.org/all/[email protected]/

- TDX host kernel support v13
https://lore.kernel.org/all/[email protected]/

The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
The corresponding qemu branch is found at
https://github.com/yamahata/qemu/tree/tdx/qemu-upm
How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM

More features tree is found at
https://github.com/intel/tdx/tree/kvm-upstream-workaround

Isaku Yamahata

Changes from v15:
- Added KVM_TDX_RELEASE_VM to reduce the destruction time
- Catch up the TDX module interface to use struct tdx_module_args
instead of struct tdx_module_output
- Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1 and handle Secure-EPT violation
with SEPT_VE_DISABLE case.
- Simplified tdx_reclaim_page()
- Reorganize the locking of tdx_release_hkid(), and use smp_call_mask()
instead of smp_call_on_cpu() to hold spinlock to race with invalidation
on releasing guest memfd
- Removed AMX check as the KVM upstream supports AMX.
- Added CET flag to guest supported xss
- add check if nr_pages isn't large with
(nr_page << PAGE_SHIFT) >> PAGE_SHIFT
- use __seamcall_saved_ret()
- As struct tdx_module_args doesn't match with vcpu.arch.regs, copy regs
before/after calling __seamcall_saved_ret().

Changes from v14:
https://lore.kernel.org/all/[email protected]/
- rebased to v6.5-rc2, v11 KVM guest_memfd(), v11 TDX host kernel support
- ABI change to add reserved member for future compatibility, dropped unused
member.
- handle EXIT_REASON_OTHER_SMI
- handle FEAT_CTL MSR access

Changes from v13:
- rbased to v6.4-rc3
- Make use of KVM gmem.
- Added check_cpuid callback for KVM_SET_CPUID2 as RFC patch.
- ABI change of KVM_TDX_VM_INIT as VM scoped KVM ioctl.
- Make TDX initialization non-depend on kvm hardware_enable.
Use vmx_hardware_enable directly.
- Drop a patch to prohibit dirty logging as new KVM gmem code base
- Drop parameter only checking for some TDG.VP.VMCALL. Just default part

Changes from v12:
- ABI change of KVM_TDX_VM_INIT
- Rename kvm_gfn_{private, shared} to kvm_gfn_to_{private, shared}
- Move APIC BASE MSI initialization to KVM_TDX_VCPU_INIT
- Fix MTRR patch
- Make MapGpa hypercall always pass it to user space VMM
- Split hooks to TDP MMU into two part. populating and zapping.

Changes from v11:
- ABI change of KVM_TDX_VM_INIT
- Split the hook of TDP MMU to not modify handle_changed_spte()
- Enhanced commit message on mtrr patch
- Made KVM_CAP_MAX_VCPUS to x86 specific

Changes from v10:
- rebased to v6.2-rc3
- support mtrr with its own patches
- Integrated fd-based private page v10
- Integrated TDX host kernel support v8
- Integrated kvm_init rework v2
- removed struct tdx_td_page and its initialization logic
- cleaned up mmio spte and require enable_mmio_caching=true for TDX
- removed dubious WARN_ON_ONCE()
- split a patch adding methods as nop into several patches

Changes from v9:
- rebased to v6.1-rc2
- Integrated fd-based private page v9 as prerequisite.
- Integrated TDX host kernel support v6
- TDP MMU: Make handle_change_spte() return value.
- TDX: removed seamcall_lock and return -EAGAIN so that TDP MMU can retry

Changes from v8:
- rebased to v6.0-rc7
- Integrated with kvm hardware initialization. Check all packages has at least
one online CPU when creating guest TD and refuse cpu offline during guest TDs
are running.
- Integrated fd-based private page v8 as prerequisite.
- TDP MMU: Introduced more callbacks instead of single callback.

Changes from v7:
- Use xarray to track whether GFN is private or shared. Drop SPTE_SHARED_MASK.
The complex state machine with SPTE_SHARED_MASK was ditched.
- Large page support is implemented. But will be posted as independent RFC patch.
- fd-based private page v7 is integrated. This is mostly same to Chao's patches.
It's in github.

Changes from v6:
- rebased to v5.19

Changes from v5:
- export __seamcall and use it
- move mutex lock from callee function of smp_call_on_cpu to the caller.
- rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
- updated comment
- drop the use of tdh_mng_key.reclaimid(): as the function is for backward
compatibility to only return success
- struct kvm_tdx_cmd: metadata => flags, added __u64 error.
- make this ioctl systemwide ioctl
- ABI change to struct kvm_init_vm
- guest_tsc_khz: use kvm->arch.default_tsc_khz
- rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
- drop exporting kvm_set_tsc_khz().
- fix kvm_tdp_page_fault() for mtrr emulation
- rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
- drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
- update commit message
- rename shadow_init_value => shadow_nonprsent_value
- added ept_violation_ve_test mode
- shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
- legacy MMU case
=> - mmu_topup_shadow_page_cache(), kvm_mmu_create()
- FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
- #VE warning:
- rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
- merge into Like we discussed, this patch should be merged with patch
"KVM: x86/mmu: Allow non-zero init value for shadow PTE".
- fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
- introduce kvm_gfn_for_root(kvm, root, gfn)
- add only_shared argument to kvm_tdp_mmu_handle_gfn()
- use kvm_arch_dirty_log_supported()
- rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
- rename: is_private_prohibit_spte() => spte_shared_mask()
- fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
- dropped this patch as the change was merged into kvm/queue
- update vt_apicv_post_state_restore()
- use is_64_bit_hypercall()
- comment: expand MSMI -> Machine Check System Management Interrupt
- fixed TDX_SEPT_PFERR
- tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
- rename tdmvcall_exit_readon() => tdvmcall_leaf()
- remove optional zero check of argument.
- do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
in kvm_vcpu_ioctl_smi and __apic_accept_irq.
- WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
- introduce vcpu_deliver_init to x86_ops
- sprinkeled KVM_BUG_ON()

Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.

---
* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.

A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.

We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).

In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.

* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/[email protected]/

This patch series is available at
https://github.com/intel/tdx/tree/kvm-upstream

The related repositories (TDX qemu, TDX OVMF(tdvf) etc) are described at
https://github.com/intel/tdx/wiki/TDX-KVM

The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.

The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step. Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.

TDX vcpu
interrupt/exits/hypercall<------------\
^ |
| |
TD finalization |
^ |
| |
TDX EPT violation<------------\ |
^ | |
| | |
TD vcpu enter/exit | |
^ | |
| | |
TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
^ | ^
| | |
TD VM creation/destruction \---------------KVM TDP MMU hooks
^ ^
| |
TDX architectural definitions KVM TDP refactoring for TDX
^ ^
| |
TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
coexistence support


The followings are explanations of each layer. Each layer has a dummy commit
that starts with [MARKER] in subject. It is intended to help to identify where
each layer starts.

TDX host kernel support:
https://lore.kernel.org/lkml/[email protected]/
The guts of system-wide initialization of TDX module. There is an
independent patch series for host x86. TDX KVM patches call functions
this patch series provides to initialize the TDX module.

TDX, VMX coexistence:
Infrastructure to allow TDX to coexist with VMX and trigger the
initialization of the TDX module.
This layer starts with
"KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
Add TDX architectural definitions and helper functions
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
Guest TD creation/destroy allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
guest TD creation/destroy Allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
Create an initial guest memory image with TDX measurement. Handle
secure EPT violations to populate guest pages with TDX SEAMCALLs.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
entering into TD. Restore CPU state after exiting from TD.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
Handle various exits/hypercalls and allow interrupts to be injected so
that TD vcpu can continue running.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"

KVM MMU GPA shared bit:
Introduce framework to handle shared bit repurposed bit of GPA TDX
repurposed a bit of GPA to indicate shared or private. If it's shared,
it's the same as the conventional VMX EPT case. VMM can access shared
guest pages. If it's private, it's handled by Secure-EPT and the guest
page is encrypted.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
TDX Secure EPT requires different constants. e.g. initial value EPT
entry value etc. Various refactoring for those differences.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
Introduce framework to TDP MMU to add hooks in addition to direct EPT
access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
use TDX SEAMCALLs to operate on Secure EPT.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
Introduce framework to handle switching guest pages from private/shared
to shared/private. For a given GPA, a guest page can be assigned to a
private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
guest TD converts GPA assignments from private (or shared) to shared (or
private).
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "

KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
Guest private memory requires different memory management in KVM. The
patch proposes a way for it. Integration with TDX KVM.

(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.

The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.

TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode. A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.

TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.

TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key). At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.

Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.

Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM. These control
structures are pointed to by fields in the TD VMCS.

The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others. 3) Redirect operations to . 3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".

*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.

In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.

During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.

* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.

This way, we can effectively walk Secure EPT without using the TDX interface
functions.

* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.

* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.

* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.

[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.

* Restrictions or future work
Some features are not included to reduce patch size. Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more

* Prerequisites
It's required to load the TDX module and initialize it. It's out of the scope
of this patch series. Another independent patch for the common x86 code is
planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.

Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY. Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine

** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it. Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
Return the system wide information about the TDX module. NULL if the TDX
isn't initialized.
- int tdx_enable(void);
Initialization of TDX module so that the TDX module is ready for KVM to use.
- extern u32 tdx_global_keyid __read_mostly;
global host key id that is used for the TDX module itself.
- u32 tdx_get_num_keyid(void);
return the number of available TDX private host key id.
- int tdx_keyid_alloc(void);
Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
Free HKID for guest TD.

(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).

- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.

- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.

The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
- Get KVM system capability to check if TDX VM type is supported
- VM creation (KVM_CREATE_VM)
- New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
- New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
- VCPU creation (KVM_CREATE_VCPU)
- New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
- New: Initialize guest memory as boot state and extend the measurement with
the memory. KVM_TDX_INIT_MEM_REGION.
- New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
TDX VM contents.
- VCPU RUN (KVM_VCPU_RUN)

- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them. For example, accessing CPU registers, injecting
exceptions, and accessing guest memory. Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.

VM/VCPU state and callbacks for TDX specific operations.
Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".

Operations on the CPU state
silently ignore operations on the guest state. For example, the write to
CPU registers is ignored and the read from CPU registers returns 0.

. ignore access to CPU registers except for allowed ones.
. TSC: add a check if tsc is immutable and return an error. Because the KVM
implementation updates the internal tsc state and it's difficult to back
out those changes. Instead, skip the logic.
. dirty logging: add check if dirty logging is supported.
. exceptions/SMI/MCE/SIPI/INIT: silently ignore

Note: virtual external interrupt and NMI can be injected into TDX guests.

- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set). The bits are called stolen bits.

- Stolen bits framework
systematically tracks which guest physical address, shared or private, is
used.

- Shared EPT and secure EPT
There are two EPTs. Shared EPT (the conventional one) and Secure
EPT(the new one). Shared EPT is handled the same for the stolen
bit set. Secure EPT points to private guest pages. To resolve
EPT violation, KVM walks one of two EPTs based on faulted GPA.
Because it's costly to access secure EPT during walking EPTs with
SEAMCALLs for the private guest physical address, another private
EPT is used as a shadow of Secure-EPT with the existing logic at
the cost of extra memory.

The following depicts the relationship.

KVM | TDX module
| | |
-------------+---------- | |
| | | |
V V | |
shared GPA private GPA | |
CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
| | | |
| | | |
V V | V
shared EPT private EPT--------mirror----->Secure EPT
| | | |
| \--------------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
|
non-encrypted memory | encrypted memory
|

- Operating on Secure EPT
Use the TDX module APIs to operate on Secure EPT. To call the TDX API
during resolving EPT violation, add hooks to additional operation and wiring
it to TDX backend.

* References

[1] TDX specification
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
https://cdrdv2.intel.com/v1/dl/getContent/726790
[3] Intel CPU Architectural Extensions Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 Specification
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
[5] Intel TDX Loader Interface Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
https://cdrdv2.intel.com/v1/dl/getContent/726790
[7] Intel TDX Virtual Firmware Design Guide
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
[8] intel public github
kvm TDX branch: https://github.com/intel/tdx/tree/kvm
TDX guest branch: https://github.com/intel/tdx/tree/guest
qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
https://github.com/tianocore/edk2-staging/tree/TDVF
This was merged into EDK2 main branch. https://github.com/tianocore/edk2

Chao Gao (2):
KVM: x86/mmu: Assume guest MMIOs are shared
KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
wrmsr

Isaku Yamahata (93):
KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_hardware_setup()
KVM: x86/vmx: Refactor KVM VMX module init/exit functions
KVM: VMX: Reorder vmx initialization with kvm vendor initialization
KVM: TDX: Initialize the TDX module when loading the KVM intel kernel
module
KVM: TDX: Add placeholders for TDX VM/vcpu structure
KVM: TDX: Make TDX VM type supported
[MARKER] The start of TDX KVM patch series: TDX architectural
definitions
KVM: TDX: Define TDX architectural definitions
KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
KVM: TDX: Retry SEAMCALL on the lack of entropy error
KVM: TDX: Add helper functions to print TDX SEAMCALL error
[MARKER] The start of TDX KVM patch series: TD VM creation/destruction
x86/cpu: Add helper functions to allocate/free TDX private host key id
x86/virt/tdx: Add a helper function to return system wide info about
TDX module
KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
KVM: x86, tdx: Make KVM_CAP_MAX_VCPUS backend specific
KVM: TDX: create/destroy VM structure
KVM: TDX: initialize VM with TDX specific parameters
KVM: TDX: Make pmu_intel.c ignore guest TD case
KVM: TDX: Refuse to unplug the last cpu on the package
[MARKER] The start of TDX KVM patch series: TD vcpu
creation/destruction
KVM: TDX: allocate/free TDX vcpu structure
KVM: TDX: Do TDX specific vcpu initialization
[MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
KVM: x86/mmu: introduce config for PRIVATE KVM MMU
KVM: x86/mmu: Add address conversion functions for TDX shared bit of
GPA
[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
TDX
KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
KVM: x86/mmu: Add Suppress VE bit to
shadow_mmio_mask/shadow_present_mask
KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
KVM: x86/mmu: Disallow fast page fault on private GPA
KVM: VMX: Introduce test mode related to EPT violation VE
[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at
allocation
KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role
KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
KVM: x86/tdp_mmu: Sprinkle __must_check
KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
[MARKER] The start of TDX KVM patch series: TDX EPT violation
KVM: TDX: Add accessors VMX VMCS helpers
KVM: TDX: Require TDP MMU and mmio caching for TDX
KVM: TDX: TDP MMU TDX support
KVM: TDX: MTRR: implement get_mt_mask() for TDX
[MARKER] The start of TDX KVM patch series: TD finalization
KVM: TDX: Create initial guest memory
KVM: TDX: Finalize VM initialization
[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
KVM: TDX: Implement TDX vcpu enter/exit path
KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
KVM: TDX: restore host xsave state when exit from the guest TD
KVM: TDX: restore user ret MSRs
[MARKER] The start of TDX KVM patch series: TD vcpu
exits/interrupts/hypercalls
KVM: TDX: complete interrupts after tdexit
KVM: TDX: restore debug store when TD exit
KVM: TDX: handle vcpu migration over logical processor
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
behavior
KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
KVM: TDX: Implement interrupt injection
KVM: TDX: Implements vcpu request_immediate_exit
KVM: TDX: Implement methods to inject NMI
KVM: TDX: Add a place holder to handle TDX VM exit
KVM: TDX: handle EXIT_REASON_OTHER_SMI
KVM: TDX: handle ept violation/misconfig exit
KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI
KVM: TDX: Add a place holder for handler of TDX hypercalls
(TDG.VP.VMCALL)
KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL
KVM: TDX: Handle TDX PV CPUID hypercall
KVM: TDX: Handle TDX PV HLT hypercall
KVM: TDX: Handle TDX PV port io hypercall
KVM: TDX: Implement callbacks for MSR operations for TDX
KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
KVM: TDX: Handle MSR MTRRCap and MTRRDefType access
KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL
KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
KVM: TDX: Silently discard SMI request
KVM: TDX: Silently ignore INIT/SIPI
KVM: TDX: Add methods to ignore accesses to CPU state
KVM: TDX: Add methods to ignore guest instruction emulation
KVM: TDX: Add a method to ignore dirty logging
KVM: TDX: Add methods to ignore VMX preemption timer
KVM: TDX: Add methods to ignore accesses to TSC
KVM: TDX: Ignore setting up mce
KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch
KVM: TDX: Add methods to ignore virtual apic related operation
KVM: TDX: Inhibit APICv for TDX guest
Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
KVM: TDX: Add hint TDX ioctl to release Secure-EPT
RFC: KVM: x86: Add x86 callback to check cpuid
RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2
[MARKER] the end of (the first phase of) TDX KVM patch series

Sean Christopherson (17):
KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
KVM: TDX: Add TDX "architectural" error codes
KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
KVM: Allow page-sized MMU caches to be initialized with custom 64-bit
values
KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed
SPTE
KVM: x86/mmu: Allow per-VM override of the TDP max page level
KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases
KVM: VMX: Split out guts of EPT violation to common/exposed function
KVM: VMX: Move setting of EPT MMU masks to common VT-x code
KVM: TDX: Add load_mmu_pgd method for TDX
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
KVM: TDX: Add support for find pending IRQ in a protected local APIC
KVM: x86: Assume timer IRQ was injected if APIC state is proteced
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
argument
KVM: VMX: Move NMI/exception handler to common helper
KVM: x86: Split core of hypercall emulation to helper function
KVM: TDX: Handle TDX PV MMIO hypercall

Yan Zhao (1):
KVM: x86/mmu: TDX: Do not enable page track for TD guest

Yang Weijiang (1):
KVM: TDX: Add TSX_CTRL msr into uret_msrs list

Yao Yuan (1):
KVM: TDX: Handle vmentry failure for INTEL TD guest

Yuan Yao (1):
KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

Documentation/virt/kvm/api.rst | 9 +-
Documentation/virt/kvm/index.rst | 1 +
Documentation/virt/kvm/x86/index.rst | 2 +
Documentation/virt/kvm/x86/intel-tdx.rst | 362 +++
Documentation/virt/kvm/x86/tdx-tdp-mmu.rst | 443 +++
arch/x86/events/intel/ds.c | 1 +
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/include/asm/kvm-x86-ops.h | 17 +-
arch/x86/include/asm/kvm_host.h | 84 +-
arch/x86/include/asm/tdx.h | 85 +
arch/x86/include/asm/vmx.h | 14 +
arch/x86/include/uapi/asm/kvm.h | 89 +
arch/x86/include/uapi/asm/vmx.h | 5 +-
arch/x86/kvm/Kconfig | 6 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/cpuid.c | 27 +-
arch/x86/kvm/cpuid.h | 2 +
arch/x86/kvm/irq.c | 3 +
arch/x86/kvm/lapic.c | 33 +-
arch/x86/kvm/lapic.h | 2 +
arch/x86/kvm/mmu.h | 31 +
arch/x86/kvm/mmu/mmu.c | 196 +-
arch/x86/kvm/mmu/mmu_internal.h | 109 +-
arch/x86/kvm/mmu/page_track.c | 3 +
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/spte.c | 17 +-
arch/x86/kvm/mmu/spte.h | 27 +-
arch/x86/kvm/mmu/tdp_iter.h | 14 +-
arch/x86/kvm/mmu/tdp_mmu.c | 421 ++-
arch/x86/kvm/mmu/tdp_mmu.h | 7 +-
arch/x86/kvm/smm.h | 7 +-
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/vmx/common.h | 166 +
arch/x86/kvm/vmx/main.c | 1239 ++++++++
arch/x86/kvm/vmx/pmu_intel.c | 46 +-
arch/x86/kvm/vmx/pmu_intel.h | 28 +
arch/x86/kvm/vmx/posted_intr.c | 43 +-
arch/x86/kvm/vmx/posted_intr.h | 13 +
arch/x86/kvm/vmx/tdx.c | 3229 ++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 272 ++
arch/x86/kvm/vmx/tdx_arch.h | 221 ++
arch/x86/kvm/vmx/tdx_errno.h | 43 +
arch/x86/kvm/vmx/tdx_error.c | 20 +
arch/x86/kvm/vmx/tdx_ops.h | 267 ++
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 676 ++--
arch/x86/kvm/vmx/vmx.h | 52 +-
arch/x86/kvm/vmx/x86_ops.h | 257 ++
arch/x86/kvm/x86.c | 117 +-
arch/x86/kvm/x86.h | 2 +
arch/x86/virt/vmx/tdx/seamcall.S | 4 +
arch/x86/virt/vmx/tdx/tdx.c | 48 +-
arch/x86/virt/vmx/tdx/tdx.h | 51 -
include/linux/kvm_host.h | 1 +
include/linux/kvm_types.h | 1 +
include/uapi/linux/kvm.h | 87 +
tools/arch/x86/include/uapi/asm/kvm.h | 96 +
virt/kvm/kvm_main.c | 31 +-
58 files changed, 8280 insertions(+), 759 deletions(-)
create mode 100644 Documentation/virt/kvm/x86/intel-tdx.rst
create mode 100644 Documentation/virt/kvm/x86/tdx-tdp-mmu.rst
create mode 100644 arch/x86/kvm/vmx/common.h
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
create mode 100644 arch/x86/kvm/vmx/tdx.c
create mode 100644 arch/x86/kvm/vmx/tdx.h
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
create mode 100644 arch/x86/kvm/vmx/tdx_error.c
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
create mode 100644 arch/x86/kvm/vmx/x86_ops.h

--
2.25.1


2023-10-16 16:21:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 065/116] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr

From: Chao Gao <[email protected]>

Several MSRs are constant and only used in userspace(ring 3). But VMs may
have different values. KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.

TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit. It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware. This inconsistency needs to be
fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values. kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr. So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.

Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 25 ++++++++++++++++++++-----
2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e6bcafd947b..777981e97f1a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2217,6 +2217,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
int kvm_add_user_return_msr(u32 msr);
int kvm_find_user_return_msr(u32 msr);
int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_update_cache(unsigned int index, u64 val);

static inline bool kvm_is_supported_user_return_msr(u32 msr)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f524a7d3fda8..916c462f7d8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -440,6 +440,15 @@ static void kvm_user_return_msr_cpu_online(void)
}
}

+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+ if (!msrs->registered) {
+ msrs->urn.on_user_return = kvm_on_user_return;
+ user_return_notifier_register(&msrs->urn);
+ msrs->registered = true;
+ }
+}
+
int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
{
unsigned int cpu = smp_processor_id();
@@ -454,15 +463,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
return 1;

msrs->values[slot].curr = value;
- if (!msrs->registered) {
- msrs->urn.on_user_return = kvm_on_user_return;
- user_return_notifier_register(&msrs->urn);
- msrs->registered = true;
- }
+ kvm_user_return_register_notifier(msrs);
return 0;
}
EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);

+/* Update the cache, "curr", and register the notifier */
+void kvm_user_return_update_cache(unsigned int slot, u64 value)
+{
+ struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+ msrs->values[slot].curr = value;
+ kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
+
static void drop_user_return_notifiers(void)
{
unsigned int cpu = smp_processor_id();
--
2.25.1

2023-10-16 16:21:33

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 066/116] KVM: TDX: restore user ret MSRs

From: Isaku Yamahata <[email protected]>

Several user ret MSRs are clobbered on TD exit. Restore those values on
TD exit and before returning to ring 3. Because TSX_CTRL requires special
treat, this patch doesn't address it.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 98827160c6f9..87ed1a255c3b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -509,6 +509,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
}

+struct tdx_uret_msr {
+ u32 msr;
+ unsigned int slot;
+ u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+ {.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
+ {.msr = MSR_STAR,},
+ {.msr = MSR_LSTAR,},
+ {.msr = MSR_TSC_AUX,},
+};
+
+static void tdx_user_return_update_cache(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+ kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
+ tdx_uret_msrs[i].defval);
+}
+
static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -601,6 +623,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_vcpu_enter_exit(tdx);

+ tdx_user_return_update_cache();
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;

@@ -1833,6 +1856,26 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
return -EINVAL;
}

+ for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+ /*
+ * Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
+ * before returning to user space.
+ *
+ * this_cpu_ptr(user_return_msrs)->registered isn't checked
+ * because the registration is done at vcpu runtime by
+ * kvm_set_user_return_msr().
+ * Here is setting up cpu feature before running vcpu,
+ * registered is already false.
+ */
+ tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+ if (tdx_uret_msrs[i].slot == -1) {
+ /* If any MSR isn't supported, it is a KVM bug */
+ pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+ tdx_uret_msrs[i].msr);
+ return -EIO;
+ }
+ }
+
max_pkgs = topology_max_packages();
tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
GFP_KERNEL);
--
2.25.1

2023-10-16 16:21:33

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 004/116] KVM: VMX: Reorder vmx initialization with kvm vendor initialization

From: Isaku Yamahata <[email protected]>

To match vmx_exit cleanup.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4059364fc5e0..2e8a0de2c96a 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -180,11 +180,11 @@ static int __init vt_init(void)
*/
hv_init_evmcs();

- r = kvm_x86_vendor_init(&vt_init_ops);
+ r = vmx_init();
if (r)
- return r;
+ goto err_vmx_init;

- r = vmx_init();
+ r = kvm_x86_vendor_init(&vt_init_ops);
if (r)
goto err_vmx_init;

@@ -201,9 +201,9 @@ static int __init vt_init(void)
return 0;

err_kvm_init:
- vmx_exit();
-err_vmx_init:
kvm_x86_vendor_exit();
+err_vmx_init:
+ vmx_exit();
return r;
}
module_init(vt_init);
--
2.25.1

2023-10-16 16:21:45

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 077/116] KVM: TDX: Implements vcpu request_immediate_exit

From: Isaku Yamahata <[email protected]>

Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
vcpu. Wire up kvm x86 methods for blocking/unblocking vcpu for TDX. To
unblock on pending events, request immediate exit methods is also needed.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 37d28ee08c79..0ce9578a1453 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -381,6 +381,16 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
vmx_enable_irq_window(vcpu);
}

+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ __kvm_request_immediate_exit(vcpu);
+ return;
+ }
+
+ vmx_request_immediate_exit(vcpu);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -527,7 +537,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,

- .request_immediate_exit = vmx_request_immediate_exit,
+ .request_immediate_exit = vt_request_immediate_exit,

.sched_in = vt_sched_in,

--
2.25.1

2023-10-16 16:22:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 073/116] KVM: TDX: Add support for find pending IRQ in a protected local APIC

From: Sean Christopherson <[email protected]>

Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest as a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.

Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.

Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.

Note, this doesn't handle interrupts that have been delivered to the vCPU
but not yet recognized by the core, i.e. interrupts that are sitting in
vmcs.GUEST_INTR_STATUS. Querying that state requires a SEAMCALL and will
be supported in a future patch.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/irq.c | 3 +++
arch/x86/kvm/lapic.c | 3 +++
arch/x86/kvm/lapic.h | 2 ++
arch/x86/kvm/vmx/main.c | 10 ++++++++++
arch/x86/kvm/vmx/tdx.c | 6 ++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
8 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 6ec58c078523..19e1f22b92b1 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -121,6 +121,7 @@ KVM_X86_OP_OPTIONAL(pi_update_irte)
KVM_X86_OP_OPTIONAL(pi_start_assignment)
KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
+KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt)
KVM_X86_OP_OPTIONAL(set_hv_timer)
KVM_X86_OP_OPTIONAL(cancel_hv_timer)
KVM_X86_OP(setup_mce)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b19e8eac4c80..fcd1e2d0dc92 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1778,6 +1778,7 @@ struct kvm_x86_ops {
void (*pi_start_assignment)(struct kvm *kvm);
void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+ bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);

int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
bool *expired);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index b2c397dd2bc6..fd6af5530c32 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -100,6 +100,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
if (kvm_cpu_has_extint(v))
return 1;

+ if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
+ return static_call(kvm_x86_protected_apic_has_interrupt)(v);
+
return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
}
EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index dcd60b39e794..cce4a5a29bd8 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2855,6 +2855,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
if (!kvm_apic_present(vcpu))
return -1;

+ if (apic->guest_apic_protected)
+ return -1;
+
__apic_update_ppr(apic, &ppr);
return apic_has_interrupt_for_ppr(apic, ppr);
}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 0a0ea4b5dd8c..749b7b629c47 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -66,6 +66,8 @@ struct kvm_lapic {
bool sw_enabled;
bool irr_pending;
bool lvt0_in_nmi_mode;
+ /* Select registers in the vAPIC cannot be read/written. */
+ bool guest_apic_protected;
/* Number of bits set in ISR. */
s16 isr_count;
/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 994ed4487605..302db8f4afc2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -94,6 +94,8 @@ static __init int vt_hardware_setup(void)

if (enable_tdx)
vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
+ else
+ vt_x86_ops.protected_apic_has_interrupt = NULL;

return 0;
}
@@ -230,6 +232,13 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmx_vcpu_load(vcpu, cpu);
}

+static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ KVM_BUG_ON(!is_td_vcpu(vcpu), vcpu->kvm);
+
+ return tdx_protected_apic_has_interrupt(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -421,6 +430,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sync_pir_to_irr = vmx_sync_pir_to_irr,
.deliver_interrupt = vmx_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+ .protected_apic_has_interrupt = vt_protected_apic_has_interrupt,

.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b936c98f601c..173a29b36432 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -557,6 +557,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return -EINVAL;

fpstate_set_confidential(&vcpu->arch.guest_fpu);
+ vcpu->arch.apic->guest_apic_protected = true;

vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;

@@ -598,6 +599,11 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
local_irq_enable();
}

+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ return pi_has_pending_interrupt(vcpu);
+}
+
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 0b7437368e91..3ab0b270b1f9 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -155,6 +155,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -187,6 +188,7 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTP
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1

2023-10-16 16:26:25

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 072/116] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior

From: Isaku Yamahata <[email protected]>

Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
irrespective of any other flags. TDX-SEAM unconditionally saves and
restores guest DRs and reset to architectural INIT state on TD exit.
So, KVM needs to save host DRs before TD enter without restoring guest DRs
and restore host DRs after TD exit.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Chao Gao <[email protected]>
Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 10 ++++++++--
arch/x86/kvm/vmx/tdx.c | 1 +
arch/x86/kvm/x86.c | 11 ++++++++---
3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 777981e97f1a..b19e8eac4c80 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -604,8 +604,14 @@ struct kvm_pmu {
struct kvm_pmu_ops;

enum {
- KVM_DEBUGREG_BP_ENABLED = 1,
- KVM_DEBUGREG_WONT_EXIT = 2,
+ KVM_DEBUGREG_BP_ENABLED = BIT(0),
+ KVM_DEBUGREG_WONT_EXIT = BIT(1),
+ /*
+ * Guest debug registers (DR0-3 and DR6) are saved/restored by hardware
+ * on exit from or enter to guest. KVM needn't switch them. Because DR7
+ * is cleared on exit from guest, DR7 need to be saved/restored.
+ */
+ KVM_DEBUGREG_AUTO_SWITCH = BIT(2),
};

struct kvm_mtrr_range {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 511ba2496372..b936c98f601c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -560,6 +560,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;

+ vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
vcpu->arch.cr0_guest_owned_bits = -1ul;
vcpu->arch.cr4_guest_owned_bits = -1ul;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 916c462f7d8b..a75b0b9ce9bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10792,7 +10792,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu->arch.guest_fpu.xfd_err)
wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

- if (unlikely(vcpu->arch.switch_db_regs)) {
+ if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);
set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -10838,6 +10838,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+ WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
kvm_update_dr0123(vcpu);
kvm_update_dr7(vcpu);
@@ -10850,8 +10851,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
* care about the messed up debug address registers. But if
* we have some of them active, restore the old state.
*/
- if (hw_breakpoint_active())
- hw_breakpoint_restore();
+ if (hw_breakpoint_active()) {
+ if (!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))
+ hw_breakpoint_restore();
+ else
+ set_debugreg(__this_cpu_read(cpu_dr7), 7);
+ }

vcpu->arch.last_vmentry_cpu = vcpu->cpu;
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
--
2.25.1

2023-10-16 16:27:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 070/116] KVM: TDX: restore debug store when TD exit

From: Isaku Yamahata <[email protected]>

Because debug store is clobbered, restore it on TD exit.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/events/intel/ds.c | 1 +
arch/x86/kvm/vmx/tdx.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index eb8dd8b8a1e8..58e86b726436 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2428,3 +2428,4 @@ void perf_restore_debug_store(void)

wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
}
+EXPORT_SYMBOL_GPL(perf_restore_debug_store);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e29df39630e8..a4401ae851d4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -639,6 +639,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(tdx);

tdx_user_return_update_cache(vcpu);
+ perf_restore_debug_store();
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;

--
2.25.1

2023-10-16 16:27:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 071/116] KVM: TDX: handle vcpu migration over logical processor

From: Isaku Yamahata <[email protected]>

For vcpu migration, in the case of VMX, VMCS is flushed on the source pcpu,
and load it on the target pcpu. There are corresponding TDX SEAMCALL APIs,
call them on vcpu migration. The logic is mostly same as VMX except the
TDX SEAMCALLs are used.

When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
each pcpu. Do the similar for TDX with TDX SEAMCALL APIs.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 32 ++++++-
arch/x86/kvm/vmx/tdx.c | 190 ++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 2 +
arch/x86/kvm/vmx/x86_ops.h | 4 +
4 files changed, 221 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index bc75ed8ff371..994ed4487605 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -44,6 +44,14 @@ static int vt_hardware_enable(void)
return ret;
}

+static void vt_hardware_disable(void)
+{
+ /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+ if (enable_tdx)
+ tdx_hardware_disable();
+ vmx_hardware_disable();
+}
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -212,6 +220,16 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_vcpu_run(vcpu);
}

+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_load(vcpu, cpu);
+ return;
+ }
+
+ vmx_vcpu_load(vcpu, cpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -271,6 +289,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}

+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_sched_in(vcpu, cpu);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -313,7 +339,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.offline_cpu = tdx_offline_cpu,

.hardware_enable = vt_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
+ .hardware_disable = vt_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,

.is_vm_type_supported = vt_is_vm_type_supported,
@@ -331,7 +357,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_reset = vt_vcpu_reset,

.prepare_switch_to_guest = vt_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
+ .vcpu_load = vt_vcpu_load,
.vcpu_put = vt_vcpu_put,

.update_exception_bitmap = vmx_update_exception_bitmap,
@@ -418,7 +444,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.request_immediate_exit = vmx_request_immediate_exit,

- .sched_in = vmx_sched_in,
+ .sched_in = vt_sched_in,

.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a4401ae851d4..511ba2496372 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -73,6 +73,14 @@ static DEFINE_MUTEX(tdx_lock);
static struct mutex *tdx_mng_key_config_lock;
static atomic_t nr_configured_hkid;

+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ * Protected by interrupt mask. This list is manipulated in process context
+ * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
{
return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
@@ -105,6 +113,37 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
return kvm_tdx->finalized;
}

+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+ lockdep_assert_irqs_disabled();
+
+ list_del(&to_tdx(vcpu)->cpu_list);
+
+ /*
+ * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+ * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+ * to its list before its deleted from this CPUs list.
+ */
+ smp_wmb();
+
+ vcpu->cpu = -1;
+}
+
+static void tdx_disassociate_vp_arg(void *vcpu)
+{
+ tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_disassociate_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+ int cpu = vcpu->cpu;
+
+ if (unlikely(cpu == -1))
+ return;
+
+ smp_call_function_single(cpu, tdx_disassociate_vp_arg, vcpu, 1);
+}
+
static void tdx_clear_page(unsigned long page_pa)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -184,6 +223,87 @@ static void tdx_reclaim_td_page(unsigned long td_page_pa)
free_page((unsigned long)__va(td_page_pa));
}

+struct tdx_flush_vp_arg {
+ struct kvm_vcpu *vcpu;
+ u64 err;
+};
+
+static void tdx_flush_vp(void *arg_)
+{
+ struct tdx_flush_vp_arg *arg = arg_;
+ struct kvm_vcpu *vcpu = arg->vcpu;
+ u64 err;
+
+ arg->err = 0;
+ lockdep_assert_irqs_disabled();
+
+ /* Task migration can race with CPU offlining. */
+ if (unlikely(vcpu->cpu != raw_smp_processor_id()))
+ return;
+
+ /*
+ * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
+ * list tracking still needs to be updated so that it's correct if/when
+ * the vCPU does get initialized.
+ */
+ if (is_td_vcpu_created(to_tdx(vcpu))) {
+ /*
+ * No need to retry. TDX Resources needed for TDH.VP.FLUSH are,
+ * TDVPR as exclusive, TDR as shared, and TDCS as shared. This
+ * vp flush function is called when destructing vcpu/TD or vcpu
+ * migration. No other thread uses TDVPR in those cases.
+ */
+ err = tdh_vp_flush(to_tdx(vcpu)->tdvpr_pa);
+ if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+ /*
+ * This function is called in IPI context. Do not use
+ * printk to avoid console semaphore.
+ * The caller prints out the error message, instead.
+ */
+ if (err)
+ arg->err = err;
+ }
+ }
+
+ tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+ struct tdx_flush_vp_arg arg = {
+ .vcpu = vcpu,
+ };
+ int cpu = vcpu->cpu;
+
+ if (unlikely(cpu == -1))
+ return;
+
+ smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
+ if (WARN_ON_ONCE(arg.err)) {
+ pr_err("cpu: %d ", cpu);
+ pr_tdx_error(TDH_VP_FLUSH, arg.err, NULL);
+ }
+}
+
+void tdx_hardware_disable(void)
+{
+ int cpu = raw_smp_processor_id();
+ struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+ struct tdx_flush_vp_arg arg;
+ struct vcpu_tdx *tdx, *tmp;
+ unsigned long flags;
+
+ lockdep_assert_preemption_disabled();
+
+ local_irq_save(flags);
+ /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+ list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) {
+ arg.vcpu = &tdx->vcpu;
+ tdx_flush_vp(&arg);
+ }
+ local_irq_restore(flags);
+}
+
static void tdx_do_tdh_phymem_cache_wb(void *unused)
{
u64 err = 0;
@@ -199,26 +319,31 @@ static void tdx_do_tdh_phymem_cache_wb(void *unused)
pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
}

-void tdx_mmu_release_hkid(struct kvm *kvm)
+static int __tdx_mmu_release_hkid(struct kvm *kvm)
{
bool packages_allocated, targets_allocated;
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
cpumask_var_t packages, targets;
+ struct kvm_vcpu *vcpu;
+ unsigned long j;
+ int i, ret = 0;
u64 err;
- int i;

if (!is_hkid_assigned(kvm_tdx))
- return;
+ return 0;

if (!is_td_created(kvm_tdx)) {
tdx_hkid_free(kvm_tdx);
- return;
+ return 0;
}

packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
cpus_read_lock();

+ kvm_for_each_vcpu(j, vcpu, kvm)
+ tdx_flush_vp_on_cpu(vcpu);
+
/*
* We can destroy multiple the guest TDs simultaneously. Prevent
* tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
@@ -236,6 +361,19 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
*/
write_lock(&kvm->mmu_lock);

+ err = tdh_mng_vpflushdone(kvm_tdx->tdr_pa);
+ if (err == TDX_FLUSHVP_NOT_DONE) {
+ ret = -EBUSY;
+ goto out;
+ }
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
+ pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
+ kvm_tdx->hkid);
+ ret = -EIO;
+ goto out;
+ }
+
for_each_online_cpu(i) {
if (packages_allocated &&
cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -258,14 +396,24 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
kvm_tdx->hkid);
+ ret = -EIO;
} else
tdx_hkid_free(kvm_tdx);

+out:
write_unlock(&kvm->mmu_lock);
mutex_unlock(&tdx_lock);
cpus_read_unlock();
free_cpumask_var(targets);
free_cpumask_var(packages);
+
+ return ret;
+}
+
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+ while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
+ ;
}

void tdx_vm_free(struct kvm *kvm)
@@ -429,6 +577,26 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return 0;
}

+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (vcpu->cpu == cpu)
+ return;
+
+ tdx_flush_vp_on_cpu(vcpu);
+
+ local_irq_disable();
+ /*
+ * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+ * vcpu->cpu is read before tdx->cpu_list.
+ */
+ smp_rmb();
+
+ list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+ local_irq_enable();
+}
+
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -469,6 +637,16 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
struct vcpu_tdx *tdx = to_tdx(vcpu);
int i;

+ /*
+ * When destroying VM, kvm_unload_vcpu_mmu() calls vcpu_load() for every
+ * vcpu after they already disassociated from the per cpu list by
+ * tdx_mmu_release_hkid(). So we need to disassociate them again,
+ * otherwise the freed vcpu data will be accessed when do
+ * list_{del,add}() on associated_tdvcpus list later.
+ */
+ tdx_disassociate_vp_on_cpu(vcpu);
+ WARN_ON_ONCE(vcpu->cpu != -1);
+
/*
* This methods can be called when vcpu allocation/initialization
* failed. So it's possible that hkid, tdvpx and tdvpr are not assigned
@@ -1891,6 +2069,10 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
return -EINVAL;
}

+ /* tdx_hardware_disable() uses associated_tdvcpus. */
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
+
for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
/*
* Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index c700792c08e2..4f803814126a 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -70,6 +70,8 @@ struct vcpu_tdx {
unsigned long tdvpr_pa;
unsigned long *tdvpx_pa;

+ struct list_head cpu_list;
+
union tdx_exit_reason exit_reason;

bool initialized;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 02da72dfb2cf..0b7437368e91 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -137,6 +137,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
#ifdef CONFIG_INTEL_TDX_HOST
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
void tdx_hardware_unsetup(void);
+void tdx_hardware_disable(void);
bool tdx_is_vm_type_supported(unsigned long type);
int tdx_offline_cpu(void);

@@ -153,6 +154,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -164,6 +166,7 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
+static inline void tdx_hardware_disable(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
static inline int tdx_offline_cpu(void) { return 0; }

@@ -183,6 +186,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1

2023-10-16 16:27:19

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 068/116] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD vcpu
exits, interrupts, and hypercalls.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 33e107bcb5cf..7a16fa284b6f 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -13,6 +13,7 @@ What qemu can do
- Qemu can create/destroy vcpu of TDX vm type.
- Qemu can populate initial guest memory image.
- Qemu can finalize guest TD.
+- Qemu can start to run vcpu. But vcpu can not make progress yet.

Patch Layer status
------------------
@@ -24,7 +25,7 @@ Patch Layer status
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
* TD finalization: Applied
-* TD vcpu enter/exit: Applying
+* TD vcpu enter/exit: Applied
* TD vcpu interrupts/exit/hypercall: Not yet

* KVM MMU GPA shared bits: Applied
--
2.25.1

2023-10-16 16:31:37

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 012/116] KVM: TDX: Retry SEAMCALL on the lack of entropy error

From: Isaku Yamahata <[email protected]>

Some SEAMCALL may return TDX_RND_NO_ENTROPY error when the entropy is
lacking. Retry SEAMCALL on the error following rdrand_long() to retry
RDRAND_RETRY_LOOPS times.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx_errno.h | 1 +
arch/x86/kvm/vmx/tdx_ops.h | 40 +++++++++++++++++++++---------------
2 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index ec76740dc6a1..dbee050b2356 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -14,6 +14,7 @@
#define TDX_OPERAND_INVALID 0xC000010000000000ULL
#define TDX_OPERAND_BUSY 0x8000020000000000ULL
#define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000ULL
+#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
#define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000ULL
#define TDX_KEY_GENERATION_FAILED 0x8000080000000000ULL
#define TDX_KEY_STATE_INCORRECT 0xC000081100000000ULL
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 12fd6b8d49e0..a55977626ae3 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -6,6 +6,7 @@

#include <linux/compiler.h>

+#include <asm/archrandom.h>
#include <asm/cacheflush.h>
#include <asm/asm.h>
#include <asm/kvm_host.h>
@@ -17,25 +18,30 @@
static inline u64 tdx_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_args *out)
{
+ int retry;
u64 ret;

- if (out) {
- *out = (struct tdx_module_args) {
- .rcx = rcx,
- .rdx = rdx,
- .r8 = r8,
- .r9 = r9,
- };
- ret = __seamcall_ret(op, out);
- } else {
- struct tdx_module_args args = {
- .rcx = rcx,
- .rdx = rdx,
- .r8 = r8,
- .r9 = r9,
- };
- ret = __seamcall(op, &args);
- }
+ /* Mimic the existing rdrand_long() to retry RDRAND_RETRY_LOOPS times. */
+ retry = RDRAND_RETRY_LOOPS;
+ do {
+ if (out) {
+ *out = (struct tdx_module_args) {
+ .rcx = rcx,
+ .rdx = rdx,
+ .r8 = r8,
+ .r9 = r9,
+ };
+ ret = __seamcall_ret(op, out);
+ } else {
+ struct tdx_module_args args = {
+ .rcx = rcx,
+ .rdx = rdx,
+ .r8 = r8,
+ .r9 = r9,
+ };
+ ret = __seamcall(op, &args);
+ }
+ } while (unlikely(ret == TDX_RND_NO_ENTROPY) && --retry);
if (unlikely(ret == TDX_SEAMCALL_UD)) {
/*
* SEAMCALLs fail with TDX_SEAMCALL_UD returned when VMX is off.
--
2.25.1

2023-10-16 16:31:40

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 016/116] x86/virt/tdx: Add a helper function to return system wide info about TDX module

From: Isaku Yamahata <[email protected]>

TDX KVM needs system-wide information about the TDX module, struct
tdsysinfo_struct. Add a helper function tdx_get_sysinfo() to return it
instead of KVM getting it with various error checks. Make KVM call the
function and stash the info. Move out the struct definition about it to
common place arch/x86/include/asm/tdx.h.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/tdx.h | 59 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.c | 15 +++++++++-
arch/x86/virt/vmx/tdx/tdx.c | 20 +++++++++++--
arch/x86/virt/vmx/tdx/tdx.h | 51 --------------------------------
4 files changed, 91 insertions(+), 54 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e50129a1f13e..800ed1783b94 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -110,6 +110,62 @@ u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
#define seamcall_saved_ret(__fn, __args) \
SEAMCALL_NO_ENTROPY_RETRY(__seamcall_saved_ret, (__fn), (__args))

+struct tdx_cpuid_config {
+ __struct_group(tdx_cpuid_config_leaf, leaf_sub_leaf, __packed,
+ u32 leaf;
+ u32 sub_leaf;
+ );
+ __struct_group(tdx_cpuid_config_value, value, __packed,
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+ );
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE 1024
+#define TDSYSINFO_STRUCT_ALIGNMENT 1024
+
+/*
+ * The size of this structure itself is flexible. The actual structure
+ * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE bytes
+ * and TDSYSINFO_STRUCT_ALIGNMENT bytes aligned.
+ */
+struct tdsysinfo_struct {
+ /* TDX-SEAM Module Info */
+ u32 attributes;
+ u32 vendor_id;
+ u32 build_date;
+ u16 build_num;
+ u16 minor_version;
+ u16 major_version;
+ u8 reserved0[14];
+ /* Memory Info */
+ u16 max_tdmrs;
+ u16 max_reserved_per_tdmr;
+ u16 pamt_entry_size;
+ u8 reserved1[10];
+ /* Control Struct Info */
+ u16 tdcs_base_size;
+ u8 reserved2[2];
+ u16 tdvps_base_size;
+ u8 tdvps_xfam_dependent_size;
+ u8 reserved3[9];
+ /* TD Capabilities */
+ u64 attributes_fixed0;
+ u64 attributes_fixed1;
+ u64 xfam_fixed0;
+ u64 xfam_fixed1;
+ u8 reserved4[32];
+ u32 num_cpuid_config;
+ /*
+ * The actual number of CPUID_CONFIG depends on above
+ * 'num_cpuid_config'.
+ */
+ DECLARE_FLEX_ARRAY(struct tdx_cpuid_config, cpuid_configs);
+} __packed;
+
+const struct tdsysinfo_struct *tdx_get_sysinfo(void);
bool platform_tdx_enabled(void);
int tdx_cpu_enable(void);
int tdx_enable(void);
@@ -138,6 +194,9 @@ static inline u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args)
{
return TDX_SEAMCALL_UD;
}
+
+struct tdsysinfo_struct;
+static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
static inline bool platform_tdx_enabled(void) { return false; }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
static inline int tdx_enable(void) { return -ENODEV; }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9d3f593eacb8..b0e3409da5a8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -11,9 +11,18 @@
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#define TDX_MAX_NR_CPUID_CONFIGS \
+ ((TDSYSINFO_STRUCT_SIZE - \
+ offsetof(struct tdsysinfo_struct, cpuid_configs)) \
+ / sizeof(struct tdx_cpuid_config))
+
static int __init tdx_module_setup(void)
{
- int ret;
+ const struct tdsysinfo_struct *tdsysinfo;
+ int ret = 0;
+
+ BUILD_BUG_ON(sizeof(*tdsysinfo) > TDSYSINFO_STRUCT_SIZE);
+ BUILD_BUG_ON(TDX_MAX_NR_CPUID_CONFIGS != 37);

ret = tdx_enable();
if (ret) {
@@ -21,6 +30,10 @@ static int __init tdx_module_setup(void)
return ret;
}

+ /* Sanitary check just in case. */
+ tdsysinfo = tdx_get_sysinfo();
+ WARN_ON(tdsysinfo->num_cpuid_config > TDX_MAX_NR_CPUID_CONFIGS);
+
return 0;
}

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5b8f2085d293..ec0eb9edfa21 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -284,6 +284,20 @@ static int get_tdx_sysinfo(struct tdsysinfo_struct *tdsysinfo,
return 0;
}

+static struct tdsysinfo_struct *tdsysinfo;
+
+const struct tdsysinfo_struct *tdx_get_sysinfo(void)
+{
+ const struct tdsysinfo_struct *r = NULL;
+
+ mutex_lock(&tdx_module_lock);
+ if (tdx_module_status == TDX_MODULE_INITIALIZED)
+ r = tdsysinfo;
+ mutex_unlock(&tdx_module_lock);
+ return r;
+}
+EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
+
/*
* Add a memory region as a TDX memory block. The caller must make sure
* all memory regions are added in address ascending order and don't
@@ -1109,7 +1123,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)

static int init_tdx_module(void)
{
- struct tdsysinfo_struct *tdsysinfo;
struct cmr_info *cmr_array;
int tdsysinfo_size;
int cmr_array_size;
@@ -1208,7 +1221,10 @@ static int init_tdx_module(void)
* For now both @sysinfo and @cmr_array are only used during
* module initialization, so always free them.
*/
- kfree(tdsysinfo);
+ if (ret) {
+ kfree(tdsysinfo);
+ tdsysinfo = NULL;
+ }
kfree(cmr_array);
return ret;

diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 5bcbfc2fc466..c37a54cff1fa 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -36,57 +36,6 @@ struct cmr_info {
#define MAX_CMRS 32
#define CMR_INFO_ARRAY_ALIGNMENT 512

-struct cpuid_config {
- u32 leaf;
- u32 sub_leaf;
- u32 eax;
- u32 ebx;
- u32 ecx;
- u32 edx;
-} __packed;
-
-#define TDSYSINFO_STRUCT_SIZE 1024
-#define TDSYSINFO_STRUCT_ALIGNMENT 1024
-
-/*
- * The size of this structure itself is flexible. The actual structure
- * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE bytes
- * and TDSYSINFO_STRUCT_ALIGNMENT bytes aligned.
- */
-struct tdsysinfo_struct {
- /* TDX-SEAM Module Info */
- u32 attributes;
- u32 vendor_id;
- u32 build_date;
- u16 build_num;
- u16 minor_version;
- u16 major_version;
- u8 reserved0[14];
- /* Memory Info */
- u16 max_tdmrs;
- u16 max_reserved_per_tdmr;
- u16 pamt_entry_size;
- u8 reserved1[10];
- /* Control Struct Info */
- u16 tdcs_base_size;
- u8 reserved2[2];
- u16 tdvps_base_size;
- u8 tdvps_xfam_dependent_size;
- u8 reserved3[9];
- /* TD Capabilities */
- u64 attributes_fixed0;
- u64 attributes_fixed1;
- u64 xfam_fixed0;
- u64 xfam_fixed1;
- u8 reserved4[32];
- u32 num_cpuid_config;
- /*
- * The actual number of CPUID_CONFIG depends on above
- * 'num_cpuid_config'.
- */
- DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
-} __packed;
-
struct tdmr_reserved_area {
u64 offset;
u64 size;
--
2.25.1

2023-10-16 16:31:47

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 019/116] KVM: x86, tdx: Make KVM_CAP_MAX_VCPUS backend specific

From: Isaku Yamahata <[email protected]>

TDX has its own limitation on the maximum number of vcpus that the guest
can accommodate. Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
handler and implement TDX backend for KVM_CAP_MAX_VCPUS. user space VMM,
e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/main.c | 22 ++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.c | 30 ++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 3 +++
arch/x86/kvm/vmx/x86_ops.h | 5 +++++
arch/x86/kvm/x86.c | 4 ++++
7 files changed, 68 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8a5770b12546..214dcaa3b384 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -21,6 +21,8 @@ KVM_X86_OP(hardware_unsetup)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
KVM_X86_OP(is_vm_type_supported)
+KVM_X86_OP_OPTIONAL(max_vcpus);
+KVM_X86_OP_OPTIONAL(vm_enable_cap)
KVM_X86_OP(vm_init)
KVM_X86_OP_OPTIONAL(vm_destroy)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4bbd2ec0e324..9dde92c8ee2f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1566,7 +1566,9 @@ struct kvm_x86_ops {
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

bool (*is_vm_type_supported)(unsigned long vm_type);
+ int (*max_vcpus)(struct kvm *kvm);
unsigned int vm_size;
+ int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index e9661954d250..5579557fa6f6 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,7 @@
#include "nested.h"
#include "pmu.h"
#include "tdx.h"
+#include "tdx_arch.h"

static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);
@@ -16,6 +17,17 @@ static bool vt_is_vm_type_supported(unsigned long type)
(enable_tdx && tdx_is_vm_type_supported(type));
}

+static int vt_max_vcpus(struct kvm *kvm)
+{
+ if (!kvm)
+ return KVM_MAX_VCPUS;
+
+ if (is_td(kvm))
+ return min3(kvm->max_vcpus, KVM_MAX_VCPUS, TDX_MAX_VCPUS);
+
+ return kvm->max_vcpus;
+}
+
static int vt_hardware_enable(void)
{
int ret;
@@ -43,6 +55,14 @@ static __init int vt_hardware_setup(void)
return 0;
}

+static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ if (is_td(kvm))
+ return tdx_vm_enable_cap(kvm, cap);
+
+ return -EINVAL;
+}
+
static int vt_vm_init(struct kvm *kvm)
{
if (is_td(kvm))
@@ -80,7 +100,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.has_emulated_msr = vmx_has_emulated_msr,

.is_vm_type_supported = vt_is_vm_type_supported,
+ .max_vcpus = vt_max_vcpus,
.vm_size = sizeof(struct kvm_vmx),
+ .vm_enable_cap = vt_vm_enable_cap,
.vm_init = vt_vm_init,
.vm_destroy = vmx_vm_destroy,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f9e80582865d..331fbaa10d46 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -17,6 +17,36 @@
offsetof(struct tdsysinfo_struct, cpuid_configs)) \
/ sizeof(struct tdx_cpuid_config))

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ int r;
+
+ switch (cap->cap) {
+ case KVM_CAP_MAX_VCPUS: {
+ if (cap->flags || cap->args[0] == 0)
+ return -EINVAL;
+ if (cap->args[0] > KVM_MAX_VCPUS)
+ return -E2BIG;
+ if (cap->args[0] > TDX_MAX_VCPUS)
+ return -E2BIG;
+
+ mutex_lock(&kvm->lock);
+ if (kvm->created_vcpus)
+ r = -EBUSY;
+ else {
+ kvm->max_vcpus = cap->args[0];
+ r = 0;
+ }
+ mutex_unlock(&kvm->lock);
+ break;
+ }
+ default:
+ r = -EINVAL;
+ break;
+ }
+ return r;
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 473013265bd8..22c0b57f69ca 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -3,6 +3,9 @@
#define __KVM_X86_TDX_H

#ifdef CONFIG_INTEL_TDX_HOST
+
+#include "tdx_ops.h"
+
struct kvm_tdx {
struct kvm kvm;
/* TDX specific members follow. */
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 601aca7a011e..c3fc3148dcf3 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -138,11 +138,16 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
bool tdx_is_vm_type_supported(unsigned long type);

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }

+static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ return -EINVAL;
+};
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
#endif

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1824fd257f41..a1354274ffe4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4586,6 +4586,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
case KVM_CAP_MAX_VCPUS:
r = KVM_MAX_VCPUS;
+ if (kvm_x86_ops.max_vcpus)
+ r = static_call(kvm_x86_max_vcpus)(kvm);
break;
case KVM_CAP_MAX_VCPU_ID:
r = KVM_MAX_VCPU_IDS;
@@ -6518,6 +6520,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
break;
default:
r = -EINVAL;
+ if (kvm_x86_ops.vm_enable_cap)
+ r = static_call(kvm_x86_vm_enable_cap)(kvm, cap);
break;
}
return r;
--
2.25.1

2023-10-16 16:31:50

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 018/116] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters

From: Sean Christopherson <[email protected]>

Implement an ioctl to get system-wide parameters for TDX. Although the
function is systemwide, vm scoped mem_enc ioctl works for userspace VMM
like qemu and device scoped version is not define, re-use vm scoped
mem_enc.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v14 -> v15:
- ABI change: added supported_gpaw and reserved area,
---
arch/x86/include/uapi/asm/kvm.h | 24 ++++++++++
arch/x86/kvm/vmx/tdx.c | 64 +++++++++++++++++++++++++++
tools/arch/x86/include/uapi/asm/kvm.h | 52 ++++++++++++++++++++++
3 files changed, 140 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 615fb60b3717..3fbd43d5177b 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -593,4 +593,28 @@ struct kvm_tdx_cmd {
__u64 error;
};

+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+#define TDX_CAP_GPAW_48 (1 << 0)
+#define TDX_CAP_GPAW_52 (1 << 1)
+ __u32 supported_gpaw;
+ __u32 padding;
+ __u64 reserved[251];
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[];
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ead229e34813..f9e80582865d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
#include "capabilities.h"
#include "x86_ops.h"
#include "x86.h"
+#include "mmu.h"
#include "tdx.h"

#undef pr_fmt
@@ -16,6 +17,66 @@
offsetof(struct tdsysinfo_struct, cpuid_configs)) \
/ sizeof(struct tdx_cpuid_config))

+static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx_capabilities __user *user_caps;
+ const struct tdsysinfo_struct *tdsysinfo;
+ struct kvm_tdx_capabilities *caps = NULL;
+ int ret = 0;
+
+ BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
+ sizeof(struct tdx_cpuid_config));
+
+ if (cmd->flags)
+ return -EINVAL;
+
+ tdsysinfo = tdx_get_sysinfo();
+ if (!tdsysinfo)
+ return -EOPNOTSUPP;
+
+ caps = kmalloc(sizeof(*caps), GFP_KERNEL);
+ if (!caps)
+ return -ENOMEM;
+
+ user_caps = (void __user *)cmd->data;
+ if (copy_from_user(caps, user_caps, sizeof(*caps))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ if (caps->nr_cpuid_configs < tdsysinfo->num_cpuid_config) {
+ ret = -E2BIG;
+ goto out;
+ }
+
+ *caps = (struct kvm_tdx_capabilities) {
+ .attrs_fixed0 = tdsysinfo->attributes_fixed0,
+ .attrs_fixed1 = tdsysinfo->attributes_fixed1,
+ .xfam_fixed0 = tdsysinfo->xfam_fixed0,
+ .xfam_fixed1 = tdsysinfo->xfam_fixed1,
+ .supported_gpaw = TDX_CAP_GPAW_48 |
+ (kvm_get_shadow_phys_bits() >= 52 &&
+ cpu_has_vmx_ept_5levels()) ? TDX_CAP_GPAW_52 : 0,
+ .nr_cpuid_configs = tdsysinfo->num_cpuid_config,
+ .padding = 0,
+ };
+
+ if (copy_to_user(user_caps, caps, sizeof(*caps))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ if (copy_to_user(user_caps->cpuid_configs, &tdsysinfo->cpuid_configs,
+ tdsysinfo->num_cpuid_config *
+ sizeof(struct tdx_cpuid_config))) {
+ ret = -EFAULT;
+ }
+
+out:
+ /* kfree() accepts NULL. */
+ kfree(caps);
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -29,6 +90,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
mutex_lock(&kvm->lock);

switch (tdx_cmd.id) {
+ case KVM_TDX_CAPABILITIES:
+ r = tdx_get_capabilities(&tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..7a08723e99e2 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +562,56 @@ struct kvm_pmu_event_filter {
/* x86-specific KVM_EXIT_HYPERCALL flags. */
#define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)

+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+
+ KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+ /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+ __u64 unused;
+};
+
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+#define TDX_CAP_GPAW_48 (1 << 0)
+#define TDX_CAP_GPAW_52 (1 << 1)
+ __u32 supported_gpaw;
+ __u32 padding;
+ __u64 reserved[251];
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[];
+};
+
#endif /* _ASM_X86_KVM_H */
--
2.25.1

2023-10-16 16:31:51

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 080/116] KVM: VMX: Move NMI/exception handler to common helper

From: Sean Christopherson <[email protected]>

TDX mostly handles NMI/exception exit mostly the same to VMX case. The
difference is how to retrieve exit qualification. To share the code with
TDX, move NMI/exception to a common header, common.h.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 59 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 68 +++++----------------------------------
2 files changed, 67 insertions(+), 60 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 6f21d0d48809..632af7a76d0a 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,8 +4,67 @@

#include <linux/kvm_host.h>

+#include <asm/traps.h>
+
#include "posted_intr.h"
#include "mmu.h"
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_do_interrupt_irqoff(unsigned long entry);
+void vmx_do_nmi_irqoff(void);
+
+static inline void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Save xfd_err to guest_fpu before interrupt is enabled, so the
+ * MSR value is not clobbered by the host activity before the guest
+ * has chance to consume it.
+ *
+ * Do not blindly read xfd_err here, since this exception might
+ * be caused by L1 interception on a platform which doesn't
+ * support xfd at all.
+ *
+ * Do it conditionally upon guest_fpu::xfd. xfd_err matters
+ * only when xfd contains a non-zero value.
+ *
+ * Queuing exception is done in vmx_handle_exit. See comment there.
+ */
+ if (vcpu->arch.guest_fpu.fpstate->xfd)
+ rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+}
+
+static inline void vmx_handle_exception_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ /* if exit due to PF check for async PF */
+ if (is_page_fault(intr_info))
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ /* if exit due to NM, handle before interrupts are enabled */
+ else if (is_nm_fault(intr_info))
+ vmx_handle_nm_fault_irqoff(vcpu);
+ /* Handle machine checks before interrupts are enabled */
+ else if (is_machine_check(intr_info))
+ kvm_machine_check();
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+ gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
+
+ if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+ "unexpected VM-Exit interrupt info: 0x%x", intr_info))
+ return;
+
+ kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
+ vmx_do_interrupt_irqoff(gate_offset(desc));
+ kvm_after_interrupt(vcpu);
+
+ vcpu->arch.at_instruction_boundary = true;
+}

static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
unsigned long exit_qualification)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b1c36525a81a..18b374b3f940 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -517,7 +517,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
vmx->segment_cache.bitmask = 0;
}

-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;

#if IS_ENABLED(CONFIG_HYPERV)
static bool __read_mostly enlightened_vmcs = true;
@@ -4277,7 +4277,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */

- vmcs_writel(HOST_IDTR_BASE, host_idt_base); /* 22.2.4 */
+ vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base); /* 22.2.4 */

vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */

@@ -5167,7 +5167,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
intr_info = vmx_get_intr_info(vcpu);

/*
- * Machine checks are handled by handle_exception_irqoff(), or by
+ * Machine checks are handled by vmx_handle_exception_irqoff(), or by
* vmx_vcpu_run() if a #MC occurs on VM-Entry. NMIs are handled by
* vmx_vcpu_enter_exit().
*/
@@ -5175,7 +5175,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
return 1;

/*
- * Queue the exception here instead of in handle_nm_fault_irqoff().
+ * Queue the exception here instead of in vmx_handle_nm_fault_irqoff().
* This ensures the nested_vmx check is not skipped so vmexit can
* be reflected to L1 (when it intercepts #NM) before reaching this
* point.
@@ -6890,59 +6890,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}

-void vmx_do_interrupt_irqoff(unsigned long entry);
-void vmx_do_nmi_irqoff(void);
-
-static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
-{
- /*
- * Save xfd_err to guest_fpu before interrupt is enabled, so the
- * MSR value is not clobbered by the host activity before the guest
- * has chance to consume it.
- *
- * Do not blindly read xfd_err here, since this exception might
- * be caused by L1 interception on a platform which doesn't
- * support xfd at all.
- *
- * Do it conditionally upon guest_fpu::xfd. xfd_err matters
- * only when xfd contains a non-zero value.
- *
- * Queuing exception is done in vmx_handle_exit. See comment there.
- */
- if (vcpu->arch.guest_fpu.fpstate->xfd)
- rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
-}
-
-static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
- /* if exit due to PF check for async PF */
- if (is_page_fault(intr_info))
- vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
- /* if exit due to NM, handle before interrupts are enabled */
- else if (is_nm_fault(intr_info))
- handle_nm_fault_irqoff(vcpu);
- /* Handle machine checks before interrupts are enabled */
- else if (is_machine_check(intr_info))
- kvm_machine_check();
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
- u32 intr_info)
-{
- unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
- gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
- if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
- "unexpected VM-Exit interrupt info: 0x%x", intr_info))
- return;
-
- kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
- vmx_do_interrupt_irqoff(gate_offset(desc));
- kvm_after_interrupt(vcpu);
-
- vcpu->arch.at_instruction_boundary = true;
-}
-
void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6951,9 +6898,10 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
return;

if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
}

/*
@@ -8244,7 +8192,7 @@ __init int vmx_hardware_setup(void)
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
store_idt(&dt);
- host_idt_base = dt.address;
+ vmx_host_idt_base = dt.address;

vmx_setup_user_return_msrs();

--
2.25.1

2023-10-16 16:31:52

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 115/116] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

From: Isaku Yamahata <[email protected]>

Implement a hook of KVM_SET_CPUID2 for additional consistency check.

Intel TDX or AMD SEV has a restriction on the value of cpuid. For example,
some values must be the same between all vcpus. Check if the new values
are consistent with the old values. The check is light because the cpuid
consistency is very model specific and complicated. The user space VMM
should set cpuid and MSRs consistently.

Suggested-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 ++++++
arch/x86/kvm/vmx/tdx.c | 72 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.h | 7 ++++
arch/x86/kvm/vmx/x86_ops.h | 4 +++
4 files changed, 87 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5d61f5306ae3..c5b01053e402 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -443,6 +443,15 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
kvm_vcpu_deliver_init(vcpu);
}

+static int vt_vcpu_check_cpuid(struct kvm_vcpu *vcpu,
+ struct kvm_cpuid_entry2 *e2, int nent)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_check_cpuid(vcpu, e2, nent);
+
+ return 0;
+}
+
static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -1103,6 +1112,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.get_exit_info = vt_get_exit_info,

+ .vcpu_check_cpuid = vt_vcpu_check_cpuid,
.vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,

.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3287701199dd..70eba1b984e1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -514,6 +514,9 @@ void tdx_vm_free(struct kvm *kvm)

free_page((unsigned long)__va(kvm_tdx->tdr_pa));
kvm_tdx->tdr_pa = 0;
+
+ kfree(kvm_tdx->cpuid);
+ kvm_tdx->cpuid = NULL;
}

static int tdx_do_tdh_mng_key_config(void *param)
@@ -635,6 +638,44 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return 0;
}

+int tdx_vcpu_check_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2, int nent)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ const struct tdsysinfo_struct *tdsysinfo;
+ int i;
+
+ tdsysinfo = tdx_get_sysinfo();
+ if (!tdsysinfo)
+ return -EOPNOTSUPP;
+
+ /*
+ * Simple check that new cpuid is consistent with created one.
+ * For simplicity, only trivial check. Don't try comprehensive checks
+ * with the cpuid virtualization table in the TDX module spec.
+ */
+ for (i = 0; i < tdsysinfo->num_cpuid_config; i++) {
+ const struct tdx_cpuid_config *config = &tdsysinfo->cpuid_configs[i];
+ u32 index = config->sub_leaf == TDX_CPUID_NO_SUBLEAF ? 0 : config->sub_leaf;
+ const struct kvm_cpuid_entry2 *old =
+ kvm_find_cpuid_entry2(kvm_tdx->cpuid, kvm_tdx->cpuid_nent,
+ config->leaf, index);
+ const struct kvm_cpuid_entry2 *new = kvm_find_cpuid_entry2(e2, nent,
+ config->leaf, index);
+
+ if (!!old != !!new)
+ return -EINVAL;
+ if (!old && !new)
+ continue;
+
+ if ((old->eax ^ new->eax) & config->eax ||
+ (old->ebx ^ new->ebx) & config->ebx ||
+ (old->ecx ^ new->ecx) & config->ecx ||
+ (old->edx ^ new->edx) & config->edx)
+ return -EINVAL;
+ }
+ return 0;
+}
+
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -2211,10 +2252,12 @@ static int setup_tdparams_eptp_controls(struct kvm_cpuid2 *cpuid,
return 0;
}

-static void setup_tdparams_cpuids(const struct tdsysinfo_struct *tdsysinfo,
+static void setup_tdparams_cpuids(struct kvm *kvm,
+ const struct tdsysinfo_struct *tdsysinfo,
struct kvm_cpuid2 *cpuid,
struct td_params *td_params)
{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
int i;

/*
@@ -2222,6 +2265,7 @@ static void setup_tdparams_cpuids(const struct tdsysinfo_struct *tdsysinfo,
* be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
* It's assumed that td_params was zeroed.
*/
+ kvm_tdx->cpuid_nent = 0;
for (i = 0; i < tdsysinfo->num_cpuid_config; i++) {
const struct tdx_cpuid_config *config = &tdsysinfo->cpuid_configs[i];
/* TDX_CPUID_NO_SUBLEAF in TDX CPUID_CONFIG means index = 0. */
@@ -2244,6 +2288,10 @@ static void setup_tdparams_cpuids(const struct tdsysinfo_struct *tdsysinfo,
value->ebx = entry->ebx & config->ebx;
value->ecx = entry->ecx & config->ecx;
value->edx = entry->edx & config->edx;
+
+ /* Remember the setting to check for KVM_SET_CPUID2. */
+ kvm_tdx->cpuid[kvm_tdx->cpuid_nent] = *entry;
+ kvm_tdx->cpuid_nent++;
}
}

@@ -2331,7 +2379,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
ret = setup_tdparams_eptp_controls(cpuid, td_params);
if (ret)
return ret;
- setup_tdparams_cpuids(tdsysinfo, cpuid, td_params);
+ setup_tdparams_cpuids(kvm, tdsysinfo, cpuid, td_params);
ret = setup_tdparams_xfam(cpuid, td_params);
if (ret)
return ret;
@@ -2546,11 +2594,18 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
if (cmd->flags)
return -EINVAL;

- init_vm = kzalloc(sizeof(*init_vm) +
- sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES,
- GFP_KERNEL);
- if (!init_vm)
+ WARN_ON_ONCE(kvm_tdx->cpuid);
+ kvm_tdx->cpuid = kzalloc(flex_array_size(init_vm, cpuid.entries, KVM_MAX_CPUID_ENTRIES),
+ GFP_KERNEL);
+ if (!kvm_tdx->cpuid)
return -ENOMEM;
+
+ init_vm = kzalloc(struct_size(init_vm, cpuid.entries, KVM_MAX_CPUID_ENTRIES),
+ GFP_KERNEL);
+ if (!init_vm) {
+ ret = -ENOMEM;
+ goto out;
+ }
if (copy_from_user(init_vm, (void __user *)cmd->data, sizeof(*init_vm))) {
ret = -EFAULT;
goto out;
@@ -2600,6 +2655,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)

out:
/* kfree() accepts NULL. */
+ if (ret) {
+ kfree(kvm_tdx->cpuid);
+ kvm_tdx->cpuid = NULL;
+ kvm_tdx->cpuid_nent = 0;
+ }
kfree(init_vm);
kfree(td_params);
return ret;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 4ddcf804c0a4..54c3f6b83571 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -32,6 +32,13 @@ struct kvm_tdx {
atomic_t tdh_mem_track;

u64 tsc_offset;
+
+ /*
+ * For KVM_SET_CPUID to check consistency. Remember the one passed to
+ * TDH.MNG_INIT
+ */
+ int cpuid_nent;
+ struct kvm_cpuid_entry2 *cpuid;
};

union tdx_exit_reason {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7c377b7d76b9..ae2e04a3ab27 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -162,6 +162,8 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
+int tdx_vcpu_check_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
+ int nent);
void tdx_inject_nmi(struct kvm_vcpu *vcpu);
void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
@@ -214,6 +216,8 @@ static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)

static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector) {}
+static inline int tdx_vcpu_check_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
+ int nent) { return -EOPNOTSUPP; }
static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
u64 *info2, u32 *intr_info, u32 *error_code) {}
--
2.25.1

2023-10-16 16:31:55

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 082/116] KVM: TDX: Add a place holder to handle TDX VM exit

From: Isaku Yamahata <[email protected]>

Wire up handle_exit and handle_exit_irqoff methods and add a place holder
to handle VM exit. Add helper functions to get exit info, exit
qualification, etc.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 37 ++++++++++++-
arch/x86/kvm/vmx/tdx.c | 110 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 10 ++++
3 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 91e2ad643a47..4356b10d68b6 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -239,6 +239,25 @@ static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
return tdx_protected_apic_has_interrupt(vcpu);
}

+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_handle_exit(vcpu, fastpath);
+
+ return vmx_handle_exit(vcpu, fastpath);
+}
+
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_handle_exit_irqoff(vcpu);
+ return;
+ }
+
+ vmx_handle_exit_irqoff(vcpu);
+}
+
static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -445,6 +464,18 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
vmx_request_immediate_exit(vcpu);
}

+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+ error_code);
+ return;
+ }
+
+ vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -540,7 +571,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.vcpu_pre_run = vt_vcpu_pre_run,
.vcpu_run = vt_vcpu_run,
- .handle_exit = vmx_handle_exit,
+ .handle_exit = vt_handle_exit,
.skip_emulated_instruction = vmx_skip_emulated_instruction,
.update_emulated_instruction = vmx_update_emulated_instruction,
.set_interrupt_shadow = vt_set_interrupt_shadow,
@@ -575,7 +606,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_identity_map_addr = vmx_set_identity_map_addr,
.get_mt_mask = vt_get_mt_mask,

- .get_exit_info = vmx_get_exit_info,
+ .get_exit_info = vt_get_exit_info,

.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,

@@ -589,7 +620,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.load_mmu_pgd = vt_load_mmu_pgd,

.check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
+ .handle_exit_irqoff = vt_handle_exit_irqoff,

.request_immediate_exit = vt_request_immediate_exit,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f5e29f8f1b6a..be73b8322a66 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -87,6 +87,26 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
}

+static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rcx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rdx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+{
+ return kvm_r8_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
+{
+ return kvm_r9_read(vcpu);
+}
+
static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
{
return tdx->tdvpr_pa;
@@ -811,6 +831,12 @@ static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
WARN_ON_ONCE(!kvm_rebooting &&
(tdx->exit_reason.full & TDX_SW_ERROR) == TDX_SW_ERROR);

+ if ((u16)tdx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
+ is_nmi(tdexit_intr_info(vcpu))) {
+ kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
+ vmx_do_nmi_irqoff();
+ kvm_after_interrupt(vcpu);
+ }
guest_state_exit_irqoff();
}

@@ -854,6 +880,25 @@ void tdx_inject_nmi(struct kvm_vcpu *vcpu)
td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
}

+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ u16 exit_reason = tdx->exit_reason.basic;
+
+ if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ tdexit_intr_info(vcpu));
+ else if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
+ vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
+{
+ vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+ vcpu->mmio_needed = 0;
+ return 0;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1234,6 +1279,71 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}

+int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
+{
+ union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+ /* See the comment of tdh_sept_seamcall(). */
+ if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)))
+ return 1;
+
+ /*
+ * TDH.VP.ENTRY checks TD EPOCH which contend with TDH.MEM.TRACK and
+ * vcpu TDH.VP.ENTER.
+ */
+ if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_TD_EPOCH)))
+ return 1;
+
+ if (unlikely(exit_reason.full == TDX_SEAMCALL_UD)) {
+ kvm_spurious_fault();
+ /*
+ * In the case of reboot or kexec, loop with TDH.VP.ENTER and
+ * TDX_SEAMCALL_UD to avoid unnecessarily activity.
+ */
+ return 1;
+ }
+
+ if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
+ if (unlikely(exit_reason.basic == EXIT_REASON_TRIPLE_FAULT))
+ return tdx_handle_triple_fault(vcpu);
+
+ kvm_pr_unimpl("TD exit 0x%llx, %d hkid 0x%x hkid pa 0x%llx\n",
+ exit_reason.full, exit_reason.basic,
+ to_kvm_tdx(vcpu->kvm)->hkid,
+ set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
+ goto unhandled_exit;
+ }
+
+ WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
+
+ switch (exit_reason.basic) {
+ default:
+ break;
+ }
+
+unhandled_exit:
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
+ vcpu->run->internal.ndata = 2;
+ vcpu->run->internal.data[0] = exit_reason.full;
+ vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
+ return 0;
+}
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ *reason = tdx->exit_reason.full;
+
+ *info1 = tdexit_exit_qual(vcpu);
+ *info2 = tdexit_ext_exit_qual(vcpu);
+
+ *intr_info = tdexit_intr_info(vcpu);
+ *error_code = 0;
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 9a949273bf47..cf12b94eb35c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -155,11 +155,16 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
void tdx_inject_nmi(struct kvm_vcpu *vcpu);
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

@@ -192,11 +197,16 @@ static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
+static inline void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
+static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath) { return 0; }
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector) {}
static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
+static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
+ u64 *info2, u32 *intr_info, u32 *error_code) {}

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

--
2.25.1

2023-10-16 16:31:59

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 116/116] [MARKER] the end of (the first phase of) TDX KVM patch series

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the end of (the first phase of) patch series
of TDX KVM support.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/index.rst | 1 -
.../virt/kvm/intel-tdx-layer-status.rst | 33 -------------------
2 files changed, 34 deletions(-)
delete mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst

diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index ccff56dca2b1..5e78a8fc2fbd 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -20,4 +20,3 @@ KVM
halt-polling
review-checklist

- intel-tdx-layer-status
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
deleted file mode 100644
index 7a16fa284b6f..000000000000
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ /dev/null
@@ -1,33 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===================================
-Intel Trust Dodmain Extensions(TDX)
-===================================
-
-Layer status
-============
-What qemu can do
-----------------
-- TDX VM TYPE is exposed to Qemu.
-- Qemu can create/destroy guest of TDX vm type.
-- Qemu can create/destroy vcpu of TDX vm type.
-- Qemu can populate initial guest memory image.
-- Qemu can finalize guest TD.
-- Qemu can start to run vcpu. But vcpu can not make progress yet.
-
-Patch Layer status
-------------------
- Patch layer Status
-
-* TDX, VMX coexistence: Applied
-* TDX architectural definitions: Applied
-* TD VM creation/destruction: Applied
-* TD vcpu creation/destruction: Applied
-* TDX EPT violation: Applied
-* TD finalization: Applied
-* TD vcpu enter/exit: Applied
-* TD vcpu interrupts/exit/hypercall: Not yet
-
-* KVM MMU GPA shared bits: Applied
-* KVM TDP refactoring for TDX: Applied
-* KVM TDP MMU hooks: Applied
--
2.25.1

2023-10-16 16:32:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 097/116] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access

From: Isaku Yamahata <[email protected]>

Handle MTRRCap RO MSR to return all features are unsupported and handle
MTRRDefType MSR to accept only E=1,FE=0,type=writeback.
enable MTRR, disable Fixed range MTRRs, default memory type=writeback

TDX virtualizes that cpuid to report MTRR to guest TD and TDX enforces
guest CR0.CD=0. If guest tries to set CR0.CD=1, it results in #GP. While
updating MTRR requires to set CR0.CD=1 (and other cache flushing
operations). It means guest TD can't update MTRR. Virtualize MTRR as
all features disabled and default memory type as writeback.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 99 ++++++++++++++++++++++++++++++++++--------
1 file changed, 82 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 41f7ad98cab5..726e28f30354 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -580,18 +580,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

- /*
- * TDX enforces CR0.CD = 0 and KVM MTRR emulation enforces writeback.
- * TODO: implement MTRR MSR emulation so that
- * MTRRCap: SMRR=0: SMRR interface unsupported
- * WC=0: write combining unsupported
- * FIX=0: Fixed range registers unsupported
- * VCNT=0: number of variable range regitsers = 0
- * MTRRDefType: E=1, FE=0, type=writeback only. Don't allow other value.
- * E=1: enable MTRR
- * FE=0: disable fixed range MTRRs
- * type: default memory type=writeback
- */
+ /* TDX enforces CR0.CD = 0 and KVM MTRR emulation enforces writeback. */
return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
}

@@ -1930,7 +1919,9 @@ bool tdx_has_emulated_msr(u32 index, bool write)
case MSR_IA32_UCODE_REV:
case MSR_IA32_ARCH_CAPABILITIES:
case MSR_IA32_POWER_CTL:
+ case MSR_MTRRcap:
case MSR_IA32_CR_PAT:
+ case MSR_MTRRdefType:
case MSR_IA32_TSC_DEADLINE:
case MSR_IA32_MISC_ENABLE:
case MSR_PLATFORM_INFO:
@@ -1972,16 +1963,47 @@ bool tdx_has_emulated_msr(u32 index, bool write)

int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
- if (tdx_has_emulated_msr(msr->index, false))
- return kvm_get_msr_common(vcpu, msr);
- return 1;
+ switch (msr->index) {
+ case MSR_MTRRcap:
+ /*
+ * Override kvm_mtrr_get_msr() which hardcodes the value.
+ * Report SMRR = 0, WC = 0, FIX = 0 VCNT = 0 to disable MTRR
+ * effectively.
+ */
+ msr->data = 0;
+ return 0;
+ default:
+ if (tdx_has_emulated_msr(msr->index, false))
+ return kvm_get_msr_common(vcpu, msr);
+ return 1;
+ }
}

int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
- if (tdx_has_emulated_msr(msr->index, true))
+ switch (msr->index) {
+ case MSR_MTRRdefType:
+ /*
+ * Allow writeback only for all memory.
+ * Because it's reported that fixed range MTRR isn't supported
+ * and VCNT=0, enforce MTRRDefType.FE = 0 and don't care
+ * variable range MTRRs. Only default memory type matters.
+ *
+ * bit 11 E: MTRR enable/disable
+ * bit 12 FE: Fixed-range MTRRs enable/disable
+ * (E, FE) = (1, 1): enable MTRR and Fixed range MTRR
+ * (E, FE) = (1, 0): enable MTRR, disable Fixed range MTRR
+ * (E, FE) = (0, *): disable all MTRRs. all physical memory
+ * is UC
+ */
+ if (msr->data != ((1 << 11) | MTRR_TYPE_WRBACK))
+ return 1;
return kvm_set_msr_common(vcpu, msr);
- return 1;
+ default:
+ if (tdx_has_emulated_msr(msr->index, true))
+ return kvm_set_msr_common(vcpu, msr);
+ return 1;
+ }
}

static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -2730,6 +2752,45 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
return ret;
}

+static int tdx_vcpu_init_mtrr(struct kvm_vcpu *vcpu)
+{
+ struct msr_data msr;
+ int ret;
+ int i;
+
+ /*
+ * To avoid confusion with reporting VNCT = 0, explicitly disable
+ * vaiale-range reisters.
+ */
+ for (i = 0; i < KVM_NR_VAR_MTRR; i++) {
+ /* phymask */
+ msr = (struct msr_data) {
+ .host_initiated = true,
+ .index = 0x200 + 2 * i + 1,
+ .data = 0, /* valid = 0 to disable. */
+ };
+ ret = kvm_set_msr_common(vcpu, &msr);
+ if (ret)
+ return -EINVAL;
+ }
+
+ /* Set MTRR to use writeback on reset. */
+ msr = (struct msr_data) {
+ .host_initiated = true,
+ .index = MSR_MTRRdefType,
+ /*
+ * Set E(enable MTRR)=1, FE(enable fixed range MTRR)=0, default
+ * type=writeback on reset to avoid UC. Note E=0 means all
+ * memory is UC.
+ */
+ .data = (1 << 11) | MTRR_TYPE_WRBACK,
+ };
+ ret = kvm_set_msr_common(vcpu, &msr);
+ if (ret)
+ return -EINVAL;
+ return 0;
+}
+
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
{
struct msr_data apic_base_msr;
@@ -2767,6 +2828,10 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
if (kvm_set_apic_base(vcpu, &apic_base_msr))
return -EINVAL;

+ ret = tdx_vcpu_init_mtrr(vcpu);
+ if (ret)
+ return ret;
+
ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
if (ret)
return ret;
--
2.25.1

2023-10-16 16:32:08

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 103/116] KVM: TDX: Add methods to ignore guest instruction emulation

From: Isaku Yamahata <[email protected]>

Because TDX protects TDX guest state from VMM, instructions in guest memory
cannot be emulated. Implement methods to ignore guest instruction
emulator.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 582fea1b60b6..5e4091dbf1e8 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -331,6 +331,30 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
+{
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_can_emulate_instruction(vcpu, emul_type, insn, insn_len);
+}
+
+static int vt_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
+{
+ /*
+ * This call back is triggered by the x86 instruction emulator. TDX
+ * doesn't allow guest memory inspection.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return X86EMUL_UNHANDLEABLE;
+
+ return vmx_check_intercept(vcpu, info, stage, exception);
+}
+
static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -954,7 +978,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.load_mmu_pgd = vt_load_mmu_pgd,

- .check_intercept = vmx_check_intercept,
+ .check_intercept = vt_check_intercept,
.handle_exit_irqoff = vt_handle_exit_irqoff,

.request_immediate_exit = vt_request_immediate_exit,
@@ -983,7 +1007,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.enable_smi_window = vt_enable_smi_window,
#endif

- .can_emulate_instruction = vmx_can_emulate_instruction,
+ .can_emulate_instruction = vt_can_emulate_instruction,
.apic_init_signal_blocked = vt_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,

--
2.25.1

2023-10-16 16:32:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 094/116] KVM: TDX: Handle TDX PV MMIO hypercall

From: Sean Christopherson <[email protected]>

Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
hypercall to the KVM backend functions.

kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
MMIO address and emulates the MMIO. As TDX PV MMIO also needs it, export
kvm_io_bus_read(). kvm_io_bus_write() is already exported. TDX PV MMIO
emulates some of MMIO itself. To add trace point consistently with x86
kvm, export kvm_mmio tracepoint.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
virt/kvm/kvm_main.c | 2 +
3 files changed, 117 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index fd6907d5ba9d..24800d1e597d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1195,6 +1195,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
return ret;
}

+static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
+{
+ unsigned long val = 0;
+ gpa_t gpa;
+ int size;
+
+ KVM_BUG_ON(vcpu->mmio_needed != 1, vcpu->kvm);
+ vcpu->mmio_needed = 0;
+
+ if (!vcpu->mmio_is_write) {
+ gpa = vcpu->mmio_fragments[0].gpa;
+ size = vcpu->mmio_fragments[0].len;
+
+ memcpy(&val, vcpu->run->mmio.data, size);
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ }
+ return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+ unsigned long val)
+{
+ if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+ return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+ unsigned long val;
+
+ if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+ struct kvm_memory_slot *slot;
+ int size, write, r;
+ unsigned long val;
+ gpa_t gpa;
+
+ KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
+
+ size = tdvmcall_a0_read(vcpu);
+ write = tdvmcall_a1_read(vcpu);
+ gpa = tdvmcall_a2_read(vcpu);
+ val = write ? tdvmcall_a3_read(vcpu) : 0;
+
+ if (size != 1 && size != 2 && size != 4 && size != 8)
+ goto error;
+ if (write != 0 && write != 1)
+ goto error;
+
+ /* Strip the shared bit, allow MMIO with and without it set. */
+ gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
+
+ if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
+ goto error;
+
+ slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
+ if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
+ goto error;
+
+ if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+ trace_kvm_fast_mmio(gpa);
+ return 1;
+ }
+
+ if (write)
+ r = tdx_mmio_write(vcpu, gpa, size, val);
+ else
+ r = tdx_mmio_read(vcpu, gpa, size);
+ if (!r) {
+ /* Kernel completed device emulation. */
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return 1;
+ }
+
+ /* Request the device emulation to userspace device model. */
+ vcpu->mmio_needed = 1;
+ vcpu->mmio_is_write = write;
+ vcpu->arch.complete_userspace_io = tdx_complete_mmio;
+
+ vcpu->run->mmio.phys_addr = gpa;
+ vcpu->run->mmio.len = size;
+ vcpu->run->mmio.is_write = write;
+ vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+ if (write) {
+ memcpy(vcpu->run->mmio.data, &val, size);
+ } else {
+ vcpu->mmio_fragments[0].gpa = gpa;
+ vcpu->mmio_fragments[0].len = size;
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+ }
+ return 0;
+
+error:
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1207,6 +1319,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_hlt(vcpu);
case EXIT_REASON_IO_INSTRUCTION:
return tdx_emulate_io(vcpu);
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_emulate_mmio(vcpu);
default:
break;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d68e99f4f5ea..6145143c32ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13734,6 +13734,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);

EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0ae59d680145..f5768c394efb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2686,6 +2686,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn

return NULL;
}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);

bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
{
@@ -5920,6 +5921,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
r = __kvm_io_bus_read(vcpu, bus, &range, val);
return r < 0 ? r : 0;
}
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);

/* Caller must hold slots_lock. */
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
--
2.25.1

2023-10-16 16:32:43

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 106/116] KVM: TDX: Add methods to ignore accesses to TSC

From: Isaku Yamahata <[email protected]>

TDX protects TDX guest TSC state from VMM. Implement access methods to
ignore guest TSC.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 44 +++++++++++++++++++++++++++++++++++++----
1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index eb02a73f3bb0..d5ee4069f684 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -843,6 +843,42 @@ static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
return vmx_get_mt_mask(vcpu, gfn, is_mmio);
}

+static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 guest at the moment. */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_get_l2_tsc_offset(vcpu);
+}
+
+static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 guest at the moment. */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_get_l2_tsc_multiplier(vcpu);
+}
+
+static void vt_write_tsc_offset(struct kvm_vcpu *vcpu)
+{
+ /* In TDX, tsc offset can't be changed. */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_write_tsc_offset(vcpu);
+}
+
+static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+ /* In TDX, tsc multiplier can't be changed. */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_write_tsc_multiplier(vcpu);
+}
+
static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
{
if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
@@ -1000,10 +1036,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,

- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
+ .get_l2_tsc_offset = vt_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
+ .write_tsc_offset = vt_write_tsc_offset,
+ .write_tsc_multiplier = vt_write_tsc_multiplier,

.load_mmu_pgd = vt_load_mmu_pgd,

--
2.25.1

2023-10-16 16:32:52

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 043/116] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

From: Isaku Yamahata <[email protected]>

For private GPA, CPU refers a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used and their cost is expensive.

When KVM resolves KVM page fault, it walks the page tables. To reuse the
existing KVM MMU code and mitigate the heavy cost to directly walk private
page table, allocate one more page to copy the dummy page table for KVM MMU
code to directly walk. Resolve KVM page fault with the existing code, and
do additional operations necessary for the private page table. To
distinguish such cases, the existing KVM page table is called a shared page
table (i.e. not associated with private page table), and the page table
with private page table is called a private page table. The relationship
is depicted below.

Add a private pointer to struct kvm_mmu_page for private page table and
add helper functions to allocate/initialize/free a private page table
page.

KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root dummy PT root | private PT root
| | | |
V V | V
shared PT dummy PT ----propagate----> private PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT: page table
- Shared PT is visible to KVM and it is used by CPU.
- Private PT is used by CPU but it is invisible to KVM.
- Dummy PT is visible to KVM but not used by CPU. It is used to
propagate PT change to the actual private PT which is used by CPU.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 ++
arch/x86/kvm/mmu/mmu.c | 7 +++
arch/x86/kvm/mmu/mmu_internal.h | 83 +++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 1 +
4 files changed, 92 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8664becb1e4..432b94a61ab1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -817,6 +817,11 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * This cache is to allocate private page table. E.g. Secure-EPT used
+ * by the TDX module.
+ */
+ struct kvm_mmu_memory_cache mmu_private_spt_cache;

/*
* QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4ac34a0cc33e..6636be590583 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -689,6 +689,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache,
+ PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+ }
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
PT64_ROOT_MAX_LEVEL);
if (r)
@@ -708,6 +714,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a510f0a16853..398f30fc89a9 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -95,7 +95,23 @@ struct kvm_mmu_page {
int root_count;
refcount_t tdp_mmu_root_count;
};
- unsigned int unsync_children;
+ union {
+ struct {
+ unsigned int unsync_children;
+ /*
+ * Number of writes since the last time traversal
+ * visited this page.
+ */
+ atomic_t write_flooding_count;
+ };
+#ifdef CONFIG_KVM_MMU_PRIVATE
+ /*
+ * Associated private shadow page table, e.g. Secure-EPT page
+ * passed to the TDX module.
+ */
+ void *private_spt;
+#endif
+ };
union {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
@@ -124,9 +140,6 @@ struct kvm_mmu_page {
int clear_spte_count;
#endif

- /* Number of writes since the last time traversal visited this page. */
- atomic_t write_flooding_count;
-
#ifdef CONFIG_X86_64
/* Used for freeing the page asynchronously if it is a TDP MMU page. */
struct rcu_head rcu_head;
@@ -150,6 +163,68 @@ static inline bool is_private_sp(const struct kvm_mmu_page *sp)
return kvm_mmu_page_role_is_private(sp->role);
}

+#ifdef CONFIG_KVM_MMU_PRIVATE
+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+ return sp->private_spt;
+}
+
+static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
+{
+ sp->private_spt = private_spt;
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+ bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level;
+
+ KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm);
+ if (is_root)
+ /*
+ * Because TDX module assigns root Secure-EPT page and set it to
+ * Secure-EPTP when TD vcpu is created, secure page table for
+ * root isn't needed.
+ */
+ sp->private_spt = NULL;
+ else {
+ /*
+ * Because the TDX module doesn't trust VMM and initializes
+ * the pages itself, KVM doesn't initialize them. Allocate
+ * pages with garbage and give them to the TDX module.
+ */
+ sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
+ /*
+ * Because mmu_private_spt_cache is topped up before staring kvm
+ * page fault resolving, the allocation above shouldn't fail.
+ */
+ WARN_ON_ONCE(!sp->private_spt);
+ }
+}
+
+static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
+{
+ if (sp->private_spt)
+ free_page((unsigned long)sp->private_spt);
+}
+#else
+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+ return NULL;
+}
+
+static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
+{
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+}
+
+static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
+{
+}
+#endif
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0c5c7eadb1ba..583888bbb87d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -66,6 +66,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)

static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
+ kvm_mmu_free_private_spt(sp);
free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
}
--
2.25.1

2023-10-16 16:33:00

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 107/116] KVM: TDX: Ignore setting up mce

From: Isaku Yamahata <[email protected]>

Because vmx_set_mce function is VMX specific and it cannot be used for TDX.
Add vt stub to ignore setting up mce for TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d5ee4069f684..df84a12aa232 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -908,6 +908,14 @@ static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
}
#endif

+static void vt_setup_mce(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_setup_mce(vcpu);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -1063,7 +1071,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.cancel_hv_timer = vt_cancel_hv_timer,
#endif

- .setup_mce = vmx_setup_mce,
+ .setup_mce = vt_setup_mce,

#ifdef CONFIG_KVM_SMM
.smi_allowed = vt_smi_allowed,
--
2.25.1

2023-10-16 16:33:25

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 039/116] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of KVM TDP MMU
hooks.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index e893a3d714c7..7903473abad1 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -26,5 +26,5 @@ Patch Layer status
* TD vcpu interrupts/exit/hypercall: Not yet

* KVM MMU GPA shared bits: Applied
-* KVM TDP refactoring for TDX: Applying
-* KVM TDP MMU hooks: Not yet
+* KVM TDP refactoring for TDX: Applied
+* KVM TDP MMU hooks: Applying
--
2.25.1

2023-10-16 16:33:29

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 023/116] KVM: TDX: Refuse to unplug the last cpu on the package

From: Isaku Yamahata <[email protected]>

In order to reclaim TDX HKID, (i.e. when deleting guest TD), needs to call
TDH.PHYMEM.PAGE.WBINVD on all packages. If we have active TDX HKID, refuse
to offline the last online cpu to guarantee at least one CPU online per
package. Add arch callback for cpu offline.
Because TDX doesn't support suspend by the TDX 1.0 spec, this also refuses
suspend if TDs are running. If no TD is running, suspend is allowed.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/main.c | 1 +
arch/x86/kvm/vmx/tdx.c | 41 ++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
arch/x86/kvm/x86.c | 5 ++++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 12 +++++++--
8 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index d5a900842535..e1adeb7d4e26 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -18,6 +18,7 @@ KVM_X86_OP(check_processor_compatibility)
KVM_X86_OP(hardware_enable)
KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
+KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
KVM_X86_OP(is_vm_type_supported)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f23557850c27..24d9a9ab338d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1562,6 +1562,7 @@ struct kvm_x86_ops {
int (*hardware_enable)(void);
void (*hardware_disable)(void);
void (*hardware_unsetup)(void);
+ int (*offline_cpu)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 463ba7214392..0bb087e3bbdf 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -121,6 +121,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.check_processor_compatibility = vmx_check_processor_compat,

.hardware_unsetup = vt_hardware_unsetup,
+ .offline_cpu = tdx_offline_cpu,

.hardware_enable = vt_hardware_enable,
.hardware_disable = vmx_hardware_disable,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6017e0feac1e..51aa114feb86 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -64,6 +64,7 @@ static struct tdx_info tdx_info __ro_after_init;
*/
static DEFINE_MUTEX(tdx_lock);
static struct mutex *tdx_mng_key_config_lock;
+static atomic_t nr_configured_hkid;

static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
{
@@ -79,6 +80,7 @@ static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
{
tdx_guest_keyid_free(kvm_tdx->hkid);
kvm_tdx->hkid = -1;
+ atomic_dec(&nr_configured_hkid);
}

static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
@@ -562,6 +564,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
if (ret < 0)
return ret;
kvm_tdx->hkid = ret;
+ atomic_inc(&nr_configured_hkid);

va = __get_free_page(GFP_KERNEL_ACCOUNT);
if (!va)
@@ -932,3 +935,41 @@ void tdx_hardware_unsetup(void)
/* kfree accepts NULL. */
kfree(tdx_mng_key_config_lock);
}
+
+int tdx_offline_cpu(void)
+{
+ int curr_cpu = smp_processor_id();
+ cpumask_var_t packages;
+ int ret = 0;
+ int i;
+
+ /* No TD is running. Allow any cpu to be offline. */
+ if (!atomic_read(&nr_configured_hkid))
+ return 0;
+
+ /*
+ * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to
+ * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory
+ * controller with pconfig. If we have active TDX HKID, refuse to
+ * offline the last online cpu.
+ */
+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+ return -ENOMEM;
+ for_each_online_cpu(i) {
+ if (i != curr_cpu)
+ cpumask_set_cpu(topology_physical_package_id(i), packages);
+ }
+ /* Check if this cpu is the last online cpu of this package. */
+ if (!cpumask_test_cpu(topology_physical_package_id(curr_cpu), packages))
+ ret = -EBUSY;
+ free_cpumask_var(packages);
+ if (ret)
+ /*
+ * Because it's hard for human operator to understand the
+ * reason, warn it.
+ */
+#define MSG_ALLPKG_ONLINE \
+ "TDX requires all packages to have an online CPU. Delete all TDs in order to offline all CPUs of a package.\n"
+ pr_warn_ratelimited(MSG_ALLPKG_ONLINE);
+ return ret;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 01f3b02a5b46..ef14c3873fe0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -138,6 +138,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
void tdx_hardware_unsetup(void);
bool tdx_is_vm_type_supported(unsigned long type);
+int tdx_offline_cpu(void);

int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int tdx_vm_init(struct kvm *kvm);
@@ -148,6 +149,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
+static inline int tdx_offline_cpu(void) { return 0; }

static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fbd80f2e403e..714685a31baf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12301,6 +12301,11 @@ void kvm_arch_hardware_disable(void)
drop_user_return_notifiers();
}

+int kvm_arch_offline_cpu(unsigned int cpu)
+{
+ return static_call(kvm_x86_offline_cpu)();
+}
+
bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
{
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8c5c017ab4e9..9f90cda7149f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1502,6 +1502,7 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
int kvm_arch_hardware_enable(void);
void kvm_arch_hardware_disable(void);
#endif
+int kvm_arch_offline_cpu(unsigned int cpu);
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e0563477830f..aa9426472bb4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5581,13 +5581,21 @@ static void hardware_disable_nolock(void *junk)
__this_cpu_write(hardware_enabled, false);
}

+__weak int kvm_arch_offline_cpu(unsigned int cpu)
+{
+ return 0;
+}
+
static int kvm_offline_cpu(unsigned int cpu)
{
+ int r = 0;
+
mutex_lock(&kvm_lock);
- if (kvm_usage_count)
+ r = kvm_arch_offline_cpu(cpu);
+ if (!r && kvm_usage_count)
hardware_disable_nolock(NULL);
mutex_unlock(&kvm_lock);
- return 0;
+ return r;
}

static void hardware_disable_all_nolock(void)
--
2.25.1

2023-10-16 16:33:48

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 083/116] KVM: TDX: Handle vmentry failure for INTEL TD guest

From: Yao Yuan <[email protected]>

TDX module passes control back to VMM if it failed to vmentry for a TD, use
same exit reason to notify user space, align with VMX.
If VMM corrupted TD VMCS, machine check during entry can happens. vm exit
reason will be EXIT_REASON_MCE_DURING_VMENTRY. If VMM corrupted TD VMCS
with debug TD by TDH.VP.WR, the exit reason would be
EXIT_REASON_INVALID_STATE or EXIT_REASON_MSR_LOAD_FAIL.

Signed-off-by: Yao Yuan <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index be73b8322a66..306d555dda35 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1314,6 +1314,28 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
goto unhandled_exit;
}

+ /*
+ * When TDX module saw VMEXIT_REASON_FAILED_VMENTER_MC etc, TDH.VP.ENTER
+ * returns with TDX_SUCCESS | exit_reason with failed_vmentry = 1.
+ * Because TDX module maintains TD VMCS correctness, usually vmentry
+ * failure shouldn't happen. In some corner cases it can happen. For
+ * example
+ * - machine check during entry: EXIT_REASON_MCE_DURING_VMENTRY
+ * - TDH.VP.WR with debug TD. VMM can corrupt TD VMCS
+ * - EXIT_REASON_INVALID_STATE
+ * - EXIT_REASON_MSR_LOAD_FAIL
+ */
+ if (unlikely(exit_reason.failed_vmentry)) {
+ pr_err("TDExit: exit_reason 0x%016llx qualification=%016lx ext_qualification=%016lx\n",
+ exit_reason.full, tdexit_exit_qual(vcpu), tdexit_ext_exit_qual(vcpu));
+ vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+ vcpu->run->fail_entry.hardware_entry_failure_reason
+ = exit_reason.full;
+ vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
+
+ return 0;
+ }
+
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
--
2.25.1

2023-10-16 16:34:02

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 013/116] KVM: TDX: Add helper functions to print TDX SEAMCALL error

From: Isaku Yamahata <[email protected]>

Add helper functions to print out errors from the TDX module in a uniform
manner.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/tdx_error.c | 20 ++++++++++++++++++++
arch/x86/kvm/vmx/tdx_ops.h | 5 +++++
3 files changed, 26 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/vmx/tdx_error.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 4b01ab842ab7..e3354b784e10 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -25,7 +25,7 @@ kvm-$(CONFIG_KVM_SMM) += smm.o
kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
-kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o vmx/tdx_error.o

kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
svm/sev.o svm/hyperv.o
diff --git a/arch/x86/kvm/vmx/tdx_error.c b/arch/x86/kvm/vmx/tdx_error.c
new file mode 100644
index 000000000000..d083c79a2331
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_error.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+/* functions to record TDX SEAMCALL error */
+
+#include <linux/kernel.h>
+#include <linux/bug.h>
+
+#include "tdx_ops.h"
+
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out)
+{
+ if (!out) {
+ pr_err_ratelimited("SEAMCALL[%lld] failed: 0x%llx\n",
+ op, error_code);
+ return;
+ }
+
+#define MSG "SEAMCALL[%lld] failed: 0x%llx RCX 0x%llx RDX 0x%llx R8 0x%llx R9 0x%llx R10 0x%llx R11 0x%llx\n"
+ pr_err_ratelimited(MSG, op, error_code, out->rcx, out->rdx, out->r8,
+ out->r9, out->r10, out->r11);
+}
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index a55977626ae3..c9f74b2400a7 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -10,6 +10,7 @@
#include <asm/cacheflush.h>
#include <asm/asm.h>
#include <asm/kvm_host.h>
+#include <asm/tdx.h>

#include "tdx_errno.h"
#include "tdx_arch.h"
@@ -57,6 +58,10 @@ static inline u64 tdx_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
return ret;
}

+#ifdef CONFIG_INTEL_TDX_HOST
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out);
+#endif
+
static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
{
clflush_cache_range(__va(addr), PAGE_SIZE);
--
2.25.1

2023-10-16 16:34:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 095/116] KVM: TDX: Implement callbacks for MSR operations for TDX

From: Isaku Yamahata <[email protected]>

Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
hypercall from guest TD for paravirtualized rdmsr and wrmsr. The TDX
module virtualizes MSRs. For some MSRs, it injects #VE to the guest TD
upon RDMSR or WRMSR. The exact list of such MSRs are defined in the spec.

Upon #VE, the guest TD may execute hypercalls,
TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.

There are three classes of MSRs virtualization.
- non-configurable: TDX module directly virtualizes it. VMM can't
configure. the value set by KVM_SET_MSR_INDEX_LIST is ignored.
- configurable: TDX module directly virtualizes it. VMM can configure at
the VM creation time. The value set by KVM_SET_MSR_INDEX_LIST is used.
- #VE case
Guest TD would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR> and
VMM handles the MSR hypercall. The value set by KVM_SET_MSR_INDEX_LIST
is used.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 44 +++++++++++++++++++++---
arch/x86/kvm/vmx/tdx.c | 70 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 6 ++++
arch/x86/kvm/x86.c | 1 -
arch/x86/kvm/x86.h | 2 ++
5 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4356b10d68b6..982022993700 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -258,6 +258,42 @@ static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
vmx_handle_exit_irqoff(vcpu);
}

+static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_set_msr(vcpu, msr_info);
+
+ return vmx_set_msr(vcpu, msr_info);
+}
+
+/*
+ * The kvm parameter can be NULL (module initialization, or invocation before
+ * VM creation). Be sure to check the kvm parameter before using it.
+ */
+static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
+{
+ if (kvm && is_td(kvm))
+ return tdx_has_emulated_msr(index, true);
+
+ return vmx_has_emulated_msr(kvm, index);
+}
+
+static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_get_msr(vcpu, msr_info);
+
+ return vmx_get_msr(vcpu, msr_info);
+}
+
+static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_msr_filter_changed(vcpu);
+}
+
static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -519,7 +555,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.hardware_enable = vt_hardware_enable,
.hardware_disable = vt_hardware_disable,
- .has_emulated_msr = vmx_has_emulated_msr,
+ .has_emulated_msr = vt_has_emulated_msr,

.is_vm_type_supported = vt_is_vm_type_supported,
.max_vcpus = vt_max_vcpus,
@@ -541,8 +577,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.update_exception_bitmap = vmx_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
+ .get_msr = vt_get_msr,
+ .set_msr = vt_set_msr,
.get_segment_base = vmx_get_segment_base,
.get_segment = vmx_get_segment,
.set_segment = vmx_set_segment,
@@ -652,7 +688,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,

- .msr_filter_changed = vmx_msr_filter_changed,
+ .msr_filter_changed = vt_msr_filter_changed,
.complete_emulated_msr = kvm_complete_insn_gp,

.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 24800d1e597d..7e24683e430b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1875,6 +1875,76 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
*error_code = 0;
}

+static bool tdx_is_emulated_kvm_msr(u32 index, bool write)
+{
+ switch (index) {
+ case MSR_KVM_POLL_CONTROL:
+ return true;
+ default:
+ return false;
+ }
+}
+
+bool tdx_has_emulated_msr(u32 index, bool write)
+{
+ switch (index) {
+ case MSR_IA32_UCODE_REV:
+ case MSR_IA32_ARCH_CAPABILITIES:
+ case MSR_IA32_POWER_CTL:
+ case MSR_IA32_CR_PAT:
+ case MSR_IA32_TSC_DEADLINE:
+ case MSR_IA32_MISC_ENABLE:
+ case MSR_PLATFORM_INFO:
+ case MSR_MISC_FEATURES_ENABLES:
+ case MSR_IA32_MCG_CAP:
+ case MSR_IA32_MCG_STATUS:
+ case MSR_IA32_MCG_CTL:
+ case MSR_IA32_MCG_EXT_CTL:
+ case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
+ case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
+ /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
+ return true;
+ case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
+ /*
+ * x2APIC registers that are virtualized by the CPU can't be
+ * emulated, KVM doesn't have access to the virtual APIC page.
+ */
+ switch (index) {
+ case X2APIC_MSR(APIC_TASKPRI):
+ case X2APIC_MSR(APIC_PROCPRI):
+ case X2APIC_MSR(APIC_EOI):
+ case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
+ case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
+ case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
+ return false;
+ default:
+ return true;
+ }
+ case MSR_IA32_APICBASE:
+ case MSR_EFER:
+ return !write;
+ case 0x4b564d00 ... 0x4b564dff:
+ /* KVM custom MSRs */
+ return tdx_is_emulated_kvm_msr(index, write);
+ default:
+ return false;
+ }
+}
+
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+ if (tdx_has_emulated_msr(msr->index, false))
+ return kvm_get_msr_common(vcpu, msr);
+ return 1;
+}
+
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+ if (tdx_has_emulated_msr(msr->index, true))
+ return kvm_set_msr_common(vcpu, msr);
+ return 1;
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index cf12b94eb35c..fd52c7df0cd8 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -165,6 +165,9 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
void tdx_inject_nmi(struct kvm_vcpu *vcpu);
void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+bool tdx_has_emulated_msr(u32 index, bool write);
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

@@ -207,6 +210,9 @@ static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mo
static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
u64 *info2, u32 *intr_info, u32 *error_code) {}
+static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
+static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6145143c32ee..e96b1f250cc8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -90,7 +90,6 @@
#include "trace.h"

#define MAX_IO_MSRS 256
-#define KVM_MAX_MCE_BANKS 32

struct kvm_caps kvm_caps __read_mostly = {
.supported_mce_cap = MCG_CTL_P | MCG_SER_P,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a3b2c3d5e18f..0ca67e6fcdb2 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -9,6 +9,8 @@
#include "kvm_cache_regs.h"
#include "kvm_emulate.h"

+#define KVM_MAX_MCE_BANKS 32
+
bool __kvm_is_vm_type_supported(unsigned long type);

struct kvm_caps {
--
2.25.1

2023-10-16 16:34:17

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 104/116] KVM: TDX: Add a method to ignore dirty logging

From: Isaku Yamahata <[email protected]>

Currently TDX KVM doesn't support tracking dirty pages (yet). Implement a
method to ignore it. Because the flag for kvm memory slot to enable dirty
logging isn't accepted for TDX, warn on the method is called for TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5e4091dbf1e8..06eb0896a87a 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -843,6 +843,14 @@ static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
return vmx_get_mt_mask(vcpu, gfn, is_mmio);
}

+static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_update_cpu_dirty_logging(vcpu);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -986,7 +994,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sched_in = vt_sched_in,

.cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+ .update_cpu_dirty_logging = vt_update_cpu_dirty_logging,

.nested_ops = &vmx_nested_ops,

--
2.25.1

2023-10-16 16:34:18

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 096/116] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV rdmsr/wrmsr hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7e24683e430b..41f7ad98cab5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1307,6 +1307,41 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_a0_read(vcpu);
+ u64 data;
+
+ if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
+ kvm_get_msr(vcpu, index, &data)) {
+ trace_kvm_msr_read_ex(index);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+ trace_kvm_msr_read(index, data);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, data);
+ return 1;
+}
+
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_a0_read(vcpu);
+ u64 data = tdvmcall_a1_read(vcpu);
+
+ if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
+ kvm_set_msr(vcpu, index, data)) {
+ trace_kvm_msr_write_ex(index, data);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ trace_kvm_msr_write(index, data);
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1321,6 +1356,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_io(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_emulate_mmio(vcpu);
+ case EXIT_REASON_MSR_READ:
+ return tdx_emulate_rdmsr(vcpu);
+ case EXIT_REASON_MSR_WRITE:
+ return tdx_emulate_wrmsr(vcpu);
default:
break;
}
--
2.25.1

2023-10-16 16:34:22

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 063/116] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)

From: Isaku Yamahata <[email protected]>

On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
from VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 30 +++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 4 ++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
4 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c45b8fc3e8d3..bc75ed8ff371 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -169,6 +169,32 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_vcpu_reset(vcpu, init_event);
}

+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ /*
+ * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+ * guest state of a TD is obviously off limits. Deferring MSRs and DRs
+ * is pointless because the TDX module needs to load *something* so as
+ * not to expose guest state.
+ */
+ if (is_td_vcpu(vcpu)) {
+ tdx_prepare_switch_to_guest(vcpu);
+ return;
+ }
+
+ vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_put(vcpu);
+ return;
+ }
+
+ vmx_vcpu_put(vcpu);
+}
+
static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -304,9 +330,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_free = vt_vcpu_free,
.vcpu_reset = vt_vcpu_reset,

- .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+ .prepare_switch_to_guest = vt_prepare_switch_to_guest,
.vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
+ .vcpu_put = vt_vcpu_put,

.update_exception_bitmap = vmx_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7b5d0556ad95..add9a0083adb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/cpu.h>
+#include <linux/mmu_context.h>

#include <asm/tdx.h>

@@ -392,6 +393,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
int tdx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);

/*
* On cpu creation, cpuid entry is blank. Forcibly enable
@@ -420,9 +422,47 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
vcpu->arch.xfd_no_write_intercept = true;

+ tdx->host_state_need_save = true;
+ tdx->host_state_need_restore = false;
+
return 0;
}

+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (!tdx->host_state_need_save)
+ return;
+
+ if (likely(is_64bit_mm(current->mm)))
+ tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+ else
+ tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+ tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ tdx->host_state_need_save = true;
+ if (!tdx->host_state_need_restore)
+ return;
+
+ ++vcpu->stat.host_state_reload;
+
+ wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+ tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ vmx_vcpu_pi_put(vcpu);
+ tdx_prepare_switch_to_host(vcpu);
+}
+
void tdx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -543,6 +583,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_vcpu_enter_exit(tdx);

+ tdx->host_state_need_restore = true;
+
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
trace_kvm_exit(vcpu, KVM_ISA_VMX);

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 04942488e6d1..610bd3f4e952 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -66,6 +66,10 @@ struct vcpu_tdx {

bool initialized;

+ bool host_state_need_save;
+ bool host_state_need_restore;
+ u64 msr_host_kernel_gs_base;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 87cfa1c5afa3..02da72dfb2cf 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -151,6 +151,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -179,6 +181,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1

2023-10-16 16:34:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 088/116] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)

From: Isaku Yamahata <[email protected]>

The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM. When the guest TD issues
TDG.VP.VMCALL, the guest TD exits to VMM with a new exit reason of
TDVMCALL. The arguments from the guest TD and returned values from the VMM
are passed in the guest registers. The guest RCX registers indicates which
registers are used. Define helper functions to access those registers as
ABI.

Define the TDVMCALL exit reason, which is carved out from the VMX exit
reason namespace as the TDVMCALL exit from TDX guest to TDX-SEAM is really
just a VM-Exit. Add a place holder to handle TDVMCALL exit.

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/vmx.h | 4 ++-
arch/x86/kvm/vmx/tdx.c | 53 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 13 ++++++++
3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index b3a30ef3efdd..f0f4a4cf84a7 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -93,6 +93,7 @@
#define EXIT_REASON_TPAUSE 68
#define EXIT_REASON_BUS_LOCK 74
#define EXIT_REASON_NOTIFY 75
+#define EXIT_REASON_TDCALL 77

#define VMX_EXIT_REASONS \
{ EXIT_REASON_EXCEPTION_NMI, "EXCEPTION_NMI" }, \
@@ -156,7 +157,8 @@
{ EXIT_REASON_UMWAIT, "UMWAIT" }, \
{ EXIT_REASON_TPAUSE, "TPAUSE" }, \
{ EXIT_REASON_BUS_LOCK, "BUS_LOCK" }, \
- { EXIT_REASON_NOTIFY, "NOTIFY" }
+ { EXIT_REASON_NOTIFY, "NOTIFY" }, \
+ { EXIT_REASON_TDCALL, "TDCALL" }

#define VMX_EXIT_REASON_FLAGS \
{ VMX_EXIT_REASONS_FAILED_VMENTRY, "FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 44558f1c13e4..fa1cf0a8b5f9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -107,6 +107,41 @@ static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
return kvm_r9_read(vcpu);
}

+#define BUILD_TDVMCALL_ACCESSORS(param, gpr) \
+static __always_inline \
+unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu) \
+{ \
+ return kvm_##gpr##_read(vcpu); \
+} \
+static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
+ unsigned long val) \
+{ \
+ kvm_##gpr##_write(vcpu, val); \
+}
+BUILD_TDVMCALL_ACCESSORS(a0, r12);
+BUILD_TDVMCALL_ACCESSORS(a1, r13);
+BUILD_TDVMCALL_ACCESSORS(a2, r14);
+BUILD_TDVMCALL_ACCESSORS(a3, r15);
+
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+ return kvm_r10_read(vcpu);
+}
+static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
+{
+ return kvm_r11_read(vcpu);
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+ long val)
+{
+ kvm_r10_write(vcpu, val);
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+ unsigned long val)
+{
+ kvm_r11_write(vcpu, val);
+}
+
static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
{
return tdx->tdvpr_pa;
@@ -871,6 +906,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_complete_interrupts(vcpu);

+ if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
+ tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
+ else
+ tdx->tdvmcall.rcx = 0;
+
return EXIT_FASTPATH_NONE;
}

@@ -942,6 +982,17 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
return 0;
}

+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+ switch (tdvmcall_leaf(vcpu)) {
+ default:
+ break;
+ }
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1433,6 +1484,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
return tdx_handle_exception(vcpu);
case EXIT_REASON_EXTERNAL_INTERRUPT:
return tdx_handle_external_interrupt(vcpu);
+ case EXIT_REASON_TDCALL:
+ return handle_tdvmcall(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_handle_ept_violation(vcpu);
case EXIT_REASON_EPT_MISCONFIG:
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index b29fcda2892b..870a0d15e073 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -80,6 +80,19 @@ struct vcpu_tdx {

struct list_head cpu_list;

+ union {
+ struct {
+ union {
+ struct {
+ u16 gpr_mask;
+ u16 xmm_mask;
+ };
+ u32 regs_mask;
+ };
+ u32 reserved;
+ };
+ u64 rcx;
+ } tdvmcall;
union tdx_exit_reason exit_reason;

bool initialized;
--
2.25.1

2023-10-16 16:34:34

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 021/116] KVM: TDX: initialize VM with TDX specific parameters

From: Isaku Yamahata <[email protected]>

TDX requires additional parameters for TDX VM for confidential execution to
protect the confidentiality of its memory contents and CPU state from any
other software, including VMM. When creating a guest TD VM before creating
vcpu, the number of vcpu, TSC frequency (the values are the same among
vcpus, and it can't change.) CPUIDs which the TDX module emulates. Guest
TDs can trust those CPUIDs and sha384 values for measurement.

Add a new subcommand, KVM_TDX_INIT_VM, to pass parameters for the TDX
guest. It assigns an encryption key to the TDX guest for memory
encryption. TDX encrypts memory per guest basis. The device model, say
qemu, passes per-VM parameters for the TDX guest. The maximum number of
vcpus, TSC frequency (TDX guest has fixed VM-wide TSC frequency, not per
vcpu. The TDX guest can not change it.), attributes (production or debug),
available extended features (which configure guest XCR0, IA32_XSS MSR),
CPUIDs, sha384 measurements, etc.

Call this subcommand before creating vcpu and KVM_SET_CPUID2, i.e. CPUID
configurations aren't available yet. So CPUIDs configuration values need
to be passed in struct kvm_tdx_init_vm. The device model's responsibility
to make this CPUID config for KVM_TDX_INIT_VM and KVM_SET_CPUID2.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
v15 -> v16:
- Removed AMX check as the KVM upstream supports AMX.
- Added CET flag to guest supported xss

v14 -> v15:
- add check if the reserved area of init_vm is zero
---
arch/x86/include/asm/tdx.h | 2 +
arch/x86/include/uapi/asm/kvm.h | 27 +++
arch/x86/kvm/cpuid.c | 7 +
arch/x86/kvm/cpuid.h | 2 +
arch/x86/kvm/vmx/tdx.c | 263 +++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 18 ++
arch/x86/kvm/vmx/tdx_arch.h | 6 +
tools/arch/x86/include/uapi/asm/kvm.h | 33 ++++
8 files changed, 348 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 800ed1783b94..3fc63ffed22e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -110,6 +110,8 @@ u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
#define seamcall_saved_ret(__fn, __args) \
SEAMCALL_NO_ENTROPY_RETRY(__seamcall_saved_ret, (__fn), (__args))

+/* -1 indicates CPUID leaf with no sub-leaves. */
+#define TDX_CPUID_NO_SUBLEAF ((u32)-1)
struct tdx_cpuid_config {
__struct_group(tdx_cpuid_config_leaf, leaf_sub_leaf, __packed,
u32 leaf;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 3fbd43d5177b..7112546bd1d0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -570,6 +570,7 @@ struct kvm_pmu_event_filter {
/* Trust Domain eXtension sub-ioctl() commands. */
enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,

KVM_TDX_CMD_NR_MAX,
};
@@ -617,4 +618,30 @@ struct kvm_tdx_capabilities {
struct kvm_tdx_cpuid_config cpuid_configs[];
};

+struct kvm_tdx_init_vm {
+ __u64 attributes;
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha348 digest */
+ /*
+ * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
+ * This should be enough given sizeof(TD_PARAMS) = 1024.
+ * 8KB was chosen given because
+ * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
+ */
+ __u64 reserved[1004];
+
+ /*
+ * Call KVM_TDX_INIT_VM before vcpu creation, thus before
+ * KVM_SET_CPUID2.
+ * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
+ * TDX module directly virtualizes those CPUIDs without VMM. The user
+ * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
+ * those values. If it doesn't, KVM may have wrong idea of vCPUIDs of
+ * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
+ * module doesn't virtualize.
+ */
+ struct kvm_cpuid2 cpuid;
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0544e30b4946..da46f284ec65 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1426,6 +1426,13 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid,
return r;
}

+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(
+ struct kvm_cpuid_entry2 *entries, int nent, u32 function, u64 index)
+{
+ return cpuid_entry2_find(entries, nent, function, index);
+}
+EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry2);
+
struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
u32 function, u32 index)
{
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 284fa4704553..a6fb9562eade 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -13,6 +13,8 @@ void kvm_set_cpu_caps(void);

void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
void kvm_update_pv_runtime(struct kvm_vcpu *vcpu);
+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(struct kvm_cpuid_entry2 *entries,
+ int nent, u32 function, u64 index);
struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
u32 function, u32 index);
struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 692619411da2..6017e0feac1e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -7,7 +7,6 @@
#include "x86_ops.h"
#include "mmu.h"
#include "tdx.h"
-#include "tdx_ops.h"
#include "x86.h"

#undef pr_fmt
@@ -317,18 +316,21 @@ static int tdx_do_tdh_mng_key_config(void *param)
return 0;
}

-static int __tdx_td_init(struct kvm *kvm);
-
int tdx_vm_init(struct kvm *kvm)
{
+ /*
+ * This function initializes only KVM software construct. It doesn't
+ * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
+ * It is handled by KVM_TDX_INIT_VM, __tdx_td_init().
+ */
+
/*
* TDX has its own limit of the number of vcpus in addition to
* KVM_MAX_VCPUS.
*/
kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);

- /* Place holder for TDX specific logic. */
- return __tdx_td_init(kvm);
+ return 0;
}

static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -391,9 +393,163 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
return ret;
}

-static int __tdx_td_init(struct kvm *kvm)
+static int setup_tdparams_eptp_controls(struct kvm_cpuid2 *cpuid,
+ struct td_params *td_params)
+{
+ const struct kvm_cpuid_entry2 *entry;
+ int max_pa = 36;
+
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
+ if (entry)
+ max_pa = entry->eax & 0xff;
+
+ td_params->eptp_controls = VMX_EPTP_MT_WB;
+ /*
+ * No CPU supports 4-level && max_pa > 48.
+ * "5-level paging and 5-level EPT" section 4.1 4-level EPT
+ * "4-level EPT is limited to translating 48-bit guest-physical
+ * addresses."
+ * cpu_has_vmx_ept_5levels() check is just in case.
+ */
+ if (!cpu_has_vmx_ept_5levels() && max_pa > 48)
+ return -EINVAL;
+ if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
+ td_params->eptp_controls |= VMX_EPTP_PWL_5;
+ td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
+ } else {
+ td_params->eptp_controls |= VMX_EPTP_PWL_4;
+ }
+
+ return 0;
+}
+
+static void setup_tdparams_cpuids(const struct tdsysinfo_struct *tdsysinfo,
+ struct kvm_cpuid2 *cpuid,
+ struct td_params *td_params)
+{
+ int i;
+
+ /*
+ * td_params.cpuid_values: The number and the order of cpuid_value must
+ * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
+ * It's assumed that td_params was zeroed.
+ */
+ for (i = 0; i < tdsysinfo->num_cpuid_config; i++) {
+ const struct tdx_cpuid_config *config = &tdsysinfo->cpuid_configs[i];
+ /* TDX_CPUID_NO_SUBLEAF in TDX CPUID_CONFIG means index = 0. */
+ u32 index = config->sub_leaf == TDX_CPUID_NO_SUBLEAF ? 0 : config->sub_leaf;
+ const struct kvm_cpuid_entry2 *entry =
+ kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
+ config->leaf, index);
+ struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
+
+ if (!entry)
+ continue;
+
+ /*
+ * tdsysinfo.cpuid_configs[].{eax, ebx, ecx, edx}
+ * bit 1 means it can be configured to zero or one.
+ * bit 0 means it must be zero.
+ * Mask out non-configurable bits.
+ */
+ value->eax = entry->eax & config->eax;
+ value->ebx = entry->ebx & config->ebx;
+ value->ecx = entry->ecx & config->ecx;
+ value->edx = entry->edx & config->edx;
+ }
+}
+
+static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
+{
+ const struct kvm_cpuid_entry2 *entry;
+ u64 guest_supported_xcr0;
+ u64 guest_supported_xss;
+
+ /* Setup td_params.xfam */
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
+ if (entry)
+ guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
+ else
+ guest_supported_xcr0 = 0;
+ guest_supported_xcr0 &= kvm_caps.supported_xcr0;
+
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
+ if (entry)
+ guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
+ else
+ guest_supported_xss = 0;
+
+ /* PT can be exposed to TD guest regardless of KVM's XSS support */
+ guest_supported_xss &=
+ (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);
+
+ td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
+ if (td_params->xfam & XFEATURE_MASK_LBR) {
+ /*
+ * TODO: once KVM supports LBR(save/restore LBR related
+ * registers around TDENTER), remove this guard.
+ */
+#define MSG_LBR "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH properly.\n"
+ pr_warn(MSG_LBR);
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
+ struct kvm_tdx_init_vm *init_vm)
+{
+ struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
+ const struct tdsysinfo_struct *tdsysinfo;
+ int ret;
+
+ tdsysinfo = tdx_get_sysinfo();
+ if (!tdsysinfo)
+ return -EOPNOTSUPP;
+ if (kvm->created_vcpus)
+ return -EBUSY;
+
+ if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
+ /*
+ * TODO: save/restore PMU related registers around TDENTER.
+ * Once it's done, remove this guard.
+ */
+#define MSG_PERFMON "TD doesn't support perfmon yet. KVM needs to save/restore host perf registers properly.\n"
+ pr_warn(MSG_PERFMON);
+ return -EOPNOTSUPP;
+ }
+
+ td_params->max_vcpus = kvm->max_vcpus;
+ td_params->attributes = init_vm->attributes;
+ td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz);
+
+ ret = setup_tdparams_eptp_controls(cpuid, td_params);
+ if (ret)
+ return ret;
+ setup_tdparams_cpuids(tdsysinfo, cpuid, td_params);
+ ret = setup_tdparams_xfam(cpuid, td_params);
+ if (ret)
+ return ret;
+
+#define MEMCPY_SAME_SIZE(dst, src) \
+ do { \
+ BUILD_BUG_ON(sizeof(dst) != sizeof(src)); \
+ memcpy((dst), (src), sizeof(dst)); \
+ } while (0)
+
+ MEMCPY_SAME_SIZE(td_params->mrconfigid, init_vm->mrconfigid);
+ MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
+ MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
+
+ return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
+ u64 *seamcall_err)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct tdx_module_args out;
cpumask_var_t packages;
unsigned long *tdcs_pa = NULL;
unsigned long tdr_pa = 0;
@@ -401,6 +557,7 @@ static int __tdx_td_init(struct kvm *kvm)
int ret, i;
u64 err;

+ *seamcall_err = 0;
ret = tdx_guest_keyid_alloc();
if (ret < 0)
return ret;
@@ -506,10 +663,23 @@ static int __tdx_td_init(struct kvm *kvm)
}
}

- /*
- * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
- * ioctl() to define the configure CPUID values for the TD.
- */
+ err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &out);
+ if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_INVALID) {
+ /*
+ * Because a user gives operands, don't warn.
+ * Return a hint to the user because it's sometimes hard for the
+ * user to figure out which operand is invalid. SEAMCALL status
+ * code includes which operand caused invalid operand error.
+ */
+ *seamcall_err = err;
+ ret = -EINVAL;
+ goto teardown;
+ } else if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_INIT, err, &out);
+ ret = -EIO;
+ goto teardown;
+ }
+
return 0;

/*
@@ -552,6 +722,76 @@ static int __tdx_td_init(struct kvm *kvm)
return ret;
}

+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_tdx_init_vm *init_vm = NULL;
+ struct td_params *td_params = NULL;
+ int ret;
+
+ BUILD_BUG_ON(sizeof(*init_vm) != 8 * 1024);
+ BUILD_BUG_ON(sizeof(struct td_params) != 1024);
+
+ if (is_hkid_assigned(kvm_tdx))
+ return -EINVAL;
+
+ if (cmd->flags)
+ return -EINVAL;
+
+ init_vm = kzalloc(sizeof(*init_vm) +
+ sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES,
+ GFP_KERNEL);
+ if (!init_vm)
+ return -ENOMEM;
+ if (copy_from_user(init_vm, (void __user *)cmd->data, sizeof(*init_vm))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ if (init_vm->cpuid.nent > KVM_MAX_CPUID_ENTRIES) {
+ ret = -E2BIG;
+ goto out;
+ }
+ if (copy_from_user(init_vm->cpuid.entries,
+ (void __user *)cmd->data + sizeof(*init_vm),
+ flex_array_size(init_vm, cpuid.entries, init_vm->cpuid.nent))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ if (memchr_inv(init_vm->reserved, 0, sizeof(init_vm->reserved))) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (init_vm->cpuid.padding) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL);
+ if (!td_params) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = setup_tdparams(kvm, td_params, init_vm);
+ if (ret)
+ goto out;
+
+ ret = __tdx_td_init(kvm, td_params, &cmd->error);
+ if (ret)
+ goto out;
+
+ kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
+ kvm_tdx->attributes = td_params->attributes;
+ kvm_tdx->xfam = td_params->xfam;
+
+out:
+ /* kfree() accepts NULL. */
+ kfree(init_vm);
+ kfree(td_params);
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -568,6 +808,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_CAPABILITIES:
r = tdx_get_capabilities(&tdx_cmd);
break;
+ case KVM_TDX_INIT_VM:
+ r = tdx_td_init(kvm, &tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ae117f864cfb..184fe394da86 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -12,7 +12,11 @@ struct kvm_tdx {
unsigned long tdr_pa;
unsigned long *tdcs_pa;

+ u64 attributes;
+ u64 xfam;
int hkid;
+
+ u64 tsc_offset;
};

struct vcpu_tdx {
@@ -39,6 +43,20 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
{
return container_of(vcpu, struct vcpu_tdx, vcpu);
}
+
+static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
+{
+ struct tdx_module_args out;
+ u64 err;
+
+ err = tdh_mng_rd(kvm_tdx->tdr_pa, TDCS_EXEC(field), &out);
+ if (unlikely(err)) {
+ pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field, err);
+ return 0;
+ }
+ return out.r8;
+}
+
#else
struct kvm_tdx {
struct kvm kvm;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 845b6ef9f787..fc9a8898765c 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -122,6 +122,12 @@ struct tdx_cpuid_value {
#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)

+/*
+ * TODO: Once XFEATURE_CET_{U, S} in arch/x86/include/asm/fpu/types.h is
+ * defined, Replace these with define ones.
+ */
+#define TDX_TD_XFAM_CET (BIT(11) | BIT(12))
+
/*
* TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
*/
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 7a08723e99e2..61ce7d174fcf 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -565,6 +565,7 @@ struct kvm_pmu_event_filter {
/* Trust Domain eXtension sub-ioctl() commands. */
enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,

KVM_TDX_CMD_NR_MAX,
};
@@ -614,4 +615,36 @@ struct kvm_tdx_capabilities {
struct kvm_tdx_cpuid_config cpuid_configs[];
};

+struct kvm_tdx_init_vm {
+ __u64 attributes;
+ __u32 max_vcpus;
+ __u32 padding;
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha348 digest */
+ union {
+ /*
+ * KVM_TDX_INIT_VM is called before vcpu creation, thus before
+ * KVM_SET_CPUID2. CPUID configurations needs to be passed.
+ *
+ * This configuration supersedes KVM_SET_CPUID{,2}.
+ * The user space VMM, e.g. qemu, should make them consistent
+ * with this values.
+ * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(256)
+ * = 8KB.
+ */
+ struct {
+ struct kvm_cpuid2 cpuid;
+ /* 8KB with KVM_MAX_CPUID_ENTRIES. */
+ struct kvm_cpuid_entry2 entries[];
+ };
+ /*
+ * For future extensibility.
+ * The size(struct kvm_tdx_init_vm) = 16KB.
+ * This should be enough given sizeof(TD_PARAMS) = 1024
+ */
+ __u64 reserved[2028];
+ };
+};
+
#endif /* _ASM_X86_KVM_H */
--
2.25.1

2023-10-16 16:34:35

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 010/116] KVM: TDX: Add TDX "architectural" error codes

From: Sean Christopherson <[email protected]>

Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL. KVM issues the TDX
SEAMCALLs and checks its error code. KVM handles hypercall from the TDX
guest and may return an error. So error code for the TDX guest is also
needed.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32]. Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx_errno.h | 42 ++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
new file mode 100644
index 000000000000..ec76740dc6a1
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural status code for SEAMCALL */
+
+#ifndef __KVM_X86_TDX_ERRNO_H
+#define __KVM_X86_TDX_ERRNO_H
+
+#define TDX_SEAMCALL_STATUS_MASK 0xFFFFFFFF00000000ULL
+
+/*
+ * TDX SEAMCALL Status Codes (returned in RAX)
+ */
+#define TDX_NON_RECOVERABLE_VCPU 0x4000000100000000ULL
+#define TDX_INTERRUPTED_RESUMABLE 0x8000000300000000ULL
+#define TDX_OPERAND_INVALID 0xC000010000000000ULL
+#define TDX_OPERAND_BUSY 0x8000020000000000ULL
+#define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000ULL
+#define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000ULL
+#define TDX_KEY_GENERATION_FAILED 0x8000080000000000ULL
+#define TDX_KEY_STATE_INCORRECT 0xC000081100000000ULL
+#define TDX_KEY_CONFIGURED 0x0000081500000000ULL
+#define TDX_NO_HKID_READY_TO_WBCACHE 0x0000082100000000ULL
+#define TDX_FLUSHVP_NOT_DONE 0x8000082400000000ULL
+#define TDX_EPT_WALK_FAILED 0xC0000B0000000000ULL
+#define TDX_EPT_ENTRY_NOT_FREE 0xC0000B0200000000ULL
+
+/*
+ * TDG.VP.VMCALL Status Codes (returned in R10)
+ */
+#define TDG_VP_VMCALL_SUCCESS 0x0000000000000000ULL
+#define TDG_VP_VMCALL_RETRY 0x0000000000000001ULL
+#define TDG_VP_VMCALL_INVALID_OPERAND 0x8000000000000000ULL
+#define TDG_VP_VMCALL_TDREPORT_FAILED 0x8000000000000001ULL
+
+/*
+ * TDX module operand ID, appears in 31:0 part of error code as
+ * detail information
+ */
+#define TDX_OPERAND_ID_RCX 0x01
+#define TDX_OPERAND_ID_SEPT 0x92
+#define TDX_OPERAND_ID_TD_EPOCH 0xa9
+
+#endif /* __KVM_X86_TDX_ERRNO_H */
--
2.25.1

2023-10-16 16:34:42

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 092/116] KVM: TDX: Handle TDX PV HLT hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV HLT hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 3 +++
2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9430091fec80..2f8bb0e9394e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
- return pi_has_pending_interrupt(vcpu);
+ bool ret = pi_has_pending_interrupt(vcpu);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+ return true;
+
+ if (tdx->interrupt_disabled_hlt)
+ return false;
+
+ /*
+ * This is for the case where the virtual interrupt is recognized,
+ * i.e. set in vmcs.RVI, between the STI and "HLT". KVM doesn't have
+ * access to RVI and the interrupt is no longer in the PID (because it
+ * was "recognized". It doesn't get delivered in the guest because the
+ * TDCALL completes before interrupts are enabled.
+ *
+ * TDX modules sets RVI while in an STI interrupt shadow.
+ * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
+ * The interrupt shadow at this point is gone.
+ * - It knows that there is an interrupt that can be delivered
+ * (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
+ * matter)
+ * - It forwards the TDExit nevertheless, to a clueless hypervisor that
+ * has no way to glean either RVI or PPR.
+ */
+ return !!xchg(&tdx->buggy_hlt_workaround, 0);
}

void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -1104,6 +1129,17 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ /* See tdx_protected_apic_has_interrupt() to avoid heavy seamcall */
+ tdx->interrupt_disabled_hlt = tdvmcall_a0_read(vcpu);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return kvm_emulate_halt_noskip(vcpu);
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1112,6 +1148,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
switch (tdvmcall_leaf(vcpu)) {
case EXIT_REASON_CPUID:
return tdx_emulate_cpuid(vcpu);
+ case EXIT_REASON_HLT:
+ return tdx_emulate_hlt(vcpu);
default:
break;
}
@@ -1504,6 +1542,8 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
struct kvm_vcpu *vcpu = apic->vcpu;
struct vcpu_tdx *tdx = to_tdx(vcpu);

+ /* See comment in tdx_protected_apic_has_interrupt(). */
+ tdx->buggy_hlt_workaround = 1;
/* TDX supports only posted interrupt. No lapic emulation. */
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 870a0d15e073..4ddcf804c0a4 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -101,6 +101,9 @@ struct vcpu_tdx {
bool host_state_need_restore;
u64 msr_host_kernel_gs_base;

+ bool interrupt_disabled_hlt;
+ unsigned int buggy_hlt_workaround;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
--
2.25.1

2023-10-16 16:34:43

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 081/116] KVM: x86: Split core of hypercall emulation to helper function

From: Sean Christopherson <[email protected]>

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +++
arch/x86/kvm/x86.c | 56 ++++++++++++++++++++++-----------
2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fcd1e2d0dc92..641f769b30d1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2135,6 +2135,10 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
}

+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit, int cpl);
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);

int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a75b0b9ce9bf..d68e99f4f5ea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9848,26 +9848,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}

-int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit, int cpl)
{
- unsigned long nr, a0, a1, a2, a3, ret;
- int op_64_bit;
-
- if (kvm_xen_hypercall_enabled(vcpu->kvm))
- return kvm_xen_hypercall(vcpu);
-
- if (kvm_hv_hypercall_enabled(vcpu))
- return kvm_hv_hypercall(vcpu);
-
- nr = kvm_rax_read(vcpu);
- a0 = kvm_rbx_read(vcpu);
- a1 = kvm_rcx_read(vcpu);
- a2 = kvm_rdx_read(vcpu);
- a3 = kvm_rsi_read(vcpu);
+ unsigned long ret;

trace_kvm_hypercall(nr, a0, a1, a2, a3);

- op_64_bit = is_64_bit_hypercall(vcpu);
if (!op_64_bit) {
nr &= 0xFFFFFFFF;
a0 &= 0xFFFFFFFF;
@@ -9876,7 +9865,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
a3 &= 0xFFFFFFFF;
}

- if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
+ if (cpl) {
ret = -KVM_EPERM;
goto out;
}
@@ -9937,18 +9926,49 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)

WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ);
vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+ /* stat is incremented on completion. */
return 0;
}
default:
ret = -KVM_ENOSYS;
break;
}
+
out:
+ ++vcpu->stat.hypercalls;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+ unsigned long nr, a0, a1, a2, a3, ret;
+ int op_64_bit;
+ int cpl;
+
+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
+ return kvm_xen_hypercall(vcpu);
+
+ if (kvm_hv_hypercall_enabled(vcpu))
+ return kvm_hv_hypercall(vcpu);
+
+ nr = kvm_rax_read(vcpu);
+ a0 = kvm_rbx_read(vcpu);
+ a1 = kvm_rcx_read(vcpu);
+ a2 = kvm_rdx_read(vcpu);
+ a3 = kvm_rsi_read(vcpu);
+ op_64_bit = is_64_bit_hypercall(vcpu);
+ cpl = static_call(kvm_x86_get_cpl)(vcpu);
+
+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
+ /* MAP_GPA tosses the request to the user space. */
+ return 0;
+
if (!op_64_bit)
ret = (u32)ret;
kvm_rax_write(vcpu, ret);

- ++vcpu->stat.hypercalls;
return kvm_skip_emulated_instruction(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
--
2.25.1

2023-10-16 16:35:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 089/116] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL

From: Isaku Yamahata <[email protected]>

The TDX Guest-Host communication interface (GHCI) specification defines
the ABI for the guest TD to issue hypercall. It reserves vendor specific
arguments for VMM specific use. Use it as KVM hypercall and handle it.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index fa1cf0a8b5f9..89e92c696760 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -982,8 +982,41 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
return 0;
}

+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
+{
+ unsigned long nr, a0, a1, a2, a3, ret;
+
+ /*
+ * ABI for KVM tdvmcall argument:
+ * In Guest-Hypervisor Communication Interface(GHCI) specification,
+ * Non-zero leaf number (R10 != 0) is defined to indicate
+ * vendor-specific. KVM uses this for KVM hypercall. NOTE: KVM
+ * hypercall number starts from one. Zero isn't used for KVM hypercall
+ * number.
+ *
+ * R10: KVM hypercall number
+ * arguments: R11, R12, R13, R14.
+ */
+ nr = kvm_r10_read(vcpu);
+ a0 = kvm_r11_read(vcpu);
+ a1 = kvm_r12_read(vcpu);
+ a2 = kvm_r13_read(vcpu);
+ a3 = kvm_r14_read(vcpu);
+
+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true, 0);
+
+ tdvmcall_set_return_code(vcpu, ret);
+
+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
+ return 0;
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
+ if (tdvmcall_exit_type(vcpu))
+ return tdx_emulate_vmcall(vcpu);
+
switch (tdvmcall_leaf(vcpu)) {
default:
break;
--
2.25.1

2023-10-16 16:35:09

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 034/116] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask

From: Isaku Yamahata <[email protected]>

To make use of the same value of shadow_mmio_mask and shadow_present_mask
for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
shadow_present_mask so that they can be common for both VMX and TDX.

TDX will require shadow_mmio_mask and shadow_present_mask to include
VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
spte value is required to cause EPT misconfig. the additional bit doesn't
affect VMX logic to add the bit to shadow_mmio_{value, mask}.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/mmu/spte.c | 6 ++++--
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..76ed39541a52 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -513,6 +513,7 @@ enum vmcs_field {
#define VMX_EPT_IPAT_BIT (1ull << 6)
#define VMX_EPT_ACCESS_BIT (1ull << 8)
#define VMX_EPT_DIRTY_BIT (1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
#define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
VMX_EPT_WRITABLE_MASK | \
VMX_EPT_EXECUTABLE_MASK)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..02a466de2991 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -429,7 +429,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
- shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+ /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
+ shadow_present_mask =
+ (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
/*
* EPT overrides the host MTRRs, and so KVM must program the desired
* memtype directly into the SPTEs. Note, this mask is just the mask
@@ -446,7 +448,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
* of an EPT paging-structure entry is 110b (write/execute).
*/
kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
- VMX_EPT_RWX_MASK, 0);
+ VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);

--
2.25.1

2023-10-16 16:35:14

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 024/116] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD vcpu
creation/destruction.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 098150da6ea2..25082e9c0b20 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -9,7 +9,7 @@ Layer status
What qemu can do
----------------
- TDX VM TYPE is exposed to Qemu.
-- Qemu can try to create VM of TDX VM type and then fails.
+- Qemu can create/destroy guest of TDX vm type.

Patch Layer status
------------------
@@ -17,8 +17,8 @@ Patch Layer status

* TDX, VMX coexistence: Applied
* TDX architectural definitions: Applied
-* TD VM creation/destruction: Applying
-* TD vcpu creation/destruction: Not yet
+* TD VM creation/destruction: Applied
+* TD vcpu creation/destruction: Applying
* TDX EPT violation: Not yet
* TD finalization: Not yet
* TD vcpu enter/exit: Not yet
--
2.25.1

2023-10-16 16:35:18

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 035/116] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis

From: Isaku Yamahata <[email protected]>

TDX will use a different shadow PTE entry value for MMIO from VMX. Add
members to kvm_arch and track value for MMIO per-VM instead of global
variables. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working. Introduce a separate setter function so that guest
TD can override later.

Also require mmio spte cachcing for TDX. Actually this is true case
because TDX require EPT and KVM EPT allows mmio spte caching.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 7 ++++---
arch/x86/kvm/mmu/spte.c | 10 ++++++++--
arch/x86/kvm/mmu/spte.h | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
6 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 742e97f23573..a2fd25fb2f9c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1280,6 +1280,8 @@ struct kvm_arch {
*/
spinlock_t mmu_unsync_pages_lock;

+ u64 shadow_mmio_value;
+
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
bool iommu_noncoherent;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index f5ba6cf589aa..c30fefa39bb4 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -101,6 +101,7 @@ static inline u8 kvm_get_shadow_phys_bits(void)
}

void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8cf3ac95bb30..469e73283824 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2519,7 +2519,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
return kvm_mmu_prepare_zap_page(kvm, child,
invalid_list);
}
- } else if (is_mmio_spte(pte)) {
+ } else if (is_mmio_spte(kvm, pte)) {
mmu_spte_clear_no_track(spte);
}
return 0;
@@ -4180,7 +4180,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
if (WARN_ON_ONCE(reserved))
return -EINVAL;

- if (is_mmio_spte(spte)) {
+ if (is_mmio_spte(vcpu->kvm, spte)) {
gfn_t gfn = get_mmio_spte_gfn(spte);
unsigned int access = get_mmio_spte_access(spte);

@@ -4751,7 +4751,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
unsigned int access)
{
- if (unlikely(is_mmio_spte(*sptep))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
if (gfn != get_mmio_spte_gfn(*sptep)) {
mmu_spte_clear_no_track(sptep);
return true;
@@ -6273,6 +6273,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
{
int r;

+ kvm->arch.shadow_mmio_value = shadow_mmio_value;
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 02a466de2991..318135daf685 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,10 +74,10 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;

- WARN_ON_ONCE(!shadow_mmio_value);
+ WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);

access &= shadow_mmio_access_mask;
- spte |= shadow_mmio_value | access;
+ spte |= vcpu->kvm->arch.shadow_mmio_value | access;
spte |= gpa | shadow_nonpresent_or_rsvd_mask;
spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
@@ -411,6 +411,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);

+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value)
+{
+ kvm->arch.shadow_mmio_value = mmio_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value);
+
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
{
/* shadow_me_value must be a subset of shadow_me_mask */
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 26bc95bbc962..1a163aee9ec6 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -264,9 +264,9 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
return spte_to_child_sp(root);
}

-static inline bool is_mmio_spte(u64 spte)
+static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
- return (spte & shadow_mmio_mask) == shadow_mmio_value &&
+ return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
likely(enable_mmio_caching);
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4f6f782d9e02..7712b1b86bd2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -522,8 +522,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* impact the guest since both the former and current SPTEs
* are nonpresent.
*/
- if (WARN_ON_ONCE(!is_mmio_spte(old_spte) &&
- !is_mmio_spte(new_spte) &&
+ if (WARN_ON_ONCE(!is_mmio_spte(kvm, old_spte) &&
+ !is_mmio_spte(kvm, new_spte) &&
!is_removed_spte(new_spte)))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
"should not be replaced with another,\n"
@@ -1010,7 +1010,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
}

/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
- if (unlikely(is_mmio_spte(new_spte))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
vcpu->stat.pf_mmio_spte_created++;
trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
new_spte);
--
2.25.1

2023-10-16 16:35:18

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 091/116] KVM: TDX: Handle TDX PV CPUID hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV CPUID hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index eb6ba2eee16c..9430091fec80 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1084,12 +1084,34 @@ static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
return 0;
}

+static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+ u32 eax, ebx, ecx, edx;
+
+ /* EAX and ECX for cpuid is stored in R12 and R13. */
+ eax = tdvmcall_a0_read(vcpu);
+ ecx = tdvmcall_a1_read(vcpu);
+
+ kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false);
+
+ tdvmcall_a0_write(vcpu, eax);
+ tdvmcall_a1_write(vcpu, ebx);
+ tdvmcall_a2_write(vcpu, ecx);
+ tdvmcall_a3_write(vcpu, edx);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
return tdx_emulate_vmcall(vcpu);

switch (tdvmcall_leaf(vcpu)) {
+ case EXIT_REASON_CPUID:
+ return tdx_emulate_cpuid(vcpu);
default:
break;
}
--
2.25.1

2023-10-16 16:35:24

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 102/116] KVM: TDX: Add methods to ignore accesses to CPU state

From: Isaku Yamahata <[email protected]>

TDX protects TDX guest state from VMM. Implement access methods for TDX
guest state to ignore them or return zero. Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON
except one method.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 289 +++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 48 +++++-
arch/x86/kvm/vmx/x86_ops.h | 13 ++
3 files changed, 321 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index dd050a6196ae..582fea1b60b6 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -386,6 +386,200 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
kvm_vcpu_deliver_init(vcpu);
}

+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_update_exception_bitmap(vcpu);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_segment_base(vcpu, seg);
+
+ return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_get_segment(vcpu, var, seg);
+ return;
+ }
+
+ vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_cpl(vcpu);
+
+ return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+ if (is_td_vcpu(vcpu)) {
+ *db = 0;
+ *l = 0;
+ return;
+ }
+
+ vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static bool vt_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_is_valid_cr0(vcpu, cr0);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_cr0(vcpu, cr0);
+}
+
+static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_is_valid_cr4(vcpu, cr4);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+ if (is_td_vcpu(vcpu))
+ return 0;
+
+ return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu)) {
+ memset(dt, 0, sizeof(*dt));
+ return;
+ }
+
+ vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu)) {
+ memset(dt, 0, sizeof(*dt));
+ return;
+ }
+
+ vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+ /*
+ * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+ * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+ * reach here for TD vcpu.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_cache_reg(vcpu, reg);
+ return;
+ }
+
+ vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_rflags(vcpu);
+
+ return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_rflags(vcpu, rflags);
+}
+
+static bool vt_get_if_flag(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_get_if_flag(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -530,6 +724,14 @@ static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
vmx_inject_irq(vcpu, reinjected);
}

+static void vt_inject_exception(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_inject_exception(vcpu);
+}
+
static void vt_cancel_injection(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -576,6 +778,39 @@ static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
}

+
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_set_identity_map_addr(kvm, ident_addr);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -639,30 +874,30 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_load = vt_vcpu_load,
.vcpu_put = vt_vcpu_put,

- .update_exception_bitmap = vmx_update_exception_bitmap,
+ .update_exception_bitmap = vt_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
.get_msr = vt_get_msr,
.set_msr = vt_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .is_valid_cr0 = vmx_is_valid_cr0,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
+ .get_segment_base = vt_get_segment_base,
+ .get_segment = vt_get_segment,
+ .set_segment = vt_set_segment,
+ .get_cpl = vt_get_cpl,
+ .get_cs_db_l_bits = vt_get_cs_db_l_bits,
+ .is_valid_cr0 = vt_is_valid_cr0,
+ .set_cr0 = vt_set_cr0,
+ .is_valid_cr4 = vt_is_valid_cr4,
+ .set_cr4 = vt_set_cr4,
+ .set_efer = vt_set_efer,
+ .get_idt = vt_get_idt,
+ .set_idt = vt_set_idt,
+ .get_gdt = vt_get_gdt,
+ .set_gdt = vt_set_gdt,
+ .set_dr7 = vt_set_dr7,
+ .sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+ .cache_reg = vt_cache_reg,
+ .get_rflags = vt_get_rflags,
+ .set_rflags = vt_set_rflags,
+ .get_if_flag = vt_get_if_flag,

.flush_tlb_all = vt_flush_tlb_all,
.flush_tlb_current = vt_flush_tlb_current,
@@ -679,7 +914,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.patch_hypercall = vmx_patch_hypercall,
.inject_irq = vt_inject_irq,
.inject_nmi = vt_inject_nmi,
- .inject_exception = vmx_inject_exception,
+ .inject_exception = vt_inject_exception,
.cancel_injection = vt_cancel_injection,
.interrupt_allowed = vt_interrupt_allowed,
.nmi_allowed = vt_nmi_allowed,
@@ -687,11 +922,11 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_nmi_mask = vt_set_nmi_mask,
.enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
+ .update_cr8_intercept = vt_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .load_eoi_exitmap = vt_load_eoi_exitmap,
.apicv_post_state_restore = vt_apicv_post_state_restore,
.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
.hwapic_irr_update = vmx_hwapic_irr_update,
@@ -702,13 +937,13 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,

- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
+ .set_tss_addr = vt_set_tss_addr,
+ .set_identity_map_addr = vt_set_identity_map_addr,
.get_mt_mask = vt_get_mt_mask,

.get_exit_info = vt_get_exit_info,

- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+ .vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,

.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 14ab1450dda4..1628e54d816b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -613,8 +613,15 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
- vcpu->arch.guest_state_protected =
- !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ /*
+ * TODO: support off-TD debug. If TD DEBUG is enabled, guest state
+ * can be accessed. guest_state_protected = false. and kvm ioctl to
+ * access CPU states should be usable for user space VMM (e.g. qemu).
+ *
+ * vcpu->arch.guest_state_protected =
+ * !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ */
+ vcpu->arch.guest_state_protected = true;

if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
vcpu->arch.xfd_no_write_intercept = true;
@@ -2071,6 +2078,43 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+int tdx_get_cpl(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+ kvm_register_mark_available(vcpu, reg);
+ switch (reg) {
+ case VCPU_REGS_RSP:
+ case VCPU_REGS_RIP:
+ case VCPU_EXREG_PDPTR:
+ case VCPU_EXREG_CR0:
+ case VCPU_EXREG_CR3:
+ case VCPU_EXREG_CR4:
+ break;
+ default:
+ KVM_BUG_ON(1, vcpu->kvm);
+ break;
+ }
+}
+
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ return 0;
+}
+
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+ memset(var, 0, sizeof(*var));
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 092676b50dc2..8dacab1bc3d7 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -169,6 +169,12 @@ bool tdx_has_emulated_msr(u32 index, bool write);
int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);

+int tdx_get_cpl(struct kvm_vcpu *vcpu);
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

void tdx_flush_tlb(struct kvm_vcpu *vcpu);
@@ -214,6 +220,13 @@ static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }

+static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
+static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
+static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
+static inline u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0; }
+static inline void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg) {}
+
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
--
2.25.1

2023-10-16 16:35:29

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 079/116] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument

From: Sean Christopherson <[email protected]>

TDX uses different ABI to get information about VM exit. Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS. The eventual code will be

VMX:
- get exit reason, intr_info, exit_qualification, and etc from VMCS
- call NMI/INTR handlers (common code)

TDX:
- get exit reason, intr_info, exit_qualification, and etc from guest
registers
- call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 484c3c7b46c8..b1c36525a81a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6913,24 +6913,22 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
}

-static void handle_exception_irqoff(struct vcpu_vmx *vmx)
+static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
{
- u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
-
/* if exit due to PF check for async PF */
if (is_page_fault(intr_info))
- vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
/* if exit due to NM, handle before interrupts are enabled */
else if (is_nm_fault(intr_info))
- handle_nm_fault_irqoff(&vmx->vcpu);
+ handle_nm_fault_irqoff(vcpu);
/* Handle machine checks before interrupts are enabled */
else if (is_machine_check(intr_info))
kvm_machine_check();
}

-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
+static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
{
- u32 intr_info = vmx_get_intr_info(vcpu);
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
gate_desc *desc = (gate_desc *)host_idt_base + vector;

@@ -6953,9 +6951,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
return;

if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu);
+ handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_irqoff(vmx);
+ handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
}

/*
--
2.25.1

2023-10-16 16:35:38

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 086/116] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT

From: Isaku Yamahata <[email protected]>

Because guest TD state is protected, exceptions in guest TDs can't be
intercepted. TDX VMM doesn't need to handle exceptions.
tdx_handle_exit_irqoff() handles NMI and machine check. Ignore NMI and
machine check and continue guest TD execution.

For external interrupt, increment stats same to the VMX case.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f98b27cb9003..6bed75c4069c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -892,6 +892,25 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
}

+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
+{
+ u32 intr_info = tdexit_intr_info(vcpu);
+
+ if (is_nmi(intr_info) || is_machine_check(intr_info))
+ return 1;
+
+ kvm_pr_unimpl("unexpected exception 0x%x(exit_reason 0x%llx qual 0x%lx)\n",
+ intr_info,
+ to_tdx(vcpu)->exit_reason.full, tdexit_exit_qual(vcpu));
+ return -EFAULT;
+}
+
+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
+{
+ ++vcpu->stat.irq_exits;
+ return 1;
+}
+
static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
{
vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
@@ -1381,6 +1400,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
+ case EXIT_REASON_EXCEPTION_NMI:
+ return tdx_handle_exception(vcpu);
+ case EXIT_REASON_EXTERNAL_INTERRUPT:
+ return tdx_handle_external_interrupt(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_handle_ept_violation(vcpu);
case EXIT_REASON_EPT_MISCONFIG:
--
2.25.1

2023-10-16 16:35:43

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 111/116] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)

From: Isaku Yamahata <[email protected]>

Add documentation to Intel Trusted Domain Extensions(TDX) support.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/api.rst | 9 +-
Documentation/virt/kvm/x86/index.rst | 1 +
Documentation/virt/kvm/x86/intel-tdx.rst | 362 +++++++++++++++++++++++
3 files changed, 371 insertions(+), 1 deletion(-)
create mode 100644 Documentation/virt/kvm/x86/intel-tdx.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5e08f2a157ef..73d26dd62ddd 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1373,6 +1373,9 @@ the memory region are automatically reflected into the guest. For example, an
mmap() that affects the region will be made visible immediately. Another
example is madvise(MADV_DROP).

+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported. Only as-id 0 is supported.
+
Note: On arm64, a write generated by the page-table walker (to update
the Access and Dirty flags, for example) never results in a
KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This
@@ -4692,7 +4695,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.

:Capability: basic
:Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
:Parameters: an opaque platform specific structure (in/out)
:Returns: 0 on success; -1 on error

@@ -4704,6 +4707,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
Documentation/virt/kvm/x86/amd-memory-encryption.rst.

+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/x86/intel-tdx.rst.
+
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..851e99174762 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -11,6 +11,7 @@ KVM for x86 systems
cpuid
errata
hypercalls
+ intel-tdx
mmu
msr
nested-vmx
diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
new file mode 100644
index 000000000000..a1b10e99c1ff
--- /dev/null
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -0,0 +1,362 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Domain Extensions (TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. For details, see the specifications [1]_, whitepaper [2]_,
+architectural extensions specification [3]_, module documentation [4]_,
+loader interface specification [5]_, guest-hypervisor communication
+interface [6]_, virtual firmware design guide [7]_, and other resources
+([8]_, [9]_, [10]_, [11]_, and [12]_).
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+ /* Trust Domain eXtension sub-ioctl() commands. */
+ enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
+
+ KVM_TDX_CMD_NR_MAX,
+ };
+
+ struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+ /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+ __u64 unused;
+ };
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: vm ioctl
+
+Subset of TDSYSINFO_STRCUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. Which describes about Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- flags: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+- error: must be 0
+- unused: must be 0
+
+::
+
+ struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+ };
+
+ struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+ #define TDX_CAP_GPAW_48 (1 << 0)
+ #define TDX_CAP_GPAW_52 (1 << 1)
+ __u32 supported_gpaw;
+ __u32 padding;
+ __u64 reserved[251];
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[];
+ };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- flags: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- error: must be 0
+- unused: must be 0
+
+::
+
+ struct kvm_tdx_init_vm {
+ __u64 attributes;
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha348 digest */
+ __u64 reserved[1004]; /* must be zero for future extensibility */
+
+ struct kvm_cpuid2 cpuid;
+ };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: must be 0
+- data: initial value of the guest TD VCPU RCX
+- error: must be 0
+- unused: must be 0
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vm ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: flags
+ currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+- error: must be 0
+- unused: must be 0
+
+::
+
+ #define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+ struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+ };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- flags: must be 0
+- data: must be 0
+- error: must be 0
+- unused: must be 0
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called. The control flow
+looks like as follows.
+
+#. system wide capability check
+
+ * KVM_CAP_VM_TYPES: check if VM type is supported and if KVM_X86_TDX_VM
+ is supported.
+
+#. creating VM
+
+ * KVM_CREATE_VM
+ * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+ * KVM_ENABLE_CAP_VM(KVM_CAP_MAX_VCPUS): set max_vcpus. KVM_MAX_VCPUS by
+ default. KVM_MAX_VCPUS is not a part of ABI, but kernel internal constant
+ that is subject to change. Because max vcpus is a part of attestation, max
+ vcpus should be explicitly set.
+ * KVM_SET_TSC_KHZ for vm. optional
+ * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+
+ * KVM_CREATE_VCPU
+ * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+ * KVM_SET_CPUID2: Enable CPUID[0x1].ECX.X2APIC(bit 21)=1 so that the following
+ setting of MSR_IA32_APIC_BASE success. Without this,
+ KVM_SET_MSRS(MSR_IA32_APIC_BASE) fails.
+ * KVM_SET_MSRS: Set the initial reset value of MSR_IA32_APIC_BASE to
+ APIC_DEFAULT_ADDRESS(0xfee00000) | XAPIC_ENABLE(bit 10) |
+ X2APIC_ENABLE(bit 11) [| MSR_IA32_APICBASE_BSP(bit 8) optional]
+
+#. initializing guest memory
+
+ * allocate guest memory and initialize page same to normal KVM case
+ In TDX case, parse and load TDVF into guest memory in addition.
+ * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+ If the pages has contents above, those pages need to be added.
+ Otherwise the contents will be lost and guest sees zero pages.
+ * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+ This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Design discussion
+=================
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered:
+
+ * No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+ * Avoid overhead of indirect call via function pointers.
+ * Contain the changes under arch/x86/kvm/vmx directory and share logic
+ with VMX for maintenance.
+ Even though the ways to operation on VM (VMX instruction vs TDX
+ SEAM call) are different, the basic idea remains the same. So, many
+ logic can be shared.
+ * Future maintenance
+ The huge change of kvm_x86_ops in (near) future isn't expected.
+ a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+
+ Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+ main.c, is just chosen to show main entry points for callbacks.) and
+ wrapper functions around all the callbacks with
+ "if (is-tdx) tdx-callback() else vmx-callback()".
+
+ Pros:
+
+ - No major change in common x86 KVM code. The change is (mostly)
+ contained under arch/x86/kvm/vmx/.
+ - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
+ optimized out.
+ - Micro optimization by avoiding function pointer.
+
+ Cons:
+
+ - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+
+ * Reuse the existing MMU code with minimal update. Because the
+ execution flow is mostly same. But additional operation, TDX call
+ for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+ * For performance, minimize TDX SEAM call to operate on S-EPT. When
+ getting corresponding S-EPT pages/entry from faulting GPA, don't
+ use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+ in host memory.
+ Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+ associate S-EPT to it.
+ * Treats share bit as attributes. mask/unmask the bit where
+ necessary to keep the existing traversing code works.
+ Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+ for special case.
+
+ * 0 : for non-TDX case
+ * 51 or 47 bit set for TDX case.
+
+ Pros:
+
+ - Large code reuse with minimal new hooks.
+ - Execution path is same.
+
+ Cons:
+
+ - Complicates the existing code.
+ - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM APIs are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+
+ Although operations for TD VMs aren't necessarily related to memory
+ encryption, define sub operations of KVM_MEMORY_ENCRYPT_OP for TDX specific
+ ioctls.
+
+ Pros:
+
+ - No major change in common x86 KVM code.
+ - Follows the SEV case.
+
+ Cons:
+
+ - The sub operations of KVM_MEMORY_ENCRYPT_OP aren't necessarily memory
+ encryption, but operations on TD VMs.
+
+References
+==========
+
+.. [1] TDX specification
+ https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+
+ * kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+ * TDX guest branch: https://github.com/intel/tdx/tree/guest
+
+.. [9] tdvf
+ https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+ Enable Hardware Isolated VMs
+ https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+ Architectural Extensions for Hardware Virtual Machine Isolation
+ to Advance Confidential Computing in Public Clouds - Ravi Sahita
+ & Jun Nakajima, Intel Corporation
+ https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+ https://lore.kernel.org/all/[email protected]/
--
2.25.1

2023-10-16 16:36:58

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 014/116] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD VM
creation/destruction.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index f11ea701dc19..098150da6ea2 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -16,8 +16,8 @@ Patch Layer status
Patch layer Status

* TDX, VMX coexistence: Applied
-* TDX architectural definitions: Applying
-* TD VM creation/destruction: Not yet
+* TDX architectural definitions: Applied
+* TD VM creation/destruction: Applying
* TD vcpu creation/destruction: Not yet
* TDX EPT violation: Not yet
* TD finalization: Not yet
--
2.25.1

2023-10-16 16:37:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 112/116] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU

From: Isaku Yamahata <[email protected]>

Add a high level design document on TDX changes to TDP MMU.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/x86/index.rst | 1 +
Documentation/virt/kvm/x86/tdx-tdp-mmu.rst | 443 +++++++++++++++++++++
2 files changed, 444 insertions(+)
create mode 100644 Documentation/virt/kvm/x86/tdx-tdp-mmu.rst

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 851e99174762..63a78bd41b16 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -16,4 +16,5 @@ KVM for x86 systems
msr
nested-vmx
running-nested-guests
+ tdx-tdp-mmu
timekeeping
diff --git a/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst b/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst
new file mode 100644
index 000000000000..49d103720272
--- /dev/null
+++ b/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst
@@ -0,0 +1,443 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Design of TDP MMU for TDX support
+=================================
+This document describes a (high level) design for TDX support of KVM TDP MMU of
+x86 KVM.
+
+In this document, we use "TD" or "guest TD" to differentiate it from the current
+"VM" (Virtual Machine), which is supported by KVM today.
+
+
+Background of TDX
+=================
+TD private memory is designed to hold TD private content, encrypted by the CPU
+using the TD ephemeral key. An encryption engine holds a table of encryption
+keys, and an encryption key is selected for each memory transaction based on a
+Host Key Identifier (HKID). By design, the host VMM does not have access to the
+encryption keys.
+
+In the first generation of MKTME, HKID is "stolen" from the physical address by
+allocating a configurable number of bits from the top of the physical address.
+The HKID space is partitioned into shared HKIDs for legacy MKTME accesses and
+private HKIDs for SEAM-mode-only accesses. We use 0 for the shared HKID on the
+host so that MKTME can be opaque or bypassed on the host.
+
+During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
+as either shared or private, based on the value of a new SHARED bit in the Guest
+Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
+(Extended Page Table) or "Shared EPT" (in this document), which resides in the
+host VMM memory. The Shared EPT is directly managed by the host VMM - the same
+as with the current VMX. Since guest TDs usually require I/O, and the data
+exchange needs to be done via shared memory, thus KVM needs to use the current
+EPT functionality even for TDs.
+
+The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
+pages are encrypted and integrity-protected with the TD's ephemeral private key.
+Secure EPT can be managed _indirectly_ by the host VMM, using the TDX interface
+functions (SEAMCALLs), and thus conceptually Secure EPT is a subset of EPT
+because not all functionalities are available.
+
+Since the execution of such interface functions takes much longer time than
+accessing memory directly, in KVM we use the existing TDP code to mirror the
+Secure EPT for the TD. And we think there are at least two options today in
+terms of the timing for executing such SEAMCALLs:
+
+1. synchronous, i.e. while walking the TDP page tables, or
+2. post-walk, i.e. record what needs to be done to the real Secure EPT during
+ the walk, and execute SEAMCALLs later.
+
+The option 1 seems to be more intuitive and simpler, but the Secure EPT
+concurrency rules are different from the ones of the TDP or EPT. For example,
+MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the target
+
+Secure EPT(SEPT) operations
+---------------------------
+Secure EPT is an Extended Page Table for GPA-to-HPA translation of TD private
+HPA. A Secure EPT is designed to be encrypted with the TD's ephemeral private
+key. SEPT pages are allocated by the host VMM via Intel TDX functions, but their
+content is intended to be hidden and is not architectural.
+
+Unlike the conventional EPT, the CPU can't directly read/write its entry.
+Instead, TDX SEAMCALL API is used. Several SEAMCALLs correspond to operation on
+the EPT entry.
+
+* TDH.MEM.SEPT.ADD():
+
+ Add a secure EPT page from the secure EPT tree. This corresponds to updating
+ the non-leaf EPT entry with present bit set
+
+* TDH.MEM.SEPT.REMOVE():
+
+ Remove the secure page from the secure EPT tree. There is no corresponding
+ to the EPT operation.
+
+* TDH.MEM.SEPT.RD():
+
+ Read the secure EPT entry. This corresponds to reading the EPT entry as
+ memory. Please note that this is much slower than direct memory reading.
+
+* TDH.MEM.PAGE.ADD() and TDH.MEM.PAGE.AUG():
+
+ Add a private page to the secure EPT tree. This corresponds to updating the
+ leaf EPT entry with present bit set.
+
+* THD.MEM.PAGE.REMOVE():
+
+ Remove a private page from the secure EPT tree. There is no corresponding
+ to the EPT operation.
+
+* TDH.MEM.RANGE.BLOCK():
+
+ This (mostly) corresponds to clearing the present bit of the leaf EPT entry.
+ Note that the private page is still linked in the secure EPT. To remove it
+ from the secure EPT, TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE() needs to
+ be called.
+
+* TDH.MEM.TRACK():
+
+ Increment the TLB epoch counter. This (mostly) corresponds to EPT TLB flush.
+ Note that the private page is still linked in the secure EPT. To remove it
+ from the secure EPT, tdh_mem_page_remove() needs to be called.
+
+
+Adding private page
+-------------------
+The procedure of populating the private page looks as follows.
+
+1. TDH.MEM.SEPT.ADD(512G level)
+2. TDH.MEM.SEPT.ADD(1G level)
+3. TDH.MEM.SEPT.ADD(2M level)
+4. TDH.MEM.PAGE.AUG(4K level)
+
+Those operations correspond to updating the EPT entries.
+
+Dropping private page and TLB shootdown
+---------------------------------------
+The procedure of dropping the private page looks as follows.
+
+1. TDH.MEM.RANGE.BLOCK(4K level)
+
+ This mostly corresponds to clear the present bit in the EPT entry. This
+ prevents (or blocks) TLB entry from creating in the future. Note that the
+ private page is still linked in the secure EPT tree and the existing cache
+ entry in the TLB isn't flushed.
+
+2. TDH.MEM.TRACK(range) and TLB shootdown
+
+ This mostly corresponds to the EPT TLB shootdown. Because all vcpus share
+ the same Secure EPT, all vcpus need to flush TLB.
+
+ * TDH.MEM.TRACK(range) by one vcpu. It increments the global internal TLB
+ epoch counter.
+
+ * send IPI to remote vcpus
+ * Other vcpu exits to VMM from guest TD and then re-enter. TDH.VP.ENTER().
+ * TDH.VP.ENTER() checks the TLB epoch counter and If its TLB is old, flush
+ TLB.
+
+ Note that only single vcpu issues tdh_mem_track().
+
+ Note that the private page is still linked in the secure EPT tree, unlike the
+ conventional EPT.
+
+3. TDH.MEM.PAGE.PROMOTE, TDH.MEM.PAGEDEMOTE(), TDH.MEM.PAGE.RELOCATE(), or
+ TDH.MEM.PAGE.REMOVE()
+
+ There is no corresponding operation to the conventional EPT.
+
+ * When changing page size (e.g. 4K <-> 2M) TDH.MEM.PAGE.PROMOTE() or
+ TDH.MEM.PAGE.DEMOTE() is used. During those operation, the guest page is
+ kept referenced in the Secure EPT.
+
+ * When migrating page, TDH.MEM.PAGE.RELOCATE(). This requires both source
+ page and destination page.
+ * when destroying TD, TDH.MEM.PAGE.REMOVE() removes the private page from the
+ secure EPT tree. In this case TLB shootdown is not needed because vcpus
+ don't run any more.
+
+The basic idea for TDX support
+==============================
+Because shared EPT is the same as the existing EPT, use the existing logic for
+shared EPT. On the other hand, secure EPT requires additional operations
+instead of directly reading/writing of the EPT entry.
+
+On EPT violation, The KVM mmu walks down the EPT tree from the root, determines
+the EPT entry to operate, and updates the entry. If necessary, a TLB shootdown
+is done. Because it's very slow to directly walk secure EPT by TDX SEAMCALL,
+TDH.MEM.SEPT.RD(), the mirror of secure EPT is created and maintained. Add
+hooks to KVM MMU to reuse the existing code.
+
+EPT violation on shared GPA
+---------------------------
+(1) EPT violation on shared GPA or zapping shared GPA
+ ::
+
+ walk down shared EPT tree (the existing code)
+ |
+ |
+ V
+ shared EPT tree (CPU refers.)
+
+(2) update the EPT entry. (the existing code)
+
+ TLB shootdown in the case of zapping.
+
+
+EPT violation on private GPA
+----------------------------
+(1) EPT violation on private GPA or zapping private GPA
+ ::
+
+ walk down the mirror of secure EPT tree (mostly same as the existing code)
+ |
+ |
+ V
+ mirror of secure EPT tree (KVM MMU software only. reuse of the existing code)
+
+(2) update the (mirrored) EPT entry. (mostly same as the existing code)
+
+(3) call the hooks with what EPT entry is changed
+ ::
+
+ |
+ NEW: hooks in KVM MMU
+ |
+ V
+ secure EPT root(CPU refers)
+
+(4) the TDX backend calls necessary TDX SEAMCALLs to update real secure EPT.
+
+The major modification is to add hooks for the TDX backend for additional
+operations and to pass down which EPT, shared EPT, or private EPT is used, and
+twist the behavior if we're operating on private EPT.
+
+The following depicts the relationship.
+::
+
+ KVM | TDX module
+ | | |
+ -------------+---------- | |
+ | | | |
+ V V | |
+ shared GPA private GPA | V
+ CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
+ | | | |
+ | | | |
+ V V | V
+ shared EPT private EPT<-------mirror----->Secure EPT
+ | | | |
+ | \--------------------+------\ |
+ | | | |
+ V | V V
+ shared guest page | private guest page
+ |
+ |
+ non-encrypted memory | encrypted memory
+ |
+
+shared EPT: CPU and KVM walk with shared GPA
+ Maintained by the existing code
+private EPT: KVM walks with private GPA
+ Maintained by the twisted existing code
+secure EPT: CPU walks with private GPA.
+ Maintained by TDX module with TDX SEAMCALLs via hooks
+
+
+Tracking private EPT page
+=========================
+Shared EPT pages are managed by struct kvm_mmu_page. They are linked in a list
+structure. When necessary, the list is traversed to operate on. Private EPT
+pages have different characteristics. For example, private pages can't be
+swapped out. When shrinking memory, we'd like to traverse only shared EPT pages
+and skip private EPT pages. Likewise, page migration isn't supported for
+private pages (yet). Introduce an additional list to track shared EPT pages and
+track private EPT pages independently.
+
+At the beginning of EPT violation, the fault handler knows fault GPA, thus it
+knows which EPT to operate on, private or shared. If it's private EPT,
+an additional task is done. Something like "if (private) { callback a hook }".
+Since the fault handler has deep function calls, it's cumbersome to hold the
+information of which EPT is operating. Options to mitigate it are
+
+1. Pass the information as an argument for the function call.
+2. Record the information in struct kvm_mmu_page somehow.
+3. Record the information in vcpu structure.
+
+Option 2 was chosen. Because option 1 requires modifying all the functions. It
+would affect badly to the normal case. Option 3 doesn't work well because in
+some cases, we need to walk both private and shared EPT.
+
+The role of the EPT page can be utilized and one bit can be curved out from
+unused bits in struct kvm_mmu_page_role. When allocating the EPT page,
+initialize the information. Mostly struct kvm_mmu_page is available because
+we're operating on EPT pages.
+
+
+The conversion of private GPA and shared GPA
+============================================
+A page of a given GPA can be assigned to only private GPA xor shared GPA at one
+time. (This is the restriction by KVM implementation to avoid doubling guest
+memory usage. Not by TDX architecture.) The GPA can't be accessed
+simultaneously via both private GPA and shared GPA. On guest startup, all the
+GPAs are assigned as private. Guest converts the range of GPA to shared (or
+private) from private (or shared) by MapGPA hypercall. MapGPA hypercall takes
+the start GPA and the size of the region. If the given start GPA is shared
+(shared bit set), VMM converts the region into shared (if it's already shared,
+nop).
+
+If the guest TD triggers an EPT violation on the already converted region,
+i.e. EPT violation on private(or shared) GPA when page is shared(or private),
+the access won't be allowed. KVM_EXIT_MEMORY_FAULT is triggered. The user
+space VMM will decide how to handle it.
+
+If the guest access private (or shared) GPA after the conversion to shared (or
+private), the following sequence will be observed
+
+1. MapGPA(shared GPA: shared bit set) hypercall
+2. KVM cause KVM_TDX_EXIT with hypercall to the user space VMM.
+3. The user space VMM converts the GPA with KVM_SET_MEMORY_ATTRIBUTES(shared).
+4. The user space VMM resumes vcpu execution with KVM_VCPU_RUN
+5. Guest TD accesses private GPA (shared bit cleared)
+6. KVM gets EPT violation on private GPA (shared bit cleared)
+7. KVM finds the GPA was set to be shared in the xarray while the faulting GPA
+ is private (shared bit cleared)
+8. KVM_EXIT_MEMORY_FAULT. User space VMM, e.g. qemu, decide what to do.
+ Typically requests KVM conversion of GPA without MapGPA hypercall.
+9. KVM converts GPA from shared to private with
+ KVM_SET_MEMORY_ATTRIBUTES(private)
+10. Resume vcpu execution
+
+At step 9, user space VMM may think such memory access is due to race, let vcpu
+resume without conversion with the expectation that other vcpu issues MapGPA.
+Or user space VMM may think such memory access is doubtful and the guest is
+trying to attack VMM. It may throttle vcpu execution as mitigation or finally
+kill such a guest. Or user space VMM may think it's a bug of the guest TD, kill
+the guest TD.
+
+This sequence is not efficient. Guest TD shouldn't access private (or shared)
+GPA after converting GPA to shared (or private). Although KVM can handle it,
+it's sub-optimal and won't be optimized.
+
+The original TDP MMU and race condition
+=======================================
+Because vcpus share the EPT, once the EPT entry is zapped, we need to shootdown
+TLB. Send IPI to remote vcpus. Remote vcpus flush their down TLBs. Until TLB
+shootdown is done, vcpus may reference the zapped guest page.
+
+TDP MMU uses read lock of mmu_lock to mitigate vcpu contention. When read lock
+is obtained, it depends on the atomic update of the EPT entry. (On the other
+hand legacy MMU uses write lock.) When vcpu is populating/zapping the EPT entry
+with a read lock held, other vcpu may be populating or zapping the same EPT
+entry at the same time.
+
+To avoid the race condition, the entry is frozen. It means the EPT entry is set
+to the special value, REMOVED_SPTE which clears the present bit. And then after
+TLB shootdown, update the EPT entry to the final value.
+
+Concurrent zapping
+------------------
+1. read lock
+2. freeze the EPT entry (atomically set the value to REMOVED_SPTE)
+ If other vcpu froze the entry, restart page fault.
+3. TLB shootdown
+
+ * send IPI to remote vcpus
+ * TLB flush (local and remote)
+
+ For each entry update, TLB shootdown is needed because of the
+ concurrency.
+4. atomically set the EPT entry to the final value
+5. read unlock
+
+Concurrent populating
+---------------------
+In the case of populating the non-present EPT entry, atomically update the EPT
+entry.
+
+1. read lock
+
+2. atomically update the EPT entry
+ If other vcpu frozen the entry or updated the entry, restart page fault.
+
+3. read unlock
+
+In the case of updating the present EPT entry (e.g. page migration), the
+operation is split into two. Zapping the entry and populating the entry.
+
+1. read lock
+2. zap the EPT entry. follow the concurrent zapping case.
+3. populate the non-present EPT entry.
+4. read unlock
+
+Non-concurrent batched zapping
+------------------------------
+In some cases, zapping the ranges is done exclusively with a write lock held.
+In this case, the TLB shootdown is batched into one.
+
+1. write lock
+2. zap the EPT entries by traversing them
+3. TLB shootdown
+4. write unlock
+
+For Secure EPT, TDX SEAMCALLs are needed in addition to updating the mirrored
+EPT entry.
+
+TDX concurrent zapping
+----------------------
+Add a hook for TDX SEAMCALLs at the step of the TLB shootdown.
+
+1. read lock
+2. freeze the EPT entry(set the value to REMOVED_SPTE)
+3. TLB shootdown via a hook
+
+ * TLB.MEM.RANGE.BLOCK()
+ * TLB.MEM.TRACK()
+ * send IPI to remote vcpus
+
+4. set the EPT entry to the final value
+5. read unlock
+
+TDX concurrent populating
+-------------------------
+TDX SEAMCALLs are required in addition to operating the mirrored EPT entry. The
+frozen entry is utilized by following the zapping case to avoid the race
+condition. A hook can be added.
+
+1. read lock
+2. freeze the EPT entry
+3. hook
+
+ * TDH_MEM_SEPT_ADD() for non-leaf or TDH_MEM_PAGE_AUG() for leaf.
+
+4. set the EPT entry to the final value
+5. read unlock
+
+Without freezing the entry, the following race can happen. Suppose two vcpus
+are faulting on the same GPA and the 2M and 4K level entries aren't populated
+yet.
+
+* vcpu 1: update 2M level EPT entry
+* vcpu 2: update 4K level EPT entry
+* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
+* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
+
+
+TDX non-concurrent batched zapping
+----------------------------------
+For simplicity, the procedure of concurrent populating is utilized. The
+procedure can be optimized later.
+
+
+Co-existing with unmapping guest private memory
+===============================================
+TODO. This needs to be addressed.
+
+
+Restrictions or future work
+===========================
+The following features aren't supported yet at the moment.
+
+* optimizing non-concurrent zap
+* Large page
+* Page migration
--
2.25.1

2023-10-16 16:37:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 020/116] KVM: TDX: create/destroy VM structure

From: Isaku Yamahata <[email protected]>

As the first step to create TDX guest, create/destroy VM struct. Assign
TDX private Host Key ID (HKID) to the TDX guest for memory encryption and
allocate extra pages for the TDX guest. On destruction, free allocated
pages, and HKID.

Before tearing down private page tables, TDX requires some resources of the
guest TD to be destroyed (i.e. HKID must have been reclaimed, etc). Add
mmu notifier release callback before tearing down private page tables for
it.

Add vm_free() of kvm_x86_ops hook at the end of kvm_arch_destroy_vm()
because some per-VM TDX resources, e.g. TDR, need to be freed after other
TDX resources, e.g. HKID, were freed.

Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
v16:
- Simplified tdx_reclaim_page()
- Reorganize the locking of tdx_release_hkid(), and use smp_call_mask()
instead of smp_call_on_cpu() to hold spinlock to race with invalidation
on releasing guest memfd
---
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/Kconfig | 2 +
arch/x86/kvm/mmu/mmu.c | 7 +
arch/x86/kvm/vmx/main.c | 35 ++-
arch/x86/kvm/vmx/tdx.c | 471 ++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 6 +-
arch/x86/kvm/vmx/x86_ops.h | 8 +
arch/x86/kvm/x86.c | 1 +
9 files changed, 528 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 214dcaa3b384..d5a900842535 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -24,7 +24,9 @@ KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP_OPTIONAL(max_vcpus);
KVM_X86_OP_OPTIONAL(vm_enable_cap)
KVM_X86_OP(vm_init)
+KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
KVM_X86_OP_OPTIONAL(vm_destroy)
+KVM_X86_OP_OPTIONAL(vm_free)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
KVM_X86_OP(vcpu_create)
KVM_X86_OP(vcpu_free)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9dde92c8ee2f..f23557850c27 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1570,7 +1570,9 @@ struct kvm_x86_ops {
unsigned int vm_size;
int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
int (*vm_init)(struct kvm *kvm);
+ void (*flush_shadow_all_private)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
+ void (*vm_free)(struct kvm *kvm);

/* Create, but do not attach this VCPU */
int (*vcpu_precreate)(struct kvm *kvm);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8452ed0228cb..4a864dbaa982 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -92,6 +92,8 @@ config KVM_SW_PROTECTED_VM
config KVM_INTEL
tristate "KVM for Intel (and compatible) processors support"
depends on KVM && IA32_FEAT_CTL
+ select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST
+ select KVM_PRIVATE_MEM if INTEL_TDX_HOST
help
Provides support for KVM on processors equipped with Intel's VT
extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bffa50927739..c3d0f8461e1c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6791,6 +6791,13 @@ static void kvm_mmu_zap_all(struct kvm *kvm)

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
+ /*
+ * kvm_mmu_zap_all() zaps both private and shared page tables. Before
+ * tearing down private page tables, TDX requires some TD resources to
+ * be destroyed (i.e. keyID must have been reclaimed, etc). Invoke
+ * kvm_x86_flush_shadow_all_private() for this.
+ */
+ static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
kvm_mmu_zap_all(kvm);
}

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5579557fa6f6..463ba7214392 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -63,14 +63,41 @@ static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
return -EINVAL;
}

+static void vt_hardware_unsetup(void)
+{
+ if (enable_tdx)
+ tdx_hardware_unsetup();
+ vmx_hardware_unsetup();
+}
+
static int vt_vm_init(struct kvm *kvm)
{
if (is_td(kvm))
- return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
+ return tdx_vm_init(kvm);

return vmx_vm_init(kvm);
}

+static void vt_flush_shadow_all_private(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ tdx_mmu_release_hkid(kvm);
+}
+
+static void vt_vm_destroy(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ return;
+
+ vmx_vm_destroy(kvm);
+}
+
+static void vt_vm_free(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ tdx_vm_free(kvm);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -93,7 +120,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.check_processor_compatibility = vmx_check_processor_compat,

- .hardware_unsetup = vmx_hardware_unsetup,
+ .hardware_unsetup = vt_hardware_unsetup,

.hardware_enable = vt_hardware_enable,
.hardware_disable = vmx_hardware_disable,
@@ -104,7 +131,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vm_size = sizeof(struct kvm_vmx),
.vm_enable_cap = vt_vm_enable_cap,
.vm_init = vt_vm_init,
- .vm_destroy = vmx_vm_destroy,
+ .flush_shadow_all_private = vt_flush_shadow_all_private,
+ .vm_destroy = vt_vm_destroy,
+ .vm_free = vt_vm_free,

.vcpu_precreate = vmx_vcpu_precreate,
.vcpu_create = vmx_vcpu_create,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 331fbaa10d46..692619411da2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,9 +5,10 @@

#include "capabilities.h"
#include "x86_ops.h"
-#include "x86.h"
#include "mmu.h"
#include "tdx.h"
+#include "tdx_ops.h"
+#include "x86.h"

#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -47,6 +48,289 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
return r;
}

+struct tdx_info {
+ u8 nr_tdcs_pages;
+};
+
+/* Info about the TDX module. */
+static struct tdx_info tdx_info __ro_after_init;
+
+/*
+ * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
+ * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock
+ * internally in TDX module. If failed, TDX_OPERAND_BUSY is returned without
+ * spinning or waiting due to a constraint on execution time. It's caller's
+ * responsibility to avoid race (or retry on TDX_OPERAND_BUSY). Use this mutex
+ * to avoid race in TDX module because the kernel knows better about scheduling.
+ */
+static DEFINE_MUTEX(tdx_lock);
+static struct mutex *tdx_mng_key_config_lock;
+
+static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
+{
+ return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
+}
+
+static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->tdr_pa;
+}
+
+static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
+{
+ tdx_guest_keyid_free(kvm_tdx->hkid);
+ kvm_tdx->hkid = -1;
+}
+
+static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->hkid > 0;
+}
+
+static void tdx_clear_page(unsigned long page_pa)
+{
+ const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+ void *page = __va(page_pa);
+ unsigned long i;
+
+ /*
+ * When re-assign one page from old keyid to a new keyid, MOVDIR64B is
+ * required to clear/write the page with new keyid to prevent integrity
+ * error when read on the page with new keyid.
+ *
+ * clflush doesn't flush cache with HKID set. The cache line could be
+ * poisoned (even without MKTME-i), clear the poison bit.
+ */
+ for (i = 0; i < PAGE_SIZE; i += 64)
+ movdir64b(page + i, zero_page);
+ /*
+ * MOVDIR64B store uses WC buffer. Prevent following memory reads
+ * from seeing potentially poisoned cache.
+ */
+ __mb();
+}
+
+static int __tdx_reclaim_page(hpa_t pa)
+{
+ struct tdx_module_args out;
+ u64 err;
+
+ do {
+ err = tdh_phymem_page_reclaim(pa, &out);
+ /*
+ * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
+ * state. i.e. destructing TD.
+ * TDH.PHYMEM.PAGE.RECLAIM requires TDR and target page.
+ * Because we're destructing TD, it's rare to contend with TDR.
+ */
+ } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int tdx_reclaim_page(hpa_t pa)
+{
+ int r;
+
+ r = __tdx_reclaim_page(pa);
+ if (!r)
+ tdx_clear_page(pa);
+ return r;
+}
+
+static void tdx_reclaim_td_page(unsigned long td_page_pa)
+{
+ WARN_ON_ONCE(!td_page_pa);
+
+ /*
+ * TDCX are being reclaimed. TDX module maps TDCX with HKID
+ * assigned to the TD. Here the cache associated to the TD
+ * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
+ * cache doesn't need to be flushed again.
+ */
+ if (tdx_reclaim_page(td_page_pa))
+ /*
+ * Leak the page on failure:
+ * tdx_reclaim_page() returns an error if and only if there's an
+ * unexpected, fatal error, e.g. a SEAMCALL with bad params,
+ * incorrect concurrency in KVM, a TDX Module bug, etc.
+ * Retrying at a later point is highly unlikely to be
+ * successful.
+ * No log here as tdx_reclaim_page() already did.
+ */
+ return;
+ free_page((unsigned long)__va(td_page_pa));
+}
+
+static void tdx_do_tdh_phymem_cache_wb(void *unused)
+{
+ u64 err = 0;
+
+ do {
+ err = tdh_phymem_cache_wb(!!err);
+ } while (err == TDX_INTERRUPTED_RESUMABLE);
+
+ /* Other thread may have done for us. */
+ if (err == TDX_NO_HKID_READY_TO_WBCACHE)
+ err = TDX_SUCCESS;
+ if (WARN_ON_ONCE(err))
+ pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
+}
+
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+ bool packages_allocated, targets_allocated;
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ cpumask_var_t packages, targets;
+ u64 err;
+ int i;
+
+ if (!is_hkid_assigned(kvm_tdx))
+ return;
+
+ if (!is_td_created(kvm_tdx)) {
+ tdx_hkid_free(kvm_tdx);
+ return;
+ }
+
+ packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
+ targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
+ cpus_read_lock();
+
+ /*
+ * We can destroy multiple the guest TDs simultaneously. Prevent
+ * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
+ */
+ mutex_lock(&tdx_lock);
+
+ /*
+ * Go through multiple TDX HKID state transitions with three SEAMCALLs
+ * to make TDH.PHYMEM.PAGE.RECLAIM() usable. Make the transition atomic
+ * to other functions to operate private pages and Secure-EPT pages.
+ *
+ * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
+ * This function is called via mmu notifier, mmu_release().
+ * kvm_gmem_release() is called via fput() on process exit.
+ */
+ write_lock(&kvm->mmu_lock);
+
+ for_each_online_cpu(i) {
+ if (packages_allocated &&
+ cpumask_test_and_set_cpu(topology_physical_package_id(i),
+ packages))
+ continue;
+ if (targets_allocated)
+ cpumask_set_cpu(i, targets);
+ }
+ if (targets_allocated)
+ on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, 1);
+ else
+ on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, 1);
+ /*
+ * In the case of error in tdx_do_tdh_phymem_cache_wb(), the following
+ * tdh_mng_key_freeid() will fail.
+ */
+
+ err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
+ pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
+ kvm_tdx->hkid);
+ } else
+ tdx_hkid_free(kvm_tdx);
+
+ write_unlock(&kvm->mmu_lock);
+ mutex_unlock(&tdx_lock);
+ cpus_read_unlock();
+ free_cpumask_var(targets);
+ free_cpumask_var(packages);
+}
+
+void tdx_vm_free(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+ int i;
+
+ /*
+ * tdx_mmu_release_hkid() failed to reclaim HKID. Something went wrong
+ * heavily with TDX module. Give up freeing TD pages. As the function
+ * already warned, don't warn it again.
+ */
+ if (is_hkid_assigned(kvm_tdx))
+ return;
+
+ if (kvm_tdx->tdcs_pa) {
+ for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
+ if (kvm_tdx->tdcs_pa[i])
+ tdx_reclaim_td_page(kvm_tdx->tdcs_pa[i]);
+ }
+ kfree(kvm_tdx->tdcs_pa);
+ kvm_tdx->tdcs_pa = NULL;
+ }
+
+ if (!kvm_tdx->tdr_pa)
+ return;
+ if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
+ return;
+ /*
+ * TDX module maps TDR with TDX global HKID. TDX module may access TDR
+ * while operating on TD (Especially reclaiming TDCS). Cache flush with
+ * TDX global HKID is needed.
+ */
+ err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(kvm_tdx->tdr_pa,
+ tdx_global_keyid));
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ return;
+ }
+ tdx_clear_page(kvm_tdx->tdr_pa);
+
+ free_page((unsigned long)__va(kvm_tdx->tdr_pa));
+ kvm_tdx->tdr_pa = 0;
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+ hpa_t *tdr_p = param;
+ u64 err;
+
+ do {
+ err = tdh_mng_key_config(*tdr_p);
+
+ /*
+ * If it failed to generate a random key, retry it because this
+ * is typically caused by an entropy error of the CPU's random
+ * number generator.
+ */
+ } while (err == TDX_KEY_GENERATION_FAILED);
+
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm);
+
+int tdx_vm_init(struct kvm *kvm)
+{
+ /*
+ * TDX has its own limit of the number of vcpus in addition to
+ * KVM_MAX_VCPUS.
+ */
+ kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
+
+ /* Place holder for TDX specific logic. */
+ return __tdx_td_init(kvm);
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
@@ -107,6 +391,167 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
return ret;
}

+static int __tdx_td_init(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ cpumask_var_t packages;
+ unsigned long *tdcs_pa = NULL;
+ unsigned long tdr_pa = 0;
+ unsigned long va;
+ int ret, i;
+ u64 err;
+
+ ret = tdx_guest_keyid_alloc();
+ if (ret < 0)
+ return ret;
+ kvm_tdx->hkid = ret;
+
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va)
+ goto free_hkid;
+ tdr_pa = __pa(va);
+
+ tdcs_pa = kcalloc(tdx_info.nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (!tdcs_pa)
+ goto free_tdr;
+ for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va)
+ goto free_tdcs;
+ tdcs_pa[i] = __pa(va);
+ }
+
+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto free_tdcs;
+ }
+ cpus_read_lock();
+ /*
+ * Need at least one CPU of the package to be online in order to
+ * program all packages for host key id. Check it.
+ */
+ for_each_present_cpu(i)
+ cpumask_set_cpu(topology_physical_package_id(i), packages);
+ for_each_online_cpu(i)
+ cpumask_clear_cpu(topology_physical_package_id(i), packages);
+ if (!cpumask_empty(packages)) {
+ ret = -EIO;
+ /*
+ * Because it's hard for human operator to figure out the
+ * reason, warn it.
+ */
+#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
+ pr_warn_ratelimited(MSG_ALLPKG);
+ goto free_packages;
+ }
+
+ /*
+ * Acquire global lock to avoid TDX_OPERAND_BUSY:
+ * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
+ * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
+ * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
+ * caller to handle the contention. This is because of time limitation
+ * usable inside the TDX module and OS/VMM knows better about process
+ * scheduling.
+ *
+ * APIs to acquire the lock of KOT:
+ * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
+ * TDH.PHYMEM.CACHE.WB.
+ */
+ mutex_lock(&tdx_lock);
+ err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
+ mutex_unlock(&tdx_lock);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_CREATE, err, NULL);
+ ret = -EIO;
+ goto free_packages;
+ }
+ kvm_tdx->tdr_pa = tdr_pa;
+
+ for_each_online_cpu(i) {
+ int pkg = topology_physical_package_id(i);
+
+ if (cpumask_test_and_set_cpu(pkg, packages))
+ continue;
+
+ /*
+ * Program the memory controller in the package with an
+ * encryption key associated to a TDX private host key id
+ * assigned to this TDR. Concurrent operations on same memory
+ * controller results in TDX_OPERAND_BUSY. Avoid this race by
+ * mutex.
+ */
+ mutex_lock(&tdx_mng_key_config_lock[pkg]);
+ ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
+ &kvm_tdx->tdr_pa, true);
+ mutex_unlock(&tdx_mng_key_config_lock[pkg]);
+ if (ret)
+ break;
+ }
+ cpus_read_unlock();
+ free_cpumask_var(packages);
+ if (ret) {
+ i = 0;
+ goto teardown;
+ }
+
+ kvm_tdx->tdcs_pa = tdcs_pa;
+ for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
+ err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
+ ret = -EIO;
+ goto teardown;
+ }
+ }
+
+ /*
+ * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
+ * ioctl() to define the configure CPUID values for the TD.
+ */
+ return 0;
+
+ /*
+ * The sequence for freeing resources from a partially initialized TD
+ * varies based on where in the initialization flow failure occurred.
+ * Simply use the full teardown and destroy, which naturally play nice
+ * with partial initialization.
+ */
+teardown:
+ for (; i < tdx_info.nr_tdcs_pages; i++) {
+ if (tdcs_pa[i]) {
+ free_page((unsigned long)__va(tdcs_pa[i]));
+ tdcs_pa[i] = 0;
+ }
+ }
+ if (!kvm_tdx->tdcs_pa)
+ kfree(tdcs_pa);
+ tdx_mmu_release_hkid(kvm);
+ tdx_vm_free(kvm);
+ return ret;
+
+free_packages:
+ cpus_read_unlock();
+ free_cpumask_var(packages);
+free_tdcs:
+ for (i = 0; i < tdx_info.nr_tdcs_pages; i++) {
+ if (tdcs_pa[i])
+ free_page((unsigned long)__va(tdcs_pa[i]));
+ }
+ kfree(tdcs_pa);
+ kvm_tdx->tdcs_pa = NULL;
+
+free_tdr:
+ if (tdr_pa)
+ free_page((unsigned long)__va(tdr_pa));
+ kvm_tdx->tdr_pa = 0;
+free_hkid:
+ if (is_hkid_assigned(kvm_tdx))
+ tdx_hkid_free(kvm_tdx);
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -150,9 +595,11 @@ static int __init tdx_module_setup(void)
return ret;
}

- /* Sanitary check just in case. */
tdsysinfo = tdx_get_sysinfo();
WARN_ON(tdsysinfo->num_cpuid_config > TDX_MAX_NR_CPUID_CONFIGS);
+ tdx_info = (struct tdx_info) {
+ .nr_tdcs_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE,
+ };

return 0;
}
@@ -195,13 +642,27 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
struct vmx_tdx_enabled vmx_tdx = {
.err = ATOMIC_INIT(0),
};
+ int max_pkgs;
int r = 0;
+ int i;

+ if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
+ pr_warn("MOVDIR64B is reqiured for TDX\n");
+ return -EOPNOTSUPP;
+ }
if (!enable_ept) {
pr_warn("Cannot enable TDX with EPT disabled\n");
return -EINVAL;
}

+ max_pkgs = topology_max_packages();
+ tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
+ GFP_KERNEL);
+ if (!tdx_mng_key_config_lock)
+ return -ENOMEM;
+ for (i = 0; i < max_pkgs; i++)
+ mutex_init(&tdx_mng_key_config_lock[i]);
+
if (!zalloc_cpumask_var(&vmx_tdx.vmx_enabled, GFP_KERNEL)) {
r = -ENOMEM;
goto out;
@@ -222,3 +683,9 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
out:
return r;
}
+
+void tdx_hardware_unsetup(void)
+{
+ /* kfree accepts NULL. */
+ kfree(tdx_mng_key_config_lock);
+}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 22c0b57f69ca..ae117f864cfb 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -8,7 +8,11 @@

struct kvm_tdx {
struct kvm kvm;
- /* TDX specific members follow. */
+
+ unsigned long tdr_pa;
+ unsigned long *tdcs_pa;
+
+ int hkid;
};

struct vcpu_tdx {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index c3fc3148dcf3..01f3b02a5b46 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -136,18 +136,26 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);

#ifdef CONFIG_INTEL_TDX_HOST
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+void tdx_hardware_unsetup(void);
bool tdx_is_vm_type_supported(unsigned long type);

int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
+int tdx_vm_init(struct kvm *kvm);
+void tdx_mmu_release_hkid(struct kvm *kvm);
+void tdx_vm_free(struct kvm *kvm);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
+static inline void tdx_hardware_unsetup(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }

static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
{
return -EINVAL;
};
+static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
+static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
+static inline void tdx_vm_free(struct kvm *kvm) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
#endif

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1354274ffe4..fbd80f2e403e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12538,6 +12538,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_page_track_cleanup(kvm);
kvm_xen_destroy_vm(kvm);
kvm_hv_destroy_vm(kvm);
+ static_call_cond(kvm_x86_vm_free)(kvm);
}

static void memslot_rmap_free(struct kvm_memory_slot *slot)
--
2.25.1

2023-10-16 16:37:06

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 033/116] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE

From: Sean Christopherson <[email protected]>

For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.

To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control. And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfigration unlike VMX case. For the shared GPA
that is not populated yet, EPT violation need to be triggered when TD guest
accesses such shared GPA. The non-present SPTE value for shared GPA should
set "suppress #VE" bit.

Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/spte.h | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 4d1799ba2bf8..26bc95bbc962 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -149,7 +149,20 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+/*
+ * Non-present SPTE value for both VMX and SVM for TDP MMU.
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
+ * bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
+ * For TDX:
+ * TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
#define SHADOW_NONPRESENT_VALUE 0ULL
+#endif

extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
@@ -196,7 +209,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
*
* Only used by the TDP MMU.
*/
-#define REMOVED_SPTE 0x5a0ULL
+#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL)

/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
--
2.25.1

2023-10-16 16:37:06

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 090/116] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL

From: Isaku Yamahata <[email protected]>

Some of TDG.VP.VMCALL require device model, for example, qemu, to handle
them on behalf of kvm kernel module. TDG_VP_VMCALL_REPORT_FATAL_ERROR,
TDG_VP_VMCALL_MAP_GPA, TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and
TDG_VP_VMCALL_GET_QUOTE requires user space VMM handling.

Introduce new kvm exit, KVM_EXIT_TDX, and functions to setup it.
TDG_VP_VMCALL_INVALID_OPERAND is set as default return value to avoid
random value. Device model should update R10 if necessary.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v14 -> v15:
- updated struct kvm_tdx_exit with union
- export constants for reg bitmask
---
arch/x86/kvm/vmx/tdx.c | 84 +++++++++++++++++++++++++++++++++++++-
include/uapi/linux/kvm.h | 87 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 89e92c696760..eb6ba2eee16c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1012,6 +1012,78 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_complete_vp_vmcall(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+ __u64 reg_mask = kvm_rcx_read(vcpu);
+
+#define COPY_REG(MASK, REG) \
+ do { \
+ if (reg_mask & TDX_VMCALL_REG_MASK_ ## MASK) \
+ kvm_## REG ## _write(vcpu, tdx_vmcall->out_ ## REG); \
+ } while (0)
+
+
+ COPY_REG(R10, r10);
+ COPY_REG(R11, r11);
+ COPY_REG(R12, r12);
+ COPY_REG(R13, r13);
+ COPY_REG(R14, r14);
+ COPY_REG(R15, r15);
+ COPY_REG(RBX, rbx);
+ COPY_REG(RDI, rdi);
+ COPY_REG(RSI, rsi);
+ COPY_REG(R8, r8);
+ COPY_REG(R9, r9);
+ COPY_REG(RDX, rdx);
+
+#undef COPY_REG
+
+ return 1;
+}
+
+static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+ __u64 reg_mask;
+
+ vcpu->arch.complete_userspace_io = tdx_complete_vp_vmcall;
+ memset(tdx_vmcall, 0, sizeof(*tdx_vmcall));
+
+ vcpu->run->exit_reason = KVM_EXIT_TDX;
+ vcpu->run->tdx.type = KVM_EXIT_TDX_VMCALL;
+
+ reg_mask = kvm_rcx_read(vcpu);
+ tdx_vmcall->reg_mask = reg_mask;
+
+#define COPY_REG(MASK, REG) \
+ do { \
+ if (reg_mask & TDX_VMCALL_REG_MASK_ ## MASK) { \
+ tdx_vmcall->in_ ## REG = kvm_ ## REG ## _read(vcpu); \
+ tdx_vmcall->out_ ## REG = tdx_vmcall->in_ ## REG; \
+ } \
+ } while (0)
+
+
+ COPY_REG(R10, r10);
+ COPY_REG(R11, r11);
+ COPY_REG(R12, r12);
+ COPY_REG(R13, r13);
+ COPY_REG(R14, r14);
+ COPY_REG(R15, r15);
+ COPY_REG(RBX, rbx);
+ COPY_REG(RDI, rdi);
+ COPY_REG(RSI, rsi);
+ COPY_REG(R8, r8);
+ COPY_REG(R9, r9);
+ COPY_REG(RDX, rdx);
+
+#undef COPY_REG
+
+ /* notify userspace to handle the request */
+ return 0;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1022,8 +1094,16 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
break;
}

- tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
- return 1;
+ /*
+ * Unknown VMCALL. Toss the request to the user space VMM, e.g. qemu,
+ * as it may know how to handle.
+ *
+ * Those VMCALLs require user space VMM:
+ * TDG_VP_VMCALL_REPORT_FATAL_ERROR, TDG_VP_VMCALL_MAP_GPA,
+ * TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and
+ * TDG_VP_VMCALL_GET_QUOTE.
+ */
+ return tdx_vp_vmcall_to_user(vcpu);
}

void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 65fc983af840..891dcfec171d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -237,6 +237,90 @@ struct kvm_xen_exit {
} u;
};

+struct kvm_tdx_exit {
+#define KVM_EXIT_TDX_VMCALL 1
+ __u32 type;
+ __u32 pad;
+
+ union {
+ struct kvm_tdx_vmcall {
+ /*
+ * RAX(bit 0), RCX(bit 1) and RSP(bit 4) are reserved.
+ * RAX(bit 0): TDG.VP.VMCALL status code.
+ * RCX(bit 1): bitmap for used registers.
+ * RSP(bit 4): the caller stack.
+ */
+#define TDX_VMCALL_REG_MASK_RBX BIT_ULL(2)
+#define TDX_VMCALL_REG_MASK_RDX BIT_ULL(3)
+#define TDX_VMCALL_REG_MASK_RSI BIT_ULL(6)
+#define TDX_VMCALL_REG_MASK_RDI BIT_ULL(7)
+#define TDX_VMCALL_REG_MASK_R8 BIT_ULL(8)
+#define TDX_VMCALL_REG_MASK_R9 BIT_ULL(9)
+#define TDX_VMCALL_REG_MASK_R10 BIT_ULL(10)
+#define TDX_VMCALL_REG_MASK_R11 BIT_ULL(11)
+#define TDX_VMCALL_REG_MASK_R12 BIT_ULL(12)
+#define TDX_VMCALL_REG_MASK_R13 BIT_ULL(13)
+#define TDX_VMCALL_REG_MASK_R14 BIT_ULL(14)
+#define TDX_VMCALL_REG_MASK_R15 BIT_ULL(15)
+ union {
+ __u64 in_rcx;
+ __u64 reg_mask;
+ };
+
+ /*
+ * Guest-Host-Communication Interface for TDX spec
+ * defines the ABI for TDG.VP.VMCALL.
+ */
+ /* Input parameters: guest -> VMM */
+ union {
+ __u64 in_r10;
+ __u64 type;
+ };
+ union {
+ __u64 in_r11;
+ __u64 subfunction;
+ };
+ /*
+ * Subfunction specific.
+ * Registers are used in this order to pass input
+ * arguments. r12=arg0, r13=arg1, etc.
+ */
+ __u64 in_r12;
+ __u64 in_r13;
+ __u64 in_r14;
+ __u64 in_r15;
+ __u64 in_rbx;
+ __u64 in_rdi;
+ __u64 in_rsi;
+ __u64 in_r8;
+ __u64 in_r9;
+ __u64 in_rdx;
+
+ /* Output parameters: VMM -> guest */
+ union {
+ __u64 out_r10;
+ __u64 status_code;
+ };
+ /*
+ * Subfunction specific.
+ * Registers are used in this order to output return
+ * values. r11=ret0, r12=ret1, etc.
+ */
+ __u64 out_r11;
+ __u64 out_r12;
+ __u64 out_r13;
+ __u64 out_r14;
+ __u64 out_r15;
+ __u64 out_rbx;
+ __u64 out_rdi;
+ __u64 out_rsi;
+ __u64 out_r8;
+ __u64 out_r9;
+ __u64 out_rdx;
+ } vmcall;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576

@@ -279,6 +363,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_RISCV_CSR 36
#define KVM_EXIT_NOTIFY 37
#define KVM_EXIT_MEMORY_FAULT 38
+#define KVM_EXIT_TDX 39

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -525,6 +610,8 @@ struct kvm_run {
#define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
__u32 flags;
} notify;
+ /* KVM_EXIT_TDX_VMCALL */
+ struct kvm_tdx_exit tdx;
/* Fix the size of the union. */
char padding[256];
};
--
2.25.1

2023-10-16 16:37:18

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 093/116] KVM: TDX: Handle TDX PV port io hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV port IO hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 57 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2f8bb0e9394e..fd6907d5ba9d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1140,6 +1140,61 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
return kvm_emulate_halt_noskip(vcpu);
}

+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ int ret;
+
+ WARN_ON_ONCE(vcpu->arch.pio.count != 1);
+
+ ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+ vcpu->arch.pio.port, &val, 1);
+ WARN_ON_ONCE(!ret);
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, val);
+
+ return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ unsigned int port;
+ int size, ret;
+ bool write;
+
+ ++vcpu->stat.io_exits;
+
+ size = tdvmcall_a0_read(vcpu);
+ write = tdvmcall_a1_read(vcpu);
+ port = tdvmcall_a2_read(vcpu);
+
+ if (size != 1 && size != 2 && size != 4) {
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ if (write) {
+ val = tdvmcall_a3_read(vcpu);
+ ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+
+ /* No need for a complete_userspace_io callback. */
+ vcpu->arch.pio.count = 0;
+ } else {
+ ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+ if (!ret)
+ vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
+ else
+ tdvmcall_set_return_val(vcpu, val);
+ }
+ if (ret)
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ return ret;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1150,6 +1205,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_cpuid(vcpu);
case EXIT_REASON_HLT:
return tdx_emulate_hlt(vcpu);
+ case EXIT_REASON_IO_INSTRUCTION:
+ return tdx_emulate_io(vcpu);
default:
break;
}
--
2.25.1

2023-10-16 16:37:27

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 007/116] KVM: TDX: Make TDX VM type supported

From: Isaku Yamahata <[email protected]>

NOTE: This patch is in position of the patch series for developers to be
able to test codes during the middle of the patch series although this
patch series doesn't provide functional features until the all the patches
of this patch series. When merging this patch series, this patch can be
moved to the end.

As first step TDX VM support, return that TDX VM type supported to device
model, e.g. qemu. The callback to create guest TD is vm_init callback for
KVM_CREATE_VM.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 18 ++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 6 ------
arch/x86/kvm/vmx/x86_ops.h | 3 ++-
4 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 1d423abd124b..8ca23adfcfb8 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -10,6 +10,12 @@
static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);

+static bool vt_is_vm_type_supported(unsigned long type)
+{
+ return __kvm_is_vm_type_supported(type) ||
+ (enable_tdx && tdx_is_vm_type_supported(type));
+}
+
static int vt_hardware_enable(void)
{
int ret;
@@ -37,6 +43,14 @@ static __init int vt_hardware_setup(void)
return 0;
}

+static int vt_vm_init(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
+
+ return vmx_vm_init(kvm);
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -57,9 +71,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,

- .is_vm_type_supported = vmx_is_vm_type_supported,
+ .is_vm_type_supported = vt_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
+ .vm_init = vt_vm_init,
.vm_destroy = vmx_vm_destroy,

.vcpu_precreate = vmx_vcpu_precreate,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1c9884164566..9d3f593eacb8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -24,6 +24,12 @@ static int __init tdx_module_setup(void)
return 0;
}

+bool tdx_is_vm_type_supported(unsigned long type)
+{
+ /* enable_tdx check is done by the caller. */
+ return type == KVM_X86_TDX_VM;
+}
+
struct vmx_tdx_enabled {
cpumask_var_t vmx_enabled;
atomic_t err;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d9f2f00af427..0b8bf04e283d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7532,12 +7532,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}

-bool vmx_is_vm_type_supported(unsigned long type)
-{
- /* TODO: Check if TDX is supported. */
- return __kvm_is_vm_type_supported(type);
-}
-
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 2c4fe3aff285..99fb6c9c0282 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -28,7 +28,6 @@ void vmx_hardware_unsetup(void);
int vmx_check_processor_compat(void);
int vmx_hardware_enable(void);
void vmx_hardware_disable(void);
-bool vmx_is_vm_type_supported(unsigned long type);
int vmx_vm_init(struct kvm *kvm);
void vmx_vm_destroy(struct kvm *kvm);
int vmx_vcpu_precreate(struct kvm *kvm);
@@ -137,8 +136,10 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);

#ifdef CONFIG_INTEL_TDX_HOST
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+bool tdx_is_vm_type_supported(unsigned long type);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
+static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1

2023-10-16 16:37:32

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 027/116] [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of KVM MMU GPA
shared bits.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 25082e9c0b20..8b8186e7bfeb 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -10,6 +10,7 @@ What qemu can do
----------------
- TDX VM TYPE is exposed to Qemu.
- Qemu can create/destroy guest of TDX vm type.
+- Qemu can create/destroy vcpu of TDX vm type.

Patch Layer status
------------------
@@ -18,12 +19,12 @@ Patch Layer status
* TDX, VMX coexistence: Applied
* TDX architectural definitions: Applied
* TD VM creation/destruction: Applied
-* TD vcpu creation/destruction: Applying
+* TD vcpu creation/destruction: Applied
* TDX EPT violation: Not yet
* TD finalization: Not yet
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet

-* KVM MMU GPA shared bits: Not yet
+* KVM MMU GPA shared bits: Applying
* KVM TDP refactoring for TDX: Not yet
* KVM TDP MMU hooks: Not yet
--
2.25.1

2023-10-16 16:37:32

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 030/116] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of KVM TDP
refactoring for TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 8b8186e7bfeb..e893a3d714c7 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -25,6 +25,6 @@ Patch Layer status
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet

-* KVM MMU GPA shared bits: Applying
-* KVM TDP refactoring for TDX: Not yet
+* KVM MMU GPA shared bits: Applied
+* KVM TDP refactoring for TDX: Applying
* KVM TDP MMU hooks: Not yet
--
2.25.1

2023-10-16 16:37:38

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 001/116] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX

From: Sean Christopherson <[email protected]>

KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures. This means we must have a TDX
version of kvm_x86_ops.

The existing global struct kvm_x86_ops already defines an interface which
fits with TDX. But kvm_x86_ops is system-wide, not per-VM structure. To
allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
"if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
both VMs and TDs. Use 'vt' for the naming scheme as a nod to VT-x and as a
concatenation of VmxTdx.

The current code looks as follows.
In vmx.c
static vmx_op() { ... }
static struct kvm_x86_ops vmx_x86_ops = {
.op = vmx_op,
initialization code

The eventually converted code will look like
In vmx.c, keep the VMX operations.
vmx_op() { ... }
VMX initialization
In tdx.c, define the TDX operations.
tdx_op() { ... }
TDX initialization
In x86_ops.h, declare the VMX and TDX operations.
vmx_op();
tdx_op();
In main.c, define common wrappers for VMX and TDX.
static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
static struct kvm_x86_ops vt_x86_ops = {
.op = vt_op,
initialization to call VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vxm_vcpu_free().

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/main.c | 167 ++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 376 ++++++++++---------------------------
arch/x86/kvm/vmx/x86_ops.h | 125 ++++++++++++
4 files changed, 396 insertions(+), 274 deletions(-)
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/x86_ops.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 80e3fe184d17..0e894ae23cbc 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -23,7 +23,7 @@ kvm-$(CONFIG_KVM_XEN) += xen.o
kvm-$(CONFIG_KVM_SMM) += smm.o

kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
- vmx/hyperv.o vmx/nested.o vmx/posted_intr.o
+ vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o

kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..3b89cb721fe8
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+#define VMX_REQUIRED_APICV_INHIBITS \
+ (BIT(APICV_INHIBIT_REASON_DISABLE)| \
+ BIT(APICV_INHIBIT_REASON_ABSENT) | \
+ BIT(APICV_INHIBIT_REASON_HYPERV) | \
+ BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
+ BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
+ BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
+ BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+ .name = KBUILD_MODNAME,
+
+ .check_processor_compatibility = vmx_check_processor_compat,
+
+ .hardware_unsetup = vmx_hardware_unsetup,
+
+ .hardware_enable = vmx_hardware_enable,
+ .hardware_disable = vmx_hardware_disable,
+ .has_emulated_msr = vmx_has_emulated_msr,
+
+ .is_vm_type_supported = vmx_is_vm_type_supported,
+ .vm_size = sizeof(struct kvm_vmx),
+ .vm_init = vmx_vm_init,
+ .vm_destroy = vmx_vm_destroy,
+
+ .vcpu_precreate = vmx_vcpu_precreate,
+ .vcpu_create = vmx_vcpu_create,
+ .vcpu_free = vmx_vcpu_free,
+ .vcpu_reset = vmx_vcpu_reset,
+
+ .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+ .vcpu_load = vmx_vcpu_load,
+ .vcpu_put = vmx_vcpu_put,
+
+ .update_exception_bitmap = vmx_update_exception_bitmap,
+ .get_msr_feature = vmx_get_msr_feature,
+ .get_msr = vmx_get_msr,
+ .set_msr = vmx_set_msr,
+ .get_segment_base = vmx_get_segment_base,
+ .get_segment = vmx_get_segment,
+ .set_segment = vmx_set_segment,
+ .get_cpl = vmx_get_cpl,
+ .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
+ .is_valid_cr0 = vmx_is_valid_cr0,
+ .set_cr0 = vmx_set_cr0,
+ .is_valid_cr4 = vmx_is_valid_cr4,
+ .set_cr4 = vmx_set_cr4,
+ .set_efer = vmx_set_efer,
+ .get_idt = vmx_get_idt,
+ .set_idt = vmx_set_idt,
+ .get_gdt = vmx_get_gdt,
+ .set_gdt = vmx_set_gdt,
+ .set_dr7 = vmx_set_dr7,
+ .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
+ .cache_reg = vmx_cache_reg,
+ .get_rflags = vmx_get_rflags,
+ .set_rflags = vmx_set_rflags,
+ .get_if_flag = vmx_get_if_flag,
+
+ .flush_tlb_all = vmx_flush_tlb_all,
+ .flush_tlb_current = vmx_flush_tlb_current,
+ .flush_tlb_gva = vmx_flush_tlb_gva,
+ .flush_tlb_guest = vmx_flush_tlb_guest,
+
+ .vcpu_pre_run = vmx_vcpu_pre_run,
+ .vcpu_run = vmx_vcpu_run,
+ .handle_exit = vmx_handle_exit,
+ .skip_emulated_instruction = vmx_skip_emulated_instruction,
+ .update_emulated_instruction = vmx_update_emulated_instruction,
+ .set_interrupt_shadow = vmx_set_interrupt_shadow,
+ .get_interrupt_shadow = vmx_get_interrupt_shadow,
+ .patch_hypercall = vmx_patch_hypercall,
+ .inject_irq = vmx_inject_irq,
+ .inject_nmi = vmx_inject_nmi,
+ .inject_exception = vmx_inject_exception,
+ .cancel_injection = vmx_cancel_injection,
+ .interrupt_allowed = vmx_interrupt_allowed,
+ .nmi_allowed = vmx_nmi_allowed,
+ .get_nmi_mask = vmx_get_nmi_mask,
+ .set_nmi_mask = vmx_set_nmi_mask,
+ .enable_nmi_window = vmx_enable_nmi_window,
+ .enable_irq_window = vmx_enable_irq_window,
+ .update_cr8_intercept = vmx_update_cr8_intercept,
+ .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .apicv_post_state_restore = vmx_apicv_post_state_restore,
+ .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
+ .hwapic_irr_update = vmx_hwapic_irr_update,
+ .hwapic_isr_update = vmx_hwapic_isr_update,
+ .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .sync_pir_to_irr = vmx_sync_pir_to_irr,
+ .deliver_interrupt = vmx_deliver_interrupt,
+ .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+ .set_tss_addr = vmx_set_tss_addr,
+ .set_identity_map_addr = vmx_set_identity_map_addr,
+ .get_mt_mask = vmx_get_mt_mask,
+
+ .get_exit_info = vmx_get_exit_info,
+
+ .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+
+ .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+ .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
+ .write_tsc_offset = vmx_write_tsc_offset,
+ .write_tsc_multiplier = vmx_write_tsc_multiplier,
+
+ .load_mmu_pgd = vmx_load_mmu_pgd,
+
+ .check_intercept = vmx_check_intercept,
+ .handle_exit_irqoff = vmx_handle_exit_irqoff,
+
+ .request_immediate_exit = vmx_request_immediate_exit,
+
+ .sched_in = vmx_sched_in,
+
+ .cpu_dirty_log_size = PML_ENTITY_NUM,
+ .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+
+ .nested_ops = &vmx_nested_ops,
+
+ .pi_update_irte = vmx_pi_update_irte,
+ .pi_start_assignment = vmx_pi_start_assignment,
+
+#ifdef CONFIG_X86_64
+ .set_hv_timer = vmx_set_hv_timer,
+ .cancel_hv_timer = vmx_cancel_hv_timer,
+#endif
+
+ .setup_mce = vmx_setup_mce,
+
+#ifdef CONFIG_KVM_SMM
+ .smi_allowed = vmx_smi_allowed,
+ .enter_smm = vmx_enter_smm,
+ .leave_smm = vmx_leave_smm,
+ .enable_smi_window = vmx_enable_smi_window,
+#endif
+
+ .can_emulate_instruction = vmx_can_emulate_instruction,
+ .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .migrate_timers = vmx_migrate_timers,
+
+ .msr_filter_changed = vmx_msr_filter_changed,
+ .complete_emulated_msr = kvm_complete_insn_gp,
+
+ .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+};
+
+struct kvm_x86_init_ops vt_init_ops __initdata = {
+ .hardware_setup = vmx_hardware_setup,
+ .handle_intel_pt_intr = NULL,
+
+ .runtime_ops = &vt_x86_ops,
+ .pmu_ops = &intel_pmu_ops,
+};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d1d8b853d9b2..f733629c4706 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -65,6 +65,7 @@
#include "vmcs12.h"
#include "vmx.h"
#include "x86.h"
+#include "x86_ops.h"
#include "smm.h"

MODULE_AUTHOR("Qumranet");
@@ -515,8 +516,6 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
static unsigned long host_idt_base;

#if IS_ENABLED(CONFIG_HYPERV)
-static struct kvm_x86_ops vmx_x86_ops __initdata;
-
static bool __read_mostly enlightened_vmcs = true;
module_param(enlightened_vmcs, bool, 0444);

@@ -574,9 +573,8 @@ static __init void hv_init_evmcs(void)
}

if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
- vmx_x86_ops.enable_l2_tlb_flush
+ vt_x86_ops.enable_l2_tlb_flush
= hv_enable_l2_tlb_flush;
-
} else {
enlightened_vmcs = false;
}
@@ -1481,7 +1479,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
* Switches to specified vcpu, until a matching vcpu_put(), but assumes
* vcpu mutex is already taken.
*/
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -1492,7 +1490,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmx->host_debugctlmsr = get_debugctlmsr();
}

-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
{
vmx_vcpu_pi_put(vcpu);

@@ -1551,7 +1549,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
vmx->emulation_required = vmx_emulation_required(vcpu);
}

-static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
{
return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
}
@@ -1657,8 +1655,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
return 0;
}

-static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
- void *insn, int insn_len)
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
{
/*
* Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1742,7 +1740,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
* Recognizes a pending MTF VM-exit and records the nested state for later
* delivery.
*/
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1773,7 +1771,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
}
}

-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
vmx_update_emulated_instruction(vcpu);
return skip_emulated_instruction(vcpu);
@@ -1792,7 +1790,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
}

-static void vmx_inject_exception(struct kvm_vcpu *vcpu)
+void vmx_inject_exception(struct kvm_vcpu *vcpu)
{
struct kvm_queued_exception *ex = &vcpu->arch.exception;
u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
@@ -1913,12 +1911,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
return kvm_caps.default_tsc_scaling_ratio;
}

-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu)
{
vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
}

-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu)
{
vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio);
}
@@ -1961,7 +1959,7 @@ static inline bool is_vmx_feature_control_msr_valid(struct vcpu_vmx *vmx,
return !(msr->data & ~valid_bits);
}

-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
{
switch (msr->index) {
case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
@@ -1978,7 +1976,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2157,7 +2155,7 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2460,7 +2458,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return ret;
}

-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
{
unsigned long guest_owned_bits;

@@ -2761,7 +2759,7 @@ static bool kvm_is_vmx_supported(void)
return supported;
}

-static int vmx_check_processor_compat(void)
+int vmx_check_processor_compat(void)
{
int cpu = raw_smp_processor_id();
struct vmcs_config vmcs_conf;
@@ -2803,7 +2801,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
return -EFAULT;
}

-static int vmx_hardware_enable(void)
+int vmx_hardware_enable(void)
{
int cpu = raw_smp_processor_id();
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2843,7 +2841,7 @@ static void vmclear_local_loaded_vmcss(void)
__loaded_vmcs_clear(v);
}

-static void vmx_hardware_disable(void)
+void vmx_hardware_disable(void)
{
vmclear_local_loaded_vmcss();

@@ -3157,7 +3155,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)

#endif

-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -3187,7 +3185,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
return to_vmx(vcpu)->vpid;
}

-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 root_hpa = mmu->root.hpa;
@@ -3203,7 +3201,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
vpid_sync_context(vmx_get_current_vpid(vcpu));
}

-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
{
/*
* vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
@@ -3212,7 +3210,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
}

-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
{
/*
* vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
@@ -3257,7 +3255,7 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
#define CR3_EXITING_BITS (CPU_BASED_CR3_LOAD_EXITING | \
CPU_BASED_CR3_STORE_EXITING)

-static bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
if (is_guest_mode(vcpu))
return nested_guest_cr0_valid(vcpu, cr0);
@@ -3378,8 +3376,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
return eptp;
}

-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
- int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
{
struct kvm *kvm = vcpu->kvm;
bool update_guest_cr3 = true;
@@ -3407,8 +3404,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmcs_writel(GUEST_CR3, guest_cr3);
}

-
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
/*
* We operate under the default treatment of SMM, so VMX cannot be
@@ -3524,7 +3520,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->g = (ar >> 15) & 1;
}

-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
{
struct kvm_segment s;

@@ -3601,14 +3597,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
}

-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
__vmx_set_segment(vcpu, var, seg);

to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
}

-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
{
u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);

@@ -3616,25 +3612,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
*l = (ar >> 13) & 1;
}

-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
dt->address = vmcs_readl(GUEST_IDTR_BASE);
}

-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
vmcs_writel(GUEST_IDTR_BASE, dt->address);
}

-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
dt->address = vmcs_readl(GUEST_GDTR_BASE);
}

-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -4106,7 +4102,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
}
}

-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
void *vapic_page;
@@ -4126,7 +4122,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
return ((rvi & 0xf0) > (vppr & 0xf0));
}

-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 i;
@@ -4267,8 +4263,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
return 0;
}

-static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
- int trig_mode, int vector)
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
{
struct kvm_vcpu *vcpu = apic->vcpu;

@@ -4430,7 +4426,7 @@ static u32 vmx_vmexit_ctrl(void)
~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
}

-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -4694,7 +4690,7 @@ static int vmx_alloc_ipiv_pid_table(struct kvm *kvm)
return 0;
}

-static int vmx_vcpu_precreate(struct kvm *kvm)
+int vmx_vcpu_precreate(struct kvm *kvm)
{
return vmx_alloc_ipiv_pid_table(kvm);
}
@@ -4846,7 +4842,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmx->pi_desc.sn = 1;
}

-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -4905,12 +4901,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_update_fb_clear_dis(vcpu, vmx);
}

-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
{
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
}

-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
{
if (!enable_vnmi ||
vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4921,7 +4917,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
}

-static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
uint32_t intr;
@@ -4949,7 +4945,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
vmx_clear_hlt(vcpu);
}

-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -5027,7 +5023,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
GUEST_INTR_STATE_NMI));
}

-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -5049,7 +5045,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
}

-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -5064,7 +5060,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !vmx_interrupt_blocked(vcpu);
}

-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
void __user *ret;

@@ -5084,7 +5080,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
return init_rmode_tss(kvm, ret);
}

-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
{
to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
return 0;
@@ -5370,8 +5366,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
return kvm_fast_pio(vcpu, size, port, in);
}

-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
{
/*
* Patch in the VMCALL instruction:
@@ -5580,7 +5575,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
return kvm_complete_insn_gp(vcpu, err);
}

-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
{
get_debugreg(vcpu->arch.db[0], 0);
get_debugreg(vcpu->arch.db[1], 1);
@@ -5599,7 +5594,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
set_debugreg(DR6_RESERVED, 6);
}

-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
{
vmcs_writel(GUEST_DR7, val);
}
@@ -5870,7 +5865,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
return 1;
}

-static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (vmx_emulation_required_with_pending_exception(vcpu)) {
kvm_prepare_emulation_failure_exit(vcpu);
@@ -6134,9 +6129,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);

-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
- u64 *info1, u64 *info2,
- u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -6579,7 +6573,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return 0;
}

-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
int ret = __vmx_handle_exit(vcpu, exit_fastpath);

@@ -6667,7 +6661,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
: "eax", "ebx", "ecx", "edx");
}

-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
int tpr_threshold;
@@ -6737,7 +6731,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
}

-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
{
const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
struct kvm *kvm = vcpu->kvm;
@@ -6806,7 +6800,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
kvm_release_pfn_clean(pfn);
}

-static void vmx_hwapic_isr_update(int max_isr)
+void vmx_hwapic_isr_update(int max_isr)
{
u16 status;
u8 old;
@@ -6840,7 +6834,7 @@ static void vmx_set_rvi(int vector)
}
}

-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
{
/*
* When running L2, updating RVI is only relevant when
@@ -6854,7 +6848,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
vmx_set_rvi(max_irr);
}

-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int max_irr;
@@ -6900,7 +6894,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
return max_irr;
}

-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
return;
@@ -6911,7 +6905,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}

-static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -6974,7 +6968,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
vcpu->arch.at_instruction_boundary = true;
}

-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -6991,7 +6985,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
* The kvm parameter can be NULL (module initialization, or invocation before
* VM creation). Be sure to check the kvm parameter before using it.
*/
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
{
switch (index) {
case MSR_IA32_SMBASE:
@@ -7114,7 +7108,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
IDT_VECTORING_ERROR_CODE);
}

-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
{
__vmx_complete_interrupts(vcpu,
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -7269,7 +7263,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
guest_state_exit_irqoff();
}

-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long cr3, cr4;
@@ -7425,7 +7419,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_exit_handlers_fastpath(vcpu);
}

-static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
+void vmx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -7436,7 +7430,7 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_loaded_vmcs(vmx->loaded_vmcs);
}

-static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
+int vmx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct vmx_uret_msr *tsx_ctrl;
struct vcpu_vmx *vmx;
@@ -7542,7 +7536,7 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}

-static bool vmx_is_vm_type_supported(unsigned long type)
+bool vmx_is_vm_type_supported(unsigned long type)
{
/* TODO: Check if TDX is supported. */
return __kvm_is_vm_type_supported(type);
@@ -7551,7 +7545,7 @@ static bool vmx_is_vm_type_supported(unsigned long type)
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
{
if (!ple_gap)
kvm->arch.pause_in_guest = true;
@@ -7582,7 +7576,7 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}

-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;

@@ -7754,7 +7748,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}

-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -7907,7 +7901,7 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
}

-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->req_immediate_exit = true;
}
@@ -7946,10 +7940,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
}

-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
- struct x86_instruction_info *info,
- enum x86_intercept_stage stage,
- struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

@@ -8029,8 +8023,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
return 0;
}

-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
- bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
{
struct vcpu_vmx *vmx;
u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -8069,13 +8063,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
return 0;
}

-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->hv_deadline_tsc = -1;
}
#endif

-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
if (!kvm_pause_in_guest(vcpu->kvm))
shrink_ple_window(vcpu);
@@ -8104,7 +8098,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
}

-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.mcg_cap & MCG_LMCE_P)
to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -8115,7 +8109,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
}

#ifdef CONFIG_KVM_SMM
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
/* we need a nested vmexit to enter SMM, postpone if run is pending */
if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -8123,7 +8117,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !is_smm(vcpu);
}

-static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -8144,7 +8138,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
return 0;
}

-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int ret;
@@ -8165,18 +8159,18 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
return 0;
}

-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
{
/* RSM will cause a vmexit anyway. */
}
#endif

-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
}

-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
{
if (is_guest_mode(vcpu)) {
struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -8186,7 +8180,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
}
}

-static void vmx_hardware_unsetup(void)
+void vmx_hardware_unsetup(void)
{
kvm_set_posted_intr_wakeup_handler(NULL);

@@ -8196,167 +8190,13 @@ static void vmx_hardware_unsetup(void)
free_kvm_area();
}

-#define VMX_REQUIRED_APICV_INHIBITS \
-( \
- BIT(APICV_INHIBIT_REASON_DISABLE)| \
- BIT(APICV_INHIBIT_REASON_ABSENT) | \
- BIT(APICV_INHIBIT_REASON_HYPERV) | \
- BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
- BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
- BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) \
-)
-
-static void vmx_vm_destroy(struct kvm *kvm)
+void vmx_vm_destroy(struct kvm *kvm)
{
struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);

free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
}

-static struct kvm_x86_ops vmx_x86_ops __initdata = {
- .name = KBUILD_MODNAME,
-
- .check_processor_compatibility = vmx_check_processor_compat,
-
- .hardware_unsetup = vmx_hardware_unsetup,
-
- .hardware_enable = vmx_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
- .has_emulated_msr = vmx_has_emulated_msr,
-
- .is_vm_type_supported = vmx_is_vm_type_supported,
- .vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
- .vm_destroy = vmx_vm_destroy,
-
- .vcpu_precreate = vmx_vcpu_precreate,
- .vcpu_create = vmx_vcpu_create,
- .vcpu_free = vmx_vcpu_free,
- .vcpu_reset = vmx_vcpu_reset,
-
- .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
-
- .update_exception_bitmap = vmx_update_exception_bitmap,
- .get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .is_valid_cr0 = vmx_is_valid_cr0,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
-
- .flush_tlb_all = vmx_flush_tlb_all,
- .flush_tlb_current = vmx_flush_tlb_current,
- .flush_tlb_gva = vmx_flush_tlb_gva,
- .flush_tlb_guest = vmx_flush_tlb_guest,
-
- .vcpu_pre_run = vmx_vcpu_pre_run,
- .vcpu_run = vmx_vcpu_run,
- .handle_exit = vmx_handle_exit,
- .skip_emulated_instruction = vmx_skip_emulated_instruction,
- .update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
- .inject_irq = vmx_inject_irq,
- .inject_nmi = vmx_inject_nmi,
- .inject_exception = vmx_inject_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_post_state_restore = vmx_apicv_post_state_restore,
- .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_interrupt = vmx_deliver_interrupt,
- .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
-
- .get_exit_info = vmx_get_exit_info,
-
- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
- .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
-
- .load_mmu_pgd = vmx_load_mmu_pgd,
-
- .check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
-
- .request_immediate_exit = vmx_request_immediate_exit,
-
- .sched_in = vmx_sched_in,
-
- .cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
- .nested_ops = &vmx_nested_ops,
-
- .pi_update_irte = vmx_pi_update_irte,
- .pi_start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
- .setup_mce = vmx_setup_mce,
-
-#ifdef CONFIG_KVM_SMM
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
-#endif
-
- .can_emulate_instruction = vmx_can_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
- .migrate_timers = vmx_migrate_timers,
-
- .msr_filter_changed = vmx_msr_filter_changed,
- .complete_emulated_msr = kvm_complete_insn_gp,
-
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-};
-
static unsigned int vmx_handle_intel_pt_intr(void)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -8422,9 +8262,7 @@ static void __init vmx_setup_me_spte_mask(void)
kvm_mmu_set_me_spte_mask(0, me_mask);
}

-static struct kvm_x86_init_ops vmx_init_ops __initdata;
-
-static __init int hardware_setup(void)
+__init int vmx_hardware_setup(void)
{
unsigned long host_bndcfgs;
struct desc_ptr dt;
@@ -8493,16 +8331,16 @@ static __init int hardware_setup(void)
* using the APIC_ACCESS_ADDR VMCS field.
*/
if (!flexpriority_enabled)
- vmx_x86_ops.set_apic_access_page_addr = NULL;
+ vt_x86_ops.set_apic_access_page_addr = NULL;

if (!cpu_has_vmx_tpr_shadow())
- vmx_x86_ops.update_cr8_intercept = NULL;
+ vt_x86_ops.update_cr8_intercept = NULL;

#if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
&& enable_ept) {
- vmx_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs;
- vmx_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range;
+ vt_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs;
+ vt_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range;
}
#endif

@@ -8517,7 +8355,7 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_apicv())
enable_apicv = 0;
if (!enable_apicv)
- vmx_x86_ops.sync_pir_to_irr = NULL;
+ vt_x86_ops.sync_pir_to_irr = NULL;

if (!enable_apicv || !cpu_has_vmx_ipiv())
enable_ipiv = false;
@@ -8553,7 +8391,7 @@ static __init int hardware_setup(void)
enable_pml = 0;

if (!enable_pml)
- vmx_x86_ops.cpu_dirty_log_size = 0;
+ vt_x86_ops.cpu_dirty_log_size = 0;

if (!cpu_has_vmx_preemption_timer())
enable_preemption_timer = false;
@@ -8578,9 +8416,9 @@ static __init int hardware_setup(void)
}

if (!enable_preemption_timer) {
- vmx_x86_ops.set_hv_timer = NULL;
- vmx_x86_ops.cancel_hv_timer = NULL;
- vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+ vt_x86_ops.set_hv_timer = NULL;
+ vt_x86_ops.cancel_hv_timer = NULL;
+ vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
}

kvm_caps.supported_mce_cap |= MCG_LMCE_P;
@@ -8591,9 +8429,9 @@ static __init int hardware_setup(void)
if (!enable_ept || !enable_pmu || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
if (pt_mode == PT_MODE_HOST_GUEST)
- vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+ vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
else
- vmx_init_ops.handle_intel_pt_intr = NULL;
+ vt_init_ops.handle_intel_pt_intr = NULL;

setup_default_sgx_lepubkeyhash();

@@ -8616,14 +8454,6 @@ static __init int hardware_setup(void)
return r;
}

-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
- .hardware_setup = hardware_setup,
- .handle_intel_pt_intr = NULL,
-
- .runtime_ops = &vmx_x86_ops,
- .pmu_ops = &intel_pmu_ops,
-};
-
static void vmx_cleanup_l1d_flush(void)
{
if (vmx_l1d_flush_pages) {
@@ -8665,7 +8495,7 @@ static int __init vmx_init(void)
*/
hv_init_evmcs();

- r = kvm_x86_vendor_init(&vmx_init_ops);
+ r = kvm_x86_vendor_init(&vt_init_ops);
if (r)
return r;

diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..0c8fbd29b969
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include "x86.h"
+
+__init int vmx_hardware_setup(void);
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+void vmx_hardware_unsetup(void);
+int vmx_check_processor_compat(void);
+int vmx_hardware_enable(void);
+void vmx_hardware_disable(void);
+bool vmx_is_vm_type_supported(unsigned long type);
+int vmx_vm_init(struct kvm *kvm);
+void vmx_vm_destroy(struct kvm *kvm);
+int vmx_vcpu_precreate(struct kvm *kvm);
+int vmx_vcpu_create(struct kvm_vcpu *vcpu);
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_vcpu_free(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+#ifdef CONFIG_KVM_SMM
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+#endif
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_inject_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1

2023-10-16 16:37:41

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 008/116] [MARKER] The start of TDX KVM patch series: TDX architectural definitions

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TDX architectural
definitions.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/index.rst | 2 ++
.../virt/kvm/intel-tdx-layer-status.rst | 29 +++++++++++++++++++
2 files changed, 31 insertions(+)
create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst

diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index ad13ec55ddfe..ccff56dca2b1 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -19,3 +19,5 @@ KVM
vcpu-requests
halt-polling
review-checklist
+
+ intel-tdx-layer-status
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
new file mode 100644
index 000000000000..f11ea701dc19
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Layer status
+============
+What qemu can do
+----------------
+- TDX VM TYPE is exposed to Qemu.
+- Qemu can try to create VM of TDX VM type and then fails.
+
+Patch Layer status
+------------------
+ Patch layer Status
+
+* TDX, VMX coexistence: Applied
+* TDX architectural definitions: Applying
+* TD VM creation/destruction: Not yet
+* TD vcpu creation/destruction: Not yet
+* TDX EPT violation: Not yet
+* TD finalization: Not yet
+* TD vcpu enter/exit: Not yet
+* TD vcpu interrupts/exit/hypercall: Not yet
+
+* KVM MMU GPA shared bits: Not yet
+* KVM TDP refactoring for TDX: Not yet
+* KVM TDP MMU hooks: Not yet
--
2.25.1

2023-10-16 16:37:46

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 098/116] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL

From: Isaku Yamahata <[email protected]>

MCE and MCA is advertised via cpuid based on the TDX module spec. Guest
kernel can access IA32_FEAT_CTL for checking if LMCE is enabled by platform
and IA32_MCG_EXT_CTL to enable LMCE. Make TDX KVM handle them. Otherwise
guest MSR access to them with TDG.VP.VMCALL<MSR> on VE results in GP in
guest.

Because LMCE is disabled with qemu by default, "-cpu lmce=on" to qemu
command line is needed to reproduce it.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 726e28f30354..7f8c89fd556a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1950,6 +1950,7 @@ bool tdx_has_emulated_msr(u32 index, bool write)
default:
return true;
}
+ case MSR_IA32_FEAT_CTL:
case MSR_IA32_APICBASE:
case MSR_EFER:
return !write;
@@ -1964,6 +1965,20 @@ bool tdx_has_emulated_msr(u32 index, bool write)
int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
switch (msr->index) {
+ case MSR_IA32_FEAT_CTL:
+ /*
+ * MCE and MCA are advertised via cpuid. guest kernel could
+ * check if LMCE is enabled or not.
+ */
+ msr->data = FEAT_CTL_LOCKED;
+ if (vcpu->arch.mcg_cap & MCG_LMCE_P)
+ msr->data |= FEAT_CTL_LMCE_ENABLED;
+ return 0;
+ case MSR_IA32_MCG_EXT_CTL:
+ if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P))
+ return 1;
+ msr->data = vcpu->arch.mcg_ext_ctl;
+ return 0;
case MSR_MTRRcap:
/*
* Override kvm_mtrr_get_msr() which hardcodes the value.
@@ -1982,6 +1997,11 @@ int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
switch (msr->index) {
+ case MSR_IA32_MCG_EXT_CTL:
+ if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P))
+ return 1;
+ vcpu->arch.mcg_ext_ctl = msr->data;
+ return 0;
case MSR_MTRRdefType:
/*
* Allow writeback only for all memory.
--
2.25.1

2023-10-16 16:37:49

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 028/116] KVM: x86/mmu: introduce config for PRIVATE KVM MMU

From: Isaku Yamahata <[email protected]>

To keep the case of non TDX intact, introduce a new config option for
private KVM MMU support. At the moment, this is synonym for
CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL. The config makes it clear
that the config is only for x86 KVM MMU.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 4a864dbaa982..e08cc4ad5749 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -168,4 +168,8 @@ config KVM_PROVE_MMU
config KVM_EXTERNAL_WRITE_TRACKING
bool

+config KVM_MMU_PRIVATE
+ def_bool y
+ depends on INTEL_TDX_HOST && KVM_INTEL
+
endif # VIRTUALIZATION
--
2.25.1

2023-10-16 16:38:00

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 099/116] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall

From: Isaku Yamahata <[email protected]>

Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is
zero, return success code and zero in output registers.

TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification. The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7f8c89fd556a..b4a0fa51ee62 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1331,6 +1331,20 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu)
+{
+ if (tdvmcall_a0_read(vcpu))
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+ else {
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+ kvm_r11_write(vcpu, 0);
+ tdvmcall_a0_write(vcpu, 0);
+ tdvmcall_a1_write(vcpu, 0);
+ tdvmcall_a2_write(vcpu, 0);
+ }
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1349,6 +1363,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_rdmsr(vcpu);
case EXIT_REASON_MSR_WRITE:
return tdx_emulate_wrmsr(vcpu);
+ case TDG_VP_VMCALL_GET_TD_VM_CALL_INFO:
+ return tdx_get_td_vm_call_info(vcpu);
default:
break;
}
--
2.25.1

2023-10-16 16:37:59

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 011/116] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

From: Isaku Yamahata <[email protected]>

A VMM interacts with the TDX module using a new instruction (SEAMCALL).
For instance, a TDX VMM does not have full access to the VM control
structure corresponding to VMX VMCS. Instead, a VMM induces the TDX module
to act on behalf via SEAMCALLs.

Export __seamcall and define C wrapper functions for SEAMCALLs for
readability.

Some SEAMCALL APIs donate host pages to TDX module or guest TD, and the
donated pages are encrypted. Those require the VMM to flush the cache
lines to avoid cache line alias.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
v15 -> v16:
- use struct tdx_module_args instead of struct tdx_module_output
- Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1.
---
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/include/asm/tdx.h | 12 ++
arch/x86/kvm/vmx/tdx_ops.h | 226 ++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/seamcall.S | 4 +
4 files changed, 243 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h

diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index b1a98fa38828..1a8feb440252 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -6,6 +6,7 @@
#include <asm/page.h>
#include <asm/checksum.h>
#include <asm/mce.h>
+#include <asm/tdx.h>

#include <asm-generic/asm-prototypes.h>

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9aaf40107819..ec2b8e67fe07 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -116,6 +116,18 @@ int tdx_enable(void);
void tdx_reset_memory(void);
bool tdx_is_private_mem(unsigned long phys);
#else
+static inline u64 __seamcall(u64 fn, struct tdx_module_args *args)
+{
+ return TDX_SEAMCALL_UD;
+}
+static inline u64 __seamcall_ret(u64 fn, struct tdx_module_args *args)
+{
+ return TDX_SEAMCALL_UD;
+}
+static inline u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args)
+{
+ return TDX_SEAMCALL_UD;
+}
static inline bool platform_tdx_enabled(void) { return false; }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
static inline int tdx_enable(void) { return -ENODEV; }
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..12fd6b8d49e0
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,226 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/cacheflush.h>
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+#include "x86.h"
+
+static inline u64 tdx_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_args *out)
+{
+ u64 ret;
+
+ if (out) {
+ *out = (struct tdx_module_args) {
+ .rcx = rcx,
+ .rdx = rdx,
+ .r8 = r8,
+ .r9 = r9,
+ };
+ ret = __seamcall_ret(op, out);
+ } else {
+ struct tdx_module_args args = {
+ .rcx = rcx,
+ .rdx = rdx,
+ .r8 = r8,
+ .r9 = r9,
+ };
+ ret = __seamcall(op, &args);
+ }
+ if (unlikely(ret == TDX_SEAMCALL_UD)) {
+ /*
+ * SEAMCALLs fail with TDX_SEAMCALL_UD returned when VMX is off.
+ * This can happen when the host gets rebooted or live
+ * updated. In this case, the instruction execution is ignored
+ * as KVM is shut down, so the error code is suppressed. Other
+ * than this, the error is unexpected and the execution can't
+ * continue as the TDX features reply on VMX to be on.
+ */
+ kvm_spurious_fault();
+ return 0;
+ }
+ return ret;
+}
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return tdx_seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+ struct tdx_module_args *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
+}
+
+static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+ struct tdx_module_args *out)
+{
+ clflush_cache_range(__va(page), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, out);
+}
+
+static inline u64 tdh_mem_sept_rd(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MEM_SEPT_RD, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MEM_SEPT_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return tdx_seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_args *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_PAGE_RELOCATE, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_args *out)
+{
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+ return tdx_seamcall(TDH_MNG_KEY_CONFIG, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+ clflush_cache_range(__va(tdr), PAGE_SIZE);
+ return tdx_seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+ clflush_cache_range(__va(tdvpr), PAGE_SIZE);
+ return tdx_seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MNG_RD, tdr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MR_EXTEND, gpa, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+ return tdx_seamcall(TDH_MR_FINALIZE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+ return tdx_seamcall(TDH_VP_FLUSH, tdvpr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+ return tdx_seamcall(TDH_MNG_VPFLUSHDONE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+ return tdx_seamcall(TDH_MNG_KEY_FREEID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MNG_INIT, tdr, td_params, 0, 0, out);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+ return tdx_seamcall(TDH_VP_INIT, tdvpr, rcx, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_VP_RD, tdvpr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+ return tdx_seamcall(TDH_MNG_KEY_RECLAIMID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_PHYMEM_PAGE_RECLAIM, page, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+ return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+ return tdx_seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+ return tdx_seamcall(TDH_PHYMEM_CACHE_WB, resume ? 1 : 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+ return tdx_seamcall(TDH_PHYMEM_PAGE_WBINVD, page, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+ struct tdx_module_args *out)
+{
+ return tdx_seamcall(TDH_VP_WR, tdvpr, field, val, mask, out);
+}
+
+#endif /* __KVM_X86_TDX_OPS_H */
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index 5b1f2286aea9..73042f7f23be 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -1,5 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/linkage.h>
+#include <asm/export.h>
#include <asm/frame.h>

#include "tdxcall.S"
@@ -21,6 +22,7 @@
SYM_FUNC_START(__seamcall)
TDX_MODULE_CALL host=1
SYM_FUNC_END(__seamcall)
+EXPORT_SYMBOL_GPL(__seamcall)

/*
* __seamcall_ret() - Host-side interface functions to SEAM software
@@ -40,6 +42,7 @@ SYM_FUNC_END(__seamcall)
SYM_FUNC_START(__seamcall_ret)
TDX_MODULE_CALL host=1 ret=1
SYM_FUNC_END(__seamcall_ret)
+EXPORT_SYMBOL_GPL(__seamcall_ret)

/*
* __seamcall_saved_ret() - Host-side interface functions to SEAM software
@@ -59,3 +62,4 @@ SYM_FUNC_END(__seamcall_ret)
SYM_FUNC_START(__seamcall_saved_ret)
TDX_MODULE_CALL host=1 ret=1 saved=1
SYM_FUNC_END(__seamcall_saved_ret)
+EXPORT_SYMBOL_GPL(__seamcall_saved_ret)
--
2.25.1

2023-10-16 16:38:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 078/116] KVM: TDX: Implement methods to inject NMI

From: Isaku Yamahata <[email protected]>

TDX vcpu control structure defines one bit for pending NMI for VMM to
inject NMI by setting the bit without knowing TDX vcpu NMI states. Because
the vcpu state is protected, VMM can't know about NMI states of TDX vcpu.
The TDX module handles actual injection and NMI states transition.

Add methods for NMI and treat NMI can be injected always.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 64 +++++++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx/tdx.c | 6 ++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
3 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 0ce9578a1453..91e2ad643a47 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -315,6 +315,60 @@ static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
vmx_flush_tlb_guest(vcpu);
}

+static void vt_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_inject_nmi(vcpu);
+ return;
+ }
+
+ vmx_inject_nmi(vcpu);
+}
+
+static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ /*
+ * The TDX module manages NMI windows and NMI reinjection, and hides NMI
+ * blocking, all KVM can do is throw an NMI over the wall.
+ */
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_nmi_allowed(vcpu, for_injection);
+}
+
+static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Assume NMIs are always unmasked. KVM could query PEND_NMI and treat
+ * NMIs as masked if a previous NMI is still pending, but SEAMCALLs are
+ * expensive and the end result is unchanged as the only relevant usage
+ * of get_nmi_mask() is to limit the number of pending NMIs, i.e. it
+ * only changes whether KVM or the TDX module drops an NMI.
+ */
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_get_nmi_mask(vcpu);
+}
+
+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_nmi_mask(vcpu, masked);
+}
+
+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
+{
+ /* Refer the comment in vt_get_nmi_mask(). */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_enable_nmi_window(vcpu);
+}
+
static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int pgd_level)
{
@@ -493,14 +547,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.get_interrupt_shadow = vt_get_interrupt_shadow,
.patch_hypercall = vmx_patch_hypercall,
.inject_irq = vt_inject_irq,
- .inject_nmi = vmx_inject_nmi,
+ .inject_nmi = vt_inject_nmi,
.inject_exception = vmx_inject_exception,
.cancel_injection = vt_cancel_injection,
.interrupt_allowed = vt_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
+ .nmi_allowed = vt_nmi_allowed,
+ .get_nmi_mask = vt_get_nmi_mask,
+ .set_nmi_mask = vt_set_nmi_mask,
+ .enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index beb3fa2d3e41..f5e29f8f1b6a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -848,6 +848,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
return EXIT_FASTPATH_NONE;
}

+void tdx_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ ++vcpu->stat.nmi_injections;
+ td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 454f28099868..9a949273bf47 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -159,6 +159,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
+void tdx_inject_nmi(struct kvm_vcpu *vcpu);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

@@ -195,6 +196,7 @@ static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)

static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector) {}
+static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

--
2.25.1

2023-10-16 16:38:28

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 037/116] KVM: x86/mmu: Allow per-VM override of the TDP max page level

From: Sean Christopherson <[email protected]>

TDX requires special handling to support large private page. For
simplicity, only support 4K page for TD guest for now. Add per-VM maximum
page level support to support different maximum page sizes for TD guest and
conventional VMX guest.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Acked-by: Kai Huang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a2fd25fb2f9c..3970473d1807 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1253,6 +1253,7 @@ struct kvm_arch {
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+ int tdp_max_page_level;
u8 mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e3ddf2e7dbf..eee5a46f5ce3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6301,6 +6301,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;

+ kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
return 0;
}

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 83dbe8245ebf..8de1192b1cca 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -296,7 +296,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.nx_huge_page_workaround_enabled =
is_nx_huge_page_enabled(vcpu->kvm),

- .max_level = KVM_MAX_HUGEPAGE_LEVEL,
+ .max_level = vcpu->kvm->arch.tdp_max_page_level,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
};
--
2.25.1

2023-10-16 16:38:41

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 075/116] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c

From: Isaku Yamahata <[email protected]>

As TDX will use posted_interrupt.c, the use of struct vcpu_vmx is a
blocker. Because the members of struct pi_desc pi_desc and struct
list_head pi_wakeup_list are only used in posted_interrupt.c, introduce
common structure, struct vcpu_pi, make vcpu_vmx and vcpu_tdx has same
layout in the top of structure.

To minimize the diff size, avoid code conversion like,
vmx->pi_desc => vmx->common->pi_desc. Instead add compile time check
if the layout is expected.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 41 ++++++++++++++++++++++++++--------
arch/x86/kvm/vmx/posted_intr.h | 11 +++++++++
arch/x86/kvm/vmx/tdx.c | 1 +
arch/x86/kvm/vmx/tdx.h | 8 +++++++
arch/x86/kvm/vmx/vmx.h | 14 +++++++-----
5 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index af662312fd07..b66add9da0f3 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -11,6 +11,7 @@
#include "posted_intr.h"
#include "trace.h"
#include "vmx.h"
+#include "tdx.h"

/*
* Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
@@ -31,9 +32,29 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
*/
static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);

+/*
+ * The layout of the head of struct vcpu_vmx and struct vcpu_tdx must match with
+ * struct vcpu_pi.
+ */
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+ offsetof(struct vcpu_vmx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+ offsetof(struct vcpu_vmx, pi_wakeup_list));
+#ifdef CONFIG_INTEL_TDX_HOST
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+ offsetof(struct vcpu_tdx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+ offsetof(struct vcpu_tdx, pi_wakeup_list));
+#endif
+
+static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
+{
+ return (struct vcpu_pi *)vcpu;
+}
+
static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
- return &(to_vmx(vcpu)->pi_desc);
+ return &vcpu_to_pi(vcpu)->pi_desc;
}

static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
@@ -52,8 +73,8 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)

void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+ struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
struct pi_desc old, new;
unsigned long flags;
unsigned int dest;
@@ -90,7 +111,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
*/
if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- list_del(&vmx->pi_wakeup_list);
+ list_del(&vcpu_pi->pi_wakeup_list);
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
}

@@ -145,15 +166,15 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
*/
static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+ struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
struct pi_desc old, new;
unsigned long flags;

local_irq_save(flags);

raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- list_add_tail(&vmx->pi_wakeup_list,
+ list_add_tail(&vcpu_pi->pi_wakeup_list,
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));

@@ -190,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
* notification vector is switched to the one that calls
* back to the pi_wakeup_handler() function.
*/
- return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
+ return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
+ vmx_can_use_vtd_pi(vcpu->kvm);
}

void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
@@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
if (!vmx_needs_pi_wakeup(vcpu))
return;

- if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+ if (kvm_vcpu_is_blocking(vcpu) &&
+ (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
pi_enable_wakeup_handler(vcpu);

/*
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 26992076552e..2fe8222308b2 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -94,6 +94,17 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
(unsigned long *)&pi_desc->control);
}

+struct vcpu_pi {
+ struct kvm_vcpu vcpu;
+
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here common layout betwwn vcpu_vmx and vcpu_tdx. */
+};
+
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
void pi_wakeup_handler(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 173a29b36432..81f659e23b78 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -558,6 +558,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

fpstate_set_confidential(&vcpu->arch.guest_fpu);
vcpu->arch.apic->guest_apic_protected = true;
+ INIT_LIST_HEAD(&tdx->pi_wakeup_list);

vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 4f803814126a..b29fcda2892b 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -4,6 +4,7 @@

#ifdef CONFIG_INTEL_TDX_HOST

+#include "posted_intr.h"
#include "pmu_intel.h"
#include "tdx_ops.h"

@@ -67,6 +68,13 @@ union tdx_exit_reason {
struct vcpu_tdx {
struct kvm_vcpu vcpu;

+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here same layout to struct vcpu_pi. */
+
unsigned long tdvpr_pa;
unsigned long *tdvpx_pa;

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2847e368c8e6..c89cc546f8df 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -233,6 +233,14 @@ struct nested_vmx {

struct vcpu_vmx {
struct kvm_vcpu vcpu;
+
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here same layout to struct vcpu_pi. */
+
u8 fail;
u8 x2apic_msr_bitmap_mode;

@@ -302,12 +310,6 @@ struct vcpu_vmx {

union vmx_exit_reason exit_reason;

- /* Posted interrupt descriptor */
- struct pi_desc pi_desc;
-
- /* Used if this vCPU is waiting for PI notification wakeup. */
- struct list_head pi_wakeup_list;
-
/* Support for a guest hypervisor (nested VMX) */
struct nested_vmx nested;

--
2.25.1

2023-10-16 16:39:26

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 036/116] KVM: x86/mmu: Disallow fast page fault on private GPA

From: Isaku Yamahata <[email protected]>

TDX requires TDX SEAMCALL to operate Secure EPT instead of direct memory
access and TDX SEAMCALL is heavy operation. Fast page fault on private GPA
doesn't make sense. Disallow fast page fault on private GPA.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 469e73283824..1e3ddf2e7dbf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3343,8 +3343,16 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
return RET_PF_CONTINUE;
}

-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm *kvm, struct kvm_page_fault *fault)
{
+ /*
+ * TDX private mapping doesn't support fast page fault because the EPT
+ * entry is read/written with TDX SEAMCALLs instead of direct memory
+ * access.
+ */
+ if (kvm_is_private_gpa(kvm, fault->addr))
+ return false;
+
/*
* Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
* reach the common page fault handler if the SPTE has an invalid MMIO
@@ -3454,7 +3462,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
u64 *sptep = NULL;
uint retry_count = 0;

- if (!page_fault_can_be_fast(fault))
+ if (!page_fault_can_be_fast(vcpu->kvm, fault))
return ret;

walk_shadow_page_lockless_begin(vcpu);
--
2.25.1

2023-10-16 16:39:27

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 006/116] KVM: TDX: Add placeholders for TDX VM/vcpu structure

From: Isaku Yamahata <[email protected]>

Add placeholders TDX VM/vcpu structure that overlays with VMX VM/vcpu
structures. Initialize VM structure size and vcpu size/align so that x86
KVM common code knows those size irrespective of VMX or TDX. Those
structures will be populated as guest creation logic develops.

Add helper functions to check if the VM is guest TD and add conversion
functions between KVM VM/VCPU and TDX VM/VCPU.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v14 -> v15:
- use KVM_X86_TDX_VM
---
arch/x86/kvm/vmx/main.c | 18 +++++++++++++--
arch/x86/kvm/vmx/tdx.c | 1 +
arch/x86/kvm/vmx/tdx.h | 50 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 67 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kvm/vmx/tdx.h

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 14ab7d5cdded..1d423abd124b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -5,6 +5,7 @@
#include "vmx.h"
#include "nested.h"
#include "pmu.h"
+#include "tdx.h"

static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);
@@ -210,6 +211,21 @@ static int __init vt_init(void)
*/
hv_init_evmcs();

+ /*
+ * kvm_x86_ops is updated with vt_x86_ops. vt_x86_ops.vm_size must
+ * be set before kvm_x86_vendor_init().
+ */
+ vcpu_size = sizeof(struct vcpu_vmx);
+ vcpu_align = __alignof__(struct vcpu_vmx);
+ if (enable_tdx) {
+ vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
+ sizeof(struct kvm_tdx));
+ vcpu_size = max_t(unsigned int, vcpu_size,
+ sizeof(struct vcpu_tdx));
+ vcpu_align = max_t(unsigned int, vcpu_align,
+ __alignof__(struct vcpu_tdx));
+ }
+
r = vmx_init();
if (r)
goto err_vmx_init;
@@ -222,8 +238,6 @@ static int __init vt_init(void)
* Common KVM initialization _must_ come last, after this, /dev/kvm is
* exposed to userspace!
*/
- vcpu_size = sizeof(struct vcpu_vmx);
- vcpu_align = __alignof__(struct vcpu_vmx);
r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
if (r)
goto err_kvm_init;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8a378fb6f1d4..1c9884164566 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
#include "capabilities.h"
#include "x86_ops.h"
#include "x86.h"
+#include "tdx.h"

#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
new file mode 100644
index 000000000000..473013265bd8
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_H
+#define __KVM_X86_TDX_H
+
+#ifdef CONFIG_INTEL_TDX_HOST
+struct kvm_tdx {
+ struct kvm kvm;
+ /* TDX specific members follow. */
+};
+
+struct vcpu_tdx {
+ struct kvm_vcpu vcpu;
+ /* TDX specific members follow. */
+};
+
+static inline bool is_td(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+ return is_td(vcpu->kvm);
+}
+
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
+{
+ return container_of(kvm, struct kvm_tdx, kvm);
+}
+
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
+{
+ return container_of(vcpu, struct vcpu_tdx, vcpu);
+}
+#else
+struct kvm_tdx {
+ struct kvm kvm;
+};
+
+struct vcpu_tdx {
+ struct kvm_vcpu vcpu;
+};
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_H */
--
2.25.1

2023-10-16 16:39:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 005/116] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

From: Isaku Yamahata <[email protected]>

TDX requires several initialization steps for KVM to create guest TDs.
Detect CPU feature, enable VMX (TDX is based on VMX) on all online CPUs,
detect the TDX module availability, initialize it and disable VMX.

To enable/disable VMX on all online CPUs, utilize
vmx_hardware_enable/disable(). The method also initializes each CPU for
TDX. TDX requires calling a TDX initialization function per logical
processor (LP) before the LP uses TDX. When the CPU is becoming online,
call the TDX LP initialization API. If it fails to initialize TDX, refuse
CPU online for simplicity instead of TDX avoiding the failed LP.

There are several options on when to initialize the TDX module. A.) kernel
module loading time, B.) the first guest TD creation time. A.) was chosen.
With B.), a user may hit an error of the TDX initialization when trying to
create the first guest TD. The machine that fails to initialize the TDX
module can't boot any guest TD further. Such failure is undesirable and a
surprise because the user expects that the machine can accommodate guest
TD, but not. So A.) is better than B.).

Introduce a module parameter, kvm_intel.tdx, to explicitly enable TDX KVM
support. It's off by default to keep the same behavior for those who don't
use TDX. Implement hardware_setup method to detect TDX feature of CPU and
initialize TDX module.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Makefile | 1 +
arch/x86/kvm/vmx/main.c | 34 ++++++++++++++-
arch/x86/kvm/vmx/tdx.c | 84 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 8 ++++
4 files changed, 125 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kvm/vmx/tdx.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 0e894ae23cbc..4b01ab842ab7 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -25,6 +25,7 @@ kvm-$(CONFIG_KVM_SMM) += smm.o
kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o

kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
svm/sev.o svm/hyperv.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2e8a0de2c96a..14ab7d5cdded 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,36 @@
#include "nested.h"
#include "pmu.h"

+static bool enable_tdx __ro_after_init;
+module_param_named(tdx, enable_tdx, bool, 0444);
+
+static int vt_hardware_enable(void)
+{
+ int ret;
+
+ ret = vmx_hardware_enable();
+ if (ret || !enable_tdx)
+ return ret;
+
+ ret = tdx_cpu_enable();
+ if (ret)
+ vmx_hardware_disable();
+ return ret;
+}
+
+static __init int vt_hardware_setup(void)
+{
+ int ret;
+
+ ret = vmx_hardware_setup();
+ if (ret)
+ return ret;
+
+ enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
+
+ return 0;
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -22,7 +52,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.hardware_unsetup = vmx_hardware_unsetup,

- .hardware_enable = vmx_hardware_enable,
+ .hardware_enable = vt_hardware_enable,
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,

@@ -159,7 +189,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
- .hardware_setup = vmx_hardware_setup,
+ .hardware_setup = vt_hardware_setup,
.handle_intel_pt_intr = NULL,

.runtime_ops = &vt_x86_ops,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
new file mode 100644
index 000000000000..8a378fb6f1d4
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+
+#include <asm/tdx.h>
+
+#include "capabilities.h"
+#include "x86_ops.h"
+#include "x86.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+static int __init tdx_module_setup(void)
+{
+ int ret;
+
+ ret = tdx_enable();
+ if (ret) {
+ pr_info("Failed to initialize TDX module.\n");
+ return ret;
+ }
+
+ return 0;
+}
+
+struct vmx_tdx_enabled {
+ cpumask_var_t vmx_enabled;
+ atomic_t err;
+};
+
+static void __init vmx_tdx_on(void *_vmx_tdx)
+{
+ struct vmx_tdx_enabled *vmx_tdx = _vmx_tdx;
+ int r;
+
+ r = vmx_hardware_enable();
+ if (!r) {
+ cpumask_set_cpu(smp_processor_id(), vmx_tdx->vmx_enabled);
+ r = tdx_cpu_enable();
+ }
+ if (r)
+ atomic_set(&vmx_tdx->err, r);
+}
+
+static void __init vmx_off(void *_vmx_enabled)
+{
+ cpumask_var_t *vmx_enabled = (cpumask_var_t *)_vmx_enabled;
+
+ if (cpumask_test_cpu(smp_processor_id(), *vmx_enabled))
+ vmx_hardware_disable();
+}
+
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+{
+ struct vmx_tdx_enabled vmx_tdx = {
+ .err = ATOMIC_INIT(0),
+ };
+ int r = 0;
+
+ if (!enable_ept) {
+ pr_warn("Cannot enable TDX with EPT disabled\n");
+ return -EINVAL;
+ }
+
+ if (!zalloc_cpumask_var(&vmx_tdx.vmx_enabled, GFP_KERNEL)) {
+ r = -ENOMEM;
+ goto out;
+ }
+
+ /* tdx_enable() in tdx_module_setup() requires cpus lock. */
+ cpus_read_lock();
+ on_each_cpu(vmx_tdx_on, &vmx_tdx, true); /* TDX requires vmxon. */
+ r = atomic_read(&vmx_tdx.err);
+ if (!r)
+ r = tdx_module_setup();
+ else
+ r = -EIO;
+ on_each_cpu(vmx_off, &vmx_tdx.vmx_enabled, true);
+ cpus_read_unlock();
+ free_cpumask_var(vmx_tdx.vmx_enabled);
+
+out:
+ return r;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index c6bb2fe9ddb9..2c4fe3aff285 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -18,6 +18,8 @@ bool kvm_is_vmx_supported(void);
int __init vmx_init(void);
void vmx_exit(void);

+__init int vmx_hardware_setup(void);
+
extern struct kvm_x86_ops vt_x86_ops __initdata;
extern struct kvm_x86_init_ops vt_init_ops __initdata;

@@ -133,4 +135,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
#endif
void vmx_setup_mce(struct kvm_vcpu *vcpu);

+#ifdef CONFIG_INTEL_TDX_HOST
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+#else
+static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
+#endif
+
#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1

2023-10-16 16:39:42

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 038/116] KVM: VMX: Introduce test mode related to EPT violation VE

From: Isaku Yamahata <[email protected]>

To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM programs
to inject #VE conditionally and set #VE suppress bit in EPT entry. For VMX
case, #VE isn't used. If #VE happens for VMX, it's a bug. To be
defensive (test that VMX case isn't broken), introduce option
ept_violation_ve_test and when it's set, set error.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/vmx.h | 12 +++++++
arch/x86/kvm/vmx/vmcs.h | 5 +++
arch/x86/kvm/vmx/vmx.c | 69 +++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +++-
4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 76ed39541a52..f703bae0c4ac 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -70,6 +70,7 @@
#define SECONDARY_EXEC_ENCLS_EXITING VMCS_CONTROL_BIT(ENCLS_EXITING)
#define SECONDARY_EXEC_RDSEED_EXITING VMCS_CONTROL_BIT(RDSEED_EXITING)
#define SECONDARY_EXEC_ENABLE_PML VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
+#define SECONDARY_EXEC_EPT_VIOLATION_VE VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
#define SECONDARY_EXEC_PT_CONCEAL_VMX VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
#define SECONDARY_EXEC_ENABLE_XSAVES VMCS_CONTROL_BIT(XSAVES)
#define SECONDARY_EXEC_MODE_BASED_EPT_EXEC VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
@@ -225,6 +226,8 @@ enum vmcs_field {
VMREAD_BITMAP_HIGH = 0x00002027,
VMWRITE_BITMAP = 0x00002028,
VMWRITE_BITMAP_HIGH = 0x00002029,
+ VE_INFORMATION_ADDRESS = 0x0000202A,
+ VE_INFORMATION_ADDRESS_HIGH = 0x0000202B,
XSS_EXIT_BITMAP = 0x0000202C,
XSS_EXIT_BITMAP_HIGH = 0x0000202D,
ENCLS_EXITING_BITMAP = 0x0000202E,
@@ -630,4 +633,13 @@ enum vmx_l1d_flush_state {

extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;

+struct vmx_ve_information {
+ u32 exit_reason;
+ u32 delivery;
+ u64 exit_qualification;
+ u64 guest_linear_address;
+ u64 guest_physical_address;
+ u16 eptp_index;
+};
+
#endif
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 7c1996b433e2..b25625314658 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -140,6 +140,11 @@ static inline bool is_nm_fault(u32 intr_info)
return is_exception_n(intr_info, NM_VECTOR);
}

+static inline bool is_ve_fault(u32 intr_info)
+{
+ return is_exception_n(intr_info, VE_VECTOR);
+}
+
/* Undocumented: icebp/int1 */
static inline bool is_icebp(u32 intr_info)
{
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 724bce9387e0..0dc56051589e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -126,6 +126,9 @@ module_param(error_on_inconsistent_vmcs_config, bool, 0444);
static bool __read_mostly dump_invalid_vmcs = 0;
module_param(dump_invalid_vmcs, bool, 0644);

+static bool __read_mostly ept_violation_ve_test;
+module_param(ept_violation_ve_test, bool, 0444);
+
#define MSR_BITMAP_MODE_X2APIC 1
#define MSR_BITMAP_MODE_X2APIC_APICV 2

@@ -869,6 +872,13 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)

eb = (1u << PF_VECTOR) | (1u << UD_VECTOR) | (1u << MC_VECTOR) |
(1u << DB_VECTOR) | (1u << AC_VECTOR);
+ /*
+ * #VE isn't used for VMX, but for TDX. To test against unexpected
+ * change related to #VE for VMX, intercept unexpected #VE and warn on
+ * it.
+ */
+ if (ept_violation_ve_test)
+ eb |= 1u << VE_VECTOR;
/*
* Guest access to VMware backdoor ports could legitimately
* trigger #GP because of TSS I/O permission bitmap.
@@ -2602,6 +2612,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
&_cpu_based_2nd_exec_control))
return -EIO;
}
+ if (!ept_violation_ve_test)
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
+
#ifndef CONFIG_X86_64
if (!(_cpu_based_2nd_exec_control &
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
@@ -2626,6 +2639,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
return -EIO;

vmx_cap->ept = 0;
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
}
if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
vmx_cap->vpid) {
@@ -4588,6 +4602,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
if (!enable_ept) {
exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+ exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
enable_unrestricted_guest = 0;
}
if (!enable_unrestricted_guest)
@@ -4711,8 +4726,40 @@ static void init_vmcs(struct vcpu_vmx *vmx)

exec_controls_set(vmx, vmx_exec_control(vmx));

- if (cpu_has_secondary_exec_ctrls())
+ if (cpu_has_secondary_exec_ctrls()) {
secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
+ if (secondary_exec_controls_get(vmx) &
+ SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ if (!vmx->ve_info) {
+ /* ve_info must be page aligned. */
+ struct page *page;
+
+ BUILD_BUG_ON(sizeof(*vmx->ve_info) > PAGE_SIZE);
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (page)
+ vmx->ve_info = page_to_virt(page);
+ }
+ if (vmx->ve_info) {
+ /*
+ * Allow #VE delivery. CPU sets this field to
+ * 0xFFFFFFFF on #VE delivery. Another #VE can
+ * occur only if software clears the field.
+ */
+ vmx->ve_info->delivery = 0;
+ vmcs_write64(VE_INFORMATION_ADDRESS,
+ __pa(vmx->ve_info));
+ } else {
+ /*
+ * Because SECONDARY_EXEC_EPT_VIOLATION_VE is
+ * used only when ept_violation_ve_test is true,
+ * it's okay to go with the bit disabled.
+ */
+ pr_err("Failed to allocate ve_info. disabling EPT_VIOLATION_VE.\n");
+ secondary_exec_controls_clearbit(vmx,
+ SECONDARY_EXEC_EPT_VIOLATION_VE);
+ }
+ }
+ }

if (cpu_has_tertiary_exec_ctrls())
tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
@@ -5197,6 +5244,12 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
if (is_invalid_opcode(intr_info))
return handle_ud(vcpu);

+ /*
+ * #VE isn't supposed to happen. Although vcpu can send
+ */
+ if (KVM_BUG_ON(is_ve_fault(intr_info), vcpu->kvm))
+ return -EIO;
+
error_code = 0;
if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
@@ -6384,6 +6437,18 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
pr_err("Virtual processor ID = 0x%04x\n",
vmcs_read16(VIRTUAL_PROCESSOR_ID));
+ if (secondary_exec_control & SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ struct vmx_ve_information *ve_info;
+
+ pr_err("VE info address = 0x%016llx\n",
+ vmcs_read64(VE_INFORMATION_ADDRESS));
+ ve_info = __va(vmcs_read64(VE_INFORMATION_ADDRESS));
+ pr_err("ve_info: 0x%08x 0x%08x 0x%016llx 0x%016llx 0x%016llx 0x%04x\n",
+ ve_info->exit_reason, ve_info->delivery,
+ ve_info->exit_qualification,
+ ve_info->guest_linear_address,
+ ve_info->guest_physical_address, ve_info->eptp_index);
+ }
}

/*
@@ -7424,6 +7489,8 @@ void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_vpid(vmx->vpid);
nested_vmx_free_vcpu(vcpu);
free_loaded_vmcs(vmx->loaded_vmcs);
+ if (vmx->ve_info)
+ free_page((unsigned long)vmx->ve_info);
}

int vmx_vcpu_create(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 41ad0e7a25fd..2847e368c8e6 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -347,6 +347,9 @@ struct vcpu_vmx {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
} shadow_msr_intercept;
+
+ /* ve_info must be page aligned. */
+ struct vmx_ve_information *ve_info;
};

struct kvm_vmx {
@@ -557,7 +560,8 @@ static inline u8 vmx_get_rvi(void)
SECONDARY_EXEC_ENABLE_VMFUNC | \
SECONDARY_EXEC_BUS_LOCK_DETECTION | \
SECONDARY_EXEC_NOTIFY_VM_EXITING | \
- SECONDARY_EXEC_ENCLS_EXITING)
+ SECONDARY_EXEC_ENCLS_EXITING | \
+ SECONDARY_EXEC_EPT_VIOLATION_VE)

#define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
#define KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL \
--
2.25.1

2023-10-16 16:39:46

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 108/116] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch

From: Isaku Yamahata <[email protected]>

Because guest TD memory is protected, VMM patching guest binary for
hypercall instruction isn't possible. Add a method to ignore hypercall
patching with a warning. Note: guest TD kernel needs to be modified to use
TDG.VP.VMCALL for hypercall.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index df84a12aa232..0d0af3fadc75 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -740,6 +740,19 @@ static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
return vmx_get_interrupt_shadow(vcpu);
}

+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+ unsigned char *hypercall)
+{
+ /*
+ * Because guest memory is protected, guest can't be patched. TD kernel
+ * is modified to use TDG.VP.VMCAL for hypercall.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_patch_hypercall(vcpu, hypercall);
+}
+
static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
{
if (is_td_vcpu(vcpu))
@@ -1008,7 +1021,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.update_emulated_instruction = vmx_update_emulated_instruction,
.set_interrupt_shadow = vt_set_interrupt_shadow,
.get_interrupt_shadow = vt_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
+ .patch_hypercall = vt_patch_hypercall,
.inject_irq = vt_inject_irq,
.inject_nmi = vt_inject_nmi,
.inject_exception = vt_inject_exception,
--
2.25.1

2023-10-16 16:39:52

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 085/116] KVM: TDX: handle ept violation/misconfig exit

From: Isaku Yamahata <[email protected]>

On EPT violation, call a common function, __vmx_handle_ept_violation() to
trigger x86 MMU code. On EPT misconfiguration, exit to ring 3 with
KVM_EXIT_UNKNOWN. because EPT misconfiguration can't happen as MMIO is
trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the
fast path.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v14 -> v15:
- use PFERR_GUEST_ENC_MASK to tell the fault is private
---
arch/x86/kvm/vmx/common.h | 3 +++
arch/x86/kvm/vmx/tdx.c | 46 +++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 632af7a76d0a..027aa4175d2c 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -87,6 +87,9 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;

+ if (kvm_is_private_gpa(vcpu->kvm, gpa))
+ error_code |= PFERR_GUEST_ENC_MASK;
+
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 618c2ecec26a..f98b27cb9003 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1279,6 +1279,48 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}

+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+ unsigned long exit_qual;
+
+ if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+ /*
+ * Always treat SEPT violations as write faults. Ignore the
+ * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
+ * TD private pages are always RWX in the SEPT tables,
+ * i.e. they're always mapped writable. Just as importantly,
+ * treating SEPT violations as write faults is necessary to
+ * avoid COW allocations, which will cause TDAUGPAGE failures
+ * due to aliasing a single HPA to multiple GPAs.
+ */
+#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
+ exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
+ } else {
+ exit_qual = tdexit_exit_qual(vcpu);
+ if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
+ pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
+ tdexit_gpa(vcpu), kvm_rip_read(vcpu));
+ vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
+ vcpu->run->ex.exception = PF_VECTOR;
+ vcpu->run->ex.error_code = exit_qual;
+ return 0;
+ }
+ }
+
+ trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
+ return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
+{
+ WARN_ON_ONCE(1);
+
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+
+ return 0;
+}
+
int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
{
union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
@@ -1339,6 +1381,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_handle_ept_violation(vcpu);
+ case EXIT_REASON_EPT_MISCONFIG:
+ return tdx_handle_ept_misconfig(vcpu);
case EXIT_REASON_OTHER_SMI:
/*
* If reach here, it's not a Machine Check System Management
--
2.25.1

2023-10-16 16:39:55

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 025/116] KVM: TDX: allocate/free TDX vcpu structure

From: Isaku Yamahata <[email protected]>

The next step of TDX guest creation is to create vcpu. Allocate TDX vcpu
structures, initialize it that doesn't require TDX SEAMCALL. TDX specific
vcpu initialization will be implemented as independent KVM_TDX_INIT_VCPU
so that when error occurs it's easy to determine which component has the
issue, KVM or TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v15 -> v16:
- Add AMX support as the KVM upstream supports it.
---
arch/x86/kvm/vmx/main.c | 44 ++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 10 ++++++++
arch/x86/kvm/x86.c | 2 ++
4 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 0bb087e3bbdf..7a9ee3ec0785 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -98,6 +98,42 @@ static void vt_vm_free(struct kvm *kvm)
tdx_vm_free(kvm);
}

+static int vt_vcpu_precreate(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_vcpu_precreate(kvm);
+}
+
+static int vt_vcpu_create(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_create(vcpu);
+
+ return vmx_vcpu_create(vcpu);
+}
+
+static void vt_vcpu_free(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_free(vcpu);
+ return;
+ }
+
+ vmx_vcpu_free(vcpu);
+}
+
+static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_reset(vcpu, init_event);
+ return;
+ }
+
+ vmx_vcpu_reset(vcpu, init_event);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -136,10 +172,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vm_destroy = vt_vm_destroy,
.vm_free = vt_vm_free,

- .vcpu_precreate = vmx_vcpu_precreate,
- .vcpu_create = vmx_vcpu_create,
- .vcpu_free = vmx_vcpu_free,
- .vcpu_reset = vmx_vcpu_reset,
+ .vcpu_precreate = vt_vcpu_precreate,
+ .vcpu_create = vt_vcpu_create,
+ .vcpu_free = vt_vcpu_free,
+ .vcpu_reset = vt_vcpu_reset,

.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
.vcpu_load = vmx_vcpu_load,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 51aa114feb86..b7f8ac4b9f95 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -335,6 +335,55 @@ int tdx_vm_init(struct kvm *kvm)
return 0;
}

+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+ /*
+ * On cpu creation, cpuid entry is blank. Forcibly enable
+ * X2APIC feature to allow X2APIC.
+ * Because vcpu_reset() can't return error, allocation is done here.
+ */
+ WARN_ON_ONCE(vcpu->arch.cpuid_entries);
+ WARN_ON_ONCE(vcpu->arch.cpuid_nent);
+
+ /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
+ if (!vcpu->arch.apic)
+ return -EINVAL;
+
+ fpstate_set_confidential(&vcpu->arch.guest_fpu);
+
+ vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+
+ vcpu->arch.cr0_guest_owned_bits = -1ul;
+ vcpu->arch.cr4_guest_owned_bits = -1ul;
+
+ vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
+ vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
+ vcpu->arch.guest_state_protected =
+ !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+
+ if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
+ vcpu->arch.xfd_no_write_intercept = true;
+
+ return 0;
+}
+
+void tdx_vcpu_free(struct kvm_vcpu *vcpu)
+{
+ /* This is stub for now. More logic will come. */
+}
+
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+
+ /* Ignore INIT silently because TDX doesn't support INIT event. */
+ if (init_event)
+ return;
+
+ /* This is stub for now. More logic will come here. */
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ef14c3873fe0..9b01104946b8 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -144,7 +144,12 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int tdx_vm_init(struct kvm *kvm);
void tdx_mmu_release_hkid(struct kvm *kvm);
void tdx_vm_free(struct kvm *kvm);
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+
+int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_free(struct kvm_vcpu *vcpu);
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
@@ -158,7 +163,12 @@ static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
static inline void tdx_vm_free(struct kvm *kvm) {}
+
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 714685a31baf..7aff6f88f575 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -502,6 +502,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
kvm_recalculate_apic_map(vcpu->kvm);
return 0;
}
+EXPORT_SYMBOL_GPL(kvm_set_apic_base);

/*
* Handle a fault on a hardware virtualization (VMX or SVM) instruction.
@@ -12310,6 +12311,7 @@ bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
{
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
}
+EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);

bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
{
--
2.25.1

2023-10-16 16:39:57

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 114/116] RFC: KVM: x86: Add x86 callback to check cpuid

From: Isaku Yamahata <[email protected]>

The x86 backend should check the consistency of KVM_SET_CPUID2 because it
has its constraint. Add a callback for it. The backend code will come as
another patch.

Suggested-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/cpuid.c | 20 ++++++++++++--------
3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8b3c5f2179cf..ba74cb7199b3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,8 @@ KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
KVM_X86_OP(has_emulated_msr)
+/* TODO: Once all backend implemented this op, remove _OPTIONAL_RET0. */
+KVM_X86_OP_OPTIONAL_RET0(vcpu_check_cpuid)
KVM_X86_OP(vcpu_after_set_cpuid)
KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP_OPTIONAL(max_vcpus);
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f0674ef1c17e..8ac6f0555ac3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1619,6 +1619,7 @@ struct kvm_x86_ops {
void (*hardware_unsetup)(void);
int (*offline_cpu)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
+ int (*vcpu_check_cpuid)(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2, int nent);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

bool (*is_vm_type_supported)(unsigned long vm_type);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index da46f284ec65..7ce2fdba8fff 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -136,6 +136,7 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
{
struct kvm_cpuid_entry2 *best;
u64 xfeatures;
+ int r;

/*
* The existing code assumes virtual address is 48-bit or 57-bit in the
@@ -155,15 +156,18 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
* enabling in the FPU, e.g. to expand the guest XSAVE state size.
*/
best = cpuid_entry2_find(entries, nent, 0xd, 0);
- if (!best)
- return 0;
-
- xfeatures = best->eax | ((u64)best->edx << 32);
- xfeatures &= XFEATURE_MASK_USER_DYNAMIC;
- if (!xfeatures)
- return 0;
+ if (best) {
+ xfeatures = best->eax | ((u64)best->edx << 32);
+ xfeatures &= XFEATURE_MASK_USER_DYNAMIC;
+ if (xfeatures) {
+ r = fpu_enable_guest_xfd_features(&vcpu->arch.guest_fpu,
+ xfeatures);
+ if (r)
+ return r;
+ }
+ }

- return fpu_enable_guest_xfd_features(&vcpu->arch.guest_fpu, xfeatures);
+ return static_call(kvm_x86_vcpu_check_cpuid)(vcpu, entries, nent);
}

/* Check whether the supplied CPUID data is equal to what is already set for the vCPU. */
--
2.25.1

2023-10-16 16:40:20

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 002/116] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_hardware_setup()

From: Isaku Yamahata <[email protected]>

vmx_hardware_disable() accesses loaded_vmcss_on_cpu via
hardware_disable_all(). To allow hardware_enable/disable_all() before
kvm_init(), initialize it in vmx_hardware_setup() so that tdx module
initialization, hardware_setup method, can reference the variable.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f733629c4706..3627b704ff37 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8266,8 +8266,12 @@ __init int vmx_hardware_setup(void)
{
unsigned long host_bndcfgs;
struct desc_ptr dt;
+ int cpu;
int r;

+ /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
+ for_each_possible_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
store_idt(&dt);
host_idt_base = dt.address;

@@ -8510,11 +8514,8 @@ static int __init vmx_init(void)
if (r)
goto err_l1d_flush;

- for_each_possible_cpu(cpu) {
- INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
-
+ for_each_possible_cpu(cpu)
pi_init_cpu(cpu);
- }

cpu_emergency_register_virt_callback(vmx_emergency_disable);

--
2.25.1

2023-10-16 16:40:21

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 003/116] KVM: x86/vmx: Refactor KVM VMX module init/exit functions

From: Isaku Yamahata <[email protected]>

Currently, KVM VMX module initialization/exit functions are a single
function each. Refactor KVM VMX module initialization functions into KVM
common part and VMX part so that TDX specific part can be added cleanly.
Opportunistically refactor module exit function as well.

The current module initialization flow is,
0.) Check if VMX is supported,
1.) hyper-v specific initialization,
2.) system-wide x86 specific and vendor specific initialization,
3.) Final VMX specific system-wide initialization,
4.) calculate the sizes of VMX kvm structure and VMX vcpu structure,
5.) report those sizes to the KVM common layer and KVM common
initialization

Refactor the KVM VMX module initialization function into functions with a
wrapper function to separate VMX logic in vmx.c from a file, main.c, common
among VMX and TDX. Introduce a wrapper function for vmx_init().

The KVM architecture common layer allocates struct kvm with reported size
for architecture-specific code. The KVM VMX module defines its structure
as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
TDX specific kvm and vcpu structures.

The current module exit function is also a single function, a combination
of VMX specific logic and common KVM logic. Refactor it into VMX specific
logic and KVM common logic. This is just refactoring to keep the VMX
specific logic in vmx.c from main.c.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 50 +++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 54 +++++---------------------------------
arch/x86/kvm/vmx/x86_ops.h | 13 ++++++++-
3 files changed, 68 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 3b89cb721fe8..4059364fc5e0 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -165,3 +165,53 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
.runtime_ops = &vt_x86_ops,
.pmu_ops = &intel_pmu_ops,
};
+
+static int __init vt_init(void)
+{
+ unsigned int vcpu_size, vcpu_align;
+ int r;
+
+ if (!kvm_is_vmx_supported())
+ return -EOPNOTSUPP;
+
+ /*
+ * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
+ * to unwind if a later step fails.
+ */
+ hv_init_evmcs();
+
+ r = kvm_x86_vendor_init(&vt_init_ops);
+ if (r)
+ return r;
+
+ r = vmx_init();
+ if (r)
+ goto err_vmx_init;
+
+ /*
+ * Common KVM initialization _must_ come last, after this, /dev/kvm is
+ * exposed to userspace!
+ */
+ vcpu_size = sizeof(struct vcpu_vmx);
+ vcpu_align = __alignof__(struct vcpu_vmx);
+ r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
+ if (r)
+ goto err_kvm_init;
+
+ return 0;
+
+err_kvm_init:
+ vmx_exit();
+err_vmx_init:
+ kvm_x86_vendor_exit();
+ return r;
+}
+module_init(vt_init);
+
+static void vt_exit(void)
+{
+ kvm_exit();
+ kvm_x86_vendor_exit();
+ vmx_exit();
+}
+module_exit(vt_exit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3627b704ff37..d9f2f00af427 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -544,7 +544,7 @@ static int hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu)
return 0;
}

-static __init void hv_init_evmcs(void)
+__init void hv_init_evmcs(void)
{
int cpu;

@@ -580,7 +580,7 @@ static __init void hv_init_evmcs(void)
}
}

-static void hv_reset_evmcs(void)
+void hv_reset_evmcs(void)
{
struct hv_vp_assist_page *vp_ap;

@@ -604,10 +604,6 @@ static void hv_reset_evmcs(void)
vp_ap->current_nested_vmcs = 0;
vp_ap->enlighten_vmentry = 0;
}
-
-#else /* IS_ENABLED(CONFIG_HYPERV) */
-static void hv_init_evmcs(void) {}
-static void hv_reset_evmcs(void) {}
#endif /* IS_ENABLED(CONFIG_HYPERV) */

/*
@@ -2748,7 +2744,7 @@ static bool __kvm_is_vmx_supported(void)
return true;
}

-static bool kvm_is_vmx_supported(void)
+bool kvm_is_vmx_supported(void)
{
bool supported;

@@ -8468,7 +8464,7 @@ static void vmx_cleanup_l1d_flush(void)
l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
}

-static void __vmx_exit(void)
+void vmx_exit(void)
{
allow_smaller_maxphyaddr = false;

@@ -8477,32 +8473,10 @@ static void __vmx_exit(void)
vmx_cleanup_l1d_flush();
}

-static void vmx_exit(void)
-{
- kvm_exit();
- kvm_x86_vendor_exit();
-
- __vmx_exit();
-}
-module_exit(vmx_exit);
-
-static int __init vmx_init(void)
+int __init vmx_init(void)
{
int r, cpu;

- if (!kvm_is_vmx_supported())
- return -EOPNOTSUPP;
-
- /*
- * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
- * to unwind if a later step fails.
- */
- hv_init_evmcs();
-
- r = kvm_x86_vendor_init(&vt_init_ops);
- if (r)
- return r;
-
/*
* Must be called after common x86 init so enable_ept is properly set
* up. Hand the parameter mitigation value in which was stored in
@@ -8512,7 +8486,7 @@ static int __init vmx_init(void)
*/
r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
if (r)
- goto err_l1d_flush;
+ return r;

for_each_possible_cpu(cpu)
pi_init_cpu(cpu);
@@ -8529,21 +8503,5 @@ static int __init vmx_init(void)
if (!enable_ept)
allow_smaller_maxphyaddr = true;

- /*
- * Common KVM initialization _must_ come last, after this, /dev/kvm is
- * exposed to userspace!
- */
- r = kvm_init(sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx),
- THIS_MODULE);
- if (r)
- goto err_kvm_init;
-
return 0;
-
-err_kvm_init:
- __vmx_exit();
-err_l1d_flush:
- kvm_x86_vendor_exit();
- return r;
}
-module_init(vmx_init);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 0c8fbd29b969..c6bb2fe9ddb9 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -6,11 +6,22 @@

#include "x86.h"

-__init int vmx_hardware_setup(void);
+#if IS_ENABLED(CONFIG_HYPERV)
+__init void hv_init_evmcs(void);
+void hv_reset_evmcs(void);
+#else /* IS_ENABLED(CONFIG_HYPERV) */
+static inline void hv_init_evmcs(void) {}
+static inline void hv_reset_evmcs(void) {}
+#endif /* IS_ENABLED(CONFIG_HYPERV) */
+
+bool kvm_is_vmx_supported(void);
+int __init vmx_init(void);
+void vmx_exit(void);

extern struct kvm_x86_ops vt_x86_ops __initdata;
extern struct kvm_x86_init_ops vt_init_ops __initdata;

+__init int vmx_hardware_setup(void);
void vmx_hardware_unsetup(void);
int vmx_check_processor_compat(void);
int vmx_hardware_enable(void);
--
2.25.1

2023-10-16 16:40:26

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 100/116] KVM: TDX: Silently discard SMI request

From: Isaku Yamahata <[email protected]>

TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs. Because guest state (vcpu state, memory
state) is protected, it must go through the TDX module APIs to change guest
state, injecting SMI and changing vcpu mode into SMM. The TDX module
doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
to switch guest vcpu mode into SMM.

We have two options in KVM when handling SMM or SMI in the guest TD or the
device model (e.g. QEMU): 1) silently ignore the request or 2) return a
meaningful error.

For simplicity, we implemented the option 1).

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/smm.h | 7 +++++-
arch/x86/kvm/vmx/main.c | 45 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 29 ++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 12 ++++++++++
4 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..bc77902f5c18 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -142,7 +142,12 @@ union kvm_smram {

static inline int kvm_inject_smi(struct kvm_vcpu *vcpu)
{
- kvm_make_request(KVM_REQ_SMI, vcpu);
+ /*
+ * If SMM isn't supported (e.g. TDX), silently discard SMI request.
+ * Assume that SMM supported = MSR_IA32_SMBASE supported.
+ */
+ if (static_call(kvm_x86_has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE))
+ kvm_make_request(KVM_REQ_SMI, vcpu);
return 0;
}

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 982022993700..ad91efcc2413 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -294,6 +294,43 @@ static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
vmx_msr_filter_changed(vcpu);
}

+#ifdef CONFIG_KVM_SMM
+static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_smi_allowed(vcpu, for_injection);
+
+ return vmx_smi_allowed(vcpu, for_injection);
+}
+
+static int vt_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_enter_smm(vcpu, smram);
+
+ return vmx_enter_smm(vcpu, smram);
+}
+
+static int vt_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_leave_smm(vcpu, smram);
+
+ return vmx_leave_smm(vcpu, smram);
+}
+
+static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_enable_smi_window(vcpu);
+ return;
+ }
+
+ /* RSM will cause a vmexit anyway. */
+ vmx_enable_smi_window(vcpu);
+}
+#endif
+
static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -678,10 +715,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.setup_mce = vmx_setup_mce,

#ifdef CONFIG_KVM_SMM
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
+ .smi_allowed = vt_smi_allowed,
+ .enter_smm = vt_enter_smm,
+ .leave_smm = vt_leave_smm,
+ .enable_smi_window = vt_enable_smi_window,
#endif

.can_emulate_instruction = vmx_can_emulate_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b4a0fa51ee62..f368f9c950ad 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2042,6 +2042,35 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
}
}

+#ifdef CONFIG_KVM_SMM
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ /* SMI isn't supported for TDX. */
+ WARN_ON_ONCE(1);
+ return false;
+}
+
+int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+{
+ /* smi_allowed() is always false for TDX as above. */
+ WARN_ON_ONCE(1);
+ return 0;
+}
+
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+{
+ WARN_ON_ONCE(1);
+ return 0;
+}
+
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+ /* SMI isn't supported for TDX. Silently discard SMI request. */
+ WARN_ON_ONCE(1);
+ vcpu->arch.smi_pending = false;
+}
+#endif
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index fd52c7df0cd8..092676b50dc2 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -222,4 +222,16 @@ static inline int tdx_sept_flush_remote_tlbs(struct kvm *kvm) { return 0; }
static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
#endif

+#if defined(CONFIG_INTEL_TDX_HOST) && defined(CONFIG_KVM_SMM)
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
+#else
+static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
+static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) { return 0; }
+static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) { return 0; }
+static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
+#endif
+
#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1

2023-10-16 16:41:07

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 105/116] KVM: TDX: Add methods to ignore VMX preemption timer

From: Isaku Yamahata <[email protected]>

TDX doesn't support VMX preemption timer. Implement access methods for VMM
to ignore VMX preemption timer.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 06eb0896a87a..eb02a73f3bb0 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -851,6 +851,27 @@ static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
vmx_update_cpu_dirty_logging(vcpu);
}

+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
+{
+ /* VMX-preemption timer isn't available for TDX. */
+ if (is_td_vcpu(vcpu))
+ return -EINVAL;
+
+ return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+ /* VMX-preemption timer can't be set. See vt_set_hv_timer(). */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -1002,8 +1023,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.pi_start_assignment = vmx_pi_start_assignment,

#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
+ .set_hv_timer = vt_set_hv_timer,
+ .cancel_hv_timer = vt_cancel_hv_timer,
#endif

.setup_mce = vmx_setup_mce,
--
2.25.1

2023-10-16 16:41:19

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 009/116] KVM: TDX: Define TDX architectural definitions

From: Isaku Yamahata <[email protected]>

Define architectural definitions for KVM to issue the TDX SEAMCALLs.

Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx_arch.h | 213 ++++++++++++++++++++++++++++++++++++
1 file changed, 213 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..845b6ef9f787
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,213 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER 0
+#define TDH_MNG_ADDCX 1
+#define TDH_MEM_PAGE_ADD 2
+#define TDH_MEM_SEPT_ADD 3
+#define TDH_VP_ADDCX 4
+#define TDH_MEM_PAGE_RELOCATE 5
+#define TDH_MEM_PAGE_AUG 6
+#define TDH_MEM_RANGE_BLOCK 7
+#define TDH_MNG_KEY_CONFIG 8
+#define TDH_MNG_CREATE 9
+#define TDH_VP_CREATE 10
+#define TDH_MNG_RD 11
+#define TDH_MR_EXTEND 16
+#define TDH_MR_FINALIZE 17
+#define TDH_VP_FLUSH 18
+#define TDH_MNG_VPFLUSHDONE 19
+#define TDH_MNG_KEY_FREEID 20
+#define TDH_MNG_INIT 21
+#define TDH_VP_INIT 22
+#define TDH_MEM_SEPT_RD 25
+#define TDH_VP_RD 26
+#define TDH_MNG_KEY_RECLAIMID 27
+#define TDH_PHYMEM_PAGE_RECLAIM 28
+#define TDH_MEM_PAGE_REMOVE 29
+#define TDH_MEM_SEPT_REMOVE 30
+#define TDH_MEM_TRACK 38
+#define TDH_MEM_RANGE_UNBLOCK 39
+#define TDH_PHYMEM_CACHE_WB 40
+#define TDH_PHYMEM_PAGE_WBINVD 41
+#define TDH_VP_WR 43
+#define TDH_SYS_LP_SHUTDOWN 44
+
+#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO 0x10000
+#define TDG_VP_VMCALL_MAP_GPA 0x10001
+#define TDG_VP_VMCALL_GET_QUOTE 0x10002
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR 0x10003
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH BIT_ULL(63)
+#define TDX_CLASS_SHIFT 56
+#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
+
+#define __BUILD_TDX_FIELD(non_arch, class, field) \
+ (((non_arch) ? TDX_NON_ARCH : 0) | \
+ ((u64)(class) << TDX_CLASS_SHIFT) | \
+ ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field) \
+ __BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field) \
+ __BUILD_TDX_FIELD(true, (class), (field))
+
+
+/* Class code for TD */
+#define TD_CLASS_EXECUTION_CONTROLS 17ULL
+
+/* Class code for TDVPS */
+#define TDVPS_CLASS_VMCS 0ULL
+#define TDVPS_CLASS_GUEST_GPR 16ULL
+#define TDVPS_CLASS_OTHER_GUEST 17ULL
+#define TDVPS_CLASS_MANAGEMENT 32ULL
+
+enum tdx_tdcs_execution_control {
+ TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field) BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field) BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field))
+
+enum tdx_vcpu_guest_other_state {
+ TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+union tdx_vcpu_state_details {
+ struct {
+ u64 vmxip : 1;
+ u64 reserved : 63;
+ };
+ u64 full;
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field) BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field))
+#define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field))
+
+/* Management class fields */
+enum tdx_vcpu_guest_management {
+ TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_vcpu_guest_management */
+#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field))
+
+#define TDX_EXTENDMR_CHUNKSIZE 256
+
+struct tdx_cpuid_value {
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+} __packed;
+
+#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
+#define TDX_TD_ATTR_SEPT_VE_DISABLE BIT_ULL(28)
+#define TDX_TD_ATTRIBUTE_PKS BIT_ULL(30)
+#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
+#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+#define TDX_MAX_VCPUS (~(u16)0)
+
+struct td_params {
+ u64 attributes;
+ u64 xfam;
+ u16 max_vcpus;
+ u8 reserved0[6];
+
+ u64 eptp_controls;
+ u64 exec_controls;
+ u16 tsc_frequency;
+ u8 reserved1[38];
+
+ u64 mrconfigid[6];
+ u64 mrowner[6];
+ u64 mrownerconfig[6];
+ u64 reserved2[4];
+
+ union {
+ DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values);
+ u8 reserved3[768];
+ };
+} __packed __aligned(1024);
+
+/*
+ * Guest uses MAX_PA for GPAW when set.
+ * 0: GPA.SHARED bit is GPA[47]
+ * 1: GPA.SHARED bit is GPA[51]
+ */
+#define TDX_EXEC_CONTROL_MAX_GPAW BIT_ULL(0)
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
+ * module can only program frequencies that are multiples of 25MHz. The
+ * frequency must be between 100mhz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)
+
+union tdx_sept_entry {
+ struct {
+ u64 r : 1;
+ u64 w : 1;
+ u64 x : 1;
+ u64 mt : 3;
+ u64 ipat : 1;
+ u64 leaf : 1;
+ u64 a : 1;
+ u64 d : 1;
+ u64 xu : 1;
+ u64 ignored0 : 1;
+ u64 pfn : 40;
+ u64 reserved : 5;
+ u64 vgp : 1;
+ u64 pwa : 1;
+ u64 ignored1 : 1;
+ u64 sss : 1;
+ u64 spp : 1;
+ u64 ignored2 : 1;
+ u64 sve : 1;
+ };
+ u64 raw;
+};
+
+enum tdx_sept_entry_state {
+ TDX_SEPT_FREE = 0,
+ TDX_SEPT_BLOCKED = 1,
+ TDX_SEPT_PENDING = 2,
+ TDX_SEPT_PENDING_BLOCKED = 3,
+ TDX_SEPT_PRESENT = 4,
+};
+
+union tdx_sept_level_state {
+ struct {
+ u64 level : 3;
+ u64 reserved0 : 5;
+ u64 state : 8;
+ u64 reserved1 : 48;
+ };
+ u64 raw;
+};
+
+#endif /* __KVM_X86_TDX_ARCH_H */
--
2.25.1

2023-10-16 16:41:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 067/116] KVM: TDX: Add TSX_CTRL msr into uret_msrs list

From: Yang Weijiang <[email protected]>

TDX module resets the TSX_CTRL MSR to 0 at TD exit if TSX is enabled for
TD. Or it preserves the TSX_CTRL MSR if TSX is disabled for TD. VMM can
rely on uret_msrs mechanism to defer the reload of host value until exiting
to user space.

Signed-off-by: Yang Weijiang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.h | 8 ++++++++
2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 87ed1a255c3b..b5493d6c7cdd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -521,14 +521,21 @@ static struct tdx_uret_msr tdx_uret_msrs[] = {
{.msr = MSR_LSTAR,},
{.msr = MSR_TSC_AUX,},
};
+static unsigned int tdx_uret_tsx_ctrl_slot;

-static void tdx_user_return_update_cache(void)
+static void tdx_user_return_update_cache(struct kvm_vcpu *vcpu)
{
int i;

for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
tdx_uret_msrs[i].defval);
+ /*
+ * TSX_CTRL is reset to 0 if guest TSX is supported. Otherwise
+ * preserved.
+ */
+ if (to_kvm_tdx(vcpu->kvm)->tsx_supported && tdx_uret_tsx_ctrl_slot != -1)
+ kvm_user_return_update_cache(tdx_uret_tsx_ctrl_slot, 0);
}

static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
@@ -623,7 +630,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_vcpu_enter_exit(tdx);

- tdx_user_return_update_cache();
+ tdx_user_return_update_cache(vcpu);
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;

@@ -1167,6 +1174,22 @@ static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_pa
return 0;
}

+static bool tdparams_tsx_supported(struct kvm_cpuid2 *cpuid)
+{
+ const struct kvm_cpuid_entry2 *entry;
+ u64 mask;
+ u32 ebx;
+
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x7, 0);
+ if (entry)
+ ebx = entry->ebx;
+ else
+ ebx = 0;
+
+ mask = __feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM);
+ return ebx & mask;
+}
+
static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
struct kvm_tdx_init_vm *init_vm)
{
@@ -1212,6 +1235,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);

+ to_kvm_tdx(kvm)->tsx_supported = tdparams_tsx_supported(cpuid);
return 0;
}

@@ -1875,6 +1899,11 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
return -EIO;
}
}
+ tdx_uret_tsx_ctrl_slot = kvm_find_user_return_msr(MSR_IA32_TSX_CTRL);
+ if (tdx_uret_tsx_ctrl_slot == -1 && boot_cpu_has(X86_FEATURE_MSR_TSX_CTRL)) {
+ pr_err("MSR_IA32_TSX_CTRL isn't included by kvm_find_user_return_msr\n");
+ return -EIO;
+ }

max_pkgs = topology_max_packages();
tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 610bd3f4e952..45f5c2744d78 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,6 +17,14 @@ struct kvm_tdx {
u64 xfam;
int hkid;

+ /*
+ * Used on each TD-exit, see tdx_user_return_update_cache().
+ * TSX_CTRL value on TD exit
+ * - set 0 if guest TSX enabled
+ * - preserved if guest TSX disabled
+ */
+ bool tsx_supported;
+
hpa_t source_pa;

bool finalized;
--
2.25.1

2023-10-16 16:41:50

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 084/116] KVM: TDX: handle EXIT_REASON_OTHER_SMI

From: Isaku Yamahata <[email protected]>

If the control reaches EXIT_REASON_OTHER_SMI, #SMI is delivered and
handled right after returning from the TDX module to KVM nothing needs to
be done in KVM. Continue TDX vcpu execution.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/uapi/asm/vmx.h | 1 +
arch/x86/kvm/vmx/tdx.c | 7 +++++++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index a5faf6d88f1b..b3a30ef3efdd 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -34,6 +34,7 @@
#define EXIT_REASON_TRIPLE_FAULT 2
#define EXIT_REASON_INIT_SIGNAL 3
#define EXIT_REASON_SIPI_SIGNAL 4
+#define EXIT_REASON_OTHER_SMI 6

#define EXIT_REASON_INTERRUPT_WINDOW 7
#define EXIT_REASON_NMI_WINDOW 8
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 306d555dda35..618c2ecec26a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1339,6 +1339,13 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
+ case EXIT_REASON_OTHER_SMI:
+ /*
+ * If reach here, it's not a Machine Check System Management
+ * Interrupt(MSMI). #SMI is delivered and handled right after
+ * SEAMRET, nothing needs to be done in KVM.
+ */
+ return 1;
default:
break;
}
--
2.25.1

2023-10-16 16:41:52

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 031/116] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values

From: Sean Christopherson <[email protected]>

Add support to MMU caches for initializing a page with a custom 64-bit
value, e.g. to pre-fill an entire page table with non-zero PTE values.
The functionality will be used by x86 to support Intel's TDX, which needs
to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
faults from getting reflected into the guest (Intel's EPT Violation #VE
architecture made the less than brilliant decision of having the per-PTE
behavior be opt-out instead of opt-in).

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_types.h | 1 +
virt/kvm/kvm_main.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 9d1f7835d8c1..60c8d5c9eab9 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -94,6 +94,7 @@ struct gfn_to_pfn_cache {
struct kvm_mmu_memory_cache {
gfp_t gfp_zero;
gfp_t gfp_custom;
+ u64 init_value;
struct kmem_cache *kmem_cache;
int capacity;
int nobjs;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index aa9426472bb4..eb45bd2ca706 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -403,12 +403,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
gfp_t gfp_flags)
{
+ void *page;
+
gfp_flags |= mc->gfp_zero;

if (mc->kmem_cache)
return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
- else
- return (void *)__get_free_page(gfp_flags);
+
+ page = (void *)__get_free_page(gfp_flags);
+ if (page && mc->init_value)
+ memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));
+ return page;
}

int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
@@ -423,6 +428,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
if (WARN_ON_ONCE(!capacity))
return -EIO;

+ /*
+ * Custom init values can be used only for page allocations,
+ * and obviously conflict with __GFP_ZERO.
+ */
+ if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
+ return -EIO;
+
mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
if (!mc->objects)
return -ENOMEM;
--
2.25.1

2023-10-16 16:42:29

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 029/116] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA

From: Isaku Yamahata <[email protected]>

TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to
indicate the GPA is private(if cleared) or shared (if set) with VMM. If
GPA.shared is set, GPA is covered by the existing conventional EPT pointed
by EPTP. If GPA.shared bit is cleared, GPA is covered by TDX module.
VMM has to issue SEAMCALLs to operate.

Add a member to remember GPA shared bit for each guest TDs, add address
conversion functions between private GPA and shared GPA and test if GPA
is private.

Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See
kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated,
the new member to remember GPA shared bit is guaranteed to be zero with
this patch unless it's initialized explicitly.

Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/kvm/mmu.h | 27 +++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.c | 5 +++++
3 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a21c060ffc68..742e97f23573 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1467,6 +1467,10 @@ struct kvm_arch {
*/
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+#ifdef CONFIG_KVM_MMU_PRIVATE
+ gfn_t gfn_shared_mask;
+#endif
};

struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 253fb2093d5d..f5ba6cf589aa 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -304,4 +304,31 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
return gpa;
return translate_nested_gpa(vcpu, gpa, access, exception);
}
+
+static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_MMU_PRIVATE
+ return kvm->arch.gfn_shared_mask;
+#else
+ return 0;
+#endif
+}
+
+static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
+{
+ return gfn | kvm_gfn_shared_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
+{
+ return gfn & ~kvm_gfn_shared_mask(kvm);
+}
+
+static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
+{
+ gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+ return mask && !(gpa_to_gfn(gpa) & mask);
+}
+
#endif
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c1a8560981a3..fe793425d393 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -878,6 +878,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
kvm_tdx->attributes = td_params->attributes;
kvm_tdx->xfam = td_params->xfam;

+ if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
+ kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
+ else
+ kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+
out:
/* kfree() accepts NULL. */
kfree(init_vm);
--
2.25.1

2023-10-16 16:42:48

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 017/116] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl

From: Isaku Yamahata <[email protected]>

KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM. It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.

TDX requires VM-scoped TDX-specific operations for device model, for
example, qemu. Getting system-wide parameters, TDX-specific VM
initialization.

Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters. Make mem_enc_ioctl non-optional as it's always filled.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v15:
- change struct kvm_tdx_cmd to drop unused member.
---
arch/x86/include/asm/kvm-x86-ops.h | 2 +-
arch/x86/include/uapi/asm/kvm.h | 26 ++++++++++++++++++++++++++
arch/x86/kvm/vmx/main.c | 10 ++++++++++
arch/x86/kvm/vmx/tdx.c | 26 ++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
arch/x86/kvm/x86.c | 4 ----
6 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index daaa1bef1b2d..8a5770b12546 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -120,7 +120,7 @@ KVM_X86_OP(enter_smm)
KVM_X86_OP(leave_smm)
KVM_X86_OP(enable_smi_window)
#endif
-KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
+KVM_X86_OP(mem_enc_ioctl)
KVM_X86_OP_OPTIONAL(mem_enc_register_region)
KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index aa7a56a47564..615fb60b3717 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -567,4 +567,30 @@ struct kvm_pmu_event_filter {
#define KVM_X86_TDX_VM 2
#define KVM_X86_SNP_VM 3

+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+
+ KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8ca23adfcfb8..e9661954d250 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -51,6 +51,14 @@ static int vt_vm_init(struct kvm *kvm)
return vmx_vm_init(kvm);
}

+static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
+{
+ if (!is_td(kvm))
+ return -ENOTTY;
+
+ return tdx_vm_ioctl(kvm, argp);
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -201,6 +209,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,

.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+ .mem_enc_ioctl = vt_mem_enc_ioctl,
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b0e3409da5a8..ead229e34813 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -16,6 +16,32 @@
offsetof(struct tdsysinfo_struct, cpuid_configs)) \
/ sizeof(struct tdx_cpuid_config))

+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+ struct kvm_tdx_cmd tdx_cmd;
+ int r;
+
+ if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+ return -EFAULT;
+ if (tdx_cmd.error)
+ return -EINVAL;
+
+ mutex_lock(&kvm->lock);
+
+ switch (tdx_cmd.id) {
+ default:
+ r = -EINVAL;
+ goto out;
+ }
+
+ if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+ r = -EFAULT;
+
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
static int __init tdx_module_setup(void)
{
const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 99fb6c9c0282..601aca7a011e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -137,9 +137,13 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
#ifdef CONFIG_INTEL_TDX_HOST
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
bool tdx_is_vm_type_supported(unsigned long type);
+
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
+
+static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 344413dededc..1824fd257f41 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7053,10 +7053,6 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
goto out;
}
case KVM_MEMORY_ENCRYPT_OP: {
- r = -ENOTTY;
- if (!kvm_x86_ops.mem_enc_ioctl)
- goto out;
-
r = static_call(kvm_x86_mem_enc_ioctl)(kvm, argp);
break;
}
--
2.25.1

2023-10-16 16:43:28

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 113/116] KVM: TDX: Add hint TDX ioctl to release Secure-EPT

From: Isaku Yamahata <[email protected]>

Add a new hint KVM TDX ioctl to release Secure-EPT as an optimization to
reduce the time of the destruction of the guest.

It takes tens of minutes to destroy a guest with tens or hundreds of GB of
guest memory. There are two cases to release pages used for the Secure-EPT
and guest private memory. One case is runtime while the guest is still
running. Another case is static when the TD won't run anymore.

In Runtime: Use this when the KVM memory slot is deleted or closes KVM file
descriptors while the user process is live. Because the guest can still
run, a TLB shoot-down is needed. The sequence is TLB shoot down, cache
flush each page, releasing the page from the Secure-EPT tree, and
zero-clear them. It requires four SEAMCALLs per page.
TDH.MEM.RANGE.BLOCK() and TDH.MEM.TRACK() for TLB shoot down,
TDH.PHYMEM.PAGE.WBINVD() for cache flush, and TDH.MEM.PAGE.REMOVE() to
release a page.

In process existing: When we know the vcpu won't run further, KVM can free
the host key ID (HKID) for memory encryption with cache flush. The vcpu
can't run after that. It simplifies the sequence to release private pages
by reclaiming and zeroing them to reduce the number of SEAMCALLs to one per
private page, TDH.PHYMEM.PAGE.RECLAIM(). However, this is applicable only
when the user process exits with the MMU notifier release callback.

Add a way for the user space to tell KVM a hint when it starts to destruct
the guest for the efficient way in addition to the MMU notifier.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v16
- Newly added
---
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/vmx/tdx.c | 9 +++++++++
3 files changed, 11 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1b4134247837..afd10fb55cfb 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -574,6 +574,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_INIT_VCPU,
KVM_TDX_INIT_MEM_REGION,
KVM_TDX_FINALIZE_VM,
+ KVM_TDX_RELEASE_VM,

KVM_TDX_CMD_NR_MAX,
};
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 63a4efd1e40a..26bad5c646fe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6907,6 +6907,7 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
kvm_mmu_zap_all(kvm);
}
+EXPORT_SYMBOL_GPL(kvm_arch_flush_shadow_all);

static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 924cbf97404a..3287701199dd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2771,6 +2771,15 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_FINALIZE_VM:
r = tdx_td_finalizemr(kvm);
break;
+ case KVM_TDX_RELEASE_VM: {
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_arch_flush_shadow_all(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
+ r = 0;
+ break;
+ }
default:
r = -EINVAL;
goto out;
--
2.25.1

2023-10-16 16:43:32

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 109/116] KVM: TDX: Add methods to ignore virtual apic related operation

From: Isaku Yamahata <[email protected]>

TDX protects TDX guest APIC state from VMM. Implement access methods of
TDX guest vAPIC state to ignore them or return zero.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 61 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 6 ++++
arch/x86/kvm/vmx/x86_ops.h | 3 ++
3 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 0d0af3fadc75..eb44f5784f18 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -363,6 +363,14 @@ static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
return vmx_apic_init_signal_blocked(vcpu);
}

+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_set_virtual_apic_mode(vcpu);
+
+ return vmx_set_virtual_apic_mode(vcpu);
+}
+
static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -371,6 +379,31 @@ static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
memset(pi->pir, 0, sizeof(pi->pir));
}

+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(int max_isr)
+{
+ if (is_td_vcpu(kvm_get_running_vcpu()))
+ return;
+
+ return vmx_hwapic_isr_update(max_isr);
+}
+
+static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 at the moment. */
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return false;
+
+ return vmx_guest_apic_has_interrupt(vcpu);
+}
+
static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -824,6 +857,22 @@ static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
vmx_update_cr8_intercept(vcpu, tpr, irr);
}

+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
+ vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (is_td_vcpu(vcpu))
@@ -1033,15 +1082,15 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
.update_cr8_intercept = vt_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .set_virtual_apic_mode = vt_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vt_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vt_load_eoi_exitmap,
.apicv_post_state_restore = vt_apicv_post_state_restore,
.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .hwapic_irr_update = vt_hwapic_irr_update,
+ .hwapic_isr_update = vt_hwapic_isr_update,
+ .guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
.sync_pir_to_irr = vt_sync_pir_to_irr,
.deliver_interrupt = vt_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1628e54d816b..527fbcd7e2c5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2078,6 +2078,12 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ /* Only x2APIC mode is supported for TD. */
+ WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
+}
+
int tdx_get_cpl(struct kvm_vcpu *vcpu)
{
return 0;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 8dacab1bc3d7..7c377b7d76b9 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -168,6 +168,7 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
bool tdx_has_emulated_msr(u32 index, bool write);
int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);

int tdx_get_cpl(struct kvm_vcpu *vcpu);
void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
@@ -220,6 +221,8 @@ static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }

+static inline void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
+
static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
--
2.25.1

2023-10-16 16:51:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 087/116] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI

From: Isaku Yamahata <[email protected]>

When BIOS eMCA MCE-SMI morphing is enabled, the #MC is morphed to MSMI
(Machine Check System Management Interrupt). Then the SMI causes TD exit
with the read reason of EXIT_REASON_OTHER_SMI with MSMI bit set in the exit
qualification to KVM instead of EXIT_REASON_EXCEPTION_NMI with MC
exception.

Handle EXIT_REASON_OTHER_SMI with MSMI bit set in the exit qualification as
MCE(Machine Check Exception) happened during TD guest running.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx/tdx_arch.h | 2 ++
2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6bed75c4069c..44558f1c13e4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -890,6 +890,30 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
tdexit_intr_info(vcpu));
else if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
+ else if (unlikely(tdx->exit_reason.non_recoverable ||
+ tdx->exit_reason.error)) {
+ /*
+ * The only reason it gets EXIT_REASON_OTHER_SMI is there is an
+ * #MSMI(Machine Check System Management Interrupt) with
+ * exit_qualification bit 0 set in TD guest.
+ * The #MSMI is delivered right after SEAMCALL returns,
+ * and an #MC is delivered to host kernel after SMI handler
+ * returns.
+ *
+ * The #MC right after SEAMCALL is fixed up and skipped in #MC
+ * handler because it's an #MC happens in TD guest we cannot
+ * handle it with host's context.
+ *
+ * Call KVM's machine check handler explicitly here.
+ */
+ if (tdx->exit_reason.basic == EXIT_REASON_OTHER_SMI) {
+ unsigned long exit_qual;
+
+ exit_qual = tdexit_exit_qual(vcpu);
+ if (exit_qual & TD_EXIT_OTHER_SMI_IS_MSMI)
+ kvm_machine_check();
+ }
+ }
}

static int tdx_handle_exception(struct kvm_vcpu *vcpu)
@@ -1372,6 +1396,11 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
exit_reason.full, exit_reason.basic,
to_kvm_tdx(vcpu->kvm)->hkid,
set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
+
+ /*
+ * tdx_handle_exit_irqoff() handled EXIT_REASON_OTHER_SMI. It
+ * must be handled before enabling preemption because it's #MC.
+ */
goto unhandled_exit;
}

@@ -1410,9 +1439,14 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
return tdx_handle_ept_misconfig(vcpu);
case EXIT_REASON_OTHER_SMI:
/*
- * If reach here, it's not a Machine Check System Management
- * Interrupt(MSMI). #SMI is delivered and handled right after
- * SEAMRET, nothing needs to be done in KVM.
+ * Unlike VMX, all the SMI in SEAM non-root mode (i.e. when
+ * TD guest vcpu is running) will cause TD exit to TDX module,
+ * then SEAMRET to KVM. Once it exits to KVM, SMI is delivered
+ * and handled right away.
+ *
+ * - If it's an Machine Check System Management Interrupt
+ * (MSMI), it's handled above due to non_recoverable bit set.
+ * - If it's not an MSMI, don't need to do anything here.
*/
return 1;
default:
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index fc9a8898765c..9f93250d22b9 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -47,6 +47,8 @@
#define TDG_VP_VMCALL_REPORT_FATAL_ERROR 0x10003
#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004

+#define TD_EXIT_OTHER_SMI_IS_MSMI BIT(1)
+
/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
#define TDX_NON_ARCH BIT_ULL(63)
#define TDX_CLASS_SHIFT 56
--
2.25.1

2023-10-16 16:51:09

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 101/116] KVM: TDX: Silently ignore INIT/SIPI

From: Isaku Yamahata <[email protected]>

The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
Instead it defines the different protocols to boot application processors.
Ignore INIT and SIPI events for the TDX guest.

There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
error to guest TDs somehow. Given that TDX guest is paravirtualized to
boot AP, the option 1 is chosen for simplicity.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/lapic.c | 19 +++++++++++-------
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/vmx/main.c | 32 ++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 4 ++--
6 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 19e1f22b92b1..8b3c5f2179cf 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -146,6 +146,7 @@ KVM_X86_OP_OPTIONAL(migrate_timers)
KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
+KVM_X86_OP(vcpu_deliver_init)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);

#undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 641f769b30d1..529c7e610d47 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1814,6 +1814,7 @@ struct kvm_x86_ops {
int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);

void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+ void (*vcpu_deliver_init)(struct kvm_vcpu *vcpu);

/*
* Returns vCPU specific APICv inhibit reasons
@@ -2050,6 +2051,7 @@ void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
void kvm_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu);

int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
int reason, bool has_error_code, u32 error_code);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e037923edb5e..13d5f5c45c79 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3261,6 +3261,16 @@ int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
return 0;
}

+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+ kvm_vcpu_reset(vcpu, true);
+ if (kvm_vcpu_is_bsp(vcpu))
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ else
+ vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_deliver_init);
+
int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
@@ -3292,13 +3302,8 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
return 0;
}

- if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
- kvm_vcpu_reset(vcpu, true);
- if (kvm_vcpu_is_bsp(apic->vcpu))
- vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
- else
- vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
- }
+ if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events))
+ static_call(kvm_x86_vcpu_deliver_init)(vcpu);
if (test_and_clear_bit(KVM_APIC_SIPI, &apic->pending_events)) {
if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
/* evaluate pending_events before reading the vector */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 70c1f7999399..30d9cc11a4b5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5017,6 +5017,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.complete_emulated_msr = svm_complete_emulated_msr,

.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_init = kvm_vcpu_deliver_init,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
};

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ad91efcc2413..dd050a6196ae 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -331,6 +331,14 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_apic_init_signal_blocked(vcpu);
+}
+
static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -359,6 +367,25 @@ static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
}

+static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ kvm_vcpu_deliver_sipi_vector(vcpu, vector);
+}
+
+static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ /* TDX doesn't support INIT. Ignore INIT event */
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ return;
+ }
+
+ kvm_vcpu_deliver_init(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -722,13 +749,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
#endif

.can_emulate_instruction = vmx_can_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .apic_init_signal_blocked = vt_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,

.msr_filter_changed = vt_msr_filter_changed,
.complete_emulated_msr = kvm_complete_insn_gp,

- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_sipi_vector = vt_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_init = vt_vcpu_deliver_init,

.mem_enc_ioctl = vt_mem_enc_ioctl,
.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f368f9c950ad..14ab1450dda4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -757,8 +757,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{

- /* Ignore INIT silently because TDX doesn't support INIT event. */
- if (init_event)
+ /* vcpu_deliver_init method silently discards INIT event. */
+ if (KVM_BUG_ON(init_event, vcpu->kvm))
return;
if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
return;
--
2.25.1

2023-10-16 16:51:25

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 032/116] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE

From: Isaku Yamahata <[email protected]>

The TDX support will need the "suppress #VE" bit (bit 63) set as the
initial value for SPTE. To reduce code change size, introduce a new macro
SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table
entry (SPTE) and replace hard-coded value 0 for it. Initialize shadow page
tables with their value.

The plan is to unconditionally set the "suppress #VE" bit for both AMD and
Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and
ignored for non-present SPTE; 2) for conventional VMX guests, KVM never
enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is
ignored by hardware.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 20 +++++++++++++++-----
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/spte.h | 2 ++
arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++-------
4 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c3d0f8461e1c..8cf3ac95bb30 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -571,9 +571,9 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)

if (!is_shadow_present_pte(old_spte) ||
!spte_has_volatile_bits(old_spte))
- __update_clear_spte_fast(sptep, 0ull);
+ __update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
else
- old_spte = __update_clear_spte_slow(sptep, 0ull);
+ old_spte = __update_clear_spte_slow(sptep, SHADOW_NONPRESENT_VALUE);

if (!is_shadow_present_pte(old_spte))
return old_spte;
@@ -607,7 +607,7 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
*/
static void mmu_spte_clear_no_track(u64 *sptep)
{
- __update_clear_spte_fast(sptep, 0ull);
+ __update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
}

static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -1954,7 +1954,8 @@ static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

static int kvm_sync_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
{
- if (!sp->spt[i])
+ /* sp->spt[i] has initial value of shadow page table allocation */
+ if (sp->spt[i] == SHADOW_NONPRESENT_VALUE)
return 0;

return vcpu->arch.mmu->sync_spte(vcpu, sp, i);
@@ -6118,7 +6119,16 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ /*
+ * When X86_64, initial SEPT entries are initialized with
+ * SHADOW_NONPRESENT_VALUE. Otherwise zeroed. See
+ * mmu_memory_cache_alloc_obj().
+ */
+ if (IS_ENABLED(CONFIG_X86_64))
+ vcpu->arch.mmu_shadow_page_cache.init_value =
+ SHADOW_NONPRESENT_VALUE;
+ if (!vcpu->arch.mmu_shadow_page_cache.init_value)
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f67..bc73290415d2 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -911,7 +911,7 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
gpa_t pte_gpa;
gfn_t gfn;

- if (WARN_ON_ONCE(!sp->spt[i]))
+ if (WARN_ON_ONCE(sp->spt[i] == SHADOW_NONPRESENT_VALUE))
return 0;

first_pte_gpa = FNAME(get_level1_sp_gpa)(sp);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a88..4d1799ba2bf8 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -149,6 +149,8 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+#define SHADOW_NONPRESENT_VALUE 0ULL
+
extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ca7ec39f17d3..4f6f782d9e02 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -630,7 +630,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* here since the SPTE is going from non-present to non-present. Use
* the raw write helper to avoid an unnecessary check on volatile bits.
*/
- __kvm_tdp_mmu_write_spte(iter->sptep, 0);
+ __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);

return 0;
}
@@ -767,8 +767,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
continue;

if (!shared)
- tdp_mmu_iter_set_spte(kvm, &iter, 0);
- else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
+ tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+ else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
goto retry;
}
}
@@ -824,8 +824,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;

- tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
- sp->gfn, sp->role.level + 1);
+ tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
+ SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);

return true;
}
@@ -859,7 +859,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- tdp_mmu_iter_set_spte(kvm, &iter, 0);
+ tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
flush = true;
}

@@ -1253,7 +1253,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See handle_changed_spte().
*/
- tdp_mmu_iter_set_spte(kvm, iter, 0);
+ tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);

if (!pte_write(range->arg.pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1

2023-10-16 17:00:57

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 015/116] x86/cpu: Add helper functions to allocate/free TDX private host key id

From: Isaku Yamahata <[email protected]>

Add helper functions to allocate/free TDX private host key id (HKID), and
export the global TDX HKID.

The memory controller encrypts TDX memory with the assigned TDX HKIDs. The
global TDX HKID is to encrypt the TDX module, its memory, and some dynamic
data (TDR). The private TDX HKID is assigned to guest TD to encrypt guest
memory and the related data. When VMM releases an encrypted page for
reuse, the page needs a cache flush with the used HKID. VMM needs the
global TDX HKID and the private TDX HKIDs to flush encrypted pages.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/tdx.h | 12 ++++++++++++
arch/x86/virt/vmx/tdx/tdx.c | 28 +++++++++++++++++++++++++++-
2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ec2b8e67fe07..e50129a1f13e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -115,6 +115,16 @@ int tdx_cpu_enable(void);
int tdx_enable(void);
void tdx_reset_memory(void);
bool tdx_is_private_mem(unsigned long phys);
+
+/*
+ * Key id globally used by TDX module: TDX module maps TDR with this TDX global
+ * key id. TDR includes key id assigned to the TD. Then TDX module maps other
+ * TD-related pages with the assigned key id. TDR requires this TDX global key
+ * id for cache flush unlike other TD-related pages.
+ */
+extern u32 tdx_global_keyid;
+int tdx_guest_keyid_alloc(void);
+void tdx_guest_keyid_free(int keyid);
#else
static inline u64 __seamcall(u64 fn, struct tdx_module_args *args)
{
@@ -133,6 +143,8 @@ static inline int tdx_cpu_enable(void) { return -ENODEV; }
static inline int tdx_enable(void) { return -ENODEV; }
static inline void tdx_reset_memory(void) { }
static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
+static inline int tdx_guest_keyid_alloc(void) { return -EOPNOTSUPP; }
+static inline void tdx_guest_keyid_free(int keyid) { }
#endif /* CONFIG_INTEL_TDX_HOST */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 04b3c81b35e5..5b8f2085d293 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -118,7 +118,8 @@ static __always_inline bool seamcall_err_is_kernel_defined(u64 err)
SEAMCALL_PRERR(seamcall_saved_ret, (__fn), (__args), \
seamcall_err_saved_ret)

-static u32 tdx_global_keyid __ro_after_init;
+u32 tdx_global_keyid __ro_after_init;
+EXPORT_SYMBOL_GPL(tdx_global_keyid);
static u32 tdx_guest_keyid_start __ro_after_init;
static u32 tdx_nr_guest_keyids __ro_after_init;

@@ -136,6 +137,31 @@ static struct tdmr_info_list tdx_tdmr_list;

static atomic_t tdx_may_have_private_mem;

+/* TDX KeyID pool */
+static DEFINE_IDA(tdx_guest_keyid_pool);
+
+int tdx_guest_keyid_alloc(void)
+{
+ if (WARN_ON_ONCE(!tdx_guest_keyid_start || !tdx_nr_guest_keyids))
+ return -EINVAL;
+
+ /* The first keyID is reserved for the global key. */
+ return ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start + 1,
+ tdx_guest_keyid_start + tdx_nr_guest_keyids - 1,
+ GFP_KERNEL);
+}
+EXPORT_SYMBOL_GPL(tdx_guest_keyid_alloc);
+
+void tdx_guest_keyid_free(int keyid)
+{
+ /* keyid = 0 is reserved. */
+ if (WARN_ON_ONCE(keyid <= 0))
+ return;
+
+ ida_free(&tdx_guest_keyid_pool, keyid);
+}
+EXPORT_SYMBOL_GPL(tdx_guest_keyid_free);
+
/*
* Do the module global initialization if not done yet. It can be
* done on any cpu. It's always called with interrupts disabled.
--
2.25.1

2023-10-16 17:00:57

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 026/116] KVM: TDX: Do TDX specific vcpu initialization

From: Isaku Yamahata <[email protected]>

TD guest vcpu needs TDX specific initialization before running. Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/vmx/main.c | 9 ++
arch/x86/kvm/vmx/tdx.c | 180 +++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 7 +
arch/x86/kvm/vmx/x86_ops.h | 4 +
arch/x86/kvm/x86.c | 6 +
tools/arch/x86/include/uapi/asm/kvm.h | 1 +
9 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e1adeb7d4e26..d740a3f19d96 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -126,6 +126,7 @@ KVM_X86_OP(leave_smm)
KVM_X86_OP(enable_smi_window)
#endif
KVM_X86_OP(mem_enc_ioctl)
+KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
KVM_X86_OP_OPTIONAL(mem_enc_register_region)
KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 24d9a9ab338d..a21c060ffc68 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1737,6 +1737,7 @@ struct kvm_x86_ops {
#endif

int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
+ int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 7112546bd1d0..311a7894b712 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -571,6 +571,7 @@ struct kvm_pmu_event_filter {
enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,

KVM_TDX_CMD_NR_MAX,
};
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7a9ee3ec0785..5d88d2196822 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -142,6 +142,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
return tdx_vm_ioctl(kvm, argp);
}

+static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+ if (!is_td_vcpu(vcpu))
+ return -EINVAL;
+
+ return tdx_vcpu_ioctl(vcpu, argp);
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -299,6 +307,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,

.mem_enc_ioctl = vt_mem_enc_ioctl,
+ .vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b7f8ac4b9f95..c1a8560981a3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -49,6 +49,7 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)

struct tdx_info {
u8 nr_tdcs_pages;
+ u8 nr_tdvpx_pages;
};

/* Info about the TDX module. */
@@ -71,6 +72,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
}

+static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
+{
+ return tdx->tdvpr_pa;
+}
+
static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
{
return kvm_tdx->tdr_pa;
@@ -88,6 +94,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
return kvm_tdx->hkid > 0;
}

+static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->finalized;
+}
+
static void tdx_clear_page(unsigned long page_pa)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -371,7 +382,32 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

void tdx_vcpu_free(struct kvm_vcpu *vcpu)
{
- /* This is stub for now. More logic will come. */
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ int i;
+
+ /*
+ * This methods can be called when vcpu allocation/initialization
+ * failed. So it's possible that hkid, tdvpx and tdvpr are not assigned
+ * yet.
+ */
+ if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm))) {
+ WARN_ON_ONCE(tdx->tdvpx_pa);
+ WARN_ON_ONCE(tdx->tdvpr_pa);
+ return;
+ }
+
+ if (tdx->tdvpx_pa) {
+ for (i = 0; i < tdx_info.nr_tdvpx_pages; i++) {
+ if (tdx->tdvpx_pa[i])
+ tdx_reclaim_td_page(tdx->tdvpx_pa[i]);
+ }
+ kfree(tdx->tdvpx_pa);
+ tdx->tdvpx_pa = NULL;
+ }
+ if (tdx->tdvpr_pa) {
+ tdx_reclaim_td_page(tdx->tdvpr_pa);
+ tdx->tdvpr_pa = 0;
+ }
}

void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -380,8 +416,13 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
/* Ignore INIT silently because TDX doesn't support INIT event. */
if (init_event)
return;
+ if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
+ return;

- /* This is stub for now. More logic will come here. */
+ /*
+ * Don't update mp_state to runnable because more initialization
+ * is needed by TDX_VCPU_INIT.
+ */
}

static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -876,6 +917,136 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
return r;
}

+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ unsigned long *tdvpx_pa = NULL;
+ unsigned long tdvpr_pa;
+ unsigned long va;
+ int ret, i;
+ u64 err;
+
+ if (is_td_vcpu_created(tdx))
+ return -EINVAL;
+
+ /*
+ * vcpu_free method frees allocated pages. Avoid partial setup so
+ * that the method can't handle it.
+ */
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va)
+ return -ENOMEM;
+ tdvpr_pa = __pa(va);
+
+ tdvpx_pa = kcalloc(tdx_info.nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
+ GFP_KERNEL_ACCOUNT);
+ if (!tdvpx_pa) {
+ ret = -ENOMEM;
+ goto free_tdvpr;
+ }
+ for (i = 0; i < tdx_info.nr_tdvpx_pages; i++) {
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va) {
+ ret = -ENOMEM;
+ goto free_tdvpx;
+ }
+ tdvpx_pa[i] = __pa(va);
+ }
+
+ err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
+ if (KVM_BUG_ON(err, vcpu->kvm)) {
+ ret = -EIO;
+ pr_tdx_error(TDH_VP_CREATE, err, NULL);
+ goto free_tdvpx;
+ }
+ tdx->tdvpr_pa = tdvpr_pa;
+
+ tdx->tdvpx_pa = tdvpx_pa;
+ for (i = 0; i < tdx_info.nr_tdvpx_pages; i++) {
+ err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
+ if (KVM_BUG_ON(err, vcpu->kvm)) {
+ pr_tdx_error(TDH_VP_ADDCX, err, NULL);
+ for (; i < tdx_info.nr_tdvpx_pages; i++) {
+ free_page((unsigned long)__va(tdvpx_pa[i]));
+ tdvpx_pa[i] = 0;
+ }
+ /* vcpu_free method frees TDVPX and TDR donated to TDX */
+ return -EIO;
+ }
+ }
+
+ err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
+ if (KVM_BUG_ON(err, vcpu->kvm)) {
+ pr_tdx_error(TDH_VP_INIT, err, NULL);
+ return -EIO;
+ }
+
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ return 0;
+
+free_tdvpx:
+ for (i = 0; i < tdx_info.nr_tdvpx_pages; i++) {
+ if (tdvpx_pa[i])
+ free_page((unsigned long)__va(tdvpx_pa[i]));
+ tdvpx_pa[i] = 0;
+ }
+ kfree(tdvpx_pa);
+ tdx->tdvpx_pa = NULL;
+free_tdvpr:
+ if (tdvpr_pa)
+ free_page((unsigned long)__va(tdvpr_pa));
+ tdx->tdvpr_pa = 0;
+
+ return ret;
+}
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+ struct msr_data apic_base_msr;
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ struct kvm_tdx_cmd cmd;
+ int ret;
+
+ if (tdx->initialized)
+ return -EINVAL;
+
+ if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ if (copy_from_user(&cmd, argp, sizeof(cmd)))
+ return -EFAULT;
+
+ if (cmd.error)
+ return -EINVAL;
+
+ /* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
+ if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
+ return -EINVAL;
+
+ /*
+ * As TDX requires X2APIC, set local apic mode to X2APIC. User space
+ * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
+ * KVM_SET_CPUID2. Otherwise kvm_set_apic_base() will fail.
+ */
+ apic_base_msr = (struct msr_data) {
+ .host_initiated = true,
+ .data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
+ (kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
+ };
+ if (kvm_set_apic_base(vcpu, &apic_base_msr))
+ return -EINVAL;
+
+ ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
+ if (ret)
+ return ret;
+
+ tdx->initialized = true;
+ return 0;
+}
+
static int __init tdx_module_setup(void)
{
const struct tdsysinfo_struct *tdsysinfo;
@@ -894,6 +1065,11 @@ static int __init tdx_module_setup(void)
WARN_ON(tdsysinfo->num_cpuid_config > TDX_MAX_NR_CPUID_CONFIGS);
tdx_info = (struct tdx_info) {
.nr_tdcs_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE,
+ /*
+ * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
+ * -1 for TDVPR.
+ */
+ .nr_tdvpx_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1,
};

return 0;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 173ed19207fb..961a7da2452a 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,12 +17,19 @@ struct kvm_tdx {
u64 xfam;
int hkid;

+ bool finalized;
+
u64 tsc_offset;
};

struct vcpu_tdx {
struct kvm_vcpu vcpu;

+ unsigned long tdvpr_pa;
+ unsigned long *tdvpx_pa;
+
+ bool initialized;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 9b01104946b8..88e117db748c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -150,6 +150,8 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
@@ -169,6 +171,8 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+
+static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7aff6f88f575..69d9d15ef70d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6086,6 +6086,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
case KVM_SET_DEVICE_ATTR:
r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
break;
+ case KVM_MEMORY_ENCRYPT_OP:
+ r = -ENOTTY;
+ if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
+ goto out;
+ r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
+ break;
default:
r = -EINVAL;
}
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 61ce7d174fcf..83bd9e3118d1 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -566,6 +566,7 @@ struct kvm_pmu_event_filter {
enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,

KVM_TDX_CMD_NR_MAX,
};
--
2.25.1

2023-10-16 17:01:29

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 022/116] KVM: TDX: Make pmu_intel.c ignore guest TD case

From: Isaku Yamahata <[email protected]>

Because TDX KVM doesn't support PMU yet (it's future work of TDX KVM
support as another patch series) and pmu_intel.c touches vmx specific
structure in vcpu initialization, as workaround add dummy structure to
struct vcpu_tdx and pmu_intel.c can ignore TDX case.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 46 +++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/pmu_intel.h | 28 ++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 8 ++++++-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/vmx/vmx.h | 32 +------------------------
5 files changed, 82 insertions(+), 34 deletions(-)
create mode 100644 arch/x86/kvm/vmx/pmu_intel.h

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index f2efa0bf7ae8..d7cdb9fc4e2b 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -19,6 +19,7 @@
#include "lapic.h"
#include "nested.h"
#include "pmu.h"
+#include "tdx.h"

#define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0)

@@ -68,6 +69,26 @@ static int fixed_pmc_events[] = {
[2] = PSEUDO_ARCH_REFERENCE_CYCLES,
};

+struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_INTEL_TDX_HOST
+ if (is_td_vcpu(vcpu))
+ return &to_tdx(vcpu)->lbr_desc;
+#endif
+
+ return &to_vmx(vcpu)->lbr_desc;
+}
+
+struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_INTEL_TDX_HOST
+ if (is_td_vcpu(vcpu))
+ return &to_tdx(vcpu)->lbr_desc.records;
+#endif
+
+ return &to_vmx(vcpu)->lbr_desc.records;
+}
+
static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
{
struct kvm_pmc *pmc;
@@ -179,6 +200,23 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
return get_gp_pmc(pmu, msr, MSR_IA32_PMC0);
}

+bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return false;
+ return cpuid_model_is_consistent(vcpu);
+}
+
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
+{
+ struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu);
+
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return lbr->nr && (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_LBR_FMT);
+}
+
static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
{
struct x86_pmu_lbr *records = vcpu_to_lbr_records(vcpu);
@@ -285,6 +323,9 @@ int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu)
PERF_SAMPLE_BRANCH_USER,
};

+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return 0;
+
if (unlikely(lbr_desc->event)) {
__set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use);
return 0;
@@ -580,7 +621,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);

perf_capabilities = vcpu_get_perf_capabilities(vcpu);
- if (cpuid_model_is_consistent(vcpu) &&
+ if (intel_pmu_lbr_is_compatible(vcpu) &&
(perf_capabilities & PMU_CAP_LBR_FMT))
x86_perf_get_lbr(&lbr_desc->records);
else
@@ -636,6 +677,9 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
struct kvm_pmc *pmc = NULL;
int i;

+ if (is_td_vcpu(vcpu))
+ return;
+
for (i = 0; i < KVM_INTEL_PMC_MAX_GENERIC; i++) {
pmc = &pmu->gp_counters[i];

diff --git a/arch/x86/kvm/vmx/pmu_intel.h b/arch/x86/kvm/vmx/pmu_intel.h
new file mode 100644
index 000000000000..66bba47c1269
--- /dev/null
+++ b/arch/x86/kvm/vmx/pmu_intel.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_PMU_INTEL_H
+#define __KVM_X86_VMX_PMU_INTEL_H
+
+struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu);
+struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu);
+
+bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu);
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);
+int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
+
+struct lbr_desc {
+ /* Basic info about guest LBR records. */
+ struct x86_pmu_lbr records;
+
+ /*
+ * Emulate LBR feature via passthrough LBR registers when the
+ * per-vcpu guest LBR event is scheduled on the current pcpu.
+ *
+ * The records may be inaccurate if the host reclaims the LBR.
+ */
+ struct perf_event *event;
+
+ /* True if LBRs are marked as not intercepted in the MSR bitmap */
+ bool msr_passthrough;
+};
+
+#endif /* __KVM_X86_VMX_PMU_INTEL_H */
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 184fe394da86..173ed19207fb 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -4,6 +4,7 @@

#ifdef CONFIG_INTEL_TDX_HOST

+#include "pmu_intel.h"
#include "tdx_ops.h"

struct kvm_tdx {
@@ -21,7 +22,12 @@ struct kvm_tdx {

struct vcpu_tdx {
struct kvm_vcpu vcpu;
- /* TDX specific members follow. */
+
+ /*
+ * Dummy to make pmu_intel not corrupt memory.
+ * TODO: Support PMU for TDX. Future work.
+ */
+ struct lbr_desc lbr_desc;
};

static inline bool is_td(struct kvm *kvm)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0b8bf04e283d..724bce9387e0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2421,7 +2421,7 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if ((data & PMU_CAP_LBR_FMT) !=
(kvm_caps.supported_perf_cap & PMU_CAP_LBR_FMT))
return 1;
- if (!cpuid_model_is_consistent(vcpu))
+ if (!intel_pmu_lbr_is_compatible(vcpu))
return 1;
}
if (data & PERF_CAP_PEBS_FORMAT) {
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c2130d2c8e24..41ad0e7a25fd 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -11,6 +11,7 @@
#include "capabilities.h"
#include "../kvm_cache_regs.h"
#include "posted_intr.h"
+#include "pmu_intel.h"
#include "vmcs.h"
#include "vmx_ops.h"
#include "../cpuid.h"
@@ -93,22 +94,6 @@ union vmx_exit_reason {
u32 full;
};

-struct lbr_desc {
- /* Basic info about guest LBR records. */
- struct x86_pmu_lbr records;
-
- /*
- * Emulate LBR feature via passthrough LBR registers when the
- * per-vcpu guest LBR event is scheduled on the current pcpu.
- *
- * The records may be inaccurate if the host reclaims the LBR.
- */
- struct perf_event *event;
-
- /* True if LBRs are marked as not intercepted in the MSR bitmap */
- bool msr_passthrough;
-};
-
/*
* The nested_vmx structure is part of vcpu_vmx, and holds information we need
* for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@@ -655,21 +640,6 @@ static __always_inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
return container_of(vcpu, struct vcpu_vmx, vcpu);
}

-static inline struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
-{
- return &to_vmx(vcpu)->lbr_desc;
-}
-
-static inline struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
-{
- return &vcpu_to_lbr_desc(vcpu)->records;
-}
-
-static inline bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
-{
- return !!vcpu_to_lbr_records(vcpu)->nr;
-}
-
void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu);
--
2.25.1

2023-10-16 17:01:40

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v16 110/116] KVM: TDX: Inhibit APICv for TDX guest

From: Isaku Yamahata <[email protected]>

TDX doesn't support APICV, inhibit APICv for TDX guest. Follow how SEV
does it. Define a new inhibit reason for TDX, set it on TD
initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 9 +++++++++
arch/x86/kvm/vmx/main.c | 3 ++-
arch/x86/kvm/vmx/tdx.c | 6 ++++++
3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 529c7e610d47..f0674ef1c17e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1284,6 +1284,15 @@ enum kvm_apicv_inhibit {
* mapping between logical ID and vCPU.
*/
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
+
+ /*********************************************************/
+ /* INHIBITs that are relevant only to the Intel's APICv. */
+ /*********************************************************/
+
+ /*
+ * APICv is disabled because TDX doesn't support it.
+ */
+ APICV_INHIBIT_REASON_TDX,
};

struct kvm_arch {
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index eb44f5784f18..5d61f5306ae3 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -1001,7 +1001,8 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
+ BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \
+ BIT(APICV_INHIBIT_REASON_TDX))

struct kvm_x86_ops vt_x86_ops __initdata = {
.name = KBUILD_MODNAME,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 527fbcd7e2c5..924cbf97404a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2486,6 +2486,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
goto teardown;
}

+ kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
+
return 0;

/*
@@ -2849,6 +2851,10 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
}

vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+ WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm));
+ vcpu->arch.apic->apicv_active = false;
+
return 0;

free_tdvpx:
--
2.25.1