2024-02-26 08:29:22

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 000/130] KVM TDX basic feature support

From: Isaku Yamahata <[email protected]>

KVM TDX basic feature support

Hello. This is v19 the patch series of KVM TDX support. This is based on
v6.8-rc5.

Major changes and uAPI
----------------------
The major change is uAPI change to use KVM_MEMORY_MAPPING as KVM common
implementation and dropped CONFIG_KVM_MMU_PRIVATE. uAPI change requires patches
to qemu or tdx kselftest as follows.

For unified KVM uAPI, I'll post independent patches and TDX support would
be on top of this patch series.

Trees
-----
The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
The corresponding qemu branch is found at
https://github.com/yamahata/qemu/tree/tdx/qemu-upm
How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM

Existing VM type and SNP
------------------------
The following matrix shows helper functions returns and how the code distinguish
the existing VM types (default, sw protected, and SNP) from TDX so that this
patch series doesn't break the existing code.

default or SEV-SNP TDX: S = (47 or 51) - 12
gfn_shared_mask 0 S bit
kvm_is_private_gpa() always false true if GFN has S bit set
kvm_gfn_to_shared() nop set S bit
kvm_gfn_to_private() nop clear S bit

fault.is_private means that host page should be gotten from guest_memfd
is_private_gpa() means that KVM MMU should invoke private MMU hooks.

Future work
-----------
This patch series supports only basic functionalities, TDX guest creation,
execution and destruction. The following features are planned as follow up.
- vPMU
- Off-TD debug. qemu gdb stub support
- live migration
- etc

Required changes to qemu and tdx kselftest
------------------------------------------
Required changes to QEMU:
Please don't forget to sync linux-headers.

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 0041be589c21..60c70b6bed79 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -714,11 +714,7 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
tdx_post_init_vcpus();

for_each_tdx_fw_entry(tdvf, entry) {
- struct kvm_tdx_init_mem_region mem_region = {
- .source_addr = (__u64)entry->mem_ptr,
- .gpa = entry->address,
- .nr_pages = entry->size / 4096,
- };
+ struct kvm_memory_mapping mapping;

r = kvm_set_memory_attributes_private(entry->address, entry->size);
if (r < 0) {
@@ -726,14 +722,31 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
exit(1);
}

- __u32 flags = entry->attributes & TDVF_SECTION_ATTRIBUTES_MR_EXTEND ?
- KVM_TDX_MEASURE_MEMORY_REGION : 0;
-
- trace_kvm_tdx_init_mem_region(entry->type, entry->attributes, mem_region.source_addr, mem_region.gpa, mem_region.nr_pages);
- r = tdx_vm_ioctl(KVM_TDX_INIT_MEM_REGION, flags, &mem_region);
+ mapping = (struct kvm_memory_mapping) {
+ .base_gfn = entry->address / 4096,
+ .nr_pages = entry->size / 4096,
+ .source = (__u64)entry->mem_ptr,
+ };
+ do {
+ r = kvm_vcpu_ioctl(first_cpu, KVM_MEMORY_MAPPING, &mapping);
+ } while (r == -EAGAIN);
if (r < 0) {
- error_report("KVM_TDX_INIT_MEM_REGION failed %s", strerror(-r));
- exit(1);
+ error_report("KVM_MEMORY_MAPPING failed %s", strerror(-r));
+ exit(1);
+ }
+
+ if (entry->attributes & TDVF_SECTION_ATTRIBUTES_MR_EXTEND) {
+ mapping = (struct kvm_memory_mapping) {
+ .base_gfn = entry->address / 4096,
+ .nr_pages = entry->size / 4096,
+ };
+ do {
+ r = tdx_vm_ioctl(KVM_TDX_EXTEND_MEMORY, 0, &mapping);
+ } while (r == -EAGAIN);
+ if (r < 0) {
+ error_report("KVM_TDX_EXTEND_MEMORY failed %s", strerror(-r));
+ exit(1);
+ }
}

if (entry->type == TDVF_SECTION_TYPE_TD_HOB ||
--

Required changes to tdx kselftest.

diff --git a/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c b/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
index edc1b227a014..a2f4921c416f 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/tdx/tdx_util.c
@@ -23,7 +23,7 @@ static char *tdx_cmd_str[] = {
"KVM_TDX_CAPABILITIES",
"KVM_TDX_INIT_VM",
"KVM_TDX_INIT_VCPU",
- "KVM_TDX_INIT_MEM_REGION",
+ "KVM_TDX_EXTEND_MEMORY",
"KVM_TDX_FINALIZE_VM",
"KVM_TDX_RELEASE_VM"
};
@@ -179,19 +179,29 @@ static void tdx_td_vcpu_init(struct kvm_vcpu *vcpu)
static void tdx_init_mem_region(struct kvm_vm *vm, void *source_pages,
uint64_t gpa, uint64_t size)
{
- struct kvm_tdx_init_mem_region mem_region = {
- .source_addr = (uint64_t)source_pages,
- .gpa = gpa,
+ struct kvm_vcpu * vcpu = list_first_entry(&vm->vcpus,
+ struct kvm_vcpu, list);
+ struct kvm_memory_mapping mapping = {
+ .base_gfn = gpa / PAGE_SIZE,
.nr_pages = size / PAGE_SIZE,
+ .source = (uint64_t)source_pages,
};
- uint32_t metadata = KVM_TDX_MEASURE_MEMORY_REGION;
+ int r;

- TEST_ASSERT((mem_region.nr_pages > 0) &&
- ((mem_region.nr_pages * PAGE_SIZE) == size),
+ TEST_ASSERT((mapping.nr_pages > 0) &&
+ ((mapping.nr_pages * PAGE_SIZE) == size),
"Cannot add partial pages to the guest memory.\n");
TEST_ASSERT(((uint64_t)source_pages & (PAGE_SIZE - 1)) == 0,
"Source memory buffer is not page aligned\n");
- tdx_ioctl(vm->fd, KVM_TDX_INIT_MEM_REGION, metadata, &mem_region);
+ r = ioctl(vcpu->fd, KVM_MEMORY_MAPPING, &mapping);
+ TEST_ASSERT(r == 0, "KVM_MEMORY_MAPPING failed: %d %d",
+ r, errno);
+
+ mapping = (struct kvm_memory_mapping) {
+ .base_gfn = gpa / PAGE_SIZE,
+ .nr_pages = size / PAGE_SIZE,
+ };
+ tdx_ioctl(vm->fd, KVM_TDX_EXTEND_MEMORY, 0, &mapping);
}

static void tdx_td_finalizemr(struct kvm_vm *vm)
@@ -231,7 +241,7 @@ static void tdx_enable_capabilities(struct kvm_vm *vm)
KVM_X2APIC_API_USE_32BIT_IDS |
KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK);
vm_enable_cap(vm, KVM_CAP_SPLIT_IRQCHIP, 24);
- vm_enable_cap(vm, KVM_CAP_MAX_VCPUS, 1024);
+ vm_enable_cap(vm, KVM_CAP_MAX_VCPUS, 512);
}

static void tdx_configure_memory_encryption(struct kvm_vm *vm)
--
Isaku Yamahata

Changes from v18:
- rebased to v6.8-rc5:
- uAPI brekage: KVM_TDX_INIT_MEM_REGION => KVM_MEMORY_MAPPING +
KVM_TDX_EXTEND_MEMORY
- Drop CONFIG_KVM_MMU_PRIVATE
- Require NO_RBP_MOD extension of TDX module
- Use TDH.VP.RD(TD_VCPU_STATE_DETAILS) to drop buggy_hlt_workaround

Changes from v17:
- Changed tdx_seamcall() to use struct tdx_module_args
- use TDH.SYS.RD() instead of TDH.SYS.INFO()
- Drop workaround of pending interrupt
- move the initializatin of loaded_vmcss_on_cpu to vmx_init()
- More error handling on tdx 1.5 specific error code
- drop changes of tools/arch/x86/include/uapi/asm/kvm.h
- fixes typo, indent and tabs

Changes from v16:
- rebased v6.7-rc
- Switched to TDX module 1.5. Unsupport TDX module 1.0

Changes from v15:
- Added KVM_TDX_RELEASE_VM to reduce the destruction time
- Catch up the TDX module interface to use struct tdx_module_args
instead of struct tdx_module_output
- Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1 and handle Secure-EPT violation
with SEPT_VE_DISABLE case.
- Simplified tdx_reclaim_page()
- Reorganize the locking of tdx_release_hkid(), and use smp_call_mask()
instead of smp_call_on_cpu() to hold spinlock to race with invalidation
on releasing guest memfd
- Removed AMX check as the KVM upstream supports AMX.
- Added CET flag to guest supported xss
- add check if nr_pages isn't large with
(nr_page << PAGE_SHIFT) >> PAGE_SHIFT
- use __seamcall_saved_ret()
- As struct tdx_module_args doesn't match with vcpu.arch.regs, copy regs
before/after calling __seamcall_saved_ret().

Changes from v14:
https://lore.kernel.org/all/[email protected]/
- rebased to v6.5-rc2, v11 KVM guest_memfd(), v11 TDX host kernel support
- ABI change to add reserved member for future compatibility, dropped unused
member.
- handle EXIT_REASON_OTHER_SMI
- handle FEAT_CTL MSR access

Changes from v13:
- rbased to v6.4-rc3
- Make use of KVM gmem.
- Added check_cpuid callback for KVM_SET_CPUID2 as RFC patch.
- ABI change of KVM_TDX_VM_INIT as VM scoped KVM ioctl.
- Make TDX initialization non-depend on kvm hardware_enable.
Use vmx_hardware_enable directly.
- Drop a patch to prohibit dirty logging as new KVM gmem code base
- Drop parameter only checking for some TDG.VP.VMCALL. Just default part

Changes from v12:
- ABI change of KVM_TDX_VM_INIT
- Rename kvm_gfn_{private, shared} to kvm_gfn_to_{private, shared}
- Move APIC BASE MSI initialization to KVM_TDX_VCPU_INIT
- Fix MTRR patch
- Make MapGpa hypercall always pass it to user space VMM
- Split hooks to TDP MMU into two part. populating and zapping.

Changes from v11:
- ABI change of KVM_TDX_VM_INIT
- Split the hook of TDP MMU to not modify handle_changed_spte()
- Enhanced commit message on mtrr patch
- Made KVM_CAP_MAX_VCPUS to x86 specific

Changes from v10:
- rebased to v6.2-rc3
- support mtrr with its own patches
- Integrated fd-based private page v10
- Integrated TDX host kernel support v8
- Integrated kvm_init rework v2
- removed struct tdx_td_page and its initialization logic
- cleaned up mmio spte and require enable_mmio_caching=true for TDX
- removed dubious WARN_ON_ONCE()
- split a patch adding methods as nop into several patches

Changes from v9:
- rebased to v6.1-rc2
- Integrated fd-based private page v9 as prerequisite.
- Integrated TDX host kernel support v6
- TDP MMU: Make handle_change_spte() return value.
- TDX: removed seamcall_lock and return -EAGAIN so that TDP MMU can retry

Changes from v8:
- rebased to v6.0-rc7
- Integrated with kvm hardware initialization. Check all packages has at least
one online CPU when creating guest TD and refuse cpu offline during guest TDs
are running.
- Integrated fd-based private page v8 as prerequisite.
- TDP MMU: Introduced more callbacks instead of single callback.

Changes from v7:
- Use xarray to track whether GFN is private or shared. Drop SPTE_SHARED_MASK.
The complex state machine with SPTE_SHARED_MASK was ditched.
- Large page support is implemented. But will be posted as independent RFC patch.
- fd-based private page v7 is integrated. This is mostly same to Chao's patches.
It's in github.

Changes from v6:
- rebased to v5.19

Changes from v5:
- export __seamcall and use it
- move mutex lock from callee function of smp_call_on_cpu to the caller.
- rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
- updated comment
- drop the use of tdh_mng_key.reclaimid(): as the function is for backward
compatibility to only return success
- struct kvm_tdx_cmd: metadata => flags, added __u64 error.
- make this ioctl systemwide ioctl
- ABI change to struct kvm_init_vm
- guest_tsc_khz: use kvm->arch.default_tsc_khz
- rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
- drop exporting kvm_set_tsc_khz().
- fix kvm_tdp_page_fault() for mtrr emulation
- rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
- drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
- update commit message
- rename shadow_init_value => shadow_nonprsent_value
- added ept_violation_ve_test mode
- shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
- legacy MMU case
=> - mmu_topup_shadow_page_cache(), kvm_mmu_create()
- FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
- #VE warning:
- rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
- merge into Like we discussed, this patch should be merged with patch
"KVM: x86/mmu: Allow non-zero init value for shadow PTE".
- fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
- introduce kvm_gfn_for_root(kvm, root, gfn)
- add only_shared argument to kvm_tdp_mmu_handle_gfn()
- use kvm_arch_dirty_log_supported()
- rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
- rename: is_private_prohibit_spte() => spte_shared_mask()
- fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
- dropped this patch as the change was merged into kvm/queue
- update vt_apicv_post_state_restore()
- use is_64_bit_hypercall()
- comment: expand MSMI -> Machine Check System Management Interrupt
- fixed TDX_SEPT_PFERR
- tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
- rename tdmvcall_exit_readon() => tdvmcall_leaf()
- remove optional zero check of argument.
- do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
in kvm_vcpu_ioctl_smi and __apic_accept_irq.
- WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
- introduce vcpu_deliver_init to x86_ops
- sprinkeled KVM_BUG_ON()

Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.

---
* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.

A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.

We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).

In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.

* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/[email protected]/

This patch series is available at
https://github.com/intel/tdx/tree/kvm-upstream

The related repositories (TDX qemu, TDX OVMF(tdvf) etc) are described at
https://github.com/intel/tdx/wiki/TDX-KVM

The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.

The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step. Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.

TDX vcpu
interrupt/exits/hypercall<------------\
^ |
| |
TD finalization |
^ |
| |
TDX EPT violation<------------\ |
^ | |
| | |
TD vcpu enter/exit | |
^ | |
| | |
TD vcpu creation/destruction | \-------KVM TDP MMU MapGPA
^ | ^
| | |
TD VM creation/destruction \---------------KVM TDP MMU hooks
^ ^
| |
TDX architectural definitions KVM TDP refactoring for TDX
^ ^
| |
TDX, VMX <--------TDX host kernel KVM MMU GPA stolen bits
coexistence support


The followings are explanations of each layer. Each layer has a dummy commit
that starts with [MARKER] in subject. It is intended to help to identify where
each layer starts.

TDX host kernel support:
https://lore.kernel.org/lkml/[email protected]/
The guts of system-wide initialization of TDX module. There is an
independent patch series for host x86. TDX KVM patches call functions
this patch series provides to initialize the TDX module.

TDX, VMX coexistence:
Infrastructure to allow TDX to coexist with VMX and trigger the
initialization of the TDX module.
This layer starts with
"KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
Add TDX architectural definitions and helper functions
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
Guest TD creation/destroy allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
guest TD creation/destroy Allocation and releasing of TDX specific vm
and vcpu structure. Create an initial guest memory image with TDX
measurement.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
Create an initial guest memory image with TDX measurement. Handle
secure EPT violations to populate guest pages with TDX SEAMCALLs.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
Allow TDX vcpu to enter into TD and exit from TD. Save CPU state before
entering into TD. Restore CPU state after exiting from TD.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
Handle various exits/hypercalls and allow interrupts to be injected so
that TD vcpu can continue running.
This layer starts with
"[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"

KVM MMU GPA shared bit:
Introduce framework to handle shared bit repurposed bit of GPA TDX
repurposed a bit of GPA to indicate shared or private. If it's shared,
it's the same as the conventional VMX EPT case. VMM can access shared
guest pages. If it's private, it's handled by Secure-EPT and the guest
page is encrypted.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
TDX Secure EPT requires different constants. e.g. initial value EPT
entry value etc. Various refactoring for those differences.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
Introduce framework to TDP MMU to add hooks in addition to direct EPT
access TDX added Secure EPT which is an enhancement to VMX EPT. Unlike
conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
use TDX SEAMCALLs to operate on Secure EPT.
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
Introduce framework to handle switching guest pages from private/shared
to shared/private. For a given GPA, a guest page can be assigned to a
private GPA or a shared GPA exclusively. With TDX MapGPA hypercall,
guest TD converts GPA assignments from private (or shared) to shared (or
private).
This layer starts with
"[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "

KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
Guest private memory requires different memory management in KVM. The
patch proposes a way for it. Integration with TDX KVM.

(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.

The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation. TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.

TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode. A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.

TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.

TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key). At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up. Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.

Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU. It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.

Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM. These control
structures are pointed to by fields in the TD VMCS.

The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others. 3) Redirect operations to . 3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".

*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.

In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.

During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.

* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.

This way, we can effectively walk Secure EPT without using the TDX interface
functions.

* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently. For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.

* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU. The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.

* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE. Instead PV way is needed for TD
to communicate with VMM. For now, KVM silently ignores MCE request by VMM. MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.

[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.

* Restrictions or future work
Some features are not included to reduce patch size. Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more

* Prerequisites
It's required to load the TDX module and initialize it. It's out of the scope
of this patch series. Another independent patch for the common x86 code is
planned. It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST. It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.

Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY. Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine

** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it. Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
Return the system wide information about the TDX module. NULL if the TDX
isn't initialized.
- int tdx_enable(void);
Initialization of TDX module so that the TDX module is ready for KVM to use.
- extern u32 tdx_global_keyid __read_mostly;
global host key id that is used for the TDX module itself.
- u32 tdx_get_num_keyid(void);
return the number of available TDX private host key id.
- int tdx_keyid_alloc(void);
Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
Free HKID for guest TD.

(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).

- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx. To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.

- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.

The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
- Get KVM system capability to check if TDX VM type is supported
- VM creation (KVM_CREATE_VM)
- New: Get the TDX specific system parameters. KVM_TDX_GET_CAPABILITY.
- New: Set TDX specific VM parameters. KVM_TDX_INIT_VM.
- VCPU creation (KVM_CREATE_VCPU)
- New: Set TDX specific VCPU parameters. KVM_TDX_INIT_VCPU.
- New: Initialize guest memory as boot state and extend the measurement with
the memory. KVM_TDX_INIT_MEM_REGION.
- New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
TDX VM contents.
- VCPU RUN (KVM_VCPU_RUN)

- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them. For example, accessing CPU registers, injecting
exceptions, and accessing guest memory. Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.

VM/VCPU state and callbacks for TDX specific operations.
Define tdx specific VM state and VCPU state instead of VMX ones. Redirect
operations to TDX specific callbacks. "if (tdx) tdx_op() else vmx_op()".

Operations on the CPU state
silently ignore operations on the guest state. For example, the write to
CPU registers is ignored and the read from CPU registers returns 0.

. ignore access to CPU registers except for allowed ones.
. TSC: add a check if tsc is immutable and return an error. Because the KVM
implementation updates the internal tsc state and it's difficult to back
out those changes. Instead, skip the logic.
. dirty logging: add check if dirty logging is supported.
. exceptions/SMI/MCE/SIPI/INIT: silently ignore

Note: virtual external interrupt and NMI can be injected into TDX guests.

- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set). The bits are called stolen bits.

- Stolen bits framework
systematically tracks which guest physical address, shared or private, is
used.

- Shared EPT and secure EPT
There are two EPTs. Shared EPT (the conventional one) and Secure
EPT(the new one). Shared EPT is handled the same for the stolen
bit set. Secure EPT points to private guest pages. To resolve
EPT violation, KVM walks one of two EPTs based on faulted GPA.
Because it's costly to access secure EPT during walking EPTs with
SEAMCALLs for the private guest physical address, another private
EPT is used as a shadow of Secure-EPT with the existing logic at
the cost of extra memory.

The following depicts the relationship.

KVM | TDX module
| | |
-------------+---------- | |
| | | |
V V | |
shared GPA private GPA | |
CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
| | | |
| | | |
V V | V
shared EPT private EPT--------mirror----->Secure EPT
| | | |
| \--------------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
|
non-encrypted memory | encrypted memory
|

- Operating on Secure EPT
Use the TDX module APIs to operate on Secure EPT. To call the TDX API
during resolving EPT violation, add hooks to additional operation and wiring
it to TDX backend.

* References

[1] TDX specification
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
https://cdrdv2.intel.com/v1/dl/getContent/726790
[3] Intel CPU Architectural Extensions Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 Specification
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
[5] Intel TDX Loader Interface Specification
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
https://cdrdv2.intel.com/v1/dl/getContent/726790
[7] Intel TDX Virtual Firmware Design Guide
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
[8] intel public github
kvm TDX branch: https://github.com/intel/tdx/tree/kvm
TDX guest branch: https://github.com/intel/tdx/tree/guest
qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
https://github.com/tianocore/edk2-staging/tree/TDVF
This was merged into EDK2 main branch. https://github.com/tianocore/edk2

Chao Gao (2):
KVM: x86/mmu: Assume guest MMIOs are shared
KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
wrmsr

Isaku Yamahata (101):
x86/tdx: Warning with 32bit build shift-count-overflow
KVM: x86: Pass is_private to gmem hook of gmem_max_level
KVM: Add new members to struct kvm_gfn_range to operate on
KVM: x86/mmu: Pass around full 64-bit error code for the KVM page
fault
KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is private
KVM: Add KVM vcpu ioctl to pre-populate guest memory
KVM: Document KVM_MEMORY_MAPPING ioctl
KVM: x86: Implement kvm_arch_{, pre_}vcpu_memory_mapping()
KVM: x86: Add is_vm_type_supported callback
KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_init()
KVM: x86/vmx: Refactor KVM VMX module init/exit functions
KVM: TDX: Initialize the TDX module when loading the KVM intel kernel
module
KVM: TDX: Add placeholders for TDX VM/vcpu structure
KVM: TDX: Make TDX VM type supported
[MARKER] The start of TDX KVM patch series: TDX architectural
definitions
KVM: TDX: Define TDX architectural definitions
KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
KVM: TDX: Add helper functions to print TDX SEAMCALL error
[MARKER] The start of TDX KVM patch series: TD VM creation/destruction
KVM: TDX: Add helper functions to allocate/free TDX private host key
id
KVM: TDX: Add helper function to read TDX metadata in array
KVM: TDX: Get system-wide info about TDX module on initialization
KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific
KVM: TDX: create/destroy VM structure
KVM: TDX: initialize VM with TDX specific parameters
KVM: TDX: Make pmu_intel.c ignore guest TD case
KVM: TDX: Refuse to unplug the last cpu on the package
[MARKER] The start of TDX KVM patch series: TD vcpu
creation/destruction
KVM: TDX: create/free TDX vcpu structure
KVM: TDX: Do TDX specific vcpu initialization
[MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
KVM: x86/mmu: Add address conversion functions for TDX shared bit of
GPA
[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
TDX
KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
KVM: x86/mmu: Add Suppress VE bit to
shadow_mmio_mask/shadow_present_mask
KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
KVM: x86/mmu: Disallow fast page fault on private GPA
KVM: VMX: Introduce test mode related to EPT violation VE
[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at
allocation
KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role
KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA
KVM: x86/tdp_mmu: Sprinkle __must_check
KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
[MARKER] The start of TDX KVM patch series: TDX EPT violation
KVM: TDX: Add accessors VMX VMCS helpers
KVM: TDX: Require TDP MMU and mmio caching for TDX
KVM: TDX: TDP MMU TDX support
KVM: TDX: MTRR: implement get_mt_mask() for TDX
[MARKER] The start of TDX KVM patch series: TD finalization
KVM: x86: Add hooks in kvm_arch_vcpu_memory_mapping()
KVM: TDX: Create initial guest memory
KVM: TDX: Extend memory measurement with initial guest memory
KVM: TDX: Finalize VM initialization
[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
KVM: TDX: Implement TDX vcpu enter/exit path
KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
KVM: TDX: restore host xsave state when exit from the guest TD
KVM: TDX: restore user ret MSRs
[MARKER] The start of TDX KVM patch series: TD vcpu
exits/interrupts/hypercalls
KVM: TDX: Complete interrupts after tdexit
KVM: TDX: restore debug store when TD exit
KVM: TDX: handle vcpu migration over logical processor
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
behavior
KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
KVM: TDX: Implement interrupt injection
KVM: TDX: Implements vcpu request_immediate_exit
KVM: TDX: Implement methods to inject NMI
KVM: TDX: Add a place holder to handle TDX VM exit
KVM: TDX: handle EXIT_REASON_OTHER_SMI
KVM: TDX: handle ept violation/misconfig exit
KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI
KVM: TDX: Add a place holder for handler of TDX hypercalls
(TDG.VP.VMCALL)
KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL
KVM: TDX: Handle TDX PV CPUID hypercall
KVM: TDX: Handle TDX PV HLT hypercall
KVM: TDX: Handle TDX PV port io hypercall
KVM: TDX: Implement callbacks for MSR operations for TDX
KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
KVM: TDX: Handle MSR MTRRCap and MTRRDefType access
KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL
KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
KVM: TDX: Silently discard SMI request
KVM: TDX: Silently ignore INIT/SIPI
KVM: TDX: Add methods to ignore accesses to CPU state
KVM: TDX: Add methods to ignore guest instruction emulation
KVM: TDX: Add a method to ignore dirty logging
KVM: TDX: Add methods to ignore VMX preemption timer
KVM: TDX: Add methods to ignore accesses to TSC
KVM: TDX: Ignore setting up mce
KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch
KVM: TDX: Add methods to ignore virtual apic related operation
KVM: TDX: Inhibit APICv for TDX guest
Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
RFC: KVM: x86: Add x86 callback to check cpuid
RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

Kai Huang (7):
x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro
x86/virt/tdx: Move TDMR metadata fields map table to local variable
x86/virt/tdx: Unbind global metadata read with 'struct
tdx_tdmr_sysinfo'
x86/virt/tdx: Support global metadata read for all element sizes
x86/virt/tdx: Export global metadata read infrastructure
x86/virt/tdx: Export TDX KeyID information
x86/virt/tdx: Export SEAMCALL functions

Michael Roth (1):
KVM: x86: Add gmem hook for determining max NPT mapping level

Sean Christopherson (15):
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
KVM: TDX: Add TDX "architectural" error codes
KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
KVM: Allow page-sized MMU caches to be initialized with custom 64-bit
values
KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed
SPTE
KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases
KVM: VMX: Split out guts of EPT violation to common/exposed function
KVM: TDX: Add load_mmu_pgd method for TDX
KVM: TDX: Add support for find pending IRQ in a protected local APIC
KVM: x86: Assume timer IRQ was injected if APIC state is proteced
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
argument
KVM: VMX: Move NMI/exception handler to common helper
KVM: x86: Split core of hypercall emulation to helper function
KVM: TDX: Handle TDX PV MMIO hypercall

Yan Zhao (1):
KVM: x86/mmu: Do not enable page track for TD guest

Yang Weijiang (1):
KVM: TDX: Add TSX_CTRL msr into uret_msrs list

Yao Yuan (1):
KVM: TDX: Handle vmentry failure for INTEL TD guest

Yuan Yao (1):
KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

Documentation/virt/kvm/api.rst | 37 +-
Documentation/virt/kvm/index.rst | 2 +
.../virt/kvm/intel-tdx-layer-status.rst | 33 +
Documentation/virt/kvm/x86/index.rst | 2 +
Documentation/virt/kvm/x86/intel-tdx.rst | 362 ++
Documentation/virt/kvm/x86/tdx-tdp-mmu.rst | 443 +++
arch/x86/events/intel/ds.c | 1 +
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/include/asm/kvm-x86-ops.h | 21 +-
arch/x86/include/asm/kvm_host.h | 79 +-
arch/x86/include/asm/shared/tdx.h | 9 +-
arch/x86/include/asm/tdx.h | 31 +-
arch/x86/include/asm/vmx.h | 14 +
arch/x86/include/uapi/asm/kvm.h | 86 +
arch/x86/include/uapi/asm/vmx.h | 5 +-
arch/x86/kvm/Kconfig | 3 +-
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/cpuid.c | 27 +-
arch/x86/kvm/cpuid.h | 2 +
arch/x86/kvm/irq.c | 3 +
arch/x86/kvm/lapic.c | 33 +-
arch/x86/kvm/lapic.h | 2 +
arch/x86/kvm/mmu.h | 37 +
arch/x86/kvm/mmu/mmu.c | 247 +-
arch/x86/kvm/mmu/mmu_internal.h | 87 +-
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/page_track.c | 3 +
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/spte.c | 17 +-
arch/x86/kvm/mmu/spte.h | 28 +-
arch/x86/kvm/mmu/tdp_iter.h | 14 +-
arch/x86/kvm/mmu/tdp_mmu.c | 442 ++-
arch/x86/kvm/mmu/tdp_mmu.h | 7 +-
arch/x86/kvm/smm.h | 7 +-
arch/x86/kvm/svm/svm.c | 8 +
arch/x86/kvm/vmx/common.h | 166 +
arch/x86/kvm/vmx/main.c | 1268 +++++++
arch/x86/kvm/vmx/pmu_intel.c | 46 +-
arch/x86/kvm/vmx/pmu_intel.h | 28 +
arch/x86/kvm/vmx/posted_intr.c | 43 +-
arch/x86/kvm/vmx/posted_intr.h | 13 +
arch/x86/kvm/vmx/tdx.c | 3347 +++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 268 ++
arch/x86/kvm/vmx/tdx_arch.h | 273 ++
arch/x86/kvm/vmx/tdx_errno.h | 34 +
arch/x86/kvm/vmx/tdx_error.c | 21 +
arch/x86/kvm/vmx/tdx_ops.h | 400 ++
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 663 +---
arch/x86/kvm/vmx/vmx.h | 52 +-
arch/x86/kvm/vmx/x86_ops.h | 276 ++
arch/x86/kvm/x86.c | 166 +-
arch/x86/kvm/x86.h | 4 +
arch/x86/virt/vmx/tdx/seamcall.S | 4 +
arch/x86/virt/vmx/tdx/tdx.c | 95 +-
arch/x86/virt/vmx/tdx/tdx.h | 2 -
include/linux/kvm_host.h | 7 +
include/linux/kvm_types.h | 1 +
include/uapi/linux/kvm.h | 99 +
virt/kvm/guest_memfd.c | 3 +
virt/kvm/kvm_main.c | 115 +-
61 files changed, 8741 insertions(+), 758 deletions(-)
create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
create mode 100644 Documentation/virt/kvm/x86/intel-tdx.rst
create mode 100644 Documentation/virt/kvm/x86/tdx-tdp-mmu.rst
create mode 100644 arch/x86/kvm/vmx/common.h
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
create mode 100644 arch/x86/kvm/vmx/tdx.c
create mode 100644 arch/x86/kvm/vmx/tdx.h
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
create mode 100644 arch/x86/kvm/vmx/tdx_error.c
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
create mode 100644 arch/x86/kvm/vmx/x86_ops.h


base-commit: ffd2cb6b718e189e7e2d5d0c19c25611f92e061a
--
2.25.1



2024-02-26 08:29:50

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 003/130] x86/virt/tdx: Unbind global metadata read with 'struct tdx_tdmr_sysinfo'

From: Kai Huang <[email protected]>

For now the kernel only reads TDMR related global metadata fields for
module initialization, and the metadata read code only works with the
'struct tdx_tdmr_sysinfo'.

KVM will need to read a bunch of non-TDMR related metadata to create and
run TDX guests. It's essential to provide a generic metadata read
infrastructure which is not bound to any specific structure.

To start providing such infrastructure, unbound the metadata read with
the 'struct tdx_tdmr_sysinfo'.

Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/virt/vmx/tdx/tdx.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index cdcb3332bc5d..eb208da4ff63 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -273,9 +273,9 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)

static int read_sys_metadata_field16(u64 field_id,
int offset,
- struct tdx_tdmr_sysinfo *ts)
+ void *stbuf)
{
- u16 *ts_member = ((void *)ts) + offset;
+ u16 *st_member = stbuf + offset;
u64 tmp;
int ret;

@@ -287,7 +287,7 @@ static int read_sys_metadata_field16(u64 field_id,
if (ret)
return ret;

- *ts_member = tmp;
+ *st_member = tmp;

return 0;
}
@@ -297,19 +297,22 @@ struct field_mapping {
int offset;
};

-#define TD_SYSINFO_MAP(_field_id, _member) \
- { .field_id = MD_FIELD_ID_##_field_id, \
- .offset = offsetof(struct tdx_tdmr_sysinfo, _member) }
+#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
+ { .field_id = MD_FIELD_ID_##_field_id, \
+ .offset = offsetof(_struct, _member) }
+
+#define TD_SYSINFO_MAP_TDMR_INFO(_field_id, _member) \
+ TD_SYSINFO_MAP(_field_id, struct tdx_tdmr_sysinfo, _member)

static int get_tdx_tdmr_sysinfo(struct tdx_tdmr_sysinfo *tdmr_sysinfo)
{
/* Map TD_SYSINFO fields into 'struct tdx_tdmr_sysinfo': */
const struct field_mapping fields[] = {
- TD_SYSINFO_MAP(MAX_TDMRS, max_tdmrs),
- TD_SYSINFO_MAP(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
- TD_SYSINFO_MAP(PAMT_4K_ENTRY_SIZE, pamt_entry_size[TDX_PS_4K]),
- TD_SYSINFO_MAP(PAMT_2M_ENTRY_SIZE, pamt_entry_size[TDX_PS_2M]),
- TD_SYSINFO_MAP(PAMT_1G_ENTRY_SIZE, pamt_entry_size[TDX_PS_1G]),
+ TD_SYSINFO_MAP_TDMR_INFO(MAX_TDMRS, max_tdmrs),
+ TD_SYSINFO_MAP_TDMR_INFO(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
+ TD_SYSINFO_MAP_TDMR_INFO(PAMT_4K_ENTRY_SIZE, pamt_entry_size[TDX_PS_4K]),
+ TD_SYSINFO_MAP_TDMR_INFO(PAMT_2M_ENTRY_SIZE, pamt_entry_size[TDX_PS_2M]),
+ TD_SYSINFO_MAP_TDMR_INFO(PAMT_1G_ENTRY_SIZE, pamt_entry_size[TDX_PS_1G]),
};
int ret;
int i;
--
2.25.1


2024-02-26 08:29:56

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 005/130] x86/virt/tdx: Export global metadata read infrastructure

From: Kai Huang <[email protected]>

KVM will need to read a bunch of non-TDMR related metadata to create and
run TDX guests. Export the metadata read infrastructure for KVM to use.

Specifically, export two helpers:

1) The helper which reads multiple metadata fields to a buffer of a
structure based on the "field ID -> structure member" mapping table.

2) The low level helper which just reads a given field ID.

The two helpers cover cases when the user wants to cache a bunch of
metadata fields to a certain structure and when the user just wants to
query a specific metadata field on demand. They are enough for KVM to
use (and also should be enough for other potential users).

Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/tdx.h | 22 ++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.c | 25 ++++++++-----------------
2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eba178996d84..709b9483f9e4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -116,6 +116,28 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
int tdx_cpu_enable(void);
int tdx_enable(void);
const char *tdx_dump_mce_info(struct mce *m);
+
+struct tdx_metadata_field_mapping {
+ u64 field_id;
+ int offset;
+ int size;
+};
+
+#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
+ { .field_id = MD_FIELD_ID_##_field_id, \
+ .offset = offsetof(_struct, _member), \
+ .size = sizeof(typeof(((_struct *)0)->_member)) }
+
+/*
+ * Read multiple global metadata fields to a buffer of a structure
+ * based on the "field ID -> structure member" mapping table.
+ */
+int tdx_sys_metadata_read(const struct tdx_metadata_field_mapping *fields,
+ int nr_fields, void *stbuf);
+
+/* Read a single global metadata field */
+int tdx_sys_metadata_field_read(u64 field_id, u64 *data);
+
#else
static inline void tdx_init(void) { }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a19adc898df6..dc21310776ab 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -251,7 +251,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
return ret;
}

-static int read_sys_metadata_field(u64 field_id, u64 *data)
+int tdx_sys_metadata_field_read(u64 field_id, u64 *data)
{
struct tdx_module_args args = {};
int ret;
@@ -270,6 +270,7 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)

return 0;
}
+EXPORT_SYMBOL_GPL(tdx_sys_metadata_field_read);

/* Return the metadata field element size in bytes */
static int get_metadata_field_bytes(u64 field_id)
@@ -295,7 +296,7 @@ static int stbuf_read_sys_metadata_field(u64 field_id,
if (WARN_ON_ONCE(get_metadata_field_bytes(field_id) != bytes))
return -EINVAL;

- ret = read_sys_metadata_field(field_id, &tmp);
+ ret = tdx_sys_metadata_field_read(field_id, &tmp);
if (ret)
return ret;

@@ -304,19 +305,8 @@ static int stbuf_read_sys_metadata_field(u64 field_id,
return 0;
}

-struct field_mapping {
- u64 field_id;
- int offset;
- int size;
-};
-
-#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
- { .field_id = MD_FIELD_ID_##_field_id, \
- .offset = offsetof(_struct, _member), \
- .size = sizeof(typeof(((_struct *)0)->_member)) }
-
-static int read_sys_metadata(struct field_mapping *fields, int nr_fields,
- void *stbuf)
+int tdx_sys_metadata_read(const struct tdx_metadata_field_mapping *fields,
+ int nr_fields, void *stbuf)
{
int i, ret;

@@ -331,6 +321,7 @@ static int read_sys_metadata(struct field_mapping *fields, int nr_fields,

return 0;
}
+EXPORT_SYMBOL_GPL(tdx_sys_metadata_read);

#define TD_SYSINFO_MAP_TDMR_INFO(_field_id, _member) \
TD_SYSINFO_MAP(_field_id, struct tdx_tdmr_sysinfo, _member)
@@ -338,7 +329,7 @@ static int read_sys_metadata(struct field_mapping *fields, int nr_fields,
static int get_tdx_tdmr_sysinfo(struct tdx_tdmr_sysinfo *tdmr_sysinfo)
{
/* Map TD_SYSINFO fields into 'struct tdx_tdmr_sysinfo': */
- const struct field_mapping fields[] = {
+ const struct tdx_metadata_field_mapping fields[] = {
TD_SYSINFO_MAP_TDMR_INFO(MAX_TDMRS, max_tdmrs),
TD_SYSINFO_MAP_TDMR_INFO(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
TD_SYSINFO_MAP_TDMR_INFO(PAMT_4K_ENTRY_SIZE, pamt_entry_size[TDX_PS_4K]),
@@ -347,7 +338,7 @@ static int get_tdx_tdmr_sysinfo(struct tdx_tdmr_sysinfo *tdmr_sysinfo)
};

/* Populate 'tdmr_sysinfo' fields using the mapping structure above: */
- return read_sys_metadata(fields, ARRAY_SIZE(fields), tdmr_sysinfo);
+ return tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdmr_sysinfo);
}

/* Calculate the actual TDMR size */
--
2.25.1


2024-02-26 08:30:10

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 006/130] x86/virt/tdx: Export TDX KeyID information

From: Kai Huang <[email protected]>

Each TDX guest must be protected by its own unique TDX KeyID. KVM will
need to tell the TDX module the unique KeyID for a TDX guest when KVM
creates it.

Export the TDX KeyID range that can be used by TDX guests for KVM to
use. KVM can then manage these KeyIDs and assign one for each TDX guest
when it is created.

Each TDX guest has a root control structure called "Trust Domain Root"
(TDR). Unlike the rest of the TDX guest, the TDR is protected by the
TDX global KeyID. When tearing down the TDR, KVM will need to pass the
TDX global KeyID explicitly to the TDX module to flush cache associated
to the TDR.

Also export the TDX global KeyID for KVM to tear down the TDR.

Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/tdx.h | 5 +++++
arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++---
2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 709b9483f9e4..16be3a1e4916 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -88,6 +88,11 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */

#ifdef CONFIG_INTEL_TDX_HOST
+
+extern u32 tdx_global_keyid;
+extern u32 tdx_guest_keyid_start;
+extern u32 tdx_nr_guest_keyids;
+
u64 __seamcall(u64 fn, struct tdx_module_args *args);
u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index dc21310776ab..d2b8f079a637 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -39,9 +39,14 @@
#include <asm/mce.h>
#include "tdx.h"

-static u32 tdx_global_keyid __ro_after_init;
-static u32 tdx_guest_keyid_start __ro_after_init;
-static u32 tdx_nr_guest_keyids __ro_after_init;
+u32 tdx_global_keyid __ro_after_init;
+EXPORT_SYMBOL_GPL(tdx_global_keyid);
+
+u32 tdx_guest_keyid_start __ro_after_init;
+EXPORT_SYMBOL_GPL(tdx_guest_keyid_start);
+
+u32 tdx_nr_guest_keyids __ro_after_init;
+EXPORT_SYMBOL_GPL(tdx_nr_guest_keyids);

static DEFINE_PER_CPU(bool, tdx_lp_initialized);

--
2.25.1


2024-02-26 08:30:49

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 008/130] x86/tdx: Warning with 32bit build shift-count-overflow

From: Isaku Yamahata <[email protected]>

This patch fixes the following warnings.

In file included from arch/x86/kernel/asm-offsets.c:22:
arch/x86/include/asm/tdx.h:92:87: warning: shift count >= width of type [-Wshift-count-overflow]
arch/x86/include/asm/tdx.h:20:21: note: expanded from macro 'TDX_ERROR'
#define TDX_ERROR _BITUL(63)

^~~~~~~~~~

Also consistently use ULL for TDX_SEAMCALL_VMFAILINVALID.

Fixes: 527a534c7326 ("x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers")
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/tdx.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 16be3a1e4916..1e9dcdf9912b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -17,9 +17,9 @@
* Bits 47:40 == 0xFF indicate Reserved status code class that never used by
* TDX module.
*/
-#define TDX_ERROR _BITUL(63)
+#define TDX_ERROR _BITULL(63)
#define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
-#define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))
+#define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _ULL(0xFFFF0000))

#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
--
2.25.1


2024-02-26 08:31:13

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

From: Kai Huang <[email protected]>

KVM will need to make SEAMCALLs to create and run TDX guests. Export
SEAMCALL functions for KVM to use.

Also add declaration of SEAMCALL functions to <asm/asm-prototypes.h> to
support CONFIG_MODVERSIONS=y.

Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/virt/vmx/tdx/seamcall.S | 4 ++++
2 files changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index b1a98fa38828..0ec572ad75f1 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -13,6 +13,7 @@
#include <asm/preempt.h>
#include <asm/asm.h>
#include <asm/gsseg.h>
+#include <asm/tdx.h>

#ifndef CONFIG_X86_CMPXCHG64
extern void cmpxchg8b_emu(void);
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index 5b1f2286aea9..e32cf82ed47e 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -1,5 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/linkage.h>
+#include <linux/export.h>
#include <asm/frame.h>

#include "tdxcall.S"
@@ -21,6 +22,7 @@
SYM_FUNC_START(__seamcall)
TDX_MODULE_CALL host=1
SYM_FUNC_END(__seamcall)
+EXPORT_SYMBOL_GPL(__seamcall);

/*
* __seamcall_ret() - Host-side interface functions to SEAM software
@@ -40,6 +42,7 @@ SYM_FUNC_END(__seamcall)
SYM_FUNC_START(__seamcall_ret)
TDX_MODULE_CALL host=1 ret=1
SYM_FUNC_END(__seamcall_ret)
+EXPORT_SYMBOL_GPL(__seamcall_ret);

/*
* __seamcall_saved_ret() - Host-side interface functions to SEAM software
@@ -59,3 +62,4 @@ SYM_FUNC_END(__seamcall_ret)
SYM_FUNC_START(__seamcall_saved_ret)
TDX_MODULE_CALL host=1 ret=1 saved=1
SYM_FUNC_END(__seamcall_saved_ret)
+EXPORT_SYMBOL_GPL(__seamcall_saved_ret);
--
2.25.1


2024-02-26 08:31:27

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 010/130] KVM: x86: Pass is_private to gmem hook of gmem_max_level

From: Isaku Yamahata <[email protected]>

TDX wants to know the faulting address is shared or private so that the max
level is limited by Secure-EPT or not. Because fault->gfn doesn't include
shared bit, gfn doesn't tell if the faulting address is shared or not.
Pass is_private for TDX case.

TDX logic will be if (!is_private) return 0; else return PG_LEVEL_4K.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu/mmu.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d15f5b4b1656..57ce89fc2740 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1797,7 +1797,8 @@ struct kvm_x86_ops {

gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);

- int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
+ int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
+ bool is_private, u8 *max_level);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e5e12d2707d..22db1a9f528a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4324,7 +4324,8 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,

max_level = kvm_max_level_for_order(max_order);
r = static_call(kvm_x86_gmem_max_level)(vcpu->kvm, fault->pfn,
- fault->gfn, &max_level);
+ fault->gfn, fault->is_private,
+ &max_level);
if (r) {
kvm_release_pfn_clean(fault->pfn);
return r;
--
2.25.1


2024-02-26 08:31:47

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 009/130] KVM: x86: Add gmem hook for determining max NPT mapping level

From: Michael Roth <[email protected]>

In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:

- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.

Add a hook to determine the max NPT mapping size in situations like
this.

Signed-off-by: Michael Roth <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 378ed944b849..156832f01ebe 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -138,6 +138,7 @@ KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
+KVM_X86_OP_OPTIONAL_RET0(gmem_max_level)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d271ba20a0b2..d15f5b4b1656 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1796,6 +1796,8 @@ struct kvm_x86_ops {
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);

gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
+
+ int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..1e5e12d2707d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4308,6 +4308,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
int max_order, r;
+ u8 max_level;

if (!kvm_slot_can_be_private(fault->slot)) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
@@ -4321,8 +4322,15 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
return r;
}

- fault->max_level = min(kvm_max_level_for_order(max_order),
- fault->max_level);
+ max_level = kvm_max_level_for_order(max_order);
+ r = static_call(kvm_x86_gmem_max_level)(vcpu->kvm, fault->pfn,
+ fault->gfn, &max_level);
+ if (r) {
+ kvm_release_pfn_clean(fault->pfn);
+ return r;
+ }
+
+ fault->max_level = min(max_level, fault->max_level);
fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);

return RET_PF_CONTINUE;
--
2.25.1


2024-02-26 08:31:51

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

From: Isaku Yamahata <[email protected]>

Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on. only_private and only_shared. Update
mmu notifier, set memory attributes ioctl or KVM gmem callback to
initialize them.

It was premature for set_memory_attributes ioctl to call
kvm_unmap_gfn_range(). Instead, let kvm_arch_ste_memory_attributes()
handle it and add a new x86 vendor callback to react to memory attribute
change. [1]

- If it's from the mmu notifier, zap shared pages only
- If it's from the KVM gmem, zap private pages only
- If setting memory attributes, vendor callback checks new attributes
and make decisions.
SNP would do nothing and handle it later with gmem callback
TDX callback would do as follows.
When it converts pages to shared, zap private pages only.
When it converts pages to private, zap shared pages only.

TDX needs to know which mapping to operate on. Shared-EPT vs. Secure-EPT.
The following sequence to convert the GPA to private doesn't work for TDX
because the page can already be private.

1) Update memory attributes to private in memory attributes xarray
2) Zap the GPA range irrespective of private-or-shared.
Even if the page is already private, zap the entry.
3) EPT violation on the GPA
4) Populate the GPA as private
The page is zeroed, and the guest has to accept the page again.

In step 2, TDX wants to zap only shared pages and skip private ones.

[1] https://lore.kernel.org/all/[email protected]/

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v18:
- rebased to kvm-next

Changes v2 -> v3:
- Drop the KVM_GFN_RANGE flags
- Updated struct kvm_gfn_range
- Change kvm_arch_set_memory_attributes() to return bool for flush
- Added set_memory_attributes x86 op for vendor backends
- Refined commit message to describe TDX care concretely

Changes v1 -> v2:
- consolidate KVM_GFN_RANGE_FLAGS_GMEM_{PUNCH_HOLE, RELEASE} into
KVM_GFN_RANGE_FLAGS_GMEM.
- Update the commit message to describe TDX more. Drop SEV_SNP.

Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/guest_memfd.c | 3 +++
virt/kvm/kvm_main.c | 17 +++++++++++++++++
3 files changed, 22 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..0520cd8d03cc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -264,6 +264,8 @@ struct kvm_gfn_range {
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+ bool only_private;
+ bool only_shared;
bool may_block;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0f4e0cf4f158..3830d50b9b67 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -64,6 +64,9 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
.slot = slot,
.may_block = true,
+ /* guest memfd is relevant to only private mappings. */
+ .only_private = true,
+ .only_shared = false,
};

if (!found_memslot) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 10bfc88a69f7..0349e1f241d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -634,6 +634,12 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
*/
gfn_range.arg = range->arg;
gfn_range.may_block = range->may_block;
+ /*
+ * HVA-based notifications aren't relevant to private
+ * mappings as they don't have a userspace mapping.
+ */
+ gfn_range.only_private = false;
+ gfn_range.only_shared = true;

/*
* {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -2486,6 +2492,16 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
gfn_range.arg = range->arg;
gfn_range.may_block = range->may_block;

+ /*
+ * If/when KVM supports more attributes beyond private .vs shared, this
+ * _could_ set only_{private,shared} appropriately if the entire target
+ * range already has the desired private vs. shared state (it's unclear
+ * if that is a net win). For now, KVM reaches this point if and only
+ * if the private flag is being toggled, i.e. all mappings are in play.
+ */
+ gfn_range.only_private = false;
+ gfn_range.only_shared = false;
+
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);

@@ -2542,6 +2558,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
struct kvm_mmu_notifier_range pre_set_range = {
.start = start,
.end = end,
+ .arg.attributes = attributes,
.handler = kvm_pre_set_memory_attributes,
.on_lock = kvm_mmu_invalidate_begin,
.flush_on_ret = true,
--
2.25.1


2024-02-26 08:32:22

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 012/130] KVM: x86/mmu: Pass around full 64-bit error code for the KVM page fault

From: Isaku Yamahata <[email protected]>

Because the full 64-bit error code, or more info about the fault, for the
KVM page fault will be needed for protected VM, TDX and SEV-SNP, update
kvm_mmu_do_page_fault() to accept the 64-bit value so it can pass it to the
callbacks.

The upper 32 bits of error code are discarded at kvm_mmu_page_fault()
by lower_32_bits(). Now it's passed down as full 64 bits.
Currently two hardware defined bits, PFERR_GUEST_FINAL_MASK and
PFERR_GUEST_PAGE_MASK, and one software defined bit, PFERR_IMPLICIT_ACCESS,
is defined.

PFERR_IMPLICIT_ACCESS:
commit 4f4aa80e3b88 ("KVM: X86: Handle implicit supervisor access with SMAP")
introduced a software defined bit PFERR_IMPLICIT_ACCESS at bit 48 to
indicate implicit access for SMAP with instruction emulator. Concretely
emulator_read_std() and emulator_write_std() set the bit.
permission_fault() checks the bit as smap implicit access. The vendor page
fault handler shouldn't pass the bit to kvm_mmu_page_fault().

PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK:
commit 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
introduced them to optimize the nested page fault handling. Other code
path doesn't use the bits. Those two bits can be safely passed down
without functionality change.

The accesses of fault->error_code are as follows
- FNAME(page_fault): PFERR_IMPLICIT_ACCESS shouldn't be passed down.
PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK
aren't used.
- kvm_mmu_page_fault(): explicit mask with PFERR_RSVD_MASK, and
PFERR_NESTED_GUEST_PAGE is used outside of the
masking upper 32 bits.
- mmutrace: change u32 -> u64

No functional change is intended. This is a preparation to pass on more
info with page fault error code.

Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v3 -> v4:
- Dropped debug print part as it was deleted in the kvm-x86-next

Changes v2 -> v3:
- Make depends on a patch to clear PFERR_IMPLICIT_ACCESS
- drop clearing the upper 32 bit, instead just pass whole 64 bits
- update commit message to mention about PFERR_IMPLICIT_ACCESS and
PFERR_NESTED_GUEST_PAGE

Changes v1 -> v2:
- no change

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 +--
arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22db1a9f528a..ccdbff3d85ec 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5822,8 +5822,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}

if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false,
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
&emulation_type);
if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..21f55e8b4dc6 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -190,7 +190,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
struct kvm_page_fault {
/* arguments to kvm_mmu_do_page_fault. */
const gpa_t addr;
- const u32 error_code;
+ const u64 error_code;
const bool prefetch;

/* Derived from error_code. */
@@ -280,7 +280,7 @@ enum {
};

static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch, int *emulation_type)
+ u64 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..195d98bc8de8 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -260,7 +260,7 @@ TRACE_EVENT(
TP_STRUCT__entry(
__field(int, vcpu_id)
__field(gpa_t, cr2_or_gpa)
- __field(u32, error_code)
+ __field(u64, error_code)
__field(u64 *, sptep)
__field(u64, old_spte)
__field(u64, new_spte)
--
2.25.1


2024-02-26 08:32:31

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 013/130] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is private

From: Isaku Yamahata <[email protected]>

SEV-SNP defines PFERR_GUEST_ENC_MASK (bit 32) in page-fault error bits to
represent the guest page is encrypted. Use the bit to designate that the
page fault is private and that it requires looking up memory attributes.
The vendor kvm page fault handler should set PFERR_GUEST_ENC_MASK bit based
on their fault information. It may or may not use the hardware value
directly or parse the hardware value to set the bit.

For KVM_X86_SW_PROTECTED_VM, ask memory attributes for the fault
privateness. For async page fault, carry the bit and use it for kvm page
fault handler.

Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v4 -> v5:
- Eliminate kvm_is_fault_private() by open code the function
- Make async page fault handler to carry is_private bit

Changes v3 -> v4:
- rename back struct kvm_page_fault::private => is_private
- catch up rename: KVM_X86_PROTECTED_VM => KVM_X86_SW_PROTECTED_VM

Changes v2 -> v3:
- Revive PFERR_GUEST_ENC_MASK
- rename struct kvm_page_fault::is_private => private
- Add check KVM_X86_PROTECTED_VM

Changes v1 -> v2:
- Introduced fault type and replaced is_private with fault_type.
- Add kvm_get_fault_type() to encapsulate the difference.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 24 +++++++++++++++++-------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 57ce89fc2740..28314e7d546c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -264,6 +264,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_BIT 15
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
+#define PFERR_GUEST_ENC_BIT 34
#define PFERR_IMPLICIT_ACCESS_BIT 48

#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
@@ -275,6 +276,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
+#define PFERR_GUEST_ENC_MASK BIT_ULL(PFERR_GUEST_ENC_BIT)
#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)

#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
@@ -1836,6 +1838,7 @@ struct kvm_arch_async_pf {
gfn_t gfn;
unsigned long cr3;
bool direct_map;
+ u64 error_code;
};

extern u32 __read_mostly kvm_nr_uret_msrs;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ccdbff3d85ec..61674d6b17aa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4246,18 +4246,19 @@ static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
return (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
}

-static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- gfn_t gfn)
+static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
struct kvm_arch_async_pf arch;

arch.token = alloc_apf_token(vcpu);
- arch.gfn = gfn;
+ arch.gfn = fault->gfn;
arch.direct_map = vcpu->arch.mmu->root_role.direct;
arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
+ arch.error_code = fault->error_code & PFERR_GUEST_ENC_MASK;

- return kvm_setup_async_pf(vcpu, cr2_or_gpa,
- kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
+ return kvm_setup_async_pf(vcpu, fault->addr,
+ kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
}

void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
@@ -4276,7 +4277,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
return;

- kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
+ kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
+ true, NULL);
}

static inline u8 kvm_max_level_for_order(int order)
@@ -4390,7 +4392,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
return RET_PF_RETRY;
- } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
+ } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
return RET_PF_RETRY;
}
}
@@ -5814,6 +5816,14 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;

+ /*
+ * This is racy with updating memory attributes with mmu_seq. If we
+ * hit a race, it would result in retrying page fault.
+ */
+ if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+ kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+ error_code |= PFERR_GUEST_ENC_MASK;
+
r = RET_PF_INVALID;
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 21f55e8b4dc6..0443bfcf5d9c 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -292,13 +292,13 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.user = err & PFERR_USER_MASK,
.prefetch = prefetch,
.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
+ .is_private = err & PFERR_GUEST_ENC_MASK,
.nx_huge_page_workaround_enabled =
is_nx_huge_page_enabled(vcpu->kvm),

.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
};
int r;

--
2.25.1


2024-02-26 08:32:41

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 014/130] KVM: Add KVM vcpu ioctl to pre-populate guest memory

From: Isaku Yamahata <[email protected]>

Add new ioctl KVM_MEMORY_MAPPING in the kvm common code. It iterates on the
memory range and call arch specific function. Add stub function as weak
symbol.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- newly added
---
include/linux/kvm_host.h | 4 +++
include/uapi/linux/kvm.h | 10 ++++++
virt/kvm/kvm_main.c | 67 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 81 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0520cd8d03cc..eeaf4e73317c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2389,4 +2389,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
}
#endif /* CONFIG_KVM_PRIVATE_MEM */

+void kvm_arch_vcpu_pre_memory_mapping(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping);
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c3308536482b..5e2b28934aa9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1155,6 +1155,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_MEMORY_ATTRIBUTES 233
#define KVM_CAP_GUEST_MEMFD 234
#define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_MEMORY_MAPPING 236

#ifdef KVM_CAP_IRQ_ROUTING

@@ -2227,4 +2228,13 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_MEMORY_MAPPING _IOWR(KVMIO, 0xd5, struct kvm_memory_mapping)
+
+struct kvm_memory_mapping {
+ __u64 base_gfn;
+ __u64 nr_pages;
+ __u64 flags;
+ __u64 source;
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0349e1f241d1..2f0a8e28795e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4409,6 +4409,62 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
return fd;
}

+__weak void kvm_arch_vcpu_pre_memory_mapping(struct kvm_vcpu *vcpu)
+{
+}
+
+__weak int kvm_arch_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ return -EOPNOTSUPP;
+}
+
+static int kvm_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ bool added = false;
+ int idx, r = 0;
+
+ /* flags isn't used yet. */
+ if (mapping->flags)
+ return -EINVAL;
+
+ /* Sanity check */
+ if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
+ !mapping->nr_pages ||
+ mapping->nr_pages & GENMASK_ULL(63, 63 - PAGE_SHIFT) ||
+ mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
+ return -EINVAL;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ kvm_arch_vcpu_pre_memory_mapping(vcpu);
+
+ while (mapping->nr_pages) {
+ if (signal_pending(current)) {
+ r = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+ r = kvm_arch_vcpu_memory_mapping(vcpu, mapping);
+ if (r)
+ break;
+
+ added = true;
+ }
+
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ if (added && mapping->nr_pages > 0)
+ r = -EAGAIN;
+
+ return r;
+}
+
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -4610,6 +4666,17 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
break;
}
+ case KVM_MEMORY_MAPPING: {
+ struct kvm_memory_mapping mapping;
+
+ r = -EFAULT;
+ if (copy_from_user(&mapping, argp, sizeof(mapping)))
+ break;
+ r = kvm_vcpu_memory_mapping(vcpu, &mapping);
+ if (copy_to_user(argp, &mapping, sizeof(mapping)))
+ r = -EFAULT;
+ break;
+ }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}
--
2.25.1


2024-02-26 08:33:07

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 015/130] KVM: Document KVM_MEMORY_MAPPING ioctl

From: Isaku Yamahata <[email protected]>

Adds documentation of KVM_MEMORY_MAPPING ioctl.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- newly added
---
Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 3ec0b7a455a0..667dc58f7d2f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6323,6 +6323,34 @@ a single guest_memfd file, but the bound ranges must not overlap).

See KVM_SET_USER_MEMORY_REGION2 for additional details.

+4.143 KVM_MEMORY_MAPPING
+------------------------
+
+:Capability: KVM_CAP_MEMORY_MAPPING
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: struct kvm_memory_mapping(in/out)
+:Returns: 0 on success, <0 on error
+
+KVM_MEMORY_MAPPING populates guest memory without running vcpu.
+
+::
+
+ struct kvm_memory_mapping {
+ __u64 base_gfn;
+ __u64 nr_pages;
+ __u64 flags;
+ __u64 source;
+ };
+
+KVM_MEMORY_MAPPING populates guest memory in the underlying mapping. If source
+is not zero and it's supported (it depends on underlying technology), the guest
+memory content is populated with the source. If nr_pages is large, it may
+return -EAGAIN and the values (base_gfn and nr_pages. source if not zero) are
+updated to point the remaining range.
+
+The "flags" field is reserved for future extensions and must be '0'.
+
5. The kvm_run structure
========================

--
2.25.1


2024-02-26 08:33:26

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 016/130] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX

From: Sean Christopherson <[email protected]>

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path. This allows
TDX to get the resulting pfn and also allows the RET_PF_* enums to
stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Move up for KVM_MEMORY_MAPPING.
- Add goal_level for the caller to know how many pages are mapped.

v14 -> v15:
- Remove loop in kvm_mmu_map_tdp_page() and return error code based on
RET_FP_xxx value to avoid potential infinite loop. The caller should
loop on -EAGAIN instead now.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 58 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..d96c93a25b3b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
__kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
}

+int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+ u8 max_level, u8 *goal_level);
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61674d6b17aa..ca0c91f14063 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4615,6 +4615,64 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}

+int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+ u8 max_level, u8 *goal_level)
+{
+ int r;
+ struct kvm_page_fault fault = (struct kvm_page_fault) {
+ .addr = gpa,
+ .error_code = error_code,
+ .exec = error_code & PFERR_FETCH_MASK,
+ .write = error_code & PFERR_WRITE_MASK,
+ .present = error_code & PFERR_PRESENT_MASK,
+ .rsvd = error_code & PFERR_RSVD_MASK,
+ .user = error_code & PFERR_USER_MASK,
+ .prefetch = false,
+ .is_tdp = true,
+ .is_private = error_code & PFERR_GUEST_ENC_MASK,
+ .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
+ };
+
+ WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
+ fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
+
+ r = mmu_topup_memory_caches(vcpu, false);
+ if (r)
+ return r;
+
+ fault.max_level = max_level;
+ fault.req_level = PG_LEVEL_4K;
+ fault.goal_level = PG_LEVEL_4K;
+
+#ifdef CONFIG_X86_64
+ if (tdp_mmu_enabled)
+ r = kvm_tdp_mmu_page_fault(vcpu, &fault);
+ else
+#endif
+ r = direct_page_fault(vcpu, &fault);
+
+ if (is_error_noslot_pfn(fault.pfn) || vcpu->kvm->vm_bugged)
+ return -EFAULT;
+
+ switch (r) {
+ case RET_PF_RETRY:
+ return -EAGAIN;
+
+ case RET_PF_FIXED:
+ case RET_PF_SPURIOUS:
+ if (goal_level)
+ *goal_level = fault.goal_level;
+ return 0;
+
+ case RET_PF_CONTINUE:
+ case RET_PF_EMULATE:
+ case RET_PF_INVALID:
+ default:
+ return -EIO;
+ }
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.25.1


2024-02-26 08:33:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 017/130] KVM: x86: Implement kvm_arch_{, pre_}vcpu_memory_mapping()

From: Isaku Yamahata <[email protected]>

Wire KVM_MEMORY_MAPPING ioctl to kvm_mmu_map_tdp_page() to populate
guest memory.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- newly added
---
arch/x86/kvm/x86.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48a61d283406..03dab4266172 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4663,6 +4663,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_MEMORY_FAULT_INFO:
+ case KVM_CAP_MEMORY_MAPPING:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
@@ -5801,6 +5802,31 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}

+void kvm_arch_vcpu_pre_memory_mapping(struct kvm_vcpu *vcpu)
+{
+ kvm_mmu_reload(vcpu);
+}
+
+int kvm_arch_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ u8 max_level = KVM_MAX_HUGEPAGE_LEVEL;
+ u64 error_code = PFERR_WRITE_MASK;
+ u8 goal_level = PG_LEVEL_4K;
+ int r;
+
+ r = kvm_mmu_map_tdp_page(vcpu, gfn_to_gpa(mapping->base_gfn), error_code,
+ max_level, &goal_level);
+ if (r)
+ return r;
+
+ if (mapping->source)
+ mapping->source += KVM_HPAGE_SIZE(goal_level);
+ mapping->base_gfn += KVM_PAGES_PER_HPAGE(goal_level);
+ mapping->nr_pages -= KVM_PAGES_PER_HPAGE(goal_level);
+ return r;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
--
2.25.1


2024-02-26 08:34:13

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 018/130] KVM: x86/mmu: Assume guest MMIOs are shared

From: Chao Gao <[email protected]>

TODO: Drop this patch once the common patch is merged.

When memory slot isn't found for kvm page fault, handle it as MMIO.

The guest of TDX_VM, SNP_VM, or SW_PROTECTED_VM don't necessarily convert
the virtual MMIO range to shared before accessing it. When the guest tries
to access the virtual device's MMIO without any private/shared conversion,
An NPT fault or EPT violation is raised first to find private-shared
mismatch. Don't raise KVM_EXIT_MEMORY_FAULT, fall back to KVM_PFN_NOLSLOT.

Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Added slot check before kvm_faultin_private_pfn()
- updated comment
- rewrite the commit message
---
arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ca0c91f14063..c45252ed2ffd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4342,6 +4342,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
+ bool force_mmio;
bool async;

/*
@@ -4371,12 +4372,21 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}

- if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ /*
+ * !fault->slot means MMIO for SNP and TDX. Don't require explicit GPA
+ * conversion for MMIO because MMIO is assigned at the boot time. Fall
+ * to !is_private case to get pfn = KVM_PFN_NOSLOT.
+ */
+ force_mmio = !slot &&
+ vcpu->kvm->arch.vm_type != KVM_X86_DEFAULT_VM &&
+ vcpu->kvm->arch.vm_type != KVM_X86_SW_PROTECTED_VM;
+ if (!force_mmio &&
+ fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return -EFAULT;
}

- if (fault->is_private)
+ if (!force_mmio && fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);

async = false;
--
2.25.1


2024-02-26 08:34:49

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 021/130] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_init()

From: Isaku Yamahata <[email protected]>

vmx_hardware_disable() accesses loaded_vmcss_on_cpu via
hardware_disable_all(). To allow hardware_enable/disable_all() before
kvm_init(), initialize it in before kvm_x86_vendor_init() in vmx_init()
so that tdx module initialization, hardware_setup method, can reference
the variable.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>

---
v19:
- Fix the subject to match the patch by Yuan

v18:
- Move the vmcss_on_cpu initialization from vmx_hardware_setup() to
early point of vmx_init() by Binbin

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 434f5aaef030..8af0668e4dca 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8539,6 +8539,10 @@ static int __init vmx_init(void)
*/
hv_init_evmcs();

+ /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
+ for_each_possible_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+
r = kvm_x86_vendor_init(&vt_init_ops);
if (r)
return r;
@@ -8554,11 +8558,8 @@ static int __init vmx_init(void)
if (r)
goto err_l1d_flush;

- for_each_possible_cpu(cpu) {
- INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
-
+ for_each_possible_cpu(cpu)
pi_init_cpu(cpu);
- }

cpu_emergency_register_virt_callback(vmx_emergency_disable);

--
2.25.1


2024-02-26 08:36:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

From: Isaku Yamahata <[email protected]>

TDX requires several initialization steps for KVM to create guest TDs.
Detect CPU feature, enable VMX (TDX is based on VMX) on all online CPUs,
detect the TDX module availability, initialize it and disable VMX.

To enable/disable VMX on all online CPUs, utilize
vmx_hardware_enable/disable(). The method also initializes each CPU for
TDX. TDX requires calling a TDX initialization function per logical
processor (LP) before the LP uses TDX. When the CPU is becoming online,
call the TDX LP initialization API. If it fails to initialize TDX, refuse
CPU online for simplicity instead of TDX avoiding the failed LP.

There are several options on when to initialize the TDX module. A.) kernel
module loading time, B.) the first guest TD creation time. A.) was chosen.
With B.), a user may hit an error of the TDX initialization when trying to
create the first guest TD. The machine that fails to initialize the TDX
module can't boot any guest TD further. Such failure is undesirable and a
surprise because the user expects that the machine can accommodate guest
TD, but not. So A.) is better than B.).

Introduce a module parameter, kvm_intel.tdx, to explicitly enable TDX KVM
support. It's off by default to keep the same behavior for those who don't
use TDX. Implement hardware_setup method to detect TDX feature of CPU and
initialize TDX module.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- fixed vt_hardware_enable() to use vmx_hardware_enable()
- renamed vmx_tdx_enabled => tdx_enabled
- renamed vmx_tdx_on() => tdx_on()

v18:
- Added comment in vt_hardware_enable() by Binbin.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Makefile | 1 +
arch/x86/kvm/vmx/main.c | 19 ++++++++-
arch/x86/kvm/vmx/tdx.c | 84 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 6 +++
4 files changed, 109 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/vmx/tdx.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 274df24b647f..5b85ef84b2e9 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -24,6 +24,7 @@ kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \

kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o

kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
svm/sev.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 18cecf12c7c8..18aef6e23aab 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,22 @@
#include "nested.h"
#include "pmu.h"

+static bool enable_tdx __ro_after_init;
+module_param_named(tdx, enable_tdx, bool, 0444);
+
+static __init int vt_hardware_setup(void)
+{
+ int ret;
+
+ ret = vmx_hardware_setup();
+ if (ret)
+ return ret;
+
+ enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
+
+ return 0;
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -22,6 +38,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.hardware_unsetup = vmx_hardware_unsetup,

+ /* TDX cpu enablement is done by tdx_hardware_setup(). */
.hardware_enable = vmx_hardware_enable,
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,
@@ -161,7 +178,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
- .hardware_setup = vmx_hardware_setup,
+ .hardware_setup = vt_hardware_setup,
.handle_intel_pt_intr = NULL,

.runtime_ops = &vt_x86_ops,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
new file mode 100644
index 000000000000..43c504fb4fed
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+
+#include <asm/tdx.h>
+
+#include "capabilities.h"
+#include "x86_ops.h"
+#include "x86.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+static int __init tdx_module_setup(void)
+{
+ int ret;
+
+ ret = tdx_enable();
+ if (ret) {
+ pr_info("Failed to initialize TDX module.\n");
+ return ret;
+ }
+
+ return 0;
+}
+
+struct tdx_enabled {
+ cpumask_var_t enabled;
+ atomic_t err;
+};
+
+static void __init tdx_on(void *_enable)
+{
+ struct tdx_enabled *enable = _enable;
+ int r;
+
+ r = vmx_hardware_enable();
+ if (!r) {
+ cpumask_set_cpu(smp_processor_id(), enable->enabled);
+ r = tdx_cpu_enable();
+ }
+ if (r)
+ atomic_set(&enable->err, r);
+}
+
+static void __init vmx_off(void *_enabled)
+{
+ cpumask_var_t *enabled = (cpumask_var_t *)_enabled;
+
+ if (cpumask_test_cpu(smp_processor_id(), *enabled))
+ vmx_hardware_disable();
+}
+
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+{
+ struct tdx_enabled enable = {
+ .err = ATOMIC_INIT(0),
+ };
+ int r = 0;
+
+ if (!enable_ept) {
+ pr_warn("Cannot enable TDX with EPT disabled\n");
+ return -EINVAL;
+ }
+
+ if (!zalloc_cpumask_var(&enable.enabled, GFP_KERNEL)) {
+ r = -ENOMEM;
+ goto out;
+ }
+
+ /* tdx_enable() in tdx_module_setup() requires cpus lock. */
+ cpus_read_lock();
+ on_each_cpu(tdx_on, &enable, true); /* TDX requires vmxon. */
+ r = atomic_read(&enable.err);
+ if (!r)
+ r = tdx_module_setup();
+ else
+ r = -EIO;
+ on_each_cpu(vmx_off, &enable.enabled, true);
+ cpus_read_unlock();
+ free_cpumask_var(enable.enabled);
+
+out:
+ return r;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b936388853ab..346289a2a01c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -135,4 +135,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
#endif
void vmx_setup_mce(struct kvm_vcpu *vcpu);

+#ifdef CONFIG_INTEL_TDX_HOST
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+#else
+static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
+#endif
+
#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1


2024-02-26 08:36:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 025/130] KVM: TDX: Make TDX VM type supported

From: Isaku Yamahata <[email protected]>

NOTE: This patch is in position of the patch series for developers to be
able to test codes during the middle of the patch series although this
patch series doesn't provide functional features until the all the patches
of this patch series. When merging this patch series, this patch can be
moved to the end.

As first step TDX VM support, return that TDX VM type supported to device
model, e.g. qemu. The callback to create guest TD is vm_init callback for
KVM_CREATE_VM.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 18 ++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 6 ------
arch/x86/kvm/vmx/x86_ops.h | 3 ++-
4 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index e11edbd19e7c..fa19682b366c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -10,6 +10,12 @@
static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);

+static bool vt_is_vm_type_supported(unsigned long type)
+{
+ return __kvm_is_vm_type_supported(type) ||
+ (enable_tdx && tdx_is_vm_type_supported(type));
+}
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -26,6 +32,14 @@ static __init int vt_hardware_setup(void)
return 0;
}

+static int vt_vm_init(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
+
+ return vmx_vm_init(kvm);
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -47,9 +61,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,

- .is_vm_type_supported = vmx_is_vm_type_supported,
+ .is_vm_type_supported = vt_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
+ .vm_init = vt_vm_init,
.vm_destroy = vmx_vm_destroy,

.vcpu_precreate = vmx_vcpu_precreate,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 14ef0ccd8f1a..a7e096fd8361 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -24,6 +24,12 @@ static int __init tdx_module_setup(void)
return 0;
}

+bool tdx_is_vm_type_supported(unsigned long type)
+{
+ /* enable_tdx check is done by the caller. */
+ return type == KVM_X86_TDX_VM;
+}
+
struct tdx_enabled {
cpumask_var_t enabled;
atomic_t err;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2fb1cd2e28a2..d928acc15d0f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7531,12 +7531,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}

-bool vmx_is_vm_type_supported(unsigned long type)
-{
- /* TODO: Check if TDX is supported. */
- return __kvm_is_vm_type_supported(type);
-}
-
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 346289a2a01c..f4da88a228d0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -28,7 +28,6 @@ void vmx_hardware_unsetup(void);
int vmx_check_processor_compat(void);
int vmx_hardware_enable(void);
void vmx_hardware_disable(void);
-bool vmx_is_vm_type_supported(unsigned long type);
int vmx_vm_init(struct kvm *kvm);
void vmx_vm_destroy(struct kvm *kvm);
int vmx_vcpu_precreate(struct kvm *kvm);
@@ -137,8 +136,10 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);

#ifdef CONFIG_INTEL_TDX_HOST
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+bool tdx_is_vm_type_supported(unsigned long type);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
+static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1


2024-02-26 08:36:27

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure

From: Isaku Yamahata <[email protected]>

Add placeholders TDX VM/vcpu structure that overlays with VMX VM/vcpu
structures. Initialize VM structure size and vcpu size/align so that x86
KVM common code knows those size irrespective of VMX or TDX. Those
structures will be populated as guest creation logic develops.

Add helper functions to check if the VM is guest TD and add conversion
functions between KVM VM/VCPU and TDX VM/VCPU.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v19:
- correctly update ops.vm_size, vcpu_size and, vcpu_align by Xiaoyao

v14 -> v15:
- use KVM_X86_TDX_VM

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 14 ++++++++++++
arch/x86/kvm/vmx/tdx.c | 1 +
arch/x86/kvm/vmx/tdx.h | 50 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 65 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx.h

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 18aef6e23aab..e11edbd19e7c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -5,6 +5,7 @@
#include "vmx.h"
#include "nested.h"
#include "pmu.h"
+#include "tdx.h"

static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);
@@ -18,6 +19,9 @@ static __init int vt_hardware_setup(void)
return ret;

enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
+ if (enable_tdx)
+ vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
+ sizeof(struct kvm_tdx));

return 0;
}
@@ -215,8 +219,18 @@ static int __init vt_init(void)
* Common KVM initialization _must_ come last, after this, /dev/kvm is
* exposed to userspace!
*/
+ /*
+ * kvm_x86_ops is updated with vt_x86_ops. vt_x86_ops.vm_size must
+ * be set before kvm_x86_vendor_init().
+ */
vcpu_size = sizeof(struct vcpu_vmx);
vcpu_align = __alignof__(struct vcpu_vmx);
+ if (enable_tdx) {
+ vcpu_size = max_t(unsigned int, vcpu_size,
+ sizeof(struct vcpu_tdx));
+ vcpu_align = max_t(unsigned int, vcpu_align,
+ __alignof__(struct vcpu_tdx));
+ }
r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
if (r)
goto err_kvm_init;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 43c504fb4fed..14ef0ccd8f1a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
#include "capabilities.h"
#include "x86_ops.h"
#include "x86.h"
+#include "tdx.h"

#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
new file mode 100644
index 000000000000..473013265bd8
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_H
+#define __KVM_X86_TDX_H
+
+#ifdef CONFIG_INTEL_TDX_HOST
+struct kvm_tdx {
+ struct kvm kvm;
+ /* TDX specific members follow. */
+};
+
+struct vcpu_tdx {
+ struct kvm_vcpu vcpu;
+ /* TDX specific members follow. */
+};
+
+static inline bool is_td(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+ return is_td(vcpu->kvm);
+}
+
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
+{
+ return container_of(kvm, struct kvm_tdx, kvm);
+}
+
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
+{
+ return container_of(vcpu, struct vcpu_tdx, vcpu);
+}
+#else
+struct kvm_tdx {
+ struct kvm kvm;
+};
+
+struct vcpu_tdx {
+ struct kvm_vcpu vcpu;
+};
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_H */
--
2.25.1


2024-02-26 08:37:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 026/130] [MARKER] The start of TDX KVM patch series: TDX architectural definitions

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TDX architectural
definitions.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/index.rst | 2 ++
.../virt/kvm/intel-tdx-layer-status.rst | 29 +++++++++++++++++++
2 files changed, 31 insertions(+)
create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst

diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index ad13ec55ddfe..ccff56dca2b1 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -19,3 +19,5 @@ KVM
vcpu-requests
halt-polling
review-checklist
+
+ intel-tdx-layer-status
diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
new file mode 100644
index 000000000000..f11ea701dc19
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Layer status
+============
+What qemu can do
+----------------
+- TDX VM TYPE is exposed to Qemu.
+- Qemu can try to create VM of TDX VM type and then fails.
+
+Patch Layer status
+------------------
+ Patch layer Status
+
+* TDX, VMX coexistence: Applied
+* TDX architectural definitions: Applying
+* TD VM creation/destruction: Not yet
+* TD vcpu creation/destruction: Not yet
+* TDX EPT violation: Not yet
+* TD finalization: Not yet
+* TD vcpu enter/exit: Not yet
+* TD vcpu interrupts/exit/hypercall: Not yet
+
+* KVM MMU GPA shared bits: Not yet
+* KVM TDP refactoring for TDX: Not yet
+* KVM TDP MMU hooks: Not yet
--
2.25.1


2024-02-26 08:37:07

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 028/130] KVM: TDX: Add TDX "architectural" error codes

From: Sean Christopherson <[email protected]>

Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL. KVM issues the TDX
SEAMCALLs and checks its error code. KVM handles hypercall from the TDX
guest and may return an error. So error code for the TDX guest is also
needed.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32]. Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
---
v19:
- Drop TDX_EPT_WALK_FAILED, TDX_EPT_ENTRY_NOT_FREE
- Rename TDG_VP_VMCALL_ => TDVMCALL_ to match the existing code
- Move TDVMCALL error codes to shared/tdx.h
- Added TDX_OPERAND_ID_TDR
---
arch/x86/include/asm/shared/tdx.h | 8 +++++++-
arch/x86/kvm/vmx/tdx_errno.h | 34 +++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index fdfd41511b02..28c4a62b7dba 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -26,7 +26,13 @@
#define TDVMCALL_GET_QUOTE 0x10002
#define TDVMCALL_REPORT_FATAL_ERROR 0x10003

-#define TDVMCALL_STATUS_RETRY 1
+/*
+ * TDG.VP.VMCALL Status Codes (returned in R10)
+ */
+#define TDVMCALL_SUCCESS 0x0000000000000000ULL
+#define TDVMCALL_RETRY 0x0000000000000001ULL
+#define TDVMCALL_INVALID_OPERAND 0x8000000000000000ULL
+#define TDVMCALL_TDREPORT_FAILED 0x8000000000000001ULL

/*
* Bitmasks of exposed registers (with VMM).
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
new file mode 100644
index 000000000000..5366bf476d2c
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural status code for SEAMCALL */
+
+#ifndef __KVM_X86_TDX_ERRNO_H
+#define __KVM_X86_TDX_ERRNO_H
+
+#define TDX_SEAMCALL_STATUS_MASK 0xFFFFFFFF00000000ULL
+
+/*
+ * TDX SEAMCALL Status Codes (returned in RAX)
+ */
+#define TDX_NON_RECOVERABLE_VCPU 0x4000000100000000ULL
+#define TDX_INTERRUPTED_RESUMABLE 0x8000000300000000ULL
+#define TDX_OPERAND_INVALID 0xC000010000000000ULL
+#define TDX_OPERAND_BUSY 0x8000020000000000ULL
+#define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000ULL
+#define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000ULL
+#define TDX_KEY_GENERATION_FAILED 0x8000080000000000ULL
+#define TDX_KEY_STATE_INCORRECT 0xC000081100000000ULL
+#define TDX_KEY_CONFIGURED 0x0000081500000000ULL
+#define TDX_NO_HKID_READY_TO_WBCACHE 0x0000082100000000ULL
+#define TDX_FLUSHVP_NOT_DONE 0x8000082400000000ULL
+#define TDX_EPT_ENTRY_STATE_INCORRECT 0xC0000B0D00000000ULL
+
+/*
+ * TDX module operand ID, appears in 31:0 part of error code as
+ * detail information
+ */
+#define TDX_OPERAND_ID_RCX 0x01
+#define TDX_OPERAND_ID_TDR 0x80
+#define TDX_OPERAND_ID_SEPT 0x92
+#define TDX_OPERAND_ID_TD_EPOCH 0xa9
+
+#endif /* __KVM_X86_TDX_ERRNO_H */
--
2.25.1


2024-02-26 08:37:08

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 020/130] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX

From: Sean Christopherson <[email protected]>

KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures. This means we must have a TDX
version of kvm_x86_ops.

The existing global struct kvm_x86_ops already defines an interface which
fits with TDX. But kvm_x86_ops is system-wide, not per-VM structure. To
allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
"if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
both VMs and TDs. Use 'vt' for the naming scheme as a nod to VT-x and as a
concatenation of VmxTdx.

The current code looks as follows.
In vmx.c
static vmx_op() { ... }
static struct kvm_x86_ops vmx_x86_ops = {
.op = vmx_op,
initialization code

The eventually converted code will look like
In vmx.c, keep the VMX operations.
vmx_op() { ... }
VMX initialization
In tdx.c, define the TDX operations.
tdx_op() { ... }
TDX initialization
In x86_ops.h, declare the VMX and TDX operations.
vmx_op();
tdx_op();
In main.c, define common wrappers for VMX and TDX.
static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
static struct kvm_x86_ops vt_x86_ops = {
.op = vt_op,
initialization to call VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>

---
v19:
- Move down the declaration of vmx_hardware_setup() in x86_ops.h. Xiaoyao

v18:
- Add Reviewed-by: Binbin Wu
- fix indent alignments pointed by Binbin Wu

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/main.c | 169 +++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 378 ++++++++++---------------------------
arch/x86/kvm/vmx/x86_ops.h | 124 ++++++++++++
4 files changed, 397 insertions(+), 276 deletions(-)
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/x86_ops.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 475b5fa917a6..274df24b647f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -20,7 +20,7 @@ kvm-$(CONFIG_KVM_XEN) += xen.o
kvm-$(CONFIG_KVM_SMM) += smm.o

kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
- vmx/nested.o vmx/posted_intr.o
+ vmx/nested.o vmx/posted_intr.o vmx/main.o

kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..eeb7a43b271d
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+#define VMX_REQUIRED_APICV_INHIBITS \
+ (BIT(APICV_INHIBIT_REASON_DISABLE)| \
+ BIT(APICV_INHIBIT_REASON_ABSENT) | \
+ BIT(APICV_INHIBIT_REASON_HYPERV) | \
+ BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
+ BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
+ BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
+ BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+ .name = KBUILD_MODNAME,
+
+ .check_processor_compatibility = vmx_check_processor_compat,
+
+ .hardware_unsetup = vmx_hardware_unsetup,
+
+ .hardware_enable = vmx_hardware_enable,
+ .hardware_disable = vmx_hardware_disable,
+ .has_emulated_msr = vmx_has_emulated_msr,
+
+ .is_vm_type_supported = vmx_is_vm_type_supported,
+ .vm_size = sizeof(struct kvm_vmx),
+ .vm_init = vmx_vm_init,
+ .vm_destroy = vmx_vm_destroy,
+
+ .vcpu_precreate = vmx_vcpu_precreate,
+ .vcpu_create = vmx_vcpu_create,
+ .vcpu_free = vmx_vcpu_free,
+ .vcpu_reset = vmx_vcpu_reset,
+
+ .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+ .vcpu_load = vmx_vcpu_load,
+ .vcpu_put = vmx_vcpu_put,
+
+ .update_exception_bitmap = vmx_update_exception_bitmap,
+ .get_msr_feature = vmx_get_msr_feature,
+ .get_msr = vmx_get_msr,
+ .set_msr = vmx_set_msr,
+ .get_segment_base = vmx_get_segment_base,
+ .get_segment = vmx_get_segment,
+ .set_segment = vmx_set_segment,
+ .get_cpl = vmx_get_cpl,
+ .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
+ .is_valid_cr0 = vmx_is_valid_cr0,
+ .set_cr0 = vmx_set_cr0,
+ .is_valid_cr4 = vmx_is_valid_cr4,
+ .set_cr4 = vmx_set_cr4,
+ .set_efer = vmx_set_efer,
+ .get_idt = vmx_get_idt,
+ .set_idt = vmx_set_idt,
+ .get_gdt = vmx_get_gdt,
+ .set_gdt = vmx_set_gdt,
+ .set_dr7 = vmx_set_dr7,
+ .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
+ .cache_reg = vmx_cache_reg,
+ .get_rflags = vmx_get_rflags,
+ .set_rflags = vmx_set_rflags,
+ .get_if_flag = vmx_get_if_flag,
+
+ .flush_tlb_all = vmx_flush_tlb_all,
+ .flush_tlb_current = vmx_flush_tlb_current,
+ .flush_tlb_gva = vmx_flush_tlb_gva,
+ .flush_tlb_guest = vmx_flush_tlb_guest,
+
+ .vcpu_pre_run = vmx_vcpu_pre_run,
+ .vcpu_run = vmx_vcpu_run,
+ .handle_exit = vmx_handle_exit,
+ .skip_emulated_instruction = vmx_skip_emulated_instruction,
+ .update_emulated_instruction = vmx_update_emulated_instruction,
+ .set_interrupt_shadow = vmx_set_interrupt_shadow,
+ .get_interrupt_shadow = vmx_get_interrupt_shadow,
+ .patch_hypercall = vmx_patch_hypercall,
+ .inject_irq = vmx_inject_irq,
+ .inject_nmi = vmx_inject_nmi,
+ .inject_exception = vmx_inject_exception,
+ .cancel_injection = vmx_cancel_injection,
+ .interrupt_allowed = vmx_interrupt_allowed,
+ .nmi_allowed = vmx_nmi_allowed,
+ .get_nmi_mask = vmx_get_nmi_mask,
+ .set_nmi_mask = vmx_set_nmi_mask,
+ .enable_nmi_window = vmx_enable_nmi_window,
+ .enable_irq_window = vmx_enable_irq_window,
+ .update_cr8_intercept = vmx_update_cr8_intercept,
+ .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .apicv_pre_state_restore = vmx_apicv_pre_state_restore,
+ .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
+ .hwapic_irr_update = vmx_hwapic_irr_update,
+ .hwapic_isr_update = vmx_hwapic_isr_update,
+ .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .sync_pir_to_irr = vmx_sync_pir_to_irr,
+ .deliver_interrupt = vmx_deliver_interrupt,
+ .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+ .set_tss_addr = vmx_set_tss_addr,
+ .set_identity_map_addr = vmx_set_identity_map_addr,
+ .get_mt_mask = vmx_get_mt_mask,
+
+ .get_exit_info = vmx_get_exit_info,
+
+ .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+
+ .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+ .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
+ .write_tsc_offset = vmx_write_tsc_offset,
+ .write_tsc_multiplier = vmx_write_tsc_multiplier,
+
+ .load_mmu_pgd = vmx_load_mmu_pgd,
+
+ .check_intercept = vmx_check_intercept,
+ .handle_exit_irqoff = vmx_handle_exit_irqoff,
+
+ .request_immediate_exit = vmx_request_immediate_exit,
+
+ .sched_in = vmx_sched_in,
+
+ .cpu_dirty_log_size = PML_ENTITY_NUM,
+ .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+
+ .nested_ops = &vmx_nested_ops,
+
+ .pi_update_irte = vmx_pi_update_irte,
+ .pi_start_assignment = vmx_pi_start_assignment,
+
+#ifdef CONFIG_X86_64
+ .set_hv_timer = vmx_set_hv_timer,
+ .cancel_hv_timer = vmx_cancel_hv_timer,
+#endif
+
+ .setup_mce = vmx_setup_mce,
+
+#ifdef CONFIG_KVM_SMM
+ .smi_allowed = vmx_smi_allowed,
+ .enter_smm = vmx_enter_smm,
+ .leave_smm = vmx_leave_smm,
+ .enable_smi_window = vmx_enable_smi_window,
+#endif
+
+ .check_emulate_instruction = vmx_check_emulate_instruction,
+ .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .migrate_timers = vmx_migrate_timers,
+
+ .msr_filter_changed = vmx_msr_filter_changed,
+ .complete_emulated_msr = kvm_complete_insn_gp,
+
+ .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+ .get_untagged_addr = vmx_get_untagged_addr,
+};
+
+struct kvm_x86_init_ops vt_init_ops __initdata = {
+ .hardware_setup = vmx_hardware_setup,
+ .handle_intel_pt_intr = NULL,
+
+ .runtime_ops = &vt_x86_ops,
+ .pmu_ops = &intel_pmu_ops,
+};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fca3457dd050..434f5aaef030 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -65,6 +65,7 @@
#include "vmcs12.h"
#include "vmx.h"
#include "x86.h"
+#include "x86_ops.h"
#include "smm.h"
#include "vmx_onhyperv.h"

@@ -516,8 +517,6 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
static unsigned long host_idt_base;

#if IS_ENABLED(CONFIG_HYPERV)
-static struct kvm_x86_ops vmx_x86_ops __initdata;
-
static bool __read_mostly enlightened_vmcs = true;
module_param(enlightened_vmcs, bool, 0444);

@@ -567,9 +566,8 @@ static __init void hv_init_evmcs(void)
}

if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
- vmx_x86_ops.enable_l2_tlb_flush
+ vt_x86_ops.enable_l2_tlb_flush
= hv_enable_l2_tlb_flush;
-
} else {
enlightened_vmcs = false;
}
@@ -1474,7 +1472,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
* Switches to specified vcpu, until a matching vcpu_put(), but assumes
* vcpu mutex is already taken.
*/
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -1485,7 +1483,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmx->host_debugctlmsr = get_debugctlmsr();
}

-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
{
vmx_vcpu_pi_put(vcpu);

@@ -1544,7 +1542,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
vmx->emulation_required = vmx_emulation_required(vcpu);
}

-static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
{
return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
}
@@ -1650,8 +1648,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
return 0;
}

-static int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
- void *insn, int insn_len)
+int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
{
/*
* Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1735,7 +1733,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
* Recognizes a pending MTF VM-exit and records the nested state for later
* delivery.
*/
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1766,7 +1764,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
}
}

-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
vmx_update_emulated_instruction(vcpu);
return skip_emulated_instruction(vcpu);
@@ -1785,7 +1783,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
}

-static void vmx_inject_exception(struct kvm_vcpu *vcpu)
+void vmx_inject_exception(struct kvm_vcpu *vcpu)
{
struct kvm_queued_exception *ex = &vcpu->arch.exception;
u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
@@ -1906,12 +1904,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
return kvm_caps.default_tsc_scaling_ratio;
}

-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu)
{
vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
}

-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu)
{
vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio);
}
@@ -1954,7 +1952,7 @@ static inline bool is_vmx_feature_control_msr_valid(struct vcpu_vmx *vmx,
return !(msr->data & ~valid_bits);
}

-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
{
switch (msr->index) {
case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
@@ -1971,7 +1969,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2152,7 +2150,7 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2455,7 +2453,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return ret;
}

-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
{
unsigned long guest_owned_bits;

@@ -2756,7 +2754,7 @@ static bool kvm_is_vmx_supported(void)
return supported;
}

-static int vmx_check_processor_compat(void)
+int vmx_check_processor_compat(void)
{
int cpu = raw_smp_processor_id();
struct vmcs_config vmcs_conf;
@@ -2798,7 +2796,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
return -EFAULT;
}

-static int vmx_hardware_enable(void)
+int vmx_hardware_enable(void)
{
int cpu = raw_smp_processor_id();
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2838,7 +2836,7 @@ static void vmclear_local_loaded_vmcss(void)
__loaded_vmcs_clear(v);
}

-static void vmx_hardware_disable(void)
+void vmx_hardware_disable(void)
{
vmclear_local_loaded_vmcss();

@@ -3152,7 +3150,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)

#endif

-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -3182,7 +3180,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
return to_vmx(vcpu)->vpid;
}

-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 root_hpa = mmu->root.hpa;
@@ -3198,7 +3196,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
vpid_sync_context(vmx_get_current_vpid(vcpu));
}

-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
{
/*
* vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
@@ -3207,7 +3205,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
}

-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
{
/*
* vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
@@ -3252,7 +3250,7 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
#define CR3_EXITING_BITS (CPU_BASED_CR3_LOAD_EXITING | \
CPU_BASED_CR3_STORE_EXITING)

-static bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
if (is_guest_mode(vcpu))
return nested_guest_cr0_valid(vcpu, cr0);
@@ -3373,8 +3371,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
return eptp;
}

-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
- int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
{
struct kvm *kvm = vcpu->kvm;
bool update_guest_cr3 = true;
@@ -3403,8 +3400,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmcs_writel(GUEST_CR3, guest_cr3);
}

-
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
/*
* We operate under the default treatment of SMM, so VMX cannot be
@@ -3520,7 +3516,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->g = (ar >> 15) & 1;
}

-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
{
struct kvm_segment s;

@@ -3597,14 +3593,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
}

-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
__vmx_set_segment(vcpu, var, seg);

to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
}

-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
{
u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);

@@ -3612,25 +3608,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
*l = (ar >> 13) & 1;
}

-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
dt->address = vmcs_readl(GUEST_IDTR_BASE);
}

-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
vmcs_writel(GUEST_IDTR_BASE, dt->address);
}

-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
dt->address = vmcs_readl(GUEST_GDTR_BASE);
}

-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -4102,7 +4098,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
}
}

-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
void *vapic_page;
@@ -4122,7 +4118,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
return ((rvi & 0xf0) > (vppr & 0xf0));
}

-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 i;
@@ -4263,8 +4259,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
return 0;
}

-static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
- int trig_mode, int vector)
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
{
struct kvm_vcpu *vcpu = apic->vcpu;

@@ -4426,7 +4422,7 @@ static u32 vmx_vmexit_ctrl(void)
~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
}

-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -4690,7 +4686,7 @@ static int vmx_alloc_ipiv_pid_table(struct kvm *kvm)
return 0;
}

-static int vmx_vcpu_precreate(struct kvm *kvm)
+int vmx_vcpu_precreate(struct kvm *kvm)
{
return vmx_alloc_ipiv_pid_table(kvm);
}
@@ -4845,7 +4841,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmx->pi_desc.sn = 1;
}

-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -4904,12 +4900,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_update_fb_clear_dis(vcpu, vmx);
}

-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
{
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
}

-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
{
if (!enable_vnmi ||
vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4920,7 +4916,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
}

-static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
uint32_t intr;
@@ -4948,7 +4944,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
vmx_clear_hlt(vcpu);
}

-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -5026,7 +5022,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
GUEST_INTR_STATE_NMI));
}

-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -5048,7 +5044,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
}

-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -5063,7 +5059,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !vmx_interrupt_blocked(vcpu);
}

-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
void __user *ret;

@@ -5083,7 +5079,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
return init_rmode_tss(kvm, ret);
}

-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
{
to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
return 0;
@@ -5369,8 +5365,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
return kvm_fast_pio(vcpu, size, port, in);
}

-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
{
/*
* Patch in the VMCALL instruction:
@@ -5579,7 +5574,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
return kvm_complete_insn_gp(vcpu, err);
}

-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
{
get_debugreg(vcpu->arch.db[0], 0);
get_debugreg(vcpu->arch.db[1], 1);
@@ -5598,7 +5593,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
set_debugreg(DR6_RESERVED, 6);
}

-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
{
vmcs_writel(GUEST_DR7, val);
}
@@ -5869,7 +5864,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
return 1;
}

-static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (vmx_emulation_required_with_pending_exception(vcpu)) {
kvm_prepare_emulation_failure_exit(vcpu);
@@ -6133,9 +6128,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);

-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
- u64 *info1, u64 *info2,
- u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -6578,7 +6572,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return 0;
}

-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
int ret = __vmx_handle_exit(vcpu, exit_fastpath);

@@ -6666,7 +6660,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
: "eax", "ebx", "ecx", "edx");
}

-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
int tpr_threshold;
@@ -6736,7 +6730,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
}

-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
{
const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
struct kvm *kvm = vcpu->kvm;
@@ -6805,7 +6799,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
kvm_release_pfn_clean(pfn);
}

-static void vmx_hwapic_isr_update(int max_isr)
+void vmx_hwapic_isr_update(int max_isr)
{
u16 status;
u8 old;
@@ -6839,7 +6833,7 @@ static void vmx_set_rvi(int vector)
}
}

-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
{
/*
* When running L2, updating RVI is only relevant when
@@ -6853,7 +6847,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
vmx_set_rvi(max_irr);
}

-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int max_irr;
@@ -6899,7 +6893,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
return max_irr;
}

-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
return;
@@ -6910,7 +6904,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}

-static void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -6973,7 +6967,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
vcpu->arch.at_instruction_boundary = true;
}

-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -6990,7 +6984,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
* The kvm parameter can be NULL (module initialization, or invocation before
* VM creation). Be sure to check the kvm parameter before using it.
*/
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
{
switch (index) {
case MSR_IA32_SMBASE:
@@ -7113,7 +7107,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
IDT_VECTORING_ERROR_CODE);
}

-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
{
__vmx_complete_interrupts(vcpu,
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -7268,7 +7262,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
guest_state_exit_irqoff();
}

-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long cr3, cr4;
@@ -7424,7 +7418,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_exit_handlers_fastpath(vcpu);
}

-static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
+void vmx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -7435,7 +7429,7 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_loaded_vmcs(vmx->loaded_vmcs);
}

-static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
+int vmx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct vmx_uret_msr *tsx_ctrl;
struct vcpu_vmx *vmx;
@@ -7541,7 +7535,7 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}

-static bool vmx_is_vm_type_supported(unsigned long type)
+bool vmx_is_vm_type_supported(unsigned long type)
{
/* TODO: Check if TDX is supported. */
return __kvm_is_vm_type_supported(type);
@@ -7550,7 +7544,7 @@ static bool vmx_is_vm_type_supported(unsigned long type)
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
{
if (!ple_gap)
kvm->arch.pause_in_guest = true;
@@ -7581,7 +7575,7 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}

-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
@@ -7753,7 +7747,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}

-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -7907,7 +7901,7 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
}

-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->req_immediate_exit = true;
}
@@ -7946,10 +7940,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
}

-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
- struct x86_instruction_info *info,
- enum x86_intercept_stage stage,
- struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

@@ -8029,8 +8023,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
return 0;
}

-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
- bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
{
struct vcpu_vmx *vmx;
u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -8069,13 +8063,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
return 0;
}

-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->hv_deadline_tsc = -1;
}
#endif

-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
if (!kvm_pause_in_guest(vcpu->kvm))
shrink_ple_window(vcpu);
@@ -8104,7 +8098,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
}

-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.mcg_cap & MCG_LMCE_P)
to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -8115,7 +8109,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
}

#ifdef CONFIG_KVM_SMM
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
/* we need a nested vmexit to enter SMM, postpone if run is pending */
if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -8123,7 +8117,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !is_smm(vcpu);
}

-static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -8144,7 +8138,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
return 0;
}

-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int ret;
@@ -8165,18 +8159,18 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
return 0;
}

-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
{
/* RSM will cause a vmexit anyway. */
}
#endif

-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
}

-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
{
if (is_guest_mode(vcpu)) {
struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -8186,7 +8180,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
}
}

-static void vmx_hardware_unsetup(void)
+void vmx_hardware_unsetup(void)
{
kvm_set_posted_intr_wakeup_handler(NULL);

@@ -8196,18 +8190,7 @@ static void vmx_hardware_unsetup(void)
free_kvm_area();
}

-#define VMX_REQUIRED_APICV_INHIBITS \
-( \
- BIT(APICV_INHIBIT_REASON_DISABLE)| \
- BIT(APICV_INHIBIT_REASON_ABSENT) | \
- BIT(APICV_INHIBIT_REASON_HYPERV) | \
- BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
- BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
- BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) \
-)
-
-static void vmx_vm_destroy(struct kvm *kvm)
+void vmx_vm_destroy(struct kvm *kvm)
{
struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);

@@ -8258,151 +8241,6 @@ gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags
return (sign_extend64(gva, lam_bit) & ~BIT_ULL(63)) | (gva & BIT_ULL(63));
}

-static struct kvm_x86_ops vmx_x86_ops __initdata = {
- .name = KBUILD_MODNAME,
-
- .check_processor_compatibility = vmx_check_processor_compat,
-
- .hardware_unsetup = vmx_hardware_unsetup,
-
- .hardware_enable = vmx_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
- .has_emulated_msr = vmx_has_emulated_msr,
-
- .is_vm_type_supported = vmx_is_vm_type_supported,
- .vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
- .vm_destroy = vmx_vm_destroy,
-
- .vcpu_precreate = vmx_vcpu_precreate,
- .vcpu_create = vmx_vcpu_create,
- .vcpu_free = vmx_vcpu_free,
- .vcpu_reset = vmx_vcpu_reset,
-
- .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
-
- .update_exception_bitmap = vmx_update_exception_bitmap,
- .get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .is_valid_cr0 = vmx_is_valid_cr0,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
-
- .flush_tlb_all = vmx_flush_tlb_all,
- .flush_tlb_current = vmx_flush_tlb_current,
- .flush_tlb_gva = vmx_flush_tlb_gva,
- .flush_tlb_guest = vmx_flush_tlb_guest,
-
- .vcpu_pre_run = vmx_vcpu_pre_run,
- .vcpu_run = vmx_vcpu_run,
- .handle_exit = vmx_handle_exit,
- .skip_emulated_instruction = vmx_skip_emulated_instruction,
- .update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
- .inject_irq = vmx_inject_irq,
- .inject_nmi = vmx_inject_nmi,
- .inject_exception = vmx_inject_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_pre_state_restore = vmx_apicv_pre_state_restore,
- .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_interrupt = vmx_deliver_interrupt,
- .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
-
- .get_exit_info = vmx_get_exit_info,
-
- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
- .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
-
- .load_mmu_pgd = vmx_load_mmu_pgd,
-
- .check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
-
- .request_immediate_exit = vmx_request_immediate_exit,
-
- .sched_in = vmx_sched_in,
-
- .cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
- .nested_ops = &vmx_nested_ops,
-
- .pi_update_irte = vmx_pi_update_irte,
- .pi_start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
- .setup_mce = vmx_setup_mce,
-
-#ifdef CONFIG_KVM_SMM
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
-#endif
-
- .check_emulate_instruction = vmx_check_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
- .migrate_timers = vmx_migrate_timers,
-
- .msr_filter_changed = vmx_msr_filter_changed,
- .complete_emulated_msr = kvm_complete_insn_gp,
-
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-
- .get_untagged_addr = vmx_get_untagged_addr,
-};
-
static unsigned int vmx_handle_intel_pt_intr(void)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -8468,9 +8306,7 @@ static void __init vmx_setup_me_spte_mask(void)
kvm_mmu_set_me_spte_mask(0, me_mask);
}

-static struct kvm_x86_init_ops vmx_init_ops __initdata;
-
-static __init int hardware_setup(void)
+__init int vmx_hardware_setup(void)
{
unsigned long host_bndcfgs;
struct desc_ptr dt;
@@ -8539,16 +8375,16 @@ static __init int hardware_setup(void)
* using the APIC_ACCESS_ADDR VMCS field.
*/
if (!flexpriority_enabled)
- vmx_x86_ops.set_apic_access_page_addr = NULL;
+ vt_x86_ops.set_apic_access_page_addr = NULL;

if (!cpu_has_vmx_tpr_shadow())
- vmx_x86_ops.update_cr8_intercept = NULL;
+ vt_x86_ops.update_cr8_intercept = NULL;

#if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
&& enable_ept) {
- vmx_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs;
- vmx_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range;
+ vt_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs;
+ vt_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range;
}
#endif

@@ -8563,7 +8399,7 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_apicv())
enable_apicv = 0;
if (!enable_apicv)
- vmx_x86_ops.sync_pir_to_irr = NULL;
+ vt_x86_ops.sync_pir_to_irr = NULL;

if (!enable_apicv || !cpu_has_vmx_ipiv())
enable_ipiv = false;
@@ -8599,7 +8435,7 @@ static __init int hardware_setup(void)
enable_pml = 0;

if (!enable_pml)
- vmx_x86_ops.cpu_dirty_log_size = 0;
+ vt_x86_ops.cpu_dirty_log_size = 0;

if (!cpu_has_vmx_preemption_timer())
enable_preemption_timer = false;
@@ -8624,9 +8460,9 @@ static __init int hardware_setup(void)
}

if (!enable_preemption_timer) {
- vmx_x86_ops.set_hv_timer = NULL;
- vmx_x86_ops.cancel_hv_timer = NULL;
- vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+ vt_x86_ops.set_hv_timer = NULL;
+ vt_x86_ops.cancel_hv_timer = NULL;
+ vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
}

kvm_caps.supported_mce_cap |= MCG_LMCE_P;
@@ -8637,9 +8473,9 @@ static __init int hardware_setup(void)
if (!enable_ept || !enable_pmu || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
if (pt_mode == PT_MODE_HOST_GUEST)
- vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+ vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
else
- vmx_init_ops.handle_intel_pt_intr = NULL;
+ vt_init_ops.handle_intel_pt_intr = NULL;

setup_default_sgx_lepubkeyhash();

@@ -8662,14 +8498,6 @@ static __init int hardware_setup(void)
return r;
}

-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
- .hardware_setup = hardware_setup,
- .handle_intel_pt_intr = NULL,
-
- .runtime_ops = &vmx_x86_ops,
- .pmu_ops = &intel_pmu_ops,
-};
-
static void vmx_cleanup_l1d_flush(void)
{
if (vmx_l1d_flush_pages) {
@@ -8711,7 +8539,7 @@ static int __init vmx_init(void)
*/
hv_init_evmcs();

- r = kvm_x86_vendor_init(&vmx_init_ops);
+ r = kvm_x86_vendor_init(&vt_init_ops);
if (r)
return r;

diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..2f8b6c43fe0f
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,124 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include "x86.h"
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+__init int vmx_hardware_setup(void);
+void vmx_hardware_unsetup(void);
+int vmx_check_processor_compat(void);
+int vmx_hardware_enable(void);
+void vmx_hardware_disable(void);
+bool vmx_is_vm_type_supported(unsigned long type);
+int vmx_vm_init(struct kvm *kvm);
+void vmx_vm_destroy(struct kvm *kvm);
+int vmx_vcpu_precreate(struct kvm *kvm);
+int vmx_vcpu_create(struct kvm_vcpu *vcpu);
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_vcpu_free(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+#ifdef CONFIG_KVM_SMM
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+#endif
+int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_inject_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1


2024-02-26 08:37:17

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

From: Isaku Yamahata <[email protected]>

Define architectural definitions for KVM to issue the TDX SEAMCALLs.

Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
---
v19:
- drop tdvmcall constants by Xiaoyao

v18:
- Add metadata field id

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx_arch.h | 265 ++++++++++++++++++++++++++++++++++++
1 file changed, 265 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..e2c1a6f429d7
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,265 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER 0
+#define TDH_MNG_ADDCX 1
+#define TDH_MEM_PAGE_ADD 2
+#define TDH_MEM_SEPT_ADD 3
+#define TDH_VP_ADDCX 4
+#define TDH_MEM_PAGE_RELOCATE 5
+#define TDH_MEM_PAGE_AUG 6
+#define TDH_MEM_RANGE_BLOCK 7
+#define TDH_MNG_KEY_CONFIG 8
+#define TDH_MNG_CREATE 9
+#define TDH_VP_CREATE 10
+#define TDH_MNG_RD 11
+#define TDH_MR_EXTEND 16
+#define TDH_MR_FINALIZE 17
+#define TDH_VP_FLUSH 18
+#define TDH_MNG_VPFLUSHDONE 19
+#define TDH_MNG_KEY_FREEID 20
+#define TDH_MNG_INIT 21
+#define TDH_VP_INIT 22
+#define TDH_MEM_SEPT_RD 25
+#define TDH_VP_RD 26
+#define TDH_MNG_KEY_RECLAIMID 27
+#define TDH_PHYMEM_PAGE_RECLAIM 28
+#define TDH_MEM_PAGE_REMOVE 29
+#define TDH_MEM_SEPT_REMOVE 30
+#define TDH_SYS_RD 34
+#define TDH_MEM_TRACK 38
+#define TDH_MEM_RANGE_UNBLOCK 39
+#define TDH_PHYMEM_CACHE_WB 40
+#define TDH_PHYMEM_PAGE_WBINVD 41
+#define TDH_VP_WR 43
+#define TDH_SYS_LP_SHUTDOWN 44
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH BIT_ULL(63)
+#define TDX_CLASS_SHIFT 56
+#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
+
+#define __BUILD_TDX_FIELD(non_arch, class, field) \
+ (((non_arch) ? TDX_NON_ARCH : 0) | \
+ ((u64)(class) << TDX_CLASS_SHIFT) | \
+ ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field) \
+ __BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field) \
+ __BUILD_TDX_FIELD(true, (class), (field))
+
+
+/* Class code for TD */
+#define TD_CLASS_EXECUTION_CONTROLS 17ULL
+
+/* Class code for TDVPS */
+#define TDVPS_CLASS_VMCS 0ULL
+#define TDVPS_CLASS_GUEST_GPR 16ULL
+#define TDVPS_CLASS_OTHER_GUEST 17ULL
+#define TDVPS_CLASS_MANAGEMENT 32ULL
+
+enum tdx_tdcs_execution_control {
+ TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field) BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field) BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field))
+
+enum tdx_vcpu_guest_other_state {
+ TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+union tdx_vcpu_state_details {
+ struct {
+ u64 vmxip : 1;
+ u64 reserved : 63;
+ };
+ u64 full;
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field) BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field))
+#define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field))
+
+/* Management class fields */
+enum tdx_vcpu_guest_management {
+ TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_vcpu_guest_management */
+#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field))
+
+#define TDX_EXTENDMR_CHUNKSIZE 256
+
+struct tdx_cpuid_value {
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+} __packed;
+
+#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
+#define TDX_TD_ATTR_SEPT_VE_DISABLE BIT_ULL(28)
+#define TDX_TD_ATTRIBUTE_PKS BIT_ULL(30)
+#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
+#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+#define TDX_MAX_VCPUS (~(u16)0)
+
+struct td_params {
+ u64 attributes;
+ u64 xfam;
+ u16 max_vcpus;
+ u8 reserved0[6];
+
+ u64 eptp_controls;
+ u64 exec_controls;
+ u16 tsc_frequency;
+ u8 reserved1[38];
+
+ u64 mrconfigid[6];
+ u64 mrowner[6];
+ u64 mrownerconfig[6];
+ u64 reserved2[4];
+
+ union {
+ DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values);
+ u8 reserved3[768];
+ };
+} __packed __aligned(1024);
+
+/*
+ * Guest uses MAX_PA for GPAW when set.
+ * 0: GPA.SHARED bit is GPA[47]
+ * 1: GPA.SHARED bit is GPA[51]
+ */
+#define TDX_EXEC_CONTROL_MAX_GPAW BIT_ULL(0)
+
+/*
+ * TDH.VP.ENTER, TDG.VP.VMCALL preserves RBP
+ * 0: RBP can be used for TDG.VP.VMCALL input. RBP is clobbered.
+ * 1: RBP can't be used for TDG.VP.VMCALL input. RBP is preserved.
+ */
+#define TDX_CONTROL_FLAG_NO_RBP_MOD BIT_ULL(2)
+
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
+ * module can only program frequencies that are multiples of 25MHz. The
+ * frequency must be between 100mhz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)
+
+union tdx_sept_entry {
+ struct {
+ u64 r : 1;
+ u64 w : 1;
+ u64 x : 1;
+ u64 mt : 3;
+ u64 ipat : 1;
+ u64 leaf : 1;
+ u64 a : 1;
+ u64 d : 1;
+ u64 xu : 1;
+ u64 ignored0 : 1;
+ u64 pfn : 40;
+ u64 reserved : 5;
+ u64 vgp : 1;
+ u64 pwa : 1;
+ u64 ignored1 : 1;
+ u64 sss : 1;
+ u64 spp : 1;
+ u64 ignored2 : 1;
+ u64 sve : 1;
+ };
+ u64 raw;
+};
+
+enum tdx_sept_entry_state {
+ TDX_SEPT_FREE = 0,
+ TDX_SEPT_BLOCKED = 1,
+ TDX_SEPT_PENDING = 2,
+ TDX_SEPT_PENDING_BLOCKED = 3,
+ TDX_SEPT_PRESENT = 4,
+};
+
+union tdx_sept_level_state {
+ struct {
+ u64 level : 3;
+ u64 reserved0 : 5;
+ u64 state : 8;
+ u64 reserved1 : 48;
+ };
+ u64 raw;
+};
+
+/*
+ * Global scope metadata field ID.
+ * See Table "Global Scope Metadata", TDX module 1.5 ABI spec.
+ */
+#define MD_FIELD_ID_SYS_ATTRIBUTES 0x0A00000200000000ULL
+#define MD_FIELD_ID_FEATURES0 0x0A00000300000008ULL
+#define MD_FIELD_ID_ATTRS_FIXED0 0x1900000300000000ULL
+#define MD_FIELD_ID_ATTRS_FIXED1 0x1900000300000001ULL
+#define MD_FIELD_ID_XFAM_FIXED0 0x1900000300000002ULL
+#define MD_FIELD_ID_XFAM_FIXED1 0x1900000300000003ULL
+
+#define MD_FIELD_ID_TDCS_BASE_SIZE 0x9800000100000100ULL
+#define MD_FIELD_ID_TDVPS_BASE_SIZE 0x9800000100000200ULL
+
+#define MD_FIELD_ID_NUM_CPUID_CONFIG 0x9900000100000004ULL
+#define MD_FIELD_ID_CPUID_CONFIG_LEAVES 0x9900000300000400ULL
+#define MD_FIELD_ID_CPUID_CONFIG_VALUES 0x9900000300000500ULL
+
+#define MD_FIELD_ID_FEATURES0_NO_RBP_MOD BIT_ULL(18)
+
+#define TDX_MAX_NR_CPUID_CONFIGS 37
+
+#define TDX_MD_ELEMENT_SIZE_8BITS 0
+#define TDX_MD_ELEMENT_SIZE_16BITS 1
+#define TDX_MD_ELEMENT_SIZE_32BITS 2
+#define TDX_MD_ELEMENT_SIZE_64BITS 3
+
+union tdx_md_field_id {
+ struct {
+ u64 field : 24;
+ u64 reserved0 : 8;
+ u64 element_size_code : 2;
+ u64 last_element_in_field : 4;
+ u64 reserved1 : 3;
+ u64 inc_size : 1;
+ u64 write_mask_valid : 1;
+ u64 context : 3;
+ u64 reserved2 : 1;
+ u64 class : 6;
+ u64 reserved3 : 1;
+ u64 non_arch : 1;
+ };
+ u64 raw;
+};
+
+#define TDX_MD_ELEMENT_SIZE_CODE(_field_id) \
+ ({ union tdx_md_field_id _fid = { .raw = (_field_id)}; \
+ _fid.element_size_code; })
+
+#endif /* __KVM_X86_TDX_ARCH_H */
--
2.25.1


2024-02-26 08:37:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error

From: Isaku Yamahata <[email protected]>

Add helper functions to print out errors from the TDX module in a uniform
manner.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
---
v19:
- dropped unnecessary include <asm/tdx.h>

v18:
- Added Reviewed-by Binbin.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/tdx_error.c | 21 +++++++++++++++++++++
arch/x86/kvm/vmx/tdx_ops.h | 4 ++++
3 files changed, 26 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/vmx/tdx_error.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 5b85ef84b2e9..44b0594da877 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -24,7 +24,7 @@ kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \

kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o
-kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o vmx/tdx_error.o

kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
svm/sev.o
diff --git a/arch/x86/kvm/vmx/tdx_error.c b/arch/x86/kvm/vmx/tdx_error.c
new file mode 100644
index 000000000000..42fcabe1f6c7
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_error.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/* functions to record TDX SEAMCALL error */
+
+#include <linux/kernel.h>
+#include <linux/bug.h>
+
+#include "tdx_ops.h"
+
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out)
+{
+ if (!out) {
+ pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
+ op, error_code);
+ return;
+ }
+
+#define MSG \
+ "SEAMCALL (0x%016llx) failed: 0x%016llx RCX 0x%016llx RDX 0x%016llx R8 0x%016llx R9 0x%016llx R10 0x%016llx R11 0x%016llx\n"
+ pr_err_ratelimited(MSG, op, error_code, out->rcx, out->rdx, out->r8,
+ out->r9, out->r10, out->r11);
+}
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index c5bb165b260e..d80212b1daf3 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -40,6 +40,10 @@ static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
return ret;
}

+#ifdef CONFIG_INTEL_TDX_HOST
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out);
+#endif
+
static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
{
struct tdx_module_args in = {
--
2.25.1


2024-02-26 08:37:54

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

From: Isaku Yamahata <[email protected]>

A VMM interacts with the TDX module using a new instruction (SEAMCALL).
For instance, a TDX VMM does not have full access to the VM control
structure corresponding to VMX VMCS. Instead, a VMM induces the TDX module
to act on behalf via SEAMCALLs.

Define C wrapper functions for SEAMCALLs for readability.

Some SEAMCALL APIs donate host pages to TDX module or guest TD, and the
donated pages are encrypted. Those require the VMM to flush the cache
lines to avoid cache line alias.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
---
Changes
v19:
- Update the commit message to match the patch by Yuan
- Use seamcall() and seamcall_ret() by paolo

v18:
- removed stub functions for __seamcall{,_ret}()
- Added Reviewed-by Binbin
- Make tdx_seamcall() use struct tdx_module_args instead of taking
each inputs.

v15 -> v16:
- use struct tdx_module_args instead of struct tdx_module_output
- Add tdh_mem_sept_rd() for SEPT_VE_DISABLE=1.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx_ops.h | 360 +++++++++++++++++++++++++++++++++++++
1 file changed, 360 insertions(+)
create mode 100644 arch/x86/kvm/vmx/tdx_ops.h

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..c5bb165b260e
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,360 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/cacheflush.h>
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+#include "x86.h"
+
+static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
+ struct tdx_module_args *out)
+{
+ u64 ret;
+
+ if (out) {
+ *out = *in;
+ ret = seamcall_ret(op, out);
+ } else
+ ret = seamcall(op, in);
+
+ if (unlikely(ret == TDX_SEAMCALL_UD)) {
+ /*
+ * SEAMCALLs fail with TDX_SEAMCALL_UD returned when VMX is off.
+ * This can happen when the host gets rebooted or live
+ * updated. In this case, the instruction execution is ignored
+ * as KVM is shut down, so the error code is suppressed. Other
+ * than this, the error is unexpected and the execution can't
+ * continue as the TDX features reply on VMX to be on.
+ */
+ kvm_spurious_fault();
+ return 0;
+ }
+ return ret;
+}
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+ struct tdx_module_args in = {
+ .rcx = addr,
+ .rdx = tdr,
+ };
+
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return tdx_seamcall(TDH_MNG_ADDCX, &in, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa,
+ .rdx = tdr,
+ .r8 = hpa,
+ .r9 = source,
+ };
+
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_PAGE_ADD, &in, out);
+}
+
+static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ .r8 = page,
+ };
+
+ clflush_cache_range(__va(page), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_SEPT_ADD, &in, out);
+}
+
+static inline u64 tdh_mem_sept_rd(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MEM_SEPT_RD, &in, out);
+}
+
+static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MEM_SEPT_REMOVE, &in, out);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+ struct tdx_module_args in = {
+ .rcx = addr,
+ .rdx = tdvpr,
+ };
+
+ clflush_cache_range(__va(addr), PAGE_SIZE);
+ return tdx_seamcall(TDH_VP_ADDCX, &in, NULL);
+}
+
+static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa,
+ .rdx = tdr,
+ .r8 = hpa,
+ };
+
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_PAGE_RELOCATE, &in, out);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa,
+ .rdx = tdr,
+ .r8 = hpa,
+ };
+
+ clflush_cache_range(__va(hpa), PAGE_SIZE);
+ return tdx_seamcall(TDH_MEM_PAGE_AUG, &in, out);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MEM_RANGE_BLOCK, &in, out);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MNG_KEY_CONFIG, &in, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ .rdx = hkid,
+ };
+
+ clflush_cache_range(__va(tdr), PAGE_SIZE);
+ return tdx_seamcall(TDH_MNG_CREATE, &in, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdvpr,
+ .rdx = tdr,
+ };
+
+ clflush_cache_range(__va(tdvpr), PAGE_SIZE);
+ return tdx_seamcall(TDH_VP_CREATE, &in, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ .rdx = field,
+ };
+
+ return tdx_seamcall(TDH_MNG_RD, &in, out);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa,
+ .rdx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MR_EXTEND, &in, out);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MR_FINALIZE, &in, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdvpr,
+ };
+
+ return tdx_seamcall(TDH_VP_FLUSH, &in, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MNG_VPFLUSHDONE, &in, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MNG_KEY_FREEID, &in, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ .rdx = td_params,
+ };
+
+ return tdx_seamcall(TDH_MNG_INIT, &in, out);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+ struct tdx_module_args in = {
+ .rcx = tdvpr,
+ .rdx = rcx,
+ };
+
+ return tdx_seamcall(TDH_VP_INIT, &in, NULL);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = tdvpr,
+ .rdx = field,
+ };
+
+ return tdx_seamcall(TDH_VP_RD, &in, out);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MNG_KEY_RECLAIMID, &in, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = page,
+ };
+
+ return tdx_seamcall(TDH_PHYMEM_PAGE_RECLAIM, &in, out);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MEM_PAGE_REMOVE, &in, out);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+ struct tdx_module_args in = {
+ };
+
+ return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+ struct tdx_module_args in = {
+ .rcx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MEM_TRACK, &in, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+
+ return tdx_seamcall(TDH_MEM_RANGE_UNBLOCK, &in, out);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+ struct tdx_module_args in = {
+ .rcx = resume ? 1 : 0,
+ };
+
+ return tdx_seamcall(TDH_PHYMEM_CACHE_WB, &in, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+ struct tdx_module_args in = {
+ .rcx = page,
+ };
+
+ return tdx_seamcall(TDH_PHYMEM_PAGE_WBINVD, &in, NULL);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = tdvpr,
+ .rdx = field,
+ .r8 = val,
+ .r9 = mask,
+ };
+
+ return tdx_seamcall(TDH_VP_WR, &in, out);
+}
+
+#endif /* __KVM_X86_TDX_OPS_H */
--
2.25.1


2024-02-26 08:37:55

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 031/130] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD VM
creation/destruction.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index f11ea701dc19..098150da6ea2 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -16,8 +16,8 @@ Patch Layer status
Patch layer Status

* TDX, VMX coexistence: Applied
-* TDX architectural definitions: Applying
-* TD VM creation/destruction: Not yet
+* TDX architectural definitions: Applied
+* TD VM creation/destruction: Applying
* TD vcpu creation/destruction: Not yet
* TDX EPT violation: Not yet
* TD finalization: Not yet
--
2.25.1


2024-02-26 08:38:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 032/130] KVM: TDX: Add helper functions to allocate/free TDX private host key id

From: Isaku Yamahata <[email protected]>

Add helper functions to allocate/free TDX private host key id (HKID).

The memory controller encrypts TDX memory with the assigned TDX HKIDs. The
global TDX HKID is to encrypt the TDX module, its memory, and some dynamic
data (TDR). The private TDX HKID is assigned to guest TD to encrypt guest
memory and the related data. When VMM releases an encrypted page for
reuse, the page needs a cache flush with the used HKID. VMM needs the
global TDX HKID and the private TDX HKIDs to flush encrypted pages.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Removed stale comment in tdx_guest_keyid_alloc() by Binbin
- Update sanity check in tdx_guest_keyid_free() by Binbin

v18:
- Moved the functions to kvm tdx from arch/x86/virt/vmx/tdx/
- Drop exporting symbols as the host tdx does.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a7e096fd8361..cde971122c1e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -11,6 +11,34 @@
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+/*
+ * Key id globally used by TDX module: TDX module maps TDR with this TDX global
+ * key id. TDR includes key id assigned to the TD. Then TDX module maps other
+ * TD-related pages with the assigned key id. TDR requires this TDX global key
+ * id for cache flush unlike other TD-related pages.
+ */
+/* TDX KeyID pool */
+static DEFINE_IDA(tdx_guest_keyid_pool);
+
+static int __used tdx_guest_keyid_alloc(void)
+{
+ if (WARN_ON_ONCE(!tdx_guest_keyid_start || !tdx_nr_guest_keyids))
+ return -EINVAL;
+
+ return ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start,
+ tdx_guest_keyid_start + tdx_nr_guest_keyids - 1,
+ GFP_KERNEL);
+}
+
+static void __used tdx_guest_keyid_free(int keyid)
+{
+ if (WARN_ON_ONCE(keyid < tdx_guest_keyid_start ||
+ keyid > tdx_guest_keyid_start + tdx_nr_guest_keyids - 1))
+ return;
+
+ ida_free(&tdx_guest_keyid_pool, keyid);
+}
+
static int __init tdx_module_setup(void)
{
int ret;
--
2.25.1


2024-02-26 08:38:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 033/130] KVM: TDX: Add helper function to read TDX metadata in array

From: Isaku Yamahata <[email protected]>

To read meta data in series, use table.
Instead of metadata_read(fid0, &data0); metadata_read(...); ...
table = { {fid0, &data0}, ...}; metadata-read(tables).
TODO: Once the TDX host code introduces its framework to read TDX metadata,
drop this patch and convert the code that uses this.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v18:
- newly added
---
arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cde971122c1e..dce21f675155 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
#include "capabilities.h"
#include "x86_ops.h"
#include "x86.h"
+#include "tdx_arch.h"
#include "tdx.h"

#undef pr_fmt
@@ -39,6 +40,50 @@ static void __used tdx_guest_keyid_free(int keyid)
ida_free(&tdx_guest_keyid_pool, keyid);
}

+#define TDX_MD_MAP(_fid, _ptr) \
+ { .fid = MD_FIELD_ID_##_fid, \
+ .ptr = (_ptr), }
+
+struct tdx_md_map {
+ u64 fid;
+ void *ptr;
+};
+
+static size_t tdx_md_element_size(u64 fid)
+{
+ switch (TDX_MD_ELEMENT_SIZE_CODE(fid)) {
+ case TDX_MD_ELEMENT_SIZE_8BITS:
+ return 1;
+ case TDX_MD_ELEMENT_SIZE_16BITS:
+ return 2;
+ case TDX_MD_ELEMENT_SIZE_32BITS:
+ return 4;
+ case TDX_MD_ELEMENT_SIZE_64BITS:
+ return 8;
+ default:
+ WARN_ON_ONCE(1);
+ return 0;
+ }
+}
+
+static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
+{
+ struct tdx_md_map *m;
+ int ret, i;
+ u64 tmp;
+
+ for (i = 0; i < nr_maps; i++) {
+ m = &maps[i];
+ ret = tdx_sys_metadata_field_read(m->fid, &tmp);
+ if (ret)
+ return ret;
+
+ memcpy(m->ptr, &tmp, tdx_md_element_size(m->fid));
+ }
+
+ return 0;
+}
+
static int __init tdx_module_setup(void)
{
int ret;
--
2.25.1


2024-02-26 08:38:58

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization

From: Isaku Yamahata <[email protected]>

TDX KVM needs system-wide information about the TDX module, store it in
struct tdx_info.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Added features0
- Use tdx_sys_metadata_read()
- Fix error recovery path by Yuan

Change v18:
- Newly Added

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 11 +++++
arch/x86/kvm/vmx/main.c | 9 +++-
arch/x86/kvm/vmx/tdx.c | 80 ++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/x86_ops.h | 2 +
4 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index aa7a56a47564..45b2c2304491 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -567,4 +567,15 @@ struct kvm_pmu_event_filter {
#define KVM_X86_TDX_VM 2
#define KVM_X86_SNP_VM 3

+#define KVM_TDX_CPUID_NO_SUBLEAF ((__u32)-1)
+
+struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index fa19682b366c..a948a6959ac7 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -32,6 +32,13 @@ static __init int vt_hardware_setup(void)
return 0;
}

+static void vt_hardware_unsetup(void)
+{
+ if (enable_tdx)
+ tdx_hardware_unsetup();
+ vmx_hardware_unsetup();
+}
+
static int vt_vm_init(struct kvm *kvm)
{
if (is_td(kvm))
@@ -54,7 +61,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.check_processor_compatibility = vmx_check_processor_compat,

- .hardware_unsetup = vmx_hardware_unsetup,
+ .hardware_unsetup = vt_hardware_unsetup,

/* TDX cpu enablement is done by tdx_hardware_setup(). */
.hardware_enable = vmx_hardware_enable,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index dce21f675155..5edfb99abb89 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -40,6 +40,21 @@ static void __used tdx_guest_keyid_free(int keyid)
ida_free(&tdx_guest_keyid_pool, keyid);
}

+struct tdx_info {
+ u64 features0;
+ u64 attributes_fixed0;
+ u64 attributes_fixed1;
+ u64 xfam_fixed0;
+ u64 xfam_fixed1;
+
+ u16 num_cpuid_config;
+ /* This must the last member. */
+ DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
+};
+
+/* Info about the TDX module. */
+static struct tdx_info *tdx_info;
+
#define TDX_MD_MAP(_fid, _ptr) \
{ .fid = MD_FIELD_ID_##_fid, \
.ptr = (_ptr), }
@@ -66,7 +81,7 @@ static size_t tdx_md_element_size(u64 fid)
}
}

-static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
+static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
{
struct tdx_md_map *m;
int ret, i;
@@ -84,9 +99,26 @@ static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
return 0;
}

+#define TDX_INFO_MAP(_field_id, _member) \
+ TD_SYSINFO_MAP(_field_id, struct tdx_info, _member)
+
static int __init tdx_module_setup(void)
{
+ u16 num_cpuid_config;
int ret;
+ u32 i;
+
+ struct tdx_md_map mds[] = {
+ TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
+ };
+
+ struct tdx_metadata_field_mapping fields[] = {
+ TDX_INFO_MAP(FEATURES0, features0),
+ TDX_INFO_MAP(ATTRS_FIXED0, attributes_fixed0),
+ TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
+ TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
+ TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
+ };

ret = tdx_enable();
if (ret) {
@@ -94,7 +126,48 @@ static int __init tdx_module_setup(void)
return ret;
}

+ ret = tdx_md_read(mds, ARRAY_SIZE(mds));
+ if (ret)
+ return ret;
+
+ tdx_info = kzalloc(sizeof(*tdx_info) +
+ sizeof(*tdx_info->cpuid_configs) * num_cpuid_config,
+ GFP_KERNEL);
+ if (!tdx_info)
+ return -ENOMEM;
+ tdx_info->num_cpuid_config = num_cpuid_config;
+
+ ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
+ if (ret)
+ goto error_out;
+
+ for (i = 0; i < num_cpuid_config; i++) {
+ struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
+ u64 leaf, eax_ebx, ecx_edx;
+ struct tdx_md_map cpuids[] = {
+ TDX_MD_MAP(CPUID_CONFIG_LEAVES + i, &leaf),
+ TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2, &eax_ebx),
+ TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2 + 1, &ecx_edx),
+ };
+
+ ret = tdx_md_read(cpuids, ARRAY_SIZE(cpuids));
+ if (ret)
+ goto error_out;
+
+ c->leaf = (u32)leaf;
+ c->sub_leaf = leaf >> 32;
+ c->eax = (u32)eax_ebx;
+ c->ebx = eax_ebx >> 32;
+ c->ecx = (u32)ecx_edx;
+ c->edx = ecx_edx >> 32;
+ }
+
return 0;
+
+error_out:
+ /* kfree() accepts NULL. */
+ kfree(tdx_info);
+ return ret;
}

bool tdx_is_vm_type_supported(unsigned long type)
@@ -162,3 +235,8 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
out:
return r;
}
+
+void tdx_hardware_unsetup(void)
+{
+ kfree(tdx_info);
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f4da88a228d0..e8cb4ae81cf1 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -136,9 +136,11 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);

#ifdef CONFIG_INTEL_TDX_HOST
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+void tdx_hardware_unsetup(void);
bool tdx_is_vm_type_supported(unsigned long type);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
+static inline void tdx_hardware_unsetup(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
#endif

--
2.25.1


2024-02-26 08:39:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl

From: Isaku Yamahata <[email protected]>

KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM. It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.

TDX requires VM-scoped TDX-specific operations for device model, for
example, qemu. Getting system-wide parameters, TDX-specific VM
initialization.

Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters. Make mem_enc_ioctl non-optional as it's always filled.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v15:
- change struct kvm_tdx_cmd to drop unused member.
---
arch/x86/include/asm/kvm-x86-ops.h | 2 +-
arch/x86/include/uapi/asm/kvm.h | 26 ++++++++++++++++++++++++++
arch/x86/kvm/vmx/main.c | 10 ++++++++++
arch/x86/kvm/vmx/tdx.c | 26 ++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
arch/x86/kvm/x86.c | 4 ----
6 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8be71a5c5c87..00b371d9a1ca 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -123,7 +123,7 @@ KVM_X86_OP(enter_smm)
KVM_X86_OP(leave_smm)
KVM_X86_OP(enable_smi_window)
#endif
-KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
+KVM_X86_OP(mem_enc_ioctl)
KVM_X86_OP_OPTIONAL(mem_enc_register_region)
KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 45b2c2304491..9ea46d143bef 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -567,6 +567,32 @@ struct kvm_pmu_event_filter {
#define KVM_X86_TDX_VM 2
#define KVM_X86_SNP_VM 3

+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+
+ KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+};
+
#define KVM_TDX_CPUID_NO_SUBLEAF ((__u32)-1)

struct kvm_tdx_cpuid_config {
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a948a6959ac7..082e82ce6580 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -47,6 +47,14 @@ static int vt_vm_init(struct kvm *kvm)
return vmx_vm_init(kvm);
}

+static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
+{
+ if (!is_td(kvm))
+ return -ENOTTY;
+
+ return tdx_vm_ioctl(kvm, argp);
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -200,6 +208,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,

.get_untagged_addr = vmx_get_untagged_addr,
+
+ .mem_enc_ioctl = vt_mem_enc_ioctl,
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5edfb99abb89..07a3f0f75f87 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -55,6 +55,32 @@ struct tdx_info {
/* Info about the TDX module. */
static struct tdx_info *tdx_info;

+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+ struct kvm_tdx_cmd tdx_cmd;
+ int r;
+
+ if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+ return -EFAULT;
+ if (tdx_cmd.error)
+ return -EINVAL;
+
+ mutex_lock(&kvm->lock);
+
+ switch (tdx_cmd.id) {
+ default:
+ r = -EINVAL;
+ goto out;
+ }
+
+ if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+ r = -EFAULT;
+
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
#define TDX_MD_MAP(_fid, _ptr) \
{ .fid = MD_FIELD_ID_##_fid, \
.ptr = (_ptr), }
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index e8cb4ae81cf1..f6c57ad44f80 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -138,10 +138,14 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
void tdx_hardware_unsetup(void);
bool tdx_is_vm_type_supported(unsigned long type);
+
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
+
+static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 442b356e4939..c459a5e9e520 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7247,10 +7247,6 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
goto out;
}
case KVM_MEMORY_ENCRYPT_OP: {
- r = -ENOTTY;
- if (!kvm_x86_ops.mem_enc_ioctl)
- goto out;
-
r = static_call(kvm_x86_mem_enc_ioctl)(kvm, argp);
break;
}
--
2.25.1


2024-02-26 08:39:37

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters

From: Sean Christopherson <[email protected]>

Implement an ioctl to get system-wide parameters for TDX. Although the
function is systemwide, vm scoped mem_enc ioctl works for userspace VMM
like qemu and device scoped version is not define, re-use vm scoped
mem_enc.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v18:
- drop the use of tdhsysinfo_struct and TDH.SYS.INFO, use TDH.SYS.RD().
For that, dynamically allocate/free tdx_info.
- drop the change of tools/arch/x86/include/uapi/asm/kvm.h.

v14 -> v15:
- ABI change: added supported_gpaw and reserved area.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 17 ++++++++++
arch/x86/kvm/vmx/tdx.c | 56 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 3 ++
3 files changed, 76 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9ea46d143bef..e28189c81691 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -604,4 +604,21 @@ struct kvm_tdx_cpuid_config {
__u32 edx;
};

+/* supported_gpaw */
+#define TDX_CAP_GPAW_48 (1 << 0)
+#define TDX_CAP_GPAW_52 (1 << 1)
+
+struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+ __u32 supported_gpaw;
+ __u32 padding;
+ __u64 reserved[251];
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[];
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 07a3f0f75f87..816ccdb4bc41 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
#include "capabilities.h"
#include "x86_ops.h"
#include "x86.h"
+#include "mmu.h"
#include "tdx_arch.h"
#include "tdx.h"

@@ -55,6 +56,58 @@ struct tdx_info {
/* Info about the TDX module. */
static struct tdx_info *tdx_info;

+static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx_capabilities __user *user_caps;
+ struct kvm_tdx_capabilities *caps = NULL;
+ int ret = 0;
+
+ if (cmd->flags)
+ return -EINVAL;
+
+ caps = kmalloc(sizeof(*caps), GFP_KERNEL);
+ if (!caps)
+ return -ENOMEM;
+
+ user_caps = (void __user *)cmd->data;
+ if (copy_from_user(caps, user_caps, sizeof(*caps))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ if (caps->nr_cpuid_configs < tdx_info->num_cpuid_config) {
+ ret = -E2BIG;
+ goto out;
+ }
+
+ *caps = (struct kvm_tdx_capabilities) {
+ .attrs_fixed0 = tdx_info->attributes_fixed0,
+ .attrs_fixed1 = tdx_info->attributes_fixed1,
+ .xfam_fixed0 = tdx_info->xfam_fixed0,
+ .xfam_fixed1 = tdx_info->xfam_fixed1,
+ .supported_gpaw = TDX_CAP_GPAW_48 |
+ ((kvm_get_shadow_phys_bits() >= 52 &&
+ cpu_has_vmx_ept_5levels()) ? TDX_CAP_GPAW_52 : 0),
+ .nr_cpuid_configs = tdx_info->num_cpuid_config,
+ .padding = 0,
+ };
+
+ if (copy_to_user(user_caps, caps, sizeof(*caps))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ if (copy_to_user(user_caps->cpuid_configs, &tdx_info->cpuid_configs,
+ tdx_info->num_cpuid_config *
+ sizeof(tdx_info->cpuid_configs[0]))) {
+ ret = -EFAULT;
+ }
+
+out:
+ /* kfree() accepts NULL. */
+ kfree(caps);
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -68,6 +121,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
mutex_lock(&kvm->lock);

switch (tdx_cmd.id) {
+ case KVM_TDX_CAPABILITIES:
+ r = tdx_get_capabilities(&tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 473013265bd8..22c0b57f69ca 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -3,6 +3,9 @@
#define __KVM_X86_TDX_H

#ifdef CONFIG_INTEL_TDX_HOST
+
+#include "tdx_ops.h"
+
struct kvm_tdx {
struct kvm kvm;
/* TDX specific members follow. */
--
2.25.1


2024-02-26 08:39:56

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

From: Isaku Yamahata <[email protected]>

TDX has its own limitation on the maximum number of vcpus that the guest
can accommodate. Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
handler and implement TDX backend for KVM_CAP_MAX_VCPUS. user space VMM,
e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.

When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be
specified as struct td_params_struct. and the value is a part of
measurement. The user space has to specify the value somehow. There are
two options for it.
option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch)
option 2. Add max_vcpu as a parameter to initialize the guest.
(TDG.MNG.INIT)

The flow will be create VM (KVM_CREATE_VM), create VCPUs (KVM_CREATE_VCPU),
initialize TDX part of VM and, initialize TDX part of vcpu.
Because the creation of vcpu is independent from TDX VM initialization,
Choose the option 1.
The work flow will be, KVM_CREATE_VM, set KVM_CAP_MAX_VCPU and,
KVM_CREATE_VCPU.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v18:
- use TDX instead of "x86, tdx" in subject
- use min(max_vcpu, TDX_MAX_VCPU) instead of
min3(max_vcpu, KVM_MAX_VCPU, TDX_MAX_VCPU)
- make "if (KVM_MAX_VCPU) and if (TDX_MAX_VCPU)" into one if statement
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/main.c | 22 ++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.c | 29 +++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 5 +++++
arch/x86/kvm/x86.c | 4 ++++
6 files changed, 64 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 00b371d9a1ca..1b0dacc6b6c0 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -21,6 +21,8 @@ KVM_X86_OP(hardware_unsetup)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
KVM_X86_OP(is_vm_type_supported)
+KVM_X86_OP_OPTIONAL(max_vcpus);
+KVM_X86_OP_OPTIONAL(vm_enable_cap)
KVM_X86_OP(vm_init)
KVM_X86_OP_OPTIONAL(vm_destroy)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37cda8aa07b6..cf8eb46b3a20 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1604,7 +1604,9 @@ struct kvm_x86_ops {
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

bool (*is_vm_type_supported)(unsigned long vm_type);
+ int (*max_vcpus)(struct kvm *kvm);
unsigned int vm_size;
+ int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 082e82ce6580..e8a1a7533eea 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,7 @@
#include "nested.h"
#include "pmu.h"
#include "tdx.h"
+#include "tdx_arch.h"

static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);
@@ -16,6 +17,17 @@ static bool vt_is_vm_type_supported(unsigned long type)
(enable_tdx && tdx_is_vm_type_supported(type));
}

+static int vt_max_vcpus(struct kvm *kvm)
+{
+ if (!kvm)
+ return KVM_MAX_VCPUS;
+
+ if (is_td(kvm))
+ return min(kvm->max_vcpus, TDX_MAX_VCPUS);
+
+ return kvm->max_vcpus;
+}
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -39,6 +51,14 @@ static void vt_hardware_unsetup(void)
vmx_hardware_unsetup();
}

+static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ if (is_td(kvm))
+ return tdx_vm_enable_cap(kvm, cap);
+
+ return -EINVAL;
+}
+
static int vt_vm_init(struct kvm *kvm)
{
if (is_td(kvm))
@@ -77,7 +97,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.has_emulated_msr = vmx_has_emulated_msr,

.is_vm_type_supported = vt_is_vm_type_supported,
+ .max_vcpus = vt_max_vcpus,
.vm_size = sizeof(struct kvm_vmx),
+ .vm_enable_cap = vt_vm_enable_cap,
.vm_init = vt_vm_init,
.vm_destroy = vmx_vm_destroy,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 816ccdb4bc41..ee015f3ce2c9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -56,6 +56,35 @@ struct tdx_info {
/* Info about the TDX module. */
static struct tdx_info *tdx_info;

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ int r;
+
+ switch (cap->cap) {
+ case KVM_CAP_MAX_VCPUS: {
+ if (cap->flags || cap->args[0] == 0)
+ return -EINVAL;
+ if (cap->args[0] > KVM_MAX_VCPUS ||
+ cap->args[0] > TDX_MAX_VCPUS)
+ return -E2BIG;
+
+ mutex_lock(&kvm->lock);
+ if (kvm->created_vcpus)
+ r = -EBUSY;
+ else {
+ kvm->max_vcpus = cap->args[0];
+ r = 0;
+ }
+ mutex_unlock(&kvm->lock);
+ break;
+ }
+ default:
+ r = -EINVAL;
+ break;
+ }
+ return r;
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f6c57ad44f80..0031a8d61589 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -139,12 +139,17 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
void tdx_hardware_unsetup(void);
bool tdx_is_vm_type_supported(unsigned long type);

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }

+static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ return -EINVAL;
+};
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
#endif

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c459a5e9e520..3f16e9450d2f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4726,6 +4726,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
case KVM_CAP_MAX_VCPUS:
r = KVM_MAX_VCPUS;
+ if (kvm_x86_ops.max_vcpus)
+ r = static_call(kvm_x86_max_vcpus)(kvm);
break;
case KVM_CAP_MAX_VCPU_ID:
r = KVM_MAX_VCPU_IDS;
@@ -6709,6 +6711,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
break;
default:
r = -EINVAL;
+ if (kvm_x86_ops.vm_enable_cap)
+ r = static_call(kvm_x86_vm_enable_cap)(kvm, cap);
break;
}
return r;
--
2.25.1


2024-02-26 08:40:44

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 040/130] KVM: TDX: Make pmu_intel.c ignore guest TD case

From: Isaku Yamahata <[email protected]>

Because TDX KVM doesn't support PMU yet (it's future work of TDX KVM
support as another patch series) and pmu_intel.c touches vmx specific
structure in vcpu initialization, as workaround add dummy structure to
struct vcpu_tdx and pmu_intel.c can ignore TDX case.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v18:
- Removed unnecessary change to vmx.c which caused kernel warning.
---
arch/x86/kvm/vmx/pmu_intel.c | 46 +++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/pmu_intel.h | 28 ++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 8 ++++++-
arch/x86/kvm/vmx/vmx.h | 32 +------------------------
4 files changed, 81 insertions(+), 33 deletions(-)
create mode 100644 arch/x86/kvm/vmx/pmu_intel.h

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 315c7c2ba89b..7e6e51efd879 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -19,6 +19,7 @@
#include "lapic.h"
#include "nested.h"
#include "pmu.h"
+#include "tdx.h"

#define MSR_PMC_FULL_WIDTH_BIT (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0)

@@ -68,6 +69,26 @@ static int fixed_pmc_events[] = {
[2] = PSEUDO_ARCH_REFERENCE_CYCLES,
};

+struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_INTEL_TDX_HOST
+ if (is_td_vcpu(vcpu))
+ return &to_tdx(vcpu)->lbr_desc;
+#endif
+
+ return &to_vmx(vcpu)->lbr_desc;
+}
+
+struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_INTEL_TDX_HOST
+ if (is_td_vcpu(vcpu))
+ return &to_tdx(vcpu)->lbr_desc.records;
+#endif
+
+ return &to_vmx(vcpu)->lbr_desc.records;
+}
+
static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
{
struct kvm_pmc *pmc;
@@ -179,6 +200,23 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
return get_gp_pmc(pmu, msr, MSR_IA32_PMC0);
}

+bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return false;
+ return cpuid_model_is_consistent(vcpu);
+}
+
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
+{
+ struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu);
+
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return lbr->nr && (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_LBR_FMT);
+}
+
static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
{
struct x86_pmu_lbr *records = vcpu_to_lbr_records(vcpu);
@@ -244,6 +282,9 @@ static inline void intel_pmu_release_guest_lbr_event(struct kvm_vcpu *vcpu)
{
struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);

+ if (is_td_vcpu(vcpu))
+ return;
+
if (lbr_desc->event) {
perf_event_release_kernel(lbr_desc->event);
lbr_desc->event = NULL;
@@ -285,6 +326,9 @@ int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu)
PERF_SAMPLE_BRANCH_USER,
};

+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return 0;
+
if (unlikely(lbr_desc->event)) {
__set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use);
return 0;
@@ -578,7 +622,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);

perf_capabilities = vcpu_get_perf_capabilities(vcpu);
- if (cpuid_model_is_consistent(vcpu) &&
+ if (intel_pmu_lbr_is_compatible(vcpu) &&
(perf_capabilities & PMU_CAP_LBR_FMT))
x86_perf_get_lbr(&lbr_desc->records);
else
diff --git a/arch/x86/kvm/vmx/pmu_intel.h b/arch/x86/kvm/vmx/pmu_intel.h
new file mode 100644
index 000000000000..66bba47c1269
--- /dev/null
+++ b/arch/x86/kvm/vmx/pmu_intel.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_PMU_INTEL_H
+#define __KVM_X86_VMX_PMU_INTEL_H
+
+struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu);
+struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu);
+
+bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu);
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);
+int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
+
+struct lbr_desc {
+ /* Basic info about guest LBR records. */
+ struct x86_pmu_lbr records;
+
+ /*
+ * Emulate LBR feature via passthrough LBR registers when the
+ * per-vcpu guest LBR event is scheduled on the current pcpu.
+ *
+ * The records may be inaccurate if the host reclaims the LBR.
+ */
+ struct perf_event *event;
+
+ /* True if LBRs are marked as not intercepted in the MSR bitmap */
+ bool msr_passthrough;
+};
+
+#endif /* __KVM_X86_VMX_PMU_INTEL_H */
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 184fe394da86..173ed19207fb 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -4,6 +4,7 @@

#ifdef CONFIG_INTEL_TDX_HOST

+#include "pmu_intel.h"
#include "tdx_ops.h"

struct kvm_tdx {
@@ -21,7 +22,12 @@ struct kvm_tdx {

struct vcpu_tdx {
struct kvm_vcpu vcpu;
- /* TDX specific members follow. */
+
+ /*
+ * Dummy to make pmu_intel not corrupt memory.
+ * TODO: Support PMU for TDX. Future work.
+ */
+ struct lbr_desc lbr_desc;
};

static inline bool is_td(struct kvm *kvm)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e3b0985bb74a..04ed2a9eada1 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -11,6 +11,7 @@
#include "capabilities.h"
#include "../kvm_cache_regs.h"
#include "posted_intr.h"
+#include "pmu_intel.h"
#include "vmcs.h"
#include "vmx_ops.h"
#include "../cpuid.h"
@@ -93,22 +94,6 @@ union vmx_exit_reason {
u32 full;
};

-struct lbr_desc {
- /* Basic info about guest LBR records. */
- struct x86_pmu_lbr records;
-
- /*
- * Emulate LBR feature via passthrough LBR registers when the
- * per-vcpu guest LBR event is scheduled on the current pcpu.
- *
- * The records may be inaccurate if the host reclaims the LBR.
- */
- struct perf_event *event;
-
- /* True if LBRs are marked as not intercepted in the MSR bitmap */
- bool msr_passthrough;
-};
-
/*
* The nested_vmx structure is part of vcpu_vmx, and holds information we need
* for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@@ -659,21 +644,6 @@ static __always_inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
return container_of(vcpu, struct vcpu_vmx, vcpu);
}

-static inline struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
-{
- return &to_vmx(vcpu)->lbr_desc;
-}
-
-static inline struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
-{
- return &vcpu_to_lbr_desc(vcpu)->records;
-}
-
-static inline bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
-{
- return !!vcpu_to_lbr_records(vcpu)->nr;
-}
-
void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu);
--
2.25.1


2024-02-26 08:40:46

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

From: Isaku Yamahata <[email protected]>

TDX requires additional parameters for TDX VM for confidential execution to
protect the confidentiality of its memory contents and CPU state from any
other software, including VMM. When creating a guest TD VM before creating
vcpu, the number of vcpu, TSC frequency (the values are the same among
vcpus, and it can't change.) CPUIDs which the TDX module emulates. Guest
TDs can trust those CPUIDs and sha384 values for measurement.

Add a new subcommand, KVM_TDX_INIT_VM, to pass parameters for the TDX
guest. It assigns an encryption key to the TDX guest for memory
encryption. TDX encrypts memory per guest basis. The device model, say
qemu, passes per-VM parameters for the TDX guest. The maximum number of
vcpus, TSC frequency (TDX guest has fixed VM-wide TSC frequency, not per
vcpu. The TDX guest can not change it.), attributes (production or debug),
available extended features (which configure guest XCR0, IA32_XSS MSR),
CPUIDs, sha384 measurements, etc.

Call this subcommand before creating vcpu and KVM_SET_CPUID2, i.e. CPUID
configurations aren't available yet. So CPUIDs configuration values need
to be passed in struct kvm_tdx_init_vm. The device model's responsibility
to make this CPUID config for KVM_TDX_INIT_VM and KVM_SET_CPUID2.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
v19:
- Check NO_RBP_MOD of feature0 and set it
- Update the comment for PT and CET

v18:
- remove the change of tools/arch/x86/include/uapi/asm/kvm.h
- typo in comment. sha348 => sha384
- updated comment in setup_tdparams_xfam()
- fix setup_tdparams_xfam() to use init_vm instead of td_params

v15 -> v16:
- Removed AMX check as the KVM upstream supports AMX.
- Added CET flag to guest supported xss

v14 -> v15:
- add check if the reserved area of init_vm is zero

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 27 ++++
arch/x86/kvm/cpuid.c | 7 +
arch/x86/kvm/cpuid.h | 2 +
arch/x86/kvm/vmx/tdx.c | 273 ++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.h | 18 +++
arch/x86/kvm/vmx/tdx_arch.h | 6 +
6 files changed, 323 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index e28189c81691..9ac0246bd974 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -570,6 +570,7 @@ struct kvm_pmu_event_filter {
/* Trust Domain eXtension sub-ioctl() commands. */
enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,

KVM_TDX_CMD_NR_MAX,
};
@@ -621,4 +622,30 @@ struct kvm_tdx_capabilities {
struct kvm_tdx_cpuid_config cpuid_configs[];
};

+struct kvm_tdx_init_vm {
+ __u64 attributes;
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha384 digest */
+ /*
+ * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
+ * This should be enough given sizeof(TD_PARAMS) = 1024.
+ * 8KB was chosen given because
+ * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
+ */
+ __u64 reserved[1004];
+
+ /*
+ * Call KVM_TDX_INIT_VM before vcpu creation, thus before
+ * KVM_SET_CPUID2.
+ * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
+ * TDX module directly virtualizes those CPUIDs without VMM. The user
+ * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
+ * those values. If it doesn't, KVM may have wrong idea of vCPUIDs of
+ * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
+ * module doesn't virtualize.
+ */
+ struct kvm_cpuid2 cpuid;
+};
+
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..8cdcd6f406aa 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1443,6 +1443,13 @@ int kvm_dev_ioctl_get_cpuid(struct kvm_cpuid2 *cpuid,
return r;
}

+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(
+ struct kvm_cpuid_entry2 *entries, int nent, u32 function, u64 index)
+{
+ return cpuid_entry2_find(entries, nent, function, index);
+}
+EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry2);
+
struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
u32 function, u32 index)
{
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 856e3037e74f..215d1c68c6d1 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -13,6 +13,8 @@ void kvm_set_cpu_caps(void);

void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
void kvm_update_pv_runtime(struct kvm_vcpu *vcpu);
+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(struct kvm_cpuid_entry2 *entries,
+ int nent, u32 function, u64 index);
struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
u32 function, u32 index);
struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1cf2b15da257..b11f105db3cd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -8,7 +8,6 @@
#include "mmu.h"
#include "tdx_arch.h"
#include "tdx.h"
-#include "tdx_ops.h"
#include "x86.h"

#undef pr_fmt
@@ -350,18 +349,21 @@ static int tdx_do_tdh_mng_key_config(void *param)
return 0;
}

-static int __tdx_td_init(struct kvm *kvm);
-
int tdx_vm_init(struct kvm *kvm)
{
+ /*
+ * This function initializes only KVM software construct. It doesn't
+ * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
+ * It is handled by KVM_TDX_INIT_VM, __tdx_td_init().
+ */
+
/*
* TDX has its own limit of the number of vcpus in addition to
* KVM_MAX_VCPUS.
*/
kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);

- /* Place holder for TDX specific logic. */
- return __tdx_td_init(kvm);
+ return 0;
}

static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -416,9 +418,162 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
return ret;
}

-static int __tdx_td_init(struct kvm *kvm)
+static int setup_tdparams_eptp_controls(struct kvm_cpuid2 *cpuid,
+ struct td_params *td_params)
+{
+ const struct kvm_cpuid_entry2 *entry;
+ int max_pa = 36;
+
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
+ if (entry)
+ max_pa = entry->eax & 0xff;
+
+ td_params->eptp_controls = VMX_EPTP_MT_WB;
+ /*
+ * No CPU supports 4-level && max_pa > 48.
+ * "5-level paging and 5-level EPT" section 4.1 4-level EPT
+ * "4-level EPT is limited to translating 48-bit guest-physical
+ * addresses."
+ * cpu_has_vmx_ept_5levels() check is just in case.
+ */
+ if (!cpu_has_vmx_ept_5levels() && max_pa > 48)
+ return -EINVAL;
+ if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
+ td_params->eptp_controls |= VMX_EPTP_PWL_5;
+ td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
+ } else {
+ td_params->eptp_controls |= VMX_EPTP_PWL_4;
+ }
+
+ return 0;
+}
+
+static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
+ struct td_params *td_params)
+{
+ int i;
+
+ /*
+ * td_params.cpuid_values: The number and the order of cpuid_value must
+ * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
+ * It's assumed that td_params was zeroed.
+ */
+ for (i = 0; i < tdx_info->num_cpuid_config; i++) {
+ const struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
+ /* KVM_TDX_CPUID_NO_SUBLEAF means index = 0. */
+ u32 index = c->sub_leaf == KVM_TDX_CPUID_NO_SUBLEAF ? 0 : c->sub_leaf;
+ const struct kvm_cpuid_entry2 *entry =
+ kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
+ c->leaf, index);
+ struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
+
+ if (!entry)
+ continue;
+
+ /*
+ * tdsysinfo.cpuid_configs[].{eax, ebx, ecx, edx}
+ * bit 1 means it can be configured to zero or one.
+ * bit 0 means it must be zero.
+ * Mask out non-configurable bits.
+ */
+ value->eax = entry->eax & c->eax;
+ value->ebx = entry->ebx & c->ebx;
+ value->ecx = entry->ecx & c->ecx;
+ value->edx = entry->edx & c->edx;
+ }
+}
+
+static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
+{
+ const struct kvm_cpuid_entry2 *entry;
+ u64 guest_supported_xcr0;
+ u64 guest_supported_xss;
+
+ /* Setup td_params.xfam */
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
+ if (entry)
+ guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
+ else
+ guest_supported_xcr0 = 0;
+ guest_supported_xcr0 &= kvm_caps.supported_xcr0;
+
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
+ if (entry)
+ guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
+ else
+ guest_supported_xss = 0;
+
+ /*
+ * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
+ * and, CET support.
+ */
+ guest_supported_xss &=
+ (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);
+
+ td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
+ if (td_params->xfam & XFEATURE_MASK_LBR) {
+ /*
+ * TODO: once KVM supports LBR(save/restore LBR related
+ * registers around TDENTER), remove this guard.
+ */
+#define MSG_LBR "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH properly.\n"
+ pr_warn(MSG_LBR);
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
+ struct kvm_tdx_init_vm *init_vm)
+{
+ struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
+ int ret;
+
+ if (kvm->created_vcpus)
+ return -EBUSY;
+
+ if (init_vm->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
+ /*
+ * TODO: save/restore PMU related registers around TDENTER.
+ * Once it's done, remove this guard.
+ */
+#define MSG_PERFMON "TD doesn't support perfmon yet. KVM needs to save/restore host perf registers properly.\n"
+ pr_warn(MSG_PERFMON);
+ return -EOPNOTSUPP;
+ }
+
+ td_params->max_vcpus = kvm->max_vcpus;
+ td_params->attributes = init_vm->attributes;
+ td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD;
+ td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz);
+
+ ret = setup_tdparams_eptp_controls(cpuid, td_params);
+ if (ret)
+ return ret;
+ setup_tdparams_cpuids(cpuid, td_params);
+ ret = setup_tdparams_xfam(cpuid, td_params);
+ if (ret)
+ return ret;
+
+#define MEMCPY_SAME_SIZE(dst, src) \
+ do { \
+ BUILD_BUG_ON(sizeof(dst) != sizeof(src)); \
+ memcpy((dst), (src), sizeof(dst)); \
+ } while (0)
+
+ MEMCPY_SAME_SIZE(td_params->mrconfigid, init_vm->mrconfigid);
+ MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
+ MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
+
+ return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
+ u64 *seamcall_err)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct tdx_module_args out;
cpumask_var_t packages;
unsigned long *tdcs_pa = NULL;
unsigned long tdr_pa = 0;
@@ -426,6 +581,7 @@ static int __tdx_td_init(struct kvm *kvm)
int ret, i;
u64 err;

+ *seamcall_err = 0;
ret = tdx_guest_keyid_alloc();
if (ret < 0)
return ret;
@@ -540,10 +696,23 @@ static int __tdx_td_init(struct kvm *kvm)
}
}

- /*
- * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
- * ioctl() to define the configure CPUID values for the TD.
- */
+ err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &out);
+ if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_INVALID) {
+ /*
+ * Because a user gives operands, don't warn.
+ * Return a hint to the user because it's sometimes hard for the
+ * user to figure out which operand is invalid. SEAMCALL status
+ * code includes which operand caused invalid operand error.
+ */
+ *seamcall_err = err;
+ ret = -EINVAL;
+ goto teardown;
+ } else if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_INIT, err, &out);
+ ret = -EIO;
+ goto teardown;
+ }
+
return 0;

/*
@@ -586,6 +755,76 @@ static int __tdx_td_init(struct kvm *kvm)
return ret;
}

+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_tdx_init_vm *init_vm = NULL;
+ struct td_params *td_params = NULL;
+ int ret;
+
+ BUILD_BUG_ON(sizeof(*init_vm) != 8 * 1024);
+ BUILD_BUG_ON(sizeof(struct td_params) != 1024);
+
+ if (is_hkid_assigned(kvm_tdx))
+ return -EINVAL;
+
+ if (cmd->flags)
+ return -EINVAL;
+
+ init_vm = kzalloc(sizeof(*init_vm) +
+ sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES,
+ GFP_KERNEL);
+ if (!init_vm)
+ return -ENOMEM;
+ if (copy_from_user(init_vm, (void __user *)cmd->data, sizeof(*init_vm))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ if (init_vm->cpuid.nent > KVM_MAX_CPUID_ENTRIES) {
+ ret = -E2BIG;
+ goto out;
+ }
+ if (copy_from_user(init_vm->cpuid.entries,
+ (void __user *)cmd->data + sizeof(*init_vm),
+ flex_array_size(init_vm, cpuid.entries, init_vm->cpuid.nent))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ if (memchr_inv(init_vm->reserved, 0, sizeof(init_vm->reserved))) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (init_vm->cpuid.padding) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL);
+ if (!td_params) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = setup_tdparams(kvm, td_params, init_vm);
+ if (ret)
+ goto out;
+
+ ret = __tdx_td_init(kvm, td_params, &cmd->error);
+ if (ret)
+ goto out;
+
+ kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
+ kvm_tdx->attributes = td_params->attributes;
+ kvm_tdx->xfam = td_params->xfam;
+
+out:
+ /* kfree() accepts NULL. */
+ kfree(init_vm);
+ kfree(td_params);
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -602,6 +841,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_CAPABILITIES:
r = tdx_get_capabilities(&tdx_cmd);
break;
+ case KVM_TDX_INIT_VM:
+ r = tdx_td_init(kvm, &tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -725,6 +967,17 @@ static int __init tdx_module_setup(void)

tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;

+ /*
+ * Make TDH.VP.ENTER preserve RBP so that the stack unwinder
+ * always work around it. Query the feature.
+ */
+ if (!(tdx_info->features0 & MD_FIELD_ID_FEATURES0_NO_RBP_MOD) &&
+ !IS_ENABLED(CONFIG_FRAME_POINTER)) {
+ pr_err("Too old version of TDX module. Consider upgrade.\n");
+ ret = -EOPNOTSUPP;
+ goto error_out;
+ }
+
return 0;

error_out:
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ae117f864cfb..184fe394da86 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -12,7 +12,11 @@ struct kvm_tdx {
unsigned long tdr_pa;
unsigned long *tdcs_pa;

+ u64 attributes;
+ u64 xfam;
int hkid;
+
+ u64 tsc_offset;
};

struct vcpu_tdx {
@@ -39,6 +43,20 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
{
return container_of(vcpu, struct vcpu_tdx, vcpu);
}
+
+static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
+{
+ struct tdx_module_args out;
+ u64 err;
+
+ err = tdh_mng_rd(kvm_tdx->tdr_pa, TDCS_EXEC(field), &out);
+ if (unlikely(err)) {
+ pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field, err);
+ return 0;
+ }
+ return out.r8;
+}
+
#else
struct kvm_tdx {
struct kvm kvm;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index e2c1a6f429d7..efc3c61c14ab 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -117,6 +117,12 @@ struct tdx_cpuid_value {
#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)

+/*
+ * TODO: Once XFEATURE_CET_{U, S} in arch/x86/include/asm/fpu/types.h is
+ * defined, Replace these with define ones.
+ */
+#define TDX_TD_XFAM_CET (BIT(11) | BIT(12))
+
/*
* TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
*/
--
2.25.1


2024-02-26 08:41:26

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 041/130] KVM: TDX: Refuse to unplug the last cpu on the package

From: Isaku Yamahata <[email protected]>

In order to reclaim TDX HKID, (i.e. when deleting guest TD), needs to call
TDH.PHYMEM.PAGE.WBINVD on all packages. If we have active TDX HKID, refuse
to offline the last online cpu to guarantee at least one CPU online per
package. Add arch callback for cpu offline.
Because TDX doesn't support suspend, this also refuses suspend if TDs are
running. If no TD is running, suspend is allowed.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
v18:
- Added reviewed-by BinBin
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/main.c | 1 +
arch/x86/kvm/vmx/tdx.c | 41 ++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
arch/x86/kvm/x86.c | 5 ++++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 12 +++++++--
8 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 97f0a681e02c..f78200492a3d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -18,6 +18,7 @@ KVM_X86_OP(check_processor_compatibility)
KVM_X86_OP(hardware_enable)
KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
+KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
KVM_X86_OP(is_vm_type_supported)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6dedb5cb71ef..0e2408a4707e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1600,6 +1600,7 @@ struct kvm_x86_ops {
int (*hardware_enable)(void);
void (*hardware_disable)(void);
void (*hardware_unsetup)(void);
+ int (*offline_cpu)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 437c6d5e802e..d69dd474775b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -110,6 +110,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.check_processor_compatibility = vmx_check_processor_compat,

.hardware_unsetup = vt_hardware_unsetup,
+ .offline_cpu = tdx_offline_cpu,

/* TDX cpu enablement is done by tdx_hardware_setup(). */
.hardware_enable = vmx_hardware_enable,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b11f105db3cd..f2ee5abac14e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -97,6 +97,7 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
*/
static DEFINE_MUTEX(tdx_lock);
static struct mutex *tdx_mng_key_config_lock;
+static atomic_t nr_configured_hkid;

static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
{
@@ -112,6 +113,7 @@ static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
{
tdx_guest_keyid_free(kvm_tdx->hkid);
kvm_tdx->hkid = -1;
+ atomic_dec(&nr_configured_hkid);
}

static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
@@ -586,6 +588,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
if (ret < 0)
return ret;
kvm_tdx->hkid = ret;
+ atomic_inc(&nr_configured_hkid);

va = __get_free_page(GFP_KERNEL_ACCOUNT);
if (!va)
@@ -1071,3 +1074,41 @@ void tdx_hardware_unsetup(void)
kfree(tdx_info);
kfree(tdx_mng_key_config_lock);
}
+
+int tdx_offline_cpu(void)
+{
+ int curr_cpu = smp_processor_id();
+ cpumask_var_t packages;
+ int ret = 0;
+ int i;
+
+ /* No TD is running. Allow any cpu to be offline. */
+ if (!atomic_read(&nr_configured_hkid))
+ return 0;
+
+ /*
+ * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to
+ * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory
+ * controller with pconfig. If we have active TDX HKID, refuse to
+ * offline the last online cpu.
+ */
+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+ return -ENOMEM;
+ for_each_online_cpu(i) {
+ if (i != curr_cpu)
+ cpumask_set_cpu(topology_physical_package_id(i), packages);
+ }
+ /* Check if this cpu is the last online cpu of this package. */
+ if (!cpumask_test_cpu(topology_physical_package_id(curr_cpu), packages))
+ ret = -EBUSY;
+ free_cpumask_var(packages);
+ if (ret)
+ /*
+ * Because it's hard for human operator to understand the
+ * reason, warn it.
+ */
+#define MSG_ALLPKG_ONLINE \
+ "TDX requires all packages to have an online CPU. Delete all TDs in order to offline all CPUs of a package.\n"
+ pr_warn_ratelimited(MSG_ALLPKG_ONLINE);
+ return ret;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7f123ffe4d42..33ab6800eab8 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -138,6 +138,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
void tdx_hardware_unsetup(void);
bool tdx_is_vm_type_supported(unsigned long type);
+int tdx_offline_cpu(void);

int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int tdx_vm_init(struct kvm *kvm);
@@ -148,6 +149,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
+static inline int tdx_offline_cpu(void) { return 0; }

static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f01c7f91652..e27ea5ed2968 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12508,6 +12508,11 @@ void kvm_arch_hardware_disable(void)
drop_user_return_notifiers();
}

+int kvm_arch_offline_cpu(unsigned int cpu)
+{
+ return static_call(kvm_x86_offline_cpu)();
+}
+
bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
{
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eeaf4e73317c..813952eca1bf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1502,6 +1502,7 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
int kvm_arch_hardware_enable(void);
void kvm_arch_hardware_disable(void);
#endif
+int kvm_arch_offline_cpu(unsigned int cpu);
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2f0a8e28795e..de38f308738e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5653,13 +5653,21 @@ static void hardware_disable_nolock(void *junk)
__this_cpu_write(hardware_enabled, false);
}

+__weak int kvm_arch_offline_cpu(unsigned int cpu)
+{
+ return 0;
+}
+
static int kvm_offline_cpu(unsigned int cpu)
{
+ int r = 0;
+
mutex_lock(&kvm_lock);
- if (kvm_usage_count)
+ r = kvm_arch_offline_cpu(cpu);
+ if (!r && kvm_usage_count)
hardware_disable_nolock(NULL);
mutex_unlock(&kvm_lock);
- return 0;
+ return r;
}

static void hardware_disable_all_nolock(void)
--
2.25.1


2024-02-26 08:41:37

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 043/130] KVM: TDX: create/free TDX vcpu structure

From: Isaku Yamahata <[email protected]>

The next step of TDX guest creation is to create vcpu. Create TDX vcpu
structures, initialize it that doesn't require TDX SEAMCALL. TDX specific
vcpu initialization will be implemented as independent KVM_TDX_INIT_VCPU
so that when error occurs it's easy to determine which component has the
issue, KVM or TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- removed stale comment in tdx_vcpu_create().

v18:
- update commit log to use create instead of allocate because the patch
doesn't newly allocate memory for TDX vcpu.

v15 -> v16:
- Add AMX support as the KVM upstream supports it.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 44 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 44 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 10 +++++++++
arch/x86/kvm/x86.c | 2 ++
4 files changed, 96 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d69dd474775b..5796fb45433f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -87,6 +87,42 @@ static void vt_vm_free(struct kvm *kvm)
tdx_vm_free(kvm);
}

+static int vt_vcpu_precreate(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_vcpu_precreate(kvm);
+}
+
+static int vt_vcpu_create(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_create(vcpu);
+
+ return vmx_vcpu_create(vcpu);
+}
+
+static void vt_vcpu_free(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_free(vcpu);
+ return;
+ }
+
+ vmx_vcpu_free(vcpu);
+}
+
+static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_reset(vcpu, init_event);
+ return;
+ }
+
+ vmx_vcpu_reset(vcpu, init_event);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -126,10 +162,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vm_destroy = vt_vm_destroy,
.vm_free = vt_vm_free,

- .vcpu_precreate = vmx_vcpu_precreate,
- .vcpu_create = vmx_vcpu_create,
- .vcpu_free = vmx_vcpu_free,
- .vcpu_reset = vmx_vcpu_reset,
+ .vcpu_precreate = vt_vcpu_precreate,
+ .vcpu_create = vt_vcpu_create,
+ .vcpu_free = vt_vcpu_free,
+ .vcpu_reset = vt_vcpu_reset,

.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
.vcpu_load = vmx_vcpu_load,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f2ee5abac14e..51283d2cd011 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -368,6 +368,50 @@ int tdx_vm_init(struct kvm *kvm)
return 0;
}

+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+ WARN_ON_ONCE(vcpu->arch.cpuid_entries);
+ WARN_ON_ONCE(vcpu->arch.cpuid_nent);
+
+ /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
+ if (!vcpu->arch.apic)
+ return -EINVAL;
+
+ fpstate_set_confidential(&vcpu->arch.guest_fpu);
+
+ vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+
+ vcpu->arch.cr0_guest_owned_bits = -1ul;
+ vcpu->arch.cr4_guest_owned_bits = -1ul;
+
+ vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
+ vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
+ vcpu->arch.guest_state_protected =
+ !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+
+ if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
+ vcpu->arch.xfd_no_write_intercept = true;
+
+ return 0;
+}
+
+void tdx_vcpu_free(struct kvm_vcpu *vcpu)
+{
+ /* This is stub for now. More logic will come. */
+}
+
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+
+ /* Ignore INIT silently because TDX doesn't support INIT event. */
+ if (init_event)
+ return;
+
+ /* This is stub for now. More logic will come here. */
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 33ab6800eab8..bb73a9b5b354 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -144,7 +144,12 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int tdx_vm_init(struct kvm *kvm);
void tdx_mmu_release_hkid(struct kvm *kvm);
void tdx_vm_free(struct kvm *kvm);
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+
+int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_free(struct kvm_vcpu *vcpu);
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
@@ -158,7 +163,12 @@ static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
static inline void tdx_vm_free(struct kvm *kvm) {}
+
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e27ea5ed2968..c002761bb662 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -502,6 +502,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
kvm_recalculate_apic_map(vcpu->kvm);
return 0;
}
+EXPORT_SYMBOL_GPL(kvm_set_apic_base);

/*
* Handle a fault on a hardware virtualization (VMX or SVM) instruction.
@@ -12517,6 +12518,7 @@ bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
{
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
}
+EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);

bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
{
--
2.25.1


2024-02-26 08:41:52

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 042/130] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD vcpu
creation/destruction.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 098150da6ea2..25082e9c0b20 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -9,7 +9,7 @@ Layer status
What qemu can do
----------------
- TDX VM TYPE is exposed to Qemu.
-- Qemu can try to create VM of TDX VM type and then fails.
+- Qemu can create/destroy guest of TDX vm type.

Patch Layer status
------------------
@@ -17,8 +17,8 @@ Patch Layer status

* TDX, VMX coexistence: Applied
* TDX architectural definitions: Applied
-* TD VM creation/destruction: Applying
-* TD vcpu creation/destruction: Not yet
+* TD VM creation/destruction: Applied
+* TD vcpu creation/destruction: Applying
* TDX EPT violation: Not yet
* TD finalization: Not yet
* TD vcpu enter/exit: Not yet
--
2.25.1


2024-02-26 08:42:24

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 045/130] [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of KVM MMU GPA
shared bits.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 25082e9c0b20..8b8186e7bfeb 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -10,6 +10,7 @@ What qemu can do
----------------
- TDX VM TYPE is exposed to Qemu.
- Qemu can create/destroy guest of TDX vm type.
+- Qemu can create/destroy vcpu of TDX vm type.

Patch Layer status
------------------
@@ -18,12 +19,12 @@ Patch Layer status
* TDX, VMX coexistence: Applied
* TDX architectural definitions: Applied
* TD VM creation/destruction: Applied
-* TD vcpu creation/destruction: Applying
+* TD vcpu creation/destruction: Applied
* TDX EPT violation: Not yet
* TD finalization: Not yet
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet

-* KVM MMU GPA shared bits: Not yet
+* KVM MMU GPA shared bits: Applying
* KVM TDP refactoring for TDX: Not yet
* KVM TDP MMU hooks: Not yet
--
2.25.1


2024-02-26 08:42:38

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 046/130] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA

From: Isaku Yamahata <[email protected]>

TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to
indicate the GPA is private(if cleared) or shared (if set) with VMM. If
GPA.shared is set, GPA is covered by the existing conventional EPT pointed
by EPTP. If GPA.shared bit is cleared, GPA is covered by TDX module.
VMM has to issue SEAMCALLs to operate.

Add a member to remember GPA shared bit for each guest TDs, add address
conversion functions between private GPA and shared GPA and test if GPA
is private.

Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See
kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated,
the new member to remember GPA shared bit is guaranteed to be zero with
this patch unless it's initialized explicitly.

default or SEV-SNP TDX: S = (47 or 51) - 12
gfn_shared_mask 0 S bit
kvm_is_private_gpa() always false true if GFN has S bit set
kvm_gfn_to_shared() nop set S bit
kvm_gfn_to_private() nop clear S bit

fault.is_private means that host page should be gotten from guest_memfd
is_private_gpa() means that KVM MMU should invoke private MMU hooks.

Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
v19:
- Add comment on default vm case.
- Added behavior table in the commit message
- drop CONFIG_KVM_MMU_PRIVATE

v18:
- Added Reviewed-by Binbin

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.c | 5 +++++
3 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5da3c211955d..de6dd42d226f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1505,6 +1505,8 @@ struct kvm_arch {
*/
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+ gfn_t gfn_shared_mask;
};

struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d96c93a25b3b..395b55684cb9 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -322,4 +322,37 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
return gpa;
return translate_nested_gpa(vcpu, gpa, access, exception);
}
+
+/*
+ * default or SEV-SNP TDX: where S = (47 or 51) - 12
+ * gfn_shared_mask 0 S bit
+ * is_private_gpa() always false if GPA has S bit set
+ * gfn_to_shared() nop set S bit
+ * gfn_to_private() nop clear S bit
+ *
+ * fault.is_private means that host page should be gotten from guest_memfd
+ * is_private_gpa() means that KVM MMU should invoke private MMU hooks.
+ */
+static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
+{
+ return kvm->arch.gfn_shared_mask;
+}
+
+static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
+{
+ return gfn | kvm_gfn_shared_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
+{
+ return gfn & ~kvm_gfn_shared_mask(kvm);
+}
+
+static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
+{
+ gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+ return mask && !(gpa_to_gfn(gpa) & mask);
+}
+
#endif
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index aa1da51b8af7..54e0d4efa2bd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -906,6 +906,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
kvm_tdx->attributes = td_params->attributes;
kvm_tdx->xfam = td_params->xfam;

+ if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
+ kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
+ else
+ kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+
out:
/* kfree() accepts NULL. */
kfree(init_vm);
--
2.25.1


2024-02-26 08:42:40

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 044/130] KVM: TDX: Do TDX specific vcpu initialization

From: Isaku Yamahata <[email protected]>

TD guest vcpu needs TDX specific initialization before running. Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v18:
- Use tdh_sys_rd() instead of struct tdsysinfo_struct.
- Rename tdx_reclaim_td_page() => tdx_reclaim_control_page()
- Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/vmx/main.c | 9 ++
arch/x86/kvm/vmx/tdx.c | 184 ++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 8 ++
arch/x86/kvm/vmx/x86_ops.h | 4 +
arch/x86/kvm/x86.c | 6 +
8 files changed, 211 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index f78200492a3d..a8e96804a252 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -129,6 +129,7 @@ KVM_X86_OP(leave_smm)
KVM_X86_OP(enable_smi_window)
#endif
KVM_X86_OP(mem_enc_ioctl)
+KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
KVM_X86_OP_OPTIONAL(mem_enc_register_region)
KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0e2408a4707e..5da3c211955d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1778,6 +1778,7 @@ struct kvm_x86_ops {
#endif

int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
+ int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9ac0246bd974..4000a2e087a8 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -571,6 +571,7 @@ struct kvm_pmu_event_filter {
enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,

KVM_TDX_CMD_NR_MAX,
};
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5796fb45433f..d0f75020579f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -131,6 +131,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
return tdx_vm_ioctl(kvm, argp);
}

+static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+ if (!is_td_vcpu(vcpu))
+ return -EINVAL;
+
+ return tdx_vcpu_ioctl(vcpu, argp);
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -291,6 +299,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.get_untagged_addr = vmx_get_untagged_addr,

.mem_enc_ioctl = vt_mem_enc_ioctl,
+ .vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 51283d2cd011..aa1da51b8af7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -49,6 +49,7 @@ struct tdx_info {
u64 xfam_fixed1;

u8 nr_tdcs_pages;
+ u8 nr_tdvpx_pages;

u16 num_cpuid_config;
/* This must the last member. */
@@ -104,6 +105,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
}

+static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
+{
+ return tdx->td_vcpu_created;
+}
+
static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
{
return kvm_tdx->tdr_pa;
@@ -121,6 +127,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
return kvm_tdx->hkid > 0;
}

+static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->finalized;
+}
+
static void tdx_clear_page(unsigned long page_pa)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -399,7 +410,32 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

void tdx_vcpu_free(struct kvm_vcpu *vcpu)
{
- /* This is stub for now. More logic will come. */
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ int i;
+
+ /*
+ * This methods can be called when vcpu allocation/initialization
+ * failed. So it's possible that hkid, tdvpx and tdvpr are not assigned
+ * yet.
+ */
+ if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm))) {
+ WARN_ON_ONCE(tdx->tdvpx_pa);
+ WARN_ON_ONCE(tdx->tdvpr_pa);
+ return;
+ }
+
+ if (tdx->tdvpx_pa) {
+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+ if (tdx->tdvpx_pa[i])
+ tdx_reclaim_control_page(tdx->tdvpx_pa[i]);
+ }
+ kfree(tdx->tdvpx_pa);
+ tdx->tdvpx_pa = NULL;
+ }
+ if (tdx->tdvpr_pa) {
+ tdx_reclaim_control_page(tdx->tdvpr_pa);
+ tdx->tdvpr_pa = 0;
+ }
}

void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -408,8 +444,13 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
/* Ignore INIT silently because TDX doesn't support INIT event. */
if (init_event)
return;
+ if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
+ return;

- /* This is stub for now. More logic will come here. */
+ /*
+ * Don't update mp_state to runnable because more initialization
+ * is needed by TDX_VCPU_INIT.
+ */
}

static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -904,6 +945,137 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
return r;
}

+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ unsigned long *tdvpx_pa = NULL;
+ unsigned long tdvpr_pa;
+ unsigned long va;
+ int ret, i;
+ u64 err;
+
+ if (is_td_vcpu_created(tdx))
+ return -EINVAL;
+
+ /*
+ * vcpu_free method frees allocated pages. Avoid partial setup so
+ * that the method can't handle it.
+ */
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va)
+ return -ENOMEM;
+ tdvpr_pa = __pa(va);
+
+ tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
+ GFP_KERNEL_ACCOUNT);
+ if (!tdvpx_pa) {
+ ret = -ENOMEM;
+ goto free_tdvpr;
+ }
+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va) {
+ ret = -ENOMEM;
+ goto free_tdvpx;
+ }
+ tdvpx_pa[i] = __pa(va);
+ }
+
+ err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
+ if (KVM_BUG_ON(err, vcpu->kvm)) {
+ ret = -EIO;
+ pr_tdx_error(TDH_VP_CREATE, err, NULL);
+ goto free_tdvpx;
+ }
+ tdx->tdvpr_pa = tdvpr_pa;
+
+ tdx->tdvpx_pa = tdvpx_pa;
+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+ err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
+ if (KVM_BUG_ON(err, vcpu->kvm)) {
+ pr_tdx_error(TDH_VP_ADDCX, err, NULL);
+ for (; i < tdx_info->nr_tdvpx_pages; i++) {
+ free_page((unsigned long)__va(tdvpx_pa[i]));
+ tdvpx_pa[i] = 0;
+ }
+ /* vcpu_free method frees TDVPX and TDR donated to TDX */
+ return -EIO;
+ }
+ }
+
+ err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
+ if (KVM_BUG_ON(err, vcpu->kvm)) {
+ pr_tdx_error(TDH_VP_INIT, err, NULL);
+ return -EIO;
+ }
+
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ tdx->td_vcpu_created = true;
+ return 0;
+
+free_tdvpx:
+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+ if (tdvpx_pa[i])
+ free_page((unsigned long)__va(tdvpx_pa[i]));
+ tdvpx_pa[i] = 0;
+ }
+ kfree(tdvpx_pa);
+ tdx->tdvpx_pa = NULL;
+free_tdvpr:
+ if (tdvpr_pa)
+ free_page((unsigned long)__va(tdvpr_pa));
+ tdx->tdvpr_pa = 0;
+
+ return ret;
+}
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+ struct msr_data apic_base_msr;
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ struct kvm_tdx_cmd cmd;
+ int ret;
+
+ if (tdx->initialized)
+ return -EINVAL;
+
+ if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ if (copy_from_user(&cmd, argp, sizeof(cmd)))
+ return -EFAULT;
+
+ if (cmd.error)
+ return -EINVAL;
+
+ /* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
+ if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
+ return -EINVAL;
+
+ /*
+ * As TDX requires X2APIC, set local apic mode to X2APIC. User space
+ * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
+ * KVM_SET_CPUID2. Otherwise kvm_set_apic_base() will fail.
+ */
+ apic_base_msr = (struct msr_data) {
+ .host_initiated = true,
+ .data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
+ (kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
+ };
+ if (kvm_set_apic_base(vcpu, &apic_base_msr))
+ return -EINVAL;
+
+ ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
+ if (ret)
+ return ret;
+
+ tdx->initialized = true;
+ return 0;
+}
+
#define TDX_MD_MAP(_fid, _ptr) \
{ .fid = MD_FIELD_ID_##_fid, \
.ptr = (_ptr), }
@@ -953,13 +1125,14 @@ static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)

static int __init tdx_module_setup(void)
{
- u16 num_cpuid_config, tdcs_base_size;
+ u16 num_cpuid_config, tdcs_base_size, tdvps_base_size;
int ret;
u32 i;

struct tdx_md_map mds[] = {
TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
TDX_MD_MAP(TDCS_BASE_SIZE, &tdcs_base_size),
+ TDX_MD_MAP(TDVPS_BASE_SIZE, &tdvps_base_size),
};

struct tdx_metadata_field_mapping fields[] = {
@@ -1013,6 +1186,11 @@ static int __init tdx_module_setup(void)
}

tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;
+ /*
+ * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
+ * -1 for TDVPR.
+ */
+ tdx_info->nr_tdvpx_pages = tdvps_base_size / PAGE_SIZE - 1;

/*
* Make TDH.VP.ENTER preserve RBP so that the stack unwinder
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 173ed19207fb..d3077151252c 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,12 +17,20 @@ struct kvm_tdx {
u64 xfam;
int hkid;

+ bool finalized;
+
u64 tsc_offset;
};

struct vcpu_tdx {
struct kvm_vcpu vcpu;

+ unsigned long tdvpr_pa;
+ unsigned long *tdvpx_pa;
+ bool td_vcpu_created;
+
+ bool initialized;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index bb73a9b5b354..f5820f617b2e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -150,6 +150,8 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
@@ -169,6 +171,8 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+
+static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c002761bb662..2bd4b7c8fa51 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6274,6 +6274,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
case KVM_SET_DEVICE_ATTR:
r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
break;
+ case KVM_MEMORY_ENCRYPT_OP:
+ r = -ENOTTY;
+ if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
+ goto out;
+ r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
+ break;
default:
r = -EINVAL;
}
--
2.25.1


2024-02-26 08:43:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 049/130] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE

From: Isaku Yamahata <[email protected]>

The TDX support will need the "suppress #VE" bit (bit 63) set as the
initial value for SPTE. To reduce code change size, introduce a new macro
SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table
entry (SPTE) and replace hard-coded value 0 for it. Initialize shadow page
tables with their value.

The plan is to unconditionally set the "suppress #VE" bit for both AMD and
Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and
ignored for non-present SPTE; 2) for conventional VMX guests, KVM never
enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is
ignored by hardware.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 20 +++++++++++++++-----
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/spte.h | 2 ++
arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++-------
4 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2becc86c71b2..211c0e72f45d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -567,9 +567,9 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)

if (!is_shadow_present_pte(old_spte) ||
!spte_has_volatile_bits(old_spte))
- __update_clear_spte_fast(sptep, 0ull);
+ __update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
else
- old_spte = __update_clear_spte_slow(sptep, 0ull);
+ old_spte = __update_clear_spte_slow(sptep, SHADOW_NONPRESENT_VALUE);

if (!is_shadow_present_pte(old_spte))
return old_spte;
@@ -603,7 +603,7 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
*/
static void mmu_spte_clear_no_track(u64 *sptep)
{
- __update_clear_spte_fast(sptep, 0ull);
+ __update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
}

static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -1950,7 +1950,8 @@ static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

static int kvm_sync_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
{
- if (!sp->spt[i])
+ /* sp->spt[i] has initial value of shadow page table allocation */
+ if (sp->spt[i] == SHADOW_NONPRESENT_VALUE)
return 0;

return vcpu->arch.mmu->sync_spte(vcpu, sp, i);
@@ -6204,7 +6205,16 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ /*
+ * When X86_64, initial SEPT entries are initialized with
+ * SHADOW_NONPRESENT_VALUE. Otherwise zeroed. See
+ * mmu_memory_cache_alloc_obj().
+ */
+ if (IS_ENABLED(CONFIG_X86_64))
+ vcpu->arch.mmu_shadow_page_cache.init_value =
+ SHADOW_NONPRESENT_VALUE;
+ if (!vcpu->arch.mmu_shadow_page_cache.init_value)
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 4d4e98fe4f35..bebd73cd61bb 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -911,7 +911,7 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
gpa_t pte_gpa;
gfn_t gfn;

- if (WARN_ON_ONCE(!sp->spt[i]))
+ if (WARN_ON_ONCE(sp->spt[i] == SHADOW_NONPRESENT_VALUE))
return 0;

first_pte_gpa = FNAME(get_level1_sp_gpa)(sp);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a88..4d1799ba2bf8 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -149,6 +149,8 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+#define SHADOW_NONPRESENT_VALUE 0ULL
+
extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6ae19b4ee5b1..bdeb23ff9e71 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -570,7 +570,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* here since the SPTE is going from non-present to non-present. Use
* the raw write helper to avoid an unnecessary check on volatile bits.
*/
- __kvm_tdp_mmu_write_spte(iter->sptep, 0);
+ __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);

return 0;
}
@@ -707,8 +707,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
continue;

if (!shared)
- tdp_mmu_iter_set_spte(kvm, &iter, 0);
- else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
+ tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+ else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
goto retry;
}
}
@@ -764,8 +764,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;

- tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
- sp->gfn, sp->role.level + 1);
+ tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
+ SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);

return true;
}
@@ -799,7 +799,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- tdp_mmu_iter_set_spte(kvm, &iter, 0);
+ tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
flush = true;
}

@@ -1226,7 +1226,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See handle_changed_spte().
*/
- tdp_mmu_iter_set_spte(kvm, iter, 0);
+ tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);

if (!pte_write(range->arg.pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1


2024-02-26 08:43:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 048/130] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values

From: Sean Christopherson <[email protected]>

Add support to MMU caches for initializing a page with a custom 64-bit
value, e.g. to pre-fill an entire page table with non-zero PTE values.
The functionality will be used by x86 to support Intel's TDX, which needs
to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
faults from getting reflected into the guest (Intel's EPT Violation #VE
architecture made the less than brilliant decision of having the per-PTE
behavior be opt-out instead of opt-in).

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_types.h | 1 +
virt/kvm/kvm_main.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 9d1f7835d8c1..60c8d5c9eab9 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -94,6 +94,7 @@ struct gfn_to_pfn_cache {
struct kvm_mmu_memory_cache {
gfp_t gfp_zero;
gfp_t gfp_custom;
+ u64 init_value;
struct kmem_cache *kmem_cache;
int capacity;
int nobjs;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de38f308738e..d399009ef1d7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -401,12 +401,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
gfp_t gfp_flags)
{
+ void *page;
+
gfp_flags |= mc->gfp_zero;

if (mc->kmem_cache)
return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
- else
- return (void *)__get_free_page(gfp_flags);
+
+ page = (void *)__get_free_page(gfp_flags);
+ if (page && mc->init_value)
+ memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));
+ return page;
}

int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
@@ -421,6 +426,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
if (WARN_ON_ONCE(!capacity))
return -EIO;

+ /*
+ * Custom init values can be used only for page allocations,
+ * and obviously conflict with __GFP_ZERO.
+ */
+ if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
+ return -EIO;
+
mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
if (!mc->objects)
return -ENOMEM;
--
2.25.1


2024-02-26 08:44:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 050/130] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE

From: Sean Christopherson <[email protected]>

For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.

To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control. And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfiguration unlike VMX case. For the shared
GPA that is not populated yet, EPT violation need to be triggered when TD
guest accesses such shared GPA. The non-present SPTE value for shared GPA
should set "suppress #VE" bit.

Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>

---
v19:
- fix typo in the commit message

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/spte.h | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 4d1799ba2bf8..26bc95bbc962 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -149,7 +149,20 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+/*
+ * Non-present SPTE value for both VMX and SVM for TDP MMU.
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
+ * bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
+ * For TDX:
+ * TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
#define SHADOW_NONPRESENT_VALUE 0ULL
+#endif

extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
@@ -196,7 +209,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
*
* Only used by the TDP MMU.
*/
-#define REMOVED_SPTE 0x5a0ULL
+#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL)

/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
--
2.25.1


2024-02-26 08:44:04

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 051/130] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask

From: Isaku Yamahata <[email protected]>

To make use of the same value of shadow_mmio_mask and shadow_present_mask
for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
shadow_present_mask so that they can be common for both VMX and TDX.

TDX will require shadow_mmio_mask and shadow_present_mask to include
VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
spte value is required to cause EPT misconfig. the additional bit doesn't
affect VMX logic to add the bit to shadow_mmio_{value, mask}.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/mmu/spte.c | 6 ++++--
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..76ed39541a52 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -513,6 +513,7 @@ enum vmcs_field {
#define VMX_EPT_IPAT_BIT (1ull << 6)
#define VMX_EPT_ACCESS_BIT (1ull << 8)
#define VMX_EPT_DIRTY_BIT (1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
#define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
VMX_EPT_WRITABLE_MASK | \
VMX_EPT_EXECUTABLE_MASK)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..02a466de2991 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -429,7 +429,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
- shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+ /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
+ shadow_present_mask =
+ (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
/*
* EPT overrides the host MTRRs, and so KVM must program the desired
* memtype directly into the SPTEs. Note, this mask is just the mask
@@ -446,7 +448,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
* of an EPT paging-structure entry is 110b (write/execute).
*/
kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
- VMX_EPT_RWX_MASK, 0);
+ VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);

--
2.25.1


2024-02-26 08:44:29

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 019/130] KVM: x86: Add is_vm_type_supported callback

From: Isaku Yamahata <[email protected]>

For SEV_SNP and TDX, allow the backend can override the supported vm type.
Add KVM_X86_SNP_VM and KVM_X86_TDX_VM to reserve the bit.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v19:
- Mention KVM_X86_SNP_VM to the commit message

v18:
- include into TDX KVM patch series v18

Changes v3 -> v4:
- Added KVM_X86_SNP_VM

Changes v2 -> v3:
- no change
- didn't bother to rename KVM_X86_PROTECTED_VM to KVM_X86_SW_PROTECTED_VM

Changes v1 -> v2
- no change

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 2 ++
arch/x86/kvm/svm/svm.c | 7 +++++++
arch/x86/kvm/vmx/vmx.c | 7 +++++++
arch/x86/kvm/x86.c | 12 +++++++++++-
arch/x86/kvm/x86.h | 2 ++
7 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 156832f01ebe..8be71a5c5c87 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP(vm_init)
KVM_X86_OP_OPTIONAL(vm_destroy)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 28314e7d546c..37cda8aa07b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1603,6 +1603,7 @@ struct kvm_x86_ops {
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

+ bool (*is_vm_type_supported)(unsigned long vm_type);
unsigned int vm_size;
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a448d0964fc0..aa7a56a47564 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -564,5 +564,7 @@ struct kvm_pmu_event_filter {

#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_SW_PROTECTED_VM 1
+#define KVM_X86_TDX_VM 2
+#define KVM_X86_SNP_VM 3

#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e90b429c84f1..f76dd52d29ba 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4886,6 +4886,12 @@ static void svm_vm_destroy(struct kvm *kvm)
sev_vm_destroy(kvm);
}

+static bool svm_is_vm_type_supported(unsigned long type)
+{
+ /* FIXME: Check if CPU is capable of SEV-SNP. */
+ return __kvm_is_vm_type_supported(type);
+}
+
static int svm_vm_init(struct kvm *kvm)
{
if (!pause_filter_count || !pause_filter_thresh)
@@ -4914,6 +4920,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_free = svm_vcpu_free,
.vcpu_reset = svm_vcpu_reset,

+ .is_vm_type_supported = svm_is_vm_type_supported,
.vm_size = sizeof(struct kvm_svm),
.vm_init = svm_vm_init,
.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1111d9d08903..fca3457dd050 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7541,6 +7541,12 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}

+static bool vmx_is_vm_type_supported(unsigned long type)
+{
+ /* TODO: Check if TDX is supported. */
+ return __kvm_is_vm_type_supported(type);
+}
+
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

@@ -8263,6 +8269,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,

+ .is_vm_type_supported = vmx_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,
.vm_destroy = vmx_vm_destroy,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03dab4266172..442b356e4939 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4576,12 +4576,18 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
}
#endif

-static bool kvm_is_vm_type_supported(unsigned long type)
+bool __kvm_is_vm_type_supported(unsigned long type)
{
return type == KVM_X86_DEFAULT_VM ||
(type == KVM_X86_SW_PROTECTED_VM &&
IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
}
+EXPORT_SYMBOL_GPL(__kvm_is_vm_type_supported);
+
+static bool kvm_is_vm_type_supported(unsigned long type)
+{
+ return static_call(kvm_x86_is_vm_type_supported)(type);
+}

int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
{
@@ -4784,6 +4790,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = BIT(KVM_X86_DEFAULT_VM);
if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
r |= BIT(KVM_X86_SW_PROTECTED_VM);
+ if (kvm_is_vm_type_supported(KVM_X86_TDX_VM))
+ r |= BIT(KVM_X86_TDX_VM);
+ if (kvm_is_vm_type_supported(KVM_X86_SNP_VM))
+ r |= BIT(KVM_X86_SNP_VM);
break;
default:
break;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2f7e19166658..4e40c23d66ed 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -9,6 +9,8 @@
#include "kvm_cache_regs.h"
#include "kvm_emulate.h"

+bool __kvm_is_vm_type_supported(unsigned long type);
+
struct kvm_caps {
/* control of guest tsc rate supported? */
bool has_tsc_control;
--
2.25.1


2024-02-26 08:44:54

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 053/130] KVM: x86/mmu: Disallow fast page fault on private GPA

From: Isaku Yamahata <[email protected]>

TDX requires TDX SEAMCALL to operate Secure EPT instead of direct memory
access and TDX SEAMCALL is heavy operation. Fast page fault on private GPA
doesn't make sense. Disallow fast page fault on private GPA.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
v19:
- updated comment to mention VM type other than TDX.
---
arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 84e7a289ad07..eeebbc67e42b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3339,8 +3339,18 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
return RET_PF_CONTINUE;
}

-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm *kvm, struct kvm_page_fault *fault)
{
+ /*
+ * TDX private mapping doesn't support fast page fault because the EPT
+ * entry is read/written with TDX SEAMCALLs instead of direct memory
+ * access.
+ * For other VM type, kvm_is_private_gpa() is always false because
+ * gfn_shared_mask is zero.
+ */
+ if (kvm_is_private_gpa(kvm, fault->addr))
+ return false;
+
/*
* Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
* reach the common page fault handler if the SPTE has an invalid MMIO
@@ -3450,7 +3460,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
u64 *sptep;
uint retry_count = 0;

- if (!page_fault_can_be_fast(fault))
+ if (!page_fault_can_be_fast(vcpu->kvm, fault))
return ret;

walk_shadow_page_lockless_begin(vcpu);
--
2.25.1


2024-02-26 08:45:11

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions

From: Isaku Yamahata <[email protected]>

Currently, KVM VMX module initialization/exit functions are a single
function each. Refactor KVM VMX module initialization functions into KVM
common part and VMX part so that TDX specific part can be added cleanly.
Opportunistically refactor module exit function as well.

The current module initialization flow is,
0.) Check if VMX is supported,
1.) hyper-v specific initialization,
2.) system-wide x86 specific and vendor specific initialization,
3.) Final VMX specific system-wide initialization,
4.) calculate the sizes of VMX kvm structure and VMX vcpu structure,
5.) report those sizes to the KVM common layer and KVM common
initialization

Refactor the KVM VMX module initialization function into functions with a
wrapper function to separate VMX logic in vmx.c from a file, main.c, common
among VMX and TDX. Introduce a wrapper function for vmx_init().

The KVM architecture common layer allocates struct kvm with reported size
for architecture-specific code. The KVM VMX module defines its structure
as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
TDX specific kvm and vcpu structures.

The current module exit function is also a single function, a combination
of VMX specific logic and common KVM logic. Refactor it into VMX specific
logic and KVM common logic. This is just refactoring to keep the VMX
specific logic in vmx.c from main.c.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Eliminate the unnecessary churn with vmx_hardware_setup() by Xiaoyao

v18:
- Move loaded_vmcss_on_cpu initialization to vt_init() before
kvm_x86_vendor_init().
- added __init to an empty stub fucntion, hv_init_evmcs().

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 54 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 60 +++++---------------------------------
arch/x86/kvm/vmx/x86_ops.h | 14 +++++++++
3 files changed, 75 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index eeb7a43b271d..18cecf12c7c8 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -167,3 +167,57 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
.runtime_ops = &vt_x86_ops,
.pmu_ops = &intel_pmu_ops,
};
+
+static int __init vt_init(void)
+{
+ unsigned int vcpu_size, vcpu_align;
+ int cpu, r;
+
+ if (!kvm_is_vmx_supported())
+ return -EOPNOTSUPP;
+
+ /*
+ * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
+ * to unwind if a later step fails.
+ */
+ hv_init_evmcs();
+
+ /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
+ for_each_possible_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+
+ r = kvm_x86_vendor_init(&vt_init_ops);
+ if (r)
+ return r;
+
+ r = vmx_init();
+ if (r)
+ goto err_vmx_init;
+
+ /*
+ * Common KVM initialization _must_ come last, after this, /dev/kvm is
+ * exposed to userspace!
+ */
+ vcpu_size = sizeof(struct vcpu_vmx);
+ vcpu_align = __alignof__(struct vcpu_vmx);
+ r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
+ if (r)
+ goto err_kvm_init;
+
+ return 0;
+
+err_kvm_init:
+ vmx_exit();
+err_vmx_init:
+ kvm_x86_vendor_exit();
+ return r;
+}
+module_init(vt_init);
+
+static void vt_exit(void)
+{
+ kvm_exit();
+ kvm_x86_vendor_exit();
+ vmx_exit();
+}
+module_exit(vt_exit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8af0668e4dca..2fb1cd2e28a2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -477,7 +477,7 @@ DEFINE_PER_CPU(struct vmcs *, current_vmcs);
* We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed
* when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded on it.
*/
-static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
+DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);

static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS);
static DEFINE_SPINLOCK(vmx_vpid_lock);
@@ -537,7 +537,7 @@ static int hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu)
return 0;
}

-static __init void hv_init_evmcs(void)
+__init void hv_init_evmcs(void)
{
int cpu;

@@ -573,7 +573,7 @@ static __init void hv_init_evmcs(void)
}
}

-static void hv_reset_evmcs(void)
+void hv_reset_evmcs(void)
{
struct hv_vp_assist_page *vp_ap;

@@ -597,10 +597,6 @@ static void hv_reset_evmcs(void)
vp_ap->current_nested_vmcs = 0;
vp_ap->enlighten_vmentry = 0;
}
-
-#else /* IS_ENABLED(CONFIG_HYPERV) */
-static void hv_init_evmcs(void) {}
-static void hv_reset_evmcs(void) {}
#endif /* IS_ENABLED(CONFIG_HYPERV) */

/*
@@ -2743,7 +2739,7 @@ static bool __kvm_is_vmx_supported(void)
return true;
}

-static bool kvm_is_vmx_supported(void)
+bool kvm_is_vmx_supported(void)
{
bool supported;

@@ -8508,7 +8504,7 @@ static void vmx_cleanup_l1d_flush(void)
l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
}

-static void __vmx_exit(void)
+void vmx_exit(void)
{
allow_smaller_maxphyaddr = false;

@@ -8517,36 +8513,10 @@ static void __vmx_exit(void)
vmx_cleanup_l1d_flush();
}

-static void vmx_exit(void)
-{
- kvm_exit();
- kvm_x86_vendor_exit();
-
- __vmx_exit();
-}
-module_exit(vmx_exit);
-
-static int __init vmx_init(void)
+int __init vmx_init(void)
{
int r, cpu;

- if (!kvm_is_vmx_supported())
- return -EOPNOTSUPP;
-
- /*
- * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
- * to unwind if a later step fails.
- */
- hv_init_evmcs();
-
- /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
- for_each_possible_cpu(cpu)
- INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
-
- r = kvm_x86_vendor_init(&vt_init_ops);
- if (r)
- return r;
-
/*
* Must be called after common x86 init so enable_ept is properly set
* up. Hand the parameter mitigation value in which was stored in
@@ -8556,7 +8526,7 @@ static int __init vmx_init(void)
*/
r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
if (r)
- goto err_l1d_flush;
+ return r;

for_each_possible_cpu(cpu)
pi_init_cpu(cpu);
@@ -8573,21 +8543,5 @@ static int __init vmx_init(void)
if (!enable_ept)
allow_smaller_maxphyaddr = true;

- /*
- * Common KVM initialization _must_ come last, after this, /dev/kvm is
- * exposed to userspace!
- */
- r = kvm_init(sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx),
- THIS_MODULE);
- if (r)
- goto err_kvm_init;
-
return 0;
-
-err_kvm_init:
- __vmx_exit();
-err_l1d_flush:
- kvm_x86_vendor_exit();
- return r;
}
-module_init(vmx_init);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 2f8b6c43fe0f..b936388853ab 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -6,6 +6,20 @@

#include "x86.h"

+#if IS_ENABLED(CONFIG_HYPERV)
+__init void hv_init_evmcs(void);
+void hv_reset_evmcs(void);
+#else /* IS_ENABLED(CONFIG_HYPERV) */
+static inline __init void hv_init_evmcs(void) {}
+static inline void hv_reset_evmcs(void) {}
+#endif /* IS_ENABLED(CONFIG_HYPERV) */
+
+DECLARE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
+
+bool kvm_is_vmx_supported(void);
+int __init vmx_init(void);
+void vmx_exit(void);
+
extern struct kvm_x86_ops vt_x86_ops __initdata;
extern struct kvm_x86_init_ops vt_init_ops __initdata;

--
2.25.1


2024-02-26 08:45:36

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 055/130] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of KVM TDP MMU
hooks.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index e893a3d714c7..7903473abad1 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -26,5 +26,5 @@ Patch Layer status
* TD vcpu interrupts/exit/hypercall: Not yet

* KVM MMU GPA shared bits: Applied
-* KVM TDP refactoring for TDX: Applying
-* KVM TDP MMU hooks: Not yet
+* KVM TDP refactoring for TDX: Applied
+* KVM TDP MMU hooks: Applying
--
2.25.1


2024-02-26 08:46:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 057/130] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role

From: Isaku Yamahata <[email protected]>

Because TDX support introduces private mapping, add a new member in union
kvm_mmu_page_role with access functions to check the member.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Fix is_private_sptep() when NULL case.
- drop CONFIG_KVM_MMU_PRIVATE
---
arch/x86/include/asm/kvm_host.h | 13 ++++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 5 +++++
arch/x86/kvm/mmu/spte.h | 7 +++++++
3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6c10d8d1017f..dcc6f7c38a83 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -349,7 +349,8 @@ union kvm_mmu_page_role {
unsigned ad_disabled:1;
unsigned guest_mode:1;
unsigned passthrough:1;
- unsigned :5;
+ unsigned is_private:1;
+ unsigned :4;

/*
* This is left at the top of the word so that
@@ -361,6 +362,16 @@ union kvm_mmu_page_role {
};
};

+static inline bool kvm_mmu_page_role_is_private(union kvm_mmu_page_role role)
+{
+ return !!role.is_private;
+}
+
+static inline void kvm_mmu_page_role_set_private(union kvm_mmu_page_role *role)
+{
+ role->is_private = 1;
+}
+
/*
* kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
* relevant to the current MMU configuration. When loading CR0, CR4, or EFER,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0443bfcf5d9c..e3f54701f98d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -145,6 +145,11 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
return kvm_mmu_role_as_id(sp->role);
}

+static inline bool is_private_sp(const struct kvm_mmu_page *sp)
+{
+ return kvm_mmu_page_role_is_private(sp->role);
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1a163aee9ec6..3ef8ea18321b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -264,6 +264,13 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
return spte_to_child_sp(root);
}

+static inline bool is_private_sptep(u64 *sptep)
+{
+ if (WARN_ON_ONCE(!sptep))
+ return false;
+ return is_private_sp(sptep_to_sp(sptep));
+}
+
static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
--
2.25.1


2024-02-26 08:47:09

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 065/130] KVM: VMX: Split out guts of EPT violation to common/exposed function

From: Sean Christopherson <[email protected]>

The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification. To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
arch/x86/kvm/vmx/common.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 25 +++----------------------
2 files changed, 36 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/kvm/vmx/common.h

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..235908f3e044
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+ unsigned long exit_qualification)
+{
+ u64 error_code;
+
+ /* Is it a read fault? */
+ error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+ ? PFERR_USER_MASK : 0;
+ /* Is it a write fault? */
+ error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+ ? PFERR_WRITE_MASK : 0;
+ /* Is it a fetch fault? */
+ error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+ ? PFERR_FETCH_MASK : 0;
+ /* ept page table entry is present? */
+ error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
+ ? PFERR_PRESENT_MASK : 0;
+
+ error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
+ PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+ return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6fe895bd7807..162bb134aae6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -50,6 +50,7 @@
#include <asm/vmx.h>

#include "capabilities.h"
+#include "common.h"
#include "cpuid.h"
#include "hyperv.h"
#include "kvm_onhyperv.h"
@@ -5779,11 +5780,8 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)

static int handle_ept_violation(struct kvm_vcpu *vcpu)
{
- unsigned long exit_qualification;
+ unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
gpa_t gpa;
- u64 error_code;
-
- exit_qualification = vmx_get_exit_qual(vcpu);

/*
* EPT violation happened while executing iret from NMI,
@@ -5798,23 +5796,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)

gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
trace_kvm_page_fault(vcpu, gpa, exit_qualification);
-
- /* Is it a read fault? */
- error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
- ? PFERR_USER_MASK : 0;
- /* Is it a write fault? */
- error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
- ? PFERR_WRITE_MASK : 0;
- /* Is it a fetch fault? */
- error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
- ? PFERR_FETCH_MASK : 0;
- /* ept page table entry is present? */
- error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
- ? PFERR_PRESENT_MASK : 0;
-
- error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
- PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
vcpu->arch.exit_qualification = exit_qualification;

/*
@@ -5828,7 +5809,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
return kvm_emulate_instruction(vcpu, 0);

- return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+ return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
}

static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
--
2.25.1


2024-02-26 08:47:42

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 066/130] KVM: TDX: Add accessors VMX VMCS helpers

From: Isaku Yamahata <[email protected]>

TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS. Introduce helper accessors to hide its SEAMCALL ABI details.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- deleted unnecessary stub functions,
tdvps_state_non_arch_check() and tdvps_management_check().
---
arch/x86/kvm/vmx/tdx.h | 92 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 92 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index d3077151252c..8a0d1bfe34a0 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -58,6 +58,98 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
return container_of(vcpu, struct vcpu_tdx, vcpu);
}

+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+#define VMCS_ENC_ACCESS_TYPE_MASK 0x1UL
+#define VMCS_ENC_ACCESS_TYPE_FULL 0x0UL
+#define VMCS_ENC_ACCESS_TYPE_HIGH 0x1UL
+#define VMCS_ENC_ACCESS_TYPE(field) ((field) & VMCS_ENC_ACCESS_TYPE_MASK)
+
+ /* TDX is 64bit only. HIGH field isn't supported. */
+ BUILD_BUG_ON_MSG(__builtin_constant_p(field) &&
+ VMCS_ENC_ACCESS_TYPE(field) == VMCS_ENC_ACCESS_TYPE_HIGH,
+ "Read/Write to TD VMCS *_HIGH fields not supported");
+
+ BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+#define VMCS_ENC_WIDTH_MASK GENMASK(14, 13)
+#define VMCS_ENC_WIDTH_16BIT (0UL << 13)
+#define VMCS_ENC_WIDTH_64BIT (1UL << 13)
+#define VMCS_ENC_WIDTH_32BIT (2UL << 13)
+#define VMCS_ENC_WIDTH_NATURAL (3UL << 13)
+#define VMCS_ENC_WIDTH(field) ((field) & VMCS_ENC_WIDTH_MASK)
+
+ /* TDX is 64bit only. i.e. natural width = 64bit. */
+ BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+ (VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_64BIT ||
+ VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_NATURAL),
+ "Invalid TD VMCS access for 64-bit field");
+ BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+ VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_32BIT,
+ "Invalid TD VMCS access for 32-bit field");
+ BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+ VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_16BIT,
+ "Invalid TD VMCS access for 16-bit field");
+}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
+ u32 field) \
+{ \
+ struct tdx_module_args out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_rd(tdx->tdvpr_pa, TDVPS_##uclass(field), &out); \
+ if (KVM_BUG_ON(err, tdx->vcpu.kvm)) { \
+ pr_err("TDH_VP_RD["#uclass".0x%x] failed: 0x%llx\n", \
+ field, err); \
+ return 0; \
+ } \
+ return (u##bits)out.r8; \
+} \
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx, \
+ u32 field, u##bits val) \
+{ \
+ struct tdx_module_args out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), val, \
+ GENMASK_ULL(bits - 1, 0), &out); \
+ if (KVM_BUG_ON(err, tdx->vcpu.kvm)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n", \
+ field, (u64)val, err); \
+} \
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx, \
+ u32 field, u64 bit) \
+{ \
+ struct tdx_module_args out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), bit, bit, &out); \
+ if (KVM_BUG_ON(err, tdx->vcpu.kvm)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n", \
+ field, bit, err); \
+} \
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx, \
+ u32 field, u64 bit) \
+{ \
+ struct tdx_module_args out; \
+ u64 err; \
+ \
+ tdvps_##lclass##_check(field, bits); \
+ err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), 0, bit, &out); \
+ if (KVM_BUG_ON(err, tdx->vcpu.kvm)) \
+ pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n", \
+ field, bit, err); \
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
+
static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
{
struct tdx_module_args out;
--
2.25.1


2024-02-26 08:48:25

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 063/130] [MARKER] The start of TDX KVM patch series: TDX EPT violation

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TDX EPT
violation.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 7903473abad1..c4d67dd9ddf8 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -20,11 +20,11 @@ Patch Layer status
* TDX architectural definitions: Applied
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
-* TDX EPT violation: Not yet
+* TDX EPT violation: Applying
* TD finalization: Not yet
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet

* KVM MMU GPA shared bits: Applied
* KVM TDP refactoring for TDX: Applied
-* KVM TDP MMU hooks: Applying
+* KVM TDP MMU hooks: Applied
--
2.25.1


2024-02-26 08:48:46

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 064/130] KVM: x86/mmu: Do not enable page track for TD guest

From: Yan Zhao <[email protected]>

TDX does not support write protection and hence page track.
Though !tdp_enabled and kvm_shadow_root_allocated(kvm) are always false
for TD guest, should also return false when external write tracking is
enabled.

Cc: Yuan Yao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
v19:
- drop TDX: from the short log
- Added reviewed-by: BinBin
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/page_track.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index c87da11f3a04..ce698ab213c1 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -22,6 +22,9 @@

bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
{
+ if (kvm->arch.vm_type == KVM_X86_TDX_VM)
+ return false;
+
return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
!tdp_enabled || kvm_shadow_root_allocated(kvm);
}
--
2.25.1


2024-02-26 08:49:47

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 068/130] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

From: Yuan Yao <[email protected]>

TDX module internally uses locks to protect internal resources. It tries
to acquire the locks. If it fails to obtain the lock, it returns
TDX_OPERAND_BUSY error without spin because its execution time limitation.

TDX SEAMCALL API reference describes what resources are used. It's known
which TDX SEAMCALL can cause contention with which resources. VMM can
avoid contention inside the TDX module by avoiding contentious TDX SEAMCALL
with, for example, spinlock. Because OS knows better its process
scheduling and its scalability, a lock at OS/VMM layer would work better
than simply retrying TDX SEAMCALLs.

TDH.MEM.* API except for TDH.MEM.TRACK operates on a secure EPT tree and
the TDX module internally tries to acquire the lock of the secure EPT tree.
They return TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT in case of failure to
get the lock. TDX KVM allows sept callbacks to return error so that TDP
MMU layer can retry.

Retry TDX TDH.MEM.* API on the error because the error is a rare event
caused by zero-step attack mitigation.

Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- fix typo TDG.VP.ENTER => TDH.VP.ENTER,
TDX_OPRRAN_BUSY => TDX_OPERAND_BUSY
- drop the description on TDH.VP.ENTER as this patch doesn't touch
TDH.VP.ENTER
---
arch/x86/kvm/vmx/tdx_ops.h | 48 +++++++++++++++++++++++++++++++-------
1 file changed, 39 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index d80212b1daf3..e5c069b96126 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -44,6 +44,36 @@ static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out);
#endif

+/*
+ * TDX module acquires its internal lock for resources. It doesn't spin to get
+ * locks because of its restrictions of allowed execution time. Instead, it
+ * returns TDX_OPERAND_BUSY with an operand id.
+ *
+ * Multiple VCPUs can operate on SEPT. Also with zero-step attack mitigation,
+ * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step
+ * attack is suspected. It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT
+ * with TDH.MEM.* operation. Note: TDH.MEM.TRACK is an exception.
+ *
+ * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL
+ * spoils TDP MMU effort. Retry several times with the assumption that SEPT
+ * lock contention is rare. But don't loop forever to avoid lockup. Let TDP
+ * MMU retry.
+ */
+#define TDX_ERROR_SEPT_BUSY (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
+
+static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in,
+ struct tdx_module_args *out)
+{
+#define SEAMCALL_RETRY_MAX 16
+ int retry = SEAMCALL_RETRY_MAX;
+ u64 ret;
+
+ do {
+ ret = tdx_seamcall(op, in, out);
+ } while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
+ return ret;
+}
+
static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
{
struct tdx_module_args in = {
@@ -66,7 +96,7 @@ static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source
};

clflush_cache_range(__va(hpa), PAGE_SIZE);
- return tdx_seamcall(TDH_MEM_PAGE_ADD, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &in, out);
}

static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
@@ -79,7 +109,7 @@ static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
};

clflush_cache_range(__va(page), PAGE_SIZE);
- return tdx_seamcall(TDH_MEM_SEPT_ADD, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &in, out);
}

static inline u64 tdh_mem_sept_rd(hpa_t tdr, gpa_t gpa, int level,
@@ -90,7 +120,7 @@ static inline u64 tdh_mem_sept_rd(hpa_t tdr, gpa_t gpa, int level,
.rdx = tdr,
};

- return tdx_seamcall(TDH_MEM_SEPT_RD, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_SEPT_RD, &in, out);
}

static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
@@ -101,7 +131,7 @@ static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
.rdx = tdr,
};

- return tdx_seamcall(TDH_MEM_SEPT_REMOVE, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_SEPT_REMOVE, &in, out);
}

static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
@@ -125,7 +155,7 @@ static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
};

clflush_cache_range(__va(hpa), PAGE_SIZE);
- return tdx_seamcall(TDH_MEM_PAGE_RELOCATE, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_PAGE_RELOCATE, &in, out);
}

static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
@@ -138,7 +168,7 @@ static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
};

clflush_cache_range(__va(hpa), PAGE_SIZE);
- return tdx_seamcall(TDH_MEM_PAGE_AUG, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &in, out);
}

static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
@@ -149,7 +179,7 @@ static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
.rdx = tdr,
};

- return tdx_seamcall(TDH_MEM_RANGE_BLOCK, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &in, out);
}

static inline u64 tdh_mng_key_config(hpa_t tdr)
@@ -299,7 +329,7 @@ static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
.rdx = tdr,
};

- return tdx_seamcall(TDH_MEM_PAGE_REMOVE, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out);
}

static inline u64 tdh_sys_lp_shutdown(void)
@@ -327,7 +357,7 @@ static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
.rdx = tdr,
};

- return tdx_seamcall(TDH_MEM_RANGE_UNBLOCK, &in, out);
+ return tdx_seamcall_sept(TDH_MEM_RANGE_UNBLOCK, &in, out);
}

static inline u64 tdh_phymem_cache_wb(bool resume)
--
2.25.1


2024-02-26 08:50:09

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

From: Isaku Yamahata <[email protected]>

Allocate protected page table for private page table, and add hooks to
operate on protected page table. This patch adds allocation/free of
protected page tables and hooks. When calling hooks to update SPTE entry,
freeze the entry, call hooks and unfreeze the entry to allow concurrent
updates on page tables. Which is the advantage of TDP MMU. As
kvm_gfn_shared_mask() returns false always, those hooks aren't called yet
with this patch.

When the faulting GPA is private, the KVM fault is called private. When
resolving private KVM fault, allocate protected page table and call hooks
to operate on protected page table. On the change of the private PTE entry,
invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the change
to protected page table. The following depicts the relationship.

private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
private PT root | protected PT root
| | |
V | V
private PT --hook to propagate-->protected PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
PT: page table

The existing KVM TDP MMU code uses atomic update of SPTE. On populating
the EPT entry, atomically set the entry. However, it requires TLB
shootdown to zap SPTE. To address it, the entry is frozen with the special
SPTE value that clears the present bit. After the TLB shootdown, the entry
is set to the eventual value (unfreeze).

For protected page table, hooks are called to update protected page table
in addition to direct access to the private SPTE. For the zapping case, it
works to freeze the SPTE. It can call hooks in addition to TLB shootdown.
For populating the private SPTE entry, there can be a race condition
without further protection

vcpu 1: populating 2M private SPTE
vcpu 2: populating 4K private SPTE
vcpu 2: TDX SEAMCALL to update 4K protected SPTE => error
vcpu 1: TDX SEAMCALL to update 2M protected SPTE

To avoid the race, the frozen SPTE is utilized. Instead of atomic update
of the private entry, freeze the entry, call the hook that update protected
SPTE, set the entry to the final value.

Support 4K page only at this stage. 2M page support can be done in future
patches.

Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
v19:
- drop CONFIG_KVM_MMU_PRIVATE

v18:
- Rename freezed => frozen

v14 -> v15:
- Refined is_private condition check in kvm_tdp_mmu_map().
Add kvm_gfn_shared_mask() check.
- catch up for struct kvm_range change

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 5 +
arch/x86/include/asm/kvm_host.h | 11 ++
arch/x86/kvm/mmu/mmu.c | 17 +-
arch/x86/kvm/mmu/mmu_internal.h | 13 +-
arch/x86/kvm/mmu/tdp_iter.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 308 +++++++++++++++++++++++++----
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
virt/kvm/kvm_main.c | 1 +
8 files changed, 320 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index a8e96804a252..e1c75f8c1b25 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -101,6 +101,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(link_private_spt)
+KVM_X86_OP_OPTIONAL(free_private_spt)
+KVM_X86_OP_OPTIONAL(set_private_spte)
+KVM_X86_OP_OPTIONAL(remove_private_spte)
+KVM_X86_OP_OPTIONAL(zap_private_spte)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index efd3fda1c177..bc0767c884f7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -468,6 +468,7 @@ struct kvm_mmu {
int (*sync_spte)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp, int i);
struct kvm_mmu_root_info root;
+ hpa_t private_root_hpa;
union kvm_cpu_role cpu_role;
union kvm_mmu_page_role root_role;

@@ -1740,6 +1741,16 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);

+ int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *private_spt);
+ int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *private_spt);
+ int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
+ int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
+ int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+
bool (*has_wbinvd_exit)(void);

u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 30c86e858ae4..0e0321ad9ca2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3717,7 +3717,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
goto out_unlock;

if (tdp_mmu_enabled) {
- root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
+ if (kvm_gfn_shared_mask(vcpu->kvm) &&
+ !VALID_PAGE(mmu->private_root_hpa)) {
+ root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
+ mmu->private_root_hpa = root;
+ }
+ root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
mmu->root.hpa = root;
} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
@@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
- gfn_t base = gfn_round_for_level(fault->gfn,
+ gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
fault->max_level);

if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
@@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
};

WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
+ fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);

r = mmu_topup_memory_caches(vcpu, false);
@@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)

mmu->root.hpa = INVALID_PAGE;
mmu->root.pgd = 0;
+ mmu->private_root_hpa = INVALID_PAGE;
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;

@@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
kvm_mmu_unload(vcpu);
+ if (tdp_mmu_enabled) {
+ write_lock(&vcpu->kvm->mmu_lock);
+ mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
+ NULL);
+ write_unlock(&vcpu->kvm->mmu_lock);
+ }
free_mmu_pages(&vcpu->arch.root_mmu);
free_mmu_pages(&vcpu->arch.guest_mmu);
mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 002f3f80bf3b..9e2c7c6d85bf 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@
#include <linux/kvm_host.h>
#include <asm/kvm_host.h>

+#include "mmu.h"
+
#ifdef CONFIG_KVM_PROVE_MMU
#define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
#else
@@ -205,6 +207,15 @@ static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
free_page((unsigned long)sp->private_spt);
}

+static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t gfn)
+{
+ if (is_private_sp(root))
+ return kvm_gfn_to_private(kvm, gfn);
+ else
+ return kvm_gfn_to_shared(kvm, gfn);
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
@@ -363,7 +374,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int r;

if (vcpu->arch.mmu->root_role.direct) {
- fault.gfn = fault.addr >> PAGE_SHIFT;
+ fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index e1e40e3f5eb7..a9c9cd0db20a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,7 +91,7 @@ struct tdp_iter {
tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
/* A pointer to the current SPTE */
tdp_ptep_t sptep;
- /* The lowest GFN mapped by the current SPTE */
+ /* The lowest GFN (shared bits included) mapped by the current SPTE */
gfn_t gfn;
/* The level of the root page given to the iterator */
int root_level;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a90907b31c54..1a0e4baa8311 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -187,6 +187,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu,
sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
sp->role = role;

+ if (kvm_mmu_page_role_is_private(role))
+ kvm_mmu_alloc_private_spt(vcpu, sp);
+
return sp;
}

@@ -209,7 +212,8 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
trace_kvm_mmu_get_page(sp, true);
}

-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
+ bool private)
{
union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
struct kvm *kvm = vcpu->kvm;
@@ -221,6 +225,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
* Check for an existing root before allocating a new one. Note, the
* role check prevents consuming an invalid root.
*/
+ if (private)
+ kvm_mmu_page_role_set_private(&role);
for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
if (root->role.word == role.word &&
kvm_tdp_mmu_get_root(root))
@@ -244,12 +250,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

out:
- return __pa(root->spt);
+ return root;
+}
+
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
+{
+ return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
}

static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
- u64 old_spte, u64 new_spte, int level,
- bool shared);
+ u64 old_spte, u64 new_spte,
+ union kvm_mmu_page_role role, bool shared);

static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
@@ -376,12 +387,78 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
REMOVED_SPTE, level);
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
- old_spte, REMOVED_SPTE, level, shared);
+ old_spte, REMOVED_SPTE, sp->role,
+ shared);
+ }
+
+ if (is_private_sp(sp) &&
+ WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
+ kvm_mmu_private_spt(sp)))) {
+ /*
+ * Failed to unlink Secure EPT page and there is nothing to do
+ * further. Intentionally leak the page to prevent the kernel
+ * from accessing the encrypted page.
+ */
+ kvm_mmu_init_private_spt(sp, NULL);
}

call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

+static void *get_private_spt(gfn_t gfn, u64 new_spte, int level)
+{
+ if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+ struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
+ void *private_spt = kvm_mmu_private_spt(sp);
+
+ WARN_ON_ONCE(!private_spt);
+ WARN_ON_ONCE(sp->role.level + 1 != level);
+ WARN_ON_ONCE(sp->gfn != gfn);
+ return private_spt;
+ }
+
+ return NULL;
+}
+
+static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
+ u64 old_spte, u64 new_spte,
+ int level)
+{
+ bool was_present = is_shadow_present_pte(old_spte);
+ bool is_present = is_shadow_present_pte(new_spte);
+ bool was_leaf = was_present && is_last_spte(old_spte, level);
+ bool is_leaf = is_present && is_last_spte(new_spte, level);
+ kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+ kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+ int ret;
+
+ /* Ignore change of software only bits. e.g. host_writable */
+ if (was_leaf == is_leaf && was_present == is_present)
+ return;
+
+ /*
+ * Allow only leaf page to be zapped. Reclaim Non-leaf page tables at
+ * destroying VM.
+ */
+ WARN_ON_ONCE(is_present);
+ if (!was_leaf)
+ return;
+
+ /* non-present -> non-present doesn't make sense. */
+ KVM_BUG_ON(!was_present, kvm);
+ KVM_BUG_ON(new_pfn, kvm);
+
+ /* Zapping leaf spte is allowed only when write lock is held. */
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+ /* Because write lock is held, operation should success. */
+ if (KVM_BUG_ON(ret, kvm))
+ return;
+
+ ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
+ KVM_BUG_ON(ret, kvm);
+}
+
/**
* handle_changed_spte - handle bookkeeping associated with an SPTE change
* @kvm: kvm instance
@@ -389,7 +466,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
* @gfn: the base GFN that was mapped by the SPTE
* @old_spte: The value of the SPTE before the change
* @new_spte: The value of the SPTE after the change
- * @level: the level of the PT the SPTE is part of in the paging structure
+ * @role: the role of the PT the SPTE is part of in the paging structure
* @shared: This operation may not be running under the exclusive use of
* the MMU lock and the operation must synchronize with other
* threads that might be modifying SPTEs.
@@ -399,14 +476,18 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
* and fast_pf_fix_direct_spte()).
*/
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
- u64 old_spte, u64 new_spte, int level,
- bool shared)
+ u64 old_spte, u64 new_spte,
+ union kvm_mmu_page_role role, bool shared)
{
+ bool is_private = kvm_mmu_page_role_is_private(role);
+ int level = role.level;
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
- bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+ kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+ kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+ bool pfn_changed = old_pfn != new_pfn;

WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -473,7 +554,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,

if (was_leaf && is_dirty_spte(old_spte) &&
(!is_present || !is_dirty_spte(new_spte) || pfn_changed))
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ kvm_set_pfn_dirty(old_pfn);

/*
* Recursively handle child PTs if the change removed a subtree from
@@ -482,14 +563,82 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* pages are kernel allocations and should never be migrated.
*/
if (was_present && !was_leaf &&
- (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+ KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)),
+ kvm);
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+ }
+
+ /*
+ * Secure-EPT requires to remove Secure-EPT tables after removing
+ * children. hooks after handling lower page table by above
+ * handle_remove_pt().
+ */
+ if (is_private && !is_present)
+ handle_removed_private_spte(kvm, gfn, old_spte, new_spte, role.level);

if (was_leaf && is_accessed_spte(old_spte) &&
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
}

+static int __must_check __set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+ gfn_t gfn, u64 old_spte,
+ u64 new_spte, int level)
+{
+ bool was_present = is_shadow_present_pte(old_spte);
+ bool is_present = is_shadow_present_pte(new_spte);
+ bool is_leaf = is_present && is_last_spte(new_spte, level);
+ kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+ int ret = 0;
+
+ lockdep_assert_held(&kvm->mmu_lock);
+ /* TDP MMU doesn't change present -> present */
+ KVM_BUG_ON(was_present, kvm);
+
+ /*
+ * Use different call to either set up middle level
+ * private page table, or leaf.
+ */
+ if (is_leaf)
+ ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn);
+ else {
+ void *private_spt = get_private_spt(gfn, new_spte, level);
+
+ KVM_BUG_ON(!private_spt, kvm);
+ ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt);
+ }
+
+ return ret;
+}
+
+static int __must_check set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+ gfn_t gfn, u64 old_spte,
+ u64 new_spte, int level)
+{
+ int ret;
+
+ /*
+ * For private page table, callbacks are needed to propagate SPTE
+ * change into the protected page table. In order to atomically update
+ * both the SPTE and the protected page tables with callbacks, utilize
+ * freezing SPTE.
+ * - Freeze the SPTE. Set entry to REMOVED_SPTE.
+ * - Trigger callbacks for protected page tables.
+ * - Unfreeze the SPTE. Set the entry to new_spte.
+ */
+ lockdep_assert_held(&kvm->mmu_lock);
+ if (!try_cmpxchg64(sptep, &old_spte, REMOVED_SPTE))
+ return -EBUSY;
+
+ ret = __set_private_spte_present(kvm, sptep, gfn, old_spte, new_spte, level);
+ if (ret)
+ __kvm_tdp_mmu_write_spte(sptep, old_spte);
+ else
+ __kvm_tdp_mmu_write_spte(sptep, new_spte);
+ return ret;
+}
+
/*
* tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
* and handle the associated bookkeeping. Do not mark the page dirty
@@ -512,6 +661,7 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
u64 new_spte)
{
u64 *sptep = rcu_dereference(iter->sptep);
+ bool frozen = false;

/*
* The caller is responsible for ensuring the old SPTE is not a REMOVED
@@ -523,19 +673,45 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,

lockdep_assert_held_read(&kvm->mmu_lock);

- /*
- * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
- * does not hold the mmu_lock. On failure, i.e. if a different logical
- * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
- * the current value, so the caller operates on fresh data, e.g. if it
- * retries tdp_mmu_set_spte_atomic()
- */
- if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
- return -EBUSY;
+ if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
+ int ret;

- handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, true);
+ if (is_shadow_present_pte(new_spte)) {
+ /*
+ * Populating case. handle_changed_spte() can
+ * process without freezing because it only updates
+ * stats.
+ */
+ ret = set_private_spte_present(kvm, iter->sptep, iter->gfn,
+ iter->old_spte, new_spte, iter->level);
+ if (ret)
+ return ret;
+ } else {
+ /*
+ * Zapping case. handle_changed_spte() calls Secure-EPT
+ * blocking or removal. Freeze the entry.
+ */
+ if (!try_cmpxchg64(sptep, &iter->old_spte, REMOVED_SPTE))
+ return -EBUSY;
+ frozen = true;
+ }
+ } else {
+ /*
+ * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
+ * and does not hold the mmu_lock. On failure, i.e. if a
+ * different logical CPU modified the SPTE, try_cmpxchg64()
+ * updates iter->old_spte with the current value, so the caller
+ * operates on fresh data, e.g. if it retries
+ * tdp_mmu_set_spte_atomic()
+ */
+ if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+ return -EBUSY;
+ }

+ handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
+ new_spte, sptep_to_sp(sptep)->role, true);
+ if (frozen)
+ __kvm_tdp_mmu_write_spte(sptep, new_spte);
return 0;
}

@@ -585,6 +761,8 @@ static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
u64 old_spte, u64 new_spte, gfn_t gfn, int level)
{
+ union kvm_mmu_page_role role;
+
lockdep_assert_held_write(&kvm->mmu_lock);

/*
@@ -597,8 +775,17 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));

old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
+ if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
+ is_shadow_present_pte(new_spte)) {
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ /* Because write spin lock is held, no race. It should success. */
+ KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn, old_spte,
+ new_spte, level), kvm);
+ }

- handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+ role = sptep_to_sp(sptep)->role;
+ role.level = level;
+ handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
return old_spte;
}

@@ -621,8 +808,11 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
continue; \
else

-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
- for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end) \
+ for_each_tdp_pte(_iter, \
+ root_to_sp((_private) ? _mmu->private_root_hpa : \
+ _mmu->root.hpa), \
+ _start, _end)

/*
* Yield if the MMU lock is contended or this thread needs to return control
@@ -784,6 +974,14 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
if (!zap_private && is_private_sp(root))
return false;

+ /*
+ * start and end doesn't have GFN shared bit. This function zaps
+ * a region including alias. Adjust shared bit of [start, end) if the
+ * root is shared.
+ */
+ start = kvm_gfn_for_root(kvm, root, start);
+ end = kvm_gfn_for_root(kvm, root, end);
+
rcu_read_lock();

for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
@@ -960,10 +1158,26 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,

if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
- else
- wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
- fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, &new_spte);
+ else {
+ unsigned long pte_access = ACC_ALL;
+ gfn_t gfn = iter->gfn;
+
+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
+ if (fault->is_private)
+ gfn |= kvm_gfn_shared_mask(vcpu->kvm);
+ else
+ /*
+ * TDX shared GPAs are no executable, enforce
+ * this for the SDV.
+ */
+ pte_access &= ~ACC_EXEC_MASK;
+ }
+
+ wrprot = make_spte(vcpu, sp, fault->slot, pte_access, gfn,
+ fault->pfn, iter->old_spte,
+ fault->prefetch, true, fault->map_writable,
+ &new_spte);
+ }

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
@@ -1041,6 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
+ gfn_t raw_gfn;
+ bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
int ret = RET_PF_RETRY;

kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1049,7 +1265,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)

rcu_read_lock();

- tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+ raw_gfn = gpa_to_gfn(fault->addr);
+
+ if (is_error_noslot_pfn(fault->pfn) ||
+ !kvm_pfn_to_refcounted_page(fault->pfn)) {
+ if (is_private) {
+ rcu_read_unlock();
+ return -EFAULT;
+ }
+ }
+
+ tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
int r;

if (fault->nx_huge_page_workaround_enabled)
@@ -1079,9 +1305,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)

sp->nx_huge_page_disallowed = fault->huge_page_disallowed;

- if (is_shadow_present_pte(iter.old_spte))
+ if (is_shadow_present_pte(iter.old_spte)) {
+ /*
+ * TODO: large page support.
+ * Doesn't support large page for TDX now
+ */
+ KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
- else
+ } else
r = tdp_mmu_link_sp(kvm, &iter, sp, true);

/*
@@ -1362,6 +1593,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm

sp->role = role;
sp->spt = (void *)__get_free_page(gfp);
+ /* TODO: large page support for private GPA. */
+ WARN_ON_ONCE(kvm_mmu_page_role_is_private(role));
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
return NULL;
@@ -1378,6 +1611,10 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
struct kvm_mmu_page *sp;

kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+ KVM_BUG_ON(kvm_mmu_page_role_is_private(role) !=
+ is_private_sptep(iter->sptep), kvm);
+ /* TODO: Large page isn't supported for private SPTE yet. */
+ KVM_BUG_ON(kvm_mmu_page_role_is_private(role), kvm);

/*
* Since we are allocating while under the MMU lock we have to be
@@ -1802,7 +2039,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,

*root_level = vcpu->arch.mmu->root_role.level;

- tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
leaf = iter.level;
sptes[leaf] = iter.old_spte;
}
@@ -1829,7 +2066,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
gfn_t gfn = addr >> PAGE_SHIFT;
tdp_ptep_t sptep = NULL;

- tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ /* fast page fault for private GPA isn't supported. */
+ WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
+
+ tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
*spte = iter.old_spte;
sptep = iter.sptep;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index b3cf58a50357..bc9124737142 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);

-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);

__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d399009ef1d7..e27c22449d85 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -201,6 +201,7 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)

return NULL;
}
+EXPORT_SYMBOL_GPL(kvm_pfn_to_refcounted_page);

/*
* Switches to specified vcpu, until a matching vcpu_put()
--
2.25.1


2024-02-26 08:50:23

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

From: Isaku Yamahata <[email protected]>

As the first step to create TDX guest, create/destroy VM struct. Assign
TDX private Host Key ID (HKID) to the TDX guest for memory encryption and
allocate extra pages for the TDX guest. On destruction, free allocated
pages, and HKID.

Before tearing down private page tables, TDX requires some resources of the
guest TD to be destroyed (i.e. HKID must have been reclaimed, etc). Add
mmu notifier release callback before tearing down private page tables for
it.

Add vm_free() of kvm_x86_ops hook at the end of kvm_arch_destroy_vm()
because some per-VM TDX resources, e.g. TDR, need to be freed after other
TDX resources, e.g. HKID, were freed.

Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
v19:
- fix check error code of TDH.PHYMEM.PAGE.RECLAIM. RCX and TDR.

v18:
- Use TDH.SYS.RD() instead of struct tdsysinfo_struct.
- Rename tdx_reclaim_td_page() to tdx_reclaim_control_page()
- return -EAGAIN on TDX_RND_NO_ENTROPY of TDH.MNG.CREATE(), TDH.MNG.ADDCX()
- fix comment to remove extra the.
- use true instead of 1 for boolean.
- remove an extra white line.

v16:
- Simplified tdx_reclaim_page()
- Reorganize the locking of tdx_release_hkid(), and use smp_call_mask()
instead of smp_call_on_cpu() to hold spinlock to race with invalidation
on releasing guest memfd

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/Kconfig | 3 +-
arch/x86/kvm/mmu/mmu.c | 7 +
arch/x86/kvm/vmx/main.c | 26 +-
arch/x86/kvm/vmx/tdx.c | 475 ++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 6 +-
arch/x86/kvm/vmx/x86_ops.h | 6 +
arch/x86/kvm/x86.c | 1 +
9 files changed, 520 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 1b0dacc6b6c0..97f0a681e02c 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -24,7 +24,9 @@ KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP_OPTIONAL(max_vcpus);
KVM_X86_OP_OPTIONAL(vm_enable_cap)
KVM_X86_OP(vm_init)
+KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
KVM_X86_OP_OPTIONAL(vm_destroy)
+KVM_X86_OP_OPTIONAL(vm_free)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
KVM_X86_OP(vcpu_create)
KVM_X86_OP(vcpu_free)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cf8eb46b3a20..6dedb5cb71ef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1608,7 +1608,9 @@ struct kvm_x86_ops {
unsigned int vm_size;
int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
int (*vm_init)(struct kvm *kvm);
+ void (*flush_shadow_all_private)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
+ void (*vm_free)(struct kvm *kvm);

/* Create, but do not attach this VCPU */
int (*vcpu_precreate)(struct kvm *kvm);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 87e3da7b0439..bc077d6f4b43 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -76,7 +76,6 @@ config KVM_WERROR

config KVM_SW_PROTECTED_VM
bool "Enable support for KVM software-protected VMs"
- depends on EXPERT
depends on KVM && X86_64
select KVM_GENERIC_PRIVATE_MEM
help
@@ -89,6 +88,8 @@ config KVM_SW_PROTECTED_VM
config KVM_INTEL
tristate "KVM for Intel (and compatible) processors support"
depends on KVM && IA32_FEAT_CTL
+ select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST
+ select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
help
Provides support for KVM on processors equipped with Intel's VT
extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c45252ed2ffd..2becc86c71b2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6866,6 +6866,13 @@ static void kvm_mmu_zap_all(struct kvm *kvm)

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
+ /*
+ * kvm_mmu_zap_all() zaps both private and shared page tables. Before
+ * tearing down private page tables, TDX requires some TD resources to
+ * be destroyed (i.e. keyID must have been reclaimed, etc). Invoke
+ * kvm_x86_flush_shadow_all_private() for this.
+ */
+ static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
kvm_mmu_zap_all(kvm);
}

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index e8a1a7533eea..437c6d5e802e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -62,11 +62,31 @@ static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
static int vt_vm_init(struct kvm *kvm)
{
if (is_td(kvm))
- return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
+ return tdx_vm_init(kvm);

return vmx_vm_init(kvm);
}

+static void vt_flush_shadow_all_private(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ tdx_mmu_release_hkid(kvm);
+}
+
+static void vt_vm_destroy(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ return;
+
+ vmx_vm_destroy(kvm);
+}
+
+static void vt_vm_free(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ tdx_vm_free(kvm);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -101,7 +121,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vm_size = sizeof(struct kvm_vmx),
.vm_enable_cap = vt_vm_enable_cap,
.vm_init = vt_vm_init,
- .vm_destroy = vmx_vm_destroy,
+ .flush_shadow_all_private = vt_flush_shadow_all_private,
+ .vm_destroy = vt_vm_destroy,
+ .vm_free = vt_vm_free,

.vcpu_precreate = vmx_vcpu_precreate,
.vcpu_create = vmx_vcpu_create,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ee015f3ce2c9..1cf2b15da257 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,10 +5,11 @@

#include "capabilities.h"
#include "x86_ops.h"
-#include "x86.h"
#include "mmu.h"
#include "tdx_arch.h"
#include "tdx.h"
+#include "tdx_ops.h"
+#include "x86.h"

#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -22,7 +23,7 @@
/* TDX KeyID pool */
static DEFINE_IDA(tdx_guest_keyid_pool);

-static int __used tdx_guest_keyid_alloc(void)
+static int tdx_guest_keyid_alloc(void)
{
if (WARN_ON_ONCE(!tdx_guest_keyid_start || !tdx_nr_guest_keyids))
return -EINVAL;
@@ -32,7 +33,7 @@ static int __used tdx_guest_keyid_alloc(void)
GFP_KERNEL);
}

-static void __used tdx_guest_keyid_free(int keyid)
+static void tdx_guest_keyid_free(int keyid)
{
if (WARN_ON_ONCE(keyid < tdx_guest_keyid_start ||
keyid > tdx_guest_keyid_start + tdx_nr_guest_keyids - 1))
@@ -48,6 +49,8 @@ struct tdx_info {
u64 xfam_fixed0;
u64 xfam_fixed1;

+ u8 nr_tdcs_pages;
+
u16 num_cpuid_config;
/* This must the last member. */
DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
@@ -85,6 +88,282 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
return r;
}

+/*
+ * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
+ * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock
+ * internally in TDX module. If failed, TDX_OPERAND_BUSY is returned without
+ * spinning or waiting due to a constraint on execution time. It's caller's
+ * responsibility to avoid race (or retry on TDX_OPERAND_BUSY). Use this mutex
+ * to avoid race in TDX module because the kernel knows better about scheduling.
+ */
+static DEFINE_MUTEX(tdx_lock);
+static struct mutex *tdx_mng_key_config_lock;
+
+static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
+{
+ return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
+}
+
+static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->tdr_pa;
+}
+
+static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
+{
+ tdx_guest_keyid_free(kvm_tdx->hkid);
+ kvm_tdx->hkid = -1;
+}
+
+static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
+{
+ return kvm_tdx->hkid > 0;
+}
+
+static void tdx_clear_page(unsigned long page_pa)
+{
+ const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+ void *page = __va(page_pa);
+ unsigned long i;
+
+ /*
+ * When re-assign one page from old keyid to a new keyid, MOVDIR64B is
+ * required to clear/write the page with new keyid to prevent integrity
+ * error when read on the page with new keyid.
+ *
+ * clflush doesn't flush cache with HKID set. The cache line could be
+ * poisoned (even without MKTME-i), clear the poison bit.
+ */
+ for (i = 0; i < PAGE_SIZE; i += 64)
+ movdir64b(page + i, zero_page);
+ /*
+ * MOVDIR64B store uses WC buffer. Prevent following memory reads
+ * from seeing potentially poisoned cache.
+ */
+ __mb();
+}
+
+static int __tdx_reclaim_page(hpa_t pa)
+{
+ struct tdx_module_args out;
+ u64 err;
+
+ do {
+ err = tdh_phymem_page_reclaim(pa, &out);
+ /*
+ * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
+ * state. i.e. destructing TD.
+ * TDH.PHYMEM.PAGE.RECLAIM requires TDR and target page.
+ * Because we're destructing TD, it's rare to contend with TDR.
+ */
+ } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX) ||
+ err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_TDR)));
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int tdx_reclaim_page(hpa_t pa)
+{
+ int r;
+
+ r = __tdx_reclaim_page(pa);
+ if (!r)
+ tdx_clear_page(pa);
+ return r;
+}
+
+static void tdx_reclaim_control_page(unsigned long td_page_pa)
+{
+ WARN_ON_ONCE(!td_page_pa);
+
+ /*
+ * TDCX are being reclaimed. TDX module maps TDCX with HKID
+ * assigned to the TD. Here the cache associated to the TD
+ * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
+ * cache doesn't need to be flushed again.
+ */
+ if (tdx_reclaim_page(td_page_pa))
+ /*
+ * Leak the page on failure:
+ * tdx_reclaim_page() returns an error if and only if there's an
+ * unexpected, fatal error, e.g. a SEAMCALL with bad params,
+ * incorrect concurrency in KVM, a TDX Module bug, etc.
+ * Retrying at a later point is highly unlikely to be
+ * successful.
+ * No log here as tdx_reclaim_page() already did.
+ */
+ return;
+ free_page((unsigned long)__va(td_page_pa));
+}
+
+static void tdx_do_tdh_phymem_cache_wb(void *unused)
+{
+ u64 err = 0;
+
+ do {
+ err = tdh_phymem_cache_wb(!!err);
+ } while (err == TDX_INTERRUPTED_RESUMABLE);
+
+ /* Other thread may have done for us. */
+ if (err == TDX_NO_HKID_READY_TO_WBCACHE)
+ err = TDX_SUCCESS;
+ if (WARN_ON_ONCE(err))
+ pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
+}
+
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+ bool packages_allocated, targets_allocated;
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ cpumask_var_t packages, targets;
+ u64 err;
+ int i;
+
+ if (!is_hkid_assigned(kvm_tdx))
+ return;
+
+ if (!is_td_created(kvm_tdx)) {
+ tdx_hkid_free(kvm_tdx);
+ return;
+ }
+
+ packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
+ targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
+ cpus_read_lock();
+
+ /*
+ * We can destroy multiple guest TDs simultaneously. Prevent
+ * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
+ */
+ mutex_lock(&tdx_lock);
+
+ /*
+ * Go through multiple TDX HKID state transitions with three SEAMCALLs
+ * to make TDH.PHYMEM.PAGE.RECLAIM() usable. Make the transition atomic
+ * to other functions to operate private pages and Secure-EPT pages.
+ *
+ * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
+ * This function is called via mmu notifier, mmu_release().
+ * kvm_gmem_release() is called via fput() on process exit.
+ */
+ write_lock(&kvm->mmu_lock);
+
+ for_each_online_cpu(i) {
+ if (packages_allocated &&
+ cpumask_test_and_set_cpu(topology_physical_package_id(i),
+ packages))
+ continue;
+ if (targets_allocated)
+ cpumask_set_cpu(i, targets);
+ }
+ if (targets_allocated)
+ on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, true);
+ else
+ on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, true);
+ /*
+ * In the case of error in tdx_do_tdh_phymem_cache_wb(), the following
+ * tdh_mng_key_freeid() will fail.
+ */
+ err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
+ pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
+ kvm_tdx->hkid);
+ } else
+ tdx_hkid_free(kvm_tdx);
+
+ write_unlock(&kvm->mmu_lock);
+ mutex_unlock(&tdx_lock);
+ cpus_read_unlock();
+ free_cpumask_var(targets);
+ free_cpumask_var(packages);
+}
+
+void tdx_vm_free(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+ int i;
+
+ /*
+ * tdx_mmu_release_hkid() failed to reclaim HKID. Something went wrong
+ * heavily with TDX module. Give up freeing TD pages. As the function
+ * already warned, don't warn it again.
+ */
+ if (is_hkid_assigned(kvm_tdx))
+ return;
+
+ if (kvm_tdx->tdcs_pa) {
+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
+ if (kvm_tdx->tdcs_pa[i])
+ tdx_reclaim_control_page(kvm_tdx->tdcs_pa[i]);
+ }
+ kfree(kvm_tdx->tdcs_pa);
+ kvm_tdx->tdcs_pa = NULL;
+ }
+
+ if (!kvm_tdx->tdr_pa)
+ return;
+ if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
+ return;
+ /*
+ * TDX module maps TDR with TDX global HKID. TDX module may access TDR
+ * while operating on TD (Especially reclaiming TDCS). Cache flush with
+ * TDX global HKID is needed.
+ */
+ err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(kvm_tdx->tdr_pa,
+ tdx_global_keyid));
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ return;
+ }
+ tdx_clear_page(kvm_tdx->tdr_pa);
+
+ free_page((unsigned long)__va(kvm_tdx->tdr_pa));
+ kvm_tdx->tdr_pa = 0;
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+ hpa_t *tdr_p = param;
+ u64 err;
+
+ do {
+ err = tdh_mng_key_config(*tdr_p);
+
+ /*
+ * If it failed to generate a random key, retry it because this
+ * is typically caused by an entropy error of the CPU's random
+ * number generator.
+ */
+ } while (err == TDX_KEY_GENERATION_FAILED);
+
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm);
+
+int tdx_vm_init(struct kvm *kvm)
+{
+ /*
+ * TDX has its own limit of the number of vcpus in addition to
+ * KVM_MAX_VCPUS.
+ */
+ kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
+
+ /* Place holder for TDX specific logic. */
+ return __tdx_td_init(kvm);
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
@@ -137,6 +416,176 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
return ret;
}

+static int __tdx_td_init(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ cpumask_var_t packages;
+ unsigned long *tdcs_pa = NULL;
+ unsigned long tdr_pa = 0;
+ unsigned long va;
+ int ret, i;
+ u64 err;
+
+ ret = tdx_guest_keyid_alloc();
+ if (ret < 0)
+ return ret;
+ kvm_tdx->hkid = ret;
+
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va)
+ goto free_hkid;
+ tdr_pa = __pa(va);
+
+ tdcs_pa = kcalloc(tdx_info->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (!tdcs_pa)
+ goto free_tdr;
+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
+ if (!va)
+ goto free_tdcs;
+ tdcs_pa[i] = __pa(va);
+ }
+
+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto free_tdcs;
+ }
+ cpus_read_lock();
+ /*
+ * Need at least one CPU of the package to be online in order to
+ * program all packages for host key id. Check it.
+ */
+ for_each_present_cpu(i)
+ cpumask_set_cpu(topology_physical_package_id(i), packages);
+ for_each_online_cpu(i)
+ cpumask_clear_cpu(topology_physical_package_id(i), packages);
+ if (!cpumask_empty(packages)) {
+ ret = -EIO;
+ /*
+ * Because it's hard for human operator to figure out the
+ * reason, warn it.
+ */
+#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
+ pr_warn_ratelimited(MSG_ALLPKG);
+ goto free_packages;
+ }
+
+ /*
+ * Acquire global lock to avoid TDX_OPERAND_BUSY:
+ * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
+ * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
+ * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
+ * caller to handle the contention. This is because of time limitation
+ * usable inside the TDX module and OS/VMM knows better about process
+ * scheduling.
+ *
+ * APIs to acquire the lock of KOT:
+ * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
+ * TDH.PHYMEM.CACHE.WB.
+ */
+ mutex_lock(&tdx_lock);
+ err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
+ mutex_unlock(&tdx_lock);
+ if (err == TDX_RND_NO_ENTROPY) {
+ ret = -EAGAIN;
+ goto free_packages;
+ }
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_CREATE, err, NULL);
+ ret = -EIO;
+ goto free_packages;
+ }
+ kvm_tdx->tdr_pa = tdr_pa;
+
+ for_each_online_cpu(i) {
+ int pkg = topology_physical_package_id(i);
+
+ if (cpumask_test_and_set_cpu(pkg, packages))
+ continue;
+
+ /*
+ * Program the memory controller in the package with an
+ * encryption key associated to a TDX private host key id
+ * assigned to this TDR. Concurrent operations on same memory
+ * controller results in TDX_OPERAND_BUSY. Avoid this race by
+ * mutex.
+ */
+ mutex_lock(&tdx_mng_key_config_lock[pkg]);
+ ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
+ &kvm_tdx->tdr_pa, true);
+ mutex_unlock(&tdx_mng_key_config_lock[pkg]);
+ if (ret)
+ break;
+ }
+ cpus_read_unlock();
+ free_cpumask_var(packages);
+ if (ret) {
+ i = 0;
+ goto teardown;
+ }
+
+ kvm_tdx->tdcs_pa = tdcs_pa;
+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
+ err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
+ if (err == TDX_RND_NO_ENTROPY) {
+ /* Here it's hard to allow userspace to retry. */
+ ret = -EBUSY;
+ goto teardown;
+ }
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
+ ret = -EIO;
+ goto teardown;
+ }
+ }
+
+ /*
+ * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
+ * ioctl() to define the configure CPUID values for the TD.
+ */
+ return 0;
+
+ /*
+ * The sequence for freeing resources from a partially initialized TD
+ * varies based on where in the initialization flow failure occurred.
+ * Simply use the full teardown and destroy, which naturally play nice
+ * with partial initialization.
+ */
+teardown:
+ for (; i < tdx_info->nr_tdcs_pages; i++) {
+ if (tdcs_pa[i]) {
+ free_page((unsigned long)__va(tdcs_pa[i]));
+ tdcs_pa[i] = 0;
+ }
+ }
+ if (!kvm_tdx->tdcs_pa)
+ kfree(tdcs_pa);
+ tdx_mmu_release_hkid(kvm);
+ tdx_vm_free(kvm);
+ return ret;
+
+free_packages:
+ cpus_read_unlock();
+ free_cpumask_var(packages);
+free_tdcs:
+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
+ if (tdcs_pa[i])
+ free_page((unsigned long)__va(tdcs_pa[i]));
+ }
+ kfree(tdcs_pa);
+ kvm_tdx->tdcs_pa = NULL;
+
+free_tdr:
+ if (tdr_pa)
+ free_page((unsigned long)__va(tdr_pa));
+ kvm_tdx->tdr_pa = 0;
+free_hkid:
+ if (is_hkid_assigned(kvm_tdx))
+ tdx_hkid_free(kvm_tdx);
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -215,12 +664,13 @@ static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)

static int __init tdx_module_setup(void)
{
- u16 num_cpuid_config;
+ u16 num_cpuid_config, tdcs_base_size;
int ret;
u32 i;

struct tdx_md_map mds[] = {
TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
+ TDX_MD_MAP(TDCS_BASE_SIZE, &tdcs_base_size),
};

struct tdx_metadata_field_mapping fields[] = {
@@ -273,6 +723,8 @@ static int __init tdx_module_setup(void)
c->edx = ecx_edx >> 32;
}

+ tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;
+
return 0;

error_out:
@@ -319,13 +771,27 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
struct tdx_enabled enable = {
.err = ATOMIC_INIT(0),
};
+ int max_pkgs;
int r = 0;
+ int i;

+ if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
+ pr_warn("MOVDIR64B is reqiured for TDX\n");
+ return -EOPNOTSUPP;
+ }
if (!enable_ept) {
pr_warn("Cannot enable TDX with EPT disabled\n");
return -EINVAL;
}

+ max_pkgs = topology_max_packages();
+ tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
+ GFP_KERNEL);
+ if (!tdx_mng_key_config_lock)
+ return -ENOMEM;
+ for (i = 0; i < max_pkgs; i++)
+ mutex_init(&tdx_mng_key_config_lock[i]);
+
if (!zalloc_cpumask_var(&enable.enabled, GFP_KERNEL)) {
r = -ENOMEM;
goto out;
@@ -350,4 +816,5 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
void tdx_hardware_unsetup(void)
{
kfree(tdx_info);
+ kfree(tdx_mng_key_config_lock);
}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 22c0b57f69ca..ae117f864cfb 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -8,7 +8,11 @@

struct kvm_tdx {
struct kvm kvm;
- /* TDX specific members follow. */
+
+ unsigned long tdr_pa;
+ unsigned long *tdcs_pa;
+
+ int hkid;
};

struct vcpu_tdx {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 0031a8d61589..7f123ffe4d42 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -140,6 +140,9 @@ void tdx_hardware_unsetup(void);
bool tdx_is_vm_type_supported(unsigned long type);

int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
+int tdx_vm_init(struct kvm *kvm);
+void tdx_mmu_release_hkid(struct kvm *kvm);
+void tdx_vm_free(struct kvm *kvm);
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
@@ -150,6 +153,9 @@ static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
{
return -EINVAL;
};
+static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
+static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
+static inline void tdx_vm_free(struct kvm *kvm) {}
static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
#endif

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3f16e9450d2f..1f01c7f91652 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12743,6 +12743,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_page_track_cleanup(kvm);
kvm_xen_destroy_vm(kvm);
kvm_hv_destroy_vm(kvm);
+ static_call_cond(kvm_x86_vm_free)(kvm);
}

static void memslot_rmap_free(struct kvm_memory_slot *slot)
--
2.25.1


2024-02-26 08:51:12

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

From: Isaku Yamahata <[email protected]>

Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
propagating the change private EPT entry to Secure EPT and freeing Secure
EPT page. TLB flush handles both shared EPT and private EPT. It flushes
shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the
hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
EPT so that the page can be freed to OS.

Propagate the entry change to Secure EPT. The possible entry changes are
present -> non-present(zapping) and non-present -> present(population). On
population just link the Secure EPT page or the private guest page to the
Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
zapping/population, zapping requires synchronous TLB shoot down with the
frozen EPT entry. It zaps the secure entry, increments TLB counter, sends
IPI to remote vcpus to trigger TLB flush, and then unlinks the private
guest page from the Secure EPT. For simplicity, batched zapping with
exclude lock is handled as concurrent zapping. Although it's inefficient,
it can be optimized in the future.

For MMIO SPTE, the spte value changes as follows.
initial value (suppress VE bit is set)
-> Guest issues MMIO and triggers EPT violation
-> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
-> Guest MMIO resumes. It triggers VE exception in guest TD
-> Guest VE handler issues TDG.VP.VMCALL<MMIO>
-> KVM handles MMIO
-> Guest VE handler resumes its execution after MMIO instruction

Signed-off-by: Isaku Yamahata <[email protected]>

---
v19:
- Compile fix when CONFIG_HYPERV != y.
It's due to the following patch. Catch it up.
https://lore.kernel.org/all/[email protected]/
- Add comments on tlb shootdown to explan the sequence.
- Use gmem_max_level callback, delete tdp_max_page_level.

v18:
- rename tdx_sept_page_aug() -> tdx_mem_page_aug()
- checkpatch: space => tab

v15 -> v16:
- Add the handling of TD_ATTR_SEPT_VE_DISABLE case.

v14 -> v15:
- Implemented tdx_flush_tlb_current()
- Removed unnecessary invept in tdx_flush_tlb(). It was carry over
from the very old code base.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 3 +-
arch/x86/kvm/vmx/main.c | 91 ++++++++-
arch/x86/kvm/vmx/tdx.c | 372 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 2 +-
arch/x86/kvm/vmx/tdx_ops.h | 6 +
arch/x86/kvm/vmx/x86_ops.h | 13 ++
6 files changed, 481 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 318135daf685..83926a35ea47 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,7 +74,8 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;

- WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
+ WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
+ !kvm_gfn_shared_mask(vcpu->kvm));

access &= shadow_mmio_access_mask;
spte |= vcpu->kvm->arch.shadow_mmio_value | access;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 54df6653193e..8c5bac3defdf 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -29,6 +29,10 @@ static int vt_max_vcpus(struct kvm *kvm)
return kvm->max_vcpus;
}

+#if IS_ENABLED(CONFIG_HYPERV)
+static int vt_flush_remote_tlbs(struct kvm *kvm);
+#endif
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -49,11 +53,29 @@ static __init int vt_hardware_setup(void)
pr_warn_ratelimited("TDX requires mmio caching. Please enable mmio caching for TDX.\n");
}

+#if IS_ENABLED(CONFIG_HYPERV)
+ /*
+ * TDX KVM overrides flush_remote_tlbs method and assumes
+ * flush_remote_tlbs_range = NULL that falls back to
+ * flush_remote_tlbs. Disable TDX if there are conflicts.
+ */
+ if (vt_x86_ops.flush_remote_tlbs ||
+ vt_x86_ops.flush_remote_tlbs_range) {
+ enable_tdx = false;
+ pr_warn_ratelimited("TDX requires baremetal. Not Supported on VMM guest.\n");
+ }
+#endif
+
enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
if (enable_tdx)
vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
sizeof(struct kvm_tdx));

+#if IS_ENABLED(CONFIG_HYPERV)
+ if (enable_tdx)
+ vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
+#endif
+
return 0;
}

@@ -136,6 +158,56 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_vcpu_reset(vcpu, init_event);
}

+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_flush_tlb(vcpu);
+ return;
+ }
+
+ vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_flush_tlb_current(vcpu);
+ return;
+ }
+
+ vmx_flush_tlb_current(vcpu);
+}
+
+#if IS_ENABLED(CONFIG_HYPERV)
+static int vt_flush_remote_tlbs(struct kvm *kvm)
+{
+ if (is_td(kvm))
+ return tdx_sept_flush_remote_tlbs(kvm);
+
+ /*
+ * fallback to KVM_REQ_TLB_FLUSH.
+ * See kvm_arch_flush_remote_tlb() and kvm_flush_remote_tlbs().
+ */
+ return -EOPNOTSUPP;
+}
+#endif
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_flush_tlb_guest(vcpu);
+}
+
static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int pgd_level)
{
@@ -163,6 +235,15 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
return tdx_vcpu_ioctl(vcpu, argp);
}

+static int vt_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
+ bool is_private, u8 *max_level)
+{
+ if (is_td(kvm))
+ return tdx_gmem_max_level(kvm, pfn, gfn, is_private, max_level);
+
+ return 0;
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -228,10 +309,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_rflags = vmx_set_rflags,
.get_if_flag = vmx_get_if_flag,

- .flush_tlb_all = vmx_flush_tlb_all,
- .flush_tlb_current = vmx_flush_tlb_current,
- .flush_tlb_gva = vmx_flush_tlb_gva,
- .flush_tlb_guest = vmx_flush_tlb_guest,
+ .flush_tlb_all = vt_flush_tlb_all,
+ .flush_tlb_current = vt_flush_tlb_current,
+ .flush_tlb_gva = vt_flush_tlb_gva,
+ .flush_tlb_guest = vt_flush_tlb_guest,

.vcpu_pre_run = vmx_vcpu_pre_run,
.vcpu_run = vmx_vcpu_run,
@@ -324,6 +405,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.mem_enc_ioctl = vt_mem_enc_ioctl,
.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
+
+ .gmem_max_level = vt_gmem_max_level,
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 143a3c2a16bc..39ef80857b6a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -8,6 +8,7 @@
#include "mmu.h"
#include "tdx_arch.h"
#include "tdx.h"
+#include "vmx.h"
#include "x86.h"

#undef pr_fmt
@@ -364,6 +365,19 @@ static int tdx_do_tdh_mng_key_config(void *param)

int tdx_vm_init(struct kvm *kvm)
{
+ /*
+ * Because guest TD is protected, VMM can't parse the instruction in TD.
+ * Instead, guest uses MMIO hypercall. For unmodified device driver,
+ * #VE needs to be injected for MMIO and #VE handler in TD converts MMIO
+ * instruction into MMIO hypercall.
+ *
+ * SPTE value for MMIO needs to be setup so that #VE is injected into
+ * TD instead of triggering EPT MISCONFIG.
+ * - RWX=0 so that EPT violation is triggered.
+ * - suppress #VE bit is cleared to inject #VE.
+ */
+ kvm_mmu_set_mmio_spte_value(kvm, 0);
+
/*
* This function initializes only KVM software construct. It doesn't
* initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
@@ -459,6 +473,307 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
}

+static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ put_page(page);
+}
+
+static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ union tdx_sept_level_state level_state;
+ hpa_t hpa = pfn_to_hpa(pfn);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ struct tdx_module_args out;
+ union tdx_sept_entry entry;
+ u64 err;
+
+ err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &out);
+ if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
+ tdx_unpin(kvm, pfn);
+ return -EAGAIN;
+ }
+ if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))) {
+ entry.raw = out.rcx;
+ level_state.raw = out.rdx;
+ if (level_state.level == tdx_level &&
+ level_state.state == TDX_SEPT_PENDING &&
+ entry.leaf && entry.pfn == pfn && entry.sve) {
+ tdx_unpin(kvm, pfn);
+ WARN_ON_ONCE(!(to_kvm_tdx(kvm)->attributes &
+ TDX_TD_ATTR_SEPT_VE_DISABLE));
+ return -EAGAIN;
+ }
+ }
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
+ tdx_unpin(kvm, pfn);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+ /* TODO: handle large pages. */
+ if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+ return -EINVAL;
+
+ /*
+ * Because restricted mem doesn't support page migration with
+ * a_ops->migrate_page (yet), no callback isn't triggered for KVM on
+ * page migration. Until restricted mem supports page migration,
+ * prevent page migration.
+ * TODO: Once restricted mem introduces callback on page migration,
+ * implement it and remove get_page/put_page().
+ */
+ get_page(pfn_to_page(pfn));
+
+ if (likely(is_td_finalized(kvm_tdx)))
+ return tdx_mem_page_aug(kvm, gfn, level, pfn);
+
+ /* TODO: tdh_mem_page_add() comes here for the initial memory. */
+
+ return 0;
+}
+
+static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct tdx_module_args out;
+ gpa_t gpa = gfn_to_gpa(gfn);
+ hpa_t hpa = pfn_to_hpa(pfn);
+ hpa_t hpa_with_hkid;
+ u64 err;
+
+ /* TODO: handle large pages. */
+ if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+ return -EINVAL;
+
+ if (unlikely(!is_hkid_assigned(kvm_tdx))) {
+ /*
+ * The HKID assigned to this TD was already freed and cache
+ * was already flushed. We don't have to flush again.
+ */
+ err = tdx_reclaim_page(hpa);
+ if (KVM_BUG_ON(err, kvm))
+ return -EIO;
+ tdx_unpin(kvm, pfn);
+ return 0;
+ }
+
+ do {
+ /*
+ * When zapping private page, write lock is held. So no race
+ * condition with other vcpu sept operation. Race only with
+ * TDH.VP.ENTER.
+ */
+ err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+ } while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
+ return -EIO;
+ }
+
+ hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+ do {
+ /*
+ * TDX_OPERAND_BUSY can happen on locking PAMT entry. Because
+ * this page was removed above, other thread shouldn't be
+ * repeatedly operating on this page. Just retry loop.
+ */
+ err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+ } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ return -EIO;
+ }
+ tdx_clear_page(hpa);
+ tdx_unpin(kvm, pfn);
+ return 0;
+}
+
+static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *private_spt)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ hpa_t hpa = __pa(private_spt);
+ struct tdx_module_args out;
+ u64 err;
+
+ err = tdh_mem_sept_add(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
+ if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+ return -EAGAIN;
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+ struct tdx_module_args out;
+ u64 err;
+
+ /* This can be called when destructing guest TD after freeing HKID. */
+ if (unlikely(!is_hkid_assigned(kvm_tdx)))
+ return 0;
+
+ /* For now large page isn't supported yet. */
+ WARN_ON_ONCE(level != PG_LEVEL_4K);
+ err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+ if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+ return -EAGAIN;
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
+ return -EIO;
+ }
+ return 0;
+}
+
+/*
+ * TLB shoot down procedure:
+ * There is a global epoch counter and each vcpu has local epoch counter.
+ * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
+ * This blocks the subsequenct creation of TLB translation on that range.
+ * This corresponds to clear the present bit(all RXW) in EPT entry
+ * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
+ * - IPI to remote vcpus
+ * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
+ * - On re-entry, TDX module compares the local epoch counter with the global
+ * epoch counter. If the local epoch counter is older than the global epoch
+ * counter, update the local epoch counter and flushes TLB.
+ */
+static void tdx_track(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+
+ KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
+ /* If TD isn't finalized, it's before any vcpu running. */
+ if (unlikely(!is_td_finalized(kvm_tdx)))
+ return;
+
+ /*
+ * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
+ * the counter. The counter is used instead of bool because multiple
+ * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
+ *
+ * optimization: The TLB shoot down procedure described in The TDX
+ * specification is, TDH.MEM.TRACK(), send IPI to remote vcpus, confirm
+ * all remote vcpus exit to VMM, and execute vcpu, both local and
+ * remote. Twist the sequence to reduce IPI overhead as follows.
+ *
+ * local remote
+ * ----- ------
+ * increment tdh_mem_track
+ *
+ * request KVM_REQ_TLB_FLUSH
+ * send IPI
+ *
+ * TDEXIT to KVM due to IPI
+ *
+ * IPI handler calls tdx_flush_tlb()
+ * to process KVM_REQ_TLB_FLUSH.
+ * spin wait for tdh_mem_track == 0
+ *
+ * TDH.MEM.TRACK()
+ *
+ * decrement tdh_mem_track
+ *
+ * complete KVM_REQ_TLB_FLUSH
+ *
+ * TDH.VP.ENTER to flush tlbs TDH.VP.ENTER to flush tlbs
+ */
+ atomic_inc(&kvm_tdx->tdh_mem_track);
+ /*
+ * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
+ * KVM_REQUEST_WAIT.
+ */
+ kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+ do {
+ err = tdh_mem_track(kvm_tdx->tdr_pa);
+ } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+
+ /* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
+ atomic_dec(&kvm_tdx->tdh_mem_track);
+
+ if (KVM_BUG_ON(err, kvm))
+ pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+
+}
+
+static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *private_spt)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+ /*
+ * The HKID assigned to this TD was already freed and cache was
+ * already flushed. We don't have to flush again.
+ */
+ if (!is_hkid_assigned(kvm_tdx))
+ return tdx_reclaim_page(__pa(private_spt));
+
+ /*
+ * free_private_spt() is (obviously) called when a shadow page is being
+ * zapped. KVM doesn't (yet) zap private SPs while the TD is active.
+ * Note: This function is for private shadow page. Not for private
+ * guest page. private guest page can be zapped during TD is active.
+ * shared <-> private conversion and slot move/deletion.
+ */
+ KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);
+ return -EINVAL;
+}
+
+int tdx_sept_flush_remote_tlbs(struct kvm *kvm)
+{
+ if (unlikely(!is_td(kvm)))
+ return -EOPNOTSUPP;
+
+ if (is_hkid_assigned(to_kvm_tdx(kvm)))
+ tdx_track(kvm);
+
+ return 0;
+}
+
+static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ /*
+ * TDX requires TLB tracking before dropping private page. Do
+ * it here, although it is also done later.
+ * If hkid isn't assigned, the guest is destroying and no vcpu
+ * runs further. TLB shootdown isn't needed.
+ *
+ * TODO: Call TDH.MEM.TRACK() only when we have called
+ * TDH.MEM.RANGE.BLOCK(), but not call TDH.MEM.TRACK() yet.
+ */
+ if (is_hkid_assigned(to_kvm_tdx(kvm)))
+ tdx_track(kvm);
+
+ return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
@@ -924,6 +1239,39 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
return ret;
}

+void tdx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Don't need to flush shared EPTP:
+ * "TD VCPU TLB Address Spaced Identifier" in the TDX module spec:
+ * The TLB entries for TD are tagged with:
+ * SEAM (1 bit)
+ * VPID
+ * Secure EPT root (51:12 bits) with HKID = 0
+ * PCID
+ * for *both* Secure-EPT and Shared-EPT.
+ * TLB flush with Secure-EPT root by tdx_track() results in flushing
+ * the conversion of both Secure-EPT and Shared-EPT.
+ */
+
+ /*
+ * See tdx_track(). Wait for tlb shootdown initiater to finish
+ * TDH_MEM_TRACK() so that shared-EPT/secure-EPT TLB is flushed
+ * on the next TDENTER.
+ */
+ while (atomic_read(&to_kvm_tdx(vcpu->kvm)->tdh_mem_track))
+ cpu_relax();
+}
+
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+ /*
+ * flush_tlb_current() is used only the first time for the vcpu to run.
+ * As it isn't performance critical, keep this function simple.
+ */
+ tdx_track(vcpu->kvm);
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -1087,6 +1435,17 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
return 0;
}

+int tdx_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
+ bool is_private, u8 *max_level)
+{
+ if (!is_private)
+ return 0;
+
+ /* TODO: Enable 2mb and 1gb large page support. */
+ *max_level = min(*max_level, PG_LEVEL_4K);
+ return 0;
+}
+
#define TDX_MD_MAP(_fid, _ptr) \
{ .fid = MD_FIELD_ID_##_fid, \
.ptr = (_ptr), }
@@ -1297,8 +1656,21 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
on_each_cpu(vmx_off, &enable.enabled, true);
cpus_read_unlock();
free_cpumask_var(enable.enabled);
+ if (r)
+ goto out;
+
+ x86_ops->link_private_spt = tdx_sept_link_private_spt;
+ x86_ops->free_private_spt = tdx_sept_free_private_spt;
+ x86_ops->set_private_spte = tdx_sept_set_private_spte;
+ x86_ops->remove_private_spte = tdx_sept_remove_private_spte;
+ x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+
+ return 0;

out:
+ /* kfree() accepts NULL. */
+ kfree(tdx_mng_key_config_lock);
+ tdx_mng_key_config_lock = NULL;
return r;
}

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 8a0d1bfe34a0..75596b9dcf3f 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -18,6 +18,7 @@ struct kvm_tdx {
int hkid;

bool finalized;
+ atomic_t tdh_mem_track;

u64 tsc_offset;
};
@@ -162,7 +163,6 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
}
return out.r8;
}
-
#else
struct kvm_tdx {
struct kvm kvm;
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index e5c069b96126..d27f281152cb 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -44,6 +44,12 @@ static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out);
#endif

+static inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+ WARN_ON_ONCE(level == PG_LEVEL_NONE);
+ return level - 1;
+}
+
/*
* TDX module acquires its internal lock for resources. It doesn't spin to get
* locks because of its restrictions of allowed execution time. Instead, it
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 24161fa404aa..d5f75efd87e6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -153,7 +153,12 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

+void tdx_flush_tlb(struct kvm_vcpu *vcpu);
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
+int tdx_sept_flush_remote_tlbs(struct kvm *kvm);
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+int tdx_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
+ bool is_private, u8 *max_level);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
@@ -176,7 +181,15 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

+static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
+static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
+static inline int tdx_sept_flush_remote_tlbs(struct kvm *kvm) { return 0; }
static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
+static inline int tdx_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
+ bool is_private, u8 *max_level)
+{
+ return -EOPNOTSUPP;
+}
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1


2024-02-26 08:51:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 072/130] [MARKER] The start of TDX KVM patch series: TD finalization

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD finalization.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index c4d67dd9ddf8..46ae049b6b85 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -11,6 +11,7 @@ What qemu can do
- TDX VM TYPE is exposed to Qemu.
- Qemu can create/destroy guest of TDX vm type.
- Qemu can create/destroy vcpu of TDX vm type.
+- Qemu can populate initial guest memory image.

Patch Layer status
------------------
@@ -20,8 +21,8 @@ Patch Layer status
* TDX architectural definitions: Applied
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
-* TDX EPT violation: Applying
-* TD finalization: Not yet
+* TDX EPT violation: Applied
+* TD finalization: Applying
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet

--
2.25.1


2024-02-26 08:51:46

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 074/130] KVM: TDX: Create initial guest memory

From: Isaku Yamahata <[email protected]>

Because the guest memory is protected in TDX, the creation of the initial
guest memory requires a dedicated TDX module API, tdh_mem_page_add, instead
of directly copying the memory contents into the guest memory in the case
of the default VM type. KVM MMU page fault handler callback,
set_private_spte, handles it.

Implement the hooks for KVM_MEMORY_MAPPING, pre_memory_mapping() and
post_memory_mapping() to handle it.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Switched to use KVM_MEMORY_MAPPING
- Dropped measurement extension
- updated commit message. private_page_add() => set_private_spte()
---
arch/x86/kvm/vmx/main.c | 23 ++++++++++
arch/x86/kvm/vmx/tdx.c | 93 ++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.h | 4 ++
arch/x86/kvm/vmx/x86_ops.h | 12 +++++
4 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c5672909fdae..7258a6304b4b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -252,6 +252,27 @@ static int vt_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
return 0;
}

+static int vt_pre_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code, u8 *max_level)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_pre_memory_mapping(vcpu, mapping, error_code, max_level);
+
+ if (mapping->source)
+ return -EINVAL;
+ return 0;
+}
+
+static void vt_post_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ if (!is_td_vcpu(vcpu))
+ return;
+
+ tdx_post_memory_mapping(vcpu, mapping);
+}
+
#define VMX_REQUIRED_APICV_INHIBITS \
(BIT(APICV_INHIBIT_REASON_DISABLE)| \
BIT(APICV_INHIBIT_REASON_ABSENT) | \
@@ -415,6 +436,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,

.gmem_max_level = vt_gmem_max_level,
+ .pre_memory_mapping = vt_pre_memory_mapping,
+ .post_memory_mapping = vt_post_memory_mapping,
};

struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e65fff43cb1b..8cf6e5dab3e9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -390,6 +390,7 @@ int tdx_vm_init(struct kvm *kvm)
*/
kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);

+ mutex_init(&to_kvm_tdx(kvm)->source_lock);
return 0;
}

@@ -541,6 +542,51 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
return 0;
}

+static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ hpa_t hpa = pfn_to_hpa(pfn);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ struct tdx_module_args out;
+ hpa_t source_pa;
+ u64 err;
+
+ lockdep_assert_held(&kvm_tdx->source_lock);
+
+ /*
+ * KVM_MEMORY_MAPPING for TD supports only 4K page because
+ * tdh_mem_page_add() supports only 4K page.
+ */
+ if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+ return -EINVAL;
+
+ if (KVM_BUG_ON(!kvm_tdx->source_page, kvm)) {
+ tdx_unpin(kvm, pfn);
+ return -EINVAL;
+ }
+
+ source_pa = pfn_to_hpa(page_to_pfn(kvm_tdx->source_page));
+ do {
+ err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, hpa, source_pa,
+ &out);
+ /*
+ * This path is executed during populating initial guest memory
+ * image. i.e. before running any vcpu. Race is rare.
+ */
+ } while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+ /*
+ * Don't warn: This is for KVM_MEMORY_MAPPING. So tdh_mem_page_add() can
+ * fail with parameters user provided.
+ */
+ if (err) {
+ tdx_unpin(kvm, pfn);
+ return -EIO;
+ }
+
+ return 0;
+}
+
static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
@@ -563,9 +609,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
if (likely(is_td_finalized(kvm_tdx)))
return tdx_mem_page_aug(kvm, gfn, level, pfn);

- /* TODO: tdh_mem_page_add() comes here for the initial memory. */
-
- return 0;
+ return tdx_mem_page_add(kvm, gfn, level, pfn);
}

static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -1469,6 +1513,49 @@ int tdx_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
return 0;
}

+#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_GUEST_ENC_MASK)
+
+int tdx_pre_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code, u8 *max_level)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct page *page;
+ int r = 0;
+
+ /* memory contents is needed for encryption. */
+ if (!mapping->source)
+ return -EINVAL;
+
+ /* Once TD is finalized, the initial guest memory is fixed. */
+ if (is_td_finalized(to_kvm_tdx(vcpu->kvm)))
+ return -EINVAL;
+
+ /* TDX supports only 4K to pre-populate. */
+ *max_level = PG_LEVEL_4K;
+ *error_code = TDX_SEPT_PFERR;
+
+ r = get_user_pages_fast(mapping->source, 1, 0, &page);
+ if (r < 0)
+ return r;
+ if (r != 1)
+ return -ENOMEM;
+
+ mutex_lock(&kvm_tdx->source_lock);
+ kvm_tdx->source_page = page;
+ return 0;
+}
+
+void tdx_post_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+ put_page(kvm_tdx->source_page);
+ kvm_tdx->source_page = NULL;
+ mutex_unlock(&kvm_tdx->source_lock);
+}
+
#define TDX_MD_MAP(_fid, _ptr) \
{ .fid = MD_FIELD_ID_##_fid, \
.ptr = (_ptr), }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 75596b9dcf3f..d822e790e3e5 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -21,6 +21,10 @@ struct kvm_tdx {
atomic_t tdh_mem_track;

u64 tsc_offset;
+
+ /* For KVM_MEMORY_MAPPING */
+ struct mutex source_lock;
+ struct page *source_page;
};

struct vcpu_tdx {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 5335d35bc655..191f2964ec8e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -160,6 +160,11 @@ int tdx_sept_flush_remote_tlbs(struct kvm *kvm);
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
int tdx_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
bool is_private, u8 *max_level);
+int tdx_pre_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code, u8 *max_level);
+void tdx_post_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
@@ -192,6 +197,13 @@ static inline int tdx_gmem_max_level(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
{
return -EOPNOTSUPP;
}
+int tdx_pre_memory_mapping(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code, u8 *max_level)
+{
+ return -EOPNOTSUPP;
+}
+void tdx_post_memory_mapping(struct kvm_vcpu *vcpu, struct kvm_memory_mapping *mapping) {}
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1


2024-02-26 08:52:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 047/130] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of KVM TDP
refactoring for TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 8b8186e7bfeb..e893a3d714c7 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -25,6 +25,6 @@ Patch Layer status
* TD vcpu enter/exit: Not yet
* TD vcpu interrupts/exit/hypercall: Not yet

-* KVM MMU GPA shared bits: Applying
-* KVM TDP refactoring for TDX: Not yet
+* KVM MMU GPA shared bits: Applied
+* KVM TDP refactoring for TDX: Applying
* KVM TDP MMU hooks: Not yet
--
2.25.1


2024-02-26 08:52:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 052/130] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis

From: Isaku Yamahata <[email protected]>

TDX will use a different shadow PTE entry value for MMIO from VMX. Add
members to kvm_arch and track value for MMIO per-VM instead of global
variables. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working. Introduce a separate setter function so that guest
TD can override later.

Also require mmio spte caching for TDX. Actually this is true case
because TDX requires EPT and KVM EPT allows mmio spte caching.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- fix typo in the commit message.
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 8 +++++---
arch/x86/kvm/mmu/spte.c | 10 ++++++++--
arch/x86/kvm/mmu/spte.h | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
6 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index de6dd42d226f..6c10d8d1017f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1312,6 +1312,8 @@ struct kvm_arch {
*/
spinlock_t mmu_unsync_pages_lock;

+ u64 shadow_mmio_value;
+
struct iommu_domain *iommu_domain;
bool iommu_noncoherent;
#define __KVM_HAVE_ARCH_NONCOHERENT_DMA
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 395b55684cb9..ab2854f337ab 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -101,6 +101,7 @@ static inline u8 kvm_get_shadow_phys_bits(void)
}

void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 211c0e72f45d..84e7a289ad07 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2515,7 +2515,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
return kvm_mmu_prepare_zap_page(kvm, child,
invalid_list);
}
- } else if (is_mmio_spte(pte)) {
+ } else if (is_mmio_spte(kvm, pte)) {
mmu_spte_clear_no_track(spte);
}
return 0;
@@ -4184,7 +4184,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
if (WARN_ON_ONCE(reserved))
return -EINVAL;

- if (is_mmio_spte(spte)) {
+ if (is_mmio_spte(vcpu->kvm, spte)) {
gfn_t gfn = get_mmio_spte_gfn(spte);
unsigned int access = get_mmio_spte_access(spte);

@@ -4837,7 +4837,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
unsigned int access)
{
- if (unlikely(is_mmio_spte(*sptep))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
if (gfn != get_mmio_spte_gfn(*sptep)) {
mmu_spte_clear_no_track(sptep);
return true;
@@ -6357,6 +6357,8 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)

void kvm_mmu_init_vm(struct kvm *kvm)
{
+
+ kvm->arch.shadow_mmio_value = shadow_mmio_value;
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 02a466de2991..318135daf685 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,10 +74,10 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;

- WARN_ON_ONCE(!shadow_mmio_value);
+ WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);

access &= shadow_mmio_access_mask;
- spte |= shadow_mmio_value | access;
+ spte |= vcpu->kvm->arch.shadow_mmio_value | access;
spte |= gpa | shadow_nonpresent_or_rsvd_mask;
spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
@@ -411,6 +411,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);

+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value)
+{
+ kvm->arch.shadow_mmio_value = mmio_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value);
+
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
{
/* shadow_me_value must be a subset of shadow_me_mask */
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 26bc95bbc962..1a163aee9ec6 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -264,9 +264,9 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
return spte_to_child_sp(root);
}

-static inline bool is_mmio_spte(u64 spte)
+static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
- return (spte & shadow_mmio_mask) == shadow_mmio_value &&
+ return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
likely(enable_mmio_caching);
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bdeb23ff9e71..04c6af49c3e8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -462,8 +462,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* impact the guest since both the former and current SPTEs
* are nonpresent.
*/
- if (WARN_ON_ONCE(!is_mmio_spte(old_spte) &&
- !is_mmio_spte(new_spte) &&
+ if (WARN_ON_ONCE(!is_mmio_spte(kvm, old_spte) &&
+ !is_mmio_spte(kvm, new_spte) &&
!is_removed_spte(new_spte)))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
"should not be replaced with another,\n"
@@ -978,7 +978,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
}

/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
- if (unlikely(is_mmio_spte(new_spte))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
vcpu->stat.pf_mmio_spte_created++;
trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
new_spte);
--
2.25.1


2024-02-26 08:52:15

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 075/130] KVM: TDX: Extend memory measurement with initial guest memory

From: Isaku Yamahata <[email protected]>

TDX allows to extned memory measurement with the initial memory. Define
new subcommand, KVM_TDX_EXTEND_MEMORY, of VM-scoped KVM_MEMORY_ENCRYPT_OP.
it extends memory measurement of the TDX guest. The memory region must
be populated with KVM_MEMORY_MAPPING command.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- newly added
- Split KVM_TDX_INIT_MEM_REGION into only extension function
---
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/vmx/tdx.c | 64 +++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 4000a2e087a8..34167404020c 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -572,6 +572,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_CAPABILITIES = 0,
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
+ KVM_TDX_EXTEND_MEMORY,

KVM_TDX_CMD_NR_MAX,
};
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8cf6e5dab3e9..3cfba63a7762 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1339,6 +1339,67 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
tdx_track(vcpu->kvm);
}

+static int tdx_extend_memory(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct kvm_memory_mapping mapping;
+ struct tdx_module_args out;
+ bool extended = false;
+ int idx, ret = 0;
+ gpa_t gpa;
+ u64 err;
+ int i;
+
+ /* Once TD is finalized, the initial guest memory is fixed. */
+ if (is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ if (cmd->flags)
+ return -EINVAL;
+
+ if (copy_from_user(&mapping, (void __user *)cmd->data, sizeof(mapping)))
+ return -EFAULT;
+
+ /* Sanity check */
+ if (mapping.source || !mapping.nr_pages ||
+ mapping.nr_pages & GENMASK_ULL(63, 63 - PAGE_SHIFT) ||
+ mapping.base_gfn + (mapping.nr_pages << PAGE_SHIFT) <= mapping.base_gfn ||
+ !kvm_is_private_gpa(kvm, mapping.base_gfn) ||
+ !kvm_is_private_gpa(kvm, mapping.base_gfn + (mapping.nr_pages << PAGE_SHIFT)))
+ return -EINVAL;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ while (mapping.nr_pages) {
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+ gpa = gfn_to_gpa(mapping.base_gfn);
+ for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+ err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &out);
+ if (err) {
+ ret = -EIO;
+ break;
+ }
+ }
+ mapping.base_gfn++;
+ mapping.nr_pages--;
+ extended = true;
+ }
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ if (extended && mapping.nr_pages > 0)
+ ret = -EAGAIN;
+ if (copy_to_user((void __user *)cmd->data, &mapping, sizeof(mapping)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -1358,6 +1419,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_INIT_VM:
r = tdx_td_init(kvm, &tdx_cmd);
break;
+ case KVM_TDX_EXTEND_MEMORY:
+ r = tdx_extend_memory(kvm, &tdx_cmd);
+ break;
default:
r = -EINVAL;
goto out;
--
2.25.1


2024-02-26 08:52:21

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 076/130] KVM: TDX: Finalize VM initialization

From: Isaku Yamahata <[email protected]>

To protect the initial contents of the guest TD, the TDX module measures
the guest TD during the build process as SHA-384 measurement. The
measurement of the guest TD contents needs to be completed to make the
guest TD ready to run.

Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
to run.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v18:
- Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.

v14 -> v15:
- removed unconditional tdx_track() by tdx_flush_tlb_current() that
does tdx_track().

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 34167404020c..c160f60189d1 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -573,6 +573,7 @@ enum kvm_tdx_cmd_id {
KVM_TDX_INIT_VM,
KVM_TDX_INIT_VCPU,
KVM_TDX_EXTEND_MEMORY,
+ KVM_TDX_FINALIZE_VM,

KVM_TDX_CMD_NR_MAX,
};
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3cfba63a7762..6aff3f7e2488 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1400,6 +1400,24 @@ static int tdx_extend_memory(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
return ret;
}

+static int tdx_td_finalizemr(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+
+ if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
+ return -EINVAL;
+
+ err = tdh_mr_finalize(kvm_tdx->tdr_pa);
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
+ return -EIO;
+ }
+
+ kvm_tdx->finalized = true;
+ return 0;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -1422,6 +1440,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
case KVM_TDX_EXTEND_MEMORY:
r = tdx_extend_memory(kvm, &tdx_cmd);
break;
+ case KVM_TDX_FINALIZE_VM:
+ r = tdx_td_finalizemr(kvm);
+ break;
default:
r = -EINVAL;
goto out;
--
2.25.1


2024-02-26 08:52:37

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 054/130] KVM: VMX: Introduce test mode related to EPT violation VE

From: Isaku Yamahata <[email protected]>

To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM programs
to inject #VE conditionally and set #VE suppress bit in EPT entry. For VMX
case, #VE isn't used. If #VE happens for VMX, it's a bug. To be
defensive (test that VMX case isn't broken), introduce option
ept_violation_ve_test and when it's set, set error.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/vmx.h | 12 +++++++
arch/x86/kvm/vmx/vmcs.h | 5 +++
arch/x86/kvm/vmx/vmx.c | 69 +++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +++-
4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 76ed39541a52..f703bae0c4ac 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -70,6 +70,7 @@
#define SECONDARY_EXEC_ENCLS_EXITING VMCS_CONTROL_BIT(ENCLS_EXITING)
#define SECONDARY_EXEC_RDSEED_EXITING VMCS_CONTROL_BIT(RDSEED_EXITING)
#define SECONDARY_EXEC_ENABLE_PML VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
+#define SECONDARY_EXEC_EPT_VIOLATION_VE VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
#define SECONDARY_EXEC_PT_CONCEAL_VMX VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
#define SECONDARY_EXEC_ENABLE_XSAVES VMCS_CONTROL_BIT(XSAVES)
#define SECONDARY_EXEC_MODE_BASED_EPT_EXEC VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
@@ -225,6 +226,8 @@ enum vmcs_field {
VMREAD_BITMAP_HIGH = 0x00002027,
VMWRITE_BITMAP = 0x00002028,
VMWRITE_BITMAP_HIGH = 0x00002029,
+ VE_INFORMATION_ADDRESS = 0x0000202A,
+ VE_INFORMATION_ADDRESS_HIGH = 0x0000202B,
XSS_EXIT_BITMAP = 0x0000202C,
XSS_EXIT_BITMAP_HIGH = 0x0000202D,
ENCLS_EXITING_BITMAP = 0x0000202E,
@@ -630,4 +633,13 @@ enum vmx_l1d_flush_state {

extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;

+struct vmx_ve_information {
+ u32 exit_reason;
+ u32 delivery;
+ u64 exit_qualification;
+ u64 guest_linear_address;
+ u64 guest_physical_address;
+ u16 eptp_index;
+};
+
#endif
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 7c1996b433e2..b25625314658 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -140,6 +140,11 @@ static inline bool is_nm_fault(u32 intr_info)
return is_exception_n(intr_info, NM_VECTOR);
}

+static inline bool is_ve_fault(u32 intr_info)
+{
+ return is_exception_n(intr_info, VE_VECTOR);
+}
+
/* Undocumented: icebp/int1 */
static inline bool is_icebp(u32 intr_info)
{
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d928acc15d0f..6fe895bd7807 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -127,6 +127,9 @@ module_param(error_on_inconsistent_vmcs_config, bool, 0444);
static bool __read_mostly dump_invalid_vmcs = 0;
module_param(dump_invalid_vmcs, bool, 0644);

+static bool __read_mostly ept_violation_ve_test;
+module_param(ept_violation_ve_test, bool, 0444);
+
#define MSR_BITMAP_MODE_X2APIC 1
#define MSR_BITMAP_MODE_X2APIC_APICV 2

@@ -862,6 +865,13 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)

eb = (1u << PF_VECTOR) | (1u << UD_VECTOR) | (1u << MC_VECTOR) |
(1u << DB_VECTOR) | (1u << AC_VECTOR);
+ /*
+ * #VE isn't used for VMX, but for TDX. To test against unexpected
+ * change related to #VE for VMX, intercept unexpected #VE and warn on
+ * it.
+ */
+ if (ept_violation_ve_test)
+ eb |= 1u << VE_VECTOR;
/*
* Guest access to VMware backdoor ports could legitimately
* trigger #GP because of TSS I/O permission bitmap.
@@ -2597,6 +2607,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
&_cpu_based_2nd_exec_control))
return -EIO;
}
+ if (!ept_violation_ve_test)
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
+
#ifndef CONFIG_X86_64
if (!(_cpu_based_2nd_exec_control &
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
@@ -2621,6 +2634,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
return -EIO;

vmx_cap->ept = 0;
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
}
if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
vmx_cap->vpid) {
@@ -4584,6 +4598,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
if (!enable_ept) {
exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+ exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
enable_unrestricted_guest = 0;
}
if (!enable_unrestricted_guest)
@@ -4707,8 +4722,40 @@ static void init_vmcs(struct vcpu_vmx *vmx)

exec_controls_set(vmx, vmx_exec_control(vmx));

- if (cpu_has_secondary_exec_ctrls())
+ if (cpu_has_secondary_exec_ctrls()) {
secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
+ if (secondary_exec_controls_get(vmx) &
+ SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ if (!vmx->ve_info) {
+ /* ve_info must be page aligned. */
+ struct page *page;
+
+ BUILD_BUG_ON(sizeof(*vmx->ve_info) > PAGE_SIZE);
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (page)
+ vmx->ve_info = page_to_virt(page);
+ }
+ if (vmx->ve_info) {
+ /*
+ * Allow #VE delivery. CPU sets this field to
+ * 0xFFFFFFFF on #VE delivery. Another #VE can
+ * occur only if software clears the field.
+ */
+ vmx->ve_info->delivery = 0;
+ vmcs_write64(VE_INFORMATION_ADDRESS,
+ __pa(vmx->ve_info));
+ } else {
+ /*
+ * Because SECONDARY_EXEC_EPT_VIOLATION_VE is
+ * used only when ept_violation_ve_test is true,
+ * it's okay to go with the bit disabled.
+ */
+ pr_err("Failed to allocate ve_info. disabling EPT_VIOLATION_VE.\n");
+ secondary_exec_controls_clearbit(vmx,
+ SECONDARY_EXEC_EPT_VIOLATION_VE);
+ }
+ }
+ }

if (cpu_has_tertiary_exec_ctrls())
tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
@@ -5196,6 +5243,12 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
if (is_invalid_opcode(intr_info))
return handle_ud(vcpu);

+ /*
+ * #VE isn't supposed to happen. Although vcpu can send
+ */
+ if (KVM_BUG_ON(is_ve_fault(intr_info), vcpu->kvm))
+ return -EIO;
+
error_code = 0;
if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
@@ -6383,6 +6436,18 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
pr_err("Virtual processor ID = 0x%04x\n",
vmcs_read16(VIRTUAL_PROCESSOR_ID));
+ if (secondary_exec_control & SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ struct vmx_ve_information *ve_info;
+
+ pr_err("VE info address = 0x%016llx\n",
+ vmcs_read64(VE_INFORMATION_ADDRESS));
+ ve_info = __va(vmcs_read64(VE_INFORMATION_ADDRESS));
+ pr_err("ve_info: 0x%08x 0x%08x 0x%016llx 0x%016llx 0x%016llx 0x%04x\n",
+ ve_info->exit_reason, ve_info->delivery,
+ ve_info->exit_qualification,
+ ve_info->guest_linear_address,
+ ve_info->guest_physical_address, ve_info->eptp_index);
+ }
}

/*
@@ -7423,6 +7488,8 @@ void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_vpid(vmx->vpid);
nested_vmx_free_vcpu(vcpu);
free_loaded_vmcs(vmx->loaded_vmcs);
+ if (vmx->ve_info)
+ free_page((unsigned long)vmx->ve_info);
}

int vmx_vcpu_create(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 04ed2a9eada1..79ff54f08fee 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -349,6 +349,9 @@ struct vcpu_vmx {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
} shadow_msr_intercept;
+
+ /* ve_info must be page aligned. */
+ struct vmx_ve_information *ve_info;
};

struct kvm_vmx {
@@ -561,7 +564,8 @@ static inline u8 vmx_get_rvi(void)
SECONDARY_EXEC_ENABLE_VMFUNC | \
SECONDARY_EXEC_BUS_LOCK_DETECTION | \
SECONDARY_EXEC_NOTIFY_VM_EXITING | \
- SECONDARY_EXEC_ENCLS_EXITING)
+ SECONDARY_EXEC_ENCLS_EXITING | \
+ SECONDARY_EXEC_EPT_VIOLATION_VE)

#define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
#define KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL \
--
2.25.1


2024-02-26 08:52:46

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 077/130] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD vcpu
enter/exit.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 46ae049b6b85..33e107bcb5cf 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -12,6 +12,7 @@ What qemu can do
- Qemu can create/destroy guest of TDX vm type.
- Qemu can create/destroy vcpu of TDX vm type.
- Qemu can populate initial guest memory image.
+- Qemu can finalize guest TD.

Patch Layer status
------------------
@@ -22,8 +23,8 @@ Patch Layer status
* TD VM creation/destruction: Applied
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
-* TD finalization: Applying
-* TD vcpu enter/exit: Not yet
+* TD finalization: Applied
+* TD vcpu enter/exit: Applying
* TD vcpu interrupts/exit/hypercall: Not yet

* KVM MMU GPA shared bits: Applied
--
2.25.1


2024-02-26 08:53:43

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

From: Isaku Yamahata <[email protected]>

This patch implements running TDX vcpu. Once vcpu runs on the logical
processor (LP), the TDX vcpu is associated with it. When the TDX vcpu
moves to another LP, the TDX vcpu needs to flush its status on the LP.
When destroying TDX vcpu, it needs to complete flush and flush cpu memory
cache. Track which LP the TDX vcpu run and flush it as necessary.

Do nothing on sched_in event as TDX doesn't support pause loop.

TDX vcpu execution requires restoring PMU debug store after returning back
to KVM because the TDX module unconditionally resets the value. To reuse
the existing code, export perf_restore_debug_store.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v19:
- Removed export_symbol_gpl(host_xcr0) to the patch that uses it

Changes v15 -> v16:
- use __seamcall_saved_ret()
- As struct tdx_module_args doesn't match with vcpu.arch.regs, copy regs
before/after calling __seamcall_saved_ret().

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 21 +++++++++-
arch/x86/kvm/vmx/tdx.c | 84 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 33 +++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 +
4 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7258a6304b4b..d72651ce99ac 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -158,6 +158,23 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_vcpu_reset(vcpu, init_event);
}

+static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ /* Unconditionally continue to vcpu_run(). */
+ return 1;
+
+ return vmx_vcpu_pre_run(vcpu);
+}
+
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_run(vcpu);
+
+ return vmx_vcpu_run(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -343,8 +360,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.flush_tlb_gva = vt_flush_tlb_gva,
.flush_tlb_guest = vt_flush_tlb_guest,

- .vcpu_pre_run = vmx_vcpu_pre_run,
- .vcpu_run = vmx_vcpu_run,
+ .vcpu_pre_run = vt_vcpu_pre_run,
+ .vcpu_run = vt_vcpu_run,
.handle_exit = vmx_handle_exit,
.skip_emulated_instruction = vmx_skip_emulated_instruction,
.update_emulated_instruction = vmx_update_emulated_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6aff3f7e2488..fdf9196cb592 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -11,6 +11,9 @@
#include "vmx.h"
#include "x86.h"

+#include <trace/events/kvm.h>
+#include "trace.h"
+
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

@@ -491,6 +494,87 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
}

+static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
+{
+ struct tdx_module_args args;
+
+ /*
+ * Avoid section mismatch with to_tdx() with KVM_VM_BUG(). The caller
+ * should call to_tdx().
+ */
+ struct kvm_vcpu *vcpu = &tdx->vcpu;
+
+ guest_state_enter_irqoff();
+
+ /*
+ * TODO: optimization:
+ * - Eliminate copy between args and vcpu->arch.regs.
+ * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
+ * which means TDG.VP.VMCALL.
+ */
+ args = (struct tdx_module_args) {
+ .rcx = tdx->tdvpr_pa,
+#define REG(reg, REG) .reg = vcpu->arch.regs[VCPU_REGS_ ## REG]
+ REG(rdx, RDX),
+ REG(r8, R8),
+ REG(r9, R9),
+ REG(r10, R10),
+ REG(r11, R11),
+ REG(r12, R12),
+ REG(r13, R13),
+ REG(r14, R14),
+ REG(r15, R15),
+ REG(rbx, RBX),
+ REG(rdi, RDI),
+ REG(rsi, RSI),
+#undef REG
+ };
+
+ tdx->exit_reason.full = __seamcall_saved_ret(TDH_VP_ENTER, &args);
+
+#define REG(reg, REG) vcpu->arch.regs[VCPU_REGS_ ## REG] = args.reg
+ REG(rcx, RCX);
+ REG(rdx, RDX);
+ REG(r8, R8);
+ REG(r9, R9);
+ REG(r10, R10);
+ REG(r11, R11);
+ REG(r12, R12);
+ REG(r13, R13);
+ REG(r14, R14);
+ REG(r15, R15);
+ REG(rbx, RBX);
+ REG(rdi, RDI);
+ REG(rsi, RSI);
+#undef REG
+
+ WARN_ON_ONCE(!kvm_rebooting &&
+ (tdx->exit_reason.full & TDX_SW_ERROR) == TDX_SW_ERROR);
+
+ guest_state_exit_irqoff();
+}
+
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (unlikely(!tdx->initialized))
+ return -EINVAL;
+ if (unlikely(vcpu->kvm->vm_bugged)) {
+ tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
+ return EXIT_FASTPATH_NONE;
+ }
+
+ trace_kvm_entry(vcpu);
+
+ tdx_vcpu_enter_exit(tdx);
+
+ vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
+ trace_kvm_exit(vcpu, KVM_ISA_VMX);
+
+ return EXIT_FASTPATH_NONE;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
WARN_ON_ONCE(root_hpa & ~PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index d822e790e3e5..81d301fbe638 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -27,6 +27,37 @@ struct kvm_tdx {
struct page *source_page;
};

+union tdx_exit_reason {
+ struct {
+ /* 31:0 mirror the VMX Exit Reason format */
+ u64 basic : 16;
+ u64 reserved16 : 1;
+ u64 reserved17 : 1;
+ u64 reserved18 : 1;
+ u64 reserved19 : 1;
+ u64 reserved20 : 1;
+ u64 reserved21 : 1;
+ u64 reserved22 : 1;
+ u64 reserved23 : 1;
+ u64 reserved24 : 1;
+ u64 reserved25 : 1;
+ u64 bus_lock_detected : 1;
+ u64 enclave_mode : 1;
+ u64 smi_pending_mtf : 1;
+ u64 smi_from_vmx_root : 1;
+ u64 reserved30 : 1;
+ u64 failed_vmentry : 1;
+
+ /* 63:32 are TDX specific */
+ u64 details_l1 : 8;
+ u64 class : 8;
+ u64 reserved61_48 : 14;
+ u64 non_recoverable : 1;
+ u64 error : 1;
+ };
+ u64 full;
+};
+
struct vcpu_tdx {
struct kvm_vcpu vcpu;

@@ -34,6 +65,8 @@ struct vcpu_tdx {
unsigned long *tdvpx_pa;
bool td_vcpu_created;

+ union tdx_exit_reason exit_reason;
+
bool initialized;

/*
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 191f2964ec8e..3e29a6fe28ef 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -150,6 +150,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -184,6 +185,7 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1


2024-02-26 08:54:15

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 080/130] KVM: TDX: restore host xsave state when exit from the guest TD

From: Isaku Yamahata <[email protected]>

On exiting from the guest TD, xsave state is clobbered. Restore xsave
state on TD exit.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- Add EXPORT_SYMBOL_GPL(host_xcr0)

v15 -> v16:
- Added CET flag mask

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
2 files changed, 20 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9616b1aab6ce..199226c6cf55 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,7 @@
#include <linux/cpu.h>
#include <linux/mmu_context.h>

+#include <asm/fpu/xcr.h>
#include <asm/tdx.h>

#include "capabilities.h"
@@ -534,6 +535,23 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
}

+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+ if (static_cpu_has(X86_FEATURE_XSAVE) &&
+ host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
+ if (static_cpu_has(X86_FEATURE_XSAVES) &&
+ /* PT can be exposed to TD guest regardless of KVM's XSS support */
+ host_xss != (kvm_tdx->xfam &
+ (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET)))
+ wrmsrl(MSR_IA32_XSS, host_xss);
+ if (static_cpu_has(X86_FEATURE_PKU) &&
+ (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+ write_pkru(vcpu->arch.host_pkru);
+}
+
static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
{
struct tdx_module_args args;
@@ -609,6 +627,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_vcpu_enter_exit(tdx);

+ tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;

vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23ece956c816..b361d948140f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -315,6 +315,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
};

u64 __read_mostly host_xcr0;
+EXPORT_SYMBOL_GPL(host_xcr0);

static struct kmem_cache *x86_emulator_cache;

--
2.25.1


2024-02-26 08:54:35

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 081/130] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr

From: Chao Gao <[email protected]>

Several MSRs are constant and only used in userspace(ring 3). But VMs may
have different values. KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.

TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit. It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware. This inconsistency needs to be
fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values. kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr. So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.

Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 25 ++++++++++++++++++++-----
2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 36694e784c27..3ab85c3d86ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2259,6 +2259,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
int kvm_add_user_return_msr(u32 msr);
int kvm_find_user_return_msr(u32 msr);
int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_update_cache(unsigned int index, u64 val);

static inline bool kvm_is_supported_user_return_msr(u32 msr)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b361d948140f..1b189e86a1f1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -440,6 +440,15 @@ static void kvm_user_return_msr_cpu_online(void)
}
}

+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+ if (!msrs->registered) {
+ msrs->urn.on_user_return = kvm_on_user_return;
+ user_return_notifier_register(&msrs->urn);
+ msrs->registered = true;
+ }
+}
+
int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
{
unsigned int cpu = smp_processor_id();
@@ -454,15 +463,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
return 1;

msrs->values[slot].curr = value;
- if (!msrs->registered) {
- msrs->urn.on_user_return = kvm_on_user_return;
- user_return_notifier_register(&msrs->urn);
- msrs->registered = true;
- }
+ kvm_user_return_register_notifier(msrs);
return 0;
}
EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);

+/* Update the cache, "curr", and register the notifier */
+void kvm_user_return_update_cache(unsigned int slot, u64 value)
+{
+ struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+ msrs->values[slot].curr = value;
+ kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
+
static void drop_user_return_notifiers(void)
{
unsigned int cpu = smp_processor_id();
--
2.25.1


2024-02-26 08:55:04

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 082/130] KVM: TDX: restore user ret MSRs

From: Isaku Yamahata <[email protected]>

Several user ret MSRs are clobbered on TD exit. Restore those values on
TD exit and before returning to ring 3. Because TSX_CTRL requires special
treat, this patch doesn't address it.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 199226c6cf55..7e2b1e554246 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -535,6 +535,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
}

+struct tdx_uret_msr {
+ u32 msr;
+ unsigned int slot;
+ u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+ {.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
+ {.msr = MSR_STAR,},
+ {.msr = MSR_LSTAR,},
+ {.msr = MSR_TSC_AUX,},
+};
+
+static void tdx_user_return_update_cache(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+ kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
+ tdx_uret_msrs[i].defval);
+}
+
static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -627,6 +649,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_vcpu_enter_exit(tdx);

+ tdx_user_return_update_cache();
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;

@@ -1972,6 +1995,26 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
return -EINVAL;
}

+ for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+ /*
+ * Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
+ * before returning to user space.
+ *
+ * this_cpu_ptr(user_return_msrs)->registered isn't checked
+ * because the registration is done at vcpu runtime by
+ * kvm_set_user_return_msr().
+ * Here is setting up cpu feature before running vcpu,
+ * registered is already false.
+ */
+ tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+ if (tdx_uret_msrs[i].slot == -1) {
+ /* If any MSR isn't supported, it is a KVM bug */
+ pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+ tdx_uret_msrs[i].msr);
+ return -EIO;
+ }
+ }
+
max_pkgs = topology_max_packages();
tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
GFP_KERNEL);
--
2.25.1


2024-02-26 08:55:33

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 083/130] KVM: TDX: Add TSX_CTRL msr into uret_msrs list

From: Yang Weijiang <[email protected]>

TDX module resets the TSX_CTRL MSR to 0 at TD exit if TSX is enabled for
TD. Or it preserves the TSX_CTRL MSR if TSX is disabled for TD. VMM can
rely on uret_msrs mechanism to defer the reload of host value until exiting
to user space.

Signed-off-by: Yang Weijiang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- fix the type of tdx_uret_tsx_ctrl_slot. unguent int => int.
---
arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.h | 8 ++++++++
2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7e2b1e554246..83dcaf5b6fbd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -547,14 +547,21 @@ static struct tdx_uret_msr tdx_uret_msrs[] = {
{.msr = MSR_LSTAR,},
{.msr = MSR_TSC_AUX,},
};
+static int tdx_uret_tsx_ctrl_slot;

-static void tdx_user_return_update_cache(void)
+static void tdx_user_return_update_cache(struct kvm_vcpu *vcpu)
{
int i;

for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
tdx_uret_msrs[i].defval);
+ /*
+ * TSX_CTRL is reset to 0 if guest TSX is supported. Otherwise
+ * preserved.
+ */
+ if (to_kvm_tdx(vcpu->kvm)->tsx_supported && tdx_uret_tsx_ctrl_slot != -1)
+ kvm_user_return_update_cache(tdx_uret_tsx_ctrl_slot, 0);
}

static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
@@ -649,7 +656,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_vcpu_enter_exit(tdx);

- tdx_user_return_update_cache();
+ tdx_user_return_update_cache(vcpu);
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;

@@ -1167,6 +1174,22 @@ static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_pa
return 0;
}

+static bool tdparams_tsx_supported(struct kvm_cpuid2 *cpuid)
+{
+ const struct kvm_cpuid_entry2 *entry;
+ u64 mask;
+ u32 ebx;
+
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x7, 0);
+ if (entry)
+ ebx = entry->ebx;
+ else
+ ebx = 0;
+
+ mask = __feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM);
+ return ebx & mask;
+}
+
static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
struct kvm_tdx_init_vm *init_vm)
{
@@ -1209,6 +1232,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);

+ to_kvm_tdx(kvm)->tsx_supported = tdparams_tsx_supported(cpuid);
return 0;
}

@@ -2014,6 +2038,11 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
return -EIO;
}
}
+ tdx_uret_tsx_ctrl_slot = kvm_find_user_return_msr(MSR_IA32_TSX_CTRL);
+ if (tdx_uret_tsx_ctrl_slot == -1 && boot_cpu_has(X86_FEATURE_MSR_TSX_CTRL)) {
+ pr_err("MSR_IA32_TSX_CTRL isn't included by kvm_find_user_return_msr\n");
+ return -EIO;
+ }

max_pkgs = topology_max_packages();
tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e96c416e73bf..44eab734e702 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,6 +17,14 @@ struct kvm_tdx {
u64 xfam;
int hkid;

+ /*
+ * Used on each TD-exit, see tdx_user_return_update_cache().
+ * TSX_CTRL value on TD exit
+ * - set 0 if guest TSX enabled
+ * - preserved if guest TSX disabled
+ */
+ bool tsx_supported;
+
bool finalized;
atomic_t tdh_mem_track;

--
2.25.1


2024-02-26 08:55:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 084/130] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls

From: Isaku Yamahata <[email protected]>

This empty commit is to mark the start of patch series of TD vcpu
exits, interrupts, and hypercalls.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/intel-tdx-layer-status.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 33e107bcb5cf..7a16fa284b6f 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -13,6 +13,7 @@ What qemu can do
- Qemu can create/destroy vcpu of TDX vm type.
- Qemu can populate initial guest memory image.
- Qemu can finalize guest TD.
+- Qemu can start to run vcpu. But vcpu can not make progress yet.

Patch Layer status
------------------
@@ -24,7 +25,7 @@ Patch Layer status
* TD vcpu creation/destruction: Applied
* TDX EPT violation: Applied
* TD finalization: Applied
-* TD vcpu enter/exit: Applying
+* TD vcpu enter/exit: Applied
* TD vcpu interrupts/exit/hypercall: Not yet

* KVM MMU GPA shared bits: Applied
--
2.25.1


2024-02-26 08:56:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 085/130] KVM: TDX: Complete interrupts after tdexit

From: Isaku Yamahata <[email protected]>

This corresponds to VMX __vmx_complete_interrupts(). Because TDX
virtualize vAPIC, KVM only needs to care NMI injection.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
v19:
- move tdvps_management_check() to this patch
- typo: complete -> Complete in short log
---
arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
arch/x86/kvm/vmx/tdx.h | 4 ++++
2 files changed, 14 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 83dcaf5b6fbd..b8b168f74dfe 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
}

+static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
+{
+ /* Avoid costly SEAMCALL if no nmi was injected */
+ if (vcpu->arch.nmi_injected)
+ vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
+ TD_VCPU_PEND_NMI);
+}
+
struct tdx_uret_msr {
u32 msr;
unsigned int slot;
@@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
trace_kvm_exit(vcpu, KVM_ISA_VMX);

+ tdx_complete_interrupts(vcpu);
+
return EXIT_FASTPATH_NONE;
}

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 44eab734e702..0d8a98feb58e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
"Invalid TD VMCS access for 16-bit field");
}

+static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+
#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
u32 field) \
@@ -200,6 +202,8 @@ TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);

+TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+
static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
{
struct tdx_module_args out;
--
2.25.1


2024-02-26 08:56:20

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 086/130] KVM: TDX: restore debug store when TD exit

From: Isaku Yamahata <[email protected]>

Because debug store is clobbered, restore it on TD exit.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/events/intel/ds.c | 1 +
arch/x86/kvm/vmx/tdx.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index d49d661ec0a7..25670d8a485b 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2428,3 +2428,4 @@ void perf_restore_debug_store(void)

wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
}
+EXPORT_SYMBOL_GPL(perf_restore_debug_store);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b8b168f74dfe..ad4d3d4eaf6c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -665,6 +665,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
tdx_vcpu_enter_exit(tdx);

tdx_user_return_update_cache(vcpu);
+ perf_restore_debug_store();
tdx_restore_host_xsave_state(vcpu);
tdx->host_state_need_restore = true;

--
2.25.1


2024-02-26 08:56:37

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 060/130] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA

From: Isaku Yamahata <[email protected]>

The private GPAs that typically guest memfd backs aren't subject to MMU
notifier because it isn't mapped into virtual address of user process.
kvm_tdp_mmu_handle_gfn() handles the callback of the MMU notifier,
clear_flush_young(), clear_young(), test_young()() and change_pte(). Make
kvm_tdp_mmu_handle_gfn() aware of private mapping and skip private mapping.

Even with AS_UNMOVABLE set, those mmu notifier are called. For example,
ksmd triggers change_pte().

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
v19:
- type: test_gfn() => test_young()

v18:
- newly added

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e7514a807134..10507920f36b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1157,9 +1157,29 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
* into this helper allow blocking; it'd be dead, wasteful code.
*/
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+ gfn_t start, end;
+
+ /*
+ * This function is called on behalf of mmu_notifier of
+ * clear_flush_young(), clear_young(), test_young()(), and
+ * change_pte(). They apply to only shared GPAs.
+ */
+ WARN_ON_ONCE(range->only_private);
+ WARN_ON_ONCE(!range->only_shared);
+ if (is_private_sp(root))
+ continue;
+
+ /*
+ * For TDX shared mapping, set GFN shared bit to the range,
+ * so the handler() doesn't need to set it, to avoid duplicated
+ * code in multiple handler()s.
+ */
+ start = kvm_gfn_to_shared(kvm, range->start);
+ end = kvm_gfn_to_shared(kvm, range->end);
+
rcu_read_lock();

- tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+ tdp_root_for_each_leaf_pte(iter, root, start, end)
ret |= handler(kvm, &iter, range);

rcu_read_unlock();
--
2.25.1


2024-02-26 08:56:42

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation

From: Isaku Yamahata <[email protected]>

Refactor tdp_mmu_alloc_sp() and tdp_mmu_init_sp and eliminate
tdp_mmu_init_child_sp(). Currently tdp_mmu_init_sp() (or
tdp_mmu_init_child_sp()) sets kvm_mmu_page.role after tdp_mmu_alloc_sp()
allocating struct kvm_mmu_page and its page table page. This patch makes
tdp_mmu_alloc_sp() initialize kvm_mmu_page.role instead of
tdp_mmu_init_sp().

To handle private page tables, argument of is_private needs to be passed
down. Given that already page level is passed down, it would be cumbersome
to add one more parameter about sp. Instead replace the level argument with
union kvm_mmu_page_role. Thus the number of argument won't be increased
and more info about sp can be passed down.

For private sp, secure page table will be also allocated in addition to
struct kvm_mmu_page and page table (spt member). The allocation functions
(tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know if the
allocation is for the conventional page table or private page table. Pass
union kvm_mmu_role to those functions and initialize role member of struct
kvm_mmu_page.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/tdp_iter.h | 12 ++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 44 ++++++++++++++++---------------------
2 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..e1e40e3f5eb7 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -135,4 +135,16 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
void tdp_iter_next(struct tdp_iter *iter);
void tdp_iter_restart(struct tdp_iter *iter);

+static inline union kvm_mmu_page_role tdp_iter_child_role(struct tdp_iter *iter)
+{
+ union kvm_mmu_page_role child_role;
+ struct kvm_mmu_page *parent_sp;
+
+ parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
+
+ child_role = parent_sp->role;
+ child_role.level--;
+ return child_role;
+}
+
#endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 04c6af49c3e8..87233b3ceaef 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -177,24 +177,30 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
kvm_mmu_page_as_id(_root) != _as_id) { \
} else

-static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu,
+ union kvm_mmu_page_role role)
{
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp->role = role;

return sp;
}

static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
- gfn_t gfn, union kvm_mmu_page_role role)
+ gfn_t gfn)
{
INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);

set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

- sp->role = role;
+ /*
+ * role must be set before calling this function. At least role.level
+ * is not 0 (PG_LEVEL_NONE).
+ */
+ WARN_ON_ONCE(!sp->role.word);
sp->gfn = gfn;
sp->ptep = sptep;
sp->tdp_mmu_page = true;
@@ -202,20 +208,6 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
trace_kvm_mmu_get_page(sp, true);
}

-static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
- struct tdp_iter *iter)
-{
- struct kvm_mmu_page *parent_sp;
- union kvm_mmu_page_role role;
-
- parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
-
- role = parent_sp->role;
- role.level--;
-
- tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
-}
-
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
{
union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
@@ -234,8 +226,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
goto out;
}

- root = tdp_mmu_alloc_sp(vcpu);
- tdp_mmu_init_sp(root, NULL, 0, role);
+ root = tdp_mmu_alloc_sp(vcpu, role);
+ tdp_mmu_init_sp(root, NULL, 0);

/*
* TDP MMU roots are kept until they are explicitly invalidated, either
@@ -1068,8 +1060,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* The SPTE is either non-present or points to a huge page that
* needs to be split.
*/
- sp = tdp_mmu_alloc_sp(vcpu);
- tdp_mmu_init_child_sp(sp, &iter);
+ sp = tdp_mmu_alloc_sp(vcpu, tdp_iter_child_role(&iter));
+ tdp_mmu_init_sp(sp, iter.sptep, iter.gfn);

sp->nx_huge_page_disallowed = fault->huge_page_disallowed;

@@ -1312,7 +1304,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
return spte_set;
}

-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mmu_page_role role)
{
struct kvm_mmu_page *sp;

@@ -1322,6 +1314,7 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
if (!sp)
return NULL;

+ sp->role = role;
sp->spt = (void *)__get_free_page(gfp);
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
@@ -1335,6 +1328,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
struct tdp_iter *iter,
bool shared)
{
+ union kvm_mmu_page_role role = tdp_iter_child_role(iter);
struct kvm_mmu_page *sp;

kvm_lockdep_assert_mmu_lock_held(kvm, shared);
@@ -1348,7 +1342,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
* If this allocation fails we drop the lock and retry with reclaim
* allowed.
*/
- sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, role);
if (sp)
return sp;

@@ -1360,7 +1354,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
write_unlock(&kvm->mmu_lock);

iter->yielded = true;
- sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT, role);

if (shared)
read_lock(&kvm->mmu_lock);
@@ -1455,7 +1449,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
continue;
}

- tdp_mmu_init_child_sp(sp, &iter);
+ tdp_mmu_init_sp(sp, iter.sptep, iter.gfn);

if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
goto retry;
--
2.25.1


2024-02-26 08:57:12

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

From: Isaku Yamahata <[email protected]>

For vcpu migration, in the case of VMX, VMCS is flushed on the source pcpu,
and load it on the target pcpu. There are corresponding TDX SEAMCALL APIs,
call them on vcpu migration. The logic is mostly same as VMX except the
TDX SEAMCALLs are used.

When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
each pcpu. Do the similar for TDX with TDX SEAMCALL APIs.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 32 ++++++-
arch/x86/kvm/vmx/tdx.c | 190 ++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 2 +
arch/x86/kvm/vmx/x86_ops.h | 4 +
4 files changed, 221 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8275a242ce07..9b336c1a6508 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -33,6 +33,14 @@ static int vt_max_vcpus(struct kvm *kvm)
static int vt_flush_remote_tlbs(struct kvm *kvm);
#endif

+static void vt_hardware_disable(void)
+{
+ /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+ if (enable_tdx)
+ tdx_hardware_disable();
+ vmx_hardware_disable();
+}
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -201,6 +209,16 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_vcpu_run(vcpu);
}

+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_load(vcpu, cpu);
+ return;
+ }
+
+ vmx_vcpu_load(vcpu, cpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -262,6 +280,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}

+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_sched_in(vcpu, cpu);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -335,7 +361,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

/* TDX cpu enablement is done by tdx_hardware_setup(). */
.hardware_enable = vmx_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
+ .hardware_disable = vt_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,

.is_vm_type_supported = vt_is_vm_type_supported,
@@ -353,7 +379,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_reset = vt_vcpu_reset,

.prepare_switch_to_guest = vt_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
+ .vcpu_load = vt_vcpu_load,
.vcpu_put = vt_vcpu_put,

.update_exception_bitmap = vmx_update_exception_bitmap,
@@ -440,7 +466,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.request_immediate_exit = vmx_request_immediate_exit,

- .sched_in = vmx_sched_in,
+ .sched_in = vt_sched_in,

.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ad4d3d4eaf6c..7aa9188f384d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -106,6 +106,14 @@ static DEFINE_MUTEX(tdx_lock);
static struct mutex *tdx_mng_key_config_lock;
static atomic_t nr_configured_hkid;

+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ * Protected by interrupt mask. This list is manipulated in process context
+ * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
{
return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
@@ -138,6 +146,37 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
return kvm_tdx->finalized;
}

+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+ lockdep_assert_irqs_disabled();
+
+ list_del(&to_tdx(vcpu)->cpu_list);
+
+ /*
+ * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+ * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+ * to its list before its deleted from this CPUs list.
+ */
+ smp_wmb();
+
+ vcpu->cpu = -1;
+}
+
+static void tdx_disassociate_vp_arg(void *vcpu)
+{
+ tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_disassociate_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+ int cpu = vcpu->cpu;
+
+ if (unlikely(cpu == -1))
+ return;
+
+ smp_call_function_single(cpu, tdx_disassociate_vp_arg, vcpu, 1);
+}
+
static void tdx_clear_page(unsigned long page_pa)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -218,6 +257,87 @@ static void tdx_reclaim_control_page(unsigned long td_page_pa)
free_page((unsigned long)__va(td_page_pa));
}

+struct tdx_flush_vp_arg {
+ struct kvm_vcpu *vcpu;
+ u64 err;
+};
+
+static void tdx_flush_vp(void *arg_)
+{
+ struct tdx_flush_vp_arg *arg = arg_;
+ struct kvm_vcpu *vcpu = arg->vcpu;
+ u64 err;
+
+ arg->err = 0;
+ lockdep_assert_irqs_disabled();
+
+ /* Task migration can race with CPU offlining. */
+ if (unlikely(vcpu->cpu != raw_smp_processor_id()))
+ return;
+
+ /*
+ * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
+ * list tracking still needs to be updated so that it's correct if/when
+ * the vCPU does get initialized.
+ */
+ if (is_td_vcpu_created(to_tdx(vcpu))) {
+ /*
+ * No need to retry. TDX Resources needed for TDH.VP.FLUSH are,
+ * TDVPR as exclusive, TDR as shared, and TDCS as shared. This
+ * vp flush function is called when destructing vcpu/TD or vcpu
+ * migration. No other thread uses TDVPR in those cases.
+ */
+ err = tdh_vp_flush(to_tdx(vcpu)->tdvpr_pa);
+ if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+ /*
+ * This function is called in IPI context. Do not use
+ * printk to avoid console semaphore.
+ * The caller prints out the error message, instead.
+ */
+ if (err)
+ arg->err = err;
+ }
+ }
+
+ tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+ struct tdx_flush_vp_arg arg = {
+ .vcpu = vcpu,
+ };
+ int cpu = vcpu->cpu;
+
+ if (unlikely(cpu == -1))
+ return;
+
+ smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
+ if (WARN_ON_ONCE(arg.err)) {
+ pr_err("cpu: %d ", cpu);
+ pr_tdx_error(TDH_VP_FLUSH, arg.err, NULL);
+ }
+}
+
+void tdx_hardware_disable(void)
+{
+ int cpu = raw_smp_processor_id();
+ struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+ struct tdx_flush_vp_arg arg;
+ struct vcpu_tdx *tdx, *tmp;
+ unsigned long flags;
+
+ lockdep_assert_preemption_disabled();
+
+ local_irq_save(flags);
+ /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+ list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) {
+ arg.vcpu = &tdx->vcpu;
+ tdx_flush_vp(&arg);
+ }
+ local_irq_restore(flags);
+}
+
static void tdx_do_tdh_phymem_cache_wb(void *unused)
{
u64 err = 0;
@@ -233,26 +353,31 @@ static void tdx_do_tdh_phymem_cache_wb(void *unused)
pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
}

-void tdx_mmu_release_hkid(struct kvm *kvm)
+static int __tdx_mmu_release_hkid(struct kvm *kvm)
{
bool packages_allocated, targets_allocated;
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
cpumask_var_t packages, targets;
+ struct kvm_vcpu *vcpu;
+ unsigned long j;
+ int i, ret = 0;
u64 err;
- int i;

if (!is_hkid_assigned(kvm_tdx))
- return;
+ return 0;

if (!is_td_created(kvm_tdx)) {
tdx_hkid_free(kvm_tdx);
- return;
+ return 0;
}

packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
cpus_read_lock();

+ kvm_for_each_vcpu(j, vcpu, kvm)
+ tdx_flush_vp_on_cpu(vcpu);
+
/*
* We can destroy multiple guest TDs simultaneously. Prevent
* tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
@@ -270,6 +395,19 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
*/
write_lock(&kvm->mmu_lock);

+ err = tdh_mng_vpflushdone(kvm_tdx->tdr_pa);
+ if (err == TDX_FLUSHVP_NOT_DONE) {
+ ret = -EBUSY;
+ goto out;
+ }
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
+ pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
+ kvm_tdx->hkid);
+ ret = -EIO;
+ goto out;
+ }
+
for_each_online_cpu(i) {
if (packages_allocated &&
cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -291,14 +429,24 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
kvm_tdx->hkid);
+ ret = -EIO;
} else
tdx_hkid_free(kvm_tdx);

+out:
write_unlock(&kvm->mmu_lock);
mutex_unlock(&tdx_lock);
cpus_read_unlock();
free_cpumask_var(targets);
free_cpumask_var(packages);
+
+ return ret;
+}
+
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+ while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
+ ;
}

void tdx_vm_free(struct kvm *kvm)
@@ -455,6 +603,26 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return 0;
}

+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (vcpu->cpu == cpu)
+ return;
+
+ tdx_flush_vp_on_cpu(vcpu);
+
+ local_irq_disable();
+ /*
+ * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+ * vcpu->cpu is read before tdx->cpu_list.
+ */
+ smp_rmb();
+
+ list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+ local_irq_enable();
+}
+
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -495,6 +663,16 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
struct vcpu_tdx *tdx = to_tdx(vcpu);
int i;

+ /*
+ * When destroying VM, kvm_unload_vcpu_mmu() calls vcpu_load() for every
+ * vcpu after they already disassociated from the per cpu list by
+ * tdx_mmu_release_hkid(). So we need to disassociate them again,
+ * otherwise the freed vcpu data will be accessed when do
+ * list_{del,add}() on associated_tdvcpus list later.
+ */
+ tdx_disassociate_vp_on_cpu(vcpu);
+ WARN_ON_ONCE(vcpu->cpu != -1);
+
/*
* This methods can be called when vcpu allocation/initialization
* failed. So it's possible that hkid, tdvpx and tdvpr are not assigned
@@ -2030,6 +2208,10 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
return -EINVAL;
}

+ /* tdx_hardware_disable() uses associated_tdvcpus. */
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
+
for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
/*
* Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 0d8a98feb58e..7f8c78f06508 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -73,6 +73,8 @@ struct vcpu_tdx {
unsigned long *tdvpx_pa;
bool td_vcpu_created;

+ struct list_head cpu_list;
+
union tdx_exit_reason exit_reason;

bool initialized;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 9fd997c79c33..5853f29f0af3 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -137,6 +137,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
#ifdef CONFIG_INTEL_TDX_HOST
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
void tdx_hardware_unsetup(void);
+void tdx_hardware_disable(void);
bool tdx_is_vm_type_supported(unsigned long type);
int tdx_offline_cpu(void);

@@ -153,6 +154,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -171,6 +173,7 @@ void tdx_post_memory_mapping(struct kvm_vcpu *vcpu,
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
+static inline void tdx_hardware_disable(void) {}
static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
static inline int tdx_offline_cpu(void) { return 0; }

@@ -190,6 +193,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1


2024-02-26 08:57:19

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

From: Isaku Yamahata <[email protected]>

For private GPA, CPU refers a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used and their cost is expensive.

When KVM resolves KVM page fault, it walks the page tables. To reuse the
existing KVM MMU code and mitigate the heavy cost to directly walk private
page table, allocate one more page to copy the dummy page table for KVM MMU
code to directly walk. Resolve KVM page fault with the existing code, and
do additional operations necessary for the private page table. To
distinguish such cases, the existing KVM page table is called a shared page
table (i.e. not associated with private page table), and the page table
with private page table is called a private page table. The relationship
is depicted below.

Add a private pointer to struct kvm_mmu_page for private page table and
add helper functions to allocate/initialize/free a private page table
page.

KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root dummy PT root | private PT root
| | | |
V V | V
shared PT dummy PT ----propagate----> private PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT: page table
- Shared PT is visible to KVM and it is used by CPU.
- Private PT is used by CPU but it is invisible to KVM.
- Dummy PT is visible to KVM but not used by CPU. It is used to
propagate PT change to the actual private PT which is used by CPU.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
v19:
- typo in the comment in kvm_mmu_alloc_private_spt()
- drop CONFIG_KVM_MMU_PRIVATE
---
arch/x86/include/asm/kvm_host.h | 5 +++
arch/x86/kvm/mmu/mmu.c | 7 ++++
arch/x86/kvm/mmu/mmu_internal.h | 63 ++++++++++++++++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 1 +
4 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dcc6f7c38a83..efd3fda1c177 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -825,6 +825,11 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * This cache is to allocate private page table. E.g. Secure-EPT used
+ * by the TDX module.
+ */
+ struct kvm_mmu_memory_cache mmu_private_spt_cache;

/*
* QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eeebbc67e42b..0d6d4506ec97 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache,
+ PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+ }
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
PT64_ROOT_MAX_LEVEL);
if (r)
@@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e3f54701f98d..002f3f80bf3b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,7 +101,21 @@ struct kvm_mmu_page {
int root_count;
refcount_t tdp_mmu_root_count;
};
- unsigned int unsync_children;
+ union {
+ struct {
+ unsigned int unsync_children;
+ /*
+ * Number of writes since the last time traversal
+ * visited this page.
+ */
+ atomic_t write_flooding_count;
+ };
+ /*
+ * Associated private shadow page table, e.g. Secure-EPT page
+ * passed to the TDX module.
+ */
+ void *private_spt;
+ };
union {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
@@ -124,9 +138,6 @@ struct kvm_mmu_page {
int clear_spte_count;
#endif

- /* Number of writes since the last time traversal visited this page. */
- atomic_t write_flooding_count;
-
#ifdef CONFIG_X86_64
/* Used for freeing the page asynchronously if it is a TDP MMU page. */
struct rcu_head rcu_head;
@@ -150,6 +161,50 @@ static inline bool is_private_sp(const struct kvm_mmu_page *sp)
return kvm_mmu_page_role_is_private(sp->role);
}

+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+ return sp->private_spt;
+}
+
+static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
+{
+ sp->private_spt = private_spt;
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+ bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level;
+
+ KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm);
+ if (is_root)
+ /*
+ * Because TDX module assigns root Secure-EPT page and set it to
+ * Secure-EPTP when TD vcpu is created, secure page table for
+ * root isn't needed.
+ */
+ sp->private_spt = NULL;
+ else {
+ /*
+ * Because the TDX module doesn't trust VMM and initializes
+ * the pages itself, KVM doesn't initialize them. Allocate
+ * pages with garbage and give them to the TDX module.
+ */
+ sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
+ /*
+ * Because mmu_private_spt_cache is topped up before starting
+ * kvm page fault resolving, the allocation above shouldn't
+ * fail.
+ */
+ WARN_ON_ONCE(!sp->private_spt);
+ }
+}
+
+static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
+{
+ if (sp->private_spt)
+ free_page((unsigned long)sp->private_spt);
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 87233b3ceaef..d47f0daf1b03 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)

static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
+ kvm_mmu_free_private_spt(sp);
free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
}
--
2.25.1


2024-02-26 08:57:34

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

From: Sean Christopherson <[email protected]>

TDX supports only write-back(WB) memory type for private memory
architecturally so that (virtualized) memory type change doesn't make sense
for private memory. Also currently, page migration isn't supported for TDX
yet. (TDX architecturally supports page migration. it's KVM and kernel
implementation issue.)

Regarding memory type change (mtrr virtualization and lapic page mapping
change), pages are zapped by kvm_zap_gfn_range(). On the next KVM page
fault, the SPTE entry with a new memory type for the page is populated.
Regarding page migration, pages are zapped by the mmu notifier. On the next
KVM page fault, the new migrated page is populated. Don't zap private
pages on unmapping for those two cases.

When deleting/moving a KVM memory slot, zap private pages. Typically
tearing down VM. Don't invalidate private page tables. i.e. zap only leaf
SPTEs for KVM mmu that has a shared bit mask. The existing
kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-lock
of mmu_lock so that other vcpu can operate on KVM mmu concurrently. It
marks the root page table invalid and zaps SPTEs of the root page
tables. The TDX module doesn't allow to unlink a protected root page table
from the hardware and then allocate a new one for it. i.e. replacing a
protected root page table. Instead, zap only leaf SPTEs for KVM mmu with a
shared bit mask set.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++----
arch/x86/kvm/mmu/tdp_mmu.h | 5 ++--
3 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d6d4506ec97..30c86e858ae4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6339,7 +6339,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
*/
if (tdp_mmu_enabled)
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ kvm_tdp_mmu_invalidate_all_roots(kvm, true);

/*
* Notify all vcpus to reload its shadow page table and flush TLB.
@@ -6459,7 +6459,16 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);

if (tdp_mmu_enabled)
- flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
+ /*
+ * zap_private = false. Zap only shared pages.
+ *
+ * kvm_zap_gfn_range() is used when MTRR or PAT memory
+ * type was changed. Later on the next kvm page fault,
+ * populate it with updated spte entry.
+ * Because only WB is supported for private pages, don't
+ * care of private pages.
+ */
+ flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush, false);

if (flush)
kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
@@ -6905,10 +6914,56 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
kvm_mmu_zap_all(kvm);
}

+static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ bool flush = false;
+
+ write_lock(&kvm->mmu_lock);
+
+ /*
+ * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+ * case scenario we'll have unused shadow pages lying around until they
+ * are recycled due to age or when the VM is destroyed.
+ */
+ if (tdp_mmu_enabled) {
+ struct kvm_gfn_range range = {
+ .slot = slot,
+ .start = slot->base_gfn,
+ .end = slot->base_gfn + slot->npages,
+ .may_block = true,
+
+ /*
+ * This handles both private gfn and shared gfn.
+ * All private page should be zapped on memslot deletion.
+ */
+ .only_private = true,
+ .only_shared = true,
+ };
+
+ flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush);
+ } else {
+ /* TDX supports only TDP-MMU case. */
+ WARN_ON_ONCE(1);
+ flush = true;
+ }
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+
+ write_unlock(&kvm->mmu_lock);
+}
+
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
- kvm_mmu_zap_all_fast(kvm);
+ if (kvm_gfn_shared_mask(kvm))
+ /*
+ * Secure-EPT requires to release PTs from the leaf. The
+ * optimization to zap root PT first with child PT doesn't
+ * work.
+ */
+ kvm_mmu_zap_memslot(kvm, slot);
+ else
+ kvm_mmu_zap_all_fast(kvm);
}

void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d47f0daf1b03..e7514a807134 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
* for zapping and thus puts the TDP MMU's reference to each root, i.e.
* ultimately frees all roots.
*/
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ kvm_tdp_mmu_invalidate_all_roots(kvm, false);
kvm_tdp_mmu_zap_invalidated_roots(kvm);

WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -771,7 +771,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
* operation can cause a soft lockup.
*/
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush)
+ gfn_t start, gfn_t end, bool can_yield, bool flush,
+ bool zap_private)
{
struct tdp_iter iter;

@@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,

lockdep_assert_held_write(&kvm->mmu_lock);

+ WARN_ON_ONCE(zap_private && !is_private_sp(root));
+ if (!zap_private && is_private_sp(root))
+ return false;
+
rcu_read_lock();

for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
@@ -810,13 +815,15 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
* true if a TLB flush is needed before releasing the MMU lock, i.e. if one or
* more SPTEs were zapped since the MMU lock was last acquired.
*/
-bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush,
+ bool zap_private)
{
struct kvm_mmu_page *root;

lockdep_assert_held_write(&kvm->mmu_lock);
for_each_tdp_mmu_root_yield_safe(kvm, root)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush,
+ zap_private && is_private_sp(root));

return flush;
}
@@ -891,7 +898,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
* See kvm_tdp_mmu_get_vcpu_root_hpa().
*/
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm, bool skip_private)
{
struct kvm_mmu_page *root;

@@ -916,6 +923,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
* or get/put references to roots.
*/
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+ /*
+ * Skip private root since private page table
+ * is only torn down when VM is destroyed.
+ */
+ if (skip_private && is_private_sp(root))
+ continue;
/*
* Note, invalid roots can outlive a memslot update! Invalid
* roots must be *zapped* before the memslot update completes,
@@ -1104,14 +1117,26 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return ret;
}

+/* Used by mmu notifier via kvm_unmap_gfn_range() */
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
struct kvm_mmu_page *root;
+ bool zap_private = false;
+
+ if (kvm_gfn_shared_mask(kvm)) {
+ if (!range->only_private && !range->only_shared)
+ /* attributes change */
+ zap_private = !(range->arg.attributes &
+ KVM_MEMORY_ATTRIBUTE_PRIVATE);
+ else
+ zap_private = range->only_private;
+ }

__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
- range->may_block, flush);
+ range->may_block, flush,
+ zap_private && is_private_sp(root));

return flush;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 20d97aa46c49..b3cf58a50357 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -19,10 +19,11 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)

void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);

-bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush,
+ bool zap_private);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm, bool skip_private);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);

int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
--
2.25.1


2024-02-26 08:57:49

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 090/130] KVM: x86: Assume timer IRQ was injected if APIC state is proteced

From: Sean Christopherson <[email protected]>

If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path. The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.

Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/lapic.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e8034f2f2dd1..8025c7f614e0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1774,8 +1774,17 @@ static void apic_update_lvtt(struct kvm_lapic *apic)
static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- u32 reg = kvm_lapic_get_reg(apic, APIC_LVTT);
+ u32 reg;

+ /*
+ * Assume a timer IRQ was "injected" if the APIC is protected. KVM's
+ * copy of the vIRR is bogus, it's the responsibility of the caller to
+ * precisely check whether or not a timer IRQ is pending.
+ */
+ if (apic->guest_apic_protected)
+ return true;
+
+ reg = kvm_lapic_get_reg(apic, APIC_LVTT);
if (kvm_apic_hw_enabled(apic)) {
int vec = reg & APIC_VECTOR_MASK;
void *bitmap = apic->regs + APIC_ISR;
--
2.25.1


2024-02-26 08:58:20

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c

From: Isaku Yamahata <[email protected]>

As TDX will use posted_interrupt.c, the use of struct vcpu_vmx is a
blocker. Because the members of struct pi_desc pi_desc and struct
list_head pi_wakeup_list are only used in posted_interrupt.c, introduce
common structure, struct vcpu_pi, make vcpu_vmx and vcpu_tdx has same
layout in the top of structure.

To minimize the diff size, avoid code conversion like,
vmx->pi_desc => vmx->common->pi_desc. Instead add compile time check
if the layout is expected.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/posted_intr.c | 41 ++++++++++++++++++++++++++--------
arch/x86/kvm/vmx/posted_intr.h | 11 +++++++++
arch/x86/kvm/vmx/tdx.c | 1 +
arch/x86/kvm/vmx/tdx.h | 8 +++++++
arch/x86/kvm/vmx/vmx.h | 14 +++++++-----
5 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index af662312fd07..b66add9da0f3 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -11,6 +11,7 @@
#include "posted_intr.h"
#include "trace.h"
#include "vmx.h"
+#include "tdx.h"

/*
* Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
@@ -31,9 +32,29 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
*/
static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);

+/*
+ * The layout of the head of struct vcpu_vmx and struct vcpu_tdx must match with
+ * struct vcpu_pi.
+ */
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+ offsetof(struct vcpu_vmx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+ offsetof(struct vcpu_vmx, pi_wakeup_list));
+#ifdef CONFIG_INTEL_TDX_HOST
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+ offsetof(struct vcpu_tdx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+ offsetof(struct vcpu_tdx, pi_wakeup_list));
+#endif
+
+static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
+{
+ return (struct vcpu_pi *)vcpu;
+}
+
static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
- return &(to_vmx(vcpu)->pi_desc);
+ return &vcpu_to_pi(vcpu)->pi_desc;
}

static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
@@ -52,8 +73,8 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)

void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+ struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
struct pi_desc old, new;
unsigned long flags;
unsigned int dest;
@@ -90,7 +111,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
*/
if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- list_del(&vmx->pi_wakeup_list);
+ list_del(&vcpu_pi->pi_wakeup_list);
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
}

@@ -145,15 +166,15 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
*/
static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+ struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
struct pi_desc old, new;
unsigned long flags;

local_irq_save(flags);

raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- list_add_tail(&vmx->pi_wakeup_list,
+ list_add_tail(&vcpu_pi->pi_wakeup_list,
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));

@@ -190,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
* notification vector is switched to the one that calls
* back to the pi_wakeup_handler() function.
*/
- return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
+ return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
+ vmx_can_use_vtd_pi(vcpu->kvm);
}

void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
@@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
if (!vmx_needs_pi_wakeup(vcpu))
return;

- if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+ if (kvm_vcpu_is_blocking(vcpu) &&
+ (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
pi_enable_wakeup_handler(vcpu);

/*
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 26992076552e..2fe8222308b2 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -94,6 +94,17 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
(unsigned long *)&pi_desc->control);
}

+struct vcpu_pi {
+ struct kvm_vcpu vcpu;
+
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here common layout betwwn vcpu_vmx and vcpu_tdx. */
+};
+
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
void pi_wakeup_handler(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a5b52aa6d153..1da58c36217c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -584,6 +584,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

fpstate_set_confidential(&vcpu->arch.guest_fpu);
vcpu->arch.apic->guest_apic_protected = true;
+ INIT_LIST_HEAD(&tdx->pi_wakeup_list);

vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 7f8c78f06508..eaffa7384725 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -4,6 +4,7 @@

#ifdef CONFIG_INTEL_TDX_HOST

+#include "posted_intr.h"
#include "pmu_intel.h"
#include "tdx_ops.h"

@@ -69,6 +70,13 @@ union tdx_exit_reason {
struct vcpu_tdx {
struct kvm_vcpu vcpu;

+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here same layout to struct vcpu_pi. */
+
unsigned long tdvpr_pa;
unsigned long *tdvpx_pa;
bool td_vcpu_created;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 79ff54f08fee..634a9a250b95 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -235,6 +235,14 @@ struct nested_vmx {

struct vcpu_vmx {
struct kvm_vcpu vcpu;
+
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here same layout to struct vcpu_pi. */
+
u8 fail;
u8 x2apic_msr_bitmap_mode;

@@ -304,12 +312,6 @@ struct vcpu_vmx {

union vmx_exit_reason exit_reason;

- /* Posted interrupt descriptor */
- struct pi_desc pi_desc;
-
- /* Used if this vCPU is waiting for PI notification wakeup. */
- struct list_head pi_wakeup_list;
-
/* Support for a guest hypervisor (nested VMX) */
struct nested_vmx nested;

--
2.25.1


2024-02-26 08:58:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 061/130] KVM: x86/tdp_mmu: Sprinkle __must_check

From: Isaku Yamahata <[email protected]>

TDP MMU allows tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() to
return -EBUSY or -EAGAIN error. The caller must check the return value and
retry. Sprinkle __must_check to guarantee it.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
--
v19:
- Added Reviewed-by: Binbin
---
arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 10507920f36b..a90907b31c54 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -507,9 +507,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* no side-effects other than setting iter->old_spte to the last
* known value of the spte.
*/
-static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
- struct tdp_iter *iter,
- u64 new_spte)
+static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte)
{
u64 *sptep = rcu_dereference(iter->sptep);

@@ -539,8 +539,8 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
return 0;
}

-static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
- struct tdp_iter *iter)
+static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+ struct tdp_iter *iter)
{
int ret;

--
2.25.1


2024-02-26 08:58:35

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 093/130] KVM: TDX: Implements vcpu request_immediate_exit

From: Isaku Yamahata <[email protected]>

Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
vcpu. Wire up kvm x86 methods for blocking/unblocking vcpu for TDX. To
unblock on pending events, request immediate exit methods is also needed.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f2c9d6358f9e..ee6c04959d4c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -372,6 +372,16 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
vmx_enable_irq_window(vcpu);
}

+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ __kvm_request_immediate_exit(vcpu);
+ return;
+ }
+
+ vmx_request_immediate_exit(vcpu);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -549,7 +559,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,

- .request_immediate_exit = vmx_request_immediate_exit,
+ .request_immediate_exit = vt_request_immediate_exit,

.sched_in = vt_sched_in,

--
2.25.1


2024-02-26 08:59:33

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 095/130] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument

From: Sean Christopherson <[email protected]>

TDX uses different ABI to get information about VM exit. Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS. The eventual code will be

VMX:
- get exit reason, intr_info, exit_qualification, and etc from VMCS
- call NMI/INTR handlers (common code)

TDX:
- get exit reason, intr_info, exit_qualification, and etc from guest
registers
- call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1349ec438837..29d891e0795e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6912,24 +6912,22 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
}

-static void handle_exception_irqoff(struct vcpu_vmx *vmx)
+static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
{
- u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
-
/* if exit due to PF check for async PF */
if (is_page_fault(intr_info))
- vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
/* if exit due to NM, handle before interrupts are enabled */
else if (is_nm_fault(intr_info))
- handle_nm_fault_irqoff(&vmx->vcpu);
+ handle_nm_fault_irqoff(vcpu);
/* Handle machine checks before interrupts are enabled */
else if (is_machine_check(intr_info))
kvm_machine_check();
}

-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
+static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
{
- u32 intr_info = vmx_get_intr_info(vcpu);
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
gate_desc *desc = (gate_desc *)host_idt_base + vector;

@@ -6952,9 +6950,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
return;

if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu);
+ handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_irqoff(vmx);
+ handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
}

/*
--
2.25.1


2024-02-26 09:00:00

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 096/130] KVM: VMX: Move NMI/exception handler to common helper

From: Sean Christopherson <[email protected]>

TDX mostly handles NMI/exception exit mostly the same to VMX case. The
difference is how to retrieve exit qualification. To share the code with
TDX, move NMI/exception to a common header, common.h.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 59 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 68 +++++----------------------------------
2 files changed, 67 insertions(+), 60 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 6f21d0d48809..632af7a76d0a 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,8 +4,67 @@

#include <linux/kvm_host.h>

+#include <asm/traps.h>
+
#include "posted_intr.h"
#include "mmu.h"
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_do_interrupt_irqoff(unsigned long entry);
+void vmx_do_nmi_irqoff(void);
+
+static inline void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Save xfd_err to guest_fpu before interrupt is enabled, so the
+ * MSR value is not clobbered by the host activity before the guest
+ * has chance to consume it.
+ *
+ * Do not blindly read xfd_err here, since this exception might
+ * be caused by L1 interception on a platform which doesn't
+ * support xfd at all.
+ *
+ * Do it conditionally upon guest_fpu::xfd. xfd_err matters
+ * only when xfd contains a non-zero value.
+ *
+ * Queuing exception is done in vmx_handle_exit. See comment there.
+ */
+ if (vcpu->arch.guest_fpu.fpstate->xfd)
+ rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+}
+
+static inline void vmx_handle_exception_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ /* if exit due to PF check for async PF */
+ if (is_page_fault(intr_info))
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ /* if exit due to NM, handle before interrupts are enabled */
+ else if (is_nm_fault(intr_info))
+ vmx_handle_nm_fault_irqoff(vcpu);
+ /* Handle machine checks before interrupts are enabled */
+ else if (is_machine_check(intr_info))
+ kvm_machine_check();
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+ gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
+
+ if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+ "unexpected VM-Exit interrupt info: 0x%x", intr_info))
+ return;
+
+ kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
+ vmx_do_interrupt_irqoff(gate_offset(desc));
+ kvm_after_interrupt(vcpu);
+
+ vcpu->arch.at_instruction_boundary = true;
+}

static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
unsigned long exit_qualification)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 29d891e0795e..f8a00a766c40 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -518,7 +518,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
vmx->segment_cache.bitmask = 0;
}

-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;

#if IS_ENABLED(CONFIG_HYPERV)
static bool __read_mostly enlightened_vmcs = true;
@@ -4273,7 +4273,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */

- vmcs_writel(HOST_IDTR_BASE, host_idt_base); /* 22.2.4 */
+ vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base); /* 22.2.4 */

vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */

@@ -5166,7 +5166,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
intr_info = vmx_get_intr_info(vcpu);

/*
- * Machine checks are handled by handle_exception_irqoff(), or by
+ * Machine checks are handled by vmx_handle_exception_irqoff(), or by
* vmx_vcpu_run() if a #MC occurs on VM-Entry. NMIs are handled by
* vmx_vcpu_enter_exit().
*/
@@ -5174,7 +5174,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
return 1;

/*
- * Queue the exception here instead of in handle_nm_fault_irqoff().
+ * Queue the exception here instead of in vmx_handle_nm_fault_irqoff().
* This ensures the nested_vmx check is not skipped so vmexit can
* be reflected to L1 (when it intercepts #NM) before reaching this
* point.
@@ -6889,59 +6889,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}

-void vmx_do_interrupt_irqoff(unsigned long entry);
-void vmx_do_nmi_irqoff(void);
-
-static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
-{
- /*
- * Save xfd_err to guest_fpu before interrupt is enabled, so the
- * MSR value is not clobbered by the host activity before the guest
- * has chance to consume it.
- *
- * Do not blindly read xfd_err here, since this exception might
- * be caused by L1 interception on a platform which doesn't
- * support xfd at all.
- *
- * Do it conditionally upon guest_fpu::xfd. xfd_err matters
- * only when xfd contains a non-zero value.
- *
- * Queuing exception is done in vmx_handle_exit. See comment there.
- */
- if (vcpu->arch.guest_fpu.fpstate->xfd)
- rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
-}
-
-static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
- /* if exit due to PF check for async PF */
- if (is_page_fault(intr_info))
- vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
- /* if exit due to NM, handle before interrupts are enabled */
- else if (is_nm_fault(intr_info))
- handle_nm_fault_irqoff(vcpu);
- /* Handle machine checks before interrupts are enabled */
- else if (is_machine_check(intr_info))
- kvm_machine_check();
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
- u32 intr_info)
-{
- unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
- gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
- if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
- "unexpected VM-Exit interrupt info: 0x%x", intr_info))
- return;
-
- kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
- vmx_do_interrupt_irqoff(gate_offset(desc));
- kvm_after_interrupt(vcpu);
-
- vcpu->arch.at_instruction_boundary = true;
-}
-
void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6950,9 +6897,10 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
return;

if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
}

/*
@@ -8284,7 +8232,7 @@ __init int vmx_hardware_setup(void)
int r;

store_idt(&dt);
- host_idt_base = dt.address;
+ vmx_host_idt_base = dt.address;

vmx_setup_user_return_msrs();

--
2.25.1


2024-02-26 09:00:40

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 067/130] KVM: TDX: Add load_mmu_pgd method for TDX

From: Sean Christopherson <[email protected]>

For virtual IO, the guest TD shares guest pages with VMM without
encryption. Shared EPT is used to map guest pages in unprotected way.

Add the VMCS field encoding for the shared EPTP, which will be used by
TDX to have separate EPT walks for private GPAs (existing EPTP) versus
shared GPAs (new shared EPTP).

Set shared EPT pointer value for the TDX guest to initialize TDX MMU.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
v19:
- Add WARN_ON_ONCE() to tdx_load_mmu_pgd() and drop unconditional mask
---
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/vmx/main.c | 13 ++++++++++++-
arch/x86/kvm/vmx/tdx.c | 6 ++++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f703bae0c4ac..9deb663a42e3 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -236,6 +236,7 @@ enum vmcs_field {
TSC_MULTIPLIER_HIGH = 0x00002033,
TERTIARY_VM_EXEC_CONTROL = 0x00002034,
TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035,
+ SHARED_EPT_POINTER = 0x0000203C,
PID_POINTER_TABLE = 0x00002042,
PID_POINTER_TABLE_HIGH = 0x00002043,
GUEST_PHYSICAL_ADDRESS = 0x00002400,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d0f75020579f..076a471d9aea 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -123,6 +123,17 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_vcpu_reset(vcpu, init_event);
}

+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+ int pgd_level)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+ return;
+ }
+
+ vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -256,7 +267,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.write_tsc_offset = vmx_write_tsc_offset,
.write_tsc_multiplier = vmx_write_tsc_multiplier,

- .load_mmu_pgd = vmx_load_mmu_pgd,
+ .load_mmu_pgd = vt_load_mmu_pgd,

.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 54e0d4efa2bd..143a3c2a16bc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -453,6 +453,12 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
}

+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
+{
+ WARN_ON_ONCE(root_hpa & ~PAGE_MASK);
+ td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f5820f617b2e..24161fa404aa 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -152,6 +152,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
#else
static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
static inline void tdx_hardware_unsetup(void) {}
@@ -173,6 +175,8 @@ static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
#endif

#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1


2024-02-26 09:00:42

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 097/130] KVM: x86: Split core of hypercall emulation to helper function

From: Sean Christopherson <[email protected]>

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +++
arch/x86/kvm/x86.c | 56 ++++++++++++++++++++++-----------
2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0ffef1d377d..bb8be091f996 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2177,6 +2177,10 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
}

+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit, int cpl);
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);

int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb7597c22f31..03950368d8db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10073,26 +10073,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}

-int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+ unsigned long a0, unsigned long a1,
+ unsigned long a2, unsigned long a3,
+ int op_64_bit, int cpl)
{
- unsigned long nr, a0, a1, a2, a3, ret;
- int op_64_bit;
-
- if (kvm_xen_hypercall_enabled(vcpu->kvm))
- return kvm_xen_hypercall(vcpu);
-
- if (kvm_hv_hypercall_enabled(vcpu))
- return kvm_hv_hypercall(vcpu);
-
- nr = kvm_rax_read(vcpu);
- a0 = kvm_rbx_read(vcpu);
- a1 = kvm_rcx_read(vcpu);
- a2 = kvm_rdx_read(vcpu);
- a3 = kvm_rsi_read(vcpu);
+ unsigned long ret;

trace_kvm_hypercall(nr, a0, a1, a2, a3);

- op_64_bit = is_64_bit_hypercall(vcpu);
if (!op_64_bit) {
nr &= 0xFFFFFFFF;
a0 &= 0xFFFFFFFF;
@@ -10101,7 +10090,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
a3 &= 0xFFFFFFFF;
}

- if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
+ if (cpl) {
ret = -KVM_EPERM;
goto out;
}
@@ -10162,18 +10151,49 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)

WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ);
vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+ /* stat is incremented on completion. */
return 0;
}
default:
ret = -KVM_ENOSYS;
break;
}
+
out:
+ ++vcpu->stat.hypercalls;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+ unsigned long nr, a0, a1, a2, a3, ret;
+ int op_64_bit;
+ int cpl;
+
+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
+ return kvm_xen_hypercall(vcpu);
+
+ if (kvm_hv_hypercall_enabled(vcpu))
+ return kvm_hv_hypercall(vcpu);
+
+ nr = kvm_rax_read(vcpu);
+ a0 = kvm_rbx_read(vcpu);
+ a1 = kvm_rcx_read(vcpu);
+ a2 = kvm_rdx_read(vcpu);
+ a3 = kvm_rsi_read(vcpu);
+ op_64_bit = is_64_bit_hypercall(vcpu);
+ cpl = static_call(kvm_x86_get_cpl)(vcpu);
+
+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
+ /* MAP_GPA tosses the request to the user space. */
+ return 0;
+
if (!op_64_bit)
ret = (u32)ret;
kvm_rax_write(vcpu, ret);

- ++vcpu->stat.hypercalls;
return kvm_skip_emulated_instruction(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
--
2.25.1


2024-02-26 09:00:57

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 098/130] KVM: TDX: Add a place holder to handle TDX VM exit

From: Isaku Yamahata <[email protected]>

Wire up handle_exit and handle_exit_irqoff methods and add a place holder
to handle VM exit. Add helper functions to get exit info, exit
qualification, etc.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 37 ++++++++++++-
arch/x86/kvm/vmx/tdx.c | 110 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 10 ++++
3 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6d6d443a2bbd..c9a40456d965 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -228,6 +228,25 @@ static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
return tdx_protected_apic_has_interrupt(vcpu);
}

+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_handle_exit(vcpu, fastpath);
+
+ return vmx_handle_exit(vcpu, fastpath);
+}
+
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_handle_exit_irqoff(vcpu);
+ return;
+ }
+
+ vmx_handle_exit_irqoff(vcpu);
+}
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -436,6 +455,18 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
vmx_request_immediate_exit(vcpu);
}

+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+ error_code);
+ return;
+ }
+
+ vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -562,7 +593,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.vcpu_pre_run = vt_vcpu_pre_run,
.vcpu_run = vt_vcpu_run,
- .handle_exit = vmx_handle_exit,
+ .handle_exit = vt_handle_exit,
.skip_emulated_instruction = vmx_skip_emulated_instruction,
.update_emulated_instruction = vmx_update_emulated_instruction,
.set_interrupt_shadow = vt_set_interrupt_shadow,
@@ -597,7 +628,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_identity_map_addr = vmx_set_identity_map_addr,
.get_mt_mask = vt_get_mt_mask,

- .get_exit_info = vmx_get_exit_info,
+ .get_exit_info = vt_get_exit_info,

.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,

@@ -611,7 +642,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.load_mmu_pgd = vt_load_mmu_pgd,

.check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
+ .handle_exit_irqoff = vt_handle_exit_irqoff,

.request_immediate_exit = vt_request_immediate_exit,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index be21dca47992..71ab48cf72ba 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -120,6 +120,26 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
}

+static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rcx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
+{
+ return kvm_rdx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+{
+ return kvm_r8_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
+{
+ return kvm_r9_read(vcpu);
+}
+
static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
{
return tdx->td_vcpu_created;
@@ -837,6 +857,12 @@ static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
WARN_ON_ONCE(!kvm_rebooting &&
(tdx->exit_reason.full & TDX_SW_ERROR) == TDX_SW_ERROR);

+ if ((u16)tdx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
+ is_nmi(tdexit_intr_info(vcpu))) {
+ kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
+ vmx_do_nmi_irqoff();
+ kvm_after_interrupt(vcpu);
+ }
guest_state_exit_irqoff();
}

@@ -880,6 +906,25 @@ void tdx_inject_nmi(struct kvm_vcpu *vcpu)
td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
}

+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ u16 exit_reason = tdx->exit_reason.basic;
+
+ if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ tdexit_intr_info(vcpu));
+ else if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
+ vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
+{
+ vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+ vcpu->mmio_needed = 0;
+ return 0;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
WARN_ON_ONCE(root_hpa & ~PAGE_MASK);
@@ -1240,6 +1285,71 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}

+int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
+{
+ union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+ /* See the comment of tdh_sept_seamcall(). */
+ if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)))
+ return 1;
+
+ /*
+ * TDH.VP.ENTRY checks TD EPOCH which contend with TDH.MEM.TRACK and
+ * vcpu TDH.VP.ENTER.
+ */
+ if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_TD_EPOCH)))
+ return 1;
+
+ if (unlikely(exit_reason.full == TDX_SEAMCALL_UD)) {
+ kvm_spurious_fault();
+ /*
+ * In the case of reboot or kexec, loop with TDH.VP.ENTER and
+ * TDX_SEAMCALL_UD to avoid unnecessarily activity.
+ */
+ return 1;
+ }
+
+ if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
+ if (unlikely(exit_reason.basic == EXIT_REASON_TRIPLE_FAULT))
+ return tdx_handle_triple_fault(vcpu);
+
+ kvm_pr_unimpl("TD exit 0x%llx, %d hkid 0x%x hkid pa 0x%llx\n",
+ exit_reason.full, exit_reason.basic,
+ to_kvm_tdx(vcpu->kvm)->hkid,
+ set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
+ goto unhandled_exit;
+ }
+
+ WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
+
+ switch (exit_reason.basic) {
+ default:
+ break;
+ }
+
+unhandled_exit:
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
+ vcpu->run->internal.ndata = 2;
+ vcpu->run->internal.data[0] = exit_reason.full;
+ vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
+ return 0;
+}
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ *reason = tdx->exit_reason.full;
+
+ *info1 = tdexit_exit_qual(vcpu);
+ *info2 = tdexit_ext_exit_qual(vcpu);
+
+ *intr_info = tdexit_intr_info(vcpu);
+ *error_code = 0;
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 539f3f9686fe..a12e3bfc96dd 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -155,11 +155,16 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
void tdx_inject_nmi(struct kvm_vcpu *vcpu);
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

@@ -199,11 +204,16 @@ static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
+static inline void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
+static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
+ enum exit_fastpath_completion fastpath) { return 0; }
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector) {}
static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
+static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
+ u64 *info2, u32 *intr_info, u32 *error_code) {}

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

--
2.25.1


2024-02-26 09:01:09

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 099/130] KVM: TDX: Handle vmentry failure for INTEL TD guest

From: Yao Yuan <[email protected]>

TDX module passes control back to VMM if it failed to vmentry for a TD, use
same exit reason to notify user space, align with VMX.
If VMM corrupted TD VMCS, machine check during entry can happens. vm exit
reason will be EXIT_REASON_MCE_DURING_VMENTRY. If VMM corrupted TD VMCS
with debug TD by TDH.VP.WR, the exit reason would be
EXIT_REASON_INVALID_STATE or EXIT_REASON_MSR_LOAD_FAIL.

Signed-off-by: Yao Yuan <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 71ab48cf72ba..cba0fd5029be 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1320,6 +1320,28 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
goto unhandled_exit;
}

+ /*
+ * When TDX module saw VMEXIT_REASON_FAILED_VMENTER_MC etc, TDH.VP.ENTER
+ * returns with TDX_SUCCESS | exit_reason with failed_vmentry = 1.
+ * Because TDX module maintains TD VMCS correctness, usually vmentry
+ * failure shouldn't happen. In some corner cases it can happen. For
+ * example
+ * - machine check during entry: EXIT_REASON_MCE_DURING_VMENTRY
+ * - TDH.VP.WR with debug TD. VMM can corrupt TD VMCS
+ * - EXIT_REASON_INVALID_STATE
+ * - EXIT_REASON_MSR_LOAD_FAIL
+ */
+ if (unlikely(exit_reason.failed_vmentry)) {
+ pr_err("TDExit: exit_reason 0x%016llx qualification=%016lx ext_qualification=%016lx\n",
+ exit_reason.full, tdexit_exit_qual(vcpu), tdexit_ext_exit_qual(vcpu));
+ vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+ vcpu->run->fail_entry.hardware_entry_failure_reason
+ = exit_reason.full;
+ vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
+
+ return 0;
+ }
+
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
--
2.25.1


2024-02-26 09:01:21

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 100/130] KVM: TDX: handle EXIT_REASON_OTHER_SMI

From: Isaku Yamahata <[email protected]>

If the control reaches EXIT_REASON_OTHER_SMI, #SMI is delivered and
handled right after returning from the TDX module to KVM nothing needs to
be done in KVM. Continue TDX vcpu execution.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/uapi/asm/vmx.h | 1 +
arch/x86/kvm/vmx/tdx.c | 7 +++++++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index a5faf6d88f1b..b3a30ef3efdd 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -34,6 +34,7 @@
#define EXIT_REASON_TRIPLE_FAULT 2
#define EXIT_REASON_INIT_SIGNAL 3
#define EXIT_REASON_SIPI_SIGNAL 4
+#define EXIT_REASON_OTHER_SMI 6

#define EXIT_REASON_INTERRUPT_WINDOW 7
#define EXIT_REASON_NMI_WINDOW 8
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cba0fd5029be..2f68e6f2b53a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1345,6 +1345,13 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
+ case EXIT_REASON_OTHER_SMI:
+ /*
+ * If reach here, it's not a Machine Check System Management
+ * Interrupt(MSMI). #SMI is delivered and handled right after
+ * SEAMRET, nothing needs to be done in KVM.
+ */
+ return 1;
default:
break;
}
--
2.25.1


2024-02-26 09:01:49

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

From: Isaku Yamahata <[email protected]>

On EPT violation, call a common function, __vmx_handle_ept_violation() to
trigger x86 MMU code. On EPT misconfiguration, exit to ring 3 with
KVM_EXIT_UNKNOWN. because EPT misconfiguration can't happen as MMIO is
trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the
fast path.

Signed-off-by: Isaku Yamahata <[email protected]>

---
v14 -> v15:
- use PFERR_GUEST_ENC_MASK to tell the fault is private

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/common.h | 3 +++
arch/x86/kvm/vmx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 632af7a76d0a..027aa4175d2c 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -87,6 +87,9 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;

+ if (kvm_is_private_gpa(vcpu->kvm, gpa))
+ error_code |= PFERR_GUEST_ENC_MASK;
+
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2f68e6f2b53a..0db80fa020d2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1285,6 +1285,51 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
}

+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+ unsigned long exit_qual;
+
+ if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+ /*
+ * Always treat SEPT violations as write faults. Ignore the
+ * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
+ * TD private pages are always RWX in the SEPT tables,
+ * i.e. they're always mapped writable. Just as importantly,
+ * treating SEPT violations as write faults is necessary to
+ * avoid COW allocations, which will cause TDAUGPAGE failures
+ * due to aliasing a single HPA to multiple GPAs.
+ */
+#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
+ exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
+ } else {
+ exit_qual = tdexit_exit_qual(vcpu);
+ if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
+ pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
+ tdexit_gpa(vcpu), kvm_rip_read(vcpu));
+ vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
+ vcpu->run->ex.exception = PF_VECTOR;
+ vcpu->run->ex.error_code = exit_qual;
+ return 0;
+ }
+ }
+
+ trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
+ return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
+{
+ WARN_ON_ONCE(1);
+
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
+ vcpu->run->internal.ndata = 2;
+ vcpu->run->internal.data[0] = EXIT_REASON_EPT_MISCONFIG;
+ vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
+
+ return 0;
+}
+
int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
{
union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
@@ -1345,6 +1390,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_handle_ept_violation(vcpu);
+ case EXIT_REASON_EPT_MISCONFIG:
+ return tdx_handle_ept_misconfig(vcpu);
case EXIT_REASON_OTHER_SMI:
/*
* If reach here, it's not a Machine Check System Management
--
2.25.1


2024-02-26 09:02:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT

From: Isaku Yamahata <[email protected]>

Because guest TD state is protected, exceptions in guest TDs can't be
intercepted. TDX VMM doesn't need to handle exceptions.
tdx_handle_exit_irqoff() handles NMI and machine check. Ignore NMI and
machine check and continue guest TD execution.

For external interrupt, increment stats same to the VMX case.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0db80fa020d2..bdd74682b474 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -918,6 +918,25 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
}

+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
+{
+ u32 intr_info = tdexit_intr_info(vcpu);
+
+ if (is_nmi(intr_info) || is_machine_check(intr_info))
+ return 1;
+
+ kvm_pr_unimpl("unexpected exception 0x%x(exit_reason 0x%llx qual 0x%lx)\n",
+ intr_info,
+ to_tdx(vcpu)->exit_reason.full, tdexit_exit_qual(vcpu));
+ return -EFAULT;
+}
+
+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
+{
+ ++vcpu->stat.irq_exits;
+ return 1;
+}
+
static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
{
vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
@@ -1390,6 +1409,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);

switch (exit_reason.basic) {
+ case EXIT_REASON_EXCEPTION_NMI:
+ return tdx_handle_exception(vcpu);
+ case EXIT_REASON_EXTERNAL_INTERRUPT:
+ return tdx_handle_external_interrupt(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_handle_ept_violation(vcpu);
case EXIT_REASON_EPT_MISCONFIG:
--
2.25.1


2024-02-26 09:03:11

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 104/130] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)

From: Isaku Yamahata <[email protected]>

The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM. When the guest TD issues
TDG.VP.VMCALL, the guest TD exits to VMM with a new exit reason of
TDVMCALL. The arguments from the guest TD and returned values from the VMM
are passed in the guest registers. The guest RCX registers indicates which
registers are used. Define helper functions to access those registers as
ABI.

Define the TDVMCALL exit reason, which is carved out from the VMX exit
reason namespace as the TDVMCALL exit from TDX guest to TDX-SEAM is really
just a VM-Exit. Add a place holder to handle TDVMCALL exit.

Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/uapi/asm/vmx.h | 4 ++-
arch/x86/kvm/vmx/tdx.c | 53 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 13 ++++++++
3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index b3a30ef3efdd..f0f4a4cf84a7 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -93,6 +93,7 @@
#define EXIT_REASON_TPAUSE 68
#define EXIT_REASON_BUS_LOCK 74
#define EXIT_REASON_NOTIFY 75
+#define EXIT_REASON_TDCALL 77

#define VMX_EXIT_REASONS \
{ EXIT_REASON_EXCEPTION_NMI, "EXCEPTION_NMI" }, \
@@ -156,7 +157,8 @@
{ EXIT_REASON_UMWAIT, "UMWAIT" }, \
{ EXIT_REASON_TPAUSE, "TPAUSE" }, \
{ EXIT_REASON_BUS_LOCK, "BUS_LOCK" }, \
- { EXIT_REASON_NOTIFY, "NOTIFY" }
+ { EXIT_REASON_NOTIFY, "NOTIFY" }, \
+ { EXIT_REASON_TDCALL, "TDCALL" }

#define VMX_EXIT_REASON_FLAGS \
{ VMX_EXIT_REASONS_FAILED_VMENTRY, "FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 117c2315f087..0be58cd428b3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -140,6 +140,41 @@ static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
return kvm_r9_read(vcpu);
}

+#define BUILD_TDVMCALL_ACCESSORS(param, gpr) \
+static __always_inline \
+unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu) \
+{ \
+ return kvm_##gpr##_read(vcpu); \
+} \
+static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
+ unsigned long val) \
+{ \
+ kvm_##gpr##_write(vcpu, val); \
+}
+BUILD_TDVMCALL_ACCESSORS(a0, r12);
+BUILD_TDVMCALL_ACCESSORS(a1, r13);
+BUILD_TDVMCALL_ACCESSORS(a2, r14);
+BUILD_TDVMCALL_ACCESSORS(a3, r15);
+
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+ return kvm_r10_read(vcpu);
+}
+static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
+{
+ return kvm_r11_read(vcpu);
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+ long val)
+{
+ kvm_r10_write(vcpu, val);
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+ unsigned long val)
+{
+ kvm_r11_write(vcpu, val);
+}
+
static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
{
return tdx->td_vcpu_created;
@@ -897,6 +932,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_complete_interrupts(vcpu);

+ if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
+ tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
+ else
+ tdx->tdvmcall.rcx = 0;
+
return EXIT_FASTPATH_NONE;
}

@@ -968,6 +1008,17 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
return 0;
}

+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+ switch (tdvmcall_leaf(vcpu)) {
+ default:
+ break;
+ }
+
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ return 1;
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
WARN_ON_ONCE(root_hpa & ~PAGE_MASK);
@@ -1442,6 +1493,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
return tdx_handle_exception(vcpu);
case EXIT_REASON_EXTERNAL_INTERRUPT:
return tdx_handle_external_interrupt(vcpu);
+ case EXIT_REASON_TDCALL:
+ return handle_tdvmcall(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_handle_ept_violation(vcpu);
case EXIT_REASON_EPT_MISCONFIG:
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index eaffa7384725..4399d474764f 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -83,6 +83,19 @@ struct vcpu_tdx {

struct list_head cpu_list;

+ union {
+ struct {
+ union {
+ struct {
+ u16 gpr_mask;
+ u16 xmm_mask;
+ };
+ u32 regs_mask;
+ };
+ u32 reserved;
+ };
+ u64 rcx;
+ } tdvmcall;
union tdx_exit_reason exit_reason;

bool initialized;
--
2.25.1


2024-02-26 09:03:23

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL

From: Isaku Yamahata <[email protected]>

The TDX Guest-Host communication interface (GHCI) specification defines
the ABI for the guest TD to issue hypercall. It reserves vendor specific
arguments for VMM specific use. Use it as KVM hypercall and handle it.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0be58cd428b3..c8eb47591105 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1008,8 +1008,41 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
return 0;
}

+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
+{
+ unsigned long nr, a0, a1, a2, a3, ret;
+
+ /*
+ * ABI for KVM tdvmcall argument:
+ * In Guest-Hypervisor Communication Interface(GHCI) specification,
+ * Non-zero leaf number (R10 != 0) is defined to indicate
+ * vendor-specific. KVM uses this for KVM hypercall. NOTE: KVM
+ * hypercall number starts from one. Zero isn't used for KVM hypercall
+ * number.
+ *
+ * R10: KVM hypercall number
+ * arguments: R11, R12, R13, R14.
+ */
+ nr = kvm_r10_read(vcpu);
+ a0 = kvm_r11_read(vcpu);
+ a1 = kvm_r12_read(vcpu);
+ a2 = kvm_r13_read(vcpu);
+ a3 = kvm_r14_read(vcpu);
+
+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true, 0);
+
+ tdvmcall_set_return_code(vcpu, ret);
+
+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
+ return 0;
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
+ if (tdvmcall_exit_type(vcpu))
+ return tdx_emulate_vmcall(vcpu);
+
switch (tdvmcall_leaf(vcpu)) {
default:
break;
--
2.25.1


2024-02-26 09:04:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 106/130] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL

From: Isaku Yamahata <[email protected]>

Some of TDG.VP.VMCALL require device model, for example, qemu, to handle
them on behalf of kvm kernel module. TDVMCALL_REPORT_FATAL_ERROR,
TDVMCALL_MAP_GPA, TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and
TDVMCALL_GET_QUOTE requires user space VMM handling.

Introduce new kvm exit, KVM_EXIT_TDX, and functions to setup it. Device
model should update R10 if necessary as return value.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v14 -> v15:
- updated struct kvm_tdx_exit with union
- export constants for reg bitmask

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 83 ++++++++++++++++++++++++++++++++++++-
include/uapi/linux/kvm.h | 89 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 170 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c8eb47591105..72dbe2ff9062 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1038,6 +1038,78 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_complete_vp_vmcall(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+ __u64 reg_mask = kvm_rcx_read(vcpu);
+
+#define COPY_REG(MASK, REG) \
+ do { \
+ if (reg_mask & TDX_VMCALL_REG_MASK_ ## MASK) \
+ kvm_## REG ## _write(vcpu, tdx_vmcall->out_ ## REG); \
+ } while (0)
+
+
+ COPY_REG(R10, r10);
+ COPY_REG(R11, r11);
+ COPY_REG(R12, r12);
+ COPY_REG(R13, r13);
+ COPY_REG(R14, r14);
+ COPY_REG(R15, r15);
+ COPY_REG(RBX, rbx);
+ COPY_REG(RDI, rdi);
+ COPY_REG(RSI, rsi);
+ COPY_REG(R8, r8);
+ COPY_REG(R9, r9);
+ COPY_REG(RDX, rdx);
+
+#undef COPY_REG
+
+ return 1;
+}
+
+static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
+{
+ struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+ __u64 reg_mask;
+
+ vcpu->arch.complete_userspace_io = tdx_complete_vp_vmcall;
+ memset(tdx_vmcall, 0, sizeof(*tdx_vmcall));
+
+ vcpu->run->exit_reason = KVM_EXIT_TDX;
+ vcpu->run->tdx.type = KVM_EXIT_TDX_VMCALL;
+
+ reg_mask = kvm_rcx_read(vcpu);
+ tdx_vmcall->reg_mask = reg_mask;
+
+#define COPY_REG(MASK, REG) \
+ do { \
+ if (reg_mask & TDX_VMCALL_REG_MASK_ ## MASK) { \
+ tdx_vmcall->in_ ## REG = kvm_ ## REG ## _read(vcpu); \
+ tdx_vmcall->out_ ## REG = tdx_vmcall->in_ ## REG; \
+ } \
+ } while (0)
+
+
+ COPY_REG(R10, r10);
+ COPY_REG(R11, r11);
+ COPY_REG(R12, r12);
+ COPY_REG(R13, r13);
+ COPY_REG(R14, r14);
+ COPY_REG(R15, r15);
+ COPY_REG(RBX, rbx);
+ COPY_REG(RDI, rdi);
+ COPY_REG(RSI, rsi);
+ COPY_REG(R8, r8);
+ COPY_REG(R9, r9);
+ COPY_REG(RDX, rdx);
+
+#undef COPY_REG
+
+ /* notify userspace to handle the request */
+ return 0;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1048,8 +1120,15 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
break;
}

- tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
- return 1;
+ /*
+ * Unknown VMCALL. Toss the request to the user space VMM, e.g. qemu,
+ * as it may know how to handle.
+ *
+ * Those VMCALLs require user space VMM:
+ * TDVMCALL_REPORT_FATAL_ERROR, TDVMCALL_MAP_GPA,
+ * TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and TDVMCALL_GET_QUOTE.
+ */
+ return tdx_vp_vmcall_to_user(vcpu);
}

void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5e2b28934aa9..a7aa804ef021 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -167,6 +167,92 @@ struct kvm_xen_exit {
} u;
};

+/* masks for reg_mask to indicate which registers are passed. */
+#define TDX_VMCALL_REG_MASK_RBX BIT_ULL(2)
+#define TDX_VMCALL_REG_MASK_RDX BIT_ULL(3)
+#define TDX_VMCALL_REG_MASK_RSI BIT_ULL(6)
+#define TDX_VMCALL_REG_MASK_RDI BIT_ULL(7)
+#define TDX_VMCALL_REG_MASK_R8 BIT_ULL(8)
+#define TDX_VMCALL_REG_MASK_R9 BIT_ULL(9)
+#define TDX_VMCALL_REG_MASK_R10 BIT_ULL(10)
+#define TDX_VMCALL_REG_MASK_R11 BIT_ULL(11)
+#define TDX_VMCALL_REG_MASK_R12 BIT_ULL(12)
+#define TDX_VMCALL_REG_MASK_R13 BIT_ULL(13)
+#define TDX_VMCALL_REG_MASK_R14 BIT_ULL(14)
+#define TDX_VMCALL_REG_MASK_R15 BIT_ULL(15)
+
+struct kvm_tdx_exit {
+#define KVM_EXIT_TDX_VMCALL 1
+ __u32 type;
+ __u32 pad;
+
+ union {
+ struct kvm_tdx_vmcall {
+ /*
+ * RAX(bit 0), RCX(bit 1) and RSP(bit 4) are reserved.
+ * RAX(bit 0): TDG.VP.VMCALL status code.
+ * RCX(bit 1): bitmap for used registers.
+ * RSP(bit 4): the caller stack.
+ */
+ union {
+ __u64 in_rcx;
+ __u64 reg_mask;
+ };
+
+ /*
+ * Guest-Host-Communication Interface for TDX spec
+ * defines the ABI for TDG.VP.VMCALL.
+ */
+ /* Input parameters: guest -> VMM */
+ union {
+ __u64 in_r10;
+ __u64 type;
+ };
+ union {
+ __u64 in_r11;
+ __u64 subfunction;
+ };
+ /*
+ * Subfunction specific.
+ * Registers are used in this order to pass input
+ * arguments. r12=arg0, r13=arg1, etc.
+ */
+ __u64 in_r12;
+ __u64 in_r13;
+ __u64 in_r14;
+ __u64 in_r15;
+ __u64 in_rbx;
+ __u64 in_rdi;
+ __u64 in_rsi;
+ __u64 in_r8;
+ __u64 in_r9;
+ __u64 in_rdx;
+
+ /* Output parameters: VMM -> guest */
+ union {
+ __u64 out_r10;
+ __u64 status_code;
+ };
+ /*
+ * Subfunction specific.
+ * Registers are used in this order to output return
+ * values. r11=ret0, r12=ret1, etc.
+ */
+ __u64 out_r11;
+ __u64 out_r12;
+ __u64 out_r13;
+ __u64 out_r14;
+ __u64 out_r15;
+ __u64 out_rbx;
+ __u64 out_rdi;
+ __u64 out_rsi;
+ __u64 out_r8;
+ __u64 out_r9;
+ __u64 out_rdx;
+ } vmcall;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576

@@ -210,6 +296,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_NOTIFY 37
#define KVM_EXIT_LOONGARCH_IOCSR 38
#define KVM_EXIT_MEMORY_FAULT 39
+#define KVM_EXIT_TDX 40

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -470,6 +557,8 @@ struct kvm_run {
__u64 gpa;
__u64 size;
} memory_fault;
+ /* KVM_EXIT_TDX_VMCALL */
+ struct kvm_tdx_exit tdx;
/* Fix the size of the union. */
char padding[256];
};
--
2.25.1


2024-02-26 09:04:04

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 107/130] KVM: TDX: Handle TDX PV CPUID hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV CPUID hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 72dbe2ff9062..eb68d6c148b6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1110,12 +1110,34 @@ static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
return 0;
}

+static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+ u32 eax, ebx, ecx, edx;
+
+ /* EAX and ECX for cpuid is stored in R12 and R13. */
+ eax = tdvmcall_a0_read(vcpu);
+ ecx = tdvmcall_a1_read(vcpu);
+
+ kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false);
+
+ tdvmcall_a0_write(vcpu, eax);
+ tdvmcall_a1_write(vcpu, ebx);
+ tdvmcall_a2_write(vcpu, ecx);
+ tdvmcall_a3_write(vcpu, edx);
+
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
return tdx_emulate_vmcall(vcpu);

switch (tdvmcall_leaf(vcpu)) {
+ case EXIT_REASON_CPUID:
+ return tdx_emulate_cpuid(vcpu);
default:
break;
}
--
2.25.1


2024-02-26 09:04:22

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX

From: Isaku Yamahata <[email protected]>

As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
support for TDX isn't implemented. TDX requires KVM mmio caching. Disable
TDX support when TDP MMU or mmio caching aren't supported.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/vmx/main.c | 13 +++++++++++++
2 files changed, 14 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0e0321ad9ca2..b8d6ce02e66d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -104,6 +104,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
* If the hardware supports that we don't need to do shadow paging.
*/
bool tdp_enabled = false;
+EXPORT_SYMBOL_GPL(tdp_enabled);

static bool __ro_after_init tdp_mmu_allowed;

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 076a471d9aea..54df6653193e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -3,6 +3,7 @@

#include "x86_ops.h"
#include "vmx.h"
+#include "mmu.h"
#include "nested.h"
#include "pmu.h"
#include "tdx.h"
@@ -36,6 +37,18 @@ static __init int vt_hardware_setup(void)
if (ret)
return ret;

+ /* TDX requires KVM TDP MMU. */
+ if (enable_tdx && !tdp_enabled) {
+ enable_tdx = false;
+ pr_warn_ratelimited("TDX requires TDP MMU. Please enable TDP MMU for TDX.\n");
+ }
+
+ /* TDX requires MMIO caching. */
+ if (enable_tdx && !enable_mmio_caching) {
+ enable_tdx = false;
+ pr_warn_ratelimited("TDX requires mmio caching. Please enable mmio caching for TDX.\n");
+ }
+
enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
if (enable_tdx)
vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
--
2.25.1


2024-02-26 09:05:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 110/130] KVM: TDX: Handle TDX PV MMIO hypercall

From: Sean Christopherson <[email protected]>

Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
hypercall to the KVM backend functions.

kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
MMIO address and emulates the MMIO. As TDX PV MMIO also needs it, export
kvm_io_bus_read(). kvm_io_bus_write() is already exported. TDX PV MMIO
emulates some of MMIO itself. To add trace point consistently with x86
kvm, export kvm_mmio tracepoint.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
virt/kvm/kvm_main.c | 2 +
3 files changed, 117 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 55fc6cc6c816..389bb95d2af0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1217,6 +1217,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
return ret;
}

+static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
+{
+ unsigned long val = 0;
+ gpa_t gpa;
+ int size;
+
+ KVM_BUG_ON(vcpu->mmio_needed != 1, vcpu->kvm);
+ vcpu->mmio_needed = 0;
+
+ if (!vcpu->mmio_is_write) {
+ gpa = vcpu->mmio_fragments[0].gpa;
+ size = vcpu->mmio_fragments[0].len;
+
+ memcpy(&val, vcpu->run->mmio.data, size);
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ }
+ return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+ unsigned long val)
+{
+ if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+ return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+ unsigned long val;
+
+ if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+ kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+ return -EOPNOTSUPP;
+
+ tdvmcall_set_return_val(vcpu, val);
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+ return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+ struct kvm_memory_slot *slot;
+ int size, write, r;
+ unsigned long val;
+ gpa_t gpa;
+
+ KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
+
+ size = tdvmcall_a0_read(vcpu);
+ write = tdvmcall_a1_read(vcpu);
+ gpa = tdvmcall_a2_read(vcpu);
+ val = write ? tdvmcall_a3_read(vcpu) : 0;
+
+ if (size != 1 && size != 2 && size != 4 && size != 8)
+ goto error;
+ if (write != 0 && write != 1)
+ goto error;
+
+ /* Strip the shared bit, allow MMIO with and without it set. */
+ gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
+
+ if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
+ goto error;
+
+ slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
+ if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
+ goto error;
+
+ if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+ trace_kvm_fast_mmio(gpa);
+ return 1;
+ }
+
+ if (write)
+ r = tdx_mmio_write(vcpu, gpa, size, val);
+ else
+ r = tdx_mmio_read(vcpu, gpa, size);
+ if (!r) {
+ /* Kernel completed device emulation. */
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ return 1;
+ }
+
+ /* Request the device emulation to userspace device model. */
+ vcpu->mmio_needed = 1;
+ vcpu->mmio_is_write = write;
+ vcpu->arch.complete_userspace_io = tdx_complete_mmio;
+
+ vcpu->run->mmio.phys_addr = gpa;
+ vcpu->run->mmio.len = size;
+ vcpu->run->mmio.is_write = write;
+ vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+ if (write) {
+ memcpy(vcpu->run->mmio.data, &val, size);
+ } else {
+ vcpu->mmio_fragments[0].gpa = gpa;
+ vcpu->mmio_fragments[0].len = size;
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+ }
+ return 0;
+
+error:
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1229,6 +1341,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_hlt(vcpu);
case EXIT_REASON_IO_INSTRUCTION:
return tdx_emulate_io(vcpu);
+ case EXIT_REASON_EPT_VIOLATION:
+ return tdx_emulate_mmio(vcpu);
default:
break;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03950368d8db..d5b18cad9dcd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13975,6 +13975,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);

EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e27c22449d85..bc14e1f2610c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2689,6 +2689,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn

return NULL;
}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);

bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
{
@@ -5992,6 +5993,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
r = __kvm_io_bus_read(vcpu, bus, &range, val);
return r < 0 ? r : 0;
}
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);

int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
int len, struct kvm_io_device *dev)
--
2.25.1


2024-02-26 09:05:59

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 113/130] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access

From: Isaku Yamahata <[email protected]>

Handle MTRRCap RO MSR to return all features are unsupported and handle
MTRRDefType MSR to accept only E=1,FE=0,type=writeback.
enable MTRR, disable Fixed range MTRRs, default memory type=writeback

TDX virtualizes that cpuid to report MTRR to guest TD and TDX enforces
guest CR0.CD=0. If guest tries to set CR0.CD=1, it results in #GP. While
updating MTRR requires to set CR0.CD=1 (and other cache flushing
operations). It means guest TD can't update MTRR. Virtualize MTRR as
all features disabled and default memory type as writeback.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 99 ++++++++++++++++++++++++++++++++++--------
1 file changed, 82 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4c635bfcaf7a..2bddaef495d1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -611,18 +611,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

- /*
- * TDX enforces CR0.CD = 0 and KVM MTRR emulation enforces writeback.
- * TODO: implement MTRR MSR emulation so that
- * MTRRCap: SMRR=0: SMRR interface unsupported
- * WC=0: write combining unsupported
- * FIX=0: Fixed range registers unsupported
- * VCNT=0: number of variable range regitsers = 0
- * MTRRDefType: E=1, FE=0, type=writeback only. Don't allow other value.
- * E=1: enable MTRR
- * FE=0: disable fixed range MTRRs
- * type: default memory type=writeback
- */
+ /* TDX enforces CR0.CD = 0 and KVM MTRR emulation enforces writeback. */
return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
}

@@ -1932,7 +1921,9 @@ bool tdx_has_emulated_msr(u32 index, bool write)
case MSR_IA32_UCODE_REV:
case MSR_IA32_ARCH_CAPABILITIES:
case MSR_IA32_POWER_CTL:
+ case MSR_MTRRcap:
case MSR_IA32_CR_PAT:
+ case MSR_MTRRdefType:
case MSR_IA32_TSC_DEADLINE:
case MSR_IA32_MISC_ENABLE:
case MSR_PLATFORM_INFO:
@@ -1974,16 +1965,47 @@ bool tdx_has_emulated_msr(u32 index, bool write)

int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
- if (tdx_has_emulated_msr(msr->index, false))
- return kvm_get_msr_common(vcpu, msr);
- return 1;
+ switch (msr->index) {
+ case MSR_MTRRcap:
+ /*
+ * Override kvm_mtrr_get_msr() which hardcodes the value.
+ * Report SMRR = 0, WC = 0, FIX = 0 VCNT = 0 to disable MTRR
+ * effectively.
+ */
+ msr->data = 0;
+ return 0;
+ default:
+ if (tdx_has_emulated_msr(msr->index, false))
+ return kvm_get_msr_common(vcpu, msr);
+ return 1;
+ }
}

int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
- if (tdx_has_emulated_msr(msr->index, true))
+ switch (msr->index) {
+ case MSR_MTRRdefType:
+ /*
+ * Allow writeback only for all memory.
+ * Because it's reported that fixed range MTRR isn't supported
+ * and VCNT=0, enforce MTRRDefType.FE = 0 and don't care
+ * variable range MTRRs. Only default memory type matters.
+ *
+ * bit 11 E: MTRR enable/disable
+ * bit 12 FE: Fixed-range MTRRs enable/disable
+ * (E, FE) = (1, 1): enable MTRR and Fixed range MTRR
+ * (E, FE) = (1, 0): enable MTRR, disable Fixed range MTRR
+ * (E, FE) = (0, *): disable all MTRRs. all physical memory
+ * is UC
+ */
+ if (msr->data != ((1 << 11) | MTRR_TYPE_WRBACK))
+ return 1;
return kvm_set_msr_common(vcpu, msr);
- return 1;
+ default:
+ if (tdx_has_emulated_msr(msr->index, true))
+ return kvm_set_msr_common(vcpu, msr);
+ return 1;
+ }
}

static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -2704,6 +2726,45 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
return ret;
}

+static int tdx_vcpu_init_mtrr(struct kvm_vcpu *vcpu)
+{
+ struct msr_data msr;
+ int ret;
+ int i;
+
+ /*
+ * To avoid confusion with reporting VNCT = 0, explicitly disable
+ * vaiale-range reisters.
+ */
+ for (i = 0; i < KVM_NR_VAR_MTRR; i++) {
+ /* phymask */
+ msr = (struct msr_data) {
+ .host_initiated = true,
+ .index = 0x200 + 2 * i + 1,
+ .data = 0, /* valid = 0 to disable. */
+ };
+ ret = kvm_set_msr_common(vcpu, &msr);
+ if (ret)
+ return -EINVAL;
+ }
+
+ /* Set MTRR to use writeback on reset. */
+ msr = (struct msr_data) {
+ .host_initiated = true,
+ .index = MSR_MTRRdefType,
+ /*
+ * Set E(enable MTRR)=1, FE(enable fixed range MTRR)=0, default
+ * type=writeback on reset to avoid UC. Note E=0 means all
+ * memory is UC.
+ */
+ .data = (1 << 11) | MTRR_TYPE_WRBACK,
+ };
+ ret = kvm_set_msr_common(vcpu, &msr);
+ if (ret)
+ return -EINVAL;
+ return 0;
+}
+
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
{
struct msr_data apic_base_msr;
@@ -2741,6 +2802,10 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
if (kvm_set_apic_base(vcpu, &apic_base_msr))
return -EINVAL;

+ ret = tdx_vcpu_init_mtrr(vcpu);
+ if (ret)
+ return ret;
+
ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
if (ret)
return ret;
--
2.25.1


2024-02-26 09:06:11

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX

From: Isaku Yamahata <[email protected]>

Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
hypercall from guest TD for paravirtualized rdmsr and wrmsr. The TDX
module virtualizes MSRs. For some MSRs, it injects #VE to the guest TD
upon RDMSR or WRMSR. The exact list of such MSRs are defined in the spec.

Upon #VE, the guest TD may execute hypercalls,
TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.

There are three classes of MSRs virtualization.
- non-configurable: TDX module directly virtualizes it. VMM can't
configure. the value set by KVM_SET_MSR_INDEX_LIST is ignored.
- configurable: TDX module directly virtualizes it. VMM can configure at
the VM creation time. The value set by KVM_SET_MSR_INDEX_LIST is used.
- #VE case
Guest TD would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR> and
VMM handles the MSR hypercall. The value set by KVM_SET_MSR_INDEX_LIST
is used.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 44 +++++++++++++++++++++---
arch/x86/kvm/vmx/tdx.c | 70 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 6 ++++
arch/x86/kvm/x86.c | 1 -
arch/x86/kvm/x86.h | 2 ++
5 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c9a40456d965..ed46e7e57c18 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -247,6 +247,42 @@ static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
vmx_handle_exit_irqoff(vcpu);
}

+static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_set_msr(vcpu, msr_info);
+
+ return vmx_set_msr(vcpu, msr_info);
+}
+
+/*
+ * The kvm parameter can be NULL (module initialization, or invocation before
+ * VM creation). Be sure to check the kvm parameter before using it.
+ */
+static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
+{
+ if (kvm && is_td(kvm))
+ return tdx_has_emulated_msr(index, true);
+
+ return vmx_has_emulated_msr(kvm, index);
+}
+
+static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_get_msr(vcpu, msr_info);
+
+ return vmx_get_msr(vcpu, msr_info);
+}
+
+static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_msr_filter_changed(vcpu);
+}
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -541,7 +577,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
/* TDX cpu enablement is done by tdx_hardware_setup(). */
.hardware_enable = vmx_hardware_enable,
.hardware_disable = vt_hardware_disable,
- .has_emulated_msr = vmx_has_emulated_msr,
+ .has_emulated_msr = vt_has_emulated_msr,

.is_vm_type_supported = vt_is_vm_type_supported,
.max_vcpus = vt_max_vcpus,
@@ -563,8 +599,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.update_exception_bitmap = vmx_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
+ .get_msr = vt_get_msr,
+ .set_msr = vt_set_msr,
.get_segment_base = vmx_get_segment_base,
.get_segment = vmx_get_segment,
.set_segment = vmx_set_segment,
@@ -674,7 +710,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,

- .msr_filter_changed = vmx_msr_filter_changed,
+ .msr_filter_changed = vt_msr_filter_changed,
.complete_emulated_msr = kvm_complete_insn_gp,

.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 389bb95d2af0..c8f991b69720 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1877,6 +1877,76 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
*error_code = 0;
}

+static bool tdx_is_emulated_kvm_msr(u32 index, bool write)
+{
+ switch (index) {
+ case MSR_KVM_POLL_CONTROL:
+ return true;
+ default:
+ return false;
+ }
+}
+
+bool tdx_has_emulated_msr(u32 index, bool write)
+{
+ switch (index) {
+ case MSR_IA32_UCODE_REV:
+ case MSR_IA32_ARCH_CAPABILITIES:
+ case MSR_IA32_POWER_CTL:
+ case MSR_IA32_CR_PAT:
+ case MSR_IA32_TSC_DEADLINE:
+ case MSR_IA32_MISC_ENABLE:
+ case MSR_PLATFORM_INFO:
+ case MSR_MISC_FEATURES_ENABLES:
+ case MSR_IA32_MCG_CAP:
+ case MSR_IA32_MCG_STATUS:
+ case MSR_IA32_MCG_CTL:
+ case MSR_IA32_MCG_EXT_CTL:
+ case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
+ case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
+ /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
+ return true;
+ case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
+ /*
+ * x2APIC registers that are virtualized by the CPU can't be
+ * emulated, KVM doesn't have access to the virtual APIC page.
+ */
+ switch (index) {
+ case X2APIC_MSR(APIC_TASKPRI):
+ case X2APIC_MSR(APIC_PROCPRI):
+ case X2APIC_MSR(APIC_EOI):
+ case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
+ case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
+ case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
+ return false;
+ default:
+ return true;
+ }
+ case MSR_IA32_APICBASE:
+ case MSR_EFER:
+ return !write;
+ case 0x4b564d00 ... 0x4b564dff:
+ /* KVM custom MSRs */
+ return tdx_is_emulated_kvm_msr(index, write);
+ default:
+ return false;
+ }
+}
+
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+ if (tdx_has_emulated_msr(msr->index, false))
+ return kvm_get_msr_common(vcpu, msr);
+ return 1;
+}
+
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+ if (tdx_has_emulated_msr(msr->index, true))
+ return kvm_set_msr_common(vcpu, msr);
+ return 1;
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index a12e3bfc96dd..017a73ab34bb 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -165,6 +165,9 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
void tdx_inject_nmi(struct kvm_vcpu *vcpu);
void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+bool tdx_has_emulated_msr(u32 index, bool write);
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

@@ -214,6 +217,9 @@ static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mo
static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
u64 *info2, u32 *intr_info, u32 *error_code) {}
+static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
+static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d5b18cad9dcd..0e1d3853eeb4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -90,7 +90,6 @@
#include "trace.h"

#define MAX_IO_MSRS 256
-#define KVM_MAX_MCE_BANKS 32

struct kvm_caps kvm_caps __read_mostly = {
.supported_mce_cap = MCG_CTL_P | MCG_SER_P,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4e40c23d66ed..c87b7a777b67 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -9,6 +9,8 @@
#include "kvm_cache_regs.h"
#include "kvm_emulate.h"

+#define KVM_MAX_MCE_BANKS 32
+
bool __kvm_is_vm_type_supported(unsigned long type);

struct kvm_caps {
--
2.25.1


2024-02-26 09:06:22

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 114/130] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL

From: Isaku Yamahata <[email protected]>

MCE and MCA is advertised via cpuid based on the TDX module spec. Guest
kernel can access IA32_FEAT_CTL for checking if LMCE is enabled by platform
and IA32_MCG_EXT_CTL to enable LMCE. Make TDX KVM handle them. Otherwise
guest MSR access to them with TDG.VP.VMCALL<MSR> on VE results in GP in
guest.

Because LMCE is disabled with qemu by default, "-cpu lmce=on" to qemu
command line is needed to reproduce it.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2bddaef495d1..3481c0b6ef2c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1952,6 +1952,7 @@ bool tdx_has_emulated_msr(u32 index, bool write)
default:
return true;
}
+ case MSR_IA32_FEAT_CTL:
case MSR_IA32_APICBASE:
case MSR_EFER:
return !write;
@@ -1966,6 +1967,20 @@ bool tdx_has_emulated_msr(u32 index, bool write)
int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
switch (msr->index) {
+ case MSR_IA32_FEAT_CTL:
+ /*
+ * MCE and MCA are advertised via cpuid. guest kernel could
+ * check if LMCE is enabled or not.
+ */
+ msr->data = FEAT_CTL_LOCKED;
+ if (vcpu->arch.mcg_cap & MCG_LMCE_P)
+ msr->data |= FEAT_CTL_LMCE_ENABLED;
+ return 0;
+ case MSR_IA32_MCG_EXT_CTL:
+ if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P))
+ return 1;
+ msr->data = vcpu->arch.mcg_ext_ctl;
+ return 0;
case MSR_MTRRcap:
/*
* Override kvm_mtrr_get_msr() which hardcodes the value.
@@ -1984,6 +1999,11 @@ int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
switch (msr->index) {
+ case MSR_IA32_MCG_EXT_CTL:
+ if (!msr->host_initiated && !(vcpu->arch.mcg_cap & MCG_LMCE_P))
+ return 1;
+ vcpu->arch.mcg_ext_ctl = msr->data;
+ return 0;
case MSR_MTRRdefType:
/*
* Allow writeback only for all memory.
--
2.25.1


2024-02-26 09:06:42

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 071/130] KVM: TDX: MTRR: implement get_mt_mask() for TDX

From: Isaku Yamahata <[email protected]>

Because TDX virtualize cpuid[0x1].EDX[MTRR: bit 12] to fixed 1, guest TD
thinks MTRR is supported. Although TDX supports only WB for private GPA,
it's desirable to support MTRR for shared GPA. As guest access to MTRR
MSRs causes #VE and KVM/x86 tracks the values of MTRR MSRs, the remaining
part is to implement get_mt_mask method for TDX for shared GPA.

Suggested-by: Kai Huang <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- typo in the commit message
- Deleted stale paragraph in the commit message
---
arch/x86/kvm/vmx/main.c | 10 +++++++++-
arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8c5bac3defdf..c5672909fdae 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -219,6 +219,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}

+static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_mt_mask(vcpu, gfn, is_mmio);
+
+ return vmx_get_mt_mask(vcpu, gfn, is_mmio);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -348,7 +356,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
+ .get_mt_mask = vt_get_mt_mask,

.get_exit_info = vmx_get_exit_info,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 39ef80857b6a..e65fff43cb1b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -393,6 +393,29 @@ int tdx_vm_init(struct kvm *kvm)
return 0;
}

+u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+ if (is_mmio)
+ return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+
+ if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+ return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+
+ /*
+ * TDX enforces CR0.CD = 0 and KVM MTRR emulation enforces writeback.
+ * TODO: implement MTRR MSR emulation so that
+ * MTRRCap: SMRR=0: SMRR interface unsupported
+ * WC=0: write combining unsupported
+ * FIX=0: Fixed range registers unsupported
+ * VCNT=0: number of variable range regitsers = 0
+ * MTRRDefType: E=1, FE=0, type=writeback only. Don't allow other value.
+ * E=1: enable MTRR
+ * FE=0: disable fixed range MTRRs
+ * type: default memory type=writeback
+ */
+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+}
+
int tdx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index d5f75efd87e6..5335d35bc655 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -150,6 +150,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

@@ -178,6 +179,7 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

--
2.25.1


2024-02-26 09:07:22

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 117/130] KVM: TDX: Silently ignore INIT/SIPI

From: Isaku Yamahata <[email protected]>

The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
Instead it defines the different protocols to boot application processors.
Ignore INIT and SIPI events for the TDX guest.

There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
error to guest TDs somehow. Given that TDX guest is paravirtualized to
boot AP, the option 1 is chosen for simplicity.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/lapic.c | 19 +++++++++++-------
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/vmx/main.c | 32 ++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 4 ++--
6 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 22d93d4124c8..85c04aad6ab3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -149,6 +149,7 @@ KVM_X86_OP_OPTIONAL(migrate_timers)
KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
+KVM_X86_OP(vcpu_deliver_init)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL_RET0(gmem_max_level)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bb8be091f996..2686c080820b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1836,6 +1836,7 @@ struct kvm_x86_ops {
int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);

void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+ void (*vcpu_deliver_init)(struct kvm_vcpu *vcpu);

/*
* Returns vCPU specific APICv inhibit reasons
@@ -2092,6 +2093,7 @@ void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
void kvm_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu);

int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
int reason, bool has_error_code, u32 error_code);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 8025c7f614e0..431074679e83 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3268,6 +3268,16 @@ int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
return 0;
}

+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+ kvm_vcpu_reset(vcpu, true);
+ if (kvm_vcpu_is_bsp(vcpu))
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ else
+ vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_deliver_init);
+
int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
@@ -3299,13 +3309,8 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
return 0;
}

- if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
- kvm_vcpu_reset(vcpu, true);
- if (kvm_vcpu_is_bsp(apic->vcpu))
- vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
- else
- vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
- }
+ if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events))
+ static_call(kvm_x86_vcpu_deliver_init)(vcpu);
if (test_and_clear_bit(KVM_APIC_SIPI, &apic->pending_events)) {
if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
/* evaluate pending_events before reading the vector */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f76dd52d29ba..27546d993809 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5037,6 +5037,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.complete_emulated_msr = svm_complete_emulated_msr,

.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_init = kvm_vcpu_deliver_init,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
};

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4f3b872cd401..84d2dc818cf7 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -320,6 +320,14 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_apic_init_signal_blocked(vcpu);
+}
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -348,6 +356,25 @@ static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
}

+static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ kvm_vcpu_deliver_sipi_vector(vcpu, vector);
+}
+
+static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ /* TDX doesn't support INIT. Ignore INIT event */
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+ return;
+ }
+
+ kvm_vcpu_deliver_init(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -744,13 +771,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
#endif

.check_emulate_instruction = vmx_check_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .apic_init_signal_blocked = vt_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,

.msr_filter_changed = vt_msr_filter_changed,
.complete_emulated_msr = kvm_complete_insn_gp,

- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_sipi_vector = vt_vcpu_deliver_sipi_vector,
+ .vcpu_deliver_init = vt_vcpu_deliver_init,

.get_untagged_addr = vmx_get_untagged_addr,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d9b36373e7d0..4c7c83105342 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -769,8 +769,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{

- /* Ignore INIT silently because TDX doesn't support INIT event. */
- if (init_event)
+ /* vcpu_deliver_init method silently discards INIT event. */
+ if (KVM_BUG_ON(init_event, vcpu->kvm))
return;
if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
return;
--
2.25.1


2024-02-26 09:07:52

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

From: Isaku Yamahata <[email protected]>

Currently TDX KVM doesn't support tracking dirty pages (yet). Implement a
method to ignore it. Because the flag for kvm memory slot to enable dirty
logging isn't accepted for TDX, warn on the method is called for TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9aaa9bd7abf3..991bfdecaed2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -834,6 +834,14 @@ static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
return vmx_get_mt_mask(vcpu, gfn, is_mmio);
}

+static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_update_cpu_dirty_logging(vcpu);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -1008,7 +1016,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sched_in = vt_sched_in,

.cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+ .update_cpu_dirty_logging = vt_update_cpu_dirty_logging,

.nested_ops = &vmx_nested_ops,

--
2.25.1


2024-02-26 09:08:00

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 073/130] KVM: x86: Add hooks in kvm_arch_vcpu_memory_mapping()

From: Isaku Yamahata <[email protected]>

In the case of TDX, the memory contents needs to be provided to be
encrypted when populating guest memory before running the guest. Add hooks
in kvm_mmu_map_tdp_page() for KVM_MEMORY_MAPPING before/after calling
kvm_mmu_tdp_page(). TDX KVM will implement the hooks.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- newly added
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/kvm/x86.c | 13 ++++++++++++-
3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e1c75f8c1b25..fb3ae97c724e 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -151,6 +151,8 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL_RET0(gmem_max_level)
+KVM_X86_OP_OPTIONAL(pre_memory_mapping);
+KVM_X86_OP_OPTIONAL(post_memory_mapping);

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bc0767c884f7..36694e784c27 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1839,6 +1839,11 @@ struct kvm_x86_ops {

int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
bool is_private, u8 *max_level);
+ int (*pre_memory_mapping)(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code, u8 *max_level);
+ void (*post_memory_mapping)(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2bd4b7c8fa51..23ece956c816 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5826,10 +5826,21 @@ int kvm_arch_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
u8 max_level = KVM_MAX_HUGEPAGE_LEVEL;
u64 error_code = PFERR_WRITE_MASK;
u8 goal_level = PG_LEVEL_4K;
- int r;
+ int r = 0;
+
+ if (kvm_x86_ops.pre_memory_mapping)
+ r = static_call(kvm_x86_pre_memory_mapping)(vcpu, mapping, &error_code, &max_level);
+ else {
+ if (mapping->source)
+ r = -EINVAL;
+ }
+ if (r)
+ return r;

r = kvm_mmu_map_tdp_page(vcpu, gfn_to_gpa(mapping->base_gfn), error_code,
max_level, &goal_level);
+ if (kvm_x86_ops.post_memory_mapping)
+ static_call(kvm_x86_post_memory_mapping)(vcpu, mapping);
if (r)
return r;

--
2.25.1


2024-02-26 09:08:02

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 121/130] KVM: TDX: Add methods to ignore VMX preemption timer

From: Isaku Yamahata <[email protected]>

TDX doesn't support VMX preemption timer. Implement access methods for VMM
to ignore VMX preemption timer.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 991bfdecaed2..ec5c0fda92e9 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -842,6 +842,27 @@ static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
vmx_update_cpu_dirty_logging(vcpu);
}

+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
+{
+ /* VMX-preemption timer isn't available for TDX. */
+ if (is_td_vcpu(vcpu))
+ return -EINVAL;
+
+ return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+ /* VMX-preemption timer can't be set. See vt_set_hv_timer(). */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -1024,8 +1045,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.pi_start_assignment = vmx_pi_start_assignment,

#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
+ .set_hv_timer = vt_set_hv_timer,
+ .cancel_hv_timer = vt_cancel_hv_timer,
#endif

.setup_mce = vmx_setup_mce,
--
2.25.1


2024-02-26 09:08:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 119/130] KVM: TDX: Add methods to ignore guest instruction emulation

From: Isaku Yamahata <[email protected]>

Because TDX protects TDX guest state from VMM, instructions in guest memory
cannot be emulated. Implement methods to ignore guest instruction
emulator.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9fb3f28d8259..9aaa9bd7abf3 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -320,6 +320,30 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+static int vt_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
+{
+ if (is_td_vcpu(vcpu))
+ return X86EMUL_RETRY_INSTR;
+
+ return vmx_check_emulate_instruction(vcpu, emul_type, insn, insn_len);
+}
+
+static int vt_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
+{
+ /*
+ * This call back is triggered by the x86 instruction emulator. TDX
+ * doesn't allow guest memory inspection.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return X86EMUL_UNHANDLEABLE;
+
+ return vmx_check_intercept(vcpu, info, stage, exception);
+}
+
static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -976,7 +1000,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.load_mmu_pgd = vt_load_mmu_pgd,

- .check_intercept = vmx_check_intercept,
+ .check_intercept = vt_check_intercept,
.handle_exit_irqoff = vt_handle_exit_irqoff,

.request_immediate_exit = vt_request_immediate_exit,
@@ -1005,7 +1029,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.enable_smi_window = vt_enable_smi_window,
#endif

- .check_emulate_instruction = vmx_check_emulate_instruction,
+ .check_emulate_instruction = vt_check_emulate_instruction,
.apic_init_signal_blocked = vt_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,

--
2.25.1


2024-02-26 09:08:58

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 122/130] KVM: TDX: Add methods to ignore accesses to TSC

From: Isaku Yamahata <[email protected]>

TDX protects TDX guest TSC state from VMM. Implement access methods to
ignore guest TSC.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 44 +++++++++++++++++++++++++++++++++++++----
1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ec5c0fda92e9..9fcd71999bba 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -834,6 +834,42 @@ static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
return vmx_get_mt_mask(vcpu, gfn, is_mmio);
}

+static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 guest at the moment. */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_get_l2_tsc_offset(vcpu);
+}
+
+static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 guest at the moment. */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_get_l2_tsc_multiplier(vcpu);
+}
+
+static void vt_write_tsc_offset(struct kvm_vcpu *vcpu)
+{
+ /* In TDX, tsc offset can't be changed. */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_write_tsc_offset(vcpu);
+}
+
+static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+ /* In TDX, tsc multiplier can't be changed. */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_write_tsc_multiplier(vcpu);
+}
+
static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
{
if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
@@ -1022,10 +1058,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,

- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
+ .get_l2_tsc_offset = vt_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
+ .write_tsc_offset = vt_write_tsc_offset,
+ .write_tsc_multiplier = vt_write_tsc_multiplier,

.load_mmu_pgd = vt_load_mmu_pgd,

--
2.25.1


2024-02-26 09:09:28

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 124/130] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch

From: Isaku Yamahata <[email protected]>

Because guest TD memory is protected, VMM patching guest binary for
hypercall instruction isn't possible. Add a method to ignore hypercall
patching with a warning. Note: guest TD kernel needs to be modified to use
TDG.VP.VMCALL for hypercall.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7c47b02d88d8..fae5a3668361 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -731,6 +731,19 @@ static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
return vmx_get_interrupt_shadow(vcpu);
}

+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+ unsigned char *hypercall)
+{
+ /*
+ * Because guest memory is protected, guest can't be patched. TD kernel
+ * is modified to use TDG.VP.VMCAL for hypercall.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_patch_hypercall(vcpu, hypercall);
+}
+
static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
{
if (is_td_vcpu(vcpu))
@@ -1030,7 +1043,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.update_emulated_instruction = vmx_update_emulated_instruction,
.set_interrupt_shadow = vt_set_interrupt_shadow,
.get_interrupt_shadow = vt_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
+ .patch_hypercall = vt_patch_hypercall,
.inject_irq = vt_inject_irq,
.inject_nmi = vt_inject_nmi,
.inject_exception = vt_inject_exception,
--
2.25.1


2024-02-26 09:09:50

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 126/130] KVM: TDX: Inhibit APICv for TDX guest

From: Isaku Yamahata <[email protected]>

TDX doesn't support APICV, inhibit APICv for TDX guest. Follow how SEV
does it. Define a new inhibit reason for TDX, set it on TD
initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 9 +++++++++
arch/x86/kvm/vmx/main.c | 3 ++-
arch/x86/kvm/vmx/tdx.c | 4 ++++
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2686c080820b..920fb771246b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1300,6 +1300,15 @@ enum kvm_apicv_inhibit {
* mapping between logical ID and vCPU.
*/
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
+
+ /*********************************************************/
+ /* INHIBITs that are relevant only to the Intel's APICv. */
+ /*********************************************************/
+
+ /*
+ * APICv is disabled because TDX doesn't support it.
+ */
+ APICV_INHIBIT_REASON_TDX,
};

struct kvm_arch {
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c46c860be0f2..2cd404fd7176 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -1022,7 +1022,8 @@ static void vt_post_memory_mapping(struct kvm_vcpu *vcpu,
BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
+ BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \
+ BIT(APICV_INHIBIT_REASON_TDX))

struct kvm_x86_ops vt_x86_ops __initdata = {
.name = KBUILD_MODNAME,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f706c346eea4..7be1be161dc2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2488,6 +2488,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
goto teardown;
}

+ kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
+
return 0;

/*
@@ -2821,6 +2823,8 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
return -EIO;
}

+ WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm));
+ vcpu->arch.apic->apicv_active = false;
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
tdx->td_vcpu_created = true;
return 0;
--
2.25.1


2024-02-26 09:10:32

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 129/130] RFC: KVM: x86: Add x86 callback to check cpuid

From: Isaku Yamahata <[email protected]>

The x86 backend should check the consistency of KVM_SET_CPUID2 because it
has its constraint. Add a callback for it. The backend code will come as
another patch.

Suggested-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/cpuid.c | 20 ++++++++++++--------
3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 85c04aad6ab3..3a7140129855 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,8 @@ KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
KVM_X86_OP(has_emulated_msr)
+/* TODO: Once all backend implemented this op, remove _OPTIONAL_RET0. */
+KVM_X86_OP_OPTIONAL_RET0(vcpu_check_cpuid)
KVM_X86_OP(vcpu_after_set_cpuid)
KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP_OPTIONAL(max_vcpus);
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 920fb771246b..e4d40e31fc31 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1638,6 +1638,7 @@ struct kvm_x86_ops {
void (*hardware_unsetup)(void);
int (*offline_cpu)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
+ int (*vcpu_check_cpuid)(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2, int nent);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

bool (*is_vm_type_supported)(unsigned long vm_type);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8cdcd6f406aa..b57006943247 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -136,6 +136,7 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
{
struct kvm_cpuid_entry2 *best;
u64 xfeatures;
+ int r;

/*
* The existing code assumes virtual address is 48-bit or 57-bit in the
@@ -155,15 +156,18 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
* enabling in the FPU, e.g. to expand the guest XSAVE state size.
*/
best = cpuid_entry2_find(entries, nent, 0xd, 0);
- if (!best)
- return 0;
-
- xfeatures = best->eax | ((u64)best->edx << 32);
- xfeatures &= XFEATURE_MASK_USER_DYNAMIC;
- if (!xfeatures)
- return 0;
+ if (best) {
+ xfeatures = best->eax | ((u64)best->edx << 32);
+ xfeatures &= XFEATURE_MASK_USER_DYNAMIC;
+ if (xfeatures) {
+ r = fpu_enable_guest_xfd_features(&vcpu->arch.guest_fpu,
+ xfeatures);
+ if (r)
+ return r;
+ }
+ }

- return fpu_enable_guest_xfd_features(&vcpu->arch.guest_fpu, xfeatures);
+ return static_call(kvm_x86_vcpu_check_cpuid)(vcpu, entries, nent);
}

/* Check whether the supplied CPUID data is equal to what is already set for the vCPU. */
--
2.25.1


2024-02-26 09:10:49

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 127/130] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)

From: Isaku Yamahata <[email protected]>

Add documentation to Intel Trusted Domain Extensions(TDX) support.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/api.rst | 9 +-
Documentation/virt/kvm/x86/index.rst | 1 +
Documentation/virt/kvm/x86/intel-tdx.rst | 362 +++++++++++++++++++++++
3 files changed, 371 insertions(+), 1 deletion(-)
create mode 100644 Documentation/virt/kvm/x86/intel-tdx.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 667dc58f7d2f..4b70d2b43532 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1394,6 +1394,9 @@ the memory region are automatically reflected into the guest. For example, an
mmap() that affects the region will be made visible immediately. Another
example is madvise(MADV_DROP).

+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported. Only as-id 0 is supported.
+
Note: On arm64, a write generated by the page-table walker (to update
the Access and Dirty flags, for example) never results in a
KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This
@@ -4734,7 +4737,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.

:Capability: basic
:Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
:Parameters: an opaque platform specific structure (in/out)
:Returns: 0 on success; -1 on error

@@ -4746,6 +4749,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
Documentation/virt/kvm/x86/amd-memory-encryption.rst.

+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/x86/intel-tdx.rst.
+
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..851e99174762 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -11,6 +11,7 @@ KVM for x86 systems
cpuid
errata
hypercalls
+ intel-tdx
mmu
msr
nested-vmx
diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
new file mode 100644
index 000000000000..a1b10e99c1ff
--- /dev/null
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -0,0 +1,362 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Domain Extensions (TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. For details, see the specifications [1]_, whitepaper [2]_,
+architectural extensions specification [3]_, module documentation [4]_,
+loader interface specification [5]_, guest-hypervisor communication
+interface [6]_, virtual firmware design guide [7]_, and other resources
+([8]_, [9]_, [10]_, [11]_, and [12]_).
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+ /* Trust Domain eXtension sub-ioctl() commands. */
+ enum kvm_tdx_cmd_id {
+ KVM_TDX_CAPABILITIES = 0,
+ KVM_TDX_INIT_VM,
+ KVM_TDX_INIT_VCPU,
+ KVM_TDX_INIT_MEM_REGION,
+ KVM_TDX_FINALIZE_VM,
+
+ KVM_TDX_CMD_NR_MAX,
+ };
+
+ struct kvm_tdx_cmd {
+ /* enum kvm_tdx_cmd_id */
+ __u32 id;
+ /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+ __u32 flags;
+ /*
+ * data for each sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Auxiliary error code. The sub-command may return TDX SEAMCALL
+ * status code in addition to -Exxx.
+ * Defined for consistency with struct kvm_sev_cmd.
+ */
+ __u64 error;
+ /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+ __u64 unused;
+ };
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: vm ioctl
+
+Subset of TDSYSINFO_STRCUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. Which describes about Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- flags: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+- error: must be 0
+- unused: must be 0
+
+::
+
+ struct kvm_tdx_cpuid_config {
+ __u32 leaf;
+ __u32 sub_leaf;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+ };
+
+ struct kvm_tdx_capabilities {
+ __u64 attrs_fixed0;
+ __u64 attrs_fixed1;
+ __u64 xfam_fixed0;
+ __u64 xfam_fixed1;
+ #define TDX_CAP_GPAW_48 (1 << 0)
+ #define TDX_CAP_GPAW_52 (1 << 1)
+ __u32 supported_gpaw;
+ __u32 padding;
+ __u64 reserved[251];
+
+ __u32 nr_cpuid_configs;
+ struct kvm_tdx_cpuid_config cpuid_configs[];
+ };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- flags: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- error: must be 0
+- unused: must be 0
+
+::
+
+ struct kvm_tdx_init_vm {
+ __u64 attributes;
+ __u64 mrconfigid[6]; /* sha384 digest */
+ __u64 mrowner[6]; /* sha384 digest */
+ __u64 mrownerconfig[6]; /* sha348 digest */
+ __u64 reserved[1004]; /* must be zero for future extensibility */
+
+ struct kvm_cpuid2 cpuid;
+ };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: must be 0
+- data: initial value of the guest TD VCPU RCX
+- error: must be 0
+- unused: must be 0
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vm ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: flags
+ currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+- error: must be 0
+- unused: must be 0
+
+::
+
+ #define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0)
+
+ struct kvm_tdx_init_mem_region {
+ __u64 source_addr;
+ __u64 gpa;
+ __u64 nr_pages;
+ };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- flags: must be 0
+- data: must be 0
+- error: must be 0
+- unused: must be 0
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called. The control flow
+looks like as follows.
+
+#. system wide capability check
+
+ * KVM_CAP_VM_TYPES: check if VM type is supported and if KVM_X86_TDX_VM
+ is supported.
+
+#. creating VM
+
+ * KVM_CREATE_VM
+ * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+ * KVM_ENABLE_CAP_VM(KVM_CAP_MAX_VCPUS): set max_vcpus. KVM_MAX_VCPUS by
+ default. KVM_MAX_VCPUS is not a part of ABI, but kernel internal constant
+ that is subject to change. Because max vcpus is a part of attestation, max
+ vcpus should be explicitly set.
+ * KVM_SET_TSC_KHZ for vm. optional
+ * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+
+ * KVM_CREATE_VCPU
+ * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+ * KVM_SET_CPUID2: Enable CPUID[0x1].ECX.X2APIC(bit 21)=1 so that the following
+ setting of MSR_IA32_APIC_BASE success. Without this,
+ KVM_SET_MSRS(MSR_IA32_APIC_BASE) fails.
+ * KVM_SET_MSRS: Set the initial reset value of MSR_IA32_APIC_BASE to
+ APIC_DEFAULT_ADDRESS(0xfee00000) | XAPIC_ENABLE(bit 10) |
+ X2APIC_ENABLE(bit 11) [| MSR_IA32_APICBASE_BSP(bit 8) optional]
+
+#. initializing guest memory
+
+ * allocate guest memory and initialize page same to normal KVM case
+ In TDX case, parse and load TDVF into guest memory in addition.
+ * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+ If the pages has contents above, those pages need to be added.
+ Otherwise the contents will be lost and guest sees zero pages.
+ * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+ This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Design discussion
+=================
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered:
+
+ * No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+ * Avoid overhead of indirect call via function pointers.
+ * Contain the changes under arch/x86/kvm/vmx directory and share logic
+ with VMX for maintenance.
+ Even though the ways to operation on VM (VMX instruction vs TDX
+ SEAM call) are different, the basic idea remains the same. So, many
+ logic can be shared.
+ * Future maintenance
+ The huge change of kvm_x86_ops in (near) future isn't expected.
+ a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+
+ Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+ main.c, is just chosen to show main entry points for callbacks.) and
+ wrapper functions around all the callbacks with
+ "if (is-tdx) tdx-callback() else vmx-callback()".
+
+ Pros:
+
+ - No major change in common x86 KVM code. The change is (mostly)
+ contained under arch/x86/kvm/vmx/.
+ - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
+ optimized out.
+ - Micro optimization by avoiding function pointer.
+
+ Cons:
+
+ - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+
+ * Reuse the existing MMU code with minimal update. Because the
+ execution flow is mostly same. But additional operation, TDX call
+ for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+ * For performance, minimize TDX SEAM call to operate on S-EPT. When
+ getting corresponding S-EPT pages/entry from faulting GPA, don't
+ use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+ in host memory.
+ Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+ associate S-EPT to it.
+ * Treats share bit as attributes. mask/unmask the bit where
+ necessary to keep the existing traversing code works.
+ Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+ for special case.
+
+ * 0 : for non-TDX case
+ * 51 or 47 bit set for TDX case.
+
+ Pros:
+
+ - Large code reuse with minimal new hooks.
+ - Execution path is same.
+
+ Cons:
+
+ - Complicates the existing code.
+ - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM APIs are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+
+ Although operations for TD VMs aren't necessarily related to memory
+ encryption, define sub operations of KVM_MEMORY_ENCRYPT_OP for TDX specific
+ ioctls.
+
+ Pros:
+
+ - No major change in common x86 KVM code.
+ - Follows the SEV case.
+
+ Cons:
+
+ - The sub operations of KVM_MEMORY_ENCRYPT_OP aren't necessarily memory
+ encryption, but operations on TD VMs.
+
+References
+==========
+
+.. [1] TDX specification
+ https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+ https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+ https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+
+ * kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+ * TDX guest branch: https://github.com/intel/tdx/tree/guest
+
+.. [9] tdvf
+ https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+ Enable Hardware Isolated VMs
+ https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+ Architectural Extensions for Hardware Virtual Machine Isolation
+ to Advance Confidential Computing in Public Clouds - Ravi Sahita
+ & Jun Nakajima, Intel Corporation
+ https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+ https://lore.kernel.org/all/[email protected]/
--
2.25.1


2024-02-26 09:11:08

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 128/130] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU

From: Isaku Yamahata <[email protected]>

Add a high level design document on TDX changes to TDP MMU.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Documentation/virt/kvm/x86/index.rst | 1 +
Documentation/virt/kvm/x86/tdx-tdp-mmu.rst | 443 +++++++++++++++++++++
2 files changed, 444 insertions(+)
create mode 100644 Documentation/virt/kvm/x86/tdx-tdp-mmu.rst

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 851e99174762..63a78bd41b16 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -16,4 +16,5 @@ KVM for x86 systems
msr
nested-vmx
running-nested-guests
+ tdx-tdp-mmu
timekeeping
diff --git a/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst b/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst
new file mode 100644
index 000000000000..49d103720272
--- /dev/null
+++ b/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst
@@ -0,0 +1,443 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Design of TDP MMU for TDX support
+=================================
+This document describes a (high level) design for TDX support of KVM TDP MMU of
+x86 KVM.
+
+In this document, we use "TD" or "guest TD" to differentiate it from the current
+"VM" (Virtual Machine), which is supported by KVM today.
+
+
+Background of TDX
+=================
+TD private memory is designed to hold TD private content, encrypted by the CPU
+using the TD ephemeral key. An encryption engine holds a table of encryption
+keys, and an encryption key is selected for each memory transaction based on a
+Host Key Identifier (HKID). By design, the host VMM does not have access to the
+encryption keys.
+
+In the first generation of MKTME, HKID is "stolen" from the physical address by
+allocating a configurable number of bits from the top of the physical address.
+The HKID space is partitioned into shared HKIDs for legacy MKTME accesses and
+private HKIDs for SEAM-mode-only accesses. We use 0 for the shared HKID on the
+host so that MKTME can be opaque or bypassed on the host.
+
+During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
+as either shared or private, based on the value of a new SHARED bit in the Guest
+Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT
+(Extended Page Table) or "Shared EPT" (in this document), which resides in the
+host VMM memory. The Shared EPT is directly managed by the host VMM - the same
+as with the current VMX. Since guest TDs usually require I/O, and the data
+exchange needs to be done via shared memory, thus KVM needs to use the current
+EPT functionality even for TDs.
+
+The CPU translates private GPAs using a separate Secure EPT. The Secure EPT
+pages are encrypted and integrity-protected with the TD's ephemeral private key.
+Secure EPT can be managed _indirectly_ by the host VMM, using the TDX interface
+functions (SEAMCALLs), and thus conceptually Secure EPT is a subset of EPT
+because not all functionalities are available.
+
+Since the execution of such interface functions takes much longer time than
+accessing memory directly, in KVM we use the existing TDP code to mirror the
+Secure EPT for the TD. And we think there are at least two options today in
+terms of the timing for executing such SEAMCALLs:
+
+1. synchronous, i.e. while walking the TDP page tables, or
+2. post-walk, i.e. record what needs to be done to the real Secure EPT during
+ the walk, and execute SEAMCALLs later.
+
+The option 1 seems to be more intuitive and simpler, but the Secure EPT
+concurrency rules are different from the ones of the TDP or EPT. For example,
+MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the target
+
+Secure EPT(SEPT) operations
+---------------------------
+Secure EPT is an Extended Page Table for GPA-to-HPA translation of TD private
+HPA. A Secure EPT is designed to be encrypted with the TD's ephemeral private
+key. SEPT pages are allocated by the host VMM via Intel TDX functions, but their
+content is intended to be hidden and is not architectural.
+
+Unlike the conventional EPT, the CPU can't directly read/write its entry.
+Instead, TDX SEAMCALL API is used. Several SEAMCALLs correspond to operation on
+the EPT entry.
+
+* TDH.MEM.SEPT.ADD():
+
+ Add a secure EPT page from the secure EPT tree. This corresponds to updating
+ the non-leaf EPT entry with present bit set
+
+* TDH.MEM.SEPT.REMOVE():
+
+ Remove the secure page from the secure EPT tree. There is no corresponding
+ to the EPT operation.
+
+* TDH.MEM.SEPT.RD():
+
+ Read the secure EPT entry. This corresponds to reading the EPT entry as
+ memory. Please note that this is much slower than direct memory reading.
+
+* TDH.MEM.PAGE.ADD() and TDH.MEM.PAGE.AUG():
+
+ Add a private page to the secure EPT tree. This corresponds to updating the
+ leaf EPT entry with present bit set.
+
+* THD.MEM.PAGE.REMOVE():
+
+ Remove a private page from the secure EPT tree. There is no corresponding
+ to the EPT operation.
+
+* TDH.MEM.RANGE.BLOCK():
+
+ This (mostly) corresponds to clearing the present bit of the leaf EPT entry.
+ Note that the private page is still linked in the secure EPT. To remove it
+ from the secure EPT, TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE() needs to
+ be called.
+
+* TDH.MEM.TRACK():
+
+ Increment the TLB epoch counter. This (mostly) corresponds to EPT TLB flush.
+ Note that the private page is still linked in the secure EPT. To remove it
+ from the secure EPT, tdh_mem_page_remove() needs to be called.
+
+
+Adding private page
+-------------------
+The procedure of populating the private page looks as follows.
+
+1. TDH.MEM.SEPT.ADD(512G level)
+2. TDH.MEM.SEPT.ADD(1G level)
+3. TDH.MEM.SEPT.ADD(2M level)
+4. TDH.MEM.PAGE.AUG(4K level)
+
+Those operations correspond to updating the EPT entries.
+
+Dropping private page and TLB shootdown
+---------------------------------------
+The procedure of dropping the private page looks as follows.
+
+1. TDH.MEM.RANGE.BLOCK(4K level)
+
+ This mostly corresponds to clear the present bit in the EPT entry. This
+ prevents (or blocks) TLB entry from creating in the future. Note that the
+ private page is still linked in the secure EPT tree and the existing cache
+ entry in the TLB isn't flushed.
+
+2. TDH.MEM.TRACK(range) and TLB shootdown
+
+ This mostly corresponds to the EPT TLB shootdown. Because all vcpus share
+ the same Secure EPT, all vcpus need to flush TLB.
+
+ * TDH.MEM.TRACK(range) by one vcpu. It increments the global internal TLB
+ epoch counter.
+
+ * send IPI to remote vcpus
+ * Other vcpu exits to VMM from guest TD and then re-enter. TDH.VP.ENTER().
+ * TDH.VP.ENTER() checks the TLB epoch counter and If its TLB is old, flush
+ TLB.
+
+ Note that only single vcpu issues tdh_mem_track().
+
+ Note that the private page is still linked in the secure EPT tree, unlike the
+ conventional EPT.
+
+3. TDH.MEM.PAGE.PROMOTE, TDH.MEM.PAGEDEMOTE(), TDH.MEM.PAGE.RELOCATE(), or
+ TDH.MEM.PAGE.REMOVE()
+
+ There is no corresponding operation to the conventional EPT.
+
+ * When changing page size (e.g. 4K <-> 2M) TDH.MEM.PAGE.PROMOTE() or
+ TDH.MEM.PAGE.DEMOTE() is used. During those operation, the guest page is
+ kept referenced in the Secure EPT.
+
+ * When migrating page, TDH.MEM.PAGE.RELOCATE(). This requires both source
+ page and destination page.
+ * when destroying TD, TDH.MEM.PAGE.REMOVE() removes the private page from the
+ secure EPT tree. In this case TLB shootdown is not needed because vcpus
+ don't run any more.
+
+The basic idea for TDX support
+==============================
+Because shared EPT is the same as the existing EPT, use the existing logic for
+shared EPT. On the other hand, secure EPT requires additional operations
+instead of directly reading/writing of the EPT entry.
+
+On EPT violation, The KVM mmu walks down the EPT tree from the root, determines
+the EPT entry to operate, and updates the entry. If necessary, a TLB shootdown
+is done. Because it's very slow to directly walk secure EPT by TDX SEAMCALL,
+TDH.MEM.SEPT.RD(), the mirror of secure EPT is created and maintained. Add
+hooks to KVM MMU to reuse the existing code.
+
+EPT violation on shared GPA
+---------------------------
+(1) EPT violation on shared GPA or zapping shared GPA
+ ::
+
+ walk down shared EPT tree (the existing code)
+ |
+ |
+ V
+ shared EPT tree (CPU refers.)
+
+(2) update the EPT entry. (the existing code)
+
+ TLB shootdown in the case of zapping.
+
+
+EPT violation on private GPA
+----------------------------
+(1) EPT violation on private GPA or zapping private GPA
+ ::
+
+ walk down the mirror of secure EPT tree (mostly same as the existing code)
+ |
+ |
+ V
+ mirror of secure EPT tree (KVM MMU software only. reuse of the existing code)
+
+(2) update the (mirrored) EPT entry. (mostly same as the existing code)
+
+(3) call the hooks with what EPT entry is changed
+ ::
+
+ |
+ NEW: hooks in KVM MMU
+ |
+ V
+ secure EPT root(CPU refers)
+
+(4) the TDX backend calls necessary TDX SEAMCALLs to update real secure EPT.
+
+The major modification is to add hooks for the TDX backend for additional
+operations and to pass down which EPT, shared EPT, or private EPT is used, and
+twist the behavior if we're operating on private EPT.
+
+The following depicts the relationship.
+::
+
+ KVM | TDX module
+ | | |
+ -------------+---------- | |
+ | | | |
+ V V | |
+ shared GPA private GPA | V
+ CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
+ | | | |
+ | | | |
+ V V | V
+ shared EPT private EPT<-------mirror----->Secure EPT
+ | | | |
+ | \--------------------+------\ |
+ | | | |
+ V | V V
+ shared guest page | private guest page
+ |
+ |
+ non-encrypted memory | encrypted memory
+ |
+
+shared EPT: CPU and KVM walk with shared GPA
+ Maintained by the existing code
+private EPT: KVM walks with private GPA
+ Maintained by the twisted existing code
+secure EPT: CPU walks with private GPA.
+ Maintained by TDX module with TDX SEAMCALLs via hooks
+
+
+Tracking private EPT page
+=========================
+Shared EPT pages are managed by struct kvm_mmu_page. They are linked in a list
+structure. When necessary, the list is traversed to operate on. Private EPT
+pages have different characteristics. For example, private pages can't be
+swapped out. When shrinking memory, we'd like to traverse only shared EPT pages
+and skip private EPT pages. Likewise, page migration isn't supported for
+private pages (yet). Introduce an additional list to track shared EPT pages and
+track private EPT pages independently.
+
+At the beginning of EPT violation, the fault handler knows fault GPA, thus it
+knows which EPT to operate on, private or shared. If it's private EPT,
+an additional task is done. Something like "if (private) { callback a hook }".
+Since the fault handler has deep function calls, it's cumbersome to hold the
+information of which EPT is operating. Options to mitigate it are
+
+1. Pass the information as an argument for the function call.
+2. Record the information in struct kvm_mmu_page somehow.
+3. Record the information in vcpu structure.
+
+Option 2 was chosen. Because option 1 requires modifying all the functions. It
+would affect badly to the normal case. Option 3 doesn't work well because in
+some cases, we need to walk both private and shared EPT.
+
+The role of the EPT page can be utilized and one bit can be curved out from
+unused bits in struct kvm_mmu_page_role. When allocating the EPT page,
+initialize the information. Mostly struct kvm_mmu_page is available because
+we're operating on EPT pages.
+
+
+The conversion of private GPA and shared GPA
+============================================
+A page of a given GPA can be assigned to only private GPA xor shared GPA at one
+time. (This is the restriction by KVM implementation to avoid doubling guest
+memory usage. Not by TDX architecture.) The GPA can't be accessed
+simultaneously via both private GPA and shared GPA. On guest startup, all the
+GPAs are assigned as private. Guest converts the range of GPA to shared (or
+private) from private (or shared) by MapGPA hypercall. MapGPA hypercall takes
+the start GPA and the size of the region. If the given start GPA is shared
+(shared bit set), VMM converts the region into shared (if it's already shared,
+nop).
+
+If the guest TD triggers an EPT violation on the already converted region,
+i.e. EPT violation on private(or shared) GPA when page is shared(or private),
+the access won't be allowed. KVM_EXIT_MEMORY_FAULT is triggered. The user
+space VMM will decide how to handle it.
+
+If the guest access private (or shared) GPA after the conversion to shared (or
+private), the following sequence will be observed
+
+1. MapGPA(shared GPA: shared bit set) hypercall
+2. KVM cause KVM_TDX_EXIT with hypercall to the user space VMM.
+3. The user space VMM converts the GPA with KVM_SET_MEMORY_ATTRIBUTES(shared).
+4. The user space VMM resumes vcpu execution with KVM_VCPU_RUN
+5. Guest TD accesses private GPA (shared bit cleared)
+6. KVM gets EPT violation on private GPA (shared bit cleared)
+7. KVM finds the GPA was set to be shared in the xarray while the faulting GPA
+ is private (shared bit cleared)
+8. KVM_EXIT_MEMORY_FAULT. User space VMM, e.g. qemu, decide what to do.
+ Typically requests KVM conversion of GPA without MapGPA hypercall.
+9. KVM converts GPA from shared to private with
+ KVM_SET_MEMORY_ATTRIBUTES(private)
+10. Resume vcpu execution
+
+At step 9, user space VMM may think such memory access is due to race, let vcpu
+resume without conversion with the expectation that other vcpu issues MapGPA.
+Or user space VMM may think such memory access is doubtful and the guest is
+trying to attack VMM. It may throttle vcpu execution as mitigation or finally
+kill such a guest. Or user space VMM may think it's a bug of the guest TD, kill
+the guest TD.
+
+This sequence is not efficient. Guest TD shouldn't access private (or shared)
+GPA after converting GPA to shared (or private). Although KVM can handle it,
+it's sub-optimal and won't be optimized.
+
+The original TDP MMU and race condition
+=======================================
+Because vcpus share the EPT, once the EPT entry is zapped, we need to shootdown
+TLB. Send IPI to remote vcpus. Remote vcpus flush their down TLBs. Until TLB
+shootdown is done, vcpus may reference the zapped guest page.
+
+TDP MMU uses read lock of mmu_lock to mitigate vcpu contention. When read lock
+is obtained, it depends on the atomic update of the EPT entry. (On the other
+hand legacy MMU uses write lock.) When vcpu is populating/zapping the EPT entry
+with a read lock held, other vcpu may be populating or zapping the same EPT
+entry at the same time.
+
+To avoid the race condition, the entry is frozen. It means the EPT entry is set
+to the special value, REMOVED_SPTE which clears the present bit. And then after
+TLB shootdown, update the EPT entry to the final value.
+
+Concurrent zapping
+------------------
+1. read lock
+2. freeze the EPT entry (atomically set the value to REMOVED_SPTE)
+ If other vcpu froze the entry, restart page fault.
+3. TLB shootdown
+
+ * send IPI to remote vcpus
+ * TLB flush (local and remote)
+
+ For each entry update, TLB shootdown is needed because of the
+ concurrency.
+4. atomically set the EPT entry to the final value
+5. read unlock
+
+Concurrent populating
+---------------------
+In the case of populating the non-present EPT entry, atomically update the EPT
+entry.
+
+1. read lock
+
+2. atomically update the EPT entry
+ If other vcpu frozen the entry or updated the entry, restart page fault.
+
+3. read unlock
+
+In the case of updating the present EPT entry (e.g. page migration), the
+operation is split into two. Zapping the entry and populating the entry.
+
+1. read lock
+2. zap the EPT entry. follow the concurrent zapping case.
+3. populate the non-present EPT entry.
+4. read unlock
+
+Non-concurrent batched zapping
+------------------------------
+In some cases, zapping the ranges is done exclusively with a write lock held.
+In this case, the TLB shootdown is batched into one.
+
+1. write lock
+2. zap the EPT entries by traversing them
+3. TLB shootdown
+4. write unlock
+
+For Secure EPT, TDX SEAMCALLs are needed in addition to updating the mirrored
+EPT entry.
+
+TDX concurrent zapping
+----------------------
+Add a hook for TDX SEAMCALLs at the step of the TLB shootdown.
+
+1. read lock
+2. freeze the EPT entry(set the value to REMOVED_SPTE)
+3. TLB shootdown via a hook
+
+ * TLB.MEM.RANGE.BLOCK()
+ * TLB.MEM.TRACK()
+ * send IPI to remote vcpus
+
+4. set the EPT entry to the final value
+5. read unlock
+
+TDX concurrent populating
+-------------------------
+TDX SEAMCALLs are required in addition to operating the mirrored EPT entry. The
+frozen entry is utilized by following the zapping case to avoid the race
+condition. A hook can be added.
+
+1. read lock
+2. freeze the EPT entry
+3. hook
+
+ * TDH_MEM_SEPT_ADD() for non-leaf or TDH_MEM_PAGE_AUG() for leaf.
+
+4. set the EPT entry to the final value
+5. read unlock
+
+Without freezing the entry, the following race can happen. Suppose two vcpus
+are faulting on the same GPA and the 2M and 4K level entries aren't populated
+yet.
+
+* vcpu 1: update 2M level EPT entry
+* vcpu 2: update 4K level EPT entry
+* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
+* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
+
+
+TDX non-concurrent batched zapping
+----------------------------------
+For simplicity, the procedure of concurrent populating is utilized. The
+procedure can be optimized later.
+
+
+Co-existing with unmapping guest private memory
+===============================================
+TODO. This needs to be addressed.
+
+
+Restrictions or future work
+===========================
+The following features aren't supported yet at the moment.
+
+* optimizing non-concurrent zap
+* Large page
+* Page migration
--
2.25.1


2024-02-26 09:11:12

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

From: Isaku Yamahata <[email protected]>

Implement a hook of KVM_SET_CPUID2 for additional consistency check.

Intel TDX or AMD SEV has a restriction on the value of cpuid. For example,
some values must be the same between all vcpus. Check if the new values
are consistent with the old values. The check is light because the cpuid
consistency is very model specific and complicated. The user space VMM
should set cpuid and MSRs consistently.

Suggested-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
---
v18:
- Use TDH.SYS.RD() instead of struct tdsysinfo_struct
---
arch/x86/kvm/vmx/main.c | 10 ++++++
arch/x86/kvm/vmx/tdx.c | 66 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.h | 7 ++++
arch/x86/kvm/vmx/x86_ops.h | 4 +++
4 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2cd404fd7176..1a979ec644d0 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -432,6 +432,15 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
kvm_vcpu_deliver_init(vcpu);
}

+static int vt_vcpu_check_cpuid(struct kvm_vcpu *vcpu,
+ struct kvm_cpuid_entry2 *e2, int nent)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_vcpu_check_cpuid(vcpu, e2, nent);
+
+ return 0;
+}
+
static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -1125,6 +1134,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {

.get_exit_info = vt_get_exit_info,

+ .vcpu_check_cpuid = vt_vcpu_check_cpuid,
.vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,

.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7be1be161dc2..a71093f7c3e3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -547,6 +547,9 @@ void tdx_vm_free(struct kvm *kvm)

free_page((unsigned long)__va(kvm_tdx->tdr_pa));
kvm_tdx->tdr_pa = 0;
+
+ kfree(kvm_tdx->cpuid);
+ kvm_tdx->cpuid = NULL;
}

static int tdx_do_tdh_mng_key_config(void *param)
@@ -661,6 +664,39 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return 0;
}

+int tdx_vcpu_check_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2, int nent)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ int i;
+
+ /*
+ * Simple check that new cpuid is consistent with created one.
+ * For simplicity, only trivial check. Don't try comprehensive checks
+ * with the cpuid virtualization table in the TDX module spec.
+ */
+ for (i = 0; i < tdx_info->num_cpuid_config; i++) {
+ const struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
+ u32 index = c->sub_leaf == KVM_TDX_CPUID_NO_SUBLEAF ? 0 : c->sub_leaf;
+ const struct kvm_cpuid_entry2 *old =
+ kvm_find_cpuid_entry2(kvm_tdx->cpuid, kvm_tdx->cpuid_nent,
+ c->leaf, index);
+ const struct kvm_cpuid_entry2 *new = kvm_find_cpuid_entry2(e2, nent,
+ c->leaf, index);
+
+ if (!!old != !!new)
+ return -EINVAL;
+ if (!old && !new)
+ continue;
+
+ if ((old->eax ^ new->eax) & c->eax ||
+ (old->ebx ^ new->ebx) & c->ebx ||
+ (old->ecx ^ new->ecx) & c->ecx ||
+ (old->edx ^ new->edx) & c->edx)
+ return -EINVAL;
+ }
+ return 0;
+}
+
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -2205,9 +2241,10 @@ static int setup_tdparams_eptp_controls(struct kvm_cpuid2 *cpuid,
return 0;
}

-static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
+static void setup_tdparams_cpuids(struct kvm *kvm, struct kvm_cpuid2 *cpuid,
struct td_params *td_params)
{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
int i;

/*
@@ -2215,6 +2252,7 @@ static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
* be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
* It's assumed that td_params was zeroed.
*/
+ kvm_tdx->cpuid_nent = 0;
for (i = 0; i < tdx_info->num_cpuid_config; i++) {
const struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
/* KVM_TDX_CPUID_NO_SUBLEAF means index = 0. */
@@ -2237,6 +2275,10 @@ static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
value->ebx = entry->ebx & c->ebx;
value->ecx = entry->ecx & c->ecx;
value->edx = entry->edx & c->edx;
+
+ /* Remember the setting to check for KVM_SET_CPUID2. */
+ kvm_tdx->cpuid[kvm_tdx->cpuid_nent] = *entry;
+ kvm_tdx->cpuid_nent++;
}
}

@@ -2324,7 +2366,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
ret = setup_tdparams_eptp_controls(cpuid, td_params);
if (ret)
return ret;
- setup_tdparams_cpuids(cpuid, td_params);
+ setup_tdparams_cpuids(kvm, cpuid, td_params);
ret = setup_tdparams_xfam(cpuid, td_params);
if (ret)
return ret;
@@ -2548,11 +2590,18 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
if (cmd->flags)
return -EINVAL;

- init_vm = kzalloc(sizeof(*init_vm) +
- sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES,
- GFP_KERNEL);
- if (!init_vm)
+ WARN_ON_ONCE(kvm_tdx->cpuid);
+ kvm_tdx->cpuid = kzalloc(flex_array_size(init_vm, cpuid.entries, KVM_MAX_CPUID_ENTRIES),
+ GFP_KERNEL);
+ if (!kvm_tdx->cpuid)
return -ENOMEM;
+
+ init_vm = kzalloc(struct_size(init_vm, cpuid.entries, KVM_MAX_CPUID_ENTRIES),
+ GFP_KERNEL);
+ if (!init_vm) {
+ ret = -ENOMEM;
+ goto out;
+ }
if (copy_from_user(init_vm, (void __user *)cmd->data, sizeof(*init_vm))) {
ret = -EFAULT;
goto out;
@@ -2602,6 +2651,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)

out:
/* kfree() accepts NULL. */
+ if (ret) {
+ kfree(kvm_tdx->cpuid);
+ kvm_tdx->cpuid = NULL;
+ kvm_tdx->cpuid_nent = 0;
+ }
kfree(init_vm);
kfree(td_params);
return ret;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 11c74c34555f..af3a2b8afee8 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -31,6 +31,13 @@ struct kvm_tdx {

u64 tsc_offset;

+ /*
+ * For KVM_SET_CPUID to check consistency. Remember the one passed to
+ * TDH.MNG_INIT
+ */
+ int cpuid_nent;
+ struct kvm_cpuid_entry2 *cpuid;
+
/* For KVM_MEMORY_MAPPING */
struct mutex source_lock;
struct page *source_page;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index c507f1513dac..6b067842a67f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -162,6 +162,8 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
+int tdx_vcpu_check_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
+ int nent);
void tdx_inject_nmi(struct kvm_vcpu *vcpu);
void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
@@ -221,6 +223,8 @@ static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)

static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector) {}
+static inline int tdx_vcpu_check_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
+ int nent) { return -EOPNOTSUPP; }
static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
u64 *info2, u32 *intr_info, u32 *error_code) {}
--
2.25.1


2024-02-26 09:12:15

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 079/130] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)

From: Isaku Yamahata <[email protected]>

On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
from VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 30 +++++++++++++++++++++++++--
arch/x86/kvm/vmx/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 4 ++++
arch/x86/kvm/vmx/x86_ops.h | 4 ++++
4 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d72651ce99ac..8275a242ce07 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -158,6 +158,32 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_vcpu_reset(vcpu, init_event);
}

+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ /*
+ * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+ * guest state of a TD is obviously off limits. Deferring MSRs and DRs
+ * is pointless because the TDX module needs to load *something* so as
+ * not to expose guest state.
+ */
+ if (is_td_vcpu(vcpu)) {
+ tdx_prepare_switch_to_guest(vcpu);
+ return;
+ }
+
+ vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_put(vcpu);
+ return;
+ }
+
+ vmx_vcpu_put(vcpu);
+}
+
static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -326,9 +352,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_free = vt_vcpu_free,
.vcpu_reset = vt_vcpu_reset,

- .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+ .prepare_switch_to_guest = vt_prepare_switch_to_guest,
.vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
+ .vcpu_put = vt_vcpu_put,

.update_exception_bitmap = vmx_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index fdf9196cb592..9616b1aab6ce 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/cpu.h>
+#include <linux/mmu_context.h>

#include <asm/tdx.h>

@@ -423,6 +424,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
int tdx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);

WARN_ON_ONCE(vcpu->arch.cpuid_entries);
WARN_ON_ONCE(vcpu->arch.cpuid_nent);
@@ -446,9 +448,47 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
vcpu->arch.xfd_no_write_intercept = true;

+ tdx->host_state_need_save = true;
+ tdx->host_state_need_restore = false;
+
return 0;
}

+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (!tdx->host_state_need_save)
+ return;
+
+ if (likely(is_64bit_mm(current->mm)))
+ tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+ else
+ tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+ tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ tdx->host_state_need_save = true;
+ if (!tdx->host_state_need_restore)
+ return;
+
+ ++vcpu->stat.host_state_reload;
+
+ wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+ tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+ vmx_vcpu_pi_put(vcpu);
+ tdx_prepare_switch_to_host(vcpu);
+}
+
void tdx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -569,6 +609,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

tdx_vcpu_enter_exit(tdx);

+ tdx->host_state_need_restore = true;
+
vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
trace_kvm_exit(vcpu, KVM_ISA_VMX);

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 81d301fbe638..e96c416e73bf 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -69,6 +69,10 @@ struct vcpu_tdx {

bool initialized;

+ bool host_state_need_save;
+ bool host_state_need_restore;
+ u64 msr_host_kernel_gs_base;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 3e29a6fe28ef..9fd997c79c33 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -151,6 +151,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
void tdx_vcpu_free(struct kvm_vcpu *vcpu);
void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -186,6 +188,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1


2024-02-26 09:19:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 088/130] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior

From: Isaku Yamahata <[email protected]>

Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
irrespective of any other flags. TDX-SEAM unconditionally saves and
restores guest DRs and reset to architectural INIT state on TD exit.
So, KVM needs to save host DRs before TD enter without restoring guest DRs
and restore host DRs after TD exit.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Chao Gao <[email protected]>
Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 10 ++++++++--
arch/x86/kvm/vmx/tdx.c | 1 +
arch/x86/kvm/x86.c | 11 ++++++++---
3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3ab85c3d86ee..a9df898c6fbd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -610,8 +610,14 @@ struct kvm_pmu {
struct kvm_pmu_ops;

enum {
- KVM_DEBUGREG_BP_ENABLED = 1,
- KVM_DEBUGREG_WONT_EXIT = 2,
+ KVM_DEBUGREG_BP_ENABLED = BIT(0),
+ KVM_DEBUGREG_WONT_EXIT = BIT(1),
+ /*
+ * Guest debug registers (DR0-3 and DR6) are saved/restored by hardware
+ * on exit from or enter to guest. KVM needn't switch them. Because DR7
+ * is cleared on exit from guest, DR7 need to be saved/restored.
+ */
+ KVM_DEBUGREG_AUTO_SWITCH = BIT(2),
};

struct kvm_mtrr_range {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7aa9188f384d..ab7403a19c5d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -586,6 +586,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;

+ vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
vcpu->arch.cr0_guest_owned_bits = -1ul;
vcpu->arch.cr4_guest_owned_bits = -1ul;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b189e86a1f1..fb7597c22f31 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11013,7 +11013,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu->arch.guest_fpu.xfd_err)
wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

- if (unlikely(vcpu->arch.switch_db_regs)) {
+ if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);
set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -11059,6 +11059,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+ WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
kvm_update_dr0123(vcpu);
kvm_update_dr7(vcpu);
@@ -11071,8 +11072,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
* care about the messed up debug address registers. But if
* we have some of them active, restore the old state.
*/
- if (hw_breakpoint_active())
- hw_breakpoint_restore();
+ if (hw_breakpoint_active()) {
+ if (!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))
+ hw_breakpoint_restore();
+ else
+ set_debugreg(__this_cpu_read(cpu_dr7), 7);
+ }

vcpu->arch.last_vmentry_cpu = vcpu->cpu;
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
--
2.25.1


2024-02-26 09:19:15

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 089/130] KVM: TDX: Add support for find pending IRQ in a protected local APIC

From: Sean Christopherson <[email protected]>

Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest as a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.

Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.

Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.

Note, this doesn't handle interrupts that have been delivered to the vCPU
but not yet recognized by the core, i.e. interrupts that are sitting in
vmcs.GUEST_INTR_STATUS. Querying that state requires a SEAMCALL and will
be supported in a future patch.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/irq.c | 3 +++
arch/x86/kvm/lapic.c | 3 +++
arch/x86/kvm/lapic.h | 2 ++
arch/x86/kvm/vmx/main.c | 10 ++++++++++
arch/x86/kvm/vmx/tdx.c | 6 ++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
8 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index fb3ae97c724e..22d93d4124c8 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -124,6 +124,7 @@ KVM_X86_OP_OPTIONAL(pi_start_assignment)
KVM_X86_OP_OPTIONAL(apicv_pre_state_restore)
KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
+KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt)
KVM_X86_OP_OPTIONAL(set_hv_timer)
KVM_X86_OP_OPTIONAL(cancel_hv_timer)
KVM_X86_OP(setup_mce)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a9df898c6fbd..e0ffef1d377d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1800,6 +1800,7 @@ struct kvm_x86_ops {
void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+ bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);

int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
bool *expired);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index ad9ca8a60144..f253f4c6bf04 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -100,6 +100,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
if (kvm_cpu_has_extint(v))
return 1;

+ if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
+ return static_call(kvm_x86_protected_apic_has_interrupt)(v);
+
return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
}
EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3242f3da2457..e8034f2f2dd1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2860,6 +2860,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
if (!kvm_apic_present(vcpu))
return -1;

+ if (apic->guest_apic_protected)
+ return -1;
+
__apic_update_ppr(apic, &ppr);
return apic_has_interrupt_for_ppr(apic, ppr);
}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 0a0ea4b5dd8c..749b7b629c47 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -66,6 +66,8 @@ struct kvm_lapic {
bool sw_enabled;
bool irr_pending;
bool lvt0_in_nmi_mode;
+ /* Select registers in the vAPIC cannot be read/written. */
+ bool guest_apic_protected;
/* Number of bits set in ISR. */
s16 isr_count;
/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9b336c1a6508..5fd99e844b86 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -78,6 +78,8 @@ static __init int vt_hardware_setup(void)
if (enable_tdx)
vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
sizeof(struct kvm_tdx));
+ else
+ vt_x86_ops.protected_apic_has_interrupt = NULL;

#if IS_ENABLED(CONFIG_HYPERV)
if (enable_tdx)
@@ -219,6 +221,13 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmx_vcpu_load(vcpu, cpu);
}

+static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ KVM_BUG_ON(!is_td_vcpu(vcpu), vcpu->kvm);
+
+ return tdx_protected_apic_has_interrupt(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -443,6 +452,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sync_pir_to_irr = vmx_sync_pir_to_irr,
.deliver_interrupt = vmx_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+ .protected_apic_has_interrupt = vt_protected_apic_has_interrupt,

.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ab7403a19c5d..a5b52aa6d153 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -583,6 +583,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return -EINVAL;

fpstate_set_confidential(&vcpu->arch.guest_fpu);
+ vcpu->arch.apic->guest_apic_protected = true;

vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;

@@ -624,6 +625,11 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
local_irq_enable();
}

+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ return pi_has_pending_interrupt(vcpu);
+}
+
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 5853f29f0af3..f3f90186926c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -155,6 +155,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -194,6 +195,7 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTP
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
--
2.25.1


2024-02-26 09:21:40

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 094/130] KVM: TDX: Implement methods to inject NMI

From: Isaku Yamahata <[email protected]>

TDX vcpu control structure defines one bit for pending NMI for VMM to
inject NMI by setting the bit without knowing TDX vcpu NMI states. Because
the vcpu state is protected, VMM can't know about NMI states of TDX vcpu.
The TDX module handles actual injection and NMI states transition.

Add methods for NMI and treat NMI can be injected always.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/main.c | 64 +++++++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx/tdx.c | 6 ++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
3 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ee6c04959d4c..6d6d443a2bbd 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -306,6 +306,60 @@ static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
vmx_flush_tlb_guest(vcpu);
}

+static void vt_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_inject_nmi(vcpu);
+ return;
+ }
+
+ vmx_inject_nmi(vcpu);
+}
+
+static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ /*
+ * The TDX module manages NMI windows and NMI reinjection, and hides NMI
+ * blocking, all KVM can do is throw an NMI over the wall.
+ */
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_nmi_allowed(vcpu, for_injection);
+}
+
+static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Assume NMIs are always unmasked. KVM could query PEND_NMI and treat
+ * NMIs as masked if a previous NMI is still pending, but SEAMCALLs are
+ * expensive and the end result is unchanged as the only relevant usage
+ * of get_nmi_mask() is to limit the number of pending NMIs, i.e. it
+ * only changes whether KVM or the TDX module drops an NMI.
+ */
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_get_nmi_mask(vcpu);
+}
+
+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_nmi_mask(vcpu, masked);
+}
+
+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
+{
+ /* Refer the comment in vt_get_nmi_mask(). */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_enable_nmi_window(vcpu);
+}
+
static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int pgd_level)
{
@@ -515,14 +569,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.get_interrupt_shadow = vt_get_interrupt_shadow,
.patch_hypercall = vmx_patch_hypercall,
.inject_irq = vt_inject_irq,
- .inject_nmi = vmx_inject_nmi,
+ .inject_nmi = vt_inject_nmi,
.inject_exception = vmx_inject_exception,
.cancel_injection = vt_cancel_injection,
.interrupt_allowed = vt_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
+ .nmi_allowed = vt_nmi_allowed,
+ .get_nmi_mask = vt_get_nmi_mask,
+ .set_nmi_mask = vt_set_nmi_mask,
+ .enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1dfa9b503e0d..be21dca47992 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -874,6 +874,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
return EXIT_FASTPATH_NONE;
}

+void tdx_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ ++vcpu->stat.nmi_injections;
+ td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
+}
+
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
{
WARN_ON_ONCE(root_hpa & ~PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 12a212e71827..539f3f9686fe 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -159,6 +159,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
+void tdx_inject_nmi(struct kvm_vcpu *vcpu);

int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

@@ -202,6 +203,7 @@ static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)

static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector) {}
+static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}

static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

--
2.25.1


2024-02-26 09:27:53

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 108/130] KVM: TDX: Handle TDX PV HLT hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV HLT hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- move tdvps_state_non_arch_check() to this patch

v18:
- drop buggy_hlt_workaround and use TDH.VP.RD(TD_VCPU_STATE_DETAILS)

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 26 +++++++++++++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 4 ++++
2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index eb68d6c148b6..a2caf2ae838c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -688,7 +688,18 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
- return pi_has_pending_interrupt(vcpu);
+ bool ret = pi_has_pending_interrupt(vcpu);
+ union tdx_vcpu_state_details details;
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+ return true;
+
+ if (tdx->interrupt_disabled_hlt)
+ return false;
+
+ details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
+ return !!details.vmxip;
}

void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -1130,6 +1141,17 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ /* See tdx_protected_apic_has_interrupt() to avoid heavy seamcall */
+ tdx->interrupt_disabled_hlt = tdvmcall_a0_read(vcpu);
+
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ return kvm_emulate_halt_noskip(vcpu);
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1138,6 +1160,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
switch (tdvmcall_leaf(vcpu)) {
case EXIT_REASON_CPUID:
return tdx_emulate_cpuid(vcpu);
+ case EXIT_REASON_HLT:
+ return tdx_emulate_hlt(vcpu);
default:
break;
}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 4399d474764f..11c74c34555f 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -104,6 +104,8 @@ struct vcpu_tdx {
bool host_state_need_restore;
u64 msr_host_kernel_gs_base;

+ bool interrupt_disabled_hlt;
+
/*
* Dummy to make pmu_intel not corrupt memory.
* TODO: Support PMU for TDX. Future work.
@@ -166,6 +168,7 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
}

static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}

#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
@@ -226,6 +229,7 @@ TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);

TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);

static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
{
--
2.25.1


2024-02-26 09:28:30

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 109/130] KVM: TDX: Handle TDX PV port io hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV port IO hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
v18:
- Fix out case to set R10 and R11 correctly when user space handled port
out.
---
arch/x86/kvm/vmx/tdx.c | 67 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a2caf2ae838c..55fc6cc6c816 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1152,6 +1152,71 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
return kvm_emulate_halt_noskip(vcpu);
}

+static int tdx_complete_pio_out(struct kvm_vcpu *vcpu)
+{
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, 0);
+ return 1;
+}
+
+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ int ret;
+
+ WARN_ON_ONCE(vcpu->arch.pio.count != 1);
+
+ ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+ vcpu->arch.pio.port, &val, 1);
+ WARN_ON_ONCE(!ret);
+
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, val);
+
+ return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+ unsigned long val = 0;
+ unsigned int port;
+ int size, ret;
+ bool write;
+
+ ++vcpu->stat.io_exits;
+
+ size = tdvmcall_a0_read(vcpu);
+ write = tdvmcall_a1_read(vcpu);
+ port = tdvmcall_a2_read(vcpu);
+
+ if (size != 1 && size != 2 && size != 4) {
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ if (write) {
+ val = tdvmcall_a3_read(vcpu);
+ ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+
+ /* No need for a complete_userspace_io callback. */
+ vcpu->arch.pio.count = 0;
+ } else
+ ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+
+ if (ret)
+ tdvmcall_set_return_val(vcpu, val);
+ else {
+ if (write)
+ vcpu->arch.complete_userspace_io = tdx_complete_pio_out;
+ else
+ vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
+ }
+
+ return ret;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1162,6 +1227,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_cpuid(vcpu);
case EXIT_REASON_HLT:
return tdx_emulate_hlt(vcpu);
+ case EXIT_REASON_IO_INSTRUCTION:
+ return tdx_emulate_io(vcpu);
default:
break;
}
--
2.25.1


2024-02-26 09:28:36

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 112/130] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall

From: Isaku Yamahata <[email protected]>

Wire up TDX PV rdmsr/wrmsr hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c8f991b69720..4c635bfcaf7a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1329,6 +1329,41 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_a0_read(vcpu);
+ u64 data;
+
+ if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
+ kvm_get_msr(vcpu, index, &data)) {
+ trace_kvm_msr_read_ex(index);
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ return 1;
+ }
+ trace_kvm_msr_read(index, data);
+
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, data);
+ return 1;
+}
+
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+ u32 index = tdvmcall_a0_read(vcpu);
+ u64 data = tdvmcall_a1_read(vcpu);
+
+ if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
+ kvm_set_msr(vcpu, index, data)) {
+ trace_kvm_msr_write_ex(index, data);
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ return 1;
+ }
+
+ trace_kvm_msr_write(index, data);
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1343,6 +1378,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_io(vcpu);
case EXIT_REASON_EPT_VIOLATION:
return tdx_emulate_mmio(vcpu);
+ case EXIT_REASON_MSR_READ:
+ return tdx_emulate_rdmsr(vcpu);
+ case EXIT_REASON_MSR_WRITE:
+ return tdx_emulate_wrmsr(vcpu);
default:
break;
}
--
2.25.1


2024-02-26 09:29:08

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 115/130] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall

From: Isaku Yamahata <[email protected]>

Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is
zero, return success code and zero in output registers.

TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification. The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v19:
- rename TDG_VP_VMCALL_GET_TD_VM_CALL_INFO => TDVMCALL_GET_TD_VM_CALL_INFO
---
arch/x86/include/asm/shared/tdx.h | 1 +
arch/x86/kvm/vmx/tdx.c | 16 ++++++++++++++++
2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 28c4a62b7dba..3e8ce567fde0 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -22,6 +22,7 @@
#define TDCS_NOTIFY_ENABLES 0x9100000000000010

/* TDX hypercall Leaf IDs */
+#define TDVMCALL_GET_TD_VM_CALL_INFO 0x10000
#define TDVMCALL_MAP_GPA 0x10001
#define TDVMCALL_GET_QUOTE 0x10002
#define TDVMCALL_REPORT_FATAL_ERROR 0x10003
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3481c0b6ef2c..725cb40d0814 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1353,6 +1353,20 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu)
+{
+ if (tdvmcall_a0_read(vcpu))
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ else {
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ kvm_r11_write(vcpu, 0);
+ tdvmcall_a0_write(vcpu, 0);
+ tdvmcall_a1_write(vcpu, 0);
+ tdvmcall_a2_write(vcpu, 0);
+ }
+ return 1;
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
if (tdvmcall_exit_type(vcpu))
@@ -1371,6 +1385,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
return tdx_emulate_rdmsr(vcpu);
case EXIT_REASON_MSR_WRITE:
return tdx_emulate_wrmsr(vcpu);
+ case TDVMCALL_GET_TD_VM_CALL_INFO:
+ return tdx_get_td_vm_call_info(vcpu);
default:
break;
}
--
2.25.1


2024-02-26 09:29:22

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 116/130] KVM: TDX: Silently discard SMI request

From: Isaku Yamahata <[email protected]>

TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs. Because guest state (vcpu state, memory
state) is protected, it must go through the TDX module APIs to change guest
state, injecting SMI and changing vcpu mode into SMM. The TDX module
doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
to switch guest vcpu mode into SMM.

We have two options in KVM when handling SMM or SMI in the guest TD or the
device model (e.g. QEMU): 1) silently ignore the request or 2) return a
meaningful error.

For simplicity, we implemented the option 1).

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/smm.h | 7 +++++-
arch/x86/kvm/vmx/main.c | 45 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 29 ++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 12 ++++++++++
4 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..bc77902f5c18 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -142,7 +142,12 @@ union kvm_smram {

static inline int kvm_inject_smi(struct kvm_vcpu *vcpu)
{
- kvm_make_request(KVM_REQ_SMI, vcpu);
+ /*
+ * If SMM isn't supported (e.g. TDX), silently discard SMI request.
+ * Assume that SMM supported = MSR_IA32_SMBASE supported.
+ */
+ if (static_call(kvm_x86_has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE))
+ kvm_make_request(KVM_REQ_SMI, vcpu);
return 0;
}

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ed46e7e57c18..4f3b872cd401 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -283,6 +283,43 @@ static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
vmx_msr_filter_changed(vcpu);
}

+#ifdef CONFIG_KVM_SMM
+static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_smi_allowed(vcpu, for_injection);
+
+ return vmx_smi_allowed(vcpu, for_injection);
+}
+
+static int vt_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_enter_smm(vcpu, smram);
+
+ return vmx_enter_smm(vcpu, smram);
+}
+
+static int vt_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+{
+ if (unlikely(is_td_vcpu(vcpu)))
+ return tdx_leave_smm(vcpu, smram);
+
+ return vmx_leave_smm(vcpu, smram);
+}
+
+static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_enable_smi_window(vcpu);
+ return;
+ }
+
+ /* RSM will cause a vmexit anyway. */
+ vmx_enable_smi_window(vcpu);
+}
+#endif
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -700,10 +737,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.setup_mce = vmx_setup_mce,

#ifdef CONFIG_KVM_SMM
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
+ .smi_allowed = vt_smi_allowed,
+ .enter_smm = vt_enter_smm,
+ .leave_smm = vt_leave_smm,
+ .enable_smi_window = vt_enable_smi_window,
#endif

.check_emulate_instruction = vmx_check_emulate_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 725cb40d0814..d9b36373e7d0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2044,6 +2044,35 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
}
}

+#ifdef CONFIG_KVM_SMM
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ /* SMI isn't supported for TDX. */
+ WARN_ON_ONCE(1);
+ return false;
+}
+
+int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+{
+ /* smi_allowed() is always false for TDX as above. */
+ WARN_ON_ONCE(1);
+ return 0;
+}
+
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+{
+ WARN_ON_ONCE(1);
+ return 0;
+}
+
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+ /* SMI isn't supported for TDX. Silently discard SMI request. */
+ WARN_ON_ONCE(1);
+ vcpu->arch.smi_pending = false;
+}
+#endif
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 017a73ab34bb..7c63b2b48125 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -241,4 +241,16 @@ int tdx_pre_memory_mapping(struct kvm_vcpu *vcpu,
void tdx_post_memory_mapping(struct kvm_vcpu *vcpu, struct kvm_memory_mapping *mapping) {}
#endif

+#if defined(CONFIG_INTEL_TDX_HOST) && defined(CONFIG_KVM_SMM)
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
+#else
+static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
+static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) { return 0; }
+static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) { return 0; }
+static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
+#endif
+
#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.25.1


2024-02-26 09:30:07

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 118/130] KVM: TDX: Add methods to ignore accesses to CPU state

From: Isaku Yamahata <[email protected]>

TDX protects TDX guest state from VMM. Implement access methods for TDX
guest state to ignore them or return zero. Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON
except one method.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 289 +++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 48 +++++-
arch/x86/kvm/vmx/x86_ops.h | 13 ++
3 files changed, 321 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 84d2dc818cf7..9fb3f28d8259 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -375,6 +375,200 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
kvm_vcpu_deliver_init(vcpu);
}

+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_update_exception_bitmap(vcpu);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_segment_base(vcpu, seg);
+
+ return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_get_segment(vcpu, var, seg);
+ return;
+ }
+
+ vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_cpl(vcpu);
+
+ return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+ if (is_td_vcpu(vcpu)) {
+ *db = 0;
+ *l = 0;
+ return;
+ }
+
+ vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static bool vt_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_is_valid_cr0(vcpu, cr0);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_cr0(vcpu, cr0);
+}
+
+static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_is_valid_cr4(vcpu, cr4);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+ if (is_td_vcpu(vcpu))
+ return 0;
+
+ return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu)) {
+ memset(dt, 0, sizeof(*dt));
+ return;
+ }
+
+ vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu)) {
+ memset(dt, 0, sizeof(*dt));
+ return;
+ }
+
+ vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+ /*
+ * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+ * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+ * reach here for TD vcpu.
+ */
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_cache_reg(vcpu, reg);
+ return;
+ }
+
+ vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_get_rflags(vcpu);
+
+ return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_rflags(vcpu, rflags);
+}
+
+static bool vt_get_if_flag(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_get_if_flag(vcpu);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -521,6 +715,14 @@ static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
vmx_inject_irq(vcpu, reinjected);
}

+static void vt_inject_exception(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_inject_exception(vcpu);
+}
+
static void vt_cancel_injection(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -567,6 +769,39 @@ static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
}

+
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+ if (is_td(kvm))
+ return 0;
+
+ return vmx_set_identity_map_addr(kvm, ident_addr);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -661,30 +896,30 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.vcpu_load = vt_vcpu_load,
.vcpu_put = vt_vcpu_put,

- .update_exception_bitmap = vmx_update_exception_bitmap,
+ .update_exception_bitmap = vt_update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
.get_msr = vt_get_msr,
.set_msr = vt_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .is_valid_cr0 = vmx_is_valid_cr0,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
+ .get_segment_base = vt_get_segment_base,
+ .get_segment = vt_get_segment,
+ .set_segment = vt_set_segment,
+ .get_cpl = vt_get_cpl,
+ .get_cs_db_l_bits = vt_get_cs_db_l_bits,
+ .is_valid_cr0 = vt_is_valid_cr0,
+ .set_cr0 = vt_set_cr0,
+ .is_valid_cr4 = vt_is_valid_cr4,
+ .set_cr4 = vt_set_cr4,
+ .set_efer = vt_set_efer,
+ .get_idt = vt_get_idt,
+ .set_idt = vt_set_idt,
+ .get_gdt = vt_get_gdt,
+ .set_gdt = vt_set_gdt,
+ .set_dr7 = vt_set_dr7,
+ .sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+ .cache_reg = vt_cache_reg,
+ .get_rflags = vt_get_rflags,
+ .set_rflags = vt_set_rflags,
+ .get_if_flag = vt_get_if_flag,

.flush_tlb_all = vt_flush_tlb_all,
.flush_tlb_current = vt_flush_tlb_current,
@@ -701,7 +936,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.patch_hypercall = vmx_patch_hypercall,
.inject_irq = vt_inject_irq,
.inject_nmi = vt_inject_nmi,
- .inject_exception = vmx_inject_exception,
+ .inject_exception = vt_inject_exception,
.cancel_injection = vt_cancel_injection,
.interrupt_allowed = vt_interrupt_allowed,
.nmi_allowed = vt_nmi_allowed,
@@ -709,11 +944,11 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_nmi_mask = vt_set_nmi_mask,
.enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
+ .update_cr8_intercept = vt_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .load_eoi_exitmap = vt_load_eoi_exitmap,
.apicv_pre_state_restore = vt_apicv_pre_state_restore,
.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
.hwapic_irr_update = vmx_hwapic_irr_update,
@@ -724,13 +959,13 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,

- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
+ .set_tss_addr = vt_set_tss_addr,
+ .set_identity_map_addr = vt_set_identity_map_addr,
.get_mt_mask = vt_get_mt_mask,

.get_exit_info = vt_get_exit_info,

- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+ .vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,

.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4c7c83105342..fc78e9ca61c1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -639,8 +639,15 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)

vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
- vcpu->arch.guest_state_protected =
- !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ /*
+ * TODO: support off-TD debug. If TD DEBUG is enabled, guest state
+ * can be accessed. guest_state_protected = false. and kvm ioctl to
+ * access CPU states should be usable for user space VMM (e.g. qemu).
+ *
+ * vcpu->arch.guest_state_protected =
+ * !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+ */
+ vcpu->arch.guest_state_protected = true;

if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
vcpu->arch.xfd_no_write_intercept = true;
@@ -2073,6 +2080,43 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+int tdx_get_cpl(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+ kvm_register_mark_available(vcpu, reg);
+ switch (reg) {
+ case VCPU_REGS_RSP:
+ case VCPU_REGS_RIP:
+ case VCPU_EXREG_PDPTR:
+ case VCPU_EXREG_CR0:
+ case VCPU_EXREG_CR3:
+ case VCPU_EXREG_CR4:
+ break;
+ default:
+ KVM_BUG_ON(1, vcpu->kvm);
+ break;
+ }
+}
+
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
+{
+ return 0;
+}
+
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+ return 0;
+}
+
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+ memset(var, 0, sizeof(*var));
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7c63b2b48125..727c4d418601 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -169,6 +169,12 @@ bool tdx_has_emulated_msr(u32 index, bool write);
int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);

+int tdx_get_cpl(struct kvm_vcpu *vcpu);
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

void tdx_flush_tlb(struct kvm_vcpu *vcpu);
@@ -221,6 +227,13 @@ static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }

+static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
+static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
+static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
+static inline u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0; }
+static inline void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+ int seg) {}
+
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
--
2.25.1


2024-02-26 09:32:03

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 125/130] KVM: TDX: Add methods to ignore virtual apic related operation

From: Isaku Yamahata <[email protected]>

TDX protects TDX guest APIC state from VMM. Implement access methods of
TDX guest vAPIC state to ignore them or return zero.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 61 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 6 ++++
arch/x86/kvm/vmx/x86_ops.h | 3 ++
3 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index fae5a3668361..c46c860be0f2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -352,6 +352,14 @@ static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
return vmx_apic_init_signal_blocked(vcpu);
}

+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return tdx_set_virtual_apic_mode(vcpu);
+
+ return vmx_set_virtual_apic_mode(vcpu);
+}
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -360,6 +368,31 @@ static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
memset(pi->pir, 0, sizeof(pi->pir));
}

+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(int max_isr)
+{
+ if (is_td_vcpu(kvm_get_running_vcpu()))
+ return;
+
+ return vmx_hwapic_isr_update(max_isr);
+}
+
+static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ /* TDX doesn't support L2 at the moment. */
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return false;
+
+ return vmx_guest_apic_has_interrupt(vcpu);
+}
+
static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -815,6 +848,22 @@ static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
vmx_update_cr8_intercept(vcpu, tpr, irr);
}

+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
+ vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (is_td_vcpu(vcpu))
@@ -1055,15 +1104,15 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
.update_cr8_intercept = vt_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .set_virtual_apic_mode = vt_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vt_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vt_load_eoi_exitmap,
.apicv_pre_state_restore = vt_apicv_pre_state_restore,
.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .hwapic_irr_update = vt_hwapic_irr_update,
+ .hwapic_isr_update = vt_hwapic_isr_update,
+ .guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
.sync_pir_to_irr = vt_sync_pir_to_irr,
.deliver_interrupt = vt_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index fc78e9ca61c1..f706c346eea4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2080,6 +2080,12 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif

+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ /* Only x2APIC mode is supported for TD. */
+ WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
+}
+
int tdx_get_cpl(struct kvm_vcpu *vcpu)
{
return 0;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 727c4d418601..c507f1513dac 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -168,6 +168,7 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
bool tdx_has_emulated_msr(u32 index, bool write);
int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);

int tdx_get_cpl(struct kvm_vcpu *vcpu);
void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
@@ -227,6 +228,8 @@ static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }

+static inline void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
+
static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
--
2.25.1


2024-02-26 12:34:57

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 092/130] KVM: TDX: Implement interrupt injection

From: Isaku Yamahata <[email protected]>

TDX supports interrupt inject into vcpu with posted interrupt. Wire up the
corresponding kvm x86 operations to posted interrupt. Move
kvm_vcpu_trigger_posted_interrupt() from vmx.c to common.h to share the
code.

VMX can inject interrupt by setting interrupt information field,
VM_ENTRY_INTR_INFO_FIELD, of VMCS. TDX supports interrupt injection only
by posted interrupt. Ignore the execution path to access
VM_ENTRY_INTR_INFO_FIELD.

As cpu state is protected and apicv is enabled for the TDX guest, VMM can
inject interrupt by updating posted interrupt descriptor. Treat interrupt
can be injected always.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/common.h | 71 ++++++++++++++++++++++++++
arch/x86/kvm/vmx/main.c | 93 ++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/posted_intr.c | 2 +-
arch/x86/kvm/vmx/posted_intr.h | 2 +
arch/x86/kvm/vmx/tdx.c | 25 +++++++++
arch/x86/kvm/vmx/vmx.c | 67 +-----------------------
arch/x86/kvm/vmx/x86_ops.h | 7 ++-
7 files changed, 190 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 235908f3e044..6f21d0d48809 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,6 +4,7 @@

#include <linux/kvm_host.h>

+#include "posted_intr.h"
#include "mmu.h"

static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -30,4 +31,74 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}

+static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
+ int pi_vec)
+{
+#ifdef CONFIG_SMP
+ if (vcpu->mode == IN_GUEST_MODE) {
+ /*
+ * The vector of the virtual has already been set in the PIR.
+ * Send a notification event to deliver the virtual interrupt
+ * unless the vCPU is the currently running vCPU, i.e. the
+ * event is being sent from a fastpath VM-Exit handler, in
+ * which case the PIR will be synced to the vIRR before
+ * re-entering the guest.
+ *
+ * When the target is not the running vCPU, the following
+ * possibilities emerge:
+ *
+ * Case 1: vCPU stays in non-root mode. Sending a notification
+ * event posts the interrupt to the vCPU.
+ *
+ * Case 2: vCPU exits to root mode and is still runnable. The
+ * PIR will be synced to the vIRR before re-entering the guest.
+ * Sending a notification event is ok as the host IRQ handler
+ * will ignore the spurious event.
+ *
+ * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
+ * has already synced PIR to vIRR and never blocks the vCPU if
+ * the vIRR is not empty. Therefore, a blocked vCPU here does
+ * not wait for any requested interrupts in PIR, and sending a
+ * notification event also results in a benign, spurious event.
+ */
+
+ if (vcpu != kvm_get_running_vcpu())
+ __apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
+ return;
+ }
+#endif
+ /*
+ * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
+ * otherwise do nothing as KVM will grab the highest priority pending
+ * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
+ */
+ kvm_vcpu_wake_up(vcpu);
+}
+
+/*
+ * Send interrupt to vcpu via posted interrupt way.
+ * 1. If target vcpu is running(non-root mode), send posted interrupt
+ * notification to vcpu and hardware will sync PIR to vIRR atomically.
+ * 2. If target vcpu isn't running(root mode), kick it to pick up the
+ * interrupt from PIR in next vmentry.
+ */
+static inline void __vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu,
+ struct pi_desc *pi_desc, int vector)
+{
+ if (pi_test_and_set_pir(vector, pi_desc))
+ return;
+
+ /* If a previous notification has sent the IPI, nothing to do. */
+ if (pi_test_and_set_on(pi_desc))
+ return;
+
+ /*
+ * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
+ * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
+ * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
+ * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
+ */
+ kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+}
+
#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5fd99e844b86..f2c9d6358f9e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -228,6 +228,34 @@ static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
return tdx_protected_apic_has_interrupt(vcpu);
}

+static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
+{
+ struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
+
+ pi_clear_on(pi);
+ memset(pi->pir, 0, sizeof(pi->pir));
+}
+
+static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return -1;
+
+ return vmx_sync_pir_to_irr(vcpu);
+}
+
+static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
+{
+ if (is_td_vcpu(apic->vcpu)) {
+ tdx_deliver_interrupt(apic, delivery_mode, trig_mode,
+ vector);
+ return;
+ }
+
+ vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -297,6 +325,53 @@ static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
vmx_sched_in(vcpu, cpu);
}

+static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+ vmx_set_interrupt_shadow(vcpu, mask);
+}
+
+static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return 0;
+
+ return vmx_get_interrupt_shadow(vcpu);
+}
+
+static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_inject_irq(vcpu, reinjected);
+}
+
+static void vt_cancel_injection(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_cancel_injection(vcpu);
+}
+
+static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_interrupt_allowed(vcpu, for_injection);
+}
+
+static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_enable_irq_window(vcpu);
+}
+
static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
if (is_td_vcpu(vcpu))
@@ -426,31 +501,31 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.handle_exit = vmx_handle_exit,
.skip_emulated_instruction = vmx_skip_emulated_instruction,
.update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
+ .set_interrupt_shadow = vt_set_interrupt_shadow,
+ .get_interrupt_shadow = vt_get_interrupt_shadow,
.patch_hypercall = vmx_patch_hypercall,
- .inject_irq = vmx_inject_irq,
+ .inject_irq = vt_inject_irq,
.inject_nmi = vmx_inject_nmi,
.inject_exception = vmx_inject_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
+ .cancel_injection = vt_cancel_injection,
+ .interrupt_allowed = vt_interrupt_allowed,
.nmi_allowed = vmx_nmi_allowed,
.get_nmi_mask = vmx_get_nmi_mask,
.set_nmi_mask = vmx_set_nmi_mask,
.enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
+ .enable_irq_window = vt_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_pre_state_restore = vmx_apicv_pre_state_restore,
+ .apicv_pre_state_restore = vt_apicv_pre_state_restore,
.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
.hwapic_irr_update = vmx_hwapic_irr_update,
.hwapic_isr_update = vmx_hwapic_isr_update,
.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_interrupt = vmx_deliver_interrupt,
+ .sync_pir_to_irr = vt_sync_pir_to_irr,
+ .deliver_interrupt = vt_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index b66add9da0f3..c86768b83f0b 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -52,7 +52,7 @@ static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
return (struct vcpu_pi *)vcpu;
}

-static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
return &vcpu_to_pi(vcpu)->pi_desc;
}
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 2fe8222308b2..0f9983b6910b 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -105,6 +105,8 @@ struct vcpu_pi {
/* Until here common layout betwwn vcpu_vmx and vcpu_tdx. */
};

+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu);
+
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
void pi_wakeup_handler(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1da58c36217c..1dfa9b503e0d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -7,6 +7,7 @@

#include "capabilities.h"
#include "x86_ops.h"
+#include "common.h"
#include "mmu.h"
#include "tdx_arch.h"
#include "tdx.h"
@@ -603,6 +604,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
tdx->host_state_need_save = true;
tdx->host_state_need_restore = false;

+ tdx->pi_desc.nv = POSTED_INTR_VECTOR;
+ tdx->pi_desc.sn = 1;
+
return 0;
}

@@ -610,6 +614,7 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);

+ vmx_vcpu_pi_load(vcpu, cpu);
if (vcpu->cpu == cpu)
return;

@@ -848,6 +853,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)

trace_kvm_entry(vcpu);

+ if (pi_test_on(&tdx->pi_desc)) {
+ apic->send_IPI_self(POSTED_INTR_VECTOR);
+
+ kvm_wait_lapic_expire(vcpu);
+ }
+
tdx_vcpu_enter_exit(tdx);

tdx_user_return_update_cache(vcpu);
@@ -1213,6 +1224,16 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
}

+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
+{
+ struct kvm_vcpu *vcpu = apic->vcpu;
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ /* TDX supports only posted interrupt. No lapic emulation. */
+ __vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
+}
+
static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx_capabilities __user *user_caps;
@@ -1972,6 +1993,10 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
if (ret)
return ret;

+ td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
+ td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
+ td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
+
tdx->initialized = true;
return 0;
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 162bb134aae6..1349ec438837 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4157,50 +4157,6 @@ void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
pt_update_intercept_for_msr(vcpu);
}

-static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
- int pi_vec)
-{
-#ifdef CONFIG_SMP
- if (vcpu->mode == IN_GUEST_MODE) {
- /*
- * The vector of the virtual has already been set in the PIR.
- * Send a notification event to deliver the virtual interrupt
- * unless the vCPU is the currently running vCPU, i.e. the
- * event is being sent from a fastpath VM-Exit handler, in
- * which case the PIR will be synced to the vIRR before
- * re-entering the guest.
- *
- * When the target is not the running vCPU, the following
- * possibilities emerge:
- *
- * Case 1: vCPU stays in non-root mode. Sending a notification
- * event posts the interrupt to the vCPU.
- *
- * Case 2: vCPU exits to root mode and is still runnable. The
- * PIR will be synced to the vIRR before re-entering the guest.
- * Sending a notification event is ok as the host IRQ handler
- * will ignore the spurious event.
- *
- * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
- * has already synced PIR to vIRR and never blocks the vCPU if
- * the vIRR is not empty. Therefore, a blocked vCPU here does
- * not wait for any requested interrupts in PIR, and sending a
- * notification event also results in a benign, spurious event.
- */
-
- if (vcpu != kvm_get_running_vcpu())
- __apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
- return;
- }
-#endif
- /*
- * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
- * otherwise do nothing as KVM will grab the highest priority pending
- * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
- */
- kvm_vcpu_wake_up(vcpu);
-}
-
static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
int vector)
{
@@ -4253,20 +4209,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
if (!vcpu->arch.apic->apicv_active)
return -1;

- if (pi_test_and_set_pir(vector, &vmx->pi_desc))
- return 0;
-
- /* If a previous notification has sent the IPI, nothing to do. */
- if (pi_test_and_set_on(&vmx->pi_desc))
- return 0;
-
- /*
- * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
- * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
- * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
- * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
- */
- kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+ __vmx_deliver_posted_interrupt(vcpu, &vmx->pi_desc, vector);
return 0;
}

@@ -6946,14 +6889,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}

-void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
-{
- struct vcpu_vmx *vmx = to_vmx(vcpu);
-
- pi_clear_on(&vmx->pi_desc);
- memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
-}
-
void vmx_do_interrupt_irqoff(unsigned long entry);
void vmx_do_nmi_irqoff(void);

diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f3f90186926c..12a212e71827 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -58,7 +58,6 @@ int vmx_check_intercept(struct kvm_vcpu *vcpu,
bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
void vmx_migrate_timers(struct kvm_vcpu *vcpu);
void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
-void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu);
bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason);
void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
void vmx_hwapic_isr_update(int max_isr);
@@ -158,6 +157,9 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector);
+
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

void tdx_flush_tlb(struct kvm_vcpu *vcpu);
@@ -198,6 +200,9 @@ static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }

+static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector) {}
+
static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
--
2.25.1


2024-02-26 12:45:18

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 103/130] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI

From: Isaku Yamahata <[email protected]>

When BIOS eMCA MCE-SMI morphing is enabled, the #MC is morphed to MSMI
(Machine Check System Management Interrupt). Then the SMI causes TD exit
with the read reason of EXIT_REASON_OTHER_SMI with MSMI bit set in the exit
qualification to KVM instead of EXIT_REASON_EXCEPTION_NMI with MC
exception.

Handle EXIT_REASON_OTHER_SMI with MSMI bit set in the exit qualification as
MCE(Machine Check Exception) happened during TD guest running.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx/tdx_arch.h | 2 ++
2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bdd74682b474..117c2315f087 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -916,6 +916,30 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
tdexit_intr_info(vcpu));
else if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
+ else if (unlikely(tdx->exit_reason.non_recoverable ||
+ tdx->exit_reason.error)) {
+ /*
+ * The only reason it gets EXIT_REASON_OTHER_SMI is there is an
+ * #MSMI(Machine Check System Management Interrupt) with
+ * exit_qualification bit 0 set in TD guest.
+ * The #MSMI is delivered right after SEAMCALL returns,
+ * and an #MC is delivered to host kernel after SMI handler
+ * returns.
+ *
+ * The #MC right after SEAMCALL is fixed up and skipped in #MC
+ * handler because it's an #MC happens in TD guest we cannot
+ * handle it with host's context.
+ *
+ * Call KVM's machine check handler explicitly here.
+ */
+ if (tdx->exit_reason.basic == EXIT_REASON_OTHER_SMI) {
+ unsigned long exit_qual;
+
+ exit_qual = tdexit_exit_qual(vcpu);
+ if (exit_qual & TD_EXIT_OTHER_SMI_IS_MSMI)
+ kvm_machine_check();
+ }
+ }
}

static int tdx_handle_exception(struct kvm_vcpu *vcpu)
@@ -1381,6 +1405,11 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
exit_reason.full, exit_reason.basic,
to_kvm_tdx(vcpu->kvm)->hkid,
set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
+
+ /*
+ * tdx_handle_exit_irqoff() handled EXIT_REASON_OTHER_SMI. It
+ * must be handled before enabling preemption because it's #MC.
+ */
goto unhandled_exit;
}

@@ -1419,9 +1448,14 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
return tdx_handle_ept_misconfig(vcpu);
case EXIT_REASON_OTHER_SMI:
/*
- * If reach here, it's not a Machine Check System Management
- * Interrupt(MSMI). #SMI is delivered and handled right after
- * SEAMRET, nothing needs to be done in KVM.
+ * Unlike VMX, all the SMI in SEAM non-root mode (i.e. when
+ * TD guest vcpu is running) will cause TD exit to TDX module,
+ * then SEAMRET to KVM. Once it exits to KVM, SMI is delivered
+ * and handled right away.
+ *
+ * - If it's an Machine Check System Management Interrupt
+ * (MSMI), it's handled above due to non_recoverable bit set.
+ * - If it's not an MSMI, don't need to do anything here.
*/
return 1;
default:
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index efc3c61c14ab..87ef22e9cd49 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -42,6 +42,8 @@
#define TDH_VP_WR 43
#define TDH_SYS_LP_SHUTDOWN 44

+#define TD_EXIT_OTHER_SMI_IS_MSMI BIT(1)
+
/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
#define TDX_NON_ARCH BIT_ULL(63)
#define TDX_CLASS_SHIFT 56
--
2.25.1


2024-02-26 19:37:21

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 028/130] KVM: TDX: Add TDX "architectural" error codes

On Mon, Feb 26, 2024 at 12:25:30AM -0800,
[email protected] wrote:

> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> index fdfd41511b02..28c4a62b7dba 100644
> --- a/arch/x86/include/asm/shared/tdx.h
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -26,7 +26,13 @@
> #define TDVMCALL_GET_QUOTE 0x10002
> #define TDVMCALL_REPORT_FATAL_ERROR 0x10003
>
> -#define TDVMCALL_STATUS_RETRY 1

Oops, I accidentally removed this constant to break tdx guest build.

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index ef1c8e5a2944..1367a5941499 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -28,6 +28,8 @@
#define TDVMCALL_REPORT_FATAL_ERROR 0x10003
#define TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004

+#define TDVMCALL_STATUS_RETRY 1
+
/*
* TDG.VP.VMCALL Status Codes (returned in R10)
*/
--
2.25.1
--
Isaku Yamahata <[email protected]>

2024-02-27 08:52:37

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> As TDX will use posted_interrupt.c, the use of struct vcpu_vmx is a
> blocker. Because the members of

Extra "of"

> struct pi_desc pi_desc and struct
> list_head pi_wakeup_list are only used in posted_interrupt.c, introduce
> common structure, struct vcpu_pi, make vcpu_vmx and vcpu_tdx has same
> layout in the top of structure.
>
> To minimize the diff size, avoid code conversion like,
> vmx->pi_desc => vmx->common->pi_desc. Instead add compile time check
> if the layout is expected.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 41 ++++++++++++++++++++++++++--------
> arch/x86/kvm/vmx/posted_intr.h | 11 +++++++++
> arch/x86/kvm/vmx/tdx.c | 1 +
> arch/x86/kvm/vmx/tdx.h | 8 +++++++
> arch/x86/kvm/vmx/vmx.h | 14 +++++++-----
> 5 files changed, 60 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index af662312fd07..b66add9da0f3 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -11,6 +11,7 @@
> #include "posted_intr.h"
> #include "trace.h"
> #include "vmx.h"
> +#include "tdx.h"
>
> /*
> * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
> @@ -31,9 +32,29 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
> */
> static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
>
> +/*
> + * The layout of the head of struct vcpu_vmx and struct vcpu_tdx must match with
> + * struct vcpu_pi.
> + */
> +static_assert(offsetof(struct vcpu_pi, pi_desc) ==
> + offsetof(struct vcpu_vmx, pi_desc));
> +static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
> + offsetof(struct vcpu_vmx, pi_wakeup_list));
> +#ifdef CONFIG_INTEL_TDX_HOST
> +static_assert(offsetof(struct vcpu_pi, pi_desc) ==
> + offsetof(struct vcpu_tdx, pi_desc));
> +static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
> + offsetof(struct vcpu_tdx, pi_wakeup_list));
> +#endif
> +
> +static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
> +{
> + return (struct vcpu_pi *)vcpu;
> +}
> +
> static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> {
> - return &(to_vmx(vcpu)->pi_desc);
> + return &vcpu_to_pi(vcpu)->pi_desc;
> }
>
> static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
> @@ -52,8 +73,8 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
>
> void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> {
> - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> - struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
> + struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
> struct pi_desc old, new;
> unsigned long flags;
> unsigned int dest;
> @@ -90,7 +111,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> */
> if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
> raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> - list_del(&vmx->pi_wakeup_list);
> + list_del(&vcpu_pi->pi_wakeup_list);
> raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> }
>
> @@ -145,15 +166,15 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
> */
> static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> {
> - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> - struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
> + struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
> struct pi_desc old, new;
> unsigned long flags;
>
> local_irq_save(flags);
>
> raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> - list_add_tail(&vmx->pi_wakeup_list,
> + list_add_tail(&vcpu_pi->pi_wakeup_list,
> &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
> raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
>
> @@ -190,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
> * notification vector is switched to the one that calls
> * back to the pi_wakeup_handler() function.
> */
> - return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
> + return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
> + vmx_can_use_vtd_pi(vcpu->kvm);
> }
>
> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> @@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> if (!vmx_needs_pi_wakeup(vcpu))
> return;
>
> - if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> + if (kvm_vcpu_is_blocking(vcpu) &&
> + (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
> pi_enable_wakeup_handler(vcpu);
>
> /*
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index 26992076552e..2fe8222308b2 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -94,6 +94,17 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
> (unsigned long *)&pi_desc->control);
> }
>
> +struct vcpu_pi {
> + struct kvm_vcpu vcpu;
> +
> + /* Posted interrupt descriptor */
> + struct pi_desc pi_desc;
> +
> + /* Used if this vCPU is waiting for PI notification wakeup. */
> + struct list_head pi_wakeup_list;
> + /* Until here common layout betwwn vcpu_vmx and vcpu_tdx. */

s/betwwn/between

Also, in pi_wakeup_handler(), it is still using struct vcpu_vmx, but it
could
be vcpu_tdx.
Functionally it is OK, however, since you have added vcpu_pi, should it use
vcpu_pi instead of vcpu_vmx in pi_wakeup_handler()?

> +};
> +
> void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
> void pi_wakeup_handler(void);
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a5b52aa6d153..1da58c36217c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -584,6 +584,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>
> fpstate_set_confidential(&vcpu->arch.guest_fpu);
> vcpu->arch.apic->guest_apic_protected = true;
> + INIT_LIST_HEAD(&tdx->pi_wakeup_list);
>
> vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 7f8c78f06508..eaffa7384725 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -4,6 +4,7 @@
>
> #ifdef CONFIG_INTEL_TDX_HOST
>
> +#include "posted_intr.h"
> #include "pmu_intel.h"
> #include "tdx_ops.h"
>
> @@ -69,6 +70,13 @@ union tdx_exit_reason {
> struct vcpu_tdx {
> struct kvm_vcpu vcpu;
>
> + /* Posted interrupt descriptor */
> + struct pi_desc pi_desc;
> +
> + /* Used if this vCPU is waiting for PI notification wakeup. */
> + struct list_head pi_wakeup_list;
> + /* Until here same layout to struct vcpu_pi. */
> +
> unsigned long tdvpr_pa;
> unsigned long *tdvpx_pa;
> bool td_vcpu_created;
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 79ff54f08fee..634a9a250b95 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -235,6 +235,14 @@ struct nested_vmx {
>
> struct vcpu_vmx {
> struct kvm_vcpu vcpu;
> +
> + /* Posted interrupt descriptor */
> + struct pi_desc pi_desc;
> +
> + /* Used if this vCPU is waiting for PI notification wakeup. */
> + struct list_head pi_wakeup_list;
> + /* Until here same layout to struct vcpu_pi. */
> +
> u8 fail;
> u8 x2apic_msr_bitmap_mode;
>
> @@ -304,12 +312,6 @@ struct vcpu_vmx {
>
> union vmx_exit_reason exit_reason;
>
> - /* Posted interrupt descriptor */
> - struct pi_desc pi_desc;
> -
> - /* Used if this vCPU is waiting for PI notification wakeup. */
> - struct list_head pi_wakeup_list;
> -
> /* Support for a guest hypervisor (nested VMX) */
> struct nested_vmx nested;
>


2024-02-28 23:00:55

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 008/130] x86/tdx: Warning with 32bit build shift-count-overflow



On 26/02/2024 9:25 pm, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> This patch fixes the following warnings.
>
> In file included from arch/x86/kernel/asm-offsets.c:22:
> arch/x86/include/asm/tdx.h:92:87: warning: shift count >= width of type [-Wshift-count-overflow]
> arch/x86/include/asm/tdx.h:20:21: note: expanded from macro 'TDX_ERROR'
> #define TDX_ERROR _BITUL(63)
>
> ^~~~~~~~~~
>
> Also consistently use ULL for TDX_SEAMCALL_VMFAILINVALID.
>
> Fixes: 527a534c7326 ("x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers")

+Kirill.

This kinda fix should be sent out as a separate patch.

> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/tdx.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 16be3a1e4916..1e9dcdf9912b 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -17,9 +17,9 @@
> * Bits 47:40 == 0xFF indicate Reserved status code class that never used by
> * TDX module.
> */
> -#define TDX_ERROR _BITUL(63)
> +#define TDX_ERROR _BITULL(63)
> #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
> -#define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))
> +#define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _ULL(0xFFFF0000))
>
> #define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
> #define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)

Both TDX guest and TDX host code depends on X86_64 in the Kconfig. This
issue seems due to asm-offsets.c includes <asm/tdx.h> unconditionally.

It doesn't make sense to generate any TDX related code in asm-offsets.h
so I am wondering whether it is better to just make the inclusion of
<asm/tdx.h> conditionally or move it the asm-offsets_64.c?


Kirill what's your opinion?

Btw after quick try seems I cannot reproduce this (w/o this KVM TDX
patchset). Isaku, could you share your .config?

2024-02-26 09:31:16

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v19 123/130] KVM: TDX: Ignore setting up mce

From: Isaku Yamahata <[email protected]>

Because vmx_set_mce function is VMX specific and it cannot be used for TDX.
Add vt stub to ignore setting up mce for TDX.

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/main.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9fcd71999bba..7c47b02d88d8 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -899,6 +899,14 @@ static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
}
#endif

+static void vt_setup_mce(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_setup_mce(vcpu);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -1085,7 +1093,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.cancel_hv_timer = vt_cancel_hv_timer,
#endif

- .setup_mce = vmx_setup_mce,
+ .setup_mce = vt_setup_mce,

#ifdef CONFIG_KVM_SMM
.smi_allowed = vt_smi_allowed,
--
2.25.1


2024-03-01 07:55:38

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

> + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> + */
> +#define TDX_MAX_VCPUS (~(u16)0)
This value will be treated as -1 in tdx_vm_init(),
"kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);"

This will lead to kvm->max_vcpus being -1 by default.
Is this by design or just an error?
If it's by design, why not set kvm->max_vcpus = -1 in tdx_vm_init() directly.
If an unexpected error, may below is better?

#define TDX_MAX_VCPUS (int)((u16)(~0UL))
or
#define TDX_MAX_VCPUS 65536


2024-03-01 16:26:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v19 008/130] x86/tdx: Warning with 32bit build shift-count-overflow

On Thu, Feb 29, 2024 at 11:49:13AM +1300, Huang, Kai wrote:
>
>
> On 26/02/2024 9:25 pm, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > This patch fixes the following warnings.
> >
> > In file included from arch/x86/kernel/asm-offsets.c:22:
> > arch/x86/include/asm/tdx.h:92:87: warning: shift count >= width of type [-Wshift-count-overflow]
> > arch/x86/include/asm/tdx.h:20:21: note: expanded from macro 'TDX_ERROR'
> > #define TDX_ERROR _BITUL(63)
> >
> > ^~~~~~~~~~
> >

I think you trim the warning message. I don't see the actual user of the
define. Define itself will not generate the warning. You need to actually
use it outside of preprocessor. I don't understand who would use it in
32-bit code. Maybe fixing it this way masking other issue.

That said, I don't object the change itself. We just need to understand
the context more.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-04 07:41:08

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> On EPT violation, call a common function, __vmx_handle_ept_violation() to
> trigger x86 MMU code. On EPT misconfiguration, exit to ring 3 with
> KVM_EXIT_UNKNOWN. because EPT misconfiguration can't happen as MMIO is
> trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the

s/trigged/triggered

> fast path.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v14 -> v15:
> - use PFERR_GUEST_ENC_MASK to tell the fault is private
>
> Signed-off-by: Isaku Yamahata <[email protected]>

duplicated SOB

> ---
> arch/x86/kvm/vmx/common.h | 3 +++
> arch/x86/kvm/vmx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 632af7a76d0a..027aa4175d2c 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -87,6 +87,9 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
> PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>
> + if (kvm_is_private_gpa(vcpu->kvm, gpa))
> + error_code |= PFERR_GUEST_ENC_MASK;
> +
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> }
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 2f68e6f2b53a..0db80fa020d2 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1285,6 +1285,51 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> __vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
> }
>
> +static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> +{
> + unsigned long exit_qual;
> +
> + if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
> + /*
> + * Always treat SEPT violations as write faults. Ignore the
> + * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
> + * TD private pages are always RWX in the SEPT tables,
> + * i.e. they're always mapped writable. Just as importantly,
> + * treating SEPT violations as write faults is necessary to
> + * avoid COW allocations, which will cause TDAUGPAGE failures
> + * due to aliasing a single HPA to multiple GPAs.
> + */
> +#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
> + exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
> + } else {
> + exit_qual = tdexit_exit_qual(vcpu);
> + if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
> + pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
> + tdexit_gpa(vcpu), kvm_rip_read(vcpu));
> + vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
> + vcpu->run->ex.exception = PF_VECTOR;
> + vcpu->run->ex.error_code = exit_qual;
> + return 0;
> + }
> + }
> +
> + trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
> + return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
> +}
> +
> +static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
> +{
> + WARN_ON_ONCE(1);
> +
> + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> + vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
> + vcpu->run->internal.ndata = 2;
> + vcpu->run->internal.data[0] = EXIT_REASON_EPT_MISCONFIG;
> + vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
> +
> + return 0;
> +}
> +
> int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> {
> union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
> @@ -1345,6 +1390,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
>
> switch (exit_reason.basic) {
> + case EXIT_REASON_EPT_VIOLATION:
> + return tdx_handle_ept_violation(vcpu);
> + case EXIT_REASON_EPT_MISCONFIG:
> + return tdx_handle_ept_misconfig(vcpu);
> case EXIT_REASON_OTHER_SMI:
> /*
> * If reach here, it's not a Machine Check System Management

2024-03-05 08:13:26

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 008/130] x86/tdx: Warning with 32bit build shift-count-overflow

On Fri, Mar 01, 2024 at 01:36:43PM +0200,
"Kirill A. Shutemov" <[email protected]> wrote:

> On Thu, Feb 29, 2024 at 11:49:13AM +1300, Huang, Kai wrote:
> >
> >
> > On 26/02/2024 9:25 pm, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > This patch fixes the following warnings.
> > >
> > > In file included from arch/x86/kernel/asm-offsets.c:22:
> > > arch/x86/include/asm/tdx.h:92:87: warning: shift count >= width of type [-Wshift-count-overflow]
> > > arch/x86/include/asm/tdx.h:20:21: note: expanded from macro 'TDX_ERROR'
> > > #define TDX_ERROR _BITUL(63)
> > >
> > > ^~~~~~~~~~
> > >
>
> I think you trim the warning message. I don't see the actual user of the
> define. Define itself will not generate the warning. You need to actually
> use it outside of preprocessor. I don't understand who would use it in
> 32-bit code. Maybe fixing it this way masking other issue.
>
> That said, I don't object the change itself. We just need to understand
> the context more.

v18 used it as stub function. v19 dropped it as the stub was not needed.
--
Isaku Yamahata <[email protected]>

2024-03-05 08:22:10

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Fri, Mar 01, 2024 at 03:25:31PM +0800,
Yan Zhao <[email protected]> wrote:

> > + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> > + */
> > +#define TDX_MAX_VCPUS (~(u16)0)
> This value will be treated as -1 in tdx_vm_init(),
> "kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);"
>
> This will lead to kvm->max_vcpus being -1 by default.
> Is this by design or just an error?
> If it's by design, why not set kvm->max_vcpus = -1 in tdx_vm_init() directly.
> If an unexpected error, may below is better?
>
> #define TDX_MAX_VCPUS (int)((u16)(~0UL))
> or
> #define TDX_MAX_VCPUS 65536

You're right. I'll use ((int)U16_MAX).
As TDX 1.5 introduced metadata MAX_VCPUS_PER_TD, I'll update to get the value
and trim it further. Something following.

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index f964d99f8701..31205f84d594 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
/* TODO: Once all backend implemented this op, remove _OPTIONAL_RET0. */
KVM_X86_OP_OPTIONAL_RET0(vcpu_check_cpuid)
KVM_X86_OP(vcpu_after_set_cpuid)
-KVM_X86_OP_OPTIONAL(max_vcpus);
KVM_X86_OP_OPTIONAL(vm_enable_cap)
KVM_X86_OP(vm_init)
KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6dd78230c9d4..deb59e94990f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -13,17 +13,6 @@
static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);

-static int vt_max_vcpus(struct kvm *kvm)
-{
- if (!kvm)
- return KVM_MAX_VCPUS;
-
- if (is_td(kvm))
- return min(kvm->max_vcpus, TDX_MAX_VCPUS);
-
- return kvm->max_vcpus;
-}
-
#if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_INTEL_TDX_HOST)
static int vt_flush_remote_tlbs(struct kvm *kvm);
static int vt_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, gfn_t nr_pages);
@@ -1130,7 +1119,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_disable = vt_hardware_disable,
.has_emulated_msr = vt_has_emulated_msr,

- .max_vcpus = vt_max_vcpus,
.vm_size = sizeof(struct kvm_vmx),
.vm_enable_cap = vt_vm_enable_cap,
.vm_init = vt_vm_init,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7cca6a33ad97..a8cfb4f214a6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -63,6 +63,8 @@ struct tdx_info {
u8 nr_tdcs_pages;
u8 nr_tdvpx_pages;

+ u16 max_vcpus_per_td;
+
/*
* The number of WBINVD domains. 0 means that wbinvd domain is cpu
* package.
@@ -100,7 +102,8 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
if (cap->flags || cap->args[0] == 0)
return -EINVAL;
if (cap->args[0] > KVM_MAX_VCPUS ||
- cap->args[0] > TDX_MAX_VCPUS)
+ cap->args[0] > TDX_MAX_VCPUS ||
+ cap->args[0] > tdx_info->max_vcpus_per_td)
return -E2BIG;

mutex_lock(&kvm->lock);
@@ -729,7 +732,8 @@ int tdx_vm_init(struct kvm *kvm)
* TDX has its own limit of the number of vcpus in addition to
* KVM_MAX_VCPUS.
*/
- kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
+ kvm->max_vcpus = min3(kvm->max_vcpus, tdx_info->max_vcpus_per_td,
+ TDX_MAX_VCPUS);

mutex_init(&to_kvm_tdx(kvm)->source_lock);
return 0;
@@ -4667,6 +4671,7 @@ static int __init tdx_module_setup(void)
TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
+ TDX_INFO_MAP(MAX_VCPUS_PER_TD, max_vcpus_per_td),
};

ret = tdx_enable();
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index b10dad8f46bb..711855be6c83 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -144,7 +144,7 @@ struct tdx_cpuid_value {
/*
* TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
*/
-#define TDX_MAX_VCPUS (~(u16)0)
+#define TDX_MAX_VCPUS ((int)U16_MAX)

struct td_params {
u64 attributes;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d097024fd974..6822a50e1d5d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4733,8 +4733,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
case KVM_CAP_MAX_VCPUS:
r = KVM_MAX_VCPUS;
- if (kvm_x86_ops.max_vcpus)
- r = static_call(kvm_x86_max_vcpus)(kvm);
+ if (kvm)
+ r = kvm->max_vcpus;
break;
case KVM_CAP_MAX_VCPU_ID:
r = KVM_MAX_VCPU_IDS;

--
Isaku Yamahata <[email protected]>

2024-03-05 08:35:52

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c

On Tue, Feb 27, 2024 at 04:52:01PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > As TDX will use posted_interrupt.c, the use of struct vcpu_vmx is a
> > blocker. Because the members of
>
> Extra "of"
>
> > struct pi_desc pi_desc and struct
> > list_head pi_wakeup_list are only used in posted_interrupt.c, introduce
> > common structure, struct vcpu_pi, make vcpu_vmx and vcpu_tdx has same
> > layout in the top of structure.
> >
> > To minimize the diff size, avoid code conversion like,
> > vmx->pi_desc => vmx->common->pi_desc. Instead add compile time check
> > if the layout is expected.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/posted_intr.c | 41 ++++++++++++++++++++++++++--------
> > arch/x86/kvm/vmx/posted_intr.h | 11 +++++++++
> > arch/x86/kvm/vmx/tdx.c | 1 +
> > arch/x86/kvm/vmx/tdx.h | 8 +++++++
> > arch/x86/kvm/vmx/vmx.h | 14 +++++++-----
> > 5 files changed, 60 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> > index af662312fd07..b66add9da0f3 100644
> > --- a/arch/x86/kvm/vmx/posted_intr.c
> > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > @@ -11,6 +11,7 @@
> > #include "posted_intr.h"
> > #include "trace.h"
> > #include "vmx.h"
> > +#include "tdx.h"
> > /*
> > * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
> > @@ -31,9 +32,29 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
> > */
> > static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
> > +/*
> > + * The layout of the head of struct vcpu_vmx and struct vcpu_tdx must match with
> > + * struct vcpu_pi.
> > + */
> > +static_assert(offsetof(struct vcpu_pi, pi_desc) ==
> > + offsetof(struct vcpu_vmx, pi_desc));
> > +static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
> > + offsetof(struct vcpu_vmx, pi_wakeup_list));
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +static_assert(offsetof(struct vcpu_pi, pi_desc) ==
> > + offsetof(struct vcpu_tdx, pi_desc));
> > +static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
> > + offsetof(struct vcpu_tdx, pi_wakeup_list));
> > +#endif
> > +
> > +static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
> > +{
> > + return (struct vcpu_pi *)vcpu;
> > +}
> > +
> > static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> > {
> > - return &(to_vmx(vcpu)->pi_desc);
> > + return &vcpu_to_pi(vcpu)->pi_desc;
> > }
> > static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
> > @@ -52,8 +73,8 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
> > void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> > {
> > - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > - struct vcpu_vmx *vmx = to_vmx(vcpu);
> > + struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
> > + struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
> > struct pi_desc old, new;
> > unsigned long flags;
> > unsigned int dest;
> > @@ -90,7 +111,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> > */
> > if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
> > raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> > - list_del(&vmx->pi_wakeup_list);
> > + list_del(&vcpu_pi->pi_wakeup_list);
> > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> > }
> > @@ -145,15 +166,15 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
> > */
> > static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> > {
> > - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > - struct vcpu_vmx *vmx = to_vmx(vcpu);
> > + struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
> > + struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
> > struct pi_desc old, new;
> > unsigned long flags;
> > local_irq_save(flags);
> > raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> > - list_add_tail(&vmx->pi_wakeup_list,
> > + list_add_tail(&vcpu_pi->pi_wakeup_list,
> > &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
> > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> > @@ -190,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
> > * notification vector is switched to the one that calls
> > * back to the pi_wakeup_handler() function.
> > */
> > - return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
> > + return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
> > + vmx_can_use_vtd_pi(vcpu->kvm);
> > }
> > void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> > @@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> > if (!vmx_needs_pi_wakeup(vcpu))
> > return;
> > - if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> > + if (kvm_vcpu_is_blocking(vcpu) &&
> > + (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
> > pi_enable_wakeup_handler(vcpu);
> > /*
> > diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> > index 26992076552e..2fe8222308b2 100644
> > --- a/arch/x86/kvm/vmx/posted_intr.h
> > +++ b/arch/x86/kvm/vmx/posted_intr.h
> > @@ -94,6 +94,17 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
> > (unsigned long *)&pi_desc->control);
> > }
> > +struct vcpu_pi {
> > + struct kvm_vcpu vcpu;
> > +
> > + /* Posted interrupt descriptor */
> > + struct pi_desc pi_desc;
> > +
> > + /* Used if this vCPU is waiting for PI notification wakeup. */
> > + struct list_head pi_wakeup_list;
> > + /* Until here common layout betwwn vcpu_vmx and vcpu_tdx. */
>
> s/betwwn/between
>
> Also, in pi_wakeup_handler(), it is still using struct vcpu_vmx, but it
> could
> be vcpu_tdx.
> Functionally it is OK, however, since you have added vcpu_pi, should it use
> vcpu_pi instead of vcpu_vmx in pi_wakeup_handler()?

Makes sense.

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index b66add9da0f3..5b71aef931dc 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -243,13 +243,13 @@ void pi_wakeup_handler(void)
int cpu = smp_processor_id();
struct list_head *wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu);
raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu);
- struct vcpu_vmx *vmx;
+ struct vcpu_pi *pi;

raw_spin_lock(spinlock);
- list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
+ list_for_each_entry(pi, wakeup_list, pi_wakeup_list) {

- if (pi_test_on(&vmx->pi_desc))
- kvm_vcpu_wake_up(&vmx->vcpu);
+ if (pi_test_on(&pi->pi_desc))
+ kvm_vcpu_wake_up(&pi->vcpu);
}

--
Isaku Yamahata <[email protected]>

2024-03-05 21:36:38

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 008/130] x86/tdx: Warning with 32bit build shift-count-overflow



On 5/03/2024 9:12 pm, Isaku Yamahata wrote:
> On Fri, Mar 01, 2024 at 01:36:43PM +0200,
> "Kirill A. Shutemov" <[email protected]> wrote:
>
>> On Thu, Feb 29, 2024 at 11:49:13AM +1300, Huang, Kai wrote:
>>>
>>>
>>> On 26/02/2024 9:25 pm, [email protected] wrote:
>>>> From: Isaku Yamahata <[email protected]>
>>>>
>>>> This patch fixes the following warnings.
>>>>
>>>> In file included from arch/x86/kernel/asm-offsets.c:22:
>>>> arch/x86/include/asm/tdx.h:92:87: warning: shift count >= width of type [-Wshift-count-overflow]
>>>> arch/x86/include/asm/tdx.h:20:21: note: expanded from macro 'TDX_ERROR'
>>>> #define TDX_ERROR _BITUL(63)
>>>>
>>>> ^~~~~~~~~~
>>>>
>>
>> I think you trim the warning message. I don't see the actual user of the
>> define. Define itself will not generate the warning. You need to actually
>> use it outside of preprocessor. I don't understand who would use it in
>> 32-bit code. Maybe fixing it this way masking other issue.
>>
>> That said, I don't object the change itself. We just need to understand
>> the context more.
>
> v18 used it as stub function. v19 dropped it as the stub was not needed.

Sorry I literally don't understand what you are talking about here.

Please just clarify (at least):

- Does this problem exist in upstream code?
- If it does, what is the root cause, and how to reproduce?

2024-03-06 11:41:46

by Yi Sun

[permalink] [raw]
Subject: Re: [PATCH v19 005/130] x86/virt/tdx: Export global metadata read infrastructure

On 26.02.2024 00:25, [email protected] wrote:
>From: Kai Huang <[email protected]>
>
>KVM will need to read a bunch of non-TDMR related metadata to create and
>run TDX guests. Export the metadata read infrastructure for KVM to use.
>
>Specifically, export two helpers:
>
>1) The helper which reads multiple metadata fields to a buffer of a
> structure based on the "field ID -> structure member" mapping table.
>
>2) The low level helper which just reads a given field ID.
>
>The two helpers cover cases when the user wants to cache a bunch of
>metadata fields to a certain structure and when the user just wants to
>query a specific metadata field on demand. They are enough for KVM to
>use (and also should be enough for other potential users).
>
>Signed-off-by: Kai Huang <[email protected]>
>Reviewed-by: Kirill A. Shutemov <[email protected]>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
> arch/x86/include/asm/tdx.h | 22 ++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.c | 25 ++++++++-----------------
> 2 files changed, 30 insertions(+), 17 deletions(-)
>
>diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
>index eba178996d84..709b9483f9e4 100644
>--- a/arch/x86/include/asm/tdx.h
>+++ b/arch/x86/include/asm/tdx.h
>@@ -116,6 +116,28 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
> int tdx_cpu_enable(void);
> int tdx_enable(void);
> const char *tdx_dump_mce_info(struct mce *m);
>+
>+struct tdx_metadata_field_mapping {
>+ u64 field_id;
>+ int offset;
>+ int size;
>+};
>+
>+#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
>+ { .field_id = MD_FIELD_ID_##_field_id, \
>+ .offset = offsetof(_struct, _member), \
>+ .size = sizeof(typeof(((_struct *)0)->_member)) }
>+
>+/*
>+ * Read multiple global metadata fields to a buffer of a structure
>+ * based on the "field ID -> structure member" mapping table.
>+ */
>+int tdx_sys_metadata_read(const struct tdx_metadata_field_mapping *fields,
>+ int nr_fields, void *stbuf);
>+
>+/* Read a single global metadata field */
>+int tdx_sys_metadata_field_read(u64 field_id, u64 *data);
>+
> #else
> static inline void tdx_init(void) { }
> static inline int tdx_cpu_enable(void) { return -ENODEV; }
>diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>index a19adc898df6..dc21310776ab 100644
>--- a/arch/x86/virt/vmx/tdx/tdx.c
>+++ b/arch/x86/virt/vmx/tdx/tdx.c
>@@ -251,7 +251,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
> return ret;
> }
>
>-static int read_sys_metadata_field(u64 field_id, u64 *data)
>+int tdx_sys_metadata_field_read(u64 field_id, u64 *data)
> {
> struct tdx_module_args args = {};
> int ret;
>@@ -270,6 +270,7 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)
>
> return 0;
> }
>+EXPORT_SYMBOL_GPL(tdx_sys_metadata_field_read);
>
> /* Return the metadata field element size in bytes */
> static int get_metadata_field_bytes(u64 field_id)
>@@ -295,7 +296,7 @@ static int stbuf_read_sys_metadata_field(u64 field_id,
> if (WARN_ON_ONCE(get_metadata_field_bytes(field_id) != bytes))
> return -EINVAL;
>
>- ret = read_sys_metadata_field(field_id, &tmp);
>+ ret = tdx_sys_metadata_field_read(field_id, &tmp);
> if (ret)
> return ret;
>
>@@ -304,19 +305,8 @@ static int stbuf_read_sys_metadata_field(u64 field_id,
> return 0;
> }
>
>-struct field_mapping {
>- u64 field_id;
>- int offset;
>- int size;
>-};
>-
>-#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
>- { .field_id = MD_FIELD_ID_##_field_id, \
>- .offset = offsetof(_struct, _member), \
>- .size = sizeof(typeof(((_struct *)0)->_member)) }
>-
>-static int read_sys_metadata(struct field_mapping *fields, int nr_fields,
>- void *stbuf)
>+int tdx_sys_metadata_read(const struct tdx_metadata_field_mapping *fields,
>+ int nr_fields, void *stbuf)
> {
> int i, ret;
>
>@@ -331,6 +321,7 @@ static int read_sys_metadata(struct field_mapping *fields, int nr_fields,
>
> return 0;
> }
>+EXPORT_SYMBOL_GPL(tdx_sys_metadata_read);
Hi Kai,

The two helpers can potentially be used by TD guests, as you mentioned.
It's a good idea to declare it in the header asm/tdx.h.

However, the function cannot be compiled if its definition remains in the
vmx/tdx/tdx.c file while disabling the CONFIG_TDX_HOST.

It would be better to move the definition to a shared location,
allowing the host and guest to share the same code.

Thanks
--Sun, Yi


2024-03-06 13:17:58

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 016/130] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Introduce a helper to directly (pun intended) fault-in a TDP page
> without having to go through the full page fault path. This allows
> TDX to get the resulting pfn and also allows the RET_PF_* enums to
> stay in mmu.c where they belong.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - Move up for KVM_MEMORY_MAPPING.
> - Add goal_level for the caller to know how many pages are mapped.
>
> v14 -> v15:
> - Remove loop in kvm_mmu_map_tdp_page() and return error code based on
> RET_FP_xxx value to avoid potential infinite loop. The caller should
> loop on -EAGAIN instead now.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu.h | 3 +++
> arch/x86/kvm/mmu/mmu.c | 58 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 61 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..d96c93a25b3b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> }
>
> +int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> + u8 max_level, u8 *goal_level);
> +
> /*
> * Check if a given access (described through the I/D, W/R and U/S bits of a
> * page fault error code pfec) causes a permission fault with the given PTE
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 61674d6b17aa..ca0c91f14063 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4615,6 +4615,64 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return direct_page_fault(vcpu, fault);
> }
>
> +int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> + u8 max_level, u8 *goal_level)
> +{
> + int r;
> + struct kvm_page_fault fault = (struct kvm_page_fault) {
> + .addr = gpa,
> + .error_code = error_code,
> + .exec = error_code & PFERR_FETCH_MASK,
> + .write = error_code & PFERR_WRITE_MASK,
> + .present = error_code & PFERR_PRESENT_MASK,
> + .rsvd = error_code & PFERR_RSVD_MASK,
> + .user = error_code & PFERR_USER_MASK,
> + .prefetch = false,
> + .is_tdp = true,
> + .is_private = error_code & PFERR_GUEST_ENC_MASK,
> + .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> + };
> +
> + WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> + fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> +
> + r = mmu_topup_memory_caches(vcpu, false);

Does it need a cache topup here?
Both kvm_tdp_mmu_page_fault() and direct_page_fault() will call
mmu_topup_memory_caches() when needed.

> + if (r)
> + return r;
> +
> + fault.max_level = max_level;
> + fault.req_level = PG_LEVEL_4K;
> + fault.goal_level = PG_LEVEL_4K;
> +
> +#ifdef CONFIG_X86_64
> + if (tdp_mmu_enabled)
> + r = kvm_tdp_mmu_page_fault(vcpu, &fault);
> + else
> +#endif
> + r = direct_page_fault(vcpu, &fault);
> +
> + if (is_error_noslot_pfn(fault.pfn) || vcpu->kvm->vm_bugged)
> + return -EFAULT;
> +
> + switch (r) {
> + case RET_PF_RETRY:
> + return -EAGAIN;
> +
> + case RET_PF_FIXED:
> + case RET_PF_SPURIOUS:
> + if (goal_level)
> + *goal_level = fault.goal_level;
> + return 0;
> +
> + case RET_PF_CONTINUE:
> + case RET_PF_EMULATE:
> + case RET_PF_INVALID:
> + default:
> + return -EIO;
> + }
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
> +
> static void nonpaging_init_context(struct kvm_mmu *context)
> {
> context->page_fault = nonpaging_page_fault;


2024-03-06 21:24:10

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 005/130] x86/virt/tdx: Export global metadata read infrastructure


>
> However, the function cannot be compiled if its definition remains in the
> vmx/tdx/tdx.c file while disabling the CONFIG_TDX_HOST.
>
> It would be better to move the definition to a shared location,
> allowing the host and guest to share the same code.
>

No not in this series. Such change needs to be in your series.

Thanks,
-Kai

2024-03-06 22:17:40

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 008/130] x86/tdx: Warning with 32bit build shift-count-overflow

On Wed, Mar 06, 2024 at 10:35:43AM +1300,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 5/03/2024 9:12 pm, Isaku Yamahata wrote:
> > On Fri, Mar 01, 2024 at 01:36:43PM +0200,
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > On Thu, Feb 29, 2024 at 11:49:13AM +1300, Huang, Kai wrote:
> > > >
> > > >
> > > > On 26/02/2024 9:25 pm, [email protected] wrote:
> > > > > From: Isaku Yamahata <[email protected]>
> > > > >
> > > > > This patch fixes the following warnings.
> > > > >
> > > > > In file included from arch/x86/kernel/asm-offsets.c:22:
> > > > > arch/x86/include/asm/tdx.h:92:87: warning: shift count >= width of type [-Wshift-count-overflow]
> > > > > arch/x86/include/asm/tdx.h:20:21: note: expanded from macro 'TDX_ERROR'
> > > > > #define TDX_ERROR _BITUL(63)
> > > > >
> > > > > ^~~~~~~~~~
> > > > >
> > >
> > > I think you trim the warning message. I don't see the actual user of the
> > > define. Define itself will not generate the warning. You need to actually
> > > use it outside of preprocessor. I don't understand who would use it in
> > > 32-bit code. Maybe fixing it this way masking other issue.
> > >
> > > That said, I don't object the change itself. We just need to understand
> > > the context more.
> >
> > v18 used it as stub function. v19 dropped it as the stub was not needed.
>
> Sorry I literally don't understand what you are talking about here.
>
> Please just clarify (at least):
>
> - Does this problem exist in upstream code?

No.

> - If it does, what is the root cause, and how to reproduce?

v18 had a problem because it has stub function. v19 doesn't have problem because
it deleted the stub function.
--
Isaku Yamahata <[email protected]>

2024-03-06 22:22:50

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 016/130] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX

On Wed, Mar 06, 2024 at 03:13:22PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > Introduce a helper to directly (pun intended) fault-in a TDP page
> > without having to go through the full page fault path. This allows
> > TDX to get the resulting pfn and also allows the RET_PF_* enums to
> > stay in mmu.c where they belong.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v19:
> > - Move up for KVM_MEMORY_MAPPING.
> > - Add goal_level for the caller to know how many pages are mapped.
> >
> > v14 -> v15:
> > - Remove loop in kvm_mmu_map_tdp_page() and return error code based on
> > RET_FP_xxx value to avoid potential infinite loop. The caller should
> > loop on -EAGAIN instead now.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu.h | 3 +++
> > arch/x86/kvm/mmu/mmu.c | 58 ++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 61 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 60f21bb4c27b..d96c93a25b3b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> > }
> > +int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > + u8 max_level, u8 *goal_level);
> > +
> > /*
> > * Check if a given access (described through the I/D, W/R and U/S bits of a
> > * page fault error code pfec) causes a permission fault with the given PTE
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 61674d6b17aa..ca0c91f14063 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4615,6 +4615,64 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return direct_page_fault(vcpu, fault);
> > }
> > +int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > + u8 max_level, u8 *goal_level)
> > +{
> > + int r;
> > + struct kvm_page_fault fault = (struct kvm_page_fault) {
> > + .addr = gpa,
> > + .error_code = error_code,
> > + .exec = error_code & PFERR_FETCH_MASK,
> > + .write = error_code & PFERR_WRITE_MASK,
> > + .present = error_code & PFERR_PRESENT_MASK,
> > + .rsvd = error_code & PFERR_RSVD_MASK,
> > + .user = error_code & PFERR_USER_MASK,
> > + .prefetch = false,
> > + .is_tdp = true,
> > + .is_private = error_code & PFERR_GUEST_ENC_MASK,
> > + .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> > + };
> > +
> > + WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> > + fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > +
> > + r = mmu_topup_memory_caches(vcpu, false);
>
> Does it need a cache topup here?
> Both kvm_tdp_mmu_page_fault() and direct_page_fault() will call
> mmu_topup_memory_caches() when needed.

You're right. As the called function has changed, I missed to remove it.
--
Isaku Yamahata <[email protected]>

2024-03-06 22:27:27

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 008/130] x86/tdx: Warning with 32bit build shift-count-overflow


>> Please just clarify (at least):
>>
>> - Does this problem exist in upstream code?
>
> No.
>
>> - If it does, what is the root cause, and how to reproduce?
>
> v18 had a problem because it has stub function. v19 doesn't have problem because
> it deleted the stub function.

What is the "stub function"??

If "v19 doesn't have problem", why do you even _need_ this patch??

I am tired of guessing, but I don't care anymore given it's not a
problem in upstream code.

2024-03-07 07:01:42

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v19 014/130] KVM: Add KVM vcpu ioctl to pre-populate guest memory



On 2/26/24 16:25, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add new ioctl KVM_MEMORY_MAPPING in the kvm common code. It iterates on the
> memory range and call arch specific function. Add stub function as weak
> symbol.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - newly added
> ---
> include/linux/kvm_host.h | 4 +++
> include/uapi/linux/kvm.h | 10 ++++++
> virt/kvm/kvm_main.c | 67 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 81 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0520cd8d03cc..eeaf4e73317c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2389,4 +2389,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> +void kvm_arch_vcpu_pre_memory_mapping(struct kvm_vcpu *vcpu);
> +int kvm_arch_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping);
> +
> #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index c3308536482b..5e2b28934aa9 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1155,6 +1155,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_MEMORY_ATTRIBUTES 233
> #define KVM_CAP_GUEST_MEMFD 234
> #define KVM_CAP_VM_TYPES 235
> +#define KVM_CAP_MEMORY_MAPPING 236
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2227,4 +2228,13 @@ struct kvm_create_guest_memfd {
> __u64 reserved[6];
> };
>
> +#define KVM_MEMORY_MAPPING _IOWR(KVMIO, 0xd5, struct kvm_memory_mapping)
> +
> +struct kvm_memory_mapping {
> + __u64 base_gfn;
> + __u64 nr_pages;
> + __u64 flags;
> + __u64 source;
> +};
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0349e1f241d1..2f0a8e28795e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4409,6 +4409,62 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
> return fd;
> }
>
> +__weak void kvm_arch_vcpu_pre_memory_mapping(struct kvm_vcpu *vcpu)
> +{
> +}
> +
> +__weak int kvm_arch_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int kvm_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + bool added = false;
> + int idx, r = 0;
> +
> + /* flags isn't used yet. */
> + if (mapping->flags)
> + return -EINVAL;
> +
> + /* Sanity check */
> + if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
> + !mapping->nr_pages ||
> + mapping->nr_pages & GENMASK_ULL(63, 63 - PAGE_SHIFT) ||
> + mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
I suppose !mapping->nr_pages can be deleted as this line can cover it.
> + return -EINVAL;
> +
> + vcpu_load(vcpu);
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> + kvm_arch_vcpu_pre_memory_mapping(vcpu);
> +
> + while (mapping->nr_pages) {
> + if (signal_pending(current)) {
> + r = -ERESTARTSYS;
> + break;
> + }
> +
> + if (need_resched())
> + cond_resched();
> +
> + r = kvm_arch_vcpu_memory_mapping(vcpu, mapping);
> + if (r)
> + break;
> +
> + added = true;
> + }
> +
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + vcpu_put(vcpu);
> +
> + if (added && mapping->nr_pages > 0)
> + r = -EAGAIN;
> +
> + return r;
> +}
> +
> static long kvm_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -4610,6 +4666,17 @@ static long kvm_vcpu_ioctl(struct file *filp,
> r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
> break;
> }
> + case KVM_MEMORY_MAPPING: {
> + struct kvm_memory_mapping mapping;
> +
> + r = -EFAULT;
> + if (copy_from_user(&mapping, argp, sizeof(mapping)))
> + break;
> + r = kvm_vcpu_memory_mapping(vcpu, &mapping);
return value r should be checked before copy_to_user


Regards
Yin, Fengwei

> + if (copy_to_user(argp, &mapping, sizeof(mapping)))
> + r = -EFAULT;
> + break;
> + }
> default:
> r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> }

2024-03-07 07:05:00

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v19 013/130] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is private



On 2/26/24 16:25, [email protected] wrote:
> + /*
> + * This is racy with updating memory attributes with mmu_seq. If we
> + * hit a race, it would result in retrying page fault.
> + */
> + if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
There are more than two times of similar check in this patchset.
Maybe it's better to add a helper function to tell whether the type
is KVM_X86_SW_PROTECTED_VM?


Regards
Yin, Fengwei

> + kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> + error_code |= PFERR_GUEST_ENC_MASK;
> +

2024-03-07 08:32:47

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v19 080/130] KVM: TDX: restore host xsave state when exit from the guest TD

On 2024-02-26 at 00:26:22 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> On exiting from the guest TD, xsave state is clobbered. Restore xsave
> state on TD exit.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - Add EXPORT_SYMBOL_GPL(host_xcr0)
>
> v15 -> v16:
> - Added CET flag mask
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
> arch/x86/kvm/x86.c | 1 +
> 2 files changed, 20 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9616b1aab6ce..199226c6cf55 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2,6 +2,7 @@
> #include <linux/cpu.h>
> #include <linux/mmu_context.h>
>
> +#include <asm/fpu/xcr.h>
> #include <asm/tdx.h>
>
> #include "capabilities.h"
> @@ -534,6 +535,23 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> */
> }
>
> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +
> + if (static_cpu_has(X86_FEATURE_XSAVE) &&
> + host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
> + if (static_cpu_has(X86_FEATURE_XSAVES) &&
> + /* PT can be exposed to TD guest regardless of KVM's XSS support */
> + host_xss != (kvm_tdx->xfam &
> + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET)))
> + wrmsrl(MSR_IA32_XSS, host_xss);
> + if (static_cpu_has(X86_FEATURE_PKU) &&
> + (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
> + write_pkru(vcpu->arch.host_pkru);
> +}

Maybe one minor question regarding the pkru restore. In the non-TDX version
kvm_load_host_xsave_state(), it first tries to read the current setting
vcpu->arch.pkru = rdpkru(); if this setting does not equal to host_pkru,
it trigger the write_pkru on host. Does it mean we can also leverage that mechanism
in TDX to avoid 1 pkru write(I guess pkru write is costly than a read pkru)?

thanks,
Chenyu

2024-03-08 10:11:49

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 003/130] x86/virt/tdx: Unbind global metadata read with 'struct tdx_tdmr_sysinfo'



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Kai Huang <[email protected]>
>
> For now the kernel only reads TDMR related global metadata fields for
> module initialization, and the metadata read code only works with the
> 'struct tdx_tdmr_sysinfo'.
>
> KVM will need to read a bunch of non-TDMR related metadata to create and
> run TDX guests. It's essential to provide a generic metadata read
> infrastructure which is not bound to any specific structure.
>
> To start providing such infrastructure, unbound the metadata read with
> the 'struct tdx_tdmr_sysinfo'.
>
> Signed-off-by: Kai Huang <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>

Reviewed-by: Binbin Wu <[email protected]>

> ---
> arch/x86/virt/vmx/tdx/tdx.c | 25 ++++++++++++++-----------
> 1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index cdcb3332bc5d..eb208da4ff63 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -273,9 +273,9 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)
>
> static int read_sys_metadata_field16(u64 field_id,
> int offset,
> - struct tdx_tdmr_sysinfo *ts)
> + void *stbuf)
> {
> - u16 *ts_member = ((void *)ts) + offset;
> + u16 *st_member = stbuf + offset;
> u64 tmp;
> int ret;
>
> @@ -287,7 +287,7 @@ static int read_sys_metadata_field16(u64 field_id,
> if (ret)
> return ret;
>
> - *ts_member = tmp;
> + *st_member = tmp;
>
> return 0;
> }
> @@ -297,19 +297,22 @@ struct field_mapping {
> int offset;
> };
>
> -#define TD_SYSINFO_MAP(_field_id, _member) \
> - { .field_id = MD_FIELD_ID_##_field_id, \
> - .offset = offsetof(struct tdx_tdmr_sysinfo, _member) }
> +#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
> + { .field_id = MD_FIELD_ID_##_field_id, \
> + .offset = offsetof(_struct, _member) }
> +
> +#define TD_SYSINFO_MAP_TDMR_INFO(_field_id, _member) \
> + TD_SYSINFO_MAP(_field_id, struct tdx_tdmr_sysinfo, _member)
>
> static int get_tdx_tdmr_sysinfo(struct tdx_tdmr_sysinfo *tdmr_sysinfo)
> {
> /* Map TD_SYSINFO fields into 'struct tdx_tdmr_sysinfo': */
> const struct field_mapping fields[] = {
> - TD_SYSINFO_MAP(MAX_TDMRS, max_tdmrs),
> - TD_SYSINFO_MAP(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
> - TD_SYSINFO_MAP(PAMT_4K_ENTRY_SIZE, pamt_entry_size[TDX_PS_4K]),
> - TD_SYSINFO_MAP(PAMT_2M_ENTRY_SIZE, pamt_entry_size[TDX_PS_2M]),
> - TD_SYSINFO_MAP(PAMT_1G_ENTRY_SIZE, pamt_entry_size[TDX_PS_1G]),
> + TD_SYSINFO_MAP_TDMR_INFO(MAX_TDMRS, max_tdmrs),
> + TD_SYSINFO_MAP_TDMR_INFO(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
> + TD_SYSINFO_MAP_TDMR_INFO(PAMT_4K_ENTRY_SIZE, pamt_entry_size[TDX_PS_4K]),
> + TD_SYSINFO_MAP_TDMR_INFO(PAMT_2M_ENTRY_SIZE, pamt_entry_size[TDX_PS_2M]),
> + TD_SYSINFO_MAP_TDMR_INFO(PAMT_1G_ENTRY_SIZE, pamt_entry_size[TDX_PS_1G]),
> };
> int ret;
> int i;


2024-03-08 12:41:50

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 005/130] x86/virt/tdx: Export global metadata read infrastructure



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Kai Huang <[email protected]>
>
> KVM will need to read a bunch of non-TDMR related metadata to create and
> run TDX guests. Export the metadata read infrastructure for KVM to use.
>
> Specifically, export two helpers:
>
> 1) The helper which reads multiple metadata fields to a buffer of a
> structure based on the "field ID -> structure member" mapping table.
>
> 2) The low level helper which just reads a given field ID.
>
> The two helpers cover cases when the user wants to cache a bunch of
> metadata fields to a certain structure and when the user just wants to
> query a specific metadata field on demand. They are enough for KVM to
> use (and also should be enough for other potential users).
>
> Signed-off-by: Kai Huang <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>

Reviewed-by: Binbin Wu <[email protected]>

> ---
> arch/x86/include/asm/tdx.h | 22 ++++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.c | 25 ++++++++-----------------
> 2 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index eba178996d84..709b9483f9e4 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -116,6 +116,28 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
> int tdx_cpu_enable(void);
> int tdx_enable(void);
> const char *tdx_dump_mce_info(struct mce *m);
> +
> +struct tdx_metadata_field_mapping {
> + u64 field_id;
> + int offset;
> + int size;
> +};
> +
> +#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
> + { .field_id = MD_FIELD_ID_##_field_id, \
> + .offset = offsetof(_struct, _member), \
> + .size = sizeof(typeof(((_struct *)0)->_member)) }
> +
> +/*
> + * Read multiple global metadata fields to a buffer of a structure
> + * based on the "field ID -> structure member" mapping table.
> + */
> +int tdx_sys_metadata_read(const struct tdx_metadata_field_mapping *fields,
> + int nr_fields, void *stbuf);
> +
> +/* Read a single global metadata field */
> +int tdx_sys_metadata_field_read(u64 field_id, u64 *data);
> +
> #else
> static inline void tdx_init(void) { }
> static inline int tdx_cpu_enable(void) { return -ENODEV; }
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index a19adc898df6..dc21310776ab 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -251,7 +251,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
> return ret;
> }
>
> -static int read_sys_metadata_field(u64 field_id, u64 *data)
> +int tdx_sys_metadata_field_read(u64 field_id, u64 *data)
> {
> struct tdx_module_args args = {};
> int ret;
> @@ -270,6 +270,7 @@ static int read_sys_metadata_field(u64 field_id, u64 *data)
>
> return 0;
> }
> +EXPORT_SYMBOL_GPL(tdx_sys_metadata_field_read);
>
> /* Return the metadata field element size in bytes */
> static int get_metadata_field_bytes(u64 field_id)
> @@ -295,7 +296,7 @@ static int stbuf_read_sys_metadata_field(u64 field_id,
> if (WARN_ON_ONCE(get_metadata_field_bytes(field_id) != bytes))
> return -EINVAL;
>
> - ret = read_sys_metadata_field(field_id, &tmp);
> + ret = tdx_sys_metadata_field_read(field_id, &tmp);
> if (ret)
> return ret;
>
> @@ -304,19 +305,8 @@ static int stbuf_read_sys_metadata_field(u64 field_id,
> return 0;
> }
>
> -struct field_mapping {
> - u64 field_id;
> - int offset;
> - int size;
> -};
> -
> -#define TD_SYSINFO_MAP(_field_id, _struct, _member) \
> - { .field_id = MD_FIELD_ID_##_field_id, \
> - .offset = offsetof(_struct, _member), \
> - .size = sizeof(typeof(((_struct *)0)->_member)) }
> -
> -static int read_sys_metadata(struct field_mapping *fields, int nr_fields,
> - void *stbuf)
> +int tdx_sys_metadata_read(const struct tdx_metadata_field_mapping *fields,
> + int nr_fields, void *stbuf)
> {
> int i, ret;
>
> @@ -331,6 +321,7 @@ static int read_sys_metadata(struct field_mapping *fields, int nr_fields,
>
> return 0;
> }
> +EXPORT_SYMBOL_GPL(tdx_sys_metadata_read);
>
> #define TD_SYSINFO_MAP_TDMR_INFO(_field_id, _member) \
> TD_SYSINFO_MAP(_field_id, struct tdx_tdmr_sysinfo, _member)
> @@ -338,7 +329,7 @@ static int read_sys_metadata(struct field_mapping *fields, int nr_fields,
> static int get_tdx_tdmr_sysinfo(struct tdx_tdmr_sysinfo *tdmr_sysinfo)
> {
> /* Map TD_SYSINFO fields into 'struct tdx_tdmr_sysinfo': */
> - const struct field_mapping fields[] = {
> + const struct tdx_metadata_field_mapping fields[] = {
> TD_SYSINFO_MAP_TDMR_INFO(MAX_TDMRS, max_tdmrs),
> TD_SYSINFO_MAP_TDMR_INFO(MAX_RESERVED_PER_TDMR, max_reserved_per_tdmr),
> TD_SYSINFO_MAP_TDMR_INFO(PAMT_4K_ENTRY_SIZE, pamt_entry_size[TDX_PS_4K]),
> @@ -347,7 +338,7 @@ static int get_tdx_tdmr_sysinfo(struct tdx_tdmr_sysinfo *tdmr_sysinfo)
> };
>
> /* Populate 'tdmr_sysinfo' fields using the mapping structure above: */
> - return read_sys_metadata(fields, ARRAY_SIZE(fields), tdmr_sysinfo);
> + return tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdmr_sysinfo);
> }
>
> /* Calculate the actual TDMR size */


2024-03-08 20:58:50

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 080/130] KVM: TDX: restore host xsave state when exit from the guest TD

On Thu, Mar 07, 2024 at 04:32:16PM +0800,
Chen Yu <[email protected]> wrote:

> On 2024-02-26 at 00:26:22 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > On exiting from the guest TD, xsave state is clobbered. Restore xsave
> > state on TD exit.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v19:
> > - Add EXPORT_SYMBOL_GPL(host_xcr0)
> >
> > v15 -> v16:
> > - Added CET flag mask
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
> > arch/x86/kvm/x86.c | 1 +
> > 2 files changed, 20 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 9616b1aab6ce..199226c6cf55 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2,6 +2,7 @@
> > #include <linux/cpu.h>
> > #include <linux/mmu_context.h>
> >
> > +#include <asm/fpu/xcr.h>
> > #include <asm/tdx.h>
> >
> > #include "capabilities.h"
> > @@ -534,6 +535,23 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > */
> > }
> >
> > +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > +
> > + if (static_cpu_has(X86_FEATURE_XSAVE) &&
> > + host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> > + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
> > + if (static_cpu_has(X86_FEATURE_XSAVES) &&
> > + /* PT can be exposed to TD guest regardless of KVM's XSS support */
> > + host_xss != (kvm_tdx->xfam &
> > + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET)))
> > + wrmsrl(MSR_IA32_XSS, host_xss);
> > + if (static_cpu_has(X86_FEATURE_PKU) &&
> > + (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
> > + write_pkru(vcpu->arch.host_pkru);
> > +}
>
> Maybe one minor question regarding the pkru restore. In the non-TDX version
> kvm_load_host_xsave_state(), it first tries to read the current setting
> vcpu->arch.pkru = rdpkru(); if this setting does not equal to host_pkru,
> it trigger the write_pkru on host. Does it mean we can also leverage that mechanism
> in TDX to avoid 1 pkru write(I guess pkru write is costly than a read pkru)?

Yes, that's the intention. When we set the PKRU feature for the guest, TDX
module unconditionally initialize pkru. Do you have use case that wrpkru()
(without rdpkru()) is better?
--
Isaku Yamahata <[email protected]>

2024-03-08 21:02:03

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 014/130] KVM: Add KVM vcpu ioctl to pre-populate guest memory

On Thu, Mar 07, 2024 at 03:01:11PM +0800,
Yin Fengwei <[email protected]> wrote:

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 0349e1f241d1..2f0a8e28795e 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4409,6 +4409,62 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
> > return fd;
> > }
> >
> > +__weak void kvm_arch_vcpu_pre_memory_mapping(struct kvm_vcpu *vcpu)
> > +{
> > +}
> > +
> > +__weak int kvm_arch_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
> > + struct kvm_memory_mapping *mapping)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static int kvm_vcpu_memory_mapping(struct kvm_vcpu *vcpu,
> > + struct kvm_memory_mapping *mapping)
> > +{
> > + bool added = false;
> > + int idx, r = 0;
> > +
> > + /* flags isn't used yet. */
> > + if (mapping->flags)
> > + return -EINVAL;
> > +
> > + /* Sanity check */
> > + if (!IS_ALIGNED(mapping->source, PAGE_SIZE) ||
> > + !mapping->nr_pages ||
> > + mapping->nr_pages & GENMASK_ULL(63, 63 - PAGE_SHIFT) ||
> > + mapping->base_gfn + mapping->nr_pages <= mapping->base_gfn)
> I suppose !mapping->nr_pages can be deleted as this line can cover it.
> > + return -EINVAL;
> > +
> > + vcpu_load(vcpu);
> > + idx = srcu_read_lock(&vcpu->kvm->srcu);
> > + kvm_arch_vcpu_pre_memory_mapping(vcpu);
> > +
> > + while (mapping->nr_pages) {
> > + if (signal_pending(current)) {
> > + r = -ERESTARTSYS;
> > + break;
> > + }
> > +
> > + if (need_resched())
> > + cond_resched();
> > +
> > + r = kvm_arch_vcpu_memory_mapping(vcpu, mapping);
> > + if (r)
> > + break;
> > +
> > + added = true;
> > + }
> > +
> > + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > + vcpu_put(vcpu);
> > +
> > + if (added && mapping->nr_pages > 0)
> > + r = -EAGAIN;
> > +
> > + return r;
> > +}
> > +
> > static long kvm_vcpu_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg)
> > {
> > @@ -4610,6 +4666,17 @@ static long kvm_vcpu_ioctl(struct file *filp,
> > r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
> > break;
> > }
> > + case KVM_MEMORY_MAPPING: {
> > + struct kvm_memory_mapping mapping;
> > +
> > + r = -EFAULT;
> > + if (copy_from_user(&mapping, argp, sizeof(mapping)))
> > + break;
> > + r = kvm_vcpu_memory_mapping(vcpu, &mapping);
> return value r should be checked before copy_to_user

That's intentional to tell the mapping is partially or fully processed
regardless that error happened or not.

>
>
> Regards
> Yin, Fengwei
>
> > + if (copy_to_user(argp, &mapping, sizeof(mapping)))
> > + r = -EFAULT;
> > + break;
> > + }
> > default:
> > r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> > }
>

--
Isaku Yamahata <[email protected]>

2024-03-09 16:29:37

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v19 080/130] KVM: TDX: restore host xsave state when exit from the guest TD

On 2024-03-08 at 12:58:38 -0800, Isaku Yamahata wrote:
> On Thu, Mar 07, 2024 at 04:32:16PM +0800,
> Chen Yu <[email protected]> wrote:
>
> > On 2024-02-26 at 00:26:22 -0800, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > On exiting from the guest TD, xsave state is clobbered. Restore xsave
> > > state on TD exit.
> > >
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > ---
> > > v19:
> > > - Add EXPORT_SYMBOL_GPL(host_xcr0)
> > >
> > > v15 -> v16:
> > > - Added CET flag mask
> > >
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
> > > arch/x86/kvm/x86.c | 1 +
> > > 2 files changed, 20 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 9616b1aab6ce..199226c6cf55 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -2,6 +2,7 @@
> > > #include <linux/cpu.h>
> > > #include <linux/mmu_context.h>
> > >
> > > +#include <asm/fpu/xcr.h>
> > > #include <asm/tdx.h>
> > >
> > > #include "capabilities.h"
> > > @@ -534,6 +535,23 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > */
> > > }
> > >
> > > +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > > +
> > > + if (static_cpu_has(X86_FEATURE_XSAVE) &&
> > > + host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> > > + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
> > > + if (static_cpu_has(X86_FEATURE_XSAVES) &&
> > > + /* PT can be exposed to TD guest regardless of KVM's XSS support */
> > > + host_xss != (kvm_tdx->xfam &
> > > + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET)))
> > > + wrmsrl(MSR_IA32_XSS, host_xss);
> > > + if (static_cpu_has(X86_FEATURE_PKU) &&
> > > + (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
> > > + write_pkru(vcpu->arch.host_pkru);
> > > +}
> >
> > Maybe one minor question regarding the pkru restore. In the non-TDX version
> > kvm_load_host_xsave_state(), it first tries to read the current setting
> > vcpu->arch.pkru = rdpkru(); if this setting does not equal to host_pkru,
> > it trigger the write_pkru on host. Does it mean we can also leverage that mechanism
> > in TDX to avoid 1 pkru write(I guess pkru write is costly than a read pkru)?
>
> Yes, that's the intention. When we set the PKRU feature for the guest, TDX
> module unconditionally initialize pkru.

I see, thanks for the information. Please correct me if I'm wrong, and I'm not sure
if wrpkru instruction would trigger the TD exit. The TDX module spec[1] mentioned PKS
(protected key for supervisor pages), but does not metion PKU for user pages. PKS
is controlled by MSR IA32_PKRS. The TDX module will passthrough the MSR IA32_PKRS
write in TD, because TDX module clears the PKS bitmap in VMCS:
https://github.com/intel/tdx-module/blob/tdx_1.5/src/common/helpers/helpers.c#L1723
so neither write to MSR IA32_PKRS nor wrpkru triggers TD exit.

However, after a second thought, I found that after commit 72a6c08c44e4, the current
code should not be a problem, because write_pkru() would first read the current pkru
settings and decide whether to update to the pkru register.

> Do you have use case that wrpkru()
> (without rdpkru()) is better?

I don't have use case yet. But with/without rdpkru() in tdx_restore_host_xsave_state(),
there is no much difference because write_pkru() has taken care of it if I understand
correctly.

thanks,
Chenyu

2024-03-11 05:32:38

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v19 022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Currently, KVM VMX module initialization/exit functions are a single
> function each. Refactor KVM VMX module initialization functions into KVM
> common part and VMX part so that TDX specific part can be added cleanly.
> Opportunistically refactor module exit function as well.
>
> The current module initialization flow is,
> 0.) Check if VMX is supported,
> 1.) hyper-v specific initialization,
> 2.) system-wide x86 specific and vendor specific initialization,
> 3.) Final VMX specific system-wide initialization,
> 4.) calculate the sizes of VMX kvm structure and VMX vcpu structure,
> 5.) report those sizes to the KVM common layer and KVM common
> initialization
>
> Refactor the KVM VMX module initialization function into functions with a
> wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> among VMX and TDX. Introduce a wrapper function for vmx_init().
>
> The KVM architecture common layer allocates struct kvm with reported size
> for architecture-specific code. The KVM VMX module defines its structure
> as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
> TDX specific kvm and vcpu structures.
>
> The current module exit function is also a single function, a combination
> of VMX specific logic and common KVM logic. Refactor it into VMX specific
> logic and KVM common logic. This is just refactoring to keep the VMX
> specific logic in vmx.c from main.c.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - Eliminate the unnecessary churn with vmx_hardware_setup() by Xiaoyao
>
> v18:
> - Move loaded_vmcss_on_cpu initialization to vt_init() before
> kvm_x86_vendor_init().
> - added __init to an empty stub fucntion, hv_init_evmcs().
>
> Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>

With one minor comment. See below.

> ---
> arch/x86/kvm/vmx/main.c | 54 ++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 60 +++++---------------------------------
> arch/x86/kvm/vmx/x86_ops.h | 14 +++++++++
> 3 files changed, 75 insertions(+), 53 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index eeb7a43b271d..18cecf12c7c8 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -167,3 +167,57 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
> .runtime_ops = &vt_x86_ops,
> .pmu_ops = &intel_pmu_ops,
> };
> +
> +static int __init vt_init(void)
> +{
> + unsigned int vcpu_size, vcpu_align;
> + int cpu, r;
> +
> + if (!kvm_is_vmx_supported())
> + return -EOPNOTSUPP;
> +
> + /*
> + * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
> + * to unwind if a later step fails.
> + */
> + hv_init_evmcs();
> +
> + /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
> + for_each_possible_cpu(cpu)
> + INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> +
> + r = kvm_x86_vendor_init(&vt_init_ops);
> + if (r)
> + return r;
> +
> + r = vmx_init();
> + if (r)
> + goto err_vmx_init;
> +
> + /*
> + * Common KVM initialization _must_ come last, after this, /dev/kvm is
> + * exposed to userspace!
> + */
> + vcpu_size = sizeof(struct vcpu_vmx);
> + vcpu_align = __alignof__(struct vcpu_vmx);
> + r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
> + if (r)
> + goto err_kvm_init;
> +
> + return 0;
> +
> +err_kvm_init:
> + vmx_exit();
> +err_vmx_init:
> + kvm_x86_vendor_exit();
> + return r;
> +}
> +module_init(vt_init);
> +
> +static void vt_exit(void)
> +{
> + kvm_exit();
> + kvm_x86_vendor_exit();
> + vmx_exit();
> +}
> +module_exit(vt_exit);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8af0668e4dca..2fb1cd2e28a2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -477,7 +477,7 @@ DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed
> * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded on it.
> */
> -static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> +DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
>
> static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS);
> static DEFINE_SPINLOCK(vmx_vpid_lock);
> @@ -537,7 +537,7 @@ static int hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu)
> return 0;
> }
>
> -static __init void hv_init_evmcs(void)
> +__init void hv_init_evmcs(void)
> {
> int cpu;
>
> @@ -573,7 +573,7 @@ static __init void hv_init_evmcs(void)
> }
> }
>
> -static void hv_reset_evmcs(void)
> +void hv_reset_evmcs(void)
> {
> struct hv_vp_assist_page *vp_ap;
>
> @@ -597,10 +597,6 @@ static void hv_reset_evmcs(void)
> vp_ap->current_nested_vmcs = 0;
> vp_ap->enlighten_vmentry = 0;
> }
> -
> -#else /* IS_ENABLED(CONFIG_HYPERV) */
> -static void hv_init_evmcs(void) {}
> -static void hv_reset_evmcs(void) {}
> #endif /* IS_ENABLED(CONFIG_HYPERV) */
>
> /*
> @@ -2743,7 +2739,7 @@ static bool __kvm_is_vmx_supported(void)
> return true;
> }
>
> -static bool kvm_is_vmx_supported(void)
> +bool kvm_is_vmx_supported(void)
> {
> bool supported;
>
> @@ -8508,7 +8504,7 @@ static void vmx_cleanup_l1d_flush(void)
> l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
> }
>
> -static void __vmx_exit(void)
> +void vmx_exit(void)
> {
> allow_smaller_maxphyaddr = false;
>
> @@ -8517,36 +8513,10 @@ static void __vmx_exit(void)
> vmx_cleanup_l1d_flush();
> }
>
> -static void vmx_exit(void)
> -{
> - kvm_exit();
> - kvm_x86_vendor_exit();
> -
> - __vmx_exit();
> -}
> -module_exit(vmx_exit);
> -
> -static int __init vmx_init(void)
> +int __init vmx_init(void)
> {
> int r, cpu;
>
> - if (!kvm_is_vmx_supported())
> - return -EOPNOTSUPP;
> -
> - /*
> - * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
> - * to unwind if a later step fails.
> - */
> - hv_init_evmcs();
> -
> - /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
> - for_each_possible_cpu(cpu)
> - INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> -
> - r = kvm_x86_vendor_init(&vt_init_ops);
> - if (r)
> - return r;
> -
> /*
> * Must be called after common x86 init so enable_ept is properly set
> * up. Hand the parameter mitigation value in which was stored in
I am wondering whether the first sentence of above comment should be
moved to vt_init()? So vt_init() has whole information about the init
sequence.


Regards
Yin, Fengwei

> @@ -8556,7 +8526,7 @@ static int __init vmx_init(void)
> */
> r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
> if (r)
> - goto err_l1d_flush;
> + return r;
>
> for_each_possible_cpu(cpu)
> pi_init_cpu(cpu);
> @@ -8573,21 +8543,5 @@ static int __init vmx_init(void)
> if (!enable_ept)
> allow_smaller_maxphyaddr = true;
>
> - /*
> - * Common KVM initialization _must_ come last, after this, /dev/kvm is
> - * exposed to userspace!
> - */
> - r = kvm_init(sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx),
> - THIS_MODULE);
> - if (r)
> - goto err_kvm_init;
> -
> return 0;
> -
> -err_kvm_init:
> - __vmx_exit();
> -err_l1d_flush:
> - kvm_x86_vendor_exit();
> - return r;
> }
> -module_init(vmx_init);
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 2f8b6c43fe0f..b936388853ab 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -6,6 +6,20 @@
>
> #include "x86.h"
>
> +#if IS_ENABLED(CONFIG_HYPERV)
> +__init void hv_init_evmcs(void);
> +void hv_reset_evmcs(void);
> +#else /* IS_ENABLED(CONFIG_HYPERV) */
> +static inline __init void hv_init_evmcs(void) {}
> +static inline void hv_reset_evmcs(void) {}
> +#endif /* IS_ENABLED(CONFIG_HYPERV) */
> +
> +DECLARE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> +
> +bool kvm_is_vmx_supported(void);
> +int __init vmx_init(void);
> +void vmx_exit(void);
> +
> extern struct kvm_x86_ops vt_x86_ops __initdata;
> extern struct kvm_x86_init_ops vt_init_ops __initdata;
>

2024-03-12 02:03:32

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 080/130] KVM: TDX: restore host xsave state when exit from the guest TD

On Sun, Mar 10, 2024 at 12:28:55AM +0800,
Chen Yu <[email protected]> wrote:

> On 2024-03-08 at 12:58:38 -0800, Isaku Yamahata wrote:
> > On Thu, Mar 07, 2024 at 04:32:16PM +0800,
> > Chen Yu <[email protected]> wrote:
> >
> > > On 2024-02-26 at 00:26:22 -0800, [email protected] wrote:
> > > > From: Isaku Yamahata <[email protected]>
> > > >
> > > > On exiting from the guest TD, xsave state is clobbered. Restore xsave
> > > > state on TD exit.
> > > >
> > > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > > ---
> > > > v19:
> > > > - Add EXPORT_SYMBOL_GPL(host_xcr0)
> > > >
> > > > v15 -> v16:
> > > > - Added CET flag mask
> > > >
> > > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > > ---
> > > > arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
> > > > arch/x86/kvm/x86.c | 1 +
> > > > 2 files changed, 20 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index 9616b1aab6ce..199226c6cf55 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -2,6 +2,7 @@
> > > > #include <linux/cpu.h>
> > > > #include <linux/mmu_context.h>
> > > >
> > > > +#include <asm/fpu/xcr.h>
> > > > #include <asm/tdx.h>
> > > >
> > > > #include "capabilities.h"
> > > > @@ -534,6 +535,23 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > */
> > > > }
> > > >
> > > > +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> > > > +{
> > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > > > +
> > > > + if (static_cpu_has(X86_FEATURE_XSAVE) &&
> > > > + host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> > > > + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
> > > > + if (static_cpu_has(X86_FEATURE_XSAVES) &&
> > > > + /* PT can be exposed to TD guest regardless of KVM's XSS support */
> > > > + host_xss != (kvm_tdx->xfam &
> > > > + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET)))
> > > > + wrmsrl(MSR_IA32_XSS, host_xss);
> > > > + if (static_cpu_has(X86_FEATURE_PKU) &&
> > > > + (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
> > > > + write_pkru(vcpu->arch.host_pkru);
> > > > +}
> > >
> > > Maybe one minor question regarding the pkru restore. In the non-TDX version
> > > kvm_load_host_xsave_state(), it first tries to read the current setting
> > > vcpu->arch.pkru = rdpkru(); if this setting does not equal to host_pkru,
> > > it trigger the write_pkru on host. Does it mean we can also leverage that mechanism
> > > in TDX to avoid 1 pkru write(I guess pkru write is costly than a read pkru)?
> >
> > Yes, that's the intention. When we set the PKRU feature for the guest, TDX
> > module unconditionally initialize pkru.
>
> I see, thanks for the information. Please correct me if I'm wrong, and I'm not sure
> if wrpkru instruction would trigger the TD exit. The TDX module spec[1] mentioned PKS
> (protected key for supervisor pages), but does not metion PKU for user pages. PKS
> is controlled by MSR IA32_PKRS. The TDX module will passthrough the MSR IA32_PKRS
> write in TD, because TDX module clears the PKS bitmap in VMCS:
> https://github.com/intel/tdx-module/blob/tdx_1.5/src/common/helpers/helpers.c#L1723
> so neither write to MSR IA32_PKRS nor wrpkru triggers TD exit.

wrpkru instruction in TDX guest doesn't cause exit to TDX module. TDX module
runs with CR4.PKE=0. The value of pkru doesn't matter to the TDX module.
When exiting from TDX module to the host VMM, PKRU is initialized to zero with
xrestr. So it doesn't matter.

We need to refer to NP-SEAMLDR for the register value for TDX module on
SEAMCALL. It sets up the register values for TDX module on SEAMCALL.


> However, after a second thought, I found that after commit 72a6c08c44e4, the current
> code should not be a problem, because write_pkru() would first read the current pkru
> settings and decide whether to update to the pkru register.
>
> > Do you have use case that wrpkru()
> > (without rdpkru()) is better?
>
> I don't have use case yet. But with/without rdpkru() in tdx_restore_host_xsave_state(),
> there is no much difference because write_pkru() has taken care of it if I understand
> correctly.

The code in this hunk is TDX version of kvm_load_guest_xsave_state(). We case
follow the VMX case at the moment.
--
Isaku Yamahata <[email protected]>

2024-03-12 02:15:40

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions

On Mon, Mar 11, 2024 at 01:32:08PM +0800,
"Yin, Fengwei" <[email protected]> wrote:

>
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Currently, KVM VMX module initialization/exit functions are a single
> > function each. Refactor KVM VMX module initialization functions into KVM
> > common part and VMX part so that TDX specific part can be added cleanly.
> > Opportunistically refactor module exit function as well.
> >
> > The current module initialization flow is,
> > 0.) Check if VMX is supported,
> > 1.) hyper-v specific initialization,
> > 2.) system-wide x86 specific and vendor specific initialization,
> > 3.) Final VMX specific system-wide initialization,
> > 4.) calculate the sizes of VMX kvm structure and VMX vcpu structure,
> > 5.) report those sizes to the KVM common layer and KVM common
> > initialization
> >
> > Refactor the KVM VMX module initialization function into functions with a
> > wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> > among VMX and TDX. Introduce a wrapper function for vmx_init().
> >
> > The KVM architecture common layer allocates struct kvm with reported size
> > for architecture-specific code. The KVM VMX module defines its structure
> > as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> > struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
> > TDX specific kvm and vcpu structures.
> >
> > The current module exit function is also a single function, a combination
> > of VMX specific logic and common KVM logic. Refactor it into VMX specific
> > logic and KVM common logic. This is just refactoring to keep the VMX
> > specific logic in vmx.c from main.c.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v19:
> > - Eliminate the unnecessary churn with vmx_hardware_setup() by Xiaoyao
> >
> > v18:
> > - Move loaded_vmcss_on_cpu initialization to vt_init() before
> > kvm_x86_vendor_init().
> > - added __init to an empty stub fucntion, hv_init_evmcs().
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Yin Fengwei <[email protected]>
>
> With one minor comment. See below.
>
> > ---
> > arch/x86/kvm/vmx/main.c | 54 ++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/vmx.c | 60 +++++---------------------------------
> > arch/x86/kvm/vmx/x86_ops.h | 14 +++++++++
> > 3 files changed, 75 insertions(+), 53 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index eeb7a43b271d..18cecf12c7c8 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -167,3 +167,57 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
> > .runtime_ops = &vt_x86_ops,
> > .pmu_ops = &intel_pmu_ops,
> > };
> > +
> > +static int __init vt_init(void)
> > +{
> > + unsigned int vcpu_size, vcpu_align;
> > + int cpu, r;
> > +
> > + if (!kvm_is_vmx_supported())
> > + return -EOPNOTSUPP;
> > +
> > + /*
> > + * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
> > + * to unwind if a later step fails.
> > + */
> > + hv_init_evmcs();
> > +
> > + /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
> > + for_each_possible_cpu(cpu)
> > + INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> > +
> > + r = kvm_x86_vendor_init(&vt_init_ops);
> > + if (r)
> > + return r;
> > +
> > + r = vmx_init();
> > + if (r)
> > + goto err_vmx_init;
> > +
> > + /*
> > + * Common KVM initialization _must_ come last, after this, /dev/kvm is
> > + * exposed to userspace!
> > + */
> > + vcpu_size = sizeof(struct vcpu_vmx);
> > + vcpu_align = __alignof__(struct vcpu_vmx);
> > + r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
> > + if (r)
> > + goto err_kvm_init;
> > +
> > + return 0;
> > +
> > +err_kvm_init:
> > + vmx_exit();
> > +err_vmx_init:
> > + kvm_x86_vendor_exit();
> > + return r;
> > +}
> > +module_init(vt_init);
> > +
> > +static void vt_exit(void)
> > +{
> > + kvm_exit();
> > + kvm_x86_vendor_exit();
> > + vmx_exit();
> > +}
> > +module_exit(vt_exit);
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 8af0668e4dca..2fb1cd2e28a2 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -477,7 +477,7 @@ DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> > * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed
> > * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded on it.
> > */
> > -static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> > +DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> > static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS);
> > static DEFINE_SPINLOCK(vmx_vpid_lock);
> > @@ -537,7 +537,7 @@ static int hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu)
> > return 0;
> > }
> > -static __init void hv_init_evmcs(void)
> > +__init void hv_init_evmcs(void)
> > {
> > int cpu;
> > @@ -573,7 +573,7 @@ static __init void hv_init_evmcs(void)
> > }
> > }
> > -static void hv_reset_evmcs(void)
> > +void hv_reset_evmcs(void)
> > {
> > struct hv_vp_assist_page *vp_ap;
> > @@ -597,10 +597,6 @@ static void hv_reset_evmcs(void)
> > vp_ap->current_nested_vmcs = 0;
> > vp_ap->enlighten_vmentry = 0;
> > }
> > -
> > -#else /* IS_ENABLED(CONFIG_HYPERV) */
> > -static void hv_init_evmcs(void) {}
> > -static void hv_reset_evmcs(void) {}
> > #endif /* IS_ENABLED(CONFIG_HYPERV) */
> > /*
> > @@ -2743,7 +2739,7 @@ static bool __kvm_is_vmx_supported(void)
> > return true;
> > }
> > -static bool kvm_is_vmx_supported(void)
> > +bool kvm_is_vmx_supported(void)
> > {
> > bool supported;
> > @@ -8508,7 +8504,7 @@ static void vmx_cleanup_l1d_flush(void)
> > l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
> > }
> > -static void __vmx_exit(void)
> > +void vmx_exit(void)
> > {
> > allow_smaller_maxphyaddr = false;
> > @@ -8517,36 +8513,10 @@ static void __vmx_exit(void)
> > vmx_cleanup_l1d_flush();
> > }
> > -static void vmx_exit(void)
> > -{
> > - kvm_exit();
> > - kvm_x86_vendor_exit();
> > -
> > - __vmx_exit();
> > -}
> > -module_exit(vmx_exit);
> > -
> > -static int __init vmx_init(void)
> > +int __init vmx_init(void)
> > {
> > int r, cpu;
> > - if (!kvm_is_vmx_supported())
> > - return -EOPNOTSUPP;
> > -
> > - /*
> > - * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
> > - * to unwind if a later step fails.
> > - */
> > - hv_init_evmcs();
> > -
> > - /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
> > - for_each_possible_cpu(cpu)
> > - INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> > -
> > - r = kvm_x86_vendor_init(&vt_init_ops);
> > - if (r)
> > - return r;
> > -
> > /*
> > * Must be called after common x86 init so enable_ept is properly set
> > * up. Hand the parameter mitigation value in which was stored in
> I am wondering whether the first sentence of above comment should be
> moved to vt_init()? So vt_init() has whole information about the init
> sequence.

If we do so, we should move the call of "vmx_setup_l1d_flush() to vt_init().
I hesitated to remove static of vmx_setup_l1d_flush().
--
Isaku Yamahata <[email protected]>

2024-03-12 02:21:44

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v19 022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions



On 3/12/24 10:15, Isaku Yamahata wrote:
>>> -
>>> - __vmx_exit();
>>> -}
>>> -module_exit(vmx_exit);
>>> -
>>> -static int __init vmx_init(void)
>>> +int __init vmx_init(void)
>>> {
>>> int r, cpu;
>>> - if (!kvm_is_vmx_supported())
>>> - return -EOPNOTSUPP;
>>> -
>>> - /*
>>> - * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
>>> - * to unwind if a later step fails.
>>> - */
>>> - hv_init_evmcs();
>>> -
>>> - /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
>>> - for_each_possible_cpu(cpu)
>>> - INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
>>> -
>>> - r = kvm_x86_vendor_init(&vt_init_ops);
>>> - if (r)
>>> - return r;
>>> -
>>> /*
>>> * Must be called after common x86 init so enable_ept is properly set
>>> * up. Hand the parameter mitigation value in which was stored in
>> I am wondering whether the first sentence of above comment should be
>> moved to vt_init()? So vt_init() has whole information about the init
>> sequence.
> If we do so, we should move the call of "vmx_setup_l1d_flush() to vt_init().
> I hesitated to remove static of vmx_setup_l1d_flush().
I meant this one:
"Must be called after common x86 init so enable_ept is properly set up"

Not necessary to move vmx_setup_l1d_flush().

Regards
Yin, Fengwei


> --

2024-03-12 04:43:09

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions

On Tue, Mar 12, 2024 at 10:21:28AM +0800,
Yin Fengwei <[email protected]> wrote:

>
>
> On 3/12/24 10:15, Isaku Yamahata wrote:
> >>> -
> >>> - __vmx_exit();
> >>> -}
> >>> -module_exit(vmx_exit);
> >>> -
> >>> -static int __init vmx_init(void)
> >>> +int __init vmx_init(void)
> >>> {
> >>> int r, cpu;
> >>> - if (!kvm_is_vmx_supported())
> >>> - return -EOPNOTSUPP;
> >>> -
> >>> - /*
> >>> - * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
> >>> - * to unwind if a later step fails.
> >>> - */
> >>> - hv_init_evmcs();
> >>> -
> >>> - /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
> >>> - for_each_possible_cpu(cpu)
> >>> - INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> >>> -
> >>> - r = kvm_x86_vendor_init(&vt_init_ops);
> >>> - if (r)
> >>> - return r;
> >>> -
> >>> /*
> >>> * Must be called after common x86 init so enable_ept is properly set
> >>> * up. Hand the parameter mitigation value in which was stored in
> >> I am wondering whether the first sentence of above comment should be
> >> moved to vt_init()? So vt_init() has whole information about the init
> >> sequence.
> > If we do so, we should move the call of "vmx_setup_l1d_flush() to vt_init().
> > I hesitated to remove static of vmx_setup_l1d_flush().
> I meant this one:
> "Must be called after common x86 init so enable_ept is properly set up"
>
> Not necessary to move vmx_setup_l1d_flush().

Ah, you mean "only" first sentence. Ok. I'll move it.
--
Isaku Yamahata <[email protected]>

2024-03-12 14:51:22

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add new members to strut kvm_gfn_range to indicate which mapping
> (private-vs-shared) to operate on. only_private and only_shared. Update
> mmu notifier, set memory attributes ioctl or KVM gmem callback to
> initialize them.
>
> It was premature for set_memory_attributes ioctl to call
> kvm_unmap_gfn_range(). Instead, let kvm_arch_ste_memory_attributes()
"kvm_arch_ste_memory_attributes()" -> "kvm_vm_set_mem_attributes()" ?


> handle it and add a new x86 vendor callback to react to memory attribute
> change. [1]
Which new x86 vendor callback?


>
> - If it's from the mmu notifier, zap shared pages only
> - If it's from the KVM gmem, zap private pages only
> - If setting memory attributes, vendor callback checks new attributes
> and make decisions.
> SNP would do nothing and handle it later with gmem callback
> TDX callback would do as follows.
> When it converts pages to shared, zap private pages only.
> When it converts pages to private, zap shared pages only.
>
> TDX needs to know which mapping to operate on. Shared-EPT vs. Secure-EPT.
> The following sequence to convert the GPA to private doesn't work for TDX
> because the page can already be private.
>
> 1) Update memory attributes to private in memory attributes xarray
> 2) Zap the GPA range irrespective of private-or-shared.
> Even if the page is already private, zap the entry.
> 3) EPT violation on the GPA
> 4) Populate the GPA as private
> The page is zeroed, and the guest has to accept the page again.
>
> In step 2, TDX wants to zap only shared pages and skip private ones.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> Changes v18:
> - rebased to kvm-next
>
> Changes v2 -> v3:
> - Drop the KVM_GFN_RANGE flags
> - Updated struct kvm_gfn_range
> - Change kvm_arch_set_memory_attributes() to return bool for flush
> - Added set_memory_attributes x86 op for vendor backends
> - Refined commit message to describe TDX care concretely
>
> Changes v1 -> v2:
> - consolidate KVM_GFN_RANGE_FLAGS_GMEM_{PUNCH_HOLE, RELEASE} into
> KVM_GFN_RANGE_FLAGS_GMEM.
> - Update the commit message to describe TDX more. Drop SEV_SNP.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> include/linux/kvm_host.h | 2 ++
> virt/kvm/guest_memfd.c | 3 +++
> virt/kvm/kvm_main.c | 17 +++++++++++++++++
> 3 files changed, 22 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..0520cd8d03cc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -264,6 +264,8 @@ struct kvm_gfn_range {
> gfn_t start;
> gfn_t end;
> union kvm_mmu_notifier_arg arg;
> + bool only_private;
> + bool only_shared;

IMO, an enum will be clearer than the two flags.

    enum {
        PROCESS_PRIVATE_AND_SHARED,
        PROCESS_ONLY_PRIVATE,
        PROCESS_ONLY_SHARED,
    };


> bool may_block;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 0f4e0cf4f158..3830d50b9b67 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -64,6 +64,9 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
> .slot = slot,
> .may_block = true,
> + /* guest memfd is relevant to only private mappings. */
> + .only_private = true,
> + .only_shared = false,
> };
>
> if (!found_memslot) {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 10bfc88a69f7..0349e1f241d1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -634,6 +634,12 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> */
> gfn_range.arg = range->arg;
> gfn_range.may_block = range->may_block;
> + /*
> + * HVA-based notifications aren't relevant to private
> + * mappings as they don't have a userspace mapping.
> + */
> + gfn_range.only_private = false;
> + gfn_range.only_shared = true;
>
> /*
> * {gfn(page) | page intersects with [hva_start, hva_end)} =
> @@ -2486,6 +2492,16 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
> gfn_range.arg = range->arg;
> gfn_range.may_block = range->may_block;
>
> + /*
> + * If/when KVM supports more attributes beyond private .vs shared, this
> + * _could_ set only_{private,shared} appropriately if the entire target
> + * range already has the desired private vs. shared state (it's unclear
> + * if that is a net win). For now, KVM reaches this point if and only
> + * if the private flag is being toggled, i.e. all mappings are in play.
> + */
> + gfn_range.only_private = false;
> + gfn_range.only_shared = false;
> +
> for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
>
> @@ -2542,6 +2558,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> struct kvm_mmu_notifier_range pre_set_range = {
> .start = start,
> .end = end,
> + .arg.attributes = attributes,
> .handler = kvm_pre_set_memory_attributes,
> .on_lock = kvm_mmu_invalidate_begin,
> .flush_on_ret = true,


2024-03-13 00:44:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 032/130] KVM: TDX: Add helper functions to allocate/free TDX private host key id

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add helper functions to allocate/free TDX private host key id (HKID).
>
> The memory controller encrypts TDX memory with the assigned TDX
> HKIDs.  The
> global TDX HKID is to encrypt the TDX module, its memory, and some
> dynamic
> data (TDR). 

I don't see any code about the global key id.

> The private TDX HKID is assigned to guest TD to encrypt guest
> memory and the related data.  When VMM releases an encrypted page for
> reuse, the page needs a cache flush with the used HKID.

Not sure the cache part is pertinent to this patch. Sounds good for
some other patch.

>   VMM needs the
> global TDX HKID and the private TDX HKIDs to flush encrypted pages.

I think the commit log could have a bit more about what code is added.
What about adding something like this (some verbiage from Kai's setup
patch):

The memory controller encrypts TDX memory with the assigned TDX
HKIDs. Each TDX guest must be protected by its own unique TDX HKID.

The HW has a fixed set of these HKID keys. Out of those, some are set
aside for use by for other TDX components, but most are saved for guest
use. The code that does this partitioning, records the range chosen to
be available for guest use in the tdx_guest_keyid_start and
tdx_nr_guest_keyids variables.

Use this range of HKIDs reserved for guest use with the kernel's IDA
allocator library helper to create a mini TDX HKID allocator that can
be called when setting up a TD. This way it can have an exclusive HKID,
as is required. This allocator will be used in future changes.


>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - Removed stale comment in tdx_guest_keyid_alloc() by Binbin
> - Update sanity check in tdx_guest_keyid_free() by Binbin
>
> v18:
> - Moved the functions to kvm tdx from arch/x86/virt/vmx/tdx/
> - Drop exporting symbols as the host tdx does.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
>  arch/x86/kvm/vmx/tdx.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a7e096fd8361..cde971122c1e 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -11,6 +11,34 @@
>  #undef pr_fmt
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
> +/*
> + * Key id globally used by TDX module: TDX module maps TDR with this
> TDX global
> + * key id.  TDR includes key id assigned to the TD.  Then TDX module
> maps other
> + * TD-related pages with the assigned key id.  TDR requires this TDX
> global key
> + * id for cache flush unlike other TD-related pages.
> + */

The above comment is about tdx_global_keyid, which is unrelated to the
patch and code.

> +/* TDX KeyID pool */
> +static DEFINE_IDA(tdx_guest_keyid_pool);
> +
> +static int __used tdx_guest_keyid_alloc(void)
> +{
> +       if (WARN_ON_ONCE(!tdx_guest_keyid_start ||
> !tdx_nr_guest_keyids))
> +               return -EINVAL;

I think the idea of this warnings is to check if TDX failed to init? It
could check X86_FEATURE_TDX_HOST_PLATFORM or enable_tdx, but that seems
to be a weird thing to check in a low level function that is called in
the middle of in progress setup.

Don't know, I'd probably drop this warning.

> +
> +       return ida_alloc_range(&tdx_guest_keyid_pool,
> tdx_guest_keyid_start,
> +                              tdx_guest_keyid_start +
> tdx_nr_guest_keyids - 1,
> +                              GFP_KERNEL);
> +}
> +
> +static void __used tdx_guest_keyid_free(int keyid)
> +{
> +       if (WARN_ON_ONCE(keyid < tdx_guest_keyid_start ||
> +                        keyid > tdx_guest_keyid_start +
> tdx_nr_guest_keyids - 1))
> +               return;

This seems like a more useful warning, but still not sure it's that
risky. I guess the point is to check for returning garbage. Because a
double free would not be caught, but would be possible to using
idr_find(). I would think if we are worried we should do the full
check, but I'm not sure we can't just drop this. There are very limited
callers or things that change the checked configuration (1 of each).

> +
> +       ida_free(&tdx_guest_keyid_pool, keyid);
> +}
> +
>  static int __init tdx_module_setup(void)
>  {
>         int ret;

2024-03-13 18:34:09

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Tue, Mar 12, 2024 at 09:33:31PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Add new members to strut kvm_gfn_range to indicate which mapping
> > (private-vs-shared) to operate on. only_private and only_shared. Update
> > mmu notifier, set memory attributes ioctl or KVM gmem callback to
> > initialize them.
> >
> > It was premature for set_memory_attributes ioctl to call
> > kvm_unmap_gfn_range(). Instead, let kvm_arch_ste_memory_attributes()
> "kvm_arch_ste_memory_attributes()" -> "kvm_vm_set_mem_attributes()" ?

Yes, will fix it.

> > handle it and add a new x86 vendor callback to react to memory attribute
> > change. [1]
> Which new x86 vendor callback?

Now we don't have it. Will drop this sentnse.

> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 7e7fd25b09b3..0520cd8d03cc 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -264,6 +264,8 @@ struct kvm_gfn_range {
> > gfn_t start;
> > gfn_t end;
> > union kvm_mmu_notifier_arg arg;
> > + bool only_private;
> > + bool only_shared;
>
> IMO, an enum will be clearer than the two flags.
>
>     enum {
>         PROCESS_PRIVATE_AND_SHARED,
>         PROCESS_ONLY_PRIVATE,
>         PROCESS_ONLY_SHARED,
>     };

The code will be ugly like
"if (== PRIVATE || == PRIVATE_AND_SHARED)" or
"if (== SHARED || == PRIVATE_AND_SHARED)"

two boolean (or two flags) is less error-prone.

Thanks,
--
Isaku Yamahata <[email protected]>

2024-03-13 18:46:10

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 021/130] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_init()

On Wed, Mar 13, 2024 at 11:30:11PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > vmx_hardware_disable() accesses loaded_vmcss_on_cpu via
> > hardware_disable_all(). To allow hardware_enable/disable_all() before
> > kvm_init(), initialize it in before kvm_x86_vendor_init() in vmx_init()
> > so that tdx module initialization, hardware_setup method, can reference
> > the variable.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Yuan Yao <[email protected]>
>
> The shortlog should be this?
> KVM: VMX: Initialize loaded_vmcss_on_cpu in vmx_init()

Yes. I also will fix the shortlog in the next patch.
--
Isaku Yamahata <[email protected]>

2024-03-13 18:49:52

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 032/130] KVM: TDX: Add helper functions to allocate/free TDX private host key id

On Wed, Mar 13, 2024 at 12:44:14AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Add helper functions to allocate/free TDX private host key id (HKID).
> >
> > The memory controller encrypts TDX memory with the assigned TDX
> > HKIDs.  The
> > global TDX HKID is to encrypt the TDX module, its memory, and some
> > dynamic
> > data (TDR). 
>
> I don't see any code about the global key id.
>
> > The private TDX HKID is assigned to guest TD to encrypt guest
> > memory and the related data.  When VMM releases an encrypted page for
> > reuse, the page needs a cache flush with the used HKID.
>
> Not sure the cache part is pertinent to this patch. Sounds good for
> some other patch.
>
> >   VMM needs the
> > global TDX HKID and the private TDX HKIDs to flush encrypted pages.
>
> I think the commit log could have a bit more about what code is added.
> What about adding something like this (some verbiage from Kai's setup
> patch):
>
> The memory controller encrypts TDX memory with the assigned TDX
> HKIDs. Each TDX guest must be protected by its own unique TDX HKID.
>
> The HW has a fixed set of these HKID keys. Out of those, some are set
> aside for use by for other TDX components, but most are saved for guest
> use. The code that does this partitioning, records the range chosen to
> be available for guest use in the tdx_guest_keyid_start and
> tdx_nr_guest_keyids variables.
>
> Use this range of HKIDs reserved for guest use with the kernel's IDA
> allocator library helper to create a mini TDX HKID allocator that can
> be called when setting up a TD. This way it can have an exclusive HKID,
> as is required. This allocator will be used in future changes.
>
>
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v19:
> > - Removed stale comment in tdx_guest_keyid_alloc() by Binbin
> > - Update sanity check in tdx_guest_keyid_free() by Binbin
> >
> > v18:
> > - Moved the functions to kvm tdx from arch/x86/virt/vmx/tdx/
> > - Drop exporting symbols as the host tdx does.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 28 ++++++++++++++++++++++++++++
> >  1 file changed, 28 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index a7e096fd8361..cde971122c1e 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -11,6 +11,34 @@
> >  #undef pr_fmt
> >  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> >  
> > +/*
> > + * Key id globally used by TDX module: TDX module maps TDR with this
> > TDX global
> > + * key id.  TDR includes key id assigned to the TD.  Then TDX module
> > maps other
> > + * TD-related pages with the assigned key id.  TDR requires this TDX
> > global key
> > + * id for cache flush unlike other TD-related pages.
> > + */
>
> The above comment is about tdx_global_keyid, which is unrelated to the
> patch and code.

Will delete this comment as it was moved into the host tdx patch series.

>
> > +/* TDX KeyID pool */
> > +static DEFINE_IDA(tdx_guest_keyid_pool);
> > +
> > +static int __used tdx_guest_keyid_alloc(void)
> > +{
> > +       if (WARN_ON_ONCE(!tdx_guest_keyid_start ||
> > !tdx_nr_guest_keyids))
> > +               return -EINVAL;
>
> I think the idea of this warnings is to check if TDX failed to init? It
> could check X86_FEATURE_TDX_HOST_PLATFORM or enable_tdx, but that seems
> to be a weird thing to check in a low level function that is called in
> the middle of in progress setup.
>
> Don't know, I'd probably drop this warning.
>
> > +
> > +       return ida_alloc_range(&tdx_guest_keyid_pool,
> > tdx_guest_keyid_start,
> > +                              tdx_guest_keyid_start +
> > tdx_nr_guest_keyids - 1,
> > +                              GFP_KERNEL);
> > +}
> > +
> > +static void __used tdx_guest_keyid_free(int keyid)
> > +{
> > +       if (WARN_ON_ONCE(keyid < tdx_guest_keyid_start ||
> > +                        keyid > tdx_guest_keyid_start +
> > tdx_nr_guest_keyids - 1))
> > +               return;
>
> This seems like a more useful warning, but still not sure it's that
> risky. I guess the point is to check for returning garbage. Because a
> double free would not be caught, but would be possible to using
> idr_find(). I would think if we are worried we should do the full
> check, but I'm not sure we can't just drop this. There are very limited
> callers or things that change the checked configuration (1 of each).

The related code is stable now and I don't hit them recently. I'll drop both
of them.
--
Isaku Yamahata <[email protected]>

2024-03-13 20:17:06

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 021/130] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_init()



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> vmx_hardware_disable() accesses loaded_vmcss_on_cpu via
> hardware_disable_all(). To allow hardware_enable/disable_all() before
> kvm_init(), initialize it in before kvm_x86_vendor_init() in vmx_init()
> so that tdx module initialization, hardware_setup method, can reference
> the variable.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Yuan Yao <[email protected]>

The shortlog should be this?
KVM: VMX: Initialize loaded_vmcss_on_cpu in vmx_init()

Others,
Reviewed-by: Binbin Wu <[email protected]>

>
> ---
> v19:
> - Fix the subject to match the patch by Yuan
>
> v18:
> - Move the vmcss_on_cpu initialization from vmx_hardware_setup() to
> early point of vmx_init() by Binbin
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 434f5aaef030..8af0668e4dca 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8539,6 +8539,10 @@ static int __init vmx_init(void)
> */
> hv_init_evmcs();
>
> + /* vmx_hardware_disable() accesses loaded_vmcss_on_cpu. */
> + for_each_possible_cpu(cpu)
> + INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> +
> r = kvm_x86_vendor_init(&vt_init_ops);
> if (r)
> return r;
> @@ -8554,11 +8558,8 @@ static int __init vmx_init(void)
> if (r)
> goto err_l1d_flush;
>
> - for_each_possible_cpu(cpu) {
> - INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> -
> + for_each_possible_cpu(cpu)
> pi_init_cpu(cpu);
> - }
>
> cpu_emergency_register_virt_callback(vmx_emergency_disable);
>


2024-03-13 20:53:04

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> For private GPA, CPU refers a private page table whose contents are
> encrypted.  The dedicated APIs to operate on it (e.g.
> updating/reading its
> PTE entry) are used and their cost is expensive.
>
> When KVM resolves KVM page fault, it walks the page tables.  To reuse
> the
> existing KVM MMU code and mitigate the heavy cost to directly walk
> private
> page table, allocate one more page to copy the dummy page table for
> KVM MMU
> code to directly walk.  Resolve KVM page fault with the existing
> code, and
> do additional operations necessary for the private page table. 

> To
> distinguish such cases, the existing KVM page table is called a
> shared page
> table (i.e. not associated with private page table), and the page
> table
> with private page table is called a private page table.

This makes it sound like the dummy page table for the private alias is
also called a shared page table, but in the drawing below it looks like
only the shared alias is called "shared PT".

>   The relationship
> is depicted below.
>
> Add a private pointer to struct kvm_mmu_page for private page table
> and
> add helper functions to allocate/initialize/free a private page table
> page.
>
>               KVM page fault                     |
>                      |                           |
>                      V                           |
>         -------------+----------                 |
>         |                      |                 |
>         V                      V                 |
>      shared GPA           private GPA            |
>         |                      |                 |
>         V                      V                 |
>     shared PT root      dummy PT root            |    private PT root
>         |                      |                 |           |
>         V                      V                 |           V
>      shared PT            dummy PT ----propagate---->   private PT
>         |                      |                 |           |
>         |                      \-----------------+------\    |
>         |                                        |      |    |
>         V                                        |      V    V
>   shared guest page                              |    private guest
> page
>                                                  |
>                            non-encrypted memory  |    encrypted
> memory
>                                                  |
> PT: page table
> - Shared PT is visible to KVM and it is used by CPU.
> - Private PT is used by CPU but it is invisible to KVM.
> - Dummy PT is visible to KVM but not used by CPU.  It is used to
>   propagate PT change to the actual private PT which is used by CPU.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> ---
> v19:
> - typo in the comment in kvm_mmu_alloc_private_spt()
> - drop CONFIG_KVM_MMU_PRIVATE
> ---
>  arch/x86/include/asm/kvm_host.h |  5 +++
>  arch/x86/kvm/mmu/mmu.c          |  7 ++++
>  arch/x86/kvm/mmu/mmu_internal.h | 63 ++++++++++++++++++++++++++++++-
> --
>  arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
>  4 files changed, 72 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> index dcc6f7c38a83..efd3fda1c177 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -825,6 +825,11 @@ struct kvm_vcpu_arch {
>         struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       /*
> +        * This cache is to allocate private page table. E.g. 
> Secure-EPT used
> +        * by the TDX module.
> +        */
> +       struct kvm_mmu_memory_cache mmu_private_spt_cache;
>  
>         /*
>          * QEMU userspace and the guest each have their own FPU
> state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index eeebbc67e42b..0d6d4506ec97 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct
> kvm_vcpu *vcpu, bool maybe_indirect)
>                                        1 + PT64_ROOT_MAX_LEVEL +
> PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> +       if (kvm_gfn_shared_mask(vcpu->kvm)) {
> +               r = kvm_mmu_topup_memory_cache(&vcpu-
> >arch.mmu_private_spt_cache,
> +                                              PT64_ROOT_MAX_LEVEL);
> +               if (r)
> +                       return r;
> +       }
>         r = kvm_mmu_topup_memory_cache(&vcpu-
> >arch.mmu_shadow_page_cache,
>                                        PT64_ROOT_MAX_LEVEL);
>         if (r)
> @@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct
> kvm_vcpu *vcpu)
>         kvm_mmu_free_memory_cache(&vcpu-
> >arch.mmu_pte_list_desc_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
>         kvm_mmu_free_memory_cache(&vcpu-
> >arch.mmu_shadowed_info_cache);
> +       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> b/arch/x86/kvm/mmu/mmu_internal.h
> index e3f54701f98d..002f3f80bf3b 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -101,7 +101,21 @@ struct kvm_mmu_page {
>                 int root_count;
>                 refcount_t tdp_mmu_root_count;
>         };
> -       unsigned int unsync_children;
> +       union {
> +               struct {
> +                       unsigned int unsync_children;
> +                       /*
> +                        * Number of writes since the last time
> traversal
> +                        * visited this page.
> +                        */
> +                       atomic_t write_flooding_count;
> +               };

I think the point of putting these in a union is that they only apply
to shadow paging and so can't be used with TDX. I think you are putting
more than the sizeof(void *) in there as there are multiple in the same
category. But there seems to be a new one added, *shadowed_translation.
Should it go in there too? Is the union because there wasn't room
before, or just to be tidy?

I think the commit log should have more discussion of this union and
maybe a comment in the struct to explain the purpose of the
organization. Can you explain the reasoning now for the sake of
discussion?

> +               /*
> +                * Associated private shadow page table, e.g. Secure-
> EPT page
> +                * passed to the TDX module.
> +                */
> +               void *private_spt;
> +       };
>         union {
>                 struct kvm_rmap_head parent_ptes; /* rmap pointers to
> parent sptes */
>                 tdp_ptep_t ptep;
> @@ -124,9 +138,6 @@ struct kvm_mmu_page {
>         int clear_spte_count;
>  #endif
>  
> -       /* Number of writes since the last time traversal visited
> this page.  */
> -       atomic_t write_flooding_count;
> -
>  #ifdef CONFIG_X86_64
>         /* Used for freeing the page asynchronously if it is a TDP
> MMU page. */
>         struct rcu_head rcu_head;
> @@ -150,6 +161,50 @@ static inline bool is_private_sp(const struct
> kvm_mmu_page *sp)
>         return kvm_mmu_page_role_is_private(sp->role);
>  }
>  
> +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
> +{
> +       return sp->private_spt;
> +}
> +
> +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp,
> void *private_spt)
> +{
> +       sp->private_spt = private_spt;
> +}
> +
> +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu,
> struct kvm_mmu_page *sp)
> +{
> +       bool is_root = vcpu->arch.root_mmu.root_role.level == sp-
> >role.level;
> +
> +       KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu-
> >kvm);
> +       if (is_root)
> +               /*
> +                * Because TDX module assigns root Secure-EPT page
> and set it to
> +                * Secure-EPTP when TD vcpu is created, secure page
> table for
> +                * root isn't needed.
> +                */
> +               sp->private_spt = NULL;
> +       else {
> +               /*
> +                * Because the TDX module doesn't trust VMM and
> initializes
> +                * the pages itself, KVM doesn't initialize them. 
> Allocate
> +                * pages with garbage and give them to the TDX
> module.
> +                */
> +               sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> >arch.mmu_private_spt_cache);
> +               /*
> +                * Because mmu_private_spt_cache is topped up before
> starting
> +                * kvm page fault resolving, the allocation above
> shouldn't
> +                * fail.
> +                */
> +               WARN_ON_ONCE(!sp->private_spt);

There is already a BUG_ON() for the allocation failure in
kvm_mmu_memory_cache_alloc().

> +       }
> +}
> +
> +static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
> +{
> +       if (sp->private_spt)

free_page() can accept NULL, so the above check is unneeded.

> +               free_page((unsigned long)sp->private_spt);
> +}
> +
>  static inline bool kvm_mmu_page_ad_need_write_protect(struct
> kvm_mmu_page *sp)
>  {
>         /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 87233b3ceaef..d47f0daf1b03 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  
>  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
> +       kvm_mmu_free_private_spt(sp);

This particular memcache zeros the allocations so it is safe to free
this without regard to whether sp->private_spt has been set and that
the allocation caller is not in place yet. It would be nice to add this
detail in the log.

>         free_page((unsigned long)sp->spt);
>         kmem_cache_free(mmu_page_header_cache, sp);
>  }

2024-03-14 04:31:53

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX requires several initialization steps for KVM to create guest TDs.
> Detect CPU feature, enable VMX (TDX is based on VMX) on all online CPUs,
> detect the TDX module availability, initialize it and disable VMX.
>
> To enable/disable VMX on all online CPUs, utilize
> vmx_hardware_enable/disable(). The method also initializes each CPU for
> TDX. TDX requires calling a TDX initialization function per logical
> processor (LP) before the LP uses TDX. When the CPU is becoming online,
> call the TDX LP initialization API. If it fails to initialize TDX, refuse
> CPU online for simplicity instead of TDX avoiding the failed LP.
>
> There are several options on when to initialize the TDX module. A.) kernel
> module loading time, B.) the first guest TD creation time. A.) was chosen.
> With B.), a user may hit an error of the TDX initialization when trying to
> create the first guest TD. The machine that fails to initialize the TDX
> module can't boot any guest TD further. Such failure is undesirable and a
> surprise because the user expects that the machine can accommodate guest
> TD, but not. So A.) is better than B.).
>
> Introduce a module parameter, kvm_intel.tdx, to explicitly enable TDX KVM
> support. It's off by default to keep the same behavior for those who don't
> use TDX. Implement hardware_setup method to detect TDX feature of CPU and
> initialize TDX module.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - fixed vt_hardware_enable() to use vmx_hardware_enable()
> - renamed vmx_tdx_enabled => tdx_enabled
> - renamed vmx_tdx_on() => tdx_on()
>
> v18:
> - Added comment in vt_hardware_enable() by Binbin.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/Makefile | 1 +
> arch/x86/kvm/vmx/main.c | 19 ++++++++-
> arch/x86/kvm/vmx/tdx.c | 84 ++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 6 +++
> 4 files changed, 109 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/kvm/vmx/tdx.c
>
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 274df24b647f..5b85ef84b2e9 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -24,6 +24,7 @@ kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>
> kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
> kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o
>
> kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
> svm/sev.o
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 18cecf12c7c8..18aef6e23aab 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -6,6 +6,22 @@
> #include "nested.h"
> #include "pmu.h"
>
> +static bool enable_tdx __ro_after_init;
> +module_param_named(tdx, enable_tdx, bool, 0444);
> +
> +static __init int vt_hardware_setup(void)
> +{
> + int ret;
> +
> + ret = vmx_hardware_setup();
> + if (ret)
> + return ret;
> +
> + enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> +
> + return 0;
> +}
> +
> #define VMX_REQUIRED_APICV_INHIBITS \
> (BIT(APICV_INHIBIT_REASON_DISABLE)| \
> BIT(APICV_INHIBIT_REASON_ABSENT) | \
> @@ -22,6 +38,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> .hardware_unsetup = vmx_hardware_unsetup,
>
> + /* TDX cpu enablement is done by tdx_hardware_setup(). */

How about if there are some LPs that are offline.
In tdx_hardware_setup(), only online LPs are initialed for TDX, right?
Then when an offline LP becoming online, it doesn't have a chance to call
tdx_cpu_enable()?

> .hardware_enable = vmx_hardware_enable,
> .hardware_disable = vmx_hardware_disable,
> .has_emulated_msr = vmx_has_emulated_msr,
> @@ -161,7 +178,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> };
>
> struct kvm_x86_init_ops vt_init_ops __initdata = {
> - .hardware_setup = vmx_hardware_setup,
> + .hardware_setup = vt_hardware_setup,
> .handle_intel_pt_intr = NULL,
>
> .runtime_ops = &vt_x86_ops,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> new file mode 100644
> index 000000000000..43c504fb4fed
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpu.h>
> +
> +#include <asm/tdx.h>
> +
> +#include "capabilities.h"
> +#include "x86_ops.h"
> +#include "x86.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +static int __init tdx_module_setup(void)
> +{
> + int ret;
> +
> + ret = tdx_enable();
> + if (ret) {
> + pr_info("Failed to initialize TDX module.\n");
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +struct tdx_enabled {
> + cpumask_var_t enabled;
> + atomic_t err;
> +};
> +
> +static void __init tdx_on(void *_enable)
> +{
> + struct tdx_enabled *enable = _enable;
> + int r;
> +
> + r = vmx_hardware_enable();
> + if (!r) {
> + cpumask_set_cpu(smp_processor_id(), enable->enabled);
> + r = tdx_cpu_enable();
> + }
> + if (r)
> + atomic_set(&enable->err, r);
> +}
> +
> +static void __init vmx_off(void *_enabled)
> +{
> + cpumask_var_t *enabled = (cpumask_var_t *)_enabled;
> +
> + if (cpumask_test_cpu(smp_processor_id(), *enabled))
> + vmx_hardware_disable();
> +}
> +
> +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> +{
> + struct tdx_enabled enable = {
> + .err = ATOMIC_INIT(0),
> + };
> + int r = 0;
> +
> + if (!enable_ept) {
> + pr_warn("Cannot enable TDX with EPT disabled\n");
> + return -EINVAL;
> + }
> +
> + if (!zalloc_cpumask_var(&enable.enabled, GFP_KERNEL)) {
> + r = -ENOMEM;
> + goto out;
> + }
> +
> + /* tdx_enable() in tdx_module_setup() requires cpus lock. */
> + cpus_read_lock();
> + on_each_cpu(tdx_on, &enable, true); /* TDX requires vmxon. */
> + r = atomic_read(&enable.err);
> + if (!r)
> + r = tdx_module_setup();
> + else
> + r = -EIO;
> + on_each_cpu(vmx_off, &enable.enabled, true);
> + cpus_read_unlock();
> + free_cpumask_var(enable.enabled);
> +
> +out:
> + return r;
> +}
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index b936388853ab..346289a2a01c 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -135,4 +135,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
> #endif
> void vmx_setup_mce(struct kvm_vcpu *vcpu);
>
> +#ifdef CONFIG_INTEL_TDX_HOST
> +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +#else
> +static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
> +#endif
> +
> #endif /* __KVM_X86_VMX_X86_OPS_H */


2024-03-14 06:29:28

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 025/130] KVM: TDX: Make TDX VM type supported



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> NOTE: This patch is in position of the patch series for developers to be
> able to test codes during the middle of the patch series although this
> patch series doesn't provide functional features until the all the patches
> of this patch series. When merging this patch series, this patch can be
> moved to the end.

Maybe at this point of time, you can consider to move this patch to the end?

>
> As first step TDX VM support, return that TDX VM type supported to device
> model, e.g. qemu. The callback to create guest TD is vm_init callback for
> KVM_CREATE_VM.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 18 ++++++++++++++++--
> arch/x86/kvm/vmx/tdx.c | 6 ++++++
> arch/x86/kvm/vmx/vmx.c | 6 ------
> arch/x86/kvm/vmx/x86_ops.h | 3 ++-
> 4 files changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index e11edbd19e7c..fa19682b366c 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -10,6 +10,12 @@
> static bool enable_tdx __ro_after_init;
> module_param_named(tdx, enable_tdx, bool, 0444);
>
> +static bool vt_is_vm_type_supported(unsigned long type)
> +{
> + return __kvm_is_vm_type_supported(type) ||
> + (enable_tdx && tdx_is_vm_type_supported(type));
> +}
> +
> static __init int vt_hardware_setup(void)
> {
> int ret;
> @@ -26,6 +32,14 @@ static __init int vt_hardware_setup(void)
> return 0;
> }
>
> +static int vt_vm_init(struct kvm *kvm)
> +{
> + if (is_td(kvm))
> + return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
> +
> + return vmx_vm_init(kvm);
> +}
> +
> #define VMX_REQUIRED_APICV_INHIBITS \
> (BIT(APICV_INHIBIT_REASON_DISABLE)| \
> BIT(APICV_INHIBIT_REASON_ABSENT) | \
> @@ -47,9 +61,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .hardware_disable = vmx_hardware_disable,
> .has_emulated_msr = vmx_has_emulated_msr,
>
> - .is_vm_type_supported = vmx_is_vm_type_supported,
> + .is_vm_type_supported = vt_is_vm_type_supported,
> .vm_size = sizeof(struct kvm_vmx),
> - .vm_init = vmx_vm_init,
> + .vm_init = vt_vm_init,
> .vm_destroy = vmx_vm_destroy,
>
> .vcpu_precreate = vmx_vcpu_precreate,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 14ef0ccd8f1a..a7e096fd8361 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -24,6 +24,12 @@ static int __init tdx_module_setup(void)
> return 0;
> }
>
> +bool tdx_is_vm_type_supported(unsigned long type)
> +{
> + /* enable_tdx check is done by the caller. */
> + return type == KVM_X86_TDX_VM;
> +}
> +
> struct tdx_enabled {
> cpumask_var_t enabled;
> atomic_t err;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 2fb1cd2e28a2..d928acc15d0f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7531,12 +7531,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
> return err;
> }
>
> -bool vmx_is_vm_type_supported(unsigned long type)
> -{
> - /* TODO: Check if TDX is supported. */
> - return __kvm_is_vm_type_supported(type);
> -}
> -
> #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
> #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 346289a2a01c..f4da88a228d0 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -28,7 +28,6 @@ void vmx_hardware_unsetup(void);
> int vmx_check_processor_compat(void);
> int vmx_hardware_enable(void);
> void vmx_hardware_disable(void);
> -bool vmx_is_vm_type_supported(unsigned long type);
> int vmx_vm_init(struct kvm *kvm);
> void vmx_vm_destroy(struct kvm *kvm);
> int vmx_vcpu_precreate(struct kvm *kvm);
> @@ -137,8 +136,10 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
>
> #ifdef CONFIG_INTEL_TDX_HOST
> int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +bool tdx_is_vm_type_supported(unsigned long type);
> #else
> static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
> +static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
> #endif
>
> #endif /* __KVM_X86_VMX_X86_OPS_H */


2024-03-14 08:32:53

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 019/130] KVM: x86: Add is_vm_type_supported callback

>-static bool kvm_is_vm_type_supported(unsigned long type)
>+bool __kvm_is_vm_type_supported(unsigned long type)
> {
> return type == KVM_X86_DEFAULT_VM ||
> (type == KVM_X86_SW_PROTECTED_VM &&
> IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);

maybe just do:
switch (type) {
case KVM_X86_DEFAULT_VM:
return true;
case KVM_X86_SW_PROTECTED_VM:
return IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled;
default:
return static_call(kvm_x86_is_vm_type_supported)(type);
}

There are two benefits
1) switch/case improves readability a little.
2) no need to expose __kvm_is_vm_type_supported()


> }
>+EXPORT_SYMBOL_GPL(__kvm_is_vm_type_supported);

>+
>+static bool kvm_is_vm_type_supported(unsigned long type)
>+{
>+ return static_call(kvm_x86_is_vm_type_supported)(type);
>+}
>
> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> {
>@@ -4784,6 +4790,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> r = BIT(KVM_X86_DEFAULT_VM);
> if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
> r |= BIT(KVM_X86_SW_PROTECTED_VM);
>+ if (kvm_is_vm_type_supported(KVM_X86_TDX_VM))
>+ r |= BIT(KVM_X86_TDX_VM);
>+ if (kvm_is_vm_type_supported(KVM_X86_SNP_VM))
>+ r |= BIT(KVM_X86_SNP_VM);

maybe use a for-loop?

2024-03-14 08:46:58

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add placeholders TDX VM/vcpu structure that overlays with VMX VM/vcpu
> structures. Initialize VM structure size and vcpu size/align so that x86
> KVM common code knows those size irrespective of VMX or TDX. Those
> structures will be populated as guest creation logic develops.
>
> Add helper functions to check if the VM is guest TD and add conversion
> functions between KVM VM/VCPU and TDX VM/VCPU.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v19:
> - correctly update ops.vm_size, vcpu_size and, vcpu_align by Xiaoyao
>
> v14 -> v15:
> - use KVM_X86_TDX_VM
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 14 ++++++++++++
> arch/x86/kvm/vmx/tdx.c | 1 +
> arch/x86/kvm/vmx/tdx.h | 50 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 65 insertions(+)
> create mode 100644 arch/x86/kvm/vmx/tdx.h
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 18aef6e23aab..e11edbd19e7c 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -5,6 +5,7 @@
> #include "vmx.h"
> #include "nested.h"
> #include "pmu.h"
> +#include "tdx.h"
>
> static bool enable_tdx __ro_after_init;
> module_param_named(tdx, enable_tdx, bool, 0444);
> @@ -18,6 +19,9 @@ static __init int vt_hardware_setup(void)
> return ret;
>
> enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> + if (enable_tdx)
> + vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> + sizeof(struct kvm_tdx));
>
> return 0;
> }
> @@ -215,8 +219,18 @@ static int __init vt_init(void)
> * Common KVM initialization _must_ come last, after this, /dev/kvm is
> * exposed to userspace!
> */
> + /*
> + * kvm_x86_ops is updated with vt_x86_ops. vt_x86_ops.vm_size must
> + * be set before kvm_x86_vendor_init().

The comment is not right?
In this patch, vt_x86_ops.vm_size is set in  vt_hardware_setup(),
which is called in kvm_x86_vendor_init().

Since kvm_x86_ops is updated by kvm_ops_update() with the fields of
vt_x86_ops. I guess you wanted to say vt_x86_ops.vm_size must be set
before kvm_ops_update()?

> + */
> vcpu_size = sizeof(struct vcpu_vmx);
> vcpu_align = __alignof__(struct vcpu_vmx);
> + if (enable_tdx) {
> + vcpu_size = max_t(unsigned int, vcpu_size,
> + sizeof(struct vcpu_tdx));
> + vcpu_align = max_t(unsigned int, vcpu_align,
> + __alignof__(struct vcpu_tdx));
> + }
> r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
> if (r)
> goto err_kvm_init;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 43c504fb4fed..14ef0ccd8f1a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -6,6 +6,7 @@
> #include "capabilities.h"
> #include "x86_ops.h"
> #include "x86.h"
> +#include "tdx.h"
>
> #undef pr_fmt
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> new file mode 100644
> index 000000000000..473013265bd8
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVM_X86_TDX_H
> +#define __KVM_X86_TDX_H
> +
> +#ifdef CONFIG_INTEL_TDX_HOST
> +struct kvm_tdx {
> + struct kvm kvm;
> + /* TDX specific members follow. */
> +};
> +
> +struct vcpu_tdx {
> + struct kvm_vcpu vcpu;
> + /* TDX specific members follow. */
> +};
> +
> +static inline bool is_td(struct kvm *kvm)
> +{
> + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
> +
> +static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
> +{
> + return is_td(vcpu->kvm);
> +}
> +
> +static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
> +{
> + return container_of(kvm, struct kvm_tdx, kvm);
> +}
> +
> +static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
> +{
> + return container_of(vcpu, struct vcpu_tdx, vcpu);
> +}
> +#else
> +struct kvm_tdx {
> + struct kvm kvm;
> +};
> +
> +struct vcpu_tdx {
> + struct kvm_vcpu vcpu;
> +};
> +
> +static inline bool is_td(struct kvm *kvm) { return false; }
> +static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
> +static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
> +static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
> +#endif /* CONFIG_INTEL_TDX_HOST */
> +
> +#endif /* __KVM_X86_TDX_H */


2024-03-14 12:16:56

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Define architectural definitions for KVM to issue the TDX SEAMCALLs.
>
> Structures and values that are architecturally defined in the TDX module
> specifications the chapter of ABI Reference.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Reviewed-by: Xiaoyao Li <[email protected]>
> ---
> v19:
> - drop tdvmcall constants by Xiaoyao
>
> v18:
> - Add metadata field id
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx_arch.h | 265 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 265 insertions(+)
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
>
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> new file mode 100644
> index 000000000000..e2c1a6f429d7
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -0,0 +1,265 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* architectural constants/data definitions for TDX SEAMCALLs */
> +
> +#ifndef __KVM_X86_TDX_ARCH_H
> +#define __KVM_X86_TDX_ARCH_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * TDX SEAMCALL API function leaves
> + */
> +#define TDH_VP_ENTER 0
> +#define TDH_MNG_ADDCX 1
> +#define TDH_MEM_PAGE_ADD 2
> +#define TDH_MEM_SEPT_ADD 3
> +#define TDH_VP_ADDCX 4
> +#define TDH_MEM_PAGE_RELOCATE 5
> +#define TDH_MEM_PAGE_AUG 6
> +#define TDH_MEM_RANGE_BLOCK 7
> +#define TDH_MNG_KEY_CONFIG 8
> +#define TDH_MNG_CREATE 9
> +#define TDH_VP_CREATE 10
> +#define TDH_MNG_RD 11
> +#define TDH_MR_EXTEND 16
> +#define TDH_MR_FINALIZE 17
> +#define TDH_VP_FLUSH 18
> +#define TDH_MNG_VPFLUSHDONE 19
> +#define TDH_MNG_KEY_FREEID 20
> +#define TDH_MNG_INIT 21
> +#define TDH_VP_INIT 22
> +#define TDH_MEM_SEPT_RD 25
> +#define TDH_VP_RD 26
> +#define TDH_MNG_KEY_RECLAIMID 27
> +#define TDH_PHYMEM_PAGE_RECLAIM 28
> +#define TDH_MEM_PAGE_REMOVE 29
> +#define TDH_MEM_SEPT_REMOVE 30
> +#define TDH_SYS_RD 34
> +#define TDH_MEM_TRACK 38
> +#define TDH_MEM_RANGE_UNBLOCK 39
> +#define TDH_PHYMEM_CACHE_WB 40
> +#define TDH_PHYMEM_PAGE_WBINVD 41
> +#define TDH_VP_WR 43
> +#define TDH_SYS_LP_SHUTDOWN 44
> +
> +/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
> +#define TDX_NON_ARCH BIT_ULL(63)
> +#define TDX_CLASS_SHIFT 56
> +#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
> +
> +#define __BUILD_TDX_FIELD(non_arch, class, field) \
> + (((non_arch) ? TDX_NON_ARCH : 0) | \
> + ((u64)(class) << TDX_CLASS_SHIFT) | \
> + ((u64)(field) & TDX_FIELD_MASK))
> +
> +#define BUILD_TDX_FIELD(class, field) \
> + __BUILD_TDX_FIELD(false, (class), (field))
> +
> +#define BUILD_TDX_FIELD_NON_ARCH(class, field) \
> + __BUILD_TDX_FIELD(true, (class), (field))
> +
> +
> +/* Class code for TD */
> +#define TD_CLASS_EXECUTION_CONTROLS 17ULL
> +
> +/* Class code for TDVPS */
> +#define TDVPS_CLASS_VMCS 0ULL
> +#define TDVPS_CLASS_GUEST_GPR 16ULL
> +#define TDVPS_CLASS_OTHER_GUEST 17ULL
> +#define TDVPS_CLASS_MANAGEMENT 32ULL
> +
> +enum tdx_tdcs_execution_control {
> + TD_TDCS_EXEC_TSC_OFFSET = 10,
> +};
> +
> +/* @field is any of enum tdx_tdcs_execution_control */
> +#define TDCS_EXEC(field) BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
> +
> +/* @field is the VMCS field encoding */
> +#define TDVPS_VMCS(field) BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field))
> +
> +enum tdx_vcpu_guest_other_state {
> + TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
> +};
> +
> +union tdx_vcpu_state_details {
> + struct {
> + u64 vmxip : 1;
> + u64 reserved : 63;
> + };
> + u64 full;
> +};
> +
> +/* @field is any of enum tdx_guest_other_state */
> +#define TDVPS_STATE(field) BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field))
> +#define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field))
> +
> +/* Management class fields */
> +enum tdx_vcpu_guest_management {
> + TD_VCPU_PEND_NMI = 11,
> +};
> +
> +/* @field is any of enum tdx_vcpu_guest_management */
> +#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field))
> +
> +#define TDX_EXTENDMR_CHUNKSIZE 256
> +
> +struct tdx_cpuid_value {
> + u32 eax;
> + u32 ebx;
> + u32 ecx;
> + u32 edx;
> +} __packed;
> +
> +#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
> +#define TDX_TD_ATTR_SEPT_VE_DISABLE BIT_ULL(28)
It's better to align the style of the naming.

Either use TDX_TD_ATTR_* or TDX_TD_ATTRIBUTE_*?

> +#define TDX_TD_ATTRIBUTE_PKS BIT_ULL(30)
> +#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
> +#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)
> +
> +/*
> + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> + */
> +#define TDX_MAX_VCPUS (~(u16)0)
> +
> +struct td_params {
> + u64 attributes;
> + u64 xfam;
> + u16 max_vcpus;
> + u8 reserved0[6];
> +
> + u64 eptp_controls;
> + u64 exec_controls;
> + u16 tsc_frequency;
> + u8 reserved1[38];
> +
> + u64 mrconfigid[6];
> + u64 mrowner[6];
> + u64 mrownerconfig[6];
> + u64 reserved2[4];
> +
> + union {
> + DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values);
> + u8 reserved3[768];
> + };
> +} __packed __aligned(1024);
> +
> +/*
> + * Guest uses MAX_PA for GPAW when set.
> + * 0: GPA.SHARED bit is GPA[47]
> + * 1: GPA.SHARED bit is GPA[51]
> + */
> +#define TDX_EXEC_CONTROL_MAX_GPAW BIT_ULL(0)
> +
> +/*
> + * TDH.VP.ENTER, TDG.VP.VMCALL preserves RBP
> + * 0: RBP can be used for TDG.VP.VMCALL input. RBP is clobbered.
> + * 1: RBP can't be used for TDG.VP.VMCALL input. RBP is preserved.
> + */
> +#define TDX_CONTROL_FLAG_NO_RBP_MOD BIT_ULL(2)
> +
> +
> +/*
> + * TDX requires the frequency to be defined in units of 25MHz, which is the
> + * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
> + * module can only program frequencies that are multiples of 25MHz. The
> + * frequency must be between 100mhz and 10ghz (inclusive).
> + */
> +#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000))
> +#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000))
> +#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
> +#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)
> +
> +union tdx_sept_entry {
> + struct {
> + u64 r : 1;
> + u64 w : 1;
> + u64 x : 1;
> + u64 mt : 3;
> + u64 ipat : 1;
> + u64 leaf : 1;
> + u64 a : 1;
> + u64 d : 1;
> + u64 xu : 1;
> + u64 ignored0 : 1;
> + u64 pfn : 40;
> + u64 reserved : 5;
> + u64 vgp : 1;
> + u64 pwa : 1;
> + u64 ignored1 : 1;
> + u64 sss : 1;
> + u64 spp : 1;
> + u64 ignored2 : 1;
> + u64 sve : 1;
> + };
> + u64 raw;
> +};
> +
> +enum tdx_sept_entry_state {
> + TDX_SEPT_FREE = 0,
> + TDX_SEPT_BLOCKED = 1,
> + TDX_SEPT_PENDING = 2,
> + TDX_SEPT_PENDING_BLOCKED = 3,
> + TDX_SEPT_PRESENT = 4,
> +};
> +
> +union tdx_sept_level_state {
> + struct {
> + u64 level : 3;
> + u64 reserved0 : 5;
> + u64 state : 8;
> + u64 reserved1 : 48;
> + };
> + u64 raw;
> +};
> +
> +/*
> + * Global scope metadata field ID.
> + * See Table "Global Scope Metadata", TDX module 1.5 ABI spec.
> + */
> +#define MD_FIELD_ID_SYS_ATTRIBUTES 0x0A00000200000000ULL
> +#define MD_FIELD_ID_FEATURES0 0x0A00000300000008ULL
> +#define MD_FIELD_ID_ATTRS_FIXED0 0x1900000300000000ULL
> +#define MD_FIELD_ID_ATTRS_FIXED1 0x1900000300000001ULL
> +#define MD_FIELD_ID_XFAM_FIXED0 0x1900000300000002ULL
> +#define MD_FIELD_ID_XFAM_FIXED1 0x1900000300000003ULL
> +
> +#define MD_FIELD_ID_TDCS_BASE_SIZE 0x9800000100000100ULL
> +#define MD_FIELD_ID_TDVPS_BASE_SIZE 0x9800000100000200ULL
> +
> +#define MD_FIELD_ID_NUM_CPUID_CONFIG 0x9900000100000004ULL
> +#define MD_FIELD_ID_CPUID_CONFIG_LEAVES 0x9900000300000400ULL
> +#define MD_FIELD_ID_CPUID_CONFIG_VALUES 0x9900000300000500ULL
> +
> +#define MD_FIELD_ID_FEATURES0_NO_RBP_MOD BIT_ULL(18)
> +
> +#define TDX_MAX_NR_CPUID_CONFIGS 37
> +
> +#define TDX_MD_ELEMENT_SIZE_8BITS 0
> +#define TDX_MD_ELEMENT_SIZE_16BITS 1
> +#define TDX_MD_ELEMENT_SIZE_32BITS 2
> +#define TDX_MD_ELEMENT_SIZE_64BITS 3
> +
> +union tdx_md_field_id {
> + struct {
> + u64 field : 24;
> + u64 reserved0 : 8;
> + u64 element_size_code : 2;
> + u64 last_element_in_field : 4;
> + u64 reserved1 : 3;
> + u64 inc_size : 1;
> + u64 write_mask_valid : 1;
> + u64 context : 3;
> + u64 reserved2 : 1;
> + u64 class : 6;
> + u64 reserved3 : 1;
> + u64 non_arch : 1;
> + };
> + u64 raw;
> +};
> +
> +#define TDX_MD_ELEMENT_SIZE_CODE(_field_id) \
> + ({ union tdx_md_field_id _fid = { .raw = (_field_id)}; \
> + _fid.element_size_code; })
> +
> +#endif /* __KVM_X86_TDX_ARCH_H */


2024-03-14 12:56:53

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 033/130] KVM: TDX: Add helper function to read TDX metadata in array



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> To read meta data in series, use table.
> Instead of metadata_read(fid0, &data0); metadata_read(...); ...
> table = { {fid0, &data0}, ...}; metadata-read(tables).
> TODO: Once the TDX host code introduces its framework to read TDX metadata,
> drop this patch and convert the code that uses this.

Do you mean the patch 1-5 included in this patch set.
I think the patch 1-5 of this patch set is doing this thing, right?

Since they are already there, I think you can use them directly in this
patch set instead of introducing these temp code?

>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v18:
> - newly added
> ---
> arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 45 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index cde971122c1e..dce21f675155 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -6,6 +6,7 @@
> #include "capabilities.h"
> #include "x86_ops.h"
> #include "x86.h"
> +#include "tdx_arch.h"
> #include "tdx.h"
>
> #undef pr_fmt
> @@ -39,6 +40,50 @@ static void __used tdx_guest_keyid_free(int keyid)
> ida_free(&tdx_guest_keyid_pool, keyid);
> }
>
> +#define TDX_MD_MAP(_fid, _ptr) \
> + { .fid = MD_FIELD_ID_##_fid, \
> + .ptr = (_ptr), }
> +
> +struct tdx_md_map {
> + u64 fid;
> + void *ptr;
> +};
> +
> +static size_t tdx_md_element_size(u64 fid)
> +{
> + switch (TDX_MD_ELEMENT_SIZE_CODE(fid)) {
> + case TDX_MD_ELEMENT_SIZE_8BITS:
> + return 1;
> + case TDX_MD_ELEMENT_SIZE_16BITS:
> + return 2;
> + case TDX_MD_ELEMENT_SIZE_32BITS:
> + return 4;
> + case TDX_MD_ELEMENT_SIZE_64BITS:
> + return 8;
> + default:
> + WARN_ON_ONCE(1);
> + return 0;
> + }
> +}
> +
> +static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> +{
> + struct tdx_md_map *m;
> + int ret, i;
> + u64 tmp;
> +
> + for (i = 0; i < nr_maps; i++) {
> + m = &maps[i];
> + ret = tdx_sys_metadata_field_read(m->fid, &tmp);
> + if (ret)
> + return ret;
> +
> + memcpy(m->ptr, &tmp, tdx_md_element_size(m->fid));
> + }
> +
> + return 0;
> +}
> +
> static int __init tdx_module_setup(void)
> {
> int ret;


2024-03-14 13:42:16

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 028/130] KVM: TDX: Add TDX "architectural" error codes



On 2/27/2024 3:27 AM, Isaku Yamahata wrote:
> On Mon, Feb 26, 2024 at 12:25:30AM -0800,
> [email protected] wrote:
>
>> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
>> index fdfd41511b02..28c4a62b7dba 100644
>> --- a/arch/x86/include/asm/shared/tdx.h
>> +++ b/arch/x86/include/asm/shared/tdx.h
>> @@ -26,7 +26,13 @@
>> #define TDVMCALL_GET_QUOTE 0x10002
>> #define TDVMCALL_REPORT_FATAL_ERROR 0x10003
>>
>> -#define TDVMCALL_STATUS_RETRY 1
> Oops, I accidentally removed this constant to break tdx guest build.

Is this the same as "TDVMCALL_RETRY" added in the patch? Since both tdx
guest code and VMM share the same header file, maybe it needs another
patch to change the code in guest or you just follow the naming style of
the exist code?
>
> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> index ef1c8e5a2944..1367a5941499 100644
> --- a/arch/x86/include/asm/shared/tdx.h
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -28,6 +28,8 @@
> #define TDVMCALL_REPORT_FATAL_ERROR 0x10003
> #define TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004
>
> +#define TDVMCALL_STATUS_RETRY 1
> +
> /*
> * TDG.VP.VMCALL Status Codes (returned in R10)
> */


2024-03-14 16:14:15

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 019/130] KVM: x86: Add is_vm_type_supported callback

On Thu, Mar 14, 2024 at 04:32:20PM +0800,
Chao Gao <[email protected]> wrote:

> >-static bool kvm_is_vm_type_supported(unsigned long type)
> >+bool __kvm_is_vm_type_supported(unsigned long type)
> > {
> > return type == KVM_X86_DEFAULT_VM ||
> > (type == KVM_X86_SW_PROTECTED_VM &&
> > IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
>
> maybe just do:
> switch (type) {
> case KVM_X86_DEFAULT_VM:
> return true;
> case KVM_X86_SW_PROTECTED_VM:
> return IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled;
> default:
> return static_call(kvm_x86_is_vm_type_supported)(type);
> }
>
> There are two benefits
> 1) switch/case improves readability a little.
> 2) no need to expose __kvm_is_vm_type_supported()

The following[1] patch will supersede this patch. Will drop this patch.

[1] https://lore.kernel.org/kvm/[email protected]/
--
Isaku Yamahata <[email protected]>

2024-03-14 16:27:27

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, Mar 14, 2024 at 10:05:35AM +0800,
Binbin Wu <[email protected]> wrote:

> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 18cecf12c7c8..18aef6e23aab 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -6,6 +6,22 @@
> > #include "nested.h"
> > #include "pmu.h"
> > +static bool enable_tdx __ro_after_init;
> > +module_param_named(tdx, enable_tdx, bool, 0444);
> > +
> > +static __init int vt_hardware_setup(void)
> > +{
> > + int ret;
> > +
> > + ret = vmx_hardware_setup();
> > + if (ret)
> > + return ret;
> > +
> > + enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> > +
> > + return 0;
> > +}
> > +
> > #define VMX_REQUIRED_APICV_INHIBITS \
> > (BIT(APICV_INHIBIT_REASON_DISABLE)| \
> > BIT(APICV_INHIBIT_REASON_ABSENT) | \
> > @@ -22,6 +38,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > .hardware_unsetup = vmx_hardware_unsetup,
> > + /* TDX cpu enablement is done by tdx_hardware_setup(). */
>
> How about if there are some LPs that are offline.
> In tdx_hardware_setup(), only online LPs are initialed for TDX, right?

Correct.


> Then when an offline LP becoming online, it doesn't have a chance to call
> tdx_cpu_enable()?

KVM registers kvm_online/offline_cpu() @ kvm_main.c as cpu hotplug callbacks.
Eventually x86 kvm hardware_enable() is called on online/offline event.
--
Isaku Yamahata <[email protected]>

2024-03-14 16:38:07

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure

On Thu, Mar 14, 2024 at 02:21:04PM +0800,
Binbin Wu <[email protected]> wrote:

> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 18aef6e23aab..e11edbd19e7c 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -5,6 +5,7 @@
> > #include "vmx.h"
> > #include "nested.h"
> > #include "pmu.h"
> > +#include "tdx.h"
> > static bool enable_tdx __ro_after_init;
> > module_param_named(tdx, enable_tdx, bool, 0444);
> > @@ -18,6 +19,9 @@ static __init int vt_hardware_setup(void)
> > return ret;
> > enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> > + if (enable_tdx)
> > + vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> > + sizeof(struct kvm_tdx));
> > return 0;
> > }
> > @@ -215,8 +219,18 @@ static int __init vt_init(void)
> > * Common KVM initialization _must_ come last, after this, /dev/kvm is
> > * exposed to userspace!
> > */
> > + /*
> > + * kvm_x86_ops is updated with vt_x86_ops. vt_x86_ops.vm_size must
> > + * be set before kvm_x86_vendor_init().
>
> The comment is not right?
> In this patch, vt_x86_ops.vm_size is set in  vt_hardware_setup(),
> which is called in kvm_x86_vendor_init().
>
> Since kvm_x86_ops is updated by kvm_ops_update() with the fields of
> vt_x86_ops. I guess you wanted to say vt_x86_ops.vm_size must be set
> before kvm_ops_update()?

Correct. Here's an updated version.

/*
* vt_hardware_setup() updates vt_x86_ops. Because kvm_ops_update()
* copies vt_x86_ops to kvm_x86_op, vt_x86_ops must be updated before
* kvm_ops_update() called by kvm_x86_vendor_init().
*/
--
Isaku Yamahata <[email protected]>

2024-03-14 16:48:13

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Thu, Mar 14, 2024 at 03:30:43PM +0800,
Binbin Wu <[email protected]> wrote:

> > +#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
> > +#define TDX_TD_ATTR_SEPT_VE_DISABLE BIT_ULL(28)
> It's better to align the style of the naming.
>
> Either use TDX_TD_ATTR_* or TDX_TD_ATTRIBUTE_*?

Good point. I'll adopt TDX_TD_ATTR_* because TDX_TD_ATTR_SEPT_VE_DISABLE is
already long.

--
Isaku Yamahata <[email protected]>

2024-03-14 17:01:11

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 033/130] KVM: TDX: Add helper function to read TDX metadata in array

On Thu, Mar 14, 2024 at 10:35:47PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 3/14/2024 5:17 PM, Binbin Wu wrote:
> >
> >
> > On 2/26/2024 4:25 PM, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > To read meta data in series, use table.
> > > Instead of metadata_read(fid0, &data0); metadata_read(...); ...
> > > table = { {fid0, &data0}, ...}; metadata-read(tables).
> > > TODO: Once the TDX host code introduces its framework to read TDX
> > > metadata,
> > > drop this patch and convert the code that uses this.
> >
> > Do you mean the patch 1-5 included in this patch set.
> > I think the patch 1-5 of this patch set is doing this thing, right?
> >
> > Since they are already there, I think you can use them directly in this
> > patch set instead of introducing these temp code?
> I may have some mis-understanding, but I think the TODO has been done,
> right?

I meant the following patch series.
https://lore.kernel.org/kvm/[email protected]/

If (the future version of) the patch series doesn't provide a way to read
one metadata with size, we need to keep this patch.
--
Isaku Yamahata <[email protected]>

2024-03-14 17:23:27

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 028/130] KVM: TDX: Add TDX "architectural" error codes

On Thu, Mar 14, 2024 at 03:45:49PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/27/2024 3:27 AM, Isaku Yamahata wrote:
> > On Mon, Feb 26, 2024 at 12:25:30AM -0800,
> > [email protected] wrote:
> >
> > > diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> > > index fdfd41511b02..28c4a62b7dba 100644
> > > --- a/arch/x86/include/asm/shared/tdx.h
> > > +++ b/arch/x86/include/asm/shared/tdx.h
> > > @@ -26,7 +26,13 @@
> > > #define TDVMCALL_GET_QUOTE 0x10002
> > > #define TDVMCALL_REPORT_FATAL_ERROR 0x10003
> > > -#define TDVMCALL_STATUS_RETRY 1
> > Oops, I accidentally removed this constant to break tdx guest build.
>
> Is this the same as "TDVMCALL_RETRY" added in the patch? Since both tdx
> guest code and VMM share the same header file, maybe it needs another patch
> to change the code in guest or you just follow the naming style of the exist
> code?

The style in other TDX place is without STATUS. I don't want to play bike
shedding. For now I'd like to leave TDVMCALL_STATUS_RETRY, not add
TDVMCALL_RETRY, and keep other TDVMCALL_*.
--
Isaku Yamahata <[email protected]>

2024-03-14 18:10:20

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Wed, Mar 13, 2024 at 08:51:53PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > For private GPA, CPU refers a private page table whose contents are
> > encrypted.  The dedicated APIs to operate on it (e.g.
> > updating/reading its
> > PTE entry) are used and their cost is expensive.
> >
> > When KVM resolves KVM page fault, it walks the page tables.  To reuse
> > the
> > existing KVM MMU code and mitigate the heavy cost to directly walk
> > private
> > page table, allocate one more page to copy the dummy page table for
> > KVM MMU
> > code to directly walk.  Resolve KVM page fault with the existing
> > code, and
> > do additional operations necessary for the private page table. 
>
> > To
> > distinguish such cases, the existing KVM page table is called a
> > shared page
> > table (i.e. not associated with private page table), and the page
> > table
> > with private page table is called a private page table.
>
> This makes it sound like the dummy page table for the private alias is
> also called a shared page table, but in the drawing below it looks like
> only the shared alias is called "shared PT".

How about this,
Call the existing KVM page table associated with shared GPA as shared page table.
Call the KVM page table associate with private GPA as private page table.

> >   The relationship
> > is depicted below.
> >
> > Add a private pointer to struct kvm_mmu_page for private page table
> > and
> > add helper functions to allocate/initialize/free a private page table
> > page.
> >
> >               KVM page fault                     |
> >                      |                           |
> >                      V                           |
> >         -------------+----------                 |
> >         |                      |                 |
> >         V                      V                 |
> >      shared GPA           private GPA            |
> >         |                      |                 |
> >         V                      V                 |
> >     shared PT root      dummy PT root            |    private PT root
> >         |                      |                 |           |
> >         V                      V                 |           V
> >      shared PT            dummy PT ----propagate---->   private PT
> >         |                      |                 |           |
> >         |                      \-----------------+------\    |
> >         |                                        |      |    |
> >         V                                        |      V    V
> >   shared guest page                              |    private guest
> > page
> >                                                  |
> >                            non-encrypted memory  |    encrypted
> > memory
> >                                                  |
> > PT: page table
> > - Shared PT is visible to KVM and it is used by CPU.
> > - Private PT is used by CPU but it is invisible to KVM.
> > - Dummy PT is visible to KVM but not used by CPU.  It is used to
> >   propagate PT change to the actual private PT which is used by CPU.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Binbin Wu <[email protected]>
> > ---

..snip...

> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > b/arch/x86/kvm/mmu/mmu_internal.h
> > index e3f54701f98d..002f3f80bf3b 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -101,7 +101,21 @@ struct kvm_mmu_page {
> >                 int root_count;
> >                 refcount_t tdp_mmu_root_count;
> >         };
> > -       unsigned int unsync_children;
> > +       union {
> > +               struct {
> > +                       unsigned int unsync_children;
> > +                       /*
> > +                        * Number of writes since the last time
> > traversal
> > +                        * visited this page.
> > +                        */
> > +                       atomic_t write_flooding_count;
> > +               };
>
> I think the point of putting these in a union is that they only apply
> to shadow paging and so can't be used with TDX. I think you are putting
> more than the sizeof(void *) in there as there are multiple in the same
> category.

I'm not sure if I'm following you.
On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4, sizeof(void *) = 8.
I moved write_flooding_count to have 8 bytes.


> But there seems to be a new one added, *shadowed_translation.
> Should it go in there too? Is the union because there wasn't room
> before, or just to be tidy?

Originally TDX MMU support was implemented for legacy tdp mmu. It used
shadowed_translation. It was not an option at that time. Later we switched to
(new) TDP MMU. Now we have choice to which member to overlay.


> I think the commit log should have more discussion of this union and
> maybe a comment in the struct to explain the purpose of the
> organization. Can you explain the reasoning now for the sake of
> discussion?

Sure. We'd like to add void * pointer to struct kvm_mmu_page. Given some
members are used only for legacy KVM MMUs and not used for TDP MMU, we can save
memory overhead with union. We have options.
- u64 *shadowed_translation
This was not chosen for the old implementation. Now this is option.
- pack unsync_children and write_flooding_count for 8 bytes
This patch chosen this for historical reason. Other two option is possible.
- unsync_child_bitmap
Historically it was unioned with other members. But now it's not.

I don't have strong preference for TDX support as long as we can have void *.
--
Isaku Yamahata <[email protected]>

2024-03-14 18:11:14

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX KVM needs system-wide information about the TDX module, store it in
> struct tdx_info.

Nit: Maybe you can add some description about hardware_unsetup()?

Reviewed-by: Binbin Wu <[email protected]>

> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - Added features0
> - Use tdx_sys_metadata_read()
> - Fix error recovery path by Yuan
>
> Change v18:
> - Newly Added
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm.h | 11 +++++
> arch/x86/kvm/vmx/main.c | 9 +++-
> arch/x86/kvm/vmx/tdx.c | 80 ++++++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/x86_ops.h | 2 +
> 4 files changed, 100 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index aa7a56a47564..45b2c2304491 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -567,4 +567,15 @@ struct kvm_pmu_event_filter {
> #define KVM_X86_TDX_VM 2
> #define KVM_X86_SNP_VM 3
>
> +#define KVM_TDX_CPUID_NO_SUBLEAF ((__u32)-1)
> +
> +struct kvm_tdx_cpuid_config {
> + __u32 leaf;
> + __u32 sub_leaf;
> + __u32 eax;
> + __u32 ebx;
> + __u32 ecx;
> + __u32 edx;
> +};
> +
> #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index fa19682b366c..a948a6959ac7 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -32,6 +32,13 @@ static __init int vt_hardware_setup(void)
> return 0;
> }
>
> +static void vt_hardware_unsetup(void)
> +{
> + if (enable_tdx)
> + tdx_hardware_unsetup();
> + vmx_hardware_unsetup();
> +}
> +
> static int vt_vm_init(struct kvm *kvm)
> {
> if (is_td(kvm))
> @@ -54,7 +61,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> .check_processor_compatibility = vmx_check_processor_compat,
>
> - .hardware_unsetup = vmx_hardware_unsetup,
> + .hardware_unsetup = vt_hardware_unsetup,
>
> /* TDX cpu enablement is done by tdx_hardware_setup(). */
> .hardware_enable = vmx_hardware_enable,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index dce21f675155..5edfb99abb89 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -40,6 +40,21 @@ static void __used tdx_guest_keyid_free(int keyid)
> ida_free(&tdx_guest_keyid_pool, keyid);
> }
>
> +struct tdx_info {
> + u64 features0;
> + u64 attributes_fixed0;
> + u64 attributes_fixed1;
> + u64 xfam_fixed0;
> + u64 xfam_fixed1;
> +
> + u16 num_cpuid_config;
> + /* This must the last member. */
> + DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
> +};
> +
> +/* Info about the TDX module. */
> +static struct tdx_info *tdx_info;
> +
> #define TDX_MD_MAP(_fid, _ptr) \
> { .fid = MD_FIELD_ID_##_fid, \
> .ptr = (_ptr), }
> @@ -66,7 +81,7 @@ static size_t tdx_md_element_size(u64 fid)
> }
> }
>
> -static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> +static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> {
> struct tdx_md_map *m;
> int ret, i;
> @@ -84,9 +99,26 @@ static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> return 0;
> }
>
> +#define TDX_INFO_MAP(_field_id, _member) \
> + TD_SYSINFO_MAP(_field_id, struct tdx_info, _member)
> +
> static int __init tdx_module_setup(void)
> {
> + u16 num_cpuid_config;
> int ret;
> + u32 i;
> +
> + struct tdx_md_map mds[] = {
> + TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> + };
> +
> + struct tdx_metadata_field_mapping fields[] = {
> + TDX_INFO_MAP(FEATURES0, features0),
> + TDX_INFO_MAP(ATTRS_FIXED0, attributes_fixed0),
> + TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
> + TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
> + TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
> + };
>
> ret = tdx_enable();
> if (ret) {
> @@ -94,7 +126,48 @@ static int __init tdx_module_setup(void)
> return ret;
> }
>
> + ret = tdx_md_read(mds, ARRAY_SIZE(mds));
> + if (ret)
> + return ret;
> +
> + tdx_info = kzalloc(sizeof(*tdx_info) +
> + sizeof(*tdx_info->cpuid_configs) * num_cpuid_config,
> + GFP_KERNEL);
> + if (!tdx_info)
> + return -ENOMEM;
> + tdx_info->num_cpuid_config = num_cpuid_config;
> +
> + ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
> + if (ret)
> + goto error_out;
> +
> + for (i = 0; i < num_cpuid_config; i++) {
> + struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
> + u64 leaf, eax_ebx, ecx_edx;
> + struct tdx_md_map cpuids[] = {
> + TDX_MD_MAP(CPUID_CONFIG_LEAVES + i, &leaf),
> + TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2, &eax_ebx),
> + TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2 + 1, &ecx_edx),
> + };
> +
> + ret = tdx_md_read(cpuids, ARRAY_SIZE(cpuids));
> + if (ret)
> + goto error_out;
> +
> + c->leaf = (u32)leaf;
> + c->sub_leaf = leaf >> 32;
> + c->eax = (u32)eax_ebx;
> + c->ebx = eax_ebx >> 32;
> + c->ecx = (u32)ecx_edx;
> + c->edx = ecx_edx >> 32;
> + }
> +
> return 0;
> +
> +error_out:
> + /* kfree() accepts NULL. */
> + kfree(tdx_info);
> + return ret;
> }
>
> bool tdx_is_vm_type_supported(unsigned long type)
> @@ -162,3 +235,8 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> out:
> return r;
> }
> +
> +void tdx_hardware_unsetup(void)
> +{
> + kfree(tdx_info);
> +}
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index f4da88a228d0..e8cb4ae81cf1 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -136,9 +136,11 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
>
> #ifdef CONFIG_INTEL_TDX_HOST
> int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +void tdx_hardware_unsetup(void);
> bool tdx_is_vm_type_supported(unsigned long type);
> #else
> static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
> +static inline void tdx_hardware_unsetup(void) {}
> static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
> #endif
>


2024-03-14 19:41:29

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 033/130] KVM: TDX: Add helper function to read TDX metadata in array



On 3/14/2024 5:17 PM, Binbin Wu wrote:
>
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
>>
>> To read meta data in series, use table.
>> Instead of metadata_read(fid0, &data0); metadata_read(...); ...
>> table = { {fid0, &data0}, ...}; metadata-read(tables).
>> TODO: Once the TDX host code introduces its framework to read TDX
>> metadata,
>> drop this patch and convert the code that uses this.
>
> Do you mean the patch 1-5 included in this patch set.
> I think the patch 1-5 of this patch set is doing this thing, right?
>
> Since they are already there, I think you can use them directly in this
> patch set instead of introducing these temp code?
I may have some mis-understanding, but I think the TODO has been done,
right?

>
>>
>> Signed-off-by: Isaku Yamahata <[email protected]>
>> ---
>> v18:
>> - newly added
>> ---
>>   arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index cde971122c1e..dce21f675155 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -6,6 +6,7 @@
>>   #include "capabilities.h"
>>   #include "x86_ops.h"
>>   #include "x86.h"
>> +#include "tdx_arch.h"
>>   #include "tdx.h"
>>     #undef pr_fmt
>> @@ -39,6 +40,50 @@ static void __used tdx_guest_keyid_free(int keyid)
>>       ida_free(&tdx_guest_keyid_pool, keyid);
>>   }
>>   +#define TDX_MD_MAP(_fid, _ptr)            \
>> +    { .fid = MD_FIELD_ID_##_fid,        \
>> +      .ptr = (_ptr), }
>> +
>> +struct tdx_md_map {
>> +    u64 fid;
>> +    void *ptr;
>> +};
>> +
>> +static size_t tdx_md_element_size(u64 fid)
>> +{
>> +    switch (TDX_MD_ELEMENT_SIZE_CODE(fid)) {
>> +    case TDX_MD_ELEMENT_SIZE_8BITS:
>> +        return 1;
>> +    case TDX_MD_ELEMENT_SIZE_16BITS:
>> +        return 2;
>> +    case TDX_MD_ELEMENT_SIZE_32BITS:
>> +        return 4;
>> +    case TDX_MD_ELEMENT_SIZE_64BITS:
>> +        return 8;
>> +    default:
>> +        WARN_ON_ONCE(1);
>> +        return 0;
>> +    }
>> +}
>> +
>> +static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
>> +{
>> +    struct tdx_md_map *m;
>> +    int ret, i;
>> +    u64 tmp;
>> +
>> +    for (i = 0; i < nr_maps; i++) {
>> +        m = &maps[i];
>> +        ret = tdx_sys_metadata_field_read(m->fid, &tmp);
>> +        if (ret)
>> +            return ret;
>> +
>> +        memcpy(m->ptr, &tmp, tdx_md_element_size(m->fid));
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   static int __init tdx_module_setup(void)
>>   {
>>       int ret;
>


2024-03-14 21:24:11

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page



On 15/03/2024 7:10 am, Isaku Yamahata wrote:
> On Wed, Mar 13, 2024 at 08:51:53PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
>> On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
>>> From: Isaku Yamahata <[email protected]>
>>>
>>> For private GPA, CPU refers a private page table whose contents are
>>> encrypted.  The dedicated APIs to operate on it (e.g.
>>> updating/reading its
>>> PTE entry) are used and their cost is expensive.
>>>
>>> When KVM resolves KVM page fault, it walks the page tables.  To reuse
>>> the
>>> existing KVM MMU code and mitigate the heavy cost to directly walk
>>> private
>>> page table, allocate one more page to copy the dummy page table for
>>> KVM MMU
>>> code to directly walk.  Resolve KVM page fault with the existing
>>> code, and
>>> do additional operations necessary for the private page table.
>>
>>> To
>>> distinguish such cases, the existing KVM page table is called a
>>> shared page
>>> table (i.e. not associated with private page table), and the page
>>> table
>>> with private page table is called a private page table.
>>
>> This makes it sound like the dummy page table for the private alias is
>> also called a shared page table, but in the drawing below it looks like
>> only the shared alias is called "shared PT".
>
> How about this,
> Call the existing KVM page table associated with shared GPA as shared page table. > Call the KVM page table associate with private GPA as private page table.
>

For the second one, are you talking about the *true* secure/private EPT
page table used by hardware, or the one visible to KVM but not used by
hardware?

We have 3 page tables as you mentioned:

PT: page table
- Shared PT is visible to KVM and it is used by CPU.
- Private PT is used by CPU but it is invisible to KVM.
- Dummy PT is visible to KVM but not used by CPU. It is used to
propagate PT change to the actual private PT which is used by CPU.

If I recall correctly, we used to call the last one "mirrored (private)
page table".

I lost the tracking when we changed to use "dummy page table", but it
seems to me "mirrored" is better than "dummy" because the latter means
it is useless but in fact it is used to propagate changes to the real
private page table used by hardware.

Btw, one nit, perhaps:

"Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is
visible to KVM and it is used by CPU for shared mappings".

To make it more clearer it is used for "shared mappings".

But this may be unnecessary to others, so up to you.


2024-03-14 21:40:18

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Fri, 2024-03-15 at 10:23 +1300, Huang, Kai wrote:
> We have 3 page tables as you mentioned:
>
> PT: page table
> - Shared PT is visible to KVM and it is used by CPU.
> - Private PT is used by CPU but it is invisible to KVM.
> - Dummy PT is visible to KVM but not used by CPU.  It is used to
>    propagate PT change to the actual private PT which is used by CPU.
>
> If I recall correctly, we used to call the last one "mirrored
> (private)
> page table".
>
> I lost the tracking when we changed to use "dummy page table", but it
> seems to me "mirrored" is better than "dummy" because the latter
> means
> it is useless but in fact it is used to propagate changes to the real
> private page table used by hardware.

Mirrored makes sense to me. So like:

Private - Table actually mapping private alias, in TDX module
Shared - Shared alias table, visible in KVM
Mirror - Mirroring private, visible in KVM

>
> Btw, one nit, perhaps:
>
> "Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is
> visible to KVM and it is used by CPU for shared mappings".
>
> To make it more clearer it is used for "shared mappings".
>
> But this may be unnecessary to others, so up to you.

Yep, this seems clearer.

2024-03-14 21:53:20

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Thu, 2024-03-14 at 11:10 -0700, Isaku Yamahata wrote:
> > I think the point of putting these in a union is that they only
> > apply
> > to shadow paging and so can't be used with TDX. I think you are
> > putting
> > more than the sizeof(void *) in there as there are multiple in the
> > same
> > category.
>
> I'm not sure if I'm following you.
> On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4,
> sizeof(void *) = 8.
> I moved write_flooding_count to have 8 bytes.

Ah, I see. Yes you are write about it summing to 8. Ok, what do you
think about putting a comment that these will always be unused with
TDX?

>
>
> > But there seems to be a new one added, *shadowed_translation.
> > Should it go in there too? Is the union because there wasn't room
> > before, or just to be tidy?
>
> Originally TDX MMU support was implemented for legacy tdp mmu.  It
> used
> shadowed_translation.  It was not an option at that time.  Later we
> switched to
> (new) TDP MMU.  Now we have choice to which member to overlay.
>
>
> > I think the commit log should have more discussion of this union
> > and
> > maybe a comment in the struct to explain the purpose of the
> > organization. Can you explain the reasoning now for the sake of
> > discussion?
>
> Sure.  We'd like to add void * pointer to struct kvm_mmu_page.  Given
> some
> members are used only for legacy KVM MMUs and not used for TDP MMU,
> we can save
> memory overhead with union.  We have options.
> - u64 *shadowed_translation
>   This was not chosen for the old implementation. Now this is option.

This seems a little more straighforward, but I'm on the fence if it's
worth changing.

> - pack unsync_children and write_flooding_count for 8 bytes
>   This patch chosen this for historical reason. Other two option is
> possible.
> - unsync_child_bitmap
>   Historically it was unioned with other members. But now it's not.
>
> I don't have strong preference for TDX support as long as we can have
> void *.

2024-03-14 22:28:28

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 033/130] KVM: TDX: Add helper function to read TDX metadata in array



On 26/02/2024 9:25 pm, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> To read meta data in series, use table.
> Instead of metadata_read(fid0, &data0); metadata_read(...); ...
> table = { {fid0, &data0}, ...}; metadata-read(tables).

This explains nothing why the code introduced in patch 5 cannot be used.

> TODO: Once the TDX host code introduces its framework to read TDX metadata,
> drop this patch and convert the code that uses this.

Seriously, what is this?? Please treat your patches as "official" patches.

>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v18:
> - newly added
> ---
> arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 45 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index cde971122c1e..dce21f675155 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -6,6 +6,7 @@
> #include "capabilities.h"
> #include "x86_ops.h"
> #include "x86.h"
> +#include "tdx_arch.h"
> #include "tdx.h"
>
> #undef pr_fmt
> @@ -39,6 +40,50 @@ static void __used tdx_guest_keyid_free(int keyid)
> ida_free(&tdx_guest_keyid_pool, keyid);
> }
>
> +#define TDX_MD_MAP(_fid, _ptr) \
> + { .fid = MD_FIELD_ID_##_fid, \
> + .ptr = (_ptr), }
> +
> +struct tdx_md_map {
> + u64 fid;
> + void *ptr;
> +};
> +
> +static size_t tdx_md_element_size(u64 fid)
> +{
> + switch (TDX_MD_ELEMENT_SIZE_CODE(fid)) {
> + case TDX_MD_ELEMENT_SIZE_8BITS:
> + return 1;
> + case TDX_MD_ELEMENT_SIZE_16BITS:
> + return 2;
> + case TDX_MD_ELEMENT_SIZE_32BITS:
> + return 4;
> + case TDX_MD_ELEMENT_SIZE_64BITS:
> + return 8;
> + default:
> + WARN_ON_ONCE(1);
> + return 0;
> + }
> +}
> +
> +static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> +{
> + struct tdx_md_map *m;
> + int ret, i;
> + u64 tmp;
> +
> + for (i = 0; i < nr_maps; i++) {
> + m = &maps[i];
> + ret = tdx_sys_metadata_field_read(m->fid, &tmp);
> + if (ret)
> + return ret;
> +
> + memcpy(m->ptr, &tmp, tdx_md_element_size(m->fid));
> + }
> +
> + return 0;
> +}
> +

It's just insane to have two duplicated mechanism for metadata reading.

This will only confuse people.

If there's anything missing in patch 1-5, we can enhance them.

2024-03-14 23:10:01

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization


> +struct tdx_info {
> + u64 features0;
> + u64 attributes_fixed0;
> + u64 attributes_fixed1;
> + u64 xfam_fixed0;
> + u64 xfam_fixed1;
> +
> + u16 num_cpuid_config;
> + /* This must the last member. */
> + DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
> +};
> +
> +/* Info about the TDX module. */
> +static struct tdx_info *tdx_info;
> +
> #define TDX_MD_MAP(_fid, _ptr) \
> { .fid = MD_FIELD_ID_##_fid, \
> .ptr = (_ptr), }
> @@ -66,7 +81,7 @@ static size_t tdx_md_element_size(u64 fid)
> }
> }
>
> -static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> +static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> {
> struct tdx_md_map *m;
> int ret, i;
> @@ -84,9 +99,26 @@ static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> return 0;
> }
>
> +#define TDX_INFO_MAP(_field_id, _member) \
> + TD_SYSINFO_MAP(_field_id, struct tdx_info, _member)
> +
> static int __init tdx_module_setup(void)
> {
> + u16 num_cpuid_config;
> int ret;
> + u32 i;
> +
> + struct tdx_md_map mds[] = {
> + TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> + };
> +
> + struct tdx_metadata_field_mapping fields[] = {
> + TDX_INFO_MAP(FEATURES0, features0),
> + TDX_INFO_MAP(ATTRS_FIXED0, attributes_fixed0),
> + TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
> + TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
> + TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
> + };
>
> ret = tdx_enable();
> if (ret) {
> @@ -94,7 +126,48 @@ static int __init tdx_module_setup(void)
> return ret;
> }
>
> + ret = tdx_md_read(mds, ARRAY_SIZE(mds));
> + if (ret)
> + return ret;
> +
> + tdx_info = kzalloc(sizeof(*tdx_info) +
> + sizeof(*tdx_info->cpuid_configs) * num_cpuid_config,
> + GFP_KERNEL);
> + if (!tdx_info)
> + return -ENOMEM;
> + tdx_info->num_cpuid_config = num_cpuid_config;
> +
> + ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
> + if (ret)
> + goto error_out;
> +
> + for (i = 0; i < num_cpuid_config; i++) {
> + struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
> + u64 leaf, eax_ebx, ecx_edx;
> + struct tdx_md_map cpuids[] = {
> + TDX_MD_MAP(CPUID_CONFIG_LEAVES + i, &leaf),
> + TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2, &eax_ebx),
> + TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2 + 1, &ecx_edx),
> + };
> +
> + ret = tdx_md_read(cpuids, ARRAY_SIZE(cpuids));
> + if (ret)
> + goto error_out;
> +
> + c->leaf = (u32)leaf;
> + c->sub_leaf = leaf >> 32;
> + c->eax = (u32)eax_ebx;
> + c->ebx = eax_ebx >> 32;
> + c->ecx = (u32)ecx_edx;
> + c->edx = ecx_edx >> 32;

OK I can see why you don't want to use ...

struct tdx_metadata_field_mapping fields[] = {
TDX_INFO_MAP(NUM_CPUID_CONFIG, num_cpuid_config),
};

.. to read num_cpuid_config first, because the memory to hold @tdx_info
hasn't been allocated, because its size depends on the num_cpuid_config.

And I confess it's because the tdx_sys_metadata_field_read() that got
exposed in patch ("x86/virt/tdx: Export global metadata read
infrastructure") only returns 'u64' for all metadata field, and you
didn't want to use something like this:

u64 num_cpuid_config;

tdx_sys_metadata_field_read(..., &num_cpuid_config);

...

tdx_info->num_cpuid_config = num_cpuid_config;

Or you can explicitly cast:

tdx_info->num_cpuid_config = (u16)num_cpuid_config;

(I know people may don't like the assigning 'u64' to 'u16', but it seems
nothing wrong to me, because the way done in (1) below effectively has
the same result comparing to type case).

But there are other (better) ways to do:

1) you can introduce a helper as suggested by Xiaoyao in [*]:


int tdx_sys_metadata_read_single(u64 field_id,
int bytes, void *buf)
{
return stbuf_read_sys_metadata_field(field_id, 0,
bytes, buf);
}

And do:

tdx_sys_metadata_read_single(NUM_CPUID_CONFIG,
sizeof(num_cpuid_config), &num_cpuid_config);

That's _much_ cleaner than the 'struct tdx_md_map', which only confuses
people.

But I don't think we need to do this as mentioned above -- we just do
type cast.

2) You can just preallocate enough memory. It cannot be larger than
1024B, right? You can even just allocate one page. It's just 4K, no
one cares.

Then you can do:

struct tdx_metadata_field_mapping tdx_info_fields = {
...
TDX_INFO_MAP(NUM_CPUID_CONFIG, num_cpuid_config),
};

tdx_sys_metadata_read(tdx_info_fields,
ARRAY_SIZE(tdx_info_fields, tdx_info);

And then you read the CPUID_CONFIG array one by one using the same
'struct tdx_metadata_field_mapping' and tdx_sys_metadata_read():


for (i = 0; i < tdx_info->num_cpuid_config; i++) {
struct tdx_metadata_field_mapping cpuid_fields = {
TDX_CPUID_CONFIG_MAP(CPUID_CONFIG_LEAVES + i,
leaf),
...
};
struct kvm_tdx_cpuid_config *c =
&tdx_info->cpuid_configs[i];

tdx_sys_metadata_read(cpuid_fields,
ARRAY_SIZE(cpuid_fields), c);

....
}

So stopping having the duplicated 'struct tdx_md_map' and related staff,
as they are absolutely unnecessary and only confuses people.

Btw, I am hesitated to do the change suggested by Xiaoyao in [*], as to
me there's nothing wrong to do the type cast. I'll response in that thread.

[*]
https://lore.kernel.org/lkml/[email protected]/T/#m2512e378c83bc44d3ca653f96f25c3fc85eb0e8a




2024-03-15 00:02:37

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions



On 26/02/2024 9:25 pm, [email protected] wrote:
> From: Kai Huang <[email protected]>
>
> KVM will need to make SEAMCALLs to create and run TDX guests. Export
> SEAMCALL functions for KVM to use.
>

Could you also list the reason that we want to expose __seamcall()
directly, rather than wanting to put some higher level wrappers in the
TDX host code, and export them?

For example, we can give a summary of the SEAMCALLs (e.g., how many in
total, and roughly introduce them based on categories) that will be used
by KVM, and clarify the reasons why we want to just export __seamcall().

E.g., we can say something like this:

TD;LR:

KVM roughly will need to use dozens of SEAMCALLs, and all these are
logically related to creating and running TDX guests. It makes more
sense to just export __seamcall() and let KVM maintain these VM-related
wrappers rather than having the TDX host code to provide wrappers for
each SEAMCALL or higher-level abstraction.

Long version:

You give a detailed explanation of SEAMCALLs that will be used by KVM,
and clarify logically it's better to manage these code in KVM.

2024-03-15 00:07:01

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Mon, 2024-02-26 at 00:27 -0800, [email protected] wrote:
>  
> +static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
> +{
> +       if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +               return;
> +
> +       vmx_update_cpu_dirty_logging(vcpu);
> +}

Discussed this first part offline, but logging it here. Since
guest_memfd cannot have dirty logging, this is essentially bugging the
VM if somehow they manage anyway. But it should be blocked via the code
in check_memory_region_flags().

On the subject of warnings and KVM_BUG_ON(), my feeling so far is that
this series is quite aggressive about these. Is it due the complexity
of the series? I think maybe we can remove some of the simple ones, but
not sure if there was already some discussion on what level is
appropriate.

2024-03-15 00:26:54

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Thu, Mar 14, 2024 at 09:52:58PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Thu, 2024-03-14 at 11:10 -0700, Isaku Yamahata wrote:
> > > I think the point of putting these in a union is that they only
> > > apply
> > > to shadow paging and so can't be used with TDX. I think you are
> > > putting
> > > more than the sizeof(void *) in there as there are multiple in the
> > > same
> > > category.
> >
> > I'm not sure if I'm following you.
> > On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4,
> > sizeof(void *) = 8.
> > I moved write_flooding_count to have 8 bytes.
>
> Ah, I see. Yes you are write about it summing to 8. Ok, what do you
> think about putting a comment that these will always be unused with
> TDX?

Ok, will add a comment. Also add some to the commit message.
--
Isaku Yamahata <[email protected]>

2024-03-15 01:09:55

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Thu, Mar 14, 2024 at 09:39:34PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Fri, 2024-03-15 at 10:23 +1300, Huang, Kai wrote:
> > We have 3 page tables as you mentioned:
> >
> > PT: page table
> > - Shared PT is visible to KVM and it is used by CPU.
> > - Private PT is used by CPU but it is invisible to KVM.
> > - Dummy PT is visible to KVM but not used by CPU.  It is used to
> >    propagate PT change to the actual private PT which is used by CPU.
> >
> > If I recall correctly, we used to call the last one "mirrored
> > (private)
> > page table".
> >
> > I lost the tracking when we changed to use "dummy page table", but it
> > seems to me "mirrored" is better than "dummy" because the latter
> > means
> > it is useless but in fact it is used to propagate changes to the real
> > private page table used by hardware.
>
> Mirrored makes sense to me. So like:
>
> Private - Table actually mapping private alias, in TDX module
> Shared - Shared alias table, visible in KVM
> Mirror - Mirroring private, visible in KVM
>
> >
> > Btw, one nit, perhaps:
> >
> > "Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is
> > visible to KVM and it is used by CPU for shared mappings".
> >
> > To make it more clearer it is used for "shared mappings".
> >
> > But this may be unnecessary to others, so up to you.
>
> Yep, this seems clearer.

Here is the updated one. Renamed dummy -> mirroed.

When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table, allocate one more page to copy the mirrored page
table for the KVM MMU code to directly walk. Resolve the KVM page fault
with the existing code, and do additional operations necessary for the
private page table. To distinguish such cases, the existing KVM page table
is called a shared page table (i.e., not associated with a private page
table), and the page table with a private page table is called a mirrored
page table. The relationship is depicted below.


KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirrored PT root | private PT root
| | | |
V V | V
shared PT mirrored PT ----propagate----> private PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT: Page table
Shared PT: visible to KVM, and the CPU uses it for shared mappings.
Private PT: the CPU uses it, but it is invisible to KVM. TDX module
updates this table to map private guest pages.
Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
to propagate PT change to the actual private PT.

--
Isaku Yamahata <[email protected]>

2024-03-15 01:17:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

Dave, care to weigh in here?

Context (whether to export the generic __seamcall for KVM's use):
https://lore.kernel.org/lkml/[email protected]/

On Fri, 2024-03-15 at 13:02 +1300, Huang, Kai wrote:
> KVM roughly will need to use dozens of SEAMCALLs, and all these are
> logically related to creating and running TDX guests.  It makes more
> sense to just export __seamcall() and let KVM maintain these VM-
> related
> wrappers rather than having the TDX host code to provide wrappers for
> each SEAMCALL or higher-level abstraction.

The helpers being discussed are these:
https://lore.kernel.org/lkml/7cfd33d896fce7b49bcf4b7179d0ded22c06b8c2.1708933498.git.isaku.yamahata@intel.com/

I guess there are three options:
1. Export the low level seamcall function
2. Export a bunch of higher level helper functions
3. Duplicate __seamcall asm in KVM

Letting modules make unrestricted seamcalls is not ideal. Preventing
the compiler from inlining the small logic in the static inline helpers
is not ideal. Duplicating code is not ideal. Hmm.

I want to say 2 sounds the least worst of the three. But I'm not sure.
I'm not sure if x86 folks would like to police new seamcalls, or be
bothered by it, either.

2024-03-15 01:33:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On 3/14/24 18:17, Edgecombe, Rick P wrote:
> I guess there are three options:
> 1. Export the low level seamcall function
> 2. Export a bunch of higher level helper functions
> 3. Duplicate __seamcall asm in KVM
>
> Letting modules make unrestricted seamcalls is not ideal. Preventing
> the compiler from inlining the small logic in the static inline helpers
> is not ideal. Duplicating code is not ideal. Hmm.
>
> I want to say 2 sounds the least worst of the three. But I'm not sure.
> I'm not sure if x86 folks would like to police new seamcalls, or be
> bothered by it, either.

#3 is the only objectively awful one. :)

In the end, we actually _want_ to have conversations about these things.
There are going to be considerations about what functionality should be
in KVM or the core kernel. We don't want KVM doing any calls that could
affect global TDX module state, for instance.

But I'd also defer to the KVM maintainers on this. They're the ones
that have to play the symbol exporting game a lot more than I ever do.
If they cringe at the idea of adding 20 (or whatever) exports, then
that's a lot more important than the possibility of some other silly
module abusing the generic exported __seamcall.

2024-03-15 01:35:25

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Fri, Mar 15, 2024 at 12:06:31AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:27 -0800, [email protected] wrote:
> >  
> > +static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
> > +{
> > +       if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> > +               return;
> > +
> > +       vmx_update_cpu_dirty_logging(vcpu);
> > +}
>
> Discussed this first part offline, but logging it here. Since
> guest_memfd cannot have dirty logging, this is essentially bugging the
> VM if somehow they manage anyway. But it should be blocked via the code
> in check_memory_region_flags().

Will drop this patch.


> On the subject of warnings and KVM_BUG_ON(), my feeling so far is that
> this series is quite aggressive about these. Is it due the complexity
> of the series? I think maybe we can remove some of the simple ones, but
> not sure if there was already some discussion on what level is
> appropriate.

KVM_BUG_ON() was helpful at the early stage. Because we don't hit them
recently, it's okay to remove them. Will remove them.
--
Isaku Yamahata <[email protected]>

2024-03-15 02:19:15

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization

On 3/15/2024 7:09 AM, Huang, Kai wrote:
>
>> +struct tdx_info {
>> +    u64 features0;
>> +    u64 attributes_fixed0;
>> +    u64 attributes_fixed1;
>> +    u64 xfam_fixed0;
>> +    u64 xfam_fixed1;
>> +
>> +    u16 num_cpuid_config;
>> +    /* This must the last member. */
>> +    DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
>> +};
>> +
>> +/* Info about the TDX module. */
>> +static struct tdx_info *tdx_info;
>> +
>>   #define TDX_MD_MAP(_fid, _ptr)            \
>>       { .fid = MD_FIELD_ID_##_fid,        \
>>         .ptr = (_ptr), }
>> @@ -66,7 +81,7 @@ static size_t tdx_md_element_size(u64 fid)
>>       }
>>   }
>> -static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
>> +static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
>>   {
>>       struct tdx_md_map *m;
>>       int ret, i;
>> @@ -84,9 +99,26 @@ static int __used tdx_md_read(struct tdx_md_map
>> *maps, int nr_maps)
>>       return 0;
>>   }
>> +#define TDX_INFO_MAP(_field_id, _member)            \
>> +    TD_SYSINFO_MAP(_field_id, struct tdx_info, _member)
>> +
>>   static int __init tdx_module_setup(void)
>>   {
>> +    u16 num_cpuid_config;
>>       int ret;
>> +    u32 i;
>> +
>> +    struct tdx_md_map mds[] = {
>> +        TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
>> +    };
>> +
>> +    struct tdx_metadata_field_mapping fields[] = {
>> +        TDX_INFO_MAP(FEATURES0, features0),
>> +        TDX_INFO_MAP(ATTRS_FIXED0, attributes_fixed0),
>> +        TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
>> +        TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
>> +        TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
>> +    };
>>       ret = tdx_enable();
>>       if (ret) {
>> @@ -94,7 +126,48 @@ static int __init tdx_module_setup(void)
>>           return ret;
>>       }
>> +    ret = tdx_md_read(mds, ARRAY_SIZE(mds));
>> +    if (ret)
>> +        return ret;
>> +
>> +    tdx_info = kzalloc(sizeof(*tdx_info) +
>> +               sizeof(*tdx_info->cpuid_configs) * num_cpuid_config,
>> +               GFP_KERNEL);
>> +    if (!tdx_info)
>> +        return -ENOMEM;
>> +    tdx_info->num_cpuid_config = num_cpuid_config;
>> +
>> +    ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
>> +    if (ret)
>> +        goto error_out;
>> +
>> +    for (i = 0; i < num_cpuid_config; i++) {
>> +        struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
>> +        u64 leaf, eax_ebx, ecx_edx;
>> +        struct tdx_md_map cpuids[] = {
>> +            TDX_MD_MAP(CPUID_CONFIG_LEAVES + i, &leaf),
>> +            TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2, &eax_ebx),
>> +            TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2 + 1, &ecx_edx),
>> +        };
>> +
>> +        ret = tdx_md_read(cpuids, ARRAY_SIZE(cpuids));
>> +        if (ret)
>> +            goto error_out;
>> +
>> +        c->leaf = (u32)leaf;
>> +        c->sub_leaf = leaf >> 32;
>> +        c->eax = (u32)eax_ebx;
>> +        c->ebx = eax_ebx >> 32;
>> +        c->ecx = (u32)ecx_edx;
>> +        c->edx = ecx_edx >> 32;
>
> OK I can see why you don't want to use ...
>
>     struct tdx_metadata_field_mapping fields[] = {
>         TDX_INFO_MAP(NUM_CPUID_CONFIG, num_cpuid_config),
>     };
>
> ... to read num_cpuid_config first, because the memory to hold @tdx_info
> hasn't been allocated, because its size depends on the num_cpuid_config.
>
> And I confess it's because the tdx_sys_metadata_field_read() that got
> exposed in patch ("x86/virt/tdx: Export global metadata read
> infrastructure") only returns 'u64' for all metadata field, and you
> didn't want to use something like this:
>
>     u64 num_cpuid_config;
>
>     tdx_sys_metadata_field_read(..., &num_cpuid_config);
>
>     ...
>
>     tdx_info->num_cpuid_config = num_cpuid_config;
>
> Or you can explicitly cast:
>
>     tdx_info->num_cpuid_config = (u16)num_cpuid_config;
>
> (I know people may don't like the assigning 'u64' to 'u16', but it seems
> nothing wrong to me, because the way done in (1) below effectively has
> the same result comparing to type case).
>
> But there are other (better) ways to do:
>
> 1) you can introduce a helper as suggested by Xiaoyao in [*]:
>
>
>     int tdx_sys_metadata_read_single(u64 field_id,
>                     int bytes,  void *buf)
>     {
>         return stbuf_read_sys_metadata_field(field_id, 0,
>                         bytes, buf);
>     }
>
> And do:
>
>     tdx_sys_metadata_read_single(NUM_CPUID_CONFIG,
>         sizeof(num_cpuid_config), &num_cpuid_config);
>
> That's _much_ cleaner than the 'struct tdx_md_map', which only confuses
> people.
>
> But I don't think we need to do this as mentioned above -- we just do
> type cast.

type cast needs another tmp variable to hold the output of u64.

The reason I want to introduce tdx_sys_metadata_read_single() is to
provide a simple and unified interface for other codes to read one
metadata field, instead of letting the caller to use temporary u64
variable and handle the cast or memcpy itself.

> [*]
> https://lore.kernel.org/lkml/[email protected]/T/#m2512e378c83bc44d3ca653f96f25c3fc85eb0e8a
>
>
>
>


2024-03-15 04:45:06

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 3/15/2024 12:27 AM, Isaku Yamahata wrote:
> On Thu, Mar 14, 2024 at 10:05:35AM +0800,
> Binbin Wu <[email protected]> wrote:
>
>>> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>>> index 18cecf12c7c8..18aef6e23aab 100644
>>> --- a/arch/x86/kvm/vmx/main.c
>>> +++ b/arch/x86/kvm/vmx/main.c
>>> @@ -6,6 +6,22 @@
>>> #include "nested.h"
>>> #include "pmu.h"
>>> +static bool enable_tdx __ro_after_init;
>>> +module_param_named(tdx, enable_tdx, bool, 0444);
>>> +
>>> +static __init int vt_hardware_setup(void)
>>> +{
>>> + int ret;
>>> +
>>> + ret = vmx_hardware_setup();
>>> + if (ret)
>>> + return ret;
>>> +
>>> + enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> #define VMX_REQUIRED_APICV_INHIBITS \
>>> (BIT(APICV_INHIBIT_REASON_DISABLE)| \
>>> BIT(APICV_INHIBIT_REASON_ABSENT) | \
>>> @@ -22,6 +38,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>>> .hardware_unsetup = vmx_hardware_unsetup,
>>> + /* TDX cpu enablement is done by tdx_hardware_setup(). */
>> How about if there are some LPs that are offline.
>> In tdx_hardware_setup(), only online LPs are initialed for TDX, right?
> Correct.
>
>
>> Then when an offline LP becoming online, it doesn't have a chance to call
>> tdx_cpu_enable()?
> KVM registers kvm_online/offline_cpu() @ kvm_main.c as cpu hotplug callbacks.
> Eventually x86 kvm hardware_enable() is called on online/offline event.

Yes, hardware_enable() will be called when online,
but  hardware_enable() now is vmx_hardware_enable() right?
It doens't call tdx_cpu_enable() during the online path.



2024-03-15 04:58:20

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization

On Fri, 2024-03-15 at 10:18 +0800, Li, Xiaoyao wrote:
> On 3/15/2024 7:09 AM, Huang, Kai wrote:
> >
> > > +struct tdx_info {
> > > +    u64 features0;
> > > +    u64 attributes_fixed0;
> > > +    u64 attributes_fixed1;
> > > +    u64 xfam_fixed0;
> > > +    u64 xfam_fixed1;
> > > +
> > > +    u16 num_cpuid_config;
> > > +    /* This must the last member. */
> > > +    DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
> > > +};
> > > +
> > > +/* Info about the TDX module. */
> > > +static struct tdx_info *tdx_info;
> > > +
> > >   #define TDX_MD_MAP(_fid, _ptr)            \
> > >       { .fid = MD_FIELD_ID_##_fid,        \
> > >         .ptr = (_ptr), }
> > > @@ -66,7 +81,7 @@ static size_t tdx_md_element_size(u64 fid)
> > >       }
> > >   }
> > > -static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> > > +static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> > >   {
> > >       struct tdx_md_map *m;
> > >       int ret, i;
> > > @@ -84,9 +99,26 @@ static int __used tdx_md_read(struct tdx_md_map
> > > *maps, int nr_maps)
> > >       return 0;
> > >   }
> > > +#define TDX_INFO_MAP(_field_id, _member)            \
> > > +    TD_SYSINFO_MAP(_field_id, struct tdx_info, _member)
> > > +
> > >   static int __init tdx_module_setup(void)
> > >   {
> > > +    u16 num_cpuid_config;
> > >       int ret;
> > > +    u32 i;
> > > +
> > > +    struct tdx_md_map mds[] = {
> > > +        TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> > > +    };
> > > +
> > > +    struct tdx_metadata_field_mapping fields[] = {
> > > +        TDX_INFO_MAP(FEATURES0, features0),
> > > +        TDX_INFO_MAP(ATTRS_FIXED0, attributes_fixed0),
> > > +        TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
> > > +        TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
> > > +        TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
> > > +    };
> > >       ret = tdx_enable();
> > >       if (ret) {
> > > @@ -94,7 +126,48 @@ static int __init tdx_module_setup(void)
> > >           return ret;
> > >       }
> > > +    ret = tdx_md_read(mds, ARRAY_SIZE(mds));
> > > +    if (ret)
> > > +        return ret;
> > > +
> > > +    tdx_info = kzalloc(sizeof(*tdx_info) +
> > > +               sizeof(*tdx_info->cpuid_configs) * num_cpuid_config,
> > > +               GFP_KERNEL);
> > > +    if (!tdx_info)
> > > +        return -ENOMEM;
> > > +    tdx_info->num_cpuid_config = num_cpuid_config;
> > > +
> > > +    ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
> > > +    if (ret)
> > > +        goto error_out;
> > > +
> > > +    for (i = 0; i < num_cpuid_config; i++) {
> > > +        struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
> > > +        u64 leaf, eax_ebx, ecx_edx;
> > > +        struct tdx_md_map cpuids[] = {
> > > +            TDX_MD_MAP(CPUID_CONFIG_LEAVES + i, &leaf),
> > > +            TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2, &eax_ebx),
> > > +            TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2 + 1, &ecx_edx),
> > > +        };
> > > +
> > > +        ret = tdx_md_read(cpuids, ARRAY_SIZE(cpuids));
> > > +        if (ret)
> > > +            goto error_out;
> > > +
> > > +        c->leaf = (u32)leaf;
> > > +        c->sub_leaf = leaf >> 32;
> > > +        c->eax = (u32)eax_ebx;
> > > +        c->ebx = eax_ebx >> 32;
> > > +        c->ecx = (u32)ecx_edx;
> > > +        c->edx = ecx_edx >> 32;
> >
> > OK I can see why you don't want to use ...
> >
> >     struct tdx_metadata_field_mapping fields[] = {
> >         TDX_INFO_MAP(NUM_CPUID_CONFIG, num_cpuid_config),
> >     };
> >
> > ... to read num_cpuid_config first, because the memory to hold @tdx_info
> > hasn't been allocated, because its size depends on the num_cpuid_config.
> >
> > And I confess it's because the tdx_sys_metadata_field_read() that got
> > exposed in patch ("x86/virt/tdx: Export global metadata read
> > infrastructure") only returns 'u64' for all metadata field, and you
> > didn't want to use something like this:
> >
> >     u64 num_cpuid_config;
> >
> >     tdx_sys_metadata_field_read(..., &num_cpuid_config);
> >
> >     ...
> >
> >     tdx_info->num_cpuid_config = num_cpuid_config;
> >
> > Or you can explicitly cast:
> >
> >     tdx_info->num_cpuid_config = (u16)num_cpuid_config;
> >
> > (I know people may don't like the assigning 'u64' to 'u16', but it seems
> > nothing wrong to me, because the way done in (1) below effectively has
> > the same result comparing to type case).
> >
> > But there are other (better) ways to do:
> >
> > 1) you can introduce a helper as suggested by Xiaoyao in [*]:
> >
> >
> >     int tdx_sys_metadata_read_single(u64 field_id,
> >                     int bytes,  void *buf)
> >     {
> >         return stbuf_read_sys_metadata_field(field_id, 0,
> >                         bytes, buf);
> >     }
> >
> > And do:
> >
> >     tdx_sys_metadata_read_single(NUM_CPUID_CONFIG,
> >         sizeof(num_cpuid_config), &num_cpuid_config);
> >
> > That's _much_ cleaner than the 'struct tdx_md_map', which only confuses
> > people.
> >
> > But I don't think we need to do this as mentioned above -- we just do
> > type cast.
>
> type cast needs another tmp variable to hold the output of u64.
>
> The reason I want to introduce tdx_sys_metadata_read_single() is to
> provide a simple and unified interface for other codes to read one
> metadata field, instead of letting the caller to use temporary u64
> variable and handle the cast or memcpy itself.
>

You can always use u64 to hold u16 metadata field AFAICT, so it doesn't have to
be temporary.

Here is what Isaku can do using the current API:

u64 num_cpuid_config;


...

tdx_sys_metadata_field_read(NUM_CPUID_CONFIG, &num_cpuid_config);

tdx_info = kzalloc(calculate_tdx_info_size(num_cpuid_config), ...);

tdx_info->num_cpuid_config = num_cpuid_config;

...

(you can do explicit (u16)num_cpuid_config type cast above if you want.)

With your suggestion, here is what Isaku can do:

u16 num_cpuid_config;

...

tdx_sys_metadata_read_single(NUM_CPUID_CONFIG,
sizeof(num_cpuid_config),
&num_cpuid_config);

tdx_info = kzalloc(calculate_tdx_info_size(num_cpuid_config), ...);

tdx_info->num_cpuid_config = num_cpuid_config;

...

I don't see big difference?

One example that the current tdx_sys_metadata_field_read() doesn't quite fit is
you have something like this:

struct {
u16 whatever;
...
} st;

tdx_sys_metadata_field_read(FIELD_ID_WHATEVER, &st.whatever);

But for this use case you are not supposed to use tdx_sys_metadata_field_read(),
but use tdx_sys_metadata_read() which has a mapping provided anyway.

So, while I don't quite object your proposal, I don't see it being quite
necessary.

I'll let other people to have a say.


2024-03-15 05:12:10

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization

On 3/15/2024 12:57 PM, Huang, Kai wrote:
> On Fri, 2024-03-15 at 10:18 +0800, Li, Xiaoyao wrote:
>> On 3/15/2024 7:09 AM, Huang, Kai wrote:
>>>
>>>> +struct tdx_info {
>>>> +    u64 features0;
>>>> +    u64 attributes_fixed0;
>>>> +    u64 attributes_fixed1;
>>>> +    u64 xfam_fixed0;
>>>> +    u64 xfam_fixed1;
>>>> +
>>>> +    u16 num_cpuid_config;
>>>> +    /* This must the last member. */
>>>> +    DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
>>>> +};
>>>> +
>>>> +/* Info about the TDX module. */
>>>> +static struct tdx_info *tdx_info;
>>>> +
>>>>   #define TDX_MD_MAP(_fid, _ptr)            \
>>>>       { .fid = MD_FIELD_ID_##_fid,        \
>>>>         .ptr = (_ptr), }
>>>> @@ -66,7 +81,7 @@ static size_t tdx_md_element_size(u64 fid)
>>>>       }
>>>>   }
>>>> -static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
>>>> +static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
>>>>   {
>>>>       struct tdx_md_map *m;
>>>>       int ret, i;
>>>> @@ -84,9 +99,26 @@ static int __used tdx_md_read(struct tdx_md_map
>>>> *maps, int nr_maps)
>>>>       return 0;
>>>>   }
>>>> +#define TDX_INFO_MAP(_field_id, _member)            \
>>>> +    TD_SYSINFO_MAP(_field_id, struct tdx_info, _member)
>>>> +
>>>>   static int __init tdx_module_setup(void)
>>>>   {
>>>> +    u16 num_cpuid_config;
>>>>       int ret;
>>>> +    u32 i;
>>>> +
>>>> +    struct tdx_md_map mds[] = {
>>>> +        TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
>>>> +    };
>>>> +
>>>> +    struct tdx_metadata_field_mapping fields[] = {
>>>> +        TDX_INFO_MAP(FEATURES0, features0),
>>>> +        TDX_INFO_MAP(ATTRS_FIXED0, attributes_fixed0),
>>>> +        TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
>>>> +        TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
>>>> +        TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
>>>> +    };
>>>>       ret = tdx_enable();
>>>>       if (ret) {
>>>> @@ -94,7 +126,48 @@ static int __init tdx_module_setup(void)
>>>>           return ret;
>>>>       }
>>>> +    ret = tdx_md_read(mds, ARRAY_SIZE(mds));
>>>> +    if (ret)
>>>> +        return ret;
>>>> +
>>>> +    tdx_info = kzalloc(sizeof(*tdx_info) +
>>>> +               sizeof(*tdx_info->cpuid_configs) * num_cpuid_config,
>>>> +               GFP_KERNEL);
>>>> +    if (!tdx_info)
>>>> +        return -ENOMEM;
>>>> +    tdx_info->num_cpuid_config = num_cpuid_config;
>>>> +
>>>> +    ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
>>>> +    if (ret)
>>>> +        goto error_out;
>>>> +
>>>> +    for (i = 0; i < num_cpuid_config; i++) {
>>>> +        struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
>>>> +        u64 leaf, eax_ebx, ecx_edx;
>>>> +        struct tdx_md_map cpuids[] = {
>>>> +            TDX_MD_MAP(CPUID_CONFIG_LEAVES + i, &leaf),
>>>> +            TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2, &eax_ebx),
>>>> +            TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2 + 1, &ecx_edx),
>>>> +        };
>>>> +
>>>> +        ret = tdx_md_read(cpuids, ARRAY_SIZE(cpuids));
>>>> +        if (ret)
>>>> +            goto error_out;
>>>> +
>>>> +        c->leaf = (u32)leaf;
>>>> +        c->sub_leaf = leaf >> 32;
>>>> +        c->eax = (u32)eax_ebx;
>>>> +        c->ebx = eax_ebx >> 32;
>>>> +        c->ecx = (u32)ecx_edx;
>>>> +        c->edx = ecx_edx >> 32;
>>>
>>> OK I can see why you don't want to use ...
>>>
>>>     struct tdx_metadata_field_mapping fields[] = {
>>>         TDX_INFO_MAP(NUM_CPUID_CONFIG, num_cpuid_config),
>>>     };
>>>
>>> ... to read num_cpuid_config first, because the memory to hold @tdx_info
>>> hasn't been allocated, because its size depends on the num_cpuid_config.
>>>
>>> And I confess it's because the tdx_sys_metadata_field_read() that got
>>> exposed in patch ("x86/virt/tdx: Export global metadata read
>>> infrastructure") only returns 'u64' for all metadata field, and you
>>> didn't want to use something like this:
>>>
>>>     u64 num_cpuid_config;
>>>
>>>     tdx_sys_metadata_field_read(..., &num_cpuid_config);
>>>
>>>     ...
>>>
>>>     tdx_info->num_cpuid_config = num_cpuid_config;
>>>
>>> Or you can explicitly cast:
>>>
>>>     tdx_info->num_cpuid_config = (u16)num_cpuid_config;
>>>
>>> (I know people may don't like the assigning 'u64' to 'u16', but it seems
>>> nothing wrong to me, because the way done in (1) below effectively has
>>> the same result comparing to type case).
>>>
>>> But there are other (better) ways to do:
>>>
>>> 1) you can introduce a helper as suggested by Xiaoyao in [*]:
>>>
>>>
>>>     int tdx_sys_metadata_read_single(u64 field_id,
>>>                     int bytes,  void *buf)
>>>     {
>>>         return stbuf_read_sys_metadata_field(field_id, 0,
>>>                         bytes, buf);
>>>     }
>>>
>>> And do:
>>>
>>>     tdx_sys_metadata_read_single(NUM_CPUID_CONFIG,
>>>         sizeof(num_cpuid_config), &num_cpuid_config);
>>>
>>> That's _much_ cleaner than the 'struct tdx_md_map', which only confuses
>>> people.
>>>
>>> But I don't think we need to do this as mentioned above -- we just do
>>> type cast.
>>
>> type cast needs another tmp variable to hold the output of u64.
>>
>> The reason I want to introduce tdx_sys_metadata_read_single() is to
>> provide a simple and unified interface for other codes to read one
>> metadata field, instead of letting the caller to use temporary u64
>> variable and handle the cast or memcpy itself.
>>
>
> You can always use u64 to hold u16 metadata field AFAICT, so it doesn't have to
> be temporary.
>
> Here is what Isaku can do using the current API:
>
> u64 num_cpuid_config;
>
>
> ...
>
> tdx_sys_metadata_field_read(NUM_CPUID_CONFIG, &num_cpuid_config);
>
> tdx_info = kzalloc(calculate_tdx_info_size(num_cpuid_config), ...);
>
> tdx_info->num_cpuid_config = num_cpuid_config;

Dosen't num_cpuid_config serve as temporary variable in some sense?

For this case, it needs to be used for calculating the size of tdx_info.
So we have to have it. But it's not the common case.

E.g., if we have another non-u64 field (e.g., field_x) in tdx_info, we
cannot to read it via

tdx_sys_metadata_field_read(FIELD_X_ID, &tdx_info->field_x);

we have to use a temporary u64 variable.

> ...
>
> (you can do explicit (u16)num_cpuid_config type cast above if you want.)
>
> With your suggestion, here is what Isaku can do:
>
> u16 num_cpuid_config;
>
> ...
>
> tdx_sys_metadata_read_single(NUM_CPUID_CONFIG,
> sizeof(num_cpuid_config),
> &num_cpuid_config);
>
> tdx_info = kzalloc(calculate_tdx_info_size(num_cpuid_config), ...);
>
> tdx_info->num_cpuid_config = num_cpuid_config;
>
> ...
>
> I don't see big difference?
>
> One example that the current tdx_sys_metadata_field_read() doesn't quite fit is
> you have something like this:
>
> struct {
> u16 whatever;
> ...
> } st;
>
> tdx_sys_metadata_field_read(FIELD_ID_WHATEVER, &st.whatever);
>
> But for this use case you are not supposed to use tdx_sys_metadata_field_read(),
> but use tdx_sys_metadata_read() which has a mapping provided anyway.
>
> So, while I don't quite object your proposal, I don't see it being quite
> necessary.
>
> I'll let other people to have a say.
>
>


2024-03-15 05:39:54

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization

On Fri, 2024-03-15 at 13:11 +0800, Li, Xiaoyao wrote:
> > Here is what Isaku can do using the current API:
> >
> >   u64 num_cpuid_config;
>  >
> >
> >   ...
> >
> >   tdx_sys_metadata_field_read(NUM_CPUID_CONFIG, &num_cpuid_config);
> >
> >   tdx_info = kzalloc(calculate_tdx_info_size(num_cpuid_config), ...);
> >
> >   tdx_info->num_cpuid_config = num_cpuid_config;
>
> Dosen't num_cpuid_config serve as temporary variable in some sense?

You need it, regardless whether it is u64 or u16.

>
> For this case, it needs to be used for calculating the size of tdx_info.
> So we have to have it. But it's not the common case.
>
> E.g., if we have another non-u64 field (e.g., field_x) in tdx_info, we
> cannot to read it via
>
> tdx_sys_metadata_field_read(FIELD_X_ID, &tdx_info->field_x);
>
> we have to use a temporary u64 variable.

Let me repeat below in my _previous_ reply:

"
One example that the current tdx_sys_metadata_field_read() doesn't quite fit is
you have something like this:

struct {
u16 whatever;
...
} st;

tdx_sys_metadata_field_read(FIELD_ID_WHATEVER, &st.whatever);

But for this use case you are not supposed to use tdx_sys_metadata_field_read(),
but use tdx_sys_metadata_read() which has a mapping provided anyway.
"

So sorry I am not seeing a real example from you.

2024-03-15 05:50:55

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization

On 3/15/2024 1:39 PM, Huang, Kai wrote:
> On Fri, 2024-03-15 at 13:11 +0800, Li, Xiaoyao wrote:
>>> Here is what Isaku can do using the current API:
>>>
>>>   u64 num_cpuid_config;
>>  >
>>>
>>>   ...
>>>
>>>   tdx_sys_metadata_field_read(NUM_CPUID_CONFIG, &num_cpuid_config);
>>>
>>>   tdx_info = kzalloc(calculate_tdx_info_size(num_cpuid_config), ...);
>>>
>>>   tdx_info->num_cpuid_config = num_cpuid_config;
>>
>> Dosen't num_cpuid_config serve as temporary variable in some sense?
>
> You need it, regardless whether it is u64 or u16.
>
>>
>> For this case, it needs to be used for calculating the size of tdx_info.
>> So we have to have it. But it's not the common case.
>>
>> E.g., if we have another non-u64 field (e.g., field_x) in tdx_info, we
>> cannot to read it via
>>
>> tdx_sys_metadata_field_read(FIELD_X_ID, &tdx_info->field_x);
>>
>> we have to use a temporary u64 variable.
>
> Let me repeat below in my _previous_ reply:
>
> "
> One example that the current tdx_sys_metadata_field_read() doesn't quite fit is
> you have something like this:
>
> struct {
> u16 whatever;
> ...
> } st;
>
> tdx_sys_metadata_field_read(FIELD_ID_WHATEVER, &st.whatever);
>
> But for this use case you are not supposed to use tdx_sys_metadata_field_read(),
> but use tdx_sys_metadata_read() which has a mapping provided anyway.
> "

tdx_sys_metadata_read() is too complicated for just reading one field.

Caller needs to prepare a one-item size array of "struct
tdx_metadata_field_mapping" and pass the correct offset.

> So sorry I am not seeing a real example from you.
>


2024-03-15 14:02:08

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Thu, 2024-03-14 at 18:35 -0700, Isaku Yamahata wrote:
> > On the subject of warnings and KVM_BUG_ON(), my feeling so far is
> > that
> > this series is quite aggressive about these. Is it due the
> > complexity
> > of the series? I think maybe we can remove some of the simple ones,
> > but
> > not sure if there was already some discussion on what level is
> > appropriate.
>
> KVM_BUG_ON() was helpful at the early stage.  Because we don't hit
> them
> recently, it's okay to remove them.  Will remove them.

Hmm. We probably need to do it case by case.

2024-03-15 16:23:10

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 034/130] KVM: TDX: Get system-wide info about TDX module on initialization

On Fri, Mar 15, 2024 at 12:09:29PM +1300,
"Huang, Kai" <[email protected]> wrote:

>
> > +struct tdx_info {
> > + u64 features0;
> > + u64 attributes_fixed0;
> > + u64 attributes_fixed1;
> > + u64 xfam_fixed0;
> > + u64 xfam_fixed1;
> > +
> > + u16 num_cpuid_config;
> > + /* This must the last member. */
> > + DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
> > +};
> > +
> > +/* Info about the TDX module. */
> > +static struct tdx_info *tdx_info;
> > +
> > #define TDX_MD_MAP(_fid, _ptr) \
> > { .fid = MD_FIELD_ID_##_fid, \
> > .ptr = (_ptr), }
> > @@ -66,7 +81,7 @@ static size_t tdx_md_element_size(u64 fid)
> > }
> > }
> > -static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> > +static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> > {
> > struct tdx_md_map *m;
> > int ret, i;
> > @@ -84,9 +99,26 @@ static int __used tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> > return 0;
> > }
> > +#define TDX_INFO_MAP(_field_id, _member) \
> > + TD_SYSINFO_MAP(_field_id, struct tdx_info, _member)
> > +
> > static int __init tdx_module_setup(void)
> > {
> > + u16 num_cpuid_config;
> > int ret;
> > + u32 i;
> > +
> > + struct tdx_md_map mds[] = {
> > + TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> > + };
> > +
> > + struct tdx_metadata_field_mapping fields[] = {
> > + TDX_INFO_MAP(FEATURES0, features0),
> > + TDX_INFO_MAP(ATTRS_FIXED0, attributes_fixed0),
> > + TDX_INFO_MAP(ATTRS_FIXED1, attributes_fixed1),
> > + TDX_INFO_MAP(XFAM_FIXED0, xfam_fixed0),
> > + TDX_INFO_MAP(XFAM_FIXED1, xfam_fixed1),
> > + };
> > ret = tdx_enable();
> > if (ret) {
> > @@ -94,7 +126,48 @@ static int __init tdx_module_setup(void)
> > return ret;
> > }
> > + ret = tdx_md_read(mds, ARRAY_SIZE(mds));
> > + if (ret)
> > + return ret;
> > +
> > + tdx_info = kzalloc(sizeof(*tdx_info) +
> > + sizeof(*tdx_info->cpuid_configs) * num_cpuid_config,
> > + GFP_KERNEL);
> > + if (!tdx_info)
> > + return -ENOMEM;
> > + tdx_info->num_cpuid_config = num_cpuid_config;
> > +
> > + ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
> > + if (ret)
> > + goto error_out;
> > +
> > + for (i = 0; i < num_cpuid_config; i++) {
> > + struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
> > + u64 leaf, eax_ebx, ecx_edx;
> > + struct tdx_md_map cpuids[] = {
> > + TDX_MD_MAP(CPUID_CONFIG_LEAVES + i, &leaf),
> > + TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2, &eax_ebx),
> > + TDX_MD_MAP(CPUID_CONFIG_VALUES + i * 2 + 1, &ecx_edx),
> > + };
> > +
> > + ret = tdx_md_read(cpuids, ARRAY_SIZE(cpuids));
> > + if (ret)
> > + goto error_out;
> > +
> > + c->leaf = (u32)leaf;
> > + c->sub_leaf = leaf >> 32;
> > + c->eax = (u32)eax_ebx;
> > + c->ebx = eax_ebx >> 32;
> > + c->ecx = (u32)ecx_edx;
> > + c->edx = ecx_edx >> 32;
>
> OK I can see why you don't want to use ...
>
> struct tdx_metadata_field_mapping fields[] = {
> TDX_INFO_MAP(NUM_CPUID_CONFIG, num_cpuid_config),
> };
>
> ... to read num_cpuid_config first, because the memory to hold @tdx_info
> hasn't been allocated, because its size depends on the num_cpuid_config.
>
> And I confess it's because the tdx_sys_metadata_field_read() that got
> exposed in patch ("x86/virt/tdx: Export global metadata read
> infrastructure") only returns 'u64' for all metadata field, and you didn't
> want to use something like this:
>
> u64 num_cpuid_config;
>
> tdx_sys_metadata_field_read(..., &num_cpuid_config);
>
> ...
>
> tdx_info->num_cpuid_config = num_cpuid_config;
>
> Or you can explicitly cast:
>
> tdx_info->num_cpuid_config = (u16)num_cpuid_config;
>
> (I know people may don't like the assigning 'u64' to 'u16', but it seems
> nothing wrong to me, because the way done in (1) below effectively has the
> same result comparing to type case).
>
> But there are other (better) ways to do:
>
> 1) you can introduce a helper as suggested by Xiaoyao in [*]:
>
>
> int tdx_sys_metadata_read_single(u64 field_id,
> int bytes, void *buf)
> {
> return stbuf_read_sys_metadata_field(field_id, 0,
> bytes, buf);
> }
>
> And do:
>
> tdx_sys_metadata_read_single(NUM_CPUID_CONFIG,
> sizeof(num_cpuid_config), &num_cpuid_config);
>
> That's _much_ cleaner than the 'struct tdx_md_map', which only confuses
> people.
>
> But I don't think we need to do this as mentioned above -- we just do type
> cast.
>
> 2) You can just preallocate enough memory. It cannot be larger than 1024B,
> right? You can even just allocate one page. It's just 4K, no one cares.
>
> Then you can do:
>
> struct tdx_metadata_field_mapping tdx_info_fields = {
> ...
> TDX_INFO_MAP(NUM_CPUID_CONFIG, num_cpuid_config),
> };
>
> tdx_sys_metadata_read(tdx_info_fields,
> ARRAY_SIZE(tdx_info_fields, tdx_info);
>
> And then you read the CPUID_CONFIG array one by one using the same 'struct
> tdx_metadata_field_mapping' and tdx_sys_metadata_read():
>
>
> for (i = 0; i < tdx_info->num_cpuid_config; i++) {
> struct tdx_metadata_field_mapping cpuid_fields = {
> TDX_CPUID_CONFIG_MAP(CPUID_CONFIG_LEAVES + i,
> leaf),
> ...
> };
> struct kvm_tdx_cpuid_config *c =
> &tdx_info->cpuid_configs[i];
>
> tdx_sys_metadata_read(cpuid_fields,
> ARRAY_SIZE(cpuid_fields), c);
>
> ....
> }
>
> So stopping having the duplicated 'struct tdx_md_map' and related staff, as
> they are absolutely unnecessary and only confuses people.


Ok, I'll rewrite the code to use tdx_sys_metadata_read() by introducng
tentative struct in a function scope.
--
Isaku Yamahata <[email protected]>

2024-03-15 16:33:38

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Thu, Mar 14, 2024, Dave Hansen wrote:
> On 3/14/24 18:17, Edgecombe, Rick P wrote:
> > I guess there are three options:
> > 1. Export the low level seamcall function
> > 2. Export a bunch of higher level helper functions
> > 3. Duplicate __seamcall asm in KVM
> >
> > Letting modules make unrestricted seamcalls is not ideal. Preventing
> > the compiler from inlining the small logic in the static inline helpers
> > is not ideal. Duplicating code is not ideal. Hmm.
> >
> > I want to say 2 sounds the least worst of the three. But I'm not sure.
> > I'm not sure if x86 folks would like to police new seamcalls, or be
> > bothered by it, either.
>
> #3 is the only objectively awful one. :)
>
> In the end, we actually _want_ to have conversations about these things.
> There are going to be considerations about what functionality should be
> in KVM or the core kernel. We don't want KVM doing any calls that could
> affect global TDX module state, for instance.

Heh, Like this one?

static inline u64 tdh_sys_lp_shutdown(void)
{
struct tdx_module_args in = {
};

return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
}

Which isn't actually used...

> But I'd also defer to the KVM maintainers on this. They're the ones
> that have to play the symbol exporting game a lot more than I ever do.
> If they cringe at the idea of adding 20 (or whatever) exports, then
> that's a lot more important than the possibility of some other silly
> module abusing the generic exported __seamcall.

I don't care much about exports. What I do care about is sane code, and while
the current code _looks_ pretty, it's actually quite insane.

I get why y'all put SEAMCALL in assembly subroutines; the macro shenanigans I
originally wrote years ago were their own brand of crazy, and dealing with GPRs
that can't be asm() constraints often results in brittle code.

But the tdx_module_args structure approach generates truly atrocious code. Yes,
SEAMCALL is inherently slow, but that doesn't mean that we shouldn't at least try
to generate efficient code. And it's not just efficiency that is lost, the
generated code ends up being much harder to read than it ought to be.

E.g. the seemingly simple

static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
struct tdx_module_args *out)
{
struct tdx_module_args in = {
.rcx = gpa | level,
.rdx = tdr,
};

return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out);
}

generates the below monstrosity with gcc-13. And that's just one SEAMCALL wrapper,
*every* single one generates the same mess. clang-16 is kinda sorta a little
better, as it at least inlines the helpers that have single callers.

So my feedback is to not worry about the exports, and instead focus on figuring
out a way to make the generated code less bloated and easier to read/debug.

Dump of assembler code for function tdh_mem_page_remove:
0x0000000000032b20 <+0>: push %r15
0x0000000000032b22 <+2>: xor %eax,%eax
0x0000000000032b24 <+4>: movabs $0x8000ff0000000006,%r15
0x0000000000032b2e <+14>: push %r14
0x0000000000032b30 <+16>: mov %rcx,%r14
0x0000000000032b33 <+19>: mov $0xb,%ecx
0x0000000000032b38 <+24>: push %r13
0x0000000000032b3a <+26>: movslq %edx,%r13
0x0000000000032b3d <+29>: push %r12
0x0000000000032b3f <+31>: or %rsi,%r13
0x0000000000032b42 <+34>: mov $0x11,%r12d
0x0000000000032b48 <+40>: push %rbp
0x0000000000032b49 <+41>: movabs $0x8000020300000000,%rbp
0x0000000000032b53 <+51>: push %rbx
0x0000000000032b54 <+52>: sub $0x70,%rsp
0x0000000000032b58 <+56>: mov %rdi,(%rsp)
0x0000000000032b5c <+60>: lea 0x18(%rsp),%rdi
0x0000000000032b61 <+65>: rep stos %rax,%es:(%rdi)
0x0000000000032b64 <+68>: mov (%rsp),%rax
0x0000000000032b68 <+72>: mov %r13,(%r14)
0x0000000000032b6b <+75>: mov $0xa,%ebx
0x0000000000032b70 <+80>: mov %rax,0x8(%r14)
0x0000000000032b74 <+84>: mov 0x18(%rsp),%rax
0x0000000000032b79 <+89>: mov %rax,0x10(%r14)
0x0000000000032b7d <+93>: mov 0x20(%rsp),%rax
0x0000000000032b82 <+98>: mov %rax,0x18(%r14)
0x0000000000032b86 <+102>: mov 0x28(%rsp),%rax
0x0000000000032b8b <+107>: mov %rax,0x20(%r14)
0x0000000000032b8f <+111>: mov 0x30(%rsp),%rax
0x0000000000032b94 <+116>: mov %rax,0x28(%r14)
0x0000000000032b98 <+120>: mov 0x38(%rsp),%rax
0x0000000000032b9d <+125>: mov %rax,0x30(%r14)
0x0000000000032ba1 <+129>: mov 0x40(%rsp),%rax
0x0000000000032ba6 <+134>: mov %rax,0x38(%r14)
0x0000000000032baa <+138>: mov 0x48(%rsp),%rax
0x0000000000032baf <+143>: mov %rax,0x40(%r14)
0x0000000000032bb3 <+147>: mov 0x50(%rsp),%rax
0x0000000000032bb8 <+152>: mov %rax,0x48(%r14)
0x0000000000032bbc <+156>: mov 0x58(%rsp),%rax
0x0000000000032bc1 <+161>: mov %rax,0x50(%r14)
0x0000000000032bc5 <+165>: mov 0x60(%rsp),%rax
0x0000000000032bca <+170>: mov %rax,0x58(%r14)
0x0000000000032bce <+174>: mov 0x68(%rsp),%rax
0x0000000000032bd3 <+179>: mov %rax,0x60(%r14)
0x0000000000032bd7 <+183>: mov %r14,%rsi
0x0000000000032bda <+186>: mov $0x1d,%edi
0x0000000000032bdf <+191>: call 0x32be4 <tdh_mem_page_remove+196>
0x0000000000032be4 <+196>: cmp %rbp,%rax
0x0000000000032be7 <+199>: jne 0x32bfd <tdh_mem_page_remove+221>
0x0000000000032be9 <+201>: sub $0x1,%ebx
0x0000000000032bec <+204>: jne 0x32bd7 <tdh_mem_page_remove+183>
0x0000000000032bee <+206>: add $0x70,%rsp
0x0000000000032bf2 <+210>: pop %rbx
0x0000000000032bf3 <+211>: pop %rbp
0x0000000000032bf4 <+212>: pop %r12
0x0000000000032bf6 <+214>: pop %r13
0x0000000000032bf8 <+216>: pop %r14
0x0000000000032bfa <+218>: pop %r15
0x0000000000032bfc <+220>: ret
0x0000000000032bfd <+221>: cmp %r15,%rax
0x0000000000032c00 <+224>: je 0x32c2a <tdh_mem_page_remove+266>
0x0000000000032c02 <+226>: movabs $0x8000020000000092,%rdx
0x0000000000032c0c <+236>: cmp %rdx,%rax
0x0000000000032c0f <+239>: jne 0x32bee <tdh_mem_page_remove+206>
0x0000000000032c11 <+241>: sub $0x1,%r12d
0x0000000000032c15 <+245>: jne 0x32b64 <tdh_mem_page_remove+68>
0x0000000000032c1b <+251>: add $0x70,%rsp
0x0000000000032c1f <+255>: pop %rbx
0x0000000000032c20 <+256>: pop %rbp
0x0000000000032c21 <+257>: pop %r12
0x0000000000032c23 <+259>: pop %r13
0x0000000000032c25 <+261>: pop %r14
0x0000000000032c27 <+263>: pop %r15
0x0000000000032c29 <+265>: ret
0x0000000000032c2a <+266>: call 0x32c2f <tdh_mem_page_remove+271>
0x0000000000032c2f <+271>: xor %eax,%eax
0x0000000000032c31 <+273>: add $0x70,%rsp
0x0000000000032c35 <+277>: pop %rbx
0x0000000000032c36 <+278>: pop %rbp
0x0000000000032c37 <+279>: pop %r12
0x0000000000032c39 <+281>: pop %r13
0x0000000000032c3b <+283>: pop %r14
0x0000000000032c3d <+285>: pop %r15
0x0000000000032c3f <+287>: ret
End of assembler dump.

2024-03-15 17:31:36

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Mon, Feb 26, 2024, [email protected] wrote:
> +static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
> +{
> + struct tdx_module_args args;
> +
> + /*
> + * Avoid section mismatch with to_tdx() with KVM_VM_BUG(). The caller
> + * should call to_tdx().

C'mon. I don't think it's unreasonable to expect that at least one of the many
people working on TDX would figure out why to_vmx() is __always_inline.

> + */
> + struct kvm_vcpu *vcpu = &tdx->vcpu;
> +
> + guest_state_enter_irqoff();
> +
> + /*
> + * TODO: optimization:
> + * - Eliminate copy between args and vcpu->arch.regs.
> + * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
> + * which means TDG.VP.VMCALL.
> + */
> + args = (struct tdx_module_args) {
> + .rcx = tdx->tdvpr_pa,
> +#define REG(reg, REG) .reg = vcpu->arch.regs[VCPU_REGS_ ## REG]

Organizing tdx_module_args's registers by volatile vs. non-volatile is asinine.
This code should not need to exist.

> + WARN_ON_ONCE(!kvm_rebooting &&
> + (tdx->exit_reason.full & TDX_SW_ERROR) == TDX_SW_ERROR);
> +
> + guest_state_exit_irqoff();
> +}
> +
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> + if (unlikely(!tdx->initialized))
> + return -EINVAL;
> + if (unlikely(vcpu->kvm->vm_bugged)) {
> + tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> + return EXIT_FASTPATH_NONE;
> + }
> +
> + trace_kvm_entry(vcpu);
> +
> + tdx_vcpu_enter_exit(tdx);
> +
> + vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> + trace_kvm_exit(vcpu, KVM_ISA_VMX);
> +
> + return EXIT_FASTPATH_NONE;
> +}
> +
> void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> {
> WARN_ON_ONCE(root_hpa & ~PAGE_MASK);
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index d822e790e3e5..81d301fbe638 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -27,6 +27,37 @@ struct kvm_tdx {
> struct page *source_page;
> };
>
> +union tdx_exit_reason {
> + struct {
> + /* 31:0 mirror the VMX Exit Reason format */

Then use "union vmx_exit_reason", having to maintain duplicate copies of the same
union is not something I want to do.

I'm honestly not even convinced that "union tdx_exit_reason" needs to exist. I
added vmx_exit_reason because we kept having bugs where KVM would fail to strip
bits 31:16, and because nested VMX needs to stuff failed_vmentry, but I don't
see a similar need for TDX.

I would even go so far as to say the vcpu_tdx field shouldn't be exit_reason,
and instead should be "return_code" or something. E.g. if the TDX module refuses
to run the vCPU, there's no VM-Enter and thus no VM-Exit (unless you count the
SEAMCALL itself, har har). Ditto for #GP or #UD on the SEAMCALL (or any other
reason that generates TDX_SW_ERROR).

Ugh, I'm doubling down on that suggesting. This:

WARN_ON_ONCE(!kvm_rebooting &&
(tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR);

if ((u16)tdx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
is_nmi(tdexit_intr_info(vcpu))) {
kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
vmx_do_nmi_irqoff();
kvm_after_interrupt(vcpu);
}

is heinous. If there's an error that leaves bits 15:0 zero, KVM will synthesize
a spurious NMI. I don't know whether or not that can happen, but it's not
something that should even be possible in KVM, i.e. the exit reason should be
processed if and only if KVM *knows* there was a sane VM-Exit from non-root mode.

tdx_vcpu_run() has a similar issue, though it's probably benign. If there's an
error in bits 15:0 that happens to collide with EXIT_REASON_TDCALL, weird things
will happen.

if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
else
tdx->tdvmcall.rcx = 0;

I vote for something like the below, with much more robust checking of vp_enter_ret
before it's converted to a VMX exit reason.

static __always_inline union vmx_exit_reason tdexit_exit_reason(struct kvm_vcpu *vcpu)
{
return (u32)vcpu->vp_enter_ret;
}

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index af3a2b8afee8..b9b40b2eaccb 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -43,37 +43,6 @@ struct kvm_tdx {
struct page *source_page;
};

-union tdx_exit_reason {
- struct {
- /* 31:0 mirror the VMX Exit Reason format */
- u64 basic : 16;
- u64 reserved16 : 1;
- u64 reserved17 : 1;
- u64 reserved18 : 1;
- u64 reserved19 : 1;
- u64 reserved20 : 1;
- u64 reserved21 : 1;
- u64 reserved22 : 1;
- u64 reserved23 : 1;
- u64 reserved24 : 1;
- u64 reserved25 : 1;
- u64 bus_lock_detected : 1;
- u64 enclave_mode : 1;
- u64 smi_pending_mtf : 1;
- u64 smi_from_vmx_root : 1;
- u64 reserved30 : 1;
- u64 failed_vmentry : 1;
-
- /* 63:32 are TDX specific */
- u64 details_l1 : 8;
- u64 class : 8;
- u64 reserved61_48 : 14;
- u64 non_recoverable : 1;
- u64 error : 1;
- };
- u64 full;
-};
-
struct vcpu_tdx {
struct kvm_vcpu vcpu;

@@ -103,7 +72,8 @@ struct vcpu_tdx {
};
u64 rcx;
} tdvmcall;
- union tdx_exit_reason exit_reason;
+
+ u64 vp_enter_ret;

bool initialized;



2024-03-15 17:41:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Mon, Feb 26, 2024, [email protected] wrote:
> +static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
> + struct tdx_module_args *out)
> +{
> + u64 ret;
> +
> + if (out) {
> + *out = *in;
> + ret = seamcall_ret(op, out);
> + } else
> + ret = seamcall(op, in);
> +
> + if (unlikely(ret == TDX_SEAMCALL_UD)) {
> + /*
> + * SEAMCALLs fail with TDX_SEAMCALL_UD returned when VMX is off.
> + * This can happen when the host gets rebooted or live
> + * updated. In this case, the instruction execution is ignored
> + * as KVM is shut down, so the error code is suppressed. Other
> + * than this, the error is unexpected and the execution can't
> + * continue as the TDX features reply on VMX to be on.
> + */
> + kvm_spurious_fault();
> + return 0;

This is nonsensical. The reason KVM liberally uses BUG_ON(!kvm_rebooting) is
because it *greatly* simpifies the overall code by obviating the need for KVM to
check for errors that should never happen in practice. On, and

But KVM quite obviously needs to check the return code for all SEAMCALLs, and
the SEAMCALLs are (a) wrapped in functions and (b) preserve host state, i.e. we
don't need to worry about KVM consuming garbage or running with unknown hardware
state because something like INVVPID or INVEPT faulted.

Oh, and the other critical aspect of all of this is that unlike VMREAD, VMWRITE,
etc., SEAMCALLs almost always require a TDR or TDVPR, i.e. need a VM or vCPU.
Now that we've abandoned the macro shenanigans that allowed things like
tdh_mem_page_add() to be pure translators to their respective SEAMCALL, I don't
see any reason to take the physical addresses of the TDR/TDVPR in the helpers.

I.e. if we do:

u64 tdh_mng_addcx(struct kvm *kvm, hpa_t addr)

then the intermediate wrapper to the SEAMCALL assembly has the vCPU or VM and
thus can precisely terminate the one problematic VM.

So unless I'm missing something, I think that kvm_spurious_fault() should be
persona non grata for TDX, and that KVM should instead use KVM_BUG_ON().

2024-03-15 17:45:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 098/130] KVM: TDX: Add a place holder to handle TDX VM exit

On Mon, Feb 26, 2024, [email protected] wrote:
> +int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> +{
> + union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
> +
> + /* See the comment of tdh_sept_seamcall(). */
> + if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)))
> + return 1;
> +
> + /*
> + * TDH.VP.ENTRY checks TD EPOCH which contend with TDH.MEM.TRACK and
> + * vcpu TDH.VP.ENTER.
> + */
> + if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_TD_EPOCH)))
> + return 1;
> +
> + if (unlikely(exit_reason.full == TDX_SEAMCALL_UD)) {
> + kvm_spurious_fault();
> + /*
> + * In the case of reboot or kexec, loop with TDH.VP.ENTER and
> + * TDX_SEAMCALL_UD to avoid unnecessarily activity.
> + */
> + return 1;

No. This is unnecessarily risky. KVM_BUG_ON() and exit to userspace. The
response to "SEAMCALL faulted" should never be, "well, let's try again!".

Also, what about #GP on SEAMCALL? In general, the error handling here seems
lacking.

2024-03-15 17:47:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, Mar 15, 2024, Sean Christopherson wrote:
> So my feedback is to not worry about the exports, and instead focus on figuring
> out a way to make the generated code less bloated and easier to read/debug.

Oh, and please make it a collaborative, public effort. I don't want to hear
crickets and then see v20 dropped with a completely new SEAMCALL scheme.

2024-03-15 17:49:18

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, 2024-03-15 at 09:33 -0700, Sean Christopherson wrote:
> Heh, Like this one?
>
>         static inline u64 tdh_sys_lp_shutdown(void)
>         {
>                 struct tdx_module_args in = {
>                 };
>        
>                 return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
>         }
>
> Which isn't actually used...

Looks like is was turned into a NOP in TDX 1.5. So will even forever be
dead code. I see one other that is unused. Thanks for pointing it out.

>
> > But I'd also defer to the KVM maintainers on this.  They're the
> > ones
> > that have to play the symbol exporting game a lot more than I ever
> > do.
> > If they cringe at the idea of adding 20 (or whatever) exports, then
> > that's a lot more important than the possibility of some other
> > silly
> > module abusing the generic exported __seamcall.
>
> I don't care much about exports.  What I do care about is sane code,
> and while
> the current code _looks_ pretty, it's actually quite insane.
>
> I get why y'all put SEAMCALL in assembly subroutines; the macro
> shenanigans I
> originally wrote years ago were their own brand of crazy, and dealing
> with GPRs
> that can't be asm() constraints often results in brittle code.

I guess it must be this, for the initiated:
https://lore.kernel.org/lkml/25f0d2c2f73c20309a1b578cc5fc15f4fd6b9a13.1605232743.git.isaku.yamahata@intel.com/

>
> But the tdx_module_args structure approach generates truly atrocious
> code.  Yes,
> SEAMCALL is inherently slow, but that doesn't mean that we shouldn't
> at least try
> to generate efficient code.  And it's not just efficiency that is
> lost, the
> generated code ends up being much harder to read than it ought to be.
>
>
[snip]
>
> So my feedback is to not worry about the exports, and instead focus
> on figuring
> out a way to make the generated code less bloated and easier to
> read/debug.
>

Thanks for the feedback both! It sounds like everyone is flexible on
the exports. As for the generated code, oof.

Kai, I see the solution has gone through some iterations already. First
the macro one linked above, then that was dropped pretty quick to
something that loses the asm constraints:
https://lore.kernel.org/lkml/e777bbbe10b1ec2c37d85dcca2e175fe3bc565ec.1625186503.git.isaku.yamahata@intel.com/

Then next the struct grew here, and here:
https://lore.kernel.org/linux-mm/[email protected]/
https://lore.kernel.org/linux-mm/[email protected]/

Not sure I understand all of the constraints yet. Do you have any
ideas?

2024-03-15 17:54:03

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, 2024-03-15 at 10:46 -0700, Sean Christopherson wrote:
> On Fri, Mar 15, 2024, Sean Christopherson wrote:
> > So my feedback is to not worry about the exports, and instead focus
> > on figuring
> > out a way to make the generated code less bloated and easier to
> > read/debug.
>
> Oh, and please make it a collaborative, public effort.  I don't want
> to hear
> crickets and then see v20 dropped with a completely new SEAMCALL
> scheme.

And here we we're worrying that people might eventually grow tired of
us adding mails to v19 and we debate every detail in public. Will do.

2024-03-15 18:29:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On 3/15/24 09:33, Sean Christopherson wrote:
> static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
> struct tdx_module_args *out)
> {
> struct tdx_module_args in = {
> .rcx = gpa | level,
> .rdx = tdr,
> };
>
> return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out);
> }
>
> generates the below monstrosity with gcc-13. And that's just one SEAMCALL wrapper,
> *every* single one generates the same mess. clang-16 is kinda sorta a little
> better, as it at least inlines the helpers that have single callers.

Yeah, that's really awful.

Is all the inlining making the compiler too ambitious? Why is this all
inlined in the first place?

tdh_mem_page_remove() _should_ just be logically:

* initialize tdx_module_args. Move a few things into place on
the stack and zero the rest.
* Put a pointer to tdx_module_args in a register
* Put TDH_MEM_PAGE_REMOVE immediate in a register
* Some register preservation, maybe
* call
* maybe some cleanup
* return

Those logical things are *NOT* easy to spot in the disassembly.

2024-03-15 19:20:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, Mar 15, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-03-15 at 10:46 -0700, Sean Christopherson wrote:
> > On Fri, Mar 15, 2024, Sean Christopherson wrote:
> > > So my feedback is to not worry about the exports, and instead focus on
> > > figuring out a way to make the generated code less bloated and easier to
> > > read/debug.
> >
> > Oh, and please make it a collaborative, public effort.  I don't want to
> > hear crickets and then see v20 dropped with a completely new SEAMCALL
> > scheme.
>
> And here we we're worrying that people might eventually grow tired of
> us adding mails to v19 and we debate every detail in public. Will do.

As a general rule, I _strongly_ prefer all review to be done on-list, in public.
Copy+pasting myself from another Intel series[*]

: Correct, what I object to is Intel _requiring_ a Reviewed-by before posting.
:
: And while I'm certainly not going to refuse patches that have been reviewed
: internally, I _strongly_ prefer reviews be on-list so that they are public and
: recorded. Being able to go back and look at the history and evolution of patches
: is valuable, and the discussion itself is often beneficial to non-participants,
: e.g. people that are new-ish to KVM and/or aren't familiar with the feature being
: enabled can often learn new things and avoid similar pitfalls of their own.

There are definitely situations where exceptions are warranted, e.g. if someone
is a first-time poster and/or wants a sanity check to make sure their idea isn't
completely crazy. But even then, the internal review should only be very cursory.

In addition to the history being valuable, doing reviews in public minimizes the
probability of a developer being led astray, e.g. due to someone internally saying
do XYZ, and then upstream reviewers telling them to do something entirely different.

As far as noise goes, look at it this way. Every time a new TDX series is posted,
I get 130+ emails. Y'all can do a _lot_ of public review and discussion before
you'll get anywhere near the point where it'd be noiser than spinning a new version
of the series.

[*] https://lore.kernel.org/all/[email protected]

2024-03-15 19:24:03

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Fri, Mar 15, 2024 at 10:41:28AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Mon, Feb 26, 2024, [email protected] wrote:
> > +static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
> > + struct tdx_module_args *out)
> > +{
> > + u64 ret;
> > +
> > + if (out) {
> > + *out = *in;
> > + ret = seamcall_ret(op, out);
> > + } else
> > + ret = seamcall(op, in);
> > +
> > + if (unlikely(ret == TDX_SEAMCALL_UD)) {
> > + /*
> > + * SEAMCALLs fail with TDX_SEAMCALL_UD returned when VMX is off.
> > + * This can happen when the host gets rebooted or live
> > + * updated. In this case, the instruction execution is ignored
> > + * as KVM is shut down, so the error code is suppressed. Other
> > + * than this, the error is unexpected and the execution can't
> > + * continue as the TDX features reply on VMX to be on.
> > + */
> > + kvm_spurious_fault();
> > + return 0;
>
> This is nonsensical. The reason KVM liberally uses BUG_ON(!kvm_rebooting) is
> because it *greatly* simpifies the overall code by obviating the need for KVM to
> check for errors that should never happen in practice. On, and
>
> But KVM quite obviously needs to check the return code for all SEAMCALLs, and
> the SEAMCALLs are (a) wrapped in functions and (b) preserve host state, i.e. we
> don't need to worry about KVM consuming garbage or running with unknown hardware
> state because something like INVVPID or INVEPT faulted.
>
> Oh, and the other critical aspect of all of this is that unlike VMREAD, VMWRITE,
> etc., SEAMCALLs almost always require a TDR or TDVPR, i.e. need a VM or vCPU.
> Now that we've abandoned the macro shenanigans that allowed things like
> tdh_mem_page_add() to be pure translators to their respective SEAMCALL, I don't
> see any reason to take the physical addresses of the TDR/TDVPR in the helpers.
>
> I.e. if we do:
>
> u64 tdh_mng_addcx(struct kvm *kvm, hpa_t addr)
>
> then the intermediate wrapper to the SEAMCALL assembly has the vCPU or VM and
> thus can precisely terminate the one problematic VM.
>
> So unless I'm missing something, I think that kvm_spurious_fault() should be
> persona non grata for TDX, and that KVM should instead use KVM_BUG_ON().

Thank you for the feedback. As I don't see any issues to do so, I'll convert
those wrappers to take struct kvm_tdx or struct vcpu_tdx, and eliminate
kvm_spurious_fault() in favor of KVM_BUG_ON().
--
Isaku Yamahata <[email protected]>

2024-03-15 19:38:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, Mar 15, 2024, Dave Hansen wrote:
> On 3/15/24 09:33, Sean Christopherson wrote:
> > static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
> > struct tdx_module_args *out)
> > {
> > struct tdx_module_args in = {
> > .rcx = gpa | level,
> > .rdx = tdr,
> > };
> >
> > return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out);
> > }
> >
> > generates the below monstrosity with gcc-13. And that's just one SEAMCALL wrapper,
> > *every* single one generates the same mess. clang-16 is kinda sorta a little
> > better, as it at least inlines the helpers that have single callers.
>
> Yeah, that's really awful.
>
> Is all the inlining making the compiler too ambitious?

No, whether or not the wrappers are inlined doesn't change anything. gcc actually
doesn't inline any of these helpers. More below.

> Why is this all inlined in the first place?

Likely because no one looked at the generated code. The C code is super simple
and looks like it should be inlined.

And the very original code was macro heavy, i.e. relied on inlining to allow the
compiler to precisely set only the actual registers needed for the SEAMCALL.

> tdh_mem_page_remove() _should_ just be logically:
>
> * initialize tdx_module_args. Move a few things into place on
> the stack and zero the rest.

The "zero the rest" is what generates the fugly code. The underlying problem is
that the SEAMCALL assembly functions unpack _all_ registers from tdx_module_args.
As a result, tdx_module_args needs to be zeroed to avoid loading registers with
unitialized stack data.

E.g. before I looked at the assembly code, my initial thought to clean things
up by doing:

struct tdx_module_args in;

in.rcx = gpa | level;
in.rdx = tdr;

but that would make one or more sanitizers (and maybe even the compiler itself)
more than a bit unhappy.

The struct is 72 bytes, which adds up to a lot of wasted effort since the majority
of SEAMCALLs only use a few of the 13 registers.

FWIW, the guest side of TDX is equally gross. E.g. to do kvm_hypercall1(), which
without TDX is simply

0xffffffff810eb4a2 <+82>: 48 c7 c0 8c 9a 01 00 mov $0x19a8c,%rax
0xffffffff810eb4a9 <+89>: 8b 1c 02 mov (%rdx,%rax,1),%ebx
0xffffffff810eb4ac <+92>: b8 0b 00 00 00 mov $0xb,%eax
0xffffffff810eb4b1 <+97>: 0f 01 c1 vmcall
0xffffffff810eb4b4 <+100>: 5b pop %rbx
0xffffffff810eb4b5 <+101>: 5d pop %rbp

the kernel blasts the unused params

0xffffffff810ffdc1 <+97>: bf 0b 00 00 00 mov $0xb,%edi
0xffffffff810ffdc6 <+102>: 31 d2 xor %edx,%edx
0xffffffff810ffdc8 <+104>: 31 c9 xor %ecx,%ecx
0xffffffff810ffdca <+106>: 45 31 c0 xor %r8d,%r8d
0xffffffff810ffdcd <+109>: 5b pop %rbx
0xffffffff810ffdce <+110>: 41 5e pop %r14
0xffffffff810ffdd0 <+112>: e9 bb 1b f0 ff jmp 0xffffffff81001990 <tdx_kvm_hypercall>

then loads and zeros a ton of memory (tdx_kvm_hypercall()):

0xffffffff81001990 <+0>: nopl 0x0(%rax,%rax,1)
0xffffffff81001995 <+5>: sub $0x70,%rsp
0xffffffff81001999 <+9>: mov %gs:0x28,%rax
0xffffffff810019a2 <+18>: mov %rax,0x68(%rsp)
0xffffffff810019a7 <+23>: mov %edi,%eax
0xffffffff810019a9 <+25>: movq $0x0,0x18(%rsp)
0xffffffff810019b2 <+34>: movq $0x0,0x10(%rsp)
0xffffffff810019bb <+43>: movq $0x0,0x8(%rsp)
0xffffffff810019c4 <+52>: movq $0x0,(%rsp)
0xffffffff810019cc <+60>: mov %rax,0x20(%rsp)
0xffffffff810019d1 <+65>: mov %rsi,0x28(%rsp)
0xffffffff810019d6 <+70>: mov %rdx,0x30(%rsp)
0xffffffff810019db <+75>: mov %rcx,0x38(%rsp)
0xffffffff810019e0 <+80>: mov %r8,0x40(%rsp)
0xffffffff810019e5 <+85>: movq $0x0,0x48(%rsp)
0xffffffff810019ee <+94>: movq $0x0,0x50(%rsp)
0xffffffff810019f7 <+103>: movq $0x0,0x58(%rsp)
0xffffffff81001a00 <+112>: movq $0x0,0x60(%rsp)
0xffffffff81001a09 <+121>: mov %rsp,%rdi
0xffffffff81001a0c <+124>: call 0xffffffff819f0a80 <__tdx_hypercall>
0xffffffff81001a11 <+129>: mov %gs:0x28,%rcx
0xffffffff81001a1a <+138>: cmp 0x68(%rsp),%rcx
0xffffffff81001a1f <+143>: jne 0xffffffff81001a26 <tdx_kvm_hypercall+150>
0xffffffff81001a21 <+145>: add $0x70,%rsp
0xffffffff81001a25 <+149>: ret

and then unpacks all of that memory back into registers, and reverses that last
part on the way back, (__tdcall_saved_ret()):

0xffffffff819f0b10 <+0>: mov %rdi,%rax
0xffffffff819f0b13 <+3>: mov (%rsi),%rcx
0xffffffff819f0b16 <+6>: mov 0x8(%rsi),%rdx
0xffffffff819f0b1a <+10>: mov 0x10(%rsi),%r8
0xffffffff819f0b1e <+14>: mov 0x18(%rsi),%r9
0xffffffff819f0b22 <+18>: mov 0x20(%rsi),%r10
0xffffffff819f0b26 <+22>: mov 0x28(%rsi),%r11
0xffffffff819f0b2a <+26>: push %rbx
0xffffffff819f0b2b <+27>: push %r12
0xffffffff819f0b2d <+29>: push %r13
0xffffffff819f0b2f <+31>: push %r14
0xffffffff819f0b31 <+33>: push %r15
0xffffffff819f0b33 <+35>: mov 0x30(%rsi),%r12
0xffffffff819f0b37 <+39>: mov 0x38(%rsi),%r13
0xffffffff819f0b3b <+43>: mov 0x40(%rsi),%r14
0xffffffff819f0b3f <+47>: mov 0x48(%rsi),%r15
0xffffffff819f0b43 <+51>: mov 0x50(%rsi),%rbx
0xffffffff819f0b47 <+55>: push %rsi
0xffffffff819f0b48 <+56>: mov 0x58(%rsi),%rdi
0xffffffff819f0b4c <+60>: mov 0x60(%rsi),%rsi
0xffffffff819f0b50 <+64>: tdcall
0xffffffff819f0b54 <+68>: push %rax
0xffffffff819f0b55 <+69>: mov 0x8(%rsp),%rax
0xffffffff819f0b5a <+74>: mov %rsi,0x60(%rax)
0xffffffff819f0b5e <+78>: pop %rax
0xffffffff819f0b5f <+79>: pop %rsi
0xffffffff819f0b60 <+80>: mov %r12,0x30(%rsi)
0xffffffff819f0b64 <+84>: mov %r13,0x38(%rsi)
0xffffffff819f0b68 <+88>: mov %r14,0x40(%rsi)
0xffffffff819f0b6c <+92>: mov %r15,0x48(%rsi)
0xffffffff819f0b70 <+96>: mov %rbx,0x50(%rsi)
0xffffffff819f0b74 <+100>: mov %rdi,0x58(%rsi)
0xffffffff819f0b78 <+104>: mov %rcx,(%rsi)
0xffffffff819f0b7b <+107>: mov %rdx,0x8(%rsi)
0xffffffff819f0b7f <+111>: mov %r8,0x10(%rsi)
0xffffffff819f0b83 <+115>: mov %r9,0x18(%rsi)
0xffffffff819f0b87 <+119>: mov %r10,0x20(%rsi)
0xffffffff819f0b8b <+123>: mov %r11,0x28(%rsi)
0xffffffff819f0b8f <+127>: xor %ecx,%ecx
0xffffffff819f0b91 <+129>: xor %edx,%edx
0xffffffff819f0b93 <+131>: xor %r8d,%r8d
0xffffffff819f0b96 <+134>: xor %r9d,%r9d
0xffffffff819f0b99 <+137>: xor %r10d,%r10d
0xffffffff819f0b9c <+140>: xor %r11d,%r11d
0xffffffff819f0b9f <+143>: xor %r12d,%r12d
0xffffffff819f0ba2 <+146>: xor %r13d,%r13d
0xffffffff819f0ba5 <+149>: xor %r14d,%r14d
0xffffffff819f0ba8 <+152>: xor %r15d,%r15d
0xffffffff819f0bab <+155>: xor %ebx,%ebx
0xffffffff819f0bad <+157>: xor %edi,%edi
0xffffffff819f0baf <+159>: pop %r15
0xffffffff819f0bb1 <+161>: pop %r14
0xffffffff819f0bb3 <+163>: pop %r13
0xffffffff819f0bb5 <+165>: pop %r12
0xffffffff819f0bb7 <+167>: pop %rbx
0xffffffff819f0bb8 <+168>: ret

It's honestly quite amusing, because y'all took one what I see as one of the big
advantages of TDX over SEV (using registers instead of shared memory), and managed
to effectively turn it into a disadvantage.

Again, I completely understand the maintenance and robustness benefits, but IMO
the pendulum swung a bit too far in that direction.

2024-03-15 20:43:08

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Fri, Mar 15, 2024 at 10:26:30AM -0700,
Sean Christopherson <[email protected]> wrote:

> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index d822e790e3e5..81d301fbe638 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -27,6 +27,37 @@ struct kvm_tdx {
> > struct page *source_page;
> > };
> >
> > +union tdx_exit_reason {
> > + struct {
> > + /* 31:0 mirror the VMX Exit Reason format */
>
> Then use "union vmx_exit_reason", having to maintain duplicate copies of the same
> union is not something I want to do.
>
> I'm honestly not even convinced that "union tdx_exit_reason" needs to exist. I
> added vmx_exit_reason because we kept having bugs where KVM would fail to strip
> bits 31:16, and because nested VMX needs to stuff failed_vmentry, but I don't
> see a similar need for TDX.
>
> I would even go so far as to say the vcpu_tdx field shouldn't be exit_reason,
> and instead should be "return_code" or something. E.g. if the TDX module refuses
> to run the vCPU, there's no VM-Enter and thus no VM-Exit (unless you count the
> SEAMCALL itself, har har). Ditto for #GP or #UD on the SEAMCALL (or any other
> reason that generates TDX_SW_ERROR).
>
> Ugh, I'm doubling down on that suggesting. This:
>
> WARN_ON_ONCE(!kvm_rebooting &&
> (tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR);
>
> if ((u16)tdx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
> is_nmi(tdexit_intr_info(vcpu))) {
> kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
> vmx_do_nmi_irqoff();
> kvm_after_interrupt(vcpu);
> }
>
> is heinous. If there's an error that leaves bits 15:0 zero, KVM will synthesize
> a spurious NMI. I don't know whether or not that can happen, but it's not
> something that should even be possible in KVM, i.e. the exit reason should be
> processed if and only if KVM *knows* there was a sane VM-Exit from non-root mode.
>
> tdx_vcpu_run() has a similar issue, though it's probably benign. If there's an
> error in bits 15:0 that happens to collide with EXIT_REASON_TDCALL, weird things
> will happen.
>
> if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
> tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
> else
> tdx->tdvmcall.rcx = 0;
>
> I vote for something like the below, with much more robust checking of vp_enter_ret
> before it's converted to a VMX exit reason.
>
> static __always_inline union vmx_exit_reason tdexit_exit_reason(struct kvm_vcpu *vcpu)
> {
> return (u32)vcpu->vp_enter_ret;
> }

Thank you for the concrete suggestion. Let me explore what safe guard check
can be done to make exit path robust.
--
Isaku Yamahata <[email protected]>

2024-03-15 21:36:52

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 025/130] KVM: TDX: Make TDX VM type supported

On Thu, Mar 14, 2024 at 02:29:07PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > NOTE: This patch is in position of the patch series for developers to be
> > able to test codes during the middle of the patch series although this
> > patch series doesn't provide functional features until the all the patches
> > of this patch series. When merging this patch series, this patch can be
> > moved to the end.
>
> Maybe at this point of time, you can consider to move this patch to the end?

Given I don't have to do step-by-step debug recently, I think it's safe to move
it.
--
Isaku Yamahata <[email protected]>

2024-03-15 23:26:10

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, Mar 15, 2024 at 12:44:46PM +0800,
Binbin Wu <[email protected]> wrote:

> On 3/15/2024 12:27 AM, Isaku Yamahata wrote:
> > On Thu, Mar 14, 2024 at 10:05:35AM +0800,
> > Binbin Wu <[email protected]> wrote:
> >
> > > > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > > > index 18cecf12c7c8..18aef6e23aab 100644
> > > > --- a/arch/x86/kvm/vmx/main.c
> > > > +++ b/arch/x86/kvm/vmx/main.c
> > > > @@ -6,6 +6,22 @@
> > > > #include "nested.h"
> > > > #include "pmu.h"
> > > > +static bool enable_tdx __ro_after_init;
> > > > +module_param_named(tdx, enable_tdx, bool, 0444);
> > > > +
> > > > +static __init int vt_hardware_setup(void)
> > > > +{
> > > > + int ret;
> > > > +
> > > > + ret = vmx_hardware_setup();
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > #define VMX_REQUIRED_APICV_INHIBITS \
> > > > (BIT(APICV_INHIBIT_REASON_DISABLE)| \
> > > > BIT(APICV_INHIBIT_REASON_ABSENT) | \
> > > > @@ -22,6 +38,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > > > .hardware_unsetup = vmx_hardware_unsetup,
> > > > + /* TDX cpu enablement is done by tdx_hardware_setup(). */
> > > How about if there are some LPs that are offline.
> > > In tdx_hardware_setup(), only online LPs are initialed for TDX, right?
> > Correct.
> >
> >
> > > Then when an offline LP becoming online, it doesn't have a chance to call
> > > tdx_cpu_enable()?
> > KVM registers kvm_online/offline_cpu() @ kvm_main.c as cpu hotplug callbacks.
> > Eventually x86 kvm hardware_enable() is called on online/offline event.
>
> Yes, hardware_enable() will be called when online,
> but  hardware_enable() now is vmx_hardware_enable() right?
> It doens't call tdx_cpu_enable() during the online path.

TDX module requires TDH.SYS.LP.INIT() on all logical processors(LPs). If we
successfully initialized TDX module, we don't need further action for TDX on cpu
online/offline.

If some of LPs are not online when loading kvm_intel.ko, KVM fails to initialize
TDX module. TDX support is disabled. We don't bother to attempt it. Leave it
to the admin of the machine.
--
Isaku Yamahata <[email protected]>

2024-03-15 23:53:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On 3/15/24 12:38, Sean Christopherson wrote:
>> tdh_mem_page_remove() _should_ just be logically:
>>
>> * initialize tdx_module_args. Move a few things into place on
>> the stack and zero the rest.
> The "zero the rest" is what generates the fugly code. The underlying problem is
> that the SEAMCALL assembly functions unpack _all_ registers from tdx_module_args.
> As a result, tdx_module_args needs to be zeroed to avoid loading registers with
> unitialized stack data.

It's the "zero the rest" and also the copy:

> + if (out) {
> + *out = *in;
> + ret = seamcall_ret(op, out);
> + } else
> + ret = seamcall(op, in);

Things get a wee bit nicer if you do an out-of-line mempcy() instead of
the structure copy. But the really fun part is that 'out' is NULL and
the compiler *SHOULD* know it. I'm not actually sure what trips it up.

In any case, I think it ends up generating code for both sides of the
if/else including the entirely superfluous copy.

The two nested while loops (one for TDX_RND_NO_ENTROPY and the other for
TDX_ERROR_SEPT_BUSY) also don't make for great code generation.

So, sure, the generated code here could be a better. But there's a lot
more going on here than just shuffling gunk in and out of the 'struct
tdx_module_args', and there's a _lot_ more work to do for one of these
than for a plain old kvm_hypercall*().

It might make sense to separate out the "out" functionality into and
maybe to uninline _some_ of the helper levels. But after that, there's
not a lot of low hanging fruit.

2024-03-18 17:12:34

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Fri, Mar 15, 2024 at 02:01:43PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Thu, 2024-03-14 at 18:35 -0700, Isaku Yamahata wrote:
> > > On the subject of warnings and KVM_BUG_ON(), my feeling so far is
> > > that
> > > this series is quite aggressive about these. Is it due the
> > > complexity
> > > of the series? I think maybe we can remove some of the simple ones,
> > > but
> > > not sure if there was already some discussion on what level is
> > > appropriate.
> >
> > KVM_BUG_ON() was helpful at the early stage.  Because we don't hit
> > them
> > recently, it's okay to remove them.  Will remove them.
>
> Hmm. We probably need to do it case by case.

I categorize as follows. Unless otherwise, I'll update this series.

- dirty log check
As we will drop this ptach, we'll have no call site.

- KVM_BUG_ON() in main.c
We should drop them because their logic isn't complex.

- KVM_BUG_ON() in tdx.c
- The error check of the return value from SEAMCALL
We should keep it as it's unexpected error from TDX module. When we hit
this, we should mark the guest bugged and prevent further operation. It's
hard to deduce the reason. TDX mdoule might be broken.

- Other check
We should drop them.

--
Isaku Yamahata <[email protected]>

2024-03-18 17:44:53

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Mon, 2024-03-18 at 10:12 -0700, Isaku Yamahata wrote:
> I categorize as follows. Unless otherwise, I'll update this series.
>
> - dirty log check
>   As we will drop this ptach, we'll have no call site.
>
> - KVM_BUG_ON() in main.c
>   We should drop them because their logic isn't complex.
What about "KVM: TDX: Add methods to ignore guest instruction
emulation"? Is it cleanly blocked somehow?

>  
> - KVM_BUG_ON() in tdx.c
>   - The error check of the return value from SEAMCALL
>     We should keep it as it's unexpected error from TDX module. When
> we hit
>     this, we should mark the guest bugged and prevent further
> operation.  It's
>     hard to deduce the reason.  TDX mdoule might be broken.
Yes. Makes sense.

>
>   - Other check
>     We should drop them.

Offhand, I'm not sure what is in this category.

2024-03-18 21:01:24

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> +{
> +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +       if (unlikely(!tdx->initialized))
> +               return -EINVAL;
> +       if (unlikely(vcpu->kvm->vm_bugged)) {
> +               tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> +               return EXIT_FASTPATH_NONE;
> +       }
> +

Isaku, can you elaborate on why this needs special handling? There is a
check in vcpu_enter_guest() like:
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
r = -EIO;
goto out;
}

Instead it returns a SEAM error code for something actuated by KVM. But
can it even be reached because of the other check? Not sure if there is
a problem, just sticks out to me and wondering whats going on.

2024-03-18 23:17:32

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Mon, Mar 18, 2024 at 05:43:33PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-03-18 at 10:12 -0700, Isaku Yamahata wrote:
> > I categorize as follows. Unless otherwise, I'll update this series.
> >
> > - dirty log check
> >   As we will drop this ptach, we'll have no call site.
> >
> > - KVM_BUG_ON() in main.c
> >   We should drop them because their logic isn't complex.
> What about "KVM: TDX: Add methods to ignore guest instruction
> emulation"? Is it cleanly blocked somehow?

KVM fault handler, kvm_mmu_page_fault(), is the caller into the emulation,
It should skip the emulation.

As the second guard, x86_emulate_instruction(), calls
check_emulate_instruction() callback to check if the emulation can/should be
done. TDX callback can return it as X86EMUL_UNHANDLEABLE. Then, the flow goes
to user space as error. I'll update the vt_check_emulate_instruction().



> > - KVM_BUG_ON() in tdx.c
> >   - The error check of the return value from SEAMCALL
> >     We should keep it as it's unexpected error from TDX module. When
> > we hit
> >     this, we should mark the guest bugged and prevent further
> > operation.  It's
> >     hard to deduce the reason.  TDX mdoule might be broken.
> Yes. Makes sense.
>
> >
> >   - Other check
> >     We should drop them.
>
> Offhand, I'm not sure what is in this category.

- Checking error code on TD enter/exit
I'll revise how to check error from TD enter/exit. We'll have new code. I
will update wrapper function to take struct kvm_tdx or struct tdx_vcpu, and
revise to remove random check. Cleanups related to kvm_rebooting,
TDX_SW_ERROR, kvm_spurious_fault()

- Remaining random check for debug.
The examples are as follows. They were added for debug.


@@ -797,18 +788,14 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
* list_{del,add}() on associated_tdvcpus list later.
*/
tdx_disassociate_vp_on_cpu(vcpu);
- WARN_ON_ONCE(vcpu->cpu != -1);

/*
* This methods can be called when vcpu allocation/initialization
* failed. So it's possible that hkid, tdvpx and tdvpr are not assigned
* yet.
*/
- if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm))) {
- WARN_ON_ONCE(tdx->tdvpx_pa);
- WARN_ON_ONCE(tdx->tdvpr_pa);
+ if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
return;
- }

if (tdx->tdvpx_pa) {
for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
@@ -831,9 +818,9 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{

/* vcpu_deliver_init method silently discards INIT event. */
- if (KVM_BUG_ON(init_event, vcpu->kvm))
+ if (init_event)
return;
- if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
+ if (is_td_vcpu_created(to_tdx(vcpu)))
return;

/*
@@ -831,9 +818,9 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{

/* vcpu_deliver_init method silently discards INIT event. */
- if (KVM_BUG_ON(init_event, vcpu->kvm))
+ if (init_event)
return;
- if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
+ if (is_td_vcpu_created(to_tdx(vcpu)))
return;

/*
@@ -831,9 +818,9 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{

/* vcpu_deliver_init method silently discards INIT event. */
- if (KVM_BUG_ON(init_event, vcpu->kvm))
+ if (init_event)
return;
- if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
+ if (is_td_vcpu_created(to_tdx(vcpu)))
return;

/*


--
Isaku Yamahata <[email protected]>

2024-03-18 23:40:24

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Mon, Mar 18, 2024 at 09:01:05PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > +{
> > +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +
> > +       if (unlikely(!tdx->initialized))
> > +               return -EINVAL;
> > +       if (unlikely(vcpu->kvm->vm_bugged)) {
> > +               tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > +               return EXIT_FASTPATH_NONE;
> > +       }
> > +
>
> Isaku, can you elaborate on why this needs special handling? There is a
> check in vcpu_enter_guest() like:
> if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
> r = -EIO;
> goto out;
> }
>
> Instead it returns a SEAM error code for something actuated by KVM. But
> can it even be reached because of the other check? Not sure if there is
> a problem, just sticks out to me and wondering whats going on.

The original intention is to get out the inner loop. As Sean pointed it out,
the current code does poor job to check error of
__seamcall_saved_ret(TDH_VP_ENTER). So it fails to call KVM_BUG_ON() when it
returns unexcepted error.

The right fix is to properly check an error from TDH_VP_ENTER and call
KVM_BUG_ON(). Then the check you pointed out should go away.

for // out loop
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
for // inner loop
vcpu_run()
kvm_vcpu_exit_request(vcpu).

--
Isaku Yamahata <[email protected]>

2024-03-18 23:48:43

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX supports only write-back(WB) memory type for private memory
> architecturally so that (virtualized) memory type change doesn't make
> sense
> for private memory.  Also currently, page migration isn't supported
> for TDX
> yet. (TDX architecturally supports page migration. it's KVM and
> kernel
> implementation issue.)
>
> Regarding memory type change (mtrr virtualization and lapic page
> mapping
> change), pages are zapped by kvm_zap_gfn_range().  On the next KVM
> page
> fault, the SPTE entry with a new memory type for the page is
> populated.
> Regarding page migration, pages are zapped by the mmu notifier. On
> the next
> KVM page fault, the new migrated page is populated.  Don't zap
> private
> pages on unmapping for those two cases.

Is the migration case relevant to TDX?

>
> When deleting/moving a KVM memory slot, zap private pages. Typically
> tearing down VM.  Don't invalidate private page tables. i.e. zap only
> leaf
> SPTEs for KVM mmu that has a shared bit mask. The existing
> kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-
> lock
> of mmu_lock so that other vcpu can operate on KVM mmu concurrently. 
> It
> marks the root page table invalid and zaps SPTEs of the root page
> tables. The TDX module doesn't allow to unlink a protected root page
> table
> from the hardware and then allocate a new one for it. i.e. replacing
> a
> protected root page table.  Instead, zap only leaf SPTEs for KVM mmu
> with a
> shared bit mask set.

I get the part about only zapping leafs and not the root and mid-level
PTEs. But why the MTRR, lapic page and migration part? Why should those
not be zapped? Why is migration a consideration when it is not
supported?

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 61
> ++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++----
>  arch/x86/kvm/mmu/tdp_mmu.h |  5 ++--
>  3 files changed, 92 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0d6d4506ec97..30c86e858ae4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6339,7 +6339,7 @@ static void kvm_mmu_zap_all_fast(struct kvm
> *kvm)
>          * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock
> and yield.
>          */
>         if (tdp_mmu_enabled)
> -               kvm_tdp_mmu_invalidate_all_roots(kvm);
> +               kvm_tdp_mmu_invalidate_all_roots(kvm, true);
>  
>         /*
>          * Notify all vcpus to reload its shadow page table and flush
> TLB.
> @@ -6459,7 +6459,16 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> gfn_start, gfn_t gfn_end)
>         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
>         if (tdp_mmu_enabled)
> -               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> gfn_end, flush);
> +               /*
> +                * zap_private = false. Zap only shared pages.
> +                *
> +                * kvm_zap_gfn_range() is used when MTRR or PAT
> memory
> +                * type was changed.  Later on the next kvm page
> fault,
> +                * populate it with updated spte entry.
> +                * Because only WB is supported for private pages,
> don't
> +                * care of private pages.
> +                */
> +               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> gfn_end, flush, false);
>  
>         if (flush)
>                 kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end -
> gfn_start);
> @@ -6905,10 +6914,56 @@ void kvm_arch_flush_shadow_all(struct kvm
> *kvm)
>         kvm_mmu_zap_all(kvm);
>  }
>  
> +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct
> kvm_memory_slot *slot)

What about kvm_mmu_zap_memslot_leafs() instead?

> +{
> +       bool flush = false;

It doesn't need to be initialized if it passes false directly into
kvm_tdp_mmu_unmap_gfn_range(). It would make the code easier to
understand.

> +
> +       write_lock(&kvm->mmu_lock);
> +
> +       /*
> +        * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't
> required, worst
> +        * case scenario we'll have unused shadow pages lying around
> until they
> +        * are recycled due to age or when the VM is destroyed.
> +        */
> +       if (tdp_mmu_enabled) {
> +               struct kvm_gfn_range range = {
> +                     .slot = slot,
> +                     .start = slot->base_gfn,
> +                     .end = slot->base_gfn + slot->npages,
> +                     .may_block = true,
> +
> +                     /*
> +                      * This handles both private gfn and shared
> gfn.
> +                      * All private page should be zapped on memslot
> deletion.
> +                      */
> +                     .only_private = true,
> +                     .only_shared = true,

only_private and only_shared are both true? Shouldn't they both be
false? (or just unset)

> +               };
> +
> +               flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range,
> flush);
> +       } else {
> +               /* TDX supports only TDP-MMU case. */
> +               WARN_ON_ONCE(1);

How about a KVM_BUG_ON() instead? If somehow this is reached, we don't
want the caller thinking the pages are zapped, then enter the guest
with pages mapped that have gone elsewhere.

> +               flush = true;

Why flush?

> +       }
> +       if (flush)
> +               kvm_flush_remote_tlbs(kvm);
> +
> +       write_unlock(&kvm->mmu_lock);
> +}
> +
>  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot)
>  {
> -       kvm_mmu_zap_all_fast(kvm);
> +       if (kvm_gfn_shared_mask(kvm))

There seems to be an attempt to abstract away the existence of Secure-
EPT in mmu.c, that is not fully successful. In this case the code
checks kvm_gfn_shared_mask() to see if it needs to handle the zapping
in a way specific needed by S-EPT. It ends up being a little confusing
because the actual check is about whether there is a shared bit. It
only works because only S-EPT is the only thing that has a
kvm_gfn_shared_mask().

Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks wrong,
but is more honest about what we are getting up to here. I'm not sure
though, what do you think?

> +               /*
> +                * Secure-EPT requires to release PTs from the leaf. 
> The
> +                * optimization to zap root PT first with child PT
> doesn't
> +                * work.
> +                */
> +               kvm_mmu_zap_memslot(kvm, slot);
> +       else
> +               kvm_mmu_zap_all_fast(kvm);
>  }
>  
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index d47f0daf1b03..e7514a807134 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>          * for zapping and thus puts the TDP MMU's reference to each
> root, i.e.
>          * ultimately frees all roots.
>          */
> -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> +       kvm_tdp_mmu_invalidate_all_roots(kvm, false);
>         kvm_tdp_mmu_zap_invalidated_roots(kvm);
>  
>         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> @@ -771,7 +771,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct
> kvm_mmu_page *sp)
>   * operation can cause a soft lockup.
>   */
>  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page
> *root,
> -                             gfn_t start, gfn_t end, bool can_yield,
> bool flush)
> +                             gfn_t start, gfn_t end, bool can_yield,
> bool flush,
> +                             bool zap_private)
>  {
>         struct tdp_iter iter;
>  
> @@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm,
> struct kvm_mmu_page *root,
>  
>         lockdep_assert_held_write(&kvm->mmu_lock);
>  
> +       WARN_ON_ONCE(zap_private && !is_private_sp(root));

All the callers have zap_private as zap_private && is_private_sp(root).
What badness is it trying to uncover?

> +       if (!zap_private && is_private_sp(root))
> +               return false;
> +



2024-03-19 02:55:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Wed, 2024-03-13 at 10:14 -0700, Isaku Yamahata wrote:
> > IMO, an enum will be clearer than the two flags.
> >
> >     enum {
> >         PROCESS_PRIVATE_AND_SHARED,
> >         PROCESS_ONLY_PRIVATE,
> >         PROCESS_ONLY_SHARED,
> >     };
>
> The code will be ugly like
> "if (== PRIVATE || == PRIVATE_AND_SHARED)" or
> "if (== SHARED || == PRIVATE_AND_SHARED)"
>
> two boolean (or two flags) is less error-prone.

Yes the enum would be awkward to handle. But I also thought the way
this is specified in struct kvm_gfn_range is a little strange.

It is ambiguous what it should mean if you set:
.only_private=true;
.only_shared=true;
...as happens later in the series (although it may be a mistake).

Reading the original conversation, it seems Sean suggested this
specifically. But it wasn't clear to me from the discussion what the
intention of the "only" semantics was. Like why not?
bool private;
bool shared;

2024-03-19 14:48:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Mon, 2024-03-18 at 19:50 -0700, Rick Edgecombe wrote:
> On Wed, 2024-03-13 at 10:14 -0700, Isaku Yamahata wrote:
> > > IMO, an enum will be clearer than the two flags.
> > >
> > >     enum {
> > >         PROCESS_PRIVATE_AND_SHARED,
> > >         PROCESS_ONLY_PRIVATE,
> > >         PROCESS_ONLY_SHARED,
> > >     };
> >
> > The code will be ugly like
> > "if (== PRIVATE || == PRIVATE_AND_SHARED)" or
> > "if (== SHARED || == PRIVATE_AND_SHARED)"
> >
> > two boolean (or two flags) is less error-prone.
>
> Yes the enum would be awkward to handle. But I also thought the way
> this is specified in struct kvm_gfn_range is a little strange.
>
> It is ambiguous what it should mean if you set:
>  .only_private=true;
>  .only_shared=true;
> ...as happens later in the series (although it may be a mistake).
>
> Reading the original conversation, it seems Sean suggested this
> specifically. But it wasn't clear to me from the discussion what the
> intention of the "only" semantics was. Like why not?
>  bool private;
>  bool shared;

I see Binbin brought up this point on v18 as well:
https://lore.kernel.org/kvm/[email protected]/#t

and helpfully dug up some other discussion with Sean where he agreed
the "_only" is confusing and proposed the the enum:
https://lore.kernel.org/kvm/[email protected]/

He wanted the default value (in the case the caller forgets to set
them), to be to include both private and shared. I think the enum has
the issues that Isaku mentioned. What about?

bool exclude_private;
bool exclude_shared;

It will become onerous if more types of aliases grow, but it clearer
semantically and has the safe default behavior.

2024-03-19 21:50:31

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Tue, Mar 19, 2024 at 02:47:47PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-03-18 at 19:50 -0700, Rick Edgecombe wrote:
> > On Wed, 2024-03-13 at 10:14 -0700, Isaku Yamahata wrote:
> > > > IMO, an enum will be clearer than the two flags.
> > > >
> > > >     enum {
> > > >         PROCESS_PRIVATE_AND_SHARED,
> > > >         PROCESS_ONLY_PRIVATE,
> > > >         PROCESS_ONLY_SHARED,
> > > >     };
> > >
> > > The code will be ugly like
> > > "if (== PRIVATE || == PRIVATE_AND_SHARED)" or
> > > "if (== SHARED || == PRIVATE_AND_SHARED)"
> > >
> > > two boolean (or two flags) is less error-prone.
> >
> > Yes the enum would be awkward to handle. But I also thought the way
> > this is specified in struct kvm_gfn_range is a little strange.
> >
> > It is ambiguous what it should mean if you set:
> >  .only_private=true;
> >  .only_shared=true;
> > ...as happens later in the series (although it may be a mistake).
> >
> > Reading the original conversation, it seems Sean suggested this
> > specifically. But it wasn't clear to me from the discussion what the
> > intention of the "only" semantics was. Like why not?
> >  bool private;
> >  bool shared;
>
> I see Binbin brought up this point on v18 as well:
> https://lore.kernel.org/kvm/[email protected]/#t
>
> and helpfully dug up some other discussion with Sean where he agreed
> the "_only" is confusing and proposed the the enum:
> https://lore.kernel.org/kvm/[email protected]/
>
> He wanted the default value (in the case the caller forgets to set
> them), to be to include both private and shared. I think the enum has
> the issues that Isaku mentioned. What about?
>
> bool exclude_private;
> bool exclude_shared;
>
> It will become onerous if more types of aliases grow, but it clearer
> semantically and has the safe default behavior.

I'm fine with those names. Anyway, I'm fine with wither way, two bools or enum.
--
Isaku Yamahata <[email protected]>

2024-03-19 21:57:18

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 098/130] KVM: TDX: Add a place holder to handle TDX VM exit

On Fri, Mar 15, 2024 at 10:45:04AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Mon, Feb 26, 2024, [email protected] wrote:
> > +int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> > +{
> > + union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
> > +
> > + /* See the comment of tdh_sept_seamcall(). */
> > + if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)))
> > + return 1;
> > +
> > + /*
> > + * TDH.VP.ENTRY checks TD EPOCH which contend with TDH.MEM.TRACK and
> > + * vcpu TDH.VP.ENTER.
> > + */
> > + if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_TD_EPOCH)))
> > + return 1;
> > +
> > + if (unlikely(exit_reason.full == TDX_SEAMCALL_UD)) {
> > + kvm_spurious_fault();
> > + /*
> > + * In the case of reboot or kexec, loop with TDH.VP.ENTER and
> > + * TDX_SEAMCALL_UD to avoid unnecessarily activity.
> > + */
> > + return 1;
>
> No. This is unnecessarily risky. KVM_BUG_ON() and exit to userspace. The
> response to "SEAMCALL faulted" should never be, "well, let's try again!".
>
> Also, what about #GP on SEAMCALL? In general, the error handling here seems
> lacking.

As I replied at [1], let me revise error handling in general TDX KVM code.
[1] https://lore.kernel.org/kvm/[email protected]/T/#macc431c87676995d65ddcd8de632261a2dedc525
--
Isaku Yamahata <[email protected]>

2024-03-19 23:24:57

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> +
> +static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level,
> hpa_t page,
> +                                  struct tdx_module_args *out)
> +{
> +       struct tdx_module_args in = {
> +               .rcx = gpa | level,
> +               .rdx = tdr,
> +               .r8 = page,
> +       };
> +
> +       clflush_cache_range(__va(page), PAGE_SIZE);
> +       return tdx_seamcall(TDH_MEM_SEPT_ADD, &in, out);
> +}

The caller of this later in the series looks like this:

err = tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &out);
if (unlikely(err == TDX_ERROR_SEPT_BUSY))
return -EAGAIN;
if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT |
TDX_OPERAND_ID_RCX))) {
union tdx_sept_entry entry = {
.raw = out.rcx,
};
union tdx_sept_level_state level_state = {
.raw = out.rdx,
};

/* someone updated the entry with same value. */
if (level_state.level == tdx_level &&
level_state.state == TDX_SEPT_PRESENT &&
!entry.leaf && entry.pfn == (hpa >> PAGE_SHIFT))
return -EAGAIN;
}

The helper abstracts setting the arguments into the proper registers
fields passed in, but doesn't abstract pulling the result out from the
register fields. Then the caller has to manually extract them in this
verbose way. Why not have the helper do both?

2024-03-19 23:57:12

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, Mar 18, 2024 at 11:46:11PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > TDX supports only write-back(WB) memory type for private memory
> > architecturally so that (virtualized) memory type change doesn't make
> > sense
> > for private memory.  Also currently, page migration isn't supported
> > for TDX
> > yet. (TDX architecturally supports page migration. it's KVM and
> > kernel
> > implementation issue.)
> >
> > Regarding memory type change (mtrr virtualization and lapic page
> > mapping
> > change), pages are zapped by kvm_zap_gfn_range().  On the next KVM
> > page
> > fault, the SPTE entry with a new memory type for the page is
> > populated.
> > Regarding page migration, pages are zapped by the mmu notifier. On
> > the next
> > KVM page fault, the new migrated page is populated.  Don't zap
> > private
> > pages on unmapping for those two cases.
>
> Is the migration case relevant to TDX?

We can forget about it because the page migration isn't supported yet.


> > When deleting/moving a KVM memory slot, zap private pages. Typically
> > tearing down VM.  Don't invalidate private page tables. i.e. zap only
> > leaf
> > SPTEs for KVM mmu that has a shared bit mask. The existing
> > kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-
> > lock
> > of mmu_lock so that other vcpu can operate on KVM mmu concurrently. 
> > It
> > marks the root page table invalid and zaps SPTEs of the root page
> > tables. The TDX module doesn't allow to unlink a protected root page
> > table
> > from the hardware and then allocate a new one for it. i.e. replacing
> > a
> > protected root page table.  Instead, zap only leaf SPTEs for KVM mmu
> > with a
> > shared bit mask set.
>
> I get the part about only zapping leafs and not the root and mid-level
> PTEs. But why the MTRR, lapic page and migration part? Why should those
> not be zapped? Why is migration a consideration when it is not
> supported?


When we zap a page from the guest, and add it again on TDX even with the same
GPA, the page is zeroed. We'd like to keep memory contents for those cases.

Ok, let me add those whys and drop migration part. Here is the updated one.

TDX supports only write-back(WB) memory type for private memory
architecturally so that (virtualized) memory type change doesn't make
sense for private memory. When we remove the private page from the guest
and re-add it with the same GPA, the page is zeroed.

Regarding memory type change (mtrr virtualization and lapic page
mapping change), the current implementation zaps pages, and populate
the page with new memory type on the next KVM page fault. It doesn't
work for TDX to have zeroed pages. Because TDX supports only WB, we
ignore the request for MTRR and lapic page change to not zap private
pages on unmapping for those two cases

TDX Secure-EPT requires removing the guest pages first and leaf
Secure-EPT pages in order. It doesn't allow zap a Secure-EPT entry
that has child pages. It doesn't work with the current TDP MMU
zapping logic that zaps the root page table without touching child
pages. Instead, zap only leaf SPTEs for KVM mmu that has a shared bit
mask.

>
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     | 61
> > ++++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++----
> >  arch/x86/kvm/mmu/tdp_mmu.h |  5 ++--
> >  3 files changed, 92 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0d6d4506ec97..30c86e858ae4 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6339,7 +6339,7 @@ static void kvm_mmu_zap_all_fast(struct kvm
> > *kvm)
> >          * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock
> > and yield.
> >          */
> >         if (tdp_mmu_enabled)
> > -               kvm_tdp_mmu_invalidate_all_roots(kvm);
> > +               kvm_tdp_mmu_invalidate_all_roots(kvm, true);
> >  
> >         /*
> >          * Notify all vcpus to reload its shadow page table and flush
> > TLB.
> > @@ -6459,7 +6459,16 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> >         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >  
> >         if (tdp_mmu_enabled)
> > -               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> > gfn_end, flush);
> > +               /*
> > +                * zap_private = false. Zap only shared pages.
> > +                *
> > +                * kvm_zap_gfn_range() is used when MTRR or PAT
> > memory
> > +                * type was changed.  Later on the next kvm page
> > fault,
> > +                * populate it with updated spte entry.
> > +                * Because only WB is supported for private pages,
> > don't
> > +                * care of private pages.
> > +                */
> > +               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> > gfn_end, flush, false);
> >  
> >         if (flush)
> >                 kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end -
> > gfn_start);
> > @@ -6905,10 +6914,56 @@ void kvm_arch_flush_shadow_all(struct kvm
> > *kvm)
> >         kvm_mmu_zap_all(kvm);
> >  }
> >  
> > +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct
> > kvm_memory_slot *slot)
>
> What about kvm_mmu_zap_memslot_leafs() instead?

Ok.


> > +{
> > +       bool flush = false;
>
> It doesn't need to be initialized if it passes false directly into
> kvm_tdp_mmu_unmap_gfn_range(). It would make the code easier to
> understand.
>
> > +
> > +       write_lock(&kvm->mmu_lock);
> > +
> > +       /*
> > +        * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't
> > required, worst
> > +        * case scenario we'll have unused shadow pages lying around
> > until they
> > +        * are recycled due to age or when the VM is destroyed.
> > +        */
> > +       if (tdp_mmu_enabled) {
> > +               struct kvm_gfn_range range = {
> > +                     .slot = slot,
> > +                     .start = slot->base_gfn,
> > +                     .end = slot->base_gfn + slot->npages,
> > +                     .may_block = true,
> > +
> > +                     /*
> > +                      * This handles both private gfn and shared
> > gfn.
> > +                      * All private page should be zapped on memslot
> > deletion.
> > +                      */
> > +                     .only_private = true,
> > +                     .only_shared = true,
>
> only_private and only_shared are both true? Shouldn't they both be
> false? (or just unset)

I replied at.
https://lore.kernel.org/kvm/[email protected]/

>
> > +               };
> > +
> > +               flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range,
> > flush);
> > +       } else {
> > +               /* TDX supports only TDP-MMU case. */
> > +               WARN_ON_ONCE(1);
>
> How about a KVM_BUG_ON() instead? If somehow this is reached, we don't
> want the caller thinking the pages are zapped, then enter the guest
> with pages mapped that have gone elsewhere.
>
> > +               flush = true;
>
> Why flush?

Those are only safe guard. TDX supports only TDP MMU. Let me drop them.


> > +       }
> > +       if (flush)
> > +               kvm_flush_remote_tlbs(kvm);
> > +
> > +       write_unlock(&kvm->mmu_lock);
> > +}
> > +
> >  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> >                                    struct kvm_memory_slot *slot)
> >  {
> > -       kvm_mmu_zap_all_fast(kvm);
> > +       if (kvm_gfn_shared_mask(kvm))
>
> There seems to be an attempt to abstract away the existence of Secure-
> EPT in mmu.c, that is not fully successful. In this case the code
> checks kvm_gfn_shared_mask() to see if it needs to handle the zapping
> in a way specific needed by S-EPT. It ends up being a little confusing
> because the actual check is about whether there is a shared bit. It
> only works because only S-EPT is the only thing that has a
> kvm_gfn_shared_mask().
>
> Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks wrong,
> but is more honest about what we are getting up to here. I'm not sure
> though, what do you think?

Right, I attempted and failed in zapping case. This is due to the restriction
that the Secure-EPT pages must be removed from the leaves. the VMX case (also
NPT, even SNP) heavily depends on zapping root entry as optimization.

I can think of
- add TDX check. Looks wrong
- Use kvm_gfn_shared_mask(kvm). confusing
- Give other name for this check like zap_from_leafs (or better name?)
The implementation is same to kvm_gfn_shared_mask() with comment.
- Or we can add a boolean variable to struct kvm


> >  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index d47f0daf1b03..e7514a807134 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> >          * for zapping and thus puts the TDP MMU's reference to each
> > root, i.e.
> >          * ultimately frees all roots.
> >          */
> > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > +       kvm_tdp_mmu_invalidate_all_roots(kvm, false);
> >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> >  
> >         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> > @@ -771,7 +771,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct
> > kvm_mmu_page *sp)
> >   * operation can cause a soft lockup.
> >   */
> >  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page
> > *root,
> > -                             gfn_t start, gfn_t end, bool can_yield,
> > bool flush)
> > +                             gfn_t start, gfn_t end, bool can_yield,
> > bool flush,
> > +                             bool zap_private)
> >  {
> >         struct tdp_iter iter;
> >  
> > @@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm,
> > struct kvm_mmu_page *root,
> >  
> >         lockdep_assert_held_write(&kvm->mmu_lock);
> >  
> > +       WARN_ON_ONCE(zap_private && !is_private_sp(root));
>
> All the callers have zap_private as zap_private && is_private_sp(root).
> What badness is it trying to uncover?

I added this during debug. Let me drop it.
--
Isaku Yamahata <[email protected]>

2024-03-20 00:03:56

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module



On 26/02/2024 9:25 pm, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> A VMM interacts with the TDX module using a new instruction (SEAMCALL).
> For instance, a TDX VMM does not have full access to the VM control
> structure corresponding to VMX VMCS. Instead, a VMM induces the TDX module
> to act on behalf via SEAMCALLs.
>
> Define C wrapper functions for SEAMCALLs for readability.
>
> Some SEAMCALL APIs donate host pages to TDX module or guest TD, and the
> donated pages are encrypted. Those require the VMM to flush the cache
> lines to avoid cache line alias.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Not valid anymore.

[...]

> +
> +static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
> + struct tdx_module_args *out)
> +{
> + u64 ret;
> +
> + if (out) {
> + *out = *in;
> + ret = seamcall_ret(op, out);
> + } else
> + ret = seamcall(op, in);

I think it's silly to have the @out argument in this way.

What is the main reason to still have it?

Yeah we used to have the @out in __seamcall() assembly function. The
assembly code checks the @out and skips copying registers to @out when
it is NULL.

But it got removed when we tried to unify the assembly for
TDCALL/TDVMCALL and SEAMCALL to have a *SINGLE* assembly macro.

https://lore.kernel.org/lkml/[email protected]/

To me that means we should just accept the fact we will always have a
valid @out.

But there might be some case that you _obviously_ need the @out and I
missed?


> +
> + if (unlikely(ret == TDX_SEAMCALL_UD)) {
> + /*
> + * SEAMCALLs fail with TDX_SEAMCALL_UD returned when VMX is off.
> + * This can happen when the host gets rebooted or live
> + * updated. In this case, the instruction execution is ignored
> + * as KVM is shut down, so the error code is suppressed. Other
> + * than this, the error is unexpected and the execution can't
> + * continue as the TDX features reply on VMX to be on.
> + */
> + kvm_spurious_fault();
> + return 0;
> + }
> + return ret;
> +}
> +
> +static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
> +{
> + struct tdx_module_args in = {
> + .rcx = addr,
> + .rdx = tdr,
> + };
> +
> + clflush_cache_range(__va(addr), PAGE_SIZE);
> + return tdx_seamcall(TDH_MNG_ADDCX, &in, NULL);
> +}
> +
> +static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
> + struct tdx_module_args *out)
> +{
> + struct tdx_module_args in = {
> + .rcx = gpa,
> + .rdx = tdr,
> + .r8 = hpa,
> + .r9 = source,
> + };
> +
> + clflush_cache_range(__va(hpa), PAGE_SIZE);
> + return tdx_seamcall(TDH_MEM_PAGE_ADD, &in, out);
> +}
> +
> +static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
> + struct tdx_module_args *out)
> +{
> + struct tdx_module_args in = {
> + .rcx = gpa | level,
> + .rdx = tdr,
> + .r8 = page,
> + };
> +
> + clflush_cache_range(__va(page), PAGE_SIZE);
> + return tdx_seamcall(TDH_MEM_SEPT_ADD, &in, out);
> +}
> +
> +static inline u64 tdh_mem_sept_rd(hpa_t tdr, gpa_t gpa, int level,
> + struct tdx_module_args *out)
> +{
> + struct tdx_module_args in = {
> + .rcx = gpa | level,
> + .rdx = tdr,
> + };
> +
> + return tdx_seamcall(TDH_MEM_SEPT_RD, &in, out);
> +}

Not checked the whole series yet, but is this ever used in this series?

[...]

> +
> +static inline u64 tdh_sys_lp_shutdown(void)
> +{
> + struct tdx_module_args in = {
> + };
> +
> + return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
> +}

As Sean already pointed out, I am sure it's/should not used in this series.

That being said, I found it's not easy to determine whether one wrapper
will be used by this series or not. The other option is we introduce
the wrapper(s) when they get actally used, but I can see (especially at
this stage) it's also a apple vs orange question that people may have
different preference.

Perhaps we can say something like below in changelog ...

"
Note, not all VM-managing related SEAMCALLs have a wrapper here, but
only provide wrappers that are essential to the run the TDX guest with
basic feature set.
"

.. so that people will at least to pay attention to this during the review?

2024-03-20 00:09:35

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Tue, Mar 19, 2024 at 11:24:37PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > +
> > +static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level,
> > hpa_t page,
> > +                                  struct tdx_module_args *out)
> > +{
> > +       struct tdx_module_args in = {
> > +               .rcx = gpa | level,
> > +               .rdx = tdr,
> > +               .r8 = page,
> > +       };
> > +
> > +       clflush_cache_range(__va(page), PAGE_SIZE);
> > +       return tdx_seamcall(TDH_MEM_SEPT_ADD, &in, out);
> > +}
>
> The caller of this later in the series looks like this:
>
> err = tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &out);
> if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> return -EAGAIN;
> if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT |
> TDX_OPERAND_ID_RCX))) {
> union tdx_sept_entry entry = {
> .raw = out.rcx,
> };
> union tdx_sept_level_state level_state = {
> .raw = out.rdx,
> };
>
> /* someone updated the entry with same value. */
> if (level_state.level == tdx_level &&
> level_state.state == TDX_SEPT_PRESENT &&
> !entry.leaf && entry.pfn == (hpa >> PAGE_SHIFT))
> return -EAGAIN;
> }
>
> The helper abstracts setting the arguments into the proper registers
> fields passed in, but doesn't abstract pulling the result out from the
> register fields. Then the caller has to manually extract them in this
> verbose way. Why not have the helper do both?

Yes. Let me update those arguments.
--
Isaku Yamahata <[email protected]>

2024-03-20 00:11:40

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Tue, 2024-03-19 at 17:09 -0700, Isaku Yamahata wrote:
> > The helper abstracts setting the arguments into the proper
> > registers
> > fields passed in, but doesn't abstract pulling the result out from
> > the
> > register fields. Then the caller has to manually extract them in
> > this
> > verbose way. Why not have the helper do both?
>
> Yes. Let me update those arguments.

What were you thinking exactly, like?

tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);

And for the other helpers?

2024-03-20 00:29:40

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error



On 26/02/2024 9:25 pm, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add helper functions to print out errors from the TDX module in a uniform
> manner.

Likely we need more information here. See below.

>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> Reviewed-by: Yuan Yao <[email protected]>
> ---
> v19:
> - dropped unnecessary include <asm/tdx.h>
>
> v18:
> - Added Reviewed-by Binbin.

The tag doesn't show in the SoB chain.

>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---

[...]

> +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out)
> +{
> + if (!out) {
> + pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
> + op, error_code);
> + return;
> + }

I think this is the reason you still want the @out in tdx_seamcall()?

But I am not sure either -- even if you want to have @out *here* -- why
cannot you pass a NULL explicitly when you *know* the concerned SEAMCALL
doesn't have a valid output?

> +
> +#define MSG \
> + "SEAMCALL (0x%016llx) failed: 0x%016llx RCX 0x%016llx RDX 0x%016llx R8 0x%016llx R9 0x%016llx R10 0x%016llx R11 0x%016llx\n"
> + pr_err_ratelimited(MSG, op, error_code, out->rcx, out->rdx, out->r8,
> + out->r9, out->r10, out->r11);
> +}

Besides the regs that you are printing, there are more regs (R12-R15,
RDI, RSI) in the structure.

It's not clear why you only print some, but not all.

AFAICT the VP.ENTER SEAMCALL can have all regs as valid output?

Anyway, that being said, you might need to put more text in
changelog/comment to make this patch (at least more) reviewable.

2024-03-20 00:57:00

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Tue, 2024-03-19 at 16:56 -0700, Isaku Yamahata wrote:
> When we zap a page from the guest, and add it again on TDX even with
> the same
> GPA, the page is zeroed.  We'd like to keep memory contents for those
> cases.
>
> Ok, let me add those whys and drop migration part. Here is the
> updated one.
>
> TDX supports only write-back(WB) memory type for private memory
> architecturally so that (virtualized) memory type change doesn't make
> sense for private memory.  When we remove the private page from the
> guest
> and re-add it with the same GPA, the page is zeroed.
>
> Regarding memory type change (mtrr virtualization and lapic page
> mapping change), the current implementation zaps pages, and populate
s^
> the page with new memory type on the next KVM page fault.  
^s

> It doesn't work for TDX to have zeroed pages.
What does this mean? Above you mention how all the pages are zeroed. Do
you mean it doesn't work for TDX to zero a running guest's pages. Which
would happen for the operations that would expect the pages could get
faulted in again just fine.


> Because TDX supports only WB, we
> ignore the request for MTRR and lapic page change to not zap private
> pages on unmapping for those two cases

Hmm. I need to go back and look at this again. It's not clear from the
description why it is safe for the host to not zap pages if requested
to. I see why the guest wouldn't want them to be zapped.

>
> TDX Secure-EPT requires removing the guest pages first and leaf
> Secure-EPT pages in order. It doesn't allow zap a Secure-EPT entry
> that has child pages.  It doesn't work with the current TDP MMU
> zapping logic that zaps the root page table without touching child
> pages.  Instead, zap only leaf SPTEs for KVM mmu that has a shared
> bit
> mask.

Could this be better as two patches that each address a separate thing?
1. Leaf only zapping
2. Don't zap for MTRR, etc.

> >
> > There seems to be an attempt to abstract away the existence of
> > Secure-
> > EPT in mmu.c, that is not fully successful. In this case the code
> > checks kvm_gfn_shared_mask() to see if it needs to handle the
> > zapping
> > in a way specific needed by S-EPT. It ends up being a little
> > confusing
> > because the actual check is about whether there is a shared bit. It
> > only works because only S-EPT is the only thing that has a
> > kvm_gfn_shared_mask().
> >
> > Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks
> > wrong,
> > but is more honest about what we are getting up to here. I'm not
> > sure
> > though, what do you think?
>
> Right, I attempted and failed in zapping case.  This is due to the
> restriction
> that the Secure-EPT pages must be removed from the leaves.  the VMX
> case (also
> NPT, even SNP) heavily depends on zapping root entry as optimization.
>
> I can think of
> - add TDX check. Looks wrong
> - Use kvm_gfn_shared_mask(kvm). confusing
> - Give other name for this check like zap_from_leafs (or better
> name?)
>   The implementation is same to kvm_gfn_shared_mask() with comment.
>   - Or we can add a boolean variable to struct kvm

Hmm, maybe wrap it in a function like:
static inline bool kvm_can_only_zap_leafs(const struct kvm *kvm)
{
/* A comment explaining what is going on */
return kvm->arch.vm_type == KVM_X86_TDX_VM;
}

But KVM seems to be a bit more on the open coded side when it comes to
things like this, so not sure what maintainers would prefer. My opinion
is the kvm_gfn_shared_mask() check is too strange and it's worth a new
helper. If that is bad, then just open coded kvm->arch.vm_type ==
KVM_X86_TDX_VM is the second best I think.

I feel both strongly that it should be changed, and unsure what
maintainers would prefer. Hopefully one will chime in.


2024-03-20 05:12:37

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

> config KVM_SW_PROTECTED_VM
> bool "Enable support for KVM software-protected VMs"
>- depends on EXPERT
> depends on KVM && X86_64
> select KVM_GENERIC_PRIVATE_MEM
> help
>@@ -89,6 +88,8 @@ config KVM_SW_PROTECTED_VM
> config KVM_INTEL
> tristate "KVM for Intel (and compatible) processors support"
> depends on KVM && IA32_FEAT_CTL
>+ select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST

why does INTEL_TDX_HOST select KVM_SW_PROTECTED_VM?

>+ select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
> help
> .vcpu_precreate = vmx_vcpu_precreate,
> .vcpu_create = vmx_vcpu_create,

>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -5,10 +5,11 @@
>
> #include "capabilities.h"
> #include "x86_ops.h"
>-#include "x86.h"
> #include "mmu.h"
> #include "tdx_arch.h"
> #include "tdx.h"
>+#include "tdx_ops.h"
>+#include "x86.h"

any reason to reorder x86.h?

>+static void tdx_do_tdh_phymem_cache_wb(void *unused)
>+{
>+ u64 err = 0;
>+
>+ do {
>+ err = tdh_phymem_cache_wb(!!err);
>+ } while (err == TDX_INTERRUPTED_RESUMABLE);
>+
>+ /* Other thread may have done for us. */
>+ if (err == TDX_NO_HKID_READY_TO_WBCACHE)
>+ err = TDX_SUCCESS;
>+ if (WARN_ON_ONCE(err))
>+ pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
>+}
>+
>+void tdx_mmu_release_hkid(struct kvm *kvm)
>+{
>+ bool packages_allocated, targets_allocated;
>+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>+ cpumask_var_t packages, targets;
>+ u64 err;
>+ int i;
>+
>+ if (!is_hkid_assigned(kvm_tdx))
>+ return;
>+
>+ if (!is_td_created(kvm_tdx)) {
>+ tdx_hkid_free(kvm_tdx);
>+ return;
>+ }
>+
>+ packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
>+ targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
>+ cpus_read_lock();
>+
>+ /*
>+ * We can destroy multiple guest TDs simultaneously. Prevent
>+ * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
>+ */
>+ mutex_lock(&tdx_lock);
>+
>+ /*
>+ * Go through multiple TDX HKID state transitions with three SEAMCALLs
>+ * to make TDH.PHYMEM.PAGE.RECLAIM() usable. Make the transition atomic
>+ * to other functions to operate private pages and Secure-EPT pages.
>+ *
>+ * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
>+ * This function is called via mmu notifier, mmu_release().
>+ * kvm_gmem_release() is called via fput() on process exit.
>+ */
>+ write_lock(&kvm->mmu_lock);
>+
>+ for_each_online_cpu(i) {
>+ if (packages_allocated &&
>+ cpumask_test_and_set_cpu(topology_physical_package_id(i),
>+ packages))
>+ continue;
>+ if (targets_allocated)
>+ cpumask_set_cpu(i, targets);
>+ }
>+ if (targets_allocated)
>+ on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, true);
>+ else
>+ on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, true);

This tries flush cache on all CPUs when we run out of memory. I am not sure if
it is the best solution. A simple solution is just use two global bitmaps.

And current logic isn't optimal. e.g., if packages_allocated is true while
targets_allocated is false, then we will fill in the packages bitmap but don't
use it at all.

That said, I prefer to optimize the rare case in a separate patch. We can just use
two global bitmaps or let the flush fail here just as you are doing below on
seamcall failure.

>+ /*
>+ * In the case of error in tdx_do_tdh_phymem_cache_wb(), the following
>+ * tdh_mng_key_freeid() will fail.
>+ */
>+ err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
>+ if (WARN_ON_ONCE(err)) {
>+ pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
>+ pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
>+ kvm_tdx->hkid);
>+ } else
>+ tdx_hkid_free(kvm_tdx);

curly brackets are missing.

>+
>+ write_unlock(&kvm->mmu_lock);
>+ mutex_unlock(&tdx_lock);
>+ cpus_read_unlock();
>+ free_cpumask_var(targets);
>+ free_cpumask_var(packages);
>+}
>+

>+static int __tdx_td_init(struct kvm *kvm)
>+{
>+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>+ cpumask_var_t packages;
>+ unsigned long *tdcs_pa = NULL;
>+ unsigned long tdr_pa = 0;
>+ unsigned long va;
>+ int ret, i;
>+ u64 err;
>+
>+ ret = tdx_guest_keyid_alloc();
>+ if (ret < 0)
>+ return ret;
>+ kvm_tdx->hkid = ret;
>+
>+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+ if (!va)
>+ goto free_hkid;
>+ tdr_pa = __pa(va);
>+
>+ tdcs_pa = kcalloc(tdx_info->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
>+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>+ if (!tdcs_pa)
>+ goto free_tdr;
>+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
>+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+ if (!va)
>+ goto free_tdcs;
>+ tdcs_pa[i] = __pa(va);
>+ }
>+
>+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
>+ ret = -ENOMEM;
>+ goto free_tdcs;
>+ }
>+ cpus_read_lock();
>+ /*
>+ * Need at least one CPU of the package to be online in order to
>+ * program all packages for host key id. Check it.
>+ */
>+ for_each_present_cpu(i)
>+ cpumask_set_cpu(topology_physical_package_id(i), packages);
>+ for_each_online_cpu(i)
>+ cpumask_clear_cpu(topology_physical_package_id(i), packages);
>+ if (!cpumask_empty(packages)) {
>+ ret = -EIO;
>+ /*
>+ * Because it's hard for human operator to figure out the
>+ * reason, warn it.
>+ */
>+#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
>+ pr_warn_ratelimited(MSG_ALLPKG);
>+ goto free_packages;
>+ }
>+
>+ /*
>+ * Acquire global lock to avoid TDX_OPERAND_BUSY:
>+ * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
>+ * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
>+ * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
>+ * caller to handle the contention. This is because of time limitation
>+ * usable inside the TDX module and OS/VMM knows better about process
>+ * scheduling.
>+ *
>+ * APIs to acquire the lock of KOT:
>+ * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
>+ * TDH.PHYMEM.CACHE.WB.
>+ */
>+ mutex_lock(&tdx_lock);
>+ err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
>+ mutex_unlock(&tdx_lock);
>+ if (err == TDX_RND_NO_ENTROPY) {
>+ ret = -EAGAIN;
>+ goto free_packages;
>+ }
>+ if (WARN_ON_ONCE(err)) {
>+ pr_tdx_error(TDH_MNG_CREATE, err, NULL);
>+ ret = -EIO;
>+ goto free_packages;
>+ }
>+ kvm_tdx->tdr_pa = tdr_pa;
>+
>+ for_each_online_cpu(i) {
>+ int pkg = topology_physical_package_id(i);
>+
>+ if (cpumask_test_and_set_cpu(pkg, packages))
>+ continue;
>+
>+ /*
>+ * Program the memory controller in the package with an
>+ * encryption key associated to a TDX private host key id
>+ * assigned to this TDR. Concurrent operations on same memory
>+ * controller results in TDX_OPERAND_BUSY. Avoid this race by
>+ * mutex.
>+ */
>+ mutex_lock(&tdx_mng_key_config_lock[pkg]);

the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
create TDs, the same set of CPUs (the first online CPU of each package) will be
selected to configure the key because of the cpumask_test_and_set_cpu() above.
it means, we never have two CPUs in the same socket trying to program the key,
i.e., no concurrent calls.

>+ ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
>+ &kvm_tdx->tdr_pa, true);
>+ mutex_unlock(&tdx_mng_key_config_lock[pkg]);
>+ if (ret)
>+ break;
>+ }
>+ cpus_read_unlock();
>+ free_cpumask_var(packages);
>+ if (ret) {
>+ i = 0;
>+ goto teardown;
>+ }
>+
>+ kvm_tdx->tdcs_pa = tdcs_pa;
>+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
>+ err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
>+ if (err == TDX_RND_NO_ENTROPY) {
>+ /* Here it's hard to allow userspace to retry. */
>+ ret = -EBUSY;
>+ goto teardown;
>+ }
>+ if (WARN_ON_ONCE(err)) {
>+ pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
>+ ret = -EIO;
>+ goto teardown;
>+ }
>+ }
>+
>+ /*
>+ * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
>+ * ioctl() to define the configure CPUID values for the TD.
>+ */
>+ return 0;
>+
>+ /*
>+ * The sequence for freeing resources from a partially initialized TD
>+ * varies based on where in the initialization flow failure occurred.
>+ * Simply use the full teardown and destroy, which naturally play nice
>+ * with partial initialization.
>+ */
>+teardown:
>+ for (; i < tdx_info->nr_tdcs_pages; i++) {
>+ if (tdcs_pa[i]) {
>+ free_page((unsigned long)__va(tdcs_pa[i]));
>+ tdcs_pa[i] = 0;
>+ }
>+ }
>+ if (!kvm_tdx->tdcs_pa)
>+ kfree(tdcs_pa);
>+ tdx_mmu_release_hkid(kvm);
>+ tdx_vm_free(kvm);
>+ return ret;
>+
>+free_packages:
>+ cpus_read_unlock();
>+ free_cpumask_var(packages);
>+free_tdcs:
>+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
>+ if (tdcs_pa[i])
>+ free_page((unsigned long)__va(tdcs_pa[i]));
>+ }
>+ kfree(tdcs_pa);
>+ kvm_tdx->tdcs_pa = NULL;
>+
>+free_tdr:
>+ if (tdr_pa)
>+ free_page((unsigned long)__va(tdr_pa));
>+ kvm_tdx->tdr_pa = 0;
>+free_hkid:
>+ if (is_hkid_assigned(kvm_tdx))

IIUC, this is always true because you just return if keyid
allocation fails.

>+ ret = tdx_guest_keyid_alloc();
>+ if (ret < 0)
>+ return ret;
>+ kvm_tdx->hkid = ret;
>+
>+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+ if (!va)
>+ goto free_hkid;

>+ tdx_hkid_free(kvm_tdx);
>+ return ret;
>+}

2024-03-20 05:41:25

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Wed, Mar 20, 2024 at 12:11:17AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Tue, 2024-03-19 at 17:09 -0700, Isaku Yamahata wrote:
> > > The helper abstracts setting the arguments into the proper
> > > registers
> > > fields passed in, but doesn't abstract pulling the result out from
> > > the
> > > register fields. Then the caller has to manually extract them in
> > > this
> > > verbose way. Why not have the helper do both?
> >
> > Yes. Let me update those arguments.
>
> What were you thinking exactly, like?
>
> tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);
>
> And for the other helpers?

I have the following four helpers. Other helpers will have no out argument.

tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);
tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
tdh_mem_page_remove(kvm_tdx, gpa, tdx_level, &entry, &level_state);
tdh_mem_range_block(kvm_tdx, gpa, tdx_level, &entry, &level_state);
--
Isaku Yamahata <[email protected]>

2024-03-20 06:13:27

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Mon, Feb 26, 2024 at 12:25:41AM -0800, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>

..

>
>TDX requires additional parameters for TDX VM for confidential execution to
>protect the confidentiality of its memory contents and CPU state from any
>other software, including VMM. When creating a guest TD VM before creating
>vcpu, the number of vcpu, TSC frequency (the values are the same among
>vcpus, and it can't change.) CPUIDs which the TDX module emulates. Guest
>TDs can trust those CPUIDs and sha384 values for measurement.
>
>Add a new subcommand, KVM_TDX_INIT_VM, to pass parameters for the TDX
>guest. It assigns an encryption key to the TDX guest for memory
>encryption. TDX encrypts memory per guest basis. The device model, say
>qemu, passes per-VM parameters for the TDX guest. The maximum number of
>vcpus, TSC frequency (TDX guest has fixed VM-wide TSC frequency, not per
>vcpu. The TDX guest can not change it.), attributes (production or debug),
>available extended features (which configure guest XCR0, IA32_XSS MSR),
>CPUIDs, sha384 measurements, etc.
>
>Call this subcommand before creating vcpu and KVM_SET_CPUID2, i.e. CPUID
>configurations aren't available yet. So CPUIDs configuration values need
>to be passed in struct kvm_tdx_init_vm. The device model's responsibility
>to make this CPUID config for KVM_TDX_INIT_VM and KVM_SET_CPUID2.
>
>Signed-off-by: Xiaoyao Li <[email protected]>
>Signed-off-by: Isaku Yamahata <[email protected]>

the SOB chain makes no sense.

>+static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
>+ struct td_params *td_params)
>+{
>+ int i;
>+
>+ /*
>+ * td_params.cpuid_values: The number and the order of cpuid_value must
>+ * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
>+ * It's assumed that td_params was zeroed.
>+ */
>+ for (i = 0; i < tdx_info->num_cpuid_config; i++) {
>+ const struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
>+ /* KVM_TDX_CPUID_NO_SUBLEAF means index = 0. */
>+ u32 index = c->sub_leaf == KVM_TDX_CPUID_NO_SUBLEAF ? 0 : c->sub_leaf;
>+ const struct kvm_cpuid_entry2 *entry =
>+ kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
>+ c->leaf, index);
>+ struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
>+
>+ if (!entry)
>+ continue;
>+
>+ /*
>+ * tdsysinfo.cpuid_configs[].{eax, ebx, ecx, edx}
>+ * bit 1 means it can be configured to zero or one.
>+ * bit 0 means it must be zero.
>+ * Mask out non-configurable bits.
>+ */
>+ value->eax = entry->eax & c->eax;
>+ value->ebx = entry->ebx & c->ebx;
>+ value->ecx = entry->ecx & c->ecx;
>+ value->edx = entry->edx & c->edx;

Any reason to mask off non-configurable bits rather than return an error? this
is misleading to userspace because guest sees the values emulated by TDX module
instead of the values passed from userspace (i.e., the request from userspace
isn't done but there is no indication of that to userspace).

>+ }
>+}
>+
>+static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
>+{
>+ const struct kvm_cpuid_entry2 *entry;
>+ u64 guest_supported_xcr0;
>+ u64 guest_supported_xss;
>+
>+ /* Setup td_params.xfam */
>+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
>+ if (entry)
>+ guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
>+ else
>+ guest_supported_xcr0 = 0;
>+ guest_supported_xcr0 &= kvm_caps.supported_xcr0;
>+
>+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
>+ if (entry)
>+ guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
>+ else
>+ guest_supported_xss = 0;
>+
>+ /*
>+ * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
>+ * and, CET support.
>+ */
>+ guest_supported_xss &=
>+ (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);
>+
>+ td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
>+ if (td_params->xfam & XFEATURE_MASK_LBR) {
>+ /*
>+ * TODO: once KVM supports LBR(save/restore LBR related
>+ * registers around TDENTER), remove this guard.
>+ */
>+#define MSG_LBR "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH properly.\n"
>+ pr_warn(MSG_LBR);

Drop the pr_warn() because userspace can trigger it at will.

I don't think KVM needs to relay TDX module capabilities to userspace as-is.
KVM should advertise a feature only if both TDX module's and KVM's support
are in place. if KVM masked out LBR and PERFMON, it should be a problem of
userspace and we don't need to warn here.

>+ return -EOPNOTSUPP;
>+ }
>+
>+ return 0;
>+}
>+
>+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
>+ struct kvm_tdx_init_vm *init_vm)
>+{
>+ struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
>+ int ret;
>+
>+ if (kvm->created_vcpus)
>+ return -EBUSY;

-EINVAL

>+
>+ if (init_vm->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
>+ /*
>+ * TODO: save/restore PMU related registers around TDENTER.
>+ * Once it's done, remove this guard.
>+ */
>+#define MSG_PERFMON "TD doesn't support perfmon yet. KVM needs to save/restore host perf registers properly.\n"
>+ pr_warn(MSG_PERFMON);

drop the pr_warn().

>+ return -EOPNOTSUPP;
>+ }
>+
>+ td_params->max_vcpus = kvm->max_vcpus;
>+ td_params->attributes = init_vm->attributes;
>+ td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD;
>+ td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz);
>+
>+ ret = setup_tdparams_eptp_controls(cpuid, td_params);
>+ if (ret)
>+ return ret;
>+ setup_tdparams_cpuids(cpuid, td_params);
>+ ret = setup_tdparams_xfam(cpuid, td_params);
>+ if (ret)
>+ return ret;
>+
>+#define MEMCPY_SAME_SIZE(dst, src) \
>+ do { \
>+ BUILD_BUG_ON(sizeof(dst) != sizeof(src)); \
>+ memcpy((dst), (src), sizeof(dst)); \
>+ } while (0)
>+
>+ MEMCPY_SAME_SIZE(td_params->mrconfigid, init_vm->mrconfigid);
>+ MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
>+ MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
>+
>+ return 0;
>+}
>+
>+static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
>+ u64 *seamcall_err)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>+ struct tdx_module_args out;
> cpumask_var_t packages;
> unsigned long *tdcs_pa = NULL;
> unsigned long tdr_pa = 0;
>@@ -426,6 +581,7 @@ static int __tdx_td_init(struct kvm *kvm)
> int ret, i;
> u64 err;
>
>+ *seamcall_err = 0;
> ret = tdx_guest_keyid_alloc();
> if (ret < 0)
> return ret;
>@@ -540,10 +696,23 @@ static int __tdx_td_init(struct kvm *kvm)
> }
> }
>
>- /*
>- * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
>- * ioctl() to define the configure CPUID values for the TD.
>- */
>+ err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &out);
>+ if ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_INVALID) {
>+ /*
>+ * Because a user gives operands, don't warn.
>+ * Return a hint to the user because it's sometimes hard for the
>+ * user to figure out which operand is invalid. SEAMCALL status
>+ * code includes which operand caused invalid operand error.
>+ */
>+ *seamcall_err = err;
>+ ret = -EINVAL;
>+ goto teardown;
>+ } else if (WARN_ON_ONCE(err)) {
>+ pr_tdx_error(TDH_MNG_INIT, err, &out);
>+ ret = -EIO;
>+ goto teardown;
>+ }
>+
> return 0;
>
> /*
>@@ -586,6 +755,76 @@ static int __tdx_td_init(struct kvm *kvm)
> return ret;
> }
>
>+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>+{
>+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>+ struct kvm_tdx_init_vm *init_vm = NULL;

no need to initialize it to NULL.

>+ struct td_params *td_params = NULL;
>+ int ret;
>+
>+ BUILD_BUG_ON(sizeof(*init_vm) != 8 * 1024);
>+ BUILD_BUG_ON(sizeof(struct td_params) != 1024);
>+
>+ if (is_hkid_assigned(kvm_tdx))
>+ return -EINVAL;
>+
>+ if (cmd->flags)
>+ return -EINVAL;
>+
>+ init_vm = kzalloc(sizeof(*init_vm) +
>+ sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES,
>+ GFP_KERNEL);

no need to zero the memory given ...

>+ if (!init_vm)
>+ return -ENOMEM;
>+ if (copy_from_user(init_vm, (void __user *)cmd->data, sizeof(*init_vm))) {

.. this.

>+ ret = -EFAULT;
>+ goto out;
>+ }
>+ if (init_vm->cpuid.nent > KVM_MAX_CPUID_ENTRIES) {
>+ ret = -E2BIG;
>+ goto out;
>+ }
>+ if (copy_from_user(init_vm->cpuid.entries,
>+ (void __user *)cmd->data + sizeof(*init_vm),
>+ flex_array_size(init_vm, cpuid.entries, init_vm->cpuid.nent))) {
>+ ret = -EFAULT;
>+ goto out;
>+ }
>+
>+ if (memchr_inv(init_vm->reserved, 0, sizeof(init_vm->reserved))) {
>+ ret = -EINVAL;
>+ goto out;
>+ }

2024-03-20 07:02:16

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 040/130] KVM: TDX: Make pmu_intel.c ignore guest TD case

On Mon, Feb 26, 2024 at 12:25:42AM -0800, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>Because TDX KVM doesn't support PMU yet (it's future work of TDX KVM
>support as another patch series) and pmu_intel.c touches vmx specific
>structure in vcpu initialization, as workaround add dummy structure to
>struct vcpu_tdx and pmu_intel.c can ignore TDX case.

Can we instead factor pmu_intel.c to avoid corrupting memory? how hard would it
be?

>+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
>+{
>+ struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu);
>+
>+ if (is_td_vcpu(vcpu))
>+ return false;
>+
>+ return lbr->nr && (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_LBR_FMT);

The check about vcpu's perf capabilities is new. is it necessary?

>-static inline bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
>-{
>- return !!vcpu_to_lbr_records(vcpu)->nr;
>-}
>-

2024-03-20 08:15:43

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On 2/26/2024 4:25 PM, [email protected] wrote:

..

> +static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
> +{
> + const struct kvm_cpuid_entry2 *entry;
> + u64 guest_supported_xcr0;
> + u64 guest_supported_xss;
> +
> + /* Setup td_params.xfam */
> + entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
> + if (entry)
> + guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
> + else
> + guest_supported_xcr0 = 0;
> + guest_supported_xcr0 &= kvm_caps.supported_xcr0;
> +
> + entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
> + if (entry)
> + guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
> + else
> + guest_supported_xss = 0;
> +
> + /*
> + * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
> + * and, CET support.
> + */
> + guest_supported_xss &=
> + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);
> +
> + td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
> + if (td_params->xfam & XFEATURE_MASK_LBR) {
> + /*
> + * TODO: once KVM supports LBR(save/restore LBR related
> + * registers around TDENTER), remove this guard.
> + */
> +#define MSG_LBR "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH properly.\n"
> + pr_warn(MSG_LBR);
> + return -EOPNOTSUPP;

This unsupported behavior is totally decided by KVM even if TDX module
supports it. I think we need to reflect it in tdx_info->xfam_fixed0,
which gets reported to userspace via KVM_TDX_CAPABILITIES. So userspace
will aware that LBR is not supported for TDs.

> + }
> +
> + return 0;
> +}
> +
> +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> + struct kvm_tdx_init_vm *init_vm)
> +{
> + struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
> + int ret;
> +
> + if (kvm->created_vcpus)
> + return -EBUSY;
> +
> + if (init_vm->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> + /*
> + * TODO: save/restore PMU related registers around TDENTER.
> + * Once it's done, remove this guard.
> + */
> +#define MSG_PERFMON "TD doesn't support perfmon yet. KVM needs to save/restore host perf registers properly.\n"
> + pr_warn(MSG_PERFMON);
> + return -EOPNOTSUPP;

similar as above, we need reflect it in tdx_info->attributes_fixed0

> + }
> +


2024-03-20 11:27:39

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, 2024-03-15 at 17:48 +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-03-15 at 09:33 -0700, Sean Christopherson wrote:
> > Heh, Like this one?
> >
> >         static inline u64 tdh_sys_lp_shutdown(void)
> >         {
> >                 struct tdx_module_args in = {
> >                 };
> >        
> >                 return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
> >         }
> >
> > Which isn't actually used...
>
> Looks like is was turned into a NOP in TDX 1.5. So will even forever be
> dead code. I see one other that is unused. Thanks for pointing it out.
>
> >
> > > But I'd also defer to the KVM maintainers on this.  They're the
> > > ones
> > > that have to play the symbol exporting game a lot more than I ever
> > > do.
> > > If they cringe at the idea of adding 20 (or whatever) exports, then
> > > that's a lot more important than the possibility of some other
> > > silly
> > > module abusing the generic exported __seamcall.
> >
> > I don't care much about exports.  What I do care about is sane code,
> > and while
> > the current code _looks_ pretty, it's actually quite insane.
> >
> > I get why y'all put SEAMCALL in assembly subroutines; the macro
> > shenanigans I
> > originally wrote years ago were their own brand of crazy, and dealing
> > with GPRs
> > that can't be asm() constraints often results in brittle code.
>
> I guess it must be this, for the initiated:
> https://lore.kernel.org/lkml/25f0d2c2f73c20309a1b578cc5fc15f4fd6b9a13.1605232743.git.isaku.yamahata@intel.com/

Hmm.. I lost memory of this. :-(

>
> >
> > But the tdx_module_args structure approach generates truly atrocious
> > code.  Yes,
> > SEAMCALL is inherently slow, but that doesn't mean that we shouldn't
> > at least try
> > to generate efficient code.  And it's not just efficiency that is
> > lost, the
> > generated code ends up being much harder to read than it ought to be.
> >
> >
> [snip]
> >
> > So my feedback is to not worry about the exports, and instead focus
> > on figuring
> > out a way to make the generated code less bloated and easier to
> > read/debug.
> >
>
> Thanks for the feedback both! It sounds like everyone is flexible on
> the exports. As for the generated code, oof.
>
> Kai, I see the solution has gone through some iterations already. First
> the macro one linked above, then that was dropped pretty quick to
> something that loses the asm constraints:
> https://lore.kernel.org/lkml/e777bbbe10b1ec2c37d85dcca2e175fe3bc565ec.1625186503.git.isaku.yamahata@intel.com/

Sorry I forgot for what reason we changed from the (bunch of) macros to this.

>
> Then next the struct grew here, and here:
> https://lore.kernel.org/linux-mm/[email protected]/
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Not sure I understand all of the constraints yet. Do you have any
> ideas?

This was due to Peter's request to unify the TDCALL/SEAMCALL and TDVMCALL
assembly, which was done by this series:

https://lore.kernel.org/lkml/[email protected]/

So when Peter requested this, the __seamcall() and __tdcall() (or
__tdx_module_call() before the above series) were already sharing one assembly
macro. The TDVMCALL were using a different macro, though. However the two
assembly macros used similar structure and similar code, so we tried to unify
them.

And another reason that we changed from ...

u64 __seamcall(u64 fn, u64 rcx, u64 rdx, ..., struct tdx_module_args *out);

to ...

u64 __seamcall(u64, struct tdx_module_args *args);

... was the former doesn't extend.  

E.g., live migration related new SEAMCALLs use more registers as input. It is
just insane to have so many individual regs as function argument.

And Peter wanted to use __seamcall() to cover TDH.VP.ENTER too, which
historically had been implemented in KVM using its own assembly, because
VP.ENTER basically uses all GPRs as input/output.

2024-03-20 12:10:09

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions


>
> then loads and zeros a ton of memory (tdx_kvm_hypercall()):
>
>
[...]

>
> and then unpacks all of that memory back into registers, and reverses that last
> part on the way back, (__tdcall_saved_ret()):
>
>
[...]

>
> It's honestly quite amusing, because y'all took one what I see as one of the big
> advantages of TDX over SEV (using registers instead of shared memory), and managed
> to effectively turn it into a disadvantage.
>
> Again, I completely understand the maintenance and robustness benefits, but IMO
> the pendulum swung a bit too far in that direction.

Having to zero-and-init the structure and store output regs to the structure is
an unfortunately burden if we want to have a single API __seamcall() for all
SEAMCALL leafs.

To (precisely) avoid writing to the unnecessary structure members before and
after the SEAMCALL instruction, we need to use your old way to have bunch of
macros. But we may end up with *a lot* macros due to needing to cover new
(e.g., live migration) SEAMCALLs.

Because essentially it's a game to implement wrappers for bunch of combinations
of each individual input/output regs.

I don't want to judge which way is better, but to be honest I think completely
switching to the old way (using bunch of macros) isn't a realistic option at
this stage.

However, I think we might be able to change ...

u64 __seamcall(u64 fn, struct tdx_module_args *args);

... to

u64 __seamcall(u64 fn, struct tdx_module_args *in,
struct tdx_module_args *out);

.. so that the assembly can actually skip "storing output regs to the structure"
if the SEAMCALL doesn't really have any output regs except RAX.

I can try to do if you guys believe this should be done, and should be done
earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
have meaningful performance improvement.

2024-03-20 15:07:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On 3/20/24 05:09, Huang, Kai wrote:
> I can try to do if you guys believe this should be done, and should be done
> earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
> have meaningful performance improvement.

I don't think Sean had performance concerns.

I think he was having a justifiably violent reaction to how much more
complicated the generated code is to do a SEAMCALL versus a good ol' KVM
hypercall.

"Complicated" in this case means lots more instructions and control
flow. That's slower, sure, but the main impact is that when you go to
debug it, it's *MUCH* harder to debug the SEAMCALL entry assembly than a
KVM hypercall.

My takeaway from this, though, is that we are relying on the compiler
for a *LOT*. There are also so many levels in the helpers that it's
hard to avoid silly things like two _separate_ retry loops.

We should probably be looking at the generated code a _bit_ more often
than never, and it's OK to tinker a _bit_ to make things out-of-line or
make sure that the compiler optimizes everything that we think that it
should.

Also remember that there are very fun tools out there that can make this
much easier than recompiling the kernel a billion times:

https://godbolt.org/z/8ooE4d465

2024-03-20 20:21:05

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Tue, Mar 19, 2024 at 10:41:09PM -0700,
Isaku Yamahata <[email protected]> wrote:

> On Wed, Mar 20, 2024 at 12:11:17AM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Tue, 2024-03-19 at 17:09 -0700, Isaku Yamahata wrote:
> > > > The helper abstracts setting the arguments into the proper
> > > > registers
> > > > fields passed in, but doesn't abstract pulling the result out from
> > > > the
> > > > register fields. Then the caller has to manually extract them in
> > > > this
> > > > verbose way. Why not have the helper do both?
> > >
> > > Yes. Let me update those arguments.
> >
> > What were you thinking exactly, like?
> >
> > tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);
> >
> > And for the other helpers?
>
> I have the following four helpers. Other helpers will have no out argument.
>
> tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);
> tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
> tdh_mem_page_remove(kvm_tdx, gpa, tdx_level, &entry, &level_state);
> tdh_mem_range_block(kvm_tdx, gpa, tdx_level, &entry, &level_state);

By updating the code, I found that tdh_mem_range_block() doesn't need out
variables. and tdh_vp_rd() needs output.
tdh_mem_range_block() doesn't need the out.

u64 tdh_vp_rd(struct vcpu_tdx *tdx, u64 field, u64 *value)
--
Isaku Yamahata <[email protected]>

2024-03-20 21:00:59

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions



On 21/03/2024 4:07 am, Dave Hansen wrote:
> On 3/20/24 05:09, Huang, Kai wrote:
>> I can try to do if you guys believe this should be done, and should be done
>> earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
>> have meaningful performance improvement.
>
> I don't think Sean had performance concerns.
>
> I think he was having a justifiably violent reaction to how much more
> complicated the generated code is to do a SEAMCALL versus a good ol' KVM
> hypercall.

Ah, I automatically linked "generating better code" to "in order to have
better performance". My bad.

>
> "Complicated" in this case means lots more instructions and control
> flow. That's slower, sure, but the main impact is that when you go to
> debug it, it's *MUCH* harder to debug the SEAMCALL entry assembly than a
> KVM hypercall.

[...]

>
> My takeaway from this, though, is that we are relying on the compiler
> for a *LOT*. There are also so many levels in the helpers that it's
> hard to avoid silly things like two _separate_ retry loops.
>
> We should probably be looking at the generated code a _bit_ more often
> than never, and it's OK to tinker a _bit_ to make things out-of-line or
> make sure that the compiler optimizes everything that we think that it
> should.

Yeah, agreed.

>
> Also remember that there are very fun tools out there that can make this
> much easier than recompiling the kernel a billion times:
>
> https://godbolt.org/z/8ooE4d465

Ah, this is useful. Thanks for sharing :-)

2024-03-20 21:36:18

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Wed, Mar 20, 2024 at 01:03:21PM +1300,
"Huang, Kai" <[email protected]> wrote:

> > +static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
> > + struct tdx_module_args *out)
> > +{
> > + u64 ret;
> > +
> > + if (out) {
> > + *out = *in;
> > + ret = seamcall_ret(op, out);
> > + } else
> > + ret = seamcall(op, in);
>
> I think it's silly to have the @out argument in this way.
>
> What is the main reason to still have it?
>
> Yeah we used to have the @out in __seamcall() assembly function. The
> assembly code checks the @out and skips copying registers to @out when it is
> NULL.
>
> But it got removed when we tried to unify the assembly for TDCALL/TDVMCALL
> and SEAMCALL to have a *SINGLE* assembly macro.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> To me that means we should just accept the fact we will always have a valid
> @out.
>
> But there might be some case that you _obviously_ need the @out and I
> missed?

As I replied at [1], those four wrappers need to return values.
The first three on error, the last one on success.

[1] https://lore.kernel.org/kvm/[email protected]/

tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);
tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
tdh_mem_page_remove(kvm_tdx, gpa, tdx_level, &entry, &level_state);
u64 tdh_vp_rd(struct vcpu_tdx *tdx, u64 field, u64 *value)

We can delete out from other wrappers.
Because only TDH.MNG.CREATE() and TDH.MNG.ADDCX() can return TDX_RND_NO_ENTROPY,
we can use __seamcall(). The TDX spec doesn't guarantee such error code
convention. It's very unlikely, though.


> > +static inline u64 tdh_sys_lp_shutdown(void)
> > +{
> > + struct tdx_module_args in = {
> > + };
> > +
> > + return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
> > +}
>
> As Sean already pointed out, I am sure it's/should not used in this series.
>
> That being said, I found it's not easy to determine whether one wrapper will
> be used by this series or not. The other option is we introduce the
> wrapper(s) when they get actally used, but I can see (especially at this
> stage) it's also a apple vs orange question that people may have different
> preference.
>
> Perhaps we can say something like below in changelog ...
>
> "
> Note, not all VM-managing related SEAMCALLs have a wrapper here, but only
> provide wrappers that are essential to the run the TDX guest with basic
> feature set.
> "
>
> ... so that people will at least to pay attention to this during the review?

Makes sense. We can split this patch into other patches that first use the
wrappers.
--
Isaku Yamahata <[email protected]>

2024-03-20 21:52:47

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error

On Wed, Mar 20, 2024 at 01:29:07PM +1300,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 26/02/2024 9:25 pm, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Add helper functions to print out errors from the TDX module in a uniform
> > manner.
>
> Likely we need more information here. See below.
>
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Binbin Wu <[email protected]>
> > Reviewed-by: Yuan Yao <[email protected]>
> > ---
> > v19:
> > - dropped unnecessary include <asm/tdx.h>
> >
> > v18:
> > - Added Reviewed-by Binbin.
>
> The tag doesn't show in the SoB chain.
>
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
>
> [...]
>
> > +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out)
> > +{
> > + if (!out) {
> > + pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
> > + op, error_code);
> > + return;
> > + }
>
> I think this is the reason you still want the @out in tdx_seamcall()?
>
> But I am not sure either -- even if you want to have @out *here* -- why
> cannot you pass a NULL explicitly when you *know* the concerned SEAMCALL
> doesn't have a valid output?
>
> > +
> > +#define MSG \
> > + "SEAMCALL (0x%016llx) failed: 0x%016llx RCX 0x%016llx RDX 0x%016llx R8 0x%016llx R9 0x%016llx R10 0x%016llx R11 0x%016llx\n"
> > + pr_err_ratelimited(MSG, op, error_code, out->rcx, out->rdx, out->r8,
> > + out->r9, out->r10, out->r11);
> > +}
>
> Besides the regs that you are printing, there are more regs (R12-R15, RDI,
> RSI) in the structure.
>
> It's not clear why you only print some, but not all.
>
> AFAICT the VP.ENTER SEAMCALL can have all regs as valid output?

Only those are used for SEAMCALLs except TDH.VP.ENTER. TDH.VP.ENTER is an
exception.

As discussed at [1], out can be eliminated. We will have only limited output.
If we go for that route, we'll have the two following functions.
Does it make sense?

void pr_tdx_error(u64 op, u64 error_code)
{
pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
op, error_code);
}

void pr_tdx_sept_error(u64 op, u64 error_code, const union tdx_sept_entry *entry,
const union tdx_sept_level_state *level_state)
{
#define MSG \
"SEAMCALL (0x%016llx) failed: 0x%016llx entry 0x%016llx level_state 0x%016llx\n"
pr_err_ratelimited(MSG, op, error_code, entry->raw, level_state->raw);
}


[1] https://lore.kernel.org/kvm/[email protected]/

>
> Anyway, that being said, you might need to put more text in
> changelog/comment to make this patch (at least more) reviewable.
>

--
Isaku Yamahata <[email protected]>

2024-03-20 22:38:28

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module



On 21/03/2024 10:36 am, Isaku Yamahata wrote:
> On Wed, Mar 20, 2024 at 01:03:21PM +1300,
> "Huang, Kai" <[email protected]> wrote:
>
>>> +static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
>>> + struct tdx_module_args *out)
>>> +{
>>> + u64 ret;
>>> +
>>> + if (out) {
>>> + *out = *in;
>>> + ret = seamcall_ret(op, out);
>>> + } else
>>> + ret = seamcall(op, in);
>>
>> I think it's silly to have the @out argument in this way.
>>
>> What is the main reason to still have it?
>>
>> Yeah we used to have the @out in __seamcall() assembly function. The
>> assembly code checks the @out and skips copying registers to @out when it is
>> NULL.
>>
>> But it got removed when we tried to unify the assembly for TDCALL/TDVMCALL
>> and SEAMCALL to have a *SINGLE* assembly macro.
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> To me that means we should just accept the fact we will always have a valid
>> @out.
>>
>> But there might be some case that you _obviously_ need the @out and I
>> missed?
>
> As I replied at [1], those four wrappers need to return values.
> The first three on error, the last one on success.
>
> [1] https://lore.kernel.org/kvm/[email protected]/
>
> tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);
> tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
> tdh_mem_page_remove(kvm_tdx, gpa, tdx_level, &entry, &level_state);
> u64 tdh_vp_rd(struct vcpu_tdx *tdx, u64 field, u64 *value)
>
> We can delete out from other wrappers.

Ah, OK. I got you don't want to invent separate wrappers for each
seamcall() variants like:

- tdx_seamcall(u64 fn, struct tdx_module_args *args);
- tdx_seamcall_ret(u64 fn, struct tdx_module_args *args);
- tdx_seamcall_saved_ret(u64 fn, struct tdx_module_args *args);

To be honest I found they were kinda annoying myself during the "unify
TDCALL/SEAMCALL and TDVMCALL assembly" patchset.

But life is hard...

And given (it seems) we are going to remove kvm_spurious_fault(), I
think the tdx_seamcall() variants are just very simple wrapper of plain
seamcall() variants.

So how about we have some macros:

static inline bool is_seamcall_err_kernel_defined(u64 err)
{
return err & TDX_SW_ERROR;
}

#define TDX_KVM_SEAMCALL(_kvm, _seamcall_func, _fn, _args) \
({ \
u64 _ret = _seamcall_func(_fn, _args);
KVM_BUG_ON(_kvm, is_seamcall_err_kernel_defined(_ret));
_ret;
})

#define tdx_kvm_seamcall(_kvm, _fn, _args) \
TDX_KVM_SEAMCALL(_kvm, seamcall, _fn, _args)

#define tdx_kvm_seamcall_ret(_kvm, _fn, _args) \
TDX_KVM_SEAMCALL(_kvm, seamcall_ret, _fn, _args)

#define tdx_kvm_seamcall_saved_ret(_kvm, _fn, _args) \
TDX_KVM_SEAMCALL(_kvm, seamcall_saved_ret, _fn, _args)

This is consistent with what we have in TDX host code, and this handles
NO_ENTROPY error internally.

Or, maybe we can just use the seamcall_ret() for ALL SEAMCALLs, except
using seamcall_saved_ret() for TDH.VP.ENTER.

u64 tdx_kvm_seamcall(sruct kvm*kvm, u64 fn,
struct tdx_module_args *args)
{
u64 ret = seamcall_ret(fn, args);

KVM_BUG_ON(kvm, is_seamcall_err_kernel_defined(ret);

return ret;
}

IIUC this at least should give us a single tdx_kvm_seamcall() API for
majority (99%) code sites?

And obviously I'd like other people to weigh in too.

> Because only TDH.MNG.CREATE() and TDH.MNG.ADDCX() can return TDX_RND_NO_ENTROPY, > we can use __seamcall(). The TDX spec doesn't guarantee such error code
> convention. It's very unlikely, though.

I don't quite follow the "convention" part. Can you elaborate?

NO_ENTROPY is already handled in seamcall() variants. Can we just use
them directly?

>
>
>>> +static inline u64 tdh_sys_lp_shutdown(void)
>>> +{
>>> + struct tdx_module_args in = {
>>> + };
>>> +
>>> + return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
>>> +}
>>
>> As Sean already pointed out, I am sure it's/should not used in this series.
>>
>> That being said, I found it's not easy to determine whether one wrapper will
>> be used by this series or not. The other option is we introduce the
>> wrapper(s) when they get actally used, but I can see (especially at this
>> stage) it's also a apple vs orange question that people may have different
>> preference.
>>
>> Perhaps we can say something like below in changelog ...
>>
>> "
>> Note, not all VM-managing related SEAMCALLs have a wrapper here, but only
>> provide wrappers that are essential to the run the TDX guest with basic
>> feature set.
>> "
>>
>> ... so that people will at least to pay attention to this during the review?
>
> Makes sense. We can split this patch into other patches that first use the
> wrappers.

Obviously I didn't want to make you do dramatic patchset reorganization,
so it's up to you.

2024-03-20 23:10:27

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error


> Does it make sense?
>
> void pr_tdx_error(u64 op, u64 error_code)
> {
> pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
> op, error_code);
> }

Should we also have a _ret version?

void pr_seamcall_err(u64 op, u64 err)
{
/* A comment to explain why using the _ratelimited() version? */
pr_err_ratelimited(...);
}

void pr_seamcall_err_ret(u64 op, u64 err, struct tdx_module_args *arg)
{
pr_err_seamcall(op, err);

pr_err_ratelimited(...);
}

(Hmm... if you look at the tdx.c in TDX host, there's similar code
there, and again, it was a little bit annoying when I did that..)

Again, if we just use seamcall_ret() for ALL SEAMCALLs except VP.ENTER,
we can simply have one..

>
> void pr_tdx_sept_error(u64 op, u64 error_code, const union tdx_sept_entry *entry,
> const union tdx_sept_level_state *level_state)
> {
> #define MSG \
> "SEAMCALL (0x%016llx) failed: 0x%016llx entry 0x%016llx level_state 0x%016llx\n"
> pr_err_ratelimited(MSG, op, error_code, entry->raw, level_state->raw);
> }

A higher-level wrapper to print SEPT error is fine to me, but do it in a
separate patch.

2024-03-21 00:11:39

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> To handle private page tables, argument of is_private needs to be
> passed
> down.  Given that already page level is passed down, it would be
> cumbersome
> to add one more parameter about sp. Instead replace the level
> argument with
> union kvm_mmu_page_role.  Thus the number of argument won't be
> increased
> and more info about sp can be passed down.
>
> For private sp, secure page table will be also allocated in addition
> to
> struct kvm_mmu_page and page table (spt member).  The allocation
> functions
> (tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know
> if the
> allocation is for the conventional page table or private page table. 
> Pass
> union kvm_mmu_role to those functions and initialize role member of
> struct
> kvm_mmu_page.

tdp_mmu_alloc_sp() is only called in two places. One for the root, and
one for the mid-level tables.

In later patches when the kvm_mmu_alloc_private_spt() part is added,
the root case doesn't need anything done. So the code has to take
special care in tdp_mmu_alloc_sp() to avoid doing anything for the
root.

It only needs to do the special private spt allocation in non-root
case. If we open code that case, I think maybe we could drop this
patch, like the below.

The benefits are to drop this patch (which looks to already be part of
Paolo's series), and simplify "KVM: x86/mmu: Add a private pointer to
struct kvm_mmu_page". I'm not sure though, what do you think? Only
build tested.

diff --git a/arch/x86/kvm/mmu/mmu_internal.h
b/arch/x86/kvm/mmu/mmu_internal.h
index f1533a753974..d6c2ee8bb636 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -176,30 +176,12 @@ static inline void
kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *priva

static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp)
{
- bool is_root = vcpu->arch.root_mmu.root_role.level == sp-
>role.level;
-
- KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm);
- if (is_root)
- /*
- * Because TDX module assigns root Secure-EPT page and
set it to
- * Secure-EPTP when TD vcpu is created, secure page
table for
- * root isn't needed.
- */
- sp->private_spt = NULL;
- else {
- /*
- * Because the TDX module doesn't trust VMM and
initializes
- * the pages itself, KVM doesn't initialize them.
Allocate
- * pages with garbage and give them to the TDX module.
- */
- sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
>arch.mmu_private_spt_cache);
- /*
- * Because mmu_private_spt_cache is topped up before
starting
- * kvm page fault resolving, the allocation above
shouldn't
- * fail.
- */
- WARN_ON_ONCE(!sp->private_spt);
- }
+ /*
+ * Because the TDX module doesn't trust VMM and initializes
+ * the pages itself, KVM doesn't initialize them. Allocate
+ * pages with garbage and give them to the TDX module.
+ */
+ sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
>arch.mmu_private_spt_cache);
}

static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct
kvm_mmu_page *root,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ac7bf37b353f..f423a38019fb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -195,9 +195,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct
kvm_vcpu *vcpu)
sp = kvm_mmu_memory_cache_alloc(&vcpu-
>arch.mmu_page_header_cache);
sp->spt = kvm_mmu_memory_cache_alloc(&vcpu-
>arch.mmu_shadow_page_cache);

- if (kvm_mmu_page_role_is_private(role))
- kvm_mmu_alloc_private_spt(vcpu, sp);
-
return sp;
}

@@ -1378,6 +1375,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
* needs to be split.
*/
sp = tdp_mmu_alloc_sp(vcpu);
+ if (!(raw_gfn & kvm_gfn_shared_mask(kvm)))
+ kvm_mmu_alloc_private_spt(vcpu, sp);
tdp_mmu_init_child_sp(sp, &iter);

sp->nx_huge_page_disallowed = fault-
>huge_page_disallowed;
@@ -1670,7 +1669,6 @@ static struct kvm_mmu_page
*__tdp_mmu_alloc_sp_for_split(struct kvm *kvm, gfp_t

sp->spt = (void *)__get_free_page(gfp);
/* TODO: large page support for private GPA. */
- WARN_ON_ONCE(kvm_mmu_page_role_is_private(role));
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
return NULL;
@@ -1686,10 +1684,6 @@ static struct kvm_mmu_page
*tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
struct kvm_mmu_page *sp;

kvm_lockdep_assert_mmu_lock_held(kvm, shared);
- KVM_BUG_ON(kvm_mmu_page_role_is_private(role) !=
- is_private_sptep(iter->sptep), kvm);
- /* TODO: Large page isn't supported for private SPTE yet. */
- KVM_BUG_ON(kvm_mmu_page_role_is_private(role), kvm);

/*
* Since we are allocating while under the MMU lock we have to
be

2024-03-21 00:19:09

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 057/130] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Because TDX support introduces private mapping, add a new member in
> union
> kvm_mmu_page_role with access functions to check the member.

I guess we should have a role bit for private like in this patch, but
just barely. AFAICT we have a gfn and struct kvm in every place where
it is checked (assuming my proposal in patch 56 holds water). So we
could have
bool is_private = !(gfn & kvm_gfn_shared_mask(kvm));

But there are extra bits available in the role, so we can skip the
extra step. Can you think of any more reasons? I want to try to write a
log for this one. It's very short.

>
> +static inline bool is_private_sptep(u64 *sptep)
> +{
> +       if (WARN_ON_ONCE(!sptep))
> +               return false;

This is not supposed to be NULL, from the existence of the warning. It
looks like some previous comments were to not let the NULL pointer
deference happen and bail if it's NULL. I think maybe we should just
drop the check and warning completely. The NULL pointer deference will
be plenty loud if it happens.

> +       return is_private_sp(sptep_to_sp(sptep));
> +}
> +

2024-03-21 01:07:22

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 041/130] KVM: TDX: Refuse to unplug the last cpu on the package

>diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>index 437c6d5e802e..d69dd474775b 100644
>--- a/arch/x86/kvm/vmx/main.c
>+++ b/arch/x86/kvm/vmx/main.c
>@@ -110,6 +110,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .check_processor_compatibility = vmx_check_processor_compat,
>
> .hardware_unsetup = vt_hardware_unsetup,
>+ .offline_cpu = tdx_offline_cpu,
>
> /* TDX cpu enablement is done by tdx_hardware_setup(). */
> .hardware_enable = vmx_hardware_enable,
>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>index b11f105db3cd..f2ee5abac14e 100644
>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -97,6 +97,7 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> */
> static DEFINE_MUTEX(tdx_lock);
> static struct mutex *tdx_mng_key_config_lock;
>+static atomic_t nr_configured_hkid;
>
> static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> {
>@@ -112,6 +113,7 @@ static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> {
> tdx_guest_keyid_free(kvm_tdx->hkid);
> kvm_tdx->hkid = -1;
>+ atomic_dec(&nr_configured_hkid);

I may think it is better to extend IDA infrastructure e.g., add an API to check if
any ID is allocated for a given range. No strong opinion on this.

> }
>
> static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
>@@ -586,6 +588,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
> if (ret < 0)
> return ret;
> kvm_tdx->hkid = ret;
>+ atomic_inc(&nr_configured_hkid);
>
> va = __get_free_page(GFP_KERNEL_ACCOUNT);
> if (!va)
>@@ -1071,3 +1074,41 @@ void tdx_hardware_unsetup(void)
> kfree(tdx_info);
> kfree(tdx_mng_key_config_lock);
> }
>+
>+int tdx_offline_cpu(void)
>+{
>+ int curr_cpu = smp_processor_id();
>+ cpumask_var_t packages;
>+ int ret = 0;
>+ int i;
>+
>+ /* No TD is running. Allow any cpu to be offline. */
>+ if (!atomic_read(&nr_configured_hkid))
>+ return 0;
>+
>+ /*
>+ * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to
>+ * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory
>+ * controller with pconfig. If we have active TDX HKID, refuse to
>+ * offline the last online cpu.
>+ */
>+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
>+ return -ENOMEM;
>+ for_each_online_cpu(i) {
>+ if (i != curr_cpu)
>+ cpumask_set_cpu(topology_physical_package_id(i), packages);
>+ }

Just check if any other CPU is in the same package of the one about to go
offline. This would obviate the need for the cpumask and allow us to break once
one cpu in the same package is found.

>+ /* Check if this cpu is the last online cpu of this package. */
>+ if (!cpumask_test_cpu(topology_physical_package_id(curr_cpu), packages))
>+ ret = -EBUSY;
>+ free_cpumask_var(packages);
>+ if (ret)
>+ /*
>+ * Because it's hard for human operator to understand the
>+ * reason, warn it.
>+ */
>+#define MSG_ALLPKG_ONLINE \
>+ "TDX requires all packages to have an online CPU. Delete all TDs in order to offline all CPUs of a package.\n"
>+ pr_warn_ratelimited(MSG_ALLPKG_ONLINE);
>+ return ret;
>+}

2024-03-21 01:17:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Tue, 2024-03-19 at 17:56 -0700, Rick Edgecombe wrote:
> > Because TDX supports only WB, we
> > ignore the request for MTRR and lapic page change to not zap
> > private
> > pages on unmapping for those two cases
>
> Hmm. I need to go back and look at this again. It's not clear from
> the
> description why it is safe for the host to not zap pages if requested
> to. I see why the guest wouldn't want them to be zapped.

Ok, I see now how this works. MTRRs and APIC zapping happen to use the
same function: kvm_zap_gfn_range(). So restricting that function from
zapping private pages has the desired affect. I think it's not ideal
that kvm_zap_gfn_range() silently skips zapping some ranges. I wonder
if we could pass something in, so it's more clear to the caller.

But can these code paths even get reaches in TDX? It sounded like MTRRs
basically weren't supported.

2024-03-21 01:30:43

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 043/130] KVM: TDX: create/free TDX vcpu structure

>+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>+{
>+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>+
>+ WARN_ON_ONCE(vcpu->arch.cpuid_entries);
>+ WARN_ON_ONCE(vcpu->arch.cpuid_nent);
>+
>+ /* TDX only supports x2APIC, which requires an in-kernel local APIC. */

Cannot QEMU emulate x2APIC? In my understanding, the reason is TDX module always
enables APICv for TDs. So, KVM cannot intercept every access to APIC and forward
them to QEMU for emulation.

>+ if (!vcpu->arch.apic)

will "if (!irqchip_in_kernel(vcpu->kvm))" work? looks this is the custome for such
a check.

>+ return -EINVAL;
>+
>+ fpstate_set_confidential(&vcpu->arch.guest_fpu);
>+
>+ vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
>+
>+ vcpu->arch.cr0_guest_owned_bits = -1ul;
>+ vcpu->arch.cr4_guest_owned_bits = -1ul;
>+
>+ vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;

kvm_tdx->tsc_offset;

>+ vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
>+ vcpu->arch.guest_state_protected =
>+ !(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);

!(kvm_tdx->attributes & TDX_TD_ATTRIBUTE_DEBUG);

>+
>+ if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
>+ vcpu->arch.xfd_no_write_intercept = true;
>+
>+ return 0;
>+}

2024-03-21 05:43:51

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 044/130] KVM: TDX: Do TDX specific vcpu initialization

>+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
>+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
>+{
>+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>+ struct vcpu_tdx *tdx = to_tdx(vcpu);
>+ unsigned long *tdvpx_pa = NULL;
>+ unsigned long tdvpr_pa;
>+ unsigned long va;
>+ int ret, i;
>+ u64 err;
>+
>+ if (is_td_vcpu_created(tdx))
>+ return -EINVAL;
>+
>+ /*
>+ * vcpu_free method frees allocated pages. Avoid partial setup so
>+ * that the method can't handle it.
>+ */
>+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+ if (!va)
>+ return -ENOMEM;
>+ tdvpr_pa = __pa(va);
>+
>+ tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
>+ GFP_KERNEL_ACCOUNT);
>+ if (!tdvpx_pa) {
>+ ret = -ENOMEM;
>+ goto free_tdvpr;
>+ }
>+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
>+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+ if (!va) {
>+ ret = -ENOMEM;
>+ goto free_tdvpx;
>+ }
>+ tdvpx_pa[i] = __pa(va);
>+ }
>+
>+ err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
>+ if (KVM_BUG_ON(err, vcpu->kvm)) {
>+ ret = -EIO;
>+ pr_tdx_error(TDH_VP_CREATE, err, NULL);
>+ goto free_tdvpx;
>+ }
>+ tdx->tdvpr_pa = tdvpr_pa;
>+
>+ tdx->tdvpx_pa = tdvpx_pa;
>+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {

Can you merge the for-loop above into this one? then ...

>+ err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
>+ if (KVM_BUG_ON(err, vcpu->kvm)) {
>+ pr_tdx_error(TDH_VP_ADDCX, err, NULL);

>+ for (; i < tdx_info->nr_tdvpx_pages; i++) {
>+ free_page((unsigned long)__va(tdvpx_pa[i]));
>+ tdvpx_pa[i] = 0;
>+ }

.. no need to free remaining pages.

>+ /* vcpu_free method frees TDVPX and TDR donated to TDX */
>+ return -EIO;
>+ }
>+ }
>+
>+ err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
>+ if (KVM_BUG_ON(err, vcpu->kvm)) {
>+ pr_tdx_error(TDH_VP_INIT, err, NULL);
>+ return -EIO;
>+ }
>+
>+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>+ tdx->td_vcpu_created = true;
>+ return 0;
>+
>+free_tdvpx:
>+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
>+ if (tdvpx_pa[i])
>+ free_page((unsigned long)__va(tdvpx_pa[i]));
>+ tdvpx_pa[i] = 0;
>+ }
>+ kfree(tdvpx_pa);
>+ tdx->tdvpx_pa = NULL;
>+free_tdvpr:
>+ if (tdvpr_pa)
>+ free_page((unsigned long)__va(tdvpr_pa));
>+ tdx->tdvpr_pa = 0;
>+
>+ return ret;
>+}
>+
>+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>+{
>+ struct msr_data apic_base_msr;
>+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>+ struct vcpu_tdx *tdx = to_tdx(vcpu);
>+ struct kvm_tdx_cmd cmd;
>+ int ret;
>+
>+ if (tdx->initialized)
>+ return -EINVAL;
>+
>+ if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))

These checks look random e.g., I am not sure why is_td_created() isn't check here.

A few helper functions and boolean variables are added to track which stage the
TD or TD vCPU is in. e.g.,

is_hkid_assigned()
is_td_finalized()
is_td_created()
tdx->initialized
td_vcpu_created

Insteading of doing this, I am wondering if adding two state machines for
TD and TD vCPU would make the implementation clear and easy to extend.

>+ return -EINVAL;
>+
>+ if (copy_from_user(&cmd, argp, sizeof(cmd)))
>+ return -EFAULT;
>+
>+ if (cmd.error)
>+ return -EINVAL;
>+
>+ /* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
>+ if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
>+ return -EINVAL;

Even though KVM_TD_INIT_VCPU is the only supported command, it is worthwhile to
use a switch-case statement. New commands can be added easily without the need
to refactor this function first.

>+
>+ /*
>+ * As TDX requires X2APIC, set local apic mode to X2APIC. User space
>+ * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
>+ * KVM_SET_CPUID2. Otherwise kvm_set_apic_base() will fail.
>+ */
>+ apic_base_msr = (struct msr_data) {
>+ .host_initiated = true,
>+ .data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
>+ (kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
>+ };
>+ if (kvm_set_apic_base(vcpu, &apic_base_msr))
>+ return -EINVAL;

Exporting kvm_vcpu_is_reset_bsp() and kvm_set_apic_base() should be done
here (rather than in a previous patch).

>+
>+ ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
>+ if (ret)
>+ return ret;
>+
>+ tdx->initialized = true;
>+ return 0;
>+}
>+

>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index c002761bb662..2bd4b7c8fa51 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -6274,6 +6274,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> case KVM_SET_DEVICE_ATTR:
> r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
> break;
>+ case KVM_MEMORY_ENCRYPT_OP:
>+ r = -ENOTTY;

Maybe -EINVAL is better. Because previously trying to call this on vCPU fd
failed with -EINVAL given ...

>+ if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
>+ goto out;
>+ r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
>+ break;
> default:
> r = -EINVAL;

.. this.

> }
>--
>2.25.1
>
>

2024-03-21 11:28:04

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Currently, KVM VMX module initialization/exit functions are a single
> function each. Refactor KVM VMX module initialization functions into KVM
> common part and VMX part so that TDX specific part can be added cleanly.
> Opportunistically refactor module exit function as well.
>
> The current module initialization flow is,

^ ',' -> ':'

And please add an empty line to make text more breathable.

> 0.) Check if VMX is supported,
> 1.) hyper-v specific initialization,
> 2.) system-wide x86 specific and vendor specific initialization,
> 3.) Final VMX specific system-wide initialization,
> 4.) calculate the sizes of VMX kvm structure and VMX vcpu structure,
> 5.) report those sizes to the KVM common layer and KVM common
> initialization

Is there any difference between "KVM common layer" and "KVM common
initialization"? I think you can remove the former.

>
> Refactor the KVM VMX module initialization function into functions with a
> wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> among VMX and TDX. Introduce a wrapper function for vmx_init().

Sorry I don't quite follow what your are trying to say in the above paragraph.

You have adequately put what is the _current_ flow, and I am expecting to see
the flow _after_ the refactor here.

>
> The KVM architecture common layer allocates struct kvm with reported size
> for architecture-specific code. The KVM VMX module defines its structure
> as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define

^vmx_kvm.

Please be more consistent on the words.

> TDX specific kvm and vcpu structures.

Is this paragraph related to the changes in this patch?

For instance, why do you need to point out we will have TDX-specific 'kvm and
vcpu' structures?

>
> The current module exit function is also a single function, a combination
> of VMX specific logic and common KVM logic. Refactor it into VMX specific
> logic and KVM common logic.  
>

[...]

> This is just refactoring to keep the VMX
> specific logic in vmx.c from main.c.

It's better to make this as a separate paragraph, because it is a summary to
this patch.

And in other words: No functional change intended?

2024-03-21 12:41:06

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, 2024-03-15 at 16:25 -0700, Isaku Yamahata wrote:
> > > > How about if there are some LPs that are offline.
> > > > In tdx_hardware_setup(), only online LPs are initialed for TDX, right?
> > > Correct.
> > >
> > >
> > > > Then when an offline LP becoming online, it doesn't have a chance to call
> > > > tdx_cpu_enable()?
> > > KVM registers kvm_online/offline_cpu() @ kvm_main.c as cpu hotplug callbacks.
> > > Eventually x86 kvm hardware_enable() is called on online/offline event.
> >
> > Yes, hardware_enable() will be called when online,
> > but  hardware_enable() now is vmx_hardware_enable() right?
> > It doens't call tdx_cpu_enable() during the online path.
>
> TDX module requires TDH.SYS.LP.INIT() on all logical processors(LPs).  If we
> successfully initialized TDX module, we don't need further action for TDX on cpu
> online/offline.
>
> If some of LPs are not online when loading kvm_intel.ko, KVM fails to initialize
> TDX module. TDX support is disabled.  We don't bother to attempt it.  Leave it
> to the admin of the machine.

No. We have relaxed this. Now the TDX module can be initialized on a subset of
all logical cpus, with arbitrary number of cpus being offline.

Those cpus can become online after module initialization, and TDH.SYS.LP.INIT on
them won't fail.

2024-03-21 13:08:01

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX requires several initialization steps for KVM to create guest TDs.
> Detect CPU feature, enable VMX (TDX is based on VMX) on all online CPUs,
> detect the TDX module availability, initialize it and disable VMX.

Before KVM can use TDX to create and run TDX guests, the kernel needs to
initialize TDX from two perspectives:

1) Initialize the TDX module.
1) Do the "per-cpu initialization" on any logical cpu before running any TDX
code on that cpu.

The host kernel provides two functions to do them respectively: tdx_cpu_enable()
and tdx_enable().

Currently, tdx_enable() requires all online cpus being in VMX operation with CPU
hotplug disabled, and tdx_cpu_enable() needs to be called on local cpu with that
cpu being in VMX operation and IRQ disabled.

>
> To enable/disable VMX on all online CPUs, utilize
> vmx_hardware_enable/disable(). The method also initializes each CPU for
> TDX.  
>

I don't understand what you are saying here.

Did you mean you put tdx_cpu_enable() inside vmx_hardware_enable()?

> TDX requires calling a TDX initialization function per logical
> processor (LP) before the LP uses TDX.  
>

[...]

> When the CPU is becoming online,
> call the TDX LP initialization API. If it fails to initialize TDX, refuse
> CPU online for simplicity instead of TDX avoiding the failed LP.

Unless I am missing something, I don't see this has been done in the code.

>
> There are several options on when to initialize the TDX module. A.) kernel
> module loading time, B.) the first guest TD creation time. A.) was chosen.

A.) was chosen -> Choose A).

Describe your change in "imperative mood".

> With B.), a user may hit an error of the TDX initialization when trying to
> create the first guest TD. The machine that fails to initialize the TDX
> module can't boot any guest TD further. Such failure is undesirable and a
> surprise because the user expects that the machine can accommodate guest
> TD, but not. So A.) is better than B.).
>
> Introduce a module parameter, kvm_intel.tdx, to explicitly enable TDX KVM

You don't have to say the name of the new parameter. It's shown in the code.

> support. It's off by default to keep the same behavior for those who don't
> use TDX.  
>

[...]


> Implement hardware_setup method to detect TDX feature of CPU and
> initialize TDX module.

You are not detecting TDX feature anymore.

And put this in a separate paragraph (at a better place), as I don't see how
this is connected to "introduce a module parameter".

>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - fixed vt_hardware_enable() to use vmx_hardware_enable()
> - renamed vmx_tdx_enabled => tdx_enabled
> - renamed vmx_tdx_on() => tdx_on()
>
> v18:
> - Added comment in vt_hardware_enable() by Binbin.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/Makefile | 1 +
> arch/x86/kvm/vmx/main.c | 19 ++++++++-
> arch/x86/kvm/vmx/tdx.c | 84 ++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 6 +++
> 4 files changed, 109 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/kvm/vmx/tdx.c
>
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 274df24b647f..5b85ef84b2e9 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -24,6 +24,7 @@ kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>
> kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
> kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o
>
> kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
> svm/sev.o
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 18cecf12c7c8..18aef6e23aab 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -6,6 +6,22 @@
> #include "nested.h"
> #include "pmu.h"
>
> +static bool enable_tdx __ro_after_init;
> +module_param_named(tdx, enable_tdx, bool, 0444);
> +
> +static __init int vt_hardware_setup(void)
> +{
> + int ret;
> +
> + ret = vmx_hardware_setup();
> + if (ret)
> + return ret;
> +
> + enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> +
> + return 0;
> +}
> +
> #define VMX_REQUIRED_APICV_INHIBITS \
> (BIT(APICV_INHIBIT_REASON_DISABLE)| \
> BIT(APICV_INHIBIT_REASON_ABSENT) | \
> @@ -22,6 +38,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> .hardware_unsetup = vmx_hardware_unsetup,
>
> + /* TDX cpu enablement is done by tdx_hardware_setup(). */

What's the point of this comment? I don't understand it either.

> .hardware_enable = vmx_hardware_enable,
> .hardware_disable = vmx_hardware_disable,

Shouldn't you also implement vt_hardware_enable(), which also does
tdx_cpu_enable()?

Because I don't see vmx_hardware_enable() is changed to call tdx_cpu_enable() to
make CPU hotplug work with TDX.

> .has_emulated_msr = vmx_has_emulated_msr,
> @@ -161,7 +178,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> };
>
> struct kvm_x86_init_ops vt_init_ops __initdata = {
> - .hardware_setup = vmx_hardware_setup,
> + .hardware_setup = vt_hardware_setup,
> .handle_intel_pt_intr = NULL,
>
> .runtime_ops = &vt_x86_ops,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> new file mode 100644
> index 000000000000..43c504fb4fed
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpu.h>
> +
> +#include <asm/tdx.h>
> +
> +#include "capabilities.h"
> +#include "x86_ops.h"
> +#include "x86.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +static int __init tdx_module_setup(void)
> +{
> + int ret;
> +
> + ret = tdx_enable();
> + if (ret) {
> + pr_info("Failed to initialize TDX module.\n");

As I commented before, tdx_enable() itself will print similar message when it
fails, so no need to print again.

> + return ret;
> + }
> +
> + return 0;
> +}

That being said, I don't think tdx_module_setup() is necessary. Just call
tdx_enable() directly.

> +
> +struct tdx_enabled {
> + cpumask_var_t enabled;
> + atomic_t err;
> +};

struct cpu_tdx_init_ctx {
cpumask_var_t vmx_enabled_cpumask;
atomic_t err;
};

?

> +
> +static void __init tdx_on(void *_enable)

tdx_on() -> cpu_tdx_init(), or cpu_tdx_on()?

> +{
> + struct tdx_enabled *enable = _enable;
> + int r;
> +
> + r = vmx_hardware_enable();
> + if (!r) {
> + cpumask_set_cpu(smp_processor_id(), enable->enabled);
> + r = tdx_cpu_enable();
> + }
> + if (r)
> + atomic_set(&enable->err, r);
> +}
> +
> +static void __init vmx_off(void *_enabled)

cpu_vmx_off() ?

> +{
> + cpumask_var_t *enabled = (cpumask_var_t *)_enabled;
> +
> + if (cpumask_test_cpu(smp_processor_id(), *enabled))
> + vmx_hardware_disable();
> +}
> +
> +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)

Why do you need the 'x86_ops' function argument? I don't see it is used?

> +{
> + struct tdx_enabled enable = {
> + .err = ATOMIC_INIT(0),
> + };
> + int r = 0;
> +
> + if (!enable_ept) {
> + pr_warn("Cannot enable TDX with EPT disabled\n");
> + return -EINVAL;
> + }
> +
> + if (!zalloc_cpumask_var(&enable.enabled, GFP_KERNEL)) {
> + r = -ENOMEM;
> + goto out;
> + }
> +
> + /* tdx_enable() in tdx_module_setup() requires cpus lock. */

/* tdx_enable() must be called with CPU hotplug disabled */

> + cpus_read_lock();
> + on_each_cpu(tdx_on, &enable, true); /* TDX requires vmxon. */

I don't think you need this comment _here_.

If you want keep it, move to the tdx_on() where the code does what this comment
say.

> + r = atomic_read(&enable.err);
> + if (!r)
> + r = tdx_module_setup();
> + else
> + r = -EIO;
> + on_each_cpu(vmx_off, &enable.enabled, true);
> + cpus_read_unlock();
> + free_cpumask_var(enable.enabled);
> +
> +out:
> + return r;
> +}

At last, I think there's one problem here:

KVM actually only registers CPU hotplug callback in kvm_init(), which happens
way after tdx_hardware_setup().

What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
kvm_init()?

Looks we have two options:

1) move registering CPU hotplug callback before tdx_hardware_setup(), or
2) we need to disable CPU hotplug until callbacks have been registered.

Perhaps the second one is easier, because for the first one we need to make sure
the kvm_cpu_online() is ready to be called right after tdx_hardware_setup().

And no one cares if CPU hotplug is disabled during KVM module loading.

That being said, we can even just disable CPU hotplug during the entire
vt_init(), if in this way the code change is simple?

But anyway, to make this patch complete, I think you need to replace
vmx_hardware_enable() to vt_hardware_enable() and do tdx_cpu_enable() to handle
TDX vs CPU hotplug in _this_ patch.

2024-03-21 14:17:23

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Wed, Mar 20, 2024 at 01:12:01PM +0800,
Chao Gao <[email protected]> wrote:

> > config KVM_SW_PROTECTED_VM
> > bool "Enable support for KVM software-protected VMs"
> >- depends on EXPERT
> > depends on KVM && X86_64
> > select KVM_GENERIC_PRIVATE_MEM
> > help
> >@@ -89,6 +88,8 @@ config KVM_SW_PROTECTED_VM
> > config KVM_INTEL
> > tristate "KVM for Intel (and compatible) processors support"
> > depends on KVM && IA32_FEAT_CTL
> >+ select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST
>
> why does INTEL_TDX_HOST select KVM_SW_PROTECTED_VM?

I wanted KVM_GENERIC_PRIVATE_MEM. Ah, we should do

select KKVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST


> >+ select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
> > help
> > .vcpu_precreate = vmx_vcpu_precreate,
> > .vcpu_create = vmx_vcpu_create,
>
> >--- a/arch/x86/kvm/vmx/tdx.c
> >+++ b/arch/x86/kvm/vmx/tdx.c
> >@@ -5,10 +5,11 @@
> >
> > #include "capabilities.h"
> > #include "x86_ops.h"
> >-#include "x86.h"
> > #include "mmu.h"
> > #include "tdx_arch.h"
> > #include "tdx.h"
> >+#include "tdx_ops.h"
> >+#include "x86.h"
>
> any reason to reorder x86.h?

No, I think it's accidental during rebase.
Will fix.



> >+static void tdx_do_tdh_phymem_cache_wb(void *unused)
> >+{
> >+ u64 err = 0;
> >+
> >+ do {
> >+ err = tdh_phymem_cache_wb(!!err);
> >+ } while (err == TDX_INTERRUPTED_RESUMABLE);
> >+
> >+ /* Other thread may have done for us. */
> >+ if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> >+ err = TDX_SUCCESS;
> >+ if (WARN_ON_ONCE(err))
> >+ pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> >+}
> >+
> >+void tdx_mmu_release_hkid(struct kvm *kvm)
> >+{
> >+ bool packages_allocated, targets_allocated;
> >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >+ cpumask_var_t packages, targets;
> >+ u64 err;
> >+ int i;
> >+
> >+ if (!is_hkid_assigned(kvm_tdx))
> >+ return;
> >+
> >+ if (!is_td_created(kvm_tdx)) {
> >+ tdx_hkid_free(kvm_tdx);
> >+ return;
> >+ }
> >+
> >+ packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> >+ targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> >+ cpus_read_lock();
> >+
> >+ /*
> >+ * We can destroy multiple guest TDs simultaneously. Prevent
> >+ * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> >+ */
> >+ mutex_lock(&tdx_lock);
> >+
> >+ /*
> >+ * Go through multiple TDX HKID state transitions with three SEAMCALLs
> >+ * to make TDH.PHYMEM.PAGE.RECLAIM() usable. Make the transition atomic
> >+ * to other functions to operate private pages and Secure-EPT pages.
> >+ *
> >+ * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
> >+ * This function is called via mmu notifier, mmu_release().
> >+ * kvm_gmem_release() is called via fput() on process exit.
> >+ */
> >+ write_lock(&kvm->mmu_lock);
> >+
> >+ for_each_online_cpu(i) {
> >+ if (packages_allocated &&
> >+ cpumask_test_and_set_cpu(topology_physical_package_id(i),
> >+ packages))
> >+ continue;
> >+ if (targets_allocated)
> >+ cpumask_set_cpu(i, targets);
> >+ }
> >+ if (targets_allocated)
> >+ on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, true);
> >+ else
> >+ on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, true);
>
> This tries flush cache on all CPUs when we run out of memory. I am not sure if
> it is the best solution. A simple solution is just use two global bitmaps.
>
> And current logic isn't optimal. e.g., if packages_allocated is true while
> targets_allocated is false, then we will fill in the packages bitmap but don't
> use it at all.
>
> That said, I prefer to optimize the rare case in a separate patch. We can just use
> two global bitmaps or let the flush fail here just as you are doing below on
> seamcall failure.

Makes sense. We can allocate cpumasks on hardware_setup/unsetup() and update them
on hardware_enable/disable().

..

> >+static int __tdx_td_init(struct kvm *kvm)
> >+{
> >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >+ cpumask_var_t packages;
> >+ unsigned long *tdcs_pa = NULL;
> >+ unsigned long tdr_pa = 0;
> >+ unsigned long va;
> >+ int ret, i;
> >+ u64 err;
> >+
> >+ ret = tdx_guest_keyid_alloc();
> >+ if (ret < 0)
> >+ return ret;
> >+ kvm_tdx->hkid = ret;
> >+
> >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> >+ if (!va)
> >+ goto free_hkid;
> >+ tdr_pa = __pa(va);
> >+
> >+ tdcs_pa = kcalloc(tdx_info->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
> >+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> >+ if (!tdcs_pa)
> >+ goto free_tdr;
> >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> >+ if (!va)
> >+ goto free_tdcs;
> >+ tdcs_pa[i] = __pa(va);
> >+ }
> >+
> >+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> >+ ret = -ENOMEM;
> >+ goto free_tdcs;
> >+ }
> >+ cpus_read_lock();
> >+ /*
> >+ * Need at least one CPU of the package to be online in order to
> >+ * program all packages for host key id. Check it.
> >+ */
> >+ for_each_present_cpu(i)
> >+ cpumask_set_cpu(topology_physical_package_id(i), packages);
> >+ for_each_online_cpu(i)
> >+ cpumask_clear_cpu(topology_physical_package_id(i), packages);
> >+ if (!cpumask_empty(packages)) {
> >+ ret = -EIO;
> >+ /*
> >+ * Because it's hard for human operator to figure out the
> >+ * reason, warn it.
> >+ */
> >+#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
> >+ pr_warn_ratelimited(MSG_ALLPKG);
> >+ goto free_packages;
> >+ }
> >+
> >+ /*
> >+ * Acquire global lock to avoid TDX_OPERAND_BUSY:
> >+ * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
> >+ * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
> >+ * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
> >+ * caller to handle the contention. This is because of time limitation
> >+ * usable inside the TDX module and OS/VMM knows better about process
> >+ * scheduling.
> >+ *
> >+ * APIs to acquire the lock of KOT:
> >+ * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
> >+ * TDH.PHYMEM.CACHE.WB.
> >+ */
> >+ mutex_lock(&tdx_lock);
> >+ err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
> >+ mutex_unlock(&tdx_lock);
> >+ if (err == TDX_RND_NO_ENTROPY) {
> >+ ret = -EAGAIN;
> >+ goto free_packages;
> >+ }
> >+ if (WARN_ON_ONCE(err)) {
> >+ pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> >+ ret = -EIO;
> >+ goto free_packages;
> >+ }
> >+ kvm_tdx->tdr_pa = tdr_pa;
> >+
> >+ for_each_online_cpu(i) {
> >+ int pkg = topology_physical_package_id(i);
> >+
> >+ if (cpumask_test_and_set_cpu(pkg, packages))
> >+ continue;
> >+
> >+ /*
> >+ * Program the memory controller in the package with an
> >+ * encryption key associated to a TDX private host key id
> >+ * assigned to this TDR. Concurrent operations on same memory
> >+ * controller results in TDX_OPERAND_BUSY. Avoid this race by
> >+ * mutex.
> >+ */
> >+ mutex_lock(&tdx_mng_key_config_lock[pkg]);
>
> the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
> create TDs, the same set of CPUs (the first online CPU of each package) will be
> selected to configure the key because of the cpumask_test_and_set_cpu() above.
> it means, we never have two CPUs in the same socket trying to program the key,
> i.e., no concurrent calls.

Makes sense. Will drop the lock.


> >+ ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> >+ &kvm_tdx->tdr_pa, true);
> >+ mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> >+ if (ret)
> >+ break;
> >+ }
> >+ cpus_read_unlock();
> >+ free_cpumask_var(packages);
> >+ if (ret) {
> >+ i = 0;
> >+ goto teardown;
> >+ }
> >+
> >+ kvm_tdx->tdcs_pa = tdcs_pa;
> >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> >+ err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> >+ if (err == TDX_RND_NO_ENTROPY) {
> >+ /* Here it's hard to allow userspace to retry. */
> >+ ret = -EBUSY;
> >+ goto teardown;
> >+ }
> >+ if (WARN_ON_ONCE(err)) {
> >+ pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> >+ ret = -EIO;
> >+ goto teardown;
> >+ }
> >+ }
> >+
> >+ /*
> >+ * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> >+ * ioctl() to define the configure CPUID values for the TD.
> >+ */
> >+ return 0;
> >+
> >+ /*
> >+ * The sequence for freeing resources from a partially initialized TD
> >+ * varies based on where in the initialization flow failure occurred.
> >+ * Simply use the full teardown and destroy, which naturally play nice
> >+ * with partial initialization.
> >+ */
> >+teardown:
> >+ for (; i < tdx_info->nr_tdcs_pages; i++) {
> >+ if (tdcs_pa[i]) {
> >+ free_page((unsigned long)__va(tdcs_pa[i]));
> >+ tdcs_pa[i] = 0;
> >+ }
> >+ }
> >+ if (!kvm_tdx->tdcs_pa)
> >+ kfree(tdcs_pa);
> >+ tdx_mmu_release_hkid(kvm);
> >+ tdx_vm_free(kvm);
> >+ return ret;
> >+
> >+free_packages:
> >+ cpus_read_unlock();
> >+ free_cpumask_var(packages);
> >+free_tdcs:
> >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> >+ if (tdcs_pa[i])
> >+ free_page((unsigned long)__va(tdcs_pa[i]));
> >+ }
> >+ kfree(tdcs_pa);
> >+ kvm_tdx->tdcs_pa = NULL;
> >+
> >+free_tdr:
> >+ if (tdr_pa)
> >+ free_page((unsigned long)__va(tdr_pa));
> >+ kvm_tdx->tdr_pa = 0;
> >+free_hkid:
> >+ if (is_hkid_assigned(kvm_tdx))
>
> IIUC, this is always true because you just return if keyid
> allocation fails.

You're right. Will fix
--
Isaku Yamahata <[email protected]>

2024-03-21 15:56:26

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Wed, Mar 20, 2024 at 02:12:49PM +0800,
Chao Gao <[email protected]> wrote:

> >+static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
> >+ struct td_params *td_params)
> >+{
> >+ int i;
> >+
> >+ /*
> >+ * td_params.cpuid_values: The number and the order of cpuid_value must
> >+ * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
> >+ * It's assumed that td_params was zeroed.
> >+ */
> >+ for (i = 0; i < tdx_info->num_cpuid_config; i++) {
> >+ const struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
> >+ /* KVM_TDX_CPUID_NO_SUBLEAF means index = 0. */
> >+ u32 index = c->sub_leaf == KVM_TDX_CPUID_NO_SUBLEAF ? 0 : c->sub_leaf;
> >+ const struct kvm_cpuid_entry2 *entry =
> >+ kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
> >+ c->leaf, index);
> >+ struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
> >+
> >+ if (!entry)
> >+ continue;
> >+
> >+ /*
> >+ * tdsysinfo.cpuid_configs[].{eax, ebx, ecx, edx}
> >+ * bit 1 means it can be configured to zero or one.
> >+ * bit 0 means it must be zero.
> >+ * Mask out non-configurable bits.
> >+ */
> >+ value->eax = entry->eax & c->eax;
> >+ value->ebx = entry->ebx & c->ebx;
> >+ value->ecx = entry->ecx & c->ecx;
> >+ value->edx = entry->edx & c->edx;
>
> Any reason to mask off non-configurable bits rather than return an error? this
> is misleading to userspace because guest sees the values emulated by TDX module
> instead of the values passed from userspace (i.e., the request from userspace
> isn't done but there is no indication of that to userspace).

Ok, I'll eliminate them. If user space passes wrong cpuids, TDX module will
return error. I'll leave the error check to the TDX module.


> >+ }
> >+}
> >+
> >+static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
> >+{
> >+ const struct kvm_cpuid_entry2 *entry;
> >+ u64 guest_supported_xcr0;
> >+ u64 guest_supported_xss;
> >+
> >+ /* Setup td_params.xfam */
> >+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
> >+ if (entry)
> >+ guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
> >+ else
> >+ guest_supported_xcr0 = 0;
> >+ guest_supported_xcr0 &= kvm_caps.supported_xcr0;
> >+
> >+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
> >+ if (entry)
> >+ guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
> >+ else
> >+ guest_supported_xss = 0;
> >+
> >+ /*
> >+ * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
> >+ * and, CET support.
> >+ */
> >+ guest_supported_xss &=
> >+ (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);
> >+
> >+ td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
> >+ if (td_params->xfam & XFEATURE_MASK_LBR) {
> >+ /*
> >+ * TODO: once KVM supports LBR(save/restore LBR related
> >+ * registers around TDENTER), remove this guard.
> >+ */
> >+#define MSG_LBR "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH properly.\n"
> >+ pr_warn(MSG_LBR);
>
> Drop the pr_warn() because userspace can trigger it at will.
>
> I don't think KVM needs to relay TDX module capabilities to userspace as-is.
> KVM should advertise a feature only if both TDX module's and KVM's support
> are in place. if KVM masked out LBR and PERFMON, it should be a problem of
> userspace and we don't need to warn here.

Makes sense. Drop those message and don't advertise those features to user
space.
--
Isaku Yamahata <[email protected]>

2024-03-21 17:31:02

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Wed, Mar 20, 2024 at 04:15:23PM +0800,
Xiaoyao Li <[email protected]> wrote:

> On 2/26/2024 4:25 PM, [email protected] wrote:
>
> ...
>
> > +static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
> > +{
> > + const struct kvm_cpuid_entry2 *entry;
> > + u64 guest_supported_xcr0;
> > + u64 guest_supported_xss;
> > +
> > + /* Setup td_params.xfam */
> > + entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
> > + if (entry)
> > + guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
> > + else
> > + guest_supported_xcr0 = 0;
> > + guest_supported_xcr0 &= kvm_caps.supported_xcr0;
> > +
> > + entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
> > + if (entry)
> > + guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
> > + else
> > + guest_supported_xss = 0;
> > +
> > + /*
> > + * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
> > + * and, CET support.
> > + */
> > + guest_supported_xss &=
> > + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);
> > +
> > + td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
> > + if (td_params->xfam & XFEATURE_MASK_LBR) {
> > + /*
> > + * TODO: once KVM supports LBR(save/restore LBR related
> > + * registers around TDENTER), remove this guard.
> > + */
> > +#define MSG_LBR "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH properly.\n"
> > + pr_warn(MSG_LBR);
> > + return -EOPNOTSUPP;
>
> This unsupported behavior is totally decided by KVM even if TDX module
> supports it. I think we need to reflect it in tdx_info->xfam_fixed0, which
> gets reported to userspace via KVM_TDX_CAPABILITIES. So userspace will aware
> that LBR is not supported for TDs.

Yes, we can suppress KVM unpported features. I replied at
https://lore.kernel.org/kvm/[email protected]/

So far we used KVM_TDX_CAPABILITIES for feature enumeration. I'm wondering about
KVM_GET_DEVICE_ATTR [1]. It's future extensible. It's also consistent with SEV.

[1] https://lore.kernel.org/r/[email protected]
--
Isaku Yamahata <[email protected]>

2024-03-21 17:57:57

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 040/130] KVM: TDX: Make pmu_intel.c ignore guest TD case

On Wed, Mar 20, 2024 at 03:01:48PM +0800,
Chao Gao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:25:42AM -0800, [email protected] wrote:
> >From: Isaku Yamahata <[email protected]>
> >
> >Because TDX KVM doesn't support PMU yet (it's future work of TDX KVM
> >support as another patch series) and pmu_intel.c touches vmx specific
> >structure in vcpu initialization, as workaround add dummy structure to
> >struct vcpu_tdx and pmu_intel.c can ignore TDX case.
>
> Can we instead factor pmu_intel.c to avoid corrupting memory? how hard would it
> be?

Do you mean sprinkling "if (tdx) return"? It's easy. Just add it to all hooks
in kvm_pmu_ops.

I chose this approach because we'll soon support vPMU support. For simplicity,
will switch to sprinkle "if (tdx) return".

> >+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
> >+{
> >+ struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu);
> >+
> >+ if (is_td_vcpu(vcpu))
> >+ return false;
> >+
> >+ return lbr->nr && (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_LBR_FMT);
>
> The check about vcpu's perf capabilities is new. is it necessary?

No. Will delete it. It crept in during rebase.
--
Isaku Yamahata <[email protected]>

2024-03-21 18:07:21

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 041/130] KVM: TDX: Refuse to unplug the last cpu on the package

On Thu, Mar 21, 2024 at 09:06:46AM +0800,
Chao Gao <[email protected]> wrote:

> >diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> >index 437c6d5e802e..d69dd474775b 100644
> >--- a/arch/x86/kvm/vmx/main.c
> >+++ b/arch/x86/kvm/vmx/main.c
> >@@ -110,6 +110,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > .check_processor_compatibility = vmx_check_processor_compat,
> >
> > .hardware_unsetup = vt_hardware_unsetup,
> >+ .offline_cpu = tdx_offline_cpu,
> >
> > /* TDX cpu enablement is done by tdx_hardware_setup(). */
> > .hardware_enable = vmx_hardware_enable,
> >diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >index b11f105db3cd..f2ee5abac14e 100644
> >--- a/arch/x86/kvm/vmx/tdx.c
> >+++ b/arch/x86/kvm/vmx/tdx.c
> >@@ -97,6 +97,7 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> > */
> > static DEFINE_MUTEX(tdx_lock);
> > static struct mutex *tdx_mng_key_config_lock;
> >+static atomic_t nr_configured_hkid;
> >
> > static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> > {
> >@@ -112,6 +113,7 @@ static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> > {
> > tdx_guest_keyid_free(kvm_tdx->hkid);
> > kvm_tdx->hkid = -1;
> >+ atomic_dec(&nr_configured_hkid);
>
> I may think it is better to extend IDA infrastructure e.g., add an API to check if
> any ID is allocated for a given range. No strong opinion on this.

Will use ida_is_empyt().



> > }
> >
> > static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> >@@ -586,6 +588,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
> > if (ret < 0)
> > return ret;
> > kvm_tdx->hkid = ret;
> >+ atomic_inc(&nr_configured_hkid);
> >
> > va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > if (!va)
> >@@ -1071,3 +1074,41 @@ void tdx_hardware_unsetup(void)
> > kfree(tdx_info);
> > kfree(tdx_mng_key_config_lock);
> > }
> >+
> >+int tdx_offline_cpu(void)
> >+{
> >+ int curr_cpu = smp_processor_id();
> >+ cpumask_var_t packages;
> >+ int ret = 0;
> >+ int i;
> >+
> >+ /* No TD is running. Allow any cpu to be offline. */
> >+ if (!atomic_read(&nr_configured_hkid))
> >+ return 0;
> >+
> >+ /*
> >+ * In order to reclaim TDX HKID, (i.e. when deleting guest TD), need to
> >+ * call TDH.PHYMEM.PAGE.WBINVD on all packages to program all memory
> >+ * controller with pconfig. If we have active TDX HKID, refuse to
> >+ * offline the last online cpu.
> >+ */
> >+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
> >+ return -ENOMEM;
> >+ for_each_online_cpu(i) {
> >+ if (i != curr_cpu)
> >+ cpumask_set_cpu(topology_physical_package_id(i), packages);
> >+ }
>
> Just check if any other CPU is in the same package of the one about to go
> offline. This would obviate the need for the cpumask and allow us to break once
> one cpu in the same package is found.

Good idea. Will rewrite it so.
--
Isaku Yamahata <[email protected]>

2024-03-21 20:22:02

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 043/130] KVM: TDX: create/free TDX vcpu structure

On Thu, Mar 21, 2024 at 09:30:12AM +0800,
Chao Gao <[email protected]> wrote:

> >+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> >+{
> >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >+
> >+ WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> >+ WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> >+
> >+ /* TDX only supports x2APIC, which requires an in-kernel local APIC. */
>
> Cannot QEMU emulate x2APIC? In my understanding, the reason is TDX module always
> enables APICv for TDs. So, KVM cannot intercept every access to APIC and forward
> them to QEMU for emulation.

You're right. Let me update it as follows.

/*
* TDX module always enables APICv for TDs. So, KVM cannot intercept every
* access to APIC and forward them to user space VMM.
*/



> >+ if (!vcpu->arch.apic)
>
> will "if (!irqchip_in_kernel(vcpu->kvm))" work? looks this is the custome for such
> a check.


It should work because kvm_arch_vcpu_create(). Will update it.
--
Isaku Yamahata <[email protected]>

2024-03-21 20:44:05

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 044/130] KVM: TDX: Do TDX specific vcpu initialization

On Thu, Mar 21, 2024 at 01:43:14PM +0800,
Chao Gao <[email protected]> wrote:

> >+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
> >+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> >+{
> >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >+ struct vcpu_tdx *tdx = to_tdx(vcpu);
> >+ unsigned long *tdvpx_pa = NULL;
> >+ unsigned long tdvpr_pa;
> >+ unsigned long va;
> >+ int ret, i;
> >+ u64 err;
> >+
> >+ if (is_td_vcpu_created(tdx))
> >+ return -EINVAL;
> >+
> >+ /*
> >+ * vcpu_free method frees allocated pages. Avoid partial setup so
> >+ * that the method can't handle it.
> >+ */
> >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> >+ if (!va)
> >+ return -ENOMEM;
> >+ tdvpr_pa = __pa(va);
> >+
> >+ tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
> >+ GFP_KERNEL_ACCOUNT);
> >+ if (!tdvpx_pa) {
> >+ ret = -ENOMEM;
> >+ goto free_tdvpr;
> >+ }
> >+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> >+ if (!va) {
> >+ ret = -ENOMEM;
> >+ goto free_tdvpx;
> >+ }
> >+ tdvpx_pa[i] = __pa(va);
> >+ }
> >+
> >+ err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> >+ if (KVM_BUG_ON(err, vcpu->kvm)) {
> >+ ret = -EIO;
> >+ pr_tdx_error(TDH_VP_CREATE, err, NULL);
> >+ goto free_tdvpx;
> >+ }
> >+ tdx->tdvpr_pa = tdvpr_pa;
> >+
> >+ tdx->tdvpx_pa = tdvpx_pa;
> >+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
>
> Can you merge the for-loop above into this one? then ...
>
> >+ err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> >+ if (KVM_BUG_ON(err, vcpu->kvm)) {
> >+ pr_tdx_error(TDH_VP_ADDCX, err, NULL);
>
> >+ for (; i < tdx_info->nr_tdvpx_pages; i++) {
> >+ free_page((unsigned long)__va(tdvpx_pa[i]));
> >+ tdvpx_pa[i] = 0;
> >+ }
>
> ... no need to free remaining pages.

Makes sense. Let me clean up this.


> >+ /* vcpu_free method frees TDVPX and TDR donated to TDX */
> >+ return -EIO;
> >+ }
> >+ }
> >+
> >+ err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> >+ if (KVM_BUG_ON(err, vcpu->kvm)) {
> >+ pr_tdx_error(TDH_VP_INIT, err, NULL);
> >+ return -EIO;
> >+ }
> >+
> >+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >+ tdx->td_vcpu_created = true;
> >+ return 0;
> >+
> >+free_tdvpx:
> >+ for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> >+ if (tdvpx_pa[i])
> >+ free_page((unsigned long)__va(tdvpx_pa[i]));
> >+ tdvpx_pa[i] = 0;
> >+ }
> >+ kfree(tdvpx_pa);
> >+ tdx->tdvpx_pa = NULL;
> >+free_tdvpr:
> >+ if (tdvpr_pa)
> >+ free_page((unsigned long)__va(tdvpr_pa));
> >+ tdx->tdvpr_pa = 0;
> >+
> >+ return ret;
> >+}
> >+
> >+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> >+{
> >+ struct msr_data apic_base_msr;
> >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >+ struct vcpu_tdx *tdx = to_tdx(vcpu);
> >+ struct kvm_tdx_cmd cmd;
> >+ int ret;
> >+
> >+ if (tdx->initialized)
> >+ return -EINVAL;
> >+
> >+ if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
>
> These checks look random e.g., I am not sure why is_td_created() isn't check here.
>
> A few helper functions and boolean variables are added to track which stage the
> TD or TD vCPU is in. e.g.,
>
> is_hkid_assigned()
> is_td_finalized()
> is_td_created()
> tdx->initialized
> td_vcpu_created
>
> Insteading of doing this, I am wondering if adding two state machines for
> TD and TD vCPU would make the implementation clear and easy to extend.

Let me look into the state machine. Originally I hoped we don't need it, but
it seems to deserve the state machine..


> >+ return -EINVAL;
> >+
> >+ if (copy_from_user(&cmd, argp, sizeof(cmd)))
> >+ return -EFAULT;
> >+
> >+ if (cmd.error)
> >+ return -EINVAL;
> >+
> >+ /* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
> >+ if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
> >+ return -EINVAL;
>
> Even though KVM_TD_INIT_VCPU is the only supported command, it is worthwhile to
> use a switch-case statement. New commands can be added easily without the need
> to refactor this function first.

Yes. For KVM_MAP_MEMORY, I will make KVM_TDX_INIT_MEM_REGION vcpu ioctl instead
of vm ioctl because it is consistent and scalable. We'll have switch statement
in the next respin.

> >+
> >+ /*
> >+ * As TDX requires X2APIC, set local apic mode to X2APIC. User space
> >+ * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
> >+ * KVM_SET_CPUID2. Otherwise kvm_set_apic_base() will fail.
> >+ */
> >+ apic_base_msr = (struct msr_data) {
> >+ .host_initiated = true,
> >+ .data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
> >+ (kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
> >+ };
> >+ if (kvm_set_apic_base(vcpu, &apic_base_msr))
> >+ return -EINVAL;
>
> Exporting kvm_vcpu_is_reset_bsp() and kvm_set_apic_base() should be done
> here (rather than in a previous patch).

Sure.


> >+
> >+ ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
> >+ if (ret)
> >+ return ret;
> >+
> >+ tdx->initialized = true;
> >+ return 0;
> >+}
> >+
>
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index c002761bb662..2bd4b7c8fa51 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -6274,6 +6274,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> > case KVM_SET_DEVICE_ATTR:
> > r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
> > break;
> >+ case KVM_MEMORY_ENCRYPT_OP:
> >+ r = -ENOTTY;
>
> Maybe -EINVAL is better. Because previously trying to call this on vCPU fd
> failed with -EINVAL given ...

Oh, ok. Will change it. I followed VM ioctl case as default value. But vcpu
ioctl seems to have -EINVAL as default value.
--
Isaku Yamahata <[email protected]>

2024-03-21 21:28:35

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation

On Thu, Mar 21, 2024 at 12:11:11AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > To handle private page tables, argument of is_private needs to be
> > passed
> > down.  Given that already page level is passed down, it would be
> > cumbersome
> > to add one more parameter about sp. Instead replace the level
> > argument with
> > union kvm_mmu_page_role.  Thus the number of argument won't be
> > increased
> > and more info about sp can be passed down.
> >
> > For private sp, secure page table will be also allocated in addition
> > to
> > struct kvm_mmu_page and page table (spt member).  The allocation
> > functions
> > (tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know
> > if the
> > allocation is for the conventional page table or private page table. 
> > Pass
> > union kvm_mmu_role to those functions and initialize role member of
> > struct
> > kvm_mmu_page.
>
> tdp_mmu_alloc_sp() is only called in two places. One for the root, and
> one for the mid-level tables.
>
> In later patches when the kvm_mmu_alloc_private_spt() part is added,
> the root case doesn't need anything done. So the code has to take
> special care in tdp_mmu_alloc_sp() to avoid doing anything for the
> root.
>
> It only needs to do the special private spt allocation in non-root
> case. If we open code that case, I think maybe we could drop this
> patch, like the below.
>
> The benefits are to drop this patch (which looks to already be part of
> Paolo's series), and simplify "KVM: x86/mmu: Add a private pointer to
> struct kvm_mmu_page". I'm not sure though, what do you think? Only
> build tested.

Makes sense. Until v18, it had config to disable private mmu part at
compile time. Those functions have #ifdef in mmu_internal.h. v19
dropped the config for the feedback.
https://lore.kernel.org/kvm/[email protected]/

After looking at mmu_internal.h, I think the following three function could be
open coded.
kvm_mmu_private_spt(), kvm_mmu_init_private_spt(), kvm_mmu_alloc_private_spt(),
and kvm_mmu_free_private_spt().
--
Isaku Yamahata <[email protected]>

2024-03-21 21:38:32

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure



On 26/02/2024 9:25 pm, Yamahata, Isaku wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add placeholders TDX VM/vcpu structure that overlays with VMX VM/vcpu
> structures. Initialize VM structure size and vcpu size/align so that x86
> KVM common code knows those size irrespective of VMX or TDX. Those
> structures will be populated as guest creation logic develops.
>
> Add helper functions to check if the VM is guest TD and add conversion
> functions between KVM VM/VCPU and TDX VM/VCPU.

The changelog is essentially only saying "doing what" w/o "why".

Please at least explain why you invented the 'struct kvm_tdx' and
'struct vcpu_tdx', and why they are invented in this way.

E.g., can we extend 'struct kvm_vmx' for TDX?

struct kvm_tdx {
struct kvm_vmx vmx;
...
};

>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v19:
> - correctly update ops.vm_size, vcpu_size and, vcpu_align by Xiaoyao
>
> v14 -> v15:
> - use KVM_X86_TDX_VM
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 14 ++++++++++++
> arch/x86/kvm/vmx/tdx.c | 1 +
> arch/x86/kvm/vmx/tdx.h | 50 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 65 insertions(+)
> create mode 100644 arch/x86/kvm/vmx/tdx.h
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 18aef6e23aab..e11edbd19e7c 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -5,6 +5,7 @@
> #include "vmx.h"
> #include "nested.h"
> #include "pmu.h"
> +#include "tdx.h"
>
> static bool enable_tdx __ro_after_init;
> module_param_named(tdx, enable_tdx, bool, 0444);
> @@ -18,6 +19,9 @@ static __init int vt_hardware_setup(void)
> return ret;
>
> enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> + if (enable_tdx)
> + vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> + sizeof(struct kvm_tdx));
>

Now I see why you included 'struct kvm_x86_ops' as function parameter.

Please move it to this patch.

> return 0;
> }
> @@ -215,8 +219,18 @@ static int __init vt_init(void)
> * Common KVM initialization _must_ come last, after this, /dev/kvm is
> * exposed to userspace!
> */
> + /*
> + * kvm_x86_ops is updated with vt_x86_ops. vt_x86_ops.vm_size must
> + * be set before kvm_x86_vendor_init().
> + */
> vcpu_size = sizeof(struct vcpu_vmx);
> vcpu_align = __alignof__(struct vcpu_vmx);
> + if (enable_tdx) {
> + vcpu_size = max_t(unsigned int, vcpu_size,
> + sizeof(struct vcpu_tdx));
> + vcpu_align = max_t(unsigned int, vcpu_align,
> + __alignof__(struct vcpu_tdx));
> + }

Since you are updating vm_size in vt_hardware_setup(), I am wondering
whether we can do similar thing for vcpu_size and vcpu_align.

That is, we put them both to 'struct kvm_x86_ops', and you update them
in vt_hardware_setup().

kvm_init() can then just access them directly in this way both
'vcpu_size' and 'vcpu_align' function parameters can be removed.


> r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
> if (r)
> goto err_kvm_init;

2024-03-21 21:40:48

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 025/130] KVM: TDX: Make TDX VM type supported



On 16/03/2024 10:36 am, Yamahata, Isaku wrote:
> On Thu, Mar 14, 2024 at 02:29:07PM +0800,
> Binbin Wu <[email protected]> wrote:
>
>>
>>
>> On 2/26/2024 4:25 PM, [email protected] wrote:
>>> From: Isaku Yamahata <[email protected]>
>>>
>>> NOTE: This patch is in position of the patch series for developers to be
>>> able to test codes during the middle of the patch series although this
>>> patch series doesn't provide functional features until the all the patches
>>> of this patch series. When merging this patch series, this patch can be
>>> moved to the end.
>>
>> Maybe at this point of time, you can consider to move this patch to the end?
>
> Given I don't have to do step-by-step debug recently, I think it's safe to move
> it.

Even if you have to, I don't think it's a valid reason for "official"
patches.

I agree we should move this to the end after all build blocks of running
TDX guest is ready.

2024-03-21 21:58:25

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions


> +/*
> + * TDX SEAMCALL API function leaves
> + */
> +#define TDH_VP_ENTER 0
> +#define TDH_MNG_ADDCX 1
> +#define TDH_MEM_PAGE_ADD 2
> +#define TDH_MEM_SEPT_ADD 3
> +#define TDH_VP_ADDCX 4
> +#define TDH_MEM_PAGE_RELOCATE 5

I don't think the "RELOCATE" is needed in this patchset?

> +#define TDH_MEM_PAGE_AUG 6
> +#define TDH_MEM_RANGE_BLOCK 7
> +#define TDH_MNG_KEY_CONFIG 8
> +#define TDH_MNG_CREATE 9
> +#define TDH_VP_CREATE 10
> +#define TDH_MNG_RD 11
> +#define TDH_MR_EXTEND 16
> +#define TDH_MR_FINALIZE 17
> +#define TDH_VP_FLUSH 18
> +#define TDH_MNG_VPFLUSHDONE 19
> +#define TDH_MNG_KEY_FREEID 20
> +#define TDH_MNG_INIT 21
> +#define TDH_VP_INIT 22
> +#define TDH_MEM_SEPT_RD 25
> +#define TDH_VP_RD 26
> +#define TDH_MNG_KEY_RECLAIMID 27
> +#define TDH_PHYMEM_PAGE_RECLAIM 28
> +#define TDH_MEM_PAGE_REMOVE 29
> +#define TDH_MEM_SEPT_REMOVE 30
> +#define TDH_SYS_RD 34
> +#define TDH_MEM_TRACK 38
> +#define TDH_MEM_RANGE_UNBLOCK 39
> +#define TDH_PHYMEM_CACHE_WB 40
> +#define TDH_PHYMEM_PAGE_WBINVD 41
> +#define TDH_VP_WR 43
> +#define TDH_SYS_LP_SHUTDOWN 44

And LP_SHUTDOWN is certainly not needed.

Could you check whether there are others that are not needed?

Perhaps we should just include macros that got used, but anyway.

[...]

> +
> +/*
> + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> + */

Why is this comment applied to TDX_MAX_VCPUS?

> +#define TDX_MAX_VCPUS (~(u16)0)

And is (~(16)0) an architectural value defined by TDX spec, or just SW
value that you just put here for convenience?

I mean, is it possible that different version of TDX module have
different implementation of MAX_CPU, e.g., module 1.0 only supports X
but module 1.5 increases to Y where Y > X?

Anyway, looks you can safely move this to the patch to enable CAP_MAX_CPU?

> +
> +struct td_params {
> + u64 attributes;
> + u64 xfam;
> + u16 max_vcpus;
> + u8 reserved0[6];
> +
> + u64 eptp_controls;
> + u64 exec_controls;
> + u16 tsc_frequency;
> + u8 reserved1[38];
> +
> + u64 mrconfigid[6];
> + u64 mrowner[6];
> + u64 mrownerconfig[6];
> + u64 reserved2[4];
> +
> + union {
> + DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values);
> + u8 reserved3[768];

I am not sure you need the 'reseved3[768]', unless you need to make
sieof(struct td_params) return 1024?

> + };
> +} __packed __aligned(1024); > +

[...]

> +
> +#define TDX_MD_ELEMENT_SIZE_8BITS 0
> +#define TDX_MD_ELEMENT_SIZE_16BITS 1
> +#define TDX_MD_ELEMENT_SIZE_32BITS 2
> +#define TDX_MD_ELEMENT_SIZE_64BITS 3
> +
> +union tdx_md_field_id {
> + struct {
> + u64 field : 24;
> + u64 reserved0 : 8;
> + u64 element_size_code : 2;
> + u64 last_element_in_field : 4;
> + u64 reserved1 : 3;
> + u64 inc_size : 1;
> + u64 write_mask_valid : 1;
> + u64 context : 3;
> + u64 reserved2 : 1;
> + u64 class : 6;
> + u64 reserved3 : 1;
> + u64 non_arch : 1;
> + };
> + u64 raw;
> +};

Could you clarify why we need such detailed definition? For metadata
element size you can use simple '&' and '<<' to get the result.

> +
> +#define TDX_MD_ELEMENT_SIZE_CODE(_field_id) \
> + ({ union tdx_md_field_id _fid = { .raw = (_field_id)}; \
> + _fid.element_size_code; })
> +
> +#endif /* __KVM_X86_TDX_ARCH_H */

2024-03-21 22:00:08

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 057/130] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role

On Thu, Mar 21, 2024 at 12:18:47AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Because TDX support introduces private mapping, add a new member in
> > union
> > kvm_mmu_page_role with access functions to check the member.
>
> I guess we should have a role bit for private like in this patch, but
> just barely. AFAICT we have a gfn and struct kvm in every place where
> it is checked (assuming my proposal in patch 56 holds water). So we
> could have
> bool is_private = !(gfn & kvm_gfn_shared_mask(kvm));

Yes, we can use such combination. or !!sp->private_spt or something.
Originally we didn't use role.is_private and passed around private parameter.


> But there are extra bits available in the role, so we can skip the
> extra step. Can you think of any more reasons? I want to try to write a
> log for this one. It's very short.

There are several places to compare role and shared<->private. For example,
kvm_tdp_mmu_alloc_root(). role.is_private simplifies such comparison.
--
Isaku Yamahata <[email protected]>

2024-03-21 22:12:33

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl



On 26/02/2024 9:25 pm, Yamahata, Isaku wrote:
> From: Isaku Yamahata <[email protected]>
>
> KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
> guest state-protected VM. It defined subcommands for technology-specific
> operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands
> are not limited to memory encryption, but various technology-specific
> operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
> for TDX specific operations and define subcommands.
>
> TDX requires VM-scoped TDX-specific operations for device model, for
> example, qemu. Getting system-wide parameters, TDX-specific VM
> initialization.

-EPARSE for the second sentence (or it is not a valid sentence at all).

>
> Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
> TDX specific sub-commands will be added to retrieve/pass TDX specific
> parameters. Make mem_enc_ioctl non-optional as it's always filled.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v15:
> - change struct kvm_tdx_cmd to drop unused member.
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 2 +-
> arch/x86/include/uapi/asm/kvm.h | 26 ++++++++++++++++++++++++++
> arch/x86/kvm/vmx/main.c | 10 ++++++++++
> arch/x86/kvm/vmx/tdx.c | 26 ++++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 4 ++++
> arch/x86/kvm/x86.c | 4 ----
> 6 files changed, 67 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8be71a5c5c87..00b371d9a1ca 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -123,7 +123,7 @@ KVM_X86_OP(enter_smm)
> KVM_X86_OP(leave_smm)
> KVM_X86_OP(enable_smi_window)
> #endif
> -KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
> +KVM_X86_OP(mem_enc_ioctl)
> KVM_X86_OP_OPTIONAL(mem_enc_register_region)
> KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
> KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 45b2c2304491..9ea46d143bef 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -567,6 +567,32 @@ struct kvm_pmu_event_filter {
> #define KVM_X86_TDX_VM 2
> #define KVM_X86_SNP_VM 3
>
> +/* Trust Domain eXtension sub-ioctl() commands. */
> +enum kvm_tdx_cmd_id {
> + KVM_TDX_CAPABILITIES = 0,
> +
> + KVM_TDX_CMD_NR_MAX,
> +};
> +
> +struct kvm_tdx_cmd {
> + /* enum kvm_tdx_cmd_id */
> + __u32 id;
> + /* flags for sub-commend. If sub-command doesn't use this, set zero. */
> + __u32 flags;
> + /*
> + * data for each sub-command. An immediate or a pointer to the actual
> + * data in process virtual address. If sub-command doesn't use it,
> + * set zero.
> + */
> + __u64 data;
> + /*
> + * Auxiliary error code. The sub-command may return TDX SEAMCALL
> + * status code in addition to -Exxx.
> + * Defined for consistency with struct kvm_sev_cmd.
> + */
> + __u64 error;

If the 'error' is for SEAMCALL error, should we rename it to 'hw_error'
or 'fw_error' or something similar? I think 'error' is too generic.

> +};
> +
> #define KVM_TDX_CPUID_NO_SUBLEAF ((__u32)-1)
>
> struct kvm_tdx_cpuid_config {
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index a948a6959ac7..082e82ce6580 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -47,6 +47,14 @@ static int vt_vm_init(struct kvm *kvm)
> return vmx_vm_init(kvm);
> }
>
> +static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> +{
> + if (!is_td(kvm))
> + return -ENOTTY;
> +
> + return tdx_vm_ioctl(kvm, argp);
> +}
> +
> #define VMX_REQUIRED_APICV_INHIBITS \
> (BIT(APICV_INHIBIT_REASON_DISABLE)| \
> BIT(APICV_INHIBIT_REASON_ABSENT) | \
> @@ -200,6 +208,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
>
> .get_untagged_addr = vmx_get_untagged_addr,
> +
> + .mem_enc_ioctl = vt_mem_enc_ioctl,
> };
>
> struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 5edfb99abb89..07a3f0f75f87 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -55,6 +55,32 @@ struct tdx_info {
> /* Info about the TDX module. */
> static struct tdx_info *tdx_info;
>
> +int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> +{
> + struct kvm_tdx_cmd tdx_cmd;
> + int r;
> +
> + if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
> + return -EFAULT;

Add an empty line.

> + if (tdx_cmd.error)
> + return -EINVAL;

Add a comment?

/*
* Userspace should never set @error, which is used to fill
* hardware-defined error by the kernel.
*/

> +
> + mutex_lock(&kvm->lock);
> +
> + switch (tdx_cmd.id) {
> + default:
> + r = -EINVAL;

I am not sure whether you should return -ENOTTY to be consistent with
the previous vt_mem_enc_ioctl() where a TDX-specific IOCTL is issued for
non-TDX guest.

Here I think the invalid @id means the sub-command isn't valid.

> + goto out;
> + }
> +
> + if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
> + r = -EFAULT;
> +
> +out:
> + mutex_unlock(&kvm->lock);
> + return r;
> +}
> +

2024-03-21 22:27:15

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters



On 26/02/2024 9:25 pm, Yamahata, Isaku wrote:
> From: Sean Christopherson <[email protected]>
>
> Implement an ioctl to get system-wide parameters for TDX. Although the
> function is systemwide, vm scoped mem_enc ioctl works for userspace VMM
> like qemu and device scoped version is not define, re-use vm scoped
> mem_enc.

-EPARSE for the part starting from "and device scoped ...".

Grammar check please.

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v18:
> - drop the use of tdhsysinfo_struct and TDH.SYS.INFO, use TDH.SYS.RD().
> For that, dynamically allocate/free tdx_info.
> - drop the change of tools/arch/x86/include/uapi/asm/kvm.h.
>
> v14 -> v15:
> - ABI change: added supported_gpaw and reserved area.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm.h | 17 ++++++++++
> arch/x86/kvm/vmx/tdx.c | 56 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 3 ++
> 3 files changed, 76 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 9ea46d143bef..e28189c81691 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -604,4 +604,21 @@ struct kvm_tdx_cpuid_config {
> __u32 edx;
> };
>
> +/* supported_gpaw */
> +#define TDX_CAP_GPAW_48 (1 << 0)
> +#define TDX_CAP_GPAW_52 (1 << 1)
> +
> +struct kvm_tdx_capabilities {
> + __u64 attrs_fixed0;
> + __u64 attrs_fixed1;
> + __u64 xfam_fixed0;
> + __u64 xfam_fixed1;
> + __u32 supported_gpaw;
> + __u32 padding;
> + __u64 reserved[251];
> +
> + __u32 nr_cpuid_configs;
> + struct kvm_tdx_cpuid_config cpuid_configs[];
> +};
> +

I think you should use __DECLARE_FLEX_ARRAY().

It's already used in existing KVM UAPI header:

struct kvm_nested_state {
...
union {
__DECLARE_FLEX_ARRAY(struct kvm_vmx_nested_state_data,
vmx);
__DECLARE_FLEX_ARRAY(struct kvm_svm_nested_state_data,
svm);
} data;
}

> #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 07a3f0f75f87..816ccdb4bc41 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -6,6 +6,7 @@
> #include "capabilities.h"
> #include "x86_ops.h"
> #include "x86.h"
> +#include "mmu.h"
> #include "tdx_arch.h"
> #include "tdx.h"
>
> @@ -55,6 +56,58 @@ struct tdx_info {
> /* Info about the TDX module. */
> static struct tdx_info *tdx_info;
>
> +static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
> +{
> + struct kvm_tdx_capabilities __user *user_caps;
> + struct kvm_tdx_capabilities *caps = NULL;
> + int ret = 0;
> +
> + if (cmd->flags)
> + return -EINVAL;

Add a comment?

/* flags is reserved for future use */

> +
> + caps = kmalloc(sizeof(*caps), GFP_KERNEL);
> + if (!caps)
> + return -ENOMEM;
> +
> + user_caps = (void __user *)cmd->data;
> + if (copy_from_user(caps, user_caps, sizeof(*caps))) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + if (caps->nr_cpuid_configs < tdx_info->num_cpuid_config) {
> + ret = -E2BIG;
> + goto out;
> + }
> +
> + *caps = (struct kvm_tdx_capabilities) {
> + .attrs_fixed0 = tdx_info->attributes_fixed0,
> + .attrs_fixed1 = tdx_info->attributes_fixed1,
> + .xfam_fixed0 = tdx_info->xfam_fixed0,
> + .xfam_fixed1 = tdx_info->xfam_fixed1,
> + .supported_gpaw = TDX_CAP_GPAW_48 |
> + ((kvm_get_shadow_phys_bits() >= 52 &&
> + cpu_has_vmx_ept_5levels()) ? TDX_CAP_GPAW_52 : 0),
> + .nr_cpuid_configs = tdx_info->num_cpuid_config,
> + .padding = 0,
> + };
> +
> + if (copy_to_user(user_caps, caps, sizeof(*caps))) {
> + ret = -EFAULT;
> + goto out;
> + }

Add an empty line.

> + if (copy_to_user(user_caps->cpuid_configs, &tdx_info->cpuid_configs,
> + tdx_info->num_cpuid_config *
> + sizeof(tdx_info->cpuid_configs[0]))) {
> + ret = -EFAULT;
> + }

I think the '{ }' is needed here.

> +
> +out:
> + /* kfree() accepts NULL. */
> + kfree(caps);
> + return ret;
> +}
> +
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_tdx_cmd tdx_cmd;
> @@ -68,6 +121,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> mutex_lock(&kvm->lock);
>
> switch (tdx_cmd.id) {
> + case KVM_TDX_CAPABILITIES:
> + r = tdx_get_capabilities(&tdx_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 473013265bd8..22c0b57f69ca 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -3,6 +3,9 @@
> #define __KVM_X86_TDX_H
>
> #ifdef CONFIG_INTEL_TDX_HOST
> +
> +#include "tdx_ops.h"
> +

It appears "tdx_ops.h" is used for making SEAMCALLs.

I don't see this patch uses any SEAMCALL so I am wondering whether this
chunk is needed here?

> struct kvm_tdx {
> struct kvm kvm;
> /* TDX specific members follow. */

2024-03-21 22:39:51

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Wed, Mar 20, 2024 at 12:56:38AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Tue, 2024-03-19 at 16:56 -0700, Isaku Yamahata wrote:
> > When we zap a page from the guest, and add it again on TDX even with
> > the same
> > GPA, the page is zeroed.  We'd like to keep memory contents for those
> > cases.
> >
> > Ok, let me add those whys and drop migration part. Here is the
> > updated one.
> >
> > TDX supports only write-back(WB) memory type for private memory
> > architecturally so that (virtualized) memory type change doesn't make
> > sense for private memory.  When we remove the private page from the
> > guest
> > and re-add it with the same GPA, the page is zeroed.
> >
> > Regarding memory type change (mtrr virtualization and lapic page
> > mapping change), the current implementation zaps pages, and populate
> s^
> > the page with new memory type on the next KVM page fault.  
> ^s
>
> > It doesn't work for TDX to have zeroed pages.
> What does this mean? Above you mention how all the pages are zeroed. Do
> you mean it doesn't work for TDX to zero a running guest's pages. Which
> would happen for the operations that would expect the pages could get
> faulted in again just fine.

(non-TDX part of) KVM assumes that page contents are preserved after zapping and
re-populate. This isn't true for TDX. The guest would suddenly see zero pages
instead of the old memory contents and would be upset.


> > Because TDX supports only WB, we
> > ignore the request for MTRR and lapic page change to not zap private
> > pages on unmapping for those two cases
>
> Hmm. I need to go back and look at this again. It's not clear from the
> description why it is safe for the host to not zap pages if requested
> to. I see why the guest wouldn't want them to be zapped.

KVM siltently ignores the request to change memory types.


> > TDX Secure-EPT requires removing the guest pages first and leaf
> > Secure-EPT pages in order. It doesn't allow zap a Secure-EPT entry
> > that has child pages.  It doesn't work with the current TDP MMU
> > zapping logic that zaps the root page table without touching child
> > pages.  Instead, zap only leaf SPTEs for KVM mmu that has a shared
> > bit
> > mask.
>
> Could this be better as two patches that each address a separate thing?
> 1. Leaf only zapping
> 2. Don't zap for MTRR, etc.

Makes sense. Let's split it.


> > > There seems to be an attempt to abstract away the existence of
> > > Secure-
> > > EPT in mmu.c, that is not fully successful. In this case the code
> > > checks kvm_gfn_shared_mask() to see if it needs to handle the
> > > zapping
> > > in a way specific needed by S-EPT. It ends up being a little
> > > confusing
> > > because the actual check is about whether there is a shared bit. It
> > > only works because only S-EPT is the only thing that has a
> > > kvm_gfn_shared_mask().
> > >
> > > Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks
> > > wrong,
> > > but is more honest about what we are getting up to here. I'm not
> > > sure
> > > though, what do you think?
> >
> > Right, I attempted and failed in zapping case.  This is due to the
> > restriction
> > that the Secure-EPT pages must be removed from the leaves.  the VMX
> > case (also
> > NPT, even SNP) heavily depends on zapping root entry as optimization.
> >
> > I can think of
> > - add TDX check. Looks wrong
> > - Use kvm_gfn_shared_mask(kvm). confusing
> > - Give other name for this check like zap_from_leafs (or better
> > name?)
> >   The implementation is same to kvm_gfn_shared_mask() with comment.
> >   - Or we can add a boolean variable to struct kvm
>
> Hmm, maybe wrap it in a function like:
> static inline bool kvm_can_only_zap_leafs(const struct kvm *kvm)
> {
> /* A comment explaining what is going on */
> return kvm->arch.vm_type == KVM_X86_TDX_VM;
> }
>
> But KVM seems to be a bit more on the open coded side when it comes to
> things like this, so not sure what maintainers would prefer. My opinion
> is the kvm_gfn_shared_mask() check is too strange and it's worth a new
> helper. If that is bad, then just open coded kvm->arch.vm_type ==
> KVM_X86_TDX_VM is the second best I think.
>
> I feel both strongly that it should be changed, and unsure what
> maintainers would prefer. Hopefully one will chime in.

Now compile time config is dropped, open code is option.
--
Isaku Yamahata <[email protected]>

2024-03-21 22:59:22

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, Mar 21, 2024 at 01:17:35AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Tue, 2024-03-19 at 17:56 -0700, Rick Edgecombe wrote:
> > > Because TDX supports only WB, we
> > > ignore the request for MTRR and lapic page change to not zap
> > > private
> > > pages on unmapping for those two cases
> >
> > Hmm. I need to go back and look at this again. It's not clear from
> > the
> > description why it is safe for the host to not zap pages if requested
> > to. I see why the guest wouldn't want them to be zapped.
>
> Ok, I see now how this works. MTRRs and APIC zapping happen to use the
> same function: kvm_zap_gfn_range(). So restricting that function from
> zapping private pages has the desired affect. I think it's not ideal
> that kvm_zap_gfn_range() silently skips zapping some ranges. I wonder
> if we could pass something in, so it's more clear to the caller.
>
> But can these code paths even get reaches in TDX? It sounded like MTRRs
> basically weren't supported.

We can make the code paths so with the (new) assumption that guest MTRR can
be disabled cleanly.
--
Isaku Yamahata <[email protected]>

2024-03-21 23:12:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Mon, 2024-02-26 at 00:27 -0800, [email protected] wrote:
> Implement a hook of KVM_SET_CPUID2 for additional consistency check.
>
> Intel TDX or AMD SEV has a restriction on the value of cpuid.  For
> example,
> some values must be the same between all vcpus.  Check if the new
> values
> are consistent with the old values.  The check is light because the
> cpuid
> consistency is very model specific and complicated.  The user space
> VMM
> should set cpuid and MSRs consistently.

I see that this was suggested by Sean, but can you explain the problem
that this is working around? From the linked thread, it seems like the
problem is what to do when userspace also calls SET_CPUID after already
configuring CPUID to the TDX module in the special way. The choices
discussed included:
1. Reject the call
2. Check the consistency between the first CPUID configuration and the
second one.

1 is a lot simpler, but the reasoning for 2 is because "some KVM code
paths rely on guest CPUID configuration" it seems. Is this a
hypothetical or real issue? Which code paths are problematic for
TDX/SNP?

Just trying to assess what we should do with these two patches.

2024-03-21 23:37:22

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific



On 26/02/2024 9:25 pm, Yamahata, Isaku wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX has its own limitation on the maximum number of vcpus that the guest
> can accommodate.

"limitation" -> "control".

"the guest" -> "a guest".

Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
> handler and implement TDX backend for KVM_CAP_MAX_VCPUS.

I am not sure we normally say "x86 KVM backend". Just say "Allow KVM
x86 ...".

user space VMM,
> e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.

Grammar check.

>
> When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be
> specified as struct td_params_struct.

'struct td_params_struct'??

Anyway, I don't think you need to mention such details.

and the value is a part of
> measurement. The user space has to specify the value somehow.

"and" -> "And" (grammar check please).

And add an empty line to start below as a new paragraph.

There are
> two options for it.
> option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch)
> option 2. Add max_vcpu as a parameter to initialize the guest.
> (TDG.MNG.INIT)

First of all, it seems to me that the two are not conflicting.

Based on the uapi/kvm.h:

#define KVM_CAP_MAX_VCPUS 66 /* returns max vcpus per vm */

Currently KVM x86 doesn't allow to configure MAX_VCPU on VM-basis, but
always reports KVM_MAX_VCPUS for _ALL_ VMs. I.e., it doesn't support
userspace to explicitly enable KVM_CAP_MAX_VCPUS for a given VM.

Now, if we allow the userspace to configure the MAX_VCPU for TDX guest
(this could be a separate discussion in fact) due to attestation
whatever, we need to support allowing userspace to configure MAX_VCPUS
on VM-basis.

Therefore, option 1 isn't really an option to me, but is the thing that
we _SHOULD_ do to support TDX.

So this pach should really just add "per-VM max vcpus" support for TDX,
starting from:

struct kvm_tdx { /* or 'struct kvm_arch' ?? */
...
int max_vcpus;
}

And in TDH.MNG.INIT, we need to manually check the MAX_VCPU specified in
TD_PARAMS structure to make sure it matches to the record that we
specified via KVM_CAP_MAX_VCPUS.

So how about:

"
TDX has its own mechanism to control the maximum number of VCPUs that
the TDX guest can use. When creating a TDX guest, the maximum number of
vcpus needs to be passed to the TDX module as part of the measurement of
the guest.

Because the value is part of the measurement, thus part of attestation,
it better to allow the userspace to be able to configure it. E.g. the
users may want to precisely control the maximum number of vcpus their
precious VMs can use.

The actual control itself must be done via the TDH.MNG.INIT SEAMCALL
itself, where the number of maximum cpus is an input to the TDX module,
but KVM needs to support the "per-VM number of maximum vcpus" and
reflect that in the KVM_CAP_MAX_VCPUS.

Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but
doesn't allow to enable KVM_CAP_MAX_VCPUS to configure the number of
maximum vcpus on VM-basis.

Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.

The userspace-configured value then can be verified when KVM is actually
creating the TDX guest.
"

2024-03-21 23:52:25

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error

On Thu, Mar 21, 2024 at 12:09:57PM +1300,
"Huang, Kai" <[email protected]> wrote:

> > Does it make sense?
> >
> > void pr_tdx_error(u64 op, u64 error_code)
> > {
> > pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
> > op, error_code);
> > }
>
> Should we also have a _ret version?
>
> void pr_seamcall_err(u64 op, u64 err)
> {
> /* A comment to explain why using the _ratelimited() version? */

Because KVM can hit successive seamcall erorrs e.g. during desutructing TD,
(it's unintentional sometimes), ratelimited version is preferred as safe guard.
For example, SEAMCALL on all or some LPs (TDH_MNG_KEY_FREEID) can fail at the
same time. And the number of LPs can be hundreds.


> pr_err_ratelimited(...);
> }
>
> void pr_seamcall_err_ret(u64 op, u64 err, struct tdx_module_args *arg)
> {
> pr_err_seamcall(op, err);
>
> pr_err_ratelimited(...);
> }
>
> (Hmm... if you look at the tdx.c in TDX host, there's similar code there,
> and again, it was a little bit annoying when I did that..)
>
> Again, if we just use seamcall_ret() for ALL SEAMCALLs except VP.ENTER, we
> can simply have one..

What about this?

void pr_seamcall_err_ret(u64 op, u64 err, struct tdx_module_args *arg)
{
pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
op, error_code);
if (arg)
pr_err_ratelimited(...);
}



> > void pr_tdx_sept_error(u64 op, u64 error_code, const union tdx_sept_entry *entry,
> > const union tdx_sept_level_state *level_state)
> > {
> > #define MSG \
> > "SEAMCALL (0x%016llx) failed: 0x%016llx entry 0x%016llx level_state 0x%016llx\n"
> > pr_err_ratelimited(MSG, op, error_code, entry->raw, level_state->raw);
> > }
>
> A higher-level wrapper to print SEPT error is fine to me, but do it in a
> separate patch.

Ok, Let's postpone custom version.
--
Isaku Yamahata <[email protected]>

2024-03-22 00:17:08

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Thu, Mar 21, 2024 at 11:37:58AM +1300,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 21/03/2024 10:36 am, Isaku Yamahata wrote:
> > On Wed, Mar 20, 2024 at 01:03:21PM +1300,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > > > +static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
> > > > + struct tdx_module_args *out)
> > > > +{
> > > > + u64 ret;
> > > > +
> > > > + if (out) {
> > > > + *out = *in;
> > > > + ret = seamcall_ret(op, out);
> > > > + } else
> > > > + ret = seamcall(op, in);
> > >
> > > I think it's silly to have the @out argument in this way.
> > >
> > > What is the main reason to still have it?
> > >
> > > Yeah we used to have the @out in __seamcall() assembly function. The
> > > assembly code checks the @out and skips copying registers to @out when it is
> > > NULL.
> > >
> > > But it got removed when we tried to unify the assembly for TDCALL/TDVMCALL
> > > and SEAMCALL to have a *SINGLE* assembly macro.
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > To me that means we should just accept the fact we will always have a valid
> > > @out.
> > >
> > > But there might be some case that you _obviously_ need the @out and I
> > > missed?
> >
> > As I replied at [1], those four wrappers need to return values.
> > The first three on error, the last one on success.
> >
> > [1] https://lore.kernel.org/kvm/[email protected]/
> >
> > tdh_mem_sept_add(kvm_tdx, gpa, tdx_level, hpa, &entry, &level_state);
> > tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
> > tdh_mem_page_remove(kvm_tdx, gpa, tdx_level, &entry, &level_state);
> > u64 tdh_vp_rd(struct vcpu_tdx *tdx, u64 field, u64 *value)
> >
> > We can delete out from other wrappers.
>
> Ah, OK. I got you don't want to invent separate wrappers for each
> seamcall() variants like:
>
> - tdx_seamcall(u64 fn, struct tdx_module_args *args);
> - tdx_seamcall_ret(u64 fn, struct tdx_module_args *args);
> - tdx_seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
>
> To be honest I found they were kinda annoying myself during the "unify
> TDCALL/SEAMCALL and TDVMCALL assembly" patchset.
>
> But life is hard...
>
> And given (it seems) we are going to remove kvm_spurious_fault(), I think
> the tdx_seamcall() variants are just very simple wrapper of plain seamcall()
> variants.
>
> So how about we have some macros:
>
> static inline bool is_seamcall_err_kernel_defined(u64 err)
> {
> return err & TDX_SW_ERROR;
> }
>
> #define TDX_KVM_SEAMCALL(_kvm, _seamcall_func, _fn, _args) \
> ({ \
> u64 _ret = _seamcall_func(_fn, _args);
> KVM_BUG_ON(_kvm, is_seamcall_err_kernel_defined(_ret));
> _ret;
> })

As we can move out KVM_BUG_ON() to the call site, we can simply have
seamcall() or seamcall_ret().
The call site has to check error. whether it is TDX_SW_ERROR or not.
And if it hit the unexpected error, it will mark the guest bugged.


> #define tdx_kvm_seamcall(_kvm, _fn, _args) \
> TDX_KVM_SEAMCALL(_kvm, seamcall, _fn, _args)
>
> #define tdx_kvm_seamcall_ret(_kvm, _fn, _args) \
> TDX_KVM_SEAMCALL(_kvm, seamcall_ret, _fn, _args)
>
> #define tdx_kvm_seamcall_saved_ret(_kvm, _fn, _args) \
> TDX_KVM_SEAMCALL(_kvm, seamcall_saved_ret, _fn, _args)
>
> This is consistent with what we have in TDX host code, and this handles
> NO_ENTROPY error internally.
>
> Or, maybe we can just use the seamcall_ret() for ALL SEAMCALLs, except using
> seamcall_saved_ret() for TDH.VP.ENTER.
>
> u64 tdx_kvm_seamcall(sruct kvm*kvm, u64 fn,
> struct tdx_module_args *args)
> {
> u64 ret = seamcall_ret(fn, args);
>
> KVM_BUG_ON(kvm, is_seamcall_err_kernel_defined(ret);
>
> return ret;
> }
>
> IIUC this at least should give us a single tdx_kvm_seamcall() API for
> majority (99%) code sites?

We can eleiminate tdx_kvm_seamcall() and use seamcall() or seamcall_ret()
directly.


> And obviously I'd like other people to weigh in too.
>
> > Because only TDH.MNG.CREATE() and TDH.MNG.ADDCX() can return TDX_RND_NO_ENTROPY, > we can use __seamcall(). The TDX spec doesn't guarantee such error code
> > convention. It's very unlikely, though.
>
> I don't quite follow the "convention" part. Can you elaborate?
>
> NO_ENTROPY is already handled in seamcall() variants. Can we just use them
> directly?

I intended for bad code generation. If the loop on NO_ENTRY error harms the
code generation, we might be able to use __seamcall() or __seamcall_ret()
instead of seamcall(), seamcall_ret().
--
Isaku Yamahata <[email protected]>

2024-03-22 00:40:34

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, 2024-03-21 at 15:59 -0700, Isaku Yamahata wrote:
> >
> > Ok, I see now how this works. MTRRs and APIC zapping happen to use
> > the
> > same function: kvm_zap_gfn_range(). So restricting that function
> > from
> > zapping private pages has the desired affect. I think it's not
> > ideal
> > that kvm_zap_gfn_range() silently skips zapping some ranges. I
> > wonder
> > if we could pass something in, so it's more clear to the caller.
> >
> > But can these code paths even get reaches in TDX? It sounded like
> > MTRRs
> > basically weren't supported.
>
> We can make the code paths so with the (new) assumption that guest
> MTRR can
> be disabled cleanly.

So the situation is (please correct):
KVM has a no "making up architectural behavior" rule, which is an
important one. But TDX module doesn't support MTRRs. So TD guests can't
have architectural behavior for MTRRs. So this patch is trying as best
as possible to match what MTRR behavior it can (not crash the guest if
someone tries).

First of all, if the guest unmaps the private memory, doesn't it have
to accept it again when gets re-added? So will the guest not crash
anyway?

But, I guess we should punt to userspace is the guest tries to use
MTRRs, not that userspace can handle it happening in a TD... But it
seems cleaner and safer then skipping zapping some pages inside the
zapping code.

I'm still not sure if I understand the intention and constraints fully.
So please correct. This (the skipping the zapping for some operations)
is a theoretical correctness issue right? It doesn't resolve a TD
crash?

2024-03-22 01:06:58

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 26/02/2024 9:25 pm, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> As the first step to create TDX guest, create/destroy VM struct. Assign
> TDX private Host Key ID (HKID) to the TDX guest for memory encryption and

Here we are at patch 38, and I don't think you still need to fully spell
"Host KeyID (HKID)" anymore.

Just say: Assign one TDX private KeyID ...

> allocate extra pages for the TDX guest. On destruction, free allocated
> pages, and HKID.

Could we put more information here?

For instance, here should be a wonderful place to explain what are TDR,
TDCS, etc??

And briefly describe the sequence of creating the TD?

Roughly checking the code, you have implemented many things including
MNG.KEY.CONFIG staff. It's worth to add some text here to give reviewer
a rough idea what's going on here.

>
> Before tearing down private page tables, TDX requires some resources of the
> guest TD to be destroyed (i.e. HKID must have been reclaimed, etc). Add
> mmu notifier release callback before tearing down private page tables for
> it. >
> Add vm_free() of kvm_x86_ops hook at the end of kvm_arch_destroy_vm()
> because some per-VM TDX resources, e.g. TDR, need to be freed after other
> TDX resources, e.g. HKID, were freed.

I think we should split the "adding callbacks' part out, given you have ...

9 files changed, 520 insertions(+), 8 deletions(-)

.. in this patch.

IMHO, >500 LOC change normally means there are too many things in this
patch, thus hard to review, and we should split.

I think perhaps we can split this big patch to smaller pieces based on
the steps, like we did for the init_tdx_module() function in the TDX
host patchset??

(But I would like to hear from others too.)

>
> Co-developed-by: Kai Huang <[email protected]>
> Signed-off-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

Sean's tag doesn't make sense anymore.

> Signed-off-by: Isaku Yamahata <[email protected]>
>
[...]

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c45252ed2ffd..2becc86c71b2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6866,6 +6866,13 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
>
> void kvm_arch_flush_shadow_all(struct kvm *kvm)
> {
> + /*
> + * kvm_mmu_zap_all() zaps both private and shared page tables. Before
> + * tearing down private page tables, TDX requires some TD resources to
> + * be destroyed (i.e. keyID must have been reclaimed, etc). Invoke
> + * kvm_x86_flush_shadow_all_private() for this.
> + */
> + static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
> kvm_mmu_zap_all(kvm);
> }
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index e8a1a7533eea..437c6d5e802e 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -62,11 +62,31 @@ static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> static int vt_vm_init(struct kvm *kvm)
> {
> if (is_td(kvm))
> - return -EOPNOTSUPP; /* Not ready to create guest TD yet. */
> + return tdx_vm_init(kvm);
>
> return vmx_vm_init(kvm);
> }
>
> +static void vt_flush_shadow_all_private(struct kvm *kvm)
> +{
> + if (is_td(kvm))
> + tdx_mmu_release_hkid(kvm);

Add a comment to explain please.

> +}
> +
> +static void vt_vm_destroy(struct kvm *kvm)
> +{
> + if (is_td(kvm))
> + return;

Please add a comment to explain why we don't do anything here, but have
to delay to vt_vm_free().

> +
> + vmx_vm_destroy(kvm);
> +}
> +
> +static void vt_vm_free(struct kvm *kvm)
> +{
> + if (is_td(kvm))
> + tdx_vm_free(kvm);
> +} > +
> static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> if (!is_td(kvm))
> @@ -101,7 +121,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .vm_size = sizeof(struct kvm_vmx),
> .vm_enable_cap = vt_vm_enable_cap,
> .vm_init = vt_vm_init,
> - .vm_destroy = vmx_vm_destroy,
> + .flush_shadow_all_private = vt_flush_shadow_all_private,
> + .vm_destroy = vt_vm_destroy,
> + .vm_free = vt_vm_free,
>
> .vcpu_precreate = vmx_vcpu_precreate,
> .vcpu_create = vmx_vcpu_create,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ee015f3ce2c9..1cf2b15da257 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,10 +5,11 @@
>
> #include "capabilities.h"
> #include "x86_ops.h"
> -#include "x86.h"
> #include "mmu.h"
> #include "tdx_arch.h"
> #include "tdx.h"
> +#include "tdx_ops.h"
> +#include "x86.h"
>
> #undef pr_fmt
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> @@ -22,7 +23,7 @@
> /* TDX KeyID pool */
> static DEFINE_IDA(tdx_guest_keyid_pool);
>
> -static int __used tdx_guest_keyid_alloc(void)
> +static int tdx_guest_keyid_alloc(void)
> {
> if (WARN_ON_ONCE(!tdx_guest_keyid_start || !tdx_nr_guest_keyids))
> return -EINVAL;
> @@ -32,7 +33,7 @@ static int __used tdx_guest_keyid_alloc(void)
> GFP_KERNEL);
> }
>
> -static void __used tdx_guest_keyid_free(int keyid)
> +static void tdx_guest_keyid_free(int keyid)
> {
> if (WARN_ON_ONCE(keyid < tdx_guest_keyid_start ||
> keyid > tdx_guest_keyid_start + tdx_nr_guest_keyids - 1))
> @@ -48,6 +49,8 @@ struct tdx_info {
> u64 xfam_fixed0;
> u64 xfam_fixed1;
>
> + u8 nr_tdcs_pages;
> +
> u16 num_cpuid_config;
> /* This must the last member. */
> DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
> @@ -85,6 +88,282 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> return r;
> }
>
> +/*
> + * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
> + * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock
> + * internally in TDX module. If failed, TDX_OPERAND_BUSY is returned without
> + * spinning or waiting due to a constraint on execution time. It's caller's
> + * responsibility to avoid race (or retry on TDX_OPERAND_BUSY). Use this mutex
> + * to avoid race in TDX module because the kernel knows better about scheduling.
> + */

/*
* Some SEAMCALLs acquire TDX module globlly. They fail with
* TDX_OPERAND_BUSY if fail to acquire. Use a global mutex to
* serialize these SEAMCALLs.
*/
> +static DEFINE_MUTEX(tdx_lock);
> +static struct mutex *tdx_mng_key_config_lock;
> +
> +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> +{
> + return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> +}
> +
> +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> +{
> + return kvm_tdx->tdr_pa;
> +}
> +
> +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> +{
> + tdx_guest_keyid_free(kvm_tdx->hkid);
> + kvm_tdx->hkid = -1;
> +}
> +
> +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> +{
> + return kvm_tdx->hkid > 0;
> +}

Hmm...

"Allocating a TDX private KeyID" seems to be one step of creating the
TDX guest. Perhaps we should split this patch based on the steps of
creating TD.

> +
> +static void tdx_clear_page(unsigned long page_pa)
> +{
> + const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> + void *page = __va(page_pa);
> + unsigned long i;
> +
> + /*
> + * When re-assign one page from old keyid to a new keyid, MOVDIR64B is
> + * required to clear/write the page with new keyid to prevent integrity
> + * error when read on the page with new keyid.
> + *
> + * clflush doesn't flush cache with HKID set.

I don't understand this "clflush".

(Firstly, I think it's better to use TDX private KeyID or TDX KeyID
instead of HKID, which can be MKTME KeyID really.)

How is "clflush doesn't flush cache with HKID set" relevant here??

What you really want is "all caches associated with the given page must
have been flushed before tdx_clear_page()".

You can add as a function comment for tdx_clear_page(), but certainly
not here.

The cache line could be
> + * poisoned (even without MKTME-i), clear the poison bit. > + */

Is below better?

/*
* The page could have been poisoned. MOVDIR64B also clears
* the poison bit so the kernel can safely use the page again.
*/


> + for (i = 0; i < PAGE_SIZE; i += 64)
> + movdir64b(page + i, zero_page);
> + /*
> + * MOVDIR64B store uses WC buffer. Prevent following memory reads
> + * from seeing potentially poisoned cache.
> + */
> + __mb();
> +}
> +
> +static int __tdx_reclaim_page(hpa_t pa)
> +{
> + struct tdx_module_args out;
> + u64 err;
> +
> + do {
> + err = tdh_phymem_page_reclaim(pa, &out);
> + /*
> + * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
> + * state. i.e. destructing TD.

Does this mean __tdx_reclaim_page() can only be used when destroying the TD?

Pleas add this to the function comment of __tdx_reclaim_page().

> + * TDH.PHYMEM.PAGE.RECLAIM requires TDR and target page.
> + * Because we're destructing TD, it's rare to contend with TDR.
> + */

It's rare to contend, so what?

If you want to justify the loop, and the "unlikely()" used in the loop,
then put this part right before the 'do { } while ()' loop, where your
intention applies, and explicitly call out.

And in general I think it's better to add a 'cond_resched()' for such
loop because SEAMCALL is time-costy. If your comment is intended for
not adding 'cond_resched()', please also call out.

> + } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX) ||
> + err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_TDR)));
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +static int tdx_reclaim_page(hpa_t pa)
> +{
> + int r;
> +
> + r = __tdx_reclaim_page(pa);
> + if (!r)
> + tdx_clear_page(pa);
> + return r;
> +}
> +
> +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> +{
> + WARN_ON_ONCE(!td_page_pa);
> +
> + /*
> + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> + * assigned to the TD. Here the cache associated to the TD
> + * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
> + * cache doesn't need to be flushed again.
> + */
> + if (tdx_reclaim_page(td_page_pa))
> + /*
> + * Leak the page on failure:
> + * tdx_reclaim_page() returns an error if and only if there's an
> + * unexpected, fatal error, e.g. a SEAMCALL with bad params,
> + * incorrect concurrency in KVM, a TDX Module bug, etc.
> + * Retrying at a later point is highly unlikely to be
> + * successful.
> + * No log here as tdx_reclaim_page() already did.
> + */
> + return;

Use empty lines to make text more breathable between paragraphs.

> + free_page((unsigned long)__va(td_page_pa));
> +}
> +
> +static void tdx_do_tdh_phymem_cache_wb(void *unused)
> +{
> + u64 err = 0;
> +
> + do {
> + err = tdh_phymem_cache_wb(!!err);
> + } while (err == TDX_INTERRUPTED_RESUMABLE);
> +
> + /* Other thread may have done for us. */
> + if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> + err = TDX_SUCCESS;

Empty line.

> + if (WARN_ON_ONCE(err))
> + pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> +}

[snip]

I am stopping here, because I need to take a break.

Again I think we should split this patch, there are just too many things
to review here.

2024-03-22 04:33:42

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

> >
> > So how about we have some macros:
> >
> > static inline bool is_seamcall_err_kernel_defined(u64 err)
> > {
> > return err & TDX_SW_ERROR;
> > }
> >
> > #define TDX_KVM_SEAMCALL(_kvm, _seamcall_func, _fn, _args) \
> > ({ \
> > u64 _ret = _seamcall_func(_fn, _args);
> > KVM_BUG_ON(_kvm, is_seamcall_err_kernel_defined(_ret));
> > _ret;
> > })
>
> As we can move out KVM_BUG_ON() to the call site, we can simply have
> seamcall() or seamcall_ret().
> The call site has to check error. whether it is TDX_SW_ERROR or not.
> And if it hit the unexpected error, it will mark the guest bugged.

How many call sites are we talking about?

I think handling KVM_BUG_ON() in macro should be able to eliminate bunch of
individual KVM_BUG_ON()s in these call sites?

>
>
> > #define tdx_kvm_seamcall(_kvm, _fn, _args) \
> > TDX_KVM_SEAMCALL(_kvm, seamcall, _fn, _args)
> >
> > #define tdx_kvm_seamcall_ret(_kvm, _fn, _args) \
> > TDX_KVM_SEAMCALL(_kvm, seamcall_ret, _fn, _args)
> >
> > #define tdx_kvm_seamcall_saved_ret(_kvm, _fn, _args) \
> > TDX_KVM_SEAMCALL(_kvm, seamcall_saved_ret, _fn, _args)
> >
> > This is consistent with what we have in TDX host code, and this handles
> > NO_ENTROPY error internally.
> >
> >

[...]

> >
> > > Because only TDH.MNG.CREATE() and TDH.MNG.ADDCX() can return TDX_RND_NO_ENTROPY, > we can use __seamcall(). The TDX spec doesn't guarantee such error code
> > > convention. It's very unlikely, though.
> >
> > I don't quite follow the "convention" part. Can you elaborate?
> >
> > NO_ENTROPY is already handled in seamcall() variants. Can we just use them
> > directly?
>
> I intended for bad code generation. If the loop on NO_ENTRY error harms the
> code generation, we might be able to use __seamcall() or __seamcall_ret()
> instead of seamcall(), seamcall_ret().

This doesn't make sense to me.

Firstly, you have to *prove* the loop generates worse code.

Secondly, if it does generate worse code, and we care about it, we should fix it
in the host seamcall() code. No?

2024-03-22 04:38:06

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error

On Thu, 2024-03-21 at 16:52 -0700, Isaku Yamahata wrote:
> On Thu, Mar 21, 2024 at 12:09:57PM +1300,
> "Huang, Kai" <[email protected]> wrote:
>
> > > Does it make sense?
> > >
> > > void pr_tdx_error(u64 op, u64 error_code)
> > > {
> > > pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
> > > op, error_code);
> > > }
> >
> > Should we also have a _ret version?
> >
> > void pr_seamcall_err(u64 op, u64 err)
> > {
> > /* A comment to explain why using the _ratelimited() version? */
>
> Because KVM can hit successive seamcall erorrs e.g. during desutructing TD,
> (it's unintentional sometimes), ratelimited version is preferred as safe guard.
> For example, SEAMCALL on all or some LPs (TDH_MNG_KEY_FREEID) can fail at the
> same time. And the number of LPs can be hundreds.

I mean you certainly have a reason to use _ratelimited() version. My point is
you at least explain it in a comment.

>
>
> > pr_err_ratelimited(...);
> > }
> >
> > void pr_seamcall_err_ret(u64 op, u64 err, struct tdx_module_args *arg)
> > {
> > pr_err_seamcall(op, err);
> >
> > pr_err_ratelimited(...);
> > }
> >
> > (Hmm... if you look at the tdx.c in TDX host, there's similar code there,
> > and again, it was a little bit annoying when I did that..)
> >
> > Again, if we just use seamcall_ret() for ALL SEAMCALLs except VP.ENTER, we
> > can simply have one..
>
> What about this?
>
> void pr_seamcall_err_ret(u64 op, u64 err, struct tdx_module_args *arg)
> {
> pr_err_ratelimited("SEAMCALL (0x%016llx) failed: 0x%016llx\n",
> op, error_code);
> if (arg)
> pr_err_ratelimited(...);
> }
>

Fine to me.

Or call pr_seamcall_err() instead. I don't care too much.

2024-03-22 05:33:40

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Fri, Mar 22, 2024 at 11:46:41AM +0800, Yuan Yao wrote:
> On Thu, Mar 21, 2024 at 07:17:09AM -0700, Isaku Yamahata wrote:
> > On Wed, Mar 20, 2024 at 01:12:01PM +0800,
> > Chao Gao <[email protected]> wrote:
..
> > > >+static int __tdx_td_init(struct kvm *kvm)
> > > >+{
> > > >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > >+ cpumask_var_t packages;
> > > >+ unsigned long *tdcs_pa = NULL;
> > > >+ unsigned long tdr_pa = 0;
> > > >+ unsigned long va;
> > > >+ int ret, i;
> > > >+ u64 err;
> > > >+
> > > >+ ret = tdx_guest_keyid_alloc();
> > > >+ if (ret < 0)
> > > >+ return ret;
> > > >+ kvm_tdx->hkid = ret;
> > > >+
> > > >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > > >+ if (!va)
> > > >+ goto free_hkid;
> > > >+ tdr_pa = __pa(va);
> > > >+
> > > >+ tdcs_pa = kcalloc(tdx_info->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
> > > >+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > > >+ if (!tdcs_pa)
> > > >+ goto free_tdr;
> > > >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > > >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > > >+ if (!va)
> > > >+ goto free_tdcs;
> > > >+ tdcs_pa[i] = __pa(va);
> > > >+ }
> > > >+
> > > >+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> > > >+ ret = -ENOMEM;
> > > >+ goto free_tdcs;
> > > >+ }
> > > >+ cpus_read_lock();
> > > >+ /*
> > > >+ * Need at least one CPU of the package to be online in order to
> > > >+ * program all packages for host key id. Check it.
> > > >+ */
> > > >+ for_each_present_cpu(i)
> > > >+ cpumask_set_cpu(topology_physical_package_id(i), packages);
> > > >+ for_each_online_cpu(i)
> > > >+ cpumask_clear_cpu(topology_physical_package_id(i), packages);
> > > >+ if (!cpumask_empty(packages)) {
> > > >+ ret = -EIO;
> > > >+ /*
> > > >+ * Because it's hard for human operator to figure out the
> > > >+ * reason, warn it.
> > > >+ */
> > > >+#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
> > > >+ pr_warn_ratelimited(MSG_ALLPKG);
> > > >+ goto free_packages;
> > > >+ }
> > > >+
> > > >+ /*
> > > >+ * Acquire global lock to avoid TDX_OPERAND_BUSY:
> > > >+ * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
> > > >+ * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
> > > >+ * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
> > > >+ * caller to handle the contention. This is because of time limitation
> > > >+ * usable inside the TDX module and OS/VMM knows better about process
> > > >+ * scheduling.
> > > >+ *
> > > >+ * APIs to acquire the lock of KOT:
> > > >+ * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
> > > >+ * TDH.PHYMEM.CACHE.WB.
> > > >+ */
> > > >+ mutex_lock(&tdx_lock);
> > > >+ err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
> > > >+ mutex_unlock(&tdx_lock);
> > > >+ if (err == TDX_RND_NO_ENTROPY) {
> > > >+ ret = -EAGAIN;
> > > >+ goto free_packages;
> > > >+ }
> > > >+ if (WARN_ON_ONCE(err)) {
> > > >+ pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> > > >+ ret = -EIO;
> > > >+ goto free_packages;
> > > >+ }
> > > >+ kvm_tdx->tdr_pa = tdr_pa;
> > > >+
> > > >+ for_each_online_cpu(i) {
> > > >+ int pkg = topology_physical_package_id(i);
> > > >+
> > > >+ if (cpumask_test_and_set_cpu(pkg, packages))
> > > >+ continue;
> > > >+
> > > >+ /*
> > > >+ * Program the memory controller in the package with an
> > > >+ * encryption key associated to a TDX private host key id
> > > >+ * assigned to this TDR. Concurrent operations on same memory
> > > >+ * controller results in TDX_OPERAND_BUSY. Avoid this race by
> > > >+ * mutex.
> > > >+ */
> > > >+ mutex_lock(&tdx_mng_key_config_lock[pkg]);
> > >
> > > the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
> > > create TDs, the same set of CPUs (the first online CPU of each package) will be
> > > selected to configure the key because of the cpumask_test_and_set_cpu() above.
> > > it means, we never have two CPUs in the same socket trying to program the key,
> > > i.e., no concurrent calls.
> >
> > Makes sense. Will drop the lock.
>
> Not get the point, the variable "packages" on stack, and it's
> possible that "i" is same for 2 threads which are trying to create td.
> Anything I missed ?

Got the point after synced with chao.
in case of using for_each_online_cpu() it's safe to remove the mutex_lock(&tdx_mng_key_config_lock[pkg]),
since every thread will select only 1 cpu for each sockets in same order, and requests submited
to same cpu by smp_call_on_cpu() are ordered on the target cpu. That means removing the lock works for
using for_each_online_cpu() but does NOT work for randomly pick up a cpu per socket.

Maybe it's just my issue that doesn't realize what's going on here, but
I think it still worth to give comment here for why it works/does not work.

>
> >
> >
> > > >+ ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> > > >+ &kvm_tdx->tdr_pa, true);
> > > >+ mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> > > >+ if (ret)
> > > >+ break;
> > > >+ }
> > > >+ cpus_read_unlock();
> > > >+ free_cpumask_var(packages);
> > > >+ if (ret) {
> > > >+ i = 0;
> > > >+ goto teardown;
> > > >+ }
> > > >+
> > > >+ kvm_tdx->tdcs_pa = tdcs_pa;
> > > >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > > >+ err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> > > >+ if (err == TDX_RND_NO_ENTROPY) {
> > > >+ /* Here it's hard to allow userspace to retry. */
> > > >+ ret = -EBUSY;
> > > >+ goto teardown;
> > > >+ }
> > > >+ if (WARN_ON_ONCE(err)) {
> > > >+ pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> > > >+ ret = -EIO;
> > > >+ goto teardown;
> > > >+ }
> > > >+ }
> > > >+
> > > >+ /*
> > > >+ * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> > > >+ * ioctl() to define the configure CPUID values for the TD.
> > > >+ */
> > > >+ return 0;
> > > >+
> > > >+ /*
> > > >+ * The sequence for freeing resources from a partially initialized TD
> > > >+ * varies based on where in the initialization flow failure occurred.
> > > >+ * Simply use the full teardown and destroy, which naturally play nice
> > > >+ * with partial initialization.
> > > >+ */
> > > >+teardown:
> > > >+ for (; i < tdx_info->nr_tdcs_pages; i++) {
> > > >+ if (tdcs_pa[i]) {
> > > >+ free_page((unsigned long)__va(tdcs_pa[i]));
> > > >+ tdcs_pa[i] = 0;
> > > >+ }
> > > >+ }
> > > >+ if (!kvm_tdx->tdcs_pa)
> > > >+ kfree(tdcs_pa);
> > > >+ tdx_mmu_release_hkid(kvm);
> > > >+ tdx_vm_free(kvm);
> > > >+ return ret;
> > > >+
> > > >+free_packages:
> > > >+ cpus_read_unlock();
> > > >+ free_cpumask_var(packages);
> > > >+free_tdcs:
> > > >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > > >+ if (tdcs_pa[i])
> > > >+ free_page((unsigned long)__va(tdcs_pa[i]));
> > > >+ }
> > > >+ kfree(tdcs_pa);
> > > >+ kvm_tdx->tdcs_pa = NULL;
> > > >+
> > > >+free_tdr:
> > > >+ if (tdr_pa)
> > > >+ free_page((unsigned long)__va(tdr_pa));
> > > >+ kvm_tdx->tdr_pa = 0;
> > > >+free_hkid:
> > > >+ if (is_hkid_assigned(kvm_tdx))
> > >
> > > IIUC, this is always true because you just return if keyid
> > > allocation fails.
> >
> > You're right. Will fix
> > --
> > Isaku Yamahata <[email protected]>
> >
>

2024-03-22 07:11:22

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Thu, 2024-03-21 at 23:12 +0000, Edgecombe, Rick P wrote:
> On Mon, 2024-02-26 at 00:27 -0800, [email protected] wrote:
> > Implement a hook of KVM_SET_CPUID2 for additional consistency check.
> >
> > Intel TDX or AMD SEV has a restriction on the value of cpuid.  For
> > example,
> > some values must be the same between all vcpus.  Check if the new
> > values
> > are consistent with the old values.  The check is light because the
> > cpuid
> > consistency is very model specific and complicated.  The user space
> > VMM
> > should set cpuid and MSRs consistently.
>
> I see that this was suggested by Sean, but can you explain the problem
> that this is working around? From the linked thread, it seems like the
> problem is what to do when userspace also calls SET_CPUID after already
> configuring CPUID to the TDX module in the special way. The choices
> discussed included:
> 1. Reject the call
> 2. Check the consistency between the first CPUID configuration and the
> second one.
>
> 1 is a lot simpler, but the reasoning for 2 is because "some KVM code
> paths rely on guest CPUID configuration" it seems. Is this a
> hypothetical or real issue? Which code paths are problematic for
> TDX/SNP?

There might be use case that TDX guest wants to use some CPUID which
isn't handled by the TDX module but purely by KVM. These (PV) CPUIDs need to be
provided via KVM_SET_CPUID2.


Btw, Isaku, I don't understand why you tag the last two patches as RFC and put
them at last. I think I've expressed this before. Per the discussion with
Sean, my understanding is this isn't something optional but the right thing we
should do?

https://lore.kernel.org/lkml/[email protected]/

2024-03-22 07:19:25

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation

On Thu, Mar 21, 2024 at 02:24:12PM -0700, Isaku Yamahata wrote:
>On Thu, Mar 21, 2024 at 12:11:11AM +0000,
>"Edgecombe, Rick P" <[email protected]> wrote:
>
>> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
>> > To handle private page tables, argument of is_private needs to be
>> > passed
>> > down.  Given that already page level is passed down, it would be
>> > cumbersome
>> > to add one more parameter about sp. Instead replace the level
>> > argument with
>> > union kvm_mmu_page_role.  Thus the number of argument won't be
>> > increased
>> > and more info about sp can be passed down.
>> >
>> > For private sp, secure page table will be also allocated in addition
>> > to
>> > struct kvm_mmu_page and page table (spt member).  The allocation
>> > functions
>> > (tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know
>> > if the
>> > allocation is for the conventional page table or private page table. 
>> > Pass
>> > union kvm_mmu_role to those functions and initialize role member of
>> > struct
>> > kvm_mmu_page.
>>
>> tdp_mmu_alloc_sp() is only called in two places. One for the root, and
>> one for the mid-level tables.
>>
>> In later patches when the kvm_mmu_alloc_private_spt() part is added,
>> the root case doesn't need anything done. So the code has to take
>> special care in tdp_mmu_alloc_sp() to avoid doing anything for the
>> root.
>>
>> It only needs to do the special private spt allocation in non-root
>> case. If we open code that case, I think maybe we could drop this
>> patch, like the below.
>>
>> The benefits are to drop this patch (which looks to already be part of
>> Paolo's series), and simplify "KVM: x86/mmu: Add a private pointer to
>> struct kvm_mmu_page". I'm not sure though, what do you think? Only
>> build tested.
>
>Makes sense. Until v18, it had config to disable private mmu part at
>compile time. Those functions have #ifdef in mmu_internal.h. v19
>dropped the config for the feedback.
> https://lore.kernel.org/kvm/[email protected]/
>
>After looking at mmu_internal.h, I think the following three function could be
>open coded.
>kvm_mmu_private_spt(), kvm_mmu_init_private_spt(), kvm_mmu_alloc_private_spt(),
>and kvm_mmu_free_private_spt().

It took me a few minutes to figure out why the mirror root page doesn't need
a private_spt.

Per TDX module spec:

Secure EPT’s root page (EPML4 or EPML5, depending on whether the host VMM uses
4-level or 5-level EPT) does not need to be explicitly added. It is created
during TD initialization (TDH.MNG.INIT) and is stored as part of TDCS.

I suggest adding the above as a comment somewhere even if we decide to open-code
kvm_mmu_alloc_private_spt().

IMO, some TDX details bleed into KVM MMU regardless of whether we open-code
kvm_mmu_alloc_private_spt() or not. This isn't good though I cannot think of
a better solution.

2024-03-22 08:56:36

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Thu, Mar 21, 2024 at 07:17:09AM -0700, Isaku Yamahata wrote:
> On Wed, Mar 20, 2024 at 01:12:01PM +0800,
> Chao Gao <[email protected]> wrote:
>
> > > config KVM_SW_PROTECTED_VM
> > > bool "Enable support for KVM software-protected VMs"
> > >- depends on EXPERT
> > > depends on KVM && X86_64
> > > select KVM_GENERIC_PRIVATE_MEM
> > > help
> > >@@ -89,6 +88,8 @@ config KVM_SW_PROTECTED_VM
> > > config KVM_INTEL
> > > tristate "KVM for Intel (and compatible) processors support"
> > > depends on KVM && IA32_FEAT_CTL
> > >+ select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST
> >
> > why does INTEL_TDX_HOST select KVM_SW_PROTECTED_VM?
>
> I wanted KVM_GENERIC_PRIVATE_MEM. Ah, we should do
>
> select KKVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
>
>
> > >+ select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
> > > help
> > > .vcpu_precreate = vmx_vcpu_precreate,
> > > .vcpu_create = vmx_vcpu_create,
> >
> > >--- a/arch/x86/kvm/vmx/tdx.c
> > >+++ b/arch/x86/kvm/vmx/tdx.c
> > >@@ -5,10 +5,11 @@
> > >
> > > #include "capabilities.h"
> > > #include "x86_ops.h"
> > >-#include "x86.h"
> > > #include "mmu.h"
> > > #include "tdx_arch.h"
> > > #include "tdx.h"
> > >+#include "tdx_ops.h"
> > >+#include "x86.h"
> >
> > any reason to reorder x86.h?
>
> No, I think it's accidental during rebase.
> Will fix.
>
>
>
> > >+static void tdx_do_tdh_phymem_cache_wb(void *unused)
> > >+{
> > >+ u64 err = 0;
> > >+
> > >+ do {
> > >+ err = tdh_phymem_cache_wb(!!err);
> > >+ } while (err == TDX_INTERRUPTED_RESUMABLE);
> > >+
> > >+ /* Other thread may have done for us. */
> > >+ if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> > >+ err = TDX_SUCCESS;
> > >+ if (WARN_ON_ONCE(err))
> > >+ pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> > >+}
> > >+
> > >+void tdx_mmu_release_hkid(struct kvm *kvm)
> > >+{
> > >+ bool packages_allocated, targets_allocated;
> > >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > >+ cpumask_var_t packages, targets;
> > >+ u64 err;
> > >+ int i;
> > >+
> > >+ if (!is_hkid_assigned(kvm_tdx))
> > >+ return;
> > >+
> > >+ if (!is_td_created(kvm_tdx)) {
> > >+ tdx_hkid_free(kvm_tdx);
> > >+ return;
> > >+ }
> > >+
> > >+ packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> > >+ targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> > >+ cpus_read_lock();
> > >+
> > >+ /*
> > >+ * We can destroy multiple guest TDs simultaneously. Prevent
> > >+ * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> > >+ */
> > >+ mutex_lock(&tdx_lock);
> > >+
> > >+ /*
> > >+ * Go through multiple TDX HKID state transitions with three SEAMCALLs
> > >+ * to make TDH.PHYMEM.PAGE.RECLAIM() usable. Make the transition atomic
> > >+ * to other functions to operate private pages and Secure-EPT pages.
> > >+ *
> > >+ * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
> > >+ * This function is called via mmu notifier, mmu_release().
> > >+ * kvm_gmem_release() is called via fput() on process exit.
> > >+ */
> > >+ write_lock(&kvm->mmu_lock);
> > >+
> > >+ for_each_online_cpu(i) {
> > >+ if (packages_allocated &&
> > >+ cpumask_test_and_set_cpu(topology_physical_package_id(i),
> > >+ packages))
> > >+ continue;
> > >+ if (targets_allocated)
> > >+ cpumask_set_cpu(i, targets);
> > >+ }
> > >+ if (targets_allocated)
> > >+ on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, true);
> > >+ else
> > >+ on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, true);
> >
> > This tries flush cache on all CPUs when we run out of memory. I am not sure if
> > it is the best solution. A simple solution is just use two global bitmaps.
> >
> > And current logic isn't optimal. e.g., if packages_allocated is true while
> > targets_allocated is false, then we will fill in the packages bitmap but don't
> > use it at all.
> >
> > That said, I prefer to optimize the rare case in a separate patch. We can just use
> > two global bitmaps or let the flush fail here just as you are doing below on
> > seamcall failure.
>
> Makes sense. We can allocate cpumasks on hardware_setup/unsetup() and update them
> on hardware_enable/disable().
>
> ...
>
> > >+static int __tdx_td_init(struct kvm *kvm)
> > >+{
> > >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > >+ cpumask_var_t packages;
> > >+ unsigned long *tdcs_pa = NULL;
> > >+ unsigned long tdr_pa = 0;
> > >+ unsigned long va;
> > >+ int ret, i;
> > >+ u64 err;
> > >+
> > >+ ret = tdx_guest_keyid_alloc();
> > >+ if (ret < 0)
> > >+ return ret;
> > >+ kvm_tdx->hkid = ret;
> > >+
> > >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > >+ if (!va)
> > >+ goto free_hkid;
> > >+ tdr_pa = __pa(va);
> > >+
> > >+ tdcs_pa = kcalloc(tdx_info->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
> > >+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > >+ if (!tdcs_pa)
> > >+ goto free_tdr;
> > >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > >+ if (!va)
> > >+ goto free_tdcs;
> > >+ tdcs_pa[i] = __pa(va);
> > >+ }
> > >+
> > >+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> > >+ ret = -ENOMEM;
> > >+ goto free_tdcs;
> > >+ }
> > >+ cpus_read_lock();
> > >+ /*
> > >+ * Need at least one CPU of the package to be online in order to
> > >+ * program all packages for host key id. Check it.
> > >+ */
> > >+ for_each_present_cpu(i)
> > >+ cpumask_set_cpu(topology_physical_package_id(i), packages);
> > >+ for_each_online_cpu(i)
> > >+ cpumask_clear_cpu(topology_physical_package_id(i), packages);
> > >+ if (!cpumask_empty(packages)) {
> > >+ ret = -EIO;
> > >+ /*
> > >+ * Because it's hard for human operator to figure out the
> > >+ * reason, warn it.
> > >+ */
> > >+#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
> > >+ pr_warn_ratelimited(MSG_ALLPKG);
> > >+ goto free_packages;
> > >+ }
> > >+
> > >+ /*
> > >+ * Acquire global lock to avoid TDX_OPERAND_BUSY:
> > >+ * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
> > >+ * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
> > >+ * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
> > >+ * caller to handle the contention. This is because of time limitation
> > >+ * usable inside the TDX module and OS/VMM knows better about process
> > >+ * scheduling.
> > >+ *
> > >+ * APIs to acquire the lock of KOT:
> > >+ * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
> > >+ * TDH.PHYMEM.CACHE.WB.
> > >+ */
> > >+ mutex_lock(&tdx_lock);
> > >+ err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
> > >+ mutex_unlock(&tdx_lock);
> > >+ if (err == TDX_RND_NO_ENTROPY) {
> > >+ ret = -EAGAIN;
> > >+ goto free_packages;
> > >+ }
> > >+ if (WARN_ON_ONCE(err)) {
> > >+ pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> > >+ ret = -EIO;
> > >+ goto free_packages;
> > >+ }
> > >+ kvm_tdx->tdr_pa = tdr_pa;
> > >+
> > >+ for_each_online_cpu(i) {
> > >+ int pkg = topology_physical_package_id(i);
> > >+
> > >+ if (cpumask_test_and_set_cpu(pkg, packages))
> > >+ continue;
> > >+
> > >+ /*
> > >+ * Program the memory controller in the package with an
> > >+ * encryption key associated to a TDX private host key id
> > >+ * assigned to this TDR. Concurrent operations on same memory
> > >+ * controller results in TDX_OPERAND_BUSY. Avoid this race by
> > >+ * mutex.
> > >+ */
> > >+ mutex_lock(&tdx_mng_key_config_lock[pkg]);
> >
> > the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
> > create TDs, the same set of CPUs (the first online CPU of each package) will be
> > selected to configure the key because of the cpumask_test_and_set_cpu() above.
> > it means, we never have two CPUs in the same socket trying to program the key,
> > i.e., no concurrent calls.
>
> Makes sense. Will drop the lock.

Not get the point, the variable "packages" on stack, and it's
possible that "i" is same for 2 threads which are trying to create td.
Anything I missed ?

>
>
> > >+ ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> > >+ &kvm_tdx->tdr_pa, true);
> > >+ mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> > >+ if (ret)
> > >+ break;
> > >+ }
> > >+ cpus_read_unlock();
> > >+ free_cpumask_var(packages);
> > >+ if (ret) {
> > >+ i = 0;
> > >+ goto teardown;
> > >+ }
> > >+
> > >+ kvm_tdx->tdcs_pa = tdcs_pa;
> > >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > >+ err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> > >+ if (err == TDX_RND_NO_ENTROPY) {
> > >+ /* Here it's hard to allow userspace to retry. */
> > >+ ret = -EBUSY;
> > >+ goto teardown;
> > >+ }
> > >+ if (WARN_ON_ONCE(err)) {
> > >+ pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> > >+ ret = -EIO;
> > >+ goto teardown;
> > >+ }
> > >+ }
> > >+
> > >+ /*
> > >+ * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> > >+ * ioctl() to define the configure CPUID values for the TD.
> > >+ */
> > >+ return 0;
> > >+
> > >+ /*
> > >+ * The sequence for freeing resources from a partially initialized TD
> > >+ * varies based on where in the initialization flow failure occurred.
> > >+ * Simply use the full teardown and destroy, which naturally play nice
> > >+ * with partial initialization.
> > >+ */
> > >+teardown:
> > >+ for (; i < tdx_info->nr_tdcs_pages; i++) {
> > >+ if (tdcs_pa[i]) {
> > >+ free_page((unsigned long)__va(tdcs_pa[i]));
> > >+ tdcs_pa[i] = 0;
> > >+ }
> > >+ }
> > >+ if (!kvm_tdx->tdcs_pa)
> > >+ kfree(tdcs_pa);
> > >+ tdx_mmu_release_hkid(kvm);
> > >+ tdx_vm_free(kvm);
> > >+ return ret;
> > >+
> > >+free_packages:
> > >+ cpus_read_unlock();
> > >+ free_cpumask_var(packages);
> > >+free_tdcs:
> > >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > >+ if (tdcs_pa[i])
> > >+ free_page((unsigned long)__va(tdcs_pa[i]));
> > >+ }
> > >+ kfree(tdcs_pa);
> > >+ kvm_tdx->tdcs_pa = NULL;
> > >+
> > >+free_tdr:
> > >+ if (tdr_pa)
> > >+ free_page((unsigned long)__va(tdr_pa));
> > >+ kvm_tdx->tdr_pa = 0;
> > >+free_hkid:
> > >+ if (is_hkid_assigned(kvm_tdx))
> >
> > IIUC, this is always true because you just return if keyid
> > allocation fails.
>
> You're right. Will fix
> --
> Isaku Yamahata <[email protected]>
>

2024-03-22 11:20:19

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> +struct kvm_tdx_init_vm {
> + __u64 attributes;
> + __u64 mrconfigid[6]; /* sha384 digest */
> + __u64 mrowner[6]; /* sha384 digest */
> + __u64 mrownerconfig[6]; /* sha384 digest */
> + /*
> + * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
> + * This should be enough given sizeof(TD_PARAMS) = 1024.
> + * 8KB was chosen given because
> + * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
> + */
> + __u64 reserved[1004];

This is insane.

You said you want to reserve 8K for CPUID entries, but how can these 1004 * 8
bytes be used for CPUID entries since ...

> +
> + /*
> + * Call KVM_TDX_INIT_VM before vcpu creation, thus before
> + * KVM_SET_CPUID2.
> + * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
> + * TDX module directly virtualizes those CPUIDs without VMM.  The user
> + * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
> + * those values.  If it doesn't, KVM may have wrong idea of vCPUIDs of
> + * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
> + * module doesn't virtualize.
> + */
> + struct kvm_cpuid2 cpuid;

... they are actually placed right after here?

> +};
> +


2024-03-22 13:03:41

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Mon, Feb 26, 2024 at 12:25:29AM -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Define architectural definitions for KVM to issue the TDX SEAMCALLs.
>
> Structures and values that are architecturally defined in the TDX module
> specifications the chapter of ABI Reference.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Reviewed-by: Xiaoyao Li <[email protected]>
> ---
> v19:
> - drop tdvmcall constants by Xiaoyao
>
> v18:
> - Add metadata field id
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx_arch.h | 265 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 265 insertions(+)
> create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
>
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> new file mode 100644
> index 000000000000..e2c1a6f429d7
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -0,0 +1,265 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* architectural constants/data definitions for TDX SEAMCALLs */
> +
> +#ifndef __KVM_X86_TDX_ARCH_H
> +#define __KVM_X86_TDX_ARCH_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * TDX SEAMCALL API function leaves
> + */
> +#define TDH_VP_ENTER 0
> +#define TDH_MNG_ADDCX 1
> +#define TDH_MEM_PAGE_ADD 2
> +#define TDH_MEM_SEPT_ADD 3
> +#define TDH_VP_ADDCX 4
> +#define TDH_MEM_PAGE_RELOCATE 5
> +#define TDH_MEM_PAGE_AUG 6
> +#define TDH_MEM_RANGE_BLOCK 7
> +#define TDH_MNG_KEY_CONFIG 8
> +#define TDH_MNG_CREATE 9
> +#define TDH_VP_CREATE 10
> +#define TDH_MNG_RD 11
> +#define TDH_MR_EXTEND 16
> +#define TDH_MR_FINALIZE 17
> +#define TDH_VP_FLUSH 18
> +#define TDH_MNG_VPFLUSHDONE 19
> +#define TDH_MNG_KEY_FREEID 20
> +#define TDH_MNG_INIT 21
> +#define TDH_VP_INIT 22
> +#define TDH_MEM_SEPT_RD 25
> +#define TDH_VP_RD 26
> +#define TDH_MNG_KEY_RECLAIMID 27
> +#define TDH_PHYMEM_PAGE_RECLAIM 28
> +#define TDH_MEM_PAGE_REMOVE 29
> +#define TDH_MEM_SEPT_REMOVE 30
> +#define TDH_SYS_RD 34
> +#define TDH_MEM_TRACK 38
> +#define TDH_MEM_RANGE_UNBLOCK 39
> +#define TDH_PHYMEM_CACHE_WB 40
> +#define TDH_PHYMEM_PAGE_WBINVD 41
> +#define TDH_VP_WR 43
> +#define TDH_SYS_LP_SHUTDOWN 44
> +
> +/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
> +#define TDX_NON_ARCH BIT_ULL(63)
> +#define TDX_CLASS_SHIFT 56
> +#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
> +
> +#define __BUILD_TDX_FIELD(non_arch, class, field) \
> + (((non_arch) ? TDX_NON_ARCH : 0) | \
> + ((u64)(class) << TDX_CLASS_SHIFT) | \
> + ((u64)(field) & TDX_FIELD_MASK))
> +
> +#define BUILD_TDX_FIELD(class, field) \
> + __BUILD_TDX_FIELD(false, (class), (field))
> +
> +#define BUILD_TDX_FIELD_NON_ARCH(class, field) \
> + __BUILD_TDX_FIELD(true, (class), (field))
> +
> +
> +/* Class code for TD */
> +#define TD_CLASS_EXECUTION_CONTROLS 17ULL
> +
> +/* Class code for TDVPS */
> +#define TDVPS_CLASS_VMCS 0ULL
> +#define TDVPS_CLASS_GUEST_GPR 16ULL
> +#define TDVPS_CLASS_OTHER_GUEST 17ULL
> +#define TDVPS_CLASS_MANAGEMENT 32ULL
> +
> +enum tdx_tdcs_execution_control {
> + TD_TDCS_EXEC_TSC_OFFSET = 10,
> +};
> +
> +/* @field is any of enum tdx_tdcs_execution_control */
> +#define TDCS_EXEC(field) BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
> +
> +/* @field is the VMCS field encoding */
> +#define TDVPS_VMCS(field) BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field))
> +
> +enum tdx_vcpu_guest_other_state {
> + TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
> +};
> +
> +union tdx_vcpu_state_details {
> + struct {
> + u64 vmxip : 1;
> + u64 reserved : 63;
> + };
> + u64 full;
> +};
> +
> +/* @field is any of enum tdx_guest_other_state */
> +#define TDVPS_STATE(field) BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field))
> +#define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field))
> +
> +/* Management class fields */
> +enum tdx_vcpu_guest_management {
> + TD_VCPU_PEND_NMI = 11,
> +};
> +
> +/* @field is any of enum tdx_vcpu_guest_management */
> +#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field))
> +
> +#define TDX_EXTENDMR_CHUNKSIZE 256
> +
> +struct tdx_cpuid_value {
> + u32 eax;
> + u32 ebx;
> + u32 ecx;
> + u32 edx;
> +} __packed;
> +
> +#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)

This series doesn't really touch off-TD things, so you can remove this.

> +#define TDX_TD_ATTR_SEPT_VE_DISABLE BIT_ULL(28)
> +#define TDX_TD_ATTRIBUTE_PKS BIT_ULL(30)
> +#define TDX_TD_ATTRIBUTE_KL BIT_ULL(31)
> +#define TDX_TD_ATTRIBUTE_PERFMON BIT_ULL(63)
> +
> +/*
> + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> + */
> +#define TDX_MAX_VCPUS (~(u16)0)
> +
> +struct td_params {
> + u64 attributes;
> + u64 xfam;
> + u16 max_vcpus;
> + u8 reserved0[6];
> +
> + u64 eptp_controls;
> + u64 exec_controls;
> + u16 tsc_frequency;
> + u8 reserved1[38];
> +
> + u64 mrconfigid[6];
> + u64 mrowner[6];
> + u64 mrownerconfig[6];
> + u64 reserved2[4];
> +
> + union {
> + DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values);
> + u8 reserved3[768];
> + };
> +} __packed __aligned(1024);
> +
> +/*
> + * Guest uses MAX_PA for GPAW when set.
> + * 0: GPA.SHARED bit is GPA[47]
> + * 1: GPA.SHARED bit is GPA[51]
> + */
> +#define TDX_EXEC_CONTROL_MAX_GPAW BIT_ULL(0)
> +
> +/*
> + * TDH.VP.ENTER, TDG.VP.VMCALL preserves RBP
> + * 0: RBP can be used for TDG.VP.VMCALL input. RBP is clobbered.
> + * 1: RBP can't be used for TDG.VP.VMCALL input. RBP is preserved.
> + */
> +#define TDX_CONTROL_FLAG_NO_RBP_MOD BIT_ULL(2)
> +
> +
> +/*
> + * TDX requires the frequency to be defined in units of 25MHz, which is the
> + * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
> + * module can only program frequencies that are multiples of 25MHz. The
> + * frequency must be between 100mhz and 10ghz (inclusive).
> + */
> +#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz) ((tsc_in_khz) / (25 * 1000))
> +#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz) ((tsc_in_25mhz) * (25 * 1000))
> +#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
> +#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)
> +
> +union tdx_sept_entry {
> + struct {
> + u64 r : 1;
> + u64 w : 1;
> + u64 x : 1;
> + u64 mt : 3;
> + u64 ipat : 1;
> + u64 leaf : 1;
> + u64 a : 1;
> + u64 d : 1;
> + u64 xu : 1;
> + u64 ignored0 : 1;
> + u64 pfn : 40;
> + u64 reserved : 5;
> + u64 vgp : 1;
> + u64 pwa : 1;
> + u64 ignored1 : 1;
> + u64 sss : 1;
> + u64 spp : 1;
> + u64 ignored2 : 1;
> + u64 sve : 1;
> + };
> + u64 raw;
> +};
> +
> +enum tdx_sept_entry_state {
> + TDX_SEPT_FREE = 0,
> + TDX_SEPT_BLOCKED = 1,
> + TDX_SEPT_PENDING = 2,
> + TDX_SEPT_PENDING_BLOCKED = 3,
> + TDX_SEPT_PRESENT = 4,
> +};
> +
> +union tdx_sept_level_state {
> + struct {
> + u64 level : 3;
> + u64 reserved0 : 5;
> + u64 state : 8;
> + u64 reserved1 : 48;
> + };
> + u64 raw;
> +};
> +
> +/*
> + * Global scope metadata field ID.
> + * See Table "Global Scope Metadata", TDX module 1.5 ABI spec.
> + */
> +#define MD_FIELD_ID_SYS_ATTRIBUTES 0x0A00000200000000ULL
> +#define MD_FIELD_ID_FEATURES0 0x0A00000300000008ULL
> +#define MD_FIELD_ID_ATTRS_FIXED0 0x1900000300000000ULL
> +#define MD_FIELD_ID_ATTRS_FIXED1 0x1900000300000001ULL
> +#define MD_FIELD_ID_XFAM_FIXED0 0x1900000300000002ULL
> +#define MD_FIELD_ID_XFAM_FIXED1 0x1900000300000003ULL
> +
> +#define MD_FIELD_ID_TDCS_BASE_SIZE 0x9800000100000100ULL
> +#define MD_FIELD_ID_TDVPS_BASE_SIZE 0x9800000100000200ULL
> +
> +#define MD_FIELD_ID_NUM_CPUID_CONFIG 0x9900000100000004ULL
> +#define MD_FIELD_ID_CPUID_CONFIG_LEAVES 0x9900000300000400ULL
> +#define MD_FIELD_ID_CPUID_CONFIG_VALUES 0x9900000300000500ULL
> +
> +#define MD_FIELD_ID_FEATURES0_NO_RBP_MOD BIT_ULL(18)
> +
> +#define TDX_MAX_NR_CPUID_CONFIGS 37
> +
> +#define TDX_MD_ELEMENT_SIZE_8BITS 0
> +#define TDX_MD_ELEMENT_SIZE_16BITS 1
> +#define TDX_MD_ELEMENT_SIZE_32BITS 2
> +#define TDX_MD_ELEMENT_SIZE_64BITS 3
> +
> +union tdx_md_field_id {
> + struct {
> + u64 field : 24;
> + u64 reserved0 : 8;
> + u64 element_size_code : 2;
> + u64 last_element_in_field : 4;
> + u64 reserved1 : 3;
> + u64 inc_size : 1;
> + u64 write_mask_valid : 1;
> + u64 context : 3;
> + u64 reserved2 : 1;
> + u64 class : 6;
> + u64 reserved3 : 1;
> + u64 non_arch : 1;
> + };
> + u64 raw;
> +};
> +
> +#define TDX_MD_ELEMENT_SIZE_CODE(_field_id) \
> + ({ union tdx_md_field_id _fid = { .raw = (_field_id)}; \
> + _fid.element_size_code; })
> +
> +#endif /* __KVM_X86_TDX_ARCH_H */
> --
> 2.25.1
>
>

2024-03-22 15:19:36

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation

On Fri, Mar 22, 2024 at 03:18:39PM +0800,
Chao Gao <[email protected]> wrote:

> On Thu, Mar 21, 2024 at 02:24:12PM -0700, Isaku Yamahata wrote:
> >On Thu, Mar 21, 2024 at 12:11:11AM +0000,
> >"Edgecombe, Rick P" <[email protected]> wrote:
> >
> >> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> >> > To handle private page tables, argument of is_private needs to be
> >> > passed
> >> > down.  Given that already page level is passed down, it would be
> >> > cumbersome
> >> > to add one more parameter about sp. Instead replace the level
> >> > argument with
> >> > union kvm_mmu_page_role.  Thus the number of argument won't be
> >> > increased
> >> > and more info about sp can be passed down.
> >> >
> >> > For private sp, secure page table will be also allocated in addition
> >> > to
> >> > struct kvm_mmu_page and page table (spt member).  The allocation
> >> > functions
> >> > (tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know
> >> > if the
> >> > allocation is for the conventional page table or private page table. 
> >> > Pass
> >> > union kvm_mmu_role to those functions and initialize role member of
> >> > struct
> >> > kvm_mmu_page.
> >>
> >> tdp_mmu_alloc_sp() is only called in two places. One for the root, and
> >> one for the mid-level tables.
> >>
> >> In later patches when the kvm_mmu_alloc_private_spt() part is added,
> >> the root case doesn't need anything done. So the code has to take
> >> special care in tdp_mmu_alloc_sp() to avoid doing anything for the
> >> root.
> >>
> >> It only needs to do the special private spt allocation in non-root
> >> case. If we open code that case, I think maybe we could drop this
> >> patch, like the below.
> >>
> >> The benefits are to drop this patch (which looks to already be part of
> >> Paolo's series), and simplify "KVM: x86/mmu: Add a private pointer to
> >> struct kvm_mmu_page". I'm not sure though, what do you think? Only
> >> build tested.
> >
> >Makes sense. Until v18, it had config to disable private mmu part at
> >compile time. Those functions have #ifdef in mmu_internal.h. v19
> >dropped the config for the feedback.
> > https://lore.kernel.org/kvm/[email protected]/
> >
> >After looking at mmu_internal.h, I think the following three function could be
> >open coded.
> >kvm_mmu_private_spt(), kvm_mmu_init_private_spt(), kvm_mmu_alloc_private_spt(),
> >and kvm_mmu_free_private_spt().
>
> It took me a few minutes to figure out why the mirror root page doesn't need
> a private_spt.
>
> Per TDX module spec:
>
> Secure EPT’s root page (EPML4 or EPML5, depending on whether the host VMM uses
> 4-level or 5-level EPT) does not need to be explicitly added. It is created
> during TD initialization (TDH.MNG.INIT) and is stored as part of TDCS.
>
> I suggest adding the above as a comment somewhere even if we decide to open-code
> kvm_mmu_alloc_private_spt().


058/130 has such comment. The citation from the spec would be better.



> IMO, some TDX details bleed into KVM MMU regardless of whether we open-code
> kvm_mmu_alloc_private_spt() or not. This isn't good though I cannot think of
> a better solution.
>

--
Isaku Yamahata <[email protected]>

2024-03-22 16:07:42

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Fri, 2024-03-22 at 07:10 +0000, Huang, Kai wrote:
> > I see that this was suggested by Sean, but can you explain the
> > problem
> > that this is working around? From the linked thread, it seems like
> > the
> > problem is what to do when userspace also calls SET_CPUID after
> > already
> > configuring CPUID to the TDX module in the special way. The choices
> > discussed included:
> > 1. Reject the call
> > 2. Check the consistency between the first CPUID configuration and
> > the
> > second one.
> >
> > 1 is a lot simpler, but the reasoning for 2 is because "some KVM
> > code
> > paths rely on guest CPUID configuration" it seems. Is this a
> > hypothetical or real issue? Which code paths are problematic for
> > TDX/SNP?
>
> There might be use case that TDX guest wants to use some CPUID which
> isn't handled by the TDX module but purely by KVM.  These (PV) CPUIDs
> need to be
> provided via KVM_SET_CPUID2.

Right, but are there any needed today? I read that Sean's point was
that KVM_SET_CPUID2 can't accept anything today what we would want to
block later, otherwise it would introduce a regression. This was the
major constraint IIUC, and means the base series requires *something*
here.

If we want to support only the most basic support first, we don't need
to support PV CPUIDs on day 1, right?

So I'm wondering, if we could shrink the base series by going with
option 1 to start, and then expanding it with this solution later to
enable more features. Do you see a problem or conflict with Sean's
comments?


2024-03-22 17:39:19

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions

On Thu, Mar 21, 2024 at 11:27:46AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Currently, KVM VMX module initialization/exit functions are a single
> > function each. Refactor KVM VMX module initialization functions into KVM
> > common part and VMX part so that TDX specific part can be added cleanly.
> > Opportunistically refactor module exit function as well.
> >
> > The current module initialization flow is,
>
> ^ ',' -> ':'
>
> And please add an empty line to make text more breathable.
>
> > 0.) Check if VMX is supported,
> > 1.) hyper-v specific initialization,
> > 2.) system-wide x86 specific and vendor specific initialization,
> > 3.) Final VMX specific system-wide initialization,
> > 4.) calculate the sizes of VMX kvm structure and VMX vcpu structure,
> > 5.) report those sizes to the KVM common layer and KVM common
> > initialization
>
> Is there any difference between "KVM common layer" and "KVM common
> initialization"? I think you can remove the former.

Ok.

> > Refactor the KVM VMX module initialization function into functions with a
> > wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> > among VMX and TDX. Introduce a wrapper function for vmx_init().
>
> Sorry I don't quite follow what your are trying to say in the above paragraph.
>
> You have adequately put what is the _current_ flow, and I am expecting to see
> the flow _after_ the refactor here.

Will add it.


> > The KVM architecture common layer allocates struct kvm with reported size
> > for architecture-specific code. The KVM VMX module defines its structure
> > as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> > struct vmx kvm. Similar for vcpu structure. TDX KVM patches will define
>
> ^vmx_kvm.
>
> Please be more consistent on the words.
>
> > TDX specific kvm and vcpu structures.
>
> Is this paragraph related to the changes in this patch?
>
> For instance, why do you need to point out we will have TDX-specific 'kvm and
> vcpu' structures?

The point of this refactoring is to make room for TDX-specific code. The
consideration point is data size/alignment difference and VMX-dependency.
Let me re-order the sentences.


> > The current module exit function is also a single function, a combination
> > of VMX specific logic and common KVM logic. Refactor it into VMX specific
> > logic and KVM common logic.  
> >
>
> [...]
>
> > This is just refactoring to keep the VMX
> > specific logic in vmx.c from main.c.
>
> It's better to make this as a separate paragraph, because it is a summary to
> this patch.
>
> And in other words: No functional change intended?

Thanks for the feedback. Here is the revised version.

KVM: x86/vmx: Refactor KVM VMX module init/exit functions

Split KVM VMX kernel module initialization into one specific to VMX in
vmx.c and one common for VMX and TDX in main.c to make room for
TDX-specific logic to fit in. Opportunistically, refactor module exit
function as well.

The key points are data structure difference and TDX dependency on
VMX. The data structures for TDX are different from VMX. So are its
size and alignment. Because TDX depends on VMX, TDX initialization
must be after VMX initialization. TDX cleanup must be before VMX
cleanup.

The current module initialization flow is:

0.) Check if VMX is supported,
1.) Hyper-v specific initialization,
2.) System-wide x86 specific and vendor-specific initialization,
3.) Final VMX-specific system-wide initialization,
4.) Calculate the sizes of the kvm and vcpu structure for VMX,
5.) Report those sizes to the KVM-common initialization

After refactoring and TDX, the flow will be:

0.) Check if VMX is supported (main.c),
1.) Hyper-v specific initialization (main.c),
2.) System-wide x86 specific and vendor-specific initialization,
2.1) VMX-specific initialization (vmx.c)
2.2) TDX-specific initialization (tdx.c)
3.) Final VMX-specific system-wide initialization (vmx.c),
TDX doesn't need this step.
4.) Calculate the sizes of the kvm and vcpu structure for both VMX and
TDX, (main.c)
5.) Report those sizes to the KVM-common initialization (main.c)

No functional change intended.
--
Isaku Yamahata <[email protected]>

2024-03-22 18:10:22

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, Mar 21, 2024 at 12:39:34PM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Fri, 2024-03-15 at 16:25 -0700, Isaku Yamahata wrote:
> > > > > How about if there are some LPs that are offline.
> > > > > In tdx_hardware_setup(), only online LPs are initialed for TDX, right?
> > > > Correct.
> > > >
> > > >
> > > > > Then when an offline LP becoming online, it doesn't have a chance to call
> > > > > tdx_cpu_enable()?
> > > > KVM registers kvm_online/offline_cpu() @ kvm_main.c as cpu hotplug callbacks.
> > > > Eventually x86 kvm hardware_enable() is called on online/offline event.
> > >
> > > Yes, hardware_enable() will be called when online,
> > > but  hardware_enable() now is vmx_hardware_enable() right?
> > > It doens't call tdx_cpu_enable() during the online path.
> >
> > TDX module requires TDH.SYS.LP.INIT() on all logical processors(LPs).  If we
> > successfully initialized TDX module, we don't need further action for TDX on cpu
> > online/offline.
> >
> > If some of LPs are not online when loading kvm_intel.ko, KVM fails to initialize
> > TDX module. TDX support is disabled.  We don't bother to attempt it.  Leave it
> > to the admin of the machine.
>
> No. We have relaxed this. Now the TDX module can be initialized on a subset of
> all logical cpus, with arbitrary number of cpus being offline.
>
> Those cpus can become online after module initialization, and TDH.SYS.LP.INIT on
> them won't fail.

Ah, you're right. So we need to call tdx_cpu_enable() on online. For offline,
KVM has to do nothing. It's another story to shutdown TDX module.
--
Isaku Yamahata <[email protected]>

2024-03-22 21:23:39

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, Mar 21, 2024 at 01:07:27PM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX requires several initialization steps for KVM to create guest TDs.
> > Detect CPU feature, enable VMX (TDX is based on VMX) on all online CPUs,
> > detect the TDX module availability, initialize it and disable VMX.
>
> Before KVM can use TDX to create and run TDX guests, the kernel needs to
> initialize TDX from two perspectives:
>
> 1) Initialize the TDX module.
> 1) Do the "per-cpu initialization" on any logical cpu before running any TDX
> code on that cpu.
>
> The host kernel provides two functions to do them respectively: tdx_cpu_enable()
> and tdx_enable().
>
> Currently, tdx_enable() requires all online cpus being in VMX operation with CPU
> hotplug disabled, and tdx_cpu_enable() needs to be called on local cpu with that
> cpu being in VMX operation and IRQ disabled.
>
> >
> > To enable/disable VMX on all online CPUs, utilize
> > vmx_hardware_enable/disable(). The method also initializes each CPU for
> > TDX.  
> >
>
> I don't understand what you are saying here.
>
> Did you mean you put tdx_cpu_enable() inside vmx_hardware_enable()?

Now the section doesn't make sense. Will remove it.


> > TDX requires calling a TDX initialization function per logical
> > processor (LP) before the LP uses TDX.  
> >
>
> [...]
>
> > When the CPU is becoming online,
> > call the TDX LP initialization API. If it fails to initialize TDX, refuse
> > CPU online for simplicity instead of TDX avoiding the failed LP.
>
> Unless I am missing something, I don't see this has been done in the code.

You're right. Somehow the code was lost. Let me revive it with the next
version.


> > There are several options on when to initialize the TDX module. A.) kernel
> > module loading time, B.) the first guest TD creation time. A.) was chosen.
>
> A.) was chosen -> Choose A).
>
> Describe your change in "imperative mood".
>
> > With B.), a user may hit an error of the TDX initialization when trying to
> > create the first guest TD. The machine that fails to initialize the TDX
> > module can't boot any guest TD further. Such failure is undesirable and a
> > surprise because the user expects that the machine can accommodate guest
> > TD, but not. So A.) is better than B.).
> >
> > Introduce a module parameter, kvm_intel.tdx, to explicitly enable TDX KVM
>
> You don't have to say the name of the new parameter. It's shown in the code.
>
> > support. It's off by default to keep the same behavior for those who don't
> > use TDX.  
> >
>
> [...]
>
>
> > Implement hardware_setup method to detect TDX feature of CPU and
> > initialize TDX module.
>
> You are not detecting TDX feature anymore.
>
> And put this in a separate paragraph (at a better place), as I don't see how
> this is connected to "introduce a module parameter".

Let me update those sentences.


> > Suggested-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v19:
> > - fixed vt_hardware_enable() to use vmx_hardware_enable()
> > - renamed vmx_tdx_enabled => tdx_enabled
> > - renamed vmx_tdx_on() => tdx_on()
> >
> > v18:
> > - Added comment in vt_hardware_enable() by Binbin.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/Makefile | 1 +
> > arch/x86/kvm/vmx/main.c | 19 ++++++++-
> > arch/x86/kvm/vmx/tdx.c | 84 ++++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/x86_ops.h | 6 +++
> > 4 files changed, 109 insertions(+), 1 deletion(-)
> > create mode 100644 arch/x86/kvm/vmx/tdx.c
> >
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index 274df24b647f..5b85ef84b2e9 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -24,6 +24,7 @@ kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
> >
> > kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
> > kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o
> > +kvm-intel-$(CONFIG_INTEL_TDX_HOST) += vmx/tdx.o
> >
> > kvm-amd-y += svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
> > svm/sev.o
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 18cecf12c7c8..18aef6e23aab 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -6,6 +6,22 @@
> > #include "nested.h"
> > #include "pmu.h"
> >
> > +static bool enable_tdx __ro_after_init;
> > +module_param_named(tdx, enable_tdx, bool, 0444);
> > +
> > +static __init int vt_hardware_setup(void)
> > +{
> > + int ret;
> > +
> > + ret = vmx_hardware_setup();
> > + if (ret)
> > + return ret;
> > +
> > + enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> > +
> > + return 0;
> > +}
> > +
> > #define VMX_REQUIRED_APICV_INHIBITS \
> > (BIT(APICV_INHIBIT_REASON_DISABLE)| \
> > BIT(APICV_INHIBIT_REASON_ABSENT) | \
> > @@ -22,6 +38,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> >
> > .hardware_unsetup = vmx_hardware_unsetup,
> >
> > + /* TDX cpu enablement is done by tdx_hardware_setup(). */
>
> What's the point of this comment? I don't understand it either.

Will delete the comment.


> > .hardware_enable = vmx_hardware_enable,
> > .hardware_disable = vmx_hardware_disable,
>
> Shouldn't you also implement vt_hardware_enable(), which also does
> tdx_cpu_enable()?
>
> Because I don't see vmx_hardware_enable() is changed to call tdx_cpu_enable() to
> make CPU hotplug work with TDX.

hardware_enable() doesn't help for cpu hot plug support. See below.


> > .has_emulated_msr = vmx_has_emulated_msr,
> > @@ -161,7 +178,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > };
> >
> > struct kvm_x86_init_ops vt_init_ops __initdata = {
> > - .hardware_setup = vmx_hardware_setup,
> > + .hardware_setup = vt_hardware_setup,
> > .handle_intel_pt_intr = NULL,
> >
> > .runtime_ops = &vt_x86_ops,
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > new file mode 100644
> > index 000000000000..43c504fb4fed
> > --- /dev/null
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -0,0 +1,84 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/cpu.h>
> > +
> > +#include <asm/tdx.h>
> > +
> > +#include "capabilities.h"
> > +#include "x86_ops.h"
> > +#include "x86.h"
> > +
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +static int __init tdx_module_setup(void)
> > +{
> > + int ret;
> > +
> > + ret = tdx_enable();
> > + if (ret) {
> > + pr_info("Failed to initialize TDX module.\n");
>
> As I commented before, tdx_enable() itself will print similar message when it
> fails, so no need to print again.
>
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
>
> That being said, I don't think tdx_module_setup() is necessary. Just call
> tdx_enable() directly.

Ok, Will move this funciton to the patch that uses it first.


> > +
> > +struct tdx_enabled {
> > + cpumask_var_t enabled;
> > + atomic_t err;
> > +};
>
> struct cpu_tdx_init_ctx {
> cpumask_var_t vmx_enabled_cpumask;
> atomic_t err;
> };
>
> ?
>
> > +
> > +static void __init tdx_on(void *_enable)
>
> tdx_on() -> cpu_tdx_init(), or cpu_tdx_on()?
>
> > +{
> > + struct tdx_enabled *enable = _enable;
> > + int r;
> > +
> > + r = vmx_hardware_enable();
> > + if (!r) {
> > + cpumask_set_cpu(smp_processor_id(), enable->enabled);
> > + r = tdx_cpu_enable();
> > + }
> > + if (r)
> > + atomic_set(&enable->err, r);
> > +}
> > +
> > +static void __init vmx_off(void *_enabled)
>
> cpu_vmx_off() ?

Ok, let's add cpu_ prefix.


> > +{
> > + cpumask_var_t *enabled = (cpumask_var_t *)_enabled;
> > +
> > + if (cpumask_test_cpu(smp_processor_id(), *enabled))
> > + vmx_hardware_disable();
> > +}
> > +
> > +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>
> Why do you need the 'x86_ops' function argument? I don't see it is used?

Will move it to the patch that uses it first.


> > +{
> > + struct tdx_enabled enable = {
> > + .err = ATOMIC_INIT(0),
> > + };
> > + int r = 0;
> > +
> > + if (!enable_ept) {
> > + pr_warn("Cannot enable TDX with EPT disabled\n");
> > + return -EINVAL;
> > + }
> > +
> > + if (!zalloc_cpumask_var(&enable.enabled, GFP_KERNEL)) {
> > + r = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + /* tdx_enable() in tdx_module_setup() requires cpus lock. */
>
> /* tdx_enable() must be called with CPU hotplug disabled */
>
> > + cpus_read_lock();
> > + on_each_cpu(tdx_on, &enable, true); /* TDX requires vmxon. */
>
> I don't think you need this comment _here_.
>
> If you want keep it, move to the tdx_on() where the code does what this comment
> say.

Will move the comment into cpu_tdx_on().


> > + r = atomic_read(&enable.err);
> > + if (!r)
> > + r = tdx_module_setup();
> > + else
> > + r = -EIO;
> > + on_each_cpu(vmx_off, &enable.enabled, true);
> > + cpus_read_unlock();
> > + free_cpumask_var(enable.enabled);
> > +
> > +out:
> > + return r;
> > +}
>
> At last, I think there's one problem here:
>
> KVM actually only registers CPU hotplug callback in kvm_init(), which happens
> way after tdx_hardware_setup().
>
> What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
> kvm_init()?
>
> Looks we have two options:
>
> 1) move registering CPU hotplug callback before tdx_hardware_setup(), or
> 2) we need to disable CPU hotplug until callbacks have been registered.
>
> Perhaps the second one is easier, because for the first one we need to make sure
> the kvm_cpu_online() is ready to be called right after tdx_hardware_setup().
>
> And no one cares if CPU hotplug is disabled during KVM module loading.
>
> That being said, we can even just disable CPU hotplug during the entire
> vt_init(), if in this way the code change is simple?
>
> But anyway, to make this patch complete, I think you need to replace
> vmx_hardware_enable() to vt_hardware_enable() and do tdx_cpu_enable() to handle
> TDX vs CPU hotplug in _this_ patch.

The option 2 sounds easier. But hardware_enable() doesn't help because it's
called when the first guest is created. It's risky to change it's semantics
because it's arch-independent callback.

- Disable CPU hot plug during TDX module initialization.
- During hardware_setup(), enable VMX, tdx_cpu_enable(), disable VMX
on online cpu. Don't rely on KVM hooks.
- Add a new arch-independent hook, int kvm_arch_online_cpu(). It's called always
on cpu onlining. It eventually calls tdx_cpu_enabel(). If it fails, refuse
onlining.
--
Isaku Yamahata <[email protected]>

2024-03-22 22:46:59

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure

On Fri, Mar 22, 2024 at 10:37:20AM +1300,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 26/02/2024 9:25 pm, Yamahata, Isaku wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Add placeholders TDX VM/vcpu structure that overlays with VMX VM/vcpu
> > structures. Initialize VM structure size and vcpu size/align so that x86
> > KVM common code knows those size irrespective of VMX or TDX. Those
> > structures will be populated as guest creation logic develops.
> >
> > Add helper functions to check if the VM is guest TD and add conversion
> > functions between KVM VM/VCPU and TDX VM/VCPU.
>
> The changelog is essentially only saying "doing what" w/o "why".
>
> Please at least explain why you invented the 'struct kvm_tdx' and 'struct
> vcpu_tdx', and why they are invented in this way.
>
> E.g., can we extend 'struct kvm_vmx' for TDX?
>
> struct kvm_tdx {
> struct kvm_vmx vmx;
> ...
> };

Here is the updated version.

KVM: TDX: Add placeholders for TDX VM/vcpu structure

Add placeholders TDX VM/vCPU structure, overlaying with the existing
VMX VM/vCPU structures. Initialize VM structure size and vCPU
size/align so that x86 KVM-common code knows those sizes irrespective
of VMX or TDX. Those structures will be populated as guest creation
logic develops.

TDX requires its data structure for guest and vcpu. For VMX, we
already have struct kvm_vmx and struct vcpu_vmx. Two options to add
TDX-specific members.

1. Append TDX-specific members to kvm_vmx and vcpu_vmx. Use the same
struct for both VMX and TDX.
2. Define TDX-specific data struct and overlay.

Choose option two because it has less memory overhead and what member
is needed is clearer

Add helper functions to check if the VM is guest TD and add the conversion
functions between KVM VM/vCPU and TDX VM/vCPU.


> > Signed-off-by: Isaku Yamahata <[email protected]>
> >
> > ---
> > v19:
> > - correctly update ops.vm_size, vcpu_size and, vcpu_align by Xiaoyao
> >
> > v14 -> v15:
> > - use KVM_X86_TDX_VM
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/main.c | 14 ++++++++++++
> > arch/x86/kvm/vmx/tdx.c | 1 +
> > arch/x86/kvm/vmx/tdx.h | 50 +++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 65 insertions(+)
> > create mode 100644 arch/x86/kvm/vmx/tdx.h
> >
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 18aef6e23aab..e11edbd19e7c 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -5,6 +5,7 @@
> > #include "vmx.h"
> > #include "nested.h"
> > #include "pmu.h"
> > +#include "tdx.h"
> > static bool enable_tdx __ro_after_init;
> > module_param_named(tdx, enable_tdx, bool, 0444);
> > @@ -18,6 +19,9 @@ static __init int vt_hardware_setup(void)
> > return ret;
> > enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> > + if (enable_tdx)
> > + vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> > + sizeof(struct kvm_tdx));
>
> Now I see why you included 'struct kvm_x86_ops' as function parameter.
>
> Please move it to this patch.

Sure.

> > return 0;
> > }
> > @@ -215,8 +219,18 @@ static int __init vt_init(void)
> > * Common KVM initialization _must_ come last, after this, /dev/kvm is
> > * exposed to userspace!
> > */
> > + /*
> > + * kvm_x86_ops is updated with vt_x86_ops. vt_x86_ops.vm_size must
> > + * be set before kvm_x86_vendor_init().
> > + */
> > vcpu_size = sizeof(struct vcpu_vmx);
> > vcpu_align = __alignof__(struct vcpu_vmx);
> > + if (enable_tdx) {
> > + vcpu_size = max_t(unsigned int, vcpu_size,
> > + sizeof(struct vcpu_tdx));
> > + vcpu_align = max_t(unsigned int, vcpu_align,
> > + __alignof__(struct vcpu_tdx));
> > + }
>
> Since you are updating vm_size in vt_hardware_setup(), I am wondering
> whether we can do similar thing for vcpu_size and vcpu_align.
>
> That is, we put them both to 'struct kvm_x86_ops', and you update them in
> vt_hardware_setup().
>
> kvm_init() can then just access them directly in this way both 'vcpu_size'
> and 'vcpu_align' function parameters can be removed.

Hmm, now I noticed the vm_size can be moved here. We have

vcpu_size = sizeof(struct vcpu_vmx);
vcpu_align = __alignof__(struct vcpu_vmx);
if (enable_tdx) {
vcpu_size = max_t(unsigned int, vcpu_size,
sizeof(struct vcpu_tdx));
vcpu_align = max_t(unsigned int, vcpu_align,
__alignof__(struct vcpu_tdx));
vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
sizeof(struct kvm_tdx));
}


We can add vcpu_size, vcpu_align to struct kvm_x86_ops. If we do so, we have
to touch svm code unnecessarily.
--
Isaku Yamahata <[email protected]>

2024-03-22 22:59:29

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Mon, Mar 18, 2024 at 04:16:56PM -0700,
Isaku Yamahata <[email protected]> wrote:

> On Mon, Mar 18, 2024 at 05:43:33PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Mon, 2024-03-18 at 10:12 -0700, Isaku Yamahata wrote:
> > > I categorize as follows. Unless otherwise, I'll update this series.
> > >
> > > - dirty log check
> > >   As we will drop this ptach, we'll have no call site.
> > >
> > > - KVM_BUG_ON() in main.c
> > >   We should drop them because their logic isn't complex.
> > What about "KVM: TDX: Add methods to ignore guest instruction
> > emulation"? Is it cleanly blocked somehow?
>
> KVM fault handler, kvm_mmu_page_fault(), is the caller into the emulation,
> It should skip the emulation.
>
> As the second guard, x86_emulate_instruction(), calls
> check_emulate_instruction() callback to check if the emulation can/should be
> done. TDX callback can return it as X86EMUL_UNHANDLEABLE. Then, the flow goes
> to user space as error. I'll update the vt_check_emulate_instruction().

Oops. It was wrong. It should be X86EMUL_RETRY_INSTR. RETRY_INSTR means, let
vcpu execute the intrusion again, UNHANDLEABLE means, emulator can't emulate,
inject exception or give up with KVM_EXIT_INTERNAL_ERROR.

For TDX, we'd like to inject #VE to the guest so that the guest #VE handler
can issue TDG.VP.VMCALL<MMIO>. The default non-present sept value has
#VE suppress bit set. As first step, EPT violation occurs. then KVM sets
up mmio_spte with #VE suppress bit cleared. Then X86EMUL_RETRY_INSTR tells
kvm to resume vcpu to inject #VE.
--
Isaku Yamahata <[email protected]>

2024-03-22 23:06:00

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 120/130] KVM: TDX: Add a method to ignore dirty logging

On Fri, 2024-03-22 at 15:57 -0700, Isaku Yamahata wrote:
> > KVM fault handler, kvm_mmu_page_fault(), is the caller into the
> > emulation,
> > It should skip the emulation.
> >
> > As the second guard, x86_emulate_instruction(), calls
> > check_emulate_instruction() callback to check if the emulation
> > can/should be
> > done.  TDX callback can return it as X86EMUL_UNHANDLEABLE.  Then,
> > the flow goes
> > to user space as error.  I'll update the
> > vt_check_emulate_instruction().
>
> Oops. It was wrong. It should be X86EMUL_RETRY_INSTR.  RETRY_INSTR
> means, let
> vcpu execute the intrusion again, UNHANDLEABLE means, emulator can't
> emulate,
> inject exception or give up with KVM_EXIT_INTERNAL_ERROR.
>
> For TDX, we'd like to inject #VE to the guest so that the guest #VE
> handler
> can issue TDG.VP.VMCALL<MMIO>.  The default non-present sept value
> has
> #VE suppress bit set.  As first step, EPT violation occurs. then KVM
> sets
> up mmio_spte with #VE suppress bit cleared. Then X86EMUL_RETRY_INSTR
> tells
> kvm to resume vcpu to inject #VE.

Ah, so in a normal VM it would:
- get ept violation for no memslot
- setup MMIO PTE for later
- go ahead and head to the emulator anyway instead of waiting to
refault

In TDX, it could skip the last step and head right to the guest, but
instead if heads to the emulator and gets told to retry there. It's a
good solution in that there is already an way to cleanly insert logic
at check_emulate_instruction(). Otherwise we would need to add another
check somewhere to know in TDX to consider the fault resolved. 

I was initially thinking if it gets anywhere near the emulator things
have gone wrong and we are missing something. The downside is that we
can't find those issues. If something tries to use the emulator it will
just fault in a loop instead of exiting to userspace or bugging the VM.
Hmm.

2024-03-22 23:15:36

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Fri, Mar 22, 2024 at 10:57:53AM +1300,
"Huang, Kai" <[email protected]> wrote:

>
> > +/*
> > + * TDX SEAMCALL API function leaves
> > + */
> > +#define TDH_VP_ENTER 0
> > +#define TDH_MNG_ADDCX 1
> > +#define TDH_MEM_PAGE_ADD 2
> > +#define TDH_MEM_SEPT_ADD 3
> > +#define TDH_VP_ADDCX 4
> > +#define TDH_MEM_PAGE_RELOCATE 5
>
> I don't think the "RELOCATE" is needed in this patchset?
>
> > +#define TDH_MEM_PAGE_AUG 6
> > +#define TDH_MEM_RANGE_BLOCK 7
> > +#define TDH_MNG_KEY_CONFIG 8
> > +#define TDH_MNG_CREATE 9
> > +#define TDH_VP_CREATE 10
> > +#define TDH_MNG_RD 11
> > +#define TDH_MR_EXTEND 16
> > +#define TDH_MR_FINALIZE 17
> > +#define TDH_VP_FLUSH 18
> > +#define TDH_MNG_VPFLUSHDONE 19
> > +#define TDH_MNG_KEY_FREEID 20
> > +#define TDH_MNG_INIT 21
> > +#define TDH_VP_INIT 22
> > +#define TDH_MEM_SEPT_RD 25
> > +#define TDH_VP_RD 26
> > +#define TDH_MNG_KEY_RECLAIMID 27
> > +#define TDH_PHYMEM_PAGE_RECLAIM 28
> > +#define TDH_MEM_PAGE_REMOVE 29
> > +#define TDH_MEM_SEPT_REMOVE 30
> > +#define TDH_SYS_RD 34
> > +#define TDH_MEM_TRACK 38
> > +#define TDH_MEM_RANGE_UNBLOCK 39
> > +#define TDH_PHYMEM_CACHE_WB 40
> > +#define TDH_PHYMEM_PAGE_WBINVD 41
> > +#define TDH_VP_WR 43
> > +#define TDH_SYS_LP_SHUTDOWN 44
>
> And LP_SHUTDOWN is certainly not needed.
>
> Could you check whether there are others that are not needed?
>
> Perhaps we should just include macros that got used, but anyway.

Ok, let's break this patch into other patches that uses the constants first.


> > +/*
> > + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> > + */
>
> Why is this comment applied to TDX_MAX_VCPUS?
>
> > +#define TDX_MAX_VCPUS (~(u16)0)
>
> And is (~(16)0) an architectural value defined by TDX spec, or just SW value
> that you just put here for convenience?
>
> I mean, is it possible that different version of TDX module have different
> implementation of MAX_CPU, e.g., module 1.0 only supports X but module 1.5
> increases to Y where Y > X?

This is architectural because it the field width is 16 bits. Each version
of TDX module may have their own limitation with metadata, MAX_VCPUS_PER_TD.


> Anyway, looks you can safely move this to the patch to enable CAP_MAX_CPU?

Yes.


> > +
> > +struct td_params {
> > + u64 attributes;
> > + u64 xfam;
> > + u16 max_vcpus;
> > + u8 reserved0[6];
> > +
> > + u64 eptp_controls;
> > + u64 exec_controls;
> > + u16 tsc_frequency;
> > + u8 reserved1[38];
> > +
> > + u64 mrconfigid[6];
> > + u64 mrowner[6];
> > + u64 mrownerconfig[6];
> > + u64 reserved2[4];
> > +
> > + union {
> > + DECLARE_FLEX_ARRAY(struct tdx_cpuid_value, cpuid_values);
> > + u8 reserved3[768];
>
> I am not sure you need the 'reseved3[768]', unless you need to make
> sieof(struct td_params) return 1024?

I'm trying to make it 1024 because the spec defines the struct size is 1024.
Maybe I can add BUILD_BUG_ON(sizeof(struct td_params) != 1024);


> > +#define TDX_MD_ELEMENT_SIZE_8BITS 0
> > +#define TDX_MD_ELEMENT_SIZE_16BITS 1
> > +#define TDX_MD_ELEMENT_SIZE_32BITS 2
> > +#define TDX_MD_ELEMENT_SIZE_64BITS 3
> > +
> > +union tdx_md_field_id {
> > + struct {
> > + u64 field : 24;
> > + u64 reserved0 : 8;
> > + u64 element_size_code : 2;
> > + u64 last_element_in_field : 4;
> > + u64 reserved1 : 3;
> > + u64 inc_size : 1;
> > + u64 write_mask_valid : 1;
> > + u64 context : 3;
> > + u64 reserved2 : 1;
> > + u64 class : 6;
> > + u64 reserved3 : 1;
> > + u64 non_arch : 1;
> > + };
> > + u64 raw;
> > +};
>
> Could you clarify why we need such detailed definition? For metadata
> element size you can use simple '&' and '<<' to get the result.

Now your TDX host patch has the definition in arch/x86/include/asm/tdx.h,
I'll eliminate this one here and use your definition.
--
Isaku Yamahata <[email protected]>

2024-03-22 23:17:32

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Fri, Mar 22, 2024 at 03:06:35PM +0800,
Yuan Yao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:25:29AM -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Define architectural definitions for KVM to issue the TDX SEAMCALLs.
> >
> > Structures and values that are architecturally defined in the TDX module
> > specifications the chapter of ABI Reference.
> >
> > Co-developed-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > Reviewed-by: Xiaoyao Li <[email protected]>
> > ---
> > v19:
> > - drop tdvmcall constants by Xiaoyao
> >
> > v18:
> > - Add metadata field id
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/tdx_arch.h | 265 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 265 insertions(+)
> > create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> >
> > diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> > new file mode 100644
> > index 000000000000..e2c1a6f429d7
> > --- /dev/null
> > +++ b/arch/x86/kvm/vmx/tdx_arch.h
> > @@ -0,0 +1,265 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* architectural constants/data definitions for TDX SEAMCALLs */
> > +
> > +#ifndef __KVM_X86_TDX_ARCH_H
> > +#define __KVM_X86_TDX_ARCH_H
> > +
> > +#include <linux/types.h>
> > +
> > +/*
> > + * TDX SEAMCALL API function leaves
> > + */
> > +#define TDH_VP_ENTER 0
> > +#define TDH_MNG_ADDCX 1
> > +#define TDH_MEM_PAGE_ADD 2
> > +#define TDH_MEM_SEPT_ADD 3
> > +#define TDH_VP_ADDCX 4
> > +#define TDH_MEM_PAGE_RELOCATE 5
> > +#define TDH_MEM_PAGE_AUG 6
> > +#define TDH_MEM_RANGE_BLOCK 7
> > +#define TDH_MNG_KEY_CONFIG 8
> > +#define TDH_MNG_CREATE 9
> > +#define TDH_VP_CREATE 10
> > +#define TDH_MNG_RD 11
> > +#define TDH_MR_EXTEND 16
> > +#define TDH_MR_FINALIZE 17
> > +#define TDH_VP_FLUSH 18
> > +#define TDH_MNG_VPFLUSHDONE 19
> > +#define TDH_MNG_KEY_FREEID 20
> > +#define TDH_MNG_INIT 21
> > +#define TDH_VP_INIT 22
> > +#define TDH_MEM_SEPT_RD 25
> > +#define TDH_VP_RD 26
> > +#define TDH_MNG_KEY_RECLAIMID 27
> > +#define TDH_PHYMEM_PAGE_RECLAIM 28
> > +#define TDH_MEM_PAGE_REMOVE 29
> > +#define TDH_MEM_SEPT_REMOVE 30
> > +#define TDH_SYS_RD 34
> > +#define TDH_MEM_TRACK 38
> > +#define TDH_MEM_RANGE_UNBLOCK 39
> > +#define TDH_PHYMEM_CACHE_WB 40
> > +#define TDH_PHYMEM_PAGE_WBINVD 41
> > +#define TDH_VP_WR 43
> > +#define TDH_SYS_LP_SHUTDOWN 44
> > +
> > +/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
> > +#define TDX_NON_ARCH BIT_ULL(63)
> > +#define TDX_CLASS_SHIFT 56
> > +#define TDX_FIELD_MASK GENMASK_ULL(31, 0)
> > +
> > +#define __BUILD_TDX_FIELD(non_arch, class, field) \
> > + (((non_arch) ? TDX_NON_ARCH : 0) | \
> > + ((u64)(class) << TDX_CLASS_SHIFT) | \
> > + ((u64)(field) & TDX_FIELD_MASK))
> > +
> > +#define BUILD_TDX_FIELD(class, field) \
> > + __BUILD_TDX_FIELD(false, (class), (field))
> > +
> > +#define BUILD_TDX_FIELD_NON_ARCH(class, field) \
> > + __BUILD_TDX_FIELD(true, (class), (field))
> > +
> > +
> > +/* Class code for TD */
> > +#define TD_CLASS_EXECUTION_CONTROLS 17ULL
> > +
> > +/* Class code for TDVPS */
> > +#define TDVPS_CLASS_VMCS 0ULL
> > +#define TDVPS_CLASS_GUEST_GPR 16ULL
> > +#define TDVPS_CLASS_OTHER_GUEST 17ULL
> > +#define TDVPS_CLASS_MANAGEMENT 32ULL
> > +
> > +enum tdx_tdcs_execution_control {
> > + TD_TDCS_EXEC_TSC_OFFSET = 10,
> > +};
> > +
> > +/* @field is any of enum tdx_tdcs_execution_control */
> > +#define TDCS_EXEC(field) BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
> > +
> > +/* @field is the VMCS field encoding */
> > +#define TDVPS_VMCS(field) BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field))
> > +
> > +enum tdx_vcpu_guest_other_state {
> > + TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
> > +};
> > +
> > +union tdx_vcpu_state_details {
> > + struct {
> > + u64 vmxip : 1;
> > + u64 reserved : 63;
> > + };
> > + u64 full;
> > +};
> > +
> > +/* @field is any of enum tdx_guest_other_state */
> > +#define TDVPS_STATE(field) BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field))
> > +#define TDVPS_STATE_NON_ARCH(field) BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field))
> > +
> > +/* Management class fields */
> > +enum tdx_vcpu_guest_management {
> > + TD_VCPU_PEND_NMI = 11,
> > +};
> > +
> > +/* @field is any of enum tdx_vcpu_guest_management */
> > +#define TDVPS_MANAGEMENT(field) BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field))
> > +
> > +#define TDX_EXTENDMR_CHUNKSIZE 256
> > +
> > +struct tdx_cpuid_value {
> > + u32 eax;
> > + u32 ebx;
> > + u32 ecx;
> > + u32 edx;
> > +} __packed;
> > +
> > +#define TDX_TD_ATTRIBUTE_DEBUG BIT_ULL(0)
>
> This series doesn't really touch off-TD things, so you can remove this.

Yes. I'll clean up to delete unused ones including this.
--
Isaku Yamahata <[email protected]>

2024-03-22 23:26:52

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Fri, Mar 22, 2024 at 04:33:21AM +0000,
"Huang, Kai" <[email protected]> wrote:

> > >
> > > So how about we have some macros:
> > >
> > > static inline bool is_seamcall_err_kernel_defined(u64 err)
> > > {
> > > return err & TDX_SW_ERROR;
> > > }
> > >
> > > #define TDX_KVM_SEAMCALL(_kvm, _seamcall_func, _fn, _args) \
> > > ({ \
> > > u64 _ret = _seamcall_func(_fn, _args);
> > > KVM_BUG_ON(_kvm, is_seamcall_err_kernel_defined(_ret));
> > > _ret;
> > > })
> >
> > As we can move out KVM_BUG_ON() to the call site, we can simply have
> > seamcall() or seamcall_ret().
> > The call site has to check error. whether it is TDX_SW_ERROR or not.
> > And if it hit the unexpected error, it will mark the guest bugged.
>
> How many call sites are we talking about?
>
> I think handling KVM_BUG_ON() in macro should be able to eliminate bunch of
> individual KVM_BUG_ON()s in these call sites?

16: custom error check is needed
6: always error

So I'd like to consistently have error check in KVM code, not in macro or
wrapper.
--
Isaku Yamahata <[email protected]>

2024-03-22 23:37:16

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl

On Fri, Mar 22, 2024 at 11:10:48AM +1300,
"Huang, Kai" <[email protected]> wrote:

> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 45b2c2304491..9ea46d143bef 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -567,6 +567,32 @@ struct kvm_pmu_event_filter {
> > #define KVM_X86_TDX_VM 2
> > #define KVM_X86_SNP_VM 3
> > +/* Trust Domain eXtension sub-ioctl() commands. */
> > +enum kvm_tdx_cmd_id {
> > + KVM_TDX_CAPABILITIES = 0,
> > +
> > + KVM_TDX_CMD_NR_MAX,
> > +};
> > +
> > +struct kvm_tdx_cmd {
> > + /* enum kvm_tdx_cmd_id */
> > + __u32 id;
> > + /* flags for sub-commend. If sub-command doesn't use this, set zero. */
> > + __u32 flags;
> > + /*
> > + * data for each sub-command. An immediate or a pointer to the actual
> > + * data in process virtual address. If sub-command doesn't use it,
> > + * set zero.
> > + */
> > + __u64 data;
> > + /*
> > + * Auxiliary error code. The sub-command may return TDX SEAMCALL
> > + * status code in addition to -Exxx.
> > + * Defined for consistency with struct kvm_sev_cmd.
> > + */
> > + __u64 error;
>
> If the 'error' is for SEAMCALL error, should we rename it to 'hw_error' or
> 'fw_error' or something similar? I think 'error' is too generic.

Ok, will rename it to hw_error.


> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 5edfb99abb89..07a3f0f75f87 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -55,6 +55,32 @@ struct tdx_info {
> > /* Info about the TDX module. */
> > static struct tdx_info *tdx_info;
> > +int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> > +{
> > + struct kvm_tdx_cmd tdx_cmd;
> > + int r;
> > +
> > + if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
> > + return -EFAULT;
>
> Add an empty line.
>
> > + if (tdx_cmd.error)
> > + return -EINVAL;
>
> Add a comment?
>
> /*
> * Userspace should never set @error, which is used to fill
> * hardware-defined error by the kernel.
> */

Sure.


> > +
> > + mutex_lock(&kvm->lock);
> > +
> > + switch (tdx_cmd.id) {
> > + default:
> > + r = -EINVAL;
>
> I am not sure whether you should return -ENOTTY to be consistent with the
> previous vt_mem_enc_ioctl() where a TDX-specific IOCTL is issued for non-TDX
> guest.
>
> Here I think the invalid @id means the sub-command isn't valid.

vt_vcpu_mem_enc_ioctl() checks non-TDX case and returns -ENOTTY. We know that
the guest is TD.
--
Isaku Yamahata <[email protected]>

2024-03-22 23:44:14

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Fri, Mar 22, 2024 at 01:32:15PM +0800,
Yuan Yao <[email protected]> wrote:

> On Fri, Mar 22, 2024 at 11:46:41AM +0800, Yuan Yao wrote:
> > On Thu, Mar 21, 2024 at 07:17:09AM -0700, Isaku Yamahata wrote:
> > > On Wed, Mar 20, 2024 at 01:12:01PM +0800,
> > > Chao Gao <[email protected]> wrote:
> ...
> > > > >+static int __tdx_td_init(struct kvm *kvm)
> > > > >+{
> > > > >+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > >+ cpumask_var_t packages;
> > > > >+ unsigned long *tdcs_pa = NULL;
> > > > >+ unsigned long tdr_pa = 0;
> > > > >+ unsigned long va;
> > > > >+ int ret, i;
> > > > >+ u64 err;
> > > > >+
> > > > >+ ret = tdx_guest_keyid_alloc();
> > > > >+ if (ret < 0)
> > > > >+ return ret;
> > > > >+ kvm_tdx->hkid = ret;
> > > > >+
> > > > >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > > > >+ if (!va)
> > > > >+ goto free_hkid;
> > > > >+ tdr_pa = __pa(va);
> > > > >+
> > > > >+ tdcs_pa = kcalloc(tdx_info->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
> > > > >+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > > > >+ if (!tdcs_pa)
> > > > >+ goto free_tdr;
> > > > >+ for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > > > >+ va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > > > >+ if (!va)
> > > > >+ goto free_tdcs;
> > > > >+ tdcs_pa[i] = __pa(va);
> > > > >+ }
> > > > >+
> > > > >+ if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> > > > >+ ret = -ENOMEM;
> > > > >+ goto free_tdcs;
> > > > >+ }
> > > > >+ cpus_read_lock();
> > > > >+ /*
> > > > >+ * Need at least one CPU of the package to be online in order to
> > > > >+ * program all packages for host key id. Check it.
> > > > >+ */
> > > > >+ for_each_present_cpu(i)
> > > > >+ cpumask_set_cpu(topology_physical_package_id(i), packages);
> > > > >+ for_each_online_cpu(i)
> > > > >+ cpumask_clear_cpu(topology_physical_package_id(i), packages);
> > > > >+ if (!cpumask_empty(packages)) {
> > > > >+ ret = -EIO;
> > > > >+ /*
> > > > >+ * Because it's hard for human operator to figure out the
> > > > >+ * reason, warn it.
> > > > >+ */
> > > > >+#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
> > > > >+ pr_warn_ratelimited(MSG_ALLPKG);
> > > > >+ goto free_packages;
> > > > >+ }
> > > > >+
> > > > >+ /*
> > > > >+ * Acquire global lock to avoid TDX_OPERAND_BUSY:
> > > > >+ * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
> > > > >+ * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
> > > > >+ * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
> > > > >+ * caller to handle the contention. This is because of time limitation
> > > > >+ * usable inside the TDX module and OS/VMM knows better about process
> > > > >+ * scheduling.
> > > > >+ *
> > > > >+ * APIs to acquire the lock of KOT:
> > > > >+ * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
> > > > >+ * TDH.PHYMEM.CACHE.WB.
> > > > >+ */
> > > > >+ mutex_lock(&tdx_lock);
> > > > >+ err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
> > > > >+ mutex_unlock(&tdx_lock);
> > > > >+ if (err == TDX_RND_NO_ENTROPY) {
> > > > >+ ret = -EAGAIN;
> > > > >+ goto free_packages;
> > > > >+ }
> > > > >+ if (WARN_ON_ONCE(err)) {
> > > > >+ pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> > > > >+ ret = -EIO;
> > > > >+ goto free_packages;
> > > > >+ }
> > > > >+ kvm_tdx->tdr_pa = tdr_pa;
> > > > >+
> > > > >+ for_each_online_cpu(i) {
> > > > >+ int pkg = topology_physical_package_id(i);
> > > > >+
> > > > >+ if (cpumask_test_and_set_cpu(pkg, packages))
> > > > >+ continue;
> > > > >+
> > > > >+ /*
> > > > >+ * Program the memory controller in the package with an
> > > > >+ * encryption key associated to a TDX private host key id
> > > > >+ * assigned to this TDR. Concurrent operations on same memory
> > > > >+ * controller results in TDX_OPERAND_BUSY. Avoid this race by
> > > > >+ * mutex.
> > > > >+ */
> > > > >+ mutex_lock(&tdx_mng_key_config_lock[pkg]);
> > > >
> > > > the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
> > > > create TDs, the same set of CPUs (the first online CPU of each package) will be
> > > > selected to configure the key because of the cpumask_test_and_set_cpu() above.
> > > > it means, we never have two CPUs in the same socket trying to program the key,
> > > > i.e., no concurrent calls.
> > >
> > > Makes sense. Will drop the lock.
> >
> > Not get the point, the variable "packages" on stack, and it's
> > possible that "i" is same for 2 threads which are trying to create td.
> > Anything I missed ?
>
> Got the point after synced with chao.
> in case of using for_each_online_cpu() it's safe to remove the mutex_lock(&tdx_mng_key_config_lock[pkg]),
> since every thread will select only 1 cpu for each sockets in same order, and requests submited
> to same cpu by smp_call_on_cpu() are ordered on the target cpu. That means removing the lock works for
> using for_each_online_cpu() but does NOT work for randomly pick up a cpu per socket.
>
> Maybe it's just my issue that doesn't realize what's going on here, but
> I think it still worth to give comment here for why it works/does not work.

It's deserves comment. Will add it.
--
Isaku Yamahata <[email protected]>

2024-03-23 00:29:00

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters

On Fri, Mar 22, 2024 at 11:26:17AM +1300,
"Huang, Kai" <[email protected]> wrote:

> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 9ea46d143bef..e28189c81691 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -604,4 +604,21 @@ struct kvm_tdx_cpuid_config {
> > __u32 edx;
> > };
> > +/* supported_gpaw */
> > +#define TDX_CAP_GPAW_48 (1 << 0)
> > +#define TDX_CAP_GPAW_52 (1 << 1)
> > +
> > +struct kvm_tdx_capabilities {
> > + __u64 attrs_fixed0;
> > + __u64 attrs_fixed1;
> > + __u64 xfam_fixed0;
> > + __u64 xfam_fixed1;
> > + __u32 supported_gpaw;
> > + __u32 padding;
> > + __u64 reserved[251];
> > +
> > + __u32 nr_cpuid_configs;
> > + struct kvm_tdx_cpuid_config cpuid_configs[];
> > +};
> > +
>
> I think you should use __DECLARE_FLEX_ARRAY().
>
> It's already used in existing KVM UAPI header:
>
> struct kvm_nested_state {
> ...
> union {
> __DECLARE_FLEX_ARRAY(struct kvm_vmx_nested_state_data,
> vmx);
> __DECLARE_FLEX_ARRAY(struct kvm_svm_nested_state_data,
> svm);
> } data;
> }

Yes, will use it.


> > + if (copy_to_user(user_caps->cpuid_configs, &tdx_info->cpuid_configs,
> > + tdx_info->num_cpuid_config *
> > + sizeof(tdx_info->cpuid_configs[0]))) {
> > + ret = -EFAULT;
> > + }
>
> I think the '{ }' is needed here.

Unnecessary? Will remove braces.


> > +
> > +out:
> > + /* kfree() accepts NULL. */
> > + kfree(caps);
> > + return ret;
> > +}
> > +
> > int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> > {
> > struct kvm_tdx_cmd tdx_cmd;
> > @@ -68,6 +121,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> > mutex_lock(&kvm->lock);
> > switch (tdx_cmd.id) {
> > + case KVM_TDX_CAPABILITIES:
> > + r = tdx_get_capabilities(&tdx_cmd);
> > + break;
> > default:
> > r = -EINVAL;
> > goto out;
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 473013265bd8..22c0b57f69ca 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -3,6 +3,9 @@
> > #define __KVM_X86_TDX_H
> > #ifdef CONFIG_INTEL_TDX_HOST
> > +
> > +#include "tdx_ops.h"
> > +
>
> It appears "tdx_ops.h" is used for making SEAMCALLs.
>
> I don't see this patch uses any SEAMCALL so I am wondering whether this
> chunk is needed here?

Will remove it to move it to an appropriate patch
--
Isaku Yamahata <[email protected]>

2024-03-23 01:13:53

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

On Fri, Mar 22, 2024 at 12:36:40PM +1300,
"Huang, Kai" <[email protected]> wrote:

> So how about:

Thanks for it. I'll update the commit message with some minor fixes.

> "
> TDX has its own mechanism to control the maximum number of VCPUs that the
> TDX guest can use. When creating a TDX guest, the maximum number of vcpus
> needs to be passed to the TDX module as part of the measurement of the
> guest.
>
> Because the value is part of the measurement, thus part of attestation, it
^'s
> better to allow the userspace to be able to configure it. E.g. the users
the userspace to configure it ^,
> may want to precisely control the maximum number of vcpus their precious VMs
> can use.
>
> The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself,
> where the number of maximum cpus is an input to the TDX module, but KVM
> needs to support the "per-VM number of maximum vcpus" and reflect that in
per-VM maximum number of vcpus
> the KVM_CAP_MAX_VCPUS.
>
> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
maximum number of vcpus
> on VM-basis.
>
> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
>
> The userspace-configured value then can be verified when KVM is actually
used
> creating the TDX guest.
> "


--
Isaku Yamahata <[email protected]>

2024-03-23 01:22:36

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Fri, Mar 22, 2024 at 11:20:01AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > +struct kvm_tdx_init_vm {
> > + __u64 attributes;
> > + __u64 mrconfigid[6]; /* sha384 digest */
> > + __u64 mrowner[6]; /* sha384 digest */
> > + __u64 mrownerconfig[6]; /* sha384 digest */
> > + /*
> > + * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
> > + * This should be enough given sizeof(TD_PARAMS) = 1024.
> > + * 8KB was chosen given because
> > + * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
> > + */
> > + __u64 reserved[1004];
>
> This is insane.
>
> You said you want to reserve 8K for CPUID entries, but how can these 1004 * 8
> bytes be used for CPUID entries since ...

I tried to overestimate it. It's too much, how about to make it
1024, reserved[109]?
--
Isaku Yamahata <[email protected]>

2024-03-23 01:37:01

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Fri, Mar 22, 2024 at 02:06:19PM +1300,
"Huang, Kai" <[email protected]> wrote:

> Roughly checking the code, you have implemented many things including
> MNG.KEY.CONFIG staff. It's worth to add some text here to give reviewer a
> rough idea what's going on here.
>
> >
> > Before tearing down private page tables, TDX requires some resources of the
> > guest TD to be destroyed (i.e. HKID must have been reclaimed, etc). Add
> > mmu notifier release callback before tearing down private page tables for
> > it. >
> > Add vm_free() of kvm_x86_ops hook at the end of kvm_arch_destroy_vm()
> > because some per-VM TDX resources, e.g. TDR, need to be freed after other
> > TDX resources, e.g. HKID, were freed.
>
> I think we should split the "adding callbacks' part out, given you have ...
>
> 9 files changed, 520 insertions(+), 8 deletions(-)
>
> ... in this patch.
>
> IMHO, >500 LOC change normally means there are too many things in this
> patch, thus hard to review, and we should split.
>
> I think perhaps we can split this big patch to smaller pieces based on the
> steps, like we did for the init_tdx_module() function in the TDX host
> patchset??
>
> (But I would like to hear from others too.)

Ok, how about those steps
- tdr allocation/free
- allocate+configure/release HKID
- phyemme cache wb
- tdcs allocation/free
- clearing page

520/5 = 104. Want more steps?


> > + if (WARN_ON_ONCE(err))
> > + pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> > +}
>
> [snip]
>
> I am stopping here, because I need to take a break.
>
> Again I think we should split this patch, there are just too many things to
> review here.

Thank you so much for the review. Let me try to break this patch.
--
Isaku Yamahata <[email protected]>

2024-03-23 01:54:34

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Fri, Mar 22, 2024 at 07:10:42AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Thu, 2024-03-21 at 23:12 +0000, Edgecombe, Rick P wrote:
> > On Mon, 2024-02-26 at 00:27 -0800, [email protected] wrote:
> > > Implement a hook of KVM_SET_CPUID2 for additional consistency check.
> > >
> > > Intel TDX or AMD SEV has a restriction on the value of cpuid.  For
> > > example,
> > > some values must be the same between all vcpus.  Check if the new
> > > values
> > > are consistent with the old values.  The check is light because the
> > > cpuid
> > > consistency is very model specific and complicated.  The user space
> > > VMM
> > > should set cpuid and MSRs consistently.
> >
> > I see that this was suggested by Sean, but can you explain the problem
> > that this is working around? From the linked thread, it seems like the
> > problem is what to do when userspace also calls SET_CPUID after already
> > configuring CPUID to the TDX module in the special way. The choices
> > discussed included:
> > 1. Reject the call
> > 2. Check the consistency between the first CPUID configuration and the
> > second one.
> >
> > 1 is a lot simpler, but the reasoning for 2 is because "some KVM code
> > paths rely on guest CPUID configuration" it seems. Is this a
> > hypothetical or real issue? Which code paths are problematic for
> > TDX/SNP?
>
> There might be use case that TDX guest wants to use some CPUID which
> isn't handled by the TDX module but purely by KVM. These (PV) CPUIDs need to be
> provided via KVM_SET_CPUID2.
>
>
> Btw, Isaku, I don't understand why you tag the last two patches as RFC and put
> them at last. I think I've expressed this before. Per the discussion with
> Sean, my understanding is this isn't something optional but the right thing we
> should do?
>
> https://lore.kernel.org/lkml/[email protected]/

Ok, let's remove RFC and reorder this patches. Do you see any issue of the
cpuid check logic itself?
--
Isaku Yamahata <[email protected]>

2024-03-23 04:28:09

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl

>
> > > +
> > > + mutex_lock(&kvm->lock);
> > > +
> > > + switch (tdx_cmd.id) {
> > > + default:
> > > + r = -EINVAL;
> >
> > I am not sure whether you should return -ENOTTY to be consistent with the
> > previous vt_mem_enc_ioctl() where a TDX-specific IOCTL is issued for non-TDX
> > guest.
> >
> > Here I think the invalid @id means the sub-command isn't valid.
>
> vt_vcpu_mem_enc_ioctl() checks non-TDX case and returns -ENOTTY. We know that
> the guest is TD.

But the command is not supported, right?

I roughly recall I saw somewhere that in such case we should return -ENOTTY, but
I cannot find the link now.

But I found this old link uses -ENOTTY:

https://lwn.net/Articles/58719/

So, just fyi.

2024-03-23 23:39:27

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index efd3fda1c177..bc0767c884f7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -468,6 +468,7 @@ struct kvm_mmu {
>         int (*sync_spte)(struct kvm_vcpu *vcpu,
>                          struct kvm_mmu_page *sp, int i);
>         struct kvm_mmu_root_info root;
> +       hpa_t private_root_hpa;

Per the conversation about consistent naming between private, shared and mirror: I wonder if this
should be named these with mirror instead of private. Like:
hpa_t mirror_root_hpa;

Since the actual private root is not tracked by KVM.

>         union kvm_cpu_role cpu_role;
>         union kvm_mmu_page_role root_role;
>  
> @@ -1740,6 +1741,16 @@ struct kvm_x86_ops {
>         void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>                              int root_level);
>  
> +       int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +                               void *private_spt);
> +       int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +                               void *private_spt);
> +       int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +                                kvm_pfn_t pfn);
> +       int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +                                   kvm_pfn_t pfn);
> +       int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> +
>         bool (*has_wbinvd_exit)(void);
>  
>         u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 30c86e858ae4..0e0321ad9ca2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3717,7 +3717,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>                 goto out_unlock;
>  
>         if (tdp_mmu_enabled) {
> -               root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> +               if (kvm_gfn_shared_mask(vcpu->kvm) &&
> +                   !VALID_PAGE(mmu->private_root_hpa)) {
> +                       root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> +                       mmu->private_root_hpa = root;
> +               }
> +               root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
>                 mmu->root.hpa = root;

This has changed now, due to rebase on,
https://lore.kernel.org/lkml/[email protected]/

...to this:
- if (tdp_mmu_enabled)
- return kvm_tdp_mmu_alloc_root(vcpu);
+ if (tdp_mmu_enabled) {
+ if (kvm_gfn_shared_mask(vcpu->kvm) &&
+ !VALID_PAGE(mmu->private_root_hpa)) {
+ r = kvm_tdp_mmu_alloc_root(vcpu, true);
+ if (r)
+ return r;
+ }
+ return kvm_tdp_mmu_alloc_root(vcpu, false);
+ }

I don't see why the !VALID_PAGE(mmu->private_root_hpa) check is needed.
kvm_tdp_mmu_get_vcpu_root_hpa() already has logic to prevent allocating multiple roots with the same
role.

Also, kvm_tdp_mmu_alloc_root() never returns non-zero, even though mmu_alloc_direct_roots() does.
Probably today when there is one caller it makes mmu_alloc_direct_roots() cleaner to just have it
return the always zero value from kvm_tdp_mmu_alloc_root(). Now that there are two calls, I think we
should refactor kvm_tdp_mmu_alloc_root() to return void, and have kvm_tdp_mmu_alloc_root() return 0
manually in this case.

Or maybe instead change it back to returning an hpa_t and then kvm_tdp_mmu_alloc_root() can lose the
"if (private)" logic at the end too.


>         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
>                 root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> @@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
>                 for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
>                         int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> -                       gfn_t base = gfn_round_for_level(fault->gfn,
> +                       gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
>                                                          fault->max_level);
>  
>                         if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> @@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>         };
>  
>         WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> +       fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
>         fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>  
>         r = mmu_topup_memory_caches(vcpu, false);
> @@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>  
>         mmu->root.hpa = INVALID_PAGE;
>         mmu->root.pgd = 0;
> +       mmu->private_root_hpa = INVALID_PAGE;
>         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>                 mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>  
> @@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
>  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  {
>         kvm_mmu_unload(vcpu);
> +       if (tdp_mmu_enabled) {
> +               write_lock(&vcpu->kvm->mmu_lock);
> +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> +                               NULL);
> +               write_unlock(&vcpu->kvm->mmu_lock);

What is the reason for the special treatment of private_root_hpa here? The rest of the roots are
freed in kvm_mmu_unload(). I think it is because we don't want the mirror to get freed during
kvm_mmu_reset_context()?

Oof. For the sake of trying to justify the code, I'm trying to keep track of the pros and cons of
treating the mirror/private root like a normal one with just a different role bit.

The whole “list of roots” thing seems to date from the shadow paging, where there is is critical to
keep multiple cached shared roots of different CPU modes of the same shadowed page tables. Today
with non-nested TDP, AFAICT, the only different root is for SMM. I guess since the machinery for
managing multiple roots in a list already exists it makes sense to use it for both.

For TDX there are also only two, but the difference is, things need to be done in special ways for
the two roots. You end up with a bunch of loops (for_each_*tdp_mmu_root(), etc) that essentially
process a list of two different roots, but with inner logic tortured to work for the peculiarities
of both private and shared. An easier to read alternative could be to open code both cases.

I guess the major benefit is to keep one set of logic for shadow paging, normal TDP and TDX, but it
makes the logic a bit difficult to follow for TDX compared to looking at it from the normal guest
perspective. So I wonder if making special versions of the TDX root traversing operations might make
the code a little easier to follow. I’m not advocating for it at this point, just still working on
an opinion. Is there any history around this design point?

> +       }
>         free_mmu_pages(&vcpu->arch.root_mmu);
>         free_mmu_pages(&vcpu->arch.guest_mmu);
>         mmu_free_memory_caches(vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 002f3f80bf3b..9e2c7c6d85bf 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
>  #include <linux/kvm_host.h>
>  #include <asm/kvm_host.h>
>  
> +#include "mmu.h"
> +
>  #ifdef CONFIG_KVM_PROVE_MMU
>  #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>  #else
> @@ -205,6 +207,15 @@ static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
>                 free_page((unsigned long)sp->private_spt);
>  }
>  
> +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                    gfn_t gfn)
> +{
> +       if (is_private_sp(root))
> +               return kvm_gfn_to_private(kvm, gfn);
> +       else
> +               return kvm_gfn_to_shared(kvm, gfn);
> +}
> +

It could be branchless, but might not be worth the readability:

static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t gfn)
{
gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);

gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
return gfn_for_root;
}

2024-03-25 10:20:35

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters



On 23/03/2024 1:28 pm, Yamahata, Isaku wrote:
>>> + if (copy_to_user(user_caps->cpuid_configs, &tdx_info->cpuid_configs,
>>> + tdx_info->num_cpuid_config *
>>> + sizeof(tdx_info->cpuid_configs[0]))) {
>>> + ret = -EFAULT;
>>> + }
>> I think the '{ }' is needed here.
> Unnecessary? Will remove braces.

Right. Sorry I didn't finish the 'isn't'.

2024-03-25 13:45:46

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific


>> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
>> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
> maximum number of vcpus
>> on VM-basis.
>>
>> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
>>
>> The userspace-configured value then can be verified when KVM is actually
> used
>> creating the TDX guest.
>> "

I think we still have two options regarding to how 'max_vcpus' is
handled in ioctl() to do TDH.MNG.INIT:

1) Just use the 'max_vcpus' done in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS),
2) Still pass the 'max_vcpus' as input, but KVM verifies it against the
value that is saved in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS).

2) seems unnecessary, so I don't have objection to use 1). But it seems
we could still mention it in the changelog in that patch?

2024-03-25 14:32:25

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure

>
> Here is the updated version.
>
> KVM: TDX: Add placeholders for TDX VM/vcpu structure
>
> Add placeholders TDX VM/vCPU structure, overlaying with the existing

^ structures

"TDX VM/vCPU structure" -> "TDX VM/vCPU structures".

And I don't quite understand what does "overlaying" mean here.

> VMX VM/vCPU structures. Initialize VM structure size and vCPU
> size/align so that x86 KVM-common code knows those sizes irrespective
> of VMX or TDX. Those structures will be populated as guest creation
> logic develops.
>
> TDX requires its data structure for guest and vcpu. For VMX, we

I don't think TDX "requires" anything here. Introducing separate structures are
software implementation, but not requirement by TDX.

> already have struct kvm_vmx and struct vcpu_vmx. Two options to add
> TDX-specific members.
>
> 1. Append TDX-specific members to kvm_vmx and vcpu_vmx. Use the same
> struct for both VMX and TDX.
> 2. Define TDX-specific data struct and overlay.
>
> Choose option two because it has less memory overhead and what member
> is needed is clearer
>
> Add helper functions to check if the VM is guest TD and add the conversion
> functions between KVM VM/vCPU and TDX VM/vCPU.

FYI:

Add TDX's own VM and vCPU structures as placeholder to manage and run TDX
guests.

TDX protects guest VMs from malicious host. Unlike VMX guests, TDX guests are
crypto-protected. KVM cannot access TDX guests' memory and vCPU states
directly. Instead, TDX requires KVM to use a set of architecture-defined
firmware APIs (a.k.a TDX module SEAMCALLs) to manage and run TDX guests.

In fact, the way to manage and run TDX guests and normal VMX guests are quite
different. Because of that, the current structures ('struct kvm_vmx' and
'struct vcpu_vmx') to manage VMX guests are not quite suitable for TDX guests.
E.g., the majority of the members of 'struct vcpu_vmx' don't apply to TDX
guests.

Introduce TDX's own VM and vCPU structures ('struct kvm_tdx' and 'struct
vcpu_tdx' respectively) for KVM to manage and run TDX guests. And instead of
building TDX's VM and vCPU structures based on VMX's, build them directly based
on 'struct kvm'.

As a result, TDX and VMX will have different VM size and vCPU size/alignment.
Adjust the 'vt_x86_ops.vm_size' and the 'vcpu_size' and 'vcpu_align' to the
maximum value of TDX guest and VMX guest during module initialization time so
that KVM can always allocate enough memory for both TDX guests and VMX guests.

[...]

> >
> > > @@ -215,8 +219,18 @@ static int __init vt_init(void)
> > > * Common KVM initialization _must_ come last, after this, /dev/kvm is
> > > * exposed to userspace!
> > > */
> > > + /*
> > > + * kvm_x86_ops is updated with vt_x86_ops. vt_x86_ops.vm_size must
> > > + * be set before kvm_x86_vendor_init().
> > > + */
> > > vcpu_size = sizeof(struct vcpu_vmx);
> > > vcpu_align = __alignof__(struct vcpu_vmx);
> > > + if (enable_tdx) {
> > > + vcpu_size = max_t(unsigned int, vcpu_size,
> > > + sizeof(struct vcpu_tdx));
> > > + vcpu_align = max_t(unsigned int, vcpu_align,
> > > + __alignof__(struct vcpu_tdx));
> > > + }
> >
> > Since you are updating vm_size in vt_hardware_setup(), I am wondering
> > whether we can do similar thing for vcpu_size and vcpu_align.
> >
> > That is, we put them both to 'struct kvm_x86_ops', and you update them in
> > vt_hardware_setup().
> >
> > kvm_init() can then just access them directly in this way both 'vcpu_size'
> > and 'vcpu_align' function parameters can be removed.
>
> Hmm, now I noticed the vm_size can be moved here. We have
>
> vcpu_size = sizeof(struct vcpu_vmx);
> vcpu_align = __alignof__(struct vcpu_vmx);
> if (enable_tdx) {
> vcpu_size = max_t(unsigned int, vcpu_size,
> sizeof(struct vcpu_tdx));
> vcpu_align = max_t(unsigned int, vcpu_align,
> __alignof__(struct vcpu_tdx));
> vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> sizeof(struct kvm_tdx));
> }
>
>
> We can add vcpu_size, vcpu_align to struct kvm_x86_ops. If we do so, we have
> to touch svm code unnecessarily.

Not only SVM, but also other architectures, because you are going to remove two
function parameters from kvm_init().

That reminds me that other ARCHs may not use 'kvm_x86_ops'-similar thing, so to
make thing simple I am fine with your above approach.

2024-03-25 14:39:18

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Fri, 2024-03-22 at 18:22 -0700, Yamahata, Isaku wrote:
> On Fri, Mar 22, 2024 at 11:20:01AM +0000,
> "Huang, Kai" <[email protected]> wrote:
>
> > On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > > +struct kvm_tdx_init_vm {
> > > + __u64 attributes;
> > > + __u64 mrconfigid[6]; /* sha384 digest */
> > > + __u64 mrowner[6]; /* sha384 digest */
> > > + __u64 mrownerconfig[6]; /* sha384 digest */
> > > + /*
> > > + * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
> > > + * This should be enough given sizeof(TD_PARAMS) = 1024.
> > > + * 8KB was chosen given because
> > > + * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
> > > + */
> > > + __u64 reserved[1004];
> >
> > This is insane.
> >
> > You said you want to reserve 8K for CPUID entries, but how can these 1004 * 8
> > bytes be used for CPUID entries since ...
>
> I tried to overestimate it. It's too much, how about to make it
> 1024, reserved[109]?
>

I am not sure why we need 1024B either.

IIUC, the inputs here in 'kvm_tdx_init_vm' should be a subset of the members in
TD_PARAMS. This IOCTL() isn't intended to carry any additional input besides
these defined in TD_PARAMS, right?

If so, then it seems to me you "at most" only need to reserve the space for the
members excluding the CPUID entries, because for the CPUID entries we will
always pass them as a flexible array at the end of the structure.

Based on the spec, the "non-CPUID-entry" part only occupies 256 bytes. To me it
seems we have no reason to reserve more space than 256 bytes.

2024-03-25 14:49:21

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Fri, 2024-03-22 at 16:06 +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-03-22 at 07:10 +0000, Huang, Kai wrote:
> > > I see that this was suggested by Sean, but can you explain the
> > > problem
> > > that this is working around? From the linked thread, it seems like
> > > the
> > > problem is what to do when userspace also calls SET_CPUID after
> > > already
> > > configuring CPUID to the TDX module in the special way. The choices
> > > discussed included:
> > > 1. Reject the call
> > > 2. Check the consistency between the first CPUID configuration and
> > > the
> > > second one.
> > >
> > > 1 is a lot simpler, but the reasoning for 2 is because "some KVM
> > > code
> > > paths rely on guest CPUID configuration" it seems. Is this a
> > > hypothetical or real issue? Which code paths are problematic for
> > > TDX/SNP?
> >
> > There might be use case that TDX guest wants to use some CPUID which
> > isn't handled by the TDX module but purely by KVM.  These (PV) CPUIDs
> > need to be
> > provided via KVM_SET_CPUID2.
>
> Right, but are there any needed today? 
>

I am not sure. Isaku may know better?

> I read that Sean's point was
> that KVM_SET_CPUID2 can't accept anything today what we would want to
> block later, otherwise it would introduce a regression. This was the
> major constraint IIUC, and means the base series requires *something*
> here.
>
> If we want to support only the most basic support first, we don't need
> to support PV CPUIDs on day 1, right?
>
> So I'm wondering, if we could shrink the base series by going with
> option 1 to start, and then expanding it with this solution later to
> enable more features. Do you see a problem or conflict with Sean's
> comments?
>
>

To confirm, I mean you want to simply make KVM_SET_CPUID2 return error for TDX
guest?

It is acceptable to me, and I don't see any conflict with Sean's comments.

But I don't know Sean's perference. As he said, I think the consistency
checking is quite straight-forward:

"
It's not complicated at all. Walk through the leafs defined during
TDH.MNG.INIT, reject KVM_SET_CPUID if a leaf isn't present or doesn't match
exactly.
"

So to me it's not a big deal.

Either way, we need a patch to handle SET_CPUID2:

1) if we go option 1) -- that is reject SET_CPUID2 completely -- we need to make
vcpu's CPUID point to KVM's saved CPUID during TDH.MNG.INIT.

2) if we do consistency check, we do a for loop and reject when in-consistency
found.

I'll leave to you to judge :-)

2024-03-25 17:05:03

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Mon, 2024-03-25 at 11:14 +0000, Huang, Kai wrote:
> To confirm, I mean you want to simply make KVM_SET_CPUID2 return error for TDX
> guest?
>
> It is acceptable to me, and I don't see any conflict with Sean's comments.
>
> But I don't know Sean's perference.  As he said, I think  the consistency
> checking is quite straight-forward:
>
> "
> It's not complicated at all.  Walk through the leafs defined during
> TDH.MNG.INIT, reject KVM_SET_CPUID if a leaf isn't present or doesn't match
> exactly.
> "
>
Yea, I'm just thinking if we could take two patches down to one small one it might be a way to
essentially break off this work to another series without affecting the ability to boot a TD. It
*seems* to be the way things are going.

> So to me it's not a big deal.
>
> Either way, we need a patch to handle SET_CPUID2:
>
> 1) if we go option 1) -- that is reject SET_CPUID2 completely -- we need to make
> vcpu's CPUID point to KVM's saved CPUID during TDH.MNG.INIT.

Ah, I missed this part. Can you elaborate? By dropping these two patches it doesn't prevent a TD
boot. If we then reject SET_CPUID, this will break things unless we make other changes? And they are
not small?



2024-03-25 17:24:15

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific



On 3/23/2024 9:13 AM, Isaku Yamahata wrote:
> On Fri, Mar 22, 2024 at 12:36:40PM +1300,
> "Huang, Kai" <[email protected]> wrote:
>
>> So how about:
> Thanks for it. I'll update the commit message with some minor fixes.
>
>> "
>> TDX has its own mechanism to control the maximum number of VCPUs that the
>> TDX guest can use. When creating a TDX guest, the maximum number of vcpus
>> needs to be passed to the TDX module as part of the measurement of the
>> guest.
>>
>> Because the value is part of the measurement, thus part of attestation, it
> ^'s
>> better to allow the userspace to be able to configure it. E.g. the users
> the userspace to configure it ^,
>> may want to precisely control the maximum number of vcpus their precious VMs
>> can use.
>>
>> The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself,
>> where the number of maximum cpus is an input to the TDX module, but KVM
>> needs to support the "per-VM number of maximum vcpus" and reflect that in
> per-VM maximum number of vcpus
>> the KVM_CAP_MAX_VCPUS.
>>
>> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
>> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
> maximum number of vcpus
>> on VM-basis.
>>
>> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
>>
>> The userspace-configured value then can be verified when KVM is actually
> used

Here, "verified", I think Kai wanted to emphasize that the value of
max_vcpus passed in via
KVM_TDX_INIT_VM should be checked against the value configured via
KVM_CAP_MAX_VCPUS?

Maybe "verified and used" ?

>> creating the TDX guest.
>> "
>


2024-03-25 18:32:27

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> As the first step to create TDX guest, create/destroy VM struct. Assign
> TDX private Host Key ID (HKID) to the TDX guest for memory encryption and
> allocate extra pages for the TDX guest. On destruction, free allocated
> pages, and HKID.
>
> Before tearing down private page tables, TDX requires some resources of the
> guest TD to be destroyed (i.e. HKID must have been reclaimed, etc). Add
> mmu notifier release callback

It seems not accurate to say "Add mmu notifier release callback", since the
interface has already been there. This patch extends the cache flush
function,
i.e, kvm_flush_shadow_all() to do TDX specific thing.

> before tearing down private page tables for
> it.
>
> Add vm_free() of kvm_x86_ops hook at the end of kvm_arch_destroy_vm()
> because some per-VM TDX resources, e.g. TDR, need to be freed after other
> TDX resources, e.g. HKID, were freed.
>
> Co-developed-by: Kai Huang <[email protected]>
> Signed-off-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v19:
> - fix check error code of TDH.PHYMEM.PAGE.RECLAIM. RCX and TDR.
>
> v18:
> - Use TDH.SYS.RD() instead of struct tdsysinfo_struct.
> - Rename tdx_reclaim_td_page() to tdx_reclaim_control_page()
> - return -EAGAIN on TDX_RND_NO_ENTROPY of TDH.MNG.CREATE(), TDH.MNG.ADDCX()
> - fix comment to remove extra the.
> - use true instead of 1 for boolean.
> - remove an extra white line.
>
> v16:
> - Simplified tdx_reclaim_page()
> - Reorganize the locking of tdx_release_hkid(), and use smp_call_mask()
> instead of smp_call_on_cpu() to hold spinlock to race with invalidation
> on releasing guest memfd
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 2 +
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/Kconfig | 3 +-
> arch/x86/kvm/mmu/mmu.c | 7 +
> arch/x86/kvm/vmx/main.c | 26 +-
> arch/x86/kvm/vmx/tdx.c | 475 ++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/tdx.h | 6 +-
> arch/x86/kvm/vmx/x86_ops.h | 6 +
> arch/x86/kvm/x86.c | 1 +
> 9 files changed, 520 insertions(+), 8 deletions(-)
>
[...]
> +
> +static void tdx_clear_page(unsigned long page_pa)
> +{
> + const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> + void *page = __va(page_pa);
> + unsigned long i;
> +
> + /*
> + * When re-assign one page from old keyid to a new keyid, MOVDIR64B is
> + * required to clear/write the page with new keyid to prevent integrity
> + * error when read on the page with new keyid.
> + *
> + * clflush doesn't flush cache with HKID set. The cache line could be
> + * poisoned (even without MKTME-i), clear the poison bit.
> + */
> + for (i = 0; i < PAGE_SIZE; i += 64)
> + movdir64b(page + i, zero_page);
> + /*
> + * MOVDIR64B store uses WC buffer. Prevent following memory reads
> + * from seeing potentially poisoned cache.
> + */
> + __mb();

Is __wmb() sufficient for this case?

> +}
> +
[...]

> +
> +static int tdx_do_tdh_mng_key_config(void *param)
> +{
> + hpa_t *tdr_p = param;
> + u64 err;
> +
> + do {
> + err = tdh_mng_key_config(*tdr_p);
> +
> + /*
> + * If it failed to generate a random key, retry it because this
> + * is typically caused by an entropy error of the CPU's random

Here you say "typically", is there other cause and is it safe to loop on
retry?

> + * number generator.
> + */
> + } while (err == TDX_KEY_GENERATION_FAILED);
> +
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
[...]


2024-03-25 20:01:42

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Sat, Mar 23, 2024 at 11:39:07PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index efd3fda1c177..bc0767c884f7 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -468,6 +468,7 @@ struct kvm_mmu {
> >         int (*sync_spte)(struct kvm_vcpu *vcpu,
> >                          struct kvm_mmu_page *sp, int i);
> >         struct kvm_mmu_root_info root;
> > +       hpa_t private_root_hpa;
>
> Per the conversation about consistent naming between private, shared and mirror: I wonder if this
> should be named these with mirror instead of private. Like:
> hpa_t mirror_root_hpa;
>
> Since the actual private root is not tracked by KVM.

It's mirrored only without direct associated Secure EPT page. The association
is implicit, a part of TDCS pages.


> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 30c86e858ae4..0e0321ad9ca2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3717,7 +3717,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >                 goto out_unlock;
> >  
> >         if (tdp_mmu_enabled) {
> > -               root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > +               if (kvm_gfn_shared_mask(vcpu->kvm) &&
> > +                   !VALID_PAGE(mmu->private_root_hpa)) {
> > +                       root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> > +                       mmu->private_root_hpa = root;
> > +               }
> > +               root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
> >                 mmu->root.hpa = root;
>
> This has changed now, due to rebase on,
> https://lore.kernel.org/lkml/[email protected]/
>
> ...to this:
> - if (tdp_mmu_enabled)
> - return kvm_tdp_mmu_alloc_root(vcpu);
> + if (tdp_mmu_enabled) {
> + if (kvm_gfn_shared_mask(vcpu->kvm) &&
> + !VALID_PAGE(mmu->private_root_hpa)) {
> + r = kvm_tdp_mmu_alloc_root(vcpu, true);
> + if (r)
> + return r;
> + }
> + return kvm_tdp_mmu_alloc_root(vcpu, false);
> + }
>
> I don't see why the !VALID_PAGE(mmu->private_root_hpa) check is needed.
> kvm_tdp_mmu_get_vcpu_root_hpa() already has logic to prevent allocating multiple roots with the same
> role.

Historically we needed it. We don't need it now. We can drop it.


> Also, kvm_tdp_mmu_alloc_root() never returns non-zero, even though mmu_alloc_direct_roots() does.
> Probably today when there is one caller it makes mmu_alloc_direct_roots() cleaner to just have it
> return the always zero value from kvm_tdp_mmu_alloc_root(). Now that there are two calls, I think we
> should refactor kvm_tdp_mmu_alloc_root() to return void, and have kvm_tdp_mmu_alloc_root() return 0
> manually in this case.
>
> Or maybe instead change it back to returning an hpa_t and then kvm_tdp_mmu_alloc_root() can lose the
> "if (private)" logic at the end too.

Probably we can make void kvm_tdp_mmu_alloc_root() instead of returning always
zero as clean up.


> >         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> >                 root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > @@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >         if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> >                 for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> >                         int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> > -                       gfn_t base = gfn_round_for_level(fault->gfn,
> > +                       gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> >                                                          fault->max_level);
> >  
> >                         if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> > @@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> >         };
> >  
> >         WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> > +       fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> >         fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> >  
> >         r = mmu_topup_memory_caches(vcpu, false);
> > @@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> >  
> >         mmu->root.hpa = INVALID_PAGE;
> >         mmu->root.pgd = 0;
> > +       mmu->private_root_hpa = INVALID_PAGE;
> >         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> >                 mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> >  
> > @@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
> >  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> >  {
> >         kvm_mmu_unload(vcpu);
> > +       if (tdp_mmu_enabled) {
> > +               write_lock(&vcpu->kvm->mmu_lock);
> > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > +                               NULL);
> > +               write_unlock(&vcpu->kvm->mmu_lock);
>
> What is the reason for the special treatment of private_root_hpa here? The rest of the roots are
> freed in kvm_mmu_unload(). I think it is because we don't want the mirror to get freed during
> kvm_mmu_reset_context()?

It reflects that we don't free Secure-EPT pages during runtime, and free them
when destroying the guest.


>
> Oof. For the sake of trying to justify the code, I'm trying to keep track of the pros and cons of
> treating the mirror/private root like a normal one with just a different role bit.
>
> The whole “list of roots” thing seems to date from the shadow paging, where there is is critical to
> keep multiple cached shared roots of different CPU modes of the same shadowed page tables. Today
> with non-nested TDP, AFAICT, the only different root is for SMM. I guess since the machinery for
> managing multiple roots in a list already exists it makes sense to use it for both.
>
> For TDX there are also only two, but the difference is, things need to be done in special ways for
> the two roots. You end up with a bunch of loops (for_each_*tdp_mmu_root(), etc) that essentially
> process a list of two different roots, but with inner logic tortured to work for the peculiarities
> of both private and shared. An easier to read alternative could be to open code both cases.
>
> I guess the major benefit is to keep one set of logic for shadow paging, normal TDP and TDX, but it
> makes the logic a bit difficult to follow for TDX compared to looking at it from the normal guest
> perspective. So I wonder if making special versions of the TDX root traversing operations might make
> the code a little easier to follow. I’m not advocating for it at this point, just still working on
> an opinion. Is there any history around this design point?

The original desire to keep the modification contained, and not introduce a
function for population and zap. With the open coding, do you want something
like the followings? We can try it and compare the outcome.

For zapping
if (private) {
__for_each_tdp_mmu_root_yield_safe_private()
private case
} else {
__for_each_tdp_mmu_root_yield_safe()
shared case
}

For fault,
kvm_tdp_mmu_map()
if (private) {
tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
private case
} else {
tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
shared case
}

--
Isaku Yamahata <[email protected]>

2024-03-25 21:11:39

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Mon, Mar 25, 2024 at 11:14:21AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Fri, 2024-03-22 at 16:06 +0000, Edgecombe, Rick P wrote:
> > On Fri, 2024-03-22 at 07:10 +0000, Huang, Kai wrote:
> > > > I see that this was suggested by Sean, but can you explain the
> > > > problem
> > > > that this is working around? From the linked thread, it seems like
> > > > the
> > > > problem is what to do when userspace also calls SET_CPUID after
> > > > already
> > > > configuring CPUID to the TDX module in the special way. The choices
> > > > discussed included:
> > > > 1. Reject the call
> > > > 2. Check the consistency between the first CPUID configuration and
> > > > the
> > > > second one.
> > > >
> > > > 1 is a lot simpler, but the reasoning for 2 is because "some KVM
> > > > code
> > > > paths rely on guest CPUID configuration" it seems. Is this a
> > > > hypothetical or real issue? Which code paths are problematic for
> > > > TDX/SNP?
> > >
> > > There might be use case that TDX guest wants to use some CPUID which
> > > isn't handled by the TDX module but purely by KVM.  These (PV) CPUIDs
> > > need to be
> > > provided via KVM_SET_CPUID2.
> >
> > Right, but are there any needed today? 
> >
>
> I am not sure. Isaku may know better?

It's not needed to boot TD. The check is safe guard. The multiple of source of
cpuids can be inconsistent.
--
Isaku Yamahata <[email protected]>

2024-03-25 21:17:36

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Mon, Mar 25, 2024 at 03:32:59PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-03-25 at 11:14 +0000, Huang, Kai wrote:
> > To confirm, I mean you want to simply make KVM_SET_CPUID2 return error for TDX
> > guest?
> >
> > It is acceptable to me, and I don't see any conflict with Sean's comments.
> >
> > But I don't know Sean's perference.  As he said, I think  the consistency
> > checking is quite straight-forward:
> >
> > "
> > It's not complicated at all.  Walk through the leafs defined during
> > TDH.MNG.INIT, reject KVM_SET_CPUID if a leaf isn't present or doesn't match
> > exactly.
> > "
> >
> Yea, I'm just thinking if we could take two patches down to one small one it might be a way to
> essentially break off this work to another series without affecting the ability to boot a TD. It
> *seems* to be the way things are going.
>
> > So to me it's not a big deal.
> >
> > Either way, we need a patch to handle SET_CPUID2:
> >
> > 1) if we go option 1) -- that is reject SET_CPUID2 completely -- we need to make
> > vcpu's CPUID point to KVM's saved CPUID during TDH.MNG.INIT.
>
> Ah, I missed this part. Can you elaborate? By dropping these two patches it doesn't prevent a TD
> boot. If we then reject SET_CPUID, this will break things unless we make other changes? And they are
> not small?

If we go forthis, the extended topology enumeration (cpuid[0xb or 0x1f]) would
need special handling because it's per-vcpu. not TD wide.
--
Isaku Yamahata <[email protected]>

2024-03-25 21:30:04

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

On Mon, Mar 25, 2024 at 09:43:31PM +1300,
"Huang, Kai" <[email protected]> wrote:

>
> > > Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
> > > allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
> > maximum number of vcpus
> > > on VM-basis.
> > >
> > > Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
> > >
> > > The userspace-configured value then can be verified when KVM is actually
> > used
> > > creating the TDX guest.
> > > "
>
> I think we still have two options regarding to how 'max_vcpus' is handled in
> ioctl() to do TDH.MNG.INIT:
>
> 1) Just use the 'max_vcpus' done in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS),
> 2) Still pass the 'max_vcpus' as input, but KVM verifies it against the
> value that is saved in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS).
>
> 2) seems unnecessary, so I don't have objection to use 1). But it seems we
> could still mention it in the changelog in that patch?

Sure, let me update the commit log.
--
Isaku Yamahata <[email protected]>

2024-03-25 21:37:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, 2024-03-25 at 12:05 -0700, Isaku Yamahata wrote:
> Right, the guest has to accept it on VE.  If the unmap was intentional by guest,
> that's fine.  The unmap is unintentional (with vMTRR), the guest doesn't expect
> VE with the GPA.
>
>
> > But, I guess we should punt to userspace is the guest tries to use
> > MTRRs, not that userspace can handle it happening in a TD...  But it
> > seems cleaner and safer then skipping zapping some pages inside the
> > zapping code.
> >
> > I'm still not sure if I understand the intention and constraints fully.
> > So please correct. This (the skipping the zapping for some operations)
> > is a theoretical correctness issue right? It doesn't resolve a TD
> > crash?
>
> For lapic, it's safe guard. Because TDX KVM disables APICv with
> APICV_INHIBIT_REASON_TDX, apicv won't call kvm_zap_gfn_range().
Ah, I see it:
https://lore.kernel.org/lkml/38e2f8a77e89301534d82325946eb74db3e47815.1708933498.git.isaku.yamahata@intel.com/

Then it seems a warning would be more appropriate if we are worried there might be a way to still
call it. If we are confident it can't, then we can just ignore this case.

>
> For MTRR, the purpose is to make the guest boot (without the guest kernel
> command line like clearcpuid=mtrr) .
> If we can assume the guest won't touch MTRR registers somehow, KVM can return an
> error to TDG.VP.VMCALL<RDMSR, WRMSR>(MTRR registers). So it doesn't call
> kvm_zap_gfn_range(). Or we can use KVM_EXIT_X86_{RDMSR, WRMSR} as you suggested.

My understanding is that Sean prefers to exit to userspace when KVM can't handle something, versus
making up behavior that keeps known guests alive. So I would think we should change this patch to
only be about not using the zapping roots optimization. Then a separate patch should exit to
userspace on attempt to use MTRRs. And we ignore the APIC one.

This is trying to guess what maintainers would want here. I'm less sure what Paolo prefers.

2024-03-25 21:49:03

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Mon, Mar 25, 2024 at 05:58:47PM +0800,
Binbin Wu <[email protected]> wrote:

> > +static void tdx_clear_page(unsigned long page_pa)
> > +{
> > + const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > + void *page = __va(page_pa);
> > + unsigned long i;
> > +
> > + /*
> > + * When re-assign one page from old keyid to a new keyid, MOVDIR64B is
> > + * required to clear/write the page with new keyid to prevent integrity
> > + * error when read on the page with new keyid.
> > + *
> > + * clflush doesn't flush cache with HKID set. The cache line could be
> > + * poisoned (even without MKTME-i), clear the poison bit.
> > + */
> > + for (i = 0; i < PAGE_SIZE; i += 64)
> > + movdir64b(page + i, zero_page);
> > + /*
> > + * MOVDIR64B store uses WC buffer. Prevent following memory reads
> > + * from seeing potentially poisoned cache.
> > + */
> > + __mb();
>
> Is __wmb() sufficient for this case?

I don't think so because sfence is for other store. Here we care other load.

> > +
> > +static int tdx_do_tdh_mng_key_config(void *param)
> > +{
> > + hpa_t *tdr_p = param;
> > + u64 err;
> > +
> > + do {
> > + err = tdh_mng_key_config(*tdr_p);
> > +
> > + /*
> > + * If it failed to generate a random key, retry it because this
> > + * is typically caused by an entropy error of the CPU's random
>
> Here you say "typically", is there other cause and is it safe to loop on
> retry?


No as long as I know. the TDX module returns KEY_GENERATION_FAILED only when
rdrnd (or equivalent) failed. But I don't know the future.

Let's delete "tyepically" because it seems confusing.
--
Isaku Yamahata <[email protected]>

2024-03-25 22:05:12

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Mon, Mar 25, 2024 at 10:39:10AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Fri, 2024-03-22 at 18:22 -0700, Yamahata, Isaku wrote:
> > On Fri, Mar 22, 2024 at 11:20:01AM +0000,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > > On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > > > +struct kvm_tdx_init_vm {
> > > > + __u64 attributes;
> > > > + __u64 mrconfigid[6]; /* sha384 digest */
> > > > + __u64 mrowner[6]; /* sha384 digest */
> > > > + __u64 mrownerconfig[6]; /* sha384 digest */
> > > > + /*
> > > > + * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
> > > > + * This should be enough given sizeof(TD_PARAMS) = 1024.
> > > > + * 8KB was chosen given because
> > > > + * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
> > > > + */
> > > > + __u64 reserved[1004];
> > >
> > > This is insane.
> > >
> > > You said you want to reserve 8K for CPUID entries, but how can these 1004 * 8
> > > bytes be used for CPUID entries since ...
> >
> > I tried to overestimate it. It's too much, how about to make it
> > 1024, reserved[109]?
> >
>
> I am not sure why we need 1024B either.
>
> IIUC, the inputs here in 'kvm_tdx_init_vm' should be a subset of the members in
> TD_PARAMS. This IOCTL() isn't intended to carry any additional input besides
> these defined in TD_PARAMS, right?
>
> If so, then it seems to me you "at most" only need to reserve the space for the
> members excluding the CPUID entries, because for the CPUID entries we will
> always pass them as a flexible array at the end of the structure.
>
> Based on the spec, the "non-CPUID-entry" part only occupies 256 bytes. To me it
> seems we have no reason to reserve more space than 256 bytes.

Ok, I'll make it 256 bytes.

The alternative is to use key-value. The user space loops to set all necessary
parameters. Something like as follows.

KVM_TDX_SET_VM_PARAM

struct kvm_tdx_vm_param {
/* TDCS metadata field. */
__u64 field_id;
/*
* value for attributes or data less or qeual to __u64.
* pointer for sha384, cpuid, or data larger than __u64.
*/
__u64 value_or_ptr;
};
--
Isaku Yamahata <[email protected]>

2024-03-25 22:17:13

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

On Mon, Mar 25, 2024 at 04:42:36PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 3/23/2024 9:13 AM, Isaku Yamahata wrote:
> > On Fri, Mar 22, 2024 at 12:36:40PM +1300,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > > So how about:
> > Thanks for it. I'll update the commit message with some minor fixes.
> >
> > > "
> > > TDX has its own mechanism to control the maximum number of VCPUs that the
> > > TDX guest can use. When creating a TDX guest, the maximum number of vcpus
> > > needs to be passed to the TDX module as part of the measurement of the
> > > guest.
> > >
> > > Because the value is part of the measurement, thus part of attestation, it
> > ^'s
> > > better to allow the userspace to be able to configure it. E.g. the users
> > the userspace to configure it ^,
> > > may want to precisely control the maximum number of vcpus their precious VMs
> > > can use.
> > >
> > > The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself,
> > > where the number of maximum cpus is an input to the TDX module, but KVM
> > > needs to support the "per-VM number of maximum vcpus" and reflect that in
> > per-VM maximum number of vcpus
> > > the KVM_CAP_MAX_VCPUS.
> > >
> > > Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
> > > allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
> > maximum number of vcpus
> > > on VM-basis.
> > >
> > > Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
> > >
> > > The userspace-configured value then can be verified when KVM is actually
> > used
>
> Here, "verified", I think Kai wanted to emphasize that the value of
> max_vcpus passed in via
> KVM_TDX_INIT_VM should be checked against the value configured via
> KVM_CAP_MAX_VCPUS?
>
> Maybe "verified and used" ?

Ok. I don't have strong opinion here.
--
Isaku Yamahata <[email protected]>

2024-03-25 22:19:11

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, Mar 25, 2024 at 07:55:04PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-03-25 at 12:05 -0700, Isaku Yamahata wrote:
> > Right, the guest has to accept it on VE.  If the unmap was intentional by guest,
> > that's fine.  The unmap is unintentional (with vMTRR), the guest doesn't expect
> > VE with the GPA.
> >
> >
> > > But, I guess we should punt to userspace is the guest tries to use
> > > MTRRs, not that userspace can handle it happening in a TD...  But it
> > > seems cleaner and safer then skipping zapping some pages inside the
> > > zapping code.
> > >
> > > I'm still not sure if I understand the intention and constraints fully.
> > > So please correct. This (the skipping the zapping for some operations)
> > > is a theoretical correctness issue right? It doesn't resolve a TD
> > > crash?
> >
> > For lapic, it's safe guard. Because TDX KVM disables APICv with
> > APICV_INHIBIT_REASON_TDX, apicv won't call kvm_zap_gfn_range().
> Ah, I see it:
> https://lore.kernel.org/lkml/38e2f8a77e89301534d82325946eb74db3e47815.1708933498.git.isaku.yamahata@intel.com/
>
> Then it seems a warning would be more appropriate if we are worried there might be a way to still
> call it. If we are confident it can't, then we can just ignore this case.
>
> >
> > For MTRR, the purpose is to make the guest boot (without the guest kernel
> > command line like clearcpuid=mtrr) .
> > If we can assume the guest won't touch MTRR registers somehow, KVM can return an
> > error to TDG.VP.VMCALL<RDMSR, WRMSR>(MTRR registers). So it doesn't call
> > kvm_zap_gfn_range(). Or we can use KVM_EXIT_X86_{RDMSR, WRMSR} as you suggested.
>
> My understanding is that Sean prefers to exit to userspace when KVM can't handle something, versus
> making up behavior that keeps known guests alive. So I would think we should change this patch to
> only be about not using the zapping roots optimization. Then a separate patch should exit to
> userspace on attempt to use MTRRs. And we ignore the APIC one.
>
> This is trying to guess what maintainers would want here. I'm less sure what Paolo prefers.

When we hit KVM_MSR_FILTER, the current implementation ignores it and makes it
error to guest. Surely we should make it KVM_EXIT_X86_{RDMSR, WRMSR}, instead.
It's aligns with the existing implementation(default VM and SW-protected) and
more flexible.
--
Isaku Yamahata <[email protected]>

2024-03-25 22:33:39

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Mon, 2024-03-25 at 13:01 -0700, Isaku Yamahata wrote:
>  Also, kvm_tdp_mmu_alloc_root() never returns non-zero, even though mmu_alloc_direct_roots() does.
> > Probably today when there is one caller it makes mmu_alloc_direct_roots() cleaner to just have
> > it
> > return the always zero value from kvm_tdp_mmu_alloc_root(). Now that there are two calls, I
> > think we
> > should refactor kvm_tdp_mmu_alloc_root() to return void, and have kvm_tdp_mmu_alloc_root()
> > return 0
> > manually in this case.
> >
> > Or maybe instead change it back to returning an hpa_t and then kvm_tdp_mmu_alloc_root() can lose
> > the
> > "if (private)" logic at the end too.
>
> Probably we can make void kvm_tdp_mmu_alloc_root() instead of returning always
> zero as clean up.

Why is it better than returning an hpa_t once we are calling it twice for mirror and shared roots.

>
>
> > >         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > >                 root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > > @@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault
> > > *fault)
> > >         if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> > >                 for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> > >                         int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> > > -                       gfn_t base = gfn_round_for_level(fault->gfn,
> > > +                       gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> > >                                                          fault->max_level);
> > >  
> > >                         if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> > > @@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64
> > > error_code,
> > >         };
> > >  
> > >         WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> > > +       fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> > >         fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > >  
> > >         r = mmu_topup_memory_caches(vcpu, false);
> > > @@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > >  
> > >         mmu->root.hpa = INVALID_PAGE;
> > >         mmu->root.pgd = 0;
> > > +       mmu->private_root_hpa = INVALID_PAGE;
> > >         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > >                 mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> > >  
> > > @@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
> > >  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > >  {
> > >         kvm_mmu_unload(vcpu);
> > > +       if (tdp_mmu_enabled) {
> > > +               write_lock(&vcpu->kvm->mmu_lock);
> > > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > > +                               NULL);
> > > +               write_unlock(&vcpu->kvm->mmu_lock);
> >
> > What is the reason for the special treatment of private_root_hpa here? The rest of the roots are
> > freed in kvm_mmu_unload(). I think it is because we don't want the mirror to get freed during
> > kvm_mmu_reset_context()?
>
> It reflects that we don't free Secure-EPT pages during runtime, and free them
> when destroying the guest.

Right. If would be great if we could do something like warn on freeing role.private = 1 sp's during
runtime. It could cover several cases that get worried about in other patches.

While looking at how we could do this, I noticed that kvm_arch_vcpu_create() calls kvm_mmu_destroy()
in an error path. So this could end up zapping/freeing a private root. It should be bad userspace
behavior too I guess. But the number of edge cases makes me think the case of zapping private sp
while a guest is running is something that deserves a VM_BUG_ON().

>
>
> >
> > Oof. For the sake of trying to justify the code, I'm trying to keep track of the pros and cons
> > of
> > treating the mirror/private root like a normal one with just a different role bit.
> >
> > The whole “list of roots” thing seems to date from the shadow paging, where there is is critical
> > to
> > keep multiple cached shared roots of different CPU modes of the same shadowed page tables. Today
> > with non-nested TDP, AFAICT, the only different root is for SMM. I guess since the machinery for
> > managing multiple roots in a list already exists it makes sense to use it for both.
> >
> > For TDX there are also only two, but the difference is, things need to be done in special ways
> > for
> > the two roots. You end up with a bunch of loops (for_each_*tdp_mmu_root(), etc) that essentially
> > process a list of two different roots, but with inner logic tortured to work for the
> > peculiarities
> > of both private and shared. An easier to read alternative could be to open code both cases.
> >
> > I guess the major benefit is to keep one set of logic for shadow paging, normal TDP and TDX, but
> > it
> > makes the logic a bit difficult to follow for TDX compared to looking at it from the normal
> > guest
> > perspective. So I wonder if making special versions of the TDX root traversing operations might
> > make
> > the code a little easier to follow. I’m not advocating for it at this point, just still working
> > on
> > an opinion. Is there any history around this design point?
>
> The original desire to keep the modification contained, and not introduce a
> function for population and zap.  With the open coding, do you want something
> like the followings?  We can try it and compare the outcome.
>
> For zapping
>   if (private) {
>      __for_each_tdp_mmu_root_yield_safe_private()
>        private case
>   } else {
>      __for_each_tdp_mmu_root_yield_safe()
>         shared case
>   }
>
> For fault,
> kvm_tdp_mmu_map()
>   if (private) {
>     tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
>       private case
>   } else {
>     tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
>       shared case
>   }

I was wondering about something limited to the operations that iterate over the roots. So not
keeping private_root_hpa in the list of roots where it has to be carefully protected from getting
zapped or get its gfn adjusted, and instead open coding the private case in the higher level zapping
operations. For normal VM's the private case would be a NOP.

Since kvm_tdp_mmu_map() already grabs private_root_hpa manually, it wouldn't change in this idea. I
don't know how much better it would be though. I think you are right we would have to create them
and compare.

2024-03-25 22:34:16

by Huang, Kai

[permalink] [raw]
Subject: RE: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

> On Mon, 2024-03-25 at 11:14 +0000, Huang, Kai wrote:
> > To confirm, I mean you want to simply make KVM_SET_CPUID2 return error
> > for TDX guest?
> >
> > It is acceptable to me, and I don't see any conflict with Sean's comments.
> >
> > But I don't know Sean's perference.  As he said, I think  the
> > consistency checking is quite straight-forward:
> >
> > "
> > It's not complicated at all.  Walk through the leafs defined during
> > TDH.MNG.INIT, reject KVM_SET_CPUID if a leaf isn't present or doesn't
> > match exactly.
> > "
> >
> Yea, I'm just thinking if we could take two patches down to one small one it
> might be a way to essentially break off this work to another series without
> affecting the ability to boot a TD. It
> *seems* to be the way things are going.
>
> > So to me it's not a big deal.
> >
> > Either way, we need a patch to handle SET_CPUID2:
> >
> > 1) if we go option 1) -- that is reject SET_CPUID2 completely -- we
> > need to make vcpu's CPUID point to KVM's saved CPUID during
> TDH.MNG.INIT.
>
> Ah, I missed this part. Can you elaborate? By dropping these two patches it
> doesn't prevent a TD boot. If we then reject SET_CPUID, this will break things
> unless we make other changes? And they are not small?
>

(sorry replying from outlook due to some issue to my linux box environment)

It booted because Qemu does sane thing, i.e., it always passes the correct CPUIDs in KVM_SET_CPUID2.

Per-Sean's comments, KVM should guarantee the consistency between CPUIDs done in TDH.MNG.INIT and KVM_SET_CPUID2, otherwise if Qemu passes in-consistent CPUIDs KVM can easily fail to work with TD.

To guarantee the consistency, KVM could do two options as we discussed:

1) reject KVM_SET_CPUID2 completely.
2) Still allow KVM_SET_CPUID2 but manually check the CPUID consistency between the one done in TDH.MNG.INIT and the one passed in KVM_SET_CPUID2.

1) can obviously guarantee consistency. But KVM maintains CPUIDs in 'vcpu', so to make the existing KVM code continue to work, we need to manually set 'vcpu->cpuid' to the one that is done in TDH.MNG.INIT.

2) you need to check the consistency and reject KVM_SET_CPUID2 if in-consistency found. But other than that, KVM doesn't need to anything more because if we allow KVM_SET_CPUID2, the 'vcpu' will have its own CPUIDs populated anyway.

2024-03-25 22:38:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

On Mon, 2024-03-25 at 22:31 +0000, Huang, Kai wrote:
> (sorry replying from outlook due to some issue to my linux box environment)
>
> It booted because Qemu does sane thing, i.e., it always passes the correct CPUIDs in
> KVM_SET_CPUID2.
>
> Per-Sean's comments, KVM should guarantee the consistency between CPUIDs done in TDH.MNG.INIT and
> KVM_SET_CPUID2, otherwise if Qemu passes in-consistent CPUIDs KVM can easily fail to work with TD.
>
> To guarantee the consistency, KVM could do two options as we discussed:
>
> 1) reject KVM_SET_CPUID2 completely.
> 2) Still allow KVM_SET_CPUID2 but manually check the CPUID consistency between the one done in
> TDH.MNG.INIT and the one passed in KVM_SET_CPUID2.
>
> 1) can obviously guarantee consistency.  But KVM maintains CPUIDs in 'vcpu', so to make the
> existing KVM code continue to work, we need to manually set 'vcpu->cpuid' to the one that is done
> in TDH.MNG.INIT.
>
> 2) you need to check the consistency and reject KVM_SET_CPUID2 if in-consistency found.  But other
> than that, KVM doesn't need to anything more because if we allow KVM_SET_CPUID2, the 'vcpu' will
> have its own CPUIDs populated anyway.

Ah, thanks for explaining. So 1 is not that simple, it is a maybe slightly smaller separate
solution. Now I see why the discussion was to just do the consistency checking up front.

2024-03-25 22:48:35

by Huang, Kai

[permalink] [raw]
Subject: RE: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

> > Here, "verified", I think Kai wanted to emphasize that the value of
> > max_vcpus passed in via KVM_TDX_INIT_VM should be checked against the
> > value configured via KVM_CAP_MAX_VCPUS?
> >
> > Maybe "verified and used" ?
>
> Ok. I don't have strong opinion here.

It depends on how you implement that patch.

If we don't pass 'max_vcpus' in that patch, there's nothing to verify really.

2024-03-25 23:03:31

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Fri, Mar 22, 2024 at 12:40:12AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Thu, 2024-03-21 at 15:59 -0700, Isaku Yamahata wrote:
> > >
> > > Ok, I see now how this works. MTRRs and APIC zapping happen to use
> > > the
> > > same function: kvm_zap_gfn_range(). So restricting that function
> > > from
> > > zapping private pages has the desired affect. I think it's not
> > > ideal
> > > that kvm_zap_gfn_range() silently skips zapping some ranges. I
> > > wonder
> > > if we could pass something in, so it's more clear to the caller.
> > >
> > > But can these code paths even get reaches in TDX? It sounded like
> > > MTRRs
> > > basically weren't supported.
> >
> > We can make the code paths so with the (new) assumption that guest
> > MTRR can
> > be disabled cleanly.
>
> So the situation is (please correct):
> KVM has a no "making up architectural behavior" rule, which is an
> important one. But TDX module doesn't support MTRRs. So TD guests can't
> have architectural behavior for MTRRs. So this patch is trying as best
> as possible to match what MTRR behavior it can (not crash the guest if
> someone tries).
>
> First of all, if the guest unmaps the private memory, doesn't it have
> to accept it again when gets re-added? So will the guest not crash
> anyway?

Right, the guest has to accept it on VE. If the unmap was intentional by guest,
that's fine. The unmap is unintentional (with vMTRR), the guest doesn't expect
VE with the GPA.


> But, I guess we should punt to userspace is the guest tries to use
> MTRRs, not that userspace can handle it happening in a TD... But it
> seems cleaner and safer then skipping zapping some pages inside the
> zapping code.
>
> I'm still not sure if I understand the intention and constraints fully.
> So please correct. This (the skipping the zapping for some operations)
> is a theoretical correctness issue right? It doesn't resolve a TD
> crash?

For lapic, it's safe guard. Because TDX KVM disables APICv with
APICV_INHIBIT_REASON_TDX, apicv won't call kvm_zap_gfn_range().

For MTRR, the purpose is to make the guest boot (without the guest kernel
command line like clearcpuid=mtrr) .
If we can assume the guest won't touch MTRR registers somehow, KVM can return an
error to TDG.VP.VMCALL<RDMSR, WRMSR>(MTRR registers). So it doesn't call
kvm_zap_gfn_range(). Or we can use KVM_EXIT_X86_{RDMSR, WRMSR} as you suggested.
--
isaku Yamahata <[email protected]>

2024-03-25 23:11:13

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, Mar 25, 2024 at 03:18:36PM -0700,
Isaku Yamahata <[email protected]> wrote:

> On Mon, Mar 25, 2024 at 07:55:04PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Mon, 2024-03-25 at 12:05 -0700, Isaku Yamahata wrote:
> > > Right, the guest has to accept it on VE.  If the unmap was intentional by guest,
> > > that's fine.  The unmap is unintentional (with vMTRR), the guest doesn't expect
> > > VE with the GPA.
> > >
> > >
> > > > But, I guess we should punt to userspace is the guest tries to use
> > > > MTRRs, not that userspace can handle it happening in a TD...  But it
> > > > seems cleaner and safer then skipping zapping some pages inside the
> > > > zapping code.
> > > >
> > > > I'm still not sure if I understand the intention and constraints fully.
> > > > So please correct. This (the skipping the zapping for some operations)
> > > > is a theoretical correctness issue right? It doesn't resolve a TD
> > > > crash?
> > >
> > > For lapic, it's safe guard. Because TDX KVM disables APICv with
> > > APICV_INHIBIT_REASON_TDX, apicv won't call kvm_zap_gfn_range().
> > Ah, I see it:
> > https://lore.kernel.org/lkml/38e2f8a77e89301534d82325946eb74db3e47815.1708933498.git.isaku.yamahata@intel.com/
> >
> > Then it seems a warning would be more appropriate if we are worried there might be a way to still
> > call it. If we are confident it can't, then we can just ignore this case.
> >
> > >
> > > For MTRR, the purpose is to make the guest boot (without the guest kernel
> > > command line like clearcpuid=mtrr) .
> > > If we can assume the guest won't touch MTRR registers somehow, KVM can return an
> > > error to TDG.VP.VMCALL<RDMSR, WRMSR>(MTRR registers). So it doesn't call
> > > kvm_zap_gfn_range(). Or we can use KVM_EXIT_X86_{RDMSR, WRMSR} as you suggested.
> >
> > My understanding is that Sean prefers to exit to userspace when KVM can't handle something, versus
> > making up behavior that keeps known guests alive. So I would think we should change this patch to
> > only be about not using the zapping roots optimization. Then a separate patch should exit to
> > userspace on attempt to use MTRRs. And we ignore the APIC one.
> >
> > This is trying to guess what maintainers would want here. I'm less sure what Paolo prefers.
>
> When we hit KVM_MSR_FILTER, the current implementation ignores it and makes it
> error to guest. Surely we should make it KVM_EXIT_X86_{RDMSR, WRMSR}, instead.
> It's aligns with the existing implementation(default VM and SW-protected) and
> more flexible.

Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
Compile only tested at this point.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f891de30a2dd..4d9ae5743e24 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1388,31 +1388,67 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
return 1;
}

+static int tdx_complete_rdmsr(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->run->msr.error)
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ else {
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ tdvmcall_set_return_val(vcpu, vcpu->run->msr.data);
+ }
+ return 1;
+}
+
static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
{
u32 index = tdvmcall_a0_read(vcpu);
u64 data;

- if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
- kvm_get_msr(vcpu, index, &data)) {
+ if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ)) {
+ trace_kvm_msr_read_ex(index);
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ return kvm_msr_user_space(vcpu, index, KVM_EXIT_X86_RDMSR, 0,
+ tdx_complete_rdmsr,
+ KVM_MSR_RET_FILTERED);
+ }
+
+ if (kvm_get_msr(vcpu, index, &data)) {
trace_kvm_msr_read_ex(index);
tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
return 1;
}
- trace_kvm_msr_read(index, data);

+ trace_kvm_msr_read(index, data);
tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
tdvmcall_set_return_val(vcpu, data);
return 1;
}

+static int tdx_complete_wrmsr(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->run->msr.error)
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ else
+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
+ return 1;
+}
+
static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
{
u32 index = tdvmcall_a0_read(vcpu);
u64 data = tdvmcall_a1_read(vcpu);

- if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
- kvm_set_msr(vcpu, index, data)) {
+ if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE)) {
+ trace_kvm_msr_write_ex(index, data);
+ tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
+ if (kvm_msr_user_space(vcpu, index, KVM_EXIT_X86_WRMSR, data,
+ tdx_complete_wrmsr,
+ KVM_MSR_RET_FILTERED))
+ return 1;
+ return 0;
+ }
+
+ if (kvm_set_msr(vcpu, index, data)) {
trace_kvm_msr_write_ex(index, data);
tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
return 1;


--
Isaku Yamahata <[email protected]>

2024-03-25 23:21:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, 2024-03-25 at 16:10 -0700, Isaku Yamahata wrote:
> > > My understanding is that Sean prefers to exit to userspace when KVM can't handle something,
> > > versus
> > > making up behavior that keeps known guests alive. So I would think we should change this patch
> > > to
> > > only be about not using the zapping roots optimization. Then a separate patch should exit to
> > > userspace on attempt to use MTRRs. And we ignore the APIC one.
> > >
> > > This is trying to guess what maintainers would want here. I'm less sure what Paolo prefers.
> >
> > When we hit KVM_MSR_FILTER, the current implementation ignores it and makes it
> > error to guest.  Surely we should make it KVM_EXIT_X86_{RDMSR, WRMSR}, instead.
> > It's aligns with the existing implementation(default VM and SW-protected) and
> > more flexible.
>
> Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
> Compile only tested at this point.

Seems reasonable to me. Does QEMU configure a special set of MSRs to filter for TDX currently?

2024-03-25 23:35:41

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, Mar 25, 2024 at 11:21:17PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-03-25 at 16:10 -0700, Isaku Yamahata wrote:
> > > > My understanding is that Sean prefers to exit to userspace when KVM can't handle something,
> > > > versus
> > > > making up behavior that keeps known guests alive. So I would think we should change this patch
> > > > to
> > > > only be about not using the zapping roots optimization. Then a separate patch should exit to
> > > > userspace on attempt to use MTRRs. And we ignore the APIC one.
> > > >
> > > > This is trying to guess what maintainers would want here. I'm less sure what Paolo prefers.
> > >
> > > When we hit KVM_MSR_FILTER, the current implementation ignores it and makes it
> > > error to guest.  Surely we should make it KVM_EXIT_X86_{RDMSR, WRMSR}, instead.
> > > It's aligns with the existing implementation(default VM and SW-protected) and
> > > more flexible.
> >
> > Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
> > Compile only tested at this point.
>
> Seems reasonable to me. Does QEMU configure a special set of MSRs to filter for TDX currently?

No for TDX at the moment. We need to add such logic.
--
Isaku Yamahata <[email protected]>

2024-03-25 23:42:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 018/130] KVM: x86/mmu: Assume guest MMIOs are shared

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> From: Chao Gao <[email protected]>
>
> TODO: Drop this patch once the common patch is merged.

What is this TODO talking about?

>
> When memory slot isn't found for kvm page fault, handle it as MMIO.
>
> The guest of TDX_VM, SNP_VM, or SW_PROTECTED_VM don't necessarily convert
> the virtual MMIO range to shared before accessing it.  When the guest tries
> to access the virtual device's MMIO without any private/shared conversion,
> An NPT fault or EPT violation is raised first to find private-shared
> mismatch.  Don't raise KVM_EXIT_MEMORY_FAULT, fall back to KVM_PFN_NOLSLOT.

If this is general KVM_X86_SW_PROTECTED_VM behavior, can we pull it out of the TDX series?

>
> Signed-off-by: Chao Gao <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---

2024-03-26 02:08:16

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

.. continue the previous review ...

> +
> +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> +{
> + WARN_ON_ONCE(!td_page_pa);

From the name 'td_page_pa' we cannot tell whether it is a control page,
but this function is only intended for control page AFAICT, so perhaps a
more specific name.

> +
> + /*
> + * TDCX are being reclaimed. TDX module maps TDCX with HKID

"are" -> "is".

Are you sure it is TDCX, but not TDCS?

AFAICT TDCX is the control structure for 'vcpu', but here you are
handling the control structure for the VM.

> + * assigned to the TD. Here the cache associated to the TD
> + * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
> + * cache doesn't need to be flushed again.
> + */

How about put this part as the comment for this function?

/*
* Reclaim <name of control page> page(s) which are crypto-protected
* by TDX guest's private KeyID. Assume the cache associated with the
* TDX private KeyID has been flushed.
*/
> + if (tdx_reclaim_page(td_page_pa))
> + /*
> + * Leak the page on failure:
> + * tdx_reclaim_page() returns an error if and only if there's an
> + * unexpected, fatal error, e.g. a SEAMCALL with bad params,
> + * incorrect concurrency in KVM, a TDX Module bug, etc.
> + * Retrying at a later point is highly unlikely to be
> + * successful.
> + * No log here as tdx_reclaim_page() already did.

IMHO can be simplified to below, and nothing else matters.

/*
* Leak the page if the kernel failed to reclaim the page.
* The krenel cannot use it safely anymore.
*/

And you can put this comment above the 'if (tdx_reclaim_page())' statement.

> + */
> + return;

Empty line.

> + free_page((unsigned long)__va(td_page_pa));
> +}
> +
> +static void tdx_do_tdh_phymem_cache_wb(void *unused)

Better to make the name explicit that it is a smp_func, and you don't
need the "tdx_" prefix for all the 'static' functions here:

static void smp_func_do_phymem_cache_wb(void *unused)

> +{
> + u64 err = 0;
> +
> + do {
> + err = tdh_phymem_cache_wb(!!err);

bool resume = !!err;

err = tdh_phymem_cache_wb(resume);

So that we don't need to jump to the tdh_phymem_cache_wb() to see what
does !!err mean.

> + } while (err == TDX_INTERRUPTED_RESUMABLE);

Add a comment before the do {} while():

/*
* TDH.PHYMEM.CACHE.WB flushes caches associated with _ANY_
* TDX private KeyID on the package (or logical cpu?) where
* it is called on. The TDX module may not finish the cache
* flush but return TDX_INTERRUPTED_RESUMEABLE instead. The
* kernel should retry it until it returns success w/o
* rescheduling.
*/
> +
> + /* Other thread may have done for us. */
> + if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> + err = TDX_SUCCESS;

Empty line.

> + if (WARN_ON_ONCE(err))
> + pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> +}
> +
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> + bool packages_allocated, targets_allocated;
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + cpumask_var_t packages, targets;
> + u64 err;
> + int i;
> +
> + if (!is_hkid_assigned(kvm_tdx))
> + return;
> +
> + if (!is_td_created(kvm_tdx)) {
> + tdx_hkid_free(kvm_tdx);
> + return;
> + }

I lost tracking what does "td_created()" mean.

I guess it means: KeyID has been allocated to the TDX guest, but not yet
programmed/configured.

Perhaps add a comment to remind the reviewer?

> +
> + packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> + targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> + cpus_read_lock();
> +
> + /*
> + * We can destroy multiple guest TDs simultaneously. Prevent
> + * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> + */

IMHO it's better to remind people that TDH.PHYMEM.CACHE.WB tries to grab
the global TDX module lock:

/*
* TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global
* lock and can fail with TDX_OPERAND_BUSY when it fails to
* grab. Multiple TDX guests can be destroyed simultaneously.
* Take the mutex to prevent it from getting error.
*/
> + mutex_lock(&tdx_lock);
> +
> + /*
> + * Go through multiple TDX HKID state transitions with three SEAMCALLs
> + * to make TDH.PHYMEM.PAGE.RECLAIM() usable.


What is "TDX HKID state transitions"? Not mentioned before, so needs
explanation _if_ you want to say this.

And what are the three "SEAMCALLs"? Where are they? The only _two_
SEAMCALLs that I can see here are: TDH.PHYMEM.CACHE.WB and
TDH.MNG.KEY.FREEID.

Make the transition atomic
> + * to other functions to operate private pages and Secure-EPT pages.

What's the consequence to "other functions" if we don't make it atomic here?

> + *
> + * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
> + * This function is called via mmu notifier, mmu_release().
> + * kvm_gmem_release() is called via fput() on process exit.
> + */
> + write_lock(&kvm->mmu_lock);

I don't fully get the race here, but it seems strange that this function
is called via mmu notifier.

IIUC, this function is "supposedly" only be called when we tear down the
VM, so I don't know why there's such race.

> +
> + for_each_online_cpu(i) {
> + if (packages_allocated &&
> + cpumask_test_and_set_cpu(topology_physical_package_id(i),
> + packages))
> + continue;
> + if (targets_allocated)
> + cpumask_set_cpu(i, targets);
> + }
> + if (targets_allocated)
> + on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, true);
> + else
> + on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, true);

I don't understand the logic here -- no comments whatever.

But I am 99% sure the logic here could be simplified.

> + /*
> + * In the case of error in tdx_do_tdh_phymem_cache_wb(), the following
> + * tdh_mng_key_freeid() will fail.
> + */
> + err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
> + if (WARN_ON_ONCE(err)) {

I see KVM_BUG_ON() is normally used for SEAMCALL error. Why this uses
WARN_ON_ONCE() here?

> + pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> + pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
> + kvm_tdx->hkid);
> + } else
> + tdx_hkid_free(kvm_tdx);
> +
> + write_unlock(&kvm->mmu_lock);
> + mutex_unlock(&tdx_lock);
> + cpus_read_unlock();
> + free_cpumask_var(targets);
> + free_cpumask_var(packages);
> +}
> +
> +void tdx_vm_free(struct kvm *kvm)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + u64 err;
> + int i;
> +
> + /*
> + * tdx_mmu_release_hkid() failed to reclaim HKID. Something went wrong
> + * heavily with TDX module. Give up freeing TD pages. As the function
> + * already warned, don't warn it again.
> + */
> + if (is_hkid_assigned(kvm_tdx))
> + return;
> +
> + if (kvm_tdx->tdcs_pa) {
> + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> + if (kvm_tdx->tdcs_pa[i])
> + tdx_reclaim_control_page(kvm_tdx->tdcs_pa[i]);

AFAICT, here tdcs_pa[i] cannot be NULL, right? How about:

if (!WARN_ON_ONCE(!kvm_tdx->tdcs_pa[i]))
continue;

tdx_reclaim_control_page(...);

which at least saves you some indent.

Btw, does it make sense to stop if any tdx_reclaim_control_page() fails?

It's OK to continue, but perhaps worth to add a comment to point out:

/*
* Continue to reclaim other control pages and
* TDR page, even failed to reclaim one control
* page. Do the best to reclaim these TDX
* private pages.
*/
tdx_reclaim_control_page();
> + }
> + kfree(kvm_tdx->tdcs_pa);
> + kvm_tdx->tdcs_pa = NULL;
> + }
> +
> + if (!kvm_tdx->tdr_pa)
> + return;
> + if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
> + return;
> + /*
> + * TDX module maps TDR with TDX global HKID. TDX module may access TDR
> + * while operating on TD (Especially reclaiming TDCS). Cache flush with > + * TDX global HKID is needed.
> + */

"Especially reclaiming TDCS" -> "especially when it is reclaiming TDCS".

Use imperative mode to describe your change:

Use the SEAMCALL to ask the TDX module to flush the cache of it using
the global KeyID.

> + err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(kvm_tdx->tdr_pa,
> + tdx_global_keyid));
> + if (WARN_ON_ONCE(err)) {

Again, KVM_BUG_ON()?

Should't matter, though.

> + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> + return;
> + }
> + tdx_clear_page(kvm_tdx->tdr_pa);
> +
> + free_page((unsigned long)__va(kvm_tdx->tdr_pa));
> + kvm_tdx->tdr_pa = 0;
> +}
> +
> +static int tdx_do_tdh_mng_key_config(void *param)
> +{
> + hpa_t *tdr_p = param;
> + u64 err;
> +
> + do {
> + err = tdh_mng_key_config(*tdr_p);
> +
> + /*
> + * If it failed to generate a random key, retry it because this
> + * is typically caused by an entropy error of the CPU's random
> + * number generator.
> + */
> + } while (err == TDX_KEY_GENERATION_FAILED);

If you want to handle TDX_KEY_GENERTION_FAILED, it's better to have a
retry limit similar to the TDX host code does.

> +
> + if (WARN_ON_ONCE(err)) {

KVM_BUG_ON()?

> + pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +static int __tdx_td_init(struct kvm *kvm);
> +
> +int tdx_vm_init(struct kvm *kvm)
> +{
> + /*
> + * TDX has its own limit of the number of vcpus in addition to
> + * KVM_MAX_VCPUS.
> + */
> + kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);

I believe this should be part of the patch that handles KVM_CAP_MAX_VCPUS.

> +
> + /* Place holder for TDX specific logic. */
> + return __tdx_td_init(kvm);
> +}
> +

.. to be continued ...

2024-03-26 02:33:46

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, Mar 25, 2024 at 04:35:28PM -0700, Isaku Yamahata wrote:
>On Mon, Mar 25, 2024 at 11:21:17PM +0000,
>"Edgecombe, Rick P" <[email protected]> wrote:
>
>> On Mon, 2024-03-25 at 16:10 -0700, Isaku Yamahata wrote:
>> > > > My understanding is that Sean prefers to exit to userspace when KVM can't handle something,
>> > > > versus
>> > > > making up behavior that keeps known guests alive. So I would think we should change this patch
>> > > > to
>> > > > only be about not using the zapping roots optimization. Then a separate patch should exit to
>> > > > userspace on attempt to use MTRRs. And we ignore the APIC one.
>> > > >
>> > > > This is trying to guess what maintainers would want here. I'm less sure what Paolo prefers.
>> > >
>> > > When we hit KVM_MSR_FILTER, the current implementation ignores it and makes it
>> > > error to guest.? Surely we should make it KVM_EXIT_X86_{RDMSR, WRMSR}, instead.
>> > > It's aligns with the existing implementation(default VM and SW-protected) and
>> > > more flexible.
>> >
>> > Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
>> > Compile only tested at this point.
>>
>> Seems reasonable to me. Does QEMU configure a special set of MSRs to filter for TDX currently?
>
>No for TDX at the moment. We need to add such logic.

What if QEMU doesn't configure the set of MSRs to filter? In this case, KVM
still needs to handle the MSR accesses.

2024-03-26 02:42:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Tue, 2024-03-26 at 10:32 +0800, Chao Gao wrote:
> > > > Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
> > > > Compile only tested at this point.
> > >
> > > Seems reasonable to me. Does QEMU configure a special set of MSRs to filter for TDX currently?
> >
> > No for TDX at the moment.  We need to add such logic.
>
> What if QEMU doesn't configure the set of MSRs to filter? In this case, KVM
> still needs to handle the MSR accesses.

Do you see a problem for the kernel? I think if any issues are limited to only the guest, then we
should count on userspace to configure the msr list.

Today if the MSR access is not allowed by the filter, or the MSR access otherwise fails, an error is
returned to the guest. I think Isaku's proposal is to return to userspace if the filter list fails,
and return an error to the guest if the access otherwise fails. So the accessible MSRs are the same.
It's just change in how error is reported.

2024-03-26 03:10:36

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl

On Sat, 2024-03-23 at 04:27 +0000, Huang, Kai wrote:
> > vt_vcpu_mem_enc_ioctl() checks non-TDX case and returns -ENOTTY.  We know that
> > the guest is TD.
>
> But the command is not supported, right?
>
> I roughly recall I saw somewhere that in such case we should return -ENOTTY, but
> I cannot find the link now.
>
> But I found this old link uses -ENOTTY:
>
> https://lwn.net/Articles/58719/
>
> So, just fyi.

The AMD version of this returns -EINVAL when the subcommand is not implemented. I don't think the
TDX side should need to necessarily match that. Is the case of concern when in a future where there
are more subcommands that are only supported when some other mode is enabled?

The man page says:
ENOTTY The specified request does not apply to the kind of object
that the file descriptor fd references.

If a future command does not apply for the TDX mode, then an upgraded kernel could start returning
ENOTTY instead of EINVAL. Hmm. We could always have the option of making KVM_MEMORY_ENCRYPT_OP_FOO
for some future mode foo if there were compatibility issues, so I don't think we would be stuck
either way. 

After thinking about it, I'd make a weak vote to leave it. No strong opinion though.

2024-03-26 03:31:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 052/130] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX will use a different shadow PTE entry value for MMIO from VMX.  Add
> members to kvm_arch and track value for MMIO per-VM instead of global
> variables.  By using the per-VM EPT entry value for MMIO, the existing VMX
> logic is kept working.  Introduce a separate setter function so that guest
> TD can override later.
>
> Also require mmio spte caching for TDX.


> Actually this is true case
> because TDX requires EPT and KVM EPT allows mmio spte caching.
>

I can't understand what this is trying to say.

>  
>  void kvm_mmu_init_vm(struct kvm *kvm)
>  {
> +
> +       kvm->arch.shadow_mmio_value = shadow_mmio_value;

It could use kvm_mmu_set_mmio_spte_value()?

>         INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

2024-03-26 11:14:16

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Tue, Mar 26, 2024 at 10:42:36AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2024-03-26 at 10:32 +0800, Chao Gao wrote:
>> > > > Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
>> > > > Compile only tested at this point.
>> > >
>> > > Seems reasonable to me. Does QEMU configure a special set of MSRs to filter for TDX currently?
>> >
>> > No for TDX at the moment.? We need to add such logic.
>>
>> What if QEMU doesn't configure the set of MSRs to filter? In this case, KVM
>> still needs to handle the MSR accesses.
>
>Do you see a problem for the kernel? I think if any issues are limited to only the guest, then we
>should count on userspace to configure the msr list.

How can QEMU handle MTRR MSR accesses if KVM exits to QEMU? I am not sure if
QEMU needs to do a lot of work to virtualize MTRR.

If QEMU doesn't configure the msr filter list correctly, KVM has to handle
guest's MTRR MSR accesses. In my understanding, the suggestion is KVM zap
private memory mappings. But guests won't accept memory again because no one
currently requests guests to do this after writes to MTRR MSRs. In this case,
guests may access unaccepted memory, causing infinite EPT violation loop
(assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
the host. But I think it would be better if we can avoid wasting CPU resource
on the useless EPT violation loop.

>
>Today if the MSR access is not allowed by the filter, or the MSR access otherwise fails, an error is
>returned to the guest. I think Isaku's proposal is to return to userspace if the filter list fails,
>and return an error to the guest if the access otherwise fails. So the accessible MSRs are the same.
>It's just change in how error is reported.

2024-03-26 17:34:36

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 048/130] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values

On Tue, Mar 26, 2024 at 11:53:02PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > Add support to MMU caches for initializing a page with a custom 64-bit
> > value, e.g. to pre-fill an entire page table with non-zero PTE values.
> > The functionality will be used by x86 to support Intel's TDX, which needs
> > to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
> > faults from getting reflected into the guest (Intel's EPT Violation #VE
> > architecture made the less than brilliant decision of having the per-PTE
> > behavior be opt-out instead of opt-in).
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > include/linux/kvm_types.h | 1 +
> > virt/kvm/kvm_main.c | 16 ++++++++++++++--
> > 2 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 9d1f7835d8c1..60c8d5c9eab9 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -94,6 +94,7 @@ struct gfn_to_pfn_cache {
> > struct kvm_mmu_memory_cache {
> > gfp_t gfp_zero;
> > gfp_t gfp_custom;
> > + u64 init_value;
> > struct kmem_cache *kmem_cache;
> > int capacity;
> > int nobjs;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index de38f308738e..d399009ef1d7 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -401,12 +401,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
> > static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > gfp_t gfp_flags)
> > {
> > + void *page;
> > +
> > gfp_flags |= mc->gfp_zero;
> > if (mc->kmem_cache)
> > return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> > - else
> > - return (void *)__get_free_page(gfp_flags);
> > +
> > + page = (void *)__get_free_page(gfp_flags);
> > + if (page && mc->init_value)
> > + memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));
>
> Do we need a static_assert() to make sure mc->init_value is 64bit?

I don't see much value. Is your concern sizeof() part?
If so, we can replace it with 8.

memset64(page, mc->init_value, PAGE_SIZE / 8);
--
Isaku Yamahata <[email protected]>

2024-03-26 17:59:14

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Tue, Mar 26, 2024 at 07:13:46PM +0800,
Chao Gao <[email protected]> wrote:

> On Tue, Mar 26, 2024 at 10:42:36AM +0800, Edgecombe, Rick P wrote:
> >On Tue, 2024-03-26 at 10:32 +0800, Chao Gao wrote:
> >> > > > Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
> >> > > > Compile only tested at this point.
> >> > >
> >> > > Seems reasonable to me. Does QEMU configure a special set of MSRs to filter for TDX currently?
> >> >
> >> > No for TDX at the moment.  We need to add such logic.
> >>
> >> What if QEMU doesn't configure the set of MSRs to filter? In this case, KVM
> >> still needs to handle the MSR accesses.
> >
> >Do you see a problem for the kernel? I think if any issues are limited to only the guest, then we
> >should count on userspace to configure the msr list.
>
> How can QEMU handle MTRR MSR accesses if KVM exits to QEMU? I am not sure if
> QEMU needs to do a lot of work to virtualize MTRR.

The default kernel logic will to return error for
TDG.VP.VMCALL<RDMSR or WRMSR MTRR registers>.
Qemu can have mostly same in the current kernel logic.

rdmsr:
MTRRCAP: 0
MTRRDEFTYPE: MTRR_TYPE_WRBACK

wrmsr:
MTRRDEFTYPE: If write back, nop. Otherwise error.


> If QEMU doesn't configure the msr filter list correctly, KVM has to handle
> guest's MTRR MSR accesses. In my understanding, the suggestion is KVM zap
> private memory mappings. But guests won't accept memory again because no one
> currently requests guests to do this after writes to MTRR MSRs. In this case,
> guests may access unaccepted memory, causing infinite EPT violation loop
> (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
> the host. But I think it would be better if we can avoid wasting CPU resource
> on the useless EPT violation loop.

Qemu is expected to do it correctly. There are manyways for userspace to go
wrong. This isn't specific to MTRR MSR.
--
Isaku Yamahata <[email protected]>

2024-03-26 18:06:52

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Mon, Mar 25, 2024 at 10:31:49PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-03-25 at 13:01 -0700, Isaku Yamahata wrote:
> >  Also, kvm_tdp_mmu_alloc_root() never returns non-zero, even though mmu_alloc_direct_roots() does.
> > > Probably today when there is one caller it makes mmu_alloc_direct_roots() cleaner to just have
> > > it
> > > return the always zero value from kvm_tdp_mmu_alloc_root(). Now that there are two calls, I
> > > think we
> > > should refactor kvm_tdp_mmu_alloc_root() to return void, and have kvm_tdp_mmu_alloc_root()
> > > return 0
> > > manually in this case.
> > >
> > > Or maybe instead change it back to returning an hpa_t and then kvm_tdp_mmu_alloc_root() can lose
> > > the
> > > "if (private)" logic at the end too.
> >
> > Probably we can make void kvm_tdp_mmu_alloc_root() instead of returning always
> > zero as clean up.
>
> Why is it better than returning an hpa_t once we are calling it twice for mirror and shared roots.

You mean split out "if (private)" from the core part? Makes sense.


> > > >         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > > >                 root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > > > @@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault
> > > > *fault)
> > > >         if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> > > >                 for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> > > >                         int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> > > > -                       gfn_t base = gfn_round_for_level(fault->gfn,
> > > > +                       gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> > > >                                                          fault->max_level);
> > > >  
> > > >                         if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> > > > @@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64
> > > > error_code,
> > > >         };
> > > >  
> > > >         WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> > > > +       fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> > > >         fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > > >  
> > > >         r = mmu_topup_memory_caches(vcpu, false);
> > > > @@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > > >  
> > > >         mmu->root.hpa = INVALID_PAGE;
> > > >         mmu->root.pgd = 0;
> > > > +       mmu->private_root_hpa = INVALID_PAGE;
> > > >         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > > >                 mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> > > >  
> > > > @@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
> > > >  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > > >  {
> > > >         kvm_mmu_unload(vcpu);
> > > > +       if (tdp_mmu_enabled) {
> > > > +               write_lock(&vcpu->kvm->mmu_lock);
> > > > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > > > +                               NULL);
> > > > +               write_unlock(&vcpu->kvm->mmu_lock);
> > >
> > > What is the reason for the special treatment of private_root_hpa here? The rest of the roots are
> > > freed in kvm_mmu_unload(). I think it is because we don't want the mirror to get freed during
> > > kvm_mmu_reset_context()?
> >
> > It reflects that we don't free Secure-EPT pages during runtime, and free them
> > when destroying the guest.
>
> Right. If would be great if we could do something like warn on freeing role.private = 1 sp's during
> runtime. It could cover several cases that get worried about in other patches.

Ok, let me move it to kvm_mmu_unload() and try to sprinkle warn-on.


> While looking at how we could do this, I noticed that kvm_arch_vcpu_create() calls kvm_mmu_destroy()
> in an error path. So this could end up zapping/freeing a private root. It should be bad userspace
> behavior too I guess. But the number of edge cases makes me think the case of zapping private sp
> while a guest is running is something that deserves a VM_BUG_ON().

Let me clean the code. I think we can clean them up.


> > > Oof. For the sake of trying to justify the code, I'm trying to keep track of the pros and cons
> > > of
> > > treating the mirror/private root like a normal one with just a different role bit.
> > >
> > > The whole “list of roots” thing seems to date from the shadow paging, where there is is critical
> > > to
> > > keep multiple cached shared roots of different CPU modes of the same shadowed page tables. Today
> > > with non-nested TDP, AFAICT, the only different root is for SMM. I guess since the machinery for
> > > managing multiple roots in a list already exists it makes sense to use it for both.
> > >
> > > For TDX there are also only two, but the difference is, things need to be done in special ways
> > > for
> > > the two roots. You end up with a bunch of loops (for_each_*tdp_mmu_root(), etc) that essentially
> > > process a list of two different roots, but with inner logic tortured to work for the
> > > peculiarities
> > > of both private and shared. An easier to read alternative could be to open code both cases.
> > >
> > > I guess the major benefit is to keep one set of logic for shadow paging, normal TDP and TDX, but
> > > it
> > > makes the logic a bit difficult to follow for TDX compared to looking at it from the normal
> > > guest
> > > perspective. So I wonder if making special versions of the TDX root traversing operations might
> > > make
> > > the code a little easier to follow. I’m not advocating for it at this point, just still working
> > > on
> > > an opinion. Is there any history around this design point?
> >
> > The original desire to keep the modification contained, and not introduce a
> > function for population and zap.  With the open coding, do you want something
> > like the followings?  We can try it and compare the outcome.
> >
> > For zapping
> >   if (private) {
> >      __for_each_tdp_mmu_root_yield_safe_private()
> >        private case
> >   } else {
> >      __for_each_tdp_mmu_root_yield_safe()
> >         shared case
> >   }
> >
> > For fault,
> > kvm_tdp_mmu_map()
> >   if (private) {
> >     tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
> >       private case
> >   } else {
> >     tdp_mmu_for_each_pte_private(iter, mmu, raw_gfn, raw_gfn + 1)
> >       shared case
> >   }
>
> I was wondering about something limited to the operations that iterate over the roots. So not
> keeping private_root_hpa in the list of roots where it has to be carefully protected from getting
> zapped or get its gfn adjusted, and instead open coding the private case in the higher level zapping
> operations. For normal VM's the private case would be a NOP.
>
> Since kvm_tdp_mmu_map() already grabs private_root_hpa manually, it wouldn't change in this idea. I
> don't know how much better it would be though. I think you are right we would have to create them
> and compare.

Given the large page support gets complicated, it would be worthwhile to try,
I think.
--
Isaku Yamahata <[email protected]>

2024-03-26 21:52:25

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 048/130] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Add support to MMU caches for initializing a page with a custom 64-bit
> value, e.g. to pre-fill an entire page table with non-zero PTE values.
> The functionality will be used by x86 to support Intel's TDX, which needs
> to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
> faults from getting reflected into the guest (Intel's EPT Violation #VE
> architecture made the less than brilliant decision of having the per-PTE
> behavior be opt-out instead of opt-in).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> include/linux/kvm_types.h | 1 +
> virt/kvm/kvm_main.c | 16 ++++++++++++++--
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 9d1f7835d8c1..60c8d5c9eab9 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -94,6 +94,7 @@ struct gfn_to_pfn_cache {
> struct kvm_mmu_memory_cache {
> gfp_t gfp_zero;
> gfp_t gfp_custom;
> + u64 init_value;
> struct kmem_cache *kmem_cache;
> int capacity;
> int nobjs;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index de38f308738e..d399009ef1d7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -401,12 +401,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
> static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> gfp_t gfp_flags)
> {
> + void *page;
> +
> gfp_flags |= mc->gfp_zero;
>
> if (mc->kmem_cache)
> return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> - else
> - return (void *)__get_free_page(gfp_flags);
> +
> + page = (void *)__get_free_page(gfp_flags);
> + if (page && mc->init_value)
> + memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));

Do we need a static_assert() to make sure mc->init_value is 64bit?

> + return page;
> }
>
> int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> @@ -421,6 +426,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
> if (WARN_ON_ONCE(!capacity))
> return -EIO;
>
> + /*
> + * Custom init values can be used only for page allocations,
> + * and obviously conflict with __GFP_ZERO.
> + */
> + if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
> + return -EIO;
> +
> mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> if (!mc->objects)
> return -ENOMEM;


2024-03-27 00:29:03

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 044/130] KVM: TDX: Do TDX specific vcpu initialization

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> +/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
> +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> +{
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> +       unsigned long *tdvpx_pa = NULL;
> +       unsigned long tdvpr_pa;


I think we could drop theselocal variables and just use tdx->tdvpr_pa and tdx->tdvpx_pa. Then we
don't have to have the assignments later.

> +       unsigned long va;
> +       int ret, i;
> +       u64 err;
> +
> +       if (is_td_vcpu_created(tdx))
> +               return -EINVAL;
> +
> +       /*
> +        * vcpu_free method frees allocated pages.  Avoid partial setup so
> +        * that the method can't handle it.
> +        */
> +       va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +       if (!va)
> +               return -ENOMEM;
> +       tdvpr_pa = __pa(va);
> +
> +       tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
> +                          GFP_KERNEL_ACCOUNT);
> +       if (!tdvpx_pa) {
> +               ret = -ENOMEM;
> +               goto free_tdvpr;
> +       }
> +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> +               va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +               if (!va) {
> +                       ret = -ENOMEM;
> +                       goto free_tdvpx;
> +               }
> +               tdvpx_pa[i] = __pa(va);
> +       }
> +
> +       err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> +       if (KVM_BUG_ON(err, vcpu->kvm)) {
> +               ret = -EIO;
> +               pr_tdx_error(TDH_VP_CREATE, err, NULL);
> +               goto free_tdvpx;
> +       }
> +       tdx->tdvpr_pa = tdvpr_pa;
> +
> +       tdx->tdvpx_pa = tdvpx_pa;

Or alternatively let's move these to right before they are used. (in the current branch

> +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> +               err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> +               if (KVM_BUG_ON(err, vcpu->kvm)) {
> +                       pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> +                       for (; i < tdx_info->nr_tdvpx_pages; i++) {
> +                               free_page((unsigned long)__va(tdvpx_pa[i]));
> +                               tdvpx_pa[i] = 0;
> +                       }
> +                       /* vcpu_free method frees TDVPX and TDR donated to TDX */
> +                       return -EIO;
> +               }
> +       }
>
>
In the current branch tdh_vp_init() takes struct vcpu_tdx, so they would be moved right here.

What do you think?

> +
> +       err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> +       if (KVM_BUG_ON(err, vcpu->kvm)) {
> +               pr_tdx_error(TDH_VP_INIT, err, NULL);
> +               return -EIO;
> +       }
> +
> +       vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +       tdx->td_vcpu_created = true;
> +       return 0;
> +
> +free_tdvpx:
> +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> +               if (tdvpx_pa[i])
> +                       free_page((unsigned long)__va(tdvpx_pa[i]));
> +               tdvpx_pa[i] = 0;
> +       }
> +       kfree(tdvpx_pa);
> +       tdx->tdvpx_pa = NULL;
> +free_tdvpr:
> +       if (tdvpr_pa)
> +               free_page((unsigned long)__va(tdvpr_pa));
> +       tdx->tdvpr_pa = 0;
> +
> +       return ret;
> +}

2024-03-27 00:48:27

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 048/130] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values



On 3/27/2024 1:34 AM, Isaku Yamahata wrote:
> On Tue, Mar 26, 2024 at 11:53:02PM +0800,
> Binbin Wu <[email protected]> wrote:
>
>>
>> On 2/26/2024 4:25 PM, [email protected] wrote:
>>> From: Sean Christopherson <[email protected]>
>>>
>>> Add support to MMU caches for initializing a page with a custom 64-bit
>>> value, e.g. to pre-fill an entire page table with non-zero PTE values.
>>> The functionality will be used by x86 to support Intel's TDX, which needs
>>> to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
>>> faults from getting reflected into the guest (Intel's EPT Violation #VE
>>> architecture made the less than brilliant decision of having the per-PTE
>>> behavior be opt-out instead of opt-in).
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> Signed-off-by: Isaku Yamahata <[email protected]>
>>> ---
>>> include/linux/kvm_types.h | 1 +
>>> virt/kvm/kvm_main.c | 16 ++++++++++++++--
>>> 2 files changed, 15 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>>> index 9d1f7835d8c1..60c8d5c9eab9 100644
>>> --- a/include/linux/kvm_types.h
>>> +++ b/include/linux/kvm_types.h
>>> @@ -94,6 +94,7 @@ struct gfn_to_pfn_cache {
>>> struct kvm_mmu_memory_cache {
>>> gfp_t gfp_zero;
>>> gfp_t gfp_custom;
>>> + u64 init_value;
>>> struct kmem_cache *kmem_cache;
>>> int capacity;
>>> int nobjs;
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index de38f308738e..d399009ef1d7 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -401,12 +401,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
>>> static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>>> gfp_t gfp_flags)
>>> {
>>> + void *page;
>>> +
>>> gfp_flags |= mc->gfp_zero;
>>> if (mc->kmem_cache)
>>> return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
>>> - else
>>> - return (void *)__get_free_page(gfp_flags);
>>> +
>>> + page = (void *)__get_free_page(gfp_flags);
>>> + if (page && mc->init_value)
>>> + memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));
>> Do we need a static_assert() to make sure mc->init_value is 64bit?
> I don't see much value. Is your concern sizeof() part?
> If so, we can replace it with 8.
>
> memset64(page, mc->init_value, PAGE_SIZE / 8);

Yes, but it's trivial. So, up to you. :)


2024-03-27 02:55:00

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/27/2024 1:48 AM, Isaku Yamahata wrote:
> On Tue, Mar 26, 2024 at 07:13:46PM +0800,
> Chao Gao <[email protected]> wrote:
>
>> On Tue, Mar 26, 2024 at 10:42:36AM +0800, Edgecombe, Rick P wrote:
>>> On Tue, 2024-03-26 at 10:32 +0800, Chao Gao wrote:
>>>>>>> Something like this for "112/130 KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall"
>>>>>>> Compile only tested at this point.
>>>>>>
>>>>>> Seems reasonable to me. Does QEMU configure a special set of MSRs to filter for TDX currently?
>>>>>
>>>>> No for TDX at the moment.  We need to add such logic.
>>>>
>>>> What if QEMU doesn't configure the set of MSRs to filter? In this case, KVM
>>>> still needs to handle the MSR accesses.
>>>
>>> Do you see a problem for the kernel? I think if any issues are limited to only the guest, then we
>>> should count on userspace to configure the msr list.
>>
>> How can QEMU handle MTRR MSR accesses if KVM exits to QEMU? I am not sure if
>> QEMU needs to do a lot of work to virtualize MTRR.
>
> The default kernel logic will to return error for
> TDG.VP.VMCALL<RDMSR or WRMSR MTRR registers>.
> Qemu can have mostly same in the current kernel logic.
>
> rdmsr:
> MTRRCAP: 0
> MTRRDEFTYPE: MTRR_TYPE_WRBACK
>
> wrmsr:
> MTRRDEFTYPE: If write back, nop. Otherwise error.
>
>
>> If QEMU doesn't configure the msr filter list correctly, KVM has to handle
>> guest's MTRR MSR accesses. In my understanding, the suggestion is KVM zap
>> private memory mappings. But guests won't accept memory again because no one
>> currently requests guests to do this after writes to MTRR MSRs. In this case,
>> guests may access unaccepted memory, causing infinite EPT violation loop
>> (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
>> the host. But I think it would be better if we can avoid wasting CPU resource
>> on the useless EPT violation loop.
>
> Qemu is expected to do it correctly. There are manyways for userspace to go
> wrong. This isn't specific to MTRR MSR.

This seems incorrect. KVM shouldn't force userspace to filter some
specific MSRs. The semantic of MSR filter is userspace configures it on
its own will, not KVM requires to do so.

2024-03-27 03:09:15

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v19 046/130] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA



On 2/26/2024 4:25 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to
> indicate the GPA is private(if cleared) or shared (if set) with VMM. If
> GPA.shared is set, GPA is covered by the existing conventional EPT pointed
> by EPTP. If GPA.shared bit is cleared, GPA is covered by TDX module.
> VMM has to issue SEAMCALLs to operate.
>
> Add a member to remember GPA shared bit for each guest TDs, add address
> conversion functions between private GPA and shared GPA and test if GPA
> is private.
>
> Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See
> kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated,
> the new member to remember GPA shared bit is guaranteed to be zero with
> this patch unless it's initialized explicitly.
>
> default or SEV-SNP TDX: S = (47 or 51) - 12
> gfn_shared_mask 0 S bit
> kvm_is_private_gpa() always false true if GFN has S bit set

TDX: true if GFN has S bit clear?

> kvm_gfn_to_shared() nop set S bit
> kvm_gfn_to_private() nop clear S bit
>
> fault.is_private means that host page should be gotten from guest_memfd
> is_private_gpa() means that KVM MMU should invoke private MMU hooks.
>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> ---
> v19:
> - Add comment on default vm case.
> - Added behavior table in the commit message
> - drop CONFIG_KVM_MMU_PRIVATE
>
> v18:
> - Added Reviewed-by Binbin
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu.h | 33 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.c | 5 +++++
> 3 files changed, 40 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5da3c211955d..de6dd42d226f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1505,6 +1505,8 @@ struct kvm_arch {
> */
> #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> struct kvm_mmu_memory_cache split_desc_cache;
> +
> + gfn_t gfn_shared_mask;
> };
>
> struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index d96c93a25b3b..395b55684cb9 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -322,4 +322,37 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
> return gpa;
> return translate_nested_gpa(vcpu, gpa, access, exception);
> }
> +
> +/*
> + * default or SEV-SNP TDX: where S = (47 or 51) - 12
> + * gfn_shared_mask 0 S bit
> + * is_private_gpa() always false if GPA has S bit set
> + * gfn_to_shared() nop set S bit
> + * gfn_to_private() nop clear S bit
> + *
> + * fault.is_private means that host page should be gotten from guest_memfd
> + * is_private_gpa() means that KVM MMU should invoke private MMU hooks.
> + */
> +static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
> +{
> + return kvm->arch.gfn_shared_mask;
> +}
> +
> +static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
> +{
> + return gfn | kvm_gfn_shared_mask(kvm);
> +}
> +
> +static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
> +{
> + return gfn & ~kvm_gfn_shared_mask(kvm);
> +}
> +
> +static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
> +{
> + gfn_t mask = kvm_gfn_shared_mask(kvm);
> +
> + return mask && !(gpa_to_gfn(gpa) & mask);
> +}
> +
> #endif
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index aa1da51b8af7..54e0d4efa2bd 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -906,6 +906,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> kvm_tdx->attributes = td_params->attributes;
> kvm_tdx->xfam = td_params->xfam;
>
> + if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
> + kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
> + else
> + kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
> +
> out:
> /* kfree() accepts NULL. */
> kfree(init_vm);

2024-03-27 14:53:36

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page



On 3/15/2024 9:09 AM, Isaku Yamahata wrote:
> Here is the updated one. Renamed dummy -> mirroed.
>
> When KVM resolves the KVM page fault, it walks the page tables. To reuse
> the existing KVM MMU code and mitigate the heavy cost of directly walking
> the private page table, allocate one more page to copy the mirrored page

Here "copy" is a bit confusing for me.
The mirrored page table is maintained by KVM, not copied from anywhere.

> table for the KVM MMU code to directly walk. Resolve the KVM page fault
> with the existing code, and do additional operations necessary for the
> private page table. To distinguish such cases, the existing KVM page table
> is called a shared page table (i.e., not associated with a private page
> table), and the page table with a private page table is called a mirrored
> page table. The relationship is depicted below.
>
>
> KVM page fault |
> | |
> V |
> -------------+---------- |
> | | |
> V V |
> shared GPA private GPA |
> | | |
> V V |
> shared PT root mirrored PT root | private PT root
> | | | |
> V V | V
> shared PT mirrored PT ----propagate----> private PT
> | | | |
> | \-----------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> non-encrypted memory | encrypted memory
> |
> PT: Page table
> Shared PT: visible to KVM, and the CPU uses it for shared mappings.
> Private PT: the CPU uses it, but it is invisible to KVM. TDX module
> updates this table to map private guest pages.
> Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
> to propagate PT change to the actual private PT.
>


2024-03-27 14:58:30

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 046/130] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA



On 3/27/2024 11:08 AM, Chenyi Qiang wrote:
>
> On 2/26/2024 4:25 PM, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
>>
>> TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to
>> indicate the GPA is private(if cleared) or shared (if set) with VMM. If
>> GPA.shared is set, GPA is covered by the existing conventional EPT pointed
>> by EPTP. If GPA.shared bit is cleared, GPA is covered by TDX module.
>> VMM has to issue SEAMCALLs to operate.
>>
>> Add a member to remember GPA shared bit for each guest TDs, add address
>> conversion functions between private GPA and shared GPA and test if GPA
>> is private.
>>
>> Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See
>> kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated,
>> the new member to remember GPA shared bit is guaranteed to be zero with
>> this patch unless it's initialized explicitly.
>>
>> default or SEV-SNP TDX: S = (47 or 51) - 12
>> gfn_shared_mask 0 S bit
>> kvm_is_private_gpa() always false true if GFN has S bit set
> TDX: true if GFN has S bit clear?
>
>> kvm_gfn_to_shared() nop set S bit
>> kvm_gfn_to_private() nop clear S bit
>>
>> fault.is_private means that host page should be gotten from guest_memfd
>> is_private_gpa() means that KVM MMU should invoke private MMU hooks.
>>
>> Co-developed-by: Rick Edgecombe <[email protected]>
>> Signed-off-by: Rick Edgecombe <[email protected]>
>> Signed-off-by: Isaku Yamahata <[email protected]>
>> Reviewed-by: Binbin Wu <[email protected]>
>> ---
>> v19:
>> - Add comment on default vm case.
>> - Added behavior table in the commit message
>> - drop CONFIG_KVM_MMU_PRIVATE
>>
>> v18:
>> - Added Reviewed-by Binbin
>>
>> Signed-off-by: Isaku Yamahata <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 2 ++
>> arch/x86/kvm/mmu.h | 33 +++++++++++++++++++++++++++++++++
>> arch/x86/kvm/vmx/tdx.c | 5 +++++
>> 3 files changed, 40 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 5da3c211955d..de6dd42d226f 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1505,6 +1505,8 @@ struct kvm_arch {
>> */
>> #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
>> struct kvm_mmu_memory_cache split_desc_cache;
>> +
>> + gfn_t gfn_shared_mask;
>> };
>>
>> struct kvm_vm_stat {
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index d96c93a25b3b..395b55684cb9 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -322,4 +322,37 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
>> return gpa;
>> return translate_nested_gpa(vcpu, gpa, access, exception);
>> }
>> +
>> +/*
>> + * default or SEV-SNP TDX: where S = (47 or 51) - 12
>> + * gfn_shared_mask 0 S bit
>> + * is_private_gpa() always false if GPA has S bit set

Also here,
TDX: true if GFN has S bit cleared

>> + * gfn_to_shared() nop set S bit
>> + * gfn_to_private() nop clear S bit
>> + *
>> + * fault.is_private means that host page should be gotten from guest_memfd
>> + * is_private_gpa() means that KVM MMU should invoke private MMU hooks.
>> + */
>> +static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
>> +{
>> + return kvm->arch.gfn_shared_mask;
>> +}
>> +
>> +static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
>> +{
>> + return gfn | kvm_gfn_shared_mask(kvm);
>> +}
>> +
>> +static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
>> +{
>> + return gfn & ~kvm_gfn_shared_mask(kvm);
>> +}
>> +
>> +static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
>> +{
>> + gfn_t mask = kvm_gfn_shared_mask(kvm);
>> +
>> + return mask && !(gpa_to_gfn(gpa) & mask);
>> +}
>> +
>> #endif
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index aa1da51b8af7..54e0d4efa2bd 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -906,6 +906,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>> kvm_tdx->attributes = td_params->attributes;
>> kvm_tdx->xfam = td_params->xfam;
>>
>> + if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
>> + kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
>> + else
>> + kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
>> +
>> out:
>> /* kfree() accepts NULL. */
>> kfree(init_vm);


2024-03-27 16:19:56

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

>--- a/arch/x86/kvm/mmu/mmu.c
>+++ b/arch/x86/kvm/mmu/mmu.c
>@@ -3717,7 +3717,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> goto out_unlock;
>
> if (tdp_mmu_enabled) {
>- root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
>+ if (kvm_gfn_shared_mask(vcpu->kvm) &&
>+ !VALID_PAGE(mmu->private_root_hpa)) {
>+ root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
>+ mmu->private_root_hpa = root;

just
mmu->private_root_hpa =
kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
>+ }
>+ root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
> mmu->root.hpa = root;

ditto

> } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
>@@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
>- gfn_t base = gfn_round_for_level(fault->gfn,
>+ gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> fault->max_level);

..

>
> if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
>@@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> };
>
> WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
>+ fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);

Could you clarify when shared bits need to be masked out or kept? shared bits
are masked out here but kept in the hunk right above and ..

>+++ b/arch/x86/kvm/mmu/tdp_iter.h
>@@ -91,7 +91,7 @@ struct tdp_iter {
> tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> /* A pointer to the current SPTE */
> tdp_ptep_t sptep;
>- /* The lowest GFN mapped by the current SPTE */
>+ /* The lowest GFN (shared bits included) mapped by the current SPTE */
> gfn_t gfn;

. in @gfn of tdp_iter.

> /* The level of the root page given to the iterator */
> int root_level;


>
>-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>+static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,

Maybe fold it into its sole caller.

>+ bool private)
> {
> union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> struct kvm *kvm = vcpu->kvm;
>@@ -221,6 +225,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> * Check for an existing root before allocating a new one. Note, the
> * role check prevents consuming an invalid root.
> */
>+ if (private)
>+ kvm_mmu_page_role_set_private(&role);
> for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> if (root->role.word == role.word &&
> kvm_tdp_mmu_get_root(root))
>@@ -244,12 +250,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> out:
>- return __pa(root->spt);
>+ return root;
>+}
>+
>+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
>+{
>+ return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
> }
>
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>- u64 old_spte, u64 new_spte, int level,
>- bool shared);
>+ u64 old_spte, u64 new_spte,
>+ union kvm_mmu_page_role role, bool shared);
>
> static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
>@@ -376,12 +387,78 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> REMOVED_SPTE, level);
> }
> handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
>- old_spte, REMOVED_SPTE, level, shared);
>+ old_spte, REMOVED_SPTE, sp->role,
>+ shared);
>+ }
>+
>+ if (is_private_sp(sp) &&
>+ WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,

WARN_ON_ONCE()?

>+ kvm_mmu_private_spt(sp)))) {
>+ /*
>+ * Failed to unlink Secure EPT page and there is nothing to do
>+ * further. Intentionally leak the page to prevent the kernel
>+ * from accessing the encrypted page.
>+ */
>+ kvm_mmu_init_private_spt(sp, NULL);
> }
>
> call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
>

> rcu_read_lock();
>
> for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
>@@ -960,10 +1158,26 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>
> if (unlikely(!fault->slot))
> new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
>- else
>- wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
>- fault->pfn, iter->old_spte, fault->prefetch, true,
>- fault->map_writable, &new_spte);
>+ else {
>+ unsigned long pte_access = ACC_ALL;
>+ gfn_t gfn = iter->gfn;
>+
>+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
>+ if (fault->is_private)
>+ gfn |= kvm_gfn_shared_mask(vcpu->kvm);

this is an open-coded kvm_gfn_to_shared().

I don't get why a spte is installed for a shared gfn when fault->is_private
is true. could you elaborate?

>+ else
>+ /*
>+ * TDX shared GPAs are no executable, enforce
>+ * this for the SDV.
>+ */

what do you mean by the SDV?

>+ pte_access &= ~ACC_EXEC_MASK;
>+ }
>+
>+ wrprot = make_spte(vcpu, sp, fault->slot, pte_access, gfn,
>+ fault->pfn, iter->old_spte,
>+ fault->prefetch, true, fault->map_writable,
>+ &new_spte);
>+ }
>
> if (new_spte == iter->old_spte)
> ret = RET_PF_SPURIOUS;
>@@ -1041,6 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
>+ gfn_t raw_gfn;
>+ bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> int ret = RET_PF_RETRY;
>
> kvm_mmu_hugepage_adjust(vcpu, fault);
>@@ -1049,7 +1265,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
> rcu_read_lock();
>
>- tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
>+ raw_gfn = gpa_to_gfn(fault->addr);
>+
>+ if (is_error_noslot_pfn(fault->pfn) ||
>+ !kvm_pfn_to_refcounted_page(fault->pfn)) {
>+ if (is_private) {
>+ rcu_read_unlock();
>+ return -EFAULT;

This needs a comment. why this check is necessary? does this imply some
kernel bugs?

2024-03-27 17:22:40

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 018/130] KVM: x86/mmu: Assume guest MMIOs are shared

On Mon, Mar 25, 2024 at 11:41:56PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > From: Chao Gao <[email protected]>
> >
> > TODO: Drop this patch once the common patch is merged.
>
> What is this TODO talking about?

https://lore.kernel.org/all/[email protected]/

This patch was shot down and need to fix it in guest side. TDVF.
--
Isaku Yamahata <[email protected]>

2024-03-27 17:28:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 018/130] KVM: x86/mmu: Assume guest MMIOs are shared

On Wed, 2024-03-27 at 10:22 -0700, Isaku Yamahata wrote:
> On Mon, Mar 25, 2024 at 11:41:56PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > > From: Chao Gao <[email protected]>
> > >
> > > TODO: Drop this patch once the common patch is merged.
> >
> > What is this TODO talking about?
>
> https://lore.kernel.org/all/[email protected]/
>
> This patch was shot down and need to fix it in guest side. TDVF.

It needs a firmware fix? Is there any firmware that works (boot any TD) with this patch missing?

2024-03-27 17:36:27

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Wed, 2024-03-27 at 10:54 +0800, Xiaoyao Li wrote:
> > > If QEMU doesn't configure the msr filter list correctly, KVM has to handle
> > > guest's MTRR MSR accesses. In my understanding, the suggestion is KVM zap
> > > private memory mappings. But guests won't accept memory again because no one
> > > currently requests guests to do this after writes to MTRR MSRs. In this case,
> > > guests may access unaccepted memory, causing infinite EPT violation loop
> > > (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
> > > the host. But I think it would be better if we can avoid wasting CPU resource
> > > on the useless EPT violation loop.
> >
> > Qemu is expected to do it correctly.  There are manyways for userspace to go
> > wrong.  This isn't specific to MTRR MSR.
>
> This seems incorrect. KVM shouldn't force userspace to filter some
> specific MSRs. The semantic of MSR filter is userspace configures it on
> its own will, not KVM requires to do so.

I'm ok just always doing the exit to userspace on attempt to use MTRRs in a TD, and not rely on the
MSR list. At least I don't see the problem.

2024-03-27 22:54:07

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Tue, Mar 26, 2024 at 02:43:54PM +1300,
"Huang, Kai" <[email protected]> wrote:

> ... continue the previous review ...
>
> > +
> > +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> > +{
> > + WARN_ON_ONCE(!td_page_pa);
>
> From the name 'td_page_pa' we cannot tell whether it is a control page, but
> this function is only intended for control page AFAICT, so perhaps a more
> specific name.
>
> > +
> > + /*
> > + * TDCX are being reclaimed. TDX module maps TDCX with HKID
>
> "are" -> "is".
>
> Are you sure it is TDCX, but not TDCS?
>
> AFAICT TDCX is the control structure for 'vcpu', but here you are handling
> the control structure for the VM.

TDCS, TDVPR, and TDCX. Will update the comment.

>
> > + * assigned to the TD. Here the cache associated to the TD
> > + * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
> > + * cache doesn't need to be flushed again.
> > + */
>
> How about put this part as the comment for this function?
>
> /*
> * Reclaim <name of control page> page(s) which are crypto-protected
> * by TDX guest's private KeyID. Assume the cache associated with the
> * TDX private KeyID has been flushed.
> */
> > + if (tdx_reclaim_page(td_page_pa))
> > + /*
> > + * Leak the page on failure:
> > + * tdx_reclaim_page() returns an error if and only if there's an
> > + * unexpected, fatal error, e.g. a SEAMCALL with bad params,
> > + * incorrect concurrency in KVM, a TDX Module bug, etc.
> > + * Retrying at a later point is highly unlikely to be
> > + * successful.
> > + * No log here as tdx_reclaim_page() already did.
>
> IMHO can be simplified to below, and nothing else matters.
>
> /*
> * Leak the page if the kernel failed to reclaim the page.
> * The krenel cannot use it safely anymore.
> */
>
> And you can put this comment above the 'if (tdx_reclaim_page())' statement.

Sure.


> > + */
> > + return;
>
> Empty line.
>
> > + free_page((unsigned long)__va(td_page_pa));
> > +}
> > +
> > +static void tdx_do_tdh_phymem_cache_wb(void *unused)
>
> Better to make the name explicit that it is a smp_func, and you don't need
> the "tdx_" prefix for all the 'static' functions here:
>
> static void smp_func_do_phymem_cache_wb(void *unused)

Ok, will rename it.


> > +{
> > + u64 err = 0;
> > +
> > + do {
> > + err = tdh_phymem_cache_wb(!!err);
>
> bool resume = !!err;
>
> err = tdh_phymem_cache_wb(resume);
>
> So that we don't need to jump to the tdh_phymem_cache_wb() to see what does
> !!err mean.

Ok.


> > + } while (err == TDX_INTERRUPTED_RESUMABLE);
>
> Add a comment before the do {} while():
>
> /*
> * TDH.PHYMEM.CACHE.WB flushes caches associated with _ANY_
> * TDX private KeyID on the package (or logical cpu?) where
> * it is called on. The TDX module may not finish the cache
> * flush but return TDX_INTERRUPTED_RESUMEABLE instead. The
> * kernel should retry it until it returns success w/o
> * rescheduling.
> */

Ok.


> > +
> > + /* Other thread may have done for us. */
> > + if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> > + err = TDX_SUCCESS;
>
> Empty line.
>
> > + if (WARN_ON_ONCE(err))
> > + pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> > +}
> > +
> > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > +{
> > + bool packages_allocated, targets_allocated;
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + cpumask_var_t packages, targets;
> > + u64 err;
> > + int i;
> > +
> > + if (!is_hkid_assigned(kvm_tdx))
> > + return;
> > +
> > + if (!is_td_created(kvm_tdx)) {
> > + tdx_hkid_free(kvm_tdx);
> > + return;
> > + }
>
> I lost tracking what does "td_created()" mean.
>
> I guess it means: KeyID has been allocated to the TDX guest, but not yet
> programmed/configured.
>
> Perhaps add a comment to remind the reviewer?

As Chao suggested, will introduce state machine for vm and vcpu.

https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/

> > +
> > + packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> > + targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> > + cpus_read_lock();
> > +
> > + /*
> > + * We can destroy multiple guest TDs simultaneously. Prevent
> > + * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> > + */
>
> IMHO it's better to remind people that TDH.PHYMEM.CACHE.WB tries to grab the
> global TDX module lock:
>
> /*
> * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global
> * lock and can fail with TDX_OPERAND_BUSY when it fails to
> * grab. Multiple TDX guests can be destroyed simultaneously.
> * Take the mutex to prevent it from getting error.
> */
> > + mutex_lock(&tdx_lock);
> > +
> > + /*
> > + * Go through multiple TDX HKID state transitions with three SEAMCALLs
> > + * to make TDH.PHYMEM.PAGE.RECLAIM() usable.
>
>
> What is "TDX HKID state transitions"? Not mentioned before, so needs
> explanation _if_ you want to say this.

Ok.
> And what are the three "SEAMCALLs"? Where are they? The only _two_
> SEAMCALLs that I can see here are: TDH.PHYMEM.CACHE.WB and
> TDH.MNG.KEY.FREEID.

tdh_mng_vpflushdone(). I'll those three in the comment. It may not seem
to hkid state machine, though.


>
> Make the transition atomic
> > + * to other functions to operate private pages and Secure-EPT pages.
>
> What's the consequence to "other functions" if we don't make it atomic here?

Other thread can be removing pages from TD. If the HKID is freed, other
thread in loop to remove pages can get error.

TDH.MEM.SEPT.REMOVE(), TDH.MEM.PAGE.REMOVE() can fail with
TDX_OP_STATE_INCORRECT when HKID is not assigned.

When HKID is freed, we need to use TDH.PHYMEM.PAGE.RECLAIM().
TDH.PHYMEM.PAGE.RECLAIM() fails with TDX_LIECYCLE_STATE_INCORRECT when
HKID isn't freed.

How about this?

/*
* We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
* TDH.MNG.KEY.FREEID() to free the HKID.
* Other threads can remove pages from TD. When the HKID is assigned, we need
* to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
* TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
* present transient state of HKID.
*/


> > + *
> > + * Avoid race for kvm_gmem_release() to call kvm_mmu_unmap_gfn_range().
> > + * This function is called via mmu notifier, mmu_release().
> > + * kvm_gmem_release() is called via fput() on process exit.
> > + */
> > + write_lock(&kvm->mmu_lock);
>
> I don't fully get the race here, but it seems strange that this function is
> called via mmu notifier.
>
> IIUC, this function is "supposedly" only be called when we tear down the VM,

That's correct. The hook when destroying the VM is mmu notifier mmu_release().
It's called on the behalf of mmput(). Because other component like vhost-net
can increment the reference, mmu_release mmu notifier can be triggered by
a thread other than the guest VM.


> so I don't know why there's such race.

When guest_memfd is released, the private memory is unmapped.
the thread of guest VM can issue exit to closes guest_memfd and
other thread can trigger mmu notifier of the guest VM.

Also, if we have multiple fds for the same guest_memfd, the last file closure
can be done in the context of the guest VM or other process.


> > +
> > + for_each_online_cpu(i) {
> > + if (packages_allocated &&
> > + cpumask_test_and_set_cpu(topology_physical_package_id(i),
> > + packages))
> > + continue;
> > + if (targets_allocated)
> > + cpumask_set_cpu(i, targets);
> > + }
> > + if (targets_allocated)
> > + on_each_cpu_mask(targets, tdx_do_tdh_phymem_cache_wb, NULL, true);
> > + else
> > + on_each_cpu(tdx_do_tdh_phymem_cache_wb, NULL, true);
>
> I don't understand the logic here -- no comments whatever.
>
> But I am 99% sure the logic here could be simplified.

Yes, as Chao suggested, I'll use global variable for those cpumasks.
https://lore.kernel.org/kvm/ZfpwIespKy8qxWWE@chao-email/


> > + /*
> > + * In the case of error in tdx_do_tdh_phymem_cache_wb(), the following
> > + * tdh_mng_key_freeid() will fail.
> > + */
> > + err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
> > + if (WARN_ON_ONCE(err)) {
>
> I see KVM_BUG_ON() is normally used for SEAMCALL error. Why this uses
> WARN_ON_ONCE() here?

Because vm_free() hook is (one of) the final steps to free struct kvm. No one
else touches this kvm. Because it doesn't harm to use KVM_BUG_ON() here,
I'll change it for consistency.


> > + pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> > + pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
> > + kvm_tdx->hkid);
> > + } else
> > + tdx_hkid_free(kvm_tdx);
> > +
> > + write_unlock(&kvm->mmu_lock);
> > + mutex_unlock(&tdx_lock);
> > + cpus_read_unlock();
> > + free_cpumask_var(targets);
> > + free_cpumask_var(packages);
> > +}
> > +
> > +void tdx_vm_free(struct kvm *kvm)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + u64 err;
> > + int i;
> > +
> > + /*
> > + * tdx_mmu_release_hkid() failed to reclaim HKID. Something went wrong
> > + * heavily with TDX module. Give up freeing TD pages. As the function
> > + * already warned, don't warn it again.
> > + */
> > + if (is_hkid_assigned(kvm_tdx))
> > + return;
> > +
> > + if (kvm_tdx->tdcs_pa) {
> > + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > + if (kvm_tdx->tdcs_pa[i])
> > + tdx_reclaim_control_page(kvm_tdx->tdcs_pa[i]);
>
> AFAICT, here tdcs_pa[i] cannot be NULL, right? How about:

If tdh_mng_addcx() fails in the middle of 0 < i < nr_tdcs_pages, tdcs_pa[i] can
be NULL. The current allocation/free code is unnecessarily convoluted. I'll
clean them up.


> if (!WARN_ON_ONCE(!kvm_tdx->tdcs_pa[i]))
> continue;
>
> tdx_reclaim_control_page(...);
>
> which at least saves you some indent.
>
> Btw, does it make sense to stop if any tdx_reclaim_control_page() fails?

It doesn't matter much in practice because it's unlikely to hit error for some
of TDCS pages. So I chose to make it return void to skip error check by the
caller.


> It's OK to continue, but perhaps worth to add a comment to point out:
>
> /*
> * Continue to reclaim other control pages and
> * TDR page, even failed to reclaim one control
> * page. Do the best to reclaim these TDX
> * private pages.
> */
> tdx_reclaim_control_page();

Sure, it will make the intention clear.


> > + }
> > + kfree(kvm_tdx->tdcs_pa);
> > + kvm_tdx->tdcs_pa = NULL;
> > + }
> > +
> > + if (!kvm_tdx->tdr_pa)
> > + return;
> > + if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
> > + return;
> > + /*
> > + * TDX module maps TDR with TDX global HKID. TDX module may access TDR
> > + * while operating on TD (Especially reclaiming TDCS). Cache flush with > + * TDX global HKID is needed.
> > + */
>
> "Especially reclaiming TDCS" -> "especially when it is reclaiming TDCS".
>
> Use imperative mode to describe your change:
>
> Use the SEAMCALL to ask the TDX module to flush the cache of it using the
> global KeyID.
>
> > + err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(kvm_tdx->tdr_pa,
> > + tdx_global_keyid));
> > + if (WARN_ON_ONCE(err)) {
>
> Again, KVM_BUG_ON()?
>
> Should't matter, though.

Ok, let's use KVM_BUG_ON() consistently.



> > + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> > + return;
> > + }
> > + tdx_clear_page(kvm_tdx->tdr_pa);
> > +
> > + free_page((unsigned long)__va(kvm_tdx->tdr_pa));
> > + kvm_tdx->tdr_pa = 0;
> > +}
> > +
> > +static int tdx_do_tdh_mng_key_config(void *param)
> > +{
> > + hpa_t *tdr_p = param;
> > + u64 err;
> > +
> > + do {
> > + err = tdh_mng_key_config(*tdr_p);
> > +
> > + /*
> > + * If it failed to generate a random key, retry it because this
> > + * is typically caused by an entropy error of the CPU's random
> > + * number generator.
> > + */
> > + } while (err == TDX_KEY_GENERATION_FAILED);
>
> If you want to handle TDX_KEY_GENERTION_FAILED, it's better to have a retry
> limit similar to the TDX host code does.

Ok, although it would complicates the error recovery path, let me update it.


> > +
> > + if (WARN_ON_ONCE(err)) {
>
> KVM_BUG_ON()?
>
> > + pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
> > + return -EIO;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int __tdx_td_init(struct kvm *kvm);
> > +
> > +int tdx_vm_init(struct kvm *kvm)
> > +{
> > + /*
> > + * TDX has its own limit of the number of vcpus in addition to
> > + * KVM_MAX_VCPUS.
> > + */
> > + kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
>
> I believe this should be part of the patch that handles KVM_CAP_MAX_VCPUS.

Ok.
--
Isaku Yamahata <[email protected]>

2024-03-27 22:57:11

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 044/130] KVM: TDX: Do TDX specific vcpu initialization

On Wed, Mar 27, 2024 at 12:27:03AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > +/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
> > +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > +{
> > +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +       unsigned long *tdvpx_pa = NULL;
> > +       unsigned long tdvpr_pa;
>
>
> I think we could drop theselocal variables and just use tdx->tdvpr_pa and tdx->tdvpx_pa. Then we
> don't have to have the assignments later.

Yes, let me clean it up. The old version acquired spin lock in the middle. Now
we don't have it.


> > +       unsigned long va;
> > +       int ret, i;
> > +       u64 err;
> > +
> > +       if (is_td_vcpu_created(tdx))
> > +               return -EINVAL;
> > +
> > +       /*
> > +        * vcpu_free method frees allocated pages.  Avoid partial setup so
> > +        * that the method can't handle it.
> > +        */
> > +       va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +       if (!va)
> > +               return -ENOMEM;
> > +       tdvpr_pa = __pa(va);
> > +
> > +       tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
> > +                          GFP_KERNEL_ACCOUNT);
> > +       if (!tdvpx_pa) {
> > +               ret = -ENOMEM;
> > +               goto free_tdvpr;
> > +       }
> > +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> > +               va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +               if (!va) {
> > +                       ret = -ENOMEM;
> > +                       goto free_tdvpx;
> > +               }
> > +               tdvpx_pa[i] = __pa(va);
> > +       }
> > +
> > +       err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> > +       if (KVM_BUG_ON(err, vcpu->kvm)) {
> > +               ret = -EIO;
> > +               pr_tdx_error(TDH_VP_CREATE, err, NULL);
> > +               goto free_tdvpx;
> > +       }
> > +       tdx->tdvpr_pa = tdvpr_pa;
> > +
> > +       tdx->tdvpx_pa = tdvpx_pa;
>
> Or alternatively let's move these to right before they are used. (in the current branch
>
> > +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> > +               err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> > +               if (KVM_BUG_ON(err, vcpu->kvm)) {
> > +                       pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> > +                       for (; i < tdx_info->nr_tdvpx_pages; i++) {
> > +                               free_page((unsigned long)__va(tdvpx_pa[i]));
> > +                               tdvpx_pa[i] = 0;
> > +                       }
> > +                       /* vcpu_free method frees TDVPX and TDR donated to TDX */
> > +                       return -EIO;
> > +               }
> > +       }
> >
> >
> In the current branch tdh_vp_init() takes struct vcpu_tdx, so they would be moved right here.
>
> What do you think?

Yes, I should revise the error recovery path.
--
Isaku Yamahata <[email protected]>

2024-03-27 23:34:06

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

.. to continue the previous review ...

>
> +static int __tdx_td_init(struct kvm *kvm)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + cpumask_var_t packages;
> + unsigned long *tdcs_pa = NULL;
> + unsigned long tdr_pa = 0;
> + unsigned long va;
> + int ret, i;
> + u64 err;
> +
> + ret = tdx_guest_keyid_alloc();
> + if (ret < 0)
> + return ret;
> + kvm_tdx->hkid = ret;
> +
> + va = __get_free_page(GFP_KERNEL_ACCOUNT);
> + if (!va)
> + goto free_hkid;
> + tdr_pa = __pa(va);
> +
> + tdcs_pa = kcalloc(tdx_info->nr_tdcs_pages, sizeof(*kvm_tdx->tdcs_pa),
> + GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> + if (!tdcs_pa)
> + goto free_tdr;

Empty line.

> + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> + va = __get_free_page(GFP_KERNEL_ACCOUNT);
> + if (!va)
> + goto free_tdcs;
> + tdcs_pa[i] = __pa(va);
> + }
> +
> + if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> + ret = -ENOMEM;
> + goto free_tdcs;
> + }

Empty line.

> + cpus_read_lock();
> + /*
> + * Need at least one CPU of the package to be online in order to
> + * program all packages for host key id. Check it.
> + */
> + for_each_present_cpu(i)
> + cpumask_set_cpu(topology_physical_package_id(i), packages);
> + for_each_online_cpu(i)
> + cpumask_clear_cpu(topology_physical_package_id(i), packages);
> + if (!cpumask_empty(packages)) {
> + ret = -EIO;
> + /*
> + * Because it's hard for human operator to figure out the
> + * reason, warn it.
> + */
> +#define MSG_ALLPKG "All packages need to have online CPU to create TD. Online CPU and retry.\n"
> + pr_warn_ratelimited(MSG_ALLPKG);
> + goto free_packages;
> + }
> +
> + /*
> + * Acquire global lock to avoid TDX_OPERAND_BUSY:
> + * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
> + * Table (KOT) to track the assigned TDX private HKID. It doesn't spin
> + * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
> + * caller to handle the contention. This is because of time limitation
> + * usable inside the TDX module and OS/VMM knows better about process
> + * scheduling.
> + *
> + * APIs to acquire the lock of KOT:
> + * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
> + * TDH.PHYMEM.CACHE.WB. > + */

Don't need to mention all SEAMCALLs here, but put a comment where
appliciable, i.e., where they are used.

/*
* TDH.MNG.CREATE tries to grab the global TDX module and fails
* with TDX_OPERAND_BUSY when it fails to grab. Take the global
* lock to prevent it from failure.
*/
> + mutex_lock(&tdx_lock);
> + err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
> + mutex_unlock(&tdx_lock);

Empty line.

> + if (err == TDX_RND_NO_ENTROPY) {
> + ret = -EAGAIN;
> + goto free_packages;
> + }
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> + ret = -EIO;
> + goto free_packages;
> + }

I would prefer more empty lines.

> + kvm_tdx->tdr_pa = tdr_pa;
> +
> + for_each_online_cpu(i) {
> + int pkg = topology_physical_package_id(i);
> +
> + if (cpumask_test_and_set_cpu(pkg, packages))
> + continue;
> +
> + /*
> + * Program the memory controller in the package with an
> + * encryption key associated to a TDX private host key id
> + * assigned to this TDR. Concurrent operations on same memory
> + * controller results in TDX_OPERAND_BUSY. Avoid this race by
> + * mutex.
> + */

IIUC the race can only happen when you are creating multiple TDX guests
simulatenously? Please clarify this in the comment.

And I even don't think you need all these TDX module details:

/*
* Concurrent run of TDH.MNG.KEY.CONFIG on the same
* package resluts in TDX_OPERAND_BUSY. When creating
* multiple TDX guests simultaneously this can run
* concurrently. Take the per-package lock to
* serialize.
*/
> + mutex_lock(&tdx_mng_key_config_lock[pkg]);
> + ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> + &kvm_tdx->tdr_pa, true);
> + mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> + if (ret)
> + break;
> + }
> + cpus_read_unlock();
> + free_cpumask_var(packages);
> + if (ret) {
> + i = 0;
> + goto teardown;
> + }
> +
> + kvm_tdx->tdcs_pa = tdcs_pa;
> + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> + err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> + if (err == TDX_RND_NO_ENTROPY) {
> + /* Here it's hard to allow userspace to retry. */
> + ret = -EBUSY;
> + goto teardown;
> + }
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> + ret = -EIO;
> + goto teardown;
> + }
> + }
> +
> + /*
> + * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> + * ioctl() to define the configure CPUID values for the TD.
> + */

Then, how about renaming this function to __tdx_td_create()?

> + return 0;
> +
> + /*
> + * The sequence for freeing resources from a partially initialized TD
> + * varies based on where in the initialization flow failure occurred.
> + * Simply use the full teardown and destroy, which naturally play nice
> + * with partial initialization.
> + */
> +teardown:
> + for (; i < tdx_info->nr_tdcs_pages; i++) {
> + if (tdcs_pa[i]) {
> + free_page((unsigned long)__va(tdcs_pa[i]));
> + tdcs_pa[i] = 0;
> + }
> + }
> + if (!kvm_tdx->tdcs_pa)
> + kfree(tdcs_pa);

The code to "free TDCS pages in a loop and free the array" is done below
with duplicated code. I am wondering whether we have way to eliminate one.

But I have lost track here, so perhaps we can review again after we
split the patch to smaller pieces.

> + tdx_mmu_release_hkid(kvm);
> + tdx_vm_free(kvm);
> + return ret;
> +
> +free_packages:
> + cpus_read_unlock();
> + free_cpumask_var(packages);
> +free_tdcs:
> + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> + if (tdcs_pa[i])
> + free_page((unsigned long)__va(tdcs_pa[i]));
> + }
> + kfree(tdcs_pa);
> + kvm_tdx->tdcs_pa = NULL;
> +
> +free_tdr:
> + if (tdr_pa)
> + free_page((unsigned long)__va(tdr_pa));
> + kvm_tdx->tdr_pa = 0;
> +free_hkid:
> + if (is_hkid_assigned(kvm_tdx))
> + tdx_hkid_free(kvm_tdx);
> + return ret;
> +}
> +
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_tdx_cmd tdx_cmd;
> @@ -215,12 +664,13 @@ static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
>
> static int __init tdx_module_setup(void)
> {
> - u16 num_cpuid_config;
> + u16 num_cpuid_config, tdcs_base_size;
> int ret;
> u32 i;
>
> struct tdx_md_map mds[] = {
> TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> + TDX_MD_MAP(TDCS_BASE_SIZE, &tdcs_base_size),
> };
>
> struct tdx_metadata_field_mapping fields[] = {
> @@ -273,6 +723,8 @@ static int __init tdx_module_setup(void)
> c->edx = ecx_edx >> 32;
> }
>
> + tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;
> +

Round up the 'tdcs_base_size' to make sure you have enough room, or put
a WARN() here if not page aligned?

> return 0;
>
> error_out:
> @@ -319,13 +771,27 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> struct tdx_enabled enable = {
> .err = ATOMIC_INIT(0),
> };
> + int max_pkgs;
> int r = 0;
> + int i;

Nit: you can put the 3 into one line.

>
> + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
> + pr_warn("MOVDIR64B is reqiured for TDX\n");

It's better to make it more clear:

"Disable TDX: MOVDIR64B is not supported or disabled by the kernel."

Or, to match below:

"Cannot enable TDX w/o MOVDIR64B".

> + return -EOPNOTSUPP;
> + }
> if (!enable_ept) {
> pr_warn("Cannot enable TDX with EPT disabled\n");
> return -EINVAL;
> }
>
> + max_pkgs = topology_max_packages();
> + tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> + GFP_KERNEL);
> + if (!tdx_mng_key_config_lock)
> + return -ENOMEM;
> + for (i = 0; i < max_pkgs; i++)
> + mutex_init(&tdx_mng_key_config_lock[i]);
> +

Using a per-socket lock looks a little bit overkill to me. I don't know
whether we need to do in the initial version. Will leave to others.

Please at least add a comment to explain this is for better performance
when creating multiple TDX guests IIUC?

> if (!zalloc_cpumask_var(&enable.enabled, GFP_KERNEL)) {
> r = -ENOMEM;
> goto out;
> @@ -350,4 +816,5 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> void tdx_hardware_unsetup(void)
> {
> kfree(tdx_info);
> + kfree(tdx_mng_key_config_lock);

The kernel actually has a mutex_destroy(). It is empty when
CONFIG_DEBUG_LOCK_ALLOC is off, but I think it should be standard
proceedure to also mutex_destory()?

[...]

2024-03-27 23:53:06

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 046/130] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA

On Wed, Mar 27, 2024 at 10:09:21PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 3/27/2024 11:08 AM, Chenyi Qiang wrote:
> >
> > On 2/26/2024 4:25 PM, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to
> > > indicate the GPA is private(if cleared) or shared (if set) with VMM. If
> > > GPA.shared is set, GPA is covered by the existing conventional EPT pointed
> > > by EPTP. If GPA.shared bit is cleared, GPA is covered by TDX module.
> > > VMM has to issue SEAMCALLs to operate.
> > >
> > > Add a member to remember GPA shared bit for each guest TDs, add address
> > > conversion functions between private GPA and shared GPA and test if GPA
> > > is private.
> > >
> > > Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See
> > > kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated,
> > > the new member to remember GPA shared bit is guaranteed to be zero with
> > > this patch unless it's initialized explicitly.
> > >
> > > default or SEV-SNP TDX: S = (47 or 51) - 12
> > > gfn_shared_mask 0 S bit
> > > kvm_is_private_gpa() always false true if GFN has S bit set
> > TDX: true if GFN has S bit clear?
> >
> > > kvm_gfn_to_shared() nop set S bit
> > > kvm_gfn_to_private() nop clear S bit
> > >
> > > fault.is_private means that host page should be gotten from guest_memfd
> > > is_private_gpa() means that KVM MMU should invoke private MMU hooks.
> > >
> > > Co-developed-by: Rick Edgecombe <[email protected]>
> > > Signed-off-by: Rick Edgecombe <[email protected]>
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > Reviewed-by: Binbin Wu <[email protected]>
> > > ---
> > > v19:
> > > - Add comment on default vm case.
> > > - Added behavior table in the commit message
> > > - drop CONFIG_KVM_MMU_PRIVATE
> > >
> > > v18:
> > > - Added Reviewed-by Binbin
> > >
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > arch/x86/kvm/mmu.h | 33 +++++++++++++++++++++++++++++++++
> > > arch/x86/kvm/vmx/tdx.c | 5 +++++
> > > 3 files changed, 40 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 5da3c211955d..de6dd42d226f 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1505,6 +1505,8 @@ struct kvm_arch {
> > > */
> > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > > struct kvm_mmu_memory_cache split_desc_cache;
> > > +
> > > + gfn_t gfn_shared_mask;
> > > };
> > > struct kvm_vm_stat {
> > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > > index d96c93a25b3b..395b55684cb9 100644
> > > --- a/arch/x86/kvm/mmu.h
> > > +++ b/arch/x86/kvm/mmu.h
> > > @@ -322,4 +322,37 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
> > > return gpa;
> > > return translate_nested_gpa(vcpu, gpa, access, exception);
> > > }
> > > +
> > > +/*
> > > + * default or SEV-SNP TDX: where S = (47 or 51) - 12
> > > + * gfn_shared_mask 0 S bit
> > > + * is_private_gpa() always false if GPA has S bit set
>
> Also here,
> TDX: true if GFN has S bit cleared

Oops. Will fix both.
--
Isaku Yamahata <[email protected]>

2024-03-27 23:58:50

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 052/130] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis

On Tue, Mar 26, 2024 at 03:31:05AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX will use a different shadow PTE entry value for MMIO from VMX.  Add
> > members to kvm_arch and track value for MMIO per-VM instead of global
> > variables.  By using the per-VM EPT entry value for MMIO, the existing VMX
> > logic is kept working.  Introduce a separate setter function so that guest
> > TD can override later.
> >
> > Also require mmio spte caching for TDX.
>
>
> > Actually this is true case
> > because TDX requires EPT and KVM EPT allows mmio spte caching.
> >
>
> I can't understand what this is trying to say.

I'll drop this sentence as the logic moved to
"069/130 KVM: TDX: Require TDP MMU and mmio caching for TDX".


> >  {
> > +
> > +       kvm->arch.shadow_mmio_value = shadow_mmio_value;
>
> It could use kvm_mmu_set_mmio_spte_value()?

Yes.
--
Isaku Yamahata <[email protected]>

2024-03-28 00:02:26

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Wed, Mar 27, 2024 at 09:49:14PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 3/15/2024 9:09 AM, Isaku Yamahata wrote:
> > Here is the updated one. Renamed dummy -> mirroed.
> >
> > When KVM resolves the KVM page fault, it walks the page tables. To reuse
> > the existing KVM MMU code and mitigate the heavy cost of directly walking
> > the private page table, allocate one more page to copy the mirrored page
>
> Here "copy" is a bit confusing for me.
> The mirrored page table is maintained by KVM, not copied from anywhere.

How about, "maintain" or "keep"?

>
> > table for the KVM MMU code to directly walk. Resolve the KVM page fault
> > with the existing code, and do additional operations necessary for the
> > private page table. To distinguish such cases, the existing KVM page table
> > is called a shared page table (i.e., not associated with a private page
> > table), and the page table with a private page table is called a mirrored
> > page table. The relationship is depicted below.
> >
> >
> > KVM page fault |
> > | |
> > V |
> > -------------+---------- |
> > | | |
> > V V |
> > shared GPA private GPA |
> > | | |
> > V V |
> > shared PT root mirrored PT root | private PT root
> > | | | |
> > V V | V
> > shared PT mirrored PT ----propagate----> private PT
> > | | | |
> > | \-----------------+------\ |
> > | | | |
> > V | V V
> > shared guest page | private guest page
> > |
> > non-encrypted memory | encrypted memory
> > |
> > PT: Page table
> > Shared PT: visible to KVM, and the CPU uses it for shared mappings.
> > Private PT: the CPU uses it, but it is invisible to KVM. TDX module
> > updates this table to map private guest pages.
> > Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > to propagate PT change to the actual private PT.
> >
>
>

--
Isaku Yamahata <[email protected]>

2024-03-28 00:06:26

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Wed, Mar 27, 2024 at 05:36:07PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Wed, 2024-03-27 at 10:54 +0800, Xiaoyao Li wrote:
> > > > If QEMU doesn't configure the msr filter list correctly, KVM has to handle
> > > > guest's MTRR MSR accesses. In my understanding, the suggestion is KVM zap
> > > > private memory mappings. But guests won't accept memory again because no one
> > > > currently requests guests to do this after writes to MTRR MSRs. In this case,
> > > > guests may access unaccepted memory, causing infinite EPT violation loop
> > > > (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
> > > > the host. But I think it would be better if we can avoid wasting CPU resource
> > > > on the useless EPT violation loop.
> > >
> > > Qemu is expected to do it correctly.  There are manyways for userspace to go
> > > wrong.  This isn't specific to MTRR MSR.
> >
> > This seems incorrect. KVM shouldn't force userspace to filter some
> > specific MSRs. The semantic of MSR filter is userspace configures it on
> > its own will, not KVM requires to do so.
>
> I'm ok just always doing the exit to userspace on attempt to use MTRRs in a TD, and not rely on the
> MSR list. At least I don't see the problem.

KVM doesn't force it. KVM allows QEMU to use the MSR filter for TDX.
(v19 doesn't allow it.) If QEMU chooses to use the MSR filter, QEMU has to
handle the MSR access correctly.
--
Isaku Yamahata <[email protected]>

2024-03-28 00:07:12

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/28/2024 1:36 AM, Edgecombe, Rick P wrote:
> On Wed, 2024-03-27 at 10:54 +0800, Xiaoyao Li wrote:
>>>> If QEMU doesn't configure the msr filter list correctly, KVM has to handle
>>>> guest's MTRR MSR accesses.
>>>> In my understanding, the suggestion is KVM zap private memory mappings.

TDX spec states that

18.2.1.4.1 Memory Type for Private and Opaque Access

The memory type for private and opaque access semantics, which use a
private HKID, is WB.

18.2.1.4.2 Memory Type for Shared Accesses

Intel SDM, Vol. 3, 28.2.7.2 Memory Type Used for Translated Guest-
Physical Addresses

The memory type for shared access semantics, which use a shared HKID,
is determined as described below. Note that this is different from the
way memory type is determined by the hardware during non-root mode
operation. Rather, it is a best-effort approximation that is designed
to still allow the host VMM some control over memory type.
• For shared access during host-side (SEAMCALL) flows, the memory
type is determined by MTRRs.
• For shared access during guest-side flows (VM exit from the guest
TD), the memory type is determined by a combination of the Shared
EPT and MTRRs.
o If the memory type determined during Shared EPT walk is WB, then
the effective memory type for the access is determined by MTRRs.
o Else, the effective memory type for the access is UC.

My understanding is that guest MTRR doesn't affect the memory type for
private memory. So we don't need to zap private memory mappings.

>>>> But guests won't accept memory again because no one
>>>> currently requests guests to do this after writes to MTRR MSRs. In this case,
>>>> guests may access unaccepted memory, causing infinite EPT violation loop
>>>> (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
>>>> the host. But I think it would be better if we can avoid wasting CPU resource
>>>> on the useless EPT violation loop.
>>>
>>> Qemu is expected to do it correctly.  There are manyways for userspace to go
>>> wrong.  This isn't specific to MTRR MSR.
>>
>> This seems incorrect. KVM shouldn't force userspace to filter some
>> specific MSRs. The semantic of MSR filter is userspace configures it on
>> its own will, not KVM requires to do so.
>
> I'm ok just always doing the exit to userspace on attempt to use MTRRs in a TD, and not rely on the
> MSR list. At least I don't see the problem.

What is the exit reason in vcpu->run->exit_reason?
KVM_EXIT_X86_RDMSR/WRMSR? If so, it breaks the ABI on
KVM_EXIT_X86_RDMSR/WRMSR.

2024-03-28 00:27:50

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/26/2024 3:55 AM, Edgecombe, Rick P wrote:
> On Mon, 2024-03-25 at 12:05 -0700, Isaku Yamahata wrote:
>> Right, the guest has to accept it on VE.  If the unmap was intentional by guest,
>> that's fine.  The unmap is unintentional (with vMTRR), the guest doesn't expect
>> VE with the GPA.
>>
>>
>>> But, I guess we should punt to userspace is the guest tries to use
>>> MTRRs, not that userspace can handle it happening in a TD...  But it
>>> seems cleaner and safer then skipping zapping some pages inside the
>>> zapping code.
>>>
>>> I'm still not sure if I understand the intention and constraints fully.
>>> So please correct. This (the skipping the zapping for some operations)
>>> is a theoretical correctness issue right? It doesn't resolve a TD
>>> crash?
>>
>> For lapic, it's safe guard. Because TDX KVM disables APICv with
>> APICV_INHIBIT_REASON_TDX, apicv won't call kvm_zap_gfn_range().
> Ah, I see it:
> https://lore.kernel.org/lkml/38e2f8a77e89301534d82325946eb74db3e47815.1708933498.git.isaku.yamahata@intel.com/
>
> Then it seems a warning would be more appropriate if we are worried there might be a way to still
> call it. If we are confident it can't, then we can just ignore this case.
>
>>
>> For MTRR, the purpose is to make the guest boot (without the guest kernel
>> command line like clearcpuid=mtrr) .
>> If we can assume the guest won't touch MTRR registers somehow, KVM can return an
>> error to TDG.VP.VMCALL<RDMSR, WRMSR>(MTRR registers). So it doesn't call
>> kvm_zap_gfn_range(). Or we can use KVM_EXIT_X86_{RDMSR, WRMSR} as you suggested.
>
> My understanding is that Sean prefers to exit to userspace when KVM can't handle something, versus
> making up behavior that keeps known guests alive. So I would think we should change this patch to
> only be about not using the zapping roots optimization. Then a separate patch should exit to
> userspace on attempt to use MTRRs. And we ignore the APIC one.

Certainly no. If exit to userspace, what is the exit reason and what is
expected for userspace to do? userspace can do nothing, except either
kill the TD or eat the RDMSR/WRMSR.

There is nothing to do with userspace. MTRR is virtualized as fixed1 for
TD (by current TDX architecture). Userspace can do nothing on it and
it's not userspace's fault to let TD guest manipulate on MTRR MSRs.

This is the bad design of current TDX, what KVM should do is return
error to TD on TDVMCALL of WR/RDMSR on MTRR MSRs. This should be a known
flaw of TDX that MTRR is not supported though TD guest reads the MTRR
CPUID as 1.

This flaw should be fixed by TDX architecture that making MTRR
configurable. At that time, userspace is responsible to set MSR filter
on MTRR MSRs if it wants to configure the MTRR CPUID to 1.

> This is trying to guess what maintainers would want here. I'm less sure what Paolo prefers.


2024-03-28 00:37:11

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Wed, Mar 27, 2024 at 09:07:21PM +0800,
Chao Gao <[email protected]> wrote:

> > if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> >@@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > };
> >
> > WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> >+ fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
>
> Could you clarify when shared bits need to be masked out or kept? shared bits
> are masked out here but kept in the hunk right above and ..

Sure, it deserves comment. I'll add a comment.

When we gets pfn, kvm_faultin_pfn() or loop with kvm memslot,
drop shared bits because KVM memslot doesn't know about shared bit.

When walks in EPT tables, keep the shared bit because we need to find the EPT
entry including shared bit.



> >+++ b/arch/x86/kvm/mmu/tdp_iter.h
> >@@ -91,7 +91,7 @@ struct tdp_iter {
> > tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > /* A pointer to the current SPTE */
> > tdp_ptep_t sptep;
> >- /* The lowest GFN mapped by the current SPTE */
> >+ /* The lowest GFN (shared bits included) mapped by the current SPTE */
> > gfn_t gfn;
>
> .. in @gfn of tdp_iter.
>
> > /* The level of the root page given to the iterator */
> > int root_level;
>
>
> >
> >-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >+static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
>
> Maybe fold it into its sole caller.

Sure.


>
> >+ bool private)
> > {
> > union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> > struct kvm *kvm = vcpu->kvm;
> >@@ -221,6 +225,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > * Check for an existing root before allocating a new one. Note, the
> > * role check prevents consuming an invalid root.
> > */
> >+ if (private)
> >+ kvm_mmu_page_role_set_private(&role);
> > for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> > if (root->role.word == role.word &&
> > kvm_tdp_mmu_get_root(root))
> >@@ -244,12 +250,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >
> > out:
> >- return __pa(root->spt);
> >+ return root;
> >+}
> >+
> >+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
> >+{
> >+ return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
> > }
> >
> > static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> >- u64 old_spte, u64 new_spte, int level,
> >- bool shared);
> >+ u64 old_spte, u64 new_spte,
> >+ union kvm_mmu_page_role role, bool shared);
> >
> > static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> >@@ -376,12 +387,78 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> > REMOVED_SPTE, level);
> > }
> > handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> >- old_spte, REMOVED_SPTE, level, shared);
> >+ old_spte, REMOVED_SPTE, sp->role,
> >+ shared);
> >+ }
> >+
> >+ if (is_private_sp(sp) &&
> >+ WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
>
> WARN_ON_ONCE()?
>
> >+ kvm_mmu_private_spt(sp)))) {
> >+ /*
> >+ * Failed to unlink Secure EPT page and there is nothing to do
> >+ * further. Intentionally leak the page to prevent the kernel
> >+ * from accessing the encrypted page.
> >+ */
> >+ kvm_mmu_init_private_spt(sp, NULL);
> > }
> >
> > call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > }
> >
>
> > rcu_read_lock();
> >
> > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> >@@ -960,10 +1158,26 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >
> > if (unlikely(!fault->slot))
> > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> >- else
> >- wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> >- fault->pfn, iter->old_spte, fault->prefetch, true,
> >- fault->map_writable, &new_spte);
> >+ else {
> >+ unsigned long pte_access = ACC_ALL;
> >+ gfn_t gfn = iter->gfn;
> >+
> >+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
> >+ if (fault->is_private)
> >+ gfn |= kvm_gfn_shared_mask(vcpu->kvm);
>
> this is an open-coded kvm_gfn_to_shared().
>
> I don't get why a spte is installed for a shared gfn when fault->is_private
> is true. could you elaborate?

This is stale code. And you're right. I'll remove this part.


> >+ else
> >+ /*
> >+ * TDX shared GPAs are no executable, enforce
> >+ * this for the SDV.
> >+ */
>
> what do you mean by the SDV?

That's development nonsense. I'll remove the second sentence.


> >+ pte_access &= ~ACC_EXEC_MASK;
> >+ }
> >+
> >+ wrprot = make_spte(vcpu, sp, fault->slot, pte_access, gfn,
> >+ fault->pfn, iter->old_spte,
> >+ fault->prefetch, true, fault->map_writable,
> >+ &new_spte);
> >+ }
> >
> > if (new_spte == iter->old_spte)
> > ret = RET_PF_SPURIOUS;
> >@@ -1041,6 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > struct kvm *kvm = vcpu->kvm;
> > struct tdp_iter iter;
> > struct kvm_mmu_page *sp;
> >+ gfn_t raw_gfn;
> >+ bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> > int ret = RET_PF_RETRY;
> >
> > kvm_mmu_hugepage_adjust(vcpu, fault);
> >@@ -1049,7 +1265,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >
> > rcu_read_lock();
> >
> >- tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> >+ raw_gfn = gpa_to_gfn(fault->addr);
> >+
> >+ if (is_error_noslot_pfn(fault->pfn) ||
> >+ !kvm_pfn_to_refcounted_page(fault->pfn)) {
> >+ if (is_private) {
> >+ rcu_read_unlock();
> >+ return -EFAULT;
>
> This needs a comment. why this check is necessary? does this imply some
> kernel bugs?

Will add a comment. It's due to the current TDX KVM implementation that
increments the page refcount.
--
Isaku Yamahata <[email protected]>

2024-03-28 00:38:00

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, Mar 28, 2024 at 08:06:53AM +0800,
Xiaoyao Li <[email protected]> wrote:

> On 3/28/2024 1:36 AM, Edgecombe, Rick P wrote:
> > On Wed, 2024-03-27 at 10:54 +0800, Xiaoyao Li wrote:
> > > > > If QEMU doesn't configure the msr filter list correctly, KVM has to handle
> > > > > guest's MTRR MSR accesses. In my understanding, the
> > > > > suggestion is KVM zap private memory mappings.
>
> TDX spec states that
>
> 18.2.1.4.1 Memory Type for Private and Opaque Access
>
> The memory type for private and opaque access semantics, which use a
> private HKID, is WB.
>
> 18.2.1.4.2 Memory Type for Shared Accesses
>
> Intel SDM, Vol. 3, 28.2.7.2 Memory Type Used for Translated Guest-
> Physical Addresses
>
> The memory type for shared access semantics, which use a shared HKID,
> is determined as described below. Note that this is different from the
> way memory type is determined by the hardware during non-root mode
> operation. Rather, it is a best-effort approximation that is designed
> to still allow the host VMM some control over memory type.
> • For shared access during host-side (SEAMCALL) flows, the memory
> type is determined by MTRRs.
> • For shared access during guest-side flows (VM exit from the guest
> TD), the memory type is determined by a combination of the Shared
> EPT and MTRRs.
> o If the memory type determined during Shared EPT walk is WB, then
> the effective memory type for the access is determined by MTRRs.
> o Else, the effective memory type for the access is UC.
>
> My understanding is that guest MTRR doesn't affect the memory type for
> private memory. So we don't need to zap private memory mappings.

So, there is no point to (try to) emulate MTRR. The direction is, don't
advertise MTRR to the guest (new TDX module is needed.) or enforce
the guest to not use MTRR (guest command line clearcpuid=mtrr). KVM will
simply return error to guest access to MTRR related registers.

QEMU or user space VMM can use the MSR filter if they want.


> > > > > But guests won't accept memory again because no one
> > > > > currently requests guests to do this after writes to MTRR MSRs. In this case,
> > > > > guests may access unaccepted memory, causing infinite EPT violation loop
> > > > > (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
> > > > > the host. But I think it would be better if we can avoid wasting CPU resource
> > > > > on the useless EPT violation loop.
> > > >
> > > > Qemu is expected to do it correctly.  There are manyways for userspace to go
> > > > wrong.  This isn't specific to MTRR MSR.
> > >
> > > This seems incorrect. KVM shouldn't force userspace to filter some
> > > specific MSRs. The semantic of MSR filter is userspace configures it on
> > > its own will, not KVM requires to do so.
> >
> > I'm ok just always doing the exit to userspace on attempt to use MTRRs in a TD, and not rely on the
> > MSR list. At least I don't see the problem.
>
> What is the exit reason in vcpu->run->exit_reason? KVM_EXIT_X86_RDMSR/WRMSR?
> If so, it breaks the ABI on KVM_EXIT_X86_RDMSR/WRMSR.

It's only when the user space requested it with the MSR filter.
--
Isaku Yamahata <[email protected]>

2024-03-28 00:49:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, 2024-03-28 at 08:06 +0800, Xiaoyao Li wrote:
>
> TDX spec states that
>
>    18.2.1.4.1 Memory Type for Private and Opaque Access
>
>    The memory type for private and opaque access semantics, which use a
>    private HKID, is WB.
>
>    18.2.1.4.2 Memory Type for Shared Accesses
>
>    Intel SDM, Vol. 3, 28.2.7.2 Memory Type Used for Translated Guest-
>    Physical Addresses
>
>    The memory type for shared access semantics, which use a shared HKID,
>    is determined as described below. Note that this is different from the
>    way memory type is determined by the hardware during non-root mode
>    operation. Rather, it is a best-effort approximation that is designed
>    to still allow the host VMM some control over memory type.
>      • For shared access during host-side (SEAMCALL) flows, the memory
>        type is determined by MTRRs.
>      • For shared access during guest-side flows (VM exit from the guest
>        TD), the memory type is determined by a combination of the Shared
>        EPT and MTRRs.
>        o If the memory type determined during Shared EPT walk is WB, then
>          the effective memory type for the access is determined by MTRRs.
>        o Else, the effective memory type for the access is UC.
>
> My understanding is that guest MTRR doesn't affect the memory type for
> private memory. So we don't need to zap private memory mappings.

Right, KVM can't zap the private side.

But why does KVM have to support a "best effort" MTRR virtualization for TDs? Kai pointed me to this
today and I haven't looked through it in depth yet:
https://lore.kernel.org/kvm/[email protected]/

An alternative could be to mirror that behavior, but normal VMs have to work with existing userspace
setup. KVM doesn't support any TDs yet, so we can take the opportunity to not introduce weird
things.

>
> > > > > But guests won't accept memory again because no one
> > > > > currently requests guests to do this after writes to MTRR MSRs. In this case,
> > > > > guests may access unaccepted memory, causing infinite EPT violation loop
> > > > > (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
> > > > > the host. But I think it would be better if we can avoid wasting CPU resource
> > > > > on the useless EPT violation loop.
> > > >
> > > > Qemu is expected to do it correctly.  There are manyways for userspace to go
> > > > wrong.  This isn't specific to MTRR MSR.
> > >
> > > This seems incorrect. KVM shouldn't force userspace to filter some
> > > specific MSRs. The semantic of MSR filter is userspace configures it on
> > > its own will, not KVM requires to do so.
> >
> > I'm ok just always doing the exit to userspace on attempt to use MTRRs in a TD, and not rely on
> > the
> > MSR list. At least I don't see the problem.
>
> What is the exit reason in vcpu->run->exit_reason?
> KVM_EXIT_X86_RDMSR/WRMSR? If so, it breaks the ABI on
> KVM_EXIT_X86_RDMSR/WRMSR.

How so? Userspace needs to learn to create a TD first.

2024-03-28 00:58:43

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/28/2024 8:45 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-03-28 at 08:06 +0800, Xiaoyao Li wrote:
>>
>> TDX spec states that
>>
>>    18.2.1.4.1 Memory Type for Private and Opaque Access
>>
>>    The memory type for private and opaque access semantics, which use a
>>    private HKID, is WB.
>>
>>    18.2.1.4.2 Memory Type for Shared Accesses
>>
>>    Intel SDM, Vol. 3, 28.2.7.2 Memory Type Used for Translated Guest-
>>    Physical Addresses
>>
>>    The memory type for shared access semantics, which use a shared HKID,
>>    is determined as described below. Note that this is different from the
>>    way memory type is determined by the hardware during non-root mode
>>    operation. Rather, it is a best-effort approximation that is designed
>>    to still allow the host VMM some control over memory type.
>>      • For shared access during host-side (SEAMCALL) flows, the memory
>>        type is determined by MTRRs.
>>      • For shared access during guest-side flows (VM exit from the guest
>>        TD), the memory type is determined by a combination of the Shared
>>        EPT and MTRRs.
>>        o If the memory type determined during Shared EPT walk is WB, then
>>          the effective memory type for the access is determined by MTRRs.
>>        o Else, the effective memory type for the access is UC.
>>
>> My understanding is that guest MTRR doesn't affect the memory type for
>> private memory. So we don't need to zap private memory mappings.
>
> Right, KVM can't zap the private side.
>
> But why does KVM have to support a "best effort" MTRR virtualization for TDs? Kai pointed me to this
> today and I haven't looked through it in depth yet:
> https://lore.kernel.org/kvm/[email protected]/
>
> An alternative could be to mirror that behavior, but normal VMs have to work with existing userspace
> setup. KVM doesn't support any TDs yet, so we can take the opportunity to not introduce weird
> things.

Not to provide any MTRR support for TD is what I prefer.

>>
>>>>>> But guests won't accept memory again because no one
>>>>>> currently requests guests to do this after writes to MTRR MSRs. In this case,
>>>>>> guests may access unaccepted memory, causing infinite EPT violation loop
>>>>>> (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
>>>>>> the host. But I think it would be better if we can avoid wasting CPU resource
>>>>>> on the useless EPT violation loop.
>>>>>
>>>>> Qemu is expected to do it correctly.  There are manyways for userspace to go
>>>>> wrong.  This isn't specific to MTRR MSR.
>>>>
>>>> This seems incorrect. KVM shouldn't force userspace to filter some
>>>> specific MSRs. The semantic of MSR filter is userspace configures it on
>>>> its own will, not KVM requires to do so.
>>>
>>> I'm ok just always doing the exit to userspace on attempt to use MTRRs in a TD, and not rely on
>>> the
>>> MSR list. At least I don't see the problem.
>>
>> What is the exit reason in vcpu->run->exit_reason?
>> KVM_EXIT_X86_RDMSR/WRMSR? If so, it breaks the ABI on
>> KVM_EXIT_X86_RDMSR/WRMSR.
>
> How so? Userspace needs to learn to create a TD first.

The current ABI of KVM_EXIT_X86_RDMSR/WRMSR is that userspace itself
sets up MSR fitler at first, then it will get such EXIT_REASON when
guest accesses the MSRs being filtered.

If you want to use this EXIT reason, then you need to enforce userspace
setting up the MSR filter. How to enforce? If not enforce, but exit with
KVM_EXIT_X86_RDMSR/WRMSR no matter usersapce sets up MSR filter or not.
Then you are trying to introduce divergent behavior in KVM.

2024-03-28 01:04:58

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/28/2024 8:36 AM, Isaku Yamahata wrote:
> On Thu, Mar 28, 2024 at 08:06:53AM +0800,
> Xiaoyao Li <[email protected]> wrote:
>
>> On 3/28/2024 1:36 AM, Edgecombe, Rick P wrote:
>>> On Wed, 2024-03-27 at 10:54 +0800, Xiaoyao Li wrote:
>>>>>> If QEMU doesn't configure the msr filter list correctly, KVM has to handle
>>>>>> guest's MTRR MSR accesses. In my understanding, the
>>>>>> suggestion is KVM zap private memory mappings.
>>
>> TDX spec states that
>>
>> 18.2.1.4.1 Memory Type for Private and Opaque Access
>>
>> The memory type for private and opaque access semantics, which use a
>> private HKID, is WB.
>>
>> 18.2.1.4.2 Memory Type for Shared Accesses
>>
>> Intel SDM, Vol. 3, 28.2.7.2 Memory Type Used for Translated Guest-
>> Physical Addresses
>>
>> The memory type for shared access semantics, which use a shared HKID,
>> is determined as described below. Note that this is different from the
>> way memory type is determined by the hardware during non-root mode
>> operation. Rather, it is a best-effort approximation that is designed
>> to still allow the host VMM some control over memory type.
>> • For shared access during host-side (SEAMCALL) flows, the memory
>> type is determined by MTRRs.
>> • For shared access during guest-side flows (VM exit from the guest
>> TD), the memory type is determined by a combination of the Shared
>> EPT and MTRRs.
>> o If the memory type determined during Shared EPT walk is WB, then
>> the effective memory type for the access is determined by MTRRs.
>> o Else, the effective memory type for the access is UC.
>>
>> My understanding is that guest MTRR doesn't affect the memory type for
>> private memory. So we don't need to zap private memory mappings.
>
> So, there is no point to (try to) emulate MTRR. The direction is, don't
> advertise MTRR to the guest (new TDX module is needed.) or enforce
> the guest to not use MTRR (guest command line clearcpuid=mtrr).

Ideally, it would be better if TD guest learns to disable/not use MTRR
itself.

> KVM will
> simply return error to guest access to MTRR related registers.
>
> QEMU or user space VMM can use the MSR filter if they want.
>
>
>>>>>> But guests won't accept memory again because no one
>>>>>> currently requests guests to do this after writes to MTRR MSRs. In this case,
>>>>>> guests may access unaccepted memory, causing infinite EPT violation loop
>>>>>> (assume SEPT_VE_DISABLE is set). This won't impact other guests/workloads on
>>>>>> the host. But I think it would be better if we can avoid wasting CPU resource
>>>>>> on the useless EPT violation loop.
>>>>>
>>>>> Qemu is expected to do it correctly.  There are manyways for userspace to go
>>>>> wrong.  This isn't specific to MTRR MSR.
>>>>
>>>> This seems incorrect. KVM shouldn't force userspace to filter some
>>>> specific MSRs. The semantic of MSR filter is userspace configures it on
>>>> its own will, not KVM requires to do so.
>>>
>>> I'm ok just always doing the exit to userspace on attempt to use MTRRs in a TD, and not rely on the
>>> MSR list. At least I don't see the problem.
>>
>> What is the exit reason in vcpu->run->exit_reason? KVM_EXIT_X86_RDMSR/WRMSR?
>> If so, it breaks the ABI on KVM_EXIT_X86_RDMSR/WRMSR.
>
> It's only when the user space requested it with the MSR filter.

right. But userspace has no reason to filter them because userspace can
do nothing except 1) either kill the TD, or 2) eat the instruction.

2024-03-28 01:06:30

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, 2024-03-28 at 08:58 +0800, Xiaoyao Li wrote:
> > How so? Userspace needs to learn to create a TD first.
>
> The current ABI of KVM_EXIT_X86_RDMSR/WRMSR is that userspace itself
> sets up MSR fitler at first, then it will get such EXIT_REASON when
> guest accesses the MSRs being filtered.
>
> If you want to use this EXIT reason, then you need to enforce userspace
> setting up the MSR filter. How to enforce?

I think Isaku's proposal was to let userspace configure it.

For the sake of conversation, what if we don't enforce it? The downside of not enforcing it is that
we then need to worry about code paths in KVM the MTRRs would call. But what goes wrong
functionally? If userspace doesn't fully setup a TD things can go wrong for the TD.

A plus side of using the MSR filter stuff is it reuses existing functionality.

> If not enforce, but exit with
> KVM_EXIT_X86_RDMSR/WRMSR no matter usersapce sets up MSR filter or not.
> Then you are trying to introduce divergent behavior in KVM.

The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this is
any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different exit
reason or behavior in mind?

2024-03-28 01:16:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Thu, 2024-03-21 at 08:55 -0700, Isaku Yamahata wrote:
> On Wed, Mar 20, 2024 at 02:12:49PM +0800,
> Chao Gao <[email protected]> wrote:
>
> > > +static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
> > > +                                 struct td_params *td_params)
> > > +{
> > > +       int i;
> > > +
> > > +       /*
> > > +        * td_params.cpuid_values: The number and the order of cpuid_value must
> > > +        * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
> > > +        * It's assumed that td_params was zeroed.
> > > +        */
> > > +       for (i = 0; i < tdx_info->num_cpuid_config; i++) {
> > > +               const struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
> > > +               /* KVM_TDX_CPUID_NO_SUBLEAF means index = 0. */
> > > +               u32 index = c->sub_leaf == KVM_TDX_CPUID_NO_SUBLEAF ? 0 : c->sub_leaf;
> > > +               const struct kvm_cpuid_entry2 *entry =
> > > +                       kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
> > > +                                             c->leaf, index);
> > > +               struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
> > > +
> > > +               if (!entry)
> > > +                       continue;
> > > +
> > > +               /*
> > > +                * tdsysinfo.cpuid_configs[].{eax, ebx, ecx, edx}
> > > +                * bit 1 means it can be configured to zero or one.
> > > +                * bit 0 means it must be zero.
> > > +                * Mask out non-configurable bits.
> > > +                */
> > > +               value->eax = entry->eax & c->eax;
> > > +               value->ebx = entry->ebx & c->ebx;
> > > +               value->ecx = entry->ecx & c->ecx;
> > > +               value->edx = entry->edx & c->edx;
> >
> > Any reason to mask off non-configurable bits rather than return an error? this
> > is misleading to userspace because guest sees the values emulated by TDX module
> > instead of the values passed from userspace (i.e., the request from userspace
> > isn't done but there is no indication of that to userspace).
>
> Ok, I'll eliminate them.  If user space passes wrong cpuids, TDX module will
> return error. I'll leave the error check to the TDX module.

I was just looking at this. Agreed. It breaks the selftests though.

2024-03-28 01:30:59

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/28/2024 9:06 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-03-28 at 08:58 +0800, Xiaoyao Li wrote:
>>> How so? Userspace needs to learn to create a TD first.
>>
>> The current ABI of KVM_EXIT_X86_RDMSR/WRMSR is that userspace itself
>> sets up MSR fitler at first, then it will get such EXIT_REASON when
>> guest accesses the MSRs being filtered.
>>
>> If you want to use this EXIT reason, then you need to enforce userspace
>> setting up the MSR filter. How to enforce?
>
> I think Isaku's proposal was to let userspace configure it.
>
> For the sake of conversation, what if we don't enforce it? The downside of not enforcing it is that
> we then need to worry about code paths in KVM the MTRRs would call. But what goes wrong
> functionally? If userspace doesn't fully setup a TD things can go wrong for the TD.
>
> A plus side of using the MSR filter stuff is it reuses existing functionality.
>
>> If not enforce, but exit with
>> KVM_EXIT_X86_RDMSR/WRMSR no matter usersapce sets up MSR filter or not.
>> Then you are trying to introduce divergent behavior in KVM.
>
> The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this is
> any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different exit
> reason or behavior in mind?

Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.

2024-03-28 01:37:02

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On 3/28/2024 9:12 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-03-21 at 08:55 -0700, Isaku Yamahata wrote:
>> On Wed, Mar 20, 2024 at 02:12:49PM +0800,
>> Chao Gao <[email protected]> wrote:
>>
>>>> +static void setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
>>>> +                                 struct td_params *td_params)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       /*
>>>> +        * td_params.cpuid_values: The number and the order of cpuid_value must
>>>> +        * be same to the one of struct tdsysinfo.{num_cpuid_config, cpuid_configs}
>>>> +        * It's assumed that td_params was zeroed.
>>>> +        */
>>>> +       for (i = 0; i < tdx_info->num_cpuid_config; i++) {
>>>> +               const struct kvm_tdx_cpuid_config *c = &tdx_info->cpuid_configs[i];
>>>> +               /* KVM_TDX_CPUID_NO_SUBLEAF means index = 0. */
>>>> +               u32 index = c->sub_leaf == KVM_TDX_CPUID_NO_SUBLEAF ? 0 : c->sub_leaf;
>>>> +               const struct kvm_cpuid_entry2 *entry =
>>>> +                       kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
>>>> +                                             c->leaf, index);
>>>> +               struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
>>>> +
>>>> +               if (!entry)
>>>> +                       continue;
>>>> +
>>>> +               /*
>>>> +                * tdsysinfo.cpuid_configs[].{eax, ebx, ecx, edx}
>>>> +                * bit 1 means it can be configured to zero or one.
>>>> +                * bit 0 means it must be zero.
>>>> +                * Mask out non-configurable bits.
>>>> +                */
>>>> +               value->eax = entry->eax & c->eax;
>>>> +               value->ebx = entry->ebx & c->ebx;
>>>> +               value->ecx = entry->ecx & c->ecx;
>>>> +               value->edx = entry->edx & c->edx;
>>>
>>> Any reason to mask off non-configurable bits rather than return an error? this
>>> is misleading to userspace because guest sees the values emulated by TDX module
>>> instead of the values passed from userspace (i.e., the request from userspace
>>> isn't done but there is no indication of that to userspace).
>>
>> Ok, I'll eliminate them.  If user space passes wrong cpuids, TDX module will
>> return error. I'll leave the error check to the TDX module.
>
> I was just looking at this. Agreed. It breaks the selftests though.

If all you prefer to go this direction, then please update the error
handling of this specific SEAMCALL.

2024-03-28 01:50:42

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 28/03/2024 11:53 am, Isaku Yamahata wrote:
> On Tue, Mar 26, 2024 at 02:43:54PM +1300,
> "Huang, Kai" <[email protected]> wrote:
>
>> ... continue the previous review ...
>>
>>> +
>>> +static void tdx_reclaim_control_page(unsigned long td_page_pa)
>>> +{
>>> + WARN_ON_ONCE(!td_page_pa);
>>
>> From the name 'td_page_pa' we cannot tell whether it is a control page, but
>> this function is only intended for control page AFAICT, so perhaps a more
>> specific name.
>>
>>> +
>>> + /*
>>> + * TDCX are being reclaimed. TDX module maps TDCX with HKID
>>
>> "are" -> "is".
>>
>> Are you sure it is TDCX, but not TDCS?
>>
>> AFAICT TDCX is the control structure for 'vcpu', but here you are handling
>> the control structure for the VM.
>
> TDCS, TDVPR, and TDCX. Will update the comment.

But TDCX, TDVPR are vcpu-scoped. Do you want to mention them _here_?

Otherwise you will have to explain them.

[...]

>>> +
>>> +void tdx_mmu_release_hkid(struct kvm *kvm)
>>> +{
>>> + bool packages_allocated, targets_allocated;
>>> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>>> + cpumask_var_t packages, targets;
>>> + u64 err;
>>> + int i;
>>> +
>>> + if (!is_hkid_assigned(kvm_tdx))
>>> + return;
>>> +
>>> + if (!is_td_created(kvm_tdx)) {
>>> + tdx_hkid_free(kvm_tdx);
>>> + return;
>>> + }
>>
>> I lost tracking what does "td_created()" mean.
>>
>> I guess it means: KeyID has been allocated to the TDX guest, but not yet
>> programmed/configured.
>>
>> Perhaps add a comment to remind the reviewer?
>
> As Chao suggested, will introduce state machine for vm and vcpu.
>
> https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/

Could you elaborate what will the state machine look like?

I need to understand it.


[...]


>
> How about this?
>
> /*
> * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> * TDH.MNG.KEY.FREEID() to free the HKID.
> * Other threads can remove pages from TD. When the HKID is assigned, we need
> * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> * present transient state of HKID.
> */

Could you elaborate why it is still possible to have other thread
removing pages from TD?

I am probably missing something, but the thing I don't understand is why
this function is triggered by MMU release? All the things done in this
function don't seem to be related to MMU at all.

IIUC, by reaching here, you must already have done VPFLUSHDONE, which
should be called when you free vcpu? Freeing vcpus is done in
kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in
which this tdx_mmu_release_keyid() is called?

But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?

>>> + /*
>>> + * In the case of error in tdx_do_tdh_phymem_cache_wb(), the following
>>> + * tdh_mng_key_freeid() will fail.
>>> + */
>>> + err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
>>> + if (WARN_ON_ONCE(err)) {
>>
>> I see KVM_BUG_ON() is normally used for SEAMCALL error. Why this uses
>> WARN_ON_ONCE() here?
>
> Because vm_free() hook is (one of) the final steps to free struct kvm. No one
> else touches this kvm. Because it doesn't harm to use KVM_BUG_ON() here,
> I'll change it for consistency.

I am fine with either. You can use KVM_BUG_ON() for SEAMCALLs at
runtime, but use WARN_ON_ONCE() for those involved during VM creation.

[...]

>>
>>> + err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(kvm_tdx->tdr_pa,
>>> + tdx_global_keyid));
>>> + if (WARN_ON_ONCE(err)) {
>>
>> Again, KVM_BUG_ON()?
>>
>> Should't matter, though.
>
> Ok, let's use KVM_BUG_ON() consistently.

Ditto.

2024-03-28 03:04:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, 2024-03-28 at 09:30 +0800, Xiaoyao Li wrote:
> > The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this
> > is
> > any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different
> > exit
> > reason or behavior in mind?
>
> Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.

MTRR appears to be configured to be type "Fixed" in the TDX module. So the guest could expect to be
able to use it and be surprised by a #GP.

{
"MSB": "12",
"LSB": "12",
"Field Size": "1",
"Field Name": "MTRR",
"Configuration Details": null,
"Bit or Field Virtualization Type": "Fixed",
"Virtualization Details": "0x1"
},

If KVM does not support MTRRs in TDX, then it has to return the error somewhere or pretend to
support it (do nothing but not return an error). Returning an error to the guest would be making up
arch behavior, and to a lesser degree so would ignoring the WRMSR. So that is why I lean towards
returning to userspace and giving the VMM the option to ignore it, return an error to the guest or
show an error to the user. If KVM can't support the behavior, better to get an actual error in
userspace than a mysterious guest hang, right?

Outside of what kind of exit it is, do you object to the general plan to punt to userspace?

Since this is a TDX specific limitation, I guess there is KVM_EXIT_TDX_VMCALL as a general category
of TDVMCALLs that cannot be handled by KVM.

2024-03-28 03:13:45

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

>+#if IS_ENABLED(CONFIG_HYPERV)
>+static int vt_flush_remote_tlbs(struct kvm *kvm);
>+#endif
>+
> static __init int vt_hardware_setup(void)
> {
> int ret;
>@@ -49,11 +53,29 @@ static __init int vt_hardware_setup(void)
> pr_warn_ratelimited("TDX requires mmio caching. Please enable mmio caching for TDX.\n");
> }
>
>+#if IS_ENABLED(CONFIG_HYPERV)
>+ /*
>+ * TDX KVM overrides flush_remote_tlbs method and assumes
>+ * flush_remote_tlbs_range = NULL that falls back to
>+ * flush_remote_tlbs. Disable TDX if there are conflicts.
>+ */
>+ if (vt_x86_ops.flush_remote_tlbs ||
>+ vt_x86_ops.flush_remote_tlbs_range) {
>+ enable_tdx = false;
>+ pr_warn_ratelimited("TDX requires baremetal. Not Supported on VMM guest.\n");
>+ }
>+#endif
>+
> enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> if (enable_tdx)
> vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> sizeof(struct kvm_tdx));
>
>+#if IS_ENABLED(CONFIG_HYPERV)
>+ if (enable_tdx)
>+ vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;

Is this hook necessary/beneficial to TDX?

if no, we can leave .flush_remote_tlbs as NULL. if yes, we should do:

struct kvm_x86_ops {
..
#if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(TDX...)
int (*flush_remote_tlbs)(struct kvm *kvm);
int (*flush_remote_tlbs_range)(struct kvm *kvm, gfn_t gfn,
gfn_t nr_pages);
#endif

2024-03-28 03:40:49

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/28/2024 11:04 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-03-28 at 09:30 +0800, Xiaoyao Li wrote:
>>> The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this
>>> is
>>> any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different
>>> exit
>>> reason or behavior in mind?
>>
>> Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.
>
> MTRR appears to be configured to be type "Fixed" in the TDX module. So the guest could expect to be
> able to use it and be surprised by a #GP.
>
> {
> "MSB": "12",
> "LSB": "12",
> "Field Size": "1",
> "Field Name": "MTRR",
> "Configuration Details": null,
> "Bit or Field Virtualization Type": "Fixed",
> "Virtualization Details": "0x1"
> },
>
> If KVM does not support MTRRs in TDX, then it has to return the error somewhere or pretend to
> support it (do nothing but not return an error). Returning an error to the guest would be making up
> arch behavior, and to a lesser degree so would ignoring the WRMSR.

The root cause is that it's a bad design of TDX to make MTRR fixed1.
When guest reads MTRR CPUID as 1 while getting #VE on MTRR MSRs, it
already breaks the architectural behavior. (MAC faces the similar issue
, MCA is fixed1 as well while accessing MCA related MSRs gets #VE. This
is why TDX is going to fix them by introducing new feature and make them
configurable)

> So that is why I lean towards
> returning to userspace and giving the VMM the option to ignore it, return an error to the guest or
> show an error to the user.

"show an error to the user" doesn't help at all. Because user cannot fix
it, nor does QEMU.

> If KVM can't support the behavior, better to get an actual error in
> userspace than a mysterious guest hang, right?
What behavior do you mean?

> Outside of what kind of exit it is, do you object to the general plan to punt to userspace?
>
> Since this is a TDX specific limitation, I guess there is KVM_EXIT_TDX_VMCALL as a general category
> of TDVMCALLs that cannot be handled by KVM.

I just don't see any difference between handling it in KVM and handling
it in userspace: either a) return error to guest or b) ignore the WRMSR.

2024-03-28 03:55:48

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

On Thu, Mar 28, 2024 at 11:12:57AM +0800,
Chao Gao <[email protected]> wrote:

> >+#if IS_ENABLED(CONFIG_HYPERV)
> >+static int vt_flush_remote_tlbs(struct kvm *kvm);
> >+#endif
> >+
> > static __init int vt_hardware_setup(void)
> > {
> > int ret;
> >@@ -49,11 +53,29 @@ static __init int vt_hardware_setup(void)
> > pr_warn_ratelimited("TDX requires mmio caching. Please enable mmio caching for TDX.\n");
> > }
> >
> >+#if IS_ENABLED(CONFIG_HYPERV)
> >+ /*
> >+ * TDX KVM overrides flush_remote_tlbs method and assumes
> >+ * flush_remote_tlbs_range = NULL that falls back to
> >+ * flush_remote_tlbs. Disable TDX if there are conflicts.
> >+ */
> >+ if (vt_x86_ops.flush_remote_tlbs ||
> >+ vt_x86_ops.flush_remote_tlbs_range) {
> >+ enable_tdx = false;
> >+ pr_warn_ratelimited("TDX requires baremetal. Not Supported on VMM guest.\n");
> >+ }
> >+#endif
> >+
> > enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> > if (enable_tdx)
> > vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> > sizeof(struct kvm_tdx));
> >
> >+#if IS_ENABLED(CONFIG_HYPERV)
> >+ if (enable_tdx)
> >+ vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
>
> Is this hook necessary/beneficial to TDX?
>
> if no, we can leave .flush_remote_tlbs as NULL. if yes, we should do:
>
> struct kvm_x86_ops {
> ...
> #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(TDX...)
> int (*flush_remote_tlbs)(struct kvm *kvm);
> int (*flush_remote_tlbs_range)(struct kvm *kvm, gfn_t gfn,
> gfn_t nr_pages);
> #endif

Will fix it. I made mistake when I rebased it. Now those hooks are only for
CONFIG_HPYERV.
--
Isaku Yamahata <[email protected]>

2024-03-28 05:34:47

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Thu, Mar 28, 2024 at 02:49:56PM +1300,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 28/03/2024 11:53 am, Isaku Yamahata wrote:
> > On Tue, Mar 26, 2024 at 02:43:54PM +1300,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > > ... continue the previous review ...
> > >
> > > > +
> > > > +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> > > > +{
> > > > + WARN_ON_ONCE(!td_page_pa);
> > >
> > > From the name 'td_page_pa' we cannot tell whether it is a control page, but
> > > this function is only intended for control page AFAICT, so perhaps a more
> > > specific name.
> > >
> > > > +
> > > > + /*
> > > > + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> > >
> > > "are" -> "is".
> > >
> > > Are you sure it is TDCX, but not TDCS?
> > >
> > > AFAICT TDCX is the control structure for 'vcpu', but here you are handling
> > > the control structure for the VM.
> >
> > TDCS, TDVPR, and TDCX. Will update the comment.
>
> But TDCX, TDVPR are vcpu-scoped. Do you want to mention them _here_?

So I'll make the patch that frees TDVPR, TDCX will change this comment.


> Otherwise you will have to explain them.
>
> [...]
>
> > > > +
> > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > +{
> > > > + bool packages_allocated, targets_allocated;
> > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > + cpumask_var_t packages, targets;
> > > > + u64 err;
> > > > + int i;
> > > > +
> > > > + if (!is_hkid_assigned(kvm_tdx))
> > > > + return;
> > > > +
> > > > + if (!is_td_created(kvm_tdx)) {
> > > > + tdx_hkid_free(kvm_tdx);
> > > > + return;
> > > > + }
> > >
> > > I lost tracking what does "td_created()" mean.
> > >
> > > I guess it means: KeyID has been allocated to the TDX guest, but not yet
> > > programmed/configured.
> > >
> > > Perhaps add a comment to remind the reviewer?
> >
> > As Chao suggested, will introduce state machine for vm and vcpu.
> >
> > https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/
>
> Could you elaborate what will the state machine look like?
>
> I need to understand it.

Not yet. Chao only propose to introduce state machine. Right now it's just an
idea.


> > How about this?
> >
> > /*
> > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> > * TDH.MNG.KEY.FREEID() to free the HKID.
> > * Other threads can remove pages from TD. When the HKID is assigned, we need
> > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> > * present transient state of HKID.
> > */
>
> Could you elaborate why it is still possible to have other thread removing
> pages from TD?
>
> I am probably missing something, but the thing I don't understand is why
> this function is triggered by MMU release? All the things done in this
> function don't seem to be related to MMU at all.

The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
to use TDH.PHYMEM.PAGE.RECLAIM().


> IIUC, by reaching here, you must already have done VPFLUSHDONE, which should
> be called when you free vcpu?

Not necessarily.


> Freeing vcpus is done in
> kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
> this tdx_mmu_release_keyid() is called?

guest memfd complicates things. The race is between guest memfd release and mmu
notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds
including guest memfd.

Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
and kvm vm. The process is exiting. Please notice vhost increments the
reference of the mmu to access guest (shared) memory.

exit_mmap():
Usually mmu notifier release is fired. But not yet because of vhost.

exit_files()
close vhost fd. vhost starts timer to issue mmput().

close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole
guest memory. Call kvm_put_kvm() at last.

During unmapping on behalf of guest memfd, the timer of vhost fires to call
mmput(). It triggers mmu notifier release.

Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls
kvm_destroy_vm().

It's ideal to free HKID first for efficiency. But KVM doesn't have control on
the order of fds.


> But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?

Not necessarily.
--
Isaku Yamahata <[email protected]>

2024-03-28 08:13:19

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c

On Mon, Feb 26, 2024 at 12:26:33AM -0800, [email protected] wrote:
>@@ -190,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
> * notification vector is switched to the one that calls
> * back to the pi_wakeup_handler() function.
> */
>- return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
>+ return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
>+ vmx_can_use_vtd_pi(vcpu->kvm);

It is better to separate this functional change from the code refactoring.

> }
>
> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>@@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> if (!vmx_needs_pi_wakeup(vcpu))
> return;
>
>- if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
>+ if (kvm_vcpu_is_blocking(vcpu) &&
>+ (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))

Ditto.

This looks incorrect to me. here we assume interrupt is always enabled for TD.
But on TDVMCALL(HLT), the guest tells KVM if hlt is called with interrupt
disabled. KVM can just check that interrupt status passed from the guest.

> pi_enable_wakeup_handler(vcpu);
>

2024-03-28 08:27:33

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
> support for TDX isn't implemented. TDX requires KVM mmio caching.

Can you add some description about why TDX requires mmio caching in the
changelog?


> Disable
> TDX support when TDP MMU or mmio caching aren't supported.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 1 +
> arch/x86/kvm/vmx/main.c | 13 +++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0e0321ad9ca2..b8d6ce02e66d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -104,6 +104,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> * If the hardware supports that we don't need to do shadow paging.
> */
> bool tdp_enabled = false;
> +EXPORT_SYMBOL_GPL(tdp_enabled);
>
> static bool __ro_after_init tdp_mmu_allowed;
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 076a471d9aea..54df6653193e 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -3,6 +3,7 @@
>
> #include "x86_ops.h"
> #include "vmx.h"
> +#include "mmu.h"
> #include "nested.h"
> #include "pmu.h"
> #include "tdx.h"
> @@ -36,6 +37,18 @@ static __init int vt_hardware_setup(void)
> if (ret)
> return ret;
>
> + /* TDX requires KVM TDP MMU. */
> + if (enable_tdx && !tdp_enabled) {
> + enable_tdx = false;
> + pr_warn_ratelimited("TDX requires TDP MMU. Please enable TDP MMU for TDX.\n");
> + }
> +
> + /* TDX requires MMIO caching. */
> + if (enable_tdx && !enable_mmio_caching) {
> + enable_tdx = false;
> + pr_warn_ratelimited("TDX requires mmio caching. Please enable mmio caching for TDX.\n");
> + }
> +
> enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> if (enable_tdx)
> vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,


2024-03-28 09:16:56

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page



On 3/28/2024 8:02 AM, Isaku Yamahata wrote:
> On Wed, Mar 27, 2024 at 09:49:14PM +0800,
> Binbin Wu <[email protected]> wrote:
>
>>
>> On 3/15/2024 9:09 AM, Isaku Yamahata wrote:
>>> Here is the updated one. Renamed dummy -> mirroed.
>>>
>>> When KVM resolves the KVM page fault, it walks the page tables. To reuse
>>> the existing KVM MMU code and mitigate the heavy cost of directly walking
>>> the private page table, allocate one more page to copy the mirrored page
>> Here "copy" is a bit confusing for me.
>> The mirrored page table is maintained by KVM, not copied from anywhere.
> How about, "maintain" or "keep"?

Or just use "for"?

i.e, allocate one more page for the mirrored page table ...



>
>>> table for the KVM MMU code to directly walk. Resolve the KVM page fault
>>> with the existing code, and do additional operations necessary for the
>>> private page table. To distinguish such cases, the existing KVM page table
>>> is called a shared page table (i.e., not associated with a private page
>>> table), and the page table with a private page table is called a mirrored
>>> page table. The relationship is depicted below.
>>>
>>>
>>> KVM page fault |
>>> | |
>>> V |
>>> -------------+---------- |
>>> | | |
>>> V V |
>>> shared GPA private GPA |
>>> | | |
>>> V V |
>>> shared PT root mirrored PT root | private PT root
>>> | | | |
>>> V V | V
>>> shared PT mirrored PT ----propagate----> private PT
>>> | | | |
>>> | \-----------------+------\ |
>>> | | | |
>>> V | V V
>>> shared guest page | private guest page
>>> |
>>> non-encrypted memory | encrypted memory
>>> |
>>> PT: Page table
>>> Shared PT: visible to KVM, and the CPU uses it for shared mappings.
>>> Private PT: the CPU uses it, but it is invisible to KVM. TDX module
>>> updates this table to map private guest pages.
>>> Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
>>> to propagate PT change to the actual private PT.
>>>
>>


2024-03-28 10:12:10

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, Mar 28, 2024 at 08:06:53AM +0800, Xiaoyao Li wrote:
>On 3/28/2024 1:36 AM, Edgecombe, Rick P wrote:
>> On Wed, 2024-03-27 at 10:54 +0800, Xiaoyao Li wrote:
>> > > > If QEMU doesn't configure the msr filter list correctly, KVM has to handle
>> > > > guest's MTRR MSR accesses. In my understanding, the
>> > > > suggestion is KVM zap private memory mappings.
>
>TDX spec states that
>
> 18.2.1.4.1 Memory Type for Private and Opaque Access
>
> The memory type for private and opaque access semantics, which use a
> private HKID, is WB.
>
> 18.2.1.4.2 Memory Type for Shared Accesses
>
> Intel SDM, Vol. 3, 28.2.7.2 Memory Type Used for Translated Guest-
> Physical Addresses
>
> The memory type for shared access semantics, which use a shared HKID,
> is determined as described below. Note that this is different from the
> way memory type is determined by the hardware during non-root mode
> operation. Rather, it is a best-effort approximation that is designed
> to still allow the host VMM some control over memory type.
> • For shared access during host-side (SEAMCALL) flows, the memory
> type is determined by MTRRs.
> • For shared access during guest-side flows (VM exit from the guest
> TD), the memory type is determined by a combination of the Shared
> EPT and MTRRs.
> o If the memory type determined during Shared EPT walk is WB, then
> the effective memory type for the access is determined by MTRRs.
> o Else, the effective memory type for the access is UC.
>
>My understanding is that guest MTRR doesn't affect the memory type for
>private memory. So we don't need to zap private memory mappings.

This isn't related to the discussion. IIUC, this is the memory type used
by TDX module code to access shared/private memory.

I didn't suggest zapping private memory. It is my understanding about what
we will end up with, if KVM relies on QEMU to filter MTRR MSRs but somehow
QEMU fails to do that.

2024-03-28 10:38:42

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, Mar 28, 2024 at 11:40:27AM +0800, Xiaoyao Li wrote:
>On 3/28/2024 11:04 AM, Edgecombe, Rick P wrote:
>> On Thu, 2024-03-28 at 09:30 +0800, Xiaoyao Li wrote:
>> > > The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this
>> > > is
>> > > any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different
>> > > exit
>> > > reason or behavior in mind?
>> >
>> > Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.
>>
>> MTRR appears to be configured to be type "Fixed" in the TDX module. So the guest could expect to be
>> able to use it and be surprised by a #GP.
>>
>> {
>> "MSB": "12",
>> "LSB": "12",
>> "Field Size": "1",
>> "Field Name": "MTRR",
>> "Configuration Details": null,
>> "Bit or Field Virtualization Type": "Fixed",
>> "Virtualization Details": "0x1"
>> },
>>
>> If KVM does not support MTRRs in TDX, then it has to return the error somewhere or pretend to
>> support it (do nothing but not return an error). Returning an error to the guest would be making up
>> arch behavior, and to a lesser degree so would ignoring the WRMSR.
>
>The root cause is that it's a bad design of TDX to make MTRR fixed1. When
>guest reads MTRR CPUID as 1 while getting #VE on MTRR MSRs, it already breaks
>the architectural behavior. (MAC faces the similar issue , MCA is fixed1 as

I won't say #VE on MTRR MSRs breaks anything. Writes to other MSRs (e.g.
TSC_DEADLINE MSR) also lead to #VE. If KVM can emulate the MSR accesses, #VE
should be fine.

The problem is: MTRR CPUID feature is fixed 1 while KVM/QEMU doesn't know how
to virtualize MTRR especially given that KVM cannot control the memory type in
secure-EPT entries.

>well while accessing MCA related MSRs gets #VE. This is why TDX is going to
>fix them by introducing new feature and make them configurable)
>
>> So that is why I lean towards
>> returning to userspace and giving the VMM the option to ignore it, return an error to the guest or
>> show an error to the user.
>
>"show an error to the user" doesn't help at all. Because user cannot fix it,
>nor does QEMU.

The key point isn't who can fix/emulate MTRR MSRs. It is just KVM doesn't know
how to handle this situation and ask userspace for help.

Whether or how userspace can handle the MSR writes isn't KVM's problem. It may be
better if KVM can tell userspace exactly in which cases KVM will exit to
userspace. But there is no such an infrastructure.

An example is: in KVM CET series, we find it is complex for KVM instruction
emulator to emulate control flow instructions when CET is enabled. The
suggestion is also to punt to userspace (w/o any indication to userspace that
KVM would do this).

>
>> If KVM can't support the behavior, better to get an actual error in
>> userspace than a mysterious guest hang, right?
>What behavior do you mean?
>
>> Outside of what kind of exit it is, do you object to the general plan to punt to userspace?
>>
>> Since this is a TDX specific limitation, I guess there is KVM_EXIT_TDX_VMCALL as a general category
>> of TDVMCALLs that cannot be handled by KVM.

Using KVM_EXIT_TDX_VMCALL looks fine.

We need to explain why MTRR MSRs are handled in this way unlike other MSRs.

It is better if KVM can tell userspace that MTRR virtualization isn't supported
by KVM for TDs. Then userspace should resolve the conflict between KVM and TDX
module on MTRR. But to report MTRR as unsupported, we need to make
GET_SUPPORTED_CPUID a vm-scope ioctl. I am not sure if it is worth the effort.


>
>I just don't see any difference between handling it in KVM and handling it in
>userspace: either a) return error to guest or b) ignore the WRMSR.

2024-03-28 10:57:06

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 092/130] KVM: TDX: Implement interrupt injection

>@@ -848,6 +853,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> trace_kvm_entry(vcpu);
>
>+ if (pi_test_on(&tdx->pi_desc)) {
>+ apic->send_IPI_self(POSTED_INTR_VECTOR);
>+
>+ kvm_wait_lapic_expire(vcpu);

it seems the APIC timer change was inadvertently included.

2024-03-28 11:15:10

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Wed, 2024-03-27 at 22:34 -0700, Isaku Yamahata wrote:
> On Thu, Mar 28, 2024 at 02:49:56PM +1300,
> "Huang, Kai" <[email protected]> wrote:
>
> >
> >
> > On 28/03/2024 11:53 am, Isaku Yamahata wrote:
> > > On Tue, Mar 26, 2024 at 02:43:54PM +1300,
> > > "Huang, Kai" <[email protected]> wrote:
> > >
> > > > ... continue the previous review ...
> > > >
> > > > > +
> > > > > +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> > > > > +{
> > > > > + WARN_ON_ONCE(!td_page_pa);
> > > >
> > > > From the name 'td_page_pa' we cannot tell whether it is a control page, but
> > > > this function is only intended for control page AFAICT, so perhaps a more
> > > > specific name.
> > > >
> > > > > +
> > > > > + /*
> > > > > + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> > > >
> > > > "are" -> "is".
> > > >
> > > > Are you sure it is TDCX, but not TDCS?
> > > >
> > > > AFAICT TDCX is the control structure for 'vcpu', but here you are handling
> > > > the control structure for the VM.
> > >
> > > TDCS, TDVPR, and TDCX. Will update the comment.
> >
> > But TDCX, TDVPR are vcpu-scoped. Do you want to mention them _here_?
>
> So I'll make the patch that frees TDVPR, TDCX will change this comment.
>

Hmm.. Looking again, I am not sure why do we even need
tdx_reclaim_control_page()?

It basically does tdx_reclaim_page() + free_page():

+static void tdx_reclaim_control_page(unsigned long td_page_pa)
+{
+ WARN_ON_ONCE(!td_page_pa);
+
+ /*
+ * TDCX are being reclaimed. TDX module maps TDCX with HKID
+ * assigned to the TD. Here the cache associated to the TD
+ * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
+ * cache doesn't need to be flushed again.
+ */
+ if (tdx_reclaim_page(td_page_pa))
+ /*
+ * Leak the page on failure:
+ * tdx_reclaim_page() returns an error if and only if there's
an
+ * unexpected, fatal error, e.g. a SEAMCALL with bad params,
+ * incorrect concurrency in KVM, a TDX Module bug, etc.
+ * Retrying at a later point is highly unlikely to be
+ * successful.
+ * No log here as tdx_reclaim_page() already did.
+ */
+ return;
+ free_page((unsigned long)__va(td_page_pa));
+}

And why do you need a special function just for control page(s)?

>
> > Otherwise you will have to explain them.
> >
> > [...]
> >
> > > > > +
> > > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > > +{
> > > > > + bool packages_allocated, targets_allocated;
> > > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > > + cpumask_var_t packages, targets;
> > > > > + u64 err;
> > > > > + int i;
> > > > > +
> > > > > + if (!is_hkid_assigned(kvm_tdx))
> > > > > + return;
> > > > > +
> > > > > + if (!is_td_created(kvm_tdx)) {
> > > > > + tdx_hkid_free(kvm_tdx);
> > > > > + return;
> > > > > + }
> > > >
> > > > I lost tracking what does "td_created()" mean.
> > > >
> > > > I guess it means: KeyID has been allocated to the TDX guest, but not yet
> > > > programmed/configured.
> > > >
> > > > Perhaps add a comment to remind the reviewer?
> > >
> > > As Chao suggested, will introduce state machine for vm and vcpu.
> > >
> > > https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/
> >
> > Could you elaborate what will the state machine look like?
> >
> > I need to understand it.
>
> Not yet. Chao only propose to introduce state machine. Right now it's just an
> idea.

Then why state machine is better? I guess we need some concrete example to tell
which is better?

>
>
> > > How about this?
> > >
> > > /*
> > > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> > > * TDH.MNG.KEY.FREEID() to free the HKID.
> > > * Other threads can remove pages from TD. When the HKID is assigned, we need
> > > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> > > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> > > * present transient state of HKID.
> > > */
> >
> > Could you elaborate why it is still possible to have other thread removing
> > pages from TD?
> >
> > I am probably missing something, but the thing I don't understand is why
> > this function is triggered by MMU release? All the things done in this
> > function don't seem to be related to MMU at all.
>
> The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
> we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
> TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
> TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
> to use TDH.PHYMEM.PAGE.RECLAIM().

Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
TDH.PHYMEM.PAGE.RECLAIM()?

And does the difference matter in practice, i.e. did you see using the former
having noticeable performance downgrade?

>
>
> > IIUC, by reaching here, you must already have done VPFLUSHDONE, which should
> > be called when you free vcpu?
>
> Not necessarily.

OK. I got confused between TDH.VP.FLUSH and TDH.MNG.VPFLUSHDONE.

>
>
> > Freeing vcpus is done in
> > kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
> > this tdx_mmu_release_keyid() is called?
>
> guest memfd complicates things. The race is between guest memfd release and mmu
> notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds
> including guest memfd.
>
> Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
> and kvm vm. The process is exiting. Please notice vhost increments the
> reference of the mmu to access guest (shared) memory.
>
> exit_mmap():
> Usually mmu notifier release is fired. But not yet because of vhost.
>
> exit_files()
> close vhost fd. vhost starts timer to issue mmput().

Why does it need to start a timer to issue mmput(), but not call mmput()
directly?

>
> close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
> kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
> and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole
> guest memory. Call kvm_put_kvm() at last.
>
> During unmapping on behalf of guest memfd, the timer of vhost fires to call
> mmput(). It triggers mmu notifier release.
>
> Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls
> kvm_destroy_vm().
>
> It's ideal to free HKID first for efficiency. But KVM doesn't have control on
> the order of fds.

Firstly, what kinda performance efficiency gain are we talking about?

We cannot really tell whether it can be justified to use two different methods
to tear down SEPT page because of this.

Even if it's worth to do, it is an optimization, which can/should be done later
after you have put all building blocks together.

That being said, you are putting too many logic in this patch, i.e., it just
doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.

>
>
> > But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?
>
> Not necessarily.

I am wondering when is TDH.VP.FLUSH done? Supposedly it should be called when
we free vcpus? But again this means you need to call TDH.MNG.VPFLUSHDONE
_after_ freeing vcpus. And this looks conflicting if you make
tdx_mmu_release_keyid() being called from MMU notifier.

2024-03-28 13:22:00

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/28/2024 6:17 PM, Chao Gao wrote:
> On Thu, Mar 28, 2024 at 11:40:27AM +0800, Xiaoyao Li wrote:
>> On 3/28/2024 11:04 AM, Edgecombe, Rick P wrote:
>>> On Thu, 2024-03-28 at 09:30 +0800, Xiaoyao Li wrote:
>>>>> The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this
>>>>> is
>>>>> any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different
>>>>> exit
>>>>> reason or behavior in mind?
>>>>
>>>> Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.
>>>
>>> MTRR appears to be configured to be type "Fixed" in the TDX module. So the guest could expect to be
>>> able to use it and be surprised by a #GP.
>>>
>>> {
>>> "MSB": "12",
>>> "LSB": "12",
>>> "Field Size": "1",
>>> "Field Name": "MTRR",
>>> "Configuration Details": null,
>>> "Bit or Field Virtualization Type": "Fixed",
>>> "Virtualization Details": "0x1"
>>> },
>>>
>>> If KVM does not support MTRRs in TDX, then it has to return the error somewhere or pretend to
>>> support it (do nothing but not return an error). Returning an error to the guest would be making up
>>> arch behavior, and to a lesser degree so would ignoring the WRMSR.
>>
>> The root cause is that it's a bad design of TDX to make MTRR fixed1. When
>> guest reads MTRR CPUID as 1 while getting #VE on MTRR MSRs, it already breaks
>> the architectural behavior. (MAC faces the similar issue , MCA is fixed1 as
>
> I won't say #VE on MTRR MSRs breaks anything. Writes to other MSRs (e.g.
> TSC_DEADLINE MSR) also lead to #VE. If KVM can emulate the MSR accesses, #VE
> should be fine.
>
> The problem is: MTRR CPUID feature is fixed 1 while KVM/QEMU doesn't know how
> to virtualize MTRR especially given that KVM cannot control the memory type in
> secure-EPT entries.

yes, I partly agree on that "#VE on MTRR MSRs breaks anything". #VE is
not a problem, the problem is if the #VE is opt-in or unconditional.

For the TSC_DEADLINE_MSR, #VE is opt-in actually.
CPUID(1).EXC[24].TSC_DEADLINE is configurable by VMM. Only when VMM
configures the bit to 1, will the TD guest get #VE. If VMM configures it
to 0, TD guest just gets #GP. This is the reasonable design.

>> well while accessing MCA related MSRs gets #VE. This is why TDX is going to
>> fix them by introducing new feature and make them configurable)
>>
>>> So that is why I lean towards
>>> returning to userspace and giving the VMM the option to ignore it, return an error to the guest or
>>> show an error to the user.
>>
>> "show an error to the user" doesn't help at all. Because user cannot fix it,
>> nor does QEMU.
>
> The key point isn't who can fix/emulate MTRR MSRs. It is just KVM doesn't know
> how to handle this situation and ask userspace for help.
>
> Whether or how userspace can handle the MSR writes isn't KVM's problem. It may be
> better if KVM can tell userspace exactly in which cases KVM will exit to
> userspace. But there is no such an infrastructure.
>
> An example is: in KVM CET series, we find it is complex for KVM instruction
> emulator to emulate control flow instructions when CET is enabled. The
> suggestion is also to punt to userspace (w/o any indication to userspace that
> KVM would do this).

Please point me to decision of CET? I'm interested in how userspace can
help on that.

>>
>>> If KVM can't support the behavior, better to get an actual error in
>>> userspace than a mysterious guest hang, right?
>> What behavior do you mean?
>>
>>> Outside of what kind of exit it is, do you object to the general plan to punt to userspace?
>>>
>>> Since this is a TDX specific limitation, I guess there is KVM_EXIT_TDX_VMCALL as a general category
>>> of TDVMCALLs that cannot be handled by KVM.
>
> Using KVM_EXIT_TDX_VMCALL looks fine.
>
> We need to explain why MTRR MSRs are handled in this way unlike other MSRs.
>
> It is better if KVM can tell userspace that MTRR virtualization isn't supported
> by KVM for TDs. Then userspace should resolve the conflict between KVM and TDX
> module on MTRR. But to report MTRR as unsupported, we need to make
> GET_SUPPORTED_CPUID a vm-scope ioctl. I am not sure if it is worth the effort.

My memory is that Sean dislike the vm-scope GET_SUPPORTED_CPUID for TDX
when he was at Intel.

Anyway, we can provide TDX specific interface to report SUPPORTED_CPUID
in KVM_TDX_CAPABILITIES, if we really need it.

>
>>
>> I just don't see any difference between handling it in KVM and handling it in
>> userspace: either a) return error to guest or b) ignore the WRMSR.


2024-03-28 13:39:46

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, Mar 28, 2024 at 09:21:37PM +0800, Xiaoyao Li wrote:
>On 3/28/2024 6:17 PM, Chao Gao wrote:
>> On Thu, Mar 28, 2024 at 11:40:27AM +0800, Xiaoyao Li wrote:
>> > On 3/28/2024 11:04 AM, Edgecombe, Rick P wrote:
>> > > On Thu, 2024-03-28 at 09:30 +0800, Xiaoyao Li wrote:
>> > > > > The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this
>> > > > > is
>> > > > > any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different
>> > > > > exit
>> > > > > reason or behavior in mind?
>> > > >
>> > > > Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.
>> > >
>> > > MTRR appears to be configured to be type "Fixed" in the TDX module. So the guest could expect to be
>> > > able to use it and be surprised by a #GP.
>> > >
>> > > {
>> > > "MSB": "12",
>> > > "LSB": "12",
>> > > "Field Size": "1",
>> > > "Field Name": "MTRR",
>> > > "Configuration Details": null,
>> > > "Bit or Field Virtualization Type": "Fixed",
>> > > "Virtualization Details": "0x1"
>> > > },
>> > >
>> > > If KVM does not support MTRRs in TDX, then it has to return the error somewhere or pretend to
>> > > support it (do nothing but not return an error). Returning an error to the guest would be making up
>> > > arch behavior, and to a lesser degree so would ignoring the WRMSR.
>> >
>> > The root cause is that it's a bad design of TDX to make MTRR fixed1. When
>> > guest reads MTRR CPUID as 1 while getting #VE on MTRR MSRs, it already breaks
>> > the architectural behavior. (MAC faces the similar issue , MCA is fixed1 as
>>
>> I won't say #VE on MTRR MSRs breaks anything. Writes to other MSRs (e.g.
>> TSC_DEADLINE MSR) also lead to #VE. If KVM can emulate the MSR accesses, #VE
>> should be fine.
>>
>> The problem is: MTRR CPUID feature is fixed 1 while KVM/QEMU doesn't know how
>> to virtualize MTRR especially given that KVM cannot control the memory type in
>> secure-EPT entries.
>
>yes, I partly agree on that "#VE on MTRR MSRs breaks anything". #VE is not a
>problem, the problem is if the #VE is opt-in or unconditional.

From guest's p.o.v, there is no difference: the guest doesn't know whether a feature
is opted in or not.

>
>For the TSC_DEADLINE_MSR, #VE is opt-in actually.
>CPUID(1).EXC[24].TSC_DEADLINE is configurable by VMM. Only when VMM
>configures the bit to 1, will the TD guest get #VE. If VMM configures it to
>0, TD guest just gets #GP. This is the reasonable design.
>
>> > well while accessing MCA related MSRs gets #VE. This is why TDX is going to
>> > fix them by introducing new feature and make them configurable)
>> >
>> > > So that is why I lean towards
>> > > returning to userspace and giving the VMM the option to ignore it, return an error to the guest or
>> > > show an error to the user.
>> >
>> > "show an error to the user" doesn't help at all. Because user cannot fix it,
>> > nor does QEMU.
>>
>> The key point isn't who can fix/emulate MTRR MSRs. It is just KVM doesn't know
>> how to handle this situation and ask userspace for help.
>>
>> Whether or how userspace can handle the MSR writes isn't KVM's problem. It may be
>> better if KVM can tell userspace exactly in which cases KVM will exit to
>> userspace. But there is no such an infrastructure.
>>
>> An example is: in KVM CET series, we find it is complex for KVM instruction
>> emulator to emulate control flow instructions when CET is enabled. The
>> suggestion is also to punt to userspace (w/o any indication to userspace that
>> KVM would do this).
>
>Please point me to decision of CET? I'm interested in how userspace can help
>on that.

https://lore.kernel.org/kvm/[email protected]/

>
>> >
>> > > If KVM can't support the behavior, better to get an actual error in
>> > > userspace than a mysterious guest hang, right?
>> > What behavior do you mean?
>> >
>> > > Outside of what kind of exit it is, do you object to the general plan to punt to userspace?
>> > >
>> > > Since this is a TDX specific limitation, I guess there is KVM_EXIT_TDX_VMCALL as a general category
>> > > of TDVMCALLs that cannot be handled by KVM.
>>
>> Using KVM_EXIT_TDX_VMCALL looks fine.
>>
>> We need to explain why MTRR MSRs are handled in this way unlike other MSRs.
>>
>> It is better if KVM can tell userspace that MTRR virtualization isn't supported
>> by KVM for TDs. Then userspace should resolve the conflict between KVM and TDX
>> module on MTRR. But to report MTRR as unsupported, we need to make
>> GET_SUPPORTED_CPUID a vm-scope ioctl. I am not sure if it is worth the effort.
>
>My memory is that Sean dislike the vm-scope GET_SUPPORTED_CPUID for TDX when
>he was at Intel.

Ok. No strong opinion on this.

2024-03-28 14:12:50

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Thu, Mar 28, 2024 at 11:14:42AM +0000, Huang, Kai wrote:
>> >
>> > [...]
>> >
>> > > > > +
>> > > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
>> > > > > +{
>> > > > > + bool packages_allocated, targets_allocated;
>> > > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>> > > > > + cpumask_var_t packages, targets;
>> > > > > + u64 err;
>> > > > > + int i;
>> > > > > +
>> > > > > + if (!is_hkid_assigned(kvm_tdx))
>> > > > > + return;
>> > > > > +
>> > > > > + if (!is_td_created(kvm_tdx)) {
>> > > > > + tdx_hkid_free(kvm_tdx);
>> > > > > + return;
>> > > > > + }
>> > > >
>> > > > I lost tracking what does "td_created()" mean.
>> > > >
>> > > > I guess it means: KeyID has been allocated to the TDX guest, but not yet
>> > > > programmed/configured.
>> > > >
>> > > > Perhaps add a comment to remind the reviewer?
>> > >
>> > > As Chao suggested, will introduce state machine for vm and vcpu.
>> > >
>> > > https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/
>> >
>> > Could you elaborate what will the state machine look like?
>> >
>> > I need to understand it.
>>
>> Not yet. Chao only propose to introduce state machine. Right now it's just an
>> idea.
>
>Then why state machine is better? I guess we need some concrete example to tell
>which is better?

Something like the TD Life Cycle State Machine (Section 9.1 of TDX module spec[1])

[1]: https://cdrdv2.intel.com/v1/dl/getContent/733568

I don't have the code. But using a few boolean variables to track the state of
TD and VCPU looks bad and hard to maintain and extend. At least, the state machine
is well-documented.

2024-03-28 14:27:10

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 060/130] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> The private GPAs that typically guest memfd backs aren't subject to MMU
> notifier because it isn't mapped into virtual address of user process.
> kvm_tdp_mmu_handle_gfn() handles the callback of the MMU notifier,
> clear_flush_young(), clear_young(), test_young()() and change_pte(). Make
                                                   ^
                                                   an extra "()"
> kvm_tdp_mmu_handle_gfn() aware of private mapping and skip private mapping.
>
> Even with AS_UNMOVABLE set, those mmu notifier are called. For example,
> ksmd triggers change_pte().

The description about the "AS_UNMOVABLE", you are refering to shared
memory, right?
Then, it seems not related to the change of this patch.

>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> ---
> v19:
> - type: test_gfn() => test_young()
>
> v18:
> - newly added
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++++++++++-
> 1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e7514a807134..10507920f36b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1157,9 +1157,29 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> * into this helper allow blocking; it'd be dead, wasteful code.
> */
> for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
> + gfn_t start, end;
> +
> + /*
> + * This function is called on behalf of mmu_notifier of
> + * clear_flush_young(), clear_young(), test_young()(), and

^
                                                               an extra
"()""

> + * change_pte(). They apply to only shared GPAs.
> + */
> + WARN_ON_ONCE(range->only_private);
> + WARN_ON_ONCE(!range->only_shared);
> + if (is_private_sp(root))
> + continue;
> +
> + /*
> + * For TDX shared mapping, set GFN shared bit to the range,
> + * so the handler() doesn't need to set it, to avoid duplicated
> + * code in multiple handler()s.
> + */
> + start = kvm_gfn_to_shared(kvm, range->start);
> + end = kvm_gfn_to_shared(kvm, range->end);
> +
> rcu_read_lock();
>
> - tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> + tdp_root_for_each_leaf_pte(iter, root, start, end)
> ret |= handler(kvm, &iter, range);
>
> rcu_read_unlock();


2024-03-28 14:46:18

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On 3/28/2024 9:38 PM, Chao Gao wrote:
> On Thu, Mar 28, 2024 at 09:21:37PM +0800, Xiaoyao Li wrote:
>> On 3/28/2024 6:17 PM, Chao Gao wrote:
>>> On Thu, Mar 28, 2024 at 11:40:27AM +0800, Xiaoyao Li wrote:
>>>> On 3/28/2024 11:04 AM, Edgecombe, Rick P wrote:
>>>>> On Thu, 2024-03-28 at 09:30 +0800, Xiaoyao Li wrote:
>>>>>>> The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this
>>>>>>> is
>>>>>>> any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different
>>>>>>> exit
>>>>>>> reason or behavior in mind?
>>>>>>
>>>>>> Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.
>>>>>
>>>>> MTRR appears to be configured to be type "Fixed" in the TDX module. So the guest could expect to be
>>>>> able to use it and be surprised by a #GP.
>>>>>
>>>>> {
>>>>> "MSB": "12",
>>>>> "LSB": "12",
>>>>> "Field Size": "1",
>>>>> "Field Name": "MTRR",
>>>>> "Configuration Details": null,
>>>>> "Bit or Field Virtualization Type": "Fixed",
>>>>> "Virtualization Details": "0x1"
>>>>> },
>>>>>
>>>>> If KVM does not support MTRRs in TDX, then it has to return the error somewhere or pretend to
>>>>> support it (do nothing but not return an error). Returning an error to the guest would be making up
>>>>> arch behavior, and to a lesser degree so would ignoring the WRMSR.
>>>>
>>>> The root cause is that it's a bad design of TDX to make MTRR fixed1. When
>>>> guest reads MTRR CPUID as 1 while getting #VE on MTRR MSRs, it already breaks
>>>> the architectural behavior. (MAC faces the similar issue , MCA is fixed1 as
>>>
>>> I won't say #VE on MTRR MSRs breaks anything. Writes to other MSRs (e.g.
>>> TSC_DEADLINE MSR) also lead to #VE. If KVM can emulate the MSR accesses, #VE
>>> should be fine.
>>>
>>> The problem is: MTRR CPUID feature is fixed 1 while KVM/QEMU doesn't know how
>>> to virtualize MTRR especially given that KVM cannot control the memory type in
>>> secure-EPT entries.
>>
>> yes, I partly agree on that "#VE on MTRR MSRs breaks anything". #VE is not a
>> problem, the problem is if the #VE is opt-in or unconditional.
>
> From guest's p.o.v, there is no difference: the guest doesn't know whether a feature
> is opted in or not.

I don't argue it makes any difference to guest. I argue that it is a bad
design of TDX to make MTRR fixed1, which leaves the tough problem to
VMM. TDX architecture is one should be blamed.

Though TDX is going to change it, we have to come up something to handle
with current existing TDX if we want to support them.

I have no objection of leaving it to userspace, via KVM_EXIT_TDX_VMCALL.
If we go this path, I would suggest return error to TD guest on QEMU
side (when I prepare the QEMU patch for it) because QEMU cannot emulate
it neither.


2024-03-28 16:58:21

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Thu, 2024-03-28 at 22:45 +0800, Xiaoyao Li wrote:
> I don't argue it makes any difference to guest. I argue that it is a bad
> design of TDX to make MTRR fixed1, which leaves the tough problem to
> VMM. TDX architecture is one should be blamed.

I didn't see anyone arguing against this point. It does seems strange to force something to be
exposed that can't be fully handled. I wonder what the history is.

>
> Though TDX is going to change it, we have to come up something to handle
> with current existing TDX if we want to support them.

Right. The things being discussed are untenable as long term solutions. The question is, is there
anything acceptable in the meantime. That is the goal of this thread.

We do have the other option of waiting for a new TDX module that could fix it better, but I thought
exiting to userspace for the time being would be way to move forward.

Would be great to have a maintainer chime in on this point.

>
> I have no objection of leaving it to userspace, via KVM_EXIT_TDX_VMCALL.
> If we go this path, I would suggest return error to TD guest on QEMU
> side (when I prepare the QEMU patch for it) because QEMU cannot emulate
> it neither.

It would be nice to give the user (user of qemu) some sort of notice of what was going on. For Linux
the workaround is clearcpuid=mtrr. If qemu can print something like "MTRRs not supported", or I
don't know what message fits. Then the user can see what the problem is and add that to the kernel
command line. If they just see a guest crash because it can't handle the error, they will have to
debug.

2024-03-28 18:28:17

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Thu, 2024-03-28 at 09:36 +0800, Xiaoyao Li wrote:
> > > > Any reason to mask off non-configurable bits rather than return an error? this
> > > > is misleading to userspace because guest sees the values emulated by TDX module
> > > > instead of the values passed from userspace (i.e., the request from userspace
> > > > isn't done but there is no indication of that to userspace).
> > >
> > > Ok, I'll eliminate them.  If user space passes wrong cpuids, TDX module will
> > > return error. I'll leave the error check to the TDX module.
> >
> > I was just looking at this. Agreed. It breaks the selftests though.
>
> If all you prefer to go this direction, then please update the error
> handling of this specific SEAMCALL.

What do you mean by SEAMCALL, TDH_MNG_INIT? Can you be more specific?

2024-03-28 20:39:31

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Thu, Mar 28, 2024 at 11:14:42AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Wed, 2024-03-27 at 22:34 -0700, Isaku Yamahata wrote:
> > On Thu, Mar 28, 2024 at 02:49:56PM +1300,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > >
> > >
> > > On 28/03/2024 11:53 am, Isaku Yamahata wrote:
> > > > On Tue, Mar 26, 2024 at 02:43:54PM +1300,
> > > > "Huang, Kai" <[email protected]> wrote:
> > > >
> > > > > ... continue the previous review ...
> > > > >
> > > > > > +
> > > > > > +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> > > > > > +{
> > > > > > + WARN_ON_ONCE(!td_page_pa);
> > > > >
> > > > > From the name 'td_page_pa' we cannot tell whether it is a control page, but
> > > > > this function is only intended for control page AFAICT, so perhaps a more
> > > > > specific name.
> > > > >
> > > > > > +
> > > > > > + /*
> > > > > > + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> > > > >
> > > > > "are" -> "is".
> > > > >
> > > > > Are you sure it is TDCX, but not TDCS?
> > > > >
> > > > > AFAICT TDCX is the control structure for 'vcpu', but here you are handling
> > > > > the control structure for the VM.
> > > >
> > > > TDCS, TDVPR, and TDCX. Will update the comment.
> > >
> > > But TDCX, TDVPR are vcpu-scoped. Do you want to mention them _here_?
> >
> > So I'll make the patch that frees TDVPR, TDCX will change this comment.
> >
>
> Hmm.. Looking again, I am not sure why do we even need
> tdx_reclaim_control_page()?
>
> It basically does tdx_reclaim_page() + free_page():
>
> +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> +{
> + WARN_ON_ONCE(!td_page_pa);
> +
> + /*
> + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> + * assigned to the TD. Here the cache associated to the TD
> + * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
> + * cache doesn't need to be flushed again.
> + */
> + if (tdx_reclaim_page(td_page_pa))
> + /*
> + * Leak the page on failure:
> + * tdx_reclaim_page() returns an error if and only if there's
> an
> + * unexpected, fatal error, e.g. a SEAMCALL with bad params,
> + * incorrect concurrency in KVM, a TDX Module bug, etc.
> + * Retrying at a later point is highly unlikely to be
> + * successful.
> + * No log here as tdx_reclaim_page() already did.
> + */
> + return;
> + free_page((unsigned long)__va(td_page_pa));
> +}
>
> And why do you need a special function just for control page(s)?

We can revise the code to have common function for reclaiming page.


> > > Otherwise you will have to explain them.
> > >
> > > [...]
> > >
> > > > > > +
> > > > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > > > +{
> > > > > > + bool packages_allocated, targets_allocated;
> > > > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > > > + cpumask_var_t packages, targets;
> > > > > > + u64 err;
> > > > > > + int i;
> > > > > > +
> > > > > > + if (!is_hkid_assigned(kvm_tdx))
> > > > > > + return;
> > > > > > +
> > > > > > + if (!is_td_created(kvm_tdx)) {
> > > > > > + tdx_hkid_free(kvm_tdx);
> > > > > > + return;
> > > > > > + }
> > > > >
> > > > > I lost tracking what does "td_created()" mean.
> > > > >
> > > > > I guess it means: KeyID has been allocated to the TDX guest, but not yet
> > > > > programmed/configured.
> > > > >
> > > > > Perhaps add a comment to remind the reviewer?
> > > >
> > > > As Chao suggested, will introduce state machine for vm and vcpu.
> > > >
> > > > https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/
> > >
> > > Could you elaborate what will the state machine look like?
> > >
> > > I need to understand it.
> >
> > Not yet. Chao only propose to introduce state machine. Right now it's just an
> > idea.
>
> Then why state machine is better? I guess we need some concrete example to tell
> which is better?

At this point we don't know which is better. I personally think it's worthwhile
to give it a try. After experiment, we may discard or adapt the idea.

Because the TDX spec already defines its state machine, we could follow it.


> > > > How about this?
> > > >
> > > > /*
> > > > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> > > > * TDH.MNG.KEY.FREEID() to free the HKID.
> > > > * Other threads can remove pages from TD. When the HKID is assigned, we need
> > > > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> > > > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> > > > * present transient state of HKID.
> > > > */
> > >
> > > Could you elaborate why it is still possible to have other thread removing
> > > pages from TD?
> > >
> > > I am probably missing something, but the thing I don't understand is why
> > > this function is triggered by MMU release? All the things done in this
> > > function don't seem to be related to MMU at all.
> >
> > The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
> > we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
> > TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
> > TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
> > to use TDH.PHYMEM.PAGE.RECLAIM().
>
> Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
> TDH.PHYMEM.PAGE.RECLAIM()?
>
> And does the difference matter in practice, i.e. did you see using the former
> having noticeable performance downgrade?

Yes. With HKID alive, we have to assume that vcpu can run still. It means TLB
shootdown. The difference is 2 extra SEAMCALL + IPI synchronization for each
guest private page. If the guest has hundreds of GB, the difference can be
tens of minutes.

With HKID alive, we need to assume vcpu is alive.
- TDH.MEM.PAGE.REMOVE()
- TDH.PHYMEM.PAGE_WBINVD()
- TLB shoot down
- TDH.MEM.TRACK()
- IPI to other vcpus
- wait for other vcpu to exit

After freeing HKID
- TDH.PHYMEM.PAGE.RECLAIM()
We already flushed TLBs and memory cache.


> > > Freeing vcpus is done in
> > > kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
> > > this tdx_mmu_release_keyid() is called?
> >
> > guest memfd complicates things. The race is between guest memfd release and mmu
> > notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds
> > including guest memfd.
> >
> > Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
> > and kvm vm. The process is exiting. Please notice vhost increments the
> > reference of the mmu to access guest (shared) memory.
> >
> > exit_mmap():
> > Usually mmu notifier release is fired. But not yet because of vhost.
> >
> > exit_files()
> > close vhost fd. vhost starts timer to issue mmput().
>
> Why does it need to start a timer to issue mmput(), but not call mmput()
> directly?

That's how vhost implements it. It's out of KVM control. Other component or
user space as other thread can get reference to mmu or FDs. They can keep/free
them as they like.


> > close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
> > kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
> > and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole
> > guest memory. Call kvm_put_kvm() at last.
> >
> > During unmapping on behalf of guest memfd, the timer of vhost fires to call
> > mmput(). It triggers mmu notifier release.
> >
> > Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls
> > kvm_destroy_vm().
> >
> > It's ideal to free HKID first for efficiency. But KVM doesn't have control on
> > the order of fds.
>
> Firstly, what kinda performance efficiency gain are we talking about?

2 extra SEAMCALL + IPI sync for each guest private page. If the guest memory
is hundreds of GB, the difference can be tens of minutes.


> We cannot really tell whether it can be justified to use two different methods
> to tear down SEPT page because of this.
>
> Even if it's worth to do, it is an optimization, which can/should be done later
> after you have put all building blocks together.
>
> That being said, you are putting too many logic in this patch, i.e., it just
> doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.

I agree that this patch is too huge, and that we should break it into smaller
patches.


> > > But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?
> >
> > Not necessarily.
>
> I am wondering when is TDH.VP.FLUSH done? Supposedly it should be called when
> we free vcpus? But again this means you need to call TDH.MNG.VPFLUSHDONE
> _after_ freeing vcpus. And this looks conflicting if you make
> tdx_mmu_release_keyid() being called from MMU notifier.

tdx_mmu_release_keyid() call it explicitly for all vcpus.
--
Isaku Yamahata <[email protected]>

2024-03-28 21:03:56

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX

On Thu, Mar 28, 2024 at 01:24:27PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
> > support for TDX isn't implemented. TDX requires KVM mmio caching.
>
> Can you add some description about why TDX requires mmio caching in the
> changelog?

Sure, will update the commit log.

As the TDX guest is protected, the guest has to issue TDG.VP.VMCALL<MMIO> on
VE. The VMM has to setup Shared-EPT entry to inject VE by setting the entry
value with VE suppress bit cleared.

KVM mmio caching is a feature to set the EPT entry to special value for MMIO GFN
instead of the default value with suppress VE bit set. So TDX KVM wants to
utilize it.

Thanks,
--
Isaku Yamahata <[email protected]>

2024-03-28 21:04:21

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 060/130] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA

On Thu, Mar 28, 2024 at 04:29:50PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > The private GPAs that typically guest memfd backs aren't subject to MMU
> > notifier because it isn't mapped into virtual address of user process.
> > kvm_tdp_mmu_handle_gfn() handles the callback of the MMU notifier,
> > clear_flush_young(), clear_young(), test_young()() and change_pte(). Make
>                                                    ^
>                                                    an extra "()"

Will fix it. Thanks.

> > kvm_tdp_mmu_handle_gfn() aware of private mapping and skip private mapping.
> >
> > Even with AS_UNMOVABLE set, those mmu notifier are called. For example,
> > ksmd triggers change_pte().
>
> The description about the "AS_UNMOVABLE", you are refering to shared memory,
> right?
> Then, it seems not related to the change of this patch.

Ok, will remove this sentence.
--
Isaku Yamahata <[email protected]>

2024-03-28 21:10:49

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c

On Thu, Mar 28, 2024 at 04:12:36PM +0800,
Chao Gao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:26:33AM -0800, [email protected] wrote:
> >@@ -190,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
> > * notification vector is switched to the one that calls
> > * back to the pi_wakeup_handler() function.
> > */
> >- return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
> >+ return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
> >+ vmx_can_use_vtd_pi(vcpu->kvm);
>
> It is better to separate this functional change from the code refactoring.

Agreed. Let's split this patch.


> > }
> >
> > void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> >@@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> > if (!vmx_needs_pi_wakeup(vcpu))
> > return;
> >
> >- if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> >+ if (kvm_vcpu_is_blocking(vcpu) &&
> >+ (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
>
> Ditto.
>
> This looks incorrect to me. here we assume interrupt is always enabled for TD.
> But on TDVMCALL(HLT), the guest tells KVM if hlt is called with interrupt
> disabled. KVM can just check that interrupt status passed from the guest.

That's true. We can complicate this function and HLT emulation. But I don't
think it's worthwhile because HLT with interrupt masked is rare. Only for
CPU online.
--
Isaku Yamahata <[email protected]>

2024-03-28 21:12:15

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 092/130] KVM: TDX: Implement interrupt injection

On Thu, Mar 28, 2024 at 06:56:26PM +0800,
Chao Gao <[email protected]> wrote:

> >@@ -848,6 +853,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> >
> > trace_kvm_entry(vcpu);
> >
> >+ if (pi_test_on(&tdx->pi_desc)) {
> >+ apic->send_IPI_self(POSTED_INTR_VECTOR);
> >+
> >+ kvm_wait_lapic_expire(vcpu);
>
> it seems the APIC timer change was inadvertently included.

Oops. Thanks for catching it.
--
Isaku Yamahata <[email protected]>

2024-03-29 01:58:10

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 093/130] KVM: TDX: Implements vcpu request_immediate_exit

On Mon, Feb 26, 2024 at 12:26:35AM -0800, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
>vcpu. Wire up kvm x86 methods for blocking/unblocking vcpu for TDX. To
>unblock on pending events, request immediate exit methods is also needed.

TDX doesn't support this immediate exit. It is considered as a potential
attack to TDs. TDX module deploys 0/1-step mitigations to prevent this.
Even KVM issues a self-IPI before TD-entry, TD-exit will happen after
the guest runs a random number of instructions.

KVM shouldn't request immediate exits in the first place. Just emit a
warning if KVM tries to do this.

>
>Signed-off-by: Isaku Yamahata <[email protected]>
>Reviewed-by: Paolo Bonzini <[email protected]>
>---
> arch/x86/kvm/vmx/main.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>index f2c9d6358f9e..ee6c04959d4c 100644
>--- a/arch/x86/kvm/vmx/main.c
>+++ b/arch/x86/kvm/vmx/main.c
>@@ -372,6 +372,16 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
> vmx_enable_irq_window(vcpu);
> }
>
>+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
>+{
>+ if (is_td_vcpu(vcpu)) {
>+ __kvm_request_immediate_exit(vcpu);
>+ return;
>+ }
>+
>+ vmx_request_immediate_exit(vcpu);
>+}
>+
> static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
> if (is_td_vcpu(vcpu))
>@@ -549,7 +559,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .check_intercept = vmx_check_intercept,
> .handle_exit_irqoff = vmx_handle_exit_irqoff,
>
>- .request_immediate_exit = vmx_request_immediate_exit,
>+ .request_immediate_exit = vt_request_immediate_exit,
>
> .sched_in = vt_sched_in,
>
>--
>2.25.1
>
>

2024-03-29 02:14:57

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 094/130] KVM: TDX: Implement methods to inject NMI

>+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
>+{
>+ if (is_td_vcpu(vcpu))
>+ return;
>+
>+ vmx_set_nmi_mask(vcpu, masked);
>+}
>+
>+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
>+{
>+ /* Refer the comment in vt_get_nmi_mask(). */
>+ if (is_td_vcpu(vcpu))
>+ return;
>+
>+ vmx_enable_nmi_window(vcpu);
>+}

The two actually request something to do done for the TD. But we make them nop
as TDX module doesn't support VMM to configure nmi mask and nmi window. Do you
think they are worth a WARN_ON_ONCE()? or adding WARN_ON_ONCE() requires a lot
of code factoring in KVM's NMI injection logics?

2024-03-29 02:55:29

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 095/130] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument

On Mon, Feb 26, 2024 at 12:26:37AM -0800, [email protected] wrote:
>From: Sean Christopherson <[email protected]>
>
>TDX uses different ABI to get information about VM exit. Pass intr_info to
>the NMI and INTR handlers instead of pulling it from vcpu_vmx in
>preparation for sharing the bulk of the handlers with TDX.
>
>When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
>exit qualification etc rather than the VMCS fields because VMM doesn't have
>access to the VMCS. The eventual code will be
>
>VMX:
> - get exit reason, intr_info, exit_qualification, and etc from VMCS
> - call NMI/INTR handlers (common code)
>
>TDX:
> - get exit reason, intr_info, exit_qualification, and etc from guest
> registers
> - call NMI/INTR handlers (common code)
>
>Signed-off-by: Sean Christopherson <[email protected]>
>Signed-off-by: Isaku Yamahata <[email protected]>
>Reviewed-by: Paolo Bonzini <[email protected]>

Reviewed-by: Chao Gao <[email protected]>

2024-03-29 03:25:22

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 097/130] KVM: x86: Split core of hypercall emulation to helper function

On Mon, Feb 26, 2024 at 12:26:39AM -0800, [email protected] wrote:
>@@ -10162,18 +10151,49 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>
> WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ);
> vcpu->arch.complete_userspace_io = complete_hypercall_exit;
>+ /* stat is incremented on completion. */

Perhaps we could use a distinct return value to signal that the request is redirected
to userspace. This way, more cases can be supported, e.g., accesses to MTRR
MSRs, requests to service TDs, etc. And then ...

> return 0;
> }
> default:
> ret = -KVM_ENOSYS;
> break;
> }
>+
> out:
>+ ++vcpu->stat.hypercalls;
>+ return ret;
>+}
>+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
>+
>+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>+{
>+ unsigned long nr, a0, a1, a2, a3, ret;
>+ int op_64_bit;
>+ int cpl;
>+
>+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
>+ return kvm_xen_hypercall(vcpu);
>+
>+ if (kvm_hv_hypercall_enabled(vcpu))
>+ return kvm_hv_hypercall(vcpu);
>+
>+ nr = kvm_rax_read(vcpu);
>+ a0 = kvm_rbx_read(vcpu);
>+ a1 = kvm_rcx_read(vcpu);
>+ a2 = kvm_rdx_read(vcpu);
>+ a3 = kvm_rsi_read(vcpu);
>+ op_64_bit = is_64_bit_hypercall(vcpu);
>+ cpl = static_call(kvm_x86_get_cpl)(vcpu);
>+
>+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
>+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
>+ /* MAP_GPA tosses the request to the user space. */

no need to check what the request is. Just checking the return value will suffice.

>+ return 0;
>+
> if (!op_64_bit)
> ret = (u32)ret;
> kvm_rax_write(vcpu, ret);
>
>- ++vcpu->stat.hypercalls;
> return kvm_skip_emulated_instruction(vcpu);
> }
> EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
>--
>2.25.1
>
>

2024-03-29 12:27:04

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 3/21/2024 10:17 PM, Isaku Yamahata wrote:
> On Wed, Mar 20, 2024 at 01:12:01PM +0800,
> Chao Gao <[email protected]> wrote:
>
>>> config KVM_SW_PROTECTED_VM
>>> bool "Enable support for KVM software-protected VMs"
>>> - depends on EXPERT

This change is not needed, right?
Since you intended to use KVM_GENERIC_PRIVATE_MEM, not KVM_SW_PROTECTED_VM.

>>> depends on KVM && X86_64
>>> select KVM_GENERIC_PRIVATE_MEM
>>> help
>>> @@ -89,6 +88,8 @@ config KVM_SW_PROTECTED_VM
>>> config KVM_INTEL
>>> tristate "KVM for Intel (and compatible) processors support"
>>> depends on KVM && IA32_FEAT_CTL
>>> + select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST
>> why does INTEL_TDX_HOST select KVM_SW_PROTECTED_VM?
> I wanted KVM_GENERIC_PRIVATE_MEM. Ah, we should do
>
> select KKVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
>
>
>>> + select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
>>> help
>>> .vcpu_precreate = vmx_vcpu_precreate,
>>> .vcpu_create = vmx_vcpu_create,
>>
[...]

2024-03-29 13:28:38

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 3/29/2024 4:39 AM, Isaku Yamahata wrote:

[...]
>>>>> How about this?
>>>>>
>>>>> /*
>>>>> * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
>>>>> * TDH.MNG.KEY.FREEID() to free the HKID.
>>>>> * Other threads can remove pages from TD. When the HKID is assigned, we need
>>>>> * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
>>>>> * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
>>>>> * present transient state of HKID.
>>>>> */
>>>> Could you elaborate why it is still possible to have other thread removing
>>>> pages from TD?
>>>>
>>>> I am probably missing something, but the thing I don't understand is why
>>>> this function is triggered by MMU release? All the things done in this
>>>> function don't seem to be related to MMU at all.
>>> The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
>>> we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
>>> TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
>>> TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
>>> to use TDH.PHYMEM.PAGE.RECLAIM().
>> Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
>> TDH.PHYMEM.PAGE.RECLAIM()?
>>
>> And does the difference matter in practice, i.e. did you see using the former
>> having noticeable performance downgrade?
> Yes. With HKID alive, we have to assume that vcpu can run still. It means TLB
> shootdown. The difference is 2 extra SEAMCALL + IPI synchronization for each
> guest private page. If the guest has hundreds of GB, the difference can be
> tens of minutes.
>
> With HKID alive, we need to assume vcpu is alive.
> - TDH.MEM.PAGE.REMOVE()
> - TDH.PHYMEM.PAGE_WBINVD()
> - TLB shoot down
> - TDH.MEM.TRACK()
> - IPI to other vcpus
> - wait for other vcpu to exit

Do we have a way to batch the TLB shoot down.
IIUC, in current implementation, TLB shoot down needs to be done for
each page remove, right?


>
> After freeing HKID
> - TDH.PHYMEM.PAGE.RECLAIM()
> We already flushed TLBs and memory cache.
>
>
>>>> Freeing vcpus is done in
>>>> kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
>>>> this tdx_mmu_release_keyid() is called?
>>> guest memfd complicates things. The race is between guest memfd release and mmu
>>> notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds
>>> including guest memfd.
>>>
>>> Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
>>> and kvm vm. The process is exiting. Please notice vhost increments the
>>> reference of the mmu to access guest (shared) memory.
>>>
>>> exit_mmap():
>>> Usually mmu notifier release is fired. But not yet because of vhost.
>>>
>>> exit_files()
>>> close vhost fd. vhost starts timer to issue mmput().
>> Why does it need to start a timer to issue mmput(), but not call mmput()
>> directly?
> That's how vhost implements it. It's out of KVM control. Other component or
> user space as other thread can get reference to mmu or FDs. They can keep/free
> them as they like.
>
>
>>> close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
>>> kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
>>> and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole
>>> guest memory. Call kvm_put_kvm() at last.
>>>
>>> During unmapping on behalf of guest memfd, the timer of vhost fires to call
>>> mmput(). It triggers mmu notifier release.
>>>
>>> Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls
>>> kvm_destroy_vm().
>>>
>>> It's ideal to free HKID first for efficiency. But KVM doesn't have control on
>>> the order of fds.
>> Firstly, what kinda performance efficiency gain are we talking about?
> 2 extra SEAMCALL + IPI sync for each guest private page. If the guest memory
> is hundreds of GB, the difference can be tens of minutes.
>
>
>> We cannot really tell whether it can be justified to use two different methods
>> to tear down SEPT page because of this.
>>
>> Even if it's worth to do, it is an optimization, which can/should be done later
>> after you have put all building blocks together.
>>
>> That being said, you are putting too many logic in this patch, i.e., it just
>> doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.
> I agree that this patch is too huge, and that we should break it into smaller
> patches.
>
>
>>>> But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?
>>> Not necessarily.
>> I am wondering when is TDH.VP.FLUSH done? Supposedly it should be called when
>> we free vcpus? But again this means you need to call TDH.MNG.VPFLUSHDONE
>> _after_ freeing vcpus. And this looks conflicting if you make
>> tdx_mmu_release_keyid() being called from MMU notifier.
> tdx_mmu_release_keyid() call it explicitly for all vcpus.


2024-04-01 04:11:26

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

>+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>+{
>+ unsigned long exit_qual;
>+
>+ if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
>+ /*
>+ * Always treat SEPT violations as write faults. Ignore the
>+ * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
>+ * TD private pages are always RWX in the SEPT tables,
>+ * i.e. they're always mapped writable. Just as importantly,
>+ * treating SEPT violations as write faults is necessary to
>+ * avoid COW allocations, which will cause TDAUGPAGE failures
>+ * due to aliasing a single HPA to multiple GPAs.
>+ */
>+#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
>+ exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
>+ } else {
>+ exit_qual = tdexit_exit_qual(vcpu);
>+ if (exit_qual & EPT_VIOLATION_ACC_INSTR) {

Unless the CPU has a bug, instruction fetch in TD from shared memory causes a
#PF. I think you can add a comment for this.

Maybe KVM_BUG_ON() is more appropriate as it signifies a potential bug.

>+ pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
>+ tdexit_gpa(vcpu), kvm_rip_read(vcpu));
>+ vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
>+ vcpu->run->ex.exception = PF_VECTOR;
>+ vcpu->run->ex.error_code = exit_qual;
>+ return 0;
>+ }
>+ }
>+
>+ trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
>+ return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
>+}
>+
>+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
>+{
>+ WARN_ON_ONCE(1);
>+
>+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>+ vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
>+ vcpu->run->internal.ndata = 2;
>+ vcpu->run->internal.data[0] = EXIT_REASON_EPT_MISCONFIG;
>+ vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
>+
>+ return 0;
>+}
>+
> int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> {
> union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
>@@ -1345,6 +1390,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
>
> switch (exit_reason.basic) {
>+ case EXIT_REASON_EPT_VIOLATION:
>+ return tdx_handle_ept_violation(vcpu);
>+ case EXIT_REASON_EPT_MISCONFIG:
>+ return tdx_handle_ept_misconfig(vcpu);

Handling EPT misconfiguration can be dropped because the "default" case handles
all unexpected exits in the same way


> case EXIT_REASON_OTHER_SMI:
> /*
> * If reach here, it's not a Machine Check System Management
>--
>2.25.1
>
>

2024-04-01 07:23:32

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On 3/29/2024 2:26 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-03-28 at 09:36 +0800, Xiaoyao Li wrote:
>>>>> Any reason to mask off non-configurable bits rather than return an error? this
>>>>> is misleading to userspace because guest sees the values emulated by TDX module
>>>>> instead of the values passed from userspace (i.e., the request from userspace
>>>>> isn't done but there is no indication of that to userspace).
>>>>
>>>> Ok, I'll eliminate them.  If user space passes wrong cpuids, TDX module will
>>>> return error. I'll leave the error check to the TDX module.
>>>
>>> I was just looking at this. Agreed. It breaks the selftests though.
>>
>> If all you prefer to go this direction, then please update the error
>> handling of this specific SEAMCALL.
>
> What do you mean by SEAMCALL, TDH_MNG_INIT? Can you be more specific?

Sorry. I missed the fact that current patch already has the specific
handling for TDX_OPERAND_INVALID for TDH.MNG.INIT.

I need to update QEMU to match the new behavior.

2024-04-01 08:22:30

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT

On Mon, Feb 26, 2024 at 12:26:44AM -0800, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>Because guest TD state is protected, exceptions in guest TDs can't be
>intercepted. TDX VMM doesn't need to handle exceptions.
>tdx_handle_exit_irqoff() handles NMI and machine check. Ignore NMI and

tdx_handle_exit_irqoff() doesn't handle NMIs.

>machine check and continue guest TD execution.
>
>For external interrupt, increment stats same to the VMX case.
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>Reviewed-by: Paolo Bonzini <[email protected]>
>---
> arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>index 0db80fa020d2..bdd74682b474 100644
>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -918,6 +918,25 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
> }
>
>+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
>+{
>+ u32 intr_info = tdexit_intr_info(vcpu);
>+
>+ if (is_nmi(intr_info) || is_machine_check(intr_info))
>+ return 1;

Add a comment in code as well.

>+
>+ kvm_pr_unimpl("unexpected exception 0x%x(exit_reason 0x%llx qual 0x%lx)\n",
>+ intr_info,
>+ to_tdx(vcpu)->exit_reason.full, tdexit_exit_qual(vcpu));
>+ return -EFAULT;

-EFAULT looks incorrect.

>+}
>+
>+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
>+{
>+ ++vcpu->stat.irq_exits;
>+ return 1;
>+}
>+
> static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
> {
> vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>@@ -1390,6 +1409,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
>
> switch (exit_reason.basic) {
>+ case EXIT_REASON_EXCEPTION_NMI:
>+ return tdx_handle_exception(vcpu);
>+ case EXIT_REASON_EXTERNAL_INTERRUPT:
>+ return tdx_handle_external_interrupt(vcpu);
> case EXIT_REASON_EPT_VIOLATION:
> return tdx_handle_ept_violation(vcpu);
> case EXIT_REASON_EPT_MISCONFIG:
>--
>2.25.1
>
>

2024-04-01 09:14:30

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 103/130] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI

On Mon, Feb 26, 2024 at 12:26:45AM -0800, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>When BIOS eMCA MCE-SMI morphing is enabled, the #MC is morphed to MSMI
>(Machine Check System Management Interrupt). Then the SMI causes TD exit
>with the read reason of EXIT_REASON_OTHER_SMI with MSMI bit set in the exit
>qualification to KVM instead of EXIT_REASON_EXCEPTION_NMI with MC
>exception.
>
>Handle EXIT_REASON_OTHER_SMI with MSMI bit set in the exit qualification as
>MCE(Machine Check Exception) happened during TD guest running.
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
> arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++---
> arch/x86/kvm/vmx/tdx_arch.h | 2 ++
> 2 files changed, 39 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>index bdd74682b474..117c2315f087 100644
>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -916,6 +916,30 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> tdexit_intr_info(vcpu));
> else if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
> vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
>+ else if (unlikely(tdx->exit_reason.non_recoverable ||
>+ tdx->exit_reason.error)) {

why not just:
else if (tdx->exit_reason.basic == EXIT_REASON_OTHER_SMI) {


i.e., does EXIT_REASON_OTHER_SMI imply exit_reason.non_recoverable or
exit_reason.error?

>+ /*
>+ * The only reason it gets EXIT_REASON_OTHER_SMI is there is an
>+ * #MSMI(Machine Check System Management Interrupt) with
>+ * exit_qualification bit 0 set in TD guest.
>+ * The #MSMI is delivered right after SEAMCALL returns,
>+ * and an #MC is delivered to host kernel after SMI handler
>+ * returns.
>+ *
>+ * The #MC right after SEAMCALL is fixed up and skipped in #MC

Looks fixing up and skipping #MC on the first instruction after TD-exit is
missing in v19?

>+ * handler because it's an #MC happens in TD guest we cannot
>+ * handle it with host's context.
>+ *
>+ * Call KVM's machine check handler explicitly here.
>+ */
>+ if (tdx->exit_reason.basic == EXIT_REASON_OTHER_SMI) {
>+ unsigned long exit_qual;
>+
>+ exit_qual = tdexit_exit_qual(vcpu);
>+ if (exit_qual & TD_EXIT_OTHER_SMI_IS_MSMI)

>+ kvm_machine_check();
>+ }
>+ }
> }
>
> static int tdx_handle_exception(struct kvm_vcpu *vcpu)
>@@ -1381,6 +1405,11 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> exit_reason.full, exit_reason.basic,
> to_kvm_tdx(vcpu->kvm)->hkid,
> set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
>+
>+ /*
>+ * tdx_handle_exit_irqoff() handled EXIT_REASON_OTHER_SMI. It
>+ * must be handled before enabling preemption because it's #MC.
>+ */

Then EXIT_REASON_OTHER_SMI is handled, why still go to unhandled_exit?

> goto unhandled_exit;
> }
>
>@@ -1419,9 +1448,14 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> return tdx_handle_ept_misconfig(vcpu);
> case EXIT_REASON_OTHER_SMI:
> /*
>- * If reach here, it's not a Machine Check System Management
>- * Interrupt(MSMI). #SMI is delivered and handled right after
>- * SEAMRET, nothing needs to be done in KVM.
>+ * Unlike VMX, all the SMI in SEAM non-root mode (i.e. when
>+ * TD guest vcpu is running) will cause TD exit to TDX module,
>+ * then SEAMRET to KVM. Once it exits to KVM, SMI is delivered
>+ * and handled right away.
>+ *
>+ * - If it's an Machine Check System Management Interrupt
>+ * (MSMI), it's handled above due to non_recoverable bit set.
>+ * - If it's not an MSMI, don't need to do anything here.

This corrects a comment added in patch 100. Maybe we can just merge patch 100 into
this one?

> */
> return 1;
> default:
>diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
>index efc3c61c14ab..87ef22e9cd49 100644
>--- a/arch/x86/kvm/vmx/tdx_arch.h
>+++ b/arch/x86/kvm/vmx/tdx_arch.h
>@@ -42,6 +42,8 @@
> #define TDH_VP_WR 43
> #define TDH_SYS_LP_SHUTDOWN 44
>
>+#define TD_EXIT_OTHER_SMI_IS_MSMI BIT(1)
>+
> /* TDX control structure (TDR/TDCS/TDVPS) field access codes */
> #define TDX_NON_ARCH BIT_ULL(63)
> #define TDX_CLASS_SHIFT 56
>--
>2.25.1
>
>

2024-04-01 10:00:03

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 104/130] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)

> static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
> {
> return tdx->td_vcpu_created;
>@@ -897,6 +932,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> tdx_complete_interrupts(vcpu);
>
>+ if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
>+ tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];

kvm_rcx_read()?


>+ else
>+ tdx->tdvmcall.rcx = 0;

RCX on TDVMCALL exit is supposed to be consumed by TDX module. I don't get why
caching it is necessary. Can tdx->tdvmcall be simply dropped?

2024-04-01 10:41:46

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure


[...]

> >
> > And why do you need a special function just for control page(s)?
>
> We can revise the code to have common function for reclaiming page.

I interpret this as you will remove tdx_reclaim_control_page(), and have one
function to reclaim _ALL_ TDX private pages.

>
>
> > > > > How about this?
> > > > >
> > > > > /*
> > > > > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> > > > > * TDH.MNG.KEY.FREEID() to free the HKID.
> > > > > * Other threads can remove pages from TD. When the HKID is assigned, we need
> > > > > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> > > > > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> > > > > * present transient state of HKID.
> > > > > */
> > > >
> > > > Could you elaborate why it is still possible to have other thread removing
> > > > pages from TD?
> > > >
> > > > I am probably missing something, but the thing I don't understand is why
> > > > this function is triggered by MMU release? All the things done in this
> > > > function don't seem to be related to MMU at all.
> > >
> > > The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
> > > we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
> > > TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
> > > TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
> > > to use TDH.PHYMEM.PAGE.RECLAIM().
> >
> > Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
> > TDH.PHYMEM.PAGE.RECLAIM()?
> >
> > And does the difference matter in practice, i.e. did you see using the former
> > having noticeable performance downgrade?
>
> Yes. With HKID alive, we have to assume that vcpu can run still. It means TLB
> shootdown. The difference is 2 extra SEAMCALL + IPI synchronization for each
> guest private page. If the guest has hundreds of GB, the difference can be
> tens of minutes.
>
> With HKID alive, we need to assume vcpu is alive.
> - TDH.MEM.PAGE.REMOVE()
> - TDH.PHYMEM.PAGE_WBINVD()
> - TLB shoot down
> - TDH.MEM.TRACK()
> - IPI to other vcpus
> - wait for other vcpu to exit
>
> After freeing HKID
> - TDH.PHYMEM.PAGE.RECLAIM()
> We already flushed TLBs and memory cache.
> > > >

[...]

> >
> > Firstly, what kinda performance efficiency gain are we talking about?
>
> 2 extra SEAMCALL + IPI sync for each guest private page. If the guest memory
> is hundreds of GB, the difference can be tens of minutes.

[...]

>
>
> > We cannot really tell whether it can be justified to use two different methods
> > to tear down SEPT page because of this.
> >
> > Even if it's worth to do, it is an optimization, which can/should be done later
> > after you have put all building blocks together.
> >
> > That being said, you are putting too many logic in this patch, i.e., it just
> > doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.
>
> I agree that this patch is too huge, and that we should break it into smaller
> patches.

IMHO it's not only breaking into smaller pieces, but also you are mixing
performance optimization and essential functionalities together.

Moving reclaiming TDX private KeyID to MMU notifier (in order to have a better
performance when shutting down TDX guest) depends on a lot SEPT details (how to
reclaim private page, TLB flush etc), which haven't yet been mentioned at all.

It's hard to review code like this.  

I think here in this patch, we should just put reclaiming TDX keyID to the
"normal" place. After you have done all SEPT (and related) patches, you can
have a patch to improve the performance:

KVM: TDX: Improve TDX guest shutdown latency

Then you put your performance data there, i.e., "tens of minutes difference for
TDX guest with hundreds of GB memory", to justify that patch.

2024-04-01 11:42:03

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Allocate protected page table for private page table, and add hooks to
> operate on protected page table. This patch adds allocation/free of
> protected page tables and hooks. When calling hooks to update SPTE entry,
> freeze the entry, call hooks and unfreeze the entry to allow concurrent
> updates on page tables. Which is the advantage of TDP MMU. As
> kvm_gfn_shared_mask() returns false always, those hooks aren't called yet
> with this patch.
>
> When the faulting GPA is private, the KVM fault is called private. When
> resolving private KVM fault, allocate protected page table and call hooks
> to operate on protected page table. On the change of the private PTE entry,
> invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the change
> to protected page table. The following depicts the relationship.
>
> private KVM page fault |
> | |
> V |
> private GPA | CPU protected EPTP
> | | |
> V | V
> private PT root | protected PT root
> | | |
> V | V
> private PT --hook to propagate-->protected PT
> | | |
> \--------------------+------\ |
> | | |
> | V V
> | private guest page
> |
> |
> non-encrypted memory | encrypted memory
> |
> PT: page table
>
> The existing KVM TDP MMU code uses atomic update of SPTE. On populating
> the EPT entry, atomically set the entry. However, it requires TLB
> shootdown to zap SPTE. To address it, the entry is frozen with the special
> SPTE value that clears the present bit. After the TLB shootdown, the entry
> is set to the eventual value (unfreeze).
>
> For protected page table, hooks are called to update protected page table
> in addition to direct access to the private SPTE. For the zapping case, it
> works to freeze the SPTE. It can call hooks in addition to TLB shootdown.
> For populating the private SPTE entry, there can be a race condition
> without further protection
>
> vcpu 1: populating 2M private SPTE
> vcpu 2: populating 4K private SPTE
> vcpu 2: TDX SEAMCALL to update 4K protected SPTE => error
> vcpu 1: TDX SEAMCALL to update 2M protected SPTE
>
> To avoid the race, the frozen SPTE is utilized. Instead of atomic update
> of the private entry, freeze the entry, call the hook that update protected
> SPTE, set the entry to the final value.
>
> Support 4K page only at this stage. 2M page support can be done in future
> patches.
>
> Co-developed-by: Kai Huang <[email protected]>
> Signed-off-by: Kai Huang <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v19:
> - drop CONFIG_KVM_MMU_PRIVATE
>
> v18:
> - Rename freezed => frozen
>
> v14 -> v15:
> - Refined is_private condition check in kvm_tdp_mmu_map().
> Add kvm_gfn_shared_mask() check.
> - catch up for struct kvm_range change
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 5 +
> arch/x86/include/asm/kvm_host.h | 11 ++
> arch/x86/kvm/mmu/mmu.c | 17 +-
> arch/x86/kvm/mmu/mmu_internal.h | 13 +-
> arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 308 +++++++++++++++++++++++++----
> arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> virt/kvm/kvm_main.c | 1 +
> 8 files changed, 320 insertions(+), 39 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index a8e96804a252..e1c75f8c1b25 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -101,6 +101,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP_OPTIONAL(link_private_spt)
> +KVM_X86_OP_OPTIONAL(free_private_spt)
> +KVM_X86_OP_OPTIONAL(set_private_spte)
> +KVM_X86_OP_OPTIONAL(remove_private_spte)
> +KVM_X86_OP_OPTIONAL(zap_private_spte)
> KVM_X86_OP(has_wbinvd_exit)
> KVM_X86_OP(get_l2_tsc_offset)
> KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index efd3fda1c177..bc0767c884f7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -468,6 +468,7 @@ struct kvm_mmu {
> int (*sync_spte)(struct kvm_vcpu *vcpu,
> struct kvm_mmu_page *sp, int i);
> struct kvm_mmu_root_info root;
> + hpa_t private_root_hpa;
> union kvm_cpu_role cpu_role;
> union kvm_mmu_page_role root_role;
>
> @@ -1740,6 +1741,16 @@ struct kvm_x86_ops {
> void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int root_level);
>
> + int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *private_spt);
> + int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *private_spt);
> + int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);
> + int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);
> + int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> +
> bool (*has_wbinvd_exit)(void);
>
> u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 30c86e858ae4..0e0321ad9ca2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3717,7 +3717,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> goto out_unlock;
>
> if (tdp_mmu_enabled) {
> - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> + if (kvm_gfn_shared_mask(vcpu->kvm) &&
> + !VALID_PAGE(mmu->private_root_hpa)) {
> + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> + mmu->private_root_hpa = root;
> + }
> + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
> mmu->root.hpa = root;
> } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> @@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> - gfn_t base = gfn_round_for_level(fault->gfn,
> + gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> fault->max_level);
>
> if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> @@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> };
>
> WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> + fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>
> r = mmu_topup_memory_caches(vcpu, false);
> @@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>
> mmu->root.hpa = INVALID_PAGE;
> mmu->root.pgd = 0;
> + mmu->private_root_hpa = INVALID_PAGE;
> for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>
> @@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
> void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_unload(vcpu);
> + if (tdp_mmu_enabled) {
> + write_lock(&vcpu->kvm->mmu_lock);
> + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> + NULL);
> + write_unlock(&vcpu->kvm->mmu_lock);
> + }
> free_mmu_pages(&vcpu->arch.root_mmu);
> free_mmu_pages(&vcpu->arch.guest_mmu);
> mmu_free_memory_caches(vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 002f3f80bf3b..9e2c7c6d85bf 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
> #include <linux/kvm_host.h>
> #include <asm/kvm_host.h>
>
> +#include "mmu.h"
> +
> #ifdef CONFIG_KVM_PROVE_MMU
> #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> #else
> @@ -205,6 +207,15 @@ static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
> free_page((unsigned long)sp->private_spt);
> }
>
> +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + gfn_t gfn)
> +{
> + if (is_private_sp(root))
> + return kvm_gfn_to_private(kvm, gfn);

IIUC, the purpose of this function is to add back shared bit to gfn for
shared memory.
For private address, the gfn should not contain shared bit anyway.
It seems weird to clear the shared bit from gfn for private address.


> + else
> + return kvm_gfn_to_shared(kvm, gfn);
> +}
> +
> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> @@ -363,7 +374,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int r;
>
> if (vcpu->arch.mmu->root_role.direct) {
> - fault.gfn = fault.addr >> PAGE_SHIFT;
> + fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> }
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index e1e40e3f5eb7..a9c9cd0db20a 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -91,7 +91,7 @@ struct tdp_iter {
> tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> /* A pointer to the current SPTE */
> tdp_ptep_t sptep;
> - /* The lowest GFN mapped by the current SPTE */
> + /* The lowest GFN (shared bits included) mapped by the current SPTE */
> gfn_t gfn;
> /* The level of the root page given to the iterator */
> int root_level;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index a90907b31c54..1a0e4baa8311 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -187,6 +187,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu,
> sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> sp->role = role;
>
> + if (kvm_mmu_page_role_is_private(role))
> + kvm_mmu_alloc_private_spt(vcpu, sp);
> +
> return sp;
> }
>
> @@ -209,7 +212,8 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
> trace_kvm_mmu_get_page(sp, true);
> }
>
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> +static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
> + bool private)
> {
> union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> struct kvm *kvm = vcpu->kvm;
> @@ -221,6 +225,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> * Check for an existing root before allocating a new one. Note, the
> * role check prevents consuming an invalid root.
> */
> + if (private)
> + kvm_mmu_page_role_set_private(&role);
> for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> if (root->role.word == role.word &&
> kvm_tdp_mmu_get_root(root))
> @@ -244,12 +250,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> out:
> - return __pa(root->spt);
> + return root;
> +}
> +
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
> +{
> + return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
> }
>
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> - u64 old_spte, u64 new_spte, int level,
> - bool shared);
> + u64 old_spte, u64 new_spte,
> + union kvm_mmu_page_role role, bool shared);
>
> static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> @@ -376,12 +387,78 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> REMOVED_SPTE, level);
> }
> handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> - old_spte, REMOVED_SPTE, level, shared);
> + old_spte, REMOVED_SPTE, sp->role,
> + shared);
> + }
> +
> + if (is_private_sp(sp) &&
> + WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
> + kvm_mmu_private_spt(sp)))) {
> + /*
> + * Failed to unlink Secure EPT page and there is nothing to do
> + * further. Intentionally leak the page to prevent the kernel
> + * from accessing the encrypted page.
> + */
> + kvm_mmu_init_private_spt(sp, NULL);
> }
>
> call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
>
> +static void *get_private_spt(gfn_t gfn, u64 new_spte, int level)
> +{
> + if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
> + struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
> + void *private_spt = kvm_mmu_private_spt(sp);
> +
> + WARN_ON_ONCE(!private_spt);
> + WARN_ON_ONCE(sp->role.level + 1 != level);
> + WARN_ON_ONCE(sp->gfn != gfn);
> + return private_spt;
> + }
> +
> + return NULL;
> +}
> +
> +static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
> + u64 old_spte, u64 new_spte,
> + int level)
> +{
> + bool was_present = is_shadow_present_pte(old_spte);
> + bool is_present = is_shadow_present_pte(new_spte);
> + bool was_leaf = was_present && is_last_spte(old_spte, level);
> + bool is_leaf = is_present && is_last_spte(new_spte, level);
> + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> + kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> + int ret;
> +
> + /* Ignore change of software only bits. e.g. host_writable */
> + if (was_leaf == is_leaf && was_present == is_present)
> + return;
> +
> + /*
> + * Allow only leaf page to be zapped. Reclaim Non-leaf page tables at
> + * destroying VM.
> + */

The comment seems just for !was_leaf,
move the comment just before "if (!was_leaf)" ?

> + WARN_ON_ONCE(is_present);

Is this warning needed?
It can be captured by the later "KVM_BUG_ON(new_pfn, kvm)"

> + if (!was_leaf)
> + return;
> +
> + /* non-present -> non-present doesn't make sense. */
> + KVM_BUG_ON(!was_present, kvm);
> + KVM_BUG_ON(new_pfn, kvm);
> +
> + /* Zapping leaf spte is allowed only when write lock is held. */
> + lockdep_assert_held_write(&kvm->mmu_lock);
> + ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
> + /* Because write lock is held, operation should success. */
> + if (KVM_BUG_ON(ret, kvm))
> + return;
> +
> + ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
> + KVM_BUG_ON(ret, kvm);
> +}
> +
> /**
> * handle_changed_spte - handle bookkeeping associated with an SPTE change
> * @kvm: kvm instance
> @@ -389,7 +466,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> * @gfn: the base GFN that was mapped by the SPTE
> * @old_spte: The value of the SPTE before the change
> * @new_spte: The value of the SPTE after the change
> - * @level: the level of the PT the SPTE is part of in the paging structure
> + * @role: the role of the PT the SPTE is part of in the paging structure
> * @shared: This operation may not be running under the exclusive use of
> * the MMU lock and the operation must synchronize with other
> * threads that might be modifying SPTEs.
> @@ -399,14 +476,18 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> * and fast_pf_fix_direct_spte()).
> */
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> - u64 old_spte, u64 new_spte, int level,
> - bool shared)
> + u64 old_spte, u64 new_spte,
> + union kvm_mmu_page_role role, bool shared)
> {
> + bool is_private = kvm_mmu_page_role_is_private(role);
> + int level = role.level;
> bool was_present = is_shadow_present_pte(old_spte);
> bool is_present = is_shadow_present_pte(new_spte);
> bool was_leaf = was_present && is_last_spte(old_spte, level);
> bool is_leaf = is_present && is_last_spte(new_spte, level);
> - bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> + kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> + bool pfn_changed = old_pfn != new_pfn;
>
> WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
> WARN_ON_ONCE(level < PG_LEVEL_4K);
> @@ -473,7 +554,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>
> if (was_leaf && is_dirty_spte(old_spte) &&
> (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> - kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> + kvm_set_pfn_dirty(old_pfn);
>
> /*
> * Recursively handle child PTs if the change removed a subtree from
> @@ -482,14 +563,82 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> * pages are kernel allocations and should never be migrated.
> */
> if (was_present && !was_leaf &&
> - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> + KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)),
> + kvm);
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> + }
> +
> + /*
> + * Secure-EPT requires to remove Secure-EPT tables after removing
> + * children. hooks after handling lower page table by above
> + * handle_remove_pt().
> + */
> + if (is_private && !is_present)
> + handle_removed_private_spte(kvm, gfn, old_spte, new_spte, role.level);
>
> if (was_leaf && is_accessed_spte(old_spte) &&
> (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> }
>
> +static int __must_check __set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> + gfn_t gfn, u64 old_spte,
> + u64 new_spte, int level)
> +{
> + bool was_present = is_shadow_present_pte(old_spte);
> + bool is_present = is_shadow_present_pte(new_spte);
> + bool is_leaf = is_present && is_last_spte(new_spte, level);
> + kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> + int ret = 0;
> +
> + lockdep_assert_held(&kvm->mmu_lock);
> + /* TDP MMU doesn't change present -> present */
> + KVM_BUG_ON(was_present, kvm);
> +
> + /*
> + * Use different call to either set up middle level
                          ^
                          calls

> + * private page table, or leaf.
> + */
> + if (is_leaf)
> + ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn);

Braces are missing.

> + else {
> + void *private_spt = get_private_spt(gfn, new_spte, level);
> +
> + KVM_BUG_ON(!private_spt, kvm);
> + ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt);
> + }
> +
> + return ret;
> +}
> +
> +static int __must_check set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> + gfn_t gfn, u64 old_spte,
> + u64 new_spte, int level)
> +{
> + int ret;
> +
> + /*
> + * For private page table, callbacks are needed to propagate SPTE
> + * change into the protected page table. In order to atomically update
> + * both the SPTE and the protected page tables with callbacks, utilize
> + * freezing SPTE.
> + * - Freeze the SPTE. Set entry to REMOVED_SPTE.
> + * - Trigger callbacks for protected page tables.
> + * - Unfreeze the SPTE. Set the entry to new_spte.
> + */
> + lockdep_assert_held(&kvm->mmu_lock);
> + if (!try_cmpxchg64(sptep, &old_spte, REMOVED_SPTE))
> + return -EBUSY;
> +
> + ret = __set_private_spte_present(kvm, sptep, gfn, old_spte, new_spte, level);
> + if (ret)
> + __kvm_tdp_mmu_write_spte(sptep, old_spte);
> + else
> + __kvm_tdp_mmu_write_spte(sptep, new_spte);
> + return ret;
> +}
> +
> /*
> * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
> * and handle the associated bookkeeping. Do not mark the page dirty
> @@ -512,6 +661,7 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> u64 new_spte)
> {
> u64 *sptep = rcu_dereference(iter->sptep);
> + bool frozen = false;
>
> /*
> * The caller is responsible for ensuring the old SPTE is not a REMOVED
> @@ -523,19 +673,45 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - /*
> - * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> - * does not hold the mmu_lock. On failure, i.e. if a different logical
> - * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> - * the current value, so the caller operates on fresh data, e.g. if it
> - * retries tdp_mmu_set_spte_atomic()
> - */
> - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> - return -EBUSY;
> + if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
> + int ret;
>
> - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, true);
> + if (is_shadow_present_pte(new_spte)) {
> + /*
> + * Populating case. handle_changed_spte() can
> + * process without freezing because it only updates
> + * stats.
> + */
> + ret = set_private_spte_present(kvm, iter->sptep, iter->gfn,
> + iter->old_spte, new_spte, iter->level);
> + if (ret)
> + return ret;
> + } else {
> + /*
> + * Zapping case. handle_changed_spte() calls Secure-EPT
> + * blocking or removal. Freeze the entry.
> + */
> + if (!try_cmpxchg64(sptep, &iter->old_spte, REMOVED_SPTE))
> + return -EBUSY;
> + frozen = true;
> + }
> + } else {
> + /*
> + * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
> + * and does not hold the mmu_lock. On failure, i.e. if a
> + * different logical CPU modified the SPTE, try_cmpxchg64()
> + * updates iter->old_spte with the current value, so the caller
> + * operates on fresh data, e.g. if it retries
> + * tdp_mmu_set_spte_atomic()
> + */
> + if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> + return -EBUSY;
> + }
>
> + handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> + new_spte, sptep_to_sp(sptep)->role, true);
> + if (frozen)
> + __kvm_tdp_mmu_write_spte(sptep, new_spte);
> return 0;
> }
>
> @@ -585,6 +761,8 @@ static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> u64 old_spte, u64 new_spte, gfn_t gfn, int level)
> {
> + union kvm_mmu_page_role role;
> +
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> /*
> @@ -597,8 +775,17 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
>
> old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> + if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
> + is_shadow_present_pte(new_spte)) {
> + lockdep_assert_held_write(&kvm->mmu_lock);

tdp_mmu_set_spte() has already called lockdep_assert_held_write() above.

> + /* Because write spin lock is held, no race. It should success. */
> + KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn, old_spte,
> + new_spte, level), kvm);
> + }
>
> - handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> + role = sptep_to_sp(sptep)->role;
> + role.level = level;
> + handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
> return old_spte;
> }
>
> @@ -621,8 +808,11 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> continue; \
> else
>
> -#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
> - for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
> +#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end) \
> + for_each_tdp_pte(_iter, \
> + root_to_sp((_private) ? _mmu->private_root_hpa : \
> + _mmu->root.hpa), \
> + _start, _end)
>
> /*
> * Yield if the MMU lock is contended or this thread needs to return control
> @@ -784,6 +974,14 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> if (!zap_private && is_private_sp(root))
> return false;
>
> + /*
> + * start and end doesn't have GFN shared bit. This function zaps
> + * a region including alias. Adjust shared bit of [start, end) if the
> + * root is shared.
> + */
> + start = kvm_gfn_for_root(kvm, root, start);
> + end = kvm_gfn_for_root(kvm, root, end);
> +
> rcu_read_lock();
>
> for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> @@ -960,10 +1158,26 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>
> if (unlikely(!fault->slot))
> new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> - else
> - wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> - fault->pfn, iter->old_spte, fault->prefetch, true,
> - fault->map_writable, &new_spte);
> + else {
> + unsigned long pte_access = ACC_ALL;
> + gfn_t gfn = iter->gfn;
> +
> + if (kvm_gfn_shared_mask(vcpu->kvm)) {
> + if (fault->is_private)
> + gfn |= kvm_gfn_shared_mask(vcpu->kvm);
> + else
> + /*
> + * TDX shared GPAs are no executable, enforce
> + * this for the SDV.
> + */
> + pte_access &= ~ACC_EXEC_MASK;
> + }
> +
> + wrprot = make_spte(vcpu, sp, fault->slot, pte_access, gfn,
> + fault->pfn, iter->old_spte,
> + fault->prefetch, true, fault->map_writable,
> + &new_spte);
> + }
>
> if (new_spte == iter->old_spte)
> ret = RET_PF_SPURIOUS;
> @@ -1041,6 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
> + gfn_t raw_gfn;
> + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> int ret = RET_PF_RETRY;
>
> kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1049,7 +1265,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
> rcu_read_lock();
>
> - tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> + raw_gfn = gpa_to_gfn(fault->addr);
> +
> + if (is_error_noslot_pfn(fault->pfn) ||
> + !kvm_pfn_to_refcounted_page(fault->pfn)) {
> + if (is_private) {

 Why this is only checked for private fault?

> + rcu_read_unlock();
> + return -EFAULT;
> + }
> + }
> +
> + tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
> int r;
>
> if (fault->nx_huge_page_workaround_enabled)
> @@ -1079,9 +1305,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
> sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
>
> - if (is_shadow_present_pte(iter.old_spte))
> + if (is_shadow_present_pte(iter.old_spte)) {
> + /*
> + * TODO: large page support.
> + * Doesn't support large page for TDX now
> + */
> + KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
> r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
> - else
> + } else
> r = tdp_mmu_link_sp(kvm, &iter, sp, true);
>
> /*
> @@ -1362,6 +1593,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm
>
> sp->role = role;
> sp->spt = (void *)__get_free_page(gfp);
> + /* TODO: large page support for private GPA. */
> + WARN_ON_ONCE(kvm_mmu_page_role_is_private(role));

Seems not needed, since __tdp_mmu_alloc_sp_for_split()
is only called in  tdp_mmu_alloc_sp_for_split() and it has
KVM_BUG_ON() for large page case.

> if (!sp->spt) {
> kmem_cache_free(mmu_page_header_cache, sp);
> return NULL;
> @@ -1378,6 +1611,10 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> struct kvm_mmu_page *sp;
>
> kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> + KVM_BUG_ON(kvm_mmu_page_role_is_private(role) !=
> + is_private_sptep(iter->sptep), kvm);
> + /* TODO: Large page isn't supported for private SPTE yet. */
> + KVM_BUG_ON(kvm_mmu_page_role_is_private(role), kvm);
>
> /*
> * Since we are allocating while under the MMU lock we have to be
> @@ -1802,7 +2039,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>
> *root_level = vcpu->arch.mmu->root_role.level;
>
> - tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> + tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
> leaf = iter.level;
> sptes[leaf] = iter.old_spte;
> }
> @@ -1829,7 +2066,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
> gfn_t gfn = addr >> PAGE_SHIFT;
> tdp_ptep_t sptep = NULL;
>
> - tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> + /* fast page fault for private GPA isn't supported. */
> + WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
> +
> + tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
> *spte = iter.old_spte;
> sptep = iter.sptep;
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index b3cf58a50357..bc9124737142 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -10,7 +10,7 @@
> void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
>
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);
>
> __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d399009ef1d7..e27c22449d85 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -201,6 +201,7 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
>
> return NULL;
> }
> +EXPORT_SYMBOL_GPL(kvm_pfn_to_refcounted_page);
>
> /*
> * Switches to specified vcpu, until a matching vcpu_put()


2024-04-01 17:34:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX

On Mon, Feb 26, 2024, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
> support for TDX isn't implemented. TDX requires KVM mmio caching. Disable
> TDX support when TDP MMU or mmio caching aren't supported.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 1 +
> arch/x86/kvm/vmx/main.c | 13 +++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0e0321ad9ca2..b8d6ce02e66d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -104,6 +104,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> * If the hardware supports that we don't need to do shadow paging.
> */
> bool tdp_enabled = false;
> +EXPORT_SYMBOL_GPL(tdp_enabled);

I haven't looked at the rest of the series, but this should be unnecessary. Just
use enable_ept. Ah, the code is wrong.

> static bool __ro_after_init tdp_mmu_allowed;
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 076a471d9aea..54df6653193e 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -3,6 +3,7 @@
>
> #include "x86_ops.h"
> #include "vmx.h"
> +#include "mmu.h"
> #include "nested.h"
> #include "pmu.h"
> #include "tdx.h"
> @@ -36,6 +37,18 @@ static __init int vt_hardware_setup(void)
> if (ret)
> return ret;
>
> + /* TDX requires KVM TDP MMU. */

This is a useless comment (assuming the code is correctly written), it's quite
obvious from:

if (!tdp_mmu_enabled)
enable_tdx = false;

that the TDP MMU is required. Explaining *why* could be useful, but I'd probably
just omit a comment. In the not too distant future, it will likely be common
knowledge that the shadow MMU doesn't support newfangled features, and it _should_
be very easy for someone to get the info from the changelog.

> + if (enable_tdx && !tdp_enabled) {

tdp_enabled can be true without the TDP MMU, you want tdp_mmu_enabled.

> + enable_tdx = false;
> + pr_warn_ratelimited("TDX requires TDP MMU. Please enable TDP MMU for TDX.\n");

Drop the pr_warn(), TDX will presumably be on by default at some point, I don't
want to get spam every time I test with TDP disabled.

Also, ratelimiting this code is pointless (as is _once()), it should only ever
be called once per module module, and the limiting/once protections are tied to
the module, i.e. effectively get reset when a module is reloaded.

> + }
> +
> + /* TDX requires MMIO caching. */
> + if (enable_tdx && !enable_mmio_caching) {
> + enable_tdx = false;
> + pr_warn_ratelimited("TDX requires mmio caching. Please enable mmio caching for TDX.\n");

Same comments here.

> + }
> +
> enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);

All of the above code belongs in tdx_hardware_setup(), especially since you need
tdp_mmu_enabled, not enable_ept.

> if (enable_tdx)
> vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> --
> 2.25.1
>

2024-04-01 19:27:14

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 067/130] KVM: TDX: Add load_mmu_pgd method for TDX



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> For virtual IO, the guest TD shares guest pages with VMM without
> encryption.

Virtual IO is a use case of shared memory, it's better to use it
as a example instead of putting it at the beginning of the sentence.


> Shared EPT is used to map guest pages in unprotected way.
>
> Add the VMCS field encoding for the shared EPTP, which will be used by
> TDX to have separate EPT walks for private GPAs (existing EPTP) versus
> shared GPAs (new shared EPTP).
>
> Set shared EPT pointer value for the TDX guest to initialize TDX MMU.
May have a mention that the EPTP for priavet GPAs is set by TDX module.

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> v19:
> - Add WARN_ON_ONCE() to tdx_load_mmu_pgd() and drop unconditional mask
> ---
> arch/x86/include/asm/vmx.h | 1 +
> arch/x86/kvm/vmx/main.c | 13 ++++++++++++-
> arch/x86/kvm/vmx/tdx.c | 6 ++++++
> arch/x86/kvm/vmx/x86_ops.h | 4 ++++
> 4 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index f703bae0c4ac..9deb663a42e3 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -236,6 +236,7 @@ enum vmcs_field {
> TSC_MULTIPLIER_HIGH = 0x00002033,
> TERTIARY_VM_EXEC_CONTROL = 0x00002034,
> TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035,
> + SHARED_EPT_POINTER = 0x0000203C,
> PID_POINTER_TABLE = 0x00002042,
> PID_POINTER_TABLE_HIGH = 0x00002043,
> GUEST_PHYSICAL_ADDRESS = 0x00002400,
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d0f75020579f..076a471d9aea 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -123,6 +123,17 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vmx_vcpu_reset(vcpu, init_event);
> }
>
> +static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> + int pgd_level)
> +{
> + if (is_td_vcpu(vcpu)) {
> + tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> + return;
> + }
> +
> + vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> +}
> +
> static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> if (!is_td(kvm))
> @@ -256,7 +267,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .write_tsc_offset = vmx_write_tsc_offset,
> .write_tsc_multiplier = vmx_write_tsc_multiplier,
>
> - .load_mmu_pgd = vmx_load_mmu_pgd,
> + .load_mmu_pgd = vt_load_mmu_pgd,
>
> .check_intercept = vmx_check_intercept,
> .handle_exit_irqoff = vmx_handle_exit_irqoff,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 54e0d4efa2bd..143a3c2a16bc 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -453,6 +453,12 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> */
> }
>
> +void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> +{
> + WARN_ON_ONCE(root_hpa & ~PAGE_MASK);
> + td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> +}
> +
> static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
> {
> struct kvm_tdx_capabilities __user *user_caps;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index f5820f617b2e..24161fa404aa 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -152,6 +152,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu);
> void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
>
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> +
> +void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
> #else
> static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
> static inline void tdx_hardware_unsetup(void) {}
> @@ -173,6 +175,8 @@ static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
> static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
>
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
> +
> +static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
> #endif
>
> #endif /* __KVM_X86_VMX_X86_OPS_H */


2024-04-01 22:55:32

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Fri, Mar 29, 2024 at 02:22:12PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 3/21/2024 10:17 PM, Isaku Yamahata wrote:
> > On Wed, Mar 20, 2024 at 01:12:01PM +0800,
> > Chao Gao <[email protected]> wrote:
> >
> > > > config KVM_SW_PROTECTED_VM
> > > > bool "Enable support for KVM software-protected VMs"
> > > > - depends on EXPERT
>
> This change is not needed, right?
> Since you intended to use KVM_GENERIC_PRIVATE_MEM, not KVM_SW_PROTECTED_VM.

Right. The fix will be something as follows.

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f2bc78ceaa9a..e912b128bddb 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -77,6 +77,7 @@ config KVM_WERROR

config KVM_SW_PROTECTED_VM
bool "Enable support for KVM software-protected VMs"
+ depends on EXPERT
depends on KVM && X86_64
select KVM_GENERIC_PRIVATE_MEM
help
@@ -90,7 +91,7 @@ config KVM_SW_PROTECTED_VM
config KVM_INTEL
tristate "KVM for Intel (and compatible) processors support"
depends on KVM && IA32_FEAT_CTL
- select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST
+ select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
help
Provides support for KVM on processors equipped with Intel's VT
--
2.43.2
--
Isaku Yamahata <[email protected]>

2024-04-02 06:04:14

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX

On Mon, Apr 01, 2024 at 10:34:05AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Mon, Feb 26, 2024, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
> > support for TDX isn't implemented. TDX requires KVM mmio caching. Disable
> > TDX support when TDP MMU or mmio caching aren't supported.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 1 +
> > arch/x86/kvm/vmx/main.c | 13 +++++++++++++
> > 2 files changed, 14 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0e0321ad9ca2..b8d6ce02e66d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -104,6 +104,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> > * If the hardware supports that we don't need to do shadow paging.
> > */
> > bool tdp_enabled = false;
> > +EXPORT_SYMBOL_GPL(tdp_enabled);
>
> I haven't looked at the rest of the series, but this should be unnecessary. Just
> use enable_ept. Ah, the code is wrong.
>
> > static bool __ro_after_init tdp_mmu_allowed;
> >
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 076a471d9aea..54df6653193e 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -3,6 +3,7 @@
> >
> > #include "x86_ops.h"
> > #include "vmx.h"
> > +#include "mmu.h"
> > #include "nested.h"
> > #include "pmu.h"
> > #include "tdx.h"
> > @@ -36,6 +37,18 @@ static __init int vt_hardware_setup(void)
> > if (ret)
> > return ret;
> >
> > + /* TDX requires KVM TDP MMU. */
>
> This is a useless comment (assuming the code is correctly written), it's quite
> obvious from:
>
> if (!tdp_mmu_enabled)
> enable_tdx = false;
>
> that the TDP MMU is required. Explaining *why* could be useful, but I'd probably
> just omit a comment. In the not too distant future, it will likely be common
> knowledge that the shadow MMU doesn't support newfangled features, and it _should_
> be very easy for someone to get the info from the changelog.
>
> > + if (enable_tdx && !tdp_enabled) {
>
> tdp_enabled can be true without the TDP MMU, you want tdp_mmu_enabled.
>
> > + enable_tdx = false;
> > + pr_warn_ratelimited("TDX requires TDP MMU. Please enable TDP MMU for TDX.\n");
>
> Drop the pr_warn(), TDX will presumably be on by default at some point, I don't
> want to get spam every time I test with TDP disabled.
>
> Also, ratelimiting this code is pointless (as is _once()), it should only ever
> be called once per module module, and the limiting/once protections are tied to
> the module, i.e. effectively get reset when a module is reloaded.
>
> > + }
> > +
> > + /* TDX requires MMIO caching. */
> > + if (enable_tdx && !enable_mmio_caching) {
> > + enable_tdx = false;
> > + pr_warn_ratelimited("TDX requires mmio caching. Please enable mmio caching for TDX.\n");
>
> Same comments here.
>
> > + }
> > +
> > enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
>
> All of the above code belongs in tdx_hardware_setup(), especially since you need
> tdp_mmu_enabled, not enable_ept.

Thanks for review. With tdp_mmu_enabled, removing warning and comments,
I come up with the followings.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aa66b0510062..ba2738cc6e98 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -110,6 +110,7 @@ static bool __ro_after_init tdp_mmu_allowed;
#ifdef CONFIG_X86_64
bool __read_mostly tdp_mmu_enabled = true;
module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
+EXPORT_SYMBOL_GPL(tdp_mmu_enabled);
#endif

static int max_huge_page_level __read_mostly;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 976cf5e37f0f..e6f66f7c04bb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1253,14 +1253,9 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
int r = 0;
int i;

- if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
- pr_warn("MOVDIR64B is reqiured for TDX\n");
+ if (!tdp_mmu_enabled || !enable_mmio_caching ||
+ !cpu_feature_enabled(X86_FEATURE_MOVDIR64B))
return -EOPNOTSUPP;
- }
- if (!enable_ept) {
- pr_warn("Cannot enable TDX with EPT disabled\n");
- return -EINVAL;
- }

max_pkgs = topology_max_packages();
tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
--
2.43.2

--
Isaku Yamahata <[email protected]>

2024-04-02 06:18:04

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Fri, Mar 29, 2024 at 03:25:47PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 3/29/2024 4:39 AM, Isaku Yamahata wrote:
>
> [...]
> > > > > > How about this?
> > > > > >
> > > > > > /*
> > > > > > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> > > > > > * TDH.MNG.KEY.FREEID() to free the HKID.
> > > > > > * Other threads can remove pages from TD. When the HKID is assigned, we need
> > > > > > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> > > > > > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> > > > > > * present transient state of HKID.
> > > > > > */
> > > > > Could you elaborate why it is still possible to have other thread removing
> > > > > pages from TD?
> > > > >
> > > > > I am probably missing something, but the thing I don't understand is why
> > > > > this function is triggered by MMU release? All the things done in this
> > > > > function don't seem to be related to MMU at all.
> > > > The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
> > > > we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
> > > > TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
> > > > TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
> > > > to use TDH.PHYMEM.PAGE.RECLAIM().
> > > Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
> > > TDH.PHYMEM.PAGE.RECLAIM()?
> > >
> > > And does the difference matter in practice, i.e. did you see using the former
> > > having noticeable performance downgrade?
> > Yes. With HKID alive, we have to assume that vcpu can run still. It means TLB
> > shootdown. The difference is 2 extra SEAMCALL + IPI synchronization for each
> > guest private page. If the guest has hundreds of GB, the difference can be
> > tens of minutes.
> >
> > With HKID alive, we need to assume vcpu is alive.
> > - TDH.MEM.PAGE.REMOVE()
> > - TDH.PHYMEM.PAGE_WBINVD()
> > - TLB shoot down
> > - TDH.MEM.TRACK()
> > - IPI to other vcpus
> > - wait for other vcpu to exit
>
> Do we have a way to batch the TLB shoot down.
> IIUC, in current implementation, TLB shoot down needs to be done for each
> page remove, right?

That's right because the TDP MMU allows multiple vcpus to operate on EPT
concurrently. Batching makes the logic more complex. It's straightforward to
use the mmu notifier to know that we start to destroy the guest.
--
Isaku Yamahata <[email protected]>

2024-04-02 06:37:56

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Mon, Apr 01, 2024 at 05:12:38PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Allocate protected page table for private page table, and add hooks to
> > operate on protected page table. This patch adds allocation/free of
> > protected page tables and hooks. When calling hooks to update SPTE entry,
> > freeze the entry, call hooks and unfreeze the entry to allow concurrent
> > updates on page tables. Which is the advantage of TDP MMU. As
> > kvm_gfn_shared_mask() returns false always, those hooks aren't called yet
> > with this patch.
> >
> > When the faulting GPA is private, the KVM fault is called private. When
> > resolving private KVM fault, allocate protected page table and call hooks
> > to operate on protected page table. On the change of the private PTE entry,
> > invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the change
> > to protected page table. The following depicts the relationship.
> >
> > private KVM page fault |
> > | |
> > V |
> > private GPA | CPU protected EPTP
> > | | |
> > V | V
> > private PT root | protected PT root
> > | | |
> > V | V
> > private PT --hook to propagate-->protected PT
> > | | |
> > \--------------------+------\ |
> > | | |
> > | V V
> > | private guest page
> > |
> > |
> > non-encrypted memory | encrypted memory
> > |
> > PT: page table
> >
> > The existing KVM TDP MMU code uses atomic update of SPTE. On populating
> > the EPT entry, atomically set the entry. However, it requires TLB
> > shootdown to zap SPTE. To address it, the entry is frozen with the special
> > SPTE value that clears the present bit. After the TLB shootdown, the entry
> > is set to the eventual value (unfreeze).
> >
> > For protected page table, hooks are called to update protected page table
> > in addition to direct access to the private SPTE. For the zapping case, it
> > works to freeze the SPTE. It can call hooks in addition to TLB shootdown.
> > For populating the private SPTE entry, there can be a race condition
> > without further protection
> >
> > vcpu 1: populating 2M private SPTE
> > vcpu 2: populating 4K private SPTE
> > vcpu 2: TDX SEAMCALL to update 4K protected SPTE => error
> > vcpu 1: TDX SEAMCALL to update 2M protected SPTE
> >
> > To avoid the race, the frozen SPTE is utilized. Instead of atomic update
> > of the private entry, freeze the entry, call the hook that update protected
> > SPTE, set the entry to the final value.
> >
> > Support 4K page only at this stage. 2M page support can be done in future
> > patches.
> >
> > Co-developed-by: Kai Huang <[email protected]>
> > Signed-off-by: Kai Huang <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> >
> > ---
> > v19:
> > - drop CONFIG_KVM_MMU_PRIVATE
> >
> > v18:
> > - Rename freezed => frozen
> >
> > v14 -> v15:
> > - Refined is_private condition check in kvm_tdp_mmu_map().
> > Add kvm_gfn_shared_mask() check.
> > - catch up for struct kvm_range change
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/include/asm/kvm-x86-ops.h | 5 +
> > arch/x86/include/asm/kvm_host.h | 11 ++
> > arch/x86/kvm/mmu/mmu.c | 17 +-
> > arch/x86/kvm/mmu/mmu_internal.h | 13 +-
> > arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> > arch/x86/kvm/mmu/tdp_mmu.c | 308 +++++++++++++++++++++++++----
> > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > virt/kvm/kvm_main.c | 1 +
> > 8 files changed, 320 insertions(+), 39 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index a8e96804a252..e1c75f8c1b25 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -101,6 +101,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > KVM_X86_OP(load_mmu_pgd)
> > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > KVM_X86_OP(has_wbinvd_exit)
> > KVM_X86_OP(get_l2_tsc_offset)
> > KVM_X86_OP(get_l2_tsc_multiplier)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index efd3fda1c177..bc0767c884f7 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -468,6 +468,7 @@ struct kvm_mmu {
> > int (*sync_spte)(struct kvm_vcpu *vcpu,
> > struct kvm_mmu_page *sp, int i);
> > struct kvm_mmu_root_info root;
> > + hpa_t private_root_hpa;
> > union kvm_cpu_role cpu_role;
> > union kvm_mmu_page_role root_role;
> > @@ -1740,6 +1741,16 @@ struct kvm_x86_ops {
> > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > int root_level);
> > + int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + void *private_spt);
> > + int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + void *private_spt);
> > + int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + kvm_pfn_t pfn);
> > + int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + kvm_pfn_t pfn);
> > + int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> > +
> > bool (*has_wbinvd_exit)(void);
> > u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 30c86e858ae4..0e0321ad9ca2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3717,7 +3717,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > goto out_unlock;
> > if (tdp_mmu_enabled) {
> > - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > + if (kvm_gfn_shared_mask(vcpu->kvm) &&
> > + !VALID_PAGE(mmu->private_root_hpa)) {
> > + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> > + mmu->private_root_hpa = root;
> > + }
> > + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
> > mmu->root.hpa = root;
> > } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > @@ -4627,7 +4632,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> > for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> > int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> > - gfn_t base = gfn_round_for_level(fault->gfn,
> > + gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> > fault->max_level);
> > if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> > @@ -4662,6 +4667,7 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > };
> > WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
> > + fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> > fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > r = mmu_topup_memory_caches(vcpu, false);
> > @@ -6166,6 +6172,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > mmu->root.hpa = INVALID_PAGE;
> > mmu->root.pgd = 0;
> > + mmu->private_root_hpa = INVALID_PAGE;
> > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> > @@ -7211,6 +7218,12 @@ int kvm_mmu_vendor_module_init(void)
> > void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_unload(vcpu);
> > + if (tdp_mmu_enabled) {
> > + write_lock(&vcpu->kvm->mmu_lock);
> > + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > + NULL);
> > + write_unlock(&vcpu->kvm->mmu_lock);
> > + }
> > free_mmu_pages(&vcpu->arch.root_mmu);
> > free_mmu_pages(&vcpu->arch.guest_mmu);
> > mmu_free_memory_caches(vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 002f3f80bf3b..9e2c7c6d85bf 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -6,6 +6,8 @@
> > #include <linux/kvm_host.h>
> > #include <asm/kvm_host.h>
> > +#include "mmu.h"
> > +
> > #ifdef CONFIG_KVM_PROVE_MMU
> > #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > #else
> > @@ -205,6 +207,15 @@ static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
> > free_page((unsigned long)sp->private_spt);
> > }
> > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > + gfn_t gfn)
> > +{
> > + if (is_private_sp(root))
> > + return kvm_gfn_to_private(kvm, gfn);
>
> IIUC, the purpose of this function is to add back shared bit to gfn for
> shared memory.
> For private address, the gfn should not contain shared bit anyway.
> It seems weird to clear the shared bit from gfn for private address.

The current caller happens to do so. With such assumption, we can code it as
something like

if (is_private_sp(root)) {
WARN_ON_ONCE(gfn & kvm_gfn_shared_mask(kvm));
return gfn;
}


.. snip ...

> > @@ -376,12 +387,78 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> > REMOVED_SPTE, level);
> > }
> > handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> > - old_spte, REMOVED_SPTE, level, shared);
> > + old_spte, REMOVED_SPTE, sp->role,
> > + shared);
> > + }
> > +
> > + if (is_private_sp(sp) &&
> > + WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
> > + kvm_mmu_private_spt(sp)))) {
> > + /*
> > + * Failed to unlink Secure EPT page and there is nothing to do
> > + * further. Intentionally leak the page to prevent the kernel
> > + * from accessing the encrypted page.
> > + */
> > + kvm_mmu_init_private_spt(sp, NULL);
> > }
> > call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > }
> > +static void *get_private_spt(gfn_t gfn, u64 new_spte, int level)
> > +{
> > + if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
> > + struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
> > + void *private_spt = kvm_mmu_private_spt(sp);
> > +
> > + WARN_ON_ONCE(!private_spt);
> > + WARN_ON_ONCE(sp->role.level + 1 != level);
> > + WARN_ON_ONCE(sp->gfn != gfn);
> > + return private_spt;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
> > + u64 old_spte, u64 new_spte,
> > + int level)
> > +{
> > + bool was_present = is_shadow_present_pte(old_spte);
> > + bool is_present = is_shadow_present_pte(new_spte);
> > + bool was_leaf = was_present && is_last_spte(old_spte, level);
> > + bool is_leaf = is_present && is_last_spte(new_spte, level);
> > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > + kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> > + int ret;
> > +
> > + /* Ignore change of software only bits. e.g. host_writable */
> > + if (was_leaf == is_leaf && was_present == is_present)
> > + return;
> > +
> > + /*
> > + * Allow only leaf page to be zapped. Reclaim Non-leaf page tables at
> > + * destroying VM.
> > + */
>
> The comment seems just for !was_leaf,
> move the comment just before "if (!was_leaf)" ?

Makes sense.


> > + WARN_ON_ONCE(is_present);
>
> Is this warning needed?
> It can be captured by the later "KVM_BUG_ON(new_pfn, kvm)"

Yes, let's remove this warn_on.


.. snip ...

> > @@ -597,8 +775,17 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
> > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > + if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
> > + is_shadow_present_pte(new_spte)) {
> > + lockdep_assert_held_write(&kvm->mmu_lock);
>
> tdp_mmu_set_spte() has already called lockdep_assert_held_write() above.

Ok.


> > @@ -1041,6 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > struct kvm *kvm = vcpu->kvm;
> > struct tdp_iter iter;
> > struct kvm_mmu_page *sp;
> > + gfn_t raw_gfn;
> > + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> > int ret = RET_PF_RETRY;
> > kvm_mmu_hugepage_adjust(vcpu, fault);
> > @@ -1049,7 +1265,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > rcu_read_lock();
> > - tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> > + raw_gfn = gpa_to_gfn(fault->addr);
> > +
> > + if (is_error_noslot_pfn(fault->pfn) ||
> > + !kvm_pfn_to_refcounted_page(fault->pfn)) {
> > + if (is_private) {
>
>  Why this is only checked for private fault?

Because (the current implementation of) the TDX vendor backend gets page
reference count. In future, it should be removed with allowing page migration.



> > + rcu_read_unlock();
> > + return -EFAULT;
> > + }
> > + }
> > +
> > + tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
> > int r;
> > if (fault->nx_huge_page_workaround_enabled)
> > @@ -1079,9 +1305,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> > - if (is_shadow_present_pte(iter.old_spte))
> > + if (is_shadow_present_pte(iter.old_spte)) {
> > + /*
> > + * TODO: large page support.
> > + * Doesn't support large page for TDX now
> > + */
> > + KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
> > r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
> > - else
> > + } else
> > r = tdp_mmu_link_sp(kvm, &iter, sp, true);
> > /*
> > @@ -1362,6 +1593,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm
> > sp->role = role;
> > sp->spt = (void *)__get_free_page(gfp);
> > + /* TODO: large page support for private GPA. */
> > + WARN_ON_ONCE(kvm_mmu_page_role_is_private(role));
>
> Seems not needed, since __tdp_mmu_alloc_sp_for_split()
> is only called in  tdp_mmu_alloc_sp_for_split() and it has
> KVM_BUG_ON() for large page case.

Ah, yes. Will remove one of two.
--
Isaku Yamahata <[email protected]>

2024-04-02 06:53:08

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 093/130] KVM: TDX: Implements vcpu request_immediate_exit

On Fri, Mar 29, 2024 at 09:54:04AM +0800,
Chao Gao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:26:35AM -0800, [email protected] wrote:
> >From: Isaku Yamahata <[email protected]>
> >
> >Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
> >vcpu. Wire up kvm x86 methods for blocking/unblocking vcpu for TDX. To
> >unblock on pending events, request immediate exit methods is also needed.
>
> TDX doesn't support this immediate exit. It is considered as a potential
> attack to TDs. TDX module deploys 0/1-step mitigations to prevent this.
> Even KVM issues a self-IPI before TD-entry, TD-exit will happen after
> the guest runs a random number of instructions.
>
> KVM shouldn't request immediate exits in the first place. Just emit a
> warning if KVM tries to do this.

0ec3d6d1f169
("KVM: x86: Fully defer to vendor code to decide how to force immediate exit")
removed the hook. This patch will be dropped and tdx_vcpu_run() will ignore
force_immediate_exit.
--
Isaku Yamahata <[email protected]>

2024-04-02 07:09:39

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 094/130] KVM: TDX: Implement methods to inject NMI

On Fri, Mar 29, 2024 at 10:11:05AM +0800,
Chao Gao <[email protected]> wrote:

> >+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
> >+{
> >+ if (is_td_vcpu(vcpu))
> >+ return;
> >+
> >+ vmx_set_nmi_mask(vcpu, masked);
> >+}
> >+
> >+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
> >+{
> >+ /* Refer the comment in vt_get_nmi_mask(). */
> >+ if (is_td_vcpu(vcpu))
> >+ return;
> >+
> >+ vmx_enable_nmi_window(vcpu);
> >+}
>
> The two actually request something to do done for the TD. But we make them nop
> as TDX module doesn't support VMM to configure nmi mask and nmi window. Do you
> think they are worth a WARN_ON_ONCE()? or adding WARN_ON_ONCE() requires a lot
> of code factoring in KVM's NMI injection logics?

Because user space can reach those hooks with KVM_SET_VCPU_EVENTS, we shouldn't
add WARN_ON_ONCE(). There are two choices. Ignore the request (the current
choice) or return error for unsupported request.

It's troublesome to allow error for them because we have to fix up the caller
up to the user space. The user space may abort on such error without fix.
--
Isaku Yamahata <[email protected]>

2024-04-02 08:53:18

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL

>+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>+{
>+ unsigned long nr, a0, a1, a2, a3, ret;
>+

do you need to emulate xen/hyper-v hypercalls here?

Nothing tells userspace that xen/hyper-v hypercalls are not supported and
so userspace may expose related CPUID leafs to TD guests.

>+ /*
>+ * ABI for KVM tdvmcall argument:
>+ * In Guest-Hypervisor Communication Interface(GHCI) specification,
>+ * Non-zero leaf number (R10 != 0) is defined to indicate
>+ * vendor-specific. KVM uses this for KVM hypercall. NOTE: KVM
>+ * hypercall number starts from one. Zero isn't used for KVM hypercall
>+ * number.
>+ *
>+ * R10: KVM hypercall number
>+ * arguments: R11, R12, R13, R14.
>+ */
>+ nr = kvm_r10_read(vcpu);
>+ a0 = kvm_r11_read(vcpu);
>+ a1 = kvm_r12_read(vcpu);
>+ a2 = kvm_r13_read(vcpu);
>+ a3 = kvm_r14_read(vcpu);
>+
>+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true, 0);
>+
>+ tdvmcall_set_return_code(vcpu, ret);
>+
>+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
>+ return 0;

Can you add a comment to call out that KVM_HC_MAP_GPA_RANGE is redirected to
the userspace?

>+ return 1;
>+}

2024-04-02 11:37:28

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
> propagating the change private EPT entry to Secure EPT and freeing Secure
> EPT page. TLB flush handles both shared EPT and private EPT. It flushes
> shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the
> hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> EPT so that the page can be freed to OS.
>
> Propagate the entry change to Secure EPT. The possible entry changes are
> present -> non-present(zapping) and non-present -> present(population). On
> population just link the Secure EPT page or the private guest page to the
> Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> zapping/population, zapping requires synchronous TLB shoot down with the
> frozen EPT entry.

But for private memory, zapping holds write lock, right?


> It zaps the secure entry, increments TLB counter, sends
> IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> guest page from the Secure EPT. For simplicity, batched zapping with
> exclude lock is handled as concurrent zapping.

exclude lock -> exclusive lock

How to understand this sentence?
Since it's holding exclusive lock, how it can be handled as concurrent
zapping?
Or you want to describe the current implementation prevents concurrent
zapping?


> Although it's inefficient,
> it can be optimized in the future.
>
> For MMIO SPTE, the spte value changes as follows.
> initial value (suppress VE bit is set)
> -> Guest issues MMIO and triggers EPT violation
> -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> -> Guest MMIO resumes. It triggers VE exception in guest TD
> -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> -> KVM handles MMIO
> -> Guest VE handler resumes its execution after MMIO instruction
>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v19:
> - Compile fix when CONFIG_HYPERV != y.
> It's due to the following patch. Catch it up.
> https://lore.kernel.org/all/[email protected]/
> - Add comments on tlb shootdown to explan the sequence.
> - Use gmem_max_level callback, delete tdp_max_page_level.
>
> v18:
> - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> - checkpatch: space => tab
>
> v15 -> v16:
> - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
>
> v14 -> v15:
> - Implemented tdx_flush_tlb_current()
> - Removed unnecessary invept in tdx_flush_tlb(). It was carry over
> from the very old code base.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/spte.c | 3 +-
> arch/x86/kvm/vmx/main.c | 91 ++++++++-
> arch/x86/kvm/vmx/tdx.c | 372 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 2 +-
> arch/x86/kvm/vmx/tdx_ops.h | 6 +
> arch/x86/kvm/vmx/x86_ops.h | 13 ++
> 6 files changed, 481 insertions(+), 6 deletions(-)
>
[...]
> +
> +static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + union tdx_sept_level_state level_state;
> + hpa_t hpa = pfn_to_hpa(pfn);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + struct tdx_module_args out;
> + union tdx_sept_entry entry;
> + u64 err;
> +
> + err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &out);
> + if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
> + tdx_unpin(kvm, pfn);
> + return -EAGAIN;
> + }
> + if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))) {
> + entry.raw = out.rcx;
> + level_state.raw = out.rdx;
> + if (level_state.level == tdx_level &&
> + level_state.state == TDX_SEPT_PENDING &&
> + entry.leaf && entry.pfn == pfn && entry.sve) {
> + tdx_unpin(kvm, pfn);
> + WARN_ON_ONCE(!(to_kvm_tdx(kvm)->attributes &
> + TDX_TD_ATTR_SEPT_VE_DISABLE));

to_kvm_tdx(kvm) -> kvm_tdx

Since the implementation requires attributes.TDX_TD_ATTR_SEPT_VE_DISABLE is set, should it check the value passed from userspace?
And the reason should be described somewhere in changelog or/and comment.



> + return -EAGAIN;
> + }
> + }
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
> + tdx_unpin(kvm, pfn);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> + /* TODO: handle large pages. */
> + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> + return -EINVAL;
> +
> + /*
> + * Because restricted mem

The term "restricted mem" is not used anymore, right? Should update the
comment.

> doesn't support page migration with
> + * a_ops->migrate_page (yet), no callback isn't triggered for KVM on

no callback isn't -> no callback is

> + * page migration. Until restricted mem supports page migration,

"restricted mem" -> guest_mem


> + * prevent page migration.
> + * TODO: Once restricted mem introduces callback on page migration,

ditto

> + * implement it and remove get_page/put_page().
> + */
> + get_page(pfn_to_page(pfn));
> +
> + if (likely(is_td_finalized(kvm_tdx)))
> + return tdx_mem_page_aug(kvm, gfn, level, pfn);
> +
> + /* TODO: tdh_mem_page_add() comes here for the initial memory. */
> +
> + return 0;
> +}
> +
> +static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + struct tdx_module_args out;
> + gpa_t gpa = gfn_to_gpa(gfn);
> + hpa_t hpa = pfn_to_hpa(pfn);
> + hpa_t hpa_with_hkid;
> + u64 err;
> +
> + /* TODO: handle large pages. */
> + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> + return -EINVAL;
> +
> + if (unlikely(!is_hkid_assigned(kvm_tdx))) {
> + /*
> + * The HKID assigned to this TD was already freed and cache
> + * was already flushed. We don't have to flush again.
> + */
> + err = tdx_reclaim_page(hpa);
> + if (KVM_BUG_ON(err, kvm))
> + return -EIO;
> + tdx_unpin(kvm, pfn);
> + return 0;
> + }
> +
> + do {
> + /*
> + * When zapping private page, write lock is held. So no race
> + * condition with other vcpu sept operation. Race only with
> + * TDH.VP.ENTER.
> + */
> + err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
> + } while (unlikely(err == TDX_ERROR_SEPT_BUSY));
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
> + return -EIO;
> + }
> +
> + hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
> + do {
> + /*
> + * TDX_OPERAND_BUSY can happen on locking PAMT entry. Because
> + * this page was removed above, other thread shouldn't be
> + * repeatedly operating on this page. Just retry loop.
> + */
> + err = tdh_phymem_page_wbinvd(hpa_with_hkid);
> + } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> + return -EIO;
> + }
> + tdx_clear_page(hpa);
> + tdx_unpin(kvm, pfn);
> + return 0;
> +}
> +
> +static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, void *private_spt)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + hpa_t hpa = __pa(private_spt);
> + struct tdx_module_args out;
> + u64 err;
> +
> + err = tdh_mem_sept_add(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);

kvm_tdx is only used here, can drop the local var.

> + if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> + return -EAGAIN;
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
>

2024-04-02 15:12:18

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
> propagating the change private EPT entry to Secure EPT and freeing Secure
> EPT page. TLB flush handles both shared EPT and private EPT. It flushes
> shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the
> hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> EPT so that the page can be freed to OS.
>
> Propagate the entry change to Secure EPT. The possible entry changes are
> present -> non-present(zapping) and non-present -> present(population). On
> population just link the Secure EPT page or the private guest page to the
> Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> zapping/population, zapping requires synchronous TLB shoot down with the
> frozen EPT entry. It zaps the secure entry, increments TLB counter, sends
> IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> guest page from the Secure EPT. For simplicity, batched zapping with
> exclude lock is handled as concurrent zapping. Although it's inefficient,
> it can be optimized in the future.
>
> For MMIO SPTE, the spte value changes as follows.
> initial value (suppress VE bit is set)
> -> Guest issues MMIO and triggers EPT violation
> -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> -> Guest MMIO resumes. It triggers VE exception in guest TD
> -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> -> KVM handles MMIO
> -> Guest VE handler resumes its execution after MMIO instruction
>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v19:
> - Compile fix when CONFIG_HYPERV != y.
> It's due to the following patch. Catch it up.
> https://lore.kernel.org/all/[email protected]/
> - Add comments on tlb shootdown to explan the sequence.
> - Use gmem_max_level callback, delete tdp_max_page_level.
>
> v18:
> - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> - checkpatch: space => tab
>
> v15 -> v16:
> - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
>
> v14 -> v15:
> - Implemented tdx_flush_tlb_current()
> - Removed unnecessary invept in tdx_flush_tlb(). It was carry over
> from the very old code base.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/spte.c | 3 +-
> arch/x86/kvm/vmx/main.c | 91 ++++++++-
> arch/x86/kvm/vmx/tdx.c | 372 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 2 +-
> arch/x86/kvm/vmx/tdx_ops.h | 6 +
> arch/x86/kvm/vmx/x86_ops.h | 13 ++
> 6 files changed, 481 insertions(+), 6 deletions(-)
>
[...]

> +static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
> + struct tdx_module_args out;
> + u64 err;
> +
> + /* This can be called when destructing guest TD after freeing HKID. */
> + if (unlikely(!is_hkid_assigned(kvm_tdx)))
> + return 0;
> +
> + /* For now large page isn't supported yet. */
> + WARN_ON_ONCE(level != PG_LEVEL_4K);
> + err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
> + if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> + return -EAGAIN;
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
> + return -EIO;
> + }
> + return 0;
> +}
> +
> +/*
> + * TLB shoot down procedure:
> + * There is a global epoch counter and each vcpu has local epoch counter.
> + * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
> + * This blocks the subsequenct creation of TLB translation on that range.
> + * This corresponds to clear the present bit(all RXW) in EPT entry
> + * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
> + * - IPI to remote vcpus
> + * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
> + * - On re-entry, TDX module compares the local epoch counter with the global
> + * epoch counter. If the local epoch counter is older than the global epoch
> + * counter, update the local epoch counter and flushes TLB.
> + */
> +static void tdx_track(struct kvm *kvm)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + u64 err;
> +
> + KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
> + /* If TD isn't finalized, it's before any vcpu running. */
> + if (unlikely(!is_td_finalized(kvm_tdx)))
> + return;
> +
> + /*
> + * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
> + * the counter. The counter is used instead of bool because multiple
> + * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.

Which case will have concurrent issues of TDH_MEM_TRACK() by multiple vcpus?
For now, zapping is holding write lock.
Promotion/demotion may have concurrent issues of TDH_MEM_TRACK(), but
it's not supported yet.


> + *
> + * optimization: The TLB shoot down procedure described in The TDX
> + * specification is, TDH.MEM.TRACK(), send IPI to remote vcpus, confirm
> + * all remote vcpus exit to VMM, and execute vcpu, both local and
> + * remote. Twist the sequence to reduce IPI overhead as follows.
> + *
> + * local remote
> + * ----- ------
> + * increment tdh_mem_track
> + *
> + * request KVM_REQ_TLB_FLUSH
> + * send IPI
> + *
> + * TDEXIT to KVM due to IPI
> + *
> + * IPI handler calls tdx_flush_tlb()
> + * to process KVM_REQ_TLB_FLUSH.
> + * spin wait for tdh_mem_track == 0
> + *
> + * TDH.MEM.TRACK()
> + *
> + * decrement tdh_mem_track
> + *
> + * complete KVM_REQ_TLB_FLUSH
> + *
> + * TDH.VP.ENTER to flush tlbs TDH.VP.ENTER to flush tlbs
> + */
> + atomic_inc(&kvm_tdx->tdh_mem_track);
> + /*
> + * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
> + * KVM_REQUEST_WAIT.
> + */
> + kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> +
> + do {
> + err = tdh_mem_track(kvm_tdx->tdr_pa);
> + } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
> +
> + /* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
> + atomic_dec(&kvm_tdx->tdh_mem_track);
> +
> + if (KVM_BUG_ON(err, kvm))
> + pr_tdx_error(TDH_MEM_TRACK, err, NULL);
> +
> +}
> +
> +static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, void *private_spt)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> + /*
> + * The HKID assigned to this TD was already freed and cache was
> + * already flushed. We don't have to flush again.
> + */
> + if (!is_hkid_assigned(kvm_tdx))
> + return tdx_reclaim_page(__pa(private_spt));
> +
> + /*
> + * free_private_spt() is (obviously) called when a shadow page is being
> + * zapped. KVM doesn't (yet) zap private SPs while the TD is active.
> + * Note: This function is for private shadow page. Not for private
> + * guest page. private guest page can be zapped during TD is active.
> + * shared <-> private conversion and slot move/deletion.
> + */
> + KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);

At this point, is_hkid_assigned(kvm_tdx) is always true.


2024-04-03 02:51:05

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 106/130] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL

On Mon, Feb 26, 2024 at 12:26:48AM -0800, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>Some of TDG.VP.VMCALL require device model, for example, qemu, to handle
>them on behalf of kvm kernel module. TDVMCALL_REPORT_FATAL_ERROR,
>TDVMCALL_MAP_GPA, TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and
>TDVMCALL_GET_QUOTE requires user space VMM handling.
>
>Introduce new kvm exit, KVM_EXIT_TDX, and functions to setup it. Device
>model should update R10 if necessary as return value.
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
>v14 -> v15:
>- updated struct kvm_tdx_exit with union
>- export constants for reg bitmask
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
> arch/x86/kvm/vmx/tdx.c | 83 ++++++++++++++++++++++++++++++++++++-
> include/uapi/linux/kvm.h | 89 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 170 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>index c8eb47591105..72dbe2ff9062 100644
>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -1038,6 +1038,78 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
> return 1;
> }
>
>+static int tdx_complete_vp_vmcall(struct kvm_vcpu *vcpu)
>+{
>+ struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
>+ __u64 reg_mask = kvm_rcx_read(vcpu);
>+
>+#define COPY_REG(MASK, REG) \
>+ do { \
>+ if (reg_mask & TDX_VMCALL_REG_MASK_ ## MASK) \
>+ kvm_## REG ## _write(vcpu, tdx_vmcall->out_ ## REG); \
>+ } while (0)

Why XMMs are not copied?

Looks you assume the guest won't use XMMs for TDVMCALL. But I think the ABI
(KVM_EXIT_TDX) should be general, i.e., can support all kinds of (future)
TDVMCALLs.

>+
>+
>+ COPY_REG(R10, r10);
>+ COPY_REG(R11, r11);
>+ COPY_REG(R12, r12);
>+ COPY_REG(R13, r13);
>+ COPY_REG(R14, r14);
>+ COPY_REG(R15, r15);
>+ COPY_REG(RBX, rbx);
>+ COPY_REG(RDI, rdi);
>+ COPY_REG(RSI, rsi);
>+ COPY_REG(R8, r8);
>+ COPY_REG(R9, r9);
>+ COPY_REG(RDX, rdx);
>+
>+#undef COPY_REG
>+
>+ return 1;
>+}

2024-04-03 03:25:49

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 108/130] KVM: TDX: Handle TDX PV HLT hypercall

On Mon, Feb 26, 2024 at 12:26:50AM -0800, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>Wire up TDX PV HLT hypercall to the KVM backend function.
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
>v19:
>- move tdvps_state_non_arch_check() to this patch
>
>v18:
>- drop buggy_hlt_workaround and use TDH.VP.RD(TD_VCPU_STATE_DETAILS)
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
> arch/x86/kvm/vmx/tdx.c | 26 +++++++++++++++++++++++++-
> arch/x86/kvm/vmx/tdx.h | 4 ++++
> 2 files changed, 29 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>index eb68d6c148b6..a2caf2ae838c 100644
>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -688,7 +688,18 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>
> bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> {
>- return pi_has_pending_interrupt(vcpu);
>+ bool ret = pi_has_pending_interrupt(vcpu);

Maybe
bool has_pending_interrupt = pi_has_pending_interrupt(vcpu);

"ret" isn't a good name. or even call pi_has_pending_interrupt() directly in
the if statement below.

>+ union tdx_vcpu_state_details details;
>+ struct vcpu_tdx *tdx = to_tdx(vcpu);
>+
>+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
>+ return true;

Question: why mp_state matters here?

>+
>+ if (tdx->interrupt_disabled_hlt)
>+ return false;

Shouldn't we move this into vt_interrupt_allowed()? VMX calls the function to
check if interrupt is disabled. KVM can clear tdx->interrupt_disabled_hlt on
every TD-enter and set it only on TD-exit due to the guest making a
TDVMCALL(hlt) w/ interrupt disabled.

>+
>+ details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
>+ return !!details.vmxip;
> }
>
> void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>@@ -1130,6 +1141,17 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
> return 1;
> }
>
>+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
>+{
>+ struct vcpu_tdx *tdx = to_tdx(vcpu);
>+
>+ /* See tdx_protected_apic_has_interrupt() to avoid heavy seamcall */
>+ tdx->interrupt_disabled_hlt = tdvmcall_a0_read(vcpu);
>+
>+ tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
>+ return kvm_emulate_halt_noskip(vcpu);
>+}
>+
> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> {
> if (tdvmcall_exit_type(vcpu))
>@@ -1138,6 +1160,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> switch (tdvmcall_leaf(vcpu)) {
> case EXIT_REASON_CPUID:
> return tdx_emulate_cpuid(vcpu);
>+ case EXIT_REASON_HLT:
>+ return tdx_emulate_hlt(vcpu);
> default:
> break;
> }
>diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>index 4399d474764f..11c74c34555f 100644
>--- a/arch/x86/kvm/vmx/tdx.h
>+++ b/arch/x86/kvm/vmx/tdx.h
>@@ -104,6 +104,8 @@ struct vcpu_tdx {
> bool host_state_need_restore;
> u64 msr_host_kernel_gs_base;
>
>+ bool interrupt_disabled_hlt;
>+
> /*
> * Dummy to make pmu_intel not corrupt memory.
> * TODO: Support PMU for TDX. Future work.
>@@ -166,6 +168,7 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
> }
>
> static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
>+static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}
>
> #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
> static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
>@@ -226,6 +229,7 @@ TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
> TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
>
> TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
>+TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);
>
> static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
> {
>--
>2.25.1
>
>

2024-04-03 06:52:39

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX

On Mon, Feb 26, 2024 at 12:26:53AM -0800, [email protected] wrote:
>+bool tdx_has_emulated_msr(u32 index, bool write)
>+{
>+ switch (index) {
>+ case MSR_IA32_UCODE_REV:
>+ case MSR_IA32_ARCH_CAPABILITIES:
>+ case MSR_IA32_POWER_CTL:
>+ case MSR_IA32_CR_PAT:
>+ case MSR_IA32_TSC_DEADLINE:
>+ case MSR_IA32_MISC_ENABLE:
>+ case MSR_PLATFORM_INFO:
>+ case MSR_MISC_FEATURES_ENABLES:
>+ case MSR_IA32_MCG_CAP:
>+ case MSR_IA32_MCG_STATUS:
>+ case MSR_IA32_MCG_CTL:
>+ case MSR_IA32_MCG_EXT_CTL:
>+ case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
>+ case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
>+ /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
>+ return true;
>+ case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
>+ /*
>+ * x2APIC registers that are virtualized by the CPU can't be
>+ * emulated, KVM doesn't have access to the virtual APIC page.
>+ */
>+ switch (index) {
>+ case X2APIC_MSR(APIC_TASKPRI):
>+ case X2APIC_MSR(APIC_PROCPRI):
>+ case X2APIC_MSR(APIC_EOI):
>+ case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
>+ case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
>+ case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
>+ return false;
>+ default:
>+ return true;
>+ }
>+ case MSR_IA32_APICBASE:
>+ case MSR_EFER:
>+ return !write;
>+ case 0x4b564d00 ... 0x4b564dff:
>+ /* KVM custom MSRs */
>+ return tdx_is_emulated_kvm_msr(index, write);
>+ default:
>+ return false;
>+ }

The only call site with a non-Null KVM parameter is:

r = static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE);

Only MSR_IA32_SMBASE needs to be handled. So, this function is much more
complicated than it should be.

2024-04-03 14:59:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 108/130] KVM: TDX: Handle TDX PV HLT hypercall

On Wed, Apr 03, 2024, Chao Gao wrote:
> On Mon, Feb 26, 2024 at 12:26:50AM -0800, [email protected] wrote:
> >From: Isaku Yamahata <[email protected]>
> >
> >Wire up TDX PV HLT hypercall to the KVM backend function.
> >
> >Signed-off-by: Isaku Yamahata <[email protected]>
> >---
> >v19:
> >- move tdvps_state_non_arch_check() to this patch
> >
> >v18:
> >- drop buggy_hlt_workaround and use TDH.VP.RD(TD_VCPU_STATE_DETAILS)
> >
> >Signed-off-by: Isaku Yamahata <[email protected]>
> >---
> > arch/x86/kvm/vmx/tdx.c | 26 +++++++++++++++++++++++++-
> > arch/x86/kvm/vmx/tdx.h | 4 ++++
> > 2 files changed, 29 insertions(+), 1 deletion(-)
> >
> >diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >index eb68d6c148b6..a2caf2ae838c 100644
> >--- a/arch/x86/kvm/vmx/tdx.c
> >+++ b/arch/x86/kvm/vmx/tdx.c
> >@@ -688,7 +688,18 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >
> > bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> > {
> >- return pi_has_pending_interrupt(vcpu);
> >+ bool ret = pi_has_pending_interrupt(vcpu);
>
> Maybe
> bool has_pending_interrupt = pi_has_pending_interrupt(vcpu);
>
> "ret" isn't a good name. or even call pi_has_pending_interrupt() directly in
> the if statement below.

Ya, or split the if-statement into multiple chucks, with comments explaining
what each non-intuitive chunk is doing. The pi_has_pending_interrupt(vcpu) check
is self-explanatory, the halted thing, not so much. They are terminal statements,
there's zero reason to pre-check the PID.

E.g.

/*
* Comment explaining why KVM needs to assume a non-halted vCPU has a
* pending interrupt (KVM can't see RFLAGS.IF).
*/
if (vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
return true;

if (pi_has_pending_interrupt(vcpu))
return;

> >+ union tdx_vcpu_state_details details;
> >+ struct vcpu_tdx *tdx = to_tdx(vcpu);
> >+
> >+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> >+ return true;
>
> Question: why mp_state matters here?
> >+
> >+ if (tdx->interrupt_disabled_hlt)
> >+ return false;
>
> Shouldn't we move this into vt_interrupt_allowed()? VMX calls the function to
> check if interrupt is disabled. KVM can clear tdx->interrupt_disabled_hlt on
> every TD-enter and set it only on TD-exit due to the guest making a
> TDVMCALL(hlt) w/ interrupt disabled.

I'm pretty sure interrupt_disabled_hlt shouldn't exist, should "a0", a.k.a. r12,
be preserved at this point?

/* Another comment explaning magic code. */
if (to_vmx(vcpu)->exit_reason.basic == EXIT_REASON_HLT &&
tdvmcall_a0_read(vcpu))
return false;


Actually, can't this all be:

if (to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_HLT)
return true;

if (!tdvmcall_a0_read(vcpu))
return false;

if (pi_has_pending_interrupt(vcpu))
return true;

return tdx_has_pending_virtual_interrupt(vcpu);

2024-04-03 15:10:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX

On Wed, Apr 03, 2024, Chao Gao wrote:
> On Mon, Feb 26, 2024 at 12:26:53AM -0800, [email protected] wrote:
> >+bool tdx_has_emulated_msr(u32 index, bool write)
> >+{
> >+ switch (index) {
> >+ case MSR_IA32_UCODE_REV:
> >+ case MSR_IA32_ARCH_CAPABILITIES:
> >+ case MSR_IA32_POWER_CTL:
> >+ case MSR_IA32_CR_PAT:
> >+ case MSR_IA32_TSC_DEADLINE:
> >+ case MSR_IA32_MISC_ENABLE:
> >+ case MSR_PLATFORM_INFO:
> >+ case MSR_MISC_FEATURES_ENABLES:
> >+ case MSR_IA32_MCG_CAP:
> >+ case MSR_IA32_MCG_STATUS:
> >+ case MSR_IA32_MCG_CTL:
> >+ case MSR_IA32_MCG_EXT_CTL:
> >+ case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> >+ case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
> >+ /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
> >+ return true;
> >+ case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
> >+ /*
> >+ * x2APIC registers that are virtualized by the CPU can't be
> >+ * emulated, KVM doesn't have access to the virtual APIC page.
> >+ */
> >+ switch (index) {
> >+ case X2APIC_MSR(APIC_TASKPRI):
> >+ case X2APIC_MSR(APIC_PROCPRI):
> >+ case X2APIC_MSR(APIC_EOI):
> >+ case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
> >+ case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
> >+ case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
> >+ return false;
> >+ default:
> >+ return true;
> >+ }
> >+ case MSR_IA32_APICBASE:
> >+ case MSR_EFER:
> >+ return !write;
> >+ case 0x4b564d00 ... 0x4b564dff:
> >+ /* KVM custom MSRs */
> >+ return tdx_is_emulated_kvm_msr(index, write);
> >+ default:
> >+ return false;
> >+ }
>
> The only call site with a non-Null KVM parameter is:
>
> r = static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE);
>
> Only MSR_IA32_SMBASE needs to be handled. So, this function is much more
> complicated than it should be.

No, because it's also used by tdx_{g,s}et_msr().

2024-04-03 15:35:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Mon, Feb 26, 2024, [email protected] wrote:
> +union tdx_vcpu_state_details {
> + struct {
> + u64 vmxip : 1;
> + u64 reserved : 63;
> + };
> + u64 full;
> +};

No unions please. KVM uses unions in a few places where they are the lesser of
all evils, but in general, unions are frowned upon. Bitfields in particular are
strongly discourage, as they are a nightmare to read/review and tend to generate
bad code.

E.g. for this one, something like (names aren't great)

static inline bool tdx_has_pending_virtual_interrupt(struct kvm_vcpu *vcpu)
{
return <get "non arch field"> & TDX_VCPU_STATE_VMXIP;
}

> +union tdx_sept_entry {
> + struct {
> + u64 r : 1;
> + u64 w : 1;
> + u64 x : 1;
> + u64 mt : 3;
> + u64 ipat : 1;
> + u64 leaf : 1;
> + u64 a : 1;
> + u64 d : 1;
> + u64 xu : 1;
> + u64 ignored0 : 1;
> + u64 pfn : 40;
> + u64 reserved : 5;
> + u64 vgp : 1;
> + u64 pwa : 1;
> + u64 ignored1 : 1;
> + u64 sss : 1;
> + u64 spp : 1;
> + u64 ignored2 : 1;
> + u64 sve : 1;

Yeah, NAK to these unions. They are crappy duplicates of existing definitions,
e.g. it took me a few seconds to realize SVE is SUPPRESS_VE, which is far too
long.

> + };
> + u64 raw;
> +};
> +enum tdx_sept_entry_state {
> + TDX_SEPT_FREE = 0,
> + TDX_SEPT_BLOCKED = 1,
> + TDX_SEPT_PENDING = 2,
> + TDX_SEPT_PENDING_BLOCKED = 3,
> + TDX_SEPT_PRESENT = 4,
> +};
> +
> +union tdx_sept_level_state {
> + struct {
> + u64 level : 3;
> + u64 reserved0 : 5;
> + u64 state : 8;
> + u64 reserved1 : 48;
> + };
> + u64 raw;
> +};

Similar thing here. Depending on what happens with the SEAMCALL argument mess,
the code can look somethign like:

static u8 tdx_get_sept_level(struct tdx_module_args *out)
{
return out->rdx & TDX_SEPT_LEVEL_MASK;
}

static u8 tdx_get_sept_state(struct tdx_module_args *out)
{
return (out->rdx & TDX_SEPT_STATE_MASK) >> TDX_SEPT_STATE_SHIFT;
}

> +union tdx_md_field_id {
> + struct {
> + u64 field : 24;
> + u64 reserved0 : 8;
> + u64 element_size_code : 2;
> + u64 last_element_in_field : 4;
> + u64 reserved1 : 3;
> + u64 inc_size : 1;
> + u64 write_mask_valid : 1;
> + u64 context : 3;
> + u64 reserved2 : 1;
> + u64 class : 6;
> + u64 reserved3 : 1;
> + u64 non_arch : 1;
> + };
> + u64 raw;
> +};
> +
> +#define TDX_MD_ELEMENT_SIZE_CODE(_field_id) \
> + ({ union tdx_md_field_id _fid = { .raw = (_field_id)}; \
> + _fid.element_size_code; })

Yeah, no thanks. MASK + SHIFT will do just fine.

2024-04-03 15:44:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX

On Mon, Feb 26, 2024, [email protected] wrote:
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 389bb95d2af0..c8f991b69720 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1877,6 +1877,76 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> *error_code = 0;
> }
>
> +static bool tdx_is_emulated_kvm_msr(u32 index, bool write)
> +{
> + switch (index) {
> + case MSR_KVM_POLL_CONTROL:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> +bool tdx_has_emulated_msr(u32 index, bool write)
> +{
> + switch (index) {
> + case MSR_IA32_UCODE_REV:
> + case MSR_IA32_ARCH_CAPABILITIES:
> + case MSR_IA32_POWER_CTL:
> + case MSR_IA32_CR_PAT:
> + case MSR_IA32_TSC_DEADLINE:
> + case MSR_IA32_MISC_ENABLE:
> + case MSR_PLATFORM_INFO:
> + case MSR_MISC_FEATURES_ENABLES:
> + case MSR_IA32_MCG_CAP:
> + case MSR_IA32_MCG_STATUS:
> + case MSR_IA32_MCG_CTL:
> + case MSR_IA32_MCG_EXT_CTL:
> + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> + case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
> + /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
> + return true;
> + case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
> + /*
> + * x2APIC registers that are virtualized by the CPU can't be
> + * emulated, KVM doesn't have access to the virtual APIC page.
> + */
> + switch (index) {
> + case X2APIC_MSR(APIC_TASKPRI):
> + case X2APIC_MSR(APIC_PROCPRI):
> + case X2APIC_MSR(APIC_EOI):
> + case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
> + case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
> + case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
> + return false;
> + default:
> + return true;
> + }
> + case MSR_IA32_APICBASE:
> + case MSR_EFER:
> + return !write;

Meh, for literally two MSRs, just open code them in tdx_set_msr() and drop the
@write param. Or alternatively add:

static bool tdx_is_read_only_msr(u32 msr){
{
return msr == MSR_IA32_APICBASE || msr == MSR_EFER;
}

> + case 0x4b564d00 ... 0x4b564dff:

This is silly, just do

case MSR_KVM_POLL_CONTROL:
return false;

and let everything else go through the default statement, no?

> + /* KVM custom MSRs */
> + return tdx_is_emulated_kvm_msr(index, write);
> + default:
> + return false;
> + }
> +}
> +
> +int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> +{
> + if (tdx_has_emulated_msr(msr->index, false))
> + return kvm_get_msr_common(vcpu, msr);
> + return 1;

Please invert these and make the happy path the not-taken path, i.e.

if (!tdx_has_emulated_msr(msr->index))
return 1;

return kvm_get_msr_common(vcpu, msr);

The standard kernel pattern is

if (error)
return <error thingie>

return <happy thingie>

> +}
> +
> +int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> +{
> + if (tdx_has_emulated_msr(msr->index, true))

As above:

if (tdx_is_read_only_msr(msr->index))
return 1;

if (!tdx_has_emulated_msr(msr->index))
return 1;

return kvm_set_msr_common(vcpu, msr);

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d5b18cad9dcd..0e1d3853eeb4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -90,7 +90,6 @@
> #include "trace.h"
>
> #define MAX_IO_MSRS 256
> -#define KVM_MAX_MCE_BANKS 32
>
> struct kvm_caps kvm_caps __read_mostly = {
> .supported_mce_cap = MCG_CTL_P | MCG_SER_P,
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 4e40c23d66ed..c87b7a777b67 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -9,6 +9,8 @@
> #include "kvm_cache_regs.h"
> #include "kvm_emulate.h"
>
> +#define KVM_MAX_MCE_BANKS 32

Split this to a separate. Yes, it's trivial, but that's _exactly_ why it should
be in a separate patch. The more trivial refactoring you split out, the more we
can apply _now_ and take off your hands.

2024-04-03 15:59:07

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 106/130] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL

On Mon, Feb 26, 2024, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Some of TDG.VP.VMCALL require device model, for example, qemu, to handle
> them on behalf of kvm kernel module. TDVMCALL_REPORT_FATAL_ERROR,
> TDVMCALL_MAP_GPA, TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and
> TDVMCALL_GET_QUOTE requires user space VMM handling.
>
> Introduce new kvm exit, KVM_EXIT_TDX, and functions to setup it. Device
> model should update R10 if necessary as return value.

Hard NAK.

KVM needs its own ABI, under no circumstance should KVM inherit ABI directly from
the GHCI. Even worse, this doesn't even sanity check the "unknown" VMCALLs, KVM
just blindly punts *everything* to userspace. And even worse than that, KVM
already has at least one user exit that overlaps, TDVMCALL_MAP_GPA => KVM_HC_MAP_GPA_RANGE.

If the userspace VMM wants to run an end-around on KVM and directly communicate
with the guest, e.g. via a synthetic device (a la virtio), that's totally fine,
because *KVM* is not definining any unique ABI, KVM is purely providing the
transport, e.g. emulated MMIO or PIO (and maybe not even that). IIRC, this option
even came up in the context of GET_QUOTE.

But explicit exiting to userspace with KVM_EXIT_TDX is very different. KVM is
creating a contract with userspace that says "for TDX VMCALLs [a-z], KVM will exit
to userspace with values [a-z]". *Every* new VMCALL that's added to the GHCI will
become KVM ABI, e.g. if Intel ships a TDX module that adds a new VMALL, then KVM
will forward the exit to userspace, and userspace can then start relying on that
behavior.

And punting all register state, decoding, etc. to userspace creates a crap ABI.
KVM effectively did this for SEV and SEV-ES by copying the PSP ABI verbatim into
KVM ioctls(), and it's a gross, ugly mess.

Each VMCALL that KVM wants to forward needs a dedicated KVM_EXIT_<reason> and
associated struct in the exit union. Yes, it's slightly more work now, but it's
one time pain. Whereas copying all registers is endless misery for everyone
involved, e.g. *every* userspace VMM needs to decipher the registers, do sanity
checking, etc. And *every* end user needs to do the same when a debugging
inevitable failures.

This also solves Chao's comment about XMM registers. Except for emualting Hyper-V
hypercalls, which have very explicit handling, KVM does NOT support using XMM
registers in hypercalls.

> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v14 -> v15:
> - updated struct kvm_tdx_exit with union
> - export constants for reg bitmask
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 83 ++++++++++++++++++++++++++++++++++++-
> include/uapi/linux/kvm.h | 89 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 170 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c8eb47591105..72dbe2ff9062 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1038,6 +1038,78 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> +static int tdx_complete_vp_vmcall(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
> + __u64 reg_mask = kvm_rcx_read(vcpu);
> +
> +#define COPY_REG(MASK, REG) \
> + do { \
> + if (reg_mask & TDX_VMCALL_REG_MASK_ ## MASK) \
> + kvm_## REG ## _write(vcpu, tdx_vmcall->out_ ## REG); \
> + } while (0)
> +
> +
> + COPY_REG(R10, r10);
> + COPY_REG(R11, r11);
> + COPY_REG(R12, r12);
> + COPY_REG(R13, r13);
> + COPY_REG(R14, r14);
> + COPY_REG(R15, r15);
> + COPY_REG(RBX, rbx);
> + COPY_REG(RDI, rdi);
> + COPY_REG(RSI, rsi);
> + COPY_REG(R8, r8);
> + COPY_REG(R9, r9);
> + COPY_REG(RDX, rdx);
> +
> +#undef COPY_REG
> +
> + return 1;
> +}
> +
> +static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
> + __u64 reg_mask;
> +
> + vcpu->arch.complete_userspace_io = tdx_complete_vp_vmcall;
> + memset(tdx_vmcall, 0, sizeof(*tdx_vmcall));
> +
> + vcpu->run->exit_reason = KVM_EXIT_TDX;
> + vcpu->run->tdx.type = KVM_EXIT_TDX_VMCALL;
> +
> + reg_mask = kvm_rcx_read(vcpu);
> + tdx_vmcall->reg_mask = reg_mask;
> +
> +#define COPY_REG(MASK, REG) \
> + do { \
> + if (reg_mask & TDX_VMCALL_REG_MASK_ ## MASK) { \
> + tdx_vmcall->in_ ## REG = kvm_ ## REG ## _read(vcpu); \
> + tdx_vmcall->out_ ## REG = tdx_vmcall->in_ ## REG; \
> + } \
> + } while (0)
> +
> +
> + COPY_REG(R10, r10);
> + COPY_REG(R11, r11);
> + COPY_REG(R12, r12);
> + COPY_REG(R13, r13);
> + COPY_REG(R14, r14);
> + COPY_REG(R15, r15);
> + COPY_REG(RBX, rbx);
> + COPY_REG(RDI, rdi);
> + COPY_REG(RSI, rsi);
> + COPY_REG(R8, r8);
> + COPY_REG(R9, r9);
> + COPY_REG(RDX, rdx);
> +
> +#undef COPY_REG
> +
> + /* notify userspace to handle the request */
> + return 0;
> +}
> +
> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> {
> if (tdvmcall_exit_type(vcpu))
> @@ -1048,8 +1120,15 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> break;
> }
>
> - tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
> - return 1;
> + /*
> + * Unknown VMCALL. Toss the request to the user space VMM, e.g. qemu,
> + * as it may know how to handle.
> + *
> + * Those VMCALLs require user space VMM:
> + * TDVMCALL_REPORT_FATAL_ERROR, TDVMCALL_MAP_GPA,
> + * TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and TDVMCALL_GET_QUOTE.
> + */
> + return tdx_vp_vmcall_to_user(vcpu);
> }
>
> void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5e2b28934aa9..a7aa804ef021 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -167,6 +167,92 @@ struct kvm_xen_exit {
> } u;
> };
>
> +/* masks for reg_mask to indicate which registers are passed. */
> +#define TDX_VMCALL_REG_MASK_RBX BIT_ULL(2)
> +#define TDX_VMCALL_REG_MASK_RDX BIT_ULL(3)
> +#define TDX_VMCALL_REG_MASK_RSI BIT_ULL(6)
> +#define TDX_VMCALL_REG_MASK_RDI BIT_ULL(7)
> +#define TDX_VMCALL_REG_MASK_R8 BIT_ULL(8)
> +#define TDX_VMCALL_REG_MASK_R9 BIT_ULL(9)
> +#define TDX_VMCALL_REG_MASK_R10 BIT_ULL(10)
> +#define TDX_VMCALL_REG_MASK_R11 BIT_ULL(11)
> +#define TDX_VMCALL_REG_MASK_R12 BIT_ULL(12)
> +#define TDX_VMCALL_REG_MASK_R13 BIT_ULL(13)
> +#define TDX_VMCALL_REG_MASK_R14 BIT_ULL(14)
> +#define TDX_VMCALL_REG_MASK_R15 BIT_ULL(15)
> +
> +struct kvm_tdx_exit {
> +#define KVM_EXIT_TDX_VMCALL 1
> + __u32 type;
> + __u32 pad;
> +
> + union {
> + struct kvm_tdx_vmcall {
> + /*
> + * RAX(bit 0), RCX(bit 1) and RSP(bit 4) are reserved.
> + * RAX(bit 0): TDG.VP.VMCALL status code.
> + * RCX(bit 1): bitmap for used registers.
> + * RSP(bit 4): the caller stack.
> + */
> + union {
> + __u64 in_rcx;
> + __u64 reg_mask;
> + };
> +
> + /*
> + * Guest-Host-Communication Interface for TDX spec
> + * defines the ABI for TDG.VP.VMCALL.
> + */
> + /* Input parameters: guest -> VMM */
> + union {
> + __u64 in_r10;
> + __u64 type;
> + };
> + union {
> + __u64 in_r11;
> + __u64 subfunction;
> + };
> + /*
> + * Subfunction specific.
> + * Registers are used in this order to pass input
> + * arguments. r12=arg0, r13=arg1, etc.
> + */
> + __u64 in_r12;
> + __u64 in_r13;
> + __u64 in_r14;
> + __u64 in_r15;
> + __u64 in_rbx;
> + __u64 in_rdi;
> + __u64 in_rsi;
> + __u64 in_r8;
> + __u64 in_r9;
> + __u64 in_rdx;
> +
> + /* Output parameters: VMM -> guest */
> + union {
> + __u64 out_r10;
> + __u64 status_code;
> + };
> + /*
> + * Subfunction specific.
> + * Registers are used in this order to output return
> + * values. r11=ret0, r12=ret1, etc.
> + */
> + __u64 out_r11;
> + __u64 out_r12;
> + __u64 out_r13;
> + __u64 out_r14;
> + __u64 out_r15;
> + __u64 out_rbx;
> + __u64 out_rdi;
> + __u64 out_rsi;
> + __u64 out_r8;
> + __u64 out_r9;
> + __u64 out_rdx;
> + } vmcall;
> + } u;
> +};
> +
> #define KVM_S390_GET_SKEYS_NONE 1
> #define KVM_S390_SKEYS_MAX 1048576
>
> @@ -210,6 +296,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_NOTIFY 37
> #define KVM_EXIT_LOONGARCH_IOCSR 38
> #define KVM_EXIT_MEMORY_FAULT 39
> +#define KVM_EXIT_TDX 40
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -470,6 +557,8 @@ struct kvm_run {
> __u64 gpa;
> __u64 size;
> } memory_fault;
> + /* KVM_EXIT_TDX_VMCALL */
> + struct kvm_tdx_exit tdx;
> /* Fix the size of the union. */
> char padding[256];
> };
> --
> 2.25.1
>

2024-04-03 16:30:57

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Wed, Apr 03, 2024 at 08:04:03AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Mon, Feb 26, 2024, [email protected] wrote:
> > +union tdx_vcpu_state_details {
> > + struct {
> > + u64 vmxip : 1;
> > + u64 reserved : 63;
> > + };
> > + u64 full;
> > +};
>
> No unions please. KVM uses unions in a few places where they are the lesser of
> all evils, but in general, unions are frowned upon. Bitfields in particular are
> strongly discourage, as they are a nightmare to read/review and tend to generate
> bad code.
>
> E.g. for this one, something like (names aren't great)
>
> static inline bool tdx_has_pending_virtual_interrupt(struct kvm_vcpu *vcpu)
> {
> return <get "non arch field"> & TDX_VCPU_STATE_VMXIP;
> }


Sure, let me replace them with mask and shift.
--
Isaku Yamahata <[email protected]>

2024-04-03 17:52:43

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Thu, Mar 28, 2024 at 12:33:35PM +1300,
"Huang, Kai" <[email protected]> wrote:

> > + kvm_tdx->tdr_pa = tdr_pa;
> > +
> > + for_each_online_cpu(i) {
> > + int pkg = topology_physical_package_id(i);
> > +
> > + if (cpumask_test_and_set_cpu(pkg, packages))
> > + continue;
> > +
> > + /*
> > + * Program the memory controller in the package with an
> > + * encryption key associated to a TDX private host key id
> > + * assigned to this TDR. Concurrent operations on same memory
> > + * controller results in TDX_OPERAND_BUSY. Avoid this race by
> > + * mutex.
> > + */
>
> IIUC the race can only happen when you are creating multiple TDX guests
> simulatenously? Please clarify this in the comment.
>
> And I even don't think you need all these TDX module details:
>
> /*
> * Concurrent run of TDH.MNG.KEY.CONFIG on the same
> * package resluts in TDX_OPERAND_BUSY. When creating
> * multiple TDX guests simultaneously this can run
> * concurrently. Take the per-package lock to
> * serialize.
> */

As pointed by Chao, those mutex will be dropped.
https://lore.kernel.org/kvm/ZfpwIespKy8qxWWE@chao-email/
Also we would simplify cpu masks to track which package is online/offline,
which cpu to use for each package somehow.


> > + mutex_lock(&tdx_mng_key_config_lock[pkg]);
> > + ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> > + &kvm_tdx->tdr_pa, true);
> > + mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> > + if (ret)
> > + break;
> > + }
> > + cpus_read_unlock();
> > + free_cpumask_var(packages);
> > + if (ret) {
> > + i = 0;
> > + goto teardown;
> > + }
> > +
> > + kvm_tdx->tdcs_pa = tdcs_pa;
> > + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > + err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> > + if (err == TDX_RND_NO_ENTROPY) {
> > + /* Here it's hard to allow userspace to retry. */
> > + ret = -EBUSY;
> > + goto teardown;
> > + }
> > + if (WARN_ON_ONCE(err)) {
> > + pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> > + ret = -EIO;
> > + goto teardown;
> > + }
> > + }
> > +
> > + /*
> > + * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> > + * ioctl() to define the configure CPUID values for the TD.
> > + */
>
> Then, how about renaming this function to __tdx_td_create()?

So do we want to rename also ioctl name for consistency?
i.e. KVM_TDX_INIT_VM => KVM_TDX_CREATE_VM.

I don't have strong opinion those names. Maybe
KVM_TDX_{INIT, CREATE, or CONFIG}_VM?
And we can rename the function name to match it.

> > + return 0;
> > +
> > + /*
> > + * The sequence for freeing resources from a partially initialized TD
> > + * varies based on where in the initialization flow failure occurred.
> > + * Simply use the full teardown and destroy, which naturally play nice
> > + * with partial initialization.
> > + */
> > +teardown:
> > + for (; i < tdx_info->nr_tdcs_pages; i++) {
> > + if (tdcs_pa[i]) {
> > + free_page((unsigned long)__va(tdcs_pa[i]));
> > + tdcs_pa[i] = 0;
> > + }
> > + }
> > + if (!kvm_tdx->tdcs_pa)
> > + kfree(tdcs_pa);
>
> The code to "free TDCS pages in a loop and free the array" is done below
> with duplicated code. I am wondering whether we have way to eliminate one.
>
> But I have lost track here, so perhaps we can review again after we split
> the patch to smaller pieces.

Surely we can simplify it. Originally we had a spin lock and I had to separate
blocking memory allocation from its usage with this error clean up path.
Now it's mutex, we mix page allocation with its usage.


> > + tdx_mmu_release_hkid(kvm);
> > + tdx_vm_free(kvm);
> > + return ret;
> > +
> > +free_packages:
> > + cpus_read_unlock();
> > + free_cpumask_var(packages);
> > +free_tdcs:
> > + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > + if (tdcs_pa[i])
> > + free_page((unsigned long)__va(tdcs_pa[i]));
> > + }
> > + kfree(tdcs_pa);
> > + kvm_tdx->tdcs_pa = NULL;
> > +
> > +free_tdr:
> > + if (tdr_pa)
> > + free_page((unsigned long)__va(tdr_pa));
> > + kvm_tdx->tdr_pa = 0;
> > +free_hkid:
> > + if (is_hkid_assigned(kvm_tdx))
> > + tdx_hkid_free(kvm_tdx);
> > + return ret;
> > +}
> > +
> > int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> > {
> > struct kvm_tdx_cmd tdx_cmd;
> > @@ -215,12 +664,13 @@ static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> > static int __init tdx_module_setup(void)
> > {
> > - u16 num_cpuid_config;
> > + u16 num_cpuid_config, tdcs_base_size;
> > int ret;
> > u32 i;
> > struct tdx_md_map mds[] = {
> > TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> > + TDX_MD_MAP(TDCS_BASE_SIZE, &tdcs_base_size),
> > };
> > struct tdx_metadata_field_mapping fields[] = {
> > @@ -273,6 +723,8 @@ static int __init tdx_module_setup(void)
> > c->edx = ecx_edx >> 32;
> > }
> > + tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;
> > +
>
> Round up the 'tdcs_base_size' to make sure you have enough room, or put a
> WARN() here if not page aligned?

Ok, will add round up. Same for tdvps_base_size.
I can't find about those sizes and page size in the TDX spec. Although
TDH.MNG.ADDCX() and TDH.VP.ADDCX() imply that those sizes are multiple of PAGE
SIZE, the spec doesn't guarantee it. I think silent round up is better than
WARN() because we can do nothing about those values the TDX module provides.



> > return 0;
> > error_out:
> > @@ -319,13 +771,27 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> > struct tdx_enabled enable = {
> > .err = ATOMIC_INIT(0),
> > };
> > + int max_pkgs;
> > int r = 0;
> > + int i;
>
> Nit: you can put the 3 into one line.
>
> > + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
> > + pr_warn("MOVDIR64B is reqiured for TDX\n");
>
> It's better to make it more clear:
>
> "Disable TDX: MOVDIR64B is not supported or disabled by the kernel."
>
> Or, to match below:
>
> "Cannot enable TDX w/o MOVDIR64B".

Ok.


> > + return -EOPNOTSUPP;
> > + }
> > if (!enable_ept) {
> > pr_warn("Cannot enable TDX with EPT disabled\n");
> > return -EINVAL;
> > }
> > + max_pkgs = topology_max_packages();
> > + tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> > + GFP_KERNEL);
> > + if (!tdx_mng_key_config_lock)
> > + return -ENOMEM;
> > + for (i = 0; i < max_pkgs; i++)
> > + mutex_init(&tdx_mng_key_config_lock[i]);
> > +
>
> Using a per-socket lock looks a little bit overkill to me. I don't know
> whether we need to do in the initial version. Will leave to others.
>
> Please at least add a comment to explain this is for better performance when
> creating multiple TDX guests IIUC?

Will delete the mutex and simply the related logic.
--
Isaku Yamahata <[email protected]>

2024-04-03 17:59:21

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 067/130] KVM: TDX: Add load_mmu_pgd method for TDX

On Mon, Apr 01, 2024 at 11:49:43PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > For virtual IO, the guest TD shares guest pages with VMM without
> > encryption.
>
> Virtual IO is a use case of shared memory, it's better to use it
> as a example instead of putting it at the beginning of the sentence.
>
>
> > Shared EPT is used to map guest pages in unprotected way.
> >
> > Add the VMCS field encoding for the shared EPTP, which will be used by
> > TDX to have separate EPT walks for private GPAs (existing EPTP) versus
> > shared GPAs (new shared EPTP).
> >
> > Set shared EPT pointer value for the TDX guest to initialize TDX MMU.
> May have a mention that the EPTP for priavet GPAs is set by TDX module.

Sure, let me update the commit message.
--
Isaku Yamahata <[email protected]>

2024-04-03 17:59:33

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

On Tue, Apr 02, 2024 at 02:21:41PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
> > propagating the change private EPT entry to Secure EPT and freeing Secure
> > EPT page. TLB flush handles both shared EPT and private EPT. It flushes
> > shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the
> > hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> > EPT so that the page can be freed to OS.
> >
> > Propagate the entry change to Secure EPT. The possible entry changes are
> > present -> non-present(zapping) and non-present -> present(population). On
> > population just link the Secure EPT page or the private guest page to the
> > Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> > zapping/population, zapping requires synchronous TLB shoot down with the
> > frozen EPT entry.
>
> But for private memory, zapping holds write lock, right?

Right.


> > It zaps the secure entry, increments TLB counter, sends
> > IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> > guest page from the Secure EPT. For simplicity, batched zapping with
> > exclude lock is handled as concurrent zapping.
>
> exclude lock -> exclusive lock
>
> How to understand this sentence?
> Since it's holding exclusive lock, how it can be handled as concurrent
> zapping?
> Or you want to describe the current implementation prevents concurrent
> zapping?

The sentences is mixture of the currenct TDP MMU and how the new enhancement
with this patch provides. Because this patch is TDX backend, let me drop the
description about the TDP MMU part.

Propagate the entry change to Secure EPT. The possible entry changes
are non-present -> present(population) and present ->
non-present(zapping). On population just link the Secure EPT page or
the private guest page to the Secure EPT by TDX SEAMCALL. On zapping,
It blocks the Secure-EPT entry (clear present bit) , increments TLB
counter, sends IPI to remote vcpus to trigger TLB flush, and then
unlinks the private guest page from the Secure EPT.


> > Although it's inefficient,
> > it can be optimized in the future.
> >
> > For MMIO SPTE, the spte value changes as follows.
> > initial value (suppress VE bit is set)
> > -> Guest issues MMIO and triggers EPT violation
> > -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> > -> Guest MMIO resumes. It triggers VE exception in guest TD
> > -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> > -> KVM handles MMIO
> > -> Guest VE handler resumes its execution after MMIO instruction
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> >
> > ---
> > v19:
> > - Compile fix when CONFIG_HYPERV != y.
> > It's due to the following patch. Catch it up.
> > https://lore.kernel.org/all/[email protected]/
> > - Add comments on tlb shootdown to explan the sequence.
> > - Use gmem_max_level callback, delete tdp_max_page_level.
> >
> > v18:
> > - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> > - checkpatch: space => tab
> >
> > v15 -> v16:
> > - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
> >
> > v14 -> v15:
> > - Implemented tdx_flush_tlb_current()
> > - Removed unnecessary invept in tdx_flush_tlb(). It was carry over
> > from the very old code base.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/spte.c | 3 +-
> > arch/x86/kvm/vmx/main.c | 91 ++++++++-
> > arch/x86/kvm/vmx/tdx.c | 372 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx.h | 2 +-
> > arch/x86/kvm/vmx/tdx_ops.h | 6 +
> > arch/x86/kvm/vmx/x86_ops.h | 13 ++
> > 6 files changed, 481 insertions(+), 6 deletions(-)
> >
> [...]
> > +
> > +static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, kvm_pfn_t pfn)
> > +{
> > + int tdx_level = pg_level_to_tdx_sept_level(level);
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + union tdx_sept_level_state level_state;
> > + hpa_t hpa = pfn_to_hpa(pfn);
> > + gpa_t gpa = gfn_to_gpa(gfn);
> > + struct tdx_module_args out;
> > + union tdx_sept_entry entry;
> > + u64 err;
> > +
> > + err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &out);
> > + if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
> > + tdx_unpin(kvm, pfn);
> > + return -EAGAIN;
> > + }
> > + if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))) {
> > + entry.raw = out.rcx;
> > + level_state.raw = out.rdx;
> > + if (level_state.level == tdx_level &&
> > + level_state.state == TDX_SEPT_PENDING &&
> > + entry.leaf && entry.pfn == pfn && entry.sve) {
> > + tdx_unpin(kvm, pfn);
> > + WARN_ON_ONCE(!(to_kvm_tdx(kvm)->attributes &
> > + TDX_TD_ATTR_SEPT_VE_DISABLE));
>
> to_kvm_tdx(kvm) -> kvm_tdx
>
> Since the implementation requires attributes.TDX_TD_ATTR_SEPT_VE_DISABLE is set,

TDX KVM allows either configuration. set or cleared.


> should it check the value passed from userspace?

It's user-space configurable value.


> And the reason should be described somewhere in changelog or/and comment.

This WARN_ON_ONCE() is a guard for buggy TDX module. It shouldn't return
(TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) when SEPT_VE_DISABLED
cleared. Maybe we should remove this WARN_ON_ONCE() because the TDX module
is mature.


> > + return -EAGAIN;
> > + }
> > + }
> > + if (KVM_BUG_ON(err, kvm)) {
> > + pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
> > + tdx_unpin(kvm, pfn);
> > + return -EIO;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, kvm_pfn_t pfn)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +
> > + /* TODO: handle large pages. */
> > + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > + return -EINVAL;
> > +
> > + /*
> > + * Because restricted mem
>
> The term "restricted mem" is not used anymore, right? Should update the
> comment.

Sure, will update it to guest_memfd.
>
> > doesn't support page migration with
> > + * a_ops->migrate_page (yet), no callback isn't triggered for KVM on
>
> no callback isn't -> no callback is
>
> > + * page migration. Until restricted mem supports page migration,
>
> "restricted mem" -> guest_mem
>
>
> > + * prevent page migration.
> > + * TODO: Once restricted mem introduces callback on page migration,
>
> ditto
>
> > + * implement it and remove get_page/put_page().
> > + */
> > + get_page(pfn_to_page(pfn));
> > +
> > + if (likely(is_td_finalized(kvm_tdx)))
> > + return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > +
> > + /* TODO: tdh_mem_page_add() comes here for the initial memory. */
> > +
> > + return 0;
> > +}
> > +
> > +static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, kvm_pfn_t pfn)
> > +{
> > + int tdx_level = pg_level_to_tdx_sept_level(level);
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + struct tdx_module_args out;
> > + gpa_t gpa = gfn_to_gpa(gfn);
> > + hpa_t hpa = pfn_to_hpa(pfn);
> > + hpa_t hpa_with_hkid;
> > + u64 err;
> > +
> > + /* TODO: handle large pages. */
> > + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > + return -EINVAL;
> > +
> > + if (unlikely(!is_hkid_assigned(kvm_tdx))) {
> > + /*
> > + * The HKID assigned to this TD was already freed and cache
> > + * was already flushed. We don't have to flush again.
> > + */
> > + err = tdx_reclaim_page(hpa);
> > + if (KVM_BUG_ON(err, kvm))
> > + return -EIO;
> > + tdx_unpin(kvm, pfn);
> > + return 0;
> > + }
> > +
> > + do {
> > + /*
> > + * When zapping private page, write lock is held. So no race
> > + * condition with other vcpu sept operation. Race only with
> > + * TDH.VP.ENTER.
> > + */
> > + err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
> > + } while (unlikely(err == TDX_ERROR_SEPT_BUSY));
> > + if (KVM_BUG_ON(err, kvm)) {
> > + pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
> > + return -EIO;
> > + }
> > +
> > + hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
> > + do {
> > + /*
> > + * TDX_OPERAND_BUSY can happen on locking PAMT entry. Because
> > + * this page was removed above, other thread shouldn't be
> > + * repeatedly operating on this page. Just retry loop.
> > + */
> > + err = tdh_phymem_page_wbinvd(hpa_with_hkid);
> > + } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
> > + if (KVM_BUG_ON(err, kvm)) {
> > + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> > + return -EIO;
> > + }
> > + tdx_clear_page(hpa);
> > + tdx_unpin(kvm, pfn);
> > + return 0;
> > +}
> > +
> > +static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, void *private_spt)
> > +{
> > + int tdx_level = pg_level_to_tdx_sept_level(level);
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + gpa_t gpa = gfn_to_gpa(gfn);
> > + hpa_t hpa = __pa(private_spt);
> > + struct tdx_module_args out;
> > + u64 err;
> > +
> > + err = tdh_mem_sept_add(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
>
> kvm_tdx is only used here, can drop the local var.

Will drop it.
--
Isaku Yamahata <[email protected]>

2024-04-03 18:03:10

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

On Tue, Apr 02, 2024 at 05:13:23PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
> > propagating the change private EPT entry to Secure EPT and freeing Secure
> > EPT page. TLB flush handles both shared EPT and private EPT. It flushes
> > shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the
> > hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> > EPT so that the page can be freed to OS.
> >
> > Propagate the entry change to Secure EPT. The possible entry changes are
> > present -> non-present(zapping) and non-present -> present(population). On
> > population just link the Secure EPT page or the private guest page to the
> > Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> > zapping/population, zapping requires synchronous TLB shoot down with the
> > frozen EPT entry. It zaps the secure entry, increments TLB counter, sends
> > IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> > guest page from the Secure EPT. For simplicity, batched zapping with
> > exclude lock is handled as concurrent zapping. Although it's inefficient,
> > it can be optimized in the future.
> >
> > For MMIO SPTE, the spte value changes as follows.
> > initial value (suppress VE bit is set)
> > -> Guest issues MMIO and triggers EPT violation
> > -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> > -> Guest MMIO resumes. It triggers VE exception in guest TD
> > -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> > -> KVM handles MMIO
> > -> Guest VE handler resumes its execution after MMIO instruction
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> >
> > ---
> > v19:
> > - Compile fix when CONFIG_HYPERV != y.
> > It's due to the following patch. Catch it up.
> > https://lore.kernel.org/all/[email protected]/
> > - Add comments on tlb shootdown to explan the sequence.
> > - Use gmem_max_level callback, delete tdp_max_page_level.
> >
> > v18:
> > - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> > - checkpatch: space => tab
> >
> > v15 -> v16:
> > - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
> >
> > v14 -> v15:
> > - Implemented tdx_flush_tlb_current()
> > - Removed unnecessary invept in tdx_flush_tlb(). It was carry over
> > from the very old code base.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/spte.c | 3 +-
> > arch/x86/kvm/vmx/main.c | 91 ++++++++-
> > arch/x86/kvm/vmx/tdx.c | 372 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx.h | 2 +-
> > arch/x86/kvm/vmx/tdx_ops.h | 6 +
> > arch/x86/kvm/vmx/x86_ops.h | 13 ++
> > 6 files changed, 481 insertions(+), 6 deletions(-)
> >
> [...]
>
> > +static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level)
> > +{
> > + int tdx_level = pg_level_to_tdx_sept_level(level);
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
> > + struct tdx_module_args out;
> > + u64 err;
> > +
> > + /* This can be called when destructing guest TD after freeing HKID. */
> > + if (unlikely(!is_hkid_assigned(kvm_tdx)))
> > + return 0;
> > +
> > + /* For now large page isn't supported yet. */
> > + WARN_ON_ONCE(level != PG_LEVEL_4K);
> > + err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
> > + if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> > + return -EAGAIN;
> > + if (KVM_BUG_ON(err, kvm)) {
> > + pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
> > + return -EIO;
> > + }
> > + return 0;
> > +}
> > +
> > +/*
> > + * TLB shoot down procedure:
> > + * There is a global epoch counter and each vcpu has local epoch counter.
> > + * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
> > + * This blocks the subsequenct creation of TLB translation on that range.
> > + * This corresponds to clear the present bit(all RXW) in EPT entry
> > + * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
> > + * - IPI to remote vcpus
> > + * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
> > + * - On re-entry, TDX module compares the local epoch counter with the global
> > + * epoch counter. If the local epoch counter is older than the global epoch
> > + * counter, update the local epoch counter and flushes TLB.
> > + */
> > +static void tdx_track(struct kvm *kvm)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + u64 err;
> > +
> > + KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
> > + /* If TD isn't finalized, it's before any vcpu running. */
> > + if (unlikely(!is_td_finalized(kvm_tdx)))
> > + return;
> > +
> > + /*
> > + * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
> > + * the counter. The counter is used instead of bool because multiple
> > + * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
>
> Which case will have concurrent issues of TDH_MEM_TRACK() by multiple vcpus?
> For now, zapping is holding write lock.
> Promotion/demotion may have concurrent issues of TDH_MEM_TRACK(), but it's
> not supported yet.

You're right. Large page support will use it. With the assumption of only
single vcpu issuing tlb flush, The alternative is boolean + memory barrier.
I prefer to keep atomic_t and drop this comment than boolean + memory barrier
because we will eventually switch to atomic_t.


> > + *
> > + * optimization: The TLB shoot down procedure described in The TDX
> > + * specification is, TDH.MEM.TRACK(), send IPI to remote vcpus, confirm
> > + * all remote vcpus exit to VMM, and execute vcpu, both local and
> > + * remote. Twist the sequence to reduce IPI overhead as follows.
> > + *
> > + * local remote
> > + * ----- ------
> > + * increment tdh_mem_track
> > + *
> > + * request KVM_REQ_TLB_FLUSH
> > + * send IPI
> > + *
> > + * TDEXIT to KVM due to IPI
> > + *
> > + * IPI handler calls tdx_flush_tlb()
> > + * to process KVM_REQ_TLB_FLUSH.
> > + * spin wait for tdh_mem_track == 0
> > + *
> > + * TDH.MEM.TRACK()
> > + *
> > + * decrement tdh_mem_track
> > + *
> > + * complete KVM_REQ_TLB_FLUSH
> > + *
> > + * TDH.VP.ENTER to flush tlbs TDH.VP.ENTER to flush tlbs
> > + */
> > + atomic_inc(&kvm_tdx->tdh_mem_track);
> > + /*
> > + * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
> > + * KVM_REQUEST_WAIT.
> > + */
> > + kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> > +
> > + do {
> > + err = tdh_mem_track(kvm_tdx->tdr_pa);
> > + } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
> > +
> > + /* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
> > + atomic_dec(&kvm_tdx->tdh_mem_track);
> > +
> > + if (KVM_BUG_ON(err, kvm))
> > + pr_tdx_error(TDH_MEM_TRACK, err, NULL);
> > +
> > +}
> > +
> > +static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, void *private_spt)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +
> > + /*
> > + * The HKID assigned to this TD was already freed and cache was
> > + * already flushed. We don't have to flush again.
> > + */
> > + if (!is_hkid_assigned(kvm_tdx))
> > + return tdx_reclaim_page(__pa(private_spt));
> > +
> > + /*
> > + * free_private_spt() is (obviously) called when a shadow page is being
> > + * zapped. KVM doesn't (yet) zap private SPs while the TD is active.
> > + * Note: This function is for private shadow page. Not for private
> > + * guest page. private guest page can be zapped during TD is active.
> > + * shared <-> private conversion and slot move/deletion.
> > + */
> > + KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);
>
> At this point, is_hkid_assigned(kvm_tdx) is always true.

Yes, will drop this KVM_BUG_ON().
--
Isaku Yamahata <[email protected]>

2024-04-03 18:44:25

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

On Mon, Apr 01, 2024 at 12:10:58PM +0800,
Chao Gao <[email protected]> wrote:

> >+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >+{
> >+ unsigned long exit_qual;
> >+
> >+ if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
> >+ /*
> >+ * Always treat SEPT violations as write faults. Ignore the
> >+ * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
> >+ * TD private pages are always RWX in the SEPT tables,
> >+ * i.e. they're always mapped writable. Just as importantly,
> >+ * treating SEPT violations as write faults is necessary to
> >+ * avoid COW allocations, which will cause TDAUGPAGE failures
> >+ * due to aliasing a single HPA to multiple GPAs.
> >+ */
> >+#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
> >+ exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
> >+ } else {
> >+ exit_qual = tdexit_exit_qual(vcpu);
> >+ if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
>
> Unless the CPU has a bug, instruction fetch in TD from shared memory causes a
> #PF. I think you can add a comment for this.

Yes.


> Maybe KVM_BUG_ON() is more appropriate as it signifies a potential bug.

Bug of what component? CPU. If so, I think KVM_EXIT_INTERNAL_ERROR +
KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON is more appropriate.


> >+ pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
> >+ tdexit_gpa(vcpu), kvm_rip_read(vcpu));
> >+ vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
> >+ vcpu->run->ex.exception = PF_VECTOR;
> >+ vcpu->run->ex.error_code = exit_qual;
> >+ return 0;
> >+ }
> >+ }
> >+
> >+ trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
> >+ return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
> >+}
> >+
> >+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
> >+{
> >+ WARN_ON_ONCE(1);
> >+
> >+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> >+ vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
> >+ vcpu->run->internal.ndata = 2;
> >+ vcpu->run->internal.data[0] = EXIT_REASON_EPT_MISCONFIG;
> >+ vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
> >+
> >+ return 0;
> >+}
> >+
> > int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> > {
> > union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
> >@@ -1345,6 +1390,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> > WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
> >
> > switch (exit_reason.basic) {
> >+ case EXIT_REASON_EPT_VIOLATION:
> >+ return tdx_handle_ept_violation(vcpu);
> >+ case EXIT_REASON_EPT_MISCONFIG:
> >+ return tdx_handle_ept_misconfig(vcpu);
>
> Handling EPT misconfiguration can be dropped because the "default" case handles
> all unexpected exits in the same way

Ah, right. Will update it.
--
Isaku Yamahata <[email protected]>

2024-04-03 18:51:20

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT

On Mon, Apr 01, 2024 at 04:22:00PM +0800,
Chao Gao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:26:44AM -0800, [email protected] wrote:
> >From: Isaku Yamahata <[email protected]>
> >
> >Because guest TD state is protected, exceptions in guest TDs can't be
> >intercepted. TDX VMM doesn't need to handle exceptions.
> >tdx_handle_exit_irqoff() handles NMI and machine check. Ignore NMI and
>
> tdx_handle_exit_irqoff() doesn't handle NMIs.

Will it to tdx_handle_exception().


> >machine check and continue guest TD execution.
> >
> >For external interrupt, increment stats same to the VMX case.
> >
> >Signed-off-by: Isaku Yamahata <[email protected]>
> >Reviewed-by: Paolo Bonzini <[email protected]>
> >---
> > arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
> > 1 file changed, 23 insertions(+)
> >
> >diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >index 0db80fa020d2..bdd74682b474 100644
> >--- a/arch/x86/kvm/vmx/tdx.c
> >+++ b/arch/x86/kvm/vmx/tdx.c
> >@@ -918,6 +918,25 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> > vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
> > }
> >
> >+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
> >+{
> >+ u32 intr_info = tdexit_intr_info(vcpu);
> >+
> >+ if (is_nmi(intr_info) || is_machine_check(intr_info))
> >+ return 1;
>
> Add a comment in code as well.

Sure.


> >+
> >+ kvm_pr_unimpl("unexpected exception 0x%x(exit_reason 0x%llx qual 0x%lx)\n",
> >+ intr_info,
> >+ to_tdx(vcpu)->exit_reason.full, tdexit_exit_qual(vcpu));
> >+ return -EFAULT;
>
> -EFAULT looks incorrect.

As this is unexpected exception, we should exit to to the user-space with
KVM_EXIT_EXCEPTION. Then QEMU will abort with message.
--
Isaku Yamahata <[email protected]>

2024-04-03 19:20:25

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 097/130] KVM: x86: Split core of hypercall emulation to helper function

On Fri, Mar 29, 2024 at 11:24:55AM +0800,
Chao Gao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:26:39AM -0800, [email protected] wrote:
> >@@ -10162,18 +10151,49 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >
> > WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ);
> > vcpu->arch.complete_userspace_io = complete_hypercall_exit;
> >+ /* stat is incremented on completion. */
>
> Perhaps we could use a distinct return value to signal that the request is redirected
> to userspace. This way, more cases can be supported, e.g., accesses to MTRR
> MSRs, requests to service TDs, etc. And then ...

The convention here is the one for exit_handler vcpu_enter_guest() already uses.
If we introduce something like KVM_VCPU_CONTINUE=1, KVM_VCPU_EXIT_TO_USER=0, it
will touch many places. So if we will (I'm not sure it's worthwhile), the
cleanup should be done as independently.


> > return 0;
> > }
> > default:
> > ret = -KVM_ENOSYS;
> > break;
> > }
> >+
> > out:
> >+ ++vcpu->stat.hypercalls;
> >+ return ret;
> >+}
> >+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> >+
> >+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >+{
> >+ unsigned long nr, a0, a1, a2, a3, ret;
> >+ int op_64_bit;
> >+ int cpl;
> >+
> >+ if (kvm_xen_hypercall_enabled(vcpu->kvm))
> >+ return kvm_xen_hypercall(vcpu);
> >+
> >+ if (kvm_hv_hypercall_enabled(vcpu))
> >+ return kvm_hv_hypercall(vcpu);
> >+
> >+ nr = kvm_rax_read(vcpu);
> >+ a0 = kvm_rbx_read(vcpu);
> >+ a1 = kvm_rcx_read(vcpu);
> >+ a2 = kvm_rdx_read(vcpu);
> >+ a3 = kvm_rsi_read(vcpu);
> >+ op_64_bit = is_64_bit_hypercall(vcpu);
> >+ cpl = static_call(kvm_x86_get_cpl)(vcpu);
> >+
> >+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> >+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> >+ /* MAP_GPA tosses the request to the user space. */
>
> no need to check what the request is. Just checking the return value will suffice.

This is needed to avoid updating rax etc. KVM_HC_MAP_GPA_RANGE is only an
exception to go to the user space. This check is a bit weird, but I couldn't
find a good way.

>
> >+ return 0;
> >+
> > if (!op_64_bit)
> > ret = (u32)ret;
> > kvm_rax_write(vcpu, ret);
> >
> >- ++vcpu->stat.hypercalls;
> > return kvm_skip_emulated_instruction(vcpu);
> > }
> > EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
> >--
> >2.25.1
> >
> >
>
--
Isaku Yamahata <[email protected]>

2024-04-03 19:21:52

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 097/130] KVM: x86: Split core of hypercall emulation to helper function

On Wed, Apr 03, 2024, Isaku Yamahata wrote:
> On Fri, Mar 29, 2024 at 11:24:55AM +0800,
> Chao Gao <[email protected]> wrote:
>
> > On Mon, Feb 26, 2024 at 12:26:39AM -0800, [email protected] wrote:
> > >@@ -10162,18 +10151,49 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > >
> > > WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ);
> > > vcpu->arch.complete_userspace_io = complete_hypercall_exit;
> > >+ /* stat is incremented on completion. */
> >
> > Perhaps we could use a distinct return value to signal that the request is redirected
> > to userspace. This way, more cases can be supported, e.g., accesses to MTRR
> > MSRs, requests to service TDs, etc. And then ...
>
> The convention here is the one for exit_handler vcpu_enter_guest() already uses.
> If we introduce something like KVM_VCPU_CONTINUE=1, KVM_VCPU_EXIT_TO_USER=0, it
> will touch many places. So if we will (I'm not sure it's worthwhile), the
> cleanup should be done as independently.

Yeah, this is far from the first time that someone has complained about KVM's
awful 1/0 return magic. And every time we've looked at it, we've come to the
conclusion that it's not worth the churn/risk.

And if we really need to further overload the return value, we can, e.g. KVM
already does this for MSR accesses:

/*
* Internal error codes that are used to indicate that MSR emulation encountered
* an error that should result in #GP in the guest, unless userspace
* handles it.
*/
#define KVM_MSR_RET_INVALID 2 /* in-kernel MSR emulation #GP condition */
#define KVM_MSR_RET_FILTERED 3 /* #GP due to userspace MSR filter */

2024-04-03 21:36:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Wed, Mar 20, 2024, Dave Hansen wrote:
> On 3/20/24 05:09, Huang, Kai wrote:
> > I can try to do if you guys believe this should be done, and should be done
> > earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
> > have meaningful performance improvement.
>
> I don't think Sean had performance concerns.
>
> I think he was having a justifiably violent reaction to how much more
> complicated the generated code is to do a SEAMCALL versus a good ol' KVM
> hypercall.

Yep. The code essentially violates the principle of least surprise. I genuinely
thought I was dumping the wrong function(s) when I first looked at the output.

2024-04-03 22:16:15

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 22/03/2024 3:17 am, Yamahata, Isaku wrote:
>>> +
>>> + for_each_online_cpu(i) {
>>> + int pkg = topology_physical_package_id(i);
>>> +
>>> + if (cpumask_test_and_set_cpu(pkg, packages))
>>> + continue;
>>> +
>>> + /*
>>> + * Program the memory controller in the package with an
>>> + * encryption key associated to a TDX private host key id
>>> + * assigned to this TDR. Concurrent operations on same memory
>>> + * controller results in TDX_OPERAND_BUSY. Avoid this race by
>>> + * mutex.
>>> + */
>>> + mutex_lock(&tdx_mng_key_config_lock[pkg]);
>> the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
>> create TDs, the same set of CPUs (the first online CPU of each package) will be
>> selected to configure the key because of the cpumask_test_and_set_cpu() above.
>> it means, we never have two CPUs in the same socket trying to program the key,
>> i.e., no concurrent calls.
> Makes sense. Will drop the lock.

Hmm.. Skipping in cpumask_test_and_set_cpu() would result in the second
TDH.MNG.KEY.CONFIG not being done for the second VM. No?

2024-04-03 22:25:00

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 103/130] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI

On Mon, Apr 01, 2024 at 05:14:02PM +0800,
Chao Gao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:26:45AM -0800, [email protected] wrote:
> >From: Isaku Yamahata <[email protected]>
> >
> >When BIOS eMCA MCE-SMI morphing is enabled, the #MC is morphed to MSMI
> >(Machine Check System Management Interrupt). Then the SMI causes TD exit
> >with the read reason of EXIT_REASON_OTHER_SMI with MSMI bit set in the exit
> >qualification to KVM instead of EXIT_REASON_EXCEPTION_NMI with MC
> >exception.
> >
> >Handle EXIT_REASON_OTHER_SMI with MSMI bit set in the exit qualification as
> >MCE(Machine Check Exception) happened during TD guest running.
> >
> >Signed-off-by: Isaku Yamahata <[email protected]>
> >---
> > arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++---
> > arch/x86/kvm/vmx/tdx_arch.h | 2 ++
> > 2 files changed, 39 insertions(+), 3 deletions(-)
> >
> >diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >index bdd74682b474..117c2315f087 100644
> >--- a/arch/x86/kvm/vmx/tdx.c
> >+++ b/arch/x86/kvm/vmx/tdx.c
> >@@ -916,6 +916,30 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> > tdexit_intr_info(vcpu));
> > else if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
> > vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
> >+ else if (unlikely(tdx->exit_reason.non_recoverable ||
> >+ tdx->exit_reason.error)) {
>
> why not just:
> else if (tdx->exit_reason.basic == EXIT_REASON_OTHER_SMI) {
>
>
> i.e., does EXIT_REASON_OTHER_SMI imply exit_reason.non_recoverable or
> exit_reason.error?

Yes, this should be refined.


> >+ /*
> >+ * The only reason it gets EXIT_REASON_OTHER_SMI is there is an
> >+ * #MSMI(Machine Check System Management Interrupt) with
> >+ * exit_qualification bit 0 set in TD guest.
> >+ * The #MSMI is delivered right after SEAMCALL returns,
> >+ * and an #MC is delivered to host kernel after SMI handler
> >+ * returns.
> >+ *
> >+ * The #MC right after SEAMCALL is fixed up and skipped in #MC
>
> Looks fixing up and skipping #MC on the first instruction after TD-exit is
> missing in v19?

Right. We removed it as MSMI will provides if #MC happened in SEAM or not.


>
> >+ * handler because it's an #MC happens in TD guest we cannot
> >+ * handle it with host's context.
> >+ *
> >+ * Call KVM's machine check handler explicitly here.
> >+ */
> >+ if (tdx->exit_reason.basic == EXIT_REASON_OTHER_SMI) {
> >+ unsigned long exit_qual;
> >+
> >+ exit_qual = tdexit_exit_qual(vcpu);
> >+ if (exit_qual & TD_EXIT_OTHER_SMI_IS_MSMI)
>
> >+ kvm_machine_check();
> >+ }
> >+ }
> > }
> >
> > static int tdx_handle_exception(struct kvm_vcpu *vcpu)
> >@@ -1381,6 +1405,11 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> > exit_reason.full, exit_reason.basic,
> > to_kvm_tdx(vcpu->kvm)->hkid,
> > set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
> >+
> >+ /*
> >+ * tdx_handle_exit_irqoff() handled EXIT_REASON_OTHER_SMI. It
> >+ * must be handled before enabling preemption because it's #MC.
> >+ */
>
> Then EXIT_REASON_OTHER_SMI is handled, why still go to unhandled_exit?

Let me update the comment.
exit_irqoff() doesn't return value to tell vcpu_run loop to continue or exit to
user-space. As the guest is dead, we'd like to exit to the user-space.


> > goto unhandled_exit;
> > }
> >
> >@@ -1419,9 +1448,14 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> > return tdx_handle_ept_misconfig(vcpu);
> > case EXIT_REASON_OTHER_SMI:
> > /*
> >- * If reach here, it's not a Machine Check System Management
> >- * Interrupt(MSMI). #SMI is delivered and handled right after
> >- * SEAMRET, nothing needs to be done in KVM.
> >+ * Unlike VMX, all the SMI in SEAM non-root mode (i.e. when
> >+ * TD guest vcpu is running) will cause TD exit to TDX module,
> >+ * then SEAMRET to KVM. Once it exits to KVM, SMI is delivered
> >+ * and handled right away.
> >+ *
> >+ * - If it's an Machine Check System Management Interrupt
> >+ * (MSMI), it's handled above due to non_recoverable bit set.
> >+ * - If it's not an MSMI, don't need to do anything here.
>
> This corrects a comment added in patch 100. Maybe we can just merge patch 100 into
> this one?

Yes. Will do.

> > */
> > return 1;
> > default:
> >diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> >index efc3c61c14ab..87ef22e9cd49 100644
> >--- a/arch/x86/kvm/vmx/tdx_arch.h
> >+++ b/arch/x86/kvm/vmx/tdx_arch.h
> >@@ -42,6 +42,8 @@
> > #define TDH_VP_WR 43
> > #define TDH_SYS_LP_SHUTDOWN 44
> >
> >+#define TD_EXIT_OTHER_SMI_IS_MSMI BIT(1)
> >+
> > /* TDX control structure (TDR/TDCS/TDVPS) field access codes */
> > #define TDX_NON_ARCH BIT_ULL(63)
> > #define TDX_CLASS_SHIFT 56
> >--
> >2.25.1
> >
> >
>

--
Isaku Yamahata <[email protected]>

2024-04-03 22:41:46

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 4/04/2024 6:24 am, Yamahata, Isaku wrote:
> On Thu, Mar 28, 2024 at 12:33:35PM +1300,
> "Huang, Kai" <[email protected]> wrote:
>
>>> + kvm_tdx->tdr_pa = tdr_pa;
>>> +
>>> + for_each_online_cpu(i) {
>>> + int pkg = topology_physical_package_id(i);
>>> +
>>> + if (cpumask_test_and_set_cpu(pkg, packages))
>>> + continue;
>>> +
>>> + /*
>>> + * Program the memory controller in the package with an
>>> + * encryption key associated to a TDX private host key id
>>> + * assigned to this TDR. Concurrent operations on same memory
>>> + * controller results in TDX_OPERAND_BUSY. Avoid this race by
>>> + * mutex.
>>> + */
>>
>> IIUC the race can only happen when you are creating multiple TDX guests
>> simulatenously? Please clarify this in the comment.
>>
>> And I even don't think you need all these TDX module details:
>>
>> /*
>> * Concurrent run of TDH.MNG.KEY.CONFIG on the same
>> * package resluts in TDX_OPERAND_BUSY. When creating
>> * multiple TDX guests simultaneously this can run
>> * concurrently. Take the per-package lock to
>> * serialize.
>> */
>
> As pointed by Chao, those mutex will be dropped.
> https://lore.kernel.org/kvm/ZfpwIespKy8qxWWE@chao-email/
> Also we would simplify cpu masks to track which package is online/offline,
> which cpu to use for each package somehow.

Please see my reply there. I might be missing something, though.

>
>
>>> + mutex_lock(&tdx_mng_key_config_lock[pkg]);
>>> + ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
>>> + &kvm_tdx->tdr_pa, true);
>>> + mutex_unlock(&tdx_mng_key_config_lock[pkg]);
>>> + if (ret)
>>> + break;
>>> + }
>>> + cpus_read_unlock();
>>> + free_cpumask_var(packages);
>>> + if (ret) {
>>> + i = 0;
>>> + goto teardown;
>>> + }
>>> +
>>> + kvm_tdx->tdcs_pa = tdcs_pa;
>>> + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
>>> + err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
>>> + if (err == TDX_RND_NO_ENTROPY) {
>>> + /* Here it's hard to allow userspace to retry. */
>>> + ret = -EBUSY;
>>> + goto teardown;
>>> + }
>>> + if (WARN_ON_ONCE(err)) {
>>> + pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
>>> + ret = -EIO;
>>> + goto teardown;
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
>>> + * ioctl() to define the configure CPUID values for the TD.
>>> + */
>>
>> Then, how about renaming this function to __tdx_td_create()?
>
> So do we want to rename also ioctl name for consistency?
> i.e. KVM_TDX_INIT_VM => KVM_TDX_CREATE_VM.

Hmm.. but this __tdx_td_create() (the __tdx_td_init() in this patch) is
called via kvm_x86_ops->vm_init(), but not IOCTL()?

If I read correctly, only TDH.MNG.INIT is called via IOCTL(), in that
sense it makes more sense to name the IOCTL() as KVM_TDX_INIT_VM.

>
> I don't have strong opinion those names. Maybe
> KVM_TDX_{INIT, CREATE, or CONFIG}_VM?
> And we can rename the function name to match it.
>

2024-04-04 00:00:18

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters



On 26/02/2024 9:25 pm, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX requires additional parameters for TDX VM for confidential execution to
> protect the confidentiality of its memory contents and CPU state from any
> other software, including VMM.

Hmm.. not only "confidentiality" but also "integrity". And the "per-VM"
TDX initializaiton here actually has nothing to do with
"crypto-protection", because the establishment of the key has already
been done before reaching here.

I would just say:

After the crypto-protection key has been configured, TDX requires a
VM-scope initialization as a step of creating the TDX guest. This
"per-VM" TDX initialization does the global configurations/features that
the TDX guest can support, such as guest's CPUIDs (emulated by the TDX
module), the maximum number of vcpus etc.




When creating a guest TD VM before creating
> vcpu, the number of vcpu, TSC frequency (the values are the same among
> vcpus, and it can't change.) CPUIDs which the TDX module emulates.

I cannot parse this sentence. It doesn't look like a sentence to me.

Guest
> TDs can trust those CPUIDs and sha384 values for measurement.

Trustness is not about the "guest can trust", but the "people using the
guest can trust".

Just remove it.

If you want to emphasize the attestation, you can add something like:

"
It also passes the VM's measurement and hash of the signer etc and the
hardware only allows to initialize the TDX guest when that match.
"

>
> Add a new subcommand, KVM_TDX_INIT_VM, to pass parameters for the TDX
> guest.

[...]

It assigns an encryption key to the TDX guest for memory
> encryption. TDX encrypts memory per guest basis.

No it doesn't. The key has been programmed already in your previous patch.

The device model, say
> qemu, passes per-VM parameters for the TDX guest.

This is implied by your first sentence of this paragraph.

The maximum number of
> vcpus, TSC frequency (TDX guest has fixed VM-wide TSC frequency, not per
> vcpu. The TDX guest can not change it.), attributes (production or debug),
> available extended features (which configure guest XCR0, IA32_XSS MSR),
> CPUIDs, sha384 measurements, etc.

This is not a sentence.

>
> Call this subcommand before creating vcpu and KVM_SET_CPUID2, i.e. CPUID
> configurations aren't available yet.

"
This "per-VM" TDX initialization must be done before any "vcpu-scope"
TDX initialization. To match this better, require the KVM_TDX_INIT_VM
IOCTL() to be done before KVM creates any vcpus.

Note KVM configures the VM's CPUIDs in KVM_SET_CPUID2 via vcpu. The
downside of this approach is KVM will need to do some enforcement later
to make sure the consisntency between the CPUIDs passed here and the
CPUIDs done in KVM_SET_CPUID2.
"

So CPUIDs configuration values need
> to be passed in struct kvm_tdx_init_vm. The device model's responsibility
> to make this CPUID config for KVM_TDX_INIT_VM and KVM_SET_CPUID2.

And I would leave how to handle KVM_SET_CPUID2 to the patch that
actually enforces the consisntency.

>
> Signed-off-by: Xiaoyao Li <[email protected]>

Missing Co-developed-by tag for Xiaoyao.

> Signed-off-by: Isaku Yamahata <[email protected]>

[...]

>
> +struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(
> + struct kvm_cpuid_entry2 *entries, int nent, u32 function, u64 index)
> +{
> + return cpuid_entry2_find(entries, nent, function, index);
> +}
> +EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry2);

Not sure whether we can export cpuid_entry2_find() directly?

No strong opinion of course.

But if we want to expose the wrapper, looks ...

> +
> struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
> u32 function, u32 index)
> {
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index 856e3037e74f..215d1c68c6d1 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -13,6 +13,8 @@ void kvm_set_cpu_caps(void);
>
> void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
> void kvm_update_pv_runtime(struct kvm_vcpu *vcpu);
> +struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(struct kvm_cpuid_entry2 *entries,
> + int nent, u32 function, u64 index);
> struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
> u32 function, u32 index); > struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,

.. __kvm_find_cpuid_entry() would fit better?

> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1cf2b15da257..b11f105db3cd 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -8,7 +8,6 @@
> #include "mmu.h"
> #include "tdx_arch.h"
> #include "tdx.h"
> -#include "tdx_ops.h"

??

If it isn't needed, then it shouldn't be included in some previous patch.

> #include "x86.h"
>
> #undef pr_fmt
> @@ -350,18 +349,21 @@ static int tdx_do_tdh_mng_key_config(void *param)
> return 0;
> }
>
> -static int __tdx_td_init(struct kvm *kvm);
> -
> int tdx_vm_init(struct kvm *kvm)
> {
> + /*
> + * This function initializes only KVM software construct. It doesn't
> + * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
> + * It is handled by KVM_TDX_INIT_VM, __tdx_td_init().
> + */
> +
> /*
> * TDX has its own limit of the number of vcpus in addition to
> * KVM_MAX_VCPUS.
> */
> kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
>
> - /* Place holder for TDX specific logic. */
> - return __tdx_td_init(kvm);
> + return 0;

??

I don't quite understand. What's wrong of still calling __tdx_td_init()
in tdx_vm_init()?

If there's anything preventing doing __tdx_td_init() from tdx_vm_init(),
then it's wrong to implement that in your previous patch.

[...]

2024-04-04 01:04:57

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Thu, Apr 04, 2024 at 11:13:49AM +1300, Huang, Kai wrote:
>
>
>On 22/03/2024 3:17 am, Yamahata, Isaku wrote:
>> > > +
>> > > + for_each_online_cpu(i) {
>> > > + int pkg = topology_physical_package_id(i);
>> > > +
>> > > + if (cpumask_test_and_set_cpu(pkg, packages))
>> > > + continue;
>> > > +
>> > > + /*
>> > > + * Program the memory controller in the package with an
>> > > + * encryption key associated to a TDX private host key id
>> > > + * assigned to this TDR. Concurrent operations on same memory
>> > > + * controller results in TDX_OPERAND_BUSY. Avoid this race by
>> > > + * mutex.
>> > > + */
>> > > + mutex_lock(&tdx_mng_key_config_lock[pkg]);
>> > the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
>> > create TDs, the same set of CPUs (the first online CPU of each package) will be
>> > selected to configure the key because of the cpumask_test_and_set_cpu() above.
>> > it means, we never have two CPUs in the same socket trying to program the key,
>> > i.e., no concurrent calls.
>> Makes sense. Will drop the lock.
>
>Hmm.. Skipping in cpumask_test_and_set_cpu() would result in the second
>TDH.MNG.KEY.CONFIG not being done for the second VM. No?

No. Because @packages isn't shared between VMs.

2024-04-04 01:16:00

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 104/130] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)

On Mon, Apr 01, 2024 at 05:59:35PM +0800,
Chao Gao <[email protected]> wrote:

> > static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
> > {
> > return tdx->td_vcpu_created;
> >@@ -897,6 +932,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> >
> > tdx_complete_interrupts(vcpu);
> >
> >+ if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
> >+ tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
>
> kvm_rcx_read()?
>
>
> >+ else
> >+ tdx->tdvmcall.rcx = 0;
>
> RCX on TDVMCALL exit is supposed to be consumed by TDX module. I don't get why
> caching it is necessary. Can tdx->tdvmcall be simply dropped?

Now it's not used. Will drop tdvmcall.

It was originally used to remember a original register mask of TDVMCALL, and
tdx_complete_vp_vmcall() used it as a valid value to copy back the output
values. The current tdx_complete_vp_vmcall() uses kvm_rcx_read() because even
if the user space changes rcx, it doesn't harm to KVM. KVM does what the user
space tells.
--
Isaku Yamahata <[email protected]>

2024-04-04 01:25:27

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 4/04/2024 2:03 pm, Chao Gao wrote:
> On Thu, Apr 04, 2024 at 11:13:49AM +1300, Huang, Kai wrote:
>>
>>
>> On 22/03/2024 3:17 am, Yamahata, Isaku wrote:
>>>>> +
>>>>> + for_each_online_cpu(i) {
>>>>> + int pkg = topology_physical_package_id(i);
>>>>> +
>>>>> + if (cpumask_test_and_set_cpu(pkg, packages))
>>>>> + continue;
>>>>> +
>>>>> + /*
>>>>> + * Program the memory controller in the package with an
>>>>> + * encryption key associated to a TDX private host key id
>>>>> + * assigned to this TDR. Concurrent operations on same memory
>>>>> + * controller results in TDX_OPERAND_BUSY. Avoid this race by
>>>>> + * mutex.
>>>>> + */
>>>>> + mutex_lock(&tdx_mng_key_config_lock[pkg]);
>>>> the lock is superfluous to me. with cpu lock held, even if multiple CPUs try to
>>>> create TDs, the same set of CPUs (the first online CPU of each package) will be
>>>> selected to configure the key because of the cpumask_test_and_set_cpu() above.
>>>> it means, we never have two CPUs in the same socket trying to program the key,
>>>> i.e., no concurrent calls.
>>> Makes sense. Will drop the lock.
>>
>> Hmm.. Skipping in cpumask_test_and_set_cpu() would result in the second
>> TDH.MNG.KEY.CONFIG not being done for the second VM. No?
>
> No. Because @packages isn't shared between VMs.

I see. Thanks.

2024-04-04 01:27:37

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL

On Tue, Apr 02, 2024 at 04:52:46PM +0800,
Chao Gao <[email protected]> wrote:

> >+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
> >+{
> >+ unsigned long nr, a0, a1, a2, a3, ret;
> >+
>
> do you need to emulate xen/hyper-v hypercalls here?


No. kvm_emulate_hypercall() handles xen/hyper-v hypercalls,
__kvm_emulate_hypercall() doesn't.

> Nothing tells userspace that xen/hyper-v hypercalls are not supported and
> so userspace may expose related CPUID leafs to TD guests.
>
> >+ /*
> >+ * ABI for KVM tdvmcall argument:
> >+ * In Guest-Hypervisor Communication Interface(GHCI) specification,
> >+ * Non-zero leaf number (R10 != 0) is defined to indicate
> >+ * vendor-specific. KVM uses this for KVM hypercall. NOTE: KVM
> >+ * hypercall number starts from one. Zero isn't used for KVM hypercall
> >+ * number.
> >+ *
> >+ * R10: KVM hypercall number
> >+ * arguments: R11, R12, R13, R14.
> >+ */
> >+ nr = kvm_r10_read(vcpu);
> >+ a0 = kvm_r11_read(vcpu);
> >+ a1 = kvm_r12_read(vcpu);
> >+ a2 = kvm_r13_read(vcpu);
> >+ a3 = kvm_r14_read(vcpu);
> >+
> >+ ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true, 0);
> >+
> >+ tdvmcall_set_return_code(vcpu, ret);
> >+
> >+ if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> >+ return 0;
>
> Can you add a comment to call out that KVM_HC_MAP_GPA_RANGE is redirected to
> the userspace?

Yes, this is confusing. We should refactor kvm_emulate_hypercall() more so that
the caller shouldn't care about the return value like this. Will refactor it
and update this patch.
--
Isaku Yamahata <[email protected]>

2024-04-04 13:22:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Mon, Feb 26, 2024 at 12:26:20AM -0800, [email protected] wrote:
> @@ -491,6 +494,87 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> */
> }
>
> +static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
> +{

..

> + tdx->exit_reason.full = __seamcall_saved_ret(TDH_VP_ENTER, &args);

Call to __seamcall_saved_ret() leaves noinstr section.

__seamcall_saved_ret() has to be moved:

diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index e32cf82ed47e..6b434ab12db6 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -44,6 +44,8 @@ SYM_FUNC_START(__seamcall_ret)
SYM_FUNC_END(__seamcall_ret)
EXPORT_SYMBOL_GPL(__seamcall_ret);

+.section .noinstr.text, "ax"
+
/*
* __seamcall_saved_ret() - Host-side interface functions to SEAM software
* (the P-SEAMLDR or the TDX module), with saving output registers to the
--
Kiryl Shutsemau / Kirill A. Shutemov

2024-04-04 21:51:49

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Thu, 2024-04-04 at 16:22 +0300, Kirill A. Shutemov wrote:
> On Mon, Feb 26, 2024 at 12:26:20AM -0800, [email protected] wrote:
> > @@ -491,6 +494,87 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > */
> > }
> >
> > +static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
> > +{
>
> ...
>
> > + tdx->exit_reason.full = __seamcall_saved_ret(TDH_VP_ENTER, &args);
>
> Call to __seamcall_saved_ret() leaves noinstr section.
>
> __seamcall_saved_ret() has to be moved:
>
> diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> index e32cf82ed47e..6b434ab12db6 100644
> --- a/arch/x86/virt/vmx/tdx/seamcall.S
> +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> @@ -44,6 +44,8 @@ SYM_FUNC_START(__seamcall_ret)
> SYM_FUNC_END(__seamcall_ret)
> EXPORT_SYMBOL_GPL(__seamcall_ret);
>
> +.section .noinstr.text, "ax"
> +
> /*
> * __seamcall_saved_ret() - Host-side interface functions to SEAM software
> * (the P-SEAMLDR or the TDX module), with saving output registers to the

Alternatively, I think we can explicitly use instrumentation_begin()/end()
around __seamcall_saved_ret() here.

__seamcall_saved_ret() could be used in the future for new SEAMCALLs (e.g.,
TDH.MEM.IMPORT) for TDX guest live migration. And for that I don't think the
caller(s) is/are tagged with noinstr.

2024-04-04 22:31:21

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 018/130] KVM: x86/mmu: Assume guest MMIOs are shared

On Wed, Mar 27, 2024 at 05:27:52PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Wed, 2024-03-27 at 10:22 -0700, Isaku Yamahata wrote:
> > On Mon, Mar 25, 2024 at 11:41:56PM +0000,
> > "Edgecombe, Rick P" <[email protected]> wrote:
> >
> > > On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > > > From: Chao Gao <[email protected]>
> > > >
> > > > TODO: Drop this patch once the common patch is merged.
> > >
> > > What is this TODO talking about?
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > This patch was shot down and need to fix it in guest side. TDVF.
>
> It needs a firmware fix? Is there any firmware that works (boot any TD) with this patch missing?

Yes. Config-B (OvmfPkg/IntelTdx/IntelTdxX64.dsc) works for me without this patch.
I'm looking into config-A(OvmfPkg/OvmfPkgX64.dsc) now.
--
Isaku Yamahata <[email protected]>

2024-04-04 22:57:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Thu, Apr 04, 2024, Kai Huang wrote:
> On Thu, 2024-04-04 at 16:22 +0300, Kirill A. Shutemov wrote:
> > On Mon, Feb 26, 2024 at 12:26:20AM -0800, [email protected] wrote:
> > > @@ -491,6 +494,87 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > */
> > > }
> > >
> > > +static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
> > > +{
> >
> > ...
> >
> > > + tdx->exit_reason.full = __seamcall_saved_ret(TDH_VP_ENTER, &args);
> >
> > Call to __seamcall_saved_ret() leaves noinstr section.
> >
> > __seamcall_saved_ret() has to be moved:
> >
> > diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> > index e32cf82ed47e..6b434ab12db6 100644
> > --- a/arch/x86/virt/vmx/tdx/seamcall.S
> > +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> > @@ -44,6 +44,8 @@ SYM_FUNC_START(__seamcall_ret)
> > SYM_FUNC_END(__seamcall_ret)
> > EXPORT_SYMBOL_GPL(__seamcall_ret);
> >
> > +.section .noinstr.text, "ax"
> > +
> > /*
> > * __seamcall_saved_ret() - Host-side interface functions to SEAM software
> > * (the P-SEAMLDR or the TDX module), with saving output registers to the
>
> Alternatively, I think we can explicitly use instrumentation_begin()/end()
> around __seamcall_saved_ret() here.

No, that will just paper over the complaint. Dang it, I was going to say that
I called out earlier that tdx_vcpu_enter_exit() doesn't need to be noinstr, but
it looks like my brain and fingers didn't connect.

So I'll say it now :-)

I don't think tdx_vcpu_enter_exit() needs to be noinstr, because the SEAMCALL is
functionally a VM-Exit, and so all host state is saved/restored "atomically"
across the SEAMCALL (some by hardware, some by software (TDX-module)).

The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks. None of
that applies to TDX. Either that, or there are some massive bugs lurking due to
missing code.

2024-04-04 23:02:59

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 106/130] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL

On Wed, Apr 03, 2024 at 08:58:50AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Mon, Feb 26, 2024, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Some of TDG.VP.VMCALL require device model, for example, qemu, to handle
> > them on behalf of kvm kernel module. TDVMCALL_REPORT_FATAL_ERROR,
> > TDVMCALL_MAP_GPA, TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT, and
> > TDVMCALL_GET_QUOTE requires user space VMM handling.
> >
> > Introduce new kvm exit, KVM_EXIT_TDX, and functions to setup it. Device
> > model should update R10 if necessary as return value.
>
> Hard NAK.
>
> KVM needs its own ABI, under no circumstance should KVM inherit ABI directly from
> the GHCI. Even worse, this doesn't even sanity check the "unknown" VMCALLs, KVM
> just blindly punts *everything* to userspace. And even worse than that, KVM
> already has at least one user exit that overlaps, TDVMCALL_MAP_GPA => KVM_HC_MAP_GPA_RANGE.
>
> If the userspace VMM wants to run an end-around on KVM and directly communicate
> with the guest, e.g. via a synthetic device (a la virtio), that's totally fine,
> because *KVM* is not definining any unique ABI, KVM is purely providing the
> transport, e.g. emulated MMIO or PIO (and maybe not even that). IIRC, this option
> even came up in the context of GET_QUOTE.
>
> But explicit exiting to userspace with KVM_EXIT_TDX is very different. KVM is
> creating a contract with userspace that says "for TDX VMCALLs [a-z], KVM will exit
> to userspace with values [a-z]". *Every* new VMCALL that's added to the GHCI will
> become KVM ABI, e.g. if Intel ships a TDX module that adds a new VMALL, then KVM
> will forward the exit to userspace, and userspace can then start relying on that
> behavior.
>
> And punting all register state, decoding, etc. to userspace creates a crap ABI.
> KVM effectively did this for SEV and SEV-ES by copying the PSP ABI verbatim into
> KVM ioctls(), and it's a gross, ugly mess.
>
> Each VMCALL that KVM wants to forward needs a dedicated KVM_EXIT_<reason> and
> associated struct in the exit union. Yes, it's slightly more work now, but it's
> one time pain. Whereas copying all registers is endless misery for everyone
> involved, e.g. *every* userspace VMM needs to decipher the registers, do sanity
> checking, etc. And *every* end user needs to do the same when a debugging
> inevitable failures.
>
> This also solves Chao's comment about XMM registers. Except for emualting Hyper-V
> hypercalls, which have very explicit handling, KVM does NOT support using XMM
> registers in hypercalls.

Sure. I will introduce the followings.

KVM_EXIT_TDX_GET_QUOTE
Request a quote.

KVM_EXIT_TDX_SETUP_EVENT_NOTIFY_INTERRUPT
Guest tells which interrupt vector the VMM uses to notify the guest.
The use case if GetQuote. It is async request. The user-space VMM uses
this interrupts to notify the guest on the completion. Or guest polls it.

KVM_EXIT_TDX_REPORT_FATAL_ERROR
Guest panicked. This conveys extra 64 bytes in registers. Probably this should
be converted to KVM_EXIT_SYSTEM_EVENT and KVM_SYSTEM_EVENT_CRASH.

MapGPA is converted to KVM_HC_MAP_GPA_RANGE.

--
Isaku Yamahata <[email protected]>

2024-04-04 23:25:51

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 108/130] KVM: TDX: Handle TDX PV HLT hypercall

On Wed, Apr 03, 2024 at 07:49:28AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Wed, Apr 03, 2024, Chao Gao wrote:
> > On Mon, Feb 26, 2024 at 12:26:50AM -0800, [email protected] wrote:
> > >From: Isaku Yamahata <[email protected]>
> > >
> > >Wire up TDX PV HLT hypercall to the KVM backend function.
> > >
> > >Signed-off-by: Isaku Yamahata <[email protected]>
> > >---
> > >v19:
> > >- move tdvps_state_non_arch_check() to this patch
> > >
> > >v18:
> > >- drop buggy_hlt_workaround and use TDH.VP.RD(TD_VCPU_STATE_DETAILS)
> > >
> > >Signed-off-by: Isaku Yamahata <[email protected]>
> > >---
> > > arch/x86/kvm/vmx/tdx.c | 26 +++++++++++++++++++++++++-
> > > arch/x86/kvm/vmx/tdx.h | 4 ++++
> > > 2 files changed, 29 insertions(+), 1 deletion(-)
> > >
> > >diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > >index eb68d6c148b6..a2caf2ae838c 100644
> > >--- a/arch/x86/kvm/vmx/tdx.c
> > >+++ b/arch/x86/kvm/vmx/tdx.c
> > >@@ -688,7 +688,18 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > >
> > > bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> > > {
> > >- return pi_has_pending_interrupt(vcpu);
> > >+ bool ret = pi_has_pending_interrupt(vcpu);
> >
> > Maybe
> > bool has_pending_interrupt = pi_has_pending_interrupt(vcpu);
> >
> > "ret" isn't a good name. or even call pi_has_pending_interrupt() directly in
> > the if statement below.
>
> Ya, or split the if-statement into multiple chucks, with comments explaining
> what each non-intuitive chunk is doing. The pi_has_pending_interrupt(vcpu) check
> is self-explanatory, the halted thing, not so much. They are terminal statements,
> there's zero reason to pre-check the PID.
>
> E.g.
>
> /*
> * Comment explaining why KVM needs to assume a non-halted vCPU has a
> * pending interrupt (KVM can't see RFLAGS.IF).
> */
> if (vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> return true;
>
> if (pi_has_pending_interrupt(vcpu))
> return;
>
> > >+ union tdx_vcpu_state_details details;
> > >+ struct vcpu_tdx *tdx = to_tdx(vcpu);
> > >+
> > >+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> > >+ return true;
> >
> > Question: why mp_state matters here?
> > >+
> > >+ if (tdx->interrupt_disabled_hlt)
> > >+ return false;
> >
> > Shouldn't we move this into vt_interrupt_allowed()? VMX calls the function to
> > check if interrupt is disabled.

Chao, are you suggesting to implement tdx_interrupt_allowed() as
"EXIT_REASON_HLT && a0" instead of "return true"?
I don't think it makes sense because it's rare case and we can't avoid spurious
wakeup for TDX case.


> >KVM can clear tdx->interrupt_disabled_hlt on
> > every TD-enter and set it only on TD-exit due to the guest making a
> > TDVMCALL(hlt) w/ interrupt disabled.
>
> I'm pretty sure interrupt_disabled_hlt shouldn't exist, should "a0", a.k.a. r12,
> be preserved at this point?
>
> /* Another comment explaning magic code. */
> if (to_vmx(vcpu)->exit_reason.basic == EXIT_REASON_HLT &&
> tdvmcall_a0_read(vcpu))
> return false;
>
>
> Actually, can't this all be:
>
> if (to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_HLT)
> return true;
>
> if (!tdvmcall_a0_read(vcpu))
> return false;
>
> if (pi_has_pending_interrupt(vcpu))
> return true;
>
> return tdx_has_pending_virtual_interrupt(vcpu);
>

Thanks for the suggestion. This is much cleaner. Will update the function.
--
Isaku Yamahata <[email protected]>

2024-04-04 23:28:26

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path

On Thu, 2024-04-04 at 15:45 -0700, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, Kai Huang wrote:
> > On Thu, 2024-04-04 at 16:22 +0300, Kirill A. Shutemov wrote:
> > > On Mon, Feb 26, 2024 at 12:26:20AM -0800, [email protected] wrote:
> > > > @@ -491,6 +494,87 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > */
> > > > }
> > > >
> > > > +static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
> > > > +{
> > >
> > > ...
> > >
> > > > + tdx->exit_reason.full = __seamcall_saved_ret(TDH_VP_ENTER, &args);
> > >
> > > Call to __seamcall_saved_ret() leaves noinstr section.
> > >
> > > __seamcall_saved_ret() has to be moved:
> > >
> > > diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> > > index e32cf82ed47e..6b434ab12db6 100644
> > > --- a/arch/x86/virt/vmx/tdx/seamcall.S
> > > +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> > > @@ -44,6 +44,8 @@ SYM_FUNC_START(__seamcall_ret)
> > > SYM_FUNC_END(__seamcall_ret)
> > > EXPORT_SYMBOL_GPL(__seamcall_ret);
> > >
> > > +.section .noinstr.text, "ax"
> > > +
> > > /*
> > > * __seamcall_saved_ret() - Host-side interface functions to SEAM software
> > > * (the P-SEAMLDR or the TDX module), with saving output registers to the
> >
> > Alternatively, I think we can explicitly use instrumentation_begin()/end()
> > around __seamcall_saved_ret() here.
>
> No, that will just paper over the complaint. Dang it, I was going to say that
> I called out earlier that tdx_vcpu_enter_exit() doesn't need to be noinstr, but
> it looks like my brain and fingers didn't connect.
>
> So I'll say it now :-)
>
> I don't think tdx_vcpu_enter_exit() needs to be noinstr, because the SEAMCALL is
> functionally a VM-Exit, and so all host state is saved/restored "atomically"
> across the SEAMCALL (some by hardware, some by software (TDX-module)).
>
> The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks. None of
> that applies to TDX. Either that, or there are some massive bugs lurking due to
> missing code.

Ah right. That's even better :-)

Thanks for jumping in and pointing out!

2024-04-04 23:42:52

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX

On Wed, Apr 03, 2024 at 08:14:04AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Mon, Feb 26, 2024, [email protected] wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 389bb95d2af0..c8f991b69720 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1877,6 +1877,76 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> > *error_code = 0;
> > }
> >
> > +static bool tdx_is_emulated_kvm_msr(u32 index, bool write)
> > +{
> > + switch (index) {
> > + case MSR_KVM_POLL_CONTROL:
> > + return true;
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > +bool tdx_has_emulated_msr(u32 index, bool write)
> > +{
> > + switch (index) {
> > + case MSR_IA32_UCODE_REV:
> > + case MSR_IA32_ARCH_CAPABILITIES:
> > + case MSR_IA32_POWER_CTL:
> > + case MSR_IA32_CR_PAT:
> > + case MSR_IA32_TSC_DEADLINE:
> > + case MSR_IA32_MISC_ENABLE:
> > + case MSR_PLATFORM_INFO:
> > + case MSR_MISC_FEATURES_ENABLES:
> > + case MSR_IA32_MCG_CAP:
> > + case MSR_IA32_MCG_STATUS:
> > + case MSR_IA32_MCG_CTL:
> > + case MSR_IA32_MCG_EXT_CTL:
> > + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> > + case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
> > + /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
> > + return true;
> > + case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
> > + /*
> > + * x2APIC registers that are virtualized by the CPU can't be
> > + * emulated, KVM doesn't have access to the virtual APIC page.
> > + */
> > + switch (index) {
> > + case X2APIC_MSR(APIC_TASKPRI):
> > + case X2APIC_MSR(APIC_PROCPRI):
> > + case X2APIC_MSR(APIC_EOI):
> > + case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
> > + case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
> > + case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
> > + return false;
> > + default:
> > + return true;
> > + }
> > + case MSR_IA32_APICBASE:
> > + case MSR_EFER:
> > + return !write;
>
> Meh, for literally two MSRs, just open code them in tdx_set_msr() and drop the
> @write param. Or alternatively add:
>
> static bool tdx_is_read_only_msr(u32 msr){
> {
> return msr == MSR_IA32_APICBASE || msr == MSR_EFER;
> }

Sure will add.

>
> > + case 0x4b564d00 ... 0x4b564dff:
>
> This is silly, just do
>
> case MSR_KVM_POLL_CONTROL:
> return false;
>
> and let everything else go through the default statement, no?

Now tdx_is_emulated_kvm_msr() is trivial, will open code it.


> > + /* KVM custom MSRs */
> > + return tdx_is_emulated_kvm_msr(index, write);
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > +int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> > +{
> > + if (tdx_has_emulated_msr(msr->index, false))
> > + return kvm_get_msr_common(vcpu, msr);
> > + return 1;
>
> Please invert these and make the happy path the not-taken path, i.e.
>
> if (!tdx_has_emulated_msr(msr->index))
> return 1;
>
> return kvm_get_msr_common(vcpu, msr);
>
> The standard kernel pattern is
>
> if (error)
> return <error thingie>
>
> return <happy thingie>
>
> > +}
> > +
> > +int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> > +{
> > + if (tdx_has_emulated_msr(msr->index, true))
>
> As above:
>
> if (tdx_is_read_only_msr(msr->index))
> return 1;
>
> if (!tdx_has_emulated_msr(msr->index))
> return 1;
>
> return kvm_set_msr_common(vcpu, msr);

Sure, will update them.


> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d5b18cad9dcd..0e1d3853eeb4 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -90,7 +90,6 @@
> > #include "trace.h"
> >
> > #define MAX_IO_MSRS 256
> > -#define KVM_MAX_MCE_BANKS 32
> >
> > struct kvm_caps kvm_caps __read_mostly = {
> > .supported_mce_cap = MCG_CTL_P | MCG_SER_P,
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index 4e40c23d66ed..c87b7a777b67 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -9,6 +9,8 @@
> > #include "kvm_cache_regs.h"
> > #include "kvm_emulate.h"
> >
> > +#define KVM_MAX_MCE_BANKS 32
>
> Split this to a separate. Yes, it's trivial, but that's _exactly_ why it should
> be in a separate patch. The more trivial refactoring you split out, the more we
> can apply _now_ and take off your hands.

Will split it.
--
Isaku Yamahata <[email protected]>

2024-04-06 00:09:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 067/130] KVM: TDX: Add load_mmu_pgd method for TDX

On Wed, 2024-04-03 at 10:33 -0700, Isaku Yamahata wrote:
> On Mon, Apr 01, 2024 at 11:49:43PM +0800,
> Binbin Wu <[email protected]> wrote:
>
> >
> >
> > On 2/26/2024 4:26 PM, [email protected] wrote:
> > > From: Sean Christopherson <[email protected]>
> > >
> > > For virtual IO, the guest TD shares guest pages with VMM without
> > > encryption.
> >
> > Virtual IO is a use case of shared memory, it's better to use it
> > as a example instead of putting it at the beginning of the sentence.
> >
> >
> > >    Shared EPT is used to map guest pages in unprotected way.
> > >
> > > Add the VMCS field encoding for the shared EPTP, which will be used by
> > > TDX to have separate EPT walks for private GPAs (existing EPTP) versus
> > > shared GPAs (new shared EPTP).
> > >
> > > Set shared EPT pointer value for the TDX guest to initialize TDX MMU.
> > May have a mention that the EPTP for priavet GPAs is set by TDX module.
>
> Sure, let me update the commit message.

How about this?

KVM: TDX: Add load_mmu_pgd method for TDX

TDX has uses two EPT pointers, one for the private half of the GPA
space and one for the shared half. The private half used the normal
EPT_POINTER vmcs field and is managed in a special way by the TDX module.
The shared half uses a new SHARED_EPT_POINTER field and will be managed by
the conventional MMU management operations that operate directly on the
EPT tables. This means for TDX the .load_mmu_pgd() operation will need to
know to use the SHARED_EPT_POINTER field instead of the normal one. Add a
new wrapper in x86 ops for load_mmu_pgd() that either directs the write to
the existing vmx implementation or a TDX one.

For the TDX operation, EPT will always be used, so it can simpy write to
the SHARED_EPT_POINTER field.


2024-04-06 01:03:35

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 067/130] KVM: TDX: Add load_mmu_pgd method for TDX

On Sat, 2024-04-06 at 00:09 +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-04-03 at 10:33 -0700, Isaku Yamahata wrote:
> > On Mon, Apr 01, 2024 at 11:49:43PM +0800,
> > Binbin Wu <[email protected]> wrote:
> >
> > >
> > >
> > > On 2/26/2024 4:26 PM, [email protected] wrote:
> > > > From: Sean Christopherson <[email protected]>
> > > >
> > > > For virtual IO, the guest TD shares guest pages with VMM without
> > > > encryption.
> > >
> > > Virtual IO is a use case of shared memory, it's better to use it
> > > as a example instead of putting it at the beginning of the sentence.
> > >
> > >
> > > >    Shared EPT is used to map guest pages in unprotected way.
> > > >
> > > > Add the VMCS field encoding for the shared EPTP, which will be used by
> > > > TDX to have separate EPT walks for private GPAs (existing EPTP) versus
> > > > shared GPAs (new shared EPTP).
> > > >
> > > > Set shared EPT pointer value for the TDX guest to initialize TDX MMU.
> > > May have a mention that the EPTP for priavet GPAs is set by TDX module.
> >
> > Sure, let me update the commit message.
>
> How about this?

Looks good. Some nits though:

>
> KVM: TDX: Add load_mmu_pgd method for TDX
>
> TDX has uses two EPT pointers, one for the private half of the GPA

"TDX uses"

> space and one for the shared half. The private half used the normal

"used" -> "uses"

> EPT_POINTER vmcs field and is managed in a special way by the TDX module.

Perhaps add:

KVM is not allowed to operate on the EPT_POINTER directly.

> The shared half uses a new SHARED_EPT_POINTER field and will be managed by
> the conventional MMU management operations that operate directly on the
> EPT tables. 
>

I would like to explicitly call out KVM can update SHARED_EPT_POINTER directly:

The shared half uses a new SHARED_EPT_POINTER field. KVM is allowed to set it
directly by the interface provided by the TDX module, and KVM is expected to
manage the shared half just like it manages the existing EPT page table today.


> This means for TDX the .load_mmu_pgd() operation will need to
> know to use the SHARED_EPT_POINTER field instead of the normal one. Add a
> new wrapper in x86 ops for load_mmu_pgd() that either directs the write to
> the existing vmx implementation or a TDX one.
>
> For the TDX operation, EPT will always be used, so it can simpy write to
> the SHARED_EPT_POINTER field.


2024-04-07 01:42:41

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 078/130] KVM: TDX: Implement TDX vcpu enter/exit path



On 3/16/2024 1:26 AM, Sean Christopherson wrote:
> On Mon, Feb 26, 2024, [email protected] wrote:
>> + */
>> + struct kvm_vcpu *vcpu = &tdx->vcpu;
>> +
>> + guest_state_enter_irqoff();
>> +
>> + /*
>> + * TODO: optimization:
>> + * - Eliminate copy between args and vcpu->arch.regs.
>> + * - copyin/copyout registers only if (tdx->tdvmvall.regs_mask != 0)
>> + * which means TDG.VP.VMCALL.
>> + */
>> + args = (struct tdx_module_args) {
>> + .rcx = tdx->tdvpr_pa,
>> +#define REG(reg, REG) .reg = vcpu->arch.regs[VCPU_REGS_ ## REG]
> Organizing tdx_module_args's registers by volatile vs. non-volatile is asinine.
> This code should not need to exist.

Did you suggest to align the tdx_module_args with enum kvm_reg for GP
registers, so it can be done by a simple mem copy?

>
>> + WARN_ON_ONCE(!kvm_rebooting &&
>> + (tdx->exit_reason.full & TDX_SW_ERROR) == TDX_SW_ERROR);
>> +
>> + guest_state_exit_irqoff();
>> +}
>> +
>>


2024-04-07 03:50:31

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 108/130] KVM: TDX: Handle TDX PV HLT hypercall

>> > >+ union tdx_vcpu_state_details details;
>> > >+ struct vcpu_tdx *tdx = to_tdx(vcpu);
>> > >+
>> > >+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
>> > >+ return true;
>> >
>> > Question: why mp_state matters here?
>> > >+
>> > >+ if (tdx->interrupt_disabled_hlt)
>> > >+ return false;
>> >
>> > Shouldn't we move this into vt_interrupt_allowed()? VMX calls the function to
>> > check if interrupt is disabled.
>
>Chao, are you suggesting to implement tdx_interrupt_allowed() as
>"EXIT_REASON_HLT && a0" instead of "return true"?
>I don't think it makes sense because it's rare case and we can't avoid spurious
>wakeup for TDX case.

Yes. KVM differeniates "interrupt allowed" from "has interrupt", e.g.,

static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
..

if (kvm_arch_interrupt_allowed(vcpu) &&
(kvm_cpu_has_interrupt(vcpu) ||
kvm_guest_apic_has_interrupt(vcpu)))
return true;


I think tdx_protected_apic_has_interrupt() mixes them together, which isn't
good.

Probably it is a minor thing; if no one else thinks it is better to move the
"interrupt allowed" check to tdx_interrupt_allowed(), I am also fine with not
doing that.

2024-04-07 03:59:02

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 067/130] KVM: TDX: Add load_mmu_pgd method for TDX



On 4/6/2024 8:58 AM, Huang, Kai wrote:
> On Sat, 2024-04-06 at 00:09 +0000, Edgecombe, Rick P wrote:
>> On Wed, 2024-04-03 at 10:33 -0700, Isaku Yamahata wrote:
>>> On Mon, Apr 01, 2024 at 11:49:43PM +0800,
>>> Binbin Wu <[email protected]> wrote:
>>>
>>>>
>>>> On 2/26/2024 4:26 PM, [email protected] wrote:
>>>>> From: Sean Christopherson <[email protected]>
>>>>>
>>>>> For virtual IO, the guest TD shares guest pages with VMM without
>>>>> encryption.
>>>> Virtual IO is a use case of shared memory, it's better to use it
>>>> as a example instead of putting it at the beginning of the sentence.
>>>>
>>>>
>>>>>    Shared EPT is used to map guest pages in unprotected way.
>>>>>
>>>>> Add the VMCS field encoding for the shared EPTP, which will be used by
>>>>> TDX to have separate EPT walks for private GPAs (existing EPTP) versus
>>>>> shared GPAs (new shared EPTP).
>>>>>
>>>>> Set shared EPT pointer value for the TDX guest to initialize TDX MMU.
>>>> May have a mention that the EPTP for priavet GPAs is set by TDX module.
>>> Sure, let me update the commit message.
>> How about this?
> Looks good. Some nits though:
>
>> KVM: TDX: Add load_mmu_pgd method for TDX
>>
>> TDX has uses two EPT pointers, one for the private half of the GPA
> "TDX uses"
>
>> space and one for the shared half. The private half used the normal
> "used" -> "uses"
>
>> EPT_POINTER vmcs field and is managed in a special way by the TDX module.
> Perhaps add:
>
> KVM is not allowed to operate on the EPT_POINTER directly.
>
>> The shared half uses a new SHARED_EPT_POINTER field and will be managed by
>> the conventional MMU management operations that operate directly on the
>> EPT tables.
>>
> I would like to explicitly call out KVM can update SHARED_EPT_POINTER directly:
>
> The shared half uses a new SHARED_EPT_POINTER field. KVM is allowed to set it
> directly by the interface provided by the TDX module, and KVM is expected to
> manage the shared half just like it manages the existing EPT page table today.
>
>
>> This means for TDX the .load_mmu_pgd() operation will need to
>> know to use the SHARED_EPT_POINTER field instead of the normal one. Add a
>> new wrapper in x86 ops for load_mmu_pgd() that either directs the write to
>> the existing vmx implementation or a TDX one.
>>
>> For the TDX operation, EPT will always be used, so it can simpy write to


Maybe remove "so"?  IMO, there is no causal relationship between the
first and second half of the sentence.

Typo, "simpy" -> "simply"

>> the SHARED_EPT_POINTER field.
>


2024-04-07 06:42:18

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 079/130] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
> from VMX case.

Could you add more descriptions about the differences?

> Add TDX hooks to save/restore host/guest CPU state.

KVM doesn't save/restore guest CPU state for TDX.

> Save/restore kernel GS base MSR.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 30 +++++++++++++++++++++++++--
> arch/x86/kvm/vmx/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx.h | 4 ++++
> arch/x86/kvm/vmx/x86_ops.h | 4 ++++
> 4 files changed, 78 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d72651ce99ac..8275a242ce07 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -158,6 +158,32 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vmx_vcpu_reset(vcpu, init_event);
> }
>
> +static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> +{
> + /*
> + * All host state is saved/restored across SEAMCALL/SEAMRET,

It sounds confusing to me.
If all host states are saved/restored across SEAMCALL/SEAMRET, why this
patch saves/restores MSR_KERNEL_GS_BASE for host?

> and the
> + * guest state of a TD is obviously off limits. Deferring MSRs and DRs
> + * is pointless because the TDX module needs to load *something* so as
> + * not to expose guest state.
> + */
> + if (is_td_vcpu(vcpu)) {
> + tdx_prepare_switch_to_guest(vcpu);
> + return;
> + }
> +
> + vmx_prepare_switch_to_guest(vcpu);
> +}
> +
> +static void vt_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu)) {
> + tdx_vcpu_put(vcpu);
> + return;
> + }
> +
> + vmx_vcpu_put(vcpu);
> +}
> +
> static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
> {
> if (is_td_vcpu(vcpu))
> @@ -326,9 +352,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .vcpu_free = vt_vcpu_free,
> .vcpu_reset = vt_vcpu_reset,
>
> - .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
> + .prepare_switch_to_guest = vt_prepare_switch_to_guest,
> .vcpu_load = vmx_vcpu_load,
> - .vcpu_put = vmx_vcpu_put,
> + .vcpu_put = vt_vcpu_put,
>
> .update_exception_bitmap = vmx_update_exception_bitmap,
> .get_msr_feature = vmx_get_msr_feature,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index fdf9196cb592..9616b1aab6ce 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1,5 +1,6 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/cpu.h>
> +#include <linux/mmu_context.h>
>
> #include <asm/tdx.h>
>
> @@ -423,6 +424,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
>
> WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> @@ -446,9 +448,47 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
> vcpu->arch.xfd_no_write_intercept = true;
>
> + tdx->host_state_need_save = true;
> + tdx->host_state_need_restore = false;
> +
> return 0;
> }
>
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)

Just like vmx_prepare_switch_to_host(), the input can be "struct
vcpu_tdx *", since vcpu is not used inside the function.
And the callsites just use "to_tdx(vcpu)"

> +{
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
Then, this can be dropped.

> +
> + if (!tdx->host_state_need_save)
> + return;
> +
> + if (likely(is_64bit_mm(current->mm)))
> + tdx->msr_host_kernel_gs_base = current->thread.gsbase;
> + else
> + tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> +
> + tdx->host_state_need_save = false;
> +}
> +
> +static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)

ditto

> +{
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> + tdx->host_state_need_save = true;
> + if (!tdx->host_state_need_restore)
> + return;
> +
> + ++vcpu->stat.host_state_reload;
> +
> + wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
> + tdx->host_state_need_restore = false;
> +}
> +
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> + vmx_vcpu_pi_put(vcpu);
> + tdx_prepare_switch_to_host(vcpu);
> +}
> +
> void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> {
> struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -569,6 +609,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> tdx_vcpu_enter_exit(tdx);
>
> + tdx->host_state_need_restore = true;
> +
> vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> trace_kvm_exit(vcpu, KVM_ISA_VMX);
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 81d301fbe638..e96c416e73bf 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -69,6 +69,10 @@ struct vcpu_tdx {
>
> bool initialized;
>
> + bool host_state_need_save;
> + bool host_state_need_restore;
> + u64 msr_host_kernel_gs_base;
> +
> /*
> * Dummy to make pmu_intel not corrupt memory.
> * TODO: Support PMU for TDX. Future work.
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 3e29a6fe28ef..9fd997c79c33 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -151,6 +151,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> void tdx_vcpu_free(struct kvm_vcpu *vcpu);
> void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu);
> u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -186,6 +188,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
> static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
> +static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
> static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
>
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }


2024-04-07 07:47:16

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 080/130] KVM: TDX: restore host xsave state when exit from the guest TD



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> On exiting from the guest TD, xsave state is clobbered. Restore xsave
> state on TD exit.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - Add EXPORT_SYMBOL_GPL(host_xcr0)
>
> v15 -> v16:
> - Added CET flag mask
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
> arch/x86/kvm/x86.c | 1 +
> 2 files changed, 20 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9616b1aab6ce..199226c6cf55 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2,6 +2,7 @@
> #include <linux/cpu.h>
> #include <linux/mmu_context.h>
>
> +#include <asm/fpu/xcr.h>
> #include <asm/tdx.h>
>
> #include "capabilities.h"
> @@ -534,6 +535,23 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> */
> }
>
> +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +
> + if (static_cpu_has(X86_FEATURE_XSAVE) &&
> + host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
> + if (static_cpu_has(X86_FEATURE_XSAVES) &&
> + /* PT can be exposed to TD guest regardless of KVM's XSS support */
The comment needs to be updated to reflect the case for CET.

> + host_xss != (kvm_tdx->xfam &
> + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET)))

For TDX_TD_XFAM_CET, maybe no need to make it TDX specific?

BTW, the definitions for XFEATURE_MASK_CET_USER/XFEATURE_MASK_CET_KERNEL
have been merged.
https://lore.kernel.org/all/20230613001108.3040476-25-rick.p.edgecombe%40intel.com
You can resolve the TODO in
https://lore.kernel.org/kvm/5eca97e6a3978cf4dcf1cff21be6ec8b639a66b9.1708933498.git.isaku.yamahata@intel.com/

> + wrmsrl(MSR_IA32_XSS, host_xss);
> + if (static_cpu_has(X86_FEATURE_PKU) &&
> + (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
> + write_pkru(vcpu->arch.host_pkru);
> +}
> +
> static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx *tdx)
> {
> struct tdx_module_args args;
> @@ -609,6 +627,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> tdx_vcpu_enter_exit(tdx);
>
> + tdx_restore_host_xsave_state(vcpu);
> tdx->host_state_need_restore = true;
>
> vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 23ece956c816..b361d948140f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -315,6 +315,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
> };
>
> u64 __read_mostly host_xcr0;
> +EXPORT_SYMBOL_GPL(host_xcr0);
>
> static struct kmem_cache *x86_emulator_cache;
>


2024-04-07 08:04:03

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 081/130] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Chao Gao <[email protected]>
>
> Several MSRs are constant and only used in userspace(ring 3). But VMs may
> have different values. KVM uses kvm_set_user_return_msr() to switch to
> guest's values and leverages user return notifier to restore them when the
> kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
> caches the value it wrote to an MSR last time.
>
> TDX module unconditionally resets some of these MSRs to architectural INIT
> state on TD exit. It makes the cached values in kvm_user_return_msrs are
                                                                       ^
                                                               extra "are"
> inconsistent with values in hardware. This inconsistency needs to be
> fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
> some MSRs to the host's values. kvm_set_user_return_msr() can help correct
> this case, but it is not optimal as it always does a wrmsr. So, introduce
> a variation of kvm_set_user_return_msr() to update cached values and skip
> that wrmsr.
>
> Signed-off-by: Chao Gao <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 25 ++++++++++++++++++++-----
> 2 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 36694e784c27..3ab85c3d86ee 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2259,6 +2259,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
> int kvm_add_user_return_msr(u32 msr);
> int kvm_find_user_return_msr(u32 msr);
> int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> +void kvm_user_return_update_cache(unsigned int index, u64 val);
>
> static inline bool kvm_is_supported_user_return_msr(u32 msr)
> {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b361d948140f..1b189e86a1f1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -440,6 +440,15 @@ static void kvm_user_return_msr_cpu_online(void)
> }
> }
>
> +static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
> +{
> + if (!msrs->registered) {
> + msrs->urn.on_user_return = kvm_on_user_return;
> + user_return_notifier_register(&msrs->urn);
> + msrs->registered = true;
> + }
> +}
> +
> int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
> {
> unsigned int cpu = smp_processor_id();
> @@ -454,15 +463,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
> return 1;
>
> msrs->values[slot].curr = value;
> - if (!msrs->registered) {
> - msrs->urn.on_user_return = kvm_on_user_return;
> - user_return_notifier_register(&msrs->urn);
> - msrs->registered = true;
> - }
> + kvm_user_return_register_notifier(msrs);
> return 0;
> }
> EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
>
> +/* Update the cache, "curr", and register the notifier */
Not sure this comment is necessary, since the code is simple.

> +void kvm_user_return_update_cache(unsigned int slot, u64 value)

As a public API, is it better to use "kvm_user_return_msr_update_cache"
instead of "kvm_user_return_update_cache"?
Although it makes the API name longer...

> +{
> + struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
> +
> + msrs->values[slot].curr = value;
> + kvm_user_return_register_notifier(msrs);
> +}
> +EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
> +
> static void drop_user_return_notifiers(void)
> {
> unsigned int cpu = smp_processor_id();


2024-04-07 11:57:19

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 082/130] KVM: TDX: restore user ret MSRs



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Several user ret MSRs are clobbered on TD exit. Restore those values on
> TD exit

Here "Restore" is not accurate, since the previous patch just updates
the cached value on TD exit.

> and before returning to ring 3. Because TSX_CTRL requires special
> treat, this patch doesn't address it.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 43 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 199226c6cf55..7e2b1e554246 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -535,6 +535,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> */
> }
>
> +struct tdx_uret_msr {
> + u32 msr;
> + unsigned int slot;
> + u64 defval;
> +};
> +
> +static struct tdx_uret_msr tdx_uret_msrs[] = {
> + {.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
> + {.msr = MSR_STAR,},
> + {.msr = MSR_LSTAR,},
> + {.msr = MSR_TSC_AUX,},
> +};
> +
> +static void tdx_user_return_update_cache(void)
> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> + kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
> + tdx_uret_msrs[i].defval);
> +}
> +
> static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -627,6 +649,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> tdx_vcpu_enter_exit(tdx);
>
> + tdx_user_return_update_cache();
> tdx_restore_host_xsave_state(vcpu);
> tdx->host_state_need_restore = true;
>
> @@ -1972,6 +1995,26 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> return -EINVAL;
> }
>
> + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
> + /*
> + * Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
> + * before returning to user space.
> + *
> + * this_cpu_ptr(user_return_msrs)->registered isn't checked
> + * because the registration is done at vcpu runtime by
> + * kvm_set_user_return_msr().
Should be tdx_user_return_update_cache(), if it's the final API name.

> + * Here is setting up cpu feature before running vcpu,
> + * registered is already false.
                                  ^
                           remove "already"?

> + */
> + tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
> + if (tdx_uret_msrs[i].slot == -1) {
> + /* If any MSR isn't supported, it is a KVM bug */
> + pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
> + tdx_uret_msrs[i].msr);
> + return -EIO;
> + }
> + }
> +
> max_pkgs = topology_max_packages();
> tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> GFP_KERNEL);


2024-04-07 14:14:03

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 083/130] KVM: TDX: Add TSX_CTRL msr into uret_msrs list



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Yang Weijiang <[email protected]>
>
> TDX module resets the TSX_CTRL MSR to 0 at TD exit if TSX is enabled for
> TD. Or it preserves the TSX_CTRL MSR if TSX is disabled for TD. VMM can
> rely on uret_msrs mechanism to defer the reload of host value until exiting
> to user space.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v19:
> - fix the type of tdx_uret_tsx_ctrl_slot. unguent int => int.
> ---
> arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++--
> arch/x86/kvm/vmx/tdx.h | 8 ++++++++
> 2 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 7e2b1e554246..83dcaf5b6fbd 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -547,14 +547,21 @@ static struct tdx_uret_msr tdx_uret_msrs[] = {
> {.msr = MSR_LSTAR,},
> {.msr = MSR_TSC_AUX,},
> };
> +static int tdx_uret_tsx_ctrl_slot;
>
> -static void tdx_user_return_update_cache(void)
> +static void tdx_user_return_update_cache(struct kvm_vcpu *vcpu)
> {
> int i;
>
> for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
> tdx_uret_msrs[i].defval);
> + /*
> + * TSX_CTRL is reset to 0 if guest TSX is supported. Otherwise
> + * preserved.
> + */
> + if (to_kvm_tdx(vcpu->kvm)->tsx_supported && tdx_uret_tsx_ctrl_slot != -1)

If to_kvm_tdx(vcpu->kvm)->tsx_supported is true, tdx_uret_tsx_ctrl_slot
shouldn't be -1 at this point.
Otherwise, it's a KVM bug, right?
Not sure if it needs a warning if tdx_uret_tsx_ctrl_slot is -1, or just
remove the check?

> + kvm_user_return_update_cache(tdx_uret_tsx_ctrl_slot, 0);
> }
>
> static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> @@ -649,7 +656,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> tdx_vcpu_enter_exit(tdx);
>
> - tdx_user_return_update_cache();
> + tdx_user_return_update_cache(vcpu);
> tdx_restore_host_xsave_state(vcpu);
> tdx->host_state_need_restore = true;
>
> @@ -1167,6 +1174,22 @@ static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_pa
> return 0;
> }
>
> +static bool tdparams_tsx_supported(struct kvm_cpuid2 *cpuid)
> +{
> + const struct kvm_cpuid_entry2 *entry;
> + u64 mask;
> + u32 ebx;
> +
> + entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x7, 0);
> + if (entry)
> + ebx = entry->ebx;
> + else
> + ebx = 0;
> +
> + mask = __feature_bit(X86_FEATURE_HLE) | __feature_bit(X86_FEATURE_RTM);
> + return ebx & mask;
> +}
> +
> static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> struct kvm_tdx_init_vm *init_vm)
> {
> @@ -1209,6 +1232,7 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
> MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
>
> + to_kvm_tdx(kvm)->tsx_supported = tdparams_tsx_supported(cpuid);
> return 0;
> }
>
> @@ -2014,6 +2038,11 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> return -EIO;
> }
> }
> + tdx_uret_tsx_ctrl_slot = kvm_find_user_return_msr(MSR_IA32_TSX_CTRL);
> + if (tdx_uret_tsx_ctrl_slot == -1 && boot_cpu_has(X86_FEATURE_MSR_TSX_CTRL)) {
> + pr_err("MSR_IA32_TSX_CTRL isn't included by kvm_find_user_return_msr\n");
> + return -EIO;
> + }
>
> max_pkgs = topology_max_packages();
> tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index e96c416e73bf..44eab734e702 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -17,6 +17,14 @@ struct kvm_tdx {
> u64 xfam;
> int hkid;
>
> + /*
> + * Used on each TD-exit, see tdx_user_return_update_cache().
> + * TSX_CTRL value on TD exit
> + * - set 0 if guest TSX enabled
> + * - preserved if guest TSX disabled
> + */
> + bool tsx_supported;
> +
> bool finalized;
> atomic_t tdh_mem_track;
>


2024-04-07 16:13:08

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> For vcpu migration, in the case of VMX, VMCS is flushed on the source pcpu,
> and load it on the target pcpu. There are corresponding TDX SEAMCALL APIs,
> call them on vcpu migration. The logic is mostly same as VMX except the
> TDX SEAMCALLs are used.
>
> When shutting down the machine,

"the machine" -> "a VM"

> (VMX or TDX) vcpus needs to be shutdown on
                   ^
                   need
> each pcpu. Do the similar for TDX with TDX SEAMCALL APIs.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 32 ++++++-
> arch/x86/kvm/vmx/tdx.c | 190 ++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/tdx.h | 2 +
> arch/x86/kvm/vmx/x86_ops.h | 4 +
> 4 files changed, 221 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 8275a242ce07..9b336c1a6508 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -33,6 +33,14 @@ static int vt_max_vcpus(struct kvm *kvm)
> static int vt_flush_remote_tlbs(struct kvm *kvm);
> #endif
>
> +static void vt_hardware_disable(void)
> +{
> + /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
> + if (enable_tdx)
> + tdx_hardware_disable();
> + vmx_hardware_disable();
> +}
> +
> static __init int vt_hardware_setup(void)
> {
> int ret;
> @@ -201,6 +209,16 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
> return vmx_vcpu_run(vcpu);
> }
>
> +static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> + if (is_td_vcpu(vcpu)) {
> + tdx_vcpu_load(vcpu, cpu);
> + return;
> + }
> +
> + vmx_vcpu_load(vcpu, cpu);
> +}
> +
> static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> {
> if (is_td_vcpu(vcpu)) {
> @@ -262,6 +280,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> }
>
> +static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return;
> +
> + vmx_sched_in(vcpu, cpu);
> +}
> +
> static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
> if (is_td_vcpu(vcpu))
> @@ -335,7 +361,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> /* TDX cpu enablement is done by tdx_hardware_setup(). */
> .hardware_enable = vmx_hardware_enable,
> - .hardware_disable = vmx_hardware_disable,
> + .hardware_disable = vt_hardware_disable,
> .has_emulated_msr = vmx_has_emulated_msr,
>
> .is_vm_type_supported = vt_is_vm_type_supported,
> @@ -353,7 +379,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .vcpu_reset = vt_vcpu_reset,
>
> .prepare_switch_to_guest = vt_prepare_switch_to_guest,
> - .vcpu_load = vmx_vcpu_load,
> + .vcpu_load = vt_vcpu_load,
> .vcpu_put = vt_vcpu_put,
>
> .update_exception_bitmap = vmx_update_exception_bitmap,
> @@ -440,7 +466,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> .request_immediate_exit = vmx_request_immediate_exit,
>
> - .sched_in = vmx_sched_in,
> + .sched_in = vt_sched_in,
>
> .cpu_dirty_log_size = PML_ENTITY_NUM,
> .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ad4d3d4eaf6c..7aa9188f384d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -106,6 +106,14 @@ static DEFINE_MUTEX(tdx_lock);
> static struct mutex *tdx_mng_key_config_lock;
> static atomic_t nr_configured_hkid;
>
> +/*
> + * A per-CPU list of TD vCPUs associated with a given CPU. Used when a CPU
> + * is brought down

Not necessarily to be a CPU brought down.
It will also be triggered by the last VM being destroyed.

But had a seond thought, if it is trigged by the last VM case, the list
should be empty already.
So I am OK with the descripton.


> to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.

"approapriate" -> "appropriate"

> + * Protected by interrupt mask. This list is manipulated in process context
> + * of vcpu and IPI callback. See tdx_flush_vp_on_cpu().
> + */
> +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> +
> static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> {
> return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> @@ -138,6 +146,37 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> return kvm_tdx->finalized;
> }
>
> +static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> +{
> + lockdep_assert_irqs_disabled();
> +
> + list_del(&to_tdx(vcpu)->cpu_list);
> +
> + /*
> + * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> + * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> + * to its list before its deleted from this CPUs list.
> + */
> + smp_wmb();
> +
> + vcpu->cpu = -1;
> +}
> +
> +static void tdx_disassociate_vp_arg(void *vcpu)
> +{
> + tdx_disassociate_vp(vcpu);
> +}
> +
> +static void tdx_disassociate_vp_on_cpu(struct kvm_vcpu *vcpu)
> +{
> + int cpu = vcpu->cpu;
> +
> + if (unlikely(cpu == -1))
> + return;
> +
> + smp_call_function_single(cpu, tdx_disassociate_vp_arg, vcpu, 1);
> +}
> +
> static void tdx_clear_page(unsigned long page_pa)
> {
> const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -218,6 +257,87 @@ static void tdx_reclaim_control_page(unsigned long td_page_pa)
> free_page((unsigned long)__va(td_page_pa));
> }
>
> +struct tdx_flush_vp_arg {
> + struct kvm_vcpu *vcpu;
> + u64 err;
> +};
> +
> +static void tdx_flush_vp(void *arg_)

It is more common to use "_arg" instead of "arg_".

> +{
> + struct tdx_flush_vp_arg *arg = arg_;
> + struct kvm_vcpu *vcpu = arg->vcpu;
> + u64 err;
> +
> + arg->err = 0;
> + lockdep_assert_irqs_disabled();
> +
> + /* Task migration can race with CPU offlining. */
> + if (unlikely(vcpu->cpu != raw_smp_processor_id()))
> + return;
> +
> + /*
> + * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
> + * list tracking still needs to be updated so that it's correct if/when
> + * the vCPU does get initialized.
> + */
> + if (is_td_vcpu_created(to_tdx(vcpu))) {
> + /*
> + * No need to retry. TDX Resources needed for TDH.VP.FLUSH are,
> + * TDVPR as exclusive, TDR as shared, and TDCS as shared. This
> + * vp flush function is called when destructing vcpu/TD or vcpu
> + * migration. No other thread uses TDVPR in those cases.
> + */
> + err = tdh_vp_flush(to_tdx(vcpu)->tdvpr_pa);
> + if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
> + /*
> + * This function is called in IPI context. Do not use
> + * printk to avoid console semaphore.
> + * The caller prints out the error message, instead.
> + */
> + if (err)
> + arg->err = err;
> + }
> + }
> +
> + tdx_disassociate_vp(vcpu);
> +}
> +
> +static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
> +{
> + struct tdx_flush_vp_arg arg = {
> + .vcpu = vcpu,
> + };
> + int cpu = vcpu->cpu;
> +
> + if (unlikely(cpu == -1))
> + return;
> +
> + smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
> + if (WARN_ON_ONCE(arg.err)) {
> + pr_err("cpu: %d ", cpu);
> + pr_tdx_error(TDH_VP_FLUSH, arg.err, NULL);
> + }
> +}
> +
> +void tdx_hardware_disable(void)
> +{
> + int cpu = raw_smp_processor_id();
> + struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> + struct tdx_flush_vp_arg arg;
> + struct vcpu_tdx *tdx, *tmp;
> + unsigned long flags;
> +
> + lockdep_assert_preemption_disabled();
> +
> + local_irq_save(flags);
> + /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> + list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) {
> + arg.vcpu = &tdx->vcpu;
> + tdx_flush_vp(&arg);
> + }
> + local_irq_restore(flags);
> +}
> +
> static void tdx_do_tdh_phymem_cache_wb(void *unused)
> {
> u64 err = 0;
> @@ -233,26 +353,31 @@ static void tdx_do_tdh_phymem_cache_wb(void *unused)
> pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> }
>
> -void tdx_mmu_release_hkid(struct kvm *kvm)
> +static int __tdx_mmu_release_hkid(struct kvm *kvm)
> {
> bool packages_allocated, targets_allocated;
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> cpumask_var_t packages, targets;
> + struct kvm_vcpu *vcpu;
> + unsigned long j;
> + int i, ret = 0;
> u64 err;
> - int i;
>
> if (!is_hkid_assigned(kvm_tdx))
> - return;
> + return 0;
>
> if (!is_td_created(kvm_tdx)) {
> tdx_hkid_free(kvm_tdx);
> - return;
> + return 0;
> }
>
> packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> cpus_read_lock();
>
> + kvm_for_each_vcpu(j, vcpu, kvm)
> + tdx_flush_vp_on_cpu(vcpu);
> +
> /*
> * We can destroy multiple guest TDs simultaneously. Prevent
> * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> @@ -270,6 +395,19 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> */
> write_lock(&kvm->mmu_lock);
>
> + err = tdh_mng_vpflushdone(kvm_tdx->tdr_pa);
> + if (err == TDX_FLUSHVP_NOT_DONE) {
> + ret = -EBUSY;
> + goto out;
> + }
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
> + pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
> + kvm_tdx->hkid);
> + ret = -EIO;
> + goto out;
> + }
> +
> for_each_online_cpu(i) {
> if (packages_allocated &&
> cpumask_test_and_set_cpu(topology_physical_package_id(i),
> @@ -291,14 +429,24 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
> kvm_tdx->hkid);
> + ret = -EIO;
> } else
> tdx_hkid_free(kvm_tdx);
>
> +out:
> write_unlock(&kvm->mmu_lock);
> mutex_unlock(&tdx_lock);
> cpus_read_unlock();
> free_cpumask_var(targets);
> free_cpumask_var(packages);
> +
> + return ret;
> +}
> +
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> + ;
> }
>
> void tdx_vm_free(struct kvm *kvm)
> @@ -455,6 +603,26 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> return 0;
> }
>
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> + if (vcpu->cpu == cpu)
> + return;
> +
> + tdx_flush_vp_on_cpu(vcpu);
> +
> + local_irq_disable();
> + /*
> + * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
> + * vcpu->cpu is read before tdx->cpu_list.
> + */
> + smp_rmb();
> +
> + list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
> + local_irq_enable();
> +}
> +
> void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> {
> struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -495,6 +663,16 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> struct vcpu_tdx *tdx = to_tdx(vcpu);
> int i;
>
> + /*
> + * When destroying VM, kvm_unload_vcpu_mmu() calls vcpu_load() for every
> + * vcpu after they already disassociated from the per cpu list by
> + * tdx_mmu_release_hkid(). So we need to disassociate them again,
> + * otherwise the freed vcpu data will be accessed when do
> + * list_{del,add}() on associated_tdvcpus list later.
> + */
> + tdx_disassociate_vp_on_cpu(vcpu);
> + WARN_ON_ONCE(vcpu->cpu != -1);
> +
> /*
> * This methods can be called when vcpu allocation/initialization
> * failed. So it's possible that hkid, tdvpx and tdvpr are not assigned
> @@ -2030,6 +2208,10 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> return -EINVAL;
> }
>
> + /* tdx_hardware_disable() uses associated_tdvcpus. */
> + for_each_possible_cpu(i)
> + INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
> +
> for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
> /*
> * Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 0d8a98feb58e..7f8c78f06508 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -73,6 +73,8 @@ struct vcpu_tdx {
> unsigned long *tdvpx_pa;
> bool td_vcpu_created;
>
> + struct list_head cpu_list;
> +
> union tdx_exit_reason exit_reason;
>
> bool initialized;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 9fd997c79c33..5853f29f0af3 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -137,6 +137,7 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
> #ifdef CONFIG_INTEL_TDX_HOST
> int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> void tdx_hardware_unsetup(void);
> +void tdx_hardware_disable(void);
> bool tdx_is_vm_type_supported(unsigned long type);
> int tdx_offline_cpu(void);
>
> @@ -153,6 +154,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
> void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> void tdx_vcpu_put(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
> u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -171,6 +173,7 @@ void tdx_post_memory_mapping(struct kvm_vcpu *vcpu,
> #else
> static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
> static inline void tdx_hardware_unsetup(void) {}
> +static inline void tdx_hardware_disable(void) {}
> static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
> static inline int tdx_offline_cpu(void) { return 0; }
>
> @@ -190,6 +193,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
> static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
> static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
> static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
>
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }


2024-04-07 16:53:14

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 088/130] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
> irrespective of any other flags.

Here "irrespective of any other flags" sounds like other flags will be
ignored if KVM_DEBUGREG_AUTO_SWITCHED_GUEST is set.
But the code below doesn't align with it.

> TDX-SEAM unconditionally saves and
> restores guest DRs and reset to architectural INIT state on TD exit.
> So, KVM needs to save host DRs before TD enter without restoring guest DRs
> and restore host DRs after TD exit.
>
> Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().
>
> Reported-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Co-developed-by: Chao Gao <[email protected]>
> Signed-off-by: Chao Gao <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 10 ++++++++--
> arch/x86/kvm/vmx/tdx.c | 1 +
> arch/x86/kvm/x86.c | 11 ++++++++---
> 3 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3ab85c3d86ee..a9df898c6fbd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -610,8 +610,14 @@ struct kvm_pmu {
> struct kvm_pmu_ops;
>
> enum {
> - KVM_DEBUGREG_BP_ENABLED = 1,
> - KVM_DEBUGREG_WONT_EXIT = 2,
> + KVM_DEBUGREG_BP_ENABLED = BIT(0),
> + KVM_DEBUGREG_WONT_EXIT = BIT(1),
> + /*
> + * Guest debug registers (DR0-3 and DR6) are saved/restored by hardware
> + * on exit from or enter to guest. KVM needn't switch them. Because DR7
> + * is cleared on exit from guest, DR7 need to be saved/restored.
> + */
> + KVM_DEBUGREG_AUTO_SWITCH = BIT(2),
> };
>
> struct kvm_mtrr_range {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 7aa9188f384d..ab7403a19c5d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -586,6 +586,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>
> vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
>
> + vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
> vcpu->arch.cr0_guest_owned_bits = -1ul;
> vcpu->arch.cr4_guest_owned_bits = -1ul;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b189e86a1f1..fb7597c22f31 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11013,7 +11013,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> if (vcpu->arch.guest_fpu.xfd_err)
> wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>
> - if (unlikely(vcpu->arch.switch_db_regs)) {
> + if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
> set_debugreg(0, 7);
> set_debugreg(vcpu->arch.eff_db[0], 0);
> set_debugreg(vcpu->arch.eff_db[1], 1);
> @@ -11059,6 +11059,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> */
> if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
> WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
> + WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
> static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
> kvm_update_dr0123(vcpu);
> kvm_update_dr7(vcpu);
> @@ -11071,8 +11072,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> * care about the messed up debug address registers. But if
> * we have some of them active, restore the old state.
> */
> - if (hw_breakpoint_active())
> - hw_breakpoint_restore();
> + if (hw_breakpoint_active()) {
> + if (!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))
> + hw_breakpoint_restore();
> + else
> + set_debugreg(__this_cpu_read(cpu_dr7), 7);

According to TDX module 1.5 ABI spec:
DR0-3, DR6 and DR7 are set to their architectural INIT value, why is
only DR7 restored?


I found a discussion about it in
https://lore.kernel.org/kvm/[email protected]/

Sean mentioned:
"
The TDX module context switches the guest _and_ host debug registers. 
It restores
the host DRs because it needs to write _something_ to hide guest state,
so it might
as well restore the host values.  The above was an optmization to avoid
rewriting
all debug registers.
"

I have question on "so it might as well restore the host values".
Was it a guess/assumption or it was the fact?



> + }
>
> vcpu->arch.last_vmentry_cpu = vcpu->cpu;
> vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());


2024-04-08 03:16:42

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c



On 3/29/2024 5:10 AM, Isaku Yamahata wrote:
> On Thu, Mar 28, 2024 at 04:12:36PM +0800,
> Chao Gao <[email protected]> wrote:
>
>>> }
>>>
>>> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>>> @@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>>> if (!vmx_needs_pi_wakeup(vcpu))
>>> return;
>>>
>>> - if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
>>> + if (kvm_vcpu_is_blocking(vcpu) &&
>>> + (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
>> Ditto.
>>
>> This looks incorrect to me. here we assume interrupt is always enabled for TD.
>> But on TDVMCALL(HLT), the guest tells KVM if hlt is called with interrupt
>> disabled. KVM can just check that interrupt status passed from the guest.
> That's true. We can complicate this function and HLT emulation. But I don't
> think it's worthwhile because HLT with interrupt masked is rare. Only for
> CPU online.
Then, it's better to add some comments?


2024-04-08 07:02:27

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 092/130] KVM: TDX: Implement interrupt injection



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata<[email protected]>
>
> TDX supports interrupt inject into vcpu with posted interrupt. Wire up the
> corresponding kvm x86 operations to posted interrupt. Move
> kvm_vcpu_trigger_posted_interrupt() from vmx.c to common.h to share the
> code.
>
> VMX can inject interrupt by setting interrupt information field,
> VM_ENTRY_INTR_INFO_FIELD, of VMCS. TDX supports interrupt injection only
> by posted interrupt. Ignore the execution path to access
> VM_ENTRY_INTR_INFO_FIELD.
>
> As cpu state is protected and apicv is enabled for the TDX guest, VMM can
> inject interrupt by updating posted interrupt descriptor. Treat interrupt
> can be injected always.
>
> Signed-off-by: Isaku Yamahata<[email protected]>
> Reviewed-by: Paolo Bonzini<[email protected]>
> ---
> arch/x86/kvm/vmx/common.h | 71 ++++++++++++++++++++++++++
> arch/x86/kvm/vmx/main.c | 93 ++++++++++++++++++++++++++++++----
> arch/x86/kvm/vmx/posted_intr.c | 2 +-
> arch/x86/kvm/vmx/posted_intr.h | 2 +
> arch/x86/kvm/vmx/tdx.c | 25 +++++++++
> arch/x86/kvm/vmx/vmx.c | 67 +-----------------------
> arch/x86/kvm/vmx/x86_ops.h | 7 ++-
> 7 files changed, 190 insertions(+), 77 deletions(-)
>
[...]
>
> +static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> +{
> + if (is_td_vcpu(vcpu))
> + return;

Please add a blank line.

> + vmx_set_interrupt_shadow(vcpu, mask);
> +}
> +
[...]
>
> @@ -848,6 +853,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>
> trace_kvm_entry(vcpu);
>
> + if (pi_test_on(&tdx->pi_desc)) {
> + apic->send_IPI_self(POSTED_INTR_VECTOR);
> +
> + kvm_wait_lapic_expire(vcpu);
As Chao pointed out, APIC timer change shouldn't be included in this patch.

Maybe better to put the splitted patch closer to patch
"KVM: x86: Assume timer IRQ was injected if APIC state is proteced"
becasue they are related.

> + }
> +
> tdx_vcpu_enter_exit(tdx);
>
> tdx_user_return_update_cache(vcpu);
> @@ -1213,6 +1224,16 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
> }
>
[...]

2024-04-08 15:33:09

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 067/130] KVM: TDX: Add load_mmu_pgd method for TDX

On Sun, 2024-04-07 at 09:32 +0800, Binbin Wu wrote:
> > Looks good.  Some nits though:
> >
> > > KVM: TDX: Add load_mmu_pgd method for TDX
> > >
> > > TDX has uses two EPT pointers, one for the private half of the GPA
> > "TDX uses"
> >
> > > space and one for the shared half. The private half used the normal
> > "used" -> "uses"
> >
> > > EPT_POINTER vmcs field and is managed in a special way by the TDX module.
> > Perhaps add:
> >
> > KVM is not allowed to operate on the EPT_POINTER directly.
> >
> > > The shared half uses a new SHARED_EPT_POINTER field and will be managed by
> > > the conventional MMU management operations that operate directly on the
> > > EPT tables.
> > >
> > I would like to explicitly call out KVM can update SHARED_EPT_POINTER directly:
> >
> > The shared half uses a new SHARED_EPT_POINTER field.  KVM is allowed to set it
> > directly by the interface provided by the TDX module, and KVM is expected to
> > manage the shared half just like it manages the existing EPT page table today.
> >
> >
> > > This means for TDX the .load_mmu_pgd() operation will need to
> > > know to use the SHARED_EPT_POINTER field instead of the normal one. Add a
> > > new wrapper in x86 ops for load_mmu_pgd() that either directs the write to
> > > the existing vmx implementation or a TDX one.
> > >
> > > For the TDX operation, EPT will always be used, so it can simpy write to
>
>
> Maybe remove "so"?  IMO, there is no causal relationship between the
> first and second half of the sentence.
>

I was trying to nod at why tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd(). Here is a
new version with all the feedback:

KVM: TDX: Add load_mmu_pgd method for TDX

TDX uses two EPT pointers, one for the private half of the GPA space and one for the shared half.
The private half uses the normal EPT_POINTER vmcs field, which is managed in a special way by the
TDX module. For TDX, KVM is not allowed to operate on it directly. The shared half uses a new
SHARED_EPT_POINTER field and will be managed by the conventional MMU management operations that
operate directly on the EPT root. This means for TDX the .load_mmu_pgd() operation will need to know
to use the SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in x86 ops for
load_mmu_pgd() that either directs the write to the existing vmx implementation or a TDX one.

For the TDX mode of operation, EPT will always be used and KVM does not need to be involved in
virtualization of CR3 behavior. So tdx_load_mmu_pgd() can simply write to SHARED_EPT_POINTER.

2024-04-08 18:58:06

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 108/130] KVM: TDX: Handle TDX PV HLT hypercall

On Sun, Apr 07, 2024 at 11:50:04AM +0800,
Chao Gao <[email protected]> wrote:

> >> > >+ union tdx_vcpu_state_details details;
> >> > >+ struct vcpu_tdx *tdx = to_tdx(vcpu);
> >> > >+
> >> > >+ if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> >> > >+ return true;
> >> >
> >> > Question: why mp_state matters here?
> >> > >+
> >> > >+ if (tdx->interrupt_disabled_hlt)
> >> > >+ return false;
> >> >
> >> > Shouldn't we move this into vt_interrupt_allowed()? VMX calls the function to
> >> > check if interrupt is disabled.
> >
> >Chao, are you suggesting to implement tdx_interrupt_allowed() as
> >"EXIT_REASON_HLT && a0" instead of "return true"?
> >I don't think it makes sense because it's rare case and we can't avoid spurious
> >wakeup for TDX case.
>
> Yes. KVM differeniates "interrupt allowed" from "has interrupt", e.g.,
>
> static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
> ...
>
> if (kvm_arch_interrupt_allowed(vcpu) &&
> (kvm_cpu_has_interrupt(vcpu) ||
> kvm_guest_apic_has_interrupt(vcpu)))
> return true;
>
>
> I think tdx_protected_apic_has_interrupt() mixes them together, which isn't
> good.

Your point is code clarity. Ok, we can code in that way. I don't expect any
performance difference.
--
Isaku Yamahata <[email protected]>

2024-04-08 19:03:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> +static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
> +{
> +       const struct kvm_cpuid_entry2 *entry;
> +       u64 guest_supported_xcr0;
> +       u64 guest_supported_xss;
> +
> +       /* Setup td_params.xfam */
> +       entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
> +       if (entry)
> +               guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
> +       else
> +               guest_supported_xcr0 = 0;
> +       guest_supported_xcr0 &= kvm_caps.supported_xcr0;
> +
> +       entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
> +       if (entry)
> +               guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
> +       else
> +               guest_supported_xss = 0;
> +
> +       /*
> +        * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
> +        * and, CET support.
> +        */
> +       guest_supported_xss &=
> +               (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);

So this enables features based on xss support in the passed CPUID, but these features are not
dependent xsave. You could have CET without xsave support. And in fact Kernel IBT doesn't use it. To
utilize CPUID leafs to configure features, but diverge from the HW meaning seems like asking for
trouble.

> +
> +       td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
> +       if (td_params->xfam & XFEATURE_MASK_LBR) {
> +               /*
> +                * TODO: once KVM supports LBR(save/restore LBR related
> +                * registers around TDENTER), remove this guard.
> +                */
> +#define MSG_LBR        "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH
> properly.\n"
> +               pr_warn(MSG_LBR);
> +               return -EOPNOTSUPP;
> +       }
> +
> +       return 0;
> +}
> +
> +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> +                       struct kvm_tdx_init_vm *init_vm)
> +{
> +       struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
> +       int ret;
> +
> +       if (kvm->created_vcpus)
> +               return -EBUSY;
> +
> +       if (init_vm->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> +               /*
> +                * TODO: save/restore PMU related registers around TDENTER.
> +                * Once it's done, remove this guard.
> +                */
> +#define MSG_PERFMON    "TD doesn't support perfmon yet. KVM needs to save/restore host perf
> registers properly.\n"
> +               pr_warn(MSG_PERFMON);

We need to remove the TODOs and a warn doesn't seem appropriate.

> +               return -EOPNOTSUPP;
> +       }
> +
> +       td_params->max_vcpus = kvm->max_vcpus;
> +       td_params->attributes = init_vm->attributes;

Don't we need to sanitize this for a selection of features known to KVM. For example what if
something else like TDX_TD_ATTRIBUTE_PERFMON is added to a future TDX module and then suddenly
userspace can configure it.

So xfam is how to control features that are tied to save (CET, etc). And ATTRIBUTES are tied to
features without xsave support (PKS, etc).

If we are going to use CPUID for specifying which features should get enabled in the TDX module, we
should match the arch definitions of the leafs. For things like CET whether xfam controls the value
of multiple CPUID leafs, then we need should check that they are all set to some consistent values
and otherwise reject them. So for CET we would need to check the SHSTK and IBT bits, as well as two
XCR0 bits.

If we are going to do that for XFAM based features, then why not do the same for ATTRIBUTE based
features?

We would need something like GET_SUPPORTED_CPUID for TDX, but also since some features can be forced
on we would need to expose something like GET_SUPPORTED_CPUID_REQUIRED as well.

> +       td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD;
> +       td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz);

2024-04-09 00:38:40

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 26/02/2024 9:25 pm, Yamahata, Isaku wrote:
> +struct tdx_enabled {
> + cpumask_var_t enabled;
> + atomic_t err;
> +};
> +
> +static void __init tdx_on(void *_enable)
> +{
> + struct tdx_enabled *enable = _enable;
> + int r;
> +
> + r = vmx_hardware_enable();
> + if (!r) {
> + cpumask_set_cpu(smp_processor_id(), enable->enabled);
> + r = tdx_cpu_enable();
> + }
> + if (r)
> + atomic_set(&enable->err, r);
> +}
> +
> +static void __init vmx_off(void *_enabled)
> +{
> + cpumask_var_t *enabled = (cpumask_var_t *)_enabled;
> +
> + if (cpumask_test_cpu(smp_processor_id(), *enabled))
> + vmx_hardware_disable();
> +}
> +
> +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> +{
> + struct tdx_enabled enable = {
> + .err = ATOMIC_INIT(0),
> + };
> + int r = 0;
> +
> + if (!enable_ept) {
> + pr_warn("Cannot enable TDX with EPT disabled\n");
> + return -EINVAL;
> + }
> +
> + if (!zalloc_cpumask_var(&enable.enabled, GFP_KERNEL)) {
> + r = -ENOMEM;
> + goto out;
> + }
> +
> + /* tdx_enable() in tdx_module_setup() requires cpus lock. */
> + cpus_read_lock();
> + on_each_cpu(tdx_on, &enable, true); /* TDX requires vmxon. */
> + r = atomic_read(&enable.err);
> + if (!r)
> + r = tdx_module_setup();
> + else
> + r = -EIO;

I was thinking why do we need to convert to -EIO.

Convert error code to -EIO unconditionally would cause the original
error code being lost. Although given tdx_on() is called on all online
cpus in parallel, the @enable.err could be imprecise anyway, the
explicit conversion seems not quite reasonable to be done _here_.

I think it would be more reasonable to explicitly set the error code to
-EIO in tdx_on(), where we _exactly_ know what went wrong and can still
possibly do something before losing the error code.

E.g., we can dump the error code to the user, but looks both
vmx_hardware_enable() and tdx_cpu_enable() will do so already so we can
safely lose the error code there.

We can perhaps add a comment to point this out before losing the error
code if that's better:

/*
* Both vmx_hardware_enable() and tdx_cpu_enable() print error
* message when they fail. Just convert the error code to -EIO
* when multiple cpu fault the @err cannot be used to precisely
* record the error code for them anyway.
*/

> + on_each_cpu(vmx_off, &enable.enabled, true);
> + cpus_read_unlock();
> + free_cpumask_var(enable.enabled);
> +
> +out:
> + return r;
> +}

2024-04-09 14:53:09

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 096/130] KVM: VMX: Move NMI/exception handler to common helper



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX mostly handles NMI/exception exit mostly the same to VMX case. The
> difference is how to retrieve exit qualification. To share the code with
> TDX, move NMI/exception to a common header, common.h.

Suggest to add "No functional change intended." in the changelog.

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/common.h | 59 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 68 +++++----------------------------------
> 2 files changed, 67 insertions(+), 60 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 6f21d0d48809..632af7a76d0a 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -4,8 +4,67 @@
>
> #include <linux/kvm_host.h>
>
> +#include <asm/traps.h>
> +
> #include "posted_intr.h"
> #include "mmu.h"
> +#include "vmcs.h"
> +#include "x86.h"
> +
> +extern unsigned long vmx_host_idt_base;
> +void vmx_do_interrupt_irqoff(unsigned long entry);
> +void vmx_do_nmi_irqoff(void);
> +
> +static inline void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
> +{
> + /*
> + * Save xfd_err to guest_fpu before interrupt is enabled, so the
> + * MSR value is not clobbered by the host activity before the guest
> + * has chance to consume it.
> + *
> + * Do not blindly read xfd_err here, since this exception might
> + * be caused by L1 interception on a platform which doesn't
> + * support xfd at all.
> + *
> + * Do it conditionally upon guest_fpu::xfd. xfd_err matters
> + * only when xfd contains a non-zero value.
> + *
> + * Queuing exception is done in vmx_handle_exit. See comment there.
> + */
> + if (vcpu->arch.guest_fpu.fpstate->xfd)
> + rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> +}
> +
> +static inline void vmx_handle_exception_irqoff(struct kvm_vcpu *vcpu,
> + u32 intr_info)
> +{
> + /* if exit due to PF check for async PF */
> + if (is_page_fault(intr_info))
> + vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> + /* if exit due to NM, handle before interrupts are enabled */
> + else if (is_nm_fault(intr_info))
> + vmx_handle_nm_fault_irqoff(vcpu);
> + /* Handle machine checks before interrupts are enabled */
> + else if (is_machine_check(intr_info))
> + kvm_machine_check();
> +}
> +
> +static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> + u32 intr_info)
> +{
> + unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> + gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
> +
> + if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
> + "unexpected VM-Exit interrupt info: 0x%x", intr_info))
> + return;
> +
> + kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> + vmx_do_interrupt_irqoff(gate_offset(desc));
> + kvm_after_interrupt(vcpu);
> +
> + vcpu->arch.at_instruction_boundary = true;
> +}
>
> static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> unsigned long exit_qualification)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 29d891e0795e..f8a00a766c40 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -518,7 +518,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
> vmx->segment_cache.bitmask = 0;
> }
>
> -static unsigned long host_idt_base;
> +unsigned long vmx_host_idt_base;
>
> #if IS_ENABLED(CONFIG_HYPERV)
> static bool __read_mostly enlightened_vmcs = true;
> @@ -4273,7 +4273,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
> vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
> vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */
>
> - vmcs_writel(HOST_IDTR_BASE, host_idt_base); /* 22.2.4 */
> + vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base); /* 22.2.4 */
>
> vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */
>
> @@ -5166,7 +5166,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> intr_info = vmx_get_intr_info(vcpu);
>
> /*
> - * Machine checks are handled by handle_exception_irqoff(), or by
> + * Machine checks are handled by vmx_handle_exception_irqoff(), or by
> * vmx_vcpu_run() if a #MC occurs on VM-Entry. NMIs are handled by
> * vmx_vcpu_enter_exit().
> */
> @@ -5174,7 +5174,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> return 1;
>
> /*
> - * Queue the exception here instead of in handle_nm_fault_irqoff().
> + * Queue the exception here instead of in vmx_handle_nm_fault_irqoff().
> * This ensures the nested_vmx check is not skipped so vmexit can
> * be reflected to L1 (when it intercepts #NM) before reaching this
> * point.
> @@ -6889,59 +6889,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
> }
>
> -void vmx_do_interrupt_irqoff(unsigned long entry);
> -void vmx_do_nmi_irqoff(void);
> -
> -static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
> -{
> - /*
> - * Save xfd_err to guest_fpu before interrupt is enabled, so the
> - * MSR value is not clobbered by the host activity before the guest
> - * has chance to consume it.
> - *
> - * Do not blindly read xfd_err here, since this exception might
> - * be caused by L1 interception on a platform which doesn't
> - * support xfd at all.
> - *
> - * Do it conditionally upon guest_fpu::xfd. xfd_err matters
> - * only when xfd contains a non-zero value.
> - *
> - * Queuing exception is done in vmx_handle_exit. See comment there.
> - */
> - if (vcpu->arch.guest_fpu.fpstate->xfd)
> - rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> -}
> -
> -static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
> -{
> - /* if exit due to PF check for async PF */
> - if (is_page_fault(intr_info))
> - vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> - /* if exit due to NM, handle before interrupts are enabled */
> - else if (is_nm_fault(intr_info))
> - handle_nm_fault_irqoff(vcpu);
> - /* Handle machine checks before interrupts are enabled */
> - else if (is_machine_check(intr_info))
> - kvm_machine_check();
> -}
> -
> -static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> - u32 intr_info)
> -{
> - unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> - gate_desc *desc = (gate_desc *)host_idt_base + vector;
> -
> - if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
> - "unexpected VM-Exit interrupt info: 0x%x", intr_info))
> - return;
> -
> - kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> - vmx_do_interrupt_irqoff(gate_offset(desc));
> - kvm_after_interrupt(vcpu);
> -
> - vcpu->arch.at_instruction_boundary = true;
> -}
> -
> void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -6950,9 +6897,10 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> return;
>
> if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
> - handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
> + vmx_handle_external_interrupt_irqoff(vcpu,
> + vmx_get_intr_info(vcpu));
> else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
> - handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
> + vmx_handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
> }
>
> /*
> @@ -8284,7 +8232,7 @@ __init int vmx_hardware_setup(void)
> int r;
>
> store_idt(&dt);
> - host_idt_base = dt.address;
> + vmx_host_idt_base = dt.address;
>
> vmx_setup_user_return_msrs();
>


2024-04-09 15:38:41

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 098/130] KVM: TDX: Add a place holder to handle TDX VM exit



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Wire up handle_exit and handle_exit_irqoff methods

This patch also wires up get_exit_info.

> and add a place holder
> to handle VM exit. Add helper functions to get exit info, exit
> qualification, etc.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 37 ++++++++++++-
> arch/x86/kvm/vmx/tdx.c | 110 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 10 ++++
> 3 files changed, 154 insertions(+), 3 deletions(-)
>
[...]
> @@ -562,7 +593,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
> .vcpu_pre_run = vt_vcpu_pre_run,
> .vcpu_run = vt_vcpu_run,
> - .handle_exit = vmx_handle_exit,
> + .handle_exit = vt_handle_exit,
> .skip_emulated_instruction = vmx_skip_emulated_instruction,
> .update_emulated_instruction = vmx_update_emulated_instruction,
> .set_interrupt_shadow = vt_set_interrupt_shadow,
> @@ -597,7 +628,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .set_identity_map_addr = vmx_set_identity_map_addr,
> .get_mt_mask = vt_get_mt_mask,
>
> - .get_exit_info = vmx_get_exit_info,
> + .get_exit_info = vt_get_exit_info,
>
> .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
>
> @@ -611,7 +642,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .load_mmu_pgd = vt_load_mmu_pgd,
>
> .check_intercept = vmx_check_intercept,
> - .handle_exit_irqoff = vmx_handle_exit_irqoff,
> + .handle_exit_irqoff = vt_handle_exit_irqoff,
>
> .request_immediate_exit = vt_request_immediate_exit,
>
[...]
>
> +int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> +{
> + union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
> +
> + /* See the comment of tdh_sept_seamcall(). */

Should be tdx_seamcall_sept().

> + if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)))

Can use "TDX_ERROR_SEPT_BUSY" instead.

> + return 1;
> +
> + /*
> + * TDH.VP.ENTRY

"TDH.VP.ENTRY" -> "TDH.VP.ENTER"

> checks TD EPOCH which contend with TDH.MEM.TRACK and
> + * vcpu TDH.VP.ENTER.
Do you mean TDH.VP.ENTER on one vcpu can contend with TDH.MEM.TRACK and
TDH.VP.ENTER on another vcpu?

> + */
>

2024-04-09 16:43:47

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 097/130] KVM: x86: Split core of hypercall emulation to helper function



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> By necessity, TDX will use a different register ABI for hypercalls.
> Break out the core functionality so that it may be reused for TDX.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 +++
> arch/x86/kvm/x86.c | 56 ++++++++++++++++++++++-----------
> 2 files changed, 42 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e0ffef1d377d..bb8be091f996 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2177,6 +2177,10 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
> kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
> }
>
> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> + unsigned long a0, unsigned long a1,
> + unsigned long a2, unsigned long a3,
> + int op_64_bit, int cpl);
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>
> int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fb7597c22f31..03950368d8db 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10073,26 +10073,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> -int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> + unsigned long a0, unsigned long a1,
> + unsigned long a2, unsigned long a3,
> + int op_64_bit, int cpl)
> {
> - unsigned long nr, a0, a1, a2, a3, ret;
> - int op_64_bit;
> -
> - if (kvm_xen_hypercall_enabled(vcpu->kvm))
> - return kvm_xen_hypercall(vcpu);
> -
> - if (kvm_hv_hypercall_enabled(vcpu))
> - return kvm_hv_hypercall(vcpu);
> -
> - nr = kvm_rax_read(vcpu);
> - a0 = kvm_rbx_read(vcpu);
> - a1 = kvm_rcx_read(vcpu);
> - a2 = kvm_rdx_read(vcpu);
> - a3 = kvm_rsi_read(vcpu);
> + unsigned long ret;
>
> trace_kvm_hypercall(nr, a0, a1, a2, a3);
>
> - op_64_bit = is_64_bit_hypercall(vcpu);
> if (!op_64_bit) {
> nr &= 0xFFFFFFFF;
> a0 &= 0xFFFFFFFF;
> @@ -10101,7 +10090,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> a3 &= 0xFFFFFFFF;
> }
>
> - if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
> + if (cpl) {
> ret = -KVM_EPERM;
> goto out;
> }
> @@ -10162,18 +10151,49 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>
> WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ);
> vcpu->arch.complete_userspace_io = complete_hypercall_exit;
> + /* stat is incremented on completion. */
> return 0;
> }
> default:
> ret = -KVM_ENOSYS;
> break;
> }
> +
> out:
> + ++vcpu->stat.hypercalls;
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> +
> +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> +{
> + unsigned long nr, a0, a1, a2, a3, ret;
> + int op_64_bit;

Can it be opportunistically changed to bool type, as well as the
argument type of "op_64_bit" in __kvm_emulate_hypercall()?

> + int cpl;
> +
> + if (kvm_xen_hypercall_enabled(vcpu->kvm))
> + return kvm_xen_hypercall(vcpu);
> +
> + if (kvm_hv_hypercall_enabled(vcpu))
> + return kvm_hv_hypercall(vcpu);
> +
> + nr = kvm_rax_read(vcpu);
> + a0 = kvm_rbx_read(vcpu);
> + a1 = kvm_rcx_read(vcpu);
> + a2 = kvm_rdx_read(vcpu);
> + a3 = kvm_rsi_read(vcpu);
> + op_64_bit = is_64_bit_hypercall(vcpu);
> + cpl = static_call(kvm_x86_get_cpl)(vcpu);
> +
> + ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> + if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> + /* MAP_GPA tosses the request to the user space. */
> + return 0;
> +
> if (!op_64_bit)
> ret = (u32)ret;
> kvm_rax_write(vcpu, ret);
>
> - ++vcpu->stat.hypercalls;
> return kvm_skip_emulated_instruction(vcpu);
> }
> EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);


2024-04-09 20:56:35

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 100/130] KVM: TDX: handle EXIT_REASON_OTHER_SMI



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> If the control reaches EXIT_REASON_OTHER_SMI, #SMI is delivered and
> handled right after returning from the TDX module to KVM

need a "," here

> nothing needs to
> be done in KVM. Continue TDX vcpu execution.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/uapi/asm/vmx.h | 1 +
> arch/x86/kvm/vmx/tdx.c | 7 +++++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
> index a5faf6d88f1b..b3a30ef3efdd 100644
> --- a/arch/x86/include/uapi/asm/vmx.h
> +++ b/arch/x86/include/uapi/asm/vmx.h
> @@ -34,6 +34,7 @@
> #define EXIT_REASON_TRIPLE_FAULT 2
> #define EXIT_REASON_INIT_SIGNAL 3
> #define EXIT_REASON_SIPI_SIGNAL 4
> +#define EXIT_REASON_OTHER_SMI 6
What does "OTHER" mean in this macro?

>
> #define EXIT_REASON_INTERRUPT_WINDOW 7
> #define EXIT_REASON_NMI_WINDOW 8
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index cba0fd5029be..2f68e6f2b53a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1345,6 +1345,13 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
>
> switch (exit_reason.basic) {
> + case EXIT_REASON_OTHER_SMI:
> + /*
> + * If reach here, it's not a Machine Check System Management
> + * Interrupt(MSMI).
Since it's the first patch that mentions MSMI, maybe some description
about it in the changelog can make it easier to understand.


> #SMI is delivered and handled right after
> + * SEAMRET, nothing needs to be done in KVM.
> + */
> + return 1;
> default:
> break;
> }


2024-04-10 13:00:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> So my feedback is to not worry about the exports, and instead focus on figuring
> out a way to make the generated code less bloated and easier to read/debug.

I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
few megawrappers. I think we can get better results by shifting leaf
function wrappers into assembly.

We are going to have more assembly, but it should produce better result.
Adding macros can help to write such wrapper and minimizer boilerplate.

Below is an example of how it can look like. It's not complete. I only
converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
more complex.

Any opinions? Is it something worth investing more time?

set offset_rcx, TDX_MODULE_rcx
set offset_rdx, TDX_MODULE_rdx
set offset_r8, TDX_MODULE_r8
set offset_r9, TDX_MODULE_r9
set offset_r10, TDX_MODULE_r10
set offset_r11, TDX_MODULE_r11

macro save_output struct_reg regs:vararg
irp reg,\regs
movq %\reg, offset_\reg(%\struct_reg)
endr
endm

macro tdcall leaf
movq \leaf, %rax
.byte 0x66,0x0f,0x01,0xcc
endm

macro tdcall_or_panic leaf
tdcall \leaf
testq %rax, %rax
jnz .Lpanic
endm

SYM_FUNC_START(tdg_vm_rd)
FRAME_BEGIN

xorl %ecx, %ecx
movq %rdi, %rdx

tdcall_or_panic $TDG_VM_RD

movq %r8, %rax

RET
FRAME_END
SYM_FUNC_END(tdg_vm_rd)

SYM_FUNC_START(tdg_vm_wr)
FRAME_BEGIN

xorl %ecx, %ecx
movq %rsi, %r8
movq %rdx, %r9
movq %rdi, %rdx

tdcall_or_panic $TDG_VM_WR

/* Old value */
movq %r8, %rax

RET
FRAME_END
SYM_FUNC_END(tdg_vm_wr)

SYM_FUNC_START(tdcs_ctls_set)
FRAME_BEGIN

movq $TDCS_TD_CTLS, %rdx
xorl %ecx, %ecx
movq %rdi, %r8
movq %rdi, %r9

tdcall $TDG_VM_WR

testq %rax, %rax
setz %al

RET
FRAME_END
SYM_FUNC_END(tdcs_ctls_set)

SYM_FUNC_START(tdg_sys_rd)
FRAME_BEGIN

xorl %ecx, %ecx
movq %rdi, %rdx

tdcall_or_panic $TDG_SYS_RD

movq %r8, %rax

RET
FRAME_END
SYM_FUNC_END(tdg_sys_rd)

SYM_FUNC_START(tdg_vp_veinfo_get)
FRAME_BEGIN

tdcall_or_panic $TDG_VP_VEINFO_GET

save_output struct_reg=rdi regs=rcx,rdx,r8,r9,r10

FRAME_END
RET
SYM_FUNC_END(tdg_vp_veinfo_get)

SYM_FUNC_START(tdg_vp_info)
FRAME_BEGIN

tdcall_or_panic $TDG_VP_INFO

save_output struct_reg=rdi regs=rcx,rdx,r8,r9,r10,r11

FRAME_END
RET
SYM_FUNC_END(tdg_vp_info)

SYM_FUNC_START(tdg_mem_page_accept)
FRAME_BEGIN

movq %rdi, %rcx

tdcall $TDG_MEM_PAGE_ACCEPT

FRAME_END
RET
SYM_FUNC_END(tdg_mem_page_accept)

SYM_FUNC_START(tdg_mr_report)
FRAME_BEGIN

movq %rdx, %r8
movq %rdi, %rcx
movq %rsi, %rdx

tdcall $TDG_MR_REPORT

FRAME_END
RET
SYM_FUNC_END(tdg_mr_report)

Lpanic:
ud2
--
Kiryl Shutsemau / Kirill A. Shutemov

2024-04-10 13:20:58

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, 2024-03-22 at 14:23 -0700, Isaku Yamahata wrote:
> > > + r = atomic_read(&enable.err);
> > > + if (!r)
> > > + r = tdx_module_setup();
> > > + else
> > > + r = -EIO;
> > > + on_each_cpu(vmx_off, &enable.enabled, true);
> > > + cpus_read_unlock();
> > > + free_cpumask_var(enable.enabled);
> > > +
> > > +out:
> > > + return r;
> > > +}
> >
> > At last, I think there's one problem here:
> >
> > KVM actually only registers CPU hotplug callback in kvm_init(), which happens
> > way after tdx_hardware_setup().
> >
> > What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
> > kvm_init()?
> >
> > Looks we have two options:
> >
> > 1) move registering CPU hotplug callback before tdx_hardware_setup(), or
> > 2) we need to disable CPU hotplug until callbacks have been registered.
> >
> > Perhaps the second one is easier, because for the first one we need to make sure
> > the kvm_cpu_online() is ready to be called right after tdx_hardware_setup().
> >
> > And no one cares if CPU hotplug is disabled during KVM module loading.
> >
> > That being said, we can even just disable CPU hotplug during the entire
> > vt_init(), if in this way the code change is simple?
> >
> > But anyway, to make this patch complete, I think you need to replace
> > vmx_hardware_enable() to vt_hardware_enable() and do tdx_cpu_enable() to handle
> > TDX vs CPU hotplug in _this_ patch.
>
> The option 2 sounds easier. But hardware_enable() doesn't help because it's
> called when the first guest is created. It's risky to change it's semantics
> because it's arch-independent callback.
>
> - Disable CPU hot plug during TDX module initialization.

As we talked, it turns out it is problematic to do so, because cpus_read_lock()
is also called by some other internal functions like static_call_update(). If
we take cpus_read_lock() for the entire vt_init() then we will have nested
cpus_read_lock().

> - During hardware_setup(), enable VMX, tdx_cpu_enable(), disable VMX
>   on online cpu. Don't rely on KVM hooks.
> - Add a new arch-independent hook, int kvm_arch_online_cpu(). It's called always
>   on cpu onlining. It eventually calls tdx_cpu_enabel(). If it fails, refuse
>   onlining.

So the purpose of kvm_arch_online_cpu() is to always do "VMXON +
tdx_cpu_enable() + VMXOFF" _regardless_ of the kvm_usage_count, so that we can
make sure that:

When TDX is enabled by KVM, all online cpus are TDX-capable (have done
tdx_cpu_enable() successfully).

And the code will be like:

static int kvm_online_cpu(unsigned int cpu)
{
mutex_lock(&kvm_lock);
ret = kvm_arch_online_cpu(cpu);
if (!ret && kvm_usage_count)
ret = __hardware_enable_nolock();
mutex_unlock(&kvm_lock);
}

This will need another kvm_x86_ops->online_cpu() where we can implement the TDX
specific "VMXON + tdx_cpu_enable() + VMXOFF":

int kvm_arch_online_cpu(unsigned int cpu)
{
return static_call(kvm_x86_online_cpu)(cpu);
}

Somehow I don't quite like this because: 1) it introduces a new kvm_x86_ops-
>online_cpu(); 2) it's a little bit silly to do "VMXON + tdx_cpu_enable() +
VMXOFF" just for TDX and then immediately do VMXON when there's KVM usage.

And IIUC, it will NOT work if kvm_online_cpu() happens when kvm_usage_count > 0:
VMXON has actually already been done on this cpu, so that the "VMXON" before
tdx_cpu_enable() will fail. Probably this can be addressed somehow, but still
doesn't seem nice.

So the above "option 1" doesn't seem right to me.

After thinking again, I think we have been too nervous about "losing
CPU hotplug between tdx_enable() and kvm_init(), and when there's no KVM
usage".

Instead, I think it's acceptable we don't do tdx_cpu_enable() for new
CPU when it is hotplugged during the above two cases.

We just need to guarantee all online cpus are TDX capable "when there's
real KVM usage, i.e., there's real VM running".

So "option 2":

I believe we just need to do tdx_cpu_enable() in vt_hardware_enable()
after vmx_hardware_enable():

1) When the first VM is created, KVM will try to do tdx_cpu_enable() for
those CPUs that becomes online after tdx_enable(), and if any
tdx_cpu_enabled() fails, the VM will not be created. Otherwise, all
online cpus are TDX-capable.

2) When there's real VM running, and when a new CPU can successfully
become online, it must be TDX-capable.

Failure of tdx_cpu_enable() in 2) is obviously fine.

The consequence of failure to do tdx_cpu_enable() in 1) is that, besides
VM cannot be created, there might be some online CPUs are not
TDX-capable when TDX is marked as enabled.

It's fine from KVM's perspective, because literally no VM is running.

From host kernel's perspective, the only tricky thing is #MC handler.
It tries to use SEAMCALL to read the faulty page's status to determine
whether it is a TDX private page. If #MC happens on those
non-TDX-capable cpus, then the SEAMCALL will fail. But that is also
fine as we don't need a precise result anyway.

Option 3:

We want still to make sure our goal: 

When TDX is enabled by KVM, all online cpus are TDX-capable.

For that, we can register an *additional* TDX specific CPU hotplug callback
right after tdx_enable() to handle any CPU hotplug "between tdx_enable() and
kvm_init(), and when there's no KVM usage".

Specifically, we can use dynamically allocated CPU hotplug state to avoid having
to hard-code another KVM-TDX-specific CPU hotplug callback state:

r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_DYN, "kvm/cpu/tdx:online",
tdx_online_cpu, NULL);

In tdx_online_cpu(), we do tdx_cpu_enable() if enable_tdx is true.

The problem is if tdx_online_cpu() happens when there's already KVM usage, the
kvm_online_cpu() has already done VMXON. So tdx_online_cpu() will need to grab
the @kvm_lock mutex and check @kvm_usage_count to determine whether to do VMXON
before tdx_cpu_enable().

However both @kvm_lock mutex and @kvm_usage_count are in kvm.ko, and it's not
nice to export such low level thing to kvm-intel.ko for TDX.

Option 4:

To avoid exporting @kvm_lock and @kvm_usage_count, we can still register the
TDX-specific CPU hotplug callback, but choose to do "unconditional
tdx_cpu_enable()" w/o doing VMXON in tdx_online_cpu(). If that fails due to
VMXON hasn't been done, let it fail. This basically means:

When TDX is enabled, KVM can only online CPU when there's running VM.

To summarize:

I think "option 2" should be the best solution for now. It's easy to implement
and yet has no real issue.

Any comments?

2024-04-10 15:30:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Wed, Apr 10, 2024, Kai Huang wrote:
> On Fri, 2024-03-22 at 14:23 -0700, Isaku Yamahata wrote:
> > > > + r = atomic_read(&enable.err);
> > > > + if (!r)
> > > > + r = tdx_module_setup();
> > > > + else
> > > > + r = -EIO;
> > > > + on_each_cpu(vmx_off, &enable.enabled, true);
> > > > + cpus_read_unlock();
> > > > + free_cpumask_var(enable.enabled);
> > > > +
> > > > +out:
> > > > + return r;
> > > > +}
> > >
> > > At last, I think there's one problem here:
> > >
> > > KVM actually only registers CPU hotplug callback in kvm_init(), which happens
> > > way after tdx_hardware_setup().
> > >
> > > What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
> > > kvm_init()?
> > >
> > > Looks we have two options:
> > >
> > > 1) move registering CPU hotplug callback before tdx_hardware_setup(), or
> > > 2) we need to disable CPU hotplug until callbacks have been registered.

This is all so dumb (not TDX, the current state of KVM). All of the hardware
enabling crud is pointless complex inherited from misguided, decade old paranoia
that led to the decision to enable VMX if and only if VMs are running. Enabling
VMX doesn't make the system less secure, and the insane dances we are doing to
do VMXON on-demand makes everything *more* fragile.

And all of this complexity really was driven by VMX, enabling virtualization for
every other vendor, including AMD/SVM, is completely uninteresting. Forcing other
architectures/vendors to take on yet more complexity doesn't make any sense.

Barely tested, and other architectures would need to be converted, but I don't
see any obvious reasons why we can't simply enable virtualization when the module
is loaded.

The diffstat pretty much says it all.

---
Documentation/virt/kvm/locking.rst | 4 -
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/kvm/svm/svm.c | 5 +-
arch/x86/kvm/vmx/vmx.c | 18 ++-
arch/x86/kvm/x86.c | 22 ++--
include/linux/kvm_host.h | 2 +
virt/kvm/kvm_main.c | 181 +++++++----------------------
7 files changed, 67 insertions(+), 168 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 02880d5552d5..0d6eff13fd46 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -227,10 +227,6 @@ time it will be set using the Dirty tracking mechanism described above.
:Type: mutex
:Arch: any
:Protects: - vm_list
- - kvm_usage_count
- - hardware virtualization enable/disable
-:Comment: KVM also disables CPU hotplug via cpus_read_lock() during
- enable/disable.

``kvm->mn_invalidate_lock``
^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 73740d698ebe..7422239987d8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -36,6 +36,7 @@
#include <asm/kvm_page_track.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/hyperv-tlfs.h>
+#include <asm/reboot.h>

#define __KVM_HAVE_ARCH_VCPU_DEBUGFS

@@ -1605,6 +1606,8 @@ struct kvm_x86_ops {

int (*hardware_enable)(void);
void (*hardware_disable)(void);
+ cpu_emergency_virt_cb *emergency_disable;
+
void (*hardware_unsetup)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9aaf83c8d57d..7e118284934c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4917,6 +4917,7 @@ static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = KBUILD_MODNAME,

+ .emergency_disable = svm_emergency_disable,
.check_processor_compatibility = svm_check_processor_compat,

.hardware_unsetup = svm_hardware_unsetup,
@@ -5348,8 +5349,6 @@ static struct kvm_x86_init_ops svm_init_ops __initdata = {
static void __svm_exit(void)
{
kvm_x86_vendor_exit();
-
- cpu_emergency_unregister_virt_callback(svm_emergency_disable);
}

static int __init svm_init(void)
@@ -5365,8 +5364,6 @@ static int __init svm_init(void)
if (r)
return r;

- cpu_emergency_register_virt_callback(svm_emergency_disable);
-
/*
* Common KVM initialization _must_ come last, after this, /dev/kvm is
* exposed to userspace!
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d18dcb1e11a6..0dbe74da7ee3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8320,6 +8320,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {

.hardware_enable = vmx_hardware_enable,
.hardware_disable = vmx_hardware_disable,
+ .emergency_disable = vmx_emergency_disable,
+
.has_emulated_msr = vmx_has_emulated_msr,

.vm_size = sizeof(struct kvm_vmx),
@@ -8733,8 +8735,6 @@ static void __vmx_exit(void)
{
allow_smaller_maxphyaddr = false;

- cpu_emergency_unregister_virt_callback(vmx_emergency_disable);
-
vmx_cleanup_l1d_flush();
}

@@ -8760,6 +8760,12 @@ static int __init vmx_init(void)
*/
hv_init_evmcs();

+ for_each_possible_cpu(cpu) {
+ INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+
+ pi_init_cpu(cpu);
+ }
+
r = kvm_x86_vendor_init(&vmx_init_ops);
if (r)
return r;
@@ -8775,14 +8781,6 @@ static int __init vmx_init(void)
if (r)
goto err_l1d_flush;

- for_each_possible_cpu(cpu) {
- INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
-
- pi_init_cpu(cpu);
- }
-
- cpu_emergency_register_virt_callback(vmx_emergency_disable);
-
vmx_check_vmcs12_offsets();

/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 26288ca05364..41d3e4e32e20 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9687,15 +9687,10 @@ static int kvm_x86_check_processor_compatibility(void)
return static_call(kvm_x86_check_processor_compatibility)();
}

-static void kvm_x86_check_cpu_compat(void *ret)
-{
- *(int *)ret = kvm_x86_check_processor_compatibility();
-}
-
int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
{
u64 host_pat;
- int r, cpu;
+ int r;

guard(mutex)(&vendor_module_lock);

@@ -9771,11 +9766,11 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)

kvm_ops_update(ops);

- for_each_online_cpu(cpu) {
- smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
- if (r < 0)
- goto out_unwind_ops;
- }
+ cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_disable);
+
+ r = kvm_enable_virtualization();
+ if (r)
+ goto out_unwind_ops;

/*
* Point of no return! DO NOT add error paths below this point unless
@@ -9818,6 +9813,7 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
return 0;

out_unwind_ops:
+ cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable);
kvm_x86_ops.hardware_enable = NULL;
static_call(kvm_x86_hardware_unsetup)();
out_mmu_exit:
@@ -9858,6 +9854,10 @@ void kvm_x86_vendor_exit(void)
static_key_deferred_flush(&kvm_xen_enabled);
WARN_ON(static_branch_unlikely(&kvm_xen_enabled.key));
#endif
+
+ kvm_disable_virtualization();
+ cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable);
+
mutex_lock(&vendor_module_lock);
kvm_x86_ops.hardware_enable = NULL;
mutex_unlock(&vendor_module_lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..92da2eee7448 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1518,6 +1518,8 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
#endif

#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
+int kvm_enable_virtualization(void);
+void kvm_disable_virtualization(void);
int kvm_arch_hardware_enable(void);
void kvm_arch_hardware_disable(void);
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f345dc15854f..326e3225c052 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -139,8 +139,6 @@ static int kvm_no_compat_open(struct inode *inode, struct file *file)
#define KVM_COMPAT(c) .compat_ioctl = kvm_no_compat_ioctl, \
.open = kvm_no_compat_open
#endif
-static int hardware_enable_all(void);
-static void hardware_disable_all(void);

static void kvm_io_bus_destroy(struct kvm_io_bus *bus);

@@ -1261,10 +1259,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (r)
goto out_err_no_arch_destroy_vm;

- r = hardware_enable_all();
- if (r)
- goto out_err_no_disable;
-
#ifdef CONFIG_HAVE_KVM_IRQCHIP
INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
#endif
@@ -1304,8 +1298,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
#endif
out_err_no_mmu_notifier:
- hardware_disable_all();
-out_err_no_disable:
kvm_arch_destroy_vm(kvm);
out_err_no_arch_destroy_vm:
WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
@@ -1393,7 +1385,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
#endif
kvm_arch_free_vm(kvm);
preempt_notifier_dec();
- hardware_disable_all();
mmdrop(mm);
}

@@ -5536,9 +5527,8 @@ __visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);

static DEFINE_PER_CPU(bool, hardware_enabled);
-static int kvm_usage_count;

-static int __hardware_enable_nolock(void)
+static int __kvm_enable_virtualization(void)
{
if (__this_cpu_read(hardware_enabled))
return 0;
@@ -5553,34 +5543,18 @@ static int __hardware_enable_nolock(void)
return 0;
}

-static void hardware_enable_nolock(void *failed)
-{
- if (__hardware_enable_nolock())
- atomic_inc(failed);
-}
-
static int kvm_online_cpu(unsigned int cpu)
{
- int ret = 0;
-
/*
* Abort the CPU online process if hardware virtualization cannot
* be enabled. Otherwise running VMs would encounter unrecoverable
* errors when scheduled to this CPU.
*/
- mutex_lock(&kvm_lock);
- if (kvm_usage_count)
- ret = __hardware_enable_nolock();
- mutex_unlock(&kvm_lock);
- return ret;
+ return __kvm_enable_virtualization();
}

-static void hardware_disable_nolock(void *junk)
+static void __kvm_disable_virtualization(void *ign)
{
- /*
- * Note, hardware_disable_all_nolock() tells all online CPUs to disable
- * hardware, not just CPUs that successfully enabled hardware!
- */
if (!__this_cpu_read(hardware_enabled))
return;

@@ -5591,78 +5565,10 @@ static void hardware_disable_nolock(void *junk)

static int kvm_offline_cpu(unsigned int cpu)
{
- mutex_lock(&kvm_lock);
- if (kvm_usage_count)
- hardware_disable_nolock(NULL);
- mutex_unlock(&kvm_lock);
+ __kvm_disable_virtualization(NULL);
return 0;
}

-static void hardware_disable_all_nolock(void)
-{
- BUG_ON(!kvm_usage_count);
-
- kvm_usage_count--;
- if (!kvm_usage_count)
- on_each_cpu(hardware_disable_nolock, NULL, 1);
-}
-
-static void hardware_disable_all(void)
-{
- cpus_read_lock();
- mutex_lock(&kvm_lock);
- hardware_disable_all_nolock();
- mutex_unlock(&kvm_lock);
- cpus_read_unlock();
-}
-
-static int hardware_enable_all(void)
-{
- atomic_t failed = ATOMIC_INIT(0);
- int r;
-
- /*
- * Do not enable hardware virtualization if the system is going down.
- * If userspace initiated a forced reboot, e.g. reboot -f, then it's
- * possible for an in-flight KVM_CREATE_VM to trigger hardware enabling
- * after kvm_reboot() is called. Note, this relies on system_state
- * being set _before_ kvm_reboot(), which is why KVM uses a syscore ops
- * hook instead of registering a dedicated reboot notifier (the latter
- * runs before system_state is updated).
- */
- if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
- system_state == SYSTEM_RESTART)
- return -EBUSY;
-
- /*
- * When onlining a CPU, cpu_online_mask is set before kvm_online_cpu()
- * is called, and so on_each_cpu() between them includes the CPU that
- * is being onlined. As a result, hardware_enable_nolock() may get
- * invoked before kvm_online_cpu(), which also enables hardware if the
- * usage count is non-zero. Disable CPU hotplug to avoid attempting to
- * enable hardware multiple times.
- */
- cpus_read_lock();
- mutex_lock(&kvm_lock);
-
- r = 0;
-
- kvm_usage_count++;
- if (kvm_usage_count == 1) {
- on_each_cpu(hardware_enable_nolock, &failed, 1);
-
- if (atomic_read(&failed)) {
- hardware_disable_all_nolock();
- r = -EBUSY;
- }
- }
-
- mutex_unlock(&kvm_lock);
- cpus_read_unlock();
-
- return r;
-}
-
static void kvm_shutdown(void)
{
/*
@@ -5678,34 +5584,22 @@ static void kvm_shutdown(void)
*/
pr_info("kvm: exiting hardware virtualization\n");
kvm_rebooting = true;
- on_each_cpu(hardware_disable_nolock, NULL, 1);
+ on_each_cpu(__kvm_disable_virtualization, NULL, 1);
}

static int kvm_suspend(void)
{
- /*
- * Secondary CPUs and CPU hotplug are disabled across the suspend/resume
- * callbacks, i.e. no need to acquire kvm_lock to ensure the usage count
- * is stable. Assert that kvm_lock is not held to ensure the system
- * isn't suspended while KVM is enabling hardware. Hardware enabling
- * can be preempted, but the task cannot be frozen until it has dropped
- * all locks (userspace tasks are frozen via a fake signal).
- */
- lockdep_assert_not_held(&kvm_lock);
lockdep_assert_irqs_disabled();

- if (kvm_usage_count)
- hardware_disable_nolock(NULL);
+ __kvm_disable_virtualization(NULL);
return 0;
}

static void kvm_resume(void)
{
- lockdep_assert_not_held(&kvm_lock);
lockdep_assert_irqs_disabled();

- if (kvm_usage_count)
- WARN_ON_ONCE(__hardware_enable_nolock());
+ WARN_ON_ONCE(__kvm_enable_virtualization());
}

static struct syscore_ops kvm_syscore_ops = {
@@ -5713,16 +5607,45 @@ static struct syscore_ops kvm_syscore_ops = {
.resume = kvm_resume,
.shutdown = kvm_shutdown,
};
-#else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
-static int hardware_enable_all(void)
+
+int kvm_enable_virtualization(void)
{
+ int r;
+
+ r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
+ kvm_online_cpu, kvm_offline_cpu);
+ if (r)
+ return r;
+
+ register_syscore_ops(&kvm_syscore_ops);
+
+ /*
+ * Manually undo virtualization enabling if the system is going down.
+ * If userspace initiated a forced reboot, e.g. reboot -f, then it's
+ * possible for an in-flight module load to enable virtualization
+ * after syscore_shutdown() is called, i.e. without kvm_shutdown()
+ * being invoked. Note, this relies on system_state being set _before_
+ * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
+ * or this CPU observes the impedning shutdown. Which is why KVM uses
+ * a syscore ops hook instead of registering a dedicated reboot
+ * notifier (the latter runs before system_state is updated).
+ */
+ if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
+ system_state == SYSTEM_RESTART) {
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
+ return -EBUSY;
+ }
+
return 0;
}

-static void hardware_disable_all(void)
+void kvm_disable_virtualization(void)
{
-
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
}
+
#endif /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */

static void kvm_iodevice_destructor(struct kvm_io_device *dev)
@@ -6418,15 +6341,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
int r;
int cpu;

-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
- kvm_online_cpu, kvm_offline_cpu);
- if (r)
- return r;
-
- register_syscore_ops(&kvm_syscore_ops);
-#endif
-
/* A kmem cache lets us meet the alignment requirements of fx_save. */
if (!vcpu_align)
vcpu_align = __alignof__(struct kvm_vcpu);
@@ -6437,10 +6351,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
offsetofend(struct kvm_vcpu, stats_id)
- offsetof(struct kvm_vcpu, arch),
NULL);
- if (!kvm_vcpu_cache) {
- r = -ENOMEM;
- goto err_vcpu_cache;
- }
+ if (!kvm_vcpu_cache)
+ return -ENOMEM;

for_each_possible_cpu(cpu) {
if (!alloc_cpumask_var_node(&per_cpu(cpu_kick_mask, cpu),
@@ -6497,11 +6409,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
for_each_possible_cpu(cpu)
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
kmem_cache_destroy(kvm_vcpu_cache);
-err_vcpu_cache:
-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- unregister_syscore_ops(&kvm_syscore_ops);
- cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
-#endif
return r;
}
EXPORT_SYMBOL_GPL(kvm_init);
@@ -6523,10 +6430,6 @@ void kvm_exit(void)
kmem_cache_destroy(kvm_vcpu_cache);
kvm_vfio_ops_exit();
kvm_async_pf_deinit();
-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- unregister_syscore_ops(&kvm_syscore_ops);
- cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
-#endif
kvm_irqfd_exit();
}
EXPORT_SYMBOL_GPL(kvm_exit);

base-commit: f10f3621ad80f008c218dbbc13a05c893766a7d2
--


2024-04-10 23:16:15

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 11/04/2024 3:29 am, Sean Christopherson wrote:
> On Wed, Apr 10, 2024, Kai Huang wrote:
>> On Fri, 2024-03-22 at 14:23 -0700, Isaku Yamahata wrote:
>>>>> + r = atomic_read(&enable.err);
>>>>> + if (!r)
>>>>> + r = tdx_module_setup();
>>>>> + else
>>>>> + r = -EIO;
>>>>> + on_each_cpu(vmx_off, &enable.enabled, true);
>>>>> + cpus_read_unlock();
>>>>> + free_cpumask_var(enable.enabled);
>>>>> +
>>>>> +out:
>>>>> + return r;
>>>>> +}
>>>>
>>>> At last, I think there's one problem here:
>>>>
>>>> KVM actually only registers CPU hotplug callback in kvm_init(), which happens
>>>> way after tdx_hardware_setup().
>>>>
>>>> What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
>>>> kvm_init()?
>>>>
>>>> Looks we have two options:
>>>>
>>>> 1) move registering CPU hotplug callback before tdx_hardware_setup(), or
>>>> 2) we need to disable CPU hotplug until callbacks have been registered.
>
> This is all so dumb (not TDX, the current state of KVM). All of the hardware
> enabling crud is pointless complex inherited from misguided, decade old paranoia
> that led to the decision to enable VMX if and only if VMs are running. Enabling
> VMX doesn't make the system less secure, and the insane dances we are doing to
> do VMXON on-demand makes everything *more* fragile.
>
> And all of this complexity really was driven by VMX, enabling virtualization for
> every other vendor, including AMD/SVM, is completely uninteresting. Forcing other
> architectures/vendors to take on yet more complexity doesn't make any sense.

Ah, I actually preferred this solution, but I was trying to follow your
suggestion here:

https://lore.kernel.org/lkml/[email protected]/

form which I interpreted you didn't like always having VMX enabled when
KVM is present. :-)

>
> Barely tested, and other architectures would need to be converted, but I don't
> see any obvious reasons why we can't simply enable virtualization when the module
> is loaded.
>
> The diffstat pretty much says it all.

Thanks a lot for the code!

I can certainly follow up with this and generate a reviewable patchset
if I can confirm with you that this is what you want?

2024-04-11 14:05:53

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, Apr 11, 2024, Kai Huang wrote:
> On 11/04/2024 3:29 am, Sean Christopherson wrote:
> > On Wed, Apr 10, 2024, Kai Huang wrote:
> > > > > What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
> > > > > kvm_init()?
> > > > >
> > > > > Looks we have two options:
> > > > >
> > > > > 1) move registering CPU hotplug callback before tdx_hardware_setup(), or
> > > > > 2) we need to disable CPU hotplug until callbacks have been registered.
> >
> > This is all so dumb (not TDX, the current state of KVM). All of the hardware
> > enabling crud is pointless complex inherited from misguided, decade old paranoia
> > that led to the decision to enable VMX if and only if VMs are running. Enabling
> > VMX doesn't make the system less secure, and the insane dances we are doing to
> > do VMXON on-demand makes everything *more* fragile.
> >
> > And all of this complexity really was driven by VMX, enabling virtualization for
> > every other vendor, including AMD/SVM, is completely uninteresting. Forcing other
> > architectures/vendors to take on yet more complexity doesn't make any sense.
>
> Ah, I actually preferred this solution, but I was trying to follow your
> suggestion here:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> form which I interpreted you didn't like always having VMX enabled when KVM
> is present. :-)

I had a feeling I said something along those lines in the past.

> > Barely tested, and other architectures would need to be converted, but I don't
> > see any obvious reasons why we can't simply enable virtualization when the module
> > is loaded.
> >
> > The diffstat pretty much says it all.
>
> Thanks a lot for the code!
>
> I can certainly follow up with this and generate a reviewable patchset if I
> can confirm with you that this is what you want?

Yes, I think it's the right direction. I still have minor concerns about VMX
being enabled while kvm.ko is loaded, which means that VMXON will _always_ be
enabled if KVM is built-in. But after seeing the complexity that is needed to
safely initialize TDX, and after seeing just how much complexity KVM already
has because it enables VMX on-demand (I hadn't actually tried removing that code
before), I think the cost of that complexity far outweighs the risk of "always"
being post-VMXON.

Within reason, I recommend getting feedback from others before you spend _too_
much time on this. It's entirely possible I'm missing/forgetting some other angle.

2024-04-11 18:36:13

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v19 076/130] KVM: TDX: Finalize VM initialization

On 26/02/24 10:26, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> To protect the initial contents of the guest TD, the TDX module measures
> the guest TD during the build process as SHA-384 measurement. The
> measurement of the guest TD contents needs to be completed to make the
> guest TD ready to run.
>
> Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
> KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
> to run.

Perhaps a spruced up commit message would be:

<BEGIN>
Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.

Documentation for the API is added in another patch:
"Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"

For the purpose of attestation, a measurement must be made of the TDX VM
initial state. This is referred to as TD Measurement Finalization, and
uses SEAMCALL TDH.MR.FINALIZE, after which:
1. The VMM adding TD private pages with arbitrary content is no longer
allowed
2. The TDX VM is runnable
<END>

History:

This code is essentially unchanged from V1, as below.
Except for V5, the code has never had any comments.
Paolo's comment from then still appears unaddressed.

V19: Unchanged
V18: Undoes change of V17
V17: Also change tools/arch/x86/include/uapi/asm/kvm.h
V16: Unchanged
V15: Undoes change of V10
V11-V14: Unchanged
V10: Adds a hack (related to TDH_MEM_TRACK)
that was later removed in V15
V6-V9: Unchanged
V5 Broke out the code into a separate patch and
received its only comments, which were from Paolo:

"Reviewed-by: Paolo Bonzini <[email protected]>
Note however that errors should be passed back in the struct."

This presumably refers to struct kvm_tdx_cmd which has an "error"
member, but that is not updated by tdx_td_finalizemr()

V4 was a cut-down series and the code was not present
V3 introduced WARN_ON_ONCE for the error condition
V2 accommodated renaming the seamcall function and ID

Outstanding:

1. Address Paolo's comment about the error code
2. Is WARN_ON sensible?

Final note:

It might be possible to make TD Measurement Finalization
transparent to the user space VMM and forego another API, but it seems
doubtful that would really make anything much simpler.

>
> Signed-off-by: Isaku Yamahata <[email protected]>
>
> ---
> v18:
> - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
>
> v14 -> v15:
> - removed unconditional tdx_track() by tdx_flush_tlb_current() that
> does tdx_track().
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 34167404020c..c160f60189d1 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -573,6 +573,7 @@ enum kvm_tdx_cmd_id {
> KVM_TDX_INIT_VM,
> KVM_TDX_INIT_VCPU,
> KVM_TDX_EXTEND_MEMORY,
> + KVM_TDX_FINALIZE_VM,
>
> KVM_TDX_CMD_NR_MAX,
> };
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3cfba63a7762..6aff3f7e2488 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1400,6 +1400,24 @@ static int tdx_extend_memory(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> return ret;
> }
>
> +static int tdx_td_finalizemr(struct kvm *kvm)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + u64 err;
> +
> + if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
> + return -EINVAL;
> +
> + err = tdh_mr_finalize(kvm_tdx->tdr_pa);
> + if (WARN_ON_ONCE(err)) {

Is a failed SEAMCALL really something to WARN over?

> + pr_tdx_error(TDH_MR_FINALIZE, err, NULL);

As per Paolo, error code is not returned in struct kvm_tdx_cmd

> + return -EIO;
> + }
> +
> + kvm_tdx->finalized = true;
> + return 0;
> +}
> +
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_tdx_cmd tdx_cmd;
> @@ -1422,6 +1440,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_TDX_EXTEND_MEMORY:
> r = tdx_extend_memory(kvm, &tdx_cmd);
> break;
> + case KVM_TDX_FINALIZE_VM:
> + r = tdx_td_finalizemr(kvm);
> + break;
> default:
> r = -EINVAL;
> goto out;


2024-04-11 19:02:33

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Thu, Apr 04, 2024 at 12:59:45PM +1300,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 26/02/2024 9:25 pm, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX requires additional parameters for TDX VM for confidential execution to
> > protect the confidentiality of its memory contents and CPU state from any
> > other software, including VMM.
>
> Hmm.. not only "confidentiality" but also "integrity". And the "per-VM" TDX
> initializaiton here actually has nothing to do with "crypto-protection",
> because the establishment of the key has already been done before reaching
> here.
>
> I would just say:
>
> After the crypto-protection key has been configured, TDX requires a VM-scope
> initialization as a step of creating the TDX guest. This "per-VM" TDX
> initialization does the global configurations/features that the TDX guest
> can support, such as guest's CPUIDs (emulated by the TDX module), the
> maximum number of vcpus etc.
>
>
>
>
> When creating a guest TD VM before creating
> > vcpu, the number of vcpu, TSC frequency (the values are the same among
> > vcpus, and it can't change.) CPUIDs which the TDX module emulates.
>
> I cannot parse this sentence. It doesn't look like a sentence to me.
>
> Guest
> > TDs can trust those CPUIDs and sha384 values for measurement.
>
> Trustness is not about the "guest can trust", but the "people using the
> guest can trust".
>
> Just remove it.
>
> If you want to emphasize the attestation, you can add something like:
>
> "
> It also passes the VM's measurement and hash of the signer etc and the
> hardware only allows to initialize the TDX guest when that match.
> "
>
> >
> > Add a new subcommand, KVM_TDX_INIT_VM, to pass parameters for the TDX
> > guest.
>
> [...]
>
> It assigns an encryption key to the TDX guest for memory
> > encryption. TDX encrypts memory per guest basis.
>
> No it doesn't. The key has been programmed already in your previous patch.
>
> The device model, say
> > qemu, passes per-VM parameters for the TDX guest.
>
> This is implied by your first sentence of this paragraph.
>
> The maximum number of
> > vcpus, TSC frequency (TDX guest has fixed VM-wide TSC frequency, not per
> > vcpu. The TDX guest can not change it.), attributes (production or debug),
> > available extended features (which configure guest XCR0, IA32_XSS MSR),
> > CPUIDs, sha384 measurements, etc.
>
> This is not a sentence.
>
> >
> > Call this subcommand before creating vcpu and KVM_SET_CPUID2, i.e. CPUID
> > configurations aren't available yet.
>
> "
> This "per-VM" TDX initialization must be done before any "vcpu-scope" TDX
> initialization. To match this better, require the KVM_TDX_INIT_VM IOCTL()
> to be done before KVM creates any vcpus.
>
> Note KVM configures the VM's CPUIDs in KVM_SET_CPUID2 via vcpu. The
> downside of this approach is KVM will need to do some enforcement later to
> make sure the consisntency between the CPUIDs passed here and the CPUIDs
> done in KVM_SET_CPUID2.
> "

Thanks for the draft. Let me update it.

> So CPUIDs configuration values need
> > to be passed in struct kvm_tdx_init_vm. The device model's responsibility
> > to make this CPUID config for KVM_TDX_INIT_VM and KVM_SET_CPUID2.
>
> And I would leave how to handle KVM_SET_CPUID2 to the patch that actually
> enforces the consisntency.

Yes, that's a different discussion.


> > +struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(
> > + struct kvm_cpuid_entry2 *entries, int nent, u32 function, u64 index)
> > +{
> > + return cpuid_entry2_find(entries, nent, function, index);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry2);
>
> Not sure whether we can export cpuid_entry2_find() directly?
>
> No strong opinion of course.
>
> But if we want to expose the wrapper, looks ...


Almost all KVM exported symbols have kvm_ prefix. I'm afraid that cpuid is too
common. We can rename the function directly without wrapper.


> > +
> > struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
> > u32 function, u32 index)
> > {
> > diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> > index 856e3037e74f..215d1c68c6d1 100644
> > --- a/arch/x86/kvm/cpuid.h
> > +++ b/arch/x86/kvm/cpuid.h
> > @@ -13,6 +13,8 @@ void kvm_set_cpu_caps(void);
> > void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
> > void kvm_update_pv_runtime(struct kvm_vcpu *vcpu);
> > +struct kvm_cpuid_entry2 *kvm_find_cpuid_entry2(struct kvm_cpuid_entry2 *entries,
> > + int nent, u32 function, u64 index);
> > struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
> > u32 function, u32 index); > struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
>
> ... __kvm_find_cpuid_entry() would fit better?

Ok, let's rename it.


> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 1cf2b15da257..b11f105db3cd 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -8,7 +8,6 @@
> > #include "mmu.h"
> > #include "tdx_arch.h"
> > #include "tdx.h"
> > -#include "tdx_ops.h"
>
> ??
>
> If it isn't needed, then it shouldn't be included in some previous patch.

Will fix.


> > #include "x86.h"
> > #undef pr_fmt
> > @@ -350,18 +349,21 @@ static int tdx_do_tdh_mng_key_config(void *param)
> > return 0;
> > }
> > -static int __tdx_td_init(struct kvm *kvm);
> > -
> > int tdx_vm_init(struct kvm *kvm)
> > {
> > + /*
> > + * This function initializes only KVM software construct. It doesn't
> > + * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
> > + * It is handled by KVM_TDX_INIT_VM, __tdx_td_init().
> > + */
> > +
> > /*
> > * TDX has its own limit of the number of vcpus in addition to
> > * KVM_MAX_VCPUS.
> > */
> > kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
> > - /* Place holder for TDX specific logic. */
> > - return __tdx_td_init(kvm);
> > + return 0;
>
> ??
>
> I don't quite understand. What's wrong of still calling __tdx_td_init() in
> tdx_vm_init()?
>
> If there's anything preventing doing __tdx_td_init() from tdx_vm_init(),
> then it's wrong to implement that in your previous patch.

Yes. As discussed the previous patch is too big, we need to break the previous
patch and this patch.
--
Isaku Yamahata <[email protected]>

2024-04-11 19:51:06

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Mon, Apr 08, 2024 at 06:38:56PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > +static int setup_tdparams_xfam(struct kvm_cpuid2 *cpuid, struct td_params *td_params)
> > +{
> > +       const struct kvm_cpuid_entry2 *entry;
> > +       u64 guest_supported_xcr0;
> > +       u64 guest_supported_xss;
> > +
> > +       /* Setup td_params.xfam */
> > +       entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 0);
> > +       if (entry)
> > +               guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
> > +       else
> > +               guest_supported_xcr0 = 0;
> > +       guest_supported_xcr0 &= kvm_caps.supported_xcr0;
> > +
> > +       entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0xd, 1);
> > +       if (entry)
> > +               guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
> > +       else
> > +               guest_supported_xss = 0;
> > +
> > +       /*
> > +        * PT and CET can be exposed to TD guest regardless of KVM's XSS, PT
> > +        * and, CET support.
> > +        */
> > +       guest_supported_xss &=
> > +               (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET);
>
> So this enables features based on xss support in the passed CPUID, but these features are not
> dependent xsave. You could have CET without xsave support. And in fact Kernel IBT doesn't use it. To
> utilize CPUID leafs to configure features, but diverge from the HW meaning seems like asking for
> trouble.

TDX module checks the consistency. KVM can rely on it not to re-implement it.
The TDX Base Architecture specification describes what check is done.
Table 11.4: Extended Features Enumeration and Execution Control

> > +
> > +       td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
> > +       if (td_params->xfam & XFEATURE_MASK_LBR) {
> > +               /*
> > +                * TODO: once KVM supports LBR(save/restore LBR related
> > +                * registers around TDENTER), remove this guard.
> > +                */
> > +#define MSG_LBR        "TD doesn't support LBR yet. KVM needs to save/restore IA32_LBR_DEPTH
> > properly.\n"
> > +               pr_warn(MSG_LBR);
> > +               return -EOPNOTSUPP;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> > +                       struct kvm_tdx_init_vm *init_vm)
> > +{
> > +       struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
> > +       int ret;
> > +
> > +       if (kvm->created_vcpus)
> > +               return -EBUSY;
> > +
> > +       if (init_vm->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> > +               /*
> > +                * TODO: save/restore PMU related registers around TDENTER.
> > +                * Once it's done, remove this guard.
> > +                */
> > +#define MSG_PERFMON    "TD doesn't support perfmon yet. KVM needs to save/restore host perf
> > registers properly.\n"
> > +               pr_warn(MSG_PERFMON);
>
> We need to remove the TODOs and a warn doesn't seem appropriate.

Sure, let me drop them.


> > +               return -EOPNOTSUPP;
> > +       }
> > +
> > +       td_params->max_vcpus = kvm->max_vcpus;
> > +       td_params->attributes = init_vm->attributes;
>
> Don't we need to sanitize this for a selection of features known to KVM. For example what if
> something else like TDX_TD_ATTRIBUTE_PERFMON is added to a future TDX module and then suddenly
> userspace can configure it.
>
> So xfam is how to control features that are tied to save (CET, etc). And ATTRIBUTES are tied to
> features without xsave support (PKS, etc).
>
> If we are going to use CPUID for specifying which features should get enabled in the TDX module, we
> should match the arch definitions of the leafs. For things like CET whether xfam controls the value
> of multiple CPUID leafs, then we need should check that they are all set to some consistent values
> and otherwise reject them. So for CET we would need to check the SHSTK and IBT bits, as well as two
> XCR0 bits.
>
> If we are going to do that for XFAM based features, then why not do the same for ATTRIBUTE based
> features?
>
> We would need something like GET_SUPPORTED_CPUID for TDX, but also since some features can be forced
> on we would need to expose something like GET_SUPPORTED_CPUID_REQUIRED as well.

I agree to reject attributes unknown to KVM. Let's add the check.

The TDX module checks consistency between attributes, xfam, and cpuids as
described in the spec, KVM can rely on it. When TDX module finds inconsistency
(or anything bad), it returns error as SEAMCALL error status code. It includes
which cpuid is bad. KVM returns it to the userspace VMM in struct
kvm_tdx_cmd.error. We don't have to re-implement similar checks.
--
Isaku Yamahata <[email protected]>

2024-04-11 19:53:54

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Thu, 2024-04-11 at 12:26 -0700, Isaku Yamahata wrote:
> >
> > So this enables features based on xss support in the passed CPUID, but these
> > features are not
> > dependent xsave. You could have CET without xsave support. And in fact
> > Kernel IBT doesn't use it. To
> > utilize CPUID leafs to configure features, but diverge from the HW meaning
> > seems like asking for
> > trouble.
>
> TDX module checks the consistency.  KVM can rely on it not to re-implement it.
> The TDX Base Architecture specification describes what check is done.
> Table 11.4: Extended Features Enumeration and Execution Control

The point is that it is an strange interface. Why not take XFAM as a specific
field in struct kvm_tdx_init_vm?

2024-04-11 20:47:26

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Thu, Apr 11, 2024 at 07:51:55PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Thu, 2024-04-11 at 12:26 -0700, Isaku Yamahata wrote:
> > >
> > > So this enables features based on xss support in the passed CPUID, but these
> > > features are not
> > > dependent xsave. You could have CET without xsave support. And in fact
> > > Kernel IBT doesn't use it. To
> > > utilize CPUID leafs to configure features, but diverge from the HW meaning
> > > seems like asking for
> > > trouble.
> >
> > TDX module checks the consistency.  KVM can rely on it not to re-implement it.
> > The TDX Base Architecture specification describes what check is done.
> > Table 11.4: Extended Features Enumeration and Execution Control
>
> The point is that it is an strange interface. Why not take XFAM as a specific
> field in struct kvm_tdx_init_vm?

Now I see your point. Yes, we can add xfam to struct kvm_tdx_init_vm and
move the burden to create xfam from the kernel to the user space.
--
Isaku Yamahata <[email protected]>

2024-04-11 21:03:52

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 039/130] KVM: TDX: initialize VM with TDX specific parameters

On Thu, 2024-04-11 at 13:46 -0700, Isaku Yamahata wrote:
> On Thu, Apr 11, 2024 at 07:51:55PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Thu, 2024-04-11 at 12:26 -0700, Isaku Yamahata wrote:
> > > >
> > > > So this enables features based on xss support in the passed CPUID, but
> > > > these
> > > > features are not
> > > > dependent xsave. You could have CET without xsave support. And in fact
> > > > Kernel IBT doesn't use it. To
> > > > utilize CPUID leafs to configure features, but diverge from the HW
> > > > meaning
> > > > seems like asking for
> > > > trouble.
> > >
> > > TDX module checks the consistency.  KVM can rely on it not to re-implement
> > > it.
> > > The TDX Base Architecture specification describes what check is done.
> > > Table 11.4: Extended Features Enumeration and Execution Control
> >
> > The point is that it is an strange interface. Why not take XFAM as a
> > specific
> > field in struct kvm_tdx_init_vm?
>
> Now I see your point. Yes, we can add xfam to struct kvm_tdx_init_vm and
> move the burden to create xfam from the kernel to the user space.

Oh, right. Qemu would have to figure out how to take it's CPUID based model and
convert it to XFAM. I still think it would be better, but I was only thinking of
it from KVM's perspective. We can see how the API discussion on the PUCK call
resolves.

2024-04-11 22:59:33

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 12/04/2024 2:03 am, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Kai Huang wrote:
>> On 11/04/2024 3:29 am, Sean Christopherson wrote:
>>> On Wed, Apr 10, 2024, Kai Huang wrote:
>>>>>> What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
>>>>>> kvm_init()?
>>>>>>
>>>>>> Looks we have two options:
>>>>>>
>>>>>> 1) move registering CPU hotplug callback before tdx_hardware_setup(), or
>>>>>> 2) we need to disable CPU hotplug until callbacks have been registered.
>>>
>>> This is all so dumb (not TDX, the current state of KVM). All of the hardware
>>> enabling crud is pointless complex inherited from misguided, decade old paranoia
>>> that led to the decision to enable VMX if and only if VMs are running. Enabling
>>> VMX doesn't make the system less secure, and the insane dances we are doing to
>>> do VMXON on-demand makes everything *more* fragile.
>>>
>>> And all of this complexity really was driven by VMX, enabling virtualization for
>>> every other vendor, including AMD/SVM, is completely uninteresting. Forcing other
>>> architectures/vendors to take on yet more complexity doesn't make any sense.
>>
>> Ah, I actually preferred this solution, but I was trying to follow your
>> suggestion here:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> form which I interpreted you didn't like always having VMX enabled when KVM
>> is present. :-)
>
> I had a feeling I said something along those lines in the past.
>
>>> Barely tested, and other architectures would need to be converted, but I don't
>>> see any obvious reasons why we can't simply enable virtualization when the module
>>> is loaded.
>>>
>>> The diffstat pretty much says it all.
>>
>> Thanks a lot for the code!
>>
>> I can certainly follow up with this and generate a reviewable patchset if I
>> can confirm with you that this is what you want?
>
> Yes, I think it's the right direction. I still have minor concerns about VMX
> being enabled while kvm.ko is loaded, which means that VMXON will _always_ be
> enabled if KVM is built-in. But after seeing the complexity that is needed to
> safely initialize TDX, and after seeing just how much complexity KVM already
> has because it enables VMX on-demand (I hadn't actually tried removing that code
> before), I think the cost of that complexity far outweighs the risk of "always"
> being post-VMXON.

Does always leaving VMXON have any actual damage, given we have
emergency virtualization shutdown?

>
> Within reason, I recommend getting feedback from others before you spend _too_
> much time on this. It's entirely possible I'm missing/forgetting some other angle.

Sure. Could you suggest who should we try to get feedback from?

Perhaps you can just help to Cc them?

Thanks for your time.

2024-04-12 01:09:02

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 076/130] KVM: TDX: Finalize VM initialization

On Thu, Apr 11, 2024 at 07:39:11PM +0300,
Adrian Hunter <[email protected]> wrote:

> On 26/02/24 10:26, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > To protect the initial contents of the guest TD, the TDX module measures
> > the guest TD during the build process as SHA-384 measurement. The
> > measurement of the guest TD contents needs to be completed to make the
> > guest TD ready to run.
> >
> > Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
> > KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
> > to run.
>
> Perhaps a spruced up commit message would be:
>
> <BEGIN>
> Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
> KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
>
> Documentation for the API is added in another patch:
> "Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"
>
> For the purpose of attestation, a measurement must be made of the TDX VM
> initial state. This is referred to as TD Measurement Finalization, and
> uses SEAMCALL TDH.MR.FINALIZE, after which:
> 1. The VMM adding TD private pages with arbitrary content is no longer
> allowed
> 2. The TDX VM is runnable
> <END>
>
> History:
>
> This code is essentially unchanged from V1, as below.
> Except for V5, the code has never had any comments.
> Paolo's comment from then still appears unaddressed.
>
> V19: Unchanged
> V18: Undoes change of V17
> V17: Also change tools/arch/x86/include/uapi/asm/kvm.h
> V16: Unchanged
> V15: Undoes change of V10
> V11-V14: Unchanged
> V10: Adds a hack (related to TDH_MEM_TRACK)
> that was later removed in V15
> V6-V9: Unchanged
> V5 Broke out the code into a separate patch and
> received its only comments, which were from Paolo:
>
> "Reviewed-by: Paolo Bonzini <[email protected]>
> Note however that errors should be passed back in the struct."
>
> This presumably refers to struct kvm_tdx_cmd which has an "error"
> member, but that is not updated by tdx_td_finalizemr()
>
> V4 was a cut-down series and the code was not present
> V3 introduced WARN_ON_ONCE for the error condition
> V2 accommodated renaming the seamcall function and ID

Thank you for creating histories. Let me update the commit message.


> Outstanding:
>
> 1. Address Paolo's comment about the error code
> 2. Is WARN_ON sensible?

See below.


> Final note:
>
> It might be possible to make TD Measurement Finalization
> transparent to the user space VMM and forego another API, but it seems
> doubtful that would really make anything much simpler.
>
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> >
> > ---
> > v18:
> > - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
> >
> > v14 -> v15:
> > - removed unconditional tdx_track() by tdx_flush_tlb_current() that
> > does tdx_track().
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/include/uapi/asm/kvm.h | 1 +
> > arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 34167404020c..c160f60189d1 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -573,6 +573,7 @@ enum kvm_tdx_cmd_id {
> > KVM_TDX_INIT_VM,
> > KVM_TDX_INIT_VCPU,
> > KVM_TDX_EXTEND_MEMORY,
> > + KVM_TDX_FINALIZE_VM,
> >
> > KVM_TDX_CMD_NR_MAX,
> > };
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 3cfba63a7762..6aff3f7e2488 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1400,6 +1400,24 @@ static int tdx_extend_memory(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> > return ret;
> > }
> >
> > +static int tdx_td_finalizemr(struct kvm *kvm)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + u64 err;
> > +
> > + if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
> > + return -EINVAL;
> > +
> > + err = tdh_mr_finalize(kvm_tdx->tdr_pa);
> > + if (WARN_ON_ONCE(err)) {
>
> Is a failed SEAMCALL really something to WARN over?

Because user can trigger an error in some cases, we shouldn't WARN in such case.
Except those, TDH.MR.FINALIZE() shouldn't return error. If we hit such error,
it typically implies serious error so that the recovery is difficult. For
example, the TDX module was broken by the host overwriting private pages.
That's the reason why we have KVM_BUN_ON. So the error check should be
something like


/* We can hit busy error to exclusively access TDR. */
if (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX))
return -EAGAIN;
/* User can call KVM_TDX_INIT_VM without any vCPUs created. */
if (err == TDX_NO_VCPUS)
return -EIO;
/* Other error shouldn't happen. */
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_MR_FINALIZE, err);
return -EIO;
}


> > + pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
>
> As per Paolo, error code is not returned in struct kvm_tdx_cmd


It will be something like the followings. No compile test yet.


diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0d3b79b5c42a..c7ff819ccaf1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2757,6 +2757,12 @@ static int tdx_td_finalizemr(struct kvm *kvm)
return -EINVAL;

err = tdh_mr_finalize(kvm_tdx);
+ kvm_tdx->hw_error = err;
+
+ if (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX))
+ return -EAGAIN;
+ if (err == TDX_NO_VCPUS)
+ return -EIO;
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_MR_FINALIZE, err);
return -EIO;
@@ -2768,6 +2774,7 @@ static int tdx_td_finalizemr(struct kvm *kvm)

int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct kvm_tdx_cmd tdx_cmd;
int r;

@@ -2777,6 +2784,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
return -EINVAL;

mutex_lock(&kvm->lock);
+ kvm_tdx->hw_error = 0;

switch (tdx_cmd.id) {
case KVM_TDX_CAPABILITIES:
@@ -2793,6 +2801,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
goto out;
}

+ tdx_cmd.error = kvm_tdx->hw_error;
if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
r = -EFAULT;

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 98f5d7c5891a..dc150b8bdd5f 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -18,6 +18,9 @@ struct kvm_tdx {
u64 xfam;
int hkid;

+ /* For KVM_TDX ioctl to return SEAMCALL status code. */
+ u64 hw_error;
+
/*
* Used on each TD-exit, see tdx_user_return_update_cache().
* TSX_CTRL value on TD exit
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index a8aa8b79e9a1..6c701856c9a8 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -41,6 +41,7 @@
#define TDX_TD_FATAL 0xC000060400000000ULL
#define TDX_TD_NON_DEBUG 0xC000060500000000ULL
#define TDX_LIFECYCLE_STATE_INCORRECT 0xC000060700000000ULL
+#define TDX_NO_VCPUS 0xC000060900000000ULL
#define TDX_TDCX_NUM_INCORRECT 0xC000061000000000ULL
#define TDX_VCPU_STATE_INCORRECT 0xC000070000000000ULL
#define TDX_VCPU_ASSOCIATED 0x8000070100000000ULL
--
2.43.2
--
Isaku Yamahata <[email protected]>

2024-04-12 12:22:23

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v19 076/130] KVM: TDX: Finalize VM initialization

On 12/04/24 04:08, Isaku Yamahata wrote:
> On Thu, Apr 11, 2024 at 07:39:11PM +0300,
> Adrian Hunter <[email protected]> wrote:
>
>> On 26/02/24 10:26, [email protected] wrote:
>>> From: Isaku Yamahata <[email protected]>
>>>
>>> To protect the initial contents of the guest TD, the TDX module measures
>>> the guest TD during the build process as SHA-384 measurement. The
>>> measurement of the guest TD contents needs to be completed to make the
>>> guest TD ready to run.
>>>
>>> Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
>>> KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
>>> to run.
>>
>> Perhaps a spruced up commit message would be:
>>
>> <BEGIN>
>> Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
>> KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
>>
>> Documentation for the API is added in another patch:
>> "Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"
>>
>> For the purpose of attestation, a measurement must be made of the TDX VM
>> initial state. This is referred to as TD Measurement Finalization, and
>> uses SEAMCALL TDH.MR.FINALIZE, after which:
>> 1. The VMM adding TD private pages with arbitrary content is no longer
>> allowed
>> 2. The TDX VM is runnable
>> <END>
>>
>> History:
>>
>> This code is essentially unchanged from V1, as below.
>> Except for V5, the code has never had any comments.
>> Paolo's comment from then still appears unaddressed.
>>
>> V19: Unchanged
>> V18: Undoes change of V17
>> V17: Also change tools/arch/x86/include/uapi/asm/kvm.h
>> V16: Unchanged
>> V15: Undoes change of V10
>> V11-V14: Unchanged
>> V10: Adds a hack (related to TDH_MEM_TRACK)
>> that was later removed in V15
>> V6-V9: Unchanged
>> V5 Broke out the code into a separate patch and
>> received its only comments, which were from Paolo:
>>
>> "Reviewed-by: Paolo Bonzini <[email protected]>
>> Note however that errors should be passed back in the struct."
>>
>> This presumably refers to struct kvm_tdx_cmd which has an "error"
>> member, but that is not updated by tdx_td_finalizemr()
>>
>> V4 was a cut-down series and the code was not present
>> V3 introduced WARN_ON_ONCE for the error condition
>> V2 accommodated renaming the seamcall function and ID
>
> Thank you for creating histories. Let me update the commit message.
>
>
>> Outstanding:
>>
>> 1. Address Paolo's comment about the error code
>> 2. Is WARN_ON sensible?
>
> See below.
>
>
>> Final note:
>>
>> It might be possible to make TD Measurement Finalization
>> transparent to the user space VMM and forego another API, but it seems
>> doubtful that would really make anything much simpler.
>>
>>>
>>> Signed-off-by: Isaku Yamahata <[email protected]>
>>>
>>> ---
>>> v18:
>>> - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
>>>
>>> v14 -> v15:
>>> - removed unconditional tdx_track() by tdx_flush_tlb_current() that
>>> does tdx_track().
>>>
>>> Signed-off-by: Isaku Yamahata <[email protected]>
>>> ---
>>> arch/x86/include/uapi/asm/kvm.h | 1 +
>>> arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
>>> 2 files changed, 22 insertions(+)
>>>
>>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>>> index 34167404020c..c160f60189d1 100644
>>> --- a/arch/x86/include/uapi/asm/kvm.h
>>> +++ b/arch/x86/include/uapi/asm/kvm.h
>>> @@ -573,6 +573,7 @@ enum kvm_tdx_cmd_id {
>>> KVM_TDX_INIT_VM,
>>> KVM_TDX_INIT_VCPU,
>>> KVM_TDX_EXTEND_MEMORY,
>>> + KVM_TDX_FINALIZE_VM,
>>>
>>> KVM_TDX_CMD_NR_MAX,
>>> };
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index 3cfba63a7762..6aff3f7e2488 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -1400,6 +1400,24 @@ static int tdx_extend_memory(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>>> return ret;
>>> }
>>>
>>> +static int tdx_td_finalizemr(struct kvm *kvm)
>>> +{
>>> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>>> + u64 err;
>>> +
>>> + if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
>>> + return -EINVAL;
>>> +
>>> + err = tdh_mr_finalize(kvm_tdx->tdr_pa);
>>> + if (WARN_ON_ONCE(err)) {
>>
>> Is a failed SEAMCALL really something to WARN over?
>
> Because user can trigger an error in some cases, we shouldn't WARN in such case.
> Except those, TDH.MR.FINALIZE() shouldn't return error. If we hit such error,
> it typically implies serious error so that the recovery is difficult. For
> example, the TDX module was broken by the host overwriting private pages.
> That's the reason why we have KVM_BUN_ON. So the error check should be
> something like
>
>
> /* We can hit busy error to exclusively access TDR. */
> if (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX))
> return -EAGAIN;
> /* User can call KVM_TDX_INIT_VM without any vCPUs created. */
> if (err == TDX_NO_VCPUS)
> return -EIO;
> /* Other error shouldn't happen. */
> if (KVM_BUG_ON(err, kvm)) {
> pr_tdx_error(TDH_MR_FINALIZE, err);
> return -EIO;
> }
>
>
>>> + pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
>>
>> As per Paolo, error code is not returned in struct kvm_tdx_cmd
>
>
> It will be something like the followings. No compile test yet.
>
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 0d3b79b5c42a..c7ff819ccaf1 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2757,6 +2757,12 @@ static int tdx_td_finalizemr(struct kvm *kvm)
> return -EINVAL;
>
> err = tdh_mr_finalize(kvm_tdx);
> + kvm_tdx->hw_error = err;
> +
> + if (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX))

There seem to be also implicit operand codes. How sure are
we that TDX_OPERAND_ID_RCX is the only valid busy operand?

> + return -EAGAIN;
> + if (err == TDX_NO_VCPUS)

TDX_NO_VCPUS is not one of the completion status codes for
TDH.MR.FINALIZE

> + return -EIO;
> if (KVM_BUG_ON(err, kvm)) {
> pr_tdx_error(TDH_MR_FINALIZE, err);
> return -EIO;
> @@ -2768,6 +2774,7 @@ static int tdx_td_finalizemr(struct kvm *kvm)
>
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> {
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> struct kvm_tdx_cmd tdx_cmd;
> int r;
>
> @@ -2777,6 +2784,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> return -EINVAL;
>
> mutex_lock(&kvm->lock);
> + kvm_tdx->hw_error = 0;
>
> switch (tdx_cmd.id) {
> case KVM_TDX_CAPABILITIES:
> @@ -2793,6 +2801,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> goto out;
> }
>
> + tdx_cmd.error = kvm_tdx->hw_error;
> if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
> r = -EFAULT;
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 98f5d7c5891a..dc150b8bdd5f 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -18,6 +18,9 @@ struct kvm_tdx {
> u64 xfam;
> int hkid;
>
> + /* For KVM_TDX ioctl to return SEAMCALL status code. */
> + u64 hw_error;

For this case, it seems weird to have a struct member
to pass back a return status code, why not make it a parameter
of tdx_td_finalizemr() or pass &tdx_cmd?

> +
> /*
> * Used on each TD-exit, see tdx_user_return_update_cache().
> * TSX_CTRL value on TD exit
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> index a8aa8b79e9a1..6c701856c9a8 100644
> --- a/arch/x86/kvm/vmx/tdx_errno.h
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -41,6 +41,7 @@
> #define TDX_TD_FATAL 0xC000060400000000ULL
> #define TDX_TD_NON_DEBUG 0xC000060500000000ULL
> #define TDX_LIFECYCLE_STATE_INCORRECT 0xC000060700000000ULL
> +#define TDX_NO_VCPUS 0xC000060900000000ULL
> #define TDX_TDCX_NUM_INCORRECT 0xC000061000000000ULL
> #define TDX_VCPU_STATE_INCORRECT 0xC000070000000000ULL
> #define TDX_VCPU_ASSOCIATED 0x8000070100000000ULL


2024-04-12 16:30:18

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

Hi Isaku,

On 2/26/2024 12:26 AM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>

..

> @@ -218,6 +257,87 @@ static void tdx_reclaim_control_page(unsigned long td_page_pa)
> free_page((unsigned long)__va(td_page_pa));
> }
>
> +struct tdx_flush_vp_arg {
> + struct kvm_vcpu *vcpu;
> + u64 err;
> +};
> +
> +static void tdx_flush_vp(void *arg_)
> +{
> + struct tdx_flush_vp_arg *arg = arg_;
> + struct kvm_vcpu *vcpu = arg->vcpu;
> + u64 err;
> +
> + arg->err = 0;
> + lockdep_assert_irqs_disabled();
> +
> + /* Task migration can race with CPU offlining. */
> + if (unlikely(vcpu->cpu != raw_smp_processor_id()))
> + return;
> +
> + /*
> + * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
> + * list tracking still needs to be updated so that it's correct if/when
> + * the vCPU does get initialized.
> + */
> + if (is_td_vcpu_created(to_tdx(vcpu))) {
> + /*
> + * No need to retry. TDX Resources needed for TDH.VP.FLUSH are,
> + * TDVPR as exclusive, TDR as shared, and TDCS as shared. This
> + * vp flush function is called when destructing vcpu/TD or vcpu
> + * migration. No other thread uses TDVPR in those cases.
> + */

(I have comment later that refer back to this comment about needing retry.)

..

> @@ -233,26 +353,31 @@ static void tdx_do_tdh_phymem_cache_wb(void *unused)
> pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> }
>
> -void tdx_mmu_release_hkid(struct kvm *kvm)
> +static int __tdx_mmu_release_hkid(struct kvm *kvm)
> {
> bool packages_allocated, targets_allocated;
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> cpumask_var_t packages, targets;
> + struct kvm_vcpu *vcpu;
> + unsigned long j;
> + int i, ret = 0;
> u64 err;
> - int i;
>
> if (!is_hkid_assigned(kvm_tdx))
> - return;
> + return 0;
>
> if (!is_td_created(kvm_tdx)) {
> tdx_hkid_free(kvm_tdx);
> - return;
> + return 0;
> }
>
> packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> cpus_read_lock();
>
> + kvm_for_each_vcpu(j, vcpu, kvm)
> + tdx_flush_vp_on_cpu(vcpu);
> +
> /*
> * We can destroy multiple guest TDs simultaneously. Prevent
> * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> @@ -270,6 +395,19 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> */
> write_lock(&kvm->mmu_lock);
>
> + err = tdh_mng_vpflushdone(kvm_tdx->tdr_pa);
> + if (err == TDX_FLUSHVP_NOT_DONE) {
> + ret = -EBUSY;
> + goto out;
> + }
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
> + pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
> + kvm_tdx->hkid);
> + ret = -EIO;
> + goto out;
> + }
> +
> for_each_online_cpu(i) {
> if (packages_allocated &&
> cpumask_test_and_set_cpu(topology_physical_package_id(i),
> @@ -291,14 +429,24 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
> kvm_tdx->hkid);
> + ret = -EIO;
> } else
> tdx_hkid_free(kvm_tdx);
>
> +out:
> write_unlock(&kvm->mmu_lock);
> mutex_unlock(&tdx_lock);
> cpus_read_unlock();
> free_cpumask_var(targets);
> free_cpumask_var(packages);
> +
> + return ret;
> +}
> +
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> + ;
> }

As I understand, __tdx_mmu_release_hkid() returns -EBUSY
after TDH.VP.FLUSH has been sent for every vCPU followed by
TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.

Considering earlier comment that a retry of TDH.VP.FLUSH is not
needed, why is this while() loop here that sends the
TDH.VP.FLUSH again to all vCPUs instead of just a loop within
__tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?

Could it be possible for a vCPU to appear during this time, thus
be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
TDH.VP.FLUSH?

I note that TDX_FLUSHVP_NOT_DONE is distinct from TDX_OPERAND_BUSY
that can also be returned from TDH.MNG.VPFLUSHDONE and
wonder if a retry may be needed in that case also/instead? It looks like
TDH.MNG.VPFLUSHDONE needs exclusive access to all operands and I
do not know enough yet if this is the case here.

Reinette

2024-04-12 18:10:11

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 076/130] KVM: TDX: Finalize VM initialization

On Fri, Apr 12, 2024 at 03:22:00PM +0300,
Adrian Hunter <[email protected]> wrote:

> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 0d3b79b5c42a..c7ff819ccaf1 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2757,6 +2757,12 @@ static int tdx_td_finalizemr(struct kvm *kvm)
> > return -EINVAL;
> >
> > err = tdh_mr_finalize(kvm_tdx);
> > + kvm_tdx->hw_error = err;
> > +
> > + if (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX))
>
> There seem to be also implicit operand codes. How sure are
> we that TDX_OPERAND_ID_RCX is the only valid busy operand?

According to the description of TDH.MR.FINALIZE, it locks exclusively,
RCX in TDR, TDCS as implicit, OP_STATE as implicit. And the basic TDX feature
to run guest TD, TDX module locks in order of TDR => OP_STATE. We won't see
OP_STATE lock failure after gaining TDR lock.

If you worry for future, we can code it as
(err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY. We should do it
consistently, though.

> > + return -EAGAIN;
> > + if (err == TDX_NO_VCPUS)
>
> TDX_NO_VCPUS is not one of the completion status codes for
> TDH.MR.FINALIZE

It depends on the document version. Need to check TDX_OP_STATE_INCORRECT
to be defensive.


> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 98f5d7c5891a..dc150b8bdd5f 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -18,6 +18,9 @@ struct kvm_tdx {
> > u64 xfam;
> > int hkid;
> >
> > + /* For KVM_TDX ioctl to return SEAMCALL status code. */
> > + u64 hw_error;
>
> For this case, it seems weird to have a struct member
> to pass back a return status code, why not make it a parameter
> of tdx_td_finalizemr() or pass &tdx_cmd?

I created the patch too quick. Given KVM_TDX_CAPABILITIES and KVM_TDX_INIT_VM
take tdx_cmd already, it's consistent to make tdx_td_finalize() take it.
--
Isaku Yamahata <[email protected]>

2024-04-12 20:17:35

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 079/130] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)

On Sun, Apr 07, 2024 at 11:02:52AM +0800,
Binbin Wu <[email protected]> wrote:

> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index d72651ce99ac..8275a242ce07 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -158,6 +158,32 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > vmx_vcpu_reset(vcpu, init_event);
> > }
> > +static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > +{
> > + /*
> > + * All host state is saved/restored across SEAMCALL/SEAMRET,
>
> It sounds confusing to me.
> If all host states are saved/restored across SEAMCALL/SEAMRET, why this
> patch saves/restores MSR_KERNEL_GS_BASE for host?
>

No. Probably we should update the comment. Something like
restored => restored or initialized to reset state.

Except conditionally saved/restored MSRs (e.g., perfrmon, debugreg), IA32_START,
IA32_LSTART, MSR_SYSCALL_MASK, IA32_TSC_AUX and TA32_KERNEL_GS_BASE are reset to
initial state. uret handles the first four. The kernel_gs_base needs to be
restored on TDExit.

> > and the
> > + * guest state of a TD is obviously off limits. Deferring MSRs and DRs
> > + * is pointless because the TDX module needs to load *something* so as
> > + * not to expose guest state.
> > + */
> > + if (is_td_vcpu(vcpu)) {
> > + tdx_prepare_switch_to_guest(vcpu);
> > + return;
> > + }
> > +
> > + vmx_prepare_switch_to_guest(vcpu);
> > +}
> > +
> > +static void vt_vcpu_put(struct kvm_vcpu *vcpu)
> > +{
> > + if (is_td_vcpu(vcpu)) {
> > + tdx_vcpu_put(vcpu);
> > + return;
> > + }
> > +
> > + vmx_vcpu_put(vcpu);
> > +}
> > +
> > static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
> > {
> > if (is_td_vcpu(vcpu))
> > @@ -326,9 +352,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > .vcpu_free = vt_vcpu_free,
> > .vcpu_reset = vt_vcpu_reset,
> > - .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
> > + .prepare_switch_to_guest = vt_prepare_switch_to_guest,
> > .vcpu_load = vmx_vcpu_load,
> > - .vcpu_put = vmx_vcpu_put,
> > + .vcpu_put = vt_vcpu_put,
> > .update_exception_bitmap = vmx_update_exception_bitmap,
> > .get_msr_feature = vmx_get_msr_feature,
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index fdf9196cb592..9616b1aab6ce 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1,5 +1,6 @@
> > // SPDX-License-Identifier: GPL-2.0
> > #include <linux/cpu.h>
> > +#include <linux/mmu_context.h>
> > #include <asm/tdx.h>
> > @@ -423,6 +424,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> > {
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> > WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> > @@ -446,9 +448,47 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> > if ((kvm_tdx->xfam & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE)
> > vcpu->arch.xfd_no_write_intercept = true;
> > + tdx->host_state_need_save = true;
> > + tdx->host_state_need_restore = false;
> > +
> > return 0;
> > }
> > +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>
> Just like vmx_prepare_switch_to_host(), the input can be "struct vcpu_tdx
> *", since vcpu is not used inside the function.
> And the callsites just use "to_tdx(vcpu)"
>
> > +{
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> Then, this can be dropped.

prepare_switch_to_guest() is used for kvm_x86_ops.prepare_switch_to_guest().
kvm_x86_ops consistently takes struct kvm_vcpu.
--
Isaku Yamahata <[email protected]>

2024-04-12 20:19:13

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 080/130] KVM: TDX: restore host xsave state when exit from the guest TD

On Sun, Apr 07, 2024 at 11:47:00AM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > On exiting from the guest TD, xsave state is clobbered. Restore xsave
> > state on TD exit.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v19:
> > - Add EXPORT_SYMBOL_GPL(host_xcr0)
> >
> > v15 -> v16:
> > - Added CET flag mask
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
> > arch/x86/kvm/x86.c | 1 +
> > 2 files changed, 20 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 9616b1aab6ce..199226c6cf55 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2,6 +2,7 @@
> > #include <linux/cpu.h>
> > #include <linux/mmu_context.h>
> > +#include <asm/fpu/xcr.h>
> > #include <asm/tdx.h>
> > #include "capabilities.h"
> > @@ -534,6 +535,23 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > */
> > }
> > +static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > +
> > + if (static_cpu_has(X86_FEATURE_XSAVE) &&
> > + host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
> > + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
> > + if (static_cpu_has(X86_FEATURE_XSAVES) &&
> > + /* PT can be exposed to TD guest regardless of KVM's XSS support */
> The comment needs to be updated to reflect the case for CET.
>
> > + host_xss != (kvm_tdx->xfam &
> > + (kvm_caps.supported_xss | XFEATURE_MASK_PT | TDX_TD_XFAM_CET)))
>
> For TDX_TD_XFAM_CET, maybe no need to make it TDX specific?
>
> BTW, the definitions for XFEATURE_MASK_CET_USER/XFEATURE_MASK_CET_KERNEL
> have been merged.
> https://lore.kernel.org/all/20230613001108.3040476-25-rick.p.edgecombe%40intel.com
> You can resolve the TODO in https://lore.kernel.org/kvm/5eca97e6a3978cf4dcf1cff21be6ec8b639a66b9.1708933498.git.isaku.yamahata@intel.com/

Yes, will update those constants to use the one in arch/x86/include/asm/fpu/types.h
--
Isaku Yamahata <[email protected]>

2024-04-12 20:28:46

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 081/130] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr

On Sun, Apr 07, 2024 at 01:36:46PM +0800,
Binbin Wu <[email protected]> wrote:

> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index b361d948140f..1b189e86a1f1 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -440,6 +440,15 @@ static void kvm_user_return_msr_cpu_online(void)
> > }
> > }
> > +static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
> > +{
> > + if (!msrs->registered) {
> > + msrs->urn.on_user_return = kvm_on_user_return;
> > + user_return_notifier_register(&msrs->urn);
> > + msrs->registered = true;
> > + }
> > +}
> > +
> > int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
> > {
> > unsigned int cpu = smp_processor_id();
> > @@ -454,15 +463,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
> > return 1;
> > msrs->values[slot].curr = value;
> > - if (!msrs->registered) {
> > - msrs->urn.on_user_return = kvm_on_user_return;
> > - user_return_notifier_register(&msrs->urn);
> > - msrs->registered = true;
> > - }
> > + kvm_user_return_register_notifier(msrs);
> > return 0;
> > }
> > EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
> > +/* Update the cache, "curr", and register the notifier */
> Not sure this comment is necessary, since the code is simple.

Ok, let's remove it.


> > +void kvm_user_return_update_cache(unsigned int slot, u64 value)
>
> As a public API, is it better to use "kvm_user_return_msr_update_cache"
> instead of "kvm_user_return_update_cache"?
> Although it makes the API name longer...

Yes, other functions consistently user user_return_msr. We should do so.
--
Isaku Yamahata <[email protected]>

2024-04-12 20:32:58

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 082/130] KVM: TDX: restore user ret MSRs

On Sun, Apr 07, 2024 at 01:59:03PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Several user ret MSRs are clobbered on TD exit. Restore those values on
> > TD exit
>
> Here "Restore" is not accurate, since the previous patch just updates the
> cached value on TD exit.

Sure, let me update it.


> > and before returning to ring 3. Because TSX_CTRL requires special
> > treat, this patch doesn't address it.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/vmx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 43 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 199226c6cf55..7e2b1e554246 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -535,6 +535,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > */
> > }
> > +struct tdx_uret_msr {
> > + u32 msr;
> > + unsigned int slot;
> > + u64 defval;
> > +};
> > +
> > +static struct tdx_uret_msr tdx_uret_msrs[] = {
> > + {.msr = MSR_SYSCALL_MASK, .defval = 0x20200 },
> > + {.msr = MSR_STAR,},
> > + {.msr = MSR_LSTAR,},
> > + {.msr = MSR_TSC_AUX,},
> > +};
> > +
> > +static void tdx_user_return_update_cache(void)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> > + kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
> > + tdx_uret_msrs[i].defval);
> > +}
> > +
> > static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
> > {
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > @@ -627,6 +649,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > tdx_vcpu_enter_exit(tdx);
> > + tdx_user_return_update_cache();
> > tdx_restore_host_xsave_state(vcpu);
> > tdx->host_state_need_restore = true;
> > @@ -1972,6 +1995,26 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> > return -EINVAL;
> > }
> > + for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
> > + /*
> > + * Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
> > + * before returning to user space.
> > + *
> > + * this_cpu_ptr(user_return_msrs)->registered isn't checked
> > + * because the registration is done at vcpu runtime by
> > + * kvm_set_user_return_msr().
> Should be tdx_user_return_update_cache(), if it's the final API name.

Yes, it will be tdx_user_reutrn_msr_update_cache().


> > + * Here is setting up cpu feature before running vcpu,
> > + * registered is already false.
>                                   ^
>                            remove "already"?

We can this sentence.
--
Isaku Yamahata <[email protected]>

2024-04-12 20:35:15

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 083/130] KVM: TDX: Add TSX_CTRL msr into uret_msrs list

On Sun, Apr 07, 2024 at 03:05:21PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Yang Weijiang <[email protected]>
> >
> > TDX module resets the TSX_CTRL MSR to 0 at TD exit if TSX is enabled for
> > TD. Or it preserves the TSX_CTRL MSR if TSX is disabled for TD. VMM can
> > rely on uret_msrs mechanism to defer the reload of host value until exiting
> > to user space.
> >
> > Signed-off-by: Yang Weijiang <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v19:
> > - fix the type of tdx_uret_tsx_ctrl_slot. unguent int => int.
> > ---
> > arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++--
> > arch/x86/kvm/vmx/tdx.h | 8 ++++++++
> > 2 files changed, 39 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 7e2b1e554246..83dcaf5b6fbd 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -547,14 +547,21 @@ static struct tdx_uret_msr tdx_uret_msrs[] = {
> > {.msr = MSR_LSTAR,},
> > {.msr = MSR_TSC_AUX,},
> > };
> > +static int tdx_uret_tsx_ctrl_slot;
> > -static void tdx_user_return_update_cache(void)
> > +static void tdx_user_return_update_cache(struct kvm_vcpu *vcpu)
> > {
> > int i;
> > for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> > kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
> > tdx_uret_msrs[i].defval);
> > + /*
> > + * TSX_CTRL is reset to 0 if guest TSX is supported. Otherwise
> > + * preserved.
> > + */
> > + if (to_kvm_tdx(vcpu->kvm)->tsx_supported && tdx_uret_tsx_ctrl_slot != -1)
>
> If to_kvm_tdx(vcpu->kvm)->tsx_supported is true, tdx_uret_tsx_ctrl_slot
> shouldn't be -1 at this point.
> Otherwise, it's a KVM bug, right?
> Not sure if it needs a warning if tdx_uret_tsx_ctrl_slot is -1, or just
> remove the check?

You're right. Let me remove the != -1 check.
--
Isaku Yamahata <[email protected]>

2024-04-12 21:43:09

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

On Fri, Apr 12, 2024 at 09:15:29AM -0700,
Reinette Chatre <[email protected]> wrote:

> Hi Isaku,
>
> On 2/26/2024 12:26 AM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
>
> ...
>
> > @@ -218,6 +257,87 @@ static void tdx_reclaim_control_page(unsigned long td_page_pa)
> > free_page((unsigned long)__va(td_page_pa));
> > }
> >
> > +struct tdx_flush_vp_arg {
> > + struct kvm_vcpu *vcpu;
> > + u64 err;
> > +};
> > +
> > +static void tdx_flush_vp(void *arg_)
> > +{
> > + struct tdx_flush_vp_arg *arg = arg_;
> > + struct kvm_vcpu *vcpu = arg->vcpu;
> > + u64 err;
> > +
> > + arg->err = 0;
> > + lockdep_assert_irqs_disabled();
> > +
> > + /* Task migration can race with CPU offlining. */
> > + if (unlikely(vcpu->cpu != raw_smp_processor_id()))
> > + return;
> > +
> > + /*
> > + * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
> > + * list tracking still needs to be updated so that it's correct if/when
> > + * the vCPU does get initialized.
> > + */
> > + if (is_td_vcpu_created(to_tdx(vcpu))) {
> > + /*
> > + * No need to retry. TDX Resources needed for TDH.VP.FLUSH are,
> > + * TDVPR as exclusive, TDR as shared, and TDCS as shared. This
> > + * vp flush function is called when destructing vcpu/TD or vcpu
> > + * migration. No other thread uses TDVPR in those cases.
> > + */
>
> (I have comment later that refer back to this comment about needing retry.)
>
> ...
>
> > @@ -233,26 +353,31 @@ static void tdx_do_tdh_phymem_cache_wb(void *unused)
> > pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> > }
> >
> > -void tdx_mmu_release_hkid(struct kvm *kvm)
> > +static int __tdx_mmu_release_hkid(struct kvm *kvm)
> > {
> > bool packages_allocated, targets_allocated;
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > cpumask_var_t packages, targets;
> > + struct kvm_vcpu *vcpu;
> > + unsigned long j;
> > + int i, ret = 0;
> > u64 err;
> > - int i;
> >
> > if (!is_hkid_assigned(kvm_tdx))
> > - return;
> > + return 0;
> >
> > if (!is_td_created(kvm_tdx)) {
> > tdx_hkid_free(kvm_tdx);
> > - return;
> > + return 0;
> > }
> >
> > packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> > targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> > cpus_read_lock();
> >
> > + kvm_for_each_vcpu(j, vcpu, kvm)
> > + tdx_flush_vp_on_cpu(vcpu);
> > +
> > /*
> > * We can destroy multiple guest TDs simultaneously. Prevent
> > * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> > @@ -270,6 +395,19 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> > */
> > write_lock(&kvm->mmu_lock);
> >
> > + err = tdh_mng_vpflushdone(kvm_tdx->tdr_pa);
> > + if (err == TDX_FLUSHVP_NOT_DONE) {
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > + if (WARN_ON_ONCE(err)) {
> > + pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
> > + pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
> > + kvm_tdx->hkid);
> > + ret = -EIO;
> > + goto out;
> > + }
> > +
> > for_each_online_cpu(i) {
> > if (packages_allocated &&
> > cpumask_test_and_set_cpu(topology_physical_package_id(i),
> > @@ -291,14 +429,24 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> > pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> > pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
> > kvm_tdx->hkid);
> > + ret = -EIO;
> > } else
> > tdx_hkid_free(kvm_tdx);
> >
> > +out:
> > write_unlock(&kvm->mmu_lock);
> > mutex_unlock(&tdx_lock);
> > cpus_read_unlock();
> > free_cpumask_var(targets);
> > free_cpumask_var(packages);
> > +
> > + return ret;
> > +}
> > +
> > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > +{
> > + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> > + ;
> > }
>
> As I understand, __tdx_mmu_release_hkid() returns -EBUSY
> after TDH.VP.FLUSH has been sent for every vCPU followed by
> TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
>
> Considering earlier comment that a retry of TDH.VP.FLUSH is not
> needed, why is this while() loop here that sends the
> TDH.VP.FLUSH again to all vCPUs instead of just a loop within
> __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
>
> Could it be possible for a vCPU to appear during this time, thus
> be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
> TDH.VP.FLUSH?

Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
When KVM vCPU fd is closed, vCPU context can be loaded again. The MMU notifier
release hook eventually calls tdx_mmu_release_hkid(). Other kernel thread
(concretely, vhost krenel thread) can get reference count to mmu and put it by
timer, the MMU notifier release hook can be triggered during closing vCPU fd.

The possible alternative is to make the vCPU closing path complicated not to
load vCPU context instead f sending IPI on every retry.


> I note that TDX_FLUSHVP_NOT_DONE is distinct from TDX_OPERAND_BUSY
> that can also be returned from TDH.MNG.VPFLUSHDONE and
> wonder if a retry may be needed in that case also/instead? It looks like
> TDH.MNG.VPFLUSHDONE needs exclusive access to all operands and I
> do not know enough yet if this is the case here.

Because we're destructing the guest and gain mmu_lock, we shouldn't have other
thread racing. Probably we can simply retry on TDX_OPERAND_BUSY without
worrying race. It would be more robust and easier to understand.
--
Isaku Yamahata <[email protected]>

2024-04-12 22:46:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <[email protected]> wrote:
> > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > +{
> > > + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> > > + ;
> > > }
> >
> > As I understand, __tdx_mmu_release_hkid() returns -EBUSY
> > after TDH.VP.FLUSH has been sent for every vCPU followed by
> > TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
> >
> > Considering earlier comment that a retry of TDH.VP.FLUSH is not
> > needed, why is this while() loop here that sends the
> > TDH.VP.FLUSH again to all vCPUs instead of just a loop within
> > __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
> >
> > Could it be possible for a vCPU to appear during this time, thus
> > be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
> > TDH.VP.FLUSH?
>
> Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
> When KVM vCPU fd is closed, vCPU context can be loaded again.

But why is _loading_ a vCPU context problematic? If I'm reading the TDX module
code correctly, TDX_FLUSHVP_NOT_DONE is returned when a vCPU is "associated" with
a pCPU, and association only happens during TDH.VP_ENTER, TDH.MNG.RD, and TDH.MNG.WR,
none of which I see in tdx_vcpu_load().

Assuming there is something problematic lurking under vcpu_load(), I would love,
love, LOVE an excuse to not do vcpu_{load,put}() in kvm_unload_vcpu_mmu(), i.e.
get rid of that thing entirely.

I have definitely looked into kvm_unload_vcpu_mmu() on more than one occassion,
but I can't remember off the top of my head why I have never yanked out the
vcpu_{load,put}(). Maybe I was just scared of breaking something and didn't have
a good reason to risk breakage?

> The MMU notifier release hook eventually calls tdx_mmu_release_hkid(). Other
> kernel thread (concretely, vhost krenel thread) can get reference count to
> mmu and put it by timer, the MMU notifier release hook can be triggered
> during closing vCPU fd.
>
> The possible alternative is to make the vCPU closing path complicated not to
> load vCPU context instead f sending IPI on every retry.

2024-04-13 00:20:43

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 088/130] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior

On Sun, Apr 07, 2024 at 06:52:44PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
> > irrespective of any other flags.
>
> Here "irrespective of any other flags" sounds like other flags will be
> ignored if KVM_DEBUGREG_AUTO_SWITCHED_GUEST is set.
> But the code below doesn't align with it.

Sure, let's update the commit message.


> > TDX-SEAM unconditionally saves and
> > restores guest DRs and reset to architectural INIT state on TD exit.
> > So, KVM needs to save host DRs before TD enter without restoring guest DRs
> > and restore host DRs after TD exit.
> >
> > Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().
> >
> > Reported-by: Xiaoyao Li <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Co-developed-by: Chao Gao <[email protected]>
> > Signed-off-by: Chao Gao <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 10 ++++++++--
> > arch/x86/kvm/vmx/tdx.c | 1 +
> > arch/x86/kvm/x86.c | 11 ++++++++---
> > 3 files changed, 17 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 3ab85c3d86ee..a9df898c6fbd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -610,8 +610,14 @@ struct kvm_pmu {
> > struct kvm_pmu_ops;
> > enum {
> > - KVM_DEBUGREG_BP_ENABLED = 1,
> > - KVM_DEBUGREG_WONT_EXIT = 2,
> > + KVM_DEBUGREG_BP_ENABLED = BIT(0),
> > + KVM_DEBUGREG_WONT_EXIT = BIT(1),
> > + /*
> > + * Guest debug registers (DR0-3 and DR6) are saved/restored by hardware
> > + * on exit from or enter to guest. KVM needn't switch them. Because DR7
> > + * is cleared on exit from guest, DR7 need to be saved/restored.
> > + */
> > + KVM_DEBUGREG_AUTO_SWITCH = BIT(2),
> > };
> > struct kvm_mtrr_range {
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 7aa9188f384d..ab7403a19c5d 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -586,6 +586,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
> > + vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
> > vcpu->arch.cr0_guest_owned_bits = -1ul;
> > vcpu->arch.cr4_guest_owned_bits = -1ul;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1b189e86a1f1..fb7597c22f31 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11013,7 +11013,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > if (vcpu->arch.guest_fpu.xfd_err)
> > wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> > - if (unlikely(vcpu->arch.switch_db_regs)) {
> > + if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
> > set_debugreg(0, 7);
> > set_debugreg(vcpu->arch.eff_db[0], 0);
> > set_debugreg(vcpu->arch.eff_db[1], 1);
> > @@ -11059,6 +11059,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > */
> > if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
> > WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
> > + WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
> > static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
> > kvm_update_dr0123(vcpu);
> > kvm_update_dr7(vcpu);
> > @@ -11071,8 +11072,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > * care about the messed up debug address registers. But if
> > * we have some of them active, restore the old state.
> > */
> > - if (hw_breakpoint_active())
> > - hw_breakpoint_restore();
> > + if (hw_breakpoint_active()) {
> > + if (!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))
> > + hw_breakpoint_restore();
> > + else
> > + set_debugreg(__this_cpu_read(cpu_dr7), 7);
>
> According to TDX module 1.5 ABI spec:
> DR0-3, DR6 and DR7 are set to their architectural INIT value, why is only
> DR7 restored?

This hunk should be dropped. Thank you for finding this.

I checked the base SPEC, the ABI spec, and the TDX module code. It seems the
documentation bug of the TDX module 1.5 base architecture specification.


The TDX module code:
- restores guest DR<N> on TD Entry to guest.
- saves guest DR<N> on TD Exit from guest TD
- initializes DR<N> on TD Exit to host VMM

TDX module 1.5 base architecture specification:
15.1.2.1 Context Switch
By design, the Intel TDX module context-switches all debug/tracing state that
the guest TD is allowed to use.
DR0-3, DR6 and IA32_DS_AREA MSR are context-switched in TDH.VP.ENTER and
TD exit flows
RFLAGS, IA32_DEBUGCTL MSR and DR7 are saved and cleared on VM exits from
the guest TD and restored on VM entry to the guest TD.

TDX module 1.5 ABI specification:
5.3.65. TDH.VP.ENTER Leaf
CPU State Preservation Following a Successful TD Entry and a TD Exit
Following a successful TD entry and a TD exit, some CPU state is modified:
Registers DR0, DR1, DR2, DR3, DR6 and DR7 are set to their architectural
INIT value.
--
Isaku Yamahata <[email protected]>

2024-04-13 00:40:49

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

On Fri, Apr 12, 2024 at 03:46:05PM -0700,
Sean Christopherson <[email protected]> wrote:

> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> > On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <[email protected]> wrote:
> > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > +{
> > > > + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> > > > + ;
> > > > }
> > >
> > > As I understand, __tdx_mmu_release_hkid() returns -EBUSY
> > > after TDH.VP.FLUSH has been sent for every vCPU followed by
> > > TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
> > >
> > > Considering earlier comment that a retry of TDH.VP.FLUSH is not
> > > needed, why is this while() loop here that sends the
> > > TDH.VP.FLUSH again to all vCPUs instead of just a loop within
> > > __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
> > >
> > > Could it be possible for a vCPU to appear during this time, thus
> > > be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
> > > TDH.VP.FLUSH?
> >
> > Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
> > When KVM vCPU fd is closed, vCPU context can be loaded again.
>
> But why is _loading_ a vCPU context problematic?

It's nothing problematic. It becomes a bit harder to understand why
tdx_mmu_release_hkid() issues IPI on each loop. I think it's reasonable
to make the normal path easy and to complicate/penalize the destruction path.
Probably I should've added comment on the function.
--
Isaku Yamahata <[email protected]>

2024-04-15 08:27:24

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On 2/26/2024 4:25 PM, [email protected] wrote:

..

> +
> + kvm_tdx->tdcs_pa = tdcs_pa;
> + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> + err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> + if (err == TDX_RND_NO_ENTROPY) {
> + /* Here it's hard to allow userspace to retry. */
> + ret = -EBUSY;

So userspace is expected to stop creating TD and quit on this?

If so, it exposes an DOS attack surface that malicious users in another
can drain the entropy with busy-loop on RDSEED.

Can you clarify why it's hard to allow userspace to retry? To me, it's
OK to retry that "teardown" cleans everything up, and userspace and
issue the KVM_TDX_INIT_VM again.

> + goto teardown;
> + }
> + if (WARN_ON_ONCE(err)) {
> + pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> + ret = -EIO;
> + goto teardown;
> + }
> + }
> +
> + /*
> + * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> + * ioctl() to define the configure CPUID values for the TD.
> + */
> + return 0;
> +
> + /*
> + * The sequence for freeing resources from a partially initialized TD
> + * varies based on where in the initialization flow failure occurred.
> + * Simply use the full teardown and destroy, which naturally play nice
> + * with partial initialization.
> + */
> +teardown:
> + for (; i < tdx_info->nr_tdcs_pages; i++) {
> + if (tdcs_pa[i]) {
> + free_page((unsigned long)__va(tdcs_pa[i]));
> + tdcs_pa[i] = 0;
> + }
> + }
> + if (!kvm_tdx->tdcs_pa)
> + kfree(tdcs_pa);
> + tdx_mmu_release_hkid(kvm);
> + tdx_vm_free(kvm);
> + return ret;
> +
> +free_packages:
> + cpus_read_unlock();
> + free_cpumask_var(packages);
> +free_tdcs:
> + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> + if (tdcs_pa[i])
> + free_page((unsigned long)__va(tdcs_pa[i]));
> + }
> + kfree(tdcs_pa);
> + kvm_tdx->tdcs_pa = NULL;
> +
> +free_tdr:
> + if (tdr_pa)
> + free_page((unsigned long)__va(tdr_pa));
> + kvm_tdx->tdr_pa = 0;
> +free_hkid:
> + if (is_hkid_assigned(kvm_tdx))
> + tdx_hkid_free(kvm_tdx);
> + return ret;
> +}
> +



2024-04-15 13:49:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> On Fri, Apr 12, 2024 at 03:46:05PM -0700,
> Sean Christopherson <[email protected]> wrote:
>
> > On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> > > On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <[email protected]> wrote:
> > > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > > +{
> > > > > + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> > > > > + ;
> > > > > }
> > > >
> > > > As I understand, __tdx_mmu_release_hkid() returns -EBUSY
> > > > after TDH.VP.FLUSH has been sent for every vCPU followed by
> > > > TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
> > > >
> > > > Considering earlier comment that a retry of TDH.VP.FLUSH is not
> > > > needed, why is this while() loop here that sends the
> > > > TDH.VP.FLUSH again to all vCPUs instead of just a loop within
> > > > __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
> > > >
> > > > Could it be possible for a vCPU to appear during this time, thus
> > > > be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
> > > > TDH.VP.FLUSH?
> > >
> > > Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
> > > When KVM vCPU fd is closed, vCPU context can be loaded again.
> >
> > But why is _loading_ a vCPU context problematic?
>
> It's nothing problematic. It becomes a bit harder to understand why
> tdx_mmu_release_hkid() issues IPI on each loop. I think it's reasonable
> to make the normal path easy and to complicate/penalize the destruction path.
> Probably I should've added comment on the function.

By "problematic", I meant, why can that result in a "missed in one TDH.VP.FLUSH
cycle"? AFAICT, loading a vCPU shouldn't cause that vCPU to be associated from
the TDX module's perspective, and thus shouldn't trigger TDX_FLUSHVP_NOT_DONE.

I.e. looping should be unnecessary, no?

2024-04-15 22:48:42

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

On Mon, Apr 15, 2024 at 06:49:35AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> > On Fri, Apr 12, 2024 at 03:46:05PM -0700,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> > > > On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <[email protected]> wrote:
> > > > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > > > +{
> > > > > > + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> > > > > > + ;
> > > > > > }
> > > > >
> > > > > As I understand, __tdx_mmu_release_hkid() returns -EBUSY
> > > > > after TDH.VP.FLUSH has been sent for every vCPU followed by
> > > > > TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
> > > > >
> > > > > Considering earlier comment that a retry of TDH.VP.FLUSH is not
> > > > > needed, why is this while() loop here that sends the
> > > > > TDH.VP.FLUSH again to all vCPUs instead of just a loop within
> > > > > __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
> > > > >
> > > > > Could it be possible for a vCPU to appear during this time, thus
> > > > > be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
> > > > > TDH.VP.FLUSH?
> > > >
> > > > Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
> > > > When KVM vCPU fd is closed, vCPU context can be loaded again.
> > >
> > > But why is _loading_ a vCPU context problematic?
> >
> > It's nothing problematic. It becomes a bit harder to understand why
> > tdx_mmu_release_hkid() issues IPI on each loop. I think it's reasonable
> > to make the normal path easy and to complicate/penalize the destruction path.
> > Probably I should've added comment on the function.
>
> By "problematic", I meant, why can that result in a "missed in one TDH.VP.FLUSH
> cycle"? AFAICT, loading a vCPU shouldn't cause that vCPU to be associated from
> the TDX module's perspective, and thus shouldn't trigger TDX_FLUSHVP_NOT_DONE.
>
> I.e. looping should be unnecessary, no?

The loop is unnecessary with the current code.

The possible future optimization is to reduce destruction time of Secure-EPT
somehow. One possible option is to release HKID while vCPUs are still alive and
destruct Secure-EPT with multiple vCPU context. Because that's future
optimization, we can ignore it at this phase.
--
Isaku Yamahata <[email protected]>

2024-04-15 22:51:27

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 097/130] KVM: x86: Split core of hypercall emulation to helper function

On Tue, Apr 09, 2024 at 05:28:05PM +0800,
Binbin Wu <[email protected]> wrote:

> > +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > +{
> > + unsigned long nr, a0, a1, a2, a3, ret;
> > + int op_64_bit;
>
> Can it be opportunistically changed to bool type, as well as the argument
> type of "op_64_bit" in __kvm_emulate_hypercall()?

Yes. We can also fix kvm_pv_send_ipi(op_64_bit).
--
Isaku Yamahata <[email protected]>

2024-04-15 22:58:22

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 098/130] KVM: TDX: Add a place holder to handle TDX VM exit

On Tue, Apr 09, 2024 at 06:36:01PM +0800,
Binbin Wu <[email protected]> wrote:

> > + return 1;
> > +
> > + /*
> > + * TDH.VP.ENTRY
>
> "TDH.VP.ENTRY" -> "TDH.VP.ENTER"
>
> > checks TD EPOCH which contend with TDH.MEM.TRACK and
> > + * vcpu TDH.VP.ENTER.
> Do you mean TDH.VP.ENTER on one vcpu can contend with TDH.MEM.TRACK and
> TDH.VP.ENTER on another vcpu?

Yes. The caller of TDH.MEM.TRACK() must ensure that other vCPUS go through
inactive (not running vCPU) after TDH.MEM.TRACK().
--
Isaku Yamahata <[email protected]>

2024-04-16 00:06:23

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor



On 16/04/2024 10:48 am, Yamahata, Isaku wrote:
> On Mon, Apr 15, 2024 at 06:49:35AM -0700,
> Sean Christopherson <[email protected]> wrote:
>
>> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
>>> On Fri, Apr 12, 2024 at 03:46:05PM -0700,
>>> Sean Christopherson <[email protected]> wrote:
>>>
>>>> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
>>>>> On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <[email protected]> wrote:
>>>>>>> +void tdx_mmu_release_hkid(struct kvm *kvm)
>>>>>>> +{
>>>>>>> + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
>>>>>>> + ;
>>>>>>> }
>>>>>>
>>>>>> As I understand, __tdx_mmu_release_hkid() returns -EBUSY
>>>>>> after TDH.VP.FLUSH has been sent for every vCPU followed by
>>>>>> TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
>>>>>>
>>>>>> Considering earlier comment that a retry of TDH.VP.FLUSH is not
>>>>>> needed, why is this while() loop here that sends the
>>>>>> TDH.VP.FLUSH again to all vCPUs instead of just a loop within
>>>>>> __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
>>>>>>
>>>>>> Could it be possible for a vCPU to appear during this time, thus
>>>>>> be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
>>>>>> TDH.VP.FLUSH?
>>>>>
>>>>> Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
>>>>> When KVM vCPU fd is closed, vCPU context can be loaded again.
>>>>
>>>> But why is _loading_ a vCPU context problematic?
>>>
>>> It's nothing problematic. It becomes a bit harder to understand why
>>> tdx_mmu_release_hkid() issues IPI on each loop. I think it's reasonable
>>> to make the normal path easy and to complicate/penalize the destruction path.
>>> Probably I should've added comment on the function.
>>
>> By "problematic", I meant, why can that result in a "missed in one TDH.VP.FLUSH
>> cycle"? AFAICT, loading a vCPU shouldn't cause that vCPU to be associated from
>> the TDX module's perspective, and thus shouldn't trigger TDX_FLUSHVP_NOT_DONE.
>>
>> I.e. looping should be unnecessary, no?
>
> The loop is unnecessary with the current code.
>
> The possible future optimization is to reduce destruction time of Secure-EPT
> somehow. One possible option is to release HKID while vCPUs are still alive and
> destruct Secure-EPT with multiple vCPU context. Because that's future
> optimization, we can ignore it at this phase.

I kinda lost here.

I thought in the current v19 code, you have already implemented this
optimization?

Or is this optimization totally different from what we discussed in an
earlier patch?

https://lore.kernel.org/lkml/[email protected]/



2024-04-16 00:56:06

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions



On 5/03/2024 9:21 pm, Isaku Yamahata wrote:
> On Fri, Mar 01, 2024 at 03:25:31PM +0800,
> Yan Zhao <[email protected]> wrote:
>
>>> + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
>>> + */
>>> +#define TDX_MAX_VCPUS (~(u16)0)
>> This value will be treated as -1 in tdx_vm_init(),
>> "kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);"
>>
>> This will lead to kvm->max_vcpus being -1 by default.
>> Is this by design or just an error?
>> If it's by design, why not set kvm->max_vcpus = -1 in tdx_vm_init() directly.
>> If an unexpected error, may below is better?
>>
>> #define TDX_MAX_VCPUS (int)((u16)(~0UL))
>> or
>> #define TDX_MAX_VCPUS 65536
>
> You're right. I'll use ((int)U16_MAX).
> As TDX 1.5 introduced metadata MAX_VCPUS_PER_TD, I'll update to get the value
> and trim it further. Something following.
>

[...]

>
> + u16 max_vcpus_per_td;
> +

[...]

> - kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
> + kvm->max_vcpus = min3(kvm->max_vcpus, tdx_info->max_vcpus_per_td,
> + TDX_MAX_VCPUS);
>

[...]

> -#define TDX_MAX_VCPUS (~(u16)0)
> +#define TDX_MAX_VCPUS ((int)U16_MAX)

Why do you even need TDX_MAX_VCPUS, given it cannot exceed U16_MAX and
you will have the 'u16 max_vcpus_per_td' anyway?

IIUC, in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS), we can overwrite the
kvm->max_vcpus to the 'max_vcpus' provided by the userspace, and make
sure it doesn't exceed tdx_info->max_vcpus_per_td.

Anything I am missing?


2024-04-16 14:17:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 089/130] KVM: TDX: Add support for find pending IRQ in a protected local APIC

On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
>  
> +static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> +{
> +       KVM_BUG_ON(!is_td_vcpu(vcpu), vcpu->kvm);
> +
> +       return tdx_protected_apic_has_interrupt(vcpu);
> +}
> +

There was some internal discussion on whether to drop the KVM_BUG_ON() here, and
then since vt_protected_apic_has_interrupt() would be just a simple call to
tdx_protected_apic_has_interrupt(), just wire in
tdx_protected_apic_has_interrupt() directly.

The reasoning for dropping the KVM_BUG_ON() is that the function has "protected"
in its name so is unlikely to be called in another context. It was apparently
added when the TDX series was still new. But with the more mature series it
doesn't seem likely to find any bugs. The caller checks
vcpu->arch.apic->guest_apic_protected.

2024-04-16 16:28:43

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions

On Tue, Apr 16, 2024 at 12:55:33PM +1200,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 5/03/2024 9:21 pm, Isaku Yamahata wrote:
> > On Fri, Mar 01, 2024 at 03:25:31PM +0800,
> > Yan Zhao <[email protected]> wrote:
> >
> > > > + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> > > > + */
> > > > +#define TDX_MAX_VCPUS (~(u16)0)
> > > This value will be treated as -1 in tdx_vm_init(),
> > > "kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);"
> > >
> > > This will lead to kvm->max_vcpus being -1 by default.
> > > Is this by design or just an error?
> > > If it's by design, why not set kvm->max_vcpus = -1 in tdx_vm_init() directly.
> > > If an unexpected error, may below is better?
> > >
> > > #define TDX_MAX_VCPUS (int)((u16)(~0UL))
> > > or
> > > #define TDX_MAX_VCPUS 65536
> >
> > You're right. I'll use ((int)U16_MAX).
> > As TDX 1.5 introduced metadata MAX_VCPUS_PER_TD, I'll update to get the value
> > and trim it further. Something following.
> >
>
> [...]
>
> > + u16 max_vcpus_per_td;
> > +
>
> [...]
>
> > - kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
> > + kvm->max_vcpus = min3(kvm->max_vcpus, tdx_info->max_vcpus_per_td,
> > + TDX_MAX_VCPUS);
>
> [...]
>
> > -#define TDX_MAX_VCPUS (~(u16)0)
> > +#define TDX_MAX_VCPUS ((int)U16_MAX)
>
> Why do you even need TDX_MAX_VCPUS, given it cannot exceed U16_MAX and you
> will have the 'u16 max_vcpus_per_td' anyway?
>
> IIUC, in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS), we can overwrite the
> kvm->max_vcpus to the 'max_vcpus' provided by the userspace, and make sure
> it doesn't exceed tdx_info->max_vcpus_per_td.
>
> Anything I am missing?

With the latest TDX 1.5 module, we don't need TDX_MAX_VCPUS.

The metadata MD_FIELD_ID_MAX_VCPUS_PER_TD was introduced at the middle version
of TDX 1.5. (I don't remember the exact version.), the logic was something
like as follows. Now if we fail to read the metadata, disable TDX.

read metadata MD_FIELD_ID_MAX_VCPUS_PER_TD;
if success
tdx_info->max_vcpu_per_td = the value read metadata
else
tdx_info->max_vcpu_per_td = TDX_MAX_VCPUS;

--
Isaku Yamahata <[email protected]>

2024-04-16 16:46:02

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

On Tue, Apr 16, 2024 at 12:05:31PM +1200,
"Huang, Kai" <[email protected]> wrote:

>
>
> On 16/04/2024 10:48 am, Yamahata, Isaku wrote:
> > On Mon, Apr 15, 2024 at 06:49:35AM -0700,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> > > > On Fri, Apr 12, 2024 at 03:46:05PM -0700,
> > > > Sean Christopherson <[email protected]> wrote:
> > > >
> > > > > On Fri, Apr 12, 2024, Isaku Yamahata wrote:
> > > > > > On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <[email protected]> wrote:
> > > > > > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > > > > > +{
> > > > > > > > + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
> > > > > > > > + ;
> > > > > > > > }
> > > > > > >
> > > > > > > As I understand, __tdx_mmu_release_hkid() returns -EBUSY
> > > > > > > after TDH.VP.FLUSH has been sent for every vCPU followed by
> > > > > > > TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
> > > > > > >
> > > > > > > Considering earlier comment that a retry of TDH.VP.FLUSH is not
> > > > > > > needed, why is this while() loop here that sends the
> > > > > > > TDH.VP.FLUSH again to all vCPUs instead of just a loop within
> > > > > > > __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
> > > > > > >
> > > > > > > Could it be possible for a vCPU to appear during this time, thus
> > > > > > > be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
> > > > > > > TDH.VP.FLUSH?
> > > > > >
> > > > > > Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
> > > > > > When KVM vCPU fd is closed, vCPU context can be loaded again.
> > > > >
> > > > > But why is _loading_ a vCPU context problematic?
> > > >
> > > > It's nothing problematic. It becomes a bit harder to understand why
> > > > tdx_mmu_release_hkid() issues IPI on each loop. I think it's reasonable
> > > > to make the normal path easy and to complicate/penalize the destruction path.
> > > > Probably I should've added comment on the function.
> > >
> > > By "problematic", I meant, why can that result in a "missed in one TDH.VP.FLUSH
> > > cycle"? AFAICT, loading a vCPU shouldn't cause that vCPU to be associated from
> > > the TDX module's perspective, and thus shouldn't trigger TDX_FLUSHVP_NOT_DONE.
> > >
> > > I.e. looping should be unnecessary, no?
> >
> > The loop is unnecessary with the current code.
> >
> > The possible future optimization is to reduce destruction time of Secure-EPT
> > somehow. One possible option is to release HKID while vCPUs are still alive and
> > destruct Secure-EPT with multiple vCPU context. Because that's future
> > optimization, we can ignore it at this phase.
>
> I kinda lost here.
>
> I thought in the current v19 code, you have already implemented this
> optimization?
>
> Or is this optimization totally different from what we discussed in an
> earlier patch?
>
> https://lore.kernel.org/lkml/[email protected]/

That's only the first step. We can optimize it further with multiple vCPUs
context.
--
Isaku Yamahata <[email protected]>

2024-04-16 16:47:04

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

On Mon, Apr 15, 2024 at 04:17:35PM +0800,
Xiaoyao Li <[email protected]> wrote:

> On 2/26/2024 4:25 PM, [email protected] wrote:
>
> ...
>
> > +
> > + kvm_tdx->tdcs_pa = tdcs_pa;
> > + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > + err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> > + if (err == TDX_RND_NO_ENTROPY) {
> > + /* Here it's hard to allow userspace to retry. */
> > + ret = -EBUSY;
>
> So userspace is expected to stop creating TD and quit on this?
>
> If so, it exposes an DOS attack surface that malicious users in another can
> drain the entropy with busy-loop on RDSEED.
>
> Can you clarify why it's hard to allow userspace to retry? To me, it's OK to
> retry that "teardown" cleans everything up, and userspace and issue the
> KVM_TDX_INIT_VM again.

The current patch has complicated error recovery path. After simplifying
the code, it would be possible to return -EAGAIN in this patch.

For the retry case, we need to avoid TDH.MNG.CREATE() and TDH.MNG.KEY.CONFIG().
--
Isaku Yamahata <[email protected]>

2024-04-16 18:24:04

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v19 085/130] KVM: TDX: Complete interrupts after tdexit

Hi Isaku,

(In shortlog "tdexit" can be "TD exit" to be consistent with
documentation.)

On 2/26/2024 12:26 AM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> This corresponds to VMX __vmx_complete_interrupts(). Because TDX
> virtualize vAPIC, KVM only needs to care NMI injection.

This seems to be the first appearance of NMI and the changelog
is very brief. How about expending it with:

"This corresponds to VMX __vmx_complete_interrupts(). Because TDX
virtualize vAPIC, KVM only needs to care about NMI injection.

KVM can request TDX to inject an NMI into a guest TD vCPU when the
vCPU is not active. TDX will attempt to inject an NMI as soon as
possible on TD entry. NMI injection is managed by writing to (to
inject NMI) and reading from (to get status of NMI injection)
the PEND_NMI field within the TDX vCPU scope metadata (Trust
Domain Virtual Processor State (TDVPS)).

Update KVM's NMI status on TD exit by checking whether a requested
NMI has been injected into the TD. Reading the metadata via SEAMCALL
is expensive so only perform the check if an NMI was injected.

This is the first need to access vCPU scope metadata in the
"management" class. Ensure that needed accessor is available.
"

>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> ---
> v19:
> - move tdvps_management_check() to this patch
> - typo: complete -> Complete in short log
> ---
> arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
> arch/x86/kvm/vmx/tdx.h | 4 ++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 83dcaf5b6fbd..b8b168f74dfe 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> */
> }
>
> +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> +{
> + /* Avoid costly SEAMCALL if no nmi was injected */

/* Avoid costly SEAMCALL if no NMI was injected. */

> + if (vcpu->arch.nmi_injected)
> + vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> + TD_VCPU_PEND_NMI);
> +}
> +
> struct tdx_uret_msr {
> u32 msr;
> unsigned int slot;
> @@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> trace_kvm_exit(vcpu, KVM_ISA_VMX);
>
> + tdx_complete_interrupts(vcpu);
> +
> return EXIT_FASTPATH_NONE;
> }
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 44eab734e702..0d8a98feb58e 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
> "Invalid TD VMCS access for 16-bit field");
> }
>
> +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}

Is this intended to be a stub or is it expected to be fleshed out with
some checks?

> +
> #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
> static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
> u32 field) \
> @@ -200,6 +202,8 @@ TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
> TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
> TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
>
> +TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
> +
> static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
> {
> struct tdx_module_args out;

Reinette


2024-04-16 18:24:45

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v19 086/130] KVM: TDX: restore debug store when TD exit

Hi Isaku,

On 2/26/2024 12:26 AM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Because debug store is clobbered, restore it on TD exit.
>

I am trying to understand why this is needed and finding a few
places that seem to indicate that this is not needed.

For example, in the TX base spec, section 15.2 "On-TD Performance
Monitoring", subsection "15.2.1 Overview":
* IA32_DS_AREA MSR is context-switched across TD entry and exit transitions.

To confirm I peeked at the TDX module code and found (if I understand this
correctly) that IA32_DS_AREA is saved on TD entry (see [1]) and restored on
TD exit (see [2]).


> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/events/intel/ds.c | 1 +
> arch/x86/kvm/vmx/tdx.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index d49d661ec0a7..25670d8a485b 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2428,3 +2428,4 @@ void perf_restore_debug_store(void)
>
> wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
> }
> +EXPORT_SYMBOL_GPL(perf_restore_debug_store);
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b8b168f74dfe..ad4d3d4eaf6c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -665,6 +665,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> tdx_vcpu_enter_exit(tdx);
>
> tdx_user_return_update_cache(vcpu);
> + perf_restore_debug_store();
> tdx_restore_host_xsave_state(vcpu);
> tdx->host_state_need_restore = true;
>

Reinette

[1] https://github.com/intel/tdx-module/blob/tdx_1.5/src/td_transitions/tdh_vp_enter.c#L719
[2] https://github.com/intel/tdx-module/blob/tdx_1.5/src/td_transitions/td_exit.c#L130

2024-04-16 19:45:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > So my feedback is to not worry about the exports, and instead focus on
> > figuring
> > out a way to make the generated code less bloated and easier to read/debug.
>
> I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> few megawrappers. I think we can get better results by shifting leaf
> function wrappers into assembly.
>
> We are going to have more assembly, but it should produce better result.
> Adding macros can help to write such wrapper and minimizer boilerplate.
>
> Below is an example of how it can look like. It's not complete. I only
> converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> more complex.
>
> Any opinions? Is it something worth investing more time?

We discussed offline how implementing these for each TDVM/SEAMCALL increases the
chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
problems more challenging. Kirill raised the possibility of some code generating
solution like cpufeatures.h, that could take a spec and generate correct calls.

So far no big wins have presented themselves. Kirill, do we think the path to
move the messy part out-of-line will not work?

2024-04-16 20:58:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, Apr 12, 2024, Kai Huang wrote:
> On 12/04/2024 2:03 am, Sean Christopherson wrote:
> > On Thu, Apr 11, 2024, Kai Huang wrote:
> > > I can certainly follow up with this and generate a reviewable patchset if I
> > > can confirm with you that this is what you want?
> >
> > Yes, I think it's the right direction. I still have minor concerns about VMX
> > being enabled while kvm.ko is loaded, which means that VMXON will _always_ be
> > enabled if KVM is built-in. But after seeing the complexity that is needed to
> > safely initialize TDX, and after seeing just how much complexity KVM already
> > has because it enables VMX on-demand (I hadn't actually tried removing that code
> > before), I think the cost of that complexity far outweighs the risk of "always"
> > being post-VMXON.
>
> Does always leaving VMXON have any actual damage, given we have emergency
> virtualization shutdown?

Being post-VMXON increases the risk of kexec() into the kdump kernel failing.
The tradeoffs that we're trying to balance are: is the risk of kexec() failing
due to the complexity of the emergency VMX code higher than the risk of us breaking
things in general due to taking on a ton of complexity to juggle VMXON for TDX?

After seeing the latest round of TDX code, my opinion is that being post-VMXON
is less risky overall, in no small part because we need that to work anyways for
hosts that are actively running VMs.

> > Within reason, I recommend getting feedback from others before you spend _too_
> > much time on this. It's entirely possible I'm missing/forgetting some other angle.
>
> Sure. Could you suggest who should we try to get feedback from?
>
> Perhaps you can just help to Cc them?

I didn't have anyone in particular in mind, I just really want *someone* to weigh
in as a sanity check.

2024-04-16 22:06:31

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 027/130] KVM: TDX: Define TDX architectural definitions



On 17/04/2024 4:28 am, Yamahata, Isaku wrote:
> On Tue, Apr 16, 2024 at 12:55:33PM +1200,
> "Huang, Kai" <[email protected]> wrote:
>
>>
>>
>> On 5/03/2024 9:21 pm, Isaku Yamahata wrote:
>>> On Fri, Mar 01, 2024 at 03:25:31PM +0800,
>>> Yan Zhao <[email protected]> wrote:
>>>
>>>>> + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
>>>>> + */
>>>>> +#define TDX_MAX_VCPUS (~(u16)0)
>>>> This value will be treated as -1 in tdx_vm_init(),
>>>> "kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);"
>>>>
>>>> This will lead to kvm->max_vcpus being -1 by default.
>>>> Is this by design or just an error?
>>>> If it's by design, why not set kvm->max_vcpus = -1 in tdx_vm_init() directly.
>>>> If an unexpected error, may below is better?
>>>>
>>>> #define TDX_MAX_VCPUS (int)((u16)(~0UL))
>>>> or
>>>> #define TDX_MAX_VCPUS 65536
>>>
>>> You're right. I'll use ((int)U16_MAX).
>>> As TDX 1.5 introduced metadata MAX_VCPUS_PER_TD, I'll update to get the value
>>> and trim it further. Something following.
>>>
>>
>> [...]
>>
>>> + u16 max_vcpus_per_td;
>>> +
>>
>> [...]
>>
>>> - kvm->max_vcpus = min(kvm->max_vcpus, TDX_MAX_VCPUS);
>>> + kvm->max_vcpus = min3(kvm->max_vcpus, tdx_info->max_vcpus_per_td,
>>> + TDX_MAX_VCPUS);
>>
>> [...]
>>
>>> -#define TDX_MAX_VCPUS (~(u16)0)
>>> +#define TDX_MAX_VCPUS ((int)U16_MAX)
>>
>> Why do you even need TDX_MAX_VCPUS, given it cannot exceed U16_MAX and you
>> will have the 'u16 max_vcpus_per_td' anyway?
>>
>> IIUC, in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS), we can overwrite the
>> kvm->max_vcpus to the 'max_vcpus' provided by the userspace, and make sure
>> it doesn't exceed tdx_info->max_vcpus_per_td.
>>
>> Anything I am missing?
>
> With the latest TDX 1.5 module, we don't need TDX_MAX_VCPUS.
>
> The metadata MD_FIELD_ID_MAX_VCPUS_PER_TD was introduced at the middle version
> of TDX 1.5. (I don't remember the exact version.), the logic was something
> like as follows. Now if we fail to read the metadata, disable TDX.
>
> read metadata MD_FIELD_ID_MAX_VCPUS_PER_TD;
> if success
> tdx_info->max_vcpu_per_td = the value read metadata
> else
> tdx_info->max_vcpu_per_td = TDX_MAX_VCPUS;
>

OK. But even the SEAMCALL can fail, we can just use U16_MAX directly
when it fails given we can see clearly the type of max_vcpu_per_td is 'u16'.

if success
tdx_info->max_vcpu_per_td = the value read metadata
else
tdx_info->max_vcpu_per_td = U16_MAX;

So I don't see why TDX_MAX_VCPUS is needed (especially in tdx_arch.h as
it is not an architectural value but just some value chosen for our
convenience).


2024-04-17 02:21:49

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, Feb 26, 2024 at 12:26:01AM -0800, [email protected] wrote:
>@@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
>
> lockdep_assert_held_write(&kvm->mmu_lock);
>
>+ WARN_ON_ONCE(zap_private && !is_private_sp(root));
>+ if (!zap_private && is_private_sp(root))
>+ return false;

Should be "return flush;".

Fengwei and I spent one week chasing a bug where virtio-net in the TD guest may
stop working at some point after bootup if the host enables numad. We finally
found that the bug was introduced by the 'return false' statement, which left
some stale EPT entries unflushed.

I am wondering if we can refactor related functions slightly to make it harder
to make such mistakes and make it easier to identify them. e.g., we could make
"@flush" an in/out parameter of tdp_mmu_zap_leafs(), kvm_tdp_mmu_zap_leafs()
and kvm_tdp_mmu_unmap_gfn_range(). It looks more apparent that "*flush = false"
below could be problematic if the changes were something like:

if (!zap_private && is_private_sp(root)) {
*flush = false;
return;
}


>+
> rcu_read_lock();
>
> for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
>@@ -810,13 +815,15 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> * true if a TLB flush is needed before releasing the MMU lock, i.e. if one or
> * more SPTEs were zapped since the MMU lock was last acquired.
> */
>-bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
>+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush,
>+ bool zap_private)
> {
> struct kvm_mmu_page *root;
>
> lockdep_assert_held_write(&kvm->mmu_lock);
> for_each_tdp_mmu_root_yield_safe(kvm, root)
>- flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
>+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush,
>+ zap_private && is_private_sp(root));
>
> return flush;
> }
>@@ -891,7 +898,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
> * See kvm_tdp_mmu_get_vcpu_root_hpa().
> */
>-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
>+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm, bool skip_private)
> {
> struct kvm_mmu_page *root;
>
>@@ -916,6 +923,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> * or get/put references to roots.
> */
> list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
>+ /*
>+ * Skip private root since private page table
>+ * is only torn down when VM is destroyed.
>+ */
>+ if (skip_private && is_private_sp(root))
>+ continue;
> /*
> * Note, invalid roots can outlive a memslot update! Invalid
> * roots must be *zapped* before the memslot update completes,
>@@ -1104,14 +1117,26 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return ret;
> }
>
>+/* Used by mmu notifier via kvm_unmap_gfn_range() */
> bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
> bool flush)
> {
> struct kvm_mmu_page *root;
>+ bool zap_private = false;
>+
>+ if (kvm_gfn_shared_mask(kvm)) {
>+ if (!range->only_private && !range->only_shared)
>+ /* attributes change */
>+ zap_private = !(range->arg.attributes &
>+ KVM_MEMORY_ATTRIBUTE_PRIVATE);
>+ else
>+ zap_private = range->only_private;
>+ }
>
> __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
> flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
>- range->may_block, flush);
>+ range->may_block, flush,
>+ zap_private && is_private_sp(root));
>
> return flush;
> }
>diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
>index 20d97aa46c49..b3cf58a50357 100644
>--- a/arch/x86/kvm/mmu/tdp_mmu.h
>+++ b/arch/x86/kvm/mmu/tdp_mmu.h
>@@ -19,10 +19,11 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
>
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
>
>-bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
>+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush,
>+ bool zap_private);
> bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
> void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
>+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm, bool skip_private);
> void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
>
> int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>--
>2.25.1
>
>

2024-04-17 04:03:42

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 086/130] KVM: TDX: restore debug store when TD exit

On Tue, Apr 16, 2024 at 11:24:16AM -0700, Reinette Chatre wrote:
>Hi Isaku,
>
>On 2/26/2024 12:26 AM, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
>>
>> Because debug store is clobbered, restore it on TD exit.
>>
>
>I am trying to understand why this is needed and finding a few
>places that seem to indicate that this is not needed.
>
>For example, in the TX base spec, section 15.2 "On-TD Performance
>Monitoring", subsection "15.2.1 Overview":
> * IA32_DS_AREA MSR is context-switched across TD entry and exit transitions.
>
>To confirm I peeked at the TDX module code and found (if I understand this
>correctly) that IA32_DS_AREA is saved on TD entry (see [1]) and restored on
>TD exit (see [2]).

Hi Reinette,

You are right. I asked TDX module to preserve IA32_DS_AREA across TD transitions
and they made this change. So, this patch can be dropped now.

2024-04-17 04:21:47

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT



On 4/4/2024 2:51 AM, Isaku Yamahata wrote:
> On Mon, Apr 01, 2024 at 04:22:00PM +0800,
> Chao Gao <[email protected]> wrote:
>
>> On Mon, Feb 26, 2024 at 12:26:44AM -0800, [email protected] wrote:
>>> From: Isaku Yamahata <[email protected]>
>>>
>>> Because guest TD state is protected, exceptions in guest TDs can't be
>>> intercepted. TDX VMM doesn't need to handle exceptions.
>>> tdx_handle_exit_irqoff() handles NMI and machine check. Ignore NMI and
>> tdx_handle_exit_irqoff() doesn't handle NMIs.
> Will it to tdx_handle_exception().

I don't get  why tdx_handle_exception()?

NMI is handled in tdx_vcpu_enter_exit() prior to leaving the safety of
noinstr, according to patch 098.
https://lore.kernel.org/kvm/88920c598dcb55c15219642f27d0781af6d0c044.1708933498.git.isaku.yamahata@intel.com/

@@ -837,6 +857,12 @@ static noinstr void tdx_vcpu_enter_exit(struct
vcpu_tdx *tdx)
     WARN_ON_ONCE(!kvm_rebooting &&
              (tdx->exit_reason.full & TDX_SW_ERROR) == TDX_SW_ERROR);

+    if ((u16)tdx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
+        is_nmi(tdexit_intr_info(vcpu))) {
+        kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
+        vmx_do_nmi_irqoff();
+        kvm_after_interrupt(vcpu);
+    }
     guest_state_exit_irqoff();
 }

>
>
>>> machine check and continue guest TD execution.
>>>
>>> For external interrupt, increment stats same to the VMX case.
>>>
>>> Signed-off-by: Isaku Yamahata <[email protected]>
>>> Reviewed-by: Paolo Bonzini <[email protected]>
>>> ---
>>> arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
>>> 1 file changed, 23 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index 0db80fa020d2..bdd74682b474 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -918,6 +918,25 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
>>> vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
>>> }
>>>
>>> +static int tdx_handle_exception(struct kvm_vcpu *vcpu)

Should this function be named as tdx_handle_exception_nmi() since it's
checking nmi as well?

>>> +{
>>> + u32 intr_info = tdexit_intr_info(vcpu);
>>> +
>>> + if (is_nmi(intr_info) || is_machine_check(intr_info))
>>> + return 1;
>>


2024-04-17 06:48:15

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Wed, Apr 17, 2024 at 10:21:16AM +0800,
Chao Gao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:26:01AM -0800, [email protected] wrote:
> >@@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> >
> > lockdep_assert_held_write(&kvm->mmu_lock);
> >
> >+ WARN_ON_ONCE(zap_private && !is_private_sp(root));
> >+ if (!zap_private && is_private_sp(root))
> >+ return false;
>
> Should be "return flush;".
>
> Fengwei and I spent one week chasing a bug where virtio-net in the TD guest may
> stop working at some point after bootup if the host enables numad. We finally
> found that the bug was introduced by the 'return false' statement, which left
> some stale EPT entries unflushed.

Thank you for chasing it down.


> I am wondering if we can refactor related functions slightly to make it harder
> to make such mistakes and make it easier to identify them. e.g., we could make
> "@flush" an in/out parameter of tdp_mmu_zap_leafs(), kvm_tdp_mmu_zap_leafs()
> and kvm_tdp_mmu_unmap_gfn_range(). It looks more apparent that "*flush = false"
> below could be problematic if the changes were something like:
>
> if (!zap_private && is_private_sp(root)) {
> *flush = false;
> return;
> }

Yes, let me look into it.
--
Isaku Yamahata <[email protected]>

2024-04-17 06:56:13

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 085/130] KVM: TDX: Complete interrupts after tdexit

On Tue, Apr 16, 2024 at 11:23:01AM -0700,
Reinette Chatre <[email protected]> wrote:

> Hi Isaku,
>
> (In shortlog "tdexit" can be "TD exit" to be consistent with
> documentation.)
>
> On 2/26/2024 12:26 AM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > This corresponds to VMX __vmx_complete_interrupts(). Because TDX
> > virtualize vAPIC, KVM only needs to care NMI injection.
>
> This seems to be the first appearance of NMI and the changelog
> is very brief. How about expending it with:
>
> "This corresponds to VMX __vmx_complete_interrupts(). Because TDX
> virtualize vAPIC, KVM only needs to care about NMI injection.
>
> KVM can request TDX to inject an NMI into a guest TD vCPU when the
> vCPU is not active. TDX will attempt to inject an NMI as soon as
> possible on TD entry. NMI injection is managed by writing to (to
> inject NMI) and reading from (to get status of NMI injection)
> the PEND_NMI field within the TDX vCPU scope metadata (Trust
> Domain Virtual Processor State (TDVPS)).
>
> Update KVM's NMI status on TD exit by checking whether a requested
> NMI has been injected into the TD. Reading the metadata via SEAMCALL
> is expensive so only perform the check if an NMI was injected.
>
> This is the first need to access vCPU scope metadata in the
> "management" class. Ensure that needed accessor is available.
> "
>
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > Reviewed-by: Binbin Wu <[email protected]>
> > ---
> > v19:
> > - move tdvps_management_check() to this patch
> > - typo: complete -> Complete in short log
> > ---
> > arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
> > arch/x86/kvm/vmx/tdx.h | 4 ++++
> > 2 files changed, 14 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 83dcaf5b6fbd..b8b168f74dfe 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > */
> > }
> >
> > +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> > +{
> > + /* Avoid costly SEAMCALL if no nmi was injected */
>
> /* Avoid costly SEAMCALL if no NMI was injected. */
>
> > + if (vcpu->arch.nmi_injected)
> > + vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> > + TD_VCPU_PEND_NMI);
> > +}
> > +
> > struct tdx_uret_msr {
> > u32 msr;
> > unsigned int slot;
> > @@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > trace_kvm_exit(vcpu, KVM_ISA_VMX);
> >
> > + tdx_complete_interrupts(vcpu);
> > +
> > return EXIT_FASTPATH_NONE;
> > }
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 44eab734e702..0d8a98feb58e 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
> > "Invalid TD VMCS access for 16-bit field");
> > }
> >
> > +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
>
> Is this intended to be a stub or is it expected to be fleshed out with
> some checks?

It was used to check if field id matches bits. We should make
tdvps_vmcs_check() common for vmcs, management and state_non_arch.
--
Isaku Yamahata <[email protected]>

2024-04-17 07:03:33

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL

On Wed, Apr 17, 2024 at 02:16:57PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 4/4/2024 9:27 AM, Isaku Yamahata wrote:
> > On Tue, Apr 02, 2024 at 04:52:46PM +0800,
> > Chao Gao <[email protected]> wrote:
> >
> > > > +static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
> > > > +{
> > > > + unsigned long nr, a0, a1, a2, a3, ret;
> > > > +
> > > do you need to emulate xen/hyper-v hypercalls here?
> >
> > No. kvm_emulate_hypercall() handles xen/hyper-v hypercalls,
> > __kvm_emulate_hypercall() doesn't.
> So for TDX, kvm doesn't support xen/hyper-v, right?
>
> Then, should KVM_CAP_XEN_HVM and KVM_CAP_HYPERV be filtered out for TDX?

That's right. We should update kvm_vm_ioctl_check_extension() and
kvm_vcpu_ioctl_enable_cap(). I didn't pay attention to them.
--
Isaku Yamahata <[email protected]>

2024-04-17 08:46:43

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL



On 4/4/2024 9:27 AM, Isaku Yamahata wrote:
> On Tue, Apr 02, 2024 at 04:52:46PM +0800,
> Chao Gao <[email protected]> wrote:
>
>>> +static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>>> +{
>>> + unsigned long nr, a0, a1, a2, a3, ret;
>>> +
>> do you need to emulate xen/hyper-v hypercalls here?
>
> No. kvm_emulate_hypercall() handles xen/hyper-v hypercalls,
> __kvm_emulate_hypercall() doesn't.
So for TDX, kvm doesn't support xen/hyper-v, right?

Then, should KVM_CAP_XEN_HVM and KVM_CAP_HYPERV be filtered out for TDX?

>
>> Nothing tells userspace that xen/hyper-v hypercalls are not supported and
>> so userspace may expose related CPUID leafs to TD guests.
>>


2024-04-17 13:21:27

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, 2024-04-16 at 13:58 -0700, Sean Christopherson wrote:
> On Fri, Apr 12, 2024, Kai Huang wrote:
> > On 12/04/2024 2:03 am, Sean Christopherson wrote:
> > > On Thu, Apr 11, 2024, Kai Huang wrote:
> > > > I can certainly follow up with this and generate a reviewable patchset if I
> > > > can confirm with you that this is what you want?
> > >
> > > Yes, I think it's the right direction. I still have minor concerns about VMX
> > > being enabled while kvm.ko is loaded, which means that VMXON will _always_ be
> > > enabled if KVM is built-in. But after seeing the complexity that is needed to
> > > safely initialize TDX, and after seeing just how much complexity KVM already
> > > has because it enables VMX on-demand (I hadn't actually tried removing that code
> > > before), I think the cost of that complexity far outweighs the risk of "always"
> > > being post-VMXON.
> >
> > Does always leaving VMXON have any actual damage, given we have emergency
> > virtualization shutdown?
>
> Being post-VMXON increases the risk of kexec() into the kdump kernel failing.
> The tradeoffs that we're trying to balance are: is the risk of kexec() failing
> due to the complexity of the emergency VMX code higher than the risk of us breaking
> things in general due to taking on a ton of complexity to juggle VMXON for TDX?
>
> After seeing the latest round of TDX code, my opinion is that being post-VMXON
> is less risky overall, in no small part because we need that to work anyways for
> hosts that are actively running VMs.
>
> >

How about we only keep VMX always on when TDX is enabled?

In short, we can do this way:

- Do VMXON + unconditional tdx_cpu_enable() in vt_hardware_enable()

- And in vt_hardware_setup():

cpus_read_lock();
hardware_enable_all_nolock(); (this doesn't exist yet)
ret = tdx_enable();
if (!ret)
hardware_disable_all_nolock();
cpus_read_unlock();

- And in vt_hardware_unsetup():

if (TDX is enabled) {
cpus_read_lock();
hardware_disable_all_nolock();
cpus_read_unlock();
}

Note to make this work, we also need to move register/unregister
kvm_online_cpu()/kvm_offline_cpu() from kvm_init() to
hardware_enable_all_nolock() and hardware_disable_all_nolock()
respectively to cover any CPU becoming online after tdx_enable() (well,
more precisely, after hardware_enable_all_nolock()).

This is reasonable anyway even w/o TDX, because only _after_ we have
enabled hardware on all online cpus, we need to handle CPU hotplug.

Calling hardware_enable_all_nolock() w/o holding kvm_lock mutex is also
fine because at this stage it's not possible for userspace to create VM
yet.

Btw, kvm_arch_hardware_enable() does things like TSC backworks, uret_msrs,
etc but they are safe to be called during module load/unload AFAICT. We
can put a comment there for reminder if needed.

If I am not missing anything, below diff to kvm.ko shows my idea:

diff --git a/arch/x86/include/asm/kvm_host.h
b/arch/x86/include/asm/kvm_host.h
index 16e07a2eee19..ed8b2f34af01 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2318,4 +2318,7 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot,
unsigned long npages);
*/
#define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)

+int hardware_enable_all_nolock(void);
+void hardware_disable_all_nolock(void);
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb49c2a60200..3d2ff7dd0150 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5601,14 +5601,23 @@ static int kvm_offline_cpu(unsigned int cpu)
return 0;
}

-static void hardware_disable_all_nolock(void)
+static void __hardware_disable_all_nolock(void)
+{
+#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
+ cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
+#endif
+ on_each_cpu(hardware_disable_nolock, NULL, 1);
+}
+
+void hardware_disable_all_nolock(void)
{
BUG_ON(!kvm_usage_count);

kvm_usage_count--;
if (!kvm_usage_count)
- on_each_cpu(hardware_disable_nolock, NULL, 1);
+ __hardware_disable_all_nolock();
}
+EXPORT_SYMBOL_GPL(hardware_disable_all_nolock);

static void hardware_disable_all(void)
{
@@ -5619,11 +5628,27 @@ static void hardware_disable_all(void)
cpus_read_unlock();
}

-static int hardware_enable_all(void)
+static int __hardware_enable_all_nolock(void)
{
atomic_t failed = ATOMIC_INIT(0);
int r;

+ on_each_cpu(hardware_enable_nolock, &failed, 1);
+
+ r = atomic_read(&failed);
+ if (r)
+ return -EBUSY;
+
+#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
+ r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE,
"kvm/cpu:online",
+ kvm_online_cpu, kvm_offline_cpu);
+#endif
+
+ return r;
+}
+
+int hardware_enable_all_nolock(void)
+{
/*
* Do not enable hardware virtualization if the system is going
down.
* If userspace initiated a forced reboot, e.g. reboot -f, then
it's
@@ -5637,6 +5662,24 @@ static int hardware_enable_all(void)
system_state == SYSTEM_RESTART)
return -EBUSY;

+ kvm_usage_count++;
+ if (kvm_usage_count == 1) {
+ int r = __hardware_enable_all_nolock();
+
+ if (r) {
+ hardware_disable_all_nolock();
+ return r;
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(hardware_enable_all_nolock);
+
+static int hardware_enable_all(void)
+{
+ int r;
+
/*
* When onlining a CPU, cpu_online_mask is set before
kvm_online_cpu()
* is called, and so on_each_cpu() between them includes the CPU
that
@@ -5648,17 +5691,7 @@ static int hardware_enable_all(void)
cpus_read_lock();
mutex_lock(&kvm_lock);

- r = 0;
-
- kvm_usage_count++;
- if (kvm_usage_count == 1) {
- on_each_cpu(hardware_enable_nolock, &failed, 1);
-
- if (atomic_read(&failed)) {
- hardware_disable_all_nolock();
- r = -EBUSY;
- }
- }
+ r = hardware_enable_all_nolock();

mutex_unlock(&kvm_lock);
cpus_read_unlock();
@@ -6422,11 +6455,6 @@ int kvm_init(unsigned vcpu_size, unsigned
vcpu_align, struct module *module)
int cpu;

#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE,
"kvm/cpu:online",
- kvm_online_cpu, kvm_offline_cpu);
- if (r)
- return r;
-
register_syscore_ops(&kvm_syscore_ops);
#endif

@@ -6528,7 +6556,6 @@ void kvm_exit(void)
kvm_async_pf_deinit();
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
unregister_syscore_ops(&kvm_syscore_ops);
- cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
#endif
kvm_irqfd_exit();
}

2024-04-17 14:40:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Wed, Apr 17, 2024, Kai Huang wrote:
> On Tue, 2024-04-16 at 13:58 -0700, Sean Christopherson wrote:
> > On Fri, Apr 12, 2024, Kai Huang wrote:
> > > On 12/04/2024 2:03 am, Sean Christopherson wrote:
> > > > On Thu, Apr 11, 2024, Kai Huang wrote:
> > > > > I can certainly follow up with this and generate a reviewable patchset if I
> > > > > can confirm with you that this is what you want?
> > > >
> > > > Yes, I think it's the right direction. I still have minor concerns about VMX
> > > > being enabled while kvm.ko is loaded, which means that VMXON will _always_ be
> > > > enabled if KVM is built-in. But after seeing the complexity that is needed to
> > > > safely initialize TDX, and after seeing just how much complexity KVM already
> > > > has because it enables VMX on-demand (I hadn't actually tried removing that code
> > > > before), I think the cost of that complexity far outweighs the risk of "always"
> > > > being post-VMXON.
> > >
> > > Does always leaving VMXON have any actual damage, given we have emergency
> > > virtualization shutdown?
> >
> > Being post-VMXON increases the risk of kexec() into the kdump kernel failing.
> > The tradeoffs that we're trying to balance are: is the risk of kexec() failing
> > due to the complexity of the emergency VMX code higher than the risk of us breaking
> > things in general due to taking on a ton of complexity to juggle VMXON for TDX?
> >
> > After seeing the latest round of TDX code, my opinion is that being post-VMXON
> > is less risky overall, in no small part because we need that to work anyways for
> > hosts that are actively running VMs.
>
> How about we only keep VMX always on when TDX is enabled?

Paolo also suggested that forcing VMXON only if TDX is enabled, mostly because
kvm-intel.ko and kvm-amd.ko may be auto-loaded based on MODULE_DEVICE_TABLE(),
which in turn causes problems for out-of-tree hypervisors that want control over
VMX and SVM.

I'm not opposed to the idea, it's the complexity and messiness I dislike. E.g.
the TDX code shouldn't have to deal with CPU hotplug locks, core KVM shouldn't
need to expose nolock helpers, etc. And if we're going to make non-trivial
changes to the core KVM hardware enabling code anyways...

What about this? Same basic idea as before, but instead of unconditionally doing
hardware enabling during module initialization, let TDX do hardware enabling in
a late_hardware_setup(), and then have KVM x86 ensure virtualization is enabled
when creating VMs.

This way, architectures that aren't saddled with out-of-tree hypervisors can do
the dead simple thing of enabling hardware during their initialization sequence,
and the TDX code is much more sane, e.g. invoke kvm_x86_enable_virtualization()
during late_hardware_setup(), and kvm_x86_disable_virtualization() during module
exit (presumably).

---
Documentation/virt/kvm/locking.rst | 4 -
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/kvm/svm/svm.c | 5 +-
arch/x86/kvm/vmx/vmx.c | 18 ++-
arch/x86/kvm/x86.c | 59 +++++++---
arch/x86/kvm/x86.h | 2 +
include/linux/kvm_host.h | 2 +
virt/kvm/kvm_main.c | 181 +++++++----------------------
8 files changed, 104 insertions(+), 170 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 02880d5552d5..0d6eff13fd46 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -227,10 +227,6 @@ time it will be set using the Dirty tracking mechanism described above.
:Type: mutex
:Arch: any
:Protects: - vm_list
- - kvm_usage_count
- - hardware virtualization enable/disable
-:Comment: KVM also disables CPU hotplug via cpus_read_lock() during
- enable/disable.

``kvm->mn_invalidate_lock``
^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 73740d698ebe..7422239987d8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -36,6 +36,7 @@
#include <asm/kvm_page_track.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/hyperv-tlfs.h>
+#include <asm/reboot.h>

#define __KVM_HAVE_ARCH_VCPU_DEBUGFS

@@ -1605,6 +1606,8 @@ struct kvm_x86_ops {

int (*hardware_enable)(void);
void (*hardware_disable)(void);
+ cpu_emergency_virt_cb *emergency_disable;
+
void (*hardware_unsetup)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9aaf83c8d57d..7e118284934c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4917,6 +4917,7 @@ static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = KBUILD_MODNAME,

+ .emergency_disable = svm_emergency_disable,
.check_processor_compatibility = svm_check_processor_compat,

.hardware_unsetup = svm_hardware_unsetup,
@@ -5348,8 +5349,6 @@ static struct kvm_x86_init_ops svm_init_ops __initdata = {
static void __svm_exit(void)
{
kvm_x86_vendor_exit();
-
- cpu_emergency_unregister_virt_callback(svm_emergency_disable);
}

static int __init svm_init(void)
@@ -5365,8 +5364,6 @@ static int __init svm_init(void)
if (r)
return r;

- cpu_emergency_register_virt_callback(svm_emergency_disable);
-
/*
* Common KVM initialization _must_ come last, after this, /dev/kvm is
* exposed to userspace!
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d18dcb1e11a6..0dbe74da7ee3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8320,6 +8320,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {

.hardware_enable = vmx_hardware_enable,
.hardware_disable = vmx_hardware_disable,
+ .emergency_disable = vmx_emergency_disable,
+
.has_emulated_msr = vmx_has_emulated_msr,

.vm_size = sizeof(struct kvm_vmx),
@@ -8733,8 +8735,6 @@ static void __vmx_exit(void)
{
allow_smaller_maxphyaddr = false;

- cpu_emergency_unregister_virt_callback(vmx_emergency_disable);
-
vmx_cleanup_l1d_flush();
}

@@ -8760,6 +8760,12 @@ static int __init vmx_init(void)
*/
hv_init_evmcs();

+ for_each_possible_cpu(cpu) {
+ INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+
+ pi_init_cpu(cpu);
+ }
+
r = kvm_x86_vendor_init(&vmx_init_ops);
if (r)
return r;
@@ -8775,14 +8781,6 @@ static int __init vmx_init(void)
if (r)
goto err_l1d_flush;

- for_each_possible_cpu(cpu) {
- INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
-
- pi_init_cpu(cpu);
- }
-
- cpu_emergency_register_virt_callback(vmx_emergency_disable);
-
vmx_check_vmcs12_offsets();

/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 26288ca05364..fdf6e05000c1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -134,6 +134,7 @@ static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);

static DEFINE_MUTEX(vendor_module_lock);
struct kvm_x86_ops kvm_x86_ops __read_mostly;
+static int kvm_usage_count;

#define KVM_X86_OP(func) \
DEFINE_STATIC_CALL_NULL(kvm_x86_##func, \
@@ -9687,15 +9688,10 @@ static int kvm_x86_check_processor_compatibility(void)
return static_call(kvm_x86_check_processor_compatibility)();
}

-static void kvm_x86_check_cpu_compat(void *ret)
-{
- *(int *)ret = kvm_x86_check_processor_compatibility();
-}
-
int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
{
u64 host_pat;
- int r, cpu;
+ int r;

guard(mutex)(&vendor_module_lock);

@@ -9771,11 +9767,11 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)

kvm_ops_update(ops);

- for_each_online_cpu(cpu) {
- smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
- if (r < 0)
- goto out_unwind_ops;
- }
+ cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_disable);
+
+ r = ops->late_hardware_setup();
+ if (r)
+ goto out_unwind_ops;

/*
* Point of no return! DO NOT add error paths below this point unless
@@ -9818,6 +9814,7 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
return 0;

out_unwind_ops:
+ cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable);
kvm_x86_ops.hardware_enable = NULL;
static_call(kvm_x86_hardware_unsetup)();
out_mmu_exit:
@@ -9858,6 +9855,10 @@ void kvm_x86_vendor_exit(void)
static_key_deferred_flush(&kvm_xen_enabled);
WARN_ON(static_branch_unlikely(&kvm_xen_enabled.key));
#endif
+
+ kvm_disable_virtualization();
+ cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable);
+
mutex_lock(&vendor_module_lock);
kvm_x86_ops.hardware_enable = NULL;
mutex_unlock(&vendor_module_lock);
@@ -12522,6 +12523,33 @@ void kvm_arch_free_vm(struct kvm *kvm)
__kvm_arch_free_vm(kvm);
}

+int kvm_x86_enable_virtualization(void)
+{
+ int r;
+
+ guard(mutex)(&vendor_module_lock);
+
+ if (kvm_usage_count++)
+ return 0;
+
+ r = kvm_enable_virtualization();
+ if (r)
+ --kvm_usage_count;
+
+ return r;
+}
+EXPORT_SYMBOL_GPL(kvm_x86_enable_virtualization);
+
+void kvm_x86_disable_virtualization(void)
+{
+ guard(mutex)(&vendor_module_lock);
+
+ if (--kvm_usage_count)
+ return;
+
+ kvm_disable_virtualization();
+}
+EXPORT_SYMBOL_GPL(kvm_x86_disable_virtualization);

int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
{
@@ -12533,9 +12561,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)

kvm->arch.vm_type = type;

+ ret = kvm_x86_enable_virtualization();
+ if (ret)
+ return ret;
+
ret = kvm_page_track_init(kvm);
if (ret)
- goto out;
+ goto out_disable_virtualization;

kvm_mmu_init_vm(kvm);

@@ -12582,7 +12614,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
out_uninit_mmu:
kvm_mmu_uninit_vm(kvm);
kvm_page_track_cleanup(kvm);
-out:
+out_disable_virtualization:
+ kvm_x86_disable_virtualization();
return ret;
}

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a8b71803777b..427c5d102525 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -32,6 +32,8 @@ struct kvm_caps {
};

void kvm_spurious_fault(void);
+int kvm_x86_enable_virtualization(void);
+void kvm_x86_disable_virtualization(void);

#define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check) \
({ \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..92da2eee7448 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1518,6 +1518,8 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
#endif

#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
+int kvm_enable_virtualization(void);
+void kvm_disable_virtualization(void);
int kvm_arch_hardware_enable(void);
void kvm_arch_hardware_disable(void);
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f345dc15854f..326e3225c052 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -139,8 +139,6 @@ static int kvm_no_compat_open(struct inode *inode, struct file *file)
#define KVM_COMPAT(c) .compat_ioctl = kvm_no_compat_ioctl, \
.open = kvm_no_compat_open
#endif
-static int hardware_enable_all(void);
-static void hardware_disable_all(void);

static void kvm_io_bus_destroy(struct kvm_io_bus *bus);

@@ -1261,10 +1259,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (r)
goto out_err_no_arch_destroy_vm;

- r = hardware_enable_all();
- if (r)
- goto out_err_no_disable;
-
#ifdef CONFIG_HAVE_KVM_IRQCHIP
INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
#endif
@@ -1304,8 +1298,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
#endif
out_err_no_mmu_notifier:
- hardware_disable_all();
-out_err_no_disable:
kvm_arch_destroy_vm(kvm);
out_err_no_arch_destroy_vm:
WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
@@ -1393,7 +1385,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
#endif
kvm_arch_free_vm(kvm);
preempt_notifier_dec();
- hardware_disable_all();
mmdrop(mm);
}

@@ -5536,9 +5527,8 @@ __visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);

static DEFINE_PER_CPU(bool, hardware_enabled);
-static int kvm_usage_count;

-static int __hardware_enable_nolock(void)
+static int __kvm_enable_virtualization(void)
{
if (__this_cpu_read(hardware_enabled))
return 0;
@@ -5553,34 +5543,18 @@ static int __hardware_enable_nolock(void)
return 0;
}

-static void hardware_enable_nolock(void *failed)
-{
- if (__hardware_enable_nolock())
- atomic_inc(failed);
-}
-
static int kvm_online_cpu(unsigned int cpu)
{
- int ret = 0;
-
/*
* Abort the CPU online process if hardware virtualization cannot
* be enabled. Otherwise running VMs would encounter unrecoverable
* errors when scheduled to this CPU.
*/
- mutex_lock(&kvm_lock);
- if (kvm_usage_count)
- ret = __hardware_enable_nolock();
- mutex_unlock(&kvm_lock);
- return ret;
+ return __kvm_enable_virtualization();
}

-static void hardware_disable_nolock(void *junk)
+static void __kvm_disable_virtualization(void *ign)
{
- /*
- * Note, hardware_disable_all_nolock() tells all online CPUs to disable
- * hardware, not just CPUs that successfully enabled hardware!
- */
if (!__this_cpu_read(hardware_enabled))
return;

@@ -5591,78 +5565,10 @@ static void hardware_disable_nolock(void *junk)

static int kvm_offline_cpu(unsigned int cpu)
{
- mutex_lock(&kvm_lock);
- if (kvm_usage_count)
- hardware_disable_nolock(NULL);
- mutex_unlock(&kvm_lock);
+ __kvm_disable_virtualization(NULL);
return 0;
}

-static void hardware_disable_all_nolock(void)
-{
- BUG_ON(!kvm_usage_count);
-
- kvm_usage_count--;
- if (!kvm_usage_count)
- on_each_cpu(hardware_disable_nolock, NULL, 1);
-}
-
-static void hardware_disable_all(void)
-{
- cpus_read_lock();
- mutex_lock(&kvm_lock);
- hardware_disable_all_nolock();
- mutex_unlock(&kvm_lock);
- cpus_read_unlock();
-}
-
-static int hardware_enable_all(void)
-{
- atomic_t failed = ATOMIC_INIT(0);
- int r;
-
- /*
- * Do not enable hardware virtualization if the system is going down.
- * If userspace initiated a forced reboot, e.g. reboot -f, then it's
- * possible for an in-flight KVM_CREATE_VM to trigger hardware enabling
- * after kvm_reboot() is called. Note, this relies on system_state
- * being set _before_ kvm_reboot(), which is why KVM uses a syscore ops
- * hook instead of registering a dedicated reboot notifier (the latter
- * runs before system_state is updated).
- */
- if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
- system_state == SYSTEM_RESTART)
- return -EBUSY;
-
- /*
- * When onlining a CPU, cpu_online_mask is set before kvm_online_cpu()
- * is called, and so on_each_cpu() between them includes the CPU that
- * is being onlined. As a result, hardware_enable_nolock() may get
- * invoked before kvm_online_cpu(), which also enables hardware if the
- * usage count is non-zero. Disable CPU hotplug to avoid attempting to
- * enable hardware multiple times.
- */
- cpus_read_lock();
- mutex_lock(&kvm_lock);
-
- r = 0;
-
- kvm_usage_count++;
- if (kvm_usage_count == 1) {
- on_each_cpu(hardware_enable_nolock, &failed, 1);
-
- if (atomic_read(&failed)) {
- hardware_disable_all_nolock();
- r = -EBUSY;
- }
- }
-
- mutex_unlock(&kvm_lock);
- cpus_read_unlock();
-
- return r;
-}
-
static void kvm_shutdown(void)
{
/*
@@ -5678,34 +5584,22 @@ static void kvm_shutdown(void)
*/
pr_info("kvm: exiting hardware virtualization\n");
kvm_rebooting = true;
- on_each_cpu(hardware_disable_nolock, NULL, 1);
+ on_each_cpu(__kvm_disable_virtualization, NULL, 1);
}

static int kvm_suspend(void)
{
- /*
- * Secondary CPUs and CPU hotplug are disabled across the suspend/resume
- * callbacks, i.e. no need to acquire kvm_lock to ensure the usage count
- * is stable. Assert that kvm_lock is not held to ensure the system
- * isn't suspended while KVM is enabling hardware. Hardware enabling
- * can be preempted, but the task cannot be frozen until it has dropped
- * all locks (userspace tasks are frozen via a fake signal).
- */
- lockdep_assert_not_held(&kvm_lock);
lockdep_assert_irqs_disabled();

- if (kvm_usage_count)
- hardware_disable_nolock(NULL);
+ __kvm_disable_virtualization(NULL);
return 0;
}

static void kvm_resume(void)
{
- lockdep_assert_not_held(&kvm_lock);
lockdep_assert_irqs_disabled();

- if (kvm_usage_count)
- WARN_ON_ONCE(__hardware_enable_nolock());
+ WARN_ON_ONCE(__kvm_enable_virtualization());
}

static struct syscore_ops kvm_syscore_ops = {
@@ -5713,16 +5607,45 @@ static struct syscore_ops kvm_syscore_ops = {
.resume = kvm_resume,
.shutdown = kvm_shutdown,
};
-#else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
-static int hardware_enable_all(void)
+
+int kvm_enable_virtualization(void)
{
+ int r;
+
+ r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
+ kvm_online_cpu, kvm_offline_cpu);
+ if (r)
+ return r;
+
+ register_syscore_ops(&kvm_syscore_ops);
+
+ /*
+ * Manually undo virtualization enabling if the system is going down.
+ * If userspace initiated a forced reboot, e.g. reboot -f, then it's
+ * possible for an in-flight module load to enable virtualization
+ * after syscore_shutdown() is called, i.e. without kvm_shutdown()
+ * being invoked. Note, this relies on system_state being set _before_
+ * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
+ * or this CPU observes the impedning shutdown. Which is why KVM uses
+ * a syscore ops hook instead of registering a dedicated reboot
+ * notifier (the latter runs before system_state is updated).
+ */
+ if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
+ system_state == SYSTEM_RESTART) {
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
+ return -EBUSY;
+ }
+
return 0;
}

-static void hardware_disable_all(void)
+void kvm_disable_virtualization(void)
{
-
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
}
+
#endif /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */

static void kvm_iodevice_destructor(struct kvm_io_device *dev)
@@ -6418,15 +6341,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
int r;
int cpu;

-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
- kvm_online_cpu, kvm_offline_cpu);
- if (r)
- return r;
-
- register_syscore_ops(&kvm_syscore_ops);
-#endif
-
/* A kmem cache lets us meet the alignment requirements of fx_save. */
if (!vcpu_align)
vcpu_align = __alignof__(struct kvm_vcpu);
@@ -6437,10 +6351,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
offsetofend(struct kvm_vcpu, stats_id)
- offsetof(struct kvm_vcpu, arch),
NULL);
- if (!kvm_vcpu_cache) {
- r = -ENOMEM;
- goto err_vcpu_cache;
- }
+ if (!kvm_vcpu_cache)
+ return -ENOMEM;

for_each_possible_cpu(cpu) {
if (!alloc_cpumask_var_node(&per_cpu(cpu_kick_mask, cpu),
@@ -6497,11 +6409,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
for_each_possible_cpu(cpu)
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
kmem_cache_destroy(kvm_vcpu_cache);
-err_vcpu_cache:
-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- unregister_syscore_ops(&kvm_syscore_ops);
- cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
-#endif
return r;
}
EXPORT_SYMBOL_GPL(kvm_init);
@@ -6523,10 +6430,6 @@ void kvm_exit(void)
kmem_cache_destroy(kvm_vcpu_cache);
kvm_vfio_ops_exit();
kvm_async_pf_deinit();
-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- unregister_syscore_ops(&kvm_syscore_ops);
- cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
-#endif
kvm_irqfd_exit();
}
EXPORT_SYMBOL_GPL(kvm_exit);

base-commit: 2d181d84af38146748042a6974c577fc46c3f1c3
--


2024-04-17 15:57:06

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 109/130] KVM: TDX: Handle TDX PV port io hypercall



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Wire up TDX PV port IO hypercall to the KVM backend function.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> v18:
> - Fix out case to set R10 and R11 correctly when user space handled port
> out.
> ---
> arch/x86/kvm/vmx/tdx.c | 67 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 67 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a2caf2ae838c..55fc6cc6c816 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1152,6 +1152,71 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
> return kvm_emulate_halt_noskip(vcpu);
> }
>
> +static int tdx_complete_pio_out(struct kvm_vcpu *vcpu)
> +{
> + tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
> + tdvmcall_set_return_val(vcpu, 0);
> + return 1;
> +}
> +
> +static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
> +{
> + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> + unsigned long val = 0;
> + int ret;
> +
> + WARN_ON_ONCE(vcpu->arch.pio.count != 1);
> +
> + ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
> + vcpu->arch.pio.port, &val, 1);
> + WARN_ON_ONCE(!ret);
> +
> + tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
> + tdvmcall_set_return_val(vcpu, val);
> +
> + return 1;
> +}
> +
> +static int tdx_emulate_io(struct kvm_vcpu *vcpu)
> +{
> + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> + unsigned long val = 0;
> + unsigned int port;
> + int size, ret;
> + bool write;
> +
> + ++vcpu->stat.io_exits;
> +
> + size = tdvmcall_a0_read(vcpu);
> + write = tdvmcall_a1_read(vcpu);
> + port = tdvmcall_a2_read(vcpu);
> +
> + if (size != 1 && size != 2 && size != 4) {
> + tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
> + return 1;
> + }
> +
> + if (write) {
> + val = tdvmcall_a3_read(vcpu);
> + ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
> +
> + /* No need for a complete_userspace_io callback. */
I am confused about the comment.

The code below sets the complete_userspace_io callback for write case,
i.e. tdx_complete_pio_out().

> + vcpu->arch.pio.count = 0;
> + } else
> + ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
> +
> + if (ret)
> + tdvmcall_set_return_val(vcpu, val);
> + else {
> + if (write)
> + vcpu->arch.complete_userspace_io = tdx_complete_pio_out;
> + else
> + vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
> + }
> +
> + return ret;
> +}
> +
> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> {
> if (tdvmcall_exit_type(vcpu))
> @@ -1162,6 +1227,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> return tdx_emulate_cpuid(vcpu);
> case EXIT_REASON_HLT:
> return tdx_emulate_hlt(vcpu);
> + case EXIT_REASON_IO_INSTRUCTION:
> + return tdx_emulate_io(vcpu);
> default:
> break;
> }


2024-04-17 20:11:25

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 109/130] KVM: TDX: Handle TDX PV port io hypercall

On Wed, Apr 17, 2024 at 08:51:39PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Wire up TDX PV port IO hypercall to the KVM backend function.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Reviewed-by: Paolo Bonzini <[email protected]>
> > ---
> > v18:
> > - Fix out case to set R10 and R11 correctly when user space handled port
> > out.
> > ---
> > arch/x86/kvm/vmx/tdx.c | 67 ++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 67 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index a2caf2ae838c..55fc6cc6c816 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1152,6 +1152,71 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
> > return kvm_emulate_halt_noskip(vcpu);
> > }
> > +static int tdx_complete_pio_out(struct kvm_vcpu *vcpu)
> > +{
> > + tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
> > + tdvmcall_set_return_val(vcpu, 0);
> > + return 1;
> > +}
> > +
> > +static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
> > +{
> > + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> > + unsigned long val = 0;
> > + int ret;
> > +
> > + WARN_ON_ONCE(vcpu->arch.pio.count != 1);
> > +
> > + ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
> > + vcpu->arch.pio.port, &val, 1);
> > + WARN_ON_ONCE(!ret);
> > +
> > + tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
> > + tdvmcall_set_return_val(vcpu, val);
> > +
> > + return 1;
> > +}
> > +
> > +static int tdx_emulate_io(struct kvm_vcpu *vcpu)
> > +{
> > + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> > + unsigned long val = 0;
> > + unsigned int port;
> > + int size, ret;
> > + bool write;
> > +
> > + ++vcpu->stat.io_exits;
> > +
> > + size = tdvmcall_a0_read(vcpu);
> > + write = tdvmcall_a1_read(vcpu);
> > + port = tdvmcall_a2_read(vcpu);
> > +
> > + if (size != 1 && size != 2 && size != 4) {
> > + tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
> > + return 1;
> > + }
> > +
> > + if (write) {
> > + val = tdvmcall_a3_read(vcpu);
> > + ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
> > +
> > + /* No need for a complete_userspace_io callback. */
> I am confused about the comment.
>
> The code below sets the complete_userspace_io callback for write case,
> i.e. tdx_complete_pio_out().

You're correct. This comment is stale and should be removed it.
--
Isaku Yamahata <[email protected]>

2024-04-17 23:10:11

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 18/04/2024 2:40 am, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Kai Huang wrote:
>> On Tue, 2024-04-16 at 13:58 -0700, Sean Christopherson wrote:
>>> On Fri, Apr 12, 2024, Kai Huang wrote:
>>>> On 12/04/2024 2:03 am, Sean Christopherson wrote:
>>>>> On Thu, Apr 11, 2024, Kai Huang wrote:
>>>>>> I can certainly follow up with this and generate a reviewable patchset if I
>>>>>> can confirm with you that this is what you want?
>>>>>
>>>>> Yes, I think it's the right direction. I still have minor concerns about VMX
>>>>> being enabled while kvm.ko is loaded, which means that VMXON will _always_ be
>>>>> enabled if KVM is built-in. But after seeing the complexity that is needed to
>>>>> safely initialize TDX, and after seeing just how much complexity KVM already
>>>>> has because it enables VMX on-demand (I hadn't actually tried removing that code
>>>>> before), I think the cost of that complexity far outweighs the risk of "always"
>>>>> being post-VMXON.
>>>>
>>>> Does always leaving VMXON have any actual damage, given we have emergency
>>>> virtualization shutdown?
>>>
>>> Being post-VMXON increases the risk of kexec() into the kdump kernel failing.
>>> The tradeoffs that we're trying to balance are: is the risk of kexec() failing
>>> due to the complexity of the emergency VMX code higher than the risk of us breaking
>>> things in general due to taking on a ton of complexity to juggle VMXON for TDX?
>>>
>>> After seeing the latest round of TDX code, my opinion is that being post-VMXON
>>> is less risky overall, in no small part because we need that to work anyways for
>>> hosts that are actively running VMs.
>>
>> How about we only keep VMX always on when TDX is enabled?
>
> Paolo also suggested that forcing VMXON only if TDX is enabled, mostly because
> kvm-intel.ko and kvm-amd.ko may be auto-loaded based on MODULE_DEVICE_TABLE(),
> which in turn causes problems for out-of-tree hypervisors that want control over
> VMX and SVM.
>
> I'm not opposed to the idea, it's the complexity and messiness I dislike. E.g.
> the TDX code shouldn't have to deal with CPU hotplug locks, core KVM shouldn't
> need to expose nolock helpers, etc. And if we're going to make non-trivial
> changes to the core KVM hardware enabling code anyways...
>
> What about this? Same basic idea as before, but instead of unconditionally doing
> hardware enabling during module initialization, let TDX do hardware enabling in
> a late_hardware_setup(), and then have KVM x86 ensure virtualization is enabled
> when creating VMs.
>
> This way, architectures that aren't saddled with out-of-tree hypervisors can do
> the dead simple thing of enabling hardware during their initialization sequence,
> and the TDX code is much more sane, e.g. invoke kvm_x86_enable_virtualization()
> during late_hardware_setup(), and kvm_x86_disable_virtualization() during module
> exit (presumably).

Fine to me, given I am not familiar with other ARCHs, assuming always
enable virtualization when KVM present is fine to them. :-)

Two questions below:

> +int kvm_x86_enable_virtualization(void)
> +{
> + int r;
> +
> + guard(mutex)(&vendor_module_lock);

It's a little bit odd to take the vendor_module_lock mutex.

It is called by kvm_arch_init_vm(), so more reasonablly we should still
use kvm_lock?

Also, if we invoke kvm_x86_enable_virtualization() from
kvm_x86_ops->late_hardware_setup(), then IIUC we will deadlock here
because kvm_x86_vendor_init() already takes the vendor_module_lock?

> +
> + if (kvm_usage_count++)
> + return 0;
> +
> + r = kvm_enable_virtualization();
> + if (r)
> + --kvm_usage_count;
> +
> + return r;
> +}
> +EXPORT_SYMBOL_GPL(kvm_x86_enable_virtualization);
> +

[...]

> +int kvm_enable_virtualization(void)
> {
> + int r;
> +
> + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> + kvm_online_cpu, kvm_offline_cpu);
> + if (r)
> + return r;
> +
> + register_syscore_ops(&kvm_syscore_ops);
> +
> + /*
> + * Manually undo virtualization enabling if the system is going down.
> + * If userspace initiated a forced reboot, e.g. reboot -f, then it's
> + * possible for an in-flight module load to enable virtualization
> + * after syscore_shutdown() is called, i.e. without kvm_shutdown()
> + * being invoked. Note, this relies on system_state being set _before_
> + * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
> + * or this CPU observes the impedning shutdown. Which is why KVM uses
> + * a syscore ops hook instead of registering a dedicated reboot
> + * notifier (the latter runs before system_state is updated).
> + */
> + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
> + system_state == SYSTEM_RESTART) {
> + unregister_syscore_ops(&kvm_syscore_ops);
> + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
> + return -EBUSY;
> + }
> +

Aren't we also supposed to do:

on_each_cpu(__kvm_enable_virtualization, NULL, 1);

here?

> return 0;
> }
>


2024-04-17 23:36:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, Apr 18, 2024, Kai Huang wrote:
> On 18/04/2024 2:40 am, Sean Christopherson wrote:
> > This way, architectures that aren't saddled with out-of-tree hypervisors can do
> > the dead simple thing of enabling hardware during their initialization sequence,
> > and the TDX code is much more sane, e.g. invoke kvm_x86_enable_virtualization()
> > during late_hardware_setup(), and kvm_x86_disable_virtualization() during module
> > exit (presumably).
>
> Fine to me, given I am not familiar with other ARCHs, assuming always enable
> virtualization when KVM present is fine to them. :-)
>
> Two questions below:
>
> > +int kvm_x86_enable_virtualization(void)
> > +{
> > + int r;
> > +
> > + guard(mutex)(&vendor_module_lock);
>
> It's a little bit odd to take the vendor_module_lock mutex.
>
> It is called by kvm_arch_init_vm(), so more reasonablly we should still use
> kvm_lock?

I think this should take an x86-specific lock, since it's guarding x86-specific
data. And vendor_module_lock fits the bill perfectly. Well, except for the
name, and I definitely have no objection to renaming it.

> Also, if we invoke kvm_x86_enable_virtualization() from
> kvm_x86_ops->late_hardware_setup(), then IIUC we will deadlock here because
> kvm_x86_vendor_init() already takes the vendor_module_lock?

Ah, yeah. Oh, duh. I think the reason I didn't initially suggest late_hardware_setup()
is that I was assuming/hoping TDX setup could be done after kvm_x86_vendor_exit().
E.g. in vt_init() or whatever it gets called:

r = kvm_x86_vendor_exit(...);
if (r)
return r;

if (enable_tdx) {
r = tdx_blah_blah_blah();
if (r)
goto vendor_exit;
}

> > + if (kvm_usage_count++)
> > + return 0;
> > +
> > + r = kvm_enable_virtualization();
> > + if (r)
> > + --kvm_usage_count;
> > +
> > + return r;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_x86_enable_virtualization);
> > +
>
> [...]
>
> > +int kvm_enable_virtualization(void)
> > {
> > + int r;
> > +
> > + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> > + kvm_online_cpu, kvm_offline_cpu);
> > + if (r)
> > + return r;
> > +
> > + register_syscore_ops(&kvm_syscore_ops);
> > +
> > + /*
> > + * Manually undo virtualization enabling if the system is going down.
> > + * If userspace initiated a forced reboot, e.g. reboot -f, then it's
> > + * possible for an in-flight module load to enable virtualization
> > + * after syscore_shutdown() is called, i.e. without kvm_shutdown()
> > + * being invoked. Note, this relies on system_state being set _before_
> > + * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
> > + * or this CPU observes the impedning shutdown. Which is why KVM uses
> > + * a syscore ops hook instead of registering a dedicated reboot
> > + * notifier (the latter runs before system_state is updated).
> > + */
> > + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
> > + system_state == SYSTEM_RESTART) {
> > + unregister_syscore_ops(&kvm_syscore_ops);
> > + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
> > + return -EBUSY;
> > + }
> > +
>
> Aren't we also supposed to do:
>
> on_each_cpu(__kvm_enable_virtualization, NULL, 1);
>
> here?

No, cpuhp_setup_state() invokes the callback, kvm_online_cpu(), on each CPU.
I.e. KVM has been doing things the hard way by using cpuhp_setup_state_nocalls().
That's part of the complexity I would like to get rid of.

2024-04-18 00:09:32

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT

On Wed, Apr 17, 2024 at 11:05:05AM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 4/4/2024 2:51 AM, Isaku Yamahata wrote:
> > On Mon, Apr 01, 2024 at 04:22:00PM +0800,
> > Chao Gao <[email protected]> wrote:
> >
> > > On Mon, Feb 26, 2024 at 12:26:44AM -0800, [email protected] wrote:
> > > > From: Isaku Yamahata <[email protected]>
> > > >
> > > > Because guest TD state is protected, exceptions in guest TDs can't be
> > > > intercepted. TDX VMM doesn't need to handle exceptions.
> > > > tdx_handle_exit_irqoff() handles NMI and machine check. Ignore NMI and
> > > tdx_handle_exit_irqoff() doesn't handle NMIs.
> > Will it to tdx_handle_exception().
>
> I don't get  why tdx_handle_exception()?
>
> NMI is handled in tdx_vcpu_enter_exit() prior to leaving the safety of
> noinstr, according to patch 098.
> https://lore.kernel.org/kvm/88920c598dcb55c15219642f27d0781af6d0c044.1708933498.git.isaku.yamahata@intel.com/
>
> @@ -837,6 +857,12 @@ static noinstr void tdx_vcpu_enter_exit(struct vcpu_tdx
> *tdx)
>      WARN_ON_ONCE(!kvm_rebooting &&
>               (tdx->exit_reason.full & TDX_SW_ERROR) == TDX_SW_ERROR);
>
> +    if ((u16)tdx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
> +        is_nmi(tdexit_intr_info(vcpu))) {
> +        kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
> +        vmx_do_nmi_irqoff();
> +        kvm_after_interrupt(vcpu);
> +    }
>      guest_state_exit_irqoff();
>  }

You're correct. tdx_vcpu_enter_exit() handles EXIT_REASON_EXCEPTION_NMI for NMI,
and tdx_handle_exeption() ignores NMI case.

The commit message should be updated with tdx_vcpu_enter_exit().


> > > > machine check and continue guest TD execution.
> > > >
> > > > For external interrupt, increment stats same to the VMX case.
> > > >
> > > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > > Reviewed-by: Paolo Bonzini <[email protected]>
> > > > ---
> > > > arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
> > > > 1 file changed, 23 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index 0db80fa020d2..bdd74682b474 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -918,6 +918,25 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> > > > vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
> > > > }
> > > >
> > > > +static int tdx_handle_exception(struct kvm_vcpu *vcpu)
>
> Should this function be named as tdx_handle_exception_nmi() since it's
> checking nmi as well?

Ok, tdx_handle_exception_nmi() is more consistent.
--
Isaku Yamahata <[email protected]>

2024-04-18 00:47:29

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 18/04/2024 11:35 am, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, Kai Huang wrote:
>> On 18/04/2024 2:40 am, Sean Christopherson wrote:
>>> This way, architectures that aren't saddled with out-of-tree hypervisors can do
>>> the dead simple thing of enabling hardware during their initialization sequence,
>>> and the TDX code is much more sane, e.g. invoke kvm_x86_enable_virtualization()
>>> during late_hardware_setup(), and kvm_x86_disable_virtualization() during module
>>> exit (presumably).
>>
>> Fine to me, given I am not familiar with other ARCHs, assuming always enable
>> virtualization when KVM present is fine to them. :-)
>>
>> Two questions below:
>>
>>> +int kvm_x86_enable_virtualization(void)
>>> +{
>>> + int r;
>>> +
>>> + guard(mutex)(&vendor_module_lock);
>>
>> It's a little bit odd to take the vendor_module_lock mutex.
>>
>> It is called by kvm_arch_init_vm(), so more reasonablly we should still use
>> kvm_lock?
>
> I think this should take an x86-specific lock, since it's guarding x86-specific
> data.

OK. This makes sense.

And vendor_module_lock fits the bill perfectly. Well, except for the
> name, and I definitely have no objection to renaming it.

No opinion on renaming. Personally I wouldn't bother to rename. We can
add a comment in kvm_x86_enable_virtualization() to explain. Perhaps in
the future we just want to change to always enable virtualization for
x86 too..

>
>> Also, if we invoke kvm_x86_enable_virtualization() from
>> kvm_x86_ops->late_hardware_setup(), then IIUC we will deadlock here because
>> kvm_x86_vendor_init() already takes the vendor_module_lock?
>
> Ah, yeah. Oh, duh. I think the reason I didn't initially suggest late_hardware_setup()
> is that I was assuming/hoping TDX setup could be done after kvm_x86_vendor_exit().
> E.g. in vt_init() or whatever it gets called:
>
> r = kvm_x86_vendor_exit(...);
> if (r)
> return r;
>
> if (enable_tdx) {
> r = tdx_blah_blah_blah();
> if (r)
> goto vendor_exit;
> }


I assume the reason you introduced the late_hardware_setup() is purely
because you want to do:

cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_enable);

after

kvm_ops_update()?

Anyway, we can also do 'enable_tdx' outside of kvm_x86_vendor_init() as
above, given it cannot be done in hardware_setup() anyway.

If we do 'enable_tdx' in late_hardware_setup(), we will need a
kvm_x86_enable_virtualization_nolock(), but that's also not a problem to me.

So which way do you prefer?

Btw, with kvm_x86_virtualization_enable(), it seems the compatibility
check is lost, which I assume is OK?

Btw2, currently tdx_enable() requires cpus_read_lock() must be called
prior. If we do unconditional tdx_cpu_enable() in vt_hardware_enable(),
then with your proposal IIUC there's no such requirement anymore,
because no task will be scheduled to the new CPU before it reaches
CPUHP_AP_ACTIVE. But now calling cpus_read_lock()/unlock() around
tdx_enable() also acceptable to me.

[...]

>>
>>> +int kvm_enable_virtualization(void)
>>> {
>>> + int r;
>>> +
>>> + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
>>> + kvm_online_cpu, kvm_offline_cpu);
>>> + if (r)
>>> + return r;
>>> +
>>> + register_syscore_ops(&kvm_syscore_ops);
>>> +
>>> + /*
>>> + * Manually undo virtualization enabling if the system is going down.
>>> + * If userspace initiated a forced reboot, e.g. reboot -f, then it's
>>> + * possible for an in-flight module load to enable virtualization
>>> + * after syscore_shutdown() is called, i.e. without kvm_shutdown()
>>> + * being invoked. Note, this relies on system_state being set _before_
>>> + * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
>>> + * or this CPU observes the impedning shutdown. Which is why KVM uses
>>> + * a syscore ops hook instead of registering a dedicated reboot
>>> + * notifier (the latter runs before system_state is updated).
>>> + */
>>> + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
>>> + system_state == SYSTEM_RESTART) {
>>> + unregister_syscore_ops(&kvm_syscore_ops);
>>> + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
>>> + return -EBUSY;
>>> + }
>>> +
>>
>> Aren't we also supposed to do:
>>
>> on_each_cpu(__kvm_enable_virtualization, NULL, 1);
>>
>> here?
>
> No, cpuhp_setup_state() invokes the callback, kvm_online_cpu(), on each CPU.
> I.e. KVM has been doing things the hard way by using cpuhp_setup_state_nocalls().
> That's part of the complexity I would like to get rid of.

Ah, right :-)

Btw, why couldn't we do the 'system_state' check at the very beginning
of this function?


2024-04-18 01:10:14

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor



On 17/04/2024 4:44 am, Isaku Yamahata wrote:
> On Tue, Apr 16, 2024 at 12:05:31PM +1200,
> "Huang, Kai" <[email protected]> wrote:
>
>>
>>
>> On 16/04/2024 10:48 am, Yamahata, Isaku wrote:
>>> On Mon, Apr 15, 2024 at 06:49:35AM -0700,
>>> Sean Christopherson <[email protected]> wrote:
>>>
>>>> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
>>>>> On Fri, Apr 12, 2024 at 03:46:05PM -0700,
>>>>> Sean Christopherson <[email protected]> wrote:
>>>>>
>>>>>> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
>>>>>>> On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <[email protected]> wrote:
>>>>>>>>> +void tdx_mmu_release_hkid(struct kvm *kvm)
>>>>>>>>> +{
>>>>>>>>> + while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
>>>>>>>>> + ;
>>>>>>>>> }
>>>>>>>>
>>>>>>>> As I understand, __tdx_mmu_release_hkid() returns -EBUSY
>>>>>>>> after TDH.VP.FLUSH has been sent for every vCPU followed by
>>>>>>>> TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
>>>>>>>>
>>>>>>>> Considering earlier comment that a retry of TDH.VP.FLUSH is not
>>>>>>>> needed, why is this while() loop here that sends the
>>>>>>>> TDH.VP.FLUSH again to all vCPUs instead of just a loop within
>>>>>>>> __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
>>>>>>>>
>>>>>>>> Could it be possible for a vCPU to appear during this time, thus
>>>>>>>> be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
>>>>>>>> TDH.VP.FLUSH?
>>>>>>>
>>>>>>> Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
>>>>>>> When KVM vCPU fd is closed, vCPU context can be loaded again.
>>>>>>
>>>>>> But why is _loading_ a vCPU context problematic?
>>>>>
>>>>> It's nothing problematic. It becomes a bit harder to understand why
>>>>> tdx_mmu_release_hkid() issues IPI on each loop. I think it's reasonable
>>>>> to make the normal path easy and to complicate/penalize the destruction path.
>>>>> Probably I should've added comment on the function.
>>>>
>>>> By "problematic", I meant, why can that result in a "missed in one TDH.VP.FLUSH
>>>> cycle"? AFAICT, loading a vCPU shouldn't cause that vCPU to be associated from
>>>> the TDX module's perspective, and thus shouldn't trigger TDX_FLUSHVP_NOT_DONE.
>>>>
>>>> I.e. looping should be unnecessary, no?
>>>
>>> The loop is unnecessary with the current code.
>>>
>>> The possible future optimization is to reduce destruction time of Secure-EPT
>>> somehow. One possible option is to release HKID while vCPUs are still alive and
>>> destruct Secure-EPT with multiple vCPU context. Because that's future
>>> optimization, we can ignore it at this phase.
>>
>> I kinda lost here.
>>
>> I thought in the current v19 code, you have already implemented this
>> optimization?
>>
>> Or is this optimization totally different from what we discussed in an
>> earlier patch?
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> That's only the first step. We can optimize it further with multiple vCPUs
> context.

OK. Let's put aside how important these optimizations are and whether
they should be done in the initial TDX support, I think the right way to
organize the patches is to bring functionality first and then put
performance optimization later.

That can make both writing code and code review easier.

And more importantly, the "performance optimization" can be discussed
_separately_.

For example, as mentioned in the link above, I think the optimization of
"releasing HKID in the MMU notifier release to improve TD teardown
latency" complicates things a lot, e.g., not only to the TD
creation/teardown sequence, but also here -- w/o it we don't even need
to consider the race between vCPU load and MMU notifier release:

https://lore.kernel.org/kvm/[email protected]/

So I think we should start with implementing these sequences in "normal
way" first, and then do the optimization(s) later.

And to me the "normal way" for TD creation/destruction we can just do:

1) Use normal SEPT sequence to teardown private EPT page table;
2) Do VP.FLUSH when vCPU is destroyed
3) Do VPFLUSHDONE after all vCPUs are destroyed
4) release HKID at last stage of destroying VM.

For vCPU migration, you do VP.FLUSH on the old pCPU before you load the
vCPU to the new pCPU, as shown in this patch.

Then you don't need to cover the silly code change around
tdx_mmu_release_hkid() in this patch.

Am I missing anything?


2024-04-18 10:46:42

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 110/130] KVM: TDX: Handle TDX PV MMIO hypercall



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
> hypercall to the KVM backend functions.
>
> kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
> MMIO address and emulates the MMIO. As TDX PV MMIO also needs it, export
> kvm_io_bus_read(). kvm_io_bus_write() is already exported. TDX PV MMIO
> emulates some of MMIO itself. To add trace point consistently with x86
> kvm, export kvm_mmio tracepoint.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 1 +
> virt/kvm/kvm_main.c | 2 +
> 3 files changed, 117 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 55fc6cc6c816..389bb95d2af0 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1217,6 +1217,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
> return ret;
> }
>
> +static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
> +{
> + unsigned long val = 0;
> + gpa_t gpa;
> + int size;
> +
> + KVM_BUG_ON(vcpu->mmio_needed != 1, vcpu->kvm);
> + vcpu->mmio_needed = 0;
> +
> + if (!vcpu->mmio_is_write) {
> + gpa = vcpu->mmio_fragments[0].gpa;
> + size = vcpu->mmio_fragments[0].len;
> +
> + memcpy(&val, vcpu->run->mmio.data, size);
> + tdvmcall_set_return_val(vcpu, val);
> + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> + }
> + return 1;
> +}
> +
> +static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
> + unsigned long val)
> +{
> + if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> + kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> + return -EOPNOTSUPP;
> +
> + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
> + return 0;
> +}
> +
> +static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
> +{
> + unsigned long val;
> +
> + if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> + kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> + return -EOPNOTSUPP;
> +
> + tdvmcall_set_return_val(vcpu, val);
> + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> + return 0;
> +}
> +
> +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_memory_slot *slot;
> + int size, write, r;
> + unsigned long val;
> + gpa_t gpa;
> +
> + KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
> +
> + size = tdvmcall_a0_read(vcpu);
> + write = tdvmcall_a1_read(vcpu);
> + gpa = tdvmcall_a2_read(vcpu);
> + val = write ? tdvmcall_a3_read(vcpu) : 0;
> +
> + if (size != 1 && size != 2 && size != 4 && size != 8)
> + goto error;
> + if (write != 0 && write != 1)
> + goto error;
> +
> + /* Strip the shared bit, allow MMIO with and without it set. */
Based on the discussion
https://lore.kernel.org/all/[email protected]/
Do we still allow the MMIO without shared bit?

> + gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
> +
> + if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
"size > 8u" can be removed, since based on the check of size above, it
can't be greater than 8.


> + goto error;
> +
> + slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
> + if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
> + goto error;
> +
> + if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
Should this be checked for write first?

I check the handle_ept_misconfig() in VMX, it doesn't check write first
neither.

Functionally, it should be OK since guest will not read the address
range of fast mmio.
So the read case will be filtered out by ioeventfd_write().
But it has take a long way to get to ioeventfd_write().
Isn't it more efficient to check write first?


> + trace_kvm_fast_mmio(gpa);
> + return 1;
> + }
> +
> + if (write)
> + r = tdx_mmio_write(vcpu, gpa, size, val);
> + else
> + r = tdx_mmio_read(vcpu, gpa, size);
> + if (!r) {
> + /* Kernel completed device emulation. */
> + tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
> + return 1;
> + }
> +
> + /* Request the device emulation to userspace device model. */
> + vcpu->mmio_needed = 1;
> + vcpu->mmio_is_write = write;
> + vcpu->arch.complete_userspace_io = tdx_complete_mmio;
> +
> + vcpu->run->mmio.phys_addr = gpa;
> + vcpu->run->mmio.len = size;
> + vcpu->run->mmio.is_write = write;
> + vcpu->run->exit_reason = KVM_EXIT_MMIO;
> +
> + if (write) {
> + memcpy(vcpu->run->mmio.data, &val, size);
> + } else {
> + vcpu->mmio_fragments[0].gpa = gpa;
> + vcpu->mmio_fragments[0].len = size;
> + trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
> + }
> + return 0;
> +
> +error:
> + tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
> + return 1;
> +}
> +
> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> {
> if (tdvmcall_exit_type(vcpu))
> @@ -1229,6 +1341,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> return tdx_emulate_hlt(vcpu);
> case EXIT_REASON_IO_INSTRUCTION:
> return tdx_emulate_io(vcpu);
> + case EXIT_REASON_EPT_VIOLATION:
> + return tdx_emulate_mmio(vcpu);
> default:
> break;
> }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03950368d8db..d5b18cad9dcd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13975,6 +13975,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e27c22449d85..bc14e1f2610c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2689,6 +2689,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
>
> return NULL;
> }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
>
> bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
> {
> @@ -5992,6 +5993,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
> r = __kvm_io_bus_read(vcpu, bus, &range, val);
> return r < 0 ? r : 0;
> }
> +EXPORT_SYMBOL_GPL(kvm_io_bus_read);
>
> int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> int len, struct kvm_io_device *dev)


2024-04-18 11:04:37

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 110/130] KVM: TDX: Handle TDX PV MMIO hypercall



On 4/18/2024 5:29 PM, Binbin Wu wrote:
>
>> +
>> +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
>> +{
>> +    struct kvm_memory_slot *slot;
>> +    int size, write, r;
>> +    unsigned long val;
>> +    gpa_t gpa;
>> +
>> +    KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
>> +
>> +    size = tdvmcall_a0_read(vcpu);
>> +    write = tdvmcall_a1_read(vcpu);
>> +    gpa = tdvmcall_a2_read(vcpu);
>> +    val = write ? tdvmcall_a3_read(vcpu) : 0;
>> +
>> +    if (size != 1 && size != 2 && size != 4 && size != 8)
>> +        goto error;
>> +    if (write != 0 && write != 1)
>> +        goto error;
>> +
>> +    /* Strip the shared bit, allow MMIO with and without it set. */
> Based on the discussion
> https://lore.kernel.org/all/[email protected]/
> Do we still allow the MMIO without shared bit?
>
>> +    gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
>> +
>> +    if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
> "size > 8u" can be removed, since based on the check of size above, it
> can't be greater than 8.
>
>
>> +        goto error;
>> +
>> +    slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
>> +    if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
>> +        goto error;
>> +
>> +    if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> Should this be checked for write first?
>
> I check the handle_ept_misconfig() in VMX, it doesn't check write
> first neither.
>
> Functionally, it should be OK since guest will not read the address
> range of fast mmio.
> So the read case will be filtered out by ioeventfd_write().
> But it has take a long way to get to ioeventfd_write().
> Isn't it more efficient to check write first?

I got the reason why in handle_ept_misconfig(), it tries to do fast mmio
write without checking.
It was intended to make fast mmio faster.
And for ept misconfig case, it's not easy to get the info of read/write.

But in this patch, we have already have read/write info, so maybe we can
add the check for write before fast mmio?


>
>
>> +        trace_kvm_fast_mmio(gpa);
>> +        return 1;
>> +    }
>> +
>> +    if (write)
>> +        r = tdx_mmio_write(vcpu, gpa, size, val);
>> +    else
>> +        r = tdx_mmio_read(vcpu, gpa, size);
>> +    if (!r) {
>> +        /* Kernel completed device emulation. */
>> +        tdvmcall_set_return_code(vcpu, TDVMCALL_SUCCESS);
>> +        return 1;
>> +    }
>> +
>> +    /* Request the device emulation to userspace device model. */
>> +    vcpu->mmio_needed = 1;
>> +    vcpu->mmio_is_write = write;
>> +    vcpu->arch.complete_userspace_io = tdx_complete_mmio;
>> +
>> +    vcpu->run->mmio.phys_addr = gpa;
>> +    vcpu->run->mmio.len = size;
>> +    vcpu->run->mmio.is_write = write;
>> +    vcpu->run->exit_reason = KVM_EXIT_MMIO;
>> +
>> +    if (write) {
>> +        memcpy(vcpu->run->mmio.data, &val, size);
>> +    } else {
>> +        vcpu->mmio_fragments[0].gpa = gpa;
>> +        vcpu->mmio_fragments[0].len = size;
>> +        trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa,
>> NULL);
>> +    }
>> +    return 0;
>> +
>> +error:
>> +    tdvmcall_set_return_code(vcpu, TDVMCALL_INVALID_OPERAND);
>> +    return 1;
>> +}
>> +
>>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>>   {
>>       if (tdvmcall_exit_type(vcpu))
>> @@ -1229,6 +1341,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>>           return tdx_emulate_hlt(vcpu);
>>       case EXIT_REASON_IO_INSTRUCTION:
>>           return tdx_emulate_io(vcpu);
>> +    case EXIT_REASON_EPT_VIOLATION:
>> +        return tdx_emulate_mmio(vcpu);
>>       default:
>>           break;
>>       }
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 03950368d8db..d5b18cad9dcd 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -13975,6 +13975,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>>     EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index e27c22449d85..bc14e1f2610c 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2689,6 +2689,7 @@ struct kvm_memory_slot
>> *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
>>         return NULL;
>>   }
>> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
>>     bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
>>   {
>> @@ -5992,6 +5993,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum
>> kvm_bus bus_idx, gpa_t addr,
>>       r = __kvm_io_bus_read(vcpu, bus, &range, val);
>>       return r < 0 ? r : 0;
>>   }
>> +EXPORT_SYMBOL_GPL(kvm_io_bus_read);
>>     int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus
>> bus_idx, gpa_t addr,
>>                   int len, struct kvm_io_device *dev)
>
>


2024-04-18 12:03:15

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 079/130] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)



On 4/13/2024 4:17 AM, Isaku Yamahata wrote:
>>> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>> Just like vmx_prepare_switch_to_host(), the input can be "struct vcpu_tdx
>> *", since vcpu is not used inside the function.
>> And the callsites just use "to_tdx(vcpu)"
>>
>>> +{
>>> + struct vcpu_tdx *tdx = to_tdx(vcpu);
>> Then, this can be dropped.
> prepare_switch_to_guest() is used for kvm_x86_ops.prepare_switch_to_guest().
> kvm_x86_ops consistently takes struct kvm_vcpu.

Oh yes, it's not suitable for tdx_prepare_switch_to_guest().
Still, it can be for tdx_prepare_switch_to_host().




2024-04-18 13:55:31

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
> hypercall from guest TD for paravirtualized rdmsr and wrmsr. The TDX
> module virtualizes MSRs. For some MSRs, it injects #VE to the guest TD
> upon RDMSR or WRMSR. The exact list of such MSRs are defined in the spec.
>
> Upon #VE, the guest TD may execute hypercalls,
> TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
> which are defined in GHCI (Guest-Host Communication Interface) so that the
> host VMM (e.g. KVM) can virtualize the MSRs.
>
> There are three classes of MSRs virtualization.
> - non-configurable: TDX module directly virtualizes it. VMM can't
> configure. the value set by KVM_SET_MSR_INDEX_LIST is ignored.

There is no KVM_SET_MSR_INDEX_LIST in current kvm code.
Do you mean KVM_SET_MSRS?

> - configurable: TDX module directly virtualizes it. VMM can configure at
> the VM creation time. The value set by KVM_SET_MSR_INDEX_LIST is used.
> - #VE case
> Guest TD would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR> and
> VMM handles the MSR hypercall. The value set by KVM_SET_MSR_INDEX_LIST
> is used.
>

2024-04-18 14:34:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, Apr 18, 2024, Kai Huang wrote:
> On 18/04/2024 11:35 am, Sean Christopherson wrote:
> > Ah, yeah. Oh, duh. I think the reason I didn't initially suggest late_hardware_setup()
> > is that I was assuming/hoping TDX setup could be done after kvm_x86_vendor_exit().
> > E.g. in vt_init() or whatever it gets called:
> >
> > r = kvm_x86_vendor_exit(...);
> > if (r)
> > return r;
> >
> > if (enable_tdx) {
> > r = tdx_blah_blah_blah();
> > if (r)
> > goto vendor_exit;
> > }
>
>
> I assume the reason you introduced the late_hardware_setup() is purely
> because you want to do:
>
> cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_enable);
>
> after
>
> kvm_ops_update()?

No, kvm_ops_update() needs to come before kvm_x86_enable_virtualization(), as the
static_call() to hardware_enable() needs to be patched in.

Oh, and my adjust patch is broken, the code to do the compat checks should NOT
be removed; it could be removed if KVM unconditionally enabled VMX during setup,
but it needs to stay in the !TDX case.

- for_each_online_cpu(cpu) {
- smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
- if (r < 0)
- goto out_unwind_ops;
- }

Which is another reason to defer kvm_x86_enable_virtualization(), though to be
honest not a particularly compelling reason on its own.

> Anyway, we can also do 'enable_tdx' outside of kvm_x86_vendor_init() as
> above, given it cannot be done in hardware_setup() anyway.
>
> If we do 'enable_tdx' in late_hardware_setup(), we will need a
> kvm_x86_enable_virtualization_nolock(), but that's also not a problem to me.
>
> So which way do you prefer?
>
> Btw, with kvm_x86_virtualization_enable(), it seems the compatibility check
> is lost, which I assume is OK?

Heh, and I obviously wasn't reading ahead :-)

> Btw2, currently tdx_enable() requires cpus_read_lock() must be called prior.
> If we do unconditional tdx_cpu_enable() in vt_hardware_enable(), then with
> your proposal IIUC there's no such requirement anymore, because no task will
> be scheduled to the new CPU before it reaches CPUHP_AP_ACTIVE.

Correct.

> But now calling cpus_read_lock()/unlock() around tdx_enable() also acceptable
> to me.

No, that will deadlock as cpuhp_setup_state() does cpus_read_lock().

> > > > +int kvm_enable_virtualization(void)
> > > > {
> > > > + int r;
> > > > +
> > > > + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> > > > + kvm_online_cpu, kvm_offline_cpu);
> > > > + if (r)
> > > > + return r;
> > > > +
> > > > + register_syscore_ops(&kvm_syscore_ops);
> > > > +
> > > > + /*
> > > > + * Manually undo virtualization enabling if the system is going down.
> > > > + * If userspace initiated a forced reboot, e.g. reboot -f, then it's
> > > > + * possible for an in-flight module load to enable virtualization
> > > > + * after syscore_shutdown() is called, i.e. without kvm_shutdown()
> > > > + * being invoked. Note, this relies on system_state being set _before_
> > > > + * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
> > > > + * or this CPU observes the impedning shutdown. Which is why KVM uses
> > > > + * a syscore ops hook instead of registering a dedicated reboot
> > > > + * notifier (the latter runs before system_state is updated).
> > > > + */
> > > > + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
> > > > + system_state == SYSTEM_RESTART) {
> > > > + unregister_syscore_ops(&kvm_syscore_ops);
> > > > + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
> > > > + return -EBUSY;
> > > > + }
> > > > +
> > >
> > > Aren't we also supposed to do:
> > >
> > > on_each_cpu(__kvm_enable_virtualization, NULL, 1);
> > >
> > > here?
> >
> > No, cpuhp_setup_state() invokes the callback, kvm_online_cpu(), on each CPU.
> > I.e. KVM has been doing things the hard way by using cpuhp_setup_state_nocalls().
> > That's part of the complexity I would like to get rid of.
>
> Ah, right :-)
>
> Btw, why couldn't we do the 'system_state' check at the very beginning of
> this function?

We could, but we'd still need to check after, and adding a small bit of extra
complexity just to try to catch a very rare situation isn't worth it.

To prevent races, system_state needs to be check after register_syscore_ops(),
because only once kvm_syscore_ops is registered is KVM guaranteed to get notified
of a shutdown.

And because the kvm_syscore_ops hooks disable virtualization, they should be called
after cpuhp_setup_state(). That's not strictly required, as the per-CPU
hardware_enabled flag will prevent true problems if the system enter shutdown
state before KVM reaches cpuhp_setup_state().

Hmm, but the same edge cases exists in the above flow. If the system enters
shutdown _just_ after register_syscore_ops(), KVM would see that in system_state
and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double
disable (which again is benign because of hardware_enabled).

Ah, but registering syscore ops before doing cpuhp_setup_state() has another race,
and one that could be fatal. If the system does suspend+resume before the cpuhup
hooks are registered, kvm_resume() would enable virtualization. And then if
cpuhp_setup_state() failed, virtualization would be left enabled.

So cpuhp_setup_state() *must* come before register_syscore_ops(), and
register_syscore_ops() *must* come before the system_state check.

2024-04-18 16:46:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > > So my feedback is to not worry about the exports, and instead focus on
> > > figuring
> > > out a way to make the generated code less bloated and easier to read/debug.
> >
> > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> > few megawrappers. I think we can get better results by shifting leaf
> > function wrappers into assembly.
> >
> > We are going to have more assembly, but it should produce better result.
> > Adding macros can help to write such wrapper and minimizer boilerplate.
> >
> > Below is an example of how it can look like. It's not complete. I only
> > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> > more complex.
> >
> > Any opinions? Is it something worth investing more time?
>
> We discussed offline how implementing these for each TDVM/SEAMCALL increases the
> chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
> problems more challenging. Kirill raised the possibility of some code generating
> solution like cpufeatures.h, that could take a spec and generate correct calls.
>
> So far no big wins have presented themselves. Kirill, do we think the path to
> move the messy part out-of-line will not work?

I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers.
Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/

I still need to clean it up and write commit messages and comments for all
wrappers.

Now I think it worth the shot.

Any feedback?

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-04-18 19:01:55

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX



On 4/5/2024 7:42 AM, Isaku Yamahata wrote:
> On Wed, Apr 03, 2024 at 08:14:04AM -0700,
> Sean Christopherson <[email protected]> wrote:
>
>> On Mon, Feb 26, 2024, [email protected] wrote:
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index 389bb95d2af0..c8f991b69720 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -1877,6 +1877,76 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
>>> *error_code = 0;
>>> }
>>>
>>> +static bool tdx_is_emulated_kvm_msr(u32 index, bool write)
>>> +{
>>> + switch (index) {
>>> + case MSR_KVM_POLL_CONTROL:
>>> + return true;
>>> + default:
>>> + return false;
>>> + }
>>> +}
>>> +
>>> +bool tdx_has_emulated_msr(u32 index, bool write)
>>> +{
>>> + switch (index) {
>>> + case MSR_IA32_UCODE_REV:
>>> + case MSR_IA32_ARCH_CAPABILITIES:
>>> + case MSR_IA32_POWER_CTL:
>>> + case MSR_IA32_CR_PAT:
>>> + case MSR_IA32_TSC_DEADLINE:
>>> + case MSR_IA32_MISC_ENABLE:
>>> + case MSR_PLATFORM_INFO:
>>> + case MSR_MISC_FEATURES_ENABLES:
>>> + case MSR_IA32_MCG_CAP:
>>> + case MSR_IA32_MCG_STATUS:
>>> + case MSR_IA32_MCG_CTL:
>>> + case MSR_IA32_MCG_EXT_CTL:
>>> + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
>>> + case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
>>> + /* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
>>> + return true;
>>> + case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
>>> + /*
>>> + * x2APIC registers that are virtualized by the CPU can't be
>>> + * emulated, KVM doesn't have access to the virtual APIC page.
>>> + */
>>> + switch (index) {
>>> + case X2APIC_MSR(APIC_TASKPRI):
>>> + case X2APIC_MSR(APIC_PROCPRI):
>>> + case X2APIC_MSR(APIC_EOI):
>>> + case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
>>> + case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
>>> + case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
>>> + return false;
>>> + default:
>>> + return true;
>>> + }
>>> + case MSR_IA32_APICBASE:
>>> + case MSR_EFER:
>>> + return !write;
>> Meh, for literally two MSRs, just open code them in tdx_set_msr() and drop the
>> @write param. Or alternatively add:
>>
>> static bool tdx_is_read_only_msr(u32 msr){
>> {
>> return msr == MSR_IA32_APICBASE || msr == MSR_EFER;
>> }
> Sure will add.
>
>>> + case 0x4b564d00 ... 0x4b564dff:
>> This is silly, just do
>>
>> case MSR_KVM_POLL_CONTROL:
>> return false;

Shoud return true here, right?
>>
>> and let everything else go through the default statement, no?
> Now tdx_is_emulated_kvm_msr() is trivial, will open code it.
>
>
>>> + /* KVM custom MSRs */
>>> + return tdx_is_emulated_kvm_msr(index, write);
>>> + default:
>>> + return false;
>>> + }
>>> +}
>>> +
>>>

2024-04-18 21:23:12

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 110/130] KVM: TDX: Handle TDX PV MMIO hypercall

On Thu, Apr 18, 2024 at 07:04:11PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 4/18/2024 5:29 PM, Binbin Wu wrote:
> >
> > > +
> > > +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> > > +{
> > > +    struct kvm_memory_slot *slot;
> > > +    int size, write, r;
> > > +    unsigned long val;
> > > +    gpa_t gpa;
> > > +
> > > +    KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
> > > +
> > > +    size = tdvmcall_a0_read(vcpu);
> > > +    write = tdvmcall_a1_read(vcpu);
> > > +    gpa = tdvmcall_a2_read(vcpu);
> > > +    val = write ? tdvmcall_a3_read(vcpu) : 0;
> > > +
> > > +    if (size != 1 && size != 2 && size != 4 && size != 8)
> > > +        goto error;
> > > +    if (write != 0 && write != 1)
> > > +        goto error;
> > > +
> > > +    /* Strip the shared bit, allow MMIO with and without it set. */
> > Based on the discussion
> > https://lore.kernel.org/all/[email protected]/
> > Do we still allow the MMIO without shared bit?

That's independent. The part is how to work around guest accesses the
MMIO region with private GPA. This part is, the guest issues
TDG.VP.VMCALL<MMMIO> and KVM masks out the shared bit to make it friendly
to the user space VMM.



> > > +    gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
> > > +
> > > +    if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
> > "size > 8u" can be removed, since based on the check of size above, it
> > can't be greater than 8.

Yes, will remove the check.


> > > +        goto error;
> > > +
> > > +    slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
> > > +    if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
> > > +        goto error;
> > > +
> > > +    if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> > Should this be checked for write first?
> >
> > I check the handle_ept_misconfig() in VMX, it doesn't check write first
> > neither.
> >
> > Functionally, it should be OK since guest will not read the address
> > range of fast mmio.
> > So the read case will be filtered out by ioeventfd_write().
> > But it has take a long way to get to ioeventfd_write().
> > Isn't it more efficient to check write first?
>
> I got the reason why in handle_ept_misconfig(), it tries to do fast mmio
> write without checking.
> It was intended to make fast mmio faster.
> And for ept misconfig case, it's not easy to get the info of read/write.
>
> But in this patch, we have already have read/write info, so maybe we can add
> the check for write before fast mmio?

Yes, let's add it.
--
Isaku Yamahata <[email protected]>

2024-04-18 21:27:41

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX

On Thu, Apr 18, 2024 at 09:54:39PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
> > hypercall from guest TD for paravirtualized rdmsr and wrmsr. The TDX
> > module virtualizes MSRs. For some MSRs, it injects #VE to the guest TD
> > upon RDMSR or WRMSR. The exact list of such MSRs are defined in the spec.
> >
> > Upon #VE, the guest TD may execute hypercalls,
> > TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
> > which are defined in GHCI (Guest-Host Communication Interface) so that the
> > host VMM (e.g. KVM) can virtualize the MSRs.
> >
> > There are three classes of MSRs virtualization.
> > - non-configurable: TDX module directly virtualizes it. VMM can't
> > configure. the value set by KVM_SET_MSR_INDEX_LIST is ignored.
>
> There is no KVM_SET_MSR_INDEX_LIST in current kvm code.
> Do you mean KVM_SET_MSRS?

Yes, will fix it. Thank you for catching it.
--
Isaku Yamahata <[email protected]>

2024-04-18 23:10:01

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 19/04/2024 2:30 am, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, Kai Huang wrote:
>> On 18/04/2024 11:35 am, Sean Christopherson wrote:
>>> Ah, yeah. Oh, duh. I think the reason I didn't initially suggest late_hardware_setup()
>>> is that I was assuming/hoping TDX setup could be done after kvm_x86_vendor_exit().
>>> E.g. in vt_init() or whatever it gets called:
>>>
>>> r = kvm_x86_vendor_exit(...);
>>> if (r)
>>> return r;
>>>
>>> if (enable_tdx) {
>>> r = tdx_blah_blah_blah();
>>> if (r)
>>> goto vendor_exit;
>>> }
>>
>>
>> I assume the reason you introduced the late_hardware_setup() is purely
>> because you want to do:
>>
>> cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_enable);
>>
>> after
>>
>> kvm_ops_update()?
>
> No, kvm_ops_update() needs to come before kvm_x86_enable_virtualization(), as the
> static_call() to hardware_enable() needs to be patched in.

Right. I was talking about that the reason you introduced the
late_hardware_setup() was because we need to do
kvm_x86_virtualization_enabled() and the above
cpu_emergency_register_virt_callback() after kvm_ops_update().

>
> Oh, and my adjust patch is broken, the code to do the compat checks should NOT
> be removed; it could be removed if KVM unconditionally enabled VMX during setup,
> but it needs to stay in the !TDX case.

Right.

>
> - for_each_online_cpu(cpu) {
> - smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
> - if (r < 0)
> - goto out_unwind_ops;
> - }
>
> Which is another reason to defer kvm_x86_enable_virtualization(), though to be
> honest not a particularly compelling reason on its own.
>
>> Anyway, we can also do 'enable_tdx' outside of kvm_x86_vendor_init() as
>> above, given it cannot be done in hardware_setup() anyway.
>>
>> If we do 'enable_tdx' in late_hardware_setup(), we will need a
>> kvm_x86_enable_virtualization_nolock(), but that's also not a problem to me.
>>
>> So which way do you prefer?
>>
>> Btw, with kvm_x86_virtualization_enable(), it seems the compatibility check
>> is lost, which I assume is OK?
>
> Heh, and I obviously wasn't reading ahead :-)
>
>> Btw2, currently tdx_enable() requires cpus_read_lock() must be called prior.
>> If we do unconditional tdx_cpu_enable() in vt_hardware_enable(), then with
>> your proposal IIUC there's no such requirement anymore, because no task will
>> be scheduled to the new CPU before it reaches CPUHP_AP_ACTIVE.
>
> Correct.
>
>> But now calling cpus_read_lock()/unlock() around tdx_enable() also acceptable
>> to me.
>
> No, that will deadlock as cpuhp_setup_state() does cpus_read_lock().

Right, but it takes cpus_read_lock()/unlock() internally. I was talking
about:

if (enable_tdx) {
kvm_x86_virtualization_enable();

/*
* Unfortunately currently tdx_enable() internally has
* lockdep_assert_cpus_held().
*/
cpus_read_lock();
tdx_enable();
cpus_read_unlock();
}

>
>>>>> +int kvm_enable_virtualization(void)
>>>>> {
>>>>> + int r;
>>>>> +
>>>>> + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
>>>>> + kvm_online_cpu, kvm_offline_cpu);
>>>>> + if (r)
>>>>> + return r;
>>>>> +
>>>>> + register_syscore_ops(&kvm_syscore_ops);
>>>>> +
>>>>> + /*
>>>>> + * Manually undo virtualization enabling if the system is going down.
>>>>> + * If userspace initiated a forced reboot, e.g. reboot -f, then it's
>>>>> + * possible for an in-flight module load to enable virtualization
>>>>> + * after syscore_shutdown() is called, i.e. without kvm_shutdown()
>>>>> + * being invoked. Note, this relies on system_state being set _before_
>>>>> + * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
>>>>> + * or this CPU observes the impedning shutdown. Which is why KVM uses
>>>>> + * a syscore ops hook instead of registering a dedicated reboot
>>>>> + * notifier (the latter runs before system_state is updated).
>>>>> + */
>>>>> + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
>>>>> + system_state == SYSTEM_RESTART) {
>>>>> + unregister_syscore_ops(&kvm_syscore_ops);
>>>>> + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
>>>>> + return -EBUSY;
>>>>> + }
>>>>> +
>>>>
>>>> Aren't we also supposed to do:
>>>>
>>>> on_each_cpu(__kvm_enable_virtualization, NULL, 1);
>>>>
>>>> here?
>>>
>>> No, cpuhp_setup_state() invokes the callback, kvm_online_cpu(), on each CPU.
>>> I.e. KVM has been doing things the hard way by using cpuhp_setup_state_nocalls().
>>> That's part of the complexity I would like to get rid of.
>>
>> Ah, right :-)
>>
>> Btw, why couldn't we do the 'system_state' check at the very beginning of
>> this function?
>
> We could, but we'd still need to check after, and adding a small bit of extra
> complexity just to try to catch a very rare situation isn't worth it.
>
> To prevent races, system_state needs to be check after register_syscore_ops(),
> because only once kvm_syscore_ops is registered is KVM guaranteed to get notified
> of a shutdown. >
> And because the kvm_syscore_ops hooks disable virtualization, they should be called
> after cpuhp_setup_state(). That's not strictly required, as the per-CPU
> hardware_enabled flag will prevent true problems if the system enter shutdown
> state before KVM reaches cpuhp_setup_state().
>
> Hmm, but the same edge cases exists in the above flow. If the system enters
> shutdown _just_ after register_syscore_ops(), KVM would see that in system_state
> and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double
> disable (which again is benign because of hardware_enabled).
>
> Ah, but registering syscore ops before doing cpuhp_setup_state() has another race,
> and one that could be fatal. If the system does suspend+resume before the cpuhup
> hooks are registered, kvm_resume() would enable virtualization. And then if
> cpuhp_setup_state() failed, virtualization would be left enabled.
>
> So cpuhp_setup_state() *must* come before register_syscore_ops(), and
> register_syscore_ops() *must* come before the system_state check.

OK. I guess I have to double check here to completely understand the
races. :-)

So I think we have consensus to go with the approach that shows in your
second diff -- that is to always enable virtualization during module
loading for all other ARCHs other than x86, for which we only always
enables virtualization during module loading for TDX.

Then how about "do kvm_x86_virtualization_enable() within
late_hardware_setup() in kvm_x86_vendor_init()" vs "do
kvm_x86_virtualization_enable() in TDX-specific code after
kvm_x86_vendor_init()"?

Which do you prefer?

2024-04-19 01:20:19

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v19 010/130] KVM: x86: Pass is_private to gmem hook of gmem_max_level

On Mon, Feb 26, 2024 at 12:25:12AM -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX wants to know the faulting address is shared or private so that the max
> level is limited by Secure-EPT or not. Because fault->gfn doesn't include
> shared bit, gfn doesn't tell if the faulting address is shared or not.
> Pass is_private for TDX case.
>
> TDX logic will be if (!is_private) return 0; else return PG_LEVEL_4K.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 3 ++-
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d15f5b4b1656..57ce89fc2740 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1797,7 +1797,8 @@ struct kvm_x86_ops {
>
> gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
>
> - int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
> + int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
> + bool is_private, u8 *max_level);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1e5e12d2707d..22db1a9f528a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4324,7 +4324,8 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
>
> max_level = kvm_max_level_for_order(max_order);
> r = static_call(kvm_x86_gmem_max_level)(vcpu->kvm, fault->pfn,
> - fault->gfn, &max_level);
> + fault->gfn, fault->is_private,
> + &max_level);
fault->is_private is always true in kvm_faultin_pfn_private().
Besides, as shared page allocation will not go to kvm_faultin_pfn_private(),
why do we need to add the "is_private" parameter ?

> if (r) {
> kvm_release_pfn_clean(fault->pfn);
> return r;
> --
> 2.25.1
>
>

2024-04-19 03:01:23

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 110/130] KVM: TDX: Handle TDX PV MMIO hypercall



On 4/19/2024 5:22 AM, Isaku Yamahata wrote:
> On Thu, Apr 18, 2024 at 07:04:11PM +0800,
> Binbin Wu <[email protected]> wrote:
>
>>
>> On 4/18/2024 5:29 PM, Binbin Wu wrote:
>>>> +
>>>> +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +    struct kvm_memory_slot *slot;
>>>> +    int size, write, r;
>>>> +    unsigned long val;
>>>> +    gpa_t gpa;
>>>> +
>>>> +    KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
>>>> +
>>>> +    size = tdvmcall_a0_read(vcpu);
>>>> +    write = tdvmcall_a1_read(vcpu);
>>>> +    gpa = tdvmcall_a2_read(vcpu);
>>>> +    val = write ? tdvmcall_a3_read(vcpu) : 0;
>>>> +
>>>> +    if (size != 1 && size != 2 && size != 4 && size != 8)
>>>> +        goto error;
>>>> +    if (write != 0 && write != 1)
>>>> +        goto error;
>>>> +
>>>> +    /* Strip the shared bit, allow MMIO with and without it set. */
>>> Based on the discussion
>>> https://lore.kernel.org/all/[email protected]/
>>> Do we still allow the MMIO without shared bit?
> That's independent. The part is how to work around guest accesses the
> MMIO region with private GPA. This part is, the guest issues
> TDG.VP.VMCALL<MMMIO> and KVM masks out the shared bit to make it friendly
> to the user space VMM.
It's similar.
The tdvmcall from the guest for mmio can also be private GPA, which is
not reasonable, right?
According to the comment, kvm doens't care about if the TD guest issue
the tdvmcall with private GPA or shared GPA.



2024-04-19 07:41:18

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 111/130] KVM: TDX: Implement callbacks for MSR operations for TDX



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
> hypercall from guest TD for paravirtualized rdmsr and wrmsr. The TDX
> module virtualizes MSRs. For some MSRs, it injects #VE to the guest TD
> upon RDMSR or WRMSR. The exact list of such MSRs are defined in the spec.
>
> Upon #VE, the guest TD may execute hypercalls,
> TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
> which are defined in GHCI (Guest-Host Communication Interface) so that the
> host VMM (e.g. KVM) can virtualize the MSRs.
>
> There are three classes of MSRs virtualization.
> - non-configurable: TDX module directly virtualizes it. VMM can't
> configure. the value set by KVM_SET_MSR_INDEX_LIST is ignored.
> - configurable: TDX module directly virtualizes it. VMM can configure at
> the VM creation time. The value set by KVM_SET_MSR_INDEX_LIST is used.
> - #VE case
> Guest TD would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR> and
> VMM handles the MSR hypercall. The value set by KVM_SET_MSR_INDEX_LIST
> is used.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> ---
[...]
> +
> +bool tdx_has_emulated_msr(u32 index, bool write)
> +{
> + switch (index) {
> + case MSR_IA32_UCODE_REV:
> + case MSR_IA32_ARCH_CAPABILITIES:
> + case MSR_IA32_POWER_CTL:
> + case MSR_IA32_CR_PAT:
> + case MSR_IA32_TSC_DEADLINE:
> + case MSR_IA32_MISC_ENABLE:
> + case MSR_PLATFORM_INFO:
> + case MSR_MISC_FEATURES_ENABLES:
> + case MSR_IA32_MCG_CAP:
> + case MSR_IA32_MCG_STATUS:
It not about this patch directly.

Intel SDM says:
"An attempt to write to IA32_MCG_STATUS with any value other than 0
would result in #GP".

But in set_msr_mce(), IA32_MCG_STATUS is set without any check.
Should it be checked against 0 if it is not host_initiated?



2024-04-19 08:21:41

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 116/130] KVM: TDX: Silently discard SMI request



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX doesn't support system-management mode (SMM) and system-management
> interrupt (SMI) in guest TDs. Because guest state (vcpu state, memory
> state) is protected, it must go through the TDX module APIs to change guest
> state, injecting SMI and changing vcpu mode into SMM. The TDX module
> doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
                                                            ^
                                                            or

> to switch guest vcpu mode into SMM.
>
> We have two options in KVM when handling SMM or SMI in the guest TD or the
> device model (e.g. QEMU): 1) silently ignore the request or 2) return a
> meaningful error.
>
> For simplicity, we implemented the option 1).
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
[...]
> +
> +static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu)) {
> + tdx_enable_smi_window(vcpu);
> + return;

Can "return tdx_enable_smi_window(vcpu);" directly.
> + }
> +
> + /* RSM will cause a vmexit anyway. */
> + vmx_enable_smi_window(vcpu);
> +}
> +#endif
> +
[...]
>
> +#if defined(CONFIG_INTEL_TDX_HOST) && defined(CONFIG_KVM_SMM)
> +int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
> +int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
> +int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
> +void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
> +#else

#elif defined(CONFIG_KVM_SMM)

These functions are only needed when CONFIG_KVM_SMM is defined.

> +static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
> +static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) { return 0; }
> +static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) { return 0; }
> +static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
> +#endif
> +
> #endif /* __KVM_X86_VMX_X86_OPS_H */


2024-04-19 10:04:45

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 118/130] KVM: TDX: Add methods to ignore accesses to CPU state



On 2/26/2024 4:27 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX protects TDX guest state from VMM. Implement access methods for TDX
> guest state to ignore them or return zero. Because those methods can be
> called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON
> except one method.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 289 +++++++++++++++++++++++++++++++++----
> arch/x86/kvm/vmx/tdx.c | 48 +++++-
> arch/x86/kvm/vmx/x86_ops.h | 13 ++
> 3 files changed, 321 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 84d2dc818cf7..9fb3f28d8259 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -375,6 +375,200 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
> kvm_vcpu_deliver_init(vcpu);
> }
>
> +static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return;
> +
> + vmx_vcpu_after_set_cpuid(vcpu);
> +}
> +
> +static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return;
> +
> + vmx_update_exception_bitmap(vcpu);
> +}
> +
> +static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_get_segment_base(vcpu, seg);
Could just return 0?
Not need to add a function, since it's only called here and it's more
straight forward to read the code without jump to the definition of
these functions.

Similarly, we can use open code for tdx_get_cpl(), tdx_get_rflags() and
tdx_get_segment(), which return 0 or memset with 0.


> +
> + return vmx_get_segment_base(vcpu, seg);
> +}
> +
[...]
>
> +int tdx_get_cpl(struct kvm_vcpu *vcpu)
> +{
> + return 0;
> +}
> +
[...]
> +
> +unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
> +{
> + return 0;
> +}
> +
> +u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> +{
> + return 0;
> +}
> +
> +void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> +{
> + memset(var, 0, sizeof(*var));
> +}
> +
> static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
> {
> struct kvm_tdx_capabilities __user *user_caps;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 7c63b2b48125..727c4d418601 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -169,6 +169,12 @@ bool tdx_has_emulated_msr(u32 index, bool write);
> int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
> int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>
> +int tdx_get_cpl(struct kvm_vcpu *vcpu);
> +void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
> +unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
> +u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
> +void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
> +
> int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>
> void tdx_flush_tlb(struct kvm_vcpu *vcpu);
> @@ -221,6 +227,13 @@ static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
> static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
> static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
>
> +static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
> +static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
> +static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
> +static inline u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0; }
> +static inline void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
> + int seg) {}
> +
> static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>
> static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}


2024-04-19 13:37:31

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 117/130] KVM: TDX: Silently ignore INIT/SIPI



On 2/26/2024 4:26 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
> Instead it defines the different protocols to boot application processors.
> Ignore INIT and SIPI events for the TDX guest.
>
> There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
> error to guest TDs somehow. Given that TDX guest is paravirtualized to
> boot AP, the option 1 is chosen for simplicity.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
[...]
> +
> +static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu)) {
> + /* TDX doesn't support INIT. Ignore INIT event */
> + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
Why change the mp_state to KVM_MP_STATE_RUNNABLE here?

And I am not sure whether we can say the INIT event is ignored since
mp_state could be modified.

> + return;
> + }
> +
> + kvm_vcpu_deliver_init(vcpu);
> +}
> +

2024-04-19 13:52:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 116/130] KVM: TDX: Silently discard SMI request

On Mon, Feb 26, 2024, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX doesn't support system-management mode (SMM) and system-management
> interrupt (SMI) in guest TDs. Because guest state (vcpu state, memory
> state) is protected, it must go through the TDX module APIs to change guest
> state, injecting SMI and changing vcpu mode into SMM. The TDX module
> doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
> to switch guest vcpu mode into SMM.
>
> We have two options in KVM when handling SMM or SMI in the guest TD or the
> device model (e.g. QEMU): 1) silently ignore the request or 2) return a
> meaningful error.
>
> For simplicity, we implemented the option 1).
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/smm.h | 7 +++++-
> arch/x86/kvm/vmx/main.c | 45 ++++++++++++++++++++++++++++++++++----
> arch/x86/kvm/vmx/tdx.c | 29 ++++++++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 12 ++++++++++
> 4 files changed, 88 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
> index a1cf2ac5bd78..bc77902f5c18 100644
> --- a/arch/x86/kvm/smm.h
> +++ b/arch/x86/kvm/smm.h
> @@ -142,7 +142,12 @@ union kvm_smram {
>
> static inline int kvm_inject_smi(struct kvm_vcpu *vcpu)
> {
> - kvm_make_request(KVM_REQ_SMI, vcpu);
> + /*
> + * If SMM isn't supported (e.g. TDX), silently discard SMI request.
> + * Assume that SMM supported = MSR_IA32_SMBASE supported.
> + */
> + if (static_call(kvm_x86_has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE))
> + kvm_make_request(KVM_REQ_SMI, vcpu);
> return 0;

No, just do what KVM already does for CONFIG_KVM_SMM=n, and return -ENOTTY. The
*entire* point of have a return code is to handle setups that don't support SMM.

if (!static_call(kvm_x86_has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE)))
return -ENOTTY;

And with that, I would drop the comment, it's pretty darn clear what "assumption"
is being made. In quotes because it's not an assumption, it's literally KVM's
implementation.

And then the changelog can say "do what KVM does for CONFIG_KVM_SMM=n" without
having to explain why we decided to do something completely arbitrary for TDX.

> }
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index ed46e7e57c18..4f3b872cd401 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -283,6 +283,43 @@ static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
> vmx_msr_filter_changed(vcpu);
> }
>
> +#ifdef CONFIG_KVM_SMM
> +static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_smi_allowed(vcpu, for_injection);

Adding stubs for something that TDX will never support is silly. Bug the VM and
return an error.

if (KVM_BUG_ON(is_td_vcpu(vcpu)))
return -EIO;

And I wouldn't even bother with vt_* wrappers, just put that right in vmx_*().
Same thing for everything below.

2024-04-19 17:33:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, Apr 19, 2024, Kai Huang wrote:
> On 19/04/2024 2:30 am, Sean Christopherson wrote:
> > No, that will deadlock as cpuhp_setup_state() does cpus_read_lock().
>
> Right, but it takes cpus_read_lock()/unlock() internally. I was talking
> about:
>
> if (enable_tdx) {
> kvm_x86_virtualization_enable();
>
> /*
> * Unfortunately currently tdx_enable() internally has
> * lockdep_assert_cpus_held().
> */
> cpus_read_lock();
> tdx_enable();
> cpus_read_unlock();
> }

Ah. Just have tdx_enable() do cpus_read_lock(), I suspect/assume the current
implemention was purely done in anticipation of KVM "needing" to do tdx_enable()
while holding cpu_hotplug_lock.

And tdx_enable() should also do its best to verify that the caller is post-VMXON:

if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
return -EINVAL;

> > > Btw, why couldn't we do the 'system_state' check at the very beginning of
> > > this function?
> >
> > We could, but we'd still need to check after, and adding a small bit of extra
> > complexity just to try to catch a very rare situation isn't worth it.
> >
> > To prevent races, system_state needs to be check after register_syscore_ops(),
> > because only once kvm_syscore_ops is registered is KVM guaranteed to get notified
> > of a shutdown. >
> > And because the kvm_syscore_ops hooks disable virtualization, they should be called
> > after cpuhp_setup_state(). That's not strictly required, as the per-CPU
> > hardware_enabled flag will prevent true problems if the system enter shutdown
> > state before KVM reaches cpuhp_setup_state().
> >
> > Hmm, but the same edge cases exists in the above flow. If the system enters
> > shutdown _just_ after register_syscore_ops(), KVM would see that in system_state
> > and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double
> > disable (which again is benign because of hardware_enabled).
> >
> > Ah, but registering syscore ops before doing cpuhp_setup_state() has another race,
> > and one that could be fatal. If the system does suspend+resume before the cpuhup
> > hooks are registered, kvm_resume() would enable virtualization. And then if
> > cpuhp_setup_state() failed, virtualization would be left enabled.
> >
> > So cpuhp_setup_state() *must* come before register_syscore_ops(), and
> > register_syscore_ops() *must* come before the system_state check.
>
> OK. I guess I have to double check here to completely understand the races.
> :-)
>
> So I think we have consensus to go with the approach that shows in your
> second diff -- that is to always enable virtualization during module loading
> for all other ARCHs other than x86, for which we only always enables
> virtualization during module loading for TDX.

Assuming the other arch maintainers are ok with that approach. If waiting until
a VM is created is desirable for other architectures, then we'll need to figure
out a plan b. E.g. KVM arm64 doesn't support being built as a module, so enabling
hardware during initialization would mean virtualization is enabled for any kernel
that is built with CONFIG_KVM=y.

Actually, duh. There's absolutely no reason to force other architectures to
choose when to enable virtualization. As evidenced by the massaging to have x86
keep enabling virtualization on-demand for !TDX, the cleanups don't come from
enabling virtualization during module load, they come from registering cpuup and
syscore ops when virtualization is enabled.

I.e. we can keep kvm_usage_count in common code, and just do exactly what I
proposed for kvm_x86_enable_virtualization().

I have patches to do this, and initial testing suggests they aren't wildly
broken. I'll post them soon-ish, assuming nothing pops up in testing. They are
clean enough that they can land in advance of TDX, e.g. in kvm-coco-queue even
before other architectures verify I didn't break them.

> Then how about "do kvm_x86_virtualization_enable() within
> late_hardware_setup() in kvm_x86_vendor_init()" vs "do
> kvm_x86_virtualization_enable() in TDX-specific code after
> kvm_x86_vendor_init()"?
>
> Which do you prefer?

The latter, assuming it doesn't make the TDX code more complex than it needs to
be. The fewer kvm_x86_ops hooks, the better.

2024-04-19 17:34:46

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 110/130] KVM: TDX: Handle TDX PV MMIO hypercall

On Fri, Apr 19, 2024 at 09:42:48AM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 4/19/2024 5:22 AM, Isaku Yamahata wrote:
> > On Thu, Apr 18, 2024 at 07:04:11PM +0800,
> > Binbin Wu <[email protected]> wrote:
> >
> > >
> > > On 4/18/2024 5:29 PM, Binbin Wu wrote:
> > > > > +
> > > > > +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +    struct kvm_memory_slot *slot;
> > > > > +    int size, write, r;
> > > > > +    unsigned long val;
> > > > > +    gpa_t gpa;
> > > > > +
> > > > > +    KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
> > > > > +
> > > > > +    size = tdvmcall_a0_read(vcpu);
> > > > > +    write = tdvmcall_a1_read(vcpu);
> > > > > +    gpa = tdvmcall_a2_read(vcpu);
> > > > > +    val = write ? tdvmcall_a3_read(vcpu) : 0;
> > > > > +
> > > > > +    if (size != 1 && size != 2 && size != 4 && size != 8)
> > > > > +        goto error;
> > > > > +    if (write != 0 && write != 1)
> > > > > +        goto error;
> > > > > +
> > > > > +    /* Strip the shared bit, allow MMIO with and without it set. */
> > > > Based on the discussion
> > > > https://lore.kernel.org/all/[email protected]/
> > > > Do we still allow the MMIO without shared bit?
> > That's independent. The part is how to work around guest accesses the
> > MMIO region with private GPA. This part is, the guest issues
> > TDG.VP.VMCALL<MMMIO> and KVM masks out the shared bit to make it friendly
> > to the user space VMM.
> It's similar.
> The tdvmcall from the guest for mmio can also be private GPA, which is not
> reasonable, right?
> According to the comment, kvm doens't care about if the TD guest issue the
> tdvmcall with private GPA or shared GPA.

I checked the GHCI spec. It clearly states this hypercall is for shared GPA.
We should return error for private GPA.

This TDG.VP.VMCALL is used to help request the VMM perform
emulated-MMIO-access operation. The VMM may emulate MMIO space in shared-GPA
space. The VMM can induce a #VE on these shared-GPA accesses by mapping shared
GPAs with the suppress-VE bit cleared in the EPT Entries corresponding to
these mappings

So we'll have something as follows. Compile only tested.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3bf0d6e3cd21..0f696f3fbd86 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1281,24 +1281,34 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
if (write != 0 && write != 1)
goto error;

- /* Strip the shared bit, allow MMIO with and without it set. */
+ /*
+ * MMIO with TDG.VP.VMCALL<MMIO> allows only shared GPA because
+ * private GPA is for device assignment.
+ */
+ if (kvm_is_private_gpa(gpa))
+ goto error;
+
+ /*
+ * Strip the shared bit because device emulator is assigned to GPA
+ * without shared bit. We'd like the existing code untouched.
+ */
gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));

- if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
+ /* Disallow MMIO crossing page boundary for simplicity. */
+ if (((gpa + size - 1) ^ gpa) & PAGE_MASK)
goto error;

slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
goto error;

- if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
- trace_kvm_fast_mmio(gpa);
- return 1;
- }
-
- if (write)
+ if (write) {
+ if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+ trace_kvm_fast_mmio(gpa);
+ return 1;
+ }
r = tdx_mmio_write(vcpu, gpa, size, val);
- else
+ } else
r = tdx_mmio_read(vcpu, gpa, size);
if (!r) {
/* Kernel completed device emulation. */


--
Isaku Yamahata <[email protected]>

2024-04-19 18:09:34

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 118/130] KVM: TDX: Add methods to ignore accesses to CPU state

On Fri, Apr 19, 2024 at 06:04:10PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:27 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX protects TDX guest state from VMM. Implement access methods for TDX
> > guest state to ignore them or return zero. Because those methods can be
> > called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON
> > except one method.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/main.c | 289 +++++++++++++++++++++++++++++++++----
> > arch/x86/kvm/vmx/tdx.c | 48 +++++-
> > arch/x86/kvm/vmx/x86_ops.h | 13 ++
> > 3 files changed, 321 insertions(+), 29 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 84d2dc818cf7..9fb3f28d8259 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -375,6 +375,200 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
> > kvm_vcpu_deliver_init(vcpu);
> > }
> > +static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > +{
> > + if (is_td_vcpu(vcpu))
> > + return;
> > +
> > + vmx_vcpu_after_set_cpuid(vcpu);
> > +}
> > +
> > +static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
> > +{
> > + if (is_td_vcpu(vcpu))
> > + return;
> > +
> > + vmx_update_exception_bitmap(vcpu);
> > +}
> > +
> > +static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> > +{
> > + if (is_td_vcpu(vcpu))
> > + return tdx_get_segment_base(vcpu, seg);
> Could just return 0?
> Not need to add a function, since it's only called here and it's more
> straight forward to read the code without jump to the definition of these
> functions.
>
> Similarly, we can use open code for tdx_get_cpl(), tdx_get_rflags() and
> tdx_get_segment(), which return 0 or memset with 0.

Yes, we should drop them. They came from the TDX guest debug. But right now
we drop the support and we have guard with vcpu->arch.guest_state_protected.

guest debug is a future topic now.
--
Isaku Yamahata <[email protected]>

2024-04-19 18:10:07

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 117/130] KVM: TDX: Silently ignore INIT/SIPI

On Fri, Apr 19, 2024 at 04:31:54PM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:26 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
> > Instead it defines the different protocols to boot application processors.
> > Ignore INIT and SIPI events for the TDX guest.
> >
> > There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
> > error to guest TDs somehow. Given that TDX guest is paravirtualized to
> > boot AP, the option 1 is chosen for simplicity.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> [...]
> > +
> > +static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
> > +{
> > + if (is_td_vcpu(vcpu)) {
> > + /* TDX doesn't support INIT. Ignore INIT event */
> > + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> Why change the mp_state to KVM_MP_STATE_RUNNABLE here?
>
> And I am not sure whether we can say the INIT event is ignored since
> mp_state could be modified.

We should drop the line. Now it's not necessary and KVM_TDX_INIT_VCPU change
it.
--
Isaku Yamahata <[email protected]>

2024-04-19 18:12:20

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 116/130] KVM: TDX: Silently discard SMI request

On Fri, Apr 19, 2024 at 06:52:42AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Mon, Feb 26, 2024, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX doesn't support system-management mode (SMM) and system-management
> > interrupt (SMI) in guest TDs. Because guest state (vcpu state, memory
> > state) is protected, it must go through the TDX module APIs to change guest
> > state, injecting SMI and changing vcpu mode into SMM. The TDX module
> > doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
> > to switch guest vcpu mode into SMM.
> >
> > We have two options in KVM when handling SMM or SMI in the guest TD or the
> > device model (e.g. QEMU): 1) silently ignore the request or 2) return a
> > meaningful error.
> >
> > For simplicity, we implemented the option 1).
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/smm.h | 7 +++++-
> > arch/x86/kvm/vmx/main.c | 45 ++++++++++++++++++++++++++++++++++----
> > arch/x86/kvm/vmx/tdx.c | 29 ++++++++++++++++++++++++
> > arch/x86/kvm/vmx/x86_ops.h | 12 ++++++++++
> > 4 files changed, 88 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
> > index a1cf2ac5bd78..bc77902f5c18 100644
> > --- a/arch/x86/kvm/smm.h
> > +++ b/arch/x86/kvm/smm.h
> > @@ -142,7 +142,12 @@ union kvm_smram {
> >
> > static inline int kvm_inject_smi(struct kvm_vcpu *vcpu)
> > {
> > - kvm_make_request(KVM_REQ_SMI, vcpu);
> > + /*
> > + * If SMM isn't supported (e.g. TDX), silently discard SMI request.
> > + * Assume that SMM supported = MSR_IA32_SMBASE supported.
> > + */
> > + if (static_call(kvm_x86_has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE))
> > + kvm_make_request(KVM_REQ_SMI, vcpu);
> > return 0;
>
> No, just do what KVM already does for CONFIG_KVM_SMM=n, and return -ENOTTY. The
> *entire* point of have a return code is to handle setups that don't support SMM.
>
> if (!static_call(kvm_x86_has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE)))
> return -ENOTTY;
>
> And with that, I would drop the comment, it's pretty darn clear what "assumption"
> is being made. In quotes because it's not an assumption, it's literally KVM's
> implementation.
>
> And then the changelog can say "do what KVM does for CONFIG_KVM_SMM=n" without
> having to explain why we decided to do something completely arbitrary for TDX.

Ok.

> > }
> >
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index ed46e7e57c18..4f3b872cd401 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -283,6 +283,43 @@ static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
> > vmx_msr_filter_changed(vcpu);
> > }
> >
> > +#ifdef CONFIG_KVM_SMM
> > +static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> > +{
> > + if (is_td_vcpu(vcpu))
> > + return tdx_smi_allowed(vcpu, for_injection);
>
> Adding stubs for something that TDX will never support is silly. Bug the VM and
> return an error.
>
> if (KVM_BUG_ON(is_td_vcpu(vcpu)))
> return -EIO;
>
> And I wouldn't even bother with vt_* wrappers, just put that right in vmx_*().
> Same thing for everything below.

Will drop them. Those are traces to support guest debug. It's future topic
and we have arch.guest_state_protected check now.
--
Isaku Yamahata <[email protected]>

2024-04-19 18:29:14

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 010/130] KVM: x86: Pass is_private to gmem hook of gmem_max_level

On Fri, Apr 19, 2024 at 09:19:29AM +0800,
Yan Zhao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:25:12AM -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX wants to know the faulting address is shared or private so that the max
> > level is limited by Secure-EPT or not. Because fault->gfn doesn't include
> > shared bit, gfn doesn't tell if the faulting address is shared or not.
> > Pass is_private for TDX case.
> >
> > TDX logic will be if (!is_private) return 0; else return PG_LEVEL_4K.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 3 ++-
> > arch/x86/kvm/mmu/mmu.c | 3 ++-
> > 2 files changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index d15f5b4b1656..57ce89fc2740 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1797,7 +1797,8 @@ struct kvm_x86_ops {
> >
> > gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> >
> > - int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
> > + int (*gmem_max_level)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn,
> > + bool is_private, u8 *max_level);
> > };
> >
> > struct kvm_x86_nested_ops {
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1e5e12d2707d..22db1a9f528a 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4324,7 +4324,8 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> >
> > max_level = kvm_max_level_for_order(max_order);
> > r = static_call(kvm_x86_gmem_max_level)(vcpu->kvm, fault->pfn,
> > - fault->gfn, &max_level);
> > + fault->gfn, fault->is_private,
> > + &max_level);
> fault->is_private is always true in kvm_faultin_pfn_private().
> Besides, as shared page allocation will not go to kvm_faultin_pfn_private(),
> why do we need to add the "is_private" parameter ?

You're right, we don't need this patch.
As Paolo picked the patch to add a hook, the discussion is happening at
https://lore.kernel.org/all/[email protected]/#t
--
Isaku Yamahata <[email protected]>

2024-04-19 18:56:16

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Tue, 2024-03-26 at 11:06 -0700, Isaku Yamahata wrote:
> > I was wondering about something limited to the operations that iterate over
> > the roots. So not
> > keeping private_root_hpa in the list of roots where it has to be carefully
> > protected from getting
> > zapped or get its gfn adjusted, and instead open coding the private case in
> > the higher level zapping
> > operations. For normal VM's the private case would be a NOP.
> >
> > Since kvm_tdp_mmu_map() already grabs private_root_hpa manually, it wouldn't
> > change in this idea. I
> > don't know how much better it would be though. I think you are right we
> > would have to create them
> > and compare.
>
> Given the large page support gets complicated, it would be worthwhile to try,
> I think.

Circling back here, let's keep things as is for the MMU breakout series. We
didn't get any maintainer comments on the proposed refactor, and we might get
some on the smaller MMU breakout series. Then we can just have the smaller less
controversial changes already incorporated for the discussion. We can mention
the idea in the coverletter.

2024-04-19 19:32:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Thu, Apr 18, 2024 at 11:26:11AM -0700, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, [email protected] wrote:
> > On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote:
> > > On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > > > > So my feedback is to not worry about the exports, and instead focus on
> > > > > figuring
> > > > > out a way to make the generated code less bloated and easier to read/debug.
> > > >
> > > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> > > > few megawrappers. I think we can get better results by shifting leaf
> > > > function wrappers into assembly.
> > > >
> > > > We are going to have more assembly, but it should produce better result.
> > > > Adding macros can help to write such wrapper and minimizer boilerplate.
> > > >
> > > > Below is an example of how it can look like. It's not complete. I only
> > > > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> > > > more complex.
> > > >
> > > > Any opinions? Is it something worth investing more time?
> > >
> > > We discussed offline how implementing these for each TDVM/SEAMCALL increases the
> > > chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
> > > problems more challenging. Kirill raised the possibility of some code generating
> > > solution like cpufeatures.h, that could take a spec and generate correct calls.
> > >
> > > So far no big wins have presented themselves. Kirill, do we think the path to
> > > move the messy part out-of-line will not work?
> >
> > I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers.
> > Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/
> >
> > I still need to clean it up and write commit messages and comments for all
> > wrappers.
> >
> > Now I think it worth the shot.
> >
> > Any feedback?
>
> I find it hard to review for correctness, and extremely susceptible to developer
> error. E.g. lots of copy+paste, and manual encoding of RCX to expose registers.

Yes, I agree. The approach requires careful manual work and is error-prone.
I was planning to get around and stare at every wrapper to make sure it is correct.
This approach is not scalable, though.

> It also bleeds TDX ABI into C code, e.g.
>
> /*
> * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
> * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
> * So copy the register contents back to pt_regs.
> */
> regs->ax = args.r12;
> regs->bx = args.r13;
> regs->cx = args.r14;
> regs->dx = args.r15;
>
> Oh, and it requires input/output paramters, which is quite gross for C code *and*
> for assembly code, e.g.
>
> u64 tdvmcall_map_gpa(u64 *gpa, u64 size);
>
> and then the accompanying assembly code:
>
> FRAME_BEGIN
>
> save_regs r12,r13
>
> movq (%rdi), %r12
> movq %rsi, %r13
>
> movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13), %rcx
>
> tdvmcall $TDVMCALL_MAP_GPA
>
> movq %r11, (%rdi)
>
> restore_regs r13,r12
>
> FRAME_END
> RET
>
> I think having one trampoline makes sense, e.g. to minimize the probability of
> leaking register state to the VMM. The part that I don't like, and which generates
> awful code, is shoving register state into a memory structure.

I don't think we can get away with single trampoline. We have outliers.

See TDG.VP.VMCALL<ReportFatalError> that uses pretty much all registers as
input. And I hope we wouldn't need TDG.VP.VMCALL<Instruction.PCONFIG> any
time soon. It uses all possible output registers.

But I guess we can make a *few* wrappers that covers all needed cases.

> The annoying part with the TDX ABI is that it heavily uses r8-r15, and asm()
> constraints don't play nice with r8-15. But that doesn't mean we can't use asm()
> with macros, it just means we have to play games with registers.
>
> Because REG-REG moves are super cheap, and ignoring the fatal error goofiness,
> there are at most four inputs. That means having a single trampoline take *all*
> possible inputs is a non-issue. And we can avoiding polluting the inline code if
> we bury the register shuffling in the trampoline.
>
> And if we use asm() wrappers to call the trampoline, then the trampoline doesn't
> need to precisely follow the C calling convention. I.e. the trampoline can return
> with the outputs still in r12-r15, and let the asm() wrappers extract the outputs
> they want.
>
> As it stands, TDVMCALLs either have 0, 1, or 4 outputs. I.e. we only need three
> asm() wrappers. We could get away with one wrapper, but then users of the wrappers
> would need dummy variables for inputs *and* outputs, and the outputs get gross.
>
> Completely untested, but this is what I'm thinking. Side topic, I think making
> "tdcall" a macro that takes a leaf is a mistake. If/when an assembler learns what
> tdcall is, we're going to have to rewrite all of that code. And what a coincidence,
> my suggestion needs a bare TDCALL! :-)

I guess rename the macro (if it still needed) to something like
tdcall_leaf or something.

> Side topic #2, I don't think the trampoline needs a stack frame, its a leaf function.

Hm. I guess.

> Side topic #3, the ud2 to induce panic should be out-of-line.

Yeah. I switched to the inline one while debugging one section mismatch
issue and forgot to switch back.

> Weird? Yeah. But at least we one need to document one weird calling convention,
> and the ugliness is contained to three macros and a small assembly function.

Okay, the approach is worth exploring. I can work on it.

You focuses here on TDVMCALL. What is your take on the rest of TDCALL?

> .pushsection .noinstr.text, "ax"
> SYM_FUNC_START(tdvmcall_trampoline)
> movq $TDX_HYPERCALL_STANDARD, %r10
> movq %rax, %r11
> movq %rdi, %r12
> movq %rsi, %r13
> movq %rdx, %r14
> movq %rcx, %r15
>
> movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15), %rcx
>
> tdcall
>
> testq %rax, %rax
> jnz .Lpanic
>
> ret
>
> .Lpanic:
> ud2
> SYM_FUNC_END(tdvmcall_trampoline)
> .popsection
>
>
> #define TDVMCALL(reason, in1, in2, in3, in4) \
> ({ \
> long __ret; \
> \
> asm( \
> "call tdvmcall_trampoline\n\t" \
> "mov %%r10, %0\n\t" \
> : "=r" (__ret) \
> : "D" (in1), "S"(in2), "d"(in3), "c" (in4) \
> : "r12", "r13", "r14", "r15" \
> ); \
> __ret; \
> })
>
> #define TDVMCALL_1(reason, in1, in2, in3, in4, out1) \
> ({ \
> long __ret; \
> \
> asm( \
> "call tdvmcall_trampoline\n\t" \
> "mov %%r10, %0\n\t" \
> "mov %%r12, %1\n\t" \

It is r11, not r12.

> : "=r"(__ret) "=r" (out1) \
> : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \
> : "r12", "r13", "r14", "r15" \
> ); \
> __ret; \
> })
>
> #define TDVMCALL_4(reason, in1, in2, in3, in4, out1, out2, out3, out4) \
> ({ \
> long __ret; \
> \
> asm( \
> "call tdvmcall_trampoline\n\t" \
> "mov %%r10, %0\n\t" \
> "mov %%r12, %1\n\t" \
> "mov %%r13, %2\n\t" \
> "mov %%r14, %3\n\t" \
> "mov %%r15, %4\n\t" \
> : "=r" (__ret), \
> "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4) \
> : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \
> [reason] "i" (reason) \
> : "r12", "r13", "r14", "r15" \
> ); \
> __ret; \
> })
>
> static int handle_halt(struct ve_info *ve)
> {
> if (TDVMCALL(EXIT_REASON_HALT, irqs_disabled(), 0, 0, 0))
> return -EIO;
>
> return ve_instr_len(ve);
> }
>
> void __cpuidle tdx_safe_halt(void)
> {
> WARN_ONCE(TDVMCALL(EXIT_REASON_HALT, false, 0, 0, 0),
> "HLT instruction emulation failed");
> }
>
> static int read_msr(struct pt_regs *regs, struct ve_info *ve)
> {
> u64 val;
>
> if (TDVMCALL_1(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, val))
> return -EIO;
>
> regs->ax = lower_32_bits(val);
> regs->dx = upper_32_bits(val);
>
> return ve_instr_len(ve);
> }
>
> static int write_msr(struct pt_regs *regs, struct ve_info *ve)
> {
> u64 val = (u64)regs->dx << 32 | regs->ax;
>
> if (TDVMCALL(EXIT_REASON_MSR_WRITE, regs->cx, val, 0, 0))
> return -EIO;
>
> return ve_instr_len(ve);
> }
> static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
> {
> /*
> * Only allow VMM to control range reserved for hypervisor
> * communication.
> *
> * Return all-zeros for any CPUID outside the range. It matches CPU
> * behaviour for non-supported leaf.
> */
> if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
> regs->ax = regs->bx = regs->cx = regs->dx = 0;
> return ve_instr_len(ve);
> }
>
> if (TDVMCALL_4(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0,
> regs->ax, regs->bx, regs->cx, regs->dx))
> return -EIO;
>
> return ve_instr_len(ve);
> }
>
> static bool mmio_read(int size, u64 gpa, u64 *val)
> {
> *val = 0;
> return !TDVMCALL_1(EXIT_REASON_EPT_VIOLATION, size, EPT_READ, gpa, 0, val);
> }
>
> static bool mmio_write(int size, u64 gpa, u64 val)
> {
> return !TDVMCALL(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, gpa, val);
> }

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-04-19 19:53:38

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, Apr 19, 2024, [email protected] wrote:
> On Thu, Apr 18, 2024 at 11:26:11AM -0700, Sean Christopherson wrote:
> > On Thu, Apr 18, 2024, [email protected] wrote:
> > I think having one trampoline makes sense, e.g. to minimize the probability of
> > leaking register state to the VMM. The part that I don't like, and which generates
> > awful code, is shoving register state into a memory structure.
>
> I don't think we can get away with single trampoline. We have outliers.

Yeah, I should have said "one trampoline for all of the not insane APIs" :-)

> See TDG.VP.VMCALL<ReportFatalError> that uses pretty much all registers as
> input. And I hope we wouldn't need TDG.VP.VMCALL<Instruction.PCONFIG> any
> time soon. It uses all possible output registers.

XMM: just say no.

> But I guess we can make a *few* wrappers that covers all needed cases.

Yeah. I suspect two will suffice. One for the calls that say at or below four
inputs, and one for the fat ones like ReportFatalError that use everything under
the sun.

For the latter one, marshalling data into registers via an in-memory structure
makes, especially if we move away from tdx_module_args, e.g. this is quite clean
and reasonable:

union {
/* Define register order according to the GHCI */
struct { u64 r14, r15, rbx, rdi, rsi, r8, r9, rdx; };

char str[64];
} message;

/* VMM assumes '\0' in byte 65, if the message took all 64 bytes */
strtomem_pad(message.str, msg, '\0');

/*
* This hypercall should never return and it is not safe
* to keep the guest running. Call it forever if it
* happens to return.
*/
while (1)
tdx_fat_tdvmcall(&message);

> > Weird? Yeah. But at least we one need to document one weird calling convention,
> > and the ugliness is contained to three macros and a small assembly function.
>
> Okay, the approach is worth exploring. I can work on it.
>
> You focuses here on TDVMCALL. What is your take on the rest of TDCALL?

Not sure, haven't looked at them recently. At a glance, something similar? The
use of high registers instead of RDI and RSI is damn annoying :-/

Hmm, but it looks like there are enough simple TDCALLs that stay away from high
registers that open coding inline asm() is a viable (best?) approach.

RAX being the leaf and the return value is annoying, so maybe a simple macro to
make it easier to deal with that? It won't allow for perfectly optimal code
generation, but forcing a MOV for a TDCALL isn't going to affect performance, and
it will make reading the code dead simple.

#define tdcall_leaf(leaf) "mov $" leaf ", %%eax\n\t.byte 0x66,0x0f,0x01,0xcc\n\t"

Then PAGE_ACCEPT is simply:

asm(tdcall_leaf(TDG_MEM_PAGE_ACCEPT)
: "=a"(ret),
: "c"(start | page_size));
if (ret)
return 0;

And even the meanies that use R8 are reasonably easy to handle:

asm("xor %%r8d, %%r8d\n\t"
tdcall_leaf(TDG_MR_REPORT)
: "=a"(ret)
: "c"(__pa(report)), "d"(__pa(data)));


and (though using names for the outputs, I just can't remember the syntax as I'm
typing this :-/)

asm(tdcall_leaf(TDG_VM_RD)
"mov %%r8, %0\n\t"
: "=r"(value), "=a"(ret)
: "c"(0), "d"(no_idea_what_this_is));

if (ret)
<cry>

return value;

Or, if you wanted to get fancy, use asm_goto_output() to bail on an error so that
you could optimize for using RAX as the return value, because *that's going to
make all the difference :-D


2024-04-19 20:04:58

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, 2024-04-19 at 17:46 +0300, [email protected] wrote:
>
> > Side topic #3, the ud2 to induce panic should be out-of-line.
>
> Yeah. I switched to the inline one while debugging one section mismatch
> issue and forgot to switch back.

Sorry, why do we need to panic?

2024-04-20 19:05:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation

On Wed, 2024-03-20 at 17:11 -0700, Rick Edgecombe wrote:
> @@ -1378,6 +1375,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> kvm_page_fault *fault)
>                  * needs to be split.
>                  */
>                 sp = tdp_mmu_alloc_sp(vcpu);
> +               if (!(raw_gfn & kvm_gfn_shared_mask(kvm)))
> +                       kvm_mmu_alloc_private_spt(vcpu, sp);

This will try to allocate the private SP for normal VMs (which have a zero
shared mask), it should be:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index efed70580922..585c80fb62c5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1350,7 +1350,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
* needs to be split.
*/
sp = tdp_mmu_alloc_sp(vcpu);
- if (!(raw_gfn & kvm_gfn_shared_mask(kvm)))
+ if (kvm_is_private_gpa(kvm, raw_gfn << PAGE_SHIFT))
kvm_mmu_alloc_private_spt(vcpu, sp);
tdp_mmu_init_child_sp(sp, &iter);

2024-04-21 01:58:35

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

On Mon, 2024-02-26 at 00:26 -0800, [email protected] wrote:
> +/* Used by mmu notifier via kvm_unmap_gfn_range() */
>  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range
> *range,
>                                  bool flush)
>  {
>         struct kvm_mmu_page *root;
> +       bool zap_private = false;
> +
> +       if (kvm_gfn_shared_mask(kvm)) {
> +               if (!range->only_private && !range->only_shared)
> +                       /* attributes change */
> +                       zap_private = !(range->arg.attributes &
> +                                       KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +               else
> +                       zap_private = range->only_private;
> +       }
>  
>         __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id,
> false)
>                 flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
> -                                         range->may_block, flush);
> +                                         range->may_block, flush,
> +                                         zap_private && is_private_sp(root));
>  



I'm trying to apply the feedback:
- Drop MTRR support
- Changing only_private/shared to exclude_private/shared
...and update the log accordingly. These changes are all intersect in this
function and I'm having a hard time trying to justify the resulting logic.

It seems the point of passing the the attributes is because:
"However, looking at kvm_mem_attrs_changed() again, I think invoking
kvm_unmap_gfn_range() from generic KVM code is a mistake and shortsighted.
Zapping in response to *any* attribute change is very private/shared centric.
E.g. if/when we extend attributes to provide per-page RWX protections, zapping
existing SPTEs in response to granting *more* permissions may not be necessary
or even desirable."
https://lore.kernel.org/all/[email protected]/

But I think shoving the logic for how to handle the attribute changes deep into
the zapping code is the opposite extreme. It results in this confusing logic
with the decision on what to zap is spread all around.

Instead we should have kvm_arch_pre_set_memory_attributes() adjust the range so
it can tell kvm_unmap_gfn_range() which ranges to zap (private/shared).

So:
kvm_vm_set_mem_attributes() - passes attributes
kvm_arch_pre_set_memory_attributes() - chooses which private/shared alias to 
zap based on attribute.
kvm_unmap_gfn_range/kvm_tdp_mmu_unmap_gfn_range - zaps the private/shared alias
tdp_mmu_zap_leafs() - doesn't care about the root type, just zaps leafs


This zapping function can then just do the simple thing it's told to do. It ends
up looking like:

bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
struct kvm_mmu_page *root;
bool exclude_private = false;
bool exclude_shared = false;

if (kvm_gfn_shared_mask(kvm)) {
exclude_private = range->exclude_private;
exclude_shared = range->exclude_shared;
}

__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id,
false) {
if (exclude_private && is_private_sp(root))
continue;
if (exclude_shared && !is_private_sp(root))
continue;

flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
range->may_block, flush);
}

return flush;
}

The resulting logic should be the same. Separately, we might be able to simplify
it further if we change the behavior a bit (lose the kvm_gfn_shared_mask() check
or the exclude_shared member), but in the meantime this seems a lot easier to
explain and review for what I think is equivalent behavior.

2024-04-22 01:56:22

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 125/130] KVM: TDX: Add methods to ignore virtual apic related operation



On 2/26/2024 4:27 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX protects TDX guest APIC state from VMM. Implement access methods of
> TDX guest vAPIC state to ignore them or return zero.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/vmx/main.c | 61 ++++++++++++++++++++++++++++++++++----
> arch/x86/kvm/vmx/tdx.c | 6 ++++
> arch/x86/kvm/vmx/x86_ops.h | 3 ++
> 3 files changed, 64 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index fae5a3668361..c46c860be0f2 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -352,6 +352,14 @@ static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> return vmx_apic_init_signal_blocked(vcpu);
> }
>
> +static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
> +{
> + if (is_td_vcpu(vcpu))
> + return tdx_set_virtual_apic_mode(vcpu);
Can open code this function...

> +
> + return vmx_set_virtual_apic_mode(vcpu);
> +}
> +
[...]
>
> +void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
> +{
> + /* Only x2APIC mode is supported for TD. */
> + WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
> +}
> +

2024-04-22 03:35:11

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Mon, Feb 26, 2024 at 12:26:00AM -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
> +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
> +{
> + return sp->private_spt;
> +}
> +
> +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
> +{
> + sp->private_spt = private_spt;
> +}
This function is actually not used for initialization.
Instead, it's only called after failure of free_private_spt() in order to
intentionally leak the page to prevent kernel from accessing the encrypted page.

So to avoid confusion, how about renaming it to kvm_mmu_leak_private_spt() and
always resetting the pointer to NULL?

static inline void kvm_mmu_leak_private_spt(struct kvm_mmu_page *sp)
{
sp->private_spt = NULL;
}

> +
> +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +{
> + bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level;
> +
> + KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm);
> + if (is_root)
> + /*
> + * Because TDX module assigns root Secure-EPT page and set it to
> + * Secure-EPTP when TD vcpu is created, secure page table for
> + * root isn't needed.
> + */
> + sp->private_spt = NULL;
> + else {
> + /*
> + * Because the TDX module doesn't trust VMM and initializes
> + * the pages itself, KVM doesn't initialize them. Allocate
> + * pages with garbage and give them to the TDX module.
> + */
> + sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
> + /*
> + * Because mmu_private_spt_cache is topped up before starting
> + * kvm page fault resolving, the allocation above shouldn't
> + * fail.
> + */
> + WARN_ON_ONCE(!sp->private_spt);
> + }
> +}

2024-04-22 09:09:47

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v19 062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

On Mon, Feb 26, 2024 at 12:26:04AM -0800, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> @@ -1041,6 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
> + gfn_t raw_gfn;
> + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
Why not put the checking of kvm_gfn_shared_mask() earlier before determining
fault->is_private?
e.g. in kvm_mmu_page_fault().

> int ret = RET_PF_RETRY;
>
> kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1049,7 +1265,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
> rcu_read_lock();
>
> - tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> + raw_gfn = gpa_to_gfn(fault->addr);
> +
> + if (is_error_noslot_pfn(fault->pfn) ||
The checking for noslot pfn is not required after this cleanup series [1] from
Sean, right?

> + !kvm_pfn_to_refcounted_page(fault->pfn)) {
What's the purpose of rejecting non-refcounted page (which is useful for trusted
IO)?

Besides, looks pages are allocated via kvm_faultin_pfn_private() if
fault->is_private, so where is the non-refcounted page from?


> + if (is_private) {
> + rcu_read_unlock();
> + return -EFAULT;
> + }
> + }

[1] https://lore.kernel.org/all/[email protected]/

> + tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
> int r;
>
> if (fault->nx_huge_page_workaround_enabled)
..

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d399009ef1d7..e27c22449d85 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -201,6 +201,7 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
>
> return NULL;
> }
> +EXPORT_SYMBOL_GPL(kvm_pfn_to_refcounted_page);
>
> /*
> * Switches to specified vcpu, until a matching vcpu_put()
> --
> 2.25.1
>
>

2024-04-22 11:46:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-04-19 at 17:46 +0300, [email protected] wrote:
> >
> > > Side topic #3, the ud2 to induce panic should be out-of-line.
> >
> > Yeah. I switched to the inline one while debugging one section mismatch
> > issue and forgot to switch back.
>
> Sorry, why do we need to panic?

It panics in cases that should never occur if the TDX module is
functioning properly. For example, TDVMCALL itself should never fail,
although the leaf function could.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-04-22 12:47:40

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, 2024-04-19 at 10:23 -0700, Sean Christopherson wrote:
> On Fri, Apr 19, 2024, Kai Huang wrote:
> > On 19/04/2024 2:30 am, Sean Christopherson wrote:
> > > No, that will deadlock as cpuhp_setup_state() does cpus_read_lock().
> >
> > Right, but it takes cpus_read_lock()/unlock() internally. I was talking
> > about:
> >
> > if (enable_tdx) {
> > kvm_x86_virtualization_enable();
> >
> > /*
> > * Unfortunately currently tdx_enable() internally has
> > * lockdep_assert_cpus_held().
> > */
> > cpus_read_lock();
> > tdx_enable();
> > cpus_read_unlock();
> > }
>
> Ah. Just have tdx_enable() do cpus_read_lock(), I suspect/assume the current
> implemention was purely done in anticipation of KVM "needing" to do tdx_enable()
> while holding cpu_hotplug_lock.

Yeah. It was implemented based on the assumption that KVM would do below
sequence in hardware_setup():

cpus_read_lock();
on_each_cpu(vmxon_and_tdx_cpu_enable, NULL, 1);
tdx_enable();
on_each_cpu(vmxoff, NULL, 1);
cpus_read_unlock();

>
> And tdx_enable() should also do its best to verify that the caller is post-VMXON:
>
> if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
> return -EINVAL;

This won't be helpful, or at least isn't sufficient.

tdx_enable() can SEAMCALLs on all online CPUs, so checking "the caller is
post-VMXON" isn't enough. It needs "checking all online CPUs are in post-
VMXON with tdx_cpu_enable() having been done".

I didn't add such check because it's not mandatory, i.e., the later
SEAMCALL will catch such violation.

Btw, I noticed there's another problem, that is currently tdx_cpu_enable()
actually requires IRQ being disabled. Again it was implemented based on
it would be invoked via both on_each_cpu() and kvm_online_cpu().

It also also implemented with consideration that it could be called by
multiple in-kernel TDX users in parallel via both SMP call and in normal
context, so it was implemented to simply request the caller to make sure
it is called with IRQ disabled so it can be IRQ safe (it uses a percpu
variable to track whether TDH.SYS.LP.INIT has been done for local cpu
similar to the hardware_enabled percpu variable).

>
> > > > Btw, why couldn't we do the 'system_state' check at the very beginning of
> > > > this function?
> > >
> > > We could, but we'd still need to check after, and adding a small bit of extra
> > > complexity just to try to catch a very rare situation isn't worth it.
> > >
> > > To prevent races, system_state needs to be check after register_syscore_ops(),
> > > because only once kvm_syscore_ops is registered is KVM guaranteed to get notified
> > > of a shutdown. >
> > > And because the kvm_syscore_ops hooks disable virtualization, they should be called
> > > after cpuhp_setup_state(). That's not strictly required, as the per-CPU
> > > hardware_enabled flag will prevent true problems if the system enter shutdown
> > > state before KVM reaches cpuhp_setup_state().
> > >
> > > Hmm, but the same edge cases exists in the above flow. If the system enters
> > > shutdown _just_ after register_syscore_ops(), KVM would see that in system_state
> > > and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double
> > > disable (which again is benign because of hardware_enabled).
> > >
> > > Ah, but registering syscore ops before doing cpuhp_setup_state() has another race,
> > > and one that could be fatal. If the system does suspend+resume before the cpuhup
> > > hooks are registered, kvm_resume() would enable virtualization. And then if
> > > cpuhp_setup_state() failed, virtualization would be left enabled.
> > >
> > > So cpuhp_setup_state() *must* come before register_syscore_ops(), and
> > > register_syscore_ops() *must* come before the system_state check.
> >
> > OK. I guess I have to double check here to completely understand the races.
> > :-)
> >
> > So I think we have consensus to go with the approach that shows in your
> > second diff -- that is to always enable virtualization during module loading
> > for all other ARCHs other than x86, for which we only always enables
> > virtualization during module loading for TDX.
>
> Assuming the other arch maintainers are ok with that approach. If waiting until
> a VM is created is desirable for other architectures, then we'll need to figure
> out a plan b. E.g. KVM arm64 doesn't support being built as a module, so enabling
> hardware during initialization would mean virtualization is enabled for any kernel
> that is built with CONFIG_KVM=y.
>
> Actually, duh. There's absolutely no reason to force other architectures to
> choose when to enable virtualization. As evidenced by the massaging to have x86
> keep enabling virtualization on-demand for !TDX, the cleanups don't come from
> enabling virtualization during module load, they come from registering cpuup and
> syscore ops when virtualization is enabled.
>
> I.e. we can keep kvm_usage_count in common code, and just do exactly what I
> proposed for kvm_x86_enable_virtualization().

If so, then looks this is basically changing "cpuhp_setup_state_nocalls()
+ on_each_cpu()" to "cpuhp_setup_state()", and moving it along with
register_syscore_ops() to hardware_enable_all()"?

>
> I have patches to do this, and initial testing suggests they aren't wildly
> broken. I'll post them soon-ish, assuming nothing pops up in testing. They are
> clean enough that they can land in advance of TDX, e.g. in kvm-coco-queue even
> before other architectures verify I didn't break them.

Good to know. I'll do more TDX test with them after you send them out.

>
> > Then how about "do kvm_x86_virtualization_enable() within
> > late_hardware_setup() in kvm_x86_vendor_init()" vs "do
> > kvm_x86_virtualization_enable() in TDX-specific code after
> > kvm_x86_vendor_init()"?
> >
> > Which do you prefer?
>
> The latter, assuming it doesn't make the TDX code more complex than it needs to
> be. The fewer kvm_x86_ops hooks, the better.
>

Agreed.

2024-04-22 16:09:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Mon, 2024-04-22 at 14:46 +0300, [email protected] wrote:
> On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote:
> > On Fri, 2024-04-19 at 17:46 +0300, [email protected] wrote:
> > >
> > > > Side topic #3, the ud2 to induce panic should be out-of-line.
> > >
> > > Yeah. I switched to the inline one while debugging one section mismatch
> > > issue and forgot to switch back.
> >
> > Sorry, why do we need to panic?
>
> It panics in cases that should never occur if the TDX module is
> functioning properly. For example, TDVMCALL itself should never fail,
> although the leaf function could.

Panic should normally be for desperate situations when horrible things will
likely happen if we continue, right? Why are we adding a panic when we didn't
have one before? Is it a second change, or a side affect of the refactor?

2024-04-22 16:58:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Mon, Apr 22, 2024, Kai Huang wrote:
> On Fri, 2024-04-19 at 10:23 -0700, Sean Christopherson wrote:
> > On Fri, Apr 19, 2024, Kai Huang wrote:
> > And tdx_enable() should also do its best to verify that the caller is post-VMXON:
> >
> > if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
> > return -EINVAL;
>
> This won't be helpful, or at least isn't sufficient.
>
> tdx_enable() can SEAMCALLs on all online CPUs, so checking "the caller is
> post-VMXON" isn't enough. It needs "checking all online CPUs are in post-
> VMXON with tdx_cpu_enable() having been done".

I'm suggesting adding it in the responding code that does that actual SEAMCALL.

And the intent isn't to catch every possible problem. As with many sanity checks,
the intent is to detect the most likely failure mode to make triaging and debugging
issues a bit easier.

> I didn't add such check because it's not mandatory, i.e., the later
> SEAMCALL will catch such violation.

Yeah, a sanity check definitely isn't manadatory, but I do think it would be
useful and worthwhile. The code in question is relatively unique (enables VMX
at module load) and a rare operation, i.e. the cost of sanity checking CR4.VMXE
is meaningless. If we do end up with a bug where a CPU fails to do VMXON, this
sanity check would give a decent chance of a precise report, whereas #UD on a
SEAMCALL will be less clearcut.

> Btw, I noticed there's another problem, that is currently tdx_cpu_enable()
> actually requires IRQ being disabled. Again it was implemented based on
> it would be invoked via both on_each_cpu() and kvm_online_cpu().
>
> It also also implemented with consideration that it could be called by
> multiple in-kernel TDX users in parallel via both SMP call and in normal
> context, so it was implemented to simply request the caller to make sure
> it is called with IRQ disabled so it can be IRQ safe (it uses a percpu
> variable to track whether TDH.SYS.LP.INIT has been done for local cpu
> similar to the hardware_enabled percpu variable).

Is this is an actual problem, or is it just something that would need to be
updated in the TDX code to handle the change in direction?

> > Actually, duh. There's absolutely no reason to force other architectures to
> > choose when to enable virtualization. As evidenced by the massaging to have x86
> > keep enabling virtualization on-demand for !TDX, the cleanups don't come from
> > enabling virtualization during module load, they come from registering cpuup and
> > syscore ops when virtualization is enabled.
> >
> > I.e. we can keep kvm_usage_count in common code, and just do exactly what I
> > proposed for kvm_x86_enable_virtualization().
>
> If so, then looks this is basically changing "cpuhp_setup_state_nocalls()
> + on_each_cpu()" to "cpuhp_setup_state()", and moving it along with
> register_syscore_ops() to hardware_enable_all()"?

Yep.

2024-04-22 17:30:29

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

On Mon, Apr 22, 2024 at 11:34:18AM +0800,
Yan Zhao <[email protected]> wrote:

> On Mon, Feb 26, 2024 at 12:26:00AM -0800, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> > +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
> > +{
> > + return sp->private_spt;
> > +}
> > +
> > +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
> > +{
> > + sp->private_spt = private_spt;
> > +}
> This function is actually not used for initialization.
> Instead, it's only called after failure of free_private_spt() in order to
> intentionally leak the page to prevent kernel from accessing the encrypted page.
>
> So to avoid confusion, how about renaming it to kvm_mmu_leak_private_spt() and
> always resetting the pointer to NULL?
>
> static inline void kvm_mmu_leak_private_spt(struct kvm_mmu_page *sp)
> {
> sp->private_spt = NULL;
> }

The older version had a config to disable TDX TDP MMU at a compile time. Now
we dropped the config so that we don't necessarily need wrapper function with
#ifdef. Now we have only single caller, I'll eliminate this wrapper function
(and related wrapper functions) by open code.
--
Isaku Yamahata <[email protected]>

2024-04-22 17:34:29

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 125/130] KVM: TDX: Add methods to ignore virtual apic related operation

On Mon, Apr 22, 2024 at 09:56:05AM +0800,
Binbin Wu <[email protected]> wrote:

>
>
> On 2/26/2024 4:27 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX protects TDX guest APIC state from VMM. Implement access methods of
> > TDX guest vAPIC state to ignore them or return zero.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/vmx/main.c | 61 ++++++++++++++++++++++++++++++++++----
> > arch/x86/kvm/vmx/tdx.c | 6 ++++
> > arch/x86/kvm/vmx/x86_ops.h | 3 ++
> > 3 files changed, 64 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index fae5a3668361..c46c860be0f2 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -352,6 +352,14 @@ static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> > return vmx_apic_init_signal_blocked(vcpu);
> > }
> > +static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
> > +{
> > + if (is_td_vcpu(vcpu))
> > + return tdx_set_virtual_apic_mode(vcpu);
> Can open code this function...

Yes, the function is empty currently.
--
Isaku Yamahata <[email protected]>

2024-04-22 19:52:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Mon, Apr 22, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-04-22 at 14:46 +0300, [email protected] wrote:
> > On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote:
> > > On Fri, 2024-04-19 at 17:46 +0300, [email protected] wrote:
> > > >
> > > > > Side topic #3, the ud2 to induce panic should be out-of-line.
> > > >
> > > > Yeah. I switched to the inline one while debugging one section mismatch
> > > > issue and forgot to switch back.
> > >
> > > Sorry, why do we need to panic?
> >
> > It panics in cases that should never occur if the TDX module is
> > functioning properly. For example, TDVMCALL itself should never fail,
> > although the leaf function could.
>
> Panic should normally be for desperate situations when horrible things will
> likely happen if we continue, right? Why are we adding a panic when we didn't
> have one before? Is it a second change, or a side affect of the refactor?

The kernel already does panic() if TDCALL itself fails,

static inline void tdcall(u64 fn, struct tdx_module_args *args)
{
if (__tdcall_ret(fn, args))
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
}

/* Called from __tdx_hypercall() for unrecoverable failure */
noinstr void __noreturn __tdx_hypercall_failed(void)
{
instrumentation_begin();
panic("TDVMCALL failed. TDX module bug?");
}

it's just doesn in C code via panic(), not in asm via a bare ud2.

2024-04-22 22:47:43

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Mon, 2024-04-22 at 09:54 -0700, Sean Christopherson wrote:
> On Mon, Apr 22, 2024, Kai Huang wrote:
> > On Fri, 2024-04-19 at 10:23 -0700, Sean Christopherson wrote:
> > > On Fri, Apr 19, 2024, Kai Huang wrote:
> > > And tdx_enable() should also do its best to verify that the caller is post-VMXON:
> > >
> > > if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
> > > return -EINVAL;
> >
> > This won't be helpful, or at least isn't sufficient.
> >
> > tdx_enable() can SEAMCALLs on all online CPUs, so checking "the caller is
> > post-VMXON" isn't enough. It needs "checking all online CPUs are in post-
> > VMXON with tdx_cpu_enable() having been done".
>
> I'm suggesting adding it in the responding code that does that actual SEAMCALL.

The thing is to check you will need to do additional things like making
sure no scheduling would happen during "check + make SEAMCALL". Doesn't
seem worth to do for me.

The intent of tdx_enable() was the caller should make sure no new CPU
should come online (well this can be relaxed if we move
cpuhp_setup_state() to hardware_enable_all()), and all existing online
CPUs are in post-VMXON with tdx_cpu_enable() been done.

I think, if we ever need any check, the latter seems to be more
reasonable.

But if we allow new CPU to become online during tdx_enable() (with your
enhancement to the hardware enabling), then I don't know how to make such
check at the beginning of tdx_enable(), because do 

on_each_cpu(check_seamcall_precondition, NULL, 1);

cannot catch any new CPU during tdx_enable().

>
> And the intent isn't to catch every possible problem. As with many sanity checks,
> the intent is to detect the most likely failure mode to make triaging and debugging
> issues a bit easier.

The SEAMCALL will literally return a unique error code to indicate CPU
isn't in post-VMXON, or tdx_cpu_enable() hasn't been done. I think the
error code is already clear to pinpoint the problem (due to these pre-
SEAMCALL-condition not being met).

>
> > I didn't add such check because it's not mandatory, i.e., the later
> > SEAMCALL will catch such violation.
>
> Yeah, a sanity check definitely isn't manadatory, but I do think it would be
> useful and worthwhile. The code in question is relatively unique (enables VMX
> at module load) and a rare operation, i.e. the cost of sanity checking CR4.VMXE
> is meaningless. If we do end up with a bug where a CPU fails to do VMXON, this
> sanity check would give a decent chance of a precise report, whereas #UD on a
> SEAMCALL will be less clearcut.

If VMXON fails for any CPU then cpuhp_setup_state() will fail, preventing
KVM to be loaded.

And if it fails during kvm_online_cpu(), the new CPU will fail to online.

>
> > Btw, I noticed there's another problem, that is currently tdx_cpu_enable()
> > actually requires IRQ being disabled. Again it was implemented based on
> > it would be invoked via both on_each_cpu() and kvm_online_cpu().
> >
> > It also also implemented with consideration that it could be called by
> > multiple in-kernel TDX users in parallel via both SMP call and in normal
> > context, so it was implemented to simply request the caller to make sure
> > it is called with IRQ disabled so it can be IRQ safe (it uses a percpu
> > variable to track whether TDH.SYS.LP.INIT has been done for local cpu
> > similar to the hardware_enabled percpu variable).
>
> Is this is an actual problem, or is it just something that would need to be
> updated in the TDX code to handle the change in direction?

For now this isn't, because KVM is the solo user, and in KVM
hardware_enable_all() and kvm_online_cpu() uses kvm_lock mutex to make
hardware_enable_nolock() IPI safe.

I am not sure how TDX/SEAMCALL will be used in TDX Connect.

However I needed to consider KVM as a user, so I decided to just make it
must be called with IRQ disabled so I could know it is IRQ safe.

Back to the current tdx_enable() and tdx_cpu_enable(), my personal
preference is, of course, to keep the existing way, that is:

During module load:

cpus_read_lock();
tdx_enable();
cpus_read_unlock();

and in kvm_online_cpu():

local_irq_save();
tdx_cpu_enable();
local_irq_restore();

But given KVM is the solo user now, I am also fine to change if you
believe this is not acceptable.

2024-04-23 00:25:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Mon, Apr 22, 2024, Kai Huang wrote:
> On Mon, 2024-04-22 at 09:54 -0700, Sean Christopherson wrote:
> > On Mon, Apr 22, 2024, Kai Huang wrote:
> > > On Fri, 2024-04-19 at 10:23 -0700, Sean Christopherson wrote:
> > > > On Fri, Apr 19, 2024, Kai Huang wrote:
> > > > And tdx_enable() should also do its best to verify that the caller is post-VMXON:
> > > >
> > > > if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
> > > > return -EINVAL;
> > >
> > > This won't be helpful, or at least isn't sufficient.
> > >
> > > tdx_enable() can SEAMCALLs on all online CPUs, so checking "the caller is
> > > post-VMXON" isn't enough. It needs "checking all online CPUs are in post-
> > > VMXON with tdx_cpu_enable() having been done".
> >
> > I'm suggesting adding it in the responding code that does that actual SEAMCALL.
>
> The thing is to check you will need to do additional things like making
> sure no scheduling would happen during "check + make SEAMCALL". Doesn't
> seem worth to do for me.
>
> The intent of tdx_enable() was the caller should make sure no new CPU
> should come online (well this can be relaxed if we move
> cpuhp_setup_state() to hardware_enable_all()), and all existing online
> CPUs are in post-VMXON with tdx_cpu_enable() been done.
>
> I think, if we ever need any check, the latter seems to be more
> reasonable.
>
> But if we allow new CPU to become online during tdx_enable() (with your
> enhancement to the hardware enabling), then I don't know how to make such
> check at the beginning of tdx_enable(), because do 
>
> on_each_cpu(check_seamcall_precondition, NULL, 1);
>
> cannot catch any new CPU during tdx_enable().

Doh, we're talking past each other, because my initial suggestion was half-baked.

When I initially said "tdx_enable()", I didn't intend to literally mean just the
CPU that calls tdx_enable(). What I was trying to say was, when doing per-CPU
things when enabling TDX, sanity check that the current CPU has CR4.VMXE=1 before
doing a SEAMCALL.

Ah, and now I see where the disconnect is. I was assuming tdx_enable() also did
TDH_SYS_LP_INIT, but that's in a separate chunk of code that's manually invoked
by KVM. More below.

> > And the intent isn't to catch every possible problem. As with many sanity checks,
> > the intent is to detect the most likely failure mode to make triaging and debugging
> > issues a bit easier.
>
> The SEAMCALL will literally return a unique error code to indicate CPU
> isn't in post-VMXON, or tdx_cpu_enable() hasn't been done. I think the
> error code is already clear to pinpoint the problem (due to these pre-
> SEAMCALL-condition not being met).

No, SEAMCALL #UDs if the CPU isn't post-VMXON. I.e. the CPU doesn't make it to
the TDX Module to provide a unique error code, all KVM will see is a #UD.

> > > Btw, I noticed there's another problem, that is currently tdx_cpu_enable()
> > > actually requires IRQ being disabled. Again it was implemented based on
> > > it would be invoked via both on_each_cpu() and kvm_online_cpu().
> > >
> > > It also also implemented with consideration that it could be called by
> > > multiple in-kernel TDX users in parallel via both SMP call and in normal
> > > context, so it was implemented to simply request the caller to make sure
> > > it is called with IRQ disabled so it can be IRQ safe (it uses a percpu
> > > variable to track whether TDH.SYS.LP.INIT has been done for local cpu
> > > similar to the hardware_enabled percpu variable).
> >
> > Is this is an actual problem, or is it just something that would need to be
> > updated in the TDX code to handle the change in direction?
>
> For now this isn't, because KVM is the solo user, and in KVM
> hardware_enable_all() and kvm_online_cpu() uses kvm_lock mutex to make
> hardware_enable_nolock() IPI safe.
>
> I am not sure how TDX/SEAMCALL will be used in TDX Connect.
>
> However I needed to consider KVM as a user, so I decided to just make it
> must be called with IRQ disabled so I could know it is IRQ safe.
>
> Back to the current tdx_enable() and tdx_cpu_enable(), my personal
> preference is, of course, to keep the existing way, that is:
>
> During module load:
>
> cpus_read_lock();
> tdx_enable();
> cpus_read_unlock();
>
> and in kvm_online_cpu():
>
> local_irq_save();
> tdx_cpu_enable();
> local_irq_restore();
>
> But given KVM is the solo user now, I am also fine to change if you
> believe this is not acceptable.

Looking more closely at the code, tdx_enable() needs to be called under
cpu_hotplug_lock to prevent *unplug*, i.e. to prevent the last CPU on a package
from being offlined. I.e. that part's not option.

And the root of the problem/confusion is that the APIs provided by the core kernel
are weird, which is really just a polite way of saying they are awful :-)

There is no reason to rely on the caller to take cpu_hotplug_lock, and definitely
no reason to rely on the caller to invoke tdx_cpu_enable() separately from invoking
tdx_enable(). I suspect they got that way because of KVM's unnecessarily complex
code, e.g. if KVM is already doing on_each_cpu() to do VMXON, then it's easy enough
to also do TDH_SYS_LP_INIT, so why do two IPIs?

But just because KVM "owns" VMXON doesn't mean the core kernel code should punt
TDX to KVM too. If KVM relies on the cpuhp code to ensure all online CPUs are
post-VMXON, then the TDX shapes up nicely and provides a single API to enable
TDX. And then my CR4.VMXE=1 sanity check makes a _lot_ more sense.

Relative to some random version of the TDX patches, this is what I'm thinking:

---
arch/x86/kvm/vmx/tdx.c | 46 +++----------------
arch/x86/virt/vmx/tdx/tdx.c | 89 ++++++++++++++++++-------------------
2 files changed, 49 insertions(+), 86 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a1d3ae09091c..137d08da43c3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3322,38 +3322,8 @@ bool tdx_is_vm_type_supported(unsigned long type)
return type == KVM_X86_TDX_VM;
}

-struct tdx_enabled {
- cpumask_var_t enabled;
- atomic_t err;
-};
-
-static void __init tdx_on(void *_enable)
-{
- struct tdx_enabled *enable = _enable;
- int r;
-
- r = vmx_hardware_enable();
- if (!r) {
- cpumask_set_cpu(smp_processor_id(), enable->enabled);
- r = tdx_cpu_enable();
- }
- if (r)
- atomic_set(&enable->err, r);
-}
-
-static void __init vmx_off(void *_enabled)
-{
- cpumask_var_t *enabled = (cpumask_var_t *)_enabled;
-
- if (cpumask_test_cpu(smp_processor_id(), *enabled))
- vmx_hardware_disable();
-}
-
int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
{
- struct tdx_enabled enable = {
- .err = ATOMIC_INIT(0),
- };
int max_pkgs;
int r = 0;
int i;
@@ -3409,17 +3379,11 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
goto out;
}

- /* tdx_enable() in tdx_module_setup() requires cpus lock. */
- cpus_read_lock();
- on_each_cpu(tdx_on, &enable, true); /* TDX requires vmxon. */
- r = atomic_read(&enable.err);
- if (!r)
- r = tdx_module_setup();
- else
- r = -EIO;
- on_each_cpu(vmx_off, &enable.enabled, true);
- cpus_read_unlock();
- free_cpumask_var(enable.enabled);
+ r = tdx_enable();
+ if (r)
+ goto out;
+
+ r = tdx_module_setup();
if (r)
goto out;

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index d2b8f079a637..19897f736c47 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -139,49 +139,6 @@ static int try_init_module_global(void)
return sysinit_ret;
}

-/**
- * tdx_cpu_enable - Enable TDX on local cpu
- *
- * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
- * global initialization SEAMCALL if not done) on local cpu to make this
- * cpu be ready to run any other SEAMCALLs.
- *
- * Always call this function via IPI function calls.
- *
- * Return 0 on success, otherwise errors.
- */
-int tdx_cpu_enable(void)
-{
- struct tdx_module_args args = {};
- int ret;
-
- if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
- return -ENODEV;
-
- lockdep_assert_irqs_disabled();
-
- if (__this_cpu_read(tdx_lp_initialized))
- return 0;
-
- /*
- * The TDX module global initialization is the very first step
- * to enable TDX. Need to do it first (if hasn't been done)
- * before the per-cpu initialization.
- */
- ret = try_init_module_global();
- if (ret)
- return ret;
-
- ret = seamcall_prerr(TDH_SYS_LP_INIT, &args);
- if (ret)
- return ret;
-
- __this_cpu_write(tdx_lp_initialized, true);
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(tdx_cpu_enable);
-
/*
* Add a memory region as a TDX memory block. The caller must make sure
* all memory regions are added in address ascending order and don't
@@ -1201,6 +1158,43 @@ static int init_tdx_module(void)
goto out_put_tdxmem;
}

+/**
+ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
+ * global initialization SEAMCALL if not done) on local cpu to make this
+ * cpu be ready to run any other SEAMCALLs.
+ */
+static void tdx_cpu_enable(void *__err)
+{
+ struct tdx_module_args args = {};
+ atomic_t err = __err;
+ int ret;
+
+ if (__this_cpu_read(tdx_lp_initialized))
+ return;
+
+ if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
+ goto failed;
+
+ /*
+ * The TDX module global initialization is the very first step
+ * to enable TDX. Need to do it first (if hasn't been done)
+ * before the per-cpu initialization.
+ */
+ ret = try_init_module_global();
+ if (ret)
+ goto failed;
+
+ ret = seamcall_prerr(TDH_SYS_LP_INIT, &args);
+ if (ret)
+ goto failed;
+
+ __this_cpu_write(tdx_lp_initialized, true);
+ return;
+
+failed:
+ atomic_inc(err);
+}
+
static int __tdx_enable(void)
{
int ret;
@@ -1234,15 +1228,19 @@ static int __tdx_enable(void)
*/
int tdx_enable(void)
{
+ atomic_t err = ATOMIC_INIT(0);
int ret;

if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
return -ENODEV;

- lockdep_assert_cpus_held();
-
+ cpus_read_lock();
mutex_lock(&tdx_module_lock);

+ on_each_cpu(tdx_cpu_enable, &err, true);
+ if (atomic_read(&err))
+ tdx_module_status = TDX_MODULE_ERROR;
+
switch (tdx_module_status) {
case TDX_MODULE_UNINITIALIZED:
ret = __tdx_enable();
@@ -1258,6 +1256,7 @@ int tdx_enable(void)
}

mutex_unlock(&tdx_module_lock);
+ cpus_read_unlock();

return ret;
}

base-commit: fde917bc1af3e1a440ab0cb0d9364f8da25b9e17
--


2024-04-23 00:28:26

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Mon, 2024-04-22 at 12:50 -0700, Sean Christopherson wrote:
> The kernel already does panic() if TDCALL itself fails,
>
>   static inline void tdcall(u64 fn, struct tdx_module_args *args)
>   {
>         if (__tdcall_ret(fn, args))
>                 panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
>   }
>
>   /* Called from __tdx_hypercall() for unrecoverable failure */
>   noinstr void __noreturn __tdx_hypercall_failed(void)
>   {
>         instrumentation_begin();
>         panic("TDVMCALL failed. TDX module bug?");
>   }
>
> it's just doesn in C code via panic(), not in asm via a bare ud2.

Hmm, I didn't realize. It looks like today some calls do and some don't. I don't
mean to reopen old debates. Just surprised that these are able to bring down the
system. Which funnily enough connects back to the original issue of the patch:
whether they are safe to export for module use.

2024-04-23 01:34:34

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module


>
> > > And the intent isn't to catch every possible problem. As with many sanity checks,
> > > the intent is to detect the most likely failure mode to make triaging and debugging
> > > issues a bit easier.
> >
> > The SEAMCALL will literally return a unique error code to indicate CPU
> > isn't in post-VMXON, or tdx_cpu_enable() hasn't been done. I think the
> > error code is already clear to pinpoint the problem (due to these pre-
> > SEAMCALL-condition not being met).
>
> No, SEAMCALL #UDs if the CPU isn't post-VMXON. I.e. the CPU doesn't make it to
> the TDX Module to provide a unique error code, all KVM will see is a #UD.

#UD is handled by the SEAMCALL assembly code. Please see TDX_MODULE_CALL
assembly macro:

.Lseamcall_trap\@:
/*
* SEAMCALL caused #GP or #UD. By reaching here RAX contains
* the trap number. Convert the trap number to the TDX error
* code by setting TDX_SW_ERROR to the high 32-bits of RAX.
*
* Note cannot OR TDX_SW_ERROR directly to RAX as OR instruction
* only accepts 32-bit immediate at most.
*/
movq $TDX_SW_ERROR, %rdi
orq %rdi, %rax

...
 
_ASM_EXTABLE_FAULT(.Lseamcall\@, .Lseamcall_trap\@)
.endif /* \host */

>
> > > > Btw, I noticed there's another problem, that is currently tdx_cpu_enable()
> > > > actually requires IRQ being disabled. Again it was implemented based on
> > > > it would be invoked via both on_each_cpu() and kvm_online_cpu().
> > > >
> > > > It also also implemented with consideration that it could be called by
> > > > multiple in-kernel TDX users in parallel via both SMP call and in normal
> > > > context, so it was implemented to simply request the caller to make sure
> > > > it is called with IRQ disabled so it can be IRQ safe (it uses a percpu
> > > > variable to track whether TDH.SYS.LP.INIT has been done for local cpu
> > > > similar to the hardware_enabled percpu variable).
> > >
> > > Is this is an actual problem, or is it just something that would need to be
> > > updated in the TDX code to handle the change in direction?
> >
> > For now this isn't, because KVM is the solo user, and in KVM
> > hardware_enable_all() and kvm_online_cpu() uses kvm_lock mutex to make
> > hardware_enable_nolock() IPI safe.
> >
> > I am not sure how TDX/SEAMCALL will be used in TDX Connect.
> >
> > However I needed to consider KVM as a user, so I decided to just make it
> > must be called with IRQ disabled so I could know it is IRQ safe.
> >
> > Back to the current tdx_enable() and tdx_cpu_enable(), my personal
> > preference is, of course, to keep the existing way, that is:
> >
> > During module load:
> >
> > cpus_read_lock();
> > tdx_enable();
> > cpus_read_unlock();
> >
> > and in kvm_online_cpu():
> >
> > local_irq_save();
> > tdx_cpu_enable();
> > local_irq_restore();
> >
> > But given KVM is the solo user now, I am also fine to change if you
> > believe this is not acceptable.
>
> Looking more closely at the code, tdx_enable() needs to be called under
> cpu_hotplug_lock to prevent *unplug*, i.e. to prevent the last CPU on a package
> from being offlined. I.e. that part's not option.

Yeah. We can say that. I almost forgot this :-)

>
> And the root of the problem/confusion is that the APIs provided by the core kernel
> are weird, which is really just a polite way of saying they are awful :-)

Well, apologize for it :-)

>
> There is no reason to rely on the caller to take cpu_hotplug_lock, and definitely
> no reason to rely on the caller to invoke tdx_cpu_enable() separately from invoking
> tdx_enable(). I suspect they got that way because of KVM's unnecessarily complex
> code, e.g. if KVM is already doing on_each_cpu() to do VMXON, then it's easy enough
> to also do TDH_SYS_LP_INIT, so why do two IPIs?

The main reason is we relaxed the TDH.SYS.LP.INIT to be called _after_ TDX
module initialization.  

Previously, the TDH.SYS.LP.INIT must be done on *ALL* CPUs that the
platform has (i.e., cpu_present_mask) right after TDH.SYS.INIT and before
any other SEAMCALLs. This didn't quite work with (kernel software) CPU
hotplug, and it had problem dealing with things like SMT disable
mitigation:

https://lore.kernel.org/lkml/[email protected]/T/#mf42fa2d68d6b98edcc2aae11dba3c2487caf3b8f

So the x86 maintainers requested to change this. The original proposal
was to eliminate the entire TDH.SYS.INIT and TDH.SYS.LP.INIT:

https://lore.kernel.org/lkml/[email protected]/T/#m78c0c48078f231e92ea1b87a69bac38564d46469

But somehow it wasn't feasible, and the result was we relaxed to allow
TDH.SYS.LP.INIT to be called after module initialization.

So we need a separate tdx_cpu_enable() for that.

>
> But just because KVM "owns" VMXON doesn't mean the core kernel code should punt
> TDX to KVM too. If KVM relies on the cpuhp code to ensure all online CPUs are
> post-VMXON, then the TDX shapes up nicely and provides a single API to enable
> TDX.  
>

We could ask tdx_enable() to invoke tdx_cpu_enable() internally, but as
mentioned above we still to have the tdx_cpu_enable() as a separate API to
allow CPU hotplug to call it.

> And then my CR4.VMXE=1 sanity check makes a _lot_ more sense.

As replied above the current SEAMCALL assembly returns a unique error code
for that:

#define TDX_SEAMCALL_UD (TDX_SW_ERROR |
X86_TRAP_UD)


2024-04-23 01:46:03

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, 2024-04-23 at 13:34 +1200, Kai Huang wrote:
> >
> > > > And the intent isn't to catch every possible problem. As with many sanity checks,
> > > > the intent is to detect the most likely failure mode to make triaging and debugging
> > > > issues a bit easier.
> > >
> > > The SEAMCALL will literally return a unique error code to indicate CPU
> > > isn't in post-VMXON, or tdx_cpu_enable() hasn't been done. I think the
> > > error code is already clear to pinpoint the problem (due to these pre-
> > > SEAMCALL-condition not being met).
> >
> > No, SEAMCALL #UDs if the CPU isn't post-VMXON. I.e. the CPU doesn't make it to
> > the TDX Module to provide a unique error code, all KVM will see is a #UD.
>
> #UD is handled by the SEAMCALL assembly code. Please see TDX_MODULE_CALL
> assembly macro:
>
> .Lseamcall_trap\@:
>         /*
>          * SEAMCALL caused #GP or #UD. By reaching here RAX contains
>          * the trap number. Convert the trap number to the TDX error
>          * code by setting TDX_SW_ERROR to the high 32-bits of RAX.
>          *
>          * Note cannot OR TDX_SW_ERROR directly to RAX as OR instruction
>          * only accepts 32-bit immediate at most.
>          */
>         movq $TDX_SW_ERROR, %rdi
>         orq %rdi, %rax
>
> ...
>        
> _ASM_EXTABLE_FAULT(.Lseamcall\@, .Lseamcall_trap\@)
> .endif /* \host */
>
> >
> > > > > Btw, I noticed there's another problem, that is currently tdx_cpu_enable()
> > > > > actually requires IRQ being disabled. Again it was implemented based on
> > > > > it would be invoked via both on_each_cpu() and kvm_online_cpu().
> > > > >
> > > > > It also also implemented with consideration that it could be called by
> > > > > multiple in-kernel TDX users in parallel via both SMP call and in normal
> > > > > context, so it was implemented to simply request the caller to make sure
> > > > > it is called with IRQ disabled so it can be IRQ safe (it uses a percpu
> > > > > variable to track whether TDH.SYS.LP.INIT has been done for local cpu
> > > > > similar to the hardware_enabled percpu variable).
> > > >
> > > > Is this is an actual problem, or is it just something that would need to be
> > > > updated in the TDX code to handle the change in direction?
> > >
> > > For now this isn't, because KVM is the solo user, and in KVM
> > > hardware_enable_all() and kvm_online_cpu() uses kvm_lock mutex to make
> > > hardware_enable_nolock() IPI safe.
> > >
> > > I am not sure how TDX/SEAMCALL will be used in TDX Connect.
> > >
> > > However I needed to consider KVM as a user, so I decided to just make it
> > > must be called with IRQ disabled so I could know it is IRQ safe.
> > >
> > > Back to the current tdx_enable() and tdx_cpu_enable(), my personal
> > > preference is, of course, to keep the existing way, that is:
> > >
> > > During module load:
> > >
> > > cpus_read_lock();
> > > tdx_enable();
> > > cpus_read_unlock();
> > >
> > > and in kvm_online_cpu():
> > >
> > > local_irq_save();
> > > tdx_cpu_enable();
> > > local_irq_restore();
> > >
> > > But given KVM is the solo user now, I am also fine to change if you
> > > believe this is not acceptable.
> >
> > Looking more closely at the code, tdx_enable() needs to be called under
> > cpu_hotplug_lock to prevent *unplug*, i.e. to prevent the last CPU on a package
> > from being offlined. I.e. that part's not option.
>
> Yeah. We can say that. I almost forgot this :-)
>
> >
> > And the root of the problem/confusion is that the APIs provided by the core kernel
> > are weird, which is really just a polite way of saying they are awful :-)
>
> Well, apologize for it :-)
>
> >
> > There is no reason to rely on the caller to take cpu_hotplug_lock, and definitely
> > no reason to rely on the caller to invoke tdx_cpu_enable() separately from invoking
> > tdx_enable(). I suspect they got that way because of KVM's unnecessarily complex
> > code, e.g. if KVM is already doing on_each_cpu() to do VMXON, then it's easy enough
> > to also do TDH_SYS_LP_INIT, so why do two IPIs?
>
> The main reason is we relaxed the TDH.SYS.LP.INIT to be called _after_ TDX
> module initialization.  
>
> Previously, the TDH.SYS.LP.INIT must be done on *ALL* CPUs that the
> platform has (i.e., cpu_present_mask) right after TDH.SYS.INIT and before
> any other SEAMCALLs. This didn't quite work with (kernel software) CPU
> hotplug, and it had problem dealing with things like SMT disable
> mitigation:
>
> https://lore.kernel.org/lkml/[email protected]/T/#mf42fa2d68d6b98edcc2aae11dba3c2487caf3b8f
>
> So the x86 maintainers requested to change this. The original proposal
> was to eliminate the entire TDH.SYS.INIT and TDH.SYS.LP.INIT:
>
> https://lore.kernel.org/lkml/[email protected]/T/#m78c0c48078f231e92ea1b87a69bac38564d46469
>
> But somehow it wasn't feasible, and the result was we relaxed to allow
> TDH.SYS.LP.INIT to be called after module initialization.
>
> So we need a separate tdx_cpu_enable() for that.

Btw, the ideal (or probably the final) plan is to handle tdx_cpu_enable()
in TDX's own CPU hotplug callback in the core-kernel and hide it from all
other in-kernel TDX users.  

Specifically:

1) that callback, e.g., tdx_online_cpu() will be placed _before_ any in-
kernel TDX users like KVM's callback.
2) In tdx_online_cpu(), we do VMXON + tdx_cpu_enable() + VMXOFF, and
return error in case of any error to prevent that cpu from going online.

That makes sure that, if TDX is supported by the platform, we basically
guarantees all online CPUs are ready to issue SEAMCALL (of course, the in-
kernel TDX user still needs to do VMXON for it, but that's TDX user's
responsibility).

But that obviously needs to move VMXON to the core-kernel.

Currently, export tdx_cpu_enable() as a separate API and require KVM to
call it explicitly is a temporary solution.

That being said, we could do tdx_cpu_enable() inside tdx_enable(), but I
don't see it's a better idea.

2024-04-23 13:20:29

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 085/130] KVM: TDX: Complete interrupts after tdexit



On 4/17/2024 2:23 AM, Reinette Chatre wrote:
> Hi Isaku,
>
> (In shortlog "tdexit" can be "TD exit" to be consistent with
> documentation.)
>
> On 2/26/2024 12:26 AM, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
>>
>> This corresponds to VMX __vmx_complete_interrupts(). Because TDX
>> virtualize vAPIC, KVM only needs to care NMI injection.
> This seems to be the first appearance of NMI and the changelog
> is very brief. How about expending it with:
>
> "This corresponds to VMX __vmx_complete_interrupts(). Because TDX
> virtualize vAPIC, KVM only needs to care about NMI injection.
  ^
  virtualizes

Also, does it need to mention that non-NMI interrupts are handled by
posted-interrupt mechanism?

For example:

"This corresponds to VMX __vmx_complete_interrupts().  Because TDX
 virtualizes vAPIC, and non-NMI interrupts are delivered using
posted-interrupt
 mechanism, KVM only needs to care about NMI injection.
..
"

>
> KVM can request TDX to inject an NMI into a guest TD vCPU when the
> vCPU is not active. TDX will attempt to inject an NMI as soon as
> possible on TD entry. NMI injection is managed by writing to (to
> inject NMI) and reading from (to get status of NMI injection)
> the PEND_NMI field within the TDX vCPU scope metadata (Trust
> Domain Virtual Processor State (TDVPS)).
>
> Update KVM's NMI status on TD exit by checking whether a requested
> NMI has been injected into the TD. Reading the metadata via SEAMCALL
> is expensive so only perform the check if an NMI was injected.
>
> This is the first need to access vCPU scope metadata in the
> "management" class. Ensure that needed accessor is available.
> "
>

2024-04-23 14:00:40

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure

On Fri, 2024-03-22 at 15:45 -0700, Isaku Yamahata wrote:
> Hmm, now I noticed the vm_size can be moved here.  We have
>
>   vcpu_size = sizeof(struct vcpu_vmx);
>   vcpu_align = __alignof__(struct vcpu_vmx);
> if (enable_tdx) {
> vcpu_size = max_t(unsigned int, vcpu_size,
>   sizeof(struct vcpu_tdx));
> vcpu_align = max_t(unsigned int, vcpu_align,
>    __alignof__(struct vcpu_tdx));
>                 vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
>                                           sizeof(struct kvm_tdx));
> }

Hmm.. After reading again, I don't think we can?

In your comments that was replied to Binbin:

/*
* vt_hardware_setup() updates vt_x86_ops. Because kvm_ops_update()
* copies vt_x86_ops to kvm_x86_op, vt_x86_ops must be updated before
* kvm_ops_update() called by kvm_x86_vendor_init().
*/

You said update to vt_x86_ops.vm_size must be done in vt_hardware_setup().

But I think we should move the above comment to the vt_hardware_setup()
where vm_size is truly updated.

/*
* TDX and VMX have different VM structure. If TDX is enabled,
* update vt_x86_ops.vm_size to the maximum value of the two
* before it is copied to kvm_x86_ops in kvm_update_ops() to make
* sure KVM always allocates enough memory for the VM structure.
*/

Here the purpose to calculate vcpu_size/vcpu_align is just to pass them to
kvm_init(). If needed, we can add a comment to describe "what this code
does":

/*
* TDX and VMX have different vCPU structure. Calculate the
* maximum size/align so that kvm_init() can use the larger 
* values to create the vCPU kmem_cache.
*/

2024-04-23 14:55:04

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v19 085/130] KVM: TDX: Complete interrupts after tdexit



On 4/23/2024 6:15 AM, Binbin Wu wrote:
>
>
> On 4/17/2024 2:23 AM, Reinette Chatre wrote:
>> Hi Isaku,
>>
>> (In shortlog "tdexit" can be "TD exit" to be consistent with
>> documentation.)
>>
>> On 2/26/2024 12:26 AM, [email protected] wrote:
>>> From: Isaku Yamahata <[email protected]>
>>>
>>> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>>> virtualize vAPIC, KVM only needs to care NMI injection.
>> This seems to be the first appearance of NMI and the changelog
>> is very brief. How about expending it with:
>>
>> "This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>>   virtualize vAPIC, KVM only needs to care about NMI injection.
>   ^
>   virtualizes
>
> Also, does it need to mention that non-NMI interrupts are handled by posted-interrupt mechanism?
>
> For example:
>
> "This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>  virtualizes vAPIC, and non-NMI interrupts are delivered using posted-interrupt
>  mechanism, KVM only needs to care about NMI injection.
> ...
> "
>

Thank you Binbin. Looks good to me.

Reinette

2024-04-23 15:15:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, Apr 23, 2024, Kai Huang wrote:
> On Tue, 2024-04-23 at 13:34 +1200, Kai Huang wrote:
> > >
> > > > > And the intent isn't to catch every possible problem. As with many sanity checks,
> > > > > the intent is to detect the most likely failure mode to make triaging and debugging
> > > > > issues a bit easier.
> > > >
> > > > The SEAMCALL will literally return a unique error code to indicate CPU
> > > > isn't in post-VMXON, or tdx_cpu_enable() hasn't been done. I think the
> > > > error code is already clear to pinpoint the problem (due to these pre-
> > > > SEAMCALL-condition not being met).
> > >
> > > No, SEAMCALL #UDs if the CPU isn't post-VMXON. I.e. the CPU doesn't make it to
> > > the TDX Module to provide a unique error code, all KVM will see is a #UD.
> >
> > #UD is handled by the SEAMCALL assembly code. Please see TDX_MODULE_CALL
> > assembly macro:

Right, but that doesn't say why the #UD occurred. The macro dresses it up in
TDX_SW_ERROR so that KVM only needs a single parser, but at the end of the day
KVM is still only going to see that SEAMCALL hit a #UD.

> > > There is no reason to rely on the caller to take cpu_hotplug_lock, and definitely
> > > no reason to rely on the caller to invoke tdx_cpu_enable() separately from invoking
> > > tdx_enable(). I suspect they got that way because of KVM's unnecessarily complex
> > > code, e.g. if KVM is already doing on_each_cpu() to do VMXON, then it's easy enough
> > > to also do TDH_SYS_LP_INIT, so why do two IPIs?
> >
> > The main reason is we relaxed the TDH.SYS.LP.INIT to be called _after_ TDX
> > module initialization.  
> >
> > Previously, the TDH.SYS.LP.INIT must be done on *ALL* CPUs that the
> > platform has (i.e., cpu_present_mask) right after TDH.SYS.INIT and before
> > any other SEAMCALLs. This didn't quite work with (kernel software) CPU
> > hotplug, and it had problem dealing with things like SMT disable
> > mitigation:
> >
> > https://lore.kernel.org/lkml/[email protected]/T/#mf42fa2d68d6b98edcc2aae11dba3c2487caf3b8f
> >
> > So the x86 maintainers requested to change this. The original proposal
> > was to eliminate the entire TDH.SYS.INIT and TDH.SYS.LP.INIT:
> >
> > https://lore.kernel.org/lkml/[email protected]/T/#m78c0c48078f231e92ea1b87a69bac38564d46469
> >
> > But somehow it wasn't feasible, and the result was we relaxed to allow
> > TDH.SYS.LP.INIT to be called after module initialization.
> >
> > So we need a separate tdx_cpu_enable() for that.

No, you don't, at least not given the TDX patches I'm looking at. Allowing
TDH.SYS.LP.INIT after module initialization makes sense because otherwise the
kernel would need to online all possible CPUs before initializing TDX. But that
doesn't mean that the kernel needs to, or should, punt TDH.SYS.LP.INIT to KVM.

AFAICT, KVM is NOT doing TDH.SYS.LP.INIT when a CPU is onlined, only when KVM
is loaded, which means that tdx_enable() can process all online CPUs just as
easily as KVM.

Presumably that approach relies on something blocking onlining CPUs when TDX is
active. And if that's not the case, the proposed patches are buggy.

> Btw, the ideal (or probably the final) plan is to handle tdx_cpu_enable()
> in TDX's own CPU hotplug callback in the core-kernel and hide it from all
> other in-kernel TDX users.  
>
> Specifically:
>
> 1) that callback, e.g., tdx_online_cpu() will be placed _before_ any in-
> kernel TDX users like KVM's callback.
> 2) In tdx_online_cpu(), we do VMXON + tdx_cpu_enable() + VMXOFF, and
> return error in case of any error to prevent that cpu from going online.
>
> That makes sure that, if TDX is supported by the platform, we basically
> guarantees all online CPUs are ready to issue SEAMCALL (of course, the in-
> kernel TDX user still needs to do VMXON for it, but that's TDX user's
> responsibility).
>
> But that obviously needs to move VMXON to the core-kernel.

It doesn't strictly have to be core kernel per se, just in code that sits below
KVM, e.g. in a seperate module called VAC[*] ;-)

[*] https://lore.kernel.org/all/[email protected]

> Currently, export tdx_cpu_enable() as a separate API and require KVM to
> call it explicitly is a temporary solution.
>
> That being said, we could do tdx_cpu_enable() inside tdx_enable(), but I
> don't see it's a better idea.

It simplifies the API surface for enabling TDX and eliminates an export.

2024-04-23 17:17:58

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor



On 4/13/2024 12:15 AM, Reinette Chatre wrote:
> Hi Isaku,
>
> On 2/26/2024 12:26 AM, [email protected] wrote:
>> From: Isaku Yamahata <[email protected]>
> ...
>
>> @@ -218,6 +257,87 @@ static void tdx_reclaim_control_page(unsigned long td_page_pa)
>> free_page((unsigned long)__va(td_page_pa));
>> }
>>
>> +struct tdx_flush_vp_arg {
>> + struct kvm_vcpu *vcpu;
>> + u64 err;
>> +};
>> +
>> +static void tdx_flush_vp(void *arg_)
>> +{
>> + struct tdx_flush_vp_arg *arg = arg_;
>> + struct kvm_vcpu *vcpu = arg->vcpu;
>> + u64 err;
>> +
>> + arg->err = 0;
>> + lockdep_assert_irqs_disabled();
>> +
>> + /* Task migration can race with CPU offlining. */
>> + if (unlikely(vcpu->cpu != raw_smp_processor_id()))
>> + return;
>> +
>> + /*
>> + * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized. The
>> + * list tracking still needs to be updated so that it's correct if/when
>> + * the vCPU does get initialized.
>> + */
>> + if (is_td_vcpu_created(to_tdx(vcpu))) {
>> + /*
>> + * No need to retry. TDX Resources needed for TDH.VP.FLUSH are,
>> + * TDVPR as exclusive, TDR as shared, and TDCS as shared. This
>> + * vp flush function is called when destructing vcpu/TD or vcpu
>> + * migration. No other thread uses TDVPR in those cases.
>> + */
Is it possible that other thread uses TDR or TDCS as exclusive?

>>


2024-04-23 23:00:03

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, 2024-04-23 at 08:15 -0700, Sean Christopherson wrote:
> On Tue, Apr 23, 2024, Kai Huang wrote:
> > On Tue, 2024-04-23 at 13:34 +1200, Kai Huang wrote:
> > > >
> > > > > > And the intent isn't to catch every possible problem. As with many sanity checks,
> > > > > > the intent is to detect the most likely failure mode to make triaging and debugging
> > > > > > issues a bit easier.
> > > > >
> > > > > The SEAMCALL will literally return a unique error code to indicate CPU
> > > > > isn't in post-VMXON, or tdx_cpu_enable() hasn't been done. I think the
> > > > > error code is already clear to pinpoint the problem (due to these pre-
> > > > > SEAMCALL-condition not being met).
> > > >
> > > > No, SEAMCALL #UDs if the CPU isn't post-VMXON. I.e. the CPU doesn't make it to
> > > > the TDX Module to provide a unique error code, all KVM will see is a #UD.
> > >
> > > #UD is handled by the SEAMCALL assembly code. Please see TDX_MODULE_CALL
> > > assembly macro:
>
> Right, but that doesn't say why the #UD occurred. The macro dresses it up in
> TDX_SW_ERROR so that KVM only needs a single parser, but at the end of the day
> KVM is still only going to see that SEAMCALL hit a #UD.

Right. But is there any problem here? I thought the point was we can
just use the error code to tell what went wrong.

>
> > > > There is no reason to rely on the caller to take cpu_hotplug_lock, and definitely
> > > > no reason to rely on the caller to invoke tdx_cpu_enable() separately from invoking
> > > > tdx_enable(). I suspect they got that way because of KVM's unnecessarily complex
> > > > code, e.g. if KVM is already doing on_each_cpu() to do VMXON, then it's easy enough
> > > > to also do TDH_SYS_LP_INIT, so why do two IPIs?
> > >
> > > The main reason is we relaxed the TDH.SYS.LP.INIT to be called _after_ TDX
> > > module initialization.  
> > >
> > > Previously, the TDH.SYS.LP.INIT must be done on *ALL* CPUs that the
> > > platform has (i.e., cpu_present_mask) right after TDH.SYS.INIT and before
> > > any other SEAMCALLs. This didn't quite work with (kernel software) CPU
> > > hotplug, and it had problem dealing with things like SMT disable
> > > mitigation:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/T/#mf42fa2d68d6b98edcc2aae11dba3c2487caf3b8f
> > >
> > > So the x86 maintainers requested to change this. The original proposal
> > > was to eliminate the entire TDH.SYS.INIT and TDH.SYS.LP.INIT:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/T/#m78c0c48078f231e92ea1b87a69bac38564d46469
> > >
> > > But somehow it wasn't feasible, and the result was we relaxed to allow
> > > TDH.SYS.LP.INIT to be called after module initialization.
> > >
> > > So we need a separate tdx_cpu_enable() for that.
>
> No, you don't, at least not given the TDX patches I'm looking at. Allowing
> TDH.SYS.LP.INIT after module initialization makes sense because otherwise the
> kernel would need to online all possible CPUs before initializing TDX. But that
> doesn't mean that the kernel needs to, or should, punt TDH.SYS.LP.INIT to KVM.
>
> AFAICT, KVM is NOT doing TDH.SYS.LP.INIT when a CPU is onlined, only when KVM
> is loaded, which means that tdx_enable() can process all online CPUs just as
> easily as KVM.

Hmm.. I assumed kvm_online_cpu() will do VMXON + tdx_cpu_enable().

>
> Presumably that approach relies on something blocking onlining CPUs when TDX is
> active. And if that's not the case, the proposed patches are buggy.

The current patch ([PATCH 023/130] KVM: TDX: Initialize the TDX module
when loading the KVM intel kernel module) indeed is buggy, but I don't
quite follow why we need to block onlining CPU when TDX is active?

There's no hard things that prevent us to do so. KVM just need to do
VMXON + tdx_cpu_enable() inside kvm_online_cpu().

>
> > Btw, the ideal (or probably the final) plan is to handle tdx_cpu_enable()
> > in TDX's own CPU hotplug callback in the core-kernel and hide it from all
> > other in-kernel TDX users.  
> >
> > Specifically:
> >
> > 1) that callback, e.g., tdx_online_cpu() will be placed _before_ any in-
> > kernel TDX users like KVM's callback.
> > 2) In tdx_online_cpu(), we do VMXON + tdx_cpu_enable() + VMXOFF, and
> > return error in case of any error to prevent that cpu from going online.
> >
> > That makes sure that, if TDX is supported by the platform, we basically
> > guarantees all online CPUs are ready to issue SEAMCALL (of course, the in-
> > kernel TDX user still needs to do VMXON for it, but that's TDX user's
> > responsibility).
> >
> > But that obviously needs to move VMXON to the core-kernel.
>
> It doesn't strictly have to be core kernel per se, just in code that sits below
> KVM, e.g. in a seperate module called VAC[*] ;-)
>
> [*] https://lore.kernel.org/all/[email protected]

Could you elaborate why vac.ko is necessary?

Being a module natually we will need to handle module init and exit. But
TDX cannot be disabled and re-enabled after initialization, so in general
the vac.ko doesn't quite fit for TDX.

And I am not sure what's the fundamental difference between managing TDX
module in a module vs in the core-kernel from KVM's perspective.

>
> > Currently, export tdx_cpu_enable() as a separate API and require KVM to
> > call it explicitly is a temporary solution.
> >
> > That being said, we could do tdx_cpu_enable() inside tdx_enable(), but I
> > don't see it's a better idea.
>
> It simplifies the API surface for enabling TDX and eliminates an export.

I was surprised to see we want to prevent onlining CPU when TDX is active.

2024-04-23 23:30:00

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, 2024-04-23 at 22:59 +0000, Huang, Kai wrote:
> > Right, but that doesn't say why the #UD occurred.  The macro dresses it up in
> > TDX_SW_ERROR so that KVM only needs a single parser, but at the end of the day
> > KVM is still only going to see that SEAMCALL hit a #UD.
>
> Right.  But is there any problem here?  I thought the point was we can
> just use the error code to tell what went wrong.

Oh, I guess I was replying too quickly. From the spec, #UD happens when

IF not in VMX operation or inSMM or inSEAM or 
((IA32_EFER.LMA & CS.L) == 0)
THEN #UD;

Are you worried about #UD was caused by other cases rather than "not in
VMX operation"?

But it's quite obvious the other 3 cases are not possible, correct?

2024-04-24 00:11:58

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> --- a/arch/x86/kvm/vmx/tdx_ops.h
> +++ b/arch/x86/kvm/vmx/tdx_ops.h
> @@ -40,6 +40,10 @@ static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
>   return ret;
>  }
>  
> +#ifdef CONFIG_INTEL_TDX_HOST
> +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out);
> +#endif
> +

Why this needs to be inside the CONFIG_INTEL_TDX_HOST while other
tdh_xxx() don't?

I suppose all tdh_xxx() together with this pr_tdx_error() should only be
called tdx.c, which is only built when CONFIG_INTEL_TDX_HOST is true?

In fact, tdx_seamcall() directly calls seamcall() and seamcall_ret(),
which are only present when CONFIG_INTEL_TDX_HOST is on.

So things are really confused here. I do believe we should just remove
this CONFIG_INTEL_TDX_HOST around pr_tdx_error() so all functions in
"tdx_ops.h" should only be used in tdx.c.

2024-04-24 10:31:03

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -3,6 +3,9 @@
>  #define __KVM_X86_TDX_H
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
> +
> +#include "tdx_ops.h"
> +
>  struct kvm_tdx {
>   struct kvm kvm;
>   /* TDX specific members follow. */

I am consistently hitting build error for the middle patches in our
internal tree, mostly because of this madness of header file inclusion.

I found the above inclusion of "tdx_ops.h" in "tdx.h" just out of blue.

We have

- "tdx_arch.h"
- "tdx_errno.h"
- "tdx_ops.h"
- "tdx.h"


The first two can be included by the "tdx.h", so that we can have a rule
for C files to just include "tdx.h", i.e., the C files should never need
to include the first two explicitly.

The "tdx_ops.h" is a little bit confusing. I _think_ the purpose of it is
to only contain SEAMCALL wrappers. But I am not sure whether it can be
included by any C file directly.

Based on above code change, I _think_ the intention is to also embed it to
"tdx.h", so the C files should just include "tdx.h".

But based on Sean's comments, the SEAMCALL wrappers will be changed to
take 'struct kvm_tdx *' and 'struct vcpu_tdx *', so they need the
declaration of those structures which are in "tdx.h".

I think we can just make a rule that, "tdx_ops.h" should never be directly
included by any C file, instead, we include "tdx_ops.h" into "tdx.h"
somewhere after declaration of 'struct kvm_tdx' and 'struct vcpu_tdx'.

And such inclusion should happen when the "tdx_ops.h" is introduced.

Am I missing anything?


2024-04-24 10:51:19

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> +#include <linux/compiler.h>
> +
> +#include <asm/cacheflush.h>
> +#include <asm/asm.h>
> +#include <asm/kvm_host.h>

None of above are needed to build this file, except the
<asm/cacheflush.h>.

I am removing them.

I am also adding <asm/tdx.h> because this file uses 'struct
tdx_module_args'. And <asm/page.h> for __va().

> +
> +#include "tdx_errno.h"
> +#include "tdx_arch.h"

I am moving them to "tdx.h", and make "tdx.h" to include this "tdx_ops.h"
as well, and declare the C file should never include "tdx_ops.h" directly:

https://lore.kernel.org/lkml/[email protected]/

2024-04-24 11:06:45

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error

On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx_error.c
> @@ -0,0 +1,21 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* functions to record TDX SEAMCALL error */
> +
> +#include <linux/kernel.h>
> +#include <linux/bug.h>
> +

I don't see why the above two are needed, especially the giant
<linux/kernel.h>.

<linux/printk.h> should be sufficient for the current patch.

2024-04-25 16:31:51

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, Apr 23, 2024, Kai Huang wrote:
> On Tue, 2024-04-23 at 22:59 +0000, Huang, Kai wrote:
> > > Right, but that doesn't say why the #UD occurred.  The macro dresses it up in
> > > TDX_SW_ERROR so that KVM only needs a single parser, but at the end of the day
> > > KVM is still only going to see that SEAMCALL hit a #UD.
> >
> > Right.  But is there any problem here?  I thought the point was we can
> > just use the error code to tell what went wrong.
>
> Oh, I guess I was replying too quickly. From the spec, #UD happens when
>
> IF not in VMX operation or inSMM or inSEAM or 
> ((IA32_EFER.LMA & CS.L) == 0)
> THEN #UD;
>
> Are you worried about #UD was caused by other cases rather than "not in
> VMX operation"?

Yes.

> But it's quite obvious the other 3 cases are not possible, correct?

The spec I'm looking at also has:

If IA32_VMX_PROCBASED_CTLS3[5] is 0.

And anecdotally, I know of at least one crash in our production environment where
a VMX instruction hit a seemingly spurious #UD, i.e. it's not impossible for a
ucode bug or hardware defect to cause problems. That's obviously _extremely_
unlikely, but that's why I emphasized that sanity checking CR4.VMXE is cheap.
Practically speaking it costs nothing, so IMO it's worth adding even if the odds
of it ever being helpful are one-in-and-million.

2024-04-25 16:35:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, Apr 23, 2024, Kai Huang wrote:
> On Tue, 2024-04-23 at 08:15 -0700, Sean Christopherson wrote:
> > Presumably that approach relies on something blocking onlining CPUs when TDX is
> > active. And if that's not the case, the proposed patches are buggy.
>
> The current patch ([PATCH 023/130] KVM: TDX: Initialize the TDX module
> when loading the KVM intel kernel module) indeed is buggy, but I don't
> quite follow why we need to block onlining CPU when TDX is active?

I was saying that based on my reading of the code, either (a) the code is buggy
or (b) something blocks onlining CPUs when TDX is active. Sounds like the answer
is (a).

> There's no hard things that prevent us to do so. KVM just need to do
> VMXON + tdx_cpu_enable() inside kvm_online_cpu().
>
> >
> > > Btw, the ideal (or probably the final) plan is to handle tdx_cpu_enable()
> > > in TDX's own CPU hotplug callback in the core-kernel and hide it from all
> > > other in-kernel TDX users.  
> > >
> > > Specifically:
> > >
> > > 1) that callback, e.g., tdx_online_cpu() will be placed _before_ any in-
> > > kernel TDX users like KVM's callback.
> > > 2) In tdx_online_cpu(), we do VMXON + tdx_cpu_enable() + VMXOFF, and
> > > return error in case of any error to prevent that cpu from going online.
> > >
> > > That makes sure that, if TDX is supported by the platform, we basically
> > > guarantees all online CPUs are ready to issue SEAMCALL (of course, the in-
> > > kernel TDX user still needs to do VMXON for it, but that's TDX user's
> > > responsibility).
> > >
> > > But that obviously needs to move VMXON to the core-kernel.
> >
> > It doesn't strictly have to be core kernel per se, just in code that sits below
> > KVM, e.g. in a seperate module called VAC[*] ;-)
> >
> > [*] https://lore.kernel.org/all/[email protected]
>
> Could you elaborate why vac.ko is necessary?
>
> Being a module natually we will need to handle module init and exit. But
> TDX cannot be disabled and re-enabled after initialization, so in general
> the vac.ko doesn't quite fit for TDX.
>
> And I am not sure what's the fundamental difference between managing TDX
> module in a module vs in the core-kernel from KVM's perspective.

VAC isn't strictly necessary. What I was saying is that it's not strictly
necessary for the core kernel to handle VMXON either. I.e. it could be done in
something like VAC, or it could be done in the core kernel.

The important thing is that they're handled by _one_ entity. What we have today
is probably the worst setup; VMXON is handled by KVM, but TDX.SYS.LP.INIT is
handled by core kernel (sort of).

2024-04-25 16:46:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

On Fri, Apr 19, 2024 at 12:53:19PM -0700, Sean Christopherson wrote:
> > But I guess we can make a *few* wrappers that covers all needed cases.
>
> Yeah. I suspect two will suffice. One for the calls that say at or below four
> inputs, and one for the fat ones like ReportFatalError that use everything under
> the sun.

I ended up with three helpers:

- tdvmcall_trampoline() as you proposed.

- tdvmcall_report_fatal_error() that does ReportFatalError specificly.
Pointer to char array as an input. Never returns: no need to
save/restore registers.

- hv_tdx_hypercall(). Hyper-V annoyingly uses R8 and RDX as parameters :/

> Not sure, haven't looked at them recently. At a glance, something similar? The
> use of high registers instead of RDI and RSI is damn annoying :-/

I defined three helpers: TDCALL_0(), TDCALL_1() and TDCALL_5(). All takes 4
input arguments.

I've updated the WIP branch with the new version:

https://github.com/intel/tdx/commits/guest-tdx-asm/

Any feedback is welcome.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-04-25 22:18:41

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, 2024-04-25 at 09:35 -0700, Sean Christopherson wrote:
> On Tue, Apr 23, 2024, Kai Huang wrote:
> > On Tue, 2024-04-23 at 08:15 -0700, Sean Christopherson wrote:
> > > Presumably that approach relies on something blocking onlining CPUs when TDX is
> > > active. And if that's not the case, the proposed patches are buggy.
> >
> > The current patch ([PATCH 023/130] KVM: TDX: Initialize the TDX module
> > when loading the KVM intel kernel module) indeed is buggy, but I don't
> > quite follow why we need to block onlining CPU when TDX is active?
>
> I was saying that based on my reading of the code, either (a) the code is buggy
> or (b) something blocks onlining CPUs when TDX is active. Sounds like the answer
> is (a).

Yeah it's a).

>
> > There's no hard things that prevent us to do so. KVM just need to do
> > VMXON + tdx_cpu_enable() inside kvm_online_cpu().
> >
> > >
> > > > Btw, the ideal (or probably the final) plan is to handle tdx_cpu_enable()
> > > > in TDX's own CPU hotplug callback in the core-kernel and hide it from all
> > > > other in-kernel TDX users.  
> > > >
> > > > Specifically:
> > > >
> > > > 1) that callback, e.g., tdx_online_cpu() will be placed _before_ any in-
> > > > kernel TDX users like KVM's callback.
> > > > 2) In tdx_online_cpu(), we do VMXON + tdx_cpu_enable() + VMXOFF, and
> > > > return error in case of any error to prevent that cpu from going online.
> > > >
> > > > That makes sure that, if TDX is supported by the platform, we basically
> > > > guarantees all online CPUs are ready to issue SEAMCALL (of course, the in-
> > > > kernel TDX user still needs to do VMXON for it, but that's TDX user's
> > > > responsibility).
> > > >
> > > > But that obviously needs to move VMXON to the core-kernel.
> > >
> > > It doesn't strictly have to be core kernel per se, just in code that sits below
> > > KVM, e.g. in a seperate module called VAC[*] ;-)
> > >
> > > [*] https://lore.kernel.org/all/[email protected]
> >
> > Could you elaborate why vac.ko is necessary?
> >
> > Being a module natually we will need to handle module init and exit. But
> > TDX cannot be disabled and re-enabled after initialization, so in general
> > the vac.ko doesn't quite fit for TDX.
> >
> > And I am not sure what's the fundamental difference between managing TDX
> > module in a module vs in the core-kernel from KVM's perspective.
>
> VAC isn't strictly necessary. What I was saying is that it's not strictly
> necessary for the core kernel to handle VMXON either. I.e. it could be done in
> something like VAC, or it could be done in the core kernel.

Right, but so far I cannot see any advantage of using a VAC module,
perhaps I am missing something although.

>
> The important thing is that they're handled by _one_ entity. What we have today
> is probably the worst setup; VMXON is handled by KVM, but TDX.SYS.LP.INIT is
> handled by core kernel (sort of).

I cannot argue against this :-)

But from this point of view, I cannot see difference between tdx_enable()
and tdx_cpu_enable(), because they both in core-kernel while depend on KVM
to handle VMXON.

Or, do you prefer to we move VMXON to the core-kernel at this stage, i.e.,
as a prepare work for KVM TDX?



2024-04-25 22:35:26

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, 2024-04-25 at 09:30 -0700, Sean Christopherson wrote:
> On Tue, Apr 23, 2024, Kai Huang wrote:
> > On Tue, 2024-04-23 at 22:59 +0000, Huang, Kai wrote:
> > > > Right, but that doesn't say why the #UD occurred.  The macro dresses it up in
> > > > TDX_SW_ERROR so that KVM only needs a single parser, but at the end of the day
> > > > KVM is still only going to see that SEAMCALL hit a #UD.
> > >
> > > Right.  But is there any problem here?  I thought the point was we can
> > > just use the error code to tell what went wrong.
> >
> > Oh, I guess I was replying too quickly. From the spec, #UD happens when
> >
> > IF not in VMX operation or inSMM or inSEAM or 
> > ((IA32_EFER.LMA & CS.L) == 0)
> > THEN #UD;
> >
> > Are you worried about #UD was caused by other cases rather than "not in
> > VMX operation"?
>
> Yes.
>
> > But it's quite obvious the other 3 cases are not possible, correct?
>
> The spec I'm looking at also has:
>
> If IA32_VMX_PROCBASED_CTLS3[5] is 0.

Ah, now I see this too.

It's not in the pseudo code of SEAMCALL instruction, but is at the "64-Bit
Mode Exceptions" section which is after the pseudo code.

And this bit 5 is to report the capability to allow to control the "shard
bit" in the 5-level EPT.

>
> And anecdotally, I know of at least one crash in our production environment where
> a VMX instruction hit a seemingly spurious #UD, i.e. it's not impossible for a
> ucode bug or hardware defect to cause problems. That's obviously _extremely_
> unlikely, but that's why I emphasized that sanity checking CR4.VMXE is cheap.

Yeah I agree it could happen although very unlikely.

But just to be sure:

I believe the #UD itself doesn't crash the kernel/machine, but should be
the kernel unable to handle #UD in such case?

If so, I am not sure whether the CR4.VMX check can make the kernel any
safer, because we can already handle the #UD for the SEAMCALL instruction.

Yeah we can clearly dump message saying "CPU isn't in VMX operation" and
return failure if we have the check, but if we don't, the worst situation
is we might mistakenly report "CPU isn't in VMX operation" (currently code
just treats #UD as CPU not in VMX operation) when CPU doesn't
IA32_VMX_PROCBASED_CTLS3[5].

And for the IA32_VMX_PROCBASED_CTLS3[5] we can easily do some pre-check in
KVM code during module loading to rule out this case.

And in practice, I even believe the BIOS cannot turn on TDX if the
IA32_VMX_PROCBASED_CTLS3[5] is not supported. I can check on this.

> Practically speaking it costs nothing, so IMO it's worth adding even if the odds
> of it ever being helpful are one-in-and-million.

I think we will need to do below at somewhere for the common SEAMCALL
function:

unsigned long flags;
int ret = -EINVAL;

local_irq_save(flags);

if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
goto out;

ret = seamcall();
out:
local_irq_restore(flags);
return ret;

to make it IRQ safe.

And the odd is currently the common SEAMCALL functions, a.k.a,
__seamcall() and seamcall() (the latter is a mocro actually), both return
u64, so if we want to have such CR4.VMX check code in the common code, we
need to invent a new error code for it.

That being said, although I agree it can make the code a little bit
clearer, I am not sure whether it can make the code any safer -- even w/o
it, the worst case is to incorrectly report "CPU is not in VMX operation",
but shouldn't crash kernel etc.

2024-04-25 23:04:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, Apr 25, 2024, Kai Huang wrote:
> On Thu, 2024-04-25 at 09:30 -0700, Sean Christopherson wrote:
> > On Tue, Apr 23, 2024, Kai Huang wrote:
> > And anecdotally, I know of at least one crash in our production environment where
> > a VMX instruction hit a seemingly spurious #UD, i.e. it's not impossible for a
> > ucode bug or hardware defect to cause problems. That's obviously _extremely_
> > unlikely, but that's why I emphasized that sanity checking CR4.VMXE is cheap.
>
> Yeah I agree it could happen although very unlikely.
>
> But just to be sure:
>
> I believe the #UD itself doesn't crash the kernel/machine, but should be
> the kernel unable to handle #UD in such case?

Correct, the #UD is likely not (immediately) fatal.
>
> If so, I am not sure whether the CR4.VMX check can make the kernel any
> safer, because we can already handle the #UD for the SEAMCALL instruction.

It's not about making the kernel safer, it's about helping triage/debug issues.

> Yeah we can clearly dump message saying "CPU isn't in VMX operation" and
> return failure if we have the check, but if we don't, the worst situation
> is we might mistakenly report "CPU isn't in VMX operation" (currently code
> just treats #UD as CPU not in VMX operation) when CPU doesn't
> IA32_VMX_PROCBASED_CTLS3[5].
>
> And for the IA32_VMX_PROCBASED_CTLS3[5] we can easily do some pre-check in
> KVM code during module loading to rule out this case.
>
> And in practice, I even believe the BIOS cannot turn on TDX if the
> IA32_VMX_PROCBASED_CTLS3[5] is not supported. I can check on this.

Eh, I wouldn't worry about that too much. The only reason I brought up that
check was to call out that we can't *know* with 100% certainty that SEAMCALL
failed due to the CPU not being post-VMXON.

> > Practically speaking it costs nothing, so IMO it's worth adding even if the odds
> > of it ever being helpful are one-in-and-million.
>
> I think we will need to do below at somewhere for the common SEAMCALL
> function:
>
> unsigned long flags;
> int ret = -EINVAL;
>
> local_irq_save(flags);
>
> if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
> goto out;
>
> ret = seamcall();
> out:
> local_irq_restore(flags);
> return ret;
>
> to make it IRQ safe.
>
> And the odd is currently the common SEAMCALL functions, a.k.a,
> __seamcall() and seamcall() (the latter is a mocro actually), both return
> u64, so if we want to have such CR4.VMX check code in the common code, we
> need to invent a new error code for it.

Oh, I wasn't thinking that we'd check CR4.VMXE before *every* SEAMCALL, just
before the TDH.SYS.LP.INIT call, i.e. before the one that is most likely to fail
due to a software bug that results in the CPU not doing VMXON before enabling
TDX.

Again, my intent is to add a simple, cheap, and targeted sanity check to help
deal with potential failures in code that historically has been less than rock
solid, and in function that has a big fat assumption that the caller has done
VMXON on the CPU.

> That being said, although I agree it can make the code a little bit
> clearer, I am not sure whether it can make the code any safer -- even w/o
> it, the worst case is to incorrectly report "CPU is not in VMX operation",
> but shouldn't crash kernel etc.

2024-04-26 00:22:12

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module


>
> > > The important thing is that they're handled by _one_ entity. What we have today
> > > is probably the worst setup; VMXON is handled by KVM, but TDX.SYS.LP.INIT is
> > > handled by core kernel (sort of).
> >
> > I cannot argue against this :-)
> >
> > But from this point of view, I cannot see difference between tdx_enable()
> > and tdx_cpu_enable(), because they both in core-kernel while depend on KVM
> > to handle VMXON.
>
> My comments were made under the assumption that the code was NOT buggy, i.e. if
> KVM did NOT need to call tdx_cpu_enable() independent of tdx_enable().
>
> That said, I do think it makes to have tdx_enable() call an private/inner version,
> e.g. __tdx_cpu_enable(), and then have KVM call a public version. Alternatively,
> the kernel could register yet another cpuhp hook that runs after KVM's, i.e. does
> TDX.SYS.LP.INIT after KVM has done VMXON (if TDX has been enabled).

We will need to handle tdx_cpu_online() in "some cpuhp callback" anyway,
no matter whether tdx_enable() calls __tdx_cpu_enable() internally or not,
because now tdx_enable() can be done on a subset of cpus that the platform
has.

For the latter (after the "Alternatively" above), by "the kernel" do you
mean the core-kernel but not KVM?

E.g., you mean to register a cpuhp book _inside_ tdx_enable() after TDX is
initialized successfully?

That would have problem like when KVM is not present (e.g., KVM is
unloaded after it enables TDX), the cpuhp book won't work at all.

If we ever want a new TDX-specific cpuhp hook "at this stage", IMHO it's
better to have it done by KVM, i.e., it goes away when KVM is unloaded.

Logically, we have two approaches in terms of how to treat
tdx_cpu_enable():

1) We treat the two cases separately: calling tdx_cpu_enable() for all
online cpus, and calling it when a new CPU tries to go online in some
cpuhp hook.  And we only want to call tdx_cpu_enable() in cpuhp book when
tdx_enable() has done successfully.

That is: 

a) we always call tdx_cpu_enable() (or __tdx_cpu_enable()) inside
tdx_enable() as the first step, or,

b) let the caller (KVM) to make sure of tdx_cpu_enable() has been done for
all online cpus before calling tdx_enable().

Something like this:

if (enable_tdx) {
cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, kvm_online_cpu, 
...);

cpus_read_lock();
on_each_cpu(tdx_cpu_enable, ...); /* or do it inside 
* in tdx_enable() */
enable_tdx = tdx_enable();
if (enable_tdx)
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
tdx_online_cpu, ...);
cpus_read_unlock();
}

static int tdx_online_cpu(unsigned int cpu)
{
unsigned long flags;
int ret;

if (!enable_tdx)
return 0;

local_irq_save(flags);
ret = tdx_cpu_enable();
local_irq_restore(flags);

return ret;
}

2) We treat tdx_cpu_enable() as a whole by viewing it as the first step to
run any TDX code (SEAMCALL) on any cpu, including the SEAMCALLs involved
in tdx_enable().

That is, we *unconditionally* call tdx_cpu_enable() for all online cpus,
and when a new CPU tries to go online.

This can be handled at once if we do tdx_cpu_enable() inside KVM's cpuhp
hook:

static int vt_hardware_enable(unsigned int cpu)
{
vmx_hardware_enable();

local_irq_save(flags);
ret = tdx_cpu_enable();
local_irq_restore(flags);

/*
* -ENODEV means TDX is not supported by the platform
* (TDX not enabled by the hardware or module is
* not loaded) or the kernel isn't built with TDX.
*
* Allow CPU to go online as there's no way kernel
* could use TDX in this case.
*
* Other error codes means TDX is available but something
* went wrong. Prevent this CPU to go online so that
* TDX may still work on other online CPUs.
*/
if (ret && ret != -ENODEV)
return ret;

return ret;
}

So with your change to always enable virtualization when TDX is enabled
during module load, we can simply have:

if (enable_tdx)
cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, kvm_online_cpu, 
...);

cpus_read_lock();
enable_tdx = tdx_enable();
cpus_read_unlock();
}

So despite the cpus_read_lock() around tdx_enable() is a little bit silly,
the logic is actually simpler IMHO.

(local_irq_save()/restore() around tdx_cpu_enable() is also silly but that
is a common problem to both above solution and can be changed
independently).

Also, as I mentioned that the final goal is to have a TDX-specific CPUHP
hook in the core-kernel _BEFORE_ any in-kernel TDX user (KVM) to make sure
all online CPUs are TDX-capable.  

When that happens, I can just move the code in vt_hardware_enable() to
tdx_online_cpu() and do additional VMXOFF inside it, with the assumption
that the in-kernel TDX users should manage VMXON/VMXOFF on their own.
Then all TDX users can remove the handling of tdx_cpu_enable().

2024-04-26 03:21:56

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, Apr 26, 2024 at 12:21:46AM +0000, Huang, Kai wrote:
>
>>
>> > > The important thing is that they're handled by _one_ entity. What we have today
>> > > is probably the worst setup; VMXON is handled by KVM, but TDX.SYS.LP.INIT is
>> > > handled by core kernel (sort of).
>> >
>> > I cannot argue against this :-)
>> >
>> > But from this point of view, I cannot see difference between tdx_enable()
>> > and tdx_cpu_enable(), because they both in core-kernel while depend on KVM
>> > to handle VMXON.
>>
>> My comments were made under the assumption that the code was NOT buggy, i.e. if
>> KVM did NOT need to call tdx_cpu_enable() independent of tdx_enable().
>>
>> That said, I do think it makes to have tdx_enable() call an private/inner version,
>> e.g. __tdx_cpu_enable(), and then have KVM call a public version. Alternatively,
>> the kernel could register yet another cpuhp hook that runs after KVM's, i.e. does
>> TDX.SYS.LP.INIT after KVM has done VMXON (if TDX has been enabled).
>
>We will need to handle tdx_cpu_online() in "some cpuhp callback" anyway,
>no matter whether tdx_enable() calls __tdx_cpu_enable() internally or not,
>because now tdx_enable() can be done on a subset of cpus that the platform
>has.

Can you confirm this is allowed again? it seems like this code indicates the
opposite:

https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_sys_config.c#L768C1-L775C6

>
>For the latter (after the "Alternatively" above), by "the kernel" do you
>mean the core-kernel but not KVM?
>
>E.g., you mean to register a cpuhp book _inside_ tdx_enable() after TDX is
>initialized successfully?
>
>That would have problem like when KVM is not present (e.g., KVM is
>unloaded after it enables TDX), the cpuhp book won't work at all.

Is "the cpuhp hook doesn't work if KVM is not loaded" a real problem?

The CPU about to online won't run any TDX code. So, it should be ok to
skip tdx_cpu_enable().

Don't get me wrong. I don't object to registering the cpuhp hook in KVM.
I just want you to make decisions based on good information.

>
>If we ever want a new TDX-specific cpuhp hook "at this stage", IMHO it's
>better to have it done by KVM, i.e., it goes away when KVM is unloaded.
>
>Logically, we have two approaches in terms of how to treat
>tdx_cpu_enable():
>
>1) We treat the two cases separately: calling tdx_cpu_enable() for all
>online cpus, and calling it when a new CPU tries to go online in some
>cpuhp hook. ?And we only want to call tdx_cpu_enable() in cpuhp book when
>tdx_enable() has done successfully.
>
>That is:?
>
>a) we always call tdx_cpu_enable() (or __tdx_cpu_enable()) inside
>tdx_enable() as the first step, or,
>
>b) let the caller (KVM) to make sure of tdx_cpu_enable() has been done for
>all online cpus before calling tdx_enable().
>
>Something like this:
>
> if (enable_tdx) {
> cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, kvm_online_cpu,?
> ...);
>
> cpus_read_lock();
> on_each_cpu(tdx_cpu_enable, ...); /* or do it inside?
> * in tdx_enable() */
> enable_tdx = tdx_enable();
> if (enable_tdx)
> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> tdx_online_cpu, ...);
> cpus_read_unlock();
> }
>
> static int tdx_online_cpu(unsigned int cpu)
> {
> unsigned long flags;
> int ret;
>
> if (!enable_tdx)
> return 0;
>
> local_irq_save(flags);
> ret = tdx_cpu_enable();
> local_irq_restore(flags);
>
> return ret;
> }
>
>2) We treat tdx_cpu_enable() as a whole by viewing it as the first step to
>run any TDX code (SEAMCALL) on any cpu, including the SEAMCALLs involved
>in tdx_enable().
>
>That is, we *unconditionally* call tdx_cpu_enable() for all online cpus,
>and when a new CPU tries to go online.
>
>This can be handled at once if we do tdx_cpu_enable() inside KVM's cpuhp
>hook:
>
> static int vt_hardware_enable(unsigned int cpu)
> {
> vmx_hardware_enable();
>
> local_irq_save(flags);
> ret = tdx_cpu_enable();
> local_irq_restore(flags);
>
> /*
> * -ENODEV means TDX is not supported by the platform
> * (TDX not enabled by the hardware or module is
> * not loaded) or the kernel isn't built with TDX.
> *
> * Allow CPU to go online as there's no way kernel
> * could use TDX in this case.
> *
> * Other error codes means TDX is available but something
> * went wrong. Prevent this CPU to go online so that
> * TDX may still work on other online CPUs.
> */
> if (ret && ret != -ENODEV)
> return ret;
>
> return ret;
> }
>
>So with your change to always enable virtualization when TDX is enabled
>during module load, we can simply have:
>
> if (enable_tdx)
> cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, kvm_online_cpu,?
> ...);
>
> cpus_read_lock();
> enable_tdx = tdx_enable();
> cpus_read_unlock();
> }
>
>So despite the cpus_read_lock() around tdx_enable() is a little bit silly,
>the logic is actually simpler IMHO.
>
>(local_irq_save()/restore() around tdx_cpu_enable() is also silly but that
>is a common problem to both above solution and can be changed
>independently).
>
>Also, as I mentioned that the final goal is to have a TDX-specific CPUHP
>hook in the core-kernel _BEFORE_ any in-kernel TDX user (KVM) to make sure
>all online CPUs are TDX-capable. ?
>
>When that happens, I can just move the code in vt_hardware_enable() to
>tdx_online_cpu() and do additional VMXOFF inside it, with the assumption
>that the in-kernel TDX users should manage VMXON/VMXOFF on their own.
>Then all TDX users can remove the handling of tdx_cpu_enable().

2024-04-26 07:47:51

by Fuad Tabba

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

Hi,

On Tue, Mar 19, 2024 at 9:50 PM Isaku Yamahata <[email protected]> wrote:
>
> On Tue, Mar 19, 2024 at 02:47:47PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Mon, 2024-03-18 at 19:50 -0700, Rick Edgecombe wrote:
> > > On Wed, 2024-03-13 at 10:14 -0700, Isaku Yamahata wrote:
> > > > > IMO, an enum will be clearer than the two flags.
> > > > >
> > > > > enum {
> > > > > PROCESS_PRIVATE_AND_SHARED,
> > > > > PROCESS_ONLY_PRIVATE,
> > > > > PROCESS_ONLY_SHARED,
> > > > > };
> > > >
> > > > The code will be ugly like
> > > > "if (== PRIVATE || == PRIVATE_AND_SHARED)" or
> > > > "if (== SHARED || == PRIVATE_AND_SHARED)"
> > > >
> > > > two boolean (or two flags) is less error-prone.
> > >
> > > Yes the enum would be awkward to handle. But I also thought the way
> > > this is specified in struct kvm_gfn_range is a little strange.
> > >
> > > It is ambiguous what it should mean if you set:
> > > .only_private=true;
> > > .only_shared=true;
> > > ...as happens later in the series (although it may be a mistake).
> > >
> > > Reading the original conversation, it seems Sean suggested this
> > > specifically. But it wasn't clear to me from the discussion what the
> > > intention of the "only" semantics was. Like why not?
> > > bool private;
> > > bool shared;
> >
> > I see Binbin brought up this point on v18 as well:
> > https://lore.kernel.org/kvm/[email protected]/#t
> >
> > and helpfully dug up some other discussion with Sean where he agreed
> > the "_only" is confusing and proposed the the enum:
> > https://lore.kernel.org/kvm/[email protected]/
> >
> > He wanted the default value (in the case the caller forgets to set
> > them), to be to include both private and shared. I think the enum has
> > the issues that Isaku mentioned. What about?
> >
> > bool exclude_private;
> > bool exclude_shared;
> >
> > It will become onerous if more types of aliases grow, but it clearer
> > semantically and has the safe default behavior.
>
> I'm fine with those names. Anyway, I'm fine with wither way, two bools or enum.

I don't have a strong opinion, but I'd brought it up in a previous
patch series. I think that having two bools to encode three states is
less intuitive and potentially more bug prone, more so than the naming
itself (i.e., _only):
https://lore.kernel.org/all/[email protected]/

Cheers,
/fuad

> --
> Isaku Yamahata <[email protected]>
>

2024-04-26 09:44:54

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Fri, 2024-04-26 at 11:21 +0800, Gao, Chao wrote:
> On Fri, Apr 26, 2024 at 12:21:46AM +0000, Huang, Kai wrote:
> >
> > >
> > > > > The important thing is that they're handled by _one_ entity. What we have today
> > > > > is probably the worst setup; VMXON is handled by KVM, but TDX.SYS.LP.INIT is
> > > > > handled by core kernel (sort of).
> > > >
> > > > I cannot argue against this :-)
> > > >
> > > > But from this point of view, I cannot see difference between tdx_enable()
> > > > and tdx_cpu_enable(), because they both in core-kernel while depend on KVM
> > > > to handle VMXON.
> > >
> > > My comments were made under the assumption that the code was NOT buggy, i.e. if
> > > KVM did NOT need to call tdx_cpu_enable() independent of tdx_enable().
> > >
> > > That said, I do think it makes to have tdx_enable() call an private/inner version,
> > > e.g. __tdx_cpu_enable(), and then have KVM call a public version. Alternatively,
> > > the kernel could register yet another cpuhp hook that runs after KVM's, i.e. does
> > > TDX.SYS.LP.INIT after KVM has done VMXON (if TDX has been enabled).
> >
> > We will need to handle tdx_cpu_online() in "some cpuhp callback" anyway,
> > no matter whether tdx_enable() calls __tdx_cpu_enable() internally or not,
> > because now tdx_enable() can be done on a subset of cpus that the platform
> > has.
>
> Can you confirm this is allowed again? it seems like this code indicates the
> opposite:
>
> https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_sys_config.c#L768C1-L775C6

This feature requires ucode/P-SEAMLDR and TDX module change, and cannot be
supported for some *early* generations. I think they haven't added such
code to the opensource TDX module code yet.

I can ask TDX module people's plan if it is a concern.

In reality, this shouldn't be a problem because the current code kinda
works with both cases:

1) If this feature is not supported (i.e., old platform and/or old
module), and if user tries to enable TDX when there's offline cpu, then
tdx_enable() will fail when it does TDH.SYS.CONFIG, and we can use the
error code to pinpoint the root cause.

2) Otherwise, it just works.

>
> >
> > For the latter (after the "Alternatively" above), by "the kernel" do you
> > mean the core-kernel but not KVM?
> >
> > E.g., you mean to register a cpuhp book _inside_ tdx_enable() after TDX is
> > initialized successfully?
> >
> > That would have problem like when KVM is not present (e.g., KVM is
> > unloaded after it enables TDX), the cpuhp book won't work at all.
>
> Is "the cpuhp hook doesn't work if KVM is not loaded" a real problem?
>
> The CPU about to online won't run any TDX code. So, it should be ok to
> skip tdx_cpu_enable().

It _can_ work if we only consider KVM, because for KVM we can always
guarantee:

1) VMXON + tdx_cpu_enable() have been done for all online cpus before it
calls tdx_enable().
2) VMXON + tdx_cpu_enable() have been done in cpuhp for any new CPU before
it goes online.

Btw, this reminds me why I didn't want to do tdx_cpu_enable() inside
tdx_enable():

tdx_enable() will need to _always_ call tdx_cpu_enable() for all online
cpus regardless of whether the module has been initialized successfully in
the previous calls.

I believed this is kinda silly, i.e., why not just letting the caller to
do tdx_cpu_enable() for all online cpus before tdx_enable().

However, back to the TDX-specific core-kernel cpuhp hook, in the long
term, I believe the TDX cpuhp hook should be put _BEFORE_ all in-kernel
TDX-users' cpuhp hooks, because logically TDX users should depend on TDX
core-kernel code, but not the opposite.

That is, my long term vision is we can have a simple rule:

The core-kernel TDX code always guarantees online CPUs are TDX-capable.
All TDX users don't need to consider tdx_cpu_enable() ever. They just
need to call tdx_enable() to bring TDX to work.

So for now, given we depend on KVM for VMXON anyway, I don't see any
reason the core-kernel should register any TDX cpuhp. Having to "skip
tdx_cpu_enable() when VMX isn't enabled" is kinda hacky anyway.

2024-04-26 13:53:27

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Fri, 2024-04-26 at 08:39 +0100, Fuad Tabba wrote:
> > I'm fine with those names. Anyway, I'm fine with wither way, two bools or
> > enum.
>
> I don't have a strong opinion, but I'd brought it up in a previous
> patch series. I think that having two bools to encode three states is
> less intuitive and potentially more bug prone, more so than the naming
> itself (i.e., _only):
> https://lore.kernel.org/all/[email protected]/

Currently in our internal branch we switched to:
exclude_private
exclude_shared

It came together bettter in the code that uses it.

But I started to wonder if we actually really need exclude_shared. For TDX
zapping private memory has to be done with more care, because it cannot be re-
populated without guest coordination. But for shared memory if we are zapping a
range that includes both private and shared memory, I don't think it should hurt
to zap the shared memory.

2024-04-26 15:32:51

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Fri, Apr 26, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-04-26 at 08:39 +0100, Fuad Tabba wrote:
> > > I'm fine with those names. Anyway, I'm fine with wither way, two bools or
> > > enum.
> >
> > I don't have a strong opinion, but I'd brought it up in a previous
> > patch series. I think that having two bools to encode three states is
> > less intuitive and potentially more bug prone, more so than the naming
> > itself (i.e., _only):

Hmm, yeah, I buy that argument. We could even harded further by poisoning '0'
to force KVM to explicitly. Aha! And maybe use a bitmap?

enum {
BUGGY_KVM_INVALIDATION = 0,
PROCESS_SHARED = BIT(0),
PROCESS_PRIVATE = BIT(1),
PROCESS_PRIVATE_AND_SHARED = PROCESS_SHARED | PROCESS_PRIVATE,
};

> > https://lore.kernel.org/all/[email protected]/
>
> Currently in our internal branch we switched to:
> exclude_private
> exclude_shared
>
> It came together bettter in the code that uses it.

If the choice is between an enum and exclude_*, I would strongly prefer the enum.
Using exclude_* results in inverted polarity for the code that triggers invalidations.

> But I started to wonder if we actually really need exclude_shared. For TDX
> zapping private memory has to be done with more care, because it cannot be re-
> populated without guest coordination. But for shared memory if we are zapping a
> range that includes both private and shared memory, I don't think it should hurt
> to zap the shared memory.

Hell no, I am not risking taking on more baggage in KVM where userspace or some
other subsystem comes to rely on KVM spuriously zapping SPTEs in response to an
unrelated userspace action.

2024-04-26 15:58:24

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Fri, 2024-04-26 at 08:28 -0700, Sean Christopherson wrote:
> Hmm, yeah, I buy that argument.  We could even harded further by poisoning '0'
> to force KVM to explicitly.  Aha!  And maybe use a bitmap?
>
>         enum {
>                 BUGGY_KVM_INVALIDATION          = 0,
>                 PROCESS_SHARED                  = BIT(0),
>                 PROCESS_PRIVATE                 = BIT(1),
>                 PROCESS_PRIVATE_AND_SHARED      = PROCESS_SHARED |
> PROCESS_PRIVATE,
>         };

Seems like it would work for all who have been concerned. The previous objection
to the enum (can't find the mail) was for requiring logic like:

if (zap == PROCESS_PRIVATE_AND_SHARED || zap == PROCESS_PRIVATE)
do_private_zap_stuff();


We are trying to tie things up internally so we can jointly have something to
stare at again, as the patches are diverging. But will make this adjustment.


>
> > > https://lore.kernel.org/all/[email protected]/
> >
> > Currently in our internal branch we switched to:
> > exclude_private
> > exclude_shared
> >
> > It came together bettter in the code that uses it.
>
> If the choice is between an enum and exclude_*, I would strongly prefer the
> enum.
> Using exclude_* results in inverted polarity for the code that triggers
> invalidations.

Right, the awkwardness lands in that code.

The processing code looks nice though:
https://lore.kernel.org/kvm/[email protected]/

>
> > But I started to wonder if we actually really need exclude_shared. For TDX
> > zapping private memory has to be done with more care, because it cannot be
> > re-
> > populated without guest coordination. But for shared memory if we are
> > zapping a
> > range that includes both private and shared memory, I don't think it should
> > hurt
> > to zap the shared memory.
>
> Hell no, I am not risking taking on more baggage in KVM where userspace or
> some
> other subsystem comes to rely on KVM spuriously zapping SPTEs in response to
> an
> unrelated userspace action. 

Hmm, I see the point. Thanks. This was just being left for later discussion
anyway.

2024-04-26 16:50:07

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Fri, Apr 26, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-04-26 at 08:28 -0700, Sean Christopherson wrote:
> > If the choice is between an enum and exclude_*, I would strongly prefer the
> > enum. Using exclude_* results in inverted polarity for the code that
> > triggers invalidations.
>
> Right, the awkwardness lands in that code.
>
> The processing code looks nice though:
> https://lore.kernel.org/kvm/[email protected]/

Heh, where's your bitmask abuse spirit? It's a little evil (and by "evil" I mean
awesome), but the need to process different roots is another good argument for an
enum+bitmask.

enum tdp_mmu_root_types {
KVM_SHARED_ROOTS = KVM_PROCESS_SHARED,
KVM_PRIVATE_ROOTS = KVM_PROCESS_PRIVATE,
KVM_VALID_ROOTS = BIT(2),
KVM_ANY_VALID_ROOT = KVM_SHARED_ROOT | KVM_PRIVATE_ROOT | KVM_VALID_ROOT,
KVM_ANY_ROOT = KVM_SHARED_ROOT | KVM_PRIVATE_ROOT,
}
static_assert(!(KVM_SHARED_ROOTS & KVM_VALID_ROOTS));
static_assert(!(KVM_PRIVATE_ROOTS & KVM_VALID_ROOTS));
static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));

/*
* Returns the next root after @prev_root (or the first root if @prev_root is
* NULL). A reference to the returned root is acquired, and the reference to
* @prev_root is released (the caller obviously must hold a reference to
* @prev_root if it's non-NULL).
*
* Returns NULL if the end of tdp_mmu_roots was reached.
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
enum tdp_mmu_root_types types)
{
bool only_valid = types & KVM_VALID_ROOTS;
struct kvm_mmu_page *next_root;

/*
* While the roots themselves are RCU-protected, fields such as
* role.invalid are protected by mmu_lock.
*/
lockdep_assert_held(&kvm->mmu_lock);

rcu_read_lock();

if (prev_root)
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
&prev_root->link,
typeof(*prev_root), link);
else
next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
typeof(*next_root), link);

while (next_root) {
if ((!only_valid || !next_root->role.invalid) &&
(types & (KVM_SHARED_ROOTS << is_private_sp(root))) &&
kvm_tdp_mmu_get_root(next_root))
break;

next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
&next_root->link, typeof(*next_root), link);
}

rcu_read_unlock();

if (prev_root)
kvm_tdp_mmu_put_root(kvm, prev_root);

return next_root;
}

#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _types) \
for (_root = tdp_mmu_next_root(_kvm, NULL, _types); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
_root = tdp_mmu_next_root(_kvm, _root, _types)) \
if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else

#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, KVM_ANY_VALID_ROOT)

#define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \
for (_root = tdp_mmu_next_root(_kvm, NULL, KVM_ANY_ROOT); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
_root = tdp_mmu_next_root(_kvm, _root, KVM_ANY_ROOT))

2024-04-26 17:02:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Fri, 2024-04-26 at 09:49 -0700, Sean Christopherson wrote:
>
> Heh, where's your bitmask abuse spirit?  It's a little evil (and by "evil" I
> mean
> awesome), but the need to process different roots is another good argument for
> an
> enum+bitmask.

Haha. There seems to be some special love for bit math in KVM. I was just
relaying my struggle to understand permission_fault() the other day.

>
> enum tdp_mmu_root_types {
>         KVM_SHARED_ROOTS = KVM_PROCESS_SHARED,
>         KVM_PRIVATE_ROOTS = KVM_PROCESS_PRIVATE,
>         KVM_VALID_ROOTS = BIT(2),
>         KVM_ANY_VALID_ROOT = KVM_SHARED_ROOT | KVM_PRIVATE_ROOT |
> KVM_VALID_ROOT,
>         KVM_ANY_ROOT = KVM_SHARED_ROOT | KVM_PRIVATE_ROOT,
> }
> static_assert(!(KVM_SHARED_ROOTS & KVM_VALID_ROOTS));
> static_assert(!(KVM_PRIVATE_ROOTS & KVM_VALID_ROOTS));
> static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));
>
> /*
>  * Returns the next root after @prev_root (or the first root if @prev_root is
>  * NULL).  A reference to the returned root is acquired, and the reference to
>  * @prev_root is released (the caller obviously must hold a reference to
>  * @prev_root if it's non-NULL).
>  *
>  * Returns NULL if the end of tdp_mmu_roots was reached.
>  */
> static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>                                               struct kvm_mmu_page *prev_root,
>                                               enum tdp_mmu_root_types types)
> {
>         bool only_valid = types & KVM_VALID_ROOTS;
>         struct kvm_mmu_page *next_root;
>
>         /*
>          * While the roots themselves are RCU-protected, fields such as
>          * role.invalid are protected by mmu_lock.
>          */
>         lockdep_assert_held(&kvm->mmu_lock);
>
>         rcu_read_lock();
>
>         if (prev_root)
>                 next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
>                                                   &prev_root->link,
>                                                   typeof(*prev_root), link);
>         else
>                 next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
>                                                    typeof(*next_root), link);
>
>         while (next_root) {
>                 if ((!only_valid || !next_root->role.invalid) &&
>                     (types & (KVM_SHARED_ROOTS << is_private_sp(root))) &&
>                     kvm_tdp_mmu_get_root(next_root))
>                         break;
>
>                 next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
>                                 &next_root->link, typeof(*next_root), link);
>         }
>
>         rcu_read_unlock();
>
>         if (prev_root)
>                 kvm_tdp_mmu_put_root(kvm, prev_root);
>
>         return next_root;
> }
>
> #define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id,
> _types)         \
>         for (_root = tdp_mmu_next_root(_kvm, NULL,
> _types);                     \
>              ({ lockdep_assert_held(&(_kvm)->mmu_lock); }),
> _root;              \
>              _root = tdp_mmu_next_root(_kvm, _root,
> _types))                    \
>                 if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)
> {       \
>                 } else
>
> #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root,
> _as_id)             \
>         __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id,
> KVM_ANY_VALID_ROOT)
>
> #define for_each_tdp_mmu_root_yield_safe(_kvm,
> _root)                           \
>         for (_root = tdp_mmu_next_root(_kvm, NULL,
> KVM_ANY_ROOT);               \
>              ({ lockdep_assert_held(&(_kvm)->mmu_lock); }),
> _root;              \
>              _root = tdp_mmu_next_root(_kvm, _root, KVM_ANY_ROOT))

Ohh, yes move it into the iterators. I like it a lot.

2024-04-26 17:13:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 011/130] KVM: Add new members to struct kvm_gfn_range to operate on

On Fri, Apr 26, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-04-26 at 09:49 -0700, Sean Christopherson wrote:
> >
> > Heh, where's your bitmask abuse spirit?  It's a little evil (and by "evil"
> > I mean awesome), but the need to process different roots is another good
> > argument for an enum+bitmask.
>
> Haha. There seems to be some special love for bit math in KVM. I was just
> relaying my struggle to understand permission_fault() the other day.

LOL, you and everyone else that's ever looked at that code. Just wait until you
run into one of Paolo's decimal-based bitmasks.

2024-04-26 18:38:55

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error

On Wed, Apr 24, 2024 at 12:11:25AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, [email protected] wrote:
> > --- a/arch/x86/kvm/vmx/tdx_ops.h
> > +++ b/arch/x86/kvm/vmx/tdx_ops.h
> > @@ -40,6 +40,10 @@ static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
> >   return ret;
> >  }
> >  
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out);
> > +#endif
> > +
>
> Why this needs to be inside the CONFIG_INTEL_TDX_HOST while other
> tdh_xxx() don't?
>
> I suppose all tdh_xxx() together with this pr_tdx_error() should only be
> called tdx.c, which is only built when CONFIG_INTEL_TDX_HOST is true?
>
> In fact, tdx_seamcall() directly calls seamcall() and seamcall_ret(),
> which are only present when CONFIG_INTEL_TDX_HOST is on.
>
> So things are really confused here. I do believe we should just remove
> this CONFIG_INTEL_TDX_HOST around pr_tdx_error() so all functions in
> "tdx_ops.h" should only be used in tdx.c.

You're right, please go clean them up.
--
Isaku Yamahata <[email protected]>

2024-04-29 11:41:51

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Thu, 2024-04-25 at 15:43 -0700, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Kai Huang wrote:
> > On Thu, 2024-04-25 at 09:30 -0700, Sean Christopherson wrote:
> > > On Tue, Apr 23, 2024, Kai Huang wrote:
> > > And anecdotally, I know of at least one crash in our production environment where
> > > a VMX instruction hit a seemingly spurious #UD, i.e. it's not impossible for a
> > > ucode bug or hardware defect to cause problems. That's obviously _extremely_
> > > unlikely, but that's why I emphasized that sanity checking CR4.VMXE is cheap.
> >
> > Yeah I agree it could happen although very unlikely.
> >
> > But just to be sure:
> >
> > I believe the #UD itself doesn't crash the kernel/machine, but should be
> > the kernel unable to handle #UD in such case?
>
> Correct, the #UD is likely not (immediately) fatal.
> >
> > If so, I am not sure whether the CR4.VMX check can make the kernel any
> > safer, because we can already handle the #UD for the SEAMCALL instruction.
>
> It's not about making the kernel safer, it's about helping triage/debug issues.
>
> > Yeah we can clearly dump message saying "CPU isn't in VMX operation" and
> > return failure if we have the check, but if we don't, the worst situation
> > is we might mistakenly report "CPU isn't in VMX operation" (currently code
> > just treats #UD as CPU not in VMX operation) when CPU doesn't
> > IA32_VMX_PROCBASED_CTLS3[5].
> >
> > And for the IA32_VMX_PROCBASED_CTLS3[5] we can easily do some pre-check in
> > KVM code during module loading to rule out this case.
> >
> > And in practice, I even believe the BIOS cannot turn on TDX if the
> > IA32_VMX_PROCBASED_CTLS3[5] is not supported. I can check on this.
>
> Eh, I wouldn't worry about that too much. The only reason I brought up that
> check was to call out that we can't *know* with 100% certainty that SEAMCALL
> failed due to the CPU not being post-VMXON.

OK (though I think we can rule out other cases by adding more checks etc).

>
> > > Practically speaking it costs nothing, so IMO it's worth adding even if the odds
> > > of it ever being helpful are one-in-and-million.
> >
> > I think we will need to do below at somewhere for the common SEAMCALL
> > function:
> >
> > unsigned long flags;
> > int ret = -EINVAL;
> >
> > local_irq_save(flags);
> >
> > if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE)))
> > goto out;
> >
> > ret = seamcall();
> > out:
> > local_irq_restore(flags);
> > return ret;
> >
> > to make it IRQ safe.
> >
> > And the odd is currently the common SEAMCALL functions, a.k.a,
> > __seamcall() and seamcall() (the latter is a mocro actually), both return
> > u64, so if we want to have such CR4.VMX check code in the common code, we
> > need to invent a new error code for it.
>
> Oh, I wasn't thinking that we'd check CR4.VMXE before *every* SEAMCALL, just
> before the TDH.SYS.LP.INIT call, i.e. before the one that is most likely to fail
> due to a software bug that results in the CPU not doing VMXON before enabling
> TDX.
>
> Again, my intent is to add a simple, cheap, and targeted sanity check to help
> deal with potential failures in code that historically has been less than rock
> solid, and in function that has a big fat assumption that the caller has done
> VMXON on the CPU.

I see.

(To be fair, personally I don't recall that we ever had any bug due to
"cpu not in post-VMXON before SEAMCALL", but maybe it's just me. :-).)

But if tdx_enable() doesn't call tdx_cpu_enable() internally, then we will
have two functions need to handle.

For tdx_enable(), given it's still good idea to disable CPU hotplug around
it, we can still do some check for all online cpus at the beginning, like:

on_each_cpu(check_cr4_vmx(), &err, 1);

Btw, please also see my last reply to Chao why I don't like calling
tdx_cpu_enable() inside tdx_enable():

https://lore.kernel.org/lkml/[email protected]/

That being said, I can try to add additional patch(es) to do CR4.VMX check
if you want, but personally I found hard to have a strong justification to
do so.

2024-04-29 20:07:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Mon, Apr 29, 2024, Kai Huang wrote:
> On Thu, 2024-04-25 at 15:43 -0700, Sean Christopherson wrote:
> > > And the odd is currently the common SEAMCALL functions, a.k.a,
> > > __seamcall() and seamcall() (the latter is a mocro actually), both return
> > > u64, so if we want to have such CR4.VMX check code in the common code, we
> > > need to invent a new error code for it.
> >
> > Oh, I wasn't thinking that we'd check CR4.VMXE before *every* SEAMCALL, just
> > before the TDH.SYS.LP.INIT call, i.e. before the one that is most likely to fail
> > due to a software bug that results in the CPU not doing VMXON before enabling
> > TDX.
> >
> > Again, my intent is to add a simple, cheap, and targeted sanity check to help
> > deal with potential failures in code that historically has been less than rock
> > solid, and in function that has a big fat assumption that the caller has done
> > VMXON on the CPU.
>
> I see.
>
> (To be fair, personally I don't recall that we ever had any bug due to
> "cpu not in post-VMXON before SEAMCALL", but maybe it's just me. :-).)
>
> But if tdx_enable() doesn't call tdx_cpu_enable() internally, then we will
> have two functions need to handle.

Why? I assume there will be exactly one caller of TDH.SYS.LP.INIT.

> For tdx_enable(), given it's still good idea to disable CPU hotplug around
> it, we can still do some check for all online cpus at the beginning, like:
>
> on_each_cpu(check_cr4_vmx(), &err, 1);

If it gets to that point, just omit the check. I really think you're making much
ado about nothing. My suggestion is essentially "throw in a CR4.VMXE check before
TDH.SYS.LP.INIT if it's easy". If it's not easy for some reason, then don't do
it.

> Btw, please also see my last reply to Chao why I don't like calling
> tdx_cpu_enable() inside tdx_enable():
>
> https://lore.kernel.org/lkml/[email protected]/
>
> That being said, I can try to add additional patch(es) to do CR4.VMX check
> if you want, but personally I found hard to have a strong justification to
> do so.
>

2024-04-29 23:12:57

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 30/04/2024 8:06 am, Sean Christopherson wrote:
> On Mon, Apr 29, 2024, Kai Huang wrote:
>> On Thu, 2024-04-25 at 15:43 -0700, Sean Christopherson wrote:
>>>> And the odd is currently the common SEAMCALL functions, a.k.a,
>>>> __seamcall() and seamcall() (the latter is a mocro actually), both return
>>>> u64, so if we want to have such CR4.VMX check code in the common code, we
>>>> need to invent a new error code for it.
>>>
>>> Oh, I wasn't thinking that we'd check CR4.VMXE before *every* SEAMCALL, just
>>> before the TDH.SYS.LP.INIT call, i.e. before the one that is most likely to fail
>>> due to a software bug that results in the CPU not doing VMXON before enabling
>>> TDX.
>>>
>>> Again, my intent is to add a simple, cheap, and targeted sanity check to help
>>> deal with potential failures in code that historically has been less than rock
>>> solid, and in function that has a big fat assumption that the caller has done
>>> VMXON on the CPU.
>>
>> I see.
>>
>> (To be fair, personally I don't recall that we ever had any bug due to
>> "cpu not in post-VMXON before SEAMCALL", but maybe it's just me. :-).)
>>
>> But if tdx_enable() doesn't call tdx_cpu_enable() internally, then we will
>> have two functions need to handle.
>
> Why? I assume there will be exactly one caller of TDH.SYS.LP.INIT.

Right, it's only done in tdx_cpu_enable().

I was thinking "the one that is most likely to fail" isn't just
TDH.SYS.LP.INIT in this case, but also could be any SEAMCALL that is
firstly run on any online cpu inside tdx_enable().

Or perhaps you were thinking once tdx_cpu_enable() is called on one cpu,
then we can safely assume that cpu must be in post-VMXON, despite we
have two separate functions: tdx_cpu_enable() and tdx_enable().

>
>> For tdx_enable(), given it's still good idea to disable CPU hotplug around
>> it, we can still do some check for all online cpus at the beginning, like:
>>
>> on_each_cpu(check_cr4_vmx(), &err, 1);
>
> If it gets to that point, just omit the check. I really think you're making much
> ado about nothing.

Yeah.

> My suggestion is essentially "throw in a CR4.VMXE check before
> TDH.SYS.LP.INIT if it's easy". If it's not easy for some reason, then don't do
> it.

I see. The disconnection between us is I am not super clear why we
should treat TDH.SYS.LP.INIT as a special one that deserves a CR4.VMXE
check but not other SEAMCALLs.

Anyway, I don't think adding such check or not matters a lot at this
stage, and I don't want to jeopardize any more time from you on this. :-)

2024-04-30 16:13:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

On Tue, Apr 30, 2024, Kai Huang wrote:
> On 30/04/2024 8:06 am, Sean Christopherson wrote:
> > My suggestion is essentially "throw in a CR4.VMXE check before
> > TDH.SYS.LP.INIT if it's easy". If it's not easy for some reason, then don't do
> > it.
>
> I see. The disconnection between us is I am not super clear why we should
> treat TDH.SYS.LP.INIT as a special one that deserves a CR4.VMXE check but
> not other SEAMCALLs.

Because TDH.SYS.LP.INIT is done on all CPUs via an IPI function call, is a one-
time thing, and is at the intersection of core TDX and KVM module code, e.g. the
the core TDX code has an explicit assumption that:

* This function assumes the caller has: 1) held read lock of CPU hotplug
* lock to prevent any new cpu from becoming online; 2) done both VMXON
* and tdx_cpu_enable() on all online cpus.

KVM can obviously screw up and attempt SEAMCALLs without being post-VMXON, but
that's entirely a _KVM_ bug. And the probability of getting all the way to
something like TDH_MEM_SEPT_ADD without being post-VMXON is comically low, e.g.
KVM and/or the kernel would likely crash long before that point.

2024-04-30 20:50:05

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

Hi Isaku,

On 4/3/2024 11:42 AM, Isaku Yamahata wrote:
> On Mon, Apr 01, 2024 at 12:10:58PM +0800,
> Chao Gao <[email protected]> wrote:
>
>>> +static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>>> +{
>>> + unsigned long exit_qual;
>>> +
>>> + if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
>>> + /*
>>> + * Always treat SEPT violations as write faults. Ignore the
>>> + * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
>>> + * TD private pages are always RWX in the SEPT tables,
>>> + * i.e. they're always mapped writable. Just as importantly,
>>> + * treating SEPT violations as write faults is necessary to
>>> + * avoid COW allocations, which will cause TDAUGPAGE failures
>>> + * due to aliasing a single HPA to multiple GPAs.
>>> + */
>>> +#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
>>> + exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
>>> + } else {
>>> + exit_qual = tdexit_exit_qual(vcpu);
>>> + if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
>>
>> Unless the CPU has a bug, instruction fetch in TD from shared memory causes a
>> #PF. I think you can add a comment for this.
>
> Yes.
>
>
>> Maybe KVM_BUG_ON() is more appropriate as it signifies a potential bug.
>
> Bug of what component? CPU. If so, I think KVM_EXIT_INTERNAL_ERROR +
> KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON is more appropriate.
>

Is below what you have in mind?

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 499c6cd9633f..bd30b4c4d710 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1305,11 +1305,18 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
} else {
exit_qual = tdexit_exit_qual(vcpu);
if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
+ /*
+ * Instruction fetch in TD from shared memory
+ * causes a #PF.
+ */
pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
tdexit_gpa(vcpu), kvm_rip_read(vcpu));
- vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
- vcpu->run->ex.exception = PF_VECTOR;
- vcpu->run->ex.error_code = exit_qual;
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->internal.suberror =
+ KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
+ vcpu->run->internal.ndata = 2;
+ vcpu->run->internal.data[0] = EXIT_REASON_EPT_VIOLATION;
+ vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
return 0;
}
}

Thank you

Reinette



2024-05-01 02:57:00

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module



On 1/05/2024 4:13 am, Sean Christopherson wrote:
> On Tue, Apr 30, 2024, Kai Huang wrote:
>> On 30/04/2024 8:06 am, Sean Christopherson wrote:
>>> My suggestion is essentially "throw in a CR4.VMXE check before
>>> TDH.SYS.LP.INIT if it's easy". If it's not easy for some reason, then don't do
>>> it.
>>
>> I see. The disconnection between us is I am not super clear why we should
>> treat TDH.SYS.LP.INIT as a special one that deserves a CR4.VMXE check but
>> not other SEAMCALLs.
>
> Because TDH.SYS.LP.INIT is done on all CPUs via an IPI function call, is a one-
> time thing, and is at the intersection of core TDX and KVM module code, e.g. the
> the core TDX code has an explicit assumption that:
>
> * This function assumes the caller has: 1) held read lock of CPU hotplug
> * lock to prevent any new cpu from becoming online; 2) done both VMXON
> * and tdx_cpu_enable() on all online cpus.

Yeah but from this perspective, both tdx_cpu_enable() and tdx_enable()
are "a one time thing" and "at the intersection of core TDX and KVM" :-)

But from the perspective that tdx_cpu_enable() must be called in IRQ
disabled context, and there's no possibility that other thread/code
could potentially mess up VMX enabling after the CR4.VMXE check, so it's
fine to add such check.

And looking again, in fact the comment of tdx_cpu_enable() doesn't
explicitly call out it requires the caller to do VMXON first (although
kinda implied by the comment of tdx_enable() as you quoted above).

I can add a patch to make it more clear by calling out in the comment of
tdx_cpu_enable() that it requires caller to do VMXON and adding a
WARN_ON_ONCE(!CR4.VMXE) check inside. I just don't know whether it is
worth to do at this stage given it's not something mandatory because it
requires review time from maintainers. I can include such patch in next
KVM TDX patchset if you prefer so we can see how it goes.

>
> KVM can obviously screw up and attempt SEAMCALLs without being post-VMXON, but
> that's entirely a _KVM_ bug. And the probability of getting all the way to
> something like TDH_MEM_SEPT_ADD without being post-VMXON is comically low, e.g.
> KVM and/or the kernel would likely crash long before that point.

Yeah fully agree SEAMCALLs managed by KVM shouldn't need to do CR4.VMXE
check. I was talking about those involved in tdx_enable(), i.e.,
TDH.SYS.xxx.

2024-05-01 15:56:33

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

On Tue, Apr 30, 2024 at 01:47:07PM -0700,
Reinette Chatre <[email protected]> wrote:

> Hi Isaku,
>
> On 4/3/2024 11:42 AM, Isaku Yamahata wrote:
> > On Mon, Apr 01, 2024 at 12:10:58PM +0800,
> > Chao Gao <[email protected]> wrote:
> >
> >>> +static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >>> +{
> >>> + unsigned long exit_qual;
> >>> +
> >>> + if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
> >>> + /*
> >>> + * Always treat SEPT violations as write faults. Ignore the
> >>> + * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
> >>> + * TD private pages are always RWX in the SEPT tables,
> >>> + * i.e. they're always mapped writable. Just as importantly,
> >>> + * treating SEPT violations as write faults is necessary to
> >>> + * avoid COW allocations, which will cause TDAUGPAGE failures
> >>> + * due to aliasing a single HPA to multiple GPAs.
> >>> + */
> >>> +#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
> >>> + exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
> >>> + } else {
> >>> + exit_qual = tdexit_exit_qual(vcpu);
> >>> + if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
> >>
> >> Unless the CPU has a bug, instruction fetch in TD from shared memory causes a
> >> #PF. I think you can add a comment for this.
> >
> > Yes.
> >
> >
> >> Maybe KVM_BUG_ON() is more appropriate as it signifies a potential bug.
> >
> > Bug of what component? CPU. If so, I think KVM_EXIT_INTERNAL_ERROR +
> > KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON is more appropriate.
> >
>
> Is below what you have in mind?

Yes. data[0] should be the raw value of exit reason if possible.
data[2] should be exit_qual. Hmm, I don't find document on data[] for
KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON.
Qemu doesn't assumt ndata = 2. Just report all data within ndata.


> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 499c6cd9633f..bd30b4c4d710 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1305,11 +1305,18 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> } else {
> exit_qual = tdexit_exit_qual(vcpu);
> if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
> + /*
> + * Instruction fetch in TD from shared memory
> + * causes a #PF.
> + */
> pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
> tdexit_gpa(vcpu), kvm_rip_read(vcpu));
> - vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
> - vcpu->run->ex.exception = PF_VECTOR;
> - vcpu->run->ex.error_code = exit_qual;
> + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> + vcpu->run->internal.suberror =
> + KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
> + vcpu->run->internal.ndata = 2;
> + vcpu->run->internal.data[0] = EXIT_REASON_EPT_VIOLATION;
> + vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
> return 0;
> }
> }
>
> Thank you
>
> Reinette
>
>
>

--
Isaku Yamahata <[email protected]>

2024-05-01 16:54:38

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

Hi Isaku,

On 5/1/2024 8:56 AM, Isaku Yamahata wrote:
> On Tue, Apr 30, 2024 at 01:47:07PM -0700,
> Reinette Chatre <[email protected]> wrote:
>> On 4/3/2024 11:42 AM, Isaku Yamahata wrote:
>>> On Mon, Apr 01, 2024 at 12:10:58PM +0800,
>>> Chao Gao <[email protected]> wrote:
>>>
>>>>> +static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> + unsigned long exit_qual;
>>>>> +
>>>>> + if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
>>>>> + /*
>>>>> + * Always treat SEPT violations as write faults. Ignore the
>>>>> + * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
>>>>> + * TD private pages are always RWX in the SEPT tables,
>>>>> + * i.e. they're always mapped writable. Just as importantly,
>>>>> + * treating SEPT violations as write faults is necessary to
>>>>> + * avoid COW allocations, which will cause TDAUGPAGE failures
>>>>> + * due to aliasing a single HPA to multiple GPAs.
>>>>> + */
>>>>> +#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
>>>>> + exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
>>>>> + } else {
>>>>> + exit_qual = tdexit_exit_qual(vcpu);
>>>>> + if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
>>>>
>>>> Unless the CPU has a bug, instruction fetch in TD from shared memory causes a
>>>> #PF. I think you can add a comment for this.
>>>
>>> Yes.
>>>
>>>
>>>> Maybe KVM_BUG_ON() is more appropriate as it signifies a potential bug.
>>>
>>> Bug of what component? CPU. If so, I think KVM_EXIT_INTERNAL_ERROR +
>>> KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON is more appropriate.
>>>
>>
>> Is below what you have in mind?
>
> Yes. data[0] should be the raw value of exit reason if possible.
> data[2] should be exit_qual. Hmm, I don't find document on data[] for

Did you perhaps intend to write "data[1] should be exit_qual" or would you
like to see ndata = 3? I followed existing usages, for example [1] and [2],
that have ndata = 2 with "data[1] = vcpu->arch.last_vmentry_cpu".

> KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON.
> Qemu doesn't assumt ndata = 2. Just report all data within ndata.

I am not sure I interpreted your response correctly so I share one possible
snippet below as I interpret it. Could you please check where I misinterpreted
you? I could also make ndata = 3 to break the existing custom and add
"data[2] = vcpu->arch.last_vmentry_cpu" to match existing pattern. What do you
think?

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 499c6cd9633f..ba81e6f68c97 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1305,11 +1305,20 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
} else {
exit_qual = tdexit_exit_qual(vcpu);
if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
+ union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+ /*
+ * Instruction fetch in TD from shared memory
+ * causes a #PF.
+ */
pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
tdexit_gpa(vcpu), kvm_rip_read(vcpu));
- vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
- vcpu->run->ex.exception = PF_VECTOR;
- vcpu->run->ex.error_code = exit_qual;
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->internal.suberror =
+ KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
+ vcpu->run->internal.ndata = 2;
+ vcpu->run->internal.data[0] = exit_reason.full;
+ vcpu->run->internal.data[1] = exit_qual;
return 0;
}
}

Reinette

[1] https://github.com/kvm-x86/linux/blob/next/arch/x86/kvm/vmx/vmx.c#L6587
[2] https://github.com/kvm-x86/linux/blob/next/arch/x86/kvm/svm/svm.c#L3436

2024-05-01 18:19:43

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

On Wed, May 01, 2024 at 09:54:07AM -0700,
Reinette Chatre <[email protected]> wrote:

> Hi Isaku,
>
> On 5/1/2024 8:56 AM, Isaku Yamahata wrote:
> > On Tue, Apr 30, 2024 at 01:47:07PM -0700,
> > Reinette Chatre <[email protected]> wrote:
> >> On 4/3/2024 11:42 AM, Isaku Yamahata wrote:
> >>> On Mon, Apr 01, 2024 at 12:10:58PM +0800,
> >>> Chao Gao <[email protected]> wrote:
> >>>
> >>>>> +static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >>>>> +{
> >>>>> + unsigned long exit_qual;
> >>>>> +
> >>>>> + if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
> >>>>> + /*
> >>>>> + * Always treat SEPT violations as write faults. Ignore the
> >>>>> + * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
> >>>>> + * TD private pages are always RWX in the SEPT tables,
> >>>>> + * i.e. they're always mapped writable. Just as importantly,
> >>>>> + * treating SEPT violations as write faults is necessary to
> >>>>> + * avoid COW allocations, which will cause TDAUGPAGE failures
> >>>>> + * due to aliasing a single HPA to multiple GPAs.
> >>>>> + */
> >>>>> +#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
> >>>>> + exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
> >>>>> + } else {
> >>>>> + exit_qual = tdexit_exit_qual(vcpu);
> >>>>> + if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
> >>>>
> >>>> Unless the CPU has a bug, instruction fetch in TD from shared memory causes a
> >>>> #PF. I think you can add a comment for this.
> >>>
> >>> Yes.
> >>>
> >>>
> >>>> Maybe KVM_BUG_ON() is more appropriate as it signifies a potential bug.
> >>>
> >>> Bug of what component? CPU. If so, I think KVM_EXIT_INTERNAL_ERROR +
> >>> KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON is more appropriate.
> >>>
> >>
> >> Is below what you have in mind?
> >
> > Yes. data[0] should be the raw value of exit reason if possible.
> > data[2] should be exit_qual. Hmm, I don't find document on data[] for
>
> Did you perhaps intend to write "data[1] should be exit_qual" or would you
> like to see ndata = 3? I followed existing usages, for example [1] and [2],
> that have ndata = 2 with "data[1] = vcpu->arch.last_vmentry_cpu".
>
> > KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON.
> > Qemu doesn't assumt ndata = 2. Just report all data within ndata.
>
> I am not sure I interpreted your response correctly so I share one possible
> snippet below as I interpret it. Could you please check where I misinterpreted
> you? I could also make ndata = 3 to break the existing custom and add
> "data[2] = vcpu->arch.last_vmentry_cpu" to match existing pattern. What do you
> think?
>
Sorry, I wasn't clear enough. I meant
ndata = 3;
data[0] = exit_reason.full;
data[1] = vcpu->arch.last_vmentry_cpu;
data[2] = exit_qual;

Because I hesitate to change the meaning of data[1] from other usage, I
appended exit_qual as data[2].

--
Isaku Yamahata <[email protected]>

2024-05-01 18:22:52

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v19 101/130] KVM: TDX: handle ept violation/misconfig exit

Hi Isaku,

On 5/1/2024 11:19 AM, Isaku Yamahata wrote:
> On Wed, May 01, 2024 at 09:54:07AM -0700,
> Reinette Chatre <[email protected]> wrote:
>
>> Hi Isaku,
>>
>> On 5/1/2024 8:56 AM, Isaku Yamahata wrote:
>>> On Tue, Apr 30, 2024 at 01:47:07PM -0700,
>>> Reinette Chatre <[email protected]> wrote:
>>>> On 4/3/2024 11:42 AM, Isaku Yamahata wrote:
>>>>> On Mon, Apr 01, 2024 at 12:10:58PM +0800,
>>>>> Chao Gao <[email protected]> wrote:
>>>>>
>>>>>>> +static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>>>>>>> +{
>>>>>>> + unsigned long exit_qual;
>>>>>>> +
>>>>>>> + if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
>>>>>>> + /*
>>>>>>> + * Always treat SEPT violations as write faults. Ignore the
>>>>>>> + * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
>>>>>>> + * TD private pages are always RWX in the SEPT tables,
>>>>>>> + * i.e. they're always mapped writable. Just as importantly,
>>>>>>> + * treating SEPT violations as write faults is necessary to
>>>>>>> + * avoid COW allocations, which will cause TDAUGPAGE failures
>>>>>>> + * due to aliasing a single HPA to multiple GPAs.
>>>>>>> + */
>>>>>>> +#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
>>>>>>> + exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
>>>>>>> + } else {
>>>>>>> + exit_qual = tdexit_exit_qual(vcpu);
>>>>>>> + if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
>>>>>>
>>>>>> Unless the CPU has a bug, instruction fetch in TD from shared memory causes a
>>>>>> #PF. I think you can add a comment for this.
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>> Maybe KVM_BUG_ON() is more appropriate as it signifies a potential bug.
>>>>>
>>>>> Bug of what component? CPU. If so, I think KVM_EXIT_INTERNAL_ERROR +
>>>>> KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON is more appropriate.
>>>>>
>>>>
>>>> Is below what you have in mind?
>>>
>>> Yes. data[0] should be the raw value of exit reason if possible.
>>> data[2] should be exit_qual. Hmm, I don't find document on data[] for
>>
>> Did you perhaps intend to write "data[1] should be exit_qual" or would you
>> like to see ndata = 3? I followed existing usages, for example [1] and [2],
>> that have ndata = 2 with "data[1] = vcpu->arch.last_vmentry_cpu".
>>
>>> KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON.
>>> Qemu doesn't assumt ndata = 2. Just report all data within ndata.
>>
>> I am not sure I interpreted your response correctly so I share one possible
>> snippet below as I interpret it. Could you please check where I misinterpreted
>> you? I could also make ndata = 3 to break the existing custom and add
>> "data[2] = vcpu->arch.last_vmentry_cpu" to match existing pattern. What do you
>> think?
>>
> Sorry, I wasn't clear enough. I meant
> ndata = 3;
> data[0] = exit_reason.full;
> data[1] = vcpu->arch.last_vmentry_cpu;
> data[2] = exit_qual;
>
> Because I hesitate to change the meaning of data[1] from other usage, I
> appended exit_qual as data[2].

I understand it now. Thank you very much Isaku.

Reinette