2024-05-30 21:07:48

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 00/15] TDX MMU prep series part 1

Hi,

This is v2 of the TDX MMU prep series, split out of the giant 130 patch
TDX base enabling series [0]. It is focusing on the changes to the x86 MMU
to support TDX’s separation of private/shared EPT into separate roots. A
future breakout series will include the changes to actually interact with
the TDX module to actually map private memory. The purpose of sending out
a smaller series is to focus review, and hopefully rapidly iterate. We
would like the series to go into kvm-coco-queue when it is ready.

I think the maturity of these patches has significantly improved during
the recent reviews. I expecting it still needs a little more work, but
think that the basic structure is in decent shape at this point. Please
consider it from the perspective of what is missing for inclusion in
kvm-coco-queue.

There is a larger team working on TDX KVM base enabling. The patches were
originally authored by Sean Christopherson and Isaku Yamahata, but
otherwise it especially represents the work of Isaku and Yan Y Zhao and
myself.

The series has been tested as part of a development branch for the TDX base
series [1]. The testing of this series consists TDX kvm-unit-tests [2],
regular KVM selftests, and booting a Linux TD. Rebasing the TDX selftests
is still in progress.

Updates from v2
===============
Most of the changes for v2 are around code clarity. At a high level:

- Previously the (described below) mirror changes had a bunch of "private"
naming in the generic MMU code. These have been renamed to "mirror". The
connection between TDX private memory and mirrored memory inside mmu.h
helpers. Now the generic MMU code simply selects the mirror root when
performing private operations (faults, invalidation, etc). The TDX
seamcalls to update the S-EPT are hidden inside the x86_ops, so the
special TDX private parts don't pollute the concept for SEV.

To do this the roots enum is changed to be about "mirror" and "direct"
instead of "private" and "shared". The process enum retains these
concepts, and TDX specific logic matches between the two. So there is a
little more purpose to have two enums.

- The gfn_shared_mask is now gfn_direct_mask, and it simply applies the
mask when operating on the direct moot (i.e. a fault targeting a “gfn”
on a normal “direct” root gets faulted in at gfn|gfn_direct_mask). Like
the private memory connection, the connection between the normal
"direct" root and shared memory is hidden inside mmu.h helpers.

- The shared bit has been removed from all of the gfn_t's. This is
achieved by applying the gfn_direct_mask from within the TDP MMU
iterator. It prevents more of the MMU code to have to take special care
around GFN's.

- There also has been some re-organization and patch splitting to aid in
reviewability.

- Log improvements

The quirk for zapping all PTEs on memslot deletion will come as a separate
series. The change to make kvm_mmu_max_gfn() configurable will also come
separately. The kvm_zap_gfn_range() KVM_BUG_ON() patch is dropped.

Private/shared memory in TDX
============================
Confidential computing solutions have concepts of private and shared
memory. Often the guest accesses either private or shared memory via a bit
in the PTE. Solutions like SEV treat this bit more like a permission bit,
where solutions like TDX and ARM CCA treat it more like a GPA bit. In the
latter case, the host maps private memory in one half of the address space
and shared in another. For TDX these two halves are mapped by different
EPT roots. The private half (also called Secure EPT in Intel
documentation) gets managed by the privileged TDX Module. The shared half
is managed by the untrusted part of the VMM (KVM).

In addition to the separate roots for private and shared, there are
limitations on what operations can be done on the private side. TDX wants
to protect against protected memory being reset or otherwise scrambled by
the host. In order to prevent this, the guest has to take specific action
to “accept” memory after changes are made by the VMM to the private EPT.
This prevents the VMM from performing many of the usual memory management
operations that involve zapping and refaulting memory. The private memory
also is always RWX and cannot have VMM specified cache attribute
attributes applied.

TDX KVM MMU Design For Private Memory
=====================================

Private/shared split
--------------------
The operations that actually change the private half of the EPT are
limited and relatively slow compared to reading a PTE. For this reason the
design for KVM is to keep a “mirrored” copy of the private EPT in KVM’s
memory. This will allow KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.

To clarify the definitions of the three EPT trees at this point:
private EPT - Protected by the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
mapped.
shared EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.

It’s worth noting that we are making an effort to remove optimizations
that have complexity for the base enabling. Although keeping a mirrored
copy of the private page tables kind of fits into that category, it has
been so fundamental to the design for so long, dropping it would be too
disruptive.

Mirror EPT
------------
The mirror EPT needs to keep a mirrored version of the private EPT
maintained in the TDX module in order to be able to find out if a GPA’s
mid-level pagetable have already been installed. So this mirrored copy has
the same structure as the private EPT, having a page table present for
every GPA range and level in the mirrored EPT where a page table is
present private. The private page tables also cannot be zapped while the
range has anything mapped, so the mirrored/private page tables need to be
protected from KVM operations that zap any non-leaf PTEs, for example
kvm_mmu_reset_context() or kvm_mmu_zap_all_fast()

Modifications to the mirrored page tables need to also perform the same
operations to the private page tables. The actual TDX module calls to do
this are not covered in this prep series.

For convenience SPs for private page tables are tracked with a role bit
out of convenience. (Note to reviewers, please consider if this is really
needed).

Zapping Changes
---------------
For normal VMs, guest memory is zapped for several reasons, like user
memory getting paged out by the guest, memslots getting deleted or
virtualization operations like MTRRs, and attachment of non-coherent DMA.
For TDX (and SNP) there is also zapping associated with the conversion of
memory between shared and privates. These operations need to take care to
do two things:
1. Not zap any private memory that is in use by the guest.
2. Not zap any memory alias unnecessarily (i.e. Don’t zap anything more
than needed). The purpose of this is to not have any unnecessary behavior
userspace could grow to rely on.

For 1, this is possible because the zapping that is out of the control of
KVM/userspace (paging out of userspace memory) will only apply to shared
memory. Guest mem fd operations are protected from mmu notifier
operations. During TD runtime, zapping of private memory will only be from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.

For 2, KVM needs to be taught which operations will operate on which
aliases. An enum based scheme is introduced such that operations can
target specific aliases like:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
MTRRs, etc - Zapping will be avoided all together

For zapping arising from other virtualization based operations, there are
four scenarios:
1. MTRR update
2. CR0.CD update
3. APICv update
4. Non-coherent DMA status update

KVM TDX will not support 1-3. In future changes (after this series) the
features will not be supported for TDX. For 4, there isn’t an easy way to
not support the feature as the notification is just passed to KVM and it
has to act accordingly. However, other proposed changes [3] will avoid the
need for zapping on non-coherent DMA notification for selfsnoop CPUs. So
KVM can follow this logic and just always honor guest PAT for shared
memory.

Atomically updating private EPT
-------------------------------
Although this prep series does not interact with the TDX module at all to
actually configure the private EPT, it does lay the ground work for doing
this. In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module, but there is one
tricky property that is worth elaborating on. That is how to handle that
the TDP MMU allows modification of PTEs with the mmu_lock held only for
read and uses the PTEs themselves to perform synchronization.

Unfortunately while operating on a single PTE can be done atomically,
operating on both the mirrored and private PTEs at the same time needs
additional solution. To handle this situation, REMOVED_SPTE is used to
prevent concurrent operations while a call to the TDX module updates the
private EPT.

The series is based on kvm-coco-queue.

[0] https://lore.kernel.org/kvm/[email protected]/
[1] https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-05-30
[2] https://lore.kernel.org/kvm/[email protected]/
[3] https://lore.kernel.org/kvm/[email protected]/

Isaku Yamahata (12):
KVM: Add member to struct kvm_gfn_range for target alias
KVM: x86/mmu: Add a mirrored pointer to struct kvm_mmu_page
KVM: x86/mmu: Add a new mirror_pt member for union kvm_mmu_page_role
KVM: x86/mmu: Support GFN direct mask
KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()
KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table
type
KVM: x86/tdp_mmu: Support mirror root for TDP MMU
KVM: x86/tdp_mmu: Reflect building mirror page tables
KVM: x86/tdp_mmu: Reflect tearing down mirror page tables
KVM: x86/tdp_mmu: Take root types for
kvm_tdp_mmu_invalidate_all_roots()
KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process
KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

Rick Edgecombe (2):
KVM: x86: Add a VM type define for TDX
KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void

Sean Christopherson (1):
KVM: x86/tdp_mmu: Invalidate correct roots

arch/x86/include/asm/kvm-x86-ops.h | 4 +
arch/x86/include/asm/kvm_host.h | 45 +++-
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu.h | 28 +++
arch/x86/kvm/mmu/mmu.c | 37 ++-
arch/x86/kvm/mmu/mmu_internal.h | 68 +++++-
arch/x86/kvm/mmu/spte.h | 5 +
arch/x86/kvm/mmu/tdp_iter.c | 5 +-
arch/x86/kvm/mmu/tdp_iter.h | 16 +-
arch/x86/kvm/mmu/tdp_mmu.c | 352 +++++++++++++++++++++++------
arch/x86/kvm/mmu/tdp_mmu.h | 45 +++-
arch/x86/kvm/x86.c | 10 +
include/linux/kvm_host.h | 8 +
virt/kvm/guest_memfd.c | 2 +
virt/kvm/kvm_main.c | 14 ++
15 files changed, 537 insertions(+), 103 deletions(-)


base-commit: 698ca1e403579ca00e16a5b28ae4d576d9f1b20e
--
2.34.1



2024-05-30 21:08:02

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 01/15] KVM: Add member to struct kvm_gfn_range for target alias

From: Isaku Yamahata <[email protected]>

Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_process process.
Update the core zapping operations to set them appropriately.

TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.

Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.

There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.

Set target alias with enum appropriately for these MMU operations:
- For KVM's mmu notifier callbacks, zap shared pages only because private
pages won't have a userspace mapping
- For setting memory attributes, kvm_arch_pre_set_memory_attributes()
chooses the aliases based on the attribute.
- For guest_memfd invalidations, zap private only.

Link: https://lore.kernel.org/kvm/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep:
- Replaced KVM_PROCESS_BASED_ON_ARG with BUGGY_KVM_INVALIDATION to follow
the original suggestion and not populte kvm_handle_gfn_range(). And add
WARN_ON_ONCE().
- Move attribute specific logic into kvm_vm_set_mem_attributes()
- Drop Sean's suggested-by tag as the solution has changed
- Re-write commit log

v18:
- rebased to kvm-next

v3:
- Drop the KVM_GFN_RANGE flags
- Updated struct kvm_gfn_range
- Change kvm_arch_set_memory_attributes() to return bool for flush
- Added set_memory_attributes x86 op for vendor backends
- Refined commit message to describe TDX care concretely

v2:
- consolidate KVM_GFN_RANGE_FLAGS_GMEM_{PUNCH_HOLE, RELEASE} into
KVM_GFN_RANGE_FLAGS_GMEM.
- Update the commit message to describe TDX more. Drop SEV_SNP.
---
arch/x86/kvm/mmu/mmu.c | 6 ++++++
include/linux/kvm_host.h | 8 ++++++++
virt/kvm/guest_memfd.c | 2 ++
virt/kvm/kvm_main.c | 14 ++++++++++++++
4 files changed, 30 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61982da8c8b2..b97241945596 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7451,6 +7451,12 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
return false;

+ /* Unmmap the old attribute page. */
+ if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+ range->process = KVM_PROCESS_SHARED;
+ else
+ range->process = KVM_PROCESS_PRIVATE;
+
return kvm_unmap_gfn_range(kvm, range);
}

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c3c922bf077f..f92c8b605b03 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,11 +260,19 @@ union kvm_mmu_notifier_arg {
unsigned long attributes;
};

+enum kvm_process {
+ BUGGY_KVM_INVALIDATION = 0,
+ KVM_PROCESS_SHARED = BIT(0),
+ KVM_PROCESS_PRIVATE = BIT(1),
+ KVM_PROCESS_PRIVATE_AND_SHARED = KVM_PROCESS_SHARED | KVM_PROCESS_PRIVATE,
+};
+
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+ enum kvm_process process;
bool may_block;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9714add38852..e5ff6fde2db3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -109,6 +109,8 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
.slot = slot,
.may_block = true,
+ /* guest memfd is relevant to only private mappings. */
+ .process = KVM_PROCESS_PRIVATE,
};

if (!found_memslot) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 81b90bf03f2f..4ff137058cec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,11 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
*/
gfn_range.arg = range->arg;
gfn_range.may_block = range->may_block;
+ /*
+ * HVA-based notifications aren't relevant to private
+ * mappings as they don't have a userspace mapping.
+ */
+ gfn_range.process = KVM_PROCESS_SHARED;

/*
* {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -2450,6 +2455,14 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
gfn_range.arg = range->arg;
gfn_range.may_block = range->may_block;

+ /*
+ * If/when KVM supports more attributes beyond private .vs shared, this
+ * _could_ set exclude_{private,shared} appropriately if the entire target
+ * range already has the desired private vs. shared state (it's unclear
+ * if that is a net win). For now, KVM reaches this point if and only
+ * if the private flag is being toggled, i.e. all mappings are in play.
+ */
+
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);

@@ -2506,6 +2519,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
struct kvm_mmu_notifier_range pre_set_range = {
.start = start,
.end = end,
+ .arg.attributes = attributes,
.handler = kvm_pre_set_memory_attributes,
.on_lock = kvm_mmu_invalidate_begin,
.flush_on_ret = true,
--
2.34.1


2024-05-30 21:08:13

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 02/15] KVM: x86: Add a VM type define for TDX

Add a VM type define for TDX.

Future changes will need to lay the ground work for TDX support by
making some behavior conditional on the VM being a TDX guest.

Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep:
- New patch, split from main series
---
arch/x86/include/uapi/asm/kvm.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 988b5204d636..4dea0cfeee51 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -922,5 +922,6 @@ struct kvm_hyperv_eventfd {
#define KVM_X86_SEV_VM 2
#define KVM_X86_SEV_ES_VM 3
#define KVM_X86_SNP_VM 4
+#define KVM_X86_TDX_VM 5

#endif /* _ASM_X86_KVM_H */
--
2.34.1


2024-05-30 21:08:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 03/15] KVM: x86/mmu: Add a mirrored pointer to struct kvm_mmu_page

From: Isaku Yamahata <[email protected]>

Add a mirrored pointer to struct kvm_mmu_page for the private page table and
add helper functions to allocate/initialize/free a private page table page.
Because KVM TDP MMU doesn't use unsync_children and write_flooding_count,
pack them to have room for a pointer and use a union to avoid memory
overhead.

For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.

When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table, allocate one more page for the mirrored page table
for the KVM MMU code to directly walk. Resolve the KVM page fault with
the existing code, and do additional operations necessary for the private
page table. To distinguish such cases, the existing KVM page table is
called a shared page table (i.e., not associated with a private page
table), and the page table with a private page table is called a mirrored
page table. The relationship is depicted below.

KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirror PT root | private PT root
| | | |
V V | V
shared PT mirror PT --propagate--> private/mirrored PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT: Page table
Shared PT: visible to KVM, and the CPU uses it for shared mappings.
Private/mirrored PT: the CPU uses it, but it is invisible to KVM. TDX
module updates this table to map private guest pages.
Mirror PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
to propagate PT change to the actual private PT.

Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.

Co-developed-by: Yan Zhao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
TDX MMU Prep v2:
- Rename private->mirror
- Don't trigger off of shared mask

TDX MMU Prep:
- Rename terminology, dummy PT => mirror PT. and updated the commit message
By Rick and Kai.
- Added a comment on union of private_spt by Rick.
- Don't handle the root case in kvm_mmu_alloc_private_spt(), it will not
be needed in future patches. (Rick)
- Update comments (Yan)
- Remove kvm_mmu_init_private_spt(), open code it in later patches (Yan)

v19:
- typo in the comment in kvm_mmu_alloc_private_spt()
- drop CONFIG_KVM_MMU_PRIVATE
---
arch/x86/include/asm/kvm_host.h | 5 ++++
arch/x86/kvm/mmu.h | 5 ++++
arch/x86/kvm/mmu/mmu.c | 7 +++++
arch/x86/kvm/mmu/mmu_internal.h | 47 ++++++++++++++++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 1 +
5 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..250899a0239b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -817,6 +817,11 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * This cache is to allocate private page table. E.g. private EPT used
+ * by the TDX module.
+ */
+ struct kvm_mmu_memory_cache mmu_mirrored_spt_cache;

/*
* QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index dc80e72e4848..0c3bf89cf7db 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -318,4 +318,9 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
return gpa;
return translate_nested_gpa(vcpu, gpa, access, exception);
}
+
+static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
#endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b97241945596..5070ba7c6e89 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
+ if (kvm_has_mirrored_tdp(vcpu->kvm)) {
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_mirrored_spt_cache,
+ PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+ }
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
PT64_ROOT_MAX_LEVEL);
if (r)
@@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.mmu_mirrored_spt_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 706f0ce8784c..faef40a561f9 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,7 +101,22 @@ struct kvm_mmu_page {
int root_count;
refcount_t tdp_mmu_root_count;
};
- unsigned int unsync_children;
+ union {
+ /* Those two members aren't used for TDP MMU */
+ struct {
+ unsigned int unsync_children;
+ /*
+ * Number of writes since the last time traversal
+ * visited this page.
+ */
+ atomic_t write_flooding_count;
+ };
+ /*
+ * Page table page of private PT.
+ * Passed to TDX module, not accessed by KVM.
+ */
+ void *mirrored_spt;
+ };
union {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
@@ -124,9 +139,6 @@ struct kvm_mmu_page {
int clear_spte_count;
#endif

- /* Number of writes since the last time traversal visited this page. */
- atomic_t write_flooding_count;
-
#ifdef CONFIG_X86_64
/* Used for freeing the page asynchronously if it is a TDP MMU page. */
struct rcu_head rcu_head;
@@ -145,6 +157,33 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
return kvm_mmu_role_as_id(sp->role);
}

+static inline void *kvm_mmu_mirrored_spt(struct kvm_mmu_page *sp)
+{
+ return sp->mirrored_spt;
+}
+
+static inline void kvm_mmu_alloc_mirrored_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+ /*
+ * mirrored_spt is allocated for TDX module to hold private EPT mappings,
+ * TDX module will initialize the page by itself.
+ * Therefore, KVM does not need to initialize or access mirrored_spt.
+ * KVM only interacts with sp->spt for mirrored EPT operations.
+ */
+ sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_mirrored_spt_cache);
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+ /*
+ * private_spt is allocated for TDX module to hold private EPT mappings,
+ * TDX module will initialize the page by itself.
+ * Therefore, KVM does not need to initialize or access private_spt.
+ * KVM only interacts with sp->spt for mirrored EPT operations.
+ */
+ sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_mirrored_spt_cache);
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1259dd63defc..e7cd4921afe7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)

static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
+ free_page((unsigned long)sp->mirrored_spt);
free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
}
--
2.34.1


2024-05-30 21:08:37

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 04/15] KVM: x86/mmu: Add a new mirror_pt member for union kvm_mmu_page_role

From: Isaku Yamahata <[email protected]>

Introduce a "mirror_pt" member to the kvm_mmu_page_role union to identify
SPTEs associated with the mirrored EPT.

The TDX module maintains the private half of the EPT mapped in the TD in
its protected memory. KVM keeps a copy of the private GPAs in a mirrored
EPT tree within host memory. This "mirror_pt" attribute enables vCPUs to
find and get the root page of mirrored EPT from the MMU root list for a
guest TD. This also allows KVM MMU code to detect changes in mirrored EPT
according to the "mirror_pt" mmu page role and propagate the changes to
the private EPT managed by TDX module.

Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep v2:
- Rename private -> mirrored

TDX MMU Prep:
- Remove warning and NULL check in is_private_sptep() (Rick)
- Update commit log (Yan)

v19:
- Fix is_private_sptep() when NULL case.
- drop CONFIG_KVM_MMU_PRIVATE
---
arch/x86/include/asm/kvm_host.h | 13 ++++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 5 +++++
arch/x86/kvm/mmu/spte.h | 5 +++++
3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 250899a0239b..084f4708aff1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -351,7 +351,8 @@ union kvm_mmu_page_role {
unsigned ad_disabled:1;
unsigned guest_mode:1;
unsigned passthrough:1;
- unsigned :5;
+ unsigned mirror_pt:1;
+ unsigned :4;

/*
* This is left at the top of the word so that
@@ -363,6 +364,16 @@ union kvm_mmu_page_role {
};
};

+static inline bool kvm_mmu_page_role_is_mirror(union kvm_mmu_page_role role)
+{
+ return !!role.mirror_pt;
+}
+
+static inline void kvm_mmu_page_role_set_mirrored(union kvm_mmu_page_role *role)
+{
+ role->mirror_pt = 1;
+}
+
/*
* kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
* relevant to the current MMU configuration. When loading CR0, CR4, or EFER,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index faef40a561f9..6d82e389cd65 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -157,6 +157,11 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
return kvm_mmu_role_as_id(sp->role);
}

+static inline bool is_mirror_sp(const struct kvm_mmu_page *sp)
+{
+ return kvm_mmu_page_role_is_mirror(sp->role);
+}
+
static inline void *kvm_mmu_mirrored_spt(struct kvm_mmu_page *sp)
{
return sp->mirrored_spt;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5dd5405fa07a..b3c065280ba1 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -265,6 +265,11 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
return spte_to_child_sp(root);
}

+static inline bool is_mirror_sptep(u64 *sptep)
+{
+ return is_mirror_sp(sptep_to_sp(sptep));
+}
+
static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
--
2.34.1


2024-05-30 21:09:17

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 05/15] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void

The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
allows for the caller, mmu_alloc_direct_roots(), to call
kvm_tdp_mmu_alloc_root() and also return 0 in one line:
return kvm_tdp_mmu_alloc_root(vcpu);

So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
will be called twice in mmu_alloc_direct_roots(). This will force the
first call to either awkwardly handle the return value that will always
be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
Do it in a separate change so the future change will be cleaner.

Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep:
- New patch
---
arch/x86/kvm/mmu/mmu.c | 6 ++++--
arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5070ba7c6e89..12178945922f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3700,8 +3700,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
unsigned i;
int r;

- if (tdp_mmu_enabled)
- return kvm_tdp_mmu_alloc_root(vcpu);
+ if (tdp_mmu_enabled) {
+ kvm_tdp_mmu_alloc_root(vcpu);
+ return 0;
+ }

write_lock(&vcpu->kvm->mmu_lock);
r = make_mmu_pages_available(vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e7cd4921afe7..2770230a5636 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,7 +224,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}

-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
union kvm_mmu_page_role role = mmu->root_role;
@@ -285,7 +285,6 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
*/
mmu->root.hpa = __pa(root->spt);
mmu->root.pgd = 0;
- return 0;
}

static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 58b55e61bd33..437ddd4937a9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);

-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);

__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
--
2.34.1


2024-05-30 21:09:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 06/15] KVM: x86/mmu: Support GFN direct mask

From: Isaku Yamahata <[email protected]>

Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid
in implementing TDX shared memory.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.

So KVM will either need to apply or strip the shared bit before mapping or
zapping the shared EPT. Having GFN's sometimes have the shared bit and
sometimes not would make the code confusing.

So instead arrange the code such that GFNs never have shared bit set.
Create a concept of a "direct mask", that is stripped from the fault
address when setting fault->gfn, and applied within the TDP MMU iterator.
Calling code will behave as if is operating on the PTE mapping the GFN
(without shared bits) but within the iterator, the actual mappings will be
shifted using a mask specific for the root. Sp's will have the gfn set
without the shared bit. In the end the TDP MMU will behave like it is
mapping things at the GFN without the shared bit but with a strange page
table format where everything is offset by the shared bit.

Since TDX only needs to shift the mapping like this for the shared bit,
which is mapped as the normal TDP root, add a "gfn_direct_mask" field to
the kvm_arch structure for each VM with a default value of 0. It will be
set to the position of the GPA shared bit in GFN through TD specific
initialization code. Keep TDX specific concepts out of the MMU code by not
naming it "shared".

Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN
ranges. In convention established above, these would need to target the
shifted GFN range. It won't matter functionally, since the actual
implementation will always result in a full flush for the only planned
user (TDX). But for code clarity, explicitly do the full flush when a
gfn_direct_mask is present.

This leaves one drawback. Some operations use a concept of max gfn (i.e.
kvm_mmu_max_gfn()), to iterate over the whole TDP range. These would then
exceed the range actually covered by each root. It should only result in a
bit of extra iterating, and not cause functional problems. This will be
addressed in a future change.

Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep v2:
- Rename from "KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA"
- Dropped Binbin's reviewed-by tag because of the extend of the changes
- Rename gfn_shared_mask to gfn_direct_mask.
- Don't include shared bits in GFNs, hide the existence in the TDP MMU
iterator.
- Don't do range flushes if a gfn_direct_mask is present.
---
arch/x86/include/asm/kvm_host.h | 11 +++------
arch/x86/kvm/mmu.h | 5 ++++
arch/x86/kvm/mmu/mmu_internal.h | 16 +++++++++++-
arch/x86/kvm/mmu/tdp_iter.c | 5 ++--
arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++------
arch/x86/kvm/mmu/tdp_mmu.c | 43 ++++++++++++++++-----------------
arch/x86/kvm/x86.c | 10 ++++++++
7 files changed, 66 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 084f4708aff1..c9af963ab897 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1535,6 +1535,8 @@ struct kvm_arch {
*/
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+ gfn_t gfn_direct_mask;
};

struct kvm_vm_stat {
@@ -1908,14 +1910,7 @@ static inline int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
}

#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
-static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
- u64 nr_pages)
-{
- if (!kvm_x86_ops.flush_remote_tlbs_range)
- return -EOPNOTSUPP;
-
- return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
-}
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
#endif /* CONFIG_HYPERV */

enum kvm_intr_type {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0c3bf89cf7db..f0713b6e4ee5 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -323,4 +323,9 @@ static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
{
return kvm->arch.vm_type == KVM_X86_TDX_VM;
}
+
+static inline gfn_t kvm_gfn_direct_mask(const struct kvm *kvm)
+{
+ return kvm->arch.gfn_direct_mask;
+}
#endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 6d82e389cd65..076871c9e694 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@
#include <linux/kvm_host.h>
#include <asm/kvm_host.h>

+#include "mmu.h"
+
#ifdef CONFIG_KVM_PROVE_MMU
#define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
#else
@@ -189,6 +191,13 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_mirrored_spt_cache);
}

+static inline gfn_t kvm_gfn_root_mask(const struct kvm *kvm, const struct kvm_mmu_page *root)
+{
+ if (is_mirror_sp(root))
+ return 0;
+ return kvm_gfn_direct_mask(kvm);
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
@@ -359,7 +368,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
int r;

if (vcpu->arch.mmu->root_role.direct) {
- fault.gfn = fault.addr >> PAGE_SHIFT;
+ /*
+ * Things like memslots don't understand the concept of a shared
+ * bit. Strip it so that the GFN can be used like normal, and the
+ * fault.addr can be used when the shared bit is needed.
+ */
+ fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_direct_mask(vcpu->kvm);
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 04c247bfe318..a3bfe7fe473a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@
static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
{
iter->sptep = iter->pt_path[iter->level - 1] +
- SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+ SPTE_INDEX((iter->gfn | iter->gfn_mask) << PAGE_SHIFT, iter->level);
iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
}

@@ -37,7 +37,7 @@ void tdp_iter_restart(struct tdp_iter *iter)
* rooted at root_pt, starting with the walk to translate next_last_level_gfn.
*/
void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
- int min_level, gfn_t next_last_level_gfn)
+ int min_level, gfn_t next_last_level_gfn, gfn_t gfn_mask)
{
if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
(root->role.level > PT64_ROOT_MAX_LEVEL))) {
@@ -46,6 +46,7 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
}

iter->next_last_level_gfn = next_last_level_gfn;
+ iter->gfn_mask = gfn_mask;
iter->root_level = root->role.level;
iter->min_level = min_level;
iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..6864d21edb4e 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,8 +91,10 @@ struct tdp_iter {
tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
/* A pointer to the current SPTE */
tdp_ptep_t sptep;
- /* The lowest GFN mapped by the current SPTE */
+ /* The lowest GFN (mask bits excluded) mapped by the current SPTE */
gfn_t gfn;
+ /* Mask applied to convert the GFN to the mapping GPA */
+ gfn_t gfn_mask;
/* The level of the root page given to the iterator */
int root_level;
/* The lowest level the iterator should traverse to */
@@ -120,18 +122,18 @@ struct tdp_iter {
* Iterates over every SPTE mapping the GFN range [start, end) in a
* preorder traversal.
*/
-#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
- for (tdp_iter_start(&iter, root, min_level, start); \
- iter.valid && iter.gfn < end; \
+#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) \
+ for (tdp_iter_start(&iter, root, min_level, start, kvm_gfn_root_mask(kvm, root)); \
+ iter.valid && iter.gfn < end; \
tdp_iter_next(&iter))

-#define for_each_tdp_pte(iter, root, start, end) \
- for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, kvm, root, start, end) \
+ for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)

tdp_ptep_t spte_to_child_pt(u64 pte, int level);

void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
- int min_level, gfn_t next_last_level_gfn);
+ int min_level, gfn_t next_last_level_gfn, gfn_t gfn_mask);
void tdp_iter_next(struct tdp_iter *iter);
void tdp_iter_restart(struct tdp_iter *iter);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2770230a5636..ed93bba76483 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -674,18 +674,18 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
iter->gfn, iter->level);
}

-#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
- for_each_tdp_pte(_iter, _root, _start, _end)
+#define tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end) \
+ for_each_tdp_pte(_iter, _kvm, _root, _start, _end)

-#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end) \
- tdp_root_for_each_pte(_iter, _root, _start, _end) \
+#define tdp_root_for_each_leaf_pte(_iter, _kvm, _root, _start, _end) \
+ tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end) \
if (!is_shadow_present_pte(_iter.old_spte) || \
!is_last_spte(_iter.old_spte, _iter.level)) \
continue; \
else

-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
- for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end) \
+ for_each_tdp_pte(_iter, _kvm, _root, _start, _end)

/*
* Yield if the MMU lock is contended or this thread needs to return control
@@ -751,7 +751,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t end = tdp_mmu_max_gfn_exclusive();
gfn_t start = 0;

- for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
@@ -855,7 +855,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,

rcu_read_lock();

- for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
if (can_yield &&
tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
@@ -1104,8 +1104,8 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
*/
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
struct kvm *kvm = vcpu->kvm;
+ struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
struct tdp_iter iter;
struct kvm_mmu_page *sp;
int ret = RET_PF_RETRY;
@@ -1115,8 +1115,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
trace_kvm_mmu_spte_requested(fault);

rcu_read_lock();
-
- tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+ tdp_mmu_for_each_pte(iter, vcpu->kvm, root, fault->gfn, fault->gfn + 1) {
int r;

if (fault->nx_huge_page_workaround_enabled)
@@ -1214,7 +1213,7 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
rcu_read_lock();

- tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+ tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end)
ret |= handler(kvm, &iter, range);

rcu_read_unlock();
@@ -1297,7 +1296,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);

- for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
@@ -1460,7 +1459,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
* level above the target level (e.g. splitting a 1GB to 512 2MB pages,
* and then splitting each of those to 512 4KB pages).
*/
- for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
@@ -1545,7 +1544,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

rcu_read_lock();

- tdp_root_for_each_pte(iter, root, start, end) {
+ tdp_root_for_each_pte(iter, kvm, root, start, end) {
retry:
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
@@ -1600,7 +1599,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,

rcu_read_lock();

- tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
+ tdp_root_for_each_leaf_pte(iter, kvm, root, gfn + __ffs(mask),
gfn + BITS_PER_LONG) {
if (!mask)
break;
@@ -1657,7 +1656,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,

rcu_read_lock();

- for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_2M, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
@@ -1727,7 +1726,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,

rcu_read_lock();

- for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
+ for_each_tdp_pte_min_level(iter, kvm, root, min_level, gfn, gfn + 1) {
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
@@ -1775,14 +1774,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
int *root_level)
{
+ struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
struct tdp_iter iter;
- struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;

*root_level = vcpu->arch.mmu->root_role.level;

- tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
leaf = iter.level;
sptes[leaf] = iter.old_spte;
}
@@ -1804,12 +1803,12 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
u64 *spte)
{
+ struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
struct tdp_iter iter;
- struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
tdp_ptep_t sptep = NULL;

- tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
*spte = iter.old_spte;
sptep = iter.sptep;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7c593a081eba..0e6325b5f5e7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13987,6 +13987,16 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
}
EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);

+#ifdef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages)
+{
+ if (!kvm_x86_ops.flush_remote_tlbs_range || kvm_gfn_direct_mask(kvm))
+ return -EOPNOTSUPP;
+
+ return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
+}
+#endif
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
--
2.34.1


2024-05-30 21:10:09

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 08/15] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type

From: Isaku Yamahata <[email protected]>

Define an enum kvm_tdp_mmu_root_types to specify the KVM MMU root type [1]
so that the iterator on the root page table can consistently filter the
root page table type instead of only_valid.

TDX KVM will operate on KVM page tables with specified types. Shared page
table, private page table, or both. Introduce an enum instead of bool
only_valid so that we can easily enhance page table types applicable to
shared, private, or both in addition to valid or not. Replace
only_valid=false with KVM_ANY_ROOTS and only_valid=true with
KVM_ANY_VALID_ROOTS. Use KVM_ANY_ROOTS and KVM_ANY_VALID_ROOTS to wrap
KVM_VALID_ROOTS to avoid further code churn when direct vs mirror root
concepts are introduced in future patches.

Link: https://lore.kernel.org/kvm/[email protected]/ [1]
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep:
- Newly introduced.
---
arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++++++-------------------
arch/x86/kvm/mmu/tdp_mmu.h | 7 +++++++
2 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d49abf1e3f37..5e8f652cd8b1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,9 +92,10 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

-static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
+ enum kvm_tdp_mmu_root_types types)
{
- if (only_valid && root->role.invalid)
+ if ((types & KVM_VALID_ROOTS) && root->role.invalid)
return false;

return true;
@@ -102,17 +103,17 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)

/*
* Returns the next root after @prev_root (or the first root if @prev_root is
- * NULL). A reference to the returned root is acquired, and the reference to
- * @prev_root is released (the caller obviously must hold a reference to
- * @prev_root if it's non-NULL).
+ * NULL) that matches with @types. A reference to the returned root is
+ * acquired, and the reference to @prev_root is released (the caller obviously
+ * must hold a reference to @prev_root if it's non-NULL).
*
- * If @only_valid is true, invalid roots are skipped.
+ * Roots that doesn't match with @types are skipped.
*
* Returns NULL if the end of tdp_mmu_roots was reached.
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
- bool only_valid)
+ enum kvm_tdp_mmu_root_types types)
{
struct kvm_mmu_page *next_root;

@@ -133,7 +134,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
typeof(*next_root), link);

while (next_root) {
- if (tdp_mmu_root_match(next_root, only_valid) &&
+ if (tdp_mmu_root_match(next_root, types) &&
kvm_tdp_mmu_get_root(next_root))
break;

@@ -158,20 +159,20 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* If shared is set, this function is operating under the MMU lock in read
* mode.
*/
-#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid) \
- for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _types) \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _types); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
- _root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \
+ _root = tdp_mmu_next_root(_kvm, _root, _types)) \
if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else

#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, true)
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, KVM_ANY_VALID_ROOTS)

#define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \
- for (_root = tdp_mmu_next_root(_kvm, NULL, false); \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, KVM_ANY_ROOTS); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
- _root = tdp_mmu_next_root(_kvm, _root, false))
+ _root = tdp_mmu_next_root(_kvm, _root, KVM_ANY_ROOTS))

/*
* Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
@@ -180,18 +181,18 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* Holding mmu_lock for write obviates the need for RCU protection as the list
* is guaranteed to be stable.
*/
-#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _only_valid) \
+#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _types) \
list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \
- !tdp_mmu_root_match((_root), (_only_valid)))) { \
+ !tdp_mmu_root_match((_root), (_types)))) { \
} else

#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
- __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
+ __for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ANY_ROOTS)

#define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id) \
- __for_each_tdp_mmu_root(_kvm, _root, _as_id, true)
+ __for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ANY_VALID_ROOTS)

static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
{
@@ -1196,7 +1197,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
{
struct kvm_mmu_page *root;

- __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
+ __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ANY_ROOTS)
flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
range->may_block, flush);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 437ddd4937a9..e7055a5333a8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -19,6 +19,13 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)

void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);

+enum kvm_tdp_mmu_root_types {
+ KVM_VALID_ROOTS = BIT(0),
+
+ KVM_ANY_ROOTS = 0,
+ KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
+};
+
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
--
2.34.1


2024-05-30 21:10:36

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 10/15] KVM: x86/tdp_mmu: Reflect building mirror page tables

From: Isaku Yamahata <[email protected]>

Integrate hooks for mirroring page table operations for cases where TDX
will set PTEs or link page tables.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from any actual guest
mapping. Now add plumbing to "reflect" changes to the mirror to the page
tables being mirrored. Just create the x86_ops for now, leave plumbing the
operations into the TDX module for future patches.

Add two operations for setting up mirrored page tables, one for linking
new page tables and one for setting leaf PTEs. Don't add any op for
configuring the root PFN, as TDX handles this itself. Don't provide a
way to set permissions on the PTEs also, as TDX doesn't support it.

This results is MMU "mirroring" support that is very targeted towards TDX.
Since it is likely there will be no other user, the main benefit of making
the support generic is to keep TDX specific *looking* code outside of the
MMU. As a generic feature it will make enough sense from TDX's
perspective. For developers unfamiliar with TDX arch it can express the
general concepts such that they can continue to work in the code.

TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For setting/linking, only
hook tdp_mmu_set_spte_atomic() which is use used for mapping and linking
PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
setting PTEs in operations unsupported by TDX: splitting huge pages and
write protecting. Sprinkle a KVM_BUG_ON()s to document as code that these
paths are not supported for mirrored page tables. For zapping operations,
leave those for near future changes.

Many operations in the TDP MMU depend on atomicity of the PTE update.
While the mirror PTE on KVM's side can be updated atomically, the update
that happens inside the reflect operations (S-EPT updates via TDX module
call) can't happen atomically with the mirror update. The following race
could result during two vCPU's populating private memory:

* vcpu 1: atomically update 2M level mirror EPT entry to be present
* vcpu 2: read 2M level EPT entry that is present
* vcpu 2: walk down into 4K level EPT
* vcpu 2: atomically update 4K level mirror EPT entry to be present
* vcpu 2: reflect_set_spte() to update 4K secure EPT entry => error
because 2M secure EPT entry is not populated yet
* vcpu 1: reflect_link_spt() to update 2M secure EPT entry

Prevent this by setting the mirror PTE to REMOVED_SPTE while the reflect
operations are performed. Only write the actual mirror PTE value once the
reflect operations has completed. When trying to set a PTE to present and
encountering a removed SPTE, retry the fault.

By doing this the race is prevented as follows:
* vcpu 1: atomically update 2M level EPT entry to be REMOVED_SPTE (freeze)
* vcpu 2: read 2M level EPT entry that is REMOVED_SPTE
* vcpu 2: find that the EPT entry is frozen
abandon page table walk to resume guest execution
* vcpu 1: reflect_link_spt() to update 2M secure EPT entry
* vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
* vcpu 2: resume guest execution
Depending on vcpu 1 state, vcpu 2 may result in EPT violation
again or make progress on guest execution

Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Yan Zhao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep v2:
- Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
- Rename x86_ops from "private" to "reflect"
- In response to "sp->mirrored_spt" rename helpers to "mirrored"
- Drop unused old_pfn and new_pfn in handle_changed_spte()
- Drop redundant is_shadow_present_pte() check in __tdp_mmu_set_spte_atomic
- Adjust some warnings and KVM_BUG_ONs
---
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 7 ++
arch/x86/kvm/mmu/tdp_mmu.c | 108 +++++++++++++++++++++++++----
3 files changed, 105 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 566d19b02483..1877d6a77525 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,6 +95,8 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(reflect_link_spt)
+KVM_X86_OP_OPTIONAL(reflect_set_spte)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d4446dde0ace..20bb10f22ca6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1748,6 +1748,13 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);

+ /* Update mirrored mapping with page table link */
+ int (*reflect_link_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *mirrored_spt);
+ /* Update the mirrored page table from spte getting set */
+ int (*reflect_set_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
+
bool (*has_wbinvd_exit)(void);

u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0fc168a3fff1..41b1d3f26597 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -446,6 +446,64 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

+static void *get_mirrored_spt(gfn_t gfn, u64 new_spte, int level)
+{
+ if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+ struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
+ void *mirrored_spt = kvm_mmu_mirrored_spt(sp);
+
+ WARN_ON_ONCE(sp->role.level + 1 != level);
+ WARN_ON_ONCE(sp->gfn != gfn);
+ return mirrored_spt;
+ }
+
+ return NULL;
+}
+
+static int __must_check reflect_set_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+ gfn_t gfn, u64 old_spte,
+ u64 new_spte, int level)
+{
+ bool was_present = is_shadow_present_pte(old_spte);
+ bool is_present = is_shadow_present_pte(new_spte);
+ bool is_leaf = is_present && is_last_spte(new_spte, level);
+ kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+ int ret = 0;
+
+ KVM_BUG_ON(was_present, kvm);
+
+ /*
+ * For mirrored page table, callbacks are needed to propagate SPTE
+ * change into the mirrored page table. In order to atomically update
+ * both the SPTE and the mirrored page tables with callbacks, utilize
+ * freezing SPTE.
+ * - Freeze the SPTE. Set entry to REMOVED_SPTE.
+ * - Trigger callbacks for mirrored page tables.
+ * - Unfreeze the SPTE. Set the entry to new_spte.
+ */
+ lockdep_assert_held(&kvm->mmu_lock);
+ if (!try_cmpxchg64(sptep, &old_spte, REMOVED_SPTE))
+ return -EBUSY;
+
+ /*
+ * Use different call to either set up middle level
+ * mirrored page table, or leaf.
+ */
+ if (is_leaf) {
+ ret = static_call(kvm_x86_reflect_set_spte)(kvm, gfn, level, new_pfn);
+ } else {
+ void *mirrored_spt = get_mirrored_spt(gfn, new_spte, level);
+
+ KVM_BUG_ON(!mirrored_spt, kvm);
+ ret = static_call(kvm_x86_reflect_link_spt)(kvm, gfn, level, mirrored_spt);
+ }
+ if (ret)
+ __kvm_tdp_mmu_write_spte(sptep, old_spte);
+ else
+ __kvm_tdp_mmu_write_spte(sptep, new_spte);
+ return ret;
+}
+
/**
* handle_changed_spte - handle bookkeeping associated with an SPTE change
* @kvm: kvm instance
@@ -559,7 +617,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
}

-static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
+static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte)
{
u64 *sptep = rcu_dereference(iter->sptep);

@@ -571,15 +629,36 @@ static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
*/
WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));

- /*
- * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
- * does not hold the mmu_lock. On failure, i.e. if a different logical
- * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
- * the current value, so the caller operates on fresh data, e.g. if it
- * retries tdp_mmu_set_spte_atomic()
- */
- if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
- return -EBUSY;
+ if (is_mirror_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
+ int ret;
+
+ /* Don't support atomic zapping for mirrored roots */
+ if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
+ return -EBUSY;
+ /*
+ * Populating case.
+ * - reflect_set_spte_present() implements
+ * 1) Freeze SPTE
+ * 2) call hooks to update mirrored page table,
+ * 3) update SPTE to new_spte
+ * - handle_changed_spte() only updates stats.
+ */
+ ret = reflect_set_spte_present(kvm, iter->sptep, iter->gfn,
+ iter->old_spte, new_spte, iter->level);
+ if (ret)
+ return ret;
+ } else {
+ /*
+ * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
+ * and does not hold the mmu_lock. On failure, i.e. if a
+ * different logical CPU modified the SPTE, try_cmpxchg64()
+ * updates iter->old_spte with the current value, so the caller
+ * operates on fresh data, e.g. if it retries
+ * tdp_mmu_set_spte_atomic()
+ */
+ if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+ return -EBUSY;
+ }

return 0;
}
@@ -610,7 +689,7 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,

lockdep_assert_held_read(&kvm->mmu_lock);

- ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
+ ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
if (ret)
return ret;

@@ -636,7 +715,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* Delay processing of the zapped SPTE until after TLBs are flushed and
* the REMOVED_SPTE is replaced (see below).
*/
- ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE);
+ ret = __tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
if (ret)
return ret;

@@ -698,6 +777,11 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
role = sptep_to_sp(sptep)->role;
role.level = level;
handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
+
+ /* Don't support setting for the non-atomic case */
+ if (is_mirror_sptep(sptep))
+ KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
+
return old_spte;
}

--
2.34.1


2024-05-30 21:10:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

From: Isaku Yamahata <[email protected]>

Add the ability for the TDP MMU to maintain a mirror of a separate mapping.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
and it will make the code confusing when being read from their
perspective. Special operations added for TDX that set private or zap
private memory will have nothing to do with these other private memory
technologies. (SEV, etc).

To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.

private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirror PT root | private PT root
| | |
V | V
mirror PT --hook to propagate-->private PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|

Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.

In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct roots
to operations targeting private/shared memory by adding kvm_on_mirror() and
kvm_on_direct() helpers in mmu.h.

Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
be rebuilt like a normal root. It needs to persist for the lifetime of the
VM.

The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_private_spt().

Update handle_changed_spte() to take a role. Use this for a KVM_BUG_ON().

Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.

Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Yan Zhao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep v2:
- Rename private->mirror
- Split apart from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP
MMU"
- Update log
- Sprinkle a few comments
- Use kvm_on_*() helpers to direct iterator to proper root
- Drop BUGGY_KVM_ROOTS because the translation between the process enum
is no longer automatic, and the warn already happens elsewhere.

TDX MMU Prep:
- Remove unnecessary gfn, access twist in
tdp_mmu_map_handle_target_level(). (Chao Gao)
- Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
tdp_mmu_alloc_sp()
- Update comment in set_private_spte_present() (Yan)
- Open code call to kvm_mmu_init_private_spt() (Yan)
- Add comments on TDX MMU hooks (Yan)
- Fix various whitespace alignment (Yan)
- Remove pointless warnings and conditionals in
handle_removed_private_spte() (Yan)
- Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
- Remove incorrect comment in handle_changed_spte() (Yan)
- Remove unneeded kvm_pfn_to_refcounted_page() and
is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
- Do kvm_gfn_for_root() branchless (Rick)
- Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
- Add comment for stripping shared bit for fault.gfn (Chao)

v19:
- drop CONFIG_KVM_MMU_PRIVATE

v18:
- Rename freezed => frozen

v14 -> v15:
- Refined is_private condition check in kvm_tdp_mmu_map().
Add kvm_gfn_shared_mask() check.
- catch up for struct kvm_range change
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.h | 16 ++++++++
arch/x86/kvm/mmu/mmu.c | 11 +++++-
arch/x86/kvm/mmu/tdp_mmu.c | 70 ++++++++++++++++++++++++---------
arch/x86/kvm/mmu/tdp_mmu.h | 39 ++++++++++++++++--
5 files changed, 115 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c9af963ab897..d4446dde0ace 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -470,6 +470,7 @@ struct kvm_mmu {
int (*sync_spte)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp, int i);
struct kvm_mmu_root_info root;
+ hpa_t mirror_root_hpa;
union kvm_cpu_role cpu_role;
union kvm_mmu_page_role root_role;

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index f0713b6e4ee5..006f06463a27 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -328,4 +328,20 @@ static inline gfn_t kvm_gfn_direct_mask(const struct kvm *kvm)
{
return kvm->arch.gfn_direct_mask;
}
+
+static inline bool kvm_on_mirror(const struct kvm *kvm, enum kvm_process process)
+{
+ if (!kvm_has_mirrored_tdp(kvm))
+ return false;
+
+ return process & KVM_PROCESS_PRIVATE;
+}
+
+static inline bool kvm_on_direct(const struct kvm *kvm, enum kvm_process process)
+{
+ if (!kvm_has_mirrored_tdp(kvm))
+ return true;
+
+ return process & KVM_PROCESS_SHARED;
+}
#endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 12178945922f..01fb918612ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
int r;

if (tdp_mmu_enabled) {
- kvm_tdp_mmu_alloc_root(vcpu);
+ if (kvm_has_mirrored_tdp(vcpu->kvm))
+ kvm_tdp_mmu_alloc_root(vcpu, true);
+ kvm_tdp_mmu_alloc_root(vcpu, false);
return 0;
}

@@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)

mmu->root.hpa = INVALID_PAGE;
mmu->root.pgd = 0;
+ mmu->mirror_root_hpa = INVALID_PAGE;
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;

@@ -7220,6 +7223,12 @@ int kvm_mmu_vendor_module_init(void)
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
kvm_mmu_unload(vcpu);
+ if (tdp_mmu_enabled) {
+ read_lock(&vcpu->kvm->mmu_lock);
+ mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->mirror_root_hpa,
+ NULL);
+ read_unlock(&vcpu->kvm->mmu_lock);
+ }
free_mmu_pages(&vcpu->arch.root_mmu);
free_mmu_pages(&vcpu->arch.guest_mmu);
mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5e8f652cd8b1..0fc168a3fff1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -95,10 +95,18 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
enum kvm_tdp_mmu_root_types types)
{
+ if (WARN_ON_ONCE(!(types & (KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS))))
+ return false;
+
if ((types & KVM_VALID_ROOTS) && root->role.invalid)
return false;

- return true;
+ if ((types & KVM_DIRECT_ROOTS) && !is_mirror_sp(root))
+ return true;
+ if ((types & KVM_MIRROR_ROOTS) && is_mirror_sp(root))
+ return true;
+
+ return false;
}

/*
@@ -233,7 +241,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}

-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
union kvm_mmu_page_role role = mmu->root_role;
@@ -241,6 +249,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;

+ if (mirror)
+ kvm_mmu_page_role_set_mirrored(&role);
+
/*
* Check for an existing root before acquiring the pages lock to avoid
* unnecessary serialization if multiple vCPUs are loading a new root.
@@ -292,13 +303,17 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
* and actually consuming the root if it's invalidated after dropping
* mmu_lock, and the root can't be freed as this vCPU holds a reference.
*/
- mmu->root.hpa = __pa(root->spt);
- mmu->root.pgd = 0;
+ if (mirror) {
+ mmu->mirror_root_hpa = __pa(root->spt);
+ } else {
+ mmu->root.hpa = __pa(root->spt);
+ mmu->root.pgd = 0;
+ }
}

static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
- u64 old_spte, u64 new_spte, int level,
- bool shared);
+ u64 old_spte, u64 new_spte,
+ union kvm_mmu_page_role role, bool shared);

static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
@@ -425,7 +440,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
REMOVED_SPTE, level);
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
- old_spte, REMOVED_SPTE, level, shared);
+ old_spte, REMOVED_SPTE, sp->role, shared);
}

call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -438,7 +453,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
* @gfn: the base GFN that was mapped by the SPTE
* @old_spte: The value of the SPTE before the change
* @new_spte: The value of the SPTE after the change
- * @level: the level of the PT the SPTE is part of in the paging structure
+ * @role: the role of the PT the SPTE is part of in the paging structure
* @shared: This operation may not be running under the exclusive use of
* the MMU lock and the operation must synchronize with other
* threads that might be modifying SPTEs.
@@ -448,9 +463,11 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
* and fast_pf_fix_direct_spte()).
*/
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
- u64 old_spte, u64 new_spte, int level,
- bool shared)
+ u64 old_spte, u64 new_spte,
+ union kvm_mmu_page_role role, bool shared)
{
+ bool is_mirror = kvm_mmu_page_role_is_mirror(role);
+ int level = role.level;
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
bool was_leaf = was_present && is_last_spte(old_spte, level);
@@ -531,8 +548,11 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* pages are kernel allocations and should never be migrated.
*/
if (was_present && !was_leaf &&
- (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+ KVM_BUG_ON(is_mirror !=
+ is_mirror_sptep(spte_to_child_pt(old_spte, level)), kvm);
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+ }

if (was_leaf && is_accessed_spte(old_spte) &&
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
@@ -585,6 +605,7 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
+ u64 *sptep = rcu_dereference(iter->sptep);
int ret;

lockdep_assert_held_read(&kvm->mmu_lock);
@@ -594,7 +615,7 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
return ret;

handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, true);
+ new_spte, sptep_to_sp(sptep)->role, true);

return 0;
}
@@ -602,6 +623,7 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter)
{
+ union kvm_mmu_page_role role;
int ret;

lockdep_assert_held_read(&kvm->mmu_lock);
@@ -628,6 +650,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
*/
__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);

+ role = sptep_to_sp(iter->sptep)->role;
/*
* Process the zapped SPTE after flushing TLBs, and after replacing
* REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
@@ -635,7 +658,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* SPTEs.
*/
handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- 0, iter->level, true);
+ SHADOW_NONPRESENT_VALUE, role, true);

return 0;
}
@@ -657,6 +680,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
u64 old_spte, u64 new_spte, gfn_t gfn, int level)
{
+ union kvm_mmu_page_role role;
+
lockdep_assert_held_write(&kvm->mmu_lock);

/*
@@ -670,7 +695,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,

old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);

- handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+ role = sptep_to_sp(sptep)->role;
+ role.level = level;
+ handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
return old_spte;
}

@@ -1114,7 +1141,8 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm *kvm = vcpu->kvm;
- struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
+ enum kvm_tdp_mmu_root_types root_type = tdp_mmu_get_fault_root_type(kvm, fault);
+ struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
struct tdp_iter iter;
struct kvm_mmu_page *sp;
int ret = RET_PF_RETRY;
@@ -1151,13 +1179,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
*/
sp = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_child_sp(sp, &iter);
+ if (is_mirror_sp(sp))
+ kvm_mmu_alloc_mirrored_spt(vcpu, sp);

sp->nx_huge_page_disallowed = fault->huge_page_disallowed;

- if (is_shadow_present_pte(iter.old_spte))
+ if (is_shadow_present_pte(iter.old_spte)) {
+ /* Don't support large page for mirrored roots (TDX) */
+ KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
- else
+ } else {
r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+ }

/*
* Force the guest to retry if installing an upper level SPTE
@@ -1812,7 +1845,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
u64 *spte)
{
- struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
+ /* Fast pf is not supported for mirrored roots */
+ struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, KVM_DIRECT_ROOTS);
struct tdp_iter iter;
gfn_t gfn = addr >> PAGE_SHIFT;
tdp_ptep_t sptep = NULL;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index e7055a5333a8..934269b82f20 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);

-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private);

__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
@@ -21,11 +21,44 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);

enum kvm_tdp_mmu_root_types {
KVM_VALID_ROOTS = BIT(0),
+ KVM_DIRECT_ROOTS = BIT(1),
+ KVM_MIRROR_ROOTS = BIT(2),

- KVM_ANY_ROOTS = 0,
- KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
+ KVM_ANY_ROOTS = KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS,
+ KVM_ANY_VALID_ROOTS = KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS | KVM_VALID_ROOTS,
};

+static inline enum kvm_tdp_mmu_root_types kvm_process_to_root_types(struct kvm *kvm,
+ enum kvm_process process)
+{
+ enum kvm_tdp_mmu_root_types ret = 0;
+
+ WARN_ON_ONCE(process == BUGGY_KVM_INVALIDATION);
+
+ if (kvm_on_mirror(kvm, process))
+ ret |= KVM_MIRROR_ROOTS;
+
+ if (kvm_on_direct(kvm, process))
+ ret |= KVM_DIRECT_ROOTS;
+
+ return ret;
+}
+
+static inline enum kvm_tdp_mmu_root_types tdp_mmu_get_fault_root_type(struct kvm *kvm,
+ struct kvm_page_fault *fault)
+{
+ if (fault->is_private)
+ return kvm_process_to_root_types(kvm, KVM_PROCESS_PRIVATE);
+ return KVM_DIRECT_ROOTS;
+}
+
+static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu, enum kvm_tdp_mmu_root_types type)
+{
+ if (type == KVM_MIRROR_ROOTS)
+ return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
+ return root_to_sp(vcpu->arch.mmu->root.hpa);
+}
+
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
--
2.34.1


2024-05-30 21:10:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 07/15] KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()

From: Isaku Yamahata <[email protected]>

Extract tdp_mmu_root_match() to check if the root has given types and use
it for the root page table iterator. It checks only_invalid now.

TDX KVM operates on a shared page table only (Shared-EPT), a mirrored page
table only (Secure-EPT), or both based on the operation. KVM MMU notifier
operations only on shared page table. KVM guest_memfd invalidation
operations only on mirrored page table, and so on. Introduce a centralized
matching function instead of open coding matching logic in the iterator.
The next step is to extend the function to check whether the page is shared
or private

Link: https://lore.kernel.org/kvm/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep:
- New patch
---
arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ed93bba76483..d49abf1e3f37 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,6 +92,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

+static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+{
+ if (only_valid && root->role.invalid)
+ return false;
+
+ return true;
+}
+
/*
* Returns the next root after @prev_root (or the first root if @prev_root is
* NULL). A reference to the returned root is acquired, and the reference to
@@ -125,7 +133,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
typeof(*next_root), link);

while (next_root) {
- if ((!only_valid || !next_root->role.invalid) &&
+ if (tdp_mmu_root_match(next_root, only_valid) &&
kvm_tdp_mmu_get_root(next_root))
break;

@@ -176,7 +184,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \
- ((_only_valid) && (_root)->role.invalid))) { \
+ !tdp_mmu_root_match((_root), (_only_valid)))) { \
} else

#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
--
2.34.1


2024-05-30 21:11:08

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 13/15] KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process

From: Isaku Yamahata <[email protected]>

Teach the MMU notifier callbacks how to check kvm_gfn_range.process to
filter which KVM MMU root types to operate on.

The private GPAs are backed by guest memfd. Such memory is not subjected
to MMU notifier callbacks because it can't be mapped into the host user
address space. Now kvm_gfn_range conveys info about which root to operate
on. Enhance the callback to filter the root page table type.

The KVM MMU notifier comes down to two functions.
kvm_tdp_mmu_unmap_gfn_range() and kvm_tdp_mmu_handle_gfn().

For VM's without a private/shared split in the EPT, all operations
should target the normal(direct) root.

invalidate_range_start() comes into kvm_tdp_mmu_unmap_gfn_range().
invalidate_range_end() doesn't come into arch code.

Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep v2:
- Use newly added kvm_process_to_root_types()

TDX MMU Prep:
- Remove warning (Rick)
- Remove confusing mention of mapping flags (Chao)
- Re-write coverletter

v19:
- type: test_gfn() => test_young()

v18:
- newly added
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4f00ae7da072..da6024b8295f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1351,12 +1351,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return ret;
}

+/* Used by mmu notifier via kvm_unmap_gfn_range() */
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
+ enum kvm_tdp_mmu_root_types types = kvm_process_to_root_types(kvm, range->process);
struct kvm_mmu_page *root;

- __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ANY_ROOTS)
+ __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types)
flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
range->may_block, flush);

@@ -1370,6 +1372,7 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
struct kvm_gfn_range *range,
tdp_handler_t handler)
{
+ enum kvm_tdp_mmu_root_types types = kvm_process_to_root_types(kvm, range->process);
struct kvm_mmu_page *root;
struct tdp_iter iter;
bool ret = false;
@@ -1378,7 +1381,7 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
* Don't support rescheduling, none of the MMU notifiers that funnel
* into this helper allow blocking; it'd be dead, wasteful code.
*/
- for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+ __for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
rcu_read_lock();

tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end)
--
2.34.1


2024-05-30 21:11:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 12/15] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()

From: Isaku Yamahata <[email protected]>

Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.

Have the callers only invalidate the required roots instead of all
roots.

Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep v2:
- Use process enum instead of root

TDX MMU Prep:
- New patch
---
arch/x86/kvm/mmu/mmu.c | 9 +++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 5 +++--
arch/x86/kvm/mmu/tdp_mmu.h | 3 ++-
3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 01fb918612ae..0b173aa08728 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6414,8 +6414,13 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* write and in the same critical section as making the reload request,
* e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
*/
- if (tdp_mmu_enabled)
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ if (tdp_mmu_enabled) {
+ /*
+ * The private page tables doesn't support fast zapping. The
+ * caller should handle it by other way.
+ */
+ kvm_tdp_mmu_invalidate_roots(kvm, KVM_PROCESS_SHARED);
+ }

/*
* Notify all vcpus to reload its shadow page table and flush TLB.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1245f6a48dbe..4f00ae7da072 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
* for zapping and thus puts the TDP MMU's reference to each root, i.e.
* ultimately frees all roots.
*/
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ kvm_tdp_mmu_invalidate_roots(kvm, KVM_PROCESS_PRIVATE_AND_SHARED);
kvm_tdp_mmu_zap_invalidated_roots(kvm);

WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -1132,7 +1132,8 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
* See kvm_tdp_mmu_alloc_root().
*/
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+ enum kvm_process process_types)
{
struct kvm_mmu_page *root;

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 934269b82f20..30b1288c1e4d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -62,7 +62,8 @@ static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu, enum
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+ enum kvm_process types);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);

int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
--
2.34.1


2024-05-30 21:11:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 14/15] KVM: x86/tdp_mmu: Invalidate correct roots

From: Sean Christopherson <[email protected]>

When invalidating roots, respect the root type passed.

kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a VM it needs to invalidate all roots. Check the root type in root
iterator.

Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
[evolved quite a bit from original author's patch]
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep:
- Rename from "Don't zap private pages for unsupported cases", and split
many parts out.
- Don't support MTRR, apic zapping (Rick)
- Detangle private/shared alias logic in kvm_tdp_mmu_unmap_gfn_range()
(Rick)
- Fix TLB flushing bug debugged by (Chao Gao)
https://lore.kernel.org/kvm/Zh8yHEiOKyvZO+QR@chao-email/
- Split out MTRR part
- Use enum based root iterators (Sean)
- Reorder logic in kvm_mmu_zap_memslot_leafs().
- Replace skip_private with enum kvm_tdp_mmu_root_type.
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index da6024b8295f..0caa1029b6bd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1135,6 +1135,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
enum kvm_process process_types)
{
+ enum kvm_tdp_mmu_root_types root_types = kvm_process_to_root_types(kvm, process_types);
struct kvm_mmu_page *root;

/*
@@ -1158,6 +1159,9 @@ void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
* or get/put references to roots.
*/
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (!tdp_mmu_root_match(root, root_types))
+ continue;
+
/*
* Note, invalid roots can outlive a memslot update! Invalid
* roots must be *zapped* before the memslot update completes,
--
2.34.1


2024-05-30 21:12:04

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 15/15] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

From: Isaku Yamahata <[email protected]>

Export a function to walk down the TDP without modifying it.

Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT, and verify the populated private memory
PFN matches the current one.[1]

There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.

Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.

Link: https://lore.kernel.org/kvm/[email protected]/ [1]
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
This helper will be used in the future change that implements
KVM_TDX_INIT_MEM_REGION. Please refer to the following commit for the
usage:
https://github.com/intel/tdx/commit/9594fb9a0ab566e0ad556bb19770665319d800fd

TDX MMU Prep v2:
- Rename function with "mirror" and use root enum

TDX MMU Prep:
- New patch
---
arch/x86/kvm/mmu.h | 2 ++
arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++++++++++++++++++++-----
2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 006f06463a27..625a57515d48 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -275,6 +275,8 @@ extern bool tdp_mmu_enabled;
#define tdp_mmu_enabled false
#endif

+int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t *pfn);
+
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0caa1029b6bd..897d8aa1f544 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1946,16 +1946,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
*
* Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
*/
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
- int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ enum kvm_tdp_mmu_root_types root_type)
{
- struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
+ struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
struct tdp_iter iter;
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;

- *root_level = vcpu->arch.mmu->root_role.level;
-
tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
leaf = iter.level;
sptes[leaf] = iter.old_spte;
@@ -1964,6 +1962,37 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
return leaf;
}

+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ int *root_level)
+{
+ *root_level = vcpu->arch.mmu->root_role.level;
+
+ return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);
+}
+
+int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+ kvm_pfn_t *pfn)
+{
+ u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+ int leaf;
+
+ lockdep_assert_held(&vcpu->kvm->mmu_lock);
+
+ rcu_read_lock();
+ leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
+ rcu_read_unlock();
+ if (leaf < 0)
+ return -ENOENT;
+
+ spte = sptes[leaf];
+ if (!(is_shadow_present_pte(spte) && is_last_spte(spte, leaf)))
+ return -ENOENT;
+
+ *pfn = spte_to_pfn(spte);
+ return leaf;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_mirror_pfn);
+
/*
* Returns the last level spte pointer of the shadow page walk for the given
* gpa, and sets *spte to the spte value. This spte may be non-preset. If no
--
2.34.1


2024-05-30 21:12:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v2 11/15] KVM: x86/tdp_mmu: Reflect tearing down mirror page tables

From: Isaku Yamahata <[email protected]>

Integrate hooks for mirroring page table operations for cases where TDX
will zap PTEs or free page tables.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from the guest. Now add
plumbing to "reflect" changes to the mirror to the page tables being
mirrored. Just create the x86_ops for now, leave plumbing the operations
into the TDX module for future patches.

Add two operations for tearing down page tables, one for freeing page
tables (reflect_free_spt) and one for zapping PTEs (reflect_remove_spte).
Define them such that reflect_remove_spte will perform a TLB flush as well.
(in TDX terms "ensure there are no active translations").

TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For zapping/freeing, only
hook tdp_mmu_iter_set_spte() which is use used for mapping and linking
PTs. Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
kvm_mmu_zap_all_fast().

In previous changes to address races around concurrent populating using
tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
REMOVED_SPTE in the mirrored page tables while performing the "reflect"
operations. Such a solution is not needed for the tear down paths in TDX
as these will always be performed with the mmu_lock held for write.
Sprinkle some KVM_BUG_ON()s to reflect this.

Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Yan Zhao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Prep v2:
- Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
- Rename x86_ops from "private" to "reflect"
- In response to "sp->mirrored_spt" rename helpers to "mirrored"
- Remove unused present mirroring support in tdp_mmu_set_spte()
- Merge reflect_zap_spte() into reflect_remove_spte()
- Move mirror zapping logic out of handle_changed_spte()
- Add some KVM_BUG_ONs
---
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 8 ++++++
arch/x86/kvm/mmu/tdp_mmu.c | 45 ++++++++++++++++++++++++++++--
3 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 1877d6a77525..dae06afc6038 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -97,6 +97,8 @@ KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
KVM_X86_OP_OPTIONAL(reflect_link_spt)
KVM_X86_OP_OPTIONAL(reflect_set_spte)
+KVM_X86_OP_OPTIONAL(reflect_free_spt)
+KVM_X86_OP_OPTIONAL(reflect_remove_spte)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 20bb10f22ca6..0df4a31a0df9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1755,6 +1755,14 @@ struct kvm_x86_ops {
int (*reflect_set_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
kvm_pfn_t pfn);

+ /* Update mirrored page tables for page table about to be freed */
+ int (*reflect_free_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *mirrored_spt);
+
+ /* Update mirrored page table from spte getting removed, and flush TLB */
+ int (*reflect_remove_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
+
bool (*has_wbinvd_exit)(void);

u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 41b1d3f26597..1245f6a48dbe 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -346,6 +346,29 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
}

+static void reflect_removed_spte(struct kvm *kvm, gfn_t gfn,
+ u64 old_spte, u64 new_spte,
+ int level)
+{
+ bool was_present = is_shadow_present_pte(old_spte);
+ bool was_leaf = was_present && is_last_spte(old_spte, level);
+ kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+ int ret;
+
+ /*
+ * Allow only leaf page to be zapped. Reclaim non-leaf page tables page
+ * at destroying VM.
+ */
+ if (!was_leaf)
+ return;
+
+ /* Zapping leaf spte is allowed only when write lock is held. */
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ /* Because write lock is held, operation should success. */
+ ret = static_call(kvm_x86_reflect_remove_spte)(kvm, gfn, level, old_pfn);
+ KVM_BUG_ON(ret, kvm);
+}
+
/**
* handle_removed_pt() - handle a page table removed from the TDP structure
*
@@ -441,6 +464,22 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
old_spte, REMOVED_SPTE, sp->role, shared);
+ if (is_mirror_sp(sp)) {
+ KVM_BUG_ON(shared, kvm);
+ reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
+ }
+ }
+
+ if (is_mirror_sp(sp) &&
+ WARN_ON(static_call(kvm_x86_reflect_free_spt)(kvm, sp->gfn, sp->role.level,
+ kvm_mmu_mirrored_spt(sp)))) {
+ /*
+ * Failed to free page table page in mirror page table and
+ * there is nothing to do further.
+ * Intentionally leak the page to prevent the kernel from
+ * accessing the encrypted page.
+ */
+ sp->mirrored_spt = NULL;
}

call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -778,9 +817,11 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
role.level = level;
handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);

- /* Don't support setting for the non-atomic case */
- if (is_mirror_sptep(sptep))
+ if (is_mirror_sptep(sptep)) {
+ /* Only support zapping for the non-atomic case */
KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
+ reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
+ }

return old_spte;
}
--
2.34.1


2024-05-30 21:58:05

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Thu, 2024-05-30 at 14:07 -0700, Rick Edgecombe wrote:
>
> Update handle_changed_spte() to take a role. Use this for a KVM_BUG_ON().

I'm wondering if the KVM_BUG_ON() is not worth the cost to the diffstat.

2024-06-06 16:04:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 03/15] KVM: x86/mmu: Add a mirrored pointer to struct kvm_mmu_page

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> Add a mirrored pointer to struct kvm_mmu_page for the private page table and
> add helper functions to allocate/initialize/free a private page table page.
> Because KVM TDP MMU doesn't use unsync_children and write_flooding_count,
> pack them to have room for a pointer and use a union to avoid memory
> overhead.
>
> For private GPA, CPU refers to a private page table whose contents are
> encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> PTE entry) are used, and their cost is expensive.
>
> When KVM resolves the KVM page fault, it walks the page tables. To reuse
> the existing KVM MMU code and mitigate the heavy cost of directly walking
> the private page table, allocate one more page for the mirrored page table
> for the KVM MMU code to directly walk. Resolve the KVM page fault with
> the existing code, and do additional operations necessary for the private
> page table. To distinguish such cases, the existing KVM page table is
> called a shared page table (i.e., not associated with a private page
> table), and the page table with a private page table is called a mirrored
> page table. The relationship is depicted below.
>
> KVM page fault |
> | |
> V |
> -------------+---------- |
> | | |
> V V |
> shared GPA private GPA |
> | | |
> V V |
> shared PT root mirror PT root | private PT root
> | | | |
> V V | V
> shared PT mirror PT --propagate--> private/mirrored PT
> | | | |
> | \-----------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> non-encrypted memory | encrypted memory
> |
> PT: Page table
> Shared PT: visible to KVM, and the CPU uses it for shared mappings.
> Private/mirrored PT: the CPU uses it, but it is invisible to KVM. TDX
> module updates this table to map private guest pages.
> Mirror PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
> to propagate PT change to the actual private PT.

Which one is the "Mirror" and which one is the "Mirrored" PT is
uncomfortably confusing.

I hate to bikeshed even more, but while I like "Mirror PT" (a lot), I
would stick with "Private" or perhaps "External" for the pages owned
by the TDX module.

> + /*
> + * This cache is to allocate private page table. E.g. private EPT used
> + * by the TDX module.
> + */
> + struct kvm_mmu_memory_cache mmu_mirrored_spt_cache;

So this would be "mmu_external_spt_cache".

> - unsigned int unsync_children;
> + union {
> + /* Those two members aren't used for TDP MMU */

s/Those/These/

> + struct {
> + unsigned int unsync_children;
> + /*
> + * Number of writes since the last time traversal
> + * visited this page.
> + */
> + atomic_t write_flooding_count;
> + };
> + /*
> + * Page table page of private PT.
> + * Passed to TDX module, not accessed by KVM.
> + */
> + void *mirrored_spt;

external_spt

> +static inline void kvm_mmu_alloc_mirrored_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +{
> + /*
> + * mirrored_spt is allocated for TDX module to hold private EPT mappings,
> + * TDX module will initialize the page by itself.
> + * Therefore, KVM does not need to initialize or access mirrored_spt.
> + * KVM only interacts with sp->spt for mirrored EPT operations.
> + */
> + sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_mirrored_spt_cache);
> +}
> +
> +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +{
> + /*
> + * private_spt is allocated for TDX module to hold private EPT mappings,
> + * TDX module will initialize the page by itself.
> + * Therefore, KVM does not need to initialize or access private_spt.
> + * KVM only interacts with sp->spt for mirrored EPT operations.
> + */
> + sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_mirrored_spt_cache);
> +}

Duplicate function.

Naming aside, looks good.

Paolo

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> Add a mirrored pointer to struct kvm_mmu_page for the private page table and
> add helper functions to allocate/initialize/free a private page table page.
> Because KVM TDP MMU doesn't use unsync_children and write_flooding_count,
> pack them to have room for a pointer and use a union to avoid memory
> overhead.
>
> For private GPA, CPU refers to a private page table whose contents are
> encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
> PTE entry) are used, and their cost is expensive.
>
> When KVM resolves the KVM page fault, it walks the page tables. To reuse
> the existing KVM MMU code and mitigate the heavy cost of directly walking
> the private page table, allocate one more page for the mirrored page table
> for the KVM MMU code to directly walk. Resolve the KVM page fault with
> the existing code, and do additional operations necessary for the private
> page table. To distinguish such cases, the existing KVM page table is
> called a shared page table (i.e., not associated with a private page
> table), and the page table with a private page table is called a mirrored
> page table. The relationship is depicted below.
>
> KVM page fault |
> | |
> V |
> -------------+---------- |
> | | |
> V V |
> shared GPA private GPA |
> | | |
> V V |
> shared PT root mirror PT root | private PT root
> | | | |
> V V | V
> shared PT mirror PT --propagate--> private/mirrored PT
> | | | |
> | \-----------------+------\ |
> | | | |
> V | V V
> shared guest page | private guest page
> |
> non-encrypted memory | encrypted memory
> |
> PT: Page table
> Shared PT: visible to KVM, and the CPU uses it for shared mappings.
> Private/mirrored PT: the CPU uses it, but it is invisible to KVM. TDX
> module updates this table to map private guest pages.
> Mirror PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
> to propagate PT change to the actual private PT.
>
> Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
> to the TDX vm type.
>
> Co-developed-by: Yan Zhao <[email protected]>
> Signed-off-by: Yan Zhao <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> ---
> TDX MMU Prep v2:
> - Rename private->mirror
> - Don't trigger off of shared mask
>
> TDX MMU Prep:
> - Rename terminology, dummy PT => mirror PT. and updated the commit message
> By Rick and Kai.
> - Added a comment on union of private_spt by Rick.
> - Don't handle the root case in kvm_mmu_alloc_private_spt(), it will not
> be needed in future patches. (Rick)
> - Update comments (Yan)
> - Remove kvm_mmu_init_private_spt(), open code it in later patches (Yan)
>
> v19:
> - typo in the comment in kvm_mmu_alloc_private_spt()
> - drop CONFIG_KVM_MMU_PRIVATE
> ---
> arch/x86/include/asm/kvm_host.h | 5 ++++
> arch/x86/kvm/mmu.h | 5 ++++
> arch/x86/kvm/mmu/mmu.c | 7 +++++
> arch/x86/kvm/mmu/mmu_internal.h | 47 ++++++++++++++++++++++++++++++---
> arch/x86/kvm/mmu/tdp_mmu.c | 1 +
> 5 files changed, 61 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aabf1648a56a..250899a0239b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -817,6 +817,11 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
> + /*
> + * This cache is to allocate private page table. E.g. private EPT used
> + * by the TDX module.
> + */
> + struct kvm_mmu_memory_cache mmu_mirrored_spt_cache;
>
> /*
> * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index dc80e72e4848..0c3bf89cf7db 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -318,4 +318,9 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
> return gpa;
> return translate_nested_gpa(vcpu, gpa, access, exception);
> }
> +
> +static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
> +{
> + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
> #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b97241945596..5070ba7c6e89 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> + if (kvm_has_mirrored_tdp(vcpu->kvm)) {
> + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_mirrored_spt_cache,
> + PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> + }
> r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> PT64_ROOT_MAX_LEVEL);
> if (r)
> @@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_mirrored_spt_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 706f0ce8784c..faef40a561f9 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -101,7 +101,22 @@ struct kvm_mmu_page {
> int root_count;
> refcount_t tdp_mmu_root_count;
> };
> - unsigned int unsync_children;
> + union {
> + /* Those two members aren't used for TDP MMU */
> + struct {
> + unsigned int unsync_children;
> + /*
> + * Number of writes since the last time traversal
> + * visited this page.
> + */
> + atomic_t write_flooding_count;
> + };
> + /*
> + * Page table page of private PT.
> + * Passed to TDX module, not accessed by KVM.
> + */
> + void *mirrored_spt;
> + };
> union {
> struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> tdp_ptep_t ptep;
> @@ -124,9 +139,6 @@ struct kvm_mmu_page {
> int clear_spte_count;
> #endif
>
> - /* Number of writes since the last time traversal visited this page. */
> - atomic_t write_flooding_count;
> -
> #ifdef CONFIG_X86_64
> /* Used for freeing the page asynchronously if it is a TDP MMU page. */
> struct rcu_head rcu_head;
> @@ -145,6 +157,33 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> return kvm_mmu_role_as_id(sp->role);
> }
>
> +static inline void *kvm_mmu_mirrored_spt(struct kvm_mmu_page *sp)
> +{
> + return sp->mirrored_spt;
> +}
> +
> +static inline void kvm_mmu_alloc_mirrored_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +{
> + /*
> + * mirrored_spt is allocated for TDX module to hold private EPT mappings,
> + * TDX module will initialize the page by itself.
> + * Therefore, KVM does not need to initialize or access mirrored_spt.
> + * KVM only interacts with sp->spt for mirrored EPT operations.
> + */
> + sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_mirrored_spt_cache);
> +}
> +
> +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +{
> + /*
> + * private_spt is allocated for TDX module to hold private EPT mappings,
> + * TDX module will initialize the page by itself.
> + * Therefore, KVM does not need to initialize or access private_spt.
> + * KVM only interacts with sp->spt for mirrored EPT operations.
> + */
> + sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_mirrored_spt_cache);
> +}
> +
> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1259dd63defc..e7cd4921afe7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>
> static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> {
> + free_page((unsigned long)sp->mirrored_spt);
> free_page((unsigned long)sp->spt);
> kmem_cache_free(mmu_page_header_cache, sp);
> }
> --
> 2.34.1
>


2024-06-06 16:10:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 01/15] KVM: Add member to struct kvm_gfn_range for target alias

On Thu, 2024-06-06 at 17:55 +0200, Paolo Bonzini wrote:
> On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
> <[email protected]> wrote:
> > +       /* Unmmap the old attribute page. */
>
> Unmap

Oops, thanks.

>
> > +       if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
> > +               range->process = KVM_PROCESS_SHARED;
> > +       else
> > +               range->process = KVM_PROCESS_PRIVATE;
> > +
> >          return kvm_unmap_gfn_range(kvm, range);
> >   }
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c3c922bf077f..f92c8b605b03 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -260,11 +260,19 @@ union kvm_mmu_notifier_arg {
> >          unsigned long attributes;
> >   };
> >
> > +enum kvm_process {
> > +       BUGGY_KVM_INVALIDATION          = 0,
> > +       KVM_PROCESS_SHARED              = BIT(0),
> > +       KVM_PROCESS_PRIVATE             = BIT(1),
> > +       KVM_PROCESS_PRIVATE_AND_SHARED  = KVM_PROCESS_SHARED |
> > KVM_PROCESS_PRIVATE,
> > +};
>
> Only KVM_PROCESS_SHARED and KVM_PROCESS_PRIVATE are needed.

I guess you mean we can just use (KVM_PROCESS_SHARED |
KVM_PROCESS_PRIVATE). Sure.

>
> > +       /*
> > +        * If/when KVM supports more attributes beyond private .vs shared,
> > this
> > +        * _could_ set exclude_{private,shared} appropriately if the entire
> > target
>
> this could mask away KVM_PROCESS_{SHARED,PRIVATE} if the entire target...

Oops, thanks.

2024-06-06 16:14:28

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 05/15] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
>
> The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
> allows for the caller, mmu_alloc_direct_roots(), to call
> kvm_tdp_mmu_alloc_root() and also return 0 in one line:
> return kvm_tdp_mmu_alloc_root(vcpu);
>
> So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
> is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
> will be called twice in mmu_alloc_direct_roots(). This will force the
> first call to either awkwardly handle the return value that will always
> be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
> Do it in a separate change so the future change will be cleaner.
>
> Signed-off-by: Rick Edgecombe <[email protected]>

Reviewed-by: Paolo Bonzini <[email protected]>

> ---
> TDX MMU Prep:
> - New patch
> ---
> arch/x86/kvm/mmu/mmu.c | 6 ++++--
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
> arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> 3 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5070ba7c6e89..12178945922f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3700,8 +3700,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> unsigned i;
> int r;
>
> - if (tdp_mmu_enabled)
> - return kvm_tdp_mmu_alloc_root(vcpu);
> + if (tdp_mmu_enabled) {
> + kvm_tdp_mmu_alloc_root(vcpu);
> + return 0;
> + }
>
> write_lock(&vcpu->kvm->mmu_lock);
> r = make_mmu_pages_available(vcpu);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e7cd4921afe7..2770230a5636 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -224,7 +224,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> }
>
> -int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> +void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> union kvm_mmu_page_role role = mmu->root_role;
> @@ -285,7 +285,6 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> */
> mmu->root.hpa = __pa(root->spt);
> mmu->root.pgd = 0;
> - return 0;
> }
>
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 58b55e61bd33..437ddd4937a9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -10,7 +10,7 @@
> void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
>
> -int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
> +void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
>
> __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> {
> --
> 2.34.1
>


2024-06-06 16:19:09

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 03/15] KVM: x86/mmu: Add a mirrored pointer to struct kvm_mmu_page

On Thu, 2024-06-06 at 18:04 +0200, Paolo Bonzini wrote:
> > PT: Page table
> > Shared PT: visible to KVM, and the CPU uses it for shared mappings.
> > Private/mirrored PT: the CPU uses it, but it is invisible to KVM.  TDX
> >                       module updates this table to map private guest pages.
> > Mirror PT: It is visible to KVM, but the CPU doesn't use it.  KVM uses it
> >               to propagate PT change to the actual private PT.
>
> Which one is the "Mirror" and which one is the "Mirrored" PT is
> uncomfortably confusing.
>
> I hate to bikeshed even more, but while I like "Mirror PT" (a lot), I
> would stick with "Private" or perhaps "External" for the pages owned
> by the TDX module.

Thanks. Yes, mirror and mirrored is too close.

Kai raised the point that TDX's special need for separate, dedicated "private"
PTEs is confusing from the SEV and SW protected VM perspective, so we've tried
to remove private in that context. "External" seems like a good name.

[snip]
>
> > +static inline void kvm_mmu_alloc_mirrored_spt(struct kvm_vcpu *vcpu, struct
> > kvm_mmu_page *sp)
> > +{
> > +       /*
> > +        * mirrored_spt is allocated for TDX module to hold private EPT
> > mappings,
> > +        * TDX module will initialize the page by itself.
> > +        * Therefore, KVM does not need to initialize or access
> > mirrored_spt.
> > +        * KVM only interacts with sp->spt for mirrored EPT operations.
> > +        */
> > +       sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > >arch.mmu_mirrored_spt_cache);
> > +}
> > +
> > +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct
> > kvm_mmu_page *sp)
> > +{
> > +       /*
> > +        * private_spt is allocated for TDX module to hold private EPT
> > mappings,
> > +        * TDX module will initialize the page by itself.
> > +        * Therefore, KVM does not need to initialize or access private_spt.
> > +        * KVM only interacts with sp->spt for mirrored EPT operations.
> > +        */
> > +       sp->mirrored_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > >arch.mmu_mirrored_spt_cache);
> > +}
>
> Duplicate function.

Argh, I thought I fixed this before sending it out.

>
> Naming aside, looks good.

Great, thanks for the review! Snipped comments all sound good.

2024-06-06 16:23:05

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 04/15] KVM: x86/mmu: Add a new mirror_pt member for union kvm_mmu_page_role

On Thu, 2024-06-06 at 18:06 +0200, Paolo Bonzini wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > index 250899a0239b..084f4708aff1 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -351,7 +351,8 @@ union kvm_mmu_page_role {
> >                  unsigned ad_disabled:1;
> >                  unsigned guest_mode:1;
> >                  unsigned passthrough:1;
> > -               unsigned :5;
> > +               unsigned mirror_pt:1;
>
> "is_mirror".

Ok.

>
> This one is also unnecessary BTW.
>
> Otherwise looks good.


Thanks. Will remove the helpers.

2024-06-06 16:35:09

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 01/15] KVM: Add member to struct kvm_gfn_range for target alias

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
> + /* Unmmap the old attribute page. */

Unmap

> + if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
> + range->process = KVM_PROCESS_SHARED;
> + else
> + range->process = KVM_PROCESS_PRIVATE;
> +
> return kvm_unmap_gfn_range(kvm, range);
> }
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c3c922bf077f..f92c8b605b03 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,11 +260,19 @@ union kvm_mmu_notifier_arg {
> unsigned long attributes;
> };
>
> +enum kvm_process {
> + BUGGY_KVM_INVALIDATION = 0,
> + KVM_PROCESS_SHARED = BIT(0),
> + KVM_PROCESS_PRIVATE = BIT(1),
> + KVM_PROCESS_PRIVATE_AND_SHARED = KVM_PROCESS_SHARED | KVM_PROCESS_PRIVATE,
> +};

Only KVM_PROCESS_SHARED and KVM_PROCESS_PRIVATE are needed.

> + /*
> + * If/when KVM supports more attributes beyond private .vs shared, this
> + * _could_ set exclude_{private,shared} appropriately if the entire target

this could mask away KVM_PROCESS_{SHARED,PRIVATE} if the entire target...

Paolo

> + * range already has the desired private vs. shared state (it's unclear
> + * if that is a net win). For now, KVM reaches this point if and only
> + * if the private flag is being toggled, i.e. all mappings are in play.
> + */
> +
> for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
>
> @@ -2506,6 +2519,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> struct kvm_mmu_notifier_range pre_set_range = {
> .start = start,
> .end = end,
> + .arg.attributes = attributes,
> .handler = kvm_pre_set_memory_attributes,
> .on_lock = kvm_mmu_invalidate_begin,
> .flush_on_ret = true,
> --
> 2.34.1
>


2024-06-06 16:38:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 04/15] KVM: x86/mmu: Add a new mirror_pt member for union kvm_mmu_page_role

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> Introduce a "mirror_pt" member to the kvm_mmu_page_role union to identify
> SPTEs associated with the mirrored EPT.
>
> The TDX module maintains the private half of the EPT mapped in the TD in
> its protected memory. KVM keeps a copy of the private GPAs in a mirrored
> EPT tree within host memory. This "mirror_pt" attribute enables vCPUs to
> find and get the root page of mirrored EPT from the MMU root list for a
> guest TD. This also allows KVM MMU code to detect changes in mirrored EPT
> according to the "mirror_pt" mmu page role and propagate the changes to
> the private EPT managed by TDX module.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> TDX MMU Prep v2:
> - Rename private -> mirrored
>
> TDX MMU Prep:
> - Remove warning and NULL check in is_private_sptep() (Rick)
> - Update commit log (Yan)
>
> v19:
> - Fix is_private_sptep() when NULL case.
> - drop CONFIG_KVM_MMU_PRIVATE
> ---
> arch/x86/include/asm/kvm_host.h | 13 ++++++++++++-
> arch/x86/kvm/mmu/mmu_internal.h | 5 +++++
> arch/x86/kvm/mmu/spte.h | 5 +++++
> 3 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 250899a0239b..084f4708aff1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -351,7 +351,8 @@ union kvm_mmu_page_role {
> unsigned ad_disabled:1;
> unsigned guest_mode:1;
> unsigned passthrough:1;
> - unsigned :5;
> + unsigned mirror_pt:1;

"is_mirror".

> + unsigned :4;
>
> /*
> * This is left at the top of the word so that
> @@ -363,6 +364,16 @@ union kvm_mmu_page_role {
> };
> };
>
> +static inline bool kvm_mmu_page_role_is_mirror(union kvm_mmu_page_role role)
> +{
> + return !!role.mirror_pt;
> +}
> +
> +static inline void kvm_mmu_page_role_set_mirrored(union kvm_mmu_page_role *role)
> +{
> + role->mirror_pt = 1;
> +}

Not needed, remove it.

> /*
> * kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
> * relevant to the current MMU configuration. When loading CR0, CR4, or EFER,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index faef40a561f9..6d82e389cd65 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -157,6 +157,11 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> return kvm_mmu_role_as_id(sp->role);
> }
>
> +static inline bool is_mirror_sp(const struct kvm_mmu_page *sp)
> +{
> + return kvm_mmu_page_role_is_mirror(sp->role);
> +}

e.g. "return sp->role.is_mirror".

> static inline void *kvm_mmu_mirrored_spt(struct kvm_mmu_page *sp)
> {
> return sp->mirrored_spt;

This one is also unnecessary BTW.

Otherwise looks good.

Paolo

> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 5dd5405fa07a..b3c065280ba1 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -265,6 +265,11 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
> return spte_to_child_sp(root);
> }
>
> +static inline bool is_mirror_sptep(u64 *sptep)
> +{
> + return is_mirror_sp(sptep_to_sp(sptep));
> +}
> +
> static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
> {
> return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
> --
> 2.34.1
>


2024-06-07 07:59:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] KVM: x86/mmu: Support GFN direct mask

> Keep TDX specific concepts out of the MMU code by not
> naming it "shared".

I think that, more than keeping TDX specific concepts out of MMU code,
it is better to have a different name because it doesn't confuse
memory attributes with MMU concepts.

For example, SNP uses the same page tables for both shared and private
memory, as it handles them at the RMP level.

By the way, in patch 3 it still talks about "shared PT", please change
that to "direct SP" (so we have "direct", "external", "mirror").

Just one non-cosmetic request at the very end of the email.

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
> +static inline gfn_t kvm_gfn_root_mask(const struct kvm *kvm, const struct kvm_mmu_page *root)
> +{
> + if (is_mirror_sp(root))
> + return 0;

Maybe add a comment:

/*
* Since mirror SPs are used only for TDX, which maps private memory
* at its "natural" GFN, no mask needs to be applied to them - and, dually,
* we expect that the mask is only used for the shared PT.
*/

> + return kvm_gfn_direct_mask(kvm);

Ok, please excuse me again for being fussy on the naming. Typically I
think of a "mask" as something that you check against, or something
that you do x &~ mask, not as something that you add. Maybe
kvm_gfn_root_offset and gfn_direct_offset?

I also thought of gfn_direct_fixed_bits, but I'm not sure it
translates as well to kvm_gfn_root_fixed_bits. Anyway, I'll leave it
to you to make a decision, speak up if you think it's not an
improvement or if (especially for fixed_bits) it results in too long
lines.

Fortunately this kind of change is decently easy to do with a
search/replace on the patch files themselves.

> +}
> +
> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> @@ -359,7 +368,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
> int r;
>
> if (vcpu->arch.mmu->root_role.direct) {
> - fault.gfn = fault.addr >> PAGE_SHIFT;
> + /*
> + * Things like memslots don't understand the concept of a shared
> + * bit. Strip it so that the GFN can be used like normal, and the
> + * fault.addr can be used when the shared bit is needed.
> + */
> + fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_direct_mask(vcpu->kvm);
> fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);

Please add a comment to struct kvm_page_fault's gfn field, about how
it differs from addr.

> + /* Mask applied to convert the GFN to the mapping GPA */
> + gfn_t gfn_mask;

s/mask/offset/ or s/mask/fixed_bits/ here, if you go for it; won't
repeat myself below.

> /* The level of the root page given to the iterator */
> int root_level;
> /* The lowest level the iterator should traverse to */
> @@ -120,18 +122,18 @@ struct tdp_iter {
> * Iterates over every SPTE mapping the GFN range [start, end) in a
> * preorder traversal.
> */
> -#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
> - for (tdp_iter_start(&iter, root, min_level, start); \
> - iter.valid && iter.gfn < end; \
> +#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) \
> + for (tdp_iter_start(&iter, root, min_level, start, kvm_gfn_root_mask(kvm, root)); \
> + iter.valid && iter.gfn < end; \
> tdp_iter_next(&iter))
>
> -#define for_each_tdp_pte(iter, root, start, end) \
> - for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
> +#define for_each_tdp_pte(iter, kvm, root, start, end) \
> + for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)

Maybe add the kvm pointer / remove the mmu pointer in a separate patch
to make the mask-related changes easier to identify?

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c593a081eba..0e6325b5f5e7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13987,6 +13987,16 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
> }
> EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>
> +#ifdef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages)
> +{
> + if (!kvm_x86_ops.flush_remote_tlbs_range || kvm_gfn_direct_mask(kvm))

I think the code need not check kvm_gfn_direct_mask() here? In the old
patches that I have it check kvm_gfn_direct_mask() in the vmx/main.c
callback.

Paolo


2024-06-07 08:00:45

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] KVM: x86/mmu: Support GFN direct mask

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
> - u64 nr_pages)
> -{
> - if (!kvm_x86_ops.flush_remote_tlbs_range)
> - return -EOPNOTSUPP;
> -
> - return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
> -}
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
> #endif /* CONFIG_HYPERV */

Ah, since you are at it please move the prototype out of the #ifdef
CONFIG_HYPERV.

Paolo


2024-06-07 08:10:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 08/15] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
> +enum kvm_tdp_mmu_root_types {
> + KVM_VALID_ROOTS = BIT(0),
> +
> + KVM_ANY_ROOTS = 0,
> + KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,

I would instead define it as

KVM_INVALID_ROOTS = BIT(0),
KVM_VALID_ROOTS = BIT(1),
KVM_ALL_ROOTS = KVM_VALID_ROOTS | KVM_INVALID_ROOTS,

and then

if (root->role.invalid)
return types & KVM_INVALID_ROOTS;
else
return types & KVM_VALID_ROOTS;

Then in the next patch you can do

KVM_INVALID_ROOTS = BIT(0),
- KVM_VALID_ROOTS = BIT(1),
+ KVM_DIRECT_ROOTS = BIT(1),
+ KVM_MIRROR_ROOTS = BIT(2),
+ KVM_VALID_ROOTS = KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS,
KVM_ALL_ROOTS = KVM_VALID_ROOTS | KVM_INVALID_ROOTS,

and likewise

if (root->role.invalid)
return types & KVM_INVALID_ROOTS;
else if (likely(!is_mirror_sp(root)))
return types & KVM_DIRECT_ROOTS;
else
return types & KVM_MIRROR_ROOTS;

This removes the need for KVM_ANY_VALID_ROOTS (btw I don't know if
it's me, but ALL sounds more grammatical than ANY in this context). So
the resulting code should be a bit clearer.

Apart from this small tweak, the overall idea is really good.

Paolo

> +};
> +
> bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
> bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
> void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> --
> 2.34.1
>


2024-06-07 08:29:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Fri, May 31, 2024 at 12:01 AM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Thu, 2024-05-30 at 14:07 -0700, Rick Edgecombe wrote:
> >
> > Update handle_changed_spte() to take a role. Use this for a KVM_BUG_ON().
>
> I'm wondering if the KVM_BUG_ON() is not worth the cost to the diffstat.

Agreed, it seems like pointless churn.

Paolo


2024-06-07 08:47:26

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
> +static inline bool kvm_on_mirror(const struct kvm *kvm, enum kvm_process process)
> +{
> + if (!kvm_has_mirrored_tdp(kvm))
> + return false;
> +
> + return process & KVM_PROCESS_PRIVATE;
> +}
> +
> +static inline bool kvm_on_direct(const struct kvm *kvm, enum kvm_process process)
> +{
> + if (!kvm_has_mirrored_tdp(kvm))
> + return true;
> +
> + return process & KVM_PROCESS_SHARED;
> +}

These helpers don't add much, it's easier to just write
kvm_process_to_root_types() as

if (process & KVM_PROCESS_SHARED)
ret |= KVM_DIRECT_ROOTS;
if (process & KVM_PROCESS_PRIVATE)
ret |= kvm_has_mirror_tdp(kvm) ? KVM_MIRROR_ROOTS : KVM_DIRECT_ROOTS;

WARN_ON_ONCE(!ret);
return ret;

(Here I chose to rename kvm_has_mirrored_tdp to kvm_has_mirror_tdp;
but if you prefer to call it kvm_has_external_tdp, it's just as
readable. Whatever floats your boat).

> struct kvm *kvm = vcpu->kvm;
> - struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> + enum kvm_tdp_mmu_root_types root_type = tdp_mmu_get_fault_root_type(kvm, fault);
> + struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);

Please make this

struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);

Most other uses of tdp_mmu_get_root() don't really need a root_type,
I'll get to them in the reviews for the rest of the series.

> enum kvm_tdp_mmu_root_types {
> KVM_VALID_ROOTS = BIT(0),
> + KVM_DIRECT_ROOTS = BIT(1),
> + KVM_MIRROR_ROOTS = BIT(2),
>
> - KVM_ANY_ROOTS = 0,
> - KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
> + KVM_ANY_ROOTS = KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS,
> + KVM_ANY_VALID_ROOTS = KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS | KVM_VALID_ROOTS,

See previous review.

> +static inline enum kvm_tdp_mmu_root_types tdp_mmu_get_fault_root_type(struct kvm *kvm,
> + struct kvm_page_fault *fault)
> +{
> + if (fault->is_private)
> + return kvm_process_to_root_types(kvm, KVM_PROCESS_PRIVATE);
> + return KVM_DIRECT_ROOTS;
> +}
> +
> +static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu, enum kvm_tdp_mmu_root_types type)
> +{
> + if (type == KVM_MIRROR_ROOTS)
> + return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> + return root_to_sp(vcpu->arch.mmu->root.hpa);
> +}

static inline struct kvm_mmu_page *
tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
hpa_t root_hpa;
if (unlikely(fault->is_private && kvm_has_mirror_tdp(kvm)))
root_hpa = vcpu->arch.mmu->mirror_root_hpa;
else
root_hpa = vcpu->arch.mmu->root.hpa;
return root_to_sp(root_hpa);
}

Paolo


2024-06-07 08:57:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 13/15] KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process

Subject: propagate enum kvm_process to MMU notifier callbacks

But again, the naming... I don't like kvm_process - in an OS process
is a word with a clear meaning. Yes, that is a noun and this is a
verb, but then naming an enum with a verb is also awkward.

Perhaps kvm_gfn_range_filter and range->attr_filter? A bit wordy but very clear:

enum kvm_tdp_mmu_root_types types =
kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter)

I think I like it.

This patch also should be earlier in the series; please move it after
patch 9, i.e. right after kvm_process_to_root_types is introduced.

Paolo


2024-06-07 09:03:53

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 14/15] KVM: x86/tdp_mmu: Invalidate correct roots

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> When invalidating roots, respect the root type passed.
>
> kvm_tdp_mmu_invalidate_roots() is called with different root types. For
> kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
> down a VM it needs to invalidate all roots. Check the root type in root
> iterator.

This patch and patch 12 are small enough that they can be merged.

> @@ -1135,6 +1135,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
> enum kvm_process process_types)
> {
> + enum kvm_tdp_mmu_root_types root_types = kvm_process_to_root_types(kvm, process_types);

Maybe pass directly enum kvm_tdp_mmu_root_types?

Looking at patch 12:

+ /*
+ * The private page tables doesn't support fast zapping. The
+ * caller should handle it by other way.
+ */
+ kvm_tdp_mmu_invalidate_roots(kvm, KVM_PROCESS_SHARED);

now that we have separated private-ness and external-ness, it sounds
much better to write:

/*
* External page tables don't support fast zapping, therefore
* their mirrors must be invalidated separately by the caller.
*/
kvm_tdp_mmu_invalidate_roots(kvm, KVM_DIRECT_ROOTS);

while kvm_mmu_uninit_tdp_mmu() can pass KVM_ANY_ROOTS. It may also be
worth adding an

if (WARN_ON_ONCE(root_types & KVM_INVALID_ROOTS))
root_types &= ~KVM_INVALID_ROOTS;

to document the invariants.

Paolo


2024-06-07 09:32:04

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 15/15] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

> -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> - int *root_level)
> +static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> + enum kvm_tdp_mmu_root_types root_type)
> {
> - struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> + struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);

I think this function should take the struct kvm_mmu_page * directly.

> +{
> + *root_level = vcpu->arch.mmu->root_role.level;
> +
> + return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);

Here you pass root_to_sp(vcpu->arch.mmu->root.hpa);

> +int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> + kvm_pfn_t *pfn)
> +{
> + u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> + int leaf;
> +
> + lockdep_assert_held(&vcpu->kvm->mmu_lock);
> +
> + rcu_read_lock();
> + leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);

and likewise here.

You might also consider using a kvm_mmu_root_info for the mirror root,
even though the pgd field is not used.

Then __kvm_tdp_mmu_get_walk() can take a struct kvm_mmu_root_info * instead.

kvm_tdp_mmu_get_walk_mirror_pfn() doesn't belong in this series, but
introducing __kvm_tdp_mmu_get_walk() can stay here.

Looking at the later patch, which uses
kvm_tdp_mmu_get_walk_mirror_pfn(), I think this function is a bit
overkill. I'll do a pre-review of the init_mem_region function,
especially the usage of kvm_gmem_populate:

+ slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+ if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
+ ret = -EFAULT;
+ goto out_put_page;
+ }

The slots_lock is taken, so checking kvm_slot_can_be_private is unnecessary.

Checking kvm_mem_is_private perhaps should also be done in
kvm_gmem_populate() itself. I'll send a patch.

+ read_lock(&kvm->mmu_lock);
+
+ ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
+ if (ret < 0)
+ goto out;
+ if (ret > PG_LEVEL_4K) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (mmu_pfn != pfn) {
+ ret = -EAGAIN;
+ goto out;
+ }

If you require pre-faulting, you don't need to return mmu_pfn and
things would be seriously wrong if the two didn't match, wouldn't
they? You are executing with the filemap_invalidate_lock() taken, and
therefore cannot race with kvm_gmem_punch_hole(). (Soapbox mode on:
this is the advantage of putting the locking/looping logic in common
code, kvm_gmem_populate() in this case).

Therefore, I think kvm_tdp_mmu_get_walk_mirror_pfn() can be replaced just with

int kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
{
struct kvm *kvm = vcpu->kvm
bool is_direct = !kvm_has_mirrored_tdp(...) || (gpa & kvm->arch.direct_mask);
hpa_t root = is_direct ? ... : ...;

lockdep_assert_held(&vcpu->kvm->mmu_lock);
rcu_read_lock();
leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
rcu_read_unlock();
if (leaf < 0)
return false;

spte = sptes[leaf];
return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
}
EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);

+ while (region.nr_pages) {
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }

Checking signal_pending() should be done in kvm_gmem_populate() -
again, I'll take care of that. The rest of the loop is not necessary -
just call kvm_gmem_populate() with the whole range and enjoy. You get
a nice API that is consistent with the intended KVM_PREFAULT_MEMORY
ioctl, because kvm_gmem_populate() returns the number of pages it has
processed and you can use that to massage and copy back the struct
kvm_tdx_init_mem_region.

+ arg = (struct tdx_gmem_post_populate_arg) {
+ .vcpu = vcpu,
+ .flags = cmd->flags,
+ };
+ gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
+ u64_to_user_ptr(region.source_addr),
+ 1, tdx_gmem_post_populate, &arg);

Paolo


2024-06-07 10:12:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 10/15] KVM: x86/tdp_mmu: Reflect building mirror page tables

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
> + /* Update mirrored mapping with page table link */
> + int (*reflect_link_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *mirrored_spt);
> + /* Update the mirrored page table from spte getting set */
> + int (*reflect_set_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);

Possibly link_external_spt and set_external_spte, since you'll have to
s/mirrored/external/ in the comment. But not a hard request.

> +static void *get_mirrored_spt(gfn_t gfn, u64 new_spte, int level)
> +{
> + if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
> + struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));

I think this is spte_to_child_sp(new_spte)?

> + void *mirrored_spt = kvm_mmu_mirrored_spt(sp);
> +
> + WARN_ON_ONCE(sp->role.level + 1 != level);
> + WARN_ON_ONCE(sp->gfn != gfn);
> + return mirrored_spt;

Based on previous reviews this can be just "return sp->external_spt",
removing the not-particularly-interesting kvm_mmu_mirrored_spt()
helper.

> +static int __must_check reflect_set_spte_present(struct kvm *kvm, tdp_ptep_t sptep,

tdp_mmu_set_mirror_spte_atomic?

> + /*
> + * For mirrored page table, callbacks are needed to propagate SPTE
> + * change into the mirrored page table. In order to atomically update
> + * both the SPTE and the mirrored page tables with callbacks, utilize
> + * freezing SPTE.
> + * - Freeze the SPTE. Set entry to REMOVED_SPTE.
> + * - Trigger callbacks for mirrored page tables.
> + * - Unfreeze the SPTE. Set the entry to new_spte.
> + */

/*
* We need to lock out other updates to the SPTE until the external
* page table has been modified. Use REMOVED_SPTE similar to
* the zapping case.
*/

Easy peasy. :) We may want to rename REMOVED_SPTE to FROZEN_SPTE; feel
free to do it at the head of this series, then it can be picked for
6.11.

> -static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
> +static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte)
> {
> u64 *sptep = rcu_dereference(iter->sptep);
>
> @@ -571,15 +629,36 @@ static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
> */
> WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
>
> - /*
> - * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> - * does not hold the mmu_lock. On failure, i.e. if a different logical
> - * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> - * the current value, so the caller operates on fresh data, e.g. if it
> - * retries tdp_mmu_set_spte_atomic()
> - */
> - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> - return -EBUSY;
> + if (is_mirror_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
> + int ret;
> +
> + /* Don't support atomic zapping for mirrored roots */

The why is hidden in the commit message to patch 11. I wonder if it
isn't clearer to simply squash together patches 10 and 11 (your call),
and instead split out the addition of the new struct kvm parameters.

Anyway, this comment needs a bit more info:

/*
* Users of atomic zapping don't operate on mirror roots,
* so only need to handle present new_spte.
*/

> + if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
> + return -EBUSY;
> + /*
> + * Populating case.
> + * - reflect_set_spte_present() implements
> + * 1) Freeze SPTE
> + * 2) call hooks to update mirrored page table,
> + * 3) update SPTE to new_spte
> + * - handle_changed_spte() only updates stats.
> + */

Comment not needed (weird I know).


> + ret = reflect_set_spte_present(kvm, iter->sptep, iter->gfn,
> + iter->old_spte, new_spte, iter->level);
> + if (ret)
> + return ret;
> + } else {
> + /*
> + * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
> + * and does not hold the mmu_lock. On failure, i.e. if a
> + * different logical CPU modified the SPTE, try_cmpxchg64()
> + * updates iter->old_spte with the current value, so the caller
> + * operates on fresh data, e.g. if it retries
> + * tdp_mmu_set_spte_atomic()
> + */
> + if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> + return -EBUSY;
> + }
>
> return 0;
> }
> @@ -610,7 +689,7 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> if (ret)
> return ret;
>
> @@ -636,7 +715,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * Delay processing of the zapped SPTE until after TLBs are flushed and
> * the REMOVED_SPTE is replaced (see below).
> */
> - ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE);
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
> if (ret)
> return ret;
>
> @@ -698,6 +777,11 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> role = sptep_to_sp(sptep)->role;
> role.level = level;
> handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
> +
> + /* Don't support setting for the non-atomic case */
> + if (is_mirror_sptep(sptep))
> + KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> +
> return old_spte;
> }
>
> --
> 2.34.1
>


2024-06-07 11:38:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 11/15] KVM: x86/tdp_mmu: Reflect tearing down mirror page tables

> ---
> arch/x86/include/asm/kvm-x86-ops.h | 2 ++
> arch/x86/include/asm/kvm_host.h | 8 ++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 45 ++++++++++++++++++++++++++++--
> 3 files changed, 53 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 1877d6a77525..dae06afc6038 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -97,6 +97,8 @@ KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> KVM_X86_OP(load_mmu_pgd)
> KVM_X86_OP_OPTIONAL(reflect_link_spt)
> KVM_X86_OP_OPTIONAL(reflect_set_spte)
> +KVM_X86_OP_OPTIONAL(reflect_free_spt)
> +KVM_X86_OP_OPTIONAL(reflect_remove_spte)
> KVM_X86_OP(has_wbinvd_exit)
> KVM_X86_OP(get_l2_tsc_offset)
> KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 20bb10f22ca6..0df4a31a0df9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1755,6 +1755,14 @@ struct kvm_x86_ops {
> int (*reflect_set_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> kvm_pfn_t pfn);
>
> + /* Update mirrored page tables for page table about to be freed */
> + int (*reflect_free_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *mirrored_spt);
> +
> + /* Update mirrored page table from spte getting removed, and flush TLB */
> + int (*reflect_remove_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);

Again, maybe free_external_spt and zap_external_spte?

Also, please rename the last argument to pfn_for_gfn. I'm not proud of
it, but it took me over 10 minutes to understand if the pfn referred
to the gfn itself, or to the external SP that holds the spte...
There's a possibility that it isn't just me. :)

(In general, this patch took me a _lot_ to review... there were a
couple of places that left me incomprehensibly puzzled, more on this
below).

> bool (*has_wbinvd_exit)(void);
>
> u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 41b1d3f26597..1245f6a48dbe 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -346,6 +346,29 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> }
>
> +static void reflect_removed_spte(struct kvm *kvm, gfn_t gfn,
> + u64 old_spte, u64 new_spte,
> + int level)

new_spte is not used and can be dropped. Also, tdp_mmu_zap_external_spte?

> +{
> + bool was_present = is_shadow_present_pte(old_spte);
> + bool was_leaf = was_present && is_last_spte(old_spte, level);

Just put it below:

if (!is_shadow_present_pte(old_spte))
return;

/* Here we only care about zapping the external leaf PTEs. */
if (!is_last_spte(old_spte, level))

> + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> + int ret;
> +
> + /*
> + * Allow only leaf page to be zapped. Reclaim non-leaf page tables page

This comment left me confused, so I'll try to rephrase and see if I
can explain what happens. Correct me if I'm wrong.

The only paths to handle_removed_pt() are:
- kvm_tdp_mmu_zap_leafs()
- kvm_tdp_mmu_zap_invalidated_roots()

but because kvm_mmu_zap_all_fast() does not operate on mirror roots,
the latter can only happen at VM destruction time.

But it's not clear why it's worth mentioning it here, or even why it
is special at all. Isn't that just what handle_removed_pt() does at
the end? Why does it matter that it's only done at VM destruction
time?

In other words, it seems to me that this comment is TMI. And if I am
wrong (which may well be), the extra information should explain the
"why" in more detail, and it should be around the call to
reflect_free_spt, not here.

> + return;
> + /* Zapping leaf spte is allowed only when write lock is held. */
> + lockdep_assert_held_write(&kvm->mmu_lock);
> + /* Because write lock is held, operation should success. */
> + ret = static_call(kvm_x86_reflect_remove_spte)(kvm, gfn, level, old_pfn);
> + KVM_BUG_ON(ret, kvm);
> +}
> +
> /**
> * handle_removed_pt() - handle a page table removed from the TDP structure
> *
> @@ -441,6 +464,22 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> }
> handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> old_spte, REMOVED_SPTE, sp->role, shared);
> + if (is_mirror_sp(sp)) {
> + KVM_BUG_ON(shared, kvm);
> + reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
> + }
> + }
> +
> + if (is_mirror_sp(sp) &&
> + WARN_ON(static_call(kvm_x86_reflect_free_spt)(kvm, sp->gfn, sp->role.level,
> + kvm_mmu_mirrored_spt(sp)))) {

Please use base_gfn and level here, instead of fishing them from sp.

> + /*
> + * Failed to free page table page in mirror page table and
> + * there is nothing to do further.
> + * Intentionally leak the page to prevent the kernel from
> + * accessing the encrypted page.
> + */
> + sp->mirrored_spt = NULL;
> }
>
> call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> @@ -778,9 +817,11 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> role.level = level;
> handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
>
> - /* Don't support setting for the non-atomic case */
> - if (is_mirror_sptep(sptep))
> + if (is_mirror_sptep(sptep)) {
> + /* Only support zapping for the non-atomic case */

Like for patch 10, this comment should point out why we never get here
for mirror SPs.

Paolo

> KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> + reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
> + }
>
> return old_spte;
> }
> --
> 2.34.1
>


2024-06-07 11:39:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] TDX MMU prep series part 1

On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
<[email protected]> wrote:
>
> Hi,
>
> This is v2 of the TDX MMU prep series, split out of the giant 130 patch
> TDX base enabling series [0]. It is focusing on the changes to the x86 MMU
> to support TDX’s separation of private/shared EPT into separate roots. A
> future breakout series will include the changes to actually interact with
> the TDX module to actually map private memory. The purpose of sending out
> a smaller series is to focus review, and hopefully rapidly iterate. We
> would like the series to go into kvm-coco-queue when it is ready.
>
> I think the maturity of these patches has significantly improved during
> the recent reviews. I expecting it still needs a little more work, but
> think that the basic structure is in decent shape at this point. Please
> consider it from the perspective of what is missing for inclusion in
> kvm-coco-queue.

Yes, now we're talking indeed. Mostly it's cosmetic tweaks or requests
to fix/improve a few comments, but overall it's very pleasing and
algorithmically clear code to read. Kudos to everyone involved.

I don't expect any big issues with v3.

Paolo


Paolo


2024-06-07 18:39:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] KVM: x86/mmu: Support GFN direct mask

On Fri, 2024-06-07 at 09:59 +0200, Paolo Bonzini wrote:
> Just one non-cosmetic request at the very end of the email.
>
> On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
> <[email protected]> wrote:
> > +static inline gfn_t kvm_gfn_root_mask(const struct kvm *kvm, const struct
> > kvm_mmu_page *root)
> > +{
> > +       if (is_mirror_sp(root))
> > +               return 0;
>
> Maybe add a comment:
>
> /*
>  * Since mirror SPs are used only for TDX, which maps private memory
>  * at its "natural" GFN, no mask needs to be applied to them - and, dually,
>  * we expect that the mask is only used for the shared PT.
>  */

Sure, seems like a good idea.

>
> > +       return kvm_gfn_direct_mask(kvm);
>
> Ok, please excuse me again for being fussy on the naming. Typically I
> think of a "mask" as something that you check against, or something
> that you do x &~ mask, not as something that you add. Maybe
> kvm_gfn_root_offset and gfn_direct_offset?
>
> I also thought of gfn_direct_fixed_bits, but I'm not sure it
> translates as well to kvm_gfn_root_fixed_bits. Anyway, I'll leave it
> to you to make a decision, speak up if you think it's not an
> improvement or if (especially for fixed_bits) it results in too long
> lines.
>
> Fortunately this kind of change is decently easy to do with a
> search/replace on the patch files themselves.

Yea, it's no problem to update the code. I'll be happy if this code is more
understandable for non-tdx developers.

As for the name, I guess I'd be less keen on "offset" because it's not clear
that it is a power-of-two value that can be used with bitwise operations.

I'm not sure what the "fixed" adds and it makes it longer. Also, many PTE bits
cannot be moved and they are not referred to as fixed, where the shared bit
actually *can* be moved via GPAW (not that the MMU code cares about that
though).

Just "bits" sounds better to me, so maybe I'll try?
kvm_gfn_direct_bits()
kvm_gfn_root_bits()

>
> > +}
> > +
> >   static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page
> > *sp)
> >   {
> >          /*
> > @@ -359,7 +368,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> >          int r;
> >
> >          if (vcpu->arch.mmu->root_role.direct) {
> > -               fault.gfn = fault.addr >> PAGE_SHIFT;
> > +               /*
> > +                * Things like memslots don't understand the concept of a
> > shared
> > +                * bit. Strip it so that the GFN can be used like normal,
> > and the
> > +                * fault.addr can be used when the shared bit is needed.
> > +                */
> > +               fault.gfn = gpa_to_gfn(fault.addr) &
> > ~kvm_gfn_direct_mask(vcpu->kvm);
> >                  fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>
> Please add a comment to struct kvm_page_fault's gfn field, about how
> it differs from addr.

Doh, yes totally.

>
> > +       /* Mask applied to convert the GFN to the mapping GPA */
> > +       gfn_t gfn_mask;
>
> s/mask/offset/ or s/mask/fixed_bits/ here, if you go for it; won't
> repeat myself below.
>
> >          /* The level of the root page given to the iterator */
> >          int root_level;
> >          /* The lowest level the iterator should traverse to */
> > @@ -120,18 +122,18 @@ struct tdp_iter {
> >    * Iterates over every SPTE mapping the GFN range [start, end) in a
> >    * preorder traversal.
> >    */
> > -#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
> > -       for (tdp_iter_start(&iter, root, min_level, start); \
> > -            iter.valid && iter.gfn < end;                   \
> > +#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start,
> > end)               \
> > +       for (tdp_iter_start(&iter, root, min_level, start,
> > kvm_gfn_root_mask(kvm, root)); \
> > +            iter.valid && iter.gfn <
> > end;                                                \
> >               tdp_iter_next(&iter))
> >
> > -#define for_each_tdp_pte(iter, root, start, end) \
> > -       for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
> > +#define for_each_tdp_pte(iter, kvm, root, start,
> > end)                          \
> > +       for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)
>
> Maybe add the kvm pointer / remove the mmu pointer in a separate patch
> to make the mask-related changes easier to identify?

Hmm, yea. I can split it.

>
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7c593a081eba..0e6325b5f5e7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13987,6 +13987,16 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu,
> > unsigned int size,
> >   }
> >   EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
> >
> > +#ifdef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> > +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64
> > nr_pages)
> > +{
> > +       if (!kvm_x86_ops.flush_remote_tlbs_range ||
> > kvm_gfn_direct_mask(kvm))
>
> I think the code need not check kvm_gfn_direct_mask() here? In the old
> patches that I have it check kvm_gfn_direct_mask() in the vmx/main.c
> callback.

You mean a VMX/TDX implementation of flush_remote_tlbs_range that just returns
-EOPNOTSUPP? Which version of the patches is this? I couldn't find anything like
that.

But I guess we could add one in the later patches. In which case we could drop
the hunk in this one. I see benefit being less churn.


The downside would be wider distribution of the concerns for dealing with
multiple aliases for a GFN. Currently, the behavior to have multiple aliases is
implemented in core MMU code. While it's fine to pollute tdx.c with TDX specific
knowledge of course, removing the handling of this corner from mmu.c might make
it less understandable for non-tdx readers who are working in MMU code.
Basically, if a concept fits into some non-TDX abstraction like this, having it
in core code seems the better default to me.

For this reason, my preference would be to leave the logic in core code. But I'm
fine changing it. I'll move it into the tdx.c for now, unless you are convinced
by the above.

2024-06-07 20:07:04

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 08/15] KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type

On Fri, 2024-06-07 at 10:10 +0200, Paolo Bonzini wrote:
> On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
> <[email protected]> wrote:
> > +enum kvm_tdp_mmu_root_types {
> > +       KVM_VALID_ROOTS = BIT(0),
> > +
> > +       KVM_ANY_ROOTS = 0,
> > +       KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
>
> I would instead define it as
>
>     KVM_INVALID_ROOTS = BIT(0),
>     KVM_VALID_ROOTS = BIT(1),
>     KVM_ALL_ROOTS = KVM_VALID_ROOTS | KVM_INVALID_ROOTS,
>
> and then
>
>   if (root->role.invalid)
>     return types & KVM_INVALID_ROOTS;
>   else
>     return types & KVM_VALID_ROOTS;
>
> Then in the next patch you can do
>
>      KVM_INVALID_ROOTS = BIT(0),
> -    KVM_VALID_ROOTS = BIT(1),
> +   KVM_DIRECT_ROOTS = BIT(1),
> +   KVM_MIRROR_ROOTS = BIT(2),
> +   KVM_VALID_ROOTS = KVM_DIRECT_ROOTS | KVM_MIRROR_ROOTS,
>      KVM_ALL_ROOTS = KVM_VALID_ROOTS | KVM_INVALID_ROOTS,
>
> and likewise
>
>   if (root->role.invalid)
>     return types & KVM_INVALID_ROOTS;
>   else if (likely(!is_mirror_sp(root)))
>     return types & KVM_DIRECT_ROOTS;
>   else
>     return types & KVM_MIRROR_ROOTS;
>
> This removes the need for KVM_ANY_VALID_ROOTS (btw I don't know if
> it's me, but ALL sounds more grammatical than ANY in this context). So
> the resulting code should be a bit clearer.
>
> Apart from this small tweak, the overall idea is really good.

Yes, this makes more sense. Thanks.

2024-06-07 20:27:31

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Fri, 2024-06-07 at 10:46 +0200, Paolo Bonzini wrote:
> On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
> <[email protected]> wrote:
> > +static inline bool kvm_on_mirror(const struct kvm *kvm, enum kvm_process
> > process)
> > +{
> > +       if (!kvm_has_mirrored_tdp(kvm))
> > +               return false;
> > +
> > +       return process & KVM_PROCESS_PRIVATE;
> > +}
> > +
> > +static inline bool kvm_on_direct(const struct kvm *kvm, enum kvm_process
> > process)
> > +{
> > +       if (!kvm_has_mirrored_tdp(kvm))
> > +               return true;
> > +
> > +       return process & KVM_PROCESS_SHARED;
> > +}
>
> These helpers don't add much, it's easier to just write
> kvm_process_to_root_types() as
>
> if (process & KVM_PROCESS_SHARED)
>   ret |= KVM_DIRECT_ROOTS;
> if (process & KVM_PROCESS_PRIVATE)
>   ret |= kvm_has_mirror_tdp(kvm) ? KVM_MIRROR_ROOTS : KVM_DIRECT_ROOTS;
>
> WARN_ON_ONCE(!ret);
> return ret;

The point of kvm_on_mirror() and kvm_on_direct() was to try to centralize TDX
specific checks in mmu.h. Whether that is a good idea aside, looking at it now
I'm not sure if it even does a good job.

I'll try it the way you suggest.

>
> (Here I chose to rename kvm_has_mirrored_tdp to kvm_has_mirror_tdp;
> but if you prefer to call it kvm_has_external_tdp, it's just as
> readable. Whatever floats your boat).
>
> >          struct kvm *kvm = vcpu->kvm;
> > -       struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> > +       enum kvm_tdp_mmu_root_types root_type =
> > tdp_mmu_get_fault_root_type(kvm, fault);
> > +       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
>
> Please make this
>
>   struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);

Yes, that seems clearer.


[snip]

>
> > +static inline enum kvm_tdp_mmu_root_types
> > tdp_mmu_get_fault_root_type(struct kvm *kvm,
> > +                                                                     struct
> > kvm_page_fault *fault)
> > +{
> > +       if (fault->is_private)
> > +               return kvm_process_to_root_types(kvm, KVM_PROCESS_PRIVATE);
> > +       return KVM_DIRECT_ROOTS;
> > +}
> > +
> > +static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
> > enum kvm_tdp_mmu_root_types type)
> > +{
> > +       if (type == KVM_MIRROR_ROOTS)
> > +               return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> > +       return root_to_sp(vcpu->arch.mmu->root.hpa);
> > +}
>
> static inline struct kvm_mmu_page *
> tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault
> *fault)
> {
>   hpa_t root_hpa;
>   if (unlikely(fault->is_private && kvm_has_mirror_tdp(kvm)))
>     root_hpa = vcpu->arch.mmu->mirror_root_hpa;
>   else
>     root_hpa = vcpu->arch.mmu->root.hpa;
>   return root_to_sp(root_hpa);
> }

I was not loving the amount of indirection here in the patch, but thought it
centralized the logic a bit better. This way seems good, given that the actual
logic is not that complex.

2024-06-07 20:53:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 10/15] KVM: x86/tdp_mmu: Reflect building mirror page tables

On Fri, 2024-06-07 at 12:10 +0200, Paolo Bonzini wrote:
> On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
> <[email protected]> wrote:
> > +       /* Update mirrored mapping with page table link */
> > +       int (*reflect_link_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                               void *mirrored_spt);
> > +       /* Update the mirrored page table from spte getting set */
> > +       int (*reflect_set_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                               kvm_pfn_t pfn);
>
> Possibly link_external_spt and set_external_spte, since you'll have to
> s/mirrored/external/ in the comment. But not a hard request.

Definitely seems better now that we have the "external" nomenclature.

>
> > +static void *get_mirrored_spt(gfn_t gfn, u64 new_spte, int level)
> > +{
> > +       if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte,
> > level)) {
> > +               struct kvm_mmu_page *sp =
> > to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
>
> I think this is spte_to_child_sp(new_spte)?

Yes, that seems much easier than all this.


[snip]

>
> > +static int __must_check reflect_set_spte_present(struct kvm *kvm,
> > tdp_ptep_t sptep,
>
> tdp_mmu_set_mirror_spte_atomic?
>
> > +       /*
> > +        * For mirrored page table, callbacks are needed to propagate SPTE
> > +        * change into the mirrored page table. In order to atomically
> > update
> > +        * both the SPTE and the mirrored page tables with callbacks,
> > utilize
> > +        * freezing SPTE.
> > +        * - Freeze the SPTE. Set entry to REMOVED_SPTE.
> > +        * - Trigger callbacks for mirrored page tables.
> > +        * - Unfreeze the SPTE.  Set the entry to new_spte.
> > +        */
>
> /*
>  * We need to lock out other updates to the SPTE until the external
>  * page table has been modified. Use REMOVED_SPTE similar to
>  * the zapping case.
>  */
>
> Easy peasy. :) We may want to rename REMOVED_SPTE to FROZEN_SPTE; feel
> free to do it at the head of this series, then it can be picked for
> 6.11.

Ok.

>
> > -static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64
> > new_spte)
> > +static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct
> > tdp_iter *iter, u64 new_spte)
> >  {
> >         u64 *sptep = rcu_dereference(iter->sptep);
> >
> > @@ -571,15 +629,36 @@ static inline int __tdp_mmu_set_spte_atomic(struct
> > tdp_iter *iter, u64 new_spte)
> >          */
> >         WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
> >
> > -       /*
> > -        * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> > -        * does not hold the mmu_lock.  On failure, i.e. if a different
> > logical
> > -        * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte
> > with
> > -        * the current value, so the caller operates on fresh data, e.g. if
> > it
> > -        * retries tdp_mmu_set_spte_atomic()
> > -        */
> > -       if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> > -               return -EBUSY;
> > +       if (is_mirror_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
> > +               int ret;
> > +
> > +               /* Don't support atomic zapping for mirrored roots */
>
> The why is hidden in the commit message to patch 11. I wonder if it
> isn't clearer to simply squash together patches 10 and 11 (your call),
> and instead split out the addition of the new struct kvm parameters.

I actually split them in two for this v2. I thought the combined patch was too
big. Maybe I could just move this whole hunk to the next patch. I'll give it a
try.

>
> Anyway, this comment needs a bit more info:
>
> /*
>  * Users of atomic zapping don't operate on mirror roots,
>  * so only need to handle present new_spte.
>  */

Ok.

>
> > +               if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
> > +                       return -EBUSY;
> > +               /*
> > +                * Populating case.
> > +                * - reflect_set_spte_present() implements
> > +                *   1) Freeze SPTE
> > +                *   2) call hooks to update mirrored page table,
> > +                *   3) update SPTE to new_spte
> > +                * - handle_changed_spte() only updates stats.
> > +                */
>
> Comment not needed (weird I know).

Sure.



2024-06-07 21:47:17

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 11/15] KVM: x86/tdp_mmu: Reflect tearing down mirror page tables

On Fri, 2024-06-07 at 13:37 +0200, Paolo Bonzini wrote:
> >
> > +       /* Update mirrored page tables for page table about to be freed */
> > +       int (*reflect_free_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                               void *mirrored_spt);
> > +
> > +       /* Update mirrored page table from spte getting removed, and flush
> > TLB */
> > +       int (*reflect_remove_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                                  kvm_pfn_t pfn);
>
> Again, maybe free_external_spt and zap_external_spte?

Yep, I'm on board.

>
> Also, please rename the last argument to pfn_for_gfn. I'm not proud of
> it, but it took me over 10 minutes to understand if the pfn referred
> to the gfn itself, or to the external SP that holds the spte...
> There's a possibility that it isn't just me. :)

Ah, I see how that could be confusing.

>
> (In general, this patch took me a _lot_ to review... there were a
> couple of places that left me incomprehensibly puzzled, more on this
> below).

Sorry for that. Thanks for taking the time to weed through it anyway.

>
> >          bool (*has_wbinvd_exit)(void);
> >
> >          u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 41b1d3f26597..1245f6a48dbe 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -346,6 +346,29 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct
> > kvm_mmu_page *sp)
> >          spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >   }
> >
> > +static void reflect_removed_spte(struct kvm *kvm, gfn_t gfn,
> > +                                       u64 old_spte, u64 new_spte,
> > +                                       int level)
>
> new_spte is not used and can be dropped. Also, tdp_mmu_zap_external_spte?

Oh, yep. In v19 there used to be a KVM_BUG_ON(), but we can get rid of it.

>
> > +{
> > +       bool was_present = is_shadow_present_pte(old_spte);
> > +       bool was_leaf = was_present && is_last_spte(old_spte, level);
>
> Just put it below:
>
> if (!is_shadow_present_pte(old_spte))
>   return;

Right, this is another miss from removing the KVM_BUG_ON()s.

>
> /* Here we only care about zapping the external leaf PTEs. */
> if (!is_last_spte(old_spte, level))
>
> > +       kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > +       int ret;
> > +
> > +       /*
> > +        * Allow only leaf page to be zapped. Reclaim non-leaf page tables
> > page
>
> This comment left me confused, so I'll try to rephrase and see if I
> can explain what happens. Correct me if I'm wrong.
>
> The only paths to handle_removed_pt() are:
> - kvm_tdp_mmu_zap_leafs()
> - kvm_tdp_mmu_zap_invalidated_roots()
>
> but because kvm_mmu_zap_all_fast() does not operate on mirror roots,
> the latter can only happen at VM destruction time.
>
> But it's not clear why it's worth mentioning it here, or even why it
> is special at all. Isn't that just what handle_removed_pt() does at
> the end? Why does it matter that it's only done at VM destruction
> time?
>
> In other words, it seems to me that this comment is TMI. And if I am
> wrong (which may well be), the extra information should explain the
> "why" in more detail, and it should be around the call to
> reflect_free_spt, not here.

TDX of course has the limitation around the ordering of the zapping S-EPT. So I
read the comment to be referring to how the implementation avoids zapping any
non-leaf PTEs during TD runtime.

But I'm going to have to circle back here after investigating a bit more. Isaku,
any comments on this comment and conditional?

>
> > +               return;
> > +       /* Zapping leaf spte is allowed only when write lock is held. */
> > +       lockdep_assert_held_write(&kvm->mmu_lock);
> > +       /* Because write lock is held, operation should success. */
> > +       ret = static_call(kvm_x86_reflect_remove_spte)(kvm, gfn, level,
> > old_pfn);
> > +       KVM_BUG_ON(ret, kvm);
> > +}
> > +
> >   /**
> >    * handle_removed_pt() - handle a page table removed from the TDP
> > structure
> >    *
> > @@ -441,6 +464,22 @@ static void handle_removed_pt(struct kvm *kvm,
> > tdp_ptep_t pt, bool shared)
> >                  }
> >                  handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> >                                      old_spte, REMOVED_SPTE, sp->role,
> > shared);
> > +               if (is_mirror_sp(sp)) {
> > +                       KVM_BUG_ON(shared, kvm);
> > +                       reflect_removed_spte(kvm, gfn, old_spte,
> > REMOVED_SPTE, level);
> > +               }
> > +       }
> > +
> > +       if (is_mirror_sp(sp) &&
> > +           WARN_ON(static_call(kvm_x86_reflect_free_spt)(kvm, sp->gfn, sp-
> > >role.level,
> > +                                                        
> > kvm_mmu_mirrored_spt(sp)))) {
>
> Please use base_gfn and level here, instead of fishing them from sp.

Oh, yep, thanks.

>
> > +               /*
> > +                * Failed to free page table page in mirror page table and
> > +                * there is nothing to do further.
> > +                * Intentionally leak the page to prevent the kernel from
> > +                * accessing the encrypted page.
> > +                */
> > +               sp->mirrored_spt = NULL;
> >          }
> >
> >          call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > @@ -778,9 +817,11 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> >          role.level = level;
> >          handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role,
> > false);
> >
> > -       /* Don't support setting for the non-atomic case */
> > -       if (is_mirror_sptep(sptep))
> > +       if (is_mirror_sptep(sptep)) {
> > +               /* Only support zapping for the non-atomic case */
>
> Like for patch 10, this comment should point out why we never get here
> for mirror SPs.

Ok.



2024-06-07 22:12:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 13/15] KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process

On Fri, 2024-06-07 at 10:56 +0200, Paolo Bonzini wrote:
> Subject: propagate enum kvm_process to MMU notifier callbacks
>
> But again, the naming... I don't like kvm_process - in an OS process
> is a word with a clear meaning. Yes, that is a noun and this is a
> verb, but then naming an enum with a verb is also awkward.
>
> Perhaps kvm_gfn_range_filter and range->attr_filter? A bit wordy but very
> clear:
>
> enum kvm_tdp_mmu_root_types types =
>     kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter)
>
> I think I like it.

Agree 'process' sticks out. Somehow having attr_filter and args.attributes in
the same struct feels a bit wrong. Not that process was a lot better.

I guess attr_filter is more about alias ranges, and args.attribute is more about
conversion to various types of memory (private, shared and ideas of other types
I guess). But since today we only have private and shared, I wonder if there is
some way to combine them within struct kvm_gfn_range? I've not thought this all
the way through.

>
> This patch also should be earlier in the series; please move it after
> patch 9, i.e. right after kvm_process_to_root_types is introduced.

Hmm, I thought I remembered having to move this to be later, but I don't see any
problems moving it earlier. Thanks for pointing it out.

2024-06-07 22:31:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 14/15] KVM: x86/tdp_mmu: Invalidate correct roots

On Fri, 2024-06-07 at 11:03 +0200, Paolo Bonzini wrote:
> On Thu, May 30, 2024 at 11:07 PM Rick Edgecombe
> <[email protected]> wrote:
> >
> > From: Sean Christopherson <[email protected]>
> >
> > When invalidating roots, respect the root type passed.
> >
> > kvm_tdp_mmu_invalidate_roots() is called with different root types. For
> > kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
> > down a VM it needs to invalidate all roots. Check the root type in root
> > iterator.
>
> This patch and patch 12 are small enough that they can be merged.

Sure.

>
> > @@ -1135,6 +1135,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm
> > *kvm)
> >   void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
> >                                    enum kvm_process process_types)
> >   {
> > +       enum kvm_tdp_mmu_root_types root_types =
> > kvm_process_to_root_types(kvm, process_types);
>
> Maybe pass directly enum kvm_tdp_mmu_root_types?
>
> Looking at patch 12:
>
> + /*
> + * The private page tables doesn't support fast zapping.  The
> + * caller should handle it by other way.
> + */
> + kvm_tdp_mmu_invalidate_roots(kvm, KVM_PROCESS_SHARED);
>
> now that we have separated private-ness and external-ness, it sounds
> much better to write:
>
> /*
>  * External page tables don't support fast zapping, therefore
>  * their mirrors must be invalidated separately by the caller.
>  */
> kvm_tdp_mmu_invalidate_roots(kvm, KVM_DIRECT_ROOTS);
>
> while kvm_mmu_uninit_tdp_mmu() can pass KVM_ANY_ROOTS. It may also be
> worth adding an
>
> if (WARN_ON_ONCE(root_types & KVM_INVALID_ROOTS))
>   root_types &= ~KVM_INVALID_ROOTS;
>
> to document the invariants.

Yes, I agree taking the kvm_tdp_mmu_root_types enum here makes more sense.
Especially with that comment you wrote.

Thanks.

2024-06-07 22:55:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 00/15] TDX MMU prep series part 1

On Fri, 2024-06-07 at 13:39 +0200, Paolo Bonzini wrote:
>
> Yes, now we're talking indeed. Mostly it's cosmetic tweaks or requests
> to fix/improve a few comments, but overall it's very pleasing and
> algorithmically clear code to read. Kudos to everyone involved.
>
> I don't expect any big issues with v3.

Thanks so much for the review. We'll try to turn this around quickly.

In the meantime, we will likely be posting two generic series that were split
out of this one:
- Making max_gfn configurable per-vm.
- Create a quirk for memslot deletion zapping all PTEs.

2024-06-07 23:39:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 15/15] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

On Fri, 2024-06-07 at 11:31 +0200, Paolo Bonzini wrote:
> > -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> > -                        int *root_level)
> > +static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64
> > *sptes,
> > +                                 enum kvm_tdp_mmu_root_types root_type)
> >   {
> > -       struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> > +       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
>
> I think this function should take the struct kvm_mmu_page * directly.
>
> > +{
> > +       *root_level = vcpu->arch.mmu->root_role.level;
> > +
> > +       return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);
>
> Here you pass root_to_sp(vcpu->arch.mmu->root.hpa);

I see. It is another case of more indirection to try to send the decision making
through the helpers. We can try to open code things more.

>
> > +int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> > +                                    kvm_pfn_t *pfn)
> > +{
> > +       u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> > +       int leaf;
> > +
> > +       lockdep_assert_held(&vcpu->kvm->mmu_lock);
> > +
> > +       rcu_read_lock();
> > +       leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
>
> and likewise here.
>
> You might also consider using a kvm_mmu_root_info for the mirror root,
> even though the pgd field is not used.

This came up on the last version actually. The reason against it was that it
used that tiny bit of extra memory for the pgd. It does look more symmetrical
though.

>
> Then __kvm_tdp_mmu_get_walk() can take a struct kvm_mmu_root_info * instead.

Ahh, I see. Yes, that's a good reason.

>
> kvm_tdp_mmu_get_walk_mirror_pfn() doesn't belong in this series, but
> introducing __kvm_tdp_mmu_get_walk() can stay here.

Ok, we can split it.

>
> Looking at the later patch, which uses
> kvm_tdp_mmu_get_walk_mirror_pfn(), I think this function is a bit
> overkill. I'll do a pre-review of the init_mem_region function,
> especially the usage of kvm_gmem_populate:
>
> +    slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +    if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
> +        ret = -EFAULT;
> +        goto out_put_page;
> +    }
>
> The slots_lock is taken, so checking kvm_slot_can_be_private is unnecessary.
>
> Checking kvm_mem_is_private perhaps should also be done in
> kvm_gmem_populate() itself. I'll send a patch.
>
> +    read_lock(&kvm->mmu_lock);
> +
> +    ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> +    if (ret < 0)
> +        goto out;
> +    if (ret > PG_LEVEL_4K) {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +    if (mmu_pfn != pfn) {
> +        ret = -EAGAIN;
> +        goto out;
> +    }
>
> If you require pre-faulting, you don't need to return mmu_pfn and
> things would be seriously wrong if the two didn't match, wouldn't
> they?

Yea, I'm not sure why it would be a normal condition. Maybe Isaku can comment on
the thinking?

> You are executing with the filemap_invalidate_lock() taken, and
> therefore cannot race with kvm_gmem_punch_hole(). (Soapbox mode on:
> this is the advantage of putting the locking/looping logic in common
> code, kvm_gmem_populate() in this case).
>
> Therefore, I think kvm_tdp_mmu_get_walk_mirror_pfn() can be replaced just with
>
> int kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
> {
>   struct kvm *kvm = vcpu->kvm
>   bool is_direct = !kvm_has_mirrored_tdp(...) || (gpa & kvm-
> >arch.direct_mask);
>   hpa_t root = is_direct ? ... : ...;
>
>   lockdep_assert_held(&vcpu->kvm->mmu_lock);
>   rcu_read_lock();
>   leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
>   rcu_read_unlock();
>   if (leaf < 0)
>     return false;
>
>   spte = sptes[leaf];
>   return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
> }
> EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
>
> +    while (region.nr_pages) {
> +        if (signal_pending(current)) {
> +            ret = -EINTR;
> +            break;
> +        }
>
> Checking signal_pending() should be done in kvm_gmem_populate() -
> again, I'll take care of that. The rest of the loop is not necessary -
> just call kvm_gmem_populate() with the whole range and enjoy. You get
> a nice API that is consistent with the intended KVM_PREFAULT_MEMORY
> ioctl, because kvm_gmem_populate() returns the number of pages it has
> processed and you can use that to massage and copy back the struct
> kvm_tdx_init_mem_region.
>
> +        arg = (struct tdx_gmem_post_populate_arg) {
> +            .vcpu = vcpu,
> +            .flags = cmd->flags,
> +        };
> +        gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> +                         u64_to_user_ptr(region.source_addr),
> +                         1, tdx_gmem_post_populate, &arg);

Ok thanks for the early comments. We can also drop those pieces as they move
into gmem code.

We were discussing starting to do some early public work on the rest of the MMU
series (that includes this patch) and the user API around VM and vCPU creation.
As in, not have the patches fully ready, but to just work on it in public. This
would probably follow finishing this series up.

It's all tentative, but just to give some idea of where we're at with the rest
of the series.

2024-06-08 08:53:04

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] KVM: x86/mmu: Support GFN direct mask

On Fri, Jun 7, 2024 at 8:39 PM Edgecombe, Rick P
<[email protected]> wrote:
> > > + return kvm_gfn_direct_mask(kvm);
> >
> > Ok, please excuse me again for being fussy on the naming. Typically I
> > think of a "mask" as something that you check against, or something
> > that you do x &~ mask, not as something that you add. Maybe
> > kvm_gfn_root_offset and gfn_direct_offset?
>
> As for the name, I guess I'd be less keen on "offset" because it's not clear
> that it is a power-of-two value that can be used with bitwise operations.
>
> I'm not sure what the "fixed" adds and it makes it longer. Also, many PTE bits
> cannot be moved and they are not referred to as fixed, where the shared bit
> actually *can* be moved via GPAW (not that the MMU code cares about that
> though).
>
> Just "bits" sounds better to me, so maybe I'll try?
> kvm_gfn_direct_bits()
> kvm_gfn_root_bits()

Yep, kvm_gfn_direct_bits and kvm_gfn_root_bits are good.

Paolo


2024-06-08 09:08:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] KVM: x86/mmu: Support GFN direct mask

On Fri, Jun 7, 2024 at 8:39 PM Edgecombe, Rick P
<[email protected]> wrote:
> > I think the code need not check kvm_gfn_direct_mask() here? In the old
> > patches that I have it check kvm_gfn_direct_mask() in the vmx/main.c
> > callback.
>
> You mean a VMX/TDX implementation of flush_remote_tlbs_range that just returns
> -EOPNOTSUPP? Which version of the patches is this? I couldn't find anything like
> that.

Something from Intel's GitHub, roughly June 2023... Looking at the
whole history, it starts with

if (!kvm_x86_ops.flush_remote_tlbs_range)
return -EOPNOTSUPP;

return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);

and it only assigns the callback in vmx.c (not main.c); then it adds
an implementation of the callback for TDX that has:

static int vt_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
gfn_t nr_pages)
{
if (is_td(kvm))
return tdx_sept_flush_remote_tlbs_range(kvm, gfn, nr_pages);

/* fallback to flush_remote_tlbs method */
return -EOPNOTSUPP;
}

where the callback knows that it should flush both private GFN and
shared GFN. So I didn't remember it correctly, but still there is no
check for the presence of direct-mapping bits.

> The downside would be wider distribution of the concerns for dealing with
> multiple aliases for a GFN. Currently, the behavior to have multiple aliases is
> implemented in core MMU code. While it's fine to pollute tdx.c with TDX specific
> knowledge of course, removing the handling of this corner from mmu.c might make
> it less understandable for non-tdx readers who are working in MMU code.
> Basically, if a concept fits into some non-TDX abstraction like this, having it
> in core code seems the better default to me.

I am not sure why it's an MMU concept that "if you offset the shared
mappings you cannot implement flush_remote_tlbs_range". It seems more
like, you need to know what you're doing?

Right now it makes no difference because you don't set the callback;
but if you ever wanted to implement flush_remote_tlbs_range as an
optimization you'd have to remove the condition from the "if". So it's
better not to have it in the first place.

Perhaps add a comment instead, like:

if (!kvm_x86_ops.flush_remote_tlbs_range)
return -EOPNOTSUPP;

+ /*
+ * If applicable, the callback should flush GFNs both with and without
+ * the direct-mapping bits.
+ */
return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);

Paolo


2024-06-08 09:14:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Fri, Jun 7, 2024 at 10:27 PM Edgecombe, Rick P
<[email protected]> wrote:
> > static inline struct kvm_mmu_page *
> > tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault
> > *fault)
> > {
> > hpa_t root_hpa;
> > if (unlikely(fault->is_private && kvm_has_mirror_tdp(kvm)))
> > root_hpa = vcpu->arch.mmu->mirror_root_hpa;
> > else
> > root_hpa = vcpu->arch.mmu->root.hpa;
> > return root_to_sp(root_hpa);
> > }
>
> I was not loving the amount of indirection here in the patch, but thought it
> centralized the logic a bit better. This way seems good, given that the actual
> logic is not that complex.

My proposed implementation is a bit TDX-specific though... Something
like this is more agnostic, and it exploits nicely the difference
between fault->addr and fault->gfn:

if (!kvm_gfn_direct_mask(kvm) ||
(gpa_to_gfn(fault->addr) & kvm_gfn_direct_mask(kvm))
root_hpa = vcpu->arch.mmu->root.hpa;
else
root_hpa = vcpu->arch.mmu->mirror_root_hpa;
return root_to_sp(root_hpa);

Paolo


2024-06-08 09:17:07

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 13/15] KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process

On Sat, Jun 8, 2024 at 12:12 AM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Fri, 2024-06-07 at 10:56 +0200, Paolo Bonzini wrote:
> > Subject: propagate enum kvm_process to MMU notifier callbacks
> >
> > But again, the naming... I don't like kvm_process - in an OS process
> > is a word with a clear meaning. Yes, that is a noun and this is a
> > verb, but then naming an enum with a verb is also awkward.
> >
> > Perhaps kvm_gfn_range_filter and range->attr_filter? A bit wordy but very
> > clear:
> >
> > enum kvm_tdp_mmu_root_types types =
> > kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter)
> >
> > I think I like it.
>
> Agree 'process' sticks out. Somehow having attr_filter and args.attributes in
> the same struct feels a bit wrong. Not that process was a lot better.
>
> I guess attr_filter is more about alias ranges, and args.attribute is more about
> conversion to various types of memory (private, shared and ideas of other types
> I guess). But since today we only have private and shared, I wonder if there is
> some way to combine them within struct kvm_gfn_range? I've not thought this all
> the way through.

I think it's better that they stay separate. One is an argument
(args.attribute), the other is not, it should be clear enough.

Paolo


2024-06-08 09:19:08

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 15/15] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

On Sat, Jun 8, 2024 at 1:39 AM Edgecombe, Rick P
<[email protected]> wrote:
> We were discussing starting to do some early public work on the rest of the MMU
> series (that includes this patch) and the user API around VM and vCPU creation.
> As in, not have the patches fully ready, but to just work on it in public. This
> would probably follow finishing this series up.
>
> It's all tentative, but just to give some idea of where we're at with the rest
> of the series.

Yes, I think we're at the point where we can start pipelining.

Paolo


2024-06-08 09:26:27

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 11/15] KVM: x86/tdp_mmu: Reflect tearing down mirror page tables

On Fri, Jun 7, 2024 at 11:46 PM Edgecombe, Rick P
<[email protected]> wrote:
> > Also, please rename the last argument to pfn_for_gfn. I'm not proud of
> > it, but it took me over 10 minutes to understand if the pfn referred
> > to the gfn itself, or to the external SP that holds the spte...
> > There's a possibility that it isn't just me. :)
>
> Ah, I see how that could be confusing.
>
> > (In general, this patch took me a _lot_ to review... there were a
> > couple of places that left me incomprehensibly puzzled, more on this
> > below).
>
> Sorry for that. Thanks for taking the time to weed through it anyway.

Oh, don't worry, I have no idea what made this patch stick out as the
hardest one... Maybe it's really that there is such a thing as too
many comments in some cases, and also the mutual recursion between
handle_removed_pt() and handle_changed_spte(). Nothing that can't be
fixed.

> > > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > > + int ret;
> > > +
> > > + /*
> > > + * Allow only leaf page to be zapped. Reclaim non-leaf page tables
> > > page
> >
> > This comment left me confused, so I'll try to rephrase and see if I
> > can explain what happens. Correct me if I'm wrong.
> >
> > The only paths to handle_removed_pt() are:
> > - kvm_tdp_mmu_zap_leafs()
> > - kvm_tdp_mmu_zap_invalidated_roots()
> >
> > but because kvm_mmu_zap_all_fast() does not operate on mirror roots,
> > the latter can only happen at VM destruction time.
> >
> > But it's not clear why it's worth mentioning it here, or even why it
> > is special at all. Isn't that just what handle_removed_pt() does at
> > the end? Why does it matter that it's only done at VM destruction
> > time?
> >
> > In other words, it seems to me that this comment is TMI. And if I am
> > wrong (which may well be), the extra information should explain the
> > "why" in more detail, and it should be around the call to
> > reflect_free_spt, not here.
>
> TDX of course has the limitation around the ordering of the zapping S-EPT. So I
> read the comment to be referring to how the implementation avoids zapping any
> non-leaf PTEs during TD runtime.

Ok, I think then you can just replace it with a comment that explains
why TDX needs it, as it's one of those places where we let TDX
assumptions creep into common code - which is not a big deal per se,
it's just worth mentioning it in a comment. Unlike the
tdp_mmu_get_root_for_fault() remark I have just sent for patch 9/15,
here the assumption is more algorithmic and less about a specific line
of code, and I think that makes it okay.

Paolo

> > > - /* Don't support setting for the non-atomic case */
> > > - if (is_mirror_sptep(sptep))
> > > + if (is_mirror_sptep(sptep)) {
> > > + /* Only support zapping for the non-atomic case */
> >
> > Like for patch 10, this comment should point out why we never get here
> > for mirror SPs.
>
> Ok.


2024-06-09 23:26:16

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 06/15] KVM: x86/mmu: Support GFN direct mask

On Sat, 2024-06-08 at 11:08 +0200, Paolo Bonzini wrote:
> > The downside would be wider distribution of the concerns for dealing with
> > multiple aliases for a GFN. Currently, the behavior to have multiple aliases
> > is
> > implemented in core MMU code. While it's fine to pollute tdx.c with TDX
> > specific
> > knowledge of course, removing the handling of this corner from mmu.c might
> > make
> > it less understandable for non-tdx readers who are working in MMU code.
> > Basically, if a concept fits into some non-TDX abstraction like this, having
> > it
> > in core code seems the better default to me.
>
> I am not sure why it's an MMU concept that "if you offset the shared
> mappings you cannot implement flush_remote_tlbs_range". It seems more
> like, you need to know what you're doing?
>
> Right now it makes no difference because you don't set the callback;
> but if you ever wanted to implement flush_remote_tlbs_range as an
> optimization you'd have to remove the condition from the "if". So it's
> better not to have it in the first place.

Yea that's true.

>
> Perhaps add a comment instead, like:
>
>      if (!kvm_x86_ops.flush_remote_tlbs_range)
>          return -EOPNOTSUPP;
>
> +    /*
> +     * If applicable, the callback should flush GFNs both with and without
> +     * the direct-mapping bits.
> +     */
>      return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);

Ok, works for me.

2024-06-10 00:09:21

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Sat, 2024-06-08 at 11:13 +0200, Paolo Bonzini wrote:
> > I was not loving the amount of indirection here in the patch, but thought it
> > centralized the logic a bit better. This way seems good, given that the
> > actual
> > logic is not that complex.
>
> My proposed implementation is a bit TDX-specific though... Something
> like this is more agnostic, and it exploits nicely the difference
> between fault->addr and fault->gfn:
>
> if (!kvm_gfn_direct_mask(kvm) ||
>     (gpa_to_gfn(fault->addr) & kvm_gfn_direct_mask(kvm))
>   root_hpa = vcpu->arch.mmu->root.hpa;
> else
>   root_hpa = vcpu->arch.mmu->mirror_root_hpa;
> return root_to_sp(root_hpa);

Agreed that this is less TDX specific and it means that this part of the generic
MMU code doesn't need to know that the mirror/direct matches to private vs
shared. I don't love that it has such a complicated conditional for the normal
VM case, though. Just for readability.

The previous versions checked kvm_gfn_shared_mask() more readily in various open
coded spots. In this v2 we tried to reduce this and instead always rely on
the "private" concept to switch between the roots in the generic code. I think
it's arguably a little easier to understand if we stick to a single way of
deciding which root to use.

But I don't feel like any of these solutions discussed is perfectly clean. So
I'm ok taking the benefits you prefer. I guess doing bitwise operations when
possible is kind of the KVM way, haha. :)


2024-06-10 01:07:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 13/15] KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process

On Sat, 2024-06-08 at 11:15 +0200, Paolo Bonzini wrote:
> > Agree 'process' sticks out. Somehow having attr_filter and args.attributes
> > in
> > the same struct feels a bit wrong. Not that process was a lot better.
> >
> > I guess attr_filter is more about alias ranges, and args.attribute is more
> > about
> > conversion to various types of memory (private, shared and ideas of other
> > types
> > I guess). But since today we only have private and shared, I wonder if there
> > is
> > some way to combine them within struct kvm_gfn_range? I've not thought this
> > all
> > the way through.
>
> I think it's better that they stay separate. One is an argument
> (args.attribute), the other is not, it should be clear enough.

Ok, yea. Looking at this more, it makes sense.

2024-06-10 09:24:35

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Mon, Jun 10, 2024 at 2:09 AM Edgecombe, Rick P
<[email protected]> wrote:
> Agreed that this is less TDX specific and it means that this part of the generic
> MMU code doesn't need to know that the mirror/direct matches to private vs
> shared. I don't love that it has such a complicated conditional for the normal
> VM case, though. Just for readability.
>
> The previous versions checked kvm_gfn_shared_mask() more readily in various open
> coded spots. In this v2 we tried to reduce this and instead always rely on
> the "private" concept to switch between the roots in the generic code. I think
> it's arguably a little easier to understand if we stick to a single way of
> deciding which root to use.

But there isn't any other place that relies on is_private, right? So...

> But I don't feel like any of these solutions discussed is perfectly clean. So
> I'm ok taking the benefits you prefer. I guess doing bitwise operations when
> possible is kind of the KVM way, haha. :)

... while I'm definitely guilty of that, :) it does seem the cleanest
option to use fault->addr to go from struct kvm_page_fault to the kind
of root.

If you prefer, you can introduce a bool kvm_is_addr_direct(struct kvm
*kvm, gpa_t gpa), and use it here as kvm_is_addr_direct(kvm,
fault->addr). Maybe that's the best of both worlds.

Paolo


2024-06-10 16:00:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 09/15] KVM: x86/tdp_mmu: Support mirror root for TDP MMU

On Mon, 2024-06-10 at 11:23 +0200, Paolo Bonzini wrote:
> > The previous versions checked kvm_gfn_shared_mask() more readily in various
> > open
> > coded spots. In this v2 we tried to reduce this and instead always rely on
> > the "private" concept to switch between the roots in the generic code. I
> > think
> > it's arguably a little easier to understand if we stick to a single way of
> > deciding which root to use.
>
> But there isn't any other place that relies on is_private, right? So...

I meant the "private" concept in general, like triggering off of
KVM_PROCESS_PRIVATE to use the mirror.

>
> > But I don't feel like any of these solutions discussed is perfectly clean.
> > So
> > I'm ok taking the benefits you prefer. I guess doing bitwise operations when
> > possible is kind of the KVM way, haha. :)
>
> ... while I'm definitely guilty of that, :) it does seem the cleanest
> option to use fault->addr to go from struct kvm_page_fault to the kind
> of root.
>
> If you prefer, you can introduce a bool kvm_is_addr_direct(struct kvm
> *kvm, gpa_t gpa), and use it here as kvm_is_addr_direct(kvm,
> fault->addr). Maybe that's the best of both worlds.

Yes, I think a helper will make it easier for non-TDX readers.

2024-06-12 18:56:19

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 15/15] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

On Fri, Jun 07, 2024 at 11:39:14PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Fri, 2024-06-07 at 11:31 +0200, Paolo Bonzini wrote:
> > > -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> > > -                        int *root_level)
> > > +static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64
> > > *sptes,
> > > +                                 enum kvm_tdp_mmu_root_types root_type)
> > >   {
> > > -       struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> > > +       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
> >
> > I think this function should take the struct kvm_mmu_page * directly.
> >
> > > +{
> > > +       *root_level = vcpu->arch.mmu->root_role.level;
> > > +
> > > +       return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);
> >
> > Here you pass root_to_sp(vcpu->arch.mmu->root.hpa);
>
> I see. It is another case of more indirection to try to send the decision making
> through the helpers. We can try to open code things more.
>
> >
> > > +int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> > > +                                    kvm_pfn_t *pfn)
> > > +{
> > > +       u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> > > +       int leaf;
> > > +
> > > +       lockdep_assert_held(&vcpu->kvm->mmu_lock);
> > > +
> > > +       rcu_read_lock();
> > > +       leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
> >
> > and likewise here.
> >
> > You might also consider using a kvm_mmu_root_info for the mirror root,
> > even though the pgd field is not used.
>
> This came up on the last version actually. The reason against it was that it
> used that tiny bit of extra memory for the pgd. It does look more symmetrical
> though.
>
> >
> > Then __kvm_tdp_mmu_get_walk() can take a struct kvm_mmu_root_info * instead.
>
> Ahh, I see. Yes, that's a good reason.
>
> >
> > kvm_tdp_mmu_get_walk_mirror_pfn() doesn't belong in this series, but
> > introducing __kvm_tdp_mmu_get_walk() can stay here.
>
> Ok, we can split it.
>
> >
> > Looking at the later patch, which uses
> > kvm_tdp_mmu_get_walk_mirror_pfn(), I think this function is a bit
> > overkill. I'll do a pre-review of the init_mem_region function,
> > especially the usage of kvm_gmem_populate:
> >
> > +    slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > +    if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
> > +        ret = -EFAULT;
> > +        goto out_put_page;
> > +    }
> >
> > The slots_lock is taken, so checking kvm_slot_can_be_private is unnecessary.
> >
> > Checking kvm_mem_is_private perhaps should also be done in
> > kvm_gmem_populate() itself. I'll send a patch.
> >
> > +    read_lock(&kvm->mmu_lock);
> > +
> > +    ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> > +    if (ret < 0)
> > +        goto out;
> > +    if (ret > PG_LEVEL_4K) {
> > +        ret = -EINVAL;
> > +        goto out;
> > +    }
> > +    if (mmu_pfn != pfn) {
> > +        ret = -EAGAIN;
> > +        goto out;
> > +    }
> >
> > If you require pre-faulting, you don't need to return mmu_pfn and
> > things would be seriously wrong if the two didn't match, wouldn't
> > they?
>
> Yea, I'm not sure why it would be a normal condition. Maybe Isaku can comment on
> the thinking?

Sean suggested for KVM_TDX_INIT_MEM_REGION to check if the two pfn from TDP MMU
and guest_memfd match. As pointed out, the two PFNs should match under
appropriate lock (or heavily broken). Personally I'm fine to remove such check
and to avoid returning pfn.

https://lore.kernel.org/kvm/[email protected]/

Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
to do PAGE.ADD. KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
simply need to verify that the pfn from guest_memfd() is the same as what's in
the TDP MMU.
--
Isaku Yamahata <[email protected]>

2024-06-12 18:58:38

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 11/15] KVM: x86/tdp_mmu: Reflect tearing down mirror page tables

On Fri, Jun 07, 2024 at 09:46:27PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> > /* Here we only care about zapping the external leaf PTEs. */
> > if (!is_last_spte(old_spte, level))
> >
> > > +       kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > > +       int ret;
> > > +
> > > +       /*
> > > +        * Allow only leaf page to be zapped. Reclaim non-leaf page tables
> > > page
> >
> > This comment left me confused, so I'll try to rephrase and see if I
> > can explain what happens. Correct me if I'm wrong.
> >
> > The only paths to handle_removed_pt() are:
> > - kvm_tdp_mmu_zap_leafs()
> > - kvm_tdp_mmu_zap_invalidated_roots()
> >
> > but because kvm_mmu_zap_all_fast() does not operate on mirror roots,
> > the latter can only happen at VM destruction time.
> >
> > But it's not clear why it's worth mentioning it here, or even why it
> > is special at all. Isn't that just what handle_removed_pt() does at
> > the end? Why does it matter that it's only done at VM destruction
> > time?
> >
> > In other words, it seems to me that this comment is TMI. And if I am
> > wrong (which may well be), the extra information should explain the
> > "why" in more detail, and it should be around the call to
> > reflect_free_spt, not here.
>
> TDX of course has the limitation around the ordering of the zapping S-EPT. So I
> read the comment to be referring to how the implementation avoids zapping any
> non-leaf PTEs during TD runtime.
>
> But I'm going to have to circle back here after investigating a bit more. Isaku,
> any comments on this comment and conditional?

It's for large page page merge/split. At this point, it seems only confusing.
We need only leaf zapping. Maybe reflect_removed_leaf_spte() or something.
Later we can come back when large page support.
--
Isaku Yamahata <[email protected]>

2024-06-14 23:19:03

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 11/15] KVM: x86/tdp_mmu: Reflect tearing down mirror page tables

On Wed, 2024-06-12 at 11:39 -0700, Isaku Yamahata wrote:
> > TDX of course has the limitation around the ordering of the zapping S-EPT.
> > So I
> > read the comment to be referring to how the implementation avoids zapping
> > any
> > non-leaf PTEs during TD runtime.
> >
> > But I'm going to have to circle back here after investigating a bit more.
> > Isaku,
> > any comments on this comment and conditional?
>
> It's for large page page merge/split.  At this point, it seems only confusing.
> We need only leaf zapping.  Maybe reflect_removed_leaf_spte() or something.
> Later we can come back when large page support.

Yes, I don't see the need for the is_shadow_present_pte() in this function. The
leaf check is needed, but actually it could be done well enough in TDX code for
now (for as long as we have no huge pages).

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0155b8330e15..e54dd2355005 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1721,6 +1721,10 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm,
gfn_t gfn,
{
int ret;

+ /* Only 4k pages currently, nothing to do for other levels */
+ if (level != PG_LEVEL_4K)
+ return 0;
+
ret = tdx_sept_zap_private_spte(kvm, gfn, level);
if (ret)
return ret;


The benefits would be that we have less mirror logic in core MMU code that
actually is about TDX specifics unrelated to the mirror concept. The downsides
are returning success there feels kind of wrong. I'm going to move forward with
dropping the is_shadow_present_pte() check and adding a comment (see below), but
I thought the alternative was worth mentioning because I was tempted.

/*
* External (TDX) SPTEs are limited to PG_LEVEL_4K, and external
* PTs are removed in a special order, involving free_external_spt().
* But remove_external_spte() will be called on non-leaf PTEs via
* __tdp_mmu_zap_root(), so avoid the error the former would return
* in this case.
*/