Hi,
This is a TDX prep series, split out of the giant 130 patch TDX base
enabling series [0]. It is focusing on the changes to the KVM MMU to
support TDX’s separation of private/shared EPT. A future breakout series
will include the changes to interact with the TDX module to actually map
private memory. The purpose of sending out a smaller series is to focus
review, and hopefully rapidly iterate on the smaller series.
It is not quite ready for upstream inclusion yet, but it as reached another
point where more public comments could help.
There is a larger team working on TDX KVM base enabling. Most patches were
originally authored by Sean Christopherson and Isaku Yamahata, with recent
work by Yan Y Zhao, Isaku and myself.
The series has been tested as part of a development branch for the TDX
base series [1]. The testing so far consists TDX kvm-unit-tests [2] and
booting a Linux TD, and regular KVM selftests (not the TDX ones).
Contents of the series
======================
There are some simple preparatory patches mixed into the series, which is
ordered with bisectability in mind. The patches that most likely need
further discussion are:
KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion
Looking at expanding the need for TDX to zap only the specific PTEs for a
memslot on deletion, into a general KVM feature.
KVM: Add member to struct kvm_gfn_range for target alias
Discussion on how to target zapping to the appropriate private/shared
alias.
KVM: x86/mmu: Bug the VM if kvm_zap_gfn_range() is called for TDX
A change that includes a discussion on how to handle cache attributes on
shared memory.
KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
A big "how to do private/shared split" patch.
Patches 11-15:
Handling the separate aliases on MMU operations, first started in "Add
member to struct kvm_gfn_range for target alias"
Private/shared memory in TDX
============================
Confidential computing solutions have concepts of private and shared
memory. Often the guest accesses either private or shared memory via a bit
in the PTE. Solutions like SEV treat this bit more like a permission bit,
where solutions like TDX and ARM CCA treat it more like a GPA bit. In the
latter case, the host maps private memory in one half of the address space
and shared in another. For TDX these two halves are mapped by different
EPT roots. The private half (also called Secure EPT in Intel
documentation) gets managed by the privileged TDX Module. The shared half
is managed by the untrusted part of the VMM (KVM).
In addition to the separate roots for private and shared, there are
limitations on what operations can be done on the private side. TDX wants
to protect against protected memory being reset or otherwise scrambled by
the host. In order to prevent this, the guest has to take specific action
to “accept” memory after changes are made by the VMM to the private EPT.
This prevents the VMM from performing many of the usual memory management
operations that involve zapping and refaulting memory. The private memory
also is always RWX and cannot have VMM specified cache attribute
attributes applied.
TDX KVM MMU Design For Private Memory
=====================================
Private/shared split
--------------------
The operations that actually change the private half of the EPT are
limited and relatively slow compared to reading a PTE. For this reason the
design for KVM is to keep a “mirrored” copy of the private EPT in KVM’s
memory. This will allow KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
To clarify the definitions of the three EPT trees at this point:
private EPT - Protected by the TDX module, modified via TDX module
calls.
mirrored EPT - Bookkeeping tree used as an optimization by KVM, not
mapped.
shared EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
It’s worth noting that we are making an effort to remove optimizations
that have complexity for the base enabling. Although keeping a mirrored
copy of the private page tables kind of fits into that category, it has
been so fundamental to the design for so long, dropping it would be too
disruptive.
Mirrored EPT
------------
The mirrored EPT needs to keep a mirrored version of the private EPT
maintained in the TDX module in order to be able to find out if a GPA’s
mid-level pagetable have already been installed. So this mirrored copy has
the same structure as the private EPT, having a page table present for
every GPA range and level in the mirrored EPT where a page table is
present private. The private page tables also cannot be zapped while the
range has anything mapped, so the mirrored/private page tables need to be
protected from KVM operations that zap any non-leaf PTEs, for example
kvm_mmu_reset_context() or kvm_mmu_zap_all_fast()
Modifications to the mirrored page tables need to also perform the same
operations to the private page tables. The actual TDX module calls to do
this are not covered in this prep series.
For convenience SPs for private page tables are tracked with a role bit
out of convenience. (Note to reviewers, please consider if this is really
needed).
Zapping Changes
---------------
For normal VMs, guest memory is zapped for several reasons, like user
memory getting paged out by the guest, memslots getting deleted or
virtualization operations like MTRRs, and attachment of non-coherent DMA.
For TDX (and SNP) there is also zapping associated with the conversion of
memory between shared and privates. These operations need to take care to
do two things:
1. Not zap any private memory that is in use by the guest.
2. Not zap any memory alias unnecessarily (i.e. Don’t zap anything more
than needed). The purpose of this is to not have any unnecessary behavior
userspace could grow to rely on.
For 1, this is possible because the zapping that is out of the control of
KVM/userspace (paging out of userspace memory) will only apply to shared
memory. Guest mem fd operations are protected from mmu notifier
operations. During TD runtime, zapping of private memory will only be from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
For 2, KVM needs to be taught which operations will operate on which
aliases. An enum based scheme is introduced such that operations can
target specific aliases like:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
MTRRs, etc - Zapping will be avoided all together
For zapping arising from other virtualization based operations, there are
four scenarios:
1. MTRR update
2. CR0.CD update
3. APICv update
4. Non-coherent DMA status update
KVM TDX will not support 1-3. In future changes (after this series) the
features will not be supported for TDX. For 4, there isn’t an easy way to
not support the feature as the notification is just passed to KVM and it
has to act accordingly. However, other proposed changes [3] will avoid the
need for zapping on non-coherent DMA notification for selfsnoop CPUs. So
KVM can follow this logic and just always honor guest PAT for shared
memory. See more details in patch 8.
Prevention of zapping mid-level PTEs
------------------------------------
As mentioned earlier, private PTEs (and so also mirrored PTEs) need to be
zapped at the leafs only. This means for TDX, the fast zap roots
optimization for memslot deletion is not compatible, and instead only the
leafs should be zapped. Behavior like this for memslot deletion was
tried[4] once before for normal VMs, and fortunately it exposed a
mysterious bug affecting an nVidia GPU in a Windows guest that was never
root caused.
Since the restrictions on not zapping roots is only for private memory,
TDX could minimize the possibility of being exposed to this by always
zapping shared roots, and zapping leafs only for the private alias.
However, designing long term ABI around a bug seems wrong. So instead,
this series explores creating a new memslot flag that allows for
specifying that a memslot should be deleted without zapping other GPA
ranges. The expectation would be for userspace to set this on memslots
used for TDX. Controlling this behavior at the VM level was also explored.
See patch 2 for more information.
Atomically updating private EPT
-------------------------------
Although this prep series does not interact with the TDX module at all to
actually configure the private EPT, it does lay the ground work for doing
this. In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module, but there is one
tricky property that is worth elaborating on. That is how to handle that
the TDP MMU allows modification of PTEs with the mmu_lock held only for
read and uses the PTEs themselves to perform synchronization.
Unfortunately while operating on a single PTE can be done atomically,
operating on both the mirrored and private PTEs at the same time needs
additional solution. To handle this situation, REMOVED_SPTE is used to
prevent concurrent operations while a call to the TDX module updates the
private EPT. For more information see the documentation in patch 18.
For more detailed discussion see the "The original TDP MMU and race
condition" section of documentation patch [5]
The series is based on kvm-coco-queue.
[0] https://lore.kernel.org/kvm/[email protected]/
[1] https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-05-14-mmu-prep-1
[2] https://lore.kernel.org/kvm/[email protected]/
[3] https://lore.kernel.org/kvm/[email protected]/
[4] https://lore.kernel.org/kvm/[email protected]/
[5] https://github.com/intel/tdx/commit/70cd3c807e547854ea52f56623ce168c7869679e
Isaku Yamahata (11):
KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
KVM: x86/mmu: Add address conversion functions for TDX shared bit of
GPA
KVM: Add member to struct kvm_gfn_range for target alias
KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role
KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()
KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table
type
KVM: x86/tdp_mmu: Introduce shared, private KVM MMU root types
KVM: x86/tdp_mmu: Take root types for
kvm_tdp_mmu_invalidate_all_roots()
KVM: x86/tdp_mmu: Make mmu notifier callbacks to check kvm_process
Rick Edgecombe (3):
KVM: x86: Add a VM type define for TDX
KVM: x86/mmu: Bug the VM if kvm_zap_gfn_range() is called for TDX
KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void
Sean Christopherson (1):
KVM: x86/tdp_mmu: Invalidate correct roots
Yan Zhao (1):
KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot
deletion
arch/x86/include/asm/kvm-x86-ops.h | 5 +
arch/x86/include/asm/kvm_host.h | 45 +++-
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu.h | 36 +++
arch/x86/kvm/mmu/mmu.c | 86 +++++-
arch/x86/kvm/mmu/mmu_internal.h | 60 ++++-
arch/x86/kvm/mmu/spte.h | 5 +
arch/x86/kvm/mmu/tdp_iter.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 407 ++++++++++++++++++++++++-----
arch/x86/kvm/mmu/tdp_mmu.h | 18 +-
arch/x86/kvm/x86.c | 17 ++
include/linux/kvm_host.h | 8 +
include/uapi/linux/kvm.h | 1 +
virt/kvm/guest_memfd.c | 2 +
virt/kvm/kvm_main.c | 19 +-
15 files changed, 632 insertions(+), 80 deletions(-)
base-commit: 698ca1e403579ca00e16a5b28ae4d576d9f1b20e
--
2.34.1
From: Isaku Yamahata <[email protected]>
Export a function to walk down the TDP without modifying it.
Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT, and verify the populated private memory
PFN matches the current one.[1]
There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.
Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.
Link: https://lore.kernel.org/kvm/[email protected]/ [1]
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
This helper will be used in the future change that implements
KVM_TDX_INIT_MEM_REGION. Please refer to the following commit for the
usage:
https://github.com/intel/tdx/commit/2832c6d87a4e6a46828b193173550e80b31240d4
TDX MMU Part 1:
- New patch
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++++++++++----
2 files changed, 36 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index dc80e72e4848..3c7a88400cbb 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -275,6 +275,9 @@ extern bool tdp_mmu_enabled;
#define tdp_mmu_enabled false
#endif
+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+ kvm_pfn_t *pfn);
+
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1259dd63defc..1086e3b2aa5c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1772,16 +1772,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
*
* Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
*/
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
- int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ bool is_private)
{
struct tdp_iter iter;
struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;
- *root_level = vcpu->arch.mmu->root_role.level;
-
tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
leaf = iter.level;
sptes[leaf] = iter.old_spte;
@@ -1790,6 +1788,37 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
return leaf;
}
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ int *root_level)
+{
+ *root_level = vcpu->arch.mmu->root_role.level;
+
+ return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
+}
+
+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+ kvm_pfn_t *pfn)
+{
+ u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+ int leaf;
+
+ lockdep_assert_held(&vcpu->kvm->mmu_lock);
+
+ rcu_read_lock();
+ leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
+ rcu_read_unlock();
+ if (leaf < 0)
+ return -ENOENT;
+
+ spte = sptes[leaf];
+ if (!(is_shadow_present_pte(spte) && is_last_spte(spte, leaf)))
+ return -ENOENT;
+
+ *pfn = spte_to_pfn(spte);
+ return leaf;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);
+
/*
* Returns the last level spte pointer of the shadow page walk for the given
* gpa, and sets *spte to the spte value. This spte may be non-preset. If no
--
2.34.1
From: Yan Zhao <[email protected]>
Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
SPTEs when deleting a memslot.
Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
AMD) and generate fresh new TDPs based on the new memslot layout. This is
because zapping and re-generating TDPs is low overhead for most use cases,
and more importantly, it's due to a bug [1] which caused VM instability
when a VM is with Nvidia Geforce GPU assigned.
There's a previous attempt [2] to introduce a per-VM flag to workaround bug
[1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
However, [2] was not merged due to lacking of a clear explanation of
exactly what is broken [3] and it's not wise to "have a bug that is known
to happen when you enable the capability".
However, for some specific scenarios, e.g. TDX, invalidating and
re-generating a new page table is not viable for reasons:
- TDX requires root page of private page table remains unaltered throughout
the TD life cycle.
- TDX mandates that leaf entries in private page table must be zapped prior
to non-leaf entries.
So, Sean re-considered about introducing a per-VM flag or per-memslot flag
again for VMs like TDX. [4]
This patch is an implementation of per-memslot flag.
Compared to per-VM flag approach,
Pros:
(1) By allowing userspace to control the zapping behavior in fine-grained
granularity, optimizations for specific use cases can be developed
without future kernel changes.
(2) Allows developing new zapping behaviors without risking regressions by
changing KVM behavior, as seen previously.
Cons:
(1) Users need to ensure all necessary memslots are with flag
KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
(2) Opens up the possibility that userspace could configure memslots for
normal VM in such a way that the bug [1] is seen.
However, one thing deserves noting for TDX, is that TDX may potentially
meet bug [1] for either per-memslot flag or per-VM flag approach, since
there's a usage in radar to assign an untrusted & passthrough GPU device
in TDX. If that happens, it can be treated as a bug (not regression) and
fixed accordingly.
An alternative approach we can also consider is to always invalidate &
rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
and private page tables on memslot deletion. This approach could exempt TDX
from bug [1] when "untrusted & passthrough" devices are involved. But
downside is that this approach requires creating new very specific KVM
zapping ABI that could limit future changes in the same way that the bug
did for normal VMs.
Link: https://patchwork.kernel.org/project/kvm/patch/[email protected] [1]
Link: https://lore.kernel.org/kvm/[email protected]/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
Link: https://lore.kernel.org/kvm/[email protected]/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
Link: https://lore.kernel.org/kvm/[email protected] [4]
Signed-off-by: Yan Zhao <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- New patch
---
arch/x86/kvm/mmu/mmu.c | 30 +++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 17 +++++++++++++++++
include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 5 ++++-
4 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61982da8c8b2..4a8e819794db 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
kvm_mmu_zap_all(kvm);
}
+static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+ return;
+
+ write_lock(&kvm->mmu_lock);
+
+ /*
+ * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+ * case scenario we'll have unused shadow pages lying around until they
+ * are recycled due to age or when the VM is destroyed.
+ */
+ struct kvm_gfn_range range = {
+ .slot = slot,
+ .start = slot->base_gfn,
+ .end = slot->base_gfn + slot->npages,
+ .may_block = true,
+ };
+
+ if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
+ kvm_flush_remote_tlbs(kvm);
+
+ write_unlock(&kvm->mmu_lock);
+}
+
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
- kvm_mmu_zap_all_fast(kvm);
+ if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
+ kvm_mmu_zap_memslot_leafs(kvm, slot);
+ else
+ kvm_mmu_zap_all_fast(kvm);
}
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7c593a081eba..4b3ec2ec79e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
return -EINVAL;
+ /*
+ * Since TDX private pages requires re-accepting after zap,
+ * and TDX private root page should not be zapped, TDX requires
+ * memslots for private memory must have flag
+ * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
+ * the deleting memslot will be zapped and SPTEs in other
+ * memslots would not be affected.
+ */
+ if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
+ (new->flags & KVM_MEM_GUEST_MEMFD) &&
+ !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
+ return -EINVAL;
+
+ /* zap-leafs-only works only when TDP MMU is enabled for now */
+ if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
+ return -EINVAL;
+
return kvm_alloc_memslot_metadata(kvm, new);
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index aee67912e71c..d53648c19b26 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
#define KVM_MEM_GUEST_MEMFD (1UL << 2)
+#define KVM_MEM_ZAP_LEAFS_ONLY (1UL << 3)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 81b90bf03f2f..1b1ffb6fc786 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1568,6 +1568,8 @@ static int check_memory_region_flags(struct kvm *kvm,
if (kvm_arch_has_private_mem(kvm))
valid_flags |= KVM_MEM_GUEST_MEMFD;
+ valid_flags |= KVM_MEM_ZAP_LEAFS_ONLY;
+
/* Dirty logging private memory is not currently supported. */
if (mem->flags & KVM_MEM_GUEST_MEMFD)
valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
@@ -2052,7 +2054,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
if ((mem->userspace_addr != old->userspace_addr) ||
(npages != old->npages) ||
- ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
+ ((mem->flags ^ old->flags) &
+ (KVM_MEM_READONLY | KVM_MEM_ZAP_LEAFS_ONLY)))
return -EINVAL;
if (base_gfn != old->base_gfn)
--
2.34.1
From: Isaku Yamahata <[email protected]>
Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_process process.
Update the core zapping operations to set them appropriately.
TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.
Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.
There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.
Set target alias with enum appropriately for these MMU operations:
- For KVM's mmu notifier callbacks, zap shared pages only because private
pages won't have a userspace mapping
- For setting memory attributes, kvm_arch_pre_set_memory_attributes()
chooses the aliases based on the attribute.
- For guest_memfd invalidations, zap private only.
Link: https://lore.kernel.org/kvm/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- Replaced KVM_PROCESS_BASED_ON_ARG with BUGGY_KVM_INVALIDATION to follow
the original suggestion and not populte kvm_handle_gfn_range(). And add
WARN_ON_ONCE().
- Move attribute specific logic into kvm_vm_set_mem_attributes()
- Drop Sean's suggested-by tag as the solution has changed
- Re-write commit log
v18:
- rebased to kvm-next
v3:
- Drop the KVM_GFN_RANGE flags
- Updated struct kvm_gfn_range
- Change kvm_arch_set_memory_attributes() to return bool for flush
- Added set_memory_attributes x86 op for vendor backends
- Refined commit message to describe TDX care concretely
v2:
- consolidate KVM_GFN_RANGE_FLAGS_GMEM_{PUNCH_HOLE, RELEASE} into
KVM_GFN_RANGE_FLAGS_GMEM.
- Update the commit message to describe TDX more. Drop SEV_SNP.
---
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++++
include/linux/kvm_host.h | 8 ++++++++
virt/kvm/guest_memfd.c | 2 ++
virt/kvm/kvm_main.c | 14 ++++++++++++++
4 files changed, 36 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4a8e819794db..1998267a330e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6979,6 +6979,12 @@ static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *s
.start = slot->base_gfn,
.end = slot->base_gfn + slot->npages,
.may_block = true,
+
+ /*
+ * All private and shared page should be zapped on memslot
+ * deletion.
+ */
+ .process = KVM_PROCESS_PRIVATE_AND_SHARED,
};
if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
@@ -7479,6 +7485,12 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
return false;
+ /* Unmmap the old attribute page. */
+ if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+ range->process = KVM_PROCESS_SHARED;
+ else
+ range->process = KVM_PROCESS_PRIVATE;
+
return kvm_unmap_gfn_range(kvm, range);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c3c922bf077f..f92c8b605b03 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,11 +260,19 @@ union kvm_mmu_notifier_arg {
unsigned long attributes;
};
+enum kvm_process {
+ BUGGY_KVM_INVALIDATION = 0,
+ KVM_PROCESS_SHARED = BIT(0),
+ KVM_PROCESS_PRIVATE = BIT(1),
+ KVM_PROCESS_PRIVATE_AND_SHARED = KVM_PROCESS_SHARED | KVM_PROCESS_PRIVATE,
+};
+
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+ enum kvm_process process;
bool may_block;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9714add38852..e5ff6fde2db3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -109,6 +109,8 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
.slot = slot,
.may_block = true,
+ /* guest memfd is relevant to only private mappings. */
+ .process = KVM_PROCESS_PRIVATE,
};
if (!found_memslot) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1b1ffb6fc786..cc434c7509f1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,11 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
*/
gfn_range.arg = range->arg;
gfn_range.may_block = range->may_block;
+ /*
+ * HVA-based notifications aren't relevant to private
+ * mappings as they don't have a userspace mapping.
+ */
+ gfn_range.process = KVM_PROCESS_SHARED;
/*
* {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -2453,6 +2458,14 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
gfn_range.arg = range->arg;
gfn_range.may_block = range->may_block;
+ /*
+ * If/when KVM supports more attributes beyond private .vs shared, this
+ * _could_ set exclude_{private,shared} appropriately if the entire target
+ * range already has the desired private vs. shared state (it's unclear
+ * if that is a net win). For now, KVM reaches this point if and only
+ * if the private flag is being toggled, i.e. all mappings are in play.
+ */
+
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
@@ -2509,6 +2522,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
struct kvm_mmu_notifier_range pre_set_range = {
.start = start,
.end = end,
+ .arg.attributes = attributes,
.handler = kvm_pre_set_memory_attributes,
.on_lock = kvm_mmu_invalidate_begin,
.flush_on_ret = true,
--
2.34.1
From: Isaku Yamahata <[email protected]>
Introduce a "is_private" member to the kvm_mmu_page_role union to identify
SPTEs associated with the mirrored EPT.
The TDX module maintains the private half of the EPT mapped in the TD in
its protected memory. KVM keeps a copy of the private GPAs in a mirrored
EPT tree within host memory, recording the root page HPA in each vCPU's
mmu->private_root_hpa. This "is_private" attribute enables vCPUs to find
and get the root page of mirrored EPT from the MMU root list for a guest
TD. This also allows KVM MMU code to detect changes in mirrored EPT
according to the "is_private" mmu page role and propagate the changes to
the private EPT managed by TDX module.
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- Remove warning and NULL check in is_private_sptep() (Rick)
- Update commit log (Yan)
v19:
- Fix is_private_sptep() when NULL case.
- drop CONFIG_KVM_MMU_PRIVATE
---
arch/x86/include/asm/kvm_host.h | 13 ++++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 5 +++++
arch/x86/kvm/mmu/spte.h | 5 +++++
3 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2f924f1d579..13119d4e44e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -351,7 +351,8 @@ union kvm_mmu_page_role {
unsigned ad_disabled:1;
unsigned guest_mode:1;
unsigned passthrough:1;
- unsigned :5;
+ unsigned is_private:1;
+ unsigned :4;
/*
* This is left at the top of the word so that
@@ -363,6 +364,16 @@ union kvm_mmu_page_role {
};
};
+static inline bool kvm_mmu_page_role_is_private(union kvm_mmu_page_role role)
+{
+ return !!role.is_private;
+}
+
+static inline void kvm_mmu_page_role_set_private(union kvm_mmu_page_role *role)
+{
+ role->is_private = 1;
+}
+
/*
* kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
* relevant to the current MMU configuration. When loading CR0, CR4, or EFER,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 706f0ce8784c..b114589a595a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -145,6 +145,11 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
return kvm_mmu_role_as_id(sp->role);
}
+static inline bool is_private_sp(const struct kvm_mmu_page *sp)
+{
+ return kvm_mmu_page_role_is_private(sp->role);
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5dd5405fa07a..d0df691ced5c 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -265,6 +265,11 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
return spte_to_child_sp(root);
}
+static inline bool is_private_sptep(u64 *sptep)
+{
+ return is_private_sp(sptep_to_sp(sptep));
+}
+
static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
--
2.34.1
When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
guest mappings so they can be faulted in with different PTE properties.
For TDX private memory this technique is fundamentally not possible.
Remapping private memory requires the guest to "accept" it, and also the
needed PTE properties are not currently supported by TDX for private
memory.
These CPU features are:
1) MTRR update
2) CR0.CD update
3) Non-coherent DMA status update
4) APICV update
Since they cannot be supported, they should be blocked from being
exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
supported for TDX as it cannot see the guest registers. For APICV
inhibit it in future changes.
Guest MTRR support is more of an interesting case. Supported versions of
the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
it is not possible to fully support the feature. This leaves KVM with a
few options:
- Support a modified version of the architecture where the caching
attributes are ignored for private memory.
- Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
bug.
With the additional consideration that likely guest MTRR support in KVM
will be going away, the later option is the best. Prevent MTRR MSR writes
from calling kvm_zap_gfn_range() in future changes.
Lastly, the most interesting case is non-coherent DMA status updates.
There isn't a way to reject the call. KVM is just notified that there is a
non-coherent DMA device attached, and expected to act accordingly. For
normal VMs today, that means to start respecting guest PAT. However,
recently there has been a proposal to avoid doing this on selfsnoop CPUs
(see link). On such CPUs it should not be problematic to simply always
configure the EPT to honor guest PAT. In future changes TDX can enforce
this behavior for shared memory, resulting in shared memory always
respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
called in this case either.
Unfortunately, this will result in different cache attributes between
private and shared memory, as private memory is always WB and cannot be
changed by the VMM on current TDX modules. But it can't really be helped
while also supporting non-coherent DMA devices.
Since all callers will be prevented from calling kvm_zap_gfn_range() in
future changes, report a bug and terminate the guest if other future
changes to KVM result in triggering kvm_zap_gfn_range() for a TD.
For lack of a better method currently, use kvm_gfn_shared_mask() to
determine if private memory cannot be zapped (as in TDX, the only VM type
that sets it).
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
- Add this KVM_BUG_ON() instead
---
arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d5cf5b15a10e..808805b3478d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
- if (tdp_mmu_enabled)
+ if (tdp_mmu_enabled) {
+ /*
+ * kvm_zap_gfn_range() is used when MTRR or PAT memory
+ * type was changed. TDX can't handle zapping the private
+ * mapping, but it's ok because KVM doesn't support either of
+ * those features for TDX. In case a new caller appears, BUG
+ * the VM if it's called for solutions with private aliases.
+ */
+ KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
+ }
if (flush)
kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
--
2.34.1
From: Isaku Yamahata <[email protected]>
Introduce a "gfn_shared_mask" field in the kvm_arch structure to record GPA
shared bit and provide address conversion helpers for TDX shared bit of
GPA.
TDX designates a specific GPA bit as the shared bit, which can be either
bit 51 or bit 47 based on configuration.
This GPA shared bit indicates whether the corresponding physical page is
shared (if shared bit set) or private (if shared bit cleared).
- GPAs with shared bit set will be mapped by VMM into conventional EPT,
which is pointed by shared EPTP in TDVMCS, resides in host VMM memory
and is managed by VMM.
- GPAs with shared bit cleared will be mapped by VMM firstly into a
mirrored EPT, which resides in host VMM memory. Changes of the mirrored
EPT are then propagated into a private EPT, which resides outside of host
VMM memory and is managed by TDX module.
Add the "gfn_shared_mask" field to the kvm_arch structure for each VM with
a default value of 0. It will be set to the position of the GPA shared bit
in GFN through TD specific initialization code.
Provide helpers to utilize the gfn_shared_mask to determine whether a GPA
is shared or private, retrieve the GPA shared bit value, and insert/strip
shared bit to/from a GPA.
Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
TDX MMU Part 1:
- Update commit log (Yan)
- Fix documentation on kvm_is_private_gpa() (Binbin)
v19:
- Add comment on default vm case.
- Added behavior table in the commit message
- drop CONFIG_KVM_MMU_PRIVATE
v18:
- Added Reviewed-by Binbin
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..d2f924f1d579 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1519,6 +1519,8 @@ struct kvm_arch {
*/
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+ gfn_t gfn_shared_mask;
};
struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3c7a88400cbb..dac13a2d944f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -321,4 +321,37 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
return gpa;
return translate_nested_gpa(vcpu, gpa, access, exception);
}
+
+/*
+ * default or SEV-SNP TDX: where S = (47 or 51) - 12
+ * gfn_shared_mask 0 S bit
+ * is_private_gpa() always false true if GPA has S bit clear
+ * gfn_to_shared() nop set S bit
+ * gfn_to_private() nop clear S bit
+ *
+ * fault.is_private means that host page should be gotten from guest_memfd
+ * is_private_gpa() means that KVM MMU should invoke private MMU hooks.
+ */
+static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
+{
+ return kvm->arch.gfn_shared_mask;
+}
+
+static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
+{
+ return gfn | kvm_gfn_shared_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
+{
+ return gfn & ~kvm_gfn_shared_mask(kvm);
+}
+
+static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
+{
+ gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+ return mask && !(gpa_to_gfn(gpa) & mask);
+}
+
#endif
--
2.34.1
From: Isaku Yamahata <[email protected]>
Add a private pointer to struct kvm_mmu_page for the private page table and
add helper functions to allocate/initialize/free a private page table page.
Because KVM TDP MMU doesn't use unsync_children and write_flooding_count,
pack them to have room for a pointer and use a union to avoid memory
overhead.
For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.
When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table, allocate one more page for the mirrored page table
for the KVM MMU code to directly walk. Resolve the KVM page fault with
the existing code, and do additional operations necessary for the private
page table. To distinguish such cases, the existing KVM page table is
called a shared page table (i.e., not associated with a private page
table), and the page table with a private page table is called a mirrored
page table. The relationship is depicted below.
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirrored PT root | private PT root
| | | |
V V | V
shared PT mirrored PT ----propagate----> private PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT: Page table
Shared PT: visible to KVM, and the CPU uses it for shared mappings.
Private PT: the CPU uses it, but it is invisible to KVM. TDX module
updates this table to map private guest pages.
Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it
to propagate PT change to the actual private PT.
Co-developed-by: Yan Zhao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
---
TDX MMU Part 1:
- Rename terminology, dummy PT => mirror PT. and updated the commit message
By Rick and Kai.
- Added a comment on union of private_spt by Rick.
- Don't handle the root case in kvm_mmu_alloc_private_spt(), it will not
be needed in future patches. (Rick)
- Update comments (Yan)
- Remove kvm_mmu_init_private_spt(), open code it in later patches (Yan)
v19:
- typo in the comment in kvm_mmu_alloc_private_spt()
- drop CONFIG_KVM_MMU_PRIVATE
---
arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/kvm/mmu/mmu.c | 7 +++++++
arch/x86/kvm/mmu/mmu_internal.h | 36 +++++++++++++++++++++++++++++----
arch/x86/kvm/mmu/tdp_mmu.c | 1 +
4 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 13119d4e44e5..d010ca5c7f44 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -828,6 +828,11 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
+ /*
+ * This cache is to allocate private page table. E.g. private EPT used
+ * by the TDX module.
+ */
+ struct kvm_mmu_memory_cache mmu_private_spt_cache;
/*
* QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1998267a330e..d5cf5b15a10e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
+ if (kvm_gfn_shared_mask(vcpu->kvm)) {
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache,
+ PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+ }
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
PT64_ROOT_MAX_LEVEL);
if (r)
@@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b114589a595a..0f1a9d733d9e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,7 +101,22 @@ struct kvm_mmu_page {
int root_count;
refcount_t tdp_mmu_root_count;
};
- unsigned int unsync_children;
+ union {
+ /* Those two members aren't used for TDP MMU */
+ struct {
+ unsigned int unsync_children;
+ /*
+ * Number of writes since the last time traversal
+ * visited this page.
+ */
+ atomic_t write_flooding_count;
+ };
+ /*
+ * Page table page of private PT.
+ * Passed to TDX module, not accessed by KVM.
+ */
+ void *private_spt;
+ };
union {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
@@ -124,9 +139,6 @@ struct kvm_mmu_page {
int clear_spte_count;
#endif
- /* Number of writes since the last time traversal visited this page. */
- atomic_t write_flooding_count;
-
#ifdef CONFIG_X86_64
/* Used for freeing the page asynchronously if it is a TDP MMU page. */
struct rcu_head rcu_head;
@@ -150,6 +162,22 @@ static inline bool is_private_sp(const struct kvm_mmu_page *sp)
return kvm_mmu_page_role_is_private(sp->role);
}
+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+ return sp->private_spt;
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+ /*
+ * private_spt is allocated for TDX module to hold private EPT mappings,
+ * TDX module will initialize the page by itself.
+ * Therefore, KVM does not need to initialize or access private_spt.
+ * KVM only interacts with sp->spt for mirrored EPT operations.
+ */
+ sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1086e3b2aa5c..6fa910b017d1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
+ free_page((unsigned long)sp->private_spt);
free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
}
--
2.34.1
From: Isaku Yamahata <[email protected]>
Extract tdp_mmu_root_match() to check if the root has given types and use
it for the root page table iterator. It checks only_invalid now.
TDX KVM operates on a shared page table only (Shared-EPT), a mirrored page
table only (Secure-EPT), or both based on the operation. KVM MMU notifier
operations only on shared page table. KVM guest_memfd invalidation
operations only on mirrored page table, and so on. Introduce a centralized
matching function instead of open coding matching logic in the iterator.
The next step is to extend the function to check whether the page is shared
or private
Link: https://lore.kernel.org/kvm/[email protected]/
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- New patch
---
arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 810d552e9bf6..a0b7c43e843d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,6 +92,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
+static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+{
+ if (only_valid && root->role.invalid)
+ return false;
+
+ return true;
+}
+
/*
* Returns the next root after @prev_root (or the first root if @prev_root is
* NULL). A reference to the returned root is acquired, and the reference to
@@ -125,7 +133,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
typeof(*next_root), link);
while (next_root) {
- if ((!only_valid || !next_root->role.invalid) &&
+ if (tdp_mmu_root_match(next_root, only_valid) &&
kvm_tdp_mmu_get_root(next_root))
break;
@@ -176,7 +184,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \
- ((_only_valid) && (_root)->role.invalid))) { \
+ !tdp_mmu_root_match((_root), (_only_valid)))) { \
} else
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
--
2.34.1
From: Isaku Yamahata <[email protected]>
Define an enum kvm_tdp_mmu_root_types to specify the KVM MMU root type [1]
so that the iterator on the root page table can consistently filter the
root page table type instead of only_valid.
TDX KVM will operate on KVM page tables with specified types. Shared page
table, private page table, or both. Introduce an enum instead of bool
only_valid so that we can easily enhance page table types applicable to
shared, private, or both in addition to valid or not. Replace
only_valid=false with KVM_ANY_ROOTS and only_valid=true with
KVM_ANY_VALID_ROOTS. Use KVM_ANY_ROOTS and KVM_ANY_VALID_ROOTS to wrap
KVM_VALID_ROOTS to avoid further code churn when shared and private are
introduced.
Link: https://lore.kernel.org/kvm/[email protected]/ [1]
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- Newly introduced.
---
arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++++++-------------------
arch/x86/kvm/mmu/tdp_mmu.h | 7 +++++++
2 files changed, 27 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a0b7c43e843d..7af395073e92 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,9 +92,10 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
-static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
+ enum kvm_tdp_mmu_root_types types)
{
- if (only_valid && root->role.invalid)
+ if ((types & KVM_VALID_ROOTS) && root->role.invalid)
return false;
return true;
@@ -102,17 +103,17 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
/*
* Returns the next root after @prev_root (or the first root if @prev_root is
- * NULL). A reference to the returned root is acquired, and the reference to
- * @prev_root is released (the caller obviously must hold a reference to
- * @prev_root if it's non-NULL).
+ * NULL) that matches with @types. A reference to the returned root is
+ * acquired, and the reference to @prev_root is released (the caller obviously
+ * must hold a reference to @prev_root if it's non-NULL).
*
- * If @only_valid is true, invalid roots are skipped.
+ * Roots that doesn't match with @types are skipped.
*
* Returns NULL if the end of tdp_mmu_roots was reached.
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
- bool only_valid)
+ enum kvm_tdp_mmu_root_types types)
{
struct kvm_mmu_page *next_root;
@@ -133,7 +134,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
typeof(*next_root), link);
while (next_root) {
- if (tdp_mmu_root_match(next_root, only_valid) &&
+ if (tdp_mmu_root_match(next_root, types) &&
kvm_tdp_mmu_get_root(next_root))
break;
@@ -158,20 +159,20 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* If shared is set, this function is operating under the MMU lock in read
* mode.
*/
-#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid) \
- for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _types) \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _types); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
- _root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \
+ _root = tdp_mmu_next_root(_kvm, _root, _types)) \
if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, true)
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, KVM_ANY_VALID_ROOTS)
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \
- for (_root = tdp_mmu_next_root(_kvm, NULL, false); \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, KVM_ANY_ROOTS); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
- _root = tdp_mmu_next_root(_kvm, _root, false))
+ _root = tdp_mmu_next_root(_kvm, _root, KVM_ANY_ROOTS))
/*
* Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
@@ -180,18 +181,18 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* Holding mmu_lock for write obviates the need for RCU protection as the list
* is guaranteed to be stable.
*/
-#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _only_valid) \
+#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _types) \
list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \
- !tdp_mmu_root_match((_root), (_only_valid)))) { \
+ !tdp_mmu_root_match((_root), (_types)))) { \
} else
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
- __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
+ __for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ANY_ROOTS)
#define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id) \
- __for_each_tdp_mmu_root(_kvm, _root, _as_id, true)
+ __for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ANY_VALID_ROOTS)
static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
{
@@ -1389,7 +1390,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
{
struct kvm_mmu_page *root;
- __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
+ __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ANY_ROOTS)
flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
range->may_block, flush);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ac350c51bc18..30f2ab88a642 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -19,6 +19,13 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
+enum kvm_tdp_mmu_root_types {
+ KVM_VALID_ROOTS = BIT(0),
+
+ KVM_ANY_ROOTS = 0,
+ KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
+};
+
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
--
2.34.1
From: Isaku Yamahata <[email protected]>
Add more types, shared and private to enum kvm_tdp_mmu_root_types to
specify KVM MMU roots [1] so that the iterator on the root page table can
consistently filter the root page table type.
TDX KVM will operate on KVM page tables with specified types. Shared page
table, private page table, or both. Introduce an enum to specify those
page table types and make the iterator take it with the specified root
type. Valid or not, and shared, private, or both. Enhance
tdp_mmu_root_match() to understand private vs shared.
Suggested-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/kvm/[email protected]/ [1]
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- New patch
---
arch/x86/kvm/mmu/tdp_mmu.c | 12 +++++++++++-
arch/x86/kvm/mmu/tdp_mmu.h | 14 ++++++++++----
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7af395073e92..8914c5b0d5ab 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -95,10 +95,20 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
enum kvm_tdp_mmu_root_types types)
{
+ if (WARN_ON_ONCE(types == BUGGY_KVM_ROOTS))
+ return false;
+ if (WARN_ON_ONCE(!(types & (KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS))))
+ return false;
+
if ((types & KVM_VALID_ROOTS) && root->role.invalid)
return false;
- return true;
+ if ((types & KVM_SHARED_ROOTS) && !is_private_sp(root))
+ return true;
+ if ((types & KVM_PRIVATE_ROOTS) && is_private_sp(root))
+ return true;
+
+ return false;
}
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 30f2ab88a642..6a65498b481c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,12 +20,18 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
enum kvm_tdp_mmu_root_types {
- KVM_VALID_ROOTS = BIT(0),
-
- KVM_ANY_ROOTS = 0,
- KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
+ BUGGY_KVM_ROOTS = BUGGY_KVM_INVALIDATION,
+ KVM_SHARED_ROOTS = KVM_PROCESS_SHARED,
+ KVM_PRIVATE_ROOTS = KVM_PROCESS_PRIVATE,
+ KVM_VALID_ROOTS = BIT(2),
+ KVM_ANY_VALID_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS | KVM_VALID_ROOTS,
+ KVM_ANY_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS,
};
+static_assert(!(KVM_SHARED_ROOTS & KVM_VALID_ROOTS));
+static_assert(!(KVM_PRIVATE_ROOTS & KVM_VALID_ROOTS));
+static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));
+
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
--
2.34.1
The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
allows for the caller, mmu_alloc_direct_roots(), to call
kvm_tdp_mmu_alloc_root() and also return 0 in one line:
return kvm_tdp_mmu_alloc_root(vcpu);
So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
will be called twice in mmu_alloc_direct_roots(). This will force the
first call to either awkwardly handle the return value that will always
be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
Do it in a separate change so the future change will be cleaner.
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- New patch
---
arch/x86/kvm/mmu/mmu.c | 6 ++++--
arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 808805b3478d..76f92cb37a96 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3700,8 +3700,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
unsigned i;
int r;
- if (tdp_mmu_enabled)
- return kvm_tdp_mmu_alloc_root(vcpu);
+ if (tdp_mmu_enabled) {
+ kvm_tdp_mmu_alloc_root(vcpu);
+ return 0;
+ }
write_lock(&vcpu->kvm->mmu_lock);
r = make_mmu_pages_available(vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6fa910b017d1..0d6d96d86703 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,7 +224,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}
-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
union kvm_mmu_page_role role = mmu->root_role;
@@ -285,7 +285,6 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
*/
mmu->root.hpa = __pa(root->spt);
mmu->root.pgd = 0;
- return 0;
}
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 58b55e61bd33..437ddd4937a9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
--
2.34.1
From: Isaku Yamahata <[email protected]>
Allocate mirrored page table for the private page table and implement MMU
hooks to operate on the private page table.
To handle page fault to a private GPA, KVM walks the mirrored page table in
unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
changes from the mirrored page table to private page table.
private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirrored PT root | private PT root
| | |
V | V
mirrored PT --hook to propagate-->private PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
PT: page table
Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
this table to map private guest pages.
Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
to propagate PT change to the actual private PT.
SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
can be modified atomically with mmu_lock held for read, however, the MMU
hooks to private page table are not atomical operations.
To address it, a special REMOVED_SPTE is introduced and below sequence is
used when mirrored SPTEs are updated atomically.
1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
2. The successful updater of the mirrored SPTE in step 1 proceeds with the
following steps.
3. Invoke MMU hooks to modify private page table with the target value.
4. (a) On hook succeeds, update mirrored SPTE to target value.
(b) On hook failure, restore mirrored SPTE to original value.
KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
This sequence also applies when SPTEs are atomiclly updated from
non-present to present in order to prevent potential conflicts when
multiple vCPUs attempt to set private SPTEs to a different page size
simultaneously, though 4K page size is only supported for private page
table currently.
2M page support can be done in future patches.
Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Co-developed-by: Yan Zhao <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- Remove unnecessary gfn, access twist in
tdp_mmu_map_handle_target_level(). (Chao Gao)
- Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
tdp_mmu_alloc_sp()
- Update comment in set_private_spte_present() (Yan)
- Open code call to kvm_mmu_init_private_spt() (Yan)
- Add comments on TDX MMU hooks (Yan)
- Fix various whitespace alignment (Yan)
- Remove pointless warnings and conditionals in
handle_removed_private_spte() (Yan)
- Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
- Remove incorrect comment in handle_changed_spte() (Yan)
- Remove unneeded kvm_pfn_to_refcounted_page() and
is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
- Do kvm_gfn_for_root() branchless (Rick)
- Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
- Add comment for stripping shared bit for fault.gfn (Chao)
v19:
- drop CONFIG_KVM_MMU_PRIVATE
v18:
- Rename freezed => frozen
v14 -> v15:
- Refined is_private condition check in kvm_tdp_mmu_map().
Add kvm_gfn_shared_mask() check.
- catch up for struct kvm_range change
---
arch/x86/include/asm/kvm-x86-ops.h | 5 +
arch/x86/include/asm/kvm_host.h | 25 +++
arch/x86/kvm/mmu/mmu.c | 13 +-
arch/x86/kvm/mmu/mmu_internal.h | 19 +-
arch/x86/kvm/mmu/tdp_iter.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 269 +++++++++++++++++++++++++----
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
7 files changed, 293 insertions(+), 42 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 566d19b02483..d13cb4b8fce6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(link_private_spt)
+KVM_X86_OP_OPTIONAL(free_private_spt)
+KVM_X86_OP_OPTIONAL(set_private_spte)
+KVM_X86_OP_OPTIONAL(remove_private_spte)
+KVM_X86_OP_OPTIONAL(zap_private_spte)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d010ca5c7f44..20fa8fa58692 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -470,6 +470,7 @@ struct kvm_mmu {
int (*sync_spte)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp, int i);
struct kvm_mmu_root_info root;
+ hpa_t private_root_hpa;
union kvm_cpu_role cpu_role;
union kvm_mmu_page_role root_role;
@@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
+ /* Add a page as page table page into private page table */
+ int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *private_spt);
+ /*
+ * Free a page table page of private page table.
+ * Only expected to be called when guest is not active, specifically
+ * during VM destruction phase.
+ */
+ int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *private_spt);
+
+ /* Add a guest private page into private page table */
+ int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
+
+ /* Remove a guest private page from private page table*/
+ int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ kvm_pfn_t pfn);
+ /*
+ * Keep a guest private page mapped in private page table, but clear its
+ * present bit
+ */
+ int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+
bool (*has_wbinvd_exit)(void);
u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 76f92cb37a96..2506d6277818 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
int r;
if (tdp_mmu_enabled) {
- kvm_tdp_mmu_alloc_root(vcpu);
+ if (kvm_gfn_shared_mask(vcpu->kvm))
+ kvm_tdp_mmu_alloc_root(vcpu, true);
+ kvm_tdp_mmu_alloc_root(vcpu, false);
return 0;
}
@@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
- gfn_t base = gfn_round_for_level(fault->gfn,
+ gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
fault->max_level);
if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
@@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
mmu->root.hpa = INVALID_PAGE;
mmu->root.pgd = 0;
+ mmu->private_root_hpa = INVALID_PAGE;
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
@@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
kvm_mmu_unload(vcpu);
+ if (tdp_mmu_enabled) {
+ read_lock(&vcpu->kvm->mmu_lock);
+ mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
+ NULL);
+ read_unlock(&vcpu->kvm->mmu_lock);
+ }
free_mmu_pages(&vcpu->arch.root_mmu);
free_mmu_pages(&vcpu->arch.guest_mmu);
mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0f1a9d733d9e..3a7fe9261e23 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@
#include <linux/kvm_host.h>
#include <asm/kvm_host.h>
+#include "mmu.h"
+
#ifdef CONFIG_KVM_PROVE_MMU
#define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
#else
@@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
}
+static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t gfn)
+{
+ gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
+
+ /* Set shared bit if not private */
+ gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
+ return gfn_for_root;
+}
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
@@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
int r;
if (vcpu->arch.mmu->root_role.direct) {
- fault.gfn = fault.addr >> PAGE_SHIFT;
+ /*
+ * Things like memslots don't understand the concept of a shared
+ * bit. Strip it so that the GFN can be used like normal, and the
+ * fault.addr can be used when the shared bit is needed.
+ */
+ fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..8a64bcef9deb 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,7 +91,7 @@ struct tdp_iter {
tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
/* A pointer to the current SPTE */
tdp_ptep_t sptep;
- /* The lowest GFN mapped by the current SPTE */
+ /* The lowest GFN (shared bits included) mapped by the current SPTE */
gfn_t gfn;
/* The level of the root page given to the iterator */
int root_level;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0d6d96d86703..810d552e9bf6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,7 +224,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
union kvm_mmu_page_role role = mmu->root_role;
@@ -232,6 +232,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;
+ if (private)
+ kvm_mmu_page_role_set_private(&role);
+
/*
* Check for an existing root before acquiring the pages lock to avoid
* unnecessary serialization if multiple vCPUs are loading a new root.
@@ -283,13 +286,17 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
* and actually consuming the root if it's invalidated after dropping
* mmu_lock, and the root can't be freed as this vCPU holds a reference.
*/
- mmu->root.hpa = __pa(root->spt);
- mmu->root.pgd = 0;
+ if (private) {
+ mmu->private_root_hpa = __pa(root->spt);
+ } else {
+ mmu->root.hpa = __pa(root->spt);
+ mmu->root.pgd = 0;
+ }
}
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
- u64 old_spte, u64 new_spte, int level,
- bool shared);
+ u64 old_spte, u64 new_spte,
+ union kvm_mmu_page_role role, bool shared);
static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
@@ -416,12 +423,124 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
REMOVED_SPTE, level);
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
- old_spte, REMOVED_SPTE, level, shared);
+ old_spte, REMOVED_SPTE, sp->role,
+ shared);
+ }
+
+ if (is_private_sp(sp) &&
+ WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
+ kvm_mmu_private_spt(sp)))) {
+ /*
+ * Failed to free page table page in private page table and
+ * there is nothing to do further.
+ * Intentionally leak the page to prevent the kernel from
+ * accessing the encrypted page.
+ */
+ sp->private_spt = NULL;
}
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
+static void *get_private_spt(gfn_t gfn, u64 new_spte, int level)
+{
+ if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+ struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
+ void *private_spt = kvm_mmu_private_spt(sp);
+
+ WARN_ON_ONCE(!private_spt);
+ WARN_ON_ONCE(sp->role.level + 1 != level);
+ WARN_ON_ONCE(sp->gfn != gfn);
+ return private_spt;
+ }
+
+ return NULL;
+}
+
+static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
+ u64 old_spte, u64 new_spte,
+ int level)
+{
+ bool was_present = is_shadow_present_pte(old_spte);
+ bool was_leaf = was_present && is_last_spte(old_spte, level);
+ kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+ int ret;
+
+ /*
+ * Allow only leaf page to be zapped. Reclaim non-leaf page tables page
+ * at destroying VM.
+ */
+ if (!was_leaf)
+ return;
+
+ /* Zapping leaf spte is allowed only when write lock is held. */
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+ /* Because write lock is held, operation should success. */
+ if (KVM_BUG_ON(ret, kvm))
+ return;
+
+ ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
+ KVM_BUG_ON(ret, kvm);
+}
+
+static int __must_check __set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+ gfn_t gfn, u64 old_spte,
+ u64 new_spte, int level)
+{
+ bool was_present = is_shadow_present_pte(old_spte);
+ bool is_present = is_shadow_present_pte(new_spte);
+ bool is_leaf = is_present && is_last_spte(new_spte, level);
+ kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+ int ret = 0;
+
+ lockdep_assert_held(&kvm->mmu_lock);
+ /* TDP MMU doesn't change present -> present */
+ KVM_BUG_ON(was_present, kvm);
+
+ /*
+ * Use different call to either set up middle level
+ * private page table, or leaf.
+ */
+ if (is_leaf) {
+ ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn);
+ } else {
+ void *private_spt = get_private_spt(gfn, new_spte, level);
+
+ KVM_BUG_ON(!private_spt, kvm);
+ ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt);
+ }
+
+ return ret;
+}
+
+static int __must_check set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+ gfn_t gfn, u64 old_spte,
+ u64 new_spte, int level)
+{
+ int ret;
+
+ /*
+ * For private page table, callbacks are needed to propagate SPTE
+ * change into the private page table. In order to atomically update
+ * both the SPTE and the private page tables with callbacks, utilize
+ * freezing SPTE.
+ * - Freeze the SPTE. Set entry to REMOVED_SPTE.
+ * - Trigger callbacks for private page tables.
+ * - Unfreeze the SPTE. Set the entry to new_spte.
+ */
+ lockdep_assert_held(&kvm->mmu_lock);
+ if (!try_cmpxchg64(sptep, &old_spte, REMOVED_SPTE))
+ return -EBUSY;
+
+ ret = __set_private_spte_present(kvm, sptep, gfn, old_spte, new_spte, level);
+ if (ret)
+ __kvm_tdp_mmu_write_spte(sptep, old_spte);
+ else
+ __kvm_tdp_mmu_write_spte(sptep, new_spte);
+ return ret;
+}
+
/**
* handle_changed_spte - handle bookkeeping associated with an SPTE change
* @kvm: kvm instance
@@ -429,7 +548,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
* @gfn: the base GFN that was mapped by the SPTE
* @old_spte: The value of the SPTE before the change
* @new_spte: The value of the SPTE after the change
- * @level: the level of the PT the SPTE is part of in the paging structure
+ * @role: the role of the PT the SPTE is part of in the paging structure
* @shared: This operation may not be running under the exclusive use of
* the MMU lock and the operation must synchronize with other
* threads that might be modifying SPTEs.
@@ -439,14 +558,18 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
* and fast_pf_fix_direct_spte()).
*/
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
- u64 old_spte, u64 new_spte, int level,
- bool shared)
+ u64 old_spte, u64 new_spte,
+ union kvm_mmu_page_role role, bool shared)
{
+ bool is_private = kvm_mmu_page_role_is_private(role);
+ int level = role.level;
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
- bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+ kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+ kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+ bool pfn_changed = old_pfn != new_pfn;
WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -513,7 +636,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
if (was_leaf && is_dirty_spte(old_spte) &&
(!is_present || !is_dirty_spte(new_spte) || pfn_changed))
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ kvm_set_pfn_dirty(old_pfn);
/*
* Recursively handle child PTs if the change removed a subtree from
@@ -522,15 +645,21 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* pages are kernel allocations and should never be migrated.
*/
if (was_present && !was_leaf &&
- (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+ KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)),
+ kvm);
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+ }
+
+ if (is_private && !is_present)
+ handle_removed_private_spte(kvm, gfn, old_spte, new_spte, role.level);
if (was_leaf && is_accessed_spte(old_spte) &&
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
}
-static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
+static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte)
{
u64 *sptep = rcu_dereference(iter->sptep);
@@ -542,15 +671,42 @@ static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
*/
WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
- /*
- * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
- * does not hold the mmu_lock. On failure, i.e. if a different logical
- * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
- * the current value, so the caller operates on fresh data, e.g. if it
- * retries tdp_mmu_set_spte_atomic()
- */
- if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
- return -EBUSY;
+ if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
+ int ret;
+
+ if (is_shadow_present_pte(new_spte)) {
+ /*
+ * Populating case.
+ * - set_private_spte_present() implements
+ * 1) Freeze SPTE
+ * 2) call hooks to update private page table,
+ * 3) update SPTE to new_spte
+ * - handle_changed_spte() only updates stats.
+ */
+ ret = set_private_spte_present(kvm, iter->sptep, iter->gfn,
+ iter->old_spte, new_spte, iter->level);
+ if (ret)
+ return ret;
+ } else {
+ /*
+ * Zapping case.
+ * Zap is only allowed when write lock is held
+ */
+ if (WARN_ON_ONCE(!is_shadow_present_pte(new_spte)))
+ return -EBUSY;
+ }
+ } else {
+ /*
+ * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
+ * and does not hold the mmu_lock. On failure, i.e. if a
+ * different logical CPU modified the SPTE, try_cmpxchg64()
+ * updates iter->old_spte with the current value, so the caller
+ * operates on fresh data, e.g. if it retries
+ * tdp_mmu_set_spte_atomic()
+ */
+ if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+ return -EBUSY;
+ }
return 0;
}
@@ -576,23 +732,24 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
+ u64 *sptep = rcu_dereference(iter->sptep);
int ret;
lockdep_assert_held_read(&kvm->mmu_lock);
- ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
+ ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
if (ret)
return ret;
handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, true);
-
+ new_spte, sptep_to_sp(sptep)->role, true);
return 0;
}
static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter)
{
+ union kvm_mmu_page_role role;
int ret;
lockdep_assert_held_read(&kvm->mmu_lock);
@@ -605,7 +762,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* Delay processing of the zapped SPTE until after TLBs are flushed and
* the REMOVED_SPTE is replaced (see below).
*/
- ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE);
+ ret = __tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
if (ret)
return ret;
@@ -619,6 +776,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
*/
__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
+
+ role = sptep_to_sp(iter->sptep)->role;
/*
* Process the zapped SPTE after flushing TLBs, and after replacing
* REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
@@ -626,7 +785,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* SPTEs.
*/
handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- 0, iter->level, true);
+ SHADOW_NONPRESENT_VALUE, role, true);
return 0;
}
@@ -648,6 +807,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
u64 old_spte, u64 new_spte, gfn_t gfn, int level)
{
+ union kvm_mmu_page_role role;
+
lockdep_assert_held_write(&kvm->mmu_lock);
/*
@@ -660,8 +821,16 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
+ if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
+ is_shadow_present_pte(new_spte)) {
+ /* Because write spin lock is held, no race. It should success. */
+ KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn, old_spte,
+ new_spte, level), kvm);
+ }
- handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+ role = sptep_to_sp(sptep)->role;
+ role.level = level;
+ handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
return old_spte;
}
@@ -684,8 +853,11 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
continue; \
else
-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
- for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end) \
+ for_each_tdp_pte(_iter, \
+ root_to_sp((_private) ? _mmu->private_root_hpa : \
+ _mmu->root.hpa), \
+ _start, _end)
/*
* Yield if the MMU lock is contended or this thread needs to return control
@@ -853,6 +1025,14 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
lockdep_assert_held_write(&kvm->mmu_lock);
+ /*
+ * start and end doesn't have GFN shared bit. This function zaps
+ * a region including alias. Adjust shared bit of [start, end) if the
+ * root is shared.
+ */
+ start = kvm_gfn_for_root(kvm, root, start);
+ end = kvm_gfn_for_root(kvm, root, end);
+
rcu_read_lock();
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
@@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
- fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, &new_spte);
+ fault->pfn, iter->old_spte, fault->prefetch, true,
+ fault->map_writable, &new_spte);
if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
@@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
+ gfn_t raw_gfn;
+ bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
int ret = RET_PF_RETRY;
kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1116,7 +1298,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
rcu_read_lock();
- tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+ raw_gfn = gpa_to_gfn(fault->addr);
+
+ tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
int r;
if (fault->nx_huge_page_workaround_enabled)
@@ -1142,14 +1326,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* needs to be split.
*/
sp = tdp_mmu_alloc_sp(vcpu);
+ if (kvm_is_private_gpa(kvm, raw_gfn << PAGE_SHIFT))
+ kvm_mmu_alloc_private_spt(vcpu, sp);
tdp_mmu_init_child_sp(sp, &iter);
sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
- if (is_shadow_present_pte(iter.old_spte))
+ if (is_shadow_present_pte(iter.old_spte)) {
+ /*
+ * TODO: large page support.
+ * Doesn't support large page for TDX now
+ */
+ KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
- else
+ } else {
r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+ }
/*
* Force the guest to retry if installing an upper level SPTE
@@ -1780,7 +1972,7 @@ static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;
- tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ tdp_mmu_for_each_pte(iter, mmu, is_private, gfn, gfn + 1) {
leaf = iter.level;
sptes[leaf] = iter.old_spte;
}
@@ -1838,7 +2030,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
gfn_t gfn = addr >> PAGE_SHIFT;
tdp_ptep_t sptep = NULL;
- tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ /* fast page fault for private GPA isn't supported. */
+ WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
+
+ tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
*spte = iter.old_spte;
sptep = iter.sptep;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 437ddd4937a9..ac350c51bc18 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private);
__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
--
2.34.1
From: Isaku Yamahata <[email protected]>
Teach the MMU notifier callbacks how to check kvm_gfn_range.process to
filter which KVM MMU root types to operate on.
The private GPAs are backed by guest memfd. Such memory is not subjected
to MMU notifier callbacks because it can't be mapped into the host user
address space. Now kvm_gfn_range conveys info about which root to operate
on. Enhance the callback to filter the root page table type.
The KVM MMU notifier comes down to two functions.
kvm_tdp_mmu_unmap_gfn_range() and kvm_tdp_mmu_handle_gfn().
For VM's without a private/shared split in the EPT, all operations
should target the normal(shared) root. Adjust the target roots based
on kvm_gfn_shared_mask().
invalidate_range_start() comes into kvm_tdp_mmu_unmap_gfn_range().
invalidate_range_end() doesn't come into arch code.
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- Remove warning (Rick)
- Remove confusing mention of mapping flags (Chao)
- Re-write coverletter
v19:
- type: test_gfn() => test_young()
v18:
- newly added
---
arch/x86/kvm/mmu/tdp_mmu.c | 40 +++++++++++++++++++++++++++++++++++---
1 file changed, 37 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index eb88af48c8f0..af61d131d2dc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1396,12 +1396,32 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return ret;
}
+static enum kvm_tdp_mmu_root_types kvm_process_to_root_types(struct kvm *kvm,
+ enum kvm_process process)
+{
+ WARN_ON_ONCE(process == BUGGY_KVM_INVALIDATION);
+
+ /* Always process shared for cases where private is not on a separate root */
+ if (!kvm_gfn_shared_mask(kvm)) {
+ process |= KVM_PROCESS_SHARED;
+ process &= ~KVM_PROCESS_PRIVATE;
+ }
+
+ return (enum kvm_tdp_mmu_root_types)process;
+}
+
+/* Used by mmu notifier via kvm_unmap_gfn_range() */
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
+ enum kvm_tdp_mmu_root_types types = kvm_process_to_root_types(kvm, range->process);
struct kvm_mmu_page *root;
- __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ANY_ROOTS)
+ /* kvm_process_to_root_types() has WARN_ON_ONCE(). Don't warn it again. */
+ if (types == BUGGY_KVM_ROOTS)
+ return flush;
+
+ __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types)
flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
range->may_block, flush);
@@ -1415,18 +1435,32 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
struct kvm_gfn_range *range,
tdp_handler_t handler)
{
+ enum kvm_tdp_mmu_root_types types = kvm_process_to_root_types(kvm, range->process);
struct kvm_mmu_page *root;
struct tdp_iter iter;
bool ret = false;
+ if (types == BUGGY_KVM_ROOTS)
+ return ret;
+
/*
* Don't support rescheduling, none of the MMU notifiers that funnel
* into this helper allow blocking; it'd be dead, wasteful code.
*/
- for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+ __for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
+ gfn_t start, end;
+
+ /*
+ * For TDX shared mapping, set GFN shared bit to the range,
+ * so the handler() doesn't need to set it, to avoid duplicated
+ * code in multiple handler()s.
+ */
+ start = kvm_gfn_for_root(kvm, root, range->start);
+ end = kvm_gfn_for_root(kvm, root, range->end);
+
rcu_read_lock();
- tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+ tdp_root_for_each_leaf_pte(iter, root, start, end)
ret |= handler(kvm, &iter, range);
rcu_read_unlock();
--
2.34.1
From: Sean Christopherson <[email protected]>
When invalidating roots, respect the root type passed.
kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a TD it needs to invalidate all roots. Check the root type in root
iterator.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
[evolved quite a bit from original author's patch]
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- Rename from "Don't zap private pages for unsupported cases", and split
many parts out.
- Don't support MTRR, apic zapping (Rick)
- Detangle private/shared alias logic in kvm_tdp_mmu_unmap_gfn_range()
(Rick)
- Fix TLB flushing bug debugged by (Chao Gao)
https://lore.kernel.org/kvm/Zh8yHEiOKyvZO+QR@chao-email/
- Split out MTRR part
- Use enum based root iterators (Sean)
- Reorder logic in kvm_mmu_zap_memslot_leafs().
- Replace skip_private with enum kvm_tdp_mmu_root_type.
---
arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index af61d131d2dc..42ccafc7deff 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1196,6 +1196,9 @@ void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
* or get/put references to roots.
*/
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (!tdp_mmu_root_match(root, types))
+ continue;
+
/*
* Note, invalid roots can outlive a memslot update! Invalid
* roots must be *zapped* before the memslot update completes,
--
2.34.1
From: Isaku Yamahata <[email protected]>
Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.
Have the callers only invalidate the required roots instead of all
roots.
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
TDX MMU Part 1:
- New patch
---
arch/x86/kvm/mmu/mmu.c | 9 +++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 5 +++--
arch/x86/kvm/mmu/tdp_mmu.h | 3 ++-
3 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2506d6277818..338628094ad7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6414,8 +6414,13 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* write and in the same critical section as making the reload request,
* e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
*/
- if (tdp_mmu_enabled)
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ if (tdp_mmu_enabled) {
+ /*
+ * The private page tables doesn't support fast zapping. The
+ * caller should handle it by other way.
+ */
+ kvm_tdp_mmu_invalidate_roots(kvm, KVM_SHARED_ROOTS);
+ }
/*
* Notify all vcpus to reload its shadow page table and flush TLB.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8914c5b0d5ab..eb88af48c8f0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
* for zapping and thus puts the TDP MMU's reference to each root, i.e.
* ultimately frees all roots.
*/
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ kvm_tdp_mmu_invalidate_roots(kvm, KVM_ANY_ROOTS);
kvm_tdp_mmu_zap_invalidated_roots(kvm);
WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -1170,7 +1170,8 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
* See kvm_tdp_mmu_alloc_root().
*/
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+ enum kvm_tdp_mmu_root_types types)
{
struct kvm_mmu_page *root;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 6a65498b481c..b8a967426fac 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -35,7 +35,8 @@ static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+ enum kvm_tdp_mmu_root_types types);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
--
2.34.1
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> From: Yan Zhao <[email protected]>
>
> Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
> SPTEs when deleting a memslot.
>
> Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
> Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
> AMD) and generate fresh new TDPs based on the new memslot layout. This is
> because zapping and re-generating TDPs is low overhead for most use cases,
> and more importantly, it's due to a bug [1] which caused VM instability
> when a VM is with Nvidia Geforce GPU assigned.
>
> There's a previous attempt [2] to introduce a per-VM flag to workaround bug
> [1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
> However, [2] was not merged due to lacking of a clear explanation of
> exactly what is broken [3] and it's not wise to "have a bug that is known
> to happen when you enable the capability".
>
> However, for some specific scenarios, e.g. TDX, invalidating and
> re-generating a new page table is not viable for reasons:
> - TDX requires root page of private page table remains unaltered throughout
> the TD life cycle.
> - TDX mandates that leaf entries in private page table must be zapped prior
> to non-leaf entries.
>
> So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> again for VMs like TDX. [4]
>
> This patch is an implementation of per-memslot flag.
> Compared to per-VM flag approach,
> Pros:
> (1) By allowing userspace to control the zapping behavior in fine-grained
> granularity, optimizations for specific use cases can be developed
> without future kernel changes.
> (2) Allows developing new zapping behaviors without risking regressions by
> changing KVM behavior, as seen previously.
>
> Cons:
> (1) Users need to ensure all necessary memslots are with flag
> KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
> memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> (2) Opens up the possibility that userspace could configure memslots for
> normal VM in such a way that the bug [1] is seen.
I don't quite follow the logic why userspace should be involved.
TDX cannot use "page table fast zap", and need to use a different way to
zap, a.k.a, zap-leaf-only while holding MMU write lock, but this doesn't
necessarily mean such thing should be exposed to userspace?
It's weird that userspace needs to control how does KVM zap page table for
memslot delete/move.
The [2] mentions there are performance improvement for certain VMs if KVM
does zap-leaf-only, but AFAICT it doesn't provide concrete argument why it
needs to be exposed to userspace so it can be done as per-vm (that whole
thread basically was talking about the bug in [1]). E.g., see:
https://lore.kernel.org/kvm/[email protected]/T/#m702b273057cc318465cb5a1677d94e923dce9832
"
Ya, a capability is a bad idea. I was coming at it from the angle that, if
there is a fundamental requirement with e.g. GPU passthrough that requires
zapping all SPTEs, then enabling the precise capability on a per-VM basis
would make sense. But adding something to the ABI on pure speculation is
silly.
"
So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
We can just set this flag for a TDX guest when memslot is created in KVM.
>
> However, one thing deserves noting for TDX, is that TDX may potentially
> meet bug [1] for either per-memslot flag or per-VM flag approach, since
> there's a usage in radar to assign an untrusted & passthrough GPU device
> in TDX. If that happens, it can be treated as a bug (not regression) and
> fixed accordingly.
>
> An alternative approach we can also consider is to always invalidate &
> rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
> and private page tables on memslot deletion. This approach could exempt TDX
> from bug [1] when "untrusted & passthrough" devices are involved. But
> downside is that this approach requires creating new very specific KVM
> zapping ABI that could limit future changes in the same way that the bug
> did for normal VMs.
>
> Link: https://patchwork.kernel.org/project/kvm/patch/[email protected] [1]
> Link: https://lore.kernel.org/kvm/[email protected]/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
> Link: https://lore.kernel.org/kvm/[email protected]/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
> Link: https://lore.kernel.org/kvm/[email protected] [4]
> Signed-off-by: Yan Zhao <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> TDX MMU Part 1:
> - New patch
> ---
> arch/x86/kvm/mmu/mmu.c | 30 +++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 17 +++++++++++++++++
> include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 5 ++++-
> 4 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 61982da8c8b2..4a8e819794db 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
> kvm_mmu_zap_all(kvm);
> }
>
> +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> + return;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + struct kvm_gfn_range range = {
> + .slot = slot,
> + .start = slot->base_gfn,
> + .end = slot->base_gfn + slot->npages,
> + .may_block = true,
> + };
> +
> + if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
> + kvm_flush_remote_tlbs(kvm);
> +
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
> + kvm_mmu_zap_memslot_leafs(kvm, slot);
> + else
> + kvm_mmu_zap_all_fast(kvm);
> }
>
> void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c593a081eba..4b3ec2ec79e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> return -EINVAL;
>
> + /*
> + * Since TDX private pages requires re-accepting after zap,
> + * and TDX private root page should not be zapped, TDX requires
> + * memslots for private memory must have flag
> + * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
> + * the deleting memslot will be zapped and SPTEs in other
> + * memslots would not be affected.
> + */
> + if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
> + (new->flags & KVM_MEM_GUEST_MEMFD) &&
> + !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
> + return -EINVAL;
> +
> + /* zap-leafs-only works only when TDP MMU is enabled for now */
> + if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
> + return -EINVAL;
If this zap-leaf-only is supposed to be generic, I don't see why we want
to make it only for TDP MMU?
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> guest mappings so they can be faulted in with different PTE properties.
>
> For TDX private memory this technique is fundamentally not possible.
> Remapping private memory requires the guest to "accept" it, and also the
> needed PTE properties are not currently supported by TDX for private
> memory.
>
> These CPU features are:
> 1) MTRR update
> 2) CR0.CD update
> 3) Non-coherent DMA status update
> 4) APICV update
>
> Since they cannot be supported, they should be blocked from being
> exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
> supported for TDX as it cannot see the guest registers. For APICV
> inhibit it in future changes.
>
> Guest MTRR support is more of an interesting case. Supported versions of
> the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
> it is not possible to fully support the feature. This leaves KVM with a
> few options:
> - Support a modified version of the architecture where the caching
> attributes are ignored for private memory.
> - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
> bug.
>
> With the additional consideration that likely guest MTRR support in KVM
> will be going away, the later option is the best. Prevent MTRR MSR writes
> from calling kvm_zap_gfn_range() in future changes.
>
> Lastly, the most interesting case is non-coherent DMA status updates.
> There isn't a way to reject the call. KVM is just notified that there is a
> non-coherent DMA device attached, and expected to act accordingly. For
> normal VMs today, that means to start respecting guest PAT. However,
> recently there has been a proposal to avoid doing this on selfsnoop CPUs
> (see link). On such CPUs it should not be problematic to simply always
> configure the EPT to honor guest PAT. In future changes TDX can enforce
> this behavior for shared memory, resulting in shared memory always
> respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
> called in this case either.
>
> Unfortunately, this will result in different cache attributes between
> private and shared memory, as private memory is always WB and cannot be
> changed by the VMM on current TDX modules. But it can't really be helped
> while also supporting non-coherent DMA devices.
>
> Since all callers will be prevented from calling kvm_zap_gfn_range() in
> future changes, report a bug and terminate the guest if other future
> changes to KVM result in triggering kvm_zap_gfn_range() for a TD.
>
> For lack of a better method currently, use kvm_gfn_shared_mask() to
> determine if private memory cannot be zapped (as in TDX, the only VM type
> that sets it).
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> TDX MMU Part 1:
> - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
> - Add this KVM_BUG_ON() instead
> ---
> arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d5cf5b15a10e..808805b3478d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
> flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>
> - if (tdp_mmu_enabled)
> + if (tdp_mmu_enabled) {
> + /*
> + * kvm_zap_gfn_range() is used when MTRR or PAT memory
> + * type was changed. TDX can't handle zapping the private
> + * mapping, but it's ok because KVM doesn't support either of
> + * those features for TDX. In case a new caller appears, BUG
> + * the VM if it's called for solutions with private aliases.
> + */
> + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
> flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
> + }
>
> if (flush)
> kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
kvm_zap_gfn_range() looks a generic function. I think it makes more sense
to let the callers to explicitly check whether VM is TDX guest and do the
KVM_BUG_ON()?
On Wed, 2024-05-15 at 13:27 +0000, Huang, Kai wrote:
>
> kvm_zap_gfn_range() looks a generic function. I think it makes more sense
> to let the callers to explicitly check whether VM is TDX guest and do the
> KVM_BUG_ON()?
Other TDX changes will prevent this function getting called. So basically like
you are suggesting. This change is to catch any new cases that pop up, which we
can't do at the caller.
On Tue, May 14, 2024, Rick Edgecombe wrote:
> When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> guest mappings so they can be faulted in with different PTE properties.
>
> For TDX private memory this technique is fundamentally not possible.
> Remapping private memory requires the guest to "accept" it, and also the
> needed PTE properties are not currently supported by TDX for private
> memory.
>
> These CPU features are:
> 1) MTRR update
> 2) CR0.CD update
> 3) Non-coherent DMA status update
Please go review the series that removes these disaster[*], I suspect it would
literally have taken less time than writing this changelog :-)
[*] https://lore.kernel.org/all/[email protected]
> 4) APICV update
>
> Since they cannot be supported, they should be blocked from being
> exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
> supported for TDX as it cannot see the guest registers. For APICV
> inhibit it in future changes.
>
> Guest MTRR support is more of an interesting case. Supported versions of
> the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
> it is not possible to fully support the feature. This leaves KVM with a
> few options:
> - Support a modified version of the architecture where the caching
> attributes are ignored for private memory.
> - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
> bug.
>
> With the additional consideration that likely guest MTRR support in KVM
> will be going away, the later option is the best. Prevent MTRR MSR writes
> from calling kvm_zap_gfn_range() in future changes.
>
> Lastly, the most interesting case is non-coherent DMA status updates.
> There isn't a way to reject the call. KVM is just notified that there is a
> non-coherent DMA device attached, and expected to act accordingly. For
> normal VMs today, that means to start respecting guest PAT. However,
> recently there has been a proposal to avoid doing this on selfsnoop CPUs
> (see link). On such CPUs it should not be problematic to simply always
> configure the EPT to honor guest PAT. In future changes TDX can enforce
> this behavior for shared memory, resulting in shared memory always
> respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
> called in this case either.
>
> Unfortunately, this will result in different cache attributes between
> private and shared memory, as private memory is always WB and cannot be
> changed by the VMM on current TDX modules. But it can't really be helped
> while also supporting non-coherent DMA devices.
>
> Since all callers will be prevented from calling kvm_zap_gfn_range() in
> future changes, report a bug and terminate the guest if other future
> changes to KVM result in triggering kvm_zap_gfn_range() for a TD.
>
> For lack of a better method currently, use kvm_gfn_shared_mask() to
> determine if private memory cannot be zapped (as in TDX, the only VM type
> that sets it).
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> TDX MMU Part 1:
> - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
> - Add this KVM_BUG_ON() instead
> ---
> arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d5cf5b15a10e..808805b3478d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
> flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>
> - if (tdp_mmu_enabled)
> + if (tdp_mmu_enabled) {
> + /*
> + * kvm_zap_gfn_range() is used when MTRR or PAT memory
> + * type was changed. TDX can't handle zapping the private
> + * mapping, but it's ok because KVM doesn't support either of
> + * those features for TDX. In case a new caller appears, BUG
> + * the VM if it's called for solutions with private aliases.
> + */
> + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX". Using a
generic name quite obviously doesn't prevent TDX details for bleeding into common
code, and dancing around things just makes it all unnecessarily confusing.
If we can't avoid bleeding TDX details into common code, my vote is to bite the
bullet and simply check vm_type.
This KVM_BUG_ON() also should not be in the tdp_mmu_enabled path. Yeah, yeah,
TDX is restricted to the TDP MMU, but there's no reason to bleed that detail all
over the place.
> flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
> + }
>
> if (flush)
> kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
> --
> 2.34.1
>
On Wed, 2024-05-15 at 08:34 -0700, Sean Christopherson wrote:
> On Tue, May 14, 2024, Rick Edgecombe wrote:
> > When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> > guest mappings so they can be faulted in with different PTE properties.
> >
> > For TDX private memory this technique is fundamentally not possible.
> > Remapping private memory requires the guest to "accept" it, and also the
> > needed PTE properties are not currently supported by TDX for private
> > memory.
> >
> > These CPU features are:
> > 1) MTRR update
> > 2) CR0.CD update
> > 3) Non-coherent DMA status update
>
> Please go review the series that removes these disaster[*], I suspect it would
> literally have taken less time than writing this changelog :-)
>
> [*] https://lore.kernel.org/all/[email protected]
We have one additional detail for TDX in that KVM will have different cache
attributes between private and shared. Although implementation is in a later
patch, that detail has an affect on whether we need to support zapping in the
basic MMU support.
>
> > 4) APICV update
> >
> > Since they cannot be supported, they should be blocked from being
> > exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
> > supported for TDX as it cannot see the guest registers. For APICV
> > inhibit it in future changes.
> >
> > Guest MTRR support is more of an interesting case. Supported versions of
> > the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
> > it is not possible to fully support the feature. This leaves KVM with a
> > few options:
> > - Support a modified version of the architecture where the caching
> > attributes are ignored for private memory.
> > - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
> > bug.
> >
> > With the additional consideration that likely guest MTRR support in KVM
> > will be going away, the later option is the best. Prevent MTRR MSR writes
> > from calling kvm_zap_gfn_range() in future changes.
> >
> > Lastly, the most interesting case is non-coherent DMA status updates.
> > There isn't a way to reject the call. KVM is just notified that there is a
> > non-coherent DMA device attached, and expected to act accordingly. For
> > normal VMs today, that means to start respecting guest PAT. However,
> > recently there has been a proposal to avoid doing this on selfsnoop CPUs
> > (see link). On such CPUs it should not be problematic to simply always
> > configure the EPT to honor guest PAT. In future changes TDX can enforce
> > this behavior for shared memory, resulting in shared memory always
> > respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
> > called in this case either.
> >
> > Unfortunately, this will result in different cache attributes between
> > private and shared memory, as private memory is always WB and cannot be
> > changed by the VMM on current TDX modules. But it can't really be helped
> > while also supporting non-coherent DMA devices.
> >
> > Since all callers will be prevented from calling kvm_zap_gfn_range() in
> > future changes, report a bug and terminate the guest if other future
> > changes to KVM result in triggering kvm_zap_gfn_range() for a TD.
> >
> > For lack of a better method currently, use kvm_gfn_shared_mask() to
> > determine if private memory cannot be zapped (as in TDX, the only VM type
> > that sets it).
> >
> > Link:
> > https://lore.kernel.org/all/[email protected]/
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > ---
> > TDX MMU Part 1:
> > - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
> > - Add this KVM_BUG_ON() instead
> > ---
> > arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index d5cf5b15a10e..808805b3478d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> >
> > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >
> > - if (tdp_mmu_enabled)
> > + if (tdp_mmu_enabled) {
> > + /*
> > + * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > + * type was changed. TDX can't handle zapping the private
> > + * mapping, but it's ok because KVM doesn't support either
> > of
> > + * those features for TDX. In case a new caller appears, BUG
> > + * the VM if it's called for solutions with private aliases.
> > + */
> > + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
>
> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX". Using a
> generic name quite obviously doesn't prevent TDX details for bleeding into
> common
> code, and dancing around things just makes it all unnecessarily confusing.
Ok, yep on the general point.
>
> If we can't avoid bleeding TDX details into common code, my vote is to bite
> the
> bullet and simply check vm_type.
In this case the generic property is the inability to re-fault in private memory
whenever we want. However the reason we can get away with just not is because
TDX won't support the operations that uses this function. Otherwise, we could
zap only the shared half of the memory (depending on the intention of the
caller).
To me KVM_BUG_ON()s seem a little less bad to check vm type specifically. It
doesn't affect the functional part of the code. So all together, I'd lean
towards vm_type in this case.
>
> This KVM_BUG_ON() also should not be in the tdp_mmu_enabled path. Yeah, yeah,
> TDX is restricted to the TDP MMU, but there's no reason to bleed that detail
> all
> over the place.
>
>
Right.
On Wed, 2024-05-15 at 08:49 -0700, Rick Edgecombe wrote:
> On Wed, 2024-05-15 at 08:34 -0700, Sean Christopherson wrote:
> > On Tue, May 14, 2024, Rick Edgecombe wrote:
> > > When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> > > guest mappings so they can be faulted in with different PTE properties.
> > >
> > > For TDX private memory this technique is fundamentally not possible.
> > > Remapping private memory requires the guest to "accept" it, and also the
> > > needed PTE properties are not currently supported by TDX for private
> > > memory.
> > >
> > > These CPU features are:
> > > 1) MTRR update
> > > 2) CR0.CD update
> > > 3) Non-coherent DMA status update
> >
> > Please go review the series that removes these disaster[*], I suspect it
> > would
> > literally have taken less time than writing this changelog :-)
> >
> > [*] https://lore.kernel.org/all/[email protected]
>
> We have one additional detail for TDX in that KVM will have different cache
> attributes between private and shared. Although implementation is in a later
> patch, that detail has an affect on whether we need to support zapping in the
> basic MMU support.
Or most specifically, we only need this zapping if we *try* to have consistent
cache attributes between private and shared. In the non-coherent DMA case we
can't have them be consistent because TDX doesn't support changing the private
memory in this way.
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 08:49 -0700, Rick Edgecombe wrote:
> > On Wed, 2024-05-15 at 08:34 -0700, Sean Christopherson wrote:
> > > On Tue, May 14, 2024, Rick Edgecombe wrote:
> > > > When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> > > > guest mappings so they can be faulted in with different PTE properties.
> > > >
> > > > For TDX private memory this technique is fundamentally not possible.
> > > > Remapping private memory requires the guest to "accept" it, and also the
> > > > needed PTE properties are not currently supported by TDX for private
> > > > memory.
> > > >
> > > > These CPU features are:
> > > > 1) MTRR update
> > > > 2) CR0.CD update
> > > > 3) Non-coherent DMA status update
> > >
> > > Please go review the series that removes these disaster[*], I suspect it
> > > would
> > > literally have taken less time than writing this changelog :-)
> > >
> > > [*] https://lore.kernel.org/all/[email protected]
> >
> > We have one additional detail for TDX in that KVM will have different cache
> > attributes between private and shared. Although implementation is in a later
> > patch, that detail has an affect on whether we need to support zapping in the
> > basic MMU support.
>
> Or most specifically, we only need this zapping if we *try* to have consistent
> cache attributes between private and shared. In the non-coherent DMA case we
> can't have them be consistent because TDX doesn't support changing the private
> memory in this way.
Huh? That makes no sense. A physical page can't be simultaneously mapped SHARED
and PRIVATE, so there can't be meaningful cache attribute aliasing between private
and shared EPT entries.
Trying to provide consistency for the GPA is like worrying about having matching
PAT entires for the virtual address in two different processes.
On Wed, 2024-05-15 at 09:02 -0700, Sean Christopherson wrote:
> > Or most specifically, we only need this zapping if we *try* to have
> > consistent
> > cache attributes between private and shared. In the non-coherent DMA case we
> > can't have them be consistent because TDX doesn't support changing the
> > private
> > memory in this way.
>
> Huh? That makes no sense. A physical page can't be simultaneously mapped
> SHARED
> and PRIVATE, so there can't be meaningful cache attribute aliasing between
> private
> and shared EPT entries.
>
> Trying to provide consistency for the GPA is like worrying about having
> matching
> PAT entires for the virtual address in two different processes.
No, not matching between the private and shared mappings of the same page. The
whole private memory will be WB, and the whole shared half will honor PAT.
On Wed, May 15, 2024 at 08:34:37AM -0700,
Sean Christopherson <[email protected]> wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index d5cf5b15a10e..808805b3478d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >
> > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >
> > - if (tdp_mmu_enabled)
> > + if (tdp_mmu_enabled) {
> > + /*
> > + * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > + * type was changed. TDX can't handle zapping the private
> > + * mapping, but it's ok because KVM doesn't support either of
> > + * those features for TDX. In case a new caller appears, BUG
> > + * the VM if it's called for solutions with private aliases.
> > + */
> > + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
>
> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX". Using a
> generic name quite obviously doesn't prevent TDX details for bleeding into common
> code, and dancing around things just makes it all unnecessarily confusing.
>
> If we can't avoid bleeding TDX details into common code, my vote is to bite the
> bullet and simply check vm_type.
TDX has several aspects related to the TDP MMU.
1) Based on the faulting GPA, determine which KVM page table to walk.
(private-vs-shared)
2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
load/store. TDP MMU needs hooks for it.
3) The tables must be zapped from the leaf. not the root or the middle.
For 1) and 2), what about something like this? TDX backend code will set
kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
for address conversion (shared<->private).
For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
(or whatever preferable name)?
For 3), flag of memslot handles it.
---
arch/x86/include/asm/kvm_host.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..218b575d24bd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1289,6 +1289,7 @@ struct kvm_arch {
u8 vm_type;
bool has_private_mem;
bool has_protected_state;
+ bool has_mirrored_pt;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
@@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
#ifdef CONFIG_KVM_PRIVATE_MEM
#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
+#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
#else
#define kvm_arch_has_private_mem(kvm) false
+#define kvm_arch_has_mirrored_pt(kvm) false
#endif
static inline u16 kvm_read_ldt(void)
--
2.43.2
--
Isaku Yamahata <[email protected]>
On Tue, May 14, 2024 at 05:59:46PM -0700,
Rick Edgecombe <[email protected]> wrote:
..snip...
> @@ -619,6 +776,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> */
> __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
>
> +
> + role = sptep_to_sp(iter->sptep)->role;
> /*
> * Process the zapped SPTE after flushing TLBs, and after replacing
> * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
> @@ -626,7 +785,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * SPTEs.
> */
> handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - 0, iter->level, true);
> + SHADOW_NONPRESENT_VALUE, role, true);
>
> return 0;
> }
This SHADOW_NONPRESENT_VALUE change should go to another patch at [1]
I replied to [1].
[1] https://lore.kernel.org/kvm/[email protected]/
--
Isaku Yamahata <[email protected]>
On Wed, 2024-05-15 at 10:35 -0700, Isaku Yamahata wrote:
>
> ...snip...
>
> > @@ -619,6 +776,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm
> > *kvm,
> > */
> > __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
> >
> > +
> > + role = sptep_to_sp(iter->sptep)->role;
> > /*
> > * Process the zapped SPTE after flushing TLBs, and after replacing
> > * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
> > @@ -626,7 +785,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm
> > *kvm,
> > * SPTEs.
> > */
> > handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > - 0, iter->level, true);
> > + SHADOW_NONPRESENT_VALUE, role, true);
> >
> > return 0;
> > }
>
> This SHADOW_NONPRESENT_VALUE change should go to another patch at [1]
> I replied to [1].
Thanks. This call site got added in an upstream patch recently, so you didn't
miss it.
On Tue, May 14, 2024 at 05:59:38PM -0700,
Rick Edgecombe <[email protected]> wrote:
> From: Yan Zhao <[email protected]>
>
> Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
> SPTEs when deleting a memslot.
>
> Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
> Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
> AMD) and generate fresh new TDPs based on the new memslot layout. This is
> because zapping and re-generating TDPs is low overhead for most use cases,
> and more importantly, it's due to a bug [1] which caused VM instability
> when a VM is with Nvidia Geforce GPU assigned.
>
> There's a previous attempt [2] to introduce a per-VM flag to workaround bug
> [1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
> However, [2] was not merged due to lacking of a clear explanation of
> exactly what is broken [3] and it's not wise to "have a bug that is known
> to happen when you enable the capability".
>
> However, for some specific scenarios, e.g. TDX, invalidating and
> re-generating a new page table is not viable for reasons:
> - TDX requires root page of private page table remains unaltered throughout
> the TD life cycle.
> - TDX mandates that leaf entries in private page table must be zapped prior
> to non-leaf entries.
>
> So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> again for VMs like TDX. [4]
>
> This patch is an implementation of per-memslot flag.
> Compared to per-VM flag approach,
> Pros:
> (1) By allowing userspace to control the zapping behavior in fine-grained
> granularity, optimizations for specific use cases can be developed
> without future kernel changes.
> (2) Allows developing new zapping behaviors without risking regressions by
> changing KVM behavior, as seen previously.
>
> Cons:
> (1) Users need to ensure all necessary memslots are with flag
> KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
> memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> (2) Opens up the possibility that userspace could configure memslots for
> normal VM in such a way that the bug [1] is seen.
>
> However, one thing deserves noting for TDX, is that TDX may potentially
> meet bug [1] for either per-memslot flag or per-VM flag approach, since
> there's a usage in radar to assign an untrusted & passthrough GPU device
> in TDX. If that happens, it can be treated as a bug (not regression) and
> fixed accordingly.
>
> An alternative approach we can also consider is to always invalidate &
> rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
> and private page tables on memslot deletion. This approach could exempt TDX
> from bug [1] when "untrusted & passthrough" devices are involved. But
> downside is that this approach requires creating new very specific KVM
> zapping ABI that could limit future changes in the same way that the bug
> did for normal VMs.
>
> Link: https://patchwork.kernel.org/project/kvm/patch/[email protected] [1]
> Link: https://lore.kernel.org/kvm/[email protected]/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
> Link: https://lore.kernel.org/kvm/[email protected]/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
> Link: https://lore.kernel.org/kvm/[email protected] [4]
> Signed-off-by: Yan Zhao <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> TDX MMU Part 1:
> - New patch
> ---
> arch/x86/kvm/mmu/mmu.c | 30 +++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 17 +++++++++++++++++
> include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 5 ++++-
> 4 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 61982da8c8b2..4a8e819794db 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
> kvm_mmu_zap_all(kvm);
> }
>
> +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> + return;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + struct kvm_gfn_range range = {
> + .slot = slot,
> + .start = slot->base_gfn,
> + .end = slot->base_gfn + slot->npages,
> + .may_block = true,
> + };
nit: move this up at the beginning of this function.
Compiler didn't complain?
> +
> + if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
> + kvm_flush_remote_tlbs(kvm);
> +
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
> + kvm_mmu_zap_memslot_leafs(kvm, slot);
> + else
> + kvm_mmu_zap_all_fast(kvm);
> }
>
> void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c593a081eba..4b3ec2ec79e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> return -EINVAL;
>
> + /*
> + * Since TDX private pages requires re-accepting after zap,
> + * and TDX private root page should not be zapped, TDX requires
> + * memslots for private memory must have flag
> + * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
> + * the deleting memslot will be zapped and SPTEs in other
> + * memslots would not be affected.
> + */
> + if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
> + (new->flags & KVM_MEM_GUEST_MEMFD) &&
> + !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
> + return -EINVAL;
> +
> + /* zap-leafs-only works only when TDP MMU is enabled for now */
> + if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
> + return -EINVAL;
> +
> return kvm_alloc_memslot_metadata(kvm, new);
> }
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index aee67912e71c..d53648c19b26 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> #define KVM_MEM_GUEST_MEMFD (1UL << 2)
> +#define KVM_MEM_ZAP_LEAFS_ONLY (1UL << 3)
If we make this uAPI, please update Documentation/virt/kvm/api.rst too.
>
> /* for KVM_IRQ_LINE */
> struct kvm_irq_level {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 81b90bf03f2f..1b1ffb6fc786 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1568,6 +1568,8 @@ static int check_memory_region_flags(struct kvm *kvm,
> if (kvm_arch_has_private_mem(kvm))
> valid_flags |= KVM_MEM_GUEST_MEMFD;
>
> + valid_flags |= KVM_MEM_ZAP_LEAFS_ONLY;
> +
This is arch common code. We need a guard for other arch (non-x86).
Also feature enumeration. KVM_CAP_USER_MEMORY2 can be used?
--
Isaku Yamahata <[email protected]>
On Wed, 2024-05-15 at 11:09 -0700, Sean Christopherson wrote:
> On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > On Wed, 2024-05-15 at 09:02 -0700, Sean Christopherson wrote:
> > > > Or most specifically, we only need this zapping if we *try* to have
> > > > consistent cache attributes between private and shared. In the
> > > > non-coherent DMA case we can't have them be consistent because TDX
> > > > doesn't support changing the private memory in this way.
> > >
> > > Huh? That makes no sense. A physical page can't be simultaneously mapped
> > > SHARED and PRIVATE, so there can't be meaningful cache attribute aliasing
> > > between private and shared EPT entries.
> > >
> > > Trying to provide consistency for the GPA is like worrying about having
> > > matching PAT entires for the virtual address in two different processes.
> >
> > No, not matching between the private and shared mappings of the same page.
> > The
> > whole private memory will be WB, and the whole shared half will honor PAT.
>
> I'm still failing to see why that's at all interesting. The non-coherent DMA
> trainwreck is all about KVM worrying about the guest and host having different
> memtypes for the same physical page.
Ok. The split seemed a little odd and special. I'm not sure it's the most
interesting thing in the world, but there was some debate internally about it.
>
> If the host is accessing TDX private memory, we have far, far bigger problems
> than aliased memtypes.
This wasn't the concern.
> And so the fact that TDX private memory is forced WB is
> interesting only to the guest, not KVM.
It's just another little quirk in an already complicated solution. They third
thing we discussed was somehow rejecting or not supporting non-coherent DMA.
This seemed simpler than that.
On Wed, May 15, 2024, Kai Huang wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> > again for VMs like TDX. [4]
Not really. I tried to be as clear as I could that my preference was that _if_
we exposing _something_ to userspace, then that something should probably be a
memslot flag.
: There's no concrete motiviation, it's more that _if_ we're going to expose a knob
: to userspace, then I'd prefer to make it as precise as possible to minimize the
: changes of KVM ending up back in ABI hell again.
As I stressed in that thread, I hadn't thought about this deeply enough to have
an opinion one way or the other.
: You're _really_ reading too much into my suggestion. As above, my suggestion
: was very spur of the momemnt. I haven't put much thought into the tradeoffs and
: side effects.
> > This patch is an implementation of per-memslot flag.
> > Compared to per-VM flag approach,
> > Pros:
> > (1) By allowing userspace to control the zapping behavior in fine-grained
> > granularity, optimizations for specific use cases can be developed
> > without future kernel changes.
> > (2) Allows developing new zapping behaviors without risking regressions by
> > changing KVM behavior, as seen previously.
> >
> > Cons:
> > (1) Users need to ensure all necessary memslots are with flag
> > KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
> > memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> > (2) Opens up the possibility that userspace could configure memslots for
> > normal VM in such a way that the bug [1] is seen.
>
> I don't quite follow the logic why userspace should be involved.
>
> TDX cannot use "page table fast zap", and need to use a different way to
> zap, a.k.a, zap-leaf-only while holding MMU write lock, but this doesn't
> necessarily mean such thing should be exposed to userspace?
>
> It's weird that userspace needs to control how does KVM zap page table for
> memslot delete/move.
Yeah, this isn't quite what I had in mind. Granted, what I had in mind may not
be much any better, but I definitely don't want to let userspace dictate exactly
how KVM manages SPTEs.
My thinking for a memslot flag was more of a "deleting this memslot doesn't have
side effects", i.e. a way for userspace to give KVM the green light to deviate
from KVM's historical behavior of rebuilding the entire page tables. Under the
hood, KVM would be allowed to do whatever it wants, e.g. for the initial
implementation, KVM would zap only leafs. But critically, KVM wouldn't be
_required_ to zap only leafs.
> So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
> We can just set this flag for a TDX guest when memslot is created in KVM.
100% agreed from a functionality perspective. My thoughts/concerns are more about
KVM's ABI.
Hmm, actually, we already have new uAPI/ABI in the form of VM types. What if
we squeeze a documentation update into 6.10 (which adds the SEV VM flavors) to
state that KVM's historical behavior of blasting all SPTEs is only _guaranteed_
for KVM_X86_DEFAULT_VM?
Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots during
SEV-* boot? If so, and assuming any such memslots are smallish, we could even
start enforcing the new ABI by doing a precise zap for small (arbitrary limit TBD)
shared-only memslots for !KVM_X86_DEFAULT_VM VMs.
On Wed, 2024-05-15 at 12:09 -0700, Sean Christopherson wrote:
> > It's weird that userspace needs to control how does KVM zap page table for
> > memslot delete/move.
>
> Yeah, this isn't quite what I had in mind. Granted, what I had in mind may
> not
> be much any better, but I definitely don't want to let userspace dictate
> exactly
> how KVM manages SPTEs.
To me it doesn't seem completely unprecedented at least. Linux has a ton of
madvise() flags and other knobs to control this kind of PTE management for
userspace memory.
>
> My thinking for a memslot flag was more of a "deleting this memslot doesn't
> have
> side effects", i.e. a way for userspace to give KVM the green light to deviate
> from KVM's historical behavior of rebuilding the entire page tables. Under
> the
> hood, KVM would be allowed to do whatever it wants, e.g. for the initial
> implementation, KVM would zap only leafs. But critically, KVM wouldn't be
> _required_ to zap only leafs.
>
> > So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
> > We can just set this flag for a TDX guest when memslot is created in KVM.
>
> 100% agreed from a functionality perspective. My thoughts/concerns are more
> about
> KVM's ABI.
>
> Hmm, actually, we already have new uAPI/ABI in the form of VM types. What if
> we squeeze a documentation update into 6.10 (which adds the SEV VM flavors) to
> state that KVM's historical behavior of blasting all SPTEs is only
> _guaranteed_
> for KVM_X86_DEFAULT_VM?
>
> Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots during
> SEV-* boot? If so, and assuming any such memslots are smallish, we could even
> start enforcing the new ABI by doing a precise zap for small (arbitrary limit
> TBD)
> shared-only memslots for !KVM_X86_DEFAULT_VM VMs.
Again thinking of the userspace memory analogy... Aren't there some VMs where
the fast zap is faster? Like if you have guest with a small memslot that gets
deleted all the time, you could want it to be zapped specifically. But for the
giant memslot next to it, you might want to do the fast zap all thing.
So rather then try to optimize zapping more someday and hit similar issues, let
userspace decide how it wants it to be done. I'm not sure of the actual
performance tradeoffs here, to be clear.
That said, a per-vm know is easier for TDX purposes.
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 11:09 -0700, Sean Christopherson wrote:
> > On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > > On Wed, 2024-05-15 at 09:02 -0700, Sean Christopherson wrote:
> > > > > Or most specifically, we only need this zapping if we *try* to have
> > > > > consistent cache attributes between private and shared. In the
> > > > > non-coherent DMA case we can't have them be consistent because TDX
> > > > > doesn't support changing the private memory in this way.
> > > >
> > > > Huh? That makes no sense. A physical page can't be simultaneously mapped
> > > > SHARED and PRIVATE, so there can't be meaningful cache attribute aliasing
> > > > between private and shared EPT entries.
> > > >
> > > > Trying to provide consistency for the GPA is like worrying about having
> > > > matching PAT entires for the virtual address in two different processes.
> > >
> > > No, not matching between the private and shared mappings of the same page.
> > > The
> > > whole private memory will be WB, and the whole shared half will honor PAT.
> >
> > I'm still failing to see why that's at all interesting. The non-coherent DMA
> > trainwreck is all about KVM worrying about the guest and host having different
> > memtypes for the same physical page.
>
> Ok. The split seemed a little odd and special. I'm not sure it's the most
> interesting thing in the world, but there was some debate internally about it.
>
> >
> > If the host is accessing TDX private memory, we have far, far bigger problems
> > than aliased memtypes.
>
> This wasn't the concern.
>
> > And so the fact that TDX private memory is forced WB is
> > interesting only to the guest, not KVM.
>
> It's just another little quirk in an already complicated solution. They third
> thing we discussed was somehow rejecting or not supporting non-coherent DMA.
> This seemed simpler than that.
Again, huh? This has _nothing_ to do with non-coherent DMA. Devices can't DMA
into TDX private memory.
On Wed, 2024-05-15 at 12:48 -0700, Sean Christopherson wrote:
> > It's just another little quirk in an already complicated solution. They
> > third
> > thing we discussed was somehow rejecting or not supporting non-coherent DMA.
> > This seemed simpler than that.
>
> Again, huh? This has _nothing_ to do with non-coherent DMA. Devices can't
> DMA
> into TDX private memory.
Hmm... I'm confused how you are confused... :)
For normal VMs (after that change you linked), guests will honor guest PAT on
newer HW. On older HW it will only honor guest PAT if non-coherent DMA is
attached.
For TDX we can't honor guest PAT for private memory. So we can either have:
1. Have shared honor PAT and private not.
2. Have private and shared both not honor PAT and be consistent. Unless non-
coherent DMA is attached. In that case KVM could zap shared only and switch to
1.
The only benefit of 2 is that in normal conditions the guest will have
consistent cache behavior between private and shared.
FWIW, there was at one time a use for private uncacheable memory proposed. It
was for keeping non-performance sensitive secret data protected from speculative
access. (not for TDX, a general kernel thing). This isn't a real thing today,
but it's an example of how the private/shared split is quirky, when you ask "do
TDs support PAT?".
1 is a little quirky, but 2 is too complex and also quirky. 1 is the best
option.
If it's obvious we can trim down the log. There was a bit of hand wringing on
this one, so seemed relevant to discussion. The other point was to describe why
we don't need to support kvm_zap_gfn_range(). I think that point is worth
review. The KVM_BUG_ON() is not super critical so we could even drop the patch
if its all settled.
On Wed, 2024-05-15 at 13:05 -0700, Sean Christopherson wrote:
> On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > On Wed, 2024-05-15 at 12:09 -0700, Sean Christopherson wrote:
> > > > It's weird that userspace needs to control how does KVM zap page table
> > > > for
> > > > memslot delete/move.
> > >
> > > Yeah, this isn't quite what I had in mind. Granted, what I had in mind
> > > may
> > > not be much any better, but I definitely don't want to let userspace
> > > dictate exactly how KVM manages SPTEs.
> >
> > To me it doesn't seem completely unprecedented at least. Linux has a ton of
> > madvise() flags and other knobs to control this kind of PTE management for
> > userspace memory.
>
> Yes, but they all express their requests in terms of what behavior userspace
> wants
> or to communicate userspace's access paterns. They don't dictate exact low
> level
> behavior to the kernel.
>
There are a few for madvise that are like "don't do this". Of course also, some
of the implementations take direct action anyway and then become ABI. Otherwise
there is mlock(). There are so many mm features. It might actually be more of a
cautionary tale.
[snip]
> > So rather then try to optimize zapping more someday and hit similar issues,
> > let
> > userspace decide how it wants it to be done. I'm not sure of the actual
> > performance tradeoffs here, to be clear.
>
> ...unless someone is able to root cause the VFIO regression, we don't have the
> luxury of letting userspace give KVM a hint as to whether it might be better
> to
> do a precise zap versus a nuke-and-pave.
Pedantry... I think it's not a regression if something requires a new flag. It
is still a bug though.
The thing I worry about on the bug is whether it might have been due to a guest
having access to page it shouldn't have. In which case we can't give the user
the opportunity to create it.
I didn't gather there was any proof of this. Did you have any hunch either way?
>
> And more importantly, it would be a _hint_, not the hard requirement that TDX
> needs.
>
> > That said, a per-vm know is easier for TDX purposes.
If we don't want it to be a mandate from userspace, then we need to do some per-
vm checking in TDX's case anyway. In which case we might as well go with the
per-vm option for TDX.
You had said up the thread, why not opt all non-normal VMs into the new
behavior. It will work great for TDX. But why do SEV and others want this
automatically?
On 16/05/2024 4:22 am, Isaku Yamahata wrote:
> On Wed, May 15, 2024 at 08:34:37AM -0700,
> Sean Christopherson <[email protected]> wrote:
>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index d5cf5b15a10e..808805b3478d 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>>>
>>> flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>>>
>>> - if (tdp_mmu_enabled)
>>> + if (tdp_mmu_enabled) {
>>> + /*
>>> + * kvm_zap_gfn_range() is used when MTRR or PAT memory
>>> + * type was changed. TDX can't handle zapping the private
>>> + * mapping, but it's ok because KVM doesn't support either of
>>> + * those features for TDX. In case a new caller appears, BUG
>>> + * the VM if it's called for solutions with private aliases.
>>> + */
>>> + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
>>
>> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX". Using a
>> generic name quite obviously doesn't prevent TDX details for bleeding into common
>> code, and dancing around things just makes it all unnecessarily confusing.
>>
>> If we can't avoid bleeding TDX details into common code, my vote is to bite the
>> bullet and simply check vm_type.
>
> TDX has several aspects related to the TDP MMU.
> 1) Based on the faulting GPA, determine which KVM page table to walk.
> (private-vs-shared)
> 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
> load/store. TDP MMU needs hooks for it.
> 3) The tables must be zapped from the leaf. not the root or the middle.
>
> For 1) and 2), what about something like this? TDX backend code will set
> kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
> for address conversion (shared<->private).
>
> For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
> (or whatever preferable name)?
>
> For 3), flag of memslot handles it.
>
> ---
> arch/x86/include/asm/kvm_host.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aabf1648a56a..218b575d24bd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1289,6 +1289,7 @@ struct kvm_arch {
> u8 vm_type;
> bool has_private_mem;
> bool has_protected_state;
> + bool has_mirrored_pt;
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> struct list_head active_mmu_pages;
> struct list_head zapped_obsolete_pages;
> @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>
> #ifdef CONFIG_KVM_PRIVATE_MEM
> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
> #else
> #define kvm_arch_has_private_mem(kvm) false
> +#define kvm_arch_has_mirrored_pt(kvm) false
> #endif
>
> static inline u16 kvm_read_ldt(void)
I think this 'has_mirrored_pt' (or a better name) is better, because it
clearly conveys it is for the "page table", but not the actual page that
any page table entry maps to.
AFAICT we need to split the concept of "private page table itself" and
the "memory type of the actual GFN".
E.g., both SEV-SNP and TDX has concept of "private memory" (obviously),
but I was told only TDX uses a dedicated private page table which isn't
directly accessible for KVV. SEV-SNP on the other hand just uses normal
page table + additional HW managed table to make sure the security.
In other words, I think we should decide whether to invoke TDP MMU
callback for private mapping (the page table itself may just be normal
one) depending on the fault->is_private, but not whether the page table
is private:
if (fault->is_private && kvm_x86_ops->set_private_spte)
kvm_x86_set_private_spte(...);
else
tdp_mmu_set_spte_atomic(...);
And the 'has_mirrored_pt' should be only used to select the root of the
page table that we want to operate on.
This also gives a chance that if there's anything special needs to be
done for page allocated for the "non-leaf" middle page table for
SEV-SNP, it can just fit.
Of course, if 'has_mirrored_pt' is true, we can assume there's a special
way to operate it, i.e., kvm_x86_ops->set_private_spte etc must be valid.
On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> From: Isaku Yamahata <[email protected]>
>
> Introduce a "gfn_shared_mask" field in the kvm_arch structure to record GPA
> shared bit and provide address conversion helpers for TDX shared bit of
> GPA.
>
> TDX designates a specific GPA bit as the shared bit, which can be either
> bit 51 or bit 47 based on configuration.
>
> This GPA shared bit indicates whether the corresponding physical page is
> shared (if shared bit set) or private (if shared bit cleared).
>
> - GPAs with shared bit set will be mapped by VMM into conventional EPT,
> which is pointed by shared EPTP in TDVMCS, resides in host VMM memory
> and is managed by VMM.
> - GPAs with shared bit cleared will be mapped by VMM firstly into a
> mirrored EPT, which resides in host VMM memory. Changes of the mirrored
> EPT are then propagated into a private EPT, which resides outside of host
> VMM memory and is managed by TDX module.
>
> Add the "gfn_shared_mask" field to the kvm_arch structure for each VM with
> a default value of 0. It will be set to the position of the GPA shared bit
> in GFN through TD specific initialization code.
>
> Provide helpers to utilize the gfn_shared_mask to determine whether a GPA
> is shared or private, retrieve the GPA shared bit value, and insert/strip
> shared bit to/from a GPA.
I am seriously thinking whether we should just abandon this whole
kvm_gfn_shared_mask() thing.
We already have enough mechanisms around private memory and the mapping
of it:
1) Xarray to query whether a given GFN is private or shared;
2) fault->is_private to indicate whether a faulting address is private
or shared;
3) sp->is_private to indicate whether a "page table" is only for private
mapping;
Consider this as 4) -- I also like to have a kvm->arch.has_mirrored_pt
(or a better name) as I replied here:
https://lore.kernel.org/kvm/[email protected]/T/#m49b37658f03e786c6aa43719cbf748215170980d
So I believe we really already have enough mechanisms in the *COMMON*
code for private page/mapping support. I intend to believe the whole
GPA shared bit thing can be hidden in TDX specific operations. If
there's really a need to apply/strip GPA shared bit in the common code,
we can do via kvm_x86_ops callback (I'll review other patches to see).
And btw, I think ...
[...]
> +
> +/*
> + * default or SEV-SNP TDX: where S = (47 or 51) - 12
> + * gfn_shared_mask 0 S bit
> + * is_private_gpa() always false true if GPA has S bit clear
.. this @is_private_gpa(), and ...
> + * gfn_to_shared() nop set S bit
> + * gfn_to_private() nop clear S bit
> + *
> + * fault.is_private means that host page should be gotten from guest_memfd
> + * is_private_gpa() means that KVM MMU should invoke private MMU hooks.
> + */
.. this invoking MMU hooks based on @is_private_gpa() makes no sense,
because clearly for SEV-SNP @is_priavate_gpa() isn't report the fact
when the GPA is indeed private, and the MMU hooks should be invoked
based on whether the faulting GPA is private or not, but not this
@is_private_gpa().
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 13:05 -0700, Sean Christopherson wrote:
> > On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > > So rather then try to optimize zapping more someday and hit similar
> > > issues, let userspace decide how it wants it to be done. I'm not sure of
> > > the actual performance tradeoffs here, to be clear.
> >
> > ...unless someone is able to root cause the VFIO regression, we don't have
> > the luxury of letting userspace give KVM a hint as to whether it might be
> > better to do a precise zap versus a nuke-and-pave.
>
> Pedantry... I think it's not a regression if something requires a new flag. It
> is still a bug though.
Heh, pedantry denied. I was speaking in the past tense about the VFIO failure,
which was a regression as I changed KVM behavior without adding a flag.
> The thing I worry about on the bug is whether it might have been due to a guest
> having access to page it shouldn't have. In which case we can't give the user
> the opportunity to create it.
>
> I didn't gather there was any proof of this. Did you have any hunch either way?
I doubt the guest was able to access memory it shouldn't have been able to access.
But that's a moot point, as the bigger problem is that, because we have no idea
what's at fault, KVM can't make any guarantees about the safety of such a flag.
TDX is a special case where we don't have a better option (we do have other options,
they're just horrible). In other words, the choice is essentially to either:
(a) cross our fingers and hope that the problem is limited to shared memory
with QEMU+VFIO, i.e. and doesn't affect TDX private memory.
or
(b) don't merge TDX until the original regression is fully resolved.
FWIW, I would love to root cause and fix the failure, but I don't know how feasible
that is at this point.
> > And more importantly, it would be a _hint_, not the hard requirement that TDX
> > needs.
> >
> > > That said, a per-vm know is easier for TDX purposes.
>
> If we don't want it to be a mandate from userspace, then we need to do some per-
> vm checking in TDX's case anyway. In which case we might as well go with the
> per-vm option for TDX.
>
> You had said up the thread, why not opt all non-normal VMs into the new
> behavior. It will work great for TDX. But why do SEV and others want this
> automatically?
Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
break away from KVM's godawful ABI. It might be a pipe dream, as keying off the
VM type obviously has similar risks to giving userspace a memslot flag. The one
sliver of hope is that the VM types really are quite new (though less so for SEV
and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.
>>
>> You had said up the thread, why not opt all non-normal VMs into the new
>> behavior. It will work great for TDX. But why do SEV and others want this
>> automatically?
>
> Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
> break away from KVM's godawful ABI. It might be a pipe dream, as keying off the
> VM type obviously has similar risks to giving userspace a memslot flag. The one
> sliver of hope is that the VM types really are quite new (though less so for SEV
> and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.
Btw, does the "zap-leaf-only" approach always have better performance,
assuming we have to hold MMU write lock for that?
Consider a huge memslot being deleted/moved.
If we can always have a better performance for "zap-leaf-only", then
instead of letting userspace to opt-in this feature, we perhaps can do
the opposite:
We always do the "zap-leaf-only" in KVM, but add a quirk for the VMs
that userspace know can have such bug and apply this quirk.
But again, I think it's just too overkill for TDX. We can just set the
ZAP_LEAF_ONLY flag for the slot when it is created in KVM.
On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
> > TDX has several aspects related to the TDP MMU.
> > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > (private-vs-shared)
> > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
> > memory
> > load/store. TDP MMU needs hooks for it.
> > 3) The tables must be zapped from the leaf. not the root or the middle.
> >
> > For 1) and 2), what about something like this? TDX backend code will set
> > kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask()
> > only
> > for address conversion (shared<->private).
1 and 2 are not the same as "mirrored" though. You could have a design that
mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
might be just a KVM design choice, even for TDX.
What we are really trying to do here is not put "is tdx" logic in the generic
code. We could rely on the fact that TDX is the only one with mirrored TDP, but
that is kind of what we are already doing with kvm_gfn_shared_mask().
How about we do helpers for each of your bullets, and they all just check:
vm_type == KVM_X86_TDX_VM
So like:
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a578ea09dfb3..c0beed5b090a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -355,4 +355,19 @@ static inline bool kvm_is_private_gpa(const struct kvm
*kvm, gpa_t gpa)
return mask && !(gpa_to_gfn(gpa) & mask);
}
+static inline bool kvm_has_mirrored_tdp(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_has_private_root(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_zap_leafs_only(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
#endif
This is similar to what Sean proposed earlier, we just didn't get that far:
https://lore.kernel.org/kvm/[email protected]/
> >
> > For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
> > (or whatever preferable name)?
> >
> > For 3), flag of memslot handles it.
> >
> > ---
> > arch/x86/include/asm/kvm_host.h | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > index aabf1648a56a..218b575d24bd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1289,6 +1289,7 @@ struct kvm_arch {
> > u8 vm_type;
> > bool has_private_mem;
> > bool has_protected_state;
> > + bool has_mirrored_pt;
> > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > struct list_head active_mmu_pages;
> > struct list_head zapped_obsolete_pages;
> > @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int
> > tdp_forced_root_level,
> >
> > #ifdef CONFIG_KVM_PRIVATE_MEM
> > #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> > +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
> > #else
> > #define kvm_arch_has_private_mem(kvm) false
> > +#define kvm_arch_has_mirrored_pt(kvm) false
> > #endif
> >
> > static inline u16 kvm_read_ldt(void)
>
> I think this 'has_mirrored_pt' (or a better name) is better, because it
> clearly conveys it is for the "page table", but not the actual page that
> any page table entry maps to.
>
> AFAICT we need to split the concept of "private page table itself" and
> the "memory type of the actual GFN".
>
> E.g., both SEV-SNP and TDX has concept of "private memory" (obviously),
> but I was told only TDX uses a dedicated private page table which isn't
> directly accessible for KVV. SEV-SNP on the other hand just uses normal
> page table + additional HW managed table to make sure the security.
>
> In other words, I think we should decide whether to invoke TDP MMU
> callback for private mapping (the page table itself may just be normal
> one) depending on the fault->is_private, but not whether the page table
> is private:
>
> if (fault->is_private && kvm_x86_ops->set_private_spte)
> kvm_x86_set_private_spte(...);
> else
> tdp_mmu_set_spte_atomic(...);
>
> And the 'has_mirrored_pt' should be only used to select the root of the
> page table that we want to operate on.
>
> This also gives a chance that if there's anything special needs to be
> done for page allocated for the "non-leaf" middle page table for
> SEV-SNP, it can just fit.
>
> Of course, if 'has_mirrored_pt' is true, we can assume there's a special
> way to operate it, i.e., kvm_x86_ops->set_private_spte etc must be valid.
It's good point that we are mixing up "private" in the code from SNP's
perspective.
On 16/05/2024 3:22 am, Edgecombe, Rick P wrote:
> On Wed, 2024-05-15 at 13:27 +0000, Huang, Kai wrote:
>>
>> kvm_zap_gfn_range() looks a generic function. I think it makes more sense
>> to let the callers to explicitly check whether VM is TDX guest and do the
>> KVM_BUG_ON()?
>
> Other TDX changes will prevent this function getting called. So basically like
> you are suggesting. This change is to catch any new cases that pop up, which we
> can't do at the caller.
But I think we need to see whether calling kvm_zap_gfn_range() is legal
or not for TDX guest case by case, but not having a universal rule that
this cannot be called for TDX guest, right?
On Thu, May 16, 2024, Kai Huang wrote:
> > > You had said up the thread, why not opt all non-normal VMs into the new
> > > behavior. It will work great for TDX. But why do SEV and others want this
> > > automatically?
> >
> > Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
> > break away from KVM's godawful ABI. It might be a pipe dream, as keying off the
> > VM type obviously has similar risks to giving userspace a memslot flag. The one
> > sliver of hope is that the VM types really are quite new (though less so for SEV
> > and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.
>
> Btw, does the "zap-leaf-only" approach always have better performance,
> assuming we have to hold MMU write lock for that?
I highly doubt it, especially given how much the TDP MMU can now do with mmu_lock
held for read.
> Consider a huge memslot being deleted/moved.
>
> If we can always have a better performance for "zap-leaf-only", then instead
> of letting userspace to opt-in this feature, we perhaps can do the opposite:
>
> We always do the "zap-leaf-only" in KVM, but add a quirk for the VMs that
> userspace know can have such bug and apply this quirk.
Hmm, a quirk isn't a bad idea. It suffers the same problems as a memslot flag,
i.e. who knows when it's safe to disable the quirk, but I would hope userspace
would be much, much cautious about disabling a quirk that comes with a massive
disclaimer.
Though I suspect Paolo will shoot this down too ;-)
> But again, I think it's just too overkill for TDX. We can just set the
> ZAP_LEAF_ONLY flag for the slot when it is created in KVM.
Ya, I'm convinced that adding uAPI is overkill at this point.
On Thu, 2024-05-16 at 10:34 +1200, Huang, Kai wrote:
>
>
> On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Introduce a "gfn_shared_mask" field in the kvm_arch structure to record GPA
> > shared bit and provide address conversion helpers for TDX shared bit of
> > GPA.
> >
> > TDX designates a specific GPA bit as the shared bit, which can be either
> > bit 51 or bit 47 based on configuration.
> >
> > This GPA shared bit indicates whether the corresponding physical page is
> > shared (if shared bit set) or private (if shared bit cleared).
> >
> > - GPAs with shared bit set will be mapped by VMM into conventional EPT,
> > which is pointed by shared EPTP in TDVMCS, resides in host VMM memory
> > and is managed by VMM.
> > - GPAs with shared bit cleared will be mapped by VMM firstly into a
> > mirrored EPT, which resides in host VMM memory. Changes of the mirrored
> > EPT are then propagated into a private EPT, which resides outside of
> > host
> > VMM memory and is managed by TDX module.
> >
> > Add the "gfn_shared_mask" field to the kvm_arch structure for each VM with
> > a default value of 0. It will be set to the position of the GPA shared bit
> > in GFN through TD specific initialization code.
> >
> > Provide helpers to utilize the gfn_shared_mask to determine whether a GPA
> > is shared or private, retrieve the GPA shared bit value, and insert/strip
> > shared bit to/from a GPA.
>
> I am seriously thinking whether we should just abandon this whole
> kvm_gfn_shared_mask() thing.
>
> We already have enough mechanisms around private memory and the mapping
> of it:
>
> 1) Xarray to query whether a given GFN is private or shared;
> 2) fault->is_private to indicate whether a faulting address is private
> or shared;
> 3) sp->is_private to indicate whether a "page table" is only for private
> mapping;
You mean drop the helpers, or the struct kvm member? I think we still need the
shared bit position stored somewhere. memslots, Xarray, etc need to operate on
the GFN without the shared it.
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 12:48 -0700, Sean Christopherson wrote:
> > > It's just another little quirk in an already complicated solution. They
> > > third
> > > thing we discussed was somehow rejecting or not supporting non-coherent DMA.
> > > This seemed simpler than that.
> >
> > Again, huh? This has _nothing_ to do with non-coherent DMA. Devices can't
> > DMA
> > into TDX private memory.
>
> Hmm... I'm confused how you are confused... :)
>
> For normal VMs (after that change you linked), guests will honor guest PAT on
> newer HW. On older HW it will only honor guest PAT if non-coherent DMA is
> attached.
>
> For TDX we can't honor guest PAT for private memory. So we can either have:
> 1. Have shared honor PAT and private not.
> 2. Have private and shared both not honor PAT and be consistent. Unless non-
> coherent DMA is attached. In that case KVM could zap shared only and switch to
> 1.
Oh good gravy, hell no :-)
On 16/05/2024 11:21 am, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 10:34 +1200, Huang, Kai wrote:
>>
>>
>> On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
>>> From: Isaku Yamahata <[email protected]>
>>>
>>> Introduce a "gfn_shared_mask" field in the kvm_arch structure to record GPA
>>> shared bit and provide address conversion helpers for TDX shared bit of
>>> GPA.
>>>
>>> TDX designates a specific GPA bit as the shared bit, which can be either
>>> bit 51 or bit 47 based on configuration.
>>>
>>> This GPA shared bit indicates whether the corresponding physical page is
>>> shared (if shared bit set) or private (if shared bit cleared).
>>>
>>> - GPAs with shared bit set will be mapped by VMM into conventional EPT,
>>> which is pointed by shared EPTP in TDVMCS, resides in host VMM memory
>>> and is managed by VMM.
>>> - GPAs with shared bit cleared will be mapped by VMM firstly into a
>>> mirrored EPT, which resides in host VMM memory. Changes of the mirrored
>>> EPT are then propagated into a private EPT, which resides outside of
>>> host
>>> VMM memory and is managed by TDX module.
>>>
>>> Add the "gfn_shared_mask" field to the kvm_arch structure for each VM with
>>> a default value of 0. It will be set to the position of the GPA shared bit
>>> in GFN through TD specific initialization code.
>>>
>>> Provide helpers to utilize the gfn_shared_mask to determine whether a GPA
>>> is shared or private, retrieve the GPA shared bit value, and insert/strip
>>> shared bit to/from a GPA.
>>
>> I am seriously thinking whether we should just abandon this whole
>> kvm_gfn_shared_mask() thing.
>>
>> We already have enough mechanisms around private memory and the mapping
>> of it:
>>
>> 1) Xarray to query whether a given GFN is private or shared;
>> 2) fault->is_private to indicate whether a faulting address is private
>> or shared;
>> 3) sp->is_private to indicate whether a "page table" is only for private
>> mapping;
>
> You mean drop the helpers, or the struct kvm member? I think we still need the
> shared bit position stored somewhere. memslots, Xarray, etc need to operate on
> the GFN without the shared it.
The struct member, and the whole thing. The shared bit is only included
in the faulting address, and we can strip that away upon
handle_ept_violation().
One thing I can think of is we still need to append the shared bit to
the actual GFN when we setup the shared page table mapping. For that I
am thinking whether we can do in TDX specific code.
Anyway, I don't think the 'gfn_shared_mask' is necessarily good at this
stage.
On 16/05/2024 11:20 am, Sean Christopherson wrote:
> On Thu, May 16, 2024, Kai Huang wrote:
>>>> You had said up the thread, why not opt all non-normal VMs into the new
>>>> behavior. It will work great for TDX. But why do SEV and others want this
>>>> automatically?
>>>
>>> Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
>>> break away from KVM's godawful ABI. It might be a pipe dream, as keying off the
>>> VM type obviously has similar risks to giving userspace a memslot flag. The one
>>> sliver of hope is that the VM types really are quite new (though less so for SEV
>>> and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.
>>
>> Btw, does the "zap-leaf-only" approach always have better performance,
>> assuming we have to hold MMU write lock for that?
>
> I highly doubt it, especially given how much the TDP MMU can now do with mmu_lock
> held for read.
>
>> Consider a huge memslot being deleted/moved.
>>
>> If we can always have a better performance for "zap-leaf-only", then instead
>> of letting userspace to opt-in this feature, we perhaps can do the opposite:
>>
>> We always do the "zap-leaf-only" in KVM, but add a quirk for the VMs that
>> userspace know can have such bug and apply this quirk.
>
> Hmm, a quirk isn't a bad idea. It suffers the same problems as a memslot flag,
> i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> would be much, much cautious about disabling a quirk that comes with a massive
> disclaimer.
>
> Though I suspect Paolo will shoot this down too ;-)
The quirk only works based on the assumption that userspace _exactly_
knows what kinda VMs will have this bug.
But as mentioned above, the first step is we need to convince ourselves
that doing "zap-leaf-only" by default is the right thing to do.
:-)
On Thu, 2024-05-16 at 11:31 +1200, Huang, Kai wrote:
>
>
> On 16/05/2024 11:21 am, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-16 at 10:34 +1200, Huang, Kai wrote:
> > >
> > >
> > > On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > > > From: Isaku Yamahata <[email protected]>
> > > >
> > > > Introduce a "gfn_shared_mask" field in the kvm_arch structure to record
> > > > GPA
> > > > shared bit and provide address conversion helpers for TDX shared bit of
> > > > GPA.
> > > >
> > > > TDX designates a specific GPA bit as the shared bit, which can be either
> > > > bit 51 or bit 47 based on configuration.
> > > >
> > > > This GPA shared bit indicates whether the corresponding physical page is
> > > > shared (if shared bit set) or private (if shared bit cleared).
> > > >
> > > > - GPAs with shared bit set will be mapped by VMM into conventional EPT,
> > > > which is pointed by shared EPTP in TDVMCS, resides in host VMM
> > > > memory
> > > > and is managed by VMM.
> > > > - GPAs with shared bit cleared will be mapped by VMM firstly into a
> > > > mirrored EPT, which resides in host VMM memory. Changes of the
> > > > mirrored
> > > > EPT are then propagated into a private EPT, which resides outside
> > > > of
> > > > host
> > > > VMM memory and is managed by TDX module.
> > > >
> > > > Add the "gfn_shared_mask" field to the kvm_arch structure for each VM
> > > > with
> > > > a default value of 0. It will be set to the position of the GPA shared
> > > > bit
> > > > in GFN through TD specific initialization code.
> > > >
> > > > Provide helpers to utilize the gfn_shared_mask to determine whether a
> > > > GPA
> > > > is shared or private, retrieve the GPA shared bit value, and
> > > > insert/strip
> > > > shared bit to/from a GPA.
> > >
> > > I am seriously thinking whether we should just abandon this whole
> > > kvm_gfn_shared_mask() thing.
> > >
> > > We already have enough mechanisms around private memory and the mapping
> > > of it:
> > >
> > > 1) Xarray to query whether a given GFN is private or shared;
> > > 2) fault->is_private to indicate whether a faulting address is private
> > > or shared;
> > > 3) sp->is_private to indicate whether a "page table" is only for private
> > > mapping;
> >
> > You mean drop the helpers, or the struct kvm member? I think we still need
> > the
> > shared bit position stored somewhere. memslots, Xarray, etc need to operate
> > on
> > the GFN without the shared it.
>
> The struct member, and the whole thing. The shared bit is only included
> in the faulting address, and we can strip that away upon
> handle_ept_violation().
>
> One thing I can think of is we still need to append the shared bit to
> the actual GFN when we setup the shared page table mapping. For that I
> am thinking whether we can do in TDX specific code.
>
> Anyway, I don't think the 'gfn_shared_mask' is necessarily good at this
> stage.
Sorry, still not clear. We need to strip the bit away, so we need to know what
bit it is. The proposal is to not remember it on struct kvm, so where do we get
it?
Actually, we used to allow it to be selected (via GPAW), but now we could
determine it based on EPT level and MAXPA. So we could possibly recalculate it
in some helper...
But it seems you are suggesting to do away with the concept of knowing what the
shared bit is.
On 16/05/2024 11:14 am, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
>>> TDX has several aspects related to the TDP MMU.
>>> 1) Based on the faulting GPA, determine which KVM page table to walk.
>>> (private-vs-shared)
>>> 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
>>> memory
>>> load/store. TDP MMU needs hooks for it.
>>> 3) The tables must be zapped from the leaf. not the root or the middle.
>>>
>>> For 1) and 2), what about something like this? TDX backend code will set
>>> kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask()
>>> only
>>> for address conversion (shared<->private).
>
> 1 and 2 are not the same as "mirrored" though. You could have a design that
> mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
> might be just a KVM design choice, even for TDX.
I am not sure whether I understand this correctly. If they are not
tracked with separate roots, it means they use the same page table (root).
So IIUC what you said is to support "mirror PT" at any sub-tree of the
page table?
That will only complicate things. I don't think we should consider
this. In reality, we only have TDX and SEV-SNP. We should have a
simple solution to cover both of them.
>
> What we are really trying to do here is not put "is tdx" logic in the generic
> code. We could rely on the fact that TDX is the only one with mirrored TDP, but
> that is kind of what we are already doing with kvm_gfn_shared_mask().
>
> How about we do helpers for each of your bullets, and they all just check:
> vm_type == KVM_X86_TDX_VM
>
> So like:
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index a578ea09dfb3..c0beed5b090a 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -355,4 +355,19 @@ static inline bool kvm_is_private_gpa(const struct kvm
> *kvm, gpa_t gpa)
> return mask && !(gpa_to_gfn(gpa) & mask);
> }
>
> +static inline bool kvm_has_mirrored_tdp(struct kvm *kvm)
> +{
> + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
> +
> +static inline bool kvm_has_private_root(struct kvm *kvm)
> +{
> + return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
I don't think we need to distinguish the two.
Even we do this, if I understand your saying correctly,
kvm_has_private_root() isn't just enough -- theoretically we can have a
mirror pt at a sub-tree at any level.
On 16/05/2024 11:38 am, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 11:31 +1200, Huang, Kai wrote:
>>
>>
>> On 16/05/2024 11:21 am, Edgecombe, Rick P wrote:
>>> On Thu, 2024-05-16 at 10:34 +1200, Huang, Kai wrote:
>>>>
>>>>
>>>> On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
>>>>> From: Isaku Yamahata <[email protected]>
>>>>>
>>>>> Introduce a "gfn_shared_mask" field in the kvm_arch structure to record
>>>>> GPA
>>>>> shared bit and provide address conversion helpers for TDX shared bit of
>>>>> GPA.
>>>>>
>>>>> TDX designates a specific GPA bit as the shared bit, which can be either
>>>>> bit 51 or bit 47 based on configuration.
>>>>>
>>>>> This GPA shared bit indicates whether the corresponding physical page is
>>>>> shared (if shared bit set) or private (if shared bit cleared).
>>>>>
>>>>> - GPAs with shared bit set will be mapped by VMM into conventional EPT,
>>>>> which is pointed by shared EPTP in TDVMCS, resides in host VMM
>>>>> memory
>>>>> and is managed by VMM.
>>>>> - GPAs with shared bit cleared will be mapped by VMM firstly into a
>>>>> mirrored EPT, which resides in host VMM memory. Changes of the
>>>>> mirrored
>>>>> EPT are then propagated into a private EPT, which resides outside
>>>>> of
>>>>> host
>>>>> VMM memory and is managed by TDX module.
>>>>>
>>>>> Add the "gfn_shared_mask" field to the kvm_arch structure for each VM
>>>>> with
>>>>> a default value of 0. It will be set to the position of the GPA shared
>>>>> bit
>>>>> in GFN through TD specific initialization code.
>>>>>
>>>>> Provide helpers to utilize the gfn_shared_mask to determine whether a
>>>>> GPA
>>>>> is shared or private, retrieve the GPA shared bit value, and
>>>>> insert/strip
>>>>> shared bit to/from a GPA.
>>>>
>>>> I am seriously thinking whether we should just abandon this whole
>>>> kvm_gfn_shared_mask() thing.
>>>>
>>>> We already have enough mechanisms around private memory and the mapping
>>>> of it:
>>>>
>>>> 1) Xarray to query whether a given GFN is private or shared;
>>>> 2) fault->is_private to indicate whether a faulting address is private
>>>> or shared;
>>>> 3) sp->is_private to indicate whether a "page table" is only for private
>>>> mapping;
>>>
>>> You mean drop the helpers, or the struct kvm member? I think we still need
>>> the
>>> shared bit position stored somewhere. memslots, Xarray, etc need to operate
>>> on
>>> the GFN without the shared it.
>>
>> The struct member, and the whole thing. The shared bit is only included
>> in the faulting address, and we can strip that away upon
>> handle_ept_violation().
>>
>> One thing I can think of is we still need to append the shared bit to
>> the actual GFN when we setup the shared page table mapping. For that I
>> am thinking whether we can do in TDX specific code.
>>
>> Anyway, I don't think the 'gfn_shared_mask' is necessarily good at this
>> stage.
>
> Sorry, still not clear. We need to strip the bit away, so we need to know what
> bit it is. The proposal is to not remember it on struct kvm, so where do we get
> it?
The TDX specific code can get it when TDX guest is created.
>
> Actually, we used to allow it to be selected (via GPAW), but now we could
> determine it based on EPT level and MAXPA. So we could possibly recalculate it
> in some helper...
>
> But it seems you are suggesting to do away with the concept of knowing what the
> shared bit is.
What I am suggesting is essentially to replace this
kvm_gfn_shared_mask() with some kvm_x86_ops callback (which can just
return the shared bit), assuming the common code somehow still need it
(e.g., setting up the SPTE for shared mapping, which must include the
shared bit to the GPA).
The advantage of this we can get rid of the concept of 'gfn_shared_mask'
in the MMU common code. All GFNs referenced in the common code is the
actual GFN (w/o the shared bit).
On Wed, 2024-05-15 at 15:47 -0700, Sean Christopherson wrote:
> > I didn't gather there was any proof of this. Did you have any hunch either
> > way?
>
> I doubt the guest was able to access memory it shouldn't have been able to
> access.
> But that's a moot point, as the bigger problem is that, because we have no
> idea
> what's at fault, KVM can't make any guarantees about the safety of such a
> flag.
>
> TDX is a special case where we don't have a better option (we do have other
> options,
> they're just horrible). In other words, the choice is essentially to either:
>
> (a) cross our fingers and hope that the problem is limited to shared memory
> with QEMU+VFIO, i.e. and doesn't affect TDX private memory.
>
> or
>
> (b) don't merge TDX until the original regression is fully resolved.
>
> FWIW, I would love to root cause and fix the failure, but I don't know how
> feasible
> that is at this point.
If we think it is not a security issue, and we don't even know if it can be hit
for TDX, then I'd be included to go with (a). Especially since we are just
aiming for the most basic support, and don't have to worry about regressions in
the classical sense.
I'm not sure how easy it will be to root cause it at this point. Hopefully Yan
will be coming online soon. She mentioned some previous Intel effort to
investigate it. Presumably we would have to start with the old kernel that
exhibited the issue. If it can still be found...
On Thu, 2024-05-16 at 11:44 +1200, Huang, Kai wrote:
> >
> > Sorry, still not clear. We need to strip the bit away, so we need to know
> > what
> > bit it is. The proposal is to not remember it on struct kvm, so where do we
> > get
> > it?
>
> The TDX specific code can get it when TDX guest is created.
The TDX specific code sets it. It knows GPAW/shared bit location.
>
> >
> > Actually, we used to allow it to be selected (via GPAW), but now we could
> > determine it based on EPT level and MAXPA. So we could possibly recalculate
> > it
> > in some helper...
> >
> > But it seems you are suggesting to do away with the concept of knowing what
> > the
> > shared bit is.
>
> What I am suggesting is essentially to replace this
> kvm_gfn_shared_mask() with some kvm_x86_ops callback (which can just
> return the shared bit), assuming the common code somehow still need it
> (e.g., setting up the SPTE for shared mapping, which must include the
> shared bit to the GPA).
>
> The advantage of this we can get rid of the concept of 'gfn_shared_mask'
> in the MMU common code. All GFNs referenced in the common code is the
> actual GFN (w/o the shared bit).
When it is actually being used as the shared bit instead of as a way to check if
a guest is a TD, what is the problem? I think the shared_mask serves a real
(small) purpose, but it is misused for a bunch of other stuff. If we move that
other stuff to new helpers, the shared mask will still be needed for it's
original job.
What is the benefit of the x86_ops over a static inline?
On 16/05/2024 11:59 am, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 11:44 +1200, Huang, Kai wrote:
>>>
>>> Sorry, still not clear. We need to strip the bit away, so we need to know
>>> what
>>> bit it is. The proposal is to not remember it on struct kvm, so where do we
>>> get
>>> it?
>>
>> The TDX specific code can get it when TDX guest is created.
>
> The TDX specific code sets it. It knows GPAW/shared bit location.
>
>>
>>>
>>> Actually, we used to allow it to be selected (via GPAW), but now we could
>>> determine it based on EPT level and MAXPA. So we could possibly recalculate
>>> it
>>> in some helper...
>>>
>>> But it seems you are suggesting to do away with the concept of knowing what
>>> the
>>> shared bit is.
>>
>> What I am suggesting is essentially to replace this
>> kvm_gfn_shared_mask() with some kvm_x86_ops callback (which can just
>> return the shared bit), assuming the common code somehow still need it
>> (e.g., setting up the SPTE for shared mapping, which must include the
>> shared bit to the GPA).
>>
>> The advantage of this we can get rid of the concept of 'gfn_shared_mask'
>> in the MMU common code. All GFNs referenced in the common code is the
>> actual GFN (w/o the shared bit).
>
> When it is actually being used as the shared bit instead of as a way to check if
> a guest is a TD, what is the problem? I think the shared_mask serves a real
> (small) purpose, but it is misused for a bunch of other stuff. If we move that
> other stuff to new helpers, the shared mask will still be needed for it's
> original job.
>
> What is the benefit of the x86_ops over a static inline?
I don't have strong objection if the use of kvm_gfn_shared_mask() is
contained in smaller areas that truly need it. Let's discuss in
relevant patch(es).
However I do think the helpers like below makes no sense (for SEV-SNP):
+static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
+{
+ gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+ return mask && !(gpa_to_gfn(gpa) & mask);
+}
On Thu, 2024-05-16 at 11:38 +1200, Huang, Kai wrote:
>
>
> On 16/05/2024 11:14 am, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
> > > > TDX has several aspects related to the TDP MMU.
> > > > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > > > (private-vs-shared)
> > > > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
> > > > memory
> > > > load/store. TDP MMU needs hooks for it.
> > > > 3) The tables must be zapped from the leaf. not the root or the middle.
> > > >
> > > > For 1) and 2), what about something like this? TDX backend code will
> > > > set
> > > > kvm->arch.has_mirrored_pt = true; I think we will use
> > > > kvm_gfn_shared_mask()
> > > > only
> > > > for address conversion (shared<->private).
> >
> > 1 and 2 are not the same as "mirrored" though. You could have a design that
> > mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
> > might be just a KVM design choice, even for TDX.
>
> I am not sure whether I understand this correctly. If they are not
> tracked with separate roots, it means they use the same page table (root).
There are three roots, right? Shared, private and mirrored. Shared and mirrored
don't have to be different roots, but it makes some operations arguably easier
to have it that way.
>
> So IIUC what you said is to support "mirror PT" at any sub-tree of the
> page table?
>
> That will only complicate things. I don't think we should consider
> this. In reality, we only have TDX and SEV-SNP. We should have a
> simple solution to cover both of them.
Look at "bool is_private" in kvm_tdp_mmu_map(). Do you see how it switches
between different roots in the iterator? That is one use.
The second use is to decide whether to call out to the x86_ops. It happens via
the role bit in the sp, which is copied from the parent sp role. The root's bit
is set originally via a kvm_gfn_shared_mask() check.
BTW, the role bit is the thing I'm wondering if we really need, because we have
shared_mask. While the shared_mask is used for lots of things today, we need
still need it for masking GPAs. Where as the role bit is only needed to know if
a SP is for private (which we can tell from the GPA).
On Thu, May 16, 2024 at 10:17:50AM +1200,
"Huang, Kai" <[email protected]> wrote:
> On 16/05/2024 4:22 am, Isaku Yamahata wrote:
> > On Wed, May 15, 2024 at 08:34:37AM -0700,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index d5cf5b15a10e..808805b3478d 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > > > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> > > > - if (tdp_mmu_enabled)
> > > > + if (tdp_mmu_enabled) {
> > > > + /*
> > > > + * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > > > + * type was changed. TDX can't handle zapping the private
> > > > + * mapping, but it's ok because KVM doesn't support either of
> > > > + * those features for TDX. In case a new caller appears, BUG
> > > > + * the VM if it's called for solutions with private aliases.
> > > > + */
> > > > + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
> > >
> > > Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX". Using a
> > > generic name quite obviously doesn't prevent TDX details for bleeding into common
> > > code, and dancing around things just makes it all unnecessarily confusing.
> > >
> > > If we can't avoid bleeding TDX details into common code, my vote is to bite the
> > > bullet and simply check vm_type.
> >
> > TDX has several aspects related to the TDP MMU.
> > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > (private-vs-shared)
> > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
> > load/store. TDP MMU needs hooks for it.
> > 3) The tables must be zapped from the leaf. not the root or the middle.
> >
> > For 1) and 2), what about something like this? TDX backend code will set
> > kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
> > for address conversion (shared<->private).
> >
> > For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
> > (or whatever preferable name)?
> >
> > For 3), flag of memslot handles it.
> >
> > ---
> > arch/x86/include/asm/kvm_host.h | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index aabf1648a56a..218b575d24bd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1289,6 +1289,7 @@ struct kvm_arch {
> > u8 vm_type;
> > bool has_private_mem;
> > bool has_protected_state;
> > + bool has_mirrored_pt;
> > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > struct list_head active_mmu_pages;
> > struct list_head zapped_obsolete_pages;
> > @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> > #ifdef CONFIG_KVM_PRIVATE_MEM
> > #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> > +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
> > #else
> > #define kvm_arch_has_private_mem(kvm) false
> > +#define kvm_arch_has_mirrored_pt(kvm) false
> > #endif
> > static inline u16 kvm_read_ldt(void)
>
> I think this 'has_mirrored_pt' (or a better name) is better, because it
> clearly conveys it is for the "page table", but not the actual page that any
> page table entry maps to.
>
> AFAICT we need to split the concept of "private page table itself" and the
> "memory type of the actual GFN".
>
> E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), but
> I was told only TDX uses a dedicated private page table which isn't directly
> accessible for KVV. SEV-SNP on the other hand just uses normal page table +
> additional HW managed table to make sure the security.
kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
need_callback or whatever makes sense.
> In other words, I think we should decide whether to invoke TDP MMU callback
> for private mapping (the page table itself may just be normal one) depending
> on the fault->is_private, but not whether the page table is private:
>
> if (fault->is_private && kvm_x86_ops->set_private_spte)
> kvm_x86_set_private_spte(...);
> else
> tdp_mmu_set_spte_atomic(...);
This doesn't work for two reasons.
- We need to pass down struct kvm_page_fault fault deep only for this.
We could change the code in such way.
- We don't have struct kvm_page_fault fault for zapping case.
We could create a dummy one and pass it around.
Essentially the issue is how to pass down is_private or stash the info
somewhere or determine it somehow. Options I think of are
- Pass around fault:
Con: fault isn't passed down
Con: Create fake fault for zapping case
- Stash it in struct tdp_iter and pass around iter:
Pro: work for zapping case
Con: we need to change the code to pass down tdp_iter
- Pass around is_private (or mirrored_pt or whatever):
Pro: Don't need to add member to some structure
Con: We need to pass it around still.
- Stash it in kvm_mmu_page:
The patch series uses kvm_mmu_page.role.
Pro: We don't need to pass around because we know struct kvm_mmu_page
Con: Need to twist root page allocation
- Use gfn. kvm_is_private_gfn(kvm, gfn):
Con: The use of gfn is confusing. It's too TDX specific.
> And the 'has_mirrored_pt' should be only used to select the root of the page
> table that we want to operate on.
We can add one more bool to struct kvm_page_fault.follow_mirrored_pt or
something to represent it. We can initialize it in __kvm_mmu_do_page_fault().
follow_mirrored_pt = kvm->arch.has_mirrored_pt && kvm_is_private_gpa(gpa);
> This also gives a chance that if there's anything special needs to be done
> for page allocated for the "non-leaf" middle page table for SEV-SNP, it can
> just fit.
Can you please elaborate on this?
--
Isaku Yamahata <[email protected]>
On Thu, 2024-05-16 at 12:12 +1200, Huang, Kai wrote:
>
> I don't have strong objection if the use of kvm_gfn_shared_mask() is
> contained in smaller areas that truly need it. Let's discuss in
> relevant patch(es).
>
> However I do think the helpers like below makes no sense (for SEV-SNP):
>
> +static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
> +{
> + gfn_t mask = kvm_gfn_shared_mask(kvm);
> +
> + return mask && !(gpa_to_gfn(gpa) & mask);
> +}
You mean the name? SNP doesn't have a concept of "private GPA" IIUC. The C bit
is more like an permission bit. So SNP doesn't have private GPAs, and the
function would always return false for SNP. So I'm not sure it's too horrible.
If it's the name, can you suggest something?
On 16/05/2024 12:19 pm, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 12:12 +1200, Huang, Kai wrote:
>>
>> I don't have strong objection if the use of kvm_gfn_shared_mask() is
>> contained in smaller areas that truly need it. Let's discuss in
>> relevant patch(es).
>>
>> However I do think the helpers like below makes no sense (for SEV-SNP):
>>
>> +static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
>> +{
>> + gfn_t mask = kvm_gfn_shared_mask(kvm);
>> +
>> + return mask && !(gpa_to_gfn(gpa) & mask);
>> +}
>
> You mean the name? SNP doesn't have a concept of "private GPA" IIUC. The C bit
> is more like an permission bit. So SNP doesn't have private GPAs, and the
> function would always return false for SNP. So I'm not sure it's too horrible.
Hmm.. Why SNP doesn't have private GPAs? They are crypto-protected and
KVM cannot access directly correct?
>
> If it's the name, can you suggest something?
The name make sense, but it has to reflect the fact that a given GPA is
truly private (crypto-protected, inaccessible to KVM).
On Thu, May 16, 2024 at 12:13:44AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Thu, 2024-05-16 at 11:38 +1200, Huang, Kai wrote:
> > On 16/05/2024 11:14 am, Edgecombe, Rick P wrote:
> > > On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
> > > > > TDX has several aspects related to the TDP MMU.
> > > > > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > > > > (private-vs-shared)
> > > > > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
> > > > > memory
> > > > > load/store. TDP MMU needs hooks for it.
> > > > > 3) The tables must be zapped from the leaf. not the root or the middle.
> > > > >
> > > > > For 1) and 2), what about something like this? TDX backend code will
> > > > > set
> > > > > kvm->arch.has_mirrored_pt = true; I think we will use
> > > > > kvm_gfn_shared_mask()
> > > > > only
> > > > > for address conversion (shared<->private).
> > >
> > > 1 and 2 are not the same as "mirrored" though. You could have a design that
> > > mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
> > > might be just a KVM design choice, even for TDX.
> >
> > I am not sure whether I understand this correctly. If they are not
> > tracked with separate roots, it means they use the same page table (root).
>
> There are three roots, right? Shared, private and mirrored. Shared and mirrored
> don't have to be different roots, but it makes some operations arguably easier
> to have it that way.
Do you have something like KVM_X86_SW_PROTECTED_VM with mirrored PT in mind?
or TDX thing?
> > So IIUC what you said is to support "mirror PT" at any sub-tree of the
> > page table?
> >
> > That will only complicate things. I don't think we should consider
> > this. In reality, we only have TDX and SEV-SNP. We should have a
> > simple solution to cover both of them.
>
> Look at "bool is_private" in kvm_tdp_mmu_map(). Do you see how it switches
> between different roots in the iterator? That is one use.
>
> The second use is to decide whether to call out to the x86_ops. It happens via
> the role bit in the sp, which is copied from the parent sp role. The root's bit
> is set originally via a kvm_gfn_shared_mask() check.
>
> BTW, the role bit is the thing I'm wondering if we really need, because we have
> shared_mask. While the shared_mask is used for lots of things today, we need
> still need it for masking GPAs. Where as the role bit is only needed to know if
> a SP is for private (which we can tell from the GPA).
I started the discussion at [1] for it.
[1] https://lore.kernel.org/kvm/[email protected]/
--
Isaku Yamahata <[email protected]>
On Thu, 2024-05-16 at 12:25 +1200, Huang, Kai wrote:
>
>
> On 16/05/2024 12:19 pm, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-16 at 12:12 +1200, Huang, Kai wrote:
> > >
> > > I don't have strong objection if the use of kvm_gfn_shared_mask() is
> > > contained in smaller areas that truly need it. Let's discuss in
> > > relevant patch(es).
> > >
> > > However I do think the helpers like below makes no sense (for SEV-SNP):
> > >
> > > +static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
> > > +{
> > > + gfn_t mask = kvm_gfn_shared_mask(kvm);
> > > +
> > > + return mask && !(gpa_to_gfn(gpa) & mask);
> > > +}
> >
> > You mean the name? SNP doesn't have a concept of "private GPA" IIUC. The C
> > bit
> > is more like an permission bit. So SNP doesn't have private GPAs, and the
> > function would always return false for SNP. So I'm not sure it's too
> > horrible.
>
> Hmm.. Why SNP doesn't have private GPAs? They are crypto-protected and
> KVM cannot access directly correct?
I suppose a GPA could be pointing to memory that is private. But I think in SNP
it is more the memory that is private. Now I see more how it could be confusing.
>
> >
> > If it's the name, can you suggest something?
>
> The name make sense, but it has to reflect the fact that a given GPA is
> truly private (crypto-protected, inaccessible to KVM).
If this was a function that tested whether memory is private and took a GPA, I
would call it is_private_mem() or something. Because it's testing the memory and
takes a GPA, not testing the GPA. Usually a function name should be about what
the function does, not what arguments it takes.
I can't think of a better name, but point taken that it is not ideal. I'll try
to think of something.
On Wed, 2024-05-15 at 17:15 -0700, Isaku Yamahata wrote:
>
> kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
> need_callback or whatever makes sense.
Name seems good to me. It's a better reason to have a role bit, as there could
be a shared_bit without mirroring.
On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> From: Isaku Yamahata <[email protected]>
>
> Allocate mirrored page table for the private page table and implement MMU
> hooks to operate on the private page table.
>
> To handle page fault to a private GPA, KVM walks the mirrored page table in
> unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> changes from the mirrored page table to private page table.
>
> private KVM page fault |
> | |
> V |
> private GPA | CPU protected EPTP
> | | |
> V | V
> mirrored PT root | private PT root
> | | |
> V | V
> mirrored PT --hook to propagate-->private PT
> | | |
> \--------------------+------\ |
> | | |
> | V V
> | private guest page
> |
> |
> non-encrypted memory | encrypted memory
> |
>
> PT: page table
> Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
> this table to map private guest pages.
> Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> to propagate PT change to the actual private PT.
>
> SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> can be modified atomically with mmu_lock held for read, however, the MMU
> hooks to private page table are not atomical operations.
>
> To address it, a special REMOVED_SPTE is introduced and below sequence is
> used when mirrored SPTEs are updated atomically.
>
> 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> following steps.
> 3. Invoke MMU hooks to modify private page table with the target value.
> 4. (a) On hook succeeds, update mirrored SPTE to target value.
> (b) On hook failure, restore mirrored SPTE to original value.
>
> KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
>
> This sequence also applies when SPTEs are atomiclly updated from
> non-present to present in order to prevent potential conflicts when
> multiple vCPUs attempt to set private SPTEs to a different page size
> simultaneously, though 4K page size is only supported for private page
> table currently.
>
> 2M page support can be done in future patches.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Co-developed-by: Kai Huang <[email protected]>
> Signed-off-by: Kai Huang <[email protected]>
> Co-developed-by: Yan Zhao <[email protected]>
> Signed-off-by: Yan Zhao <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> TDX MMU Part 1:
> - Remove unnecessary gfn, access twist in
> tdp_mmu_map_handle_target_level(). (Chao Gao)
> - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> tdp_mmu_alloc_sp()
> - Update comment in set_private_spte_present() (Yan)
> - Open code call to kvm_mmu_init_private_spt() (Yan)
> - Add comments on TDX MMU hooks (Yan)
> - Fix various whitespace alignment (Yan)
> - Remove pointless warnings and conditionals in
> handle_removed_private_spte() (Yan)
> - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> - Remove incorrect comment in handle_changed_spte() (Yan)
> - Remove unneeded kvm_pfn_to_refcounted_page() and
> is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> - Do kvm_gfn_for_root() branchless (Rick)
> - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> - Add comment for stripping shared bit for fault.gfn (Chao)
>
> v19:
> - drop CONFIG_KVM_MMU_PRIVATE
>
> v18:
> - Rename freezed => frozen
>
> v14 -> v15:
> - Refined is_private condition check in kvm_tdp_mmu_map().
> Add kvm_gfn_shared_mask() check.
> - catch up for struct kvm_range change
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 5 +
> arch/x86/include/asm/kvm_host.h | 25 +++
> arch/x86/kvm/mmu/mmu.c | 13 +-
> arch/x86/kvm/mmu/mmu_internal.h | 19 +-
> arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 269 +++++++++++++++++++++++++----
> arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> 7 files changed, 293 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 566d19b02483..d13cb4b8fce6 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP_OPTIONAL(link_private_spt)
> +KVM_X86_OP_OPTIONAL(free_private_spt)
> +KVM_X86_OP_OPTIONAL(set_private_spte)
> +KVM_X86_OP_OPTIONAL(remove_private_spte)
> +KVM_X86_OP_OPTIONAL(zap_private_spte)
> KVM_X86_OP(has_wbinvd_exit)
> KVM_X86_OP(get_l2_tsc_offset)
> KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d010ca5c7f44..20fa8fa58692 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -470,6 +470,7 @@ struct kvm_mmu {
> int (*sync_spte)(struct kvm_vcpu *vcpu,
> struct kvm_mmu_page *sp, int i);
> struct kvm_mmu_root_info root;
> + hpa_t private_root_hpa;
Should we have
struct kvm_mmu_root_info private_root;
instead?
> union kvm_cpu_role cpu_role;
> union kvm_mmu_page_role root_role;
>
> @@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
> void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int root_level);
>
> + /* Add a page as page table page into private page table */
> + int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *private_spt);
> + /*
> + * Free a page table page of private page table.
> + * Only expected to be called when guest is not active, specifically
> + * during VM destruction phase.
> + */
> + int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *private_spt);
> +
> + /* Add a guest private page into private page table */
> + int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);
> +
> + /* Remove a guest private page from private page table*/
> + int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn);
> + /*
> + * Keep a guest private page mapped in private page table, but clear its
> + * present bit
> + */
> + int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> +
> bool (*has_wbinvd_exit)(void);
>
> u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 76f92cb37a96..2506d6277818 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> int r;
>
> if (tdp_mmu_enabled) {
> - kvm_tdp_mmu_alloc_root(vcpu);
> + if (kvm_gfn_shared_mask(vcpu->kvm))
> + kvm_tdp_mmu_alloc_root(vcpu, true);
As mentioned in replies to other patches, I kinda prefer
kvm->arch.has_mirrored_pt (or has_mirrored_private_pt)
Or we have a helper
kvm_has_mirrored_pt() / kvm_has_mirrored_private_pt()
> + kvm_tdp_mmu_alloc_root(vcpu, false);
> return 0;
> }
>
> @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> - gfn_t base = gfn_round_for_level(fault->gfn,
> + gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> fault->max_level);
I thought by reaching here the shared bit has already been stripped away
by the caller?
It doesn't make a lot sense to still have it here, given we have a
universal KVM-defined PFERR_PRIVATE_ACCESS flag:
https://lore.kernel.org/kvm/[email protected]/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e
IMHO we should just strip the shared bit in the TDX variant of
handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA
doesn't hvae shared bit) to the common fault handler so it can correctly
set fault->is_private to true.
>
> if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> @@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>
> mmu->root.hpa = INVALID_PAGE;
> mmu->root.pgd = 0;
> + mmu->private_root_hpa = INVALID_PAGE;
> for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>
> @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_unload(vcpu);
> + if (tdp_mmu_enabled) {
> + read_lock(&vcpu->kvm->mmu_lock);
> + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> + NULL);
> + read_unlock(&vcpu->kvm->mmu_lock);
> + }
Hmm.. I don't quite like this, but sorry I kinda forgot why we need to
to this here.
Could you elaborate?
Anyway, from common code's perspective, we need to have some
clarification why we design to do it here.
> free_mmu_pages(&vcpu->arch.root_mmu);
> free_mmu_pages(&vcpu->arch.guest_mmu);
> mmu_free_memory_caches(vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0f1a9d733d9e..3a7fe9261e23 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
> #include <linux/kvm_host.h>
> #include <asm/kvm_host.h>
>
> +#include "mmu.h"
> +
> #ifdef CONFIG_KVM_PROVE_MMU
> #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> #else
> @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
> sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
> }
>
> +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + gfn_t gfn)
> +{
> + gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> +
> + /* Set shared bit if not private */
> + gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
> + return gfn_for_root;
> +}
> +
> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
> int r;
>
> if (vcpu->arch.mmu->root_role.direct) {
> - fault.gfn = fault.addr >> PAGE_SHIFT;
> + /*
> + * Things like memslots don't understand the concept of a shared
> + * bit. Strip it so that the GFN can be used like normal, and the
> + * fault.addr can be used when the shared bit is needed.
> + */
> + fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
Again, I don't think it's nessary for fault.gfn to still have the shared
bit here?
This kinda usage is pretty much the reason I want to get rid of
kvm_gfn_shared_mask().
> }
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index fae559559a80..8a64bcef9deb 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -91,7 +91,7 @@ struct tdp_iter {
> tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> /* A pointer to the current SPTE */
> tdp_ptep_t sptep;
> - /* The lowest GFN mapped by the current SPTE */
> + /* The lowest GFN (shared bits included) mapped by the current SPTE */
> gfn_t gfn;
IMHO we need more clarification of this design.
We at least needs to call out the TDX hardware uses the 'GFA + shared
bit' when it walks the page table for shared mappings, so we must set up
the mapping at the GPA with the shared bit.
E.g, because TDX hardware uses separate root for shared/private
mappings, I think it's a resonable opion for the TDX hardware to just
use the actual GPA w/o shared bit when it walks the shared page table,
and still report EPT violation with GPA with shared bit set.
Such HW implementation is completely hidden from software, thus should
be clarified in the changelog/comments.
> /* The level of the root page given to the iterator */
> int root_level;
[...]
> for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> else
> wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> - fault->pfn, iter->old_spte, fault->prefetch, true,
> - fault->map_writable, &new_spte);
> + fault->pfn, iter->old_spte, fault->prefetch, true,
> + fault->map_writable, &new_spte);
>
> if (new_spte == iter->old_spte)
> ret = RET_PF_SPURIOUS;
> @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
> + gfn_t raw_gfn;
> + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
Ditto. I wish we can have 'has_mirrored_private_pt'.
On 16/05/2024 12:35 pm, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 12:25 +1200, Huang, Kai wrote:
>>
>>
>> On 16/05/2024 12:19 pm, Edgecombe, Rick P wrote:
>>> On Thu, 2024-05-16 at 12:12 +1200, Huang, Kai wrote:
>>>>
>>>> I don't have strong objection if the use of kvm_gfn_shared_mask() is
>>>> contained in smaller areas that truly need it. Let's discuss in
>>>> relevant patch(es).
>>>>
>>>> However I do think the helpers like below makes no sense (for SEV-SNP):
>>>>
>>>> +static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
>>>> +{
>>>> + gfn_t mask = kvm_gfn_shared_mask(kvm);
>>>> +
>>>> + return mask && !(gpa_to_gfn(gpa) & mask);
>>>> +}
>>>
>>> You mean the name? SNP doesn't have a concept of "private GPA" IIUC. The C
>>> bit
>>> is more like an permission bit. So SNP doesn't have private GPAs, and the
>>> function would always return false for SNP. So I'm not sure it's too
>>> horrible.
>>
>> Hmm.. Why SNP doesn't have private GPAs? They are crypto-protected and
>> KVM cannot access directly correct?
>
> I suppose a GPA could be pointing to memory that is private. But I think in SNP
> it is more the memory that is private. Now I see more how it could be confusing.
>
>>
>>>
>>> If it's the name, can you suggest something?
>>
>> The name make sense, but it has to reflect the fact that a given GPA is
>> truly private (crypto-protected, inaccessible to KVM).
>
> If this was a function that tested whether memory is private and took a GPA, I
> would call it is_private_mem() or something. Because it's testing the memory and
> takes a GPA, not testing the GPA. Usually a function name should be about what
> the function does, not what arguments it takes.
>
> I can't think of a better name, but point taken that it is not ideal. I'll try
> to think of something.
>
I really don't see difference between ...
is_private_mem(gpa)
.. and
is_private_gpa(gpa)
If it confuses me, it can confuses other people.
The point is there's really no need to distinguish the two. The GPA is
only meaningful when it refers to the memory that it points to.
So far I am not convinced we need this helper, because such info we can
already get from:
1) fault->is_private;
2) Xarray which records memtype for given GFN.
So we should just get rid of it.
>
> BTW, the role bit is the thing I'm wondering if we really need, because we have
> shared_mask. While the shared_mask is used for lots of things today, we need
> still need it for masking GPAs. Where as the role bit is only needed to know if
> a SP is for private (which we can tell from the GPA).
Yeah we can have a second thought on whether sp.role.private is
necessary. It is useful in shadow MMU (which we originally used to
support at the first place), but may not be necessary for TDP MMU.
On 5/16/2024 7:20 AM, Sean Christopherson wrote:
>> But again, I think it's just too overkill for TDX. We can just set the
>> ZAP_LEAF_ONLY flag for the slot when it is created in KVM.
> Ya, I'm convinced that adding uAPI is overkill at this point.
+1.
Making it configurable by userspace needs justification common enough.
If it's just for TDX specific and mandatory for TDX, just make it KVM
internal thing for TDX.
On Thu, 2024-05-16 at 13:04 +1200, Huang, Kai wrote:
>
> I really don't see difference between ...
>
> is_private_mem(gpa)
>
> ... and
>
> is_private_gpa(gpa)
>
> If it confuses me, it can confuses other people.
Again, point taken. I'll try to think of a better name. Please share if you do.
>
> The point is there's really no need to distinguish the two. The GPA is
> only meaningful when it refers to the memory that it points to.
>
> So far I am not convinced we need this helper, because such info we can
> already get from:
>
> 1) fault->is_private;
> 2) Xarray which records memtype for given GFN.
>
> So we should just get rid of it.
Kai, can you got look through the dev branch a bit more before making the same
point on every patch?
kvm_is_private_gpa() is used to set PFERR_PRIVATE_ACCESS, which in turn sets
fault->is_private. So you are saying we can use these other things that are
dependent on it. Look at the other callers too.
On 16/05/2024 12:15 pm, Isaku Yamahata wrote:
> On Thu, May 16, 2024 at 10:17:50AM +1200,
> "Huang, Kai" <[email protected]> wrote:
>
>> On 16/05/2024 4:22 am, Isaku Yamahata wrote:
>>> On Wed, May 15, 2024 at 08:34:37AM -0700,
>>> Sean Christopherson <[email protected]> wrote:
>>>
>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>>> index d5cf5b15a10e..808805b3478d 100644
>>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>>> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>>>>> flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>>>>> - if (tdp_mmu_enabled)
>>>>> + if (tdp_mmu_enabled) {
>>>>> + /*
>>>>> + * kvm_zap_gfn_range() is used when MTRR or PAT memory
>>>>> + * type was changed. TDX can't handle zapping the private
>>>>> + * mapping, but it's ok because KVM doesn't support either of
>>>>> + * those features for TDX. In case a new caller appears, BUG
>>>>> + * the VM if it's called for solutions with private aliases.
>>>>> + */
>>>>> + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
>>>>
>>>> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX". Using a
>>>> generic name quite obviously doesn't prevent TDX details for bleeding into common
>>>> code, and dancing around things just makes it all unnecessarily confusing.
>>>>
>>>> If we can't avoid bleeding TDX details into common code, my vote is to bite the
>>>> bullet and simply check vm_type.
>>>
>>> TDX has several aspects related to the TDP MMU.
>>> 1) Based on the faulting GPA, determine which KVM page table to walk.
>>> (private-vs-shared)
>>> 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
>>> load/store. TDP MMU needs hooks for it.
>>> 3) The tables must be zapped from the leaf. not the root or the middle.
>>>
>>> For 1) and 2), what about something like this? TDX backend code will set
>>> kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
>>> for address conversion (shared<->private).
>>>
>>> For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
>>> (or whatever preferable name)?
>>>
>>> For 3), flag of memslot handles it.
>>>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index aabf1648a56a..218b575d24bd 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1289,6 +1289,7 @@ struct kvm_arch {
>>> u8 vm_type;
>>> bool has_private_mem;
>>> bool has_protected_state;
>>> + bool has_mirrored_pt;
>>> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>>> struct list_head active_mmu_pages;
>>> struct list_head zapped_obsolete_pages;
>>> @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>> #ifdef CONFIG_KVM_PRIVATE_MEM
>>> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>>> +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
>>> #else
>>> #define kvm_arch_has_private_mem(kvm) false
>>> +#define kvm_arch_has_mirrored_pt(kvm) false
>>> #endif
>>> static inline u16 kvm_read_ldt(void)
>>
>> I think this 'has_mirrored_pt' (or a better name) is better, because it
>> clearly conveys it is for the "page table", but not the actual page that any
>> page table entry maps to.
>>
>> AFAICT we need to split the concept of "private page table itself" and the
>> "memory type of the actual GFN".
>>
>> E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), but
>> I was told only TDX uses a dedicated private page table which isn't directly
>> accessible for KVV. SEV-SNP on the other hand just uses normal page table +
>> additional HW managed table to make sure the security.
>
> kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
> need_callback or whatever makes sense.
>
>
>> In other words, I think we should decide whether to invoke TDP MMU callback
>> for private mapping (the page table itself may just be normal one) depending
>> on the fault->is_private, but not whether the page table is private:
>>
>> if (fault->is_private && kvm_x86_ops->set_private_spte)
>> kvm_x86_set_private_spte(...);
>> else
>> tdp_mmu_set_spte_atomic(...);
>
> This doesn't work for two reasons.
>
> - We need to pass down struct kvm_page_fault fault deep only for this.
> We could change the code in such way.
>
> - We don't have struct kvm_page_fault fault for zapping case.
> We could create a dummy one and pass it around.
For both above, we don't necessarily need the whole 'kvm_page_fault', we
just need:
1) GFN
2) Whether it is private (points to private memory to be precise)
3) use a separate private page table.
>
> Essentially the issue is how to pass down is_private or stash the info
> somewhere or determine it somehow. Options I think of are
>
> - Pass around fault:
> Con: fault isn't passed down
> Con: Create fake fault for zapping case >
> - Stash it in struct tdp_iter and pass around iter:
> Pro: work for zapping case
> Con: we need to change the code to pass down tdp_iter >
> - Pass around is_private (or mirrored_pt or whatever):
> Pro: Don't need to add member to some structure
> Con: We need to pass it around still. >
> - Stash it in kvm_mmu_page:
> The patch series uses kvm_mmu_page.role.
> Pro: We don't need to pass around because we know struct kvm_mmu_page
> Con: Need to twist root page allocation
I don't think using kvm_mmu_page.role is correct.
If kvm_mmu_page.role is private, we definitely can assume the faulting
address is private; but otherwise the address can be both private or shared.
>
> - Use gfn. kvm_is_private_gfn(kvm, gfn):
> Con: The use of gfn is confusing. It's too TDX specific.
>
>
>> And the 'has_mirrored_pt' should be only used to select the root of the page
>> table that we want to operate on.
>
> We can add one more bool to struct kvm_page_fault.follow_mirrored_pt or
> something to represent it. We can initialize it in __kvm_mmu_do_page_fault().
>
> .follow_mirrored_pt = kvm->arch.has_mirrored_pt && kvm_is_private_gpa(gpa);
>
>
>> This also gives a chance that if there's anything special needs to be done
>> for page allocated for the "non-leaf" middle page table for SEV-SNP, it can
>> just fit.
>
> Can you please elaborate on this?
I meant SEV-SNP may have it's own version of link_private_spt().
I haven't looked into it, and it may not needed from hardware's
perspective, but providing such chance certainly doesn't hurt and is
more flexible IMHO.
On Thu, 2024-05-16 at 12:52 +1200, Huang, Kai wrote:
>
>
> On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Allocate mirrored page table for the private page table and implement MMU
> > hooks to operate on the private page table.
> >
> > To handle page fault to a private GPA, KVM walks the mirrored page table in
> > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > changes from the mirrored page table to private page table.
> >
> > private KVM page fault |
> > | |
> > V |
> > private GPA | CPU protected EPTP
> > | | |
> > V | V
> > mirrored PT root | private PT root
> > | | |
> > V | V
> > mirrored PT --hook to propagate-->private PT
> > | | |
> > \--------------------+------\ |
> > | | |
> > | V V
> > | private guest page
> > |
> > |
> > non-encrypted memory | encrypted memory
> > |
> >
> > PT: page table
> > Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
> > this table to map private guest pages.
> > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > to propagate PT change to the actual private PT.
> >
> > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > can be modified atomically with mmu_lock held for read, however, the MMU
> > hooks to private page table are not atomical operations.
> >
> > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > used when mirrored SPTEs are updated atomically.
> >
> > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> > following steps.
> > 3. Invoke MMU hooks to modify private page table with the target value.
> > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> > (b) On hook failure, restore mirrored SPTE to original value.
> >
> > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> >
> > This sequence also applies when SPTEs are atomiclly updated from
> > non-present to present in order to prevent potential conflicts when
> > multiple vCPUs attempt to set private SPTEs to a different page size
> > simultaneously, though 4K page size is only supported for private page
> > table currently.
> >
> > 2M page support can be done in future patches.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Co-developed-by: Kai Huang <[email protected]>
> > Signed-off-by: Kai Huang <[email protected]>
> > Co-developed-by: Yan Zhao <[email protected]>
> > Signed-off-by: Yan Zhao <[email protected]>
> > Co-developed-by: Rick Edgecombe <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > ---
> > TDX MMU Part 1:
> > - Remove unnecessary gfn, access twist in
> > tdp_mmu_map_handle_target_level(). (Chao Gao)
> > - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> > tdp_mmu_alloc_sp()
> > - Update comment in set_private_spte_present() (Yan)
> > - Open code call to kvm_mmu_init_private_spt() (Yan)
> > - Add comments on TDX MMU hooks (Yan)
> > - Fix various whitespace alignment (Yan)
> > - Remove pointless warnings and conditionals in
> > handle_removed_private_spte() (Yan)
> > - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> > - Remove incorrect comment in handle_changed_spte() (Yan)
> > - Remove unneeded kvm_pfn_to_refcounted_page() and
> > is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> > - Do kvm_gfn_for_root() branchless (Rick)
> > - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> > - Add comment for stripping shared bit for fault.gfn (Chao)
> >
> > v19:
> > - drop CONFIG_KVM_MMU_PRIVATE
> >
> > v18:
> > - Rename freezed => frozen
> >
> > v14 -> v15:
> > - Refined is_private condition check in kvm_tdp_mmu_map().
> > Add kvm_gfn_shared_mask() check.
> > - catch up for struct kvm_range change
> > ---
> > arch/x86/include/asm/kvm-x86-ops.h | 5 +
> > arch/x86/include/asm/kvm_host.h | 25 +++
> > arch/x86/kvm/mmu/mmu.c | 13 +-
> > arch/x86/kvm/mmu/mmu_internal.h | 19 +-
> > arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> > arch/x86/kvm/mmu/tdp_mmu.c | 269 +++++++++++++++++++++++++----
> > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > 7 files changed, 293 insertions(+), 42 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-
> > x86-ops.h
> > index 566d19b02483..d13cb4b8fce6 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > KVM_X86_OP(load_mmu_pgd)
> > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > KVM_X86_OP(has_wbinvd_exit)
> > KVM_X86_OP(get_l2_tsc_offset)
> > KVM_X86_OP(get_l2_tsc_multiplier)
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > index d010ca5c7f44..20fa8fa58692 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -470,6 +470,7 @@ struct kvm_mmu {
> > int (*sync_spte)(struct kvm_vcpu *vcpu,
> > struct kvm_mmu_page *sp, int i);
> > struct kvm_mmu_root_info root;
> > + hpa_t private_root_hpa;
>
> Should we have
>
> struct kvm_mmu_root_info private_root;
>
> instead?
This is corresponds to:
mmu->root.hpa
We don't need the other fields, so I think better to not take space. It does
look asymmetric though...
>
> > union kvm_cpu_role cpu_role;
> > union kvm_mmu_page_role root_role;
> >
> > @@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
> > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > int root_level);
> >
> > + /* Add a page as page table page into private page table */
> > + int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > + void *private_spt);
> > + /*
> > + * Free a page table page of private page table.
> > + * Only expected to be called when guest is not active, specifically
> > + * during VM destruction phase.
> > + */
> > + int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > + void *private_spt);
> > +
> > + /* Add a guest private page into private page table */
> > + int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > + kvm_pfn_t pfn);
> > +
> > + /* Remove a guest private page from private page table*/
> > + int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > + kvm_pfn_t pfn);
> > + /*
> > + * Keep a guest private page mapped in private page table, but clear
> > its
> > + * present bit
> > + */
> > + int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level);
> > +
> > bool (*has_wbinvd_exit)(void);
> >
> > u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 76f92cb37a96..2506d6277818 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu
> > *vcpu)
> > int r;
> >
> > if (tdp_mmu_enabled) {
> > - kvm_tdp_mmu_alloc_root(vcpu);
> > + if (kvm_gfn_shared_mask(vcpu->kvm))
> > + kvm_tdp_mmu_alloc_root(vcpu, true);
>
> As mentioned in replies to other patches, I kinda prefer
>
> kvm->arch.has_mirrored_pt (or has_mirrored_private_pt)
>
> Or we have a helper
>
> kvm_has_mirrored_pt() / kvm_has_mirrored_private_pt()
Yep I think everyone is on board with not doing kvm_gfn_shared_mask() for these
checks at this point.
>
> > + kvm_tdp_mmu_alloc_root(vcpu, false);
> > return 0;
> > }
> >
> > @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
> > kvm_page_fault *fault)
> > if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> > for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level)
> > {
> > int page_num = KVM_PAGES_PER_HPAGE(fault-
> > >max_level);
> > - gfn_t base = gfn_round_for_level(fault->gfn,
> > + gfn_t base = gfn_round_for_level(gpa_to_gfn(fault-
> > >addr),
> > fault->max_level);
>
> I thought by reaching here the shared bit has already been stripped away
> by the caller?
We don't support MTRRs so this code wont be executed for TDX, but not clear what
you are asking.
fault->addr has the shared bit (if present)
fault->gfn has it stripped.
>
> It doesn't make a lot sense to still have it here, given we have a
> universal KVM-defined PFERR_PRIVATE_ACCESS flag:
>
> https://lore.kernel.org/kvm/[email protected]/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e
>
> IMHO we should just strip the shared bit in the TDX variant of
> handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA
> doesn't hvae shared bit) to the common fault handler so it can correctly
> set fault->is_private to true.
I'm not sure what you are seeing here, could elaborate?
>
>
> >
> > if (kvm_mtrr_check_gfn_range_consistency(vcpu, base,
> > page_num))
> > @@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu,
> > struct kvm_mmu *mmu)
> >
> > mmu->root.hpa = INVALID_PAGE;
> > mmu->root.pgd = 0;
> > + mmu->private_root_hpa = INVALID_PAGE;
> > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> >
> > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> > void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_unload(vcpu);
> > + if (tdp_mmu_enabled) {
> > + read_lock(&vcpu->kvm->mmu_lock);
> > + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu-
> > >private_root_hpa,
> > + NULL);
> > + read_unlock(&vcpu->kvm->mmu_lock);
> > + }
>
> Hmm.. I don't quite like this, but sorry I kinda forgot why we need to
> to this here.
>
> Could you elaborate?
I was confused by this too, see the conversation here:
https://lore.kernel.org/kvm/[email protected]/
>
> Anyway, from common code's perspective, we need to have some
> clarification why we design to do it here.
>
> > free_mmu_pages(&vcpu->arch.root_mmu);
> > free_mmu_pages(&vcpu->arch.guest_mmu);
> > mmu_free_memory_caches(vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > b/arch/x86/kvm/mmu/mmu_internal.h
> > index 0f1a9d733d9e..3a7fe9261e23 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -6,6 +6,8 @@
> > #include <linux/kvm_host.h>
> > #include <asm/kvm_host.h>
> >
> > +#include "mmu.h"
> > +
> > #ifdef CONFIG_KVM_PROVE_MMU
> > #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > #else
> > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
> > kvm_vcpu *vcpu, struct kvm_m
> > sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > >arch.mmu_private_spt_cache);
> > }
> >
> > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
> > *root,
> > + gfn_t gfn)
> > +{
> > + gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > +
> > + /* Set shared bit if not private */
> > + gfn_for_root |= -(gfn_t)!is_private_sp(root) &
> > kvm_gfn_shared_mask(kvm);
> > + return gfn_for_root;
> > +}
> > +
> > static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page
> > *sp)
> > {
> > /*
> > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> > int r;
> >
> > if (vcpu->arch.mmu->root_role.direct) {
> > - fault.gfn = fault.addr >> PAGE_SHIFT;
> > + /*
> > + * Things like memslots don't understand the concept of a
> > shared
> > + * bit. Strip it so that the GFN can be used like normal,
> > and the
> > + * fault.addr can be used when the shared bit is needed.
> > + */
> > + fault.gfn = gpa_to_gfn(fault.addr) &
> > ~kvm_gfn_shared_mask(vcpu->kvm);
> > fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>
> Again, I don't think it's nessary for fault.gfn to still have the shared
> bit here?
It's getting stripped as it's set for the first time... What do you mean still
have it?
>
> This kinda usage is pretty much the reason I want to get rid of
> kvm_gfn_shared_mask().
I think you want to move it to an x86_op right? Not get rid of the concept of a
shared bit? I think KVM will have a hard time doing TDX without knowing about
the shared bit location.
Or maybe you are saying you think it should be stripped earlier and live as a PF
error code?
>
> > }
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index fae559559a80..8a64bcef9deb 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -91,7 +91,7 @@ struct tdp_iter {
> > tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > /* A pointer to the current SPTE */
> > tdp_ptep_t sptep;
> > - /* The lowest GFN mapped by the current SPTE */
> > + /* The lowest GFN (shared bits included) mapped by the current SPTE
> > */
> > gfn_t gfn;
>
> IMHO we need more clarification of this design.
Have you seen the documentation patch? Where do you think it should be? You mean
in the tdp_iter struct?
>
> We at least needs to call out the TDX hardware uses the 'GFA + shared
> bit' when it walks the page table for shared mappings, so we must set up
> the mapping at the GPA with the shared bit.
>
> E.g, because TDX hardware uses separate root for shared/private
> mappings, I think it's a resonable opion for the TDX hardware to just
> use the actual GPA w/o shared bit when it walks the shared page table,
> and still report EPT violation with GPA with shared bit set.
>
> Such HW implementation is completely hidden from software, thus should
> be clarified in the changelog/comments.
>
>
> > /* The level of the root page given to the iterator */
> > int root_level;
>
> [...]
>
> > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct
> > kvm_vcpu *vcpu,
> > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > else
> > wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter-
> > >gfn,
> > - fault->pfn, iter->old_spte, fault-
> > >prefetch, true,
> > - fault->map_writable, &new_spte);
> > + fault->pfn, iter->old_spte, fault-
> > >prefetch, true,
> > + fault->map_writable, &new_spte);
> >
> > if (new_spte == iter->old_spte)
> > ret = RET_PF_SPURIOUS;
> > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> > kvm_page_fault *fault)
> > struct kvm *kvm = vcpu->kvm;
> > struct tdp_iter iter;
> > struct kvm_mmu_page *sp;
> > + gfn_t raw_gfn;
> > + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
>
> Ditto. I wish we can have 'has_mirrored_private_pt'.
>
On 16/05/2024 1:20 pm, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 13:04 +1200, Huang, Kai wrote:
>>
>> I really don't see difference between ...
>>
>> is_private_mem(gpa)
>>
>> ... and
>>
>> is_private_gpa(gpa)
>>
>> If it confuses me, it can confuses other people.
>
> Again, point taken. I'll try to think of a better name. Please share if you do.
>
>>
>> The point is there's really no need to distinguish the two. The GPA is
>> only meaningful when it refers to the memory that it points to.
>>
>> So far I am not convinced we need this helper, because such info we can
>> already get from:
>>
>> 1) fault->is_private;
>> 2) Xarray which records memtype for given GFN.
>>
>> So we should just get rid of it.
>
> Kai, can you got look through the dev branch a bit more before making the same
> point on every patch?
>
> kvm_is_private_gpa() is used to set PFERR_PRIVATE_ACCESS, which in turn sets
> fault->is_private. So you are saying we can use these other things that are
> dependent on it. Look at the other callers too.
Well, I think I didn't make myself clear.
I don't object to have this helper. If it helps, then we can have it.
My objection is the current implementation of it, because it is
*conceptually* wrong for SEV-SNP.
Btw, I just look at the dev branch.
For the common code, it is used in kvm_tdp_mmu_map() and
kvm_tdp_mmu_fast_pf_get_last_sptep() to get whether a GPA is private.
As said above, I don't see why we need a helper with the "current
implementation" (which consults kvm_shared_gfn_mask()) for them. We can
just use fault->gfn + fault->is_private for such purpose.
It is also used in the TDX code like TDX variant handle_ept_violation()
and tdx_vcpu_init_mem_region(). For them to be honest I don't quite
care whether a helper is used. We can have a helper if we have multiple
callers, but this helper should be in TDX code, but not common MMU code.
On Thu, May 16, 2024 at 12:52:32PM +1200,
"Huang, Kai" <[email protected]> wrote:
> On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Allocate mirrored page table for the private page table and implement MMU
> > hooks to operate on the private page table.
> >
> > To handle page fault to a private GPA, KVM walks the mirrored page table in
> > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > changes from the mirrored page table to private page table.
> >
> > private KVM page fault |
> > | |
> > V |
> > private GPA | CPU protected EPTP
> > | | |
> > V | V
> > mirrored PT root | private PT root
> > | | |
> > V | V
> > mirrored PT --hook to propagate-->private PT
> > | | |
> > \--------------------+------\ |
> > | | |
> > | V V
> > | private guest page
> > |
> > |
> > non-encrypted memory | encrypted memory
> > |
> >
> > PT: page table
> > Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
> > this table to map private guest pages.
> > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > to propagate PT change to the actual private PT.
> >
> > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > can be modified atomically with mmu_lock held for read, however, the MMU
> > hooks to private page table are not atomical operations.
> >
> > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > used when mirrored SPTEs are updated atomically.
> >
> > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> > following steps.
> > 3. Invoke MMU hooks to modify private page table with the target value.
> > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> > (b) On hook failure, restore mirrored SPTE to original value.
> >
> > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> >
> > This sequence also applies when SPTEs are atomiclly updated from
> > non-present to present in order to prevent potential conflicts when
> > multiple vCPUs attempt to set private SPTEs to a different page size
> > simultaneously, though 4K page size is only supported for private page
> > table currently.
> >
> > 2M page support can be done in future patches.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > Co-developed-by: Kai Huang <[email protected]>
> > Signed-off-by: Kai Huang <[email protected]>
> > Co-developed-by: Yan Zhao <[email protected]>
> > Signed-off-by: Yan Zhao <[email protected]>
> > Co-developed-by: Rick Edgecombe <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > ---
> > TDX MMU Part 1:
> > - Remove unnecessary gfn, access twist in
> > tdp_mmu_map_handle_target_level(). (Chao Gao)
> > - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> > tdp_mmu_alloc_sp()
> > - Update comment in set_private_spte_present() (Yan)
> > - Open code call to kvm_mmu_init_private_spt() (Yan)
> > - Add comments on TDX MMU hooks (Yan)
> > - Fix various whitespace alignment (Yan)
> > - Remove pointless warnings and conditionals in
> > handle_removed_private_spte() (Yan)
> > - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> > - Remove incorrect comment in handle_changed_spte() (Yan)
> > - Remove unneeded kvm_pfn_to_refcounted_page() and
> > is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> > - Do kvm_gfn_for_root() branchless (Rick)
> > - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> > - Add comment for stripping shared bit for fault.gfn (Chao)
> >
> > v19:
> > - drop CONFIG_KVM_MMU_PRIVATE
> >
> > v18:
> > - Rename freezed => frozen
> >
> > v14 -> v15:
> > - Refined is_private condition check in kvm_tdp_mmu_map().
> > Add kvm_gfn_shared_mask() check.
> > - catch up for struct kvm_range change
> > ---
> > arch/x86/include/asm/kvm-x86-ops.h | 5 +
> > arch/x86/include/asm/kvm_host.h | 25 +++
> > arch/x86/kvm/mmu/mmu.c | 13 +-
> > arch/x86/kvm/mmu/mmu_internal.h | 19 +-
> > arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> > arch/x86/kvm/mmu/tdp_mmu.c | 269 +++++++++++++++++++++++++----
> > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > 7 files changed, 293 insertions(+), 42 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 566d19b02483..d13cb4b8fce6 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > KVM_X86_OP(load_mmu_pgd)
> > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > KVM_X86_OP(has_wbinvd_exit)
> > KVM_X86_OP(get_l2_tsc_offset)
> > KVM_X86_OP(get_l2_tsc_multiplier)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index d010ca5c7f44..20fa8fa58692 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -470,6 +470,7 @@ struct kvm_mmu {
> > int (*sync_spte)(struct kvm_vcpu *vcpu,
> > struct kvm_mmu_page *sp, int i);
> > struct kvm_mmu_root_info root;
> > + hpa_t private_root_hpa;
>
> Should we have
>
> struct kvm_mmu_root_info private_root;
>
> instead?
Yes. And the private root allocation can be pushed down into TDP MMU.
> > union kvm_cpu_role cpu_role;
> > union kvm_mmu_page_role root_role;
> > @@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
> > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > int root_level);
> > + /* Add a page as page table page into private page table */
> > + int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + void *private_spt);
> > + /*
> > + * Free a page table page of private page table.
> > + * Only expected to be called when guest is not active, specifically
> > + * during VM destruction phase.
> > + */
> > + int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + void *private_spt);
> > +
> > + /* Add a guest private page into private page table */
> > + int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + kvm_pfn_t pfn);
> > +
> > + /* Remove a guest private page from private page table*/
> > + int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + kvm_pfn_t pfn);
> > + /*
> > + * Keep a guest private page mapped in private page table, but clear its
> > + * present bit
> > + */
> > + int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> > +
> > bool (*has_wbinvd_exit)(void);
> > u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 76f92cb37a96..2506d6277818 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > int r;
> > if (tdp_mmu_enabled) {
> > - kvm_tdp_mmu_alloc_root(vcpu);
> > + if (kvm_gfn_shared_mask(vcpu->kvm))
> > + kvm_tdp_mmu_alloc_root(vcpu, true);
>
> As mentioned in replies to other patches, I kinda prefer
>
> kvm->arch.has_mirrored_pt (or has_mirrored_private_pt)
>
> Or we have a helper
>
> kvm_has_mirrored_pt() / kvm_has_mirrored_private_pt()
>
> > + kvm_tdp_mmu_alloc_root(vcpu, false);
> > return 0;
> > }
> > @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> > for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> > int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> > - gfn_t base = gfn_round_for_level(fault->gfn,
> > + gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> > fault->max_level);
>
> I thought by reaching here the shared bit has already been stripped away by
> the caller?
>
> It doesn't make a lot sense to still have it here, given we have a universal
> KVM-defined PFERR_PRIVATE_ACCESS flag:
>
> https://lore.kernel.org/kvm/[email protected]/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e
>
> IMHO we should just strip the shared bit in the TDX variant of
> handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA doesn't
> hvae shared bit) to the common fault handler so it can correctly set
> fault->is_private to true.
Yes, this part should be dropped. Because we will have vCPUID.MTRR=0 for TDX in
long term, we can make kvm_mmu_honors_guest_mtrrs() always false. Maybe
kvm->arch.disbled_mtrr or guest_cpuid_has(vcpu, X86_FEATURE_MTRR) = false. We
will enforce vcpu.CPUID.MTRR=false.
Guest MTRR=0 support can be independently addressed.
> > if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> > @@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > mmu->root.hpa = INVALID_PAGE;
> > mmu->root.pgd = 0;
> > + mmu->private_root_hpa = INVALID_PAGE;
> > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> > void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_unload(vcpu);
> > + if (tdp_mmu_enabled) {
> > + read_lock(&vcpu->kvm->mmu_lock);
> > + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > + NULL);
> > + read_unlock(&vcpu->kvm->mmu_lock);
> > + }
>
> Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to
> this here.
>
> Could you elaborate?
>
> Anyway, from common code's perspective, we need to have some clarification
> why we design to do it here.
This should be cleaned up. It can be pushed down into kvm_tdp_mmu_alloc_root().
void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
allocate shared root
if (has_mirrort_pt)
allocate private root
> > free_mmu_pages(&vcpu->arch.root_mmu);
> > free_mmu_pages(&vcpu->arch.guest_mmu);
> > mmu_free_memory_caches(vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 0f1a9d733d9e..3a7fe9261e23 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -6,6 +6,8 @@
> > #include <linux/kvm_host.h>
> > #include <asm/kvm_host.h>
> > +#include "mmu.h"
> > +
> > #ifdef CONFIG_KVM_PROVE_MMU
> > #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > #else
> > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
> > sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
> > }
> > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > + gfn_t gfn)
> > +{
> > + gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > +
> > + /* Set shared bit if not private */
> > + gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
> > + return gfn_for_root;
> > +}
> > +
> > static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> > {
> > /*
> > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
> > int r;
> > if (vcpu->arch.mmu->root_role.direct) {
> > - fault.gfn = fault.addr >> PAGE_SHIFT;
> > + /*
> > + * Things like memslots don't understand the concept of a shared
> > + * bit. Strip it so that the GFN can be used like normal, and the
> > + * fault.addr can be used when the shared bit is needed.
> > + */
> > + fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> > fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>
> Again, I don't think it's nessary for fault.gfn to still have the shared bit
> here?
>
> This kinda usage is pretty much the reason I want to get rid of
> kvm_gfn_shared_mask().
We are going to flags like has_mirrored_pt and we have root page table iterator
with types specified. I'll investigate how we can reduce (or eliminate)
those helper functions.
> > }
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index fae559559a80..8a64bcef9deb 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -91,7 +91,7 @@ struct tdp_iter {
> > tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > /* A pointer to the current SPTE */
> > tdp_ptep_t sptep;
> > - /* The lowest GFN mapped by the current SPTE */
> > + /* The lowest GFN (shared bits included) mapped by the current SPTE */
> > gfn_t gfn;
>
> IMHO we need more clarification of this design.
>
> We at least needs to call out the TDX hardware uses the 'GFA + shared bit'
> when it walks the page table for shared mappings, so we must set up the
> mapping at the GPA with the shared bit.
>
> E.g, because TDX hardware uses separate root for shared/private mappings, I
> think it's a resonable opion for the TDX hardware to just use the actual GPA
> w/o shared bit when it walks the shared page table, and still report EPT
> violation with GPA with shared bit set.
>
> Such HW implementation is completely hidden from software, thus should be
> clarified in the changelog/comments.
Totally agree that it deserves for documentation. I would update the design
document of TDX TDP MMU to include it. This patch series doesn't include it,
though.
> > /* The level of the root page given to the iterator */
> > int root_level;
>
> [...]
>
> > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > else
> > wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> > - fault->pfn, iter->old_spte, fault->prefetch, true,
> > - fault->map_writable, &new_spte);
> > + fault->pfn, iter->old_spte, fault->prefetch, true,
> > + fault->map_writable, &new_spte);
> > if (new_spte == iter->old_spte)
> > ret = RET_PF_SPURIOUS;
> > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > struct kvm *kvm = vcpu->kvm;
> > struct tdp_iter iter;
> > struct kvm_mmu_page *sp;
> > + gfn_t raw_gfn;
> > + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
>
> Ditto. I wish we can have 'has_mirrored_private_pt'.
Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
--
Isaku Yamahata <[email protected]>
On Wed, 2024-05-15 at 18:48 -0700, Isaku Yamahata wrote:
> On Thu, May 16, 2024 at 12:52:32PM +1200,
> "Huang, Kai" <[email protected]> wrote:
>
> > On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > Allocate mirrored page table for the private page table and implement MMU
> > > hooks to operate on the private page table.
> > >
> > > To handle page fault to a private GPA, KVM walks the mirrored page table
> > > in
> > > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > > changes from the mirrored page table to private page table.
> > >
> > > private KVM page fault |
> > > | |
> > > V |
> > > private GPA | CPU protected EPTP
> > > | | |
> > > V | V
> > > mirrored PT root | private PT root
> > > | | |
> > > V | V
> > > mirrored PT --hook to propagate-->private PT
> > > | | |
> > > \--------------------+------\ |
> > > | | |
> > > | V V
> > > | private guest page
> > > |
> > > |
> > > non-encrypted memory | encrypted memory
> > > |
> > >
> > > PT: page table
> > > Private PT: the CPU uses it, but it is invisible to KVM. TDX module
> > > manages
> > > this table to map private guest pages.
> > > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > > to propagate PT change to the actual private PT.
> > >
> > > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > > can be modified atomically with mmu_lock held for read, however, the MMU
> > > hooks to private page table are not atomical operations.
> > >
> > > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > > used when mirrored SPTEs are updated atomically.
> > >
> > > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> > > following steps.
> > > 3. Invoke MMU hooks to modify private page table with the target value.
> > > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> > > (b) On hook failure, restore mirrored SPTE to original value.
> > >
> > > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> > >
> > > This sequence also applies when SPTEs are atomiclly updated from
> > > non-present to present in order to prevent potential conflicts when
> > > multiple vCPUs attempt to set private SPTEs to a different page size
> > > simultaneously, though 4K page size is only supported for private page
> > > table currently.
> > >
> > > 2M page support can be done in future patches.
> > >
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > Co-developed-by: Kai Huang <[email protected]>
> > > Signed-off-by: Kai Huang <[email protected]>
> > > Co-developed-by: Yan Zhao <[email protected]>
> > > Signed-off-by: Yan Zhao <[email protected]>
> > > Co-developed-by: Rick Edgecombe <[email protected]>
> > > Signed-off-by: Rick Edgecombe <[email protected]>
> > > ---
> > > TDX MMU Part 1:
> > > - Remove unnecessary gfn, access twist in
> > > tdp_mmu_map_handle_target_level(). (Chao Gao)
> > > - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> > > tdp_mmu_alloc_sp()
> > > - Update comment in set_private_spte_present() (Yan)
> > > - Open code call to kvm_mmu_init_private_spt() (Yan)
> > > - Add comments on TDX MMU hooks (Yan)
> > > - Fix various whitespace alignment (Yan)
> > > - Remove pointless warnings and conditionals in
> > > handle_removed_private_spte() (Yan)
> > > - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> > > - Remove incorrect comment in handle_changed_spte() (Yan)
> > > - Remove unneeded kvm_pfn_to_refcounted_page() and
> > > is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> > > - Do kvm_gfn_for_root() branchless (Rick)
> > > - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> > > - Add comment for stripping shared bit for fault.gfn (Chao)
> > >
> > > v19:
> > > - drop CONFIG_KVM_MMU_PRIVATE
> > >
> > > v18:
> > > - Rename freezed => frozen
> > >
> > > v14 -> v15:
> > > - Refined is_private condition check in kvm_tdp_mmu_map().
> > > Add kvm_gfn_shared_mask() check.
> > > - catch up for struct kvm_range change
> > > ---
> > > arch/x86/include/asm/kvm-x86-ops.h | 5 +
> > > arch/x86/include/asm/kvm_host.h | 25 +++
> > > arch/x86/kvm/mmu/mmu.c | 13 +-
> > > arch/x86/kvm/mmu/mmu_internal.h | 19 +-
> > > arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> > > arch/x86/kvm/mmu/tdp_mmu.c | 269 +++++++++++++++++++++++++----
> > > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > > 7 files changed, 293 insertions(+), 42 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h
> > > b/arch/x86/include/asm/kvm-x86-ops.h
> > > index 566d19b02483..d13cb4b8fce6 100644
> > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > > KVM_X86_OP(load_mmu_pgd)
> > > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > > KVM_X86_OP(has_wbinvd_exit)
> > > KVM_X86_OP(get_l2_tsc_offset)
> > > KVM_X86_OP(get_l2_tsc_multiplier)
> > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > b/arch/x86/include/asm/kvm_host.h
> > > index d010ca5c7f44..20fa8fa58692 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -470,6 +470,7 @@ struct kvm_mmu {
> > > int (*sync_spte)(struct kvm_vcpu *vcpu,
> > > struct kvm_mmu_page *sp, int i);
> > > struct kvm_mmu_root_info root;
> > > + hpa_t private_root_hpa;
> >
> > Should we have
> >
> > struct kvm_mmu_root_info private_root;
> >
> > instead?
>
> Yes. And the private root allocation can be pushed down into TDP MMU.
Why?
>
[snip]
> > > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> > > void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > > {
> > > kvm_mmu_unload(vcpu);
> > > + if (tdp_mmu_enabled) {
> > > + read_lock(&vcpu->kvm->mmu_lock);
> > > + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu-
> > > >private_root_hpa,
> > > + NULL);
> > > + read_unlock(&vcpu->kvm->mmu_lock);
> > > + }
> >
> > Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to
> > this here.
> >
> > Could you elaborate?
> >
> > Anyway, from common code's perspective, we need to have some clarification
> > why we design to do it here.
>
> This should be cleaned up. It can be pushed down into
> kvm_tdp_mmu_alloc_root().
>
> void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> allocate shared root
> if (has_mirrort_pt)
> allocate private root
>
Huh? This is kvm_mmu_destroy()...
>
> > > free_mmu_pages(&vcpu->arch.root_mmu);
> > > free_mmu_pages(&vcpu->arch.guest_mmu);
> > > mmu_free_memory_caches(vcpu);
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > > b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 0f1a9d733d9e..3a7fe9261e23 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -6,6 +6,8 @@
> > > #include <linux/kvm_host.h>
> > > #include <asm/kvm_host.h>
> > > +#include "mmu.h"
> > > +
> > > #ifdef CONFIG_KVM_PROVE_MMU
> > > #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > > #else
> > > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
> > > kvm_vcpu *vcpu, struct kvm_m
> > > sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > > >arch.mmu_private_spt_cache);
> > > }
> > > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
> > > *root,
> > > + gfn_t gfn)
> > > +{
> > > + gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > > +
> > > + /* Set shared bit if not private */
> > > + gfn_for_root |= -(gfn_t)!is_private_sp(root) &
> > > kvm_gfn_shared_mask(kvm);
> > > + return gfn_for_root;
> > > +}
> > > +
> > > static inline bool kvm_mmu_page_ad_need_write_protect(struct
> > > kvm_mmu_page *sp)
> > > {
> > > /*
> > > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> > > int r;
> > > if (vcpu->arch.mmu->root_role.direct) {
> > > - fault.gfn = fault.addr >> PAGE_SHIFT;
> > > + /*
> > > + * Things like memslots don't understand the concept of a
> > > shared
> > > + * bit. Strip it so that the GFN can be used like normal,
> > > and the
> > > + * fault.addr can be used when the shared bit is needed.
> > > + */
> > > + fault.gfn = gpa_to_gfn(fault.addr) &
> > > ~kvm_gfn_shared_mask(vcpu->kvm);
> > > fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> >
> > Again, I don't think it's nessary for fault.gfn to still have the shared bit
> > here?
> >
> > This kinda usage is pretty much the reason I want to get rid of
> > kvm_gfn_shared_mask().
>
> We are going to flags like has_mirrored_pt and we have root page table
> iterator
> with types specified. I'll investigate how we can reduce (or eliminate)
> those helper functions.
Let's transition the abusers off and see whats left. I'm still waiting for an
explanation of why they are bad when uses properly.
[snip]
>
> > > /* The level of the root page given to the iterator */
> > > int root_level;
> >
> > [...]
> >
> > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct
> > > kvm_vcpu *vcpu,
> > > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > > else
> > > wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter-
> > > >gfn,
> > > - fault->pfn, iter->old_spte,
> > > fault->prefetch, true,
> > > - fault->map_writable, &new_spte);
> > > + fault->pfn, iter->old_spte, fault-
> > > >prefetch, true,
> > > + fault->map_writable, &new_spte);
> > > if (new_spte == iter->old_spte)
> > > ret = RET_PF_SPURIOUS;
> > > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> > > kvm_page_fault *fault)
> > > struct kvm *kvm = vcpu->kvm;
> > > struct tdp_iter iter;
> > > struct kvm_mmu_page *sp;
> > > + gfn_t raw_gfn;
> > > + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> >
> > Ditto. I wish we can have 'has_mirrored_private_pt'.
>
> Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
Why not helpers that wrap vm_type like:
https://lore.kernel.org/kvm/[email protected]/
>>> @@ -470,6 +470,7 @@ struct kvm_mmu {
>>> int (*sync_spte)(struct kvm_vcpu *vcpu,
>>> struct kvm_mmu_page *sp, int i);
>>> struct kvm_mmu_root_info root;
>>> + hpa_t private_root_hpa;
>>
>> Should we have
>>
>> struct kvm_mmu_root_info private_root;
>>
>> instead?
>
> This is corresponds to:
> mmu->root.hpa
>
> We don't need the other fields, so I think better to not take space. It does
> look asymmetric though...
Being symmetric is why I asked. Anyway no strong opinion.
[...]
>>>
>>> @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
>>> kvm_page_fault *fault)
>>> if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
>>> for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level)
>>> {
>>> int page_num = KVM_PAGES_PER_HPAGE(fault-
>>>> max_level);
>>> - gfn_t base = gfn_round_for_level(fault->gfn,
>>> + gfn_t base = gfn_round_for_level(gpa_to_gfn(fault-
>>>> addr),
>>> fault->max_level);
>>
>> I thought by reaching here the shared bit has already been stripped away
>> by the caller?
>
> We don't support MTRRs so this code wont be executed for TDX, but not clear what
> you are asking.
> fault->addr has the shared bit (if present)
> fault->gfn has it stripped.
When I was looking at the code, I thought fault->gfn is still having the
shred bit, and gpa_to_gfn() internally strips aways the shared bit, but
sorry it is not true.
My question is why do we even need this change? Souldn't we pass the
actual GFN (which doesn't have the shared bit) to
kvm_mtrr_check_gfn_range_consistency()?
If so, looks we should use fault->gfn to get the base?
>
>>
>> It doesn't make a lot sense to still have it here, given we have a
>> universal KVM-defined PFERR_PRIVATE_ACCESS flag:
>>
>> https://lore.kernel.org/kvm/[email protected]/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e
>>
>> IMHO we should just strip the shared bit in the TDX variant of
>> handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA
>> doesn't hvae shared bit) to the common fault handler so it can correctly
>> set fault->is_private to true.
>
> I'm not sure what you are seeing here, could elaborate?
See reply below.
[...]
>>
>> Anyway, from common code's perspective, we need to have some
>> clarification why we design to do it here.
>>
>>> free_mmu_pages(&vcpu->arch.root_mmu);
>>> free_mmu_pages(&vcpu->arch.guest_mmu);
>>> mmu_free_memory_caches(vcpu);
>>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h
>>> b/arch/x86/kvm/mmu/mmu_internal.h
>>> index 0f1a9d733d9e..3a7fe9261e23 100644
>>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>>> @@ -6,6 +6,8 @@
>>> #include <linux/kvm_host.h>
>>> #include <asm/kvm_host.h>
>>>
>>> +#include "mmu.h"
>>> +
>>> #ifdef CONFIG_KVM_PROVE_MMU
>>> #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>>> #else
>>> @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
>>> kvm_vcpu *vcpu, struct kvm_m
>>> sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
>>>> arch.mmu_private_spt_cache);
>>> }
>>>
>>> +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
>>> *root,
>>> + gfn_t gfn)
>>> +{
>>> + gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
>>> +
>>> + /* Set shared bit if not private */
>>> + gfn_for_root |= -(gfn_t)!is_private_sp(root) &
>>> kvm_gfn_shared_mask(kvm);
>>> + return gfn_for_root;
>>> +}
>>> +
>>> static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page
>>> *sp)
>>> {
>>> /*
>>> @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
>>> kvm_vcpu *vcpu, gpa_t cr2_or_gp
>>> int r;
>>>
>>> if (vcpu->arch.mmu->root_role.direct) {
>>> - fault.gfn = fault.addr >> PAGE_SHIFT;
>>> + /*
>>> + * Things like memslots don't understand the concept of a
>>> shared
>>> + * bit. Strip it so that the GFN can be used like normal,
>>> and the
>>> + * fault.addr can be used when the shared bit is needed.
>>> + */
>>> + fault.gfn = gpa_to_gfn(fault.addr) &
>>> ~kvm_gfn_shared_mask(vcpu->kvm);
>>> fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>>
>> Again, I don't think it's nessary for fault.gfn to still have the shared
>> bit here?
>
> It's getting stripped as it's set for the first time... What do you mean still
> have it?
Sorry, I meant fault->addr.
>
>>
>> This kinda usage is pretty much the reason I want to get rid of
>> kvm_gfn_shared_mask().
>
> I think you want to move it to an x86_op right? Not get rid of the concept of a
> shared bit? I think KVM will have a hard time doing TDX without knowing about
> the shared bit location.
>
> Or maybe you are saying you think it should be stripped earlier and live as a PF
> error code?
I meant it seems we should just strip shared bit away from the GPA in
handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
won't have the shared bit.
Do you see any problem of doing so?
>
>>
>>> }
>>>
>>> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
>>> index fae559559a80..8a64bcef9deb 100644
>>> --- a/arch/x86/kvm/mmu/tdp_iter.h
>>> +++ b/arch/x86/kvm/mmu/tdp_iter.h
>>> @@ -91,7 +91,7 @@ struct tdp_iter {
>>> tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
>>> /* A pointer to the current SPTE */
>>> tdp_ptep_t sptep;
>>> - /* The lowest GFN mapped by the current SPTE */
>>> + /* The lowest GFN (shared bits included) mapped by the current SPTE
>>> */
>>> gfn_t gfn;
>>
>> IMHO we need more clarification of this design.
>
> Have you seen the documentation patch? Where do you think it should be? You mean
> in the tdp_iter struct?
My thinking:
Changelog should clarify why include shared bit to 'gfn' in tdp_iter.
And here around the 'gfn' we can have some simple sentence to explain
why to include the shared bit.
>>>> + gfn_t raw_gfn;
>>>> + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
>>>
>>> Ditto. I wish we can have 'has_mirrored_private_pt'.
>>
>> Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
>
> Why not helpers that wrap vm_type like:
> https://lore.kernel.org/kvm/[email protected]/
I am fine with any of them -- boolean (with either name) or helper.
On Wed, 2024-05-15 at 16:56 -0700, Rick Edgecombe wrote:
>
> If we think it is not a security issue, and we don't even know if it can be
> hit
> for TDX, then I'd be included to go with (a). Especially since we are just
> aiming for the most basic support, and don't have to worry about regressions
> in
> the classical sense.
>
> I'm not sure how easy it will be to root cause it at this point. Hopefully Yan
> will be coming online soon. She mentioned some previous Intel effort to
> investigate it. Presumably we would have to start with the old kernel that
> exhibited the issue. If it can still be found...
>
Weijiang filled me in. It sounds like a really tough one, as he described: "From
my experience, to reproduce the issue, it requires specific HW config + specific
SW + specific operations, I almost died on it"
If it shows up in TDX, we might get lucky with an easier reproducer...
On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
>
> I meant it seems we should just strip shared bit away from the GPA in
> handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
> won't have the shared bit.
>
> Do you see any problem of doing so?
We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
In the past I did something like the private/shared split, but for execute-only
aliases and a few other wacky things.
It also had a synthetic error code. For awhile I had it so GPA had alias bits
(i.e. shared bit) not stripped, like TDX has today, but there was always some
code that got surprised by the extra bits in the GPA. I want to say it was the
emulation of PAE or something like that (execute-only had to support all the
normal VM stuff).
So in the later revisions I actually had a helper to take a GFN and PF error
code and put the alias bits back in. Then alias bits got stripped immediately
and at the same time the synthetic error code was set. Something similar could
probably work to recreate "raw_gfn" from a fault.
IIRC (and I could easily be wrong), when I discussed this with Sean he said TDX
didn't need to support whatever issue I was working around, and the original
solution was slightly better for TDX.
In any case, I doubt Sean is wedded to a remark he may or may not have made long
ago. But looking at the TDX code today, it doesn't feel that confusing to me.
So I'm not against adding the shared bits back in later, but it doesn't seem
that big of a gain to me. It also has kind of been tried before a long time ago.
>
> >
> > >
> > > > }
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > > > index fae559559a80..8a64bcef9deb 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > > > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > > > @@ -91,7 +91,7 @@ struct tdp_iter {
> > > > tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > > > /* A pointer to the current SPTE */
> > > > tdp_ptep_t sptep;
> > > > - /* The lowest GFN mapped by the current SPTE */
> > > > + /* The lowest GFN (shared bits included) mapped by the current
> > > > SPTE
> > > > */
> > > > gfn_t gfn;
> > >
> > > IMHO we need more clarification of this design.
> >
> > Have you seen the documentation patch? Where do you think it should be? You
> > mean
> > in the tdp_iter struct?
>
> My thinking:
>
> Changelog should clarify why include shared bit to 'gfn' in tdp_iter.
>
> And here around the 'gfn' we can have some simple sentence to explain
> why to include the shared bit.
Doesn't seem unreasonable.
On Thu, May 16, 2024 at 07:56:18AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-05-15 at 15:47 -0700, Sean Christopherson wrote:
> > > I didn't gather there was any proof of this. Did you have any hunch either
> > > way?
> >
> > I doubt the guest was able to access memory it shouldn't have been able to
> > access.
> > But that's a moot point, as the bigger problem is that, because we have no
> > idea
> > what's at fault, KVM can't make any guarantees about the safety of such a
> > flag.
> >
> > TDX is a special case where we don't have a better option (we do have other
> > options,
> > they're just horrible).? In other words, the choice is essentially to either:
> >
> > ?(a) cross our fingers and hope that the problem is limited to shared memory
> > ???? with QEMU+VFIO, i.e. and doesn't affect TDX private memory.
> >
> > or
> >
> > ?(b) don't merge TDX until the original regression is fully resolved.
> >
> > FWIW, I would love to root cause and fix the failure, but I don't know how
> > feasible
> > that is at this point.
Me too. So curious about what's exactly broken.
>
> If we think it is not a security issue, and we don't even know if it can be hit
> for TDX, then I'd be included to go with (a). Especially since we are just
> aiming for the most basic support, and don't have to worry about regressions in
> the classical sense.
>
> I'm not sure how easy it will be to root cause it at this point. Hopefully Yan
> will be coming online soon. She mentioned some previous Intel effort to
> investigate it. Presumably we would have to start with the old kernel that
> exhibited the issue. If it can still be found...
I tried to reproduce it under the direction from Weijiang, though my NVIDIA card
was of a little difference as the one used by Weijiang.
However, I failed. I'm not sure whether it was because I did it remotely or
whether it was because I didn't spend enough time (since it's not an official
tasks assigned to me and I just did it out of curiosity).
If you think it's worthwhile, I would like to try again locally to see if I will
be lucky enough to reproduce and root-cause it.
But is it possible not to have TDX be pending on this bug/regression?
On Thu, May 16, 2024 at 01:40:41PM +1200, Huang, Kai wrote:
>
>
> On 16/05/2024 1:20 pm, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-16 at 13:04 +1200, Huang, Kai wrote:
> > >
> > > I really don't see difference between ...
> > >
> > > ????????is_private_mem(gpa)
> > >
> > > ... and
> > >
> > > ????????is_private_gpa(gpa)
> > >
> > > If it confuses me, it can confuses other people.
> >
> > Again, point taken. I'll try to think of a better name. Please share if you do.
> >
> > >
> > > The point is there's really no need to distinguish the two.? The GPA is
> > > only meaningful when it refers to the memory that it points to.
> > >
> > > So far I am not convinced we need this helper, because such info we can
> > > already get from:
> > >
> > > ?? 1) fault->is_private;
> > > ?? 2) Xarray which records memtype for given GFN.
> > >
> > > So we should just get rid of it.
> >
> > Kai, can you got look through the dev branch a bit more before making the same
> > point on every patch?
> >
> > kvm_is_private_gpa() is used to set PFERR_PRIVATE_ACCESS, which in turn sets
> > fault->is_private. So you are saying we can use these other things that are
> > dependent on it. Look at the other callers too.
>
> Well, I think I didn't make myself clear.
>
> I don't object to have this helper. If it helps, then we can have it.
>
> My objection is the current implementation of it, because it is
> *conceptually* wrong for SEV-SNP.
>
> Btw, I just look at the dev branch.
>
> For the common code, it is used in kvm_tdp_mmu_map() and
> kvm_tdp_mmu_fast_pf_get_last_sptep() to get whether a GPA is private.
>
> As said above, I don't see why we need a helper with the "current
> implementation" (which consults kvm_shared_gfn_mask()) for them. We can
> just use fault->gfn + fault->is_private for such purpose.
What about a name like kvm_is_private_and_mirrored_gpa()?
Only TDX's private memory is mirrored and the common code needs a way to
tell that.
> It is also used in the TDX code like TDX variant handle_ept_violation() and
> tdx_vcpu_init_mem_region(). For them to be honest I don't quite care
> whether a helper is used. We can have a helper if we have multiple callers,
> but this helper should be in TDX code, but not common MMU code.
>
On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> >
> > I meant it seems we should just strip shared bit away from the GPA in
> > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
> > won't have the shared bit.
> >
> > Do you see any problem of doing so?
>
> We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
I don't see any big difference?
Now in this patch the raw_gfn is directly from fault->addr:
raw_gfn = gpa_to_gfn(fault->addr);
tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
...
}
But there's nothing wrong to get the raw_gfn from the fault->gfn. In
fact, the zapping code just does this:
/*
* start and end doesn't have GFN shared bit. This function zaps
* a region including alias. Adjust shared bit of [start, end) if
* the root is shared.
*/
start = kvm_gfn_for_root(kvm, root, start);
end = kvm_gfn_for_root(kvm, root, end);
So there's nothing wrong to just do the same thing in both functions.
The point is fault->gfn has shared bit stripped away at the beginning, and
AFAICT there's no useful reason to keep shared bit in fault->addr. The
entire @fault is a temporary structure on the stack during fault handling
anyway.
>
> In the past I did something like the private/shared split, but for execute-only
> aliases and a few other wacky things.
>
> It also had a synthetic error code. For awhile I had it so GPA had alias bits
> (i.e. shared bit) not stripped, like TDX has today, but there was always some
> code that got surprised by the extra bits in the GPA. I want to say it was the
> emulation of PAE or something like that (execute-only had to support all the
> normal VM stuff).
>
> So in the later revisions I actually had a helper to take a GFN and PF error
> code and put the alias bits back in. Then alias bits got stripped immediately
> and at the same time the synthetic error code was set. Something similar could
> probably work to recreate "raw_gfn" from a fault.
>
> IIRC (and I could easily be wrong), when I discussed this with Sean he said TDX
> didn't need to support whatever issue I was working around, and the original
> solution was slightly better for TDX.
>
> In any case, I doubt Sean is wedded to a remark he may or may not have made long
> ago. But looking at the TDX code today, it doesn't feel that confusing to me.
[...]
>
> So I'm not against adding the shared bits back in later, but it doesn't seem
> that big of a gain to me. It also has kind of been tried before a long time ago.
As mentioned above, we are already doing that anyway in the zapping code
path.
>
> >
> > >
> > > >
> > > > > }
> > > > >
> > > > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > > > > index fae559559a80..8a64bcef9deb 100644
> > > > > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > > > > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > > > > @@ -91,7 +91,7 @@ struct tdp_iter {
> > > > > tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > > > > /* A pointer to the current SPTE */
> > > > > tdp_ptep_t sptep;
> > > > > - /* The lowest GFN mapped by the current SPTE */
> > > > > + /* The lowest GFN (shared bits included) mapped by the current
> > > > > SPTE
> > > > > */
> > > > > gfn_t gfn;
> > > >
> > > > IMHO we need more clarification of this design.
> > >
Btw, another thing after second thought:
So regardless of how to implement in KVM, IIUC TDX hardware requires below
two operations to have the shared bit set in the GPA for shared mapping:
1) Setup/teardown shared page table mapping
2) GPA range in TLB flush for shared mapping
(I kinda forgot the TLB flush part so better double check, but I guess I
am >90% sure about it.)
So in the fault handler path, we actually need to be careful of the GFN
passed to relevant functions, because for other operations like finding
memslot based on GFN, we must pass the GFN w/o shared bit.
Now the tricky thing is due to 1) the 'tdp_iter->gfn' is set to the
"raw_gfn" with shared bit in order to find the correct SPTE in the fault
handler path. And as a result, the current implementation sets the sp-
>gfn to the "raw_gfn" too.
sp = tdp_mmu_alloc_sp(vcpu);
...
tdp_mmu_init_child_sp(sp, &iter);
The problem is in current KVM implementation, iter->gfn and sp->gfn are
used in both cases: 1) page table walk and TLB flush; 2) others like
memslot lookup.
So the result is we need to be very careful whether we should strip the
shared bit away when using them.
E.g., Looking at the current dev branch, if I am reading code correctly,
it seems we have bug around here:
static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault,
struct tdp_iter *iter)
{
...
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL,
iter->gfn, fault->pfn, iter->old_spte,
fault->prefetch, true,
fault->map_writable, &new_spte);
...
}
See @iter->gfn (which is "raw_gfn" AFAICT) is passed to both
make_mmio_spte() and make_spte(). But AFAICT both the two functions treat
GFN as the actual GFN. E.g.,
bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
bool host_writable, u64 *new_spte)
{
...
if (shadow_memtype_mask)
spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
...
if ((spte & PT_WRITABLE_MASK) &&
kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
WARN_ON_ONCE(level > PG_LEVEL_4K);
mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
}
...
}
AFAICT both @gfn in kvm_x86_get_mt_mask() and mark_page_dirty_in_slot()
needs the actual GFN. They may not be a concern for TDX now, but I think
it's logically wrong to use the raw GFN.
This kinda issue is hard to find in code writing and review. I am
thinking whether we should have a more clear way to avoid such issues.
The idea is to add a new 'raw_gfn' to @tdp_iter and 'kvm_mmu_page'. When
we walk the GFN range using iter, we always use the "actual GFN" w/o
shared bit. Like:
tdp_mmu_for_each_pte(kvm, iter, mmu, is_private, gfn, gfn + 1) {
...
}
But in the tdp_iter_*() functions, we internally calculate the "raw_gfn"
using the "actual GFN" + the 'kvm', and we use the "raw_gfn" to walk the
page table to find the correct SPTE.
So the end code will be: 1) explicitly use iter->raw_gfn for page table
walk and do TLB flush; 2) For all others like memslot lookup, use iter-
>gfn.
(sp->gfn and sp->raw_gfn can be used similarly, e.g., sp->raw_gfn is used
for TLB flush, and for others like memslot lookup we use sp->gfn.)
I think in this way the code will be more clear?
On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
> On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> > >
> > > I meant it seems we should just strip shared bit away from the GPA in
> > > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
> > > won't have the shared bit.
> > >
> > > Do you see any problem of doing so?
> >
> > We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
>
> I don't see any big difference?
>
> Now in this patch the raw_gfn is directly from fault->addr:
>
> raw_gfn = gpa_to_gfn(fault->addr);
>
> tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
> ...
> }
>
> But there's nothing wrong to get the raw_gfn from the fault->gfn. In
> fact, the zapping code just does this:
>
> /*
> * start and end doesn't have GFN shared bit. This function zaps
> * a region including alias. Adjust shared bit of [start, end) if
> * the root is shared.
> */
> start = kvm_gfn_for_root(kvm, root, start);
> end = kvm_gfn_for_root(kvm, root, end);
>
> So there's nothing wrong to just do the same thing in both functions.
>
> The point is fault->gfn has shared bit stripped away at the beginning, and
> AFAICT there's no useful reason to keep shared bit in fault->addr. The
> entire @fault is a temporary structure on the stack during fault handling
> anyway.
I would like to avoid code churn at this point if there is not a real clear
benefit.
One small benefit of keeping the shared bit in the fault->addr is that it is
sort of consistent with how that field is used in other scenarios in KVM. In
shadow paging it's not even the GPA. So it is simply the "fault address" and has
to be interpreted in different ways in the fault handler. For TDX the fault
address *does* include the shared bit. And the EPT needs to be faulted in at
that address.
If we strip the shared bit when setting fault->addr we have to reconstruct it
when we do the actual shared mapping. There is no way around that. Which helper
does it, isn't important I think. Doing the reconstruction inside
tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
shared bit position.
The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
without the shared bit. It's not stripped and then added back. Those are
operations that target GFNs really.
I think the real problem is that we are gleaning whether the fault is to private
or shared memory from different things. Sometimes from fault->is_private,
sometimes the presence of the shared bits, and sometimes the role bit. I think
this is confusing, doubly so because we are using some of these things to infer
unrelated things (mirrored vs private).
My guess is that you have noticed this and somehow zeroed in on the shared_mask.
I think we should straighten out the mirrored/private semantics and see what the
results look like. How does that sound to you?
On Thu, May 16, 2024 at 02:00:32AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Wed, 2024-05-15 at 18:48 -0700, Isaku Yamahata wrote:
> > On Thu, May 16, 2024 at 12:52:32PM +1200,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > > On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > > > From: Isaku Yamahata <[email protected]>
> > > >
> > > > Allocate mirrored page table for the private page table and implement MMU
> > > > hooks to operate on the private page table.
> > > >
> > > > To handle page fault to a private GPA, KVM walks the mirrored page table
> > > > in
> > > > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > > > changes from the mirrored page table to private page table.
> > > >
> > > > private KVM page fault |
> > > > | |
> > > > V |
> > > > private GPA | CPU protected EPTP
> > > > | | |
> > > > V | V
> > > > mirrored PT root | private PT root
> > > > | | |
> > > > V | V
> > > > mirrored PT --hook to propagate-->private PT
> > > > | | |
> > > > \--------------------+------\ |
> > > > | | |
> > > > | V V
> > > > | private guest page
> > > > |
> > > > |
> > > > non-encrypted memory | encrypted memory
> > > > |
> > > >
> > > > PT: page table
> > > > Private PT: the CPU uses it, but it is invisible to KVM. TDX module
> > > > manages
> > > > this table to map private guest pages.
> > > > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > > > to propagate PT change to the actual private PT.
> > > >
> > > > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > > > can be modified atomically with mmu_lock held for read, however, the MMU
> > > > hooks to private page table are not atomical operations.
> > > >
> > > > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > > > used when mirrored SPTEs are updated atomically.
> > > >
> > > > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > > > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> > > > following steps.
> > > > 3. Invoke MMU hooks to modify private page table with the target value.
> > > > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> > > > (b) On hook failure, restore mirrored SPTE to original value.
> > > >
> > > > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> > > >
> > > > This sequence also applies when SPTEs are atomiclly updated from
> > > > non-present to present in order to prevent potential conflicts when
> > > > multiple vCPUs attempt to set private SPTEs to a different page size
> > > > simultaneously, though 4K page size is only supported for private page
> > > > table currently.
> > > >
> > > > 2M page support can be done in future patches.
> > > >
> > > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > > Co-developed-by: Kai Huang <[email protected]>
> > > > Signed-off-by: Kai Huang <[email protected]>
> > > > Co-developed-by: Yan Zhao <[email protected]>
> > > > Signed-off-by: Yan Zhao <[email protected]>
> > > > Co-developed-by: Rick Edgecombe <[email protected]>
> > > > Signed-off-by: Rick Edgecombe <[email protected]>
> > > > ---
> > > > TDX MMU Part 1:
> > > > - Remove unnecessary gfn, access twist in
> > > > tdp_mmu_map_handle_target_level(). (Chao Gao)
> > > > - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> > > > tdp_mmu_alloc_sp()
> > > > - Update comment in set_private_spte_present() (Yan)
> > > > - Open code call to kvm_mmu_init_private_spt() (Yan)
> > > > - Add comments on TDX MMU hooks (Yan)
> > > > - Fix various whitespace alignment (Yan)
> > > > - Remove pointless warnings and conditionals in
> > > > handle_removed_private_spte() (Yan)
> > > > - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> > > > - Remove incorrect comment in handle_changed_spte() (Yan)
> > > > - Remove unneeded kvm_pfn_to_refcounted_page() and
> > > > is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> > > > - Do kvm_gfn_for_root() branchless (Rick)
> > > > - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> > > > - Add comment for stripping shared bit for fault.gfn (Chao)
> > > >
> > > > v19:
> > > > - drop CONFIG_KVM_MMU_PRIVATE
> > > >
> > > > v18:
> > > > - Rename freezed => frozen
> > > >
> > > > v14 -> v15:
> > > > - Refined is_private condition check in kvm_tdp_mmu_map().
> > > > Add kvm_gfn_shared_mask() check.
> > > > - catch up for struct kvm_range change
> > > > ---
> > > > arch/x86/include/asm/kvm-x86-ops.h | 5 +
> > > > arch/x86/include/asm/kvm_host.h | 25 +++
> > > > arch/x86/kvm/mmu/mmu.c | 13 +-
> > > > arch/x86/kvm/mmu/mmu_internal.h | 19 +-
> > > > arch/x86/kvm/mmu/tdp_iter.h | 2 +-
> > > > arch/x86/kvm/mmu/tdp_mmu.c | 269 +++++++++++++++++++++++++----
> > > > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > > > 7 files changed, 293 insertions(+), 42 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h
> > > > b/arch/x86/include/asm/kvm-x86-ops.h
> > > > index 566d19b02483..d13cb4b8fce6 100644
> > > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > > > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > > > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > > > KVM_X86_OP(load_mmu_pgd)
> > > > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > > > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > > > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > > > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > > > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > > > KVM_X86_OP(has_wbinvd_exit)
> > > > KVM_X86_OP(get_l2_tsc_offset)
> > > > KVM_X86_OP(get_l2_tsc_multiplier)
> > > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index d010ca5c7f44..20fa8fa58692 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -470,6 +470,7 @@ struct kvm_mmu {
> > > > int (*sync_spte)(struct kvm_vcpu *vcpu,
> > > > struct kvm_mmu_page *sp, int i);
> > > > struct kvm_mmu_root_info root;
> > > > + hpa_t private_root_hpa;
> > >
> > > Should we have
> > >
> > > struct kvm_mmu_root_info private_root;
> > >
> > > instead?
> >
> > Yes. And the private root allocation can be pushed down into TDP MMU.
>
> Why?
Because the only TDP MMU supports mirrored PT and the change of the root pt
allocation will be contained in TDP MMU. Also it will be symetric to
kvm_mmu_destroy() and kvm_tdp_mmu_destroy().
> [snip]
> > > > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> > > > void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > > > {
> > > > kvm_mmu_unload(vcpu);
> > > > + if (tdp_mmu_enabled) {
> > > > + read_lock(&vcpu->kvm->mmu_lock);
> > > > + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu-
> > > > >private_root_hpa,
> > > > + NULL);
> > > > + read_unlock(&vcpu->kvm->mmu_lock);
> > > > + }
> > >
> > > Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to
> > > this here.
> > >
> > > Could you elaborate?
> > >
> > > Anyway, from common code's perspective, we need to have some clarification
> > > why we design to do it here.
> >
> > This should be cleaned up. It can be pushed down into
> > kvm_tdp_mmu_alloc_root().
> >
> > void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> > allocate shared root
> > if (has_mirrort_pt)
> > allocate private root
> >
>
> Huh? This is kvm_mmu_destroy()...
> > > > free_mmu_pages(&vcpu->arch.root_mmu);
> > > > free_mmu_pages(&vcpu->arch.guest_mmu);
> > > > mmu_free_memory_caches(vcpu);
> > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > > > b/arch/x86/kvm/mmu/mmu_internal.h
> > > > index 0f1a9d733d9e..3a7fe9261e23 100644
> > > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > > @@ -6,6 +6,8 @@
> > > > #include <linux/kvm_host.h>
> > > > #include <asm/kvm_host.h>
> > > > +#include "mmu.h"
> > > > +
> > > > #ifdef CONFIG_KVM_PROVE_MMU
> > > > #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > > > #else
> > > > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
> > > > kvm_vcpu *vcpu, struct kvm_m
> > > > sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > > > >arch.mmu_private_spt_cache);
> > > > }
> > > > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
> > > > *root,
> > > > + gfn_t gfn)
> > > > +{
> > > > + gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > > > +
> > > > + /* Set shared bit if not private */
> > > > + gfn_for_root |= -(gfn_t)!is_private_sp(root) &
> > > > kvm_gfn_shared_mask(kvm);
> > > > + return gfn_for_root;
> > > > +}
> > > > +
> > > > static inline bool kvm_mmu_page_ad_need_write_protect(struct
> > > > kvm_mmu_page *sp)
> > > > {
> > > > /*
> > > > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > > > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> > > > int r;
> > > > if (vcpu->arch.mmu->root_role.direct) {
> > > > - fault.gfn = fault.addr >> PAGE_SHIFT;
> > > > + /*
> > > > + * Things like memslots don't understand the concept of a
> > > > shared
> > > > + * bit. Strip it so that the GFN can be used like normal,
> > > > and the
> > > > + * fault.addr can be used when the shared bit is needed.
> > > > + */
> > > > + fault.gfn = gpa_to_gfn(fault.addr) &
> > > > ~kvm_gfn_shared_mask(vcpu->kvm);
> > > > fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > >
> > > Again, I don't think it's nessary for fault.gfn to still have the shared bit
> > > here?
> > >
> > > This kinda usage is pretty much the reason I want to get rid of
> > > kvm_gfn_shared_mask().
> >
> > We are going to flags like has_mirrored_pt and we have root page table
> > iterator
> > with types specified. I'll investigate how we can reduce (or eliminate)
> > those helper functions.
>
> Let's transition the abusers off and see whats left. I'm still waiting for an
> explanation of why they are bad when uses properly.
Sure. Let's untangle things one by one.
> [snip]
> >
> > > > /* The level of the root page given to the iterator */
> > > > int root_level;
> > >
> > > [...]
> > >
> > > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > > > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct
> > > > kvm_vcpu *vcpu,
> > > > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > > > else
> > > > wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter-
> > > > >gfn,
> > > > - fault->pfn, iter->old_spte,
> > > > fault->prefetch, true,
> > > > - fault->map_writable, &new_spte);
> > > > + fault->pfn, iter->old_spte, fault-
> > > > >prefetch, true,
> > > > + fault->map_writable, &new_spte);
> > > > if (new_spte == iter->old_spte)
> > > > ret = RET_PF_SPURIOUS;
> > > > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> > > > kvm_page_fault *fault)
> > > > struct kvm *kvm = vcpu->kvm;
> > > > struct tdp_iter iter;
> > > > struct kvm_mmu_page *sp;
> > > > + gfn_t raw_gfn;
> > > > + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> > >
> > > Ditto. I wish we can have 'has_mirrored_private_pt'.
> >
> > Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
>
> Why not helpers that wrap vm_type like:
> https://lore.kernel.org/kvm/[email protected]/
I followed the existing way. Anyway I'm fine with either way.
--
Isaku Yamahata <[email protected]>
On Thu, May 16, 2024 at 01:21:40PM +1200,
"Huang, Kai" <[email protected]> wrote:
> On 16/05/2024 12:15 pm, Isaku Yamahata wrote:
> > On Thu, May 16, 2024 at 10:17:50AM +1200,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > > On 16/05/2024 4:22 am, Isaku Yamahata wrote:
> > > > On Wed, May 15, 2024 at 08:34:37AM -0700,
> > > > Sean Christopherson <[email protected]> wrote:
> > > >
> > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > index d5cf5b15a10e..808805b3478d 100644
> > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > > > > > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> > > > > > - if (tdp_mmu_enabled)
> > > > > > + if (tdp_mmu_enabled) {
> > > > > > + /*
> > > > > > + * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > > > > > + * type was changed. TDX can't handle zapping the private
> > > > > > + * mapping, but it's ok because KVM doesn't support either of
> > > > > > + * those features for TDX. In case a new caller appears, BUG
> > > > > > + * the VM if it's called for solutions with private aliases.
> > > > > > + */
> > > > > > + KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
> > > > >
> > > > > Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX". Using a
> > > > > generic name quite obviously doesn't prevent TDX details for bleeding into common
> > > > > code, and dancing around things just makes it all unnecessarily confusing.
> > > > >
> > > > > If we can't avoid bleeding TDX details into common code, my vote is to bite the
> > > > > bullet and simply check vm_type.
> > > >
> > > > TDX has several aspects related to the TDP MMU.
> > > > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > > > (private-vs-shared)
> > > > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
> > > > load/store. TDP MMU needs hooks for it.
> > > > 3) The tables must be zapped from the leaf. not the root or the middle.
> > > >
> > > > For 1) and 2), what about something like this? TDX backend code will set
> > > > kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
> > > > for address conversion (shared<->private).
> > > >
> > > > For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
> > > > (or whatever preferable name)?
> > > >
> > > > For 3), flag of memslot handles it.
> > > >
> > > > ---
> > > > arch/x86/include/asm/kvm_host.h | 3 +++
> > > > 1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index aabf1648a56a..218b575d24bd 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1289,6 +1289,7 @@ struct kvm_arch {
> > > > u8 vm_type;
> > > > bool has_private_mem;
> > > > bool has_protected_state;
> > > > + bool has_mirrored_pt;
> > > > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > > > struct list_head active_mmu_pages;
> > > > struct list_head zapped_obsolete_pages;
> > > > @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> > > > #ifdef CONFIG_KVM_PRIVATE_MEM
> > > > #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> > > > +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
> > > > #else
> > > > #define kvm_arch_has_private_mem(kvm) false
> > > > +#define kvm_arch_has_mirrored_pt(kvm) false
> > > > #endif
> > > > static inline u16 kvm_read_ldt(void)
> > >
> > > I think this 'has_mirrored_pt' (or a better name) is better, because it
> > > clearly conveys it is for the "page table", but not the actual page that any
> > > page table entry maps to.
> > >
> > > AFAICT we need to split the concept of "private page table itself" and the
> > > "memory type of the actual GFN".
> > >
> > > E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), but
> > > I was told only TDX uses a dedicated private page table which isn't directly
> > > accessible for KVV. SEV-SNP on the other hand just uses normal page table +
> > > additional HW managed table to make sure the security.
> >
> > kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
> > need_callback or whatever makes sense.
> >
> >
> > > In other words, I think we should decide whether to invoke TDP MMU callback
> > > for private mapping (the page table itself may just be normal one) depending
> > > on the fault->is_private, but not whether the page table is private:
> > >
> > > if (fault->is_private && kvm_x86_ops->set_private_spte)
> > > kvm_x86_set_private_spte(...);
> > > else
> > > tdp_mmu_set_spte_atomic(...);
> >
> > This doesn't work for two reasons.
> >
> > - We need to pass down struct kvm_page_fault fault deep only for this.
> > We could change the code in such way.
> >
> > - We don't have struct kvm_page_fault fault for zapping case.
> > We could create a dummy one and pass it around.
>
> For both above, we don't necessarily need the whole 'kvm_page_fault', we
> just need:
>
> 1) GFN
> 2) Whether it is private (points to private memory to be precise)
> 3) use a separate private page table.
Ok, so you suggest passing around necessary info (if missing) somehow.
> > Essentially the issue is how to pass down is_private or stash the info
> > somewhere or determine it somehow. Options I think of are
> >
> > - Pass around fault:
> > Con: fault isn't passed down
> > Con: Create fake fault for zapping case >
> > - Stash it in struct tdp_iter and pass around iter:
> > Pro: work for zapping case
> > Con: we need to change the code to pass down tdp_iter >
> > - Pass around is_private (or mirrored_pt or whatever):
> > Pro: Don't need to add member to some structure
> > Con: We need to pass it around still. >
> > - Stash it in kvm_mmu_page:
> > The patch series uses kvm_mmu_page.role.
> > Pro: We don't need to pass around because we know struct kvm_mmu_page
> > Con: Need to twist root page allocation
>
> I don't think using kvm_mmu_page.role is correct.
>
> If kvm_mmu_page.role is private, we definitely can assume the faulting
> address is private; but otherwise the address can be both private or shared.
What do you mean by the last sentence. For example, do you mean memslot
deletion? In that case, we need to GPA with shared bit for shared PT, GPA
without shared bit for mirrored/private PT. Or do you mean something else?
> > - Use gfn. kvm_is_private_gfn(kvm, gfn):
> > Con: The use of gfn is confusing. It's too TDX specific.
> >
> >
> > > And the 'has_mirrored_pt' should be only used to select the root of the page
> > > table that we want to operate on.
> >
> > We can add one more bool to struct kvm_page_fault.follow_mirrored_pt or
> > something to represent it. We can initialize it in __kvm_mmu_do_page_fault().
> >
> > .follow_mirrored_pt = kvm->arch.has_mirrored_pt && kvm_is_private_gpa(gpa);
> >
> >
> > > This also gives a chance that if there's anything special needs to be done
> > > for page allocated for the "non-leaf" middle page table for SEV-SNP, it can
> > > just fit.
> >
> > Can you please elaborate on this?
>
> I meant SEV-SNP may have it's own version of link_private_spt().
>
> I haven't looked into it, and it may not needed from hardware's perspective,
> but providing such chance certainly doesn't hurt and is more flexible IMHO.
It doesn't need TDP MMU hooks.
--
Isaku Yamahata <[email protected]>
On Thu, May 16, 2024 at 04:36:48PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
> > On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> > > >
> > > > I meant it seems we should just strip shared bit away from the GPA in
> > > > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
> > > > won't have the shared bit.
> > > >
> > > > Do you see any problem of doing so?
> > >
> > > We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
> >
> > I don't see any big difference?
> >
> > Now in this patch the raw_gfn is directly from fault->addr:
> >
> > raw_gfn = gpa_to_gfn(fault->addr);
> >
> > tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
> > ...
> > }
> >
> > But there's nothing wrong to get the raw_gfn from the fault->gfn. In
> > fact, the zapping code just does this:
> >
> > /*
> > * start and end doesn't have GFN shared bit. This function zaps
> > * a region including alias. Adjust shared bit of [start, end) if
> > * the root is shared.
> > */
> > start = kvm_gfn_for_root(kvm, root, start);
> > end = kvm_gfn_for_root(kvm, root, end);
> >
> > So there's nothing wrong to just do the same thing in both functions.
> >
> > The point is fault->gfn has shared bit stripped away at the beginning, and
> > AFAICT there's no useful reason to keep shared bit in fault->addr. The
> > entire @fault is a temporary structure on the stack during fault handling
> > anyway.
>
> I would like to avoid code churn at this point if there is not a real clear
> benefit.
>
> One small benefit of keeping the shared bit in the fault->addr is that it is
> sort of consistent with how that field is used in other scenarios in KVM. In
> shadow paging it's not even the GPA. So it is simply the "fault address" and has
> to be interpreted in different ways in the fault handler. For TDX the fault
> address *does* include the shared bit. And the EPT needs to be faulted in at
> that address.
>
> If we strip the shared bit when setting fault->addr we have to reconstruct it
> when we do the actual shared mapping. There is no way around that. Which helper
> does it, isn't important I think. Doing the reconstruction inside
> tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
> shared bit position.
>
> The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
> without the shared bit. It's not stripped and then added back. Those are
> operations that target GFNs really.
>
> I think the real problem is that we are gleaning whether the fault is to private
> or shared memory from different things. Sometimes from fault->is_private,
> sometimes the presence of the shared bits, and sometimes the role bit. I think
> this is confusing, doubly so because we are using some of these things to infer
> unrelated things (mirrored vs private).
It's confusing we don't check it in uniform way.
> My guess is that you have noticed this and somehow zeroed in on the shared_mask.
> I think we should straighten out the mirrored/private semantics and see what the
> results look like. How does that sound to you?
I had closer look of the related code. I think we can (mostly) uniformly use
gpa/gfn without shared mask. Here is the proposal. We need a real patch to see
how the outcome looks like anyway. I think this is like what Kai is thinking
about.
- rename role.is_private => role.is_mirrored_pt
- sp->gfn: gfn without shared bit.
- fault->address: without gfn_shared_mask
Actually it doesn't matter much. We can use gpa with gfn_shared_mask.
- Update struct tdp_iter
struct tdp_iter
gfn: gfn without shared bit
/* Add new members */
/* Indicates which PT to walk. */
bool mirrored_pt;
// This is used tdp_iter_refresh_sptep()
// shared gfn_mask if mirrored_pt
// 0 if !mirrored_pt
gfn_shared_mask
- Pass mirrored_pt and gfn_shared_mask to
tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
and update tdp_iter_refresh_sptep()
static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
...
iter->sptep = iter->pt_path[iter->level - 1] +
SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask, iter->level);
Change for_each_tdp_mte_min_level() accordingly.
Also the iteretor to call this.
#define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start, end) \
for (tdp_iter_start(&iter, root, min_level, start, \
mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) : 0); \
iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root, end); \
tdp_iter_next(&iter))
- trace point: update to include mirroredd_pt. Or Leave it as is for now.
- pr_err() that log gfn in handle_changed_spte()
Update to include mirrored_pt. Or Leave it as is for now.
- Update spte handler (handle_changed_spte(), handle_removed_pt()...),
use iter->mirror_pt or pass down mirror_pt.
--
Isaku Yamahata <[email protected]>
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
>
> For lack of a better method currently, use kvm_gfn_shared_mask() to
> determine if private memory cannot be zapped (as in TDX, the only VM type
> that sets it).
Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw that
SNP actually uses this function:
https://lore.kernel.org/kvm/[email protected]/
So trying to have a helper that says "The VM can't zap and refault in memory at
will" won't cut it. I guess there would have to be some more specific. I'm
thinking to just drop this patch instead.
On 17/05/2024 9:46 am, Edgecombe, Rick P wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
>>
>> For lack of a better method currently, use kvm_gfn_shared_mask() to
>> determine if private memory cannot be zapped (as in TDX, the only VM type
>> that sets it).
>
> Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw that
> SNP actually uses this function:
> https://lore.kernel.org/kvm/[email protected]/
>
> So trying to have a helper that says "The VM can't zap and refault in memory at
> will" won't cut it. I guess there would have to be some more specific. I'm
> thinking to just drop this patch instead.
Or KVM_BUG_ON() in the callers by explicitly checking VM type being TDX
as I mentioned before.
Having such checking in a generic function like this is just dangerous
and not flexible.
Just my 2 cents, though.
On Fri, 2024-05-17 at 10:23 +1200, Huang, Kai wrote:
> On 17/05/2024 9:46 am, Edgecombe, Rick P wrote:
> > On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > >
> > > For lack of a better method currently, use kvm_gfn_shared_mask() to
> > > determine if private memory cannot be zapped (as in TDX, the only VM type
> > > that sets it).
> >
> > Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw
> > that
> > SNP actually uses this function:
> > https://lore.kernel.org/kvm/[email protected]/
> >
> > So trying to have a helper that says "The VM can't zap and refault in memory
> > at
> > will" won't cut it. I guess there would have to be some more specific. I'm
> > thinking to just drop this patch instead.
>
> Or KVM_BUG_ON() in the callers by explicitly checking VM type being TDX
> as I mentioned before.
>
> Having such checking in a generic function like this is just dangerous
> and not flexible.
>
> Just my 2 cents, though.
As I said before, the point is to catch new callers. I see how it's a little
wrong to assume the intentions of the callers, but I don't see how it's
dangerous. Can you explain?
But you just reminded me that, yes, we can probably just check the vm_type here:
https://lore.kernel.org/kvm/[email protected]/
On Wed, 2024-05-15 at 18:20 -0700, Rick Edgecombe wrote:
> On Thu, 2024-05-16 at 13:04 +1200, Huang, Kai wrote:
> >
> > I really don't see difference between ...
> >
> > is_private_mem(gpa)
> >
> > ... and
> >
> > is_private_gpa(gpa)
> >
> > If it confuses me, it can confuses other people.
>
> Again, point taken. I'll try to think of a better name. Please share if you
> do.
What about:
bool kvm_on_private_root(const struct kvm *kvm, gpa_t gpa);
Since SNP doesn't have a private root, it can't get confused for SNP. For TDX
it's a little weirder. We usually want to know if the GPA is to the private
half. Whether it's on a separate root or not is not really important to the
callers. But they could infer that if it's on a private root it must be a
private GPA.
Otherwise:
bool kvm_is_private_gpa_bits(const struct kvm *kvm, gpa_t gpa);
The bits indicates it's checking actual bits in the GPA and not the
private/shared state of the GFN.
On 17/05/2024 10:38 am, Edgecombe, Rick P wrote:
> On Fri, 2024-05-17 at 10:23 +1200, Huang, Kai wrote:
>> On 17/05/2024 9:46 am, Edgecombe, Rick P wrote:
>>> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
>>>>
>>>> For lack of a better method currently, use kvm_gfn_shared_mask() to
>>>> determine if private memory cannot be zapped (as in TDX, the only VM type
>>>> that sets it).
>>>
>>> Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw
>>> that
>>> SNP actually uses this function:
>>> https://lore.kernel.org/kvm/[email protected]/
>>>
>>> So trying to have a helper that says "The VM can't zap and refault in memory
>>> at
>>> will" won't cut it. I guess there would have to be some more specific. I'm
>>> thinking to just drop this patch instead.
>>
>> Or KVM_BUG_ON() in the callers by explicitly checking VM type being TDX
>> as I mentioned before.
>>
>> Having such checking in a generic function like this is just dangerous
>> and not flexible.
>>
>> Just my 2 cents, though.
>
> As I said before, the point is to catch new callers. I see how it's a little
> wrong to assume the intentions of the callers, but I don't see how it's
> dangerous. Can you explain?
Dangerous means when "a little wrong to assume the intentions of the
callers" actually goes wrong. In other words, a general intention to
"catch new callers" doesn't make a lot sense to me.
Anyway as said before, it's just my 2 cents, and it's totally up to you.
On 17/05/2024 11:08 am, Edgecombe, Rick P wrote:
> On Wed, 2024-05-15 at 18:20 -0700, Rick Edgecombe wrote:
>> On Thu, 2024-05-16 at 13:04 +1200, Huang, Kai wrote:
>>>
>>> I really don't see difference between ...
>>>
>>> is_private_mem(gpa)
>>>
>>> ... and
>>>
>>> is_private_gpa(gpa)
>>>
>>> If it confuses me, it can confuses other people.
>>
>> Again, point taken. I'll try to think of a better name. Please share if you
>> do.
>
> What about:
> bool kvm_on_private_root(const struct kvm *kvm, gpa_t gpa);
>
> Since SNP doesn't have a private root, it can't get confused for SNP. For TDX
> it's a little weirder. We usually want to know if the GPA is to the private
> half. Whether it's on a separate root or not is not really important to the
> callers. But they could infer that if it's on a private root it must be a
> private GPA.
>
>
> Otherwise:
> bool kvm_is_private_gpa_bits(const struct kvm *kvm, gpa_t gpa);
>
> The bits indicates it's checking actual bits in the GPA and not the
> private/shared state of the GFN.
The kvm_on_private_root() is better to me, assuming this helper wants to
achieve two goals:
1) whether a given GPA is private;
2) and when it is, whether to use private table;
And AFAICT we still want this implementation:
+ gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+ return mask && !(gpa_to_gfn(gpa) & mask);
What I don't quite like is we use ...
!(gpa_to_gfn(gpa) & mask);
.. to tell whether a GPA is private, because it is TDX specific logic
cause it doesn't tell on SNP whether the GPA is private.
But as you said it certainly makes sense to say "we won't use a private
table for this GPA" when the VM doesn't have a private table at all. So
it's also fine to me.
But my question is "why we need this helper at all".
As I expressed before, my concern is we already have too many mechanisms
around private/shared memory/mapping, and I am wondering whether we can
get rid of kvm_gfn_shared_mask() completely.
E.g, why we cannot do:
static bool kvm_use_private_root(struct kvm *kvm)
{
return kvm->arch.vm_type == VM_TYPE_TDX;
}
Or,
static bool kvm_use_private_root(struct kvm *kvm)
{
return kvm->arch.use_private_root;
}
Or, assuming we would love to keep the kvm_gfn_shared_mask():
static bool kvm_use_private_root(struct kvm *kvm)
{
return !!kvm_gfn_shared_mask(kvm);
}
And then:
In fault handler:
if (fault->is_private && kvm_use_private_root(kvm))
// use private root
else
// use shared/normal root
When you zap:
bool private_gpa = kvm_mem_is_private(kvm, gfn);
if (private_gpa && kvm_use_private_root(kvm))
// zap private root
else
// zap shared/normal root.
The benefit of this is we can clearly split the logic of:
1) whether a GPN is private, and
2) whether to use private table for private GFN
But it's certainly possible that I am missing something, though.
Do you see any problem of above?
Again, my main concern is whether we should just get rid of the
kvm_gfn_shared_mask() completely (so we won't be able to abuse to use
it) due to we already having so many mechanisms around private/shared
memory/mapping here.
But I also understand we anyway will need to add the shared bit back
when we setup page table or teardown of it, but for this purpose we can
also use:
kvm_x86_ops->get_shared_gfn_mask(kvm)
So to me the kvm_shared_gfn_mask() is at least not mandatory.
Anyway, it's not a very strong opinion from me that we should remove the
kvm_shared_gfn_mask(), assuming we won't abuse to use it just for
convenience in common code.
I hope I have expressed my view clearly.
And consider this as just my 2 cents.
We agreed to remove the abuse of kvm_gfn_shared_mask() and look at it again. I
was just checking back in on the name of the other function as I said I would.
Never-the-less...
On Fri, 2024-05-17 at 12:37 +1200, Huang, Kai wrote:
> The kvm_on_private_root() is better to me, assuming this helper wants to
> achieve two goals:
>
> 1) whether a given GPA is private;
> 2) and when it is, whether to use private table;
>
> And AFAICT we still want this implementation:
>
> + gfn_t mask = kvm_gfn_shared_mask(kvm);
> +
> + return mask && !(gpa_to_gfn(gpa) & mask);
No, like this:
static inline bool kvm_on_private_root(const struct kvm *kvm, gpa_t gpa)
{
gfn_t mask = kvm_gfn_shared_mask(kvm);
return kvm_has_private_root(kvm) && !(gpa_to_gfn(gpa) & mask);
}
>
> What I don't quite like is we use ...
>
> !(gpa_to_gfn(gpa) & mask);
>
> ... to tell whether a GPA is private, because it is TDX specific logic
> cause it doesn't tell on SNP whether the GPA is private.
These helpers are where we hide what will functionally be the same as "if tdx".
The other similar ones literally check for KVM_X86_TDX_VM.
>
> But as you said it certainly makes sense to say "we won't use a private
> table for this GPA" when the VM doesn't have a private table at all. So
> it's also fine to me.
>
> But my question is "why we need this helper at all".
>
> As I expressed before, my concern is we already have too many mechanisms
> around private/shared memory/mapping,
Everyone is in agreement here, you don't need to make the point again.
> and I am wondering whether we can
> get rid of kvm_gfn_shared_mask() completely.
You mentioned...
>
> E.g, why we cannot do:
>
> static bool kvm_use_private_root(struct kvm *kvm)
> {
> return kvm->arch.vm_type == VM_TYPE_TDX;
> }
>
> Or,
> static bool kvm_use_private_root(struct kvm *kvm)
> {
> return kvm->arch.use_private_root;
> }
>
> Or, assuming we would love to keep the kvm_gfn_shared_mask():
>
> static bool kvm_use_private_root(struct kvm *kvm)
> {
> return !!kvm_gfn_shared_mask(kvm);
> }
>
> And then:
>
> In fault handler:
>
> if (fault->is_private && kvm_use_private_root(kvm))
> // use private root
> else
> // use shared/normal root
>
> When you zap:
>
> bool private_gpa = kvm_mem_is_private(kvm, gfn);
>
> if (private_gpa && kvm_use_private_root(kvm))
> // zap private root
> else
> // zap shared/normal root.
>
I think you are trying to say not to abuse kvm_gfn_shared_mask() as is currently
done in this logic. But we already agreed on this. So not sure.
> The benefit of this is we can clearly split the logic of:
>
> 1) whether a GPN is private, and
> 2) whether to use private table for private GFN
>
> But it's certainly possible that I am missing something, though.
>
> Do you see any problem of above?
>
> Again, my main concern is whether we should just get rid of the
> kvm_gfn_shared_mask() completely
Sigh...
> (so we won't be able to abuse to use
> it) due to we already having so many mechanisms around private/shared
> memory/mapping here.
>
> But I also understand we anyway will need to add the shared bit back
> when we setup page table or teardown of it, but for this purpose we can
> also use:
>
> kvm_x86_ops->get_shared_gfn_mask(kvm)
>
> So to me the kvm_shared_gfn_mask() is at least not mandatory.
Up the thread we have:
On Thu, 2024-05-16 at 12:12 +1200, Huang, Kai wrote:
> > What is the benefit of the x86_ops over a static inline?
> I don't have strong objection if the use of kvm_gfn_shared_mask() is
> contained in smaller areas that truly need it. Let's discuss in
> relevant patch(es).
So.. same question.
>
> Anyway, it's not a very strong opinion from me that we should remove the
> kvm_shared_gfn_mask()
This is a shock!
> , assuming we won't abuse to use it just for
> convenience in common code.
>
> I hope I have expressed my view clearly.
>
> And consider this as just my 2 cents.
I don't think we can get rid of the shared mask. Even if we relied on
kvm_mem_is_private() to determine if a GPA is private or shared, at absolute
minimum we need to add the shared bit when we are zapping a GFN or mapping it.
Let's table the discussion until we have some code to look again.
Here is a diff of an attempt to merge all the feedback so far. It's on top of
the the dev branch from this series.
On Thu, 2024-05-16 at 12:42 -0700, Isaku Yamahata wrote:
> - rename role.is_private => role.is_mirrored_pt
Agreed.
>
> - sp->gfn: gfn without shared bit.
>
> - fault->address: without gfn_shared_mask
> Actually it doesn't matter much. We can use gpa with gfn_shared_mask.
I left fault->addr with shared bits. It's not used anymore for TDX except in the
tracepoint which I think makes sense.
>
> - Update struct tdp_iter
> struct tdp_iter
> gfn: gfn without shared bit
>
> /* Add new members */
>
> /* Indicates which PT to walk. */
> bool mirrored_pt;
>
> // This is used tdp_iter_refresh_sptep()
> // shared gfn_mask if mirrored_pt
> // 0 if !mirrored_pt
> gfn_shared_mask
>
> - Pass mirrored_pt and gfn_shared_mask to
> tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
>
> and update tdp_iter_refresh_sptep()
> static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> ...
> iter->sptep = iter->pt_path[iter->level - 1] +
> SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask,
> iter->level);
I tried something else. The iterators still have gfn's with shared bits, but the
addition of the shared bit is wrapped in tdp_mmu_for_each_pte(), so
kvm_tdp_mmu_map() and similar don't have to handle the shared bits. They just
pass in a root, and tdp_mmu_for_each_pte() knows how to adjust the GFN. Like:
#define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end) \
for_each_tdp_pte(_iter, _root, \
kvm_gfn_for_root(_kvm, _root, _start), \
kvm_gfn_for_root(_kvm, _root, _end))
I also changed the callers to use the new enum to specify roots. This way they
can pass something with a nice name instead of true/false for bool private.
Keeping a gfn_shared_mask inside the iterator didn't seem more clear to me, and
bit more cumbersome. But please compare it.
>
> Change for_each_tdp_mte_min_level() accordingly.
> Also the iteretor to call this.
>
> #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start,
> end) \
> for (tdp_iter_start(&iter, root, min_level,
> start, \
> mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) :
> 0); \
> iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root,
> end); \
> tdp_iter_next(&iter))
I liked it a lot because the callers don't need to manually call
kvm_gfn_for_root() anymore. But I tried it and it required a lot of additions of
kvm to the iterators call sites. I ended up removing it, but I'm not sure.
>
> - trace point: update to include mirroredd_pt. Or Leave it as is for now.
>
> - pr_err() that log gfn in handle_changed_spte()
> Update to include mirrored_pt. Or Leave it as is for now.
I left it, as fault->addr still has shared bit.
>
> - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
> use iter->mirror_pt or pass down mirror_pt.
You mean just rename it, or something else?
Anyway below is a first cut based on the discussion.
A few other things:
1. kvm_is_private_gpa() is moved into Intel code. kvm_gfn_shared_mask() remains
for only two operations in common code:
- kvm_gfn_for_root() <- required for zapping/mapping
- Stripping the bit when setting fault.gfn <- possible to remove if we strip
cr2_or_gpa
2. I also played with changing KVM_PRIVATE_ROOTS to KVM_MIRROR_ROOTS.
Unfortunately there is still some confusion between private and mirrored. For
example you walk a mirror root (what is actually happening), but you have to
allocate private page tables as you do, as well as call out to x86_ops named
private. So those concepts are effectively linked and used a bit
interchangeably.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e35a446baaad..64af6fd7cf85 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -351,7 +351,7 @@ union kvm_mmu_page_role {
unsigned ad_disabled:1;
unsigned guest_mode:1;
unsigned passthrough:1;
- unsigned is_private:1;
+ unsigned mirrored_pt:1;
unsigned :4;
/*
@@ -364,14 +364,14 @@ union kvm_mmu_page_role {
};
};
-static inline bool kvm_mmu_page_role_is_private(union kvm_mmu_page_role role)
+static inline bool kvm_mmu_page_role_is_mirrored(union kvm_mmu_page_role role)
{
- return !!role.is_private;
+ return !!role.mirrored_pt;
}
-static inline void kvm_mmu_page_role_set_private(union kvm_mmu_page_role *role)
+static inline void kvm_mmu_page_role_set_mirrored(union kvm_mmu_page_role
*role)
{
- role->is_private = 1;
+ role->mirrored_pt = 1;
}
/*
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a578ea09dfb3..0c08b4f9093c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -338,21 +338,26 @@ static inline gfn_t kvm_gfn_shared_mask(const struct kvm
*kvm)
return kvm->arch.gfn_shared_mask;
}
-static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
-{
- return gfn | kvm_gfn_shared_mask(kvm);
-}
-
static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
{
return gfn & ~kvm_gfn_shared_mask(kvm);
}
-static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
-{
- gfn_t mask = kvm_gfn_shared_mask(kvm);
- return mask && !(gpa_to_gfn(gpa) & mask);
+/* The VM keeps a mirrored copy of the private memory */
+static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_has_private_root(const struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_zap_leafs_only(const struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_TDX_VM;
}
#endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d291c5d2d50..c6a0af5aefce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -686,7 +686,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu,
bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL +
PTE_PREFETCH_NUM);
if (r)
return r;
- if (kvm_gfn_shared_mask(vcpu->kvm)) {
+ if (kvm_has_mirrored_tdp(vcpu->kvm)) {
r = kvm_mmu_topup_memory_cache(&vcpu-
>arch.mmu_private_spt_cache,
PT64_ROOT_MAX_LEVEL);
if (r)
@@ -3702,7 +3702,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
int r;
if (tdp_mmu_enabled) {
- if (kvm_gfn_shared_mask(vcpu->kvm))
+ if (kvm_has_private_root(vcpu->kvm))
kvm_tdp_mmu_alloc_root(vcpu, true);
kvm_tdp_mmu_alloc_root(vcpu, false);
return 0;
@@ -6539,17 +6539,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start,
gfn_t gfn_end)
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
- if (tdp_mmu_enabled) {
- /*
- * kvm_zap_gfn_range() is used when MTRR or PAT memory
- * type was changed. TDX can't handle zapping the private
- * mapping, but it's ok because KVM doesn't support either of
- * those features for TDX. In case a new caller appears, BUG
- * the VM if it's called for solutions with private aliases.
- */
- KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
+ if (tdp_mmu_enabled)
flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
- }
if (flush)
kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end -
gfn_start);
@@ -6996,10 +6987,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
kvm_mmu_zap_all(kvm);
}
+static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot
*slot)
+{
+ if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+ return;
+
+ write_lock(&kvm->mmu_lock);
+
+ /*
+ * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+ * case scenario we'll have unused shadow pages lying around until they
+ * are recycled due to age or when the VM is destroyed.
+ */
+ struct kvm_gfn_range range = {
+ .slot = slot,
+ .start = slot->base_gfn,
+ .end = slot->base_gfn + slot->npages,
+ .may_block = true,
+ };
+
+ if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
+ kvm_flush_remote_tlbs(kvm);
+
+ write_unlock(&kvm->mmu_lock);
+}
+
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
- kvm_mmu_zap_all_fast(kvm);
+ if (kvm_zap_leafs_only(kvm))
+ kvm_mmu_zap_memslot_leafs(kvm, slot);
+ else
+ kvm_mmu_zap_all_fast(kvm);
}
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 3a7fe9261e23..2b1b2a980b03 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -159,9 +159,9 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page
*sp)
return kvm_mmu_role_as_id(sp->role);
}
-static inline bool is_private_sp(const struct kvm_mmu_page *sp)
+static inline bool is_mirrored_sp(const struct kvm_mmu_page *sp)
{
- return kvm_mmu_page_role_is_private(sp->role);
+ return kvm_mmu_page_role_is_mirrored(sp->role);
}
static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
@@ -186,7 +186,7 @@ static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct
kvm_mmu_page *root,
gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
/* Set shared bit if not private */
- gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
+ gfn_for_root |= -(gfn_t)!is_mirrored_sp(root) &
kvm_gfn_shared_mask(kvm);
return gfn_for_root;
}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 5eae8eac2da0..d0d13a4317e8 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,9 +74,6 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned
int access)
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;
- WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
- !kvm_gfn_shared_mask(vcpu->kvm));
-
access &= shadow_mmio_access_mask;
spte |= vcpu->kvm->arch.shadow_mmio_value | access;
spte |= gpa | shadow_nonpresent_or_rsvd_mask;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index d0df691ced5c..17d3f1593a24 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -265,9 +265,9 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
return spte_to_child_sp(root);
}
-static inline bool is_private_sptep(u64 *sptep)
+static inline bool is_mirrored_sptep(u64 *sptep)
{
- return is_private_sp(sptep_to_sp(sptep));
+ return is_mirrored_sp(sptep_to_sp(sptep));
}
static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 42ccafc7deff..7f13016e210b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -97,15 +97,15 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
{
if (WARN_ON_ONCE(types == BUGGY_KVM_ROOTS))
return false;
- if (WARN_ON_ONCE(!(types & (KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS))))
+ if (WARN_ON_ONCE(!(types & (KVM_SHARED_ROOTS | KVM_MIRROR_ROOTS))))
return false;
if ((types & KVM_VALID_ROOTS) && root->role.invalid)
return false;
- if ((types & KVM_SHARED_ROOTS) && !is_private_sp(root))
+ if ((types & KVM_SHARED_ROOTS) && !is_mirrored_sp(root))
return true;
- if ((types & KVM_PRIVATE_ROOTS) && is_private_sp(root))
+ if ((types & KVM_MIRROR_ROOTS) && is_mirrored_sp(root))
return true;
return false;
@@ -252,7 +252,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool
private)
struct kvm_mmu_page *root;
if (private)
- kvm_mmu_page_role_set_private(&role);
+ kvm_mmu_page_role_set_mirrored(&role);
/*
* Check for an existing root before acquiring the pages lock to avoid
@@ -446,7 +446,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t
pt, bool shared)
shared);
}
- if (is_private_sp(sp) &&
+ if (is_mirrored_sp(sp) &&
WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp-
>role.level,
kvm_mmu_private_spt(sp)))) {
/*
@@ -580,7 +580,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id,
gfn_t gfn,
u64 old_spte, u64 new_spte,
union kvm_mmu_page_role role, bool shared)
{
- bool is_private = kvm_mmu_page_role_is_private(role);
+ bool is_mirrored = kvm_mmu_page_role_is_mirrored(role);
int level = role.level;
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
@@ -665,12 +665,12 @@ static void handle_changed_spte(struct kvm *kvm, int
as_id, gfn_t gfn,
*/
if (was_present && !was_leaf &&
(is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
- KVM_BUG_ON(is_private !=
is_private_sptep(spte_to_child_pt(old_spte, level)),
+ KVM_BUG_ON(is_mirrored !=
is_mirrored_sptep(spte_to_child_pt(old_spte, level)),
kvm);
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
shared);
}
- if (is_private && !is_present)
+ if (is_mirrored && !is_present)
handle_removed_private_spte(kvm, gfn, old_spte, new_spte,
role.level);
if (was_leaf && is_accessed_spte(old_spte) &&
@@ -690,7 +690,7 @@ static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm,
struct tdp_iter *it
*/
WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
- if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
+ if (is_mirrored_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
int ret;
if (is_shadow_present_pte(new_spte)) {
@@ -840,7 +840,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
tdp_ptep_t sptep,
WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
- if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
+ if (is_mirrored_sptep(sptep) && !is_removed_spte(new_spte) &&
is_shadow_present_pte(new_spte)) {
/* Because write spin lock is held, no race. It should success.
*/
KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn, old_spte,
@@ -872,11 +872,10 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm,
struct tdp_iter *iter,
continue; \
else
-#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end) \
- for_each_tdp_pte(_iter, \
- root_to_sp((_private) ? _mmu->private_root_hpa : \
- _mmu->root.hpa), \
- _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end) \
+ for_each_tdp_pte(_iter, _root, \
+ kvm_gfn_for_root(_kvm, _root, _start), \
+ kvm_gfn_for_root(_kvm, _root, _end))
/*
* Yield if the MMU lock is contended or this thread needs to return control
@@ -1307,12 +1306,11 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm,
struct tdp_iter *iter,
*/
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
struct kvm *kvm = vcpu->kvm;
+ enum kvm_tdp_mmu_root_types root_type = tdp_mmu_get_root_type(kvm,
fault);
+ struct kvm_mmu_page *root;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
- gfn_t raw_gfn;
- bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
int ret = RET_PF_RETRY;
kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1321,9 +1319,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
rcu_read_lock();
- raw_gfn = gpa_to_gfn(fault->addr);
-
- tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
+ root = tdp_mmu_get_root(vcpu, root_type);
+ tdp_mmu_for_each_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
int r;
if (fault->nx_huge_page_workaround_enabled)
@@ -1349,7 +1346,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
* needs to be split.
*/
sp = tdp_mmu_alloc_sp(vcpu);
- if (kvm_is_private_gpa(kvm, raw_gfn << PAGE_SHIFT))
+ if (root_type == KVM_MIRROR_ROOTS)
kvm_mmu_alloc_private_spt(vcpu, sp);
tdp_mmu_init_child_sp(sp, &iter);
@@ -1360,7 +1357,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
* TODO: large page support.
* Doesn't support large page for TDX now
*/
- KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
+ KVM_BUG_ON(is_mirrored_sptep(iter.sptep), vcpu->kvm);
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
} else {
r = tdp_mmu_link_sp(kvm, &iter, sp, true);
@@ -1405,7 +1402,7 @@ static enum kvm_tdp_mmu_root_types
kvm_process_to_root_types(struct kvm *kvm,
WARN_ON_ONCE(process == BUGGY_KVM_INVALIDATION);
/* Always process shared for cases where private is not on a separate
root */
- if (!kvm_gfn_shared_mask(kvm)) {
+ if (!kvm_has_private_root(kvm)) {
process |= KVM_PROCESS_SHARED;
process &= ~KVM_PROCESS_PRIVATE;
}
@@ -2022,14 +2019,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
* Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
*/
static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
- bool is_private)
+ enum kvm_tdp_mmu_root_types root_type)
{
+ struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
struct tdp_iter iter;
- struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;
- tdp_mmu_for_each_pte(iter, mmu, is_private, gfn, gfn + 1) {
+ tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
leaf = iter.level;
sptes[leaf] = iter.old_spte;
}
@@ -2042,7 +2039,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr,
u64 *sptes,
{
*root_level = vcpu->arch.mmu->root_role.level;
- return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
+ return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_SHARED_ROOTS);
}
int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
@@ -2054,7 +2051,7 @@ int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu
*vcpu, u64 gpa,
lockdep_assert_held(&vcpu->kvm->mmu_lock);
rcu_read_lock();
- leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
+ leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
rcu_read_unlock();
if (leaf < 0)
return -ENOENT;
@@ -2082,15 +2079,12 @@ EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);
u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
u64 *spte)
{
+ struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, KVM_SHARED_ROOTS);
struct tdp_iter iter;
- struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
tdp_ptep_t sptep = NULL;
- /* fast page fault for private GPA isn't supported. */
- WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
-
- tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
+ tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
*spte = iter.old_spte;
sptep = iter.sptep;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index b8a967426fac..40f5f9753131 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -22,15 +22,30 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct
kvm_mmu_page *root);
enum kvm_tdp_mmu_root_types {
BUGGY_KVM_ROOTS = BUGGY_KVM_INVALIDATION,
KVM_SHARED_ROOTS = KVM_PROCESS_SHARED,
- KVM_PRIVATE_ROOTS = KVM_PROCESS_PRIVATE,
+ KVM_MIRROR_ROOTS = KVM_PROCESS_PRIVATE,
KVM_VALID_ROOTS = BIT(2),
- KVM_ANY_VALID_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS |
KVM_VALID_ROOTS,
- KVM_ANY_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS,
+ KVM_ANY_VALID_ROOTS = KVM_SHARED_ROOTS | KVM_MIRROR_ROOTS |
KVM_VALID_ROOTS,
+ KVM_ANY_ROOTS = KVM_SHARED_ROOTS | KVM_MIRROR_ROOTS,
};
static_assert(!(KVM_SHARED_ROOTS & KVM_VALID_ROOTS));
-static_assert(!(KVM_PRIVATE_ROOTS & KVM_VALID_ROOTS));
-static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));
+static_assert(!(KVM_MIRROR_ROOTS & KVM_VALID_ROOTS));
+static_assert(KVM_MIRROR_ROOTS == (KVM_SHARED_ROOTS << 1));
+
+static inline enum kvm_tdp_mmu_root_types tdp_mmu_get_root_type(struct kvm
*kvm,
+ struct
kvm_page_fault *fault)
+{
+ if (fault->is_private && kvm_has_mirrored_tdp(kvm))
+ return KVM_MIRROR_ROOTS;
+ return KVM_SHARED_ROOTS;
+}
+
+static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu, enum
kvm_tdp_mmu_root_types type)
+{
+ if (type == KVM_MIRROR_ROOTS)
+ return root_to_sp(vcpu->arch.mmu->private_root_hpa);
+ return root_to_sp(vcpu->arch.mmu->root.hpa);
+}
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool
flush);
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 7fdc67835e06..b4e324fe55c5 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -69,6 +69,14 @@ static inline void
vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
vcpu->arch.at_instruction_boundary = true;
}
+
+static inline bool gpa_on_private_root(const struct kvm *kvm, gpa_t gpa)
+{
+ gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+ return kvm_has_private_root(kvm) && !(gpa_to_gfn(gpa) & mask);
+}
+
static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
unsigned long exit_qualification)
{
@@ -90,7 +98,7 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu
*vcpu, gpa_t gpa,
error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
- if (kvm_is_private_gpa(vcpu->kvm, gpa))
+ if (gpa_on_private_root(vcpu->kvm, gpa))
error_code |= PFERR_PRIVATE_ACCESS;
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bfb939826276..d7626f80b7f7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1772,7 +1772,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
{
unsigned long exit_qual;
- if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+ if (gpa_on_private_root(vcpu->kvm, tdexit_gpa(vcpu))) {
/*
* Always treat SEPT violations as write faults. Ignore the
* EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
@@ -2967,8 +2967,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
struct kvm_tdx_cmd *c
if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
!region.nr_pages ||
region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
- !kvm_is_private_gpa(kvm, region.gpa) ||
- !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages <<
PAGE_SHIFT)))
+ !gpa_on_private_root(kvm, region.gpa) ||
+ !gpa_on_private_root(kvm, region.gpa + (region.nr_pages <<
PAGE_SHIFT)))
return -EINVAL;
mutex_lock(&kvm->slots_lock);
On 17/05/2024 7:42 am, Isaku Yamahata wrote:
> On Thu, May 16, 2024 at 04:36:48PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
>> On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
>>> On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
>>>> On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
>>>>>
>>>>> I meant it seems we should just strip shared bit away from the GPA in
>>>>> handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
>>>>> won't have the shared bit.
>>>>>
>>>>> Do you see any problem of doing so?
>>>>
>>>> We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
>>>
>>> I don't see any big difference?
>>>
>>> Now in this patch the raw_gfn is directly from fault->addr:
>>>
>>> raw_gfn = gpa_to_gfn(fault->addr);
>>>
>>> tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
>>> ...
>>> }
>>>
>>> But there's nothing wrong to get the raw_gfn from the fault->gfn. In
>>> fact, the zapping code just does this:
>>>
>>> /*
>>> * start and end doesn't have GFN shared bit. This function zaps
>>> * a region including alias. Adjust shared bit of [start, end) if
>>> * the root is shared.
>>> */
>>> start = kvm_gfn_for_root(kvm, root, start);
>>> end = kvm_gfn_for_root(kvm, root, end);
>>>
>>> So there's nothing wrong to just do the same thing in both functions.
>>>
>>> The point is fault->gfn has shared bit stripped away at the beginning, and
>>> AFAICT there's no useful reason to keep shared bit in fault->addr. The
>>> entire @fault is a temporary structure on the stack during fault handling
>>> anyway.
>>
>> I would like to avoid code churn at this point if there is not a real clear
>> benefit. >>
>> One small benefit of keeping the shared bit in the fault->addr is that it is
>> sort of consistent with how that field is used in other scenarios in KVM. In
>> shadow paging it's not even the GPA. So it is simply the "fault address" and has
>> to be interpreted in different ways in the fault handler. For TDX the fault
>> address *does* include the shared bit. And the EPT needs to be faulted in at
>> that address.
It's about how we want to define the semantic of fault->addr (forget
about shadow MMU because for it fault->addr has different meaning from TDP):
1) It represents the faulting address which points to the actual guest
memory (has no shared bit).
2) It represents the faulting address which is truly used as the
hardware page table walk.
The fault->gfn always represents the location of actual guest memory
(w/o shared bit) in both cases.
I originally thought 2) isn't consistent for both SNP and TDX, but
thinking more, I now think actually both the two semantics are
consistent for SNP and TDX, which is important in order to avoid confusion.
Anyway it's a trivial because fault->addr is only used for fault
handling path.
So yes fine to me we choose to use 2) here. But IMHO we should
explicitly add a comment to 'struct kvm_page_fault' that the @addr
represents 2).
And I think more important thing is how we handle this "gfn" and
"raw_gfn" in tdp_iter and 'struct kvm_mmu_page'. See below.
>>
>> If we strip the shared bit when setting fault->addr we have to reconstruct it
>> when we do the actual shared mapping. There is no way around that. Which helper
>> does it, isn't important I think. Doing the reconstruction inside
>> tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
>> shared bit position.
>>
>> The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
>> without the shared bit. It's not stripped and then added back. Those are
>> operations that target GFNs really.
>>
>> I think the real problem is that we are gleaning whether the fault is to private
>> or shared memory from different things. Sometimes from fault->is_private,
>> sometimes the presence of the shared bits, and sometimes the role bit. I think
>> this is confusing, doubly so because we are using some of these things to infer
>> unrelated things (mirrored vs private).
>
> It's confusing we don't check it in uniform way.
>
>
>> My guess is that you have noticed this and somehow zeroed in on the shared_mask.
>> I think we should straighten out the mirrored/private semantics and see what the
>> results look like. How does that sound to you?
>
> I had closer look of the related code. I think we can (mostly) uniformly use
> gpa/gfn without shared mask. Here is the proposal. We need a real patch to see
> how the outcome looks like anyway. I think this is like what Kai is thinking
> about.
>
>
> - rename role.is_private => role.is_mirrored_pt
>
> - sp->gfn: gfn without shared bit.
I think we can treat 'tdp_iter' and 'struct kvm_mmu_page' in the same
way, because conceptually they both reflects the page table.
So I think both of them can have "gfn" or "raw_gfn", and "shared_gfn_mask".
Or have both "raw_gfn" or "gfn" but w/o "shared_gfn_mask". This may be
more straightforward because we can just use them when needed w/o
needing to play with gfn_shared_mask.
>
> - fault->address: without gfn_shared_mask
> Actually it doesn't matter much. We can use gpa with gfn_shared_mask.
See above. I think it makes sense to include the shared bit here. It's
trivial anyway though.
>
> - Update struct tdp_iter
> struct tdp_iter
> gfn: gfn without shared bit
Or "raw_gfn"?
Which may be more straightforward because it can be just from:
raw_gfn = gpa_to_gfn(fault->addr);
gfn = fault->gfn;
tdp_mmu_for_each_pte(..., raw_gfn, raw_gfn + 1, gfn)
Which is the reason to make fault->addr include the shared bit AFAICT.
>
> /* Add new members */
>
> /* Indicates which PT to walk. */
> bool mirrored_pt;
I don't think you need this? It's only used to select the root for page
table walk. Once it's done, we already have the @sptep to operate on.
And I think you can just get @mirrored_pt from the sptep:
mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
Instead, I think we should keep the @is_private to indicate whether the
GFN is private or not, which should be distinguished with 'mirrored_pt',
which the root page table (and the @sptep) already reflects.
Of course if the @root/@sptep is mirrored_pt, the is_private should be
always true, like:
WARN_ON_ONCE(sptep_to_sp(sptep)->role.is_mirrored_pt
&& !is_private);
Am I missing anything?
>
> // This is used tdp_iter_refresh_sptep()
> // shared gfn_mask if mirrored_pt
> // 0 if !mirrored_pt
> gfn_shared_mask >
> - Pass mirrored_pt and gfn_shared_mask to
> tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
As mentioned above, I am not sure whether we need @mirrored_pt, because
it already have the @root. Instead we should pass is_private, which
indicates the GFN is private.
>
> and update tdp_iter_refresh_sptep()
> static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> ...
> iter->sptep = iter->pt_path[iter->level - 1] +
> SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask, iter->level);
>
> Change for_each_tdp_mte_min_level() accordingly.
> Also the iteretor to call this.
>
> #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start, end) \
> for (tdp_iter_start(&iter, root, min_level, start, \
> mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) : 0); \
> iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root, end); \
> tdp_iter_next(&iter))
See above.
>
> - trace point: update to include mirroredd_pt. Or Leave it as is for now.
>
> - pr_err() that log gfn in handle_changed_spte()
> Update to include mirrored_pt. Or Leave it as is for now.
>
> - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
> use iter->mirror_pt or pass down mirror_pt.
>
IIUC only sp->role.is_mirrored_pt is needed, tdp_iter->is_mirrored_pt
isn't necessary. But when the @sp is created, we need to initialize
whether it is mirrored_pt.
Am I missing anything?
>> E.g, why we cannot do:
>>
>> static bool kvm_use_private_root(struct kvm *kvm)
>> {
>> return kvm->arch.vm_type == VM_TYPE_TDX;
>> }
>>
>> Or,
>> static bool kvm_use_private_root(struct kvm *kvm)
>> {
>> return kvm->arch.use_private_root;
>> }
>>
>> Or, assuming we would love to keep the kvm_gfn_shared_mask():
>>
>> static bool kvm_use_private_root(struct kvm *kvm)
>> {
>> return !!kvm_gfn_shared_mask(kvm);
>> }
>>
>> And then:
>>
>> In fault handler:
>>
>> if (fault->is_private && kvm_use_private_root(kvm))
>> // use private root
>> else
>> // use shared/normal root
>>
>> When you zap:
>>
>> bool private_gpa = kvm_mem_is_private(kvm, gfn);
>>
>> if (private_gpa && kvm_use_private_root(kvm))
>> // zap private root
>> else
>> // zap shared/normal root.
>>
>
> I think you are trying to say not to abuse kvm_gfn_shared_mask() as is currently
> done in this logic. But we already agreed on this. So not sure.
To be clear: We agreed on this in general, but not on this
kvm_on_private_root().
It's obvious that you still want to "use kvm_gfn_shared_mask() to
determine whether a GPA is private" for this helper but I don't like it.
In fact I don't see why we even need this helper.
I think I am just too obsessed on avoiding using kvm_gfn_shared_mask()
so I'll stop commenting/replying on this.
[...]
>
> I don't think we can get rid of the shared mask. Even if we relied on
> kvm_mem_is_private() to determine if a GPA is private or shared, at absolute
> minimum we need to add the shared bit when we are zapping a GFN or mapping it.
No we cannot, but we can avoid using it here.
>
> Let's table the discussion until we have some code to look again.
100% agreed.
On Tue, May 14, 2024 at 05:59:39PM -0700, Rick Edgecombe wrote:
>From: Isaku Yamahata <[email protected]>
>
>Export a function to walk down the TDP without modifying it.
>
>Future changes will support pre-populating TDX private memory. In order to
>implement this KVM will need to check if a given GFN is already
>pre-populated in the mirrored EPT, and verify the populated private memory
>PFN matches the current one.[1]
>
>There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
>the KVM MMU that almost does what is required. However, to make sense of
>the results, MMU internal PTE helpers are needed. Refactor the code to
>provide a helper that can be used outside of the KVM MMU code.
>
>Refactoring the KVM page fault handler to support this lookup usage was
>also considered, but it was an awkward fit.
>
>Link: https://lore.kernel.org/kvm/[email protected]/ [1]
>Signed-off-by: Isaku Yamahata <[email protected]>
>Signed-off-by: Rick Edgecombe <[email protected]>
>---
>This helper will be used in the future change that implements
>KVM_TDX_INIT_MEM_REGION. Please refer to the following commit for the
>usage:
>https://github.com/intel/tdx/commit/2832c6d87a4e6a46828b193173550e80b31240d4
>
>TDX MMU Part 1:
> - New patch
>---
> arch/x86/kvm/mmu.h | 3 +++
> arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++++++++++----
> 2 files changed, 36 insertions(+), 4 deletions(-)
>
>diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>index dc80e72e4848..3c7a88400cbb 100644
>--- a/arch/x86/kvm/mmu.h
>+++ b/arch/x86/kvm/mmu.h
>@@ -275,6 +275,9 @@ extern bool tdp_mmu_enabled;
> #define tdp_mmu_enabled false
> #endif
>
>+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
>+ kvm_pfn_t *pfn);
>+
> static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> {
> return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
>diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
>index 1259dd63defc..1086e3b2aa5c 100644
>--- a/arch/x86/kvm/mmu/tdp_mmu.c
>+++ b/arch/x86/kvm/mmu/tdp_mmu.c
>@@ -1772,16 +1772,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> *
> * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
> */
>-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>- int *root_level)
>+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>+ bool is_private)
is_private isn't used.
> {
> struct tdp_iter iter;
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> gfn_t gfn = addr >> PAGE_SHIFT;
> int leaf = -1;
>
>- *root_level = vcpu->arch.mmu->root_role.level;
>-
> tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> leaf = iter.level;
> sptes[leaf] = iter.old_spte;
>@@ -1790,6 +1788,37 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> return leaf;
> }
>
>+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>+ int *root_level)
>+{
>+ *root_level = vcpu->arch.mmu->root_role.level;
>+
>+ return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
>+}
>+
>+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
>+ kvm_pfn_t *pfn)
private_pfn probably is a misnomer. shared/private is an attribute of
GPA rather than pfn. Since the function is to get pfn from gpa, how about
kvm_tdp_mmu_gpa_to_pfn()?
And the function is limited to handle private gpa only. It is an artificial
limitation we can get rid of easily. e.g., by making the function take
"is_private" boolean and relay it to __kvm_tdp_mmu_get_walk(). I know TDX
just calls the function to convert private gpa but having a generic API
can accommodate future use cases (e.g., get hpa from shared gpa) w/o the
need of refactoring.
>+{
>+ u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
>+ int leaf;
>+
>+ lockdep_assert_held(&vcpu->kvm->mmu_lock);
>+
>+ rcu_read_lock();
>+ leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
>+ rcu_read_unlock();
>+ if (leaf < 0)
>+ return -ENOENT;
>+
>+ spte = sptes[leaf];
>+ if (!(is_shadow_present_pte(spte) && is_last_spte(spte, leaf)))
>+ return -ENOENT;
>+
>+ *pfn = spte_to_pfn(spte);
>+ return leaf;
>+}
>+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);
>+
> /*
> * Returns the last level spte pointer of the shadow page walk for the given
> * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
>--
>2.34.1
>
>
On Fri, May 17, 2024 at 02:36:43PM +1200,
"Huang, Kai" <[email protected]> wrote:
> On 17/05/2024 7:42 am, Isaku Yamahata wrote:
> > On Thu, May 16, 2024 at 04:36:48PM +0000,
> > "Edgecombe, Rick P" <[email protected]> wrote:
> >
> > > On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
> > > > On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> > > > > On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> > > > > >
> > > > > > I meant it seems we should just strip shared bit away from the GPA in
> > > > > > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
> > > > > > won't have the shared bit.
> > > > > >
> > > > > > Do you see any problem of doing so?
> > > > >
> > > > > We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
> > > >
> > > > I don't see any big difference?
> > > >
> > > > Now in this patch the raw_gfn is directly from fault->addr:
> > > >
> > > > raw_gfn = gpa_to_gfn(fault->addr);
> > > >
> > > > tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
> > > > ...
> > > > }
> > > >
> > > > But there's nothing wrong to get the raw_gfn from the fault->gfn. In
> > > > fact, the zapping code just does this:
> > > >
> > > > /*
> > > > * start and end doesn't have GFN shared bit. This function zaps
> > > > * a region including alias. Adjust shared bit of [start, end) if
> > > > * the root is shared.
> > > > */
> > > > start = kvm_gfn_for_root(kvm, root, start);
> > > > end = kvm_gfn_for_root(kvm, root, end);
> > > >
> > > > So there's nothing wrong to just do the same thing in both functions.
> > > >
> > > > The point is fault->gfn has shared bit stripped away at the beginning, and
> > > > AFAICT there's no useful reason to keep shared bit in fault->addr. The
> > > > entire @fault is a temporary structure on the stack during fault handling
> > > > anyway.
> > >
> > > I would like to avoid code churn at this point if there is not a real clear
> > > benefit. >>
> > > One small benefit of keeping the shared bit in the fault->addr is that it is
> > > sort of consistent with how that field is used in other scenarios in KVM. In
> > > shadow paging it's not even the GPA. So it is simply the "fault address" and has
> > > to be interpreted in different ways in the fault handler. For TDX the fault
> > > address *does* include the shared bit. And the EPT needs to be faulted in at
> > > that address.
>
> It's about how we want to define the semantic of fault->addr (forget about
> shadow MMU because for it fault->addr has different meaning from TDP):
>
> 1) It represents the faulting address which points to the actual guest
> memory (has no shared bit).
>
> 2) It represents the faulting address which is truly used as the hardware
> page table walk.
>
> The fault->gfn always represents the location of actual guest memory (w/o
> shared bit) in both cases.
>
> I originally thought 2) isn't consistent for both SNP and TDX, but thinking
> more, I now think actually both the two semantics are consistent for SNP and
> TDX, which is important in order to avoid confusion.
>
> Anyway it's a trivial because fault->addr is only used for fault handling
> path.
>
> So yes fine to me we choose to use 2) here. But IMHO we should explicitly
> add a comment to 'struct kvm_page_fault' that the @addr represents 2).
Ok. I'm fine with 2).
> And I think more important thing is how we handle this "gfn" and "raw_gfn"
> in tdp_iter and 'struct kvm_mmu_page'. See below.
>
> > >
> > > If we strip the shared bit when setting fault->addr we have to reconstruct it
> > > when we do the actual shared mapping. There is no way around that. Which helper
> > > does it, isn't important I think. Doing the reconstruction inside
> > > tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
> > > shared bit position.
> > >
> > > The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
> > > without the shared bit. It's not stripped and then added back. Those are
> > > operations that target GFNs really.
> > >
> > > I think the real problem is that we are gleaning whether the fault is to private
> > > or shared memory from different things. Sometimes from fault->is_private,
> > > sometimes the presence of the shared bits, and sometimes the role bit. I think
> > > this is confusing, doubly so because we are using some of these things to infer
> > > unrelated things (mirrored vs private).
> >
> > It's confusing we don't check it in uniform way.
> >
> >
> > > My guess is that you have noticed this and somehow zeroed in on the shared_mask.
> > > I think we should straighten out the mirrored/private semantics and see what the
> > > results look like. How does that sound to you?
> >
> > I had closer look of the related code. I think we can (mostly) uniformly use
> > gpa/gfn without shared mask. Here is the proposal. We need a real patch to see
> > how the outcome looks like anyway. I think this is like what Kai is thinking
> > about.
> >
> >
> > - rename role.is_private => role.is_mirrored_pt
> >
> > - sp->gfn: gfn without shared bit.
>
> I think we can treat 'tdp_iter' and 'struct kvm_mmu_page' in the same way,
> because conceptually they both reflects the page table.
Agreed that iter->gfn and sp->gfn should be in same way.
> So I think both of them can have "gfn" or "raw_gfn", and "shared_gfn_mask".
>
> Or have both "raw_gfn" or "gfn" but w/o "shared_gfn_mask". This may be more
> straightforward because we can just use them when needed w/o needing to play
> with gfn_shared_mask.
>
> >
> > - fault->address: without gfn_shared_mask
> > Actually it doesn't matter much. We can use gpa with gfn_shared_mask.
>
> See above. I think it makes sense to include the shared bit here. It's
> trivial anyway though.
Ok, let's make fault->addr include shared mask.
> > - Update struct tdp_iter
> > struct tdp_iter
> > gfn: gfn without shared bit
>
> Or "raw_gfn"?
>
> Which may be more straightforward because it can be just from:
>
> raw_gfn = gpa_to_gfn(fault->addr);
> gfn = fault->gfn;
>
> tdp_mmu_for_each_pte(..., raw_gfn, raw_gfn + 1, gfn)
>
> Which is the reason to make fault->addr include the shared bit AFAICT.
If we can eliminate raw_gfn and kvm_gfn_for_root(), it's better.
> >
> > /* Add new members */
> >
> > /* Indicates which PT to walk. */
> > bool mirrored_pt;
>
> I don't think you need this? It's only used to select the root for page
> table walk. Once it's done, we already have the @sptep to operate on.
>
> And I think you can just get @mirrored_pt from the sptep:
>
> mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
>
> Instead, I think we should keep the @is_private to indicate whether the GFN
> is private or not, which should be distinguished with 'mirrored_pt', which
> the root page table (and the @sptep) already reflects.
>
> Of course if the @root/@sptep is mirrored_pt, the is_private should be
> always true, like:
>
> WARN_ON_ONCE(sptep_to_sp(sptep)->role.is_mirrored_pt
> && !is_private);
>
> Am I missing anything?
You said it not correct to use role. So I tried to find a way to pass down
is_mirrored and avoid to use role.
Did you change your mind? or you're fine with new name is_mirrored?
https://lore.kernel.org/kvm/[email protected]/
> I don't think using kvm_mmu_page.role is correct.
> >
> > // This is used tdp_iter_refresh_sptep()
> > // shared gfn_mask if mirrored_pt
> > // 0 if !mirrored_pt
> > gfn_shared_mask >
> > - Pass mirrored_pt and gfn_shared_mask to
> > tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
>
> As mentioned above, I am not sure whether we need @mirrored_pt, because it
> already have the @root. Instead we should pass is_private, which indicates
> the GFN is private.
If we can use role, we don't need iter.mirrored_pt isn't needed.
> > - trace point: update to include mirroredd_pt. Or Leave it as is for now.
> >
> > - pr_err() that log gfn in handle_changed_spte()
> > Update to include mirrored_pt. Or Leave it as is for now.
> >
> > - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
> > use iter->mirror_pt or pass down mirror_pt.
> >
>
> IIUC only sp->role.is_mirrored_pt is needed, tdp_iter->is_mirrored_pt isn't
> necessary. But when the @sp is created, we need to initialize whether it is
> mirrored_pt.
>
> Am I missing anything?
Because you didn't like to use role, I tried to find other way.
--
Isaku Yamahata <[email protected]>
On Fri, May 17, 2024 at 02:35:46AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> Here is a diff of an attempt to merge all the feedback so far. It's on top of
> the the dev branch from this series.
>
> On Thu, 2024-05-16 at 12:42 -0700, Isaku Yamahata wrote:
> > - rename role.is_private => role.is_mirrored_pt
>
> Agreed.
>
> >
> > - sp->gfn: gfn without shared bit.
> >
> > - fault->address: without gfn_shared_mask
> > Actually it doesn't matter much. We can use gpa with gfn_shared_mask.
>
> I left fault->addr with shared bits. It's not used anymore for TDX except in the
> tracepoint which I think makes sense.
As discussed with Kai [1], make fault->addr represent the real fault address.
[1] https://lore.kernel.org/kvm/[email protected]/
>
> >
> > - Update struct tdp_iter
> > struct tdp_iter
> > gfn: gfn without shared bit
> >
> > /* Add new members */
> >
> > /* Indicates which PT to walk. */
> > bool mirrored_pt;
> >
> > // This is used tdp_iter_refresh_sptep()
> > // shared gfn_mask if mirrored_pt
> > // 0 if !mirrored_pt
> > gfn_shared_mask
> >
> > - Pass mirrored_pt and gfn_shared_mask to
> > tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
> >
> > and update tdp_iter_refresh_sptep()
> > static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> > ...
> > iter->sptep = iter->pt_path[iter->level - 1] +
> > SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask,
> > iter->level);
>
> I tried something else. The iterators still have gfn's with shared bits, but the
> addition of the shared bit is wrapped in tdp_mmu_for_each_pte(), so
> kvm_tdp_mmu_map() and similar don't have to handle the shared bits. They just
> pass in a root, and tdp_mmu_for_each_pte() knows how to adjust the GFN. Like:
>
> #define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end) \
> for_each_tdp_pte(_iter, _root, \
> kvm_gfn_for_root(_kvm, _root, _start), \
> kvm_gfn_for_root(_kvm, _root, _end))
I'm wondering to remove kvm_gfn_for_root() at all.
> I also changed the callers to use the new enum to specify roots. This way they
> can pass something with a nice name instead of true/false for bool private.
This is nice.
> Keeping a gfn_shared_mask inside the iterator didn't seem more clear to me, and
> bit more cumbersome. But please compare it.
>
> >
> > Change for_each_tdp_mte_min_level() accordingly.
> > Also the iteretor to call this.
> >
> > #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start,
> > end) \
> > for (tdp_iter_start(&iter, root, min_level,
> > start, \
> > mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) :
> > 0); \
> > iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root,
> > end); \
> > tdp_iter_next(&iter))
>
> I liked it a lot because the callers don't need to manually call
> kvm_gfn_for_root() anymore. But I tried it and it required a lot of additions of
> kvm to the iterators call sites. I ended up removing it, but I'm not sure.
..
> > - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
> > use iter->mirror_pt or pass down mirror_pt.
>
> You mean just rename it, or something else?
I scratch this. I thought Kai didn't like to use role [2].
But now it seems okay. [3]
[2] https://lore.kernel.org/kvm/[email protected]/
> I don't think using kvm_mmu_page.role is correct.
[3] https://lore.kernel.org/kvm/[email protected]/
> I think you can just get @mirrored_pt from the sptep:
> mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> Anyway below is a first cut based on the discussion.
>
> A few other things:
> 1. kvm_is_private_gpa() is moved into Intel code. kvm_gfn_shared_mask() remains
> for only two operations in common code:
> - kvm_gfn_for_root() <- required for zapping/mapping
> - Stripping the bit when setting fault.gfn <- possible to remove if we strip
> cr2_or_gpa
> 2. I also played with changing KVM_PRIVATE_ROOTS to KVM_MIRROR_ROOTS.
> Unfortunately there is still some confusion between private and mirrored. For
> example you walk a mirror root (what is actually happening), but you have to
> allocate private page tables as you do, as well as call out to x86_ops named
> private. So those concepts are effectively linked and used a bit
> interchangeably.
On top of your patch, I created the following patch to remove kvm_gfn_for_root().
Although I haven't tested it yet, I think the following shows my idea.
- Add gfn_shared_mask to struct tdp_iter.
- Use iter.gfn_shared_mask to determine the starting sptep in the root.
- Remove kvm_gfn_for_root()
---
arch/x86/kvm/mmu/mmu_internal.h | 10 -------
arch/x86/kvm/mmu/tdp_iter.c | 5 ++--
arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++-----
arch/x86/kvm/mmu/tdp_mmu.c | 48 ++++++++++-----------------------
4 files changed, 26 insertions(+), 53 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2b1b2a980b03..9676af0cb133 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -180,16 +180,6 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
}
-static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t gfn)
-{
- gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
-
- /* Set shared bit if not private */
- gfn_for_root |= -(gfn_t)!is_mirrored_sp(root) & kvm_gfn_shared_mask(kvm);
- return gfn_for_root;
-}
-
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 04c247bfe318..c5f2ca1ceede 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@
static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
{
iter->sptep = iter->pt_path[iter->level - 1] +
- SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+ SPTE_INDEX((iter->gfn | iter->gfn_shared_mask) << PAGE_SHIFT, iter->level);
iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
}
@@ -37,7 +37,7 @@ void tdp_iter_restart(struct tdp_iter *iter)
* rooted at root_pt, starting with the walk to translate next_last_level_gfn.
*/
void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
- int min_level, gfn_t next_last_level_gfn)
+ int min_level, gfn_t next_last_level_gfn, gfn_t gfn_shared_mask)
{
if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
(root->role.level > PT64_ROOT_MAX_LEVEL))) {
@@ -46,6 +46,7 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
}
iter->next_last_level_gfn = next_last_level_gfn;
+ iter->gfn_shared_mask = gfn_shared_mask;
iter->root_level = root->role.level;
iter->min_level = min_level;
iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 8a64bcef9deb..274b42707f0a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,8 +91,9 @@ struct tdp_iter {
tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
/* A pointer to the current SPTE */
tdp_ptep_t sptep;
- /* The lowest GFN (shared bits included) mapped by the current SPTE */
+ /* The lowest GFN (shared bits excluded) mapped by the current SPTE */
gfn_t gfn;
+ gfn_t gfn_shared_mask;
/* The level of the root page given to the iterator */
int root_level;
/* The lowest level the iterator should traverse to */
@@ -120,18 +121,19 @@ struct tdp_iter {
* Iterates over every SPTE mapping the GFN range [start, end) in a
* preorder traversal.
*/
-#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
- for (tdp_iter_start(&iter, root, min_level, start); \
- iter.valid && iter.gfn < end; \
+#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) \
+ for (tdp_iter_start(&iter, root, min_level, start, \
+ is_mirrored_sp(root) ? 0: kvm_gfn_shared_mask(kvm)); \
+ iter.valid && iter.gfn < end; \
tdp_iter_next(&iter))
-#define for_each_tdp_pte(iter, root, start, end) \
- for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, kvm, root, start, end) \
+ for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)
tdp_ptep_t spte_to_child_pt(u64 pte, int level);
void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
- int min_level, gfn_t next_last_level_gfn);
+ int min_level, gfn_t next_last_level_gfn, gfn_t gfn_shared_mask);
void tdp_iter_next(struct tdp_iter *iter);
void tdp_iter_restart(struct tdp_iter *iter);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f13016e210b..bf7aa87eb593 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -862,20 +862,18 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
iter->gfn, iter->level);
}
-#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
- for_each_tdp_pte(_iter, _root, _start, _end)
+#define tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end) \
+ for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
-#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end) \
- tdp_root_for_each_pte(_iter, _root, _start, _end) \
+#define tdp_root_for_each_leaf_pte(_iter, _kvm, _root, _start, _end) \
+ tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end) \
if (!is_shadow_present_pte(_iter.old_spte) || \
!is_last_spte(_iter.old_spte, _iter.level)) \
continue; \
else
#define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end) \
- for_each_tdp_pte(_iter, _root, \
- kvm_gfn_for_root(_kvm, _root, _start), \
- kvm_gfn_for_root(_kvm, _root, _end))
+ for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
/*
* Yield if the MMU lock is contended or this thread needs to return control
@@ -941,7 +939,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t end = tdp_mmu_max_gfn_exclusive();
gfn_t start = 0;
- for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
@@ -1043,17 +1041,9 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
lockdep_assert_held_write(&kvm->mmu_lock);
- /*
- * start and end doesn't have GFN shared bit. This function zaps
- * a region including alias. Adjust shared bit of [start, end) if the
- * root is shared.
- */
- start = kvm_gfn_for_root(kvm, root, start);
- end = kvm_gfn_for_root(kvm, root, end);
-
rcu_read_lock();
- for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
if (can_yield &&
tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
@@ -1448,19 +1438,9 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
* into this helper allow blocking; it'd be dead, wasteful code.
*/
__for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
- gfn_t start, end;
-
- /*
- * For TDX shared mapping, set GFN shared bit to the range,
- * so the handler() doesn't need to set it, to avoid duplicated
- * code in multiple handler()s.
- */
- start = kvm_gfn_for_root(kvm, root, range->start);
- end = kvm_gfn_for_root(kvm, root, range->end);
-
rcu_read_lock();
- tdp_root_for_each_leaf_pte(iter, root, start, end)
+ tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end)
ret |= handler(kvm, &iter, range);
rcu_read_unlock();
@@ -1543,7 +1523,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
- for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
@@ -1706,7 +1686,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
* level above the target level (e.g. splitting a 1GB to 512 2MB pages,
* and then splitting each of those to 512 4KB pages).
*/
- for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
@@ -1791,7 +1771,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();
- tdp_root_for_each_pte(iter, root, start, end) {
+ tdp_root_for_each_pte(iter, kvm, root, start, end) {
retry:
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
@@ -1846,7 +1826,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();
- tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
+ tdp_root_for_each_leaf_pte(iter, kvm, root, gfn + __ffs(mask),
gfn + BITS_PER_LONG) {
if (!mask)
break;
@@ -1903,7 +1883,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
rcu_read_lock();
- for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
+ for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_2M, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
@@ -1973,7 +1953,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();
- for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
+ for_each_tdp_pte_min_level(iter, kvm, root, min_level, gfn, gfn + 1) {
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
--
2.43.2
--
Isaku Yamahata <[email protected]>
On Fri, May 17, 2024 at 03:44:27PM +0800,
Chao Gao <[email protected]> wrote:
> On Tue, May 14, 2024 at 05:59:39PM -0700, Rick Edgecombe wrote:
> >From: Isaku Yamahata <[email protected]>
> >
> >Export a function to walk down the TDP without modifying it.
> >
> >Future changes will support pre-populating TDX private memory. In order to
> >implement this KVM will need to check if a given GFN is already
> >pre-populated in the mirrored EPT, and verify the populated private memory
> >PFN matches the current one.[1]
> >
> >There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
> >the KVM MMU that almost does what is required. However, to make sense of
> >the results, MMU internal PTE helpers are needed. Refactor the code to
> >provide a helper that can be used outside of the KVM MMU code.
> >
> >Refactoring the KVM page fault handler to support this lookup usage was
> >also considered, but it was an awkward fit.
> >
> >Link: https://lore.kernel.org/kvm/[email protected]/ [1]
> >Signed-off-by: Isaku Yamahata <[email protected]>
> >Signed-off-by: Rick Edgecombe <[email protected]>
> >---
> >This helper will be used in the future change that implements
> >KVM_TDX_INIT_MEM_REGION. Please refer to the following commit for the
> >usage:
> >https://github.com/intel/tdx/commit/2832c6d87a4e6a46828b193173550e80b31240d4
> >
> >TDX MMU Part 1:
> > - New patch
> >---
> > arch/x86/kvm/mmu.h | 3 +++
> > arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++++++++++----
> > 2 files changed, 36 insertions(+), 4 deletions(-)
> >
> >diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> >index dc80e72e4848..3c7a88400cbb 100644
> >--- a/arch/x86/kvm/mmu.h
> >+++ b/arch/x86/kvm/mmu.h
> >@@ -275,6 +275,9 @@ extern bool tdp_mmu_enabled;
> > #define tdp_mmu_enabled false
> > #endif
> >
> >+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> >+ kvm_pfn_t *pfn);
> >+
> > static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> > {
> > return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
> >diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> >index 1259dd63defc..1086e3b2aa5c 100644
> >--- a/arch/x86/kvm/mmu/tdp_mmu.c
> >+++ b/arch/x86/kvm/mmu/tdp_mmu.c
> >@@ -1772,16 +1772,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> > *
> > * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
> > */
> >-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >- int *root_level)
> >+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >+ bool is_private)
>
> is_private isn't used.
>
> > {
> > struct tdp_iter iter;
> > struct kvm_mmu *mmu = vcpu->arch.mmu;
> > gfn_t gfn = addr >> PAGE_SHIFT;
> > int leaf = -1;
> >
> >- *root_level = vcpu->arch.mmu->root_role.level;
> >-
> > tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> > leaf = iter.level;
> > sptes[leaf] = iter.old_spte;
> >@@ -1790,6 +1788,37 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> > return leaf;
> > }
> >
> >+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >+ int *root_level)
> >+{
> >+ *root_level = vcpu->arch.mmu->root_role.level;
> >+
> >+ return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
> >+}
> >+
> >+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> >+ kvm_pfn_t *pfn)
>
> private_pfn probably is a misnomer. shared/private is an attribute of
> GPA rather than pfn. Since the function is to get pfn from gpa, how about
> kvm_tdp_mmu_gpa_to_pfn()?
>
> And the function is limited to handle private gpa only. It is an artificial
> limitation we can get rid of easily. e.g., by making the function take
> "is_private" boolean and relay it to __kvm_tdp_mmu_get_walk(). I know TDX
> just calls the function to convert private gpa but having a generic API
> can accommodate future use cases (e.g., get hpa from shared gpa) w/o the
> need of refactoring.
Agreed. Based on a patch at [1], we can have something like
int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 gpa,
enum kvm_tdp_mmu_root_types root_type,
kvm_pfn_t *pfn);
[1] https://lore.kernel.org/kvm/[email protected]/
--
Isaku Yamahata <[email protected]>
On 5/15/24 21:09, Sean Christopherson wrote:
> Hmm, actually, we already have new uAPI/ABI in the form of VM types. What if
> we squeeze a documentation update into 6.10 (which adds the SEV VM flavors) to
> state that KVM's historical behavior of blasting all SPTEs is only_guaranteed_
> for KVM_X86_DEFAULT_VM?
>
> Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots during
> SEV-* boot?
Yes, the process is mostly the same for normal UEFI boot, SEV and SEV-ES.
However, it does so while the VM is paused (remember the atomic memslot
updates attempts? that's now enforced by QEMU). So it's quite possible
that the old bug is not visible anymore, independent of why VFIO caused
it.
Paolo
> If so, and assuming any such memslots are smallish, we could even
> start enforcing the new ABI by doing a precise zap for small (arbitrary limit TBD)
> shared-only memslots for !KVM_X86_DEFAULT_VM VMs.
On 5/15/24 22:05, Sean Christopherson wrote:
>> Again thinking of the userspace memory analogy... Aren't there some VMs where
>> the fast zap is faster? Like if you have guest with a small memslot that gets
>> deleted all the time, you could want it to be zapped specifically. But for the
>> giant memslot next to it, you might want to do the fast zap all thing.
>
> Yes. But...
On the other hand, tearing down a giant memslot isn't really common.
The main occurrence of memslots going away is 1) BIOS fiddling with low
memory permissions; 2) PCI BARs. The former is only happening at boot
and with small memslots, the latter can in principle involve large
memslots but... just don't do it.
Paolo
On 5/16/24 01:20, Sean Christopherson wrote:
> Hmm, a quirk isn't a bad idea. It suffers the same problems as a memslot flag,
> i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> would be much, much cautious about disabling a quirk that comes with a massive
> disclaimer.
>
> Though I suspect Paolo will shoot this down too ????
Not really, it's probably the least bad option. Not as safe as keying
it off the new machine types, but less ugly.
Paolo
On Fri, 2024-05-17 at 02:03 -0700, Isaku Yamahata wrote:
>
> On top of your patch, I created the following patch to remove
> kvm_gfn_for_root().
> Although I haven't tested it yet, I think the following shows my idea.
>
> - Add gfn_shared_mask to struct tdp_iter.
> - Use iter.gfn_shared_mask to determine the starting sptep in the root.
> - Remove kvm_gfn_for_root()
I investigated it.
After this, gfn_t's never have shared bit. It's a simple rule. The MMU mostly
thinks it's operating on a shared root that is mapped at the normal GFN. Only
the iterator knows that the shared PTEs are actually in a different location.
There are some negative side effects:
1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping anymore.
2. As a result of above, the code that flushes TLBs for a specific GFN will be
confused. It won't functionally matter for TDX, just look buggy to see flushing
code called with the wrong gfn.
3. A lot of tracepoints no longer have the "real" gfn
4. mmio spte doesn't have the shared bit, as previous (no effect)
5. Some zapping code (__tdp_mmu_zap_root(), tdp_mmu_zap_leafs()) intends to
actually operating on the raw_gfn. It wants to iterate the whole EPT, so it goes
from 0 to tdp_mmu_max_gfn_exclusive(). So now for mirrored it does, but for
shared it only covers the shared range. Basically kvm_mmu_max_gfn() is wrong if
we pretend shared GFNs are just strangely mapped normal GFNs. Maybe we could
just fix this up to report based on GPAW for TDX? Feels wrong.
On the positive effects side:
1. There is code that passes sp->gfn into things that it shouldn't (if it has
shared bits) like memslot lookups.
2. Also code that passes iter.gfn into things it shouldn't like
kvm_mmu_max_mapping_level().
These places are not called by TDX, but if you know that gfn's might include
shared bits, then that code looks buggy.
I think the solution in the diff is more elegant then before, because it hides
what is really going on with the shared root. That is both good and bad. Can we
accept the downsides?
On Fri, May 17, 2024 at 06:16:26PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Fri, 2024-05-17 at 02:03 -0700, Isaku Yamahata wrote:
> >
> > On top of your patch, I created the following patch to remove
> > kvm_gfn_for_root().
> > Although I haven't tested it yet, I think the following shows my idea.
> >
> > - Add gfn_shared_mask to struct tdp_iter.
> > - Use iter.gfn_shared_mask to determine the starting sptep in the root.
> > - Remove kvm_gfn_for_root()
>
> I investigated it.
Thanks for looking at it.
> After this, gfn_t's never have shared bit. It's a simple rule. The MMU mostly
> thinks it's operating on a shared root that is mapped at the normal GFN. Only
> the iterator knows that the shared PTEs are actually in a different location.
>
> There are some negative side effects:
> 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping anymore.
> 2. As a result of above, the code that flushes TLBs for a specific GFN will be
> confused. It won't functionally matter for TDX, just look buggy to see flushing
> code called with the wrong gfn.
flush_remote_tlbs_range() is only for Hyper-V optimization. In other cases,
x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
time. So the remote tlb flush falls back to flushing whole range. I don't
expect TDX in hyper-V guest. I have to admit that the code looks superficially
broken and confusing.
> 3. A lot of tracepoints no longer have the "real" gfn
Anyway we'd like to sort out trace points and pr_err() eventually because we
already added new pferr flags.
> 4. mmio spte doesn't have the shared bit, as previous (no effect)
> 5. Some zapping code (__tdp_mmu_zap_root(), tdp_mmu_zap_leafs()) intends to
> actually operating on the raw_gfn. It wants to iterate the whole EPT, so it goes
> from 0 to tdp_mmu_max_gfn_exclusive(). So now for mirrored it does, but for
> shared it only covers the shared range. Basically kvm_mmu_max_gfn() is wrong if
> we pretend shared GFNs are just strangely mapped normal GFNs. Maybe we could
> just fix this up to report based on GPAW for TDX? Feels wrong.
Yes, it's broken with kvm_mmu_max_gfn().
> On the positive effects side:
> 1. There is code that passes sp->gfn into things that it shouldn't (if it has
> shared bits) like memslot lookups.
> 2. Also code that passes iter.gfn into things it shouldn't like
> kvm_mmu_max_mapping_level().
>
> These places are not called by TDX, but if you know that gfn's might include
> shared bits, then that code looks buggy.
>
> I think the solution in the diff is more elegant then before, because it hides
> what is really going on with the shared root. That is both good and bad. Can we
> accept the downsides?
Kai, do you have any thoughts?
--
Isaku Yamahata <[email protected]>
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c3c922bf077f..f92c8b605b03 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,11 +260,19 @@ union kvm_mmu_notifier_arg {
> unsigned long attributes;
> };
>
> +enum kvm_process {
> + BUGGY_KVM_INVALIDATION = 0,
> + KVM_PROCESS_SHARED = BIT(0),
> + KVM_PROCESS_PRIVATE = BIT(1),
> + KVM_PROCESS_PRIVATE_AND_SHARED = KVM_PROCESS_SHARED |
> KVM_PROCESS_PRIVATE,
> +};
> +
This enum and kvm_tdp_mmu_root_types are very similar. We could teach the
generic KVM code about invaliding roots instead of just private/shared. Then we
could have a single enum: kvm_tdp_mmu_root_types. This leaks arch details a
little more. But since kvm_process is only used by TDX anyway, the abstraction
seems a little excessive.
I think we should just justify it better in the log. Basically that the benefit
is to hide the detail that private and shared are actually on different roots.
And I guess to hide the existence of the mirrored EPT optimization since the
root is named KVM_MIRROR_ROOTS now. The latter seems more worthwhile.
On Fri, 2024-05-17 at 16:26 +1200, Huang, Kai wrote:
> > I think I am just too obsessed on avoiding using kvm_gfn_shared_mask()
> > so I'll stop commenting/replying on this.
I think you just need to stick with it and discuss it a little more. The pattern
seems to go:
1. You comment somewhere saying you want to get rid of kvm_gfn_shared_mask()
2. I ask about how it can work
3. We don't get to the bottom of it
4. Go to step 1
I think you are seeing bad code, but the communication is leaving me seriously
confused. The rework Isaku and I were doing in the other thread still includes a
shared mask in the core MMU code, so it's still open at this point.
On Thu, 2024-05-16 at 13:52 +0800, Yan Zhao wrote:
> > As said above, I don't see why we need a helper with the "current
> > implementation" (which consults kvm_shared_gfn_mask()) for them. We can
> > just use fault->gfn + fault->is_private for such purpose.
> What about a name like kvm_is_private_and_mirrored_gpa()?
> Only TDX's private memory is mirrored and the common code needs a way to
> tell that.
In the new changes we are working on in the other thread this helper is moved
into arch/x86/kvm/vmx/common.h for only Intel side use, and renamed:
gpa_on_private_root(). It should address the SNP confusion concerns at least.
On the private and mirrored point, the mixing of private and mirrored in the
current code is definitely confusing. I think changing the names like that
(private_mirror), could make it easier to understand, even if it creates longer
lines.
I tried to create some abstraction where the MMU understood the concept of
general mirroring EPT roots, then checked a helper to see if the vm_type
mirrored "private" memory before calling out to all the private helpers. I
thought it would let us pretend more of this stuff was generic. But it was
turning out a bit silly. So think I will just stick with updating the names for
the next revision.
>
> > >
> > > /* Add new members */
> > >
> > > /* Indicates which PT to walk. */
> > > bool mirrored_pt;
> >
> > I don't think you need this? It's only used to select the root for page
> > table walk. Once it's done, we already have the @sptep to operate on.
> >
> > And I think you can just get @mirrored_pt from the sptep:
> >
> > mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> >
> > Instead, I think we should keep the @is_private to indicate whether the GFN
> > is private or not, which should be distinguished with 'mirrored_pt', which
> > the root page table (and the @sptep) already reflects.
> >
> > Of course if the @root/@sptep is mirrored_pt, the is_private should be
> > always true, like:
> >
> > WARN_ON_ONCE(sptep_to_sp(sptep)->role.is_mirrored_pt
> > && !is_private);
> >
> > Am I missing anything?
>
> You said it not correct to use role. So I tried to find a way to pass down
> is_mirrored and avoid to use role.
>
> Did you change your mind? or you're fine with new name is_mirrored?
>
> https://lore.kernel.org/kvm/[email protected]/
> > I don't think using kvm_mmu_page.role is correct.
>
>
No. I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
invoke kvm_x86_ops::xx_private_spt()" is not correct. Instead, we should
use fault->is_private to determine:
if (fault->is_private && kvm_x86_ops::xx_private_spt())
kvm_x86_ops::xx_private_spte();
else
// normal TDP MMU operation
The reason is this pattern works not just for TDX, but also for SNP (and
SW_PROTECTED_VM) if they ever need specific page table ops.
Whether we are operating on the mirrored page table or not doesn't matter,
because we have already selected the root page table at the beginning of
kvm_tdp_mmu_map() based on whether the VM needs to use mirrored pt for
private mapping:
bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);
tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
1)
{
...
}
#define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end) \
for_each_tdp_pte(_iter, \
root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa : \
_mmu->root.hpa), \
_start, _end)
If you somehow needs the mirrored_pt in later time when handling the page
fault, you don't need another "mirrored_pt" in tdp_iter, because you can
easily get it from the sptep (or just get from the root):
mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
What we really need to pass in is the fault->is_private, because we are
not able to get whether a GPN is private based on kvm_shared_gfn_mask()
for SNP and SW_PROTECTED_VM.
Since the current KVM code only mainly passes the @kvm and the @iter for
many TDP MMU functions like tdp_mmu_set_spte_atomic(), the easiest way to
convery the fault->is_private is to add a new 'is_private' (or even
better, 'is_private_gpa' to be more precisely) to tdp_iter.
Otherwise, we either need to explicitly pass the entire @fault (which
might not be a, or @is_private_gpa.
Or perhaps I am missing anything?
On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
>
> No. I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
> invoke kvm_x86_ops::xx_private_spt()" is not correct.
I agree this looks wrong.
> Instead, we should
> use fault->is_private to determine:
>
> if (fault->is_private && kvm_x86_ops::xx_private_spt())
> kvm_x86_ops::xx_private_spte();
> else
> // normal TDP MMU operation
>
> The reason is this pattern works not just for TDX, but also for SNP (and
> SW_PROTECTED_VM) if they ever need specific page table ops.
I think the problem is there are a lot of things that are more on the mirrored
concept side:
- Allocating the "real" PTE pages (i.e. sp->private_spt)
- Setting the PTE when the mirror changes
- Zapping the real PTE when the mirror is zapped (and there is no fault)
- etc
And on the private side there is just knowing that private faults should operate
on the mirror root.
The xx_private_spte() operations are actually just updating the real PTE for the
mirror. In some ways it doesn't have to be about "private". It could be a mirror
of something else and still need the updates. For SNP and others they don't need
to do anything like that. (AFAIU)
So based on that, I tried to change the naming of xx_private_spt() to reflect
that. Like:
if (role.mirrored)
update_mirrored_pte()
The TDX code could encapsulate that mirrored updates need to update private EPT.
Then I had a helper that answered the question of whether to handle private
faults on the mirrored root.
The FREEZE stuff actually made a bit more sense too, because it was clear it
wasn't a special TDX private memory thing, but just about the atomicity.
The problem was I couldn't get rid of all special things that are private (can't
remember what now).
I wonder if I should give it a more proper try. What do you think?
At this point, I was just going to change the "mirrored" name to
"private_mirrored". Then code that does either mirrored things or private things
both looks correct. Basically making it clear that the MMU only supports
mirroring private memory.
>
> Whether we are operating on the mirrored page table or not doesn't matter,
> because we have already selected the root page table at the beginning of
> kvm_tdp_mmu_map() based on whether the VM needs to use mirrored pt for
> private mapping:
I think it does matter, especially for the other operations (not faults). Did
you look at the other things checking the role?
>
>
> bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);
>
> tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
> 1)
> {
> ...
> }
>
> #define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end) \
> for_each_tdp_pte(_iter, \
> root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa : \
> _mmu->root.hpa), \
> _start, _end)
>
> If you somehow needs the mirrored_pt in later time when handling the page
> fault, you don't need another "mirrored_pt" in tdp_iter, because you can
> easily get it from the sptep (or just get from the root):
>
> mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
>
> What we really need to pass in is the fault->is_private, because we are
> not able to get whether a GPN is private based on kvm_shared_gfn_mask()
> for SNP and SW_PROTECTED_VM.
SNP and SW_PROTECTED_VM (today) don't need do anything special here, right?
>
> Since the current KVM code only mainly passes the @kvm and the @iter for
> many TDP MMU functions like tdp_mmu_set_spte_atomic(), the easiest way to
> convery the fault->is_private is to add a new 'is_private' (or even
> better, 'is_private_gpa' to be more precisely) to tdp_iter.
>
> Otherwise, we either need to explicitly pass the entire @fault (which
> might not be a, or @is_private_gpa.
>
> Or perhaps I am missing anything?
I think two things:
- fault->is_private is only for faults, and we have other cases where we call
out to kvm_x86_ops.xx_private() things.
- Calling out to update something else is really more about the "mirrored"
concept then about private.
On Sat, 2024-05-18 at 15:41 +0000, Edgecombe, Rick P wrote:
> On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
> >
> > No. I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
> > invoke kvm_x86_ops::xx_private_spt()" is not correct.
>
> I agree this looks wrong.
>
> > Instead, we should
> > use fault->is_private to determine:
> >
> > if (fault->is_private && kvm_x86_ops::xx_private_spt())
> > kvm_x86_ops::xx_private_spte();
> > else
> > // normal TDP MMU operation
> >
> > The reason is this pattern works not just for TDX, but also for SNP (and
> > SW_PROTECTED_VM) if they ever need specific page table ops.
>
> I think the problem is there are a lot of things that are more on the mirrored
> concept side:
> - Allocating the "real" PTE pages (i.e. sp->private_spt)
> - Setting the PTE when the mirror changes
> - Zapping the real PTE when the mirror is zapped (and there is no fault)
> - etc
>
> And on the private side there is just knowing that private faults should operate
> on the mirror root.
... and issue SEAMCALL to operate the real private page table?
>
> The xx_private_spte() operations are actually just updating the real PTE for the
> mirror. In some ways it doesn't have to be about "private". It could be a mirror
> of something else and still need the updates. For SNP and others they don't need
> to do anything like that. (AFAIU)
AFAICT xx_private_spte() should issue SEAMCALL to operate the real private
page table?
>
> So based on that, I tried to change the naming of xx_private_spt() to reflect
> that. Like:
> if (role.mirrored)
> update_mirrored_pte()
>
> The TDX code could encapsulate that mirrored updates need to update private EPT.
> Then I had a helper that answered the question of whether to handle private
> faults on the mirrored root.
I am fine with this too, but I am also fine with the existing pattern:
That we update the mirrored_pt using normal TDP MMU operation, and then
invoke the xx_private_spte() for private GPA.
My only true comment is, to me it seems more reasonable to invoke
xx_private_spte() based on fault->is_private, but not on
'use_mirrored_pt'.
See my reply to your question whether SNP needs special handling below.
>
> The FREEZE stuff actually made a bit more sense too, because it was clear it
> wasn't a special TDX private memory thing, but just about the atomicity.
>
> The problem was I couldn't get rid of all special things that are private (can't
> remember what now).
>
> I wonder if I should give it a more proper try. What do you think?
>
> At this point, I was just going to change the "mirrored" name to
> "private_mirrored". Then code that does either mirrored things or private things
> both looks correct. Basically making it clear that the MMU only supports
> mirroring private memory.
I don't have preference on name. "mirrored_private" also works for me.
>
> >
> > Whether we are operating on the mirrored page table or not doesn't matter,
> > because we have already selected the root page table at the beginning of
> > kvm_tdp_mmu_map() based on whether the VM needs to use mirrored pt for
> > private mapping:
>
> I think it does matter, especially for the other operations (not faults). Did
> you look at the other things checking the role?
Yeah I shouldn't say "doesn't matter". I meant we can get this from the
iter->spetp or the root.
>
> >
> >
> > bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);
> >
> > tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
> > 1)
> > {
> > ...
> > }
> >
> > #define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end) \
> > for_each_tdp_pte(_iter, \
> > root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa : \
> > _mmu->root.hpa), \
> > _start, _end)
> >
> > If you somehow needs the mirrored_pt in later time when handling the page
> > fault, you don't need another "mirrored_pt" in tdp_iter, because you can
> > easily get it from the sptep (or just get from the root):
> >
> > mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> >
> > What we really need to pass in is the fault->is_private, because we are
> > not able to get whether a GPN is private based on kvm_shared_gfn_mask()
> > for SNP and SW_PROTECTED_VM.
>
> SNP and SW_PROTECTED_VM (today) don't need do anything special here, right?
Conceptually, I think SNP also needs to at least issue some command(s) to
update the RMP table to reflect the GFN<->PFN relationship. From this
point, I do see a fit.
I briefly looked into SNP patchset, and I also raised the discussion there
(with you and Isaku copied):
https://lore.kernel.org/lkml/[email protected]/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051
I could be wrong, though.
On Mon, May 20, 2024 at 10:38:58AM +0000,
"Huang, Kai" <[email protected]> wrote:
> On Sat, 2024-05-18 at 15:41 +0000, Edgecombe, Rick P wrote:
> > On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
> > >
> > > No. I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
> > > invoke kvm_x86_ops::xx_private_spt()" is not correct.
> >
> > I agree this looks wrong.
> >
> > > Instead, we should
> > > use fault->is_private to determine:
> > >
> > > if (fault->is_private && kvm_x86_ops::xx_private_spt())
> > > kvm_x86_ops::xx_private_spte();
> > > else
> > > // normal TDP MMU operation
> > >
> > > The reason is this pattern works not just for TDX, but also for SNP (and
> > > SW_PROTECTED_VM) if they ever need specific page table ops.
Do you want to split the concept from invoking hooks from mirrored PT
and to allow invoking hooks even for shared PT (probably without
mirrored PT)? So far I tied the mirrored PT to invoking the hooks as
those hooks are to reflect the changes on mirrored PT to private PT.
Is there any use case to allow hook for shared PT?
- SEV_SNP
Although I can't speak for SNP folks, I guess they don't need hooks.
I guess they want to stay away from directly modifying the TDP MMU
(to add TDP MMU hooks). Instead, They added hooks to guest_memfd.
RMP (Reverse mapping table) doesn't have to be consistent with NPT.
Anyway, I'll reply to
https://lore.kernel.org/lkml/[email protected]/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051
TDX
I don't see immediate need to allow hooks for shared PT.
SW_PROTECTED (today)
It uses only shared PT and don't need hooks.
SW_PROTECTED (with mirrored pt with shared mask in future in theory)
This would be similar to TDX, we wouldn't need hooks for shared PT.
SW_PROTECTED (shared PT only without mirrored pt in future in theory)
I don't see necessity hooks for shared PT.
(Or I don't see value of this SW_PROTECTED case.)
> > I think the problem is there are a lot of things that are more on the mirrored
> > concept side:
> > - Allocating the "real" PTE pages (i.e. sp->private_spt)
> > - Setting the PTE when the mirror changes
> > - Zapping the real PTE when the mirror is zapped (and there is no fault)
> > - etc
> >
> > And on the private side there is just knowing that private faults should operate
> > on the mirror root.
>
> ... and issue SEAMCALL to operate the real private page table?
For zapping case,
- SEV-SNP
They use the hook for guest_memfd.
- SW_PROTECTED (with mirrored pt in future in theory)
This would be similar to TDX.
> > The xx_private_spte() operations are actually just updating the real PTE for the
> > mirror. In some ways it doesn't have to be about "private". It could be a mirror
> > of something else and still need the updates. For SNP and others they don't need
> > to do anything like that. (AFAIU)
>
> AFAICT xx_private_spte() should issue SEAMCALL to operate the real private
> page table?
>
> >
> > So based on that, I tried to change the naming of xx_private_spt() to reflect
> > that. Like:
> > if (role.mirrored)
> > update_mirrored_pte()
> >
> > The TDX code could encapsulate that mirrored updates need to update private EPT.
> > Then I had a helper that answered the question of whether to handle private
> > faults on the mirrored root.
>
> I am fine with this too, but I am also fine with the existing pattern:
>
> That we update the mirrored_pt using normal TDP MMU operation, and then
> invoke the xx_private_spte() for private GPA.
>
> My only true comment is, to me it seems more reasonable to invoke
> xx_private_spte() based on fault->is_private, but not on
> 'use_mirrored_pt'.
>
> See my reply to your question whether SNP needs special handling below.
>
> >
> > The FREEZE stuff actually made a bit more sense too, because it was clear it
> > wasn't a special TDX private memory thing, but just about the atomicity.
> >
> > The problem was I couldn't get rid of all special things that are private (can't
> > remember what now).
> >
> > I wonder if I should give it a more proper try. What do you think?
> >
> > At this point, I was just going to change the "mirrored" name to
> > "private_mirrored". Then code that does either mirrored things or private things
> > both looks correct. Basically making it clear that the MMU only supports
> > mirroring private memory.
>
> I don't have preference on name. "mirrored_private" also works for me.
For hook names, we can use mirrored_private or reflect or handle?
(or whatever better name)
The current hook names
{link, free}_private_spt(),
{set, remove, zap}_private_spte()
=>
# use mirrored_private
{link, free}_mirrored_private_spt(),
{set, remove, zap}_mirrored_private_spte()
or
# use reflect (update or handle?) mirrored to private
reflect_{linked, freeed}_mirrored_spt(),
reflect_{set, removed, zapped}_mirrored_spte()
or
# Don't add anything. I think this would be confusing.
{link, free}_spt(),
{set, remove, zap}_spte()
I think we should also rename the internal functions in TDP MMU.
- handle_removed_private_spte()
- set_private_spte_present()
handle and set is inconsistent. They should have consistent name.
=>
handle_{removed, set}_mirrored_private_spte()
or
reflect_{removed, set}_mirrored_spte()
> > > bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);
> > >
> > > tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
> > > 1)
> > > {
> > > ...
> > > }
> > >
> > > #define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end) \
> > > for_each_tdp_pte(_iter, \
> > > root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa : \
> > > _mmu->root.hpa), \
> > > _start, _end)
> > >
> > > If you somehow needs the mirrored_pt in later time when handling the page
> > > fault, you don't need another "mirrored_pt" in tdp_iter, because you can
> > > easily get it from the sptep (or just get from the root):
> > >
> > > mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> > >
> > > What we really need to pass in is the fault->is_private, because we are
> > > not able to get whether a GPN is private based on kvm_shared_gfn_mask()
> > > for SNP and SW_PROTECTED_VM.
> >
> > SNP and SW_PROTECTED_VM (today) don't need do anything special here, right?
>
> Conceptually, I think SNP also needs to at least issue some command(s) to
> update the RMP table to reflect the GFN<->PFN relationship. From this
> point, I do see a fit.
>
> I briefly looked into SNP patchset, and I also raised the discussion there
> (with you and Isaku copied):
>
> https://lore.kernel.org/lkml/[email protected]/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051
>
> I could be wrong, though.
I'll reply to it.
--
Isaku Yamahata <[email protected]>
On Mon, 2024-05-20 at 11:58 -0700, Isaku Yamahata wrote:
> For hook names, we can use mirrored_private or reflect or handle?
> (or whatever better name)
>
> The current hook names
> {link, free}_private_spt(),
> {set, remove, zap}_private_spte()
>
> =>
> # use mirrored_private
> {link, free}_mirrored_private_spt(),
> {set, remove, zap}_mirrored_private_spte()
>
> or
> # use reflect (update or handle?) mirrored to private
> reflect_{linked, freeed}_mirrored_spt(),
> reflect_{set, removed, zapped}_mirrored_spte()
reflect is a nice name. I'm trying this path right now. I'll share a branch.
>
> or
> # Don't add anything. I think this would be confusing.
> {link, free}_spt(),
> {set, remove, zap}_spte()
On 21/05/2024 6:58 am, Isaku Yamahata wrote:
> On Mon, May 20, 2024 at 10:38:58AM +0000,
> "Huang, Kai" <[email protected]> wrote:
>
>> On Sat, 2024-05-18 at 15:41 +0000, Edgecombe, Rick P wrote:
>>> On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
>>>>
>>>> No. I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
>>>> invoke kvm_x86_ops::xx_private_spt()" is not correct.
>>>
>>> I agree this looks wrong.
>>>
>>>> Instead, we should
>>>> use fault->is_private to determine:
>>>>
>>>> if (fault->is_private && kvm_x86_ops::xx_private_spt())
>>>> kvm_x86_ops::xx_private_spte();
>>>> else
>>>> // normal TDP MMU operation
>>>>
>>>> The reason is this pattern works not just for TDX, but also for SNP (and
>>>> SW_PROTECTED_VM) if they ever need specific page table ops.
>
> Do you want to split the concept from invoking hooks from mirrored PT
> and to allow invoking hooks even for shared PT (probably without
> mirrored PT)? So far I tied the mirrored PT to invoking the hooks as
> those hooks are to reflect the changes on mirrored PT to private PT.
>
> Is there any use case to allow hook for shared PT?
To be clear, my intention is to allow hook, if available, for "private
GPA". The point here is for "private GPA", but not "shared PT".
>
> - SEV_SNP
> Although I can't speak for SNP folks, I guess they don't need hooks.
> I guess they want to stay away from directly modifying the TDP MMU
> (to add TDP MMU hooks). Instead, They added hooks to guest_memfd.
> RMP (Reverse mapping table) doesn't have to be consistent with NPT.
>
> Anyway, I'll reply to
> https://lore.kernel.org/lkml/[email protected]/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051
For SNP _ONLY_ I completely understand. The point is, TDX needs to
modify anyway. So if SNP can use hooks for TDX, and if in that case we
can avoid guest_memfd hooks, then I think it's better?
But I can certainly be, and probably am, wrong, because that
gmem_memfd() hooks have been there for long time.
>
> TDX
> I don't see immediate need to allow hooks for shared PT. >
> SW_PROTECTED (today)
> It uses only shared PT and don't need hooks.
>
> SW_PROTECTED (with mirrored pt with shared mask in future in theory)
> This would be similar to TDX, we wouldn't need hooks for shared PT.
>
> SW_PROTECTED (shared PT only without mirrored pt in future in theory)
> I don't see necessity hooks for shared PT.
> (Or I don't see value of this SW_PROTECTED case.)
>
I don't think SW_PROTECTED VM will ever need to have any TDP MMU hook,
because there's no hardware feature backing behind it.
My intention is for SNP. Even if SNP doesn't need any TDP MMU hook
today, I think invoking hook depending on "private GPA", but not
"private page table" provides more flexibility. And this also works for
TDX, regardless whether SNP wants to implement any TDP MMU hook.
So conceptually speaking, I don't see any disadvantage of my proposal,
regardless whether SNP chooses to use any TDP MMU hook or not. On the
other hand, if we choose to "invoke hooks depending on page table type",
then this code will indeed be only for TDX.
On Fri, May 17, 2024 at 12:16:30PM -0700,
Isaku Yamahata <[email protected]> wrote:
> > 4. mmio spte doesn't have the shared bit, as previous (no effect)
> > 5. Some zapping code (__tdp_mmu_zap_root(), tdp_mmu_zap_leafs()) intends to
> > actually operating on the raw_gfn. It wants to iterate the whole EPT, so it goes
> > from 0 to tdp_mmu_max_gfn_exclusive(). So now for mirrored it does, but for
> > shared it only covers the shared range. Basically kvm_mmu_max_gfn() is wrong if
> > we pretend shared GFNs are just strangely mapped normal GFNs. Maybe we could
> > just fix this up to report based on GPAW for TDX? Feels wrong.
>
> Yes, it's broken with kvm_mmu_max_gfn().
I looked into this one. I think we need to adjust the value even for VMX case.
I have something at the bottom. What do you think? I compiled it only at the
moment. This is to show the idea.
Based on "Intel Trust Domain CPU Architectural Extensions"
There are four cases to consider.
- TDX Shared-EPT with 5-level EPT with host max_pa > 47
mmu_max_gfn should be host max gfn - (TDX key bits)
- TDX Shared-EPT with 4-level EPT with host max_pa > 47
The host allows 5-level. The guest doesn't need it. So use 4-level.
mmu_max_gfn should be 47 = min(47, host max gfn - (TDX key bits))).
- TDX Shared-EPT with 4-level EPT with host max_pa < 48
mmu_max_gfn should be min(47, host max gfn - (TDX key bits)))
- The value for Shared-EPT works for TDX Secure-EPT.
- For VMX case (with TDX CPU extension enabled)
mmu_max_gfn should be host max gfn - (TDX key bits)
For VMX only with TDX disabled, TDX key bits == 0.
So kvm_mmu_max_gfn() need to be per-VM value. And now gfn_shared_mask() is
out side of guest max PA.
(Maybe we'd like to check if guest cpuid[0x8000:0008] matches with those.)
Citation from "Intel Trust Domain CPU Architectural Extensions" for those
interested in the related sentences:
1.4.2 Guest Physical Address Translation
Transition to SEAM VMX non-root operation is formatted to require Extended
Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there should
be two EPTs active: the private EPT specified using the EPTP field of the VMCS
and a shared EPT specified using the Shared-EPTP field of the VMCS.
When translating a GPA using the shared EPT, an EPT misconfiguration can occur
if the entry is present and the physical address bits in the range
(MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
configured with a TDX private KeyID.
If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
physical address width is configured to be 48, accesses with GPA bits 51:48
not all being 0 can cause an EPT-violation, where such EPT-violations are not
mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED bit
is configured to be in bit position 47, GPA bit 47 would be reserved, and GPA
bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
46:MAXPA in any paging structure can cause a reserved bit page fault on
access.
1.5 OPERATION OUTSIDE SEAM
The physical address bits reserved for encoding TDX private KeyID are meant to
be treated as reserved bits when not in SEAM operation.
When translating a linear address outside SEAM, if any paging structure entry
has bits reserved for TDX private KeyID encoding in the physical address set,
then the processor helps generate a reserved bit page fault exception. When
translating a guest physical address outside SEAM, if any EPT structure entry
has bits reserved for TDX private KeyID encoding in the physical address set,
then the processor helps generate an EPT misconfiguration
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e3df14142db0..4ea6ad407a3d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1559,6 +1559,7 @@ struct kvm_arch {
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+ gfn_t mmu_max_gfn;
gfn_t gfn_shared_mask;
};
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index bab9b0c4f0a9..fcb7197f7487 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -64,7 +64,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
*/
extern u8 __read_mostly shadow_phys_bits;
-static inline gfn_t kvm_mmu_max_gfn(void)
+static inline gfn_t __kvm_mmu_max_gfn(void)
{
/*
* Note that this uses the host MAXPHYADDR, not the guest's.
@@ -82,6 +82,11 @@ static inline gfn_t kvm_mmu_max_gfn(void)
return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
}
+static inline gfn_t kvm_mmu_max_gfn(struct kvm *kvm)
+{
+ return kvm->arch.mmu_max_gfn;
+}
+
static inline u8 kvm_get_shadow_phys_bits(void)
{
/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1fb6055b1565..25da520e81d6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3333,7 +3333,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
* only if L1's MAXPHYADDR is inaccurate with respect to the
* hardware's).
*/
- if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
+ if (unlikely(fault->gfn > kvm_mmu_max_gfn(vcpu->kvm)))
return RET_PF_EMULATE;
return RET_PF_CONTINUE;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 630acf2b17f7..04b3c83f21a0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -952,7 +952,7 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
return iter->yielded;
}
-static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
+static inline gfn_t tdp_mmu_max_gfn_exclusive(struct kvm *kvm)
{
/*
* Bound TDP MMU walks at host.MAXPHYADDR. KVM disallows memslots with
@@ -960,7 +960,7 @@ static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
* MMIO SPTEs for "impossible" gfns, instead sending such accesses down
* the slow emulation path every time.
*/
- return kvm_mmu_max_gfn() + 1;
+ return kvm_mmu_max_gfn(kvm) + 1;
}
static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
@@ -968,7 +968,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
{
struct tdp_iter iter;
- gfn_t end = tdp_mmu_max_gfn_exclusive();
+ gfn_t end = tdp_mmu_max_gfn_exclusive(kvm);
gfn_t start = 0;
for_each_tdp_pte_min_level(kvm, iter, root, zap_level, start, end) {
@@ -1069,7 +1069,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
{
struct tdp_iter iter;
- end = min(end, tdp_mmu_max_gfn_exclusive());
+ end = min(end, tdp_mmu_max_gfn_exclusive(kvm));
lockdep_assert_held_write(&kvm->mmu_lock);
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a3c39bd783d6..025d51a55505 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -12,6 +12,8 @@
static bool enable_tdx __ro_after_init;
module_param_named(tdx, enable_tdx, bool, 0444);
+static gfn_t __ro_after_init mmu_max_gfn;
+
#if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_INTEL_TDX_HOST)
static int vt_flush_remote_tlbs(struct kvm *kvm);
#endif
@@ -24,6 +26,27 @@ static void vt_hardware_disable(void)
vmx_hardware_disable();
}
+#define MSR_IA32_TME_ACTIVATE 0x982
+#define MKTME_UNINITIALIZED 2
+#define TME_ACTIVATE_LOCKED BIT_ULL(0)
+#define TME_ACTIVATE_ENABLED BIT_ULL(1)
+#define TDX_RESERVED_KEYID_BITS(tme_activate) \
+ (((tme_activate) & GENMASK_ULL(39, 36)) >> 36)
+
+static void vt_adjust_max_pa(void)
+{
+ u64 tme_activate;
+
+ mmu_max_gfn = __kvm_mmu_max_gfn();
+
+ rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
+ if (!(tme_activate & TME_ACTIVATE_LOCKED) ||
+ !(tme_activate & TME_ACTIVATE_ENABLED))
+ return;
+
+ mmu_max_gfn -= (gfn_t)TDX_RESERVED_KEYID_BITS(tme_activate);
+}
+
static __init int vt_hardware_setup(void)
{
int ret;
@@ -69,6 +92,8 @@ static __init int vt_hardware_setup(void)
vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
#endif
+ vt_adjust_max_pa();
+
return 0;
}
@@ -89,6 +114,8 @@ static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
static int vt_vm_init(struct kvm *kvm)
{
+ kvm->arch.mmu_max_gfn = mmu_max_gfn;
+
if (is_td(kvm))
return tdx_vm_init(kvm);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3be4b8ff7cb6..206ad053cbad 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2610,8 +2610,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
- else
+ else {
kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+ kvm->arch.mmu_max_gfn = min(kvm->arch.mmu_max_gfn,
+ gpa_to_gfn(BIT_ULL(47)));
+ }
out:
/* kfree() accepts NULL. */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7f89405c8bc4..c519bb9c9559 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12693,6 +12693,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (ret)
goto out;
+ kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
kvm_mmu_init_vm(kvm);
ret = static_call(kvm_x86_vm_init)(kvm);
@@ -13030,7 +13031,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return -EINVAL;
if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
- if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
+ if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
return -EINVAL;
#if 0
--
Isaku Yamahata <[email protected]>
On Mon, 2024-05-20 at 12:02 -0700, Rick Edgecombe wrote:
>
> reflect is a nice name. I'm trying this path right now. I'll share a branch.
Here is the branch:
https://github.com/rpedgeco/linux/commit/674cd68b6ba626e48fe2446797d067e38dca80e3
TODO:
- kvm_mmu_max_gfn() updates from iterator changes
- kvm_flush_remote_tlbs_gfn() updates from iterator changes
The historically controversial mmu.h helpers:
static inline gfn_t kvm_gfn_direct_mask(const struct kvm *kvm)
{
/* Only TDX sets this and it's the shared mask */
return kvm->arch.gfn_shared_mask;
}
/* The VM keeps a mirrored copy of the private memory */
static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
{
return kvm->arch.vm_type == KVM_X86_TDX_VM;
}
static inline bool kvm_on_mirror(const struct kvm *kvm, enum kvm_process
process)
{
if (!kvm_has_mirrored_tdp(kvm))
return false;
return process & KVM_PROCESS_PRIVATE;
}
static inline bool kvm_on_direct(const struct kvm *kvm, enum kvm_process
process)
{
if (!kvm_has_mirrored_tdp(kvm))
return true;
return process & KVM_PROCESS_SHARED;
}
static inline bool kvm_zap_leafs_only(const struct kvm *kvm)
{
return kvm->arch.vm_type == KVM_X86_TDX_VM;
}
In this solution, the tdp_mmu.c doesn't have a concept of private vs shared EPT
or GPA aliases. It just knows KVM_PROCESS_PRIVATE/SHARED, and fault->is_private.
Based on the PROCESS enums or fault->is_private, helpers in mmu.h encapsulate
whether to operate on the normal "direct" roots or the mirrored roots. When
!TDX, it always operates on direct.
The code that does PTE setting/zapping etc, calls out the mirrored "reflect"
helper and does the extra atomicity stuff when it sees the mirrored role bit.
In Isaku's code to make gfn's never have shared bits, there was still the
concept of "shared" in the TDP MMU. But now since the TDP MMU focuses on
mirrored vs direct instead, an abstraction is introduced to just ask for the
mask for the root. For TDX the direct root is for shared memory, so instead the
kvm_gfn_direct_mask() gets applied when operating on the direct root.
I think there are still some things to be polished in the branch, but overall it
does a good job of cleaning up the confusion about the connection between
private and mirrored. And also between this and the previous changes, improves
littering the generic MMU code with private/shared alias concepts.
At the same time, I think the abstractions have a small cost in clarity if you
are looking at the code from TDX's perspective. It probably wont raise any
eyebrows for people used to tracing nested EPT violations through paging_tmpl.h.
But compared to naming everything mirrored_private, there is more obfuscation of
the bits twiddled.
On Mon, May 20, 2024 at 11:39:06PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Mon, 2024-05-20 at 12:02 -0700, Rick Edgecombe wrote:
> >
> > reflect is a nice name. I'm trying this path right now. I'll share a branch.
>
> Here is the branch:
> https://github.com/rpedgeco/linux/commit/674cd68b6ba626e48fe2446797d067e38dca80e3
Thank you for sharing it. It makes it easy to create further patches on top of
it.
..
> In this solution, the tdp_mmu.c doesn't have a concept of private vs shared EPT
> or GPA aliases. It just knows KVM_PROCESS_PRIVATE/SHARED, and fault->is_private.
>
> Based on the PROCESS enums or fault->is_private, helpers in mmu.h encapsulate
> whether to operate on the normal "direct" roots or the mirrored roots. When
> !TDX, it always operates on direct.
>
> The code that does PTE setting/zapping etc, calls out the mirrored "reflect"
> helper and does the extra atomicity stuff when it sees the mirrored role bit.
>
> In Isaku's code to make gfn's never have shared bits, there was still the
> concept of "shared" in the TDP MMU. But now since the TDP MMU focuses on
> mirrored vs direct instead, an abstraction is introduced to just ask for the
> mask for the root. For TDX the direct root is for shared memory, so instead the
> kvm_gfn_direct_mask() gets applied when operating on the direct root.
"direct" is better than "shared". It might be confusing with the existing
role.direct, but I don't think of better other name.
I resorted to pass around kvm for gfn_direct_mask to the iterator. Alternative
way is to stash it in struct kvm_mmu_page of root somehow. Then, we can strip
kvm from the iterator and the related macros.
> I think there are still some things to be polished in the branch, but overall it
> does a good job of cleaning up the confusion about the connection between
> private and mirrored. And also between this and the previous changes, improves
> littering the generic MMU code with private/shared alias concepts.
>
> At the same time, I think the abstractions have a small cost in clarity if you
> are looking at the code from TDX's perspective. It probably wont raise any
> eyebrows for people used to tracing nested EPT violations through paging_tmpl.h.
> But compared to naming everything mirrored_private, there is more obfuscation of
> the bits twiddled.
The rename makes the code much less confusing. I noticed that mirror and
mirrored are mixed. I'm not sure whether it's intentional or accidental.
--
Isaku Yamahata <[email protected]>
On Mon, 2024-05-20 at 19:25 -0700, Isaku Yamahata wrote:
> On Mon, May 20, 2024 at 11:39:06PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Mon, 2024-05-20 at 12:02 -0700, Rick Edgecombe wrote:
> > In this solution, the tdp_mmu.c doesn't have a concept of private vs shared
> > EPT
> > or GPA aliases. It just knows KVM_PROCESS_PRIVATE/SHARED, and fault-
> > >is_private.
> >
> > Based on the PROCESS enums or fault->is_private, helpers in mmu.h
> > encapsulate
> > whether to operate on the normal "direct" roots or the mirrored roots. When
> > !TDX, it always operates on direct.
> >
> > The code that does PTE setting/zapping etc, calls out the mirrored "reflect"
> > helper and does the extra atomicity stuff when it sees the mirrored role
> > bit.
> >
> > In Isaku's code to make gfn's never have shared bits, there was still the
> > concept of "shared" in the TDP MMU. But now since the TDP MMU focuses on
> > mirrored vs direct instead, an abstraction is introduced to just ask for the
> > mask for the root. For TDX the direct root is for shared memory, so instead
> > the
> > kvm_gfn_direct_mask() gets applied when operating on the direct root.
>
> "direct" is better than "shared". It might be confusing with the existing
> role.direct, but I don't think of better other name.
Yea, direct is kind of overloaded. But it actually is "direct" in the
role.direct sense at least.
>
> I resorted to pass around kvm for gfn_direct_mask to the iterator.
> Alternative
> way is to stash it in struct kvm_mmu_page of root somehow. Then, we can strip
> kvm from the iterator and the related macros.
It seems like it would use too much memory. Looking up the mask once per
iteration doesn't seem too terrible to me.
>
>
> > I think there are still some things to be polished in the branch, but
> > overall it
> > does a good job of cleaning up the confusion about the connection between
> > private and mirrored. And also between this and the previous changes,
> > improves
> > littering the generic MMU code with private/shared alias concepts.
> >
> > At the same time, I think the abstractions have a small cost in clarity if
> > you
> > are looking at the code from TDX's perspective. It probably wont raise any
> > eyebrows for people used to tracing nested EPT violations through
> > paging_tmpl.h.
> > But compared to naming everything mirrored_private, there is more
> > obfuscation of
> > the bits twiddled.
>
> The rename makes the code much less confusing. I noticed that mirror and
> mirrored are mixed. I'm not sure whether it's intentional or accidental.
We need a better name for sp->mirrored_spt and related functions. It is not the
mirror page table, it's the actual page table that is getting mirrored
It would be nice to have a good generic name (not private) for what the mirrored
page tables are mirroring. Mirror vs mirrored is too close, but I couldn't think
of anything. Reflect only seems to fit as a verb.
Another nice thing about this separation, I think we can break the big patch
apart a bit. I think maybe I'll start re-arranging things into patches. Unless
there is any objection to the whole direction. Kai?
On Mon, 2024-05-20 at 16:32 -0700, Isaku Yamahata wrote:
> I looked into this one. I think we need to adjust the value even for VMX
> case.
> I have something at the bottom. What do you think? I compiled it only at the
> moment. This is to show the idea.
>
>
> Based on "Intel Trust Domain CPU Architectural Extensions"
> There are four cases to consider.
> - TDX Shared-EPT with 5-level EPT with host max_pa > 47
> mmu_max_gfn should be host max gfn - (TDX key bits)
>
> - TDX Shared-EPT with 4-level EPT with host max_pa > 47
> The host allows 5-level. The guest doesn't need it. So use 4-level.
> mmu_max_gfn should be 47 = min(47, host max gfn - (TDX key bits))).
>
> - TDX Shared-EPT with 4-level EPT with host max_pa < 48
> mmu_max_gfn should be min(47, host max gfn - (TDX key bits)))
>
> - The value for Shared-EPT works for TDX Secure-EPT.
>
> - For VMX case (with TDX CPU extension enabled)
> mmu_max_gfn should be host max gfn - (TDX key bits)
> For VMX only with TDX disabled, TDX key bits == 0.
>
> So kvm_mmu_max_gfn() need to be per-VM value. And now gfn_shared_mask() is
> out side of guest max PA.
> (Maybe we'd like to check if guest cpuid[0x8000:0008] matches with those.)
>
> Citation from "Intel Trust Domain CPU Architectural Extensions" for those
> interested in the related sentences:
>
> 1.4.2 Guest Physical Address Translation
> Transition to SEAM VMX non-root operation is formatted to require Extended
> Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there
> should
> be two EPTs active: the private EPT specified using the EPTP field of the
> VMCS
> and a shared EPT specified using the Shared-EPTP field of the VMCS.
> When translating a GPA using the shared EPT, an EPT misconfiguration can
> occur
> if the entry is present and the physical address bits in the range
> (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
> configured with a TDX private KeyID.
> If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
> physical address width is configured to be 48, accesses with GPA bits 51:48
> not all being 0 can cause an EPT-violation, where such EPT-violations are
> not
> mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
> If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED
> bit
> is configured to be in bit position 47, GPA bit 47 would be reserved, and
> GPA
> bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
> 46:MAXPA in any paging structure can cause a reserved bit page fault on
> access.
In "if the entry is present and the physical address bits in the range
(MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set", it's not clear
to be if "physical address bits" is referring to the GPA or the "entry" (meaning
the host pfn). The "entry" would be my guess.
It is also confusing when it talks about "guest physical address". It must mean
4 vs 5 level paging? How else is the shared EPT walker supposed to know the
guest maxpa. In which case it would be consistent with normal EPT behavior. But
the assertions around reserved bit page faults are surprising.
Based on those guesses, I'm not sure the below code is correct. We wouldn't need
to remove keyid bits from the GFN.
Maybe we should clarify the spec? Or are you confident reading it the other way?
>
> 1.5 OPERATION OUTSIDE SEAM
> The physical address bits reserved for encoding TDX private KeyID are meant
> to
> be treated as reserved bits when not in SEAM operation.
> When translating a linear address outside SEAM, if any paging structure
> entry
> has bits reserved for TDX private KeyID encoding in the physical address
> set,
> then the processor helps generate a reserved bit page fault exception. When
> translating a guest physical address outside SEAM, if any EPT structure
> entry
> has bits reserved for TDX private KeyID encoding in the physical address
> set,
> then the processor helps generate an EPT misconfiguration
This is more specific regarding which bits should not have key id bits: "if any
paging structure entry has bits reserved for TDX private KeyID encoding in the
physical address set". It is bits in the PTE, not the GPA.
>
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e3df14142db0..4ea6ad407a3d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1559,6 +1559,7 @@ struct kvm_arch {
> #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> struct kvm_mmu_memory_cache split_desc_cache;
>
> + gfn_t mmu_max_gfn;
> gfn_t gfn_shared_mask;
> };
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index bab9b0c4f0a9..fcb7197f7487 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -64,7 +64,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
> */
> extern u8 __read_mostly shadow_phys_bits;
>
> -static inline gfn_t kvm_mmu_max_gfn(void)
> +static inline gfn_t __kvm_mmu_max_gfn(void)
> {
> /*
> * Note that this uses the host MAXPHYADDR, not the guest's.
> @@ -82,6 +82,11 @@ static inline gfn_t kvm_mmu_max_gfn(void)
> return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
> }
>
> +static inline gfn_t kvm_mmu_max_gfn(struct kvm *kvm)
> +{
> + return kvm->arch.mmu_max_gfn;
> +}
> +
> static inline u8 kvm_get_shadow_phys_bits(void)
> {
> /*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1fb6055b1565..25da520e81d6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3333,7 +3333,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu
> *vcpu,
> * only if L1's MAXPHYADDR is inaccurate with respect to the
> * hardware's).
> */
> - if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
> + if (unlikely(fault->gfn > kvm_mmu_max_gfn(vcpu->kvm)))
> return RET_PF_EMULATE;
>
> return RET_PF_CONTINUE;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 630acf2b17f7..04b3c83f21a0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -952,7 +952,7 @@ static inline bool __must_check
> tdp_mmu_iter_cond_resched(struct kvm *kvm,
> return iter->yielded;
> }
>
> -static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
> +static inline gfn_t tdp_mmu_max_gfn_exclusive(struct kvm *kvm)
> {
> /*
> * Bound TDP MMU walks at host.MAXPHYADDR. KVM disallows memslots
> with
> @@ -960,7 +960,7 @@ static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
> * MMIO SPTEs for "impossible" gfns, instead sending such accesses
> down
> * the slow emulation path every time.
> */
> - return kvm_mmu_max_gfn() + 1;
> + return kvm_mmu_max_gfn(kvm) + 1;
> }
>
> static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> @@ -968,7 +968,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct
> kvm_mmu_page *root,
> {
> struct tdp_iter iter;
>
> - gfn_t end = tdp_mmu_max_gfn_exclusive();
> + gfn_t end = tdp_mmu_max_gfn_exclusive(kvm);
> gfn_t start = 0;
>
> for_each_tdp_pte_min_level(kvm, iter, root, zap_level, start, end) {
> @@ -1069,7 +1069,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct
> kvm_mmu_page *root,
> {
> struct tdp_iter iter;
>
> - end = min(end, tdp_mmu_max_gfn_exclusive());
> + end = min(end, tdp_mmu_max_gfn_exclusive(kvm));
>
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index a3c39bd783d6..025d51a55505 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -12,6 +12,8 @@
> static bool enable_tdx __ro_after_init;
> module_param_named(tdx, enable_tdx, bool, 0444);
>
> +static gfn_t __ro_after_init mmu_max_gfn;
> +
> #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_INTEL_TDX_HOST)
> static int vt_flush_remote_tlbs(struct kvm *kvm);
> #endif
> @@ -24,6 +26,27 @@ static void vt_hardware_disable(void)
> vmx_hardware_disable();
> }
>
> +#define MSR_IA32_TME_ACTIVATE 0x982
> +#define MKTME_UNINITIALIZED 2
> +#define TME_ACTIVATE_LOCKED BIT_ULL(0)
> +#define TME_ACTIVATE_ENABLED BIT_ULL(1)
> +#define TDX_RESERVED_KEYID_BITS(tme_activate) \
> + (((tme_activate) & GENMASK_ULL(39, 36)) >> 36)
> +
> +static void vt_adjust_max_pa(void)
> +{
> + u64 tme_activate;
> +
> + mmu_max_gfn = __kvm_mmu_max_gfn();
> +
> + rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
> + if (!(tme_activate & TME_ACTIVATE_LOCKED) ||
> + !(tme_activate & TME_ACTIVATE_ENABLED))
> + return;
> +
> + mmu_max_gfn -= (gfn_t)TDX_RESERVED_KEYID_BITS(tme_activate);
> +}
As above, I'm not sure this is right. I guess you read the above as bits in the
GPA?
> +
> static __init int vt_hardware_setup(void)
> {
> int ret;
> @@ -69,6 +92,8 @@ static __init int vt_hardware_setup(void)
> vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
> #endif
>
> + vt_adjust_max_pa();
> +
> return 0;
> }
>
> @@ -89,6 +114,8 @@ static int vt_vm_enable_cap(struct kvm *kvm, struct
> kvm_enable_cap *cap)
>
> static int vt_vm_init(struct kvm *kvm)
> {
> + kvm->arch.mmu_max_gfn = mmu_max_gfn;
> +
> if (is_td(kvm))
> return tdx_vm_init(kvm);
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3be4b8ff7cb6..206ad053cbad 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2610,8 +2610,11 @@ static int tdx_td_init(struct kvm *kvm, struct
> kvm_tdx_cmd *cmd)
>
> if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
> kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
> - else
> + else {
> kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
> + kvm->arch.mmu_max_gfn = min(kvm->arch.mmu_max_gfn,
> + gpa_to_gfn(BIT_ULL(47)));
> + }
>
> out:
> /* kfree() accepts NULL. */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f89405c8bc4..c519bb9c9559 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12693,6 +12693,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long
> type)
> if (ret)
> goto out;
>
> + kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
> kvm_mmu_init_vm(kvm);
>
> ret = static_call(kvm_x86_vm_init)(kvm);
> @@ -13030,7 +13031,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> return -EINVAL;
>
> if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
> - if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> + if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
> return -EINVAL;
>
> #if 0
On Tue, May 21, 2024 at 03:07:50PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> > 1.4.2 Guest Physical Address Translation
> > Transition to SEAM VMX non-root operation is formatted to require Extended
> > Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there
> > should
> > be two EPTs active: the private EPT specified using the EPTP field of the
> > VMCS
> > and a shared EPT specified using the Shared-EPTP field of the VMCS.
> > When translating a GPA using the shared EPT, an EPT misconfiguration can
> > occur
> > if the entry is present and the physical address bits in the range
> > (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
> > configured with a TDX private KeyID.
> > If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
> > physical address width is configured to be 48, accesses with GPA bits 51:48
> > not all being 0 can cause an EPT-violation, where such EPT-violations are
> > not
> > mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
> > If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED
> > bit
> > is configured to be in bit position 47, GPA bit 47 would be reserved, and
> > GPA
> > bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
> > 46:MAXPA in any paging structure can cause a reserved bit page fault on
> > access.
>
> In "if the entry is present and the physical address bits in the range
> (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set", it's not clear
> to be if "physical address bits" is referring to the GPA or the "entry" (meaning
> the host pfn). The "entry" would be my guess.
>
> It is also confusing when it talks about "guest physical address". It must mean
> 4 vs 5 level paging? How else is the shared EPT walker supposed to know the
> guest maxpa. In which case it would be consistent with normal EPT behavior. But
> the assertions around reserved bit page faults are surprising.
>
> Based on those guesses, I'm not sure the below code is correct. We wouldn't need
> to remove keyid bits from the GFN.
>
> Maybe we should clarify the spec? Or are you confident reading it the other way?
I'll read them more closely. At least the following patch is broken.
--
Isaku Yamahata <[email protected]>
On Fri, May 17, 2024 at 05:30:50PM +0200, Paolo Bonzini wrote:
> On 5/16/24 01:20, Sean Christopherson wrote:
> > Hmm, a quirk isn't a bad idea. It suffers the same problems as a memslot flag,
> > i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> > would be much, much cautious about disabling a quirk that comes with a massive
> > disclaimer.
> >
> > Though I suspect Paolo will shoot this down too ????
>
> Not really, it's probably the least bad option. Not as safe as keying it
> off the new machine types, but less ugly.
A concern about the quirk is that before identifying the root cause of the
issue, we don't know which one is a quirk, fast zapping all TDPs or slow zapping
within memslot range.
I have the same feeling that the bug is probably not reproducible with latest
KVM code. And even when both ways are bug free, some VMs may still prefer to
fast zapping given it's fast.
So, I'm wondering if a cap in [1] is better.
[1] https://lore.kernel.org/kvm/[email protected]/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c
On Wed, May 22, 2024, Yan Zhao wrote:
> On Fri, May 17, 2024 at 05:30:50PM +0200, Paolo Bonzini wrote:
> > On 5/16/24 01:20, Sean Christopherson wrote:
> > > Hmm, a quirk isn't a bad idea. It suffers the same problems as a memslot flag,
> > > i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> > > would be much, much cautious about disabling a quirk that comes with a massive
> > > disclaimer.
> > >
> > > Though I suspect Paolo will shoot this down too ????
> >
> > Not really, it's probably the least bad option. Not as safe as keying it
> > off the new machine types, but less ugly.
> A concern about the quirk is that before identifying the root cause of the
> issue, we don't know which one is a quirk, fast zapping all TDPs or slow zapping
> within memslot range.
The quirk is specifically that KVM zaps SPTEs that aren't related to the memslot
being deleted/moved. E.g. the issue went away if KVM zapped a rather arbitrary
set of SPTEs. IIRC, there was a specific gfn range that was "problematic", but
we never figured out the correlation between the problematic range and the memslot
being deleted.
Disabling the quirk would allow KVM to choose between a slow/precise/partial zap,
and full/fast zap.
On Tue, May 21, 2024 at 07:31:31PM -0700, Sean Christopherson wrote:
> On Wed, May 22, 2024, Yan Zhao wrote:
> > On Fri, May 17, 2024 at 05:30:50PM +0200, Paolo Bonzini wrote:
> > > On 5/16/24 01:20, Sean Christopherson wrote:
> > > > Hmm, a quirk isn't a bad idea. It suffers the same problems as a memslot flag,
> > > > i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> > > > would be much, much cautious about disabling a quirk that comes with a massive
> > > > disclaimer.
> > > >
> > > > Though I suspect Paolo will shoot this down too ????
> > >
> > > Not really, it's probably the least bad option. Not as safe as keying it
> > > off the new machine types, but less ugly.
> > A concern about the quirk is that before identifying the root cause of the
> > issue, we don't know which one is a quirk, fast zapping all TDPs or slow zapping
> > within memslot range.
>
> The quirk is specifically that KVM zaps SPTEs that aren't related to the memslot
> being deleted/moved. E.g. the issue went away if KVM zapped a rather arbitrary
> set of SPTEs. IIRC, there was a specific gfn range that was "problematic", but
> we never figured out the correlation between the problematic range and the memslot
> being deleted.
>
So, a quirk like KVM_X86_QUIRK_ZAP_ALL_ON_MEMSLOT_DELETION, and enable it by
default?
> Disabling the quirk would allow KVM to choose between a slow/precise/partial zap,
> and full/fast zap.
TDX needs to disable the quirk for slow/precise/partial zap, right?
Then, when unsafe and passthrough devices are involved in TDX, we need to either
keep disabling the quirk if no bug reported or identify the root cause then.
Is that correct?
On Wed, May 22, 2024 at 8:49 AM Yan Zhao <[email protected]> wrote:
> > Disabling the quirk would allow KVM to choose between a slow/precise/partial zap,
> > and full/fast zap.
> TDX needs to disable the quirk for slow/precise/partial zap, right?
Yes - and since TDX is a separate VM type it might even start with the
quirk disabled. For sure, the memslot flag is the worst option and I'd
really prefer to avoid it.
> > I have the same feeling that the bug is probably not reproducible with latest
> > KVM code
Or with the latest QEMU code, if it was related somehow to non-atomic
changes to the memory map.
Paolo
On Tue, May 21, 2024 at 09:15:20AM -0700,
Isaku Yamahata <[email protected]> wrote:
> On Tue, May 21, 2024 at 03:07:50PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > > 1.4.2 Guest Physical Address Translation
> > > Transition to SEAM VMX non-root operation is formatted to require Extended
> > > Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there
> > > should
> > > be two EPTs active: the private EPT specified using the EPTP field of the
> > > VMCS
> > > and a shared EPT specified using the Shared-EPTP field of the VMCS.
> > > When translating a GPA using the shared EPT, an EPT misconfiguration can
> > > occur
> > > if the entry is present and the physical address bits in the range
> > > (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
> > > configured with a TDX private KeyID.
> > > If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
> > > physical address width is configured to be 48, accesses with GPA bits 51:48
> > > not all being 0 can cause an EPT-violation, where such EPT-violations are
> > > not
> > > mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
> > > If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED
> > > bit
> > > is configured to be in bit position 47, GPA bit 47 would be reserved, and
> > > GPA
> > > bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
> > > 46:MAXPA in any paging structure can cause a reserved bit page fault on
> > > access.
> >
> > In "if the entry is present and the physical address bits in the range
> > (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set", it's not clear
> > to be if "physical address bits" is referring to the GPA or the "entry" (meaning
> > the host pfn). The "entry" would be my guess.
> >
> > It is also confusing when it talks about "guest physical address". It must mean
> > 4 vs 5 level paging? How else is the shared EPT walker supposed to know the
> > guest maxpa. In which case it would be consistent with normal EPT behavior. But
> > the assertions around reserved bit page faults are surprising.
> >
> > Based on those guesses, I'm not sure the below code is correct. We wouldn't need
> > to remove keyid bits from the GFN.
> >
> > Maybe we should clarify the spec? Or are you confident reading it the other way?
>
> I'll read them more closely. At least the following patch is broken.
I was confused with guest(virtual) maxphyaddr and host maxphyaddr. Here is the
outcome. We have 5 potentially problematic points related to mmu max pfn.
Related operations
==================
- memslot creation or kvm_arch_prepare_memory_region()
We can create the slot beyond virtual maxphyaddr without any change. Although
it's weird, it doesn't immediately harm. If we prevent it, some potentially
problematic case won't happen.
- TDP MMU iterator (including memslot deletion)
It works fine without any change because it uses only necessary bits of GPA.
It ignores upper bits of given GFN for start. it ends with the SPTE traverse
if GPA > virtual maxphyaddr.
For secure-EPT
It may go beyond shared-bit if slots is huge enough to cross the boundary of
private-vs-shared. Because (we can make) tdp mmu fault handler doesn't
populate on such entries, it essentially results in NOP.
- population EPT violation
Because TDX EPT violation handler can filter out ept violation with GPA >
virtual maxphyaddr, we can assume GPA passed to the fault handler is < virtual
maxphyaddr.
- zapping (including memslot deletion)
Because zapping not-populated GFN is nop, so zapping specified GFN works fine.
- pre_fault_memory
KVM_PRE_FAULT_MEMORY calls the fault handler without virtual maxphyaddr
Additional check is needed to prevent GPA > virtual maxphyaddr
if virtual maxphyaddr < 47 or 52.
I can think of the following options.
options
=======
option 1. Allow per-VM kvm_mmu_max_gfn()
Pro: Conceptually easy to understand and it's straightforward to disallow
memslot creation > virtual maxphyaddr
Con: overkill for the corner case? The diff is attached. This is only when user
space creates memlost > virtual maxphyaddr and the guest accesses GPA >
virtual maxphyaddr)
option 2. Keep kvm_mmu_max_gfn() and add ad hock address check.
Pro: Minimal change?
Modify kvm_handel_noslot_fault() or kvm_faultin_pfn() to reject GPA >
virtual maxphyaddr.
Con: Conceptually confusing with allowing operation on GFN > virtual maxphyaddr.
The change might be unnatural or ad-hoc because it allow to create memslot
with GPA > virtual maxphyaddr.
The following is an experimental change for option 1.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 406effc613e5..dbc371071cb5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1558,6 +1558,7 @@ struct kvm_arch {
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+ gfn_t mmu_max_gfn;
gfn_t gfn_shared_mask;
};
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9cd83448e39f..7b7ecaf1c607 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -64,7 +64,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
*/
extern u8 __read_mostly shadow_phys_bits;
-static inline gfn_t kvm_mmu_max_gfn(void)
+static inline gfn_t __kvm_mmu_max_gfn(void)
{
/*
* Note that this uses the host MAXPHYADDR, not the guest's.
@@ -82,6 +82,11 @@ static inline gfn_t kvm_mmu_max_gfn(void)
return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
}
+static inline gfn_t kvm_mmu_max_gfn(struct kvm *kvm)
+{
+ return kvm->arch.mmu_max_gfn;
+}
+
static inline u8 kvm_get_shadow_phys_bits(void)
{
/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 295c27dc593b..515edc6ae867 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3333,7 +3333,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
* only if L1's MAXPHYADDR is inaccurate with respect to the
* hardware's).
*/
- if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
+ if (unlikely(fault->gfn > kvm_mmu_max_gfn(vcpu->kvm)))
return RET_PF_EMULATE;
return RET_PF_CONTINUE;
@@ -6509,6 +6509,7 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
void kvm_mmu_init_vm(struct kvm *kvm)
{
+ kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
kvm->arch.shadow_mmio_value = shadow_mmio_value;
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 79c9b22ceef6..ee3456b2096d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -945,7 +945,7 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
return iter->yielded;
}
-static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
+static inline gfn_t tdp_mmu_max_gfn_exclusive(struct kvm *kvm)
{
/*
* Bound TDP MMU walks at host.MAXPHYADDR. KVM disallows memslots with
@@ -953,7 +953,7 @@ static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
* MMIO SPTEs for "impossible" gfns, instead sending such accesses down
* the slow emulation path every time.
*/
- return kvm_mmu_max_gfn() + 1;
+ return kvm_mmu_max_gfn(kvm) + 1;
}
static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
@@ -961,7 +961,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
{
struct tdp_iter iter;
- gfn_t end = tdp_mmu_max_gfn_exclusive();
+ gfn_t end = tdp_mmu_max_gfn_exclusive(kvm);
gfn_t start = 0;
for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
@@ -1062,7 +1062,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
{
struct tdp_iter iter;
- end = min(end, tdp_mmu_max_gfn_exclusive());
+ end = min(end, tdp_mmu_max_gfn_exclusive(kvm));
lockdep_assert_held_write(&kvm->mmu_lock);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 61715424629b..5c2afca59386 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2549,7 +2549,9 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct kvm_tdx_init_vm *init_vm = NULL;
struct td_params *td_params = NULL;
- int ret;
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ int ret, idx, i, bkt;
BUILD_BUG_ON(sizeof(*init_vm) != 8 * 1024);
BUILD_BUG_ON(sizeof(struct td_params) != 1024);
@@ -2611,6 +2613,25 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
else
kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+ kvm->arch.mmu_max_gfn = min(kvm->arch.mmu_max_gfn,
+ kvm->arch.gfn_shared_mask - 1);
+ /*
+ * As memslot can be created before KVM_TDX_INIT_VM, check whether the
+ * existing memslot is equal or lower than mmu_max_gfn.
+ */
+ idx = srcu_read_lock(&kvm->srcu);
+ write_lock(&kvm->mmu_lock);
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+ slots = __kvm_memslots(kvm, i);
+ kvm_for_each_memslot(slot, bkt, slots) {
+ if (slot->base_gfn + slot->npages > kvm->arch.mmu_max_gfn) {
+ ret = -ERANGE;
+ break;
+ }
+ }
+ }
+ write_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, idx);
out:
/* kfree() accepts NULL. */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5812cd1a4bc..9461cd4f540b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13029,7 +13029,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return -EINVAL;
if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
- if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
+ if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
return -EINVAL;
return kvm_alloc_memslot_metadata(kvm, new);
--
Isaku Yamahata <[email protected]>
On Wed, 2024-05-22 at 15:34 -0700, Isaku Yamahata wrote:
> option 1. Allow per-VM kvm_mmu_max_gfn()
> Pro: Conceptually easy to understand and it's straightforward to disallow
> memslot creation > virtual maxphyaddr
> Con: overkill for the corner case? The diff is attached. This is only when
> user
> space creates memlost > virtual maxphyaddr and the guest accesses GPA >
> virtual maxphyaddr)
It breaks the promise that gfn's don't have the share bit which is the pro for
hiding the shared bit in the tdp mmu iterator.
>
> option 2. Keep kvm_mmu_max_gfn() and add ad hock address check.
> Pro: Minimal change?
> Modify kvm_handel_noslot_fault() or kvm_faultin_pfn() to reject GPA >
> virtual maxphyaddr.
> Con: Conceptually confusing with allowing operation on GFN > virtual
> maxphyaddr.
> The change might be unnatural or ad-hoc because it allow to create
> memslot
> with GPA > virtual maxphyaddr.
I can't find any actual functional problem to just ignoring it. Just some extra
work to go over ranges that aren't covered by the root.
How about we leave option 1 as a separate patch and note it is not functionally
required? Then we can shed it if needed. At the least it can serve as a
conversation piece in the meantime.
On Wed, May 22, 2024 at 11:09:54PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Wed, 2024-05-22 at 15:34 -0700, Isaku Yamahata wrote:
> > option 1. Allow per-VM kvm_mmu_max_gfn()
> > Pro: Conceptually easy to understand and it's straightforward to disallow
> > memslot creation > virtual maxphyaddr
> > Con: overkill for the corner case? The diff is attached. This is only when
> > user
> > space creates memlost > virtual maxphyaddr and the guest accesses GPA >
> > virtual maxphyaddr)
>
> It breaks the promise that gfn's don't have the share bit which is the pro for
> hiding the shared bit in the tdp mmu iterator.
>
> >
> > option 2. Keep kvm_mmu_max_gfn() and add ad hock address check.
> > Pro: Minimal change?
> > Modify kvm_handel_noslot_fault() or kvm_faultin_pfn() to reject GPA >
> > virtual maxphyaddr.
> > Con: Conceptually confusing with allowing operation on GFN > virtual
> > maxphyaddr.
> > The change might be unnatural or ad-hoc because it allow to create
> > memslot
> > with GPA > virtual maxphyaddr.
>
> I can't find any actual functional problem to just ignoring it. Just some extra
> work to go over ranges that aren't covered by the root.
>
> How about we leave option 1 as a separate patch and note it is not functionally
> required? Then we can shed it if needed. At the least it can serve as a
> conversation piece in the meantime.
Ok. We understand the situation correctly. I think it's okay to do nothing for
now with some notes somewhere as record because it doesn't affect much for usual
case.
--
Isaku Yamahata <[email protected]>
On Wed, 2024-05-22 at 16:47 -0700, Isaku Yamahata wrote:
> > How about we leave option 1 as a separate patch and note it is not
> > functionally
> > required? Then we can shed it if needed. At the least it can serve as a
> > conversation piece in the meantime.
>
> Ok. We understand the situation correctly. I think it's okay to do nothing for
> now with some notes somewhere as record because it doesn't affect much for
> usual
> case.
I meant we include your proposed option 1 as a separate patch in the next
series. I'm writing am currently writing a log for the iterator changes, and
I'll note it as an issue. And then we include this later in the same series. No?
On Wed, May 22, 2024 at 11:50:58PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Wed, 2024-05-22 at 16:47 -0700, Isaku Yamahata wrote:
> > > How about we leave option 1 as a separate patch and note it is not
> > > functionally
> > > required? Then we can shed it if needed. At the least it can serve as a
> > > conversation piece in the meantime.
> >
> > Ok. We understand the situation correctly. I think it's okay to do nothing for
> > now with some notes somewhere as record because it doesn't affect much for
> > usual
> > case.
>
> I meant we include your proposed option 1 as a separate patch in the next
> series. I'm writing am currently writing a log for the iterator changes, and
> I'll note it as an issue. And then we include this later in the same series. No?
Ok, Let's include the patch.
--
Isaku Yamahata <[email protected]>
On Wed, 2024-05-22 at 17:01 -0700, Isaku Yamahata wrote:
> Ok, Let's include the patch.
We were discussing offline, that actually the existing behavior of
kvm_mmu_max_gfn() can be improved for normal VMs. It would be more proper to
trigger it off of the GFN range supported by EPT level, than the host MAXPA.
Today I was thinking, to fix this would need somthing like an x86_ops.max_gfn(),
so it could get at VMX stuff (usage of 4/5 level EPT). If that exists we might
as well just call it directly in kvm_mmu_max_gfn().
Then for TDX we could just provide a TDX implementation, rather than stash the
GFN on the kvm struct? Instead it could use gpaw stashed on struct kvm_tdx. The
op would still need to be take a struct kvm.
What do you think of that alternative?
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> +static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
> + u64 old_spte, u64 new_spte,
> + int level)
> +{
> + bool was_present = is_shadow_present_pte(old_spte);
> + bool was_leaf = was_present && is_last_spte(old_spte, level);
> + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> + int ret;
> +
> + /*
> + * Allow only leaf page to be zapped. Reclaim non-leaf page tables
> page
> + * at destroying VM.
> + */
> + if (!was_leaf)
> + return;
> +
> + /* Zapping leaf spte is allowed only when write lock is held. */
> + lockdep_assert_held_write(&kvm->mmu_lock);
> + ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
> + /* Because write lock is held, operation should success. */
> + if (KVM_BUG_ON(ret, kvm))
> + return;
> +
> + ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level,
> old_pfn);
I don't see why these (zap_private_spte and remove_private_spte) can't be a
single op. Was it to prepare for huge pages support or something? In the base
series they are both only called once.
> + KVM_BUG_ON(ret, kvm);
> +}
> +
On Wed, May 22, 2024 at 05:45:39PM +0200, Paolo Bonzini wrote:
> On Wed, May 22, 2024 at 8:49 AM Yan Zhao <[email protected]> wrote:
> > > Disabling the quirk would allow KVM to choose between a slow/precise/partial zap,
> > > and full/fast zap.
> > TDX needs to disable the quirk for slow/precise/partial zap, right?
>
> Yes - and since TDX is a separate VM type it might even start with the
> quirk disabled. For sure, the memslot flag is the worst option and I'd
> really prefer to avoid it.
Thanks. Will implement a quirk and let TDX code in QEMU to disable the
quirk.
>
> > > I have the same feeling that the bug is probably not reproducible with latest
> > > KVM code
>
> Or with the latest QEMU code, if it was related somehow to non-atomic
> changes to the memory map.
>
Thanks for this input. Will check if it's related.
On Thu, May 23, 2024 at 06:27:49PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Wed, 2024-05-22 at 17:01 -0700, Isaku Yamahata wrote:
> > Ok, Let's include the patch.
>
> We were discussing offline, that actually the existing behavior of
> kvm_mmu_max_gfn() can be improved for normal VMs. It would be more proper to
> trigger it off of the GFN range supported by EPT level, than the host MAXPA.
>
> Today I was thinking, to fix this would need somthing like an x86_ops.max_gfn(),
> so it could get at VMX stuff (usage of 4/5 level EPT). If that exists we might
> as well just call it directly in kvm_mmu_max_gfn().
>
> Then for TDX we could just provide a TDX implementation, rather than stash the
> GFN on the kvm struct? Instead it could use gpaw stashed on struct kvm_tdx. The
> op would still need to be take a struct kvm.
>
> What do you think of that alternative?
I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
But I don't have strong preference. Either way will work.
The max_gfn for the guest is rather static once the guest is created and
initialized. Also the existing codes that use max_gfn expect that the value
doesn't change. So we can use x86_ops.vm_init() to determine the value for VMX
and TDX. If we introduced x86_ops.max_gfn(), the implementation will be simply
return kvm_vmx->max_gfn or return kvm_tdx->max_gfn. (We would have similar for
SVM and SEV.) So I don't see benefit of x86_ops.max_gfn() than
kvm->arch.max_gfn.
--
Isaku Yamahata <[email protected]>
On Thu, May 23, 2024 at 11:14:07PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > +static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
> > + u64 old_spte, u64 new_spte,
> > + int level)
> > +{
> > + bool was_present = is_shadow_present_pte(old_spte);
> > + bool was_leaf = was_present && is_last_spte(old_spte, level);
> > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > + int ret;
> > +
> > + /*
> > + * Allow only leaf page to be zapped. Reclaim non-leaf page tables
> > page
> > + * at destroying VM.
> > + */
> > + if (!was_leaf)
> > + return;
> > +
> > + /* Zapping leaf spte is allowed only when write lock is held. */
> > + lockdep_assert_held_write(&kvm->mmu_lock);
> > + ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
> > + /* Because write lock is held, operation should success. */
> > + if (KVM_BUG_ON(ret, kvm))
> > + return;
> > +
> > + ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level,
> > old_pfn);
>
> I don't see why these (zap_private_spte and remove_private_spte) can't be a
> single op. Was it to prepare for huge pages support or something? In the base
> series they are both only called once.
That is for large page support. The step to merge or split large page is
1. zap_private_spte()
2. tlb shoot down
3. merge/split_private_spte()
--
Isaku Yamahata <[email protected]>
On Fri, 2024-05-24 at 00:55 -0700, Isaku Yamahata wrote:
> On Thu, May 23, 2024 at 06:27:49PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Wed, 2024-05-22 at 17:01 -0700, Isaku Yamahata wrote:
> > > Ok, Let's include the patch.
> >
> > We were discussing offline, that actually the existing behavior of
> > kvm_mmu_max_gfn() can be improved for normal VMs. It would be more proper to
> > trigger it off of the GFN range supported by EPT level, than the host
> > MAXPA.
> >
> > Today I was thinking, to fix this would need somthing like an
> > x86_ops.max_gfn(),
> > so it could get at VMX stuff (usage of 4/5 level EPT). If that exists we
> > might
> > as well just call it directly in kvm_mmu_max_gfn().
> >
> > Then for TDX we could just provide a TDX implementation, rather than stash
> > the
> > GFN on the kvm struct? Instead it could use gpaw stashed on struct kvm_tdx.
> > The
> > op would still need to be take a struct kvm.
> >
> > What do you think of that alternative?
>
> I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> But I don't have strong preference. Either way will work.
The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
state per-vm.
>
> The max_gfn for the guest is rather static once the guest is created and
> initialized. Also the existing codes that use max_gfn expect that the value
> doesn't change. So we can use x86_ops.vm_init() to determine the value for
> VMX
> and TDX. If we introduced x86_ops.max_gfn(), the implementation will be
> simply
> return kvm_vmx->max_gfn or return kvm_tdx->max_gfn. (We would have similar for
> SVM and SEV.) So I don't see benefit of x86_ops.max_gfn() than
> kvm->arch.max_gfn.
For TDX it will be based on the shared bit, so we actually already have the per-
vm data we need. So we don't even need both gfn_shared_mask and max_gfn for TDX.
On Thu, May 16, 2024 at 4:11 AM Huang, Kai <[email protected]> wrote:
>
>
> >>>> + gfn_t raw_gfn;
> >>>> + bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> >>>
> >>> Ditto. I wish we can have 'has_mirrored_private_pt'.
> >>
> >> Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
> >
> > Why not helpers that wrap vm_type like:
> > https://lore.kernel.org/kvm/[email protected]/
>
> I am fine with any of them -- boolean (with either name) or helper.
Helpers are fine.
Paolo
On Fri, May 17, 2024 at 9:16 PM Isaku Yamahata <[email protected]> wrote:
>
> On Fri, May 17, 2024 at 06:16:26PM +0000,
> "Edgecombe, Rick P" <[email protected]> wrote:
>
> > On Fri, 2024-05-17 at 02:03 -0700, Isaku Yamahata wrote:
> > >
> > > On top of your patch, I created the following patch to remove
> > > kvm_gfn_for_root().
> > > Although I haven't tested it yet, I think the following shows my idea.
> > >
> > > - Add gfn_shared_mask to struct tdp_iter.
> > > - Use iter.gfn_shared_mask to determine the starting sptep in the root.
> > > - Remove kvm_gfn_for_root()
> >
> > I investigated it.
>
> Thanks for looking at it.
>
> > After this, gfn_t's never have shared bit. It's a simple rule. The MMU mostly
> > thinks it's operating on a shared root that is mapped at the normal GFN Only
> > the iterator knows that the shared PTEs are actually in a different location.
> >
> > There are some negative side effects:
> > 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping anymore.
> > 2. As a result of above, the code that flushes TLBs for a specific GFN will be
> > confused. It won't functionally matter for TDX, just look buggy to see flushing
> > code called with the wrong gfn.
>
> flush_remote_tlbs_range() is only for Hyper-V optimization. In other cases,
> x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
> time. So the remote tlb flush falls back to flushing whole range. I don't
> expect TDX in hyper-V guest. I have to admit that the code looks superficially
> broken and confusing.
You could add an "&& kvm_has_private_root(kvm)" to
kvm_available_flush_remote_tlbs_range(), since
kvm_has_private_root(kvm) is sort of equivalent to "there is no 1:1
correspondence between gfn and PTE to be flushed".
I am conflicted myself, but the upsides below are pretty substantial.
Paolo
> > On the positive effects side:
> > 1. There is code that passes sp->gfn into things that it shouldn't (if it has
> > shared bits) like memslot lookups.
> > 2. Also code that passes iter.gfn into things it shouldn't like
> > kvm_mmu_max_mapping_level().
> >
> > These places are not called by TDX, but if you know that gfn's might include
> > shared bits, then that code looks buggy.
> >
> > I think the solution in the diff is more elegant then before, because it hides
> > what is really going on with the shared root. That is both good and bad Can we
> > accept the downsides?
>
> Kai, do you have any thoughts?
> --
> Isaku Yamahata <[email protected]>
>
On Tue, May 21, 2024 at 1:32 AM Isaku Yamahata <[email protected]> wrote:
> +static void vt_adjust_max_pa(void)
> +{
> + u64 tme_activate;
> +
> + mmu_max_gfn = __kvm_mmu_max_gfn();
> + rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
> + if (!(tme_activate & TME_ACTIVATE_LOCKED) ||
> + !(tme_activate & TME_ACTIVATE_ENABLED))
> + return;
> +
> + mmu_max_gfn -= (gfn_t)TDX_RESERVED_KEYID_BITS(tme_activate);
This would be be >>=, not "-=". But I think this should not look at
TME MSRs directly, instead it can use boot_cpu_data.x86_phys_bits. You
can use it instead of shadow_phys_bits in __kvm_mmu_max_gfn() and then
VMX does not need any adjustment.
That said, this is not a bugfix, it's just an optimization.
Paolo
> + }
>
> out:
> /* kfree() accepts NULL. */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f89405c8bc4..c519bb9c9559 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12693,6 +12693,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> if (ret)
> goto out;
>
> + kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
> kvm_mmu_init_vm(kvm);
>
> ret = static_call(kvm_x86_vm_init)(kvm);
> @@ -13030,7 +13031,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> return -EINVAL;
>
> if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
> - if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> + if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
> return -EINVAL;
>
> #if 0
>
> --
> Isaku Yamahata <[email protected]>
>
On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
<[email protected]> wrote:
> > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > But I don't have strong preference. Either way will work.
>
> The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> state per-vm.
It's just a cached value like there are many in the MMU. It's easier
for me to read code without the mental overhead of a function call.
> For TDX it will be based on the shared bit, so we actually already have the per-
> vm data we need. So we don't even need both gfn_shared_mask and max_gfn for TDX.
But they are independent, for example AMD placed the encryption bit
highest, then the reduced physical address space bits, then finally
the rest of the gfn. I think it's consistent with the kvm_has_*
approach, to not assume much and just store separate data.
Paolo
On Tue, 2024-05-28 at 19:16 +0200, Paolo Bonzini wrote:
> > > After this, gfn_t's never have shared bit. It's a simple rule. The MMU
> > > mostly
> > > thinks it's operating on a shared root that is mapped at the normal GFN.
> > > Only
> > > the iterator knows that the shared PTEs are actually in a different
> > > location.
> > >
> > > There are some negative side effects:
> > > 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping
> > > anymore.
> > > 2. As a result of above, the code that flushes TLBs for a specific GFN
> > > will be
> > > confused. It won't functionally matter for TDX, just look buggy to see
> > > flushing
> > > code called with the wrong gfn.
> >
> > flush_remote_tlbs_range() is only for Hyper-V optimization. In other cases,
> > x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
> > time. So the remote tlb flush falls back to flushing whole range. I don't
> > expect TDX in hyper-V guest. I have to admit that the code looks
> > superficially
> > broken and confusing.
>
> You could add an "&& kvm_has_private_root(kvm)" to
> kvm_available_flush_remote_tlbs_range(), since
> kvm_has_private_root(kvm) is sort of equivalent to "there is no 1:1
> correspondence between gfn and PTE to be flushed".
>
> I am conflicted myself, but the upsides below are pretty substantial.
It looks like kvm_available_flush_remote_tlbs_range() is not checked in many of
the paths that get to x86_ops.flush_remote_tlbs_range().
So maybe something like:
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 65bbda95acbb..e09bb6c50a0b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1959,14 +1959,7 @@ static inline int kvm_arch_flush_remote_tlbs(struct kvm
*kvm)
#if IS_ENABLED(CONFIG_HYPERV)
#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
-static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
- u64 nr_pages)
-{
- if (!kvm_x86_ops.flush_remote_tlbs_range)
- return -EOPNOTSUPP;
-
- return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
-}
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
#endif /* CONFIG_HYPERV */
enum kvm_intr_type {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43d70f4c433d..9dc1b3db286d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14048,6 +14048,14 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu,
unsigned int size,
}
EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages)
+{
+ if (!kvm_x86_ops.flush_remote_tlbs_range || kvm_gfn_direct_mask(kvm))
+ return -EOPNOTSUPP;
+
+ return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
+}
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
Regarding the kvm_gfn_direct_mask() usage, in the current WIP code we have
renamed things around the concepts of "mirrored roots" and "direct masks". The
mirrored root, just means "also go off an update something else" (S-EPT). The
direct mask, just means when on the direct root, shift the actual page table
mapping using the mask (shared memory). Kai raised that all TDX special stuff in
the x86 MMU around handling private memory is confusing from the SEV
perspective, so we were trying to rename those things to something related, but
generic instead of "private".
So the TLB flush confusion is more about that the direct GFNs are shifted by
something (i.e. kvm_gfn_direct_mask() returns non-zero).
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> +static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter
> *iter, u64 new_spte)
> {
> u64 *sptep = rcu_dereference(iter->sptep);
>
> @@ -542,15 +671,42 @@ static inline int __tdp_mmu_set_spte_atomic(struct
> tdp_iter *iter, u64 new_spte)
> */
> WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
>
> - /*
> - * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> - * does not hold the mmu_lock. On failure, i.e. if a different
> logical
> - * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> - * the current value, so the caller operates on fresh data, e.g. if it
> - * retries tdp_mmu_set_spte_atomic()
> - */
> - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> - return -EBUSY;
> + if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
> + int ret;
> +
> + if (is_shadow_present_pte(new_spte)) {
> + /*
> + * Populating case.
> + * - set_private_spte_present() implements
> + * 1) Freeze SPTE
> + * 2) call hooks to update private page table,
> + * 3) update SPTE to new_spte
> + * - handle_changed_spte() only updates stats.
> + */
> + ret = set_private_spte_present(kvm, iter->sptep, iter-
> >gfn,
> + iter->old_spte,
> new_spte, iter->level);
> + if (ret)
> + return ret;
> + } else {
> + /*
> + * Zapping case.
> + * Zap is only allowed when write lock is held
> + */
> + if (WARN_ON_ONCE(!is_shadow_present_pte(new_spte)))
This inside an else block for (is_shadow_present_pte(new_spte)), so it will
always be true if it gets here. But it can't because TDX doesn't do any atomic
zapping.
We can remove the conditional, but in regards to the WARN, any recollection of
what was might have been going on here originally?
> + return -EBUSY;
> + }
On Fri, 2024-05-24 at 01:20 -0700, Isaku Yamahata wrote:
> >
> > I don't see why these (zap_private_spte and remove_private_spte) can't be a
> > single op. Was it to prepare for huge pages support or something? In the
> > base
> > series they are both only called once.
>
> That is for large page support. The step to merge or split large page is
> 1. zap_private_spte()
> 2. tlb shoot down
> 3. merge/split_private_spte()
I think we can simplify it for now. Otherwise we can't justify it without
getting into the huge page support.
Looking at how to create some more explainable code here, I'm also wondering
about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't realize
it will send IPIs to each vcpu for *each* page getting zapped. Another one in
the "to optimize later" bucket I guess. And I guess it won't happen very often.
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> - u64 old_spte, u64 new_spte, int level,
> - bool shared)
> + u64 old_spte, u64 new_spte,
> + union kvm_mmu_page_role role, bool shared)
> {
> + bool is_private = kvm_mmu_page_role_is_private(role);
> + int level = role.level;
> bool was_present = is_shadow_present_pte(old_spte);
> bool is_present = is_shadow_present_pte(new_spte);
> bool was_leaf = was_present && is_last_spte(old_spte, level);
> bool is_leaf = is_present && is_last_spte(new_spte, level);
> - bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> + kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> + bool pfn_changed = old_pfn != new_pfn;
>
> WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
> WARN_ON_ONCE(level < PG_LEVEL_4K);
> @@ -513,7 +636,7 @@ static void handle_changed_spte(struct kvm *kvm, int
> as_id, gfn_t gfn,
>
> if (was_leaf && is_dirty_spte(old_spte) &&
> (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> - kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> + kvm_set_pfn_dirty(old_pfn);
>
> /*
> * Recursively handle child PTs if the change removed a subtree from
> @@ -522,15 +645,21 @@ static void handle_changed_spte(struct kvm *kvm, int
> as_id, gfn_t gfn,
> * pages are kernel allocations and should never be migrated.
> */
> if (was_present && !was_leaf &&
> - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> + KVM_BUG_ON(is_private !=
> is_private_sptep(spte_to_child_pt(old_spte, level)),
> + kvm);
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> shared);
> + }
> +
> + if (is_private && !is_present)
> + handle_removed_private_spte(kvm, gfn, old_spte, new_spte,
> role.level);
I'm a little bothered by the asymmetry of where the mirrored hooks get called
between setting and zapping PTEs. Tracing through the code, the relevent
operations that are needed for TDX are:
1. tdp_mmu_iter_set_spte() from tdp_mmu_zap_leafs() and __tdp_mmu_zap_root()
2. tdp_mmu_set_spte_atomic() is used for mapping, linking
(1) is a simple case because the mmu_lock is held for writes. It updates the
mirror root like normal, then has extra logic to call out to update the S-EPT.
(2) on the other hand just has the read lock, so it has to do the whole
operation in a special way. First set REMOVED_SPTE, then update the private
copy, then write to the mirror page tables. It can't get stuffed into
handle_changed_spte() because it has to write REMOVED_SPTE first.
In some ways it makes sense to update the S-EPT. Despite claiming
"handle_changed_spte() only updates stats.", it does some updating of other PTEs
based on the current PTE change. Which is pretty similar to what the mirrored
PTEs are doing. But we can't really do the setting of present PTEs because of
the REMOVED_SPTE stuff.
So we could only make it more symmetrical by moving the S-EPT ops out of
handle_changed_spte() and manually call it in the two places relevant for TDX,
like the below.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e966986bb9f2..c9ddb1c2a550 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t
pt, bool shared)
*/
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
REMOVED_SPTE, level);
+
+ if (is_mirror_sp(sp))
+ reflect_removed_spte(kvm, gfn, old_spte,
REMOVED_SPTE, level);
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
old_spte, REMOVED_SPTE, sp->role, shared);
@@ -667,9 +670,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id,
gfn_t gfn,
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
shared);
}
- if (is_mirror && !is_present)
- reflect_removed_spte(kvm, gfn, old_spte, new_spte, role.level);
-
if (was_leaf && is_accessed_spte(old_spte) &&
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
@@ -839,6 +839,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
tdp_ptep_t sptep,
new_spte, level), kvm);
}
+ if (is_mirror_sptep(sptep))
+ reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
+
role = sptep_to_sp(sptep)->role;
role.level = level;
handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
Otherwise, we could move the "set present" mirroring operations into
handle_changed_spte(), and have some earlier conditional logic do the
REMOVED_SPTE parts. It starts to become more scattered.
Anyway, it's just a code clarity thing arising from having hard time explaining
the design in the log. Any opinions?
A separate but related comment is below.
>
> if (was_leaf && is_accessed_spte(old_spte) &&
> (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> }
>
> @@ -648,6 +807,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> u64 old_spte, u64 new_spte, gfn_t gfn, int level)
> {
> + union kvm_mmu_page_role role;
> +
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> /*
> @@ -660,8 +821,16 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> tdp_ptep_t sptep,
> WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
>
> old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> + if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
> + is_shadow_present_pte(new_spte)) {
> + /* Because write spin lock is held, no race. It should
> success. */
> + KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn,
> old_spte,
> + new_spte, level), kvm);
> + }
Based on the above enumeration, I don't see how this hunk gets used.
>
> - handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> false);
> + role = sptep_to_sp(sptep)->role;
> + role.level = level;
> + handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
> return old_spte;
> }
>
On Tue, May 28, 2024 at 06:29:59PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Tue, 2024-05-28 at 19:16 +0200, Paolo Bonzini wrote:
> > > > After this, gfn_t's never have shared bit. It's a simple rule. The MMU
> > > > mostly
> > > > thinks it's operating on a shared root that is mapped at the normal GFN.
> > > > Only
> > > > the iterator knows that the shared PTEs are actually in a different
> > > > location.
> > > >
> > > > There are some negative side effects:
> > > > 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping
> > > > anymore.
> > > > 2. As a result of above, the code that flushes TLBs for a specific GFN
> > > > will be
> > > > confused. It won't functionally matter for TDX, just look buggy to see
> > > > flushing
> > > > code called with the wrong gfn.
> > >
> > > flush_remote_tlbs_range() is only for Hyper-V optimization. In other cases,
> > > x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
> > > time. So the remote tlb flush falls back to flushing whole range. I don't
> > > expect TDX in hyper-V guest. I have to admit that the code looks
> > > superficially
> > > broken and confusing.
> >
> > You could add an "&& kvm_has_private_root(kvm)" to
> > kvm_available_flush_remote_tlbs_range(), since
> > kvm_has_private_root(kvm) is sort of equivalent to "there is no 1:1
> > correspondence between gfn and PTE to be flushed".
> >
> > I am conflicted myself, but the upsides below are pretty substantial.
>
> It looks like kvm_available_flush_remote_tlbs_range() is not checked in many of
> the paths that get to x86_ops.flush_remote_tlbs_range().
>
> So maybe something like:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 65bbda95acbb..e09bb6c50a0b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1959,14 +1959,7 @@ static inline int kvm_arch_flush_remote_tlbs(struct kvm
> *kvm)
>
> #if IS_ENABLED(CONFIG_HYPERV)
> #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> -static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
> - u64 nr_pages)
> -{
> - if (!kvm_x86_ops.flush_remote_tlbs_range)
> - return -EOPNOTSUPP;
> -
> - return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
> -}
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
> #endif /* CONFIG_HYPERV */
>
> enum kvm_intr_type {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 43d70f4c433d..9dc1b3db286d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14048,6 +14048,14 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu,
> unsigned int size,
> }
> EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages)
> +{
> + if (!kvm_x86_ops.flush_remote_tlbs_range || kvm_gfn_direct_mask(kvm))
> + return -EOPNOTSUPP;
> +
> + return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
> +}
> +
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
kvm_x86_ops.flush_remote_tlbs_range() is defined only when CONFIG_HYPERV=y.
We need #ifdef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE ... #endif around the
function.
--
Isaku Yamahata <[email protected]>
On Tue, May 28, 2024 at 09:48:45PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Fri, 2024-05-24 at 01:20 -0700, Isaku Yamahata wrote:
> > >
> > > I don't see why these (zap_private_spte and remove_private_spte) can't be a
> > > single op. Was it to prepare for huge pages support or something? In the
> > > base
> > > series they are both only called once.
> >
> > That is for large page support. The step to merge or split large page is
> > 1. zap_private_spte()
> > 2. tlb shoot down
> > 3. merge/split_private_spte()
>
> I think we can simplify it for now. Otherwise we can't justify it without
> getting into the huge page support.
Ok. Now we don't care large page support, we can combine those hooks into single
hook.
> Looking at how to create some more explainable code here, I'm also wondering
> about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't realize
> it will send IPIs to each vcpu for *each* page getting zapped. Another one in
> the "to optimize later" bucket I guess. And I guess it won't happen very often.
We need it. Without tracking (or TLB shoot down), we'll hit
TDX_TLB_TRACKING_NOT_DONE. The TDX module has to guarantee that there is no
remaining TLB entries for pages freed by TDH.MEM.PAGE.REMOVE().
--
Isaku Yamahata <[email protected]>
On Tue, May 28, 2024 at 08:54:31PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > +static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter
> > *iter, u64 new_spte)
> > {
> > u64 *sptep = rcu_dereference(iter->sptep);
> >
> > @@ -542,15 +671,42 @@ static inline int __tdp_mmu_set_spte_atomic(struct
> > tdp_iter *iter, u64 new_spte)
> > */
> > WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
> >
> > - /*
> > - * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> > - * does not hold the mmu_lock. On failure, i.e. if a different
> > logical
> > - * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> > - * the current value, so the caller operates on fresh data, e.g. if it
> > - * retries tdp_mmu_set_spte_atomic()
> > - */
> > - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> > - return -EBUSY;
> > + if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
> > + int ret;
> > +
> > + if (is_shadow_present_pte(new_spte)) {
> > + /*
> > + * Populating case.
> > + * - set_private_spte_present() implements
> > + * 1) Freeze SPTE
> > + * 2) call hooks to update private page table,
> > + * 3) update SPTE to new_spte
> > + * - handle_changed_spte() only updates stats.
> > + */
> > + ret = set_private_spte_present(kvm, iter->sptep, iter-
> > >gfn,
> > + iter->old_spte,
> > new_spte, iter->level);
> > + if (ret)
> > + return ret;
> > + } else {
> > + /*
> > + * Zapping case.
> > + * Zap is only allowed when write lock is held
> > + */
> > + if (WARN_ON_ONCE(!is_shadow_present_pte(new_spte)))
>
> This inside an else block for (is_shadow_present_pte(new_spte)), so it will
> always be true if it gets here. But it can't because TDX doesn't do any atomic
> zapping.
>
> We can remove the conditional, but in regards to the WARN, any recollection of
> what was might have been going on here originally?
We had an optimization so that there are other state in addition to present,
non-present. When I dropped it, I should've dropped else-sentence.
--
Isaku Yamahata <[email protected]>
On Tue, 2024-05-28 at 18:16 -0700, Isaku Yamahata wrote:
> > Looking at how to create some more explainable code here, I'm also wondering
> > about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't
> > realize
> > it will send IPIs to each vcpu for *each* page getting zapped. Another one
> > in
> > the "to optimize later" bucket I guess. And I guess it won't happen very
> > often.
>
> We need it. Without tracking (or TLB shoot down), we'll hit
> TDX_TLB_TRACKING_NOT_DONE. The TDX module has to guarantee that there is no
> remaining TLB entries for pages freed by TDH.MEM.PAGE.REMOVE().
It can't be removed without other changes, but the TDX module doesn't enforce
that you have to zap and shootdown a page at at time, right? Like it could be
batched.
On Tue, 2024-05-28 at 18:06 -0700, Isaku Yamahata wrote:
>
> kvm_x86_ops.flush_remote_tlbs_range() is defined only when CONFIG_HYPERV=y.
> We need #ifdef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE ... #endif around the
> function.
Oh, right. Thanks.
On Tue, May 28, 2024 at 11:06:45PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > - u64 old_spte, u64 new_spte, int level,
> > - bool shared)
> > + u64 old_spte, u64 new_spte,
> > + union kvm_mmu_page_role role, bool shared)
> > {
> > + bool is_private = kvm_mmu_page_role_is_private(role);
> > + int level = role.level;
> > bool was_present = is_shadow_present_pte(old_spte);
> > bool is_present = is_shadow_present_pte(new_spte);
> > bool was_leaf = was_present && is_last_spte(old_spte, level);
> > bool is_leaf = is_present && is_last_spte(new_spte, level);
> > - bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > + kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> > + bool pfn_changed = old_pfn != new_pfn;
> >
> > WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
> > WARN_ON_ONCE(level < PG_LEVEL_4K);
> > @@ -513,7 +636,7 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id, gfn_t gfn,
> >
> > if (was_leaf && is_dirty_spte(old_spte) &&
> > (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> > - kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > + kvm_set_pfn_dirty(old_pfn);
> >
> > /*
> > * Recursively handle child PTs if the change removed a subtree from
> > @@ -522,15 +645,21 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id, gfn_t gfn,
> > * pages are kernel allocations and should never be migrated.
> > */
> > if (was_present && !was_leaf &&
> > - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> > + KVM_BUG_ON(is_private !=
> > is_private_sptep(spte_to_child_pt(old_spte, level)),
> > + kvm);
> > handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> > shared);
> > + }
> > +
> > + if (is_private && !is_present)
> > + handle_removed_private_spte(kvm, gfn, old_spte, new_spte,
> > role.level);
>
> I'm a little bothered by the asymmetry of where the mirrored hooks get called
> between setting and zapping PTEs. Tracing through the code, the relevent
> operations that are needed for TDX are:
> 1. tdp_mmu_iter_set_spte() from tdp_mmu_zap_leafs() and __tdp_mmu_zap_root()
> 2. tdp_mmu_set_spte_atomic() is used for mapping, linking
>
> (1) is a simple case because the mmu_lock is held for writes. It updates the
> mirror root like normal, then has extra logic to call out to update the S-EPT.
>
> (2) on the other hand just has the read lock, so it has to do the whole
> operation in a special way. First set REMOVED_SPTE, then update the private
> copy, then write to the mirror page tables. It can't get stuffed into
> handle_changed_spte() because it has to write REMOVED_SPTE first.
>
> In some ways it makes sense to update the S-EPT. Despite claiming
> "handle_changed_spte() only updates stats.", it does some updating of other PTEs
> based on the current PTE change. Which is pretty similar to what the mirrored
> PTEs are doing. But we can't really do the setting of present PTEs because of
> the REMOVED_SPTE stuff.
>
> So we could only make it more symmetrical by moving the S-EPT ops out of
> handle_changed_spte() and manually call it in the two places relevant for TDX,
> like the below.
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e966986bb9f2..c9ddb1c2a550 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t
> pt, bool shared)
> */
> old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
> REMOVED_SPTE, level);
> +
> + if (is_mirror_sp(sp))
> + reflect_removed_spte(kvm, gfn, old_spte,
> REMOVED_SPTE, level);
The callback before handling lower level will result in error.
> }
> handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> old_spte, REMOVED_SPTE, sp->role, shared);
We should call it here after processing lower level.
> @@ -667,9 +670,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id,
> gfn_t gfn,
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> shared);
> }
>
> - if (is_mirror && !is_present)
> - reflect_removed_spte(kvm, gfn, old_spte, new_spte, role.level);
> -
> if (was_leaf && is_accessed_spte(old_spte) &&
> (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> @@ -839,6 +839,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> tdp_ptep_t sptep,
> new_spte, level), kvm);
> }
>
> + if (is_mirror_sptep(sptep))
> + reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
> +
Ditto.
> role = sptep_to_sp(sptep)->role;
> role.level = level;
> handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
The callback should be here. It should be after handling the lower level.
> Otherwise, we could move the "set present" mirroring operations into
> handle_changed_spte(), and have some earlier conditional logic do the
> REMOVED_SPTE parts. It starts to become more scattered.
> Anyway, it's just a code clarity thing arising from having hard time explaining
> the design in the log. Any opinions?
Originally I tried to consolidate the callbacks by following TDP MMU using
handle_changed_spte(). Anyway we can pick from two outcomes based on which is
easy to understand/maintain.
> A separate but related comment is below.
>
> >
> > if (was_leaf && is_accessed_spte(old_spte) &&
> > (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> > kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > }
> >
> > @@ -648,6 +807,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> > static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > u64 old_spte, u64 new_spte, gfn_t gfn, int level)
> > {
> > + union kvm_mmu_page_role role;
> > +
> > lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > /*
> > @@ -660,8 +821,16 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> > WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
> >
> > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > + if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
> > + is_shadow_present_pte(new_spte)) {
> > + /* Because write spin lock is held, no race. It should
> > success. */
> > + KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn,
> > old_spte,
> > + new_spte, level), kvm);
> > + }
>
> Based on the above enumeration, I don't see how this hunk gets used.
I should've removed it. This is leftover from the old patches.
--
Isaku Yamahata <[email protected]>
On Tue, 2024-05-28 at 18:57 -0700, Isaku Yamahata wrote:
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm,
> > tdp_ptep_t
> > pt, bool shared)
> > */
> > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
> > REMOVED_SPTE,
> > level);
> > +
> > + if (is_mirror_sp(sp))
> > + reflect_removed_spte(kvm, gfn, old_spte,
> > REMOVED_SPTE, level);
>
> The callback before handling lower level will result in error.
Hmm, yea the order is changed. It didn't result in an error for some reason
though. Can you elaborate?
>
>
> > }
> > handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> > old_spte, REMOVED_SPTE, sp->role,
> > shared);
>
>
> We should call it here after processing lower level.
>
>
>
> > @@ -667,9 +670,6 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id,
> > gfn_t gfn,
> > handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> > shared);
> > }
> >
> > - if (is_mirror && !is_present)
> > - reflect_removed_spte(kvm, gfn, old_spte, new_spte,
> > role.level);
> > -
> > if (was_leaf && is_accessed_spte(old_spte) &&
> > (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> > kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > @@ -839,6 +839,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> > new_spte, level),
> > kvm);
> > }
> >
> > + if (is_mirror_sptep(sptep))
> > + reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE,
> > level);
> > +
>
> Ditto.
>
>
> > role = sptep_to_sp(sptep)->role;
> > role.level = level;
> > handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role,
> > false);
>
> The callback should be here. It should be after handling the lower level.
Ok, let me try.
>
>
>
> > Otherwise, we could move the "set present" mirroring operations into
> > handle_changed_spte(), and have some earlier conditional logic do the
> > REMOVED_SPTE parts. It starts to become more scattered.
> > Anyway, it's just a code clarity thing arising from having hard time
> > explaining
> > the design in the log. Any opinions?
>
> Originally I tried to consolidate the callbacks by following TDP MMU using
> handle_changed_spte().
How did it handle the REMOVED_SPTE part of the set_present() path?
> Anyway we can pick from two outcomes based on which is
> easy to understand/maintain.
I guess I can try to generate a diff of the other one and we can compare. It's a
matter of opinion, but I think splitting it between the two methods is the most
confusing.
On Tue, 2024-05-28 at 19:47 +0200, Paolo Bonzini wrote:
> On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
> <[email protected]> wrote:
> > > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > > But I don't have strong preference. Either way will work.
> >
> > The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> > state per-vm.
>
> It's just a cached value like there are many in the MMU. It's easier
> for me to read code without the mental overhead of a function call.
Ok. Since this has (optimization) utility beyond TDX, maybe it's worth splitting
it off as a separate patch? I think maybe we'll pursue this path unless there is
objection.
>
> > For TDX it will be based on the shared bit, so we actually already have the
> > per-
> > vm data we need. So we don't even need both gfn_shared_mask and max_gfn for
> > TDX.
>
> But they are independent, for example AMD placed the encryption bit
> highest, then the reduced physical address space bits, then finally
> the rest of the gfn. I think it's consistent with the kvm_has_*
> approach, to not assume much and just store separate data.
I meant for a TDX specific x86_ops implementation we already have the data
needed to compute it (gfn_shared_mask - 1). I didn't realize SEV would benefit
from this too.
On Wed, May 29, 2024 at 01:50:05AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Tue, 2024-05-28 at 18:16 -0700, Isaku Yamahata wrote:
> > > Looking at how to create some more explainable code here, I'm also wondering
> > > about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't
> > > realize
> > > it will send IPIs to each vcpu for *each* page getting zapped. Another one
> > > in
> > > the "to optimize later" bucket I guess. And I guess it won't happen very
> > > often.
> >
> > We need it. Without tracking (or TLB shoot down), we'll hit
> > TDX_TLB_TRACKING_NOT_DONE. The TDX module has to guarantee that there is no
> > remaining TLB entries for pages freed by TDH.MEM.PAGE.REMOVE().
>
> It can't be removed without other changes, but the TDX module doesn't enforce
> that you have to zap and shootdown a page at at time, right? Like it could be
> batched.
Right. TDX module doesn't enforce it. If we want to batch zapping, it requires
to track the SPTE state, zapped, not TLB shoot down yet, and not removed yet.
It's simpler to issue TLB shoot per page for now. It would be future
optimization.
At runtime, the zapping happens when memory conversion(private -> shared) or
memslot deletion. Because it's not often, we don't have to care.
For vm destruction, it's simpler to skip tlb shoot down by deleting HKID first
than to track SPTE state for batching TLB shoot down.
--
Isaku Yamahata <[email protected]>
On Tue, 2024-05-28 at 19:20 -0700, Isaku Yamahata wrote:
> Right. TDX module doesn't enforce it. If we want to batch zapping, it
> requires
> to track the SPTE state, zapped, not TLB shoot down yet, and not removed yet.
> It's simpler to issue TLB shoot per page for now. It would be future
> optimization.
Totally agree we should not change it now. It's just in the list of not
optimized things.
>
> At runtime, the zapping happens when memory conversion(private -> shared) or
> memslot deletion. Because it's not often, we don't have to care.
Not sure I agree on this part. But in any case we can discuss it when we are in
the happy situation of upstream TDX users existing and complaining about things.
A great thing about it though - it's obviously correct.
> For vm destruction, it's simpler to skip tlb shoot down by deleting HKID first
> than to track SPTE state for batching TLB shoot down.
On Wed, May 29, 2024 at 4:14 AM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Tue, 2024-05-28 at 19:47 +0200, Paolo Bonzini wrote:
> > On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
> > <[email protected]> wrote:
> > > > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > > > But I don't have strong preference. Either way will work.
> > >
> > > The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> > > state per-vm.
> >
> > It's just a cached value like there are many in the MMU. It's easier
> > for me to read code without the mental overhead of a function call.
>
> Ok. Since this has (optimization) utility beyond TDX, maybe it's worth splitting
> it off as a separate patch? I think maybe we'll pursue this path unless there is
> objection.
Yes, absolutely.
Paolo
On Wed, May 29, 2024 at 02:13:24AM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:
> On Tue, 2024-05-28 at 18:57 -0700, Isaku Yamahata wrote:
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm,
> > > tdp_ptep_t
> > > pt, bool shared)
> > > */
> > > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
> > > REMOVED_SPTE,
> > > level);
> > > +
> > > + if (is_mirror_sp(sp))
> > > + reflect_removed_spte(kvm, gfn, old_spte,
> > > REMOVED_SPTE, level);
> >
> > The callback before handling lower level will result in error.
>
> Hmm, yea the order is changed. It didn't result in an error for some reason
> though. Can you elaborate?
TDH.MEM.{PAGE, SEPT}.REMOVE() needs to be issued from the leaf. I guess
zapping is done at only leaf by tdp_mmu_zap_leafs(). Subtree zapping case wasn't
exercised.
> > > Otherwise, we could move the "set present" mirroring operations into
> > > handle_changed_spte(), and have some earlier conditional logic do the
> > > REMOVED_SPTE parts. It starts to become more scattered.
> > > Anyway, it's just a code clarity thing arising from having hard time
> > > explaining
> > > the design in the log. Any opinions?
> >
> > Originally I tried to consolidate the callbacks by following TDP MMU using
> > handle_changed_spte().
>
> How did it handle the REMOVED_SPTE part of the set_present() path?
is_removed_pt() was used. It was ugly.
--
Isaku Yamahata <[email protected]>
On Wed, May 29, 2024 at 09:25:46AM +0200,
Paolo Bonzini <[email protected]> wrote:
> On Wed, May 29, 2024 at 4:14 AM Edgecombe, Rick P
> <[email protected]> wrote:
> >
> > On Tue, 2024-05-28 at 19:47 +0200, Paolo Bonzini wrote:
> > > On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
> > > <[email protected]> wrote:
> > > > > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > > > > But I don't have strong preference. Either way will work.
> > > >
> > > > The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> > > > state per-vm.
> > >
> > > It's just a cached value like there are many in the MMU. It's easier
> > > for me to read code without the mental overhead of a function call.
> >
> > Ok. Since this has (optimization) utility beyond TDX, maybe it's worth splitting
> > it off as a separate patch? I think maybe we'll pursue this path unless there is
> > objection.
>
> Yes, absolutely.
Ok, let me cook an independent patch series for kvm-coco-queue.
--
Isaku Yamahata <[email protected]>