This series is the result of digging into why deleting a memslot, which on
x86 forces all vCPUs to reload a new MMU root, causes noticeably more jitter
in vCPUs and other tasks when running with the TDP MMU than the Shadow MMU
(with TDP enabled).
Patch 1 addresses the most obvious issue by simply zapping at a finer
granularity so that if a different task, e.g. a vCPU, wants to run on the
pCPU doing the zapping, it doesn't have to wait for KVM to zap an entire
1GiB region, which can take a hundreds of microseconds (or more). The
shadow MMU checks for need_resched() (and mmu_lock contention, see below)
every 10 zaps, which is why the shadow MMU doesn't induce the same level
of jitter.
On preemptible kernels, zapping at 4KiB granularity will also cause the
zapping task to yield mmu_lock much more aggressively if a writer comes
along. That _sounds_ like a good thing, and most of the time it is, but
sometimes bouncing mmu_lock can be a big net negative:
https://lore.kernel.org/all/[email protected]
While trying to figure out whether or not frequently yielding mmu_lock
would be a negative or positive, I ran into extremely high latencies for
loading TDP MMU roots on VMs with large-ish numbers of vCPUs, e.g. a vCPU
could end up taking more than a second to
Long story short, the issue is that the TDP MMU acquires mmu_lock for
write when unloading roots, and again when loading a "new" root (in quotes
because most vCPUs end up loading an existing root). With a decent number
of vCPUs, that results in a _lot_ mmu_lock contention, as every vCPU will
take and release mmu_lock for write to unload its roots, and then again to
load a new root. Due to rwlock's fairness (waiting writers block new
readers), the contention can result in rather nasty worst case scenarios.
Patches 6-8 fix the issues by taking mmu_lock for read. The free path is
very straightforward and doesn't require any new protection (IIRC, the only
reason we didn't pursue this when reworking the TDP MMU zapping back at the
end of 2021 was because we had bigger issues to solve). Allocating a new
root with mmu_lock held for read is a little harder, but still fairly easy.
KVM only needs to ensure that it doesn't create duplicate roots, because
everything that needs mmu_lock to ensure ordering must take mmu_lock for
write, i.e. is still mutually exclusive with new roots coming along.
Patches 2-5 are small cleanups to avoid doing work for invalid roots, e.g.
when zapping SPTEs purely to affect guest behavior, there's no need to zap
invalid roots because they are unreachable from the guest.
All told, this significantly reduces mmu_lock contention when doing a fast
zap, i.e. when deleting memslots, and takes the worst case latency for a
vCPU to load a new root from >3ms to <100us for large-ish VMs (100+ vCPUs)
For small and medium sized VMs (<24 vCPUs), the vast majority of loads
takes less than 1us, with the worst case being <10us, versus >200us without
this series.
Note, I did all of the latency testing before the holidays, and then
managed to lose almost all of my notes, which is why I don't have more
precise data on the exact setups and latency bins. /facepalm
Sean Christopherson (8):
KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity
KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots
KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators
KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range
KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs
KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for
read
KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read
KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read
arch/x86/kvm/mmu/mmu.c | 33 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 124 ++++++++++++++++++++++++++-----------
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
3 files changed, 111 insertions(+), 48 deletions(-)
base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
--
2.43.0.275.g3460e3d667-goog
Zap invalidated TDP MMU roots at maximum granularity, i.e. with more
frequent conditional resched checkpoints, in order to avoid running for an
extended duration (milliseconds, or worse) without honoring a reschedule
request. And for kernels running with full or real-time preempt models,
zapping at 4KiB granularity also provides significantly reduced latency
for other tasks that are contending for mmu_lock (which isn't necessarily
an overall win for KVM, but KVM should do its best to honor the kernel's
preemption model).
To keep KVM's assertion that zapping at 1GiB granularity is functionally
ok, which is the main reason 1GiB was selected in the past, skip straight
to zapping at 1GiB if KVM is configured to prove the MMU. Zapping roots
is far more common than a vCPU replacing a 1GiB page table with a hugepage,
e.g. generally happens multiple times during boot, and so keeping the test
coverage provided by root zaps is desirable, just not for production.
Cc: David Matlack <[email protected]>
Cc: Pattara Teerapong <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6ae19b4ee5b1..372da098d3ce 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -734,15 +734,26 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();
/*
- * To avoid RCU stalls due to recursively removing huge swaths of SPs,
- * split the zap into two passes. On the first pass, zap at the 1gb
- * level, and then zap top-level SPs on the second pass. "1gb" is not
- * arbitrary, as KVM must be able to zap a 1gb shadow page without
- * inducing a stall to allow in-place replacement with a 1gb hugepage.
+ * Zap roots in multiple passes of decreasing granularity, i.e. zap at
+ * 4KiB=>2MiB=>1GiB=>root, in order to better honor need_resched() (all
+ * preempt models) or mmu_lock contention (full or real-time models).
+ * Zapping at finer granularity marginally increases the total time of
+ * the zap, but in most cases the zap itself isn't latency sensitive.
*
- * Because zapping a SP recurses on its children, stepping down to
- * PG_LEVEL_4K in the iterator itself is unnecessary.
+ * If KVM is configured to prove the MMU, skip the 4KiB and 2MiB zaps
+ * in order to mimic the page fault path, which can replace a 1GiB page
+ * table with an equivalent 1GiB hugepage, i.e. can get saddled with
+ * zapping a 1GiB region that's fully populated with 4KiB SPTEs. This
+ * allows verifying that KVM can safely zap 1GiB regions, e.g. without
+ * inducing RCU stalls, without relying on a relatively rare event
+ * (zapping roots is orders of magnitude more common). Note, because
+ * zapping a SP recurses on its children, stepping down to PG_LEVEL_4K
+ * in the iterator itself is unnecessary.
*/
+ if (!IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
+ __tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_4K);
+ __tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_2M);
+ }
__tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_1G);
__tdp_mmu_zap_root(kvm, root, shared, root->role.level);
--
2.43.0.275.g3460e3d667-goog
Don't force a TLB flush when zapping SPTEs in invalid roots as vCPUs
can't be actively using invalid roots (zapping SPTEs in invalid roots is
necessary only to ensure KVM doesn't mark a page accessed/dirty after it
is freed by the primary MMU).
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 372da098d3ce..68920877370b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -811,7 +811,13 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
continue;
tdp_mmu_iter_set_spte(kvm, &iter, 0);
- flush = true;
+
+ /*
+ * Zappings SPTEs in invalid roots doesn't require a TLB flush,
+ * see kvm_tdp_mmu_zap_invalidated_roots() for details.
+ */
+ if (!root->role.invalid)
+ flush = true;
}
rcu_read_unlock();
--
2.43.0.275.g3460e3d667-goog
Modify for_each_tdp_mmu_root() and __for_each_tdp_mmu_root_yield_safe() to
accept -1 for _as_id to mean "process all memslot address spaces". That
way code that wants to process both SMM and !SMM doesn't need to iterate
over roots twice (and likely copy+paste code in the process).
Deliberately don't cast _as_id to an "int", just in case not casting helps
the compiler elide the "_as_id >=0" check when being passed an unsigned
value, e.g. from a memslot.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 68920877370b..60fff2aad59e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -149,11 +149,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* If shared is set, this function is operating under the MMU lock in read
* mode.
*/
-#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid)\
- for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \
- ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
- _root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid) \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \
+ ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
+ _root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \
+ if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
@@ -171,10 +171,10 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* Holding mmu_lock for write obviates the need for RCU protection as the list
* is guaranteed to be stable.
*/
-#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
- list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
- if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
- kvm_mmu_page_as_id(_root) != _as_id) { \
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
+ list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
+ if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
+ _as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else
static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
--
2.43.0.275.g3460e3d667-goog
When zapping a GFN in response to an APICv or MTRR change, don't zap SPTEs
for invalid roots as KVM only needs to ensure the guest can't use stale
mappings for the GFN. Unlike kvm_tdp_mmu_unmap_gfn_range(), which must
zap "unreachable" SPTEs to ensure KVM doesn't mark a page accessed/dirty,
kvm_tdp_mmu_zap_leafs() isn't used (and isn't intended to be used) to
handle freeing of host memory.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 60fff2aad59e..1a9c16e5c287 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -830,16 +830,16 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
}
/*
- * Zap leaf SPTEs for the range of gfns, [start, end), for all roots. Returns
- * true if a TLB flush is needed before releasing the MMU lock, i.e. if one or
- * more SPTEs were zapped since the MMU lock was last acquired.
+ * Zap leaf SPTEs for the range of gfns, [start, end), for all *VALID** roots.
+ * Returns true if a TLB flush is needed before releasing the MMU lock, i.e. if
+ * one or more SPTEs were zapped since the MMU lock was last acquired.
*/
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
{
struct kvm_mmu_page *root;
lockdep_assert_held_write(&kvm->mmu_lock);
- for_each_tdp_mmu_root_yield_safe(kvm, root)
+ for_each_valid_tdp_mmu_root_yield_safe(kvm, root, -1)
flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
return flush;
--
2.43.0.275.g3460e3d667-goog
When write-protecting SPTEs, don't process invalid roots as invalid roots
are unreachable, i.e. can't be used to access guest memory and thus don't
need to be write-protected.
Note, this is *almost* a nop for kvm_tdp_mmu_clear_dirty_pt_masked(),
which is called under slots_lock, i.e. is mutually exclusive with
kvm_mmu_zap_all_fast(). But it's possible for something other than the
"fast zap" thread to grab a reference to an invalid root and thus keep a
root alive (but completely empty) after kvm_mmu_zap_all_fast() completes.
The kvm_tdp_mmu_write_protect_gfn() case is more interesting as KVM write-
protects SPTEs for reasons other than dirty logging, e.g. if a KVM creates
a SPTE for a nested VM while a fast zap is in-progress.
Add another TDP MMU iterator to visit only valid roots, and
opportunistically convert kvm_tdp_mmu_get_vcpu_root_hpa() to said iterator.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1a9c16e5c287..e0a8343f66dc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -171,12 +171,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* Holding mmu_lock for write obviates the need for RCU protection as the list
* is guaranteed to be stable.
*/
-#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
+#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _only_valid) \
list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
- _as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
+ ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \
+ ((_only_valid) && (_root)->role.invalid))) { \
} else
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
+ __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
+
+#define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id) \
+ __for_each_tdp_mmu_root(_kvm, _root, _as_id, true)
+
static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_page *sp;
@@ -224,11 +231,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
lockdep_assert_held_write(&kvm->mmu_lock);
- /*
- * Check for an existing root before allocating a new one. Note, the
- * role check prevents consuming an invalid root.
- */
- for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
+ /* Check for an existing root before allocating a new one. */
+ for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
if (root->role.word == role.word &&
kvm_tdp_mmu_get_root(root))
goto out;
@@ -1639,7 +1643,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
{
struct kvm_mmu_page *root;
- for_each_tdp_mmu_root(kvm, root, slot->as_id)
+ for_each_valid_tdp_mmu_root(kvm, root, slot->as_id)
clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
}
@@ -1757,7 +1761,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
bool spte_set = false;
lockdep_assert_held_write(&kvm->mmu_lock);
- for_each_tdp_mmu_root(kvm, root, slot->as_id)
+ for_each_valid_tdp_mmu_root(kvm, root, slot->as_id)
spte_set |= write_protect_gfn(kvm, root, gfn, min_level);
return spte_set;
--
2.43.0.275.g3460e3d667-goog
When allocating a new TDP MMU root, check for a usable root while holding
mmu_lock for read and only acquire mmu_lock for write if a new root needs
to be created. There is no need to serialize other MMU operations if a
vCPU is simply grabbing a reference to an existing root, holding mmu_lock
for write is "necessary" (spoiler alert, it's not strictly necessary) only
to ensure KVM doesn't end up with duplicate roots.
Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and
to setups that frequently delete memslots, i.e. which force all vCPUs to
reload all roots.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++---
arch/x86/kvm/mmu/tdp_mmu.c | 60 +++++++++++++++++++++++++++++++-------
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
3 files changed, 55 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c844e428684..ea18aca23196 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3693,15 +3693,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
unsigned i;
int r;
+ if (tdp_mmu_enabled)
+ return kvm_tdp_mmu_alloc_root(vcpu);
+
write_lock(&vcpu->kvm->mmu_lock);
r = make_mmu_pages_available(vcpu);
if (r < 0)
goto out_unlock;
- if (tdp_mmu_enabled) {
- root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
- mmu->root.hpa = root;
- } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
+ if (shadow_root_level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
mmu->root.hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e0a8343f66dc..9a8250a14fc1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -223,21 +223,52 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
{
union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
+ int as_id = kvm_mmu_role_as_id(role);
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;
- lockdep_assert_held_write(&kvm->mmu_lock);
-
- /* Check for an existing root before allocating a new one. */
- for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
- if (root->role.word == role.word &&
- kvm_tdp_mmu_get_root(root))
- goto out;
+ for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
+ if (root->role.word == role.word)
+ return root;
}
+ return NULL;
+}
+
+int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ union kvm_mmu_page_role role = mmu->root_role;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_mmu_page *root;
+
+ /*
+ * Check for an existing root while holding mmu_lock for read to avoid
+ * unnecessary serialization if multiple vCPUs are loading a new root.
+ * E.g. when bringing up secondary vCPUs, KVM will already have created
+ * a valid root on behalf of the primary vCPU.
+ */
+ read_lock(&kvm->mmu_lock);
+ root = kvm_tdp_mmu_try_get_root(vcpu);
+ read_unlock(&kvm->mmu_lock);
+
+ if (root)
+ goto out;
+
+ write_lock(&kvm->mmu_lock);
+
+ /*
+ * Recheck for an existing root after acquiring mmu_lock for write. It
+ * is possible a new usable root was created between dropping mmu_lock
+ * (for read) and acquiring it for write.
+ */
+ root = kvm_tdp_mmu_try_get_root(vcpu);
+ if (root)
+ goto out_unlock;
+
root = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_sp(root, NULL, 0, role);
@@ -254,8 +285,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+out_unlock:
+ write_unlock(&kvm->mmu_lock);
out:
- return __pa(root->spt);
+ /*
+ * Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest
+ * and actually consuming the root if it's invalidated after dropping
+ * mmu_lock, and the root can't be freed as this vCPU holds a reference.
+ */
+ mmu->root.hpa = __pa(root->spt);
+ mmu->root.pgd = 0;
+ return 0;
}
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
@@ -917,7 +957,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* the VM is being destroyed).
*
* Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
- * See kvm_tdp_mmu_get_vcpu_root_hpa().
+ * See kvm_tdp_mmu_alloc_root().
*/
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 20d97aa46c49..6e1ea04ca885 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
+int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
--
2.43.0.275.g3460e3d667-goog
Allocate TDP MMU roots while holding mmu_lock for read, and instead use
tdp_mmu_pages_lock to guard against duplicate roots. This allows KVM to
create new roots without forcing kvm_tdp_mmu_zap_invalidated_roots() to
yield, e.g. allows vCPUs to load new roots after memslot deletion without
forcing the zap thread to detect contention and yield (or complete if the
kernel isn't preemptible).
Note, creating a new TDP MMU root as an mmu_lock reader is safe for two
reasons: (1) paths that must guarantee all roots/SPTEs are *visited* take
mmu_lock for write and so are still mutually exclusive, e.g. mmu_notifier
invalidations, and (2) paths that require all roots/SPTEs to *observe*
some given state without holding mmu_lock for write must ensure freshness
through some other means, e.g. toggling dirty logging must first wait for
SRCU readers to recognize the memslot flags change before processing
existing roots/SPTEs.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 55 +++++++++++++++-----------------------
1 file changed, 22 insertions(+), 33 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9a8250a14fc1..d078157e62aa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -223,51 +223,42 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}
-static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
-{
- union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
- int as_id = kvm_mmu_role_as_id(role);
- struct kvm *kvm = vcpu->kvm;
- struct kvm_mmu_page *root;
-
- for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
- if (root->role.word == role.word)
- return root;
- }
-
- return NULL;
-}
-
int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
union kvm_mmu_page_role role = mmu->root_role;
+ int as_id = kvm_mmu_role_as_id(role);
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;
/*
- * Check for an existing root while holding mmu_lock for read to avoid
+ * Check for an existing root before acquiring the pages lock to avoid
* unnecessary serialization if multiple vCPUs are loading a new root.
* E.g. when bringing up secondary vCPUs, KVM will already have created
* a valid root on behalf of the primary vCPU.
*/
read_lock(&kvm->mmu_lock);
- root = kvm_tdp_mmu_try_get_root(vcpu);
- read_unlock(&kvm->mmu_lock);
- if (root)
- goto out;
+ for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
+ if (root->role.word == role.word)
+ goto out_read_unlock;
+ }
- write_lock(&kvm->mmu_lock);
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
/*
- * Recheck for an existing root after acquiring mmu_lock for write. It
- * is possible a new usable root was created between dropping mmu_lock
- * (for read) and acquiring it for write.
+ * Recheck for an existing root after acquiring the pages lock, another
+ * vCPU may have raced ahead and created a new usable root. Manually
+ * walk the list of roots as the standard macros assume that the pages
+ * lock is *not* held. WARN if grabbing a reference to a usable root
+ * fails, as the last reference to a root can only be put *after* the
+ * root has been invalidated, which requires holding mmu_lock for write.
*/
- root = kvm_tdp_mmu_try_get_root(vcpu);
- if (root)
- goto out_unlock;
+ list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (root->role.word == role.word &&
+ !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
+ goto out_spin_unlock;
+ }
root = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_sp(root, NULL, 0, role);
@@ -280,14 +271,12 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
* is ultimately put by kvm_tdp_mmu_zap_invalidated_roots().
*/
refcount_set(&root->tdp_mmu_root_count, 2);
-
- spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
- spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-out_unlock:
- write_unlock(&kvm->mmu_lock);
-out:
+out_spin_unlock:
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+out_read_unlock:
+ read_unlock(&kvm->mmu_lock);
/*
* Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest
* and actually consuming the root if it's invalidated after dropping
--
2.43.0.275.g3460e3d667-goog
Free TDP MMU roots from vCPU context while holding mmu_lock for read, it
is completely legal to invoke kvm_tdp_mmu_put_root() as a reader. This
eliminates the last mmu_lock writer in the TDP MMU's "fast zap" path
after requesting vCPUs to reload roots, i.e. allows KVM to zap invalidated
roots, free obsolete roots, and allocate new roots in parallel.
On large VMs, e.g. 100+ vCPUs, allowing the bulk of the "fast zap"
operation to run in parallel with freeing and allocating roots reduces the
worst case latency for a vCPU to reload a root from 2-3ms to <100us.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ea18aca23196..90773cdb73bb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3575,10 +3575,14 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
if (WARN_ON_ONCE(!sp))
return;
- if (is_tdp_mmu_page(sp))
+ if (is_tdp_mmu_page(sp)) {
+ lockdep_assert_held_read(&kvm->mmu_lock);
kvm_tdp_mmu_put_root(kvm, sp);
- else if (!--sp->root_count && sp->role.invalid)
- kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+ } else {
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ if (!--sp->root_count && sp->role.invalid)
+ kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+ }
*root_hpa = INVALID_PAGE;
}
@@ -3587,6 +3591,7 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
ulong roots_to_free)
{
+ bool is_tdp_mmu = tdp_mmu_enabled && mmu->root_role.direct;
int i;
LIST_HEAD(invalid_list);
bool free_active_root;
@@ -3609,7 +3614,10 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
return;
}
- write_lock(&kvm->mmu_lock);
+ if (is_tdp_mmu)
+ read_lock(&kvm->mmu_lock);
+ else
+ write_lock(&kvm->mmu_lock);
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
@@ -3635,8 +3643,13 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
mmu->root.pgd = 0;
}
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
- write_unlock(&kvm->mmu_lock);
+ if (is_tdp_mmu) {
+ read_unlock(&kvm->mmu_lock);
+ WARN_ON_ONCE(!list_empty(&invalid_list));
+ } else {
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ write_unlock(&kvm->mmu_lock);
+ }
}
EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);
--
2.43.0.275.g3460e3d667-goog
On Wed, Jan 10, 2024 at 06:00:46PM -0800, Sean Christopherson wrote:
> When allocating a new TDP MMU root, check for a usable root while holding
> mmu_lock for read and only acquire mmu_lock for write if a new root needs
> to be created. There is no need to serialize other MMU operations if a
> vCPU is simply grabbing a reference to an existing root, holding mmu_lock
> for write is "necessary" (spoiler alert, it's not strictly necessary) only
> to ensure KVM doesn't end up with duplicate roots.
>
> Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and
> to setups that frequently delete memslots, i.e. which force all vCPUs to
> reload all roots.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 ++---
> arch/x86/kvm/mmu/tdp_mmu.c | 60 +++++++++++++++++++++++++++++++-------
> arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> 3 files changed, 55 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..ea18aca23196 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3693,15 +3693,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> unsigned i;
> int r;
>
> + if (tdp_mmu_enabled)
> + return kvm_tdp_mmu_alloc_root(vcpu);
> +
> write_lock(&vcpu->kvm->mmu_lock);
> r = make_mmu_pages_available(vcpu);
> if (r < 0)
> goto out_unlock;
>
> - if (tdp_mmu_enabled) {
> - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> - mmu->root.hpa = root;
> - } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> + if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> mmu->root.hpa = root;
> } else if (shadow_root_level == PT32E_ROOT_LEVEL) {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e0a8343f66dc..9a8250a14fc1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -223,21 +223,52 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> }
>
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> +static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
> {
> union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> + int as_id = kvm_mmu_role_as_id(role);
> struct kvm *kvm = vcpu->kvm;
> struct kvm_mmu_page *root;
>
> - lockdep_assert_held_write(&kvm->mmu_lock);
> -
> - /* Check for an existing root before allocating a new one. */
> - for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> - if (root->role.word == role.word &&
> - kvm_tdp_mmu_get_root(root))
> - goto out;
> + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
No lock yielding attempt in this loop, why change to _yield_safe version?
Thanks,
Yilun
> + if (root->role.word == role.word)
> + return root;
> }
>
> + return NULL;
> +}
> +
> +int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu *mmu = vcpu->arch.mmu;
> + union kvm_mmu_page_role role = mmu->root_role;
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_mmu_page *root;
> +
> + /*
> + * Check for an existing root while holding mmu_lock for read to avoid
> + * unnecessary serialization if multiple vCPUs are loading a new root.
> + * E.g. when bringing up secondary vCPUs, KVM will already have created
> + * a valid root on behalf of the primary vCPU.
> + */
> + read_lock(&kvm->mmu_lock);
> + root = kvm_tdp_mmu_try_get_root(vcpu);
> + read_unlock(&kvm->mmu_lock);
> +
> + if (root)
> + goto out;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Recheck for an existing root after acquiring mmu_lock for write. It
> + * is possible a new usable root was created between dropping mmu_lock
> + * (for read) and acquiring it for write.
> + */
> + root = kvm_tdp_mmu_try_get_root(vcpu);
> + if (root)
> + goto out_unlock;
> +
> root = tdp_mmu_alloc_sp(vcpu);
> tdp_mmu_init_sp(root, NULL, 0, role);
>
> @@ -254,8 +285,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> +out_unlock:
> + write_unlock(&kvm->mmu_lock);
> out:
> - return __pa(root->spt);
> + /*
> + * Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest
> + * and actually consuming the root if it's invalidated after dropping
> + * mmu_lock, and the root can't be freed as this vCPU holds a reference.
> + */
> + mmu->root.hpa = __pa(root->spt);
> + mmu->root.pgd = 0;
> + return 0;
> }
>
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> @@ -917,7 +957,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> * the VM is being destroyed).
> *
> * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
> - * See kvm_tdp_mmu_get_vcpu_root_hpa().
> + * See kvm_tdp_mmu_alloc_root().
> */
> void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 20d97aa46c49..6e1ea04ca885 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -10,7 +10,7 @@
> void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
>
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> +int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
>
> __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> {
> --
> 2.43.0.275.g3460e3d667-goog
>
>
On Tue, Feb 06, 2024 at 06:09:18PM +0800, Xu Yilun wrote:
> On Wed, Jan 10, 2024 at 06:00:46PM -0800, Sean Christopherson wrote:
> > When allocating a new TDP MMU root, check for a usable root while holding
> > mmu_lock for read and only acquire mmu_lock for write if a new root needs
> > to be created. There is no need to serialize other MMU operations if a
> > vCPU is simply grabbing a reference to an existing root, holding mmu_lock
> > for write is "necessary" (spoiler alert, it's not strictly necessary) only
> > to ensure KVM doesn't end up with duplicate roots.
> >
> > Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and
> > to setups that frequently delete memslots, i.e. which force all vCPUs to
> > reload all roots.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 8 ++---
> > arch/x86/kvm/mmu/tdp_mmu.c | 60 +++++++++++++++++++++++++++++++-------
> > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > 3 files changed, 55 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c844e428684..ea18aca23196 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3693,15 +3693,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > unsigned i;
> > int r;
> >
> > + if (tdp_mmu_enabled)
> > + return kvm_tdp_mmu_alloc_root(vcpu);
> > +
> > write_lock(&vcpu->kvm->mmu_lock);
> > r = make_mmu_pages_available(vcpu);
> > if (r < 0)
> > goto out_unlock;
> >
> > - if (tdp_mmu_enabled) {
> > - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > - mmu->root.hpa = root;
> > - } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > + if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > mmu->root.hpa = root;
> > } else if (shadow_root_level == PT32E_ROOT_LEVEL) {
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index e0a8343f66dc..9a8250a14fc1 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -223,21 +223,52 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> > tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> > }
> >
> > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > +static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
> > {
> > union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> > + int as_id = kvm_mmu_role_as_id(role);
> > struct kvm *kvm = vcpu->kvm;
> > struct kvm_mmu_page *root;
> >
> > - lockdep_assert_held_write(&kvm->mmu_lock);
> > -
> > - /* Check for an existing root before allocating a new one. */
> > - for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> > - if (root->role.word == role.word &&
> > - kvm_tdp_mmu_get_root(root))
> > - goto out;
> > + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
>
> No lock yielding attempt in this loop, why change to _yield_safe version?
Oh, I assume you just want to early exit the loop with the reference to
root hold. But I feel it makes harder for us to have a clear
understanding of the usage of _yield_safe and non _yield_safe helpers.
Maybe change it back?
Thanks,
Yilun
>
> Thanks,
> Yilun
>
> > + if (root->role.word == role.word)
> > + return root;
> > }
> >
> > + return NULL;
> > +}
> > +
> > +int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_mmu *mmu = vcpu->arch.mmu;
> > + union kvm_mmu_page_role role = mmu->root_role;
> > + struct kvm *kvm = vcpu->kvm;
> > + struct kvm_mmu_page *root;
> > +
> > + /*
> > + * Check for an existing root while holding mmu_lock for read to avoid
> > + * unnecessary serialization if multiple vCPUs are loading a new root.
> > + * E.g. when bringing up secondary vCPUs, KVM will already have created
> > + * a valid root on behalf of the primary vCPU.
> > + */
> > + read_lock(&kvm->mmu_lock);
> > + root = kvm_tdp_mmu_try_get_root(vcpu);
> > + read_unlock(&kvm->mmu_lock);
> > +
> > + if (root)
> > + goto out;
> > +
> > + write_lock(&kvm->mmu_lock);
> > +
> > + /*
> > + * Recheck for an existing root after acquiring mmu_lock for write. It
> > + * is possible a new usable root was created between dropping mmu_lock
> > + * (for read) and acquiring it for write.
> > + */
> > + root = kvm_tdp_mmu_try_get_root(vcpu);
> > + if (root)
> > + goto out_unlock;
> > +
> > root = tdp_mmu_alloc_sp(vcpu);
> > tdp_mmu_init_sp(root, NULL, 0, role);
> >
> > @@ -254,8 +285,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >
> > +out_unlock:
> > + write_unlock(&kvm->mmu_lock);
> > out:
> > - return __pa(root->spt);
> > + /*
> > + * Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest
> > + * and actually consuming the root if it's invalidated after dropping
> > + * mmu_lock, and the root can't be freed as this vCPU holds a reference.
> > + */
> > + mmu->root.hpa = __pa(root->spt);
> > + mmu->root.pgd = 0;
> > + return 0;
> > }
> >
> > static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > @@ -917,7 +957,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> > * the VM is being destroyed).
> > *
> > * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
> > - * See kvm_tdp_mmu_get_vcpu_root_hpa().
> > + * See kvm_tdp_mmu_alloc_root().
> > */
> > void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> > {
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 20d97aa46c49..6e1ea04ca885 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -10,7 +10,7 @@
> > void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> > void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> >
> > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> > +int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
> >
> > __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> > {
> > --
> > 2.43.0.275.g3460e3d667-goog
> >
> >
>
On Tue, Feb 06, 2024, Xu Yilun wrote:
> On Wed, Jan 10, 2024 at 06:00:47PM -0800, Sean Christopherson wrote:
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 55 +++++++++++++++-----------------------
> > 1 file changed, 22 insertions(+), 33 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9a8250a14fc1..d078157e62aa 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -223,51 +223,42 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> > tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> > }
> >
> > -static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
> > -{
> > - union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> > - int as_id = kvm_mmu_role_as_id(role);
> > - struct kvm *kvm = vcpu->kvm;
> > - struct kvm_mmu_page *root;
> > -
> > - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> > - if (root->role.word == role.word)
> > - return root;
> > - }
> > -
> > - return NULL;
> > -}
> > -
> > int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> > {
> > struct kvm_mmu *mmu = vcpu->arch.mmu;
> > union kvm_mmu_page_role role = mmu->root_role;
> > + int as_id = kvm_mmu_role_as_id(role);
> > struct kvm *kvm = vcpu->kvm;
> > struct kvm_mmu_page *root;
> >
> > /*
> > - * Check for an existing root while holding mmu_lock for read to avoid
> > + * Check for an existing root before acquiring the pages lock to avoid
> > * unnecessary serialization if multiple vCPUs are loading a new root.
> > * E.g. when bringing up secondary vCPUs, KVM will already have created
> > * a valid root on behalf of the primary vCPU.
> > */
> > read_lock(&kvm->mmu_lock);
> > - root = kvm_tdp_mmu_try_get_root(vcpu);
> > - read_unlock(&kvm->mmu_lock);
> >
> > - if (root)
> > - goto out;
> > + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> > + if (root->role.word == role.word)
> > + goto out_read_unlock;
> > + }
> >
> > - write_lock(&kvm->mmu_lock);
>
> It seems really complex to me...
>
> I failed to understand why the following KVM_BUG_ON() could be avoided
> without the mmu_lock for write. I thought a valid root could be added
> during zapping.
>
> void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> {
> struct kvm_mmu_page *root;
>
> read_lock(&kvm->mmu_lock);
>
> for_each_tdp_mmu_root_yield_safe(kvm, root) {
> if (!root->tdp_mmu_scheduled_root_to_zap)
> continue;
>
> root->tdp_mmu_scheduled_root_to_zap = false;
> KVM_BUG_ON(!root->role.invalid, kvm);
tdp_mmu_scheduled_root_to_zap is set only when mmu_lock is held for write, i.e.
it's mutually exclusive with allocating a new root.
And tdp_mmu_scheduled_root_to_zap is cleared if and only if kvm_tdp_mmu_zap_invalidated_roots
is already set, and is only processed by kvm_tdp_mmu_zap_invalidated_roots(),
which runs under slots_lock (a mutex).
So a new, valid root can be added, but it won't have tdp_mmu_scheduled_root_to_zap
set, at least not until the current "fast zap" completes and a new one beings,
which as above requires taking mmu_lock for write.
On Tue, Feb 06, 2024, Xu Yilun wrote:
> On Tue, Feb 06, 2024 at 06:09:18PM +0800, Xu Yilun wrote:
> > On Wed, Jan 10, 2024 at 06:00:46PM -0800, Sean Christopherson wrote:
> > > When allocating a new TDP MMU root, check for a usable root while holding
> > > mmu_lock for read and only acquire mmu_lock for write if a new root needs
> > > to be created. There is no need to serialize other MMU operations if a
> > > vCPU is simply grabbing a reference to an existing root, holding mmu_lock
> > > for write is "necessary" (spoiler alert, it's not strictly necessary) only
> > > to ensure KVM doesn't end up with duplicate roots.
> > >
> > > Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and
> > > to setups that frequently delete memslots, i.e. which force all vCPUs to
> > > reload all roots.
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 8 ++---
> > > arch/x86/kvm/mmu/tdp_mmu.c | 60 +++++++++++++++++++++++++++++++-------
> > > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > > 3 files changed, 55 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 3c844e428684..ea18aca23196 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3693,15 +3693,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > > unsigned i;
> > > int r;
> > >
> > > + if (tdp_mmu_enabled)
> > > + return kvm_tdp_mmu_alloc_root(vcpu);
> > > +
> > > write_lock(&vcpu->kvm->mmu_lock);
> > > r = make_mmu_pages_available(vcpu);
> > > if (r < 0)
> > > goto out_unlock;
> > >
> > > - if (tdp_mmu_enabled) {
> > > - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > > - mmu->root.hpa = root;
> > > - } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > > + if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > > root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > > mmu->root.hpa = root;
> > > } else if (shadow_root_level == PT32E_ROOT_LEVEL) {
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index e0a8343f66dc..9a8250a14fc1 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -223,21 +223,52 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> > > tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> > > }
> > >
> > > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > > +static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
> > > {
> > > union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> > > + int as_id = kvm_mmu_role_as_id(role);
> > > struct kvm *kvm = vcpu->kvm;
> > > struct kvm_mmu_page *root;
> > >
> > > - lockdep_assert_held_write(&kvm->mmu_lock);
> > > -
> > > - /* Check for an existing root before allocating a new one. */
> > > - for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> > > - if (root->role.word == role.word &&
> > > - kvm_tdp_mmu_get_root(root))
> > > - goto out;
> > > + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> >
> > No lock yielding attempt in this loop, why change to _yield_safe version?
Because variants that don't allow yielding, i.e. for_each_valid_tdp_mmu_root()
as of this patch, require mmu_lock be held for write. Holding mmu_lock for write
is necessary because that simpler version uses list_for_each_entry() and doesn't
grab a reference to the root, i.e. entries on the list could be freed, e.g. by
kvm_tdp_mmu_zap_invalidated_roots().
The _yield_safe() versions don't require the user to want to yield. The naming
is _yield_safe() because the yield-safe iterators can run with mmu_lock held for
read *or* right.
> Oh, I assume you just want to early exit the loop with the reference to
> root hold. But I feel it makes harder for us to have a clear
> understanding of the usage of _yield_safe and non _yield_safe helpers.
>
> Maybe change it back?
No. There's even a comment above for_each_tdp_mmu_root() (which is
for_each_valid_tdp_mmu_root() as of this patch) that explains the difference.
The rule is essentially, use the yield-safe variant unless there's a good reason
not to.
/*
* Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
* the implication being that any flow that holds mmu_lock for read is
* inherently yield-friendly and should use the yield-safe variant above.
* Holding mmu_lock for write obviates the need for RCU protection as the list
* is guaranteed to be stable.
*/
On Wed, Jan 10, 2024 at 06:00:47PM -0800, Sean Christopherson wrote:
> Allocate TDP MMU roots while holding mmu_lock for read, and instead use
> tdp_mmu_pages_lock to guard against duplicate roots. This allows KVM to
> create new roots without forcing kvm_tdp_mmu_zap_invalidated_roots() to
> yield, e.g. allows vCPUs to load new roots after memslot deletion without
> forcing the zap thread to detect contention and yield (or complete if the
> kernel isn't preemptible).
>
> Note, creating a new TDP MMU root as an mmu_lock reader is safe for two
> reasons: (1) paths that must guarantee all roots/SPTEs are *visited* take
> mmu_lock for write and so are still mutually exclusive, e.g. mmu_notifier
> invalidations, and (2) paths that require all roots/SPTEs to *observe*
> some given state without holding mmu_lock for write must ensure freshness
> through some other means, e.g. toggling dirty logging must first wait for
> SRCU readers to recognize the memslot flags change before processing
> existing roots/SPTEs.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 55 +++++++++++++++-----------------------
> 1 file changed, 22 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9a8250a14fc1..d078157e62aa 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -223,51 +223,42 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> }
>
> -static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
> -{
> - union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> - int as_id = kvm_mmu_role_as_id(role);
> - struct kvm *kvm = vcpu->kvm;
> - struct kvm_mmu_page *root;
> -
> - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> - if (root->role.word == role.word)
> - return root;
> - }
> -
> - return NULL;
> -}
> -
> int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> union kvm_mmu_page_role role = mmu->root_role;
> + int as_id = kvm_mmu_role_as_id(role);
> struct kvm *kvm = vcpu->kvm;
> struct kvm_mmu_page *root;
>
> /*
> - * Check for an existing root while holding mmu_lock for read to avoid
> + * Check for an existing root before acquiring the pages lock to avoid
> * unnecessary serialization if multiple vCPUs are loading a new root.
> * E.g. when bringing up secondary vCPUs, KVM will already have created
> * a valid root on behalf of the primary vCPU.
> */
> read_lock(&kvm->mmu_lock);
> - root = kvm_tdp_mmu_try_get_root(vcpu);
> - read_unlock(&kvm->mmu_lock);
>
> - if (root)
> - goto out;
> + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> + if (root->role.word == role.word)
> + goto out_read_unlock;
> + }
>
> - write_lock(&kvm->mmu_lock);
It seems really complex to me...
I failed to understand why the following KVM_BUG_ON() could be avoided
without the mmu_lock for write. I thought a valid root could be added
during zapping.
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;
read_lock(&kvm->mmu_lock);
for_each_tdp_mmu_root_yield_safe(kvm, root) {
if (!root->tdp_mmu_scheduled_root_to_zap)
continue;
root->tdp_mmu_scheduled_root_to_zap = false;
KVM_BUG_ON(!root->role.invalid, kvm);
Thanks,
Yilun
> + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
>
> /*
> - * Recheck for an existing root after acquiring mmu_lock for write. It
> - * is possible a new usable root was created between dropping mmu_lock
> - * (for read) and acquiring it for write.
> + * Recheck for an existing root after acquiring the pages lock, another
> + * vCPU may have raced ahead and created a new usable root. Manually
> + * walk the list of roots as the standard macros assume that the pages
> + * lock is *not* held. WARN if grabbing a reference to a usable root
> + * fails, as the last reference to a root can only be put *after* the
> + * root has been invalidated, which requires holding mmu_lock for write.
> */
> - root = kvm_tdp_mmu_try_get_root(vcpu);
> - if (root)
> - goto out_unlock;
> + list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> + if (root->role.word == role.word &&
> + !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
> + goto out_spin_unlock;
> + }
>
> root = tdp_mmu_alloc_sp(vcpu);
> tdp_mmu_init_sp(root, NULL, 0, role);
> @@ -280,14 +271,12 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> * is ultimately put by kvm_tdp_mmu_zap_invalidated_roots().
> */
> refcount_set(&root->tdp_mmu_root_count, 2);
> -
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> -out_unlock:
> - write_unlock(&kvm->mmu_lock);
> -out:
> +out_spin_unlock:
> + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +out_read_unlock:
> + read_unlock(&kvm->mmu_lock);
> /*
> * Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest
> * and actually consuming the root if it's invalidated after dropping
> --
> 2.43.0.275.g3460e3d667-goog
>
>
On Tue, Feb 06, 2024 at 10:10:44AM -0800, Sean Christopherson wrote:
> On Tue, Feb 06, 2024, Xu Yilun wrote:
> > On Wed, Jan 10, 2024 at 06:00:47PM -0800, Sean Christopherson wrote:
> > > ---
> > > arch/x86/kvm/mmu/tdp_mmu.c | 55 +++++++++++++++-----------------------
> > > 1 file changed, 22 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 9a8250a14fc1..d078157e62aa 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -223,51 +223,42 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> > > tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> > > }
> > >
> > > -static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
> > > -{
> > > - union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> > > - int as_id = kvm_mmu_role_as_id(role);
> > > - struct kvm *kvm = vcpu->kvm;
> > > - struct kvm_mmu_page *root;
> > > -
> > > - for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> > > - if (root->role.word == role.word)
> > > - return root;
> > > - }
> > > -
> > > - return NULL;
> > > -}
> > > -
> > > int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> > > {
> > > struct kvm_mmu *mmu = vcpu->arch.mmu;
> > > union kvm_mmu_page_role role = mmu->root_role;
> > > + int as_id = kvm_mmu_role_as_id(role);
> > > struct kvm *kvm = vcpu->kvm;
> > > struct kvm_mmu_page *root;
> > >
> > > /*
> > > - * Check for an existing root while holding mmu_lock for read to avoid
> > > + * Check for an existing root before acquiring the pages lock to avoid
> > > * unnecessary serialization if multiple vCPUs are loading a new root.
> > > * E.g. when bringing up secondary vCPUs, KVM will already have created
> > > * a valid root on behalf of the primary vCPU.
> > > */
> > > read_lock(&kvm->mmu_lock);
> > > - root = kvm_tdp_mmu_try_get_root(vcpu);
> > > - read_unlock(&kvm->mmu_lock);
> > >
> > > - if (root)
> > > - goto out;
> > > + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> > > + if (root->role.word == role.word)
> > > + goto out_read_unlock;
> > > + }
> > >
> > > - write_lock(&kvm->mmu_lock);
> >
> > It seems really complex to me...
> >
> > I failed to understand why the following KVM_BUG_ON() could be avoided
> > without the mmu_lock for write. I thought a valid root could be added
> > during zapping.
> >
> > void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> > {
> > struct kvm_mmu_page *root;
> >
> > read_lock(&kvm->mmu_lock);
> >
> > for_each_tdp_mmu_root_yield_safe(kvm, root) {
> > if (!root->tdp_mmu_scheduled_root_to_zap)
> > continue;
> >
> > root->tdp_mmu_scheduled_root_to_zap = false;
> > KVM_BUG_ON(!root->role.invalid, kvm);
>
> tdp_mmu_scheduled_root_to_zap is set only when mmu_lock is held for write, i.e.
> it's mutually exclusive with allocating a new root.
>
> And tdp_mmu_scheduled_root_to_zap is cleared if and only if kvm_tdp_mmu_zap_invalidated_roots
> is already set, and is only processed by kvm_tdp_mmu_zap_invalidated_roots(),
> which runs under slots_lock (a mutex).
>
> So a new, valid root can be added, but it won't have tdp_mmu_scheduled_root_to_zap
> set, at least not until the current "fast zap" completes and a new one beings,
> which as above requires taking mmu_lock for write.
It's clear to me.
Thanks for the detailed explanation.
>
On Tue, Feb 06, 2024 at 10:21:00AM -0800, Sean Christopherson wrote:
> On Tue, Feb 06, 2024, Xu Yilun wrote:
> > On Tue, Feb 06, 2024 at 06:09:18PM +0800, Xu Yilun wrote:
> > > On Wed, Jan 10, 2024 at 06:00:46PM -0800, Sean Christopherson wrote:
> > > > When allocating a new TDP MMU root, check for a usable root while holding
> > > > mmu_lock for read and only acquire mmu_lock for write if a new root needs
> > > > to be created. There is no need to serialize other MMU operations if a
> > > > vCPU is simply grabbing a reference to an existing root, holding mmu_lock
> > > > for write is "necessary" (spoiler alert, it's not strictly necessary) only
> > > > to ensure KVM doesn't end up with duplicate roots.
> > > >
> > > > Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and
> > > > to setups that frequently delete memslots, i.e. which force all vCPUs to
> > > > reload all roots.
> > > >
> > > > Signed-off-by: Sean Christopherson <[email protected]>
> > > > ---
> > > > arch/x86/kvm/mmu/mmu.c | 8 ++---
> > > > arch/x86/kvm/mmu/tdp_mmu.c | 60 +++++++++++++++++++++++++++++++-------
> > > > arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> > > > 3 files changed, 55 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 3c844e428684..ea18aca23196 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -3693,15 +3693,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > > > unsigned i;
> > > > int r;
> > > >
> > > > + if (tdp_mmu_enabled)
> > > > + return kvm_tdp_mmu_alloc_root(vcpu);
> > > > +
> > > > write_lock(&vcpu->kvm->mmu_lock);
> > > > r = make_mmu_pages_available(vcpu);
> > > > if (r < 0)
> > > > goto out_unlock;
> > > >
> > > > - if (tdp_mmu_enabled) {
> > > > - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > > > - mmu->root.hpa = root;
> > > > - } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > > > + if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > > > root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
> > > > mmu->root.hpa = root;
> > > > } else if (shadow_root_level == PT32E_ROOT_LEVEL) {
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > index e0a8343f66dc..9a8250a14fc1 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -223,21 +223,52 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
> > > > tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
> > > > }
> > > >
> > > > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > > > +static struct kvm_mmu_page *kvm_tdp_mmu_try_get_root(struct kvm_vcpu *vcpu)
> > > > {
> > > > union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> > > > + int as_id = kvm_mmu_role_as_id(role);
> > > > struct kvm *kvm = vcpu->kvm;
> > > > struct kvm_mmu_page *root;
> > > >
> > > > - lockdep_assert_held_write(&kvm->mmu_lock);
> > > > -
> > > > - /* Check for an existing root before allocating a new one. */
> > > > - for_each_valid_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> > > > - if (root->role.word == role.word &&
> > > > - kvm_tdp_mmu_get_root(root))
> > > > - goto out;
> > > > + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
> > >
> > > No lock yielding attempt in this loop, why change to _yield_safe version?
>
> Because variants that don't allow yielding, i.e. for_each_valid_tdp_mmu_root()
> as of this patch, require mmu_lock be held for write. Holding mmu_lock for write
> is necessary because that simpler version uses list_for_each_entry() and doesn't
> grab a reference to the root, i.e. entries on the list could be freed, e.g. by
> kvm_tdp_mmu_zap_invalidated_roots().
>
> The _yield_safe() versions don't require the user to want to yield. The naming
> is _yield_safe() because the yield-safe iterators can run with mmu_lock held for
> read *or* right.
>
> > Oh, I assume you just want to early exit the loop with the reference to
> > root hold. But I feel it makes harder for us to have a clear
> > understanding of the usage of _yield_safe and non _yield_safe helpers.
> >
> > Maybe change it back?
>
> No. There's even a comment above for_each_tdp_mmu_root() (which is
Oh, I should have read comments more carefully.
> for_each_valid_tdp_mmu_root() as of this patch) that explains the difference.
> The rule is essentially, use the yield-safe variant unless there's a good reason
> not to.
>
> /*
> * Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
> * the implication being that any flow that holds mmu_lock for read is
> * inherently yield-friendly and should use the yield-safe variant above.
I still have doubt until I see the changelog:
The nature of a shared walk means that the caller needs to
play nice with other tasks modifying the page tables, which is more or
less the same thing as playing nice with yielding.
Thanks.
> * Holding mmu_lock for write obviates the need for RCU protection as the list
> * is guaranteed to be stable.
> */
On Wed, 10 Jan 2024 18:00:40 -0800, Sean Christopherson wrote:
> This series is the result of digging into why deleting a memslot, which on
> x86 forces all vCPUs to reload a new MMU root, causes noticeably more jitter
> in vCPUs and other tasks when running with the TDP MMU than the Shadow MMU
> (with TDP enabled).
>
> Patch 1 addresses the most obvious issue by simply zapping at a finer
> granularity so that if a different task, e.g. a vCPU, wants to run on the
> pCPU doing the zapping, it doesn't have to wait for KVM to zap an entire
> 1GiB region, which can take a hundreds of microseconds (or more). The
> shadow MMU checks for need_resched() (and mmu_lock contention, see below)
> every 10 zaps, which is why the shadow MMU doesn't induce the same level
> of jitter.
>
> [...]
Applied to kvm-x86 mmu, thanks!
[1/8] KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity
https://github.com/kvm-x86/linux/commit/8ca983631f3c
[2/8] KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots
https://github.com/kvm-x86/linux/commit/fcdffe97f80e
[3/8] KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators
https://github.com/kvm-x86/linux/commit/6577f1efdff4
[4/8] KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range
https://github.com/kvm-x86/linux/commit/99b85fda91b1
[5/8] KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs
https://github.com/kvm-x86/linux/commit/d746182337c2
[6/8] KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read
https://github.com/kvm-x86/linux/commit/f5238c2a60f1
[7/8] KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read
https://github.com/kvm-x86/linux/commit/dab285e4ec73
[8/8] KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read
https://github.com/kvm-x86/linux/commit/576a15de8d29
--
https://github.com/kvm-x86/linux/tree/next