2021-11-20 04:51:59

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing

Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
number of TLB flushes, and to clean up the zapping code. The final patch
realizes the biggest change, which is to use RCU to defer any TLB flush
due to zapping a SP to the caller. The largest cleanup is to separate the
flows for zapping roots (zap _everything_), zapping leaf SPTEs (zap guest
mappings for whatever reason), and zapping a specific SP (NX recovery).
They're currently smushed into a single zap_gfn_range(), which was a good
idea at the time, but became a mess when trying to handle the different
rules, e.g. TLB flushes aren't needed when zapping a root because KVM can
safely zap a root if and only if it's unreachable.

For booting an 8 vCPU, remote_tlb_flush (requests) goes from roughly
180 (600) to 130 (215).

Please don't apply patches 02 and 03, they've been posted elsehwere and by
other people. I included them here because some of the patches have
pseudo-dependencies on their changes. Patch 01 is also posted separately.
I had a brain fart and sent it out realizing that doing so would lead to
oddities.

Hou Wenlong (1):
KVM: x86/mmu: Skip tlb flush if it has been done in zap_gfn_range()

Sean Christopherson (27):
KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier
unmapping
KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible
path
KVM: x86/mmu: Retry page fault if root is invalidated by memslot
update
KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP
MMU
KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush
logic
KVM: x86/mmu: Document that zapping invalidated roots doesn't need to
flush
KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root()
KVM: x86/mmu: Require mmu_lock be held for write in unyielding root
iter
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU
root
KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP
removal
KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier
change_spte
KVM: x86/mmu: Drop RCU after processing each root in MMU notifier
hooks
KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots
KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
KVM: x86/mmu: Terminate yield-friendly walk if invalid root observed
KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw
vals
KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap
hook
KVM: x86/mmu: Add TDP MMU helper to zap a root
KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
KVM: x86/mmu: Use "zap root" path for "slow" zap of all TDP MMU SPTEs
KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU
resched
KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow
pages

arch/x86/kvm/mmu/mmu.c | 74 +++--
arch/x86/kvm/mmu/mmu_internal.h | 7 +-
arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
arch/x86/kvm/mmu/tdp_iter.c | 6 +-
arch/x86/kvm/mmu/tdp_iter.h | 15 +-
arch/x86/kvm/mmu/tdp_mmu.c | 526 +++++++++++++++++++-------------
arch/x86/kvm/mmu/tdp_mmu.h | 48 +--
7 files changed, 406 insertions(+), 273 deletions(-)

--
2.34.0.rc2.393.gf8c9666880-goog



2021-11-20 04:52:03

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 05/28] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU

Explicitly check for preset SPTEs when clearing dirty bits in the TDP
MMU. This isn't strictly required for correctness, as setting the dirty
bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
complain if anyone actually turned on KVM's MMU debugging.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1db8496259ad..c575df121b19 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1246,6 +1246,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;

+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
if (spte_ad_need_write_protect(iter.old_spte)) {
if (is_writable_pte(iter.old_spte))
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:07

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 04/28] KVM: x86/mmu: Retry page fault if root is invalidated by memslot update

Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.

For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.

For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest. But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root. Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete. But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.

Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: [email protected]
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 23 +++++++++++++++++++++--
arch/x86/kvm/mmu/paging_tmpl.h | 3 ++-
2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d2ad12a4d66e..31ce913efe37 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1936,7 +1936,11 @@ static void mmu_audit_disable(void) { }

static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- return sp->role.invalid ||
+ if (sp->role.invalid)
+ return true;
+
+ /* TDP MMU pages due not use the MMU generation. */
+ return !sp->tdp_mmu_page &&
unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
}

@@ -3976,6 +3980,20 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
return true;
}

+/*
+ * Returns true if the page fault is stale and needs to be retried, i.e. if the
+ * root was invalidated by a memslot update or a relevant mmu_notifier fired.
+ */
+static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault, int mmu_seq)
+{
+ if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))
+ return true;
+
+ return fault->slot &&
+ mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+}
+
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
@@ -4013,8 +4031,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
else
write_lock(&vcpu->kvm->mmu_lock);

- if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
+ if (is_page_fault_stale(vcpu, fault, mmu_seq))
goto out_unlock;
+
r = make_mmu_pages_available(vcpu);
if (r)
goto out_unlock;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f87d36898c44..708a5d297fe1 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -911,7 +911,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault

r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);
- if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
+
+ if (is_page_fault_stale(vcpu, fault, mmu_seq))
goto out_unlock;

kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:11

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 15/28] KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots

Take TDP MMU roots off the list of roots when they're invalidated instead
of walking later on to find the roots that were just invalidated. In
addition to making the flow more straightforward, this allows warning
if something attempts to elevate the refcount of an invalid root, which
should be unreachable (no longer on the list so can't be reached by MMU
notifier, and vCPUs must reload a new root before installing new SPTE).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 6 +-
arch/x86/kvm/mmu/tdp_mmu.c | 171 ++++++++++++++++++++-----------------
arch/x86/kvm/mmu/tdp_mmu.h | 14 ++-
3 files changed, 108 insertions(+), 83 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e00e46205730..e3cd330c9532 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5664,6 +5664,8 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
*/
static void kvm_mmu_zap_all_fast(struct kvm *kvm)
{
+ LIST_HEAD(invalidated_roots);
+
lockdep_assert_held(&kvm->slots_lock);

write_lock(&kvm->mmu_lock);
@@ -5685,7 +5687,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* could drop the MMU lock and yield.
*/
if (is_tdp_mmu_enabled(kvm))
- kvm_tdp_mmu_invalidate_all_roots(kvm);
+ kvm_tdp_mmu_invalidate_all_roots(kvm, &invalidated_roots);

/*
* Notify all vcpus to reload its shadow page table and flush TLB.
@@ -5703,7 +5705,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)

if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
- kvm_tdp_mmu_zap_invalidated_roots(kvm);
+ kvm_tdp_mmu_zap_invalidated_roots(kvm, &invalidated_roots);
read_unlock(&kvm->mmu_lock);
}
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ca6b30a7130d..085f6b09e5f3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -94,9 +94,17 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,

WARN_ON(!root->tdp_mmu_page);

- spin_lock(&kvm->arch.tdp_mmu_pages_lock);
- list_del_rcu(&root->link);
- spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ /*
+ * Remove the root from tdp_mmu_roots, unless the root is invalid in
+ * which case the root was pulled off tdp_mmu_roots when it was
+ * invalidated. Note, this must be an RCU-protected deletion to avoid
+ * use-after-free in the yield-safe iterator!
+ */
+ if (!root->role.invalid) {
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ list_del_rcu(&root->link);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ }

/*
* A TLB flush is not necessary as KVM performs a local TLB flush when
@@ -105,18 +113,23 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
* invalidates any paging-structure-cache entries, i.e. TLB entries for
* intermediate paging structures, that may be zapped, as such entries
* are associated with the ASID on both VMX and SVM.
+ *
+ * WARN if a flush is reported for an invalid root, as its child SPTEs
+ * should have been zapped by kvm_tdp_mmu_zap_invalidated_roots(), and
+ * inserting new SPTEs under an invalid root is a KVM bug.
*/
- (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
+ if (zap_gfn_range(kvm, root, 0, -1ull, true, false, shared))
+ WARN_ON_ONCE(root->role.invalid);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

/*
- * Finds the next valid root after root (or the first valid root if root
- * is NULL), takes a reference on it, and returns that next root. If root
- * is not NULL, this thread should have already taken a reference on it, and
- * that reference will be dropped. If no valid root is found, this
- * function will return NULL.
+ * Finds the next root after @prev_root (or the first root if @prev_root is NULL
+ * or invalid), takes a reference on it, and returns that next root. If root is
+ * not NULL, this thread should have already taken a reference on it, and that
+ * reference will be dropped. If no valid root is found, this function will
+ * return NULL.
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
@@ -124,6 +137,27 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
{
struct kvm_mmu_page *next_root;

+ lockdep_assert_held(&kvm->mmu_lock);
+
+ /*
+ * Restart the walk if the previous root was invalidated, which can
+ * happen if the caller drops mmu_lock when yielding. Restarting the
+ * walke is necessary because invalidating a root also removes it from
+ * tdp_mmu_roots. Restarting is safe and correct because invalidating
+ * a root is done if and only if _all_ roots are invalidated, i.e. any
+ * root on tdp_mmu_roots was added _after_ the invalidation event.
+ */
+ if (prev_root && prev_root->role.invalid) {
+ kvm_tdp_mmu_put_root(kvm, prev_root, shared);
+ prev_root = NULL;
+ }
+
+ /*
+ * Finding the next root must be done under RCU read lock. Although
+ * @prev_root itself cannot be removed from tdp_mmu_roots because this
+ * task holds a reference, its next and prev pointers can be modified
+ * when freeing a different root. Ditto for tdp_mmu_roots itself.
+ */
rcu_read_lock();

if (prev_root)
@@ -230,10 +264,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
refcount_set(&root->tdp_mmu_root_count, 1);

- spin_lock(&kvm->arch.tdp_mmu_pages_lock);
- list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
- spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-
+ /*
+ * Because mmu_lock must be held for write to ensure that KVM doesn't
+ * create multiple roots for a given role, this does not need to use
+ * an RCU-friendly variant as readers of tdp_mmu_roots must also hold
+ * mmu_lock in some capacity.
+ */
+ list_add(&root->link, &kvm->arch.tdp_mmu_roots);
out:
return __pa(root->spt);
}
@@ -814,28 +851,6 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
kvm_flush_remote_tlbs(kvm);
}

-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
- struct kvm_mmu_page *prev_root)
-{
- struct kvm_mmu_page *next_root;
-
- if (prev_root)
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &prev_root->link,
- typeof(*prev_root), link);
- else
- next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- typeof(*next_root), link);
-
- while (next_root && !(next_root->role.invalid &&
- refcount_read(&next_root->tdp_mmu_root_count)))
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &next_root->link,
- typeof(*next_root), link);
-
- return next_root;
-}
-
/*
* Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
* invalidated root, they will not be freed until this function drops the
@@ -844,22 +859,21 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
* only has to do a trivial amount of work. Since the roots are invalid,
* no new SPTEs should be created under them.
*/
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
+ struct list_head *invalidated_roots)
{
- struct kvm_mmu_page *next_root;
- struct kvm_mmu_page *root;
+ struct kvm_mmu_page *root, *tmp;

+ lockdep_assert_held(&kvm->slots_lock);
lockdep_assert_held_read(&kvm->mmu_lock);

- rcu_read_lock();
-
- root = next_invalidated_root(kvm, NULL);
-
- while (root) {
- next_root = next_invalidated_root(kvm, root);
-
- rcu_read_unlock();
-
+ /*
+ * Put the ref to each root, acquired by kvm_tdp_mmu_put_root(). The
+ * safe variant is required even though kvm_tdp_mmu_put_root() doesn't
+ * explicitly remove the root from the invalid list, as this task does
+ * not take rcu_read_lock() and so the list object itself can be freed.
+ */
+ list_for_each_entry_safe(root, tmp, invalidated_roots, link) {
/*
* A TLB flush is unnecessary, invalidated roots are guaranteed
* to be unreachable by the guest (see kvm_tdp_mmu_put_root()
@@ -870,49 +884,50 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* blip and not a functional issue.
*/
(void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
-
- /*
- * Put the reference acquired in
- * kvm_tdp_mmu_invalidate_roots
- */
kvm_tdp_mmu_put_root(kvm, root, true);
-
- root = next_root;
-
- rcu_read_lock();
}
-
- rcu_read_unlock();
}

/*
- * Mark each TDP MMU root as invalid so that other threads
- * will drop their references and allow the root count to
- * go to 0.
+ * Mark each TDP MMU root as invalid so that other threads will drop their
+ * references and allow the root count to go to 0.
*
- * Also take a reference on all roots so that this thread
- * can do the bulk of the work required to free the roots
- * once they are invalidated. Without this reference, a
- * vCPU thread might drop the last reference to a root and
- * get stuck with tearing down the entire paging structure.
+ * Take a reference on each root and move it to a local list so that this task
+ * can do the actual work required to free the roots once they are invalidated,
+ * e.g. zap the SPTEs and trigger a remote TLB flush. Without this reference, a
+ * vCPU task might drop the last reference to a root and get stuck with tearing
+ * down the entire paging structure.
*
- * Roots which have a zero refcount should be skipped as
- * they're already being torn down.
- * Already invalid roots should be referenced again so that
- * they aren't freed before kvm_tdp_mmu_zap_all_fast is
- * done with them.
+ * Roots which have a zero refcount are skipped as they're already being torn
+ * down. Encountering a root that is already invalid is a KVM bug, as this is
+ * the only path that is allowed to invalidate roots and (a) it's proteced by
+ * slots_lock and (b) pulls each root off tdp_mmu_roots.
*
- * This has essentially the same effect for the TDP MMU
- * as updating mmu_valid_gen does for the shadow MMU.
+ * This has essentially the same effect for the TDP MMU as updating
+ * mmu_valid_gen does for the shadow MMU.
*/
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
+ struct list_head *invalidated_roots)
{
- struct kvm_mmu_page *root;
+ struct kvm_mmu_page *root, *tmp;

+ /*
+ * mmu_lock must be held for write, moving entries off an RCU-protected
+ * list is not safe, entries can only be deleted. All accesses to
+ * tdp_mmu_roots are required to hold mmu_lock in some capacity, thus
+ * holding it for write ensures there are no concurrent readers.
+ */
lockdep_assert_held_write(&kvm->mmu_lock);
- list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
- if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
- root->role.invalid = true;
+
+ list_for_each_entry_safe(root, tmp, &kvm->arch.tdp_mmu_roots, link) {
+ if (!kvm_tdp_mmu_get_root(root))
+ continue;
+
+ list_move_tail(&root->link, invalidated_roots);
+
+ WARN_ON_ONCE(root->role.invalid);
+ root->role.invalid = true;
+ }
}

/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 599714de67c3..ced6d8e47362 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -9,7 +9,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);

__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
- if (root->role.invalid)
+ /*
+ * Acquiring a reference on an invalid root is a KVM bug. Invalid roots
+ * are supposed to be reachable only by references that were acquired
+ * before the invalidation, and taking an additional reference to an
+ * invalid root is not allowed.
+ */
+ if (WARN_ON_ONCE(root->role.invalid))
return false;

return refcount_inc_not_zero(&root->tdp_mmu_root_count);
@@ -44,8 +50,10 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
}

void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
+ struct list_head *invalidated_roots);
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
+ struct list_head *invalidated_roots);

int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);

--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:13

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 14/28] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU

Add helpers to read and write TDP MMU SPTEs instead of open coding
rcu_dereference() all over the place, and to provide a convenient
location to document why KVM doesn't exempt holding mmu_lock for write
from having to hold RCU (and any future changes to the rules).

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_iter.c | 6 +++---
arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++-------
3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index b3ed302c1a35..1f7741c725f6 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
{
iter->sptep = iter->pt_path[iter->level - 1] +
SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
}

static gfn_t round_gfn_for_level(gfn_t gfn, int level)
@@ -86,7 +86,7 @@ static bool try_step_down(struct tdp_iter *iter)
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry.
*/
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);

child_pt = spte_to_child_pt(iter->old_spte, iter->level);
if (!child_pt)
@@ -120,7 +120,7 @@ static bool try_step_side(struct tdp_iter *iter)
iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
iter->next_last_level_gfn = iter->gfn;
iter->sptep++;
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);

return true;
}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b1748b988d3a..9c04d8677cb3 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,6 +9,22 @@

typedef u64 __rcu *tdp_ptep_t;

+/*
+ * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
+ * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
+ * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
+ * the lower* depths of the TDP MMU just to make lockdep happy is a nightmare,
+ * so all* accesses to SPTEs are must be done under RCU protection.
+ */
+static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
+{
+ return READ_ONCE(*rcu_dereference(sptep));
+}
+static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
+{
+ WRITE_ONCE(*rcu_dereference(sptep), val);
+}
+
/*
* A TDP iterator performs a pre-order walk over a TDP paging structure.
*/
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3ff7b4cd7d0e..ca6b30a7130d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -572,7 +572,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* here since the SPTE is going from non-present
* to non-present.
*/
- WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
+ kvm_tdp_mmu_write_spte(iter->sptep, 0);

return true;
}
@@ -609,7 +609,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
*/
WARN_ON(is_removed_spte(iter->old_spte));

- WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+ kvm_tdp_mmu_write_spte(iter->sptep, new_spte);

__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
new_spte, iter->level, false);
@@ -775,7 +775,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
}
@@ -1012,7 +1012,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* because the new value informs the !present
* path below.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
}

if (!is_shadow_present_pte(iter.old_spte)) {
@@ -1225,7 +1225,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
spte_set = true;
@@ -1296,7 +1296,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
spte_set = true;
@@ -1427,7 +1427,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 17/28] KVM: x86/mmu: Terminate yield-friendly walk if invalid root observed

Stop walking TDP MMU roots if the previous root was invalidated while the
caller yielded (dropped mmu_lock). Any effect that the caller wishes to
be recognized by a root must be made visible before walking the list of
roots. Because all roots are invalided when any root is invalidated, if
the previous root was invalidated, then all roots on the list when the
caller first started the walk also were invalidated, and any valid roots
on the list must have been added after the invalidation event.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d9524b387221..cc8d021a1ba5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -140,16 +140,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
lockdep_assert_held(&kvm->mmu_lock);

/*
- * Restart the walk if the previous root was invalidated, which can
- * happen if the caller drops mmu_lock when yielding. Restarting the
- * walke is necessary because invalidating a root also removes it from
- * tdp_mmu_roots. Restarting is safe and correct because invalidating
- * a root is done if and only if _all_ roots are invalidated, i.e. any
- * root on tdp_mmu_roots was added _after_ the invalidation event.
+ * Terminate the walk if the previous root was invalidated, which can
+ * happen if the caller yielded and dropped mmu_lock. Because invalid
+ * roots are removed from tdp_mmu_roots with mmu_lock held for write,
+ * if the previous root was invalidated, then the invalidation occurred
+ * after this walk started. And because _all_ roots are invalidated
+ * during an invalidation event, any root on tdp_mmu_roots was created
+ * after the invalidation. Lastly, any state change being made by the
+ * caller must be effected before updating SPTEs, otherwise vCPUs could
+ * simply create new SPTEs with the old state. Thus, if the previous
+ * root was invalidated, all valid roots are guaranteed to have been
+ * created after the desired state change and don't need to be updated.
*/
if (prev_root && prev_root->role.invalid) {
kvm_tdp_mmu_put_root(kvm, prev_root, shared);
- prev_root = NULL;
+ return NULL;
}

/*
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:17

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 23/28] KVM: x86/mmu: Use "zap root" path for "slow" zap of all TDP MMU SPTEs

Now that the "slow" zap of all TDP MMU SPTEs doesn't flush the TLBs, use
the dedicated "zap root" helper, which can be used if and only if a flush
isn't needed.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e5401f0efe8e..99ea19e763da 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -888,6 +888,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,

void kvm_tdp_mmu_zap_all(struct kvm *kvm)
{
+ struct kvm_mmu_page *root;
int i;

/*
@@ -895,8 +896,10 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
* is being destroyed or the userspace VMM has exited. In both cases,
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
*/
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for_each_tdp_mmu_root_yield_safe(kvm, root, i, false)
+ (void)tdp_mmu_zap_root(kvm, root, false);
+ }
}

/*
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:21

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 24/28] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page

Convert tdp_mmu_zap_root() into its own dedicated flow instead of simply
redirecting into zap_gfn_range(). In addition to hardening zapping of
roots, this will allow future simplification of zap_gfn_range() by having
it zap only leaf SPTEs, and by removing its tricky "zap all" heuristic.
By having all paths that truly need to free _all_ SPs flow through the
dedicated root zapper, the generic zapper can be freed of those concerns.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 91 +++++++++++++++++++++++++++-----------
1 file changed, 66 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 99ea19e763da..0e5a0d40e54a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,10 +53,6 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
rcu_barrier();
}

-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush,
- bool shared);
-
static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
free_page((unsigned long)sp->spt);
@@ -79,11 +75,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
tdp_mmu_free_sp(sp);
}

-static bool tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
- bool shared)
-{
- return zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
-}
+static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ bool shared, bool root_is_unreachable);

/*
* Note, putting a root might sleep, i.e. the caller must have IRQs enabled and
@@ -120,13 +113,8 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
* invalidates any paging-structure-cache entries, i.e. TLB entries for
* intermediate paging structures, that may be zapped, as such entries
* are associated with the ASID on both VMX and SVM.
- *
- * WARN if a flush is reported for an invalid root, as its child SPTEs
- * should have been zapped by kvm_tdp_mmu_zap_invalidated_roots(), and
- * inserting new SPTEs under an invalid root is a KVM bug.
*/
- if (tdp_mmu_zap_root(kvm, root, shared))
- WARN_ON_ONCE(root->role.invalid);
+ tdp_mmu_zap_root(kvm, root, shared, true);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
@@ -766,6 +754,65 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
return false;
}

+static inline gfn_t tdp_mmu_max_gfn_host(void)
+{
+ /*
+ * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
+ * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
+ * and so KVM will never install a SPTE for such addresses.
+ */
+ return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+}
+
+static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ bool shared, bool root_is_unreachable)
+{
+ struct tdp_iter iter;
+
+ gfn_t end = tdp_mmu_max_gfn_host();
+ gfn_t start = 0;
+
+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
+ rcu_read_lock();
+
+ /*
+ * No need to try to step down in the iterator when zapping an entire
+ * root, zapping an upper-level SPTE will recurse on its children.
+ */
+ for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
+ root->role.level, start, end) {
+retry:
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+ continue;
+
+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
+ if (!shared) {
+ tdp_mmu_set_spte(kvm, &iter, 0);
+ } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
+ /*
+ * cmpxchg() shouldn't fail if the root is unreachable.
+ * to be unreachable. Re-read the SPTE and retry so as
+ * not to leak the page and its children.
+ */
+ WARN_ONCE(root_is_unreachable,
+ "Contended TDP MMU SPTE in unreachable root.");
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
+ goto retry;
+ }
+ /*
+ * WARN if the root is invalid and is unreachable, all SPTEs
+ * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
+ * and inserting new SPTEs under an invalid root is a KVM bug.
+ */
+ WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
+ }
+
+ rcu_read_unlock();
+}
+
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
u64 old_spte;
@@ -807,8 +854,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, bool can_yield, bool flush,
bool shared)
{
- gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
- bool zap_all = (start == 0 && end >= max_gfn_host);
+ bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;

/*
@@ -817,12 +863,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
*/
int min_level = zap_all ? root->role.level : PG_LEVEL_4K;

- /*
- * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
- * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
- * and so KVM will never install a SPTE for such addresses.
- */
- end = min(end, max_gfn_host);
+ end = min(end, tdp_mmu_max_gfn_host());

kvm_lockdep_assert_mmu_lock_held(kvm, shared);

@@ -898,7 +939,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
*/
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
for_each_tdp_mmu_root_yield_safe(kvm, root, i, false)
- (void)tdp_mmu_zap_root(kvm, root, false);
+ tdp_mmu_zap_root(kvm, root, false, true);
}
}

@@ -934,7 +975,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
* will still flush on yield, but that's a minor performance
* blip and not a functional issue.
*/
- (void)tdp_mmu_zap_root(kvm, root, true);
+ tdp_mmu_zap_root(kvm, root, true, false);
kvm_tdp_mmu_put_root(kvm, root, true);
}
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:23

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 21/28] KVM: x86/mmu: Add TDP MMU helper to zap a root

Add a small wrapper to handle zapping a specific root. For now, it's
little more than syntactic sugar, but in the future it will become a
unique flow with rules specific to zapping an unreachable root.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9449cb5baf0b..31fb622249e5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -79,11 +79,18 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
tdp_mmu_free_sp(sp);
}

+static bool tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ bool shared)
+{
+ return zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
+}
+
/*
* Note, putting a root might sleep, i.e. the caller must have IRQs enabled and
* must not explicitly disable preemption (it will be disabled by virtue of
* holding mmu_lock, hence the lack of a might_sleep()).
*/
+
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -118,7 +125,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
* should have been zapped by kvm_tdp_mmu_zap_invalidated_roots(), and
* inserting new SPTEs under an invalid root is a KVM bug.
*/
- if (zap_gfn_range(kvm, root, 0, -1ull, true, false, shared))
+ if (tdp_mmu_zap_root(kvm, root, shared))
WARN_ON_ONCE(root->role.invalid);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -923,7 +930,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
* will still flush on yield, but that's a minor performance
* blip and not a functional issue.
*/
- (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
+ (void)tdp_mmu_zap_root(kvm, root, true);
kvm_tdp_mmu_put_root(kvm, root, true);
}
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:25

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 25/28] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range

Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
support for zapping with mmu_lock held for read. That all callers hold
mmu_lock for write isn't a random coincedence; now that the paths that
need to zap _everything_ have their own path, the only callers left are
those that need to zap for functional correctness. And when zapping is
required for functional correctness, mmu_lock must be held for write,
otherwise the caller has no guarantees about the state of the TDP MMU
page tables after it has run, e.g. the SPTE(s) it zapped can be
immediately replaced by a vCPU faulting in a page.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 29 ++++++-----------------------
1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0e5a0d40e54a..926e92473e92 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -844,15 +844,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
* function cannot yield, it will not release the MMU lock or reschedule and
* the caller must ensure it does not supply too large a GFN range, or the
* operation can cause a soft lockup.
- *
- * If shared is true, this thread holds the MMU lock in read mode and must
- * account for the possibility that other threads are modifying the paging
- * structures concurrently. If shared is false, this thread should hold the
- * MMU lock in write mode.
*/
static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush,
- bool shared)
+ gfn_t start, gfn_t end, bool can_yield, bool flush)
{
bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;
@@ -865,15 +859,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

end = min(end, tdp_mmu_max_gfn_host());

- kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+ lockdep_assert_held_write(&kvm->mmu_lock);

rcu_read_lock();

for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
min_level, start, end) {
-retry:
if (can_yield &&
- tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
+ tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}
@@ -892,17 +885,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- if (!shared) {
- tdp_mmu_set_spte(kvm, &iter, 0);
- flush = true;
- } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
- /*
- * The iter must explicitly re-read the SPTE because
- * the atomic cmpxchg failed.
- */
- iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
- goto retry;
- }
+ tdp_mmu_set_spte(kvm, &iter, 0);
+ flush = true;
}

rcu_read_unlock();
@@ -921,8 +905,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
struct kvm_mmu_page *root;

for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
- flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
- false);
+ flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);

return flush;
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:27

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 27/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched

When yielding in the TDP MMU iterator, service any pending TLB flush
before dropping RCU protections in anticipation of using the callers RCU
"lock" as a proxy for vCPUs in the guest.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 79a52717916c..55c16680b927 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -732,11 +732,11 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
return false;

if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
- rcu_read_unlock();
-
if (flush)
kvm_flush_remote_tlbs(kvm);

+ rcu_read_unlock();
+
if (shared)
cond_resched_rwlock_read(&kvm->mmu_lock);
else
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:33

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 28/28] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages

Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
of immediately flushing. Because the shadow pages are freed in an RCU
callback, so long as at least one CPU holds RCU, all CPUs are protected.
For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
needs to ensure the caller services the pending TLB flush before dropping
its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs
running in the guest.

Deferring the flushes allows batching flushes, e.g. when installing a
1gb hugepage and zapping a pile of SPs, and when zapping an entire root,
allows skipping the flush entirely (becaues flushes are not needed in
that case).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++++
arch/x86/kvm/mmu/tdp_iter.h | 7 +++----
arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++++------------
3 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ef689b8bab12..7aab9737dffa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6237,6 +6237,12 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
rcu_idx = srcu_read_lock(&kvm->srcu);
write_lock(&kvm->mmu_lock);

+ /*
+ * Zapping TDP MMU shadow pages, including the remote TLB flush, must
+ * be done under RCU protection, the pages are freed via RCU callback.
+ */
+ rcu_read_lock();
+
ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
for ( ; to_zap; --to_zap) {
@@ -6261,12 +6267,18 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)

if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+ rcu_read_unlock();
+
cond_resched_rwlock_write(&kvm->mmu_lock);
flush = false;
+
+ rcu_read_lock();
}
}
kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

+ rcu_read_unlock();
+
write_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, rcu_idx);
}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 0693f1fdb81e..0299703fc844 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,10 +9,9 @@

/*
* TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
- * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
- * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
- * the lower* depths of the TDP MMU just to make lockdep happy is a nightmare,
- * so all* accesses to SPTEs are must be done under RCU protection.
+ * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
+ * batched without having to collect the list of zapped SPs. Flows that can
+ * remove SPs must service pending TLB flushes prior to dropping RCU protection.
*/
static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 55c16680b927..62cb357b1dff 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -433,9 +433,6 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
shared);
}

- kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
- KVM_PAGES_PER_HPAGE(level + 1));
-
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

@@ -815,21 +812,14 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,

bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- u64 old_spte;
+ u64 old_spte = kvm_tdp_mmu_read_spte(sp->ptep);

- rcu_read_lock();
-
- old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
- if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
- rcu_read_unlock();
+ if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;
- }

__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
sp->gfn, sp->role.level + 1, true, true);

- rcu_read_unlock();
-
return true;
}

@@ -871,6 +861,11 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
}

rcu_read_unlock();
+
+ /*
+ * Because this flows zaps _only_ leaf SPTEs, the caller doesn't need
+ * to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
+ */
return flush;
}

@@ -1011,6 +1006,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
ret = RET_PF_SPURIOUS;
else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
return RET_PF_RETRY;
+ else if (is_shadow_present_pte(iter->old_spte) &&
+ !is_last_spte(iter->old_spte, iter->level))
+ kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+ KVM_PAGES_PER_HPAGE(iter->level + 1));

/*
* If the page fault was caused by a write but the page is write
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:40

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 26/28] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()

Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
functions according. When removing mappings for functional correctness
(except for the stupid VFIO GPU passthrough memslots bug), zapping the
leaf SPTEs is sufficient as the paging structures themselves do not point
at guest memory and do not directly impact the final translation (in the
TDP MMU).

Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
kvm_unmap_gfn_range().

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
arch/x86/kvm/mmu/tdp_mmu.h | 8 +-------
3 files changed, 14 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e3cd330c9532..ef689b8bab12 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5796,8 +5796,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

if (is_tdp_mmu_enabled(kvm)) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
- gfn_end, flush);
+ flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
+ gfn_end, true, flush);
}

if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 926e92473e92..79a52717916c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
}

/*
- * Tears down the mappings for the range of gfns, [start, end), and frees the
- * non-root pages mapping GFNs strictly within that range. Returns true if
- * SPTEs have been cleared and a TLB flush is needed before releasing the
- * MMU lock.
+ * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
+ * have been cleared and a TLB flush is needed before releasing the MMU lock.
*
* If can_yield is true, will release the MMU lock and reschedule if the
* scheduler needs the CPU or there is contention on the MMU lock. If this
@@ -845,18 +843,11 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
* the caller must ensure it does not supply too large a GFN range, or the
* operation can cause a soft lockup.
*/
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush)
+static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, bool can_yield, bool flush)
{
- bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;

- /*
- * No need to try to step down in the iterator when zapping all SPTEs,
- * zapping the top-level non-leaf SPTEs will recurse on their children.
- */
- int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
-
end = min(end, tdp_mmu_max_gfn_host());

lockdep_assert_held_write(&kvm->mmu_lock);
@@ -864,24 +855,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();

for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
- min_level, start, end) {
+ PG_LEVEL_4K, start, end) {
if (can_yield &&
tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}

- if (!is_shadow_present_pte(iter.old_spte))
- continue;
-
- /*
- * If this is a non-last-level SPTE that covers a larger range
- * than should be zapped, continue, and zap the mappings at a
- * lower level, except when zapping all SPTEs.
- */
- if (!zap_all &&
- (iter.gfn < start ||
- iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+ if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;

@@ -899,13 +880,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* SPTEs have been cleared and a TLB flush is needed before releasing the
* MMU lock.
*/
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
- gfn_t end, bool can_yield, bool flush)
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
+ bool can_yield, bool flush)
{
struct kvm_mmu_page *root;

for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
- flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);

return flush;
}
@@ -1147,8 +1128,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
- return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
- range->end, range->may_block, flush);
+ return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
+ range->end, range->may_block, flush);
}

typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 8ad1717f4a1d..6e7c32170608 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -24,14 +24,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared);

-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
gfn_t end, bool can_yield, bool flush);
-static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
- gfn_t start, gfn_t end, bool flush)
-{
- return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
-}
-
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:45

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 22/28] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU

Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
uses the slow version of "zap everything" is when the VM is being
destroyed or the owning mm has exited. In either case, KVM_RUN is
unreachable for the VM, i.e. the guest TLB entries cannot be consumed.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 31fb622249e5..e5401f0efe8e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -888,14 +888,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,

void kvm_tdp_mmu_zap_all(struct kvm *kvm)
{
- bool flush = false;
int i;

+ /*
+ * A TLB flush is unnecessary, KVM's zap everything if and only the VM
+ * is being destroyed or the userspace VMM has exited. In both cases,
+ * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
+ */
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
+ (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
}

/*
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:48

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 20/28] KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook

Use the common TDP MMU zap helper when hanlding an MMU notifier unmap
event, the two flows are semantically identical.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ea6651e735c2..9449cb5baf0b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1112,13 +1112,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
- struct kvm_mmu_page *root;
-
- for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
- flush = zap_gfn_range(kvm, root, range->start, range->end,
- range->may_block, flush, false);
-
- return flush;
+ return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
+ range->end, range->may_block, flush);
}

typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:51

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 19/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

When recovering a potential hugepage that was shattered for the iTLB
multihit workaround, precisely zap only the target page instead of
iterating over the TDP MMU to find the SP that was passed in. This will
allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 7 ++++++-
arch/x86/kvm/mmu/tdp_iter.h | 2 --
arch/x86/kvm/mmu/tdp_mmu.c | 28 ++++++++++++++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.h | 18 +-----------------
4 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 52c6527b1a06..8ede43a826af 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -30,6 +30,8 @@ extern bool dbg;
#define INVALID_PAE_ROOT 0
#define IS_VALID_PAE_ROOT(x) (!!(x))

+typedef u64 __rcu *tdp_ptep_t;
+
struct kvm_mmu_page {
/*
* Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -59,7 +61,10 @@ struct kvm_mmu_page {
refcount_t tdp_mmu_root_count;
};
unsigned int unsync_children;
- struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+ union {
+ struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+ tdp_ptep_t ptep;
+ };
DECLARE_BITMAP(unsync_child_bitmap, 512);

struct list_head lpage_disallowed_link;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 9c04d8677cb3..0693f1fdb81e 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -7,8 +7,6 @@

#include "mmu.h"

-typedef u64 __rcu *tdp_ptep_t;
-
/*
* TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
* to be zapped while holding mmu_lock for read. Holding RCU isn't required for
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7d354344924d..ea6651e735c2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -318,12 +318,16 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
*
* @kvm: kvm instance
* @sp: the new page
+ * @sptep: pointer to the new page's SPTE (in its parent)
* @account_nx: This page replaces a NX large page and should be marked for
* eventual reclaim.
*/
static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool account_nx)
+ tdp_ptep_t sptep, bool account_nx)
{
+ WARN_ON_ONCE(sp->ptep);
+ sp->ptep = sptep;
+
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
if (account_nx)
@@ -755,6 +759,26 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
return false;
}

+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ u64 old_spte;
+
+ rcu_read_lock();
+
+ old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
+ if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
+ sp->gfn, sp->role.level + 1, true, true);
+
+ rcu_read_unlock();
+
+ return true;
+}
+
/*
* Tears down the mappings for the range of gfns, [start, end), and frees the
* non-root pages mapping GFNs strictly within that range. Returns true if
@@ -1062,7 +1086,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
!shadow_accessed_mask);

if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
- tdp_mmu_link_page(vcpu->kvm, sp,
+ tdp_mmu_link_page(vcpu->kvm, sp, iter.sptep,
fault->huge_page_disallowed &&
fault->req_level >= iter.level);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ced6d8e47362..8ad1717f4a1d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -31,24 +31,8 @@ static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
{
return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
}
-static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level + 1);
-
- /*
- * Don't allow yielding, as the caller may have a flush pending. Note,
- * if mmu_lock is held for write, zapping will never yield in this case,
- * but explicitly disallow it for safety. The TDP MMU does not yield
- * until it has made forward progress (steps sideways), and when zapping
- * a single shadow page that it's guaranteed to see (thus the mmu_lock
- * requirement), its "step sideways" will always step beyond the bounds
- * of the shadow page's gfn range and stop iterating before yielding.
- */
- lockdep_assert_held_write(&kvm->mmu_lock);
- return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
- sp->gfn, end, false, false);
-}

+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
struct list_head *invalidated_roots);
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:54

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 16/28] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path

WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
which is called out by the comment as being disallowed but not actually
checked. Keep the WARN on the old_spte as well, because overwriting a
REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
lack of splats with the existing WARN).

Fixes: 08f07c800e9d ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 085f6b09e5f3..d9524b387221 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -638,13 +638,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
lockdep_assert_held_write(&kvm->mmu_lock);

/*
- * No thread should be using this function to set SPTEs to the
+ * No thread should be using this function to set SPTEs to or from the
* temporary removed SPTE value.
* If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
* should be used. If operating under the MMU lock in write mode, the
* use of the removed SPTE should not be necessary.
*/
- WARN_ON(is_removed_spte(iter->old_spte));
+ WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));

kvm_tdp_mmu_write_spte(iter->sptep, new_spte);

--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:52:55

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 18/28] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw vals

Refactor __tdp_mmu_set_spte() to work with raw values instead of a
tdp_iter objects so that a future patch can modify SPTEs without doing a
walk, and without having to synthesize a tdp_iter.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 47 +++++++++++++++++++++++---------------
1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cc8d021a1ba5..7d354344924d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -622,9 +622,13 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,

/*
* __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
- * @kvm: kvm instance
- * @iter: a tdp_iter instance currently on the SPTE that should be set
- * @new_spte: The value the SPTE should be set to
+ * @kvm: KVM instance
+ * @as_id: Address space ID, i.e. regular vs. SMM
+ * @sptep: Pointer to the SPTE
+ * @old_spte: The current value of the SPTE
+ * @new_spte: The new value that will be set for the SPTE
+ * @gfn: The base GFN that was (or will be) mapped by the SPTE
+ * @level: The level _containing_ the SPTE (its parent PT's level)
* @record_acc_track: Notify the MM subsystem of changes to the accessed state
* of the page. Should be set unless handling an MMU
* notifier for access tracking. Leaving record_acc_track
@@ -636,9 +640,9 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* Leaving record_dirty_log unset in that case prevents page
* writes from being double counted.
*/
-static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
- u64 new_spte, bool record_acc_track,
- bool record_dirty_log)
+static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
+ u64 old_spte, u64 new_spte, gfn_t gfn, int level,
+ bool record_acc_track, bool record_dirty_log)
{
lockdep_assert_held_write(&kvm->mmu_lock);

@@ -649,39 +653,46 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
* should be used. If operating under the MMU lock in write mode, the
* use of the removed SPTE should not be necessary.
*/
- WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
+ WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte));

- kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
+ kvm_tdp_mmu_write_spte(sptep, new_spte);
+
+ __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);

- __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, false);
if (record_acc_track)
- handle_changed_spte_acc_track(iter->old_spte, new_spte,
- iter->level);
+ handle_changed_spte_acc_track(old_spte, new_spte, level);
if (record_dirty_log)
- handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
- iter->old_spte, new_spte,
- iter->level);
+ handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
+ new_spte, level);
+}
+
+static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+ u64 new_spte, bool record_acc_track,
+ bool record_dirty_log)
+{
+ __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte,
+ new_spte, iter->gfn, iter->level,
+ record_acc_track, record_dirty_log);
}

static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
+ _tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
}

static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
+ _tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
}

static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
+ _tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
}

#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:02

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 13/28] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks

Drop RCU protection after processing each root when handling MMU notifier
hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily
drop RCU to let RCU do its thing between roots, and to make it clear that
there's no special behavior that relies on holding RCU across all roots.

Currently, the RCU protection is completely superficial, it's necessary
only to make rcu_dereference() of SPTE pointers happy. A future patch
will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
rather, acknowledge the need for a flush), but in that case RCU needs to
be held until the flush is complete if and only if the flush is needed
because a shadow page may have been removed. And except for the "unmap"
path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
and can't change the PFN for a SP).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 06b500fab248..3ff7b4cd7d0e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1079,18 +1079,19 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,

lockdep_assert_held_write(&kvm->mmu_lock);

- rcu_read_lock();
-
/*
* Don't support rescheduling, none of the MMU notifiers that funnel
* into this helper allow blocking; it'd be dead, wasteful code.
*/
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+ rcu_read_lock();
+
tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
ret |= handler(kvm, &iter, range);
+
+ rcu_read_unlock();
}

- rcu_read_unlock();

return ret;
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:06

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 12/28] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte

Batch TLB flushes (with other MMUs) when handling ->change_spte()
notifications in the TDP MMU. The MMU notifier path in question doesn't
allow yielding and correcty flushes before dropping mmu_lock.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8e446ef03022..06b500fab248 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1182,13 +1182,12 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
*/
bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
-
- /* FIXME: return 'flush' instead of flushing here. */
- if (flush)
- kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
-
- return false;
+ /*
+ * No need to handle the remote TLB flush under RCU protection, the
+ * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
+ * shadow page. See the WARN on pfn_changed in __handle_changed_spte().
+ */
+ return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
}

/*
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:07

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 11/28] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal

Look for a !leaf=>leaf conversion instead of a PFN change when checking
if a SPTE change removed a TDP MMU shadow page. Convert the PFN check
into a WARN, as KVM should never change the PFN of a shadow page (except
when its being zapped or replaced).

From a purely theoretical perspective, it's not illegal to replace a SP
with a hugepage pointing at the same PFN. In practice, it's impossible
as that would require mapping guest memory overtop a kernel-allocated SP.
Either way, the check is odd.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 138c7dc41d2c..8e446ef03022 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -489,9 +489,12 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,

/*
* Recursively handle child PTs if the change removed a subtree from
- * the paging structure.
+ * the paging structure. Note the WARN on the PFN changing without the
+ * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
+ * pages are kernel allocations and should never be migrated.
*/
- if (was_present && !was_leaf && (pfn_changed || !is_present))
+ if (was_present && !was_leaf &&
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_tdp_mmu_page(kvm,
spte_to_child_pt(old_spte, level), shared);
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:13

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 09/28] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter

Assert that mmu_lock is held for write by users of the yield-unfriendly
TDP iterator. The nature of a shared walk means that the caller needs to
play nice with other tasks modifying the page tables, which is more or
less the same thing as playing nice with yielding. Theoretically, KVM
could gain a flow where it could legitimately take mmu_lock for read in
a non-preemptible context, but that's highly unlikely and any such case
should be viewed with a fair amount of scrutiny.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 12a28afce73f..3086c6dc74fb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -159,11 +159,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else

-#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
- list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
- lockdep_is_held_type(&kvm->mmu_lock, 0) || \
- lockdep_is_held(&kvm->arch.tdp_mmu_pages_lock)) \
+/*
+ * Iterate over all valid TDP MMU roots. Requires that mmu_lock be held for
+ * write, the implication being that any flow that holds mmu_lock for read is
+ * inherently yield-friendly and should use the yielf-safe variant above.
+ * Holding mmu_lock for write obviates the need for RCU protection as the list
+ * is guaranteed to be stable.
+ */
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
+ list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
if (kvm_mmu_page_as_id(_root) != _as_id) { \
+ lockdep_assert_held_write(&(_kvm)->mmu_lock); \
} else

static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
@@ -1063,6 +1069,8 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
struct tdp_iter iter;
bool ret = false;

+ lockdep_assert_held_write(&kvm->mmu_lock);
+
rcu_read_lock();

/*
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:18

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 10/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

Allow yielding when zapping SPTEs for a defunct TDP MMU root. Yielding
is safe from a TDP perspective, as the root is unreachable. The only
potential danger is putting a root from a non-preemptible context, and
KVM currently does not do so.

Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
take a reference to each root (it requires mmu_lock be held for the
entire duration of the walk).

tdp_mmu_next_root() is used only by the yield-friendly iterator.

kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.

kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
but is still yield-friendly in all call sites, as all callers can be
traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
kvm_create_vm().

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3086c6dc74fb..138c7dc41d2c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -79,6 +79,11 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
tdp_mmu_free_sp(sp);
}

+/*
+ * Note, putting a root might sleep, i.e. the caller must have IRQs enabled and
+ * must not explicitly disable preemption (it will be disabled by virtue of
+ * holding mmu_lock, hence the lack of a might_sleep()).
+ */
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -101,7 +106,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
* intermediate paging structures, that may be zapped, as such entries
* are associated with the ASID on both VMX and SVM.
*/
- (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+ (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:23

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 08/28] KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root()

Drop the unused @kvm param from kvm_tdp_mmu_get_root(). No functional
change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
arch/x86/kvm/mmu/tdp_mmu.h | 3 +--
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4305ee8e3de3..12a28afce73f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -129,9 +129,10 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
typeof(*next_root), link);

- while (next_root && !kvm_tdp_mmu_get_root(kvm, next_root))
+ while (next_root && !kvm_tdp_mmu_get_root(next_root))
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &next_root->link, typeof(*next_root), link);
+ &next_root->link,
+ typeof(*next_root), link);

rcu_read_unlock();

@@ -211,7 +212,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
/* Check for an existing root before allocating a new one. */
for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
if (root->role.word == role.word &&
- kvm_tdp_mmu_get_root(kvm, root))
+ kvm_tdp_mmu_get_root(root))
goto out;
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3899004a5d91..599714de67c3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -7,8 +7,7 @@

hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);

-__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
- struct kvm_mmu_page *root)
+__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
if (root->role.invalid)
return false;
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:26

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 07/28] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush

Remove the misleading flush "handling" when zapping invalidated TDP MMU
roots, and document that flushing is unnecessary for all flavors of MMUs
when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU
is dead code, as zap_gfn_range() is called with shared=true, in which
case it will never return true due to the flushing being handled by
tdp_mmu_zap_spte_atomic().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 10 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++-----
2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3adac2630c4c..e00e46205730 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5642,9 +5642,13 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
}

/*
- * Trigger a remote TLB flush before freeing the page tables to ensure
- * KVM is not in the middle of a lockless shadow page table walk, which
- * may reference the pages.
+ * Kick all vCPUs (via remote TLB flush) before freeing the page tables
+ * to ensure KVM is not in the middle of a lockless shadow page table
+ * walk, which may reference the pages. The remote TLB flush itself is
+ * not required and is simply a convenient way to kick vCPUs as needed.
+ * KVM performs a local TLB flush when allocating a new root (see
+ * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
+ * running with an obsolete MMU.
*/
kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 981fb0517384..4305ee8e3de3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -833,7 +833,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
struct kvm_mmu_page *next_root;
struct kvm_mmu_page *root;
- bool flush = false;

lockdep_assert_held_read(&kvm->mmu_lock);

@@ -846,7 +845,16 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)

rcu_read_unlock();

- flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
+ /*
+ * A TLB flush is unnecessary, invalidated roots are guaranteed
+ * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
+ * for more details), and unlike the legacy MMU, no vCPU kick
+ * is needed to play nice with lockless shadow walks as the TDP
+ * MMU protects its paging structures via RCU. Note, zapping
+ * will still flush on yield, but that's a minor performance
+ * blip and not a functional issue.
+ */
+ (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);

/*
* Put the reference acquired in
@@ -860,9 +868,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
}

rcu_read_unlock();
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
}

/*
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-20 04:53:28

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 06/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

Explicitly ignore the result of zap_gfn_range() when putting the last
reference to a TDP MMU root, and add a pile of comments to formalize the
TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this
only affects the !shared case, as zap_gfn_range() subtly never returns
true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().

Putting the root without a flush is ok because even if there are stale
references to the root in the TLB, they are unreachable because KVM will
not run the guest with the same ASID without first flushing (where ASID
in this context refers to both SVM's explicit ASID and Intel's implicit
ASID that is constructed from VPID+PCID+EPT4A+etc...).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 31ce913efe37..3adac2630c4c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5098,6 +5098,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
kvm_mmu_sync_roots(vcpu);

kvm_mmu_load_pgd(vcpu);
+
+ /*
+ * Flush any TLB entries for the new root, the provenance of the root
+ * is unknown. In theory, even if KVM ensures there are no stale TLB
+ * entries for a freed root, in theory, an out-of-tree hypervisor could
+ * have left stale entries. Flushing on alloc also allows KVM to skip
+ * the TLB flush when freeing a root (see kvm_tdp_mmu_put_root()).
+ */
static_call(kvm_x86_tlb_flush_current)(vcpu);
out:
return r;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c575df121b19..981fb0517384 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

- zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+ /*
+ * A TLB flush is not necessary as KVM performs a local TLB flush when
+ * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+ * to a different pCPU. Note, the local TLB flush on reuse also
+ * invalidates any paging-structure-cache entries, i.e. TLB entries for
+ * intermediate paging structures, that may be zapped, as such entries
+ * are associated with the ASID on both VMX and SVM.
+ */
+ (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-22 19:54:57

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 04/28] KVM: x86/mmu: Retry page fault if root is invalidated by memslot update

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Bail from the page fault handler if the root shadow page was obsoleted by
> a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU

*acquiring

> doesn't rely on the memslot/MMU generation, and instead relies on the
> root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
> mmu_lock for write.
>
> For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
> kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
> moved past the gfn associated with the SP.
>
> For other MMUs, the resulting behavior is far more convoluted, though
> unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
> root isn't directly problematic, as the obsolete root will be unloaded
> and dropped before the vCPU re-enters the guest. But because the legacy
> MMU tracks shadow pages by their role, any SP created by the fault can
> can be reused in the new post-reload root. Again, that _shouldn't_ be
> problematic as any leaf child SPTEs will be created for the current/valid
> memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
> the old generation as they will be flagged as obsolete. But, given that
> continuing with the fault is pointess (the root will be unloaded), apply
> the check to all MMUs.
>
> Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
> Cc: [email protected]
> Cc: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

A couple spelling nits, but otherwise:

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu.c | 23 +++++++++++++++++++++--
> arch/x86/kvm/mmu/paging_tmpl.h | 3 ++-
> 2 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d2ad12a4d66e..31ce913efe37 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1936,7 +1936,11 @@ static void mmu_audit_disable(void) { }
>
> static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - return sp->role.invalid ||
> + if (sp->role.invalid)
> + return true;
> +
> + /* TDP MMU pages due not use the MMU generation. */

*do

> + return !sp->tdp_mmu_page &&
> unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> }
>
> @@ -3976,6 +3980,20 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> return true;
> }
>
> +/*
> + * Returns true if the page fault is stale and needs to be retried, i.e. if the
> + * root was invalidated by a memslot update or a relevant mmu_notifier fired.
> + */
> +static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault, int mmu_seq)
> +{
> + if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))
> + return true;
> +
> + return fault->slot &&
> + mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +}
> +
> static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
> @@ -4013,8 +4031,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> else
> write_lock(&vcpu->kvm->mmu_lock);
>
> - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> + if (is_page_fault_stale(vcpu, fault, mmu_seq))
> goto out_unlock;
> +
> r = make_mmu_pages_available(vcpu);
> if (r)
> goto out_unlock;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index f87d36898c44..708a5d297fe1 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -911,7 +911,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
> r = RET_PF_RETRY;
> write_lock(&vcpu->kvm->mmu_lock);
> - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> +
> + if (is_page_fault_stale(vcpu, fault, mmu_seq))
> goto out_unlock;
>
> kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 19:57:36

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 05/28] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Explicitly check for preset SPTEs when clearing dirty bits in the TDP
> MMU. This isn't strictly required for correctness, as setting the dirty
> bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
> However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
> complain if anyone actually turned on KVM's MMU debugging.
>
> Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
> Cc: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1db8496259ad..c575df121b19 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1246,6 +1246,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
>
> + if (!is_shadow_present_pte(iter.old_spte))
> + continue;
> +
> if (spte_ad_need_write_protect(iter.old_spte)) {
> if (is_writable_pte(iter.old_spte))
> new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 20:03:24

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 08/28] KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root()

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Drop the unused @kvm param from kvm_tdp_mmu_get_root(). No functional
> change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
> arch/x86/kvm/mmu/tdp_mmu.h | 3 +--
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4305ee8e3de3..12a28afce73f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -129,9 +129,10 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> typeof(*next_root), link);
>
> - while (next_root && !kvm_tdp_mmu_get_root(kvm, next_root))
> + while (next_root && !kvm_tdp_mmu_get_root(next_root))
> next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - &next_root->link, typeof(*next_root), link);
> + &next_root->link,
> + typeof(*next_root), link);
>
> rcu_read_unlock();
>
> @@ -211,7 +212,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> /* Check for an existing root before allocating a new one. */
> for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> if (root->role.word == role.word &&
> - kvm_tdp_mmu_get_root(kvm, root))
> + kvm_tdp_mmu_get_root(root))
> goto out;
> }
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3899004a5d91..599714de67c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -7,8 +7,7 @@
>
> hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>
> -__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
> - struct kvm_mmu_page *root)
> +__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> {
> if (root->role.invalid)
> return false;
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 20:10:19

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 09/28] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Assert that mmu_lock is held for write by users of the yield-unfriendly
> TDP iterator. The nature of a shared walk means that the caller needs to
> play nice with other tasks modifying the page tables, which is more or
> less the same thing as playing nice with yielding. Theoretically, KVM
> could gain a flow where it could legitimately take mmu_lock for read in
> a non-preemptible context, but that's highly unlikely and any such case
> should be viewed with a fair amount of scrutiny.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 12a28afce73f..3086c6dc74fb 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -159,11 +159,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> if (kvm_mmu_page_as_id(_root) != _as_id) { \
> } else
>
> -#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> - list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
> - lockdep_is_held_type(&kvm->mmu_lock, 0) || \
> - lockdep_is_held(&kvm->arch.tdp_mmu_pages_lock)) \
> +/*
> + * Iterate over all valid TDP MMU roots. Requires that mmu_lock be held for
> + * write, the implication being that any flow that holds mmu_lock for read is
> + * inherently yield-friendly and should use the yielf-safe variant above.

Nit: *yield-safe

> + * Holding mmu_lock for write obviates the need for RCU protection as the list
> + * is guaranteed to be stable.
> + */
> +#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
> if (kvm_mmu_page_as_id(_root) != _as_id) { \
> + lockdep_assert_held_write(&(_kvm)->mmu_lock); \

Did you mean for this lockdep to only be hit in this uncommon
non-matching ASID case?

> } else
>
> static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> @@ -1063,6 +1069,8 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> struct tdp_iter iter;
> bool ret = false;
>
> + lockdep_assert_held_write(&kvm->mmu_lock);
> +
> rcu_read_lock();
>
> /*
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 20:20:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/28] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter

On Mon, Nov 22, 2021, Ben Gardon wrote:
> > + * Holding mmu_lock for write obviates the need for RCU protection as the list
> > + * is guaranteed to be stable.
> > + */
> > +#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> > + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
> > if (kvm_mmu_page_as_id(_root) != _as_id) { \
> > + lockdep_assert_held_write(&(_kvm)->mmu_lock); \
>
> Did you mean for this lockdep to only be hit in this uncommon
> non-matching ASID case?

Yes and no. Yes, I intended what I wrote. No, this isn't intended to be limited
to a memslot address space mismatch, but at the time I wrote this I was apparently
lazy or inept :-)

In hindsight, this would be better:

/* blah blah blah */
static inline struct list_head *kvm_get_tdp_mmu_roots_exclusive(struct kvm *kvm)
{
lockdep_assert_held_write(&kvm->mmu_lock);

return &kvm->arch.tdp_mmu_roots;
}

#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
list_for_each_entry(_root, kvm_get_tdp_mmu_roots_exclusive(kvm), link) \
if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else

2021-11-22 21:31:00

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 10/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Allow yielding when zapping SPTEs for a defunct TDP MMU root. Yielding
> is safe from a TDP perspective, as the root is unreachable. The only
> potential danger is putting a root from a non-preemptible context, and
> KVM currently does not do so.
>
> Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
> take a reference to each root (it requires mmu_lock be held for the
> entire duration of the walk).
>
> tdp_mmu_next_root() is used only by the yield-friendly iterator.
>
> kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.
>
> kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
> but is still yield-friendly in all call sites, as all callers can be
> traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
> kvm_create_vm().
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

I'm glad to see this fixed. I assume we don't usually hit this in
testing because most of the teardown happens in the zap-all path when
we unregister for MMU notifiers and actually deleting a fully
populated root while the VM is running is pretty rare.

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3086c6dc74fb..138c7dc41d2c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -79,6 +79,11 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> tdp_mmu_free_sp(sp);
> }
>
> +/*
> + * Note, putting a root might sleep, i.e. the caller must have IRQs enabled and
> + * must not explicitly disable preemption (it will be disabled by virtue of
> + * holding mmu_lock, hence the lack of a might_sleep()).
> + */
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared)
> {
> @@ -101,7 +106,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> * intermediate paging structures, that may be zapped, as such entries
> * are associated with the ASID on both VMX and SVM.
> */
> - (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> + (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 21:45:20

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 12/28] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Batch TLB flushes (with other MMUs) when handling ->change_spte()
> notifications in the TDP MMU. The MMU notifier path in question doesn't
> allow yielding and correcty flushes before dropping mmu_lock.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

This seems very reasonable to me.

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8e446ef03022..06b500fab248 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1182,13 +1182,12 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> */
> bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> - bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
> -
> - /* FIXME: return 'flush' instead of flushing here. */
> - if (flush)
> - kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
> -
> - return false;
> + /*
> + * No need to handle the remote TLB flush under RCU protection, the
> + * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
> + * shadow page. See the WARN on pfn_changed in __handle_changed_spte().
> + */
> + return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
> }
>
> /*
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 21:48:07

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 13/28] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Drop RCU protection after processing each root when handling MMU notifier
> hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily
> drop RCU to let RCU do its thing between roots, and to make it clear that
> there's no special behavior that relies on holding RCU across all roots.
>
> Currently, the RCU protection is completely superficial, it's necessary
> only to make rcu_dereference() of SPTE pointers happy. A future patch
> will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
> ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
> rather, acknowledge the need for a flush), but in that case RCU needs to
> be held until the flush is complete if and only if the flush is needed
> because a shadow page may have been removed. And except for the "unmap"
> path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
> and can't change the PFN for a SP).
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 06b500fab248..3ff7b4cd7d0e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1079,18 +1079,19 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
>
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> - rcu_read_lock();
> -
> /*
> * Don't support rescheduling, none of the MMU notifiers that funnel
> * into this helper allow blocking; it'd be dead, wasteful code.
> */
> for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
> + rcu_read_lock();
> +
> tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> ret |= handler(kvm, &iter, range);
> +
> + rcu_read_unlock();
> }
>
> - rcu_read_unlock();
>
> return ret;
> }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 21:55:51

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 14/28] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Add helpers to read and write TDP MMU SPTEs instead of open coding
> rcu_dereference() all over the place, and to provide a convenient
> location to document why KVM doesn't exempt holding mmu_lock for write
> from having to hold RCU (and any future changes to the rules).
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

Modulo a couple nits below.

> ---
> arch/x86/kvm/mmu/tdp_iter.c | 6 +++---
> arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++-------
> 3 files changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b3ed302c1a35..1f7741c725f6 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> {
> iter->sptep = iter->pt_path[iter->level - 1] +
> SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> - iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> + iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
> }
>
> static gfn_t round_gfn_for_level(gfn_t gfn, int level)
> @@ -86,7 +86,7 @@ static bool try_step_down(struct tdp_iter *iter)
> * Reread the SPTE before stepping down to avoid traversing into page
> * tables that are no longer linked from this entry.
> */
> - iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> + iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
>
> child_pt = spte_to_child_pt(iter->old_spte, iter->level);
> if (!child_pt)
> @@ -120,7 +120,7 @@ static bool try_step_side(struct tdp_iter *iter)
> iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
> iter->next_last_level_gfn = iter->gfn;
> iter->sptep++;
> - iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> + iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
>
> return true;
> }
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index b1748b988d3a..9c04d8677cb3 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -9,6 +9,22 @@
>
> typedef u64 __rcu *tdp_ptep_t;
>
> +/*
> + * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
> + * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
> + * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
> + * the lower* depths of the TDP MMU just to make lockdep happy is a nightmare,
> + * so all* accesses to SPTEs are must be done under RCU protection.

Nit: Are those extra asterisks intentional? I think this line should also be:
so all accesses to SPTEs must be done under RCU protection.

> + */
> +static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)

Nit: function names could also be tdp_iter_read/write_spte. It's a
little shorter, and in the tdp_iter file, but doesn't matter too much
either way.

> +{
> + return READ_ONCE(*rcu_dereference(sptep));
> +}
> +static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
> +{
> + WRITE_ONCE(*rcu_dereference(sptep), val);
> +}
> +
> /*
> * A TDP iterator performs a pre-order walk over a TDP paging structure.
> */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3ff7b4cd7d0e..ca6b30a7130d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -572,7 +572,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * here since the SPTE is going from non-present
> * to non-present.
> */
> - WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
> + kvm_tdp_mmu_write_spte(iter->sptep, 0);
>
> return true;
> }
> @@ -609,7 +609,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> */
> WARN_ON(is_removed_spte(iter->old_spte));
>
> - WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
> + kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
>
> __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> new_spte, iter->level, false);
> @@ -775,7 +775,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> * The iter must explicitly re-read the SPTE because
> * the atomic cmpxchg failed.
> */
> - iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> + iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> goto retry;
> }
> }
> @@ -1012,7 +1012,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> * because the new value informs the !present
> * path below.
> */
> - iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> + iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> }
>
> if (!is_shadow_present_pte(iter.old_spte)) {
> @@ -1225,7 +1225,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> * The iter must explicitly re-read the SPTE because
> * the atomic cmpxchg failed.
> */
> - iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> + iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> goto retry;
> }
> spte_set = true;
> @@ -1296,7 +1296,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> * The iter must explicitly re-read the SPTE because
> * the atomic cmpxchg failed.
> */
> - iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> + iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> goto retry;
> }
> spte_set = true;
> @@ -1427,7 +1427,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> * The iter must explicitly re-read the SPTE because
> * the atomic cmpxchg failed.
> */
> - iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> + iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);

Wow this was open coded in a LOT of places. Thanks for cleaning it up.

> goto retry;
> }
> }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 21:57:30

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 16/28] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
> which is called out by the comment as being disallowed but not actually
> checked. Keep the WARN on the old_spte as well, because overwriting a
> REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
> lack of splats with the existing WARN).
>
> Fixes: 08f07c800e9d ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
> Cc: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 085f6b09e5f3..d9524b387221 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -638,13 +638,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> /*
> - * No thread should be using this function to set SPTEs to the
> + * No thread should be using this function to set SPTEs to or from the
> * temporary removed SPTE value.
> * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
> * should be used. If operating under the MMU lock in write mode, the
> * use of the removed SPTE should not be necessary.
> */
> - WARN_ON(is_removed_spte(iter->old_spte));
> + WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
>
> kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
>
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 22:20:42

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 15/28] KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots

"/

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Take TDP MMU roots off the list of roots when they're invalidated instead
> of walking later on to find the roots that were just invalidated. In
> addition to making the flow more straightforward, this allows warning
> if something attempts to elevate the refcount of an invalid root, which
> should be unreachable (no longer on the list so can't be reached by MMU
> notifier, and vCPUs must reload a new root before installing new SPTE).
>
> Signed-off-by: Sean Christopherson <[email protected]>

There are a bunch of awesome little cleanups and unrelated fixes
included in this commit that could be factored out.

I'm skeptical of immediately moving the invalidated roots into another
list as that seems like it has a lot of potential for introducing
weird races. I'm not sure it actually solves a problem either. Part of
the motive from the commit description "this allows warning if
something attempts to elevate the refcount of an invalid root" can be
achieved already without moving the roots into a separate list.

Maybe this would seem more straightforward with some of the little
cleanups factored out, but this feels more complicated to me.

> ---
> arch/x86/kvm/mmu/mmu.c | 6 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 171 ++++++++++++++++++++-----------------
> arch/x86/kvm/mmu/tdp_mmu.h | 14 ++-
> 3 files changed, 108 insertions(+), 83 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e00e46205730..e3cd330c9532 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5664,6 +5664,8 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
> */
> static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> {
> + LIST_HEAD(invalidated_roots);
> +
> lockdep_assert_held(&kvm->slots_lock);
>
> write_lock(&kvm->mmu_lock);
> @@ -5685,7 +5687,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> * could drop the MMU lock and yield.
> */
> if (is_tdp_mmu_enabled(kvm))
> - kvm_tdp_mmu_invalidate_all_roots(kvm);
> + kvm_tdp_mmu_invalidate_all_roots(kvm, &invalidated_roots);
>
> /*
> * Notify all vcpus to reload its shadow page table and flush TLB.
> @@ -5703,7 +5705,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>
> if (is_tdp_mmu_enabled(kvm)) {
> read_lock(&kvm->mmu_lock);
> - kvm_tdp_mmu_zap_invalidated_roots(kvm);
> + kvm_tdp_mmu_zap_invalidated_roots(kvm, &invalidated_roots);
> read_unlock(&kvm->mmu_lock);
> }
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ca6b30a7130d..085f6b09e5f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -94,9 +94,17 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>
> WARN_ON(!root->tdp_mmu_page);
>
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - list_del_rcu(&root->link);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + /*
> + * Remove the root from tdp_mmu_roots, unless the root is invalid in
> + * which case the root was pulled off tdp_mmu_roots when it was
> + * invalidated. Note, this must be an RCU-protected deletion to avoid
> + * use-after-free in the yield-safe iterator!
> + */
> + if (!root->role.invalid) {
> + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> + list_del_rcu(&root->link);
> + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + }
>
> /*
> * A TLB flush is not necessary as KVM performs a local TLB flush when
> @@ -105,18 +113,23 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> * invalidates any paging-structure-cache entries, i.e. TLB entries for
> * intermediate paging structures, that may be zapped, as such entries
> * are associated with the ASID on both VMX and SVM.
> + *
> + * WARN if a flush is reported for an invalid root, as its child SPTEs
> + * should have been zapped by kvm_tdp_mmu_zap_invalidated_roots(), and
> + * inserting new SPTEs under an invalid root is a KVM bug.
> */
> - (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
> + if (zap_gfn_range(kvm, root, 0, -1ull, true, false, shared))
> + WARN_ON_ONCE(root->role.invalid);
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
>
> /*
> - * Finds the next valid root after root (or the first valid root if root
> - * is NULL), takes a reference on it, and returns that next root. If root
> - * is not NULL, this thread should have already taken a reference on it, and
> - * that reference will be dropped. If no valid root is found, this
> - * function will return NULL.
> + * Finds the next root after @prev_root (or the first root if @prev_root is NULL
> + * or invalid), takes a reference on it, and returns that next root. If root is
> + * not NULL, this thread should have already taken a reference on it, and that
> + * reference will be dropped. If no valid root is found, this function will
> + * return NULL.
> */
> static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> struct kvm_mmu_page *prev_root,
> @@ -124,6 +137,27 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> {
> struct kvm_mmu_page *next_root;
>
> + lockdep_assert_held(&kvm->mmu_lock);
> +
> + /*
> + * Restart the walk if the previous root was invalidated, which can
> + * happen if the caller drops mmu_lock when yielding. Restarting the
> + * walke is necessary because invalidating a root also removes it from

Nit: *walk

> + * tdp_mmu_roots. Restarting is safe and correct because invalidating
> + * a root is done if and only if _all_ roots are invalidated, i.e. any
> + * root on tdp_mmu_roots was added _after_ the invalidation event.
> + */
> + if (prev_root && prev_root->role.invalid) {
> + kvm_tdp_mmu_put_root(kvm, prev_root, shared);
> + prev_root = NULL;
> + }
> +
> + /*
> + * Finding the next root must be done under RCU read lock. Although
> + * @prev_root itself cannot be removed from tdp_mmu_roots because this
> + * task holds a reference, its next and prev pointers can be modified
> + * when freeing a different root. Ditto for tdp_mmu_roots itself.
> + */

I'm not sure this is correct with the rest of the changes in this
patch. The new version of invalidate_roots removes roots from the list
immediately, even if they have a non-zero ref-count.

> rcu_read_lock();
>
> if (prev_root)
> @@ -230,10 +264,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> refcount_set(&root->tdp_mmu_root_count, 1);
>
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> -
> + /*
> + * Because mmu_lock must be held for write to ensure that KVM doesn't
> + * create multiple roots for a given role, this does not need to use
> + * an RCU-friendly variant as readers of tdp_mmu_roots must also hold
> + * mmu_lock in some capacity.
> + */

I doubt we're doing it now, but in principle we could allocate new
roots with mmu_lock in read + tdp_mmu_pages_lock. That might be better
than depending on the write lock.

> + list_add(&root->link, &kvm->arch.tdp_mmu_roots);
> out:
> return __pa(root->spt);
> }
> @@ -814,28 +851,6 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> kvm_flush_remote_tlbs(kvm);
> }
>
> -static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
> - struct kvm_mmu_page *prev_root)
> -{
> - struct kvm_mmu_page *next_root;
> -
> - if (prev_root)
> - next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - &prev_root->link,
> - typeof(*prev_root), link);
> - else
> - next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - typeof(*next_root), link);
> -
> - while (next_root && !(next_root->role.invalid &&
> - refcount_read(&next_root->tdp_mmu_root_count)))
> - next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - &next_root->link,
> - typeof(*next_root), link);
> -
> - return next_root;
> -}
> -
> /*
> * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
> * invalidated root, they will not be freed until this function drops the
> @@ -844,22 +859,21 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
> * only has to do a trivial amount of work. Since the roots are invalid,
> * no new SPTEs should be created under them.
> */
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
> + struct list_head *invalidated_roots)
> {
> - struct kvm_mmu_page *next_root;
> - struct kvm_mmu_page *root;
> + struct kvm_mmu_page *root, *tmp;
>
> + lockdep_assert_held(&kvm->slots_lock);
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - rcu_read_lock();
> -
> - root = next_invalidated_root(kvm, NULL);
> -
> - while (root) {
> - next_root = next_invalidated_root(kvm, root);
> -
> - rcu_read_unlock();
> -
> + /*
> + * Put the ref to each root, acquired by kvm_tdp_mmu_put_root(). The

Nit: s/kvm_tdp_mmu_put_root/kvm_tdp_mmu_get_root/

> + * safe variant is required even though kvm_tdp_mmu_put_root() doesn't
> + * explicitly remove the root from the invalid list, as this task does
> + * not take rcu_read_lock() and so the list object itself can be freed.
> + */
> + list_for_each_entry_safe(root, tmp, invalidated_roots, link) {
> /*
> * A TLB flush is unnecessary, invalidated roots are guaranteed
> * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
> @@ -870,49 +884,50 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> * blip and not a functional issue.
> */
> (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
> -
> - /*
> - * Put the reference acquired in
> - * kvm_tdp_mmu_invalidate_roots
> - */
> kvm_tdp_mmu_put_root(kvm, root, true);
> -
> - root = next_root;
> -
> - rcu_read_lock();
> }
> -
> - rcu_read_unlock();
> }
>
> /*
> - * Mark each TDP MMU root as invalid so that other threads
> - * will drop their references and allow the root count to
> - * go to 0.
> + * Mark each TDP MMU root as invalid so that other threads will drop their
> + * references and allow the root count to go to 0.
> *
> - * Also take a reference on all roots so that this thread
> - * can do the bulk of the work required to free the roots
> - * once they are invalidated. Without this reference, a
> - * vCPU thread might drop the last reference to a root and
> - * get stuck with tearing down the entire paging structure.
> + * Take a reference on each root and move it to a local list so that this task
> + * can do the actual work required to free the roots once they are invalidated,
> + * e.g. zap the SPTEs and trigger a remote TLB flush. Without this reference, a
> + * vCPU task might drop the last reference to a root and get stuck with tearing
> + * down the entire paging structure.
> *
> - * Roots which have a zero refcount should be skipped as
> - * they're already being torn down.
> - * Already invalid roots should be referenced again so that
> - * they aren't freed before kvm_tdp_mmu_zap_all_fast is
> - * done with them.
> + * Roots which have a zero refcount are skipped as they're already being torn
> + * down. Encountering a root that is already invalid is a KVM bug, as this is
> + * the only path that is allowed to invalidate roots and (a) it's proteced by

Nit: protected

> + * slots_lock and (b) pulls each root off tdp_mmu_roots.
> *
> - * This has essentially the same effect for the TDP MMU
> - * as updating mmu_valid_gen does for the shadow MMU.
> + * This has essentially the same effect for the TDP MMU as updating
> + * mmu_valid_gen does for the shadow MMU.
> */
> -void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> +void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
> + struct list_head *invalidated_roots)
> {
> - struct kvm_mmu_page *root;
> + struct kvm_mmu_page *root, *tmp;
>
> + /*
> + * mmu_lock must be held for write, moving entries off an RCU-protected
> + * list is not safe, entries can only be deleted. All accesses to
> + * tdp_mmu_roots are required to hold mmu_lock in some capacity, thus
> + * holding it for write ensures there are no concurrent readers.
> + */
> lockdep_assert_held_write(&kvm->mmu_lock);
> - list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
> - if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
> - root->role.invalid = true;
> +
> + list_for_each_entry_safe(root, tmp, &kvm->arch.tdp_mmu_roots, link) {
> + if (!kvm_tdp_mmu_get_root(root))
> + continue;
> +
> + list_move_tail(&root->link, invalidated_roots);
> +
> + WARN_ON_ONCE(root->role.invalid);
> + root->role.invalid = true;
> + }
> }
>
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 599714de67c3..ced6d8e47362 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -9,7 +9,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>
> __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> {
> - if (root->role.invalid)
> + /*
> + * Acquiring a reference on an invalid root is a KVM bug. Invalid roots
> + * are supposed to be reachable only by references that were acquired
> + * before the invalidation, and taking an additional reference to an
> + * invalid root is not allowed.
> + */
> + if (WARN_ON_ONCE(root->role.invalid))
> return false;
>
> return refcount_inc_not_zero(&root->tdp_mmu_root_count);
> @@ -44,8 +50,10 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> }
>
> void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> -void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
> +void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
> + struct list_head *invalidated_roots);
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
> + struct list_head *invalidated_roots);
>
> int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 22:25:26

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 17/28] KVM: x86/mmu: Terminate yield-friendly walk if invalid root observed

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Stop walking TDP MMU roots if the previous root was invalidated while the
> caller yielded (dropped mmu_lock). Any effect that the caller wishes to
> be recognized by a root must be made visible before walking the list of
> roots. Because all roots are invalided when any root is invalidated, if
> the previous root was invalidated, then all roots on the list when the
> caller first started the walk also were invalidated, and any valid roots
> on the list must have been added after the invalidation event.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 19 ++++++++++++-------
> 1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index d9524b387221..cc8d021a1ba5 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -140,16 +140,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> lockdep_assert_held(&kvm->mmu_lock);
>
> /*
> - * Restart the walk if the previous root was invalidated, which can
> - * happen if the caller drops mmu_lock when yielding. Restarting the
> - * walke is necessary because invalidating a root also removes it from
> - * tdp_mmu_roots. Restarting is safe and correct because invalidating
> - * a root is done if and only if _all_ roots are invalidated, i.e. any
> - * root on tdp_mmu_roots was added _after_ the invalidation event.
> + * Terminate the walk if the previous root was invalidated, which can
> + * happen if the caller yielded and dropped mmu_lock. Because invalid
> + * roots are removed from tdp_mmu_roots with mmu_lock held for write,
> + * if the previous root was invalidated, then the invalidation occurred
> + * after this walk started. And because _all_ roots are invalidated
> + * during an invalidation event, any root on tdp_mmu_roots was created
> + * after the invalidation. Lastly, any state change being made by the
> + * caller must be effected before updating SPTEs, otherwise vCPUs could
> + * simply create new SPTEs with the old state. Thus, if the previous
> + * root was invalidated, all valid roots are guaranteed to have been
> + * created after the desired state change and don't need to be updated.
> */
> if (prev_root && prev_root->role.invalid) {
> kvm_tdp_mmu_put_root(kvm, prev_root, shared);
> - prev_root = NULL;
> + return NULL;
> }

Another option might be to walk the list FIFO so that the newly
created roots are encountered, but since they start empty, there might
not be any work to do.

>
> /*
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 22:30:21

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 18/28] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw vals

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Refactor __tdp_mmu_set_spte() to work with raw values instead of a
> tdp_iter objects so that a future patch can modify SPTEs without doing a
> walk, and without having to synthesize a tdp_iter.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 47 +++++++++++++++++++++++---------------
> 1 file changed, 29 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cc8d021a1ba5..7d354344924d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -622,9 +622,13 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>
> /*
> * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
> - * @kvm: kvm instance
> - * @iter: a tdp_iter instance currently on the SPTE that should be set
> - * @new_spte: The value the SPTE should be set to
> + * @kvm: KVM instance
> + * @as_id: Address space ID, i.e. regular vs. SMM
> + * @sptep: Pointer to the SPTE
> + * @old_spte: The current value of the SPTE
> + * @new_spte: The new value that will be set for the SPTE
> + * @gfn: The base GFN that was (or will be) mapped by the SPTE
> + * @level: The level _containing_ the SPTE (its parent PT's level)
> * @record_acc_track: Notify the MM subsystem of changes to the accessed state
> * of the page. Should be set unless handling an MMU
> * notifier for access tracking. Leaving record_acc_track
> @@ -636,9 +640,9 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * Leaving record_dirty_log unset in that case prevents page
> * writes from being double counted.
> */
> -static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> - u64 new_spte, bool record_acc_track,
> - bool record_dirty_log)
> +static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> + u64 old_spte, u64 new_spte, gfn_t gfn, int level,
> + bool record_acc_track, bool record_dirty_log)
> {
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> @@ -649,39 +653,46 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> * should be used. If operating under the MMU lock in write mode, the
> * use of the removed SPTE should not be necessary.
> */
> - WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
> + WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte));
>
> - kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> + kvm_tdp_mmu_write_spte(sptep, new_spte);
> +
> + __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
>
> - __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, false);
> if (record_acc_track)
> - handle_changed_spte_acc_track(iter->old_spte, new_spte,
> - iter->level);
> + handle_changed_spte_acc_track(old_spte, new_spte, level);
> if (record_dirty_log)
> - handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
> - iter->old_spte, new_spte,
> - iter->level);
> + handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> + new_spte, level);
> +}
> +
> +static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> + u64 new_spte, bool record_acc_track,
> + bool record_dirty_log)
> +{
> + __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte,
> + new_spte, iter->gfn, iter->level,
> + record_acc_track, record_dirty_log);
> }
>
> static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> u64 new_spte)
> {
> - __tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
> + _tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
> }
>
> static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> - __tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
> + _tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
> }
>
> static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> - __tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
> + _tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
> }
>
> #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 22:40:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 10/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> >
> > Allow yielding when zapping SPTEs for a defunct TDP MMU root. Yielding
> > is safe from a TDP perspective, as the root is unreachable. The only
> > potential danger is putting a root from a non-preemptible context, and
> > KVM currently does not do so.
> >
> > Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
> > take a reference to each root (it requires mmu_lock be held for the
> > entire duration of the walk).
> >
> > tdp_mmu_next_root() is used only by the yield-friendly iterator.
> >
> > kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.
> >
> > kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
> > but is still yield-friendly in all call sites, as all callers can be
> > traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
> > kvm_create_vm().
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> Reviewed-by: Ben Gardon <[email protected]>
>
> I'm glad to see this fixed. I assume we don't usually hit this in
> testing because most of the teardown happens in the zap-all path when
> we unregister for MMU notifiers

Or more likely, when the userspace process exits and kvm_mmu_notifier_ops.release
is invoked. But yeah, same difference, VM teardown is unlikely to trigger zapping
by putting the last TDP MMU reference.

> and actually deleting a fully populated root while the VM is running is pretty
> rare.

Hmm, probably not that rare, e.g. guest reboot (emulated RESET) is all but
guaranteed to trigger kvm_mmu_reset_context() on all vCPUs and thus drop all roots,
and QEMU at least doesn't (always) do memslot updates as part of reboot.

2021-11-22 22:43:39

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 19/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> When recovering a potential hugepage that was shattered for the iTLB
> multihit workaround, precisely zap only the target page instead of
> iterating over the TDP MMU to find the SP that was passed in. This will
> allow future simplification of zap_gfn_range() by having it zap only
> leaf SPTEs.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu_internal.h | 7 ++++++-
> arch/x86/kvm/mmu/tdp_iter.h | 2 --
> arch/x86/kvm/mmu/tdp_mmu.c | 28 ++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/tdp_mmu.h | 18 +-----------------
> 4 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 52c6527b1a06..8ede43a826af 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -30,6 +30,8 @@ extern bool dbg;
> #define INVALID_PAE_ROOT 0
> #define IS_VALID_PAE_ROOT(x) (!!(x))
>
> +typedef u64 __rcu *tdp_ptep_t;
> +
> struct kvm_mmu_page {
> /*
> * Note, "link" through "spt" fit in a single 64 byte cache line on
> @@ -59,7 +61,10 @@ struct kvm_mmu_page {
> refcount_t tdp_mmu_root_count;
> };
> unsigned int unsync_children;
> - struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> + union {
> + struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> + tdp_ptep_t ptep;
> + };
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
> struct list_head lpage_disallowed_link;
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 9c04d8677cb3..0693f1fdb81e 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -7,8 +7,6 @@
>
> #include "mmu.h"
>
> -typedef u64 __rcu *tdp_ptep_t;
> -
> /*
> * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
> * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7d354344924d..ea6651e735c2 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -318,12 +318,16 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
> *
> * @kvm: kvm instance
> * @sp: the new page
> + * @sptep: pointer to the new page's SPTE (in its parent)
> * @account_nx: This page replaces a NX large page and should be marked for
> * eventual reclaim.
> */
> static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> - bool account_nx)
> + tdp_ptep_t sptep, bool account_nx)
> {
> + WARN_ON_ONCE(sp->ptep);
> + sp->ptep = sptep;
> +
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> if (account_nx)
> @@ -755,6 +759,26 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> return false;
> }
>
> +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + u64 old_spte;
> +
> + rcu_read_lock();
> +
> + old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> + if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
> + rcu_read_unlock();
> + return false;
> + }
> +
> + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> + sp->gfn, sp->role.level + 1, true, true);
> +
> + rcu_read_unlock();
> +
> + return true;
> +}
> +

Ooooh this makes me really nervous. There are a lot of gotchas to
modifying SPTEs in a new context without traversing the paging
structure like this. For example, we could modify an SPTE under an
invalidated root here. I don't think that would be a problem since
we're just clearing it, but it makes the code more fragile. Another
approach to this would be to do in-place promotion / in-place
splitting once the patch sets David and I just sent out are merged.
That would avoid causing extra page faults here to bring in the page
after this zap, but it probably wouldn't be safe if we did it under an
invalidated root.
I'd rather avoid this extra complexity and just tolerate the worse
performance on the iTLB multi hit mitigation at this point since new
CPUs seem to be moving past that vulnerability.
If you think this is worth the complexity, it'd be nice to do a little
benchmarking to make sure it's giving us a substantial improvement.

> /*
> * Tears down the mappings for the range of gfns, [start, end), and frees the
> * non-root pages mapping GFNs strictly within that range. Returns true if
> @@ -1062,7 +1086,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> !shadow_accessed_mask);
>
> if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
> - tdp_mmu_link_page(vcpu->kvm, sp,
> + tdp_mmu_link_page(vcpu->kvm, sp, iter.sptep,
> fault->huge_page_disallowed &&
> fault->req_level >= iter.level);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index ced6d8e47362..8ad1717f4a1d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -31,24 +31,8 @@ static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> {
> return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> }
> -static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> - gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level + 1);
> -
> - /*
> - * Don't allow yielding, as the caller may have a flush pending. Note,
> - * if mmu_lock is held for write, zapping will never yield in this case,
> - * but explicitly disallow it for safety. The TDP MMU does not yield
> - * until it has made forward progress (steps sideways), and when zapping
> - * a single shadow page that it's guaranteed to see (thus the mmu_lock
> - * requirement), its "step sideways" will always step beyond the bounds
> - * of the shadow page's gfn range and stop iterating before yielding.
> - */
> - lockdep_assert_held_write(&kvm->mmu_lock);
> - return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
> - sp->gfn, end, false, false);
> -}
>
> +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
> void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
> struct list_head *invalidated_roots);
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 22:49:40

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 20/28] KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Use the common TDP MMU zap helper when hanlding an MMU notifier unmap
> event, the two flows are semantically identical.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 9 ++-------
> 1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ea6651e735c2..9449cb5baf0b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1112,13 +1112,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
> bool flush)
> {
> - struct kvm_mmu_page *root;
> -
> - for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
> - flush = zap_gfn_range(kvm, root, range->start, range->end,
> - range->may_block, flush, false);
> -
> - return flush;
> + return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
> + range->end, range->may_block, flush);
> }
>
> typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 22:55:16

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 21/28] KVM: x86/mmu: Add TDP MMU helper to zap a root

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Add a small wrapper to handle zapping a specific root. For now, it's
> little more than syntactic sugar, but in the future it will become a
> unique flow with rules specific to zapping an unreachable root.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9449cb5baf0b..31fb622249e5 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -79,11 +79,18 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> tdp_mmu_free_sp(sp);
> }
>
> +static bool tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + bool shared)
> +{
> + return zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);

Total aside:
Remembering the order of these three boolean parameters through all
these functions drives me nuts.
It'd be really nice to put them into a neat, reusable struct that tracks:
MMU lock mode (read / write / none)
If yielding is okay
If the TLBs are dirty and need to be flushed

I don't know when I'll have time to do that refactor, but it would
make this code so much more sensible.

> +}
> +
> /*
> * Note, putting a root might sleep, i.e. the caller must have IRQs enabled and
> * must not explicitly disable preemption (it will be disabled by virtue of
> * holding mmu_lock, hence the lack of a might_sleep()).
> */
> +
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared)
> {
> @@ -118,7 +125,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> * should have been zapped by kvm_tdp_mmu_zap_invalidated_roots(), and
> * inserting new SPTEs under an invalid root is a KVM bug.
> */
> - if (zap_gfn_range(kvm, root, 0, -1ull, true, false, shared))
> + if (tdp_mmu_zap_root(kvm, root, shared))
> WARN_ON_ONCE(root->role.invalid);
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> @@ -923,7 +930,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
> * will still flush on yield, but that's a minor performance
> * blip and not a functional issue.
> */
> - (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
> + (void)tdp_mmu_zap_root(kvm, root, true);
> kvm_tdp_mmu_put_root(kvm, root, true);
> }
> }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 23:00:33

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 22/28] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
> uses the slow version of "zap everything" is when the VM is being
> destroyed or the owning mm has exited. In either case, KVM_RUN is
> unreachable for the VM, i.e. the guest TLB entries cannot be consumed.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 31fb622249e5..e5401f0efe8e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -888,14 +888,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
>
> void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> {
> - bool flush = false;
> int i;
>
> + /*
> + * A TLB flush is unnecessary, KVM's zap everything if and only the VM
> + * is being destroyed or the userspace VMM has exited. In both cases,
> + * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
> + */

Nit: Suggest:
/*
* A TLB flush is unnecessary. KVM's zap_all is used if and
only if the VM
* is being destroyed or the userspace VMM has exited. In both cases,
* the vCPUs are not running and will never run again, so their
TLB state doesn't matter.
*/

> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> - flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
> -
> - if (flush)
> - kvm_flush_remote_tlbs(kvm);
> + (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
> }
>
> /*
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-22 23:04:05

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 10/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On Mon, Nov 22, 2021 at 2:40 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Nov 22, 2021, Ben Gardon wrote:
> > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > Allow yielding when zapping SPTEs for a defunct TDP MMU root. Yielding
> > > is safe from a TDP perspective, as the root is unreachable. The only
> > > potential danger is putting a root from a non-preemptible context, and
> > > KVM currently does not do so.
> > >
> > > Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
> > > take a reference to each root (it requires mmu_lock be held for the
> > > entire duration of the walk).
> > >
> > > tdp_mmu_next_root() is used only by the yield-friendly iterator.
> > >
> > > kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.
> > >
> > > kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
> > > but is still yield-friendly in all call sites, as all callers can be
> > > traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
> > > kvm_create_vm().
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> >
> > Reviewed-by: Ben Gardon <[email protected]>
> >
> > I'm glad to see this fixed. I assume we don't usually hit this in
> > testing because most of the teardown happens in the zap-all path when
> > we unregister for MMU notifiers
>
> Or more likely, when the userspace process exits and kvm_mmu_notifier_ops.release
> is invoked. But yeah, same difference, VM teardown is unlikely to trigger zapping
> by putting the last TDP MMU reference.
>
> > and actually deleting a fully populated root while the VM is running is pretty
> > rare.
>
> Hmm, probably not that rare, e.g. guest reboot (emulated RESET) is all but
> guaranteed to trigger kvm_mmu_reset_context() on all vCPUs and thus drop all roots,
> and QEMU at least doesn't (always) do memslot updates as part of reboot.

I don't think we have a selftest or kvm-unit-test that builds a large
EPT structure and then reboots the guest though. That'd be a cool test
to have.

2021-11-22 23:08:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 15/28] KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> >
> > Take TDP MMU roots off the list of roots when they're invalidated instead
> > of walking later on to find the roots that were just invalidated. In
> > addition to making the flow more straightforward, this allows warning
> > if something attempts to elevate the refcount of an invalid root, which
> > should be unreachable (no longer on the list so can't be reached by MMU
> > notifier, and vCPUs must reload a new root before installing new SPTE).
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> There are a bunch of awesome little cleanups and unrelated fixes
> included in this commit that could be factored out.
>
> I'm skeptical of immediately moving the invalidated roots into another
> list as that seems like it has a lot of potential for introducing
> weird races.

I disagree, the entire premise of fast invalidate is that there can't be races,
i.e. mmu_lock must be held for write. IMO, it's actually the opposite, as the only
reason leaving roots on the per-VM list doesn't have weird races is because slots_lock
is held. If slots_lock weren't required to do a fast zap, which is feasible for the
TDP MMU since it doesn't rely on the memslots generation, then it would be possible
for multiple calls to kvm_tdp_mmu_zap_invalidated_roots() to run in parallel. And in
that case, leaving roots on the per-VM list would lead to a single instance of a
"fast zap" zapping roots it didn't invalidate. That's wouldn't be problematic per se,
but I don't like not having a clear "owner" of the invalidated root.

> I'm not sure it actually solves a problem either. Part of
> the motive from the commit description "this allows warning if
> something attempts to elevate the refcount of an invalid root" can be
> achieved already without moving the roots into a separate list.

Hmm, true in the sense that kvm_tdp_mmu_get_root() could be converted to a WARN,
but that would require tdp_mmu_next_root() to manually skip invalid roots.
kvm_tdp_mmu_get_vcpu_root_hpa() is naturally safe because page_role_for_level()
will never set the invalid flag.

I don't care too much about adding a manual check in tdp_mmu_next_root(), what I don't
like is that a WARN in kvm_tdp_mmu_get_root() wouldn't be establishing an invariant
that invalidated roots are unreachable, it would simply be forcing callers to check
role.invalid.

> Maybe this would seem more straightforward with some of the little
> cleanups factored out, but this feels more complicated to me.
> > @@ -124,6 +137,27 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > {
> > struct kvm_mmu_page *next_root;
> >
> > + lockdep_assert_held(&kvm->mmu_lock);
> > +
> > + /*
> > + * Restart the walk if the previous root was invalidated, which can
> > + * happen if the caller drops mmu_lock when yielding. Restarting the
> > + * walke is necessary because invalidating a root also removes it from
>
> Nit: *walk
>
> > + * tdp_mmu_roots. Restarting is safe and correct because invalidating
> > + * a root is done if and only if _all_ roots are invalidated, i.e. any
> > + * root on tdp_mmu_roots was added _after_ the invalidation event.
> > + */
> > + if (prev_root && prev_root->role.invalid) {
> > + kvm_tdp_mmu_put_root(kvm, prev_root, shared);
> > + prev_root = NULL;
> > + }
> > +
> > + /*
> > + * Finding the next root must be done under RCU read lock. Although
> > + * @prev_root itself cannot be removed from tdp_mmu_roots because this
> > + * task holds a reference, its next and prev pointers can be modified
> > + * when freeing a different root. Ditto for tdp_mmu_roots itself.
> > + */
>
> I'm not sure this is correct with the rest of the changes in this
> patch. The new version of invalidate_roots removes roots from the list
> immediately, even if they have a non-zero ref-count.

Roots don't have to be invalidated to be removed, e.g. if the last reference is
put due to kvm_mmu_reset_context(). Or did I misunderstand?

> > rcu_read_lock();
> >
> > if (prev_root)
> > @@ -230,10 +264,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> > refcount_set(&root->tdp_mmu_root_count, 1);
> >
> > - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > - list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> > - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > -
> > + /*
> > + * Because mmu_lock must be held for write to ensure that KVM doesn't
> > + * create multiple roots for a given role, this does not need to use
> > + * an RCU-friendly variant as readers of tdp_mmu_roots must also hold
> > + * mmu_lock in some capacity.
> > + */
>
> I doubt we're doing it now, but in principle we could allocate new
> roots with mmu_lock in read + tdp_mmu_pages_lock. That might be better
> than depending on the write lock.

We're not, this function does lockdep_assert_held_write(&kvm->mmu_lock) a few
lines above. I don't have a preference between using mmu_lock.read+tdp_mmu_pages_lock
versus mmu_lock.write, but I do care that the current code doesn't incorrectly imply
that it's possible for something else to be walking the roots while this runs.

Either way, this should definitely be a separate patch, pretty sure I just lost
track of it.

> > + list_add(&root->link, &kvm->arch.tdp_mmu_roots);
> > out:
> > return __pa(root->spt);
> > }

2021-11-22 23:15:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 21/28] KVM: x86/mmu: Add TDP MMU helper to zap a root

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> >
> > Add a small wrapper to handle zapping a specific root. For now, it's
> > little more than syntactic sugar, but in the future it will become a
> > unique flow with rules specific to zapping an unreachable root.
> >
> > No functional change intended.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++--
> > 1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9449cb5baf0b..31fb622249e5 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -79,11 +79,18 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> > tdp_mmu_free_sp(sp);
> > }
> >
> > +static bool tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > + bool shared)
> > +{
> > + return zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
>
> Total aside:
> Remembering the order of these three boolean parameters through all
> these functions drives me nuts.
> It'd be really nice to put them into a neat, reusable struct that tracks:
> MMU lock mode (read / write / none)
> If yielding is okay
> If the TLBs are dirty and need to be flushed
>
> I don't know when I'll have time to do that refactor, but it would
> make this code so much more sensible.

Heh, I did exactly that, then threw away the code when I realized that I could
break up zap_gfn_range() into three separate helpers and avoid control knob hell
(spoiler alert for later patches in this series).

There are still two booleans (to what ends up being tdp_mmu_zap_leafs()), but none
none of the call sites pass true/false for _both_ params, so the call sites end up
being quite readable. At that point, using a struct ended up being a net negative,
e.g. kvm_tdp_mmu_unmap_gfn_range() had to marshall from one struct to another.

2021-11-22 23:38:26

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 21/28] KVM: x86/mmu: Add TDP MMU helper to zap a root

On Mon, Nov 22, 2021 at 3:15 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Nov 22, 2021, Ben Gardon wrote:
> > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > Add a small wrapper to handle zapping a specific root. For now, it's
> > > little more than syntactic sugar, but in the future it will become a
> > > unique flow with rules specific to zapping an unreachable root.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++--
> > > 1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 9449cb5baf0b..31fb622249e5 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -79,11 +79,18 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> > > tdp_mmu_free_sp(sp);
> > > }
> > >
> > > +static bool tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > + bool shared)
> > > +{
> > > + return zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
> >
> > Total aside:
> > Remembering the order of these three boolean parameters through all
> > these functions drives me nuts.
> > It'd be really nice to put them into a neat, reusable struct that tracks:
> > MMU lock mode (read / write / none)
> > If yielding is okay
> > If the TLBs are dirty and need to be flushed
> >
> > I don't know when I'll have time to do that refactor, but it would
> > make this code so much more sensible.
>
> Heh, I did exactly that, then threw away the code when I realized that I could
> break up zap_gfn_range() into three separate helpers and avoid control knob hell
> (spoiler alert for later patches in this series).
>
> There are still two booleans (to what ends up being tdp_mmu_zap_leafs()), but none
> none of the call sites pass true/false for _both_ params, so the call sites end up
> being quite readable. At that point, using a struct ended up being a net negative,
> e.g. kvm_tdp_mmu_unmap_gfn_range() had to marshall from one struct to another.

Awesome! Disregard then! I'll review the remaining few tomorrow.

2021-11-23 00:03:25

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 15/28] KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots

On Mon, Nov 22, 2021 at 3:08 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Nov 22, 2021, Ben Gardon wrote:
> > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > Take TDP MMU roots off the list of roots when they're invalidated instead
> > > of walking later on to find the roots that were just invalidated. In
> > > addition to making the flow more straightforward, this allows warning
> > > if something attempts to elevate the refcount of an invalid root, which
> > > should be unreachable (no longer on the list so can't be reached by MMU
> > > notifier, and vCPUs must reload a new root before installing new SPTE).
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> >
> > There are a bunch of awesome little cleanups and unrelated fixes
> > included in this commit that could be factored out.
> >
> > I'm skeptical of immediately moving the invalidated roots into another
> > list as that seems like it has a lot of potential for introducing
> > weird races.
>
> I disagree, the entire premise of fast invalidate is that there can't be races,
> i.e. mmu_lock must be held for write. IMO, it's actually the opposite, as the only
> reason leaving roots on the per-VM list doesn't have weird races is because slots_lock
> is held. If slots_lock weren't required to do a fast zap, which is feasible for the
> TDP MMU since it doesn't rely on the memslots generation, then it would be possible
> for multiple calls to kvm_tdp_mmu_zap_invalidated_roots() to run in parallel. And in
> that case, leaving roots on the per-VM list would lead to a single instance of a
> "fast zap" zapping roots it didn't invalidate. That's wouldn't be problematic per se,
> but I don't like not having a clear "owner" of the invalidated root.

That's a good point, the potential interleaving of zap_alls would be gross.

My mental model for the invariant here was "roots that are still in
use are on the roots list," but I can see how "the roots list contains
all valid, in-use roots" could be a more useful invariant.

>
> > I'm not sure it actually solves a problem either. Part of
> > the motive from the commit description "this allows warning if
> > something attempts to elevate the refcount of an invalid root" can be
> > achieved already without moving the roots into a separate list.
>
> Hmm, true in the sense that kvm_tdp_mmu_get_root() could be converted to a WARN,
> but that would require tdp_mmu_next_root() to manually skip invalid roots.
> kvm_tdp_mmu_get_vcpu_root_hpa() is naturally safe because page_role_for_level()
> will never set the invalid flag.
>
> I don't care too much about adding a manual check in tdp_mmu_next_root(), what I don't
> like is that a WARN in kvm_tdp_mmu_get_root() wouldn't be establishing an invariant
> that invalidated roots are unreachable, it would simply be forcing callers to check
> role.invalid.

That makes sense.

>
> > Maybe this would seem more straightforward with some of the little
> > cleanups factored out, but this feels more complicated to me.
> > > @@ -124,6 +137,27 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > > {
> > > struct kvm_mmu_page *next_root;
> > >
> > > + lockdep_assert_held(&kvm->mmu_lock);
> > > +
> > > + /*
> > > + * Restart the walk if the previous root was invalidated, which can
> > > + * happen if the caller drops mmu_lock when yielding. Restarting the
> > > + * walke is necessary because invalidating a root also removes it from
> >
> > Nit: *walk
> >
> > > + * tdp_mmu_roots. Restarting is safe and correct because invalidating
> > > + * a root is done if and only if _all_ roots are invalidated, i.e. any
> > > + * root on tdp_mmu_roots was added _after_ the invalidation event.
> > > + */
> > > + if (prev_root && prev_root->role.invalid) {
> > > + kvm_tdp_mmu_put_root(kvm, prev_root, shared);
> > > + prev_root = NULL;
> > > + }
> > > +
> > > + /*
> > > + * Finding the next root must be done under RCU read lock. Although
> > > + * @prev_root itself cannot be removed from tdp_mmu_roots because this
> > > + * task holds a reference, its next and prev pointers can be modified
> > > + * when freeing a different root. Ditto for tdp_mmu_roots itself.
> > > + */
> >
> > I'm not sure this is correct with the rest of the changes in this
> > patch. The new version of invalidate_roots removes roots from the list
> > immediately, even if they have a non-zero ref-count.
>
> Roots don't have to be invalidated to be removed, e.g. if the last reference is
> put due to kvm_mmu_reset_context(). Or did I misunderstand?

Ah, sorry that was kind of a pedantic comment. I think the comment
should say "@prev_root itself cannot be removed from tdp_mmu_roots
because this task holds the MMU lock (in read or write mode)"
With the other changes in the patch, holding a reference on a root
provides protection against it being freed, but not against it being
removed from the roots list.

>
> > > rcu_read_lock();
> > >
> > > if (prev_root)
> > > @@ -230,10 +264,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > > root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> > > refcount_set(&root->tdp_mmu_root_count, 1);
> > >
> > > - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > - list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> > > - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > -
> > > + /*
> > > + * Because mmu_lock must be held for write to ensure that KVM doesn't
> > > + * create multiple roots for a given role, this does not need to use
> > > + * an RCU-friendly variant as readers of tdp_mmu_roots must also hold
> > > + * mmu_lock in some capacity.
> > > + */
> >
> > I doubt we're doing it now, but in principle we could allocate new
> > roots with mmu_lock in read + tdp_mmu_pages_lock. That might be better
> > than depending on the write lock.
>
> We're not, this function does lockdep_assert_held_write(&kvm->mmu_lock) a few
> lines above. I don't have a preference between using mmu_lock.read+tdp_mmu_pages_lock
> versus mmu_lock.write, but I do care that the current code doesn't incorrectly imply
> that it's possible for something else to be walking the roots while this runs.

That makes sense. It seems like it'd be pretty easy to let this run
under the read lock, but I'd be fine with either way. I agree both
options are an improvement.

>
> Either way, this should definitely be a separate patch, pretty sure I just lost
> track of it.
>
> > > + list_add(&root->link, &kvm->arch.tdp_mmu_roots);
> > > out:
> > > return __pa(root->spt);
> > > }

2021-11-23 01:04:28

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 24/28] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Convert tdp_mmu_zap_root() into its own dedicated flow instead of simply
> redirecting into zap_gfn_range(). In addition to hardening zapping of
> roots, this will allow future simplification of zap_gfn_range() by having
> it zap only leaf SPTEs, and by removing its tricky "zap all" heuristic.
> By having all paths that truly need to free _all_ SPs flow through the
> dedicated root zapper, the generic zapper can be freed of those concerns.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 91 +++++++++++++++++++++++++++-----------
> 1 file changed, 66 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 99ea19e763da..0e5a0d40e54a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,10 +53,6 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> rcu_barrier();
> }
>
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> - gfn_t start, gfn_t end, bool can_yield, bool flush,
> - bool shared);
> -
> static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> {
> free_page((unsigned long)sp->spt);
> @@ -79,11 +75,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> tdp_mmu_free_sp(sp);
> }
>
> -static bool tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> - bool shared)
> -{
> - return zap_gfn_range(kvm, root, 0, -1ull, true, false, shared);
> -}
> +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + bool shared, bool root_is_unreachable);
>
> /*
> * Note, putting a root might sleep, i.e. the caller must have IRQs enabled and
> @@ -120,13 +113,8 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> * invalidates any paging-structure-cache entries, i.e. TLB entries for
> * intermediate paging structures, that may be zapped, as such entries
> * are associated with the ASID on both VMX and SVM.
> - *
> - * WARN if a flush is reported for an invalid root, as its child SPTEs
> - * should have been zapped by kvm_tdp_mmu_zap_invalidated_roots(), and
> - * inserting new SPTEs under an invalid root is a KVM bug.
> */
> - if (tdp_mmu_zap_root(kvm, root, shared))
> - WARN_ON_ONCE(root->role.invalid);
> + tdp_mmu_zap_root(kvm, root, shared, true);
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
> @@ -766,6 +754,65 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> return false;
> }
>
> +static inline gfn_t tdp_mmu_max_gfn_host(void)
> +{
> + /*
> + * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
> + * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
> + * and so KVM will never install a SPTE for such addresses.
> + */
> + return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> +}
> +
> +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + bool shared, bool root_is_unreachable)
> +{
> + struct tdp_iter iter;
> +
> + gfn_t end = tdp_mmu_max_gfn_host();
> + gfn_t start = 0;
> +
> + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +
> + rcu_read_lock();
> +
> + /*
> + * No need to try to step down in the iterator when zapping an entire
> + * root, zapping an upper-level SPTE will recurse on its children.
> + */
> + for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> + root->role.level, start, end) {
> +retry:
> + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> + continue;
> +
> + if (!is_shadow_present_pte(iter.old_spte))
> + continue;
> +
> + if (!shared) {
> + tdp_mmu_set_spte(kvm, &iter, 0);
> + } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {

Worth adding a comment about why this is used instead of
tdp_mmu_zap_spte_atomic.

> + /*
> + * cmpxchg() shouldn't fail if the root is unreachable.
> + * to be unreachable. Re-read the SPTE and retry so as

Repeated phrase.


> + * not to leak the page and its children.
> + */
> + WARN_ONCE(root_is_unreachable,
> + "Contended TDP MMU SPTE in unreachable root.");
> + iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);

Note this will conflict with the series David sent out Friday.
Hopefully some of the cleanups early in that series get merged, in
which case this line will not be needed.

> + goto retry;
> + }
> + /*
> + * WARN if the root is invalid and is unreachable, all SPTEs
> + * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
> + * and inserting new SPTEs under an invalid root is a KVM bug.
> + */
> + WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
> + }
> +
> + rcu_read_unlock();
> +}
> +
> bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> u64 old_spte;
> @@ -807,8 +854,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> gfn_t start, gfn_t end, bool can_yield, bool flush,
> bool shared)
> {
> - gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> - bool zap_all = (start == 0 && end >= max_gfn_host);
> + bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
> struct tdp_iter iter;
>
> /*
> @@ -817,12 +863,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> */
> int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
>
> - /*
> - * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> - * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> - * and so KVM will never install a SPTE for such addresses.
> - */
> - end = min(end, max_gfn_host);
> + end = min(end, tdp_mmu_max_gfn_host());

tdp_mmu_max_gfn_host and this refactor, could be added in a separate
commit if desired.

>
> kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>
> @@ -898,7 +939,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> */
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> for_each_tdp_mmu_root_yield_safe(kvm, root, i, false)
> - (void)tdp_mmu_zap_root(kvm, root, false);
> + tdp_mmu_zap_root(kvm, root, false, true);
> }
> }
>
> @@ -934,7 +975,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm,
> * will still flush on yield, but that's a minor performance
> * blip and not a functional issue.
> */
> - (void)tdp_mmu_zap_root(kvm, root, true);
> + tdp_mmu_zap_root(kvm, root, true, false);
> kvm_tdp_mmu_put_root(kvm, root, true);
> }
> }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-23 01:16:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 19/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> > @@ -755,6 +759,26 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> > return false;
> > }
> >
> > +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +{
> > + u64 old_spte;
> > +
> > + rcu_read_lock();
> > +
> > + old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> > + if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
> > + rcu_read_unlock();
> > + return false;
> > + }
> > +
> > + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> > + sp->gfn, sp->role.level + 1, true, true);
> > +
> > + rcu_read_unlock();
> > +
> > + return true;
> > +}
> > +
>
> Ooooh this makes me really nervous. There are a lot of gotchas to
> modifying SPTEs in a new context without traversing the paging
> structure like this. For example, we could modify an SPTE under an
> invalidated root here. I don't think that would be a problem since
> we're just clearing it, but it makes the code more fragile.

Heh, it better not be a problem, because there are plently of flows in the TDP MMU
that can modify SPTEs under an invalidated root, e.g. fast_page_fault(),
tdp_mmu_zap_leafs(), kvm_age_gfn(), kvm_test_age_gfn(), etc... And before the
patch that introduced is_page_fault_stale(), kvm_tdp_mmu_map() was even installing
SPTEs into an invalid root! Anything that takes a reference to a root and yields
(or never takes mmu_lock) can potentially modify a SPTE under an invalid root.

Checking the paging structures for this flow wouldn't change anything. Invalidating
a root doesn't immediately zap SPTEs, it just marks the root invalid. The other
subtle gotcha is that kvm_reload_remote_mmus() doesn't actually gaurantee all vCPUs
will have dropped the invalid root or performed a TLB flush when mmu_lock is dropped,
those guarantees are only with respect to re-entering the guest!

All of the above is no small part of why I don't want to walk the page tables:
it's completely misleading as walking the page tables doesn't actually provide any
protection, it's holding RCU that guarantees KVM doesn't write memory it doesn't own.

> Another approach to this would be to do in-place promotion / in-place
> splitting once the patch sets David and I just sent out are merged. That
> would avoid causing extra page faults here to bring in the page after this
> zap, but it probably wouldn't be safe if we did it under an invalidated root.

I agree that in-place promotion would be better, but if we do that, I think a logical
intermediate step would be to stop zapping unrelated roots and entries. If there's
a bug that is exposed/introduced by not zapping other stuff, I would much rather it
show up when KVM stops zapping other stuff, not when KVM stops zapping other stuff
_and_ promotes in place. Ditto for if in-place promotion introduces a bug.

> I'd rather avoid this extra complexity and just tolerate the worse
> performance on the iTLB multi hit mitigation at this point since new
> CPUs seem to be moving past that vulnerability.

IMO, it reduces complexity, especially when looking at the series as a whole, which
I fully realize you haven't yet done :-) Setting aside the complexities of each
chunk of code, what I find complex with the current TDP MMU zapping code is that
there are no precise rules for what needs to be done in each situation. I'm not
criticizing how we got to this point, I absolutely think that hitting everything
with a big hammer to get the initial version stable was the right thing to do.

But a side effect of the big hammer approach is that it makes reasoning about things
more difficult, e.g. "when is it safe to modify a SPTE versus when is it safe to insert
a SPTE into the paging structures?" or "what needs to be zapped when the mmu_notifier
unmaps a range?".

And I also really want to avoid another snafu like the memslots with passthrough
GPUs bug, where using a big hammer (zap all) when a smaller hammer (zap SPTEs for
the memslot) _should_ work allows bugs to creep in unnoticed because they're hidden
by overzealous zapping.

> If you think this is worth the complexity, it'd be nice to do a little
> benchmarking to make sure it's giving us a substantial improvement.

Performance really isn't a motivating factor. Per the changelog, the motivation
is mostly to allow later patches to simplify zap_gfn_range() by having it zap only
leaf SPTEs, and now that I've typed it up, also all of the above :-)

2021-11-23 19:35:42

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 19/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

On Mon, Nov 22, 2021 at 5:16 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Nov 22, 2021, Ben Gardon wrote:
> > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> > > @@ -755,6 +759,26 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> > > return false;
> > > }
> > >
> > > +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > +{
> > > + u64 old_spte;
> > > +
> > > + rcu_read_lock();
> > > +
> > > + old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> > > + if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
> > > + rcu_read_unlock();
> > > + return false;
> > > + }
> > > +
> > > + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> > > + sp->gfn, sp->role.level + 1, true, true);
> > > +
> > > + rcu_read_unlock();
> > > +
> > > + return true;
> > > +}
> > > +
> >
> > Ooooh this makes me really nervous. There are a lot of gotchas to
> > modifying SPTEs in a new context without traversing the paging
> > structure like this. For example, we could modify an SPTE under an
> > invalidated root here. I don't think that would be a problem since
> > we're just clearing it, but it makes the code more fragile.
>
> Heh, it better not be a problem, because there are plently of flows in the TDP MMU
> that can modify SPTEs under an invalidated root, e.g. fast_page_fault(),
> tdp_mmu_zap_leafs(), kvm_age_gfn(), kvm_test_age_gfn(), etc... And before the
> patch that introduced is_page_fault_stale(), kvm_tdp_mmu_map() was even installing
> SPTEs into an invalid root! Anything that takes a reference to a root and yields
> (or never takes mmu_lock) can potentially modify a SPTE under an invalid root.

That's true, I don't think there's really a problem with this commit,
just a different way of dealing with the PTs.


>
> Checking the paging structures for this flow wouldn't change anything. Invalidating
> a root doesn't immediately zap SPTEs, it just marks the root invalid. The other
> subtle gotcha is that kvm_reload_remote_mmus() doesn't actually gaurantee all vCPUs
> will have dropped the invalid root or performed a TLB flush when mmu_lock is dropped,
> those guarantees are only with respect to re-entering the guest!
>
> All of the above is no small part of why I don't want to walk the page tables:
> it's completely misleading as walking the page tables doesn't actually provide any
> protection, it's holding RCU that guarantees KVM doesn't write memory it doesn't own.

That's a great point. I was thinking about the RCU protection being
sort of passed down through the RCU dereferences from the root of the
paging structure to whatever SPTE we modify, but since we protect the
SPs with RCU too, dereferencing from them is just as good, I suppose.

>
> > Another approach to this would be to do in-place promotion / in-place
> > splitting once the patch sets David and I just sent out are merged. That
> > would avoid causing extra page faults here to bring in the page after this
> > zap, but it probably wouldn't be safe if we did it under an invalidated root.
>
> I agree that in-place promotion would be better, but if we do that, I think a logical
> intermediate step would be to stop zapping unrelated roots and entries. If there's
> a bug that is exposed/introduced by not zapping other stuff, I would much rather it
> show up when KVM stops zapping other stuff, not when KVM stops zapping other stuff
> _and_ promotes in place. Ditto for if in-place promotion introduces a bug.

That makes sense. I think this is a good first step.

>
> > I'd rather avoid this extra complexity and just tolerate the worse
> > performance on the iTLB multi hit mitigation at this point since new
> > CPUs seem to be moving past that vulnerability.
>
> IMO, it reduces complexity, especially when looking at the series as a whole, which
> I fully realize you haven't yet done :-) Setting aside the complexities of each
> chunk of code, what I find complex with the current TDP MMU zapping code is that
> there are no precise rules for what needs to be done in each situation. I'm not
> criticizing how we got to this point, I absolutely think that hitting everything
> with a big hammer to get the initial version stable was the right thing to do.
>
> But a side effect of the big hammer approach is that it makes reasoning about things
> more difficult, e.g. "when is it safe to modify a SPTE versus when is it safe to insert
> a SPTE into the paging structures?" or "what needs to be zapped when the mmu_notifier
> unmaps a range?".
>
> And I also really want to avoid another snafu like the memslots with passthrough
> GPUs bug, where using a big hammer (zap all) when a smaller hammer (zap SPTEs for
> the memslot) _should_ work allows bugs to creep in unnoticed because they're hidden
> by overzealous zapping.
>
> > If you think this is worth the complexity, it'd be nice to do a little
> > benchmarking to make sure it's giving us a substantial improvement.
>
> Performance really isn't a motivating factor. Per the changelog, the motivation
> is mostly to allow later patches to simplify zap_gfn_range() by having it zap only
> leaf SPTEs, and now that I've typed it up, also all of the above :-)

That makes sense, and addresses all my concerns. Carry on.
Thanks for writing all that up.

2021-11-23 19:58:34

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 27/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> When yielding in the TDP MMU iterator, service any pending TLB flush
> before dropping RCU protections in anticipation of using the callers RCU
> "lock" as a proxy for vCPUs in the guest.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 79a52717916c..55c16680b927 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -732,11 +732,11 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> return false;
>
> if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> - rcu_read_unlock();
> -
> if (flush)
> kvm_flush_remote_tlbs(kvm);
>
> + rcu_read_unlock();
> +

Just to check my understanding:
Theoretically PT memory could be freed as soon as we release the RCU
lock, if this is the only thread in a read critical section. In order
to ensure that we can use RCU as a proxy for TLB flushes we need to
flush the TLBs while still holding the RCU read lock. Without this
change (and with the next one) we could wind up in a situation where
we drop the RCU read lock, then the RCU callback runs and frees the
memory, and then the guest does a lookup through the paging structure
caches and we get a use-after-free bug. By flushing in an RCU critical
section, we ensure that the TLBs will have been flushed by the time
the RCU callback runs to free the memory. Clever!

> if (shared)
> cond_resched_rwlock_read(&kvm->mmu_lock);
> else
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-23 19:58:41

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 26/28] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
> functions according. When removing mappings for functional correctness
> (except for the stupid VFIO GPU passthrough memslots bug), zapping the
> leaf SPTEs is sufficient as the paging structures themselves do not point
> at guest memory and do not directly impact the final translation (in the
> TDP MMU).
>
> Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
> the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
> kvm_unmap_gfn_range().
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>


> ---
> arch/x86/kvm/mmu/mmu.c | 4 ++--
> arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
> arch/x86/kvm/mmu/tdp_mmu.h | 8 +-------
> 3 files changed, 14 insertions(+), 39 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e3cd330c9532..ef689b8bab12 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5796,8 +5796,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
> if (is_tdp_mmu_enabled(kvm)) {
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> - flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
> - gfn_end, flush);
> + flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> + gfn_end, true, flush);
> }
>
> if (flush)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 926e92473e92..79a52717916c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> }
>
> /*
> - * Tears down the mappings for the range of gfns, [start, end), and frees the
> - * non-root pages mapping GFNs strictly within that range. Returns true if
> - * SPTEs have been cleared and a TLB flush is needed before releasing the
> - * MMU lock.
> + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
> + * have been cleared and a TLB flush is needed before releasing the MMU lock.
> *
> * If can_yield is true, will release the MMU lock and reschedule if the
> * scheduler needs the CPU or there is contention on the MMU lock. If this
> @@ -845,18 +843,11 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> * the caller must ensure it does not supply too large a GFN range, or the
> * operation can cause a soft lockup.
> */
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> - gfn_t start, gfn_t end, bool can_yield, bool flush)
> +static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> + gfn_t start, gfn_t end, bool can_yield, bool flush)
> {
> - bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
> struct tdp_iter iter;
>
> - /*
> - * No need to try to step down in the iterator when zapping all SPTEs,
> - * zapping the top-level non-leaf SPTEs will recurse on their children.
> - */
> - int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
> -
> end = min(end, tdp_mmu_max_gfn_host());
>
> lockdep_assert_held_write(&kvm->mmu_lock);
> @@ -864,24 +855,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> rcu_read_lock();
>
> for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> - min_level, start, end) {
> + PG_LEVEL_4K, start, end) {
> if (can_yield &&
> tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
> flush = false;
> continue;
> }
>
> - if (!is_shadow_present_pte(iter.old_spte))
> - continue;
> -
> - /*
> - * If this is a non-last-level SPTE that covers a larger range
> - * than should be zapped, continue, and zap the mappings at a
> - * lower level, except when zapping all SPTEs.
> - */
> - if (!zap_all &&
> - (iter.gfn < start ||
> - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> + if (!is_shadow_present_pte(iter.old_spte) ||
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> @@ -899,13 +880,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> * SPTEs have been cleared and a TLB flush is needed before releasing the
> * MMU lock.
> */
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> - gfn_t end, bool can_yield, bool flush)
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> + bool can_yield, bool flush)
> {
> struct kvm_mmu_page *root;
>
> for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
> - flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
> + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
>
> return flush;
> }
> @@ -1147,8 +1128,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
> bool flush)
> {
> - return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
> - range->end, range->may_block, flush);
> + return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
> + range->end, range->may_block, flush);
> }
>
> typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 8ad1717f4a1d..6e7c32170608 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -24,14 +24,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared);
>
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
> gfn_t end, bool can_yield, bool flush);
> -static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> - gfn_t start, gfn_t end, bool flush)
> -{
> - return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> -}
> -
> bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
> void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm,
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-23 19:58:51

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 25/28] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
> support for zapping with mmu_lock held for read. That all callers hold
> mmu_lock for write isn't a random coincedence; now that the paths that
> need to zap _everything_ have their own path, the only callers left are
> those that need to zap for functional correctness. And when zapping is
> required for functional correctness, mmu_lock must be held for write,
> otherwise the caller has no guarantees about the state of the TDP MMU
> page tables after it has run, e.g. the SPTE(s) it zapped can be
> immediately replaced by a vCPU faulting in a page.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>


> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 29 ++++++-----------------------
> 1 file changed, 6 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 0e5a0d40e54a..926e92473e92 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -844,15 +844,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> * function cannot yield, it will not release the MMU lock or reschedule and
> * the caller must ensure it does not supply too large a GFN range, or the
> * operation can cause a soft lockup.
> - *
> - * If shared is true, this thread holds the MMU lock in read mode and must
> - * account for the possibility that other threads are modifying the paging
> - * structures concurrently. If shared is false, this thread should hold the
> - * MMU lock in write mode.
> */
> static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> - gfn_t start, gfn_t end, bool can_yield, bool flush,
> - bool shared)
> + gfn_t start, gfn_t end, bool can_yield, bool flush)
> {
> bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
> struct tdp_iter iter;
> @@ -865,15 +859,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
> end = min(end, tdp_mmu_max_gfn_host());
>
> - kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> + lockdep_assert_held_write(&kvm->mmu_lock);
>
> rcu_read_lock();
>
> for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> min_level, start, end) {
> -retry:
> if (can_yield &&
> - tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
> + tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
> flush = false;
> continue;
> }
> @@ -892,17 +885,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> - if (!shared) {
> - tdp_mmu_set_spte(kvm, &iter, 0);
> - flush = true;
> - } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> - /*
> - * The iter must explicitly re-read the SPTE because
> - * the atomic cmpxchg failed.
> - */
> - iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> - goto retry;
> - }
> + tdp_mmu_set_spte(kvm, &iter, 0);
> + flush = true;
> }
>
> rcu_read_unlock();
> @@ -921,8 +905,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> struct kvm_mmu_page *root;
>
> for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
> - flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
> - false);
> + flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
>
> return flush;
> }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-23 20:12:33

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 28/28] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
> of immediately flushing. Because the shadow pages are freed in an RCU
> callback, so long as at least one CPU holds RCU, all CPUs are protected.
> For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
> needs to ensure the caller services the pending TLB flush before dropping
> its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs
> running in the guest.
>
> Deferring the flushes allows batching flushes, e.g. when installing a
> 1gb hugepage and zapping a pile of SPs, and when zapping an entire root,
> allows skipping the flush entirely (becaues flushes are not needed in
> that case).
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu.c | 12 ++++++++++++
> arch/x86/kvm/mmu/tdp_iter.h | 7 +++----
> arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++++------------
> 3 files changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ef689b8bab12..7aab9737dffa 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6237,6 +6237,12 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> rcu_idx = srcu_read_lock(&kvm->srcu);
> write_lock(&kvm->mmu_lock);
>
> + /*
> + * Zapping TDP MMU shadow pages, including the remote TLB flush, must
> + * be done under RCU protection, the pages are freed via RCU callback.
> + */
> + rcu_read_lock();
> +
> ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
> for ( ; to_zap; --to_zap) {
> @@ -6261,12 +6267,18 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>
> if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> + rcu_read_unlock();
> +
> cond_resched_rwlock_write(&kvm->mmu_lock);
> flush = false;
> +
> + rcu_read_lock();
> }
> }
> kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
>
> + rcu_read_unlock();
> +
> write_unlock(&kvm->mmu_lock);
> srcu_read_unlock(&kvm->srcu, rcu_idx);
> }
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 0693f1fdb81e..0299703fc844 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -9,10 +9,9 @@
>
> /*
> * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
> - * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
> - * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
> - * the lower* depths of the TDP MMU just to make lockdep happy is a nightmare,
> - * so all* accesses to SPTEs are must be done under RCU protection.
> + * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
> + * batched without having to collect the list of zapped SPs. Flows that can
> + * remove SPs must service pending TLB flushes prior to dropping RCU protection.
> */
> static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
> {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 55c16680b927..62cb357b1dff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -433,9 +433,6 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> shared);
> }
>
> - kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
> - KVM_PAGES_PER_HPAGE(level + 1));
> -
> call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
>
> @@ -815,21 +812,14 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>
> bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - u64 old_spte;
> + u64 old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
>
> - rcu_read_lock();
> -
> - old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> - if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
> - rcu_read_unlock();
> + if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> return false;
> - }
>
> __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> sp->gfn, sp->role.level + 1, true, true);
>
> - rcu_read_unlock();
> -
> return true;
> }
>
> @@ -871,6 +861,11 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> }
>
> rcu_read_unlock();
> +
> + /*
> + * Because this flows zaps _only_ leaf SPTEs, the caller doesn't need
> + * to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
> + */
> return flush;
> }
>
> @@ -1011,6 +1006,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> ret = RET_PF_SPURIOUS;
> else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
> return RET_PF_RETRY;
> + else if (is_shadow_present_pte(iter->old_spte) &&
> + !is_last_spte(iter->old_spte, iter->level))
> + kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
> + KVM_PAGES_PER_HPAGE(iter->level + 1));
>
> /*
> * If the page fault was caused by a write but the page is write
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-24 18:42:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 27/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched

On Tue, Nov 23, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> >
> > When yielding in the TDP MMU iterator, service any pending TLB flush
> > before dropping RCU protections in anticipation of using the callers RCU
> > "lock" as a proxy for vCPUs in the guest.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> Reviewed-by: Ben Gardon <[email protected]>
>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 79a52717916c..55c16680b927 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -732,11 +732,11 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> > return false;
> >
> > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> > - rcu_read_unlock();
> > -
> > if (flush)
> > kvm_flush_remote_tlbs(kvm);
> >
> > + rcu_read_unlock();
> > +
>
> Just to check my understanding:
> Theoretically PT memory could be freed as soon as we release the RCU
> lock, if this is the only thread in a read critical section.In order
> to ensure that we can use RCU as a proxy for TLB flushes we need to
> flush the TLBs while still holding the RCU read lock. Without this
> change (and with the next one) we could wind up in a situation where
> we drop the RCU read lock, then the RCU callback runs and frees the
> memory, and then the guest does a lookup through the paging structure
> caches and we get a use-after-free bug. By flushing in an RCU critical
> section, we ensure that the TLBs will have been flushed by the time
> the RCU callback runs to free the memory. Clever!

Yep, exactly.

2021-11-30 11:29:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 27/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched

On 11/20/21 05:50, Sean Christopherson wrote:
> if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> - rcu_read_unlock();
> -
> if (flush)
> kvm_flush_remote_tlbs(kvm);
>
> + rcu_read_unlock();
> +

Couldn't this sleep in kvm_make_all_cpus_request, whilst in an RCU
read-side critical section?

Paolo

2021-11-30 15:45:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 27/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched

On Tue, Nov 30, 2021, Paolo Bonzini wrote:
> On 11/20/21 05:50, Sean Christopherson wrote:
> > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> > - rcu_read_unlock();
> > -
> > if (flush)
> > kvm_flush_remote_tlbs(kvm);
> > + rcu_read_unlock();
> > +
>
> Couldn't this sleep in kvm_make_all_cpus_request, whilst in an RCU read-side
> critical section?

No. And if kvm_make_all_cpus_request() can sleep, the TDP MMU is completely hosed
as tdp_mmu_zap_spte_atomic() and handle_removed_tdp_mmu_page() currently call
kvm_flush_remote_tlbs_with_range() while under RCU protection.

kvm_make_all_cpus_request_except() disables preemption via get_cpu(), and
smp_call_function() doubles down on disabling preemption as the inner helpers
require preemption to be disabled, so anything below them should complain if
there's a might_sleep(). hv_remote_flush_tlb_with_range() takes a spinlock, so
nothing in there should be sleeping either.

2021-11-30 16:18:10

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 27/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched

On 11/30/21 16:45, Sean Christopherson wrote:
>> Couldn't this sleep in kvm_make_all_cpus_request, whilst in an RCU read-side
>> critical section?
> No. And if kvm_make_all_cpus_request() can sleep, the TDP MMU is completely hosed
> as tdp_mmu_zap_spte_atomic() and handle_removed_tdp_mmu_page() currently call
> kvm_flush_remote_tlbs_with_range() while under RCU protection.
>
> kvm_make_all_cpus_request_except() disables preemption via get_cpu(), and
> smp_call_function() doubles down on disabling preemption as the inner helpers
> require preemption to be disabled, so anything below them should complain if
> there's a might_sleep(). hv_remote_flush_tlb_with_range() takes a spinlock, so
> nothing in there should be sleeping either.

Yeah, of course you're right.

Paolo

2021-12-01 17:54:43

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing

On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
>
> Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
> number of TLB flushes, and to clean up the zapping code. The final patch
> realizes the biggest change, which is to use RCU to defer any TLB flush
> due to zapping a SP to the caller. The largest cleanup is to separate the
> flows for zapping roots (zap _everything_), zapping leaf SPTEs (zap guest
> mappings for whatever reason), and zapping a specific SP (NX recovery).
> They're currently smushed into a single zap_gfn_range(), which was a good
> idea at the time, but became a mess when trying to handle the different
> rules, e.g. TLB flushes aren't needed when zapping a root because KVM can
> safely zap a root if and only if it's unreachable.
>
> For booting an 8 vCPU, remote_tlb_flush (requests) goes from roughly
> 180 (600) to 130 (215).
>
> Please don't apply patches 02 and 03, they've been posted elsehwere and by
> other people. I included them here because some of the patches have
> pseudo-dependencies on their changes. Patch 01 is also posted separately.
> I had a brain fart and sent it out realizing that doing so would lead to
> oddities.

What's the base commit for this series?

>
> Hou Wenlong (1):
> KVM: x86/mmu: Skip tlb flush if it has been done in zap_gfn_range()
>
> Sean Christopherson (27):
> KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier
> unmapping
> KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible
> path
> KVM: x86/mmu: Retry page fault if root is invalidated by memslot
> update
> KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP
> MMU
> KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush
> logic
> KVM: x86/mmu: Document that zapping invalidated roots doesn't need to
> flush
> KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root()
> KVM: x86/mmu: Require mmu_lock be held for write in unyielding root
> iter
> KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU
> root
> KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP
> removal
> KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier
> change_spte
> KVM: x86/mmu: Drop RCU after processing each root in MMU notifier
> hooks
> KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
> KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots
> KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
> KVM: x86/mmu: Terminate yield-friendly walk if invalid root observed
> KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw
> vals
> KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
> KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap
> hook
> KVM: x86/mmu: Add TDP MMU helper to zap a root
> KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
> KVM: x86/mmu: Use "zap root" path for "slow" zap of all TDP MMU SPTEs
> KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
> KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
> KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
> KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU
> resched
> KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow
> pages
>
> arch/x86/kvm/mmu/mmu.c | 74 +++--
> arch/x86/kvm/mmu/mmu_internal.h | 7 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> arch/x86/kvm/mmu/tdp_iter.c | 6 +-
> arch/x86/kvm/mmu/tdp_iter.h | 15 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 526 +++++++++++++++++++-------------
> arch/x86/kvm/mmu/tdp_mmu.h | 48 +--
> 7 files changed, 406 insertions(+), 273 deletions(-)
>
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-12-01 20:50:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 04/28] KVM: x86/mmu: Retry page fault if root is invalidated by memslot update

On 11/20/21 05:50, Sean Christopherson wrote:
> Bail from the page fault handler if the root shadow page was obsoleted by
> a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU
> doesn't rely on the memslot/MMU generation, and instead relies on the
> root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
> mmu_lock for write.
>
> For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
> kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
> moved past the gfn associated with the SP.
>
> For other MMUs, the resulting behavior is far more convoluted, though
> unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete
> root isn't directly problematic, as the obsolete root will be unloaded
> and dropped before the vCPU re-enters the guest. But because the legacy
> MMU tracks shadow pages by their role, any SP created by the fault can
> can be reused in the new post-reload root. Again, that _shouldn't_ be
> problematic as any leaf child SPTEs will be created for the current/valid
> memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
> the old generation as they will be flagged as obsolete. But, given that
> continuing with the fault is pointess (the root will be unloaded), apply
> the check to all MMUs.
>
> Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
> Cc: [email protected]
> Cc: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 23 +++++++++++++++++++++--
> arch/x86/kvm/mmu/paging_tmpl.h | 3 ++-
> 2 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d2ad12a4d66e..31ce913efe37 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1936,7 +1936,11 @@ static void mmu_audit_disable(void) { }
>
> static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - return sp->role.invalid ||
> + if (sp->role.invalid)
> + return true;
> +
> + /* TDP MMU pages due not use the MMU generation. */
> + return !sp->tdp_mmu_page &&
> unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> }
>
> @@ -3976,6 +3980,20 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> return true;
> }
>
> +/*
> + * Returns true if the page fault is stale and needs to be retried, i.e. if the
> + * root was invalidated by a memslot update or a relevant mmu_notifier fired.
> + */
> +static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault, int mmu_seq)
> +{
> + if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))
> + return true;
> +
> + return fault->slot &&
> + mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +}
> +
> static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
> @@ -4013,8 +4031,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> else
> write_lock(&vcpu->kvm->mmu_lock);
>
> - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> + if (is_page_fault_stale(vcpu, fault, mmu_seq))
> goto out_unlock;
> +
> r = make_mmu_pages_available(vcpu);
> if (r)
> goto out_unlock;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index f87d36898c44..708a5d297fe1 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -911,7 +911,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
> r = RET_PF_RETRY;
> write_lock(&vcpu->kvm->mmu_lock);
> - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> +
> + if (is_page_fault_stale(vcpu, fault, mmu_seq))
> goto out_unlock;
>
> kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
>

Queued, thanks.

Paolo

2021-12-02 02:03:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing

On Wed, Dec 01, 2021, David Matlack wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> >
> > Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
> > number of TLB flushes, and to clean up the zapping code. The final patch
> > realizes the biggest change, which is to use RCU to defer any TLB flush
> > due to zapping a SP to the caller. The largest cleanup is to separate the
> > flows for zapping roots (zap _everything_), zapping leaf SPTEs (zap guest
> > mappings for whatever reason), and zapping a specific SP (NX recovery).
> > They're currently smushed into a single zap_gfn_range(), which was a good
> > idea at the time, but became a mess when trying to handle the different
> > rules, e.g. TLB flushes aren't needed when zapping a root because KVM can
> > safely zap a root if and only if it's unreachable.
> >
> > For booting an 8 vCPU, remote_tlb_flush (requests) goes from roughly
> > 180 (600) to 130 (215).
> >
> > Please don't apply patches 02 and 03, they've been posted elsehwere and by
> > other people. I included them here because some of the patches have
> > pseudo-dependencies on their changes. Patch 01 is also posted separately.
> > I had a brain fart and sent it out realizing that doing so would lead to
> > oddities.
>
> What's the base commit for this series?

Pretty sure it's based on a stale kvm/queue, commit 81d7c6659da0 ("KVM: VMX: Remove
vCPU from PI wakeup list before updating PID.NV"). Time to add useAutoBase=true...

2021-12-03 00:17:08

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing

On Wed, Dec 1, 2021 at 6:03 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Dec 01, 2021, David Matlack wrote:
> > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
> > > number of TLB flushes, and to clean up the zapping code. The final patch
> > > realizes the biggest change, which is to use RCU to defer any TLB flush
> > > due to zapping a SP to the caller. The largest cleanup is to separate the
> > > flows for zapping roots (zap _everything_), zapping leaf SPTEs (zap guest
> > > mappings for whatever reason), and zapping a specific SP (NX recovery).
> > > They're currently smushed into a single zap_gfn_range(), which was a good
> > > idea at the time, but became a mess when trying to handle the different
> > > rules, e.g. TLB flushes aren't needed when zapping a root because KVM can
> > > safely zap a root if and only if it's unreachable.
> > >
> > > For booting an 8 vCPU, remote_tlb_flush (requests) goes from roughly
> > > 180 (600) to 130 (215).
> > >
> > > Please don't apply patches 02 and 03, they've been posted elsehwere and by
> > > other people. I included them here because some of the patches have
> > > pseudo-dependencies on their changes. Patch 01 is also posted separately.
> > > I had a brain fart and sent it out realizing that doing so would lead to
> > > oddities.
> >
> > What's the base commit for this series?
>
> Pretty sure it's based on a stale kvm/queue, commit 81d7c6659da0 ("KVM: VMX: Remove
> vCPU from PI wakeup list before updating PID.NV"). Time to add useAutoBase=true...

That applied cleanly. Thanks!

2021-12-08 19:17:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 04/28] KVM: x86/mmu: Retry page fault if root is invalidated by memslot update

On Sat, Nov 20, 2021, Sean Christopherson wrote:
> @@ -3976,6 +3980,20 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> return true;
> }
>
> +/*
> + * Returns true if the page fault is stale and needs to be retried, i.e. if the
> + * root was invalidated by a memslot update or a relevant mmu_notifier fired.
> + */
> +static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault, int mmu_seq)
> +{
> + if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))

Ugh, I got so focused on TDP MMU that I completely forgot to test this with shadow
paging. PAE roots are not backed by shadow pages, which means this explodes on the
very first page fault with TDP disabled. Nested NPT will suffer the same fate.

I'll figure out a patch for 5.16. Long term, it might be nice to actually allocate
shadow pages for the special roots.

> + return true;
> +
> + return fault->slot &&
> + mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +}

2021-12-14 23:35:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 15/28] KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Mon, Nov 22, 2021 at 3:08 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Mon, Nov 22, 2021, Ben Gardon wrote:
> > > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > Take TDP MMU roots off the list of roots when they're invalidated instead
> > > > of walking later on to find the roots that were just invalidated. In
> > > > addition to making the flow more straightforward, this allows warning
> > > > if something attempts to elevate the refcount of an invalid root, which
> > > > should be unreachable (no longer on the list so can't be reached by MMU
> > > > notifier, and vCPUs must reload a new root before installing new SPTE).
> > > >
> > > > Signed-off-by: Sean Christopherson <[email protected]>
> > >
> > > There are a bunch of awesome little cleanups and unrelated fixes
> > > included in this commit that could be factored out.
> > >
> > > I'm skeptical of immediately moving the invalidated roots into another
> > > list as that seems like it has a lot of potential for introducing
> > > weird races.
> >
> > I disagree, the entire premise of fast invalidate is that there can't be races,
> > i.e. mmu_lock must be held for write. IMO, it's actually the opposite, as the only
> > reason leaving roots on the per-VM list doesn't have weird races is because slots_lock
> > is held. If slots_lock weren't required to do a fast zap, which is feasible for the
> > TDP MMU since it doesn't rely on the memslots generation, then it would be possible
> > for multiple calls to kvm_tdp_mmu_zap_invalidated_roots() to run in parallel. And in
> > that case, leaving roots on the per-VM list would lead to a single instance of a
> > "fast zap" zapping roots it didn't invalidate. That's wouldn't be problematic per se,
> > but I don't like not having a clear "owner" of the invalidated root.
>
> That's a good point, the potential interleaving of zap_alls would be gross.
>
> My mental model for the invariant here was "roots that are still in
> use are on the roots list," but I can see how "the roots list contains
> all valid, in-use roots" could be a more useful invariant.

Sadly, my idea of taking invalid roots off the list ain't gonna happen.

The immediate issue is that the TDP MMU doesn't zap invalid roots in mmu_notifier
callbacks. This leads to use-after-free and other issues if the mmu_notifier runs
to completion while an invalid root zapper yields as KVM fails to honor the
requirement that there must be _no_ references to the page after the mmu_notifier
returns.

This is most noticeable with set_nx_huge_pages() + kvm_mmu_notifier_release(),
but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot
updates as well. The pages aren't accessible by the guest, and KVM won't read or
write the data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping
SPTEs, _after_ the mmu_notifier returns.

2021-12-14 23:45:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 10/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <[email protected]> wrote:
> >
> > Allow yielding when zapping SPTEs for a defunct TDP MMU root. Yielding
> > is safe from a TDP perspective, as the root is unreachable. The only
> > potential danger is putting a root from a non-preemptible context, and
> > KVM currently does not do so.
> >
> > Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
> > take a reference to each root (it requires mmu_lock be held for the
> > entire duration of the walk).
> >
> > tdp_mmu_next_root() is used only by the yield-friendly iterator.
> >
> > kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.
> >
> > kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
> > but is still yield-friendly in all call sites, as all callers can be
> > traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
> > kvm_create_vm().
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> Reviewed-by: Ben Gardon <[email protected]>
>
> I'm glad to see this fixed. I assume we don't usually hit this in
> testing because most of the teardown happens in the zap-all path when
> we unregister for MMU notifiers and actually deleting a fully
> populated root while the VM is running is pretty rare.

Another *sigh*.

AFAIK, the above analysis is 100% correct, but there's a subtle problem with
yielding while putting the last reference to a root. If the mmu_notifier runs
in parallel, it (obviously) won't be able to get a reference to the root, and so
KVM will fail to ensure all references to an unmapped range are removed prior to
returning from the mmu_notifier.

But, I have a idea. Instead of synchronously zapping the defunct root, mark it
invalid, set the refcount back to '1', and then use a helper kthread to do the
teardown. Assuming there is exactly one helper, that would also address my
concerns with kvm_tdp_mmu_zap_invalidated_roots() being unsafe to call in parallel,
e.g. two zappers processing an invalid root would both put the last reference to
a root and trigger use-after-free of a different kind.

2021-12-14 23:52:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 10/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On Tue, Dec 14, 2021, Sean Christopherson wrote:
> Assuming there is exactly one helper, that would also address my
> concerns with kvm_tdp_mmu_zap_invalidated_roots() being unsafe to call in parallel,
> e.g. two zappers processing an invalid root would both put the last reference to
> a root and trigger use-after-free of a different kind.

I take that back. So long as both callers grabbed a reference to the root, multiple
instances are ok. I forgot that kvm_tdp_mmu_zap_invalidated_roots() doesn't take
roots off the list directly, that's handled by kvm_tdp_mmu_put_root().