2022-02-26 01:42:05

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing

Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
number of TLB flushes, fix soft lockups and RCU stalls, avoid blocking
vCPUs for long durations while zapping paging structure, and to clean up
the zapping code.

The largest cleanup is to separate the flows for zapping roots (zap
_everything_), zapping leaf SPTEs (zap guest mappings for whatever reason),
and zapping a specific SP (NX recovery). They're currently smushed into a
single zap_gfn_range(), which was a good idea at the time, but became a
mess when trying to handle the different rules, e.g. TLB flushes aren't
needed when zapping a root because KVM can safely zap a root if and only
if it's unreachable.

To solve the soft lockups, stalls, and vCPU performance issues:

- Defer remote TLB flushes to the caller when zapping TDP MMU shadow
pages by relying on RCU to ensure the paging structure isn't freed
until all vCPUs have exited the guest.

- Allowing yielding when zapping TDP MMU roots in response to the root's
last reference being put. This requires a bit of trickery to ensure
the root is reachable via mmu_notifier, but it's not too gross.

- Zap roots in two passes to avoid holding RCU for potential hundreds of
seconds when zapping guest with terabytes of memory that is backed
entirely by 4kb SPTEs.

- Zap defunct roots asynchronously via the common work_queue so that a
vCPU doesn't get stuck doing the work if the vCPU happens to drop the
last reference to a root.

The selftest at the end allows populating a guest with the max amount of
memory allowed by the underlying architecture. The most I've tested is
~64tb (MAXPHYADDR=46) as I don't have easy access to a system with
MAXPHYADDR=52. The selftest compiles on arm64 and s390x, but otherwise
hasn't been tested outside of x86-64. It will hopefully do something
useful as is, but there's a non-zero chance it won't get past init with
a high max memory. Running on x86 without the TDP MMU is comically slow.

v3:
- Drop patches that were applied.
- Rebase to latest kvm/queue.
- Collect a review. [David]
- Use helper instead of goto to zap roots in two passes. [David]
- Add patches to disallow REMOVED "old" SPTE when atomically
setting SPTE.

v2:
- https://lore.kernel.org/all/[email protected]
- Drop patches that were applied.
- Collect reviews for patches that weren't modified. [Ben]
- Abandon the idea of taking invalid roots off the list of roots.
- Add a patch to fix misleading/wrong comments with respect to KVM's
responsibilities in the "fast zap" flow, specifically that all SPTEs
must be dropped before the zap completes.
- Rework yielding in kvm_tdp_mmu_put_root() to keep the root visibile
while yielding.
- Add patch to zap roots in two passes. [Mingwei, David]
- Add a patch to asynchronously zap defunct roots.
- Add the selftest.

v1: https://lore.kernel.org/all/[email protected]


Sean Christopherson (28):
KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots
KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP
MMU
KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap
KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush
logic
KVM: x86/mmu: Document that zapping invalidated roots doesn't need to
flush
KVM: x86/mmu: Require mmu_lock be held for write in unyielding root
iter
KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP
removal
KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier
change_spte
KVM: x86/mmu: Drop RCU after processing each root in MMU notifier
hooks
KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw
vals
KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU
resched
KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow
pages
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU
root
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
KVM: x86/mmu: Zap defunct roots via asynchronous worker
KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE
KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE
KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils
KVM: selftests: Split out helper to allocate guest mem via memfd
KVM: selftests: Define cpu_relax() helpers for s390 and x86
KVM: selftests: Add test to populate a VM with the max possible guest
mem

arch/x86/kvm/mmu/mmu.c | 42 +-
arch/x86/kvm/mmu/mmu_internal.h | 15 +-
arch/x86/kvm/mmu/tdp_iter.c | 6 +-
arch/x86/kvm/mmu/tdp_iter.h | 15 +-
arch/x86/kvm/mmu/tdp_mmu.c | 595 ++++++++++++------
arch/x86/kvm/mmu/tdp_mmu.h | 26 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
.../selftests/kvm/include/kvm_util_base.h | 5 +
.../selftests/kvm/include/s390x/processor.h | 8 +
.../selftests/kvm/include/x86_64/processor.h | 5 +
tools/testing/selftests/kvm/lib/kvm_util.c | 66 +-
.../selftests/kvm/max_guest_memory_test.c | 292 +++++++++
.../selftests/kvm/set_memory_region_test.c | 35 +-
14 files changed, 832 insertions(+), 282 deletions(-)
create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c


base-commit: 625e7ef7da1a4addd8db41c2504fe8a25b93acd5
--
2.35.1.574.g5d30c73bfb-goog


2022-02-26 01:46:23

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 19/28] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages

Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
of immediately flushing. Because the shadow pages are freed in an RCU
callback, so long as at least one CPU holds RCU, all CPUs are protected.
For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
needs to ensure the caller services the pending TLB flush before dropping
its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs
running in the guest.

Deferring the flushes allows batching flushes, e.g. when installing a
1gb hugepage and zapping a pile of SPs. And when zapping an entire root,
deferring flushes allows skipping the flush entirely (because flushes are
not needed in that case).

Avoiding flushes when zapping an entire root is especially important as
synchronizing with other CPUs via IPI after zapping every shadow page can
cause significant performance issues for large VMs. The issue is
exacerbated by KVM zapping entire top-level entries without dropping
RCU protection, which can lead to RCU stalls even when zapping roots
backing relatively "small" amounts of guest memory, e.g. 2tb. Removing
the IPI bottleneck largely mitigates the RCU issues, though it's likely
still a problem for 5-level paging. A future patch will further address
the problem by zapping roots in multiple passes to avoid holding RCU for
an extended duration.

Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++++
arch/x86/kvm/mmu/tdp_iter.h | 7 +++----
arch/x86/kvm/mmu/tdp_mmu.c | 20 ++++++++++----------
3 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 92fe1ba27213..c1deaec795c2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6290,6 +6290,12 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
rcu_idx = srcu_read_lock(&kvm->srcu);
write_lock(&kvm->mmu_lock);

+ /*
+ * Zapping TDP MMU shadow pages, including the remote TLB flush, must
+ * be done under RCU protection, the pages are freed via RCU callback.
+ */
+ rcu_read_lock();
+
ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
for ( ; to_zap; --to_zap) {
@@ -6314,12 +6320,18 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)

if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+ rcu_read_unlock();
+
cond_resched_rwlock_write(&kvm->mmu_lock);
flush = false;
+
+ rcu_read_lock();
}
}
kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

+ rcu_read_unlock();
+
write_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, rcu_idx);
}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index e2a7e267a77d..b1eaf6ec0e0b 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,10 +9,9 @@

/*
* TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
- * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
- * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
- * the lower depths of the TDP MMU just to make lockdep happy is a nightmare, so
- * all accesses to SPTEs are done under RCU protection.
+ * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
+ * batched without having to collect the list of zapped SPs. Flows that can
+ * remove SPs must service pending TLB flushes prior to dropping RCU protection.
*/
static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ef594af246f5..3031b42c27a6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -405,9 +405,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
shared);
}

- kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
- KVM_PAGES_PER_HPAGE(level + 1));
-
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

@@ -831,19 +828,13 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
if (WARN_ON_ONCE(!sp->ptep))
return false;

- rcu_read_lock();
-
old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
- if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
- rcu_read_unlock();
+ if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;
- }

__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
sp->gfn, sp->role.level + 1, true, true);

- rcu_read_unlock();
-
return true;
}

@@ -884,6 +875,11 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
}

rcu_read_unlock();
+
+ /*
+ * Because this flow zaps _only_ leaf SPTEs, the caller doesn't need
+ * to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
+ */
return flush;
}

@@ -1013,6 +1009,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
ret = RET_PF_SPURIOUS;
else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
return RET_PF_RETRY;
+ else if (is_shadow_present_pte(iter->old_spte) &&
+ !is_last_spte(iter->old_spte, iter->level))
+ kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+ KVM_PAGES_PER_HPAGE(iter->level + 1));

/*
* If the page fault was caused by a write but the page is write
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 01:49:01

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 02/28] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU

Explicitly check for present SPTEs when clearing dirty bits in the TDP
MMU. This isn't strictly required for correctness, as setting the dirty
bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
complain if anyone actually turned on KVM's MMU debugging.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 25148e8b711d..9357780ec28f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1446,6 +1446,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;

+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
if (spte_ad_need_write_protect(iter.old_spte)) {
if (is_writable_pte(iter.old_spte))
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 01:50:18

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

Zap defunct roots, a.k.a. roots that have been invalidated after their
last reference was initially dropped, asynchronously via the system work
queue instead of forcing the work upon the unfortunate task that happened
to drop the last reference.

If a vCPU task drops the last reference, the vCPU is effectively blocked
by the host for the entire duration of the zap. If the root being zapped
happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
being active, the zap can take several hundred seconds. Unsurprisingly,
most guests are unhappy if a vCPU disappears for hundreds of seconds.

E.g. running a synthetic selftest that triggers a vCPU root zap with
~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
Offloading the zap to a worker drops the block time to <100ms.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 8 +++-
arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++-----
2 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index be063b6c91b7..1bff453f7cbe 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -65,7 +65,13 @@ struct kvm_mmu_page {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
};
- DECLARE_BITMAP(unsync_child_bitmap, 512);
+ union {
+ DECLARE_BITMAP(unsync_child_bitmap, 512);
+ struct {
+ struct work_struct tdp_mmu_async_work;
+ void *tdp_mmu_async_data;
+ };
+ };

struct list_head lpage_disallowed_link;
#ifdef CONFIG_X86_32
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ec28a88c6376..4151e61245a7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared);

+static void tdp_mmu_zap_root_async(struct work_struct *work)
+{
+ struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
+ tdp_mmu_async_work);
+ struct kvm *kvm = root->tdp_mmu_async_data;
+
+ read_lock(&kvm->mmu_lock);
+
+ /*
+ * A TLB flush is not necessary as KVM performs a local TLB flush when
+ * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+ * to a different pCPU. Note, the local TLB flush on reuse also
+ * invalidates any paging-structure-cache entries, i.e. TLB entries for
+ * intermediate paging structures, that may be zapped, as such entries
+ * are associated with the ASID on both VMX and SVM.
+ */
+ tdp_mmu_zap_root(kvm, root, true);
+
+ /*
+ * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
+ * avoiding an infinite loop. By design, the root is reachable while
+ * it's being asynchronously zapped, thus a different task can put its
+ * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
+ * asynchronously zapped root is unavoidable.
+ */
+ kvm_tdp_mmu_put_root(kvm, root, true);
+
+ read_unlock(&kvm->mmu_lock);
+
+ kvm_put_kvm(kvm);
+}
+
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -142,15 +174,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
refcount_set(&root->tdp_mmu_root_count, 1);

/*
- * Zap the root, then put the refcount "acquired" above. Recursively
- * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
- * infinite loop by freeing invalid roots. By design, the root is
- * reachable while it's being zapped, thus a different task can put its
- * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
- * defunct root is unavoidable.
+ * Attempt to acquire a reference to KVM itself. If KVM is alive, then
+ * zap the root asynchronously in a worker, otherwise it must be zapped
+ * directly here. Wait to do this check until after the refcount is
+ * reset so that tdp_mmu_zap_root() can safely yield.
+ *
+ * In both flows, zap the root, then put the refcount "acquired" above.
+ * When putting the reference, use kvm_tdp_mmu_put_root() to test the
+ * above logic for avoiding an infinite loop by freeing invalid roots.
+ * By design, the root is reachable while it's being zapped, thus a
+ * different task can put its last reference, i.e. flowing through
+ * kvm_tdp_mmu_put_root() for a defunct root is unavoidable.
*/
- tdp_mmu_zap_root(kvm, root, shared);
- kvm_tdp_mmu_put_root(kvm, root, shared);
+ if (kvm_get_kvm_safe(kvm)) {
+ root->tdp_mmu_async_data = kvm;
+ INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async);
+ schedule_work(&root->tdp_mmu_async_work);
+ } else {
+ tdp_mmu_zap_root(kvm, root, shared);
+ kvm_tdp_mmu_put_root(kvm, root, shared);
+ }
}

enum tdp_mmu_roots_iter_type {
@@ -954,7 +997,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)

/*
* Zap all roots, including invalid roots, as all SPTEs must be dropped
- * before returning to the caller.
+ * before returning to the caller. Zap directly even if the root is
+ * also being zapped by a worker. Walking zapped top-level SPTEs isn't
+ * all that expensive and mmu_lock is already held, which means the
+ * worker has yielded, i.e. flushing the work instead of zapping here
+ * isn't guaranteed to be any faster.
*
* A TLB flush is unnecessary, KVM zaps everything if and only the VM
* is being destroyed or the userspace VMM has exited. In both cases,
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 01:52:44

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 26/28] KVM: selftests: Split out helper to allocate guest mem via memfd

Extract the code for allocating guest memory via memfd out of
vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
A future selftest to populate a guest with the maximum amount of guest
memory will abuse KVM's memslots to alias guest memory regions to a
single memfd-backed host region, i.e. needs to back a guest with memfd
memory without a 1:1 association between a memslot and a memfd instance.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 42 +++++++++++--------
2 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 573de0354175..92cef0ffb19e 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -123,6 +123,7 @@ int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
size_t len);

void kvm_vm_elf_load(struct kvm_vm *vm, const char *filename);
+int kvm_memfd_alloc(size_t size, bool hugepages);

void vm_dump(FILE *stream, struct kvm_vm *vm, uint8_t indent);

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index dcb8e96c6a54..1665a220abcb 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -718,6 +718,27 @@ void kvm_vm_free(struct kvm_vm *vmp)
free(vmp);
}

+int kvm_memfd_alloc(size_t size, bool hugepages)
+{
+ int memfd_flags = MFD_CLOEXEC;
+ int fd, r;
+
+ if (hugepages)
+ memfd_flags |= MFD_HUGETLB;
+
+ fd = memfd_create("kvm_selftest", memfd_flags);
+ TEST_ASSERT(fd != -1, "memfd_create() failed, errno: %i (%s)",
+ errno, strerror(errno));
+
+ r = ftruncate(fd, size);
+ TEST_ASSERT(!r, "ftruncate() failed, errno: %i (%s)", errno, strerror(errno));
+
+ r = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, size);
+ TEST_ASSERT(!r, "fallocate() failed, errno: %i (%s)", errno, strerror(errno));
+
+ return fd;
+}
+
/*
* Memory Compare, host virtual to guest virtual
*
@@ -970,24 +991,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->mmap_size += alignment;

region->fd = -1;
- if (backing_src_is_shared(src_type)) {
- int memfd_flags = MFD_CLOEXEC;
-
- if (src_type == VM_MEM_SRC_SHARED_HUGETLB)
- memfd_flags |= MFD_HUGETLB;
-
- region->fd = memfd_create("kvm_selftest", memfd_flags);
- TEST_ASSERT(region->fd != -1,
- "memfd_create failed, errno: %i", errno);
-
- ret = ftruncate(region->fd, region->mmap_size);
- TEST_ASSERT(ret == 0, "ftruncate failed, errno: %i", errno);
-
- ret = fallocate(region->fd,
- FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
- region->mmap_size);
- TEST_ASSERT(ret == 0, "fallocate failed, errno: %i", errno);
- }
+ if (backing_src_is_shared(src_type))
+ region->fd = kvm_memfd_alloc(region->mmap_size,
+ src_type == VM_MEM_SRC_SHARED_HUGETLB);

region->mmap_start = mmap(NULL, region->mmap_size,
PROT_READ | PROT_WRITE,
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 01:53:30

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

Explicitly ignore the result of zap_gfn_range() when putting the last
reference to a TDP MMU root, and add a pile of comments to formalize the
TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this
only affects the !shared case, as zap_gfn_range() subtly never returns
true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().

Putting the root without a flush is ok because even if there are stale
references to the root in the TLB, they are unreachable because KVM will
not run the guest with the same ASID without first flushing (where ASID
in this context refers to both SVM's explicit ASID and Intel's implicit
ASID that is constructed from VPID+PCID+EPT4A+etc...).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 80607513a1f2..5a931c89d27b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5069,6 +5069,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
kvm_mmu_sync_roots(vcpu);

kvm_mmu_load_pgd(vcpu);
+
+ /*
+ * Flush any TLB entries for the new root, the provenance of the root
+ * is unknown. In theory, even if KVM ensures there are no stale TLB
+ * entries for a freed root, in theory, an out-of-tree hypervisor could
+ * have left stale entries. Flushing on alloc also allows KVM to skip
+ * the TLB flush when freeing a root (see kvm_tdp_mmu_put_root()).
+ */
static_call(kvm_x86_flush_tlb_current)(vcpu);
out:
return r;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 12866113fb4f..e35bd88d92fd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

- zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+ /*
+ * A TLB flush is not necessary as KVM performs a local TLB flush when
+ * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+ * to a different pCPU. Note, the local TLB flush on reuse also
+ * invalidates any paging-structure-cache entries, i.e. TLB entries for
+ * intermediate paging structures, that may be zapped, as such entries
+ * are associated with the ASID on both VMX and SVM.
+ */
+ (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 01:56:44

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 17/28] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()

Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
functions accordingly. When removing mappings for functional correctness
(except for the stupid VFIO GPU passthrough memslots bug), zapping the
leaf SPTEs is sufficient as the paging structures themselves do not point
at guest memory and do not directly impact the final translation (in the
TDP MMU).

Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
kvm_unmap_gfn_range().

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
arch/x86/kvm/mmu/tdp_mmu.h | 8 +-------
3 files changed, 14 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c4b84e80841..92fe1ba27213 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5775,8 +5775,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

if (is_tdp_mmu_enabled(kvm)) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
- gfn_end, flush);
+ flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
+ gfn_end, true, flush);
}

if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b55eb7bec308..42803d5dbbf9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -848,10 +848,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
}

/*
- * Tears down the mappings for the range of gfns, [start, end), and frees the
- * non-root pages mapping GFNs strictly within that range. Returns true if
- * SPTEs have been cleared and a TLB flush is needed before releasing the
- * MMU lock.
+ * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
+ * have been cleared and a TLB flush is needed before releasing the MMU lock.
*
* If can_yield is true, will release the MMU lock and reschedule if the
* scheduler needs the CPU or there is contention on the MMU lock. If this
@@ -859,42 +857,25 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
* the caller must ensure it does not supply too large a GFN range, or the
* operation can cause a soft lockup.
*/
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush)
+static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, bool can_yield, bool flush)
{
- bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;

- /*
- * No need to try to step down in the iterator when zapping all SPTEs,
- * zapping the top-level non-leaf SPTEs will recurse on their children.
- */
- int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
-
end = min(end, tdp_mmu_max_gfn_host());

lockdep_assert_held_write(&kvm->mmu_lock);

rcu_read_lock();

- for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
+ for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
if (can_yield &&
tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}

- if (!is_shadow_present_pte(iter.old_spte))
- continue;
-
- /*
- * If this is a non-last-level SPTE that covers a larger range
- * than should be zapped, continue, and zap the mappings at a
- * lower level, except when zapping all SPTEs.
- */
- if (!zap_all &&
- (iter.gfn < start ||
- iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+ if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;

@@ -912,13 +893,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* SPTEs have been cleared and a TLB flush is needed before releasing the
* MMU lock.
*/
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
- gfn_t end, bool can_yield, bool flush)
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
+ bool can_yield, bool flush)
{
struct kvm_mmu_page *root;

for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
- flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);

return flush;
}
@@ -1179,8 +1160,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
- return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
- range->end, range->may_block, flush);
+ return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
+ range->end, range->may_block, flush);
}

typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5e5ef2576c81..54bc8118c40a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -15,14 +15,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared);

-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
gfn_t end, bool can_yield, bool flush);
-static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
- gfn_t start, gfn_t end, bool flush)
-{
- return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
-}
-
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 01:58:41

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 05/28] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush

Remove the misleading flush "handling" when zapping invalidated TDP MMU
roots, and document that flushing is unnecessary for all flavors of MMUs
when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU
is dead code, as zap_gfn_range() is called with shared=true, in which
case it will never return true due to the flushing being handled by
tdp_mmu_zap_spte_atomic().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 10 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++-----
2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5a931c89d27b..1c4b84e80841 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5615,9 +5615,13 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
}

/*
- * Trigger a remote TLB flush before freeing the page tables to ensure
- * KVM is not in the middle of a lockless shadow page table walk, which
- * may reference the pages.
+ * Kick all vCPUs (via remote TLB flush) before freeing the page tables
+ * to ensure KVM is not in the middle of a lockless shadow page table
+ * walk, which may reference the pages. The remote TLB flush itself is
+ * not required and is simply a convenient way to kick vCPUs as needed.
+ * KVM performs a local TLB flush when allocating a new root (see
+ * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
+ * running with an obsolete MMU.
*/
kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e35bd88d92fd..5994db5d5226 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -843,12 +843,20 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;
- bool flush = false;

lockdep_assert_held_read(&kvm->mmu_lock);

for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
- flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
+ /*
+ * A TLB flush is unnecessary, invalidated roots are guaranteed
+ * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
+ * for more details), and unlike the legacy MMU, no vCPU kick
+ * is needed to play nice with lockless shadow walks as the TDP
+ * MMU protects its paging structures via RCU. Note, zapping
+ * will still flush on yield, but that's a minor performance
+ * blip and not a functional issue.
+ */
+ (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);

/*
* Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
@@ -856,9 +864,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
*/
kvm_tdp_mmu_put_root(kvm, root, true);
}
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
}

/*
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:05:38

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 24/28] KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE

Disallow calling tdp_mmu_set_spte_atomic() with a REMOVED "old" SPTE.
This solves a conundrum introduced by commit 3255530ab191 ("KVM: x86/mmu:
Automatically update iter->old_spte if cmpxchg fails"); if the helper
doesn't update old_spte in the REMOVED case, then theoretically the
caller could get stuck in an infinite loop as it will fail indefinitely
on the REMOVED SPTE. E.g. until recently, clear_dirty_gfn_range() didn't
check for a present SPTE and would have spun until getting rescheduled.

In practice, only the page fault path should "create" a new SPTE, all
other paths should only operate on existing, a.k.a. shadow present,
SPTEs. Now that the page fault path pre-checks for a REMOVED SPTE in all
cases, require all other paths to indirectly pre-check by verifying the
target SPTE is a shadow-present SPTE.

Note, this does not guarantee the actual SPTE isn't REMOVED, nor is that
scenario disallowed. The invariant is only that the caller mustn't
invoke tdp_mmu_set_spte_atomic() if the SPTE was REMOVED when last
observed by the caller.

Cc: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1acd12bf309f..d223870b3790 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -634,16 +634,15 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
u64 *sptep = rcu_dereference(iter->sptep);
u64 old_spte;

- WARN_ON_ONCE(iter->yielded);
-
- lockdep_assert_held_read(&kvm->mmu_lock);
-
/*
- * Do not change removed SPTEs. Only the thread that froze the SPTE
- * may modify it.
+ * The caller is responsible for ensuring the old SPTE is not a REMOVED
+ * SPTE. KVM should never attempt to zap or manipulate a REMOVED SPTE,
+ * and pre-checking before inserting a new SPTE is advantageous as it
+ * avoids unnecessary work.
*/
- if (is_removed_spte(iter->old_spte))
- return -EBUSY;
+ WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
+
+ lockdep_assert_held_read(&kvm->mmu_lock);

/*
* Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:09:08

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 16/28] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range

Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
support for zapping with mmu_lock held for read. That all callers hold
mmu_lock for write isn't a random coincedence; now that the paths that
need to zap _everything_ have their own path, the only callers left are
those that need to zap for functional correctness. And when zapping is
required for functional correctness, mmu_lock must be held for write,
otherwise the caller has no guarantees about the state of the TDP MMU
page tables after it has run, e.g. the SPTE(s) it zapped can be
immediately replaced by a vCPU faulting in a page.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 24 ++++++------------------
1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c5df9a552470..b55eb7bec308 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -858,15 +858,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
* function cannot yield, it will not release the MMU lock or reschedule and
* the caller must ensure it does not supply too large a GFN range, or the
* operation can cause a soft lockup.
- *
- * If shared is true, this thread holds the MMU lock in read mode and must
- * account for the possibility that other threads are modifying the paging
- * structures concurrently. If shared is false, this thread should hold the
- * MMU lock in write mode.
*/
static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush,
- bool shared)
+ gfn_t start, gfn_t end, bool can_yield, bool flush)
{
bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;
@@ -879,14 +873,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

end = min(end, tdp_mmu_max_gfn_host());

- kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+ lockdep_assert_held_write(&kvm->mmu_lock);

rcu_read_lock();

for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
-retry:
if (can_yield &&
- tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
+ tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}
@@ -905,12 +898,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- if (!shared) {
- tdp_mmu_set_spte(kvm, &iter, 0);
- flush = true;
- } else if (tdp_mmu_zap_spte_atomic(kvm, &iter)) {
- goto retry;
- }
+ tdp_mmu_set_spte(kvm, &iter, 0);
+ flush = true;
}

rcu_read_unlock();
@@ -929,8 +918,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
struct kvm_mmu_page *root;

for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
- flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
- false);
+ flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);

return flush;
}
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:13:03

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 06/28] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter

Assert that mmu_lock is held for write by users of the yield-unfriendly
TDP iterator. The nature of a shared walk means that the caller needs to
play nice with other tasks modifying the page tables, which is more or
less the same thing as playing nice with yielding. Theoretically, KVM
could gain a flow where it could legitimately take mmu_lock for read in
a non-preemptible context, but that's highly unlikely and any such case
should be viewed with a fair amount of scrutiny.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5994db5d5226..189f21e71c36 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,13 +29,16 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
return true;
}

-static __always_inline void kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
+/* Arbitrarily returns true so that this may be used in if statements. */
+static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
bool shared)
{
if (shared)
lockdep_assert_held_read(&kvm->mmu_lock);
else
lockdep_assert_held_write(&kvm->mmu_lock);
+
+ return true;
}

void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
@@ -187,11 +190,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOTS)

-#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
- list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
- lockdep_is_held_type(&kvm->mmu_lock, 0) || \
- lockdep_is_held(&kvm->arch.tdp_mmu_pages_lock)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+/*
+ * Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
+ * the implication being that any flow that holds mmu_lock for read is
+ * inherently yield-friendly and should use the yielf-safe variant above.
+ * Holding mmu_lock for write obviates the need for RCU protection as the list
+ * is guaranteed to be stable.
+ */
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
+ list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
+ if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
+ kvm_mmu_page_as_id(_root) != _as_id) { \
} else

static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:16:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 13/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

When recovering a potential hugepage that was shattered for the iTLB
multihit workaround, precisely zap only the target page instead of
iterating over the TDP MMU to find the SP that was passed in. This will
allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 7 ++++++-
arch/x86/kvm/mmu/tdp_iter.h | 2 --
arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++++++----
arch/x86/kvm/mmu/tdp_mmu.h | 18 +----------------
4 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index da6166b5c377..be063b6c91b7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -30,6 +30,8 @@ extern bool dbg;
#define INVALID_PAE_ROOT 0
#define IS_VALID_PAE_ROOT(x) (!!(x))

+typedef u64 __rcu *tdp_ptep_t;
+
struct kvm_mmu_page {
/*
* Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -59,7 +61,10 @@ struct kvm_mmu_page {
refcount_t tdp_mmu_root_count;
};
unsigned int unsync_children;
- struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+ union {
+ struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+ tdp_ptep_t ptep;
+ };
DECLARE_BITMAP(unsync_child_bitmap, 512);

struct list_head lpage_disallowed_link;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index bb9b581f1ee4..e2a7e267a77d 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -7,8 +7,6 @@

#include "mmu.h"

-typedef u64 __rcu *tdp_ptep_t;
-
/*
* TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
* to be zapped while holding mmu_lock for read. Holding RCU isn't required for
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9e8ba6f12ebf..c231b60e1726 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -213,13 +213,14 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
return sp;
}

-static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, gfn_t gfn,
- union kvm_mmu_page_role role)
+static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
+ gfn_t gfn, union kvm_mmu_page_role role)
{
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

sp->role = role;
sp->gfn = gfn;
+ sp->ptep = sptep;
sp->tdp_mmu_page = true;

trace_kvm_mmu_get_page(sp, true);
@@ -236,7 +237,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
role = parent_sp->role;
role.level--;

- tdp_mmu_init_sp(child_sp, iter->gfn, role);
+ tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}

hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
@@ -258,7 +259,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
}

root = tdp_mmu_alloc_sp(vcpu);
- tdp_mmu_init_sp(root, 0, role);
+ tdp_mmu_init_sp(root, NULL, 0, role);

refcount_set(&root->tdp_mmu_root_count, 1);

@@ -750,6 +751,33 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
return iter->yielded;
}

+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ u64 old_spte;
+
+ /*
+ * This helper intentionally doesn't allow zapping a root shadow page,
+ * which doesn't have a parent page table and thus no associated entry.
+ */
+ if (WARN_ON_ONCE(!sp->ptep))
+ return false;
+
+ rcu_read_lock();
+
+ old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
+ if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
+ sp->gfn, sp->role.level + 1, true, true);
+
+ rcu_read_unlock();
+
+ return true;
+}
+
/*
* Tears down the mappings for the range of gfns, [start, end), and frees the
* non-root pages mapping GFNs strictly within that range. Returns true if
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 57c73d8f76ce..5e5ef2576c81 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -22,24 +22,8 @@ static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
{
return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
}
-static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level + 1);
-
- /*
- * Don't allow yielding, as the caller may have a flush pending. Note,
- * if mmu_lock is held for write, zapping will never yield in this case,
- * but explicitly disallow it for safety. The TDP MMU does not yield
- * until it has made forward progress (steps sideways), and when zapping
- * a single shadow page that it's guaranteed to see (thus the mmu_lock
- * requirement), its "step sideways" will always step beyond the bounds
- * of the shadow page's gfn range and stop iterating before yielding.
- */
- lockdep_assert_held_write(&kvm->mmu_lock);
- return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
- sp->gfn, end, false, false);
-}

+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:18:44

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 27/28] KVM: selftests: Define cpu_relax() helpers for s390 and x86

Add cpu_relax() for s390 and x86 for use in arch-agnostic tests. arm64
already defines its own version.

Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/include/s390x/processor.h | 8 ++++++++
tools/testing/selftests/kvm/include/x86_64/processor.h | 5 +++++
2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/s390x/processor.h b/tools/testing/selftests/kvm/include/s390x/processor.h
index e0e96a5f608c..255c9b990f4c 100644
--- a/tools/testing/selftests/kvm/include/s390x/processor.h
+++ b/tools/testing/selftests/kvm/include/s390x/processor.h
@@ -5,6 +5,8 @@
#ifndef SELFTEST_KVM_PROCESSOR_H
#define SELFTEST_KVM_PROCESSOR_H

+#include <linux/compiler.h>
+
/* Bits in the region/segment table entry */
#define REGION_ENTRY_ORIGIN ~0xfffUL /* region/segment table origin */
#define REGION_ENTRY_PROTECT 0x200 /* region protection bit */
@@ -19,4 +21,10 @@
#define PAGE_PROTECT 0x200 /* HW read-only bit */
#define PAGE_NOEXEC 0x100 /* HW no-execute bit */

+/* Is there a portable way to do this? */
+static inline void cpu_relax(void)
+{
+ barrier();
+}
+
#endif
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 8a470da7b71a..37db341d4cc5 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -363,6 +363,11 @@ static inline unsigned long get_xmm(int n)
return 0;
}

+static inline void cpu_relax(void)
+{
+ asm volatile("rep; nop" ::: "memory");
+}
+
bool is_intel_cpu(void);
bool is_amd_cpu(void);

--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:19:30

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 23/28] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE

Explicitly check for a REMOVED leaf SPTE prior to attempting to map
the final SPTE when handling a TDP MMU fault. Functionally, this is a
nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE.
Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal
is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old"
SPTE is never a REMOVED SPTE.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4151e61245a7..1acd12bf309f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1250,7 +1250,11 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
}

- if (iter.level != fault->goal_level) {
+ /*
+ * Force the guest to retry the access if the upper level SPTEs aren't
+ * in place, or if the target leaf SPTE is frozen by another CPU.
+ */
+ if (iter.level != fault->goal_level || is_removed_spte(iter.old_spte)) {
rcu_read_unlock();
return RET_PF_RETRY;
}
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:19:40

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 09/28] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks

Drop RCU protection after processing each root when handling MMU notifier
hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily
drop RCU to let RCU do its thing between roots, and to make it clear that
there's no special behavior that relies on holding RCU across all roots.

Currently, the RCU protection is completely superficial, it's necessary
only to make rcu_dereference() of SPTE pointers happy. A future patch
will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
rather, acknowledge the need for a flush), but in that case RCU needs to
be held until the flush is complete if and only if the flush is needed
because a shadow page may have been removed. And except for the "unmap"
path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
and can't change the PFN for a SP).

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 634a2838e117..4f460782a848 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1100,18 +1100,18 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
struct tdp_iter iter;
bool ret = false;

- rcu_read_lock();
-
/*
* Don't support rescheduling, none of the MMU notifiers that funnel
* into this helper allow blocking; it'd be dead, wasteful code.
*/
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+ rcu_read_lock();
+
tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
ret |= handler(kvm, &iter, range);
- }

- rcu_read_unlock();
+ rcu_read_unlock();
+ }

return ret;
}
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:28:20

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 03/28] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap

Fix misleading and arguably wrong comments in the TDP MMU's fast zap
flow. The comments, and the fact that actually zapping invalid roots was
added separately, strongly suggests that zapping invalid roots is an
optimization and not required for correctness. That is a lie.

KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
KVM is relying on it to fully remove all references to the memslot. Once
the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
stale references as the hva=>gfn translation is done via the memslots.
If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
deleting a memslot, KVM will fail to zap in response to the mmu_notifier
due to not finding a memslot corresponding to the notifier's range, which
leads to a variation of use-after-free.

The other misleading comment (and code) explicitly states that roots
without a reference should be skipped. While that's technically true,
it's also extremely misleading as it should be impossible for KVM to
encounter a defunct root on the list while holding mmu_lock for write.
Opportunstically add a WARN to enforce that invariant.

Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 +++++++
arch/x86/kvm/mmu/tdp_mmu.c | 46 +++++++++++++++++++++-----------------
2 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2c1c4eb6007..80607513a1f2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5662,6 +5662,14 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)

write_unlock(&kvm->mmu_lock);

+ /*
+ * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
+ * returning to the caller, e.g. if the zap is in response to a memslot
+ * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
+ * associated with the deleted memslot once the update completes, and
+ * Deferring the zap until the final reference to the root is put would
+ * lead to use-after-free.
+ */
if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
kvm_tdp_mmu_zap_invalidated_roots(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9357780ec28f..12866113fb4f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -826,12 +826,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
}

/*
- * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
- * invalidated root, they will not be freed until this function drops the
- * reference. Before dropping that reference, tear down the paging
- * structure so that whichever thread does drop the last reference
- * only has to do a trivial amount of work. Since the roots are invalid,
- * no new SPTEs should be created under them.
+ * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
+ * zap" completes. Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
+ * reference to each invalidated root, roots will not be freed until after this
+ * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
+ * tearing paging structures.
*/
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
@@ -855,21 +854,25 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
}

/*
- * Mark each TDP MMU root as invalid so that other threads
- * will drop their references and allow the root count to
- * go to 0.
+ * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
+ * is about to be zapped, e.g. in response to a memslots update. The caller is
+ * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to the actual
+ * zapping.
*
- * Also take a reference on all roots so that this thread
- * can do the bulk of the work required to free the roots
- * once they are invalidated. Without this reference, a
- * vCPU thread might drop the last reference to a root and
- * get stuck with tearing down the entire paging structure.
+ * Take a reference on all roots to prevent the root from being freed before it
+ * is zapped by this thread. Freeing a root is not a correctness issue, but if
+ * a vCPU drops the last reference to a root prior to the root being zapped, it
+ * will get stuck with tearing down the entire paging structure.
*
- * Roots which have a zero refcount should be skipped as
- * they're already being torn down.
- * Already invalid roots should be referenced again so that
- * they aren't freed before kvm_tdp_mmu_zap_all_fast is
- * done with them.
+ * Get a reference even if the root is already invalid,
+ * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
+ * invalid roots, e.g. there's no epoch to identify roots that were invalidated
+ * by a previous call. Roots stay on the list until the last reference is
+ * dropped, so even though all invalid roots are zapped, a root may not go away
+ * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
+ *
+ * Because mmu_lock is held for write, it should be impossible to observe a
+ * root with zero refcount, i.e. the list of roots cannot be stale.
*
* This has essentially the same effect for the TDP MMU
* as updating mmu_valid_gen does for the shadow MMU.
@@ -879,9 +882,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
struct kvm_mmu_page *root;

lockdep_assert_held_write(&kvm->mmu_lock);
- list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
- if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
+ list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
root->role.invalid = true;
+ }
}

/*
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:28:31

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 07/28] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal

Look for a !leaf=>leaf conversion instead of a PFN change when checking
if a SPTE change removed a TDP MMU shadow page. Convert the PFN check
into a WARN, as KVM should never change the PFN of a shadow page (except
when its being zapped or replaced).

From a purely theoretical perspective, it's not illegal to replace a SP
with a hugepage pointing at the same PFN. In practice, it's impossible
as that would require mapping guest memory overtop a kernel-allocated SP.
Either way, the check is odd.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 189f21e71c36..848448b65703 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -505,9 +505,12 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,

/*
* Recursively handle child PTs if the change removed a subtree from
- * the paging structure.
+ * the paging structure. Note the WARN on the PFN changing without the
+ * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
+ * pages are kernel allocations and should never be migrated.
*/
- if (was_present && !was_leaf && (pfn_changed || !is_present))
+ if (was_present && !was_leaf &&
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
}

--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:32:24

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 08/28] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte

Batch TLB flushes (with other MMUs) when handling ->change_spte()
notifications in the TDP MMU. The MMU notifier path in question doesn't
allow yielding and correcty flushes before dropping mmu_lock.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 848448b65703..634a2838e117 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1203,13 +1203,12 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
*/
bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
-
- /* FIXME: return 'flush' instead of flushing here. */
- if (flush)
- kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
-
- return false;
+ /*
+ * No need to handle the remote TLB flush under RCU protection, the
+ * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
+ * shadow page. See the WARN on pfn_changed in __handle_changed_spte().
+ */
+ return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
}

/*
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:34:38

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 10/28] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU

Add helpers to read and write TDP MMU SPTEs instead of open coding
rcu_dereference() all over the place, and to provide a convenient
location to document why KVM doesn't exempt holding mmu_lock for write
from having to hold RCU (and any future changes to the rules).

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_iter.c | 6 +++---
arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index be3f096db2eb..6d3b3e5a5533 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
{
iter->sptep = iter->pt_path[iter->level - 1] +
SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
}

static gfn_t round_gfn_for_level(gfn_t gfn, int level)
@@ -89,7 +89,7 @@ static bool try_step_down(struct tdp_iter *iter)
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry.
*/
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);

child_pt = spte_to_child_pt(iter->old_spte, iter->level);
if (!child_pt)
@@ -123,7 +123,7 @@ static bool try_step_side(struct tdp_iter *iter)
iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
iter->next_last_level_gfn = iter->gfn;
iter->sptep++;
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);

return true;
}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 216ebbe76ddd..bb9b581f1ee4 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,6 +9,22 @@

typedef u64 __rcu *tdp_ptep_t;

+/*
+ * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
+ * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
+ * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
+ * the lower depths of the TDP MMU just to make lockdep happy is a nightmare, so
+ * all accesses to SPTEs are done under RCU protection.
+ */
+static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
+{
+ return READ_ONCE(*rcu_dereference(sptep));
+}
+static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
+{
+ WRITE_ONCE(*rcu_dereference(sptep), val);
+}
+
/*
* A TDP iterator performs a pre-order walk over a TDP paging structure.
*/
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4f460782a848..8fbf3364f116 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -609,7 +609,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* here since the SPTE is going from non-present
* to non-present.
*/
- WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
+ kvm_tdp_mmu_write_spte(iter->sptep, 0);

return 0;
}
@@ -648,7 +648,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
*/
WARN_ON(is_removed_spte(iter->old_spte));

- WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+ kvm_tdp_mmu_write_spte(iter->sptep, new_spte);

__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
new_spte, iter->level, false);
@@ -1046,7 +1046,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* because the new value informs the !present
* path below.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
}

if (!is_shadow_present_pte(iter.old_spte)) {
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:35:43

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

Allow yielding when zapping SPTEs after the last reference to a valid
root is put. Because KVM must drop all SPTEs in response to relevant
mmu_notifier events, mark defunct roots invalid and reset their refcount
prior to zapping the root. Keeping the refcount elevated while the zap
is in-progress ensures the root is reachable via mmu_notifier until the
zap completes and the last reference to the invalid, defunct root is put.

Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
root in being put has a massive paging structure, e.g. zapping a root
that is backed entirely by 4kb pages for a guest with 32tb of memory can
take hundreds of seconds to complete.

watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
__handle_changed_spte+0x1b2/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
tdp_mmu_zap_root+0x307/0x4d0 [kvm]
kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
kvm_mmu_free_roots+0x22d/0x350 [kvm]
kvm_mmu_reset_context+0x20/0x60 [kvm]
kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
__se_sys_ioctl+0x77/0xc0
__x64_sys_ioctl+0x1d/0x20
do_syscall_64+0x44/0xa0
entry_SYSCALL_64_after_hwframe+0x44/0xae

KVM currently doesn't put a root from a non-preemptible context, so other
than the mmu_notifier wrinkle, yielding when putting a root is safe.

Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
take a reference to each root (it requires mmu_lock be held for the
entire duration of the walk).

tdp_mmu_next_root() is used only by the yield-friendly iterator.

kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.

kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
but is still yield-friendly in all call sites, as all callers can be
traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
kvm_create_vm().

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 122 ++++++++++++++++++++++++-------------
1 file changed, 81 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3031b42c27a6..b838cfa984ad 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -91,21 +91,66 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,

WARN_ON(!root->tdp_mmu_page);

- spin_lock(&kvm->arch.tdp_mmu_pages_lock);
- list_del_rcu(&root->link);
- spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ /*
+ * Ensure root->role.invalid is read after the refcount reaches zero to
+ * avoid zapping the root multiple times, e.g. if a different task
+ * acquires a reference (after the root was marked invalid) and puts
+ * the last reference, all while holding mmu_lock for read. Pairs
+ * with the smp_mb__before_atomic() below.
+ */
+ smp_mb__after_atomic();
+
+ /*
+ * Free the root if it's already invalid. Invalid roots must be zapped
+ * before their last reference is put, i.e. there's no work to be done,
+ * and all roots must be invalidated (see below) before they're freed.
+ * Re-zapping invalid roots would put KVM into an infinite loop (again,
+ * see below).
+ */
+ if (root->role.invalid) {
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ list_del_rcu(&root->link);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+
+ call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
+ return;
+ }
+
+ /*
+ * Invalidate the root to prevent it from being reused by a vCPU, and
+ * so that KVM doesn't re-zap the root when its last reference is put
+ * again (see above).
+ */
+ root->role.invalid = true;
+
+ /*
+ * Ensure role.invalid is visible if a concurrent reader acquires a
+ * reference after the root's refcount is reset. Pairs with the
+ * smp_mb__after_atomic() above.
+ */
+ smp_mb__before_atomic();
+
+ /*
+ * Note, if mmu_lock is held for read this can race with other readers,
+ * e.g. they may acquire a reference without seeing the root as invalid,
+ * and the refcount may be reset after the root is skipped. Both races
+ * are benign, as flows that must visit all roots, e.g. need to zap
+ * SPTEs for correctness, must take mmu_lock for write to block page
+ * faults, and the only flow that must not consume an invalid root is
+ * allocating a new root for a vCPU, which also takes mmu_lock for write.
+ */
+ refcount_set(&root->tdp_mmu_root_count, 1);

/*
- * A TLB flush is not necessary as KVM performs a local TLB flush when
- * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
- * to a different pCPU. Note, the local TLB flush on reuse also
- * invalidates any paging-structure-cache entries, i.e. TLB entries for
- * intermediate paging structures, that may be zapped, as such entries
- * are associated with the ASID on both VMX and SVM.
+ * Zap the root, then put the refcount "acquired" above. Recursively
+ * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
+ * infinite loop by freeing invalid roots. By design, the root is
+ * reachable while it's being zapped, thus a different task can put its
+ * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
+ * defunct root is unavoidable.
*/
tdp_mmu_zap_root(kvm, root, shared);
-
- call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
+ kvm_tdp_mmu_put_root(kvm, root, shared);
}

enum tdp_mmu_roots_iter_type {
@@ -760,12 +805,23 @@ static inline gfn_t tdp_mmu_max_gfn_host(void)
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
- bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count);
struct tdp_iter iter;

gfn_t end = tdp_mmu_max_gfn_host();
gfn_t start = 0;

+ /*
+ * The root must have an elevated refcount so that it's reachable via
+ * mmu_notifier callbacks, which allows this path to yield and drop
+ * mmu_lock. When handling an unmap/release mmu_notifier command, KVM
+ * must drop all references to relevant pages prior to completing the
+ * callback. Dropping mmu_lock with an unreachable root would result
+ * in zapping SPTEs after a relevant mmu_notifier callback completes
+ * and lead to use-after-free as zapping a SPTE triggers "writeback" of
+ * dirty accessed bits to the SPTE's associated struct page.
+ */
+ WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count));
+
kvm_lockdep_assert_mmu_lock_held(kvm, shared);

rcu_read_lock();
@@ -776,42 +832,16 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
*/
for_each_tdp_pte_min_level(iter, root, root->role.level, start, end) {
retry:
- /*
- * Yielding isn't allowed when zapping an unreachable root as
- * the root won't be processed by mmu_notifier callbacks. When
- * handling an unmap/release mmu_notifier command, KVM must
- * drop all references to relevant pages prior to completing
- * the callback. Dropping mmu_lock can result in zapping SPTEs
- * for an unreachable root after a relevant callback completes,
- * which leads to use-after-free as zapping a SPTE triggers
- * "writeback" of dirty/accessed bits to the SPTE's associated
- * struct page.
- */
- if (!root_is_unreachable &&
- tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;

if (!is_shadow_present_pte(iter.old_spte))
continue;

- if (!shared) {
+ if (!shared)
tdp_mmu_set_spte(kvm, &iter, 0);
- } else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
- /*
- * cmpxchg() shouldn't fail if the root is unreachable.
- * Retry so as not to leak the page and its children.
- */
- WARN_ONCE(root_is_unreachable,
- "Contended TDP MMU SPTE in unreachable root.");
+ else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
goto retry;
- }
-
- /*
- * WARN if the root is invalid and is unreachable, all SPTEs
- * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
- * and inserting new SPTEs under an invalid root is a KVM bug.
- */
- WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
}

rcu_read_unlock();
@@ -906,6 +936,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
int i;

/*
+ * Zap all roots, including invalid roots, as all SPTEs must be dropped
+ * before returning to the caller.
+ *
* A TLB flush is unnecessary, KVM zaps everything if and only the VM
* is being destroyed or the userspace VMM has exited. In both cases,
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
@@ -931,6 +964,13 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)

for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
/*
+ * Zap the root regardless of what marked it invalid, e.g. even
+ * if the root was marked invalid by kvm_tdp_mmu_put_root() due
+ * to its last reference being put. All SPTEs must be dropped
+ * before returning to the caller, e.g. if a memslot is deleted
+ * or moved, the memslot's associated SPTEs are unreachable via
+ * the mmu_notifier once the memslot update completes.
+ *
* A TLB flush is unnecessary, invalidated roots are guaranteed
* to be unreachable by the guest (see kvm_tdp_mmu_put_root()
* for more details), and unlike the legacy MMU, no vCPU kick
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:37:04

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 14/28] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU

Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
uses the slow version of "zap everything" is when the VM is being
destroyed or the owning mm has exited. In either case, KVM_RUN is
unreachable for the VM, i.e. the guest TLB entries cannot be consumed.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c231b60e1726..87706e9cc6f3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -874,14 +874,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,

void kvm_tdp_mmu_zap_all(struct kvm *kvm)
{
- bool flush = false;
int i;

+ /*
+ * A TLB flush is unnecessary, KVM zaps everything if and only the VM
+ * is being destroyed or the userspace VMM has exited. In both cases,
+ * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
+ */
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
+ (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
}

/*
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:39:44

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 11/28] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path

WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
which is called out by the comment as being disallowed but not actually
checked. Keep the WARN on the old_spte as well, because overwriting a
REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
lack of splats with the existing WARN).

Fixes: 08f07c800e9d ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8fbf3364f116..1dcdf1a4fcc1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -640,13 +640,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
lockdep_assert_held_write(&kvm->mmu_lock);

/*
- * No thread should be using this function to set SPTEs to the
+ * No thread should be using this function to set SPTEs to or from the
* temporary removed SPTE value.
* If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
* should be used. If operating under the MMU lock in write mode, the
* use of the removed SPTE should not be necessary.
*/
- WARN_ON(is_removed_spte(iter->old_spte));
+ WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));

kvm_tdp_mmu_write_spte(iter->sptep, new_spte);

--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:41:10

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 01/28] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots

Now that tdp_mmu_next_root() can process both valid and invalid roots,
extend it to be able to process _only_ invalid roots, add yet another
iterator macro for walking invalid roots, and use the new macro in
kvm_tdp_mmu_zap_invalidated_roots().

No functional change intended.

Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 74 ++++++++++++++------------------------
1 file changed, 26 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index debf08212f12..25148e8b711d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -98,6 +98,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

+enum tdp_mmu_roots_iter_type {
+ ALL_ROOTS = -1,
+ VALID_ROOTS = 0,
+ INVALID_ROOTS = 1,
+};
+
/*
* Returns the next root after @prev_root (or the first root if @prev_root is
* NULL). A reference to the returned root is acquired, and the reference to
@@ -110,10 +116,16 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
- bool shared, bool only_valid)
+ bool shared,
+ enum tdp_mmu_roots_iter_type type)
{
struct kvm_mmu_page *next_root;

+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
+ /* Ensure correctness for the below comparison against role.invalid. */
+ BUILD_BUG_ON(!!VALID_ROOTS || !INVALID_ROOTS);
+
rcu_read_lock();

if (prev_root)
@@ -125,7 +137,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
typeof(*next_root), link);

while (next_root) {
- if ((!only_valid || !next_root->role.invalid) &&
+ if ((type == ALL_ROOTS || (type == !!next_root->role.invalid)) &&
kvm_tdp_mmu_get_root(next_root))
break;

@@ -151,18 +163,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* mode. In the unlikely event that this thread must free a root, the lock
* will be temporarily dropped and reacquired in write mode.
*/
-#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _only_valid)\
- for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _type) \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _type); \
_root; \
- _root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+ _root = tdp_mmu_next_root(_kvm, _root, _shared, _type)) \
+ if (_as_id > 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else

+#define for_each_invalid_tdp_mmu_root_yield_safe(_kvm, _root) \
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, -1, true, INVALID_ROOTS)
+
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, VALID_ROOTS)

#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOTS)

#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
@@ -810,28 +825,6 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
kvm_flush_remote_tlbs(kvm);
}

-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
- struct kvm_mmu_page *prev_root)
-{
- struct kvm_mmu_page *next_root;
-
- if (prev_root)
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &prev_root->link,
- typeof(*prev_root), link);
- else
- next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- typeof(*next_root), link);
-
- while (next_root && !(next_root->role.invalid &&
- refcount_read(&next_root->tdp_mmu_root_count)))
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &next_root->link,
- typeof(*next_root), link);
-
- return next_root;
-}
-
/*
* Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
* invalidated root, they will not be freed until this function drops the
@@ -842,36 +835,21 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
*/
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
- struct kvm_mmu_page *next_root;
struct kvm_mmu_page *root;
bool flush = false;

lockdep_assert_held_read(&kvm->mmu_lock);

- rcu_read_lock();
-
- root = next_invalidated_root(kvm, NULL);
-
- while (root) {
- next_root = next_invalidated_root(kvm, root);
-
- rcu_read_unlock();
-
+ for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);

/*
- * Put the reference acquired in
- * kvm_tdp_mmu_invalidate_roots
+ * Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
+ * Note, the iterator holds its own reference.
*/
kvm_tdp_mmu_put_root(kvm, root, true);
-
- root = next_root;
-
- rcu_read_lock();
}

- rcu_read_unlock();
-
if (flush)
kvm_flush_remote_tlbs(kvm);
}
--
2.35.1.574.g5d30c73bfb-goog

2022-02-26 02:42:12

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 28/28] KVM: selftests: Add test to populate a VM with the max possible guest mem

Add a selftest that enables populating a VM with the maximum amount of
guest memory allowed by the underlying architecture. Abuse KVM's
memslots by mapping a single host memory region into multiple memslots so
that the selftest doesn't require a system with terabytes of RAM.

Default to 512gb of guest memory, which isn't all that interesting, but
should work on all MMUs and doesn't take an exorbitant amount of memory
or time. E.g. testing with ~64tb of guest memory takes the better part
of an hour, and requires 200gb of memory for KVM's page tables when using
4kb pages.

To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
the not-hugepage size is) in the backing store (memfd). Use memfd for
the host backing store to ensure that hugepages are guaranteed when
requested, and to give the user explicit control of the size of hugepage
being tested.

By default, spin up as many vCPUs as there are available to the selftest,
and distribute the work of dirtying each 4kb chunk of memory across all
vCPUs. Dirtying guest memory forces KVM to populate its page tables, and
also forces KVM to write back accessed/dirty information to struct page
when the guest memory is freed.

On x86, perform two passes with a MMU context reset between each pass to
coerce KVM into dropping all references to the MMU root, e.g. to emulate
a vCPU dropping the last reference. Perform both passes and all
rendezvous on all architectures in the hope that arm64 and s390x can gain
similar shenanigans in the future.

Measure and report the duration of each operation, which is helpful not
only to verify the test is working as intended, but also to easily
evaluate the performance differences different page sizes.

Provide command line options to limit the amount of guest memory, set the
size of each slot (i.e. of the host memory region), set the number of
vCPUs, and to enable usage of hugepages.

Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
.../selftests/kvm/max_guest_memory_test.c | 292 ++++++++++++++++++
3 files changed, 296 insertions(+)
create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 052ddfe4b23a..9b67343dc4ab 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -58,6 +58,7 @@
/hardware_disable_test
/kvm_create_max_vcpus
/kvm_page_table_test
+/max_guest_memory_test
/memslot_modification_stress_test
/memslot_perf_test
/rseq_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index f7fa5655e535..c06b1f8bc649 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -93,6 +93,7 @@ TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
TEST_GEN_PROGS_x86_64 += hardware_disable_test
TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
TEST_GEN_PROGS_x86_64 += kvm_page_table_test
+TEST_GEN_PROGS_x86_64 += max_guest_memory_test
TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
TEST_GEN_PROGS_x86_64 += memslot_perf_test
TEST_GEN_PROGS_x86_64 += rseq_test
@@ -112,6 +113,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test
TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
TEST_GEN_PROGS_aarch64 += kvm_page_table_test
+TEST_GEN_PROGS_aarch64 += max_guest_memory_test
TEST_GEN_PROGS_aarch64 += memslot_modification_stress_test
TEST_GEN_PROGS_aarch64 += memslot_perf_test
TEST_GEN_PROGS_aarch64 += rseq_test
@@ -127,6 +129,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
TEST_GEN_PROGS_s390x += dirty_log_test
TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
TEST_GEN_PROGS_s390x += kvm_page_table_test
+TEST_GEN_PROGS_s390x += max_guest_memory_test
TEST_GEN_PROGS_s390x += rseq_test
TEST_GEN_PROGS_s390x += set_memory_region_test
TEST_GEN_PROGS_s390x += kvm_binary_stats_test
diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/testing/selftests/kvm/max_guest_memory_test.c
new file mode 100644
index 000000000000..360c88288295
--- /dev/null
+++ b/tools/testing/selftests/kvm/max_guest_memory_test.c
@@ -0,0 +1,292 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <semaphore.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/atomic.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "guest_modes.h"
+#include "processor.h"
+
+static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
+{
+ uint64_t gpa;
+
+ for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+ *((volatile uint64_t *)gpa) = gpa;
+
+ GUEST_DONE();
+}
+
+struct vcpu_info {
+ struct kvm_vm *vm;
+ uint32_t id;
+ uint64_t start_gpa;
+ uint64_t end_gpa;
+};
+
+static int nr_vcpus;
+static atomic_t rendezvous;
+
+static void rendezvous_with_boss(void)
+{
+ int orig = atomic_read(&rendezvous);
+
+ if (orig > 0) {
+ atomic_dec_and_test(&rendezvous);
+ while (atomic_read(&rendezvous) > 0)
+ cpu_relax();
+ } else {
+ atomic_inc(&rendezvous);
+ while (atomic_read(&rendezvous) < 0)
+ cpu_relax();
+ }
+}
+
+static void run_vcpu(struct kvm_vm *vm, uint32_t vcpu_id)
+{
+ vcpu_run(vm, vcpu_id);
+ ASSERT_EQ(get_ucall(vm, vcpu_id, NULL), UCALL_DONE);
+}
+
+static void *vcpu_worker(void *data)
+{
+ struct vcpu_info *vcpu = data;
+ struct kvm_vm *vm = vcpu->vm;
+ struct kvm_sregs sregs;
+ struct kvm_regs regs;
+
+ vcpu_args_set(vm, vcpu->id, 3, vcpu->start_gpa, vcpu->end_gpa,
+ vm_get_page_size(vm));
+
+ /* Snapshot regs before the first run. */
+ vcpu_regs_get(vm, vcpu->id, &regs);
+ rendezvous_with_boss();
+
+ run_vcpu(vm, vcpu->id);
+ rendezvous_with_boss();
+ vcpu_regs_set(vm, vcpu->id, &regs);
+ vcpu_sregs_get(vm, vcpu->id, &sregs);
+#ifdef __x86_64__
+ /* Toggle CR0.WP to trigger a MMU context reset. */
+ sregs.cr0 ^= X86_CR0_WP;
+#endif
+ vcpu_sregs_set(vm, vcpu->id, &sregs);
+ rendezvous_with_boss();
+
+ run_vcpu(vm, vcpu->id);
+ rendezvous_with_boss();
+
+ return NULL;
+}
+
+static pthread_t *spawn_workers(struct kvm_vm *vm, uint64_t start_gpa,
+ uint64_t end_gpa)
+{
+ struct vcpu_info *info;
+ uint64_t gpa, nr_bytes;
+ pthread_t *threads;
+ int i;
+
+ threads = malloc(nr_vcpus * sizeof(*threads));
+ TEST_ASSERT(threads, "Failed to allocate vCPU threads");
+
+ info = malloc(nr_vcpus * sizeof(*info));
+ TEST_ASSERT(info, "Failed to allocate vCPU gpa ranges");
+
+ nr_bytes = ((end_gpa - start_gpa) / nr_vcpus) &
+ ~((uint64_t)vm_get_page_size(vm) - 1);
+ TEST_ASSERT(nr_bytes, "C'mon, no way you have %d CPUs", nr_vcpus);
+
+ for (i = 0, gpa = start_gpa; i < nr_vcpus; i++, gpa += nr_bytes) {
+ info[i].vm = vm;
+ info[i].id = i;
+ info[i].start_gpa = gpa;
+ info[i].end_gpa = gpa + nr_bytes;
+ pthread_create(&threads[i], NULL, vcpu_worker, &info[i]);
+ }
+ return threads;
+}
+
+static void rendezvous_with_vcpus(struct timespec *time, const char *name)
+{
+ int i, rendezvoused;
+
+ pr_info("Waiting for vCPUs to finish %s...\n", name);
+
+ rendezvoused = atomic_read(&rendezvous);
+ for (i = 0; abs(rendezvoused) != 1; i++) {
+ usleep(100);
+ if (!(i & 0x3f))
+ pr_info("\r%d vCPUs haven't rendezvoused...",
+ abs(rendezvoused) - 1);
+ rendezvoused = atomic_read(&rendezvous);
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, time);
+
+ /* Release the vCPUs after getting the time of the previous action. */
+ pr_info("\rAll vCPUs finished %s, releasing...\n", name);
+ if (rendezvoused > 0)
+ atomic_set(&rendezvous, -nr_vcpus - 1);
+ else
+ atomic_set(&rendezvous, nr_vcpus + 1);
+}
+
+static void calc_default_nr_vcpus(void)
+{
+ cpu_set_t possible_mask;
+ int r;
+
+ r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
+ TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)",
+ errno, strerror(errno));
+
+ nr_vcpus = CPU_COUNT(&possible_mask);
+ TEST_ASSERT(nr_vcpus > 0, "Uh, no CPUs?");
+}
+
+int main(int argc, char *argv[])
+{
+ /*
+ * Skip the first 4gb and slot0. slot0 maps <1gb and is used to back
+ * the guest's code, stack, and page tables. Because selftests creates
+ * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
+ * just below the 4gb boundary. This test could create memory at
+ * 1gb-3gb,but it's simpler to skip straight to 4gb.
+ */
+ const uint64_t size_1gb = (1 << 30);
+ const uint64_t start_gpa = (4ull * size_1gb);
+ const int first_slot = 1;
+
+ struct timespec time_start, time_run1, time_reset, time_run2;
+ uint64_t max_gpa, gpa, slot_size, max_mem, i;
+ int max_slots, slot, opt, fd;
+ bool hugepages = false;
+ pthread_t *threads;
+ struct kvm_vm *vm;
+ void *mem;
+
+ /*
+ * Default to 2gb so that maxing out systems with MAXPHADDR=46, which
+ * are quite common for x86, requires changing only max_mem (KVM allows
+ * 32k memslots, 32k * 2gb == ~64tb of guest memory).
+ */
+ slot_size = 2 * size_1gb;
+
+ max_slots = kvm_check_cap(KVM_CAP_NR_MEMSLOTS);
+ TEST_ASSERT(max_slots > first_slot, "KVM is broken");
+
+ /* All KVM MMUs should be able to survive a 512gb guest. */
+ max_mem = 512 * size_1gb;
+
+ calc_default_nr_vcpus();
+
+ while ((opt = getopt(argc, argv, "c:h:m:s:u")) != -1) {
+ switch (opt) {
+ case 'c':
+ nr_vcpus = atoi(optarg);
+ TEST_ASSERT(nr_vcpus, "#DE");
+ break;
+ case 'm':
+ max_mem = atoi(optarg) * size_1gb;
+ TEST_ASSERT(max_mem, "#DE");
+ break;
+ case 's':
+ slot_size = atoi(optarg) * size_1gb;
+ TEST_ASSERT(slot_size, "#DE");
+ break;
+ case 'u':
+ hugepages = true;
+ break;
+ case 'h':
+ default:
+ printf("usage: %s [-c nr_vcpus] [-m max_mem_in_gb] [-s slot_size_in_gb] [-u [huge_page_size]]\n", argv[0]);
+ exit(1);
+ }
+ }
+
+ vm = vm_create_default_with_vcpus(nr_vcpus, 0, 0, guest_code, NULL);
+
+ max_gpa = vm_get_max_gfn(vm) << vm_get_page_shift(vm);
+ TEST_ASSERT(max_gpa > (4 * slot_size), "MAXPHYADDR <4gb ");
+
+ fd = kvm_memfd_alloc(slot_size, hugepages);
+ mem = mmap(NULL, slot_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmap() failed");
+
+ TEST_ASSERT(!madvise(mem, slot_size, MADV_NOHUGEPAGE), "madvise() failed");
+
+ /* Pre-fault the memory to avoid taking mmap_sem on guest page faults. */
+ for (i = 0; i < slot_size; i += vm_get_page_size(vm))
+ ((uint8_t *)mem)[i] = 0xaa;
+
+ gpa = 0;
+ for (slot = first_slot; slot < max_slots; slot++) {
+ gpa = start_gpa + ((slot - first_slot) * slot_size);
+ if (gpa + slot_size > max_gpa)
+ break;
+
+ if ((gpa - start_gpa) >= max_mem)
+ break;
+
+ vm_set_user_memory_region(vm, slot, 0, gpa, slot_size, mem);
+
+#ifdef __x86_64__
+ /* Identity map memory in the guest using 1gb pages. */
+ for (i = 0; i < slot_size; i += size_1gb)
+ __virt_pg_map(vm, gpa + i, gpa + i, X86_PAGE_SIZE_1G);
+#else
+ for (i = 0; i < slot_size; i += vm_get_page_size(vm))
+ virt_pg_map(vm, gpa + i, gpa + i);
+#endif
+ }
+
+ atomic_set(&rendezvous, nr_vcpus + 1);
+ threads = spawn_workers(vm, start_gpa, gpa);
+
+ pr_info("Running with %lugb of guest memory and %u vCPUs\n",
+ (gpa - start_gpa) / size_1gb, nr_vcpus);
+
+ rendezvous_with_vcpus(&time_start, "spawning");
+ rendezvous_with_vcpus(&time_run1, "run 1");
+ rendezvous_with_vcpus(&time_reset, "reset");
+ rendezvous_with_vcpus(&time_run2, "run 2");
+
+ time_run2 = timespec_sub(time_run2, time_reset);
+ time_reset = timespec_sub(time_reset, time_run1);
+ time_run1 = timespec_sub(time_run1, time_start);
+
+ pr_info("run1 = %ld.%.9lds, reset = %ld.%.9lds, run2 = %ld.%.9lds\n",
+ time_run1.tv_sec, time_run1.tv_nsec,
+ time_reset.tv_sec, time_reset.tv_nsec,
+ time_run2.tv_sec, time_run2.tv_nsec);
+
+ /*
+ * Delete even numbered slots (arbitrary) and unmap the first half of
+ * the backing (also arbitrary) to verify KVM correctly drops all
+ * references to the removed regions.
+ */
+ for (slot = (slot - 1) & ~1ull; slot >= first_slot; slot -= 2)
+ vm_set_user_memory_region(vm, slot, 0, 0, 0, NULL);
+
+ munmap(mem, slot_size / 2);
+
+ /* Sanity check that the vCPUs actually ran. */
+ for (i = 0; i < nr_vcpus; i++)
+ pthread_join(threads[i], NULL);
+
+ /*
+ * Deliberately exit without deleting the remaining memslots or closing
+ * kvm_fd to test cleanup via mmu_notifier.release.
+ */
+}
--
2.35.1.574.g5d30c73bfb-goog

2022-02-28 23:28:15

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v3 03/28] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <[email protected]> wrote:
>
> Fix misleading and arguably wrong comments in the TDP MMU's fast zap
> flow. The comments, and the fact that actually zapping invalid roots was
> added separately, strongly suggests that zapping invalid roots is an
> optimization and not required for correctness. That is a lie.
>
> KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
> because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
> KVM is relying on it to fully remove all references to the memslot. Once
> the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
> stale references as the hva=>gfn translation is done via the memslots.
> If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
> deleting a memslot, KVM will fail to zap in response to the mmu_notifier
> due to not finding a memslot corresponding to the notifier's range, which
> leads to a variation of use-after-free.
>
> The other misleading comment (and code) explicitly states that roots
> without a reference should be skipped. While that's technically true,
> it's also extremely misleading as it should be impossible for KVM to
> encounter a defunct root on the list while holding mmu_lock for write.
> Opportunstically add a WARN to enforce that invariant.
>
> Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
> Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
> Signed-off-by: Sean Christopherson <[email protected]>

A couple nits about missing words, but otherwise looks good.

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu.c | 8 +++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 46 +++++++++++++++++++++-----------------
> 2 files changed, 33 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b2c1c4eb6007..80607513a1f2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5662,6 +5662,14 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>
> write_unlock(&kvm->mmu_lock);
>
> + /*
> + * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
> + * returning to the caller, e.g. if the zap is in response to a memslot
> + * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
> + * associated with the deleted memslot once the update completes, and
> + * Deferring the zap until the final reference to the root is put would
> + * lead to use-after-free.
> + */
> if (is_tdp_mmu_enabled(kvm)) {
> read_lock(&kvm->mmu_lock);
> kvm_tdp_mmu_zap_invalidated_roots(kvm);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9357780ec28f..12866113fb4f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -826,12 +826,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> }
>
> /*
> - * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
> - * invalidated root, they will not be freed until this function drops the
> - * reference. Before dropping that reference, tear down the paging
> - * structure so that whichever thread does drop the last reference
> - * only has to do a trivial amount of work. Since the roots are invalid,
> - * no new SPTEs should be created under them.
> + * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
> + * zap" completes. Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
> + * reference to each invalidated root, roots will not be freed until after this
> + * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
> + * tearing paging structures.

Nit: tearing down paging structures

> */
> void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> {
> @@ -855,21 +854,25 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> }
>
> /*
> - * Mark each TDP MMU root as invalid so that other threads
> - * will drop their references and allow the root count to
> - * go to 0.
> + * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
> + * is about to be zapped, e.g. in response to a memslots update. The caller is
> + * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to the actual

Nit: to do

> + * zapping.
> *
> - * Also take a reference on all roots so that this thread
> - * can do the bulk of the work required to free the roots
> - * once they are invalidated. Without this reference, a
> - * vCPU thread might drop the last reference to a root and
> - * get stuck with tearing down the entire paging structure.
> + * Take a reference on all roots to prevent the root from being freed before it
> + * is zapped by this thread. Freeing a root is not a correctness issue, but if
> + * a vCPU drops the last reference to a root prior to the root being zapped, it
> + * will get stuck with tearing down the entire paging structure.
> *
> - * Roots which have a zero refcount should be skipped as
> - * they're already being torn down.
> - * Already invalid roots should be referenced again so that
> - * they aren't freed before kvm_tdp_mmu_zap_all_fast is
> - * done with them.
> + * Get a reference even if the root is already invalid,
> + * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
> + * invalid roots, e.g. there's no epoch to identify roots that were invalidated
> + * by a previous call. Roots stay on the list until the last reference is
> + * dropped, so even though all invalid roots are zapped, a root may not go away
> + * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
> + *
> + * Because mmu_lock is held for write, it should be impossible to observe a
> + * root with zero refcount, i.e. the list of roots cannot be stale.
> *
> * This has essentially the same effect for the TDP MMU
> * as updating mmu_valid_gen does for the shadow MMU.
> @@ -879,9 +882,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> struct kvm_mmu_page *root;
>
> lockdep_assert_held_write(&kvm->mmu_lock);
> - list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
> - if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
> + list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> + if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
> root->role.invalid = true;
> + }
> }
>
> /*
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-02-28 23:45:39

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v3 06/28] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <[email protected]> wrote:
>
> Assert that mmu_lock is held for write by users of the yield-unfriendly
> TDP iterator. The nature of a shared walk means that the caller needs to
> play nice with other tasks modifying the page tables, which is more or
> less the same thing as playing nice with yielding. Theoretically, KVM
> could gain a flow where it could legitimately take mmu_lock for read in
> a non-preemptible context, but that's highly unlikely and any such case
> should be viewed with a fair amount of scrutiny.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5994db5d5226..189f21e71c36 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -29,13 +29,16 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> return true;
> }
>
> -static __always_inline void kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
> +/* Arbitrarily returns true so that this may be used in if statements. */
> +static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
> bool shared)
> {
> if (shared)
> lockdep_assert_held_read(&kvm->mmu_lock);
> else
> lockdep_assert_held_write(&kvm->mmu_lock);
> +
> + return true;
> }
>
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> @@ -187,11 +190,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> #define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
> __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOTS)
>
> -#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> - list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
> - lockdep_is_held_type(&kvm->mmu_lock, 0) || \
> - lockdep_is_held(&kvm->arch.tdp_mmu_pages_lock)) \
> - if (kvm_mmu_page_as_id(_root) != _as_id) { \
> +/*
> + * Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
> + * the implication being that any flow that holds mmu_lock for read is
> + * inherently yield-friendly and should use the yielf-safe variant above.
> + * Holding mmu_lock for write obviates the need for RCU protection as the list
> + * is guaranteed to be stable.
> + */
> +#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
> + if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
> + kvm_mmu_page_as_id(_root) != _as_id) { \
> } else
>
> static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-01 00:46:38

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v3 07/28] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <[email protected]> wrote:
>
> Look for a !leaf=>leaf conversion instead of a PFN change when checking
> if a SPTE change removed a TDP MMU shadow page. Convert the PFN check
> into a WARN, as KVM should never change the PFN of a shadow page (except
> when its being zapped or replaced).
>
> From a purely theoretical perspective, it's not illegal to replace a SP
> with a hugepage pointing at the same PFN. In practice, it's impossible
> as that would require mapping guest memory overtop a kernel-allocated SP.
> Either way, the check is odd.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 189f21e71c36..848448b65703 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -505,9 +505,12 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>
> /*
> * Recursively handle child PTs if the change removed a subtree from
> - * the paging structure.
> + * the paging structure. Note the WARN on the PFN changing without the
> + * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
> + * pages are kernel allocations and should never be migrated.
> */
> - if (was_present && !was_leaf && (pfn_changed || !is_present))
> + if (was_present && !was_leaf &&
> + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> }
>
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-01 01:13:08

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 26/28] KVM: selftests: Split out helper to allocate guest mem via memfd

On Sat, 2022-02-26 at 00:15 +0000, Sean Christopherson wrote:
> Extract the code for allocating guest memory via memfd out of
> vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
> A future selftest to populate a guest with the maximum amount of guest
> memory will abuse KVM's memslots to alias guest memory regions to a
> single memfd-backed host region, i.e. needs to back a guest with memfd
> memory without a 1:1 association between a memslot and a memfd instance.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>

While we're at it, please can we make the whole thing go away and just
return failure #ifndef MFD_CLOEXEC, instead of breaking the build on
older userspace?


Attachments:
smime.p7s (5.83 kB)

2022-03-01 01:20:54

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v3 05/28] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <[email protected]> wrote:
>
> Remove the misleading flush "handling" when zapping invalidated TDP MMU
> roots, and document that flushing is unnecessary for all flavors of MMUs
> when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU
> is dead code, as zap_gfn_range() is called with shared=true, in which
> case it will never return true due to the flushing being handled by
> tdp_mmu_zap_spte_atomic().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu.c | 10 +++++++---
> arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++-----
> 2 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5a931c89d27b..1c4b84e80841 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5615,9 +5615,13 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
> }
>
> /*
> - * Trigger a remote TLB flush before freeing the page tables to ensure
> - * KVM is not in the middle of a lockless shadow page table walk, which
> - * may reference the pages.
> + * Kick all vCPUs (via remote TLB flush) before freeing the page tables
> + * to ensure KVM is not in the middle of a lockless shadow page table
> + * walk, which may reference the pages. The remote TLB flush itself is
> + * not required and is simply a convenient way to kick vCPUs as needed.
> + * KVM performs a local TLB flush when allocating a new root (see
> + * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
> + * running with an obsolete MMU.
> */
> kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e35bd88d92fd..5994db5d5226 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -843,12 +843,20 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> {
> struct kvm_mmu_page *root;
> - bool flush = false;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
> - flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
> + /*
> + * A TLB flush is unnecessary, invalidated roots are guaranteed
> + * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
> + * for more details), and unlike the legacy MMU, no vCPU kick
> + * is needed to play nice with lockless shadow walks as the TDP
> + * MMU protects its paging structures via RCU. Note, zapping
> + * will still flush on yield, but that's a minor performance
> + * blip and not a functional issue.
> + */
> + (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
>
> /*
> * Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
> @@ -856,9 +864,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> */
> kvm_tdp_mmu_put_root(kvm, root, true);
> }
> -
> - if (flush)
> - kvm_flush_remote_tlbs(kvm);
> }
>
> /*
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-01 07:26:43

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v3 14/28] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <[email protected]> wrote:
>
> Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
> uses the slow version of "zap everything" is when the VM is being
> destroyed or the owning mm has exited. In either case, KVM_RUN is
> unreachable for the VM, i.e. the guest TLB entries cannot be consumed.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c231b60e1726..87706e9cc6f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -874,14 +874,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
>
> void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> {
> - bool flush = false;
> int i;
>
> + /*
> + * A TLB flush is unnecessary, KVM zaps everything if and only the VM
> + * is being destroyed or the userspace VMM has exited. In both cases,
> + * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
> + */
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> - flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
> -
> - if (flush)
> - kvm_flush_remote_tlbs(kvm);
> + (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
> }
>
> /*
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-01 19:12:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On 2/26/22 01:15, Sean Christopherson wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3031b42c27a6..b838cfa984ad 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -91,21 +91,66 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>
> WARN_ON(!root->tdp_mmu_page);
>
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - list_del_rcu(&root->link);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + /*
> + * Ensure root->role.invalid is read after the refcount reaches zero to
> + * avoid zapping the root multiple times, e.g. if a different task
> + * acquires a reference (after the root was marked invalid) and puts
> + * the last reference, all while holding mmu_lock for read. Pairs
> + * with the smp_mb__before_atomic() below.
> + */
> + smp_mb__after_atomic();
> +
> + /*
> + * Free the root if it's already invalid. Invalid roots must be zapped
> + * before their last reference is put, i.e. there's no work to be done,
> + * and all roots must be invalidated (see below) before they're freed.
> + * Re-zapping invalid roots would put KVM into an infinite loop (again,
> + * see below).
> + */
> + if (root->role.invalid) {
> + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> + list_del_rcu(&root->link);
> + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +
> + call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> + return;
> + }
> +
> + /*
> + * Invalidate the root to prevent it from being reused by a vCPU, and
> + * so that KVM doesn't re-zap the root when its last reference is put
> + * again (see above).
> + */
> + root->role.invalid = true;
> +
> + /*
> + * Ensure role.invalid is visible if a concurrent reader acquires a
> + * reference after the root's refcount is reset. Pairs with the
> + * smp_mb__after_atomic() above.
> + */
> + smp_mb__before_atomic();

I have reviewed the series and I only have very minor comments... but
this part is beyond me. The lavish comments don't explain what is an
optimization and what is a requirement, and after spending quite some
time I wonder if all this should just be

if (refcount_dec_not_one(&root->tdp_mmu_root_count))
return;

if (!xchg(&root->role.invalid, true) {
tdp_mmu_zap_root(kvm, root, shared);

/*
* Do not assume the refcount is still 1: because
* tdp_mmu_zap_root can yield, a different task
* might have grabbed a reference to this root.
*
if (refcount_dec_not_one(&root->tdp_mmu_root_count))
return;
}

/*
* The root is invalid, and its reference count has reached
* zero. It must have been zapped either in the "if" above or
* by someone else, and we're definitely the last thread to see
* it apart from RCU-protected page table walks.
*/
refcount_set(&root->tdp_mmu_root_count, 0);

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);

(Yay for xchg's implicit memory barriers)

Paolo

> + /*
> + * Note, if mmu_lock is held for read this can race with other readers,
> + * e.g. they may acquire a reference without seeing the root as invalid,
> + * and the refcount may be reset after the root is skipped. Both races
> + * are benign, as flows that must visit all roots, e.g. need to zap
> + * SPTEs for correctness, must take mmu_lock for write to block page
> + * faults, and the only flow that must not consume an invalid root is
> + * allocating a new root for a vCPU, which also takes mmu_lock for write.
> + */
> + refcount_set(&root->tdp_mmu_root_count, 1);
>
> /*
> - * A TLB flush is not necessary as KVM performs a local TLB flush when
> - * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> - * to a different pCPU. Note, the local TLB flush on reuse also
> - * invalidates any paging-structure-cache entries, i.e. TLB entries for
> - * intermediate paging structures, that may be zapped, as such entries
> - * are associated with the ASID on both VMX and SVM.
> + * Zap the root, then put the refcount "acquired" above. Recursively
> + * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
> + * infinite loop by freeing invalid roots. By design, the root is
> + * reachable while it's being zapped, thus a different task can put its
> + * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
> + * defunct root is unavoidable.
> */
> tdp_mmu_zap_root(kvm, root, shared);
> -
> - call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> + kvm_tdp_mmu_put_root(kvm, root, shared);
> }
>
> enum tdp_mmu_roots_iter_type {

2022-03-01 19:59:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On Tue, Mar 01, 2022, Paolo Bonzini wrote:
> On 2/26/22 01:15, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 3031b42c27a6..b838cfa984ad 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -91,21 +91,66 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > WARN_ON(!root->tdp_mmu_page);
> > - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > - list_del_rcu(&root->link);
> > - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > + /*
> > + * Ensure root->role.invalid is read after the refcount reaches zero to
> > + * avoid zapping the root multiple times, e.g. if a different task
> > + * acquires a reference (after the root was marked invalid) and puts
> > + * the last reference, all while holding mmu_lock for read. Pairs
> > + * with the smp_mb__before_atomic() below.
> > + */
> > + smp_mb__after_atomic();
> > +
> > + /*
> > + * Free the root if it's already invalid. Invalid roots must be zapped
> > + * before their last reference is put, i.e. there's no work to be done,
> > + * and all roots must be invalidated (see below) before they're freed.
> > + * Re-zapping invalid roots would put KVM into an infinite loop (again,
> > + * see below).
> > + */
> > + if (root->role.invalid) {
> > + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > + list_del_rcu(&root->link);
> > + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > +
> > + call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > + return;
> > + }
> > +
> > + /*
> > + * Invalidate the root to prevent it from being reused by a vCPU, and
> > + * so that KVM doesn't re-zap the root when its last reference is put
> > + * again (see above).
> > + */
> > + root->role.invalid = true;
> > +
> > + /*
> > + * Ensure role.invalid is visible if a concurrent reader acquires a
> > + * reference after the root's refcount is reset. Pairs with the
> > + * smp_mb__after_atomic() above.
> > + */
> > + smp_mb__before_atomic();
>
> I have reviewed the series and I only have very minor comments... but this
> part is beyond me. The lavish comments don't explain what is an
> optimization and what is a requirement,

Ah, they're all requirements, but the invalid part also optimizes the case where
a root was marked invalid before its last reference was was ever put.

What I really meant to refer to by "zapping" was the entire sequence of restoring
the refcount to '1', zapping the root, and recursively re-dropping that ref. Avoiding
that "zap" is a requirement, otherwise KVM would get stuck in an infinite loop.

> and after spending quite some time I wonder if all this should just be
>
> if (refcount_dec_not_one(&root->tdp_mmu_root_count))
> return;
>
> if (!xchg(&root->role.invalid, true) {

The refcount being '1' means there's another task currently using root, marking
the root invalid will mean checks on the root's validity are non-deterministic
for the other task.

> tdp_mmu_zap_root(kvm, root, shared);
>
> /*
> * Do not assume the refcount is still 1: because
> * tdp_mmu_zap_root can yield, a different task
> * might have grabbed a reference to this root.
> *
> if (refcount_dec_not_one(&root->tdp_mmu_root_count))

This is wrong, _this_ task can't drop a reference taken by the other task.

> return;
> }
>
> /*
> * The root is invalid, and its reference count has reached
> * zero. It must have been zapped either in the "if" above or
> * by someone else, and we're definitely the last thread to see
> * it apart from RCU-protected page table walks.
> */
> refcount_set(&root->tdp_mmu_root_count, 0);

Not sure what you intended here, KVM should never force a refcount to '0'.

> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_del_rcu(&root->link);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
>
> (Yay for xchg's implicit memory barriers)

xchg() is a very good idea. The smp_mb_*() stuff was carried over from the previous
version where this sequence set another flag in addition to role.invalid.

Is this less funky (untested)?

/*
* Invalidate the root to prevent it from being reused by a vCPU while
* the root is being zapped, i.e. to allow yielding while zapping the
* root (see below).
*
* Free the root if it's already invalid. Invalid roots must be zapped
* before their last reference is put, i.e. there's no work to be done,
* and all roots must be invalidated before they're freed (this code).
* Re-zapping invalid roots would put KVM into an infinite loop.
*
* Note, xchg() provides an implicit barrier to ensure role.invalid is
* visible if a concurrent reader acquires a reference after the root's
* refcount is reset.
*/
if (xchg(root->role.invalid, true))
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
return;
}


2022-03-02 03:28:00

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <[email protected]> wrote:
>
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
>
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap. If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds. Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
>
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu_internal.h | 8 +++-
> arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++-----
> 2 files changed, 63 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index be063b6c91b7..1bff453f7cbe 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -65,7 +65,13 @@ struct kvm_mmu_page {
> struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> tdp_ptep_t ptep;
> };
> - DECLARE_BITMAP(unsync_child_bitmap, 512);
> + union {
> + DECLARE_BITMAP(unsync_child_bitmap, 512);
> + struct {
> + struct work_struct tdp_mmu_async_work;
> + void *tdp_mmu_async_data;
> + };
> + };

At some point (probably not in this series since it's so long already)
it would be good to organize kvm_mmu_page. It looks like we've got
quite a few anonymous unions in there for TDP / Shadow MMU fields.

>
> struct list_head lpage_disallowed_link;
> #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ec28a88c6376..4151e61245a7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared);
>
> +static void tdp_mmu_zap_root_async(struct work_struct *work)
> +{
> + struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
> + tdp_mmu_async_work);
> + struct kvm *kvm = root->tdp_mmu_async_data;
> +
> + read_lock(&kvm->mmu_lock);
> +
> + /*
> + * A TLB flush is not necessary as KVM performs a local TLB flush when
> + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> + * to a different pCPU. Note, the local TLB flush on reuse also
> + * invalidates any paging-structure-cache entries, i.e. TLB entries for
> + * intermediate paging structures, that may be zapped, as such entries
> + * are associated with the ASID on both VMX and SVM.
> + */
> + tdp_mmu_zap_root(kvm, root, true);
> +
> + /*
> + * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
> + * avoiding an infinite loop. By design, the root is reachable while
> + * it's being asynchronously zapped, thus a different task can put its
> + * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
> + * asynchronously zapped root is unavoidable.
> + */
> + kvm_tdp_mmu_put_root(kvm, root, true);
> +
> + read_unlock(&kvm->mmu_lock);
> +
> + kvm_put_kvm(kvm);
> +}
> +
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared)
> {
> @@ -142,15 +174,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> refcount_set(&root->tdp_mmu_root_count, 1);
>
> /*
> - * Zap the root, then put the refcount "acquired" above. Recursively
> - * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
> - * infinite loop by freeing invalid roots. By design, the root is
> - * reachable while it's being zapped, thus a different task can put its
> - * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
> - * defunct root is unavoidable.
> + * Attempt to acquire a reference to KVM itself. If KVM is alive, then
> + * zap the root asynchronously in a worker, otherwise it must be zapped
> + * directly here. Wait to do this check until after the refcount is
> + * reset so that tdp_mmu_zap_root() can safely yield.
> + *
> + * In both flows, zap the root, then put the refcount "acquired" above.
> + * When putting the reference, use kvm_tdp_mmu_put_root() to test the
> + * above logic for avoiding an infinite loop by freeing invalid roots.
> + * By design, the root is reachable while it's being zapped, thus a
> + * different task can put its last reference, i.e. flowing through
> + * kvm_tdp_mmu_put_root() for a defunct root is unavoidable.
> */
> - tdp_mmu_zap_root(kvm, root, shared);
> - kvm_tdp_mmu_put_root(kvm, root, shared);
> + if (kvm_get_kvm_safe(kvm)) {
> + root->tdp_mmu_async_data = kvm;
> + INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async);
> + schedule_work(&root->tdp_mmu_async_work);
> + } else {
> + tdp_mmu_zap_root(kvm, root, shared);
> + kvm_tdp_mmu_put_root(kvm, root, shared);
> + }
> }
>
> enum tdp_mmu_roots_iter_type {
> @@ -954,7 +997,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>
> /*
> * Zap all roots, including invalid roots, as all SPTEs must be dropped
> - * before returning to the caller.
> + * before returning to the caller. Zap directly even if the root is
> + * also being zapped by a worker. Walking zapped top-level SPTEs isn't
> + * all that expensive and mmu_lock is already held, which means the
> + * worker has yielded, i.e. flushing the work instead of zapping here
> + * isn't guaranteed to be any faster.
> *
> * A TLB flush is unnecessary, KVM zaps everything if and only the VM
> * is being destroyed or the userspace VMM has exited. In both cases,
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-02 04:36:22

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

/facepalm

After typing up all of the below, I actually tried the novel idea of compiling
the code... and we can't do xchg() on role.invalid because it occupies a single
bit, it's not a standalone boolean. I completely agree that the xchg() code is
far, far cleaner, but we'd have to sacrifice using a full byte for "smm" _and_
write some rather ugly code for retrieving a pointer to "invalid".

TL;DR: this

smp_mb__after_atomic();

if (root->role.invalid) {
return;
}

root->role.invalid = true;

smp_mb__before_atomic();

is just a weirdly open coded xchg() that operates on a single bit field.


On Tue, Mar 01, 2022, Paolo Bonzini wrote:
> On 3/1/22 20:43, Sean Christopherson wrote:
> > > and after spending quite some time I wonder if all this should just be
> > >
> > > if (refcount_dec_not_one(&root->tdp_mmu_root_count))
> > > return;
> > >
> > > if (!xchg(&root->role.invalid, true) {
> >
> > The refcount being '1' means there's another task currently using root, marking
> > the root invalid will mean checks on the root's validity are non-deterministic
> > for the other task.
>
> Do you mean it's not possible to use refcount_dec_not_one, otherwise
> kvm_tdp_mmu_get_root is not guaranteed to reject the root?

Scratch my objection, KVM already assumes concurrent readers may or may not see
role.invalid as true. I deliberately went that route so as to avoid having to
require specific ordering between checking role.invalid and getting a reference.

As my comment further down states, "allocating" a new root is the only flow that
absolutely cannot consume a soon-to-be-invalid root, and it takes mmu_lock for
write so it can't be running concurrently.

So, we don't need to rely on xchg() for barriers, the only consumers of the barriers
are kvm_tdp_mmu_put_root() and they'll obviously always do an atomic xchg().

Invalidating the root while it's refcount is >=1 is also ok, but I thinks that's
flawed for a different reason (see comments on refcount_set(..., 0)).

> > > tdp_mmu_zap_root(kvm, root, shared);
> > >
> > > /*
> > > * Do not assume the refcount is still 1: because
> > > * tdp_mmu_zap_root can yield, a different task
> > > * might have grabbed a reference to this root.
> > > *
> > > if (refcount_dec_not_one(&root->tdp_mmu_root_count))
> >
> > This is wrong, _this_ task can't drop a reference taken by the other task.
>
> This is essentially the "kvm_tdp_mmu_put_root(kvm, root, shared);" (or "goto
> beginning_of_function;") part of your patch.

Gah, I didn't read the code/comments for refcount_dec_not_one(). I assumed it
was "decrement and return true if the result is not '1'", not "decrement unless
the count is already '1', and return true if there was a decrement". In hindsight,
the former makes no sense at all...

> > > return;
> > > }
> > >
> > > /*
> > > * The root is invalid, and its reference count has reached
> > > * zero. It must have been zapped either in the "if" above or
> > > * by someone else, and we're definitely the last thread to see
> > > * it apart from RCU-protected page table walks.
> > > */
> > > refcount_set(&root->tdp_mmu_root_count, 0);
> >
> > Not sure what you intended here, KVM should never force a refcount to '0'.
>
> It's turning a refcount_dec_not_one into a refcount_dec_and_test. It seems
> legit to me, because the only refcount_inc_not_zero is in a write-side
> critical section. If the refcount goes to zero on the read-side, the root
> is gone for good.

The issue is that by using refcount_dec_not_one() above, there's no guarantee that
this task is the last one to see it as kvm_tdp_mmu_get_root() can succeed and bump
the refcount between refcount_dec_not_one() and here. Freeing the root would lead
to use-after-free because iterators (rightly) assuming that RCU protection isn't
needed once they have a reference. RCU protection is needed only if the user of the
iterator wants to dereference page table memory.

> > xchg() is a very good idea. The smp_mb_*() stuff was carried over from the previous
> > version where this sequence set another flag in addition to role.invalid.
> >
> > Is this less funky (untested)?
> >
> > /*
> > * Invalidate the root to prevent it from being reused by a vCPU while
> > * the root is being zapped, i.e. to allow yielding while zapping the
> > * root (see below).
> > *
> > * Free the root if it's already invalid. Invalid roots must be zapped
> > * before their last reference is put, i.e. there's no work to be done,
> > * and all roots must be invalidated before they're freed (this code).
> > * Re-zapping invalid roots would put KVM into an infinite loop.
> > *
> > * Note, xchg() provides an implicit barrier to ensure role.invalid is
> > * visible if a concurrent reader acquires a reference after the root's
> > * refcount is reset.
> > */
> > if (xchg(root->role.invalid, true))
> > spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > list_del_rcu(&root->link);
> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >
> > call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > return;
> > }
>
> Based on my own version, I guess you mean (without comments due to family
> NMI):
>
> if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
> return;
>
> if (!xchg(&root->role.invalid, true) {
> refcount_set(&root->tdp_mmu_count, 1);
> tdp_mmu_zap_root(kvm, root, shared);
> if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
> return;
> }
>
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_del_rcu(&root->link);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);

That would work, though I'd prefer to recurse on kvm_tdp_mmu_put_root() instead
of open coding refcount_dec_and_test() so that we get coverage of the xchg()
doing the right thing.

I still slightly prefer having the "free" path be inside the xchg(). To me, even
though the "free" path is the only one that's guaranteed to be reached for every root,
the fall-through to resetting the refcount and zapping the root is the "normal" path,
and the "free" path is the exception.

2022-03-02 04:45:27

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v3 23/28] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <[email protected]> wrote:
>
> Explicitly check for a REMOVED leaf SPTE prior to attempting to map
> the final SPTE when handling a TDP MMU fault. Functionally, this is a
> nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE.
> Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal
> is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old"
> SPTE is never a REMOVED SPTE.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4151e61245a7..1acd12bf309f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1250,7 +1250,11 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> }
> }
>
> - if (iter.level != fault->goal_level) {
> + /*
> + * Force the guest to retry the access if the upper level SPTEs aren't
> + * in place, or if the target leaf SPTE is frozen by another CPU.
> + */
> + if (iter.level != fault->goal_level || is_removed_spte(iter.old_spte)) {
> rcu_read_unlock();
> return RET_PF_RETRY;
> }
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-02 09:09:51

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On 3/1/22 20:43, Sean Christopherson wrote:
>> and after spending quite some time I wonder if all this should just be
>>
>> if (refcount_dec_not_one(&root->tdp_mmu_root_count))
>> return;
>>
>> if (!xchg(&root->role.invalid, true) {
>
> The refcount being '1' means there's another task currently using root, marking
> the root invalid will mean checks on the root's validity are non-deterministic
> for the other task.

Do you mean it's not possible to use refcount_dec_not_one, otherwise
kvm_tdp_mmu_get_root is not guaranteed to reject the root?

>> tdp_mmu_zap_root(kvm, root, shared);
>>
>> /*
>> * Do not assume the refcount is still 1: because
>> * tdp_mmu_zap_root can yield, a different task
>> * might have grabbed a reference to this root.
>> *
>> if (refcount_dec_not_one(&root->tdp_mmu_root_count))
>
> This is wrong, _this_ task can't drop a reference taken by the other task.

This is essentially the "kvm_tdp_mmu_put_root(kvm, root, shared);" (or
"goto beginning_of_function;") part of your patch.

>> return;
>> }
>>
>> /*
>> * The root is invalid, and its reference count has reached
>> * zero. It must have been zapped either in the "if" above or
>> * by someone else, and we're definitely the last thread to see
>> * it apart from RCU-protected page table walks.
>> */
>> refcount_set(&root->tdp_mmu_root_count, 0);
>
> Not sure what you intended here, KVM should never force a refcount to '0'.

It's turning a refcount_dec_not_one into a refcount_dec_and_test. It
seems legit to me, because the only refcount_inc_not_zero is in a
write-side critical section. If the refcount goes to zero on the
read-side, the root is gone for good.

> xchg() is a very good idea. The smp_mb_*() stuff was carried over from the previous
> version where this sequence set another flag in addition to role.invalid.
>
> Is this less funky (untested)?
>
> /*
> * Invalidate the root to prevent it from being reused by a vCPU while
> * the root is being zapped, i.e. to allow yielding while zapping the
> * root (see below).
> *
> * Free the root if it's already invalid. Invalid roots must be zapped
> * before their last reference is put, i.e. there's no work to be done,
> * and all roots must be invalidated before they're freed (this code).
> * Re-zapping invalid roots would put KVM into an infinite loop.
> *
> * Note, xchg() provides an implicit barrier to ensure role.invalid is
> * visible if a concurrent reader acquires a reference after the root's
> * refcount is reset.
> */
> if (xchg(root->role.invalid, true))
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_del_rcu(&root->link);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> return;
> }

Based on my own version, I guess you mean (without comments due to
family NMI):

if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;

if (!xchg(&root->role.invalid, true) {
refcount_set(&root->tdp_mmu_count, 1);
tdp_mmu_zap_root(kvm, root, shared);
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;
}

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);

Paolo

2022-03-02 20:48:39

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On 2/26/22 01:15, Sean Christopherson wrote:
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
>
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap. If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds. Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
>
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---

Do we even need kvm_tdp_mmu_zap_invalidated_roots() now? That is,
something like the following:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd3625a875ef..5fd8bc858c6f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
{
lockdep_assert_held(&kvm->slots_lock);

+ /*
+ * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference
+ * count. If we're dying, zap everything as it's going to happen
+ * soon anyway.
+ */
+ if (!refcount_read(&kvm->users_count)) {
+ kvm_mmu_zap_all(kvm);
+ return;
+ }
+
write_lock(&kvm->mmu_lock);
trace_kvm_mmu_zap_all_fast(kvm);

@@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_zap_obsolete_pages(kvm);

write_unlock(&kvm->mmu_lock);
-
- /*
- * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
- * returning to the caller, e.g. if the zap is in response to a memslot
- * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
- * associated with the deleted memslot once the update completes, and
- * Deferring the zap until the final reference to the root is put would
- * lead to use-after-free.
- */
- if (is_tdp_mmu_enabled(kvm)) {
- read_lock(&kvm->mmu_lock);
- kvm_tdp_mmu_zap_invalidated_roots(kvm);
- read_unlock(&kvm->mmu_lock);
- }
}

static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cd1bf68e7511..af9db5b8f713 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
WARN_ON(!root->tdp_mmu_page);

/*
- * The root now has refcount=0 and is valid. Readers cannot acquire
- * a reference to it (they all visit valid roots only, except for
- * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
- * any reference itself.
+ * The root now has refcount=0. It is valid, but readers already
+ * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
+ * rejects it. This remains true for the rest of the execution
+ * of this function, because readers visit valid roots only
+ * (except for tdp_mmu_zap_root_work(), which however operates only
+ * on one specific root and does not acquire any reference itself).

*
* Even though there are flows that need to visit all roots for
* correctness, they all take mmu_lock for write, so they cannot yet
@@ -996,103 +994,16 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
}
}

-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
- struct kvm_mmu_page *prev_root)
-{
- struct kvm_mmu_page *next_root;
-
- if (prev_root)
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &prev_root->link,
- typeof(*prev_root), link);
- else
- next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- typeof(*next_root), link);
-
- while (next_root && !(next_root->role.invalid &&
- refcount_read(&next_root->tdp_mmu_root_count)))
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &next_root->link,
- typeof(*next_root), link);
-
- return next_root;
-}
-
-/*
- * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
- * zap" completes. Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
- * reference to each invalidated root, roots will not be freed until after this
- * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
- * tearing paging structures.
- */
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
-{
- struct kvm_mmu_page *next_root;
- struct kvm_mmu_page *root;
-
- lockdep_assert_held_read(&kvm->mmu_lock);
-
- rcu_read_lock();
-
- root = next_invalidated_root(kvm, NULL);
-
- while (root) {
- next_root = next_invalidated_root(kvm, root);
-
- rcu_read_unlock();
-
- /*
- * Zap the root regardless of what marked it invalid, e.g. even
- * if the root was marked invalid by kvm_tdp_mmu_put_root() due
- * to its last reference being put. All SPTEs must be dropped
- * before returning to the caller, e.g. if a memslot is deleted
- * or moved, the memslot's associated SPTEs are unreachable via
- * the mmu_notifier once the memslot update completes.
- *
- * A TLB flush is unnecessary, invalidated roots are guaranteed
- * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
- * for more details), and unlike the legacy MMU, no vCPU kick
- * is needed to play nice with lockless shadow walks as the TDP
- * MMU protects its paging structures via RCU. Note, zapping
- * will still flush on yield, but that's a minor performance
- * blip and not a functional issue.
- */
- tdp_mmu_zap_root(kvm, root, true);
-
- /*
- * Put the reference acquired in
- * kvm_tdp_mmu_invalidate_roots
- */
- kvm_tdp_mmu_put_root(kvm, root, true);
-
- root = next_root;
-
- rcu_read_lock();
- }
-
- rcu_read_unlock();
-}
-
/*
* Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
- * is about to be zapped, e.g. in response to a memslots update. The caller is
- * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to the actual
- * zapping.
- *
- * Take a reference on all roots to prevent the root from being freed before it
- * is zapped by this thread. Freeing a root is not a correctness issue, but if
- * a vCPU drops the last reference to a root prior to the root being zapped, it
- * will get stuck with tearing down the entire paging structure.
- *
- * Get a reference even if the root is already invalid,
- * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
- * invalid roots, e.g. there's no epoch to identify roots that were invalidated
- * by a previous call. Roots stay on the list until the last reference is
- * dropped, so even though all invalid roots are zapped, a root may not go away
- * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
+ * is about to be zapped, e.g. in response to a memslots update. The actual
+ * zapping is performed asynchronously, so a reference is taken on all roots
+ * as well as (once per root) on the struct kvm.
*
- * Because mmu_lock is held for write, it should be impossible to observe a
- * root with zero refcount, i.e. the list of roots cannot be stale.
+ * Get a reference even if the root is already invalid, the asynchronous worker
+ * assumes it was gifted a reference to the root it processes. Because mmu_lock
+ * is held for write, it should be impossible to observe a root with zero refcount,
+ * i.e. the list of roots cannot be stale.
*
* This has essentially the same effect for the TDP MMU
* as updating mmu_valid_gen does for the shadow MMU.
@@ -1103,8 +1014,11 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)

lockdep_assert_held_write(&kvm->mmu_lock);
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
- if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
+ if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
root->role.invalid = true;
+ kvm_get_kvm(kvm);
+ tdp_mmu_schedule_zap_root(kvm, root);
+ }
}
}


It passes a smoke test, and also resolves the debate on the fate of patch 1.

However, I think we now need a module_get/module_put when creating/destroying
a VM; the workers can outlive kvm_vm_release and therefore any reference
automatically taken by VFS's fops_get/fops_put.

Paolo

2022-03-02 21:30:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 19:01, Sean Christopherson wrote:
> > > It passes a smoke test, and also resolves the debate on the fate of patch 1.
> > +1000, I love this approach. Do you want me to work on a v3, or shall I let you
> > have the honors?
>
> I'm already running the usual battery of tests, so I should be able to post
> it either tomorrow (early in my evening) or Friday morning.

Gah, now I remember why I didn't use an async worker. kvm_mmu_zap_all_fast()
must ensure all SPTEs are zapped and their dirty/accessed data written back to
the primary MMU prior to returning. Once the memslot update completes, the old
deleted/moved memslot is no longer reachable by the mmu_notifier. If an mmu_notifier
zaps pfns reachable via the root, KVM will do nothing because there's no relevant
memslot.

So we can use workers, but kvm_mmu_zap_all_fast() would need to flush all workers
before returning, which ends up being no different than putting the invalid roots
on a different list.

What about that idea? Put roots invalidated by "fast zap" on _another_ list?
My very original idea of moving the roots to a separate list didn't work because
the roots needed to be reachable by the mmu_notifier. But we could just add
another list_head (inside the unsync_child_bitmap union) and add the roots to
_that_ list.

Let me go resurrect that patch from v1 and tweak it to keep the roots on the old
list, but add them to a new list as well. That would get rid of the invalid
root iterator stuff.

2022-03-02 22:23:00

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On 3/2/22 19:33, David Matlack wrote:
> On Wed, Mar 2, 2022 at 9:35 AM Sean Christopherson <[email protected]> wrote:
>>
>> On Wed, Mar 02, 2022, Paolo Bonzini wrote:
>>> However, I think we now need a module_get/module_put when creating/destroying
>>> a VM; the workers can outlive kvm_vm_release and therefore any reference
>>> automatically taken by VFS's fops_get/fops_put.
>>
>> Haven't read the rest of the patch, but this caught my eye. We _already_ need
>> to handle this scenario. As you noted, any worker, i.e. anything that takes a
>> reference via kvm_get_kvm() without any additional guarantee that the module can't
>> be unloaded is suspect. x86 is mostly fine, though kvm_setup_async_pf() is likely
>> affected, and other architectures seem to have bugs.
>>
>> Google has an internal patch that addresses this. I believe David is going to post
>> the fix... David?
>
> This was towards the back of my queue but I can bump it to the front.
> I'll have the patches out this week.

Thanks!

Paolo

2022-03-02 22:33:10

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> However, I think we now need a module_get/module_put when creating/destroying
> a VM; the workers can outlive kvm_vm_release and therefore any reference
> automatically taken by VFS's fops_get/fops_put.

Haven't read the rest of the patch, but this caught my eye. We _already_ need
to handle this scenario. As you noted, any worker, i.e. anything that takes a
reference via kvm_get_kvm() without any additional guarantee that the module can't
be unloaded is suspect. x86 is mostly fine, though kvm_setup_async_pf() is likely
affected, and other architectures seem to have bugs.

Google has an internal patch that addresses this. I believe David is going to post
the fix... David?

2022-03-02 22:41:23

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On 3/2/22 21:47, Sean Christopherson wrote:
> On Wed, Mar 02, 2022, Paolo Bonzini wrote:
>> For now let's do it the simple but ugly way. Keeping
>> next_invalidated_root() does not make things worse than the status quo, and
>> further work will be easier to review if it's kept separate from this
>> already-complex work.
>
> Oof, that's not gonna work. My approach here in v3 doesn't work either. I finally
> remembered why I had the dedicated tdp_mmu_defunct_root flag and thus the smp_mb_*()
> dance.
>
> kvm_tdp_mmu_zap_invalidated_roots() assumes that it was gifted a reference to
> _all_ invalid roots by kvm_tdp_mmu_invalidate_all_roots(). This works in the
> current code base only because kvm->slots_lock is held for the entire duration,
> i.e. roots can't become invalid between the end of kvm_tdp_mmu_invalidate_all_roots()
> and the end of kvm_tdp_mmu_zap_invalidated_roots().

Yeah, of course that doesn't work if kvm_tdp_mmu_zap_invalidated_roots()
calls kvm_tdp_mmu_put_root() and the worker also does the same
kvm_tdp_mmu_put_root().

But, it seems so me that we were so close to something that works and is
elegant with the worker idea. It does avoid the possibility of two
"puts", because the work item is created on the valid->invalid
transition. What do you think of having a separate workqueue for each
struct kvm, so that kvm_tdp_mmu_zap_invalidated_roots() can be replaced
with a flush? I can probably do it next Friday.

Paolo

>
> Marking a root invalid in kvm_tdp_mmu_put_root() breaks that assumption, e.g. if a
> new root is created and then dropped, it will be marked invalid but the "fast zap"
> will not have a reference. The "defunct" flag prevents this scenario by allowing
> the "fast zap" path to identify invalid roots for which it did not take a reference.
> By virtue of holding a reference, "fast zap" also guarantees that the roots it needs
> to invalidate and put can't become defunct.
>
> My preference would be to either go back to a variant of v2, or to implement my
> "second list" idea.
>
> I also need to figure out why I didn't encounter errors in v3, because I distinctly
> remember underflowing the refcount before adding the defunct flag...

2022-03-02 23:20:04

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 26/28] KVM: selftests: Split out helper to allocate guest mem via memfd

On Wed, 2022-03-02 at 19:36 +0100, Paolo Bonzini wrote:
> On 3/1/22 00:36, David Woodhouse wrote:
> > On Sat, 2022-02-26 at 00:15 +0000, Sean Christopherson wrote:
> > > Extract the code for allocating guest memory via memfd out of
> > > vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
> > > A future selftest to populate a guest with the maximum amount of guest
> > > memory will abuse KVM's memslots to alias guest memory regions to a
> > > single memfd-backed host region, i.e. needs to back a guest with memfd
> > > memory without a 1:1 association between a memslot and a memfd instance.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> >
> > While we're at it, please can we make the whole thing go away and just
> > return failure #ifndef MFD_CLOEXEC, instead of breaking the build on
> > older userspace?
>
> We can just use old school F_SETFD if that's helpful for you.

The system on which I had that problem doesn't have memfd at all. As a
local hack just to build the TSC test case on a system that actually
supports TSC scaling, I just made it all go away thus:

--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -938,6 +938,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,

region->fd = -1;
if (backing_src_is_shared(src_type)) {
+#ifdef MFD_CLOEXEC
int memfd_flags = MFD_CLOEXEC;

if (src_type == VM_MEM_SRC_SHARED_HUGETLB)
@@ -954,6 +955,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
region->mmap_size);
TEST_ASSERT(ret == 0, "fallocate failed, errno: %i", errno);
+#else
+ TEST_ASSERT(false, "memfd support not present");
+#endif
}

region->mmap_start = mmap(NULL, region->mmap_size,


Attachments:
smime.p7s (5.83 kB)

2022-03-02 23:21:18

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On 3/2/22 19:01, Sean Christopherson wrote:
>> + */
>> + if (!refcount_read(&kvm->users_count)) {
>> + kvm_mmu_zap_all(kvm);
>> + return;
>> + }
>
> I'd prefer we make this an assertion and shove this logic to set_nx_huge_pages(),
> because in that case there's no need to zap anything, the guest can never run
> again. E.g. (I'm trying to remember why I didn't do this before...)

I did it this way because it seemed like a reasonable fallback for any
present or future caller.

> One thing that keeps tripping me up is the "readers" verbiage. I get confused
> because taking mmu_lock for read vs. write doesn't really have anything to do with
> reading or writing state, e.g. "readers" still write SPTEs, and so I keep thinking
> "readers" means anything iterating over the set of roots. Not sure if there's a
> shorthand that won't be confusing.

Not that I know of. You really need to know that the rwlock is been
used for its shared/exclusive locking behavior. But even on ther OSes
use shared/exclusive instead of read/write, there are no analogous nouns
and people end up using readers/writers anyway.

>> It passes a smoke test, and also resolves the debate on the fate of patch 1.
> +1000, I love this approach. Do you want me to work on a v3, or shall I let you
> have the honors?

I'm already running the usual battery of tests, so I should be able to
post it either tomorrow (early in my evening) or Friday morning.

Paolo

2022-03-02 23:28:01

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On 3/2/22 20:33, Sean Christopherson wrote:
> What about that idea? Put roots invalidated by "fast zap" on_another_ list?
> My very original idea of moving the roots to a separate list didn't work because
> the roots needed to be reachable by the mmu_notifier. But we could just add
> another list_head (inside the unsync_child_bitmap union) and add the roots to
> _that_ list.

Perhaps the "separate list" idea could be extended to have a single
worker for all kvm_tdp_mmu_put_root() work, and then indeed replace
kvm_tdp_mmu_zap_invalidated_roots() with a flush of _that_ worker. The
disadvantage is a little less parallelism in zapping invalidated roots;
but what is good for kvm_tdp_mmu_zap_invalidated_roots() is just as good
for kvm_tdp_mmu_put_root(), I suppose. If one wants separate work
items, KVM could have its own workqueue, and then you flush that workqueue.

For now let's do it the simple but ugly way. Keeping
next_invalidated_root() does not make things worse than the status quo,
and further work will be easier to review if it's kept separate from
this already-complex work.

Paolo

2022-03-02 23:38:14

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Wed, Mar 2, 2022 at 9:35 AM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> > However, I think we now need a module_get/module_put when creating/destroying
> > a VM; the workers can outlive kvm_vm_release and therefore any reference
> > automatically taken by VFS's fops_get/fops_put.
>
> Haven't read the rest of the patch, but this caught my eye. We _already_ need
> to handle this scenario. As you noted, any worker, i.e. anything that takes a
> reference via kvm_get_kvm() without any additional guarantee that the module can't
> be unloaded is suspect. x86 is mostly fine, though kvm_setup_async_pf() is likely
> affected, and other architectures seem to have bugs.
>
> Google has an internal patch that addresses this. I believe David is going to post
> the fix... David?

This was towards the back of my queue but I can bump it to the front.
I'll have the patches out this week.

2022-03-02 23:55:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 01/28] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots

On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > Now that tdp_mmu_next_root() can process both valid and invalid roots,
> > extend it to be able to process _only_ invalid roots, add yet another
> > iterator macro for walking invalid roots, and use the new macro in
> > kvm_tdp_mmu_zap_invalidated_roots().
> >
> > No functional change intended.
> >
> > Reviewed-by: David Matlack <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 74 ++++++++++++++------------------------
> > 1 file changed, 26 insertions(+), 48 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index debf08212f12..25148e8b711d 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -98,6 +98,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > }
> >
> > +enum tdp_mmu_roots_iter_type {
> > + ALL_ROOTS = -1,
> > + VALID_ROOTS = 0,
> > + INVALID_ROOTS = 1,
> > +};
>
> I am wondering what the trick is to start from -1?

-1 is arbitrary, any non-zero value would work. More below.

> > /*
> > * Returns the next root after @prev_root (or the first root if @prev_root is
> > * NULL). A reference to the returned root is acquired, and the reference to
> > @@ -110,10 +116,16 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > */
> > static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > struct kvm_mmu_page *prev_root,
> > - bool shared, bool only_valid)
> > + bool shared,
> > + enum tdp_mmu_roots_iter_type type)
> > {
> > struct kvm_mmu_page *next_root;
> >
> > + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > +
> > + /* Ensure correctness for the below comparison against role.invalid. */
> > + BUILD_BUG_ON(!!VALID_ROOTS || !INVALID_ROOTS);
> > +
> > rcu_read_lock();
> >
> > if (prev_root)
> > @@ -125,7 +137,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > typeof(*next_root), link);
> >
> > while (next_root) {
> > - if ((!only_valid || !next_root->role.invalid) &&
> > + if ((type == ALL_ROOTS || (type == !!next_root->role.invalid)) &&

This is the code that deals with the enums. It's making the type a tri-state,
where the values of VALID_ROOTS and INVALID_ROOTS align with converting role.invalid
to a boolean (always '0' or '1') so that they can be directly compared as above.

Any value for ALL_ROOTS (other than '0' or '1' obviously) would work since the
above logic requires ALL_ROOTS to be explicitly checked first.

> > kvm_tdp_mmu_get_root(next_root))
> > break;
> >

2022-03-02 23:58:10

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 03:13, Sean Christopherson wrote:
> > That would work, though I'd prefer to recurse on kvm_tdp_mmu_put_root() instead
> > of open coding refcount_dec_and_test() so that we get coverage of the xchg()
> > doing the right thing.
> >
> > I still slightly prefer having the "free" path be inside the xchg(). To me, even
> > though the "free" path is the only one that's guaranteed to be reached for every root,
> > the fall-through to resetting the refcount and zapping the root is the "normal" path,
> > and the "free" path is the exception.
>
> Hmm I can see how that makes especially sense once you add in the worker logic.
> But it seems to me that the "basic" logic should be "do both the xchg and the
> free", and coding the function with tail recursion obfuscates this. Even with
> the worker, you grow an
>
> + if (kvm_get_kvm_safe(kvm)) {
> + ... let the worker do it ...
> + return;
> + }
> +
> tdp_mmu_zap_root(kvm, root, shared);
>
> but you still have a downwards flow that matches what happens even if multiple
> threads pick up different parts of the job.
>
> So, I tried a bunch of alternatives including with gotos and with if/else, but
> really the above one remains my favorite.

Works for me.

> My plan would be:
>
> 1) splice the mini series I'm attaching before this patch, and
> remove patch 1 of this series. next_invalidated_root() is a
> bit yucky, but notably it doesn't need to add/remove a reference
> in kvm_tdp_mmu_zap_invalidated_roots().
>
> Together, these two steps ensure that readers never acquire a
> reference to either refcount=0/valid or invalid pages". In other
> words, the three states of that kvm_tdp_mmu_put_root moves the root
> through (refcount=0/valid -> refcount=0/invalid -> refcount=1/invalid)
> are exactly the same to readers, and there are essentially no races
> to worry about.
>
> In other other words, it's trading slightly uglier code for simpler
> invariants.

I'm glad you thought of the "immediately send to a worker" idea, I don't know if
I could stomach next_invalidated_root() continuing to live :-)

> @@ -879,7 +879,8 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> {
> struct kvm_mmu_page *root;
>
> - for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
> + lockdep_assert_held_write(&kvm->mmu_lock);
> + for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
>
> return flush;
> @@ -895,8 +896,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> * is being destroyed or the userspace VMM has exited. In both cases,
> * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
> */
> + lockdep_assert_held_write(&kvm->mmu_lock);
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> - for_each_tdp_mmu_root_yield_safe(kvm, root, i, false)
> + for_each_tdp_mmu_root_yield_safe(kvm, root, i)
> tdp_mmu_zap_root(kvm, root, false);
> }
> }
> --

If you hoist the patch "KVM: x86/mmu: Require mmu_lock be held for write in unyielding
root iter" to be the first in the series, then you can instead add the lockdep
assertion to the iterator itself. Slotted in earlier in the series, it becomes...

From: Paolo Bonzini <[email protected]>
Date: Wed, 2 Mar 2022 09:25:30 -0800
Subject: [PATCH] KVM: x86/mmu: do not allow readers to acquire references to
invalid roots

Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus
ensuring that readers do not ever acquire a reference to an invalid root.
After this patch, all readers except kvm_tdp_mmu_zap_invalidated_roots()
treat refcount=0/valid, refcount=0/invalid and refcount=1/invalid in
exactly the same way. kvm_tdp_mmu_zap_invalidated_roots() is different
but it also does not acquire a reference to the invalid root, and it
cannot see refcount=0/invalid because it is guaranteed to run after
kvm_tdp_mmu_invalidate_all_roots().

Opportunistically add a lockdep assertion to the yield-safe iterator.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0cb834aa5406..46752093b79c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -158,14 +158,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \
_root; \
_root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+ if (kvm_lockdep_assert_mmu_lock_held(_kvm, _shared) && \
+ kvm_mmu_page_as_id(_root) != _as_id) { \
} else

#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)

-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, false, false)

/*
* Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
@@ -800,7 +801,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
{
struct kvm_mmu_page *root;

- for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
+ for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
false);


base-commit: 57352b7efee6b38eb9eb899b8f013652afe4e110
--

2022-03-02 23:58:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 26/28] KVM: selftests: Split out helper to allocate guest mem via memfd

On 3/1/22 00:36, David Woodhouse wrote:
> On Sat, 2022-02-26 at 00:15 +0000, Sean Christopherson wrote:
>> Extract the code for allocating guest memory via memfd out of
>> vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
>> A future selftest to populate a guest with the maximum amount of guest
>> memory will abuse KVM's memslots to alias guest memory regions to a
>> single memfd-backed host region, i.e. needs to back a guest with memfd
>> memory without a 1:1 association between a memslot and a memfd instance.
>>
>> No functional change intended.
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>
> While we're at it, please can we make the whole thing go away and just
> return failure #ifndef MFD_CLOEXEC, instead of breaking the build on
> older userspace?

We can just use old school F_SETFD if that's helpful for you.

Paolo

2022-03-03 00:00:24

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

On 3/2/22 03:13, Sean Christopherson wrote:
> After typing up all of the below, I actually tried the novel idea of compiling
> the code... and we can't do xchg() on role.invalid because it occupies a single
> bit, it's not a standalone boolean.

Yeah I thought the same right after sending the email, but I'll just say it
was pseudocode. :) We can do

static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
{
union kvm_mmu_page_role role = page->role;
role.invalid = true;

/* No need to use cmpxchg, only the invalid bit can change. */
role.word = xchg(&page->role.word, role.word);
return role.invalid;
}

Either way, barriers or xchg, it needs to be a separate function.

> by using refcount_dec_not_one() above, there's no guarantee that this
> task is the last one to see it as kvm_tdp_mmu_get_root() can succeed
> and bump the refcount between refcount_dec_not_one() and here.
Yep, I agree refcount_dec_and_test is needed.

>> Based on my own version, I guess you mean (without comments due to family
>> NMI):
>>
>> if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
>> return;
>>
>> if (!xchg(&root->role.invalid, true) {
>> refcount_set(&root->tdp_mmu_root_count, 1);
>> tdp_mmu_zap_root(kvm, root, shared);
>> if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
>> return;
>> }
>>
>> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
>> list_del_rcu(&root->link);
>> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
>
> That would work, though I'd prefer to recurse on kvm_tdp_mmu_put_root() instead
> of open coding refcount_dec_and_test() so that we get coverage of the xchg()
> doing the right thing.
>
> I still slightly prefer having the "free" path be inside the xchg(). To me, even
> though the "free" path is the only one that's guaranteed to be reached for every root,
> the fall-through to resetting the refcount and zapping the root is the "normal" path,
> and the "free" path is the exception.

Hmm I can see how that makes especially sense once you add in the worker logic.
But it seems to me that the "basic" logic should be "do both the xchg and the
free", and coding the function with tail recursion obfuscates this. Even with
the worker, you grow an

+ if (kvm_get_kvm_safe(kvm)) {
+ ... let the worker do it ...
+ return;
+ }
+
tdp_mmu_zap_root(kvm, root, shared);

but you still have a downwards flow that matches what happens even if multiple
threads pick up different parts of the job.

So, I tried a bunch of alternatives including with gotos and with if/else, but
really the above one remains my favorite.

My plan would be:

1) splice the mini series I'm attaching before this patch, and
remove patch 1 of this series. next_invalidated_root() is a
bit yucky, but notably it doesn't need to add/remove a reference
in kvm_tdp_mmu_zap_invalidated_roots().

Together, these two steps ensure that readers never acquire a
reference to either refcount=0/valid or invalid pages". In other
words, the three states of that kvm_tdp_mmu_put_root moves the root
through (refcount=0/valid -> refcount=0/invalid -> refcount=1/invalid)
are exactly the same to readers, and there are essentially no races
to worry about.

In other other words, it's trading slightly uglier code for simpler
invariants.

2) here, replace the change to kvm_tdp_mmu_put_root with the following:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b6ffa91fb9d7..aa0669f54d96 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -81,6 +81,16 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared);

+static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
+{
+ union kvm_mmu_page_role role = page->role;
+ role.invalid = true;
+
+ /* No need to use cmpxchg, only the invalid bit can change. */
+ role.word = xchg(&page->role.word, role.word);
+ return role.invalid;
+}
+
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -91,20 +101,44 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,

WARN_ON(!root->tdp_mmu_page);

+ /*
+ * The root now has refcount=0 and is valid. Readers cannot acquire
+ * a reference to it (they all visit valid roots only, except for
+ * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
+ * any reference itself.
+ *
+ * Even though there are flows that need to visit all roots for
+ * correctness, they all take mmu_lock for write, so they cannot yet
+ * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
+ * since the root still has refcount=0.
+ *
+ * However, tdp_mmu_zap_root can yield, and writers do not expect to
+ * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
+ * So the root temporarily gets an extra reference, going to refcount=1
+ * while staying invalid. Readers still cannot acquire any reference;
+ * but writers are now allowed to run if tdp_mmu_zap_root yields and
+ * they might take an extra reference is they themselves yield. Therefore,
+ * when the reference is given back after tdp_mmu_zap_root terminates,
+ * there is no guarantee that the refcount is still 1. If not, whoever
+ * puts the last reference will free the page, but they will not have to
+ * zap the root because a root cannot go from invalid to valid.
+ */
+ if (!kvm_tdp_root_mark_invalid(root)) {
+ refcount_set(&root->tdp_mmu_root_count, 1);
+ tdp_mmu_zap_root(kvm, root, shared);
+
+ /*
+ * Give back the reference that was added back above. We now
+ * know that the root is invalid, so go ahead and free it if
+ * no one has taken a reference in the meanwhile.
+ */
+ if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
+ return;
+ }
+
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-
- /*
- * A TLB flush is not necessary as KVM performs a local TLB flush when
- * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
- * to a different pCPU. Note, the local TLB flush on reuse also
- * invalidates any paging-structure-cache entries, i.e. TLB entries for
- * intermediate paging structures, that may be zapped, as such entries
- * are associated with the ASID on both VMX and SVM.
- */
- tdp_mmu_zap_root(kvm, root, shared);
-
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}


3) for the worker patch, the idea would be

+static void tdp_mmu_zap_root_work(struct work_struct *work)
+{
+ ...
+}
+
+
+static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+ root->tdp_mmu_async_data = kvm;
+ INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
+ schedule_work(&root->tdp_mmu_async_work);
+}
+
static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
{
union kvm_mmu_page_role role = page->role;
@@ -125,13 +165,24 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
*/
if (!kvm_tdp_root_mark_invalid(root)) {
refcount_set(&root->tdp_mmu_root_count, 1);
- tdp_mmu_zap_root(kvm, root, shared);

/*
- * Give back the reference that was added back above. We now
+ * If the struct kvm is alive, we might as well zap the root
+ * in a worker. The worker takes ownership of the reference we
+ * have just added to root as well as the new reference to kvm.
+ */
+ if (kvm_get_kvm_safe(kvm)) {
+ tdp_mmu_schedule_zap_root(kvm, root);
+ return;
+ }
+
+ /*
+ * The struct kvm is being destroyed, zap synchronously and give
+ * back immediately the reference that was added above. We now
* know that the root is invalid, so go ahead and free it if
* no one has taken a reference in the meanwhile.
*/
+ tdp_mmu_zap_root(kvm, root, shared);
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;
}


Again, I appreciate the idea behind the recursive call, but I think
overall it's clearer to have a clear flow from the beginning to the
end of the function, with the exceptions and optimizations noted as
early returns.

Let me know what you think. Tomorrow I have a day off, but later
today I will have my changes tested and pushed to kvm/queue for you
to look at.

Thanks,

Paolo


Attachments:
readers.patch (3.68 kB)

2022-03-03 00:10:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 2/26/22 01:15, Sean Christopherson wrote:
> > Zap defunct roots, a.k.a. roots that have been invalidated after their
> > last reference was initially dropped, asynchronously via the system work
> > queue instead of forcing the work upon the unfortunate task that happened
> > to drop the last reference.
> >
> > If a vCPU task drops the last reference, the vCPU is effectively blocked
> > by the host for the entire duration of the zap. If the root being zapped
> > happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> > being active, the zap can take several hundred seconds. Unsurprisingly,
> > most guests are unhappy if a vCPU disappears for hundreds of seconds.
> >
> > E.g. running a synthetic selftest that triggers a vCPU root zap with
> > ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> > Offloading the zap to a worker drops the block time to <100ms.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
>
> Do we even need kvm_tdp_mmu_zap_invalidated_roots() now? That is,
> something like the following:

Nice! I initially did something similar (moving invalidated roots to a separate
list), but never circled back to idea after implementing the worker stuff.

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bd3625a875ef..5fd8bc858c6f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> {
> lockdep_assert_held(&kvm->slots_lock);
> + /*
> + * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference
> + * count. If we're dying, zap everything as it's going to happen
> + * soon anyway.
> + */
> + if (!refcount_read(&kvm->users_count)) {
> + kvm_mmu_zap_all(kvm);
> + return;
> + }

I'd prefer we make this an assertion and shove this logic to set_nx_huge_pages(),
because in that case there's no need to zap anything, the guest can never run
again. E.g. (I'm trying to remember why I didn't do this before...)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2c1c4eb6007..d4d25ab88ae7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6132,7 +6132,8 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)

list_for_each_entry(kvm, &vm_list, vm_list) {
mutex_lock(&kvm->slots_lock);
- kvm_mmu_zap_all_fast(kvm);
+ if (refcount_read(&kvm->users_count))
+ kvm_mmu_zap_all_fast(kvm);
mutex_unlock(&kvm->slots_lock);

wake_up_process(kvm->arch.nx_lpage_recovery_thread);


> +
> write_lock(&kvm->mmu_lock);
> trace_kvm_mmu_zap_all_fast(kvm);
> @@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> kvm_zap_obsolete_pages(kvm);
> write_unlock(&kvm->mmu_lock);
> -
> - /*
> - * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
> - * returning to the caller, e.g. if the zap is in response to a memslot
> - * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
> - * associated with the deleted memslot once the update completes, and
> - * Deferring the zap until the final reference to the root is put would
> - * lead to use-after-free.
> - */
> - if (is_tdp_mmu_enabled(kvm)) {
> - read_lock(&kvm->mmu_lock);
> - kvm_tdp_mmu_zap_invalidated_roots(kvm);
> - read_unlock(&kvm->mmu_lock);
> - }
> }
> static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cd1bf68e7511..af9db5b8f713 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> WARN_ON(!root->tdp_mmu_page);
> /*
> - * The root now has refcount=0 and is valid. Readers cannot acquire
> - * a reference to it (they all visit valid roots only, except for
> - * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
> - * any reference itself.
> + * The root now has refcount=0. It is valid, but readers already
> + * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
> + * rejects it. This remains true for the rest of the execution
> + * of this function, because readers visit valid roots only

One thing that keeps tripping me up is the "readers" verbiage. I get confused
because taking mmu_lock for read vs. write doesn't really have anything to do with
reading or writing state, e.g. "readers" still write SPTEs, and so I keep thinking
"readers" means anything iterating over the set of roots. Not sure if there's a
shorthand that won't be confusing.

> + * (except for tdp_mmu_zap_root_work(), which however operates only
> + * on one specific root and does not acquire any reference itself).
>
> *
> * Even though there are flows that need to visit all roots for
> * correctness, they all take mmu_lock for write, so they cannot yet

...

> It passes a smoke test, and also resolves the debate on the fate of patch 1.

+1000, I love this approach. Do you want me to work on a v3, or shall I let you
have the honors?

2022-03-03 00:13:26

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 02/28] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Explicitly check for present SPTEs when clearing dirty bits in the TDP
> MMU. This isn't strictly required for correctness, as setting the dirty
> bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
> However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
> complain if anyone actually turned on KVM's MMU debugging.
>
> Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
> Cc: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Ben Gardon <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 25148e8b711d..9357780ec28f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1446,6 +1446,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
>
> + if (!is_shadow_present_pte(iter.old_spte))
> + continue;
> +
> if (spte_ad_need_write_protect(iter.old_spte)) {
> if (is_writable_pte(iter.old_spte))
> new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-03 00:17:38

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 01/28] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Now that tdp_mmu_next_root() can process both valid and invalid roots,
> extend it to be able to process _only_ invalid roots, add yet another
> iterator macro for walking invalid roots, and use the new macro in
> kvm_tdp_mmu_zap_invalidated_roots().
>
> No functional change intended.
>
> Reviewed-by: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 74 ++++++++++++++------------------------
> 1 file changed, 26 insertions(+), 48 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index debf08212f12..25148e8b711d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -98,6 +98,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
>
> +enum tdp_mmu_roots_iter_type {
> + ALL_ROOTS = -1,
> + VALID_ROOTS = 0,
> + INVALID_ROOTS = 1,
> +};

I am wondering what the trick is to start from -1?
> +
> /*
> * Returns the next root after @prev_root (or the first root if @prev_root is
> * NULL). A reference to the returned root is acquired, and the reference to
> @@ -110,10 +116,16 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> */
> static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> struct kvm_mmu_page *prev_root,
> - bool shared, bool only_valid)
> + bool shared,
> + enum tdp_mmu_roots_iter_type type)
> {
> struct kvm_mmu_page *next_root;
>
> + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +
> + /* Ensure correctness for the below comparison against role.invalid. */
> + BUILD_BUG_ON(!!VALID_ROOTS || !INVALID_ROOTS);
> +
> rcu_read_lock();
>
> if (prev_root)
> @@ -125,7 +137,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> typeof(*next_root), link);
>
> while (next_root) {
> - if ((!only_valid || !next_root->role.invalid) &&
> + if ((type == ALL_ROOTS || (type == !!next_root->role.invalid)) &&
> kvm_tdp_mmu_get_root(next_root))
> break;
>
> @@ -151,18 +163,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> * mode. In the unlikely event that this thread must free a root, the lock
> * will be temporarily dropped and reacquired in write mode.
> */
> -#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _only_valid)\
> - for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \
> +#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _type) \
> + for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _type); \
> _root; \
> - _root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \
> - if (kvm_mmu_page_as_id(_root) != _as_id) { \
> + _root = tdp_mmu_next_root(_kvm, _root, _shared, _type)) \
> + if (_as_id > 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
> } else
>
> +#define for_each_invalid_tdp_mmu_root_yield_safe(_kvm, _root) \
> + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, -1, true, INVALID_ROOTS)
> +
> #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
> - __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
> + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, VALID_ROOTS)
>
> #define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
> - __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
> + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOTS)
>
> #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
> @@ -810,28 +825,6 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> kvm_flush_remote_tlbs(kvm);
> }
>
> -static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
> - struct kvm_mmu_page *prev_root)
> -{
> - struct kvm_mmu_page *next_root;
> -
> - if (prev_root)
> - next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - &prev_root->link,
> - typeof(*prev_root), link);
> - else
> - next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - typeof(*next_root), link);
> -
> - while (next_root && !(next_root->role.invalid &&
> - refcount_read(&next_root->tdp_mmu_root_count)))
> - next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - &next_root->link,
> - typeof(*next_root), link);
> -
> - return next_root;
> -}
> -
> /*
> * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
> * invalidated root, they will not be freed until this function drops the
> @@ -842,36 +835,21 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
> */
> void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> {
> - struct kvm_mmu_page *next_root;
> struct kvm_mmu_page *root;
> bool flush = false;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - rcu_read_lock();
> -
> - root = next_invalidated_root(kvm, NULL);
> -
> - while (root) {
> - next_root = next_invalidated_root(kvm, root);
> -
> - rcu_read_unlock();
> -
> + for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
> flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
>
> /*
> - * Put the reference acquired in
> - * kvm_tdp_mmu_invalidate_roots
> + * Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
> + * Note, the iterator holds its own reference.
> */
> kvm_tdp_mmu_put_root(kvm, root, true);
> -
> - root = next_root;
> -
> - rcu_read_lock();
> }
>
> - rcu_read_unlock();
> -
> if (flush)
> kvm_flush_remote_tlbs(kvm);
> }
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-03 00:19:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 20:33, Sean Christopherson wrote:
> > What about that idea? Put roots invalidated by "fast zap" on_another_ list?
> > My very original idea of moving the roots to a separate list didn't work because
> > the roots needed to be reachable by the mmu_notifier. But we could just add
> > another list_head (inside the unsync_child_bitmap union) and add the roots to
> > _that_ list.
>
> Perhaps the "separate list" idea could be extended to have a single worker
> for all kvm_tdp_mmu_put_root() work, and then indeed replace
> kvm_tdp_mmu_zap_invalidated_roots() with a flush of _that_ worker. The
> disadvantage is a little less parallelism in zapping invalidated roots; but
> what is good for kvm_tdp_mmu_zap_invalidated_roots() is just as good for
> kvm_tdp_mmu_put_root(), I suppose. If one wants separate work items, KVM
> could have its own workqueue, and then you flush that workqueue.
>
> For now let's do it the simple but ugly way. Keeping
> next_invalidated_root() does not make things worse than the status quo, and
> further work will be easier to review if it's kept separate from this
> already-complex work.

Oof, that's not gonna work. My approach here in v3 doesn't work either. I finally
remembered why I had the dedicated tdp_mmu_defunct_root flag and thus the smp_mb_*()
dance.

kvm_tdp_mmu_zap_invalidated_roots() assumes that it was gifted a reference to
_all_ invalid roots by kvm_tdp_mmu_invalidate_all_roots(). This works in the
current code base only because kvm->slots_lock is held for the entire duration,
i.e. roots can't become invalid between the end of kvm_tdp_mmu_invalidate_all_roots()
and the end of kvm_tdp_mmu_zap_invalidated_roots().

Marking a root invalid in kvm_tdp_mmu_put_root() breaks that assumption, e.g. if a
new root is created and then dropped, it will be marked invalid but the "fast zap"
will not have a reference. The "defunct" flag prevents this scenario by allowing
the "fast zap" path to identify invalid roots for which it did not take a reference.
By virtue of holding a reference, "fast zap" also guarantees that the roots it needs
to invalidate and put can't become defunct.

My preference would be to either go back to a variant of v2, or to implement my
"second list" idea.

I also need to figure out why I didn't encounter errors in v3, because I distinctly
remember underflowing the refcount before adding the defunct flag...

2022-03-03 00:31:44

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 21:47, Sean Christopherson wrote:
> > On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> > > For now let's do it the simple but ugly way. Keeping
> > > next_invalidated_root() does not make things worse than the status quo, and
> > > further work will be easier to review if it's kept separate from this
> > > already-complex work.
> >
> > Oof, that's not gonna work. My approach here in v3 doesn't work either. I finally
> > remembered why I had the dedicated tdp_mmu_defunct_root flag and thus the smp_mb_*()
> > dance.
> >
> > kvm_tdp_mmu_zap_invalidated_roots() assumes that it was gifted a reference to
> > _all_ invalid roots by kvm_tdp_mmu_invalidate_all_roots(). This works in the
> > current code base only because kvm->slots_lock is held for the entire duration,
> > i.e. roots can't become invalid between the end of kvm_tdp_mmu_invalidate_all_roots()
> > and the end of kvm_tdp_mmu_zap_invalidated_roots().
>
> Yeah, of course that doesn't work if kvm_tdp_mmu_zap_invalidated_roots()
> calls kvm_tdp_mmu_put_root() and the worker also does the same
> kvm_tdp_mmu_put_root().
>
> But, it seems so me that we were so close to something that works and is
> elegant with the worker idea. It does avoid the possibility of two "puts",
> because the work item is created on the valid->invalid transition. What do
> you think of having a separate workqueue for each struct kvm, so that
> kvm_tdp_mmu_zap_invalidated_roots() can be replaced with a flush?

I definitely like the idea, but I'm getting another feeling of deja vu. Ah, I
think the mess I created was zapping via async worker without a dedicated workqueue,
and so the flush became very annoying/painful.

I have the "dedicated list" idea coded up. If testing looks good, I'll post it as
a v3.5 (without your xchg() magic or other kvm_tdp_mmu_put_root() changes). That
way we have a less-awful backup (and/or an intermediate step) if the workqueue
idea is delayed or doesn't work. Assuming it works, it's much prettier than having
a defunct flag.

> I can probably do it next Friday.

Early-ish warning, I'll be offline March 11th - March 23rd inclusive.

FWIW, other than saving me from another painful rebase, there's no urgent need to
get this series into 5.18.

2022-03-03 00:43:03

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Explicitly ignore the result of zap_gfn_range() when putting the last
> reference to a TDP MMU root, and add a pile of comments to formalize the
> TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this
> only affects the !shared case, as zap_gfn_range() subtly never returns
> true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
>
> Putting the root without a flush is ok because even if there are stale
> references to the root in the TLB, they are unreachable because KVM will
> not run the guest with the same ASID without first flushing (where ASID
> in this context refers to both SVM's explicit ASID and Intel's implicit
> ASID that is constructed from VPID+PCID+EPT4A+etc...).
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 80607513a1f2..5a931c89d27b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5069,6 +5069,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> kvm_mmu_sync_roots(vcpu);
>
> kvm_mmu_load_pgd(vcpu);
> +
> + /*
> + * Flush any TLB entries for the new root, the provenance of the root
> + * is unknown. In theory, even if KVM ensures there are no stale TLB
> + * entries for a freed root, in theory, an out-of-tree hypervisor could
> + * have left stale entries. Flushing on alloc also allows KVM to skip
> + * the TLB flush when freeing a root (see kvm_tdp_mmu_put_root()).
> + */
> static_call(kvm_x86_flush_tlb_current)(vcpu);
> out:
> return r;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 12866113fb4f..e35bd88d92fd 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> list_del_rcu(&root->link);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> - zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> + /*
> + * A TLB flush is not necessary as KVM performs a local TLB flush when
> + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> + * to a different pCPU. Note, the local TLB flush on reuse also
> + * invalidates any paging-structure-cache entries, i.e. TLB entries for
> + * intermediate paging structures, that may be zapped, as such entries
> + * are associated with the ASID on both VMX and SVM.
> + */
> + (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);

Understood that we could avoid the TLB flush here. Just curious why the
"(void)" is needed here? Is it for compile time reason?
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-03 00:49:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 12866113fb4f..e35bd88d92fd 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > list_del_rcu(&root->link);
> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >
> > - zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > + /*
> > + * A TLB flush is not necessary as KVM performs a local TLB flush when
> > + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > + * to a different pCPU. Note, the local TLB flush on reuse also
> > + * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > + * intermediate paging structures, that may be zapped, as such entries
> > + * are associated with the ASID on both VMX and SVM.
> > + */
> > + (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
>
> Understood that we could avoid the TLB flush here. Just curious why the
> "(void)" is needed here? Is it for compile time reason?

Nope, no functional purpose, though there might be some "advanced" warning or
static checkers that care.

The "(void)" is to communicate to human readers that the result is intentionally
ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
the result of zap_gfn_range(). The comment should suffice, but it's nice to have
the code be self-documenting as much as possible.

2022-03-03 01:22:20

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 01/28] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots

On Wed, Mar 02, 2022, Sean Christopherson wrote:
> On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> > On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > > Now that tdp_mmu_next_root() can process both valid and invalid roots,
> > > extend it to be able to process _only_ invalid roots, add yet another
> > > iterator macro for walking invalid roots, and use the new macro in
> > > kvm_tdp_mmu_zap_invalidated_roots().
> > >
> > > No functional change intended.
> > >
> > > Reviewed-by: David Matlack <[email protected]>
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/tdp_mmu.c | 74 ++++++++++++++------------------------
> > > 1 file changed, 26 insertions(+), 48 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index debf08212f12..25148e8b711d 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -98,6 +98,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > > }
> > >
> > > +enum tdp_mmu_roots_iter_type {
> > > + ALL_ROOTS = -1,
> > > + VALID_ROOTS = 0,
> > > + INVALID_ROOTS = 1,
> > > +};
> >
> > I am wondering what the trick is to start from -1?
>
> -1 is arbitrary, any non-zero value would work. More below.
>
> > > /*
> > > * Returns the next root after @prev_root (or the first root if @prev_root is
> > > * NULL). A reference to the returned root is acquired, and the reference to
> > > @@ -110,10 +116,16 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > */
> > > static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > > struct kvm_mmu_page *prev_root,
> > > - bool shared, bool only_valid)
> > > + bool shared,
> > > + enum tdp_mmu_roots_iter_type type)
> > > {
> > > struct kvm_mmu_page *next_root;
> > >
> > > + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > > +
> > > + /* Ensure correctness for the below comparison against role.invalid. */
> > > + BUILD_BUG_ON(!!VALID_ROOTS || !INVALID_ROOTS);
> > > +
> > > rcu_read_lock();
> > >
> > > if (prev_root)
> > > @@ -125,7 +137,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > > typeof(*next_root), link);
> > >
> > > while (next_root) {
> > > - if ((!only_valid || !next_root->role.invalid) &&
> > > + if ((type == ALL_ROOTS || (type == !!next_root->role.invalid)) &&
>
> This is the code that deals with the enums. It's making the type a tri-state,
> where the values of VALID_ROOTS and INVALID_ROOTS align with converting role.invalid
> to a boolean (always '0' or '1') so that they can be directly compared as above.
>
> Any value for ALL_ROOTS (other than '0' or '1' obviously) would work since the
> above logic requires ALL_ROOTS to be explicitly checked first.
>
yeah, I see that. The other thing I feel strange is the that VALID_ROOTS
is _0_ while INVALID_ROOTS is _1_. But when I see !!next_root->role.invalid,
that solves my concerns.
> > > kvm_tdp_mmu_get_root(next_root))
> > > break;
> > >

2022-03-03 01:34:05

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> > On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 12866113fb4f..e35bd88d92fd 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > list_del_rcu(&root->link);
> > > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > >
> > > - zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > + /*
> > > + * A TLB flush is not necessary as KVM performs a local TLB flush when
> > > + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > > + * to a different pCPU. Note, the local TLB flush on reuse also
> > > + * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > > + * intermediate paging structures, that may be zapped, as such entries
> > > + * are associated with the ASID on both VMX and SVM.
> > > + */
> > > + (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> >
> > Understood that we could avoid the TLB flush here. Just curious why the
> > "(void)" is needed here? Is it for compile time reason?
>
> Nope, no functional purpose, though there might be some "advanced" warning or
> static checkers that care.
>
> The "(void)" is to communicate to human readers that the result is intentionally
> ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
> the result of zap_gfn_range(). The comment should suffice, but it's nice to have
> the code be self-documenting as much as possible.

Right, I got the point. Thanks.

Coming back. It seems that I pretended to understand that we should
avoid the TLB flush without really knowing why.

I mean, leaving (part of the) stale TLB entries unflushed will still be
dangerous right? Or am I missing something that guarantees to flush the
local TLB before returning to the guest? For instance,
kvm_mmu_{re,}load()?

2022-03-03 01:41:34

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

On Thu, Mar 03, 2022, Mingwei Zhang wrote:
> On Thu, Mar 03, 2022, Sean Christopherson wrote:
> > On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> > > On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > index 12866113fb4f..e35bd88d92fd 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > > list_del_rcu(&root->link);
> > > > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > >
> > > > - zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > > + /*
> > > > + * A TLB flush is not necessary as KVM performs a local TLB flush when
> > > > + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > > > + * to a different pCPU. Note, the local TLB flush on reuse also
> > > > + * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > > > + * intermediate paging structures, that may be zapped, as such entries
> > > > + * are associated with the ASID on both VMX and SVM.
> > > > + */
> > > > + (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > >
> > > Understood that we could avoid the TLB flush here. Just curious why the
> > > "(void)" is needed here? Is it for compile time reason?
> >
> > Nope, no functional purpose, though there might be some "advanced" warning or
> > static checkers that care.
> >
> > The "(void)" is to communicate to human readers that the result is intentionally
> > ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
> > the result of zap_gfn_range(). The comment should suffice, but it's nice to have
> > the code be self-documenting as much as possible.
>
> Right, I got the point. Thanks.
>
> Coming back. It seems that I pretended to understand that we should
> avoid the TLB flush without really knowing why.
>
> I mean, leaving (part of the) stale TLB entries unflushed will still be
> dangerous right? Or am I missing something that guarantees to flush the
> local TLB before returning to the guest? For instance,
> kvm_mmu_{re,}load()?

Heh, if SVM's ASID management wasn't a mess[*], it'd be totally fine. The idea,
and what EPT architectures mandates, is that each TDP root is associated with an
ASID. So even though there may be stale entries in the TLB for a root, because
that root is no longer used those stale entries are unreachable. And if KVM ever
happens to reallocate the same physical page for a root, that's ok because KVM must
be paranoid and flush that root (see code comment in this patch).

What we're missing on SVM is proper ASID handling. If KVM uses ASIDs the way AMD
intends them to be used, then this works as intended because each root is again
associated with a specific ASID, and KVM just needs to flush when (re)allocating
a root and when reusing an ASID (which it already handles).

[*] https://lore.kernel.org/all/Yh%[email protected]

2022-03-03 06:10:27

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Thu, Mar 03, 2022, Mingwei Zhang wrote:
> > On Thu, Mar 03, 2022, Sean Christopherson wrote:
> > > On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> > > > On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > index 12866113fb4f..e35bd88d92fd 100644
> > > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > > > list_del_rcu(&root->link);
> > > > > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > > >
> > > > > - zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > > > + /*
> > > > > + * A TLB flush is not necessary as KVM performs a local TLB flush when
> > > > > + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > > > > + * to a different pCPU. Note, the local TLB flush on reuse also
> > > > > + * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > > > > + * intermediate paging structures, that may be zapped, as such entries
> > > > > + * are associated with the ASID on both VMX and SVM.
> > > > > + */
> > > > > + (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > >
> > > > Understood that we could avoid the TLB flush here. Just curious why the
> > > > "(void)" is needed here? Is it for compile time reason?
> > >
> > > Nope, no functional purpose, though there might be some "advanced" warning or
> > > static checkers that care.
> > >
> > > The "(void)" is to communicate to human readers that the result is intentionally
> > > ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
> > > the result of zap_gfn_range(). The comment should suffice, but it's nice to have
> > > the code be self-documenting as much as possible.
> >
> > Right, I got the point. Thanks.
> >
> > Coming back. It seems that I pretended to understand that we should
> > avoid the TLB flush without really knowing why.
> >
> > I mean, leaving (part of the) stale TLB entries unflushed will still be
> > dangerous right? Or am I missing something that guarantees to flush the
> > local TLB before returning to the guest? For instance,
> > kvm_mmu_{re,}load()?
>
> Heh, if SVM's ASID management wasn't a mess[*], it'd be totally fine. The idea,
> and what EPT architectures mandates, is that each TDP root is associated with an
> ASID. So even though there may be stale entries in the TLB for a root, because
> that root is no longer used those stale entries are unreachable. And if KVM ever
> happens to reallocate the same physical page for a root, that's ok because KVM must
> be paranoid and flush that root (see code comment in this patch).
>
> What we're missing on SVM is proper ASID handling. If KVM uses ASIDs the way AMD
> intends them to be used, then this works as intended because each root is again
> associated with a specific ASID, and KVM just needs to flush when (re)allocating
> a root and when reusing an ASID (which it already handles).
>
> [*] https://lore.kernel.org/all/Yh%[email protected]

Oh, putting AMD issues aside for now.

I think I might be too narrow down to the zapping logic previously. So,
I originally think anytime we want to zap, we have to do the following
things in strict order:

1) zap SPTEs.
2) flush TLBs.
3) flush cache (AMD SEV only).
4) deallocate shadow pages.

However, if you have already invalidated EPTP (pgd ptr), then step 2)
becomes optional, since those stale TLBs are no longer useable by the
guest due to the change of ASID.

Am I understanding the point correctly? So, for all invalidated roots,
the assumption is that we have already called "kvm_reload_rmote_mmus()",
which basically update the ASID.

2022-03-03 17:25:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

On Thu, Mar 03, 2022, Mingwei Zhang wrote:
> On Thu, Mar 03, 2022, Sean Christopherson wrote:
> > Heh, if SVM's ASID management wasn't a mess[*], it'd be totally fine. The idea,
> > and what EPT architectures mandates, is that each TDP root is associated with an
> > ASID. So even though there may be stale entries in the TLB for a root, because
> > that root is no longer used those stale entries are unreachable. And if KVM ever
> > happens to reallocate the same physical page for a root, that's ok because KVM must
> > be paranoid and flush that root (see code comment in this patch).
> >
> > What we're missing on SVM is proper ASID handling. If KVM uses ASIDs the way AMD
> > intends them to be used, then this works as intended because each root is again
> > associated with a specific ASID, and KVM just needs to flush when (re)allocating
> > a root and when reusing an ASID (which it already handles).
> >
> > [*] https://lore.kernel.org/all/Yh%[email protected]
>
> Oh, putting AMD issues aside for now.
>
> I think I might be too narrow down to the zapping logic previously. So,
> I originally think anytime we want to zap, we have to do the following
> things in strict order:
>
> 1) zap SPTEs.
> 2) flush TLBs.
> 3) flush cache (AMD SEV only).
> 4) deallocate shadow pages.

Not necessarily. 1-3 are actually all optional. E.g. for #1, if KVM somehow
knew that the host didn't care about A/D bits (no writeback needed, no LRU info
needed), then KVM could skip straight to freeing the shadow pages when destroying
a VM.

Flushing the TLB before freeing pages is optional because KVM only needs to ensure
the guest can no longer access the memory. E.g. at kvm_mmu_notifier_release(),
because KVM disallows KVM_RUN from a different mm, KVM knows that the guest will
never run again and so can skip the TLB flushes.

For the TLB, that does mean KVM needs to flush when using an ASID/EPT4 for the
first time, but KVM needs to do that regardless to guard against a different
hypervisor being loaded previously (where a "different" hypervisor could very
well be an older, buggier version of KVM).

> However, if you have already invalidated EPTP (pgd ptr), then step 2)
> becomes optional, since those stale TLBs are no longer useable by the
> guest due to the change of ASID.

Mostly. It doesn't require an "invalidated EPTP", just a different EPT4A (or
ASID on SVM).

> Am I understanding the point correctly? So, for all invalidated roots,
> the assumption is that we have already called "kvm_reload_rmote_mmus()",
> which basically update the ASID.

No, the assumption (though I'd describe it a requirement) is that vCPUs can no
longer consume the TLB entries. That could be due to a reload, but as above it
could also be due to KVM knowing KVM_RUN is unreachable.

2022-03-03 19:10:05

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 09/28] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Drop RCU protection after processing each root when handling MMU notifier
> hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily
> drop RCU to let RCU do its thing between roots, and to make it clear that
> there's no special behavior that relies on holding RCU across all roots.
>
> Currently, the RCU protection is completely superficial, it's necessary
> only to make rcu_dereference() of SPTE pointers happy. A future patch
> will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
> ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
> rather, acknowledge the need for a flush), but in that case RCU needs to
> be held until the flush is complete if and only if the flush is needed
> because a shadow page may have been removed. And except for the "unmap"
> path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
> and can't change the PFN for a SP).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Ben Gardon <[email protected]>

Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 634a2838e117..4f460782a848 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1100,18 +1100,18 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> struct tdp_iter iter;
> bool ret = false;
>
> - rcu_read_lock();
> -
> /*
> * Don't support rescheduling, none of the MMU notifiers that funnel
> * into this helper allow blocking; it'd be dead, wasteful code.
> */
> for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
> + rcu_read_lock();
> +
> tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> ret |= handler(kvm, &iter, range);
> - }
>
> - rcu_read_unlock();
> + rcu_read_unlock();
> + }
>
> return ret;
> }
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-03 19:22:03

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 11/28] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
> which is called out by the comment as being disallowed but not actually
> checked. Keep the WARN on the old_spte as well, because overwriting a
> REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
> lack of splats with the existing WARN).
>
> Fixes: 08f07c800e9d ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
> Cc: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Ben Gardon <[email protected]>

Reviewed-by: Mingwei Zhang <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8fbf3364f116..1dcdf1a4fcc1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -640,13 +640,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> /*
> - * No thread should be using this function to set SPTEs to the
> + * No thread should be using this function to set SPTEs to or from the
> * temporary removed SPTE value.
> * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
> * should be used. If operating under the MMU lock in write mode, the
> * use of the removed SPTE should not be necessary.
> */
> - WARN_ON(is_removed_spte(iter->old_spte));
> + WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
>
> kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
>
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-03 19:40:05

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 14/28] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
> uses the slow version of "zap everything" is when the VM is being
> destroyed or the owning mm has exited. In either case, KVM_RUN is
> unreachable for the VM, i.e. the guest TLB entries cannot be consumed.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c231b60e1726..87706e9cc6f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -874,14 +874,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
>
> void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> {
> - bool flush = false;
> int i;
>
> + /*
> + * A TLB flush is unnecessary, KVM zaps everything if and only the VM
> + * is being destroyed or the userspace VMM has exited. In both cases,
> + * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
> + */
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> - flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
> -
> - if (flush)
> - kvm_flush_remote_tlbs(kvm);
> + (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
> }
>
> /*
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-03 19:40:35

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 07/28] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Look for a !leaf=>leaf conversion instead of a PFN change when checking
> if a SPTE change removed a TDP MMU shadow page. Convert the PFN check
> into a WARN, as KVM should never change the PFN of a shadow page (except
> when its being zapped or replaced).
>
> From a purely theoretical perspective, it's not illegal to replace a SP
> with a hugepage pointing at the same PFN. In practice, it's impossible
> as that would require mapping guest memory overtop a kernel-allocated SP.
> Either way, the check is odd.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 189f21e71c36..848448b65703 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -505,9 +505,12 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>
> /*
> * Recursively handle child PTs if the change removed a subtree from
> - * the paging structure.
> + * the paging structure. Note the WARN on the PFN changing without the
> + * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
> + * pages are kernel allocations and should never be migrated.
> */
> - if (was_present && !was_leaf && (pfn_changed || !is_present))
> + if (was_present && !was_leaf &&
> + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> }
>
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-03 22:44:48

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 09/28] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Drop RCU protection after processing each root when handling MMU notifier
> hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily
> drop RCU to let RCU do its thing between roots, and to make it clear that
> there's no special behavior that relies on holding RCU across all roots.
>
> Currently, the RCU protection is completely superficial, it's necessary
> only to make rcu_dereference() of SPTE pointers happy. A future patch
> will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
> ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
> rather, acknowledge the need for a flush), but in that case RCU needs to
> be held until the flush is complete if and only if the flush is needed
> because a shadow page may have been removed. And except for the "unmap"
> path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
> and can't change the PFN for a SP).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Ben Gardon <[email protected]>

Reviewed-by: Mingwei Zhang <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 634a2838e117..4f460782a848 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1100,18 +1100,18 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> struct tdp_iter iter;
> bool ret = false;
>
> - rcu_read_lock();
> -
> /*
> * Don't support rescheduling, none of the MMU notifiers that funnel
> * into this helper allow blocking; it'd be dead, wasteful code.
> */
> for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
> + rcu_read_lock();
> +
> tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> ret |= handler(kvm, &iter, range);
> - }
>
> - rcu_read_unlock();
> + rcu_read_unlock();
> + }
>
> return ret;
> }
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-04 02:08:40

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 10/28] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Add helpers to read and write TDP MMU SPTEs instead of open coding
> rcu_dereference() all over the place, and to provide a convenient
> location to document why KVM doesn't exempt holding mmu_lock for write
> from having to hold RCU (and any future changes to the rules).
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Ben Gardon <[email protected]>

Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_iter.c | 6 +++---
> arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index be3f096db2eb..6d3b3e5a5533 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> {
> iter->sptep = iter->pt_path[iter->level - 1] +
> SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> - iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> + iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
> }
>
> static gfn_t round_gfn_for_level(gfn_t gfn, int level)
> @@ -89,7 +89,7 @@ static bool try_step_down(struct tdp_iter *iter)
> * Reread the SPTE before stepping down to avoid traversing into page
> * tables that are no longer linked from this entry.
> */
> - iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> + iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
>
> child_pt = spte_to_child_pt(iter->old_spte, iter->level);
> if (!child_pt)
> @@ -123,7 +123,7 @@ static bool try_step_side(struct tdp_iter *iter)
> iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
> iter->next_last_level_gfn = iter->gfn;
> iter->sptep++;
> - iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> + iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
>
> return true;
> }
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 216ebbe76ddd..bb9b581f1ee4 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -9,6 +9,22 @@
>
> typedef u64 __rcu *tdp_ptep_t;
>
> +/*
> + * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
> + * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
> + * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
> + * the lower depths of the TDP MMU just to make lockdep happy is a nightmare, so
> + * all accesses to SPTEs are done under RCU protection.
> + */
> +static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
> +{
> + return READ_ONCE(*rcu_dereference(sptep));
> +}
> +static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
> +{
> + WRITE_ONCE(*rcu_dereference(sptep), val);
> +}
> +
> /*
> * A TDP iterator performs a pre-order walk over a TDP paging structure.
> */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4f460782a848..8fbf3364f116 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -609,7 +609,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * here since the SPTE is going from non-present
> * to non-present.
> */
> - WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
> + kvm_tdp_mmu_write_spte(iter->sptep, 0);
>
> return 0;
> }
> @@ -648,7 +648,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> */
> WARN_ON(is_removed_spte(iter->old_spte));
>
> - WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
> + kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
>
> __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> new_spte, iter->level, false);
> @@ -1046,7 +1046,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> * because the new value informs the !present
> * path below.
> */
> - iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> + iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> }
>
> if (!is_shadow_present_pte(iter.old_spte)) {
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-04 05:18:33

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 08/28] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte

On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Batch TLB flushes (with other MMUs) when handling ->change_spte()
> notifications in the TDP MMU. The MMU notifier path in question doesn't
> allow yielding and correcty flushes before dropping mmu_lock.
nit: correctly
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Ben Gardon <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 848448b65703..634a2838e117 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1203,13 +1203,12 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> */
> bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> - bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
> -
> - /* FIXME: return 'flush' instead of flushing here. */
> - if (flush)
> - kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
> -
> - return false;
> + /*
> + * No need to handle the remote TLB flush under RCU protection, the
> + * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
> + * shadow page. See the WARN on pfn_changed in __handle_changed_spte().
> + */
> + return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
> }
>
> /*
> --
> 2.35.1.574.g5d30c73bfb-goog
>