Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
number of TLB flushes, fix soft lockups and RCU stalls, avoid blocking
vCPUs for long durations while zapping paging structure, and to clean up
the zapping code.
Patches 01-03 were allegedly queued when posted separately, but they haven't
showed up yet, and this series depends/conflicts on/with them, so here they
are again.
Based on kvm/queue-5.17, commit 1c4261809af0 ("KVM: SVM: include CR3 ...").
The largest cleanup is to separate the flows for zapping roots (zap
_everything_), zapping leaf SPTEs (zap guest mappings for whatever reason),
and zapping a specific SP (NX recovery). They're currently smushed into a
single zap_gfn_range(), which was a good idea at the time, but became a
mess when trying to handle the different rules, e.g. TLB flushes aren't
needed when zapping a root because KVM can safely zap a root if and only
if it's unreachable.
To solve the soft lockups, stalls, and vCPU performance issues:
- Defer remote TLB flushes to the caller when zapping TDP MMU shadow
pages by relying on RCU to ensure the paging structure isn't freed
until all vCPUs have exited the guest.
- Allowing yielding when zapping TDP MMU roots in response to the root's
last reference being put. This requires a bit of trickery to ensure
the root is reachable via mmu_notifier, but it's not too gross.
- Zap roots in two passes to avoid holding RCU for potential hundreds of
seconds when zapping guest with terabytes of memory that is backed
entirely by 4kb SPTEs.
- Zap defunct roots asynchronously via the common work_queue so that a
vCPU doesn't get stuck doing the work if the vCPU happens to drop the
last reference to a root.
The selftest at the end allows populating a guest with the max amount of
memory allowed by the underlying architecture. The most I've tested is
~64tb (MAXPHYADDR=46) as I don't have easy access to a system with
MAXPHYADDR=52. The selftest compiles on arm64 and s390x, but otherwise
hasn't been tested outside of x86-64. It will hopefully do something
useful as is, but there's a non-zero chance it won't get past init with
a high max memory. Running on x86 without the TDP MMU is comically slow.
v2:
- Drop patches that were applied.
- Collect reviews for patches that weren't modified. [Ben]
- Abandon the idea of taking invalid roots off the list of roots.
- Add a patch to fix misleading/wrong comments with respect to KVM's
responsibilities in the "fast zap" flow, specifically that all SPTEs
must be dropped before the zap completes.
- Rework yielding in kvm_tdp_mmu_put_root() to keep the root visibile
while yielding.
- Add patch to zap roots in two passes. [Mingwei, David]
- Add a patch to asynchronously zap defunct roots.
- Add the selftest.
v1: https://lore.kernel.org/all/[email protected]
Sean Christopherson (30):
KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap
hook
KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root()
KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU
KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots
KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP
MMU
KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap
KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush
logic
KVM: x86/mmu: Document that zapping invalidated roots doesn't need to
flush
KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root()
KVM: x86/mmu: Require mmu_lock be held for write in unyielding root
iter
KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP
removal
KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier
change_spte
KVM: x86/mmu: Drop RCU after processing each root in MMU notifier
hooks
KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw
vals
KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU
resched
KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow
pages
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU
root
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
KVM: x86/mmu: Zap defunct roots via asynchronous worker
KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils
KVM: selftests: Split out helper to allocate guest mem via memfd
KVM: selftests: Define cpu_relax() helpers for s390 and x86
KVM: selftests: Add test to populate a VM with the max possible guest
mem
arch/x86/kvm/mmu/mmu.c | 42 +-
arch/x86/kvm/mmu/mmu_internal.h | 16 +-
arch/x86/kvm/mmu/tdp_iter.c | 6 +-
arch/x86/kvm/mmu/tdp_iter.h | 15 +-
arch/x86/kvm/mmu/tdp_mmu.c | 642 ++++++++++++------
arch/x86/kvm/mmu/tdp_mmu.h | 32 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 6 +
.../selftests/kvm/include/s390x/processor.h | 8 +
.../selftests/kvm/include/x86_64/processor.h | 5 +
tools/testing/selftests/kvm/lib/kvm_util.c | 66 +-
.../selftests/kvm/max_guest_memory_test.c | 292 ++++++++
.../selftests/kvm/set_memory_region_test.c | 35 +-
14 files changed, 870 insertions(+), 299 deletions(-)
create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c
--
2.34.1.448.ga2b2bfdf31-goog
Use the common TDP MMU zap helper when handling an MMU notifier unmap
event, the two flows are semantically identical. Consolidate the code in
preparation for a future bug fix, as both kvm_tdp_mmu_unmap_gfn_range()
and __kvm_tdp_mmu_zap_gfn_range() are guilty of not zapping SPTEs in
invalid roots.
No functional change intended.
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b1bc816b7c3..d320b56d5cd7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1032,13 +1032,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
- struct kvm_mmu_page *root;
-
- for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
- flush = zap_gfn_range(kvm, root, range->start, range->end,
- range->may_block, flush, false);
-
- return flush;
+ return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
+ range->end, range->may_block, flush);
}
typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
--
2.34.1.448.ga2b2bfdf31-goog
Move the check for an invalid root out kvm_tdp_mmu_get_root() and into
the one place it actually matters, tdp_mmu_next_root(), as the other user
already has an implicit validity check. A future bug fix will need to
get references to invalid roots to honor mmu_notifier requests, there's
no point in forcing what will be a common path to open code getting a
reference to a root.
No functional change intended.
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++++--
arch/x86/kvm/mmu/tdp_mmu.h | 3 ---
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d320b56d5cd7..200001190fcf 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -121,9 +121,14 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
typeof(*next_root), link);
- while (next_root && !kvm_tdp_mmu_get_root(kvm, next_root))
+ while (next_root) {
+ if (!next_root->role.invalid &&
+ kvm_tdp_mmu_get_root(kvm, next_root))
+ break;
+
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
&next_root->link, typeof(*next_root), link);
+ }
rcu_read_unlock();
@@ -200,7 +205,10 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
- /* Check for an existing root before allocating a new one. */
+ /*
+ * Check for an existing root before allocating a new one. Note, the
+ * role check prevents consuming an invalid root.
+ */
for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
if (root->role.word == role.word &&
kvm_tdp_mmu_get_root(kvm, root))
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3899004a5d91..08c917511fed 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,9 +10,6 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
struct kvm_mmu_page *root)
{
- if (root->role.invalid)
- return false;
-
return refcount_inc_not_zero(&root->tdp_mmu_root_count);
}
--
2.34.1.448.ga2b2bfdf31-goog
Zap both valid and invalid roots when zapping/unmapping a gfn range, as
KVM must ensure it holds no references to the freed page after returning
from the unmap operation. Most notably, thee TDP MMU doesn't zap invalid
roots in mmu_notifier callbacks. This leads to use-after-free and other
issues if the mmu_notifier runs to completion while an invalid root
zapper yields as KVM fails to honor the requirement that there must be
_no_ references to the page after the mmu_notifier returns.
The bug is most easily reproduced by hacking KVM to cause a collision
between set_nx_huge_pages() + kvm_mmu_notifier_release(), but the bug
exists between kvm_mmu_notifier_invalidate_range_start() and memslot
updates as well. Invalidating a root ensure pages aren't accessible by
the guest, and KVM won't read or write page data itself, but KVM will
trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing
a zap _after_ the mmu_notifier returns is fatal.
WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm]
RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0xa8/0xe0 [kvm]
__handle_changed_spte+0x2ab/0x5e0 [kvm]
__handle_changed_spte+0x2ab/0x5e0 [kvm]
__handle_changed_spte+0x2ab/0x5e0 [kvm]
zap_gfn_range+0x1f3/0x310 [kvm]
kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm]
kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm]
set_nx_huge_pages+0xb4/0x190 [kvm]
param_attr_store+0x70/0x100
module_attr_store+0x19/0x30
kernfs_fop_write_iter+0x119/0x1b0
new_sync_write+0x11c/0x1b0
vfs_write+0x1cc/0x270
ksys_write+0x5f/0xe0
do_syscall_64+0x38/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: [email protected]
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++++++++++---------------
1 file changed, 24 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 200001190fcf..577985fa001d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -99,15 +99,18 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
}
/*
- * Finds the next valid root after root (or the first valid root if root
- * is NULL), takes a reference on it, and returns that next root. If root
- * is not NULL, this thread should have already taken a reference on it, and
- * that reference will be dropped. If no valid root is found, this
- * function will return NULL.
+ * Returns the next root after @prev_root (or the first root if @prev_root is
+ * NULL). A reference to the returned root is acquired, and the reference to
+ * @prev_root is released (the caller obviously must hold a reference to
+ * @prev_root if it's non-NULL).
+ *
+ * If @only_valid is true, invalid roots are skipped.
+ *
+ * Returns NULL if the end of tdp_mmu_roots was reached.
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
- bool shared)
+ bool shared, bool only_valid)
{
struct kvm_mmu_page *next_root;
@@ -122,7 +125,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
typeof(*next_root), link);
while (next_root) {
- if (!next_root->role.invalid &&
+ if ((!only_valid || !next_root->role.invalid) &&
kvm_tdp_mmu_get_root(kvm, next_root))
break;
@@ -148,13 +151,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* mode. In the unlikely event that this thread must free a root, the lock
* will be temporarily dropped and reacquired in write mode.
*/
-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
- for (_root = tdp_mmu_next_root(_kvm, NULL, _shared); \
- _root; \
- _root = tdp_mmu_next_root(_kvm, _root, _shared)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _only_valid)\
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \
+ _root; \
+ _root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \
+ if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else
+#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
+
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
+
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
lockdep_is_held_type(&kvm->mmu_lock, 0) || \
@@ -1224,7 +1233,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
lockdep_assert_held_read(&kvm->mmu_lock);
- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
+ for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages, min_level);
@@ -1294,7 +1303,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
lockdep_assert_held_read(&kvm->mmu_lock);
- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
+ for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages);
@@ -1419,7 +1428,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
lockdep_assert_held_read(&kvm->mmu_lock);
- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
+ for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
zap_collapsible_spte_range(kvm, root, slot);
}
--
2.34.1.448.ga2b2bfdf31-goog
Explicitly check for present SPTEs when clearing dirty bits in the TDP
MMU. This isn't strictly required for correctness, as setting the dirty
bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
complain if anyone actually turned on KVM's MMU debugging.
Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 41e975841ea6..fcbae282af6f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1239,6 +1239,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
if (spte_ad_need_write_protect(iter.old_spte)) {
if (is_writable_pte(iter.old_spte))
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
--
2.34.1.448.ga2b2bfdf31-goog
Now that tdp_mmu_next_root() can process both valid and invalid roots,
extend it to be able to process _only_ invalid roots, add yet another
iterator macro for walking invalid roots, and use the new macro in
kvm_tdp_mmu_zap_invalidated_roots().
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 74 ++++++++++++++------------------------
1 file changed, 26 insertions(+), 48 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 577985fa001d..41e975841ea6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -98,6 +98,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
+enum tdp_mmu_roots_iter_type {
+ ALL_ROOTS = -1,
+ VALID_ROOTS = 0,
+ INVALID_ROOTS = 1,
+};
+
/*
* Returns the next root after @prev_root (or the first root if @prev_root is
* NULL). A reference to the returned root is acquired, and the reference to
@@ -110,10 +116,16 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
- bool shared, bool only_valid)
+ bool shared,
+ enum tdp_mmu_roots_iter_type type)
{
struct kvm_mmu_page *next_root;
+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
+ /* Ensure correctness for the below comparison against role.invalid. */
+ BUILD_BUG_ON(!!VALID_ROOTS || !INVALID_ROOTS);
+
rcu_read_lock();
if (prev_root)
@@ -125,7 +137,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
typeof(*next_root), link);
while (next_root) {
- if ((!only_valid || !next_root->role.invalid) &&
+ if ((type == ALL_ROOTS || (type == !!next_root->role.invalid)) &&
kvm_tdp_mmu_get_root(kvm, next_root))
break;
@@ -151,18 +163,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* mode. In the unlikely event that this thread must free a root, the lock
* will be temporarily dropped and reacquired in write mode.
*/
-#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _only_valid)\
- for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _type) \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _type); \
_root; \
- _root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+ _root = tdp_mmu_next_root(_kvm, _root, _shared, _type)) \
+ if (_as_id > 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else
+#define for_each_invalid_tdp_mmu_root_yield_safe(_kvm, _root) \
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, -1, true, INVALID_ROOTS)
+
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, VALID_ROOTS)
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOTS)
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
@@ -811,28 +826,6 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
kvm_flush_remote_tlbs(kvm);
}
-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
- struct kvm_mmu_page *prev_root)
-{
- struct kvm_mmu_page *next_root;
-
- if (prev_root)
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &prev_root->link,
- typeof(*prev_root), link);
- else
- next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- typeof(*next_root), link);
-
- while (next_root && !(next_root->role.invalid &&
- refcount_read(&next_root->tdp_mmu_root_count)))
- next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
- &next_root->link,
- typeof(*next_root), link);
-
- return next_root;
-}
-
/*
* Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
* invalidated root, they will not be freed until this function drops the
@@ -843,36 +836,21 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
*/
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
- struct kvm_mmu_page *next_root;
struct kvm_mmu_page *root;
bool flush = false;
lockdep_assert_held_read(&kvm->mmu_lock);
- rcu_read_lock();
-
- root = next_invalidated_root(kvm, NULL);
-
- while (root) {
- next_root = next_invalidated_root(kvm, root);
-
- rcu_read_unlock();
-
+ for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
/*
- * Put the reference acquired in
- * kvm_tdp_mmu_invalidate_roots
+ * Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
+ * Note, the iterator holds its own reference.
*/
kvm_tdp_mmu_put_root(kvm, root, true);
-
- root = next_root;
-
- rcu_read_lock();
}
- rcu_read_unlock();
-
if (flush)
kvm_flush_remote_tlbs(kvm);
}
--
2.34.1.448.ga2b2bfdf31-goog
Fix misleading and arguably wrong comments in the TDP MMU's fast zap
flow. The comments, and the fact that actually zapping invalid roots was
added separately, strongly suggests that zapping invalid roots is an
optimization and not required for correctness. That is a lie.
KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
KVM is relying on it to fully remove all references to the memslot. Once
the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
stale references as the hva=>gfn translation is done via the memslots.
If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
deleting a memslot, KVM will fail to zap in response to the mmu_notifier
due to not finding a memslot corresponding to the notifier's range, which
leads to a variation of use-after-free.
The other misleading comment (and code) explicitly states that roots
without a reference should be skipped. While that's technically true,
it's also extremely misleading as it should be impossible for KVM to
encounter a defunct root on the list while holding mmu_lock for write.
Opportunstically add a WARN to enforce that invariant.
Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 +++++++
arch/x86/kvm/mmu/tdp_mmu.c | 46 +++++++++++++++++++++-----------------
2 files changed, 33 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1d275e9d76b5..94590bc97a67 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5692,6 +5692,14 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
write_unlock(&kvm->mmu_lock);
+ /*
+ * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
+ * returning to the caller, e.g. if the zap is in response to a memslot
+ * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
+ * associated with the deleted memslot once the update completes, and
+ * Deferring the zap until the final reference to the root is put would
+ * lead to use-after-free.
+ */
if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
kvm_tdp_mmu_zap_invalidated_roots(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fcbae282af6f..4f5c8e7380a9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -827,12 +827,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
}
/*
- * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
- * invalidated root, they will not be freed until this function drops the
- * reference. Before dropping that reference, tear down the paging
- * structure so that whichever thread does drop the last reference
- * only has to do a trivial amount of work. Since the roots are invalid,
- * no new SPTEs should be created under them.
+ * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
+ * zap" completes. Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
+ * reference to each invalidated root, roots will not be freed until after this
+ * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
+ * tearing paging structures.
*/
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
@@ -856,21 +855,25 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
}
/*
- * Mark each TDP MMU root as invalid so that other threads
- * will drop their references and allow the root count to
- * go to 0.
+ * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
+ * is about to be zapped, e.g. in response to a memslots update. The caller is
+ * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to the actual
+ * zapping.
*
- * Also take a reference on all roots so that this thread
- * can do the bulk of the work required to free the roots
- * once they are invalidated. Without this reference, a
- * vCPU thread might drop the last reference to a root and
- * get stuck with tearing down the entire paging structure.
+ * Take a reference on all roots to prevent the root from being freed before it
+ * is zapped by this thread. Freeing a root is not a correctness issue, but if
+ * a vCPU drops the last reference to a root prior to the root being zapped, it
+ * will get stuck with tearing down the entire paging structure.
*
- * Roots which have a zero refcount should be skipped as
- * they're already being torn down.
- * Already invalid roots should be referenced again so that
- * they aren't freed before kvm_tdp_mmu_zap_all_fast is
- * done with them.
+ * Get a reference even if the root is already invalid,
+ * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
+ * invalid roots, e.g. there's no epoch to identify roots that were invalidated
+ * by a previous call. Roots stay on the list until the last reference is
+ * dropped, so even though all invalid roots are zapped, a root may not go away
+ * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
+ *
+ * Because mmu_lock is held for write, it should be impossible to observe a
+ * root with zero refcount, i.e. the list of roots cannot be stale.
*
* This has essentially the same effect for the TDP MMU
* as updating mmu_valid_gen does for the shadow MMU.
@@ -880,9 +883,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
struct kvm_mmu_page *root;
lockdep_assert_held_write(&kvm->mmu_lock);
- list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
- if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
+ list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(kvm, root)))
root->role.invalid = true;
+ }
}
/*
--
2.34.1.448.ga2b2bfdf31-goog
Assert that mmu_lock is held for write by users of the yield-unfriendly
TDP iterator. The nature of a shared walk means that the caller needs to
play nice with other tasks modifying the page tables, which is more or
less the same thing as playing nice with yielding. Theoretically, KVM
could gain a flow where it could legitimately take mmu_lock for read in
a non-preemptible context, but that's highly unlikely and any such case
should be viewed with a fair amount of scrutiny.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cd093fa73d14..3b13249bbbe1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,13 +29,16 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
return true;
}
-static __always_inline void kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
+/* Arbitrarily returns true so that this may be used in if statements. */
+static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
bool shared)
{
if (shared)
lockdep_assert_held_read(&kvm->mmu_lock);
else
lockdep_assert_held_write(&kvm->mmu_lock);
+
+ return true;
}
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
@@ -187,11 +190,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOTS)
-#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
- list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
- lockdep_is_held_type(&kvm->mmu_lock, 0) || \
- lockdep_is_held(&kvm->arch.tdp_mmu_pages_lock)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+/*
+ * Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
+ * the implication being that any flow that holds mmu_lock for read is
+ * inherently yield-friendly and should use the yielf-safe variant above.
+ * Holding mmu_lock for write obviates the need for RCU protection as the list
+ * is guaranteed to be stable.
+ */
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
+ list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
+ if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
+ kvm_mmu_page_as_id(_root) != _as_id) { \
} else
static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
--
2.34.1.448.ga2b2bfdf31-goog
Drop the unused @kvm param from kvm_tdp_mmu_get_root(). No functional
change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.h | 3 +--
2 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 87785dce1bd4..cd093fa73d14 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -146,7 +146,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
while (next_root) {
if ((type == ALL_ROOTS || (type == !!next_root->role.invalid)) &&
- kvm_tdp_mmu_get_root(kvm, next_root))
+ kvm_tdp_mmu_get_root(next_root))
break;
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
@@ -243,7 +243,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
*/
for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
if (root->role.word == role.word &&
- kvm_tdp_mmu_get_root(kvm, root))
+ kvm_tdp_mmu_get_root(root))
goto out;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 08c917511fed..6b9bdd652bca 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -7,8 +7,7 @@
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
-__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
- struct kvm_mmu_page *root)
+__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
{
return refcount_inc_not_zero(&root->tdp_mmu_root_count);
}
--
2.34.1.448.ga2b2bfdf31-goog
Remove the misleading flush "handling" when zapping invalidated TDP MMU
roots, and document that flushing is unnecessary for all flavors of MMUs
when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU
is dead code, as zap_gfn_range() is called with shared=true, in which
case it will never return true due to the flushing being handled by
tdp_mmu_zap_spte_atomic().
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 10 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++-----
2 files changed, 17 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6549c13e89d9..f660906c8230 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5645,9 +5645,13 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
}
/*
- * Trigger a remote TLB flush before freeing the page tables to ensure
- * KVM is not in the middle of a lockless shadow page table walk, which
- * may reference the pages.
+ * Kick all vCPUs (via remote TLB flush) before freeing the page tables
+ * to ensure KVM is not in the middle of a lockless shadow page table
+ * walk, which may reference the pages. The remote TLB flush itself is
+ * not required and is simply a convenient way to kick vCPUs as needed.
+ * KVM performs a local TLB flush when allocating a new root (see
+ * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
+ * running with an obsolete MMU.
*/
kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 66b75c197c94..87785dce1bd4 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -844,12 +844,20 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;
- bool flush = false;
lockdep_assert_held_read(&kvm->mmu_lock);
for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
- flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
+ /*
+ * A TLB flush is unnecessary, invalidated roots are guaranteed
+ * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
+ * for more details), and unlike the legacy MMU, no vCPU kick
+ * is needed to play nice with lockless shadow walks as the TDP
+ * MMU protects its paging structures via RCU. Note, zapping
+ * will still flush on yield, but that's a minor performance
+ * blip and not a functional issue.
+ */
+ (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
/*
* Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
@@ -857,9 +865,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
*/
kvm_tdp_mmu_put_root(kvm, root, true);
}
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
}
/*
--
2.34.1.448.ga2b2bfdf31-goog
Explicitly ignore the result of zap_gfn_range() when putting the last
reference to a TDP MMU root, and add a pile of comments to formalize the
TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this
only affects the !shared case, as zap_gfn_range() subtly never returns
true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
Putting the root without a flush is ok because even if there are stale
references to the root in the TLB, they are unreachable because KVM will
not run the guest with the same ASID without first flushing (where ASID
in this context refers to both SVM's explicit ASID and Intel's implicit
ASID that is constructed from VPID+PCID+EPT4A+etc...).
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 94590bc97a67..6549c13e89d9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5100,6 +5100,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
kvm_mmu_sync_roots(vcpu);
kvm_mmu_load_pgd(vcpu);
+
+ /*
+ * Flush any TLB entries for the new root, the provenance of the root
+ * is unknown. In theory, even if KVM ensures there are no stale TLB
+ * entries for a freed root, in theory, an out-of-tree hypervisor could
+ * have left stale entries. Flushing on alloc also allows KVM to skip
+ * the TLB flush when freeing a root (see kvm_tdp_mmu_put_root()).
+ */
static_call(kvm_x86_tlb_flush_current)(vcpu);
out:
return r;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4f5c8e7380a9..66b75c197c94 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
- zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+ /*
+ * A TLB flush is not necessary as KVM performs a local TLB flush when
+ * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+ * to a different pCPU. Note, the local TLB flush on reuse also
+ * invalidates any paging-structure-cache entries, i.e. TLB entries for
+ * intermediate paging structures, that may be zapped, as such entries
+ * are associated with the ASID on both VMX and SVM.
+ */
+ (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
--
2.34.1.448.ga2b2bfdf31-goog
Look for a !leaf=>leaf conversion instead of a PFN change when checking
if a SPTE change removed a TDP MMU shadow page. Convert the PFN check
into a WARN, as KVM should never change the PFN of a shadow page (except
when its being zapped or replaced).
From a purely theoretical perspective, it's not illegal to replace a SP
with a hugepage pointing at the same PFN. In practice, it's impossible
as that would require mapping guest memory overtop a kernel-allocated SP.
Either way, the check is odd.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3b13249bbbe1..05f35541ff2f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -518,9 +518,12 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
/*
* Recursively handle child PTs if the change removed a subtree from
- * the paging structure.
+ * the paging structure. Note the WARN on the PFN changing without the
+ * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
+ * pages are kernel allocations and should never be migrated.
*/
- if (was_present && !was_leaf && (pfn_changed || !is_present))
+ if (was_present && !was_leaf &&
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_tdp_mmu_page(kvm,
spte_to_child_pt(old_spte, level), shared);
}
--
2.34.1.448.ga2b2bfdf31-goog
Batch TLB flushes (with other MMUs) when handling ->change_spte()
notifications in the TDP MMU. The MMU notifier path in question doesn't
allow yielding and correcty flushes before dropping mmu_lock.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 05f35541ff2f..6c51548d89b1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1174,13 +1174,12 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
*/
bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
-
- /* FIXME: return 'flush' instead of flushing here. */
- if (flush)
- kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
-
- return false;
+ /*
+ * No need to handle the remote TLB flush under RCU protection, the
+ * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
+ * shadow page. See the WARN on pfn_changed in __handle_changed_spte().
+ */
+ return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
}
/*
--
2.34.1.448.ga2b2bfdf31-goog
Add helpers to read and write TDP MMU SPTEs instead of open coding
rcu_dereference() all over the place, and to provide a convenient
location to document why KVM doesn't exempt holding mmu_lock for write
from having to hold RCU (and any future changes to the rules).
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_iter.c | 6 +++---
arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++-------
3 files changed, 26 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index caa96c270b95..de31f3e68668 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
{
iter->sptep = iter->pt_path[iter->level - 1] +
SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
}
static gfn_t round_gfn_for_level(gfn_t gfn, int level)
@@ -87,7 +87,7 @@ static bool try_step_down(struct tdp_iter *iter)
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry.
*/
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
child_pt = spte_to_child_pt(iter->old_spte, iter->level);
if (!child_pt)
@@ -121,7 +121,7 @@ static bool try_step_side(struct tdp_iter *iter)
iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
iter->next_last_level_gfn = iter->gfn;
iter->sptep++;
- iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+ iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
return true;
}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index e19cabbcb65c..3cdfaf391a49 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,6 +9,22 @@
typedef u64 __rcu *tdp_ptep_t;
+/*
+ * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
+ * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
+ * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
+ * the lower* depths of the TDP MMU just to make lockdep happy is a nightmare,
+ * so all* accesses to SPTEs are must be done under RCU protection.
+ */
+static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
+{
+ return READ_ONCE(*rcu_dereference(sptep));
+}
+static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
+{
+ WRITE_ONCE(*rcu_dereference(sptep), val);
+}
+
/*
* A TDP iterator performs a pre-order walk over a TDP paging structure.
*/
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 47424e22a681..41c3a1cff3e7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -603,7 +603,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* here since the SPTE is going from non-present
* to non-present.
*/
- WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
+ kvm_tdp_mmu_write_spte(iter->sptep, 0);
return true;
}
@@ -642,7 +642,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
*/
WARN_ON(is_removed_spte(iter->old_spte));
- WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+ kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
new_spte, iter->level, false);
@@ -807,7 +807,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
}
@@ -1011,7 +1011,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* because the new value informs the !present
* path below.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
}
if (!is_shadow_present_pte(iter.old_spte)) {
@@ -1217,7 +1217,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
spte_set = true;
@@ -1288,7 +1288,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
spte_set = true;
@@ -1419,7 +1419,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
- iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
}
--
2.34.1.448.ga2b2bfdf31-goog
Drop RCU protection after processing each root when handling MMU notifier
hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily
drop RCU to let RCU do its thing between roots, and to make it clear that
there's no special behavior that relies on holding RCU across all roots.
Currently, the RCU protection is completely superficial, it's necessary
only to make rcu_dereference() of SPTE pointers happy. A future patch
will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
rather, acknowledge the need for a flush), but in that case RCU needs to
be held until the flush is complete if and only if the flush is needed
because a shadow page may have been removed. And except for the "unmap"
path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
and can't change the PFN for a SP).
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6c51548d89b1..47424e22a681 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1071,18 +1071,19 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
struct tdp_iter iter;
bool ret = false;
- rcu_read_lock();
-
/*
* Don't support rescheduling, none of the MMU notifiers that funnel
* into this helper allow blocking; it'd be dead, wasteful code.
*/
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+ rcu_read_lock();
+
tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
ret |= handler(kvm, &iter, range);
+
+ rcu_read_unlock();
}
- rcu_read_unlock();
return ret;
}
--
2.34.1.448.ga2b2bfdf31-goog
WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
which is called out by the comment as being disallowed but not actually
checked. Keep the WARN on the old_spte as well, because overwriting a
REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
lack of splats with the existing WARN).
Fixes: 08f07c800e9d ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 41c3a1cff3e7..e2d217cbeca3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -634,13 +634,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
lockdep_assert_held_write(&kvm->mmu_lock);
/*
- * No thread should be using this function to set SPTEs to the
+ * No thread should be using this function to set SPTEs to or from the
* temporary removed SPTE value.
* If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
* should be used. If operating under the MMU lock in write mode, the
* use of the removed SPTE should not be necessary.
*/
- WARN_ON(is_removed_spte(iter->old_spte));
+ WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
--
2.34.1.448.ga2b2bfdf31-goog
Refactor __tdp_mmu_set_spte() to work with raw values instead of a
tdp_iter objects so that a future patch can modify SPTEs without doing a
walk, and without having to synthesize a tdp_iter.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++---------------
1 file changed, 31 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e2d217cbeca3..61596b4a8121 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -611,9 +611,13 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
/*
* __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
- * @kvm: kvm instance
- * @iter: a tdp_iter instance currently on the SPTE that should be set
- * @new_spte: The value the SPTE should be set to
+ * @kvm: KVM instance
+ * @as_id: Address space ID, i.e. regular vs. SMM
+ * @sptep: Pointer to the SPTE
+ * @old_spte: The current value of the SPTE
+ * @new_spte: The new value that will be set for the SPTE
+ * @gfn: The base GFN that was (or will be) mapped by the SPTE
+ * @level: The level _containing_ the SPTE (its parent PT's level)
* @record_acc_track: Notify the MM subsystem of changes to the accessed state
* of the page. Should be set unless handling an MMU
* notifier for access tracking. Leaving record_acc_track
@@ -625,12 +629,10 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* Leaving record_dirty_log unset in that case prevents page
* writes from being double counted.
*/
-static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
- u64 new_spte, bool record_acc_track,
- bool record_dirty_log)
+static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
+ u64 old_spte, u64 new_spte, gfn_t gfn, int level,
+ bool record_acc_track, bool record_dirty_log)
{
- WARN_ON_ONCE(iter->yielded);
-
lockdep_assert_held_write(&kvm->mmu_lock);
/*
@@ -640,39 +642,48 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
* should be used. If operating under the MMU lock in write mode, the
* use of the removed SPTE should not be necessary.
*/
- WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
+ WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte));
- kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
+ kvm_tdp_mmu_write_spte(sptep, new_spte);
+
+ __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
- __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, false);
if (record_acc_track)
- handle_changed_spte_acc_track(iter->old_spte, new_spte,
- iter->level);
+ handle_changed_spte_acc_track(old_spte, new_spte, level);
if (record_dirty_log)
- handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
- iter->old_spte, new_spte,
- iter->level);
+ handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
+ new_spte, level);
+}
+
+static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+ u64 new_spte, bool record_acc_track,
+ bool record_dirty_log)
+{
+ WARN_ON_ONCE(iter->yielded);
+
+ __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte,
+ new_spte, iter->gfn, iter->level,
+ record_acc_track, record_dirty_log);
}
static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
+ _tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
}
static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
+ _tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
}
static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
+ _tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
}
#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
--
2.34.1.448.ga2b2bfdf31-goog
When recovering a potential hugepage that was shattered for the iTLB
multihit workaround, precisely zap only the target page instead of
iterating over the TDP MMU to find the SP that was passed in. This will
allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 7 ++++++-
arch/x86/kvm/mmu/tdp_iter.h | 2 --
arch/x86/kvm/mmu/tdp_mmu.c | 28 ++++++++++++++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.h | 18 +-----------------
4 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index da6166b5c377..be063b6c91b7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -30,6 +30,8 @@ extern bool dbg;
#define INVALID_PAE_ROOT 0
#define IS_VALID_PAE_ROOT(x) (!!(x))
+typedef u64 __rcu *tdp_ptep_t;
+
struct kvm_mmu_page {
/*
* Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -59,7 +61,10 @@ struct kvm_mmu_page {
refcount_t tdp_mmu_root_count;
};
unsigned int unsync_children;
- struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+ union {
+ struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+ tdp_ptep_t ptep;
+ };
DECLARE_BITMAP(unsync_child_bitmap, 512);
struct list_head lpage_disallowed_link;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 3cdfaf391a49..1de6c1c9ff7b 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -7,8 +7,6 @@
#include "mmu.h"
-typedef u64 __rcu *tdp_ptep_t;
-
/*
* TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
* to be zapped while holding mmu_lock for read. Holding RCU isn't required for
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 61596b4a8121..d23c2d42ad60 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -305,12 +305,16 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
*
* @kvm: kvm instance
* @sp: the new page
+ * @sptep: pointer to the new page's SPTE (in its parent)
* @account_nx: This page replaces a NX large page and should be marked for
* eventual reclaim.
*/
static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool account_nx)
+ tdp_ptep_t sptep, bool account_nx)
{
+ WARN_ON_ONCE(sp->ptep);
+ sp->ptep = sptep;
+
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
if (account_nx)
@@ -745,6 +749,26 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
return iter->yielded;
}
+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ u64 old_spte;
+
+ rcu_read_lock();
+
+ old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
+ if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
+ sp->gfn, sp->role.level + 1, true, true);
+
+ rcu_read_unlock();
+
+ return true;
+}
+
/*
* Tears down the mappings for the range of gfns, [start, end), and frees the
* non-root pages mapping GFNs strictly within that range. Returns true if
@@ -1041,7 +1065,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
!shadow_accessed_mask);
if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
- tdp_mmu_link_page(vcpu->kvm, sp,
+ tdp_mmu_link_page(vcpu->kvm, sp, iter.sptep,
fault->huge_page_disallowed &&
fault->req_level >= iter.level);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 6b9bdd652bca..ccb12f1914ba 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -22,24 +22,8 @@ static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
{
return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
}
-static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level + 1);
-
- /*
- * Don't allow yielding, as the caller may have a flush pending. Note,
- * if mmu_lock is held for write, zapping will never yield in this case,
- * but explicitly disallow it for safety. The TDP MMU does not yield
- * until it has made forward progress (steps sideways), and when zapping
- * a single shadow page that it's guaranteed to see (thus the mmu_lock
- * requirement), its "step sideways" will always step beyond the bounds
- * of the shadow page's gfn range and stop iterating before yielding.
- */
- lockdep_assert_held_write(&kvm->mmu_lock);
- return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
- sp->gfn, end, false, false);
-}
+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
--
2.34.1.448.ga2b2bfdf31-goog
Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
uses the slow version of "zap everything" is when the VM is being
destroyed or the owning mm has exited. In either case, KVM_RUN is
unreachable for the VM, i.e. the guest TLB entries cannot be consumed.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d23c2d42ad60..e9232ef2194e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -871,14 +871,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
void kvm_tdp_mmu_zap_all(struct kvm *kvm)
{
- bool flush = false;
int i;
+ /*
+ * A TLB flush is unnecessary, KVM zaps everything if and only the VM
+ * is being destroyed or the userspace VMM has exited. In both cases,
+ * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
+ */
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
+ (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
}
/*
--
2.34.1.448.ga2b2bfdf31-goog
Add a dedicate helper for zapping a TDP MMU root, and use it in the three
flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs
are zapped (zapping an entire root is safe if and only if it cannot be in
use by any vCPU). Because a TLB flush is never required, unconditionally
pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding.
Opportunistically document why KVM must not yield when zapping roots that
are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has
reached zero, and further harden the flow to detect improper KVM behavior
with respect to roots that are supposed to be unreachable.
In addition to hardening zapping of roots, isolating zapping of roots
will allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs, and by removing its tricky "zap all" heuristic. By having
all paths that truly need to free _all_ SPs flow through the dedicated
root zapper, the generic zapper can be freed of those concerns.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 100 +++++++++++++++++++++++++++++++------
1 file changed, 84 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e9232ef2194e..c4bfd6aac999 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -56,10 +56,6 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
rcu_barrier();
}
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush,
- bool shared);
-
static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
free_page((unsigned long)sp->spt);
@@ -82,6 +78,9 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
tdp_mmu_free_sp(sp);
}
+static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ bool shared);
+
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -104,7 +103,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
* intermediate paging structures, that may be zapped, as such entries
* are associated with the ASID on both VMX and SVM.
*/
- (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+ tdp_mmu_zap_root(kvm, root, shared);
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
@@ -749,6 +748,78 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
return iter->yielded;
}
+static inline gfn_t tdp_mmu_max_gfn_host(void)
+{
+ /*
+ * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
+ * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
+ * and so KVM will never install a SPTE for such addresses.
+ */
+ return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+}
+
+static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ bool shared)
+{
+ bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count);
+ struct tdp_iter iter;
+
+ gfn_t end = tdp_mmu_max_gfn_host();
+ gfn_t start = 0;
+
+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
+ rcu_read_lock();
+
+ /*
+ * No need to try to step down in the iterator when zapping an entire
+ * root, zapping an upper-level SPTE will recurse on its children.
+ */
+ for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
+ root->role.level, start, end) {
+retry:
+ /*
+ * Yielding isn't allowed when zapping an unreachable root as
+ * the root won't be processed by mmu_notifier callbacks. When
+ * handling an unmap/release mmu_notifier command, KVM must
+ * drop all references to relevant pages prior to completing
+ * the callback. Dropping mmu_lock can result in zapping SPTEs
+ * for an unreachable root after a relevant callback completes,
+ * which leads to use-after-free as zapping a SPTE triggers
+ * "writeback" of dirty/accessed bits to the SPTE's associated
+ * struct page.
+ */
+ if (!root_is_unreachable &&
+ tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+ continue;
+
+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
+ if (!shared) {
+ tdp_mmu_set_spte(kvm, &iter, 0);
+ } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
+ /*
+ * cmpxchg() shouldn't fail if the root is unreachable.
+ * to be unreachable. Re-read the SPTE and retry so as
+ * not to leak the page and its children.
+ */
+ WARN_ONCE(root_is_unreachable,
+ "Contended TDP MMU SPTE in unreachable root.");
+ iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
+ goto retry;
+ }
+ /*
+ * WARN if the root is invalid and is unreachable, all SPTEs
+ * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
+ * and inserting new SPTEs under an invalid root is a KVM bug.
+ */
+ WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
+ }
+
+ rcu_read_unlock();
+}
+
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
u64 old_spte;
@@ -790,8 +861,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, bool can_yield, bool flush,
bool shared)
{
- gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
- bool zap_all = (start == 0 && end >= max_gfn_host);
+ bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;
/*
@@ -800,12 +870,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
*/
int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
- /*
- * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
- * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
- * and so KVM will never install a SPTE for such addresses.
- */
- end = min(end, max_gfn_host);
+ end = min(end, tdp_mmu_max_gfn_host());
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
@@ -871,6 +936,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
void kvm_tdp_mmu_zap_all(struct kvm *kvm)
{
+ struct kvm_mmu_page *root;
int i;
/*
@@ -878,8 +944,10 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
* is being destroyed or the userspace VMM has exited. In both cases,
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
*/
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ for_each_tdp_mmu_root_yield_safe(kvm, root, i, false)
+ tdp_mmu_zap_root(kvm, root, false);
+ }
}
/*
@@ -905,7 +973,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* will still flush on yield, but that's a minor performance
* blip and not a functional issue.
*/
- (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
+ tdp_mmu_zap_root(kvm, root, true);
/*
* Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
--
2.34.1.448.ga2b2bfdf31-goog
Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
support for zapping with mmu_lock held for read. That all callers hold
mmu_lock for write isn't a random coincedence; now that the paths that
need to zap _everything_ have their own path, the only callers left are
those that need to zap for functional correctness. And when zapping is
required for functional correctness, mmu_lock must be held for write,
otherwise the caller has no guarantees about the state of the TDP MMU
page tables after it has run, e.g. the SPTE(s) it zapped can be
immediately replaced by a vCPU faulting in a page.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 29 ++++++-----------------------
1 file changed, 6 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c4bfd6aac999..c7529de776be 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -851,15 +851,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
* function cannot yield, it will not release the MMU lock or reschedule and
* the caller must ensure it does not supply too large a GFN range, or the
* operation can cause a soft lockup.
- *
- * If shared is true, this thread holds the MMU lock in read mode and must
- * account for the possibility that other threads are modifying the paging
- * structures concurrently. If shared is false, this thread should hold the
- * MMU lock in write mode.
*/
static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush,
- bool shared)
+ gfn_t start, gfn_t end, bool can_yield, bool flush)
{
bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;
@@ -872,15 +866,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
end = min(end, tdp_mmu_max_gfn_host());
- kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+ lockdep_assert_held_write(&kvm->mmu_lock);
rcu_read_lock();
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
min_level, start, end) {
-retry:
if (can_yield &&
- tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
+ tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}
@@ -899,17 +892,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;
- if (!shared) {
- tdp_mmu_set_spte(kvm, &iter, 0);
- flush = true;
- } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
- /*
- * The iter must explicitly re-read the SPTE because
- * the atomic cmpxchg failed.
- */
- iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
- goto retry;
- }
+ tdp_mmu_set_spte(kvm, &iter, 0);
+ flush = true;
}
rcu_read_unlock();
@@ -928,8 +912,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
- flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
- false);
+ flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
return flush;
}
--
2.34.1.448.ga2b2bfdf31-goog
Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
functions accordingly. When removing mappings for functional correctness
(except for the stupid VFIO GPU passthrough memslots bug), zapping the
leaf SPTEs is sufficient as the paging structures themselves do not point
at guest memory and do not directly impact the final translation (in the
TDP MMU).
Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
kvm_unmap_gfn_range().
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
arch/x86/kvm/mmu/tdp_mmu.h | 8 +-------
3 files changed, 14 insertions(+), 39 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f660906c8230..f40773dc4c92 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5804,8 +5804,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
if (is_tdp_mmu_enabled(kvm)) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
- gfn_end, flush);
+ flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
+ gfn_end, true, flush);
}
if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7529de776be..21d015b38ac1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -841,10 +841,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
}
/*
- * Tears down the mappings for the range of gfns, [start, end), and frees the
- * non-root pages mapping GFNs strictly within that range. Returns true if
- * SPTEs have been cleared and a TLB flush is needed before releasing the
- * MMU lock.
+ * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
+ * have been cleared and a TLB flush is needed before releasing the MMU lock.
*
* If can_yield is true, will release the MMU lock and reschedule if the
* scheduler needs the CPU or there is contention on the MMU lock. If this
@@ -852,18 +850,11 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
* the caller must ensure it does not supply too large a GFN range, or the
* operation can cause a soft lockup.
*/
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush)
+static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, bool can_yield, bool flush)
{
- bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
struct tdp_iter iter;
- /*
- * No need to try to step down in the iterator when zapping all SPTEs,
- * zapping the top-level non-leaf SPTEs will recurse on their children.
- */
- int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
-
end = min(end, tdp_mmu_max_gfn_host());
lockdep_assert_held_write(&kvm->mmu_lock);
@@ -871,24 +862,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
- min_level, start, end) {
+ PG_LEVEL_4K, start, end) {
if (can_yield &&
tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}
- if (!is_shadow_present_pte(iter.old_spte))
- continue;
-
- /*
- * If this is a non-last-level SPTE that covers a larger range
- * than should be zapped, continue, and zap the mappings at a
- * lower level, except when zapping all SPTEs.
- */
- if (!zap_all &&
- (iter.gfn < start ||
- iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+ if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
@@ -906,13 +887,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* SPTEs have been cleared and a TLB flush is needed before releasing the
* MMU lock.
*/
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
- gfn_t end, bool can_yield, bool flush)
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
+ bool can_yield, bool flush)
{
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
- flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
return flush;
}
@@ -1143,8 +1124,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
- return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
- range->end, range->may_block, flush);
+ return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
+ range->end, range->may_block, flush);
}
typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ccb12f1914ba..9759cbd821ef 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -15,14 +15,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared);
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
gfn_t end, bool can_yield, bool flush);
-static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
- gfn_t start, gfn_t end, bool flush)
-{
- return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
-}
-
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
--
2.34.1.448.ga2b2bfdf31-goog
When yielding in the TDP MMU iterator, service any pending TLB flush
before dropping RCU protections in anticipation of using the callers RCU
"lock" as a proxy for vCPUs in the guest.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 21d015b38ac1..e7086eb35599 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -728,11 +728,11 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
return false;
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
- rcu_read_unlock();
-
if (flush)
kvm_flush_remote_tlbs(kvm);
+ rcu_read_unlock();
+
if (shared)
cond_resched_rwlock_read(&kvm->mmu_lock);
else
--
2.34.1.448.ga2b2bfdf31-goog
Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
of immediately flushing. Because the shadow pages are freed in an RCU
callback, so long as at least one CPU holds RCU, all CPUs are protected.
For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
needs to ensure the caller services the pending TLB flush before dropping
its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs
running in the guest.
Deferring the flushes allows batching flushes, e.g. when installing a
1gb hugepage and zapping a pile of SPs. And when zapping an entire root,
deferring flushes allows skipping the flush entirely (because flushes are
not needed in that case).
Avoiding flushes when zapping an entire root is especially important as
synchronizing with other CPUs via IPI after zapping every shadow page can
cause significant performance issues for large VMs. The issue is
exacerbated by KVM zapping entire top-level entries without dropping
RCU protection, which can lead to RCU stalls even when zapping roots
backing relatively "small" amounts of guest memory, e.g. 2tb. Removing
the IPI bottleneck largely mitigates the RCU issues, though it's likely
still a problem for 5-level paging. A future patch will further address
the problem by zapping roots in multiple passes to avoid holding RCU for
an extended duration.
Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++++
arch/x86/kvm/mmu/tdp_iter.h | 7 +++----
arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++++------------
3 files changed, 26 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f40773dc4c92..ff5a7f763b1e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6268,6 +6268,12 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
rcu_idx = srcu_read_lock(&kvm->srcu);
write_lock(&kvm->mmu_lock);
+ /*
+ * Zapping TDP MMU shadow pages, including the remote TLB flush, must
+ * be done under RCU protection, the pages are freed via RCU callback.
+ */
+ rcu_read_lock();
+
ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
for ( ; to_zap; --to_zap) {
@@ -6292,12 +6298,18 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+ rcu_read_unlock();
+
cond_resched_rwlock_write(&kvm->mmu_lock);
flush = false;
+
+ rcu_read_lock();
}
}
kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+ rcu_read_unlock();
+
write_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, rcu_idx);
}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 1de6c1c9ff7b..95bd54027e08 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,10 +9,9 @@
/*
* TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
- * to be zapped while holding mmu_lock for read. Holding RCU isn't required for
- * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
- * the lower* depths of the TDP MMU just to make lockdep happy is a nightmare,
- * so all* accesses to SPTEs are must be done under RCU protection.
+ * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
+ * batched without having to collect the list of zapped SPs. Flows that can
+ * remove SPs must service pending TLB flushes prior to dropping RCU protection.
*/
static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e7086eb35599..72bcec2cd23c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -424,9 +424,6 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
shared);
}
- kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
- KVM_PAGES_PER_HPAGE(level + 1));
-
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
@@ -822,21 +819,14 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- u64 old_spte;
+ u64 old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
- rcu_read_lock();
-
- old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
- if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
- rcu_read_unlock();
+ if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;
- }
__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
sp->gfn, sp->role.level + 1, true, true);
- rcu_read_unlock();
-
return true;
}
@@ -878,6 +868,11 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
}
rcu_read_unlock();
+
+ /*
+ * Because this flows zaps _only_ leaf SPTEs, the caller doesn't need
+ * to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
+ */
return flush;
}
@@ -1007,6 +1002,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
ret = RET_PF_SPURIOUS;
else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
return RET_PF_RETRY;
+ else if (is_shadow_present_pte(iter->old_spte) &&
+ !is_last_spte(iter->old_spte, iter->level))
+ kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+ KVM_PAGES_PER_HPAGE(iter->level + 1));
/*
* If the page fault was caused by a write but the page is write
--
2.34.1.448.ga2b2bfdf31-goog
Allow yielding when zapping SPTEs after the last reference to a valid
root is put. Because KVM must drop all SPTEs in response to relevant
mmu_notifier events, mark defunct roots invalid and reset their refcount
prior to zapping the root. Keeping the refcount elevated while the zap
is in-progress ensures the root is reachable via mmu_notifier until the
zap completes and the last reference to the invalid, defunct root is put.
Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
root in being put has a massive paging structure, e.g. zapping a root
that is backed entirely by 4kb pages for a guest with 32tb of memory can
take hundreds of seconds to complete.
watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
__handle_changed_spte+0x1b2/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
tdp_mmu_zap_root+0x307/0x4d0 [kvm]
kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
kvm_mmu_free_roots+0x22d/0x350 [kvm]
kvm_mmu_reset_context+0x20/0x60 [kvm]
kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
__se_sys_ioctl+0x77/0xc0
__x64_sys_ioctl+0x1d/0x20
do_syscall_64+0x44/0xa0
entry_SYSCALL_64_after_hwframe+0x44/0xae
KVM currently doesn't put a root from a non-preemptible context, so other
than the mmu_notifier wrinkle, yielding when putting a root is safe.
Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
take a reference to each root (it requires mmu_lock be held for the
entire duration of the walk).
tdp_mmu_next_root() is used only by the yield-friendly iterator.
kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.
kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
but is still yield-friendly in all call sites, as all callers can be
traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
kvm_create_vm().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 7 +-
arch/x86/kvm/mmu/tdp_mmu.c | 145 +++++++++++++++++++++++---------
2 files changed, 109 insertions(+), 43 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index be063b6c91b7..8ce3d58fdf7f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -65,7 +65,12 @@ struct kvm_mmu_page {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
};
- DECLARE_BITMAP(unsync_child_bitmap, 512);
+ union {
+ DECLARE_BITMAP(unsync_child_bitmap, 512);
+ struct {
+ bool tdp_mmu_defunct_root;
+ };
+ };
struct list_head lpage_disallowed_link;
#ifdef CONFIG_X86_32
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 72bcec2cd23c..aec97e037a8d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -91,21 +91,67 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
WARN_ON(!root->tdp_mmu_page);
- spin_lock(&kvm->arch.tdp_mmu_pages_lock);
- list_del_rcu(&root->link);
- spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ /*
+ * Ensure root->role.invalid is read after the refcount reaches zero to
+ * avoid zapping the root multiple times, e.g. if a different task
+ * acquires a reference (after the root was marked invalid+defunct) and
+ * puts the last reference, all while holding mmu_lock for read. Pairs
+ * with the smp_mb__before_atomic() below.
+ */
+ smp_mb__after_atomic();
+
+ /*
+ * Free the root if it's already invalid. Invalid roots must be zapped
+ * before their last reference is put, i.e. there's no work to be done,
+ * and all roots must be invalidated (see below) before they're freed.
+ * Re-zapping defunct roots, which are always invalid, would put KVM
+ * into an infinite loop (again, see below).
+ */
+ if (root->role.invalid) {
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ list_del_rcu(&root->link);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+
+ call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
+ return;
+ }
+
+ /*
+ * Invalidate the root to prevent it from being reused by a vCPU, and
+ * mark it defunct so that kvm_tdp_mmu_zap_invalidated_roots() doesn't
+ * try to put a reference it didn't acquire.
+ */
+ root->role.invalid = true;
+ root->tdp_mmu_defunct_root = true;
+
+ /*
+ * Ensure tdp_mmu_defunct_root is visible if a concurrent reader acquires
+ * a reference after the root's refcount is reset. Pairs with the
+ * smp_mb__after_atomic() above and in kvm_tdp_mmu_zap_invalidated_roots().
+ */
+ smp_mb__before_atomic();
+
+ /*
+ * Note, if mmu_lock is held for read this can race with other readers,
+ * e.g. they may acquire a reference without seeing the root as invalid,
+ * and the refcount may be reset after the root is skipped. Both races
+ * are benign, as flows that must visit all roots, e.g. need to zap
+ * SPTEs for correctness, must take mmu_lock for write to block page
+ * faults, and the only flow that must not consume an invalid root is
+ * allocating a new root for a vCPU, which also takes mmu_lock for write.
+ */
+ refcount_set(&root->tdp_mmu_root_count, 1);
/*
- * A TLB flush is not necessary as KVM performs a local TLB flush when
- * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
- * to a different pCPU. Note, the local TLB flush on reuse also
- * invalidates any paging-structure-cache entries, i.e. TLB entries for
- * intermediate paging structures, that may be zapped, as such entries
- * are associated with the ASID on both VMX and SVM.
+ * Zap the root, then put the refcount "acquired" above. Recursively
+ * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
+ * infinite loop by freeing invalid roots. By design, the root is
+ * reachable while it's being zapped, thus a different task can put its
+ * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
+ * defunct root is unavoidable.
*/
tdp_mmu_zap_root(kvm, root, shared);
-
- call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
+ kvm_tdp_mmu_put_root(kvm, root, shared);
}
enum tdp_mmu_roots_iter_type {
@@ -758,12 +804,23 @@ static inline gfn_t tdp_mmu_max_gfn_host(void)
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
- bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count);
struct tdp_iter iter;
gfn_t end = tdp_mmu_max_gfn_host();
gfn_t start = 0;
+ /*
+ * The root must have an elevated refcount so that it's reachable via
+ * mmu_notifier callbacks, which allows this path to yield and drop
+ * mmu_lock. When handling an unmap/release mmu_notifier command, KVM
+ * must drop all references to relevant pages prior to completing the
+ * callback. Dropping mmu_lock with an unreachable root would result
+ * in zapping SPTEs after a relevant mmu_notifier callback completes
+ * and lead to use-after-free as zapping a SPTE triggers "writeback" of
+ * dirty accessed bits to the SPTE's associated struct page.
+ */
+ WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count));
+
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
rcu_read_lock();
@@ -775,19 +832,7 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
root->role.level, start, end) {
retry:
- /*
- * Yielding isn't allowed when zapping an unreachable root as
- * the root won't be processed by mmu_notifier callbacks. When
- * handling an unmap/release mmu_notifier command, KVM must
- * drop all references to relevant pages prior to completing
- * the callback. Dropping mmu_lock can result in zapping SPTEs
- * for an unreachable root after a relevant callback completes,
- * which leads to use-after-free as zapping a SPTE triggers
- * "writeback" of dirty/accessed bits to the SPTE's associated
- * struct page.
- */
- if (!root_is_unreachable &&
- tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
if (!is_shadow_present_pte(iter.old_spte))
@@ -796,22 +841,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
if (!shared) {
tdp_mmu_set_spte(kvm, &iter, 0);
} else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
- /*
- * cmpxchg() shouldn't fail if the root is unreachable.
- * to be unreachable. Re-read the SPTE and retry so as
- * not to leak the page and its children.
- */
- WARN_ONCE(root_is_unreachable,
- "Contended TDP MMU SPTE in unreachable root.");
iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
goto retry;
}
- /*
- * WARN if the root is invalid and is unreachable, all SPTEs
- * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
- * and inserting new SPTEs under an invalid root is a KVM bug.
- */
- WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
}
rcu_read_unlock();
@@ -899,6 +931,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
int i;
/*
+ * Zap all roots, including defunct roots, as all SPTEs must be dropped
+ * before returning to the caller.
+ *
* A TLB flush is unnecessary, KVM zaps everything if and only the VM
* is being destroyed or the userspace VMM has exited. In both cases,
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
@@ -924,6 +959,12 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
/*
+ * Zap the root, even if it's defunct, as all SPTEs must be
+ * dropped before returning to the caller, e.g. if the root was
+ * invalidated by a memslot update, then SPTEs associated with
+ * a deleted/moved memslot are unreachable via the mmu_notifier
+ * once the memslot update completes.
+ *
* A TLB flush is unnecessary, invalidated roots are guaranteed
* to be unreachable by the guest (see kvm_tdp_mmu_put_root()
* for more details), and unlike the legacy MMU, no vCPU kick
@@ -935,10 +976,24 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
tdp_mmu_zap_root(kvm, root, true);
/*
- * Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
+ * Leverages kvm_tdp_mmu_get_root() in the iterator, pairs with
+ * the smp_mb__before_atomic() in kvm_tdp_mmu_put_root() to
+ * ensure a defunct root is seen as such.
+ */
+ smp_mb__after_atomic();
+
+ /*
+ * Put the reference acquired in kvm_tdp_mmu_invalidate_roots()
+ * unless the root is defunct worker data, in which case no
+ * reference was taken. Roots become defunct only when a valid
+ * root has its last reference put, thus holding a reference
+ * means the root can't become defunct between invalidating the
+ * root and re-checking the data here.
+ *
* Note, the iterator holds its own reference.
*/
- kvm_tdp_mmu_put_root(kvm, root, true);
+ if (!root->tdp_mmu_defunct_root)
+ kvm_tdp_mmu_put_root(kvm, root, true);
}
}
@@ -953,13 +1008,17 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* a vCPU drops the last reference to a root prior to the root being zapped, it
* will get stuck with tearing down the entire paging structure.
*
- * Get a reference even if the root is already invalid,
+ * Get a reference even if the root is already invalid, unless it's defunct, as
* kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
* invalid roots, e.g. there's no epoch to identify roots that were invalidated
* by a previous call. Roots stay on the list until the last reference is
* dropped, so even though all invalid roots are zapped, a root may not go away
* for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
*
+ * Don't take a reference if the root is defunct, vCPUs cannot hold references
+ * to defunct roots and so will never get stuck with zapping the root. Note,
+ * defunct roots still need to be zapped by kvm_tdp_mmu_zap_invalidated_roots().
+ *
* Because mmu_lock is held for write, it should be impossible to observe a
* root with zero refcount, i.e. the list of roots cannot be stale.
*
@@ -971,8 +1030,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
struct kvm_mmu_page *root;
lockdep_assert_held_write(&kvm->mmu_lock);
+
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
- if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(kvm, root)))
+ if (!root->tdp_mmu_defunct_root &&
+ !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
root->role.invalid = true;
}
}
--
2.34.1.448.ga2b2bfdf31-goog
Zap defunct roots, a.k.a. roots that have been invalidated after their
last reference was initially dropped, asynchronously via the system work
queue instead of forcing the work upon the unfortunate task that happened
to drop the last reference.
If a vCPU task drops the last reference, the vCPU is effectively blocked
by the host for the entire duration of the zap. If the root being zapped
happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
being active, the zap can take several hundred seconds. Unsurprisingly,
most guests are unhappy if a vCPU disappears for hundreds of seconds.
E.g. running a synthetic selftest that triggers a vCPU root zap with
~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
Offloading the zap to a worker drops the block time to <100ms.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 2 +
arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++-----
2 files changed, 58 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8ce3d58fdf7f..ac365631e4fe 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -69,6 +69,8 @@ struct kvm_mmu_page {
DECLARE_BITMAP(unsync_child_bitmap, 512);
struct {
bool tdp_mmu_defunct_root;
+ struct work_struct tdp_mmu_async_work;
+ void *tdp_mmu_async_data;
};
};
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2e28f5e4b761..a706328a5658 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared);
+static void tdp_mmu_zap_root_async(struct work_struct *work)
+{
+ struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
+ tdp_mmu_async_work);
+ struct kvm *kvm = root->tdp_mmu_async_data;
+
+ read_lock(&kvm->mmu_lock);
+
+ /*
+ * A TLB flush is not necessary as KVM performs a local TLB flush when
+ * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+ * to a different pCPU. Note, the local TLB flush on reuse also
+ * invalidates any paging-structure-cache entries, i.e. TLB entries for
+ * intermediate paging structures, that may be zapped, as such entries
+ * are associated with the ASID on both VMX and SVM.
+ */
+ tdp_mmu_zap_root(kvm, root, true);
+
+ /*
+ * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
+ * avoiding an infinite loop. By design, the root is reachable while
+ * it's being asynchronously zapped, thus a different task can put its
+ * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
+ * asynchronously zapped root is unavoidable.
+ */
+ kvm_tdp_mmu_put_root(kvm, root, true);
+
+ read_unlock(&kvm->mmu_lock);
+
+ kvm_put_kvm(kvm);
+}
+
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -143,15 +175,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
refcount_set(&root->tdp_mmu_root_count, 1);
/*
- * Zap the root, then put the refcount "acquired" above. Recursively
- * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
- * infinite loop by freeing invalid roots. By design, the root is
- * reachable while it's being zapped, thus a different task can put its
- * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
- * defunct root is unavoidable.
+ * Attempt to acquire a reference to KVM itself. If KVM is alive, then
+ * zap the root asynchronously in a worker, otherwise it must be zapped
+ * directly here. Wait to do this check until after the refcount is
+ * reset so that tdp_mmu_zap_root() can safely yield.
+ *
+ * In both flows, zap the root, then put the refcount "acquired" above.
+ * When putting the reference, use kvm_tdp_mmu_put_root() to test the
+ * above logic for avoiding an infinite loop by freeing invalid roots.
+ * By design, the root is reachable while it's being zapped, thus a
+ * different task can put its last reference, i.e. flowing through
+ * kvm_tdp_mmu_put_root() for a defunct root is unavoidable.
*/
- tdp_mmu_zap_root(kvm, root, shared);
- kvm_tdp_mmu_put_root(kvm, root, shared);
+ if (kvm_get_kvm_safe(kvm)) {
+ root->tdp_mmu_async_data = kvm;
+ INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async);
+ schedule_work(&root->tdp_mmu_async_work);
+ } else {
+ tdp_mmu_zap_root(kvm, root, shared);
+ kvm_tdp_mmu_put_root(kvm, root, shared);
+ }
}
enum tdp_mmu_roots_iter_type {
@@ -949,7 +992,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
/*
* Zap all roots, including defunct roots, as all SPTEs must be dropped
- * before returning to the caller.
+ * before returning to the caller. Zap directly even if the root is
+ * also being zapped by a worker. Walking zapped top-level SPTEs isn't
+ * all that expensive and mmu_lock is already held, which means the
+ * worker has yielded, i.e. flushing the work instead of zapping here
+ * isn't guaranteed to be any faster.
*
* A TLB flush is unnecessary, KVM zaps everything if and only the VM
* is being destroyed or the userspace VMM has exited. In both cases,
--
2.34.1.448.ga2b2bfdf31-goog
When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.
With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based number of kernel
config, CPUs, vCPUs, etc...). With 5-level paging, that time can balloon
well into hundreds of seconds.
Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.
By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.
Zapping at 1gb in the first pass is not arbitrary. First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage. Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
softirq=15759/15759 fqs=5058
(t=21016 jiffies g=66453 q=238577)
NMI backtrace for cpu 52
Call Trace:
...
mark_page_accessed+0x266/0x2f0
kvm_set_pfn_accessed+0x31/0x40
handle_removed_tdp_mmu_page+0x259/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
zap_gfn_range+0x141/0x3b0
kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
kvm_mmu_zap_all_fast+0x121/0x190
kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
kvm_page_track_flush_slot+0x5c/0x80
kvm_arch_flush_shadow_memslot+0xe/0x10
kvm_set_memslot+0x172/0x4e0
__kvm_set_memory_region+0x337/0x590
kvm_vm_ioctl+0x49c/0xf80
Reported-by: David Matlack <[email protected]>
Cc: Ben Gardon <[email protected]>
Cc: Mingwei Zhang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 27 ++++++++++++++++++++++-----
1 file changed, 22 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index aec97e037a8d..2e28f5e4b761 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -809,6 +809,18 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t end = tdp_mmu_max_gfn_host();
gfn_t start = 0;
+ /*
+ * To avoid RCU stalls due to recursively removing huge swaths of SPs,
+ * split the zap into two passes. On the first pass, zap at the 1gb
+ * level, and then zap top-level SPs on the second pass. "1gb" is not
+ * arbitrary, as KVM must be able to zap a 1gb shadow page without
+ * inducing a stall to allow in-place replacement with a 1gb hugepage.
+ *
+ * Because zapping a SP recurses on its children, stepping down to
+ * PG_LEVEL_4K in the iterator itself is unnecessary.
+ */
+ int zap_level = PG_LEVEL_1G;
+
/*
* The root must have an elevated refcount so that it's reachable via
* mmu_notifier callbacks, which allows this path to yield and drop
@@ -825,12 +837,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();
- /*
- * No need to try to step down in the iterator when zapping an entire
- * root, zapping an upper-level SPTE will recurse on its children.
- */
+start:
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
- root->role.level, start, end) {
+ zap_level, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
@@ -838,6 +847,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
if (!is_shadow_present_pte(iter.old_spte))
continue;
+ if (iter.level > zap_level)
+ continue;
+
if (!shared) {
tdp_mmu_set_spte(kvm, &iter, 0);
} else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
@@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
}
}
+ if (zap_level < root->role.level) {
+ zap_level = root->role.level;
+ goto start;
+ }
+
rcu_read_unlock();
}
--
2.34.1.448.ga2b2bfdf31-goog
Extract the code for allocating guest memory via memfd out of
vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
A future selftest to populate a guest with the maximum amount of guest
memory will abuse KVM's memslots to alias guest memory regions to a
single memfd-backed host region, i.e. needs to back a guest with memfd
memory without a 1:1 association between a memslot and a memfd instance.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
.../testing/selftests/kvm/include/kvm_util.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 42 +++++++++++--------
2 files changed, 25 insertions(+), 18 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index fba971189390..ee4a7fafb442 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -104,6 +104,7 @@ int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
size_t len);
void kvm_vm_elf_load(struct kvm_vm *vm, const char *filename);
+int kvm_memfd_alloc(size_t size, bool hugepages);
void vm_dump(FILE *stream, struct kvm_vm *vm, uint8_t indent);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 45f39dd9b4da..97514ece798f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -658,6 +658,27 @@ void kvm_vm_free(struct kvm_vm *vmp)
free(vmp);
}
+int kvm_memfd_alloc(size_t size, bool hugepages)
+{
+ int memfd_flags = MFD_CLOEXEC;
+ int fd, r;
+
+ if (hugepages)
+ memfd_flags |= MFD_HUGETLB;
+
+ fd = memfd_create("kvm_selftest", memfd_flags);
+ TEST_ASSERT(fd != -1, "memfd_create() failed, errno: %i (%s)",
+ errno, strerror(errno));
+
+ r = ftruncate(fd, size);
+ TEST_ASSERT(!r, "ftruncate() failed, errno: %i (%s)", errno, strerror(errno));
+
+ r = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, size);
+ TEST_ASSERT(!r, "fallocate() failed, errno: %i (%s)", errno, strerror(errno));
+
+ return fd;
+}
+
/*
* Memory Compare, host virtual to guest virtual
*
@@ -910,24 +931,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->mmap_size += alignment;
region->fd = -1;
- if (backing_src_is_shared(src_type)) {
- int memfd_flags = MFD_CLOEXEC;
-
- if (src_type == VM_MEM_SRC_SHARED_HUGETLB)
- memfd_flags |= MFD_HUGETLB;
-
- region->fd = memfd_create("kvm_selftest", memfd_flags);
- TEST_ASSERT(region->fd != -1,
- "memfd_create failed, errno: %i", errno);
-
- ret = ftruncate(region->fd, region->mmap_size);
- TEST_ASSERT(ret == 0, "ftruncate failed, errno: %i", errno);
-
- ret = fallocate(region->fd,
- FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
- region->mmap_size);
- TEST_ASSERT(ret == 0, "fallocate failed, errno: %i", errno);
- }
+ if (backing_src_is_shared(src_type))
+ region->fd = kvm_memfd_alloc(region->mmap_size,
+ src_type == VM_MEM_SRC_SHARED_HUGETLB);
region->mmap_start = mmap(NULL, region->mmap_size,
PROT_READ | PROT_WRITE,
--
2.34.1.448.ga2b2bfdf31-goog
Add a selftest that enables populating a VM with the maximum amount of
guest memory allowed by the underlying architecture. Abuse KVM's
memslots by mapping a single host memory region into multiple memslots so
that the selftest doesn't require a system with terabytes of RAM.
Default to 512gb of guest memory, which isn't all that interesting, but
should work on all MMUs and doesn't take an exorbitant amount of memory
or time. E.g. testing with ~64tb of guest memory takes the better part
of an hour, and requires 200gb of memory for KVM's page tables when using
4kb pages.
To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
the not-hugepage size is) in the backing store (memfd). Use memfd for
the host backing store to ensure that hugepages are guaranteed when
requested, and to give the user explicit control of the size of hugepage
being tested.
By default, spin up as many vCPUs as there are available to the selftest,
and distribute the work of dirtying each 4kb chunk of memory across all
vCPUs. Dirtying guest memory forces KVM to populate its page tables, and
also forces KVM to write back accessed/dirty information to struct page
when the guest memory is freed.
On x86, perform two passes with a MMU context reset between each pass to
coerce KVM into dropping all references to the MMU root, e.g. to emulate
a vCPU dropping the last reference. Perform both passes and all
rendezvous on all architectures in the hope that arm64 and s390x can gain
similar shenanigans in the future.
Measure and report the duration of each operation, which is helpful not
only to verify the test is working as intended, but also to easily
evaluate the performance differences different page sizes.
Provide command line options to limit the amount of guest memory, set the
size of each slot (i.e. of the host memory region), set the number of
vCPUs, and to enable usage of hugepages.
Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
.../selftests/kvm/max_guest_memory_test.c | 292 ++++++++++++++++++
3 files changed, 296 insertions(+)
create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 3cb5ac5da087..ffb4da5b9d03 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -52,6 +52,7 @@
/hardware_disable_test
/kvm_create_max_vcpus
/kvm_page_table_test
+/max_guest_memory_test
/memslot_modification_stress_test
/memslot_perf_test
/rseq_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 17342b575e85..63640f59e96b 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -83,6 +83,7 @@ TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
TEST_GEN_PROGS_x86_64 += hardware_disable_test
TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
TEST_GEN_PROGS_x86_64 += kvm_page_table_test
+TEST_GEN_PROGS_x86_64 += max_guest_memory_test
TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
TEST_GEN_PROGS_x86_64 += memslot_perf_test
TEST_GEN_PROGS_x86_64 += rseq_test
@@ -101,6 +102,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test
TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
TEST_GEN_PROGS_aarch64 += kvm_page_table_test
+TEST_GEN_PROGS_aarch64 += max_guest_memory_test
TEST_GEN_PROGS_aarch64 += memslot_modification_stress_test
TEST_GEN_PROGS_aarch64 += memslot_perf_test
TEST_GEN_PROGS_aarch64 += rseq_test
@@ -115,6 +117,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
TEST_GEN_PROGS_s390x += dirty_log_test
TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
TEST_GEN_PROGS_s390x += kvm_page_table_test
+TEST_GEN_PROGS_s390x += max_guest_memory_test
TEST_GEN_PROGS_s390x += rseq_test
TEST_GEN_PROGS_s390x += set_memory_region_test
TEST_GEN_PROGS_s390x += kvm_binary_stats_test
diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/testing/selftests/kvm/max_guest_memory_test.c
new file mode 100644
index 000000000000..360c88288295
--- /dev/null
+++ b/tools/testing/selftests/kvm/max_guest_memory_test.c
@@ -0,0 +1,292 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <semaphore.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/atomic.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "guest_modes.h"
+#include "processor.h"
+
+static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
+{
+ uint64_t gpa;
+
+ for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+ *((volatile uint64_t *)gpa) = gpa;
+
+ GUEST_DONE();
+}
+
+struct vcpu_info {
+ struct kvm_vm *vm;
+ uint32_t id;
+ uint64_t start_gpa;
+ uint64_t end_gpa;
+};
+
+static int nr_vcpus;
+static atomic_t rendezvous;
+
+static void rendezvous_with_boss(void)
+{
+ int orig = atomic_read(&rendezvous);
+
+ if (orig > 0) {
+ atomic_dec_and_test(&rendezvous);
+ while (atomic_read(&rendezvous) > 0)
+ cpu_relax();
+ } else {
+ atomic_inc(&rendezvous);
+ while (atomic_read(&rendezvous) < 0)
+ cpu_relax();
+ }
+}
+
+static void run_vcpu(struct kvm_vm *vm, uint32_t vcpu_id)
+{
+ vcpu_run(vm, vcpu_id);
+ ASSERT_EQ(get_ucall(vm, vcpu_id, NULL), UCALL_DONE);
+}
+
+static void *vcpu_worker(void *data)
+{
+ struct vcpu_info *vcpu = data;
+ struct kvm_vm *vm = vcpu->vm;
+ struct kvm_sregs sregs;
+ struct kvm_regs regs;
+
+ vcpu_args_set(vm, vcpu->id, 3, vcpu->start_gpa, vcpu->end_gpa,
+ vm_get_page_size(vm));
+
+ /* Snapshot regs before the first run. */
+ vcpu_regs_get(vm, vcpu->id, ®s);
+ rendezvous_with_boss();
+
+ run_vcpu(vm, vcpu->id);
+ rendezvous_with_boss();
+ vcpu_regs_set(vm, vcpu->id, ®s);
+ vcpu_sregs_get(vm, vcpu->id, &sregs);
+#ifdef __x86_64__
+ /* Toggle CR0.WP to trigger a MMU context reset. */
+ sregs.cr0 ^= X86_CR0_WP;
+#endif
+ vcpu_sregs_set(vm, vcpu->id, &sregs);
+ rendezvous_with_boss();
+
+ run_vcpu(vm, vcpu->id);
+ rendezvous_with_boss();
+
+ return NULL;
+}
+
+static pthread_t *spawn_workers(struct kvm_vm *vm, uint64_t start_gpa,
+ uint64_t end_gpa)
+{
+ struct vcpu_info *info;
+ uint64_t gpa, nr_bytes;
+ pthread_t *threads;
+ int i;
+
+ threads = malloc(nr_vcpus * sizeof(*threads));
+ TEST_ASSERT(threads, "Failed to allocate vCPU threads");
+
+ info = malloc(nr_vcpus * sizeof(*info));
+ TEST_ASSERT(info, "Failed to allocate vCPU gpa ranges");
+
+ nr_bytes = ((end_gpa - start_gpa) / nr_vcpus) &
+ ~((uint64_t)vm_get_page_size(vm) - 1);
+ TEST_ASSERT(nr_bytes, "C'mon, no way you have %d CPUs", nr_vcpus);
+
+ for (i = 0, gpa = start_gpa; i < nr_vcpus; i++, gpa += nr_bytes) {
+ info[i].vm = vm;
+ info[i].id = i;
+ info[i].start_gpa = gpa;
+ info[i].end_gpa = gpa + nr_bytes;
+ pthread_create(&threads[i], NULL, vcpu_worker, &info[i]);
+ }
+ return threads;
+}
+
+static void rendezvous_with_vcpus(struct timespec *time, const char *name)
+{
+ int i, rendezvoused;
+
+ pr_info("Waiting for vCPUs to finish %s...\n", name);
+
+ rendezvoused = atomic_read(&rendezvous);
+ for (i = 0; abs(rendezvoused) != 1; i++) {
+ usleep(100);
+ if (!(i & 0x3f))
+ pr_info("\r%d vCPUs haven't rendezvoused...",
+ abs(rendezvoused) - 1);
+ rendezvoused = atomic_read(&rendezvous);
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, time);
+
+ /* Release the vCPUs after getting the time of the previous action. */
+ pr_info("\rAll vCPUs finished %s, releasing...\n", name);
+ if (rendezvoused > 0)
+ atomic_set(&rendezvous, -nr_vcpus - 1);
+ else
+ atomic_set(&rendezvous, nr_vcpus + 1);
+}
+
+static void calc_default_nr_vcpus(void)
+{
+ cpu_set_t possible_mask;
+ int r;
+
+ r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
+ TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)",
+ errno, strerror(errno));
+
+ nr_vcpus = CPU_COUNT(&possible_mask);
+ TEST_ASSERT(nr_vcpus > 0, "Uh, no CPUs?");
+}
+
+int main(int argc, char *argv[])
+{
+ /*
+ * Skip the first 4gb and slot0. slot0 maps <1gb and is used to back
+ * the guest's code, stack, and page tables. Because selftests creates
+ * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
+ * just below the 4gb boundary. This test could create memory at
+ * 1gb-3gb,but it's simpler to skip straight to 4gb.
+ */
+ const uint64_t size_1gb = (1 << 30);
+ const uint64_t start_gpa = (4ull * size_1gb);
+ const int first_slot = 1;
+
+ struct timespec time_start, time_run1, time_reset, time_run2;
+ uint64_t max_gpa, gpa, slot_size, max_mem, i;
+ int max_slots, slot, opt, fd;
+ bool hugepages = false;
+ pthread_t *threads;
+ struct kvm_vm *vm;
+ void *mem;
+
+ /*
+ * Default to 2gb so that maxing out systems with MAXPHADDR=46, which
+ * are quite common for x86, requires changing only max_mem (KVM allows
+ * 32k memslots, 32k * 2gb == ~64tb of guest memory).
+ */
+ slot_size = 2 * size_1gb;
+
+ max_slots = kvm_check_cap(KVM_CAP_NR_MEMSLOTS);
+ TEST_ASSERT(max_slots > first_slot, "KVM is broken");
+
+ /* All KVM MMUs should be able to survive a 512gb guest. */
+ max_mem = 512 * size_1gb;
+
+ calc_default_nr_vcpus();
+
+ while ((opt = getopt(argc, argv, "c:h:m:s:u")) != -1) {
+ switch (opt) {
+ case 'c':
+ nr_vcpus = atoi(optarg);
+ TEST_ASSERT(nr_vcpus, "#DE");
+ break;
+ case 'm':
+ max_mem = atoi(optarg) * size_1gb;
+ TEST_ASSERT(max_mem, "#DE");
+ break;
+ case 's':
+ slot_size = atoi(optarg) * size_1gb;
+ TEST_ASSERT(slot_size, "#DE");
+ break;
+ case 'u':
+ hugepages = true;
+ break;
+ case 'h':
+ default:
+ printf("usage: %s [-c nr_vcpus] [-m max_mem_in_gb] [-s slot_size_in_gb] [-u [huge_page_size]]\n", argv[0]);
+ exit(1);
+ }
+ }
+
+ vm = vm_create_default_with_vcpus(nr_vcpus, 0, 0, guest_code, NULL);
+
+ max_gpa = vm_get_max_gfn(vm) << vm_get_page_shift(vm);
+ TEST_ASSERT(max_gpa > (4 * slot_size), "MAXPHYADDR <4gb ");
+
+ fd = kvm_memfd_alloc(slot_size, hugepages);
+ mem = mmap(NULL, slot_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmap() failed");
+
+ TEST_ASSERT(!madvise(mem, slot_size, MADV_NOHUGEPAGE), "madvise() failed");
+
+ /* Pre-fault the memory to avoid taking mmap_sem on guest page faults. */
+ for (i = 0; i < slot_size; i += vm_get_page_size(vm))
+ ((uint8_t *)mem)[i] = 0xaa;
+
+ gpa = 0;
+ for (slot = first_slot; slot < max_slots; slot++) {
+ gpa = start_gpa + ((slot - first_slot) * slot_size);
+ if (gpa + slot_size > max_gpa)
+ break;
+
+ if ((gpa - start_gpa) >= max_mem)
+ break;
+
+ vm_set_user_memory_region(vm, slot, 0, gpa, slot_size, mem);
+
+#ifdef __x86_64__
+ /* Identity map memory in the guest using 1gb pages. */
+ for (i = 0; i < slot_size; i += size_1gb)
+ __virt_pg_map(vm, gpa + i, gpa + i, X86_PAGE_SIZE_1G);
+#else
+ for (i = 0; i < slot_size; i += vm_get_page_size(vm))
+ virt_pg_map(vm, gpa + i, gpa + i);
+#endif
+ }
+
+ atomic_set(&rendezvous, nr_vcpus + 1);
+ threads = spawn_workers(vm, start_gpa, gpa);
+
+ pr_info("Running with %lugb of guest memory and %u vCPUs\n",
+ (gpa - start_gpa) / size_1gb, nr_vcpus);
+
+ rendezvous_with_vcpus(&time_start, "spawning");
+ rendezvous_with_vcpus(&time_run1, "run 1");
+ rendezvous_with_vcpus(&time_reset, "reset");
+ rendezvous_with_vcpus(&time_run2, "run 2");
+
+ time_run2 = timespec_sub(time_run2, time_reset);
+ time_reset = timespec_sub(time_reset, time_run1);
+ time_run1 = timespec_sub(time_run1, time_start);
+
+ pr_info("run1 = %ld.%.9lds, reset = %ld.%.9lds, run2 = %ld.%.9lds\n",
+ time_run1.tv_sec, time_run1.tv_nsec,
+ time_reset.tv_sec, time_reset.tv_nsec,
+ time_run2.tv_sec, time_run2.tv_nsec);
+
+ /*
+ * Delete even numbered slots (arbitrary) and unmap the first half of
+ * the backing (also arbitrary) to verify KVM correctly drops all
+ * references to the removed regions.
+ */
+ for (slot = (slot - 1) & ~1ull; slot >= first_slot; slot -= 2)
+ vm_set_user_memory_region(vm, slot, 0, 0, 0, NULL);
+
+ munmap(mem, slot_size / 2);
+
+ /* Sanity check that the vCPUs actually ran. */
+ for (i = 0; i < nr_vcpus; i++)
+ pthread_join(threads[i], NULL);
+
+ /*
+ * Deliberately exit without deleting the remaining memslots or closing
+ * kvm_fd to test cleanup via mmu_notifier.release.
+ */
+}
--
2.34.1.448.ga2b2bfdf31-goog
Add cpu_relax() for s390 and x86 for use in arch-agnostic tests. arm64
already defines its own version.
Signed-off-by: Sean Christopherson <[email protected]>
---
tools/testing/selftests/kvm/include/s390x/processor.h | 8 ++++++++
tools/testing/selftests/kvm/include/x86_64/processor.h | 5 +++++
2 files changed, 13 insertions(+)
diff --git a/tools/testing/selftests/kvm/include/s390x/processor.h b/tools/testing/selftests/kvm/include/s390x/processor.h
index e0e96a5f608c..255c9b990f4c 100644
--- a/tools/testing/selftests/kvm/include/s390x/processor.h
+++ b/tools/testing/selftests/kvm/include/s390x/processor.h
@@ -5,6 +5,8 @@
#ifndef SELFTEST_KVM_PROCESSOR_H
#define SELFTEST_KVM_PROCESSOR_H
+#include <linux/compiler.h>
+
/* Bits in the region/segment table entry */
#define REGION_ENTRY_ORIGIN ~0xfffUL /* region/segment table origin */
#define REGION_ENTRY_PROTECT 0x200 /* region protection bit */
@@ -19,4 +21,10 @@
#define PAGE_PROTECT 0x200 /* HW read-only bit */
#define PAGE_NOEXEC 0x100 /* HW no-execute bit */
+/* Is there a portable way to do this? */
+static inline void cpu_relax(void)
+{
+ barrier();
+}
+
#endif
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 05e65ca1c30c..224574ee9967 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -346,6 +346,11 @@ static inline unsigned long get_xmm(int n)
return 0;
}
+static inline void cpu_relax(void)
+{
+ asm volatile("rep; nop" ::: "memory");
+}
+
bool is_intel_cpu(void);
struct kvm_x86_state;
--
2.34.1.448.ga2b2bfdf31-goog
Move set_memory_region_test's KVM_SET_USER_MEMORY_REGION helper to KVM's
utils so that it can be used by other tests. Provide a raw version as
well as an assert-success version to reduce the amount of boilerplate
code need for basic usage.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
.../testing/selftests/kvm/include/kvm_util.h | 5 +++
tools/testing/selftests/kvm/lib/kvm_util.c | 24 +++++++++++++
.../selftests/kvm/set_memory_region_test.c | 35 +++++--------------
3 files changed, 37 insertions(+), 27 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 2d62edc49d67..fba971189390 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -128,6 +128,11 @@ void vcpu_dump(FILE *stream, struct kvm_vm *vm, uint32_t vcpuid,
void vm_create_irqchip(struct kvm_vm *vm);
+void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva);
+int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva);
+
void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 53d2b5d04b82..45f39dd9b4da 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -779,6 +779,30 @@ static void vm_userspace_mem_region_hva_insert(struct rb_root *hva_tree,
rb_insert_color(®ion->hva_node, hva_tree);
}
+
+int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva)
+{
+ struct kvm_userspace_memory_region region = {
+ .slot = slot,
+ .flags = flags,
+ .guest_phys_addr = gpa,
+ .memory_size = size,
+ .userspace_addr = (uintptr_t)hva,
+ };
+
+ return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION, ®ion);
+}
+
+void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+ uint64_t gpa, uint64_t size, void *hva)
+{
+ int ret = __vm_set_user_memory_region(vm, slot, flags, gpa, size, hva);
+
+ TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed, errno = %d (%s)",
+ errno, strerror(errno));
+}
+
/*
* VM Userspace Memory Region Add
*
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index 72a1c9b4882c..73bc297dabe6 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -329,22 +329,6 @@ static void test_zero_memory_regions(void)
}
#endif /* __x86_64__ */
-static int test_memory_region_add(struct kvm_vm *vm, void *mem, uint32_t slot,
- uint32_t size, uint64_t guest_addr)
-{
- struct kvm_userspace_memory_region region;
- int ret;
-
- region.slot = slot;
- region.flags = 0;
- region.guest_phys_addr = guest_addr;
- region.memory_size = size;
- region.userspace_addr = (uintptr_t) mem;
- ret = ioctl(vm_get_fd(vm), KVM_SET_USER_MEMORY_REGION, ®ion);
-
- return ret;
-}
-
/*
* Test it can be added memory slots up to KVM_CAP_NR_MEMSLOTS, then any
* tentative to add further slots should fail.
@@ -382,23 +366,20 @@ static void test_add_max_memory_regions(void)
TEST_ASSERT(mem != MAP_FAILED, "Failed to mmap() host");
mem_aligned = (void *)(((size_t) mem + alignment - 1) & ~(alignment - 1));
- for (slot = 0; slot < max_mem_slots; slot++) {
- ret = test_memory_region_add(vm, mem_aligned +
- ((uint64_t)slot * MEM_REGION_SIZE),
- slot, MEM_REGION_SIZE,
- (uint64_t)slot * MEM_REGION_SIZE);
- TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
- " rc: %i errno: %i slot: %i\n",
- ret, errno, slot);
- }
+ for (slot = 0; slot < max_mem_slots; slot++)
+ vm_set_user_memory_region(vm, slot, 0,
+ ((uint64_t)slot * MEM_REGION_SIZE),
+ MEM_REGION_SIZE,
+ mem_aligned + (uint64_t)slot * MEM_REGION_SIZE);
/* Check it cannot be added memory slots beyond the limit */
mem_extra = mmap(NULL, MEM_REGION_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
TEST_ASSERT(mem_extra != MAP_FAILED, "Failed to mmap() host");
- ret = test_memory_region_add(vm, mem_extra, max_mem_slots, MEM_REGION_SIZE,
- (uint64_t)max_mem_slots * MEM_REGION_SIZE);
+ ret = __vm_set_user_memory_region(vm, max_mem_slots, 0,
+ (uint64_t)max_mem_slots * MEM_REGION_SIZE,
+ MEM_REGION_SIZE, mem_extra);
TEST_ASSERT(ret == -1 && errno == EINVAL,
"Adding one more memory slot should fail with EINVAL");
--
2.34.1.448.ga2b2bfdf31-goog
On Thu, Dec 23, 2021 at 10:22:52PM +0000, Sean Christopherson wrote:
> Now that tdp_mmu_next_root() can process both valid and invalid roots,
> extend it to be able to process _only_ invalid roots, add yet another
> iterator macro for walking invalid roots, and use the new macro in
> kvm_tdp_mmu_zap_invalidated_roots().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Nice cleanup, it's great to see next_invalidated_root() go.
Reviewed-by: David Matlack <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 74 ++++++++++++++------------------------
> 1 file changed, 26 insertions(+), 48 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 577985fa001d..41e975841ea6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -98,6 +98,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
>
> +enum tdp_mmu_roots_iter_type {
> + ALL_ROOTS = -1,
> + VALID_ROOTS = 0,
> + INVALID_ROOTS = 1,
> +};
> +
> /*
> * Returns the next root after @prev_root (or the first root if @prev_root is
> * NULL). A reference to the returned root is acquired, and the reference to
> @@ -110,10 +116,16 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> */
> static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> struct kvm_mmu_page *prev_root,
> - bool shared, bool only_valid)
> + bool shared,
> + enum tdp_mmu_roots_iter_type type)
> {
> struct kvm_mmu_page *next_root;
>
> + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +
> + /* Ensure correctness for the below comparison against role.invalid. */
> + BUILD_BUG_ON(!!VALID_ROOTS || !INVALID_ROOTS);
> +
> rcu_read_lock();
>
> if (prev_root)
> @@ -125,7 +137,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> typeof(*next_root), link);
>
> while (next_root) {
> - if ((!only_valid || !next_root->role.invalid) &&
> + if ((type == ALL_ROOTS || (type == !!next_root->role.invalid)) &&
> kvm_tdp_mmu_get_root(kvm, next_root))
> break;
>
> @@ -151,18 +163,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> * mode. In the unlikely event that this thread must free a root, the lock
> * will be temporarily dropped and reacquired in write mode.
> */
> -#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _only_valid)\
> - for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \
> +#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _type) \
> + for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _type); \
> _root; \
> - _root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \
> - if (kvm_mmu_page_as_id(_root) != _as_id) { \
> + _root = tdp_mmu_next_root(_kvm, _root, _shared, _type)) \
> + if (_as_id > 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
> } else
>
> +#define for_each_invalid_tdp_mmu_root_yield_safe(_kvm, _root) \
> + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, -1, true, INVALID_ROOTS)
> +
> #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
> - __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
> + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, VALID_ROOTS)
>
> #define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
> - __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
> + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOTS)
>
> #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
> @@ -811,28 +826,6 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> kvm_flush_remote_tlbs(kvm);
> }
>
> -static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
> - struct kvm_mmu_page *prev_root)
> -{
> - struct kvm_mmu_page *next_root;
> -
> - if (prev_root)
> - next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - &prev_root->link,
> - typeof(*prev_root), link);
> - else
> - next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - typeof(*next_root), link);
> -
> - while (next_root && !(next_root->role.invalid &&
> - refcount_read(&next_root->tdp_mmu_root_count)))
> - next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
> - &next_root->link,
> - typeof(*next_root), link);
> -
> - return next_root;
> -}
> -
> /*
> * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
> * invalidated root, they will not be freed until this function drops the
> @@ -843,36 +836,21 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
> */
> void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> {
> - struct kvm_mmu_page *next_root;
> struct kvm_mmu_page *root;
> bool flush = false;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - rcu_read_lock();
> -
> - root = next_invalidated_root(kvm, NULL);
> -
> - while (root) {
> - next_root = next_invalidated_root(kvm, root);
> -
> - rcu_read_unlock();
> -
> + for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
> flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
>
> /*
> - * Put the reference acquired in
> - * kvm_tdp_mmu_invalidate_roots
> + * Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
> + * Note, the iterator holds its own reference.
> */
> kvm_tdp_mmu_put_root(kvm, root, true);
> -
> - root = next_root;
> -
> - rcu_read_lock();
> }
>
> - rcu_read_unlock();
> -
> if (flush)
> kvm_flush_remote_tlbs(kvm);
> }
> --
> 2.34.1.448.ga2b2bfdf31-goog
>
On Thu, Dec 23, 2021 at 10:23:13PM +0000, Sean Christopherson wrote:
> When zapping a TDP MMU root, perform the zap in two passes to avoid
> zapping an entire top-level SPTE while holding RCU, which can induce RCU
> stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
> zap top-level entries in the second pass.
>
> With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
> SPTEs take up to ~7 or so seconds (time varies based number of kernel
> config, CPUs, vCPUs, etc...). With 5-level paging, that time can balloon
> well into hundreds of seconds.
>
> Before remote TLB flushes were omitted, the problem was even worse as
> waiting for all active vCPUs to respond to the IPI introduced significant
> overhead for VMs with large numbers of vCPUs.
>
> By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
> the amount of work that is done without dropping RCU protection is
> strictly bounded, with the worst case latency for a single operation
> being less than 100ms.
>
> Zapping at 1gb in the first pass is not arbitrary. First and foremost,
> KVM relies on being able to zap 1gb shadow pages in a single shot when
> when repacing a shadow page with a hugepage.
When dirty logging is disabled, zap_collapsible_spte_range() does the
bulk of the work zapping leaf SPTEs and allows yielding. I guess that
could race with a vCPU faulting in the huge page though and the vCPU
could do the bulk of the work.
Are there any other scenarios where KVM relies on zapping 1GB worth of
4KB SPTEs without yielding?
In any case, 100ms is a long time to hog the CPU. Why not just do the
safe thing and zap each level? 4K, then 2M, then 1GB, ..., then root
level. The only argument against it I can think of is performance (lots
of redundant walks through the page table). But I don't think root
zapping is especially latency critical.
> Zapping a 1gb shadow page
> that is fully populated with 4kb dirty SPTEs also triggers the worst case
> latency due writing back the struct page accessed/dirty bits for each 4kb
> page, i.e. the two-pass approach is guaranteed to work so long as KVM can
> cleany zap a 1gb shadow page.
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
> softirq=15759/15759 fqs=5058
> (t=21016 jiffies g=66453 q=238577)
> NMI backtrace for cpu 52
> Call Trace:
> ...
> mark_page_accessed+0x266/0x2f0
> kvm_set_pfn_accessed+0x31/0x40
> handle_removed_tdp_mmu_page+0x259/0x2e0
> __handle_changed_spte+0x223/0x2c0
> handle_removed_tdp_mmu_page+0x1c1/0x2e0
> __handle_changed_spte+0x223/0x2c0
> handle_removed_tdp_mmu_page+0x1c1/0x2e0
> __handle_changed_spte+0x223/0x2c0
> zap_gfn_range+0x141/0x3b0
> kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
> kvm_mmu_zap_all_fast+0x121/0x190
> kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
> kvm_page_track_flush_slot+0x5c/0x80
> kvm_arch_flush_shadow_memslot+0xe/0x10
> kvm_set_memslot+0x172/0x4e0
> __kvm_set_memory_region+0x337/0x590
> kvm_vm_ioctl+0x49c/0xf80
>
> Reported-by: David Matlack <[email protected]>
> Cc: Ben Gardon <[email protected]>
> Cc: Mingwei Zhang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 27 ++++++++++++++++++++++-----
> 1 file changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index aec97e037a8d..2e28f5e4b761 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -809,6 +809,18 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> gfn_t end = tdp_mmu_max_gfn_host();
> gfn_t start = 0;
>
> + /*
> + * To avoid RCU stalls due to recursively removing huge swaths of SPs,
> + * split the zap into two passes. On the first pass, zap at the 1gb
> + * level, and then zap top-level SPs on the second pass. "1gb" is not
> + * arbitrary, as KVM must be able to zap a 1gb shadow page without
> + * inducing a stall to allow in-place replacement with a 1gb hugepage.
> + *
> + * Because zapping a SP recurses on its children, stepping down to
> + * PG_LEVEL_4K in the iterator itself is unnecessary.
> + */
> + int zap_level = PG_LEVEL_1G;
> +
> /*
> * The root must have an elevated refcount so that it's reachable via
> * mmu_notifier callbacks, which allows this path to yield and drop
> @@ -825,12 +837,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>
> rcu_read_lock();
>
> - /*
> - * No need to try to step down in the iterator when zapping an entire
> - * root, zapping an upper-level SPTE will recurse on its children.
> - */
> +start:
> for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> - root->role.level, start, end) {
> + zap_level, start, end) {
> retry:
> if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> continue;
> @@ -838,6 +847,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> if (!is_shadow_present_pte(iter.old_spte))
> continue;
>
> + if (iter.level > zap_level)
> + continue;
> +
> if (!shared) {
> tdp_mmu_set_spte(kvm, &iter, 0);
> } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
> @@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> }
> }
>
> + if (zap_level < root->role.level) {
> + zap_level = root->role.level;
> + goto start;
> + }
This is probably just person opinion but I find the 2 iteration goto
loop harder to understand than just open-coding the 2 passes.
e.g.
static void tdp_mmu_zap_root(...)
{
/*
* To avoid RCU stalls due to recursively removing huge swaths of SPs,
* split the zap into two passes. On the first pass, zap at the 1gb
* level, and then zap top-level SPs on the second pass. "1gb" is not
* arbitrary, as KVM must be able to zap a 1gb shadow page without
* inducing a stall to allow in-place replacement with a 1gb hugepage.
*
* Because zapping a SP recurses on its children, stepping down to
* PG_LEVEL_4K in the iterator itself is unnecessary.
*/
tdp_mmu_zap_root_level(..., PG_LEVEL_1G);
tdp_mmu_zap_root_level(..., root->role.level);
}
Or just go ahead and zap each level from 4K up to root->role.level as I
mentioned above.
> +
> rcu_read_unlock();
> }
>
> --
> 2.34.1.448.ga2b2bfdf31-goog
>
On Thu, Dec 23, 2021 at 10:23:12PM +0000, Sean Christopherson wrote:
> Allow yielding when zapping SPTEs after the last reference to a valid
> root is put. Because KVM must drop all SPTEs in response to relevant
> mmu_notifier events, mark defunct roots invalid and reset their refcount
> prior to zapping the root. Keeping the refcount elevated while the zap
> is in-progress ensures the root is reachable via mmu_notifier until the
> zap completes and the last reference to the invalid, defunct root is put.
>
> Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
> root in being put has a massive paging structure, e.g. zapping a root
> that is backed entirely by 4kb pages for a guest with 32tb of memory can
> take hundreds of seconds to complete.
>
> watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
> RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
> __handle_changed_spte+0x1b2/0x2f0 [kvm]
> handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
> __handle_changed_spte+0x1f4/0x2f0 [kvm]
> handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
> __handle_changed_spte+0x1f4/0x2f0 [kvm]
> tdp_mmu_zap_root+0x307/0x4d0 [kvm]
> kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
> kvm_mmu_free_roots+0x22d/0x350 [kvm]
> kvm_mmu_reset_context+0x20/0x60 [kvm]
> kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
> kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
> __se_sys_ioctl+0x77/0xc0
> __x64_sys_ioctl+0x1d/0x20
> do_syscall_64+0x44/0xa0
> entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> KVM currently doesn't put a root from a non-preemptible context, so other
> than the mmu_notifier wrinkle, yielding when putting a root is safe.
>
> Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
> take a reference to each root (it requires mmu_lock be held for the
> entire duration of the walk).
>
> tdp_mmu_next_root() is used only by the yield-friendly iterator.
>
> kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly.
>
> kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
> but is still yield-friendly in all call sites, as all callers can be
> traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
> kvm_create_vm().
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu_internal.h | 7 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 145 +++++++++++++++++++++++---------
> 2 files changed, 109 insertions(+), 43 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index be063b6c91b7..8ce3d58fdf7f 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -65,7 +65,12 @@ struct kvm_mmu_page {
> struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> tdp_ptep_t ptep;
> };
> - DECLARE_BITMAP(unsync_child_bitmap, 512);
> + union {
> + DECLARE_BITMAP(unsync_child_bitmap, 512);
> + struct {
> + bool tdp_mmu_defunct_root;
> + };
> + };
>
> struct list_head lpage_disallowed_link;
> #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 72bcec2cd23c..aec97e037a8d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -91,21 +91,67 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>
> WARN_ON(!root->tdp_mmu_page);
>
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - list_del_rcu(&root->link);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + /*
> + * Ensure root->role.invalid is read after the refcount reaches zero to
> + * avoid zapping the root multiple times, e.g. if a different task
> + * acquires a reference (after the root was marked invalid+defunct) and
> + * puts the last reference, all while holding mmu_lock for read. Pairs
> + * with the smp_mb__before_atomic() below.
> + */
> + smp_mb__after_atomic();
> +
> + /*
> + * Free the root if it's already invalid. Invalid roots must be zapped
> + * before their last reference is put, i.e. there's no work to be done,
> + * and all roots must be invalidated (see below) before they're freed.
> + * Re-zapping defunct roots, which are always invalid, would put KVM
> + * into an infinite loop (again, see below).
> + */
> + if (root->role.invalid) {
> + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> + list_del_rcu(&root->link);
> + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +
> + call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> + return;
> + }
> +
> + /*
> + * Invalidate the root to prevent it from being reused by a vCPU, and
> + * mark it defunct so that kvm_tdp_mmu_zap_invalidated_roots() doesn't
> + * try to put a reference it didn't acquire.
> + */
> + root->role.invalid = true;
> + root->tdp_mmu_defunct_root = true;
Ah ok so tdp_mmu_defunct_root indicates the root became invalid due to
losing all its references while it was valid. This is in contrast to
kvm_tdp_mmu_invalidate_all_roots() which marks roots invalid while they
still have valid references.
But I wonder if tdp_mmu_defunct_root is necessary? It's only used to
skip a put in zap_invalidated_roots. Could we instead unconditionally
grab a reference in invalidate_all_roots and then unconditionally drop
it?
e.g. Apply this on top:
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8ce3d58fdf7f..be063b6c91b7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -65,12 +65,7 @@ struct kvm_mmu_page {
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
tdp_ptep_t ptep;
};
- union {
- DECLARE_BITMAP(unsync_child_bitmap, 512);
- struct {
- bool tdp_mmu_defunct_root;
- };
- };
+ DECLARE_BITMAP(unsync_child_bitmap, 512);
struct list_head lpage_disallowed_link;
#ifdef CONFIG_X86_32
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index aec97e037a8d..6bc5556b4cb7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -122,7 +122,6 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
* try to put a reference it didn't acquire.
*/
root->role.invalid = true;
- root->tdp_mmu_defunct_root = true;
/*
* Ensure tdp_mmu_defunct_root is visible if a concurrent reader acquires
@@ -140,7 +139,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
* faults, and the only flow that must not consume an invalid root is
* allocating a new root for a vCPU, which also takes mmu_lock for write.
*/
- refcount_set(&root->tdp_mmu_root_count, 1);
+ refcount_add(1, &root->tdp_mmu_root_count);
/*
* Zap the root, then put the refcount "acquired" above. Recursively
@@ -984,16 +983,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
/*
* Put the reference acquired in kvm_tdp_mmu_invalidate_roots()
- * unless the root is defunct worker data, in which case no
- * reference was taken. Roots become defunct only when a valid
- * root has its last reference put, thus holding a reference
- * means the root can't become defunct between invalidating the
- * root and re-checking the data here.
*
* Note, the iterator holds its own reference.
*/
- if (!root->tdp_mmu_defunct_root)
- kvm_tdp_mmu_put_root(kvm, root, true);
+ kvm_tdp_mmu_put_root(kvm, root, true);
}
}
@@ -1015,10 +1008,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
* dropped, so even though all invalid roots are zapped, a root may not go away
* for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
*
- * Don't take a reference if the root is defunct, vCPUs cannot hold references
- * to defunct roots and so will never get stuck with zapping the root. Note,
- * defunct roots still need to be zapped by kvm_tdp_mmu_zap_invalidated_roots().
- *
* Because mmu_lock is held for write, it should be impossible to observe a
* root with zero refcount, i.e. the list of roots cannot be stale.
*
@@ -1032,9 +1021,8 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
lockdep_assert_held_write(&kvm->mmu_lock);
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
- if (!root->tdp_mmu_defunct_root &&
- !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
- root->role.invalid = true;
+ refcount_add(1, &root->tdp_mmu_root_count);
+ root->role.invalid = true;
}
}
> +
> + /*
> + * Ensure tdp_mmu_defunct_root is visible if a concurrent reader acquires
> + * a reference after the root's refcount is reset. Pairs with the
> + * smp_mb__after_atomic() above and in kvm_tdp_mmu_zap_invalidated_roots().
> + */
> + smp_mb__before_atomic();
> +
> + /*
> + * Note, if mmu_lock is held for read this can race with other readers,
> + * e.g. they may acquire a reference without seeing the root as invalid,
> + * and the refcount may be reset after the root is skipped. Both races
> + * are benign, as flows that must visit all roots, e.g. need to zap
> + * SPTEs for correctness, must take mmu_lock for write to block page
> + * faults, and the only flow that must not consume an invalid root is
> + * allocating a new root for a vCPU, which also takes mmu_lock for write.
> + */
> + refcount_set(&root->tdp_mmu_root_count, 1);
>
> /*
> - * A TLB flush is not necessary as KVM performs a local TLB flush when
> - * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> - * to a different pCPU. Note, the local TLB flush on reuse also
> - * invalidates any paging-structure-cache entries, i.e. TLB entries for
> - * intermediate paging structures, that may be zapped, as such entries
> - * are associated with the ASID on both VMX and SVM.
> + * Zap the root, then put the refcount "acquired" above. Recursively
> + * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
> + * infinite loop by freeing invalid roots. By design, the root is
> + * reachable while it's being zapped, thus a different task can put its
> + * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
> + * defunct root is unavoidable.
> */
> tdp_mmu_zap_root(kvm, root, shared);
> -
> - call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
> + kvm_tdp_mmu_put_root(kvm, root, shared);
> }
>
> enum tdp_mmu_roots_iter_type {
> @@ -758,12 +804,23 @@ static inline gfn_t tdp_mmu_max_gfn_host(void)
> static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared)
> {
> - bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count);
> struct tdp_iter iter;
>
> gfn_t end = tdp_mmu_max_gfn_host();
> gfn_t start = 0;
>
> + /*
> + * The root must have an elevated refcount so that it's reachable via
> + * mmu_notifier callbacks, which allows this path to yield and drop
> + * mmu_lock. When handling an unmap/release mmu_notifier command, KVM
> + * must drop all references to relevant pages prior to completing the
> + * callback. Dropping mmu_lock with an unreachable root would result
> + * in zapping SPTEs after a relevant mmu_notifier callback completes
> + * and lead to use-after-free as zapping a SPTE triggers "writeback" of
> + * dirty accessed bits to the SPTE's associated struct page.
> + */
> + WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count));
> +
> kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>
> rcu_read_lock();
> @@ -775,19 +832,7 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> root->role.level, start, end) {
> retry:
> - /*
> - * Yielding isn't allowed when zapping an unreachable root as
> - * the root won't be processed by mmu_notifier callbacks. When
> - * handling an unmap/release mmu_notifier command, KVM must
> - * drop all references to relevant pages prior to completing
> - * the callback. Dropping mmu_lock can result in zapping SPTEs
> - * for an unreachable root after a relevant callback completes,
> - * which leads to use-after-free as zapping a SPTE triggers
> - * "writeback" of dirty/accessed bits to the SPTE's associated
> - * struct page.
> - */
> - if (!root_is_unreachable &&
> - tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> continue;
>
> if (!is_shadow_present_pte(iter.old_spte))
> @@ -796,22 +841,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> if (!shared) {
> tdp_mmu_set_spte(kvm, &iter, 0);
> } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
> - /*
> - * cmpxchg() shouldn't fail if the root is unreachable.
> - * to be unreachable. Re-read the SPTE and retry so as
> - * not to leak the page and its children.
> - */
> - WARN_ONCE(root_is_unreachable,
> - "Contended TDP MMU SPTE in unreachable root.");
> iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> goto retry;
> }
> - /*
> - * WARN if the root is invalid and is unreachable, all SPTEs
> - * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
> - * and inserting new SPTEs under an invalid root is a KVM bug.
> - */
> - WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
> }
>
> rcu_read_unlock();
> @@ -899,6 +931,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> int i;
>
> /*
> + * Zap all roots, including defunct roots, as all SPTEs must be dropped
> + * before returning to the caller.
> + *
> * A TLB flush is unnecessary, KVM zaps everything if and only the VM
> * is being destroyed or the userspace VMM has exited. In both cases,
> * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
> @@ -924,6 +959,12 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>
> for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) {
> /*
> + * Zap the root, even if it's defunct, as all SPTEs must be
> + * dropped before returning to the caller, e.g. if the root was
> + * invalidated by a memslot update, then SPTEs associated with
> + * a deleted/moved memslot are unreachable via the mmu_notifier
> + * once the memslot update completes.
> + *
> * A TLB flush is unnecessary, invalidated roots are guaranteed
> * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
> * for more details), and unlike the legacy MMU, no vCPU kick
> @@ -935,10 +976,24 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> tdp_mmu_zap_root(kvm, root, true);
>
> /*
> - * Put the reference acquired in kvm_tdp_mmu_invalidate_roots().
> + * Leverages kvm_tdp_mmu_get_root() in the iterator, pairs with
> + * the smp_mb__before_atomic() in kvm_tdp_mmu_put_root() to
> + * ensure a defunct root is seen as such.
> + */
> + smp_mb__after_atomic();
> +
> + /*
> + * Put the reference acquired in kvm_tdp_mmu_invalidate_roots()
Correction: kvm_tdp_mmu_invalidate_all_roots
(Looks like the typo comes from a previous patch though.)
> + * unless the root is defunct worker data, in which case no
> + * reference was taken. Roots become defunct only when a valid
> + * root has its last reference put, thus holding a reference
> + * means the root can't become defunct between invalidating the
> + * root and re-checking the data here.
> + *
> * Note, the iterator holds its own reference.
> */
> - kvm_tdp_mmu_put_root(kvm, root, true);
> + if (!root->tdp_mmu_defunct_root)
> + kvm_tdp_mmu_put_root(kvm, root, true);
> }
> }
>
> @@ -953,13 +1008,17 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> * a vCPU drops the last reference to a root prior to the root being zapped, it
> * will get stuck with tearing down the entire paging structure.
> *
> - * Get a reference even if the root is already invalid,
> + * Get a reference even if the root is already invalid, unless it's defunct, as
> * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
> * invalid roots, e.g. there's no epoch to identify roots that were invalidated
> * by a previous call. Roots stay on the list until the last reference is
> * dropped, so even though all invalid roots are zapped, a root may not go away
> * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
> *
> + * Don't take a reference if the root is defunct, vCPUs cannot hold references
> + * to defunct roots and so will never get stuck with zapping the root. Note,
> + * defunct roots still need to be zapped by kvm_tdp_mmu_zap_invalidated_roots().
> + *
> * Because mmu_lock is held for write, it should be impossible to observe a
> * root with zero refcount, i.e. the list of roots cannot be stale.
> *
> @@ -971,8 +1030,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> struct kvm_mmu_page *root;
>
> lockdep_assert_held_write(&kvm->mmu_lock);
> +
> list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> - if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(kvm, root)))
> + if (!root->tdp_mmu_defunct_root &&
> + !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
> root->role.invalid = true;
> }
> }
> --
> 2.34.1.448.ga2b2bfdf31-goog
>
On Wed, Jan 05, 2022, David Matlack wrote:
> On Thu, Dec 23, 2021 at 10:23:13PM +0000, Sean Christopherson wrote:
> > Zapping at 1gb in the first pass is not arbitrary. First and foremost,
> > KVM relies on being able to zap 1gb shadow pages in a single shot when
> > when repacing a shadow page with a hugepage.
>
> When dirty logging is disabled, zap_collapsible_spte_range() does the
> bulk of the work zapping leaf SPTEs and allows yielding. I guess that
> could race with a vCPU faulting in the huge page though and the vCPU
> could do the bulk of the work.
>
> Are there any other scenarios where KVM relies on zapping 1GB worth of
> 4KB SPTEs without yielding?
Yes. Zapping executable shadow pages that were forced to be small because of the
iTLB multihit mitigation. If the VM is using nested EPT and a shadow page is
unaccounted, in which case decrementing disallow_lpage could allow a hugepage
and a fault in the 1gb region that installs a 1gb hugepage would then zap the
shadow page.
There are other scenarios, though they are much more contrived, e.g. if the guest
changes its MTRRs such that a previously disallowed hugepage is now allowed.
> In any case, 100ms is a long time to hog the CPU. Why not just do the
> safe thing and zap each level? 4K, then 2M, then 1GB, ..., then root
> level. The only argument against it I can think of is performance (lots
> of redundant walks through the page table). But I don't think root
> zapping is especially latency critical.
I'm not opposed to that approach, assuming putting a root is done asynchronously
so that the high latency isn't problematic for vCPUs. Though from a test coverage
perspective, I do like zapping at the worst case level (for the above flows).
And regarding the latency, if it's problematic we could track the number of present
SPTEs and skip the walk if there are none. The downside is that doing so would
require an atomic operation when installing SPTEs to play nice with parallel page
faults.
> > @@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > }
> > }
> >
> > + if (zap_level < root->role.level) {
> > + zap_level = root->role.level;
> > + goto start;
> > + }
>
> This is probably just person opinion but I find the 2 iteration goto
> loop harder to understand than just open-coding the 2 passes.
Yeah, but it's clever! I'll add another helper unless it turns out gross for
some reason. :-)
On Wed, Jan 05, 2022, David Matlack wrote:
> On Thu, Dec 23, 2021 at 10:23:12PM +0000, Sean Christopherson wrote:
> > + /*
> > + * Invalidate the root to prevent it from being reused by a vCPU, and
> > + * mark it defunct so that kvm_tdp_mmu_zap_invalidated_roots() doesn't
> > + * try to put a reference it didn't acquire.
> > + */
> > + root->role.invalid = true;
> > + root->tdp_mmu_defunct_root = true;
>
> Ah ok so tdp_mmu_defunct_root indicates the root became invalid due to
> losing all its references while it was valid. This is in contrast to
> kvm_tdp_mmu_invalidate_all_roots() which marks roots invalid while they
> still have valid references.
>
> But I wonder if tdp_mmu_defunct_root is necessary? It's only used to
> skip a put in zap_invalidated_roots. Could we instead unconditionally
> grab a reference in invalidate_all_roots and then unconditionally drop
> it?
Hmm, it's probably not necessary. I added tdp_mmu_defunct_root before realizing
that that kvm_tdp_mmu_invalidate_all_roots() was wrong about this:
* Roots which have a zero refcount should be skipped as
* they're already being torn down.
IIRC, I added the second flag because I was trying to honor that (incorrect) logic,
and never reconsidered the need for a second flag once I got everything working.
The only downside would be keeping the memory for defunct roots around a wee bit
longer, and that's not a big deal.
I'll yank it out for the next version, assuming I didn't forget some detail...