For all intents and purposes, this is an x86/mmu series, but it touches
s390 and common KVM code because KVM_REQ_MMU_RELOAD is currently a generic
request despite its use being encapsulated entirely within arch code.
The meat of the series is to zap only obsolete (a.k.a. invalid) roots in
response to KVM marking a root obsolete/invalid due to it being zapped.
KVM currently drops/zaps all roots, which, aside from being a performance
hit if the guest is using multiple roots, complicates x86 KVM paths that
load a new root because it raises the question of what should be done if
there's a pending KVM_REQ_MMU_RELOAD, i.e. if the path _knows_ that any
root it loads will be obliterated.
Paolo, I'm hoping you can squash patch 01 with your patch it "fixes".
I'm also speculating that this will be applied after my patch to remove
KVM_REQ_GPC_INVALIDATE, otherwise the changelog in patch 06 will be
wrong.
v2:
- Collect reviews. [Claudio, Janosch]
- Rebase to latest kvm/queue.
v1: https://lore.kernel.org/all/[email protected]
Sean Christopherson (7):
KVM: x86: Remove spurious whitespaces from kvm_post_set_cr4()
KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE change
KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users
KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped
KVM: s390: Replace KVM_REQ_MMU_RELOAD usage with arch specific request
KVM: Drop KVM_REQ_MMU_RELOAD and update vcpu-requests.rst
documentation
KVM: WARN if is_unsync_root() is called on a root without a shadow
page
Documentation/virt/kvm/vcpu-requests.rst | 7 +-
arch/s390/include/asm/kvm_host.h | 2 +
arch/s390/kvm/kvm-s390.c | 8 +--
arch/s390/kvm/kvm-s390.h | 2 +-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 83 ++++++++++++++++++++----
arch/x86/kvm/x86.c | 10 +--
include/linux/kvm_host.h | 4 +-
virt/kvm/kvm_main.c | 5 --
10 files changed, 90 insertions(+), 34 deletions(-)
base-commit: f4bc051fc91ab9f1d5225d94e52d369ef58bec58
--
2.35.1.574.g5d30c73bfb-goog
Zap only obsolete roots when responding to zapping a single root shadow
page. Because KVM keeps root_count elevated when stuffing a previous
root into its PGD cache, shadowing a 64-bit guest means that zapping any
root causes all vCPUs to reload all roots, even if their current root is
not affected by the zap.
For many kernels, zapping a single root is a frequent operation, e.g. in
Linux it happens whenever an mm is dropped, e.g. process exits, etc...
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 65 +++++++++++++++++++++++++++++----
arch/x86/kvm/x86.c | 4 +-
4 files changed, 63 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 713e08f62385..343041e892c6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -102,6 +102,8 @@
#define KVM_REQ_MSR_FILTER_CHANGED KVM_ARCH_REQ(29)
#define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
+ KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 1d0c1904d69a..bf8dbc4bb12a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -80,6 +80,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
int kvm_mmu_load(struct kvm_vcpu *vcpu);
void kvm_mmu_unload(struct kvm_vcpu *vcpu);
+void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 32c6d4b33d03..825996408465 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2310,7 +2310,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
struct list_head *invalid_list,
int *nr_zapped)
{
- bool list_unstable;
+ bool list_unstable, zapped_root = false;
trace_kvm_mmu_prepare_zap_page(sp);
++kvm->stat.mmu_shadow_zapped;
@@ -2352,14 +2352,20 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
* in kvm_mmu_zap_all_fast(). Note, is_obsolete_sp() also
* treats invalid shadow pages as being obsolete.
*/
- if (!is_obsolete_sp(kvm, sp))
- kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
+ zapped_root = !is_obsolete_sp(kvm, sp);
}
if (sp->lpage_disallowed)
unaccount_huge_nx_page(kvm, sp);
sp->role.invalid = 1;
+
+ /*
+ * Make the request to free obsolete roots after marking the root
+ * invalid, otherwise other vCPUs may not see it as invalid.
+ */
+ if (zapped_root)
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
return list_unstable;
}
@@ -3947,7 +3953,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
* previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
* to reload even if no vCPU is actively using the root.
*/
- if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
+ if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
return true;
return fault->slot &&
@@ -4180,8 +4186,8 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
/*
* It's possible that the cached previous root page is obsolete because
* of a change in the MMU generation number. However, changing the
- * generation number is accompanied by KVM_REQ_MMU_RELOAD, which will
- * free the root set here and allocate a new one.
+ * generation number is accompanied by KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
+ * which will free the root set here and allocate a new one.
*/
kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
@@ -5085,6 +5091,51 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu)
vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
}
+static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa)
+{
+ struct kvm_mmu_page *sp;
+
+ if (!VALID_PAGE(root_hpa))
+ return false;
+
+ /*
+ * When freeing obsolete roots, treat roots as obsolete if they don't
+ * have an associated shadow page. This does mean KVM will get false
+ * positives and free roots that don't strictly need to be freed, but
+ * such false positives are relatively rare:
+ *
+ * (a) only PAE paging and nested NPT has roots without shadow pages
+ * (b) remote reloads due to a memslot update obsoletes _all_ roots
+ * (c) KVM doesn't track previous roots for PAE paging, and the guest
+ * is unlikely to zap an in-use PGD.
+ */
+ sp = to_shadow_page(root_hpa);
+ return !sp || is_obsolete_sp(kvm, sp);
+}
+
+static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
+{
+ unsigned long roots_to_free = 0;
+ int i;
+
+ if (is_obsolete_root(kvm, mmu->root.hpa))
+ roots_to_free |= KVM_MMU_ROOT_CURRENT;
+
+ for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+ if (is_obsolete_root(kvm, mmu->root.hpa))
+ roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
+ }
+
+ if (roots_to_free)
+ kvm_mmu_free_roots(kvm, mmu, roots_to_free);
+}
+
+void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu)
+{
+ __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.root_mmu);
+ __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.guest_mmu);
+}
+
static bool need_remote_flush(u64 old, u64 new)
{
if (!is_shadow_present_pte(old))
@@ -5656,7 +5707,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* Note: we need to do this under the protection of mmu_lock,
* otherwise, vcpu would purge shadow page but miss tlb flush.
*/
- kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
kvm_zap_obsolete_pages(kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 579b26ffc124..d6bf0562c4c4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9856,8 +9856,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto out;
}
}
- if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
- kvm_mmu_unload(vcpu);
+ if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
+ kvm_mmu_free_obsolete_roots(vcpu);
if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
__kvm_migrate_timers(vcpu);
if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
--
2.35.1.574.g5d30c73bfb-goog
Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call
when the guest's CR4.PCIDE changes. This will allow tweaking the logic
of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the
historical intent of KVM_REQ_MMU_RELOAD. The recent PCIDE behavior is
the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as
obsolete, needs to unconditionally unload the entire MMU, _and_ affects
only the current vCPU.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2157284d05b0..579b26ffc124 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1077,7 +1077,7 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
*/
if (!tdp_enabled &&
(cr4 & X86_CR4_PCIDE) && !(old_cr4 & X86_CR4_PCIDE))
- kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_mmu_unload(vcpu);
/*
* The TLB has to be flushed for all PCIDs if any of the following
--
2.35.1.574.g5d30c73bfb-goog
Add an arch request, KVM_REQ_REFRESH_GUEST_PREFIX, to deal with guest
prefix changes instead of piggybacking KVM_REQ_MMU_RELOAD. This will
allow for the removal of the generic KVM_REQ_MMU_RELOAD, which isn't
actually used by generic KVM.
No functional change intended.
Reviewed-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/s390/include/asm/kvm_host.h | 2 ++
arch/s390/kvm/kvm-s390.c | 8 ++++----
arch/s390/kvm/kvm-s390.h | 2 +-
3 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index a22c9266ea05..766028d54a3e 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -45,6 +45,8 @@
#define KVM_REQ_START_MIGRATION KVM_ARCH_REQ(3)
#define KVM_REQ_STOP_MIGRATION KVM_ARCH_REQ(4)
#define KVM_REQ_VSIE_RESTART KVM_ARCH_REQ(5)
+#define KVM_REQ_REFRESH_GUEST_PREFIX \
+ KVM_ARCH_REQ_FLAGS(6, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define SIGP_CTRL_C 0x80
#define SIGP_CTRL_SCN_MASK 0x3f
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 577f1ead6a51..db8c113562cf 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3394,7 +3394,7 @@ static void kvm_gmap_notifier(struct gmap *gmap, unsigned long start,
if (prefix <= end && start <= prefix + 2*PAGE_SIZE - 1) {
VCPU_EVENT(vcpu, 2, "gmap notifier for %lx-%lx",
start, end);
- kvm_s390_sync_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_s390_sync_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
}
}
}
@@ -3796,19 +3796,19 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
if (!kvm_request_pending(vcpu))
return 0;
/*
- * We use MMU_RELOAD just to re-arm the ipte notifier for the
+ * If the guest prefix changed, re-arm the ipte notifier for the
* guest prefix page. gmap_mprotect_notify will wait on the ptl lock.
* This ensures that the ipte instruction for this request has
* already finished. We might race against a second unmapper that
* wants to set the blocking bit. Lets just retry the request loop.
*/
- if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu)) {
+ if (kvm_check_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu)) {
int rc;
rc = gmap_mprotect_notify(vcpu->arch.gmap,
kvm_s390_get_prefix(vcpu),
PAGE_SIZE * 2, PROT_WRITE);
if (rc) {
- kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
return rc;
}
goto retry;
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 098831e815e6..45b7c1edd85f 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -105,7 +105,7 @@ static inline void kvm_s390_set_prefix(struct kvm_vcpu *vcpu, u32 prefix)
prefix);
vcpu->arch.sie_block->prefix = prefix >> GUEST_PREFIX_SHIFT;
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
- kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
}
static inline u64 kvm_s390_get_base_disp_s(struct kvm_vcpu *vcpu, u8 *ar)
--
2.35.1.574.g5d30c73bfb-goog
Remove the now unused KVM_REQ_MMU_RELOAD, shift KVM_REQ_VM_DEAD into the
unoccupied space, and update vcpu-requests.rst, which was missing an
entry for KVM_REQ_VM_DEAD. Switching KVM_REQ_VM_DEAD to entry '1' also
fixes the stale comment about bits 4-7 being reserved.
Reviewed-by: Claudio Imbrenda <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
Documentation/virt/kvm/vcpu-requests.rst | 7 +++----
include/linux/kvm_host.h | 3 +--
2 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
index ad2915ef7020..b61d48aec36c 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -112,11 +112,10 @@ KVM_REQ_TLB_FLUSH
choose to use the common kvm_flush_remote_tlbs() implementation will
need to handle this VCPU request.
-KVM_REQ_MMU_RELOAD
+KVM_REQ_VM_DEAD
- When shadow page tables are used and memory slots are removed it's
- necessary to inform each VCPU to completely refresh the tables. This
- request is used for that.
+ This request informs all VCPUs that the VM is dead and unusable, e.g. due to
+ fatal error or because the VM's state has been intentionally destroyed.
KVM_REQ_UNBLOCK
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0aeb47cffd43..9536ffa0473b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -153,10 +153,9 @@ static inline bool is_error_page(struct page *page)
* Bits 4-7 are reserved for more arch-independent bits.
*/
#define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_UNBLOCK 2
#define KVM_REQ_UNHALT 3
-#define KVM_REQ_VM_DEAD (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_GPC_INVALIDATE (5 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQUEST_ARCH_BASE 8
--
2.35.1.574.g5d30c73bfb-goog
Remove the generic kvm_reload_remote_mmus() and open code its
functionality into the two x86 callers. x86 is (obviously) the only
architecture that uses the hook, and is also the only architecture that
uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That
will change in a future patch, as x86's usage when zapping a single
shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
MMUs whose root is being zapped actually need to be reloaded.
s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
Drop the generic code in anticipation of implementing s390 and x86 arch
specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
Opportunistically reword the x86 TDP MMU comment to avoid making
references to functions (and requests!) when possible, and to remove the
rather ambiguous "this".
No functional change intended.
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 14 +++++++-------
include/linux/kvm_host.h | 1 -
virt/kvm/kvm_main.c | 5 -----
3 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2c1c4eb6007..32c6d4b33d03 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2353,7 +2353,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
* treats invalid shadow pages as being obsolete.
*/
if (!is_obsolete_sp(kvm, sp))
- kvm_reload_remote_mmus(kvm);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
}
if (sp->lpage_disallowed)
@@ -5639,11 +5639,11 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
*/
kvm->arch.mmu_valid_gen = kvm->arch.mmu_valid_gen ? 0 : 1;
- /* In order to ensure all threads see this change when
- * handling the MMU reload signal, this must happen in the
- * same critical section as kvm_reload_remote_mmus, and
- * before kvm_zap_obsolete_pages as kvm_zap_obsolete_pages
- * could drop the MMU lock and yield.
+ /*
+ * In order to ensure all vCPUs drop their soon-to-be invalid roots,
+ * invalidating TDP MMU roots must be done while holding mmu_lock for
+ * write and in the same critical section as making the reload request,
+ * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
*/
if (is_tdp_mmu_enabled(kvm))
kvm_tdp_mmu_invalidate_all_roots(kvm);
@@ -5656,7 +5656,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* Note: we need to do this under the protection of mmu_lock,
* otherwise, vcpu would purge shadow page but miss tlb flush.
*/
- kvm_reload_remote_mmus(kvm);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
kvm_zap_obsolete_pages(kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f11039944c08..0aeb47cffd43 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1325,7 +1325,6 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target);
void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
void kvm_flush_remote_tlbs(struct kvm *kvm);
-void kvm_reload_remote_mmus(struct kvm *kvm);
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 83c57bcc6eb6..66bb1631cb89 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -354,11 +354,6 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
#endif
-void kvm_reload_remote_mmus(struct kvm *kvm)
-{
- kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
-}
-
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
gfp_t gfp_flags)
--
2.35.1.574.g5d30c73bfb-goog
Remove leading spaces that snuck into kvm_post_set_cr4(), fixing the
KVM_REQ_TLB_FLUSH_CURRENT request in particular is helpful as it unaligns
the body of the if-statement from the condition check.
Fixes: f4bc051fc91a ("KVM: x86: flush TLB separately from MMU reset")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6552360d8888..2157284d05b0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1089,7 +1089,7 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
*/
if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
(!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
- kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+ kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
/*
* The TLB has to be flushed for the current PCID if any of the
@@ -1099,7 +1099,7 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
*/
else if (((cr4 ^ old_cr4) & X86_CR4_PAE) ||
((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
- kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
+ kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
}
EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
--
2.35.1.574.g5d30c73bfb-goog
WARN and bail if is_unsync_root() is passed a root for which there is no
shadow page, i.e. is passed the physical address of one of the special
roots, which do not have an associated shadow page. The current usage
squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
and 5-level AMD CPUs are not generally available, i.e. no one can coerce
KVM into calling is_unsync_root() on pml5_root.
Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
prevents KVM from crashing.
Cc: Lai Jiangshan <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 825996408465..3e7c8ad5bed9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3634,6 +3634,14 @@ static bool is_unsync_root(hpa_t root)
*/
smp_rmb();
sp = to_shadow_page(root);
+
+ /*
+ * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
+ * PDPTEs for a given PAE root need to be synchronized individually.
+ */
+ if (WARN_ON_ONCE(!sp))
+ return false;
+
if (sp->unsync || sp->unsync_children)
return true;
--
2.35.1.574.g5d30c73bfb-goog
On Fri, Feb 25, 2022 at 10:23 AM Sean Christopherson <[email protected]> wrote:
>
> Remove the now unused KVM_REQ_MMU_RELOAD, shift KVM_REQ_VM_DEAD into the
> unoccupied space, and update vcpu-requests.rst, which was missing an
> entry for KVM_REQ_VM_DEAD. Switching KVM_REQ_VM_DEAD to entry '1' also
> fixes the stale comment about bits 4-7 being reserved.
>
> Reviewed-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> Documentation/virt/kvm/vcpu-requests.rst | 7 +++----
> include/linux/kvm_host.h | 3 +--
> 2 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
> index ad2915ef7020..b61d48aec36c 100644
> --- a/Documentation/virt/kvm/vcpu-requests.rst
> +++ b/Documentation/virt/kvm/vcpu-requests.rst
> @@ -112,11 +112,10 @@ KVM_REQ_TLB_FLUSH
> choose to use the common kvm_flush_remote_tlbs() implementation will
> need to handle this VCPU request.
>
> -KVM_REQ_MMU_RELOAD
> +KVM_REQ_VM_DEAD
>
> - When shadow page tables are used and memory slots are removed it's
> - necessary to inform each VCPU to completely refresh the tables. This
> - request is used for that.
> + This request informs all VCPUs that the VM is dead and unusable, e.g. due to
> + fatal error or because the VM's state has been intentionally destroyed.
>
> KVM_REQ_UNBLOCK
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0aeb47cffd43..9536ffa0473b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -153,10 +153,9 @@ static inline bool is_error_page(struct page *page)
> * Bits 4-7 are reserved for more arch-independent bits.
> */
> #define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> -#define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_UNBLOCK 2
> #define KVM_REQ_UNHALT 3
> -#define KVM_REQ_VM_DEAD (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_GPC_INVALIDATE (5 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQUEST_ARCH_BASE 8
>
> --
> 2.35.1.574.g5d30c73bfb-goog
>
On Fri, Feb 25, 2022 at 10:22 AM Sean Christopherson <[email protected]> wrote:
>
> Zap only obsolete roots when responding to zapping a single root shadow
> page. Because KVM keeps root_count elevated when stuffing a previous
> root into its PGD cache, shadowing a 64-bit guest means that zapping any
> root causes all vCPUs to reload all roots, even if their current root is
> not affected by the zap.
>
> For many kernels, zapping a single root is a frequent operation, e.g. in
> Linux it happens whenever an mm is dropped, e.g. process exits, etc...
>
Reviewed-by: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/mmu.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 65 +++++++++++++++++++++++++++++----
> arch/x86/kvm/x86.c | 4 +-
> 4 files changed, 63 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 713e08f62385..343041e892c6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -102,6 +102,8 @@
> #define KVM_REQ_MSR_FILTER_CHANGED KVM_ARCH_REQ(29)
> #define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
> KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
> + KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>
> #define CR0_RESERVED_BITS \
> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 1d0c1904d69a..bf8dbc4bb12a 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -80,6 +80,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>
> int kvm_mmu_load(struct kvm_vcpu *vcpu);
> void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> +void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
> void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
> void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 32c6d4b33d03..825996408465 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2310,7 +2310,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
> struct list_head *invalid_list,
> int *nr_zapped)
> {
> - bool list_unstable;
> + bool list_unstable, zapped_root = false;
>
> trace_kvm_mmu_prepare_zap_page(sp);
> ++kvm->stat.mmu_shadow_zapped;
> @@ -2352,14 +2352,20 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
> * in kvm_mmu_zap_all_fast(). Note, is_obsolete_sp() also
> * treats invalid shadow pages as being obsolete.
> */
> - if (!is_obsolete_sp(kvm, sp))
> - kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
> + zapped_root = !is_obsolete_sp(kvm, sp);
> }
>
> if (sp->lpage_disallowed)
> unaccount_huge_nx_page(kvm, sp);
>
> sp->role.invalid = 1;
> +
> + /*
> + * Make the request to free obsolete roots after marking the root
> + * invalid, otherwise other vCPUs may not see it as invalid.
> + */
> + if (zapped_root)
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
> return list_unstable;
> }
>
> @@ -3947,7 +3953,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> * to reload even if no vCPU is actively using the root.
> */
> - if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
> + if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> return true;
>
> return fault->slot &&
> @@ -4180,8 +4186,8 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
> /*
> * It's possible that the cached previous root page is obsolete because
> * of a change in the MMU generation number. However, changing the
> - * generation number is accompanied by KVM_REQ_MMU_RELOAD, which will
> - * free the root set here and allocate a new one.
> + * generation number is accompanied by KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> + * which will free the root set here and allocate a new one.
> */
> kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>
> @@ -5085,6 +5091,51 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu)
> vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
> }
>
> +static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa)
> +{
> + struct kvm_mmu_page *sp;
> +
> + if (!VALID_PAGE(root_hpa))
> + return false;
> +
> + /*
> + * When freeing obsolete roots, treat roots as obsolete if they don't
> + * have an associated shadow page. This does mean KVM will get false
> + * positives and free roots that don't strictly need to be freed, but
> + * such false positives are relatively rare:
> + *
> + * (a) only PAE paging and nested NPT has roots without shadow pages
> + * (b) remote reloads due to a memslot update obsoletes _all_ roots
> + * (c) KVM doesn't track previous roots for PAE paging, and the guest
> + * is unlikely to zap an in-use PGD.
> + */
> + sp = to_shadow_page(root_hpa);
> + return !sp || is_obsolete_sp(kvm, sp);
> +}
> +
> +static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
> +{
> + unsigned long roots_to_free = 0;
> + int i;
> +
> + if (is_obsolete_root(kvm, mmu->root.hpa))
> + roots_to_free |= KVM_MMU_ROOT_CURRENT;
> +
> + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
> + if (is_obsolete_root(kvm, mmu->root.hpa))
> + roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
> + }
> +
> + if (roots_to_free)
> + kvm_mmu_free_roots(kvm, mmu, roots_to_free);
> +}
> +
> +void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu)
> +{
> + __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.root_mmu);
> + __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.guest_mmu);
> +}
> +
> static bool need_remote_flush(u64 old, u64 new)
> {
> if (!is_shadow_present_pte(old))
> @@ -5656,7 +5707,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> * Note: we need to do this under the protection of mmu_lock,
> * otherwise, vcpu would purge shadow page but miss tlb flush.
> */
> - kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
>
> kvm_zap_obsolete_pages(kvm);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 579b26ffc124..d6bf0562c4c4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9856,8 +9856,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> goto out;
> }
> }
> - if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
> - kvm_mmu_unload(vcpu);
> + if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> + kvm_mmu_free_obsolete_roots(vcpu);
> if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> __kvm_migrate_timers(vcpu);
> if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
> --
> 2.35.1.574.g5d30c73bfb-goog
>
On Fri, Feb 25, 2022 at 10:22 AM Sean Christopherson <[email protected]> wrote:
>
> Remove the generic kvm_reload_remote_mmus() and open code its
> functionality into the two x86 callers. x86 is (obviously) the only
> architecture that uses the hook, and is also the only architecture that
> uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That
> will change in a future patch, as x86's usage when zapping a single
> shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
> MMUs whose root is being zapped actually need to be reloaded.
>
> s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
>
> Drop the generic code in anticipation of implementing s390 and x86 arch
> specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
>
> Opportunistically reword the x86 TDP MMU comment to avoid making
> references to functions (and requests!) when possible, and to remove the
> rather ambiguous "this".
>
> No functional change intended.
>
> Cc: Ben Gardon <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 14 +++++++-------
> include/linux/kvm_host.h | 1 -
> virt/kvm/kvm_main.c | 5 -----
> 3 files changed, 7 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b2c1c4eb6007..32c6d4b33d03 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2353,7 +2353,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
> * treats invalid shadow pages as being obsolete.
> */
> if (!is_obsolete_sp(kvm, sp))
> - kvm_reload_remote_mmus(kvm);
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
> }
>
> if (sp->lpage_disallowed)
> @@ -5639,11 +5639,11 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> */
> kvm->arch.mmu_valid_gen = kvm->arch.mmu_valid_gen ? 0 : 1;
>
> - /* In order to ensure all threads see this change when
> - * handling the MMU reload signal, this must happen in the
> - * same critical section as kvm_reload_remote_mmus, and
> - * before kvm_zap_obsolete_pages as kvm_zap_obsolete_pages
> - * could drop the MMU lock and yield.
> + /*
> + * In order to ensure all vCPUs drop their soon-to-be invalid roots,
> + * invalidating TDP MMU roots must be done while holding mmu_lock for
> + * write and in the same critical section as making the reload request,
> + * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
> */
> if (is_tdp_mmu_enabled(kvm))
> kvm_tdp_mmu_invalidate_all_roots(kvm);
> @@ -5656,7 +5656,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> * Note: we need to do this under the protection of mmu_lock,
> * otherwise, vcpu would purge shadow page but miss tlb flush.
> */
> - kvm_reload_remote_mmus(kvm);
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
>
> kvm_zap_obsolete_pages(kvm);
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f11039944c08..0aeb47cffd43 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1325,7 +1325,6 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target);
> void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
>
> void kvm_flush_remote_tlbs(struct kvm *kvm);
> -void kvm_reload_remote_mmus(struct kvm *kvm);
>
> #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 83c57bcc6eb6..66bb1631cb89 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -354,11 +354,6 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
> #endif
>
> -void kvm_reload_remote_mmus(struct kvm *kvm)
> -{
> - kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
> -}
> -
> #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> gfp_t gfp_flags)
> --
> 2.35.1.574.g5d30c73bfb-goog
>
On Fri, Feb 25, 2022 at 10:23 AM Sean Christopherson <[email protected]> wrote:
>
> WARN and bail if is_unsync_root() is passed a root for which there is no
> shadow page, i.e. is passed the physical address of one of the special
> roots, which do not have an associated shadow page. The current usage
> squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
> kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
> and 5-level AMD CPUs are not generally available, i.e. no one can coerce
> KVM into calling is_unsync_root() on pml5_root.
>
> Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
> prevents KVM from crashing.
>
> Cc: Lai Jiangshan <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 825996408465..3e7c8ad5bed9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3634,6 +3634,14 @@ static bool is_unsync_root(hpa_t root)
> */
> smp_rmb();
> sp = to_shadow_page(root);
> +
> + /*
> + * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
> + * PDPTEs for a given PAE root need to be synchronized individually.
> + */
> + if (WARN_ON_ONCE(!sp))
> + return false;
> +
I was trying to figure out if this should be returning true or false,
but neither really seems correct. Since we never expect this to fire,
perhaps it doesn't matter and it's easier to just return false so the
callers don't need to be changed. If this did fire in a production
scenario, I'd want it to terminate the VM too.
> if (sp->unsync || sp->unsync_children)
> return true;
>
> --
> 2.35.1.574.g5d30c73bfb-goog
>
On Mon, Feb 28, 2022, Ben Gardon wrote:
> On Fri, Feb 25, 2022 at 10:23 AM Sean Christopherson <[email protected]> wrote:
> >
> > WARN and bail if is_unsync_root() is passed a root for which there is no
> > shadow page, i.e. is passed the physical address of one of the special
> > roots, which do not have an associated shadow page. The current usage
> > squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
> > kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
> > and 5-level AMD CPUs are not generally available, i.e. no one can coerce
> > KVM into calling is_unsync_root() on pml5_root.
> >
> > Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
> > prevents KVM from crashing.
> >
> > Cc: Lai Jiangshan <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 825996408465..3e7c8ad5bed9 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3634,6 +3634,14 @@ static bool is_unsync_root(hpa_t root)
> > */
> > smp_rmb();
> > sp = to_shadow_page(root);
> > +
> > + /*
> > + * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
> > + * PDPTEs for a given PAE root need to be synchronized individually.
> > + */
> > + if (WARN_ON_ONCE(!sp))
> > + return false;
> > +
>
> I was trying to figure out if this should be returning true or false,
> but neither really seems correct. Since we never expect this to fire,
> perhaps it doesn't matter and it's easier to just return false so the
> callers don't need to be changed.
Yep, neither is correct.
> If this did fire in a production scenario, I'd want it to terminate the VM
> too.
Me too, but practically speaking this should never get anywhere near production.
IMO, it's not worth plumbing in @kvm just to be able to do KVM_BUG_ON.
On 2/25/22 19:22, Sean Christopherson wrote:
> @@ -5656,7 +5707,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> * Note: we need to do this under the protection of mmu_lock,
> * otherwise, vcpu would purge shadow page but miss tlb flush.
> */
> - kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
>
I was going to squash in this:
* invalidating TDP MMU roots must be done while holding mmu_lock for
- * write and in the same critical section as making the reload request,
+ * write and in the same critical section as making the free request,
* e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
But then I realized that this needs better comments and that my knowledge of
this has serious holes. Regarding this comment, this is my proposal:
/*
* Invalidated TDP MMU roots are zapped within MMU read_lock to be
* able to walk the list of roots, but with the expectation of no
* concurrent change to the pages themselves. There cannot be
* any yield between kvm_tdp_mmu_invalidate_all_roots and the free
* request, otherwise somebody could grab a reference to the root
* and break that assumption.
*/
if (is_tdp_mmu_enabled(kvm))
kvm_tdp_mmu_invalidate_all_roots(kvm);
However, for the second comment (the one in the context above), there's much
more. From easier to harder:
1) I'm basically clueless about the TLB flush "note" above.
2) It's not clear to me what needs to use for_each_tdp_mmu_root; for
example, why would anything but the MMU notifiers use for_each_tdp_mmu_root?
It is used in kvm_tdp_mmu_write_protect_gfn, kvm_tdp_mmu_try_split_huge_pages
and kvm_tdp_mmu_clear_dirty_pt_masked.
3) Does it make sense that yielding users of for_each_tdp_mmu_root must
either look at valid roots only, or take MMU lock for write? If so, can
this be enforced in tdp_mmu_next_root?
4) If the previous point is correct, _who_ could grab a reference and
not release it before kvm_tdp_mmu_zap_invalidated_roots runs? That is,
is "somebody could grab a reference" an accurate explanation in the first
comment above?
Thanks,
Paolo
On 2/25/22 19:22, Sean Christopherson wrote:
> For all intents and purposes, this is an x86/mmu series, but it touches
> s390 and common KVM code because KVM_REQ_MMU_RELOAD is currently a generic
> request despite its use being encapsulated entirely within arch code.
>
> The meat of the series is to zap only obsolete (a.k.a. invalid) roots in
> response to KVM marking a root obsolete/invalid due to it being zapped.
> KVM currently drops/zaps all roots, which, aside from being a performance
> hit if the guest is using multiple roots, complicates x86 KVM paths that
> load a new root because it raises the question of what should be done if
> there's a pending KVM_REQ_MMU_RELOAD, i.e. if the path _knows_ that any
> root it loads will be obliterated.
>
> Paolo, I'm hoping you can squash patch 01 with your patch it "fixes".
>
> I'm also speculating that this will be applied after my patch to remove
> KVM_REQ_GPC_INVALIDATE, otherwise the changelog in patch 06 will be
> wrong.
Queued, thanks.
Paolo
> v2:
> - Collect reviews. [Claudio, Janosch]
> - Rebase to latest kvm/queue.
>
> v1: https://lore.kernel.org/all/[email protected]
>
> Sean Christopherson (7):
> KVM: x86: Remove spurious whitespaces from kvm_post_set_cr4()
> KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE change
> KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users
> KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped
> KVM: s390: Replace KVM_REQ_MMU_RELOAD usage with arch specific request
> KVM: Drop KVM_REQ_MMU_RELOAD and update vcpu-requests.rst
> documentation
> KVM: WARN if is_unsync_root() is called on a root without a shadow
> page
>
> Documentation/virt/kvm/vcpu-requests.rst | 7 +-
> arch/s390/include/asm/kvm_host.h | 2 +
> arch/s390/kvm/kvm-s390.c | 8 +--
> arch/s390/kvm/kvm-s390.h | 2 +-
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/mmu.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 83 ++++++++++++++++++++----
> arch/x86/kvm/x86.c | 10 +--
> include/linux/kvm_host.h | 4 +-
> virt/kvm/kvm_main.c | 5 --
> 10 files changed, 90 insertions(+), 34 deletions(-)
>
>
> base-commit: f4bc051fc91ab9f1d5225d94e52d369ef58bec58
On 3/1/22 18:55, Paolo Bonzini wrote:
> On 2/25/22 19:22, Sean Christopherson wrote:
>> @@ -5656,7 +5707,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>> * Note: we need to do this under the protection of mmu_lock,
>> * otherwise, vcpu would purge shadow page but miss tlb flush.
>> */
>> - kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
>> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
>
> I was going to squash in this:
>
> * invalidating TDP MMU roots must be done while holding mmu_lock for
> - * write and in the same critical section as making the reload
> request,
> + * write and in the same critical section as making the free request,
> * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and
> yield.
>
> But then I realized that this needs better comments and that my
> knowledge of
> this has serious holes. Regarding this comment, this is my proposal:
>
> /*
> * Invalidated TDP MMU roots are zapped within MMU read_lock to be
> * able to walk the list of roots, but with the expectation of no
> * concurrent change to the pages themselves. There cannot be
> * any yield between kvm_tdp_mmu_invalidate_all_roots and the free
> * request, otherwise somebody could grab a reference to the root
> * and break that assumption.
> */
> if (is_tdp_mmu_enabled(kvm))
> kvm_tdp_mmu_invalidate_all_roots(kvm);
>
> However, for the second comment (the one in the context above), there's
> much
> more. From easier to harder:
>
> 1) I'm basically clueless about the TLB flush "note" above.
>
> 2) It's not clear to me what needs to use for_each_tdp_mmu_root; for
> example, why would anything but the MMU notifiers use
> for_each_tdp_mmu_root?
> It is used in kvm_tdp_mmu_write_protect_gfn,
> kvm_tdp_mmu_try_split_huge_pages
> and kvm_tdp_mmu_clear_dirty_pt_masked.
>
> 3) Does it make sense that yielding users of for_each_tdp_mmu_root must
> either look at valid roots only, or take MMU lock for write? If so, can
> this be enforced in tdp_mmu_next_root?
Ok, I could understand this a little better now, but please correct me
if this is incorrect:
2) if I'm not wrong, kvm_tdp_mmu_try_split_huge_pages indeed does not
need to walk invalid roots. The others do because the TDP MMU does
not necessarily kick vCPUs after marking roots as invalid. But
because TDP MMU roots are gone for good once their refcount hits 0,
I wonder if we could do something like
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7e3d1f985811..a4a6dfee27f9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -164,6 +164,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
*/
if (!kvm_tdp_root_mark_invalid(root)) {
refcount_set(&root->tdp_mmu_root_count, 1);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
/*
* If the struct kvm is alive, we might as well zap the root
@@ -1099,12 +1100,16 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;
+ bool invalidated_root = false
lockdep_assert_held_write(&kvm->mmu_lock);
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
- root->role.invalid = true;
+ invalidated_root |= !kvm_tdp_root_mark_invalid(root);
}
+
+ if (invalidated_root)
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
}
/*
(based on my own version of Sean's patches) and stop walking invalid roots
in kvm_tdp_mmu_write_protect_gfn and kvm_tdp_mmu_clear_dirty_pt_masked.
3) Yes, it makes sense that yielding users of for_each_tdp_mmu_root must
either look at valid roots only, or take MMU lock for write. The only
exception is kvm_tdp_mmu_try_split_huge_pages, which does not need to
walk invalid roots. And kvm_tdp_mmu_zap_invalidated_pages(), but that
one is basically an asynchronous worker [and this is where I had the
inspiration to get rid of the function altogether]
Paolo
On 3/2/22 20:45, Sean Christopherson wrote:
> AMD NPT is hosed because KVM's awful ASID scheme doesn't assign an ASID per root
> and doesn't force a new ASID. IMO, this is an SVM mess and not a TDP MMU bug.
I agree.
> In the short term, I think something like the following would suffice. Long term,
> we really need to redo SVM ASID management so that ASIDs are tied to a KVM root.
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c5e3f219803e..7899ca4748c7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3857,6 +3857,9 @@ static void svm_load_mmu_pgd(struct kvm_vcpu
*vcpu, hpa_t root_hpa,
unsigned long cr3;
if (npt_enabled) {
+ if (is_tdp_mmu_root(root_hpa))
+ svm->current_vmcb->asid_generation = 0;
+
svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
Why not just new_asid (even unconditionally, who cares)?
BTW yeah, the smoke test worked but the actual one failed horribly.
Paolo
On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/1/22 18:55, Paolo Bonzini wrote:
> > On 2/25/22 19:22, Sean Christopherson wrote:
> > > @@ -5656,7 +5707,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> > > ?????? * Note: we need to do this under the protection of mmu_lock,
> > > ?????? * otherwise, vcpu would purge shadow page but miss tlb flush.
> > > ?????? */
> > > -??? kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
> > > +??? kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
> >
> > I was going to squash in this:
> >
> > ????? * invalidating TDP MMU roots must be done while holding mmu_lock for
> > -???? * write and in the same critical section as making the reload
> > request,
> > +???? * write and in the same critical section as making the free request,
> > ????? * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and
> > yield.
> >
> > But then I realized that this needs better comments and that my
> > knowledge of
> > this has serious holes.? Regarding this comment, this is my proposal:
> >
> > ??????? /*
> > ???????? * Invalidated TDP MMU roots are zapped within MMU read_lock to be
> > ???????? * able to walk the list of roots, but with the expectation of no
> > ???????? * concurrent change to the pages themselves.? There cannot be
> > ???????? * any yield between kvm_tdp_mmu_invalidate_all_roots and the free
> > ???????? * request, otherwise somebody could grab a reference to the root
> > ???? * and break that assumption.
> > ???????? */
> > ??????? if (is_tdp_mmu_enabled(kvm))
> > ??????????????? kvm_tdp_mmu_invalidate_all_roots(kvm);
> >
> > However, for the second comment (the one in the context above), there's
> > much more.? From easier to harder:
> >
> > 1) I'm basically clueless about the TLB flush "note" above.
I assume you're referring to this ancient thing?
* Note: we need to do this under the protection of mmu_lock,
* otherwise, vcpu would purge shadow page but miss tlb flush.
The "vcpu" part should be "KVM", or more precisely kvm_zap_obsolete_pages().
The fast zap (not a vCPU) will drop mmu_lock() if it yields when "preparing" the
zap, so the remote TLB flush via the kvm_mmu_commit_zap_page() is too late.
> > 2) It's not clear to me what needs to use for_each_tdp_mmu_root; for
> > example, why would anything but the MMU notifiers use
> > for_each_tdp_mmu_root?
> > It is used in kvm_tdp_mmu_write_protect_gfn,
> > kvm_tdp_mmu_try_split_huge_pages
> > and kvm_tdp_mmu_clear_dirty_pt_masked.
> >
> > 3) Does it make sense that yielding users of for_each_tdp_mmu_root must
> > either look at valid roots only, or take MMU lock for write?? If so, can
> > this be enforced in tdp_mmu_next_root?
>
> Ok, I could understand this a little better now, but please correct me
> if this is incorrect:
>
> 2) if I'm not wrong, kvm_tdp_mmu_try_split_huge_pages indeed does not
> need to walk invalid roots.
Correct, it doesn't need to walk invalid roots. The only flows that need to walk
invalid roots are the mmu_notifiers (or kvm_arch_flush_shadow_all() if KVM x86 were
somehow able to survive without notifiers).
> The others do because the TDP MMU does not necessarily kick vCPUs after
> marking roots as invalid.
Fudge. I'm pretty sure AMD/SVM TLB management is broken for the TDP MMU (though
I would argue that KVM's ASID management is broken regardless of the TDP MMU...).
The notifiers need to walk all roots because they need to guarantee any metadata
accounting, e.g. propagation of dirty bits, for the associated (host) pfn occurs
before the notifier returns. It's not an issue of vCPUs having stale references,
or at least it shouldn't be, it's an issue of the "writeback" occurring after the
pfn is full released.
In the "fast zap", the KVM always kicks vCPUs after marking them invalid, before
dropping mmu_lock (which is held for write). This is mandatory because the memslot
is being deleted/moved, so KVM must guarantee the old slot can't be accessed by
the guest.
In the put_root() path, there _shouldn't_ be a need to kick because the vCPU doesn't
have a reference to the root, and the last vCPU to drop a reference to the root
_should_ ensure it's unreachable.
Intel EPT is fine, because the EPT4A ensures a unique ASID, i.e. KVM can defer
any TLB flush until the same physical root page is reused.
Shadow paging is fine because kvm_mmu_free_roots()'s call to kvm_mmu_commit_zap_page()
will flush TLBs for all vCPUs when the last reference is put.
AMD NPT is hosed because KVM's awful ASID scheme doesn't assign an ASID per root
and doesn't force a new ASID. IMO, this is an SVM mess and not a TDP MMU bug.
In the short term, I think something like the following would suffice. Long term,
we really need to redo SVM ASID management so that ASIDs are tied to a KVM root.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 54bc8118c40a..2dbbf67dfd21 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -70,11 +70,8 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
-static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
+static inline bool is_tdp_mmu_root(hpa_t hpa)
{
- struct kvm_mmu_page *sp;
- hpa_t hpa = mmu->root.hpa;
-
if (WARN_ON(!VALID_PAGE(hpa)))
return false;
@@ -86,10 +83,16 @@ static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
sp = to_shadow_page(hpa);
return sp && is_tdp_mmu_page(sp) && sp->root_count;
}
+
+static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
+{
+ return is_tdp_mmu_root(mmu->root.hpa);
+}
#else
static inline bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return false; }
static inline void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) {}
static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
+static inline bool is_tdp_mmu_root(hpa_t hpa) { return false; }
static inline bool is_tdp_mmu(struct kvm_mmu *mmu) { return false; }
#endif
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c5e3f219803e..7899ca4748c7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3857,6 +3857,9 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
unsigned long cr3;
if (npt_enabled) {
+ if (is_tdp_mmu_root(root_hpa))
+ svm->current_vmcb->asid_generation = 0;
+
svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
> But because TDP MMU roots are gone for good once their refcount hits 0, I
> wonder if we could do something like
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7e3d1f985811..a4a6dfee27f9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -164,6 +164,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> */
> if (!kvm_tdp_root_mark_invalid(root)) {
> refcount_set(&root->tdp_mmu_root_count, 1);
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
> /*
> * If the struct kvm is alive, we might as well zap the root
> @@ -1099,12 +1100,16 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> {
> struct kvm_mmu_page *root;
> + bool invalidated_root = false
> lockdep_assert_held_write(&kvm->mmu_lock);
> list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
> - root->role.invalid = true;
> + invalidated_root |= !kvm_tdp_root_mark_invalid(root);
> }
> +
> + if (invalidated_root)
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
> }
This won't work, see my other response about not being able to use a worker for
this path (my brain is finally getting up to speed today...).
>
> 3) Yes, it makes sense that yielding users of for_each_tdp_mmu_root must
> either look at valid roots only, or take MMU lock for write. The only
> exception is kvm_tdp_mmu_try_split_huge_pages, which does not need to
> walk invalid roots. And kvm_tdp_mmu_zap_invalidated_pages(), but that
> one is basically an asynchronous worker [and this is where I had the
> inspiration to get rid of the function altogether]
>
> Paolo
>
On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 20:45, Sean Christopherson wrote:
> > AMD NPT is hosed because KVM's awful ASID scheme doesn't assign an ASID per root
> > and doesn't force a new ASID. IMO, this is an SVM mess and not a TDP MMU bug.
>
> I agree.
>
> > In the short term, I think something like the following would suffice. Long term,
> > we really need to redo SVM ASID management so that ASIDs are tied to a KVM root.
>
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index c5e3f219803e..7899ca4748c7 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3857,6 +3857,9 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu,
> hpa_t root_hpa,
> unsigned long cr3;
>
> if (npt_enabled) {
> + if (is_tdp_mmu_root(root_hpa))
> + svm->current_vmcb->asid_generation = 0;
> +
> svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
> vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
>
> Why not just new_asid
My mental coin flip came up tails? new_asid() is definitely more intuitive.
> (even unconditionally, who cares)?
Heh, I was going to say we do care to some extent for nested transitions, then
I remembered we flush on every nested transition anyways, in no small part because
the ASID handling is a mess.
On 3/2/22 23:53, Sean Christopherson wrote:
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index c5e3f219803e..7899ca4748c7 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -3857,6 +3857,9 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu,
>> hpa_t root_hpa,
>> unsigned long cr3;
>>
>> if (npt_enabled) {
>> + if (is_tdp_mmu_root(root_hpa))
>> + svm->current_vmcb->asid_generation = 0;
>> +
>> svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
>> vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
>>
>> Why not just new_asid
> My mental coin flip came up tails? new_asid() is definitely more intuitive.
>
Can you submit a patch (seems like 5.17+stable material)?
Paolo
On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> On 3/2/22 23:53, Sean Christopherson wrote:
> > >
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index c5e3f219803e..7899ca4748c7 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -3857,6 +3857,9 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu,
> > > hpa_t root_hpa,
> > > unsigned long cr3;
> > >
> > > if (npt_enabled) {
> > > + if (is_tdp_mmu_root(root_hpa))
> > > + svm->current_vmcb->asid_generation = 0;
> > > +
> > > svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
> > > vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
> > >
> > > Why not just new_asid
> > My mental coin flip came up tails? new_asid() is definitely more intuitive.
> >
>
> Can you submit a patch (seems like 5.17+stable material)?
After a lot more thinking, there's no bug. If KVM unloads all roots, e.g. fast zap,
then all vCPUs are guaranteed to go through kvm_mmu_load(), and that will flush the
current ASID.
So the only problematic path is KVM_REQ_LOAD_MMU_PGD, which has two users,
kvm_mmu_new_pgd() and load_pdptrs(). load_pdptrs() is benign because it triggers
a "false" PGD load only top get PDPTRs updated on EPT, the actual PGD doesn't change
(or rather isn't forced to change by load_pdptrs().
Nested SVM's use of kvm_mmu_new_pgd() is "ok" because KVM currently flushes on
every transition.
That leaves kvm_set_cr3() via kvm_mmu_new_pgd(). For NPT, lack of a flush is
moot because KVM shouldn't be loading a new PGD in the first place (see our other
discussion about doing:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf17af4d6904..f11199b41ca8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1212,7 +1212,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
return 1;
- if (cr3 != kvm_read_cr3(vcpu))
+ if (!tdp_enabled && cr3 != kvm_read_cr3(vcpu))
kvm_mmu_new_pgd(vcpu, cr3);
vcpu->arch.cr3 = cr3;
Non-NPT shadow paging is ok because either the MOV CR3 will do a TLB flush, or the
guest explicitly says "don't do a TLB flush", in which case KVM is off the hook
from a correctness perspective (guest's responsibility to ensure MMU in sync'd),
and is ok from a safety perspective because the legacy MMU does a remote TLB flush
if it zaps any pages, i.e. the guest can't do use-after-free.
All that said, this is another argument against dropping kvm_mmu_unload() from
kvm_mmu_reset_context()[*]: SMM would theoretically be broken on AMD due to reusing
the same ASID for both non-SMM and SMM roots/memslots.
In practice, I don't think it can actually happen, but that's mostly dumb luck.
em_rsm() temporarily transitions back to Real Mode before loading the actual
non-SMM guest state, so only SMI that arrives with CR0.PG=0 is problematic. In
that case, TLB flushes may not be triggered by kvm_set_cr0() or kvm_set_cr4(),
but kvm_set_cr3() will always trigger a flush because the "no flush" PCID bit
will always be clear. Well, unless the SMM handler writes the read-only SMRAM
field, at which point it deserves to die :-)
Anyways, before we transitions SMM away from kvm_mmu_reset_context(), we should
add an explicit KVM_REQ_TLB_FLUSH_CURRENT in svm_{enter,leave}_smm(), with a TODO
similar to nested_svm_transition_tlb_flush() to document that the explicit flush
can go away when KVM ensures unique ASIDs for non-SMM vs. SMM.
[*] https://lore.kernel.org/all/[email protected]