Patch 01 fixes a complete braino and lack of testing :-(
The rest of the series is an enhancement to address a performance issue
I encountered when implementing the aforementioned fix. I wanted to WARN
if KVM_REQ_MMU_LOAD was pending with a valid root shadow page, but it
fired like crazy because the roots in prev_roots have an elevated
root_count and thus can trigger KVM_REQ_MMU_LOAD when the guest zaps a
cached root's corresponding PGD (in the guest).
Patches 2+ haven't been super well tested, I'll beat on 'em more and
holler if anything pops up.
Sean Christopherson (7):
KVM: x86: Retry page fault if MMU reload is pending and root has no sp
KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE change
KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users
KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped
KVM: s390: Replace KVM_REQ_MMU_RELOAD usage with arch specific request
KVM: Drop KVM_REQ_MMU_RELOAD and update vcpu-requests.rst
documentation
KVM: WARN if is_unsync_root() is called on a root without a shadow
page
Documentation/virt/kvm/vcpu-requests.rst | 7 +-
arch/s390/include/asm/kvm_host.h | 2 +
arch/s390/kvm/kvm-s390.c | 8 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 98 +++++++++++++++++++++---
arch/x86/kvm/x86.c | 15 ++--
include/linux/kvm_host.h | 4 +-
virt/kvm/kvm_main.c | 5 --
10 files changed, 107 insertions(+), 37 deletions(-)
--
2.34.1.400.ga245620fadb-goog
Play nice with a NULL shadow page when checking for an obsolete root in
the page fault handler by flagging the page fault as stale if there's no
shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending.
Invalidating memslots, which is the only case where _all_ roots need to
be reloaded, requests all vCPUs to reload their MMUs while holding
mmu_lock for lock.
The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
backed by a shadow page. Running with TDP disabled or with nested NPT
explodes spectaculary due to dereferencing a NULL shadow page pointer.
Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the
root. Zapping shadow pages in response to guest activity, e.g. when the
guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current
vCPU isn't using the affected root. I.e. KVM_REQ_MMU_RELOAD can be seen
with a completely valid root shadow page. This is a bit of a moot point
as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will
be cleaned up in the future.
Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
Cc: [email protected]
Cc: Maxim Levitsky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1ccee4d17481..1d275e9d76b5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3971,7 +3971,21 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault, int mmu_seq)
{
- if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))
+ struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);
+
+ /* Special roots, e.g. pae_root, are not backed by shadow pages. */
+ if (sp && is_obsolete_sp(vcpu->kvm, sp))
+ return true;
+
+ /*
+ * Roots without an associated shadow page are considered invalid if
+ * there is a pending request to free obsolete roots. The request is
+ * only a hint that the current root _may_ be obsolete and needs to be
+ * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
+ * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
+ * to reload even if no vCPU is actively using the root.
+ */
+ if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
return true;
return fault->slot &&
--
2.34.1.400.ga245620fadb-goog
Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call
when the guest's CR4.PCIDE changes. This will allow tweaking the logic
of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the
historical intent of KVM_REQ_MMU_RELOAD. The recent PCIDE behavior is
the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as
obsolete, needs to unconditionally unload the entire MMU, _and_ affects
only the current vCPU.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1aaf37e1bd0f..ca1f0350a868 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1045,19 +1045,18 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
* If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
* according to the SDM; however, stale prev_roots could be reused
* incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
- * free them all. KVM_REQ_MMU_RELOAD is fit for the both cases; it
- * is slow, but changing CR4.PCIDE is a rare case.
+ * free them all. kvm_mmu_unload() is fit for the both cases; it is
+ * slow, but changing CR4.PCIDE is a rare case.
*
* If CR4.PGE is changed, the guest TLB must be flushed.
*
- * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
- * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
- * the usage of "else if".
+ * Note: resetting MMU is a superset of unloading an MMU, and unloading
+ * an MMU is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence the "else if".
*/
if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
kvm_mmu_reset_context(vcpu);
else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
- kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_mmu_unload(vcpu);
else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}
--
2.34.1.400.ga245620fadb-goog
Remove the generic kvm_reload_remote_mmus() and open code its
functionality into the two x86 callers. x86 is (obviously) the only
architecture that uses the hook, and is also the only architecture that
uses KVM_REQ_MMU_RELOAD in away that's consistent with the name. That
will change in a future patch, as x86's usage when zapping a single
shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
MMUs whose root is being zapped actually need to be reloaded.
s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
Drop the generic code in anticipation of implementing s390 and x86 arch
specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
Opportunistically reword the x86 TDP MMU comment to avoid making
references to functions (and requests!) when possible, and to remove the
rather ambiguous "this".
No functional change intended.
Cc: Ben Gardon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 14 +++++++-------
include/linux/kvm_host.h | 1 -
virt/kvm/kvm_main.c | 5 -----
3 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1d275e9d76b5..31605cd3c09f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2388,7 +2388,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
* treats invalid shadow pages as being obsolete.
*/
if (!is_obsolete_sp(kvm, sp))
- kvm_reload_remote_mmus(kvm);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
}
if (sp->lpage_disallowed)
@@ -5669,11 +5669,11 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
*/
kvm->arch.mmu_valid_gen = kvm->arch.mmu_valid_gen ? 0 : 1;
- /* In order to ensure all threads see this change when
- * handling the MMU reload signal, this must happen in the
- * same critical section as kvm_reload_remote_mmus, and
- * before kvm_zap_obsolete_pages as kvm_zap_obsolete_pages
- * could drop the MMU lock and yield.
+ /*
+ * In order to ensure all vCPUs drop their soon-to-be invalid roots,
+ * invalidating TDP MMU roots must be done while holding mmu_lock for
+ * write and in the same critical section as making the reload request,
+ * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
*/
if (is_tdp_mmu_enabled(kvm))
kvm_tdp_mmu_invalidate_all_roots(kvm);
@@ -5686,7 +5686,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* Note: we need to do this under the protection of mmu_lock,
* otherwise, vcpu would purge shadow page but miss tlb flush.
*/
- kvm_reload_remote_mmus(kvm);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
kvm_zap_obsolete_pages(kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f8ed799e8674..636e62c09964 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1112,7 +1112,6 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target);
void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
void kvm_flush_remote_tlbs(struct kvm *kvm);
-void kvm_reload_remote_mmus(struct kvm *kvm);
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f3acff708bf5..e5a89592e89d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -355,11 +355,6 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
#endif
-void kvm_reload_remote_mmus(struct kvm *kvm)
-{
- kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
-}
-
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
gfp_t gfp_flags)
--
2.34.1.400.ga245620fadb-goog
Zap only obsolete roots when responding to zapping a single root shadow
page. Because KVM keeps root_count elevated when stuffing a previous
root into its PGD cache, shadowing a 64-bit guest means that zapping any
root causes all vCPUs to reload all roots, even if their current root is
not affected by the zap.
For many kernels, zapping a single root is a frequent operation, e.g. in
Linux it happens whenever an mm is dropped, e.g. process exits, etc...
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 66 +++++++++++++++++++++++++++++----
arch/x86/kvm/x86.c | 4 +-
4 files changed, 64 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d5fede05eb5f..62e5e842b692 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -102,6 +102,8 @@
#define KVM_REQ_MSR_FILTER_CHANGED KVM_ARCH_REQ(29)
#define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
+ KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e9fbb2c8bbe2..923e0e95e7d7 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -79,6 +79,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
int kvm_mmu_load(struct kvm_vcpu *vcpu);
void kvm_mmu_unload(struct kvm_vcpu *vcpu);
+void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 31605cd3c09f..b6115d8ea696 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2345,7 +2345,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
struct list_head *invalid_list,
int *nr_zapped)
{
- bool list_unstable;
+ bool list_unstable, zapped_root = false;
trace_kvm_mmu_prepare_zap_page(sp);
++kvm->stat.mmu_shadow_zapped;
@@ -2387,14 +2387,20 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
* in kvm_mmu_zap_all_fast(). Note, is_obsolete_sp() also
* treats invalid shadow pages as being obsolete.
*/
- if (!is_obsolete_sp(kvm, sp))
- kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
+ zapped_root = !is_obsolete_sp(kvm, sp);
}
if (sp->lpage_disallowed)
unaccount_huge_nx_page(kvm, sp);
sp->role.invalid = 1;
+
+ /*
+ * Make the request to free obsolete roots after marking the root
+ * invalid, otherwise other vCPUs may not see it as invalid.
+ */
+ if (zapped_root)
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
return list_unstable;
}
@@ -3985,7 +3991,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
* previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
* to reload even if no vCPU is actively using the root.
*/
- if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
+ if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
return true;
return fault->slot &&
@@ -4187,8 +4193,8 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
/*
* It's possible that the cached previous root page is obsolete because
* of a change in the MMU generation number. However, changing the
- * generation number is accompanied by KVM_REQ_MMU_RELOAD, which will
- * free the root set here and allocate a new one.
+ * generation number is accompanied by KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
+ * which will free the root set here and allocate a new one.
*/
kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
@@ -5113,6 +5119,52 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu)
WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
}
+static bool is_obsolete_root(struct kvm_vcpu *vcpu, hpa_t root_hpa)
+{
+ struct kvm_mmu_page *sp;
+
+ if (!VALID_PAGE(root_hpa))
+ return false;
+
+ /*
+ * When freeing obsolete roots, treat roots as obsolete if they don't
+ * have an associated shadow page. This does mean KVM will get false
+ * positives and free roots that don't strictly need to be freed, but
+ * such false positives are relatively rare:
+ *
+ * (a) only PAE paging and nested NPT has roots without shadow pages
+ * (b) remote reloads due to a memslot update obsoletes _all_ roots
+ * (c) KVM doesn't track previous roots for PAE paging, and the guest
+ * is unlikely to zap an in-use PGD.
+ */
+ sp = to_shadow_page(root_hpa);
+ return !sp || is_obsolete_sp(vcpu->kvm, sp);
+}
+
+static void __kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *mmu)
+{
+ unsigned long roots_to_free = 0;
+ int i;
+
+ if (is_obsolete_root(vcpu, mmu->root_hpa))
+ roots_to_free |= KVM_MMU_ROOT_CURRENT;
+
+ for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+ if (is_obsolete_root(vcpu, mmu->root_hpa))
+ roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
+ }
+
+ if (roots_to_free)
+ kvm_mmu_free_roots(vcpu, mmu, roots_to_free);
+}
+
+void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu)
+{
+ __kvm_mmu_free_obsolete_roots(vcpu, &vcpu->arch.root_mmu);
+ __kvm_mmu_free_obsolete_roots(vcpu, &vcpu->arch.guest_mmu);
+}
+
static bool need_remote_flush(u64 old, u64 new)
{
if (!is_shadow_present_pte(old))
@@ -5686,7 +5738,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* Note: we need to do this under the protection of mmu_lock,
* otherwise, vcpu would purge shadow page but miss tlb flush.
*/
- kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
kvm_zap_obsolete_pages(kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ca1f0350a868..35a4f07b4f40 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9714,8 +9714,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto out;
}
}
- if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
- kvm_mmu_unload(vcpu);
+ if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
+ kvm_mmu_free_obsolete_roots(vcpu);
if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
__kvm_migrate_timers(vcpu);
if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
--
2.34.1.400.ga245620fadb-goog
Add an arch request, KVM_REQ_REFRESH_GUEST_PREFIX, to deal with guest
prefix changes instead of piggybacking KVM_REQ_MMU_RELOAD. This will
allow for the removal of the generic KVM_REQ_MMU_RELOAD, which isn't
actually used by generic KVM.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/s390/include/asm/kvm_host.h | 2 ++
arch/s390/kvm/kvm-s390.c | 8 ++++----
arch/s390/kvm/kvm-s390.h | 2 +-
3 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index a22c9266ea05..766028d54a3e 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -45,6 +45,8 @@
#define KVM_REQ_START_MIGRATION KVM_ARCH_REQ(3)
#define KVM_REQ_STOP_MIGRATION KVM_ARCH_REQ(4)
#define KVM_REQ_VSIE_RESTART KVM_ARCH_REQ(5)
+#define KVM_REQ_REFRESH_GUEST_PREFIX \
+ KVM_ARCH_REQ_FLAGS(6, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define SIGP_CTRL_C 0x80
#define SIGP_CTRL_SCN_MASK 0x3f
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index dd099d352753..e161df69520c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3394,7 +3394,7 @@ static void kvm_gmap_notifier(struct gmap *gmap, unsigned long start,
if (prefix <= end && start <= prefix + 2*PAGE_SIZE - 1) {
VCPU_EVENT(vcpu, 2, "gmap notifier for %lx-%lx",
start, end);
- kvm_s390_sync_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_s390_sync_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
}
}
}
@@ -3796,19 +3796,19 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
if (!kvm_request_pending(vcpu))
return 0;
/*
- * We use MMU_RELOAD just to re-arm the ipte notifier for the
+ * If the guest prefix changed, re-arm the ipte notifier for the
* guest prefix page. gmap_mprotect_notify will wait on the ptl lock.
* This ensures that the ipte instruction for this request has
* already finished. We might race against a second unmapper that
* wants to set the blocking bit. Lets just retry the request loop.
*/
- if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu)) {
+ if (kvm_check_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu)) {
int rc;
rc = gmap_mprotect_notify(vcpu->arch.gmap,
kvm_s390_get_prefix(vcpu),
PAGE_SIZE * 2, PROT_WRITE);
if (rc) {
- kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
return rc;
}
goto retry;
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 60f0effcce99..219f92ffd556 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -105,7 +105,7 @@ static inline void kvm_s390_set_prefix(struct kvm_vcpu *vcpu, u32 prefix)
prefix);
vcpu->arch.sie_block->prefix = prefix >> GUEST_PREFIX_SHIFT;
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
- kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
}
static inline u64 kvm_s390_get_base_disp_s(struct kvm_vcpu *vcpu, u8 *ar)
--
2.34.1.400.ga245620fadb-goog
Remove the now unused KVM_REQ_MMU_RELOAD, shift KVM_REQ_VM_DEAD into the
unoccupied space, and update vcpu-requests.rst, which was missing an
entry for KVM_REQ_VM_DEAD. Switching KVM_REQ_VM_DEAD to entry '1' also
fixes the stale comment about bits 4-7 being reserved.
Signed-off-by: Sean Christopherson <[email protected]>
---
Documentation/virt/kvm/vcpu-requests.rst | 7 +++----
include/linux/kvm_host.h | 3 +--
2 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
index ad2915ef7020..98b8f02b7a19 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -112,11 +112,10 @@ KVM_REQ_TLB_FLUSH
choose to use the common kvm_flush_remote_tlbs() implementation will
need to handle this VCPU request.
-KVM_REQ_MMU_RELOAD
+KVM_REQ_VM_DEAD
- When shadow page tables are used and memory slots are removed it's
- necessary to inform each VCPU to completely refresh the tables. This
- request is used for that.
+ This request informs all VCPUs that the VM is dead and unusable, e.g. due to
+ fatal error or because the VMs state has been intentionally destroyed.
KVM_REQ_UNBLOCK
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 636e62c09964..7e444c4e406d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -151,10 +151,9 @@ static inline bool is_error_page(struct page *page)
* Bits 4-7 are reserved for more arch-independent bits.
*/
#define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_UNBLOCK 2
#define KVM_REQ_UNHALT 3
-#define KVM_REQ_VM_DEAD (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQUEST_ARCH_BASE 8
#define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
--
2.34.1.400.ga245620fadb-goog
WARN and bail if is_unsync_root() is passed a root for which there is no
shadow page, i.e. is passed the physical address of one of the special
roots, which do not have an associated shadow page. The current usage
squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
and 5-level AMD CPUs are not generally available, i.e. no one can coerce
KVM into calling is_unsync_root() on pml5_root.
Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
prevents KVM from crashing.
Cc: Lai Jiangshan <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b6115d8ea696..18ecaadcf616 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3666,6 +3666,14 @@ static bool is_unsync_root(hpa_t root)
*/
smp_rmb();
sp = to_shadow_page(root);
+
+ /*
+ * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
+ * PDPTEs for a given PAE root need to be synchronized individually.
+ */
+ if (WARN_ON_ONCE(!sp))
+ return false;
+
if (sp->unsync || sp->unsync_children)
return true;
--
2.34.1.400.ga245620fadb-goog
On Thu, 9 Dec 2021 06:05:51 +0000
Sean Christopherson <[email protected]> wrote:
> Remove the now unused KVM_REQ_MMU_RELOAD, shift KVM_REQ_VM_DEAD into the
> unoccupied space, and update vcpu-requests.rst, which was missing an
> entry for KVM_REQ_VM_DEAD. Switching KVM_REQ_VM_DEAD to entry '1' also
> fixes the stale comment about bits 4-7 being reserved.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
> ---
> Documentation/virt/kvm/vcpu-requests.rst | 7 +++----
> include/linux/kvm_host.h | 3 +--
> 2 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
> index ad2915ef7020..98b8f02b7a19 100644
> --- a/Documentation/virt/kvm/vcpu-requests.rst
> +++ b/Documentation/virt/kvm/vcpu-requests.rst
> @@ -112,11 +112,10 @@ KVM_REQ_TLB_FLUSH
> choose to use the common kvm_flush_remote_tlbs() implementation will
> need to handle this VCPU request.
>
> -KVM_REQ_MMU_RELOAD
> +KVM_REQ_VM_DEAD
>
> - When shadow page tables are used and memory slots are removed it's
> - necessary to inform each VCPU to completely refresh the tables. This
> - request is used for that.
> + This request informs all VCPUs that the VM is dead and unusable, e.g. due to
> + fatal error or because the VMs state has been intentionally destroyed.
>
> KVM_REQ_UNBLOCK
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 636e62c09964..7e444c4e406d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -151,10 +151,9 @@ static inline bool is_error_page(struct page *page)
> * Bits 4-7 are reserved for more arch-independent bits.
> */
> #define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> -#define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_UNBLOCK 2
> #define KVM_REQ_UNHALT 3
> -#define KVM_REQ_VM_DEAD (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQUEST_ARCH_BASE 8
>
> #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
On Thu, 9 Dec 2021 06:05:50 +0000
Sean Christopherson <[email protected]> wrote:
> Add an arch request, KVM_REQ_REFRESH_GUEST_PREFIX, to deal with guest
> prefix changes instead of piggybacking KVM_REQ_MMU_RELOAD. This will
> allow for the removal of the generic KVM_REQ_MMU_RELOAD, which isn't
> actually used by generic KVM.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/include/asm/kvm_host.h | 2 ++
> arch/s390/kvm/kvm-s390.c | 8 ++++----
> arch/s390/kvm/kvm-s390.h | 2 +-
> 3 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index a22c9266ea05..766028d54a3e 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -45,6 +45,8 @@
> #define KVM_REQ_START_MIGRATION KVM_ARCH_REQ(3)
> #define KVM_REQ_STOP_MIGRATION KVM_ARCH_REQ(4)
> #define KVM_REQ_VSIE_RESTART KVM_ARCH_REQ(5)
> +#define KVM_REQ_REFRESH_GUEST_PREFIX \
> + KVM_ARCH_REQ_FLAGS(6, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>
> #define SIGP_CTRL_C 0x80
> #define SIGP_CTRL_SCN_MASK 0x3f
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index dd099d352753..e161df69520c 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -3394,7 +3394,7 @@ static void kvm_gmap_notifier(struct gmap *gmap, unsigned long start,
> if (prefix <= end && start <= prefix + 2*PAGE_SIZE - 1) {
> VCPU_EVENT(vcpu, 2, "gmap notifier for %lx-%lx",
> start, end);
> - kvm_s390_sync_request(KVM_REQ_MMU_RELOAD, vcpu);
> + kvm_s390_sync_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
> }
> }
> }
> @@ -3796,19 +3796,19 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
> if (!kvm_request_pending(vcpu))
> return 0;
> /*
> - * We use MMU_RELOAD just to re-arm the ipte notifier for the
> + * If the guest prefix changed, re-arm the ipte notifier for the
> * guest prefix page. gmap_mprotect_notify will wait on the ptl lock.
> * This ensures that the ipte instruction for this request has
> * already finished. We might race against a second unmapper that
> * wants to set the blocking bit. Lets just retry the request loop.
> */
> - if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu)) {
> + if (kvm_check_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu)) {
> int rc;
> rc = gmap_mprotect_notify(vcpu->arch.gmap,
> kvm_s390_get_prefix(vcpu),
> PAGE_SIZE * 2, PROT_WRITE);
> if (rc) {
> - kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> + kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
> return rc;
> }
> goto retry;
> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> index 60f0effcce99..219f92ffd556 100644
> --- a/arch/s390/kvm/kvm-s390.h
> +++ b/arch/s390/kvm/kvm-s390.h
> @@ -105,7 +105,7 @@ static inline void kvm_s390_set_prefix(struct kvm_vcpu *vcpu, u32 prefix)
> prefix);
> vcpu->arch.sie_block->prefix = prefix >> GUEST_PREFIX_SHIFT;
> kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> - kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> + kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
> }
>
> static inline u64 kvm_s390_get_base_disp_s(struct kvm_vcpu *vcpu, u8 *ar)
On 12/9/21 07:05, Sean Christopherson wrote:
> Add an arch request, KVM_REQ_REFRESH_GUEST_PREFIX, to deal with guest
> prefix changes instead of piggybacking KVM_REQ_MMU_RELOAD. This will
> allow for the removal of the generic KVM_REQ_MMU_RELOAD, which isn't
> actually used by generic KVM.
Yes, that does sound nicer.
>
> No functional change intended.
Reviewed-by: Janosch Frank <[email protected]>
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/s390/include/asm/kvm_host.h | 2 ++
> arch/s390/kvm/kvm-s390.c | 8 ++++----
> arch/s390/kvm/kvm-s390.h | 2 +-
> 3 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index a22c9266ea05..766028d54a3e 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -45,6 +45,8 @@
> #define KVM_REQ_START_MIGRATION KVM_ARCH_REQ(3)
> #define KVM_REQ_STOP_MIGRATION KVM_ARCH_REQ(4)
> #define KVM_REQ_VSIE_RESTART KVM_ARCH_REQ(5)
> +#define KVM_REQ_REFRESH_GUEST_PREFIX \
> + KVM_ARCH_REQ_FLAGS(6, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>
> #define SIGP_CTRL_C 0x80
> #define SIGP_CTRL_SCN_MASK 0x3f
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index dd099d352753..e161df69520c 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -3394,7 +3394,7 @@ static void kvm_gmap_notifier(struct gmap *gmap, unsigned long start,
> if (prefix <= end && start <= prefix + 2*PAGE_SIZE - 1) {
> VCPU_EVENT(vcpu, 2, "gmap notifier for %lx-%lx",
> start, end);
> - kvm_s390_sync_request(KVM_REQ_MMU_RELOAD, vcpu);
> + kvm_s390_sync_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
> }
> }
> }
> @@ -3796,19 +3796,19 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
> if (!kvm_request_pending(vcpu))
> return 0;
> /*
> - * We use MMU_RELOAD just to re-arm the ipte notifier for the
> + * If the guest prefix changed, re-arm the ipte notifier for the
> * guest prefix page. gmap_mprotect_notify will wait on the ptl lock.
> * This ensures that the ipte instruction for this request has
> * already finished. We might race against a second unmapper that
> * wants to set the blocking bit. Lets just retry the request loop.
> */
> - if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu)) {
> + if (kvm_check_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu)) {
> int rc;
> rc = gmap_mprotect_notify(vcpu->arch.gmap,
> kvm_s390_get_prefix(vcpu),
> PAGE_SIZE * 2, PROT_WRITE);
> if (rc) {
> - kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> + kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
> return rc;
> }
> goto retry;
> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> index 60f0effcce99..219f92ffd556 100644
> --- a/arch/s390/kvm/kvm-s390.h
> +++ b/arch/s390/kvm/kvm-s390.h
> @@ -105,7 +105,7 @@ static inline void kvm_s390_set_prefix(struct kvm_vcpu *vcpu, u32 prefix)
> prefix);
> vcpu->arch.sie_block->prefix = prefix >> GUEST_PREFIX_SHIFT;
> kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> - kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> + kvm_make_request(KVM_REQ_REFRESH_GUEST_PREFIX, vcpu);
> }
>
> static inline u64 kvm_s390_get_base_disp_s(struct kvm_vcpu *vcpu, u8 *ar)
>
On 12/9/21 07:05, Sean Christopherson wrote:
> Play nice with a NULL shadow page when checking for an obsolete root in
> the page fault handler by flagging the page fault as stale if there's no
> shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending.
> Invalidating memslots, which is the only case where _all_ roots need to
> be reloaded, requests all vCPUs to reload their MMUs while holding
> mmu_lock for lock.
>
> The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
> backed by a shadow page. Running with TDP disabled or with nested NPT
> explodes spectaculary due to dereferencing a NULL shadow page pointer.
>
> Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the
> root. Zapping shadow pages in response to guest activity, e.g. when the
> guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current
> vCPU isn't using the affected root. I.e. KVM_REQ_MMU_RELOAD can be seen
> with a completely valid root shadow page. This is a bit of a moot point
> as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will
> be cleaned up in the future.
>
> Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
> Cc: [email protected]
> Cc: Maxim Levitsky <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1ccee4d17481..1d275e9d76b5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3971,7 +3971,21 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault, int mmu_seq)
> {
> - if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))
> + struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);
> +
> + /* Special roots, e.g. pae_root, are not backed by shadow pages. */
> + if (sp && is_obsolete_sp(vcpu->kvm, sp))
> + return true;
> +
> + /*
> + * Roots without an associated shadow page are considered invalid if
> + * there is a pending request to free obsolete roots. The request is
> + * only a hint that the current root _may_ be obsolete and needs to be
> + * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> + * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> + * to reload even if no vCPU is actively using the root.
> + */
> + if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
> return true;
>
> return fault->slot &&
>
Queued this one for 5.16, thanks.
Paolo
On 12/9/21 07:05, Sean Christopherson wrote:
> + /* Special roots, e.g. pae_root, are not backed by shadow pages. */
> + if (sp && is_obsolete_sp(vcpu->kvm, sp))
> + return true;
> +
> + /*
> + * Roots without an associated shadow page are considered invalid if
> + * there is a pending request to free obsolete roots. The request is
> + * only a hint that the current root_may_ be obsolete and needs to be
> + * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> + * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> + * to reload even if no vCPU is actively using the root.
> + */
> + if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
> return true;
Hmm I don't understand this (or maybe I do and I just don't like what I
understand).
KVM_REQ_MMU_RELOAD is raised after kvm->arch.mmu_valid_gen is fixed (of
course, otherwise the other CPU might just not see any obsoleted page
from the legacy MMU), therefore any check on KVM_REQ_MMU_RELOAD is just
advisory.
This is not a problem per se; in the other commit message you said,
For other MMUs, the resulting behavior is far more convoluted,
though unlikely to be truly problematic.
but it's unnecessarily complicating the logic. I'm more inclined to
just play it simple and make the special roots process the page fault;
Jiangshan's work should clean things up a bit:
--------- 8< -------------
From 0c1e30d4e7e17692668d960452107f983dd2c9a9 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <[email protected]>
Date: Fri, 10 Dec 2021 07:41:02 -0500
Subject: [PATCH] KVM: x86: Do not check obsoleteness of roots that have no sp
attached
The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
backed by a shadow page. Running with TDP disabled or with nested NPT
explodes spectaculary due to dereferencing a NULL shadow page pointer.
Play nice with a NULL shadow page when checking for an obsolete root in
is_page_fault_stale().
Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
Cc: [email protected]
Cc: Maxim Levitsky <[email protected]>
Analyzed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2e1d012df22..4a3bcdd3cfe7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3987,7 +3987,17 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault, int mmu_seq)
{
- if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))
+ struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);
+
+ /*
+ * Special roots, e.g. pae_root, are not backed by shadow pages
+ * so there isn't an easy way to detect if they're obsolete.
+ * If they are, any child SPTE created by the fault will be useless
+ * (they will instantly be treated as obsolete because they don't
+ * match the mmu_valid_gen); but they will not leak, so just play
+ * it simple.
+ */
+ if (sp && is_obsolete_sp(vcpu->kvm, sp))
return true;
return fault->slot &&
On Fri, Dec 10, 2021, Paolo Bonzini wrote:
> On 12/9/21 07:05, Sean Christopherson wrote:
> > + /* Special roots, e.g. pae_root, are not backed by shadow pages. */
> > + if (sp && is_obsolete_sp(vcpu->kvm, sp))
> > + return true;
> > +
> > + /*
> > + * Roots without an associated shadow page are considered invalid if
> > + * there is a pending request to free obsolete roots. The request is
> > + * only a hint that the current root_may_ be obsolete and needs to be
> > + * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> > + * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> > + * to reload even if no vCPU is actively using the root.
> > + */
> > + if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
> > return true;
>
> Hmm I don't understand this (or maybe I do and I just don't like what I
> understand).
>
> KVM_REQ_MMU_RELOAD is raised after kvm->arch.mmu_valid_gen is fixed (of
> course, otherwise the other CPU might just not see any obsoleted page
> from the legacy MMU), therefore any check on KVM_REQ_MMU_RELOAD is just
> advisory.
I disagree. IMO, KVM should not be installing SPTEs into obsolete shadow pages,
which is what continuing on allows. I don't _think_ it's problematic, but I do
think it's wrong.
> This is not a problem per se; in the other commit message you said,
>
> For other MMUs, the resulting behavior is far more convoluted,
> though unlikely to be truly problematic.
>
> but it's unnecessarily complicating the logic. I'm more inclined to
> just play it simple and make the special roots process the page fault;
> Jiangshan's work should clean things up a bit:
Ya.
> --------- 8< -------------
> From 0c1e30d4e7e17692668d960452107f983dd2c9a9 Mon Sep 17 00:00:00 2001
> From: Paolo Bonzini <[email protected]>
> Date: Fri, 10 Dec 2021 07:41:02 -0500
> Subject: [PATCH] KVM: x86: Do not check obsoleteness of roots that have no sp
> attached
>
> The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
> backed by a shadow page. Running with TDP disabled or with nested NPT
> explodes spectaculary due to dereferencing a NULL shadow page pointer.
> Play nice with a NULL shadow page when checking for an obsolete root in
> is_page_fault_stale().
>
> Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
> Cc: [email protected]
> Cc: Maxim Levitsky <[email protected]>
> Analyzed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2e1d012df22..4a3bcdd3cfe7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3987,7 +3987,17 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault, int mmu_seq)
> {
> - if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa)))
> + struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);
> +
> + /*
> + * Special roots, e.g. pae_root, are not backed by shadow pages
> + * so there isn't an easy way to detect if they're obsolete.
Eh, for all intents and purposes, KVM_REQ_MMU_RELOAD very much says special roots
are obsolete. The root will be unloaded, i.e. will no longer be used, i.e. is obsolete.
The other way to check for an invalid special root would be to treat it as obsolete
if any of its children in entries 0-3 are present and obsolete. That would be more
precise, but it provides no benefit given KVM's current implementation.
I'm not completely opposed to doing nothing, but I do think it's silly to continue
on knowing that the work done by the page fault is all but gauranteed to be useless.
> + * If they are, any child SPTE created by the fault will be useless
> + * (they will instantly be treated as obsolete because they don't
> + * match the mmu_valid_gen); but they will not leak, so just play
> + * it simple.
> + */
> + if (sp && is_obsolete_sp(vcpu->kvm, sp))
> return true;
> return fault->slot &&
On 12/10/21 17:01, Sean Christopherson wrote:
>> KVM_REQ_MMU_RELOAD is raised after kvm->arch.mmu_valid_gen is fixed (of
>> course, otherwise the other CPU might just not see any obsoleted page
>> from the legacy MMU), therefore any check on KVM_REQ_MMU_RELOAD is just
>> advisory.
>
> I disagree. IMO, KVM should not be installing SPTEs into obsolete shadow pages,
> which is what continuing on allows. I don't _think_ it's problematic, but I do
> think it's wrong.
>
> [...] Eh, for all intents and purposes, KVM_REQ_MMU_RELOAD very much says
> special roots are obsolete. The root will be unloaded, i.e. will no
> longer be used, i.e. is obsolete.
I understand that---but it takes some unspoken details to understand
that. In particular that both kvm_reload_remote_mmus and
is_page_fault_stale are called under mmu_lock write-lock, and that
there's no unlock between updating mmu_valid_gen and calling
kvm_reload_remote_mmus.
(This also suggests, for the other six patches, keeping
kvm_reload_remote_mmus and just moving it to arch/x86/kvm/mmu/mmu.c,
with an assertion that the MMU lock is held for write).
But since we have a way forward for having no special roots to worry
about, it seems an unnecessary overload for 1) a patch that will last
one or two releasees at most 2) a case that has been handled in the
inefficient way forever.
Paolo
> The other way to check for an invalid special root would be to treat
> it as obsolete if any of its children in entries 0-3 are present and
> obsolete. That would be more precise, but it provides no benefit
> given KVM's current implementation.
>
> I'm not completely opposed to doing nothing, but I do think it's
> silly to continue on knowing that the work done by the page fault is
> all but gauranteed to be useless.
>
On Fri, Dec 10, 2021, Paolo Bonzini wrote:
> On 12/10/21 17:01, Sean Christopherson wrote:
> > > KVM_REQ_MMU_RELOAD is raised after kvm->arch.mmu_valid_gen is fixed (of
> > > course, otherwise the other CPU might just not see any obsoleted page
> > > from the legacy MMU), therefore any check on KVM_REQ_MMU_RELOAD is just
> > > advisory.
> >
> > I disagree. IMO, KVM should not be installing SPTEs into obsolete shadow pages,
> > which is what continuing on allows. I don't _think_ it's problematic, but I do
> > think it's wrong.
> >
> > [...] Eh, for all intents and purposes, KVM_REQ_MMU_RELOAD very much says
> > special roots are obsolete. The root will be unloaded, i.e. will no
> > longer be used, i.e. is obsolete.
>
> I understand that---but it takes some unspoken details to understand that.
Eh, it takes just as many unspoken details to understand why it's safe-ish to
install SPTEs into an obsolete shadow page.
> In particular that both kvm_reload_remote_mmus and is_page_fault_stale are
> called under mmu_lock write-lock, and that there's no unlock between
> updating mmu_valid_gen and calling kvm_reload_remote_mmus.
>
> (This also suggests, for the other six patches, keeping
> kvm_reload_remote_mmus and just moving it to arch/x86/kvm/mmu/mmu.c, with an
> assertion that the MMU lock is held for write).
>
> But since we have a way forward for having no special roots to worry about,
> it seems an unnecessary overload for 1) a patch that will last one or two
> releasees at most
Yeah, I don't disagree, which is why I'm not totally opposed to punting this and
naturally fixing it by allocating shadow pages for the special roots. But this
code needs to be modified by Jiangshan's series either way, so it's not like we're
saving anything meaningful.
> 2) a case that has been handled in the inefficient way forever.
I don't care about inefficiency, I'm worried about correctness. It's extremely
unlikely this fixes a true bug in the legacy MMU, but there's also no real
downside to adding the check.
Anyways, either way is fine.
On Fri, Dec 10, 2021, Sean Christopherson wrote:
> On Fri, Dec 10, 2021, Paolo Bonzini wrote:
> > On 12/10/21 17:01, Sean Christopherson wrote:
> > > > KVM_REQ_MMU_RELOAD is raised after kvm->arch.mmu_valid_gen is fixed (of
> > > > course, otherwise the other CPU might just not see any obsoleted page
> > > > from the legacy MMU), therefore any check on KVM_REQ_MMU_RELOAD is just
> > > > advisory.
> > >
> > > I disagree. IMO, KVM should not be installing SPTEs into obsolete shadow pages,
> > > which is what continuing on allows. I don't _think_ it's problematic, but I do
> > > think it's wrong.
> > >
> > > [...] Eh, for all intents and purposes, KVM_REQ_MMU_RELOAD very much says
> > > special roots are obsolete. The root will be unloaded, i.e. will no
> > > longer be used, i.e. is obsolete.
> >
> > I understand that---but it takes some unspoken details to understand that.
>
> Eh, it takes just as many unspoken details to understand why it's safe-ish to
> install SPTEs into an obsolete shadow page.
>
> > In particular that both kvm_reload_remote_mmus and is_page_fault_stale are
> > called under mmu_lock write-lock, and that there's no unlock between
> > updating mmu_valid_gen and calling kvm_reload_remote_mmus.
> >
> > (This also suggests, for the other six patches, keeping
> > kvm_reload_remote_mmus and just moving it to arch/x86/kvm/mmu/mmu.c, with an
> > assertion that the MMU lock is held for write).
> >
> > But since we have a way forward for having no special roots to worry about,
> > it seems an unnecessary overload for 1) a patch that will last one or two
> > releasees at most
>
> Yeah, I don't disagree, which is why I'm not totally opposed to punting this and
> naturally fixing it by allocating shadow pages for the special roots. But this
> code needs to be modified by Jiangshan's series either way, so it's not like we're
> saving anything meaningful.
>
> > 2) a case that has been handled in the inefficient way forever.
>
> I don't care about inefficiency, I'm worried about correctness. It's extremely
> unlikely this fixes a true bug in the legacy MMU, but there's also no real
> downside to adding the check.
>
> Anyways, either way is fine.
Ping, in case this dropped off your radar. Regardless of how we fix this goof,
it needs to get fixed in 5.16.
On 12/15/21 19:53, Sean Christopherson wrote:
>>
>>> 2) a case that has been handled in the inefficient way forever.
>> I don't care about inefficiency, I'm worried about correctness. It's extremely
>> unlikely this fixes a true bug in the legacy MMU, but there's also no real
>> downside to adding the check.
>>
>> Anyways, either way is fine.
> Ping, in case this dropped off your radar. Regardless of how we fix this goof,
> it needs to get fixed in 5.16.
Something has happened to my home directory in the middle of tests
running for this rc, and I can't really do anything about it until
someone sees my ticket. :/ This is the Linux maintainer version of a
dog eating my homework, I suppose.
Since there was another report of this, I'll just queue up this version
and send it out.
Paolo