The TDP MMU has a performance regression compared to the legacy MMU
when CR0 changes often. This was reported for the grsecurity kernel,
which uses CR0.WP to implement kernel W^X. In that case, each change to
CR0.WP unloads the MMU and causes a lot of unnecessary work. When running
nested, this can even cause the L1 to hardly make progress, as the L0
hypervisor it is overwhelmed by the amount of MMU work that is needed.
Initially, my plan for this was to pull kvm_mmu_unload from
kvm_mmu_reset_context into kvm_init_mmu. Therefore I started by separating
the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.) from the shadow
page table format. Right now the "MMU role" is a messy mix of the two
and, whenever something is different between the MMU and the CPU, it is
stored as an extra field in struct kvm_mmu; for extra bonus complication,
sometimes the same thing is stored in both the role and an extra field.
The aim was to keep kvm_mmu_unload only if the MMU role changed, and
drop it if the CPU role changed.
I even posted that cleanup, but it occurred to me later that even
a conditional kvm_mmu_unload in kvm_init_mmu would be overkill.
kvm_mmu_unload is only needed in the rare cases where a TLB flush is
needed (e.g. CR0.PG changing from 1 to 0) or where the guest page table
interpretation changes in way not captured by the role (that is, CPUID
changes). But the implementation of fast PGD switching is subtle
and requires a call to kvm_mmu_new_pgd (and therefore knowing the
new MMU role) before kvm_init_mmu, therefore kvm_mmu_reset_context
chickens and drops all the roots.
Therefore, the meat of this series is a reorganization of fast PGD
switching; it makes it possible to call kvm_mmu_new_pgd *after*
the MMU has been set up, just using the MMU role instead of
kvm_mmu_calc_root_page_role.
Patches 1 to 3 are bugfixes found while working on the series.
Patches 4 to 5 add more sanity checks that triggered a lot during
development.
Patches 6 and 7 are related cleanups. In particular patch 7 makes
the cache lookup code a bit more pleasant.
Patches 8 to 9 rework the fast PGD switching. Patches 10 and
11 are cleanups enabled by the rework, and the only survivors
of the CPU role patchset.
Finally, patch 12 optimizes kvm_mmu_reset_context.
Paolo
Paolo Bonzini (12):
KVM: x86: host-initiated EFER.LME write affects the MMU
KVM: MMU: move MMU role accessors to header
KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload
KVM: MMU: avoid NULL-pointer dereference on page freeing bugs
KVM: MMU: rename kvm_mmu_reload
KVM: x86: use struct kvm_mmu_root_info for mmu->root
KVM: MMU: do not consult levels when freeing roots
KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
KVM: MMU: load new PGD after the shadow MMU is initialized
KVM: MMU: remove kvm_mmu_calc_root_page_role
KVM: x86: do not unload MMU roots on all role changes
arch/x86/include/asm/kvm_host.h | 3 +-
arch/x86/kvm/mmu.h | 28 +++-
arch/x86/kvm/mmu/mmu.c | 253 ++++++++++++++++----------------
arch/x86/kvm/mmu/mmu_audit.c | 4 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
arch/x86/kvm/svm/nested.c | 6 +-
arch/x86/kvm/vmx/nested.c | 8 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 39 +++--
11 files changed, 190 insertions(+), 159 deletions(-)
--
2.31.1
The name of kvm_mmu_reload is very confusing for two reasons:
first, KVM_REQ_MMU_RELOAD actually does not call it; second,
it only does anything if there is no valid root.
Rename it to kvm_mmu_ensure_valid_root, which matches the actual
behavior better.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/x86.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b9d06a218b2c..c9f1c2162ade 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -104,7 +104,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
-static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
+static inline int kvm_mmu_ensure_valid_root(struct kvm_vcpu *vcpu)
{
if (likely(vcpu->arch.mmu->root_hpa != INVALID_PAGE))
return 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 98aca0f2af12..2685fb62807e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9976,7 +9976,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}
- r = kvm_mmu_reload(vcpu);
+ r = kvm_mmu_ensure_valid_root(vcpu);
if (unlikely(r)) {
goto cancel_injection;
}
@@ -12164,7 +12164,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
work->wakeup_all)
return;
- r = kvm_mmu_reload(vcpu);
+ r = kvm_mmu_ensure_valid_root(vcpu);
if (unlikely(r))
return;
--
2.31.1
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e0c0f0bc2e8b..7b5765ced928 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5065,12 +5065,21 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
return r;
}
+static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
+{
+ int i;
+ kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
+ WARN_ON(VALID_PAGE(mmu->root_hpa));
+ if (mmu->pae_root) {
+ for (i = 0; i < 4; ++i)
+ WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
+ }
+}
+
void kvm_mmu_unload(struct kvm_vcpu *vcpu)
{
- kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
- kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
+ __kvm_mmu_unload(vcpu, &vcpu->arch.root_mmu);
+ __kvm_mmu_unload(vcpu, &vcpu->arch.guest_mmu);
}
static bool need_remote_flush(u64 old, u64 new)
--
2.31.1
Right now, PGD caching requires a complicated dance of first computing
the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling
kvm_init_mmu.
Part of this is due to kvm_mmu_free_roots using mmu->root_level and
mmu->shadow_root_level to distinguish whether the page table uses a single
root or 4 PAE roots. Because kvm_init_mmu can overwrite mmu->root_level,
kvm_mmu_free_roots must be called before kvm_init_mmu.
However, even after kvm_init_mmu there is a way to detect whether the page table
has a single root or four, because the pae_root does not have an associated
struct kvm_mmu_page.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c3f597ea00d..95d0fa0bb876 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
struct kvm *kvm = vcpu->kvm;
int i;
LIST_HEAD(invalid_list);
- bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
+ bool free_active_root;
BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
/* Before acquiring the MMU lock, see if we need to do any real work. */
- if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
+ free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
+ && VALID_PAGE(mmu->root.hpa);
+
+ if (!free_active_root) {
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
VALID_PAGE(mmu->prev_roots[i].hpa))
@@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
&invalid_list);
if (free_active_root) {
- if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
- (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
+ if (to_shadow_page(mmu->root.hpa)) {
mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
} else if (mmu->pae_root) {
for (i = 0; i < 4; ++i) {
--
2.31.1
Since the guest PGD is now loaded after the MMU has been set up
completely, the desired role for a cache hit is simply the current
mmu_role. There is no need to compute it again, so __kvm_mmu_new_pgd
can be folded in kvm_mmu_new_pgd.
For the !tdp_enabled case, it would also have been possible to use
the role that is already in vcpu->arch.mmu.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 29 ++++-------------------------
1 file changed, 4 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df9e0a43513c..38b40ddcaad7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -190,8 +190,6 @@ struct kmem_cache *mmu_page_header_cache;
static struct percpu_counter kvm_total_used_mmu_pages;
static void mmu_spte_set(u64 *sptep, u64 spte);
-static union kvm_mmu_page_role
-kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
struct kvm_mmu_role_regs {
const unsigned long cr0;
@@ -4172,9 +4170,9 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
return cached_root_find_and_replace(vcpu, new_pgd, new_role);
}
-static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
- union kvm_mmu_page_role new_role)
+void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
{
+ union kvm_mmu_page_role new_role = vcpu->arch.mmu->mmu_role.base;
if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
/* kvm_mmu_ensure_valid_pgd will set up a new root. */
return;
@@ -4209,11 +4207,6 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
__clear_sp_write_flooding_count(
to_shadow_page(vcpu->arch.mmu->root.hpa));
}
-
-void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
-{
- __kvm_mmu_new_pgd(vcpu, new_pgd, kvm_mmu_calc_root_page_role(vcpu));
-}
EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
static unsigned long get_cr3(struct kvm_vcpu *vcpu)
@@ -4883,7 +4876,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
new_role = kvm_calc_shadow_npt_root_page_role(vcpu, ®s);
shadow_mmu_init_context(vcpu, context, ®s, new_role);
- __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
+ kvm_mmu_new_pgd(vcpu, nested_cr3);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
@@ -4939,7 +4932,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
reset_ept_shadow_zero_bits_mask(context, execonly);
}
- __kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
+ kvm_mmu_new_pgd(vcpu, new_eptp);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
@@ -5024,20 +5017,6 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_init_mmu);
-static union kvm_mmu_page_role
-kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu)
-{
- struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
- union kvm_mmu_role role;
-
- if (tdp_enabled)
- role = kvm_calc_tdp_mmu_root_page_role(vcpu, ®s, true);
- else
- role = kvm_calc_shadow_mmu_root_page_role(vcpu, ®s, true);
-
- return role.base;
-}
-
void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
/*
--
2.31.1
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.
The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +-
arch/x86/kvm/mmu.h | 4 +-
arch/x86/kvm/mmu/mmu.c | 69 +++++++++++++++------------------
arch/x86/kvm/mmu/mmu_audit.c | 4 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
arch/x86/kvm/vmx/nested.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 2 +-
10 files changed, 42 insertions(+), 50 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a0d2925b6651..6da9a460e584 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -432,8 +432,7 @@ struct kvm_mmu {
int (*sync_page)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp);
void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
- hpa_t root_hpa;
- gpa_t root_pgd;
+ struct kvm_mmu_root_info root;
union kvm_mmu_role mmu_role;
u8 root_level;
u8 shadow_root_level;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c9f1c2162ade..f896c438c8ee 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -106,7 +106,7 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
static inline int kvm_mmu_ensure_valid_root(struct kvm_vcpu *vcpu)
{
- if (likely(vcpu->arch.mmu->root_hpa != INVALID_PAGE))
+ if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
return 0;
return kvm_mmu_load(vcpu);
@@ -128,7 +128,7 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
{
- u64 root_hpa = vcpu->arch.mmu->root_hpa;
+ u64 root_hpa = vcpu->arch.mmu->root.hpa;
if (!VALID_PAGE(root_hpa))
return;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d0f2077bd798..3c3f597ea00d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2141,7 +2141,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
* prev_root is currently only used for 64-bit hosts. So only
* the active root_hpa is valid here.
*/
- BUG_ON(root != vcpu->arch.mmu->root_hpa);
+ BUG_ON(root != vcpu->arch.mmu->root.hpa);
iterator->shadow_addr
= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
@@ -2155,7 +2155,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, u64 addr)
{
- shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root_hpa,
+ shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa,
addr);
}
@@ -3224,7 +3224,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
/* Before acquiring the MMU lock, see if we need to do any real work. */
- if (!(free_active_root && VALID_PAGE(mmu->root_hpa))) {
+ if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
VALID_PAGE(mmu->prev_roots[i].hpa))
@@ -3244,7 +3244,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
if (free_active_root) {
if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
(mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
- mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
+ mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
} else if (mmu->pae_root) {
for (i = 0; i < 4; ++i) {
if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
@@ -3255,8 +3255,8 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
mmu->pae_root[i] = INVALID_PAE_ROOT;
}
}
- mmu->root_hpa = INVALID_PAGE;
- mmu->root_pgd = 0;
+ mmu->root.hpa = INVALID_PAGE;
+ mmu->root.pgd = 0;
}
kvm_mmu_commit_zap_page(kvm, &invalid_list);
@@ -3329,10 +3329,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
if (is_tdp_mmu_enabled(vcpu->kvm)) {
root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
- mmu->root_hpa = root;
+ mmu->root.hpa = root;
} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
- mmu->root_hpa = root;
+ mmu->root.hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
if (WARN_ON_ONCE(!mmu->pae_root)) {
r = -EIO;
@@ -3347,15 +3347,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
mmu->pae_root[i] = root | PT_PRESENT_MASK |
shadow_me_mask;
}
- mmu->root_hpa = __pa(mmu->pae_root);
+ mmu->root.hpa = __pa(mmu->pae_root);
} else {
WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
r = -EIO;
goto out_unlock;
}
- /* root_pgd is ignored for direct MMUs. */
- mmu->root_pgd = 0;
+ /* root.pgd is ignored for direct MMUs. */
+ mmu->root.pgd = 0;
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
return r;
@@ -3468,7 +3468,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
if (mmu->root_level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, root_gfn, 0,
mmu->shadow_root_level, false);
- mmu->root_hpa = root;
+ mmu->root.hpa = root;
goto set_root_pgd;
}
@@ -3518,14 +3518,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
- mmu->root_hpa = __pa(mmu->pml5_root);
+ mmu->root.hpa = __pa(mmu->pml5_root);
else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
- mmu->root_hpa = __pa(mmu->pml4_root);
+ mmu->root.hpa = __pa(mmu->pml4_root);
else
- mmu->root_hpa = __pa(mmu->pae_root);
+ mmu->root.hpa = __pa(mmu->pae_root);
set_root_pgd:
- mmu->root_pgd = root_pgd;
+ mmu->root.pgd = root_pgd;
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
@@ -3638,13 +3638,13 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
if (vcpu->arch.mmu->direct_map)
return;
- if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+ if (!VALID_PAGE(vcpu->arch.mmu->root.hpa))
return;
vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
- hpa_t root = vcpu->arch.mmu->root_hpa;
+ hpa_t root = vcpu->arch.mmu->root.hpa;
sp = to_shadow_page(root);
if (!is_unsync_root(root))
@@ -3935,7 +3935,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault, int mmu_seq)
{
- struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);
+ struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
/* Special roots, e.g. pae_root, are not backed by shadow pages. */
if (sp && is_obsolete_sp(vcpu->kvm, sp))
@@ -4092,34 +4092,27 @@ static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
/*
* Find out if a previously cached root matching the new pgd/role is available.
* The current root is also inserted into the cache.
- * If a matching root was found, it is assigned to kvm_mmu->root_hpa and true is
+ * If a matching root was found, it is assigned to kvm_mmu->root.hpa and true is
* returned.
- * Otherwise, the LRU root from the cache is assigned to kvm_mmu->root_hpa and
+ * Otherwise, the LRU root from the cache is assigned to kvm_mmu->root.hpa and
* false is returned. This root should now be freed by the caller.
*/
static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
union kvm_mmu_page_role new_role)
{
uint i;
- struct kvm_mmu_root_info root;
struct kvm_mmu *mmu = vcpu->arch.mmu;
- root.pgd = mmu->root_pgd;
- root.hpa = mmu->root_hpa;
-
- if (is_root_usable(&root, new_pgd, new_role))
+ if (is_root_usable(&mmu->root, new_pgd, new_role))
return true;
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
- swap(root, mmu->prev_roots[i]);
+ swap(mmu->root, mmu->prev_roots[i]);
- if (is_root_usable(&root, new_pgd, new_role))
+ if (is_root_usable(&mmu->root, new_pgd, new_role))
break;
}
- mmu->root_hpa = root.hpa;
- mmu->root_pgd = root.pgd;
-
return i < KVM_MMU_NUM_PREV_ROOTS;
}
@@ -4175,7 +4168,7 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
*/
if (!new_role.direct)
__clear_sp_write_flooding_count(
- to_shadow_page(vcpu->arch.mmu->root_hpa));
+ to_shadow_page(vcpu->arch.mmu->root.hpa));
}
void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
@@ -5071,7 +5064,7 @@ static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
{
int i;
kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(mmu->root_hpa));
+ WARN_ON(VALID_PAGE(mmu->root.hpa));
if (mmu->pae_root) {
for (i = 0; i < 4; ++i)
WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
@@ -5266,7 +5259,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
int r, emulation_type = EMULTYPE_PF;
bool direct = vcpu->arch.mmu->direct_map;
- if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+ if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;
r = RET_PF_INVALID;
@@ -5338,7 +5331,7 @@ void kvm_mmu_invalidate_gva(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
return;
if (root_hpa == INVALID_PAGE) {
- mmu->invlpg(vcpu, gva, mmu->root_hpa);
+ mmu->invlpg(vcpu, gva, mmu->root.hpa);
/*
* INVLPG is required to invalidate any global mappings for the VA,
@@ -5374,7 +5367,7 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
uint i;
if (pcid == kvm_get_active_pcid(vcpu)) {
- mmu->invlpg(vcpu, gva, mmu->root_hpa);
+ mmu->invlpg(vcpu, gva, mmu->root.hpa);
tlb_flush = true;
}
@@ -5487,8 +5480,8 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
struct page *page;
int i;
- mmu->root_hpa = INVALID_PAGE;
- mmu->root_pgd = 0;
+ mmu->root.hpa = INVALID_PAGE;
+ mmu->root.pgd = 0;
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index f31fdb874f1f..3e5d62a25350 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -56,11 +56,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
int i;
struct kvm_mmu_page *sp;
- if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+ if (!VALID_PAGE(vcpu->arch.mmu->root.hpa))
return;
if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
- hpa_t root = vcpu->arch.mmu->root_hpa;
+ hpa_t root = vcpu->arch.mmu->root.hpa;
sp = to_shadow_page(root);
__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->root_level);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5b5bdac97c7b..346f3bad3cb9 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -668,7 +668,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (FNAME(gpte_changed)(vcpu, gw, top_level))
goto out_gpte_changed;
- if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+ if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
goto out_gpte_changed;
for (shadow_walk_init(&it, vcpu, fault->addr);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8def8f810cb0..debf08212f12 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -657,7 +657,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
else
#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
- for_each_tdp_pte(_iter, to_shadow_page(_mmu->root_hpa), _start, _end)
+ for_each_tdp_pte(_iter, to_shadow_page(_mmu->root.hpa), _start, _end)
/*
* Yield if the MMU lock is contended or this thread needs to return control
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3f987785702a..57c73d8f76ce 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -95,7 +95,7 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
{
struct kvm_mmu_page *sp;
- hpa_t hpa = mmu->root_hpa;
+ hpa_t hpa = mmu->root.hpa;
if (WARN_ON(!VALID_PAGE(hpa)))
return false;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c73e4d938ddc..29289ecca223 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5466,7 +5466,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
roots_to_free = 0;
- if (nested_ept_root_matches(mmu->root_hpa, mmu->root_pgd,
+ if (nested_ept_root_matches(mmu->root.hpa, mmu->root.pgd,
operand.eptp))
roots_to_free |= KVM_MMU_ROOT_CURRENT;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 70e7f00362bc..5542a2b536e0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2957,7 +2957,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
- u64 root_hpa = mmu->root_hpa;
+ u64 root_hpa = mmu->root.hpa;
/* No flush required if the current context is invalid. */
if (!VALID_PAGE(root_hpa))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2685fb62807e..0d3646535cc5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -762,7 +762,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
if ((fault->error_code & PFERR_PRESENT_MASK) &&
!(fault->error_code & PFERR_RSVD_MASK))
kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
- fault_mmu->root_hpa);
+ fault_mmu->root.hpa);
fault_mmu->inject_page_fault(vcpu, fault);
return fault->nested_page_fault;
--
2.31.1
If kvm_mmu_free_roots encounters a PAE page table where a 64-bit page
table is expected, the result is a NULL pointer dereference. Instead
just WARN and exit.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7b5765ced928..d0f2077bd798 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3201,6 +3201,8 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
return;
sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
+ if (WARN_ON(!sp))
+ return;
if (is_tdp_mmu_page(sp))
kvm_tdp_mmu_put_root(kvm, sp, false);
--
2.31.1
While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and therefore
EFER.NX is the only bit that can affect the MMU role. However, set_efer accepts
a host-initiated change to EFER.LME even with CR0.PG=1. In that case, the
MMU has to be reset.
Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/x86.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 51faa2c76ca5..a5a50cfeffff 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -48,6 +48,7 @@
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
#define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)
+#define KVM_MMU_EFER_ROLE_BITS (EFER_LME | EFER_NX)
static __always_inline u64 rsvd_bits(int s, int e)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a9006226501..5e1298aef9e2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1647,7 +1647,7 @@ static int set_efer(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
/* Update reserved bits */
- if ((efer ^ old_efer) & EFER_NX)
+ if ((efer ^ old_efer) & KVM_MMU_EFER_ROLE_BITS)
kvm_mmu_reset_context(vcpu);
return 0;
--
2.31.1
Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
shadow_root_level anymore, pull the PGD load after the initialization of
the shadow MMUs.
Besides being more intuitive, this enables future simplifications
and optimizations because it's not necessary anymore to compute the
role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not
attempting to use a cached PGD to avoid having to figure out the new role.
It will soon be able to follow what nested_{vmx,svm}_load_cr3 are doing,
and avoid unloading all the cached roots.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 37 +++++++++++++++++--------------------
arch/x86/kvm/svm/nested.c | 6 +++---
arch/x86/kvm/vmx/nested.c | 6 +++---
3 files changed, 23 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f61208ccce43..df9e0a43513c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4882,9 +4882,8 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
new_role = kvm_calc_shadow_npt_root_page_role(vcpu, ®s);
- __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
-
shadow_mmu_init_context(vcpu, context, ®s, new_role);
+ __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
@@ -4922,27 +4921,25 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
execonly, level);
- __kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
-
- if (new_role.as_u64 == context->mmu_role.as_u64)
- return;
-
- context->mmu_role.as_u64 = new_role.as_u64;
+ if (new_role.as_u64 != context->mmu_role.as_u64) {
+ context->mmu_role.as_u64 = new_role.as_u64;
- context->shadow_root_level = level;
+ context->shadow_root_level = level;
- context->ept_ad = accessed_dirty;
- context->page_fault = ept_page_fault;
- context->gva_to_gpa = ept_gva_to_gpa;
- context->sync_page = ept_sync_page;
- context->invlpg = ept_invlpg;
- context->root_level = level;
- context->direct_map = false;
+ context->ept_ad = accessed_dirty;
+ context->page_fault = ept_page_fault;
+ context->gva_to_gpa = ept_gva_to_gpa;
+ context->sync_page = ept_sync_page;
+ context->invlpg = ept_invlpg;
+ context->root_level = level;
+ context->direct_map = false;
+ update_permission_bitmask(context, true);
+ context->pkru_mask = 0;
+ reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
+ reset_ept_shadow_zero_bits_mask(context, execonly);
+ }
- update_permission_bitmask(context, true);
- context->pkru_mask = 0;
- reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
- reset_ept_shadow_zero_bits_mask(context, execonly);
+ __kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f284e61451c8..96bab464967f 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -492,14 +492,14 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
CC(!load_pdptrs(vcpu, cr3)))
return -EINVAL;
- if (!nested_npt)
- kvm_mmu_new_pgd(vcpu, cr3);
-
vcpu->arch.cr3 = cr3;
/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
kvm_init_mmu(vcpu);
+ if (!nested_npt)
+ kvm_mmu_new_pgd(vcpu, cr3);
+
return 0;
}
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 29289ecca223..abfcd71f787f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1126,15 +1126,15 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
return -EINVAL;
}
- if (!nested_ept)
- kvm_mmu_new_pgd(vcpu, cr3);
-
vcpu->arch.cr3 = cr3;
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
kvm_init_mmu(vcpu);
+ if (!nested_ept)
+ kvm_mmu_new_pgd(vcpu, cr3);
+
return 0;
}
--
2.31.1
kvm_mmu_reset_context is called on all role changes and right now it
calls kvm_mmu_unload. With the legacy MMU this is a relatively cheap
operation; the previous PGDs remains in the hash table and is picked
up immediately on the next page fault. With the TDP MMU, however, the
roots are thrown away for good and a full rebuild of the page tables is
necessary, which is many times more expensive.
Fortunately, throwing away the roots is not necessary except when
the manual says a TLB flush is required:
- changing CR0.PG from 1 to 0 (because it flushes the TLB according to
the x86 architecture specification)
- changing CPUID (which changes the interpretation of page tables in
ways not reflected by the role).
- changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
Except for these cases, once the MMU has updated the CPU/MMU roles
and metadata it is enough to force-reload the current value of CR3.
KVM will look up the cached roots for an entry with the right role and
PGD, and only if the cache misses a new root will be created.
Measuring with vmexit.flat from kvm-unit-tests shows the following
improvement:
TDP legacy shadow
before 46754 5096 5150
after 4879 4875 5006
which is for very small page tables. The impact is however much larger
when running as an L1 hypervisor, because the new page tables cause
extra work for L0 to shadow them.
Reported-by: Brad Spengler <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 7 ++++---
arch/x86/kvm/x86.c | 27 ++++++++++++++++++---------
2 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 38b40ddcaad7..dbd4e98ba426 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
/*
- * Invalidate all MMU roles to force them to reinitialize as CPUID
- * information is factored into reserved bit calculations.
+ * Invalidate all MMU roles and roots to force them to reinitialize,
+ * as CPUID information is factored into reserved bit calculations.
*
* Correctly handling multiple vCPU models with respect to paging and
* physical address properties) in a single VM would require tracking
@@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
+ kvm_mmu_unload(vcpu);
kvm_mmu_reset_context(vcpu);
/*
@@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
{
- kvm_mmu_unload(vcpu);
kvm_init_mmu(vcpu);
+ kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
}
EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d3646535cc5..97c4f5fc291f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
kvm_async_pf_hash_reset(vcpu);
}
- if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
+ if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
+ /* Flush the TLB if CR0 is changed 1 -> 0. */
+ if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
+ kvm_mmu_unload(vcpu);
kvm_mmu_reset_context(vcpu);
+ }
if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
@@ -1067,15 +1071,18 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
* free them all. KVM_REQ_MMU_RELOAD is fit for the both cases; it
* is slow, but changing CR4.PCIDE is a rare case.
*
- * If CR4.PGE is changed, the guest TLB must be flushed.
+ * Setting SMEP also needs to flush the TLB, in addition to resetting
+ * the MMU.
*
- * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
- * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
- * the usage of "else if".
+ * If CR4.PGE is changed, the guest TLB must be flushed. Because
+ * the shadow MMU ignores global pages, this bit is not part of
+ * KVM_MMU_CR4_ROLE_BITS.
*/
- if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
+ if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) {
kvm_mmu_reset_context(vcpu);
- else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+ if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
+ kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+ } else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
@@ -11329,8 +11336,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
* paging related bits are ignored if paging is disabled, i.e. CR0.WP,
* CR4, and EFER changes are all irrelevant if CR0.PG was '0'.
*/
- if (old_cr0 & X86_CR0_PG)
- kvm_mmu_reset_context(vcpu);
+ if (old_cr0 & X86_CR0_PG) {
+ kvm_mmu_unload(vcpu);
+ kvm_init_mmu(vcpu);
+ }
/*
* Intel's SDM states that all TLB entries are flushed on INIT. AMD's
--
2.31.1
On Fri, Feb 11, 2022, Sean Christopherson wrote:
> On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> > On 2/11/22 01:54, Sean Christopherson wrote:
> > > > > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > > &invalid_list);
> > > > > if (free_active_root) {
> > > > > - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > > > > - (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > > > > + if (to_shadow_page(mmu->root.hpa)) {
> > > > > mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> > > > > } else if (mmu->pae_root) {
> > >
> > > Gah, this is technically wrong. It shouldn't truly matter, but it's wrong. root.hpa
> > > will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> > > case freeing the PAE root is wrong. They should obviously be invalid already, but
> > > it's a little confusing because KVM wanders down a path that may not be relevant
> > > to the current mode.
> >
> > pml4_root and pml5_root are dummy, and the first "real" level of page tables
> > is stored in pae_root for that case too, so I think that should DTRT.
>
> Ugh, completely forgot that detail. You're correct. Probably worth a comment?
Actually, can't this be
if (to_shadow_page(mmu->root.hpa)) {
...
else if (!WARN_ON(!mmu->pae_root)) {
...
}
now that it's wrapped with VALID_PAGE(root.hpa)?
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> If kvm_mmu_free_roots encounters a PAE page table where a 64-bit page
> table is expected, the result is a NULL pointer dereference. Instead
> just WARN and exit.
This confused the heck out of me, because we obviously free PAE page tables. What
we don't do is back the root that gets shoved into CR3 with a shadow page. It'd
be especially confusing without the context that this WARN was helpful during
related development, as it's not super obvious why mmu_free_root_page() is a special
snowflake and deserves a WARN.
Something like this?
WARN and bail if KVM attempts to free a root that isn't backed by a shadow
page. KVM allocates a bare page for "special" roots, e.g. when using PAE
paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa
will be valid but won't be backed by a shadow page. It's all too easy to
blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of
crashing KVM and possibly the kernel.
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7b5765ced928..d0f2077bd798 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3201,6 +3201,8 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
> return;
>
> sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> + if (WARN_ON(!sp))
Should this be KVM_BUG_ON()? I.e. when you triggered these, would continuing on
potentially corrupt guest data, or was it truly benign-ish?
> + return;
>
> if (is_tdp_mmu_page(sp))
> kvm_tdp_mmu_put_root(kvm, sp, false);
> --
> 2.31.1
>
>
On Fri, Feb 11, 2022, Sean Christopherson wrote:
> On Fri, Feb 11, 2022, Sean Christopherson wrote:
> > On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> > > On 2/11/22 01:54, Sean Christopherson wrote:
> > > > > > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > > > &invalid_list);
> > > > > > if (free_active_root) {
> > > > > > - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > > > > > - (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > > > > > + if (to_shadow_page(mmu->root.hpa)) {
> > > > > > mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> > > > > > } else if (mmu->pae_root) {
> > > >
> > > > Gah, this is technically wrong. It shouldn't truly matter, but it's wrong. root.hpa
> > > > will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> > > > case freeing the PAE root is wrong. They should obviously be invalid already, but
> > > > it's a little confusing because KVM wanders down a path that may not be relevant
> > > > to the current mode.
> > >
> > > pml4_root and pml5_root are dummy, and the first "real" level of page tables
> > > is stored in pae_root for that case too, so I think that should DTRT.
> >
> > Ugh, completely forgot that detail. You're correct.
Mostly correct. The first "real" level will be PML4 in the hCR4.LA57=1, gCR4.LA57=0
nested NPT case. Ditto for shadowing PAE NPT with 4/5-level NPT, though in that
case KVM still allocates pae_root entries, it just happens to be a "real" level.
And now I realize why I'm so confused, mmu_alloc_shadow_roots() is also broken
with respect to 5-level shadowing 4-level. I believe the part that got fixed
was 5-level with a 32-bit guest. Ugh.
For the stuff that actually works in KVM, this will do just fine. 5-level nNPT
can be punted to the future.
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e0c0f0bc2e8b..7b5765ced928 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5065,12 +5065,21 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> return r;
> }
>
> +static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> +{
> + int i;
> + kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
> + WARN_ON(VALID_PAGE(mmu->root_hpa));
> + if (mmu->pae_root) {
> + for (i = 0; i < 4; ++i)
> + WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
> + }
I'm somewhat ambivalent, but if you're at all on the fence, I vote to drop this
one. I've always viewed the WARN on root_hpa as gratuitous.
But, if it helped during development, then why not...
On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> On 2/11/22 01:54, Sean Christopherson wrote:
> > > free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
> > > VALID_PAGE(mmu->root.hpa);
> > >
> > > Isn't this a separate bug fix? E.g. call kvm_mmu_unload() without a valid current
> > > root, but with valid previous roots? In which case we'd try to free garbage, no?
>
> mmu_free_root_page checks VALID_PAGE(*root_hpa). If that's what you meant,
> then it wouldn't be a preexisting bug (and I think it'd be a fairly common
> case).
Ahh, yep.
> > > > +
> > > > + if (!free_active_root) {
> > > > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > > > if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
> > > > VALID_PAGE(mmu->prev_roots[i].hpa))
> > > > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > &invalid_list);
> > > > if (free_active_root) {
> > > > - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > > > - (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > > > + if (to_shadow_page(mmu->root.hpa)) {
> > > > mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> > > > } else if (mmu->pae_root) {
> >
> > Gah, this is technically wrong. It shouldn't truly matter, but it's wrong. root.hpa
> > will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> > case freeing the PAE root is wrong. They should obviously be invalid already, but
> > it's a little confusing because KVM wanders down a path that may not be relevant
> > to the current mode.
>
> pml4_root and pml5_root are dummy, and the first "real" level of page tables
> is stored in pae_root for that case too, so I think that should DTRT.
Ugh, completely forgot that detail. You're correct. Probably worth a comment?
> That's why I also disliked the shadow_root_level/root_level/direct check:
> even though there's half a dozen of cases involved, they all boil down to
> either 4 pae_roots or a single root with a backing kvm_mmu_page.
>
> It's even more obscure to check shadow_root_level/root_level/direct in
> fast_pgd_switch, where it's pretty obvious that you cannot cache 4 pae_roots
> in a single (hpa, pgd) pair...
Heh, apparently not obvious enough for me :-)
On 2/9/2022 10:30 PM, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0d3646535cc5..97c4f5fc291f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
> kvm_async_pf_hash_reset(vcpu);
> }
>
> - if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> + if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> + /* Flush the TLB if CR0 is changed 1 -> 0. */
^^ CR0.PG here ?
> + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> + kvm_mmu_unload(vcpu);
> kvm_mmu_reset_context(vcpu);
> + }
Regards
Nikunj
On 2/11/22 01:54, Sean Christopherson wrote:
> On Fri, Feb 11, 2022, Sean Christopherson wrote:
>> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>>> Right now, PGD caching requires a complicated dance of first computing
>>> the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling
>>
>> Nit, adding () after function names helps readers easily recognize when you're
>> taking about a specific function, e.g. as opposed to a concept or whatever.
>>
>>> kvm_init_mmu.
>>>
>>> Part of this is due to kvm_mmu_free_roots using mmu->root_level and
>>> mmu->shadow_root_level to distinguish whether the page table uses a single
>>> root or 4 PAE roots. Because kvm_init_mmu can overwrite mmu->root_level,
>>> kvm_mmu_free_roots must be called before kvm_init_mmu.
>>>
>>> However, even after kvm_init_mmu there is a way to detect whether the page table
>>> has a single root or four, because the pae_root does not have an associated
>>> struct kvm_mmu_page.
>>
>> Suggest a reword on the final paragraph, because there's a discrepancy with the
>> code (which handles 0, 1, or 4 "roots", versus just "single or four").
>>
>> However, even after kvm_init_mmu() there is a way to detect whether the
>> page table may hold PAE roots, as root.hpa isn't backed by a shadow when
>> it points at PAE roots.
>>
>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 10 ++++++----
>>> 1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 3c3f597ea00d..95d0fa0bb876 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>> struct kvm *kvm = vcpu->kvm;
>>> int i;
>>> LIST_HEAD(invalid_list);
>>> - bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
>>> + bool free_active_root;
>>>
>>> BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
>>>
>>> /* Before acquiring the MMU lock, see if we need to do any real work. */
>>> - if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
>>> + free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
>>> + && VALID_PAGE(mmu->root.hpa);
>>
>> free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
>> VALID_PAGE(mmu->root.hpa);
>>
>> Isn't this a separate bug fix? E.g. call kvm_mmu_unload() without a valid current
>> root, but with valid previous roots? In which case we'd try to free garbage, no?
mmu_free_root_page checks VALID_PAGE(*root_hpa). If that's what you
meant, then it wouldn't be a preexisting bug (and I think it'd be a
fairly common case).
>>> +
>>> + if (!free_active_root) {
>>> for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>>> if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
>>> VALID_PAGE(mmu->prev_roots[i].hpa))
>>> @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>> &invalid_list);
>>>
>>> if (free_active_root) {
>>> - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
>>> - (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
>>> + if (to_shadow_page(mmu->root.hpa)) {
>>> mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
>>> } else if (mmu->pae_root) {
>
> Gah, this is technically wrong. It shouldn't truly matter, but it's wrong. root.hpa
> will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> case freeing the PAE root is wrong. They should obviously be invalid already, but
> it's a little confusing because KVM wanders down a path that may not be relevant
> to the current mode.
pml4_root and pml5_root are dummy, and the first "real" level of page
tables is stored in pae_root for that case too, so I think that should DTRT.
That's why I also disliked the shadow_root_level/root_level/direct
check: even though there's half a dozen of cases involved, they all boil
down to either 4 pae_roots or a single root with a backing kvm_mmu_page.
It's even more obscure to check shadow_root_level/root_level/direct in
fast_pgd_switch, where it's pretty obvious that you cannot cache 4
pae_roots in a single (hpa, pgd) pair...
Paolo
On Fri, Feb 11, 2022, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > Right now, PGD caching requires a complicated dance of first computing
> > the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling
>
> Nit, adding () after function names helps readers easily recognize when you're
> taking about a specific function, e.g. as opposed to a concept or whatever.
>
> > kvm_init_mmu.
> >
> > Part of this is due to kvm_mmu_free_roots using mmu->root_level and
> > mmu->shadow_root_level to distinguish whether the page table uses a single
> > root or 4 PAE roots. Because kvm_init_mmu can overwrite mmu->root_level,
> > kvm_mmu_free_roots must be called before kvm_init_mmu.
> >
> > However, even after kvm_init_mmu there is a way to detect whether the page table
> > has a single root or four, because the pae_root does not have an associated
> > struct kvm_mmu_page.
>
> Suggest a reword on the final paragraph, because there's a discrepancy with the
> code (which handles 0, 1, or 4 "roots", versus just "single or four").
>
> However, even after kvm_init_mmu() there is a way to detect whether the
> page table may hold PAE roots, as root.hpa isn't backed by a shadow when
> it points at PAE roots.
>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 10 ++++++----
> > 1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c3f597ea00d..95d0fa0bb876 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > struct kvm *kvm = vcpu->kvm;
> > int i;
> > LIST_HEAD(invalid_list);
> > - bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
> > + bool free_active_root;
> >
> > BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
> >
> > /* Before acquiring the MMU lock, see if we need to do any real work. */
> > - if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
> > + free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
> > + && VALID_PAGE(mmu->root.hpa);
>
> free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
> VALID_PAGE(mmu->root.hpa);
>
> Isn't this a separate bug fix? E.g. call kvm_mmu_unload() without a valid current
> root, but with valid previous roots? In which case we'd try to free garbage, no?
>
> > +
> > + if (!free_active_root) {
> > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
> > VALID_PAGE(mmu->prev_roots[i].hpa))
> > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > &invalid_list);
> >
> > if (free_active_root) {
> > - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > - (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > + if (to_shadow_page(mmu->root.hpa)) {
> > mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> > } else if (mmu->pae_root) {
Gah, this is technically wrong. It shouldn't truly matter, but it's wrong. root.hpa
will not be backed by shadow page if the root is pml4_root or pml5_root, in which
case freeing the PAE root is wrong. They should obviously be invalid already, but
it's a little confusing because KVM wanders down a path that may not be relevant
to the current mode.
For clarity, I think it's worth doing:
} else if (mmu->root.hpa == __pa(mmu->pae_root)) {
> > for (i = 0; i < 4; ++i) {
> > --
> > 2.31.1
> >
> >
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and therefore
> EFER.NX is the only bit that can affect the MMU role. However, set_efer accepts
> a host-initiated change to EFER.LME even with CR0.PG=1. In that case, the
> MMU has to be reset.
>
> Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
> Cc: [email protected]
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
Ugh, but KVM_SET_SREGS handles this... It's basically KVM's equivalent of VMX putting
EFER in the VMCS, but then also allowing EFER in the load/store lists.
Reviewed-by: Sean Christopherson <[email protected]>
On 2/11/22 01:24, Sean Christopherson wrote:
>>
>> sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
>> + if (WARN_ON(!sp))
>
> Should this be KVM_BUG_ON()? I.e. when you triggered these, would continuing on
> potentially corrupt guest data, or was it truly benign-ish?
It only triggered on the mode_switch SVM unit test (with npt=0); so, in
a very small test which just hung after the bug. The WARN however was
the 10-minute difference between rmmod and reboot...
I didn't use KVM_BUG_ON because we're in a pretty deep call stack
(kvm_mmu_new_pgd, itself called from nested vmentry) and all sort of
stuff will happen before bailing out. My mental model is to use
KVM_BUG_ON in situations for which error propagation is possible and clean.
Paolo
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
> shadow_root_level anymore, pull the PGD load after the initialization of
> the shadow MMUs.
>
> Besides being more intuitive, this enables future simplifications
> and optimizations because it's not necessary anymore to compute the
> role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not
> attempting to use a cached PGD to avoid having to figure out the new role.
> It will soon be able to follow what nested_{vmx,svm}_load_cr3 are doing,
> and avoid unloading all the cached roots.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
If you add a sanity check as described in the other thread[*],
Reviewed-by: Sean Christopherson <[email protected]>
[*] https://lore.kernel.org/all/[email protected]
On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> On 2/11/22 01:27, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > The name of kvm_mmu_reload is very confusing for two reasons:
> > > first, KVM_REQ_MMU_RELOAD actually does not call it; second,
> > > it only does anything if there is no valid root.
> > >
> > > Rename it to kvm_mmu_ensure_valid_root, which matches the actual
> > > behavior better.
> >
> > 100% agree that kvm_mmu_reload() is a terrible name, but kvm_mmu_ensure_valid_root()
> > isn't much better, e.g. it sounds like a sanity check and nothing more.
>
> I would have thought that would be more of a check_valid_root(). There are
> other functions in the kernel following the idea that "ensure" means
> idempotency: skb_ensure_writable, perf_cgroup_ensure_storage,
> btf_ensure_modifiable and libbpf_ensure_mem in libbpf. I'm not a native
> speaker but, at least in computing, "ensure" seems to mean not just "to make
> certain that (something) will be true", but also taking steps if that's not
> already the case.
There's no ambiguity on the "make certain that <x> will be true", it's the second
part about taking steps that's ambiguous. Specifically, it doesn't convey any
information about _what_ steps will be taken, e.g. the below implementation is
also a possibility since it ensures the root is valid by preventing forward
progress if the root is invalid.
static inline int kvm_mmu_ensure_valid_root(struct kvm_vcpu *vcpu)
{
if (unlikely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
return -EFAULT;
return 0;
}
Existing example of that interpretation are input_dev_ensure_poller() and
rtnl_ensure_unique_netns().
The other nuance that I want to avoid is the implication that KVM is checking for
a valid root because it doesn't trust what has happened before, i.e. that the call
is there as a safeguard. That's misleading for the most common path, vcpu_enter_guest(),
because when the helper does do some work, it's usually because KVM deliberately
invalidated the root.
> I also thought of "establish_valid_root", but it has the opposite
> problem---it does not convey well, if at all, that the root could be valid
> already.
Heh, I agree that "establish" would imply the root is always invalid, but amusingly
"establish" is listed as a synonym for "ensure" on the few sites of checked. Yay English.
I was going to suggest we just open code it in vcpu_enter_guest, but async #PF
uses it too :-/
Can we put this on the backburner for now? IMO, KVM_REQ_MMU_RELOAD is far more
misleading than kvm_mmu_reload(), and I posted a series to remedy that (though I
need to check if it's still viable since you vetoed adding the check for a pending
request in the page fault handler).
https://lore.kernel.org/all/[email protected]
IMO, the shortlog is too literal and doesn't help understand the implications of
the change. I prefer something like:
KVM: x86/mmu: Always use current mmu's role when loading new PGD
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Since the guest PGD is now loaded after the MMU has been set up
> completely, the desired role for a cache hit is simply the current
> mmu_role. There is no need to compute it again, so __kvm_mmu_new_pgd
> can be folded in kvm_mmu_new_pgd.
>
> For the !tdp_enabled case, it would also have been possible to use
> the role that is already in vcpu->arch.mmu.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
With a different shortlog and newline,
Reviewed-by: Sean Christopherson <[email protected]>
> arch/x86/kvm/mmu/mmu.c | 29 ++++-------------------------
> 1 file changed, 4 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index df9e0a43513c..38b40ddcaad7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -190,8 +190,6 @@ struct kmem_cache *mmu_page_header_cache;
> static struct percpu_counter kvm_total_used_mmu_pages;
>
> static void mmu_spte_set(u64 *sptep, u64 spte);
> -static union kvm_mmu_page_role
> -kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
>
> struct kvm_mmu_role_regs {
> const unsigned long cr0;
> @@ -4172,9 +4170,9 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> return cached_root_find_and_replace(vcpu, new_pgd, new_role);
> }
>
> -static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> - union kvm_mmu_page_role new_role)
> +void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
> {
> + union kvm_mmu_page_role new_role = vcpu->arch.mmu->mmu_role.base;
Newline needed.
> if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
> /* kvm_mmu_ensure_valid_pgd will set up a new root. */
> return;
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> kvm_mmu_reset_context is called on all role changes and right now it
> calls kvm_mmu_unload. With the legacy MMU this is a relatively cheap
> operation; the previous PGDs remains in the hash table and is picked
> up immediately on the next page fault. With the TDP MMU, however, the
> roots are thrown away for good and a full rebuild of the page tables is
> necessary, which is many times more expensive.
>
> Fortunately, throwing away the roots is not necessary except when
> the manual says a TLB flush is required:
>
> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
> the x86 architecture specification)
>
> - changing CPUID (which changes the interpretation of page tables in
> ways not reflected by the role).
>
> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
>
> Except for these cases, once the MMU has updated the CPU/MMU roles
> and metadata it is enough to force-reload the current value of CR3.
> KVM will look up the cached roots for an entry with the right role and
> PGD, and only if the cache misses a new root will be created.
>
> Measuring with vmexit.flat from kvm-unit-tests shows the following
> improvement:
>
> TDP legacy shadow
> before 46754 5096 5150
> after 4879 4875 5006
>
> which is for very small page tables. The impact is however much larger
> when running as an L1 hypervisor, because the new page tables cause
> extra work for L0 to shadow them.
>
> Reported-by: Brad Spengler <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 7 ++++---
> arch/x86/kvm/x86.c | 27 ++++++++++++++++++---------
> 2 files changed, 22 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 38b40ddcaad7..dbd4e98ba426 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
> void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> /*
> - * Invalidate all MMU roles to force them to reinitialize as CPUID
> - * information is factored into reserved bit calculations.
> + * Invalidate all MMU roles and roots to force them to reinitialize,
> + * as CPUID information is factored into reserved bit calculations.
> *
> * Correctly handling multiple vCPU models with respect to paging and
> * physical address properties) in a single VM would require tracking
> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
> vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
> vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
> + kvm_mmu_unload(vcpu);
> kvm_mmu_reset_context(vcpu);
>
> /*
> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>
> void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> {
> - kvm_mmu_unload(vcpu);
> kvm_init_mmu(vcpu);
> + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
affected, e.g. SMM transitions, KVM_SET_SREG, etc...
Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
The call to kvm_mmu_new_pgd() is also
To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}(). In
the future we can/should work on avoiding unload in all paths, but again, future
problem.
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0d3646535cc5..97c4f5fc291f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
> kvm_async_pf_hash_reset(vcpu);
> }
>
> - if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> + if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> + /* Flush the TLB if CR0 is changed 1 -> 0. */
> + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> + kvm_mmu_unload(vcpu);
Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
to the comment, or with SMEP handling. And the SMEP handling isn't coherent with
respect to the changelog. Please elaborate :-)
> kvm_mmu_reset_context(vcpu);
> + }
>
> if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1067,15 +1071,18 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
> * free them all. KVM_REQ_MMU_RELOAD is fit for the both cases; it
> * is slow, but changing CR4.PCIDE is a rare case.
> *
> - * If CR4.PGE is changed, the guest TLB must be flushed.
> + * Setting SMEP also needs to flush the TLB, in addition to resetting
> + * the MMU.
> *
> - * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> - * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> - * the usage of "else if".
> + * If CR4.PGE is changed, the guest TLB must be flushed. Because
> + * the shadow MMU ignores global pages, this bit is not part of
> + * KVM_MMU_CR4_ROLE_BITS.
> */
> - if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> + if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) {
> kvm_mmu_reset_context(vcpu);
> - else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> + if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
> + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
This mishandles CR4.PGE. Per the comment above, the if-elif-elif sequence relies
on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.
For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
doesn't emulate global pages.
This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.
---
arch/x86/kvm/mmu/mmu.c | 4 ++--
arch/x86/kvm/x86.c | 42 +++++++++++++++++++++++++++++-------------
2 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e41834748d52..c477c519c784 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
/*
- * Invalidate all MMU roles to force them to reinitialize as CPUID
- * information is factored into reserved bit calculations.
+ * Invalidate all MMU roles and roots to force them to reinitialize,
+ * as CPUID information is factored into reserved bit calculations.
*
* Correctly handling multiple vCPU models with respect to paging and
* physical address properties) in a single VM would require tracking
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 782dc9cd31d8..b8dad04301ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
}
EXPORT_SYMBOL_GPL(load_pdptrs);
+static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
+{
+ kvm_mmu_init(vcpu);
+ kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
+}
+
void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
{
if ((cr0 ^ old_cr0) & X86_CR0_PG) {
kvm_clear_async_pf_completion_queue(vcpu);
kvm_async_pf_hash_reset(vcpu);
+
+ /*
+ * Clearing CR0.PG is architecturally defined as flushing the
+ * TLB from the guest's perspective.
+ */
+ if (!(cr0 & X86_CR0_PG))
+ kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}
if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
- kvm_mmu_reset_context(vcpu);
+ kvm_post_set_cr_reinit_mmu(vcpu);
if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
@@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
{
/*
- * If any role bit is changed, the MMU needs to be reset.
- *
* If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
* If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
* according to the SDM; however, stale prev_roots could be reused
* incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
* free them all. KVM_REQ_MMU_RELOAD is fit for the both cases; it
* is slow, but changing CR4.PCIDE is a rare case.
- *
- * If CR4.PGE is changed, the guest TLB must be flushed.
- *
- * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
- * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
- * the usage of "else if".
*/
- if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
- kvm_mmu_reset_context(vcpu);
- else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+ if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
- else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
+ return;
+ }
+
+ /* If any role bit is changed, the MMU needs to be reinitialized. */
+ if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
+ kvm_post_set_cr_reinit_mmu(vcpu);
+
+ /*
+ * Setting SMEP or toggling PGE is architecturally defined as flushing
+ * the TLB from the guest's perspective. Note, because the shadow MMU
+ * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
+ */
+ if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
+ ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}
EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
--
On 2/11/22 00:20, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>> arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
>> 1 file changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index e0c0f0bc2e8b..7b5765ced928 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5065,12 +5065,21 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>> return r;
>> }
>>
>> +static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>> +{
>> + int i;
>> + kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
>> + WARN_ON(VALID_PAGE(mmu->root_hpa));
>> + if (mmu->pae_root) {
>> + for (i = 0; i < 4; ++i)
>> + WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>> + }
>
> I'm somewhat ambivalent, but if you're at all on the fence, I vote to drop this
> one. I've always viewed the WARN on root_hpa as gratuitous.
>
> But, if it helped during development, then why not...
Well, it was not really helping in that the WARN triggered, but rather
it was ruling out the more blatant violations of invariants. The one in
patch 5 triggered a lot, though.
Paolo
On 2/11/22 01:27, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> The name of kvm_mmu_reload is very confusing for two reasons:
>> first, KVM_REQ_MMU_RELOAD actually does not call it; second,
>> it only does anything if there is no valid root.
>>
>> Rename it to kvm_mmu_ensure_valid_root, which matches the actual
>> behavior better.
>
> 100% agree that kvm_mmu_reload() is a terrible name, but kvm_mmu_ensure_valid_root()
> isn't much better, e.g. it sounds like a sanity check and nothing more.
I would have thought that would be more of a check_valid_root(). There
are other functions in the kernel following the idea that "ensure" means
idempotency: skb_ensure_writable, perf_cgroup_ensure_storage,
btf_ensure_modifiable and libbpf_ensure_mem in libbpf. I'm not a native
speaker but, at least in computing, "ensure" seems to mean not just "to
make certain that (something) will be true", but also taking steps if
that's not already the case.
I also thought of "establish_valid_root", but it has the opposite
problem---it does not convey well, if at all, that the root could be
valid already.
Paolo
>
> Maybe just be very literalal?
>
> kvm_mmu_load_if_necessary()
> kvm_mmu_load_if_invalid()
>
> Or follow cond_sched()?
>
> kvm_mmu_cond_load()
>
On 2/11/22 17:16, Sean Christopherson wrote:
> The other nuance that I want to avoid is the implication that KVM is checking for
> a valid root because it doesn't trust what has happened before, i.e. that the call
> is there as a safeguard. That's misleading for the most common path, vcpu_enter_guest(),
> because when the helper does do some work, it's usually because KVM deliberately
> invalidated the root.
Fair enough.
>> I also thought of "establish_valid_root", but it has the opposite
>> problem---it does not convey well, if at all, that the root could be valid
>> already.
>
> Heh, I agree that "establish" would imply the root is always invalid, but amusingly
> "establish" is listed as a synonym for "ensure" on the few sites of checked. Yay English.
Well, synonyms rarely have a perfectly matching meaning.
> Can we put this on the backburner for now?
Yes, of course.
Paolo
> IMO, KVM_REQ_MMU_RELOAD is far more
> misleading than kvm_mmu_reload(), and I posted a series to remedy that (though I
> need to check if it's still viable since you vetoed adding the check for a pending
> request in the page fault handler).
>
> https://lore.kernel.org/all/[email protected]
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
> Use the struct to have more consistency between mmu->root and
> mmu->prev_roots.
>
> The patch is entirely search and replace except for cached_root_available,
> which does not need a temporary struct kvm_mmu_root_info anymore.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
Reviewed-by: Sean Christopherson <[email protected]>
On 2/11/22 18:45, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
>> shadow_root_level anymore, pull the PGD load after the initialization of
>> the shadow MMUs.
>>
>> Besides being more intuitive, this enables future simplifications
>> and optimizations because it's not necessary anymore to compute the
>> role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not
>> attempting to use a cached PGD to avoid having to figure out the new role.
>> It will soon be able to follow what nested_{vmx,svm}_load_cr3 are doing,
>> and avoid unloading all the cached roots.
>>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>
> If you add a sanity check as described in the other thread[*],
>
> Reviewed-by: Sean Christopherson <[email protected]>
It's not as easy as I thought, but it becomes almost trivial with the
CPU/MMU role refactoring. I'll get that posted as soon as I can push a
final-ish version of this one to kvm/queue.
Paolo
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> The name of kvm_mmu_reload is very confusing for two reasons:
> first, KVM_REQ_MMU_RELOAD actually does not call it; second,
> it only does anything if there is no valid root.
>
> Rename it to kvm_mmu_ensure_valid_root, which matches the actual
> behavior better.
100% agree that kvm_mmu_reload() is a terrible name, but kvm_mmu_ensure_valid_root()
isn't much better, e.g. it sounds like a sanity check and nothing more.
Maybe just be very literalal?
kvm_mmu_load_if_necessary()
kvm_mmu_load_if_invalid()
Or follow cond_sched()?
kvm_mmu_cond_load()
On 2/11/22 19:48, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> kvm_mmu_reset_context is called on all role changes and right now it
>> calls kvm_mmu_unload. With the legacy MMU this is a relatively cheap
>> operation; the previous PGDs remains in the hash table and is picked
>> up immediately on the next page fault. With the TDP MMU, however, the
>> roots are thrown away for good and a full rebuild of the page tables is
>> necessary, which is many times more expensive.
>>
>> Fortunately, throwing away the roots is not necessary except when
>> the manual says a TLB flush is required:
>>
>> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
>> the x86 architecture specification)
>>
>> - changing CPUID (which changes the interpretation of page tables in
>> ways not reflected by the role).
>>
>> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
>>
>> Except for these cases, once the MMU has updated the CPU/MMU roles
>> and metadata it is enough to force-reload the current value of CR3.
>> KVM will look up the cached roots for an entry with the right role and
>> PGD, and only if the cache misses a new root will be created.
>>
>> Measuring with vmexit.flat from kvm-unit-tests shows the following
>> improvement:
>>
>> TDP legacy shadow
>> before 46754 5096 5150
>> after 4879 4875 5006
>>
>> which is for very small page tables. The impact is however much larger
>> when running as an L1 hypervisor, because the new page tables cause
>> extra work for L0 to shadow them.
>>
>> Reported-by: Brad Spengler <[email protected]>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>> arch/x86/kvm/mmu/mmu.c | 7 ++++---
>> arch/x86/kvm/x86.c | 27 ++++++++++++++++++---------
>> 2 files changed, 22 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 38b40ddcaad7..dbd4e98ba426 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>> void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> {
>> /*
>> - * Invalidate all MMU roles to force them to reinitialize as CPUID
>> - * information is factored into reserved bit calculations.
>> + * Invalidate all MMU roles and roots to force them to reinitialize,
>> + * as CPUID information is factored into reserved bit calculations.
>> *
>> * Correctly handling multiple vCPU models with respect to paging and
>> * physical address properties) in a single VM would require tracking
>> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>> vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>> vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
>> + kvm_mmu_unload(vcpu);
>> kvm_mmu_reset_context(vcpu);
>>
>> /*
>> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>
>> void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>> {
>> - kvm_mmu_unload(vcpu);
>> kvm_init_mmu(vcpu);
>> + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
>
> This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> affected, e.g. SMM transitions, KVM_SET_SREG, etc...
SMM exit does flush the TLB because RSM clears CR0.PG (I did check this
:)). SMM re-entry then does not need to flush. But I don't think SMM
exit should flush the TLB *for non-SMM roles*.
For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree
it is certainly safer to keep it that way.
> Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> The call to kvm_mmu_new_pgd() is also
*white noise*
> To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}(). In
> the future we can/should work on avoiding unload in all paths, but again, future
> problem.
I disagree on this. There aren't many calls to kvm_mmu_reset_context.
>>
>> - if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>> + if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
>> + /* Flush the TLB if CR0 is changed 1 -> 0. */
>> + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
>> + kvm_mmu_unload(vcpu);
>
> Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> to the comment, or with SMEP handling. And the SMEP handling isn't coherent with
> respect to the changelog. Please elaborate :-)
Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case
below). Using kvm_mmu_unload() avoids loading a cached root just to
throw it away immediately after, but I can change this to a new
KVM_REQ_MMU_UPDATE_ROOT flag that does
kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
By the way, I have a possibly stupid question. In kvm_set_cr3 (called
e.g. from emulator_set_cr()) there is
if (cr3 != kvm_read_cr3(vcpu))
kvm_mmu_new_pgd(vcpu, cr3);
What makes this work if mmu_is_nested(vcpu)? Should this also have an
"if (... & !tdp_enabled)"?
>> - else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
>> + if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
>> + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>
> This mishandles CR4.PGE. Per the comment above, the if-elif-elif sequence relies
> on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.
>
> For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
> MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
> doesn't emulate global pages.
Makes sense, yes. It also needs to handle flushing the current PCID
when changing CR4.PAE (previously done for "free" by
kvm_mmu_reset_context), but I agree with the idea.
Paolo
> This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.
>
>
> ---
> arch/x86/kvm/mmu/mmu.c | 4 ++--
> arch/x86/kvm/x86.c | 42 +++++++++++++++++++++++++++++-------------
> 2 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e41834748d52..c477c519c784 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
> void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> /*
> - * Invalidate all MMU roles to force them to reinitialize as CPUID
> - * information is factored into reserved bit calculations.
> + * Invalidate all MMU roles and roots to force them to reinitialize,
> + * as CPUID information is factored into reserved bit calculations.
> *
> * Correctly handling multiple vCPU models with respect to paging and
> * physical address properties) in a single VM would require tracking
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 782dc9cd31d8..b8dad04301ee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
> }
> EXPORT_SYMBOL_GPL(load_pdptrs);
>
> +static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
> +{
> + kvm_mmu_init(vcpu);
> + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> +}
> +
> void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
> {
> if ((cr0 ^ old_cr0) & X86_CR0_PG) {
> kvm_clear_async_pf_completion_queue(vcpu);
> kvm_async_pf_hash_reset(vcpu);
> +
> + /*
> + * Clearing CR0.PG is architecturally defined as flushing the
> + * TLB from the guest's perspective.
> + */
> + if (!(cr0 & X86_CR0_PG))
> + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> }
>
> if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> - kvm_mmu_reset_context(vcpu);
> + kvm_post_set_cr_reinit_mmu(vcpu);
>
> if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
> void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
> {
> /*
> - * If any role bit is changed, the MMU needs to be reset.
> - *
> * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
> * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
> * according to the SDM; however, stale prev_roots could be reused
> * incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
> * free them all. KVM_REQ_MMU_RELOAD is fit for the both cases; it
> * is slow, but changing CR4.PCIDE is a rare case.
> - *
> - * If CR4.PGE is changed, the guest TLB must be flushed.
> - *
> - * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> - * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> - * the usage of "else if".
> */
> - if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> - kvm_mmu_reset_context(vcpu);
> - else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> + if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
> kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> - else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
> + return;
> + }
> +
> + /* If any role bit is changed, the MMU needs to be reinitialized. */
> + if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> + kvm_post_set_cr_reinit_mmu(vcpu);
> +
> + /*
> + * Setting SMEP or toggling PGE is architecturally defined as flushing
> + * the TLB from the guest's perspective. Note, because the shadow MMU
> + * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
> + */
> + if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
> + ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
> kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> }
> EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
>
> base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
> --
>
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Right now, PGD caching requires a complicated dance of first computing
> the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling
Nit, adding () after function names helps readers easily recognize when you're
taking about a specific function, e.g. as opposed to a concept or whatever.
> kvm_init_mmu.
>
> Part of this is due to kvm_mmu_free_roots using mmu->root_level and
> mmu->shadow_root_level to distinguish whether the page table uses a single
> root or 4 PAE roots. Because kvm_init_mmu can overwrite mmu->root_level,
> kvm_mmu_free_roots must be called before kvm_init_mmu.
>
> However, even after kvm_init_mmu there is a way to detect whether the page table
> has a single root or four, because the pae_root does not have an associated
> struct kvm_mmu_page.
Suggest a reword on the final paragraph, because there's a discrepancy with the
code (which handles 0, 1, or 4 "roots", versus just "single or four").
However, even after kvm_init_mmu() there is a way to detect whether the
page table may hold PAE roots, as root.hpa isn't backed by a shadow when
it points at PAE roots.
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c3f597ea00d..95d0fa0bb876 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> struct kvm *kvm = vcpu->kvm;
> int i;
> LIST_HEAD(invalid_list);
> - bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
> + bool free_active_root;
>
> BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
>
> /* Before acquiring the MMU lock, see if we need to do any real work. */
> - if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
> + free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
> + && VALID_PAGE(mmu->root.hpa);
free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
VALID_PAGE(mmu->root.hpa);
Isn't this a separate bug fix? E.g. call kvm_mmu_unload() without a valid current
root, but with valid previous roots? In which case we'd try to free garbage, no?
> +
> + if (!free_active_root) {
> for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
> VALID_PAGE(mmu->prev_roots[i].hpa))
> @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> &invalid_list);
>
> if (free_active_root) {
> - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> - (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> + if (to_shadow_page(mmu->root.hpa)) {
> mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> } else if (mmu->pae_root) {
> for (i = 0; i < 4; ++i) {
> --
> 2.31.1
>
>
On Mon, Feb 14, 2022, Paolo Bonzini wrote:
> On 2/11/22 19:48, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> > > {
> > > - kvm_mmu_unload(vcpu);
> > > kvm_init_mmu(vcpu);
> > > + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> >
> > This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> > affected, e.g. SMM transitions, KVM_SET_SREG, etc...
>
> SMM exit does flush the TLB because RSM clears CR0.PG (I did check this :)).
> SMM re-entry then does not need to flush. But I don't think SMM exit should
> flush the TLB *for non-SMM roles*.
I'm not concerned about the TLB flush aspects so much as the addition of
kvm_mmu_new_pgd() in new paths.
> For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree it is
> certainly safer to keep it that way.
>
> > Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> > and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> > The call to kvm_mmu_new_pgd() is also
>
> *white noise*
>
> > To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> > necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}(). In
> > the future we can/should work on avoiding unload in all paths, but again, future
> > problem.
>
> I disagree on this. There aren't many calls to kvm_mmu_reset_context.
All the more reason to do things incrementally. I have no objection to allowing
all flows to reuse a cached (or current) root, I'm objecting to converting them
all in a single patch.
> > > - if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> > > + if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> > > + /* Flush the TLB if CR0 is changed 1 -> 0. */
> > > + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> > > + kvm_mmu_unload(vcpu);
> >
> > Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> > to the comment, or with SMEP handling. And the SMEP handling isn't coherent with
> > respect to the changelog. Please elaborate :-)
>
> Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case below).
Oh, you're freeing all roots to ensure a future MOV CR3 with NO_FLUSH and PCIDE=1
can't reuse a stale root. That's necessary if and only if the MMU is shadowing
the guest, non-nested TDP MMUs just need to flush the guest's TLB. The same is
true for the PCIDE case, i.e. we could optimize that too, though the main motivation
would be to clarify why all roots are unloaded.
> Using kvm_mmu_unload() avoids loading a cached root just to throw it away
> immediately after,
The shadow paging case will throw it away, but not the non-nested TDP MMU case?
> but I can change this to a new KVM_REQ_MMU_UPDATE_ROOT flag that does
>
> kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
I don't think that's necessary, I was just confused by the discrepancy.
> By the way, I have a possibly stupid question. In kvm_set_cr3 (called e.g.
> from emulator_set_cr()) there is
>
> if (cr3 != kvm_read_cr3(vcpu))
> kvm_mmu_new_pgd(vcpu, cr3);
>
> What makes this work if mmu_is_nested(vcpu)?
Hmm, nothing. VMX is "broken" anyways because it will kick out to userspace with
X86EMUL_UNHANDLEABLE due to the CR3 intercept check. SVM is also broken in that
it doesn't check INTERCEPT_CR3_WRITE, e.g. will do the wrong thing even if L1 wants
to intercept CR3 accesses.
> Should this also have an "if (... & !tdp_enabled)"?
Yes? That should avoid the nested mess. This patch also needs to handle CR0 and
CR4 modifications if L2 is active, e.g. if L1 choose not to intercept CR0/CR4.
kvm_post_set_cr_reinit_mmu() would be a lovely landing spot for that check :-D
On 2/14/22 20:24, Sean Christopherson wrote:
> On Mon, Feb 14, 2022, Paolo Bonzini wrote:
>> On 2/11/22 19:48, Sean Christopherson wrote:
>>> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>>>> - kvm_mmu_unload(vcpu);
>>>> kvm_init_mmu(vcpu);
>>>> + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
>>>
>>> This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
>>> affected, e.g. SMM transitions, KVM_SET_SREG, etc...
>
> I'm not concerned about the TLB flush aspects so much as the addition of
> kvm_mmu_new_pgd() in new paths.
Okay, yeah those are more complex and the existing ones are broken too.
>>>> - if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>>>> + if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
>>>> + /* Flush the TLB if CR0 is changed 1 -> 0. */
>>>> + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
>>>> + kvm_mmu_unload(vcpu);
>>>
>>> Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
>>> to the comment, or with SMEP handling. And the SMEP handling isn't coherent with
>>> respect to the changelog. Please elaborate :-)
>>
>> Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case below).
>
> Oh, you're freeing all roots to ensure a future MOV CR3 with NO_FLUSH and PCIDE=1
> can't reuse a stale root. That's necessary if and only if the MMU is shadowing
> the guest, non-nested TDP MMUs just need to flush the guest's TLB. The same is
> true for the PCIDE case, i.e. we could optimize that too, though the main motivation
> would be to clarify why all roots are unloaded.
Yes. Clarifying all this should be done before the big change to
kvm_mmu_reset_context().
>> Using kvm_mmu_unload() avoids loading a cached root just to throw it away
>> immediately after,
>
> The shadow paging case will throw it away, but not the non-nested TDP MMU case?
Yes, the TDP case is okay since the role is the same. kvm_init_mmu() is
enough.
>> but I can change this to a new KVM_REQ_MMU_UPDATE_ROOT flag that does
>>
>> kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
>
> I don't think that's necessary, I was just confused by the discrepancy.
It may not be necessary but it is clearer IMO. Let me post a new patch.
Paolo